Building a Google Lighthouse Crawler with Livewire

Building a Google Lighthouse Crawler with Livewire

Published on 03.01.2024

Welcome to our tech blog! Today, we’re diving into an exciting project: creating a tool that combines the power of Google Lighthouse with a website crawler. The goal? To provide comprehensive Lighthouse scores for all pages on a website. We’re spicing things up with Livewire for real-time page refresh during audits. Let’s break this down!

You can try the Lighthouse Crawler Tool here: https://inthemakings.com/en/tools/lighthouse-crawler

Starting Point: Packages and Libraries

We’ve got our eyes on a couple of existing packages as our starting point. These include:

  • Spatie Crawler (GitHub link): A robust tool for crawling websites.
  • Spatie Lighthouse (GitHub link): Integrates Google Lighthouse for analyzing web pages.

Should these not meet our needs, we’re prepared to extend them or develop our own solution.

Building the Crawler: Step by Step

Installation: To get started, we need to set up our environment. Here’s how:

composer require spatie/crawler
npm install lighthouse
npm install chrome-launcher

Database Preparation: We’re planning a database structure that allows us to store, queue, and manage audit requests. This includes timestamps for tracking and potential cancellation of requests.

Our database models are designed as follows:

LighthousecrawlerRequest
- id, uuid, url, started_at, pinged_at, canceled_at, failed_at, ended_at

LighthousecrawlerPage
- id, lighthousecrawler_request_id, url, started_at, failed_at, ended_at

LighthousecrawlerPageScore
- id, lighthousecrawler_page_id, name, value, formfactor

LighthousecrawlerPageAudit
- id, lighthousecrawler_page_id, name, values, formfactor

Routes and Controllers: Our setup includes a page to initiate audits and another to display results, with corresponding routes and controllers.

Livewire Magic: Bringing it to Life

Our Livewire component is a breeze – a simple form with a URL input and submit button. When submitted, it triggers the Lighthouse crawler request and dispatches a job to run the crawl.

class Request extends Component
{
    public string $language;

    #[Rule('required|url')]
    public string $url = '';

    public function mount (string $language) {
        $this->language = $language;
    }

    public function render () {
        return view('livewire.lighthousecrawler.request');
    }

    public function submit () {
        $this->validate();

        $lighthousecrawler_request = (new \App\Domain\Lighthousecrawler\Actions\CreatesLighthousecrawlerRequestAction())->__invoke($this->url);

        dispatch(new \App\Jobs\Lighthousecrawler\RunLighthousecrawlerRequestJob($lighthousecrawler_request->id))
            ->onQueue(config('lighthousecrawler.queues.crawl'));

        return redirect()->route('lighthousecrawler.result.get', [
            'language' => $this->language,
            'uuid' => $lighthousecrawler_request->uuid
        ]);
    }
}

Custom Crawl Profiles and Observers: We’re customizing the crawl process to filter and manage crawled pages efficiently. This includes defining a custom “CrawlProfile” to exclude unwanted pages and a “CrawlObserver” to track results.

class RunsLighthousecrawlerRequestAction {
    public function __invoke (LighthousecrawlerRequest $lighthousecrawler_request): void {
        $lighthousecrawler_request->started_at = now();
        $lighthousecrawler_request->save();

        $crawlQueue = app(ArrayCrawlQueue::class);

        $crawl_url_host = parse_url($lighthousecrawler_request->url, PHP_URL_HOST);
        if (str_starts_with($crawl_url_host, 'www.')) {
            $crawl_url_host = substr($crawl_url_host, 4);
        }

        $crawlProfile = with(new CrawlProfile())
            // Stay on same domain as $crawl->url
            ->addFilter(function (UriInterface $uri) use ($crawl_url_host) {
                // if last part is an image file extension, return false
                if (preg_match('/\.(jpg|jpeg|png|gif|webp|svg|ico|bmp|tiff|tif|psd|ai|indd|raw|heif|jp2|j2k|jpf|jpx|jpm|mj2)$/i', $uri->getPath())) {
                    return false;
                }

                // allow $crawl->url with and without "www."
                $url_host = $uri->getHost();

                if (str_starts_with($url_host, 'www.')) {
                    $url_host = substr($url_host, 4);
                }

                return $url_host === $crawl_url_host;
            });

        $crawlObserver = new CrawlObserver(
            lighthousecrawler_request: $lighthousecrawler_request,
            createLighthousecrawlerPage: app(CreatesLighthousecrawlerPageAction::class),
            crawl_queue: $crawlQueue,
        );

        $options = [
            RequestOptions::COOKIES => true,
            RequestOptions::CONNECT_TIMEOUT => 10,
            RequestOptions::TIMEOUT => 20,
            RequestOptions::ALLOW_REDIRECTS => [
                'max' => 5,
                'track_redirects' => true,
            ],
            RequestOptions::AUTH => null,
        ];

        Crawler::create($options)
            ->setCrawlQueue($crawlQueue)
            ->setCrawlProfile($crawlProfile)
            ->setCrawlObserver($crawlObserver)
            // ->executeJavaScript()
            ->setUserAgent(config('lighthousecrawler.user-agent'))
            ->respectRobots()
            ->setTotalCrawlLimit(50)
            ->setParseableMimeTypes([
                'text/html',
                'text/plain',
                'application/json',
                'application/xml',
                'application/xhtml+xml',
                'application/rss+xml',
                'application/atom+xml',
            ])
            ->setMaximumResponseSize(1024 * 1024 * 3)
            ->setDefaultScheme('https')
            ->setConcurrency(5)
            ->startCrawling($lighthousecrawler_request->url);

        $lighthousecrawler_request->crawled_at = now();
        $lighthousecrawler_request->save();
    }
}
class CrawlObserver extends \Spatie\Crawler\CrawlObservers\CrawlObserver {
    public function __construct (
        protected LighthousecrawlerRequest $lighthousecrawler_request,
        protected CreatesLighthousecrawlerPageAction $createLighthousecrawlerPage,
        protected CrawlQueue $crawl_queue
    ) {}

    public function willCrawl (UriInterface $url, ?string $linkText): void {
        $this->createLighthousecrawlerPage->__invoke(
            lighthousecrawler_request: $this->lighthousecrawler_request,
            url: $url->__toString(),
        );
    }

    public function crawled (UriInterface $url, ResponseInterface $response, UriInterface $foundOnUrl = null, string $linkText = null): void {
        $redirect_header = $response->getHeader(\GuzzleHttp\RedirectMiddleware::HISTORY_HEADER);

        $lighthousecrawler_page = $this->findLighthousecrawlerPage($url);

        if (count($redirect_header) > 0) {
            $new_url = end($redirect_header);
            $this->crawl_queue->add(CrawlUrl::create(new Url($new_url, null)));

            if ($lighthousecrawler_page) {
                $lighthousecrawler_page->delete();
                return;
            }
        }

        if (!$lighthousecrawler_page) {
            return;
        }

        $html = $response->getCachedBody();

        if (!$html) {
            $lighthousecrawler_page->delete();
            return;
        }

        $url_hash = md5($url->__toString());

        $same_crawl_url_exists = $this->lighthousecrawler_request->lighthousecrawler_pages()->whereNotNull('crawled_at')->where('url_hash', $url_hash)->exists();

        if ($same_crawl_url_exists) {
            $lighthousecrawler_page->delete();
            return;
        }

        $lighthousecrawler_page->crawled_at = now();
        $lighthousecrawler_page->save();

        dispatch(new RunLighthousecrawlerPageAuditJob($lighthousecrawler_page->id))
            ->onQueue(config('lighthousecrawler.queues.audit'));
    }
    
    [...]
}

Running Lighthouse Audits: For every page discovered, we dispatch a job to perform the Lighthouse audit. We ensure that audits are done efficiently, capturing essential data for analysis.

class RunsLighthousecrawlerPageAuditAction {
    public function __invoke (
        LighthousecrawlerPage $lighthousecrawler_page
    ): void {
        $lighthousecrawler_page->started_at = now();
        $lighthousecrawler_page->save();

        try {
            $this->run($lighthousecrawler_page, FormFactor::Mobile);

            $lighthousecrawler_page->ended_at = now();
            $lighthousecrawler_page->save();
        } catch (\Exception $e) {
            $lighthousecrawler_page->failed_at = now();
            $lighthousecrawler_page->save();

            throw $e;
        } finally {
            if (!$lighthousecrawler_page->lighthousecrawler_request->lighthousecrawler_pages()->whereNull('ended_at')->whereNull('failed_at')->exists()) {
                $lighthousecrawler_page->lighthousecrawler_request->ended_at = now();
                $lighthousecrawler_page->lighthousecrawler_request->save();
            }
        }
    }

    protected function run (LighthousecrawlerPage $lighthousecrawler_page, FormFactor $formfactor): void {
        $result = Lighthouse::url($lighthousecrawler_page->url)
            ->formFactor($formfactor)
            ->run();

        foreach ($result->scores() as $name => $score) {
            $lighthousecrawler_page_score = new LighthousecrawlerPageScore();
            $lighthousecrawler_page_score->lighthousecrawler_page()->associate($lighthousecrawler_page);
            $lighthousecrawler_page_score->formfactor = $formfactor->value;
            $lighthousecrawler_page_score->name = $name;
            $lighthousecrawler_page_score->score = $score;
            $lighthousecrawler_page_score->save();
        }
    }
}

That’s our journey so far in creating this powerful tool. Stay tuned as we continue to develop and refine it. Remember, the tech world is all about innovation and adaptation. We’re excited to see where this project takes us! 🚀