How do I make a simple crawler in PHP?

Question 1

How do I make a simple crawler in PHP?

php web-crawler

Kshitij Saxena -KJ- · Feb 22, 2010 · Viewed 178.3k times · Source

Answer

Answer

Meh. Don't parse HTML with regexes.

Here's a DOM version inspired by Tatu's:

<?php
function crawl_page($url, $depth = 5)
{
    static $seen = array();
    if (isset($seen[$url]) || $depth === 0) {
        return;
    }

    $seen[$url] = true;

    $dom = new DOMDocument('1.0');
    @$dom->loadHTMLFile($url);

    $anchors = $dom->getElementsByTagName('a');
    foreach ($anchors as $element) {
        $href = $element->getAttribute('href');
        if (0 !== strpos($href, 'http')) {
            $path = '/' . ltrim($href, '/');
            if (extension_loaded('http')) {
                $href = http_build_url($url, array('path' => $path));
            } else {
                $parts = parse_url($url);
                $href = $parts['scheme'] . '://';
                if (isset($parts['user']) && isset($parts['pass'])) {
                    $href .= $parts['user'] . ':' . $parts['pass'] . '@';
                }
                $href .= $parts['host'];
                if (isset($parts['port'])) {
                    $href .= ':' . $parts['port'];
                }
                $href .= dirname($parts['path'], 1).$path;
            }
        }
        crawl_page($href, $depth - 1);
    }
    echo "URL:",$url,PHP_EOL,"CONTENT:",PHP_EOL,$dom->saveHTML(),PHP_EOL,PHP_EOL;
}
crawl_page("http://hobodave.com", 2);

Edit: I fixed some bugs from Tatu's version (works with relative URLs now).

Edit: I added a new bit of functionality that prevents it from following the same URL twice.

Edit: echoing output to STDOUT now so you can redirect it to whatever file you want

Edit: Fixed a bug pointed out by George in his answer. Relative urls will no longer append to the end of the url path, but overwrite it. Thanks to George for this. Note that George's answer doesn't account for any of: https, user, pass, or port. If you have the http PECL extension loaded this is quite simply done using http_build_url. Otherwise, I have to manually glue together using parse_url. Thanks again George.

Question 2

I have a web page with a bunch of links. I want to write a script which would dump all the data contained in those links in a local file.

Has anybody done that with PHP? General guidelines and gotchas would suffice as an answer.

How do I make a simple crawler in PHP?

Answer

Related questions