Web Cralwer Algorithm: depth?

StackOverflowNewbie picture StackOverflowNewbie · Dec 5, 2010 · Viewed 13k times · Source

I'm working on a crawler and need to understand exactly what is meant by "link depth". Take nutch for example: http://wiki.apache.org/nutch/NutchTutorial

depth indicates the link depth from the root page that should be crawled.

So, say I have the domain www.domain.com and wanted to crawl a depth of, say, 3 -- what do I need to do? If a site could be represented as a binary tree, then it wouldn't be a problem I think.

Answer

Fred Foo picture Fred Foo · Dec 5, 2010

Link depth means the number of "hops" a page is be away from the root, where a "hop" means following a link on a page. The reason Nutch has this restriction is that links very "far away" from the main page are unlikely to hold much information (the main page will link to the most important information, so the farther you get, the more detailed info you find), while there can be very many of them, so they take up lots of storage space, computing time for ranking, and bandwidth.

Nutch thus uses an algorithm scheme known as depth-limited search to bound its running time and space usage. If it didn't use this heuristic, it would have to crawl an entire site to rank all the pages in it and find the top N.

To crawl to depth 3, implement this algorithm and give it a depth bound of three. The nice thing about depth-limited search is that it's a variant of depth-first search (DFS), so it's quite space-efficient:

function depth-limited-crawl(page p, int d)
    if d == 0
        return
    /* do something with p, store it or so */
    foreach (page l in links(p))
        depth-limited-crawl(linked, d-1)

And no, a site cannot in general be represented as a binary tree; it's a directed graph. If you somehow remove the backlinks, then it becomes a multiway tree. Either way, many sites are too large to store for your crawler.