How to write a crawler?

Jason picture Jason · Sep 19, 2008 · Viewed 57.8k times · Source

I have had thoughts of trying to write a simple crawler that might crawl and produce a list of its findings for our NPO's websites and content.

Does anybody have any thoughts on how to do this? Where do you point the crawler to get started? How does it send back its findings and still keep crawling? How does it know what it finds, etc,etc.

Answer

slim picture slim · Sep 19, 2008

You'll be reinventing the wheel, to be sure. But here's the basics:

  • A list of unvisited URLs - seed this with one or more starting pages
  • A list of visited URLs - so you don't go around in circles
  • A set of rules for URLs you're not interested in - so you don't index the whole Internet

Put these in persistent storage, so you can stop and start the crawler without losing state.

Algorithm is:

while(list of unvisited URLs is not empty) {
    take URL from list
    remove it from the unvisited list and add it to the visited list
    fetch content
    record whatever it is you want to about the content
    if content is HTML {
        parse out URLs from links
        foreach URL {
           if it matches your rules
              and it's not already in either the visited or unvisited list
              add it to the unvisited list
        }
    }
}