How to crawl entire Wikipedia?

Mr CooL picture Mr CooL · Feb 22, 2010 · Viewed 18k times · Source

I've tried WebSphinx application.

I realize if I put wikipedia.org as the starting URL, it will not crawl further.

Hence, how to actually crawl the entire Wikipedia? Can anyone gimme some guidelines? Do I need to specifically go and find those URLs and put multiple starting URLs?

Anyone has suggestions of good website with the tutorial on usng WebSphinx's API?

Answer

Andrew picture Andrew · Feb 22, 2010

If your goal is to crawl all of Wikipedia, you might want to look at the available database dumps. See http://download.wikimedia.org/.