How to crawl entire Wikipedia?

java web-crawler wikipedia websphinx

Mr CooL · Feb 22, 2010 · Viewed 18k times · Source

I've tried WebSphinx application.

I realize if I put wikipedia.org as the starting URL, it will not crawl further.

Hence, how to actually crawl the entire Wikipedia? Can anyone gimme some guidelines? Do I need to specifically go and find those URLs and put multiple starting URLs?

Anyone has suggestions of good website with the tutorial on usng WebSphinx's API?

Answer

If your goal is to crawl all of Wikipedia, you might want to look at the available database dumps. See http://download.wikimedia.org/.

How to crawl entire Wikipedia?

Answer

Related questions