views:

675

answers:

5

I've tried WebSphinx application.

I realize if I put wikipedia.org as the starting URL, it will not crawl further.

Hence, how to actually crawl the entire Wikipedia? Can anyone gimme some guidelines? Do I need to specifically go and find those URLs and put multiple starting URLs?

Anyone has suggestions of good website with the tutorial on usng WebSphinx's API?

+24  A: 

If your goal is to crawl all of Wikipedia, you might want to look at the available database dumps. See http://download.wikimedia.org/.

Andrew
+1. Crawling Wikipedia through HTTP is rude and puts a lot of extra load on the servers.
Greg Hewgill
A: 

You probably need to start with a random article, and then crawl all articles you can get to from that starting one. When that search tree has been exhausted, start with a new random article. You could seed your searches with terms you think will lead to the most articles, or start with the featured article on the front page.

Another question: Why didn't WebSphinx crawl further? Does wikipedia block bots that identify as 'WebSphinx'?

FrustratedWithFormsDesigner
A: 

Kind of off topic, but I recommend checking out http://www.netsoc.tcd.ie/~mu/wiki/. This guy did some really neat stuff with Wikipedia.

hypoxide
+5  A: 

I'm not sure but maybe WEbSphinx's UserAgent is blocked by wikipedia's robots.txt

http://en.wikipedia.org/robots.txt

Dr.Optix
A: 

In addition to using the Wikipedia database dump mentioned above, you can use Wikipedia's API for executing queries, such as retrieving 100 random articles.

http://www.mediawiki.org/wiki/API:Query_-Lists#random.2F_rn

Gabe