How to crawl entire Wikipedia?

views:

675

answers:

+3 Q:

How to crawl entire Wikipedia?

I've tried WebSphinx application.

I realize if I put wikipedia.org as the starting URL, it will not crawl further.

Hence, how to actually crawl the entire Wikipedia? Can anyone gimme some guidelines? Do I need to specifically go and find those URLs and put multiple starting URLs?

Anyone has suggestions of good website with the tutorial on usng WebSphinx's API?

+24 A:

If your goal is to crawl all of Wikipedia, you might want to look at the available database dumps. See http://download.wikimedia.org/.

Andrew 2010-02-22 20:02:59

+1. Crawling Wikipedia through HTTP is rude and puts a lot of extra load on the servers.

Greg Hewgill 2010-02-22 20:30:26

You probably need to start with a random article, and then crawl all articles you can get to from that starting one. When that search tree has been exhausted, start with a new random article. You could seed your searches with terms you think will lead to the most articles, or start with the featured article on the front page.

Another question: Why didn't WebSphinx crawl further? Does wikipedia block bots that identify as 'WebSphinx'?

FrustratedWithFormsDesigner 2010-02-22 20:03:34

Kind of off topic, but I recommend checking out http://www.netsoc.tcd.ie/~mu/wiki/. This guy did some really neat stuff with Wikipedia.

hypoxide 2010-02-22 20:05:35

+5 A:

I'm not sure but maybe WEbSphinx's UserAgent is blocked by wikipedia's robots.txt

http://en.wikipedia.org/robots.txt

Dr.Optix 2010-02-22 20:05:47

In addition to using the Wikipedia database dump mentioned above, you can use Wikipedia's API for executing queries, such as retrieving 100 random articles.

http://www.mediawiki.org/wiki/API:Query_-Lists#random.2F_rn

Gabe 2010-02-23 00:50:13

ansaurus

tags:

views:

answers:

How to crawl entire Wikipedia?

related questions