views:

93

answers:

3

I am trying to get the list of people from the http://en.wikipedia.org/wiki/Category:People_by_occupation . I have to go through all the sections and get people from each section.

How should i go about it ? Should I use a crawler and get the pages and search through those using BeautifulSoup ?
Or is there any other alternative to get the same from Wikipedia ?

+1  A: 

If you want, you can just download the entire dump of the wikipedia and work it from there. The one your would probably want is only the articles dump dated 3 feb 2010. But beware: It's 5.6 GB in size.

Hao Wooi Lim
I wouldn't recommend using dump for processing categories. In this case OP should write wikipedia template processing since some categories are added through templates. I vote for crawling wikipedia pages.
Yaroslav
Crawling such a big web site (and against the site policy) is not good idea. It is difficult to process the Wikipedia but passing an XML dump of it is not that bad, and I can assure you it can be done in 2GB of RAM.
Ross
+3  A: 

I would go with Pywikipediabot python project.

Have a look to category.py. You could use:

* tree        - show a tree of subcategories of a given category
* listify     - make a list of all of the articles that are in a category
systempuntoout
A: 

You can use the CatScan tool to search categories.

Instructions here
http://meta.wikimedia.org/wiki/CatScan

Example search - note, html format maxes out at 1000 results. Choose CSV export to retrieve all the results. Also, be sure to modify the category depth and other options, as needed.

The pywikipediabot already mentioned is another option.

Gabe