views:

321

answers:

3

Hi all,

This is a "big" question, that I don't know how to start, so I hope some of you can give me a direction. And if this is not a "good" question, I will close the thread with an apology.

I wish to go through the database of Wikipedia (let's say the English one), and do statistics. For example, I am interested in how many active editors (which should be defined) Wikipedia had at each point of time (let's say in the last 2 years).

I don't know how to build such a database, how to access it, how to know which types of data it has and so on. So my questions are:

  1. What tools do I need for this (besides basic R) ? MySQL on my computer? RODBC database connection?
  2. How do you start planning for such a project?
+8  A: 

You'll want to start here: http://en.wikipedia.org/wiki/Wikipedia:Database_download

Which will take you to here: http://download.wikimedia.org/enwiki/20100312/

And the file you probably want is:

# 2010-03-17 04:33:50 done Log events to all pages.
    * This contains the log of actions performed on pages.
    * pages-logging.xml.gz 1.0 GB

http://download.wikimedia.org/enwiki/20100312/enwiki-20100312-pages-logging.xml.gz

You'll then import the xml into MySQL. Generating a histogram of users per day, week, year, etc. won't require R. You'll be able to do that with a single MySQL query. Something like:

select DAYOFYEAR(wiki_edit_timestamp), count(*)
from page_logs
group by DAYOFYEAR(wiki_edit_timestamp)
order by DAYOFYEAR(wiki_edit_timestamp);

etc.

(I'm not sure what their actual schema is, but it'll be something like that.)

You'll run into issues, no doubt, but you'll learn a lot too. Good luck!

Roger
Thanks Roger for the head start! So now my next steps are how to setup the MySQL, and then how to import this in. Thanks :)
Tal Galili
+5  A: 

You could

Karsten W.
Fantastic answer Karsten, thanks a lot!
Tal Galili
+1  A: 

Try WikiXRay and zotero.

Juliana