views:

86

answers:

3

Hi,

I am currently doing a project on person name disambiguation. The idea behind the project, that it will be able to identify the correct person, when there are multiple people with the same name. I have used wikipedia for this. I want to evaluate my project on some standard data. I am looking for some testing data. I am not familiar with popular names in wikipedia. Any idea, where I can find this data? I am not looking for vast amounts of data. I am just looking for some 100-500 examples.

Thank you

Adding more information to the question.

What I am looking for is of people with same names but are actually different. For ex, Michael Jordon is a famous basketball player and there is also a statistician with that name. I am looking for examples like this.

http://en.wikipedia.org/wiki/Michael_Jordan http://en.wikipedia.org/wiki/Michael_I._Jordan

Hope, you understand the question now.

A: 

wondering why can't you use the names on SO users: http://stackoverflow.com/users?tab=reputation

it is already ranked by rep - so you know the "popular names".

kartheek
I think you didn't get my question. I updated the question now.
Algorist
A: 

http://en.wikipedia.org/wiki/Category:Redirects_to_disambiguation_pages is a huge list of disambiguation pages on wikipedia. Every page linked from that contains links of pages of ambiguous names of things. Is that what you're looking for?

Scott Stafford
Thanks for the link. Actually, I am mining this data for my project. But, I want to evaluate my project on some popular names.
Algorist
+2  A: 

Datasets for testing:

Good luck!

Skarab
Thanks for the datasets. But all these datasets, include a data to train on and then then articles to evaluate them. I just need people names with two senses. Because, my code extracts features from wikipedia and it cannot be applied to arbitrary text.
Algorist
Hmm.. now I understand. You develop an algorithm with takes into account if, e.g., wikipage has an infobox. The best evaluation base would be information, which wikipages were merged in given period of time. For sure there is somewhere such a dataset. Because there are a lot of research project about archiving web, e.g., http://www.slideshare.net/phonedude/memento-time-travel-for-the-web, and wikipedia is one of the most important knowledge portals on the web. Maybe you can extract this information from history of wikipages.
Skarab
A wikipage about merging, could be helpful in looking for information how to retrieve automaticaly wikipages history: http://en.wikipedia.org/wiki/Wikipedia:Merge
Skarab
Here you will find information how to get a history of pages - http://meta.wikimedia.org/wiki/Data_dumps#Format.
Skarab