tags:

views:

65

answers:

3

I need a list of common first names for people, like "Bill", "Gordon", "Jane", etc. Is there some free list of lots of known names, instead of me having to type them out? Something that I can easily parse with the programme to fill in an array for example?

I'm not worried about:

  • Knowing if a name is masculine or feminine (or both)
  • If the dataset has a whole pile of false positives
  • If there are names that aren't on it, obviously no dataset like this will be complete.
  • If there are 'duplicates', i.e. I don't care if the dataset lists "Bill" and "William" and "Billy" as different names. I'd rather have more data than less
  • I don't care about knowing the popularity the name

I know Wikipedia has a list of most popular given names, but that's all in a HTML page and manged up with horrible wiki syntax. Is there a better way to get some sample data like this without having to screen scrape wikipedia?

+1  A: 

http://www.fakenamegenerator.com/

Antony
+2  A: 

That ought to be enough to get you started, I'd think.

Mark Rushakoff
A: 

You can easily consume the Wikipedia API (http://en.wikipedia.org/w/api.php) to retrieve the list of pages in specific category, looks like Category:Given names is something you want to start from.

http://en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmnamespace=0&cmlimit=500&cmtitle=Category:Given_names

The part of result from this URL looks like this:

  <cm pageid="5797824" ns="0" title="Abdou" />
  <cm pageid="5797863" ns="0" title="Abdu" />
  <cm pageid="859035" ns="0" title="Abdul Aziz" />
  <cm pageid="6504818" ns="0" title="Abdul Qadir" />

Look at the API and select appropriate format and query parameters, and check categories.

P.S. BTW, The wiki-text from page you linked to contain names in a form that easy to extract using regexp... As well as titles of links in the rendered HTML page have “(name)” attached to the name itself.

Juicy Scripter
The *cmlimit* option in the query is at maximum (500) allowed to unauthorized users, and can be risen to 5000 items. Anyway using the *cmcontinue* option to retrieve all results chunk by chunk...
Juicy Scripter