tags:

views:

287

answers:

3

I have recently expanded the names corpus in nltk and would like to know how I can turn the two files I have (male.txt, female.txt) in to a corpus so I can access them using the existing nltk.corpus methods. Does anyone have any suggestions?

Many thanks, James.

+2  A: 

As the readme says, the names corpus is not in the public domain -- you should send an email with any changes you make to the corpus author (address is in that file). Apart from that detail of law and courtesy, you can simply replace either or both of those files with your own, they're in perfectly simple format (one name per line, comments allowed [[and ignored]] and start with '#').

To install a totally new corpus rather than just tweaking an existing ones, you could start with the docs given here.

Alex Martelli
Thanks for reply. Have emailed changes to owner of the Names corpus.
James Smith
A: 

Alex is right, start with the docs, and figure out which corpus reader will work for your corpus. The simple instantiate it, given the path to your corpus file(s). As you'll see in the docs, the builtin corpora are simply instances of particular corpus reader classes. Look thru the code in the nltk.corpus package should be helpful as well.

Jacob
+1  A: 

Came to understand how corpus reading works by looking at the source code in nltk.corpus and then looking at the corpora (located in /home/[user]/nltk_data/corpora/names - this will probably be in My Documents for XP and somewhere in User for Win7 users).

The structure of the corpus and its related function will give a good understanding of how to use the different corpora available in NLTK.

In my case I looked at the names variable in nltk.corpus' source code and was interested in the WordListCorpusReader function as the names corpus is simply a list of words.

James Smith