views:

450

answers:

4

I need the most exhaustive English word list I can find for several types of language processing operations, but I could not find anything on the internet that has good enough quality.

There are 1,000,000 words in the English language including foreign and/or technical words.

Can you please suggest me such a source (or close to 500k words) that can be downloaded from the internet and it is maybe a bit categorized? What input do you use for your language processing applications?

+12  A: 

Kevin's wordlists is the best I know just for lists of words.

WordNet is better if you want to know about things being nouns, verbs etc, synonyms, etc.

Nick Fortescue
I've used Kevin's lists before. I merged a bunch of them together to get one huge list so I could generate all possible words from a given set of chars.
dotjoe
+2  A: 

Who told you there was 1 million words? According to Wikipedia, the Oxford English Dictionary only has 600,000. And the OED tries to include all technical and slang terms that are used.

Kibbee
What's a power of two between friends?
zaratustra
English is a synthetic language. I've heard the 1M number too, usually as a lower bound on the number of words that you can create on the fly.
rmeador
+2  A: 

I did research for Purdue on controlled / natural english and language domain knowledge processing.

I would take a look at the attempto project: http://attempto.ifi.uzh.ch/site/description/ which is a project to help build a controlled natural english.

You can download their entire word lexicon at: http://attempto.ifi.uzh.ch/site/downloads/files/clex-6.0-080806.zip it has ~ 100,000 natural English words.

You can also supply your own lexicon for domain specific words, this is what we did in our research. They offer webservices to parse and format natural english text.

mmattax
+3  A: 

`The "million word" hoax rolls along', I see ;-)

How to make your word lists longer: given a noun, add any of the following to it: non-, pseudo-, semi-, -arific, -geek, ...; mutatis mutandis for verbs etc.

unhammer