tags:

views:

815

answers:

9

Hello,

I have a file containing roughly all the words in English (~60k words, ~500k characters). I want to test whether a certain word I receive as input is "in English" (i.e. if this exact word is in the list).

What would be the most efficient way to do this in Python?

The trivial solution is to load the file into a list and check whether the word is in that list. The list can be sorted, which I believe will shrink the complexity to O(logn). However I'm not sure about how Python implements searching through lists, and whether there's a performance penalty if such a large list is in memory. Can I "abuse" the fact I can put a cap on the length of words? (e.g. say the longest one is 15 characters long).

Please note I run the application on a machine with lots of memory, so I care less for memory consumption than for speed and CPU utilization.

Thanks

+1  A: 

You're basically testing whether a member is in a set or not, right?

If so, and because you said you have lots of memory, why not just load all the words as keys in memcache, and then for every word, just check if it is present in memcache or not.

Or use that data structure that is used by bash to autocomplete command names - this is fast and highly efficient in memory (can't remember the name).

Swaroop C H
The data structure is called a Trie (http://en.wikipedia.org/wiki/Trie).
Brian
+11  A: 

The python Set is what you should try.

A set object is an unordered collection of distinct hashable objects. Common uses include membership testing, removing duplicates from a sequence, and computing mathematical operations such as intersection, union, difference, and symmetric difference.

gimel
Would you expect any speed difference between set and frozenset?
Brent.Longborough
+2  A: 

A Trie structure would suit your purposes. There are undoubtedly Python implementations to be found out there...

Paul Dixon
+1  A: 

If memory consumption isn't an issue and the words won't change, the fastest way to do this is put everything in a hash and search that way. In Python, this is the Set. You'll have constant-time lookup.

John Feminella
+1, but I'll bring out the old saw: lookup in hashtables is not truly O(1) -- it is only O(1) if (a) the dataset is sufficiently small and (b) you don't store one of the pathological sets of keys that produces O(n) (linked-list-like) lookup times. In practice (b) is almost never violated, but many implementations violate (a) by adjusting the number of buckets according to the number of elements stored in the hashtable. But regardless of the true time complexity, hashtables should work nicely in your case.
j_random_hacker
Python makes extensive use of hashtables throughout its implementation (all classes members, modules, etc). Almost everything is stored in hashtables in python, and because of this, you'll find python hashtable implementation is one of the very best and efficient ones, at least when it comes to "everyday use"
Nico
I was under the impression that sets are implemented with balanced trees, not hashes (which means O(log n) lookup). Isn't this right?
Eli Bendersky
+1  A: 

500k character is not a large list. if items in your list are unique and you need to do this search repeatedly use set which would lower the complexity to O(1) in the best case.

SilentGhost
Exactly - Sets are built using Hashtables - hence O(1)
Dario
+3  A: 

Sample Python code:

L = ['foo', 'bar', 'baz'] # Your list
s = set(L)  # Converted to Set

print 'foo'  in s # True
print 'blah' in s # False
Roman Zeyde
If you're only doing a few look-ups, the conversion from list->set can take more time than you save by using a set.. Depends on the size of the list and the number of loop-ups of course
dbr
+2  A: 

Two things:

The Python 'mutable set' type has an 'add' method ( s.add(item) ), so you could go right from reading (a line) from your big file straight into a set without using a list as an intermediate data structure.

Python lets you 'pickle' a data structure, so you could save your big set to a file and save the time of reinitiating the set.

Second, I've been looking for a list of all the single-syllable words in English for my own amusement, but the ones I've found mentioned seem to be proprietary. If it isn't being intrusive, could I ask whether your list of English words can be obtained by others?

behindthefall
You don't even need .add(). set takes an iterator as argument, so assuming the words are stored one per line, "f=open("words.txt") ; s = set(f)" will work, and use no unneccessary list. Pickling is not a good idea though - it will probably take at least as long restoring from a pickle as reconstructing the set. If initialisation time is important, using an on-disk format like the dbm libraries would be better.
Brian
Thanks. I'll remember that.
behindthefall
A: 

Converting the list to a set will only be helpful if you repeatedly run this kind of query against the data, as will sorting the list and doing a binary search. If you're only going to pull data out of the list once, a plain old linear search is your best bet:

if 'foo' in some_list:
    do_something()

Otherwise, your best bet is to use either a set as has been mentioned or a binary search. Which one you should choose depends largely on how big the data is and how much memory you can spare. I'm told that really large lists tend to benefit more from hashing, although the amount of memory that's taken up can be prohibitively expensive.

Finally, a third option is that you can import the data into a sqlite database and read directly from it. Sqlite is very fast and it may save you the trouble of loading the whole list from file. Python has a very good built-in sqlite library.

Jason Baker
+1  A: 

Others have given you the in-memory way using set(), and this is generally going to be the fastest way, and should not tax your memory for a 60k word dataset (a few MiBs at most). You should be able to construct your set with:

f=open('words.txt')
s = set(word.strip() for word in f)

However, it does require some time to load the set into memory. If you are checking lots of words, this is no problem - the lookup time will more than make up for it. However if you're only going to be checking one word per command execution (eg. this is a commandline app like "checkenglish [word]" ) the startup time will be longer than it would have taken you just to search through the file line by line.

If this is your situation, or you have a much bigger dataset, using an on-disk format may be better. The simplest way would be using the dbm module. Create such a database from a wordlist with:

import dbm
f=open('wordlist.txt')
db = dbm.open('words.db','c')
for word in f:
    db[word] = '1'
f.close()
db.close()

Then your program can check membership with:

db = dbm.open('words.db','r')
if db.has_key(word):
    print "%s is english" % word
else:
    print "%s is not english" % word

This will be slower than a set lookup, since there will be disk access, but will be faster than searching, have low memory use and no significant initialisation time.

There are also other alternatives, such as using a SQL database (eg sqlite).

Brian
Bear in mind that constructing the set directly from file, while elegant, will include the line ending characters, which may not be what you want.
Brent.Longborough
Oops, you're right. Updated to strip line endings / extra whitespace.
Brian