ansaurus

Question

Most Efficient Way to Find Whether a Large List Contains a Specific String (Python)

Answer 1

+1 A:

You're basically testing whether a member is in a set or not, right?

If so, and because you said you have lots of memory, why not just load all the words as keys in memcache, and then for every word, just check if it is present in memcache or not.

Or use that data structure that is used by bash to autocomplete command names - this is fast and highly efficient in memory (can't remember the name).

Swaroop C H 2009-05-16 12:04:42

The data structure is called a Trie (http://en.wikipedia.org/wiki/Trie).

Brian 2009-05-17 11:25:33

Answer 2

+11 A:

The python Set is what you should try.

A set object is an unordered collection of distinct hashable objects. Common uses include membership testing, removing duplicates from a sequence, and computing mathematical operations such as intersection, union, difference, and symmetric difference.

gimel 2009-05-16 12:05:17

Would you expect any speed difference between set and frozenset?

Brent.Longborough 2010-02-12 17:20:55

Answer 3

+2 A:

A Trie structure would suit your purposes. There are undoubtedly Python implementations to be found out there...

Paul Dixon 2009-05-16 12:05:58

Answer 4

+1 A:

If memory consumption isn't an issue and the words won't change, the fastest way to do this is put everything in a hash and search that way. In Python, this is the Set. You'll have constant-time lookup.

John Feminella 2009-05-16 12:06:07

+1, but I'll bring out the old saw: lookup in hashtables is not truly O(1) -- it is only O(1) if (a) the dataset is sufficiently small and (b) you don't store one of the pathological sets of keys that produces O(n) (linked-list-like) lookup times. In practice (b) is almost never violated, but many implementations violate (a) by adjusting the number of buckets according to the number of elements stored in the hashtable. But regardless of the true time complexity, hashtables should work nicely in your case.

j_random_hacker 2009-05-16 12:42:57

Python makes extensive use of hashtables throughout its implementation (all classes members, modules, etc). Almost everything is stored in hashtables in python, and because of this, you'll find python hashtable implementation is one of the very best and efficient ones, at least when it comes to "everyday use"

Nico 2009-05-16 18:37:02

I was under the impression that sets are implemented with balanced trees, not hashes (which means O(log n) lookup). Isn't this right?

Eli Bendersky 2009-05-16 18:54:28

Answer 5

+1 A:

500k character is not a large list. if items in your list are unique and you need to do this search repeatedly use set which would lower the complexity to O(1) in the best case.

SilentGhost 2009-05-16 12:06:27

Exactly - Sets are built using Hashtables - hence O(1)

Dario 2009-05-16 12:33:51

Answer 6

+3 A:

Sample Python code:

L = ['foo', 'bar', 'baz'] # Your list
s = set(L)  # Converted to Set

print 'foo'  in s # True
print 'blah' in s # False

Roman Zeyde 2009-05-16 12:12:37

If you're only doing a few look-ups, the conversion from list->set can take more time than you save by using a set.. Depends on the size of the list and the number of loop-ups of course

dbr 2009-06-01 01:21:17

Answer 7

+2 A:

Two things:

The Python 'mutable set' type has an 'add' method ( s.add(item) ), so you could go right from reading (a line) from your big file straight into a set without using a list as an intermediate data structure.

Python lets you 'pickle' a data structure, so you could save your big set to a file and save the time of reinitiating the set.

Second, I've been looking for a list of all the single-syllable words in English for my own amusement, but the ones I've found mentioned seem to be proprietary. If it isn't being intrusive, could I ask whether your list of English words can be obtained by others?

behindthefall 2009-05-16 12:34:01

You don't even need .add(). set takes an iterator as argument, so assuming the words are stored one per line, "f=open("words.txt") ; s = set(f)" will work, and use no unneccessary list. Pickling is not a good idea though - it will probably take at least as long restoring from a pickle as reconstructing the set. If initialisation time is important, using an on-disk format like the dbm libraries would be better.

Brian 2009-05-16 13:40:39

Thanks. I'll remember that.

behindthefall 2009-05-16 14:03:59

Answer 8

A:

Converting the list to a set will only be helpful if you repeatedly run this kind of query against the data, as will sorting the list and doing a binary search. If you're only going to pull data out of the list once, a plain old linear search is your best bet:

if 'foo' in some_list:
    do_something()

Otherwise, your best bet is to use either a set as has been mentioned or a binary search. Which one you should choose depends largely on how big the data is and how much memory you can spare. I'm told that really large lists tend to benefit more from hashing, although the amount of memory that's taken up can be prohibitively expensive.

Finally, a third option is that you can import the data into a sqlite database and read directly from it. Sqlite is very fast and it may save you the trouble of loading the whole list from file. Python has a very good built-in sqlite library.

Jason Baker 2009-05-16 13:01:05

Answer 9

+1 A:

Others have given you the in-memory way using set(), and this is generally going to be the fastest way, and should not tax your memory for a 60k word dataset (a few MiBs at most). You should be able to construct your set with:

f=open('words.txt')
s = set(word.strip() for word in f)

However, it does require some time to load the set into memory. If you are checking lots of words, this is no problem - the lookup time will more than make up for it. However if you're only going to be checking one word per command execution (eg. this is a commandline app like "checkenglish [word]" ) the startup time will be longer than it would have taken you just to search through the file line by line.

If this is your situation, or you have a much bigger dataset, using an on-disk format may be better. The simplest way would be using the dbm module. Create such a database from a wordlist with:

import dbm
f=open('wordlist.txt')
db = dbm.open('words.db','c')
for word in f:
    db[word] = '1'
f.close()
db.close()

Then your program can check membership with:

db = dbm.open('words.db','r')
if db.has_key(word):
    print "%s is english" % word
else:
    print "%s is not english" % word

This will be slower than a set lookup, since there will be disk access, but will be faster than searching, have low memory use and no significant initialisation time.

There are also other alternatives, such as using a SQL database (eg sqlite).

Brian 2009-05-16 13:59:46

Bear in mind that constructing the set directly from file, while elegant, will include the line ending characters, which may not be what you want.

Brent.Longborough 2010-02-12 17:19:31

Oops, you're right. Updated to strip line endings / extra whitespace.

Brian 2010-02-19 09:54:19

ansaurus

tags:

views:

answers:

Most Efficient Way to Find Whether a Large List Contains a Specific String (Python)

related questions