Hey guys, -- I just parsed a big file and I created a list containing 42.000 strings/words. I want to query [against this list] to check if a given word/string belongs to it. So my question is: What is the most efficient way for such a lookup? A first approach is to sort the list [list.sort()] and then just use the >> if word in list: print 'word' -- which is really trivial and I am sure there is a better way to do it. My goal is to apply a fast lookup that finds whether a given string is in this list or not. If you have any ideas of another data structure, they are welcome. Yet, I want to avoid for now more sophisticated data-structures like Tries etc. I am interested in hearing ideas (or tricks) about fast lookups or any other python library methods that might do the search faster than the simple 'in'. Thanks in advance!
Don't create a list
, create a set
. It does lookups in constant time.
If you don't want the memory overhead of a set then keep a sorted list and search through it with the bisect
module.
from bisect import bisect_left
def bi_contains(lst, item):
""" efficient `item in lst` for sorted lists """
# if item is larger than the last its not in the list, but the bisect would
# find `len(lst)` as the index to insert, so check that first. Else, if the
# item is in the list then it has to be at index bisect_left(lst, item)
return (item <= lst[-1]) and (lst[bisect_left(lst, item)] == item)
A point about sets versus lists that hasn't been considered: in "parsing a big file" one would expect to need to handle duplicate words/strings. You haven't mentioned this at all.
Obviously adding new words to a set removes duplicates on the fly, at no additional cost of CPU time or your thinking time. If you try that with a list it ends up O(N**2). If you append everything to a list and remove duplicates at the end, the smartest way of doing that is ... drum roll ... use a set, and the (small) memory advantage of a list is likely to be overwhelmed by the duplicates.
If you anticipate complex lookups later on - and by complex I mean not trivial - I recommend you store it in sqlite3
.