ansaurus

Question

Most efficient way for a lookup/search in a huge list (python)

Answer 1

+13 A:

Don't create a list, create a set. It does lookups in constant time.

If you don't want the memory overhead of a set then keep a sorted list and search through it with the bisect module.

from bisect import bisect_left
def bi_contains(lst, item):
    """ efficient `item in lst` for sorted lists """
    # if item is larger than the last its not in the list, but the bisect would 
    # find `len(lst)` as the index to insert, so check that first. Else, if the 
    # item is in the list then it has to be at index bisect_left(lst, item)
    return (item <= lst[-1]) and (lst[bisect_left(lst, item)] == item)

THC4k 2010-04-23 18:51:30

Thanks a lot THC4k for your detailed response. Actually I was thinking to apply a binary search myself but as I see that is what the bisect module kind of does anyway, so you saved my time :). Again thanks for your help.

2010-04-23 20:31:46

@user229269, you latched on to the wrong part of the post! You probably want a `set`, not a `list` at all.

Mike Graham 2010-04-23 21:34:41

@Mike Graham I know what you are saying, but I am afraid I might run into memory problems if I use sets, considering that my list is actually a fast growing word-list that is going to end up being as large as 100.000 strings and more

2010-04-23 22:13:56

@user229269, 100000 items isn't that many. Using a `set` instead of a `list` for that many items should only increase memory usage by <2MB, which isn't really all that much on modern hardware. If your data did grow so large using a `set` would cause memory problems, you'd probably want to look into using a very different technique, such as storing the data in a database.

Mike Graham 2010-04-23 22:34:40

Yeah, actually you (@Mike Graham) are right :) -- I used sets already. Thanks a lot for making me reconsider it

2010-04-23 22:55:11

Answer 2

A:

A point about sets versus lists that hasn't been considered: in "parsing a big file" one would expect to need to handle duplicate words/strings. You haven't mentioned this at all.

Obviously adding new words to a set removes duplicates on the fly, at no additional cost of CPU time or your thinking time. If you try that with a list it ends up O(N**2). If you append everything to a list and remove duplicates at the end, the smartest way of doing that is ... drum roll ... use a set, and the (small) memory advantage of a list is likely to be overwhelmed by the duplicates.

John Machin 2010-04-23 23:21:46

Answer 3

A:

If you anticipate complex lookups later on - and by complex I mean not trivial - I recommend you store it in sqlite3.

jeffjose 2010-04-24 09:02:33

ansaurus

tags:

views:

answers:

Most efficient way for a lookup/search in a huge list (python)

related questions