views:

182

answers:

3

given a large list of alphabetically sorted words in a file,I need to write a program that, given a word x, determines if x is in the list. Preprocessing is ok since I will be calling this function many times over different inputs.
priorties: 1. speed. 2. memory

I already know I can use (n is number of words, m is average length of the words) 1. a trie, time is O(log(n)), space(best case) is O(log(n*m)), space(worst case) is O(n*m).
2. load the complete list into memory, then binary search, time is O(log(n)), space is O(n*m)

I'm not sure about the complexity on tri, please correct me if they are wrong. Also are there other good approaches?

A: 

Preprocessing is ok since I will be calling > this function many times over different inputs.

As a food for thought, do you consider creating a set from the input data and then searching using particular hash? It will take more time process for the first time to build a set but if number of inputs is limited and you may return to them then set might be good idea with O(1) for "contains" operation for a good hash function.

Andrei Taptunov
A: 

It is O(m) time for the trie, and up to O(m*log(n)) for the binary search. The space is asymptotically O(n*m) for any reasonable method, which you can probably reduce in some cases using compression. The trie structure is, in theory, somewhat better on memory, but in practice it has devils hiding in the implementation details: memory needed to store pointers and potentially bad cache access.

There are other options for implementing a set structure - hashset and treeset are easy choices in most languages. I'd go for the hash set as it is efficient and simple.

KT
Yes a hash set is probably the best. I'm worried about memory consumption. But I guess in any method the memory would be O(N*M)? Was going to use trie, now u scared me off. is it not asymptotically better off in space?
It depends on many details. E.g. you can note that the space requirements for a trie can also be bounded by something like a^m (where a is the size of the alphabet), which on certain toy examples can be less than n*m. However for a sufficiently large dataset of sufficiently long strings, storing the trie is not much more efficient that just storing the strings verbatim. So, if you have a really huge amount of long strings you would probably be better off storing them in some smart verbatim-but-compressed form, and using a hash set as an index over this structure.
KT
A: 

I'd recommend a hashmap. You can find an extension to C++ for this in both VC and GCC.

DeadMG