ansaurus

Question

Searching fast through a sorted list of strings in C++

Answer 1

+2 A:

I doubt you'd come up with a better hashtable; if the list varies from time to time you've probably got the best way.

The fastest way would be to construct a finite state machine to scan the input. I'm not sure what the best modern tools are (it's been over ten years since I did anything like this in practice), but Lex/Flex was the standard Unix constructor.

A FSM has a table of states, and a list of accepting states. It starts in the beginning state, and does a character-by-character scan of the input. Each state has an entry for each possible input character. The entry could either be to go into another state, or to abort because the string isn't in the list. If the FSM gets to the end of the input string without aborting, it checks the final state it's in, which is either an accepting state (in which case you've matched the string) or it isn't (in which case you haven't).

Any book on compilers should have more detail, or you can doubtless find more information on the web.

David Thornley 2009-01-26 14:28:24

I figured a state machine would do a better job here but I'm not really willing to add that much more complexity for that extra bit of performance.

Huppie 2009-01-26 14:34:52

This is actually how the search procedure of a Patricia Trie works. But it is a lot more straight-forward and dead-simple to implement.

2009-01-26 14:50:10

Answer 2

A:

I don't know which kind of hashing function MS uses for stings, but maybe you could come up with something simpler (=faster) that works in your special case. The container should allow you to use a custom hashing class.

If it's an implementation issue of the container you can also try if boosts std::tr1::unordered_set gives better results.

sth 2009-01-26 14:41:33

Answer 3

+4 A:

You could try a PATRICIA Trie if none of the standard containers meet your needs.

Worst-case lookup is bounded by the length of the string you're looking up. Also, strings share common prefixes so it is really easy on memory.So if you have lots of relatively short strings this could be beneficial.

Check it out here.

Note: PATRICIA = Practical Algorithm to Retrieve Information Coded in Alphanumeric

2009-01-26 14:47:57

Answer 4

+3 A:

If it's a fixed list, sort the list and do a binary search? I can't imagine, with only a hundred or so strings on a modern CPU, you're really going to see any appreciable difference between algorithms, unless your application is doing nothing but searching said list 100% of the time.

kirkus 2009-01-26 14:50:26

Answer 5

+1 A:

If the set of strings to check for numbers in the hundreds as you say, and this is when doing I/O (loading a file, which I assume comes from a disk, commonly), then I'd say: profile what you've got, before looking for more exotic/complex solutions.

Of course, it could be that your "documents" contain hundreds of millions to these strings, in which case I guess it really starts to take time ... Without more detail, it's hard to say for sure.

What I'm saying boils down to "consider the use-case and typical scenarios, before (over)optimizing", which I guess is just a specialization of that old thing about roots of evil ... :)

unwind 2009-01-26 14:51:28

Answer 6

A:

a hash table is a good solution, and by using a pre-existing implementation you are likely to get good performance. an alternative though i believe is called "indexing".

keep some pointers around to convenient locations. e.g. if its using letters for the sorting, keep a pointer to everything starting aa, ab, ac... ba, bc, bd... this is a few hundred pointers, but means that you can skip to part of the list which is quite near to the result before continuing. e.g. if an entry is is "afunctionname" then you can binary search between the pointers for af and ag, much faster than searching the whole lot... if you have a million records in total you will likely only have to binary search a list of a few thousand.

i re-invented this particular wheel, but there may be plenty of implementations out there already, which will save you the headache of implementing and are likely faster than any code I could paste in here. :)

jheriko 2009-01-26 14:51:58

Answer 7

+9 A:

If your list of strings are fixed at compile time, use gperf http://www.gnu.org/software/gperf/ QUOTE: gperf is a perfect hash function generator. For a given list of strings, it produces a hash function and hash table, in form of C or C++ code, for looking up a value depending on the input string. The hash function is perfect, which means that the hash table has no collisions, and the hash table lookup needs a single string comparison only.

The output of gperf is not governed by gpl or lgpl, afaik.

Vardhan Varma 2009-01-26 15:25:21

Hmm... I guess my current implementation is fast enough but nevertheless I will give gperf a try just for the experience and comparison material.

Huppie 2009-01-27 07:30:51

Answer 8

+1 A:

100 unique strings? If this isn't called frequently, and the list doesn't change dynamically, I'd probably use a straight forward const char array with a linear search. Unless you search it a lot, something that small just isn't worth the extra code. Something like this:

const char _items[][MAX_ITEM_LEN] = { ... };
int i = 0;
for (;  strcmp( a, _items[i] ) < 0 && i < NUM_ITEMS; ++i );
bool found = i < NUM_ITEMS && strcmp( a, _items[i] ) == 0;

For a list that small, I think your implementation and maintenance costs with anything more complex would probably outweigh the run time costs, and you're not really going to get cheaper space costs than this. To gain a little more speed, you could do a hash table of first char -> list index to set the initial value of i;

For a list this small, you probably won't get much faster.

Rob K 2009-01-26 17:57:26

I prefer a simple solution. That's why my current solution is like that. The code is called pretty much so I want to make sure I'd get as much as performance as possible from the least lines of code possible.

Huppie 2009-01-26 19:18:30

Of course, I would wrap it in a nice class to hide all that, too.

Rob K 2009-01-26 19:51:06

Answer 9

+3 A:

What's wrong with std::vector? Load it, sort(v.begin(), v.end()) once and then use lower_bound() to see if the string is in the vector. lower_bound is guaranteed to be O(log2 N) on a sorted random access iterator. I can't understand the need for a hash if the values are fixed. A vector takes less room in memory than a hash and makes fewer allocations.

jmucchiello 2009-01-26 18:08:31

Answer 10

A:

You're using binary search, which is O(log(n)). You should look at interpolation search, which is not as good "worst case," but it's average case is better: O(log(log(n)).

Chris Harris 2009-06-06 20:29:38

ansaurus

tags:

views:

answers:

Searching fast through a sorted list of strings in C++

related questions