tags:

views:

1430

answers:

10

I have a list of about a hundreds unique strings in C++, I need to check if a value exists in this list, but preferrably lightning fast.

I am currenly using a hash_set with std::strings (since I could not get it to work with const char*) like so:

stdext::hash_set<const std::string> _items;
_items.insert("LONG_NAME_A_WITH_SOMETHING");
_items.insert("LONG_NAME_A_WITH_SOMETHING_ELSE");
_items.insert("SHORTER_NAME");
_items.insert("SHORTER_NAME_SPECIAL");

stdext::hash_set<const std::string>::const_iterator it = _items.find( "SHORTER_NAME" ) );

if( it != _items.end() ) {
   std::cout << "item exists" << std::endl;
}

Does anybody else have a good idea for a faster search method without building a complete hashtable myself?


The list is a fixed list of strings which will not change. It contains a list of names of elements which are affected by a certain bug and should be repaired on-the-fly when opened with a newer version.

I've built hashtables before using Aho-Corasick but I'm not really willing to add too much complexity.


I was amazed by the number of answers. I ended up testing a few methods for their performance and ended up using a combination of kirkus and Rob K.'s answers. I had tried a binary search before but I guess I had a small bug implementing it (how hard can it be...).

The results where shocking... I thought I had a fast implementation using a hash_set... well, ends out I did not. Here's some statistics (and the eventual code):

Random lookup of 5 existing keys and 1 non-existant key, 50.000 times

My original algorithm took on average 18,62 seconds
A lineair search took on average 2,49 seconds
A binary search took on average 0,92 seconds.
A search using a perfect hashtable generated by gperf took on average 0,51 seconds.

Here's the code I use now:

bool searchWithBinaryLookup(const std::string& strKey) {
   static const char arrItems[][NUM_ITEMS] = { /* list of items */ };

   /* Binary lookup */
   int low, mid, high;

   low = 0;
   high = NUM_ITEMS;
   while( low < high ) {
      mid = (low + high) / 2;
      if(arrAffectedSymbols[mid] > strKey) {
         high = mid - 1;
      }
      else if(arrAffectedSymbols[mid] < strKey) {
         low = mid + 1;
      }
      else {
         return true;
      }
   }

   return false;
}

*NOTE: This is Microsoft VC++ so I'm not using the std::hash_set from SGI.*


I did some tests this morning using gperf as VardhanDotNet suggested and this is quite a bit faster indeed.

+2  A: 

I doubt you'd come up with a better hashtable; if the list varies from time to time you've probably got the best way.

The fastest way would be to construct a finite state machine to scan the input. I'm not sure what the best modern tools are (it's been over ten years since I did anything like this in practice), but Lex/Flex was the standard Unix constructor.

A FSM has a table of states, and a list of accepting states. It starts in the beginning state, and does a character-by-character scan of the input. Each state has an entry for each possible input character. The entry could either be to go into another state, or to abort because the string isn't in the list. If the FSM gets to the end of the input string without aborting, it checks the final state it's in, which is either an accepting state (in which case you've matched the string) or it isn't (in which case you haven't).

Any book on compilers should have more detail, or you can doubtless find more information on the web.

David Thornley
I figured a state machine would do a better job here but I'm not really willing to add that much more complexity for that extra bit of performance.
Huppie
This is actually how the search procedure of a Patricia Trie works. But it is a lot more straight-forward and dead-simple to implement.
A: 

I don't know which kind of hashing function MS uses for stings, but maybe you could come up with something simpler (=faster) that works in your special case. The container should allow you to use a custom hashing class.

If it's an implementation issue of the container you can also try if boosts std::tr1::unordered_set gives better results.

sth
+4  A: 

You could try a PATRICIA Trie if none of the standard containers meet your needs.

Worst-case lookup is bounded by the length of the string you're looking up. Also, strings share common prefixes so it is really easy on memory.So if you have lots of relatively short strings this could be beneficial.

Check it out here.

Note: PATRICIA = Practical Algorithm to Retrieve Information Coded in Alphanumeric

+3  A: 

If it's a fixed list, sort the list and do a binary search? I can't imagine, with only a hundred or so strings on a modern CPU, you're really going to see any appreciable difference between algorithms, unless your application is doing nothing but searching said list 100% of the time.

kirkus
+1  A: 

If the set of strings to check for numbers in the hundreds as you say, and this is when doing I/O (loading a file, which I assume comes from a disk, commonly), then I'd say: profile what you've got, before looking for more exotic/complex solutions.

Of course, it could be that your "documents" contain hundreds of millions to these strings, in which case I guess it really starts to take time ... Without more detail, it's hard to say for sure.

What I'm saying boils down to "consider the use-case and typical scenarios, before (over)optimizing", which I guess is just a specialization of that old thing about roots of evil ... :)

unwind
A: 

a hash table is a good solution, and by using a pre-existing implementation you are likely to get good performance. an alternative though i believe is called "indexing".

keep some pointers around to convenient locations. e.g. if its using letters for the sorting, keep a pointer to everything starting aa, ab, ac... ba, bc, bd... this is a few hundred pointers, but means that you can skip to part of the list which is quite near to the result before continuing. e.g. if an entry is is "afunctionname" then you can binary search between the pointers for af and ag, much faster than searching the whole lot... if you have a million records in total you will likely only have to binary search a list of a few thousand.

i re-invented this particular wheel, but there may be plenty of implementations out there already, which will save you the headache of implementing and are likely faster than any code I could paste in here. :)

jheriko
+9  A: 

If your list of strings are fixed at compile time, use gperf http://www.gnu.org/software/gperf/ QUOTE: gperf is a perfect hash function generator. For a given list of strings, it produces a hash function and hash table, in form of C or C++ code, for looking up a value depending on the input string. The hash function is perfect, which means that the hash table has no collisions, and the hash table lookup needs a single string comparison only.

The output of gperf is not governed by gpl or lgpl, afaik.

Vardhan Varma
Hmm... I guess my current implementation is fast enough but nevertheless I will give gperf a try just for the experience and comparison material.
Huppie
+1  A: 

100 unique strings? If this isn't called frequently, and the list doesn't change dynamically, I'd probably use a straight forward const char array with a linear search. Unless you search it a lot, something that small just isn't worth the extra code. Something like this:

const char _items[][MAX_ITEM_LEN] = { ... };
int i = 0;
for (;  strcmp( a, _items[i] ) < 0 && i < NUM_ITEMS; ++i );
bool found = i < NUM_ITEMS && strcmp( a, _items[i] ) == 0;

For a list that small, I think your implementation and maintenance costs with anything more complex would probably outweigh the run time costs, and you're not really going to get cheaper space costs than this. To gain a little more speed, you could do a hash table of first char -> list index to set the initial value of i;

For a list this small, you probably won't get much faster.

Rob K
I prefer a simple solution. That's why my current solution is like that. The code is called pretty much so I want to make sure I'd get as much as performance as possible from the least lines of code possible.
Huppie
Of course, I would wrap it in a nice class to hide all that, too.
Rob K
+3  A: 

What's wrong with std::vector? Load it, sort(v.begin(), v.end()) once and then use lower_bound() to see if the string is in the vector. lower_bound is guaranteed to be O(log2 N) on a sorted random access iterator. I can't understand the need for a hash if the values are fixed. A vector takes less room in memory than a hash and makes fewer allocations.

jmucchiello
A: 

You're using binary search, which is O(log(n)). You should look at interpolation search, which is not as good "worst case," but it's average case is better: O(log(log(n)).

Chris Harris