views:

130

answers:

3

I have a set of strings, set<string>, and a string to check, I would like to do a binary search over it, obviously, a set is sorted.

The purpose is to find the longer string in the set that is contained in the supplied string to check. It's basically to check for urls and registered handlers.

I'm digging in the std algorithms and comparators to find a standard way of doing it but I don't see how.

A: 

You can look for a prefix, not a substring that can appear anywhere. For the latter, neither a set nor a binary search does the whole job. So I'll assume you have the rest of the code, and I won't spend time thinking about it ;-)

There isn't really an efficient way to do an iterator-based binary search on a std::set, because you don't have random access, but the upper_bound and lower_bound member functions of std::set serve the same purpose. In the case where set is implemented as a balanced binary search tree, what they do is almost exactly analogous to a binary search on an array.

They can be used to find the last/first element with a specified prefix (if there are any). You can then walk backward/forward from there to find any others.

Steve Jessop
A: 

From the set of input strings you need to determine a set of N-grams, then create a multimap where the key is the N-gram, and the value is the string containing that N-gram.

E.g. suppose you have the string "Hello World", and you use the value 3 for N, then your N-grams are:

  • Hel
  • ell
  • llo
  • lo
  • o W
  • Wo
  • Wor
  • orl
  • rld

Depending on the size and characteristics of your input strings, determine a suitable value for N.

Now if you want to look for strings containing Hello, split your search string also in N-grams, and look up the N-grams in the earlier described multimap.

Depending on what you find in the multimap, you could:

  • take the intersection of all the results (mainly if all N-grams return a rather large result set)
  • abort the search if an N-gram is not found in the multimap (string is not found)
  • stop the search if there is only result

Finally, check all the found strings to see if they really contain the string. After all, the N-grams are just a trick to find the strings, but there is no guarantee that they contain the string you are looking for.

E.g. the string "Helllllo also contains the 3-grams ofr Hello, but does not contain Hello itself.

Patrick