views:

467

answers:

6

I have a large set (100k) of short strings (not more than 100 chars) and I need to quickly find all those who have a certain substring.

This will be used as a search box where the user starts typing and the system immediately gives "suggestions" (the strings that have as a substring the text that the user typed). Something similar to the "Tag" box in StackOverflow.

As this will be interactive, it should be pretty fast. What algorithm or data structure do you recommend for this?

BTW, I'll be using Delphi 2007.

Thanks in advance.

+13  A: 

The data structure you'll likely want to use is a Trie, specifically a suffix trie. Read this article for a good explanation of how they work and how to write one for your problem.

Mike Axiak
Beat me to it. In case the article doesn't make it obvious, you can build one suffix tree for the whole corpus, and annotate it to say which string(s) that suffix belongs to.
Steve Jessop
A good sugesstion, but probably an overkill for what he wants. +1 for siggesting a data structure that not many people know about.
Runner
@Runner Not many people? "Trie" is new JQuery :) It's hard to find algorithm question not having "user tries" answer today.
Nikita Rybak
@Rybak - It must have gone past me. Interesting. I thought it wasn't well known. I implemented one a year and a half ago and I sure haven't found much information on them then. Must look again :)
Runner
+18  A: 

I wrote out a long blurb, doing a bunch of complexity calculations and xzibit jokes (tree in a tree so you can lookup when you lookup), but then realized that this is easier than I thought. Browsers do this all the time and they never precompute big tables every time you're loading a page.

http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm

what it means is that you take your 100k strings and combine them into one long string. you then take your query substring, and iterate over your big string, looking for your matches. but you're not jumping by character (which would mean you're looking at 100k*100 iterations). you're jumping by the length of your substring, so the longer your substring gets, the faster this goes.

here's a great example: http://userweb.cs.utexas.edu/users/moore/best-ideas/string-searching/fstrpos-example.html

they're searching for the string EXAMPLE.

this is the kind of stuff browsers and text editors do, and they dont really build giant prefix tables every time you load a page.

Oren Mazor
+1. I always wondered if the pain of implementing and understanding suffix trees was worth it, they just seem to get mentioned as a silver bullet all string problems.
bronzebeard
@bronzebeard completely agree with you, +1 to Oren for offering some realistic solution.
Nikita Rybak
Suffix tries are terribly powerfull, but not so easy to understand and build. They shine when you have a lot of data. For simple suggestions they probably are overkill. +1 for Oren.
Runner
Boyer-Moore is the way to go. I coded one up to look through thousands of short strings in an audio library (bascially all the ID3 tags plus some extra data). Its so fast you can type and show realtime updates.
gbrandt
Have you got any proof for your assertion that no preprocessing is done in browsers? This would rather surprise me. On the other hand, it would explain why the “awesome bar” sucks so badly …. Just to make it clear: while this way is OK, using a trie is much more efficient for the amount of data that the OP is dealing with. Also, implementing a trie is *not* hard.
Konrad Rudolph
@Konrad I didn't say tries are hard or that this is a MUCH better solution than using a trie. just that its a far more exciting one :) as for the browsers, I'm not sure what they're doing, but I really doubt they're building trees, considering how ridiculously fast chrome is at searching in a really long page..
Oren Mazor
@Oren: Right, that was a misunderstanding. You’re of course right, search in sites is bound to be an “online search” (= no preprocessing of the text). I somehow thought you were talking about the auto suggest feature in the browser’s address bar, and here I’m pretty sure that some kind of index is used.
Konrad Rudolph
+6  A: 

While you certainly can speed things up with a better data structure, this is a time that brute force might be perfectly adequate. Doing a quick test:

[Edit: changed code to search for substrings, edited again to shorten the substring it searches for compared to the ones it searches in.]

#include <algorithm>
#include <iostream>
#include <vector>
#include <string>
#include <time.h>

std::string rand_string(int min=20, int max=100) { 
    size_t length = rand()% (max-min) + min;
    std::string ret;

    for (size_t i=0; i<length; i++)
        ret.push_back(rand() % ('z' - 'a') + 'a');
    return ret; 
}

class substr {
    std::string seek;
public:
    substr(std::string x) : seek(x) {}

    bool operator()(std::string const &y) { return y.find(seek) != std::string::npos; }
};

int main() { 
    std::vector<std::string> values;

    for (int i=0; i<100000; i++)
        values.push_back(rand_string());

    std::string seek = rand_string(5, 10);

    const int reps = 10;

    clock_t start = clock();
    std::vector<std::string>::iterator pos;
    for (int i=0; i<reps; i++)
         pos = std::find_if(values.begin(), values.end(), substr(seek));
    clock_t stop = clock();

    std::cout << "Search took: " << double(stop-start)/CLOCKS_PER_SEC/reps << " seconds\n";
    if (pos == values.end())
        std::cout << "Value wasn't found\n";
    else
        std::cout << "Value was found\n";
    return 0;
}

On my machine (around 4 years old -- hardly a speed demon by current standards) this runs in around 3 10 milliseconds per search. That's fast enough to appear instantaneous to an interactive user -- and with 10 times as many strings, it would still be fine.

Jerry Coffin
I'm not proficient at STL, but IIRC _std::find_ searches for _exact_ occurrence of element in container. And Fernando is interested in substrings.
Nikita Rybak
+1. Brute force rules :-)
Svein Bringsli
@Nikita: Quite true. It doesn't make a whole lot of difference, but I've edited the code to test the right thing. Even so, we're still talking about single-digit milliseconds for a search. Although I wrote the test code in C++, I'd expect results with Delphi to be about the same. Speed might differ by a few percent, but we'd have to see a 10x difference for it to even be close to significant, and I'd be *very* surprised to see that.
Jerry Coffin
+1, brute force should be tried first before complicating the matter.
GrandmasterB
And even if the number and size of strings was considerably larger, I can only think that with multiple threads, this could be even faster.
luiscubal
@Jerry Your time estimates here aren't accurate, because "seek" string is on average as long as "values" strings. Try search for something like "abcde" (more realistic example of user input). Although, results are still quite good (for me 20ms on random input). Not 0.1ms, but often good enough too.
Nikita Rybak
BTW, you could update time estimate in your post (3ms is before "substr" and confuses people)
Nikita Rybak
+4  A: 

I hate to disagree with Mike and his supporters, but suffix trees (data structure described in his link) are a lot of pain to implement. And finding reliable implementation in Pascal/Delphi might be difficult too.

Suffix Arrays offer the same functionality while being a lot simpler. The tradeoff is O(m * logn) complexity, where m is length of the search token and n is size of the dataset (100kb in this case).

In case somebody doesn't know, both suffix trees and suffix arrays allow you to find all occurrences of substring s in long text t.

Fernando's problem can be reduced to this one, by concatenating initial set of strings into one string using some delimiter symbol. For example, is initial set is ["text1", "text2", "some text"], then result string t will be "text1|text2|some text".
Now, instead of searching for string "text" in each word separately, we just need to find all occurrences of it in big string t.

I also recommend Oren's answer where he suggests another realistic approach.

Nikita Rybak
A trie and a suffix array differ quite a bit, though. Although suffix arrays *can* be built over string sets, tries do the same much more naturally and efficiently – search is O(m) instead of O(log n + m), and for 100k strings of length 100, this *can* make a difference. And finally, building a trie efficiently is a lot easier than building a suffix array efficiently – although even the brute force quicksort method yields acceptable performance in practice. But apart from that, thumbs up for suffix array!
Konrad Rudolph
+1  A: 

What you're probably looking for is an n-gram. It's used to find the most likely words related to your substring. Very interesting stuff, and though it may be overkill for what you're looking for, it's still good to know.

ReaperUnreal
+2  A: 

This Delphi implementation of Boyer-Moore-Horspool might give you what you need.
Disclaimer: I did not try this code...

François