views:

167

answers:

5

I know there are already several similar questions here, but I need some recommendations for my case (couldn't find anything similar).

I have to search a very large amount of data for a substring that would be about a billion times smaller (10 bytes in 10 billion bytes). The haystack doesn't change, so I can bear with large precomputation if needed. I just need the searching part to be as fast as possible.

I found algorithms which take O(n) time (n = haystack size, m = needle size), and the naive search takes O(n+m). As the m in this particular case would be very small, is there any other algorithm I could look into?

Edit: Thanks everyone for your suggestions! Some more info - The data could be considered random bits, so I don't think any kind of indexing / sorting would be possible. The data to be searched can be anything, not english words or anything predictable.

+4  A: 

You are looking for the data structure called the Trie or "prefix tree". In short, this data structure encodes all the possible string prefixes which can be found in your corpus.

Here is a paper which searches a DNA sequences for a small substring, using a prefix tree. I imagine that might help you, since your case sounds similar.

If you know a definite limit on the length of the input search string, you can limit the growth of your Trie so that it does not store any prefixes longer than this length max. In this way, you may be able to fit a Trie representing all 10G into less than 10G of memory. Especially for highly repetitive data, any sort of Trie is a compressed data representation. (Or should be, if implemented sanely.) Limiting the Trie depth to the max input search string allows you to limit the memory consumption still further.

Heath Hunnicutt
Typically with search, you want not just a yes/no match response, but also where the match occurred. Because of this, there is usually an additional memory cost to store the locations of substring matches. So even though you might be able to store 10G of data into less than 10G of memory, you will need much more to store location information.
Chris Schmich
Chris, what makes you think prefix trees don't give you that? It all depends on how you implement the 'prefix termination.' If you implement that as a list of hit-on locations, you're done. That is the implementation used in the paper I cited. I am aware that suffix trees work just as well as prefix trees -- both give the O(m) solution. So, I gave you a +1.
Heath Hunnicutt
You're absoultely right! Sorry, I was just nitpicking at the fact that there are hidden costs with indexing beyond just storing the original text that don't necessarily show up in the space complexity. In other words, storing the additional list of locations with each prefix isn't free :)
Chris Schmich
Yes, there are costs I did not fully explicate. Suffix tree incurs the same costs, correct? With prefix tree, you can fix an arbitrary limit, *M*, strings longer than which you will not search for. In that way -> smaller Trie. By the way, I am not sure one can't fix some *M* for suffix tree and get the same gain... Can one?
Heath Hunnicutt
+2  A: 

If you can afford the space (a lot of space!) to create an index, it'd definitely be worth your while indexing small chunks (e.g. four byte blocks) and storing these with their offsets within the haystack - then searches for 10 bytes involve searching for all four byte blocks for the first four bytes of the search term and checking the next six bytes.

Will A
+3  A: 

It's worth looking at suffix arrays and trees. They both require precomputation and significant memory, but they are better than reverse indexes in the sense that you can search for arbitrary substrings in O(m) (for suffix trees) and O(m + log n) (for suffix arrays with least common prefix info).

If you have a lot of time on your hands, you can look into compressed suffix arrays and succinct CSAs that are compressed versions of your data that are also self-indexing (i.e. the data is also the index, there is no external index). This is really the best of all worlds because not only do you have a compressed version of your data (you can throw the original data away), but it's also indexed! The problem is understanding the research papers and translating them into code :)

If you do not need perfect substring matching, but rather general searching capabilities, check out Lucene.

Chris Schmich
Out of interest, Chris, how much memory would be required to employ this approach?
Will A
For suffix arrays, it's `O(n)`, but the constants can be significant once you start storing auxiliary info to speedup search (like LCP info). For suffix trees, it's `O(n^2)` I believe, which would is probably prohibitive on a 10GB dataset. The latest research in information retrieval that I've seen has indexes on the compressed version of the data, which appears to give significant savings.
Chris Schmich
On 10G of highly repetitive data (I suspect we are talking about DNA), even a non-compressed suffix tree will, in fact, compress the data.
Heath Hunnicutt
+6  A: 

Given Knuth–Morris–Pratt or Boyer–Moore you're not going to do any better, what you should consider is parallelization of your search process.

dman
These are good solutions that do not require precomputation, but you can definitely get better performance with indexes or other precomputation.
Chris Schmich
With proper indexing, you can do a lot better than either of these algorithms - even if you've the 10G of physical memory free to perform them without hitting the disk!
Will A
@Chris,@Will: Correct me if i'm wrong, but what do indexes have to do with arbitary string searching? Its seems the task is: need to find short block of data within large block of non-structured piece of data. Creating an index for every possible required string might result in an index as large as the large block of data. I think this is a case of people jumping ahead of the requirments, none of you work for CA or MS by any chance? :)
dman
@dman -- The issue is he wants to seek *several* different small strings in the one very large corpus. For that, precomputation of a search index may be faster than a naive O(n) search each time. If he is searching for *k* strings, he wants his overall search to come in at less than O(kn). Also, he may not know what the search strings are in advance. For example, he might be making a web page where users can submit small search strings over a DNA genome.
Heath Hunnicutt
@dman -- In fact, thinking about this reply to you, *building* the prefix tree, suffix tree, or index, is probably going to take longer than O(kn) for some problems. But if you want to support a pleasing user experience for, e.g., a web search front end to a genome, you want each individual search to be *fast*. In that case, paying time for the up-front indexing enables the user experience you want -- at a cost of expending more total CPU time than otherwise would be needed.
Heath Hunnicutt
@dman nice :P I am in total agreement with you: if you don't want or can't afford indexes, KMP/BM is the way to go. However, if your corpus is relatively fixed, indexing can provide almost magical performance. After all, I'm pretty sure Google, Bing, et al. don't use KMP to search the web :)
Chris Schmich
+2  A: 

Prefix/suffix tree are generally the standard, best and most cautious solution for this sort of thing in my opinion. You can't go wrong with them.

But here is a different idea, which resorts to Bloom filters. You probably know what these are, but just in case (and for other people reading this answer): Bloom filters are very small, very compact bit-vectors which approximate set inclusion. If you have a set S and a Bloom filter for that set B(S), then

x ∈ S ⇒ x ∈ B(S)

but the reciprocate is false. This is what is probabilistic about the structure: there is a (quantifiable) probability that the Bloom filter will return a false positive. But approximating inclusion with the Bloom filter is wildly faster than testing it exactly on the set.

(A simple case use: in a lot of applications, the Bloom filter is used, well, as a filter. Checking cache is expensive, because you have to do a hard drive access, so programs like Squid will first check a small Bloom filter in memory, and if the Bloom filter returns a positive result, then Squid will go check the cache. If it was false positive, then that's OK, because Squid will find that out when actually visiting the cache---but the advantage is that the Bloom filter will have spared Squid having to check the cache for a lot of requests where it would have been useless.)

Bloom filters were used with some success in string search. Here is a sketch (I may remember some of the details wrong) of this application. A text file is a sequence of N lines. You are looking for a word composed of M letters (and no word can be spread accross two lines). A preprocessing phase will build ONE Bloom filter for each line, by adding every subsequence of the line to the Bloom filter; for instance, for this line

 Touching this dreaded sight, twice seene of vs,

And the corresponding Bloom filter will be created with "T", "To", "Tou" ... "o", "ou", ... "vs,", "s", "s,", ",". (I may have this part wrong. Or you might want to optimize.)

Then when searching for the subword of size M, simply do one very fast check on each of the Bloom filters, and when there is a hit, examine the line closely with the KMP algorithm, for instance. In practice, if you tune your Bloom filters well, the trade-off is remarkable. Searching is incredibly fast because you eliminate all useless lines.

I believe from this concept you could derive a useful scheme for your situation. Right now, I see two evident adaptation:

  • either cut your data set in many blocks of size K (each with its Bloom filter, like the lines in the previous example);

  • or use a sort-of dichotomy where you split the set into two subset, each with a Bloom filter, then each subset split into two sub-subsets with their own Bloom filter, etc. (if you are going to add all substrings as suggested with the method I described, this second idea would be a bit prohibitive---except you don't have to add all substrings, only substrings ranging size 1 to 10).

Both ideas can be combined in inventive ways to create multi-layered schemes.

Jérémie
+1 this is so great to learn, thank you.
Heath Hunnicutt