tags:

views:

339

answers:

3

Hi,

I was wondering what the complexity of the algorithm is that is used to find the smallest snippet that contains all the search key words.

Please explain the complexity, if non-trivial.

Thanks
Bala

+4  A: 

As stated, the problem is solved by a rather simple algorithm:

Just look through the input text sequentially from the very beginning and check each word: whether it is in the search key or not. If the word is in the key, add it to the end of the structure that we will call The Current Block. The Current Block is just a linear sequence of words, each word accompanied by a position at which it was found in the text. The Current Block must maintain the following Property: the very first word in The Current Block must be present in The Current Block once and only once. If you add the new word to the end of The Current Block, and the above property becomes violated, you have to remove the very first word from the block. This process is called normalization of The Current Block. Normalization is a potentially iterative process, since once you remove the very first word from the block, the new first word might also violate The Property, so you'll have to remove it as well. And so on.

So, basically The Current Block is a FIFO sequence: the new words arrive at the right end, and get removed by normalization process from the left end.

All you have to do to solve the problem is look through the text, maintain The Current Block, normalizing it when necessary so that it satisfies The Property. The shortest block with all the keywords in it you ever build is the answer to the problem.

For example, consider the text

CxxxAxxxBxxAxxCxBAxxxC

with keywords A, B and C. Looking through the text you'll build the following sequence of blocks

C
CA
CAB - all words, length 9 (CxxxAxxxB...)
CABA - all words, length 12 (CxxxAxxxBxxA...)
CABAC - violates The Property, remove first C
ABAC - violates The Property, remove first A
BAC - all words, length 7 (...BxxAxxC...)
BACB - violates The Property, remove first B
ACB - all words, length 6 (...AxxCxB...)
ACBA - violates The Property, remove first A
CBA - all words, length 4 (...CxBA...)
CBAC - violates The Property, remove first C
BAC - all words, length 6 (...BAxxxC)

The best block we built has length 4, which is the answer in this case

CxxxAxxxBxxAxx CxBA xxxC

The exact complexity of this algorithm depends on the input, since it dictates how many iterations the normalization process will make, but ignoring the normalization the complexity would trivially be O(N * log M), where N is the number of words in the text and M is the number of keywords, and O(log M) is the complexity of checking whether the current word belongs to the keyword set.

Now, having said that, I have to admit that I suspect that this might not be what you need. Since you mentioned Google in the caption, it might be that the statement of the problem you gave in your post is not complete. Maybe in your case the text is indexed? (With indexing the above algorithm is still applicable, just becomes more efficient). Maybe there's some tricky database that describes the text and allows for a more efficient solution (like without looking through the entire text)? I can only guess and you are not saying...

AndreyT
A: 

This is an interesting question. To restate it more formally: Given a list L (the web page) of length n and a set S (the query) of size k, find the smallest sublist of L that contains all the elements of S.

I'll start with a brute-force solution in hopes of inspiring others to beat it. Note that set membership can be done in constant time, after one pass through the set. See this question. Also note that this assumes all the elements of S are in fact in L, otherwise it will just return the sublist from 1 to n.

best = (1,n)
For i from 1 to n-k:  
  Create/reset a hash found[] mapping each element of S to False.
  For j from i to n or until counter == k:  
    If found[L[j]] then counter++ and let found[L[j]] = True;
  If j-i < best[2]-best[1] then let best = (i,j).

Time complexity is O((n+k)(n-k)). Ie, n^2-ish.

dreeves
Already beaten by a single-pass algorithm in my answer. Also, I don't understand how you plan to do set membership in constant time, given that the set members are *words*. A perfect hash can do that, but it is not feasible in general case, given that we are working with arbitrary *words*. The approach at your link is for *subset* membership witin a finite "main" set, which is obviously not the case in this problem.
AndreyT
Nice work, Andrey! I guess I was imagining first converting all the words to integers.
dreeves
+1  A: 

In the below link you have an algorithm O(m), where m is how many times the query terms have matched: http://rcrezende.blogspot.com/2010/08/smallest-relevant-text-snippet-for.html

Rodes