Algorithm to find the smallest snippet from searching a document?

tags:

algorithm

views:

369

answers:

+6 Q:

Algorithm to find the smallest snippet from searching a document?

I've been going through Skiena's excellent "The Algorithm Design Manual" and got hung up on one of the exercises.

The question is: "Given a search string of three words, find the smallest snippet of the document that contains all three of the search words—i.e. , the snippet with smallest number of words in it. You are given the index positions where these words in occur search strings, such as word1: (1, 4, 5), word2: (4, 9, 10), and word3: (5, 6, 15). Each of the lists are in sorted order, as above."

Anything I come up with is O(n^2)... This question is in the "Sorting and Searching" chapter, so I assume there is a simple and clever way to do it. I'm trying something with graphs right now, but that seems like overkill.

Ideas? Thanks

Say the sorted arrays/lists are A,B,C corresponding to word1, word2 and word3 respectively.

For each a in A (and b in B and c in C) find the best window which begins at a (or b or c). Find the best among those (the best window has to begin at one of those indices).

To find the best for an a in A (similarly for B and C), find where a lies in B and in C (using binary search).

Say b1 <= a <= b2 and c1 <= a <= c2.

Then pick the best window in [a, a+strlen(word1)], [a, b2 + strlen(word2)], [a, c2+strlen(word3)] which fits all three words.

Now find the best among all all such (i.e. considering all A, B and C).

I believe this will work and is O(nlogn), but haven't tried proving it.

Moron 2010-06-02 02:40:21

Hello, Moron, does my comment above hold any water?

Hamish Grubijan 2010-06-02 02:49:25

@Hamish. It looks promising. Perhaps you should consider further on that line and do update us!

Moron 2010-06-02 02:56:56

+1 A:

From the question, it seems that you're given the index locations for each of your n “search words” (word1, word2, word3, ..., word n) in the document. Using a sorting algorithm, the n independent arrays associated with search words can readily be represented as a single array of all the index locations in ascending numerical order and a word label associated with each index in the array (the index array).

The Basic Algorithm:

(Designed to work whether or not the poster of this question intended to allow two different search words to coexist at the same index number.)

First, we define a simple function for measuring the length of a snippet that contains all n labels given a starting point in the index array. (It is obvious from the definition of our array that any starting point on the array will necessarily be the indexed location of one of the n search labels.) The function simply keeps track of the unique search labels seen as the function iterates through the elements in the array until all n labels have been observed. The length of the snippet is defined as the difference between the index of the last unique label found and the index of the starting point in the index array (the first unique label found). If all n labels aren't observed before the end of the array the function returns a null value.

Now, the snippet length function can be run for each element in your array to associate a snippet size containing all n search words starting from each element in the array. The smallest non-Null value returned by the snippet length function over the whole index array is the snippet in your document that you're looking for.

Necessary Optimizations:

Keep track of the value of the current shortest snippet length so that the value will be know immediately after iterating once through the index array.
When iterating through your array terminate the snippet length function if the current snippet under inspection ever surpasses the length of the shortest snippet length previously seen.
When the snippet length function returns null for not locating all n search words in the remaining index array elements, associate a null snippet length to all successive elements in the index array.
If the snippet length function is applied to a word label and the label immediately following it is identical to the starting label, assign a null value to the starting label and move on to the next label.

Computational Complexity:

Obviously the sorting part of the algorithm can be arranged in O(n log n).

Here's how I would work out the time complexity of the second part of the algorithm (any critiques and corrections would be greatly appreciated).

In the best case scenario, the algorithm only applies the snippet length function to the first element in the index array and finds that no snippet containing all the search words exists. This scenario would be computed in just n calculations where n is the size of the index array. Slightly worse than that is if the smallest snippet turns out to be equal to the size of the whole array. In this case the computational complexity will be a little less than 2 n (once through the array to find the smallest snippet length, a second time to demonstrate that no other snippets exist). The shorter the average computed snippet length, the more times the snippet length function will need to be applied over the index array. We can assume that our worse case scenario will be the case where the snippet length function needs to be applied to every element in the index array. To develop a case where the function will be applied to every element in the index array we need to design an index array where the average snippet length over the whole index array is negligible in comparison to the size of the index array as a whole. Using this case we can write out our computational complexity as O(C n) where C is some constant that is significantly smaller then n. Giving a final computational complexity of:

O(n log n + C n)

Where:

C << n

Edit:

AndreyT correctly points out that instead of sorting the word indicies in n log n time, one might just as well merge them (since the sub arrays are already sorted) in n log m time where m is the amount of search word arrays to be merged. This will obviously speed up the algorithm is cases where m < n.

Ami 2010-06-02 03:27:05

+4 A:

I already posted a rather straightforward algorithm that solves exactly that problem in this answer

http://stackoverflow.com/questions/2734313/google-search-results-how-to-find-the-minimum-window-that-contains-all-the-searc/2734606#2734606

However, in that question we assumed that the input is represented by a text stream and the words are stored in an easily searchable set.

In your case the input is represented slightly differently: as a bunch of vectors with sorted positions for each word. This representation is easily transformable to what is needed for the above algorithm by simply merging all these vectors into a single vector of (position, word) pairs ordered by position. It can be done literally, or it can be done "virtually", by placing the original vectors into the priority queue (ordered in accordance with their first elements). Popping an element from the queue in this case means popping the first element from the first vector in the queue and possibly sinking the first vector into the queue in accordance with its new first element.

Of course, since your statement of the problem explicitly fixes the number of words as three, you can simply check the first elements of all three arrays and pop the smallest one at each iteration. That gives you a O(N) algorithm, where N is the total length of all arrays.

Also, your statement of the problem seems to suggest that target words can overlap in the text, which is rather strange (given that you use the term "word"). Is it intentional? In any case, it doesn't present any problem for the above linked algorithm.

AndreyT 2010-06-02 22:40:18

I disagree with your claim that the time complexity of your algorithm is O(N). You're neglecting the computations necessary to determine the smallest value of the different search word arrays (which is necessary to know which value to pop before the next computation) and the computations necessary to explicitly determine the length of the snippet given the smallest values from all the search word arrays (which is a constant value proportional to the amount of search words being computed). When these factors are taken into consideration, I think my algorithm is the best that you can do.

Ami 2010-06-02 23:28:40

@Ami: I stated in my linked post that the complexity is `O(N log M)`, where `N` is the total number of occurrences of our keywords and `M` is the number of *different* keywords. However, I explicitly explained in this post that if the statement of the problem explicitly fixes `M` as a constant (`3` in this case), the algorithm becomes a `O(N)` algorithm for obvious reasons.

AndreyT 2010-06-02 23:45:17

@Ami: As for your algorithm... I don't really see a complete algorithm is your post, just a bunch of ideas, but I think it is pretty obvious that my algorithm is nothing else than a compact practical single-pass implementation of what you essentially propose in your post. I don't see though how you managed to obtain `O(N log N)` for your algorithm, which is notably worse than my `O(N log M)`

AndreyT 2010-06-02 23:49:45

@Andrey: I want to respond to the critisms you raised about my algorithm (perhaps I should rewrite my answer to make it more clear), but before I do that let me clarify the point I was making earlier. In your linked post you explained the O(log M) as "the complexity of checking whether the current word belongs to the keyword set." You correctly pointed out in your answer above that there is no need to make a check like that in our case because the indecies given in this problem are already known to be part of the keyword set. I claim that you're neglecting a different set of computations....

Ami 2010-06-03 01:26:27

@Andrey part2: Given M different sets of indicies, your algorithm requires deterimining a snippet length from the M smallest index values then popping the smallest index. How does your algorithm know which of the M indecies is smalllest index which is to be popped? To answer this question your algorithm has to iterate through all M indecies and this process is repeated exactly N times. For this reason I think your algorithm is operating at O(N*M) not O(N). O(N*M) is not ostensibly worse than O(N log N)....

Ami 2010-06-03 01:29:07

@Andrey part3: My O(N log N) comes from sorting the set of N indicies. I've reduced the problem to a sorting problem because once I've sorted the indicies, finding the smallest snippet is extremely fast. Instead of sorting all the indicies once, you need to go through the smallest values in each of the different keyword sets every time you compute a snippet length. In summation, I take back my claim "my algorithm is the best that you can do." The best algorithm depends on the size of M and whether N*M is bigger or smaller than N log N. In the case of M=3 you're algorithm probably wins (+1).

Ami 2010-06-03 01:33:05

@Ami: I already described above how the smallest index out of `M` sorted vectors can be found efficiently: by placing the vectors into a binary heap (I used a term "priority queue" above). Binary heap will give you the current minimum in `O(log M)` each time. Which will give you, again, `O(N log M)` algorithm in the end.

AndreyT 2010-06-03 01:55:25

@Ami: By applying sorting to all `N` indices you are ignoring some important information that in given to you at the beginning: the fact that all vectors are *already sorted*. Instead of sorting all `N` from scratch, you should've simply *merged* `M` sorted vectors into a single sorted vector. Merging of `M` can be carried out in `O(N log M)` time, which is more efficient. Again, as I said before, your algorithm is not really different from mine. It is just that I don't understand why you insist on `O(N log N)` *sorting*, when you can easily do the same thing by `O(N log M)` *merging*.

AndreyT 2010-06-03 02:00:50

@Andrey: Ahhh, I see. I completely forgot about merging. Duly noted.

Ami 2010-06-03 02:04:29

@Ami: don't feel bad, I'm well known to overlook some crucial aspects too. Sometimes it's as if things did not registered in the brain when you read them... Fortunately, we are all here to learn (and helps other learning).

Matthieu M. 2010-06-03 15:39:34

Unless I've overlooked something, here's a simple, O(n) algorithm:

We'll represent the snippet by (x, y) where x and y are where the snippet begins and ends respectively.
A snippet is feasible if it contains all 3 search words.
We will start with the infeasible snippet (0,0).
Repeat the following until y reaches end-of-string:
1. If the current snippet (x, y) is feasible, proceed to the snippet (x+1, y)
  Else (the current snippet is infeasible) proceed to the snippet (x, y+1)
Choose the shortest snippet among all feasible snippets we went through.

Running time - in each iteration either x or y is increased by 1, clearly x can't exceed y and y can't exceed string length so total number of iterations is O(n). Also, feasibility can be checked at O(1) in this case since we can track how many occurences of each word are within the current snippet. We can maintain this count at O(1) with each increase of x or y by 1.

Correctness - For each x, we calculate the minimal feasible snippet (x, ?). Thus we must go over the minimal snippet. Also, if y is the smallest y such that (x, y) is feasible then if (x+1, y') is a feasible snippet y' >= y (This bit is why this algorithm is linear and the others aren't).

Yuval Cohen 2010-06-09 15:19:35

ansaurus

tags:

views:

answers:

Algorithm to find the smallest snippet from searching a document?

related questions