Interpolation search on strings

views:

answers:

Interpolation search on strings

For those of you not familiar with interpolation search, it is method to search for a value in a sorted array that is potentially faster than binary search. You look at the first and last element and (assuming that the contents of the array are uniformly distributed) linearly interpolate to predict the location.

For example: we have an array of length 100 with array[0]=0 and array[99]=99. If we are looking for 80, it is intuitive to try array[80] over array[50], and if the array is close to uniformly distributed, the expected runtime is reduced to log(log(N))

For numbers, the location to check is defined by the equation: low + ((toFind - sortedArray[low]) * (high - low + 1)) / (sortedArray[high] - sortedArray[low]).

A common example used to show off the intuitive nature of interpolation search is: imagine trying to find the word 'yellow' in a dictionary. You wouldn't use binary search and go to the half way point. Rather, you would go to the expected location.

Humans can naturally linearly interpolate strings, but I can't figure out how code it up. How do we linearly interpolate strings?

Basically you use a binary tree approach to searching. Start in the middle. Is the value you are looking for larger or smaller than that point? If larger, go to the mid point of the middle and the end, and check again. If that point is larger than what you are looking for, go to the midpoint of the middle and the current point, and check again. Do that recursively until you find what you are looking for. Its very fast. Not much use in trying to 'guess' whereabout to start in the list other than the middle, since at most it'd mean 1 or 2 additional comparisons to get to that spot by starting in the middle.

GrandmasterB 2010-09-07 18:53:19

-1: Good description of binary search (if a little long and rambling and without a good code example), but this does NOT answer the question at all!

Platinum Azure 2010-09-07 19:34:38

? the question was how to guess where to start... and the answer is you dont generally have to.

GrandmasterB 2010-09-07 21:07:11

+2 A:

To find the "distance" between two strings, a simple method would be to look at the first letter that is different between them and assign a numeric value to each, then take the difference.

For example, the distance from "a" to "y" would be 24 and the distance from "y" to "z" would be 1, if each letter were assigned a value equal to its position in the alphabet.

A better performing method would go through a dictionary to weight the various letters by how common they are in actual words.

Another refinement would be to look at two characters - "aa" is farther from "bz" than "az" is from "ba", for example. Going beyond two characters wouldn't buy you much.

The reason this method isn't more popular is that it complicates the binary search algorithm for not a lot of gain. If you were to time it you might even find that standard binary search is faster; what you gain in fewer comparisons you lose in the complexity of determining distances.

Also note that the worst-case performance of this algorithm is worse than a binary search. Consider for example searching for "ae" in the list of "aa","ab","ac","ad","ae","zz" - the outlier "zz" is going to bias the search so that it's always trying the beginning of the search range. It degrades to O(n) under these conditions.

Mark Ransom 2010-09-07 18:59:14

Good points all around. +1

Platinum Azure 2010-09-07 19:36:03

The extra complexity is 2 mult/div + 5 add/sub. I have tested it and, yes, it is a little slower than binary search (if N isn't ridiculous). But if the comparison is non-trivial (as in the case of strings) then it may become worth-while.

2010-09-07 20:30:58

@user108088, the complexity is also in the distance calculations, which will also be non-trivial in the case of strings. See my edit.

Mark Ransom 2010-09-07 20:52:27

If we use the first 2 letters for distance calculation, for example dist = (word[0] << 8) + word[1], then the complexity increases by a shift, add, and two references. This is potentially less than equality comparison. Also, the log log performance is expected from large uniformly distributed data-sets, like the dictionary I'm using.

2010-09-07 22:49:16

@user108088, the distance calculation first needs to skip the identical characters in each string, unless you degrade to a mid-point calculation once the words become close enough together. You also need to take into account things like case conversion. I still recommend doing a timing comparison.

Mark Ransom 2010-09-07 23:08:28

Ok, that sounds good. Thanks for all the help.

2010-09-07 23:31:51

ansaurus

tags:

views:

answers:

Interpolation search on strings

related questions