views:

482

answers:

6

Given a list L of n character strings, and an input character string S, what is an efficient way to find the character string in L that contains the most characters that exist in S? We want to find the string in L that is most-closely made up of the letters contained in S.

The obvious answer is to loop through all n strings and check to see how many characters in the current string exist in S. However, this algorithm will be run frequently, and the list L of n string will be stored in a database... loop manually through all n strings would require something like big-Oh of n*m^2, where n is the number of strings in L, and m is the max length of any string in L, as well as the max length of S... in this case m is actually a constant of 150.

Is there a better way than just a simple loop? Is there a data structure I can load the n strings into that would give me fast search ability? Is there an algorithm that uses the pre-calculated meta-data about each of the n strings that would perform better than a loop?

I know there are a lot of geeks out there that are into the algorithms. So please help!

Thanks!

+1  A: 

Sounds like you need a trie. Tries are used to search for words similar to the way a spell checker will work. So if the String S has the characters in the same order as the Strings in L then this may work for you.

If however, the order of the characters in S is not relevant - like a set of scrabble tiles and you want to search for the longest word - then this is not your solution.

Vincent Ramdhanie
+4  A: 

If you are after substrings, a Trie or Patrica trie might be a good starting point.

If you don't care about the order, just about the number of each symbol or letter, I would calculate the histogram of all strings and then compare them with the histogram of the input.

               ABCDEFGHIJKLMNOPQRSTUVWXYZ
Hello World => ...11..1...3..2..1....1...

This will lower the costs to O(26 * m + n) plus the preprocessing once if you consider only case-insensitive latin letters.

If m is constant, you could interpret the histogram as a 26 dimensional vector on a 26 dimensional unit sphere by normalizing it. Then you could just calculate the Dot Product of two vectors yielding the cosine of the angle between the two vectors, and this value should be proportional to the similarity of the strings.

Assuming m = 3, a alphabet A = { 'U', 'V', 'W' } of size three only, and the following list of strings.

L = { "UUU", "UVW", "WUU" }

The histograms are the following.

H = { (3, 0, 0), (1, 1, 1), (2, 0, 1) }

A histogram h = (x, y, z) is normalized to h' = (x/r, y/r, z/r) with r the Euclidian norm of the histogram h - that is r = sqrt(x² + y² + z²).

H' = { (1.000, 0.000, 0.000), (0.577, 0.577, 0.577), (0.894, 0.000, 0.447) }

The input S = "VVW" has the histogram hs = (0, 2, 1) and the normalized histogram hs' = (0.000, 0.894, 0.447).

Now we can calculate the similarity of two histograms h1 = (a, b, c) and h2 = (x, y, z) as the Euclidian distance of both histograms.

d(h1, h2) = sqrt((a - x)² + (b - y)² + (c - z)²)

For the example we obtain.

d((3, 0, 0), (0, 2, 1)) = 3.742
d((1, 1, 1), (0, 2, 1)) = 1.414
d((2, 0, 1), (0, 2, 1)) = 2.828

Hence "UVW" is closest to "VVW" (smaller numbers indicate higher similarity).

Using the normalized histograms h1' = (a', b', c') and h2' = (x', y', z') we can calculate the distance as the dot product of both histograms.

d'(h1', h2') = a'x' + b'y' + c'z'

For the example we obtain.

d'((1.000, 0.000, 0.000), (0.000, 0.894, 0.447)) = 0.000
d'((0.577, 0.577, 0.577), (0.000, 0.894, 0.447)) = 0.774
d'((0.894, 0.000, 0.447), (0.000, 0.894, 0.447)) = 0.200

Again "UVW" is determined to be closest to "VVW" (larger numbers indicate higher similarity).

Both version yield different numbers, but the results are always the same. One could also use other norms - Manhattan distance (L1 norm) for example - but this will only change the numbers because norms in finite dimensional vector spaces are all equivalent.

Daniel Brückner
I think you want to take the difference between the vectors, and then use vector the smallest difference (which could be either by a straight sum of each component or you can get fancy and square each component before summing them).
Erich Mirabal
Your 26-dimensional vector solution gets my MOST AWESOME solution badge :) Can you give an example calculation, maybe with a 2-d vector?
Rafe
Beat me to it with this answer - though I would have just made the vectors binary, since the OP is just looking for strings in L that have the most characters used in S.
bubaker
I am still thinking about the dot product and Euclidian distance vs. Block distance vs. vector difference ... I will sleep about it one night (German saying - don't know if it makes sense in English) and give an example tomorrow.
Daniel Brückner
The English idiom would be "I will sleep on it".
John Fouhy
The proposed solution works, but it's still O(n) to check against n strings.
Nick Johnson
Yes, it is still O(n) (better O(m) with the given variable names). One could try to use a kd-tree, something like a highdimensional Voronoi diagram, or another space partitioning scheme to speed the search up, but I cannot give a good and simple solution at the moment.
Daniel Brückner
See my post regarding BK-Trees, below. :)
Nick Johnson
A: 

I believe what you're looking for can be found here: Fuzzy Logic Based Search Technique

It's pretty heavy, but so is what you're asking for. It talks about word similarities, and character misplacement.

i.e:

L I N E A R T R N A S F O R M
L I N A E R T R A N S F O R M
L E N E A R T R A N S F R M
Mike Curry
A: 

Hi,

it seems to me that the order of the characters is not important in your problem, but you are searching for "near-anagrams" of the word S.

If that's so, then you can represent every word in the set L as an array of 26 integers (assuming your alphabet has 26 letters). You can represent S similarly as an array of 26 integers; now to find the best match you just run once through the set L and calculate a distance metric between the S-vector and the current L-vector, however you want to define the distance metric (e.g. euclidean / sum-of-squares or Manhattan / sum of absolute differences). This is O(n) algorithm because the vectors have constant lengths.

antti.huima
+1  A: 

What you want is a BK-Tree. It's a bit unintuitive, but very cool - and it makes it possible to search for elements within a levenshtein (edit) distance threshold in O(log n) time.

If you care about ordering in your input strings, use them as is. If you don't you can sort the individual characters before inserting them into the BK-Tree (or querying with them).

Nick Johnson
Nice data structure! +1 But calculating Levenshtein distance is quite expensive and he does not know the distance in advance. He might still need to visit almost the whole tree.
Daniel Brückner
You can implement nearest-neighbour search in a BK-tree in a fairly straightforward fashion. I think - but can't prove - that the amount of work to find the single nearest neighbour is bounded, regardless of how far away it is.
Nick Johnson
A: 

Here is a T-SQL function that has been working great for me, gives you the edit distance:

Example:

  SELECT TOP 1 [StringValue] , edit_distance([StringValue, 'Input Value')
    FROM [SomeTable]
ORDER BY edit_distance([StringValue, 'Input Value')

The Function:

CREATE FUNCTION edit_distance(@s1 nvarchar(3999), @s2 nvarchar(3999))
RETURNS int
AS
BEGIN
  DECLARE @s1_len int, @s2_len int, @i int, @j int, @s1_char nchar, @c int, @c_temp int,
    @cv0 varbinary(8000), @cv1 varbinary(8000)
  SELECT @s1_len = LEN(@s1), @s2_len = LEN(@s2), @cv1 = 0x0000, @j = 1, @i = 1, @c = 0
  WHILE @j <= @s2_len
    SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j = @j + 1
  WHILE @i <= @s1_len
  BEGIN
    SELECT @s1_char = SUBSTRING(@s1, @i, 1), @c = @i, @cv0 = CAST(@i AS binary(2)), @j = 1
    WHILE @j <= @s2_len
    BEGIN
      SET @c = @c + 1
      SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j-1, 2) AS int) +
        CASE WHEN @s1_char = SUBSTRING(@s2, @j, 1) THEN 0 ELSE 1 END
      IF @c > @c_temp SET @c = @c_temp
      SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j+1, 2) AS int)+1
      IF @c > @c_temp SET @c = @c_temp
      SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j = @j + 1
    END
    SELECT @cv1 = @cv0, @i = @i + 1
  END
  RETURN @c
END
JasonRShaver