ansaurus

Question

Answer 1

A:

To get over the 'relative distance' problem, all you need to do is take a fixed point to measure from.

You could still use the Levenstein distance, but take it from a fixed "Origin" string. For example, you could use an arbitrary length string of all spaces as your origin string.

Anyway, I'd test it out first with a small subset of known strings to see if the values reflect what you would expect to see.

Andrew Rollings 2009-01-30 22:51:24

This is unsound, you will almost certainly end up with two very dissimilar strings sharing the same distance from the fixed point. e.g. if your fixed point is the empty string, any pair of strings with the same length will share the same value.

Daniel Nadasi 2009-01-30 23:14:02

Hmmm. That's a good point, but it's also a problem with any solution due to the reduction in dimensionality. Okay... How about having a second (and/or third) (different) origin string and using that... The problem is the dimensionality of the solution is reduced hugely.

Andrew Rollings 2009-01-30 23:23:36

(continued) There is no easily solution to this due to the mapping of a multi-dimensional value (the range of strings) to a 1-dimensional space (the number line)

Andrew Rollings 2009-01-30 23:24:21

Answer 2

+3 A:

I think you are going to have to specify your problem more clearly, what exactly are you trying to achieve with this metric?

I say this, because Levenstein works since it maps pairs of strings to a metric, which can preserve the dimensionality of the string space. What happens if you try and map strings to numbers is that there is a large loss of dimensional information. For example, say I have the string "cat", I'd want "bat", "hat", "rat", "can", "cot" etc. to all be reasonably close to this. With a large number of words, the result is that you end up with dissimilar words being close relatively often e.g. "bat" and "cot" may be close, because they both happen to be similar distances from "cat" on the positive side. This is similar to the problem of what happens when you try and map the plane to a line, it is difficult to meet the restriction that points far away in the plane stay far away on the line. So, the upshot of this is that the 'The more "different" two given strings are, the more different two corresponding values should be' requirement is difficult.

So, my first suggestion is, do you really need something that does this, will a simple hash-code suffice to give you unique values, or perhaps you can use Levenstein after all and ignore the values for individual strings? If none of those suffice, perhaps you can use a multidimensional function value, that is map strings into pairs, triples or another small tuple of numbers. The extra dimensionality granted that way will give you far better results.

An example might be encoding the string as a triple: length, sum of values of letters in string, alternating sum of values of letters e.g. f("cat") = (3, 3 + 1 + 20, 3 - 1 + 20) = (3, 24, 22). This would have some of the properties you desire, but is probably not optimal. Try looking for orthogonal features of the string to do this encoding, or even better, if you have a large test set of strings there are existing libraries for mapping this sort of data into low dimensions while preserving metrics (e.g. the Levenstein metric) and you can train your function on that. I remember the S language had support for this sort of thing.

Daniel Nadasi 2009-01-30 23:08:23

Hmm. Looks like you're right.I'm trying to map a set of strings to a set of colors. So I'm all for multidimensional function value. See http://stackoverflow.com/questions/495662/mapping-arbitrary-strings-to-rgb-values

Alexander Gladysh 2009-01-30 23:16:14

Answer 3

+2 A:

I don't think this is possible to do. Start with a simple string, and assign it zero (it doesn't really matter what the number is)

"Hello World" = 0

The following strings are at distance 2 from it:

"XXllo World" = a
"HeXXo World" = b
"Hello XXrld" = c
"Hello WorXX" = d

Yet, each of these strings is 4 from each other. There is no way to sort the numbers to make it work, for the following instance:

a = 1, b = -1, c = 2, d = -2

Consider that c to 0 is 2, yet c to a is 1, yet 0 is closer than a.

And this is just a simple case.

FryGuy 2009-01-30 23:11:13

Answer 4

+1 A:

Measure the edit distance from the empty string, but instead of treating each edit as having the value "1", give it the index of the letter being added/removed in the alphabet sorted by use frequency (etaoinshrdlu...), and the difference between letter indices if your algorithm allows you to detect replacements as replacements rather than as insert+delete pairs.

moonshadow 2009-01-30 23:23:26

Answer 5

A:

This is an "off the top of the head" answer to the question. Basically, calculates the distance sentance 2 differs from sentance 1, as a cartesian distance from sentence 1(assumed to be at the orign). Where, the distances is the sum of the minimun Levenstien difference between the word in the 2 sentences. It has the property that 2 equal sentances give a 0 distance. If this approach has been published elswhere, I'm unaware of it.

Edit: it looks wrong (C# code duplicated - How can I fix it?). Thanks in advance.

using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Diagnostics;

namespace ConsoleApplication1 { class Program { static void Main(string[] args) { string str1 = "The cat sat on the mat"; string str2 = "The quick brown fox jumped over the lazy cow"; ReportDifference(str1, str1); ReportDifference(str2, str2); ReportDifference(str1, str2); ReportDifference(str2, str1); } /// /// Quick test andisplay routine /// /// First sentence to test with /// Second sentance to test with static void ReportDifference(string str1, string str2) { Debug.WriteLine( String.Format("difference between \"{0}\" and \"{1}\" is {2}", str1, str2, Difference(str1, str2))); } /// /// This does the hard work. /// Basiclly, what it does is: /// 1) Split the stings into tokens/words /// 2) Form a catresian product of the 2 lists of words. /// 3) Calculate the Levenshtein Distance between each word. /// 4) Group on the words from the first sentance /// 5) Get the min distance between the word in first sentance and all of the words frrom the second /// 6) Square the distances for each word. /// (based on the distance betwen 2 points is the sqrt of the sum of the x,y,... axises distances /// what this assumes is the first word is the orign) /// 7) take the square root of sum /// /// sentance 1 compare /// sentance 2 compare /// distance caculated static double Difference(string str1, string str2) { string[] splitters = { " " };

        var a = Math.Sqrt(
            (from x in str1.Split(splitters, StringSplitOptions.RemoveEmptyEntries)
                 from y in str2.Split(splitters, StringSplitOptions.RemoveEmptyEntries)
                 select new {x, y, ld = Distance.LD(x,y)} )
                .GroupBy(x => x.x)
                .Select(q => new { q.Key, min_match = q.Min(p => p.ld) })
                .Sum(s =>  (double)(s.min_match * s.min_match )));
        return a;
    }
}

/// <summary>
/// Lifted from http://www.merriampark.com/ldcsharp.htm
/// </summary>
public class Distance
{

    /// <summary>
    /// Compute Levenshtein distance
    /// </summary>
    /// <param name="s">String 1</param>
    /// <param name="t">String 2</param>
    /// <returns>Distance between the two strings.
    /// The larger the number, the bigger the difference.
    /// </returns>
    public static int LD(string s, string t)
    {
        int n = s.Length; //length of s
        int m = t.Length; //length of t
        int[,] d = new int[n + 1, m + 1]; // matrix
        int cost; // cost
        // Step 1
        if (n == 0) return m;
        if (m == 0) return n;
        // Step 2
        for (int i = 0; i <= n; d[i, 0] = i++) ;
        for (int j = 0; j <= m; d[0, j] = j++) ;
        // Step 3
        for (int i = 1; i <= n; i++)
        {
            //Step 4
            for (int j = 1; j <= m; j++)
            {
                // Step 5
                cost = (t.Substring(j - 1, 1) == s.Substring(i - 1, 1) ? 0 : 1);
                // Step 6
                d[i, j] = System.Math.Min(System.Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                          d[i - 1, j - 1] + cost);
            }
        }
        // Step 7
        return d[n, m];
    }
}

}

Aussie Craig 2009-01-31 03:48:14

Answer 6

+2 A:

I'd like to expand FryGuy's answer why it isn't gonna work in any fixed number of dimensions. Let's take aaaaaaaaaa and baaaaaaaaa, abaaaaaaaa, ... , aaaaaaaaab. In this example the strings are of length 10, but they can be of arbitrary length. The distance of each of the 10 b-strings from aaaaaaaaaa is 1, and their distance from each other is 2. In general, if you take fixed strings of length N over 2-letter alphabet, their distance graph is an N-dimensional hypercube.

There is no way you can map that into a fixed number of dimensions unless your strings are of limited length.

Rafał Dowgird 2009-01-31 14:18:31

Well, yes. But all I need is "good enough" mapping. (Of course my strings are of limited length -- there is even limited, but huge, number of strings themselves.)

Alexander Gladysh 2009-01-31 15:34:46

Answer 7

+1 A:

You can also try to look into Latent Semantic Analysis and vector space models, with the problem that you have to limit the maximum string length.

Your dimensions are the product of the elements of your alphabet and the positions in the string. Given the alphabet ("a", "b", "c", "t") and a maximum length of 3, the dimensions are (a:1, b:1, c:1, t:1, ..., a:3, b:3, c:3, t:3)

As an example, "cat" becomes (0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1).

This is of course a huge dataset, but you can use dimensionality reduction techniques (like SVD) to cut down the number of dimensions. This should work well, because there are many recurring patterns in words. You can tweak the number of output dimensions to suit your needs.

The similarity between two words can be computed by cosine similarity between the word vectors. You can also store the transformation vectors of the SVD to get the reduced vector for words, even previously unseen ones.

Torsten Marek 2009-01-31 16:42:54

ansaurus

tags:

views:

answers:

"Absolute" string metric

Update on dimensions

Update on expected result correctness

Update on Hilbert curve approach

related questions