views:

453

answers:

7

I have a huge (but finite) set of natural language strings.

I need a way to convert each string to a numeric value. For any given string the value must be the same every time.

The more "different" two given strings are, the more different two corresponding values should be. The more "similar" they are, the less different values should be.

I do not know yet what exact definition of difference between strings I need. No natural language parsing anyway. It should probably be something Levenstein-like (but Levenstein is relative and I need absolute metric). Lets start with something simple.

Update on dimensions

I'll be happy to settle for a multidimensional (3d is best) vector instead of single numeric value.

Update on expected result correctness

As it was correctly noted here and here, the distance from one string to another is a vector with MAX(firstStringLength, secondStringLength) dimensions. In general it is not possible to reduce number of dimensions without some loss of information.

However I do not need an absolute solution. I would settle for any "good enough" conversion from N-dimensional strings space to my 3D space.

Note also that I have a finite number of strings of finite length. (Number of strings is rather large though, about 80 million (10 GB), so I'd better pick some single-pass state-less algorithm.)

From scanning references, I'm under impression that Hilbert space-filling curve may help me here. Looks like Analysis of the Clustering Properties of the Hilbert Space-Filling Curve article discusses something close to my problem...

Update on Hilbert curve approach

  1. We map each string to a point in a N-dimensional space, where N is the maximum length of a string in set. BTW, can i-th character code from a string be used as the i-th coordinate value here?
  2. We plot a Hilbert curve through that N-dimensional space.
  3. For each string we take point on the curve, closest to the coordinates of the string. Hilbert value of that point (the length from the beginning of curve) is the single-dimensional value I seek.
  4. If we need 3D value, we plot Hilbert curve in 3D and pick points, matching Hilbert values, calculated above.

Does this looks right? What would be the computational expenses here?

A: 

To get over the 'relative distance' problem, all you need to do is take a fixed point to measure from.

You could still use the Levenstein distance, but take it from a fixed "Origin" string. For example, you could use an arbitrary length string of all spaces as your origin string.

Anyway, I'd test it out first with a small subset of known strings to see if the values reflect what you would expect to see.

Andrew Rollings
This is unsound, you will almost certainly end up with two very dissimilar strings sharing the same distance from the fixed point. e.g. if your fixed point is the empty string, any pair of strings with the same length will share the same value.
Daniel Nadasi
Hmmm. That's a good point, but it's also a problem with any solution due to the reduction in dimensionality. Okay... How about having a second (and/or third) (different) origin string and using that... The problem is the dimensionality of the solution is reduced hugely.
Andrew Rollings
(continued) There is no easily solution to this due to the mapping of a multi-dimensional value (the range of strings) to a 1-dimensional space (the number line)
Andrew Rollings
+3  A: 

I think you are going to have to specify your problem more clearly, what exactly are you trying to achieve with this metric?

I say this, because Levenstein works since it maps pairs of strings to a metric, which can preserve the dimensionality of the string space. What happens if you try and map strings to numbers is that there is a large loss of dimensional information. For example, say I have the string "cat", I'd want "bat", "hat", "rat", "can", "cot" etc. to all be reasonably close to this. With a large number of words, the result is that you end up with dissimilar words being close relatively often e.g. "bat" and "cot" may be close, because they both happen to be similar distances from "cat" on the positive side. This is similar to the problem of what happens when you try and map the plane to a line, it is difficult to meet the restriction that points far away in the plane stay far away on the line. So, the upshot of this is that the 'The more "different" two given strings are, the more different two corresponding values should be' requirement is difficult.

So, my first suggestion is, do you really need something that does this, will a simple hash-code suffice to give you unique values, or perhaps you can use Levenstein after all and ignore the values for individual strings? If none of those suffice, perhaps you can use a multidimensional function value, that is map strings into pairs, triples or another small tuple of numbers. The extra dimensionality granted that way will give you far better results.

An example might be encoding the string as a triple: length, sum of values of letters in string, alternating sum of values of letters e.g. f("cat") = (3, 3 + 1 + 20, 3 - 1 + 20) = (3, 24, 22). This would have some of the properties you desire, but is probably not optimal. Try looking for orthogonal features of the string to do this encoding, or even better, if you have a large test set of strings there are existing libraries for mapping this sort of data into low dimensions while preserving metrics (e.g. the Levenstein metric) and you can train your function on that. I remember the S language had support for this sort of thing.

Daniel Nadasi
Hmm. Looks like you're right.I'm trying to map a set of strings to a set of colors. So I'm all for multidimensional function value. See http://stackoverflow.com/questions/495662/mapping-arbitrary-strings-to-rgb-values
Alexander Gladysh
+2  A: 

I don't think this is possible to do. Start with a simple string, and assign it zero (it doesn't really matter what the number is)

  • "Hello World" = 0

The following strings are at distance 2 from it:

  • "XXllo World" = a
  • "HeXXo World" = b
  • "Hello XXrld" = c
  • "Hello WorXX" = d

Yet, each of these strings is 4 from each other. There is no way to sort the numbers to make it work, for the following instance:

a = 1, b = -1, c = 2, d = -2

Consider that c to 0 is 2, yet c to a is 1, yet 0 is closer than a.

And this is just a simple case.

FryGuy
+1  A: 

Measure the edit distance from the empty string, but instead of treating each edit as having the value "1", give it the index of the letter being added/removed in the alphabet sorted by use frequency (etaoinshrdlu...), and the difference between letter indices if your algorithm allows you to detect replacements as replacements rather than as insert+delete pairs.

moonshadow
A: 

This is an "off the top of the head" answer to the question. Basically, calculates the distance sentance 2 differs from sentance 1, as a cartesian distance from sentence 1(assumed to be at the orign). Where, the distances is the sum of the minimun Levenstien difference between the word in the 2 sentences. It has the property that 2 equal sentances give a 0 distance. If this approach has been published elswhere, I'm unaware of it.

Edit: it looks wrong (C# code duplicated - How can I fix it?). Thanks in advance.

using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Diagnostics;

namespace ConsoleApplication1 { class Program { static void Main(string[] args) { string str1 = "The cat sat on the mat"; string str2 = "The quick brown fox jumped over the lazy cow"; ReportDifference(str1, str1); ReportDifference(str2, str2); ReportDifference(str1, str2); ReportDifference(str2, str1); } /// /// Quick test andisplay routine /// /// First sentence to test with /// Second sentance to test with static void ReportDifference(string str1, string str2) { Debug.WriteLine( String.Format("difference between \"{0}\" and \"{1}\" is {2}", str1, str2, Difference(str1, str2))); } /// /// This does the hard work. /// Basiclly, what it does is: /// 1) Split the stings into tokens/words /// 2) Form a catresian product of the 2 lists of words. /// 3) Calculate the Levenshtein Distance between each word. /// 4) Group on the words from the first sentance /// 5) Get the min distance between the word in first sentance and all of the words frrom the second /// 6) Square the distances for each word. /// (based on the distance betwen 2 points is the sqrt of the sum of the x,y,... axises distances /// what this assumes is the first word is the orign) /// 7) take the square root of sum /// /// sentance 1 compare /// sentance 2 compare /// distance caculated static double Difference(string str1, string str2) { string[] splitters = { " " };

        var a = Math.Sqrt(
            (from x in str1.Split(splitters, StringSplitOptions.RemoveEmptyEntries)
                 from y in str2.Split(splitters, StringSplitOptions.RemoveEmptyEntries)
                 select new {x, y, ld = Distance.LD(x,y)} )
                .GroupBy(x => x.x)
                .Select(q => new { q.Key, min_match = q.Min(p => p.ld) })
                .Sum(s =>  (double)(s.min_match * s.min_match )));
        return a;
    }
}

/// <summary>
/// Lifted from http://www.merriampark.com/ldcsharp.htm
/// </summary>
public class Distance
{

    /// <summary>
    /// Compute Levenshtein distance
    /// </summary>
    /// <param name="s">String 1</param>
    /// <param name="t">String 2</param>
    /// <returns>Distance between the two strings.
    /// The larger the number, the bigger the difference.
    /// </returns>
    public static int LD(string s, string t)
    {
        int n = s.Length; //length of s
        int m = t.Length; //length of t
        int[,] d = new int[n + 1, m + 1]; // matrix
        int cost; // cost
        // Step 1
        if (n == 0) return m;
        if (m == 0) return n;
        // Step 2
        for (int i = 0; i <= n; d[i, 0] = i++) ;
        for (int j = 0; j <= m; d[0, j] = j++) ;
        // Step 3
        for (int i = 1; i <= n; i++)
        {
            //Step 4
            for (int j = 1; j <= m; j++)
            {
                // Step 5
                cost = (t.Substring(j - 1, 1) == s.Substring(i - 1, 1) ? 0 : 1);
                // Step 6
                d[i, j] = System.Math.Min(System.Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                          d[i - 1, j - 1] + cost);
            }
        }
        // Step 7
        return d[n, m];
    }
}

}

Aussie Craig
+2  A: 

I'd like to expand FryGuy's answer why it isn't gonna work in any fixed number of dimensions. Let's take aaaaaaaaaa and baaaaaaaaa, abaaaaaaaa, ... , aaaaaaaaab. In this example the strings are of length 10, but they can be of arbitrary length. The distance of each of the 10 b-strings from aaaaaaaaaa is 1, and their distance from each other is 2. In general, if you take fixed strings of length N over 2-letter alphabet, their distance graph is an N-dimensional hypercube.

There is no way you can map that into a fixed number of dimensions unless your strings are of limited length.

Rafał Dowgird
Well, yes. But all I need is "good enough" mapping. (Of course my strings are of limited length -- there is even limited, but huge, number of strings themselves.)
Alexander Gladysh
+1  A: 

You can also try to look into Latent Semantic Analysis and vector space models, with the problem that you have to limit the maximum string length.

Your dimensions are the product of the elements of your alphabet and the positions in the string. Given the alphabet ("a", "b", "c", "t") and a maximum length of 3, the dimensions are (a:1, b:1, c:1, t:1, ..., a:3, b:3, c:3, t:3)

As an example, "cat" becomes (0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1).

This is of course a huge dataset, but you can use dimensionality reduction techniques (like SVD) to cut down the number of dimensions. This should work well, because there are many recurring patterns in words. You can tweak the number of output dimensions to suit your needs.

The similarity between two words can be computed by cosine similarity between the word vectors. You can also store the transformation vectors of the SVD to get the reduced vector for words, even previously unseen ones.

Torsten Marek