I have a huge (but finite) set of natural language strings.
I need a way to convert each string to a numeric value. For any given string the value must be the same every time.
The more "different" two given strings are, the more different two corresponding values should be. The more "similar" they are, the less different values should be.
I do not know yet what exact definition of difference between strings I need. No natural language parsing anyway. It should probably be something Levenstein-like (but Levenstein is relative and I need absolute metric). Lets start with something simple.
Update on dimensions
I'll be happy to settle for a multidimensional (3d is best) vector instead of single numeric value.
Update on expected result correctness
As it was correctly noted here and here, the distance from one string to another is a vector with MAX(firstStringLength, secondStringLength)
dimensions. In general it is not possible to reduce number of dimensions without some loss of information.
However I do not need an absolute solution. I would settle for any "good enough" conversion from N-dimensional strings space to my 3D space.
Note also that I have a finite number of strings of finite length. (Number of strings is rather large though, about 80 million (10 GB), so I'd better pick some single-pass state-less algorithm.)
From scanning references, I'm under impression that Hilbert space-filling curve may help me here. Looks like Analysis of the Clustering Properties of the Hilbert Space-Filling Curve article discusses something close to my problem...
Update on Hilbert curve approach
- We map each string to a point in a N-dimensional space, where N is the maximum length of a string in set. BTW, can i-th character code from a string be used as the i-th coordinate value here?
- We plot a Hilbert curve through that N-dimensional space.
- For each string we take point on the curve, closest to the coordinates of the string. Hilbert value of that point (the length from the beginning of curve) is the single-dimensional value I seek.
- If we need 3D value, we plot Hilbert curve in 3D and pick points, matching Hilbert values, calculated above.
Does this looks right? What would be the computational expenses here?