views:

272

answers:

5

I'm looking for an algorithm or function that is able to map a string to a number in such way that the resulting values correspond the lexicographic ordering of strings. Example:

"book" -> 50000
"car"  -> 60000
"card" -> 65000
"a longer string" -> 15000
"another long string" -> 15500
"awesome" -> 16000

As a function it should be something like: f(x) = y, so that for any x1 < x2 => f(x1) < f(x2), where x is an arbitrary string and y is a number.

If the input set of x is finite, then I could always do a sort and assign the proper values, but I'm looking for something generic for an unlimited input set for x.

+15  A: 

If you require that f map to integers this is impossible.

Suppose that there is such a map f. Consider the strings a, aa, aaa, etc. Consider the values f(a), f(aa), f(aaa), etc. As we require that f(a) < f(aa) < f(aaa) < ... we see that f(a_n) tends to infinity as n tends to infinity; here I am using the obvious notation that a_n is the character a repeated n times. Now consider the string b. We require that f(a_n) < f(b) for all n. But f(b) is some finite integer and we just showed that f(a_n) goes to infinity. We have a contradiction. No such map is possible.

Maybe you could tell us what you need this for? This is fairly abstract and we might be able to suggest something more suitable. Further, don't necessarily worry about solving "it" generally. YAGNI and all that.

Jason
with a maximum string length it may be possible however
jk
@jk: yes, but question explicitly states "I'm looking for something generic, for an unlimited input set for x."
AakashM
I supposed this would be the answer, but just wanted to be sure ... :-) The background: Actually I get some input sets of lets say 10000 strings each of maximum length 50, that is always different for each query. (That's why the unlimited input set.) Out of these strings I'd like to identify/group the similar ones and need a numeric value which matches the sort order for each separate input set. Solved the problem meanwhile, so thanks for your answer.
MicSim
if you have a maximum length then the input set is limited (assuming a fixed alphabet) although probably very large. still it sounds like there is a better solution to the underlying problem anyway
jk
@MicSim: What do you mean by "similar"? What do you need the numeric value for? Hashing?
Jason
By similar I mean equal. The numerical value, I would have needed it for sorting (the numerical value was expected by a 3rd party component, that i cannot change). But I solved my problem in another way, so that I don't have this requirement anymore.
MicSim
@Jason: I like your answer, as it clearly shows: Proving an approach wrong or impossible lets you safely discard it and frees your mind for other solutions.
MicSim
+2  A: 

You would be much better off writing a comparator which you can supply to a sort function. The comparator takes two strings and returns -1, 0, or 1. Even if you could create such a map, you still have to sort on it. If you need both a "hash" and the order, then keep stuff in two data structures - one that preserves the order, and one that allows fast access.

Hamish Grubijan
+2  A: 

Maybe a Radix Tree is what you're looking for?

A radix tree, Patricia trie/tree, or crit bit tree is a specialized set data structure based on the trie that is used to store a set of strings. In contrast with a regular trie, the edges of a Patricia trie are labelled with sequences of characters rather than with single characters. These can be strings of characters, bit strings such as integers or IP addresses, or generally arbitrary sequences of objects in lexicographical order. Sometimes the names radix tree and crit bit tree are only applied to trees storing integers and Patricia trie is retained for more general inputs, but the structure works the same way in all cases.

LWN.net also has an article describing this data structures use in the Linux kernel.

Robert S. Barnes
+2  A: 

As a corollary to Jason's answer, if you can map your strings to rational numbers, such a mapping is very straightforward. If code(c) is the ASCII code of the character c and s[i] is theith character in the string s, just sum like follows:

result <- 0
scale  <- 1
for i from 1 to length(s) 
  scale <- scale / 26
  index <- (1 + code(s[i]) - code('a'))
  result <- result + index / scale
end for
return result

This maps the empty string to 0, and every other string to a rational number between 0 and 1, maintaining lexicographical order. If you have arbitrary-precision decimal floating-point numbers, you can replace the division by powers of 26 with powers of 100 and still have exactly representable numbers; with arbitrary precision binary floating-point numbers, you can divide by powers of 32.

Pillsy
Yes, and as you mention, the key here is that you absolutely have to have arbitrary-precision. Similar reasoning as in my answer shows why this must be the case. We have that `f(a_n) < f(b)` for all `n`. If there is a number `e > 0` such that `f(a_(n+1)) - f(a_n) > e` for all `n` then choose `N` so large that `N * e + f(a_1) > f(b)`. Then `f(a_N) = f(a_N) - f(a_(N-1) + f(a_(N-1)) - f(a_(N-2)) + f(a_(N-2)) - ... + f(a_1) > N * e + f(a_1) > f(b)`. So, there is no such `e` and we conclude that the `f(a_n)` must get arbitrarily close to each other and thus we require arbitrary-precision.
Jason
+2  A: 

Hi,

what you are asking for is a a temporary suspension of the pigeon hole principle (http://en.wikipedia.org/wiki/Pigeonhole%5Fprinciple).

The strings are the pigeons, the numbers are the holes. There are more pigeons than holes, so you can't put each pigeon in its own hole.

Geert Baeyaert