views:

117

answers:

5

This is a hard one (for me) I hope people can help me. I have some text and I need to transfer it to a number, but it has to be unique just as the text is unique.

For example: The word 'kitty' could produce 12432, but only the word kitty produces that number. The text could be anything and a proper number should be given.

One problem the result integer must me a 32-bit unsigned integer, that means the largest possible number is 2147483647. I don't mind if there is a text length restriction, but I hope it can be as large as possible.


My attempts. You have the letters A-Z and 0-9 so one character can have a number between 1-36. But if A = 1 and B = 2 and the text is A(1)B(2) and you add it you will get the result of 3, the problem is the text BA produces the same result, so this algoritm won't work.

Any ideas to point me in the right direction or is it impossible to do?

A: 

Build a dictionary out of words mapped to unique numbers and use that, that's the best you can do.

I doubt there are more than 2^32 number of words in use, but this is not the problem you're facing, the problem is that you need to map numbers back to words.

If you were only mapping words over to numbers, some hash algorithm might work, although you'd have to work a bit to guarantee that you have one that won't produce collisions.

However, for numbers back to words, that's quite a different problem, and the easiest solution to this is to just build a dictionary and map both ways.

In other words:

AARDUANI = 0
AARDVARK = 1
...

If you want to map numbers to base 26 characters, you can only store 6 characters (or 5 or 7 if I miscalculated), but not 12 and certainly not 20.

Unless you only count actual words, and they don't follow any good countable rules. The only way to do that is to just put all the words in a long list, and start assigning numbers from the start.

Lasse V. Karlsen
+5  A: 

Your idea is generally sane, only needs to be developed a little.

Let f(c) be a function converting character c to a unique number in range [0..M-1]. Then you can calculate result number for the whole string like this.

f(s[0]) + f(s[1])*M + f(s[2])*M^2 + ... + f(s[n])*M^n

You can easily prove that number will be unique for particular string (and you can get string back from the number).

Obviously, you can't use very long strings here (up to 6 characters for your case), as 36^n grows fast.

Nikita Rybak
Ah, this is a more formal definition of what I typed. +1
Gareth
26^n with roman alphabet
kriss
@kriss I've taken range from the question: _Mark_ wanted numbers and letters.
Nikita Rybak
@Nikita: yes, my error.
kriss
A: 

Imagine you were trying to store Strings from the character set "0-9" only in a number (the equivalent of obtaining a number of a string of digits). What would you do?

Char 9 8 7 6 5 4 3 2 1 0
Str  0 5 2 1 2 5 4 1 2 6

Num = 6 * 10^0 + 2 * 10^1 + 1 * 10^2...

Apply the same thing to your characters.

Char 5 4 3 2 1 0 
Str  A B C D E F
L = 36

C(I): transforms character to number: C(0)=0, C(A)=10, C(B)=11, ...

Num = C(F) * L ^ 0 + C(E) * L ^ 1 + ...
Mark Peters
A couple of notes: This is reversible, which I assume you want as a requirement since you want to prevent collisions, and second, you don't need to do it in backwards order like I did; I just used descending indices to demonstrate the similarity to an `atoi` type function.
Mark Peters
A: 

If it's correctly spelled text in some language, you can have a number for each word. However you'd need to consider all possible plurals, place and people names etc. which is generally impossible. What sort of text are we talking about? There's usually going to be some existing words that can't be coded in 32 bits in any way without prior knowledge of them.

Can you build a list of words as you go along? Just give the first word you see the number 1, second number 2 and check if a word has a number already or it needs a new one. Then save your newly created dictionary somewhere. This would likely be the only workable solution if you require 100% reliable, reversible mapping from the numbers back to original words given new unknown text that doesn't follow any known pattern.

With 64 bits and a sufficiently good hash like MD5 it's extremely unlikely to have collisions, but for 32 bits it doesn't seem likely that a safe hash would exist.

jjrv
A: 

Just treat each character as a digit in base 36, and calculate the decimal equivalent?

So:

'A' = 0
'B' = 1
[...]
'Z' = 25
'0' = 26
[...]
'9' = 35
'AA' = 36
'AB' = 37
[...]
'CAB' = 46657
Gareth