tags:

views:

92

answers:

8
  1. What's the best way to convert (to hash) a string like 3800290030, which represents an id for a classification into a four character one like 3450 (I need to support at max 9999 classes). We will only have less than 1000 classes in 10 character space and it will never grow to more than 10k.
  2. The hash needs to be unique and always the same for the same an input.
  3. The resulting string should be numeric (but it will be saved as char(4) in SQL Server).

I removed the requirement for reversibility.

This is my solution, please comment:

        string classTIC = "3254002092";
        MD5 md5Hasher = MD5.Create();

        byte[] classHash = md5Hasher.ComputeHash(Encoding.Default.GetBytes(classTIC));
        StringBuilder sBuilder = new StringBuilder();

        foreach (byte b in classHash)
        {
            sBuilder.Append(b.ToString());
        }

        string newClass = (double.Parse(sBuilder.ToString())%9999 + 1).ToString();
+2  A: 
  1. You can do something like

    str.GetHashCode() % 9999 + 1;

  2. The hash can't be unique since you have more than 9,999 strings

  3. It is not unique so it cannot be reversible

and of course my answer is wrong in case you don't have more than 9999 different 10 character classes.

In case you don't have more than 9999 classes you need to have a mapping from string id to its 4 char representation - for example - save the stings in a list and each string key will be its index in the list

Itay
yes, I only have less than 1000 classes in 10 character space and it will never grow to more than 10k
mare
I edited my answer
Itay
saving to list and using their index for the new representation would require additional logic to prepend zeros to the index (0005, for index 5) and I would like to avoid this kind of manipulation and filling
mare
you can use the string.PadLeft method to add 0s on the left, int int.Parse to parse the string to an integer
Itay
I appreciate your effort and have tried your solution but it's too error prone. Also I would like to lose the order - classifications in new space should not be in the same order as those in the source and I would like to avoid those with padded zeros. I have come up with my own solution, which I post in the question and you can evaluate it and post comments.
mare
+1  A: 
  1. ehn no idea
  2. Unique is difficult, you have - in your request - 4 characters - thats a max of 9999, collision will occur.
  3. Hash is not reversible. Data is lost (obviously).
rdkleine
+2  A: 

When you want to reverse the process, and have no knowledge about the id's apart from that there are at most 9999 of them, I think you need to use a translation dictionary to map each id to its short version.

Even without the need to reverse the process, I don't think there is a way to guerantee unique id's without such a dictionary.

This short version could then simply be incremented by one with each new id.

Jens
A: 

Convert the number to base35/base36

ex: 3800290030 decimal = 22CGHK5 base-35 //length: 7

Or may be convert to Base60 [ignoring Capital O and small o to not confuse with 0]

ex: 3800290030 decimal = 4tDw7A base-60 //length: 6

this. __curious_geek
Indeed ... does the resulting string *have* to be numeric? If not, it's not so hard. Not that base 35 works for getting only four characters.
Joren
check the revised constraints please
mare
I don't know what can be stored in a char, but even `256^4 < 10^10`. So simply storing the entire ID in the shortId field does not work. (unless char can be any UTF char...)
Jens
A: 

I think you might need to create and store a lookup table to be able to support your requirements. And in that case you don't even need a hash you could just increment the last used 4 digit lookup code.

ho1
A: 

Convert your int to binary and then base64 encode it. It wont be numbers then, but it will be a reversible hash.

Edit:

As far as my sense tells me you are asking for the impossible.

You cannot take a totally random data and somehow reduce the amount of data it takes to encode it (some might be shorter, others might be longer), thus your requirement that the number is unique is not possible, there has to be some dataloss somewhere and no matter how you do it it won't ensure uniqueness.

Second, due to the above it is also not possible to make it reversible. Thus that is out of the question.

Therefore, the only possible way I can see, is if you have an enumerable data source. IE. you know all the values prior to calculating the value. In that case you can simply assign them a sequencial id.

Cine
it has to be numbers
mare
Of course it can be unique after I posted that the source does not contain more than 10k entries. We are translating 10k or less to four digits.I would like to avoid sequantial, if possible.
mare
If you have an enumerable source your problem is trivial... Random r = new Random(123); int[10000] ids = new.; for(i in 0..10000)ids[i] = r.GetNext(10000); Now you just take from the ids in a sequential order, and you still have random numbers. And because you use a seed, it is also always the same random number
Cine
A: 

use md5 or sha like:

string = substring(md5("05910395410"),0,4)

or write your own simple method, for example

sum = 0
foreach(char c in string)
{
  sum+=(int)c;
}
sum %= 9999
this will not be unique and hence irreversible.
Itay
+1  A: 

You do not want a hash. Hashing by design allows for collisions. There is no possible hashing function for the kind of strings you work with that won't have collisions.

You need to build a persistent mapping table to convert the string to a number. Logically similar to a Dictionary<string, int>. The first string you'll add gets number 0. When you need to map, look up the string and return its associate number. If it is not present then add the string and simply assign it a number equal to the count.

Making this mapping table persistent is what you'll need to think about. Trivially done with a dbase of course.

Hans Passant
This is by far the best explanation so far for what I need and what I have to do and I'm going to do it this way.Italy was close too.
mare