tags:

views:

649

answers:

9

What is the easiest way in Java to map strings (Java String) to (positive) integers (Java int), so that

  • equal strings map to equal integers, and
  • different strings map to different integers?

So, similar to hashCode() but different strings are required to produce different integers. So, in a sense, it would be a hasCode() without the collision possibility.

An obvious solution would maintain a mapping table from strings to integers, and a counter to guarantee that new strings are assigned a new integer. I'm just wondering how is this problem usually solved. Would also be interesting to extend it to other objects than strings.

+4  A: 

There's not going to be an easy or complete solution. We use hashes because there are way more possible Strings than there are ints. Collisions are just a limitation of using a finite number of bits to represent integers.

Bill the Lizard
+1  A: 

Can you use a Map to indicate which Strings you already have assigned integers to? That's kind of the "database-y" solution, where you assign each String a "primary key" from a sequence as it comes up. Then you put the String and Integer pair into a Map so you can look it up again. And if you need the String for a given Integer, you can also put the same pair into a Map.

Paul Tomblin
A: 

Since Strings in java are unbounded in length, and each character has 16 bits, and ints have 32 bits, you could only produce a unique mapping of Strings to ints if the Strings were up to two characters. But you could use BigInteger to produce a unique mapping, with something like:

String s = "my string";
BigInteger bi = new BigInteger(s.getBytes());

Reverse mapping:

String str = new String(bi.toByteArray());
Avi
This is a pretty good solution, but the BigIntegers returned can be negative. I think the positive int part was a pretty arbitrary requirement by the OP, though.
Bill the Lizard
True. One could construct a similar mapping using only positive BigIntegers, but it would be trickier, since BigInteger doesn't have a toByteArray() method that ignores the sign (which would be a counterpart to the BigInteger(int, byte[]) constructor.
Avi
actually the method proposed by the OP can produce guaranteed unique codes for up to MAXINT strings.
frankodwyer
Problem with this is that BigInteger is itself a sort of string (dynamically allocated variable length), so why bother? The real problem is in the question itself.
Daniel Earwicker
+3  A: 

I'd try to do by introducing an object holding Map and Map. Adding Strings to that object (or maybe having them created from said object) will assign them an Integer value. Requesting a Integer value for a String already registered will return the same value.

Drawbacks: Different launches will yield different Integers for the same String, depending on order unless you somehow persist the whole thing. Also, it's not very object oriented and requires a special object to create/register a String. Plus side: It's quite similar to internalizing Strings and easily understandable. (Also, you asked for an easy, not elegant way.)

For the more general case, you might create a high level subclass of Object, introduce a "integerize" method there and extend every single class from that. I think, however, that road leads to tears.

Urs Reupke
A: 

If by integer you mean the data type, then as other posters have explained this is quite impossible, due to the fact that the integer data type is of fixed size, and strings are unbound.

However if you simply mean a positive number, then theoretically you should be able to interpret the string as if it were an "integer" simply by regarding it as a byte array (in a consistent encoding). You could also treat it as an array of integers of arbitrary length, but if you can do that why not just use a string? :)

Implementation speaking, this is usually "solved" by using a hash code and simply double-checking any collisions, since there are likely to be none anyway and on the off chance there is a collision, it still works out to be constant time. However if this isn't applicable, I'm not sure what the best solution would be.

Interesting question.

chaiguy
+4  A: 

In most hashcode() type implementations, collisions are accepted as inevitable and tested for.

If you absolutely must have no collisions, guaranteed, the solution you outline will work.

Aside from this, there are cryptographic hash functions such as MD5 and SHA, where collisions are extremely unlikely (though with a lot of effort can be forced). The Java Cryptography Architecture has implementations of these. Those methods may perhaps be faster than a good implementation of your solution for very large sets. They will also execute in constant time and give the same code for the same string, no matter which order the strings are added in. Also, it doesn't require storing each string. Crypto hash results could be considered as integers but they won't fit in a java int - you could use a BigInteger to hold them as suggested in another answer.

Incidentally, if you're put off by the idea of a collision being 'extremely unlikely', it's probably similar likelihood that a bit would randomly flip in your computer memory or hard disk and cause any program to behave differently than you expect :-)

Note, there are also some theoretical weaknesses in some hash functions (e.g. MD5) but for your purposes that probably doesn't matter and you could just use the most efficient such function - those weaknesses are only relevant if someone is maliciously trying to come up with strings that have the same code as another string.

edit: I just noticed in the title of your question, it seems you want bidirectional mapping, though you don't actually state this in the question. It is (by design) not possible to go from a Crypto hash to the original string. If you really need that, you'd have to store a map keying hashes back to strings.

frankodwyer
+4  A: 

Have a look at perfect hashing.

Dan Dyer
A: 

As you outline, a hash table that resolves collisions is a standard solution. You could also use a Bentley/Sedgewick style search trie, which in many applications is faster than hashing.

If you substitute 'unique pointer' for 'unique integer' you can see Dave Hanson's solution to this problem in C. This is quite a nice abstraction because

  • The pointers can still be used as C strings.

  • Equal strings hash to equal pointers, so strcmp can be dispensed with in favor of pointer equality, and the pointers can be used as keys in other hash tables.

If Java offers a test for object identity on String objects then you can play the same game there.

Norman Ramsey
+1  A: 

This is impossible to achieve without any restrictions, simply because there are more possible Strings than there are integers, so eventually you will run out of numbers.

A solution is only possible when you limit the number of usable Strings. Then you can use a simple counter. Here is a simple implementation where all (2^32 = 4294967296 different strings) can be used. Never mind that it uses lots of memory.

import java.util.HashMap;
import java.util.Map;

public class StringToInt {

    private Map<String, Integer> map;

    private int counter = Integer.MIN_VALUE;

    public StringToInt() {
     map = new HashMap<String, Integer>();
    }

    public int toInt(String s) {
     Integer i = map.get(s);
     if (i == null) {
      map.put(s, counter);
      i = counter;
      ++counter;
     }
     return i;
    }
}
martinus
There is a bug in this code. If i == null you should return the counter.
Kaarel
Thanks Kaarel, I have fixed the bug
martinus