views:

305

answers:

7

I have set up a circular linked list data structure that represents a word, and each element in the list is a letter from the word. At the bottom of my question are the class definitions of the list and element of the list.

The purpose of the list data structure is to be able to compare cyclic words. So... "picture" and "turepic" are the same cyclic word, so the two lists will be equal.

So I override equals() when comparing two lists, and I've read that whenever you have to override equals(), you have to also override hashCode(). However, I don't really have a good idea of how to do that.

How should I define a good hashCode for what I have set up? What things should I consider? In the example of the "picture" and "turepic", the two lists are equal so their hashCode needs to be the same. Any ideas?

Thanks, Hristo

public class Letter {
 char value;
 Letter theNextNode;

 /**
  * Default constructor for an element of the list.
  * 
  * @param theCharacter - the value for this node.
  */
 Letter(char theCharacter) {
  this.value = theCharacter;
 }
}


public class CircularWord {

 /*
  * Class Variables
  */
 Letter head;
 Letter tail;
 Letter theCurrentNode;

 int iNumberOfElements;


 /**
  * Default Constructor. All characters that make up 'theWord' are stored in a 
  * circular linked list structure where the tail's NEXT is the head. 
  */
 public CircularWord(String theWord) {

  char[] theCharacters = theWord.toCharArray();

  for (int iIndex = 0; iIndex < theCharacters.length; iIndex++) {
   this.addElement(theCharacters[iIndex]);
  }

  this.theCurrentNode = head;
  this.iNumberOfElements = theCharacters.length;
 }
}
+1  A: 

How about the sum of the hashcodes of all the elements inside your list, each multiplied by an arbitrary value?

Something like

hashCode = 1;
for (char c : myChars) {
    hashCode += 31 * c;
}
Vivien Barousse
Can you explain why you think that works? I could do that, I just don't understand it. Remember, the elements in one list might be the same as in another, but they must in in a circular order... so summing might not be ideal.
Hristo
`a+b+c` is the same as `b+c+a`. So summing up the terms would give the same value if your lists contains the same elements, no matter what they order is.
Vivien Barousse
it's not just sum of hashcodes ... you missed 31*hashCode part, which makes position important. If there are two elements, with hashes 50 and 100, then 31*(31*1 + 50)+100 != 31*(31*1 + 100)+50.
Peter Štibraný
@vivien: the code that you present is not the sum of the hashcodes of all elements. Note `31*hashCode` - previous hashcode value is multiplied by 31. So with two elements it is = `31*(31*1 + hashcode1) + hashcode2`. Definitely not suitable for a circular list.
abhin4v
@Hristo -- see if my solution helps. I use subtraction instead of addition. Subtraction is non-commutative and so it would seem that it would help in your case, which is order dependent.
Vivin Paliath
Erf, bad reading... I read `hashcode = hashcode + 31 * obj.hashCode();`. Updated though...
Vivien Barousse
@Vivin: His case is NOT order dependent ;)
Vivien Barousse
That would give different hashcodes for "picture" and "turepic".
Péter Török
@Vivien oops... My bad :p But if you are just using addition, picture turepic and tuerpci would also be equal, would they not?
Vivin Paliath
@Vivin: That's what is being asked. "In the example of the "picture" and "turepic", the two lists are equal so their hashCode needs to be the same."
Vivien Barousse
@Péter: Code has been updated. Now the hashcode would be the same
Vivien Barousse
@all... I think subtraction is the proper approach. "picture" and "tuerpic" will always add to the same value, but they are not equal lists, so subtraction will be able to distinguish that. Do I have to worry about negative numbers?
Hristo
@Vivien ah yes. I just remembered that if two objects are equal, they must have the same hashcodes, but if they have the same hashcodes, they are not necessarily equal!
Vivin Paliath
@Hristo: With substractions, "picture" and "turepic" would have different hashcodes, although they are equals. That breaks the hashCode contract. With addition, you have some conflicts, but that doesn't break the hashCode contract (conflicts are tolerated).
Vivien Barousse
@Vivien... good point, but how would I avoid the problem with "picture" and "tuerpic" being the same since addition results in the same value, but the lists are NOT equal?
Hristo
@Vivien... also, why do you use `31`?
Hristo
@Hristo as said in the answer 31 is an arbitrary value. And it's always better if you use a prime number here.
Colin Hebert
@Hristo: As for the why, to have less collisions. "abg" would collide with "aaag" without this 31, for example.
Vivien Barousse
@Vivien and @Colin... thanks for the explanation. That makes a lot of sense.
Hristo
31 preventing collisions in the current formula is hogwash. "az", "by", "cx", "dw", "ev", ... and get the same hash code with that formula. One can achieve far less collisions with very little additional effort ... as my answer shows.
meriton
+12  A: 

So you want a hashcode calculation which gives equal results for "picture" and "turepic", but (preferably) different from the hashcode of e.g. "eruptic". Thus it is not enough to simply add up the hashcodes of the letters contained in the word - you need to have some position information too, but still, it should be independent of the actual permutation of the word. You need to define "equivalence classes", and always calculate the same hashcode for each member of the class.

The easiest way to achieve this is to select a specific member of the equivalence class and always use the hashcode of that variation for all equivalent words. E.g. select the first variant alphabetically (thanks @Michael for summing it up concisely). For "picture" et al., that would be "cturepi". Both "picture" and "turepic" (and all the other equivalent variations) should return the hash code of "cturepi". That hash code could be calculated by the standard LinkedList method, or any other preferred way.

One might say that this calculation is very expensive. True, however one could cache the result, so that only the first calculation would be costly. And I guess the selection of the first alphabetical variant could be optimized fairly much in the common case (compared to the trivial solution of generating all permutations in the specific equivalence class, then sorting them and picking the first).

E.g. in many of the words, the first letter alphabetically is unique ("picture" is one of these - its first letter alphabetically is 'c', and there is only one 'c' in it). So you only need to find it, then calculate the hashcode starting from there. If it is not unique, you need to compare the second, third, etc. letters after that, until you find a difference (or you roll over).

Update 2 - examples

  • "abracadabra" contains 5 'a's. The 2nd characters after the 'a's are 'b', 'c', 'd', 'b' and 'a', respectively. So in the 2nd round of comparison you can conclude that the lexicographically smallest variation is "aabracadabr".
  • "abab" contains 2 'a's, and a 'b' after each (and then you roll over, reaching an 'a' again, so the quest ends there). So you have two identical lexicographically smallest variations of it. But since they are identical, they obviously produce the same hashcode.

Update: In the end, it all boils down to how much do you actually need the hashcode - i.e. do you plan to put your circular lists into an associative collection like Set or Map. If not, you can do with a simple, or even trivial hash method. But if you use some associative collection heavily, a trivial hash implementation gives you lots of collisions thus suboptimal performance. In this case, it is worth a try implementing this hash method and measuring whether it pays for itself in performance.

Update 3: sample code

Letter is basically left the same as above, I only made the fields private, renamed theNextNode to next, and added getters/setters as needed.

In CircularWord I made some more changes: dropped tail and theCurrentNode, and made the word really circular (i.e. last.next == head). The constructor, toString and equals are not relevant for computing the hashcode, so they are omitted for the sake of simplicity.

public class CircularWord {
    private final Letter head;
    private final int numberOfElements;

    // constructor, toString(), equals() omitted

    @Override
    public int hashCode() {
        return hashCodeStartingFrom(getStartOfSmallestRotation());
    }

    private Letter getStartOfSmallestRotation() {
        if (head == null) {
            return null;
        }
        Set<Letter> candidates = allLetters();
        int counter = numberOfElements;

        while (candidates.size() > 1 && counter > 0) {
            candidates = selectSmallestSuccessors(candidates);
            counter--;
        }
        return rollOverToStart(counter, candidates.iterator().next());
    }

    private Set<Letter> allLetters() {
        Set<Letter> letters = new LinkedHashSet<Letter>();
        Letter letter = head;

        for (int i = 0; i < numberOfElements; i++) {
            letters.add(letter);
            letter = letter.getNext();
        }
        return letters;
    }

    private Set<Letter> selectSmallestSuccessors(Set<Letter> candidates) {
        Set<Letter> smallestSuccessors = new LinkedHashSet<Letter>();

        char min = Character.MAX_VALUE;
        for (Letter letter : candidates) {
            Letter nextLetter = letter.getNext();
            if (nextLetter.getValue() < min) {
                min = nextLetter.getValue();
                smallestSuccessors.clear();
            }
            if (nextLetter.getValue() == min) {
                smallestSuccessors.add(nextLetter);
            }
        }
        return smallestSuccessors;
    }

    private Letter rollOverToStart(int counter, Letter lastCandidate) {
        for (; counter >= 0; counter--) {
            lastCandidate = lastCandidate.getNext();
        }
        return lastCandidate;
    }

    private int hashCodeStartingFrom(Letter startFrom) {
        int hash = 0;
        Letter letter = startFrom;
        for (int i = 0; i < numberOfElements; i++) {
            hash = 31 * hash + letter.getValue();
            letter = letter.getNext();
        }
        return hash;
    }

}

The algorithm implemented in getStartOfSmallestRotation to find the lexicographically smallest rotation of the word is basically what I describe above: compare and select the lexicographically smallest 1st, 2nd, 3rd etc. letters of each rotation, dropping the greater letters until either there is only one candidate left, or you roll over the word. Since the list is circular, I use a counter to avoid an infinite loop.

In the end, if I have a single candidate left, it may be in the middle of the word and I need to get the start of the smallest word rotation. However, as this is a singly-linked list, it is awkward to step backwards in it. Luckily, the counter nicely helps me out: it has recorded how many letters I have compared so far, but in a circular list this is equivalent to how many letters I can move forward before rolling over. Thus I know how many letters to move forward in order to get again to the beginning of the minimal word rotation I am looking for.

Hope this helps someone - at least it was fun to write :-)

Péter Török
You might have some problems if your word contains twice the same letter. What about `alphabet` and `abetalpha`?
Vivien Barousse
@Vivien, I was about to cover that case - see my update.
Péter Török
More simply put, the first variant alphabetically?
Michael Brewer-Davis
@Michael, yes - good point, that makes it even simpler to select one, thanks :-)
Péter Török
+1: Will result in very few collisions, even though finding the alphabetically first word can be somewhat expensive (worst case: O(length^2)).
meriton
I really like your idea, but I'm not sure how to implement it. I've never dealt with hash codes before but it seems a little to computationally intensive to do this? Can you perhaps provide an example?
Hristo
Sort the list first and then calculate its usual hash, is that what you mean?
abhin4v
@abhin4v, no - see the example I gave.
Péter Török
@Péter... thanks for your help. I really think your solution will result in the "perfect" hashcode I'm looking for, but for the purpose of this project, the time spent on perfecting the hash code won't yield equivalent benefits. But I will keep this in mind whenever I am confronted with a similar situation and the hashcode needs to be perfect. Thanks!
Hristo
@Hristo, you might be right - my first reaction to your question was actually that you may not need to implement such a hashcode. It would only be needed if you plan to store your circular lists in an associative container like Map or Set. But then I started to think about the problem nevertheless, and came up with this idea...
Péter Török
@Hristo, note that calculating the hashcode itself would not be an issue here - you could copy the standard algorithm from `java.util.AbstractList`. The trickier part would be finding the first variant alphabetically, and with reasonable speed. It's getting late night where I am, so I won't get into producing a working code sample, but I may come back to it sometime next week, just for the fun :-)
Péter Török
@Peter: sorting and taking the usual hashcode does seem to work. see the code here: http://www.ideone.com/mapTF
abhin4v
@abhin4v, yes, but it gives the same hashcode for "picture" and "ciputer" (and any other of its permutations).
Péter Török
I still don't see how you can normalize all possible words "aabbcc" has no unique letters.
TokenMacGuy
@TokenMacGuy: the two variants starting with 'a', in your case, are aabbcc and abbcca, and the first one is lexicographically smaller and would be chosen. That's what he meant by "compare the second, third, etc. letters after that, until you find a difference".
Blaisorblade
Oh, well i guess that could work. There is at most 1 unique least lexicographic minimum rotation. But i'm not sure that's what Péter Török actually stated. If this were my code, I'd take care of that step when the object is created (first permutation available is the lexicographically least)
TokenMacGuy
@TokenMacGuy, I meant exactly what @Blaisorblade assumed. Yes, you can find the lexicographically smallest variation at creation time, but you can also compute it lazily, only when (if) it is needed.
Péter Török
@Hristo, FYI I added a sample implementation to my answer.
Péter Török
A: 

I misread your question - I thought you wanted different haschodes for "picture" and "turepic"; I think in this case, you can get a hint from the fact that two objects that are equal must have the same hash code, but two objects that have the same hash code may not necessarily be equal.

So you can use Vivien's solution which will guarantee that "picture" and "turepic" will have the same hash code. However, it also means that "picture" and "pitcure" would have the same hash codes as well. In this case, your equals method will have to be smarter and will have to figure out if the two list of letters actually represent the same word. Essentially your equals method helps resolve the collision that you can get from "picture"/"turepic" and "pitcure".

Vivin Paliath
This is a really good idea! I'll give it a try. Thanks!
Hristo
That would give different hashcodes for "picture" and "turepic".
Péter Török
Checked your code with a function in Scala: `def hash[T](list: List[T]): Int = list.foldLeft(1) { (acc, e) => (13*acc - (if (e == null) 0 else e.hashCode)) }`. It gives different values for "abcde" and "eabcd".
abhin4v
Erm, a-b is generally not equal to b-a, but the "ab" and "ba" are supposed to be equal ...
meriton
@meriton/abhin4v I misread his question. Which is why I qualified my solution with "if it is order dependent". So in that scenario, my solution isn't wrong. I've since edited my answer
Vivin Paliath
@Vivin... I am pretty sure my equals is smart enough to know that "picture" and "pitcure" are not equal. I've set up a bunch of unit tests that test a bunch of different variations. So what would you suggest now, knowing that my equals is legit?
Hristo
@Hristo is your equals smart enough to figure out that `picture` and `turepic` are equal? If that's the case, you should be good to go. In data structures where equality matters (like a HashSet or a HashMap), the code first looks at the hashcodes. If they are unequal, then it is guaranteed that the objects are unequal. If hashcodes are equal, then `equals` is called to actually figure out if the objects are equal or unequal
Vivin Paliath
My equals is smart enough to know "picture" and "turepic" are equal. All my unit tests pass except the super intense one where I use 192,718 words, and each is different. Here's the problem: java.lang.AssertionError: expected:<192718> but was:<192468>. So I believe my hashcode is not perfect.
Hristo
@Hristo It's alright for "picture" and "pitcure" to have the same hashcode as long as you can resolve the actual equality in the `equals` method.
Vivin Paliath
Alright thanks. I'll try to test my equals more thoroughly. Any suggestions on how to test the hashcode?
Hristo
@Hristo The most important thing is that `hashCode` *must* return the same hash code for two objects that are conceptually equal. In your case, "picture" and "turepic" are equal, so you need to make sure that you get the same hashcode for them (or even for "icturep"). However, this also means that "pitcure" has the same hash code. I would recommend a test which consolidates *both* hashcode and equals and actually tests for equality. Like I mentioned, this is what some of the collection classes (like HashMap/HashSet) do, to test for equality.
Vivin Paliath
@Vivin... Got it! My `equals()` was so good that it actually caught my personal error in assuming something was true :) Thanks for your help!
Hristo
+3  A: 
int hashcode() {
    int hash = 0;
    for (c in list) {
        hash += c * c;
    }
    return hash;
}

Since + is commutative, equal words will have equal hashcodes. The hashcode is not very discriminating (all letter permutations get the same hash code), but it should do the trick unless you usually put many permutations into the HashSet.

Note: I add c * c rather than simply c in order to get less collisions for distinct letters.

Note 2: Unequal lists with equal hash codes do not violate of the contract for hashcode. Such "collisions" should be avoided because they reduce performance, but they do not threaten the correctness of the program. In general, collisions can not be avoided, though it is certainly possible to avoid them more than in my answer, but doing so makes the hashcode more expensive to compute, which might more than eat any performance gain.

meriton
Although that would give the same hashcode for "picture" and "turepic", but also for "ciputer" and any other permutation of the same letters.
Péter Török
+1 for pointing out that you don't need to generate different hashcodes for different lists to maintain contract. Strictly speaking, this is impossible - unless you invent infinite compression and can encode every possible combination of objects in 32 bits!
dty
Nitpick: It's not impossible "strictly speaking", it's just plain impossible (pidgeon hole principle) :-).
sleske
A: 
  1. define equals() and hashCode() for Letter. Do this using only the char field.
  2. For CircularWord, implement hashCode() by iterating from head to tail XOR'ing the respective values of Letter.hashCode. Finally XOR the result with some constant.

Another way would be to canonicalize the cicular words, representing them as something like:

public class CircularWord {

    private static Set<String> canonicalWords = new HashSet<String>();
    private String canonicalWord;
    private int offset;

    public CircularWord(String word) {
        // Looks for an equal cirular word in the set (according to our definition)
        // If found, set canonicalWord to it and calculate the offset.
        // If not found, put the word in the set, set canonical word to our argument and set offset to 0.
    }
    // Implementation of CircularWord methods using
    // canonicalWord and offset
}

You would then implement equals() and hashCode() by delegating to the String implementations.

gpeche
I love XOR, so I'm interested in your idea. However, XORing "picture" and "tuerpic" would result in the same value right? If yes, then that is NOT what I'm looking for since "picture" and "tuerpic" are NOT equal lists.
Hristo
Will have quite a few collisions: "ab", "cd", "ef", "gh", "ij", ... all get the same hashcode.
meriton
@meriton hash('a') ^ hash('b') gives me 3, hash('c') ^ hash('d') gives me 7. Using Java 6 (Sun JVM) and `Character.hashCode()`. Of course there will be a lot of collisions, but with that implementation is hard to do better.
gpeche
@Hristo You need to guarantee that a = b implies hash(a) = hash(b), but you do not need (and in general it is impossible) to guarantee hash(a) = hash(b) implies a = b.
gpeche
gpeche: My apologies, the example should have been "bc", "de", "fg", "hi", "jk", ... With the code from my answer, these words all get different hashcodes.
meriton
@meriton It will depend on your particular implementation of `Letter.hashCode()`. Note that I said "Do this using only the char field" but did not specify *what* to do with the char field. Of course I thought `Character.hashCode()` would do better, but in Sun JVM it seems to be pretty trivial.
gpeche
@meriton Anyway there is going to be a lot of collisions because with XOR letters appearing an even number of times will cancel each other.
gpeche
That's why I used + instead of XOR :-)
meriton
A: 

Bear in mind that hashcodes are not unique. Two different objects can hash to exactly the same value. Thus hashcode is insufficient for determining equality; you have to do the actual comparison in equals(). [SPECULATIVE COMMENT REMOVED. OMG]

hashcode() can just return a constant in all cases. This may affect performance but it's totally valid. Once you get everything else done, you can work on a more efficient hashcode() algorithm.

This is a good article. Note the 'lazy hashcode' section.

Tony Ennis
"Further, I'm not even sure two identical values have to have the same hashcode" -- but if they don't, weird things will happen if you store your objects in a HashSet or HashMap. Not a good idea.
MatrixFrog
Nothing weird will happen. You'll get a sub-optimal search, according to the site I linked. I don't say that returning a constant is a good idea, only that it will work.
Tony Ennis
No, the search result would be incorrect as the hash map would look for the value in the wrong bucket. Equals objects must have equal hashcodes. Equal hashcodes need not be of equal objects.
meriton
So the way the site works is for readers to keep pressing -1 until the author removes a speculative statement with an appropriate disclaimer? Got it.
Tony Ennis
@Tony Ennis Of course, thats how it goes. If ppl would not have pressed the church, we would still belive that the world is flat. Wrong answer -> -1, especially if its something fundamental. And instead of guessing you could go and check (your 2nd statement is wrong, too).
atamanroman
Don't take it personally. The primary purpose is to filter good information from bad. Just keep answering and asking questions, don't think about your score too much, and it will gradually go up. Or so everyone says, and it seems to be mostly true.
MatrixFrog
+5  A: 

Do you actually need to use your hashCodes? If you don't intend to place the object members in any kind of hash structure, you could just ignore the problem:

public int hashCode() {
    return 5;
}

this satisfies the requirements that equal instances have equal hash codes. Unless i knew I needed a better hash distribution, this would probably work well enough for my own needs.

But I think I might have an idea that gives better distribution of hashes. psuedo code:

hash = 0
for each rotation
    hash += hash(permutation)
end
hash %= MAX_HASH

Since hash() is likely to be O(n), then this algorithm is O(n^2), which is a bit slow, but hashes reflect the method used for equivalence testing, the distribution of hash codes is probably pretty decent. any other kernel (prod, xor) that is commutative will work as well as the sum used in this example.

TokenMacGuy
+1 for a pragmatic solution - though I always fear that that this kind of implementation shortcut will leave someone with a nasty performance regression in a few years time when someone eventually does start using the structures as HashMap keys :-)
mikera
"I" don't use the hash codes, but java.util.Set uses them whenever I make a call `setOfUniqueWords.contains(thisWord);`. So it is important that I define it properly. I got it working though :)
Hristo
First, not all permutations are considered equal. Only "rotations" are. Next, adding the hashes can cause collisions. For instance, if you take the standard approach of interpreting the letters as coefficients of a polynomial, and use an evaluation of that polynomial as some point other than 0 as the hash, addition over all rotations would create a polynomial where all coefficients are equal, thereby losing any information about the order of letters.
meriton
@meriton: Well, if that is the method used behind the hash function, then a different kernel (say, hash *= hash(permutation)) would solve that.
TokenMacGuy
Probably (my math skills are too rusty to attempt a proof of good distribution). Yes, that method for computing a hash function for a sequence is the standard approach in Java land, featured, for instance, in the code generation of the eclipse IDE and in the standard library (c.f. java.util.AbstractList.hashCode).
meriton
I gave it a try, it produces different results, at least for 1+2x+3x^2 (and its permutations) versus 3 + 2x + x^2(and its permutations).
TokenMacGuy
+1, practical idea. If some guy comes along and tries to use that without checking the `hashCode` method, its his fault. Just add a warning in a comment. OTOH, if the OP *is* using `HashSet`/`HashMap` or are otherwise calling that method, better implement it properly.
MAK