views:

291

answers:

8

I'm reading lines of text that can come in any order. The problem is that the output can actually be indentical to the previous output. How can I detect this, without sorting the output first?

Is there some kind of hash function that can take identical input, but in any order, and still produce the same result?

A: 

So you have input like

A B C D
D E F G
C B A D

and you need to detect that the first and third lines are identical?

Nicholas
A: 

If you want to find out if two files contain the same set of lines, but in a different order, you can use a regular hash function on each line individually, then combine them with a function where ordering doesn't matter, like addition.

I've been thinking of just adding the hash'es togheter. I'm a bit worried that I might get false negatives, but I'll find out :-)
A: 

If the lines are fairly long, you could just keep a list of the hashes of each line -- sort those and compare with previous outputs.

If you don't need a 100% fool-proof solution, you could store the hash of each line in a Bloom filter (look it up on Wikipedia) and compare the Bloom filters at the end of processing. This can give you false positives (i.e. you think you have the same output but it isn't really the same) but you can tweak the error rate by adjusting the size of the Bloom filter...

HD
Sorting the hashes is also a possibility that I have considered. I'm a bit concerned about memory usage, but I'll worry about that later.
A: 

If you add up the ASCII values of each character, you'd get the same result regardless of order.

(This may be a bit too simplified, but perhaps it sparks an idea for you. See Programming Pearls, section 2.8, for an interesting back story.)

+3  A: 

The easiest way would seem to be to hash each line on the way in, storing the hash and the original data, and then compare each new hash with your collection of existing hashes. If you get a positive, you could compare the actual data, to make sure it's not a false positive - though this would be extremely rare, you could go with a quicker hash algorithm, like MD5 or CRC (instead of something like SHA, which is slower but less likely to collide), just so it's quick, and then compare the actual data when you get a hit.

rwmnau
A: 

Any of the hash-based methods may produce bad results because more than one string can produce the same hash. (It's not likely, but it's possible.) This is particularly true of the suggestion to add the hashes, since you would essentially be taking a particularly bad hash of the hash values.

A hash method should only be attempted if it's not critical that you miss a change or spot a change where none exists.

The most accurate way would be to keep a Map using the line strings as key and storing the count of each as the value. (If each string can only appear once, you don't need the count.) Compute this for the expected set of lines. Duplicate this collection to examine the incoming lines, reducing the count for each line as you see it.

  • If you encounter a line with a zero count (or no map entry at all), you've seen a line you didn't expect.
  • If you end this with non-zero entries remaining in the Map, you didn't see something you expected.
Alan Krueger
A: 

Well the problem specification is a bit limited.

As I understand it you wish to see if several strings contain the same elements regardless of order.

For example:

A B C C B A

are the same.

The way to do this is to create a set of the values then compare the sets. To create a set do:

HashSet set = new HashSet(); foreach (item : string) { set.add(item); }

Then just compare the contents of the sets by running through one of the sets and comparing it w/others. The execution time will be O(N) instead of O(NlogN) for the sorting example.

A: 

Dog, isn't there a time complexity involved in "running through one of the sets and comparing it w/others" that makes your solution > O(n)?