views:

184

answers:

4

Let's say I want to calculate a hash of a data structure, using a hash algorithm like MD5 which accepts a serial stream, for the purposes of equivalence checking. (I want to record the hash, then recalculate the hash on the same or an equivalent data structure later, and check the hashes to gauge equivalence with high probability.)

Are there standard methods of doing this?

Issues I can see that are problematic are

  • if the data structure contains an array of binary strings, I can't just concatenate them since ["abc","defg"] and ["ab","cdefg"] are not equivalent arrays
  • if the data structure contains a collection that isn't guaranteed to enumerate in the same order, e.g. a key-value dictionary {a: "bc", d: "efg", h: "ijkl"} which should be considered equivalent to a key-value pair {d: "efg", h: "ijkl", a: "bc"}.
+3  A: 

For the first issue, also hash the lengths of the strings. This will differentiate their hashes.

For the second, sort the keys.

Grumdrig
+1  A: 

There is also another problem with structs and it is the alignment of data members on different platforms. If you want a stable and portable solution, you can solve this by implementing "serialize" method for your data structure in such a way that serialize will produce byte stream (or more commonly, output to the byte stream). Then, you can use hash algorithm with the serialized stream. In such a way, you will be able to solve the problems you mentioned by explicit traversion of your data. As other additional features you will get ability to save your data onto hdd or to send it over the network.

For the strings, you can implement Pascal type storage where length comes first.

AlexKR
+1  A: 
  • If the strings can't have any nul characters, you can use C strings to guarantee uniqueness, eg. "abc\0defg\0" is distinct from "cdefg\0".
  • For dictionaries, maybe you can sort before hashing.

This also reminds me of an issue I heard of once... I don't know what language you are using, but if you are also hashing C structs without filtering them in any way, be careful about the space between fields that the compiler might have introduced for alignment reasons. Sometimes those will not be zeroed out.

asveikau
+2  A: 

A "standard" way of doing this is to define a serialized form of the data structure, and digest the resulting byte stream.

For example, a TBSCertificate is a data structure comprising a subject name, extensions, and other information. This is converted to a string of octets in a deterministic way and hashed as part of a digital signature operation to produce a certificate.

erickson