I'm just wondering, is there a reason why some libraries (be it any language) use iterative hashing such that the hashed data is encoded in hex and rehashed again instead of rehashing the actual binary output?
A:
This is done to introduce an extra step to guard against the hash possibly starting to produce the same or similar output if it is iteratively directly applied to the result of the same hash. This extra step is independent of the hash implementation and itself acts as a yet another re-hash stage which will not hurt. Such precautions are not needed for reliable hashes - but you never know in advance if some hash algorithm has a yeat unknown defect.
sharptooth
2009-12-28 13:10:12
Hold on, you are saying that the hash algorithms that do not return the raw output, return some sort of "hashed" output? I think that's not true. AFAIK, most libraries will either convert the binary data to hex (encoding), or return the binary data alone. I see no reason to use hex (base 16) over binary (base 2) here.
rFactor
2009-12-28 13:35:34
No, if you have hex output you can convert it to raw and vice verse trivially. Just the fact that the output of the hash is transformed in some way before rehashing can improve security. The key is not transformation itself, the key is the fact that it is done before applying the same hash.
sharptooth
2009-12-28 14:11:12
When I think about this over and over again, I can't see a way how mapping 0-255 bytes into 0-15 bytes (not the actual bytes, just the number of) and rehashing it can improve security. As far as I can tell, it should be the opposite. We would lose entropy by hashing the compressed output again more than when we hash the actual output. Can you provide any facts on this or is it just a feeling that compressing the data before rehashing improves security?
rFactor
2009-12-28 14:43:56
It's exactly the opposite. Rehashing can either do nothing or hurt depending on how good the hash is. Transformation like conversion to a hex string is done to minimize the risk of being hurt if the hash is not good and starts producing interdependent outputs after some iterations of rehashing the same block. In other words hashing compresses data and the transformation uncompresses them a little so that the next compression doesn't produce a similarly looking block.
sharptooth
2009-12-28 14:52:11
I think we have jumped to a different topic: the inner workings of hashes? Because converting the final output to hex rather than having it as binary has no effect on the results (it's just an encoding).
rFactor
2009-12-28 15:08:00
Yes, sure. You asked why the output of the hash is rehashed as a hex string instead of as a binary block. Surely the hex string and the binary block are different representations of the same. Bu the point is that if you rehash the block directly you risk facing a hash imperfection and if you hash its string representation that risk is reduced.
sharptooth
2009-12-28 15:19:07
Take a look at Perls md5() function, it returns the hash of the given data in binary. if you want, you can also use md5_hex() function which does exactly the same except that it returns the hex encoding presentation of the binary data. Take a look at PHP's hash() function, by default it returns the data as hex encoded, but if you want you can specify the third parameter to be true to make it not encode it as hex (it stays as binary). I see many PHP scripts that iteratively hash and encode to hex in the process. I'm asking why encode in that process since it loses info? I bet its by mistake.
rFactor
2009-12-29 09:15:25
So my question is, is there a reason to do for() loop to call md5_hex() again and again instead of doing a for() loop to call md5() function again and again?
rFactor
2009-12-29 09:16:23
This might be by mistake, but it actually does good. Encoding to hex doesn't lose any data - it transforms data to another representation with more bytes each being able to hold less values - in binary you have 16 bytes 256 values each and in hex you have 32 bytes with 16 values each. In fact this transformation leads to uncompressing data - you get the same amount of information occupying more encoding bytes. This is exactly the opposite of what hash does and thi is good because it shuffles data a bit before rehashing.
sharptooth
2009-12-29 10:19:09
Rehashing as binary would be: hash()->hash()->hash() and if the hash is imperfect you risk to get biased output after the final hash. You don't want this. So you do hash()-shuffle()->hash()->shuffle()->hash() with shuffle() being the transformation from binary to hex. This way risk of getting biased output is reduced.
sharptooth
2009-12-29 10:22:01
Okay. That is certainly interesting, but are there any research or facts that show/prove how shuffle() makes the rehashing stronger. Also, is there a particular reason as to why one would want to use hex encoding instead of base32/64, or other encodings? Which is the best encoding method to use? Is it the one that stores as less information in one byte as possible?
rFactor
2009-12-29 10:31:23
I haven't seen any researches on the topic. I suppose that it doesn't matter much which transformation to use as shuffle() as long as it doesn't cause information loss.
sharptooth
2009-12-29 10:37:37
What the? @sharptooth is completely incorrect - encoding the hash output in hex has _no_ impact on the effectiveness of the hash. If hash algorithms were so delicate that applying them repeatedly to themselves without encoding could result in biased output, they'd be totally useless for their intended purpose. Encoding in hex before repeatedly hashing is simply a side-effect of sloppy use of library hash functions that return hex encoded strings.
Nick Johnson
2010-08-07 10:48:53