views:

108

answers:

1
+3  Q: 

SHA-1 and Unicode

Hi everyone,

Is behavior of SHA-1 algorithm defined for Unicode strings? I do realize that SHA-1 itself does not care about the content of the string, however,
it seems to me that in order to pass standard tests for SHA-1, the input string should be encoded with UTF-8.

+4  A: 

SHA-1 is based on bytes; it doesn't care about the structure of the data. If you give it a bunch of bytes that consists of Unicode strings, you'll get a result. Keep in mind that the block size is 512 bits, so smaller amounts of data may cause security issues.

WhirlWind
I do understand it too. But what if I try to pass a test with "abc" and I give it "abc" string encoded in UTF-16? It will never pass it. That is the reason for question to exist.
Andrew
Of course that doesn't work. SHA-1 compares data regardless of structure. You have to compare apples to apples.
WhirlWind
so when talking about Unicode string that you want to put into SHA-1 algorithm what will you define as apples? :)
Andrew
If you compare it to the same Unicode string (encoded the same way), the hashes will match. If you compare it to a string encoded some other way, the hashes won't match. That's all secure hashes do, really; they show you whether data is the same or different. Comparing Unicode to UTF-8 strings results in different hashes because they contain different data. That's the desired result.It doesn't matter *how* you encode, so long as you do it the same way each time you hash.
WhirlWind
The hash of an UTF-8 encoded string containing only ASCII characters will match that of the equivalent pure ASCII char[] string (the same as with memcmp). Otherwise you have to choose between UTF-16, UTF-16LE, UTF-16BE, UCS-4... I'd keep the strings in their default encoding unless you intend to share these hashes
Alexandre Jasmin
Thanks Alexandre. You could have put that as an answer because it directly answers my question
Andrew