ansaurus

Question

Looking for a good 64 bit hash for file paths in UTF16

Answer 1

+2 A:

Archimedix 2010-09-15 20:34:47

Thanks - but MD5 and just about any other "cryptographic hash" I know of are larger than 64 bit. I'd have to collapse the key space for this. I do have a method that is Unicode 5.1 based to lowercase strings that is quite fast.

Dominik Weber 2010-09-15 21:41:58

Yes, like I said, you'd have to take a 64-bit sub-string of MD5 then. Alternatively, take a look at this question: http://stackoverflow.com/questions/1660501/what-is-a-good-64bit-hash-function-in-java-for-textual-strings

Archimedix 2010-09-15 21:55:37

Yes that would work - I was confused with the "sub-sting" - I thought that referred to the file path, not the resulting hash.

Dominik Weber 2010-09-18 00:28:11

Selected this beacuse this seems the best solution for me of and the good follow-up in the commments! Thanks!

Dominik Weber 2010-09-23 21:46:04

Answer 2

+1 A:

Even if you do not need a cryptographic hash, you can still use one, and since your problem is not about security, then a "broken" cryptographic hash would be fine. I suggest MD4, which is quite fast. On my PC (a 2.4 GHz Core2 system, using a single core), MD4 hashes more than 700 MB/s, and even for small inputs (less than 50 bytes) it can process about 8 millions messages par second. You may find faster non-cryptographic hashes, but it already takes a rather specific situation for it to make a measurable difference.

For the specific properties you are after, you would need:

To "normalize" characters so that uppercase letters are converted to lowercase (for case insensitivity). Note that, generally speaking, case-insensitivity in the Unicode world is not an easy task. From what you explain, I gather that you are only after the same kind of case-insensitivity that Windows uses for file accesses (I think that it is ASCII-only, so conversion uppercase->lowercase is simple).
To truncate the output of MD4. MD4 produces 128 bits; just use the first 64 bits. This will be as scattered as you could wish for.

There are MD4 implementations available in may places, including right in the RFC 1320 I link to above. You may also find opensource MD4 implementations in C and Java in sphlib.

Thomas Pornin 2010-09-16 13:56:06

Yeah, crypto-hashes need not be bad per se, hence I suggest doing some benchmarks for evaluation if necessary. MD4 support may be phased out for some crypto-libraries as it is considered insecure, therefore the widely implemented MD5 may be more future-proof regarding portability and compatibility. Windows supports Unicode file system names as far as I know for NTFS and VFAT (UTF-16; according to Wikipedia, NTFS uses 16 bit characters which may - but need not - be UTF-16). http://en.wikipedia.org/wiki/NTFS

Archimedix 2010-09-16 20:04:13

actually every NTFS volume has a uppercase map that the file system driver uses for case-insensitive comparisons.

Dominik Weber 2010-09-17 18:46:11

+1 for MD4 due to the speed requirement.

Dominik Weber 2010-09-18 00:26:04

Regarding speed, the hashing code used by Java in that related question I linked to might be a sufficiently good one as well for *compiled* languages (includes C/C++ and Java) - for dynamically interpreted languages (e.g. PHP, Perl), a hash function provided by the language's function library (which usually is in native machine code) may be faster than a function you provide in the interpreted language (also depends on the length of the input). http://stackoverflow.com/questions/1660501/what-is-a-good-64bit-hash-function-in-java-for-textual-strings/1660613#1660613

Archimedix 2010-09-18 15:20:24

Answer 3

+1 A:

I would just use something straightforward. I don't know what language you are using, so the following is pseudocode:

ui64 res = 10000019;
for(i = 0; i < len; i += 2)
{
  ui64 merge = ucase(path[i]) * 65536 + ucase(path[i + 1]);
  res = res * 8191 + merge; // unchecked arithmetic
}
return res;

I'm assuming that path[i + 1] is safe on the basis that if len is odd then in the last case it will read the U+0000 safely.

I wouldn't make use of the fact that there are gaps caused by the gaps in UTF-16, by lower-case and title-case characters, and by characters invalid for paths, because these are not distributed in a way to make use of this fact something that could be used speedily. Dropping by 32 (all chars below U+0032 are invalid in path names) wouldn't be too expensive, but it wouldn't improve the hashing too much either.

Jon Hanna 2010-09-22 16:19:36

This is not bad - it avoids having to capitalize the whole string - and the null termination trick is neat.

Dominik Weber 2010-09-22 22:03:32

Well, it essentially capitalises the whole string in the UCase bit. That bit of pseudo code is meant to mean a capitalisation. Whether its done as a string or char-by-char is of no matter (including as efficiency) assuming I'm remembering correctly that the file case folding is char-by-char and culture-agnostic. I could be wrong in that memory. In either case you'd want to use the same method that windows file sys uses.

Jon Hanna 2010-09-22 23:22:21

ansaurus

tags:

views:

answers:

Looking for a good 64 bit hash for file paths in UTF16

related questions