views:

156

answers:

3

Hi, I have a Unicode / UTF-16 encoded path. the path delimiters is U+005C '\'. The paths are null-terminated root relative windows file system paths, e.g. "\windows\system32\drivers\myDriver32.sys"

I want to hash this path into a 64-bit unsigned integer. It does not need to be "cryptographically sound". The hashes should be case insensitive, but able to handle non-ascii letters. Obviously, the hash also should scatter well.

There are some ideas that I had though of:

A) Using the windows file identifier as a "hash". In my case i do want the hash to change if the file gets moved, so this is not an option.

B) Just use a regular sting hash: hash += prime * hash + codepoint for the whole string.

I do have the feeling that the fact that the path consists of "segements" (folder names and the final file name) can be leveraged.

To sum up the needs:

1) 64bit hash
2) good distribution / few collisions for file system paths.
3) efficient
4) does not need to be secure
5) case insensitive

+2  A: 
Archimedix
Thanks - but MD5 and just about any other "cryptographic hash" I know of are larger than 64 bit. I'd have to collapse the key space for this. I do have a method that is Unicode 5.1 based to lowercase strings that is quite fast.
Dominik Weber
Yes, like I said, you'd have to take a 64-bit sub-string of MD5 then. Alternatively, take a look at this question: http://stackoverflow.com/questions/1660501/what-is-a-good-64bit-hash-function-in-java-for-textual-strings
Archimedix
Yes that would work - I was confused with the "sub-sting" - I thought that referred to the file path, not the resulting hash.
Dominik Weber
Selected this beacuse this seems the best solution for me of and the good follow-up in the commments! Thanks!
Dominik Weber
+1  A: 

Even if you do not need a cryptographic hash, you can still use one, and since your problem is not about security, then a "broken" cryptographic hash would be fine. I suggest MD4, which is quite fast. On my PC (a 2.4 GHz Core2 system, using a single core), MD4 hashes more than 700 MB/s, and even for small inputs (less than 50 bytes) it can process about 8 millions messages par second. You may find faster non-cryptographic hashes, but it already takes a rather specific situation for it to make a measurable difference.

For the specific properties you are after, you would need:

  1. To "normalize" characters so that uppercase letters are converted to lowercase (for case insensitivity). Note that, generally speaking, case-insensitivity in the Unicode world is not an easy task. From what you explain, I gather that you are only after the same kind of case-insensitivity that Windows uses for file accesses (I think that it is ASCII-only, so conversion uppercase->lowercase is simple).

  2. To truncate the output of MD4. MD4 produces 128 bits; just use the first 64 bits. This will be as scattered as you could wish for.

There are MD4 implementations available in may places, including right in the RFC 1320 I link to above. You may also find opensource MD4 implementations in C and Java in sphlib.

Thomas Pornin
Yeah, crypto-hashes need not be bad per se, hence I suggest doing some benchmarks for evaluation if necessary. MD4 support may be phased out for some crypto-libraries as it is considered insecure, therefore the widely implemented MD5 may be more future-proof regarding portability and compatibility. Windows supports Unicode file system names as far as I know for NTFS and VFAT (UTF-16; according to Wikipedia, NTFS uses 16 bit characters which may - but need not - be UTF-16). http://en.wikipedia.org/wiki/NTFS
Archimedix
actually every NTFS volume has a uppercase map that the file system driver uses for case-insensitive comparisons.
Dominik Weber
+1 for MD4 due to the speed requirement.
Dominik Weber
Regarding speed, the hashing code used by Java in that related question I linked to might be a sufficiently good one as well for *compiled* languages (includes C/C++ and Java) - for dynamically interpreted languages (e.g. PHP, Perl), a hash function provided by the language's function library (which usually is in native machine code) may be faster than a function you provide in the interpreted language (also depends on the length of the input). http://stackoverflow.com/questions/1660501/what-is-a-good-64bit-hash-function-in-java-for-textual-strings/1660613#1660613
Archimedix
+1  A: 

I would just use something straightforward. I don't know what language you are using, so the following is pseudocode:

ui64 res = 10000019;
for(i = 0; i < len; i += 2)
{
  ui64 merge = ucase(path[i]) * 65536 + ucase(path[i + 1]);
  res = res * 8191 + merge; // unchecked arithmetic
}
return res;

I'm assuming that path[i + 1] is safe on the basis that if len is odd then in the last case it will read the U+0000 safely.

I wouldn't make use of the fact that there are gaps caused by the gaps in UTF-16, by lower-case and title-case characters, and by characters invalid for paths, because these are not distributed in a way to make use of this fact something that could be used speedily. Dropping by 32 (all chars below U+0032 are invalid in path names) wouldn't be too expensive, but it wouldn't improve the hashing too much either.

Jon Hanna
This is not bad - it avoids having to capitalize the whole string - and the null termination trick is neat.
Dominik Weber
Well, it essentially capitalises the whole string in the UCase bit. That bit of pseudo code is meant to mean a capitalisation. Whether its done as a string or char-by-char is of no matter (including as efficiency) assuming I'm remembering correctly that the file case folding is char-by-char and culture-agnostic. I could be wrong in that memory. In either case you'd want to use the same method that windows file sys uses.
Jon Hanna