tags:

views:

445

answers:

9

I want to convert windows pathname to unique integer.

Eg:

For pathname C:\temp\a.out, if i add ascii value of all the characters, i get 1234. But some other path can also generate the same number. So, what is the best way to generate unique numbers for different pathnames?

+12  A: 

Look into Hash functions. Make sure to consider the case-insensitive nature of most Windows filenames when performing the hash.

Most likely, the language you are using provides a library function (or collection of functions) which can take the hash of a string (or just data). SHA1 is popular and has low collision.

Here on Stackoverflow there are many questions pertaining to hash functions. To get you started, simply search for "hash function". This may be a useful SO question for your case: What is a performant string hashing function that results in a 32 bit integer with low collision rates?.

strager
See this http://stackoverflow.com/search?q=hashing+algorithm
BrianLy
Because of Jimmy's answer, hash functions will never solve this problem 100%
Allain Lalonde
@BrianLy, Thanks; I've updated my answer to provide references to SO resources.
strager
In practice though, a SHA-1 hash is so unlikely to hit an accidental collision (even by a malicious attacker) that you can pretty much depend on it for the time being.
Eclipse
+7  A: 

there are more possible pathnames than integers, therefore you can't have true uniqueness. You could settle for something like an MD5 hash.

Jimmy
Wow, insightful :) +1
Allain Lalonde
Integers are from -inf to +inf. Am I missing something? =] Also, I'm pretty sure there's a limit to the size of a pathname for FAT32 and/or NTFS; something like 1024 characters or something. So if you had a 1023-byte or less integer, you may be out of luck. =]
strager
@stranger: http://stackoverflow.com/questions/265769/maximum-filename-length-in-ntfs-xp-and-vista you have {#_of_possible_characters}^32,000 combinations to map in a integer-sized value. Yes, you are missing something ;)
rjack
@rjack: Where does the poster say how large the integer can be? Python 3, for example, supports unlimited length integers: http://docs.python.org/3.0/reference/datamodel.html#objects-values-and-types
Steve Losh
@everyone: ha you guys caught me on the integer thing. for bignums something like "pathname".inject(1, |i,c| { i = i*256+c }) should work right?
Jimmy
@steve losh: you need a 224,000 bit integer to map every ASCII-only pathname without collisions. A 27KB integer is something infeasible, even if supported by the language (yay python!)
rjack
@rjack: What? If path names are 1024 characters, you need 1024 bytes in your integer at most. Less, because not all characters are valid in path names.
derobert
Yup, just take the bytes for the pathname and pretend its one long integer, and there you go.
Eclipse
@rjack Yes, you'd need a huge integer to hold a 32,000 character pathname. How often do you see paths 32,000 characters long though? Most of the integers would be tiny and fast, but you'd still be able to fall back on large ones for the rare cases. There's still probably faster ways though.
Steve Losh
+2  A: 

Perfect hashing

Mehrdad Afshari
That's only possible if you know the domain space ahead of time.
T.E.D.
Yes, other guys has mentioned other stuff. I just noted it for sake of completeness. This works if you have static list of files in a database, for instance.
Mehrdad Afshari
+2  A: 

Yes, you'll need to use some kind of hash function, simply because the domain of your input is greater than the range of your output. In other words, there are almost certainly more valid pathnames than there are numbers representable in your target language's data type.

So it will not be possible to completely avoid collisions. If this guarantee is essential to your application, you won't be able to do it by translation to integers.

Andrzej Doyle
A: 

If this is on Unix, you could just grab its inode number. ls -i shows it on the command line. The stat() command allows you to retrive it from a program.

Soft links would show up as the same file, while hard links would show up as a different file. This may or may not be behavior you want.

I see a lot of folks talking about hashing. That could work, but theoretically if your hash does anything more than compress out integer values that are not allowable in file names, then you could have clashes. If that is unacceptable for you, then your hash is always going to be nearly as many digits as the file name. At that point, you might as well just use the file name.

T.E.D.
Two problems. One, he's on Windows. I'm sure FAT32 and NTFS have something similar to i-nodes, though. Two, i-nodes can change where the path name doesn't. Also, your method wouldn't work with network paths (as far as I know), most likely because access of low-level file info would be denied.
strager
A: 

To all the people saying "it's not possible because you have more possible paths than integers to store them in": no. The poster never specified an implementation language; some languages support arbitrary-length integers. Python, for example.

Say we take the 32,000 character paths as the limit mentioned in one of the other comments. If we have 256 different characters to use with paths we get:

Python 2.5.1 (r251:54863, May 18 2007, 16:56:43)
[GCC 3.4.4 (cygming special, gdc 0.12, using dmd 0.125)] on cygwin
Type "help", "copyright", "credits" or "license" for more information.
>>> 32000L**256L
20815864389328798163850480654728171077230524494533409610638224700807216119346720596024478883464648369684843227908562015582767132496646929816279813211354641525848259018778440691546366699323167100945918841095379622423387354295096957733925002768876520583464697770622321657076833170056511209332449663781837603694136444406281042053396870977465916057756101739472373801429441421111406337458176000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000L
>>>

Notice how Python represents that just fine? Yes, there's probably a better way to do it, but that doesn't mean it's impossible.

EDIT: rjack pointed out that it's actually 256^32000, not the other way around. Python still handles it just fine. The performance may leave something to be desired, but saying it's mathematically impossible is wrong.

Steve Losh
it's not 32000^256, its 256^32000, that makes a big difference.
rjack
'python -c "print pow (256, 32000)" | less' --- yes, less is definitely needed
rjack
Yep, you're right. Still works just fine though.
Steve Losh
+1  A: 

How about something like this: Use a hash of (String->n bits) for each directory level. Alloting 20 bits for each of 10 directory levels is clearly not going to scale, but maybe a telescoping level of bits, under the assumption that the lowest directory level will be the most populated -

e.g. if you have (from root) /A/B/C/D/E/F, output some sort of n-bit number where

bits n/2 - n hashes F

bits n/4 - n/2 bits hashes E

n/8 - n/4 bits hashes D

etc. etc.

Steve B.
A: 

Jimmy Said

there are more possible pathnames than integers, therefore you can't have true uniqueness. You could settle for something like an MD5 hash.

I don't think there are more possible path names then integers. As a construction to create a unique number from a pathname we can convert each letter to a (two-digit) number (so from 10-25,26=., then other special chars, and 27 being / --this is assuming there are less then 89 different characters, else, we can move to three digit encoding)

home/nlucaroni/documents/cv.pdf
1724221427232130121027242318271324122827123136251315

This forms a bijection (although, if you count only valid path names then the surjective property fails, but normally one doesn't care about that holding) --Come up with a path that isn't an integer.

This number obviously doesn't fit in a 64_bit unsigned int (max being 18446744073709551615), so it's not practical, but this isn't the point of my response.

nlucaroni
A: 

You can read here http://stackoverflow.com/questions/410705/best-way-to-determine-if-two-path-reference-to-same-file-in-c#410794 how you can uniquely identify a path. You need three numbers (dwVolumeSerialNumber, nFileIndexHigh and nFileIndexLow), maybe you can combine those three numbers to a new number with three times more bits. See also here: http://stackoverflow.com/questions/271398/post-your-extension-goodies-for-c-net-codeplex-com-extensionoverflow#274652 .

tuinstoel