views:

184

answers:

1

Hi,

There is a data set of "file" - name of file, and 32bit number is following after it - something like hash for the file.

"file1" 6a9bd9a6 1df3b24b 7ab054dc
"file2" 6a9bd54e 1df3b24b 8cd054dc
"file3" 6a9bd9a6 7ab054dc

How am I going to get unique files so s2 is not a prefix of any other s2 - that means the number is unique. If there are two same s2s, both of them are unique if they are not prefixes to any other s2.

I'm looking for a fast solution. I could come up with solution to compare each string to every other but it would be too time consuming and uneffective. Another option was to somehow use MySQL engine for tables, but I am not sure how. Can you help?

+2  A: 

You could use a trie to ensure that no string is a prefix of any other string .

When you insert into your trie, you would check for both of these cases:

1) Have I passed an old leaf node? If so, that means another string is a prefix of my string.
2) Am I wanting to mark an already existing non leaf as a leaf? If so, I am a prefix of another string.

This would be an O(N) solution where N is the number of strings (measuring the number of insertions into the trie). Each insertion runs for the length of its string.

So if you want to create hashes from here. You can easily just traverse the trie and then use the information about if you have a prefix node or not once you reach the leaf you want. Each leaf node represents a whole path, and it knows whether it is a prefix of another string or not. If it is a prefix, then it has at least 1 child node.

Brian R. Bondy
oh, so I insert all data into trie, and trie has the algorithm of returning the information about each data entry whether it got some "parent" and is a prefix of different entry?
Skuta
ya you can recursively iterate over each node of the trie. When you reach each leaf you would calculate the hash. At each leaf node you also know whether it is a prefix or not with O(1) access time.
Brian R. Bondy
do you know some working example ?
Skuta
No sorry you would have to code it
Brian R. Bondy