views:

43

answers:

4

I am looking for a time efficient method to parse a list of files into a tree. There can be hundreds of millions of file paths.

The brute force solution would be to split each path on occurrence of a directory separator, and traverse the tree adding in directory and file entries by doing string comparisons but this would be exceptionally slow.

The input data is usually sorted alphabetically, so the list would be something like:

C:\Users\Aaron\AppData\Amarok\Afile

C:\Users\Aaron\AppData\Amarok\Afile2

C:\Users\Aaron\AppData\Amarok\Afile3

C:\Users\Aaron\AppData\Blender\alibrary.dll

C:\Users\Aaron\AppData\Blender\and_so_on.txt

From this ordering my natural reaction is to partition the directory listings into groups... somehow... before doing the slow string comparisons. I'm really not sure. I would appreciate any ideas.

Edit: It would be better if this tree were lazy loaded from the top down if possible.

A: 

if its possible, you can generate your tree structure with the tree command, here

ghostdog74
A: 

Try using Tries.

TheMachineCharmer
A: 

To take advantage of the "usually sorted" property of your input data, begin your traversal at the directory where your last file was inserted: compare the directory name of current pathname to the previous one. If they match, you can just insert here, otherwise pop up a level and try again.

David Gelhar
+1  A: 

You have no choice but to do full string comparisons since you can't guarantee where the strings might differ. There are a couple tricks that might speed things up a little:

  • As David said, form a tree, but search for the new insertion point from the previous one (perhaps with the aid of some sort of matchingPrefix routine that will tell you where the new one differs).
  • Use a hash table for each level of the tree if there may be very many files within and you need to count duplicates. (Otherwise, appending to a stack is fine.)
Rex Kerr