views:

1059

answers:

10

I am writing a little program that creates an index of all files on my directories. It basically iterates over each file on the disk and stores it into a searchable database, much like Unix's locate. The problem is, that index generation is quite slow since I have about a million files.

Once I have generated an index, is there a quick way to find out which files have been added or removed on the disk since the last run?

EDIT: I do not want to monitor the file system events. I think the risk is too high to get out of sync, I would much prefer to have something like a quick re-scan that quickly finds where files have been added / removed. Maybe with directory last modified date or something?

A Little Benchmark

I just made a little benchmark. Running

dir /b /s M:\tests\  >c:\out.txt

Takes 0.9 seconds and gives me all the information I need. When I use a Java implementation (much like this), it takes about 4.5 seconds. Any ideas how to improve at least this brute force approach?

Related posts: How to see if a subfile of a directory has changed

+6  A: 

Unfortunately there's no standard way to listen to file system events in java. This may be coming in java7.

For now, you'll have to google "java filesystem events" and pick the custom implementation that matches your platform.

krosenvold
+3  A: 

One way you could speed things up is to just iterate over the directories and check the last modified time to see if the contents of the directory have changed since your last index, and if they have just do a normal scan on the directory then and see if you can find where things changed. I don't know how portable this will be tho, but it changing the hierarchy propagates up on a Linux system (might be filesystem dependant) so you can start at the root and work your way down, stopping when you hit a directory that hasn't changed

MrWiggles
does not work, windows does not propagate changes
martinus
You will have to check every directory since windows doesn't propagate but this will still be faster since you know all the paths from your last index run. I've done this before and it works. Speedup is tremendous!
Aaron Digulla
A: 

The file date approach might not be the best. For example if you restore a file from backup. Perhaps during the indexing you could store a MD5 hash of the file contents. However you might need to do some performance benchmarking to see if the performance is acceptable

Conrad
I wrote something similar to find duplicate files on my hard drive. MD5 was pretty slow. I tried multi-threading the program, but it just brought my computer to a crawl.
ScArcher2
computing the MD5 hash will require to read the whole file. Not really good if what you are looking for is performance ...
Guillaume
Indeed. Which is why i suggested benchmarking to check if the performance is acceptable. Any ideas how to treat a file that is restored from backup?
Conrad
A: 

I have heared that this task is very hard to do efficiently. I'm sure MS would have implemented similar tool to Windows if it was easy, especially nowadays since HD:s are growing and growing.

Lycha
A: 

I havent checked the implementation or the performances, but commons-io has an listFiles() method. It might be worth a try ...

Guillaume
It just calls Java's listfiles
martinus
+1  A: 

Given that we do not want to monitor file system events, could we then just keep track of the (name,size,time,checksum) of each file? The computation of the file checksum (or cryptographic hash, if you prefer) is going to be the bottleneck. You could just compute it once in the initial run, and re-compute it only when necessary subsequently (e.g. when files match on the other three attributes). Of course, we don't need to bother with this if we only want to track filenames and not file content.

You mention that your Java implementation (similar to this) is very slow compared to "dir /s". I think there are two reasons for this:

  1. File.listFiles() is inherently slow. See this earlier question "Is there a workaround for Java’s poor performance on walking huge directories?", and this Java RFE "File.list(FilenameFilter) is not effective for huge directories" for more information. This shortcoming is apparently addressed by NIO.2, coming soon.

  2. Are you traversing your directories using recursion? If so, try a non-recursive approach, like pushing/popping directories to be visited on/off a stack. My limited personal experience suggests that the improvement can be quite significant.

Zach Scrivena
A: 

How about something like this:

private static String execute( String command ) throws IOException  { 
    Process p = Runtime.getRuntime().exec( "cmd /c " + command );
    InputStream i = p.getInputStream();
    StringBuilder sb = new StringBuilder();
    for(  int c = 0 ; ( c =  i.read() ) > -1  ; ) {
        sb.append( ( char ) c );
    }
    i.close();
    return sb.toString();
}

( There is a lot of room for improvement there, since that version reads one char at a time: You can pick a better version from here to read the stream faster )

And you use as argument:

"dir /b /s M:\tests\"

If this is going to be used in a running app ( rather and being an standalone app ) you can discount the "warm up" time of the JVM, that's about 1 - 2 secs depending on your hardware.

You could give it a try to see what's the impact.

OscarRyz
I have tried it this way, but the dumb ordering of dir makes it difficult to use when scanning a hierarchies.
martinus
@martinus: mmhh something in the lines of: String s = getFileList(); Arrays.sort( s.split( lineSep ) ); Perhaps?
OscarRyz
+2  A: 

I've done this in my tool MetaMake. Here is the recipe:

  1. If the index is empty, add the root directory to the index with a timestamp == dir.lastModified()-1.
  2. Find all directories in the index
  3. Compare the timestamp of the directory in the index with the one from the filesystem. This is a fast operation since you have the full path (no scanning of all files/dirs in the tree involved).
  4. If the timestamp has changed, you have a change in this directory. Rescan it and update the index.
  5. If you encounter missing directories in this step, delete the subtree from the index
  6. If you encounter an existing directory, ignore it (will be checked in step 2)
  7. If you encounter a new directory, add it with timestamp == dir.lastModified()-1. Make sure it gets considered in step 2.

This will allow you to notice new and deleted files in an effective manner. Since you scan only for known paths in step #2, this will be very effective. File systems are bad at enumerating all the entries in a directory but they are fast when you know the exact name.

Drawback: You will not notice changed files. So if you edit a file, this will not reflect in a change of the directory. If you need this information, too, you will have to repeat the algorithm above for the file nodes in your index. This time, you can ignore new/deleted files because they have already been updated during the run over the directories.

[EDIT] Zach mentioned that timestamps are not enough. My reply is: There simply is no other way to do this. The notion of "size" is completely undefined for directories and changes from implementation to implementation. There is no API where you can register "I want to be notified of any change being made to something in the file system". There are APIs which work while your application is alive but if it stops or misses an event, then you're out of sync.

If the file system is remote, things get worse because all kinds of network problems can cause you to get out of sync. So while my solution might not be 100% perfect and water tight, it will work for all but the most constructed exceptional case. And it's the only solution which even gets this far.

Now there is a single kind application which would want to preserve the timestamp of a directory after making a modification: A virus or worm. This will clearly break my algorithm but then, it's not meant to protect against a virus infection. If you want to protect against this, you must a completely different approach.

The only other way to achieve what Zach wants is to build a new filesystem which logs this information permanently somewhere, sell it to Microsoft and wait a few years (probably 10 or more) until everyone uses it.

Aaron Digulla
So far this is the best answer. Do you know if this is a filesystem specific feature, or is it save to use it anywhere?
martinus
It works with an Unix file system and with Windows. I haven't tested with HFS but I'd be surprised if there were any problems.
Aaron Digulla
@Aaron: Does this approach assume that the directory last modified timestamp changes whenever a file is added/removed? What happens if directory timestamps are modified externally, e.g. by "touch"? (cont'd)
Zach Scrivena
(cont'd) Also, the use of timestamps may not be good enough on some file systems, e.g. FAT only has 2-second precision so changes occurring rapidly may go undetected.
Zach Scrivena
@Zach: Then you read the dir for nothing. But there is little damage since you will only read that single dir.
Aaron Digulla
@Zach: Re FAT: if there is no way to get some information, there is little point complaining, is there? You can only ever be as good as your source of information. Which is why the war on terror is working so well ... but I digress.
Aaron Digulla
@Aaron: Not only will it read the dir for nothing, it might also *not* read directories that it should, e.g. if externally, some program is keeping the timestamp constant even when contents are changing.
Zach Scrivena
@Aaron: Re FAT: I take it you agree with me then that timestamps are sometimes insufficient.
Zach Scrivena
@Zach: Look. My solution works for several years, now. We use it to build an OS. It runs on hundreds of computers all over the world, with different kinds of Unix and Windows.
Aaron Digulla
@Aaron: No need to be defensive. I'm just pointing out the assumptions we need to make in order to use this approach.
Zach Scrivena
It's the best you can get and that's that. Take it or leave it.
Aaron Digulla
@Aaron: Is that how you sell your OS to customers?
Zach Scrivena
Luckily, AROS is OSS. As you know by now I'm socially inept. :)
Aaron Digulla
+6  A: 

Can you jump out of java.

You could simply use

dir /b /s /on M:\tests\

the /on sorts by name

if you pipe that out to out.txt

Then do a diff to the last time you ran this file either in Java or in a batch file. Something like this in Dos. You'd need to get a diff tool, either diff in cygwin or the excellent http://gnuwin32.sourceforge.net/packages/diffutils.htm

dir /b /s /on m:\tests >new.txt
diff new.txt archive.txt >diffoutput.txt
del archive.txt
ren new.txt archive.txt

Obviously you could use a java diff class as well but I think the thing to accept is that a shell command is nearly always going to beat Java at a file list operation.

Stewart Robinson
A: 

Try using git. Version control software is geared towards this kind of problem, and git has a good reputation for speed; it's specifically designed for working fast with local files. 'git diff --name-status' would get you what you want I think.

Zac Thompson
I want to index a large file systems with millions of files
martinus