tags:

views:

470

answers:

3

Okay I got this function who prints the name of all files in a directory recursively problem is that it's very slow and it gets the stuff from a network device and with my current code it has to access the device time after time.

What I would want is to first load all the files from the directory recursively and then after that go through all files with the regex to filter out all the files I don't want. Unless anyone got a better suggestion. I've never before done anything like this.

public static printFnames(String sDir){
  File[] faFiles = new File(sDir).listFiles();
  for(File file: faFiles){
    if(file.getName().matches("^(.*?)")){
      System.out.println(file.getAbsolutePath());
    }
    if(file.isDirectory()){
      printFnames(file.getAbsolutePath());
    }
  }
}

This is just a test later on I'm not going to use the code like this, instead I'm going to add the path and modification date of every file which matches an advanced regex to an array.

+3  A: 

Assuming this is actual production code you'll be writing, then I suggest using the solution to this sort of thing that's already been solved - Apache Commons IO, specifically FileUtils.listFiles(). It handles nested directories, filters (based on name, modification time, etc).

For example, for your regex:

IOFileFilter filter = new RegexFileFilter("^(.*?)");
Collection files = FileUtils.listFiles(filter, DirectoryFileFilter.DIRECTORY);

This will recursively search for files matching the ^(.*?) regex, returning the results as a collection.

It's worth noting that this will be no faster than rolling your own code, it's doing the same thing - trawling a filesystem is just slow. The difference is, the Apache Commons version will have no bugs in it.

skaffman
I looked there and from that I would use http://commons.apache.org/io/api-release/index.html?org/apache/commons/io/FileUtils.html to get all the file from the directory and subdirectories and then search through the files so that they match my regex. Or am I wrong?
Hultner
@Hultner: It's easier than that - see my edited answer
skaffman
Yeah problem it takes over an hour to scan the folder and doing that every time I start the program to check for updates is extremely annoying. Would it be faster if I wrote this part of the program in C and the rest in Java and if so would it be any significant difference? For now I changed the code on the if isdir line and added so that the directory also have to match a regex to be included in the search. I see that in your example it says DirectoryFileFilter.DIRECTORY, I guesss I could have a regex filter there.
Hultner
@Hultner: Writing this in C would make it no faster - it's limited by your disk speed, not your processor speed. Filtering directories by regex could make a difference, though, depending on your directory structure.
skaffman
A: 

it feels like it's stupid access the filesystem and get the contents for every subdirectory instead of getting everything at once.

Your feeling is wrong. That's how filesystems work. There is no faster way (except when you have to do this repeatedly or for different patterns, you can cache all the file paths in memory, but then you have to deal with cache invalidation i.e. what happens when files are added/removed/renamed while the app runs).

Michael Borgwardt
Thing is I want to load all files of a certain type with a certain name format into a library which is presented to the user and everytime the app is started the library is supposed to be updated but it takes forever to update the library. Only solution I got is to run the update in the background but it's still annoying that it takes so long time until all the new files are loaded. There must be a better way to do it. Or at least a better way to update the database. It feels stupid for it to go through all the files it have already gone through onces. Is there a way to only find updates fast.
Hultner
@Hultner: Java 7 will include a facility for getting notified of filesystem updates, but that would still only work while the app is running, so unless you want to have a background service run all the time, it would not help. There might be special issues with network shares as Kevin describes, but as long as you depend on scanning through the entire directory tree, there really is no better way.
Michael Borgwardt
A: 

Java's interface for reading filesystem folder contents is not very performant (as you've discovered). JDK 7 fixes this with a completely new interface for this sort of thing, which should bring native level performance to these sorts of operations.

The core issue is that Java makes a native system call for every single file. On a low latency interface, this is not that big of a deal - but on a network with even moderate latency, it really adds up. If you profile your algorithm above, you'll find that the bulk of the time is spent in the pesky isDirectory() call - that's because you are incurring a round trip for every single call to isDirectory(). Most modern OSes can provide this sort of information when the list of files/folders was originally requested (as opposed to querying each individual file path for it's properties).

If you can't wait for JDK7, one strategy for addressing this latency is to go multi-threaded and use an ExecutorService with a maximum # of threads to perform your recursion. It's not great (you have to deal with locking of your output data structures), but it'll be a heck of a lot faster than doing this single threaded.

In all of your discussions about this sort of thing, I highly recommend that you compare against the best you could do using native code (or even a command line script that does roughly the same thing). Saying that it takes an hour to traverse a network structure doesn't really mean that much. Telling us that you can do it native in 7 second, but it takes an hour in Java will get people's attention.

Kevin Day