views:

1244

answers:

9

I am trying to process files one at a time that are stored over a network. Reading the files is fast due to buffering is not the issue. The problem I have is just listing the directories in a folder. I have at least 10k files per folder over many folders.

Performance is super slow since File.list() returns an array instead of an iterable. Java goes off and collects all the names in a folder and packs it into an array before returning.

The bug entry for this is http://bugs.sun.com/view_bug.do;jsessionid=db7fcf25bcce13541c4289edeb4?bug_id=4285834 and doesn't have a work around. They just say this has been fixed for JDK7.

A few questions:

  1. Does anybody have a workaround to this performance bottleneck?
  2. Am I trying to achieve the impossible? Is performance still going to be poor even if it just iterates over the directories?
  3. Could I use the beta JDK7 builds that have this functionality without having to build my entire project on it?
+1  A: 

A non-portable solution would be to make native calls to the operating system and stream the results.

For Linux

You can look at something like readdir. You can walk the directory structure like a linked list and return results in batches or individually.

For Windows

In windows the behavior would be fairly similar using FindFirstFile and FindNextFile apis.

grepsedawk
for now, would agree that you might want to wrap FFF/FNF as suggested here.
John Gardner
Bear in mind that bridging the native barrier will have performance impacts itself. If you take this approach, you might want to consider batching several FNF results per JNI call (or possibly use a DirectByteBuffer for the data transfer)
Kevin Day
+4  A: 

An alternative is to have the files served over a different protocol. As I understand you're using SMB for that and java is just trying to list them as a regular file.

The problem here might not be java alone ( how does it behaves when you open that directory with Microsoft Explorer x:\shared ) In my experience it also take a considerably amount of time.

You can change the protocol to something like HTTP, only to fetch the file names. This way you can retrieve the list of files over http ( 10k lines should't be too much ) and let the server deal with file listing. This would be very fast, since it will run with local resources ( those in the server )

Then when you have the list, you can process them one by exactly the way you're doing right now.

The keypoint is to have an aid mechanism in the other side of the node.

Is this feasible?

Today:

File [] content = new File("X:\\remote\\dir").listFiles();

for ( File f : content ) {
    process( f );
}

Proposed:

String [] content = fetchViaHttpTheListNameOf("x:\\remote\\dir");

for ( String fileName : content ) {
    process( new File( fileName ) );
}

The http server could be a very small small and simple file.

If this is the way you have it right now, what you're doing is to fetch all the 10k files information to your client machine ( I don't know how much of that info ) when you only need the file name for later processing.

If the processing is very fast right now it may be slowed down a bit. This is because the information prefetched is no longer available.

Give it a try.

OscarRyz
that implies a lot of things though: a webserver, implementing some kind of security that SMB/etc already have, and so on.
John Gardner
Not really. Is like have a LAN website isn't? I think this is an already solved issue. Although now that you mention it, I have no idea how to setup a website that could be visible only in the LAN. I'm pretty sure a .exe that server files and uses Windows authentication already exists.
OscarRyz
Isn't IIS support this kind of functionality?... I think it can authenticate window users very easily. But I agree, It is not transparent. However I find it far easier / faster than wrap native code via JNI and still using SMB as the protocol.
OscarRyz
+1  A: 

Although it's not pretty, I solved this kind of problem once by piping the output of dir/ls to a file before starting my app, and passing in the filename.

If you needed to do it within the app, you could just use system.exec(), but it would create some nastiness.

You asked. The first form is going to be blazingly fast, the second should be pretty fast as well.

Be sure to do the one item per line (bare, no decoration, no graphics), full path and recurse options of your selected command.

EDIT:

30 minutes just to get a directory listing, wow.

It just struck me that if you use exec(), you can get it's stdout redirected into a pipe instead of writing it to a file.

If you did that, you should start getting the files immediately and be able to begin processing before the command has completed.

The interaction may actually slow things down, but maybe not--you might give it a try.

Wow, I just went to find the syntax of the .exec command for you and came across this, possibly exactly what you want (it lists a directory using exec and "ls" and pipes the result into your program for processing):

http://java.sun.com/developer/JDCTechTips/2003/tt0304.html

Bill K
this is a good solution for static trees or trees that don't change often, but you need to transverse it often
Pyrolistical
Yeah, I used it for cases where it was a run-once batch type of program. If you did it often, well I'm not sure that using exec() to recreate that file wouldn't work--but regardless it's an inelegant, error-prone solution.
Bill K
I just tried this and a "dir /B /S ..." took about 30 mins to collect over 200MB of paths, while it took about 3.5 hours to transverse the same tree using File.list()
Pyrolistical
I am going to accept this solution, as it will probably solve the issue most of the time.
Pyrolistical
Bill: Mention that you must use a flag which tells dir/ls *not* to sort the results (so start getting names immediately): "ls -U". There doesn't seem to be such an option for "dir" :/
Aaron Digulla
+1  A: 

I doubt the problem is relate to the bug report you referenced. The issue there is "only" memory usage, but not necessarily speed. If you have enough memory the bug is not relevant for your problem.

You should measure whether your problem is memory related or not. Turn on your Garbage Collector log and use for example gcviewer to analyze your memory usage.

I suspect that it has to do with the SMB protocol causing the problem. You can try to write a test in another language and see if it's faster, or you can try to get the list of filenames through some other method, such as described here in another post.

kohlerm
+1  A: 

See whether the workaround for http://bugs.sun.com/view_bug.do?bug_id=5050516 makes a difference for your situation.

Ilja Preuß
Nah. That is related with: Win32ShellFolder that is used when displaying system icons ( the DesktopIcon, networks and the like ) and the bug is related with ZipFile. This is very likely a totally different issue.
OscarRyz
Yes, likely. Still, might be worth checking. <shrug>
Ilja Preuß
A: 

I wonder why there are 10k files in a directory. Some file systems do not work well with so many files. There are specifics limitations for file systems like max amount of files per directory and max amount of levels of subdirectory.

I solve a similar problem with an iterator solution.

I needed to walk across huge directorys and several levels of directory tree recursively.

I try FileUtils.iterateFiles() of Apache commons io. But it implement the iterator by adding all the files in a List and then returning List.iterator(). It's very bad for memory.

So I prefer to write something like this:

private static class SequentialIterator implements Iterator<File> {
    private DirectoryStack dir = null;
    private File current = null;
    private long limit;
    private FileFilter filter = null;

    public SequentialIterator(String path, long limit, FileFilter ff) {
     current = new File(path);
     this.limit = limit;
     filter = ff;
     dir = DirectoryStack.getNewStack(current);
    }

    public boolean hasNext() {
     while(walkOver());
     return isMore && (limit > count || limit < 0) && dir.getCurrent() != null;
    }

    private long count = 0;

    public File next() {
     File aux = dir.getCurrent();
     dir.advancePostition();
     count++;
     return aux;
    }

    private boolean walkOver() {
     if (dir.isOutOfDirListRange()) {
      if (dir.isCantGoParent()) {
       isMore = false;
       return false;
      } else {
       dir.goToParent();
       dir.advancePostition();
       return true;
      }
     } else {
      if (dir.isCurrentDirectory()) {
       if (dir.isDirectoryEmpty()) {
        dir.advancePostition();
       } else {
        dir.goIntoDir();
       }
       return true;
      } else {
       if (filter.accept(dir.getCurrent())) {
        return false;
       } else {
        dir.advancePostition();
        return true;
       }
      }
     }
    }

    private boolean isMore = true;

    public void remove() {
     throw new UnsupportedOperationException();
    }

}

Note that the iterator stop by an amount of files iterateds and it has a FileFilter also.

And DirectoryStack is:

public class DirectoryStack {
    private class Element{
     private File files[] = null;
     private int currentPointer;
     public Element(File current) {
      currentPointer = 0;
      if (current.exists()) {
       if(current.isDirectory()){
        files = current.listFiles();
        Set<File> set = new TreeSet<File>();
        for (int i = 0; i < files.length; i++) {
         File file = files[i];
         set.add(file);
        }
        set.toArray(files);
       }else{
        throw new IllegalArgumentException("File current must be directory");
       }
      } else {
       throw new IllegalArgumentException("File current not exist");
      }

     }
     public String toString(){
      return "current="+getCurrent().toString();
     }
     public int getCurrentPointer() {
      return currentPointer;
     }
     public void setCurrentPointer(int currentPointer) {
      this.currentPointer = currentPointer;
     }
     public File[] getFiles() {
      return files;
     }
     public File getCurrent(){
      File ret = null;
      try{
       ret = getFiles()[getCurrentPointer()];
      }catch (Exception e){
      }
      return ret;
     }
     public boolean isDirectoryEmpty(){
      return !(getFiles().length>0);
     }
     public Element advancePointer(){
      setCurrentPointer(getCurrentPointer()+1);
      return this;
     }
    }
    private DirectoryStack(File first){
     getStack().push(new Element(first));
    }
    public static DirectoryStack getNewStack(File first){
     return new DirectoryStack(first);
    }
    public String toString(){
     String ret = "stack:\n";
     int i = 0;
     for (Element elem : stack) {
      ret += "nivel " + i++ + elem.toString()+"\n";
     }
     return ret;
    }
    private Stack<Element> stack=null;
    private Stack<Element> getStack(){
     if(stack==null){
      stack = new Stack<Element>();
     }
     return stack;
    }
    public File getCurrent(){
     return getStack().peek().getCurrent();
    }
    public boolean isDirectoryEmpty(){
     return getStack().peek().isDirectoryEmpty();
    }
    public DirectoryStack downLevel(){
     getStack().pop();
     return this;
    }
    public DirectoryStack goToParent(){
     return downLevel();
    }
    public DirectoryStack goIntoDir(){
     return upLevel();
    }
    public DirectoryStack upLevel(){
     if(isCurrentNotNull())
      getStack().push(new Element(getCurrent()));
     return this;
    }
    public DirectoryStack advancePostition(){
     getStack().peek().advancePointer();
     return this;
    }
    public File[] peekDirectory(){
     return getStack().peek().getFiles();
    }
    public boolean isLastFileOfDirectory(){
     return getStack().peek().getFiles().length <= getStack().peek().getCurrentPointer();
    }
    public boolean gotMoreLevels() {
     return getStack().size()>0;
    }
    public boolean gotMoreInCurrentLevel() {
     return getStack().peek().getFiles().length > getStack().peek().getCurrentPointer()+1;
    }
    public boolean isRoot() {
     return !(getStack().size()>1);
    }
    public boolean isCurrentNotNull() {
     if(!getStack().isEmpty()){
      int currentPointer = getStack().peek().getCurrentPointer();
      int maxFiles = getStack().peek().getFiles().length;
      return currentPointer < maxFiles;
     }else{
      return false;
     }
    }
    public boolean isCurrentDirectory() {
     return getStack().peek().getCurrent().isDirectory();
    }
    public boolean isLastFromDirList() {
     return getStack().peek().getCurrentPointer() == (getStack().peek().getFiles().length-1);
    }
    public boolean isCantGoParent() {
     return !(getStack().size()>1);
    }
    public boolean isOutOfDirListRange() {
     return getStack().peek().getFiles().length <= getStack().peek().getCurrentPointer();
    }

}
damian
?You use listFiles() anyways, which would end up with the same performance.You code does a whole lot of nothing. listFiles().iterator() would have been the same.
Pyrolistical
If you want to walk recursively through 5 levels of subdirectory with 10.000 files per directory listFiles.iterator() it kill your memory.For several files in the same directory it's the same.
damian
+1  A: 

If you need to eventually process all files, then having Iterable over String[] won't give you any advantage, as you'll still have to go and fetch the whole list of files.

Yoni Roit
A: 

Using an Iterable doesn't imply that the Files will be streamed to you. In fact its usually the opposite. So an array is typically faster than an Iterable.

Peter Lawrey
A: 

Are you sure it's due to Java, not just a general problem with having 10k entries in one directory, particularly over the network?

Have you tried writing a proof-of-concept program to do the same thing in C using the win32 findfirst/findnext functions to see whether it's any faster?

I don't know the ins and outs of SMB, but I strongly suspect that it needs a round trip for every file in the list - which is not going to be fast, particularly over a network with moderate latency.

Having 10k strings in an array sounds like something which should not tax the modern Java VM too much either.

MarkR