views:

144

answers:

6

Hi guys,

i am planning to perform a standard list command to get a vector or a list of the content of a directory.

I know this is easy by using

File f = new File("C:/testDir");
File[] files = f.listFiles();

The problem is that I need a list/array/vector of URLs. So my thoughts were to convert the files to URL. With the org.apache.commons.io.FileUtils library this is possible with the following simple code:

URL[] urls = FileUtils.toURLs(files);

This does exactly what I need, but is unfortunately very slow (especially for directories with thousands of files), although it is just using a for-loop and parses every single File object with the "toURL()" method.

Does someone know a way to do this task in a better performance?

+1  A: 

Your solution is fine, and you shouldn't worry about performance, unless you have tens of thousands of files in that directory.

A performance optimization might be to cache the array of URLs if this functionality is used a lot.

That said - measure how much does it take to perform this on a directory with 2k files, and then optimize.

Bozho
His question states that he does indeed have thousands of files :P
willcodejavaforfood
I actually meant tens of thousands.. :) (to denote a big number)
Bozho
@Bozho - Fair enough :)
willcodejavaforfood
maybe i was not precise enough. I have directories from 1000 up to several tens of thousand files.
Ahaggar
@Ahaggar ok, measure it.
Bozho
A: 

The most effective way of speeding this up would be to write the routines in C, then invoke them via JNI (Java Native Interface). Of course, it's not fully portable ...

Jas
There is a significant overhead in invoking JNI. Even if there wasn't, there's no evidence to suggest writing the same function in C would perform any better than the Java version.
gregcase
Maybe you are right about the JNI overhead - Regarding your second comment on C and speed vs. Java, I just can't believe that anyone can dispute what I said. There shouldn't be as you said - "evidence" to this - the fact that C is natively compiled gives you a very clear idea on the speed gains. Otherwise everyone would be writing low-level functions (and yes, function listing 10s of thousands of files should be low-level) in Java, Python, PHP or other interpreted langs.
Jas
All I am saying is that the Hotspot JIT compiler often means java can run just as fast as an equivalent C program, depending on the application. On a file based task similar to what the OP described, either program is mostly going to be waiting on the underlying system I/O.
gregcase
I don't know about PHP and Python (and I do not want to distract from the problem at hand) but Javas JIT and hotspot technology became pretty sophisticated by the time.
DerMike
@greggcase , there is at least two problems with your reasoning : 1) No matter how efficient, no JIT can resolve the issue of automatic garbage collection, which has proven itself in the past to be inefficient. There's no way that cost of allocating 10,000 objects on the heap is the same for the two, and not to mention the timing of deallocation which is fine-grain controllable in C. Secondly, no matter how efficient, the JIT is still an intermediary in the process execution, thus adding cost. If I wanted to use such arguments, I would have thrown in smth. like code-optimizers.
Jas
@Jas - there is no **evidence** to suggest a native C version would be faster. If you are sure (for whatever reason) ... why don't you actually implement the pure Java and Java + JNI + C solutions, and test which is faster.
Stephen C
@Jas - actually, for some application (and this is likely to be one of them) memory management using GC is faster than using malloc / free, assuming that the heap size is set appropriately.
Stephen C
@Jas - the cost of JITing is going to be there whether you are running a pure Java or a Java + JNI + C application. The cost of JIT compiling (say) 5 lines more Java will be insignificant compared with the time taken to run the entire application. Remember, we are not comparing a Java application to a C application here. We are comparing 2 Java applications which differ in a small (probably tiny) part of the code.
Stephen C
@Stephen C - Why should I be proving the already proven basic principles of Computer Systems Architecture? I am not the one disputing the speed of natively compiled programs vs. interpreted ones. You see, there is a reason why we study math, physics, computer science, etc.- It's because we don't want to prove the already proven theories. But feel free to do it yourself and post the benchmark data. Speaking of memory management, you missed my point because I was talking about efficiency through timing granularity, not the actual speed of alloc/dealloc.
Jas
@Stephen C - I don't know about you, but I started this discussion by comparing speeds of operation execution, not the actual apps. See my answer as well as comments and it will be clear.
Jas
@Jas - You are blustering. If it is "already proven" show me the proof. Where's the paper that proves it? Where's the data? The point is that benchmarks don't and cannot prove generalizations like that "automatic storage management has proven itself to be inefficient". And if you want some hard scientific data that DISPROVES it, read this: http://www.cs.colorado.edu/department/publications/reports/docs/CU-CS-573-92.pdf (And bear in mind that Zorn was comparing malloc/free with an 1990's era conservative collector. Modern collectors, as used in Java are much faster.)
Stephen C
+1  A: 

If you really have that many file you might want to use several threads. Each of n threads converses 1/n files.

For this to be efficient you need really many files.

kasten
+8  A: 

The only optimization that is simple would be reducing object creation, which will make a modest improvement in performance. Instead of using listFiles(), which creates a whole slew of File objects, use list() to get a String array of just the file names, not the paths, and create the URLs directly. String creation and storage will have less object overhead in this case. The string manipulation could obviously be made faster and more proper as well, although it probably won't make a huge difference.

Something like:

ArrayList<URL> urls = new ArrayList<URL>(); //or use an array if you prefer.
for(String name: f.files())
    urls.add(new URL("file://"+f.getPath()+"/"+name));
Peter DeWeese
Thanks guys for the answers. This both seems to be simple and efficient. A first test gave me a speed up on factor 5, testet with about 4000 files. I test a little more and then give you a feedback
Ahaggar
+1. Also, I checked this out, and found that using the constructor new URL("file", f.getPath(), name) still improves this approach performance.
Tomas Narros
You might gain a tiny gain by passing in the correct size when calling new ArrayList<URL>(files.size()); This will avoid having re-allocating the underlying array in main loop.
gregcase
4096 files test. Time with the old version: ~13500 ms; Time with the above solution: 2250 ms; I think this speedup should work.
Ahaggar
@tomás: you're right. but it changes the time only about some milliseconds. but thx ;) I use it this way
Ahaggar
+2  A: 

Create a new URL object, instead of invoke the toUrl() method seems to be more efficient. I have checked this out:

    File parent=new File("./doc");
    File[] listado=parent.listFiles();
    long t0=0L;
    try {
       t0=System.currentTimeMillis();
       for(int k=0;k<10000;k++) {
        URL[] listaArchivos=new URL[listado.length];
        for (int i = 0; i < listado.length; i++) {
            listaArchivos[i]=listado[i].toURL();
        }
       } 
    } catch (Exception e) {
        e.printStackTrace();
    }
    System.out.println("Files:"+listado.length+"; Time 1: "+(System.currentTimeMillis()-t0)+" ms");


    try {
        t0=System.currentTimeMillis();
        for(int k=0;k<10000;k++) {
            URL[] listaArchivos=new URL[listado.length];
            for (int i = 0; i < listado.length; i++) {
                listaArchivos[i]=new URL("file://"+listado[i].getAbsolutePath());
            }
        }
    } catch (Exception e) {
        e.printStackTrace();
    }           
    System.out.println("Files:"+listado.length+"; Time 2: "+(System.currentTimeMillis()-t0)+" ms");

My output is:

Files:14; Time 1: 1985 ms
Files:14; Time 2: 516 ms
Tomas Narros
Is it the same if you run your second algorithm first? Just to make sure, the second algorithm doesn't profit from some internal chaches...
Andreas_D
@Andreas_D I hadn't think about this: Let's check it.
Tomas Narros
@Andreas_D: Done. Now it writes Files:14; Time 2: 532 msFiles:14; Time 1: 1937 ms. So, the "new URL(String)" approach keeps being faster. To my surprise, I must say.
Tomas Narros
+1  A: 

Other people have responded saying that constructing the URLs by string concatenation (e.g. "file://" + dirPath + "/" + file.getName() is a lot faster than calling File.toURI().toString(). For instance the OP reports a 5 fold speedup. I wondered why there is such a difference.

Apparently, one reason is that the toURI() method checks to see if this is a directory, and appends a / if it is. The corollary is that a URL for a directory produced by String concatenation won't have a trailing /.

There's another caveat with creating "file:" URLs by string concatenation. That is that if the names in the file's path contains reserved characters (per the URL / URI specs), then string concatenation may produce a malformed URL / URI. The reserved characters typically need to be % escaped. Furthermore, on Windows it is not entirely clear how drive letters should be represented in "file:" URLs.

Stephen C