views:

77

answers:

6

I have a program in which each thread reads in files many lines at a time from a file, processes the lines, and writes the lines out to a different file. Four threads split the list of files to process among them. I'm having strange performance issues across two cases:

  • Four files with 50,000 lines each
    • Throughput starts at 700 lines/sec processed, declines to ~100 lines/sec
  • 30,000 files with 12 lines each
    • Throughput starts around 800 lines/sec and remains steady

This is internal software I'm working on so unfortunately I can't share any source code, but the main steps of the program are:

  1. Split list of files among four worker threads
  2. Start all threads.
  3. Thread reads up to 100 lines at once and stores in String[] array.
  4. Thread applies transformation to all lines in array.
  5. Thread writes lines to a file (not same as input file).
  6. 3-5 repeats for each thread until all files completely processed.

What I don't understand is why 30k files with 12 lines each gives me greater performance than a few files with many lines each. I would have expected that the overhead of opening and closing the files to be greater than that of reading a single file. In addition, the decline in performance of the former case is exponential in nature.

I've set the maximum heap size to 1024 MB and it appears to use 100 MB at most, so an overtaxed GC isn't the problem. Do you have any other ideas?

+1  A: 

Have you tried running a Java profiler? That will point out what parts of your code are running the slowest. From this discussion, it seems like Netbeans profiler is a good one to check out.

Karmastan
I've looked at the heap dump using Eclipse's MAT plug-in, but it wasn't particularly helpful. All it told me during the first case was that I have a lot of `String`s being stored, which I know. I'll take a look at the Netbeans one.
A B
I'm not really interested (immediately) by what's being stored on the heap. Instead, I'd like to know which statements are taking the longest to complete in both scenarios. That will at least tell you whether it's memory pressure (creating strings takes forever), or file I/O (reading takes forever), file access (opening takes forever), or something else completely!
Karmastan
+1  A: 

Likely your thread is holding on to the buffered String[]s for too long. Even though your heap is much larger than you need, the throughput could be suffering due to garbage collection. Look at how long you're holding on to those references.

You might also waiting while the vm allocates more memory- asking for Xmx1024m doesn't allocate that much immediately, it grabs what it needs as more memory is required. You could also try -Xms1024m -Xmx1024m (i.e. allocate all of the memory at start) to test if that's the case.

Steve B.
I do have both options enabled. The same array keeps getting reused except that new Strings are allocated every time a line is read, so I assume whatever references are being overwritten so to speak can be collected by the GC immediately. Should I be explicitly setting the references to null as they're written out?
A B
A: 

You might have a stop and lock condition going on with your threads (one thread reads 100 lines into memory and holds onto the lock until its done processing, instead of giving it up when it has finished reading from the file). I'm not expert on Java threading, but it's something to consider.

Eric
Hmm, each thread has its own Reader and Writer and no two threads ever touch the same file. Could there still be a locking issue?
A B
My guess would be that there's no locking issue if there's no sharing between threads. I think I like the answer you chose best.
Eric
+2  A: 

I am assuming that the files are located on the same disk, in which case you are probably thrashing the disk (or invalidating the disk\OS cache) with multiple threads attempting to read concurrently and write concurrently. A better pattern may be to have a dedicated reader\writer thread to handle IO, and then alter your pattern so that the job of transform (which sounds expensive) is handled by multiple threads. Your IO thread can fetch and overlap writing with the transform operations as results become available. This should stop disk thrashing, and balance the IO and CPU side of your pattern.

chibacity
+2  A: 

From your numbers, I guess that GC is probably not the issue. I suspect that this is a normal behavior of a disk, being operated on by many concurrent threads. When the files are big, the disk has to switch context between the threads many times (producing significant disk seek time), and the overhead is apparent. With small files, maybe they are read as a single chunk with no extra seek time, so threads do not interfere with each other too much.

When working with a single, standard disk, serial IO is usually better that parallel IO.

Eyal Schneider
I'll try to recode it in such a way that the main thread reads many lines at once, allows several worker threads to process, and then the main thread writes the results all out again. Thanks!
A B
A: 

I would review this process. If you use BufferedReader and BufferedWriter there is no advantage to reading and processing 100 lines at a time. It's just added complication and another source of potential error. Do it one at a time and simplify your life.

EJP