I have a program in which each thread reads in files many lines at a time from a file, processes the lines, and writes the lines out to a different file. Four threads split the list of files to process among them. I'm having strange performance issues across two cases:
- Four files with 50,000 lines each
- Throughput starts at 700 lines/sec processed, declines to ~100 lines/sec
- 30,000 files with 12 lines each
- Throughput starts around 800 lines/sec and remains steady
This is internal software I'm working on so unfortunately I can't share any source code, but the main steps of the program are:
- Split list of files among four worker threads
- Start all threads.
- Thread reads up to 100 lines at once and stores in
String[]
array. - Thread applies transformation to all lines in array.
- Thread writes lines to a file (not same as input file).
- 3-5 repeats for each thread until all files completely processed.
What I don't understand is why 30k files with 12 lines each gives me greater performance than a few files with many lines each. I would have expected that the overhead of opening and closing the files to be greater than that of reading a single file. In addition, the decline in performance of the former case is exponential in nature.
I've set the maximum heap size to 1024 MB and it appears to use 100 MB at most, so an overtaxed GC isn't the problem. Do you have any other ideas?