I am maintaining a high performance CSV parser and try to get the most out of latest technology to improve the throughput. For this particular tasks this means:
- Flash memory (We own a relatively inexpensive PCI-Express card, 1 TB of storage that reaches 1 GB/s sustained read performance)
- Multiple cores (We own a cheap Nehalem server with 16 hardware threads)
The first implementation of the CSV parser was single threaded. File reading, character decoding, field splitting, text parsing, all within the same thread. The result was a throughput of about 50MB/s. Not bad but well below the storage limit...
The second implementation uses one thread to read the file (at the byte level), one thread to decode the characters (from ByteBuffer to CharBuffer), and multiple threads to parse the fields (I mean parsing delimitted text fields into doubles, integers, dates...). This works well faster, close to 400MB/s on our box.
But still well below the performance of our storage. And those SSD will improve again in the future, we are not taking the most out of it in Java. It is clear that the current limitation is the character decoding ( CharsetDecoder.read(...) ). That is the bottleneck, on a powerful Nehalem processor it transforms bytes into chars at 400MB/s, pretty good, but this has to be single threaded. The CharsetDecoder is somewhat stateful, depending on the used charset, and does not support multithreaded decoding.
So my question to the community is (and thank you for reading the post so far): does anyone know how to parallelize the charset decoding operation in Java?