views:

159

answers:

4

I am maintaining a high performance CSV parser and try to get the most out of latest technology to improve the throughput. For this particular tasks this means:

  • Flash memory (We own a relatively inexpensive PCI-Express card, 1 TB of storage that reaches 1 GB/s sustained read performance)
  • Multiple cores (We own a cheap Nehalem server with 16 hardware threads)

The first implementation of the CSV parser was single threaded. File reading, character decoding, field splitting, text parsing, all within the same thread. The result was a throughput of about 50MB/s. Not bad but well below the storage limit...

The second implementation uses one thread to read the file (at the byte level), one thread to decode the characters (from ByteBuffer to CharBuffer), and multiple threads to parse the fields (I mean parsing delimitted text fields into doubles, integers, dates...). This works well faster, close to 400MB/s on our box.

But still well below the performance of our storage. And those SSD will improve again in the future, we are not taking the most out of it in Java. It is clear that the current limitation is the character decoding ( CharsetDecoder.read(...) ). That is the bottleneck, on a powerful Nehalem processor it transforms bytes into chars at 400MB/s, pretty good, but this has to be single threaded. The CharsetDecoder is somewhat stateful, depending on the used charset, and does not support multithreaded decoding.

So my question to the community is (and thank you for reading the post so far): does anyone know how to parallelize the charset decoding operation in Java?

+1  A: 

does anyone know how to parallelize the charset decoding operation in Java?

You might be able to open multiple input streams to do this (I'm not sure how you'd go about this with NIO, but it must be possible).

How difficult this would be depends on the encoding you're decoding from. You will need a bespoke solution for the target encoding. If the encoding has a fixed width (e.g. Windows-1252), then one byte == one character and decoding is easy.

Modern variable-width encodings (like UTF-8 and UTF-16) contain rules for identifying the first byte of a character sequence, so it is possible to jump to the middle of a file and start decoding (you'll have to note the end of the previous chunk, so it is wise to start decoding the end of the file first).

Some legacy variable-width encodings might not be this well designed, so you'll have no option but to decode from the start of the data and read it sequentially.

If it is an option, generate your data as UTF-16BE. Then you can cut out decoding and read two bytes straight to a char.

If the file is Unicode, watch out for BOM handling, but I'm guessing you're already familiar with many of the low-level details.

McDowell
Unfortunately, UTF-16 is a variable length encoding. You need UTF-32 for such simple Unicode parsing.
grddev
@grddev - I covered this in my post - it is possible to identify character sequences in the middle of UTF-16 data streams - high surrogate pairs are 0xD800-0xDBFF and low surrogates are 0xDC00-0xDFFF. Anything else is contained in pair of bytes.
McDowell
@McDowell: My comment referred to the mention of UTF-16BE. You cannot completely cut out decoding. But it really is quite simple.
grddev
Thanks for those inspiring elements. Certainly fixed-length encodings are relatively easy to parallelize. You can store the raw bytes in chunks making sure the chunks don't cut a character in the middle, decode the chunks in concurrent tasks and assemble the result.Maybe I should just do that and revert to single threaded mode when I detect a charset of variable length. Of course my goal is to offer a CSV parser that works whatever the charset, to deliver to customers all over the world and its exotic charsets.
Antoine
+1  A: 

It is clear that the current limitation is the character decoding ( CharsetDecoder.read(...) )

How do you know that? Does your monitoring / profiling show conclusively that the decoder thread is using 100% of one of your cores?

Another possibility is that the OS is not capable of driving the SSD at its theoretical maximum speed.

If UTF-8 decoding is definitely the bottleneck then it should be possible to do the task in parallel. But you will certainly need to implement your own decoders to do this.

Stephen C
Yes, several runs using JProfiler clearly show that the (single) character decoding thread is active almost 100% of the time.I see multiple references at UTF-8 and UTF-16 encoding in the responses. But we are writing a general purpose CSV parser here, that will be used on existing files, by our customers in Europe, US, Japan, China... So we cannot assume what charset will be used. In particular we cannot assume whether the charset will be fixed-length or not.
Antoine
A: 

If you know the encoding, and it is either fixed size, or does not contain overlapping byte sequences, you could scan for a special sequence. In CSV, a sequence for newlines might make sense. Even if you dynamically detect the encoding, you could run a pass of the first few bytes to determine encoding, and then move on to parallel decoding.

grddev
I like this idea a lot. Locating delimitters directly within the raw bytes. And yes, the NEW_LINE pattern is the right candidate for a CSV parser.But I have to support any charset. Are you aware of some generic method around the charset implementation that tells whether byte patterns overlap or not? I don't see any in the Javadoc.
Antoine
@Antoine: Unfortunately, I have no idea. It should not be a problem in any of the UTF-encodings, or any fixed-width encoding in general. [According to this question](http://stackoverflow.com/questions/724247/newline-control-characters-in-multi-byte-character-sets) there should also be no problems for typical Japanese encodings. Whether any of the newline representations overlap in Chinese (or other) encodings, I don't know.It is in any case clear that the existing interfaces in Java does not provide the means for doing this nicely. :(
grddev
A: 

Another (crazy) alternative would be to just separate the input into chunks of some arbitrary size, ignore the decoding issues and then decode each of the chunks in parallel. However, you want to ensure that the chunks overlap (with a parametrized size). If the overlapping region of the two chunks is decoded the same way by the two threads (and your overlap was big enough for the specified encoding) it should be safe to join the results. The bigger the overlap, the more processing required, and the smaller the probability of error. Furthermore, if you are in a situation where you know the encoding is UTF-8, or a similarly simple encoding, you could set the overlap quite low (for that client) and still be guaranteed correct operation.

If the second chunk turns out to be wrong, you will have to redo it, so it is important to not do to big chunks in parallel. If you do more than two chunks in parallel, it would be important to 'repair' from beginning to end, so that one misaligned block does not result in invalidating the next block (which might be correctly aligned).

grddev