views:

365

answers:

3

I'm looking for Java implementation of CSV (comma separated values) parser with proper handling of Unicode data, e.g. UTF-8 CSV files with Chinese text. I suppose such a parser should internally use code point related methods while iterating, comparing etc. Apache 2 license or similar would work the best.

A: 

It's pretty easy to write yourself. Open the file with a FileInputStream and an InputStreamReader that uses UTF-8. Wrap it in a BufferedReader you can iterate through it using readLine(). Get each line as a String. Use regular expressions to split it into fields.

The only tricky part is constructing the regexes so they don't treat commas that are enclosed within quotes as field delimiters.

The approach above is a bit inefficient, but fast enough for most apps. If you have real performance requirements then you'll need something that iterates through characters. I wrote one a few years ago that uses a state machine that worked ok.

That's more straightforward than I can afford without having bad dreams at night :-) I'm now looking for ready-to-be-used library.
Igor Romanov
this is actually *not* straightforward. The simple case can be handled with regexes, but when you get into fields that themselves contain commas or the (optional) quote delimeters, Regex will not work. Regex is a fine tool for certain jobs, but it is not a substitute for a well written parser.
Kevin Day
I think it will work, just will be a bit more complex. Google gives good regexp to use instantly, see here for example: http://www.programmersheaven.com/user/Jonathan/blog/73-Splitting-CSV-with-regex/
Igor Romanov
+2  A: 

I don't believe in reinventing the wheel. So I do not want to write my own parser and go through the same headaches someone else did.

I personally like the CSV Parser from Ostermiller. They also have a Maven Repository if interested.


You can also check out OpenCSV. There is a Stack Overflow question already about parsing unicode.

Ascalonian
This one looks good and even it is stated directly to support Chinese, but its GPL I think, that is something I cannot use for my work.
Igor Romanov
A: 

Have you tried Commons CSV?

Mirko Nasato