views:

589

answers:

3

I'm developing a java app that exports data to CSV files, intended to be opened in Excel by end users. We just noticed that the export function uses Java's platform default encoding. This causes umlaut characters to be lost and unit test to fail on the build server (which is configured to have US-ASCII as its platform default encoding exactly to catch such potential problems).

The question is: which would be the best encoding to use? How does Excel determine what encoding to use? Does it use something platform-specific that presumably matches Java's platform default?

I'm currently leaning towards hardcoding Cp1252 - that should cover the target machines (the deployment environment is actually specified) and would fix the test problem. From googling around, Excel does not seem to handle UTF-8 well, so that's out, and sticking to the platform default encoding would require some sort of workaround hack for the tests.

A: 

Think Excel works well with UTF-16. What's wrong with exporting in UTF-16. At least that way non-ascii characters will be preserved, instead of just throwing them away.

Edit, ok, 'well' might exaggerate how excel works with UTF-16, but it still seems that UTF-16LE works better than UTF-8

Glen
According to http://stackoverflow.com/questions/451636/whats-the-best-way-to-export-utf8-data-into-excel Excel does not work well with UTF-16
Michael Borgwardt
A: 

You may get system locale (from system properties) and create output file with that encoding. If your files will be opened only in excel may be you need take look to Apache POI?

Alexey Sviridov
We're already using the platform default encoding - it's what happens in Java when you don't specify an encoding. Using POI would probably the most solid solution, but rather more work and a bigger change than what we're willing to do right now.
Michael Borgwardt
+1  A: 

I would expect Excel to work well with the platform default encoding, so sticking with that seems like the best choice for Excel in the general case. Checking if the platform default is US-ASCII and using Cp1252 instead (I guess the hack for the tests) would be the conceptual equivalent of suppressing a compiler warning. You know it doesn't apply in this case.

However, since you write that that you control the production deployment, why do you hesitate to hard code Cp1252? It seems like a perfectly reasonable solution if that is the target encoding of the application.

Yishai
This is a huge system with many teams developing in parallel, and deployment and operation totally separate from development. The deployment environment is controlled by the operations guys, which is a different department - and who knows what they'll do?
Michael Borgwardt