Character Encoding Trouble - Java

views:

611

answers:

Character Encoding Trouble - Java

Hi all,

I've written a little application that does some text manipulation and writes the output to a file (html, csv, docx, xml) and this all appears to work fine on Mac OS X. On windows however I seem to get character encoding problems and a lot of '"' seems to disappear and be replaced with some weird stuff. Usually the closing '"' out of a pair.

I use a FreeMarker to create my output files and there is a byte[] array and in one case also a ByteArrayStream between reading the templates and writing the output. I assume this is a character encoding problem so if someone could give me advise or point me to some 'Best Practice' resource for dealing with character encoding in java.

Thanks

+3 A:

You can control which encoding your JVM will run with by supplying f,ex

-Dfile.encoding=utf-8

for (UTF-8 of course) as an argument to the JVM. Then you should get predictable results on all platforms. Example:

java -Dfile.encoding=utf-8 my.MainClass

2009-04-07 10:30:38

This sure did fix everything, cheers mate

willcodejavaforfood 2009-04-07 11:43:27

However, it's only a band-aid solution that may hide the real problem for now, only to emerge again later.

Michael Borgwardt 2009-04-07 11:57:14

+5 A:

There's really only one best practice: be aware that Strings and bytes are two fundamentally different things, and that whenever you convert between them, you are using a character encoding (either implicitly or explicitly), which you need to pay attention to.

Typical problematic spots in the Java API are:

new String(byte[])
String.getBytes()
FileReader, FileWriter

All of these implicitly use the platform default encoding, which depends on the OS and the user's locale settings. Usually, it's a good idea to avoid this and explicitly declare an encoding in the above cases (which FileReader/Writer unfortunately don't allow, so you have to use an InputStreamReader/Writer).

However, your problems with the quotation marks and your use of a template engine may have a much simpler explanation. What program are you using to write your templates? It sounds like it's one that inserts "smart quotes", which are part of the Windows-specific cp1251 encoding but don't exist in the more global ISO-8859-1 encoding.

What you probably need to do is to be aware which encoding your templates are saved in, and configure your template engine to use that encoding when reading in the templates. Also be aware that some texxt files, specifically XML, explicitly declare the encoding in a header, and if that header disagrees with the actual encoding used by the file, you'll invariable run into problems.

Michael Borgwardt 2009-04-07 10:31:24

I am using FreeMarker and the Template object that I create seems to use CP1251 even though it also in another field claims to use UTF-8. And they do appear as smart quotes, but the '"' does not come from my tempalte but from the text I am parsing as input.

willcodejavaforfood 2009-04-07 11:42:35

Then the problem seems to be both in the configuration of FreeMarker (contradicting encodings are always very bad news) and in your parsing code.

Michael Borgwardt 2009-04-07 11:59:01

I've specified UTF-8 in my xml and html files btw. After using the vm property the template object no longer shows contradicting encodings.

willcodejavaforfood 2009-04-07 14:23:49

Then it sounds as if your parsing code uses the platform default encoding somewhere.

Michael Borgwardt 2009-04-07 14:34:32

@Michael - And tinwelint's suggested fix took care of that. You still think it could be a problem?

willcodejavaforfood 2009-04-07 15:10:36

It's a hidden dependency - if the app is ever run without that system property, the problem will resurface.

Michael Borgwardt 2009-04-07 15:27:07

So you think it would be better to set the encoding anywhere in the code where I use a *Stream if possible?

willcodejavaforfood 2009-04-07 16:16:59

I see what you mean though. Now I cannot distribute the application as a jar since that would not guarantee that the encoding property was used.

willcodejavaforfood 2009-04-07 16:20:05

Yes, the encoding should always be specified explicitly wherever you convert between bytes and Strings - but there shouldn't be too many places - the best of practices is to avoid such conversions wherever possible.

Michael Borgwardt 2009-04-07 18:33:03

This was a lot more complicated than I realised. Have to look at my code and see if it is possible even. I hope that the bandit is my ZipOutputSTream(FileOutputStream())

willcodejavaforfood 2009-04-07 19:59:31

Took me a while but I did manage to find the offending input/output streams and make them UTF-8.

willcodejavaforfood 2009-04-08 10:04:43

Great to hear that :)

Michael Borgwardt 2009-04-08 11:08:06

@Michael - Oh I was supposed to say thank you as well :)

willcodejavaforfood 2009-04-08 16:00:15

+1 A:

Running the JVM with a 'standard' encoding via the confusing named -Dfile.encoding will resolve a lot of problems.

Ensuring your app doesn't make use of byte[] <-> String conversions without encoding specified is important, since sometimes you can't enforce the VM encoding (e.g. if you have an app server used by multiple applications)

If you're confused by the whole encoding issue, or want to revise your knowledge, Joel Spolsky wrote a great article on this.

Brian Agnew 2009-04-07 10:44:46

ansaurus

tags:

views:

answers:

Character Encoding Trouble - Java

related questions