ansaurus

Question

How can I open files containing accents in Java?

Answer 1

A:

The Java system property file.encoding should match the console's character encoding. The property must be set when starting java on the command-line:

java -Dfile.encoding=UTF-8 …

Normally this happens automatically, because the console encoding is usually the platform default encoding, and Java will use the platform default encoding if you don't specify one explicitly.

erickson 2010-06-18 19:10:07

Answer 2

+3 A:

First, the character encoding used is not directly related to the locale. So changing the locale won't help much.

Second, the ï¿½ is typical for the Unicode replacement character U+FFFD � being printed in ISO-8859-1 instead of UTF-8. Here's an evidence:

System.out.println(new String("�".getBytes("UTF-8"), "ISO-8859-1")); // ï¿½

So there are two problems:

Your JVM is reading those special characters as �.
Your console is using ISO-8859-1 to display characters.

For a Sun JVM, the VM argument -Dfile.encoding=UTF-8 should fix the first problem. The second problem is to be fixed in the console settings. If you're using for example Eclipse, you can change it in Window > Preferences > General > Workspace > Text File Encoding. Set it to UTF-8 as well.

Update: As per your update:

byte[] textArray = f.getName().getBytes();

That should have been the following to exclude influence of platform default encoding:

byte[] textArray = f.getName().getBytes("UTF-8");

If that still displays the same, then the problem lies deeper. What JVM exactly are you using? Do a java -version. As said before, the -Dfile.encoding argument is Sun JVM specific. Some Linux machines ships with GNU JVM or OpenJDK's JVM and this argument may then not work.

BalusC 2010-06-18 19:10:56

I tried that and it didn't work.java -Dfile.encoding=UTF-8 load_i18nes_ESspecial_ï¿½_ï¿½_ï¿½_characters.docI'm probably wrong, but I'm not convinced there's a console issue yet. I redirect the output to a file so there's no console involved and I still get the same results. I do an "od -a" on the file and here's the relevant output:0000200 e f i l e nl s p e c i a l _ o ?0000220 = _ o ? = _ o ? = _ c h a r a c0000240 t e r s . d o c nl r e a d _ i 1

Mark Juric 2010-06-19 01:51:47

As to the first problem: that may be platform/JVM specific. Hard to tell from here on. As to the second problem: is the file written with an `OutputStreamWriter` using UTF-8 and viewed with a viewer supporting UTF-8?

BalusC 2010-06-19 03:36:43

@Mark, not sure why you're passing the 'mangled' filename on the command line. The flow seems to be (1) Java gets correct filename from OS (2) Java writes filename to stdout, where it gets mangled (3) you take the mangled filename and pass it back in to a different tool (4) Java hands the mangled filename to the OS, which can't find the file. Fix (2), and the problem goes away; passing the MANGLED filename in (3) is just making things worse.

Cowan 2010-06-19 09:26:22

Also - "I redirect the output to a file so there's no console involved and I still get the same results." -- do you mean redirect in code, using e.g. a Writer, or using your shell's command-line redirection? If the problem is Java's choice of encoding when writing to System.out, it's just those (incorrect) bytes which your shell will redirect into the file, making exactly the same problem.

Cowan 2010-06-19 09:27:51

ansaurus

tags:

views:

answers:

How can I open files containing accents in Java?

related questions