ansaurus

Question

File is not saved in UTF-8 encoding even when I set encoding to UTF-8

Answer 1

A:

The IANA registered type is "UTF-8", not "UTF8". However, Java should throw an exception for invalid encodings, so that's probably not the problem.

I suspect that Notepad is the problem. Examine the text using a hexdump program, and you should see it properly encoded.

kdgregory 2009-10-08 10:59:51

I create XML file that should be encoded in UTF-8, but jboss cannot udnestand that file, but when I change encoding manually with Notepad++ to UTF-8, then jboss understands XML which is supposed to be in UTF-8

newbie 2009-10-08 11:02:37

Answer 2

+1 A:

If there is no BOM (and Java doesn't output one for UTF8, it doesn't even recognize it), the text is identical in ANSI and UTF8 encoding as long as only characters in the ASCII range are being used. Therefore Notepad++ cannot detect any difference.

(And there seems to be an issue with UTF8 in Java anyways...)

Lucero 2009-10-08 11:03:18

This isn't really an issue. There are many cases where you don't want your UTF-8 data prefixed by a BOM. Unicode BOM FAQ: http://unicode.org/faq/utf_bom.html#bom10

McDowell 2009-10-08 11:08:44

It is one of the possibilities to automatically (and pretty reliably) detect the difference between UTF-8 and ASCII or ANSI character sets. How are you otherwise going to know UTF-8 from something else?

Lucero 2009-10-08 11:11:18

I'm not disputing that you can (or whether you should) use a BOM in any individual case; you are right that it is certainly a possibility. However, there are too many cases where a BOM will break the data (e.g. unix scripts; appending to a file) or will be completely unnecessary (e.g. database records) for Java to prefix every UTF-8 encoded stream with U+FEFF. I would just not describe this as an issue with Java so much as an issue with developers being ignorant of how to work with encodings and how/when to use BOMs.

McDowell 2009-10-08 11:29:04

@McDowell: Amen to that.

Arthur Reutenauer 2009-10-08 11:30:44

@McDowell: I completely agree. Note that I was just explaining why UTF-8 could not be detected by Notepad++ for ASCII text, and that Java has issues with UTF-8 BOMs. Just trying to put the facts on the table so that the problem can be understood; which way to go always depends on the circumstances.

Lucero 2009-10-08 11:44:02

Answer 3

+2 A:

UTF-8 is designed to be, in the common case, rather indistinguishable from ANSI. So when you write text to a file and encode the text with UTF-8, in the common case, it looks like ANSI to anyone else who opens the file.

UTF-8 is 1-byte-per-character for all ASCII characters, just like ANSI.
UTF-8 has all the same bytes for the ASCII characters as ANSI does.
UTF-8 does not have any special header characters, just as ANSI does not.

It's only when you start to get into the non-ASCII codepoints that things start looking different.

But in the common case, byte-for-byte, ANSI and UTF-8 are identical.

Justice 2009-10-08 11:04:34

That is not true that ANSI and UTF-8 are usually identical. ANSI and UTF-8 are only identical when only-ASCII characters are used (codes between 0-127). Non ASCII characters such as "áéíóúÁÉÍÓÚñÑ" have multibyte encoding (2 bytes for these specific set) in UTF-8. In ANSI every character is encoded with 1 byte . When are ANSI and UTF-8 identical? When "strange" characters are not used, understanding by "strange" any character (letter/punctuation/accent) not found in English language.

Fernando Miguélez 2009-10-08 11:37:27

That's precisely the point: “newbie” has, no doubt, a lot of non-ASCII characters in his input, as hints his mistaken use of “alphabet” as “element of an alphabet”, which is common among English speakers in India, in particular. He uses “ANSI” to mean that his file is interpreted as using some 8-bit encoding (probably Windows-1252) and comes out as meaningless sequence of accented characters (“its only alphabets”).

Arthur Reutenauer 2009-10-08 11:40:03

The "common case" is bytes 00 to 7F. When all codepoints in the text can be represented in UTF-8 in a single byte in the 00-7F range, then that UTF-8-encoded text is ANSI. That includes most English-language-with-no-accents text.

Justice 2009-10-08 12:55:21

You folks keep using "ANSI" in a very strange way. I have no earthly idea what it means!

tchrist 2010-10-30 23:46:29

Answer 4

+2 A:

If you're creating an XML file (as your comments imply), I would strongly recommend that you use the XML libraries to output this and write the correct XML encoding header. Otherwise your character encoding won't conform to XML standards and other tools (like your JBoss instance) will rightfully complain.

    // Prepare the DOM document for writing
    Source source = new DOMSource(doc);

    // Prepare the output file
    File file = new File(filename);
    Result result = new StreamResult(file);

    // Write the DOM document to the file
    Transformer xformer = TransformerFactory.newInstance().newTransformer();
    xformer.transform(source, result);

Brian Agnew 2009-10-08 11:04:55

Answer 5

+1 A:

There's no such thing as plain text. The problem is that an application is decoding character data without you telling it which encoding the data uses.

Although many Microsoft apps rely on the presence of a Byte Order Mark to indicate a Unicode file, this is by no means standard. The Unicode BOM FAQ says more.

You can add a BOM to your output by writing the character '\uFEFF' at the start of the stream. More info here. This should be enough for applications that rely on BOMs.

McDowell 2009-10-08 11:18:16

Answer 6

A:

Did you try to write a BOM at the beginning of the file? BOM is the only thing that can tell the editor the file is in UTF-8. Otherwise, the UTF-8 file can just look like Latin-1 or extended ANSI.

You can do it like this,

public final static byte[] UTF8_BOM = {(byte)0xEF, (byte)0xBB, (byte)0xBF};
...
OutputStream os = new FileOutputStream(file);
os.write(UTF8_BOM);
os.flush();
OutputStreamWriter out = new OutputStreamWriter(os, "UTF8");
try
    {                       
            out.write(text);
            out.flush();
    } finally
    {
            out.close();
    }

ZZ Coder 2009-10-08 12:08:26

ansaurus

tags:

views:

answers:

File is not saved in UTF-8 encoding even when I set encoding to UTF-8

related questions