ansaurus

Question

UTF8 Beginning of File characters are breaking serializer & readers

Answer 1

+1 A:

Yes that is a BOM.

Yes some older JDK's had a bug that blew up on UTF-8 BOM data. And two of them will confuse even a modern version of Java.

The solution I used was to stick a pushback stream on the front and filter it out.

Or use a more modern version of Java.

bmargulies 2009-11-20 22:37:35

Answer 2

+1 A:

The byte sequence 0xEF 0xBB 0xBF is the UTF-8 encoding of U+FEFF, which is the Unicode BOM (byte order mark). It is unnecessary in UTF-8, but crucial in UTF-16 or UTF-32.

You've got the same sequence twice.

The only good thing to do with them is ignore and/or delete them.

Jonathan Leffler 2009-11-20 22:43:45

Answer 3

+3 A:

Yeah, that's two BOMs. You're encoding to UTF-8 twice and each time adds a pseudo-BOM, due to the extremely unfortunate fact that:

Encoding.UTF8

means “UTF-8 with a pointless, meaningless U+FEFF stuck to the front to screw up your applications”. Try instead using

new UTF8Encoding(false)

which should give you a less sucky version.

bobince 2009-11-20 23:24:12

Perfect! This answered my question. I was able to write the files without the BOM. I replaced all the locations using `Encoding.UTF8` with new `UTF8Encoding(false)`.

Nathan 2009-11-23 20:54:28

ansaurus

tags:

views:

answers:

UTF8 Beginning of File characters are breaking serializer & readers

related questions