ansaurus

Question

What's different between utf-8 and utf-8 without BOM?

Answer 1

+1 A:

UTF-8 without BOM has no BOM, which doesn't make it any better than UTF-8 with BOM, except when the consumer of the file needs to know (or would benefit from knowing) whether the file is UTF-8-encoded or not.

The BOM is usually useful to determine the endianness of the encoding, which is not required for most use cases.

Also, the BOM can be unnecessary noise/pain for those consumers that don't know or care about it, and can result in user confusion.

Romain 2010-02-08 18:30:19

"which has no use for UTF-8 as it is 8-bits per glyph anyway." Er... no, only ASCII-7 glyphs are 8-bits in UTF-8. Anything beyond that is going to be 16, 24, or 32 bits.

R. Bemrose 2010-02-08 18:38:14

I must be tired. Sigh.

Romain 2010-02-08 18:41:43

Answer 2

+1 A:

from http://en.wikipedia.org/wiki/Byte-order_mark

The byte order mark (BOM) is a Unicode character used to signal the endianness (byte order) of a text file or stream. Its code point is U+FEFF. BOM use is optional, and, if used, should appear at the start of the text stream. Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in.

Always using a BOM in your file will ensure that it always opens correctly in editor which support UTF-8 and BOM.

Edit: My real problem with the absence of BOM is the following. Suppose we've got a file which contains:

abc

Without BOM this opens as ANSI in most editors. So another user of this file opens it and appends some native characters, e.g:

abg-αβγ

Oops... Now the file is still in ANSI and guess what, "αβγ" does not occupy 6 bytes but 3. This is not UTF-8 and this causes other problems later on in the development chain.

cherouvim 2010-02-08 18:31:00

An ensure that spurious bytes appear in the beginning of non BOM-aware software. Yay.

Romain 2010-02-08 18:33:48

@Romain Muller: e.g. PHP 5 will throw "impossible" errors when you try to send headers after the BOM.

Piskvor 2010-02-08 18:47:14

updated my answer with why I want BOM in my files.

cherouvim 2010-02-20 23:04:03

Answer 3

+14 A:

The UTF-8 BOM is a sequence of bytes (EF BB BF) that allows the reader to identify the file as an UTF-8 file.

Normally, the BOM is used to signal the endianness of the encoding, but since UTF-8 doesn't have any encoding issue, the BOM is unnecessary.

According to the Unicode standard, the BOM for UTF-8 files is not recommended:

2.6 Encoding Schemes

Use of a BOM is neither required nor recommended for UTF-8, but may be encounter in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature. See the "Byte Order Mark" subsection in Section 16.8, Specials, for more information.

Martin Cote 2010-02-08 18:33:26

Answer 4

A:

Quoted at the bottom of the Wikipedia page on BOM: http://en.wikipedia.org/wiki/Byte-order_mark#cite_note-2

"Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature"

pib 2010-02-08 18:35:41

Answer 5

+2 A:

The excellent answers above already answered that:

there is not official difference between UTF-8 and BOM-ed UTF-8
a BOM-ed UTF-8 string will start with the three following bytes EF BB BF
Those bytes, if present, must be ignored when extracting the string from the file/stream

But, as additional information to this, the BOM for UTF-8 could be a good way to "smell" if a string was encoded in UTF-8... Or it could be a legitimate string in any other encoding...

For example, the following data: [EF BB BF 41 42 43] could either be:

the legitimate ISO-8859-1 string "ï»¿ABC"
the legitimate UTF-8 string "ABC"

So while it can be cool to recognize the encoding of a file content by looking at the first bytes, you should not rely on this, as show by the example above

Encodings should be known, not divined.

paercebal 2010-02-08 18:42:23

Answer 6

A:

What's different between utf-8 and utf-8 without BOM?

Short answer: In UTF-8, a BOM is encoded as the bytes EF BB BF at the beginning of the file.

Long answer:

Originally, it was expected that Unicode would be encoded in UTF-16/UCS-2. The BOM was designed for this encoding form. When you have 2-byte code units, it's necessary to indicate which order those two bytes are in, and a common convention for doing this is to include the character U+FEFF as a "Byte Order Mark" at the beginning of the data. The character U+FFFE is permanently unassigned so that its presence can be used to detect the wrong byte order.

UTF-8 has the same byte order regardless of platform endianness, so a byte order mark isn't needed. However, it may occur (as the byte sequence EF BB FF) in data that was converted to UTF-8 from UTF-16, or as a "signature" to indicate that the data is UTF-8.

Which is better?

Without. As Martin Cote answered, the Unicode standard does not recommend it. It causes problems with non-BOM-aware software.

A better way to detect whether a file is UTF-8 is to perform a validity check. UTF-8 has strict rules about what byte sequences are valid, so the probability of a false positive is negligible. If a byte sequence looks like UTF-8, it probably is.

dan04 2010-07-31 22:53:25

ansaurus

tags:

views:

answers:

What's different between utf-8 and utf-8 without BOM?

2.6 Encoding Schemes

related questions