views:

283

answers:

5
  1. Why does a file saved as UTF8 (in Notepad++) have this character in the beginning of the fstream I opened to it in my c++ program?

    ´╗┐

    I have no idea what it is, I just know that it's not there when I save to ASCII. UPDATE: If I save it to UTF8 (without BOM) it's not there.

  2. How can I check the encoding of a file (ASCII or UTF8, everything else will be rejected ;) ) in c++. Is it exactly these characters?

Thanks!

+1  A: 

Why does a file saved as UTF8 not have this character in the beginning [...] I have no idea what it is, I just know that it's not there when I save to ASCII.

I suppose you are referring to the Byte Order Mark (BOM) U+FEFF, a zero-width, non-breaking space character. Here (notepad++ 5.4.3) a file saved as UTF-8, has the characters EF BB BF at the beginning. I suppose that's what's a BOM encoded in UTF-8.

How can I check the encoding of a file

You cannot. You have to know what encoding your file was written in. While Unicde encoded files might start with a BOM, I don't think there's a requirement that they do so.

sbi
A: 

I'm guessing you meant to ask, why does it have those characters. Those characters are probably the byte order mark, which according to that link in UTF-8 are the bytes EF BB BF.

As for knowing what encoding a file is in, you cannot derive that from the file itself. You have to know it ahead of time (or ask the user who supplies you with the file). For a better understanding of encoding without having to do a lot of reading, I highly recommend Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

SCFrench
Citing wikipedia as a reference is a bad practice that you should probably break http://en.wikipedia.org/wiki/Wikipedia:Citing_Wikipedia . Wikipedia is a good point to start research and find authoritative references but it should never actually be used as a reference as it has unverifiable authorship and without knowing the author you can't know the quality or accuracy of their comments. A better reference is included in the wiki article that points at the official Unicode site: http://www.unicode.org/faq/utf_bom.html
Martin York
An excellent point. I have revised my answer accordingly.
SCFrench
+6  A: 

When you save a file as UTF-16, each value is two bytes. Different computers use different byte orders. Some put the most significant byte first, some put the least significant byte first. Unicode reserves a special codepoint (U+FEFF) called a byte-order mark (BOM). When a program writes a file in UTF-16, it puts this special codepoint at the beginning of the file. When another program reads a UTF-16 file, it knows there should be a BOM there. By comparing the actual bytes to the expected BOM, it can tell if the reader uses the same byte order as the writer, or if all the bytes have to be swapped.

When you save a UTF-8 file, there's no ambiguity in byte order. But some programs, especially ones written for Windows still add a BOM, encoded as UTF-8. When you encode the BOM codepoint as UTF-8, you get three bytes, 0xEF 0xBB 0xBF, which is the three extra characters you're seeing.

The argument in favor of doing this is that it marks the files as truly UTF-8, as opposed to some other native encoding. For example, lots of text files on western Windows are in codepage 1252. Tagging the file with the UTF-8-encoded BOM makes it easier to tell the difference.

The argument against doing this is that lots of programs expect ASCII or UTF-8 regardless, and don't know how to handle the extra three bytes.

If I were writing a program that reads UTF-8, I would check for exactly these three bytes at the beginning. If they're there, skip them.

Adrian McCarthy
"The argument against doing this is that lots of programs expect ASCII or UTF-8 regardless, and don't know how to handle the extra three bytes." I don't follow. `EF BB BF` is UTF-8 speak for a a zero-width, non-breaking space - which basically means "nothing" and is the reason this was picked as BOM. If a program supposedly reads UTF-8, it has to be able to read this character and know how to handle it.
sbi
It is much stronger than that. It isn't allowed to omit the BOM in a UTF encoded file. For obvious reasons, no program that would read that file would be able to guess that it contains UTF encoded text.
Hans Passant
@Hans: TTBOMK, a BOM is always optional, never required.
sbi
@sbi: Well, you can always create a bag o' bytes. but if you hand a UTF encoded file to a 3rd party app without a BOM, be prepared for the "are you kidding me!" support request response.
Hans Passant
@Hans: I didn't say anything about the advisability. If you have no other way of indicating the encoding, a BOM is a good way to do that. It's just that it isn't _required_ to be present.
sbi
Sometimes a BOM mustn't or shouldn't be present, see e.g. http://mywiki.wooledge.org/BashPitfalls#On_UTF-8_and_Byte-Order_Marks_.28BOM.29
Philipp
@Hans: See this link: http://www.unicode.org/faq/utf_bom.html#bom9
sbi
@sbi -- The use of U+FEFF as a zero-width, non-breaking space is deprecated (see for example http://unicode.org/faq/utf_bom.html#bom6).
Dan Breslau
@Dan: Yeah, I've seen this. OTOH, when encountered in the middle of a string, it's still supposed to be interpreted as a zero-space, non-breaking space.
sbi
@sbi: Not exactly. The link that I provided says that it "can be treated as an unsupported character". (Not "must", "should", or "may", but "can", for whatever that's worth.)In any case, most software that I've used handles only a subset of UTF-8, not the whole range of code points, so to assert that a program "has to" be able to read U+FEFF as a ZWNBSP is to set oneself up for disappointment :-| You may find, for example, that it doesn't display the character (good), but it doesn't lex it correctly either (bad.)
Dan Breslau
@Dan: Read again. It says _"For backwards compatibility [FEFF] should be treated as ZERO WIDTH NON-BREAKING SPACE (ZWNBSP), and is then part of the content of the file or string."_ Only later it says _"When designing a markup language or data protocol, [...] any FEFF occurring in the middle of a file can be treated as an unsupported character."_
sbi
A: 

Without knowing what those characters really are (i.e., without a hex dump) it's only a guess, but my immediate guess would be that what you're seeing is the result of taking a byte order mark (BOM) and (sort of) encoding it as UTF-8. Technically, you're not allowed to/supposed to do that, but in practice it's actually fairly common.

Just to clarify, you should realize that this not really a byte-order mark. The basic idea of a byte-order mark simply doesn't apply to UTF-8. Theoretically, UTF-8 encoding is never supposed to be applied to a BOM -- but you can ignore that, and apply the normal UTF-8 encoding rules to the values that make up a BOM anyway, if you want to.

Jerry Coffin
It's a little strong to say that "UTF-8 encoding is never supposed to be applied to a BOM". It's redundant to use it for byte-ordering, but it's allowable as an encoding signature. See http://unicode.org/faq/utf_bom.html#bom5
Dan Breslau
A: 

Regarding your second point, every valid ASCII string is also a valid UTF-8 string, so you don't have to check for ASCII explicitly. Simply read the file using UTF-8, if the file doesn't contain a valid UTF-8 string, you will get an error.

Philipp