views:

103

answers:

3

If I am reading an XML of HTML file, don't I have to read the tag that tells me the encoding to be able to read the file? Isn't that tag encoded the same way the file is? I am curious how you read that tag with out knowing the encoding. I realize this is solved problem. I am just curious how its done.

Update 1

I dont get it, in UTF-16 wont each character take 2 bytes, not one, and be different than ascii? For example the character E in UTF-16 (U+0045) is 0xfeff0045. That is 0xfeff then 0x0045, but some encodings change the endian of that. Do you have to figure it out by checkign for 0xfeff and realizing that can't be ASCII or something?

+1  A: 

The encoding name is limited to ([A-Za-z0-9.] |'-'), so it's identical for any encoding based on ASCII or ISO-646 (e.g. ISO 8859-, ISO 10646/Unicode).

Edit: There are still some ambiguities though. For example, you still need to have some idea of whether to attempt to read 8-, 16-, or 32-bit chunks at a time to read it. There's also the minor detail that to be a proper UTF-16 or UTF-32/UCS-4 file, it should start with a BOM -- but the XML spec doesn't seem to allow inclusion of a BOM...

If, however, you know the file is supposed to contain XML, you have a pretty good idea of how the file needs to start, so an incorrect guess is easy to detect.

Jerry Coffin
I dont get it, in UTF-16 wont each character take 2 bytes, not one, and be different than ascii?
Anthony D
There are strict rules for parsers to deduce the length of the UTF encoding in the absence of a BOM: http://www.w3.org/TR/2000/REC-xml-20001006#sec-guessing-no-ext-info
Christian Hayter
+4  A: 

Here's what W3C has to say about it:

The XML encoding declaration functions as an internal label on each entity, indicating which character encoding is in use. Before an XML processor can read the internal label, however, it apparently has to know what character encoding is in use--which is what the internal label is trying to indicate. In the general case, this is a hopeless situation. It is not entirely hopeless in XML, however, because XML limits the general case in two ways: each implementation is assumed to support only a finite set of character encodings, and the XML encoding declaration is restricted in position and content in order to make it feasible to autodetect the character encoding in use in each entity in normal cases.

http://www.w3.org/TR/2000/REC-xml-20001006#sec-guessing

Robert Harvey
+1 in other words, the processor just tries all encodings until the XML encoding declaration shows up in the output
Wim Coenen
A: 

For HTML, it is documented in HTML5. (Don't read if you still believe anything is sane on the web, though.)

Ms2ger