I'm working on some code to determine the character encoding of an XML document being returned by a web server (an RSS feed in this particular case). Unfortunately, sometimes the web server lies and tells me that the document is UTF-8 when in fact it's not, or the boilerplate XML generation code on the server has <?xml encoding='UTF-8'?>
at the start but the document contains invalid UTF-8 byte sequences.
Since I don't have control over the server, I need to make my client code tolerate this kind of inconsistency and show something, even if some of the characters are not decoded correctly. This is an important requirement for my application.
I'm well aware that the server is violating the XML spec in this case. I try to work with the server side developers when possible to make things correct according to the spec, but sometimes this is a low priority for them or for their organization, or the server side code is not actively maintained by anyone.
In order to be robust, I want to look at the first few bytes of the XML data and try to determine if it's some form of UTF-16 or some 8-bit encoding. I already have code that looks for a byte order mark (BOM).
But sometimes the server doesn't include a BOM, even for UTF-16. I want to try and figure out if it's UTF-16 or not by looking at the first two bytes and checking them against the list of possible first characters in an XML document.
Obviously I have to draw the line somewhere. If the document is not well-formed XML I won't be able to parse it anyway unless I write my own very tolerant parser (which I'm not planning to do). But given that it's well-formed, what could I possibly see in the first character of the document aside from a BOM?
So far as I can tell from looking at the spec, this set would be: whitespace (space, tab, new line, carriage return) and '<'. Do any XML experts out there know of anything I might be missing? I need to assume that the <?xml?>
declaration may not be present even if required by the spec.
Internal DTDs, processing instructions, tags and comments all start with '<'. Is it possible to have an entity (starting with '&') or something else at the start of a document?
EDIT: Rewritten to emphasize my particular requirements.