views:

147

answers:

3

I'm working on some code to determine the character encoding of an XML document being returned by a web server (an RSS feed in this particular case). Unfortunately, sometimes the web server lies and tells me that the document is UTF-8 when in fact it's not, or the boilerplate XML generation code on the server has <?xml encoding='UTF-8'?> at the start but the document contains invalid UTF-8 byte sequences.

Since I don't have control over the server, I need to make my client code tolerate this kind of inconsistency and show something, even if some of the characters are not decoded correctly. This is an important requirement for my application.

I'm well aware that the server is violating the XML spec in this case. I try to work with the server side developers when possible to make things correct according to the spec, but sometimes this is a low priority for them or for their organization, or the server side code is not actively maintained by anyone.

In order to be robust, I want to look at the first few bytes of the XML data and try to determine if it's some form of UTF-16 or some 8-bit encoding. I already have code that looks for a byte order mark (BOM).

But sometimes the server doesn't include a BOM, even for UTF-16. I want to try and figure out if it's UTF-16 or not by looking at the first two bytes and checking them against the list of possible first characters in an XML document.

Obviously I have to draw the line somewhere. If the document is not well-formed XML I won't be able to parse it anyway unless I write my own very tolerant parser (which I'm not planning to do). But given that it's well-formed, what could I possibly see in the first character of the document aside from a BOM?

So far as I can tell from looking at the spec, this set would be: whitespace (space, tab, new line, carriage return) and '<'. Do any XML experts out there know of anything I might be missing? I need to assume that the <?xml?> declaration may not be present even if required by the spec.

Internal DTDs, processing instructions, tags and comments all start with '<'. Is it possible to have an entity (starting with '&') or something else at the start of a document?

EDIT: Rewritten to emphasize my particular requirements.

+1  A: 

The trouble is that if a feed is invalid, it probably doesn't obey any rules about legal characters. Take a look at the code for the Universal Feed Parser. It's very well-tested code for parsing garbage text into possibly-correct data structures.

The UFP uses a sub-library named Universal Encoding Detector, which should contain useful information for general encoding detection.

John Millikin
+2  A: 

The XML Specification provides some guidance about detecting character encodings. The problem is that it is nearly impossible to look at the first few bytes and tell if it is UTF-8 or ISO-8859-1 or CP437 for that matter. The information that the spec contains will at least let you distinguish well-formed documents.

D.Shawley
I'd never noticed that section of the spec before -- definitely useful!
Don McCaughey
I do tend to agree with John about the _garbage-in/garbage-out_ concept. I've seen a whole lot of XML over HTTP that breaks every rule that there is about encoding and it bothers me to no end. Take a look at this (http://annevankesteren.nl/2005/03/text-xml) blog entry about the subject and then notice that it is from over 3 years ago. Yet the problem continues to persist. Sorry about the rant.
D.Shawley
I couldn't agree more in principle, but when a sports writer innocently copies from Word and pastes into his company's home grown CMS which doesn't handle character encoding _at all_ and you discover that the default encoding of Word's curly quote characters form invalid UTF-8 byte sequences and your users discover that their app no longer shows them the latest sports scores (or anything much at all), you start wanting to be more tolerant of other programmers' shortcomings.
Don McCaughey
Hehehe... I completely forgot about those "fancy" quotes. Sounds like you are actually looking at Windows 1252. You might want to consider sniffing one of the HTTP headers or maybe you are lucky enough to have a `<generator>` element. Try to signal non-standard processing on something that identifies the faulty generator of the content.
D.Shawley
A: 

It's not ideal, but I sometimes do this when I need to cope with bad encodings (pseduo-code alert).

str = decode("utf-8", input)
if (!str) {
  str = decode("cp1252", input)
}

That is, try to interpret the input as UTF-8, and if it fails, treat it as coming from a Windows system (which it probably is). It seems like a reasonable compromise to me.

Of course, this does require that you download the entire input into memory first, which may not be practical.

Dominic Mitchell