ansaurus

Question

What is the set of valid first characters in an XML document?

Answer 1

+1 A:

The trouble is that if a feed is invalid, it probably doesn't obey any rules about legal characters. Take a look at the code for the Universal Feed Parser. It's very well-tested code for parsing garbage text into possibly-correct data structures.

The UFP uses a sub-library named Universal Encoding Detector, which should contain useful information for general encoding detection.

John Millikin 2009-08-21 22:48:31

Answer 2

+2 A:

The XML Specification provides some guidance about detecting character encodings. The problem is that it is nearly impossible to look at the first few bytes and tell if it is UTF-8 or ISO-8859-1 or CP437 for that matter. The information that the spec contains will at least let you distinguish well-formed documents.

D.Shawley 2009-08-21 23:02:53

I'd never noticed that section of the spec before -- definitely useful!

Don McCaughey 2009-08-21 23:27:47

I do tend to agree with John about the _garbage-in/garbage-out_ concept. I've seen a whole lot of XML over HTTP that breaks every rule that there is about encoding and it bothers me to no end. Take a look at this (http://annevankesteren.nl/2005/03/text-xml) blog entry about the subject and then notice that it is from over 3 years ago. Yet the problem continues to persist. Sorry about the rant.

D.Shawley 2009-08-22 00:05:24

I couldn't agree more in principle, but when a sports writer innocently copies from Word and pastes into his company's home grown CMS which doesn't handle character encoding _at all_ and you discover that the default encoding of Word's curly quote characters form invalid UTF-8 byte sequences and your users discover that their app no longer shows them the latest sports scores (or anything much at all), you start wanting to be more tolerant of other programmers' shortcomings.

Don McCaughey 2009-08-22 00:54:29

Hehehe... I completely forgot about those "fancy" quotes. Sounds like you are actually looking at Windows 1252. You might want to consider sniffing one of the HTTP headers or maybe you are lucky enough to have a `<generator>` element. Try to signal non-standard processing on something that identifies the faulty generator of the content.

D.Shawley 2009-08-22 01:46:11

Answer 3

A:

It's not ideal, but I sometimes do this when I need to cope with bad encodings (pseduo-code alert).

str = decode("utf-8", input)
if (!str) {
  str = decode("cp1252", input)
}

That is, try to interpret the input as UTF-8, and if it fails, treat it as coming from a Windows system (which it probably is). It seems like a reasonable compromise to me.

Of course, this does require that you download the entire input into memory first, which may not be practical.

Dominic Mitchell 2009-08-22 13:30:15

ansaurus

tags:

views:

answers:

What is the set of valid first characters in an XML document?

related questions