views:

1454

answers:

5

There are a variety of characters that are not legally encodeable in XML, e.g. U+0007 ('bell') and U+001B ('escape'). Most of the interesting ones are non-whitespace 'control' characters.

It's clear from (e.g.) this question and others that it's the XML spec that's the issue -- but can anyone illuminate me as to why the XML spec forbids these characters?

It seems like it could have been required that they be encoded in escapes, e.g. as  and  respectively, but perhaps there's a practical reason that the characters were forbidden rather than required to be escaped?

Answerers have suggested that there is some motivation towards avoiding transmission control characters, but Unicode includes many other control-like characters (consider U+200C "zero width non joiner"). I recognize there may be no good reason for this behavior, but I would still like to understand it better.

It's particularly frustrating because when those character values appear in other encodings data formats, I end up "double-escaping" new XML documents that need to encode this.

+6  A: 

My understanding is that this range is barred on the grounds that a markup language should not have any need to support transmission and flow control characters and including them would create a problem for any editors and parsers in binary conversion.

I'm struggling to find anything ex cathedra on this from Tim Bray et al though.

edit: some discussion of control chars and a vague admission it wasn't exactly over-engineered

annakata
thank you for that thought -- I have updated the question to reflect my understanding of control vs other characters. I would welcome 'ex cathedra' links though!
Trochee
your understanding is not at fault, but try and adjust your thinking to how those characters could make sense in a markup language and you'll see they can't - there is no spoon, as it were (still looking for links btw)
annakata
but as a markup language for data -- and XML is that -- those characters are no different from some other control characters, so it seems like a design error/inconsistency, as the links you provide suggest. Thank you for those links.
Trochee
+1  A: 

XML was designed specially around Unicode (specifically UTF-8 and UTF-16) and ISO/IEC 10646, both of which (I'm not quite positive about ISO 10646) contain the transmission/flow control characters which were left over from ASCII and the days of character-based terminals. While those characters still have uses, they don't belong in a format like XML.

As for these new encodings that use those codes for something else, well, it seems that the XML spec may need to adapt.

foxxtrot
please see my update of the question above -- if they *were* designed around Unicode and ISO-10646, why not support the entirety of the standard?
Trochee
+1  A: 

It seems like it could have been required that they be encoded in escapes, e.g. as  and 

You can do exactly that in XML 1.1, for all but \0.

bobince
Specifically, 0x1-0x1F and 0x7F-0x9F *must* be encoded as escapes in XML 1.1. The former were forbidden and the latter were optionally not-escaped in 1.0.
Chad Wellington
A: 
MSalters
+1  A: 

That was a long time ago, but my best recollection was that they have no graphical representation and also no agreed-upon semantics. Picking a couple at random we see U+0006 "Acknowledge" or U+0016 "Synchronous idle"... what do those mean? Unicode doesn't say. Even back when everyone claimed to support ASCII, there was no interoperability around this junk. XML is supposed to be about interoperability.

The experience has been that people who want to use these things really want to jam binary data into their XML elements (and the next thing they want is to include U+0000 NULL), which has been an explicit non-goal of XML since day 1. If you want to represent the numbers 0x6 or 0x16, there are lots of good ways to do that which don't muddy the notion of "character".