tags:

views:

225

answers:

3

Hi,

In my java application, I need to parse xml that contains control character 0x2 inside CDATA. I tried few ways but coudnt get through. I want to avoid any sort of encoding. Is there any way in XML1.1?

Thanks, Shefali

+1  A: 

XML cannt contain ASCII control characters (apart from TAB, CR and LF), not even inside a CDATA section. They are disallowed by the XML spec.

Encode binary data into Base64 strings and write them to XML. No need for CDATA in this case.

Tomalak
The first part is only true of XML 1.0. XML 1.1 does allow these characters as character references - although as far as I can tell, XML 1.1 doesn't have widespread adoption.Encoding into binary data and using XML 1.0, as suggested by Tomalak, is probably the easiest and most compatible solution.
Matthew Wilson
But from what i read about xml1.1, it appears that some unicode characters that were not supported in xml1.0 are now supported in xml1.1?
Shefali Dubey
They may be part of document content in XML 1.1, yes, but they're still only valid in serialised XML as a character reference. (Even then, as Tomalak said, it's generally a really bad idea to be sticking arbitrary binary gunk in XML.)
bobince
+3  A: 

I need to parse xml that contains control character 0x2 inside CDATA

That's not XML, then. A raw control character U+0002 anywhere means it's not well-formed and hence not an XML document.

In XML 1.1 only, one may include control characters encoded as character reference. So you might have tried to fix it up by doing a string replace for \x02 with  before parsing. However, you can't put character references in CDATA sections, so that's not going to fly either.

edit: you could probably fix it in the short-term, if you are absolutely sure that every stray U+0002 character is inside a CDATA section, by replacing each with:

]]>&#2;<![CDATA[

However this is super-shonky. Whatever generated the faulty XML in the first place needs to be fixed. Go kick the person responsible for creating it!

bobince
+1 for kicking people who produce invalid XML.
Joachim Sauer
A: 

If thats the case then please suggest the most appropriate way of handling such characters in data.

Shefali Dubey
If you have additions to your question, please post them as comments or edit your question. Don't post them as answers (mostly because they will be ordered effectively randomly).
Joachim Sauer