views:

922

answers:

5
+1  Q: 

XML encoding issue

Hello everyone,

I want to know whether there is quick way to find whether an XML document is correctly encoded in UTF-8 and does not contains any characters which is not allowed in XML UTF-8 encoding.

<?xml version="1.0" encoding="utf-8"?>

thanks in advance, George

EDIT1: here is the content of my XML file, in both text form and in binary form.

http://tinypic.com/view.php?pic=2r2akvr&amp;s=5

I have tried to use tools like xmlstarlet to check, the result is correct (invalid because of out of range of UTF-8), but the error message is not correct, because in my posted link above, there is no char whose value is 0xDFDD. Any ideas?

BTW: I can send the XML file to anyone, but I did not find a way to upload the file as attachment here. If anyone needs this file for analysis, please feel free to let me know.

D:\xmlstarlet-1.0.1-win32\xmlstarlet-1.0.1>xml val a.xml
a.xml:2: parser error : Char 0xDFDD out of allowed range
<URL>student=1砜濏磦</URL>
              ^
a.xml:2: parser error : Char 0xDFDD out of allowed range
<URL>student=1砜濏磦</URL>
              ^
a.xml:2: parser error : internal error
<URL>student=1砜濏磦</URL>
              ^
a.xml:2: parser error : Extra content at the end of the document
<URL>student=1砜濏磦</URL>
              ^
a.xml - invalid

EDIT2: I have used the tool libxml to check the validation of XML file as well, but met with an error when start this tool. Here is a screen snapshot. Any ideas?

http://tinypic.com/view.php?pic=2ildjpe&amp;s=5

OS is Windows Server 2003 x64.

+2  A: 

Try these out

  1. http://validator.w3.org/#validate_by_input

  2. http://www.w3schools.com/XML/xml_validator.asp

CodeToGlory
The OP noted in a recent comment that he does not want online tools.
bortzmeyer
@CodeToGlory, sorry I did not specify my needs very clearly. I am using .Net (C#) and I want to find some C# solutions, better to use some existing combination of built-in .Net APIs. :-) Any ideas?
George2
+1  A: 

I presume you want to do this programmatically? In that case, this is highly dependent on what programming language you're using - which language would it be?

For example, I have used this code before in PHP. preg_match allows a /u modifier (which I think is PHP-specific) which treats the pattern, and the string it is being matched against, as UTF-8. A side-effect is that the whole string is checked for UTF-8 validity each time you do this. HTML/XHTML doesn't allow C0/C1 control codes apart from tab, new line, space etc, so I also added a way to check for those here too.

function validate($allowcontrolcodes = false)
    // returns true if this is a valid utf-8 string, false otherwise.  
    // if allowcontrolcodes is false (default), then most C0 codes below 0x20, as
    // well as C1 codes 127-159, will be denied - recommend false for html/xml
    {
     if ($this->string=='') return '';
     return preg_match($allowcontrolcodes
      ? '/^[\x00-\x{d7ff}\x{e000}-\x{10ffff}]++$/u'
      : '/^[\x20-\x7e\x0a\x09\x0d\x{a0}-\x{d7ff}\x{e000}-\x{10ffff}]++$/u',
      $this->string) ? true : false; 
    }

Another way would be to use the DOM, which is available in many languages. The DOM document object has a LoadXML method which loads the document from an XML formatted string. This will fail if the document you input is not valid according to whatever character encoding it has specified, but won't specifically enforce UTF-8 encoding, but if it was successful you can then check the "encoding" property of the document object to see what encoding it was.

thomasrutter
1. I am using .Net, any C# code which I could use? 2. "A side-effect is that the whole string is checked for UTF-8 validity each time you do this" -- why it is a side effect? I think we must check for validity each time in this way? Any points you think it is not smart/efficient enough to improve (why you said "side effect")?
George2
Sorry, I'm not familiar with C# and .NET. The /u modifier is PHP-specific, it is a regular expression check in UTF-8 mode and a side effect is it checks for UTF-8 validity. You may have more luck with DOM. For example, http://support.microsoft.com/kb/317664 No doubt System.Xml.XmlDocument has some way (like an "encoding" property) of checking what character encoding was used after importing an XML document - plus if the document is not valid according to any encoding it just won't parse.
thomasrutter
@thomasrutter, could you show me in my sample (I posted in EDIT 1 section), why it is treated as 'Invalid Byte Sequences' by XML UTF-8 decoder please? I posted both the text form of XML file and related bianry hex value.
George2
@thomasrutter, I have studied the Microsoft kb document, does it include information about how to check whether an XML document contains invalid UTF-8 byte? I did not find such info.
George2
A proper XML parser will fail to load an XML document if it contains any invalid characters - that is, byte sequences which are not valid as characters in its current character encoding. Therefore, if XmlDocument.loadXML() or XmlDocument.load() succeeds at all then this means the encoding is valid. That was what I was getting at. Unfortunately I have never done anything like this in .NET or C# so I will probably be unable to help further, but that's the approach I would take.
thomasrutter
The second part of Brian Agnew's answer is what I mean. That Load method will fail if there are invalid characters.
thomasrutter
@thomasrutter, to my surprise the method will never fail even for invalid input. :-( But anyway, let us discuss non .Net related questions. could you show me in my sample (I posted in EDIT 1 section), why it is treated as invalid encoding file by XML UTF-8 decoder please? I posted both the text form of XML file and related bianry hex value.
George2
+2  A: 

libxml2 can do it, it is available as a library (to integrate into your programs) or through the command-line tool xmllint. Here is an example with xmllint:

[Proper file] 
% head test.xml
<?xml version="1.0" encoding="utf-8"?>
<café>Ils s'étaient ...

% xmllint --noout test.xml
% 

[One byte in a multibyte character removed]
% xmllint --noout test.xml
test.xml:2: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xC3 0x74 0x61 0x69
<café>Ils s'Ãtaient ...
             ^
bortzmeyer
Hi bortzmeyer, any library which could be easily used with .Net code?
George2
I am interested about the command line feature, could you share the command line and options you use to valid whether there is invalid character in an XML document using the libxml2 tool? :-)
George2
What do you mean, to share? I gave an example of use and the URL of the libxml2 Web site. Isn't it enough?
bortzmeyer
Sorry, my bad English. 1. I mean is it your command which is used to check XML file encoding character -- "xmllint --noout test.xml"? 2. I think UTF-8 encoding could contains any character (I think every character in every language has its related UTF-8 form represetation of unicode table), why there could be invalid character in UTF-8 encoding? I believe unicode character à has related UTF-8 value.
George2
Yes, UTF-8 can encode every Unicode character (that's the point). But not every byte stream is legal UTF-8, far from it. So, yes, there are files which are not legal UTF-8.
bortzmeyer
@bortzmeyer, could you show me in my sample (I posted in EDIT 1 section), why it is treated as 'Invalid Byte Sequences' by XML UTF-8 decoder please? I posted both the text form of XML file and related bianry hex value.
George2
@bortzmeyer, I have also tried to use the tool you recommended -- libxml, but it fails to start. I posted my screen snapshot into EDIT 2 section of my original post, could you take a look please?
George2
Sorry for the libxml installation but it is a Windows-specific problem and I know nothing about Windows.
bortzmeyer
No, I cannot test the sample you posted, it is an image of an hexadecimal dump (!) and I need the original file, the raw file.
bortzmeyer
@bortzmeyer, 1. do you mean you never use libxml on Windows or libxml not support Windows platform? 2. Since this platform (stackoverflow) does not allow upload attachment files, could you let me know how to post attachment or recommend some free hosting servers? :-)
George2
I mean I never use Windows.
bortzmeyer
+1  A: 

The easiest way to do this is to simply run the XML through a command line utility to perform this check.

I always have a copy of XMLStar available for stuff like this. It'll indicate immediately if it can/cannot parse your XML, and thus indicate whether the encoding is correct or not.

If you're looking for a coded method to do this, simply load the XML into your XML parser of choice. An encoding error will immediately trigger a parser exception (since the encoding is wrong, parsing can't take place, by definition)

e.g.

XmlDocument xDoc = new XmlDocument();

Next use the load method to load the XML document from the specified stream.

xDoc.Load("sampleXML.xml");
Brian Agnew
@Brian, I am writing .Net code, any easy way to using existing .Net API to check?
George2
See above (edited answer). Just use your .net parser
Brian Agnew
@Brian, confused. where is your answer you mentioned ".net parser"? Could you point out please?
George2
Re-edited. See above.
Brian Agnew
Have you tested it? I put some invalid characters but Load method never throws exception? I assume you rely on whether there is exception thrown from Load method to check whether there is invalid character?
George2
I have a new ideas, we compare two things, 1. the result by XMLStar about whether an XML documet contains invalid character 2. the result of your above .Net code. Could you share the command you use to valid XML document when using XMLStar? I am very interested to make my hands dirty to use this tool. :-)
George2
Any XMLStar command will initially parse the XML. To be clear, any XML parser will throw an error on an invalid character/encoding, since by definition it can't parse the XML correctly. See http://xmlstar.sourceforge.net/docs.php and in particular the 'val' command
Brian Agnew
@Brian, 1. have your verified that your .Net code could actually valid there is invalid character in UTF-8 encoding? I tested it and it never throw any exceptions even if invalid character. 2. I think UTF-8 encoding could contains any character (I think every character in every language has its related UTF-8 form represetation of unicode table), why there could be invalid character in UTF-8 encoding? I believe unicode character à has related UTF-8 value.
George2
Whilst UTF-8 can represent any character via 1/2/4 bytes, that *doesn't* imply that every byte sequence maps to a character
Brian Agnew
See http://en.wikipedia.org/wiki/UTF-8 and in particular the section labelled 'Invalid Byte Sequences'
Brian Agnew
@Brian, I have tried to use your method, but it has some issues. Could you check my just posted "EDIT1" part of my original post? I provided detailed information about my XML file to check and related error messages with my confusions.
George2
@Brian, your information from wikipedia about UTF-8 encoding, especially the 'Invalid Byte Sequences' section is very helpful. Could you show me in my sample (I posted in EDIT 1 section), why it is treated as 'Invalid Byte Sequences' by XML UTF-8 decoder please?
George2
I don't know, I'm afraid. Alan M has an analysis of what's going on (elsewhere in this question), but I think your best bet is to reject the data you've been given and get whoever sent it to you to sort it out!
Brian Agnew
@Brian, my code has a bug and I find your solution of the .Net code is correct. Sorry!
George2
Excellent. Thx for letting me know!
Brian Agnew
But it never gives as rich information as XMLStartlet. Let me know if you have methods to extract as rich information as the .Net exception object.
George2
+1  A: 

I don't know what's causing your problem, but it isn't a limitation of UTF-8 or an error in the encoding process. UTF-8 can encode every character known to Unicode, and the problematic byte sequences (ED BF 9D and ED B4 82) are valid--that is, the first byte starts with 1110 to indicate a three-byte sequence, and each of the other two bytes starts with 10 as continuation bytes are supposed to. It's the values they're trying to encode that are invalid.

Unicode and ISO/IEC 10646 do not, and will never, assign characters to any of the code points in the U+D800–U+DFFF range, so an individual code value from a surrogate pair does not ever represent a character. -Wikipedia

Your problem characters are U+DFDD and U+DD02. The fact that there are two characters from the range used for surrogate pairs might seem to suggest that they were meant to be a surrogate pair, but that doesn't work. It's UTF-16 that employs surrogate pairs; UTF-8 would encode the character as a single, four-byte sequence.

Another possibility is modified UTF-8, which does encode each byte of the surrogate pair separately. But that doesn't work either: a surrogate pair is always made up of one byte from the high-surrogate range (U+DC00..U+DFFF) and one from the low-surrogate range (U+D800..U+DBFF). These characters are both from the high-surrogate range.

So it appears to be a matter of bad data rather than faulty encoding. It would help a lot if we knew what those characters were supposed to be. Failing that, some info about what kind of data you're expecting (what languages, for example), where the data came from, what's been done to it... that kind of thing.

Alan Moore
@Alan, "Your problem characters are U+DFDD and U+DD02" -- confused, where in my posted XML content contains character U+DFDD?
George2
Another confusion is, why you quote UTF-16/UCS-2 wikipedia page? My question is dealing with UTF-8, not dealing with UTF-16/UCS-2. :-)
George2
The UTF-8 byte sequences ED BF 9D and ED B4 82 decode to U+DFDD and U+DD02. As for the UTF-16 page, it seemed like the most succinct way to explain why those code points aren't valid: because they're reserved for UTF-16 surrogate pairs.
Alan Moore
"The UTF-8 byte sequences ED BF 9D and ED B4 82 decode to U+DFDD and U+DD02" -- I agree and I calculated by myself and get the same results. So, my original XML file should be valid? But "As for the UTF-16 page" -- I am very confused. Why you mention UTF-16 here? UTF-8 and UTF-16 are totally different. Any more description please?BTW: from page http://en.wikipedia.org/wiki/UTF-8 I did not find any information about what range of UTF-8 are reserved. :-(
George2
No, your XML file isn't valid because the *characters* U+DFDD and U+DD02 don't exist--that is, those numbers don't map to characters in any Unicode character chart. It isn't the *encoding* (UTF-8) that reserves those numbers, it's the Unicode database itself. (And the reason it reserves them is so another encoding, UTF-16, can use them for surrogate pairs, but that was just background information.)
Alan Moore
@Alan, 1. I read your reply and your quoted materials carefully again. I think the answer to my question should be "Unicode disallows the 2048 code points U+D800..U+DFFF" from the UTF-8 page below, correct?http://en.wikipedia.org/wiki/UTF-82. In UTF-16, if this range is for the values of encoded form of byte sequences, then it is allowed, but for the unicode character point itself, neither UTF-16 nor UTF-8 allows maps encoded byte sequences to this range of unicode character, correct?
George2
http://en.wikipedia.org/wiki/UTF-8#Invalid_code_points is the section you're referring to. Yes, that seems to be the problem here: the characters were invalid before they were UTF-8 encoded. So where did those characters come from?
Alan Moore