ansaurus

Question

XML encoding issue

Answer 1

+2 A:

Try these out

CodeToGlory 2009-04-21 03:46:24

The OP noted in a recent comment that he does not want online tools.

bortzmeyer 2009-04-21 09:39:37

@CodeToGlory, sorry I did not specify my needs very clearly. I am using .Net (C#) and I want to find some C# solutions, better to use some existing combination of built-in .Net APIs. :-) Any ideas?

George2 2009-04-21 09:44:59

Answer 2

+1 A:

I presume you want to do this programmatically? In that case, this is highly dependent on what programming language you're using - which language would it be?

For example, I have used this code before in PHP. preg_match allows a /u modifier (which I think is PHP-specific) which treats the pattern, and the string it is being matched against, as UTF-8. A side-effect is that the whole string is checked for UTF-8 validity each time you do this. HTML/XHTML doesn't allow C0/C1 control codes apart from tab, new line, space etc, so I also added a way to check for those here too.

function validate($allowcontrolcodes = false)
    // returns true if this is a valid utf-8 string, false otherwise.  
    // if allowcontrolcodes is false (default), then most C0 codes below 0x20, as
    // well as C1 codes 127-159, will be denied - recommend false for html/xml
    {
     if ($this->string=='') return '';
     return preg_match($allowcontrolcodes
      ? '/^[\x00-\x{d7ff}\x{e000}-\x{10ffff}]++$/u'
      : '/^[\x20-\x7e\x0a\x09\x0d\x{a0}-\x{d7ff}\x{e000}-\x{10ffff}]++$/u',
      $this->string) ? true : false; 
    }

Another way would be to use the DOM, which is available in many languages. The DOM document object has a LoadXML method which loads the document from an XML formatted string. This will fail if the document you input is not valid according to whatever character encoding it has specified, but won't specifically enforce UTF-8 encoding, but if it was successful you can then check the "encoding" property of the document object to see what encoding it was.

thomasrutter 2009-04-21 04:36:15

1. I am using .Net, any C# code which I could use? 2. "A side-effect is that the whole string is checked for UTF-8 validity each time you do this" -- why it is a side effect? I think we must check for validity each time in this way? Any points you think it is not smart/efficient enough to improve (why you said "side effect")?

George2 2009-04-21 09:43:13

Sorry, I'm not familiar with C# and .NET. The /u modifier is PHP-specific, it is a regular expression check in UTF-8 mode and a side effect is it checks for UTF-8 validity. You may have more luck with DOM. For example, http://support.microsoft.com/kb/317664 No doubt System.Xml.XmlDocument has some way (like an "encoding" property) of checking what character encoding was used after importing an XML document - plus if the document is not valid according to any encoding it just won't parse.

thomasrutter 2009-04-21 13:53:56

@thomasrutter, could you show me in my sample (I posted in EDIT 1 section), why it is treated as 'Invalid Byte Sequences' by XML UTF-8 decoder please? I posted both the text form of XML file and related bianry hex value.

George2 2009-04-22 06:30:54

@thomasrutter, I have studied the Microsoft kb document, does it include information about how to check whether an XML document contains invalid UTF-8 byte? I did not find such info.

George2 2009-04-22 06:32:49

A proper XML parser will fail to load an XML document if it contains any invalid characters - that is, byte sequences which are not valid as characters in its current character encoding. Therefore, if XmlDocument.loadXML() or XmlDocument.load() succeeds at all then this means the encoding is valid. That was what I was getting at. Unfortunately I have never done anything like this in .NET or C# so I will probably be unable to help further, but that's the approach I would take.

thomasrutter 2009-04-22 07:29:48

The second part of Brian Agnew's answer is what I mean. That Load method will fail if there are invalid characters.

thomasrutter 2009-04-22 07:31:02

@thomasrutter, to my surprise the method will never fail even for invalid input. :-( But anyway, let us discuss non .Net related questions. could you show me in my sample (I posted in EDIT 1 section), why it is treated as invalid encoding file by XML UTF-8 decoder please? I posted both the text form of XML file and related bianry hex value.

George2 2009-04-22 07:43:35

Answer 3

+2 A:

libxml2 can do it, it is available as a library (to integrate into your programs) or through the command-line tool xmllint. Here is an example with xmllint:

[Proper file] 
% head test.xml
<?xml version="1.0" encoding="utf-8"?>
<café>Ils s'étaient ...

% xmllint --noout test.xml
% 

[One byte in a multibyte character removed]
% xmllint --noout test.xml
test.xml:2: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xC3 0x74 0x61 0x69
<café>Ils s'Ãtaient ...
             ^

bortzmeyer 2009-04-21 07:15:41

Hi bortzmeyer, any library which could be easily used with .Net code?

George2 2009-04-21 09:37:44

I am interested about the command line feature, could you share the command line and options you use to valid whether there is invalid character in an XML document using the libxml2 tool? :-)

George2 2009-04-21 10:00:28

What do you mean, to share? I gave an example of use and the URL of the libxml2 Web site. Isn't it enough?

bortzmeyer 2009-04-21 11:08:30

Sorry, my bad English. 1. I mean is it your command which is used to check XML file encoding character -- "xmllint --noout test.xml"? 2. I think UTF-8 encoding could contains any character (I think every character in every language has its related UTF-8 form represetation of unicode table), why there could be invalid character in UTF-8 encoding? I believe unicode character Ã has related UTF-8 value.

George2 2009-04-21 13:37:19

Yes, UTF-8 can encode every Unicode character (that's the point). But not every byte stream is legal UTF-8, far from it. So, yes, there are files which are not legal UTF-8.

bortzmeyer 2009-04-21 14:03:15

@bortzmeyer, could you show me in my sample (I posted in EDIT 1 section), why it is treated as 'Invalid Byte Sequences' by XML UTF-8 decoder please? I posted both the text form of XML file and related bianry hex value.

George2 2009-04-22 06:22:38

@bortzmeyer, I have also tried to use the tool you recommended -- libxml, but it fails to start. I posted my screen snapshot into EDIT 2 section of my original post, could you take a look please?

George2 2009-04-22 06:29:28

Sorry for the libxml installation but it is a Windows-specific problem and I know nothing about Windows.

bortzmeyer 2009-04-22 20:16:47

No, I cannot test the sample you posted, it is an image of an hexadecimal dump (!) and I need the original file, the raw file.

bortzmeyer 2009-04-22 20:18:14

@bortzmeyer, 1. do you mean you never use libxml on Windows or libxml not support Windows platform? 2. Since this platform (stackoverflow) does not allow upload attachment files, could you let me know how to post attachment or recommend some free hosting servers? :-)

George2 2009-04-23 05:06:04

I mean I never use Windows.

bortzmeyer 2009-04-23 06:15:59

Answer 4

+1 A:

The easiest way to do this is to simply run the XML through a command line utility to perform this check.

I always have a copy of XMLStar available for stuff like this. It'll indicate immediately if it can/cannot parse your XML, and thus indicate whether the encoding is correct or not.

If you're looking for a coded method to do this, simply load the XML into your XML parser of choice. An encoding error will immediately trigger a parser exception (since the encoding is wrong, parsing can't take place, by definition)

e.g.

XmlDocument xDoc = new XmlDocument();

Next use the load method to load the XML document from the specified stream.

xDoc.Load("sampleXML.xml");

Brian Agnew 2009-04-21 08:26:29

@Brian, I am writing .Net code, any easy way to using existing .Net API to check?

George2 2009-04-21 09:39:15

See above (edited answer). Just use your .net parser

Brian Agnew 2009-04-21 09:47:20

@Brian, confused. where is your answer you mentioned ".net parser"? Could you point out please?

George2 2009-04-21 09:48:55

Re-edited. See above.

Brian Agnew 2009-04-21 09:49:30

Have you tested it? I put some invalid characters but Load method never throws exception? I assume you rely on whether there is exception thrown from Load method to check whether there is invalid character?

George2 2009-04-21 09:56:57

I have a new ideas, we compare two things, 1. the result by XMLStar about whether an XML documet contains invalid character 2. the result of your above .Net code. Could you share the command you use to valid XML document when using XMLStar? I am very interested to make my hands dirty to use this tool. :-)

George2 2009-04-21 09:59:15

Any XMLStar command will initially parse the XML. To be clear, any XML parser will throw an error on an invalid character/encoding, since by definition it can't parse the XML correctly. See http://xmlstar.sourceforge.net/docs.php and in particular the 'val' command

Brian Agnew 2009-04-21 10:03:05

@Brian, 1. have your verified that your .Net code could actually valid there is invalid character in UTF-8 encoding? I tested it and it never throw any exceptions even if invalid character. 2. I think UTF-8 encoding could contains any character (I think every character in every language has its related UTF-8 form represetation of unicode table), why there could be invalid character in UTF-8 encoding? I believe unicode character Ã has related UTF-8 value.

George2 2009-04-21 13:40:59

Whilst UTF-8 can represent any character via 1/2/4 bytes, that *doesn't* imply that every byte sequence maps to a character

Brian Agnew 2009-04-21 14:01:45

See http://en.wikipedia.org/wiki/UTF-8 and in particular the section labelled 'Invalid Byte Sequences'

Brian Agnew 2009-04-21 14:07:59

@Brian, I have tried to use your method, but it has some issues. Could you check my just posted "EDIT1" part of my original post? I provided detailed information about my XML file to check and related error messages with my confusions.

George2 2009-04-22 06:19:17

@Brian, your information from wikipedia about UTF-8 encoding, especially the 'Invalid Byte Sequences' section is very helpful. Could you show me in my sample (I posted in EDIT 1 section), why it is treated as 'Invalid Byte Sequences' by XML UTF-8 decoder please?

George2 2009-04-22 06:21:36

I don't know, I'm afraid. Alan M has an analysis of what's going on (elsewhere in this question), but I think your best bet is to reject the data you've been given and get whoever sent it to you to sort it out!

Brian Agnew 2009-04-22 08:29:36

@Brian, my code has a bug and I find your solution of the .Net code is correct. Sorry!

George2 2009-04-22 14:04:24

Excellent. Thx for letting me know!

Brian Agnew 2009-04-22 14:20:16

But it never gives as rich information as XMLStartlet. Let me know if you have methods to extract as rich information as the .Net exception object.

George2 2009-04-23 05:04:32

Answer 5

+1 A:

I don't know what's causing your problem, but it isn't a limitation of UTF-8 or an error in the encoding process. UTF-8 can encode every character known to Unicode, and the problematic byte sequences (ED BF 9D and ED B4 82) are valid--that is, the first byte starts with 1110 to indicate a three-byte sequence, and each of the other two bytes starts with 10 as continuation bytes are supposed to. It's the values they're trying to encode that are invalid.

Unicode and ISO/IEC 10646 do not, and will never, assign characters to any of the code points in the U+D800–U+DFFF range, so an individual code value from a surrogate pair does not ever represent a character. -Wikipedia

Your problem characters are U+DFDD and U+DD02. The fact that there are two characters from the range used for surrogate pairs might seem to suggest that they were meant to be a surrogate pair, but that doesn't work. It's UTF-16 that employs surrogate pairs; UTF-8 would encode the character as a single, four-byte sequence.

Another possibility is modified UTF-8, which does encode each byte of the surrogate pair separately. But that doesn't work either: a surrogate pair is always made up of one byte from the high-surrogate range (U+DC00..U+DFFF) and one from the low-surrogate range (U+D800..U+DBFF). These characters are both from the high-surrogate range.

So it appears to be a matter of bad data rather than faulty encoding. It would help a lot if we knew what those characters were supposed to be. Failing that, some info about what kind of data you're expecting (what languages, for example), where the data came from, what's been done to it... that kind of thing.

Alan Moore 2009-04-22 08:12:40

@Alan, "Your problem characters are U+DFDD and U+DD02" -- confused, where in my posted XML content contains character U+DFDD?

George2 2009-04-22 12:25:54

Another confusion is, why you quote UTF-16/UCS-2 wikipedia page? My question is dealing with UTF-8, not dealing with UTF-16/UCS-2. :-)

George2 2009-04-22 13:25:23

The UTF-8 byte sequences ED BF 9D and ED B4 82 decode to U+DFDD and U+DD02. As for the UTF-16 page, it seemed like the most succinct way to explain why those code points aren't valid: because they're reserved for UTF-16 surrogate pairs.

Alan Moore 2009-04-22 13:52:29

"The UTF-8 byte sequences ED BF 9D and ED B4 82 decode to U+DFDD and U+DD02" -- I agree and I calculated by myself and get the same results. So, my original XML file should be valid? But "As for the UTF-16 page" -- I am very confused. Why you mention UTF-16 here? UTF-8 and UTF-16 are totally different. Any more description please?BTW: from page http://en.wikipedia.org/wiki/UTF-8 I did not find any information about what range of UTF-8 are reserved. :-(

George2 2009-04-22 14:03:45

No, your XML file isn't valid because the *characters* U+DFDD and U+DD02 don't exist--that is, those numbers don't map to characters in any Unicode character chart. It isn't the *encoding* (UTF-8) that reserves those numbers, it's the Unicode database itself. (And the reason it reserves them is so another encoding, UTF-16, can use them for surrogate pairs, but that was just background information.)

Alan Moore 2009-04-22 18:05:00

@Alan, 1. I read your reply and your quoted materials carefully again. I think the answer to my question should be "Unicode disallows the 2048 code points U+D800..U+DFFF" from the UTF-8 page below, correct?http://en.wikipedia.org/wiki/UTF-82. In UTF-16, if this range is for the values of encoded form of byte sequences, then it is allowed, but for the unicode character point itself, neither UTF-16 nor UTF-8 allows maps encoded byte sequences to this range of unicode character, correct?

George2 2009-04-23 05:03:20

http://en.wikipedia.org/wiki/UTF-8#Invalid_code_points is the section you're referring to. Yes, that seems to be the problem here: the characters were invalid before they were UTF-8 encoded. So where did those characters come from?

Alan Moore 2009-04-23 06:39:29

ansaurus

tags:

views:

answers:

XML encoding issue

related questions