ansaurus

Question

Answer 1

A:

Are you absolutely certain that the encoding is incorrect? Rather than use emacs, I'd use a binary file viewer. What are the actual bytes at the problematic position?

With Java it would be reasonably easy to detect invalid UTF-8 byte patterns. I'm not sure whether the default Charset support would handle it, but UTF-8 is pretty simple. I usually use the UTF-8 table here as a reference for valid byte sequences.

Jon Skeet 2009-07-27 05:28:54

Answer 2

+3 A:

You can check for UTF-8-ness of a string with this regular expression:

(^(?:
[\x00-\x7f] |
[\xc0-\xdf][\x80-\xff] |
[\xe0-\xef][\x80-\xff]{2} |
[\xf0-\xf7][\x80-\xff]{3}
)*$)x

Martin v. Löwis 2009-07-27 05:32:50

Thanks! Will test this, and wrap the value with `utf8_encode` if it fails the test.

notnoop 2009-07-27 13:20:25

Answer 3

A:

You can use libxml_use_internal_errors and libxml_get_errors to loop through the errors that occurred when the document was loaded. The error code you're looking for is XML_ERR_INVALID_CHAR = 9.

<?php
$xml = '<?xml version="1.0" encoding="utf-8"?>
<a>
    <b>' . chr(0xfd) . chr(0xff) . '</b>
</a>';
libxml_use_internal_errors(true);

$doc = new DOMDocument;
$doc->loadxml($xml);

foreach (libxml_get_errors() as $error) {
    print_r($error);
}
libxml_clear_errors();

prints

LibXMLError Object
(
    [level] => 3
    [code] => 9
    [column] => 5
    [message] => Input is not proper UTF-8, indicate encoding !
Bytes: 0xFD 0xFF 0x3C 0x2F

    [file] => 
    [line] => 3
)

VolkerK 2009-07-27 05:52:30

Thanks! Unfortunately, this only reports the first invalid characters and never recovers from the error. So it doesn't report the rest of the errors.

notnoop 2009-07-27 08:54:34

ansaurus

tags:

views:

answers:

Wrong Mixed Character Encoding in XML

related questions