views:

291

answers:

3

I have an automatically-generated XML file that is supposed to be encoded with UTF-8. For the most part, the encoding is correct. However, there are some few characters that are not encoded properly. When viewing the file in Emacs, I get \370, \351.

Is there a way to detect their characters programatically? I prefer solutions using PHP, but solutions in Perl or Java would be very helpful as well.

A: 

Are you absolutely certain that the encoding is incorrect? Rather than use emacs, I'd use a binary file viewer. What are the actual bytes at the problematic position?

With Java it would be reasonably easy to detect invalid UTF-8 byte patterns. I'm not sure whether the default Charset support would handle it, but UTF-8 is pretty simple. I usually use the UTF-8 table here as a reference for valid byte sequences.

Jon Skeet
+3  A: 

You can check for UTF-8-ness of a string with this regular expression:

(^(?:
[\x00-\x7f] |
[\xc0-\xdf][\x80-\xff] |
[\xe0-\xef][\x80-\xff]{2} |
[\xf0-\xf7][\x80-\xff]{3}
)*$)x
Martin v. Löwis
Thanks! Will test this, and wrap the value with `utf8_encode` if it fails the test.
notnoop
A: 

You can use libxml_use_internal_errors and libxml_get_errors to loop through the errors that occurred when the document was loaded. The error code you're looking for is XML_ERR_INVALID_CHAR = 9.

<?php
$xml = '<?xml version="1.0" encoding="utf-8"?>
<a>
    <b>' . chr(0xfd) . chr(0xff) . '</b>
</a>';
libxml_use_internal_errors(true);

$doc = new DOMDocument;
$doc->loadxml($xml);

foreach (libxml_get_errors() as $error) {
    print_r($error);
}
libxml_clear_errors();

prints

LibXMLError Object
(
    [level] => 3
    [code] => 9
    [column] => 5
    [message] => Input is not proper UTF-8, indicate encoding !
Bytes: 0xFD 0xFF 0x3C 0x2F

    [file] => 
    [line] => 3
)
VolkerK
Thanks! Unfortunately, this only reports the first invalid characters and never recovers from the error. So it doesn't report the rest of the errors.
notnoop