views:

1345

answers:

4

I have a PHP script which is trying to parse a huge XML file. To do this I'm using the XMLReader library. During the parsing, I have this encoding error:

Input is not proper UTF-8, indicate encoding ! Bytes: 0xA0 0x32 0x36 0x30

I would like to know if they are a way to skip records with bad characters.

Thanks!

+1  A: 

First of all, make sure that your XML file is indeed UTF-8 encoded. If not specify the encoding as the second parameter to XMLReader::open().

If the encoding error is due a real malformed byte sequence in an UTF-8 document and if you're using PHP > 5.2.0 you could pass LIBXML_NOERROR and/or (depending on the error level) LIBXML_NOWARNING as a bitmask to the third parameter of XMLReader::open():

$xml = new XMLReader(); 
$xml->open('myxml.xml', null, LIBXML_NOERROR | LIBXML_NOWARNING);

If your're using PHP > 5.1.0 you can tweak the libXML error-handling.

// enable user error handling
libxml_use_internal_errors(true);
/* ... do your XML processing ... */
$errors = libxml_get_errors();
foreach ($errors as $error) {
    // handle errors here
}
libxml_clear_errors();

I actually don't know if the preceding two work-arounds actually allow XMLReader to continue reading in case of an error or if they only suppress the error output. But it's worth a try.


Responding to comment:

libXML defines XML_PARSE_RECOVER (1) but ext/libxml does not expose this constant as a PHP constant. Perhaps it's possible to pass the integer value 1 to the $options parameter.

$xml = new XMLReader(); 
$xml->open('myxml.xml', null, LIBXML_NOERROR | LIBXML_NOWARNING | 1);
Stefan Gehrig
I tried to call libxml_use_internal_errors(true) function before processing my XML file and to add "LIBXML_NOERROR | LIBXML_NOWARNING" mask to XMLReader::open(). This is very helpful but the parsing is still stopped when an encoding error is found. Do you know if they are any way to tell libxml to continue the parsing if an error is found.
Michael Alves
Edited answer regarding the comment.
Stefan Gehrig
I tried to pass the integer value 1 to the $options parameter but the behavior is not changed. The parsing is stopped when an encoding error is found.
Michael Alves
And you're sure that the XML file is UTF-8 encoded and that the byte sequence encountered by XMLReader is really just erroneous?
Stefan Gehrig
I could not be sure because the file is very large (>1Go) and it is generated by a client.
Michael Alves
Doesn't the XML file have a XML declaration specifying its encoding? As Alan M wrote, the byte sequence would be perfectly OK in ISO-8859-1. I think you have to go the other way round and check what encoding is used...
Stefan Gehrig
A: 

If your XML file has really simple structure, you may "prefilter" it to get rid (or even better, correct) the wrong records.

Read it record by record and write out a filtered xml file, then process the filtered file.

Csaba Kétszeri
+2  A: 

I would listen to what XMLReader is telling you. Remember that many encodings are supersets of ASCII, so (for example) UTF-8 and ISO-8859-1 are identical to ASCII for the first 128 code points. It may well be that your file is really encoded as ISO-8859-1, but almost all of the characters in are from the lower, ASCII half of that character set. In that case, the error would be yours for letting it use the default encoding for XML, UTF-8.

In ISO-8859-1 the byte sequence 0xA0 0x32 0x36 0x30 is perfectly valid: a non-breaking space followed by '2', '6', '0'.

Alan Moore
A: 
$xml = file_get_contents('myxml.xml');
$xml = preg_replace('/[\x0-\x1f\x7f-\x9f]/u', ' ', $xml);
//parse $xml below

bandw