ansaurus

Question

Python.expat can't parse XML file with bad symbols. How to go around?

Answer 1

+1 A:

Not sure if '�' characters were introduced by copy-pasting string here, but if you have them in original data, then it seems to be generator problem which introduced \uFFFD charactes as:

"used to replace an incoming character whose value is unknown or unrepresentable in Unicode"

citied from: http://www.fileformat.info/info/unicode/char/fffd/index.htm

Workaround? Just idea for extension:

good = True
buf = None
while True:
if good:
        buf = f.read(buf_size)
        else:
        # try again with cleaned buffer
        pass
        try:
            xp.Parse(buf, len(buf) == 0)
            if (len(buf) == 0):
                    break
        good = True
    except ExpatError:
        if xp.ErrorCode  == XML_ERROR_BAD_CHAR_REF:
            # look at ErrorByteIndex (or nearby)
            # for 0xEF 0xBF 0xBD (UTF8 replacement char) and remove it
            good = False
        else:
            # other errors processing
            pass

Or clean input buffer instead + corner cases (partial sequence at the buffer end). I can't recall if python's expat allows to assign custom error handler. That would be easier then.

If i clean yours sample from '�' characters it's processed ok. \xd1 does not fail.

OSM data?

rados 2010-03-23 01:17:05

Yes, it's OSM whole earth dump. I'll try to make a generator of your code, thanks!

culebrón 2010-03-23 06:33:48

I've noted that xp.ErrorCode contains a numeric code, but XML_ERROR_BAD_CHAR_REF contains a string (Python 2.6). That's quite a headache if I want to check error type: I'll need to compare strings, etc.

culebrón 2010-03-23 07:52:57

Well... this doesn't work: when expat raises an error, it has already eaten the characters up to that, and I can't see a way to get the index of wrong character in `buf`. There's only `lineno` and `columnno`, and character counter that counts _all_ characters in the file, but not in `buf`.

culebrón 2010-03-23 08:13:45

So then try another thing i've posted. Cleanup buffer after read.Also check buffer end for partial sequences, if so: remove thatpart to 'reminder' storage, which will be merged with next bufferreaded.One more thing: have your input data those '�' character, or expatreports on other thing?I've checked my planet-100129.osm with expat 2.0.1 but in C++ app(only half our) and there were no characters error (file decompressed). Can you try this on uncompressed file adn see if youhave the same errors?

rados 2010-03-23 18:31:06

ansaurus

tags:

views:

answers:

Python.expat can't parse XML file with bad symbols. How to go around?

related questions