tags:

views:

88

answers:

2

A sever I can't influence sends very broken XML.

Specifically, a Unicode WHITE STAR would get encoded as UTF-8 (E2 98 86) and then translated using a Latin-1 to HTML entity table. What I get is â 98 86 (9 bytes) in a file that's declared as utf-8 with no DTD.

I couldn't configure W3C tidy in a way that doesn't garble this irreversibly. I only found how to make lxml skip it silently. SAX uses Expat, which cannot recover after encountering this. I'd like to avoid BeautifulSoup for speed reasons.

What else is there?

+2  A: 

BeautifulSoup is your best bet in this case. I suggest profiling before ruling out BeautifulSoup altogether.

Manoj Govindan
And the link is: http://www.crummy.com/software/BeautifulSoup/
Adrian Petrescu
"[...] you don't really care what HTML is supposed to look like.Neither does this parser. " :-)
leoluk
I did, and it's orders of magnitude slower than the lxml.objectify I'm using now (accepting a few broken strings in the UI)
Tobias
A: 

Maybe something like:

import htmlentitydefs as ents
from lxml import etree  # or maybe 'html' , if the input is still more broken
def repl_ent(m): 
     return ents.entitydefs[m.group()[1:-1]]
goodxml = re.sub( '&\w+;', repl_ent, badxml )
etree.fromstring( goodxml )
Steven D. Majewski
You need to remove the five XML entities from htmlentitydefs to avoid unescaping <>.
Tobias
As I said I'm reluctant to do this because it looks like the server only entity-encodes the contents of one specific tag.
Tobias
The problem is that I don't think you can do it from SAX or a SAX filter, so you would have to drop down to the XMLReader interface, where you would have to do something similar to the above. ( The JAVA parser api has an optional feature to tell it to try to continue after a fatal error, so it might be possible to fix it and continue, but I don't know if that can be done in Python. If it can, it's probably a more complicated procedure than the above. Are there any hooks in lxml than can do this ? )
Steven D. Majewski