ansaurus

Question

How to parse broken XML in Python?

Answer 1

+2 A:

BeautifulSoup is your best bet in this case. I suggest profiling before ruling out BeautifulSoup altogether.

Manoj Govindan 2010-08-26 17:23:58

And the link is: http://www.crummy.com/software/BeautifulSoup/

Adrian Petrescu 2010-08-26 17:25:22

"[...] you don't really care what HTML is supposed to look like.Neither does this parser. " :-)

leoluk 2010-08-26 18:18:56

I did, and it's orders of magnitude slower than the lxml.objectify I'm using now (accepting a few broken strings in the UI)

Tobias 2010-08-26 19:50:55

Answer 2

A:

Maybe something like:

import htmlentitydefs as ents
from lxml import etree  # or maybe 'html' , if the input is still more broken
def repl_ent(m): 
     return ents.entitydefs[m.group()[1:-1]]
goodxml = re.sub( '&\w+;', repl_ent, badxml )
etree.fromstring( goodxml )

Steven D. Majewski 2010-08-26 21:18:39

You need to remove the five XML entities from htmlentitydefs to avoid unescaping <>.

Tobias 2010-08-26 22:35:29

As I said I'm reluctant to do this because it looks like the server only entity-encodes the contents of one specific tag.

Tobias 2010-08-26 22:36:59

The problem is that I don't think you can do it from SAX or a SAX filter, so you would have to drop down to the XMLReader interface, where you would have to do something similar to the above. ( The JAVA parser api has an optional feature to tell it to try to continue after a fatal error, so it might be possible to fix it and continue, but I don't know if that can be done in Python. If it can, it's probably a more complicated procedure than the above. Are there any hooks in lxml than can do this ? )

Steven D. Majewski 2010-08-27 02:49:11

ansaurus

tags:

views:

answers:

How to parse broken XML in Python?

related questions