ansaurus

Question

Python: How to extract xml embedded in a html file?

Answer 1

+1 A:

Try using the BeautifulStoneSoup part of the BeautifulSoup library, which is designed for XML.

Amber 2010-04-26 22:35:23

Answer 2

A:

Perhaps lxml would work, although I've never used it myself so I don't know how easy/complicated it would be to do what you want.

David Zaslavsky 2010-04-26 22:36:29

Answer 3

A:

(Ugh! Why do so many authors seem to think <textarea> content doesn't need HTML-escaping? Fools!)

Unfortunately BeautifulSoup 3.1 is not applying the (incorrect but common) browser-fixup of treating < and & characters inside <textarea> as text, and is instead creating real XML elements.

BeautifulSoup 3.0 copes with it OK though. Why there's a difference.

bobince 2010-04-26 22:42:08

Answer 4

A:

Well I just tried beautifulSoup 3.0, and it doesn't work for me:

xml ='<samlp:Response xmlns:samlp="urn:oasis:names:tc:SAML:2.0:protocol"></samlp:Resonse>'
print BeautifulSoup.BeautifulStoneSoup(xml)
<samlp:response xmlns:samlp="urn:oasis:names:tc:SAML:2.0:protocol"

You will notice that the soup has changed Response to response

georgehu 2010-04-26 22:59:39

Answer 5

A:

Finally I found the pyparsing is the best weapon to accomplish the task:

aStart,aEnd = makeHTMLTags("textarea")

search = aStart + SkipTo(aEnd)("body")+ aEnd

saml_resp_str = search.searchString(doc)[0].body relay_state_str = search.searchString(doc)[1].body

georgehu 2010-04-27 21:53:22

ansaurus

tags:

views:

answers:

Python: How to extract xml embedded in a html file?

related questions