views:

53

answers:

5

I have a html file with xml snipped embedded, the source code is pasted in the pastbin:

http://pastebin.com/Hy0QaWk8

my task is to extract the text enclosed in the first textarea, which is a xml snippet, from the html. Without any change to the original snippet. I'm able to get it by using the BeautifulSoup, but it changes all the tag names into lower case.

+1  A: 

Try using the BeautifulStoneSoup part of the BeautifulSoup library, which is designed for XML.

Amber
A: 

Perhaps lxml would work, although I've never used it myself so I don't know how easy/complicated it would be to do what you want.

David Zaslavsky
A: 

(Ugh! Why do so many authors seem to think <textarea> content doesn't need HTML-escaping? Fools!)

Unfortunately BeautifulSoup 3.1 is not applying the (incorrect but common) browser-fixup of treating < and & characters inside <textarea> as text, and is instead creating real XML elements.

BeautifulSoup 3.0 copes with it OK though. Why there's a difference.

bobince
A: 

Well I just tried beautifulSoup 3.0, and it doesn't work for me:

xml ='<samlp:Response xmlns:samlp="urn:oasis:names:tc:SAML:2.0:protocol"></samlp:Resonse>'
print BeautifulSoup.BeautifulStoneSoup(xml)
<samlp:response xmlns:samlp="urn:oasis:names:tc:SAML:2.0:protocol"

You will notice that the soup has changed Response to response

georgehu
A: 

Finally I found the pyparsing is the best weapon to accomplish the task:

aStart,aEnd = makeHTMLTags("textarea")

search = aStart + SkipTo(aEnd)("body")+ aEnd

saml_resp_str = search.searchString(doc)[0].body relay_state_str = search.searchString(doc)[1].body

georgehu