views:

61

answers:

5

I'm scraping a html page, then using xml.dom.minidom.parseString() to create a dom object.

however, the html page has a '&'. I can use cgi.escape to convert this into &amp; but it also converts all my html <> tags into &lt;&gt; which makes parseString() unhappy.

how do i go about this? i would rather not just hack it and straight replace the "&"s

thanks

+1  A: 

i would rather not just hack it and straight replace the "&"s

Er, why? That's what cgi.escape is doing - effectively just a search and replace operation for certain characters that have to be escaped.

If you only want to replace a single character, just replace the single character:

yourstring.replace('&', '&amp;')

Don't beat around the bush.

Amber
A: 

use this htmlstring.replace('&','&amp;')

Srinivas Reddy Thatiparthy
sje397
A: 

If you want to make sure that you don't accidentally re-escape an already escaped & (i. e. not transform &amp; into &amp;amp; or &szlig; into &amp;szlig;), you could

import re
newstring = re.sub(r"&(?![A-Za-z])", "&amp;", oldstring)

This will leave &s alone when they are followed by a letter.

Tim Pietzcker
+1  A: 

For scraping, try to use a library that can handle such html "tag soup", like lxml, which has a html parser (as well as a dedicated html package in lxml.html), or BeautifulSoup (you will also find that these libraries also contain other stuff that makes scraping/working with html easier, aside from being able to handle ill-formed documents: getting information out of forms, making hyperlinks absolute, using css selectors...)

Steven
A: 

You shouldn't use an XML parser to parse data that isn't XML. Find an HTML parser instead, you'll be happier in the long run. The standard library has a few (HTMLParser and htmllib), and BeautifulSoup is a well-loved third-party package.

Ned Batchelder