ansaurus

Question

need to selectively escape html entities (&)

Answer 1

+1 A:

i would rather not just hack it and straight replace the "&"s

Er, why? That's what cgi.escape is doing - effectively just a search and replace operation for certain characters that have to be escaped.

If you only want to replace a single character, just replace the single character:

yourstring.replace('&', '&amp;')

Don't beat around the bush.

Amber 2010-08-04 06:43:50

Answer 2

A:

use this htmlstring.replace('&','&')

Srinivas Reddy Thatiparthy 2010-08-04 06:45:09

sje397 2010-08-04 06:46:21

Answer 3

A:

If you want to make sure that you don't accidentally re-escape an already escaped & (i. e. not transform & into &amp; or ß into &szlig;), you could

import re
newstring = re.sub(r"&(?![A-Za-z])", "&amp;", oldstring)

This will leave &s alone when they are followed by a letter.

Tim Pietzcker 2010-08-04 06:53:29

Answer 4

+1 A:

For scraping, try to use a library that can handle such html "tag soup", like lxml, which has a html parser (as well as a dedicated html package in lxml.html), or BeautifulSoup (you will also find that these libraries also contain other stuff that makes scraping/working with html easier, aside from being able to handle ill-formed documents: getting information out of forms, making hyperlinks absolute, using css selectors...)

Steven 2010-08-04 09:00:26

Answer 5

A:

You shouldn't use an XML parser to parse data that isn't XML. Find an HTML parser instead, you'll be happier in the long run. The standard library has a few (HTMLParser and htmllib), and BeautifulSoup is a well-loved third-party package.

Ned Batchelder 2010-08-04 12:37:07

ansaurus

tags:

views:

answers:

need to selectively escape html entities (&)

related questions