ansaurus

Question

How to find/replace text in html while preserving html tags/structure

Answer 1

+7 A:

Use a DOM library, not regular expressions, when dealing with manipulating HTML:

lxml: a parser, document, and HTML serializer. Also can use BeautifulSoup and html5lib for parsing.
BeautifulSoup: a parser, document, and HTML serializer.
html5lib: a parser. It has a serializer.
ElementTree: a document object, and XML serializer
cElementTree: a document object implemented as a C extension.
HTMLParser: a parser.
Genshi: includes a parser, document, and HTML serializer.
xml.dom.minidom: a document model built into the standard library, which html5lib can parse to.

Stolen from http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/.

Out of these I would recommend lxml, html5lib, and BeautifulSoup.

meder 2009-12-06 17:46:11

Answer 2

+3 A:

Beautiful Soup or HTMLParser is your answer.

duffymo 2009-12-06 17:46:16

Answer 3

+1 A:

Use html parser such as provided by lxml or BeautifulSoup. Another option is to use XSLT transformations (XSLT in Jython).

J.F. Sebastian 2009-12-06 17:56:00

Thanks, Robert.

J.F. Sebastian 2009-12-06 23:01:29

Answer 4

A:

I don't think that the DOM / HTML parser library recommendations posted so far address the specific problem in the given example: overflow should replaced with underflow only when preceded by stack in the rendered document, whether or not there are tags between them. Such a library is a necessary part the solution, though.

Assuming that tags never appear in the middle of words, one solution would be to

process the DOM, tokenize all text nodes and insert a unique identifier at the beginning of each token (e.g. word)
render the document as plain text
search and replace the plain text with regexes which use groups to match, preserve and mark unique identifiers at the beginning of each token
extract all tokens with marked unique identifiers from the plain text
process the DOM by removing unique identifiers and replacing tokens matching marked unique identifiers with corresponding changed tokens
render the processed DOM back to HTML

Example:

In 1. the HTML DOM,

stack <sometag>overflow</sometag>

becomes the DOM

#1;stack <sometag>#2;overflow</sometag>

and in 2. the plain text is produced:

#1;stack #2;overflow

The regex needed in 3. is #(\d+);stack\s+#(\d+);overflow\b and the replacement #\1;stack %\2;underflow. Note that only the second word is marked by changing # to % in the unique identifier, since the first word isn't altered.

In 4., the word underflow with the unique identifier numbered 2 is extracted from the resulting plain text since it was marked by changing the # to a %.

In 5., all #(\d+); identifiers are removed from text nodes of the DOM while looking up their numbers among extracted words. The number 1 is not found, so #1;stack is replaced with simply stack. The number 2 is found with the changed word underflow, so #2;overflow is replaced by underflow.

Finally in 6. the DOM is rendered back to the HTML document `stack underflow.

akaihola 2009-12-06 20:52:46

Answer 5

+1 A:

Note that arbitrary replacements can't be done unambiguously. Consider the following examples:

1)

HTML:

A<tag>B</tag>

Pattern -> replacement:

AB -> AXB

Possible results:

AX<tag>B</tag>
A<tag>XB</tag>

2)

HTML:

A<tag>A</tag>A

Pattern -> replacement:

A+ -> WXYZ

Possible results:

W<tag />XYZ
W<tag>X</tag>YZ
W<tag>XY</tag>Z
W<tag>XYZ</tag>
WX<tag />YZ
WX<tag>Y</tag>Z
WX<tag>YZ</tag>
WXY<tag />Z
WXY<tag>Z</tag>
WXYZ

What kind of algorithms work for your case depends highly on the nature of possible search patterns and desired rules for handling ambiguity.

akaihola 2009-12-06 21:14:12

Answer 6

A:

Kastor 2010-05-15 08:55:58

ansaurus

tags:

views:

answers:

How to find/replace text in html while preserving html tags/structure

Example:

1)

2)

related questions