I use regexps to transform text as I want, but I want to preserve the HTML tags.
e.g. if I want to replace "stack overflow" with "stack underflow", this should work as
expected: if the input is stack <sometag>overflow</sometag>
, I must obtain stack <sometag>underflow</sometag>
(i.e. the string substitution is done, but the
tags are still there...
views:
445answers:
6Use a DOM library, not regular expressions, when dealing with manipulating HTML:
- lxml: a parser, document, and HTML serializer. Also can use BeautifulSoup and html5lib for parsing.
- BeautifulSoup: a parser, document, and HTML serializer.
- html5lib: a parser. It has a serializer.
- ElementTree: a document object, and XML serializer
- cElementTree: a document object implemented as a C extension.
- HTMLParser: a parser.
- Genshi: includes a parser, document, and HTML serializer.
- xml.dom.minidom: a document model built into the standard library, which html5lib can parse to.
Stolen from http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/.
Out of these I would recommend lxml, html5lib, and BeautifulSoup.
Use html parser such as provided by lxml
or BeautifulSoup
. Another option is to use XSLT transformations (XSLT in Jython).
I don't think that the DOM / HTML parser library recommendations posted so far address the specific problem in the given example: overflow
should replaced with underflow
only when preceded by stack
in the rendered document, whether or not there are tags between them. Such a library is a necessary part the solution, though.
Assuming that tags never appear in the middle of words, one solution would be to
- process the DOM, tokenize all text nodes and insert a unique identifier at the beginning of each token (e.g. word)
- render the document as plain text
- search and replace the plain text with regexes which use groups to match, preserve and mark unique identifiers at the beginning of each token
- extract all tokens with marked unique identifiers from the plain text
- process the DOM by removing unique identifiers and replacing tokens matching marked unique identifiers with corresponding changed tokens
- render the processed DOM back to HTML
Example:
In 1. the HTML DOM,
stack <sometag>overflow</sometag>
becomes the DOM
#1;stack <sometag>#2;overflow</sometag>
and in 2. the plain text is produced:
#1;stack #2;overflow
The regex needed in 3. is #(\d+);stack\s+#(\d+);overflow\b
and the replacement #\1;stack %\2;underflow
. Note that only the second word is marked by changing #
to %
in the unique identifier, since the first word isn't altered.
In 4., the word underflow
with the unique identifier numbered 2
is extracted from the resulting plain text since it was marked by changing the #
to a %
.
In 5., all #(\d+);
identifiers are removed from text nodes of the DOM while looking up their numbers among extracted words. The number 1
is not found, so #1;stack
is replaced with simply stack
. The number 2
is found with the changed word underflow
, so #2;overflow
is replaced by underflow
.
Finally in 6. the DOM is rendered back to the HTML document `stack underflow.
Note that arbitrary replacements can't be done unambiguously. Consider the following examples:
1)
HTML:
A<tag>B</tag>
Pattern -> replacement:
AB -> AXB
Possible results:
AX<tag>B</tag>
A<tag>XB</tag>
2)
HTML:
A<tag>A</tag>A
Pattern -> replacement:
A+ -> WXYZ
Possible results:
W<tag />XYZ
W<tag>X</tag>YZ
W<tag>XY</tag>Z
W<tag>XYZ</tag>
WX<tag />YZ
WX<tag>Y</tag>Z
WX<tag>YZ</tag>
WXY<tag />Z
WXY<tag>Z</tag>
WXYZ
What kind of algorithms work for your case depends highly on the nature of possible search patterns and desired rules for handling ambiguity.