views:

445

answers:

6

I use regexps to transform text as I want, but I want to preserve the HTML tags. e.g. if I want to replace "stack overflow" with "stack underflow", this should work as expected: if the input is stack <sometag>overflow</sometag>, I must obtain stack <sometag>underflow</sometag> (i.e. the string substitution is done, but the tags are still there...

+7  A: 

Use a DOM library, not regular expressions, when dealing with manipulating HTML:

  • lxml: a parser, document, and HTML serializer. Also can use BeautifulSoup and html5lib for parsing.
  • BeautifulSoup: a parser, document, and HTML serializer.
  • html5lib: a parser. It has a serializer.
  • ElementTree: a document object, and XML serializer
  • cElementTree: a document object implemented as a C extension.
  • HTMLParser: a parser.
  • Genshi: includes a parser, document, and HTML serializer.
  • xml.dom.minidom: a document model built into the standard library, which html5lib can parse to.

Stolen from http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/.

Out of these I would recommend lxml, html5lib, and BeautifulSoup.

meder
+3  A: 

Beautiful Soup or HTMLParser is your answer.

duffymo
+1  A: 

Use html parser such as provided by lxml or BeautifulSoup. Another option is to use XSLT transformations (XSLT in Jython).

J.F. Sebastian
Thanks, Robert.
J.F. Sebastian
A: 

I don't think that the DOM / HTML parser library recommendations posted so far address the specific problem in the given example: overflow should replaced with underflow only when preceded by stack in the rendered document, whether or not there are tags between them. Such a library is a necessary part the solution, though.

Assuming that tags never appear in the middle of words, one solution would be to

  1. process the DOM, tokenize all text nodes and insert a unique identifier at the beginning of each token (e.g. word)
  2. render the document as plain text
  3. search and replace the plain text with regexes which use groups to match, preserve and mark unique identifiers at the beginning of each token
  4. extract all tokens with marked unique identifiers from the plain text
  5. process the DOM by removing unique identifiers and replacing tokens matching marked unique identifiers with corresponding changed tokens
  6. render the processed DOM back to HTML

Example:

In 1. the HTML DOM,

stack <sometag>overflow</sometag>

becomes the DOM

#1;stack <sometag>#2;overflow</sometag>

and in 2. the plain text is produced:

#1;stack #2;overflow

The regex needed in 3. is #(\d+);stack\s+#(\d+);overflow\b and the replacement #\1;stack %\2;underflow. Note that only the second word is marked by changing # to % in the unique identifier, since the first word isn't altered.

In 4., the word underflow with the unique identifier numbered 2 is extracted from the resulting plain text since it was marked by changing the # to a %.

In 5., all #(\d+); identifiers are removed from text nodes of the DOM while looking up their numbers among extracted words. The number 1 is not found, so #1;stack is replaced with simply stack. The number 2 is found with the changed word underflow, so #2;overflow is replaced by underflow.

Finally in 6. the DOM is rendered back to the HTML document `stack underflow.

akaihola
+1  A: 

Note that arbitrary replacements can't be done unambiguously. Consider the following examples:

1)

HTML:

A<tag>B</tag>

Pattern -> replacement:

AB -> AXB

Possible results:

AX<tag>B</tag>
A<tag>XB</tag>

2)

HTML:

A<tag>A</tag>A

Pattern -> replacement:

A+ -> WXYZ

Possible results:

W<tag />XYZ
W<tag>X</tag>YZ
W<tag>XY</tag>Z
W<tag>XYZ</tag>
WX<tag />YZ
WX<tag>Y</tag>Z
WX<tag>YZ</tag>
WXY<tag />Z
WXY<tag>Z</tag>
WXYZ

What kind of algorithms work for your case depends highly on the nature of possible search patterns and desired rules for handling ambiguity.

akaihola
A: 
Kastor