tags:

views:

471

answers:

1

How can I make a pattern match so long as it's not inside of an HTML tag?

Here's my attempt below. Anyone have a better/different approach?

import re

inputstr = 'mary had a <b class="foo"> little loomb</b>'

rx = re.compile('[aob]')
repl = 'x'

outputstr = ''
i = 0

for astr in re.compile(r'(<[^>]*>)').split(inputstr):
    i = 1 - i

    if i:
        astr = re.sub(rx, repl, astr)

    outputstr += astr

print outputstr

output:

mxry hxd x <b class="foo"> little lxxmx</b>

Notes:

  • The <[^>]*> pattern to match HTML tags is obviously flawed -- I wrote this quickly and didn't account for the possibility of angle brackets within quoted attributes (e.g. '<img alt="next >" />'). It doesn't account for <script> or <style> tags or comments, either.
+11  A: 

Since you are using Python anyway, if I were you, I would have a look at Beautiful Soup, which is a Python HTML/XML parser. Really, there are so many special cases and headaches with writing your own parser, it just doesn't worth the effort. Your regular expression will get unmanageably large and will still not yield the correct results in all of the cases.

Just use Beautiful Soup.

DrJokepu