tags:

views:

51

answers:

2

I am looking for a regex to replace a given string in a html page but only if the string is not a part of the tag itself or appearing as text inside a link or a heading.

Examples:

Looking for 'replace_me'

<p>You can replace_me just fine</p> OK

<a href='replace_me'>replace_me</a> no match

<h3>replace_me</h3> no match

<a href='/test/'><span>replace_me</span></a> no match

<p style="background:url('replace_me')">replace_me<h1>replace_me</h1></p> first no match, second OK, third no match

Thanks in advance!

UPDATE:

I have found a working regex

\b(replace_me)\b(?!(?:(?!<\/?[ha].*?>).)*<\/[ha].*?>)(?![^<>]*>)
A: 

Parsing HTML with regex is a Bad Idea that will drive you insane. Using regex on this is probably not quite as bad, but a few things to think about in whatever approach you take:

  1. How many of these are there in a page?
  2. How many pages will you be doing this to?
  3. Will you be hand-checking the output, or is it automated?
  4. Which programming language(s) are you using for this?

I think the best way is not with a "simple" (read: horrendously complicated) regex, but a proper program that has some logic behind it - unless regular expressions are Turing Complete and someone else can provide a regex to do what you want, of course :)

Alphax
Vladimir
A: 
\b(replace_me)\b(?!(?:(?!<\/?[ha].*?>).)*<\/[ha].*?>)(?![^<>]*>)
Vladimir