views:

836

answers:

7

I want to replace a string in HTML page using JavaScript but ignore it, if it is in an HTML tag, for example:

<a href="google.com">visit google search engine</a>
you can search on google tatatata...

I want to replace google by <b>google</b>, but not here:

<a href="google.com">visit google search engine</a>
you can search on <b>google</b> tatatata...

I tried with this one:

regex = new RegExp(">([^<]*)?(google)([^>]*)?<", 'i');
el.innerHTML =  el.innerHTML.replace(regex,'>$1<b>$2</b>$3<');

but the problem: I got <b>google</b> inside the <a> tag:

<a href="google.com">visit <b>google</b> search engine</a>
you can search on <b>google</b> tatatata...

How can fix this?

+5  A: 

You'd be better using an html parser for this, rather than regex. I'm not sure it can be done 100% reliably.

Draemon
+1  A: 

you can't really do that, your "google" is always in some tag, either replace all or none

skrat
+5  A: 

You may or may not be able to do with with a regexp. It depends on how precisely you can define the conditions. Saying you want the string replaced except if it's in an HTML tag is not narrow enough, since everything on the page is presumably within some HTML tag (BODY if nothing else).

It would probably work better to traverse the DOM tree for this instead of trying to use a regexp on the HTML.

jhurshman
I agree. Find all the text nodes in the DOM that contain the string. Keep a blacklist of tags that you **don't** want to replace the string in. Check if the text node is inside one of these tags. If not, do your replacement, otherwise leave it as is.
tvanfosson
+1  A: 

Parsing HTML with a regular expression is not going to be easy for anything other than trivial cases, since HTML isn't regular.

For more details see this Stackoverflow question (and answers).

Brian Agnew
A: 

Well, since everything is part of a tag, your request makes no real sense. If it's just the <a /> tag, you might just check for that part. Mainly by making sure you don't have a tailing </a> tag before a fresh <a>

Grubsnik
A: 

I think you're all missing the question here...

When he says inside the tag, he means inside the opening tag, as in the <a href="google.com"> tag...This is something quite different than text, say, inside a <p> </p> tag pair or <body> </body>. While I don't have the answer yet, I'm struggling with this same problem and I know it has to be solvable using regex. Once I figure it out, i'll come back and post.

Mike
A: 

WORKAROUND

If You can't use a html parser or are quite confident about Your html structure try this:

  1. do the "bad" changing
  2. repeat replace (<[^>]*)(<[^>]+>) to $1 a few times (as much as You need)

It's a simple workaround, but works for me.

Cons? Well... You have to do the replace twice for the case ... ...> as it removes only first unwanted tag from every tag on the page

[edit:] SOLUTION

Why not use jQuery, put the html code into the page and do something like this:

$(containerOrSth).find('a').each(function(){
 if($(this).children().length==0){
 $(this).text($(this).text().replace('google','evil')); 
 }else{
 //here You have to care about children tags, but You have to know where to expect them - before or after text. comment for more help
 }
});
naugtur
Another con is that it's not a parser.
BalusC
Hey, I said "if You can't use a parser" - so yes, it's not
naugtur