views:

338

answers:

4

I need to write some code that will search and replace whole words in a string that are outside HTML tags. So if I have this string:

string content = "the brown fox jumped over <b>the</b> lazy dog over there";
string keyword = "the";

I need to something like:

if (content.ToLower().Contains(keyword.ToLower()))
       content = content.Replace(keyword, String.Format("<span style=\"background-color:yellow;\">{0}</span>", keyword));

but I don't want to replace the "the" in the bold tags or the "the" in "there", just the first "the".

A: 

You'll need to give more details.

For example:

<p>the brown fox</p>

is technically inside HTML tags.

thedz
then I wouldn't want that "the". Just whole words outside HTML.
But *every* is wrapped in HTML at some level. What does your source content look like?
thedz
+1  A: 

you can use this library to parse you html and to replace only the words that are not in any html, to replace only the word "the" and not "three" use RegEx.Replace("the\s+"...) instead of string replace

ArsenMkrt
A: 

Try this:

content = RegEx.Replace(content, "(?<!>)" 
   + keyword 
   + "(?!(<|\w))", "<span blah...>" + keyword + '</span>';

Edit: I fixed the "these" case, but not the case where more than the keyword is wrapped in HTML, e.g., "fox jumped over the lazy dog."

What you're asking for is going to be nearly impossible with RegEx and normal, everyday HTML, because to know if you're "inside" a tag, you would have to "pair" each start and end tag, and ignore tags that are intended to be self-closing (BR and IMG, for instance).

If this is merely eye candy for a web site, I suggest going the other route: fix your CSS so the SPAN you are adding only impacts the HTML outside of a tag.

For example:

content = content.Replace("the", "<span class=\"highlight\">the</span>");

Then, in your CSS:

span.highlight { background-color: yellow; }

b span.highlight,
i span.highlight,
em span.highlight,
strong span.highlight,
p span.highlight,
blockquote span.highlight { background: none; }

Just add an exclusion for each HTML tag whose contents should not be highlighted.

richardtallent
Hey, I fixed it, vote me back up! lol
richardtallent
A: 

I like the suggestion to use an HTML parser, but let me propose a way to enumerate the top-level text (no enclosing tags) regions, which you can transform and recombine at your leisure.

Essentially, you can treat each top-level open tag as a {, and track the nesting of only that tag. This might be simple enough compared to regular parsing that you want to do it yourself.

Here are some potential gotchas:

If it's not XHTML, you need a list of tags which are always empty:

<hr> , <br> and <img> (are there more?).

For all opening tags, if it ends in />, it's immediately closed - {} rather than {.

Case insensitivity - I believe you'll want to match tag names insensitively (just lc them all).

Super-permissive generous browser interpretations like

"<p> <p>" = "<p> </p><p>" = {}{

Quoted entities are NOT allowed to contain <> (they need to use &lt;), but maybe browsers are super permissive there as well.

Essentially, if you want to parse correct HTML markup, there's no problem.

So, the algorithm:

"end of previous tag" = start of string

repeatedly search for the next open-tag (case insensitive), or end of string:

< *([^ >/]+)[^/>]*(/?) *>|$

handle (end of previous tag, start of match) as a region outside all tags.

set tagname=lc($1). if there was a / ($2 isn't empty), then update end and continue at start. else, with depth=1,

  1. while depth > 0, scan for next (also case insensitive):

    < *(/?) *$tagname *(/?) *>

    If $1, then it's a close tag (depth-=1). Else if not $2, it's another open tag; depth+=1. In any case, keep looping (back to 1.)

Back to start (you're at top level again). Note that I said at the top "scan for next start of top-level open tag, or end of string", i.e. make sure you process the toplevel text hanging off the last closing tag.

That's it. Essentially, you get to ignore all other tags than the current topmost one you're monitoring, on the assumption that the input markup is properly nested (it will still work properly against some types of mis-nesting).

Also, wherever I wrote a space above, should probably be any whitespace (between < > / and tag name you're allowed any whitespace you like).

As you can see, just because the problem is slightly easier than full HTML parsing, doesn't necessarily mean you shouldn't use a real HTML parser :) There's a lot you could screw up.

wrang-wrang
I should add that some regexp syntaxes (Perl) allow balanced paren counting, i.e. they're not really regular languages. I doubt that facility can handle closed tags like <tagname / >, but you could try it.
wrang-wrang