I like the suggestion to use an HTML parser, but let me propose a way to enumerate the top-level text (no enclosing tags) regions, which you can transform and recombine at your leisure.
Essentially, you can treat each top-level open tag as a {, and track the nesting of only that tag. This might be simple enough compared to regular parsing that you want to do it yourself.
Here are some potential gotchas:
If it's not XHTML, you need a list of tags which are always empty:
<hr> , <br> and <img> (are there more?).
For all opening tags, if it ends in />, it's immediately closed - {} rather than {.
Case insensitivity - I believe you'll want to match tag names insensitively (just lc them all).
Super-permissive generous browser interpretations like
"<p> <p>" = "<p> </p><p>" = {}{
Quoted entities are NOT allowed to contain <> (they need to use <), but maybe browsers are super permissive there as well.
Essentially, if you want to parse correct HTML markup, there's no problem.
So, the algorithm:
"end of previous tag" = start of string
repeatedly search for the next open-tag (case insensitive), or end of string:
< *([^ >/]+)[^/>]*(/?) *>|$
handle (end of previous tag, start of match) as a region outside all tags.
set tagname=lc($1). if there was a / ($2 isn't empty), then update end and continue at start. else, with depth=1,
while depth > 0, scan for next (also case insensitive):
< *(/?) *$tagname *(/?) *>
If $1, then it's a close tag (depth-=1). Else if not $2, it's another open tag; depth+=1. In any case, keep looping (back to 1.)
Back to start (you're at top level again). Note that I said at the top "scan for next start of top-level open tag, or end of string", i.e. make sure you process the toplevel text hanging off the last closing tag.
That's it. Essentially, you get to ignore all other tags than the current topmost one you're monitoring, on the assumption that the input markup is properly nested (it will still work properly against some types of mis-nesting).
Also, wherever I wrote a space above, should probably be any whitespace (between < > / and tag name you're allowed any whitespace you like).
As you can see, just because the problem is slightly easier than full HTML parsing, doesn't necessarily mean you shouldn't use a real HTML parser :) There's a lot you could screw up.