tags:

views:

93

answers:

2

I am trying to come up with a way to match content that does not exist inside any xml or html tags. I've read that using regular expressions is fundamentally bad for parsing xml/html, and I'm open for any solution that will solve my problem, but if a regex works too all the better.

Here's an example of what I'm looking for:

the lazy fox jumped <span>over</span> the brown fence.

What I want back is

the lazy fox jumped  the brown fence

Any ideas?

+1  A: 

It's probably a naive technique, but my first instinct would be to run the regular expression, figure out what text it matches within your parent string, and REMOVE it from that string, returning the remainder. In pseudocode,

String input = "whatever";
matches = Regex.Matches(input,"<.*>.*?</.*>");
foreach (match m in Matches)
{
input = input.Remove(m.Value);
}
Jim Dagg
Thanks Jim, I'll try that. Question, how would that handle two spans in sequence? like this "the <span>lazy</span> fox <span>jumped</span> again" In my case I would need that to return "the fox again"
Joseph
That's where the star-question mark comes in. Defined in the .net regular expression syntax, it's a "lazy" (that is, non-greedy) match -- it'll slurp as few characters as possible while still matching the pattern. While <.*>.*<.*> would return "the again", the given pseudocode will match against the first closing span tag (the first possible match against the pattern); "the fox again" would be returned from that case.
Jim Dagg
@Jim Dagg: You should make the dot-stars inside the tags reluctant, too, e.g.: `"<.*?>"`. Also, you could do the job in one pass with `Regex.Replace(target, regex, "")`.
Alan Moore
@Jim Oh awesome! I didn't know you could do that! Thanks!
Joseph
@Alan -- good call. Hadn't thought of that.
Jim Dagg
+2  A: 

Run this one over the string:

s/\(.*\)<.*>.*<.*>\(.*\)/\1\2/

You might need to change some of the details based on implementation (escaping parentheses may not be required, for example), but that'll get exactly what you want (with the double space and everything in the middle).

Gordon Worley
Thanks Gordon! I'll give it a try!
Joseph
I just edited it, so try again. Forgot to escape the lt/gt.
Gordon Worley