+2  A: 

In general, you must not parse XML using regular expressions.

Instead, use the System.Xml namespace.

SLaks
This unfortunately is not viable in this situation. This is an application that was not mad by my team and we need to optimize it without rewriting anything (not my decision). It is invalid XML and so I need to do this in order to make it valid. Then I can treat it as xml :)
Phobis
So I am parsing a string that very closely resembles XML.
Phobis
A: 

I would approach it in two passes. (in perl, but regexes should translate. )

First pass. Extract all strings.

my @strings = $s =~ /<[^>]+>([^<>]+)<[^/>]*/[^/>]*>/g;

Second pass. Filter out unwanted

@strings = grep {!/&nbsp;|^\s+$/} @strings;
zen
+1  A: 

The regex for this will be quite cumbersome. Basically you need a regex that looks for balanced pairs LinK and within the balanced pair you want anything that is valid for your scenario. The "valid for your scenario is the crappy part. Given the snippet you showed you want a regex similar to:

<(?<tag>\w*)>(?<text>.*)</\k<tag>>

(Courtesy of Expresso)

(?<text>.*) <- is what you will have to construct by hand to match your elim criteria
GrayWizardx
Yes... that is what I have so far... Thank you though... At least you are trying to help me solve the solution as I asked for. Instead of telling me another way to do it :) I just need to get the regex to exclude   and any whitespace combined.
Phobis
You might not do that in the regex itself. If you capture each candidate, and then verify it after the regex, it might be easier. otherwise your elim pattern is going to be very complex. I would just get all the matches either way (including invalid ones) and then iterate over them and throw out the ones you dont want
GrayWizardx
Oops saw someone posted the same solution with a down vote. Sorry. Not sure of the exact syntax for the elim pattern sorry.
GrayWizardx
+1  A: 

I would not use regular expressions to do this! I would run it through a Tidy utility and then use XSLT and XPath.

Josh Stodola
it isn't valid xml
Phobis
That's why you use a tidying utility. The challenge is finding one that works on your particular brand of poorly-formed XML.
Robert Rossney
Or roll your own tidy utility that takes care of the specific problems you have (given that your XML isn't ridiculously malformed). It can't be that difficult...
Josh Stodola
A: 

I was able to get what I wanted by using one regex to get the elements and a second regex to remove the ones with the whitespace I defined.

With about 30MB of data it takes 3 seconds.

  Regex ElementExpression = new Regex(
            @"<(?'tag'\w+?)(?'attributes'.*?)>" + // match first tag, and name it 'tag'
            @"(?'text'[^<>]*?)" + // match text content, name it 'text'
            @"</\k'tag'>" // match last tag, denoted by 'tag'
            , RegexOptions.Multiline | RegexOptions.Compiled | RegexOptions.IgnoreCase);


  Regex WhiteSpaceExpression = new Regex(@"\A((&nbsp;)|(\s)|(\r))*\Z", RegexOptions.Multiline | RegexOptions.Compiled | RegexOptions.IgnoreCase);

  text = ElementExpression.Replace(text, delegate(Match match){
        if (match.Groups.Count > 0){
           Group textGroup = match.Groups["text"];
           if (!WhiteSpaceExpression.IsMatch(textGroup.Value)){
              return String.Format("<{0}{1}>{2}</{0}>", match.Groups["tag"].Value, match.Groups["attributes"].Value, HttpUtility.HtmlEncode(textGroup.Value));
           }
           else{
              return String.Format("<{0}{1} />", match.Groups["tag"].Value, match.Groups["attributes"].Value);
           }
        }
        return match.Value;
  });
Phobis
Again, I want to make clear that this is horrible thing to do and I know that, but it is the scope of the task I have at hand. This is outside code that needed to be optimized, while working the same way. (the previous code was done client side in javascript and would bomb after about an hour!) ...You get what you pay for, and the company that paid for this paid a non-software consulting company to build this junk.
Phobis
A: 

If it's not XML that's bad. Saying that it's a "string that closely represents XML" is not really an adequate definition of the problem. There are an infinity of ways for a string to closely resemble XML, and a parsing solution devised for one won't work with another.

If you can be specific about the ways in which the string will deviate from XML - i.e., if you can identify the specific mistakes that the original developer was making in attempting to write XML - it should be possible to undo the damage, turn the string into well-formed XML, and then use a DOM approach to find the data that you're looking for.

If you can't be specific about the ways in which the string deviates from XML, then you have a much bigger problem than writing a regular expression.

Robert Rossney