In general, you must not parse XML using regular expressions.
Instead, use the System.Xml
namespace.
In general, you must not parse XML using regular expressions.
Instead, use the System.Xml
namespace.
I would approach it in two passes. (in perl, but regexes should translate. )
First pass. Extract all strings.
my @strings = $s =~ /<[^>]+>([^<>]+)<[^/>]*/[^/>]*>/g;
Second pass. Filter out unwanted
@strings = grep {!/ |^\s+$/} @strings;
The regex for this will be quite cumbersome. Basically you need a regex that looks for balanced pairs LinK and within the balanced pair you want anything that is valid for your scenario. The "valid for your scenario is the crappy part. Given the snippet you showed you want a regex similar to:
<(?<tag>\w*)>(?<text>.*)</\k<tag>>
(Courtesy of Expresso)
(?<text>.*) <- is what you will have to construct by hand to match your elim criteria
I would not use regular expressions to do this! I would run it through a Tidy utility and then use XSLT and XPath.
I was able to get what I wanted by using one regex to get the elements and a second regex to remove the ones with the whitespace I defined.
With about 30MB of data it takes 3 seconds.
Regex ElementExpression = new Regex(
@"<(?'tag'\w+?)(?'attributes'.*?)>" + // match first tag, and name it 'tag'
@"(?'text'[^<>]*?)" + // match text content, name it 'text'
@"</\k'tag'>" // match last tag, denoted by 'tag'
, RegexOptions.Multiline | RegexOptions.Compiled | RegexOptions.IgnoreCase);
Regex WhiteSpaceExpression = new Regex(@"\A(( )|(\s)|(\r))*\Z", RegexOptions.Multiline | RegexOptions.Compiled | RegexOptions.IgnoreCase);
text = ElementExpression.Replace(text, delegate(Match match){
if (match.Groups.Count > 0){
Group textGroup = match.Groups["text"];
if (!WhiteSpaceExpression.IsMatch(textGroup.Value)){
return String.Format("<{0}{1}>{2}</{0}>", match.Groups["tag"].Value, match.Groups["attributes"].Value, HttpUtility.HtmlEncode(textGroup.Value));
}
else{
return String.Format("<{0}{1} />", match.Groups["tag"].Value, match.Groups["attributes"].Value);
}
}
return match.Value;
});
If it's not XML that's bad. Saying that it's a "string that closely represents XML" is not really an adequate definition of the problem. There are an infinity of ways for a string to closely resemble XML, and a parsing solution devised for one won't work with another.
If you can be specific about the ways in which the string will deviate from XML - i.e., if you can identify the specific mistakes that the original developer was making in attempting to write XML - it should be possible to undo the damage, turn the string into well-formed XML, and then use a DOM approach to find the data that you're looking for.
If you can't be specific about the ways in which the string deviates from XML, then you have a much bigger problem than writing a regular expression.