ansaurus

Question

Using Lookahead to match a string using a regular expression

Answer 1

+5 A:

Once again use an HTML parser to walk the DOM: regexs will never be robust enough to do this.

annakata 2008-12-09 11:04:06

regexHtmlParserQuestions++ ;-)

Tomalak 2008-12-09 11:08:38

I think this is a good name for a tag ;-)

bruno conde 2008-12-09 12:06:56

you forgot the "++" :)

annakata 2008-12-09 12:11:33

Obviously it has to be: ++regexHtmlParserQuestions++;

Tomalak 2008-12-09 12:38:25

Answer 2

+4 A:

It's actually impossible to solve this using standard regular expression, since they basically implement type 3 grammars in the Chomsky hierarchy (finite state automata), whereas you need at least a type 2 grammar (some sort of stack or recursion) to correctly recognize arbitrary nested structures.

However, if you restrict the maximal possible nesting level, then it's probably possible, but I still doubt whether regexps are the best solution.

Michael Borgwardt 2008-12-09 11:06:38

.NET's implementation supports counting, so it is possible to match begin / end tags etc. but for anything beyond simple matching a real parser would be the way to go.

Brian Rasmussen 2008-12-09 13:55:19

Answer 3

A:

Basically, I agree with the above advices, using regexes for parsing HTML is asking for having code breaking some day on strange legal HTML constructs (not to mention malformed HTML that browsers accept...). Finding and using a good HTML parser can be rewarding in many ways...

Now, I am pragmatic (and I can't resist a little regex challenge...) and sometime I use REs against machine generated HTML (often an export feature), because I know the structure I see is unlikely to change, unlike hand-generated pages where the author can make typos... It is mostly for quick hacks I can adapt if output ever change.

In your case, the HTML is quite regular, linear and predictable, so the RE is quite simple. I give Java code because I don't know C# but adaptation should be trivial.

Pattern p = Pattern.compile("(<SPAN id.*?<SPAN id.*?</SPAN></SPAN>)");
Matcher m = p.matcher(html);
while (m.find())
{
  System.out.println(m.group(1));
}

HTH.

PhiLho 2008-12-09 14:31:58

Answer 4

A:

Try this:

@"(?is)<SPAN\b[^>]*>\s*(<SPAN\b[^>]*>.*?</SPAN>)\s*</SPAN>"

This is basically the same as PhiLho's regex, except it permits whitespace between the tags at either end. I also had to add the SingleLine/DOTALL modifier to accomodate line separators within the matched text. I don't know if either of those changes was really necessary; the sample data the OP posted was all on one line, but PhiLho broke it up (thereby breaking his own regex).

Alan Moore 2008-12-09 16:24:47

ansaurus

tags:

views:

answers:

Using Lookahead to match a string using a regular expression

related questions