ansaurus

Question

Regex with vb.net

Answer 1

+6 A:

First of all, it should be pointed out that you want to use the HTML Agility Pack, and not regex, for this kind of things.

But other than that, a pattern could look like:

(<tr>.*?World.*?</tr>)

It's a rather lousy pattern, but then again, use the agility pack.

David Hedlund 2009-12-08 10:15:11

wouldn't this match the first 6 lines - from first `<tr>` to the `</tr>` after `World`?

Amarghosh 2009-12-08 10:20:39

No, the non-greedy operators `?` will make sure the whitespace between `<tr>` and `World` is kept as small as possible, without breaking the match

David Hedlund 2009-12-08 10:39:46

Tested on Expresso and got this. `<tr><td>Hello</td></tr><tr><td>World</td></tr>`

Amarghosh 2009-12-08 10:58:06

That is not how non-greediness work. Non greediness is responsible for not matching the whole string (including the last </tr>) - after all the whole string starts with <tr> and ends with </tr> - but due to the `.*?` at the end of regex, it matches the first </tr>

Amarghosh 2009-12-08 11:00:43

cool, i managed to reproduce the same actually. so the pattern could be rewritten (something quick and dirty would be `(<tr>[^<]*<td>World.*?</tr>)`), but more importantly, my main point in my reply was for expressions such as these *not* to be used

David Hedlund 2009-12-08 11:03:53

(thanks for pointing me in the right direction with regards to non-greediness, tho)

David Hedlund 2009-12-08 11:04:41

A work around is to use negative look ahead `(<tr>(?!.*?<tr>.*World).*?World.*?</tr>)`

Amarghosh 2009-12-08 11:05:21

Sorry to sound nitpicking (again), but the new regex cannot accommodate additional <td> tags within that <tr> tag. But yeah, I agree with your main point - don't parse html with regex. And +1 for that :)

Amarghosh 2009-12-08 11:10:17

i think this whole conversation has been illustrating that argument rather well =) yes, negative lookahead would be a wiser option here, but however we look at it, we're stuck making a bunch of assumptions as to what the html is going to look like

David Hedlund 2009-12-08 11:12:54

ansaurus

tags:

views:

answers:

Regex with vb.net

related questions