tags:

views:

82

answers:

1

Hi,

I have the following html contained in a string.

<tr>
   <td>Hello</td>
</tr>
<tr>
   <td>World</td>
</tr>
<tr>
   <td>lets all smile</td>
</tr>

I would like to use RegEx to find the <tr></tr> that contains the text "World". The difficulty is that there might be other <td> around the one that contains the search text. So we need to be able to search for the <tr> and </tr> nearest to the search text with any text between the <tr> and the search text.

The result of the match would be.

<tr>
   <td>World</td>
</tr>

I'm using vb.net by the way.

Could anyone help at all?

Thanks

Richard

+6  A: 

First of all, it should be pointed out that you want to use the HTML Agility Pack, and not regex, for this kind of things.

But other than that, a pattern could look like:

(<tr>.*?World.*?</tr>)

It's a rather lousy pattern, but then again, use the agility pack.

David Hedlund
wouldn't this match the first 6 lines - from first `<tr>` to the `</tr>` after `World`?
Amarghosh
No, the non-greedy operators `?` will make sure the whitespace between `<tr>` and `World` is kept as small as possible, without breaking the match
David Hedlund
Tested on Expresso and got this. `<tr><td>Hello</td></tr><tr><td>World</td></tr>`
Amarghosh
That is not how non-greediness work. Non greediness is responsible for not matching the whole string (including the last </tr>) - after all the whole string starts with <tr> and ends with </tr> - but due to the `.*?` at the end of regex, it matches the first </tr>
Amarghosh
cool, i managed to reproduce the same actually. so the pattern could be rewritten (something quick and dirty would be `(<tr>[^<]*<td>World.*?</tr>)`), but more importantly, my main point in my reply was for expressions such as these *not* to be used
David Hedlund
(thanks for pointing me in the right direction with regards to non-greediness, tho)
David Hedlund
A work around is to use negative look ahead `(<tr>(?!.*?<tr>.*World).*?World.*?</tr>)`
Amarghosh
Sorry to sound nitpicking (again), but the new regex cannot accommodate additional <td> tags within that <tr> tag. But yeah, I agree with your main point - don't parse html with regex. And +1 for that :)
Amarghosh
i think this whole conversation has been illustrating that argument rather well =) yes, negative lookahead would be a wiser option here, but however we look at it, we're stuck making a bunch of assumptions as to what the html is going to look like
David Hedlund