views:

304

answers:

6

Hi

I've to parse over a piece of HTML. It looks a bit like:

<table>
   <tr>
     <td class="blabla"> <table><tr><td><table><tr><td></td></tr></table></td></tr></table>
     </td>
   </tr>
  <tr>
     <td class="blabla"> <table><tr><td></td></tr></table>
     </td>
   </tr>
</table>

I need to extract each td with class blabla, but each of these cells could have 0 or more nested tables with many nested td's. I want to get

<td class="blabla"> ... many nested stuff ... </td>

Thanks

+1  A: 

Why don't you use css selectors?

rahul
It is on a .NET win app, that parses text.
Gidon
@Gidon: Don't think about HTML as text.
Welbog
A: 

([tT][dD]\sclass=\"blabla\")

Ratnesh Maurya
A: 

You would be looking for a regex similar to /<td\sclass=\"(.*?)\">/, but I do not know the way to do this in .net.

However, due to the way you can badly form HTML, regex is not a good candidate for parsing. There are much better tools for doing that.

As has been mentioned, Using XPath would be quite a good way to do this using //td[@class="someClass"]. This would give you the td node. You can then get the contents of that and process it as required

Xetius
+6  A: 

Don't try to parse HTML with regular expressions. You can't write an expression that will match what you want, because HTML isn't regular.

Use an HTML/XML parser in a library your language provides. System.Xml has a number of useful classes that will let you open your file and query it with XPath.

The XPath expression you're looking for is

//td[@class="someClass"]
Welbog
Not sure of the .net implementation, but wouldn't that be //td[@class="someClass"]
Xetius
@Xetius: Right. Sorry. :)
Welbog
That is what we did in the end.
Gidon
+4  A: 

If you need to do extenisve html parsing I would recommend using the Html Agility Pack instead of regular expressions. HAP builds an xml document from an html page so you can look for specific nodes using XPath.

René
A: 

You can't do this merely using regular expressions because it's too complicated. Even using lookahead matching, the regex would have to dynamically change because you'd have to increment the number of </td> you're looking for based on how many <td> are found after the one you want.

Mike Caron