tags:

views:

136

answers:

3

Hi, I have the following piece of text from which I'd like to extract all the <td ????>???</td> tags

<tr id=row509>
    <td id=serv509 align=center  class='style1'>Z Deviazione Tecnico Home verso S24 [ NON USATO ]</td>
    <td align=center class='style4'>23</td>
    <td align=center class='style10'>22</td>
    <td align=center class='style6'>0</td>
    <td align=center class='style2'>0</td>
    <td id=rowtot509 align=center class='style6'>0</td>
    <td align=center class='style6'>0</td>
    <td align=center class='style2'>0</td>
    <td align=center class='style6'>0</td>
</tr>

The expected result would be:

1. <td id=serv509 align=center  class='style1'>Z Deviazione Tecnico Home verso S24 [ NON USATO ]</td>
2. <td align=center class='style4'>23</td>
3. <td align=center class='style10'>22</td>
[..]

Any help? Thanks

+2  A: 

What's the problem with using an HTML or XML library?

Using XML and XPath, for instance, this would just be a case of doing xml / td, in whatever way the library API supports that.

Regex is a lousy way of doing that, because XMLs is not a regular language. Specifically, you can nest tags inside other tags, and this is something that can't be represented with regular expressions.

So, while it would be easy to create as regular expression for the simple case (<td.*?</td>), it would easily break if the XML changed just a bit.

Granted that the XML is broken, but you may fix it with Regex. :-) For instance, if you replace the pattern (\w+)=(\w+) in that with $1='$2' (or \1='\2', if that's the syntax of c# replace patterns), you'll get a valid XML.

Daniel
The document may not be well formed (like in this case). In fact XDocument x = XDocument.Parse(row.ToString()); throws a XException
pistacchio
Ah, well, who am I to disagree with that? I use regex to extract td's from a malformed HTML page myself. Well, the pattern is in the answer. I don't know c#, so I can't give the exact code.
Daniel
oh, by the way, your regex does not match the first two TDs!
pistacchio
The working copy :) <td[^>]*>[^<]*</td>
pistacchio
I changed the regular expression, as you may see. The older one should have worked too. Maybe there was a typo? The "working copy" one works here.
Daniel
Ah... I can see the typo. Indeed, instead of `*]` it should have been `]*`. :-)
Daniel
A: 

I would agree with Daniel, but if you really must use a regex - get yourself a copy of RegexBuddy so you can quickly debug your expression. Best $40 I've spent in a long time.

Sneal
A: 

Regular expressions are a pretty fragile tool to use for this kind of problem, especially if there's any risk at all that a table's cell content could be another table. (In that case, the first </td> tag you find after a <td> start tag may not actually be closing that element but a descendant element.)

A much more robust way to tackle problems like these is to parse the HTML into a DOM and then examine the DOM. The HTML Agility Pack is one that people seem to like.

Robert Rossney