views:

498

answers:

3

What regex would match a nested table with identifiable text in the table cell? I've tried but failed to come up with a regular expression to extract the specific table I want with out grabbing the beginning and end of both tables in the example. Here is something to get started: "<table>.*?</table>"

<table>
    <tr>
     <td>
      <table>
       <tr><td>Code1</td></tr>
       <tr><td>some data</td></tr>
       <tr><td>etc ...</td></tr>
      </table>
     </td>
    </tr>
    <tr>
     <td>
      <table>
       <tr><td>Code2</td></tr>
       <tr><td>some data</td></tr>
       <tr><td>etc ...</td></tr>
      </table>
     </td>
    </tr>
</table>

Say I want to extract the table containing "Code2". What regex will match specifically and only that table?

+1  A: 

Don't use a regex. Use an HTML parser!

However, in Perl (assuming you don't have nested tables):

$xml =~ /<table>.*<td>Code2<\/td>.*<\/table>/s;
tster
Don't use an XML parser, use a **HTML** parser!
Peter Boughton
(unless of course you can be certain the content is valid XHTML)
Peter Boughton
Thanks, edited the answer.
tster
+5  A: 

I wouldn't use a regexp on this, since HTML isn't regular, and there are no end of edge cases to trip you up. You're better off using an HTML parser. Whichever language or platform you're using, there'll be one available.

Brian Agnew
+2  A: 

The following regex will find your table:

(?ms)<table>((?!<table>).)*<td>Code2</td>.*?</table>

With (?ms) you turn on "multiline matches" (m) and "dot matches newlines, too" (s). Then you have a negative lookahead (?!) to make sure you have no second start of a table inside your match.

tangens