tags:

views:

297

answers:

3

How could i use regex to find this table in a page (need to find it by name):

<table id="Table Name">
<tr><td class="label">Name:</td>
<td class="data"><div class="datainfo">Stuff</div></td></tr>
<tr><td class="label">Email:</td>
<td class="data"><div class="datainfo">Stuff2</div></td></tr>
<tr><td class="label">Address:</td>
<td class="data"><div class="datainfo">Stuff3</div></td></tr>
</table>
<table id="Table Name 2">
<tr><td class="label">Field1:</td>
<td class="data"><div class="datainfo">MoreStuff</div></td></tr>
<tr><td class="label">Field2:</td>
<td class="data"><div class="datainfo">MoreStuff2</div></td></tr>
<tr><td class="label">Field3:</td>
<td class="data"><div class="datainfo">MoreStuff3</div></td></tr>
</table>

Then grab the "labels" and "datainfo" and store them in an associative array such as:

$table_name[name] //Stuff
$table_name[email] //Stuff2
$table_name[address] //Stuff3

$table_name2[field1] //MoreStuff
$table_name2[field2] //Morestuff2
$table_name2[field3] //Morestuff3
+8  A: 

Regexp is bad solution in this case. Use Simple HTML Parser instead.

Update: Here is function for this:

 $html = str_get_html($html);
 print_r(get_table_fields($html, 'Table Name'));
 print_r(get_table_fields($html, 'Table Name 2'));

 function get_table_fields($html, $id) {
     $table = $html->find('table[id='.$id.']', 0);
     foreach ($table->find('tr') as $row) {
         $key = $row->find('td', 0)->plaintext;
         $value = $row->find('td', 1)->plaintext;
         ## remove ending ':' symbol
         $key = preg_replace('/:$/', '', $key);
         $result[$key] = $value;
     }
     return $result;
 }
Ivan Nevostruev
Looking into it now, i knew there had to be some kind of a solution to this, thanks very much
Patrick
See http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 for why.
Ikke
ahh thanks man, you didnt have to write it out for me, but i appreciate it greatly none the less. :)
Patrick
Use the Dom class. It's part of php and it's a lot faster!!
AntonioCS
@AntonioCS: I think DOM parser works for XML only. It's not it?
Ivan Nevostruev
@Ivan http://www.php.net/manual/en/domdocument.loadhtml.php
AntonioCS
@AntonioCS: Thanks for this
Ivan Nevostruev
@Ivan No Problem, glad I could help :)
AntonioCS
@Ivan I have added the code using DOMDocument. This might give you an idea of the power of the DOM class :)
AntonioCS
A: 

I've never played with Simple HTML parser, but I'm a pretty big fan of PHP's built-in SimpleXML. This accomplishes the same thing.

$XML = simplexml_load_string(file_get_contents('test_doc.html'));

$all_labels =  $XML->xpath("//td[@class='label']");
$all_datainfo = $XML->xpath("//div[@class='datainfo']");

$all = array_combine($all_labels,$all_datainfo);
foreach($all as $k=>$v) { $final[preg_replace('/:$/', '', (string)$k)] = (string)$v; }

print_r($final);

if you're wondering why I've got that loop casting everything to (string), do a print_r on $all.

The final output would be:

Array
(
    [Name] => Stuff
    [Email] => Stuff2
    [Address] => Stuff3
    [Field1] => MoreStuff
    [Field2] => MoreStuff2
    [Field3] => MoreStuff3
)
Erik
Will it work for HTML?
Ivan Nevostruev
I dropped his example HTML inside of a `<html><body> ... </body></html>' so ... yes, it does :)
Erik
A: 

I decided to create the code using PHP DOMDocument class

<?php 

$dom = new DOMDocument();

$dom->loadHTML(file_get_contents('stackoverflow_table.html'));

$count = 0;
$data = array();

while (++$count) {
  $tableid = 'Table Name' . ($count > 1 ? ' ' . $count : ''); //getting the table id  
  $table = $dom->getElementById($tableid);
  if ($table) {
    $tds = $table->getElementsByTagName('td');

    if ($tds->length) { //did I get td's? 
      for ($i = 0, $l = $tds->length;$i < $l; $i+=2) { 
        $keyname = $tds->item($i)->firstChild->nodeValue; //get the value of the firs td
        $value = null;
        if ($tds->item($i+1)->hasChildNodes()) //check if the 2º td has children (the div) (this might always be true because of whitespace)
          $value = $tds->item($i+1)->childNodes->item(1)->firstChild->nodeValue; //Get the div value (which is the second, because of whitespace)

        $data[$keyname] = $value;
      }
    }
  }
  else //there is no table
    break;
}

//should present the format you wanted :)
var_dump($data);

Here is the html file I created for this:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<meta http-equiv="Expires" content="Fri, Jan 01 1900 00:00:00 GMT">
<meta http-equiv="Pragma" content="no-cache">
<meta http-equiv="Cache-Control" content="no-cache">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta http-equiv="Lang" content="en">
<meta name="author" content="">
<meta http-equiv="Reply-to" content="">
<meta name="generator" content="">
<meta name="description" content="">
<meta name="keywords" content="">
<meta name="creation-date" content="11/11/2008">
<meta name="revisit-after" content="15 days">
<title>Example</title>
<link rel="stylesheet" type="text/css" href="my.css">
</head>
<body>
<table id="Table Name">
    <tr>
        <td class="label">Name:</td>
        <td class="data">
            <div class="datainfo">Stuff</div>
        </td>
    </tr>
    <tr>
        <td class="label">Email:</td>
        <td class="data">
            <div class="datainfo">Stuff2</div>
        </td>
    </tr>
    <tr>
        <td class="label">Address:</td>
        <td class="data">
            <div class="datainfo">Stuff3</div>
        </td>
    </tr>
</table>
<table id="Table Name 2">
    <tr>
        <td class="label">Field1:</td>
        <td class="data">
            <div class="datainfo">MoreStuff</div>
        </td>
    </tr>
    <tr>
        <td class="label">Field2:</td>
        <td class="data">
            <div class="datainfo">MoreStuff2</div>
        </td>
    </tr>
    <tr>
        <td class="label">Field3:</td>
        <td class="data">
            <div class="datainfo">MoreStuff3</div>
        </td>
    </tr>
</table>  
</body>
</html>
AntonioCS