views:

41

answers:

2

Hi everybody,

I have a html table like this :

<table ... >

  <tbody ... >

       <tr ... > 
             <td ...>
                  string...
              </td>
                <td ...>
                  string...
              </td>
                <td ...>
                  string...
              </td>
                <td ...>
                  string...
              </td>
                <td ...>
                  string...
              </td>
       </tr>
        <tr ... > 
             <td ...>
                  string...
              </td>
                <td ...>
                  string...
              </td>
                <td ...>
                  string...
              </td>
                <td ...>
             </td>
                <td ...>
                  string...
              </td>
       </tr>
       ..............

  </tbody>


</table>

This is a data table and I need to get all data from this. The table will have many rows (<tr></tr>) . each row will have a fixed columns (<td></td>)(currently is 5 ). remember each table,tr,td tag maybe formatted (where say "...")

And I hope everyone can help me to write a regex for preg_match_all function to get the data like this :

array(
   0 => array(
       0=> 'some data0',
       1=> 'some data1',
       2=> 'some data2',
       3=> 'some data3',
       4=> 'some data4',
   )
   1 => array(
       0=> 'some data0',
       1=> 'some data1',
       2=> 'some data2',
       3=> 'some data3',
       4=> 'some data4',
   )
   2 => array(
       0=> 'some data0',
       1=> 'some data1',
       2=> 'some data2',
       3=> 'some data3',
       4=> 'some data4',
   )
..........
)

Now the example for your test, hopfully you can help me!!!

<table border="1" >
  <tbody style="" >

       <tr style="" > 
             <td style="color:blue;">
                  data0
              </td>
                <td style="font-size:15px;">
                 data1
              </td>
                <td style="font-size:15px;">
                  data2
              </td>
                <td style="color:blue;">
                  data3
              </td>
                <td style="color:blue;">
                  data4
              </td>
       </tr>
       <tr style="" > 
             <td style="color:blue;">
                  data00
              </td>
                <td style="font-size:15px;">
                 data11
              </td>
                <td style="font-size:15px;">
                  data22
              </td>
                <td style="color:blue;">
                  data33
              </td>
                <td style="color:blue;">
                  data44
              </td>
       </tr>
       <tr style="color:black" > 
             <td style="color:blue;">
                  data000
              </td>
                <td style="font-size:15px;">
                 data111
              </td>
                <td style="font-size:15px;">
                  data222
              </td>
                <td style="color:blue;">
                  data333
              </td>
                <td style="color:blue;">
                  data444
              </td>
       </tr>

  </tbody>


</table>
A: 

You absolutely do NOT want to parse HTML with Regex.

There are far too many variations, for one, and more importantly, regex isn't very good with the hierarchal nature of HTML. It's best to use an XML parser or better-yet an HTML-specific parser.

Whenever I need to scrape HTML, I tend to use the Simple HTML DOM Parser library, which takes an HTML tree and parses it into a traversable PHP object, which you can query something like JQuery.

<?php
    require 'simplehtmldom/simple_html_dom.php';

    $sHtml = <<<EOS
    <table border="1" >
      <tbody style="" >
           <tr style="" > 
                 <td style="color:blue;">
                      data0
                  </td>
                    <td style="font-size:15px;">
                     data1
                  </td>
                    <td style="font-size:15px;">
                      data2
                  </td>
                    <td style="color:blue;">
                      data3
                  </td>
                    <td style="color:blue;">
                      data4
                  </td>
           </tr>
           <tr style="" > 
                 <td style="color:blue;">
                      data00
                  </td>
                    <td style="font-size:15px;">
                     data11
                  </td>
                    <td style="font-size:15px;">
                      data22
                  </td>
                    <td style="color:blue;">
                      data33
                  </td>
                    <td style="color:blue;">
                      data44
                  </td>
           </tr>
           <tr style="color:black" > 
                 <td style="color:blue;">
                      data000
                  </td>
                    <td style="font-size:15px;">
                     data111
                  </td>
                    <td style="font-size:15px;">
                      data222
                  </td>
                    <td style="color:blue;">
                      data333
                  </td>
                    <td style="color:blue;">
                      data444
                  </td>
           </tr>
      </tbody>
    </table>
EOS;

    $oHTML = str_get_html($sHtml);
    $oTRs = $oHTML->find('table tr');
    $aData = array();
    foreach($oTRs as $oTR) {
        $aRow = array();
        $oTDs = $oTR->find('td');

        foreach($oTDs as $oTD) {
            $aRow[] = trim($oTD->plaintext);
        }

        $aData[] = $aRow;
    }

    var_dump($aData);
?>

And the output:

array
  0 => 
    array
      0 => string 'data0' (length=5)
      1 => string 'data1' (length=5)
      2 => string 'data2' (length=5)
      3 => string 'data3' (length=5)
      4 => string 'data4' (length=5)
  1 => 
    array
      0 => string 'data00' (length=6)
      1 => string 'data11' (length=6)
      2 => string 'data22' (length=6)
      3 => string 'data33' (length=6)
      4 => string 'data44' (length=6)
  2 => 
    array
      0 => string 'data000' (length=7)
      1 => string 'data111' (length=7)
      2 => string 'data222' (length=7)
      3 => string 'data333' (length=7)
      4 => string 'data444' (length=7)
enobrev
Suggested third party alternatives to [SimpleHtmlDom](http://simplehtmldom.sourceforge.net/) that actually use [DOM](http://php.net/manual/en/book.dom.php) instead of String Parsing: [phpQuery](http://code.google.com/p/phpquery/), [Zend_Dom](http://framework.zend.com/manual/en/zend.dom.html), [QueryPath](http://querypath.org/) and [FluentDom](http://www.fluentdom.org).
Gordon
Yes! You're right! Thank you too!!!
rongcon
+1  A: 

PHP has a native extension to parse HTML and XML with DOM:

$dom = new DOMDocument;
$dom->loadHTML( $htmlContent );
$rows = array();
foreach( $dom->getElementsByTagName( 'tr' ) as $tr ) {
    $cells = array();
    foreach( $tr->getElementsByTagName( 'td' ) as $td ) {
        $cells[] = $td->nodeValue;
    }
    $rows[] = $cells;
}

Adjust to your liking. Search StackOverflow or have a look at the PHP Manual or go through some of my answers to learn more about it's usage.

Gordon
Wow! Thank you! I've done it! You're awesome!!!
rongcon