views:

82

answers:

3

For example,to strip out key/value pairs from html like below:

<tr> 
          <td id="td3"  class="td3"  bgcolor="#FFFFFF" colspan="4">■ Related Information </td>

        </tr>
        <tr> 
          <td id="td5" class="td5" width="10%">job title:</td>
          <td id="td5" class="td5" width="90%" colspan="3">Sales Representitive</td>
        </tr>
        <tr> 
          <td id="td5" class="td5" width="10%">Date:</td>

          <td id="td5" class="td5" width="40%">2009-9-15</td>
        </tr>
        <tr> 
          <td id="td5" class="td5" width="10%">Location:</td>

          <td id="td5" class="td5" width="40%">Jiangyin</td>
        </tr>
        <tr> 
          <td id="td5" class="td5" width="10%">Degree:</td>
          <td id="td5" class="td5" width="40%">Bachelor</td>

          <td id="td5" class="td5" width="10%">Major:</td>
          <td id="td5" class="td5" width="40%">No limit</td>
        </tr>
        <tr> 
          <td id="td5" class="td5" width="10%">Sex:</td>
          <td id="td5" class="td5" width="40%">No limit</
        </tr>
        <tr> 
          <td id="td5" class="td5" width="10%">Type:</td>
          <td id="td5" class="td5" width="40%">Fulltime</td>
          <td id="td5" class="td5" width="10%"></td>
          <td id="td5" class="td5" width="40%"></td>
        </tr>

I've been tired of writing long regular expression. Is there an easier way to do this?

+5  A: 

Use an HTML or XML parser like DOMDocument or SimpleXML. Then you can simply traverse the DOM and fetch the data you want.

Gumbo
Can both of them be used to parse HTML?
Shore
@Shore: SimpleXML can only parse XML. But DOMDocument can parse both HTML and XML.
Gumbo
+2  A: 

You could use some simple regular expressions:

$values = array();
if (preg_match_all("/<tr>(.*?)<\/tr>/is", $html, $matches)) {
 foreach($matches[1] as $match) {
  if (preg_match_all("/<td[^>]*>([^<]+)<\/td>/is", $match, $tds))
   array_push($values, $tds[1]);
 }
}

var_dump($values);

It is a lot simpler when separate the patterns instead of one single large pattern.

bucabay
+1  A: 

You should try the lesser known PHP Simple HTML DOM Parser. It lets you do stuff like this:

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all images
foreach($html->find('img') as $element)
       echo $element->src . '<br>';

// Find all links
foreach($html->find('a') as $element)
       echo $element->href . '<br>';


// Create DOM from string
$html = str_get_html('<div id="hello">Hello</div><div id="world">World</div>');

$html->find('div', 1)->class = 'bar';

$html->find('div[id=hello]', 0)->innertext = 'foo';

echo $html; // Output: <div id="hello">foo</div><div id="world" class="bar">World</div>
ryeguy