tags:

views:

319

answers:

3

Found this one http://simplehtmldom.sourceforge.net/ but it has failed to work

extracting this page http://php.net/manual/en/function.curl-setopt.php
and parse it to plain html, it failed and returned a partial html page

This is what I want to do, Go to a html page and get the components individual( the contents of all div and p in a hierarchy ) I like the features of simplehtmldom any such parser is required which is good at all code(best and worst).

+3  A: 

I often use DOMDocument::loadHTML, which works not too bad, in the general cases -- and I like querying the documents, once they are loaded as DOM, with Xpath.

Unfortunatly, I suppose that, in some cases, if the HTML page is really to badly-formed, some parsing problems can occur... That's when you start understanding that respecting web-standards is a great idea...

Pascal MARTIN
Well, as someone who has to parse *other people's* code it's entirely irrelevant to respect web standards of not :-)
Joey
@Johannes > indeed ;; but if you try parsing other's people HTML, there are chances you'll have to produce HTML too, one day or another... And, that day, remembering the difficulties you had parsing crappy-HTML might encourage you to write clean-HTML (hoppefully... )
Pascal MARTIN
A: 

Building on Pascal MARTIN's response...

I use a combination of CURL and XPATH. Below is a function I use in one of my classes.

protected function _get_xpath($url) {
 $refferer='http://www.whatever.com/';
 $useragent='Googlebot/2.1 (http://www.googlebot.com/bot.html)';
 // create curl resource
 $ch = curl_init();

 // set url
 curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
 curl_setopt ($ch, CURLOPT_REFERER, $refferer);
 curl_setopt($ch, CURLOPT_URL, $url);

 //return the transfer as a string
 curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
 curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

 // $output contains the output string
 $output = curl_exec($ch);
 //echo htmlentities($output);

 if(curl_errno($ch)) {
  echo 'Curl error: ' . curl_error($ch);
 }
 else {
  $dom = new DOMDocument();
  @$dom->loadHTML($output);
  $this->xpath = new DOMXPath($dom);
  $this->html = $output;
 }

 // close curl resource to free up system resources
 curl_close($ch);
}

You can then parse the document structure using evaluate and extract the information you want

$resultDom = $this->xpath->evaluate("//span[@id='headerResults']/strong");
$this->results = $resultDom->item(0)->nodeValue;
m3mbran3
A: 

I found the best one for my use here it is - http://querypath.org/

goutham