ansaurus

Question

Answer 1

+3 A:

I often use DOMDocument::loadHTML, which works not too bad, in the general cases -- and I like querying the documents, once they are loaded as DOM, with Xpath.

Unfortunatly, I suppose that, in some cases, if the HTML page is really to badly-formed, some parsing problems can occur... That's when you start understanding that respecting web-standards is a great idea...

Pascal MARTIN 2009-12-09 12:05:59

Well, as someone who has to parse *other people's* code it's entirely irrelevant to respect web standards of not :-)

Joey 2009-12-09 12:07:42

@Johannes > indeed ;; but if you try parsing other's people HTML, there are chances you'll have to produce HTML too, one day or another... And, that day, remembering the difficulties you had parsing crappy-HTML might encourage you to write clean-HTML (hoppefully... )

Pascal MARTIN 2009-12-09 12:10:30

Answer 2

A:

Building on Pascal MARTIN's response...

I use a combination of CURL and XPATH. Below is a function I use in one of my classes.

protected function _get_xpath($url) {
 $refferer='http://www.whatever.com/';
 $useragent='Googlebot/2.1 (http://www.googlebot.com/bot.html)';
 // create curl resource
 $ch = curl_init();

 // set url
 curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
 curl_setopt ($ch, CURLOPT_REFERER, $refferer);
 curl_setopt($ch, CURLOPT_URL, $url);

 //return the transfer as a string
 curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
 curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

 // $output contains the output string
 $output = curl_exec($ch);
 //echo htmlentities($output);

 if(curl_errno($ch)) {
  echo 'Curl error: ' . curl_error($ch);
 }
 else {
  $dom = new DOMDocument();
  @$dom->loadHTML($output);
  $this->xpath = new DOMXPath($dom);
  $this->html = $output;
 }

 // close curl resource to free up system resources
 curl_close($ch);
}

You can then parse the document structure using evaluate and extract the information you want

$resultDom = $this->xpath->evaluate("//span[@id='headerResults']/strong");
$this->results = $resultDom->item(0)->nodeValue;

m3mbran3 2009-12-09 13:06:04

Answer 3

A:

I found the best one for my use here it is - http://querypath.org/

goutham 2009-12-13 07:54:25

ansaurus

tags:

views:

answers:

Need a good HTML parser on php

related questions