views:

20

answers:

0

Hello All,

So here is my problem in a nut shell. I'm working on a web scraping app for work and I've ran into a snag. I'm trying to load the HTML markup of a site using CURL, then use DOMDocument and XPath to find specific node values from that HTML.

Initially, the user plugs in a URL which displays information they want to pull out of the site. This information can be things like prices, names, descriptions, etc. Each item they want to have is described as

SPAN:3 P:1 TD:1 TR:1 TBODY:1 TABLE:1 TD:2 TR:1 TBODY:1 TABLE:1 DIV:1 DIV#content-right-wrapper

I have a script that turns this into an xpath query, which seem to be working fine, the result I get from that is

//div[@id='content-right-wrapper']/div/table/tbody/tr/td[2]/table/tbody/tr/td/p/span[3]

From there I get the page's HTML using CURL and pass that into the DOMDocuments loadHTML method (I've tried load, loadXML, and loadHTMLFile). Then I create my xpath object, set a namespace (I've tried without setting a namespace) and run my xpath query with evaluate (I've tried query as well). If I loop over the results I get no output, var_dump shows an empty DOMNodeList object.

I've gone thru pages and pages on google but I can't find anything that seems to answer what I'm doing wrong, so I'm hoping someone here knows more than I do :-)

Here is my code:

$filter = 'SPAN:3 P:1 TD:1 TR:1 TBODY:1 TABLE:1 TD:2 TR:1 TBODY:1 TABLE:1 DIV:1 DIV#content-right-wrapper';
$xpath_query = '';
$terms = explode("|",$filter);
foreach ($terms as $term) {
    $tags = explode(' ',$term);
    foreach (array_reverse($tags) as $tag) {
        $tag = strtolower($tag);
        if (strpos($tag,'#')!==false) {
            $tagName        =   substr($tag,0,strpos($tag,'#'));
            $id             =   substr($tag,strpos($tag,'#')+1);
            $xpath_query   .=   "//".$tagName."[@id='".$id."']";
        } else if (strpos($tag,':')!==false) {
            $tagName        =   substr($tag,0,strpos($tag,':'));
            $total          =   substr($tag,strpos($tag,':')+1);
            $xpath_query   .=   '/'.$tagName.($total > 1 ? '['.$total.']':'');
        }
    }
}
$html  = new DOMDocument();
@$html->loadHTML(getPage('http://www.discoverthewind.com/VWT.php'));
file_put_contents('Page.txt',$html->saveHTML());

echo '<pre>';
var_dump($html);
echo '</pre>';

$xpath = new DOMXPath($html);
if (!$xpath->registerNamespace("x","http://www.w3.org/1999/xhtml")) die('Namespace failed');

echo '<pre>';
var_dump($xpath);
echo '</pre>';

echo 'Xpath Query: ' . $xpath_query . '<br />';
$results = $xpath->evaluate($xpath_query);

echo '<pre>';
var_dump($results);
echo '</pre>';

function getPage($url) {
    $ch = curl_init($url);
    $defaults = array(
        CURLOPT_RETURNTRANSFER => true,     // return web page
        CURLOPT_HEADER         => false,    // don't return headers
        CURLOPT_FOLLOWLOCATION => true,     // follow redirects
        CURLOPT_ENCODING       => "",       // handle all encodings
        CURLOPT_USERAGENT      => "Mozilla/5.0 (Windows; U; Windows NT 5.1; pl; rv:1.9) Gecko/2008052906 Firefox/3.0", // who am i
        CURLOPT_AUTOREFERER    => true,     // set referer on redirect
        CURLOPT_CONNECTTIMEOUT => 0,      // timeout on connect
        CURLOPT_SSL_VERIFYPEER => false //Accept any and all SSL certs
    );
    curl_setopt_array($ch,$defaults);
    $content = curl_exec($ch);
    $out  = curl_getinfo($ch);
    $out['errno']   = curl_errno($ch);
    $out['errmsg']  = curl_error($ch);
    $out['content'] = $content;
    curl_close($ch);
    unset($ch,$content,$defaults,$url,$opts);
    return $out['content'];
}

I'm not sure if it has anything to do with the encoding, the page is UTF-8 but I dont see any methods in the manual to set the encoding for DOMDocument, maybe its somewhere else or I'm just blind?

Thanks for the help

Ok, small update. I finally got some output and I guess there is an issue with my xpath query. I tried using /html/body//a which worked. I guess I'll need to read up on xpath queries a bit more and see if I can find where my mistake is. If anyone can point me in the right direction I'd appreciate it.

Update #2:

I got it working. Turns out everything was loading just fine for DOMDocument, the issue was my Xpath query as I suspected earlier. the "tbody" element is apparently invalid for xpath so I adjusted the jquery script which iterates over the DOM from the clicked element to ignore the tbody tag. I also found that the attributes are case sensitive (go figure) so I adjusted the loop that is creating my xpath query to take that into account and now it is functional for every page I throw at it.

Here is the fixed code for anyone that runs into a similar problem.

    <?php
// Hard Coded filter string, from Filter Setup Module
$filter = 'SPAN#lblDimensions|A:1 SPAN#lblProdName';

// Translate filter string to Xpath query, supports or condition
$xpath_query = '';
$terms = explode("|",$filter);
foreach ($terms as $term) {
    $tags = explode(' ',$term);
    foreach (array_reverse($tags) as $tag) {
        if (strpos($tag,'#')!==false) {
            $tagName        =   substr(strtolower($tag),0,strpos($tag,'#'));
            $id             =   substr($tag,strpos($tag,'#')+1);
            $xpath_query   .=   "//".$tagName.'[@id="'.$id.'"]';
        } else if (strpos($tag,':')!==false) {
            $tagName        =   substr(strtolower($tag),0,strpos($tag,':'));
            $total          =   substr($tag,strpos($tag,':')+1);
            $xpath_query   .=   '/'.$tagName.($total > 1 ? '['.$total.']':'');
        }
    }
    if (count($terms) > 1) $xpath_query .= '|';
}
if (count($terms) > 1) $xpath_query = substr($xpath_query,0,-1);

// Setup DOMDocument object, load HTML
$html  = new DOMDocument();
@$html->loadHTML(getPage('http://www.ashleyfurniture.com/Showroom/Showroom.aspx?PageId=Showroom&amp;CategoryID=9&amp;ItemNo=W423-21&amp;SetDomTab=1&amp;SeriesNo=W423&amp;CollectionId=&amp;style=&amp;PageNumber=1&amp;IsClicked=1&amp;CatPageNumber=1'));

// Setup XPath and run query
$xpath = new DOMXPath($html);
$results = $xpath->query($xpath_query);

// Display results
foreach ($results as $result) {
    echo $result->nodeValue.'<br />';
}

// CURL call to get the page content
function getPage($url) {
    $ch = curl_init($url);
    $defaults = array(
        CURLOPT_RETURNTRANSFER => true,     // return web page
        CURLOPT_HEADER         => false,    // don't return headers
        CURLOPT_FOLLOWLOCATION => true,     // follow redirects
        CURLOPT_ENCODING       => "",       // handle all encodings
        CURLOPT_USERAGENT      => "Mozilla/5.0 (Windows; U; Windows NT 5.1; pl; rv:1.9) Gecko/2008052906 Firefox/3.0", // who am i
        CURLOPT_AUTOREFERER    => true,     // set referer on redirect
        CURLOPT_CONNECTTIMEOUT => 0,      // timeout on connect
        CURLOPT_SSL_VERIFYPEER => false //Accept any and all SSL certs
    );
    curl_setopt_array($ch,$defaults);
    $content = curl_exec($ch);
    $out  = curl_getinfo($ch);
    $out['errno']   = curl_errno($ch);
    $out['errmsg']  = curl_error($ch);
    $out['content'] = $content;
    curl_close($ch);
    unset($ch,$content,$defaults,$url,$opts);
    return $out['content'];
}
?>