views:

95

answers:

1

Hello, what are the advantages and disadvantages of the following libraries?

From the above i've used QP and it failed to parse invalid HTML, and simpleDomParser, that does a good job, but it kinda leaks memory because of the object model. But you may keep that under control by calling $object->clear(); unset($object); when you dont need an object anymore.

Are there any more scrapers? What are your experiences with them? I'm going to make this a community wiki, may we'll build a useful list of libraries that can be useful when scraping.


i did some tests based Byron's answer:

    <?
    include("lib/simplehtmldom/simple_html_dom.php");
    include("lib/phpQuery/phpQuery/phpQuery.php");


    echo "<pre>";

    $html = file_get_contents("http://stackoverflow.com/search?q=favorite+programmer+cartoon");
    $data['pq'] = $data['dom'] = $data['simple_dom'] = array();

    $timer_start = microtime(true);

    $dom = new DOMDocument();
    @$dom->loadHTML($html);
    $x = new DOMXPath($dom);

    foreach($x->query("//a") as $node)
    {
         $data['dom'][] = $node->getAttribute("href");
    }

    foreach($x->query("//img") as $node)
    {
         $data['dom'][] = $node->getAttribute("src");
    }

    foreach($x->query("//input") as $node)
    {
         $data['dom'][] = $node->getAttribute("name");
    }

    $dom_time =  microtime(true) - $timer_start;
    echo "dom: \t\t $dom_time . Got ".count($data['dom'])." items \n";






    $timer_start = microtime(true);
    $doc = phpQuery::newDocument($html);
    foreach( $doc->find("a") as $node)
    {
       $data['pq'][] = $node->href;
    }

    foreach( $doc->find("img") as $node)
    {
       $data['pq'][] = $node->src;
    }

    foreach( $doc->find("input") as $node)
    {
       $data['pq'][] = $node->name;
    }
    $time =  microtime(true) - $timer_start;
    echo "PQ: \t\t $time . Got ".count($data['pq'])." items \n";









    $timer_start = microtime(true);
    $simple_dom = new simple_html_dom();
    $simple_dom->load($html);
    foreach( $simple_dom->find("a") as $node)
    {
       $data['simple_dom'][] = $node->href;
    }

    foreach( $simple_dom->find("img") as $node)
    {
       $data['simple_dom'][] = $node->src;
    }

    foreach( $simple_dom->find("input") as $node)
    {
       $data['simple_dom'][] = $node->name;
    }
    $simple_dom_time =  microtime(true) - $timer_start;
    echo "simple_dom: \t $simple_dom_time . Got ".count($data['simple_dom'])." items \n";


    echo "</pre>";

and got

dom:         0.00359296798706 . Got 115 items 
PQ:          0.010568857193 . Got 115 items 
simple_dom:  0.0770139694214 . Got 115 items 
+4  A: 

I used to use simple html dom exclusively until some bright SO'ers showed me the light hallelujah.

Just use the built in DOM functions. They are written in C and part of the PHP core. They are faster more efficient than any 3rd party solution. With firebug, getting an XPath query is muey simple. This simple change has made my php based scrapers run faster, while saving my precious time.

My scrapers used to take ~ 60 megabytes to scrape 10 sites asyncronously with curl. That was even with the simple html dom memory fix you mentioned.

Now my php processes never go above 8 megabytes.

Highly recommended.

EDIT

Okay I did some benchmarks. Built in dom is at least an order of magnitude faster.

Built in php DOM: 0.007061
Simple html  DOM: 0.117781

<?
include("../lib/simple_html_dom.php");

$html = file_get_contents("http://stackoverflow.com/search?q=favorite+programmer+cartoon");
$data['dom'] = $data['simple_dom'] = array();

$timer_start = microtime(true);

$dom = new DOMDocument();
@$dom->loadHTML($html);
$x = new DOMXPath($dom); 

foreach($x->query("//a") as $node) 
{
     $data['dom'][] = $node->getAttribute("href");
}

foreach($x->query("//img") as $node) 
{
     $data['dom'][] = $node->getAttribute("src");
}

foreach($x->query("//input") as $node) 
{
     $data['dom'][] = $node->getAttribute("name");
}

$dom_time =  microtime(true) - $timer_start;

echo "built in php DOM : $dom_time\n";

$timer_start = microtime(true);
$simple_dom = new simple_html_dom();
$simple_dom->load($html);
foreach( $simple_dom->find("a") as $node)
{
   $data['simple_dom'][] = $node->href;
}

foreach( $simple_dom->find("img") as $node)
{
   $data['simple_dom'][] = $node->src;
}

foreach( $simple_dom->find("input") as $node)
{
   $data['simple_dom'][] = $node->name;
}
$simple_dom_time =  microtime(true) - $timer_start;

echo "simple html  DOM : $simple_dom_time\n";
Byron Whitlock
this woun't work for invalid markup. How much faster is this versus simple dom?
Quamis
This **does** work for invalid markup. I don't have benchmarks but it is at least an order of magnitude faster. On large pages, simple html dom would take 1-2 seconds. The built in DOM does it in the blink of an eye. I've written many scrapers with this and I would never use simple html dom for anything ever again.
Byron Whitlock
@Quamis Notice the @ in front of loadHtml(). With that removed you will see a ton of warnings from invalid html being coerced into the dom tree. Works for browsers, works for php too ;)
Byron Whitlock
you're right, its way faster and it loads invalid html, just re-tested it now
Quamis
You can find this benchmark at http://whitlock.ath.cx/FastCrawl/benchmark.php
Byron Whitlock
i edited your answer to use timestamp(true) instead of simply timestamp().
Quamis