views:

56

answers:

3

Goodmorning stackoverflow,

I'm still busy with my webcrawler and i just need some last help. Because crawling the web can cost a lot of time I want to let pcntl_fork() help me in creating multiple childs to split my code in parts.

  1. Master - crawling the domain
  2. Child - When receiving a link child must crawl the link found on the domain
  3. Child - Must do the same as 2. when receiving new link.

Can i make as many as i want, or do i have to set a maximum of childs?

Here's my code:

class MyCrawler extends PHPCrawler 
{


  function handlePageData(&$page_data) 
  { // CHECK DOMEIN
$domain = $_POST['domain'];
$keywords = $_POST['keywords'];
//$tags = get_meta_tags($page_data["url"]);
//$iKeyFound = null;


$find = $keywords;
$str = file_get_contents($page_data["url"]);
if(strpos($str, $find) == true && $page_data["received"] == true)
{           
    $keywords = $_POST['keywords'];
    if($page_data["header"]){
    echo "<table border='1' >";
    echo "<tr><td width='300'>Status:</td><td width='500'> ".strtok($page_data["header"], "\n")."</td></tr>";}
    else "<table border='1' >";

    // PRINT EERSTE LIJN

    echo "<tr><td>Page requested:</td><td> ".$page_data["url"]."</td></tr>";
    // PRINT STATUS WEBSITE

    // PRINT WEBPAGINA
    echo "<tr><td>Referer-page:</td><td> ".$page_data["referer_url"]."</td></tr>";

    // CONTENT ONTVANGEN?
    if ($page_data["received"]==true)
      echo "<tr><td>Content received: </td><td>".$page_data["bytes_received"] / 8 . " Kbytes</td></tr></table>";
    else
      echo "<tr><td>Content:</td><td> Not received</td></tr></table>";


    $domain = $_POST['domain'];
    $link = mysql_connect('localhost', 'crawler', 'DRZOIDBERGGG');

    if (!$link) 
    {
        die('Could not connect: ' . mysql_error());
    }

    mysql_select_db("crawler");
    if(empty($page_data["referer_url"]))
    $page_data["referer_url"] = $page_data["url"];

    strip_tags($str, '<p><b>');
    $matches = $keywords;
    //$match = preg_match_all("'/<(*.?)(*.?)>(*.?)'".$keywords."'(*.?)<\/($1)>/'", $str, $matches, PREG_SET_ORDER);
    //echo $match;

    $doc = new DOMDocument();
    $doc->loadHTML($str);

    $xPath = new DOMXpath($doc);
    $xPathQuery = "//text()[contains(translate(.,'abcdefghijklmnopqrstuvwxyz', 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'), '".strtoupper($keywords)."')]";
    $elements = $xPath->query($xPathQuery);

    if($elements->length > 0){

    foreach($elements as $element){
        print "Gevonden: " .$element->nodeValue."<br />";
    }}

    $result = mysql_query("SELECT * FROM crawler WHERE data = '".$element->nodeValue."' ") ;

    if(mysql_num_rows($result)>0)
    echo 'Column already exist';

    else{ 
    echo 'added';
    mysql_query("INSERT INTO crawler (id, domain, url, keywords, data) VALUES ('', '".$page_data["referer_url"]."', '".$page_data["url"]."', '".$keywords."', '".$element->nodeValue. "' )");
    }

    echo '<br>';
    echo "<br><br>";
    echo str_pad(" ", 5000); // "Force flush", workaround
    flush();



}

FORGOT TO SAY: I NEED A WIN x(86) 32 bits workaround!

Because it's not supported on my client.

+1  A: 

I wonder if you wouldn't be better served by going with something like Gearman for this.

It's a job manager that runs on your system and you submit jobs to it (via php if you like), and then it assigns them to workers (again, written in php), who then report back with their result. It's pretty robust and flexible in that you can let it run more workers to handle more workload.

Fanis
Very nice but it's not where i'm looking for;) +1 anyway.
Jordy
If everything has to sit on win32 then yes, Gearman is not suitable at the moment. I'm afraid I can't help you with pcntl_fork but best of luck with it :)
Fanis
A: 

shell_exec does the thing but don't know how to use.

Jordy
A: 

Look into this: http://in.php.net/manual/en/ref.pcntl.php#37369

Phill Pafford