views:

60

answers:

4

So im creating a webcrawler and everything works, only got 1 problem.

With file_get_contents($page_data["url"]); I get the content of a webpage. This webpage is scanned when one of my keywords excists on the webpage.

$find = $keywords; $str = file_get_contents($page_data["url"]);

if(strpos($str, $find) == true)

When i want to insert the data into mysql-database i only want the info inside the div the keyword is find in.

I know i have to use DOM but i'm new into the domdocument scene.

EXAMPLE: http://crawler.tmp.remote.nl/example.php

A: 

I think there are some problems with your desired solution:

  1. The HTML may not be valid and you have to "repair" it to be able to parse it
  2. The information might not be stored in a DIV but in a TITLE, P, H1-H6, TD or anything else
  3. The keyword can also appear in some attributes such as the meta description or the meta keywords.

Usually you would use some XPATH query to search in a DOM tree, but I really don't know how to search for a node that has a child node of type "text node" with a specific keyword in it.

You might want to have a look at Lucene which offers you some search engine functionality. There are also some HTML parsers for Lucene which might be able to solve your problem.

EDIT: You might search for the next tag "before" the matched keyword and than searching for the next corresponding closing tag. But that might not actually be the closing tag of the parent DIV.

EDIT: I found a question about searching for a text within a tag: http://stackoverflow.com/questions/598722/how-to-match-a-text-node-then-follow-parent-nodes-using-xpath#answer-598732. So you might try to import the whole HTML into a SimpleXML or DOMDocument and than use XPath to search for the string and the parent DIV.

Kau-Boy
Well, it's a task i need to write myself, just need 1 function :p
Jordy
Maybe the question from my edit will help you
Kau-Boy
Can you maybe show me a example to import a html and use XPath at it?I really dont understand it , maybe im too stupid:(
Jordy
A: 

$str = file_get_contents($page_data["url"]);

if(strpos($str, $find) == true)
{   
    echo $page_data["referer_url"]. ' - gevonden';

    $keywords = $_POST['keywords'];
    if($page_data["header"]){
    echo "<table border='1' >";
    echo "<tr><td width='300'>Status:</td><td width='500'> ".strtok($page_data["header"], "\n")."</td></tr>";}
    else "<table border='1' >";

    // PRINT EERSTE LIJN

    echo "<tr><td>Page requested:</td><td> ".$page_data["url"]."</td></tr>";
    // PRINT STATUS WEBSITE

    // PRINT WEBPAGINA
    echo "<tr><td>Referer-page:</td><td> ".$page_data["referer_url"]."</td></tr>";

    // CONTENT ONTVANGEN?
    if ($page_data["received"]==true)
      echo "<tr><td>Content received: </td><td>".$page_data["bytes_received"] / 8 . " Kbytes</td></tr></table>";
    else
    {
      echo "<tr><td>Content:</td><td>Not received</td></tr></table>";
    }

    $domain = $_POST['domain'];
    $link = mysql_connect('localhost', 'crawler', 'password');

    if (!$link) 
    {
        die('Could not connect: ' . mysql_error());
    }

    mysql_select_db("crawler");
    if(empty($page_data["referer_url"]))
    $page_data["referer_url"] = $page_data["url"];

    strip_tags($str, '<p><b>');

    mysql_query("INSERT INTO crawler (id, domain, url, keywords, data) VALUES ('', '".$page_data["referer_url"]."', '".$page_data["url"]."', '".$keywords."', '".mysql_real_escape_string($str) . "' )");

    echo "<br><br>";
    echo str_pad(" ", 5000); // "Force flush", workaround
    flush();

}

As you can see I already got the keywords finded, now i need the part around it. Somebody told me i have to read the page in a treestructure and after I can use the part around my founded keyword (div, p, etc.)

Jordy
A: 

Maybe this will help in a general way. The code will find all elements that have both an 'id' attribute and text containing "keyword", then display the 'id' value and the text value of the element (assumes the document is well-formed):

$sxml = new SimpleXMLElement(file_get_contents($page_data['url']));

foreach ($sxml->xpath('//div[@id]') as $div) {
    if (strpos((string) $div, 'keyword') !== false) {
        echo $div->attributes()->id . ': ' . trim($div) . "\n";
    }
}
GZipp
Thanks for your answer, still not sure where to put it..In my constructor i guess?
Jordy
Well i tried some ways but still getting a lot of error's.Error's containing: Warning: SimpleXMLElement::__construct() [simplexmlelement.--construct]: </body> in /data/websites/crawler.nl/www/htdocs/content.php on line 44So it recieves the connten but don't know how-to fix the error?
Jordy
A: 

I solved the problem with:

    $doc = new DOMDocument();
    $doc->loadHTML($str);

    $xPath = new DOMXpath($doc);
    $xPathQuery = "//text()[contains(translate(.,'abcdefghijklmnopqrstuvwxyz', 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'), '".strtoupper($keywords)."')]";
    $elements = $xPath->query($xPathQuery);

    if($elements->length > 0){

    foreach($elements as $element){
        print "Gevonden: " .$element->nodeValue."<br />";
    }
Jordy