ansaurus

Question

get div around searched keyword (file_get_contents('url')

Answer 1

A:

I think there are some problems with your desired solution:

The HTML may not be valid and you have to "repair" it to be able to parse it
The information might not be stored in a DIV but in a TITLE, P, H1-H6, TD or anything else
The keyword can also appear in some attributes such as the meta description or the meta keywords.

Usually you would use some XPATH query to search in a DOM tree, but I really don't know how to search for a node that has a child node of type "text node" with a specific keyword in it.

You might want to have a look at Lucene which offers you some search engine functionality. There are also some HTML parsers for Lucene which might be able to solve your problem.

EDIT: You might search for the next tag "before" the matched keyword and than searching for the next corresponding closing tag. But that might not actually be the closing tag of the parent DIV.

EDIT: I found a question about searching for a text within a tag: http://stackoverflow.com/questions/598722/how-to-match-a-text-node-then-follow-parent-nodes-using-xpath#answer-598732. So you might try to import the whole HTML into a SimpleXML or DOMDocument and than use XPath to search for the string and the parent DIV.

Kau-Boy 2010-09-09 12:03:27

Well, it's a task i need to write myself, just need 1 function :p

Jordy 2010-09-09 12:48:39

Maybe the question from my edit will help you

Kau-Boy 2010-09-09 16:57:03

Can you maybe show me a example to import a html and use XPath at it?I really dont understand it , maybe im too stupid:(

Jordy 2010-09-10 09:39:34

Answer 2

A:

$str = file_get_contents($page_data["url"]);

if(strpos($str, $find) == true)
{   
    echo $page_data["referer_url"]. ' - gevonden';

    $keywords = $_POST['keywords'];
    if($page_data["header"]){
    echo "<table border='1' >";
    echo "<tr><td width='300'>Status:</td><td width='500'> ".strtok($page_data["header"], "\n")."</td></tr>";}
    else "<table border='1' >";

    // PRINT EERSTE LIJN

    echo "<tr><td>Page requested:</td><td> ".$page_data["url"]."</td></tr>";
    // PRINT STATUS WEBSITE

    // PRINT WEBPAGINA
    echo "<tr><td>Referer-page:</td><td> ".$page_data["referer_url"]."</td></tr>";

    // CONTENT ONTVANGEN?
    if ($page_data["received"]==true)
      echo "<tr><td>Content received: </td><td>".$page_data["bytes_received"] / 8 . " Kbytes</td></tr></table>";
    else
    {
      echo "<tr><td>Content:</td><td>Not received</td></tr></table>";
    }

    $domain = $_POST['domain'];
    $link = mysql_connect('localhost', 'crawler', 'password');

    if (!$link) 
    {
        die('Could not connect: ' . mysql_error());
    }

    mysql_select_db("crawler");
    if(empty($page_data["referer_url"]))
    $page_data["referer_url"] = $page_data["url"];

    strip_tags($str, '<p><b>');

    mysql_query("INSERT INTO crawler (id, domain, url, keywords, data) VALUES ('', '".$page_data["referer_url"]."', '".$page_data["url"]."', '".$keywords."', '".mysql_real_escape_string($str) . "' )");

    echo "<br><br>";
    echo str_pad(" ", 5000); // "Force flush", workaround
    flush();

}

As you can see I already got the keywords finded, now i need the part around it. Somebody told me i have to read the page in a treestructure and after I can use the part around my founded keyword (div, p, etc.)

Jordy 2010-09-09 12:07:49

Answer 3

A:

Maybe this will help in a general way. The code will find all elements that have both an 'id' attribute and text containing "keyword", then display the 'id' value and the text value of the element (assumes the document is well-formed):

$sxml = new SimpleXMLElement(file_get_contents($page_data['url']));

foreach ($sxml->xpath('//div[@id]') as $div) {
    if (strpos((string) $div, 'keyword') !== false) {
        echo $div->attributes()->id . ': ' . trim($div) . "\n";
    }
}

GZipp 2010-09-09 18:49:57

Thanks for your answer, still not sure where to put it..In my constructor i guess?

Jordy 2010-09-10 07:14:28

Well i tried some ways but still getting a lot of error's.Error's containing: Warning: SimpleXMLElement::__construct() [simplexmlelement.--construct]: </body> in /data/websites/crawler.nl/www/htdocs/content.php on line 44So it recieves the connten but don't know how-to fix the error?

Jordy 2010-09-10 07:42:14

Answer 4

A:

I solved the problem with:

    $doc = new DOMDocument();
    $doc->loadHTML($str);

    $xPath = new DOMXpath($doc);
    $xPathQuery = "//text()[contains(translate(.,'abcdefghijklmnopqrstuvwxyz', 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'), '".strtoupper($keywords)."')]";
    $elements = $xPath->query($xPathQuery);

    if($elements->length > 0){

    foreach($elements as $element){
        print "Gevonden: " .$element->nodeValue."<br />";
    }

Jordy 2010-09-10 12:58:02

ansaurus

tags:

views:

answers:

get div around searched keyword (file_get_contents('url')

related questions