views:

54

answers:

1

Can somebody show me some examples to import a html-page and use the XPath to find the keywords including the rest of the text from the div, p, title etc.

Thank you!

EDIT:

In this case i use my webcrawler for example, i have a form to get the website to be crawled and the keywords wich has to be find in pages of the website.

http://crawler.tmp.remote.nl/example.php

Now it scans for webpages with the keywords inside, my problem.. I need the div area around the founded keywords.

class MyCrawler extends PHPCrawler 
{


  function handlePageData(&$page_data) 
  { // CHECK DOMEIN
  $domain = $_POST['domain'];
  $keywords = $_POST['keywords'];
        //$tags = get_meta_tags($page_data["url"]);
        //$iKeyFound = null;


$find = $keywords;
$str = file_get_contents($page_data["url"]);
if(strpos($str, $find) == true)
{           
    echo $page_data["referer_url"]. ' - gevonden';

    $keywords = $_POST['keywords'];
    if($page_data["header"]){
    echo "<table border='1' >";
    echo "<tr><td width='300'>Status:</td><td width='500'> ".strtok($page_data["header"], "\n")."</td></tr>";}
    else "<table border='1' >";

    // PRINT EERSTE LIJN

    echo "<tr><td>Page requested:</td><td> ".$page_data["url"]."</td></tr>";
    // PRINT STATUS WEBSITE

    // PRINT WEBPAGINA
    echo "<tr><td>Referer-page:</td><td> ".$page_data["referer_url"]."</td></tr>";

    // CONTENT ONTVANGEN?
    if ($page_data["received"]==true)
      echo "<tr><td>Content received: </td><td>".$page_data["bytes_received"] / 8 . " Kbytes</td></tr></table>";
    else
      echo "<tr><td>Content:</td><td> Not received</td></tr></table>";


    $domain = $_POST['domain'];
    $link = mysql_connect('localhost', 'crawler', '--');

    if (!$link) 
    {
        die('Could not connect: ' . mysql_error());
    }

    mysql_select_db("crawler");
    if(empty($page_data["referer_url"]))
    $page_data["referer_url"] = $page_data["url"];

    strip_tags($str, '<p><b>');
    $matches = $keywords;
    //$match = preg_match_all("'/<(*.?)(*.?)>(*.?)'".$keywords."'(*.?)<\/($1)>/'", $str, $matches, PREG_SET_ORDER);
    //echo $match;



    mysql_query("INSERT INTO crawler (id, domain, url, keywords, data) VALUES ('', '".$page_data["referer_url"]."', '".$page_data["url"]."', '".$keywords."', '".mysql_real_escape_string($str) . "' )");



    echo '<br>';
    echo "<br><br>";
    echo str_pad(" ", 5000); // "Force flush", workaround
    flush();



}
A: 

An XPath expression to find <div> elements that contain certain keywords could be:

$someKeyword = "foobar";
$xPath = "//text()[contains(., '$someKeyword')]/ancestor::div[1]

To match case-insensitively, either use translate():

$upper = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
$lower =  strtolower($upper);
$someKeyword = "foobar";
$xPath = "//text()[
            contains(
              translate(., '$upper', '$lower'),
              translate('$someKeyword', '$upper', '$lower')
            )
          ]/ancestor::div[1]";

or (and that would be much easier in terms of searching) lowercase the entire HTML string before creating a DOMDocument out of it and then simply use the first XPath expression.

Tomalak
Problem solved!!!!
Jordy