views:

127

answers:

5

I have an HTML document as a string

I want to search for a keyword in this document and figure out where did it appear in the document

I mean in which tag did it appear

did it appear in H1,H2 or TITLE tag

lets say my document is

        $string = "<html>
                   <head> 
                   <title>bar , this is an example</title> 
                   </head> 
                   <body> 
                   <h1>latest news</h1>
                   foo <strong>bar</strong> 
                   </body>
                   </html>";


                   $arr = find_term("bar",$string);
                   print_r($arr);

I expect the result to be like this

                   [0]=> title
                   [1]=> strong

because "bar" appeared one time in TITLE tag and one time in the STRONG tag

I knew it is a complicated question, that is why I am asking if someone knows the answer :)

thanks

what I have so far is

        function find_term($term,$string){
               $arr = explode($term, $string);
               return $arr;
        }
        $arr = find_term("bar",$string);
        print_r($arr);

now we have an array which has the value

             Array
             (
             [0] => <html>
               <head>
               <title>

             [1] =>  , this is an example</title>
               </head>
               <body>
               <h1>latest news</h1>
               foo <strong>

             [2] => </strong>
               </body>
               </html>
             )

you can see that the last tag of every element of the array is the tag which contains "bar" but the question now is how to know the last tag appeard in every element?

Thanks

A: 

i think u need first ,

parse html into array ,

find function that do it like : http://www.php.happycodings.com/Arrays/code35.html

or class like : http://www.phpclasses.org/browse/package/5139.html

after that search in this array by loop.

Haim Evgi
A: 

Hm... thats a tricky question^^

Okay... why dont u search in the string for your keyword, rember the position where u found it, and then go throug the string backwards until u see the first "<", write that into your array until u see ">".

Should work.

Gushiken
+1  A: 

I am not a php programmer, but generally if you can get hold of an html dom parser, it would make it easy. Find all text nodes and search them for the text string. Whenver you have a match, just retrieve the name of the parent node.

Without a dom parser, there are two problems to deal with.

  1. Unless you are using xhtml, html isn't xml. <br> is a good example that you will have to hardcode around.

  2. Secondly, the following combination of tags will have to be considered "<a><b>bar<c></c></a>". It should result in the answer "a", and not "b" or "c".

Even after having located the "bar" string, you can't therefore not just find the next or previous tag. Instead you should set a counter to 1 and start back tracking. When you encounter a start tag, you decrease by one and when you encounter an end tag you increase by one. When the counter drops to 0, save the tag you are currently on.

Finally, there is also malformed html such as "<i><b>bar</i></b>". I don't really know if there is a good way to deal with that.

Marcus Andrén
Thanks, yes it would be useful if I can use a DOM parser, I didn't think about that!! it looks a grate solution however I am not sure if there is a DOM parser in PHP or not!
ahmed
Yes, there is, http://php.net/dom
VolkerK
+2  A: 

You can use DOMDocument and xpath for that.

<?php
$doc = new DOMDocument;
$doc->loadhtml('<html>
  <head> 
    <title>bar , this is an example</title> 
  </head> 
  <body> 
    <h1>latest news</h1>
    foo <strong>bar</strong> 
    <i>foobar</i>
   </body>
</html>');
$xpath = new DOMXPath($doc);
foreach($xpath->query('//*[contains(child::text(),"bar")]') as $e) {
  echo $e->tagName, "\n";
}

prints

title
strong
i

Note the i-element. It contains foobar, not bar as a single word and matches the xpath query. So this solution may or may not suffice.

VolkerK
Thanks, grate solution but it not always work because some documents has errors I did try your code and apply it on one of my documents and the DOM parser generates 11 parsing error, Thanks
ahmed
Errors (i.e. false===$doc->loadhtml()) or "only" warnings?
VolkerK
yes warnings I think I can live with warnings, Thanks
ahmed
is there any idea to hide warnings just hide them not necessary solve them
ahmed
That would be @$doc->loadhtml() which will suppress any error/warning message (the return value of course is unaffected by this, if loadhtml() fails it will return false), see http://php.net/@ You can still go through the parsing errors via http://php.net/libxml_use_internal_errors
VolkerK
@ahmed: I always repair HTML with Tidy before using DOM parser. Check this: http://php.net/manual/en/tidy.parsestring.php
warpech
A: 

The following code will work, most of the time. It won't respect HTML comments and may get confused by quoted strings (eg <img alt="<grin>" ...) but wont't choke on pathological cases like <i><b>foo</i>bar</b>, and even gives a reasonable result.

It does not notice tags like <?php>, and doesn't know about empty tags like <br> or <input> but will ignore tags like </br />. You could add logic to ignore empty tags (img, hr, br, input, etc).

The search word is surrounded by \b (word boundary) so foobar is not matched.

$html   = "<html>
               <head>
               <title>bar , this is an example</title>
               </head>
               <body class=3>
               <h1>latest news</h1>
               foo <strong>bar</strong> <br />bar
               <i>foobar</i>
               </body>
               </html>";
$search = 'bar';

preg_match_all('/(\<(\/?)(\w+).*?(\/?)\>)|(\b'.$search.'\b)/i', $html, $matches, PREG_SET_ORDER);

$tags = array();
$found = array();
foreach ($matches as $m) {
    if ($m[2] == '/') {
        $n = array_search($m[3], $tags);
        if ($n !== false)
            array_splice($tags, $n, 1);
    }
    else if ($m[3] and !$m[4]) {
        array_unshift($tags, $m[3]);
    }
    else if ($m[5]){
        $found[] = $tags[0];
    }
}
print_r($found);

It outputs (with the extra bar after the <br /> tag)

Array
(
    [0] => title
    [1] => strong
    [2] => body
)
Lucky
Note that if the search string is user input, it should probably be cleaned of any special characters that might confuse preg_match. Alternatively, the search string could be specified a regular expression.
Lucky