ansaurus

Question

How to figure out the location of a keyword in an HTML document?

Answer 1

A:

i think u need first ,

parse html into array ,

find function that do it like : http://www.php.happycodings.com/Arrays/code35.html

or class like : http://www.phpclasses.org/browse/package/5139.html

after that search in this array by loop.

Haim Evgi 2009-08-05 07:18:35

Answer 2

A:

Hm... thats a tricky question^^

Okay... why dont u search in the string for your keyword, rember the position where u found it, and then go throug the string backwards until u see the first "<", write that into your array until u see ">".

Should work.

Gushiken 2009-08-05 07:19:05

Answer 3

+1 A:

I am not a php programmer, but generally if you can get hold of an html dom parser, it would make it easy. Find all text nodes and search them for the text string. Whenver you have a match, just retrieve the name of the parent node.

Without a dom parser, there are two problems to deal with.

Unless you are using xhtml, html isn't xml. is a good example that you will have to hardcode around.
Secondly, the following combination of tags will have to be considered "<a>bar<c></c></a>". It should result in the answer "a", and not "b" or "c".

Even after having located the "bar" string, you can't therefore not just find the next or previous tag. Instead you should set a counter to 1 and start back tracking. When you encounter a start tag, you decrease by one and when you encounter an end tag you increase by one. When the counter drops to 0, save the tag you are currently on.

Finally, there is also malformed html such as "bar". I don't really know if there is a good way to deal with that.

Marcus Andrén 2009-08-05 07:43:48

Thanks, yes it would be useful if I can use a DOM parser, I didn't think about that!! it looks a grate solution however I am not sure if there is a DOM parser in PHP or not!

ahmed 2009-08-05 07:53:34

Yes, there is, http://php.net/dom

VolkerK 2009-08-05 07:56:00

Answer 4

+2 A:

You can use DOMDocument and xpath for that.

<?php
$doc = new DOMDocument;
$doc->loadhtml('<html>
  <head> 
    <title>bar , this is an example</title> 
  </head> 
  <body> 
    <h1>latest news</h1>
    foo <strong>bar</strong> 
    <i>foobar</i>
   </body>
</html>');
$xpath = new DOMXPath($doc);
foreach($xpath->query('//*[contains(child::text(),"bar")]') as $e) {
  echo $e->tagName, "\n";
}

prints

title
strong
i

Note the i-element. It contains foobar, not bar as a single word and matches the xpath query. So this solution may or may not suffice.

VolkerK 2009-08-05 07:55:02

Thanks, grate solution but it not always work because some documents has errors I did try your code and apply it on one of my documents and the DOM parser generates 11 parsing error, Thanks

ahmed 2009-08-05 08:05:01

Errors (i.e. false===$doc->loadhtml()) or "only" warnings?

VolkerK 2009-08-05 08:12:37

yes warnings I think I can live with warnings, Thanks

ahmed 2009-08-05 08:20:11

is there any idea to hide warnings just hide them not necessary solve them

ahmed 2009-08-05 08:24:28

That would be @$doc->loadhtml() which will suppress any error/warning message (the return value of course is unaffected by this, if loadhtml() fails it will return false), see http://php.net/@ You can still go through the parsing errors via http://php.net/libxml_use_internal_errors

VolkerK 2009-08-05 09:44:02

@ahmed: I always repair HTML with Tidy before using DOM parser. Check this: http://php.net/manual/en/tidy.parsestring.php

warpech 2009-08-05 10:19:47

Answer 5

A:

The following code will work, most of the time. It won't respect HTML comments and may get confused by quoted strings (eg <img alt="<grin>" ...) but wont't choke on pathological cases like foobar, and even gives a reasonable result.

It does not notice tags like <?php>, and doesn't know about empty tags like  or <input> but will ignore tags like . You could add logic to ignore empty tags (img, hr, br, input, etc).

The search word is surrounded by \b (word boundary) so foobar is not matched.

$html   = "<html>
               <head>
               <title>bar , this is an example</title>
               </head>
               <body class=3>
               <h1>latest news</h1>
               foo <strong>bar</strong> <br />bar
               <i>foobar</i>
               </body>
               </html>";
$search = 'bar';

preg_match_all('/(\<(\/?)(\w+).*?(\/?)\>)|(\b'.$search.'\b)/i', $html, $matches, PREG_SET_ORDER);

$tags = array();
$found = array();
foreach ($matches as $m) {
    if ($m[2] == '/') {
        $n = array_search($m[3], $tags);
        if ($n !== false)
            array_splice($tags, $n, 1);
    }
    else if ($m[3] and !$m[4]) {
        array_unshift($tags, $m[3]);
    }
    else if ($m[5]){
        $found[] = $tags[0];
    }
}
print_r($found);

It outputs (with the extra bar after the   tag)

Array
(
    [0] => title
    [1] => strong
    [2] => body
)

Lucky 2009-08-05 09:58:46

Note that if the search string is user input, it should probably be cleaned of any special characters that might confuse preg_match. Alternatively, the search string could be specified a regular expression.

Lucky 2009-08-05 10:14:39

ansaurus

tags:

views:

answers:

How to figure out the location of a keyword in an HTML document?

related questions