views:

36

answers:

2

After seeing several threads rubbishing the regexp method of finding a term to match within an HTML document, I've used the Simple HTML DOM PHP parser (http://simplehtmldom.sourceforge.net/) to get the bits of text I'm after, but I want to know if my code is optimal. It feels like I'm looping too many times. Is there a way to optimise the following loop?

//Get the HTML and look at the text nodes
   $html = str_get_html($buffer);
   //First we match the <body> tag as we don't want to change the <head> items
   foreach($html->find('body') as $body) {
    //Then we get the text nodes, rather than any HTML
    foreach($body->find('text') as $text) {
     //Then we match each term
     foreach ($terms as $term) {
      //Match to the terms within the text nodes
      $text->outertext = str_replace($term, '<span class="highlight">'.$term.'</span>', $text->outertext);
     }       
    }
   }

For example, would it make a difference to determine check if I have any matches before I start the loop maybe?

A: 

You don't need the outer foreach loop; there's generally only one body tag in a well-formed document. Instead, just use $body = $html->find('body',0);.

However, since a loop with only a single iteration is essentially equivalent in run time to not looping at all, it probably won't have much performance impact either way. So in reality, you really just have 2 nested loops even in your original code, not 3.

Amber
A: 

Speaking out of ignorance, does find take arbitrary XPath expressions? If it does, you can fold the two outer loops into one:

foreach($html->find('body/text') as $body) {
    ...
}
Marcelo Cantos
Not sure. It follows the jquery (CSS) method of matching. Does that help?
Jeepstone