tags:

views:

173

answers:

3

I've want to match any occurrence of a search term (or list of search terms) within the tags of a document. My current solution uses preg (within a Joomla plugin)

$pattern = '/matchthisterm/i';
$article->text = preg_replace($pattern,"<span class=\"highlight\">\\0</span>",$article->text);

But this replaces everything within the HTML of the document so I need to match the tags first. Is this even the best way to achieve this?

EDIT: OK, I've used simplehtmldom, but just need some help getting to the correct term. So far I've got:

$pattern = '/(matchthisterm)/i';
$html = str_get_html($buffer);
$es = $html->find('text');
foreach ($es as $term) {
    //Match to the terms within the text nodes 
    if (preg_match($pattern, $term->plaintext)) {
        $term->outertext = '<span class="highlight">' . $term->outertext . '</span>';
    }
}

This makes the entire node text bold, am I ok to use the preg_replace in here?

SOLUTION:

//Get the HTML and look at the text nodes
$html = str_get_html($buffer);
$es = $html->find('text');
foreach ($es as $term) {
    //Match to the terms within the text nodes
    $term->outertext = str_ireplace('matchthis', '<span class="highlight">matchthis</span>',         $term->outertext);
}
A: 

I haven't used preg but I've done pattern matching in perl, java and actionscript before. If this is anything similar you have to escape special characters. For example "\<span class.... I found a website that talks about using preg, in case you haven't come across this site, that can be found here

Kyra
+3  A: 

No, processing [X][HT]ML with regex is largely disastrous. In the simplest case for your example, this input:

<a href="/foo/matchthisterm/bar">bof</a>

gives quite thoroughly broken output:

<a href="/foo/<span class="highlight">matchthisterm</span>/bar">bof</a>

The proper way to do it would be to use a proper HTML/XML parser (for example DOMDocument.loadHTML or simplehtmldom), then scan and replace the contents of each text node separately. Finally re-save the HTML back to a string.

An alternative for search term highlighting is to do it in JavaScript. Since the browser has already parsed the HTML to a DOM, that saves you a processing step. See eg. this question for an example.

bobince
Thanks bobince. I saw that question earlier. The response made me chuckle. I'll take a look at the javascript and get back to you.
Jeepstone
OK, I've used simplehtmldom, but just need some help getting to the correct term.So far I've got: $pattern = '/(matchthisterm)/i'; $html = str_get_html($buffer); $es = $html->find('text'); foreach ($es as $term) { //Match to the terms within the text nodes if (preg_match($pattern, $term->plaintext)) { $term->outertext = '<span class="highlight">' . $term->outertext . '</span>'; } }This makes the entire node text bold, am I ok to use the preg_replace in here?
Jeepstone
bobince
You can see the W3 DOM (and hence DOMDocument) way of doing that in the JS version, with `splitData` and `insertBefore`... I haven't done it with simplehtmldom myself and don't see much in the docs about manipulating existing documents. You might have to do the annoying work yourself by reading `plaintext`, replacing the matches, and then `htmlspecialchars`-escaping each text portion into a new string value to write to `outertext`. :-(
bobince
The link to the JS went to the wrong place. I've got Mootools loaded on the page already so could use this instead perhaps?
Jeepstone
oops :-) fixed linkref.
bobince
I didn't downvote neither upvoted this answer. But I disagree with the HTML processing here, see my answer below.I'm totally positive for the JS solution here. Maybe I should upvote for this, but the rest of the answer doesn't appear good to me.
Savageman
+1  A: 

I agree processing HTML with regex is not a good solution.

I just read the argument about why regex can't parse HTML here: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#answer-1732454

I quite agree with the whole thing, but the problem is MUCH simpler here: we just need to know whether we are inside some HTML tag or not. We don't have to parse an HTML structure and interpreting a tree and mismatching tags or some other errors. We just know that a HTML tag is something between < and >. I believe the regex is a very good, adapted and consistent tool here.

It's not because we're dealing with some HTML that we don't want to use regex. We need to focus on the real problem here, which I believe doesn't really process HTML. We only need to know whether we're inside a tag or not. I hope I won't get too much downvotes for this, but I completely assume my position.

I'm redirecting you to a previous post (where you put a link to this topic) I made sooner this day: http://stackoverflow.com/questions/2591046/highlight-text-except-html-tags/2593488#2593488

On the same idea, and I hope we know all we need to, you're using preg_replace() where a simpler function like str_ireplace() would be sufficient. If you just need to replace a word (or a set of words) inside a string and deal with case insensivity, don't use regex. Keep it simple. (I'm assuming you didn't simplify the replacement you're trying to make on purpose to explain your problem here).

Savageman
“We just know that a HTML tag is something between < and >.” No, we don't. `<div title="a> b">` is a valid tag (never mind all the invalid constructs that browsers also allow, or comments, or CDATA-content elements, or textareas, or the DOCTYPE internal subset, or a word as part of an entity reference).
bobince
Well, you have a point here. That said, I've always been using the regex tool to transform some string of this kind without trouble. Still, I'll start digging into tools such as DOMDocument.loadHTML() to do the job, thanks!
Savageman
Sorry, yes, I should be using str_ireplace().
Jeepstone