ansaurus

Question

Answer 1

A:

I haven't used preg but I've done pattern matching in perl, java and actionscript before. If this is anything similar you have to escape special characters. For example "\<span class.... I found a website that talks about using preg, in case you haven't come across this site, that can be found here

Kyra 2010-04-07 08:22:05

Answer 2

+3 A:

No, processing [X][HT]ML with regex is largely disastrous. In the simplest case for your example, this input:

<a href="/foo/matchthisterm/bar">bof</a>

gives quite thoroughly broken output:

<a href="/foo/<span class="highlight">matchthisterm</span>/bar">bof</a>

The proper way to do it would be to use a proper HTML/XML parser (for example DOMDocument.loadHTML or simplehtmldom), then scan and replace the contents of each text node separately. Finally re-save the HTML back to a string.

An alternative for search term highlighting is to do it in JavaScript. Since the browser has already parsed the HTML to a DOM, that saves you a processing step. See eg. this question for an example.

bobince 2010-04-07 08:33:28

Thanks bobince. I saw that question earlier. The response made me chuckle. I'll take a look at the javascript and get back to you.

Jeepstone 2010-04-07 09:52:30

OK, I've used simplehtmldom, but just need some help getting to the correct term.So far I've got: $pattern = '/(matchthisterm)/i'; $html = str_get_html($buffer); $es = $html->find('text'); foreach ($es as $term) { //Match to the terms within the text nodes if (preg_match($pattern, $term->plaintext)) { $term->outertext = '<span class="highlight">' . $term->outertext . '</span>'; } }This makes the entire node text bold, am I ok to use the preg_replace in here?

Jeepstone 2010-04-07 11:57:59

bobince 2010-04-07 12:50:37

You can see the W3 DOM (and hence DOMDocument) way of doing that in the JS version, with `splitData` and `insertBefore`... I haven't done it with simplehtmldom myself and don't see much in the docs about manipulating existing documents. You might have to do the annoying work yourself by reading `plaintext`, replacing the matches, and then `htmlspecialchars`-escaping each text portion into a new string value to write to `outertext`. :-(

bobince 2010-04-07 12:52:27

The link to the JS went to the wrong place. I've got Mootools loaded on the page already so could use this instead perhaps?

Jeepstone 2010-04-07 13:30:01

oops :-) fixed linkref.

bobince 2010-04-07 13:37:12

I didn't downvote neither upvoted this answer. But I disagree with the HTML processing here, see my answer below.I'm totally positive for the JS solution here. Maybe I should upvote for this, but the rest of the answer doesn't appear good to me.

Savageman 2010-04-07 22:18:13

Answer 3

+1 A:

I agree processing HTML with regex is not a good solution.

I just read the argument about why regex can't parse HTML here: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#answer-1732454

I quite agree with the whole thing, but the problem is MUCH simpler here: we just need to know whether we are inside some HTML tag or not. We don't have to parse an HTML structure and interpreting a tree and mismatching tags or some other errors. We just know that a HTML tag is something between < and >. I believe the regex is a very good, adapted and consistent tool here.

It's not because we're dealing with some HTML that we don't want to use regex. We need to focus on the real problem here, which I believe doesn't really process HTML. We only need to know whether we're inside a tag or not. I hope I won't get too much downvotes for this, but I completely assume my position.

I'm redirecting you to a previous post (where you put a link to this topic) I made sooner this day: http://stackoverflow.com/questions/2591046/highlight-text-except-html-tags/2593488#2593488

On the same idea, and I hope we know all we need to, you're using preg_replace() where a simpler function like str_ireplace() would be sufficient. If you just need to replace a word (or a set of words) inside a string and deal with case insensivity, don't use regex. Keep it simple. (I'm assuming you didn't simplify the replacement you're trying to make on purpose to explain your problem here).

Savageman 2010-04-07 22:15:23

“We just know that a HTML tag is something between < and >.” No, we don't. `<div title="a> b">` is a valid tag (never mind all the invalid constructs that browsers also allow, or comments, or CDATA-content elements, or textareas, or the DOCTYPE internal subset, or a word as part of an entity reference).

bobince 2010-04-07 22:25:48

Well, you have a point here. That said, I've always been using the regex tool to transform some string of this kind without trouble. Still, I'll start digging into tools such as DOMDocument.loadHTML() to do the job, thanks!

Savageman 2010-04-08 07:23:45

Sorry, yes, I should be using str_ireplace().

Jeepstone 2010-04-08 09:16:18

ansaurus

tags:

views:

answers:

Match multiple terms within <body> tags

related questions