tags:

views:

92

answers:

3

I'm working with long paragraphs of text that are searchable using MySQL and PHP. I'd like to be able to find and highlight only the relevant search terms and use regex to isolate them.

For example, I'd like to transform a Lorem ipsum paragraph,

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor
incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud 
exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor 
in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur 
sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est 
laborum.

into something like this when searching for "dolor",

Lorem ipsum *dolor* sit amet ... labore et *dolor*e magna aliqua ... aute irure *dolor* in reprehenderit ... esse cillum *dolor*e eu fugiat ...

with two (or however many) words before and after the term.

So far I have this

search  - .*?(\w+?\b\s){2}(dolor)(\w+?\b\s){2}.*?
replace - ... $1*$2*$3...

but it's not entirely working; it only finds one word before and after (despite the {2}), fails when the search string is at the beginning or end of a string (or sentence), and doesn't eliminate rest of the paragraph after the final found instance of the search string.

What's the best way to do this?

Thanks!

A: 

Regarding the problem that only one word is matched:

From the PHP PCRE documentation

When a capturing subpattern is repeated, the value captured is the substring that matched the final iteration.

e.g.

String
"tweedledum tweedledee"

Regex
(tweedle[dume]{3}\s*)+

Captured value
tweedledee

This regex should get you a bit closer.

.*?(\w+\b\s*\w+\b\s*)?(dolor)(\w*\s*\w+\b\s*\w+\b)?.*?

Doesn't work for dolor on end or start of string. Doesn't handle non space or non word characters. Doesn't handle the problem of multiple dolor instances following each other (e.g. dolor dolor dolor). Doesn't handle when dolor is in the "2 word rang" (e.g Lorem ipsum dolor amet dolor). Possible other special cases I can't think of right now are unhandeld too :-)

jitter
That explains it. Is there any way around it?
Andrew
Yeah, that works better, but I really don't like the repeated \w+?\b\s* Hmmm...
Andrew
Enhanced with the dolor\w*\s* case
jitter
A: 

It fails at the beginning/end because you're specifying (or at least attempting to specify...) that a match must include exactly two words of leading and trailing context. If your "dolor" is the first word, there's nothing before it, so the match fails. Changing the {2} to {0,2} should fix that part.

One other thing which immediately stands out as a little off is your use of \w+?\b\s. You probably mean \w*\b\s. * means "match zero or more" which is equivalent to the "optionally match one or more" that you're trying to specify with +?. Also note that, unless you change the \s to \s+, it will fail on words separated by multiple spaces. There are also potential issues with punctuation or other characters which are neither word nor whitespace characters.

In the end, though, I think that regexes might not be the best approach for what you're trying to accomplish, or at least not on their own. The most efficient way to do this would probably be to build a custom full-text search with the reverse index containing the text of the word, its position (so you can get them in the right order), and the highlighted word in context (so you can just concatenate these together for your final result).

If that's not an option, I'd go for splitting the text up into an array of words, then scanning through that for your target word. Not only does this make it easier to handle your context requirements, I would expect it to also run faster than a pure-regex solution, since it would severely reduce the potential need for backtracking. (OTOH, though, running two passes over the text (first pass to split it into an array of words, second pass to compare each word against your search term(s)) might tip things the other way.)

Dave Sherohman
+1  A: 

A couple of changes:

((\w+\b\s*){2})(dolor)(\w*\s*(\w+\b\s*){2})

...$1*$3*$4...

Firstly, the {2} multiplier needs to be contained within memory in both cases, to ensure you're remembering both words. This means we can ignore $2 when reading it back ($5 now contains the last word matched).

Secondly, in the case of "dolore" and anything else with dolor\w+, the terminal 'e' becomes a word in its own right; to match your specification above, I've added \w*\s* to trap any end-of-word chars and terminal spaces in the remainder.

Otherwise, the non-greedy "?" char isn't really needed here because you're already specifying \b at the end of your \w+, so I've cleaned those out too.

Jeremy Smyth
Brilliant! This is so close... The only issue now is that when I add .*? to the beginning and end of the search, anything that is not $1, $3, or $4 is cut out (which is good) until the last found group, when it just prints out the rest of the string (not good)
Andrew
I'm not sure you need that! You haven't got any anchors like ^ or $ in there, so it'll happily match in the middle of a string. This means you don't really need .* unless you wish to capture everything. Am I missing something?
Jeremy Smyth
Yeah, I want to only output ...$1$3$4... -- Right now, unless I use .* , the entire paragraph is returned with the ellipses and asterisks added
Andrew