ansaurus

Question

Answer 1

A:

Regarding the problem that only one word is matched:

From the PHP PCRE documentation

When a capturing subpattern is repeated, the value captured is the substring that matched the final iteration.

e.g.

String
"tweedledum tweedledee"

Regex
(tweedle[dume]{3}\s*)+

Captured value
tweedledee

This regex should get you a bit closer.

.*?(\w+\b\s*\w+\b\s*)?(dolor)(\w*\s*\w+\b\s*\w+\b)?.*?

Doesn't work for dolor on end or start of string. Doesn't handle non space or non word characters. Doesn't handle the problem of multiple dolor instances following each other (e.g. dolor dolor dolor). Doesn't handle when dolor is in the "2 word rang" (e.g Lorem ipsum dolor amet dolor). Possible other special cases I can't think of right now are unhandeld too :-)

jitter 2009-06-22 11:49:15

That explains it. Is there any way around it?

Andrew 2009-06-22 12:02:45

Yeah, that works better, but I really don't like the repeated \w+?\b\s* Hmmm...

Andrew 2009-06-22 12:05:14

Enhanced with the dolor\w*\s* case

jitter 2009-06-22 12:22:30

Answer 2

A:

It fails at the beginning/end because you're specifying (or at least attempting to specify...) that a match must include exactly two words of leading and trailing context. If your "dolor" is the first word, there's nothing before it, so the match fails. Changing the {2} to {0,2} should fix that part.

One other thing which immediately stands out as a little off is your use of \w+?\b\s. You probably mean \w*\b\s. * means "match zero or more" which is equivalent to the "optionally match one or more" that you're trying to specify with +?. Also note that, unless you change the \s to \s+, it will fail on words separated by multiple spaces. There are also potential issues with punctuation or other characters which are neither word nor whitespace characters.

In the end, though, I think that regexes might not be the best approach for what you're trying to accomplish, or at least not on their own. The most efficient way to do this would probably be to build a custom full-text search with the reverse index containing the text of the word, its position (so you can get them in the right order), and the highlighted word in context (so you can just concatenate these together for your final result).

If that's not an option, I'd go for splitting the text up into an array of words, then scanning through that for your target word. Not only does this make it easier to handle your context requirements, I would expect it to also run faster than a pure-regex solution, since it would severely reduce the potential need for backtracking. (OTOH, though, running two passes over the text (first pass to split it into an array of words, second pass to compare each word against your search term(s)) might tip things the other way.)

Dave Sherohman 2009-06-22 11:53:08

Answer 3

+1 A:

A couple of changes:

((\w+\b\s*){2})(dolor)(\w*\s*(\w+\b\s*){2})

...$1*$3*$4...

Firstly, the {2} multiplier needs to be contained within memory in both cases, to ensure you're remembering both words. This means we can ignore $2 when reading it back ($5 now contains the last word matched).

Secondly, in the case of "dolore" and anything else with dolor\w+, the terminal 'e' becomes a word in its own right; to match your specification above, I've added \w*\s* to trap any end-of-word chars and terminal spaces in the remainder.

Otherwise, the non-greedy "?" char isn't really needed here because you're already specifying \b at the end of your \w+, so I've cleaned those out too.

Jeremy Smyth 2009-06-22 12:04:05

Brilliant! This is so close... The only issue now is that when I add .*? to the beginning and end of the search, anything that is not $1, $3, or $4 is cut out (which is good) until the last found group, when it just prints out the rest of the string (not good)

Andrew 2009-06-22 12:09:15

I'm not sure you need that! You haven't got any anchors like ^ or $ in there, so it'll happily match in the middle of a string. This means you don't really need .* unless you wish to capture everything. Am I missing something?

Jeremy Smyth 2009-06-22 12:29:01

Yeah, I want to only output ...$1$3$4... -- Right now, unless I use .* , the entire paragraph is returned with the ellipses and asterisks added

Andrew 2009-06-22 12:40:16

ansaurus

tags:

views:

answers:

Extract snippets with PCRE regex

related questions