views:

943

answers:

2

Hello,

I posted this message to the Solr mailing list, but I'm trying here too in case there's a Solr expert lurking around.

I am trying to use the regex fragmenter and am having a hard time getting the results I want. I am trying to get fragments that start on a word character and end on punctuation, but for some reason the fragments being returned to me seem to be very inflexible, despite that I've provided a large slop. Here are the relevant parameters I'm using, maybe someone can help point out where I've gone wrong:

<str name="hl.fragsize">500</str>
<str name="hl.fragmenter">regex</str>
<str name="hl.regex.slop">0.8</str>
<str name="hl.regex.pattern">[\w].*{400,600}[.!?]</str>
<str name="hl">true</str>
<str name="q">chinese</str>

This should be matching between 400-600 characters, beginning with a word character and ending with one of .!?. Here is an example of a typical result:

. Check these pictures out. Nine panda cubs on display for the first time Thursday in southwest China. They're less than a year old. They just recently stopped nursing. There are only 1,600 of these guys left in the mountain forests of central China, another 120 in Chinese breeding facilities and zoos. And they're about 20 that live outside China in zoos. They exist almost entirely on bamboo. They can live to be 30 years old. And these little guys will eventually get much bigger. They'll grow

As you can see, it is starting with a period and ending on a word character! It's almost as if the fragments are just coming out as they will and the regex isn't doing anything at all, but the results are different when I use the gap fragmenter. In the above result I don't see any reason why it shouldn't have stripped out the preceding period and the last two words, there is plenty of room in the slop and in the regex pattern. Please help me figure out what I'm doing wrong...

Thanks a lot,

Mark

+1  A: 

Try:

\w[^\.!\?]{400,600}[\.!\?]

You should not need the first square brackets around \w

And you should escape the final dot.

And I do not think .* just before another quantifier ({400,600})is a good idea, hence the .{400,600}

Since ? is a special character in regex, you should also escape it.

And since . matches anything, you should rather use [^\.!\?] in order to match anything but your ending characters.

VonC
Hi,Thanks for your response. You're right, the .*{400,600} was definitely a big problem and a mistake on my part. I've applied your corrections, but unfortunately my results still aren't any better. But this was definitely part of the problem so thanks very much.
Markus
A: 

I've never heard of the tool you're working with (Solr), but the quantifiers in your regular expression are definitely wrong. This regex will match between 402 and 602 characters, where the first is a word character, and the last is one of three punctuation characters:

\w.{400,600}[.!?]

The dot and question mark are not metacharacters inside a character class, so there's no point escaping them. \w can stand on its own.

Since the dot also matches the 3 punctuation characters, your regex will match as many characters as possible (up to 602), and then give back to make sure the last one is one of your 3 punctuation characters.

If you want to prioritize shorter runs, use a lazy quantifier:

\w.{400,600}?[.!?]

If you want your regex to match only one sentence, use a negated character class:

\w[^.!?]{400,600}[.!?]

All of the above assumes that Solr uses Perl-style regular expressions. Things like \w and {400,600} don't work in all regex flavors.

Jan Goyvaerts