Hello,
I posted this message to the Solr mailing list, but I'm trying here too in case there's a Solr expert lurking around.
I am trying to use the regex fragmenter and am having a hard time getting the results I want. I am trying to get fragments that start on a word character and end on punctuation, but for some reason the fragments being returned to me seem to be very inflexible, despite that I've provided a large slop. Here are the relevant parameters I'm using, maybe someone can help point out where I've gone wrong:
<str name="hl.fragsize">500</str>
<str name="hl.fragmenter">regex</str>
<str name="hl.regex.slop">0.8</str>
<str name="hl.regex.pattern">[\w].*{400,600}[.!?]</str>
<str name="hl">true</str>
<str name="q">chinese</str>
This should be matching between 400-600 characters, beginning with a word character and ending with one of .!?. Here is an example of a typical result:
. Check these pictures out. Nine panda cubs on display for the first time Thursday in southwest China. They're less than a year old. They just recently stopped nursing. There are only 1,600 of these guys left in the mountain forests of central China, another 120 in Chinese breeding facilities and zoos. And they're about 20 that live outside China in zoos. They exist almost entirely on bamboo. They can live to be 30 years old. And these little guys will eventually get much bigger. They'll grow
As you can see, it is starting with a period and ending on a word character! It's almost as if the fragments are just coming out as they will and the regex isn't doing anything at all, but the results are different when I use the gap fragmenter. In the above result I don't see any reason why it shouldn't have stripped out the preceding period and the last two words, there is plenty of room in the slop and in the regex pattern. Please help me figure out what I'm doing wrong...
Thanks a lot,
Mark