tags:

views:

53

answers:

2

I can see how two values, when doing a regular/fuzzy full text search, can be compared to determine which one is "better" (i.e. one value contains more keywords than the other, one contains less non-keywords than the other).

However, how Lucene computes the score when doing regex queries using RegexQuery? It is a boolean query - a field's value is either compatible with the regex or not. Lucene can't take keywords from my regex query and do its usual magic...

+1  A: 

This is just a wild guess, but one possible metric could be the number of backtracking steps the regex engine needs to take to match your search strings.

Of course, these values also depend mightily on the quality of your regex, but when comparing several matches, the one that was "easier to match" could be considered a better match than the one that the regex engine had to go through contortions for.

Tim Pietzcker
Thanks for the answer! Would be an interesting metric indeed.
maayank
+4  A: 
Xodarap
In 2, you meant the distance between each matched term to its own matched fields' values, right? How would it play out for example, if there was a distance function, with the regex ".*(dog|cat}.*" and the value "my dog and cat are happy"?
maayank
@maayank: Lucene matches terms, not strings. So your regex .*(dog|cat).* would match any single term matching that regex, which would presumably be just the terms "dog" and "cat", and maybe something like "hotdog". I am not sure how exactly they would calculate distance here, but I can guess it would be along the lines of "consider each token of the regex as a literal (whether it was intended as a literal or not), and then calculate the distance." Like the code says though, this is just speculation; for now, the distance always = 1 :-)
Xodarap
@Xodarap - you could always not analyze/not tokenize the values while indexing to get whole values regex. What do you mean by each token of the regex? the strings 'dog' and 'cat' from inside the regex query? It would seem problematic to me, since how it would parse a token out of something like "[^ABC]*\w[0-9]"? Thank you very much for the answer and comments!
maayank
@maayank: good point about not tokenizing. I agree that regexs which have "generic" stuff in them like [^A] would be hard to find the distance of, which is probably why the lucene devs haven't implemented it yet. My theory (consider every character as being in the alphabet and then calc the distance like that) is obviously pretty flawed; but I really don't know of a better solution. Maybe this would be a good question to ask as a general regex question? Tim's answer is also a good one.
Xodarap