tags:

views:

248

answers:

1

I've used Lucene on a previous project, so I am somewhat familiar with the API. However, I've never had to do anything "fancy" (where "fancy" means things like using filters, different analyzers, boosting, payloads, etc).

I'm about to embark on implementing the full-text search feature of XQuery:

http://www.w3.org/TR/xpath-full-text-10/

Its query abilities are the most complicated I've seen. From my experience with Lucene, I know it can be used to implement some of the features; however, I'd like to walk through them all. For each feature, I only need a simple answer like, "Feature X is best implemented using a query filter," just so I start off in the right direction for each feature.

Note: I will be implementing my own query parser and construct queries "by hand" using various instantiations of Lucene classes.

3.3 Cardinality Selection

This allows you to say things like:

title ftcontains "usability" occurs at least 2 times

which means that the title field must contain the "usability" at least twice. How can this be done?

3.4.4 Stemming Option

This allows you to match words that have been indexed against words in the query that have been stemmed like:

title ftcontains "improve" with stemming

which would match even if title contained "improving". Note that PorterStemFilter can not be used because the decision whether to use stemming or not is specified at query-time and not index-time.

In this case, would I have to add each word to the index twice? Once for the original word and once for the stemmed word (assuming the stemmed word is different from the original word)? Or is there a better way?

3.4.5 Case Option

This allows you to specify -- at query-time -- one of "case insensitive", "case sensitive", "lowercase", "uppercase".

The last two I think can be implemented using a query filter since, for "lowercase", it matches only if the document text is all in lower-case (and same for "uppercase").

But how would you handle the case insensitive/sensitive specifications? One thought is to add every word twice: once in its original case and once in a normalized case (arbitrarily chosen to be, say, lowercase). Any better ideas?

3.4.6 Diacritics Option

This is similar to the Cast Option except its "diacritics insensitive" or "diacritics sensitive. How about implementing this?

3.4.7 Stop Word Option

This allows you to specify -- qt query time -- "with stop words", e.g.:

abstract ftcontains "propagating of errors"
with stop words ("a", "the", "of")

would match a document with an abstract that contains "propagating few errors". It seems odd, I know. It's as if the stop words become wildcards, i.e.:

"propagating of errors" -> "propagating * errors"

where * will match any word in the document. How can this be implemented in Lucene?

3.5.3 Mild-Not Selection

XQuery has two flavors of "not": (regular) not and mild-not. This allows you to have a query like:

body ftcontains "Mexico" not in "New Mexico"

which would only match documents that contain "Mexico" when it's not part of the phrase "New Mexico". I would guess that you could use a query filter for this, yes?

3.6.1 Ordered Selection

This allows you to require that the order of the words in a query match the order of the words in a document, e.g.:

title ftcontains ("web site" ftand "usability") ordered

which would match only if the phrase "web site" and the word "usability" both occurred in the document and "usability" comes after "web site" in word order. The Lucene SpanQuery class must have access to word positions, yes? How do you access those?

3.6.4 Scope Selection

This allows you to require that words appear in the same "scope", e.g.:

abstract ftcontains "usability" ftand "web site" same sentence

You can also do any combination of {same|different} {sentence|paragraph}. My guess for this would also be to keep track of sentence/paragraph data in a payload. Yes?

3.7 Ignore Option

Given the partial XQuery:

let $x := <book>
  <title>Web Usability and Practice</title>
  <author>Montana <annotation> this author is
      an expert in Web Usability</annotation> Marigold
  </author>
  <editor>Vera Tudor-Medina on Web <annotation> best
      editor on Web Usability</annotation> Usability
  </editor>
</book>

if I were to have a query:

book ftcontains "Web Usability" without content $x//annotation

then it would not consider any text inside of elements at all. "Web Usability" would be found twice: once in the title element and once in the editor element. Note that the latter element comes smack in the middle of the phrase "Web Usability". My guess for this would also be to use payload data to store the element each word is inside of then use a filter based on that. Yes?


I realize this is a lot, but any pointers appreciated. Thanks!

A: 

This might be of interest to you http://exist.sourceforge.net/lucene.html

Swami