tags:

views:

353

answers:

5

We plan to use lucene as FTI-service. Amongst other things, we want to build a tag-index, based on a tag-attribute of our documents that simply contains space-delimited tags.

Now for suggesting tag-completions, it would be great if there was a way to access all unique keywords of a given index. Lucene must be able to do that internally, as it uses that to complete like-queries to rewrite them using OR.

Any suggestions?

A: 

If you are trying to do a tag completion you don't need all the unique tags, you need the tags that match what the user has already entered. This can be done with a wildcard, fuzzy, span, or proefix query depending on the need.

Gandalf
+5  A: 

Use IndexReader.terms to get all the term values (and doc counts) for your tag field.

Coady
+1  A: 

Tag completion needs to come from either (a) a prefix query on your list of tags (like pytho*) , or (b) via a query on a ngram-tokenized field (for example, Lucene will index python as p, py, pyt, pytho, python in a separate field.) Both of these solutions allow you to do tag-completion queries on the fly.

What you're suggesting (and what Coady's response will get you) is a more offline approach or something that you don't really want to run at query time. This is also fine-- tag dictionaries are not expected to be in realtime-- but be aware that iterating through IndexReader's terms is not meant to be a "query-time" operation.

bwhitman
I will look into the IndexReader.terms.However, I don't think your assumptions are correct. If lucene can expand terms in query-time, then it at least internally is fast enough to yield a list of terms for a given partial term. This functionality I'm interested in, to prevent to have a second index of unique tags.
deets
+1  A: 

Be careful about using terms from the index directly. If you have stemming enabled while indexing, all funny strings will start appearing in the term list. "Beauty" gets stemmed to "beauti", "create" is transformed to "creat" and so on.

Shashikant Kore
+1  A: 

You need to do two things:

1) When you create your document to index, make sure you use "ANALYZED"

doc.add(new Field("tags", tags, Field.Store.NO, Field.Index.ANALYZED));

2) Use a boolean query and OR all the terms:

BooleanQuery query = new BooleanQuery();

for( String tag : tags){
    query.add(new TermQuery("tags", tag), BooleanClause.Occur.SHOULD); 
}
TopDocs docs = searcher.search(query, null, searchLimit);
Cambium