tags:

views:

168

answers:

2

I'm starting from a Lucene index which someone else created. I'd like to find all of the words that follow a given word. I've extracted the term (org.apache.lucene.index.Term) of interest from the index, and I can find the documents which contain that term:

segmentTermDocs = segmentReader.termDocs(term);
while (segmentTermDocs.next) {
        doc = segmentReader.document(segmentTermDocs.doc);
...
}

Is there a way for me to locate the positions of the term in the document and extract the terms which follow it?

+1  A: 

Since indexing the n-grams isn't an option in your situation, some brute force will be required. You could enumerate the IndexReader's terms and termPositions, but that would likely be excrutiatingly slow.

A faster approach would be implement a divide-and-conquer search algorithm by enumerating the terms and using a MultiPhraseQuery to check a group at once. Split all the potential terms into reasonably sized groups (say 1000), and run a MultiPhraseQuery search with each chunk and your prefix word. If there are any hits, recursively call on sub-groups until you reach a single term.

Coady
Thanks for the ideas! This is for generating a report, so performance isn't really an issue. I ended up doing a brute-force search, creating PhraseQuerys composed of the term of interest and every other term in the index. Those queries which had hits indicated the terms which followed the term of interest.
Matthew Simoneau