views:

293

answers:

2

I am using Lucene (or more specifically Compass), to log threads in a forum and I need a way to extract the keywords behind the discussion. That said, I don't want to index every entry someone makes, but rather I'd have a list of 'keywords' that are relevant to a certain context and if the entry matches a keyword and is above a threshold I'd add these entries to the index.

I want to be able to use the power of an analyser to strip out things and do its magic, but then return the tokens from the analyser in order to match the keywords, and also count the number of occurrences certain words are being mentioned.

Is there a way to get the tokens from an analyser without having the overhead of indexing every entry made?

I was thinking I'd have to maintain a RAMDirectory to hold all entries, and then perform searches using my list of keywords, then merge the relevant Documents to the persistence manager to actually store the relevant entries.

+1  A: 

You are on the right path. You can create index of each document using RAMDirectory and then search on it to check that document contains relevant keyword. If no, discard that document. Else, you add it to the persistent/main index.

You don't need to hold all the documents in memory. It will consume a lot of memory unnecessarily.

Shashikant Kore
+1 thanks... what do you think of the answer below?
andy
+1  A: 

You should be able to skip using the RAMDirectory entirely. You can call the StandardAnalyzer directly and get it to pass back a list of tokens to you (aka keywords).

StandardAnalyzer analyzer = new StandardAnalyzer;
TokenStream stream = analyzer.tokenStream("meaningless", new StringReader("<text>"));
while (true) {
    Token token = stream.next();
    if (token == null) break;

    System.out.println(token.termText());
}

Better yet, write your own Analyzer (they're not hard, have a look at the source code for the existing ones) that uses your own filter to watch for your keywords.

Sam Doshi
+1 So there's no benefit to indexing if you don't need that info anymore?
andy
ahh my bad... you do still have to actually index something before the tokens are created? I was wondering if there was a way to just throw some text at an analyzer and have it return tokens?
andy