HI!!
as the question says, I would like to get some frequently occuring phrases with lucene. I am getting some information from txt files, and am losing a lot of context for not having information for phrases eg. "information retrieval" is indexed as two separate words.
What is the way to get the phrases like this? I can not find anything useful on internet, all the advices, links, hints especially examples are appreciated!
Thank you!!!!!
EDIT: I store my documents just by title and content
Document doc = new Document();
doc.add(new Field("name", f.getName(), Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("text", fReader, Field.TermVector.WITH_POSITIONS_OFFSETS));
Because for what I am doing the most important is the content of the file. Titles are too often not descriptive at all (eg, i have many pdf academic papers whose titles are codes and numbers)
I desperately need to index top occuring phrases from text contents, just now i see how much this simple "bag of words" approach is not efficient.