views:

176

answers:

3

HI!!

as the question says, I would like to get some frequently occuring phrases with lucene. I am getting some information from txt files, and am losing a lot of context for not having information for phrases eg. "information retrieval" is indexed as two separate words.

What is the way to get the phrases like this? I can not find anything useful on internet, all the advices, links, hints especially examples are appreciated!

Thank you!!!!!

EDIT: I store my documents just by title and content

 Document doc = new Document();
      doc.add(new Field("name", f.getName(), Field.Store.YES, Field.Index.NOT_ANALYZED));
      doc.add(new Field("text", fReader, Field.TermVector.WITH_POSITIONS_OFFSETS));

Because for what I am doing the most important is the content of the file. Titles are too often not descriptive at all (eg, i have many pdf academic papers whose titles are codes and numbers)

I desperately need to index top occuring phrases from text contents, just now i see how much this simple "bag of words" approach is not efficient.

A: 

Hi Julia

Is it possible for you to post any code that you have written?

Basically a lot depends on the way you create your fields and store documents in lucene.

Lets consider a case where I have got two fields: ID and Comments; and in my ID field I allow values like this 'finding nemo' i.e. strings with space. Whereas 'Comments' is a free flow text field i.e. I allow anything and everything which my keyboard allows and what lucene can understand.

Now in real life scenario it does not make sense to make my ID:'finding nemo' as two different searchable string. Whereas I want to index everything in Comments.

So what I will do is, I will create a document (org.apache.lucene.document.Document) object to take care of this... Something like this

Document doc = new Document();
doc.add(new Field("comments","Finding nemo was a very tough job for a clown fish ...", Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("id", "finding nemo", Field.Store.YES, Field.Index.NOT_ANALYZED));

So, essentially I have created two fields:

  1. comments: Where I have preferred to analyze it by using Field.Index.ANALYZED
  2. id: Where I directed lucene to store it but do not analyze it Field.Index.NOT_ANALYZED

This is how you customize lucene for Default Tokenizer and analyser. Otherwise you can write your own Tokenizer and analyzers.

Link(s) http://darksleep.com/lucene/

Hope this will help you... :)

Favonius
@Favonius : Thank you for reply Favonius! I have edited my post,so you can see how i Index docs. If I understand what you are saying, using only the information from title, will not be appropriate for my case..? :(
Julia
@Julia: Actually my answer is partially correct. I have misunderstood the n-grams problems as a simple indexing problem :o . Although considering only the 'id' ('title' in your case) might not be appropriate... which I think you have already recognized...
Favonius
+2  A: 

Julia, It seems what you are looking for is n-grams, specifically Bigrams (also called collocations).

Here's a chapter about finding collocations (PDF) from Manning and Schutze's Foundations of Statistical Natural Language Processing.

In order to do this with Lucene, I suggest using Solr with ShingleFilterFactory. Please see this discussion for details.

Yuval F
@Yuval F : Yes exactly, what i need is ngrams.... I was hoping i will not have to go too much into NLP :/ ..but can I ask you please before I go into this book chapter, if i use tools you recommended me(and if i manage anyways), ngrams are found during the search time, not during the index time?Can I obtain as the end result one index with indexed all together, terms and frequent ngrams? Because I am doing some concept matching with ontology, and it would be the best solution to have it that way(if possible ofcourse)Thanx!
Julia
+1 for correctly recognizing the problem... :)
Favonius
@Julia: I think you can apply the ShingleFilterFactory during indexing. And maybe you can use Luke (http://wiki.apache.org/solr/LukeRequestHandler) for viewing the results. Hope you now have enough to get you going.
Yuval F
A: 

Well the problem of losing the context for phrases can be solved by using PhraseQuery.

An index by default contains positional information of terms, as long as you did not create pure Boolean fields by indexing with the omitTermFreqAndPositions option. PhraseQuery uses this information to locate documents where terms are within a certain distance of one another.

For example, suppose a field contained the phrase “the quick brown fox jumped over the lazy dog”. Without knowing the exact phrase, you can still find this document by searching for documents with fields having quick and fox near each other. Sure, a plain TermQuery would do the trick to locate this document knowing either of those words, but in this case we only want documents that have phrases where the words are either exactly side by side (quick fox) or have one word in between (quick [irrelevant] fox). The maximum allowable positional distance between terms to be considered a match is called slop. Distance is the number of positional moves of terms to reconstruct the phrase in order.

Check out Lucene's JavaDoc for PhraseQuery

See this example code which demonstrates how to work with various Query Objects:

You can also try to combine various query types with the help of the BooleanQuery class.

And regarding the frequency of phrases, I suppose Lucene's scoring considers the frequency of the terms occurring in the documents.

Abhinav Upadhyay