views:

134

answers:

5

Hi All,

I am using Lucene Search to get the articles that are matching the search text. Is there any way to get them in ascending order of number of hits in the Article.

Example: If my search text is stack and in first Article there are two occurrences of the word stack and in the second Article there are three occurrences of stack then the second one should come first and the first one should come second.

Any idea how can I get it done?

Below is the code that I am using

List<LuceneSearchResult> searchResult = new List<LuceneSearchResult>();
LuceneSearchResult result;
IndexReader reader = IndexReader.Open(INDEX_DIR);
Searcher searcher = new IndexSearcher(reader);
Analyzer analyzer = new StandardAnalyzer();

QueryParser parser = new QueryParser("Text", analyzer);
//Text and Type are column name

Query q = parser.Parse(string.Format("Text:{0} AND Type:{1}", finalText, type));
Hits hs = searcher.Search(q);
ArrayList idList = new ArrayList();
for (int i = 0; i < hs.Length(); i++)
{

 Document doc = hs.Doc(i);
 result = new LuceneSearchResult();
 result.ID = doc.Get("ID");
 result.Type = doc.Get("Type");


 if (!idList.Contains(result.ID))
 {
  searchResult.Add(result);
  idList.Add(result.ID);
 }

}
return searchResult.ToArray();
+1  A: 

Lucene should do this automatically, but it depends in some part on how you formulate your query. By default if you do a query with more than one word then those are ORd together. For example, say your query was something like this (searching the contents field):

contents:apples oranges

This would return any pages with the term apples OR oranges in it. If a page contains the word "apples" 50 times but no reference to orange that page would still rank higher than a page that just contained the word "apples" once and "oranges" once.

What you probably want to do is AND your query like this:

contents:apples AND oranges

Note: uppercase AND

This will only return pages that have both the word "apples" AND "oranges" in it, which is probably nearer to what you want.

Have a read of Lucene - Query Parser Syntax for more info on how to forumulate queries

Dan Diplo
A: 

I agree with Dan that this should be Lucene's default behavior. If your implementation does not behave this way, please add details so we can help you diagnose why. Lucene's Similarity class documentation explains the details of Lucene scoring, which is responsible for the order of the hits.

Yuval F
I am using AND in the query please see above I have included the code that I am using
Pranali Desai
A: 

On first sight, your code looks like it should function as expected.
Could you show us an example of a finalText, type and the results?
When I get unexpected results, I usually check what query was actually used (in debug mode check the value of q) and use that query in Luke to see what results it gives.

In my code, I usually use hits.Max instead of hits.Length. Don't know what the difference is, but it's something I noted.

Also, as a side note, unless the rest of your program dictates you otherwise, you might want to check out the HashTable instead of a ArrayList for your IdList, it's usually faster.

borisCallens
A: 

I have googled around and found that Lucene lists the search result in the order of score of the hits,which is not the phenomenon of number of occurence of the phrase but is calculated depending on various factors, and therefore I think it will not be possible to get it from Lucene straight, but if you find some way please let me know.

Pranali Desai
+2  A: 

Lucene ranks documents by score. There are several components to the score for a document for a given query. One of them is the frequency of the term in the field queried. However, for a search on a single term, the calculation is pretty simple. It's proportional to the square root of the number of occurrences of the term in the field normalized by field length. This could be where you are running into trouble.

If you search for the word "stack" and doc A has 1 occurrences, and doc B has 2 occurrences, doc A could still rank higher in the results if the field length is significantly greater than that of doc B.

The good news is you can disable field normalization. The bad news is that you need to do it before you index, unless you over the Similarity class to always factor it out, but I wouldn't recommend doing it this way. To disable norms at index time, in your indexing code, call Field.setOmitNorms(true) on the Field object you add to the IndexWriter. In your case this would be for the "text" field.

KenE
Hi KenE this sounds great but where do I implement Field.setOmitNorms(true)??
Pranali Desai
You would call it in your indexing code.
KenE