tags:

views:

62

answers:

1

Hi,

I am trying to index a table in a database using Lucene. I use Lucene just for indexing, the Fields are not stored. The table mentioned above has five columns (userid (PK), description, report number, reporttype, report).

I intend to use a combination of userid, reportnumber and report type for getting data back from the database, if Lucene finds a hit.

One record in the table can span multiple rows for e.g.

JQ123, SOMEDESCRIPTION, 1, FIN, content of fin report
JQ123, AnotherDescription, 2, MATH, content of math report
JQ123, YetAnotherDesc, 3, MATH, content of another math report
JD456, MoreDesc, 1, STAT, content of stat report ..so on

Some of the report types e.g. (MATH) have highly structured contents (XML, stored as string in last column) and in the future I may want to flesh out some of the content as a Field of the document.

My strategy so far has been to create a Lucene Document for every row and index it. My thinking behind it being that 1. It is easy and seems logical (to me) 2. if I end up extracting contents out of certain document types and making them in to Fields, all that would be needed is an if statement that checks for report type and creates these new Fields. Here is the relevant code:

public void createDocument(){
Document luceneDocument=new Document();
luceneDocument.add(new Field("userid", userID, Field.Store.NO, Field.Index.NOT_ANALYZED));
luceneDocument.add(new Field("reportnumber", reportNum, Field.Store.NO, Field.Index.NOT_ANALYZED));
luceneDocument.add(new Field("reporttype", reportType, Field.Store.NO, Field.Index.NOT_ANALYZED));
luceneDocument.add(new Field("description", description, Field.Store.NO, Field.Index.ANALYZED));
luceneDocument.add(new Field("report", report, Field.Store.NO, Field.Index.ANALYZED));

if(reporttype.equalsIgnoreCase("MATH"){
luceneDocument.add(new Field("more fields", field content, Field.Store.NO, Field.Index.ANALYZED));
}
 indexwriter.add(luceneDocument)
 indexwriter.close
}           

1. Does having different Documents for the same record affect Lucene's search efficiency in any fashion?
2. Would this approach have any significant disk space over heads when compared to having one Document per record in Lucene (I do not store any Fields)?

Thanks in advance for your response,

A: 

First, note how the index is set up. Each term's index looks like:

[term][docid][docid]...

where the [docid]'s are IDs of documents which contain that term. So to answer your questions:

  1. If e.g. MATH and STATS contained the same term, they would be listed twice here. And so the search would have to look at two documents, when it should in theory only need to look at one. But this is a very minimal penalty.
  2. I assume you have to store at least an ID for each document, so you will see a minor storage increase. It will be (length of id) * (number of documents per row). Again, this is trivial.

A more important problem is the fact that queries can't be normed appropriately. For example, a search finds row #1 that matches in MATH and STATS, and row #2 that matches only in MATH. You will need to manually rank row #1 higher, because Lucene won't know that the two documents are actually the same row.

In short: unless you have some absolutely massive index, I wouldn't worry much about storage/performance. But I would worry about how you're going to score that query.

Xodarap