views:

1003

answers:

5

I am just wondering if we could achieve some RDBMS capabilities in lucene.

Example: 1) I have 10,000 project documents (pdf files) which have to be indexed with their content to make them available for search. 2) Every document is related to a SINGLE PROJECT. The project can contain details like project name, number, start date, end date, location, type etc.

I have to search in the contents of the pdf files for a given keyword, but while displaying the results I want to display the project meta data as mentioned in point (2).

My idea is to associate a field called projectId with each pdf file while indexing. Once we get that, we will fire search again for getting project meta data.

This way we could avoid duplicated data. Also, if we want to update the project meta data we will end up updating at a SINGLE PLACE only. Otherwise if we store this meta data with all the pdf doument indexes, we will end up updating all of the documents, which is not the way I am looking for.

please advise.

+1  A: 

If I understand you correctly, you have two questions:

  1. Can I store a project id in Lucene and use it for further searches? Yes, you can. This is a common practice.
  2. Can I use this project id to search Lucene for project meta data? Yes, you can. I do not know if this is a good idea. It depends on the frequency of your meta data updates and your access pattern. If the meta data is relatively static, and you only access it by id, Lucene may be a good place to store it. Otherwise, you can use the project id as a primary key to a database table, which could be a better fit.
Yuval F
hi,all indexes would be with lucene only. there will not be any database communication. but the lucene structure would be like that.means 1) index directory1 : will have indexes for documents with product id2) index directory2 : will have indexes for product meta data containing product idthe main idea behind this is to reduce the lucene index size. means each of those 10,000 documents would have product meta data, which is repeated data for that i wanted to do a seperate single product meta data index which would be called for using product id there in document index.
KP
Fine. You can support queries of the type "give me all documents having product id nnn" or "give me meta data for product ids aaa, bbb". You can even have a two-stage query that amounts to "give me all meta data for the products relevant to these documents". This is less flexible than an RDBMS, but it seems enough for your use case. If you want range queries, you may need to pad your ids with zeros.
Yuval F
A: 

Sounds like a perfectly good thing to do. The only limitation you'll have (by storing a reference to the project in Lucene rather than the project data itself) is that you won't be able to query both the document text and project metadata at the same time. For example, "documentText:foo OR projectName:bar" . If you have no such requirement, then seems like storing the ID in Lucene which refers to a database row is a fine thing to do.

bajafresh4life
A: 

I am not sure on your overall setup, but maybe Hibernate Search is for you. It would allow you to combine the benefits of a relational database with the power of a fulltext search engine like Lucene. The meta data could live in the database, maybe together with the original pdf documents, while the Lucene documents just contain the searchable data.

Hardy
+1  A: 

This is definitely possible. But always be aware of the fact that you're using Lucene for something that it was not intended for. In general, Lucene is designed for full-text search, not for mapping relational content. So the more complex your system your relational content becomes, the more you'll see a decrease in performance.

In particular, there are a few areas to keep a close eye on:

  • Storing the value of each field in your index will decrease performance. If you are not overly concerned with sub-second search results, or if your index is relatively small, then this may not be a problem.
  • Also, be aware that if you are not using the default ranking algorithm, and your custom algorithm requires information about the project in order to calculate the score for each document, this will have a dramatic impact on search performance, as well.

If you need a more powerful index that was designed for relational content, there are hierarchical indexing tools out there (one developed by Apache, called Jackrabbit) that are worth looking into.

As your project continues to grow, you might also check out Solr, also developed by Apache, which provides some added functionality, such as multi-faceted search.

ph0enix
A: 

You can use Lucene that way;

Pros:

Full-text search is easy to implement, which is not the case in an RDBMS.

Cons:

Referential integrity: you get it for free in an RDBMS, but in Lucene, you must implement it yourself.

lbp