views:

160

answers:

5

I need to index a large number of Java properties and manifest files.

The data in the files is just key-value pairs.

I am thinking to use Lucene for this.

However, I do not need any real full-text search capabilities, as the data is quite structured. I only need to search for exact matches of property values, and the property key is always known. There is no need for tokenizing, and there is also no "default" field. The number of unique property keys could be quite large.

I should also add that I hope to be able to hold the index entirely in memory (in Lucene that would be a RAMDirectory).

So, is Lucene (as primarily a full-text search-engine) still a good match, or does something else fit better?

Update: A simple HashMap will not do, because I want to find the files that define property A as value B. It would need to be at least a nested HashMap to hold the triples ( Key , Value, Filename ).

+2  A: 

Yes, a Lucene index with a non tokenized field per key will do the trick. It's also a bit of an overkill, some sort of Map structure will probably be enough for what you are describing.

The main benefit of using Lucene here would be that it abstracts away the details into a fairly simple API.

Sindri Traustason
Single non-tokenized field? would not that be a separate field per property?
Thilo
Thilo, you are right, I'm actually misunderstanding what he wants to find, editing ...
Sindri Traustason
A: 

I would start with a simple HashMap, and if you run into memory problems then move to something more complicated like Lucene. You'd be surprised how efficient a HashMap can be.

If you want to start really simple, just use the Properties object itself - it's an instance of HashTable (see HashMap vs HashTable). You can easily use load(Inputstream) to load multiple properties files into a simple object, and then if you decide to try HashMap switch it using new HashMap(propertiesObject).

Spyder
The thing is that I have many Properties objects, not just one, and I want to search across properties files, not within a property file. I need to get the list of files that say "a=b".
Thilo
Ah that's a bit clearer. I guess if you can guarantee uniqueness you could read it in as text and then make each line the key and the filename the value.But I think at that point some kind of database-type solution would be better.
Spyder
A: 

If you don't need full-text searching, and only want to represent a large key-value map, then I suggest that Lucene is inappropriate.

I'd suggest something like EhCache, which allows you to hold a large chunk of the data in RAM, but can swpa out to a disk file if it gets too large.

skaffman
It is not really one large key-value map, more like a large collection of small key-value maps that I want to search across (not within). I want to find all the maps that say "a=b" and "c=d" for example.
Thilo
A: 

Take a look at jdbm - it is a light-weight, open source object database that has a fast B+Tree implementation that should work for you. If you don't need high-reliability, you can turn off the log part of the database (this makes inserts much faster, at the risk of corrupting the database if you have a power failure in the middle of a write).

We've been using jdbm in several production projects for 4 or 5 years now with some really, really big data sets.

If you can hold the entire index in memory, though, you'd probably be better off using a TreeMap (or multiple TreeMaps if you need to also do reverse indexing), and just serialize it if you need to save to disk.

Kevin Day
+1 for an embedded db. From the comments it seems that a classical back-forth indexed table is enough for the use cases.
kd304
A: 

@Thilo, Hi!

I have the same task: indexing files by key/value properties for fast searching) Could you share your experience?) I wanna to use Lucene, but maybe you found something else?

Thank you!

Edward83