views:

335

answers:

8

I need to develop a system for storing large numbers (10's to 100's of thousands) of objects. Each object is email-like - there is a main text body, and several ancillary text fields of limited size. A body will be from a few bytes, to several KB in size.

Each item will have a single unique ID (probably a GUID) that identifies it.

The store will only be written to when an object is added to it. It will be read often. Deletions will be rare. The data is almost all human readable text so it will be readily compressible.

A system that lets me issue the I/Os and mange the memory and caching would be ideal.

I'm going to keep the indexes in memory, using it to map indexes to the single (and primary) key for the objects. Once I have the key, then I'll load it from disk, or the cache.

The data management system needs to be part of my application - I do not want to depend on OS services. Or separately installed packages. Native (C++) would be best, but a manged (C#) thing would be ok.

I believe that a database is an obvious choice, but this needs to be super-fast for look up and loading into memory of an object. I am not experienced with data base tech and I'm concerned that general relational systems will not handle all this variable sized data efficiently.

(Note, this has nothing to do with my job - its a personal project.)

In your experience, what are the viable alternatives to a traditional relational DB? Or would a DB work well for this?

+1  A: 

You don't really indicate how you will be searching this data. I've done some similar work with some text mining applications where the main data is stored in MySQL but I maintain a textual search index in Ferret (the project is in Ruby) to find the appropriate row in the messages table based on keyword search. I think this hybrid approach could work for you as well. SQLServer and Lucene.Net may work well for you in the C# environment. I'm sure if you look around you can find similar solutions in the C++ space.

I don't recommend using SQLServer full text search -- Lucene and it's derivations seem to be a much better choice.

I think that you would have much better luck with just about any DB solution over a file-based solution. Just about any modern database should be able to handle your data requirements, at least space-wise. Building the indexes on your large field is a different matter and is why I would recommend a text mining approach if you need to search over it.

tvanfosson
Hi Tvanfosson, I'm going to keep the index in memory, using it to map indexes to the single (and primary) key for the objects. Once I have the key, then I'll load it from disk, or the cache.Thanks for the advice :)
Foredecker
A: 

Sounds like just what Berkeley DB was designed for. I haven't used it, however.

Darius Bacon
A: 

Maybe you should give some thought to a WebDav-Server like Apache+mod-dav. This will store the conten and metadata on disk. For searching you may place an existing search engine on top of this WebDav server, e.g. Lucene.

This way you keep you own development at a minimum and start of with a powerful bunch of features.

mkoeller
+2  A: 

I would give PFS a try: http://blog.sensenet.hu/post/2008/05/Portal-File-System-(PFS)-an-open-source-content-repository-for-Net.aspx

Too bad you're on c/.Net, as Jackrabbit would have been a perfect choice.

Bogdan
A: 

Have a look at Glimpse.

Thevs
Strange people... If you don't get or like this idea, why down-vote?! At least this should go with explanations where I was wrong.
Thevs
The link you gave points to something that is not at all a solution to the question. I can think of ways that might make it do some of what is asked for but it would be a DailyWTF worthy hack.
BCS
Just have a look again. It fits all of mentioned requirements, except maybe integration with C++, but it has a linkable library. It has indexes, cache, even server for requests. And it's superfast.
Thevs
...Server for handling large number of requests...
Thevs
+2  A: 

Look at SQLite, it has bindings for many programming languages and environments available and is, like the Berkeley DB, a database on disk without the need for a database engine installation.

If you just add the right indexes, lookups will be very fast, and since it is a set-based database at heart, you can still do bulk queries and similar.

Lasse V. Karlsen
A: 

Have you looked at db4o or Karvonite?

Gergely Orosz
A: 

Check Solid File System, which provides a great hierarchical storage (actually a virtual file system) for your data.

Eugene Mayevski 'EldoS Corp