views:

546

answers:

7

In the project I'm currently working on there is a need to save a sizable data structure to disk (edit: think dozens of MB's). Being an optimist, I thought that there must be a standard solution for such a problem; however, up to now I haven't found a solution that satisfies the following requirements:

  1. .NET 2.0 support, preferably with a FOSS implementation
  2. Version friendly (this should be interpreted as: reading an old version of the format should be relatively simple if the changes in the underlying data structure are simple, say adding/dropping fields)
  3. Ability to do some form of random access where part of the data can be extended after initial creation, without the need to deserialize the collection created up to this point in time (think of this as extending intermediate results)
  4. Space and time efficient (XML has been excluded as option given this requirement)

Options considered so far:

  • XmlSerializer: was turned down since xml serialization does not meet requirement 3 and 4.
  • SerializableAttribute: does not support requirements 2 and 3.
  • Protocol Buffers: was turned down by verdict of the documentation about Large Data Sets - since this comment suggested adding another layer on top, this would call for additional complexity which I wish to have handled by the file format itself.
  • HDF5,EXI: do not seem to have .net implementations
  • SQLite/SQL Server Compact edition: the data structure at hand would result in a pretty complex table structure that seems too heavyweight for the intended use
  • BSON: does not appear to support requirement 3.
  • Fast Infoset: only seems to have paid .NET implementations.

Any recommendations or pointers are greatly appreciated. Furthermore if you believe any of the information above is not true, please provide pointers/examples to prove me wrong.

+6  A: 

Have you considered using SQL Server Compact Edition?

  1. It has plenty of .NET support
  2. The versioning of the schema and the ability for new versions of your application handling old schemas would be entirely in your control. Versioning of SQL Server Compact should be somewhat seemless beyond your application using features in a newer version that did not exist in the older version.
  3. You have the most of the SQL syntax available to you for querying.
  4. Obviously from the name, this version of SQL Server was designed for embedded systems which can include applications that want to avoid installation of SQL Express or the full blown version of SQL Server.

Now, this would have the same issues as SQLite in that the data structure, from what you have told us, could get complicated, but that will be true even if you roll you own binary format.

Btw, it occurs to me that you haven't clarified what exactly is meant by "sizeable". If "sizeable" means close to or more than 4 GB, obviously SQL Compact will not work nor will a host of other database file formats.

EDIT I notice that you have added SQL Compact Edition to your list of "too heavyweight" list after my post. SQL Compact requires only 5MB of RAM and 2MB of disk storage depending on the size of the database. So, the problem cannot be that is heavyweight. Now, as to the second point of claiming the data structure would be pretty complicated. If that is true, I suspect it will be true of any relational database product and rolling your own binary format will be even more complicated. Given that, you might look at non-relational database products such as mongodb.

Thomas
I do think SQL CE or SQLite is the best approach. It's hard to make suggestions with no idea of the current data structure, but an embedded database certainly provides for all of the requirements. You also get the benefit of tools that allow you to query the tables/data directly in the file (for easy debugging/testing).
Dean Harding
I'm on board with this. If you want efficient random access to persisted data then you need a database, probably either relational or kvp. That's exactly what databases are *for*. It's the de facto standard and seems to satisfy all 4 requirements - and SQL CE/SQLite are far from "heavyweight."
Aaronaught
+1  A: 

Would you consider (B)JSON? If so, one of the document-oriented databases may fit your needs. CouchDB is a JSON document store with a REST API (definitely useable from .Net). CouchDB documents can have binary attachments and I've talked with people who have stored multi-MB attachments in documents without issue. I believe MongoDB, an alternative document database that uses binary JSON as a storage format, also has .Net bindings.

These "NoSQL" alternatives are easily versioned because they are essentially schema-free. JSON is quite compact, and they most certainly allow updates to the existing data.

Barry Wark
please note that BSON is listed as one of the discarded options, furthermore I don't wish to store binary blobs, but .net data structures that can be quite large but consist of many parts.
Bas Bossink
BJSON is an implementation detail of the on disk format. For this use it is quite efficient. You most certainly can easily extend or update a document in MongoDB, negating your exclusion on requirement 3. You can serialize a data structure to a MongoDB document which you can query etc. Any on disk storage is a binary BLOB on disk. This or any storage scheme is a logical abstraction that makes working with the on disk store easier. I don't think you'll find anything much better than a document database.
Barry Wark
I think a document based nosql db like mongo would suit the requirements fine + you get the scalability options as a bonus if ever needed.
Brimstedt
A: 

Have you looked at binary serialization?

See my post here for more info. It has sample code to serialize a custom class contained in a Dictionary object. Not sure how complex your structure is, but it should be pretty straight forward to adapt it to your needs.

Add a comment if you need more help...

GalacticJello
see my latest edit I'm aware of binary/xml- serialization but both options were turned down.
Bas Bossink
Ok, but binary serialization != xml serialization. I would still check it out.
GalacticJello
A: 

If XML doesn't meet requirements due to space consumption, you could feed the XML through a System.IO.Compression.DeflateStream to reduce its size. The Deflate algorithm is essentially the same as GZip compression, but can be up to 40% faster (see Jeff Atwood's blog).

Zach Johnson
XML is not seekable (no indexing) and compressed streams/files are not seekable either.
Paul de Vrieze
A: 

I wouldn't write off Protocol Buffers so quickly. Sure, the manual entry you reference says of the order of a megabyte, and you're dealing with tens of megabytes... but have you tried a study to see if this limitation impacts you?

If it still does impact you, my suggestion is to go with a hybrid approach: slice and dice your data set into 1 MB size chunks, and then store each chunk as a field of a SQLite table (as a binary blob). Add other fields to the table for the elements that you want to index on (or search by).

Yes, it adds complexity, but nothing else seems to be getting you near to where you need to go.

Peter K.
+1  A: 

Have you considered something like db4o? The licensing might restrict you but it would seem to fit the bill otherwise.

Jon M
+1  A: 

Here is an interesting option to think of: ETCH from Cisco, available under Apache license (you pay no royalties and your software remains commercial and yours.)

The idea is using Etch to communicate between components of your system, in a binary form. The format is resilient to version changes, and can handle missing fields etc' as your requirements state.

The benefit is that you gain a more complete transfer system, on top of the binary format. It is considered very fast (a machine performing 900 SOAP XML transactions per second, made 50,000 ETCH transactions).

You could store the binarize form in a lightweight RDBMS if you need multiple indices. If just one index would be enough, then a simple key/value store (CouchDB/MongoDB or even Cassandra for distributed environments) would give you wonderful storage performance as well!

Etamar L.