views:

259

answers:

4

Our company has been for a while looking at a file format to hold a large amount of lab sensor data. Each time they run the instrumentation, it generates a file, which we consume and store in a database for trending, etc. A hierarchical format is preferred as it allows us to "group" data. This is a intermediate file format before we place the data into a database. Due to our development environment, this is our priority list:

1) .Net compliant. The API will be used in web services and a client application. We do not have any control over the customer's environment, so a pure.Net solution is best.

2) Speed of reads. Our reads are random, not sequential. The faster the better. If we were not a C# development shop I would say speed is #1.

3) File Size. If the file itself is large, a good compression ratio (86% and higher) is desired.

4) Memory footprint of the reads. Due to the volume of data, we cannot simply read it. each sensor has a time/value pair. This can generate will over 4 million pairs. This has eliminated XML for us.

We have currently looked at HDF5 and found the API is horribly lacking in the .NET arena, cannot do web services, but has size/speed we are looking for. I have looked also into JSON and it looked promising but I haven't tried reading a piece of the data back. I have searched the web and not found a lot of file formats that do what we need. Any help is appreciated.

A: 

I think the special reading requirement would be a problem for any format, and in this case you'll need to implement your own parser.

Tamás Szelei
A: 

If Binary Tree/Balanced Tree format isn't too much effort, you could look into storing it in Newick Format. It can also support key/value pair format like JSON.

It's not really any more light weight than JSON however - "{}" are replaced with "()".

((raccoon, bear),((sea_lion,seal),((monkey,cat), weasel)),dog);

Obviously being a binary tree it's very fast to query, though again probably no faster than a dictionary from a JSON object, however it has no linked list style hierachy (object graph) to worry about.

I'm afraid I couldn't find any .NET apis for it though, just Java and C.

Chris S
+1  A: 

I think you might be better off storing this information in a table in your database, if you are using SQL Server, a VARBINARY should do the job.

Your table can be hierarchal by including a [Parent] field that can be null for top level nodes.

If you index your lookup value (id of file), random access should be quick. If you are needing compression, you can try using the GZip classes to format your raw byte[] before sticking it in the database.

Using a database for this information gives you the ability to:

1) Run crazy queries, joins, etc. 2) You can index multiple columns for faster lookup of by different key values 3) .Net for sure has multiple APIs 4) Compression can be added if it doesn't affect speed too badly 5) Backing up the data should be a cinch

Does this advice help you out?

Jonathan.Peppers
well we do store it in a database but we need something as an intermediate to contain the information. We can't just go from sensor -> DB. it goes Sensor-> file -> database. Trust me when I say I would love to go directly to DB.
mcauthorn
You can still have the sensor write to a temporary file and load that into a VARBINARY column. If the speed is acceptable, I normally try to go with a database in every situation--things are easier to manage for maintenance down the road.
Jonathan.Peppers
+1  A: 

You need a b-tree database, such as: Sql Server Compact

Also look at SQLite http://sqlite.phxsoftware.com/

CTree is more of an ISAM, if you can dispense with the SQL part google for ctree

Sorry, I'd link more, SO isn't letting me bc this is a new acct

Doug
Thanks I'll try a proof of concept and see how it works. It definitely looks promising.
mcauthorn
From all initial time tests and demos, it is as fast in reads and writes as HDF5, doesn't compress as well (10% less) but by playing with it I have been able to get the same data in a smaller file size. Thanks for the recommendation.
mcauthorn