views:

365

answers:

4

Hi,

I have my own data structure written in C# (the structure is quite complex). I need to serialize and deserialize the structure. The size of the serialized file in disk can be pretty large at times (close to a 1 GB) but it can be small also (based on the number of records stored). I have the following requirements:

  1. The serialization and deserialization should be very fast
  2. I should be able to partially deserialize a large file (i.e. access only some relevant records), because if I deserialize the whole file from disk the memory usage will be too high.
  3. Should be thread-safe as multiple processes can write/read records from the file

I know it sounds like I need a database but I cannot use one for multiple reasons. I tried fulfiling requirement 1 by implementing ISerializable, which made it much faster than using the .net built in Binary/XML serializers, but not fast enough. For requirement 2 is completely stumping me.

Anybody out there got any ideas on how to go about this? I am thinking anybody who has had to save their own large file formats had to deal with similar issues.

Regards, Sam

A: 

I think we'll need more information as how the file actually looks like...

Can't you just read in pieces of sizeof(yourstruct) from the file, and process them separately iso reading all the records in memory?

fretje
+2  A: 

I haven't worked on any scenario as you have here. However, I have in the past discussed a similar issue and here is the outcome of the discussion. (Though I confess I haven't ever seen the implementation). Also, I am afraid there may not be any simple straight forward solution.

Assumptions:

i. The data to be written is sorted.

Solution:

i. Fragment your data store into multiple files. Allot a range of the sorted values to each file. eg. record 1-10000 in file 1, record 100001-20000 in file 2 and so on.

ii. When you write/read data, you know the range before hand so you can meet point 2.

iii. It will also solve point 3 as long as the chance of two or more processes requesting the exact same data is less.

To be able to provide more accurate solution we need more information on what you are trying to achieve.

rAm
I agree, a multiple file solution is almost certainly best.
C. Ross
Yes, a multifile solution would work, but would not make the datafile portable.I posted a comment on my original post on what I am trying to achieve.
Sam
Just wondering how you mail a 1GB datafile? Why would a multifile solution make datafile not portable, unless of course yes when someone plans to modify the datafile in such a way that the data becomes out of order because of the modification.
rAm
Sorry by portable, I meant that it is not very easy for users to email across 200 or even 20 individual files. There are used to one datafile kind of concept from working with other software. Users typically share large datafiles through ftp or internal SAN and smaller ones through email
Sam
yeah now I get the point. Do let us know what you finally settle down for? Many times there is no perfect solution, what meets the core conditions is what wins.
rAm
A: 

For partial (or split) deserialization (which I have been looking sime into myself, such as dynamic and static parts in a game level), I think you will have to write your own serialization engine.

Cecil Has a Name
+2  A: 

Is is a data tree, or a full graph - i.e. are there any circular references? If not, protobuf-net is a high-performance binary tree serializer. It supports streaming of enumerable items (so you can skip records, etc - rather than buffer everything), but to efficiently seek to a random element I expect you'd need some kind of index.

Read/write is VERY hard for a single file; in particular, writing may require moving lots more of the disk than you expect... reading is also tricky and might require synchronization. It would be easier to use separate files...


Example of skipping early items; I could probably add a helper method, but the TryDeserializeWithLengthPrefix method will work... the key point being to watch that between serialization and deserialization we only create one extra object.

using System;
using System.IO;
using System.Threading;
using ProtoBuf;

[ProtoContract]
class Foo {
    static int count;
    public static int ObjectCount { get { return count; } }
    public Foo() { // track how many objects have been created...
        Interlocked.Increment(ref count);
    }
    [ProtoMember(1)]
    public int Id { get; set; }
    [ProtoMember(2)]
    public double Bar { get; set; }    
}
static class Program {
    static void Main() {
        MemoryStream ms = new MemoryStream();
        Random rand = new Random();
        for (int i = 1; i <= 5000; i++) {
            Foo foo = new Foo { Bar = rand.NextDouble(), Id = i };
            Serializer.SerializeWithLengthPrefix(ms, foo,PrefixStyle.Base128, 1);
        }
        ms.Position = 0;
        // skip 1000
        int index = 0;
        object obj;
        Console.WriteLine(Foo.ObjectCount);
        Serializer.NonGeneric.TryDeserializeWithLengthPrefix(
            ms, PrefixStyle.Base128,
            tag => ++index == 1000 ? typeof(Foo) : null, out obj);
        Console.WriteLine(Foo.ObjectCount);
        Console.WriteLine(((Foo)obj).Id);
    }
}
Marc Gravell
Right now it is a full graph (with circular references). But your suggestion is interesting. When you say streaming, do you mean, that while I enumerate over the collection it will page records from the serialized file on demand (i.e. not load everything into memory at once)?
Sam
Correct; primarily you can either pick items off as a sequence of top-level objects (assuming the data is essentially a sequence of homogeneous records) - but you can also do some fun things with cheeky collection classes that drop records - or process them in sequence and then drop them (rather than buffer them).
Marc Gravell
I've also got an example of it running against "PushLINQ", which is a LINQ implementation (by Jon and myself) designed to efficiently process large data streams - i.e. to compute sums/averages (including group aggregates) etc without buffering the individual records.
Marc Gravell
This sounds very interesting. I might be able to rework my datastructure to remove circular references and make it behave like a sequence of homogenous objects, in which case this solution should work. Is it possible for you to share the example you just mentioned?
Sam
Sure; I haven't tidied it, but the PushLINQ example is here: http://code.google.com/p/protobuf-net/source/browse/trunk/DataPipeline/Program.cs
Marc Gravell
The "pick/drop items from a stream" is essentially Serializer.DeserializeItems<T> (returns a lazy IEnumerable<T>): http://code.google.com/p/protobuf-net/source/browse/trunk/protobuf-net/Serializer.cs
Marc Gravell
Re circular references; you can add attributes to say "don't serialize this relationship", and (if you need) use serialization callbacks to do parent-fixups (i.e. presumably the parent would have a callback that tells the child record(s) who their parent is)
Marc Gravell
Also - it isn't obvious, but you can get the system to completely skip early items (rather than deserialize them and drop them); I could flesh out an example if needed... or add a helper method (which would perhaps be more useful)
Marc Gravell
I should add.... it doesn't support custom structs; classes and lists etc...
Marc Gravell
Hey, thanks for your great help. I just downloaded protobuf-net. I will try it out over the next couple of days. If this works in my case, it will save me a lot of pain!! (I was almost going to write a chunk based algorithm !!). I will contact you if I have any queries. The early skip example (skipping before deserializing), would be great...
Sam