ansaurus

Question

Fast Read/Write of Complex .Net Object Graph

Answer 1

A:

I think we'll need more information as how the file actually looks like...

Can't you just read in pieces of sizeof(yourstruct) from the file, and process them separately iso reading all the records in memory?

fretje 2009-06-04 13:20:29

Answer 2

+2 A:

I haven't worked on any scenario as you have here. However, I have in the past discussed a similar issue and here is the outcome of the discussion. (Though I confess I haven't ever seen the implementation). Also, I am afraid there may not be any simple straight forward solution.

Assumptions:

i. The data to be written is sorted.

Solution:

i. Fragment your data store into multiple files. Allot a range of the sorted values to each file. eg. record 1-10000 in file 1, record 100001-20000 in file 2 and so on.

ii. When you write/read data, you know the range before hand so you can meet point 2.

iii. It will also solve point 3 as long as the chance of two or more processes requesting the exact same data is less.

To be able to provide more accurate solution we need more information on what you are trying to achieve.

rAm 2009-06-04 13:23:31

I agree, a multiple file solution is almost certainly best.

C. Ross 2009-06-04 13:26:41

Yes, a multifile solution would work, but would not make the datafile portable.I posted a comment on my original post on what I am trying to achieve.

Sam 2009-06-04 13:59:52

Just wondering how you mail a 1GB datafile? Why would a multifile solution make datafile not portable, unless of course yes when someone plans to modify the datafile in such a way that the data becomes out of order because of the modification.

rAm 2009-06-04 14:16:06

Sorry by portable, I meant that it is not very easy for users to email across 200 or even 20 individual files. There are used to one datafile kind of concept from working with other software. Users typically share large datafiles through ftp or internal SAN and smaller ones through email

Sam 2009-06-04 14:23:50

yeah now I get the point. Do let us know what you finally settle down for? Many times there is no perfect solution, what meets the core conditions is what wins.

rAm 2009-06-04 18:19:23

Answer 3

A:

For partial (or split) deserialization (which I have been looking sime into myself, such as dynamic and static parts in a game level), I think you will have to write your own serialization engine.

Cecil Has a Name 2009-06-04 13:24:31

Answer 4

+2 A:

Is is a data tree, or a full graph - i.e. are there any circular references? If not, protobuf-net is a high-performance binary tree serializer. It supports streaming of enumerable items (so you can skip records, etc - rather than buffer everything), but to efficiently seek to a random element I expect you'd need some kind of index.

Read/write is VERY hard for a single file; in particular, writing may require moving lots more of the disk than you expect... reading is also tricky and might require synchronization. It would be easier to use separate files...

Example of skipping early items; I could probably add a helper method, but the TryDeserializeWithLengthPrefix method will work... the key point being to watch that between serialization and deserialization we only create one extra object.

using System;
using System.IO;
using System.Threading;
using ProtoBuf;

[ProtoContract]
class Foo {
    static int count;
    public static int ObjectCount { get { return count; } }
    public Foo() { // track how many objects have been created...
        Interlocked.Increment(ref count);
    }
    [ProtoMember(1)]
    public int Id { get; set; }
    [ProtoMember(2)]
    public double Bar { get; set; }    
}
static class Program {
    static void Main() {
        MemoryStream ms = new MemoryStream();
        Random rand = new Random();
        for (int i = 1; i <= 5000; i++) {
            Foo foo = new Foo { Bar = rand.NextDouble(), Id = i };
            Serializer.SerializeWithLengthPrefix(ms, foo,PrefixStyle.Base128, 1);
        }
        ms.Position = 0;
        // skip 1000
        int index = 0;
        object obj;
        Console.WriteLine(Foo.ObjectCount);
        Serializer.NonGeneric.TryDeserializeWithLengthPrefix(
            ms, PrefixStyle.Base128,
            tag => ++index == 1000 ? typeof(Foo) : null, out obj);
        Console.WriteLine(Foo.ObjectCount);
        Console.WriteLine(((Foo)obj).Id);
    }
}

Marc Gravell 2009-06-04 13:29:45

Right now it is a full graph (with circular references). But your suggestion is interesting. When you say streaming, do you mean, that while I enumerate over the collection it will page records from the serialized file on demand (i.e. not load everything into memory at once)?

Sam 2009-06-04 14:06:13

Correct; primarily you can either pick items off as a sequence of top-level objects (assuming the data is essentially a sequence of homogeneous records) - but you can also do some fun things with cheeky collection classes that drop records - or process them in sequence and then drop them (rather than buffer them).

Marc Gravell 2009-06-04 14:12:13

I've also got an example of it running against "PushLINQ", which is a LINQ implementation (by Jon and myself) designed to efficiently process large data streams - i.e. to compute sums/averages (including group aggregates) etc without buffering the individual records.

Marc Gravell 2009-06-04 14:13:35

This sounds very interesting. I might be able to rework my datastructure to remove circular references and make it behave like a sequence of homogenous objects, in which case this solution should work. Is it possible for you to share the example you just mentioned?

Sam 2009-06-04 14:35:37

Sure; I haven't tidied it, but the PushLINQ example is here: http://code.google.com/p/protobuf-net/source/browse/trunk/DataPipeline/Program.cs

Marc Gravell 2009-06-04 14:52:58

The "pick/drop items from a stream" is essentially Serializer.DeserializeItems<T> (returns a lazy IEnumerable<T>): http://code.google.com/p/protobuf-net/source/browse/trunk/protobuf-net/Serializer.cs

Marc Gravell 2009-06-04 14:53:46

Re circular references; you can add attributes to say "don't serialize this relationship", and (if you need) use serialization callbacks to do parent-fixups (i.e. presumably the parent would have a callback that tells the child record(s) who their parent is)

Marc Gravell 2009-06-04 14:54:46

Also - it isn't obvious, but you can get the system to completely skip early items (rather than deserialize them and drop them); I could flesh out an example if needed... or add a helper method (which would perhaps be more useful)

Marc Gravell 2009-06-04 14:57:36

I should add.... it doesn't support custom structs; classes and lists etc...

Marc Gravell 2009-06-04 14:59:13

Hey, thanks for your great help. I just downloaded protobuf-net. I will try it out over the next couple of days. If this works in my case, it will save me a lot of pain!! (I was almost going to write a chunk based algorithm !!). I will contact you if I have any queries. The early skip example (skipping before deserializing), would be great...

Sam 2009-06-04 15:50:23

ansaurus

tags:

views:

answers:

Fast Read/Write of Complex .Net Object Graph

related questions