views:

201

answers:

3

Is is possible to only deserialize a limited number of items from a serialized array?

Background:

I have a stream that holds a serialized array of type T. The array can have millions of items but i want to create a preview of the content and only retrieve the, say, first one hundred items. My first idea was to create a wrapper around the input stream that limits the number of bytes, but there's no direct translation from the number of items of the array to the stream size.

A: 

Could you maybe alter your data source so it contains a data preview in another array which you can deserialize separately?

Joey
+1  A: 

No, this can't be done with standard .NET serialization. You'll have to invent your own storage format. For example, include a header with offsets of data chunks:

----------------
<magic-value>
<chunks-count>
<chunk-size>
<chunk-1-offset>
<chunk-2-offset>  --+
...                 |
----------------    |
...                 |
<chunk-1>           |
...                 |
----------------    |
...               <-+
<chunk-2>
...
-----------------
...

So in order to preview data (from any arbitrary position) you'll have to load at most ceil(required-item-count/chunk-size). This will incur some overhead, but it's much better than loading the whole file.

Anton Gogolev
So i have to save the array in chunks and accept that it loads a little more?
Rauhotz
+1  A: 

What is the serializer?

With BinaryFormatter, that would be very, very tricky.

With xml, you could perhaps pre-process the xml, but that it very tricky.

Other serializers exist, though - for example, with protobuf-net there is little difference between an array/list of items, and a sequence of individual items - so it would be pretty easy to pick of a finite sequence of items without processing the entire array.


Complete protobuf-net example:

[ProtoContract]
class Test {
    [ProtoMember(1)]
    public int Foo { get; set; }
    [ProtoMember(2)]
    public string Bar { get; set; }

    static void Main() {
        Test[] data = new Test[1000];
        for (int i = 0; i < 1000; i++) {
            data[i] = new Test { Foo = i, Bar = ":" + i.ToString() };
        }
        MemoryStream ms = new MemoryStream();
        Serializer.Serialize(ms, data);
        Console.WriteLine("Pos after writing: " + ms.Position); // 10760
        Console.WriteLine("Length: " + ms.Length); // 10760
        ms.Position = 0;
        foreach (Test foo in Serializer.DeserializeItems<Test>(ms,
                PrefixStyle.Base128, Serializer.ListItemTag).Take(100)) {
            Console.WriteLine(foo.Foo + "\t" + foo.Bar);
        }
        Console.WriteLine("Pos after reading: " + ms.Position); // 902

    }
}

Note that DeserializeItems<T> is a lazy/streaming API, so it only consumes data from the stream as you iterate over it - hence the LINQ Take(100) avoids us reading the whole stream.

Marc Gravell
Would a bit of hackery implementing the ISerializable interface not get him what he wants?
Noldorin
@Noldorin - not as far as I know... you don't get to intercept the array deserialization, regardless of how you handle each individual item (via ISerializable)
Marc Gravell