ansaurus

Question

C# - Optimising binary serialization for multi-dimensional generic arrays

Answer 1

A:

The best code length/output size ratio would be to encode your array using BitConverter, converting all elements into their compact binary format. It's manual, I know, but will save 80-90% space compared to .NET binary serialization.

Shachar 2008-10-21 22:14:04

However, BitConverter is a pain to use with generics (you'd need to use reflection, presumably joined with Delegate.CreateDelegate for efficiency), and doesn't work for all types (not even all the inbuilt structs)...

Marc Gravell 2008-10-22 06:06:37

Answer 2

+3 A:

Here's what I came up with. The code below makes an int[1000][10000] and writes it out using the BinaryFormatter to 2 files - one zipped and one not.

The zipped file is 1.19 MB (1,255,339 bytes) Unzipped is 38.2 MB (40,150,034 bytes)

        int width = 1000;
        int height = 10000;
        List<int[]> list = new List<int[]>();
        for (int i = 0; i < height; i++)
        {
            list.Add(Enumerable.Range(0, width).ToArray());
        }
        int[][] bazillionInts = list.ToArray();
        using (FileStream fsZ = new FileStream("c:\\temp_zipped.txt", FileMode.Create))
        using (FileStream fs = new FileStream("c:\\temp_notZipped.txt", FileMode.Create))
        using (GZipStream gz = new GZipStream(fsZ, CompressionMode.Compress))
        {
            BinaryFormatter f = new BinaryFormatter();
            f.Serialize(gz, bazillionInts);
            f.Serialize(fs, bazillionInts);
        }

I can't think of a better/easy way to do this. The zipped version is pretty damn tight.

I'd go with the BinaryFormatter + GZipStream. Making something custom would not be fun at all.

[edit by MG] I hope you won't be offended by an edit, but the uniform repeated Range(0,width) is skewing things vastly; change to:

        int width = 1000;
        int height = 10000;
        Random rand = new Random(123456);
        int[,] bazillionInts = new int[width, height];
        for(int i = 0 ; i < width;i++)
            for (int j = 0; j < height; j++)
            {
                bazillionInts[i, j] = rand.Next(50000);
            }

And try it; you'll see temp_notZipped.txt at 40MB, temp_zipped.txt at 62MB. Not so appealing...

TheSoftwareJedi 2008-10-22 00:00:58

There are 2 problems with this; the first is that the OP asked about rectangular (not jagged). The more important is that your compression is being affected by the uniform data. I'll add a sport of random to show what I mean...

Marc Gravell 2008-10-22 05:39:24

Of course you cannot compress random data. But lots of meaningful data can be reasonably compressed. So by my opinion compression can be appealing.

Jakub Šturc 2009-02-27 13:05:11

Answer 3

A:

Can you define "large"? The 1000x10000xint example (another post) comes out at 40Mb; and 1000x10000x4 bytes (=int) is 38MB. As overheads go, that isn't terrible.

What sort of data is T likely to be? Just primatives? I'm thinking that I could probably edit protobuf-net to support rectangular arrays* - but to keep some kind of wire-compatibility we'd probably need a header (one byte) per element - i.e. 9MB of overhead for the 1000x10000 example.

This probably isn't worth it for things like float, double, etc (since they are stored verbatim under "protocol buffers") - but there may be savings for things like int simply due to how it packs ints... (especially if they tend to be on the smaller side [magnitude]). Finally, if T is actually objects like Person etc, then it should be a lot better than binary serialization, since it is very good at packing objects.

It wouldn't be trivial to shoe-horn in rectangular arrays, but let me know if this is something you'd be interested in trying.

*: it doesn't at the moment since the "protocol buffers" spec doesn't support them, but we can hack around that...

Marc Gravell 2008-10-22 05:56:28

Thanks very much for your thoughts on this Marc. I looked into protobuf-net and it looks very interesting. For my class there will be a lot of repeated/redundant data so I'm thinking standard binary serialization plus compression should suffice.

WillH 2008-10-22 16:31:17

Answer 4

A:

The reason there needs to be so much data about the types is that your array of T could be any type, but more specifically, T could be of type SomeBaseClass, and you could still store SomeDerivedClass in that array, and the deserializer would need to know this.

But this redundant data makes it a good candidate for compression, as others have noted.

David 2009-09-25 14:54:08

ansaurus

tags:

views:

answers:

C# - Optimising binary serialization for multi-dimensional generic arrays

related questions