views:

522

answers:

8

We are beginning to roll out more and more WAN deployments of our product (.Net fat client w/ IIS hosted Remoting backend). Because of this we are trying to reduce the size of the data on the wire.

We have overridden the default serialization by implementing ISerializable (similar to this), we are seeing anywhere from 12% to 50% gains. Most of our efforts focus on optimizing arrays of primitive types. I would like to know if anyone knows of any fancy way of serializing primitive types, beyond the obvious?

For example today we serialize an array of ints as follows:

[4-bytes (array length)][4-bytes][4-bytes]

Can anyone do significantly better?

The most obvious example of a significant improvement, for boolean arrays, is putting 8 bools in each byte, which we already do.

Note: Saving 7 bits per bool may seem like a waste of time, but when you are dealing with large magnitudes of data (which we are), it adds up very fast.

Note: We want to avoid general compression algorithms because of the latency associated with it. Remoting only supports buffered requests/responses(no chunked encoding). I realize there is a fine line between compression and optimal serialization, but our tests indicate we can afford very specific serialization optimizations at very little cost in latency. Whereas reprocessing the entire buffered response into new compressed buffer is too expensive.

+2  A: 

Check out the base-128 varint type used in Google's protocol buffers; that might be what you're looking for.

(There are a number of .NET implementations of protocol buffers available if you search the web which, depending on their license, you might be able to grovel some code from!)

Greg Beech
2 of the bigger ones were implemented by frequent stackoverflow users.
Joel Coehoorn
protobuf-net by me, dotnet-protobufs by Jon Skeet
Marc Gravell
A: 

If you know which int values are more common, you can encode those values in fewer bits (and encode the less-common values using correspondingly more bits): this is called "Huffman" coding/encoding/compression.

In general though I'd suggest that one of the easiest things you could do would be to run a standard 'zip' or 'compression' utility over your data.

ChrisW
We want to stay away from compression, for multiple reasons. Mainly the latency it introduces. Remoting only supports buffered response/request (no chunked encoding).
Greg Dean
A: 

For integer, if you usually have small numbers (under 127 or 32768) you can encode the number using the MSB as a flag to determine if it's the last byte or not. A little bit similar to UTF-8 but the flag bit is actually wasted (which is not the case with UTF-8)

Example (big-endian):

125 which is usually encoded as 00 00 00 7D
Could be encoded as 7D

270 which is usually encoded as 00 00 01 0E
Could be encoded as 82 0E

The main limitation is that the effective range of a 32 bit value is reduced to 28 bits. But for small values you will usually gain a lot.

This method is actually used in old formats such as MIDI because old electronics needed very efficient and simple encoding techniques.

Coincoin
A: 

Before implementing ISerializable yourself, you were probably using XmlSerializer or the SOAP formatter in a web service. Given you have all fat clients all running .Net, you could try using the BinaryFormatter.

Joel Coehoorn
The XmlSerializer was the first thing to go :-)
Greg Dean
BinaryFormatter should be about as compact as you would want to get.
Joel Coehoorn
We'll it's not. The BinaryFormatter is a generic formatter that can serialize object graphs. We are serializing very specific structures and can blow away the BF's speed and output size. There is no limit on how compact I want to go.
Greg Dean
+1  A: 

If your arrays can be sorted you can perform a simple RLE to save space. Even if they aren't sorted RLE can still be beneficial. It is fast to implement for both writing and reading.

dreamlax
Good idea - however we can't sort the data, and its unlikely that the unsorted data will contain consecutive duplicates.
Greg Dean
+1  A: 

Here's a trick I used once for encoding arrays of integers:

  1. Group array elements in groups of 4.
  2. Precede each group with a byte (let's call it length mask) that indicates the length of the following 4 elements. The length mask is a bitmask composed of dibits which indicate how long the corresponding element is (00 - 1 byte, 01 - 2 bytes, 10 - 3 bytes, 11 - 4 bytes).
  3. Write out the elements as short as you can.

For example, to represent unsigned integers 0x0000017B, 0x000000A9, 0xC247E8AD and 0x00032A64, you would write (assuming little-endian): B1, 7B, 01, A9, AD, E8, 47, C2, 64, 2A, 03.

It can save you up to 68.75% (11/16) of space in the best case. In the worst case, you would actually waste additional 6.25% (1/16).

Vojislav Stojkovic
Any idea how this would compare to the base-128 varint Greg mentioned?
Greg Dean
The varint encoding is the same encoding as used in SEMI's OASIS format. I remember that we compared the two methods (back at the job where I worked with that stuff) and the conclusion was that our compression was slightly better. The real gain was speed, though: our method performed better in C ;)
Vojislav Stojkovic
+5  A: 

(relates to messages/classes, not just primitives)

Google designed "protocol buffers" for this type of scenario (they shift a huge amount of data around) - their format is compact (using things like base-128 encoding) but extensible and version tolerant (so clients and servers can upgrade easily).

In the .NET world, I can recommend 2 protocol buffers implementations:

For info, protobuf-net has direct support for ISerializable and remoting (it is part of the unit tests). There are performance/size metrics here.

And best of all, all you do is add a few attributes to your classes.

Caveat: it doesn't claim to be the theoretical best - but pragmatic and easy to get right - a compromise between performance, portability and simplicity.

Marc Gravell
Marc - I downloaded the latest source for protobuf-net, and ran the example. If i am not mistaken it looks like the protobuf-net and BinarySerializer produce about the same size output:DatabaseCompatRem||protobuf-net||133,010||8,614||18,853||||BinaryFormatter||133,167||10,171||19,526||
Greg Dean
Is this expected or am I doing something wrong, or even looking at the wrong example?
Greg Dean
Ah, no: that isn't what that result means... that is comparing *raw* protobuf-net (Serializer.Blah) vs BinaryFormatter which **uses** protobuf-net via ISerializer. You'd need to compare to the BinaryFormatter result without the ISerializable implementation.
Marc Gravell
Based on the results wiki page, I'd expect BinaryFormatter (without ISerializable) to be *at least* twice as large.
Marc Gravell
Ahhh I see, I'll take a second look tomorrow. Thanks for the reply.
Greg Dean
I have been on vacation. So I'm just now testing it out, and the first problem I ran into is that we have a jagged object array. I'll try to work around that and get back to you.
Greg Dean
When I say object array I literally mean object[]. Unfortunately I haven't found an obvious way to handle this when it contains boxed value types.
Greg Dean
+2  A: 

Yes, there is a fancy way of serializing primitive types. As a bonus it is also much faster (typically 20-40 times).

Simon Hewitt's open source library, see Optimizing Serialization in .NET - part 2, uses various tricks. E.g. if it is known that an array contains small integers then less is going to the serialised output. This is described in detail in part 1 of the article. E.g.:

...So, an Int32 that is less than 128 can be stored in a single byte (by using 7-bit encoding) ....

The full and the size optimised integers can be mixed and matched. This may seem obvious, but there are other things; e.g. special things happen for integer value 0 - optimization to store numeric type and a zero value.

Part 1 states:

... If you've ever used .NET remoting for large amounts of data, you will have found that there are problems with scalability. For small amounts of data, it works well enough, but larger amounts take a lot of CPU and memory, generate massive amounts of data for transmission, and can fail with Out Of Memory exceptions. There is also a big problem with the time taken to actually perform the serialization - large amounts of data can make it unfeasible for use in apps ....

I have used this library with great success in my application.

To make sure .NET serialisation is never used put an ASSERT 0, Debug.WriteLine() or similar into the place in the library code where it falls back on .NET serialisation. That's at the end of function WriteObject() in file FastSerializer.cs, near `createBinaryFormatter().Serialize(BaseStream, value);.

Peter Mortensen