views:

265

answers:

8

In my application I have a simple XML formatted file containing structured data. Each data entry has a data type and a value. Something like

<entry>
  <field type="integer">5265</field>
  <field type="float">34.23</field>
  <field type="string">Jorge</field>
</entry>

Now, this formatting allow us to have the data in a human readable form in order to check for various values, as well as performing transformation and reading of the file easily for interoperability.

The problem is we have a very low bandwidth connection (about 1000 bps, yeah, thats bits per second) so XML is no exactly the best format to transmit the data. I'm looking for ways to encode the xml file into a binary equivalent that its more suitable for transmission.

Do you know of any good tutorial on the matter?

Additionally we compress the data before sending (simple GZIP) so I'm a little concerned with losing compression ratio if I go binary. Would the size be affected (when compressing) so badly that it would be a bad idea to try to optimize it in the first place?

Note: This is not premature optimization, it's a requisite. 1000 bps is a really low bandwidth so every byte counts.

Note2: Application is written in c# but any tutorial will do.

+1  A: 

You may want to investigate Google Protocol Buffers. They produce far smaller payloads than XML, though not necessarily the smallest payloads possible; whether they're acceptable for your use depends on a lot of factors. They're certainly easier than devising your own scheme from scratch, though.

They've been ported to C#/.NET and seem to work quite well there in my (thus far, somewhat limited) experience. There's a package at that link to integrate somewhat with VS and automatically create C# classes from the .proto files, which is very nice.

Skirwan
Seems quite good, either for directly using them or for extracting something useful. Do you know if they support transforming XSD to .proto files?
Jorge Córdoba
For clarity, protobuf-net isn't really the "port" - it is a rewrite (by me) to suit .NET typical programming styles. For the port, see dotnet-protobufs.
Marc Gravell
@Jorge: I'm not aware of such a tool, but I suspect you might be able to write an XSLT for such a transformation.@Marc: Apologies, I didn't mean to minimize your work at all. Thanks!
Skirwan
I didn't take it that way ;-p I just meant to clarify that there are parallel C# implementations.
Marc Gravell
+1  A: 

Anything which is efficient at converting the plaintext form to binary is likely to make the compression ratio much worse, yes.

However, it could well be that an XML-optimised binary format will be better than the compressed text anyway. Have a look at the various XML Binary formats listed on the Wikipedia page. I have a bit of experience with WBXML, but that's all.

As JeeBee says, a custom binary format is likely to be the most efficient approach, to be honest. You can try to the gzip it, but the results of that will depend on what the data is like in the first place.

And yes, as Skirwan says, Protocol Buffers are a fairly obvious candidate here - but you may want to think about custom floating point representations, depending on what your actual requirements are. If you only need 4SF (and you know the scale) then sending a two byte integer may well be the best bet.

Jon Skeet
Not too worried about floats as almost everything is stored as decimals
Jorge Córdoba
+1  A: 

I'd dump (for transmission anyway, you could deconstruct at the sender, and reconstruct at the receiver, in Java you could use a custom Input/OutputStream to do the work neatly) the XML. Go binary with fixed fields - data type, length, data.

Say if you have 8 or fewer datatypes, encode that in three bits. Then the length, e.g., as an 8-bit value (0..255).

Then for each datatype, encode differently.

  • Integer/Float: BCD - 4 bits per digit, use 15 as decimal point. Or just the raw bits themselves (might want different datatypes for 8-bit int, 16-bit int, 32-bit int, 64-bit long, 32-bit float, 64-bit double).
  • String - can you get away with 7-bit ASCII instead of 8? Etc. All upper-case letters + digits and some punctuation could get you down to 6-bits per character.

You might want to prefix it all with the total number of fields to transmit. And perform a CRC or 8/10 encoding if the transport is lossy, but hopefully that's already handled by the system.

However don't underestimate how well XML text can be compressed. I would certainly do some calculations to check how much compression is being achieved.

JeeBee
+1  A: 

The first thing to try is gzip; beyond that, I would try protobuf-net - I can think of a few ways of encoding that quite easily, but it depends how you are building the xml, and whether you mind a bit of code to shim between the two formats. In particular, I can imagine representing the different data-types as either 3 optional fields on the same type, or 3 different subclasses of an abstract contract.

[ProtoContract]
class EntryItem {
    [ProtoMember(1)]
    public int? Int32Value {get;set;}
    [ProtoMember(2)]
    public float? SingleValue {get;set;}
    [ProtoMember(3)]
    public string StringValue {get;set;}
}
[ProtoContract]
class Entry {
    [ProtoMember(1)]
    public List<EntryItem> Items {get; set;}
}


With test:

[TestFixture]
public class TestEntries {
    [Test]
    public void ShowSize() {
        Entry e = new Entry {
            Items = new List<EntryItem>{
                new EntryItem { Int32Value = 5265},
                new EntryItem { SingleValue = 34.23F },
                new EntryItem { StringValue = "Jorge" }
            }
        };
        var ms = new MemoryStream();
        Serializer.Serialize(ms, e);
        Console.WriteLine(ms.Length);
        Console.WriteLine(BitConverter.ToString(ms.ToArray()));
    }
}

Results (21 bytes)

0A-03-08-91-29-0A-05-15-85-EB-08-42-0A-07-1A-05-4A-6F-72-67-65
Marc Gravell
This seems quite interesting, both in syntax and results. You won't happen to know how does it compare to Fast Infoset, would you?
Jorge Córdoba
Hmmm, but... again, this assumes I know in advance which data structures I'll be transfering (so EntryItem is fixed). In fact I only know an Entry item may have fields, and each field will have a type and a value (an object). We generate the XML file by reflecting on the object and getting the field type on runtime. Can protobuf-net do the same? If now, can I "convert" the xml file to a protobuffer compatible thing??
Jorge Córdoba
I haven't compared it directly to Fast Infoset. Re being totally dynamic... well, it uses reflection too - so it depends: could you decorate your object with the necessary attributes? By its nature, it **doesn't** include any metadata in the wire format, so you *must* know this ahead of time to serialize/deserialize a full object. With some extra code you can make it round-trip safe for unexpected data, and query that extra data manually, but that isn't as simple as knowing the object model in advance. Generics might be of use, but *again* it needs to know all expected types up-front.
Marc Gravell
+2  A: 

Try using ASN.1. The packed encoding rules should yield a pretty decently compressed form on their own and and the xml encoding rules should yield something equivalent to your existing xml.

Also, consider using 7zip instead of gzip.

Brian
+1  A: 

I would look into configuring your app to be responsive to smaller XML fragments; in particular ones which are small enough to fit in a single network packet.

Then arrange your data to be transmitted in order of importance to the user so that they can see useful stuff and maybe even start working on it before all the data arrives.

Nico
We already do :) but that's a nice suggestion ... and not that easier to implement because of data integrity.
Jorge Córdoba
A: 

Here's the pickle you're in though: You're compresing things with Gzip. Gzip is horrible on plain text until you hit about the length of the total concatonated works of Dickens or about 1200 lines of code. The overhead of the dictionary and other things Gzip uses for compression.

1Kbps is fine for the task of 7500 chars (it'll take about a minute given optimal conditions, but for <300 chars, you should be fine!) However if you're really that concerned, you're going to want to compress this down for brevity. Here's how I do things of this scale:

T[ype]L[ength][data data data]+

That is, that T represents the TYPE. say 0x01 for INT, 0x02 for STRING, etc. LENGTH is just an int... so 0xFF = 254 chars long, etc. An example datapacket would look like:

0x01 0x01 0x3F 0x01 0x01 0x2D 0x02 0x06 H E L L O 0x00

This says I have an INT, length 1, of value 0x3F, an INT, length 1, of value 0x2D then a STRING, length 6 of a null terminated "HELLO" (Ascii assumed). Learn the wonders that are System.Text.Encoding.Utf8.getBytes and BitConverter and ByteConverter.

for reference see This page to see just how much 1Kbps is. Really, for the size you're dealing with you should bee fine.

Indrora
+1  A: 

Late response -- at least it comes before year's end ;-)

You mentioned Fast Infoset. Did you try it? It should give you the best results in terms of both compactness and performance. Add GZIP compression and the final size will be really small, and you will have avoided the processing penalties of compressing XML. WCF-Xtensions offers a Fast Infoset message encoding and GZIP/DEFLATE/LZMA/PPMs compression too (works on .NET/CF/SL/Azure).

Alexander Philippou