views:

173

answers:

5

I have a client server application that sends XML over TCP/IP from client to server and then broadcast out to other clients. How do i know at what the minimun size of the XML that would warrant a performance improvement by compression the XML rather than sending over the regular stream.

Are there any good metrics on this or examples?

A: 

By all means compress it always.

It will save you bandwidth for anything with more then 2 tags.

Dev er dev
but isn't there overhead by zipping and unzipping ??
ooo
You should also consider how the client is interpreting the XML, for example SAX parsing a compressed stream for large XML vs having to decompress and load the entire XML via DOM.
duckworth
You can always use some on-the-fly compression/decompression on streams. I don't know for C#, but works nice in java:InputStream st = new GZipInputStream(inStream);st.read()
Dev er dev
@Marko - it would be virtually identical: new GZipStream(inStream, CompressionMode.Decompress)
Marc Gravell
A: 

To decide if compression has any benefit for you, you need to run some tests using actual or expected amount of the kind of data expect will flow through your system.

Hope this helps.

norbertB
+1  A: 

A loose metric would be to compress anything larger than a single packet, but that's just nitpicking.

There is no reason to refrain from using a binary format internally in your application - no matter how much time compression will take, the network overhead will be several orders of magnitude slower than compressing (unless we're talking about very slow devices).

If these two suggestions don't put you at ease, you can always benchmark to find the spot to compress at.

Omer van Kloeten
+1  A: 

Xml usually compresses very well, as it tends to have a lot of repetition.

Another option would be to swap to a binary format; BinaryFormatter or NetDataContractSerializer are simple options, but both are notoriously incompatible (for example with java) compared with xml.

Another option would be a portable binary format such as google's "protocol buffers". I maintain a .NET/C# version of this called protobuf-net. This is designed to be side-by-side compatible with regular .NET approaches (such as XmlSerializer / DataContractSerializer), but is much smaller than xml, and requires significantly less processing (CPU etc) for both serialization and deserialization.

This page shows some numbers for XmlSerializer, DataContractSerializer and protobuf-net; I thought it included stats with/without compression, but they seem to have vanished...

[update] I should have said - there is a TCP/IP example in the QuickStart project.

Marc Gravell
A: 

In the tests that we did, we found a huge benefit, however be aware about the CPU implications.

On one project that I worked on we were sending over large amounts of XML data (> 10 meg) to clients running .NET. (I'm not recommending this as a way to do things, it's just the situation we found ourselves in!!) We found that as XML files got sufficiently large the Microsoft XML libraries were unable to parse the XML files (the machines ran out of memory, even on machines > 1 gig). Changing the XML parsing libraries eventually helped, but before we did that we enabled GZIP compression on the data we transferred which helped us parse the large documents. On our two linux based websphere servers we were able to generate the XML and then gzip it fairly easily. I think that with 50 users doing this concurrently (loading about 10 to 20 of these files) we were able to do this ok, with about 50% cpu. The compression of the XML seemed to be better handled (i.e. parsing/cpu time) on the servers than on the .net gui's, but this was probably due to the above inadequacies of the Microsoft XML libraries being used. As I mentioned, there are better libraries available that are faster and use less memory.

In our case, we got massive improvements in size too -- we were compressing 50 meg XML files in some cases down to about 10 meg. This obviously helped out network performance too.

Since we were concerned about the impact, and whether this would have other consequences (our users seemed to do things in large waves, so we were concerned we'd run out of CPU) we had a config variable which we could use to turn gzip on/off. I'd recommend that you do this too.

Another thing: we also zipped XML files before persisting them in databases, and this saved about 50% space (XML files ranging from a few K to a few meg, but mostly fairly small). It's probably easier to do everything than choose a specific level to differentiate when to use compression or not.

Egwor