views:

774

answers:

4

Hi all,

Currently our .NET application constructs XML data in memory that we persist to a SQL Server database. The XElement object is converted to a string using ToString() and then stored in a varchar(MAX) column in the DB. We dind't want to use the SQL XML datatype as we didn't need any validation and SQL doesn't need to query the XML at any stage.

Although this implementation works fine, we want to reduce the size of the database by compressing the XML before storing it, and decompressing it after retrieving it. Does anyone have any sample code for compressing an XElement object (and decompressing would be great too)? Also, what changes would I need to make to the data type of the database column so that we can fully take advantage of this compression?

I have investigated again the XML datatype SQL Server 2005 offers, and the validation overhead it offers is too high for us to consider using it. Also, although it does compress the XML somewhat, it doesn't as much compression as the .NET DeflateStream class.

I have tested the DeflateStream class by writing the XML we use to disk, and then saving the comrpessed version as a new file. The results are great, a 16kb file goes down to a 3kb file, so it's jsut a case of getting this to work in memory and saving the resulting data to the DB. Does anyone have any sample code to do the compression, and should I change the varcahr(MAX) colum to type to maybe varbinary?

Thanks in advance

+2  A: 

This article may help you get a start.

The following snippet can compress a string and return a base-64 coded result:

public static string Compress(string text)
{
 byte[] buffer = Encoding.UTF8.GetBytes(text);
 MemoryStream ms = new MemoryStream();
 using (GZipStream zip = new GZipStream(ms, CompressionMode.Compress, true))
 {
  zip.Write(buffer, 0, buffer.Length);
 }

 ms.Position = 0;
 MemoryStream outStream = new MemoryStream();

 byte[] compressed = new byte[ms.Length];
 ms.Read(compressed, 0, compressed.Length);

 byte[] gzBuffer = new byte[compressed.Length + 4];
 System.Buffer.BlockCopy(compressed, 0, gzBuffer, 4, compressed.Length);
 System.Buffer.BlockCopy(BitConverter.GetBytes(buffer.Length), 0, gzBuffer, 0, 4);
 return Convert.ToBase64String (gzBuffer);
}

EDIT: As an aside, you may want to use CLOB formats even when storing XML as text because varchars have a very limited length - which XML can often quickly exceed.

LBushkin
It would be best if the data could be written as a byte array, not a Base64 string, because the 1/3 expansions from the encoding will swallow up some of the gain from compression, at least on the wire.
Steven Sudit
@Steven - The expansion amount of the base-64 coded stream really depends on the level of redundancy in the XML. Some streams may end up smaller, some may end up larger. It's difficult to predict. However, you can be sure that a pure binary stream will be smaller than the base-64 coded one. You would have to do some testing on real world data to see if the compression factor is canceled out by the base-64 coding inefficiency.
LBushkin
The amount of compression GZip gets does depend on the level of redundancy in the XML. Fortunately, XML is full of redundancy, even on top of the redundancy that plain text generally has. Base64, however, consistently expands 3 bytes of 8-bit data to 4 bytes of 7-bit ciphertext. If the output is treated as nvarchar instead of varchar, then it doubles again. On the other hand, if you actually use the XML data type in SQL Server, there's a combination of compression and indexing.
Steven Sudit
A: 

I know you tagged the question SQL 2005, but you should consider upgrading to SQL 2008 and using the wonderful new compression capabilities that come with it. Is out-of-the-box, transparent for your application and will save you a huge implementation/test/support cost.

Remus Rusanu
+2  A: 

I think you should also re-test the XML column. It stores in binary, I know, not as text. It could be smaller, and may not perform badly, even if you don't actually need the additional features.

John Saunders
+1 - yes, the XML data type in SQL Server actually sort of "tokenizes" the XML stored it in - with or without associated schema - and thus is smaller than the corresponding VARCHAR(MAX) field.
marc_s
I'm getting results that conflict with that. I have a true XML column and am confused by my results. What does it mean that dataLength(cast(myxml as nvarchar(max))) > len(cast(myxml as nvarchar(max))) > datalength(myxml) ?
BlueMonkMN
I have no idea. Size isn't everything, though. Test the performance.
John Saunders
Performance of what compared to what? I have only one option in my case because I need to query the XML in TSQL script. Although now I'm considering typed xml, and I have even more strange results here. Column "data" is typed XML and "data2" is untyped (same document). In descending order:datalength(cast(data2 as nvarchar(max))) = 8712datalength(cast(data as nvarchar(max))) = 8466DataLength(data) = 7225datalength(data2) = 4807len(cast(data2 as nvarchar(max))) = 4356len(cast(data as nvarchar(max))) = 4233Typed XML seems much larger than untyped XML in its native format!
BlueMonkMN
+1  A: 

Besides possibly compressing the string itself (perhaps using LBushkin's Base64 method above), you probably want to start with making sure you kill all the whitespace. The default XElement.ToString() method saves the element with "indenting". You need to use the ToString(SaveOptions options) method (using SaveOptions.DisableFormatting) if you want to make sure you've just got the tags and data.

Jacob Proffitt
If you want compact, machine-readable XML, then removing all that extra white-space and indenting is a great idea. Having said that, if you're going to compress it then run-length encoding will remove much of the harm. Likewise, if it winds up parsed on the server, then the white-space will affect only the size over the wire, not the stored size.
Steven Sudit