views:

206

answers:

7

My project when it is running, will collect a large number of string text block (about 20K and largest I have seen is about 200K of them) in short span of time and store them in a relational database. Each of the string text is relatively small and the average would be about 15 short lines (about 300 characters). The current implementation is in C# (VS2008), .NET 3.5 and backend DBMS is Ms. SQL Server 2005

Performance and storage are both important concern of the project, but the priority will be performance first, then storage. I am looking for answers to these:

  • Should I compress the text before storing them in DB? or let SQL Server worry about compacting the storage?
  • Do you know what will be the best compression algorithm/library to use for this context that gives me the best performance? Currently I just use the standard GZip in .NET framework
  • Do you know any best practices to deal with this? I welcome outside the box suggestions as long as it is implementable in .NET framework? (it is a big project and this requirements is only a small part of it)

EDITED: I will keep adding to this to clarify points raised

  • I don't need text indexing or searching on these text. I just need to be able to retrieve them in later stage for display as a text block using its primary key.
  • I have a working solution implemented as above and SQL Server has no issue at all handling it. This program will run quite often and need to work with large data context so you can imagine the size will grow very rapidly hence every optimization I can do will help.
A: 

Sounds like you would benefit from using Large-Value Data Types

These data types will store up to 2^31-1 bytes of data

If all of your strings are smallish, there is a diminishing return to be gained by compressing them. Without natuve SQL compression, they will not be searchable anyway if you compress them.

Chris Ballance
I dont think this helps me. I have lots of small data that needs to be stored separately, not as one blob of large value.
Fadrian Sudaman
Anything over 8k will require an alternate storage type such as this.
Chris Ballance
What exactly are you storing in the strings? This information might help me point you in the right direction.
Chris Ballance
As I said in my question, each of my string is only about 300 characters long so not over 8k. I have about 20K-200K of those string to collect but each of them is independent of the other. The text is just snippet of some text. As an example, it is like a method body. I will have say 50K methods and each of them has a method body that average about 15 lines. Please dont get misled by the example and suggest source control, because it is not solving my problem. Thanks.
Fadrian Sudaman
A: 

I wouldn't worry about compressing them. For strings this size (300 characters or so), it's going to be more of a headache than it's worth. Compressing strings takes time (no matter how small), and SQL server 2005 does not have a native way of doing this, which means that you are going to have to write something to do it. If you do this in the application that is going to hurt your performance, you could write a CLR routine to do it in the database, but it is still going to be an extra step to actually use the compressed string in your application (or any other that uses it for that matter).

Space in a database is cheap, so you aren't really saving much by compressing all the strings. Your biggest problem is going to be keeping a large number of strings in your application's memory. If you are routinely going back to the database to load some of them and not trying to cache all of them at the same time, I wouldn't worry about it unless you are actually seeing problems.

Kevin
The reason to compress isn't just to save space; it's also to save I/O. If you can save thousands of I/Os, you can get much better performance.
Gabe
+2  A: 

If you can upgrade to SQL Server 2008, I would recommend just turning on page compression, as detailed here: http://msdn.microsoft.com/en-us/library/cc280449.aspx

As an example, you can create a compressed table like this:

CREATE TABLE T1 
(c1 int, c2 nvarchar(50) )
WITH (DATA_COMPRESSION = PAGE);

If you can't use compression in the database, unfortunately your strings (no more than 300 chars) are not going to be worthwhile to compress using something like System.IO.Compression. I suppose you could try it, though.

Gabe
+1 Thanks. I try with this and monitor the performance and storage. Still hoping to get more advice that specifically addressed my questions above
Fadrian Sudaman
After all, maybe I cant use it. The article says "Compression is available only in the SQL Server 2008 Enterprise and Developer editions." and I cant assume that in production
Fadrian Sudaman
Sorry, Fadrian, but it is expensive. You can use MySQL instead, which is free. Its InnoDB tables support compression.
Gabe
Thanks. I actually migrated from MySQL to SQL Server early this year. Personal experience comparing MySQL and MsSQL using it in this project seems that SQL Server is a bit faster and more compact with the storage dealing with the data I have. The performance maybe due to the native sql client built into .NET framework I guess.
Fadrian Sudaman
How much of a performance hit do you take by enabling row compression?
Chris Ballance
If your database queries are I/O-bound, you would see a performance increase (because compressed data requires less I/O). If your queries are CPU-bound, it all depends on how much data you're accessing.
Gabe
+1 Thanks for the explanation gabe
Chris Ballance
+1  A: 

Not entirely clear on what you are asking.

In regard to performance - if you are compressing the strings in memory before storing them in the database your program is going to be slower than if you just stuff the data straight in to the table and let SQL worry about it later. Trade off is that the sql database will be larger, but 1Tb hard drives are cheap so is storage really that big a deal?

Based on your numbers (200K by 300 bytes) you are only talking about roughly 60Megs. That is not a very large dataset. Have you considered using the Bulk Copy feature in ADO.NET (http://msdn.microsoft.com/en-us/library/7ek5da1a.aspx). If all over you data goes in one table this should be fun.

This would be an alternative to having something like EF generating essentially 200K insert statements.

UPDATE Here is another example: http://weblogs.sqlteam.com/mladenp/archive/2006/08/26/11368.aspx

Jake
My overall dataset is not just this, but this is a specific problem that I asked here. Also with the frequency that it is running 60MB-100MB storage per run will fill up 1TB fairly quick too. I will try out Sqlbulk operation and see if it helps with performance. Only problem is that I then have to work out all the foreign key manually on the parent table that refer to this.
Fadrian Sudaman
Sounds like you are dealing with something pretty fun.If you look at the docs you can update multiple tables w/ the BC thing. In the example I updated my answer with the author hand creates a datatable that is used in the BC operation. If you did something similar as you built your datatable you could preload all of your foreign keys and already have your look ups populated. When you finaly fired off the BC.write it should fly.That of course does nothing to solve your storage issue. But we'll make it a bigger problem sooner.
Jake
+1 Thanks Jake for the extra info in the comment and updating the example. I will try this out
Fadrian Sudaman
I would try this out first, because with your small amounts of data, it's probably the transaction overhead that kills.
Gabe
@Jake and Gabe, just a follow up I have converted a portion of my code to perform batch insert using SqlBulkCopy and gained some significant performance improvement. I was also able to make it work with my Linq data context and tracked all the created keys for use in my reference table. Thanks for all the advice.
Fadrian Sudaman
Really glad it worked. The linq integration is really cool.
Jake
+1  A: 

Compression will consume resources and typically will hurt performance where significant time is just local communication and processing.

Rohit
+2  A: 

The strings are, on average, 300 characters each. That's either 300 or 600 bytes, depending on Unicode settings. Let's say you use a varchar(4000) column and use (on average) 300 bytes each.

Then you have up to 200,000 of these to store in a database.

That's less than 60 MB of storage. In the land of databases, that is, quite frankly, peanuts. 60 GB of storage is what I'd call a "medium" database.

At this point in time, even thinking about compression is premature optimization. SQL Server can handle this amount of text without breaking a sweat. Barring any system constraints that you haven't mentioned, I would not concern myself with any of this until and unless you actually start to see performance problems - and even then it will likely be the result of something else, like a poor indexing strategy.

And compressing certain kinds of data, especially very small amounts of data (and 300 bytes is definitely small), can actually sometimes yield worse results. You could end up with "compressed" data that is actually larger than the original data. I'm guessing that most of the time, the compressed size will probably be very close to the original size.

SQL Server 2008 can perform page-level compression, which would be a somewhat more useful optimization, but you're on SQL Server 2005. So no, definitely don't bother trying to compress individual values or rows, it's not going to be worth the effort and may actually make things worse.

Aaronaught
+1 Thanks for the explanation. Like I said, this is only a part of the requirement, my overall data stored is much larger than that. Plus, this is per-run, 60MB, so it wouldn't take too long to fill up 60GB or even TB. At this stage, this program is intend to run very often. One of the answer has suggested SQL 2008 compression, but it is limited to Enterprise and Developer edition. Also after I read the details, the compression mostly turn fixed length data into variable length so wouldnt help me much as I already do that.
Fadrian Sudaman
@Fadrian: Even if you got, say, a generous 60-70% compression level out of SQL '08 page compression (which you don't have access to), that's not really going to change the big picture. If it's filling up into the TB range then I would start to question how much of the data you really need stored permanently. I can't imagine what one would be doing with 10 TB of plain text...
Aaronaught
Totally agree. I do have some smart in the application to not store duplicates to again reduce the storage and hopefully TB storage is more than sufficient and wont go into TBs. Part of the problem is that the project I'm building is modelling evolution data and hence historical data are a main part of my domain model and dealing with large storage is one of its major challenge.
Fadrian Sudaman
I was mistaken about the compression. ROW compression is useless to you, but PAGE compression, as Aaronaught mentions, compresses data across rows on the same "page" of data.
Gabe
A: 

It sound like you are trying to solve a definitely non-relational problem with a relational database. Why exactly are you using a database? It can be done of course, but some problems just don't fit well. TFS shows that you can brute force a problem into using a RDBS once you throw enough hardware on it, but that doesn't make it a good idea.

Stephan Eggermont