ansaurus

Question

Dealing with large number of text strings

Answer 1

A:

Sounds like you would benefit from using Large-Value Data Types

These data types will store up to 2^31-1 bytes of data

If all of your strings are smallish, there is a diminishing return to be gained by compressing them. Without natuve SQL compression, they will not be searchable anyway if you compress them.

Chris Ballance 2010-03-15 22:40:25

I dont think this helps me. I have lots of small data that needs to be stored separately, not as one blob of large value.

Fadrian Sudaman 2010-03-15 22:51:29

Anything over 8k will require an alternate storage type such as this.

Chris Ballance 2010-03-15 22:54:50

What exactly are you storing in the strings? This information might help me point you in the right direction.

Chris Ballance 2010-03-15 22:56:24

As I said in my question, each of my string is only about 300 characters long so not over 8k. I have about 20K-200K of those string to collect but each of them is independent of the other. The text is just snippet of some text. As an example, it is like a method body. I will have say 50K methods and each of them has a method body that average about 15 lines. Please dont get misled by the example and suggest source control, because it is not solving my problem. Thanks.

Fadrian Sudaman 2010-03-15 23:04:06

Answer 2

A:

I wouldn't worry about compressing them. For strings this size (300 characters or so), it's going to be more of a headache than it's worth. Compressing strings takes time (no matter how small), and SQL server 2005 does not have a native way of doing this, which means that you are going to have to write something to do it. If you do this in the application that is going to hurt your performance, you could write a CLR routine to do it in the database, but it is still going to be an extra step to actually use the compressed string in your application (or any other that uses it for that matter).

Space in a database is cheap, so you aren't really saving much by compressing all the strings. Your biggest problem is going to be keeping a large number of strings in your application's memory. If you are routinely going back to the database to load some of them and not trying to cache all of them at the same time, I wouldn't worry about it unless you are actually seeing problems.

Kevin 2010-03-15 23:13:17

The reason to compress isn't just to save space; it's also to save I/O. If you can save thousands of I/Os, you can get much better performance.

Gabe 2010-03-15 23:22:59

Answer 3

+2 A:

If you can upgrade to SQL Server 2008, I would recommend just turning on page compression, as detailed here: http://msdn.microsoft.com/en-us/library/cc280449.aspx

As an example, you can create a compressed table like this:

CREATE TABLE T1 
(c1 int, c2 nvarchar(50) )
WITH (DATA_COMPRESSION = PAGE);

If you can't use compression in the database, unfortunately your strings (no more than 300 chars) are not going to be worthwhile to compress using something like System.IO.Compression. I suppose you could try it, though.

Gabe 2010-03-15 23:17:52

+1 Thanks. I try with this and monitor the performance and storage. Still hoping to get more advice that specifically addressed my questions above

Fadrian Sudaman 2010-03-15 23:28:04

After all, maybe I cant use it. The article says "Compression is available only in the SQL Server 2008 Enterprise and Developer editions." and I cant assume that in production

Fadrian Sudaman 2010-03-15 23:29:46

Sorry, Fadrian, but it is expensive. You can use MySQL instead, which is free. Its InnoDB tables support compression.

Gabe 2010-03-15 23:51:23

Thanks. I actually migrated from MySQL to SQL Server early this year. Personal experience comparing MySQL and MsSQL using it in this project seems that SQL Server is a bit faster and more compact with the storage dealing with the data I have. The performance maybe due to the native sql client built into .NET framework I guess.

Fadrian Sudaman 2010-03-16 00:04:17

How much of a performance hit do you take by enabling row compression?

Chris Ballance 2010-03-16 01:29:22

If your database queries are I/O-bound, you would see a performance increase (because compressed data requires less I/O). If your queries are CPU-bound, it all depends on how much data you're accessing.

Gabe 2010-03-16 03:05:46

+1 Thanks for the explanation gabe

Chris Ballance 2010-03-16 21:59:50

Answer 4

+1 A:

Not entirely clear on what you are asking.

In regard to performance - if you are compressing the strings in memory before storing them in the database your program is going to be slower than if you just stuff the data straight in to the table and let SQL worry about it later. Trade off is that the sql database will be larger, but 1Tb hard drives are cheap so is storage really that big a deal?

Based on your numbers (200K by 300 bytes) you are only talking about roughly 60Megs. That is not a very large dataset. Have you considered using the Bulk Copy feature in ADO.NET (http://msdn.microsoft.com/en-us/library/7ek5da1a.aspx). If all over you data goes in one table this should be fun.

This would be an alternative to having something like EF generating essentially 200K insert statements.

UPDATE Here is another example: http://weblogs.sqlteam.com/mladenp/archive/2006/08/26/11368.aspx

Jake 2010-03-15 23:32:28

My overall dataset is not just this, but this is a specific problem that I asked here. Also with the frequency that it is running 60MB-100MB storage per run will fill up 1TB fairly quick too. I will try out Sqlbulk operation and see if it helps with performance. Only problem is that I then have to work out all the foreign key manually on the parent table that refer to this.

Fadrian Sudaman 2010-03-16 00:01:05

Sounds like you are dealing with something pretty fun.If you look at the docs you can update multiple tables w/ the BC thing. In the example I updated my answer with the author hand creates a datatable that is used in the BC operation. If you did something similar as you built your datatable you could preload all of your foreign keys and already have your look ups populated. When you finaly fired off the BC.write it should fly.That of course does nothing to solve your storage issue. But we'll make it a bigger problem sooner.

Jake 2010-03-16 00:24:08

+1 Thanks Jake for the extra info in the comment and updating the example. I will try this out

Fadrian Sudaman 2010-03-16 00:36:14

I would try this out first, because with your small amounts of data, it's probably the transaction overhead that kills.

Gabe 2010-03-16 03:09:39

@Jake and Gabe, just a follow up I have converted a portion of my code to perform batch insert using SqlBulkCopy and gained some significant performance improvement. I was also able to make it work with my Linq data context and tracked all the created keys for use in my reference table. Thanks for all the advice.

Fadrian Sudaman 2010-03-19 01:16:09

Really glad it worked. The linq integration is really cool.

Jake 2010-03-20 02:38:28

Answer 5

+1 A:

Compression will consume resources and typically will hurt performance where significant time is just local communication and processing.

Rohit 2010-03-15 23:40:28

Answer 6

+2 A:

The strings are, on average, 300 characters each. That's either 300 or 600 bytes, depending on Unicode settings. Let's say you use a varchar(4000) column and use (on average) 300 bytes each.

Then you have up to 200,000 of these to store in a database.

That's less than 60 MB of storage. In the land of databases, that is, quite frankly, peanuts. 60 GB of storage is what I'd call a "medium" database.

At this point in time, even thinking about compression is premature optimization. SQL Server can handle this amount of text without breaking a sweat. Barring any system constraints that you haven't mentioned, I would not concern myself with any of this until and unless you actually start to see performance problems - and even then it will likely be the result of something else, like a poor indexing strategy.

And compressing certain kinds of data, especially very small amounts of data (and 300 bytes is definitely small), can actually sometimes yield worse results. You could end up with "compressed" data that is actually larger than the original data. I'm guessing that most of the time, the compressed size will probably be very close to the original size.

SQL Server 2008 can perform page-level compression, which would be a somewhat more useful optimization, but you're on SQL Server 2005. So no, definitely don't bother trying to compress individual values or rows, it's not going to be worth the effort and may actually make things worse.

Aaronaught 2010-03-15 23:53:30

+1 Thanks for the explanation. Like I said, this is only a part of the requirement, my overall data stored is much larger than that. Plus, this is per-run, 60MB, so it wouldn't take too long to fill up 60GB or even TB. At this stage, this program is intend to run very often. One of the answer has suggested SQL 2008 compression, but it is limited to Enterprise and Developer edition. Also after I read the details, the compression mostly turn fixed length data into variable length so wouldnt help me much as I already do that.

Fadrian Sudaman 2010-03-15 23:58:52

@Fadrian: Even if you got, say, a generous 60-70% compression level out of SQL '08 page compression (which you don't have access to), that's not really going to change the big picture. If it's filling up into the TB range then I would start to question how much of the data you really need stored permanently. I can't imagine what one would be doing with 10 TB of plain text...

Aaronaught 2010-03-16 00:01:38

Totally agree. I do have some smart in the application to not store duplicates to again reduce the storage and hopefully TB storage is more than sufficient and wont go into TBs. Part of the problem is that the project I'm building is modelling evolution data and hence historical data are a main part of my domain model and dealing with large storage is one of its major challenge.

Fadrian Sudaman 2010-03-16 00:09:06

I was mistaken about the compression. ROW compression is useless to you, but PAGE compression, as Aaronaught mentions, compresses data across rows on the same "page" of data.

Gabe 2010-03-16 03:08:46

Answer 7

A:

It sound like you are trying to solve a definitely non-relational problem with a relational database. Why exactly are you using a database? It can be done of course, but some problems just don't fit well. TFS shows that you can brute force a problem into using a RDBS once you throw enough hardware on it, but that doesn't make it a good idea.

Stephan Eggermont 2010-04-08 13:23:48

ansaurus

tags:

views:

answers:

Dealing with large number of text strings

related questions