Removing string duplication in a SQL Server 2008 database.

views:

answers:

+3 Q:

Removing string duplication in a SQL Server 2008 database.

Background: Professional tools developer. SQL/DB amateur.

Setup: .Net 3.5 winforms app talking to MS SQL Server 2008.

Scenario: I am populating a database with information extracted from a large quantity of files. This amounts to about 60M records, each of which has an arbitrarily sized message associated with it. My original plan was for an nvarchar(max) field in the record to hold the messages, however after performing a test run on a subset of the data this was going to make the database too large (extrapolates to an unacceptable 113GB). Running a few queries on this initial test data set (1.3GB database) I discovered that there was a significant amount of message duplication and that we could use this to shrink the message data to about one sixth. I’ve tried and thought of a few of approaches to achieve this but none are satisfactory. I’ve searched around for a few days now but either a) there doesn’t seem to be a good answer (unlikely), or b) I don’t know how to express what I need well enough (more likely).

Approaches considered/tried:

Bulk insertion of messages into records with a nvarchar(max) field. – found to have too much redundancy.
Stick with this message column but find a way to get the database to ‘compress’ the messages. – no idea how to do this.
Add a message table for unique messages, keyed on an ID that the main record(s) ‘point’ at. – whilst working in principal, implementing the uniqueness turns out to be painful and suffers from slowdown as more messages are added.
Perform duplicate removal on the client. – requires that all the messages are fetched to the client for each population session. This doesn’t scale as they would need to fit in memory.
Add an extra (indexed) hash column to the message table and submit the messages with a corresponding (locally generated) hash value. Search on this to narrow down the messages that actually need testing against. – complicated, there must be a better way.

This third approach amounts to the creation of a string dictionary table. After a couple of iterations on this idea I ended up with the following:

The database has a message table which maps an (auto-assigned) int ID primary key to an nvarchar(max) message.
The client batches up messages and submits multiple records for insertion to a stored procedure.
The stored procedure iterates through the batch of incoming records, and for each message:

i. The message dictionary table is checked (SELECT) for an existing instance of the message.

ii. If found, remember the ID of the existing message.

iii. If not found, insert a new message record, remembering the ID of the new record (OUTPUT).
The ID’s for all the messages (old and new) are returned as an output result set from the procedure.
The client generates the main table records with entries (int foreign keys) for the messages filled in with the IDs returned from the procedure.

Issues:

The search for existing messages gets slower and slower as the number of messages grows, becoming the limiting factor.
I’ve tried indexing (UNIQUE) the message column, but you can’t index a nvarchar(max) column.
I’ve looked at the Full Text Search capabilities of MS SQL Server 2008, but this seems like overkill to me.
I’ve thought about trying to MERGE in batches of messages, but I can’t see a way to easily get the corresponding list of IDs (old and new, in the right order) to give back to the client.

It seems to me that I’m trying to achieve some sort of normalisation of my data, but from what I understand of database design, this is more like ‘row normalisation’ than proper normalisation which is about ‘column normalisation’. I’m surprised that this isn’t something needed all over the place with corresponding support for already.

And so, my question is: What is the right approach here?

Any help greatly appreciated.

Sam

+2 A:

Sam, I think you were on to something with approach #5. And I really don't think it would be as complicated to implement as you may think. A locally generated message hash is easy to produce and it would speed up all look-ups considerably (on the database).

Of course, that is if the messages really require nvarchar(max). If you can get away with less space (512 I think for nvarchar) than you could put uniqueness constraints in SQL and index on the column which would make the search so much faster - definitely my recommendation, if you think you could cut the message length.

If you do go with the message hash approach, I believe you may be able to use a clever technique to speed things up too. Use the bulk insert to insert all the records into the database, without worrying about duplicate messages. After that, you could write a pretty simple query to purge the message table of duplicate messages, and then continue to enforce the unique constraints.

Miky Dinescu 2010-03-09 15:12:28

+1 A:

You had the solution in your article. With big data like nvarchar(max) you need to reduce the search set -- as you said:

Add an extra (indexed) hash column to the message table and submit the messages with a corresponding (locally generated) hash value. Search on this to narrow down the messages that actually need testing against. – complicated, there must be a better way.

This is the way to solve the problem.

Or if you don't want to deal with hashes, make the first 150 characters or so a hash (eg varchar(150), use that to reduce the search for duplicates. It won't be quite as unique as a hash but depending on your data it might work. (You could also use 75 of the first characters and 75 of the last characters.) Some tests of the data should show you what areas to substring which are most unique.

Hogan 2010-03-09 15:15:40

+1 A:

There are two practical aspects to (and reasons for) normalization: sensibility of the arrangement of the data (and the corresponding maintenance boon) and performance.

Regarding sensibility, one issue that you need to consider, at least from an abstract DB design perspective, is whether or not the data is truly duplicated. While you may have two messages that have identical data, they may not represent the "same thing" in reality. The real question is: Does the fact that two messages share the same text make them the same message? In other words, assuming that message A and message B share the same text, would you want a change in message A to be reflected in message B?

If your answer is "yes", then your string dictionary is the right approach. If no, then you don't really have duplicate data, just data that looks the same but isn't.

From a performance perspective, I'd likely think that the string dictionary with the additional message hash would be the best approach; I don't think this is really as complicated as you consider it to be. Standard hashing algorithms are available in virtually every language (including T-SQL), and I wouldn't consider the possiblity of collisions or even distribution of hash values to be terribly important in this scenario, since you're only using it as a "hint" to speed up the execution of a query.

Adam Robinson 2010-03-09 15:19:05

Cool, a db theory answer :) Since I'm using a database purely for analysis of the data I would consider the messages immutable, and hence your A and B *are* the same thing. Interesting perspective though.

Gaspode 2010-03-09 15:25:14

ansaurus

tags:

views:

answers:

Removing string duplication in a SQL Server 2008 database.

related questions