Background: Professional tools developer. SQL/DB amateur.
Setup: .Net 3.5 winforms app talking to MS SQL Server 2008.
Scenario: I am populating a database with information extracted from a large quantity of files. This amounts to about 60M records, each of which has an arbitrarily sized message associated with it. My original plan was for an nvarchar(max) field in the record to hold the messages, however after performing a test run on a subset of the data this was going to make the database too large (extrapolates to an unacceptable 113GB). Running a few queries on this initial test data set (1.3GB database) I discovered that there was a significant amount of message duplication and that we could use this to shrink the message data to about one sixth. I’ve tried and thought of a few of approaches to achieve this but none are satisfactory. I’ve searched around for a few days now but either a) there doesn’t seem to be a good answer (unlikely), or b) I don’t know how to express what I need well enough (more likely).
Approaches considered/tried:
- Bulk insertion of messages into records with a nvarchar(max) field. – found to have too much redundancy.
- Stick with this message column but find a way to get the database to ‘compress’ the messages. – no idea how to do this.
- Add a message table for unique messages, keyed on an ID that the main record(s) ‘point’ at. – whilst working in principal, implementing the uniqueness turns out to be painful and suffers from slowdown as more messages are added.
- Perform duplicate removal on the client. – requires that all the messages are fetched to the client for each population session. This doesn’t scale as they would need to fit in memory.
- Add an extra (indexed) hash column to the message table and submit the messages with a corresponding (locally generated) hash value. Search on this to narrow down the messages that actually need testing against. – complicated, there must be a better way.
This third approach amounts to the creation of a string dictionary table. After a couple of iterations on this idea I ended up with the following:
- The database has a message table which maps an (auto-assigned) int ID primary key to an nvarchar(max) message.
- The client batches up messages and submits multiple records for insertion to a stored procedure.
The stored procedure iterates through the batch of incoming records, and for each message:
i. The message dictionary table is checked (SELECT) for an existing instance of the message.
ii. If found, remember the ID of the existing message.
iii. If not found, insert a new message record, remembering the ID of the new record (OUTPUT).
The ID’s for all the messages (old and new) are returned as an output result set from the procedure.
- The client generates the main table records with entries (int foreign keys) for the messages filled in with the IDs returned from the procedure.
Issues:
- The search for existing messages gets slower and slower as the number of messages grows, becoming the limiting factor.
- I’ve tried indexing (UNIQUE) the message column, but you can’t index a nvarchar(max) column.
- I’ve looked at the Full Text Search capabilities of MS SQL Server 2008, but this seems like overkill to me.
- I’ve thought about trying to MERGE in batches of messages, but I can’t see a way to easily get the corresponding list of IDs (old and new, in the right order) to give back to the client.
It seems to me that I’m trying to achieve some sort of normalisation of my data, but from what I understand of database design, this is more like ‘row normalisation’ than proper normalisation which is about ‘column normalisation’. I’m surprised that this isn’t something needed all over the place with corresponding support for already.
And so, my question is: What is the right approach here?
Any help greatly appreciated.
Sam