tags:

views:

150

answers:

4

In Sql Server I am using an XML type column to store a message. I do not want to store duplicate messages.

I only will have a few messages per user. I am currently querying the table for these messages, converting the XML to string in my C# code. I then compare the strings with what I am about to insert.

Unfortunately, Sql Server pretty-prints the data in the XML typed fields. What you store into the database is not necessarily exactly the same string as what you get back out later. It is functionally equivalent, but may have white space removed, etc.

Is there an efficient way to compare an XML string that I am considering inserting with those that are already in the database? As an aside, if I detect a duplicate I need to delete the older message then insert the replacement.

+1  A: 

0 - Add a hash column to your table

1 - when you receive a new message, convert the whole XML to uppercase, remove all blanks and returns/linefeed, then compute the hash value of the normalized string.

2 - check if you already have a row with the resulting hash code in it.

  • If yes, this is duplicated, treat it accordingly
  • If not, store the original XML along with the hash in a new row
p.marino
Pedantic note: It's not required that the values are equal because the hash code is, but if you use SHA1 or MD5 or something, the risk of collitions is neglible. Non-pedantic note: You could probably even create a function to perform this calculation (CLR perhaps), create a computed column for the value, and create a unique index on the computed column.
erikkallen
@erikallen: no need for CLR function, just use built-in HASHBYTES http://msdn.microsoft.com/en-us/library/ms174415.aspx
Remus Rusanu
I don't think I need to do string normalization, your step 1. I can compute the hashes on the raw message strings, then the comparison for duplicates will still be accurate. My duplicate test is to detect duplicates on the original messages that are about to be stored. ("Hello" should not equal "HELLO").
aaaa bbbb
Even better then. What type of normalization (i.e. what you consider meaningful in terms of format, spaces, case etc.) is of course something you need to evaluate depending on the specific situation.
p.marino
A: 

One solution is to stop using the XML typed field. Store the XML string into a varchar typed field.

I don't really like this solution, but I don't really like p.marino's solution either. It doesn't seem right to store a hash of something that is already in the row in the table.

aaaa bbbb
A: 

What if you use OPENXML on each row in the table and query the actual XML information for key nodes and/or key attributes? But then you need to do it row by row, I don't think OPENXML works with a whole set of table rows.

Lars Nyström
A: 

I'm not 100% sure on your exact implementation but here is something I played around with. The idea being a stored procedure would do the inserting. Inserting into the messages table does a basic check on existing messages (SQL 2008 syntax):

declare @messages table (msg xml)
insert into @messages values 
('<message>You like oranges</message>')
,('<message>You like apples</message>')

declare @newMessage xml = '<message>You like apples</message>'

insert into @messages (msg)
select @newMessage
where @newMessage.value('(message)[1]', 'nvarchar(50)') not in (
  select msg.value('(message)[1]', 'nvarchar(50)')
  from @messages  
)
eddiegroves