views:

30

answers:

1

I have an automated process that inserts an XML document into SQL Server 2008 table, the column is of Type XML. There is a lot of duplicated data, I wonder if anyone can recommend a good way to delete non-distinct values based on the XML column? The table has thousands of rows and each XML document is about 70k.

Each XML document looks the same except for one element value, for example:

Row 1 , Column C:

<?xml version="1.0"?><a><b/><c>2010.09.28T10:10:00</c></a>

Row 2, Column C:

<?xml version="1.0"?><a><b/><c>2010.09.29T10:10:00</c></a>

I want to pretend that the value of is ignored when it comes to the diff. If everything else is equal, then I want to consider the documents to be the same. If any other element is different, then the documents would be considered different.

Thanks for all ideas.

+1  A: 

Can you qualify what 'distinct XML' means for you? For example what is the difference between:

  • <a><b/></a>
  • <?xml version="1.0"?><a><b/></a>
  • <a xmlns:xhtml="http://www.w3.org/1999/xhtml"&gt;&lt;b/&gt;&lt;/a&gt;
  • <a><b xsi:nil="true" /></a>
  • <a><b></b></a>
  • <?xml version="1.0" encoding="UTF-8"?><a><b/></a>
  • <?xml version="1.0" encoding="UTF-16"?><a><b></b></a>

In your opinion, how many 'distinct' XMLs are there?

Updated

If your XML looks like: <?xml version="1.0"?><a><b/><c>2010.09.29T10:10:00</c></a> then you can project the element that distinguish the fields and query on this projection:

with cte_x as (
  select xmlcolumn.value(N'(//a/c)[1]', N'DATETIME') as xml_date_a_c,
   ...
  from table
),
cte_rank as (
  select row_number() over (partition by xml_date_a_c order by ...) as rn
  from cte_x)
delete from cte_rank
  where rn > 1;
Remus Rusanu
Thanks for the question, actually there is a slight difference between the XML LOBs. I updated my original post. Thanks.
ScSub