tags:

views:

53

answers:

3

Hi,

I have a (seems to be) very easy problem, but after searching for hours, I can't find anything helpful.

Here's the problem:

In Microsoft SQL, I have a table where a column A stores some data. This data can contain duplicates (ie. two or more rows will have the same value for the column A).

I can easily find the duplicates by doing:

select A, count(A) as CountDuplicates
from TableName
group by A having (count(A) > 1)

Now, I want to retrieve the values of other columns, let's say B and C. Of course, those B and C values can be different even for the rows sharing the same A value, but it doesn't matter for me. I just want any B value and any C one, the first, the last or the random one.

If I had a small table and one or two columns to retrieve, I would do something like:

select A, count(A) as CountDuplicates, (
    select top 1 child.B from TableName as child where child.A = base.A) as B
)
from TableName as base group by A having (count(A) > 1)

The problem is that I have much more rows to get, and the table is quite big, so having several children selects will have a high performance cost.

So, is there a less ugly pure SQL solution to do this?


Not sure if my question is clear enough, so I give an example based on AdventureWorks database. Let's say I want to list available States, and for each State, get its code, a city (any city) and an address (any address). The easiest, and the most inefficient way to do it would be:

var q = from c in data.StateProvinces select new { c.StateProvinceCode, c.Addresses.First().City, c.Addresses.First().AddressLine1 };

in LINQ-to-SQL and will do two selects for each of 181 States, so 363 selects. I my case, I am searching for a way to have a maximum of 182 selects.

+4  A: 

It concerns me that you want any old value for fields b and c. If they are to be meaningless why are you returning them?

If it truly doesn't matter (and I honestly can't imagine a case where I would ever want this, but it's what you said) and the values for b and c don't even have to be from the same record, group by with the use of mon or max is the way to go. It's more complicated if you want the values for a particular record for all fields.

select A, count(A) as CountDuplicates, min(B) as B , min(C) as C
from TableName as base 
group by A 
having (count(A) > 1) 
HLGEM
Ok, it may work. By the way, what I wanted to say by meaningless is that it doesn't matter inside the same group of rows. I use this data just to have *a hint* of *"what is duplicated and how much"*.
MainMa
This may be incorrect. The B and C returned are potentially unrelated -- they may come from different records. You haven't returned an arbitrary record representing one of the A's, but fragments of two different A's.
Chris Wuestefeld
ANd I stated in the answer that it would do so. The poster himself said the values didn't matter.
HLGEM
A: 

you can do some thing like this if you have id as primary key in your table

select id,b,c from tablename 
inner join
(
select id, count(A) as CountDuplicates
from TableName as base group by A,id having (count(A) > 1) 
)d on tablename.id= d.id
Pranay Rana
Downvote - I think you're guilty of shot-gunning an answer just to be first. Obviously you didn't test this code, since it says "form" and "innet join". Also, the fact that it relies on A being a unique key (but not primary key as you said) makes it a bad general solution.
Chris Wuestefeld
ans updated now... thanks for the info
Pranay Rana
+1  A: 

The ROW_NUMBER function in a CTE is the way to do this. For example:

DECLARE @mytab TABLE (A INT, B INT, C INT)
INSERT INTO @mytab ( A, B, C ) VALUES (1, 1, 1)
INSERT INTO @mytab ( A, B, C ) VALUES (1, 1, 2)
INSERT INTO @mytab ( A, B, C ) VALUES (1, 2, 1)
INSERT INTO @mytab ( A, B, C ) VALUES (1, 3, 1)
INSERT INTO @mytab ( A, B, C ) VALUES (2, 2, 2)
INSERT INTO @mytab ( A, B, C ) VALUES (3, 3, 1)
INSERT INTO @mytab ( A, B, C ) VALUES (3, 3, 2)
INSERT INTO @mytab ( A, B, C ) VALUES (3, 3, 3)
;WITH numbered AS 
(
    SELECT *, rn=ROW_NUMBER() OVER (PARTITION BY A ORDER BY B, C)
        FROM @mytab AS m
)
SELECT *
    FROM numbered
    WHERE rn=1

As I mentioned in my comment to HLGEM and Philip Kelley, their simple use of an aggregate function does not necessarily return one "solid" record for each A group; instead, it may return column values from many separate rows, all stitched together as if they were a single record. For example, if this were a PERSON table, with the PersonID being the "A" column, and distinct contact records (say, Home and Word), you might wind up returning the person's home city, but their office ZIP code -- and that's clearly asking for trouble.

The use of the ROW_NUMBER, in conjunction with a CTE here, is a little difficult to get used to at first because the syntax is awkward. But it's becoming a pretty common pattern, so it's good to get to know it.

In my sample I've define a CTE that tacks on an extra column rn (standing for "row number") to the table, that itself groups by the A column. A SELECT on that result, filtering to only those having a row number of 1 (i.e., the first record found for that value of A), returns a "solid" record for each A group -- in my example above, you'd be certain to get either the Work or Home address, but not elements of both mixed together.

Chris Wuestefeld