views:

249

answers:

12

I have a table with columns A, B and C. Column A might have duplicates.

I need a query that will get me a resultset with unique values in column A, and I don't care which possible duplicate it takes.

I don't know anything beforehand about the rest of the data.

An example might be:

A    B    C
1    8    8
1    7    7
2    10   10

In this case I'd want to select:

A    B    C
1    x    x
2    10   10

x = It doesn't matter which value it would pick.

Kind regards,

Matthias Vance

Edit

I thought I found my solution with:

SELECT * FROM (
   SELECT * FROM test GROUP BY a
) table_test;

But that wasn't working after all.

This will result in:

[Microsoft][ODBC Excel Driver] Cannot group on fields selected with '*'
A: 

All rows that have an unique value in A

SELECT * FROM table t1 INNER JOIN
(SELECT A FROM table GROUP BY A HAVING COUNT(A) = 1) as t2 
ON t1.A = t2.A

I don't understand what you mean with "One of the rows that has one of the duplicate values in A". Could you explain it a little better?

Using your example, in MySQL just doing

SELECT * FROM table GROUP BY A

will gave you the desired result:

A    B    C
1    8    8
2    10   10
Sergi
I added an example to illustrate my problem.Matthias
Matthias Vance
Sadly, the Microsoft ODBC Excel driver doesn't allow using GROUP BY when you have "SELECT *", it works for single columns.
Matthias Vance
Just put everyone of the fields instead of *SELECT A, B, C FROM table GROUP BY A
Sergi
A: 
-- All rows that are unique in column A
select *
from table
where col_a in (select col_a from table group by col_a having count(*)=1)
-- One row per dupe
select * 
from table
where col_a in (select max(col_a) from table group by col_a having count(*)>1)
Sparky
Your second query is not going to work because of this:SELECT MAX(col_a) FROM table GROUP BY col_a HAVING COUNT(*) > 1;// This will return (as an example): 1SELECT * FROM table WHERE a IN(1);// This will return all rows where a = 1, which are two rows.Matthias
Matthias Vance
Thanks matt, you are right, I meant a different field for the Max(col_a) expression, but was a typo. But whatever field he uses as the Max() expression will need to be unique for each col_a. I kind of wish we had real column names to make it easier to provide examples...
Sparky
I can't provide any other real column names besides A, because that's the only column that's guaranteed to be in the file. (The files are user-contributed). I definitely understand it makes everything a bit harder to write/understand.
Matthias Vance
+1  A: 

The hard part is getting b and c from the same row. The following query uses a subquery to eliminate rows which do not have the lowest value for either b or c. It joins the table on itself, and says there can not be rows with a lower value of b or c. The "not" is implemented by the prev.a is null in the WHERE clause.

The subquery is called semiunique because there can still be duplicate rows with identical b and c. The outer query takes care of those with a GROUP BY. Since b and c are identical, it doesn't matter which row we choose, so we can pick one using min().

select a, min(b), min(c)
from (
    select cur.a, cur.b, cur.c
    from YourTable cur
    left outer join YourTable prev
        on cur.a = prev.a
        and (cur.b > prev.b
            or (cur.b = prev.b and cur.c > prev.c))
   where prev.a is null             
) semiunique
group by semiunique.a

Per your comment, a much simpler version to grab "something" for b and c:

select a, min(b), min(c)
from YourTable
group by a
Andomar
I tried, and failed, to rewrite your query to ignore the values of column B and C. I just want to get everything regardless of the values in other columns. Would you be so kind to (help me) rewrite this?
Matthias Vance
Edited answer, but I'm not sure if I understand you right. Maybe you could edit the question with a more detailed example?
Andomar
If you use min(b) and min(c), they might return values from different rows. I'm not sure if this is okay.
Jonas Lincoln
+1  A: 

This works in SQL Server 2008, which illustrates the concept. You need a unique column.

declare @temp as table (
id int identity(1,1),
a int,
b int, 
c int)

insert into @temp
    select 1 as A, 8 as B, 8 as C
    union
    select 1, 7, 7
    union 
    select 2, 10, 10

select a, b, c from @temp
where id in (select MAX(id) from @temp
group by a)

Seeing that you're using Excel, I'd use the same principle. Add another column to the spreadsheet and make sure it is unique. Use that column as your ID-column.

Jonas Lincoln
Unfortunately, the ODBC Excel Driver doesn't support the ALTER TABLE command.
Matthias Vance
This is the best solution given so far, and the only one certain to work. You're dealing with Excel, write a VBA macro to add the unique column, it doesn't have to be done in SQL.
Donnie
A: 

Another option would be to use the ROW_NUMBER()-function. Not sure if it's valid in the ODBC Excel driver though:

select a, b, c from (
select * 
, ROW_NUMBER() OVER (PARTITION BY A ORDER BY A) as RN
from @temp
) q where rn = 1
Jonas Lincoln
I saw this solution before and tried it, but the driver doesn't support ROW_NUMBER(). "Undefined function 'ROW_NUMBER' in expression"
Matthias Vance
A: 
select * 
from table T 
where id = (
  select min(id) from table where a = T.a
)

UPD. But if there is no primary key in your table (why?), then:

select A, min(B), min(C)
from TABLE
group by A
serge_bg
That won't work, check my comment at Sparky's answer about using MIN() and MAX().
Matthias Vance
A: 

I know this is a dirty way, but will work this case.

Pseudo code:

create table #tmpStaging with primary key as col( A )

for each row in the flatFile/excel/whatever begin begin try insert into #tmpstaging end try

begin catch --do nothing end catch end

select * from #tmpstaging will give you rows without dups

Nikhil S
I get a "[Microsoft][ODBC Excel Driver] Syntax error in CREATE TABLE statement." when I try to execute "CREATE TABLE #tempdata(id INTEGER)".
Matthias Vance
+1  A: 

Try this:

select A, B, C
from test x
where not exists (select *
                  from test y
                  where y.A = x.A
                        and (y.B < x.B or (y.B = x.B and y.C < x.C))
order by A

But since it contains correlated subquery it might be slow. (OTOH it is at least theoretically possible for database engine to optimize it into something I show below.)


What about something outside SQL? What are you going to do with result?

If you are going to process it with some program, why not just get:

select A, B, C from test order by A, B, C

and then do something like:

prev_a = None
for a, b, c in get_query_result():
    if a != prev_a:
        prev_a = a
        yield (a, b, c)

in your application?

I don't know PHP but I guess it would be something like this:

$query = "SELECT a,b,c FROM test ORDER BY a,b,c";
$result = odbc_exec($connect, $query);
$prev_a = NULL;  # I don't know what you would normally use here in PHP
while (odbc_fetch_row($result)) {
  $a = odbc_result($result, 1);
  if (is_null($prev_a) or $a != $prev_a) { 
    $b = odbc_result($result, 2);
    $c = odbc_result($result, 3);
    print("A = $a, B = $b, C = $c\n");
    $prev_a = $a;
  }
}
Tomek Szpakowicz
I think I will indeed do this client side (as in, client to the SQL server), that's the fastest solution. There might be as much as 20 columns, and using all those subqueries will result in a very slow system. I should just use ORDER BY, and check against the previous result. Thank you!
Matthias Vance
Not all 20 columns are used to decide which row to pick, right? So before choosing version, check which is better. If there are a lot of duplicates and rows are large it might be faster to do it in SQL. Check execution plan for query and measure performance.
Tomek Szpakowicz
+2  A: 

Wouldn't this simple query work:

SELECT A, MIN(B), MIN(C) FROM test GROUP BY A

It groups by A and just selects the minimum values of B and C in the rows of A. The values of B and C might come from different rows, e.g.

A  B  C
1  2  3
1  4  1

would return

A  B  C
1  2  1
HakonB
I can't do this, because the excel file will contain customer data. Column A might be duplicate, but the rest of the data needs to be in order. It would be a good solution if I didn't need it.
Matthias Vance
+1  A: 
Select A
    , Max(b) //Since You don't care about the Value
    , Max(c) //Since You don't care about the Value
From table t
Group By A
Vivek
The only potential problem is that the value for B and C could come from different rows, which the OP hasn't clarified being acceptable or not.
Jeremy Seghi
That is, indeed, a problem for me. (See also my comment on HakonB's post)
Matthias Vance
A: 

This will give you the first of every duplicate

SELECT  DISTINCT
    A,
    (SELECT TOP 1 B FROM @Table tB WHERE tb.A = t.A) B,
    (SELECT TOP 1 C FROM @Table tB WHERE tb.A = t.A) C
FROM    @Table t
astander
oopsy... i didn't saw your post either :)
Sandy
A: 

Try this,

SELECT UT.[A],
(SELECT TOP 1 B FROM [YourTable] WHERE [YourTable].A= UT.A) AS B,
(SELECT TOP 1 C FROM [YourTable] WHERE [YourTable].A= UT.A) AS C  FROM [YourTable] AS UT GROUP BY UT.[A]

I didn't tried it yet... who knows :)

Sandy