tags:

views:

267

answers:

2

I have a table which has no primary key and I can't add one - the relevant columns from it are:

Department   | Category  | 
-------------+-----------+
0001         | A         |
0002         | D         |
0003         | A         | 
0003         | A         |
0003         | C         |
0004         | B         |

I want to retrieve a single row for each Department, which gives me the department code and the Category which appears most frequently in the table, i.e.

Department   | Category  | 
-------------+-----------+
0001         | A         |
0002         | D         |
0003         | A         | 
0004         | B         |

What is the best way to achieve this? My current attempt involves a Count(Category) in a subquery from which the Max(CountofCategory) is then taken, but including the Category field at this stage means too many rows at returned (since GROUP BY is applied at Category level as well as Department). In the case of a tie, I'd just select the min/max of the category arbitrarily. Ideally this should be database-agnostic, but is likely to run on either Oracle or MySQL.

A: 

You'll have to clean this up if you are better with subqueries than I am, but in my testing this produced the result you want:

SELECT
  main.Department as Department,
  (SELECT 
     Category
   FROM yourtable
   WHERE Department=main.Department
   GROUP BY Category
   ORDER BY COUNT(Category) DESC
   LIMIT 1) AS Category
FROM yourtable main
GROUP BY main.Department

The trick is just to get the one row in the subquery to return the max value you want with the ORDER BY and the "LIMIT 1"

Renesis
This syntax doesn't work with Oracle.
APC
+3  A: 

Works in both Oracle and SQL Server, I believe is all standard SQL, from later standards:

with T_with_RN as
    (select Department
     , Category
     , row_number() over (partition by Department order by count(*) Desc) as RN
    from T
    group by Department, Category)
select Department, Category
from T_with_RN
where RN = 1

EDIT I don't know why I used the WITH, the solution is probably easier to read using an inline view:

select Department, Category
from (select Department
    , Category
    , row_number() over (partition by Department order by count(*) Desc) as RN
    from T
    group by Department, Category) T_with_RN
where RN = 1

END EDIT

Test cases:

create table T (
    Department varchar(10) null,
    Category varchar(10) null
);

-- Original test case
insert into T values ('0001', 'A');
insert into T values ('0002', 'D');
insert into T values ('0003', 'A');
insert into T values ('0003', 'A');
insert into T values ('0003', 'C');
insert into T values ('0004', 'B');
-- Null Test cases:
insert into T values (null, 'A');
insert into T values (null, 'B');
insert into T values (null, 'B');
insert into T values ('0005', null);
insert into T values ('0005', null);
insert into T values ('0005', 'X');
-- Tie Test case
insert into T values ('0006', 'O');
insert into T values ('0006', 'P');
Shannon Severance
The original question doesn't mention how to handle ties. Your solution selects one of the tied rows, effectively at random. I think it would be better to use RANK() instead of ROW_NUMBER(), as that returns all rows which tie for first place.
APC
@APC: He does mention ties: "In the case of a tie, I'd just select the min/max of the category arbitrarily." Which I took to mean he didn't care which of the tied categories was chosen. You are correct that rank() would return ties.
Shannon Severance