tags:

views:

1181

answers:

5

Many years ago, I was asked during a phone interview to delete duplicate rows in a database. After giving several solutions that do work, I was eventually told the restrictions are:

  • Assume table has one VARCHAR column
  • Cannot use rowid
  • Cannot use temporary tables

The interviewer refused to give me the answer. I've been stumped ever since.

After asking several colleagues over the years, I'm convinced there is no solution. Am I wrong?!

A: 

I would put a unique number of fixed size in the VARCHAR column for the duplicated rows, then parse out the number and delete all but the minimum row. Maybe that's what his VARCHAR constraint is for. But that stinks because it assumes that your unique number will fit. Lame question. You didn't want to work there anyway. ;-)

Dave Markle
$chars = array('L', 'O'); while(1=1){ echo $chars[0]; echo $chars[1];} echo $chars[0];
Chacha102
+1  A: 

This is a completely Jacked up way to do it, but given the assanine requirements, here is a workable solution assuming SQL 2005 or later:

  DELETE from MyTable
  WHERE ROW_NUMBER() over(PARTITION BY [MyField] order by MyField)>1
JohnFx
Interesting - sounds like row_number() is just very similar to rowid
vh
@vh row_number() is more similar to Oracle's ROWNUM than Oracle's ROWID, but far more flexible. Oracle also has ROW_NUMBER(). It's part of the analytic functions. Oh, and this won't work in SQL Server 2005/2008, because ROW_NUMBER() is not allowed in the where clause.
Shannon Severance
Ack! You are right. I swear this worked yesterday when I tested it, but alas it isn't working this morning. Sorry for the false hope.Also, I just noticed the noRowID constraint, so this probably violates the spirit of the terms. I say you should just counter-question the interviewer for how they would write a query to do this without a keyboard or mouse. It is just about as nonsensical.
JohnFx
For SQL Server 2005+ you can use a CTE to make ROW_NUMBER() results available to the DELETE statement. ;WITH x AS (SELECT ROW_NUMBER() OVER (PARTITION BY MyField ORDER BY MyField) as RN, * FROM MyTable) DELETE x WHERE RN > 1
Shannon Severance
+2  A: 

And if you did have an answer, would a new restriction suddenly present itself? Since you mention ROWID, I assume you were using Oracle. The solutions are for SQL Server.

Inspired by SQLServerCentral.com http://www.sqlservercentral.com/scripts/T-SQL/62866/

while(1=1) begin
  delete top (1)
  from MyTable
  where VarcharColumn in 
    (select VarcharColumn
    from MyTable
    group by VarcharColumn
    having count(*) > 1)

    if @@rowcount = 0
      exit
end

Deletes one row at a time. When the second to last row of a set of duplicates disappears then the remaining row won't be in the subselect on the next pass through the loop. (BIG Yuck!)

Also, see http://www.sqlservercentral.com/articles/T-SQL/63578/ for inspiration. There RBarry Young suggests a way that might be modified to store the deduplicated data in the same table, delete all the original rows, then convert the stored deduplicated data back into the right format. He had three columns, so not exactly analogous to what you are doing.

And then it might be do-able with a cursor. Not sure and don't have time to look it up. But create a cursor to select everything out of the table, in order, and then a variable to track what the last row looked like. If the current row is the same, delete, else set the variable to the current row.

Shannon Severance
You're right ... big yuck to that while loop. You're also right that a new restriction did present itself when I suggested using a PL/SQL procedure. He wanted it done only with a DELETE statement.
vh
Hey, that's me! And FYI, the technique that I used in my article *will* work on a single VARCHAR column too, so long as it is not already maxed out anywhere. Oh, and that's without any loops or cursors (which make it kind of easy) and *also* SQL 2000 compatible, so no Row_Number() function either. And yes, it is *darned* tough, but it can be done.
RBarryYoung
A: 

Assume you are implementing the DELETE statement for a SQL engine. how will you delete two rows from a table that are exactly identical? You need something to distinguish one from the other! You actually cannot delete entirely duplicate rows (ALL columns being equal) under the following constraints(as provided to you)

  1. No use of ROWID or ROWNUM
  2. No Temporary Table
  3. No procedural code

It can, however be done even if one of the conditions is relaxed. Here are solutions using at least one of the three conditions

Assume table is defined as below

Create Table t1 (
col1 vacrchar2(100),
col2 number(5),
col3 number(2)
);

Duplicate rows identification:

Select col1, col2, col3
from t1
group by col1, col2, col3
having count(*) >1

Duplicate rows can also be identified using this: select c1,c2,c3, row_number() over (partition by (c1,c2,c3) order by c1,c2,c3) rn
from t1

NOTE: The row_number() analytic function cannot be used in a DELETE statement as suggested by JohnFx at least in Oracle 10g.

  • Solution using ROWID

Delete from t1 where row_id >
( select min(t1_inner.row_id) from t1 t1_innner
where t1_inner.c1=t1.c1 and t1_inner.c2=t1.c2 and t1_inner.c3=t1.c3))

  • Solution using temp table

create table t1_dups as (
//write query here to find the duplicate rows as liste above//
)

delete from t1
where t1.c1,t1.c2,t1.c3 in (select * from t1.dups)
insert into t1(
select c1,c2,c3 from t1_dups)

  • Solution using procedural code

This will use an approach similar to the case where we use a temp table.

bkm
A: 

create table temp as select c1,c2 from table group by c1,c2 having(count()>1 or count()=1); Now drop the base table . Rename the temp table to base table.

santosh