views:

1347

answers:

6

There are multiple delete rows questions here, but this one is a little unique (not difficult, just unique). I have already ready the other thread found here:

http://stackoverflow.com/questions/18932/sql-how-can-i-remove-duplicate-rows

Question: I have a table with rowID,longitude,latitude,businessName,url, caption. This might look like:

rowID | long | lat | businessName | url       |               caption 1          20   -20    Pizza Hut      yum.com                null 2          20   -20    Pizza Hut       null                    rabble rabble rabble


How do I delete all of the duplicates, but only keep the one that has a URL (first priority), or keep the one that has a caption if the other doesn't have a URL(second priority) and delete the rest?

+3  A: 

Here's my looping technique. This will probably get voted down for not being mainstream - and I'm cool with that.

DECLARE @LoopVar int

DECLARE
  @long int,
  @lat int,
  @businessname varchar(30),
  @winner int

SET @LoopVar = (SELECT MIN(rowID) FROM Locations)

WHILE @LoopVar is not null
BEGIN
  --initialize the variables.
  SELECT 
    @long = null,
    @lat = null,
    @businessname = null,
    @winner = null

  -- load data from the known good row.  
  SELECT
    @long = long,
    @lat = lat,
    @businessname = businessname
  FROM Locations
  WHERE rowID = @LoopVar

  --find the winning row with that data
  SELECT top 1 @Winner = rowID
  FROM Locations
  WHERE @long = long
    AND @lat = lat
    AND @businessname = businessname
  ORDER BY
    CASE WHEN URL is not null THEN 1 ELSE 2 END,
    CASE WHEN Caption is not null THEN 1 ELSE 2 END,
    RowId

  --delete any losers.
  DELETE FROM Locations
  WHERE @long = long
    AND @lat = lat
    AND @businessname = businessname
    AND @winner != rowID

  -- prep the next loop value.
  SET @LoopVar = (SELECT MIN(rowID) FROM Locations WHERE @LoopVar < rowID)
END
David B
I use a very similar approach. This type of looping is faster than a CURSOR as well. It also has the benefit that it will not peg the server's CPU. I put similar code in the other post that you linked in your question.
hectorsosajr
What if the rowID is a char(11) variable? It is the primary key, but can you select min(foo) on something that is a string?
RyanKeeter
In order for some type to be a true primary key, it must establish an ordering on the table. There is no problem ordering by char(11).
David B
A: 

If possible, can you homogenize, then remove duplicates?

Step 1:

UPDATE BusinessLocations
SET BusinessLocations.url = LocationsWithUrl.url
FROM BusinessLocations
INNER JOIN (
  SELECT long, lat, businessName, url, caption
  FROM BusinessLocations 
  WHERE url IS NOT NULL) LocationsWithUrl 
    ON BusinessLocations.long = LocationsWithUrl.long
    AND BusinessLocations.lat = LocationsWithUrl.lat
    AND BusinessLocations.businessName = LocationsWithUrl.businessName

UPDATE BusinessLocations
SET BusinessLocations.caption = LocationsWithCaption.caption
FROM BusinessLocations
INNER JOIN (
  SELECT long, lat, businessName, url, caption
  FROM BusinessLocations 
  WHERE caption IS NOT NULL) LocationsWithCaption 
    ON BusinessLocations.long = LocationsWithCaption.long
    AND BusinessLocations.lat = LocationsWithCaption.lat
    AND BusinessLocations.businessName = LocationsWithCaption.businessName

Step 2: Remove duplicates.

Forgotten Semicolon
+1  A: 

Set-based solution:

delete from T as t1
where /* delete if there is a "better" row
         with same long, lat and businessName */
  exists(
    select * from T as t2 where
      t1.rowID <> t2.rowID
      and t1.long = t2.long
      and t1.lat = t2.lat
      and t1.businessName = t2.businessName 
      and
        case when t1.url is null then 0 else 4 end
          /* 4 points for non-null url */
        + case when t1.businessName is null then 0 else 2 end
          /* 2 points for non-null businessName */
        + case when t1.rowID > t2.rowId then 0 else 1 end
          /* 1 point for having smaller rowId */
        <
        case when t2.url is null then 0 else 4 end
        + case when t2.businessName is null then 0 else 2 end
        )
Constantin
+1  A: 
delete MyTable
from MyTable
left outer join (
     select min(rowID) as rowID, long, lat, businessName
     from MyTable
     where url is not null
     group by long, lat, businessName
    ) as HasUrl
    on MyTable.long = HasUrl.long
    and MyTable.lat = HasUrl.lat
    and MyTable.businessName = HasUrl.businessName
left outer join (
     select min(rowID) as rowID, long, lat, businessName
     from MyTable
     where caption is not null
     group by long, lat, businessName
    ) HasCaption
    on MyTable.long = HasCaption.long
    and MyTable.lat = HasCaption.lat
    and MyTable.businessName = HasCaption.businessName
left outer join (
     select min(rowID) as rowID, long, lat, businessName
     from MyTable
     where url is null
      and caption is null
     group by long, lat, businessName
    ) HasNone 
    on MyTable.long = HasNone.long
    and MyTable.lat = HasNone.lat
    and MyTable.businessName = HasNone.businessName
where MyTable.rowID <> 
     coalesce(HasUrl.rowID, HasCaption.rowID, HasNone.rowID)
Todd Waldorf
+4  A: 

This solution is brought to you by "stuff I've learned on Stack Overflow" in the last week:

DELETE restaurant
WHERE rowID in 
(SELECT rowID
    FROM restaurant
    EXCEPT
    SELECT rowID 
    FROM (
     SELECT rowID, Rank() over (Partition BY BusinessName, lat, long ORDER BY url DESC, caption DESC ) AS Rank
     FROM restaurant
     ) rs WHERE Rank = 1)

Warning: I have not tested this on a real database

Darrel Miller
man, this kicks ass! thanks!:)
balint
+1  A: 

Similar to another answer, but you want to delete based on row number rather than rank. Mix with common table expressions as well:


;WITH GroupedRows AS
(   SELECT rowID, Row_Number() OVER (Partition BY BusinessName, lat, long ORDER BY url DESC, caption DESC) rowNum 
    FROM restaurant
)
DELETE r
FROM restaurant r
JOIN GroupedRows gr ON r.rowID = gr.rowID
WHERE gr.rowNum > 1
mancaus