views:

145

answers:

6

I have an Access table of the form (I'm simplifying it a bit)

ID            AutoNumber       Primary Key
SchemeName    Text (50)
SchemeNumber  Text (15)

This contains some data eg...

ID            SchemeName           SchemeNumber
--------------------------------------------------------------------
714           Malcolm              ABC123
80            Malcolm              ABC123
96            Malcolms Scheme      ABC123
101           Malcolms Scheme      ABC123
98            Malcolms Scheme      DEF888
654           Another Scheme       BAR876
543           Whatever Scheme      KJL111
etc...

Now. I want to remove duplicate names under the same SchemeNumber. But I want to leave the record which has the longest SchemeName for that scheme number. If there are duplicate records with the same longest length then I just want to leave only one, say, the lowest ID (but any one will do really). From the above example I would want to delete IDs 714, 80 and 101 (to leave only 96).

I thought this would be relatively easy to achieve but it's turning into a bit of a nightmare! Thanks for any suggestions. I know I could loop it programatically but I'd rather have a single DELETE query.

+2  A: 
DELETE FROM Table t1
WHERE EXISTS (SELECT 1 from Table t2
             WHERE t1.SchemeNumber = t2.SchemeNumber
             AND Length(t2.SchemeName) > Length(t1.SchemeName)
)

Depend on your RDBMS you may use function different from Length (Oracle - length, mysql - length, sql server - LEN)

Michael Pakhantsov
I think this may leave duplicate records if they have the same length name? I have updated my question example table...
El Ronnoco
I have run this as a select query and it does not return anywhere near enough rows to delete.
El Ronnoco
@El Ronnoco, does the following query return anything? Select * FROM Table t1WHERE EXISTS (SELECT 1 from Table t2 WHERE t1.SchemeNumber = t2.SchemeNumber AND Length(t2.SchemeName) > Length(t1.SchemeName))
Michael Pakhantsov
@Michael Hi - This returns the same as the previous query. It returns 21 rows - where I know that I need to remove 200+.
El Ronnoco
A: 

try this:

   Select * From Table t
   Where Len(SchemeName) <
      (Select Max(Len(Schemename))
       From Table
       Where SchemeNumber = t.SchemeNumber )
    And Id > 
      (Select Min (Id) 
       From Table
       Where SchemeNumber = t.SchemeNumber
           And SchemeName = t.SchemeName)

or this:,...

   Select * From Table t
   Where Id > 
      (Select Min(Id) From Table
       Where SchemeNumber = t.SchemeNumber
         And Len(SchemeName) <
            (Select Max(Len(Schemename))
             From Table
             Where SchemeNumber = t.SchemeNumber))

if either of these selects the records that should be deleted, just change it to a delete

   Delete 
   From Table t
   Where Len(SchemeName) <
      (Select Max(Len(Schemename))
       From Table
       Where SchemeNumber = t.SchemeNumber )
    And Id > 
      (Select Min (Id) 
       From Table
       Where SchemeNumber = t.SchemeNumber
           And SchemeName = t.SchemeName)

or using the second construction:

 Delete From Table t Where Id > 
  (Select Min(Id) From Table
   Where SchemeNumber = t.SchemeNumber
     And Len(SchemeName) <
        (Select Max(Len(Schemename))
         From Table
         Where SchemeNumber = t.SchemeNumber))
Charles Bretana
I think this will leave duplicates if they have the same length SchemeName
El Ronnoco
Yes, this is true, but if there are more than one with the same length, how to decide which one not to delete? Specify a rule and I can modify the query to delete all but that one.
Charles Bretana
Hi see my updated question. Apologies for not being more specific around the rules. But only one row for each SchemeNumber should remain. This row should have the longest of the original SchemeNames for that SchemeNumber. If there are more than two different names with the same number which are the longest then it does not matter which is retained. For arguments sake I specified the one with the lowest ID.
El Ronnoco
Edit to retain only one with lowest id
Charles Bretana
The SELECT only returns one row...
El Ronnoco
ok,... Is that because there's only one row that needs to be delted? Can you show sample data that shows this?
Charles Bretana
@Charles I can't post real sample data as it is confidential. But there are 200+ rows which need to be removed. I just tried your query again today and it returned 6 rows - don't know where it when wrong yesterday! Thanks for posting :)
El Ronnoco
+2  A: 
delete ShortScheme
from Scheme ShortScheme
join Scheme LongScheme
  on ShortScheme.SchemeNumber = LongScheme.SchemeNumber
  and (len(ShortScheme.SchemeName) < len(LongScheme.SchemeName) or (len(ShortScheme.SchemeName) = len(LongScheme.SchemeName) and ShortScheme.ID > LongScheme.ID))

(SQL Server flavored)

Now updated to include the specified tie resolution. Although, you may get better performance doing it in two queries: first deleting the schemes with shorter names as in my original query and then going back and deleting the higher ID where there was a tie in name length.

Simon
Comparing lengths is not really an accurate way to do this. What if you have two different strings in SchemeName that are identical length.
James
The requirement does not specify which record to retain if they are identical lengths - just *any one of the longest*. I'm not sure if this query will work in Access though.
El Ronnoco
I tried this in Access and the DELETE ... FROM syntax isn't valid :(
El Ronnoco
A: 

If your platform supports ranking functions and common table expressions:

with cte as (
  select row_number() 
     over (partition by SchemeNumber order by len(SchemeName) desc) as rn
  from Table)
delete from cte where rn > 1;
Remus Rusanu
My platform is MSAccess I'm afraid :)
El Ronnoco
Row_number() and OVER are not supported, but many people are unaware that Jet/ACE/Access do support Partition.
David-W-Fenton
@David-W-Fenton: until your comment I was one of the many :)
Remus Rusanu
I've been programming in Access since 1996, and only discovered partition in the last year!
David-W-Fenton
+1  A: 

I'd do this in multiple steps. Large delete operations done in a single step make me too nervous -- what if you make a mistake? There's no sql 'undo' statement.

-- Setup the data
DROP Table foo;
DROP Table bar;
DROP Table bat;
DROP Table baz;
CREATE TABLE foo (
  id int(11) NOT NULL,
  SchemeName varchar(50),
  SchemeNumber varchar(15),
  PRIMARY KEY (id)
);

insert into foo values (714, 'Malcolm', 'ABC123' );
insert into foo values (80, 'Malcolm', 'ABC123' );
insert into foo values (96, 'Malcolms Scheme', 'ABC123' );
insert into foo values (101, 'Malcolms Scheme', 'ABC123' );
insert into foo values (98, 'Malcolms Scheme', 'DEF888' );
insert into foo values (654, 'Another Scheme ', 'BAR876' );
insert into foo values (543, 'Whatever Scheme ', 'KJL111' );

-- Find all the records that have dups, find the longest one
create table bar as
    select max(length(SchemeName)) as max_length, SchemeNumber
    from foo
    group by SchemeNumber
    having count(*) > 1;

-- Find the one we want to keep
create table bat as
    select min(a.id) as id, a.SchemeNumber
    from foo a join bar b on a.SchemeNumber = b.SchemeNumber 
       and length(a.SchemeName) = b.max_length
    group by SchemeNumber;

-- Select into this table all the rows to delete
create table baz as 
    select a.id from foo a join bat b where a.SchemeNumber = b.SchemeNumber 
      and a.id != b.id;

This will give you a new table with only records for rows that you want to remove.

Now check these out and make sure that they contain only the rows you want deleted. This way you can make sure that when you do the delete, you know exactly what to expect. It should also be pretty fast.

Then when you're ready, use this command to delete the rows using this command.

delete from foo where id in (select id from baz);

This seems like more work because of the different tables, but it's safer probably just as fast as the other ways. Plus you can stop at any step and make sure the data is what you want before you do any actual deletes.

Kevin Bedell
This will only delete one of the duplicates and does not account for my maximum name length requirement.
El Ronnoco
You're correct. I've updated the post with the answer that works.
Kevin Bedell
+1 Group by with having clause is the right trick.
James
Does `CREATE TABLE somename AS SELECT ...` work in any version of MS Access? Access 2003 responds "Syntax error in CREATE TABLE statement" with even a simple SELECT.
HansUp
+1  A: 

See if this query returns the rows you want to keep:

SELECT r.SchemeNumber, r.SchemeName, Min(r.ID) AS MinOfID
FROM
    (SELECT
        SchemeNumber,
        SchemeName,
        Len(SchemeName) AS name_length,
        ID
    FROM tblSchemes
    ) AS r
    INNER JOIN
    (SELECT
        SchemeNumber,
        Max(Len(SchemeName)) AS name_length
    FROM tblSchemes
    GROUP BY SchemeNumber
    ) AS w
    ON
        (r.SchemeNumber = w.SchemeNumber)
        AND (r.name_length = w.name_length)
GROUP BY r.SchemeNumber, r.SchemeName
ORDER BY r.SchemeName;

If so, save it as qrySchemes2Keep. Then create a DELETE query to discard rows from tblSchemes whose ID value is not found in qrySchemes2Keep.

DELETE 
FROM tblSchemes AS s
WHERE Not Exists (SELECT * FROM qrySchemes2Keep WHERE MinOfID = s.ID);

Just beware, if you later use Access' query designer to make changes to that DELETE query, it may "helpfully" convert the SQL to something like this:

DELETE s.*, Exists (SELECT * FROM qrySchemes2Keep WHERE MinOfID = s.ID)
FROM tblSchemes AS s
WHERE (((Exists (SELECT * FROM qrySchemes2Keep WHERE MinOfID = s.ID))=False));
HansUp
This *almost* worked perfectly! It left me one duplicate where there were two records under the same schemeNumber with different names of the same length - it should have removed the one with the lower ID.
El Ronnoco
In that case, remove r.SchemeName from the first line of qrySchemes2Keep, and from the final GROUP BY line. Also change or remove the ORDER BY.
HansUp
Oops. Remove the lower ID? I thought you wanted to keep the lowest ID.
HansUp
Sorry, I meant to say keep the lowest ID (although it doesn't really matter which is left). It has worked for all the others but it has removed neither ID in the case I mentioned. I will make the changes you say though. Thanks.
El Ronnoco