tags:

views:

2340

answers:

7

SQL to find duplicate entries (within a group)

I have a small problem and I'm not sure what would be the best way to fix it, as I only have limited access to the database (Oracle) itself. In our Table "EVENT" we have about 160k entries, each EVENT has a GROUPID and a normal entry has exactly 5 rows with the same GROUPID. Due to a bug we currently get a couple of duplicate entries (duplicate, so 10 rows instead of 5, just a different EVENTID. This may change, so it's just <> 5). We need to filter all the entries of these groups.

Due to limited access to the database we can not use a temporary table, nor can we add an index to the GROUPID column to make it faster.

We can get the GROUPIDs with this query, but we would need a second query to get the needed data

select A."GROUPID"
from "EVENT" A
group by A."GROUPID"
having count(A."GROUPID") <> 5

One solution would be a subselect:

select *
from "EVENT" A
where A."GROUPID" IN (
  select B."GROUPID"
  from "EVENT" B
  group by B."GROUPID"
  having count(B."GROUPID") <> 5
)

Without an index on GROUPID and 160k entries, this takes much too long. Tried thinking about a join that can handle this, but can't find a good solution so far.

Anybody can find a good solution for this maybe?

Small edit: We don't have 100% duplicates here, as each entry still has a unique ID and the GROUPID is not unique either (that's why we need to use "group by") - or maybe I just miss an easy solution for it :)

Small example about the data (I don't want to delete it, just find it)

EVENTID | GROUPID | TYPEID
123456    123       12
123457    123       145
123458    123       2612
123459    123       41
123460    123       238

234567    123       12
234568    123       145
234569    123       2612
234570    123       41
234571    123       238

It has some more columns, like timestamp etc, but as you can see already, everything is identical, besides the EVENTID.

We will run it more often for testing, to find the bug and check if it happens again.

+1  A: 

If your DBAs won't add an index to make this faster, ask them what they suggest you do (that's what they're paid for, after all). Presumably you have a business case why you need this information in which case your immediate management should be on your side.

Perhaps you could ask your DBAs to duplicate the data into a database where you could add an index.

paxdiablo
DBA is not around for a week, else it would be the easiest way, agreed.
FrankS
You don't need an index to make this faster -- you need a different query
David Aldridge
+1  A: 

From a SQL perspective I think you've already answered your own question. The approach you've described (ie using the sub-select) is fine, and I'd be surprised if any other way of writing the query differed vastly in performance.

160K records doesn't seem like a lot to me. I could understand if you were unhappy with the performance of that query if it was going into a piece of application code, but from the sounds of it you're just using it as part of some data cleansing excercise. (and so would expect you to be a little more tolerant in performance terms).

Even without any supporting index, its still just two full table table scans on 160K rows, which frankly, I'd expect to perform in some sort of vaguely reasonable time.

Talk to your db administrators. They've helped create the problem, so let them be part of the solution.

/EDIT/ In the meantime, run the query you have. Find out how long it takes, rather than guessing. Even better would be to run it, with set autotrace on, and post the results here, then we might be able to help you refine it somewhat.

Michael OShea
The problem is the missing index and the DBA is not around at the moment, so that's why this is not a solution. The subquery alone takes about 100ms due to the missing index, multiplied by 160k entries this means the full query takes about 4-5h.I fear that there really is no better solution.
FrankS
You are second-guessing the optimiser, wrongly. It won't take 160K*0.1 seconds, it will take more like 0.1+0.1 seconds!
Tony Andrews
well, we stopped the script after it was running for 2min
FrankS
Leave it run, and see what happens would be my suggestion. Hell, if it takes 20 minutes to run, and its only to get you over the hump until your dba is back, then you might find that you'll spend more time trying to optimise the query than is really worth it.
Michael OShea
True, but as I said on the other comment, I am trying to help a colleague here and I was curious if there is a better and faster way, as this was just the solution I came up with within a few minutes and I haven't been working with SQL anymore for some time.
FrankS
"its still just two full table table scans on 160K rows" ... ah, but try it on a few million. One full scan too many here.
David Aldridge
I was commenting on *his* provided query, suggesting that *even* with two complete scans it should perform adequately (based on the provided information). This is borne out by the accepted answer. No-where does the OP talk about a few million rows.
Michael OShea
Sure, but an efficient solution will work on 160k rows, 160M rows, or 160 rows. Systems are full of methods that work on hundreds, thousands, or hundreds of thousands of rows, but which are inefficient when the data grew to millions of rows.
David Aldridge
+3  A: 

How long does that SQL actually take? You are only going to run it once I presume, having fixed the bug that caused the corruption in the first place? I just set up a test case like this:

SQL> create table my_objects as 
  2  select object_name, ceil(rownum/5) groupid, rpad('x',500,'x') filler
  3  from all_objects;

Table created.

SQL> select count(*) from my_objects;

  COUNT(*)
----------
     83782

SQL> select * from my_objects where groupid in (
  2  select groupid from my_objects
  3  group by groupid
  4  having count(*) <> 5
  5  );

OBJECT_NAME                       GROUPID FILLER
------------------------------ ---------- --------------------------------
XYZ                                 16757 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
YYYY                                16757 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Elapsed: 00:00:01.67

Less than 2 seconds. OK, my table has half as many rows as yours, but 160K isn't huge. I added the filler column to make the table take up some disk space. The AUTOTRACE execution plan was:

-------------------------------------------------------------------------
| Id  | Operation             | Name       | Rows  | Bytes | Cost (%CPU)|
-------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |            |   389 |   112K| 14029   (2)|
|*  1 |  HASH JOIN            |            |   389 |   112K| 14029   (2)|
|   2 |   VIEW                | VW_NSO_1   | 94424 |  1198K|  6570   (2)|
|*  3 |    FILTER             |            |       |       |            |
|   4 |     HASH GROUP BY     |            |     1 |  1198K|  6570   (2)|
|   5 |      TABLE ACCESS FULL| MY_OBJECTS | 94424 |  1198K|  6504   (1)|
|   6 |   TABLE ACCESS FULL   | MY_OBJECTS | 94424 |    25M|  6506   (1)|
-------------------------------------------------------------------------
Tony Andrews
Aye, thats more like what I'd expect. 90K or 160K rows in oracle terms is pretty insignificant (or at least should be)
Michael OShea
That's what I would expect myself, I must admit I trying to help a colleague here, and he stopped the script after 2min. No real idea why it takes that long, my first guess was the missing index and I am not an Oracle expert, so maybe I am just missing some informations/settings.
FrankS
What you may have, I suppose, are stale optimiser statistics that misled Oracle into choosing a poor plan?
Tony Andrews
Try the execution plan for the analytic method as well.
David Aldridge
Tell you what -- making the groupid column not null and indexing it might make this more efficient for cases where there are very few records to retrieve. You'd probably get an index scan driving a nested loop join back to the table.
David Aldridge
+4  A: 

You can get the answer with a join instead of a subquery

select
    a.*
from
    event as a
inner join
    (select groupid
     from event
     group by groupid
     having count(*) <> 5) as b
  on a.groupid = b.groupid

This is a fairly common way of obtaining the all the information out of the rows in a group.

Like your suggested answer and the other responses, this will run a lot faster with an index on groupid. It's up to the DBA to balance the benefit of making your query run a lot faster against the cost of maintaining yet another index.

If the DBA decides against the index, make sure the appropriate people understand that its the index strategy and not the way you wrote the query that is slowing things down.

Walter Mitty
Oracle doesn't seem to like "as", but this worked perfectly!
FrankS
Inefficient though. Use the analytic function method.
David Aldridge
How long did it take to run?
Walter Mitty
didn't really measure it, but it took less than a second
FrankS
A: 

Does this work do what you want, and does it offer better performance? (I just thought I'd throw it in as a suggestion).

select * 
from group g
where (select count(*) from event e where g.groupid = e.groupid) <> 5
ilitirit
A: 

How about an analytic:

SELECT * FROM (
SELECT eventid, groupid, typeid, COUNT(groupid) OVER (PARTITION BY groupid) group_count
  FROM event
)
  WHERE group_count <> 5
Dave Costa
+4  A: 

A classic problem for analytic queries to solve:

select eventid,
       groupid,
       typeid
from   (
       Select eventid,
              groupid,
              typeid,
              count(*) over (partition by group_id) count_by_group_id
       from   EVENT
       )
where count_by_group_id <> 5
David Aldridge