ansaurus

Question

SQL to find duplicate entries (within a group)

Answer 1

+1 A:

If your DBAs won't add an index to make this faster, ask them what they suggest you do (that's what they're paid for, after all). Presumably you have a business case why you need this information in which case your immediate management should be on your side.

Perhaps you could ask your DBAs to duplicate the data into a database where you could add an index.

paxdiablo 2008-10-08 12:53:50

DBA is not around for a week, else it would be the easiest way, agreed.

FrankS 2008-10-08 12:59:31

You don't need an index to make this faster -- you need a different query

David Aldridge 2008-10-08 14:24:37

Answer 2

+1 A:

From a SQL perspective I think you've already answered your own question. The approach you've described (ie using the sub-select) is fine, and I'd be surprised if any other way of writing the query differed vastly in performance.

160K records doesn't seem like a lot to me. I could understand if you were unhappy with the performance of that query if it was going into a piece of application code, but from the sounds of it you're just using it as part of some data cleansing excercise. (and so would expect you to be a little more tolerant in performance terms).

Even without any supporting index, its still just two full table table scans on 160K rows, which frankly, I'd expect to perform in some sort of vaguely reasonable time.

Talk to your db administrators. They've helped create the problem, so let them be part of the solution.

/EDIT/ In the meantime, run the query you have. Find out how long it takes, rather than guessing. Even better would be to run it, with set autotrace on, and post the results here, then we might be able to help you refine it somewhat.

Michael OShea 2008-10-08 13:02:46

The problem is the missing index and the DBA is not around at the moment, so that's why this is not a solution. The subquery alone takes about 100ms due to the missing index, multiplied by 160k entries this means the full query takes about 4-5h.I fear that there really is no better solution.

FrankS 2008-10-08 13:07:18

You are second-guessing the optimiser, wrongly. It won't take 160K*0.1 seconds, it will take more like 0.1+0.1 seconds!

Tony Andrews 2008-10-08 13:13:58

well, we stopped the script after it was running for 2min

FrankS 2008-10-08 13:21:34

Leave it run, and see what happens would be my suggestion. Hell, if it takes 20 minutes to run, and its only to get you over the hump until your dba is back, then you might find that you'll spend more time trying to optimise the query than is really worth it.

Michael OShea 2008-10-08 13:28:32

True, but as I said on the other comment, I am trying to help a colleague here and I was curious if there is a better and faster way, as this was just the solution I came up with within a few minutes and I haven't been working with SQL anymore for some time.

FrankS 2008-10-08 13:34:03

"its still just two full table table scans on 160K rows" ... ah, but try it on a few million. One full scan too many here.

David Aldridge 2008-10-08 14:24:02

I was commenting on *his* provided query, suggesting that *even* with two complete scans it should perform adequately (based on the provided information). This is borne out by the accepted answer. No-where does the OP talk about a few million rows.

Michael OShea 2008-10-08 15:08:41

Sure, but an efficient solution will work on 160k rows, 160M rows, or 160 rows. Systems are full of methods that work on hundreds, thousands, or hundreds of thousands of rows, but which are inefficient when the data grew to millions of rows.

David Aldridge 2008-10-08 16:05:40

Answer 3

+3 A:

How long does that SQL actually take? You are only going to run it once I presume, having fixed the bug that caused the corruption in the first place? I just set up a test case like this:

SQL> create table my_objects as 
  2  select object_name, ceil(rownum/5) groupid, rpad('x',500,'x') filler
  3  from all_objects;

Table created.

SQL> select count(*) from my_objects;

  COUNT(*)
----------
     83782

SQL> select * from my_objects where groupid in (
  2  select groupid from my_objects
  3  group by groupid
  4  having count(*) <> 5
  5  );

OBJECT_NAME                       GROUPID FILLER
------------------------------ ---------- --------------------------------
XYZ                                 16757 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
YYYY                                16757 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Elapsed: 00:00:01.67

Less than 2 seconds. OK, my table has half as many rows as yours, but 160K isn't huge. I added the filler column to make the table take up some disk space. The AUTOTRACE execution plan was:

-------------------------------------------------------------------------
| Id  | Operation             | Name       | Rows  | Bytes | Cost (%CPU)|
-------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |            |   389 |   112K| 14029   (2)|
|*  1 |  HASH JOIN            |            |   389 |   112K| 14029   (2)|
|   2 |   VIEW                | VW_NSO_1   | 94424 |  1198K|  6570   (2)|
|*  3 |    FILTER             |            |       |       |            |
|   4 |     HASH GROUP BY     |            |     1 |  1198K|  6570   (2)|
|   5 |      TABLE ACCESS FULL| MY_OBJECTS | 94424 |  1198K|  6504   (1)|
|   6 |   TABLE ACCESS FULL   | MY_OBJECTS | 94424 |    25M|  6506   (1)|
-------------------------------------------------------------------------

Tony Andrews 2008-10-08 13:12:17

Aye, thats more like what I'd expect. 90K or 160K rows in oracle terms is pretty insignificant (or at least should be)

Michael OShea 2008-10-08 13:24:58

That's what I would expect myself, I must admit I trying to help a colleague here, and he stopped the script after 2min. No real idea why it takes that long, my first guess was the missing index and I am not an Oracle expert, so maybe I am just missing some informations/settings.

FrankS 2008-10-08 13:31:55

What you may have, I suppose, are stale optimiser statistics that misled Oracle into choosing a poor plan?

Tony Andrews 2008-10-08 13:40:27

Try the execution plan for the analytic method as well.

David Aldridge 2008-10-08 14:23:10

Tell you what -- making the groupid column not null and indexing it might make this more efficient for cases where there are very few records to retrieve. You'd probably get an index scan driving a nested loop join back to the table.

David Aldridge 2008-10-08 14:38:23

Answer 4

+4 A:

You can get the answer with a join instead of a subquery

select
    a.*
from
    event as a
inner join
    (select groupid
     from event
     group by groupid
     having count(*) <> 5) as b
  on a.groupid = b.groupid

This is a fairly common way of obtaining the all the information out of the rows in a group.

Like your suggested answer and the other responses, this will run a lot faster with an index on groupid. It's up to the DBA to balance the benefit of making your query run a lot faster against the cost of maintaining yet another index.

If the DBA decides against the index, make sure the appropriate people understand that its the index strategy and not the way you wrote the query that is slowing things down.

Walter Mitty 2008-10-08 13:15:59

Oracle doesn't seem to like "as", but this worked perfectly!

FrankS 2008-10-08 14:00:04

Inefficient though. Use the analytic function method.

David Aldridge 2008-10-08 14:22:19

How long did it take to run?

Walter Mitty 2008-10-08 19:26:07

didn't really measure it, but it took less than a second

FrankS 2008-10-09 06:47:57

Answer 5

A:

Does this work do what you want, and does it offer better performance? (I just thought I'd throw it in as a suggestion).

select * 
from group g
where (select count(*) from event e where g.groupid = e.groupid) <> 5

ilitirit 2008-10-08 13:21:17

Answer 6

A:

How about an analytic:

SELECT * FROM (
SELECT eventid, groupid, typeid, COUNT(groupid) OVER (PARTITION BY groupid) group_count
  FROM event
)
  WHERE group_count <> 5

Dave Costa 2008-10-08 13:26:53

Answer 7

+4 A:

A classic problem for analytic queries to solve:

select eventid,
       groupid,
       typeid
from   (
       Select eventid,
              groupid,
              typeid,
              count(*) over (partition by group_id) count_by_group_id
       from   EVENT
       )
where count_by_group_id <> 5

David Aldridge 2008-10-08 13:28:57

ansaurus

tags:

views:

answers:

SQL to find duplicate entries (within a group)

related questions