views:

233

answers:

4

I have a table like so:

Table eventlog
user  |  user_group  | event_date |  event_dur.
----     ----------    ---------     ----------
xyz         1           2009-1-1        3.5
xyz         2           2009-1-1        4.5
abc         2           2009-1-2        5
abc         1           2009-1-2        5

Notice that in the above sample data, the only thing reliable is the date and the user. Through an over site that is 90% mine to blame, I have managed to allow users to duplicate their daily entries. In some instances the duplicates were intended to be updates to their duration, in others it was their attempt to change the user_group they were working with that day, and in other cases both.

Fortunately, I have a fairly strong idea (since this is an update to an older system) of which records are correct. (Basically, this all happened as an attempt to seamlessly merge the old DB with the new DB).

Unfortunately, I have to more or less do this by hand, or risk losing data that only exists on one side and not the other....

Long story short, I'm trying to figure out the right MySQL query to return all records that have more than one entry for a user on any given date. I have been struggling with GROUP BY and HAVING, but the best I can get is a list of one of the two duplicates, per duplicate, which would be great if I knew for sure it was the wrong one.

Here is the closest I've come:

SELECT *
FROM eventlog
GROUP BY event_date, user
HAVING COUNT(user) > 1
ORDER BY event_date, user

Any help with this would be extremely useful. If need be, I have the list of users/date for each set of duplicates, so I can go by hand and remove all 400 of them, but I'd much rather see them all at once.

Thanks!

A: 

Would this work?

SELECT event_date, user
FROM eventlog
GROUP BY event_date, user
HAVING COUNT(*) > 1
ORDER BY event_date, user

What's throwing me off is the COUNT(user) clause you have.

David Andres
I thought that I had to have something in that `COUNT()` to specify which column had the duplicate data (which one was duplicated in a bad way?), anyways, testing now...
Anthony
Rats, same results. Still just getting one set of the duplicates, not both. I know it's some issue with the GROUP BY
Anthony
Is it possible that your date field includes a timestamp value (e.g., 4:00 PM)? This may exclude what would otherwise look like a pair.
David Andres
If it does, it's not showing up in phpmyadmin, which is what I'm using to do this. I'll try it again using a DATE() function to be sure.
Anthony
No luck. Is it possible that the COUNT is only returning the members of the set that,when counted, are higher than 1 in that count? (as opposed to all members of the set where the count is higher than 1)
Anthony
What do you see when you take away the HAVING clause and add COUNT(*) to the SELECT list?
David Andres
I get all of my none dupes with a 1 in the count col, and one of my two dupes with a 2 in the count col. That explains why it's not returning both. How annoying.
Anthony
Ok, either one or both of the columns differ in unexpected ways: (1) the event_date field may contain timestamps (2) the user field may contain trailing spaces in some cases
David Andres
This would have to be true for each and every instance of where it is doing this, though. It really seems more like it is grouping the duplicates together, so to speak. Like I said. I'm looking at it in phpmyadmin. If either the date field or the user field were not matching, then doing a COUNT(user) or a COUNT(event_date) wouldn't produce the same results.
Anthony
Very odd. Try SELECT DATE(event_date), RTRIM(user) and GROUP BY the same.
David Andres
Check out my newer question for where I've decided to go with this:http://stackoverflow.com/questions/1392554/mysql-delete-all-results-of-sub-query
Anthony
A: 

You can list all the field values of the duplicates with GROUP_CONCAT function, but you still get one row for each set.

Pomyk
A: 

I think this would work (untested)

SELECT  *
FROM    eventlog e1
WHERE   1 <
(
    SELECT  COUNT(*)
    FROM    eventlog e2
    WHERE   e1.event_date = e2.event_date
    AND     e1.user = e2.user
)
-- AND [maybe an additionnal constraint to find the bad duplicate]
ORDER BY event_date, user;
;
Bryan Menard
A: 

In scenarios like this, you need a primary key in the table to mark or delete duplicate records. Seems like you don't have one. Add one and you might see options that were not visible before:

UPDATE eventlog SET has_atleast_one_duplicate WHERE primary_key IN 
(
SELECT primary_key
FROM eventlog
GROUP BY event_date, user
HAVING COUNT(*) > 1
)

I understand that the answer is incomplete but see if you can find a way to proceed further.

Salman A