ansaurus

Question

How do I find duplicate entries in a database table?

Answer 1

+1 A:

You probably want this

SELECT dewey_number, author_last_name,
 COUNT(dewey_number) AS NumOccurrences
FROM book
GROUP BY dewey_number,author_last_name
HAVING ( COUNT(dewey_number) > 1 )

Kibbee 2009-01-12 16:25:46

"author_last_name" is not in the "book" table. The "book" table and "author" table are connected by a FKed "author_id" in the "book" table.

Huuuze 2009-01-12 16:27:12

Answer 2

+2 A:

A nested query can do the job.

SELECT author_last_name, dewey_number, NumOccurrences
FROM author INNER JOIN
     ( SELECT author_id, dewey_number,  COUNT(dewey_number) AS NumOccurrences
        FROM book
        GROUP BY author_id, dewey_number
        HAVING ( COUNT(dewey_number) > 1 ) ) AS duplicates
ON author.id = duplicates.author_id

(I don't know if this is the fastest way to achieve what you want.)

Update: Here is my data

SELECT * FROM author;
 id | author_last_name 
----+------------------
  1 | Fowler
  2 | Knuth
  3 | Lang

SELECT * FROM book;
 id | author_id | dewey_number |         title          
----+-----------+--------------+------------------------
  1 |         1 |          600 | Refactoring
  2 |         1 |          600 | Refactoring
  3 |         1 |          600 | Analysis Patterns
  4 |         2 |          600 | TAOCP vol. 1
  5 |         2 |          600 | TAOCP vol. 1
  6 |         2 |          600 | TAOCP vol. 2
  7 |         3 |          500 | Algebra
  8 |         3 |          500 | Undergraduate Analysis
  9 |         1 |          600 | Refactoring
 10 |         2 |          500 | Concrete Mathematics
 11 |         2 |          500 | Concrete Mathematics
 12 |         2 |          500 | Concrete Mathematics

And here is the result of the above query:

 author_last_name | dewey_number | numoccurrences 
------------------+--------------+----------------
 Fowler           |          600 |              4
 Knuth            |          600 |              3
 Knuth            |          500 |              3
 Lang             |          500 |              2

Federico Ramponi 2009-01-12 16:30:05

Ok, probably a nested query is overkill, but now I'm sure it works. Tony Andrews' comment on the author_id column applies here too.

Federico Ramponi 2009-01-12 17:07:39

By the way, why are you searching for duplicates based on author's name and Dewey code? Duplicates (multiple copies of the same book, I guess) should be based also on the book's title...

Federico Ramponi 2009-01-12 17:20:29

I updated the code, but it returns 0 results. As for the use case question, I simply created a dummy scenario -- I'm not actually working on detecting duplicate books.

Huuuze 2009-01-12 17:25:30

"I updated the code, but it returns 0 results." I executed this code in psql with two tables like yours and it returned the expected data. Check your data. Let me work out the details, then I'll update my post.

Federico Ramponi 2009-01-12 17:28:44

OK, I found part of the problem on my end -- a couple fields were missing DD numbers. Unfortunately, the query doesn't yield the desired result -- it only displays the name of a single author with books that have duplicated DD numbers.

Huuuze 2009-01-12 17:36:43

Update your data so author_id=1 only has one "600" and author_id=2 only has one 600. Do you still get a result back for "600"?

Huuuze 2009-01-12 18:11:49

Of course not. And there should not be any, otherwise your question is wrong. You can't expect a query to choose arbitrarily an author_id from those (many) having an entry with dewey=600 and have that author_id joined with table author.

Federico Ramponi 2009-01-12 18:18:32

Answer 3

+2 A:

SELECT dewey_number, author_last_name,
       COUNT(dewey_number) AS NumOccurrences
FROM book
JOIN author USING (author_id)
GROUP BY dewey_number,author_last_name
HAVING COUNT(dewey_number) > 1

If book.author_id can be null then change the join to:

LEFT OUTER JOIN author USING (author_id)

If the author_id column has a different name in each table then you can't use USING, use ON instead:

JOIN author ON author.id = book.author_id

or

LEFT OUTER JOIN author ON author.id = book.author_id

Tony Andrews 2009-01-12 16:30:12

No luck on this one. It returns 0 results.

Huuuze 2009-01-12 16:37:34

Do both tables have a column AUTHOR_ID as you stated? My USING syntax assumes that. If not change to "JOIN author ON author.xxx = book.yyy". If the query returns 0 rows (without error) that suggests that there the authors table is empty?

Tony Andrews 2009-01-12 16:45:54

I think we're getting closer. I did neglect to mention that it's "author.id" and "book.author_id", so good call on USING vs ON. I updated accordingly, but Postgresql didn't like the bracketed "[LEFT JOIN]". After removing the brackets, it yields 0 results.

Huuuze 2009-01-12 17:00:22

And, no, neither "id" field can be null.

Huuuze 2009-01-12 17:03:39

Sorry, the brackets around LEFT OUTER were my way of saying that these keywords were optional

Tony Andrews 2009-01-12 17:52:24

ansaurus

tags:

views:

answers:

How do I find duplicate entries in a database table?

related questions