Fastest “get duplicates” SQL

10 pts.
Tags:
Database management
Duplicate records
SQL
What is an example of a fast SQL to get duplicates in datasets with hundreds of thousands of records. I typically use something like: select afield1, afield2 from afile a where 1 < (select count(afield1) from afile b where a.afield1 = b.afield1);

Answer Wiki

Thanks. We'll let you know when a new response is added.

The absolutely fastest in terms of elapsed time depends on many variables including the DBMS, platform, etc.

But the most powerful, and usually fastest, is based on COUNT, GROUP BY, and HAVING:

SELECT key_col_1 [,key_col_2 . . .], COUNT( * ) AS dup_count
FROM the_table
GROUP BY key_col_1 [,key_col_2 . . .]
HAVING COUNT( * ) > 1
ORDER BY key_col_1 [,key_col_2 . . .]

–This is powerful because

1) It is as easy to code and quick to run for a multi-part key as for a single column one.

2) It can be adapted to working with large or unknown numbers of duplicates, by making it an in-line view.

Using the basic query as an in-line view

Just to check if there are duplicates and get a count of them:

SELECT COUNT(*) FROM (
SELECT key_col_1 [,key_col_2 . . .], COUNT( * ) AS dup_count
FROM the_table
GROUP BY key_col_1 [,key_col_2 . . .]
HAVING COUNT( * ) > 1
) dups

If you want to see some duplicate key values but not thousands of them, and your database supports a method for returning the first x rows, it can be adapted to that. With Oracle you can see the first 100 with this:

SELECT key_col_1 [,key_col_2 . . .], dup_count FROM (
SELECT key_col_1 [,key_col_2 . . .], COUNT( * ) AS dup_count
FROM the_table
GROUP BY key_col_1 [,key_col_2 . . .]
HAVING COUNT( * ) > 1
) dups
WHERE rownum < 101
ORDER BY key_col_1 [,key_col_2 . . .]

If you need to report some non-key column values for each row with a duplicate key, you can join the “dups” inline view to the table to do that.

With larger data sets and some programming, very high performance with little impact on the database system can be achieved by using a command line query utility to “print” all the keys values to the output stream, pipe them to sort, then pipe sort output to a script that checks for duplicates by comparing pairs of records. Since the script is custom-coded, it is not only fast but can do whatever is needed on encountering duplicates, from simply reporting them to throwing alerts to providing input to an application that will clean up the data.

Discuss This Question: 1  Reply

 
There was an error processing your information. Please try again later.
Thanks. We'll let you know when a new response is added.
Send me notifications when members answer or reply to this question.

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
  • carlosdl
    Good answer, practicaldba. I remember when I first saw this question, I was going to give a similar answer to yours, using group by and having... but I decided to make some tests first, and I found that in some cases, the approach used by Mm2010 (select afield1, afield2 from afile a where 1 < (select count(afield1) from afile b where a.afield1 = b.afield1)) gave fastest results, mainly where an index existed on a.afield1.
    69,510 pointsBadges:
    report

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to:

To follow this tag...

There was an error processing your information. Please try again later.

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Thanks! We'll email you when relevant content is added and updated.

Following