10 pts.
 Fastest “get duplicates” SQL
What is an example of a fast SQL to get duplicates in datasets with hundreds of thousands of records. I typically use something like: select afield1, afield2 from afile a where 1 < (select count(afield1) from afile b where a.afield1 = b.afield1);

Software/Hardware used:
ASKED: October 13, 2008  9:33 AM
UPDATED: November 25, 2008  10:14 PM

Answer Wiki:
The absolutely fastest in terms of elapsed time depends on many variables including the DBMS, platform, etc. But the most powerful, and usually fastest, is based on COUNT, GROUP BY, and HAVING: SELECT key_col_1 [,key_col_2 . . .], COUNT( * ) AS dup_count FROM the_table GROUP BY key_col_1 [,key_col_2 . . .] HAVING COUNT( * ) > 1 ORDER BY key_col_1 [,key_col_2 . . .] --This is powerful because 1) It is as easy to code and quick to run for a multi-part key as for a single column one. 2) It can be adapted to working with large or unknown numbers of duplicates, by making it an in-line view. Using the basic query as an in-line view Just to check if there are duplicates and get a count of them: SELECT COUNT(*) FROM ( SELECT key_col_1 [,key_col_2 . . .], COUNT( * ) AS dup_count FROM the_table GROUP BY key_col_1 [,key_col_2 . . .] HAVING COUNT( * ) > 1 ) dups If you want to see some duplicate key values but not thousands of them, and your database supports a method for returning the first x rows, it can be adapted to that. With Oracle you can see the first 100 with this: SELECT key_col_1 [,key_col_2 . . .], dup_count FROM ( SELECT key_col_1 [,key_col_2 . . .], COUNT( * ) AS dup_count FROM the_table GROUP BY key_col_1 [,key_col_2 . . .] HAVING COUNT( * ) > 1 ) dups WHERE rownum < 101 ORDER BY key_col_1 [,key_col_2 . . .] If you need to report some non-key column values for each row with a duplicate key, you can join the "dups" inline view to the table to do that. With larger data sets and some programming, very high performance with little impact on the database system can be achieved by using a command line query utility to "print" all the keys values to the output stream, pipe them to sort, then pipe sort output to a script that checks for duplicates by comparing pairs of records. Since the script is custom-coded, it is not only fast but can do whatever is needed on encountering duplicates, from simply reporting them to throwing alerts to providing input to an application that will clean up the data.
Last Wiki Answer Submitted:  November 25, 2008  8:56 pm  by  Practicaldba   45 pts.
All Answer Wiki Contributors:  Practicaldba   45 pts.
To see all answers submitted to the Answer Wiki: View Answer History.


Discuss This Question:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _


 

Good answer, practicaldba.
I remember when I first saw this question, I was going to give a similar answer to yours, using group by and having… but I decided to make some tests first, and I found that in some cases, the approach used by Mm2010 (select afield1, afield2 from afile a where 1 < (select count(afield1) from afile b where a.afield1 = b.afield1)) gave fastest results, mainly where an index existed on a.afield1.

 63,535 pts.