Translate

How to Identify and Remove Duplicate Records from Table?

Reason -  We made a mistake, and allowed to insert duplicate values. Now we need to remove/change the duplicates from the Database permanently.  So, we need to find them first.
Duplicate records can create problems sometimes when displaying reports or performing a Multiple Insert update. Finding duplicate records in a database needs further investigation. In some cases, duplicate records are positive, but it all depends on the data and the database design as well.

Identify Duplicate Rows:

There are various times when we need to find duplicate records in SQL Server. It is possible to find duplicates using DISTINCT, ROW NUMBER as well as the GROUP BY approach.
Using the DISTINCT approach, we can quickly get unique rows in a table. However, DISTINCT does not show how many times a row has been duplicated. Using the GROUP BY approach, we can find this.

Adding grouping and a counting field to our display of fields combination shows how many times each record appears.

 SQL> SELECT * From DupTest;  
     ID NAME  
 ---------- --------------------  
     101 Ravi  
     101 Ravi  
     102 Rajan  
     103 Karan  
     103 Karan  
     103 Karan  
     104 Pritesh  
     106 Subho  
     105 Manu  
     105 Manu  
 10 rows selected.
 SQL> SELECT ID, Count(*) Occurrences  
  2 From DupTest  
  3 Group By ID;  
     ID OCCURRENCES  
 ---------- -----------  
     102      1  
     101      2  
     104      1  
     105      2  
     103      3  
     106      1  
 6 rows selected.  


GROUP BY will show just one record for each combination of Selected records. The count(Occurrences) shows how many times the row has been duplicated. Count = 1 indicates that row appears only once.

Let us filter out using the Having clause to exclude rows that appear only once(Display only Duplicate rows).

 SQL> SELECT ID, Count(*) Occurrences  
  2 From DupTest  
  3 Group By ID  
  4 Having Count(*)>1;  
     ID OCCURRENCES  
 ---------- -----------  
     101      2  
     105      2  
     103      3  
 SQL> SELECT * From DupTest  
  2 Where ID IN(SELECT ID From DupTest Group By ID Having Count(*)>1);  
     ID NAME  
 ---------- --------------------  
     101 Ravi  
     101 Ravi  
     105 Manu  
     105 Manu  
     103 Karan  
     103 Karan  
     103 Karan  
 7 rows selected.  


Delete Duplicate Rows:

Let’s remove duplicate rows from table. Following ways can be used to delete duplicate records from table. Here assume that ID column should be unique after deleting. Now we see how to delete the duplicate records from the DupTest table in different ways. 

Method 1: Using Rowid With Group By
The following Delete statement deletes the rows using the Rowid.  

 SQL> SELECT * From DupTest;  
     ID NAME  
 ---------- --------------------  
     101 Ravi  
     101 Ravi  
     102 Rajan  
     103 Karan  
     103 Karan  
     103 Karan  
     104 Pritesh  
     106 Subho  
     105 Manu  
     105 Manu  
 10 rows selected.  
 SQL> DELETE From DupTest  
  2 Where ROWID NOT IN  
  3 (SELECT MAX(ROWID) From DupTest Group By ID);  
 4 rows deleted.  
 SQL> SELECT * From DupTest;  
     ID NAME  
 ---------- --------------------  
     101 Ravi  
     102 Rajan  
     103 Karan  
     104 Pritesh  
     106 Subho  
     105 Manu  
 6 rows selected.  

Method 2: Using Rowid With Self-Join
The following Delete statement deletes the rows using the Rowid with Self join. This is similar to the first method except we need not to group the records.
 SQL> SELECT * From DupTest;  
     ID NAME  
 ---------- --------------------  
     101 Ravi  
     101 Ravi  
     102 Rajan  
     103 Karan  
     103 Karan  
     103 Karan  
     104 Pritesh  
     106 Subho  
     105 Manu  
     105 Manu  
 10 rows selected.  

 SQL> DELETE From DupTest T1  
  2 Where ROWID NOT IN  
  3 (SELECT MAX(ROWID) From DupTest T2 Where T1.ID=T2.ID);  
 4 rows deleted.  

 SQL> SELECT * From DupTest;  
     ID NAME  
 ---------- --------------------  
     101 Ravi  
     102 Rajan  
     103 Karan  
     104 Pritesh  
     106 Subho  
     105 Manu  
 6 rows selected.  

Method 3: Using Row_Number()
The following Delete statement deletes the rows using Analytical Function. The Row_Number analytic function is used to rank the rows. Here we use the Row_Number function to rank the rows for each group of ID and then select only record from the group.

 SQL> SELECT * From DupTest;  
     ID NAME  
 ---------- --------------------  
     101 Ravi  
     101 Ravi  
     102 Rajan  
     103 Karan  
     103 Karan  
     103 Karan  
     104 Pritesh  
     106 Subho  
     105 Manu  
     105 Manu  
 10 rows selected.  
 SQL> DELETE From DupTest Where ROWID IN  
  2 (  
  3 SELECT Rid From  
  4       (SELECT ROWID Rid,  
  5       ROW_NUMBER() OVER(PARTITION BY ID ORDER BY ID) Rn  
  6       From DupTest  
  7       )  
  8 Where Rn>1  
  9 );  
 4 rows deleted.  
 SQL> SELECT * From DupTest;  
     ID NAME  
 ---------- --------------------  
     101 Ravi  
     102 Rajan  
     103 Karan  
     104 Pritesh  
     106 Subho  
     105 Manu  
 6 rows selected.  

Method 4: Using Dense_Rank()
The following Delete statement deletes the rows using Analytical Function. DENSE_RANK() Returns the rank of rows within the partition of a result set, without any gaps in the ranking.
 SQL> SELECT * From DupTest;  
     ID NAME  
 ---------- --------------------  
     101 Ravi  
     101 Ravi  
     102 Rajan  
     103 Karan  
     103 Karan  
     103 Karan  
     104 Pritesh  
     106 Subho  
     105 Manu  
     105 Manu  
 10 rows selected.  
 SQL> DELETE From DupTest Where ROWID IN  
  2   (  
  3   SELECT Rid From  
  4         (SELECT ROWID Rid,  
  5         DENSE_RANK() OVER(PARTITION BY ID ORDER BY ROWID) Rn  
  6         From DupTest  
  7         )  
  8   Where Rn>1  
  9   );  
 4 rows deleted.  
 SQL> SELECT * From DupTest;  
     ID NAME  
 ---------- --------------------  
     101 Ravi  
     102 Rajan  
     103 Karan  
     104 Pritesh  
     106 Subho  
     105 Manu  
 6 rows selected.  

Here you can use both Rank() and Dense_Rank() since both will give unique records when order by Rowid

You can any of the above the above method to delete the Duplicate records as per your requirement and data volume.




Get involved and leave your Comments in the Box Below. The more people get involved, the more we all benefit.
So, leave your thoughts before you leave the page.

3 comments: