Find and Remove Duplicates in SQL

·

·

, ,

Problem

You want to remove duplicates in SQL. You know it is easy to find duplicates using GROUP BY, but how do your remove all but one?

Background

Here’s an example of how to find duplicate records using a window function in SQL:

Suppose we have a table of students that includes columns for student ID, name, and age. We want to find all the students who have the same name and age. We can use a window function to do this in a single SQL query:

CREATE TABLE students ( 
      id INT PRIMARY KEY,
      name VARCHAR(50) NOT NULL,
      age INT NOT NULL
  );

INSERT INTO students (id, name, age)
VALUES
   (1, 'Alice', 20),
   (2, 'Bob', 22),
   (3, 'Charlie', 20),
   (4, 'Alice', 20),
   (5, 'David', 18),
   (6, 'Charlie', 20),
   (7, 'Eve', 19),
   (8, 'Bob', 22),
   (9, 'Frank', 21);

Solution

This solution to remove duplicates in SQL is similar to the solution for finding duplicate rows we showed earlier, but here we want to keep one of the rows and delete the rest.

To do this, we’ll use the ROW_NUMBER() function to help find duplicates with name and ages.

delete students 
from students 
where students.id in
(
   select id
   from (
      select id, row_number() over (partition by name, age order by id) as rownum
      from students
   ) d
   where rownum > 1
)

In this query, we use a derived table named d, to apply the ROW_NUMBER() function to the “students” table, partitioned by “name” and “age”, and ordered by “id”.

We then select only the rows where the “rownum” value is greater than 1, which corresponds to all but the first occurrence of each duplicate.

Finally, we use the DELETE statement to remove all the selected rows from the “students” table. We use a WHERE clause to filter remove the duplicates. The duplicates being rows 2, 3, 4, … within the partition.

This query eliminates the need for the subquery and uses the window function directly in the DELETE statement to identify and delete duplicates.

Learn More! SQL DELETE Statement

Discussion

You can prioritize which row to keep by using an order by within your OVER clause.

For instance, in our, since ID is order by default in ascending order, the ID are sorted in lowest to highest order for within each partition.

…row_number() over (partition by name, age order by id asc) as rownum

This means, all things equal, the lower ID is kept.

If you want to keep the highest ID value, then order by ID in descending order.

Remove Duplicates in SQL Alternatives

If you cannot use a window function to identify and remove duplicates in SQL, there are a few other ways to accomplish this task in SQL. Here are a couple of common methods:

Self-Join:

One common way to identify duplicates in SQL is to use a self-join. We join the table to itself on the columns that define the duplicates, then select only the rows that match on those columns. Here is an example:

SELECT t1.* 
FROM students t1
  JOIN students t2 ON t1.name = t2.name AND t1.age = t2.age AND t1.id < t2.id;

In this example, we join the “students” table to itself on the “name” and “age” columns, and we select only the rows where the “id” value in the first table is less than the “id” value in the second table. This ensures that we only return each duplicate set once, and we exclude the first occurrence of each set.

To remove the duplicates using this method, we can use the DELETE statement with the same join condition to remove all but the first row in each duplicate set.

Subquery with GROUP BY:

Another way to identify and remove duplicates is to use a subquery with a GROUP BY clause. We can group the table by the columns that define the duplicates, and then select only the groups that have more than one row. Here is an example:

SELECT * 
FROM students
WHERE (name, age) IN
(
   SELECT name, age
   FROM students
   GROUP BY name, age
   HAVING COUNT(*) > 1
);

In this example, we use a subquery to select the “name” and “age” values for groups that have more than one row in the “students” table. We then use these values in the WHERE clause of the outer query to select all rows that belong to these groups.

To remove the duplicates in SQL using this method, we can use the DELETE statement with the same subquery to remove all but the first row in each duplicate set. However, we need to modify the subquery to return the “id” values for each group, so that we can use them in the WHERE clause of the DELETE statement. Here is an example:

DELETE FROM students
WHERE (name, age) IN
(
   SELECT name, age
   FROM students
   GROUP BY name, age
   HAVING COUNT(*) > 1
);

In this example, we use a subquery to select the minimum “id” value for each group of duplicates in the “students” table. We then use these values in the WHERE clause of the DELETE statement to remove all but the first row in each duplicate set.

Related Articles

Join the newsletter

Subscribe to get our latest content by email.

Powered by ConvertKit

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

More from the blog


MySQL PostgreSQL SQLite SqlServer