If you want to find duplicate rows in SQL, you can go two routes. The first is to use the GROUP BY and HAVING to identify “duplicate” groups of data, or you can use a easy to understand window function to not only identify the duplicates but also its primary key.
In this video I’ll show how. I think once you watch the video, you’ll see there isn’t much mystery involved!
I encourage you to watch the video, but for those who like to ready, check out the transcript below.
Hey, this is Kris from EssentialSQL and today, if you’re looking for a great way to identify duplicate rows in your data and be able to get to those specific rows, you’ll want to watch this episode because I’m going to show you how to do that using a couple of techniques in SQL.
But namely, we’re going to use a window function called ROW_NUMBER() to really help us get to the core of the matter. So let’s get started with this, and we will start with this first table where we’re going to try to find duplicates of sales order headers, where we have the order date, the territory ID, and the account number.
When I run this, you’re going to see a whole bunch of rows and they all look different, but there’s actually duplicate dates, territories, and account numbers in here.
How could I start to find out what these are? One thing I could do is use a window function, which I go over in my blog and I can actually get and count the duplicates.
ROW_NUMBER() and Duplicate Rows In SQL
I’m going to use, what’s called ROW_NUMBER() here and let’s just type to this in. I’ll say row number and then I’m going to now type in like, what do I want to do my row number over? Because it’s basically going to return for a block of rows like, I’m row one, I’m row two, I’m row three. So that block is called a partition and to do that, we start with an over clause to signal we’re working with a window function. So I could say partition by.
And here I’ll say order date and then I’ll say territory ID and then account number, because I want to find duplicates of all three of these. Okay and then I think it wants an order by, so I’m going to have to say order by and then we’ll just say order date.
Now when I run- and we’ll just call this roadmap so we know we have an alias for it. When I run this, you’re going to see a whole bunch of row numbers that are one and what that means is that’s the first row that the system found for a unique triplet of order, date, territory, ID, and account number.
Maybe somewhere lurking in our result is a row here that has a row num of two, which would be like the second row. And guess what that is? That is one of the duplicates. So let’s see if we could try to identify that.
I think one thing that people would try to do now that they’ve done all this hard work with the window function is to add a where clause. And lets just clean this up a little bit.
It looks a little nicer and then I would say something like maybe where, and then I could say row number and let’s say greater than one because that might find the duplicates for us. Right?
Hitting SQL Limitations
When I run this, it’ll say window functions can only appear in the select or order clause. Well, that’s a bummer, but let’s try just the ROW_NUMBER(). Right? Because we did alias it and it doesn’t really look like a window function to let’s try that.
And that doesn’t work either. What can we do to make this work? What we can do to get around this limitation is take our original query and we can wrap it in what’s called a derived table and then based on that, we’ll be able to see the row number as truly as a column and then be able to do some operations off that.
Let me show you what that would look like. I can say select star and then say from. This is the drive table part that I’m creating right here, where I put it inside parentheses. Some people will call this a sub query. I think in this specific case, calling it drive table is more accurate.
Work Around with Derived Tables
When you do a derived table, you have to give it an alias and I have done that. Now let me run this and if you look at the results here, you should see where pretty much you get the same rows in the row numbers.
Now let’s go in and find the row numbers that are equal to two. And how I can do that, it’s just like where, and I can say D.ROW_NUMBER() is greater than one. Okay and this will find the duplicate rows and I run that. You’ll see I have now, the duplicate rows.
Now what’s cool about this is that notice that I now have the sales order ID as part of my result and I know what the duplicate row is. I added sales order ID to my query. Unlike a group by statement with a having that could identify duplicates, I’m able to get the actual row where this duplicate lies. If I wanted to go in and delete this, I could, because I could use this primary key now to identify the row and then actually delete it out of the record set. That’s pretty awesome.
Duplicate Rows in SQL Conclusion
If you like what you’ve seen and you want to learn more about ways you can combine SQL features together to get over some limitations, you’ll definitely want to check out my course, Fearless SQL. It will introduce you to intermediate SQL and take you from intermediate SQL to advance SQL and go through all the cool topics. Definitely check that out. It’s on my blog, essentialsql.com.
Also don’t forget to subscribe to our channel and like the video. That really helps us out a lot. And I totally appreciate you watching my video. And I liked the comments I’ve received already. They make my day. So hopefully you have a great day yourself. And I look forward to getting you another video here in the near future. Bye.