How to delete duplicate rows from a table in SQL

Duplicate records can result in unintended outcomes when working with data. Despite data professionals disliking it, this issue is often encountered in data warehouse systems. When loading data from sources into a staging table, the data load job may execute multiple times due to various reasons, such as network connectivity failures or runtime exceptions, which can lead to the insertion of duplicate records.

How to delete only excessive rows from a table?

Consider the following image, data from 6/10/2017 loaded four times into this table while there should be only one record.

Using Window function and CTE

The window function ROW_NUMBER() and Common Table Expression (CTE) come in handy to remove duplicate records. Refer to the following query, where the ROW_NUMBER() function is used to generate a unique sequence number for each row within the partition group. The records are partitioned based on the columns ‘StoreId, SaleDate, SalesAmount’ where duplicates are found. The result of this query is shown in the following image.

;WITH DUPLICATE AS  
(
SELECT *, 
ROW_NUMBER() OVER (PARTITION BY StoreId, SaleDate, SalesAmount ORDER BY StoreSaleId) RowNum 
FROM StoreSale 
) SELECT * FROM DUPLICATE

DELETE Records
It is clear that records with RowNum greater than 1 are confirmed to be duplicates and it is safe to delete them using the following query

;WITH DUPLICATE AS  
(
SELECT  
ROW_NUMBER() OVER (PARTITION BY StoreId, SaleDate, SalesAmount ORDER BY StoreSaleId) RowNum 
FROM StoreSale 
) DELETE FROM DUPLICATE WHERE RowNum >1 

Complete script used in this exercise given here.

USE TEMPDB
GO
---------------------------------------------------------------------------------------------------------------
--Prep
---------------------------------------------------------------------------------------------------------------
DROP TABLE IF EXISTS [dbo].[StoreSale]  
GO
CREATE TABLE [dbo].[StoreSale]
(
	[StoreSaleId] [bigint] IDENTITY(1,1) NOT NULL,
	[StoreId] [int] NOT NULL,
	[SaleDate] [date] NOT NULL,
	[SalesAmount] [money] NOT NULL,
	[DataLoadDate] [datetime] NULL DEFAULT (GETDATE()),
    CONSTRAINT [PK_StoreSaleId] PRIMARY KEY CLUSTERED (	[StoreSaleId] ASC) 
) ON [PRIMARY]
GO

DECLARE @StoreId SMALLINT = 1
DECLARE @SaleDate DATE = '6/1/2017'

WHILE @StoreId <=1
BEGIN
	WHILE @SaleDate <='6/10/2017'
	BEGIN
		INSERT INTO [dbo].[StoreSale] ([StoreId],[SaleDate],[SalesAmount])
		VALUES (@StoreId,@SaleDate,ROUND(RAND()*1000,2))
		SET @SaleDate=DATEADD(DD,1,@SaleDate)		
	END
	SET @StoreId+=1
	SET @SaleDate = '6/1/2017'
END

GO

INSERT INTO [dbo].[StoreSale] ([StoreId],[SaleDate],[SalesAmount])
SELECT [StoreId],[SaleDate],[SalesAmount] FROM [StoreSale] WHERE StoreId=1 AND SaleDate = '6/10/2017'
GO 2
---------------------------------------------------------------------------------------------------------------
--Select and delete duplicate records
---------------------------------------------------------------------------------------------------------------
GO

;WITH DUPLICATE AS  
(
	SELECT *, ROW_NUMBER() OVER (PARTITION BY StoreId, SaleDate, SalesAmount ORDER BY StoreSaleId) RowNum 
	FROM StoreSale 
) SELECT * FROM DUPLICATE

GO

;WITH DUPLICATE AS  
(
SELECT  ROW_NUMBER() OVER (PARTITION BY StoreId, SaleDate, SalesAmount ORDER BY StoreSaleId) RowNum 
FROM StoreSale 
) DELETE FROM DUPLICATE WHERE RowNum >1 

GO

;WITH DUPLICATE AS  
(
	SELECT *, ROW_NUMBER() OVER (PARTITION BY StoreId, SaleDate, SalesAmount ORDER BY StoreSaleId) RowNum 
	FROM StoreSale 
) SELECT * FROM DUPLICATE

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: