我需要从一个相当大的SQL Server表(即300,000+行)中删除重复的行。

当然,由于RowID标识字段的存在,这些行不会完全重复。

MyTable

RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null

我该怎么做呢?


当前回答

我知道这个问题已经回答了,但我已经创建了非常有用的sp,它将为表副本创建一个动态删除语句:

    CREATE PROCEDURE sp_DeleteDuplicate @tableName varchar(100), @DebugMode int =1
AS 
BEGIN
SET NOCOUNT ON;

IF(OBJECT_ID('tempdb..#tableMatrix') is not null) DROP TABLE #tableMatrix;

SELECT ROW_NUMBER() OVER(ORDER BY name) as rn,name into #tableMatrix FROM sys.columns where [object_id] = object_id(@tableName) ORDER BY name

DECLARE @MaxRow int = (SELECT MAX(rn) from #tableMatrix)
IF(@MaxRow is null)
    RAISERROR  ('I wasn''t able to find any columns for this table!',16,1)
ELSE 
    BEGIN
DECLARE @i int =1 
DECLARE @Columns Varchar(max) ='';

WHILE (@i <= @MaxRow)
BEGIN 
    SET @Columns=@Columns+(SELECT '['+name+'],' from #tableMatrix where rn = @i)

    SET @i = @i+1;
END

---DELETE LAST comma
SET @Columns = LEFT(@Columns,LEN(@Columns)-1)

DECLARE @Sql nvarchar(max) = '
WITH cteRowsToDelte
     AS (
SELECT ROW_NUMBER() OVER (PARTITION BY '+@Columns+' ORDER BY ( SELECT 0)) as rowNumber,* FROM '+@tableName
+')

DELETE FROM cteRowsToDelte
WHERE  rowNumber > 1;
'
SET NOCOUNT OFF;
    IF(@DebugMode = 1)
       SELECT @Sql
    ELSE
       EXEC sp_executesql @Sql
    END
END

如果你创建这样的表格

IF(OBJECT_ID('MyLitleTable') is not null)
    DROP TABLE MyLitleTable 


CREATE TABLE MyLitleTable
(
    A Varchar(10),
    B money,
    C int
)
---------------------------------------------------------

    INSERT INTO MyLitleTable VALUES
    ('ABC',100,1),
    ('ABC',100,1), -- only this row should be deleted
    ('ABC',101,1),
    ('ABC',100,2),
    ('ABCD',100,1)

    -----------------------------------------------------------

     exec sp_DeleteDuplicate 'MyLitleTable',0

它将从表中删除所有重复项。如果运行它时不带第二个参数,它将返回一条SQL语句来运行。

如果您需要排除任何列,只需在调试模式下运行它,获取代码并按照您的喜好修改它。

其他回答

快速和脏删除精确重复的行(小表):

select  distinct * into t2 from t1;
delete from t1;
insert into t1 select *  from t2;
drop table t2;

另一种基于两列删除重复项的方法

我发现这个查询更容易阅读和替换。

DELETE 
FROM 
 TABLE_NAME 
 WHERE FIRST_COLUMNS 
 IN( 
       SELECT * FROM 
           ( SELECT MIN(FIRST_COLUMNS) 
             FROM TABLE_NAME 
             GROUP BY 
                      FIRST_COLUMNS,
                      SECOND_COLUMNS 
             HAVING COUNT(FIRST_COLUMNS) > 1 
            ) temp 
   )

注意:在运行查询之前最好模拟查询。

另一个简单的解决方案可以在这里粘贴的链接中找到。这个方法很容易掌握,似乎对大多数类似的问题都很有效。虽然它是为SQL Server,但所使用的概念是可以接受的。

以下是链接页面的相关部分:

考虑以下数据:

EMPLOYEE_ID ATTENDANCE_DATE
A001    2011-01-01
A001    2011-01-01
A002    2011-01-01
A002    2011-01-01
A002    2011-01-01
A003    2011-01-01

那么我们如何删除这些重复的数据呢?

首先,使用以下代码在表中插入一个标识列:

ALTER TABLE dbo.ATTENDANCE ADD AUTOID INT IDENTITY(1,1)  

使用下面的代码来解决它:

DELETE FROM dbo.ATTENDANCE WHERE AUTOID NOT IN (SELECT MIN(AUTOID) _
    FROM dbo.ATTENDANCE GROUP BY EMPLOYEE_ID,ATTENDANCE_DATE) 

删除重复记录

在这种情况下,大于操作符删除除第一条记录以外的所有记录

u1 FROM users u1 JOIN users u2 u1的地方。Id > u2.id 和u1.email = u2.email

<小于操作符在这种情况下删除除最后一条记录以外的所有记录

u1 FROM users u1 JOIN users u2 u1的地方。Id < u2.id 和u1.email = u2.email

获取重复的行:

SELECT
name, email, COUNT(*)
FROM 
users
GROUP BY
name, email
HAVING COUNT(*) > 1

删除重复的行。

DELETE users 
WHERE rowid NOT IN 
(SELECT MIN(rowid)
FROM users
GROUP BY name, email);