我需要从一个相当大的SQL Server表(即300,000+行)中删除重复的行。

当然,由于RowID标识字段的存在,这些行不会完全重复。

MyTable

RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null

我该怎么做呢?


当前回答

下面的查询用于删除重复的行。本例中的表以ID作为标识列,具有重复数据的列是Column1、Column2和Column3。

DELETE FROM TableName
WHERE  ID NOT IN (SELECT MAX(ID)
                  FROM   TableName
                  GROUP  BY Column1,
                            Column2,
                            Column3
                  /*Even if ID is not null-able SQL Server treats MAX(ID) as potentially
                    nullable. Because of semantics of NOT IN (NULL) including the clause
                    below can simplify the plan*/
                  HAVING MAX(ID) IS NOT NULL) 

下面的脚本显示GROUP BY、HAVING、ORDER BY在一个查询中的用法,并返回带有重复列及其计数的结果。

SELECT YourColumnName,
       COUNT(*) TotalCount
FROM   YourTableName
GROUP  BY YourColumnName
HAVING COUNT(*) > 1
ORDER  BY COUNT(*) DESC 

其他回答

在微软支持网站上有一篇关于删除重复文件的好文章。这是相当保守的——他们让你在不同的步骤中做所有的事情——但它应该适用于大的表格。

在过去,我使用了自连接来实现这一点,尽管它可能会用一个HAVING子句来美化:

DELETE dupes
FROM MyTable dupes, MyTable fullTable
WHERE dupes.dupField = fullTable.dupField 
AND dupes.secondDupField = fullTable.secondDupField 
AND dupes.uniqueField > fullTable.uniqueField

这将删除重复的行,除了第一行

DELETE
FROM
    Mytable
WHERE
    RowID NOT IN (
        SELECT
            MIN(RowID)
        FROM
            Mytable
        GROUP BY
            Col1,
            Col2,
            Col3
    )

引用(http://www.codeproject.com/Articles/157977/Remove-Duplicate-Rows-from-a-Table-in-SQL-Server)

另一种表达方式:——

DELETE A
FROM   TABLE A,
       TABLE B
WHERE  A.COL1 = B.COL1
       AND A.COL2 = B.COL2
       AND A.UNIQUEFIELD > B.UNIQUEFIELD 
SELECT  DISTINCT *
      INTO tempdb.dbo.tmpTable
FROM myTable

TRUNCATE TABLE myTable
INSERT INTO myTable SELECT * FROM tempdb.dbo.tmpTable
DROP TABLE tempdb.dbo.tmpTable

如果您想预览要删除的行,并控制要保留的重复行。参见http://developer.azurewebsites.net/2014/09/better-sql-group-by-find-duplicate-data/

with MYCTE as (
  SELECT ROW_NUMBER() OVER (
    PARTITION BY DuplicateKey1
                ,DuplicateKey2 -- optional
    ORDER BY CreatedAt -- the first row among duplicates will be kept, other rows will be removed
  ) RN
  FROM MyTable
)
DELETE FROM MYCTE
WHERE RN > 1