我需要从一个相当大的SQL Server表(即300,000+行)中删除重复的行。

当然,由于RowID标识字段的存在,这些行不会完全重复。

MyTable

RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null

我该怎么做呢?


当前回答

如果您想预览要删除的行,并控制要保留的重复行。参见http://developer.azurewebsites.net/2014/09/better-sql-group-by-find-duplicate-data/

with MYCTE as (
  SELECT ROW_NUMBER() OVER (
    PARTITION BY DuplicateKey1
                ,DuplicateKey2 -- optional
    ORDER BY CreatedAt -- the first row among duplicates will be kept, other rows will be removed
  ) RN
  FROM MyTable
)
DELETE FROM MYCTE
WHERE RN > 1

其他回答

假设没有空值,您GROUP BY唯一的列,并选择最小(或最大)RowId作为要保留的行。然后,删除所有没有行id的内容:

DELETE FROM MyTable
LEFT OUTER JOIN (
   SELECT MIN(RowId) as RowId, Col1, Col2, Col3 
   FROM MyTable 
   GROUP BY Col1, Col2, Col3
) as KeepRows ON
   MyTable.RowId = KeepRows.RowId
WHERE
   KeepRows.RowId IS NULL

如果您使用的是GUID而不是整数,则可以进行替换

MIN(RowId)

with

CONVERT(uniqueidentifier, MIN(CONVERT(char(36), MyGuidColumn)))

这是另一篇关于删除重复的好文章。

它讨论了为什么它很难:“SQL是基于关系代数的,在关系代数中不能出现重复,因为在一个集合中不允许重复。”

临时表解决方案,以及两个mysql示例。

将来,您是打算在数据库级别还是从应用程序的角度防止它?我建议数据库级别,因为你的数据库应该负责维护引用完整性,开发人员只会造成问题;)

我更喜欢CTE从sql server表中删除重复的行

强烈推荐阅读本文::http://codaffection.com/sql-server-article/delete-duplicate-rows-in-sql-server/

保持原创性

WITH CTE AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY col1,col2,col3 ORDER BY col1,col2,col3) AS RN
FROM MyTable
)

DELETE FROM CTE WHERE RN<>1

不保留原创

WITH CTE AS
(SELECT *,R=RANK() OVER (ORDER BY col1,col2,col3)
FROM MyTable)
 
DELETE CTE
WHERE R IN (SELECT R FROM CTE GROUP BY R HAVING COUNT(*)>1)

如果重复行的所有列都是相同的,那么下面的查询可以用来删除重复的记录。

SELECT DISTINCT * INTO #TemNewTable FROM #OriginalTable
TRUNCATE TABLE #OriginalTable
INSERT INTO #OriginalTable SELECT * FROM #TemNewTable
DROP TABLE #TemNewTable

我知道这个问题已经回答了,但我已经创建了非常有用的sp,它将为表副本创建一个动态删除语句:

    CREATE PROCEDURE sp_DeleteDuplicate @tableName varchar(100), @DebugMode int =1
AS 
BEGIN
SET NOCOUNT ON;

IF(OBJECT_ID('tempdb..#tableMatrix') is not null) DROP TABLE #tableMatrix;

SELECT ROW_NUMBER() OVER(ORDER BY name) as rn,name into #tableMatrix FROM sys.columns where [object_id] = object_id(@tableName) ORDER BY name

DECLARE @MaxRow int = (SELECT MAX(rn) from #tableMatrix)
IF(@MaxRow is null)
    RAISERROR  ('I wasn''t able to find any columns for this table!',16,1)
ELSE 
    BEGIN
DECLARE @i int =1 
DECLARE @Columns Varchar(max) ='';

WHILE (@i <= @MaxRow)
BEGIN 
    SET @Columns=@Columns+(SELECT '['+name+'],' from #tableMatrix where rn = @i)

    SET @i = @i+1;
END

---DELETE LAST comma
SET @Columns = LEFT(@Columns,LEN(@Columns)-1)

DECLARE @Sql nvarchar(max) = '
WITH cteRowsToDelte
     AS (
SELECT ROW_NUMBER() OVER (PARTITION BY '+@Columns+' ORDER BY ( SELECT 0)) as rowNumber,* FROM '+@tableName
+')

DELETE FROM cteRowsToDelte
WHERE  rowNumber > 1;
'
SET NOCOUNT OFF;
    IF(@DebugMode = 1)
       SELECT @Sql
    ELSE
       EXEC sp_executesql @Sql
    END
END

如果你创建这样的表格

IF(OBJECT_ID('MyLitleTable') is not null)
    DROP TABLE MyLitleTable 


CREATE TABLE MyLitleTable
(
    A Varchar(10),
    B money,
    C int
)
---------------------------------------------------------

    INSERT INTO MyLitleTable VALUES
    ('ABC',100,1),
    ('ABC',100,1), -- only this row should be deleted
    ('ABC',101,1),
    ('ABC',100,2),
    ('ABCD',100,1)

    -----------------------------------------------------------

     exec sp_DeleteDuplicate 'MyLitleTable',0

它将从表中删除所有重复项。如果运行它时不带第二个参数,它将返回一条SQL语句来运行。

如果您需要排除任何列,只需在调试模式下运行它,获取代码并按照您的喜好修改它。