我需要从一个相当大的SQL Server表(即300,000+行)中删除重复的行。
当然,由于RowID标识字段的存在,这些行不会完全重复。
MyTable
RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null
我该怎么做呢?
我需要从一个相当大的SQL Server表(即300,000+行)中删除重复的行。
当然,由于RowID标识字段的存在,这些行不会完全重复。
MyTable
RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null
我该怎么做呢?
当前回答
如果您想预览要删除的行,并控制要保留的重复行。参见http://developer.azurewebsites.net/2014/09/better-sql-group-by-find-duplicate-data/
with MYCTE as (
SELECT ROW_NUMBER() OVER (
PARTITION BY DuplicateKey1
,DuplicateKey2 -- optional
ORDER BY CreatedAt -- the first row among duplicates will be kept, other rows will be removed
) RN
FROM MyTable
)
DELETE FROM MYCTE
WHERE RN > 1
其他回答
假设没有空值,您GROUP BY唯一的列,并选择最小(或最大)RowId作为要保留的行。然后,删除所有没有行id的内容:
DELETE FROM MyTable
LEFT OUTER JOIN (
SELECT MIN(RowId) as RowId, Col1, Col2, Col3
FROM MyTable
GROUP BY Col1, Col2, Col3
) as KeepRows ON
MyTable.RowId = KeepRows.RowId
WHERE
KeepRows.RowId IS NULL
如果您使用的是GUID而不是整数,则可以进行替换
MIN(RowId)
with
CONVERT(uniqueidentifier, MIN(CONVERT(char(36), MyGuidColumn)))
这是另一篇关于删除重复的好文章。
它讨论了为什么它很难:“SQL是基于关系代数的,在关系代数中不能出现重复,因为在一个集合中不允许重复。”
临时表解决方案,以及两个mysql示例。
将来,您是打算在数据库级别还是从应用程序的角度防止它?我建议数据库级别,因为你的数据库应该负责维护引用完整性,开发人员只会造成问题;)
我更喜欢CTE从sql server表中删除重复的行
强烈推荐阅读本文::http://codaffection.com/sql-server-article/delete-duplicate-rows-in-sql-server/
保持原创性
WITH CTE AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY col1,col2,col3 ORDER BY col1,col2,col3) AS RN
FROM MyTable
)
DELETE FROM CTE WHERE RN<>1
不保留原创
WITH CTE AS
(SELECT *,R=RANK() OVER (ORDER BY col1,col2,col3)
FROM MyTable)
DELETE CTE
WHERE R IN (SELECT R FROM CTE GROUP BY R HAVING COUNT(*)>1)
如果重复行的所有列都是相同的,那么下面的查询可以用来删除重复的记录。
SELECT DISTINCT * INTO #TemNewTable FROM #OriginalTable
TRUNCATE TABLE #OriginalTable
INSERT INTO #OriginalTable SELECT * FROM #TemNewTable
DROP TABLE #TemNewTable
我知道这个问题已经回答了,但我已经创建了非常有用的sp,它将为表副本创建一个动态删除语句:
CREATE PROCEDURE sp_DeleteDuplicate @tableName varchar(100), @DebugMode int =1
AS
BEGIN
SET NOCOUNT ON;
IF(OBJECT_ID('tempdb..#tableMatrix') is not null) DROP TABLE #tableMatrix;
SELECT ROW_NUMBER() OVER(ORDER BY name) as rn,name into #tableMatrix FROM sys.columns where [object_id] = object_id(@tableName) ORDER BY name
DECLARE @MaxRow int = (SELECT MAX(rn) from #tableMatrix)
IF(@MaxRow is null)
RAISERROR ('I wasn''t able to find any columns for this table!',16,1)
ELSE
BEGIN
DECLARE @i int =1
DECLARE @Columns Varchar(max) ='';
WHILE (@i <= @MaxRow)
BEGIN
SET @Columns=@Columns+(SELECT '['+name+'],' from #tableMatrix where rn = @i)
SET @i = @i+1;
END
---DELETE LAST comma
SET @Columns = LEFT(@Columns,LEN(@Columns)-1)
DECLARE @Sql nvarchar(max) = '
WITH cteRowsToDelte
AS (
SELECT ROW_NUMBER() OVER (PARTITION BY '+@Columns+' ORDER BY ( SELECT 0)) as rowNumber,* FROM '+@tableName
+')
DELETE FROM cteRowsToDelte
WHERE rowNumber > 1;
'
SET NOCOUNT OFF;
IF(@DebugMode = 1)
SELECT @Sql
ELSE
EXEC sp_executesql @Sql
END
END
如果你创建这样的表格
IF(OBJECT_ID('MyLitleTable') is not null)
DROP TABLE MyLitleTable
CREATE TABLE MyLitleTable
(
A Varchar(10),
B money,
C int
)
---------------------------------------------------------
INSERT INTO MyLitleTable VALUES
('ABC',100,1),
('ABC',100,1), -- only this row should be deleted
('ABC',101,1),
('ABC',100,2),
('ABCD',100,1)
-----------------------------------------------------------
exec sp_DeleteDuplicate 'MyLitleTable',0
它将从表中删除所有重复项。如果运行它时不带第二个参数,它将返回一条SQL语句来运行。
如果您需要排除任何列,只需在调试模式下运行它,获取代码并按照您的喜好修改它。