我有一个表,我想获得每组的最新条目。下面是表格:

DocumentStatusLogs表

|ID| DocumentID | Status | DateCreated |
| 2| 1          | S1     | 7/29/2011   |
| 3| 1          | S2     | 7/30/2011   |
| 6| 1          | S1     | 8/02/2011   |
| 1| 2          | S1     | 7/28/2011   |
| 4| 2          | S2     | 7/30/2011   |
| 5| 2          | S3     | 8/01/2011   |
| 6| 3          | S1     | 8/02/2011   |

该表将按documententid分组,并按DateCreated降序排序。对于每个documententid,我希望获得最新的状态。

我的首选输出:

| DocumentID | Status | DateCreated |
| 1          | S1     | 8/02/2011   |
| 2          | S3     | 8/01/2011   |
| 3          | S1     | 8/02/2011   |

Is there any aggregate function to get only the top from each group? See pseudo-code GetOnlyTheTop below: SELECT DocumentID, GetOnlyTheTop(Status), GetOnlyTheTop(DateCreated) FROM DocumentStatusLogs GROUP BY DocumentID ORDER BY DateCreated DESC If such function doesn't exist, is there any way I can achieve the output I want? Or at the first place, could this be caused by unnormalized database? I'm thinking, since what I'm looking for is just one row, should that status also be located in the parent table?

更多信息请参见父表:

当前文档表

| DocumentID | Title  | Content  | DateCreated |
| 1          | TitleA | ...      | ...         |
| 2          | TitleB | ...      | ...         |
| 3          | TitleC | ...      | ...         |

父表应该是这样的,以便我可以轻松地访问它的状态吗?

| DocumentID | Title  | Content  | DateCreated | CurrentStatus |
| 1          | TitleA | ...      | ...         | s1            |
| 2          | TitleB | ...      | ...         | s3            |
| 3          | TitleC | ...      | ...         | s1            |

更新 我刚刚学会了如何使用“apply”,它可以更容易地解决这类问题。


当前回答

我知道这是一个旧的线程,但TOP 1与TIES解决方案是相当不错的,可能有助于阅读一些解决方案。

select top 1 with ties
   DocumentID
  ,Status
  ,DateCreated
from DocumentStatusLogs
order by row_number() over (partition by DocumentID order by DateCreated desc)

select top 1 with ties子句告诉SQL Server要返回每个组的第一行。但是SQL Server如何知道如何对数据进行分组呢?这就是按row_number()的顺序除以(按documententid分区的顺序除以DateCreated desc的顺序。分区后的列定义了SQL Server如何对数据进行分组。在每个组中,行将根据列的顺序进行排序。排序之后,查询中将返回每个组中的第一行。

关于TOP子句的更多信息可以在这里找到。

其他回答

一些数据库引擎*开始支持允许过滤窗口函数结果的qualifier子句(接受的答案使用该子句)。

所以公认的答案可以变成

SELECT *, ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
FROM DocumentStatusLogs
QUALIFY rn = 1

查看这篇文章以获得更深入的解释:https://jrandrews.net/the-joy-of-qualify

您可以使用此工具查看哪个数据库支持此子句:https://www.jooq.org/translate/ 当目标方言不支持qualifier子句时,可以选择转换它。

*Teradata, BigQuery, H2, Snowflake…

I've done some timings over the various recommendations here, and the results really depend on the size of the table involved, but the most consistent solution is using the CROSS APPLY These tests were run against SQL Server 2008-R2, using a table with 6,500 records, and another (identical schema) with 137 million records. The columns being queried are part of the primary key on the table, and the table width is very small (about 30 bytes). The times are reported by SQL Server from the actual execution plan.

Query                                  Time for 6500 (ms)    Time for 137M(ms)

CROSS APPLY                                    17.9                17.9
SELECT WHERE col = (SELECT MAX(COL)…)           6.6               854.4
DENSE_RANK() OVER PARTITION                     6.6               907.1

我认为真正令人惊讶的是,无论涉及的行数是多少,CROSS APPLY的时间都是如此一致。

这是一个相当老的话题,但我想我要发表我的意见,就像公认的答案对我来说并不是特别有效一样。我在一个大型数据集上尝试了gbn的解决方案,发现它非常慢(在SQL Server 2012中500多万条记录上>45秒)。看一下执行计划,很明显问题在于它需要一个SORT操作,这会大大降低速度。

下面是我从实体框架中提取的一个替代方案,它不需要SORT操作,只进行非聚集索引搜索。这将在前面提到的记录集中将执行时间减少到< 2秒。

SELECT 
[Limit1].[DocumentID] AS [DocumentID], 
[Limit1].[Status] AS [Status], 
[Limit1].[DateCreated] AS [DateCreated]
FROM   (SELECT DISTINCT [Extent1].[DocumentID] AS [DocumentID] FROM [dbo].[DocumentStatusLogs] AS [Extent1]) AS [Distinct1]
OUTER APPLY  (SELECT TOP (1) [Project2].[ID] AS [ID], [Project2].[DocumentID] AS [DocumentID], [Project2].[Status] AS [Status], [Project2].[DateCreated] AS [DateCreated]
    FROM (SELECT 
        [Extent2].[ID] AS [ID], 
        [Extent2].[DocumentID] AS [DocumentID], 
        [Extent2].[Status] AS [Status], 
        [Extent2].[DateCreated] AS [DateCreated]
        FROM [dbo].[DocumentStatusLogs] AS [Extent2]
        WHERE ([Distinct1].[DocumentID] = [Extent2].[DocumentID])
    )  AS [Project2]
    ORDER BY [Project2].[ID] DESC) AS [Limit1]

Now I'm assuming something that isn't entirely specified in the original question, but if your table design is such that your ID column is an auto-increment ID, and the DateCreated is set to the current date with each insert, then even without running with my query above you could actually get a sizable performance boost to gbn's solution (about half the execution time) just from ordering on ID instead of ordering on DateCreated as this will provide an identical sort order and it's a faster sort.

;WITH cte AS
(
   SELECT *,
         ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
   FROM DocumentStatusLogs
)
SELECT *
FROM cte
WHERE rn = 1

如果您希望每天有2个条目,那么这将任意选择一个。要获得一天的两个条目,请使用DENSE_RANK代替

至于是否正常化,这取决于你是否想:

在两个地方保持状态 保存状态历史 ...

目前,您保留了状态历史。如果你也想在父表中保持最新状态(这是非规范化的),你需要一个触发器来维持父表中的“状态”。或者删除这个状态历史表。

试试这个:

SELECT [DocumentID]
    ,[tmpRez].value('/x[2]', 'varchar(20)') AS [Status]
    ,[tmpRez].value('/x[3]', 'datetime') AS [DateCreated]
FROM (
    SELECT [DocumentID]
        ,cast('<x>' + max(cast([ID] AS VARCHAR(10)) + '</x><x>' + [Status] + '</x><x>' + cast([DateCreated] AS VARCHAR(20))) + '</x>' AS XML) AS [tmpRez]
    FROM DocumentStatusLogs
    GROUP BY DocumentID
    ) AS [tmpQry]