我有一个表,我想获得每组的最新条目。下面是表格:

DocumentStatusLogs表

|ID| DocumentID | Status | DateCreated |
| 2| 1          | S1     | 7/29/2011   |
| 3| 1          | S2     | 7/30/2011   |
| 6| 1          | S1     | 8/02/2011   |
| 1| 2          | S1     | 7/28/2011   |
| 4| 2          | S2     | 7/30/2011   |
| 5| 2          | S3     | 8/01/2011   |
| 6| 3          | S1     | 8/02/2011   |

该表将按documententid分组,并按DateCreated降序排序。对于每个documententid,我希望获得最新的状态。

我的首选输出:

| DocumentID | Status | DateCreated |
| 1          | S1     | 8/02/2011   |
| 2          | S3     | 8/01/2011   |
| 3          | S1     | 8/02/2011   |

Is there any aggregate function to get only the top from each group? See pseudo-code GetOnlyTheTop below: SELECT DocumentID, GetOnlyTheTop(Status), GetOnlyTheTop(DateCreated) FROM DocumentStatusLogs GROUP BY DocumentID ORDER BY DateCreated DESC If such function doesn't exist, is there any way I can achieve the output I want? Or at the first place, could this be caused by unnormalized database? I'm thinking, since what I'm looking for is just one row, should that status also be located in the parent table?

更多信息请参见父表:

当前文档表

| DocumentID | Title  | Content  | DateCreated |
| 1          | TitleA | ...      | ...         |
| 2          | TitleB | ...      | ...         |
| 3          | TitleC | ...      | ...         |

父表应该是这样的,以便我可以轻松地访问它的状态吗?

| DocumentID | Title  | Content  | DateCreated | CurrentStatus |
| 1          | TitleA | ...      | ...         | s1            |
| 2          | TitleB | ...      | ...         | s3            |
| 3          | TitleC | ...      | ...         | s1            |

更新 我刚刚学会了如何使用“apply”,它可以更容易地解决这类问题。


当前回答

这是这个话题中最容易找到的问题之一,所以我想给出一个现代的答案(既供我参考,也帮助其他人)。通过使用first_value和over,你可以简化上面的查询:

Select distinct DocumentID
  , first_value(status) over (partition by DocumentID order by DateCreated Desc) as Status
  , first_value(DateCreated) over (partition by DocumentID order by DateCreated Desc) as DateCreated
From DocumentStatusLogs

这应该工作在Sql Server 2008及更高。First_value可以被认为是在使用over子句时实现Select Top 1的一种方法。Over允许在选择列表中分组,而不是编写嵌套的子查询(像许多现有的答案那样),这以一种更可读的方式进行。希望这能有所帮助。

其他回答

这是一个相当老的话题,但我想我要发表我的意见,就像公认的答案对我来说并不是特别有效一样。我在一个大型数据集上尝试了gbn的解决方案,发现它非常慢(在SQL Server 2012中500多万条记录上>45秒)。看一下执行计划,很明显问题在于它需要一个SORT操作,这会大大降低速度。

下面是我从实体框架中提取的一个替代方案,它不需要SORT操作,只进行非聚集索引搜索。这将在前面提到的记录集中将执行时间减少到< 2秒。

SELECT 
[Limit1].[DocumentID] AS [DocumentID], 
[Limit1].[Status] AS [Status], 
[Limit1].[DateCreated] AS [DateCreated]
FROM   (SELECT DISTINCT [Extent1].[DocumentID] AS [DocumentID] FROM [dbo].[DocumentStatusLogs] AS [Extent1]) AS [Distinct1]
OUTER APPLY  (SELECT TOP (1) [Project2].[ID] AS [ID], [Project2].[DocumentID] AS [DocumentID], [Project2].[Status] AS [Status], [Project2].[DateCreated] AS [DateCreated]
    FROM (SELECT 
        [Extent2].[ID] AS [ID], 
        [Extent2].[DocumentID] AS [DocumentID], 
        [Extent2].[Status] AS [Status], 
        [Extent2].[DateCreated] AS [DateCreated]
        FROM [dbo].[DocumentStatusLogs] AS [Extent2]
        WHERE ([Distinct1].[DocumentID] = [Extent2].[DocumentID])
    )  AS [Project2]
    ORDER BY [Project2].[ID] DESC) AS [Limit1]

Now I'm assuming something that isn't entirely specified in the original question, but if your table design is such that your ID column is an auto-increment ID, and the DateCreated is set to the current date with each insert, then even without running with my query above you could actually get a sizable performance boost to gbn's solution (about half the execution time) just from ordering on ID instead of ordering on DateCreated as this will provide an identical sort order and it's a faster sort.

我刚学会如何使用交叉应用。下面是如何在这种情况下使用它:

 select d.DocumentID, ds.Status, ds.DateCreated 
 from Documents as d 
 cross apply 
     (select top 1 Status, DateCreated
      from DocumentStatusLogs 
      where DocumentID = d.DocumentId
      order by DateCreated desc) as ds

一些数据库引擎*开始支持允许过滤窗口函数结果的qualifier子句(接受的答案使用该子句)。

所以公认的答案可以变成

SELECT *, ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
FROM DocumentStatusLogs
QUALIFY rn = 1

查看这篇文章以获得更深入的解释:https://jrandrews.net/the-joy-of-qualify

您可以使用此工具查看哪个数据库支持此子句:https://www.jooq.org/translate/ 当目标方言不支持qualifier子句时,可以选择转换它。

*Teradata, BigQuery, H2, Snowflake…

这是这个话题中最容易找到的问题之一,所以我想给出一个现代的答案(既供我参考,也帮助其他人)。通过使用first_value和over,你可以简化上面的查询:

Select distinct DocumentID
  , first_value(status) over (partition by DocumentID order by DateCreated Desc) as Status
  , first_value(DateCreated) over (partition by DocumentID order by DateCreated Desc) as DateCreated
From DocumentStatusLogs

这应该工作在Sql Server 2008及更高。First_value可以被认为是在使用over子句时实现Select Top 1的一种方法。Over允许在选择列表中分组,而不是编写嵌套的子查询(像许多现有的答案那样),这以一种更可读的方式进行。希望这能有所帮助。

SELECT * FROM
DocumentStatusLogs JOIN (
  SELECT DocumentID, MAX(DateCreated) DateCreated
  FROM DocumentStatusLogs
  GROUP BY DocumentID
  ) max_date USING (DocumentID, DateCreated)

什么数据库服务器?这段代码并不是对所有的都有效。

关于你问题的后半部分,我认为应该在一栏中写上“地位”。您可以将DocumentStatusLogs保留为日志,但仍然将最新信息存储在主表中。

顺便说一句,如果你已经在文档表中有DateCreated列,你可以使用它来加入DocumentStatusLogs(只要DateCreated在DocumentStatusLogs中是唯一的)。

编辑:MsSQL不支持USING,因此将其更改为:

ON DocumentStatusLogs.DocumentID = max_date.DocumentID AND DocumentStatusLogs.DateCreated = max_date.DateCreated