选择每个GROUP BY组中的第一行？

正如标题所示，我想选择用GROUP BY分组的每组行中的第一行。

具体来说，如果我有一个如下所示的采购表：

SELECT * FROM purchases;

我的输出：

id	customer	total
1	Joe	5
2	Sally	3
3	Joe	2
4	Sally	1

我想查询每个客户的最大购买量（总购买量）。类似于：

SELECT FIRST(id), customer, FIRST(total)
FROM  purchases
GROUP BY customer
ORDER BY total DESC;

预期输出：

FIRST(id)	customer	FIRST(total)
1	Joe	5
2	Sally	3

当前回答

非常快速的解决方案

SELECT a.* 
FROM
    purchases a 
    JOIN ( 
        SELECT customer, min( id ) as id 
        FROM purchases 
        GROUP BY customer 
    ) b USING ( id );

如果表是按id索引的，则速度非常快：

create index purchases_id on purchases (id);

2014-04-08 16:13:33

其他回答

非常快速的解决方案

SELECT a.* 
FROM
    purchases a 
    JOIN ( 
        SELECT customer, min( id ) as id 
        FROM purchases 
        GROUP BY customer 
    ) b USING ( id );

如果表是按id索引的，则速度非常快：

create index purchases_id on purchases (id);

2014-04-08 16:13:33

这是一个常见的最大的每组问题，它已经有了经过充分测试和高度优化的解决方案。就我个人而言，我更喜欢比尔·卡尔温（Bill Karwin）的左联解决方案（原始帖子中有很多其他解决方案）。

请注意，MySQL手册中可以意外地找到解决这个常见问题的一系列解决方案——尽管您的问题是在Postgres中，而不是MySQL中，但给出的解决方案应该适用于大多数SQL变体。请参见常见查询的示例：：保持某一列的组最大值的行。

2013-06-27 08:38:44

查询：

SELECT purchases.*
FROM purchases
LEFT JOIN purchases as p 
ON 
  p.customer = purchases.customer 
  AND 
  purchases.total < p.total
WHERE p.total IS NULL

这是怎么回事！（我去过那里）

我们希望确保每次购买的总金额最高。

一些理论知识（如果您只想了解查询，请跳过此部分）

让Total是一个函数T（customer，id），其中它返回一个给定名称和id的值为了证明给定的总数（T（customer，id））是最高的，我们必须证明我们想证明

∀x T（customer，id）>T（customer，x）（这个总数高于所有其他该客户的总计）

∃x T（customer，id）<T（customers，x）（不存在更高的总数该客户）

第一种方法需要我们获取我不太喜欢的名字的所有记录。

第二个将需要一个聪明的方式来表示，没有比这个更高的记录了。

返回SQL

如果我们在表的名称和总数小于连接表的情况下留下连接表：

LEFT JOIN purchases as p 
ON 
p.customer = purchases.customer 
AND 
purchases.total < p.total

我们确保要加入的同一用户的另一条记录总数较高的所有记录：

+--------------+---------------------+-----------------+------+------------+---------+
| purchases.id |  purchases.customer | purchases.total | p.id | p.customer | p.total |
+--------------+---------------------+-----------------+------+------------+---------+
|            1 | Tom                 |             200 |    2 | Tom        |     300 |
|            2 | Tom                 |             300 |      |            |         |
|            3 | Bob                 |             400 |    4 | Bob        |     500 |
|            4 | Bob                 |             500 |      |            |         |
|            5 | Alice               |             600 |    6 | Alice      |     700 |
|            6 | Alice               |             700 |      |            |         |
+--------------+---------------------+-----------------+------+------------+---------+

这将有助于我们在不需要分组的情况下筛选每次购买的最高总额：

WHERE p.total IS NULL
    
+--------------+----------------+-----------------+------+--------+---------+
| purchases.id | purchases.name | purchases.total | p.id | p.name | p.total |
+--------------+----------------+-----------------+------+--------+---------+
|            2 | Tom            |             300 |      |        |         |
|            4 | Bob            |             500 |      |        |         |
|            6 | Alice          |             700 |      |        |         |
+--------------+----------------+-----------------+------+--------+---------+

这就是我们需要的答案。

2018-03-24 16:11:27

基准

我测试了最有趣的候选人：

最初使用Postgres 9.4和9.5。后来为Postgres 13添加了重音测试。

基本测试设置

主表：采购：

CREATE TABLE purchases (
  id          serial  -- PK constraint added below
, customer_id int     -- REFERENCES customer
, total       int     -- could be amount of money in Cent
, some_column text    -- to make the row bigger, more realistic
);

虚拟数据（带有一些死元组），PK，索引：

INSERT INTO purchases (customer_id, total, some_column)    -- 200k rows
SELECT (random() * 10000)::int             AS customer_id  -- 10k distinct customers
     , (random() * random() * 100000)::int AS total     
     , 'note: ' || repeat('x', (random()^2 * random() * random() * 500)::int)
FROM   generate_series(1,200000) g;

ALTER TABLE purchases ADD CONSTRAINT purchases_id_pkey PRIMARY KEY (id);

DELETE FROM purchases WHERE random() > 0.9;  -- some dead rows

INSERT INTO purchases (customer_id, total, some_column)
SELECT (random() * 10000)::int             AS customer_id  -- 10k customers
     , (random() * random() * 100000)::int AS total     
     , 'note: ' || repeat('x', (random()^2 * random() * random() * 500)::int)
FROM   generate_series(1,20000) g;  -- add 20k to make it ~ 200k

CREATE INDEX purchases_3c_idx ON purchases (customer_id, total DESC, id);

VACUUM ANALYZE purchases;

客户表-用于优化查询：

CREATE TABLE customer AS
SELECT customer_id, 'customer_' || customer_id AS customer
FROM   purchases
GROUP  BY 1
ORDER  BY 1;

ALTER TABLE customer ADD CONSTRAINT customer_customer_id_pkey PRIMARY KEY (customer_id);

VACUUM ANALYZE customer;

在9.5的第二次测试中，我使用了相同的设置，但使用了100000个不同的customer_id来获得每个customer_id的几行。

表购买的对象大小

基本设置：购买200k行，10k个不同的customer_id，平均每个客户20行。对于Postgres9.5，我添加了第二个测试，共有86446个不同的客户-平均每个客户2.3行。

使用此处的查询生成：

测量PostgreSQL表行的大小

为Postgres 9.5收集：

               what                | bytes/ct | bytes_pretty | bytes_per_row
-----------------------------------+----------+--------------+---------------
 core_relation_size                | 20496384 | 20 MB        |           102
 visibility_map                    |        0 | 0 bytes      |             0
 free_space_map                    |    24576 | 24 kB        |             0
 table_size_incl_toast             | 20529152 | 20 MB        |           102
 indexes_size                      | 10977280 | 10 MB        |            54
 total_size_incl_toast_and_indexes | 31506432 | 30 MB        |           157
 live_rows_in_text_representation  | 13729802 | 13 MB        |            68
 ------------------------------    |          |              |
 row_count                         |   200045 |              |
 live_tuples                       |   200045 |              |
 dead_tuples                       |    19955 |              |

查询

1.CTE中的row_number（），（参见其他答案）

WITH cte AS (
   SELECT id, customer_id, total
        , row_number() OVER (PARTITION BY customer_id ORDER BY total DESC) AS rn
   FROM   purchases
   )
SELECT id, customer_id, total
FROM   cte
WHERE  rn = 1;

2.子查询中的row_number（）（我的优化）

SELECT id, customer_id, total
FROM   (
   SELECT id, customer_id, total
        , row_number() OVER (PARTITION BY customer_id ORDER BY total DESC) AS rn
   FROM   purchases
   ) sub
WHERE  rn = 1;

3.DISTINCT ON（请参阅其他答案）

SELECT DISTINCT ON (customer_id)
       id, customer_id, total
FROM   purchases
ORDER  BY customer_id, total DESC, id;

4.带有LATERAL子查询的rCTE（参见此处）

WITH RECURSIVE cte AS (
   (  -- parentheses required
   SELECT id, customer_id, total
   FROM   purchases
   ORDER  BY customer_id, total DESC
   LIMIT  1
   )
   UNION ALL
   SELECT u.*
   FROM   cte c
   ,      LATERAL (
      SELECT id, customer_id, total
      FROM   purchases
      WHERE  customer_id > c.customer_id  -- lateral reference
      ORDER  BY customer_id, total DESC
      LIMIT  1
      ) u
   )
SELECT id, customer_id, total
FROM   cte
ORDER  BY customer_id;

5.带LATERAL的客户表（见此处）

SELECT l.*
FROM   customer c
,      LATERAL (
   SELECT id, customer_id, total
   FROM   purchases
   WHERE  customer_id = c.customer_id  -- lateral reference
   ORDER  BY total DESC
   LIMIT  1
   ) l;

6.带有ORDER BY的array_agg（）（请参见其他答案）

SELECT (array_agg(id ORDER BY total DESC))[1] AS id
     , customer_id
     , max(total) AS total
FROM   purchases
GROUP  BY customer_id;

后果

使用EXPLAIN（ANALYZE、TIMING OFF、COSTS OFF）执行上述查询的执行时间，与热缓存相比，这是5次运行中最好的一次。

所有查询都在purchases2_3c_idx上使用了“仅索引扫描”（以及其他步骤）。有些人只是为了从较小的指数规模中获益，其他人则更有效。

A.Postgres 9.4，200k行，每个customer_id约20行

1. 273.274 ms  
2. 194.572 ms  
3. 111.067 ms  
4.  92.922 ms  -- !
5.  37.679 ms  -- winner
6. 189.495 ms

B.与A.相同，Postgres 9.5

1. 288.006 ms
2. 223.032 ms  
3. 107.074 ms  
4.  78.032 ms  -- !
5.  33.944 ms  -- winner
6. 211.540 ms

C.与B.相同，但每个customer_id有约2.3行

1. 381.573 ms
2. 311.976 ms
3. 124.074 ms  -- winner
4. 710.631 ms
5. 311.976 ms
6. 421.679 ms

2021-08-11年与Postgres 13重新测试

简化的测试设置：没有删除的行，因为VACUUM ANALYZE完全清除了简单情况下的表。

Postgres的重要变化：

一般性能改进。CTE可以从Postgres 12开始内联，因此查询1。和2。现在执行基本相同的（相同的查询计划）。

D.类似B.每个customer_id约20行

1. 103 ms
2. 103 ms  
3.  23 ms  -- winner  
4.  71 ms  
5.  22 ms  -- winner
6.  81 ms

db<>fiddle在这里

E.类似C.每个customer_id约2.3行

1. 127 ms
2. 126 ms  
3.  36 ms  -- winner  
4. 620 ms  
5. 145 ms
6. 203 ms

db<>fiddle在这里

参加Postgres 13考试

每个客户有1百万行，10.000对100对1.6行。

F.每个客户约10.000行

1. 526 ms
2. 527 ms  
3. 127 ms
4.   2 ms  -- winner !
5.   1 ms  -- winner !
6. 356 ms

db<>fiddle在这里

G.每个客户约100行

1. 535 ms
2. 529 ms  
3. 132 ms
4. 108 ms  -- !
5.  71 ms  -- winner
6. 376 ms

db<>fiddle在这里

H.每个客户约1.6行

1.  691 ms
2.  684 ms  
3.  234 ms  -- winner
4. 4669 ms
5. 1089 ms
6. 1264 ms

db<>fiddle在这里

结论

DISTINCT ON有效地使用索引，并且通常对每个组的几行执行最佳。即使每个组有很多行，它的性能也很好。对于每个组的许多行，使用rCTE模拟索引跳过扫描的性能最好，仅次于使用单独查找表（如果可用）的查询技术。在当前接受的答案中演示的row_number（）技术从未赢得任何性能测试。那时不行，现在不行。它从未接近DISTINCT ON，即使数据分布对后者不利。row_number（）唯一的优点是：它的扩展性不是很好，只是一般。

更多基准

在Postgres11.5上通过“ogr”进行基准测试，拥有1000万行和60万个独特的“客户”。结果与我们目前所看到的一致：

访问每个标识符的最新行的正确方法？

2011年原始（过时）基准

我用PostgreSQL 9.1在一个真实的表上运行了三次测试，该表包含65579行，三列中的每一列都有单列btree索引，并用了5次运行的最佳执行时间。将@OMGPonies的第一个查询（A）与上述DISTINCT ON解决方案（B）进行比较：

选择整个表，在本例中会产生5958行。

A: 567.218 ms
B: 386.673 ms

使用条件WHERE customer BETWEEN x AND y，得到1000行。

A: 249.136 ms
B:  55.111 ms

选择WHERE客户=x的单个客户。

A:   0.143 ms
B:   0.072 ms

用另一个答案中描述的索引重复相同的测试：

CREATE INDEX purchases_3c_idx ON purchases (customer, total DESC, id);

1A: 277.953 ms  
1B: 193.547 ms

2A: 249.796 ms -- special index not used  
2B:  28.679 ms

3A:   0.120 ms  
3B:   0.048 ms

2016-01-11 06:05:43

在PostgreSQL中，另一种可能是将first_value窗口函数与SELECT DISTINCT结合使用：

select distinct customer_id,
                first_value(row(id, total)) over(partition by customer_id order by total desc, id)
from            purchases;

我创建了一个组合（id，total），因此两个值都由同一个聚合返回。当然，您可以始终应用first_value（）两次。

2019-12-09 12:49:08

选择每个GROUP BY组中的第一行？

推荐文章

最新文章

标签