用MySQL计算中位数最简单(希望不会太慢)的方法是什么?我已经使用AVG(x)来寻找平均值,但我很难找到一个简单的方法来计算中位数。现在,我将所有的行返回到PHP,进行排序,然后选择中间的行,但是肯定有一些简单的方法可以在一个MySQL查询中完成它。

示例数据:

id | val
--------
 1    4
 2    7
 3    2
 4    2
 5    9
 6    8
 7    3

对val排序得到2 2 3 4 7 8 9,因此中位数应该是4,而SELECT AVG(val) == 5。


当前回答

如果这是MySQL,现在有窗口函数,你可以这样做(假设你想四舍五入到最接近的整数-否则只需将round替换为CEIL或FLOOR或其他什么)。下面的解决方案适用于表,无论表的行数是偶数还是奇数:


WITH CTE AS (
    SELECT val,
            ROW_NUMBER() OVER (ORDER BY val ASC) AS rn,
            COUNT(*) OVER () AS total_count
    FROM data
)
SELECT ROUND(AVG(val)) AS median
FROM CTE
WHERE
    rn BETWEEN
    total_count / 2.0 AND
    total_count / 2.0 + 1;

I think some of the more recent answers on this thread were already getting at this approach, but it also seemed like people were overthinking it, so consider this an improved version. Regardless of SQL flavor, there is no reason anyone should be writing a huge paragraph of code with multiple subqueries just to get the median in 2021. However, please note that the above query only works if you're asked to find the median for a continuous series. Of course, regardless of row number, sometimes people do make a distinction between what is referred to as the Discrete Median and what is referred to as the Interpolated Median for a continuous series.

如果你被要求为一个离散级数找到中位数,而表的行数是偶数,那么上面的解决方案就不适合你,你应该恢复使用其他解决方案之一,比如TheJacobTaylor的。

下面的第二个解决方案是对TheJacobTaylor的稍微修改的版本,其中我显式地声明了CROSS JOIN。这个方法也适用于行数为奇数的表,不管你是被要求求连续序列的中位数还是离散序列的中位数,但我特别会在被要求求离散序列的中位数时使用这个方法。否则,使用第一种解决方案。这样,您就永远不必考虑数据是包含“偶数”还是“奇数”个数的数据点。


SELECT x.val AS median
FROM data x
CROSS JOIN data y
GROUP BY x.val
HAVING SUM(SIGN(1 - SIGN(y.val - x.val))) = (COUNT(*) + 1) / 2;

最后,你可以在PostgreSQL中使用内置函数轻松做到这一点。这里有一个很好的解释,以及关于离散中位数和插值中位数的有效总结。

https://leafo.net/guides/postgresql-calculating-percentile.html#calculating-the-median

其他回答

这是我的办法。当然,你可以把它放到一个过程中:-)

SET @median_counter = (SELECT FLOOR(COUNT(*)/2) - 1 AS `median_counter` FROM `data`);

SET @median = CONCAT('SELECT `val` FROM `data` ORDER BY `val` LIMIT ', @median_counter, ', 1');

PREPARE median FROM @median;

EXECUTE median;

你可以避免变量@median_counter,如果你替换它:

SET @median = CONCAT( 'SELECT `val` FROM `data` ORDER BY `val` LIMIT ',
                      (SELECT FLOOR(COUNT(*)/2) - 1 AS `median_counter` FROM `data`),
                      ', 1'
                    );

PREPARE median FROM @median;

EXECUTE median;

MariaDB / MySQL:

SELECT AVG(dd.val) as median_val
FROM (
SELECT d.val, @rownum:=@rownum+1 as `row_number`, @total_rows:=@rownum
  FROM data d, (SELECT @rownum:=0) r
  WHERE d.val is NOT NULL
  -- put some where clause here
  ORDER BY d.val
) as dd
WHERE dd.row_number IN ( FLOOR((@total_rows+1)/2), FLOOR((@total_rows+2)/2) );

Steve Cohen指出,在第一次传递之后,@rownum将包含总行数。这可用于确定中值,因此不需要第二次传递或连接。

此外,AVG(dd.val)和dd.row_number IN(…)用于在有偶数条记录时正确地产生中位数。推理:

SELECT FLOOR((3+1)/2),FLOOR((3+2)/2); -- when total_rows is 3, avg rows 2 and 2
SELECT FLOOR((4+1)/2),FLOOR((4+2)/2); -- when total_rows is 4, avg rows 2 and 3

最后,MariaDB 10.3.3+包含一个MEDIAN函数

MySQL文档中这一页的注释有以下建议:

-- (mostly) High Performance scaling MEDIAN function per group
-- Median defined in http://en.wikipedia.org/wiki/Median
--
-- by Peter Hlavac
-- 06.11.2008
--
-- Example Table:

DROP table if exists table_median;
CREATE TABLE table_median (id INTEGER(11),val INTEGER(11));
COMMIT;


INSERT INTO table_median (id, val) VALUES
(1, 7), (1, 4), (1, 5), (1, 1), (1, 8), (1, 3), (1, 6),
(2, 4),
(3, 5), (3, 2),
(4, 5), (4, 12), (4, 1), (4, 7);



-- Calculating the MEDIAN
SELECT @a := 0;
SELECT
id,
AVG(val) AS MEDIAN
FROM (
SELECT
id,
val
FROM (
SELECT
-- Create an index n for every id
@a := (@a + 1) mod o.c AS shifted_n,
IF(@a mod o.c=0, o.c, @a) AS n,
o.id,
o.val,
-- the number of elements for every id
o.c
FROM (
SELECT
t_o.id,
val,
c
FROM
table_median t_o INNER JOIN
(SELECT
id,
COUNT(1) AS c
FROM
table_median
GROUP BY
id
) t2
ON (t2.id = t_o.id)
ORDER BY
t_o.id,val
) o
) a
WHERE
IF(
-- if there is an even number of elements
-- take the lower and the upper median
-- and use AVG(lower,upper)
c MOD 2 = 0,
n = c DIV 2 OR n = (c DIV 2)+1,

-- if its an odd number of elements
-- take the first if its only one element
-- or take the one in the middle
IF(
c = 1,
n = 1,
n = c DIV 2 + 1
)
)
) a
GROUP BY
id;

-- Explanation:
-- The Statement creates a helper table like
--
-- n id val count
-- ----------------
-- 1, 1, 1, 7
-- 2, 1, 3, 7
-- 3, 1, 4, 7
-- 4, 1, 5, 7
-- 5, 1, 6, 7
-- 6, 1, 7, 7
-- 7, 1, 8, 7
--
-- 1, 2, 4, 1

-- 1, 3, 2, 2
-- 2, 3, 5, 2
--
-- 1, 4, 1, 4
-- 2, 4, 5, 4
-- 3, 4, 7, 4
-- 4, 4, 12, 4


-- from there we can select the n-th element on the position: count div 2 + 1 
set @r = 0;

select  
    case when mod(c,2)=0 then round(sum(lat_N),4)
    else round(sum(lat_N)/2,4) 
    end as Med  
from 
    (select lat_N, @r := @r+1, @r as id from station order by lat_N) A
    cross join
    (select (count(1)+1)/2 as c from station) B
where id >= floor(c) and id <=ceil(c)

MySQL从8.0版本开始支持窗口函数,您可以使用ROW_NUMBER或DENSE_RANK(不要使用RANK,因为它将相同的RANK分配给相同的值,就像在体育排名):

SELECT AVG(t1.val) AS median_val
  FROM (SELECT val, 
               ROW_NUMBER() OVER(ORDER BY val) AS rownum
          FROM data) t1,
       (SELECT COUNT(*) AS num_records FROM data) t2
 WHERE t1.row_num IN
       (FLOOR((t2.num_records + 1) / 2), 
        FLOOR((t2.num_records + 2) / 2));