用MySQL计算中位数最简单(希望不会太慢)的方法是什么?我已经使用AVG(x)来寻找平均值,但我很难找到一个简单的方法来计算中位数。现在,我将所有的行返回到PHP,进行排序,然后选择中间的行,但是肯定有一些简单的方法可以在一个MySQL查询中完成它。

示例数据:

id | val
--------
 1    4
 2    7
 3    2
 4    2
 5    9
 6    8
 7    3

对val排序得到2 2 3 4 7 8 9,因此中位数应该是4,而SELECT AVG(val) == 5。


当前回答

不幸的是,无论是TheJacobTaylor还是velcrow的答案都不会返回当前版本MySQL的准确结果。

从上面来看,魔术贴的答案是接近的,但它不能正确计算具有偶数行数的结果集。中位数定义为1)奇数集上的中间数,或2)偶数集上两个中间数的平均值。

所以,这里是魔术贴的解决方案修补处理奇数和偶数集:

SELECT AVG(middle_values) AS 'median' FROM (
  SELECT t1.median_column AS 'middle_values' FROM
    (
      SELECT @row:=@row+1 as `row`, x.median_column
      FROM median_table AS x, (SELECT @row:=0) AS r
      WHERE 1
      -- put some where clause here
      ORDER BY x.median_column
    ) AS t1,
    (
      SELECT COUNT(*) as 'count'
      FROM median_table x
      WHERE 1
      -- put same where clause here
    ) AS t2
    -- the following condition will return 1 record for odd number sets, or 2 records for even number sets.
    WHERE t1.row >= t2.count/2 and t1.row <= ((t2.count/2) +1)) AS t3;

要使用它,请遵循以下3个简单步骤:

将上面代码中的“median_table”(出现2次)替换为您的表名 将“median_column”(3次)替换为您希望为其查找中位数的列名 如果你有一个WHERE条件,用WHERE条件替换“WHERE 1”(2次)

其他回答

这些方法从同一个表中选择两次。如果源数据来自一个昂贵的查询,这是一种避免运行两次的方法:

select KEY_FIELD, AVG(VALUE_FIELD) MEDIAN_VALUE
from (
    select KEY_FIELD, VALUE_FIELD, RANKF
    , @rownumr := IF(@prevrowidr=KEY_FIELD,@rownumr+1,1) RANKR
    , @prevrowidr := KEY_FIELD
    FROM (
        SELECT KEY_FIELD, VALUE_FIELD, RANKF
        FROM (
            SELECT KEY_FIELD, VALUE_FIELD 
            , @rownumf := IF(@prevrowidf=KEY_FIELD,@rownumf+1,1) RANKF
            , @prevrowidf := KEY_FIELD     
            FROM (
                SELECT KEY_FIELD, VALUE_FIELD 
                FROM (
                    -- some expensive query
                )   B
                ORDER BY  KEY_FIELD, VALUE_FIELD
            ) C
            , (SELECT @rownumf := 1) t_rownum
            , (SELECT @prevrowidf := '*') t_previd
        ) D
        ORDER BY  KEY_FIELD, RANKF DESC
    ) E
    , (SELECT @rownumr := 1) t_rownum
    , (SELECT @prevrowidr := '*') t_previd
) F
WHERE RANKF-RANKR BETWEEN -1 and 1
GROUP BY KEY_FIELD

我有下面的代码,我在HackerRank上找到的,它很简单,适用于每一种情况。

SELECT M.MEDIAN_COL FROM MEDIAN_TABLE M WHERE  
  (SELECT COUNT(MEDIAN_COL) FROM MEDIAN_TABLE WHERE MEDIAN_COL < M.MEDIAN_COL ) = 
  (SELECT COUNT(MEDIAN_COL) FROM MEDIAN_TABLE WHERE MEDIAN_COL > M.MEDIAN_COL );

来自: http://mdb-blog.blogspot.com/2015/06/mysql-find-median-nth-element-without.html

我建议另一种方法,不用连接, 但是处理字符串

我没有用大数据表来检验, 但是中小型的桌子就可以了。

这里的好处是,它也可以通过GROUPING工作,所以它可以返回几个项目的中值。

下面是测试表的测试代码:

DROP TABLE test.test_median
CREATE TABLE test.test_median AS
SELECT 'book' AS grp, 4 AS val UNION ALL
SELECT 'book', 7 UNION ALL
SELECT 'book', 2 UNION ALL
SELECT 'book', 2 UNION ALL
SELECT 'book', 9 UNION ALL
SELECT 'book', 8 UNION ALL
SELECT 'book', 3 UNION ALL

SELECT 'note', 11 UNION ALL

SELECT 'bike', 22 UNION ALL
SELECT 'bike', 26 

求每组中位数的代码:

SELECT grp,
         SUBSTRING_INDEX( SUBSTRING_INDEX( GROUP_CONCAT(val ORDER BY val), ',', COUNT(*)/2 ), ',', -1) as the_median,
         GROUP_CONCAT(val ORDER BY val) as all_vals_for_debug
FROM test.test_median
GROUP BY grp

输出:

grp | the_median| all_vals_for_debug
bike| 22        | 22,26
book| 4         | 2,2,3,4,7,8,9
note| 11        | 11

这种方法似乎包括偶数和奇数计数,没有子查询。

SELECT AVG(t1.x)
FROM table t1, table t2
GROUP BY t1.x
HAVING SUM(SIGN(t1.x - t2.x)) = 0

ORACLE的简单解决方案:

SELECT ROUND(MEDIAN(Lat_N), 4) FROM Station;

简单的解决方案,理解MySQL:

select case MOD(count(lat_n),2) 
when 1 then (select round(S.LAT_N,4) from station S where (select count(Lat_N) from station where Lat_N < S.LAT_N ) = (select count(Lat_N) from station where Lat_N > S.LAT_N))
else (select round(AVG(S.LAT_N),4) from station S where 1 = (select count(Lat_N) from station where Lat_N < S.LAT_N ) - (select count(Lat_N) from station where Lat_N > S.LAT_N))
end from station;

解释

STATION是表名。LAT_N是具有数值的列名

假设站表中有101条记录(奇数)。这意味着如果表以asc或desc排序,则中位数是第51条记录。

In above query for every S.LAT_N of S table I am creating two tables. One for number of LAT_N values less than S.LAT_N and another for number of LAT_N values greater than S.LAT_N. Later I am comparing these two tables and if they are matched then I am selecting that S.LAT_N value. When I check for 51st records there are 50 values less than 51st record and there 50 records greater than 51st record. As you see, there are 50 records in both tables. So this is our answer. For every other record there are different number of records in two tables created for comparison. So, only 51st record meets the condition.

现在假设站表中有100条记录(偶数)。这意味着如果表以asc或desc排序,则中位数是第50条和第51条记录的平均值。

Same as odd logic I am creating two tables. One for number of LAT_N values less than S.LAT_N and another for number of LAT_N values greater than S.LAT_N. Later I am comparing these two tables and if their difference is equal to 1 then I am selecting that S.LAT_N value and find the average. When I check for 50th records there are 49 values less than 50th record and there 51 records greater than 50th record. As you see, there is difference of 1 record in both tables. So this(50th record) is our 1st record for average. Similarly, When I check for 51st records there are 50 values less than 51st record and there 49 records greater than 51st record. As you see, there is difference of 1 record in both tables. So this(51st record) is our 2nd record for average. For every other record there are different number of records in two tables created for comparison. So, only 50th and 51st records meet the condition.