根据MSDN, Median在Transact-SQL中不能作为聚合函数使用。但是,我想知道是否可以创建此功能(使用create Aggregate函数、用户定义函数或其他方法)。

最好的方法(如果可能的话)是什么——允许在聚合查询中计算中值(假设是数值数据类型)?


当前回答

这是我能想到的求中位数的最优解。示例中的名称基于Justin示例。确保表有索引 销售。SalesOrderHeader以索引列CustomerId和TotalDue的顺序存在。

SELECT
 sohCount.CustomerId,
 AVG(sohMid.TotalDue) as TotalDueMedian
FROM 
(SELECT 
  soh.CustomerId,
  COUNT(*) as NumberOfRows
FROM 
  Sales.SalesOrderHeader soh 
GROUP BY soh.CustomerId) As sohCount
CROSS APPLY 
    (Select 
       soh.TotalDue
    FROM 
    Sales.SalesOrderHeader soh 
    WHERE soh.CustomerId = sohCount.CustomerId 
    ORDER BY soh.TotalDue
    OFFSET sohCount.NumberOfRows / 2 - ((sohCount.NumberOfRows + 1) % 2) ROWS 
    FETCH NEXT 1 + ((sohCount.NumberOfRows + 1) % 2) ROWS ONLY
    ) As sohMid
GROUP BY sohCount.CustomerId

更新

我有点不确定哪种方法性能最好,所以我比较了我的方法Justin Grants和Jeff Atwoods,在一个批量中运行基于这三种方法的查询,每个查询的批量成本为:

没有指数:

我的30% Justin Grants 13% Jeff Atwoods 58%

还有index

我的3%。 Justin Grants 10% Jeff Atwoods 87%

I tried to see how well the queries scale if you have index by creating more data from around 14 000 rows by a factor of 2 up to 512 which means in the end around 7,2 millions rows. Note I made sure CustomeId field where unique for each time I did a single copy, so the proportion of rows compared to unique instance of CustomerId was kept constant. While I was doing this I ran executions where I rebuilt index afterwards, and I noticed the results stabilized at around a factor of 128 with the data I had to these values:

我的3%。 贾斯汀·格兰特5% Jeff Atwoods 92%

我想知道,在保持惟一CustomerId不变的情况下,扩展行数会如何影响性能,因此我设置了一个新的测试,在其中执行了上述操作。现在,批成本比率并没有稳定下来,而是不断分化,每个CustomerId平均大约有20行,最后每个这样唯一的Id大约有10000行。数字如下:

我的4% 贾斯汀60% 杰夫斯35%

通过比较结果,我确保我正确地实现了每个方法。 我的结论是,只要索引存在,我使用的方法通常更快。还要注意,本文针对这个特定问题推荐使用这种方法https://www.microsoftpressstore.com/articles/article.aspx?p=2314819&seqNum=5

进一步提高对该查询的后续调用的性能的一种方法是在辅助表中持久化计数信息。您甚至可以通过一个触发器来维护它,该触发器更新并保存有关依赖于CustomerId的SalesOrderHeader行计数的信息,当然您也可以简单地存储中值。

其他回答

查看SQL中位数计算的其他解决方案: “用MySQL计算中位数的简单方法”(解决方案大多与供应商无关)。

我尝试了几种替代方案,但由于我的数据记录有重复的值,ROW_NUMBER版本似乎不是我的选择。这里是我使用的查询(NTILE版本):

SELECT distinct
   CustomerId,
   (
       MAX(CASE WHEN Percent50_Asc=1 THEN TotalDue END) OVER (PARTITION BY CustomerId)  +
       MIN(CASE WHEN Percent50_desc=1 THEN TotalDue END) OVER (PARTITION BY CustomerId) 
   )/2 MEDIAN
FROM
(
   SELECT
      CustomerId,
      TotalDue,
     NTILE(2) OVER (
         PARTITION BY CustomerId
         ORDER BY TotalDue ASC) AS Percent50_Asc,
     NTILE(2) OVER (
         PARTITION BY CustomerId
         ORDER BY TotalDue DESC) AS Percent50_desc
   FROM Sales.SalesOrderHeader SOH
) x
ORDER BY CustomerId;
with t1 as (select *, row_number() over(order by ordqty) as rn,
count(*) over() as rc from ord_line)
select rn,* from t1 where rn in((rc+1)/2, (rc+2)/2);

它将计算偶数和奇数的中位数。

Ord_line是一个表 Ordqty是一个列

这段代码有点长,但很容易理解

medii是有列val的表,它有数据集, Smedi是一个cte,它将列idx作为行号,val作为medi表中的'val',该表是升序排序的。 这是基本的数学,如果行号是奇数,那么它的中值来自smedi。 当它是偶数时,它是中间两个值的平均值。

with smedi(idx,vals) as(
                select ROW_NUMBER() over(order by val),val from medi
                )
select (case
            when (select count(*) from medi)%2!=0 then (select vals from smedi where (((select count(*) from medi)/2))=idx)
            else (select avg(vals) from smedi where idx in ((select count(*)/2 from medi),(select (count(*)/2)+1 from medi)))
            end)

在UDF中,写:

 Select Top 1 medianSortColumn from Table T
  Where (Select Count(*) from Table
         Where MedianSortColumn <
           (Select Count(*) From Table) / 2)
  Order By medianSortColumn