根据MSDN, Median在Transact-SQL中不能作为聚合函数使用。但是,我想知道是否可以创建此功能(使用create Aggregate函数、用户定义函数或其他方法)。

最好的方法(如果可能的话)是什么——允许在聚合查询中计算中值(假设是数值数据类型)?


当前回答

以下解决方案在这些假设下有效:

无重复值 没有取消

代码:

IF OBJECT_ID('dbo.R', 'U') IS NOT NULL
  DROP TABLE dbo.R

CREATE TABLE R (
    A FLOAT NOT NULL);

INSERT INTO R VALUES (1);
INSERT INTO R VALUES (2);
INSERT INTO R VALUES (3);
INSERT INTO R VALUES (4);
INSERT INTO R VALUES (5);
INSERT INTO R VALUES (6);

-- Returns Median(R)
select SUM(A) / CAST(COUNT(A) AS FLOAT)
from R R1 
where ((select count(A) from R R2 where R1.A > R2.A) = 
      (select count(A) from R R2 where R1.A < R2.A)) OR
      ((select count(A) from R R2 where R1.A > R2.A) + 1 = 
      (select count(A) from R R2 where R1.A < R2.A)) OR
      ((select count(A) from R R2 where R1.A > R2.A) = 
      (select count(A) from R R2 where R1.A < R2.A) + 1) ; 

其他回答

在我的解决方案表中是一个只有分数列的学生表,我正在计算分数的中位数,这个解决方案是基于SQL server 2019的

with total_c as ( --Total_c CTE counts total number of rows in a table
    select count(*) as n from student
),
even as ( --Even CTE extract two middle rows if the number of rows are even
    select marks from student 
    order by marks 
    offset (select n from total_c)/2 -1 rows
    fetch next 2 rows only
),
odd as ( --Odd CTE extract middle row if the number of rows are odd
    select marks from student 
    order by marks 
    offset (select n + 1 from total_c)/2 -1 rows
    fetch next 1 rows only
    )
--Case statement helps to select odd or even CTE based on number of rows
select                                                        
case when n%2 = 0 then (select avg(cast(marks as float)) from even)
    else (select marks from odd)
end as med_marks
from total_c
with t1 as (select *, row_number() over(order by ordqty) as rn,
count(*) over() as rc from ord_line)
select rn,* from t1 where rn in((rc+1)/2, (rc+2)/2);

它将计算偶数和奇数的中位数。

Ord_line是一个表 Ordqty是一个列

如果你想在SQL Server中使用创建聚合函数,这是如何做到的。这样做的好处是能够编写干净的查询。注意,这个过程可以很容易地计算一个百分位值。

创建一个新的Visual Studio项目,并将目标框架设置为。net 3.5(这是针对SQL 2008的,在SQL 2012中可能有所不同)。然后创建一个类文件,并放入以下代码或c#等效代码:

Imports Microsoft.SqlServer.Server
Imports System.Data.SqlTypes
Imports System.IO

<Serializable>
<SqlUserDefinedAggregate(Format.UserDefined, IsInvariantToNulls:=True, IsInvariantToDuplicates:=False, _
  IsInvariantToOrder:=True, MaxByteSize:=-1, IsNullIfEmpty:=True)>
Public Class Median
  Implements IBinarySerialize
  Private _items As List(Of Decimal)

  Public Sub Init()
    _items = New List(Of Decimal)()
  End Sub

  Public Sub Accumulate(value As SqlDecimal)
    If Not value.IsNull Then
      _items.Add(value.Value)
    End If
  End Sub

  Public Sub Merge(other As Median)
    If other._items IsNot Nothing Then
      _items.AddRange(other._items)
    End If
  End Sub

  Public Function Terminate() As SqlDecimal
    If _items.Count <> 0 Then
      Dim result As Decimal
      _items = _items.OrderBy(Function(i) i).ToList()
      If _items.Count Mod 2 = 0 Then
        result = ((_items((_items.Count / 2) - 1)) + (_items(_items.Count / 2))) / 2@
      Else
        result = _items((_items.Count - 1) / 2)
      End If

      Return New SqlDecimal(result)
    Else
      Return New SqlDecimal()
    End If
  End Function

  Public Sub Read(r As BinaryReader) Implements IBinarySerialize.Read
    'deserialize it from a string
    Dim list = r.ReadString()
    _items = New List(Of Decimal)

    For Each value In list.Split(","c)
      Dim number As Decimal
      If Decimal.TryParse(value, number) Then
        _items.Add(number)
      End If
    Next

  End Sub

  Public Sub Write(w As BinaryWriter) Implements IBinarySerialize.Write
    'serialize the list to a string
    Dim list = ""

    For Each item In _items
      If list <> "" Then
        list += ","
      End If      
      list += item.ToString()
    Next
    w.Write(list)
  End Sub
End Class

然后编译它,并将DLL和PDB文件复制到您的SQL Server机器,在SQL Server中运行以下命令:

CREATE ASSEMBLY CustomAggregate FROM '{path to your DLL}'
WITH PERMISSION_SET=SAFE;
GO

CREATE AGGREGATE Median(@value decimal(9, 3))
RETURNS decimal(9, 3) 
EXTERNAL NAME [CustomAggregate].[{namespace of your DLL}.Median];
GO

然后你可以写一个查询来计算中位数,就像这样: SELECT dbo.Median(Field) FROM Table

这是我能想到的求中位数的最优解。示例中的名称基于Justin示例。确保表有索引 销售。SalesOrderHeader以索引列CustomerId和TotalDue的顺序存在。

SELECT
 sohCount.CustomerId,
 AVG(sohMid.TotalDue) as TotalDueMedian
FROM 
(SELECT 
  soh.CustomerId,
  COUNT(*) as NumberOfRows
FROM 
  Sales.SalesOrderHeader soh 
GROUP BY soh.CustomerId) As sohCount
CROSS APPLY 
    (Select 
       soh.TotalDue
    FROM 
    Sales.SalesOrderHeader soh 
    WHERE soh.CustomerId = sohCount.CustomerId 
    ORDER BY soh.TotalDue
    OFFSET sohCount.NumberOfRows / 2 - ((sohCount.NumberOfRows + 1) % 2) ROWS 
    FETCH NEXT 1 + ((sohCount.NumberOfRows + 1) % 2) ROWS ONLY
    ) As sohMid
GROUP BY sohCount.CustomerId

更新

我有点不确定哪种方法性能最好,所以我比较了我的方法Justin Grants和Jeff Atwoods,在一个批量中运行基于这三种方法的查询,每个查询的批量成本为:

没有指数:

我的30% Justin Grants 13% Jeff Atwoods 58%

还有index

我的3%。 Justin Grants 10% Jeff Atwoods 87%

I tried to see how well the queries scale if you have index by creating more data from around 14 000 rows by a factor of 2 up to 512 which means in the end around 7,2 millions rows. Note I made sure CustomeId field where unique for each time I did a single copy, so the proportion of rows compared to unique instance of CustomerId was kept constant. While I was doing this I ran executions where I rebuilt index afterwards, and I noticed the results stabilized at around a factor of 128 with the data I had to these values:

我的3%。 贾斯汀·格兰特5% Jeff Atwoods 92%

我想知道,在保持惟一CustomerId不变的情况下,扩展行数会如何影响性能,因此我设置了一个新的测试,在其中执行了上述操作。现在,批成本比率并没有稳定下来,而是不断分化,每个CustomerId平均大约有20行,最后每个这样唯一的Id大约有10000行。数字如下:

我的4% 贾斯汀60% 杰夫斯35%

通过比较结果,我确保我正确地实现了每个方法。 我的结论是,只要索引存在,我使用的方法通常更快。还要注意,本文针对这个特定问题推荐使用这种方法https://www.microsoftpressstore.com/articles/article.aspx?p=2314819&seqNum=5

进一步提高对该查询的后续调用的性能的一种方法是在辅助表中持久化计数信息。您甚至可以通过一个触发器来维护它,该触发器更新并保存有关依赖于CustomerId的SalesOrderHeader行计数的信息,当然您也可以简单地存储中值。

在UDF中,写:

 Select Top 1 medianSortColumn from Table T
  Where (Select Count(*) from Table
         Where MedianSortColumn <
           (Select Count(*) From Table) / 2)
  Order By medianSortColumn