Condividi tramite


At long last, a post: Medians in T-SQL

Well, shoot.  Start a blog and then never post to it---not the best use of bits.  I can't promise that this will be the beginning of a post-filled revolution, but it will hopefully be of some use.

Assume you have a schema like:

sales(city varchar(30), number float)

For the purposes of this example, we don't care what number is---all we know is that we want to find the median value of number for each city.  We also want to do it using only T-SQL.  We also want to do it using only declarative T-SQL.

Here's how:

select middles.city, avg(convert(float, number)) as median
from (
select city, floor((count(*) + 1) / 2.0) as lowmiddle, ceiling((count(*) + 1) / 2.0) as highmiddle
from sales
group by city
) as middles
join (
select city, number, row_number() over (partition by city order by number) as rank
from sales
) as ranks
on (middles.city = ranks.city) and
((middles.lowmiddle = ranks.rank) or (middles.highmiddle = ranks.rank))
group by middles.city

Note that this version of median will return the average of the two middle values if the number of values for a particular city is even.  It's simpler if you want a right- or left-biased median.

How does it work?  The first subquery finds the rank for each of the two medians for each city---they'll be the same if the number of entries for the city is odd.  The second subquery finds the rank of each entry within each city.  All that is left to do is join these two to find the values associated with the median ranks and average them together.

[Note: Edited to correct the problem described by Alex in the comments.]

Comments

  • Anonymous
    December 19, 2006
    A couple posts---and a long time---ago, I wrote about computing medians in T-SQL . Adam Mechanic has

  • Anonymous
    September 28, 2007
    I was looking for something similar. Tried various posts but didnt help. Finally found this and has helped me out. Thanks a ton!!!

  • Anonymous
    January 06, 2009
    Great post!  Here's a tweak, replace rank() with row_number().  The will ensure that all goes well with sets containing non-unique values.

  • Anonymous
    January 07, 2009
    Alex, Ah!  I see what you mean---very nice catch. -Isaac