Use SQL to Calculate Median Values.
In this puzzle we’re going to learn how to find the person whose birthday, among others, is in the middle. Knowing how to calculate the median value is a good skill to have. As you start to explore business intelligence you’ll come across similar problems to solve.
Also, be sure to check out the bonus question. We’ll work with date from colonial times. The calendar was different, and you see that the standard DATETIME datatype falls short of our challenge.
Solving puzzles is a great way to learn SQL. Nothing beats practicing what you’ve learned. Once you have figured out the puzzle, post your answer in the comments so we all can learn from one another. We also discuss puzzle and more in Essential SQL Learning Group on Facebook. Be sure to find us there!
Calculate The Median Date Among A List of Dates
Here is today’s puzzle written by Chris Huntley.
You have a list of people with birth dates and want to find the person who has the middle birth date?
In other words, who has the median birth date?
Are you up to the challenge? Can you write a single SQL statement to find this person?
To get started, be sure to download the code samples. You’ll find sample data for this and the bonus question here.
Beware of Date Conversion Issues
The first Idea might be to convert it to an integer and then use that in a sort or to calculate median values. However depending on how you convert it you may end up with something you don’t want. For example converting it to a numeric looking string and then into an integer value
Here is one way to convert date to an INT, but not the right way!
SELECT personid, personname, birthdate, CAST(CONVERT(NVARCHAR(10), birthdate,112)as INT) as BirthINT, CAST(CONVERT(NVARCHAR(10), birthdate,112)as INT)+1 as BirthINTIncremented, CAST(CONVERT(NVARCHAR(10), birthdate,112)as INT)+30 as BirthINTIncrementedMore FROM @datetable
Refer to the figure below and you will find the disadvantage of this conversion. You can increment it like any other integer, but it isn’t adate anymore where you get dates that do not really exist. Incrementing the dates should be done using appropriate datatypes and possibly appropriate functions.
In continuing with this idea, we just need to calculate the median value of this integer to get the following…
-- calculating the median using the converted String -- using a CTE to reference the conversion ;WITH ctedateconversion AS ( SELECT personid, personname, birthdate, CONVERT( INT, Birthdate) as BirthINT FROM @datetable ) SELECT CONVERT(DATETIME, SUM(birthint)/(SELECT COUNT(*) FROM cteDateconversion)) As meanbirthdate FROM ctedateconversion
We do get the median date but it’s not a date on our list, which means, it doesn’t represent a person in our table. If we remove the outside conversion back into a date…
SUM(birthint)/(SELECT COUNT(*) FROM cteDateconversion) As meanbirthdate
we would have received the value, 29438…
Which is the number of days since 1/1/1900.
If you want to confirm this, use the following code
SELECT DATEDIFF(dd, '1900-01-01 00:00:00','1980-08-07 00:00:00.000')
Answer to Calculating the Median Date
To get the median birth date, I decided to use (2) PERCENTILE_DISC. The PERCENTILE_DISC computes a specific percentile for a distribution, which in our example, is a list of dates. Specifying PERCENTILE_DISC (.5) calculates the 50th percentile, which happens to be the median.
;with ctedateconversion as( SELECT personid, PersonName, Birthdate, Percentile_Disc(0.5) within group (order by Birthdate) OVER() as MedianBirthDate FROM @datetable ) SELECT personname, birthdate FROM ctedateconversion WHERE Birthdate = MedianBirthDate
This gives us the answer…
There’s more than one way to get the answer you seek. How would you accomplish this task?
Find the specific presidents that were the mean age of all presidents when they entered or left the presidency. Specifically either they were the mean age of all presidents when they entered or the mean age of all presidents when they left. This requires you to consider the datatypes you use carefully. The analyst who collected this data in your behalf, will give this to you in a denormalized format where everything is in a string datatype since they could not get DATETIME to work.
To get started, be sure to download the code samples. You’ll find sample data for the bonus question here.
The first thought might be convert it to a DATETIME then just do the math. If you try that you’ll find an error.
SELECT President, Convert(DATETIME, Birthdate) as Birthdate, Convert(DATETIME, TermStart) TermStart, Convert(DATETIME, TermEnd) as TermEnd FROM Presidents
DATETIME has a history and the dates it can hold are limited to dates greater than 1/1/1753. The history is that Pope Gregory ,for whom the calendar is named, in 1582 enacted reform to change from the Julian to Gregorian calendar.
It took Britain and the Colonies quite a while to get around to it and they finally decided that 1752 would be the change over. In order to make the change over they had to drop 11 days from September 1752.
Way back when they were designing SQL server they chose to just ignore this period and support from 1753 onward using the Gregorian Calendar.
An option might be convert to DATETIME2 which is Gregorian and supports all dates down to 00/00/0001 which makes it compatible with other DBMS.
SELECT President, Convert(DATETIME2, Birthdate) as Birthdate, Convert(DATETIME2, TermStart) TermStart, Convert(DATETIME2, TermEnd) as TermEnd FROM Presidents
How do we want to get this data? We can use chained CTEs (Common Table Expression) so we can convert and then perform the calculations as follows
; with ctepresidents as( SELECT President, Convert(DATETIME2, Birthdate) as Birthdate, Convert(DATETIME2, TermStart) TermStart, Convert(DATETIME2, TermEnd) as TermEnd FROM Presidents ) , CTESTEP2 as( SELECT President, CAST(DATEDIFF(dd, birthdate, termstart)/365.0 as INT) as startage, CAST(DATEDIFF(dd, birthdate, termend)/365.0 as INT) as endage FROM ctepresidents ) ,ctefinal as( SELECT CAST(SUM(startage)/(select COUNT(*) FROM Ctestep2)as INT) as meanStartAge, CAST(SUM(endage)/(select COUNT(*) FROM Ctestep2)as INT )as meanendAge FROM CTESTEP2) ,cteconnectingitall as( SELECT President FROM CTESTEP2 c INNER JOIN CTEFINAL f on c.startage = f.meanStartAge UNION ALL SELECT President FROM ctefinal ff INNER JOIN CTESTEP2 cc ON cc.endage = ff.meanendAge ) SELECT * FROM cteconnectingitall
The mean age of start of a presidential term is : 54
The mean age of the end of a presidential term is: 59
Just to make confirmation easier here’s the code:
;with ctepresidents as( SELECT President, Convert(DATETIME2, Birthdate) as Birthdate, Convert(DATETIME2, TermStart) TermStart, Convert(DATETIME2, TermEnd) as TermEnd FROM Presidents ) , CTESTEP2 as( SELECT President, CAST(DATEDIFF(dd, birthdate, termstart)/365.0 as INT) as startage, CAST(DATEDIFF(dd, birthdate, termend)/365.0 as INT) as endage FROM ctepresidents ) SELECT * FROM CTESTEP2 WHERE startage = 54 OR endage = 59
How did you get the answer? Did you have a flatter and easier way?
This blog was authored by Chris Huntley. He is a Business Intelligence Developer living in Dallas Fort Worth. He loves data, the Microsoft Bi Stack and tacos not necessarily in that order.
Chris frequently helps fellow SQL coders. One place you can find him is on the essentialSQL learning group.