4

How to Use GROUP BY to Summarize Query Results

Group and Summarize Data

In today’s lesson you’re going to learn about grouping results returned from your queries using the GROUP BY clause.

The objectives of today’s lesson are to:

  • Learn how to group results using GROUP BY
  • Use aggregate functions to perform calculations
  • Understand how to filter groups using the HAVING clause

Important! Please follow along and do the examples in your database.  If you haven’t already done so, sign up for my Guide to Getting Started with SQL Server.  You get instructions on how to install the free tools and sample database.

GROUP BY Clause

The SQL GROUP BY Clause is used to output a row across specified column values.  It is typically used in conjunction with aggregate functions such as SUM or Count to summarize values.  In SQL groups are unique combinations of fields.  Rather than returning every row in a table, when values are grouped, only the unique combinations are returned.

The GROUP BY Clause is added to the SQL Statement after the WHERE Clause.  Here is an example where we are listing SalesOrderID, excluding quantities greater than 100.

SELECT   SalesOrderID
FROM     Sales.SalesOrderDetail
WHERE    OrderQty <= 100
GROUP BY SalesOrderID

There are a couple of things to note.  First, the columns we want to summarize are listed, separated by commas, in the GROUP BY clause.  Second, this same list of columns must be listed in the select statement; otherwise the statement fails.

When this statement is run, not every filtered row is returned.  Only unique combinations of SalesOrderID are included in the result.  This statement is very similar to

SELECT DISTINCT SalesOrderID
FROM   Sales.SalesOrderDetail
WHERE  OrderQty <= 100

But there is a key difference.  The DISTINCT modifier stops at outputting unique combination of rows, whereas with the GROUP BY statement, we can calculate values based on the underlying filtered rows for each unique combination.

In other words, using our example, with the GROUP BY, we can calculate the number or OrderDetails per order as following:

SELECT   SalesOrderID, COUNT(SalesOrderID)
FROM     Sales.SalesOrderDetail
GROUP BY SalesOrderID

COUNT is an example of an aggregate function, these are what really give the GROUP BY statement its special value.

Aggregate Functions

Some functions, such as SUM, are used to perform calculations on a group of rows, these are called aggregate functions.  In most cases these functions operate on a group of values which are defined using the GROUP BY clause.  When there isn’t a GROUP BY clause, it is generally understood the aggregate function applies to all filtered results.

Some of the most common aggregate functions include:

AVG(expression) Calculate the average of the expression.
COUNT(expression) Count occurrences of non-null values returned by the expression.
COUNT(*) Counts all rows in the specified table.
MIN(expression) Finds the minimum expression value.
MAX(expression) Finds the maximum expression value.
SUM(expression) Calculate the sum of the expression.

These functions can be used on their own on in conjunction with the GROUP BY clause.  On their own, they operate across the entire table; however, when used with GROUP BY, their calculations are “reset” each time the grouping changes.  In this manner they act as subtotals.

General Syntax of an Aggregate Function

When using the aggregate function you can either compute the result on all values or distinct values.  For instance, to count all SalesOrderDetails records we could use the expression:

SELECT   COUNT(SalesOrderID)
FROM     Sales.SalesOrderDetail

This results in a count of 121317 being returned.

To count the distinct of orders making up the details we would use the following:

SELECT   COUNT(Distinct SalesOrderID)
FROM     Sales.SalesOrderDetail

The count now is 31465.

Using Aggregate Functions with GROUP BY

To aggregate means to make whole from individual parts.  Aggregate functions are functions that work on more than one row to return a result.

AVG and SUM

The SUM function totals up the values returned, in similar fashion AVG calculates the average.

Let’s see if we can calculate the total order amount from the OrderDetails.  From previous lessons we know how to calculate the total amount for each detail as:

SELECT SalesOrderID,
       ProductID,
       OrderQty* UnitPrice As ExtendedPrice
FROM   Sales.SalesOrderDetail

Since we can apply aggregate function to expressions, we can set up a grouping on OrderID to calculate the total price per order as

SELECT   SalesOrderID,
         SUM(OrderQty * UnitPrice) AS TotalPrice
FROM     Sales.SalesOrderDetail
GROUP BY SalesOrderID

We can even sort by the total to get the top orders first

SELECT   SalesOrderID,
         SUM(OrderQty * UnitPrice) AS TotalPrice
FROM     Sales.SalesOrderDetail
GROUP BY SalesOrderID
ORDER BY TotalPrice DESC

In similar fashion we can calculate the average order detail amount as

SELECT   SalesOrderID,
         AVG(OrderQty * UnitPrice) AS AverageOrderAmount
FROM     Sales.SalesOrderDetail
GROUP BY SalesOrderID

For the curious, since an average is calculated as the sum of the sample divided by the sample count, then using AVG in the above statement is the same as:

SELECT   SalesOrderID,
         SUM(OrderQty * UnitPrice) / COUNT(SalesOrderID) AS AvgOrderAmount
FROM     Sales.SalesOrderDetail
GROUP BY SalesOrderID

We covered a lot in this section.  Here are some key points to remember:

  1. An aggregate function can evaluate an expression such as SUM(A + B)
  2. You should alias aggregate functions, so the column names are meaningful
  3. When working with aggregate functions and GROUP BY, IT is sometimes is easier to think about the details first, that is write a simple SELECT statement, inspect the results, then add in the fancy stuff.  I cover this in How to Write Better Queries.

COUNT

The COUNT function is used when you need to know how many records exist in a table or within a group.  COUNT(*) will count every record in the grouping; whereas COUNT(expression) counts every record where expression’s result isn’t null.  You can also use Distinct with COUNT to find the number of unique values within a group.

To find the number of SalesOrderDetail Lines per order

SELECT   SalesOrderID,
         COUNT(SalesOrderDetailID)
FROM     Sales.SalesOrderDetail
GROUP BY SalesOrderID

To find the number of unique sales orders per product

SELECT   ProductID,
         COUNT(DISTINCT SalesOrderID)
FROM     Sales.SalesOrderDetail
GROUP BY ProductID

MIN and MAX

Use MIN and MAX to find the smallest and largest values, respectively, within a table or group.

For example, to find the smallest and largest product quantities ordered within a order try

SELECT   SalesOrderID,
         Min(OrderQty) AS MinQuantity,
         Max(OrderQty) AS MaxQuantity
FROM     Sales.SalesOrderDetail
GROUP BY SalesOrderID

You can also find the MIN or MAX value of a calculation.  Here we find the highest product amount ordered within a product:

SELECT   SalesOrderID,
         MAX(UnitPrice * OrderQty) as MaxAmount
FROM     Sales.SalesOrderDetail
GROUP BY SalesOrderID

HAVING Clause

The HAVING clause is used to filter groups according to the results of the aggregate functions.  This makes it possible to solve problems such as select all orders that have more than two order detail lines.

That example looks like

SELECT   SalesOrderID,
         COUNT(SalesOrderDetailID)
FROM     Sales.SalesOrderDetail
GROUP BY SalesOrderID
HAVING   Count(SalesOrderDetailID) > 2

If we wanted to find all orders greater than $1000 we would write

SELECT   SalesOrderID,
         SUM(UnitPrice * OrderQty) AS TotalPrice
FROM     Sales.SalesOrderDetail
GROUP BY SalesOrderID
HAVING   SUM(UnitPrice * OrderQty) > 1000
ORDER BY TotalPrice DESC

Note that we can use the alias TotalPrice in the ORDER BY clause,  but the having clause has to use the expression.

To hammer home HAVING, I want to show one last example.  Here you’ll see the HAVING statement includes an aggregate function that isn’t in the SELECT list.

SELECT   SalesOrderID,
         SUM(UnitPrice * OrderQty) AS TotalPrice
FROM     Sales.SalesOrderDetail
GROUP BY SalesOrderID
HAVING   AVG(UnitPrice * OrderQty) > 500
ORDER BY TotalPrice DESC

In the above query we’re getting the total price for orders where the average SalesOrderDetail amount is greater than $500.00.

Final Statement about HAVING

Though they perform a similar function, there is a key distinction between the WHERE clause and HAVING. The WHERE clause to filter individual records; whereas, the HAVING clause filters on the groups.

To keep it straight in my head I like to think of the WHERE clause doing its work before any groupings take place, and then the HAVING clause taking over after the groups are formed.

Exercises

It’s important to practice! Use the sample database to answer these questions.

  1. HR wants a report of the number of active employees by job title.  What SQL would you use?
  2. Display the Min, Max, and Average Quantity ordered for each product in SalesOrderDetails.
  3. List all employee job titles, and number of employees where the average number of sick leave hours is less than or equal to forty.
  4. For a job title returned in #3 above, is the count the same for the corresponding job title answer #1’s result?

Answers are Here!

Congratulations!  You just learned how to use the GROUP BY and HAVING clauses to summarize and filter on summarized information.  More tutorials are to follow! Remember!  I want to remind you all that if you have other questions you want answered, then post a comment or tweet me.

I’m here to help you. What other topics would you like to know more about?

Kris Wenzel
 

Kris Wenzel has been working with databases over the past 28 years as a developer, analyst, and DBA. He has a BSE in Computer Engineering from the University of Michigan and an MBA from the University of Notre Dame. Kris has written hundreds of blog articles and many online courses. He is loves helping others learn SQL.

Click Here to Leave a Comment Below 4 comments