Grouping and Aggregating Data: Module Overview
Grouping and Aggregating Data: Module Overview
Module 9
Grouping and Aggregating Data
Contents:
Module Overview 9-1
Module Overview
In addition to row-at-a-time queries, you may need to summarize data in order to analyze it. Microsoft®
SQL Server® provides a number of built-in functions that can aggregate, or summarize, information
across multiple rows. In this module, you will learn how to use aggregate functions. You will also learn
how to use the GROUP BY and HAVING clauses to break up the data into groups for summarizing and to
filter the resulting groups.
Objectives
After completing this lesson, you will be able to:
• Write queries using aggregate functions that handle the presence of NULLs in source data.
9-2 Grouping and Aggregating Data
Lesson 1
Using Aggregate Functions
In this lesson, you will learn how to use built-in functions to aggregate, or summarize, data in multiple
rows. SQL Server provides functions such as SUM, MAX, and AVG to perform calculations that take
multiple values and return a single result.
Lesson Objectives
After completing this lesson, you will be able to:
• Write queries that use aggregate functions in a SELECT list to summarize all the rows in an input set.
• Write queries using aggregate functions that handle the presence of NULLs in source data.
When working with aggregate functions, there are some considerations to keep in mind:
• Aggregate functions return a single (scalar) value and can be used in SELECT statements where a
single expression is used, such as SELECT, HAVING, and ORDER BY clauses.
• Aggregate functions ignore NULLs, except when using COUNT(*). You will learn more about this later
in the lesson.
• Aggregate functions in a SELECT list do not generate a column alias. You may wish to use the AS
clause to provide one.
• Aggregate functions in a SELECT clause operate on all rows passed to the SELECT phase. If there is no
GROUP BY clause, all rows will be summarized, as in the slide above. You will learn more about
GROUP BY in the next lesson.
To extend beyond the built-in functions, SQL Server provides a mechanism for user-defined aggregate
functions via the .NET Common Language Runtime (CLR).
For more information on other built-in aggregate functions, go to Books Online at:
Querying Microsoft® SQL Server® 9-3
Function
Syntax Description
Name
Function
Syntax Description
Name
NULL.
When a
column is
specified as
<expression
>, returns
count of
non-NULL
rows for the
column.
COUNT
returns an
int;
COUNT_BIG
returns a
big_int.
This lesson will only cover common aggregate functions. For more information on other built-in
aggregate functions, go to Books Online at:
To use a built-in aggregate in a SELECT clause, consider the following example in the TSQL sample
database:
Aggregate Example
SELECT AVG(unitprice) AS avg_price,
MIN(qty)AS min_qty,
MAX(discount) AS max_discount
FROM Sales.OrderDetails;
Note that the above example does not use a GROUP BY clause. Therefore, all rows from the
Sales.OrderDetails table will be summarized by the aggregate formulas in the SELECT clause.
The results:
When using aggregates in a SELECT clause, all columns referenced in the SELECT list must be used as
inputs for an aggregate function, or be referenced in a GROUP BY clause.
The following example will return an error:
This returns:
Since our example is not using a GROUP BY clause, the query treats all rows as a single group. All
columns, therefore, must be used as inputs to aggregate functions. Removing orderid from the previous
example will prevent the error.
In addition to numeric data such as the price and quantities in the previous example, aggregate
expressions can also summarize date, time, and character data. The following examples show the use of
aggregates with dates and characters.
This query returns first and last company by name, using MIN and MAX:
Returns:
first_customer last_customer
-------------- --------------
Customer AHPOP Customer ZRNDE
This query returns the earliest and latest orders by order date, using MIN and MAX:
Aggregating Dates
SELECT MIN(orderdate)AS earliest,MAX(orderdate) AS latest
FROM Sales.Orders;
Returns:
earliest latest
----------------------- -----------------------
2006-07-04 00:00:00.000 2008-05-06 00:00:00.000
Returns:
earliest latest
-------- -------
2006 2008
9-6 Grouping and Aggregating Data
Note that the above example uses a GROUP BY clause. GROUP BY will be covered in the next lesson. It is
used here to provide a useful example for comparing DISTINCT and non-DISTINCT aggregate functions.
Note the difference in each row between the COUNT of custid (in column 3) and the DISTINCT COUNT in
column 4. Column 3 simply returns all rows except those containing NULL. Column 4 excludes duplicate
custids (repeat customers) and returns a count of unique customers, answering the question: “How many
customers per employee?”
Question: Could you accomplish the same output with the use of SELECT DISTINCT?
Querying Microsoft® SQL Server® 9-7
C1 C2
1 NULL
2 10
3 20
4 30
5 40
6 50
The following query illustrates the difference between how AVG handles NULL and how you might
calculate an average with a SUM/COUNT(*) computed column:
The result:
If you need to summarize all rows, whether NULL or not, consider replacing the NULLs with another value
that can be used by your aggregate function.
9-8 Grouping and Aggregating Data
The following example replaces NULLs with 0 before calculating an average. The table named t2 contains
the following rows:
c1 c2
----------- -----------
1 1
2 10
3 1
4 NULL
5 1
6 10
7 1
8 NULL
9 1
10 10
11 1
12 10
Compare the effect on the arithmetic mean with NULLs-ignored verses replaced with 0:
AvgWithNULLs AvgWithNULLReplace
------------ ------------------
4 3
Warning: Null value is eliminated by an aggregate or other SET operation.
Note: This example cannot be executed against the sample database used in this course. A
script to create the table is included in the upcoming demonstration.
Demonstration Steps
Use Built-in Aggregate Functions
1. Ensure that the 20461C-MIA-DC and 20461C-MIA-SQL virtual machines are both running, and then
log on to 20461C-MIA-SQL as ADVENTUREWORKS\Student with the password Pa$$w0rd.
3. Start SQL Server Management Studio and connect to the MIA-SQL database engine instance using
Windows authentication.
5. If the Solution Explorer pane is not visible, on the View menu, click Solution Explorer.
7. Follow the instructions contained within the comments of the script file.
Querying Microsoft® SQL Server® 9-9
8. Keep SQL Server Management Studio open for the next demonstration.
9-10 Grouping and Aggregating Data
Lesson 2
Using the GROUP BY Clause
While aggregate functions are useful for analysis, you may wish to arrange your data into subsets before
summarizing it. In this lesson, you will learn how to accomplish this using the GROUP BY clause.
Lesson Objectives
After completing this lesson, you will be able to:
• Write queries that separate rows into groups using the GROUP BY clause.
• Describe the role of the GROUP BY clause in the logical order of operations for processing a SELECT
statement.
GROUP BY Syntax
GROUP BY <value1> [, <value2>, ...]
GROUP BY creates groups and places rows into each group as determined by unique combinations of the
elements specified in the clause.
For example, the following snippet of a query will result in a set of grouped rows, one per empid, in the
Sales.Orders table:
GROUP BY Snippet
FROM Sales.Orders
GROUP BY empid;
Once the GROUP BY clause has been processed and rows have been associated with a group, subsequent
phases of the query must aggregate any elements of the source rows that do not appear in the GROUP BY
list. This will have an impact on how you write your SELECT and HAVING clauses.
To see into the results of the GROUP BY clause, you will need to add a SELECT clause.
This shows the original 830 source rows being grouped into nine groups based on the unique employee
ID:
Querying Microsoft® SQL Server® 9-11
GROUP BY Example
SELECT empid, COUNT(*) AS cnt
FROM Sales.Orders
GROUP BY empid;
The result:
empid cnt
----- -----
1 123
2 96
3 127
4 156
5 42
6 67
7 72
8 104
9 43
(9 row(s) affected)
To learn more about GROUP BY, go to GROUP BY (Transact SQL) in Books Online at:
GROUP BY (Transact-SQL)
http://go.microsoft.com/fwlink/?LinkID=402754
If you recall from earlier in the course, the SELECT clause is not processed until after the FROM, WHERE,
GROUP BY, and HAVING clauses are processed, if present. When discussing the use of GROUP BY, it is
important to remember that not only does GROUP BY precede SELECT, but it also replaces the results of
the FROM and WHERE clauses with its own results. The final outcome of the query will only return one
row per qualifying group (if a HAVING clause is present). Therefore, any operations performed after
GROUP BY, including SELECT, HAVING, and ORDER BY, are performed on the groups, not the original
detail rows. Columns in the SELECT list, for example, must return a scalar value per group. This may
include the column(s) being grouped on, or aggregate functions being performed on, each group.
The following query is permitted because each column in the SELECT list is either a column in the GROUP
BY clause or an aggregate function operating on each group:
GROUP BY Example
SELECT empid, COUNT(*) AS cnt
FROM Sales.Orders
9-12 Grouping and Aggregating Data
GROUP BY empid;
This returns:
empid count
----- -----
1 123
2 96
3 127
4 156
5 42
6 67
7 72
8 104
9 43
The following query will return an error since orderdate is not an input to GROUP BY, and its data has
been "lost" following the FROM clause:
This returns:
If you did want to see orders per employee ID and per order date, add it to the GROUP BY clause, as
follows:
The net effect of this behavior is that you will not be able to combine a view of summary data with the
detailed source date, using the T-SQL tools you have learned so far. You will learn some approaches to
solving the problem later in this course.
GROUP BY Workflow
Initially, the WHERE clause is processed followed by
the GROUP BY. The slide shows the results of the
WHERE clause, followed by the GROUP BY being
performed on these results.
Source Queries
SELECT orderid, empid, custid
FROM Sales.Orders;
productid largest_order
----------- -------------
23 70
46 60
69 65
29 80
75 120
9-14 Grouping and Aggregating Data
Note: The qty column, used as an input to the MAX function, is not used in the GROUP BY
clause. This illustrates that, even though the detail rows returned by the FROM...WHERE phase are
lost to the GROUP BY phase, the source columns are still available for aggregation.
Demonstration Steps
Use the GROUP BY Clause
1. Ensure that you have completed the previous demonstration in this module. Alternatively, start the
20461C-MIA-DC and 20461C-MIA-SQL virtual machines, log on to 20461C-MIA-SQL as
ADVENTUREWORKS\Student with the password Pa$$w0rd, and run
D:\Demofiles\Mod09\Setup.cmd as an administrator.
2. If SQL Server Management Studio is not already open, start it and connect to the MIA-SQL database
engine instance using Windows authentication, and then open the Demo.ssmssln solution in the
D:\Demofiles\Mod09\Demo folder.
3. In Solution Explorer, open the 21 – Demonstration B.sql script file.
4. Follow the instructions contained within the comments of the script file.
5. Keep SQL Server Management Studio open for the next demonstration.
Querying Microsoft® SQL Server® 9-15
Lesson 3
Filtering Groups with HAVING
Once you have created groups with a GROUP BY clause, you may wish to further filter the results. The
HAVING clause acts as a filter on groups, much like the WHERE clause acts as a filter on rows returned by
the FROM clause. In this lesson, you will learn how to write a HAVING clause and understand the
differences between HAVING and WHERE.
Lesson Objectives
After completing this lesson, you will be able to:
Logical
Phase Comments
Order
5 SELECT
1 FROM
6 ORDER
BY
A HAVING clause allows you to create a search condition, conceptually similar to the predicate of a
WHERE clause, which will then test each group returned by the GROUP BY clause.
The following example from the TSQL database groups all orders by customer, then returns only those
customers who have placed orders. No HAVING clause has been added so no filter is applied to the
groups:
The following example adds a HAVING clause to the previous query. It groups all orders by customer,
then returns only those who have placed 10 or more orders. Groups containing customers who placed
fewer than 10 rows are discarded:
Note: Remember that HAVING is processed before the SELECT clause, so any column
aliases created in a SELECT clause are not available to the HAVING clause.
• A HAVING clause controls which groups are available to the next phase of the query.
You will see a comparison of using WHERE and HAVING in the next demonstration.
Querying Microsoft® SQL Server® 9-17
Demonstration Steps
Filter Grouped Data Using the HAVING Clause
1. Ensure that you have completed the previous demonstration in this module. Alternatively, start the
20461C-MIA-DC and 20461C-MIA-SQL virtual machines, log on to 20461C-MIA-SQL as
ADVENTUREWORKS\Student with the password Pa$$w0rd, and run
D:\Demofiles\Mod09\Setup.cmd as an administrator.
2. If SQL Server Management Studio is not already open, start it and connect to the MIA-SQL database
engine instance using Windows authentication, and then open the Demo.ssmssln solution in the
D:\Demofiles\Mod09\Demo folder.
Objectives
After completing this lab, you will be able to:
Password: Pa$$w0rd
4. Write a SELECT Statement to Retrieve the Customers with Orders for Each Year
5. Write a SELECT Statement to Retrieve Groups of Product Categories Sold in a Specific Year
2. Write a SELECT statement that will return groups of customers who made a purchase. The SELECT
clause should include the custid column from the Sales.Orders table and the contactname column
Querying Microsoft® SQL Server® 9-19
from the Sales.Customers table. Group by both columns and filter only the orders from the sales
employee whose empid equals five.
3. Execute the written statement and compare the results that you achieved with the desired results
shown in the file D:\Labfiles\Lab09\Solution\52 - Lab Exercise 1 - Task 1 Result.txt.
5. Execute the query and compare the results that you achieved with the desired results shown in the
file D:\Labfiles\Lab09\Solution\53 - Lab Exercise 1 - Task 2 Result.txt.
Task 4: Write a SELECT Statement to Retrieve the Customers with Orders for Each
Year
1. Write a SELECT statement that will return groups of rows based on the custid column and a calculated
column orderyear representing the order year based on the orderdate column from the Sales.Orders
table. Filter the results to include only the orders from the sales employee whose empid equals five.
2. Execute the written statement and compare the results that you achieved with the desired results
shown in the file D:\Labfiles\Lab09\Solution\54 - Lab Exercise 1 - Task 3 Result.txt.
2. Execute the written statement and compare the results that you achieved with the desired results
shown in the file D:\Labfiles\Lab09\Solution\55 - Lab Exercise 1 - Task 4 Result.txt.
Results: After this exercise, you should be able to use the GROUP BY clause in the T-SQL statement.
3. Write a SELECT Statement to Retrieve the Sales Amount Value Per Month
4. Write a SELECT Statement to List All Customers with the Total Sales Amount and Number of Order Lines
Added
9-20 Grouping and Aggregating Data
Task 1: Write a SELECT Statement to Retrieve the Total Sales Amount Per Order
1. Open the T-SQL script 61 - Lab Exercise 2.sql. Ensure that you are connected to the TSQL database.
2. Write a SELECT statement to retrieve the orderid column from the Sales.Orders table and the total
sales amount per orderid. (Hint: Multiply the qty and unitprice columns from the Sales.OrderDetails
table.) Use the alias salesamount for the calculated column. Sort the result by the total sales amount
in descending order.
3. Execute the written statement and compare the results that you achieved with the desired results
shown in the file D:\Labfiles\Lab09\Solution\62 - Lab Exercise 2 - Task 1 Result.txt.
2. Execute the written statement and compare the results that you achieved with the recommended
results shown in the file D:\Labfiles\Lab09\Solution\63 - Lab Exercise 2 - Task 2 Result.txt.
Task 3: Write a SELECT Statement to Retrieve the Sales Amount Value Per Month
1. Write a select statement to retrieve the total sales amount for each month. The SELECT clause should
include a calculated column named yearmonthno (YYYYMM notation) based on the orderdate
column in the Sales.Orders table and a total sales amount (multiply the qty and unitprice columns
from the Sales.OrderDetails table). Order the result by the yearmonthno calculated column.
2. Execute the written statement and compare the results that you achieved with the recommended
result shown in the file D:\Labfiles\Lab09\Solution\64 - Lab Exercise 2 - Task 3 Result.txt.
Task 4: Write a SELECT Statement to List All Customers with the Total Sales Amount
and Number of Order Lines Added
1. Write a select statement to retrieve all the customers (including those who did not place any orders)
and their total sales amount, maximum sales amount per order line, and number of order lines.
2. The SELECT clause should include the custid and contactname columns from the Sales.Customers
table and four calculated columns based on appropriate aggregate functions:
d. numberoforderlines, representing the number of order lines (use the orderid column in the
COUNT function)
4. Execute the written statement and compare the results that you achieved with the recommended
results shown in the file D:\Labfiles\Lab09\Solution\65 - Lab Exercise 2 - Task 4 Result.txt.
5. Notice that the custid 22 and 57 rows have a NULL in the columns with the SUM and MAX aggregate
functions. What are their values in the COUNT columns? Why are they different?
Querying Microsoft® SQL Server® 9-21
SELECT
YEAR(orderdate) AS orderyear,
COUNT(orderid) AS nooforders,
COUNT(custid) AS noofcustomers
FROM Sales.Orders
GROUP BY YEAR(orderdate);
3. Observe the results. Notice that the number of orders is the same as the number of customers. Why?
4. Correct the T-SQL statement to show the correct number of customers who placed an order for each
year.
5. Execute the written statement and compare the results that you achieved with the recommended
results shown in the file D:\Labfiles\Lab09\Solution\72 - Lab Exercise 3 - Task 1 Result.txt.
2. Execute the written statement and compare the results that you achieved with the recommended
results shown in the file D:\Labfiles\Lab09\Solution\73 - Lab Exercise 3 - Task 2 Result.txt.
2. Execute the written statement and compare the results that you achieved with the recommended
results shown in the file D:\Labfiles\Lab09\Solution\74 - Lab Exercise 3 - Task 3 Result.txt.
Results: After this exercise, you should have an understanding of how to apply a DISTINCT aggregate
function.
9-22 Grouping and Aggregating Data
Exercise 4: Writing Queries That Filter Groups with the HAVING Clause
Scenario
The sales and marketing departments were satisfied with the reports you provided to analyze customers’
behavior. Now they would like to have the results filtered, based on the total sales amount and number of
orders. So, in the final exercise, you will learn how to filter the result based on aggregated functions and
learn when to use the WHERE and HAVING clauses.
2. Write a SELECT statement to retrieve the top 10 customers (by total sales amount) who spent more
than $10,000 in terms of sales amount. Display the custid column from the Orders table and a
calculated column that contains the total sales amount, based on the qty and unitprice columns
from the Sales.OrderDetails table. Use the alias totalsalesamount for the calculated column.
3. Execute the written statement and compare the results that you achieved with the recommended
results shown in the file D:\Labfiles\Lab09\Solution\82 - Lab Exercise 4 - Task 1 Result.txt.
2. Execute the written statement and compare the results that you achieved with the recommended
results shown in the file D:\Labfiles\Lab09\Solution\83 - Lab Exercise 4 - Task 2 Result.txt.
3. Apply an additional filter to show only employees with empid equal number 3.
4. Execute the written statement and compare the results that you achieved with the recommended
results shown in the file D:\Labfiles\Lab09\Solution\85 - Lab Exercise 4 - Task 3_2 Result.txt.
5. Did you apply the predicate logic in the WHERE clause or the HAVING clause? Which do you think is
better? Why?
2. Execute the written statement and compare the results that you achieved with the recommended
result shown in the file D:\Labfiles\Lab09\Solution\86 - Lab Exercise 4 - Task 4 Result.txt.
Results: After this exercise, you should have an understanding of how to use the HAVING clause.
9-24 Grouping and Aggregating Data
Question: Can a WHERE clause and a HAVING clause in a query filter on the same column?