What is Summarize data with GROUP BY
While aggregate functions are useful for analysis, you may wish to arrange your data into subsets before summarizing it. In this section, you will learn how to accomplish this using the GROUP BY clause.
Using the GROUP BY clause
As you’ve learned, when your SELECT statement is processed, after the FROM clause and WHERE clause have been evaluated, a virtual table is created. The contents of the virtual table are now available for further processing. You can use the GROUP BY clause to subdivide the contents of this virtual table into groups of rows.
To group rows, specify one or more elements in the GROUP BY clause:
SQLCopy
GROUP BY <value1> [, <value2>, …]
GROUP BY creates groups and places rows into each group as determined by the elements specified in the clause.
For example, the following query will result in a set of grouped rows, one row per CustomerID in the Sales.SalesOrderHeader table. Another way of looking at the GROUP BY process, is that all rows with the same value for CustomerID will be grouped together and returned in a single result row.
SQLCopy
SELECT CustomerID
FROM Sales.SalesOrderHeader
GROUP BY CustomerID;
The query above is equivalent to the following query:
SQLCopy
SELECT DISTINCT CustomerID
FROM Sales.SalesOrderHeader
After the GROUP BY clause has been processed and each row has been associated with a group, later phases of the query must aggregate any elements of the source rows that are in the SELECT list but that don’t appear in the GROUP BY list. This requirement will have an impact on how you write your SELECT and HAVING clauses.
So, what’s the difference between writing the query with a GROUP BY or a DISTINCT? If all you want to know is the distinct values for CustomerID, there is no difference. But with GROUP BY, we can add other elements to the SELECT list that are then aggregated for each group.
The simplest aggregate function is COUNT(*). The following query takes the original 830 source rows from CustomerID and groups them into 89 groups, based on the CustomerID values. Each distinct CustomerID value generates one row of output in the GROUP BY query
SQLCopy
SELECT CustomerID, COUNT(*) AS OrderCount
FROM Sales.SalesOrderHeader
GROUP BY CustomerID;
For each CustomerID value, the query aggregates and counts the rows, so we result shows us how many rows in the SalesOrderHeader table belong to each customer.
CustomerID
OrderCount
1234
3
1005
1
Note that GROUP BY does not guarantee the order of the results. Often, as a result of the way the grouping operation is performed by the query processor, the results are returned in the order of the group values. However, you should not rely on this behavior. If you need the results to be sorted, you must explicitly include an ORDER clause:
SQLCopy
SELECT CustomerID, COUNT(*) AS OrderCount
FROM Sales.SalesOrderHeader
GROUP BY CustomerID
ORDER BY CustomerID;
This time, the results are returned in the specified order:
CustomerID
OrderCount
1005
1
1234
3
The clauses in a SELECT statement are applied in the following order:
- FROM
- WHERE
- GROUP BY
- HAVING
- SELECT
- ORDER BY
Column aliases are assigned in the SELECT clause, which occurs after the GROUP BY clause but before the ORDER BY clause. You can reference a column alias in the ORDER BY clause, but not in the GROUP BY clause. The following query will result in an invalid column name error:
SQLCopy
SELECT CustomerID AS Customer,
COUNT(*) AS OrderCount
FROM Sales.SalesOrderHeader
GROUP BY Customer
ORDER BY Customer;
However, the following query will succeed, grouping and sorting the results by the customer ID.
SQLCopy
SELECT CustomerID AS Customer,
COUNT(*) AS OrderCount
FROM Sales.SalesOrderHeader
GROUP BY CustomerID
ORDER BY Customer;