Article Series Navigation:
Summarizing data in a SELECT statement using a GROUP BY clause is a very common area of difficulty for beginning SQL programmers. In Part I of this two part series, we'll use a simple schema and a typical report request to cover the effect of JOINS on grouping and aggregate calculations, and how to use COUNT(Distinct) to overcome this. In Part II, we'll finish up our report while examining the problem with SUM(Distinct) and discussing how useful derived tables can be when grouping complicated data. (This article has been updated through SQL Server 2005. )
Here's the schema we'll be using along with some sample data:
Determining the Virtual Primary Key of a Result Set
Let's examine our sample data by joining these tables together to return Orders along with the OrderDetails:
Remember that Orders has a one-to-many or parent/child relationship with OrderDetails: thus, one order can have many details. When we join them together, the Order columns are repeated over and over for each OrderDetail. This is normal, standard SQL behavior and what happens when you join tables in a one-to-many relation. Our result contains one row per OrderDetail, and those OrderDetail rows are never repeated, since it is not being joined to any further tables that will produce duplicate rows.
Thus, we could say that our result set has a virtual primary key of DetailID; there will never be a duplicate OrderDetail row in the data. We can count and add up all OrderDetail columns and never worry about double counting values. However, we can not do the same for the Orders table, since its rows are duplicated in the results. Remember this as we move forward.
A Typical Summary Report Example
Here is a good example using Orders and OrderDetails that demonstrates various aggregate functions and typical things to look for:
For each Customer, we want to return the total number of orders, the number of items ordered, the total order amount, and the total shipping cost.
All of this information lives in the Orders and the OrderDetails tables, and we know how to join them together, we just need to summarize those results now. We don't want to return all 8 rows and see all of the details; we just want to return 2 rows: one per customer, with the corresponding summary calculations.
Grouping and Summarizing Primary Rows
Since we want to return 1 row for each Customer, we can simply add GROUP BY Customer to the end of the SELECT. We can return the Customer column since we are grouping on it, and we can add any other columns as long as they are summarized in an aggregate function. We can also use the COUNT(*) aggregate function to return the number of rows per group. Since each row in our data corresponds to an order item, grouping by Customer and selecting Customer and COUNT(*) will return the number of OrderDetails per Customer:
Next, let’s add the total order amount per customer. Since our join resulted in one row per item detail (remember: when joining Orders to OrderDetails, the result set has a virtual primary key of DetailID) we know that no rows in the OrderDetails table were duplicated in our results. Therefore, a simple SUM() of the Amount column from the OrderDetails table is all we need to return the total order amount per customer:
So far, so good! Only two more calculations
to go: total orders per customer, and total shipping cost. Both of these values come from the Orders table.
If you recall, our join resulted in one row per OrderDetail, and that meant that the Orders table had duplicate rows. We need to return two calculations from our Orders table -- total number of Orders, and total Shipping cost. We already know that COUNT(*) returns the number of OrderDetails, so that won’t work for us. Perhaps COUNT(OrderID) will return the total number of orders? Let's try it:
Notice that the OrderCount column returned is the same as the ItemCount, and definitely not the number of orders per customer. That is because, by definition in SQL, COUNT(expression) just returns the total number of rows in which that expression is not null. Since OrderID is never null, it returns the total row count per Customer, and since each row corresponds with an OrderDetail item, we get the number of OrderDetail items per customer, not the number of Orders per customer.
Luckily, there is a DISTINCT feature that we can use in our COUNT() aggregate function that will help us here. Count(Distinct expression) means: " return the total number of distinct values for the specified expression." So, if we write COUNT(Distinct OrderID). it will return the distinct number of OrderID values per Customer. Since each OrderID value corresponds to an Order (it is the primary key of the Orders table), we can use this to calculate our Order count:
Great! Looks good, makes sense, now we are getting somewhere.
Beware of SUMMING Duplicate Values
Moving on, let’s add an expression to calculate total shipping cost per customer. Well, we know we have a ShippingCost column in our data from the Orders table, so let's try just adding SUM(ShippingCost) to our SELECT:
Looks like we are good to go, right? Well -- not really, look at our TotalShipping column. Customer ABC has a total shipping of 195. Yet, in our Orders table, we see that the customer has 3 orders, with shipping values of 40,30 and 25. 40+30+25=95, not 195! What is going on here? Well, like the COUNT(*) expression, remember that the SUM() is acting not upon our tables themselves, but the result of the JOIN that we expressed in our SELECT. The JOIN from Orders to OrderDetails meant that rows from the Orders table were duplicated, remember? So, if we SUM() a column in our Orders table when it is joined to the Details, we are summing up duplicate values and our result will be too high. You can verify this by reviewing the results returned by the join from Orders to OrderDetails and manually adding them up by hand.
It is very important to understand this; when writing any JOINS, you need to identify which tables can and cannot have duplicate rows returned. The basic rule is very simple:
If Table A has one-to-many relation with Table B, and Table A is JOINED to Table B, then Table A will have duplicate rows in the result. The final result will have a virtual primary key equal to that of Table B's.
So, we know we can SUM() and COUNT() columns from Table B with no problem, but we cannot do that with columns from Table A because the duplicate values will skew our results. How do we handle this situation? We'll cover that and a whole lot more in Part II. coming tomorrow.