# How to Use GROUP BY in SQL Server

Article Series Navigation:

Summarizing data in a SELECT statement using a *GROUP BY* clause is a very common area of difficulty for beginning SQL programmers. In Part I of this two part series, we'll use a simple schema and a typical report request to cover the effect of JOINS on grouping and aggregate calculations, and how to use *COUNT(Distinct)* to overcome this. In Part II, we'll finish up our report while examining the problem with *SUM(Distinct)* and discussing how useful derived tables can be when grouping complicated data. (**This article has been updated through SQL Server 2005.** )

Here's the schema we'll be using along with some sample data:

## Determining the Virtual Primary Key of a Result Set

Let's examine our sample data by joining these tables together to return Orders along with the OrderDetails:

Remember that Orders has a **one-to-many** or **parent/child** relationship with OrderDetails: thus, **one** order can have **many** details. When we join them together, the Order columns are repeated over and over for each OrderDetail. This is normal, standard SQL behavior and what happens when you join tables in a one-to-many relation. Our result contains one row per OrderDetail, and those OrderDetail rows are never repeated, since it is not being joined to any further tables that will produce duplicate rows.

Thus, we could say that our result set has a *virtual primary key* of DetailID; there will never be a duplicate OrderDetail row in the data. We can count and add up all OrderDetail columns and never worry about double counting values. However, we can not do the same for the Orders table, since its rows are duplicated in the results. Remember this as we move forward.

## A Typical Summary Report Example

Here is a good example using Orders and OrderDetails that demonstrates various aggregate functions and typical things to look for:

*For each Customer, we want to return the total number of orders, the number of items ordered, the total order amount, and the total shipping cost.*

All of this information lives in the Orders and the OrderDetails tables, and we know how to join them together, we just need to summarize those results now. We don't want to return all 8 rows and see all of the details; we just want to return 2 rows: one per customer, with the corresponding summary calculations.

## Grouping and Summarizing Primary Rows

Since we want to return 1 row for each Customer, we can simply add **GROUP BY Customer** to the end of the SELECT. We can return the Customer column since we are grouping on it, and we can add any other columns as long as they are summarized in an aggregate function. We can also use the *COUNT(*)* aggregate function to return the number of rows per group. Since each row in our data corresponds to an order item, grouping by Customer and selecting Customer and *COUNT(*)* will return the number of OrderDetails per Customer:

Next, let’s add the total order amount per customer. Since our join resulted in one row per item detail (remember: when joining Orders to OrderDetails, the result set has a *virtual primary key* of DetailID) we know that no rows in the OrderDetails table were duplicated in our results. Therefore, a simple *SUM()* of the Amount column from the OrderDetails table is all we need to return the total order amount per customer:

So far, so good! Only two more calculations

to go: total orders per customer, and total shipping cost. Both of these values come from the Orders table.

If you recall, our join resulted in one row per OrderDetail, and that meant that the Orders table had duplicate rows. We need to return two calculations from our Orders table -- total number of Orders, and total Shipping cost. We already know that *COUNT(*)* returns the number of OrderDetails, so that won’t work for us. Perhaps **COUNT(OrderID)** will return the total number of orders? Let's try it:

Notice that the OrderCount column returned is the same as the ItemCount, and definitely not the number of orders per customer. That is because, by definition in SQL, *COUNT(expression)* just returns the total number of rows in which that expression is not null. Since OrderID is never null, it returns the total row count per Customer, and since each row corresponds with an OrderDetail item, we get the number of OrderDetail items per customer, **not** the number of Orders per customer.

## Using COUNT(Distinct)

Luckily, there is a *DISTINCT* feature that we can use in our *COUNT()* aggregate function that will help us here. *Count(Distinct expression)* means: " return the total number of distinct values for the specified expression." So, if we write **COUNT(Distinct OrderID)**. it will return the distinct number of OrderID values per Customer. Since each OrderID value corresponds to an Order (it is the primary key of the Orders table), we can use this to calculate our Order count:

Great! Looks good, makes sense, now we are getting somewhere.

## Beware of SUMMING Duplicate Values

Moving on, let’s add an expression to calculate total shipping cost per customer. Well, we know we have a ShippingCost column in our data from the Orders table, so let's try just adding **SUM(ShippingCost)** to our SELECT:

Looks like we are good to go, right? Well -- not really, look at our TotalShipping column. Customer ABC has a total shipping of 195. Yet, in our Orders table, we see that the customer has 3 orders, with shipping values of 40,30 and 25. 40+30+25=95, not 195! What is going on here? Well, like the *COUNT(*)* expression, remember that the *SUM()* is acting not upon our tables themselves, but the result of the JOIN that we expressed in our SELECT. The JOIN from Orders to OrderDetails meant that rows from the Orders table were duplicated, remember? So, if we *SUM()* a column in our Orders table when it is joined to the Details, we are summing up duplicate values and our result will be too high. You can verify this by reviewing the results returned by the join from Orders to OrderDetails and manually adding them up by hand.

It is very important to understand this; when writing any JOINS, you need to identify which tables can and cannot have duplicate rows returned. The basic rule is very simple:

*If Table A has one-to-many relation with Table B, and Table A is JOINED to Table B, then Table A will have duplicate rows in the result. The final result will have a virtual primary key equal to that of Table B's.*

So, we know we can *SUM()* and *COUNT()* columns from Table B with no problem, but we cannot do that with columns from Table A because the duplicate values will skew our results. How do we handle this situation? We'll cover that and a whole lot more in Part II. coming tomorrow.

Source: www.sqlteam.com

Category: Forex