2 - Multi-line (Aggregate) Functions

Table of Contents

Aggregate Functions

When we think about the business world and what questions are being asked in the workplace, it is absolutely impossible to not include questions that include "How much ... ", "How many ... ", "What is the highest ... ", "What is the lowest ... ". With questions like "How much money did we make today?", "How many employees work for us?"," etc.... it is easy to see that some answers to questions will require the aggregation of data across multiple rows of data. To this point in your SQL learning, all SELECT statements have been executed on single rows of data, one at a time. Now we move into the world of multiple-row functions.

SELECT Statement Syntax

So far we have learned 4 clauses in the SELECT statement:

Now by adding aggregate functions, we introduce 2 new clauses, GROUP BY and HAVING.

SELECT      < comma separated field list >
FROM        < table(s) with/without JOINS >
WHERE       < conditions not involving aggregate functions >
GROUP BY    < comma separate field list (group by expression) >
HAVING      < conditions involving aggregate functions >
ORDER BY    < comma separated columns >

The 7 Aggregate Functions

There are exactly 7 different aggregate functions in SQL, and these 7 will enable the programmer to answer almost any business question. The 7 aggregate functions are:

7 Aggregate Functions
Function Description Example
SUM() the summation (addition) of data the total owing on a receipt adding up multiple purchased products
COUNT() counting the number of instances (rows) The number of employees that work at a specific location
AVG() Average - the non-weighted average of a group of given numbers The average price of products being sold on a website.
MIN() Minimum - the smallest number, the first alphabetical string, and the earliest date chronologically. The earliest date (first event) in a list of events
MAX() Maximum - the largest number, last alphabetical string, or latest date The largest number in a set of numbers
STDDEV() Standard Deviation - A statistical measure of the horizontal distribution of data points.
VARIANCE() Standard Deviation - A statistical measure of the vertical distribution of data points.

Some examples of the main functions might be:

    SELECT count(playerID) AS numPlayers FROM players;

will return the number of rows in the player table. This should product a unique list as playerid would be the primary key, so in this case DISTINCT would not be required, but it might be in other cases.

    SELECT sum(balanceOwing) AS totalOwing FROM customers;

would return the total owing from ALL customers. This would be an important number for companies to know how much money they should have coming in within the next short time period: assuming customers all pay their bills that is :)

    SELECT min(eventDate) AS firstEventDate FROM events;

would return the date of the first event in a table of events, if multiple events had the same date, it would still only return the date once.

Grouping (GROUP BY)

It is often required to have more complex answers that involve grouping data rows together. For example: How many players in a players table, does not requiring any grouping, but a slightly different question: How many players play on each team in a league would require you to group players together that play on the same team and then count them. This is achieved through the GROUP BY statement.

How many players play on each team?

    SELECT teamName, firstName, lastName
    FROM players JOIN teams USING (teamID)
    ORDER BY teamName;

gives us the players on each team, but we can only count them manually because we sorted it by teamName. It is still a manual count. Therefore, if we want the SQL to produce the actual numbers automatically, we need to group the players together by team and then use the count() aggregate function

    SELECT 
        teamName, 
        count(playerID) AS numPlayers
    FROM players JOIN teams USING (teamID)
    GROUP BY teamName
    ORDER BY teamName;

gives us the results that we need. However, what if there is a possibility that the teamName is not unique and we therefore need to ensure we are not mixing players from multiple teams.

    -- THIS CAUSES AN ERROR
    SELECT 
        teamID,
        teamName, 
        count(playerID) AS numPlayers
    FROM players JOIN teams USING (teamID)
    GROUP BY teamName
    ORDER BY teamName;

will give us an error as the teamID is not part of an aggregate function nor is it in the GROUP BY clause.
Note: That if we had used ON rather than USING then we would also have an ambiguous field error.

Clint's Law: Any field in the SELECT clause that is not part of an aggregate function, or a sub-query, MUST be in the GROUP BY clause.

yes I called it my law, I made it up, but it works. Realistically it is just a different way of saying what other books have said, but I feel it simplifies it and makes it an easy to follow rule.

So because of the above rule, the teamID in the above statement is a field that is not part of the aggregate function, and therefore must be in the GROUP BY. Let us correct this along with the ambiguous issue we had above if we had used ON.

    SELECT 
        t.teamID, -- because an inner join is used, it does not matter if this is t.teamID or p.teamID
        teamName, -- outer joins are different and careful consideration is required.
        count(playerID) AS numPlayers
    FROM players p JOIN teams t ON p.teamID = t.teamID
    GROUP BY t.teamID, teamName
    ORDER BY teamName;
Caution when grouping by a PRIMARY KEY, as grouping on a primary key can often result in nothing more than groups of one. In this case the INNER JOIN will result in repeated values for the teamID in the out put so it would be okay.

Examples

Example 1

Using the class sample database, How much money goes out for salaries per month per department

SELECT
    departmentID,
    sum(monthlySalary) AS totalSalary
FROM employees
GROUP BY departmentID
ORDER BY departmentID;
-- but this does not include commission yet
SELECT
    department_id,
    sum(monthlySalary) AS totalSalary,
    sum(monthlySalary * IsNull(commissionPercent,0)) AS totalCommission
FROM employees
GROUP BY departmentID
ORDER BY departmentID;
-- 
SELECT
    department_id,
    sum(monthlySalary) AS totalSalary,
    sum(monthlySalary * IsNull(commissionPercent,0)) AS totalCommission,
    sum(monthlySalary) + sum(monthlySalary * IsNull(commissionPercent, 0)) AS totalIncome,
    sum(monthlySalary + monthlySalary * IsNull(commissionPercent, 0)) AS totalIncome2,
    count(departmentID) AS numEmployee
FROM employees
GROUP BY departmentID
ORDER BY departmentID;

Example 2

Using the class sample database, What is the average sales commission per month?

SELECT Avg(commissionPercent) AS AvgComm
FROM employees;

results in 0.2125 (does this make sense)
let us do this manually

-- 0.2 + 0.3 + 0.2 + 0.15 = 0.85
-- 0.85 / 4 = 0.2125

But now we can question this answer, as the average is calculated based on the 4 employees whom receive a commission, but there are 20 employees? So the avg() aggregate function simply ignores the NULL values present and only counts records with numeric values. But, does the average commission involve only those employees whom receive commission or is it the average commission for all employees; which would be:

-- 0.2 + 0.3 + 0.2 + 0.15 = 0.85
    -- 0.85 / 20 = 0.0425

In order to allow the avg() function to calculate the latter method, we would have to ensure that NULL values are included in the average calculation, and therefore need to convert the null values to a 0 in this case. We can use the IsNull() function to perform this task.

SELECT Avg(IsNull(commissionPercent,0)) AS AvgComm
FROM employees; 

HAVING

The HAVING clause of the SQL statement can also be added once grouping has taken place. The HAVING clause is used, very similar to the WHERE clause, but for conditions that involve aggregate functions. There are quite often conditions that apply to summation data that can only be implemented after grouping takes place, therefore the HAVING clause is executed after the GROUP BY clause, where the WHERE clause executed before the GROUP BY clause.

Examples

Example 1

Display the average salary for employees in each job-title in each department

SELECT 
    jobID,
    departmentID,
    avg(IsNull(monthlySalary,0)) AS avgSalary
FROM employees
GROUP BY 
    jobID, 
    departmentID
ORDER BY 
    departmentID, 
    jobID;

Now, only include departments 20, 60, 80, and 90

SELECT 
    jobID,
    departmentID,
    avg(IsNull(monthlySalary,0)) AS avgSalary
FROM employees
WHERE departmentID IN (20, 60, 80, 90)
GROUP BY 
    jobID, 
    departmentID
ORDER BY 
    departmentID, 
    jobID;
    

Now, include only those where the average is above 15,000

SELECT 
    jobID,
    departmentID,
    avg(IsNull(monthlySalary,0)) AS avgSalary
FROM employees
WHERE department_id IN (20, 60, 80, 90)
GROUP BY 
    jobID, 
    departmentID
HAVING avg(IsNull(monthlySalary,0)) > 15000
ORDER BY 
    departmentID, 
    jobID;