In order to understand when and why you might use sub-queries, it is important to first understand the results that we get from non nested queries. All queries produce a table of data, regardless of the number of columns or rows that are returned; the results is still a table. However, looking closer we can determine that not all tables are the same and that tables that look one way might actually not be what they appear to be.
So query results can generally be divided into 3 hierarchical categories.
It is sometimes misleading when you get a result that gives a single value (1 column and 1 row) but it cannot be considered a scalar, because if in the future the data changes and you could get more than one row, than it cannot be a scalar, but is a list.
Note that scalars are also lists and tables, lists are tables, but not scalars, and tables are tables, but not lists or scalars.
All of the following SELECT statements will produce tables, but cannot be classified as lists or scalars.
SELECT * FROM employees; SELECT firstname, lastname FROM employees; SELECT employeeID, firstName, lastName FROM employees WHERE lower(lastname) LIKE 'jones' AND lower(firstname) LIKE 'andrew';
The third example above has multiple columns and looks like it would return a single row, but could have multiple rows as well.
SELECT employeeID FROM employees; SELECT lastname FROM employees; SELECT employeeID FROM employees WHERE lower(lastname) LIKE 'jones' AND lower(firstname) LIKE 'andrew';
The third example above can sometimes be confusing because we are obviously intending to find a single person and return their one employeeID, but there is no guarantee that there is not more than one 'Andrew Jones' working for the company, and therefore it is possible to get more than one row returned, and thus it can not be considered a scalar.
SELECT count(employeeID) FROM employees; SELECT max(orderNumber) FROM orders; SELECT teamName FROM teams WHERE teamID = 12;
All three of the above queries can only ever return a single column and single row and therefore can be considered scalar queries.
The reason why we must consider the type of query is because we are about to embark on the implementation of sub-queries. Inside of SELECT statements, we can place alternative sources for the information by nesting queries or placing a sub-query inside another query. The type of query that can be placed will depend on where it will be placed.
Inserting a sub-query in the FROM
clause will require a table (remembering that both lists and scalars are also tables).
Sub-queries can also be used within the WHERE
clause can include lists and scalars.
- Placing a sub-query after an "=" will require a scalar query as the comparison operator requires a single value to be compared.
- Placing a sub-query within an IN()
comparison operator will accept a list query, as the IN operator contains a comma separated list.
Let us learn through examples using the following table schema.
List all the employees, by name, whom work in the city of Seattle! (Do not use JOINS)
We will start this solution by looking at the information we know (City = Seattle) and obtaining the location_id for that record or records
-- STEP 1 SELECT location_id FROM locations WHERE lower(city) = 'seattle'; -- let us say this query return the value of 1700 -- it must also be noted that although we only got 1 result, it is possible to get more than one, -- so this must be considered a list query and not a scalar Query
Now using this result of 1700 we can get the departments that are in that location or locations.
-- STEP 2 SELECT department_id FROM departments WHERE location_id IN (1700); -- note that we used IN rather than =, because the previous query may have produced more than one value -- we might get results like 10, 90, 110, 190
having returned a single column list of department_id's gives us then the opportunity to get the employee information whom work in those departments.
-- STEP 3 SELECT employee_id, first_name, last_name FROM employees WHERE department_id IN (10, 90, 1010, 1090); -- this should give us the list employees whom work in Seattle.
so although we now have the answer we were looking for, we have created 3 queries and hard coded values at more than one step. We should create queries that are more dynamic and allow us to use the same query for the results to many cities directly.
-- first we will replace 'Seattle' from Step 1 above with a parameter DECLARE @city varchar(20); SET @city = 'Seattle'; SELECT location_id FROM locations WHERE lower(city) LIKE '%' + TRIM(lower(@city)) + '%'; -- this allows us to find the location_ids from any city typed in by a user. -- Next we will take step 2 and replace the value 1700 with the query that got us that value, with the parameter SELECT department_id FROM departments WHERE location_id IN ( SELECT location_id FROM locations WHERE lower(city) LIKE '%' + TRIM(lower(@city)) + '%' ); -- Now we take step 3 and replace the hard coded values with the query above that would produce those values anyways SELECT employee_id, first_name, last_name FROM employees WHERE department_id IN ( SELECT department_id FROM departments WHERE location_id IN ( SELECT location_id FROM locations WHERE lower(city) LIKE '%' + TRIM(lower(@city)) + '%' ) );
This final query above will now let us get the employees from any city at any time as typed in by a parameter. Much more useful and reusable.
So far we have seen examples where the sub-query is nested within the WHERE
clause. We can also use Tabular queries as the data source for queries as well.
List all employees whom work in Seattle whose first name starts with an 'S'.
SELECT employee_id, first_name, last_name
FROM employees
WHERE department_id IN (
SELECT department_id
FROM departments
WHERE location_id IN (
SELECT location_id
FROM locations
WHERE lower(city) LIKE '%' + TRIM(lower(@city)) + '%'
)
);
and then wrap those queries with a simple query!
SELECT *
FROM employees
WHERE upper(first_name) LIKE 'S%';
A simple statement that does not yet take into account employees in Seattle yet!
becomes ....
DECLARE @city varchar(20);
SET @city = 'Seattle';
SELECT *
FROM (
SELECT employee_id, first_name, last_name
FROM employees
WHERE department_id IN (
SELECT department_id
FROM departments
WHERE location_id IN (
SELECT location_id
FROM locations
WHERE lower(city) LIKE '%' + TRIM(lower(@city)) + '%'
)
)
)
WHERE upper(first_name) LIKE 'S%';
so sub-queries can be used as the course for a FROM
clause as well. This is very similar to the use of a view, in fact sub-queries are used in conjunction with views very often to help assist in creating object oriented SQL code.
List the employees, by name, and the department name from which they work, without using joins.
So to get information, that will be an output column, in the output without using joins, it is important to get information from multiple tables. We can utilize the fact that sub-queries can also be implemented in the SELECT
clause in addition the the previous examples using the WHERE
and FROM
clauses.
SELECT first_name, last_name, ( SELECT department_name FROM departments WHERE department_id = e.department_id ) FROM employees e ORDER BY last_name, first_name;
In the above solution, we can insert the employees department_id into the WHERE
clause of the sub-query by using the e table alias.
If we also recall from order of execution, that the SELECT
clause executes iteratively, as a loop, through each row of the result set. Therefore, the sub-query is run on each row of the employees table and thus we can get the specific department_id for each row.
It is often necessary to put two lists together that have similar columns, but very different queries and conditions to obtain. A quick example would be: list all people a company contacts, both employees and customers. For this example we might write SQL like:
SELECT firstName, lastName, email
FROM employees;
-- and
SELECT nameFirst, nameLast, contactEmail
FROM customers;
With the two above statements, we now would have to do more external work to make this a single list. Even concepts of Joins and Sub-Queries would not help here as there is not necessarily any relationship between these two tables. Therefore, we need another way to automate the combining of these two resultant sets. This is called Set Operators.
Just to give a little context, many of us took mathematics in high school, middle school, or the equivalent and usually that would cover the topics of Venn Diagrams, Set, and Sequences. Most students often wonder, "Why am I learning this, and when will I ever use this in real life?" Well database is a great example of where we use Venn Diagrams and Set Operators.
For example, in mathematics:
Let A = { 1, 2, 3, 4, 6, 8, 10 } Let B = { 3, 5, 6, 9, 10 } A ⋃ B = { 1, 2, 3, 4, 5, 6, 8, 9, 10 } -- union A ⋂ B = { 3, 6, 10 } -- intersect A - B = { 1, 2, 4, 8 } -- minus or except
and we may represent these by:
so...
Let A = SELECT * FROM employees; Let B = SELECT * FROM customers; Let C = SELECT * FROM children; -- so we can do: UNION: A ⋃ B INTERSECT: A ⋂ B EXCEPT: A - B
could be extended and visualized using Set Operators and Venn Diagrams
We will look at 4 different Set Operators that are commonly used in SQL that will allow the developer to answer most business questions when combined with other techniques, such as Joins, Sub-Queries, Filtering etc.
The 4 operators we will look at are:
We will use the following tables for examples on how these set operators work:
Employees | Customers | |||||
---|---|---|---|---|---|---|
firstName | lastName | nameFirst | nameLast | contactEmail | ||
John | Smith | js@email.com | Julie | Wilson | jw@email.com | |
Andrew | Jones | aj@email.com | Raj | Patel | rp@email.com | |
Sarah | Bennet | sb@email.com | John | Smith | js@email.com |
UNION
is the most used of the set operators, and we will look at how it works. However, it is important to know the difference between UNION and UNION ALL as they have subtle differences that are critical to understand.
So to get a complete list of people, we will need to combine the employees and customers lists into a single list.
SELECT firstName, lastName, email
FROM employees;
-- and
SELECT nameFirst, nameLast, contactEmail
FROM customers;
turns into
-- RED SELECT firstName, lastName, email FROM employees -- note the lack of semi-colon here UNION -- BLUE SELECT nameFirst, nameLast, contactEmail FROM customers; -- All in ONE statement
The results in:
nameFirst | nameLast | ||
---|---|---|---|
Andrew | Jones | aj@email.com | |
John | Smith | js@email.com | |
Julie | Wilson | jw@email.com | |
Raj | Patel | rp@email.com | |
Sarah | Bennet | sb@email.com |
a few things to note:
SELECT
portion and the blue rows are from the second SELECT
portion.SELECT
statement. This means any aliases used in the second statement are irrelevant and not necessary.ORDER BY
clause?When you eliminate duplicate records, especially in very large databases, it is much more efficient to sort the data first. To understand this concept, let us look at this in more depth.
If you have 100 records, to determine if a record is a duplicate you would:
Obviously as the database gets larger, each new record added means MANY MANY more checks each time, greatly slowing down the database and it's processes. To greatly improve this inefficiency, the data is sorted first.
When you are comparing a record, to determine matches, it is significantly better if they are sorted. When looking at the nth record, we will only need to look at the n-1th record to see if it matches. This means that no matter how many records are in the database (23 or more than a million), each record can be determined to be a match by only looking at ONE previous record.
Just for clarification,
- if not sorted, when reviewing the 100th record, we would have to compare to 99 other records
- if sorted, when reviewing the 100th record, we would only have to compare to 1 other record.
Comparing this if we were reviewing the one millionth record, it is obvious to see the efficiency here.
UNION combines
UNION ALL
works very similar to UNION
with a few subtle differences:
Using the statement:
SELECT firstName, lastName, email
FROM employees -- note the lack of semi-colon here
UNION ALL
SELECT nameFirst, nameLast, contactEmail
FROM customers;
we will see the resultant set of:
nameFirst | nameLast | ||
---|---|---|---|
Andrew | Jones | aj@email.com | |
John | Smith | js@email.com | |
Sarah | Bennet | sb@email.com | |
John | Smith | js@email.com | |
Julie | Wilson | jw@email.com | |
Raj | Patel | rp@email.com |
NOTE: the red rows are from the first SELECT part and the blue rows from the second SELECT part. Note that red and green are not mixed
Like UNION and UNION ALL, INTERSECT is another set operator that works on multiple incoming resultant sets. INTERSECT specifically finds records that EXACTLY match in both resultant sets. Note: That an exact match means, all column values in both rows are identical. Thus:
SELECT firstName, lastName, email
FROM employees -- note the lack of semi-colon here
INTERSECT
SELECT nameFirst, nameLast, contactEmail
FROM customers;
results in:
nameFirst | nameLast | ||
---|---|---|---|
John | Smith | js@email.com |
The only row that identically appears in both resultant sets.
We use the words EXCEPT
or MINUS
to describe the same thing where as the only difference is which clause is available in different DBMSs. MINUS
is the more common usage and is used in Oracle, MySQL, PostGreSQL, and many others. EXCEPT
is the form of the clause that is used in MicroSoft SQL Server.
This clause allows the statement to include all records in the first resultant set, and remove any records that also exist in the second resultant set. Any records in the second resultant set that are not in the first set, are completely ignored.
To see by example:
SELECT firstName, lastName, email
FROM employees -- note the lack of semi-colon here
EXCEPT
SELECT nameFirst, nameLast, contactEmail
FROM customers;
results in:
nameFirst | nameLast | ||
---|---|---|---|
Andrew | Jones | aj@email.com | |
Sarah | Bennet | sb@email.com |
Note that: no records from the second resultant set are included in any way and that all records from the first resultant set are included except John Smith, whom was removed because that record also exists in the second resultant set.