Joining Data in SQL - Edited
Joining Data in SQL - Edited
Hello, my name is Maham Faisal Khan, and I'll be your instructor on a journey
through the world of SQL joins. Joining data is an essential skill which enables us
to draw information from separate tables together into a single, meaningful set of
results. Although this course focuses on PostgreSQL, the material here applies to
multiple forms of SQL.
We'll begin with INNER JOIN, which along with LEFT JOIN makes up one of
the two most common joins. Two tables named "left_table" and "right_table" are
shown. Matching values of "id" in both tables are shown in the same color. In both
tables, the id field is the key. A key is a single column or group of columns that
uniquely identifies records in a table. Both tables also have fields called
left_val and right_val, which will help visualize records returned after a join.
Diagram for an INNER JOIN on the id field
• INNER JOIN looks for records in both tables which match on a given field
With SQL joins, you can join on a key field, or any other field. The INNER JOIN shown looks for
records in both tables with the same values in the key field, id. Arrows indicate records where the id
matches.
Diagram for an INNER JOIN ON the id field
In this diagram, we focus on records that match on id and fade out any records that are not of interest to
our INNER JOIN.
The result is records with ids 1 and 4.
Throughout this course, we'll work with a database of world leaders. Our database schema is displayed
here. It contains the presidents, prime_ministers and monarchs tables, as well as the states table
containing independence years, and the prime_minister_terms table, providing years the
prime_ministers assumed office.
1.2. At the president’s table
Let's start querying our database by selecting all fields from the president’s table. Recall
the use of the asterisk in SQL to select all fields quickly.
1.3. Meet the prime ministers
Suppose that we're interested in countries that have both a president and prime minister. In these small
tables, we might notice that Egypt, Portugal, Pakistan and India appear in both tables. For larger tables
though, it's not as feasible to identify these records by visual inspection. Here's where SQL joins can be
helpful.
1.5. Our first INNER JOIN
Note. The table.column_name format must be used when selecting columns that exist in both tables to
avoid a SQL error.
The syntax for performing an INNER JOIN on our tables is shown step-by-step. It is common to begin
constructing the query with the join first, and selecting fields later. After FROM, we list the left table,
followed by the INNER JOIN keyword and the right table. We then specify the field to match the tables
on, using the ON keyword. Here, we use the "country" field. Lastly, we add SELECT at the start of the
query and choose the fields we want returned. When selecting columns that exist in both tables, such
as "country" and "continent", the table-dot-column_name format must be used to avoid a SQL error.
The result of our query is displayed: four countries in our database have both presidents and prime
ministers.
1.6. Aliasing tables
In our INNER JOIN, we've had to type out "president" and "prime_minister" several times. Luckily, we
can alias table names using the same AS keyword used to alias column names. Here, we use aliases p1
and p2 in both our SELECT and ON clauses to shorten our query.
1.7. Using USING
Before we end this lesson, let's add one more SQL shortcut to make our query even more concise.
When joining on two identical column names, we can employ the USING command followed by the
shared column name in parentheses. Here, since the join field is named "country" in both tables, we
can use USING.
Exercise
Your first join
Throughout this course, you'll be working with the countries database, which contains information
about the most populous cities in the world, along with country-level economic, population, and
geographic data. The database also contains information on languages spoken in each country.
You can see the different tables in this database to get a sense of what they contain by clicking on
the corresponding tabs. Familiarize yourself with the fields that seem to be shared across tables
before you continue with the course.
In this exercise, you'll use the cities and countries tables to build your first inner join. You'll start
off by selecting all columns in step 1, performing your join in step 2, and then refining your join
to choose specific columns in step 3.
Instructions 1/3:
• Begin by selecting all columns from the cities table, using the SQL shortcut that selects all.
• -- Select all columns from cities
• SELECT *
• FROM cities;
Instructions 2/3:
• Perform an inner join with the cities table on the left and the countries table on the right; you
do not need to alias tables here.
• Join ONthe country_code and codecolumns, making sure you identify them with the correct
table.
• SELECT *
• FROM cities
• -- Inner join to countries
• INNER JOIN countries
• -- Match on country codes
• ON cities.country_code=countries.code;
Instructions 3/3:
• Complete the SELECT statement to keep three columns: the name of the city, the name of
the country, and the region the country is located in (in this order).
• Alias the name of the city AS city and the name of the country AS country.
• -- Select name fields (with alias) and region
• SELECT cities.name AS city, countries.name AS country, coun
tries.region
• FROM cities
• INNER JOIN countries
• ON cities.country_code = countries.code;
Exercise
Joining with aliased tables
Table aliases are helpful in allowing you to reference them in other parts of your query, like the
SELECT statement.
When you SELECT fields, a field can be ambiguous. For example, imagine two tables, apples and
oranges, both containing a column called color. You need to use the syntax apples.color or
oranges.color in your SELECT statement to point SQL to the correct table. Without this, you would
get the following error:
You'll practice joining with aliased tables using data from both the countries and economies tables
to examine the inflation rate in 2010 and 2015.
When writing joins, many SQL users prefer to write the SELECT statement after writing the join
code, in case the SELECT statement requires using table aliases.
Instructions
• Start with your inner join in line 5; join the tables countries AS c (left) with economies
(right), aliasing economies AS e.
• Next, use code as your joining field in line 7; do not use the USING command here.
• Lastly, select the following columns in order in line 2: code from the countries table
(aliased as country_code), name, year, and inflation_rate.
-- Select fields with aliases
SELECT c.code AS country_code, name, year, inflation_rate
FROM countries AS c
-- Join to economies (alias e)
INNER JOIN economies AS e
-- Match on code field using table aliases
ON c.code=e.code;
Exercise
USING in action
In the previous exercises, you performed your joins using the ON keyword. Recall that when both
the field names being joined on are the same, you can take advantage of the USING clause.
You'll now explore the languages table from our database. Which languages are official languages,
and which ones are unofficial?
You'll employ USING to simplify your query as you explore this question.
Instructions
• Use the country code field to complete the INNER JOIN with USING; do not change any
alias names.
SELECT c.name AS country, l.name AS language, official
FROM countries AS c
INNER JOIN languages AS l
-- Match using the code column
USING (code);
2. Defining relationships
The first type of relationship we'll talk about is a one-to-many relationship. This is the most common
type of relationship, one where a single entity can be associated with several entities. Think about a
music library. One artist can produce many songs over their career. This is a one-to- many relationship.
The same applies for authors and their books, directors and movie titles, and so on.
Jane Austen wrote many books, and each of her books had just one author. In the table schema shown,
each of her works are stored in the books table, and can be merged with the authors table using author_id
from both tables.
2.2. One-to-one relationships
Imagine fingerprint scanning at airport border control. Border control might have two tables in a
database: a table of people and a table of their prints. Each person's fingerprint could be merged
with the individuals table using their passport number. We won't cover this in detail here, but using
a binary format is one way to store images, such as fingerprints, in a database. Note here that
although an individual will have four fingerprints stored in the fingerprints table, this is not a one-
to- many relationship. It is a one-to-one relationship, because a single individual - represented as
one record on the individuals table - will have one set of fingerprints - represented as one record
on the fingerprints table. The four different fingerprints are four different fields in each record.
Exercise
Inspecting a relationship
You've just identified that the countries table has a many-to-many relationship with the languages
table. That is, many languages can be spoken in a country, and a language can be spoken in many
countries.
But, what is the best way to query all the different languages spoken in a country? Or, all the
countries that speak a certain language?
Instructions 1/4:
• Select the country name, aliased as country, from the countries table.
-- Select country (aliased) from countries
SELECT name AS country
FROM countries;
Instructions 2/4:
• Now add an alias c for the countries table and perform an inner join with the languages
table, l, on the right; join on code in line 8 with the USING keyword; include the language
name, aliased as language.
• -- Select country and language names (aliased)
• SELECT c.name AS country, l.name AS language
• -- From countries (aliased)
• FROM countries AS c
• -- Join to languages (aliased)
• INNER JOIN languages AS l
-- Use code as the joining field with the USING keyword
USING (code);
Instructions 3/4:
• Add a WHERE clause to find how many countries speak the language 'Bhojpuri'.
-- Select country and language name (aliased)
SELECT c.name AS country, l.name AS language
-- From countries (aliased)
FROM countries AS c
-- Join to languages (aliased)
INNER JOIN languages AS l
-- Use code as the joining field with the USING keyword
USING(code)
-- Filter for the Bhojpuri language
WHERE l.name='Bhojpuri';
3. Multiple joins
We've seen a few inner joins and thought about the relationships tables can have to each other. Now it's
time to level up our joins by building on these ideas.
3.1. Joins on joins
Note. Depending on the use case, left_table or right_table can be used in the ON clause.
A powerful feature of SQL is that multiple joins can be combined and run in a single query. Let's have
a look at some syntax for multiple joins. We begin with the same INNER JOIN as before, and then
chain another INNER JOIN to the result of our first INNER JOIN. Notice that we use left_table.id in the
last line of this example. If we want to perform the second join using the id field of right_table rather than
left_table, we can replace left_table.id with right_table.id in the final line.
The prime_minister_terms table from our database of world leaders
Thinking back to our example of world leaders, let's say that we are interested in identifying
countries of the world that have both a president and a prime minister, and want to know the year each
prime minister came into office. Let's have a look at the prime_minister_terms table from our database of
world leaders. It contains a prime_minister column with prime minister names as well as a pm_start
column containing the start year of a prime minister's term.
3.2. What to join first?
Let's apply the new syntax we have learned to our world leaders example. We can begin as before with
an INNER JOIN of the presidents and prime_ministers tables, using our aliasing skills to shorten the
length of the query. Recall from our first lesson that the result set of this join was a combined table
showing countries with both a prime_minister and a president.
3.3. Chaining joins
If we now join the result of our first INNER JOIN with our prime_minister_terms table on the
prime_minister field, the final result will be only those countries with both prime_ministers and
presidents, along with the years their prime_ministers assumed office. Notice that the
prime_minister_terms table has 10 records, whereas our final result only has four records. This is
because only four records from the prime_minister_terms table match the result of our first INNER
JOIN.
Here is a look at the final SQL query for chaining a second INNER JOIN to our first INNER JOIN.
Notice that the syntax for the second join is exactly the same as for the first join, but we use prime_minister
as the field to join on instead. We can continue the chain and join as many tables together as we need!
3.4. What are we joining ON?
We can limit the records returned by supplying an additional field to join on by adding the AND
keyword to our ON clause. In this example, we join on date, a frequently used second column when
joining on multiple fields. The result set now contains records that match on both id AND date.
Exercise
Joining multiple tables
You've seen that the ability to combine multiple joins using a single query is a powerful feature of
SQL.
Suppose you are interested in the relationship between fertility and unemployment rates. Your task
in this exercise is to join tables to return the country name, year, fertility rate, and unemployment
rate in a single result from the countries, populations and economies tables.
Instructions 1/2:
Do an inner join of countries AS c (left) with populations AS p (right), on code.
• Select name and fertility_rate.
• -- Select relevant fields
• SELECT c.name, p.fertility_rate
• -- Inner join countries and populations, aliased, on code
• FROM countries AS c
• INNER JOIN populations AS p
• ON c.code=p.country_code;
Instructions 2/2:
• Chain an inner join with the economies table AS e, on code.
• Select year and unemployment_rate from the economies table.
-- Select fields
SELECT name, e.year, p.fertility_rate, e.unemployment_rate
FROM countries AS c
INNER JOIN populations AS p
ON c.code = p.country_code
-- Join to economies (as e)
INNER JOIN economies AS e
-- Match on country code
ON c.code=e.code;
Exercise
Checking multi-table joins
Have a look at the results for Albania from the previous query below. You can see that the
2015 fertility_rate has been paired with 2010 unemployment_rate, and vice versa.
Instead of four records, the query should return two: one for each year. The last join was
performed on c.code = e.code, without also joining on year. Your task in this exercise is to fix
your query by explicitly stating that both the country code and year should match!
Instructions
• Modify your query so that you are joining to economies on year as well as code.
-
- Add an additional joining condition such that you are als o
joining on year
AND p.year=e.year;
4. LEFT and RIGHT JOINs
• LEFT JOIN will return all records in the left table, and those records in the
right table that match on the joining field provided
We'll now compare INNER JOIN with three types of outer joins, beginning with LEFT JOIN. Outer
joins can obtain records from other tables, even if matches are not found for the field being joined on. A
LEFT JOIN will return all records in the left_table, and those records in the right_table that match on the
joining field provided. In the diagram shown, the values of 2 and 3 do not appear in the id field of
right_table but will still be retained in the join. Records that are not of interest to a LEFT JOIN on the id
field have been faded out.
Let's go back to our world leaders example. Say we want our query to include all countries with prime
ministers, presidents if they happen to have them, and missing values if they don't. LEFT JOIN will give
us the results we need! The syntax of LEFT JOIN is very similar to INNER JOIN. Only the word
INNER is replaced with LEFT. Note that LEFT JOIN can also be written as LEFT OUTER JOIN in
SQL. The first three records in the result are the same as they were with an INNER JOIN, but from the
fourth record the result starts to look different. Since the United Kingdom does not have a president, a
corresponding null value is returned in the president field.
4.2. RIGHT JOIN
On to RIGHT JOIN! RIGHT JOIN is the second type of outer join, and is much less common than
LEFT JOIN so we won't spend as much time on it here. Instead of matching entries in the id column of
the left table to the id column of the right table, a RIGHT JOIN does the reverse. All records are retained
from right_table, even when id doesn't find a corresponding match in left_table. Null values are returned
for the left_value field in records that do not find a match.
Generic syntax for a RIGHT JOIN is shown. Note that the order of left_table and right_table is the same
as in LEFT JOIN. The only change is that we call RIGHT JOIN instead of LEFT JOIN. RIGHT JOIN
can also be written as RIGHT OUTER JOIN in SQL.
4.2.2. RIGHT JOIN with presidents and prime ministers
Let's make this concrete with our world leaders example. We perform a right join of prime_ministers on
the left and presidents on the right. The only change is from the LEFT JOIN keyword to RIGHT JOIN.
The result contains null values where countries have presidents but no prime ministers.
Now that you're familiar with both LEFT and RIGHT JOIN, let's discuss why RIGHT JOIN is less
commonly used. A key reason for this is that a RIGHT JOIN can always be re-written as a LEFT JOIN.
Because we typically type from left to right, LEFT JOIN feels more intuitive to most users when
constructing queries.
Exercise
This is a LEFT JOIN, right?
Nice work getting to grips with the structure of joins! In this exercise, you'll explore the differences
between INNER JOIN and LEFT JOIN. This will help you decide which type of join to use.
You'll begin with an INNER JOIN with the cities table (left) and countries table (right). This helps
if you are interested only in records where a country is present in both tables.
You'll then change to a LEFT JOIN. This helps if you're interested in returning all countries in the
cities table, whether or not they have a match in the countries table.
Instructions 1/2:
• Perform an inner join with cities AS c1 on the left and countries as c2 on the right.
• Use code as the field to merge your tables on.
SELECT
c1.name AS city,
code,
c2.name AS country,
region,
city_proper_pop
FROM cities AS c1
-
- Perform an inner join with cities as c1 and countries as c2
on country code
INNER JOIN countries AS c2
ON c1.country_code=c2.code
ORDER BY code DESC;
Instructions 2/2:
• Change the code to perform a LEFT JOIN instead of an INNER JOIN.
• After executing this query, have a look at how many records the query result contains.
• SELECT
• c1.name AS city,
• code,
• c2.name AS country,
• region,
• city_proper_pop
• FROM cities AS c1
• -- Join right table (with alias)
• LEFT JOIN countries AS c2
• ON c1.country_code = c2.code
• ORDER BY code DESC;
Exercise
Building on your LEFT JOIN
You'll now revisit the use of the AVG() function introduced in a previous course.
Being able to build more than one SQL function into your query will enable you to write compact,
supercharged queries.
You will use AVG() in combination with a LEFT JOIN to determine the average gross domestic
product (GDP) per capita by region in 2010.
Instructions 1/3:
• Complete the LEFT JOIN with the countries table on the left and the economies table on
the right on the code field.
• Filter the records from the year 2010.
SELECT name, region, gdp_percapita
FROM countries AS c
LEFT JOIN economies AS e
-- Match on code fields
USING(code)
-- Filter for the year 2010
WHERE year = 2010;
Instructions 2/3:
• To calculate per capita GDP per region, begin by grouping by region.
• After your GROUP BY, choose region in your SELECT statement, followed by average
GDP per capita using the AVG() function, with AS avg_gdp as your alias.
Exercise
Is this RIGHT?
You learned that right joins are not used as commonly as left joins. A key reason for this is that
right joins can always be re-written as left joins, and because joins are typically typed from left to
right, joining from the left feels more intuitive when constructing queries.
It can be tricky to wrap one's head around when left and right joins return equivalent results. You'll
explore this in this exercise!
Instructions
• Write a new query using RIGHT JOIN that produces an identical result to the LEFT JOIN
provided.
• -- Modify this query to use RIGHT JOIN instead of LEFT JOIN
• SELECT countries.name AS country, languages.name AS languag
e, percent
• FROM languages
• RIGHT JOIN countries
• USING(code)
• ORDER BY language;
5. FULL JOINs
A FULL JOIN combines a LEFT JOIN and a RIGHT JOIN. As you can see in this diagram, no values
are faded out as they were in earlier diagrams. This is because the FULL JOIN will return all ids,
irrespective of whether they have a match in the other table being joined.
5.1.1. FULL JOIN diagram
Let's have a look at the result after FULL JOIN. We see that it has retained all ids, returning missing values
in the form of nulls for fields corresponding to records where id did not find a match. All six ids we have
been working with are included in the result after the FULL JOIN. Note that this time, nulls can appear
in either left_value or right_value fields.
Time for some SQL code. In order to produce a FULL JOIN, the general format aligns closely with the
SQL syntax we've been using for INNER JOIN, LEFT JOIN and RIGHT JOIN. We adapt our join to
have the word 'FULL' before 'JOIN'. Note that the keyword FULL OUTER JOIN can also be used to
return the same result.
5.1.3. FULL JOIN example using leaders database
Let's make our sample syntax concrete with our example from the leaders database. Suppose we
were interested in all the countries in our database, and wanted to check whether they had a
president, a prime minister, or both. We'll walk through the code line by line to do this using a
FULL JOIN. The SELECT statement starts us off by including the country, as well as the
prime_minister and president fields.
Next, we specify prime_ministers as our left table and alias this as p1. Note that order of the tables
matters here, and if we switched the order, the records would be ordered differently depending on
how prime ministers and presidents are ordered in the tables.
We then add the FULL JOIN query and add presidents as the right table, using alias p2.
Lastly, the join is performed using country as the field to join on in both tables. We LIMIT to the
first 10 records.
Here is a look at our result. Note that there are null values in both the prime_minister and president fields.
We chose a FULL JOIN because we were interested in all countries, whether they had a prime minister,
a president, or both.
Exercise
Comparing joins
In this exercise, you'll examine how results can differ when performing a full join compared to a
left join and inner join by joining the countries and currencies tables. You'll be focusing on the
North American region and records where the name of the country is missing.
You'll begin with a full join with countries on the left and currencies on the right. Recall the
workings of a full join with the diagram below!
You'll then complete a similar left join and conclude with an inner join, observing the results you
see along the way.
Instructions 1/3:
• Perform a full join with countries (left) and currencies (right).
• Filter for the North America region or NULL country names.
Instructions 2/3:
• Repeat the same query as before, turning your full join into a left join with the currencies
table.
• Have a look at what has changed in the output by comparing it to the full join result.
Exercise
Chaining FULL JOINs
As you have seen in the previous chapter on INNER JOIN, it is possible to chain joins in SQL,
such as when looking to connect data from more than two tables.
Suppose you are doing some research on Melanesia and Micronesia, and are interested in pulling
information about languages and currencies into the data we see for these regions in the countries
table. Since languages and currencies exist in separate tables, this will require two consecutive full
joins involving the countries, languages and currencies tables.
Instructions
• Complete the FULL JOIN with countries as c1 on the left and languages as l on the right,
using code to perform this join.
• Next, chain this join with another FULL JOIN, placing currencies on the right, joining on
code again.
• SELECT
• c1.name AS country,
• region,
• l.name AS language,
• basic_unit,
• frac_unit
• FROM countries as c1
• -- Full join with languages (alias as l)
• FULL JOIN languages AS l
• Using (code)
• -- Full join with currencies (alias as c2)
• FULL JOIN currencies as c2
• Using (code)
• WHERE region LIKE 'M%esia';
6. Crossing into CROSS JOIN
CROSS JOINs are slightly different than joins we have seen previously: they create all possible
combinations of two tables. Let's explore the diagram for a CROSS JOIN. In this diagram we have two
tables named table1 and table2, with one field each: id1 and id2, respectively. The result of the CROSS
JOIN is all nine combinations of the id values of 1, 2, and 3 in table1 with the id values of A, B, and C
for table2.
6.1.1. CROSS JOIN syntax
Let's have a look at the syntax for CROSS JOIN. Note that the syntax is very minimal, and we do not
specify ON or USING with CROSS JOIN.
6.1.2. Pairing prime ministers with presidents
Let's apply the SQL syntax for a CROSS JOIN to the following problem that uses our database of global
leaders. Suppose that all prime ministers in Asia from our database are scheduled for individual
meetings with all presidents in South America from our database, and we are journalists who wish to
follow all the meetings that will happen. We can create a query that returns all these combinations using
a CROSS JOIN! We use a WHERE clause to focus only on prime ministers from Asia and presidents
from South America. The results of the query give us all possible pairings of the four prime ministers
from Asia in the prime_ministers table, and the two presidents from South America in the presidents
table.
Exercise
Histories and languages
Well, done getting to know all about CROSS JOIN! As you have learned, CROSS JOIN can be
incredibly helpful when asking questions that involve looking at all possible combinations or
pairings between two sets of data.
Imagine you are a researcher interested in the languages spoken in two countries: Pakistan and
India. You are interested in asking:
In this exercise, we will explore how INNER JOIN and CROSS JOIN can help us answer these
two questions, respectively.
Instructions 1/2:
• Complete the code to perform an INNER JOIN of countries AS c with languages AS l
using the code field to obtain the languages currently spoken in the two countries.
• SELECT c.name AS country, l.name AS language
• -- Inner join countries as c with languages as l on code
• FROM countries as c
• INNER JOIN languages as l
• USING (code)
• WHERE c.code IN ('PAK','IND')
• AND l.code in ('PAK','IND');
Instructions 2/2:
• Change your INNER JOIN to a different kind of join to look at possible combinations of
languages that could have been spoken in the two countries given their history.
• Observe the differences in output for both joins.
• SELECT c.name AS country, l.name AS language
• FROM countries AS c
• -- Perform a cross join to languages (alias as l)
• CROSS JOIN languages AS l
• WHERE c.code in ('PAK','IND')
• AND l.code in ('PAK','IND');
Exercise
Choosing your join
Now that you're fully equipped to use joins, try a challenge problem to test your knowledge!
You will determine the names of the five countries and their respective regions with the lowest life
expectancy for the year 2010. Use your knowledge about joins, filtering, sorting and limiting to
create this list!
Instructions:
• Complete the join of countries AS c with populations as p.
• Filter on the year 2010.
• Sort your results by life expectancy in ascending order.
• Limit the result to five countries.
• SELECT
• c.name AS country,
• region,
• life_expectancy AS life_exp
• FROM countries AS c
• -
- Join to populations (alias as p) using an appropriate joi
n
• LEFT JOIN populations AS p
• ON c.code = p.country_code
• -- Filter for only results in the year 2010
• WHERE year = 2010
• -- Sort by life_exp
• ORDER BY life_exp ASC
• -- Limit to five records
• limit 5;
7. Self joins
• Self joins are tables joined with themselves
• They can used to compare parts of the same table
Joining a table to itself may seem like an unusual thing to do. Why would we want to do that? Self joins
are used to compare values from part of a table to other values from within the same table. Recall the
prime_ministers table from earlier. Suppose all prime ministers are convening in summits on their own
continents. We want to create a new table showing all countries in the same continent as pairs.
7.1. Prime minister, meet prime minister
Self joins don't have dedicated syntax as other joins we have seen do. We can't just write SELF JOIN in
SQL code, for example. In addition, aliasing is required for a self join. Let's look at a chunk of INNER
JOIN code using the prime_ministers table. The country column is selected twice, and so is the
continent column. The prime_ministers table is on both the left and the right of the JOIN, making this
both a self join and an INNER JOIN! The vital step here is setting the joining fields which we use to match
the table to itself. For each country, we will find multiple matched countries in the right table, since we
are joining on continent. Each of these matched countries will be returned as pairs. Since this query will
return several records, we use LIMIT to return only the first 10 records.
The results are a pairing of each country with every other country in the same continent. However, note
that our join also paired countries with themselves, since they too are in the same continent as
themselves. We don't want to include these, since a leader from Portugal does not need to meet with
themselves, for example. Let's fix this.
Recall the use of the AND clause to ensure multiple conditions are met in the ON clause. In our second
condition, we use the not equal to operator to exclude records where the p1-dot-country and p2-dot-
country fields are identical.
Here's a look at our final table, showing combinations of countries in our database that are in the
same continent, but excluding records where the two country fields are the same.
Exercise
Comparing a country to itself
Self joins are very useful for comparing data from one part of a table with another part of the same
table. Suppose you are interested in finding out how much the populations for each country
changed from 2010 to 2015. You can visualize this change by performing a self join.
In this exercise, you'll work to answer this question by joining the populations table with itself.
Recall that, with self joins, tables must be aliased. Use this as an opportunity to practice your
aliasing!
Since you'll be joining the populations table to itself, you can alias populations first as p1 and again
as p2. This is good practice whenever you are aliasing tables with the same first letter.
Instructions 1/2:
• Perform an inner join of populations with itself ON country_code, aliased p1 and p2
respectively.
• Select the country_code from p1 and the size field from both p1 and p2, aliasing p1.size as
size2010 and p2.size as size2015 (in that order).
-- Select aliased fields from populations as p1
SELECT p1.country_code, p1.size AS size2010, p2.size AS size2015
-
- Join populations as p1 to itself, alias as p2, on country code
FROM populations AS p1
INNER JOIN populations AS p2
USING (country_code)
Instructions 2/2:
• Since you want to compare records from 2010 and 2015, eliminate unwanted records by
extending the WHERE statement to include only records where the p1.year matches
p2.year -5.
SELECT
p1.country_code,
p1.size AS size2010,
p2.size AS size2015
FROM populations AS p1
INNER JOIN populations AS p2
ON p1.country_code = p2.country_code
WHERE p1.year = 2010
-
- Filter such that p1.year is always five years before p2.y
ear
AND p1.year <> p2.year;
8. Set theory for SQL Joins
SQL has three main set operations, UNION, INTERSECT and EXCEPT. The Venn diagrams shown
visualize the differences between them. We can think of each circle as representing a table. The green
parts represent what is included after the set operation is performed on each pair of tables.
In this lesson, we will cover UNION and UNION ALL. Let's take a closer look at how they work.
8.2. UNION diagram
• UNION takes two tables as input, and returns all records from both tables
In SQL, the UNION operator takes two tables as input, and returns all records from both tables. The
diagram shows two tables: left and right, and performing a UNION returns all records in each table. If
two records are identical, UNION only returns them once. To illustrate this, the first two records of the
right table have been faded out. Our result has seven records.
8.3. UNION ALL diagram
• UNION ALL takes two tables and returns all records from both tables, including
duplicates
In SQL, there is a further operator for unions called UNION ALL. In contrast to UNION, given the
same two tables, UNION ALL will include duplicate records. Therefore, performing UNION ALL on
this data will return nine records, whereas UNION only returned seven. No records have been faded out.
UNION syntax
UNION ALL syntax
The syntax for performing all the set operations in this chapter is highly similar. We perform a SELECT
statement on our first table, a SELECT statement on our second table, and specify a set operation (in
this example, either UNION or UNION ALL) between them. Note that set operations do not require a
field to join ON. This is because they do not do quite the same thing as join operations we covered in
earlier chapters. Rather than comparing and merging tables on the left and right, they stack fields on top
of one another.
For all set operations, the number of selected columns and their respective data types must be identical.
For instance, we can't stack a number field on top of a character field. The result will only use field names
(or aliases, if used) of the first SELECT statement in the query.
8.5. To the monarchs table
We now return to our world leaders example. In this example, we use one more table from our leaders
database: the monarchs table. Of course, this table is only a sample of monarchs, and not a
comprehensive list of monarchs across the world.
8.5.1. Prime ministers, meet the monarchs
We can use UNION on the monarchs and prime_ministers tables to show all the different prime
ministers and monarchs in these two tables. Note that the monarch field has been aliased as 'leader'.
As we have learned, the monarch and prime_minister fields will be combined under the leader field, even
though we only aliased the monarch field. We'll LIMIT to the first 10 results so that they fit on our screen.
This is where a UNION ALL is helpful! A UNION ALL will tell us if there are any monarchs who also
act as prime ministers. The syntax for this is shown.
8.6.1. UNION ALL result
After UNION ALL, Brunei and Oman are now listed twice.
Exercise
Comparing global economies
Are you ready to perform your first set operation?
In this exercise, you have two tables, economies2015 and economies2019, available to you under
the tabs in the console. You'll perform a set operation to stack all records in these two tables on
top of each other, excluding duplicates.
When drafting queries containing set operations, it is often helpful to write the queries on either
side of the operation first, and then call the set operator. The instructions are ordered accordingly.
Instructions:
• Begin your query by selecting all fields from economies2015.
• Create a second query that selects all fields from economies2019.
• Perform a set operation to combine the two queries you just created, ensuring you do not
return duplicates.
• -- Select all fields from economies2015
• SELECT *
• FROM economies2015
• -- Set operation
• UNION
• -- Select all fields from economies2019
• SELECT *
• FROM economies2019
• ORDER BY code, year;
Exercise
Comparing two set operations
You learned in the video exercise that UNION ALL returns duplicates, whereas UNION does not.
In this exercise, you will dive deeper into this, looking at cases for when UNION is appropriate
compared to UNION ALL.
You will be looking at combinations of country code and year from the economies and populations
tables.
Instructions 1/2:
• Perform an appropriate set operation that determines all pairs of country code and year (in
that order) from economies and populations, excluding duplicates.
• Order by country code and year.
• -
- Query that determines all pairs of code and year from eco
nomies and populations, without duplicates
• SELECT code, year
• FROM economies
• UNION
• SELECT country_code AS code, year
• FROM populations
• ORDER BY code,year
Instructions 2/2:
• Amend the query to return all combinations (including duplicates) of country code and year
in the economies or the populations tables.
INTERSECT takes two tables as input, and returns only the records that exist in both tables.
In the diagram shown, we have two tables, left_table and right_table. The result of performing
INTERSECT on these tables is only the records common to both tables: the first record. All records that
are not of interest to the INTERSECT operation are faded out.
9.1.2. INTERSECT syntax
The syntax for this set operation is very similar to that of UNION and UNION ALL. We perform a
SELECT statement on our first table, a SELECT statement on our second table, and specify our set
operator between them.
Let's compare INTERSECT to performing an INNER JOIN on two fields with identical field names.
Similar to UNION, for a record to be returned, INTERSECT requires all fields to match, since in set
operations we do not specify any fields to match on. This is also why it requires the left and right table
to have the same number of columns in order to compare records. In the figure shown, only records
where both columns match are returned. In INNER JOIN, similar to INTERSECT, only results where
both fields match are returned. INNER JOIN will return duplicate values, whereas INTERSECT will
only return common records once. As we know from earlier lessons, INNER JOIN will add more
columns to the result set.
9.3. Countries with prime ministers and presidents
Let's have a look at how we can use INTERSECT to determine all countries that have both a prime
minister and a president. Each SELECT statement in our query has the same number of columns, of
identical data types, in order for the set operation to be performed. The result of the query is the four
countries with both a prime minister and a president in our leaders database. As with UNION, the result
set uses the field names provided for the left table, whether aliased or not.
9.4. INTERSECT on two fields
Next, let's think about what would happen if we selected two columns, country and prime_minister,
instead of just country, as in our previous example. The code shown does just that. What will the result
of this query be? Will this also give us the names of countries that have both a prime minister and a
president? The actual result is an empty table. Why is that? As we saw in our INTERSECT diagrams,
INTERSECT requires data from both fields in the left table to match their corresponding fields in the
right table. The search did not find any countries where the prime minister and president share the same
name.
9.5. Countries with prime ministers and monarchs
However, recall that there are monarchs in our database who also act as prime_ministers. In the
example shown, performing an INTERSECT on the prime_ministers and monarchs tables using both
country and leader name does not return an empty result.
Exercise
INTERSECT
Let's say you are interested in those countries that share names with cities. Use this task as an
opportunity to show off your knowledge of set theory in SQL!
Instructions:
• Return all city names that are also country names.
• -- Return all cities with the same name as a country
• SELECT name
• FROM countries
• INTERSECT
• SELECT name
• FROM cities;
10. EXCEPT
EXCEPT allows us to identify the records that are present in one table, but not the other. More
specifically, it retains only records from the left table that are not present in the right table.
The diagram shown illustrates the workings of the EXCEPT operation. All records that are not of interest
to the EXCEPT operation are faded out. Only the last two records of the left_table arereturned. Note that
while the id 4 does exist in the right_table, the whole record does not match, which is why the last record
of left_table is not faded out.
10.2. EXCEPT syntax
We saw earlier that there are some monarchs that also act as the prime minister for their country.
Let's say we were interested in monarchs that do NOT also hold the title of prime minister. The
EXCEPT clause is really handy for this! The SQL code shown selects the monarch and country
field from monarchs and then looks for common entries in the prime_minister and country fields
in the prime_ministers table, looking to exclude those entries. In the result, we see only the three
monarchs from our leaders database who do not also serve the role of prime minister.
Exercise
You've got it, EXCEPT...
Just as you were able to leverage INTERSECT to find the names of cities with the same names as
countries, you can also do the reverse, using EXCEPT.
In this exercise, you will find the names of cities that do not have the same names as their countries.
Instructions:
• Return all cities that do not have the same name as a country.
-
- Return all cities that do not have the same name as a cou
ntry
SELECT name
FROM cities
EXCEPT
SELECT name
FROM countries
ORDER BY name;
The six joins we've worked with so far are all "additive" in that they add columns to the original "left"
table. The diagram shows an INNER JOIN on the id field, showing an additional column being added
to the original left table.
Here is some familiar INNER JOIN syntax for this join on the id field. Lets dive deeper into what it means
for this join to be "additive".
Fields with different names are added to the result set with their original names. Fields with the same
name, such as date, are added to the result set too, but notice how we now have duplicate date columns
with the same name. This can be changed, however, with aliasing.
11.3. Semi join
A semi join chooses records in the first table where a condition is met in the second table
In this lesson, we'll see ways of joining data that do not expressly use JOIN keywords and are not additive
in the same way. Instead of using JOIN or set operators, we can leverage the WHERE clause to specify
the records to include. We'll use the two tables, left_table and right_table, in the diagram shown. A semi
join chooses records in the first table where a condition is met in the second table.
More specifically, the semi join will return all values of left_table where values of col1 are in a column
we specify, namely col2 in the right_table. Records not of interest to the semi join have been faded out.
The final result of our semi join is records corresponding to ids 2 and 3.
11.3.1. Kicking off our semi join
Let's make this concrete with an example. Suppose we are interested in determining the presidents of
countries that gained independence before 1800. We select the fields country, continent, and president.
How do we filter this data for independence year, which is a field in the states table? We'll use a semi
join!
11.3.2. Building on our semi join
Before we knew anything about joins, we knew we could leverage the WHERE clause to filter data. The
query that returns countries with indep_year before 1800 is shown. The query returns only Spain and
Portugal from our database.
11.3.3. Finish the semi join (an intro to subqueries)
We can use this list of countries as a filter by embedding it in a further WHERE clause. This is called a
subquery! It chooses records in the first table where the country matches the list returned by our
subquery. Since Spain does not have a president, only the Portuguese president is listed.
Moving on to anti joins! The anti join chooses records in the first table where col1 does NOT find a match
in col2.
The final result of our anti join is records corresponding to ids 1 and 4. Again, no new columns are added.
How might we adapt our semi join to determine countries in the Americas founded after 1800? To
change our semi join from before to after 1800, we add NOT before the IN statement. We call this an anti
join! In addition, we add to our first WHERE clause to filter for continents in the Americas. The result
shows presidents of countries in the Americas that gained independence after 1800. Only Chile and
Uruguay are returned, not the USA, which gained independence before 1800. Note again that we are
only using the few countries in our database for illustration; this is not a comprehensive list of countries
in the world!
Exercise
Semi join
Great job getting acquainted with semi joins and anti joins! You are now going to practice using
semi joins.
Let's say you are interested in identifying languages spoken in the Middle East. The languages
table contains information about languages and countries, but it does not tell you what region the
countries belong to. You can build up a semi join by filtering the countries table by a
particular region, and then using this to further filter the languages table.
You'll build up your semi join as you did in the video exercise, block by block, starting with a
selection of countries from the countries table, and then leveraging a WHERE clause to filter the
languages table by this selection.
Instructions 1/3
• Select country code as a single field from the countries table, filtering for countries in the
'Middle East' region.
-- Select country code for countries in the Middle East
SELECT code
FROM countries
WHERE region = 'Middle East';
Instructions 2/3:
• Write a second query to SELECT the name of each unique language appearing in the
languages table; do not use column aliases here.
• Order the result set by name in ascending order.
• -- Select unique language names
• SELECT DISTINCT(NAME)
• FROM languages
• -- Order by the name of the language
• ORDER BY name ASC;
Instructions 3/3:
• Create a semi join out of the two queries you've written, which filters unique languages
returned in the first query for only those languages spoken in the 'Middle East'.
• SELECT DISTINCT name
• FROM languages
• -- Add syntax to use bracketed subquery below as a filter
• WHERE code IN
• (SELECT code
• FROM countries
• WHERE region = 'Middle East')
• ORDER BY name;
Exercise
Diagnosing problems using anti join
Nice work on semi joins! The anti join is a related and powerful joining tool. It can be particularly
useful for identifying whether an incorrect number of records appears in a join.
Say you are interested in identifying currencies of Oceanian countries. You have written the
following INNER JOIN, which returns 15 records. Now, you want to ensure that all Oceanian
countries from the countries table are included in this result. You'll do this in the first step.
If there are any Oceanian countries excluded in this INNER JOIN, you want to return the names
of these countries. You'll write an anti join to this in the second step!
Instructions 1/2:
• Begin by writing a query to return the code and name (in order, not aliased) for all countries
in the continent of Oceania from the countries table.
• Observe the number of records returned and compare this with the provided INNER JOIN,
which returns 15 records.
• -- Select code and name of countries from Oceania
• SELECT code,name
• FROM countries
• WHERE continent = 'Oceania';
Instructions 2/2:
• Now, build on your query to complete your anti join, by adding an additional filter to return
every country code that is not included in the currencies table.
In this lesson, we'll dive deeper into embedding queries within queries. We'll revisit subqueries within a
WHERE clause, and then go on to subqueries within a SELECT statement.
• All semi joins and anti joins we have seen included a subquery in WHERE
• WHERE is the most common place for subqueries
The semi joins and anti joins we have seen so far involve subqueries inside WHERE. The WHERE
clause is the most common place for subqueries, because filtering data is one of the most common
data manipulation tasks. Let's revisit this with some generic syntax and understand the nuances of this
type of subquery. Have a look at the query shown. Recall that the WHERE IN clause enables us to
provide a list of values to filter on.
As we have seen in the lesson on semi joins, arguments to the IN operator are not limited to lists typed
out by us. We can include a SQL subquery as an argument for the IN operator, provided the result of the
subquery is of the same data type as the field we are filtering on.
The query shown will only work if some_field is of the same data type as some_numeric_field, because
the result of the subquery will be a numeric field. Subqueries inside WHERE can be from the same table
or from a different table, and here, the subquery is from a different table.
12.2. Subqueries inside SELECT
The second most common type of subquery is inside a SELECT clause. Say we want to count the
number of monarchs listed in the monarchs table for each continent in the states table. Again, let's build
up our solution, step by step. The query shown selects each of the five continents in the states table, and
the result is displayed.
In earlier courses we have seen the use of GROUP BY to COUNT data by a group. However, since
our monarchs data lives in a different table than the states table, this would involve a careful join before
the GROUP BY. Let's look at how to do this with a subquery instead. Our subquery requires two things.
First, it needs to COUNT all monarchs. Second, it needs a WHERE statement matching the continent
fields in the two tables. This subquery follows the selection of DISTINCT continents, and will therefore
count all monarchs within them in the SELECT statement. A subquery inside a SELECT statement
like this requires an alias, like monarch_count here. Magic!
Exercise
Subquery inside WHERE
The video pointed out that subqueries inside WHERE can either be from the same table or a
different table. In this exercise, you will nest a subquery from the populations table inside another
query from the same table, populations. Your goal is to figure out which countries had high average
life expectancies in 2015.
You can use SQL to do calculations for you. Suppose you only want records
from 2015 with life_expectancy above 1.15 * avg_life_expectancy. You could use the following
SQL query.
SELECT *
FROM populations
WHERE life_expectancy > 1.15 * avg_life_expectancy
AND year = 2015;
In the first step, you'll write a query to calculate a value for avg_life_expectancy. In the second
step, you will nest this calculation into another query.
Instructions 1/2:
• Begin by calculating the average life expectancy from the populations table.
• Filter your answer to use records from 2015 only.
-
- Select average life_expectancy from the populations table
SELECT avg(life_expectancy)
FROM populations
-- Filter for the year 2015
WHERE year = 2015
Instructions 2/2:
• The answer from your query has now been nested into another query; use this calculation
to filter populations for all records where life_expectancy is 1.15 times higher than average.
SELECT *
FROM populations
WHERE year = 2015
-
- Filter for only those populations where life expectancy i s
1.15 times higher than average
AND life_expectancy > 1.15 *
(SELECT AVG(life_expectancy)
FROM populations
WHERE year = 2015);
Exercise
WHERE do people live?
In this exercise, you will strengthen your knowledge of subquerying by identifying capital cities
in order of largest to smallest population.
Follow the instructions below to get the urban area population for capital cities only. You'll use
the countries and cities tables displayed in the console to help identify columns of interest as you
build your query.
Instructions:
• Return the name, country_code and urbanarea_pop for all capital cities (not aliased).
-- Select relevant fields from cities table
SELECT name, country_code, urbanarea_pop
FROM cities
-- Filter using a subquery on the countries table
WHERE country_code IN
(SELECT code
FROM countries
WHERE cities.name = countries.capital)
ORDER BY urbanarea_pop DESC;
Exercise
Subquery inside SELECT
As explored in the video, there are often multiple ways to produce the same result in SQL. You
saw that subqueries can provide an alternative to joins to obtain the same result.
In this exercise, you'll go further in exploring how some queries can be written using either a join
or a subquery.
In Step 1, you'll begin with a LEFT JOIN combined with a GROUP BY to select the nine countries
with the most cities appearing in the cities table, along with the counts of these cities. In Step 2,
you'll write a query that returns the same result as the join, but leveraging a nested query instead.
Instructions 1/2:
• Write a LEFT JOIN with countries on the left and the cities on the right, joining on country
code.
• In the SELECT statement of your join, include country names as country, and count the
cities in each country, aliased as cities_num.
• Sort by cities_num (descending), and country (ascending), limiting to the first nine records.
-- Find top nine countries with the most cities
SELECT countries.name AS country, COUNT(cities.name) AS cit
ies_num
FROM countries
LEFT JOIN cities
ON countries.code = cities.country_code
GROUP BY countries.name
-- Order by count of cities as cities_num
ORDER BY cities_num DESC, country ASC
-- Limit the results
LIMIT 9;
Instructions 2/2:
• Complete the subquery to return a result equivalent to your LEFT JOIN, counting all cities
in the cities table as cities_num.
• Use the WHERE clause to enable the correct country codes to be matched in the cities and
countries columns.
• SELECT countries.name AS country,
• -- Subquery that provides the count of cities
• (SELECT COUNT(*)
• FROM cities
• WHERE countries.code=cities.country_code) AS cities_num
• FROM countries
• ORDER BY cities_num DESC, country
• LIMIT 9;
The last type of subquery we'll cover is a query inside a FROM clause. Let's dig in!
Let's say we are interested in all the continents with monarchs, along with the most recent country to gain
independence in that continent. A starting SQL query to pull all the most recent independence years by
continent is shown. The query groups records by continent and returns the most recent independence
year for each group, by using the MAX() function AS most_recent. The result set is shown. Now how
do we filter this for continents with monarchs?
We can nest the subquery we initially wrote into a FROM statement, so that we are selecting from both
monarchs and our subquery, aliased as "sub". We also use the WHERE clause as before to identify
records in both tables that match on continent. As we saw on the previous slide, because the continents
from our subquery will find multiple matches in monarchs, this match will return duplicates. As before,
in our SELECT statement, we use the DISTINCT command to drop duplicate continents, and select
sub-dot-most_recent to get the most recent independence year for each continent. Lastly, we ORDER
BY continent. Have a look at our result set. There we have it: the continents with monarchs in our
database, and the most recent year of independence in those continents! That's how we include a
subquery as a temporary table in our FROM clause and then SELECT from it.
Exercise
Subquery inside FROM
Subqueries inside FROM can help select columns from multiple tables in a single query.
Say you are interested in determining the number of languages spoken for each country. You want
to present this information alongside each country's local_name, which is a field only present in
the countries table and not in the languages table. You'll use a subquery inside FROM to bring
information from these two tables together!
Instructions 1/2:
• Begin with a query that groups by each country code from languages, and counts the
languages spoken in each country as lang_num.
• In your SELECT statement, return code and lang_num (in that order).
-- Select code, and language count as lang_num
SELECT code, COUNT(*) AS lang_num
FROM languages
GROUP BY code;
Instructions 2/2:
• Select local_name from countries, with the aliased lang_num from your subquery (which
has been nested and aliased for you as sub).
• Use WHERE to match the code field from countries and sub.
Exercise
Subquery challenge
You're near the finish line! Test your understanding of subquerying with a challenge problem.
Suppose you're interested in analyzing inflation and unemployment rate for certain countries in
2015. You are interested in countries with "Republic" or "Monarchy" as their form of
government.
You will use the field gov_form to filter for these two conditions, which represents a country's
form of government. You can review the different entries for gov_form in the countries table.
Instructions:
• Select country code, inflation_rate, and unemployment_rate from economies.
• Filter code for the set of countries which contain the words "Republic" or "Monarchy" in
their gov_form.
• -- Select relevant fields
• SELECT code, inflation_rate, unemployment_rate
• FROM economies
• WHERE year = 2015
• AND code IN
• -- Subquery returning country codes filtered on gov_form
• (SELECT code
FROM countries
WHERE gov_form LIKE '%Republic%' OR gov_form LIKE '%Monar
chy%')
ORDER BY inflation_rate;
Exercise
Final challenge
You've made it to the final challenge problem! Get ready to tackle this step-by-step.
Your task is to determine the top 10 capital cities in Europe and the Americas by city_perc, a
metric you'll calculate. city_perc is a percentage that calculates the "proper" population in a city
as a percentage of the total population in the wider metro area, as follows: