SQL 1 Novo
SQL 1 Novo
SQL 1 Novo
Some of the steps in the video for this particular lesson have been updated, please see
the lesson notes below for the corrected version.
In this lesson, you will learn how to set up the development environment.
For managing our dependencies we are going to use this tool called “Anaconda.” Essentially, you
can use Anaconda to manage our dependencies and update them very easily.
Choose the version of Anaconda you need based on your current operating system you are using.
Once you have Anaconda downloaded go ahead and follow the setup wizards instructions.
After Anaconda is installed, go ahead and open the Anaconda Navigator application:
This already has all our dependencies bundled into these things called environments.
So select the Environments Tab from the Anaconda menu. At the bottom you can see a
toolbar to create and import environments:
You can change your environment by clicking on one of them on the menu. After you select an
environment, you can see all of the installed packages listed on the right.
The following step has been updated, and differs from the video:
The environment we’re going to use can be downloaded from the Course Home page. After you
download it, you will need to extract the files:
The environment.yml file contains all dependencies of the Python packages we’re going to use.
All you need to do now is go to Anaconda and import the environment. It’ll pull up a window for
you to select the environment file you now have downloaded. Name it ‘sqlite’:
After the environment is done loading, you will now see it in the menu to choose:
If you click on the green arrow next to the imported environment, you will pull up a drop-down
menu, and from it you can open a terminal or a Python interpreter:
What is a Database
A high level definition of a database is that it is a highly efficient way of storing data for querying and
data management. Querying is asking questions about the data I have. Another way to look at
databases visually is to think of them as a spreadsheet workbook such as below.
A worksheet may have many sheets such as Track, Album etc. representing data from music tracks
and the album they are part of. In database lingo, these would correspond to two different tables.
In the worksheet above, the header row (highlighted) tells us what fields(similar to columns in a
database table), are present in the data. The worksheet rows below the header contain the actual
entries (similar to rows in a database table).
Here are the header row (columns) and entries (rows) from the Track sheet (table).
Track, Name, Album, Media, Genre, Composer etc. represent database columns e.g. the
Composer column contains names of all the composers for the various tracks. The data in the
entries below the columns (such as the red highlighted entry) represent rows.
Conceptually we can think of data in a database as being organized as in the spreadsheet where
each sheet represents a table, the sheet header represents the columns and the sheet entries
represent rows. Databases are really fast and efficient at storing and retrieving millions of rows and
we will soon see an experiment that demonstrates the huge time difference between using a
database to answer a query versus using a CSV or text file to answer the same query.
Managing Data
We manage data in a database using SQL (Structured Query Language), also pronounced sequel. We
can use SQL to query the database for data, add rows, change or delete rows. We can also use it to
create new tables.
Kinds of Databases
Its worthwhile discussing why we need a database for storage with SQL for querying as opposed to
just using a CSV file. Lets look at an experiment where we compare these approaches by using
python scripts to retrieve data (list of orders) from a SQLite database Vs a CSV file. Both the
database (single table) and the CSV file have the same order information (approximately 500,000
entries – a small number in comparison to large commercial databases).
The query we will be using is a very classic commonplace query used all the time while buying from
a website : Fetch me the order whose id is <order_id>. We will be using python to ask both the text
file and the database for this information. In the case of the database, we will use the python SQLite
API. We will time both these approaches using the UNIX time command. The results are below.
CSV file
SQLite Database
The SQLite query takes about 64 ms. We can see the huge difference in retrieval time between using
a text file Vs a database for just 500,000 rows. People use databases with millions of rows in the
industry. Databases thus store data very efficiently thus allowing us to perform very powerful
queries against the data stored. This is why we use databases.
Some instructions for this particular lesson have been updated, please see the lesson
notes below for the corrected version.
Let’s get started with SQLite. Let us first download the main project source code. Inside the zip file is
the database file we are going to be working with – chinook.db – a very popular sample database
used to learn about SQL and SQLite. It comes pre-populated with a lot of data and helps exemplify
several SQLite features.
The following has been updated, and differs from the video:
Now go to the Anaconda Navigator -> Environments and ensure that you have imported the
sqlite environment:
Click on the arrow and choose Open Terminal to open a terminal with the sqlite environment. As
you can see, the Anaconda environment name (sqlite) is in parentheses which means we are in the
right environment:
Now let’s navigate to our working directory. A sequence of steps on how to do this in the terminal is
displayed below:
Note: if you’re on Windows, use ‘dir’ instead of the ‘ls’ command to check the files under
a folder. The ‘cd‘ and ‘sqlite3‘ commands work normally in the Windows Prompt terminal.
We see the sqlite version information followed by the sqlite command prompt: sqlite>. We can run
all our queries from this command prompt. Let us first look at some meta commands.
Meta Commands
These are commands (not SQL), that we can use to navigate our database.
COMMAND: .tables
.tables list all tables in the current database. Album, Artist etc. are tables storing information about
music.
As you can see, .schema <table name> lists the command for creating the table, the fields in the
table, their data types (what kind of value are expected). e.g. TrackId is an integer, Name can store a
variable length of characters with a maximum length of 200 i.e. a string etc. Millseconds represents
the length of the track, Bytes represents the size of the track, UnitPrice represents the price of the
track. We’ll look at the FOREIGN KEY clauses later.
The above images show the output of .help for the meta commands we used above. e.g. the
description for .tables states that if we leave out the table name, it will show us the schema for all
tables in the database. These two commands i.e. .tables and .schema allow us to navigate the
database to look for tables and the kind of information that can be stored in these tables.
.read FILENAME is another useful meta command that allows us to read in and execute SQL
statements stored in a .sql file.
This video was a quick overview of how to load a sqlite database, list its tables and explore what kind
of data was stored in them.
In this video,we are going to learn about the most basic SQL command we can run inside sqlite, the
SELECT command. SELECT is used to execute queries against tables in our database.
sqlite> .tables
Now lets us look at the Track table using .schema. This gives us a list of all the fields (attributes)
inside this table.
Now let us run the simplest form of a SELECT query which is to select a single column from a table. If
we think of the output of a sqlite query as being a spreadsheet, then this form of the SELECT query
will retrieve just one column.
Notice we ignore the square brackets that surround the field name in the output of .schema. e.g.
[Composer]. All SQL statements have to end in a semicolon (;).
We get a list of all the composers in the Track table. The empty spaces in the output indicate NULL
values in the Composer column i.e. Composer column with no composer.
It would be good if our results were numbered. We pick a column called TrackId which can act as a
unique identifier of a track from a music album.
This gives the ids of the tracks in the album. Now we select the TrackId and the Composer from
the Track table.
If we think of the table as a spreadsheet, this command will fetch the entire sheet.
As we see, we get a ton of information corresponding to all row and columns in the table. In the real
word, we don’t often use the ‘SELECT *’ syntax. This is because it is possible that columns got added
to a table after such a statement got added to a application. If the application assumed a certain
number of columns present in the table, then these new columns would not be known to the
application to work with.
Another reason we don’t use “SELECT *” is that is the retrieval may take a very long time for a table
with millions of rows. So in general, we always specify which columns we want from the table.
In summary, the SELECT statement along with a comma-separated list of columns can be used to
retrieve specific data from a table.
In this video, we will learn about the LIMIT and OFFSET commands. These are used with a SELECT
query to limit the data returned by the query. It is useful to quickly get a sense for what kind of data
is in a particular table.
The result returns all rows, a ton of data to browse through. Suppose I want only a few records from
this data to get a feel for the kind of data this table stores, I could do the following :
The result has only 10 records. Suppose I want to get the records between 10 and 20, I can use the
OFFSET command.
sqlite> SELECT TrackId, Name, Composer FROM Track LIMIT 10 OFFSET 10;
The result contains the rows in the table OFFSET by 10 i.e. Rows 10 – 20. We will come to this later,
but LIMIT and OFFSET can be combined with ordering to ask queries like “Give me the five longest
tracks in this database” or “Give me the three tracks with the smallest byte size”.
In summary the LIMIT command can be used to limit the result set to whatever size we want. The
OFFSET command allows me to skip ahead by a specific number of results. These commands will
become more important when we get to ordering.
In this video, we are going to learn how to pick unique rows from a particular column. We will be
working with a new table called Customer. Lets look at its structure in sqlite.
Lets first retrieve all countries for all customers to see where most of my customers are from. Maybe
I want to open a new office in a particular country.
I get a list of all countries, but notice there are several duplicates. This is not necessarily a useful
thing. If I wanted just the unique countries, I would use the DISTINCT command as follows :
We can see that there are pairs of City – Country (e.g. London | United Kingdom) that are not unique.
Suppose I want pairs of values to be distinct.
Notice there is only one value for the combination of London | United Kingdom. So using DISTINCT on
It might not necessarily be the case that customers belong to a company. It may just be individuals
that want to purchase things. In that case, we may have NULL values for the Company field
(corresponding to customers who do not represent companies).
There are legitimate company names for customers that represent companies.
Let us replace any possible empty values using the following command.
We can collapse these NULL values into a single NULL value using DISTINCT.
The result has only 1 NULL value. The above command could be useful to check if a particular
column in a table has at least one NULL value.
In summary, SELECT DISTINCT will allow us to query for unique values in a column or unique tuples
of values in a set of columns. When used on a column with NULL values, SELECT DISTINCT will
collapse any NULL values into a single NULL value.
In this video, we are going to learn how to order things using sqlite. Lets start with a simple query.
The result is ordered by the TrackId. However suppose I want to order by the AlbumId. I modify the
query as follows :
Lets now order by Milliseconds instead of OrderId. This is a challenge for you to try out.
The output is ordered by Milliseconds. The default sort ordering is ascending i.e. shortest to longest
tracks.
In the output, the longest track gets listed first and the shortest, last.
We can also order by multiple columns. This form of sorting orders by the first column and then
within that, by the second column and so on.
The following query orders first by the AlbumId and then by Milliseconds (i.e. order the songs first
by album and then by how long it is).
sqlite> SELECT Name, Milliseconds, AlbumId FROM Track ORDER BY AlbumId ASC, Milliseco
nds DESC;
We see above that all songs from album with AlbumId: 8 have been sorted by how long the songs
are, in descending order. The longest song is approximately 366,837 milliseconds long and the
shortest song is about 126,511 milliseconds long.
We can retrieve the five shortest/longest tracks by ORDERing by Milliseconds and then LIMITing the
output to five as follows :
sqlite> SELECT TrackId, Name, Milliseconds FROM Track ORDER BY Milliseconds DESC LIMI
T 5;
I can verify this by running a query that orders all tracks by Milliseconds DESC and comparing the
top five records to the above output.
Challenge : Retrieve (keeping the TrackId, Name fields unchanged from the above query), the
three smallest tracks. Use the Bytes field (this field records how many bytes the track occupies on
the hard disk). Include the Bytes column in the query.
sqlite> SELECT TrackId, Name, Bytes FROM Track ORDER BY Bytes ASC LIMIT 3;
We use ASC in the above query since we want the three smallest tracks.
Lets also print out the Milliseconds field in the above output.
sqlite> SELECT TrackId, Name, Bytes, Milliseconds FROM Track ORDER BY Bytes ASC LIMIT
3;
Notice that the shorter tracks do not take up much space on the disk either.
In summary, we learned about the ORDER BY command to sort the results. If we use multiple
columns in the ORDER BY command, the results will be sub-sorted by the second column inside the
first column and so on and so forth. We also saw how to combine ORDER BY and LIMIT to get back
results such as the “three smallest” or “five largest” tracks from the table.
In this video, we will look at the WHERE clause of the SELECT statement. This is where we can start
asking complicated questions about the data.
Suppose I wanted all the tracks composed by Miles Davis. I could do the following :
sqlite> SELECT TrackId, Name, Composer FROM Track WHERE Composer = "Miles Davis";
In some programming languages, the “=” sign is used to assign a value to a variable. In SQL, it is
used to check equality.
As expected, we see that all the result tracks were composed by “Miles Davis”.
Other types of equality such as numerical equality can also be used in a WHERE clause.
We can also use Boolean conditions in the WHERE clause using Boolean operators such as AND, OR,
NOT. The following query fetches tracks belonging to AlbumId:1 AND longer than two minutes.
sqlite> SELECT TrackId, AlbumId, Name, Composer FROM Track WHERE AlbumId = 1 AND Mill
seconds > 120000 ;
Lets now change the Milliseconds value to 320000 and run the query again.
sqlite> SELECT TrackId, AlbumId, Name, Composer FROM Track WHERE AlbumId = 1 AND Mill
seconds > 320000 ;
sqlite> SELECT TrackId, Name, Composer FROM Track WHERE Composer = "Mozart";
We get nothing back. This does not sound right as we know that there are tracks composed by
Mozart in the table. It maybe that some tracks have “Mozart” as the composer and others have his
full name “Wolfgang Amadeus Mozart”.
sqlite> SELECT TrackId, Name, Composer FROM Track WHERE Composer LIKE "%Mozart%";
The string above is like a regular expression. “%” here means zero or more characters.
“%Mozart%” will match strings that have zero or more characters followed by “Mozart” followed by
zero or more characters.
We see that the Composer field has the value “Wolfgang Amadeus Mozart”, which we should have
used in the “=” match.
Challenge: Write a query that will fetch us all the Track record composed by Bach.
sqlite> SELECT TrackId, Name, Composer FROM Track WHERE Composer LIKE "%Bach%";
The results contain tracks by Bachman and Johann Sebastian Bach. The full name of the composer
whose track we want is “Johann Sebastian Bach”.
Suppose I wanted all the tracks composed by “Miles Davis’ or “Jame Brown”.
sqlite> SELECT TrackId, Name, Composer FROM Track WHERE Composer = "Miles Davis" or C
omposer = "James Brown";
We can see tracks composed by either composer. There is a shorter query we can use to get the
same result.
sqlite> SELECT TrackId, Name, Composer FROM Track WHERE Composer IN ("Miles Davis", "
James Brown");
We get the same resulting set back. We can use the IN clause in other interesting ways.
sqlite> SELECT TrackId, Name, Composer FROM Track WHERE GenreId NOT IN (1,2,3,4) ;
In summary, the WHERE clause can be used with the SELECT statement to whittle down the number
of rows we have. We can use String equality, numeric equality, various numeric operators such as >,
<, Boolean conjunctions like AND, OR, NOT. We can use string regular expressions (with the % sign)
with the LIKE operator. We also learned about the IN operator which is a shorthand for using OR.
In this lesson, we are going to talk about aggregate functions and grouping. Suppose I want to know
the number of tracks in each album. Doing this manually would require me to group all the tracks by
AlbumId and then count the tracks in each AlbumId to get the number of tracks for that group and
then repeat this for all groups.
In sqlite we use the combination of a GROUP BY clause, and an aggregate function (COUNT) inside
that grouping.
COUNT
The output lists the AlbumId followed by the number of tracks belonging to that album.
Suppose I want to know how many tracks I have across all groups. I leave out the GROUP BY clause.
3503
The output tells us how many tracks we have in the entire table. All the rows are treated as being in
one giant group.
SUM
Lets look at another aggregate function called SUM. Suppose we want the length (milliseconds) of
each album.
The output is not ordered in any way. Lets order the output by the sum so that we can tell which
album is the longest.
sqlite> SELECT AlbumId, SUM(Milliseconds) FROM Track GROUP BY AlbumId ORDER BY SUM(Mi
lliseconds);
The output is ordered by Milliseconds rather than the AlbumId. We see that AlbumId: 229 is the
longest.
sqlite> SELECT AlbumId, SUM(Milliseconds) FROM Track GROUP BY AlbumId ORDER BY SUM(Mi
lliseconds) DESC;
MAX
The output gives us the length of the longest song in each album. Let us add the Name of the song
to the query.
sqlite> SELECT AlbumId, Name, MAX (Milliseconds) FROM Track GROUP BY AlbumId ;
The output gives me the album id, name and the length of the longest song in each album.
Suppose I want to get all albums, how many tracks they have for albums that have more than 10
tracks. We can think of the HAVING clause as a WHERE clause with GROUPS. HAVING whittles down
the number of results of a GROUP BY, similar to what WHERE does to the output of a SELECT
statement.
sqlite> SELECT AlbumId, COUNT(TrackId) FROM Track GROUP BY AlbumId HAVING COUNT(Track
Id) > 10;
In summary, we learned about GROUP BY and the various aggregate functions such as COUNT, SUM,
MAX that we can apply to groups. We also learned about the HAVING clause which can be thought of
as a WHERE clause for groups.
In this video, we will learn how to combine (join) information from two different tables. We use the
JOIN clause. This is important and useful when our data is spread out across multiple tables.
Lets look at a simple example that uses the INNER JOIN (there are other kinds of JOINS as well). Lets
see how a JOIN works manually using the tables highlighted above.
The Track table has track info but only an AlbumId. The Album table on the right has the AlbumId
and the Title of the album. Suppose we want to combine the track information with the Album title,
we would need to link the two tables. The field that links these two tables is the AlbumId. We match
the AlbumId columns in the two tables and link the corresponding rows together.
The result of joining the two tables is shown above. Joining AlbumId:1 in both tables gives us a
combined result where both tracks 84, 85, 86 have the Title “Time Out”. Similarly, joining AlbumId:2
in both tables gives a combined result where tracks 103, 104 have the Title “Kind Of Blue”.
Track 133 has a AlbumId:5 which does not have a corresponding value in the Album table. Hence we
do not include it in the result. Likewise AlbumId:3 does not have any corresponding AlbumId in the
Track table and is ignored in the result. This is how an INNER JOIN works. If a value in the joining
column does not exist in either of the joined tables, it will NOT be included in the result set.
NOTE: JOINS are not restricted to only two tables. Suppose there were an
Artist table with an ArtistId and a corresponding ArtistId in the Track
table, we could link these tables as well and include the Track, Album and
Artist information in the result.
sqlite> SELECT TrackId, Name, Title FROM Track INNER JOIN Album ON Track.AlbumId = Al
bum.AlbumId;
The output will have ALL tracks that belong to an album in the Album table. Any track whose album
is not the Album table or any album without a corresponding entry in the Track table will NOT be
listed here.
Flipping the order of the tables in the INNER JOIN does not change the results.
sqlite> SELECT TrackId, Name, Title FROM Album INNER JOIN Track ON Track.AlbumId = Al
bum.AlbumId;
Let us modify the above query so that we only get the results for AlbumId:1.
sqlite> SELECT TrackId, Name, Title FROM Album INNER JOIN Track ON Track.AlbumId = Al
bum.AlbumId WHERE Album.AlbumId = 1;
So WHERE clauses and GROUP By clauses will continue to work with the JOIN clause enabling us to
write more powerful queries.
In summary, the INNER JOIN combines data from two tables and only returns those rows that have a
match in both tables. We can further process this data using WHERE and GROUP BY clauses if
necessary.
There is another kind of frequently used join supported by sqlite, called the LEFT OUTER join.
Lets understand the rationale behind the name LEFT OUTER join. With reference to the tables above,
a LEFT OUTER join will have all rows from the LEFT table in the result. In the SQL statement, the
table name appearing on the left side of the LEFT JOIN clause will be considered to be the left table.
The result has 6 rows i.e the same rows from the left table.
Now there could be a row in the left table with no corresponding row in the right table. For example,
AlbumId:5 from the left table has no corresponding value in the right table (i.e Album). As seen
above, we replace the Title field in the result with a NULL value.
Lets do a LEFT OUTER join in sqlite by linking the Artist and Album tables.
sqlite> SELECT AlbumId, Name, Title FROM Artist LEFT JOIN Album ON Album.ArtistId = A
rtist.ArtistId;
We can see that there are some NULLS for the AlbumId and Title fields in the result. These
correspond to artists that do not have any album in our data set as the LEFT JOIN represents the
AlbumId and Title as NULL.
We can use other clauses such as WHERE, GROUP BY , HAVING with the LEFT JOIN.
sqlite> SELECT AlbumId, Name, Title FROM Artist LEFT JOIN Album ON Album.ArtistId = A
rtist.ArtistId WHERE AlbumId IS NULL;
In summary, the LEFT OUTER join gives us all rows from the left table and replaces those fields that
do not have a corresponding value in the right table with a NULL in the result. We may use clauses
such as WHERE, GROUP BY and HAVING with the LEFT OUTER join
If you’re using Windows, you’re going to need Git Bash to run the bash commands seen in the
next lesson.
You can open Git Bash in two different ways after you finished installing it. You can either
extract the source code from the zip in the Course Files tab under the video, open the folder
where all files were extracted to and right-click to open Git Bash directly in that folder with the
“Git Bash Here” option:
Or you can open Git Bash normally through the Windows main menu and navigate to the folder
where you extracted the course’s code using the cd command “cd new-folder-name” until you
reach it:
In this video, let us look at some other things we can do with sqlite. In particular, let’s look at how we
can read in .sql files and export results as .csv files.
For Windows users: you can continue this lesson by running the ‘ls’, ‘cat’, ‘head’, and
‘tail’ commands on Git Bash. Just open the Git Bash terminal on the folder where your
code is located as instructed in the previous lesson and proceed with the commands
described below normally. Note that GitBash will be used for this lesson and for these
commands only in order to check on the data we’ve been using throughout the course
and that you will continue to use the Anaconda terminal normally to run all SQLite
commands.
$ ls
Inside my folder here, I have my sqlite database and my .sql file. Let’s look at the contents of the .sql
file:
$ cat track-and-albums.sql
It is a single query that retrieves the name of the track and the name of the album. Let’s see how to
run this query and export the results to a .csv file.
sqlite will execute the whole command in the file and spit out the results.
If the .sql file had multiple queries, they would be executed in sequence.
Lets now look at how to export the results into a .csv file. In order to format the .csv file to nicely
display the data and the names of the fields, let us use some sqlite meta commands.
sqlite> .headers on
We now tell sqlite that the export mode is csv (comma seperated values).
Now any command that is run, will have its results saved in track-data.csv
sqlite> SELECT TrackId, Name, Title FROM Track INNER JOIN Album ON Track.AlbumId = Al
bum.AlbumId;
We don’t see any output on the screen. Lets exit out of the sqlite prompt using the .quit command
or by pressing ^D.
sqlite> .quit
$ ls
$ head track-data.csv
$ tail track-data.csv
We can compare the contents of the output file with the screen output we got before. They are
exactly the same.
In summary, we have seen how we can read in .sql files using the meta command .read and how we
can export the results of a query into a .csv file using the meta command .output. We can now use
the .csv file any way we want e.g. to make plots of the results.
In this video, we will look at how to execute SQL queries in the Python programming language
(www.python.org). We’ll look at how to connect to a database, set up a cursor object, execute
queries, view results. We’ll also look at execution safety issues that are important when executing
SQL from Python.
We already have a code file (querying.py) already saved in the same directory as the sqlite
database.
# run a query
result_set = cursor.execute('SELECT * FROM Track')
#cursor will execute the query and store results in the result set
for row in result_set:
# row = tuple object where each element is a value
print (row)
All our data is being printed out on the right. Each row is a python tuple object where each element
is a column value.
Suppose we wanted to print out only the songs, we would replace the print() function in the code
listing above with the line below :
print (row[1])
Lets modify the above query to return only the track where the composer is Miles Davis. Our first
inclination would be to do this as follows :
This, however, is not a great idea since this makes the code vulnerable to a SQL injection attack
which is a big security concern as it can lead to loss of data or insertion of malicious code. We can
use the cursor object to sanitize the input to prevent any such damage to the database.
import sqlite
# run a query
result_set = cursor.execute('SELECT * FROM Track')
#cursor will execute the query and store results in the result set
for row in result_set:
pass
# must be a tuple
favorite_artist = ('Miles Davis',)
result_set = cursor.execute(“SELECT * FROM Track WHERE Composer = ?”, favorite_artist
)
print (result_set.fetchone())
#close the cursor
cursor.close() # once closed, a cursor object cannot be used again. We have to get
a new cursor from the connection
The above statements replaces the single ? with the single artist inside favorite_artist. If there were
multiple ?’s and multiple artists inside favorite_artist, they would be pulled in, in sequence.
Running the code after making the changes, returns a single row as highlighted above.
Calling fetchone() on the cursor multiple times will yield the successive result from the result set.
The result now returns the next track composed by Miles Davis.
We avoid using Python’s format() function and the ‘%’ operator to prevent the SQL injection attack.
Instead, we use the combination of the Python tuple and the cursor as shown above.
In summary, we looked at how we can connect to a sqlite database and execute queries from Python
by creating a cursor on the database connection, setting the SQL statement on the cursor and calling
execute() on the cursor, and retrieve specific results from the resultant result set.