OVER To Coding
OVER To Coding
Mathematical Computing
with Python
Jay Gopalakrishnan
• Share: copy and redistribute the material in any medium or format, and
• Adapt: remix, transform, and build upon the material for any purpose, even commercially.
The licensor cannot revoke these freedoms as long as you follow the license terms.
Under the following terms:
• Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes
were made. You may do so in any reasonable manner, but not in any way that suggests the licensor
endorses you or your use.
• ShareAlike: If you remix, transform, or build upon the material, you must distribute your contributions
under the same license as the original.
• No additional restrictions: You may not apply legal terms or technological measures that legally restrict
others from doing anything the license permits.
Recommended citation:
Gopalakrishnan, J., Lectures on Mathematical Computing with Python, PDXOpen: Open Educational Resource
28, Portland State University Library, DOI: 10.15760/pdxopen-28, July 2020.
2
Preface
These lectures were prepared for a class of (mostly) second year mathematics and statis-
tics undergraduate students at Portland State University during Spring 2020. The term
was unlike any other. The onslaught of COVID-19 moved the course meetings online, an
emergency transition that few of us were prepared for. Many lectures reflect our preoccu-
pations with the damage inflicted by the virus. I have not attempted to edit these out since
I felt that a utilitarian course on computing need not be divested from the real world.
These materials offer class activities for studying basics of mathematical computing using
the python programming language, with glimpses into modern topics in scientific com-
putation and data science. The lectures attempt to illustrate computational thinking by
examples. They do not attempt to introduce programming from the ground up, although
students, by necessity, will learn programming skills from external materials. In my expe-
rience, students are able and eager to learn programming by themselves using the abun-
dant free online resources that introduce python programming. In particular, my students
and I found the two (free and online) books of Jake VanderPlas invaluable. Many sec-
tions of these two books, hyperlinked throughout these lectures, were assigned as required
preparatory reading materials during the course (see List of Preparatory Materials).
Materials usually covered in a first undergraduate linear algebra course and in a one-
variable differential calculus course form mathematical prerequisites for some lectures.
Concepts like convergence may not be covered rigorously in such prerequisites, but I have
not shied away from talking about them: I feel it is entirely appropriate that a first en-
counter with such concepts is via computation.
Each lecture has a date of preparation. It may help the reader understand the context in
relation to current events and news headlines. The timestamp also serves as an indicator
of the state of the modules in the ever-changing python ecosystem of modules for scientific
computation. The specific version numbers of the modules used are listed overleaf. The
codes may need tinkering with to ensure compatibility with future versions. The materials
are best viewed as offering a starting point for your own adaptation.
If you are an instructor declaring these materials as a resource in your course syllabus, I
would be happy to provide any needed solutions to exercises or datafiles. If you find errors
please alert me. If you wish to contribute by updating or adding materials, please fork the
public GitHub Repository where these materials reside and send me a pull request.
Jay Gopalakrishnan
(gjay@pdx.edu)
3
Software Requirements:
• Python >= 3.7
• Jupyter >= 1
Main modules used:
• cartopy==0.18.0b2.dev48+
• geopandas==0.7.0
• gitpython==3.1.0
• matplotlib==3.2.1
• numpy==1.18.2
• pandas==1.0.4
• scipy==1.4.1
• scikit-learn==0.23.1
• seaborn==0.10.0
• spacy==2.2.4
Other (optional) facilities used include line_profiler, memory_profiler, numexpr, pandas-
datareader, and primesieve.
4
Table of Contents
Lecture Notebooks
• 01 Overview of some tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
• 02 Interacting with python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
• 03 Working with git . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
• 04 Conversion table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
• 05 Approximating derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
• 06 Genome of SARS-CoV-2 virus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
• 07 Fibonacci primes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
• 08 Numpy blitz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
• 09 The SEIR model of infectious diseases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
• 10 Singular value decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
• 11 Bikes on Tilikum Crossing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
• 12 Visualizing geospatial data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
• 13 Gambler’s ruin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
• 14 Google’s PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
• 15 Supervised learning by regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
• 16 Unsupervised learning by PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
• 17 Latent semantic analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Exercises
• Power sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
• Graph functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
• Argument passing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
• Piecewise functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
• Row swap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
• Averaging matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
• Differentiation matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
• Pairwise differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
• Hausdorff distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
• k-nearest neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
• Predator-prey model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
• Column space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
• Null space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
• Pandas from dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
• Iris flowers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
• Stock prices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
• Passengers on the Titanic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
• Animate functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
5
• Insurance company . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
• Probabilities on small graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
• Ehrenfest urns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
• Power method for large graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
• Google’s toy graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
• Atmospheric carbon dioxide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
• Ovarian cancer data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
• Eigenfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
• Word vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
Projects
• Bisection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
• Rise of CO2 in the atmosphere . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
• COVID-19 cases in the west coast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
• World map of COVID-19 cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
• Neighbor’s color . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
6
List of Preparatory Materials for Each Activity
The activities in the table of contents are enumerated again below in a linear ordering with
hyperlinks to external online preparatory materials for each.
7
Required Preparation Activity
Learn sorting, partitioning [JV-H], and Exercise: Row Swap
quick ways to make matrices from
[numpy.org].
Exercise: Averaging Matrix
Exercise: Differentiation Matrix
8
Required Preparation Activity
Be acquainted with scipy.sparse’s matrix Exercise: Power method for large graphs
format, specifically COO and CSR Exercise: Google’s toy graph
formats.
9
I
Overview of some tools
This lecture is an introductory overview to give you a sense of the broad utility of a few
python tools you will encounter in later lectures. Each lecture or class activity is guided by
a Jupyter Notebook (like this document), which combines executable code, mathematical
formulae, and text notes. This overview notebook also serves to check and verify that you
have a working installation of some of the python modules we will need to use later. We
shall delve into basic programming using python (after this overview and a few further
start-up notes) starting from a later lecture.
The ability to program, analyze and compute with data are life skills. They are useful well
beyond your mathematics curriculum. To illustrate this claim, let us begin by considering
the most pressing current issue in our minds as we begin these lectures: the progression
of COVID-19 disease worldwide. The skills you will learn in depth later can be applied
to understand many types of data, including the data on COVID-19 disease progression.
In this overview, we shall use a few python tools to quickly obtain and visualize data on
COVID-19 disease worldwide. The live data on COVID-19 (which is changing in as yet
unknown ways) will also be used in several later activities.
Specifically, this notebook contains all the code needed to perform these tasks:
• download today’s data on COVID-19 from a cloud repository,
• make a data frame object out of the data,
• use a geospatial module to put the data on a world map,
• download county maps from US Census Bureau, and
• visualize the COVID-19 data restricted to Oregon.
The material here is intended just to give you an overview of the various tools we will learn
in depth later. There is no expectation that you can immediately digest the code here. The
goal of this overview is merely to whet your appetite and motivate you to allocate time to
learn the materials yet to come.
10
Please install these modules if you do not have them already. (If you do not have these
installed, attempting to run the next cell will give you an error.)
[2]: # your local folder into which you want to download the covid data
covidfolder = '../../data_external/covid19'
Remember this location where you have stored the COVID-19 data. You will need to return
to it when you use the data during activities in later days, including assignment projects.
covidfolder)
datadir = repo.working_dir + '/csse_covid_19_data/
↪→csse_covid_19_daily_reports'
The folder datadir contains many files (all of which can be listed here using the command
os.listdir(datadir) if needed). The filenames begin with a date like 03-27-2020 and
ends in .csv. The ending suffix csv stands for “comma separated values”, a common
simple format for storing uncompressed data.
11
I.3 Examine the data for a specific date
The python module pandas, the workhorse for all data science tasks in python, can make a
DataFrame object out of each such .csv files. You will learn more about pandas later in the
course. For now, let us pick a recent date, say March 27, 2020, and examine the COVID-19
data for that date.
[4]: c = pd.read_csv(datadir+'/03-27-2020.csv')
The DataFrame object c has over 3000 rows. An examination of the first five rows already
tells us a lot about the data layout:
[5]: c.head()
Combined_Key
0 Abbeville, South Carolina, US
1 Acadia, Louisiana, US
2 Accomack, Virginia, US
3 Ada, Idaho, US
4 Adair, Iowa, US
Note that depending on how the output is rendered where you are reading this, the later
columns may be line-wrapped or may be visible only after scrolling to the edges. This
object c, whose head part is printed above, looks like a structured array. There are features
corresponding to locations, as specified in latitude Lat and longitude Long_. The columns
Confirmed, Deaths, and Recovered represents the number of confirmed cases, deaths, and
recovered cases due to COVID-19 at a corresponding location.
12
(gpd) is well-suited for visualizing geospatial data. It is built on top of the pandas library.
So it is easy to convert our pandas object c to a geopandas object.
Combined_Key geometry
0 Abbeville, South Carolina, US POINT (-82.46171 34.22333)
1 Acadia, Louisiana, US POINT (-92.41420 30.29506)
2 Accomack, Virginia, US POINT (-75.63235 37.76707)
3 Ada, Idaho, US POINT (-116.24155 43.45266)
4 Adair, Iowa, US POINT (-94.47106 41.33076)
The only difference between gc and c is the last column, which contains the new geometry
objects representing points on the globe. Next, in order to place markers at these points on
a map of the world, we need to get a simple low resolution world map:
13
You can download and use maps with better resolution from Natural Earth, but that will
be too far of a digression for this overview. On top of the above low resolution map, we
can now put the markers whose sizes are proportional to the number of confirmed cases.
These python tools have made it incredibly easy for us to immediately identify the COVID-
19 trouble spots in the world. Moreover, these visualizations can be updated easily by
re-running this code as data becomes available for other days.
[9]: co = c[c['Province_State']=='Oregon']
The variable co now contains the data restricted to Oregon. However, we are now pre-
sented with a problem. To visualize the restricted data, we need a map of Oregon. The
module geopandas does not carry any information about Oregon and its counties. How-
ever this information is available from the United States Census Bureau. (By the way, the
2020 census is happening now! Do not forget to respond to their survey. They are one of
our authoritative sources of quality data.)
To visualize the COVID-19 information on a map of Oregon, we need to get the county
boundary information from the census bureau. This illustrates a common situation that
arises when trying to analyze data: it is often necessary to procure and merge data from
multiple sources in order to understand a real-world phenomena.
A quick internet search reveals the census page with county information. The information
is available in an online file cb_2018_us_county_500k.zip at the URL below. Python al-
lows you to download this file using its urllib module without even needing to leave this
notebook.
[10]: # url of the data
census_url = 'https://www2.census.gov/geo/tiger/GENZ2018/shp/
↪→cb_2018_us_county_500k.zip'
14
# location of your download
your_download_folder = '../../data_external'
if not os.path.isdir(your_download_folder):
os.mkdir(your_download_folder)
us_county_file = your_download_folder + '/cb_2018_us_county_500k.zip'
shutil.copyfileobj(response, out_file)
Now, your local computer has a zip file, which has among its contents, files with geometry
information on the county boundaries, which can be read by geopandas. We let geopandas
directly read in the zip file: it knows which information to extract from the zip archive to
make a data frame with geometry.
AWATER geometry
0 69473325 POLYGON ((-89.18137 37.04630, -89.17938 37.053...
1 4829777 POLYGON ((-84.44266 38.28324, -84.44114 38.283...
2 13943044 POLYGON ((-86.94486 37.07341, -86.94346 37.074...
3 6516335 POLYGON ((-84.12662 37.64540, -84.12483 37.646...
4 7182793 POLYGON ((-83.98428 38.44549, -83.98246 38.450...
The object us_counties has information about all the counties. Now, we need to re-
strict this data to just that of Oregon. Looking at the columns, we find something called
STATEFP. Searching through the government pages, we find that STATEFP refers to a 2-
character state FIPS code. The FIPS code refers to Federal Information Processing Standard
which was a “standard” at one time, then deemed obsolete, but still continues to be used
today. All that aside, it suffices to note that Oregon’s FIPS code is 41. Once we know this,
15
python makes it is easy to restrict the data to Oregon:
Now we have the Oregon data in two data frames, ore and co. We must combine the two
data frames. This is again a situation so often encountered when dealing with real data
that there is a facility for it in pandas called merge. Both data has FIPS codes: in ore you
find it under column GEOID, and in co you find it called FIPS. The merged data frame is
represented by the variable orco below:
The orco object now has both the geometry information as well as the COVID-19 informa-
tion, making it extremely easy to visualize.
16
This is an example of a chloropleth map, a map where regions are colored or shaded in
proportion to some data variable. It is an often-used data visualization tool.
17
Confirmed COVID-19 cases until 2020-03-30
7000 Oregon
Washington
6000 California
5000
4000
3000
2000
1000
0
1 15 1 -15 -01
2-0 02- 3-0 3 4
0-0 0 - 0-0 0-0 0-0
202 202 202 202 202
Dates
How does the progression of infections in New York compare with Hubei where the dis-
ease started? Again the answer based on the data we have up to today is easy to extract,
and is displayed next.
50000
40000
30000
20000
10000
0
01 15 01 -15 -01
0-02- 0- 02- 0-03- 0-0
3
0-04
202 202 202 202 202
Dates
Of course, the COVID-19 situation is evolving, so these figures are immediately outdated
after today’s class. This situation is evolving in as yet unknown ways. I am sure that you,
like me, want to know more about how these plots will change in the next few months.
You will be able to generate plots like this and learn many more data analysis skills from
these lectures. As you amass more technical skills, let me encourage you to answer your
18
own questions on COVID-19 by returning to this overview, pulling the most recent data,
and modifying the code here to your needs. In fact, some later assignments will require
you to work further with this Johns Hopkins COVID-19 worldwide dataset. Visualizing
the COVID-19 data for any other state, or indeed, any other region in the world, is easily
accomplished by some small modifications to the code of this lecture.
19
II
Interacting with Python
20
Note the following from the interactive session displayed in the figure above:
• Computing the square root of a number using sqrt is successful only after import-
ing math. Most of the functionality in Python is provided by modules, like the math
module. Some modules, like math, come with python, while others must be installed
after python is installed.
• Strings that begin with # (like “# works!” in the figure) differentiate comments from
code. This is the case in a python shell and also in the other forms of interacting with
python discussed below.
• The dir command shows the facilities provided by a module. As you can see, the
math module contains many functions in addition to sqrt.
21
II.3 Jupyter Notebook
The Jupyter notebook is a web-browser based graphical environment consisting of cells,
which can consist of code, or text. The text cells should contain text in markdown syntax,
which allows you to type not just words in bold and italic, but also tables, mathematical
formula using latex, etc. The code cells of Jupyter can contain code in various languages,
but here we will exclusively focus on code cells with Python 3.
For example, this block of text that begins with this sentence marks the beginning of a
jupyter notebook cell containing markdown content. If you are viewing this from jupyter,
click on jupyter’s top menu -> Cell -> Cell Type to see what is the type of the current cell,
or to change the cell type. Computations must be done in a code cell, not a markdown cell.
For example, to compute √
cos(π π )7
we open a code cell next with the following two lines of python code:
cos(pi*sqrt(pi))**7
[1]: 0.14008146171564725
This seamless integration of text and code makes Jupyter attractive for developing a repro-
ducible environment for scientific computing.
22
II.4 Python file
Open your favorite text editor, type in some python code, and then save the file as
myfirstpy.py. Here is a simple example of such a file.
#------- myfirstpy.py ---------------------------------
from math import cos, sqrt, pi
The above output cell should display the same output as what one would have obtained
if we executed the python file on the command line.
For larger projects (including take-home assignments), you will need to create such python
files with many lines of python code. Therefore it is essential that you know how to create
and execute python files in your system.
23
III
Working with git
April 2, 2020
Git a distributed version control system (and is a program often used independently of
python). A version control system tracks the history of changes in projects with many files,
including data files, and codes, which many people access simultaneously. Git facilitates
identification of changes made, fetching revisions from a cloud repository in git format,
and pushing revisions to the cloud.
GitHub is a cloud server that specializes in serving data in the form of git repositories.
Many other such cloud services exists, such as Atlassian’s BitBucket.
The notebooks that form these lectures are in a git repository served from GitHub. In this
notebook, we describe how to access materials from this remote git repository. We will
also use this opportunity to introduce some object-oriented terminology like classes, objects,
constructor, data members, and methods, which are pervasive in python. Those already
familiar with this terminology and GitHub may skip to the next activity.
24
III.2 Git Repo class in python
We shall use the python module gitpython to work with git. (We already used this mod-
ule in the first overview lecture. The documentation of gitpython contains a lot of infor-
mation on how to use its facilities. The main facility is the class called Repo which it uses
to represent git repositories.
Classes have a special method called constructor, which you would find listed among its
methods as __init__.
[3]: help(Repo.__init__)
25
:param path:
the path to either the root git directory or the bare git repo::
repo = Repo("/Users/mtrier/Development/git-python")
repo = Repo("/Users/mtrier/Development/git-python.git")
repo = Repo("~/Development/git-python.git")
repo = Repo("$REPOSITORIES/Development/git-python.git")
repo = Repo("C:\Users\mtrier\Development\git-python\.git")
Please note that this was the default behaviour in older versions of GitPython,
which is considered a bug though.
:raise InvalidGitRepositoryError:
:raise NoSuchPathError:
:return: git.Repo
The __init__ method is called when you type in Repo(...) with the arguments allowed
in __init__. Below, we will see how to initialize a Repo object using our github repository.
[5]: import os
os.path.abspath(coursefolder)
[5]: '/Users/Jay/tmpdir'
Please double-check that the output is what you expected on your operating system: if not,
please go back and revise coursefolder before proceeding. (Windows users should see
forward slashes converted to double backslashes, while mac and linux users will usually
retain the forward slashes.)
We proceed to download the course materials from GitHub. These materials will be stored
in a subfolder of coursefolder called mth271content, which is the name of the git reposi-
tory.
26
[6]: repodir = os.path.join(os.path.abspath(coursefolder), 'mth271content')
repodir # full path name of the subfolder
[6]: '/Users/Jay/tmpdir/mth271content'
Again, the value of the string variable repodir output above describes the location on your
computer where your copy of the course materials from GitHub will reside.
[7]: True
The output above should be False if you are running this notebook for the first time, per
my assumption above. When you run it after you have executed this notebook successfully
at least once, you would already have cloned the repository, so the folder will exist.
27
• Repo.clone_from(...) calls the clone_from(...) method.
Now you have the updated course materials in your computer in a local folder. The object
repo stores information about this folder, which you gave to the constructor in the string
variable repodir, in a data member called working_dir. You can access any data members
of an object in memory, and you do so just like you access a method, using a dot . followed
by the member name. Here is an example:
[9]: repo.working_dir
[9]: '/Users/Jay/tmpdir/mth271content'
Note how the Repo object was either initialized with repodir (if that folder exists) or set to
clone a remote repository at a URL.
28
IV
Conversion table
April 2, 2020
This elementary activity is intended to check and consolidate your understanding of very
basic python language features. It is modeled after a similar activity in [HPL] and involves
a simple temperature conversion formula. You may have seen kitchen cheat sheets (or have
one yourself) like the following:
Fahrenheit Celsius
cool oven 200 F 90 C
very slow oven 250 F 120 C
slow oven 300-325 F 150-160 C
moderately slow oven 325-350 F 160-180 C
moderate oven 350-375 F 180-190 C
moderately hot oven 375-400 F 190-200 C
hot oven 400-450 F 200-230 C
very hot oven 450-500 F 230-260 C
This is modeled after a conversion table at the website Cooking Conversions for Old Time
Recipes, which many found particularly useful for translating old recipes from Europe.
Of course, the “old continent” has already moved on to the newer, more rational, metric
system, so all European recipes, old and new, are bound to have temperatures in Celsius
(C). Even if recipes don’t peak your interest, do know that every scientist must learn to
work with the metric system.
Celsius values can be converted to the Fahrenheit system by the formula
9
F= C + 32.
5
The task in this activity is to print a table of F and C values per this formula. While ac-
complishing this task, you will recall basic python language features, like while loop, for
loop, range, print, list and tuples, zip, and list comprehension.
29
C = 0
while C <= 250:
F = 9 * C / 5 + 32
print(F, C)
C += 10
F C
32.0 0
50.0 10
68.0 20
86.0 30
104.0 40
122.0 50
140.0 60
158.0 70
176.0 80
194.0 90
212.0 100
230.0 110
248.0 120
266.0 130
284.0 140
302.0 150
320.0 160
338.0 170
356.0 180
374.0 190
392.0 200
410.0 210
428.0 220
446.0 230
464.0 240
482.0 250
This cell shows how to add, multiply, assign, increment, print, and run a while loop. Such
basic language features are introduced very well in the prerequisite reading for this lecture,
the official python tutorial’s section titled “An informal introduction to Python.” (Note
that all pointers to prerequisite reading materials are listed together just after the table of
contents in the beginning.)
C = 0
30
while C <= 250:
F = 9 * C / 5 + 32
print('%4.0f %4.0f' % (F, C))
C += 10
F C
32 0
50 10
68 20
86 30
104 40
122 50
140 60
158 70
176 80
194 90
212 100
230 110
248 120
266 130
284 140
302 150
320 160
338 170
356 180
374 190
392 200
410 210
428 220
446 230
464 240
482 250
F C
32 0
50 10
68 20
86 30
104 40
122 50
140 60
158 70
176 80
31
194 90
212 100
230 110
248 120
266 130
284 140
302 150
320 160
338 170
356 180
374 190
392 200
410 210
428 220
446 230
464 240
F C
-58 -50
-49 -45
-40 -40
-31 -35
-22 -30
-13 -25
-4 -20
5 -15
14 -10
23 -5
32 0
41 5
50 10
59 15
68 20
77 25
86 30
95 35
104 40
113 45
As you see from the output above, at −40 degrees, the Fahrenheit scale and the Celsius
scale coincide. If you have lived in Minnesota, you probably know how −40 feels like, and
you likely already know the fact we just discovered above (it’s common for Minnesotans
to throw around this tidbit while commiserating in the cold).
32
IV.5 Store in a list
If we want to use the above-printed tables later, we would have to run a loop again. Our
conversion problem is so small that there is no cost to run as many loops as we like, but
in many practical problems, loops contains expensive computations. So one often wants
to store the quantities computed in the loop in order to reuse them later. Lists are good
constructs for this.
First we should note that python has lists and also tuples. Only the former can be modified
after creation. Here is an example of a list:
You access a tuple element just like a list element, so Cs[0] will give the first element
whether or not Cs is a list or a tuple. But the statement Cs[0] = -10 that changes an
element of the container will work only if Cs is a list. We say that a list is mutable, while
a tuple is immutable. Tuples are generally faster than lists, but lists are more flexible than
tuples.
Here is an example of how to store the computed C and F values within a loop into lists.
[0, 25, 50, 75, 100, 125, 150, 175, 200, 225]
[9]: print(Fs)
[32.0, 77.0, 122.0, 167.0, 212.0, 257.0, 302.0, 347.0, 392.0, 437.0]
This is not as pretty an output as before. But we can easily run a loop and print the stored
values in any format we like. This is a good opportunity to show off a pythonic feature zip
that allows you to traverse two lists simultaneously:
33
[10]: print(' F C')
for C, F in zip(Cs, Fs):
print('%4.0f %4.0f' % (F, C))
F C
32 0
77 25
122 50
167 75
212 100
257 125
302 150
347 175
392 200
437 225
Note how this makes for compact code without sacrificing readability: constructs like this
are why your hear so much praise for python’s expressiveness. For mathematicians, the
list comprehension syntax is also reminiscent of the set notation in mathematics: the set
(list) Fs is described in mathematical notation by
{︃ }︃
9
Fs = C + 32 : C ∈ Cs .
5
Note how similar it is to the list comprehension code. (Feel free to check that the Fs com-
puted by the above one-liner is the same as the Fs we computed previously within a loop.)
34
V
Approximating the derivative
April 7, 2020
In calculus, you learnt about the derivative and its central role in modeling processes
where a rate of change is important. How do we compute the derivate on a computer?
Recall what you did in your first calculus course to compute the derivative. You memo-
rized derivatives of simple functions like cos x, sin x, exp x, x n etc. Then you learnt rules
like product rule, quotient rule, chain rule etc. In the end you could systematically com-
pute derivatives of complicated functions by reducing it to simpler components and ap-
plying the rules. We could teach the computer the same rules and come up with an algo-
rithm for computing derivatives. This is the idea behind automatic differentiation. Python
modules like sympy can compute derivatives symbolically in this fashion. However, this
approach has its limits.
In the real world, we often encounter complicated functions, such as functions that cannot
be represented in terms of simple component functions, or functions whose values you can
only query from some proprietary company code, or functions whose values are based off
a table, like for instance this function.
700
600
500
400
1-01 1-1
5 01 -15 01 -15 -01
0-0 0-0 0-02- 0-0
2
0-03- 0-0
3
0-0
4
2 0 2 202 202 202 202 202 202
Date
This function represents Tesla’s stock prices this year until yesterday (which I got, in case
you are curious, using just a few lines of python code). The function is complicated (not
35
to mention depressing - it reflects the market downturn due to the pandemic). But its
rate of change drives some investment decisions. Instead of the oscillatory daily stock
values, analysts often look at the rate of change of trend lines (like the rolling weekly
means above), a function certainly not expressible in terms of a few simple functions like
sines or cosines.
In this activity, we look at computing a numerical approximation to the derivative using
something you learnt in calculus.
f ( x + h/2) − f ( x − h/2)
f ′ (x) ≈
h
Below is a plot of the tangent line of some function f at x, whose slope is f ′ ( x ), together
with the secant line whose slope is the approximation on the right hand side above. Clearly
as the spacing h decreases, the secant line becomes a better and better approximation to
the tangent line.
t
gen
tan nt
seca
f (x)
36
This is the Central Difference Formula for the second derivative.
The first task in this activity is to write a function to compute the above-stated second
derivative approximation,
f ( x − h) − 2 f ( x ) + f ( x + h)
h2
given any function f of a single variable x. The parameter h should also be input, but can
take a default value of 10−6 .
The prerequisite reading for this activity included python functions, keyword arguments,
positional arguments, and lambda functions. Let’s apply all of these concepts while com-
puting the derivative approximation. Note that python allows you to pass functions them-
selves as arguments to other functions. Therefore, without knowing what specific function
f to apply the central difference formula, we can write a generic function D2 for implement-
ing the formula for any f .
Let’s apply the formula to some nice function, say the sine function.
D2(sin, 0.2)
[2]: -0.19864665468105613
Of course we know that second derivative of sin( x ) is negative of itself, so a quick test of
correctness is to compare the above value to that of − sin(0.2).
[3]: -sin(0.2)
[3]: -0.19866933079506122
How do we apply D2 to, say, sin(2x )? One way is to define a function returning sin(2 ∗ x )
and then pass it to D2, as follows.
D2(g, 0.2)
[4]: -1.5576429035490946
An alternate way is using a lambda function. This gives a one-liner without damaging code
readability.
37
[5]: -1.5576429035490946
Of course, in either case the computed value approximates the actual value of sin′′ (2x ) =
−4 sin(2x ), thus verifying our code.
[6]: -4*sin(2* 0.2) # actual 2nd derivative value
[6]: -1.557673369234602
V.3 Error
The error in the approximation formula we just implemented is
f ( x − h) − 2 f ( x ) + f ( x + h)
ε( x ) = f ′′ ( x ) −
h2
Although we can’t know the error ε( x ) without knowing the true value f ′′ ( x ), calculus
gives you all the tools to bound this error.
Substituting the Taylor expansions
h2 ′′ h3 ′′′ h4 ′′′′
f ( x + h) = f ( x ) + h f ′ ( x ) + f (x) + f (x) + f (x) + · · ·
2 6 24
and
h2 ′′ h3 ′′′ h4 ′′′′
f ( x − h) = f ( x ) − h f ′ ( x ) +
f (x) − f (x) + f (x) + · · ·
2 6 24
into the definition of ε( x ), we find that the after several cancellations, the dominant term
is O(h2 ) as h → 0.
This means that if h is halved, the error should decrease by a factor of 4. Let us take a look
at the error in the derivative approximations applied to a simple function
f ( x ) = x −6
at, say x = 1. I am sure you can compute the exact derivative using your calculus knowl-
edge. In the code below, we subtract this exact derivative from the computed derivative
approximation to obtain the error.
h D2 Result Error
6e-02 42.99863 0.998629
3e-02 42.24698 0.246977
2e-02 42.06158 0.061579
8e-03 42.01538 0.015384
Clearly, we observe that the error decreases by a factor of 4 when h is halved. This is in
accordance with what we expected from the Taylor expansion analysis above.
38
V.4 Limitations
A serious limitation of numerical differentiation formulas like this can be seen when we
take values of h really close to 0. Although the limiting process in calculus relies on h going
to 0, your computer is not equipped to deal with very small numbers. This creates issues.
Instead of halving h, let us aggressively reduce h by a factor of 10, going down to 10−13
and look at the results.
[8]: for k in range(1,14):
h = 10**(-k)
d2g = D2(lambda x: x**-6,1, h)
print('%.0e %18.5f' %(h, d2g))
1e-01 44.61504
1e-02 42.02521
1e-03 42.00025
1e-04 42.00000
1e-05 41.99999
1e-06 42.00074
1e-07 41.94423
1e-08 47.73959
1e-09 -666.13381
1e-10 0.00000
1e-11 0.00000
1e-12 -666133814.77509
1e-13 66613381477.50939
39
VI
Genome of SARS-CoV-2
April 7, 2020
Since most data come in files and streams, a data scientist must be able to effectively work
with them. Python provides many facilities to make this easy. In this class activity, we
will review some of python’s file, string, and dictionary facilities by examining a file con-
taining the genetic code of the virus that has been disrupting our lives this term. Here is
a transmission electron micrograph showing the virus (a public domain image from the
CDC, credited to H. A. Bullock and A. Tamin).
The genetic code of each living organism is a long sequence of simple molecules called nu-
cleotides or bases. Although many nucleotides exist in nature, only 4 nucleotides, labeled
A, C, G, and T, have been found in DNA. They are abbreviations of Adenine, Cytosine,
Guanine, and Thymine. Although it is difficult to put viruses in the category of living
organisms, they also have genetic codes made up of nucleotides.
40
name of the virus is SARS-CoV-2 (which is different from the name of the disease, COVID-
19), or “Severe Acute Respiratory Syndrome Coronavirus 2” in full. Searching the NCBI
website with the proper virus name will help you locate many publicly available data sets.
Let’s download NCBI’s Reference Sequence NC_045512 giving the complete genome ex-
tracted from a sample of SARS-CoV-2 from the Wuhan seafood market, called the Wuhan-
Hu-1 isolate. Here is a code using urllib that will attempt to directly download from the
url specified below. It is unclear if this url would serve as a stable permanent link. In the
event you have problems executing the next cell, please just head over to the webpage for
NC_045512, click on “FASTA” (a data format) and then click on “Send to” a file. Then save
the file in the same relative location mentioned below in f within the folder where we have
been putting all the data files in this course.
url = 'https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?tool=portal&' + \
'save=file&log$=seqview&db=nuccore&report=fasta&id=1798174254&' + \
'extrafeat=null&conwithfeat=on&hide-cdd=on'
f = '../../data_external/SARS-CoV-2-Wuhan-NC_045512.2.fasta'
[2]: import os
import urllib
import shutil
if not os.path.isdir('../../data_external/'):
os.mkdir('../../data_external/')
r = urllib.request.urlopen(url)
fo = open(f, 'wb')
shutil.copyfileobj(r, fo)
fo.close()
As mentioned in the page describing the data, this file gives the RNA of the virus.
The file has been opened in read-only mode. The variable lines contains a list of all the
lines of the file. Here are the first five lines:
[4]: lines[0:5]
41
␣
↪→ 'ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAA\n',
␣
↪→ 'CGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAAC\n',
␣
↪→ 'TAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTG\n',
␣
↪→ 'TTGCAGCCGATCATCAGCACATCTAGGTTTCGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTC\n']
The first line is a description of the data. The long genetic code is broken up into the
following lines. We need to strip end-of-line characters from each such line to re-assemble
the RNA string. Here is a way to strip off the end-of-line character:
[5]: lines[1].strip()
[5]: 'ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAA'
Let’s do so for every line starting ignoring the first. Since lines is a list object, ignoring the
first element of the list is done by lines[1:]. (If you don’t know this already, you must
review the list access constructs.) The following code uses the string operation join to put
together the lines into one long string. This is the RNA of the virus.
The first thousand characters and the last thousand characters of the RNA of the coron-
avirus are printed below:
[7]: rna[:1000]
[7]: 'ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTA
AAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGTTGACAG
GACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTT
TCGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTG
CCTGTTTTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCAGAGGCACGTCAACA
TCTTAAAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTTTTGCCTCAACTTGAACAGCCCTATGTGTTCATCA
AACGTTCGGATGCTCGAACTGCACCTCATGGTCATGTTATGGTTGAGCTGGTAGCAGAACTCGAAGGCATTCAGTACGGT
CGTAGTGGTGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGAAATACCAGTGGCTTACCGCAAGGTTCTTCTTCGTAA
GAACGGTAATAAAGGAGCTGGTGGCCATAGTTACGGCGCCGATCTAAAGTCATTTGACTTAGGCGACGAGCTTGGCACTG
ATCCTTATGAAGATTTTCAAGAAAACTGGAACACTAAACATAGCAGTGGTGTTACCCGTGAACTCATGCGTGAGCTTAAC
GGAGGGGCATACACTCGCTATGTCGATAACAACTTCTGTGGCCCTGATGGCTACCCTCTTGAGTGCATTAAAGACCTTCT
AGCACGTGCTGGTAAAGCTTCATGCACTTTGTCCGAACAACTGGACTTTATTGACACTAAGAGGGGTGTATACTGCTGCC
GTGAACATGAGCATGAAATTGCTTGGTACACGGAACGTTCT'
[8]: rna[-1000:]
[8]: 'GCTGGCAATGGCGGTGATGCTGCTCTTGCTTTGCTGCTGCTTGACAGATTGAACCAGCTTGAGAGCAAAATGTCTGGTA
AAGGCCAACAACAACAAGGCCAAACTGTCACTAAGAAATCTGCTGCTGAGGCTTCTAAGAAGCCTCGGCAAAAACGTACT
GCCACTAAAGCATACAATGTAACACAAGCTTTCGGCAGACGTGGTCCAGAACAAACCCAAGGAAATTTTGGGGACCAGGA
42
ACTAATCAGACAAGGAACTGATTACAAACATTGGCCGCAAATTGCACAATTTGCCCCCAGCGCTTCAGCGTTCTTCGGAA
TGTCGCGCATTGGCATGGAAGTCACACCTTCGGGAACGTGGTTGACCTACACAGGTGCCATCAAATTGGATGACAAAGAT
CCAAATTTCAAAGATCAAGTCATTTTGCTGAATAAGCATATTGACGCATACAAAACATTCCCACCAACAGAGCCTAAAAA
GGACAAAAAGAAGAAGGCTGATGAAACTCAAGCCTTACCGCAGAGACAGAAGAAACAGCAAACTGTGACTCTTCTTCCTG
CTGCAGATTTGGATGATTTCTCCAAACAATTGCAACAATCCATGAGCAGTGCTGACTCAACTCAGGCCTAAACTCATGCA
GACCACACAAGGCAGATGGGCTATATAAACGTTTTCGCTTTTCCGTTTACGATATATAGTCTACTCTTGTGCAGAATGAA
TTCTCGTAACTACATAGCACAAGTAGATGTAGTTAACTTTAATCTCACATAGCAATCTTTAATCAGTGTGTAACATTAGG
GAGGACTTGAAAGAGCCACCACATTTTCACCGAGGCCACGCGGAGTACGATCGAGTGTACAGTGAACAATGCTAGGGAGA
GCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGG
AGAATGACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA'
[9]: len(rna)
[9]: 29903
While the human genome is over 3 billion in length, the genome of this virus does not even
reach the length of 30000.
43
The next task in this class activity is to find if this sequence occurs in the RNA we just
downloaded, and if it does, where it occurs. To this end, we first make the replacements
required to read the string in terms of A, T, G, and C.
[11]: 'ATGAAAATTATTCTTTTCTTGGCACTGATAACACTCGCTACTTGTGAGCTTTATCACTACCAAGAGTGTGTTAGAGGTA
CAACAGTACTTTTAAAAGAACCTTGCTCTTCTGGAACATACGAGGGCAATTCACCATTTCATCCTCTAGCTGATAACAAA
TTTGCACTGACTTGCTTTAGCACTCAATTTGCTTTTGCTTGTCCTGACGGCGTAAAACACGTCTATCAGTTACGTGCCAG
ATCAGTTTCACCTAAACTGTTCATCAGACAAGAGGAAGTTCAAGAACTTTACTCTCCAATTTTTCTTATTGTTGCGGCAA
TAGTGTTTATAACACTTTGCTTCACACTCAAAAGAAAGACAGAATGATTGAACTTTCATTAATTGACTTCTATTTGTGCT
TTTTAGCCTTTCTGCTATTCCTTGTTTTAATTATGCTTATTATCTTTTGGTTCTCACTTGAACTGCAAGATCATAATGAA
ACTTGTCACGCCTAAACGAAC'
The next step is now a triviality in view of python’s exceptional string handling mecha-
nisms:
[12]: s in rna
[12]: True
We may also easily find the location of the ORF7a sequence and read off the entire string
beginning with the sequence.
[13]: rna.find(s)
[13]: 27393
[14]: rna[27393:]
[14]: 'ATGAAAATTATTCTTTTCTTGGCACTGATAACACTCGCTACTTGTGAGCTTTATCACTACCAAGAGTGTGTTAGAGGTA
CAACAGTACTTTTAAAAGAACCTTGCTCTTCTGGAACATACGAGGGCAATTCACCATTTCATCCTCTAGCTGATAACAAA
TTTGCACTGACTTGCTTTAGCACTCAATTTGCTTTTGCTTGTCCTGACGGCGTAAAACACGTCTATCAGTTACGTGCCAG
ATCAGTTTCACCTAAACTGTTCATCAGACAAGAGGAAGTTCAAGAACTTTACTCTCCAATTTTTCTTATTGTTGCGGCAA
TAGTGTTTATAACACTTTGCTTCACACTCAAAAGAAAGACAGAATGATTGAACTTTCATTAATTGACTTCTATTTGTGCT
TTTTAGCCTTTCTGCTATTCCTTGTTTTAATTATGCTTATTATCTTTTGGTTCTCACTTGAACTGCAAGATCATAATGAA
ACTTGTCACGCCTAAACGAACATGAAATTTCTTGTTTTCTTAGGAATCATCACAACTGTAGCTGCATTTCACCAAGAATG
TAGTTTACAGTCATGTACTCAACATCAACCATATGTAGTTGATGACCCGTGTCCTATTCACTTCTATTCTAAATGGTATA
TTAGAGTAGGAGCTAGAAAATCAGCACCTTTAATTGAATTGTGCGTGGATGAGGCTGGTTCTAAATCACCCATTCAGTAC
ATCGATATCGGTAATTATACAGTTTCCTGTTTACCTTTTACAATTAATTGCCAGGAACCTAAATTGGGTAGTCTTGTAGT
GCGTTGTTCGTTCTATGAAGACTTTTTAGAGTATCATGACGTTCGTGTTGTTTTAGATTTCATCTAAACGAACAAACTAA
AATGTCTGATAATGGACCCCAAAATCAGCGAAATGCACCCCGCATTACGTTTGGTGGACCCTCAGATTCAACTGGCAGTA
ACCAGAATGGAGAACGCAGTGGGGCGCGATCAAAACAACGTCGGCCCCAAGGTTTACCCAATAATACTGCGTCTTGGTTC
ACCGCTCTCACTCAACATGGCAAGGAAGACCTTAAATTCCCTCGAGGACAAGGCGTTCCAATTAACACCAATAGCAGTCC
AGATGACCAAATTGGCTACTACCGAAGAGCTACCAGACGAATTCGTGGTGGTGACGGTAAAATGAAAGATCTCAGTCCAA
GATGGTATTTCTACTACCTAGGAACTGGGCCAGAAGCTGGACTTCCCTATGGTGCTAACAAAGACGGCATCATATGGGTT
44
GCAACTGAGGGAGCCTTGAATACACCAAAAGATCACATTGGCACCCGCAATCCTGCTAACAATGCTGCAATCGTGCTACA
ACTTCCTCAAGGAACAACATTGCCAAAAGGCTTCTACGCAGAAGGGAGCAGAGGCGGCAGTCAAGCCTCTTCTCGTTCCT
CATCACGTAGTCGCAACAGTTCAAGAAATTCAACTCCAGGCAGCAGTAGGGGAACTTCTCCTGCTAGAATGGCTGGCAAT
GGCGGTGATGCTGCTCTTGCTTTGCTGCTGCTTGACAGATTGAACCAGCTTGAGAGCAAAATGTCTGGTAAAGGCCAACA
ACAACAAGGCCAAACTGTCACTAAGAAATCTGCTGCTGAGGCTTCTAAGAAGCCTCGGCAAAAACGTACTGCCACTAAAG
CATACAATGTAACACAAGCTTTCGGCAGACGTGGTCCAGAACAAACCCAAGGAAATTTTGGGGACCAGGAACTAATCAGA
CAAGGAACTGATTACAAACATTGGCCGCAAATTGCACAATTTGCCCCCAGCGCTTCAGCGTTCTTCGGAATGTCGCGCAT
TGGCATGGAAGTCACACCTTCGGGAACGTGGTTGACCTACACAGGTGCCATCAAATTGGATGACAAAGATCCAAATTTCA
AAGATCAAGTCATTTTGCTGAATAAGCATATTGACGCATACAAAACATTCCCACCAACAGAGCCTAAAAAGGACAAAAAG
AAGAAGGCTGATGAAACTCAAGCCTTACCGCAGAGACAGAAGAAACAGCAAACTGTGACTCTTCTTCCTGCTGCAGATTT
GGATGATTTCTCCAAACAATTGCAACAATCCATGAGCAGTGCTGACTCAACTCAGGCCTAAACTCATGCAGACCACACAA
GGCAGATGGGCTATATAAACGTTTTCGCTTTTCCGTTTACGATATATAGTCTACTCTTGTGCAGAATGAATTCTCGTAAC
TACATAGCACAAGTAGATGTAGTTAACTTTAATCTCACATAGCAATCTTTAATCAGTGTGTAACATTAGGGAGGACTTGA
AAGAGCCACCACATTTTCACCGAGGCCACGCGGAGTACGATCGAGTGTACAGTGAACAATGCTAGGGAGAGCTGCCTATA
TGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGGAGAATGACAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA'
[16]: freq
45
[17]: url2 = 'https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?' + \
'tool=portal&save=file&log$=seqview&db=nuccore&report=fasta&' + \
'id=1828694245&extrafeat=null&conwithfeat=on&hide-cdd=on'
f2 = '../../data_external/SARS-CoV-2-Washington_MT293201.1.fasta'
[18]: r2 = urllib.request.urlopen(url2)
fo2 = open(f2, 'wb')
shutil.copyfileobj(r2, fo2)
You might have already heard in the news that there are multiple strains of the virus
around the globe. Let’s investigate this genetic code a bit closer.
Is this the same genetic code as from the Wuhan sample? Repeating the previous pro-
cedure on this new file, we now make a string object that contains the RNA from the
Washington sample. We shall call it rna2 below.
We should note that not all data sets uses just ATGC. There is a standard notation that ex-
tends the four letters, e.g., N is used to indicate any nucleotide. So, it might be a good idea
to answer this question first: what are the distinct characters in the new rna2? There can be
very simply done in python if you use the set data structure, which removes duplicates.
[20]: set(rna2)
The next natural question might be this. Are the lengths of rna and rna2 the same?
We could also look at the first and last 30 characters and check if they are the same, like so:
46
[25]: freq2
Although the Washington genome is not identical to the Wuhan one, their nucleotide fre-
quencies are very close to the Wuhan one, reproduced here:
[26]: freq
[27]: True
[28]: rna2.find(s)
[28]: 27364
Thus, we located the same ORF7a instruction in this virus at a different location. Although
the genetic code from the Washington sample and the Wuhan sample are different, they
can make the same protein ORF7a and their nucleotide frequencies are very close.
This activity provided you with just a glimpse into the large field of bioinformatics, which
studies, among other things, patterns of nucleotide arrangements. If you are interested in
this field, you should take a look at Biopython, a bioinformatics python package.
47
VII
Fibonacci primes
April 9, 2020
Fibonacci numbers appear in so many unexpected places that I am sure you have already
seen them. They are elements of the Fibonacci sequence Fn defined by
F0 = 0, F1 = 1,
Fn = Fn−1 + Fn−2 , for n > 1.
Obviously, this recursive formula gives infinitely many Fibonacci numbers. We also know
that there are infinitely many prime numbers: the ancient Greeks knew it (actually proved
it) in 300 BC!
But, to this day, we still do not know if there are infinitely many prime numbers in the Fibonacci
sequence. These numbers form the set of Fibonacci primes. Its (in)finiteness is one of the
still unsolved problems in mathematics.
In this activity, you will compute a few initial Fibonacci primes, while reviewing some
python features along the way, such as generator expressions, yield, next, all, line mag-
ics, modules, and test functions. Packages we shall come across include memory_profiler,
primesieve, and pytest.
ni , n = 0, 1, 2, . . . , N − 1
succinctly:
If you change the brackets to parentheses, then instead of a list comprehension, you get a
different object called generator expression.
Both L and G are examples of iterators, an abstraction of a sequence of things with the
ability to tell, given an element, what is the next element of the sequence. Since both L and
48
G are iterators, you will generally not see a difference in the results if you run a loop to
print their values, or if you use them within a list comprehension.
[3]: [l for l in L]
[4]: [g for g in G]
[5]: [g for g in G]
[5]: []
The difference between the generator expression G and the list L is that a generator expres-
sion does not actually compute the values until they are needed. Once an element of the
sequence is computed, the next time, the generator can only compute the next element in
the sequence. If the end of a finite sequence was already reached in a previous use of the
generator, then there are no more elements of the sequence to compute. This is why we
got the empty output above.
[7]: G2 = GG()
print(*G2) # see that you get the same values as before
1 4 9 16 25 36 49 64 81
The yield statement tells python that this function does not just return a value, but rather
a value that is an element of a sequence, or an iterator. Internally, in order for something
to be an iterator in python, it must have a well-defined __next__() method: even though
you did not explicitly define anything called __next__ when you defined GG, python seeing
yield defines one for you behind the scenes.
49
Recall that you have seen another method whose name also began with two underscores,
the special __init__ method, which allows you to construct a object using the name of the
class followed by parentheses. The __next__ method is also a “special” method in that it
allows you to call next on the iterator to get its next value, like so:
[8]: G2 = GG()
[8]: (1, 4, 9)
16 25 36 49 64 81
As you can see, a generator “remembers” where it left off in a prior iteration.
[10]: i = -20
N = 10**8
50
Per official standards, memory should be reported in mebibytes (MiB), a power of two that
is close to 103 (“mebi” is made up of words “mega” and “binary”), although the commer-
ical world continues to use 10-based MB, GB, etc. The “increment” reported in the above
output in MiB is the difference between the peak memory and the memory used just before
memit was called: that gives the memory used by the statement.
Clearly we should not need so much memory for such a simple task. A better solution
is offered by the generator expression. It has no memory issues since it doesn’t store all
the elements of the sequence at once. Moreover, we can decide to stop iterating when the
numbers in the sequence get below machine precision, thus getting to the sum faster.
s = 0
for g in G3:
s += g
if g < 1e-15:
break
print(s)
1.0000009539620338
0
1
2
3
4
5
In fact the function count in module itertools does just this. Python does assume that
you are smart enough to use these without sending yourself into infinite loops. If you want
51
to stay safe, then avoid using while True, replacing it with while n < max where max is
some maximum number, a sentinel, that you never plan to exceed.
we use a generator that keeps in memory two prior elements of the sequence, as follows.
[17]: Fn = fibonacci(10000)
print(*Fn)
Note that we have used python’s tuple swap idiom in the definition of fibonacci above.
To understand it, note the evaluation order of expressions
expr3, expr4 = expr1, expr2
per the official documentation. The tuple swap idiom is an example (yet another) of how
python achieves brevity without compromising elegance or readability.
[18]: P = [2, 3]
[19]: [4 % p for p in P]
[19]: [0, 1]
52
[20]: all([4 % p for p in P])
[20]: False
Hence the generator would conclude that the number 4 is not a prime, and proceed to the
next case n = 5, which it would conclude is a prime because:
[21]: True
[23]: list(prime_numbers(70))
[23]: [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67]
VII.8 Verification
Verification refers to the process of cross checking that a program behaves as expected in
a few chosen cases. It is the developer’s responsibility to ensure that a code is correct.
53
Part of this responsibility involves designing and adding test functions that verify that the
code produces the correct output in a few cases where correct output is known. Complex
codebases with multiple files often have a suite of test functions. After a development
team member changes one file, if the test suite does not pass all tests, it is a sign that the
change has broken the code functionality due to some unanticipated repercussions of the
change in other parts of the code.
How do you know that the result of your fibonacci_primes is correct? We could design
checks, say by verifying that our prime number routine is correct for the first few primes,
plus a similar check for Fibonacci numbers. Or, we could look up the known Fibonacci
prime numbers and see if we have got the first few correctly. Design of such tests is the
process of verification. While there no standard method for it, one often used principle is
to code with all cases in mind and test using known cases.
Let us use the Online Encylopedia of Integer Sequences (OEIS) which reports the currently
known values of n for which Fn is prime. It is listed there as sequence A001605. Here are
the first 10 elements of this sequence:
Based on this we can write a test function. A test function has a name that begins with test
and does not take any argument. In the test function below you see a statement of the
form assert Proposition, Error, which will raise an AssertionError and print Error
if Proposition evaluates to False (and if True, then assert lets execution continues to the
next line). The test checks if our list of initial Fibonacci primes coincides with the one
implied by nFP above.
our_list = fibonacci_primes(N)
known_list = set([F[n] for n in nFP if n < len(F)])
[27]: test_fibonacci_prime()
One of the python modules that facilitates automated testing in python codes is the pytest
module. If you run pytest within a folder (directory), it will run all test functions it finds in
all files of the form test_*.py or *_test.py in the current directory and its subdirectories.
Please install pytest in the usual way you install any other python package.
To illustrate its use, let us make up a simple project structure as follows. This also serves as
your introduction to modules in python. Please create a folder fibonacci_primes and files
54
my_simple_primes.py, fibonacci.py and test_fibonacci_primes.py within the folder as
shown below:
fibonacci_primes <- create this folder within ../pyfiles
|-- fibonacci.py <- define functions fibonacci & fibonacci_primes
|-- my_simple_primes.py <- copy prime_numbers function definition here
|-- test_fibonacci_primes.py <- copy test_fibonacci_prime function here
Individual files may be thought of as python modules and you can import from them
the way you have been importing from external packages. Since fibonacci.py uses
prime_numbers which is now in a different file my_simple_primes.py, you should add
the line
from my_simple_primes import prime_numbers
at the top of fibonacci.py. Additionally, since test_fibonacci_primes.py uses the func-
tions fibonacci and fibonacci_primes, in order for the test function to find these function
definitions, you should include the line
from fibonacci import fibonacci, fibonacci_primes
at the top of test_fibonacci_primes.py.
Now you have a project called fibonacci_primes on which you can apply automated
testing programs like pytest, as shown in the next cell. Note that a test function will run
silently if all tests pass (without printing any outputs).
../pyfiles/fibonacci_primes/test_fibonacci_prime.py .
[100%]
To see how the output would be like if a test fails, you might want to run this again after
deliberately creating a bug: for example, set the initializer for Fibonacci recurrence to 2
instead of 1 in the file fibonacci.py and then return to run the above pytest to see what
happens.
55
need to install python-primesieve and primesieve depending on your system.) After you
install it, the following two lines will give you the same prime number list.
[29]: from primesieve import primes # do after you have installed primesieve
list(primes(70))
[29]: [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67]
This package is many times faster than our simple code with a python loop above, as
you can verify using another %magic. (Recall that iPython/Jupyter facilties like %timeit
and %memit are called line magics. You can read more about line and cell magics in your
[JV-H].)
[30]: %timeit primes(1000)
3.15 µs ± 53.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
1.2 ms ± 37 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
56
Theorem 1. A number F is a Fibonacci number if and only if 5F2 + 4 or 5F2 − 4 is a perfect
square.
Let’s close by implementing this check for a number to be a perfect square using math.sqrt
and let’s use it together with primesieve to get a Fibonacci prime.
def is_square(n):
s = int(math.sqrt(n))
return s*s == n
it = primesieve.Iterator()
it.skipto(2**28-1)
p = it.next_prime()
Do feel free to experiment increasing the 2**30 limit above. But may be it is now time
to manage your expectations a bit. A few decades ago, 231 − 1 was the largest integer
representable on most computers. Now that 64-bit computing is common, we can go up
to 263 − 1 (and a bit more with unsigned integers). To go beyond, not only will we need
much faster programs, but also specialized software to represent larger integers and do
arithmetic with them.
57
VIII
Numpy blitz
Numpy arrays are more efficient than lists because all elements of numpy arrays are of the
same pre-determined type. Numpy also provides efficient ufuncs (universal functions)
which are vectorized functions that loop over array elements with loops pre-compiled in
C. Numpy also exhibits some syntactic features that a mathematician may consider a nui-
sance, but knowing how to work with numpy is key for almost all scientific computation
in python. It takes some practice to be comfortable with numpy and practice is what we
are aiming for in this activity. This activity is structured as a list of questions. You will
be in a position to appreciate these questions (and their answers) after going through this
lecture’s prerequistes on numpy yourself.
type(a), type(A)
Here is how you find out the common data type of elements of a numpy array (and there
is no such analogue for list, since list elements can be of different types).
[3]: dtype('float64')
[4]: 2*a
[5]: 2*A
58
[5]: [0.1, 1.3, 0.4, 0.5, 0.1, 1.3, 0.4, 0.5]
64.7 ms ± 246 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.79 ms ± 202 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The functions np.sin and np.exp are examples of numpy’s universal functions (ufuncs)
that act directly on arrays. While np.sin are unary ufunc, there are many binary ufuncs
like np.add: when you write x+y, you are actually calling the binary ufunc np.add(x, y).
Ufuncs are vectorized functions.
59
[8]: def f(v): # apply f to one scalar value v
return math.sin(v) * math.exp(-v)
To gain (1), the convenience feature, there are at least two options other than using ufuncs:
a) Use map
A function f acting on a scalar value can be made into a function that acts on a vector of
values using the functional programming tool map(f, x), which returns an iterator that
applies f to every element of x.
Both options (a) and (b) provide the convenience feature (1), letting you avoid python
loops, and allows you to write expressive short codes.
However, neither option (a) nor (b) gives you the full performance of numpy’s ufuncs.
356 ms ± 3.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
333 ms ± 14.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
There was a time (in Python 2) when range was not as efficient, but those times have
passed.
60
• The defaults b=0, e=len(a), and s=1 may be omitted in a slice specification.
• Negative indices count from the end of the array: a[-1] is the last element of a and
a[-k] = a[len(a)-k].
• Positive indices count from the begining as usual.
[13]: a = np.random.randint(0,9,5)
a
If you have understood these, then you should be able to say what the expected results are
from each of the following statements.
a[::]
a[-1]
a[len(a)-1]
a[-3:]
a[-4:-1:2]
slice = range(-4,-1,2)
a[-4:-1:2], a[slice]
Verify your answers:
[14]: a[::]
[15]: a[-3:]
[16]: (3, 3)
[17]: a[-4:-1:2]
61
VIII.1.3 Do you really know what = does?
[19]: a = np.array([1,2,3])
b = np.array([3,4,5,6])
[20]: a = b
a[0] = 1
a
We certainly expected the 3 to become 1 in a. Did you also expect the following?
[21]: b
We can double check that these objects occupy different memory locations using python’s
id function.
[23]: id(a), id(b)
62
[23]: (4529008192, 4529242560)
Names a and b are now both bound to the same “Object2”. (And we no longer have a
way to access “Object1”!) Now, when we change a, or when we change b, we are really
changing the same object.
[27]: a, b
63
[29]: amat = np.array(Amat)
amat
Note that 2D and 1D numpy arrays are of the same type called numpy.ndarray.
[31]: 2*Amat
[32]: 2*amat
[33]: amat
Look at the output: is this really matrix multiplication?! This is one thing that drives math-
ematicians crazy when they look at numpy for the first time. Mathematicians want the
multiplication operator * to mean matrix multiplication when it is applied to numpy ar-
rays. Unfortunately python’s default * does element-by-element multiplication, not ma-
trix multiplication. Since the first proposal for numpy, decades ago, many battles have
been waged to resolve this embarrassment.
Finally, a few years ago, there came some good news. Since Python 3.5, the @ symbol was
dedicated to mean the matrix multiplication operator. You can read more about it at PEP
465.
64
[35]: import sys
print(sys.version) # check if you have version >= 3.5 before trying @
Naturally, many of us needed matrix multiplication before the @ came along, so as you
must have guessed, there is another way to do matrix multiplication:
[38]: amat.dot(amat)
Numpy provides a matrix_power routine to compute Mn for matrices M (and integers n).
[40]: np.linalg.matrix_power(amat, 2)
65
the result is a submatrix of A using row indices in rslice and column indices in cslice.
66
[45]: array([[7, 8, 5, 1, 2, 5],
[5, 2, 9, 6, 8, 9]])
But the actual situation is more complicated since numpy allows users to override the default
storage ordering. You can decide to store an array like in C or like in Fortran. Here is how
to store the same array in Fortran’s column-major ordering.
Obviously, it is the same matrix. How the data is stored internally is mostly immaterial
(except for some performance optimizations). The behavior (of most) of numpy methods
does not change even if the user has opted to store the array in a different ordering. If you
really need to see a matrix’s internal ordering, you can do so by calling the ravel method
with keyword argument order='A'.
[50]: N = np.arange(25).reshape(5,5)
N
How will you isolate elements in N whose value is between 7 and 18?
67
[51]: mask = (N>7) & (N<18)
mask
[52]: array([ 8, 9, 10, 11, 12, 13, 14, 15, 16, 17])
68
[[9, 1, 8, 7, 1],
[2, 8, 8, 1, 3],
[1, 6, 5, 3, 8]]])
69
[0, 6, 5] [8]
a.ndim=2 b.ndim=2 <= No need for Step 1 modification
a.shape=(2, 3) b.shape=(2, 1) <= Apply Step 2 to equalize shape
a + b = [1, 8, 3] + [1, 1, 1] = [ 2, 9, 4]
[0, 6, 5] [8, 8, 8] [ 8, 14, 13]
Example 2:
a + b = [1, 8, 3] + [1]
[0, 6, 5]
a.ndim=2 b.ndim=0 <= Apply Step 1 to equalize ndim
a.shape=(2, 3) b.shape=() (prepend 1 until ndim equalizes)
a + b = [1, 8, 3] + [1, 1, 1] = [ 2, 9, 4]
[0, 6, 5] [1, 1, 1] [ 1, 7, 5]
Example 3:
a + b = [1, 8, 3] + [1, 3]
[0, 6, 5]
a.ndim=2 b.ndim=1 <= Apply Step 1 to equalize ndim
a.shape=(2, 3) b.shape=(2, )
70
IX
The SEIR model of infectious diseases
Recent news of COVID-19 has brought to our attention the stories of the many earlier
pandemics the world has seen. A classic case is a strain of influenza that invaded New
York City during 1968-1969, then dubbed the Hong Kong flu. The following data (from
[DM]) shows the number of deaths that winter in New York City believed to be due to this
flu.
[1]: import matplotlib.pyplot as plt
%matplotlib inline
import seaborn; seaborn.set();
plt.bar(range(1,14), [14,28,50,66,156,190,156,108,68,77,33,65,24])
plt.xlabel('Week'); plt.ylabel('Excess deaths');
plt.xticks(range(1,14)); plt.title('1968-69 Influenza in New York City');
Notice how the data from Week 1 to Week 13 roughly fits into a bell-shaped curve. You
have, by now, no doubt heard enough times that we all must do our part to flatten the curve.
The bell-shaped curve, which has been identified in many disease progressions, is the
curve we want to flatten. Some mathematical models of epidemic evolution, for instance
the well-known “SIR model” discussed in [DM], produces such bell curves. Flattening
the curve can then be interpreted as bringing relevant model parameters into a range that
produces a shallow bell.
Mathematical models are often used as tools for prediction. However, we should be wary
that models only approximate a few features of reality, and only when realistic parameter
values (which are often missing) are supplied. Yet, as the saying goes, “All models are
71
wrong, but some are useful.” Even if a model is far away from the “truth”, the “whole
truth”, it helps us understand the process being modeled by revealing the consequences
of various hypotheses. Hence mathematical models are key instruments of computational
thinking.
In this activity, we will study a mathematical model called the SEIR model of infectious dis-
ease progression. In the last few weeks, many researchers have been furiously working to fit
the emerging COVID-19 data into variants of the SEIR model. A number of contributions
can be viewed at the Bulletin of World Health Organization (WHO) which now maintains
a special COVID-19 Open archive.
A number that emerges from models like the SIR or the SEIR model, called R0 , or the basic
reproduction number often makes its appearance in popular science. It is even explained
in a film from 2011 called the Contagion, which has now gained in popularity in view of
its almost prescient plot. The epidemiological definition of R0 is the average number of sec-
ondary cases produced by one infected individual introduced into a population entirely of
susceptible individuals. One suspects from this definition that if R0 > 1, then there will be
an epidemic outbreak. We will see that this number also naturally emerges from a math-
ematical model. A quite readable review of R0 (written before the COVID-19 pandemic)
gives an R0 of 14.5 for a measles outbreak in Ghana in the sixties. By all current accounts,
the R0 for COVID-19 appears to be between 2 and 3.
• Category “R” consists of individuals who can be removed from the system, e.g., be-
cause they have gained immunity to the disease, or because they have succumbed to
the disease.
S λ E σ I γ R
Susceptible Exposed Infected Recovered
The model then postulates rules on how populations in each category can move to other
categories. Let us consider the following simple set of rules.
• Assume that individuals move from S to E at the exposure rate λ, i.e., the population
in category S decreases with respect to time t at the rate λ × S and the population in E
72
correspondingly increases at the same rate:
dS
= −λS + · · ·
dt
dE
= +λS + · · ·
dt
where “· · ·” serves to remind us that there may be other unmodeled factors. In this
discussion, the number of individuals in each category (S, E, etc.) is denoted in italic
type by the same letter (S, E etc.).
• The exposure rate λ should grow with I, the number of infected individuals. A
standard hypothesis is that λ is the product of the transmission rate (or the rate of
contact) β and the probability of infection given that contact occurred, which is I/N
in a total population of N individuals, i.e.,
βI
λ= .
N
• The incubation rate σ is the rate at which exposed hosts become infected, i.e.,
dE
= +λS−σE + · · ·
dt
dI
= +σE + · · ·
dt
• The recovery rate γ is the rate at which infected individuals move to the R category:
dI
= +σE−γI + · · ·
dt
dR
= +γI + · · ·
dt
dS βI
=− S,
dt N
dE βI
= S − σE,
dt N
dI
= σE − γI
dt
dR
= γI
dt
73
to the “R” category; some infected individuals in category “I” might not gain perfect im-
munity and so may move back to susceptible category “S.” Despite these limitations, even
this basic SEIR model can provide some useful insights on the disease evolution.
S E I R
s= , e= , i= , r= .
N N N N
The equivalent ODE system to be solved for the unknown functions s(t), e(t), i (t), and r (t),
has now become
ds
= − β i s,
dt
de
= β i s − σ e,
dt
di
= σe−γi
dt
dr
= γ i.
dt
When supplemented with some initial conditions, say
we have completed our formulation of the IVP to be solved. Note that the above initial
conditions correspond to a starting scenario where just 1% of the population is exposed.
Scipy’s integrate module provides a solve_ivp facility for solving IVPs like the above.
The facility assumes you have an IVP of the form
⃗
dY
= ⃗f (t, ⃗Y ), t0 ≤ t ≤ t1 ,
dt
⃗Y (t0 ) = ⃗Y0 , t = t0 ,
74
where you know the function ⃗f : [t0 , t1 ] × Rn → Rn and the initial data ⃗Y 0 . It can then
compute an approximation of the solution ⃗Y (t) for t in the interval [t0 , t1 ] using numerical
ODE solvers that you might learn about if you take a numerical analysis course. Type in
help(solve_ivp) into a cell to get more information on how to use this function.
Let us apply this to the SEIR model. To fit to the setting required for solve_ivp, we put
⎡ ⎤
s
⎢e⎥
⃗Y = ⎢ ⎥
⎣i⎦
r
and ⎡ ⎤
− β i s,
⃗f (t, ⃗Y ) = ⎢ β i s − σ e,⎥ .
⎢ ⎥
⎣ σe−γi ⎦
γi
We have to give this ⃗f as a function argument to solve_ivp. Let’s first define this ⃗f , called
seir_f in the code below, keeping in mind that we also need to provide some values of
β, σ, and γ before we can solve it. We pass these values as additional arguments to seir_f.
Examining the resulting solution object sol you will notice that it has a numpy array as its
data member sol.y containing the values of the computed solution ⃗Y (t) at values of t con-
tained in another data member sol.t. We can easily send these arrays to the matplotlib
module to get a plot of the solution.
75
As you can see, even with 1% exposed population, the number of infections rapidly rise.
However, with more time, they begin to fall, making for a bell-shaped curve, like the one
from the previously mentioned New York City data.
76
[9]: plot_ei(beta=0.5); # what happens if beta is reduced?
77
IX.5 Equilibria
In the study of evolution of dynamical systems like the SEIR model, equilibria play an
important role. An equilibrium state is a value of the vector ⃗Y (i.e., values of s, e, i, and
r) for which the rate of change dY⃗ /dt = 0, i.e., if the system happens to enter an exact
equilibrium, then it no longer changes.
For our SEIR system, an equilibrium state s, e, i, r satisfies
⎡ ⎤ ⎡ ⎤
0 − β i s,
⎢0⎥ ⎢ β i s − σ e,⎥
⎢ ⎥=⎢
⎣0⎦ ⎣ σ e − γ i ⎦ .
⎥
0 γi
You should be able to conclude (exercise!) that the only solutions for this system are of the
form
s ≡ constant, e = i = 0, r ≡ constant.
In other words, since e = i = 0, all equilibria of our model are disease-free equilibria. This
matches our previous observations from our simulations. After a transitional phase, where
i and e increases and decreases per the bell-curve, the system settles into an equilibrium of
the form above.
There are other scenarios where an infection persists and never quite disappears from the
population. Such equilibria where the disease is endemic are sometimes called endemic
equilibria.
As an example, suppose our model represents a city’s population, and suppose travel into
and out of the city is allowed. Then we must add terms that represent the influx of travel-
ers in each category (the number of people entering minus the number of people leaving).
Even if we assume that infected people do not travel, a small influx into susceptible cate-
gory S and exposed category E will disturb the disease-free equilibrium of our model. Let
us add terms a and b representing these influxes and see what happens.
78
-gamma * i + sigma * e,
gamma * i - (a + b)])
As you can see from this output, the percentage of the population with the disease now
remains at around 5% and never quite vanishes, an example of an endemic equilibrium.
suppose we want to guess the stability of one of the previously discussed disease-free
equilibrium states,
s = s0 , e = i = 0, r = r0 .
where s0 and r0 are some constants. Adding the e and the i equations, we observe that
d(e + i )
= ( β s − γ) i.
dt
79
Thus, despite a perturbation brought about by a small surge in the infected population
(resulting in a small positive i value), if the above derivative is negative, i.e., if
β s0 − γ < 0,
then, the value of e + i will decrease to its equilibrium value. This simple argument already
hints at the relevance of the number
β
R0 = s0 ,
γ
which is the basic reproductive number for this model. In some texts, R0 is defined (to
match the epidemiological definition) using an initial population that is 100% susceptible,
in which case s0 = 1 and R0 = β/γ.
The argument sketched above was not a proof that the system will return to the disease-
free equilibrium, but rather a sketch to show you why R0 naturally emerges from the
model, using only the calculus tools you have already studied. (Nonetheless, one can
indeed rigorously prove that if R0 < 1, the disease-free equilibrium is stable, see e.g.,
[HSW].)
Clearly, the infected population, despite a positive bump in infections, decays as seen be-
low. In other words, when R0 < 1 there is no outbreak.
Next, consider an example where R0 > 1.
80
This output clearly shows that the small percentage of introduced infections rapidly in-
crease. (If you worry about the small initial dip in i observed in the output, do please
try to plot e + i to convince yourselves.) The system eventually goes on to attain (the
unique) disease-free equilibrium, but only after inflicting some damage. Summarizing,
when R0 > 1, we can expect an epidemic outbreak.
Regarding the impact of vaccinations (provided a vaccine exists) the model does have
something to say. If a fraction, say v, of the population is vaccinated, then that population
moves directly from the S category to the R category. Therefore, changing s0 to s0 (1 − v),
we see that R0 reduces to R0 = βs0 (1 − v)/γ. A vaccine would therefore be effective to
prevent an epidemic outbreak if enough people are vaccinated, i.e., if v is sufficiently large
in order to bring R0 under 1.
81
• β = R0 γ = 2.2γ.
Finally, let us additionally assume a scenario where 0.02% of the world’s population is in-
fected at the start of the simulation, and thrice that many are exposed. (The current number
of active infections worldwide appear to be around 0.02% of the world’s population.) Here
are the outputs of the simulation under these values.
These output curves seem to suggest that the infection will proceed well into the next two
months before subsiding.
The social distancing and other governmental measures that we are now practicing can be
viewed from the perspective of this simple SEIR model. They are designed to reduce the
transmission rate β. Please go ahead and experiment to see what you get with lower β
values that you can imagine.
You will see that lowering β by a little has two effects:
• it reduces the peak value of the curves (multiply the percentage value by the world’s
population ≈ 7.5 billion, to see the effect in terms of the reduction in number of
people affected), and
• it moves the location of the infection peak farther out in time (i.e., the infection per-
sists longer but in lower numbers).
On the other hand, lowering β by a lot (enough to make R0 < 1) will take you to a regime
where e + i decreases, indicating the other side of the peak, where we really want the world
to be as soon as possible.
82
X
The Singular Value Decomposition
One of the early and natural ideas in software development for scientific computation was
the idea of packaging linear algebra software around robust implementations of matrix
factorizations, divested from specific applications. This enabled durable linear algebra
tools like LAPACK to be developed with a stable interface, usable across numerous appli-
cation domains. Commercial software like Matlab™, as well as open-source software like
octave, numpy and scipy, all take full advantage of these developments. What does this
mean for you as an aspiring data scientist? When you are faced with a specific computa-
tional task, if you are able to reformulate your task using off-the-shelf implementations of
matrix factorizations, then you might already be half-way to finishing your task.
You already know about some matrix factorizations. In your introductory linear algebra
prerequisites, you learnt about eigenvalues and eigenvectors. The computation of eigen-
values and eigenvectors is indeed the computation of a matrix factorization. This factor-
ization is called a diagonalization or an eigenvalue decomposition of an n × n matrix A
and it takes the form
A = XDX −1
where D is a diagonal matrix and X is an invertible matrix, both of the same size as A. The
eigenvalues of A are found in the diagonal entries of D and the eigenvectors are columns
of X, as can be see by rewriting the factorization as
AX = XD.
A = UΣV ∗
83
where Σ is am m × n diagonal matrix, and U and V are unitary matrices of sized m × m
and n × n, respectively. (Recall that a square matrix Q is called unitary if its inverse equals
Q∗ , the conjugate transpose of Q.) The diagonal entries of Σ are non-negative and positive
ones are called the singular values of A. It is a convention to list the singular values in
non-increasing order along the diagonal. The columns of U and V are called the left and
right singular vectors, respectively.
Here is how we compute SVD using scipy.
84
Recall from your linear algebra prerequisite that this dimension is what we call rank. All
outer products are of unit rank (unless one of the vectors is trivial).
A very useful way to think of the SVD is to expand the factorization as follows. Naming
the columns of U and V as ui and v j , we have
⎡ ⎤
σ1 min(m,n)
A = UΣV ∗ = [u1 , . . . , um ] ⎣ σ2 ∗
⎦ [ v1 , . . . , v n ] = ∑ σl ul v∗l .
⎢ ⎥
.. l =1
.
Thus the SVD can be viewed as a sum of unit rank outer products. Each summand in-
creases rank (if the corresponding singular value is nonzero) until the rank of A is reached.
Let’s see this in action for a small matrix.
[6]: a = np.random.rand(4, 5)
u, s, vh = svd(a)
Numpy’s broadcasting rules do not make it easy to make the outer product ul v∗l simply.
Yet, once you follow the broadcasting rules carefully, you will see that all that is needed is
a placement of a newaxis in the right places.
Alternately, you can use the facility that numpy itself provides specifically for the outer
product, namely np.outer.
[9]: ar = np.zeros_like(a)
for i in range(4):
ar += np.outer(u[:, i], s[i] * vh[i, :])
85
[-0., -0., -0., -0., 0.],
[-0., -0., -0., 0., -0.]])
The factors of the SVD tell us all the important properties of the matrix immediately, as we
see next. If you enjoy linear algebra, I encourage you to prove the following simple result
yourself.
Notice how the rank-nullity theorem (something you may have been tortured with in your
first linear algebra course) follows as a trivial consequence of items (2) and (3).
To see how the geometry gets transformed (squashed) by the linear operator (matrix) a,
we first plot the unit circle and the parts of the x and y axis within it. Then, we track how
these points are mapped by a, as well as by each of the components of the SVD of a.
def show(c):
plt.plot(c[0, :], c[1, :])
plt.axis('image');
86
t =
np.linspace(0, 3.5 * np.pi , num=300)
l =
np.linspace(-1, 1, num=10)
z =
np.zeros_like(l)
c =
np.array([np.concatenate([l, np.cos(t), z]),
np.concatenate([z, np.sin(t), l])])
show(c)
[13]: show(a @ c)
87
Now, let us see this transformation as a composition of the following three geometrical
operations:
[14]: show(vh @ c)
[15]: show(np.diag(s) @ c)
[16]: show(u @ c)
88
When you compose these operations, you indeed get the transformation generated by a.
[︃ ]︃ [︃ ]︃ [︃ ]︃ [︃ ]︃ [︃ ]︃
a b a 0 0 b 0 0 0 0
= + + + .
c d 0 0 0 0 c 0 0 d
Each of the matrices on the right can have rank at most one.
As we have already seen, the SVD also expresses A as a sum of rank-one outer products.
However, the way the SVD does this, is special in that a low-rank minimizer can be read off
the SVD, as described in the following (Eckart-Young-Mirksy) theorem. Here we compare
matrices of the same size using the Frobenius norm
(︃ )︃1/2
∥ A∥ F = ∑ | Aij | 2
.
i,j
The theorem answers the following question: how close can we get to A using matrices
whose rank is much lower than the rank of A?
89
using the singular values σj and the left and right singular vectors u j , v j of A, i.e., Aℓ is the
sum of the first ℓ terms of the SVD when written as a sum of outer products. Then, the
minimum of ∥ A − B∥ F over all m × n matrices B of rank not more than ℓ is attained by
2
∥ A − Aℓ ∥ F and the minimum is (σℓ+ 2 1/2 .
1 + · · · + σr )
This matrix approximation result is perhaps best visualized using images. Think of an im-
age as a matrix. Using matplotlib’s imread we can load PNG images. Here is an example.
This is a 3-dimensional array because images are represented using RGBA values at each
pixel (red, green, blue and an alpha value for transparency). However this image of my
cats is actually a black and white image, so all RGB values are the same, as can be verified
immediately:
[19]: 0.0
The above line contains two features that you might want to note: the use of the ellipsis to
leave the dimension of a numpy slice unspecified, and the way to compute the Frobenius
norm in numpy. Restricting to the first of the three identical image channels, we continue:
[20]: c = cats[..., 0]
plt.imshow(c, cmap='gray');
90
Let us take the SVD of this 1040 x 758 matrix.
[21]: u, s, vh = svd(c)
[22]: plt.plot(s);
You can see a sharp drop in the magnitude of the singular values. This is a good indication
that the later summands in the SVD representation of A,
min(m,n)
A= ∑ σj u j v∗j
j =1
are adding much less to A than the first few summands. Therefore, we should be able to
represent the same A using the first few outer products.
91
[24]: # Rank 50 approximation of the cats:
l = 50; cl = u[:, :l] @ np.diag(s[:l]) @ vh[:l, :]
plt.imshow(cl, cmap='gray');
If you increase the rank l to 100, you will find that the result is visually indistinguishable
from the original.
Returning to Theorem 2, notice that the theorem also gives one the ability to specify an
error tolerance and let that dictate the choice of the rank ℓ. E.g., if I do not want the error
in my low-rank approximation to be more than some specific ε, then I need only choose ℓ
92
so that
2 2 1/2
(σℓ+ 1 + · · · + σr ) ≤ ε.
As an example, suppose I declare I want a low-rank approximation within the following
relative error in Frobenius norm:
[25]: relative_error = 1.e-1
Then we can find the needed ℓ using an aggregation and masking (see [JV-H] for the prereq-
uisite material on this) as follows.
[26]: s2 = s**2
total = np.sum(s2)
diff = np.sqrt((total - np.add.accumulate(s2)) / total)
l = np.argmax(diff < relative_error) + 1
l
[26]: 41
You can check that the error is indeed less than the prescribed error tolerance.
[28]: 0.09942439
As you can see, the low rank approximation does give some image compression. The
number of entries needed to store a rank ℓ approximation cl of an m × n matrix is mℓ +
ℓ + ℓn:
[29]: u.shape[0] * l + l + l * vh.shape[0]
[29]: 73759
In contrast, to store the original image (single channel) we would need to minimally store
mn numbers:
[30]: c.shape[0] * c.shape[1]
[30]: 788320
Comparing the two previous output, we can certainly conclude that we have some com-
pression. However, for image compression, there are better algorithms.
The utility of the low-rank approximation technique using the SVD is not really in im-
age compression, but rather in other applications needing a condensed operator. Being an
compressed approximation of an operator (i.e., a matrix) is much more than being just a
compressed visual image. For example, one can not only reduce the storage for a matrix,
93
but also reduce the time to apply the operator to a vector using a low-rank approximation.
Because the SVD tells us the essentials of an operator in lower dimensional spaces, it con-
tinues to find new applications in many modern emerging concepts, some of which we
shall see in later lectures.
94
XI
Bikes on Tilikum Crossing
May 1, 2020
A car-free bridge is still considered a ridiculous idea in many parts of our country. Port-
landers beg to differ. Portland’s newest bridge, the Tilikum Crossing, opened in 2015, and is
highly multimodal, allowing travel for pedestrians, bikes, electric scooters, trains, street-
cars, and buses (but the modality of travel by personal car is missing). Bike lanes were not
an afterthought, but rather an integral part of the bridge design. One therefore expects to
see a good amount of bike traffic on Tilikum.
In this activity, we examine the data collected by the bicycle counters on the Tilikum. Port-
land is divided into east side and west side by the north-flowing Willamette river and the
Tilikum connects the two sides with both eastbound and westbound lanes. Here is a photo
of the bike counter (the black display, located in between the streetcar and the bike lane)
on the bridge.
Portlanders use the numbers displayed live on this little device to boast about Portland’s
bike scene in comparison to other cities. The data from the device can also be used in more
95
complex ways. The goal of this lesson is to share the excitement of extracting knowledge
or information from data - it is more fun than a Sherlock Holmes tale. In this activity, you
get to be Mr. Holmes while you wrangle with the data and feel the thrill of uncovering
the following facts that even many of the locals don’t know about. ( a) Most of those who
bike to work on Tilikum live on the east side. (b) Recreational bikers on Tilikum prefer
afternoon rides. (c) There are fewer bikers on the bridge after social distancing and they
appear to use the bridge during afternoons.
Comparison with Seattle’s Fremont bridge bike counter data reveals more, as we shall
see: ( a) there are fewer bikers on Portland’s Tilikum than on Seattle’s Fremont bridge in
general. (b) During peak hours, bikers are distributed more evenly on Seattle’s Fremont
bridge travel lanes than on Tilikum. (c) The bike usage on both bridges have shifted to a
recreational pattern after social distancing.
The BikePed Portal provides some of the data collected from the counter for the public,
but currently only subsampled data can be downloaded from there. Here we shall instead
use the full raw data set collected by the counters, which is not yet publicly downloadable.
I gratefully acknowledge Dr. Tammy Lee and TREC for making this data accessible. This
activity is motivated by the material in Working with Time Series section of [JV-H].
96
volume flow_detector_id
0 0 1903
1 0 1903
2 0 1903
3 0 1903
4 0 1903
[3]: td.tail()
Looking through first few (of the over 300,000) data entries above, and then examining the
meta data file contents in tm, we conclude that volume gives the bike counts. The volume is
for 15-minute intervals as seen from measure_period. A quick check indicates that every
data entry has a starting and ending time that conforms to a 15-minute measurement.
[4]: True
Therefore, let us rename start_time to just time and drop the redundant data in end_time
and measure_period (as well as the id) columns.
[5]: td = td.rename(columns={'start_time':'time'}).drop(columns=['end_time',␣
↪→'measure_period', 'id'])
The meta data also tells us to expect three detectors and three values of flow_detector_id.
Here are a few entries from the meta data:
[6]: tm.T.loc[['detector_description', 'flow_detector_id', 'detector_make',␣
↪→'detector_name', 'facility_description'], :]
[6]: 0 \
detector_description Inbound towards East
flow_detector_id 1903
97
detector_make EcoCounter
detector_name Tilikum Crossing 1 EB
facility_description South bike lane of Tilikum Crossing Bridge
1 \
detector_description Inbound towards West
flow_detector_id 1904
detector_make EcoCounter
detector_name Tilikum Crossing (EAST)
facility_description North bike lane of Tilikum Crossing Bridge
2
detector_description Inbound towards West
flow_detector_id 1905
detector_make EcoCounter
detector_name Tilikum Crossing 2 WB
facility_description North bike lane of Tilikum Crossing Bridge
Although there are three values of flow_detector_id listed above, one of these values
never seems to appear in the data file. You can check that it does not as follows:
[7]: (td.flow_detector_id==1904).sum()
[7]: 0
Therefore, going through the meta data again, we conclude that eastbound and westbound
bikes pass through the flow detectors with id-numbers 1903 and 1905, respectively.
The next step is to reshape the data into the form of a time series. The start_time seems
like a good candidate for indexing a time series. But it’s a red herring. A closer look will
tell you that the times are repeated in the data set. This is because there are distinct data
entries for the eastbound and westbound volumes with the same time stamp. So we will
make two data sets (since our data is not gigabytes long, memory will not be an issue), a
tE for eastbound volume and tW for westbound volume.
[8]: tE = td.loc[td['flow_detector_id']==1903, ['time', 'volume']]
tE.index = pd.DatetimeIndex(pd.to_datetime(tE['time'])).tz_convert('US/
↪→Pacific')
tE = tE.drop(columns=['time']).rename(columns={'volume':'Eastbound'})
tW = tW.drop(columns=['time']).rename(columns={'volume':'Westbound'})
Note that we have now indexed eastbound and westbound data by time stamps, and re-
named volume to Eastbound and Westbound respectively in each case.
We are now ready for a first look at the full time series. Let us consider the eastbound data
98
first.
[10]: tE.plot();
Clearly, we have problems with this data. A spike of 7000 bikers passing through in 15
minutes, even for a bike-crazed city like Portland, just does not seem right. Zooming in,
we find the situation even more disturbing, with a lot of zero readings before the spike:
[11]: tE['2018-11-25':'2019-06-01'].plot();
There are reports from TriMet of construction in 2018 and city traffic advisories in 2019 that
might all affect bike counter operation, but since the data set seems to have no means to in-
dicate these outages, we are forced to come up with some strategy ourselves for discarding
the false-looking entries from the data.
First, exploiting pandas’ ability to work with missing values, we declare the entries for the
dates in the above plot to be missing. Note that missing data is not the same as zero data.
When the bike counter is not working, the data should ideally be marked as missing, not
zero. Since our suspicion is that outages might have resulted in defective counts, we shall
effectively remove all data entries for these dates from the data set, as follows:
99
Next, we shall declare all entries with a volume of more than 1000 bikes per 15 minute to
be a missing/defective value on both the westbound and eastbound data.
Examining the above graph, we still see spikes that look unreasonably high in the be-
ginning of the data, but they may actually be real because at the official opening of the
bridge there were 30,000 to 40,000 people and at least 13,000 bikes milling around. Simi-
larly, the other spikes may be real data. One can try to explain them, e.g., by consulting
https://bikeportland.org/events/, from which you might conclude that the spike on Au-
gust 25, 2019 is due to a Green Loop event, and that large spike on June 29, 2019 might be
due to all the people coming over for the World Naked Bike Ride; or was it some afterparty
of Loud’n Lit event? I can’t really tell. We’ll just leave it at that, and blame the remaining
spikes on the groovy bike scene of Portland.
The quarter-hour samples look too dense in the plot above. A better picture of the situation
is obtained by extracting weekly counts of bikes in both directions from the data.
100
XI.3 The pattern of use
The Tilikum is being used both by people who commute to work using a bicycle as well as
recreational bicycle users. We can understand more about this division among bikers by
dividing the data into weekend and weekday entries.
The only technical skills you need for this are numpy.where and an understanding of
pandas.Timestamp objects. (Please ensure you have studied Working with Time Series
section of [JV-H] before proceeding.) Combined with a use of pandas.groupby, we can
then extract the mean biker volumes for each 15-minute interval during the day.
The result is the distribution plotted below.
weekplot(t)
101
The hourly distribution is distinctly “bimodal”. There is a group of westbound commuters
(on their way to work) on the bridge in the morning, and a group (probably the same
people) traveling eastbound after work. If you look closely, you will find that there are two
slightly smaller bumps indicating that there are some (although many fewer) eastbound
morning bikers and westbound evening bikers across the bridge. Yet, on the whole, the
data leads us to the interesting conclusion that the overwhelming majority of the bike
commuters on the Tilikum live on the east side and commute to the west for work and
return daily.
Often the purpose of understanding data is to guide policy and action. What might one
do with the pattern we have just discovered? The current numbers are small enough not
to pose a bike traffic problem. But envision a future where the bike counts will grow. If
it grows maintaining the same lop-sided utilization pattern, what are the city’s options
to encourage optimal bridge usage? Bike traffic flow control modifications? Generation
of more jobs on the east side? More residential zoning near the west end of the bridge?
These are complex issues where an urban planner’s expertise is needed. Nonetheless, I
hope to have convinced you of the importance of going from data (clicks on a counter) to
knowledge (patterns of use).
Next, let us look at the non-commuter, recreational, use, assuming that they occur in the
weekends. In sharp contrast to the weekday distribution, below we find that the weekend
distribution has just one peak.
102
Both the eastbound and westbound lanes seem to find a good amount of use in the week-
end. There is, most definitely, a preference for recreational riding in the afternoon. I sup-
pose that is not a major surprise in Portland as afternoons are most often when we are
given a reprieve from the battleship gray of the cloud cover.
103
Clearly, the strong bimodal distribution has weakened considerably after we all started
isolating ourselves. This perhaps comes as no surprise, since both universities on the west
side of Tilikum have switched to remote classes. It makes sense that there are fewer west-
bound commuters in the morning. What about the afternoon peak? One could imagine
various explanations for this: people isolating themselves all morning, getting restless in
the afternoon, especially with such unusually good weather we were having in April, and
deciding to take their bikes out for some fresh air. Whatever be the case, we can summa-
rize our conclusion from the data as follows: social distancing has changed the weekday
bike use on Tilikum from a commuter to a recreational pattern.
Of course, we can also compare the overall statistics before and after social distancing, but
the results are too blunt to point out differences like the above. From the statistics outputs
below, we see that the average number of bikers per quarter-hour in each direction has
decreased by about 1:
The data can also tell us the reduction in terms of number of bikers per week, although
we should perhaps use it with some caution as not enough weeks have passed after social
distancing started to form a robust sample.
The westbound lane certainly seems to have suffered more reduction in traffic after social
distancing, whichever way we slice it.
104
XI.5 Comparison with Seattle’s Fremont bridge
Although Portland claims to be the first city in the US to adopt the open data program,
Seattle’s open data program is something to envy. Seattle’s Fremont bridge bike counter
data, even way back from 2012, is readily available for anyone to download, thanks to their
open data program (at the URL below). Let’s take a peek at their data.
[22]: import os
import shutil
import urllib
url = "https://data.seattle.gov/api/views/65db-xm6k/rows.csv?
↪→accessType=DOWNLOAD"
f = "../../data_external/Fremont_Bridge_Bicycle_Counter.csv"
if not os.path.isdir('../../data_external/'):
os.mkdir('../../data_external/')
if not os.path.exists(f):
with open(f, 'wb') as fo:
r = urllib.request.urlopen(url)
shutil.copyfileobj(r, fo)
[23]: sd = pd.read_csv(f)
sd.tail()
105
[24]: East West
time
2012-10-03 00:00:00 4.0 9.0
2012-10-03 01:00:00 4.0 6.0
2012-10-03 02:00:00 1.0 1.0
2012-10-03 03:00:00 2.0 3.0
2012-10-03 04:00:00 6.0 1.0
The Tilikum data is spikier than Seattle’s Fremont data (compare the max values in the
above outputs), but the average volumes (mean) are clearly higher in Seattle. That the
volume is higher in Seattle in even more clear if we plot weekly counts on both bridges on
the same axis.
[27]: sw = sd.resample('W').sum()
tw = t.resample('W').sum()
fig, axs = plt.subplots(1, 2, figsize=(13, 3), sharey=True)
plt.subplots_adjust(wspace=0.05)
106
sw.plot(ax=axs[0], title='Fremont bridge (Seattle) bikes/week');
tw.plot(ax=axs[1], title='Tilikum bridge (Portland) bikes/week');
The Fremont bridge has good bike traffic flow in both directions during the peak hours,
107
unlike the Tilikum. We conclude that during peak hours, bikers are distributed more
evenly on Seattle’s Fremont bridge travel lanes than on Portland’s Tilikum.
Somewhat remarkably, despite all the above-seen differences, the weekday bike counts of
both cities respond to social distancing in quite the same fashion: the bimodal weekday
distribution of commuting to work has become a unimodal afternoon recreation pattern.
108
XII
Visualizing geospatial data
May 6, 2020
Geospatial data refers to data which has a geographic component in it. Usually this means
that the data entries are associated to some point on the surface of the earth. Therefore,
geospatial data are usually visualized on maps.
Because the earth is round, in order to make a flat map, one must transform the earth’s
surface to a flat surface. Such transformations are called projections by cartographers (not
to be confused with linear projections from linear algebra). Mathematicians know that a
transformation between topologically different regions must be discontinous somewhere.
So these projections, while very useful, cannot be perfect replicas of reality. It is useful
to know this and a bit more about python modules for projections while attempting to
visualize geospatial data on the globe.
Many references, including Geographic Data with Basemap of [JV-H], use the python mod-
ule basemap. However in recent years, the module basemap has been deprecated in favor
of the new python mapping project called Cartopy. Therefore, this activity aims at tak-
ing a quick look at cartopy. Cartopy, together with geopandas, a package built on top of
pandas, shows promise in easing geospatial visualization. They are nonetheless relatively
new efforts. You will notice that their documentation, while constantly improving, is not
as mature as, say numpy. There will likely be a number of changes in the provided facili-
ties as these efforts take hold. Our goal in this activity is to get acquainted with these new
developments for visualizing geospatial data.
109
Here is a GeoDataFrame object we have already used in 01 Overview:
[2]: geopandas.geodataframe.GeoDataFrame
[3]: type(world.geometry)
[3]: geopandas.geoseries.GeoSeries
A GeoDataFrame can have many columns of type GeoSeries, but there can only be one ac-
tive geometry column, which is what is returned by the attribute GeoDataFrame.geometry.
Note also that the name of this active GeoSeries can be anything (not necessarily
geometry), but in the world object, the column happens to be called geometry, as can be
seen below.
[4]: world.geometry.name
[4]: 'geometry'
[5]: world.head()
geometry
0 MULTIPOLYGON (((180.00000 -16.06713, 180.00000...
1 POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...
2 POLYGON ((-8.66559 27.65643, -8.66512 27.58948...
3 MULTIPOLYGON (((-122.84000 49.00000, -122.9742...
4 MULTIPOLYGON (((-122.84000 49.00000, -120.0000...
The plot method of the data frame is redefined in a GeoDataFrame to use the geometry
objects in the active geometry column. So to plot this map, all we have to do is use the
plot method:
[6]: world.plot()
110
A GeoDataFrame has more attributes than a regular pandas data frame. For example, it
stores the centroids of the shapes in the active geometry column.
[7]: type(world.centroid)
[7]: geopandas.geoseries.GeoSeries
This is a GeoSeries that did not show up when we queried world.head(), but it is an
attribute of world. We can, of course, make it an additional column of world by in the
usual pandas way.
geometry \
0 MULTIPOLYGON (((180.00000 -16.06713, 180.00000...
1 POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...
2 POLYGON ((-8.66559 27.65643, -8.66512 27.58948...
3 MULTIPOLYGON (((-122.84000 49.00000, -122.9742...
4 MULTIPOLYGON (((-122.84000 49.00000, -120.0000...
centroids
0 POINT (163.85316 -17.31631)
1 POINT (34.75299 -6.25773)
2 POINT (-12.13783 24.29117)
3 POINT (-98.14238 61.46908)
4 POINT (-112.59944 45.70563)
Now, world has two GeoSeries columns. If we make the centroids column the active
geometry column, then the output of the plot method changes since it uses the active
column’s geometry specifications.
111
[9]: world = world.set_geometry('centroids') # change the active geometry␣
↪→column
world.plot();
[10]: crs.PlateCarree()
As you guessed from the above output, the CRS object is able to plot itself using matplotlib.
This points to one avenue for visualizing geospatial data that has no need for geopandas.
Cartopy produces a matplotlib axis on which you can overlay your data as you see fit:
if your data has latitude and longitude associated to it, cartopy can apply the relevant
projection automatically to place it at the proper place in the figure. Below, we will focus
on alternative visualization methods using geopandas and facilities to interact between
cartopy and geopandas.
The world object of class GeoDataFrame comes with a CRS attribute, another attribute that
does not show up in world.head() output.
[11]: world.crs
112
- Prime Meridian: Greenwich
We see that the CRS above is codenamed EPSG:4326. Here EPSG stands
for European Petroleum Survey Group. They maintain several data sets at
https://spatialreference.org/ref/epsg/. Each set in “EPSG Geodetic Parameter Dataset is
a collection of definitions of coordinate reference systems and coordinate transformations
which may be global, regional, national or local in application.”
EPSG-number 4326 that we have here belongs to the WGS 84 coordinate system, the latest
in the World Geodetic System (WGS) where the earth is represented as an oblate spheroid
with coordinates in decimal degrees (Lat, Lon).
Changing the active geometry from previously set centroids to the original geometry col-
umn, let us plot the world again in order to compare the result with another CRS.
We compare this output with another very well-known projection, the Mercator projection,
which has the nice property that it is conformal, i.e., it preserves angles. EPSG has made
available a Web Mercator projection. We can easily convert world from the WGS 84 to the
Mercator CRS:
[13]: world_Mercator = world.to_crs("EPSG:3395")
world_Mercator.plot();
plt.title('World in Mercator CRS');
113
Generations of school kids have grown up with this Mercator map. Note how the Merca-
tor projection distorts the size of objects as the latitude increases from the equator to the
poles. Some countries near equator look smaller than they really are. For example, Brazil
is actually larger than the contiguous United States, but it looks smaller in the Mercator
projection. If you have time for a digression, have a look into the many discussions online,
e.g., The Problem With Our Maps, on how the false sizes on maps (perhaps inadvertently)
shape our views on countries.
[14]: ae = crs.AzimuthalEquidistant()
type(ae)
[14]: cartopy.crs.AzimuthalEquidistant
Then we convert the cartopy object to a form usable by geopandas through an intermediate
step. That step offers a glimpse of what lies at the core of many of the mapping tools, the
PROJ project. (All these dependencies made installing geopandas more involved, as you
recall from our early install sessions.)
114
[15]: aeproj4 = ae.proj4_init # Convert to`proj4` string/dict usable␣
↪→in gpd
This represents the geopandas object world_ae, which is the world object whose geometry
has been converted to the azimuthal equidistant CRS. From the above output, it we see that
the azimuthal equidistant projection shows the central view in good proportions, with ob-
vious distortions farther out although the distortions are evocative of the globe. However,
this CRS is often difficult to use for showing data that is spread over the populous parts of
the globe. (You can change the central view as shown in the next cell.) See, for example,
how difficult it is to get the far east, the west, and Europe, together in any perspective, due
to the vastness of the intermediate Pacific ocean.
[16]: crs.AzimuthalEquidistant(central_longitude=200, central_latitude=10)
115
by cartopy, we will convert or project the existing geometry objects in world_ae to
AlbersEqualArea CRS as shown below.
[18]: aea_geo = [aea.project_geometry(ii, src_crs=ae)
for ii in world_ae['geometry'].values]
Since cartopy works directly with matplotlib, we can immediately render the resulting
geometries in matplotlib’s axis.
We can alternately produce essentially the same plot using geopandas as follows.
116
[21]: import os
from git import Repo
First update your data folder by pulling the newest reports of COVID-19 from the GitHub
repository maintained by Johns Hopkins’ researchers.
covidfolder)
datadir = repo.working_dir + '/csse_covid_19_data/
↪→csse_covid_19_time_series'
f = datadir + '/time_series_covid19_confirmed_global.csv'
[23]: c = pd.read_csv(os.path.abspath(f))
c = c.rename(columns={'Country/Region': 'country'}).iloc[:, 1:]
c.head()
Some countries are repeated (as their provinces are being counted in separate data entries),
as can be seen from the following nonzero difference:
117
[24]: len(c['country']) - len(set(c['country']))
[24]: 78
Therefore, we sum up each country’s confirmed COVID-19 counts for each day using a
pandas groupby operation. Of course, it doesn’t make sense to sum up the latitudes and
longitudes, so we take their average values within countries. (Taking the mean of latitudes
and longitudes will do for our current purposes, but it is definitely not a perfect solution,
which is why I’ll be asking you to improve on it by making a chloropleth map in the next
assignment.)
[25]: cg = c.groupby('country')[c.columns[3:]].sum()
cg['Lat'] = c.groupby('country')['Lat'].mean()
cg['Long'] = c.groupby('country')['Long'].mean()
This newly created data frame has no GeoSeries. We use the latitude and longitude infor-
mation to create point geometries. With this information, we can make a GeoDataFrame.
Now, in cg, we have a GeoDataFrame object that should know how to plot its data columns
in the data’s associated places on the globe.
# set marker sizes, with a min marker size for cases > 1000
msz = 500 * np.where(cg[date]-1000, np.maximum(cg[date]/mx, 0.001), 0)
cg.plot(ax=ax, cmap='Wistia', markersize=msz, alpha=0.5)
[28]: covidworldmap('5/5/20')
118
The world is now littered with COVID-19 cases. I wish the world and our country had
fared better, but the data doesn’t lie.
119
XIII
Gambler’s Ruin
A gambler G starts with two chips at a table game in a casino, pledging to quit once 8 more
chips are won. G can either win a chip or lose a chip in each game, with probabilities p
and 1 − p, respectively (independent of past game history). What is the probability that G
accumulates a total of 10 chips playing the game repeatedly, without being ruined while
trying? The ruining state is one where G has no chips and can no longer play. What is the
probability that G is ruined while trying to get to 10 chips?
The goal of this activity is to introduce you to the rich subject of Markov chains, and in the
process also get you acquainted with graphs, random walks, and stochastic matrices, with
the gambler G as an entry point example. Since a probability course is not a prerequisite
for this course, I will try to present the results colloquially and without proofs, with my
advance apologies to the probabilists among you. The two theorems you will find stated
below are proved using probabilistic tools (see, for example, [S]) and is material one might
usually find in a statistics program. In the next activity, I will connect this to material from
other fields.
p33
p13
p34
p01 p12
p00 S0 S1 S2 S3 S4
p32 p43
p20
The assumptions when considering a Markov chain are that the system is required to move
from state to state (the next state can be the same as the current state), and that the next
120
state is determined only by the current state’s transition probabilities (not by prior states).
The latter is called the memorylessness property or the Markov property. The former implies
that the probability that the system will transition from the current state is one, so that for
any i,
∑ pij = 1.
j
Note that the sum above may be finite or infinite: if the set of states is a finite set, say
S = {S0 , S1 , . . . , S N }, then the above sum runs from j = 0, 1, through N; otherwise the sum
should be treated as an infinite sum. Irrespective of the finiteness of the set of states S, the
Markov chain itself is thought of as an infinite sequence.
(Optional note: Here is a formal definition using the conditional probability notation,
Pr( A| B). A stochastic sequence Xn taking values from a set of states S = {S0 , S1 , . . .} is
called a Markov chain if for any subset of states Si , S j , Sk0 , Sk1 , . . . , Skn−1 ∈ S,
Pr( Xn+1 = S j | Xn = Si )
= Pr( Xn+1 = S j | Xn = Si , Xn−1 = Skn−1 , Xn−2 = Skn−2 , . . . , X0 = Sk0 )
= pij .
Throughout, we only consider what are known as time-homogeneous Markov chains, where
the probabilities pij are independent of the “time” step n.)
To connect this to the concept of random walks, let us first introduce graphs.
XIII.2 Graphs
In mathematics, we often use the word graph in a sense completely different from the
graph or plot of a function.
A graph (V, E) is a set V of n vertices, together with a set E of m edges between (some)
vertices. Although vertices are often pictorially represented as points, technically they can
be whatever things lumpable into a set V, e.g., - people, labels, companies, power stations,
cities, etc.
Edges are often pictorially represented as line segments (or curves) connecting two points
representing two vertices, but technically, they are just a “choice of two vertices” (not nec-
essarily distinct) from V, e.g., corresponding to the above-listed vertex examples, an edge
can represent - friendship, similarities, spinoffs, wires, roads, etc.
When the above-mentioned “choice of two vertices” is changed to an ordered tuple, then
the ordering of the two vertex choices that form an edge is significant, i.e., the edge has a
direction. Thus a directed edge from vertex vi to vertex v j is the tuple (vi , v j ). If all edges
in E are directed, the graph is called a directed graph or a digraph. If a non-negative number,
a weight, is associated to each edge of a digraph, then we call the graph a weighted digraph.
Python does not come with a graph data structure built in. Before you begin to think this
somehow runs counter to the “batteries-included” philosophy of python, let me interrupt.
Python’s built-in dictionary data structure encapsulates a graph quite cleanly. Here is an
example of a graph with vertices a, b, c, d:
121
[1]: gd = {'a': ['b', 'd'], # a -> b, a -> d
'b': ['c', 'd', 'a'] } # b -> c, b -> d, b -> a
You can use a dict of dicts to incorporate more edge properties, such as assign
names/labels, or more importantly, weights to obtain a weighted digraph.
Although we now have a graph data structure using the python dictionary, for this to be
useful, we would have to implement various graph algorithms on it. Thankfully, there are
many python packages that implement graph algorithms. Let’s pick one package called
NetworkX as an example. Please install it before executing the next code cell. NetworkX
allows us to send in the above dictionary to its digraph constructor.
Now g is a DiGraph object with many methods. To see all edges connected to vertex a, a
dictionary-type access is provided. We can use it to double-check that the object is made
as intended.
[4]: g['a']
You can plot this graph using NetworkX’s facilities (which uses matplotlib under the
hood).
[5]: import matplotlib.pyplot as plt
%matplotlib inline
def plot_gph(g):
pos = nx.spectral_layout(g)
nx.draw(g, pos, with_labels=True, node_color='orange')
labels = nx.get_edge_attributes(g, 'weight')
nx.draw_networkx_edge_labels(g, pos, edge_labels=labels);
plot_gph(g)
122
XIII.3 Random walks
Consider a weighted digraph (V, E) where the weight associated to a directed edge e =
(vi , v j ) is a number 0 < pij ≤ 1. Let us extend these numbers to every pair of vertices vi
and v j such that pij = 0 if (vi , v j ) is not an edge of the graph. Let us restrict ourselves to
the scenario where
∑ pij = 1
j
for any i.
A random walk on such a directed graph is a sequence of vertices in V generated stochasti-
cally in the following sense. Suppose the nth element of the sequence is the ith vertex vi in
V. Then one of its outgoing edges (vi , v j ) is selected with probability pij , and the (n + 1)th
element of the random walk sequence is set to v j . This process is repeated to get the full
random walk, once a starting vertex is selected.
123
mathematical object, namely the matrix P whose (i, j)th entry is pij . This matrix of proba-
bilities is called a transition matrix (sometimes also called a stochastic matrix) and it can be
associated either to a Markov chain or a random walk provided its rows sum to one.
Here is an example of a transition matrix.
# S0 S1 S2 S3
P = np.array([[0, 0.0, 0.5, 0.5], # S0
[1.0, 0.0, 0.0, 0.0], # S1
[0.0, 0.0, 0.0, 1.0], # S2
[0, 1.0, 0.0, 0.0]]) # S3
Matrix to digraph The above-mentioned conceptual equivalences are often tacitly used
in graph programs. For instance, in NetworkX, one can easily make a graph out of the
above transition matrix P as follows.
[7]: gP = nx.from_numpy_array(P, create_using=nx.DiGraph)
plot_gph(gP)
Digraph to matrix We can, of course, also go the other way. For example, consider the
small graph g we made “by hand” previously:
[8]: plot_gph(g)
124
[9]: g.nodes # note the ordering of vertices
This NetworkX object g can produce a matrix of the graph’s edge weights. It is typical to
make this as a scipy.sparse matrix since one anticipates many zero entries (correspond-
ing to edges that do not exist in a realistic large graph). The result Pg below is a sparse
matrix, which we convert to a dense matrix, just for printing.
[10]: Pg = nx.convert_matrix.to_scipy_sparse_matrix(g)
Pg.todense()
Note how the matrix entries and edge weights in the figure are in correspondence. The ma-
trix is generally called the adjacency matrix of a weighted graph. (Note that many textbooks
will define adjacency matrix entries with 1 in place of the nonzeros above to accommodate
graphs without weights.) For digraphs in a random walk discussion, we shall refer to this
adjacency matrix as the transition matrix, as previously noted.
125
p p p
q p
1 S0 S1 S2 S3 S8 S9 S10 1
q q q
Here we have also indicated two additional pieces of information: G has pledged to quit
upon reaching state S10 , so once the Markov chain reaches S10 it will not go to any other
state forever. Furthermore, if G’s Markov chain reaches the ruining state of S0 , then G can’t
play any more, so the Markov chain cannot go to any other state forever.
Let us look at the corresponding transition matrix, say when p = 0.4.
PforG(p=0.4)
[11]: array([[1. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0.6, 0. , 0.4, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0.6, 0. , 0.4, 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0.6, 0. , 0.4, 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0.6, 0. , 0.4, 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0.6, 0. , 0.4, 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0.6, 0. , 0.4, 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0.6, 0. , 0.4, 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.6, 0. , 0.4, 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.6, 0. , 0.4],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 1. ]])
In the scenario described in the beginning, G starts with 2 chips and wants to reach a total
of 10 chips. This translates to the following question: what is the probability that the above
random walk enters the state S10 ? It turns out that such questions are so often asked off
Markov chains that the answer is a basic theoretical result on Markov chains (see the next
theorem below).
S A = { Si : i ∈ A } , S B = { Si : i ∈ B }
form a disjoint partitioning of the set S of all the states of a Markov chain. The probability
that the Markov chain attains a state in S A in some finite number of steps (i.e., if it ever
126
hits S A ) starting from a state Si is called a hitting probability of S A and is denoted by hi . Of
course, if i ∈ A, then hi = 1. What is not obvious is the value of hi when the chain starts
from i ∈ B. Classic probabilistic arguments prove that the hi values satisfy the system in
the next theorem.
Theorem 1. The hitting probability of a subset A of states in a Markov chain is the minimal
non-negative solution of the system of equations
hi = ∑ pij h j
j
for all i with i ∈ B, and hi = 1 if i ∈ A. Here minimality means that if x is another solution,
then xi ≥ hi for all i.
The reasoning that leads to the system of equations in the theorem is as follows: if the
Markov chain starts from state Si , then in the next step, it can be in any of the states S j with
probability pij , from where the hitting probability is h j , so the hitting probability hi from Si
must be the sum of all pij × h j over all states S j . This idea can be formalized into a proof of
Theorem 1. Let me highlight a few more things to note about Theorem 1:
• First trivial solution: From the definition of Markov chain, recall that
∑ pij = 1.
j
hi = 1
for all i. However, this solution need not be the minimal one mentioned in the theo-
rem.
• Second trivial solution: One case where the minimal nonnegative solution is obvious
is when A is such that pij = 0 for all i ∈ B and j ∈ A, i.e., when it is not possible to
go from B to A. Then {︄
1, i ∈ A,
hi =
0, i ∈ B,
obviously satisfies the system of equations in Theorem 1. Since the hi values for i ∈ B
cannot be reduced any further, this is the minimal solution.
• Collecting hi into a vector h, the system of equations in Theorem 1 can almost be
written as the eigenvalue equation for eigenvalue 1,
h = Ph,
but not quite, because the equation need not hold for indices i ∈ A. Indeed, as stated
in the theorem, hi must equal 1 if i ∈ A.
127
The approach I present here is not standard, but has the advantage that it uses eigenvector
calculations that you are familiar with from your prerequisites. Even though the eigenvec-
tor remark I made at the end of the previous section is pessimistic, looking at G’s transition
matrix, we find that the condition hi = 1 for i ∈ A can be lumped together with the re-
maining equations. Indeed, because the last row of P contains only one nonzero entry (of
1),
h10 = ∑ p10,j h j
j
holds. Therefore in G’s case, h is a solution of the system in Theorem 1 if and only if h is a
non-negative eigenvector of P corresponding to eigenvalue 1 and scaled to satisfy h10 = 1.
(Be warned again that this may or may not happen for other Markov chains: see exercises.)
What gives us further hope in the example of G is that we have a key piece of additional
information:
h0 = 0,
i.e., if G starts with no chips, then G cannot play, so G will stay in state S0 forever. We
might guess that this condition will help us filter out the irrelevant first trivial solution
with hi = 1 for all i.
Let me make one more remark before we start computing. The system of equations of
Theorem 1 in the case of G reduces to
hi = phi+1 + (1 − p)hi−1
for 1 ≤ i ≤ 9 together with h0 = 0 and h10 = 1. You can make intuitive sense of this
outside the general framework of the theorem. If G starts with i chips (1 ≤ i ≤ 9) so that
the probability of hitting A is hi , then in the next step there are two cases: ( a) G has i + 1
chips with probability p, or (b) G has i − 1 chips with probability 1 − p. The probability
of hitting A in case ( a) is p × hi+1 , and the probability of hitting A in case (b) is q × hi−1 .
Hence hi must be equal to the sum of these two, thus explaining the theorem’s equation
hi = phi+1 + (1 − p)hi−1 .
Let us now compute hi using the knowledge that in G’s case, h is a non-negative eigenvec-
tor of P corresponding to eigenvalue 1, scaled to satisfy h10 = 1.
P = PforG(p=0.4)
ew, ev = eig(P)
ew
The computed set of eigenvalues of P include 1 twice. Since the diagonalization (the factor-
ization produced by eig) was successful, we know that the eigenspace of P corresponding
to eigenvalue 1 is two-dimensional. If there are vectors h in this eigenspace satisfying
h0 = 0, h10 = 1,
128
then such vectors clearly solve the system in Theorem 1. We can algorithmically look for
eigenvectors satisfying these two conditions in a two-dimensional space: it’s a system of
two equations and two unknowns, made below in the form
[︃ ]︃
0
Mc = .
1
[13]: 0.2825542003687932
The nonzero determinant implies that M is invertible. This means that there is a unique
solution c and hence a unique vector in the eigenspace satisfying both the conditions, which
can be immediately computed as follows.
[15]: h = Gchances(p=0.4)
h
129
XIII.8 Cross checking
The answer we got above is correct, even if not intuitive. In fact, this is a manifestation of
the phenomena that goes by the name of Gambler’s Ruin. How can we double-check the
above answer? One way to double-check the answer is the analytical technique described
in the optional exercise below.
Optional exercise: Solve the equations of Theorem 1 in closed form to conclude that the
probability of G making 10 chips, starting from i chips, is
1 − (q/p)i
hi =
1 − (q/p)10
whenever p ̸= q.
I’ll omit more details on this analytical way for verification, since this course is aimed at
learning computational thinking. Instead, let’s consider another computational way.
To let the computer cross check the answer, we design a completely different algorithm
whose output “should” approximate the correct result. Namely, we simulate many many
gambles, get the statistics of the outcomes, and then cross check the frequency of G’s win-
nings. (That this “should” give the right probability is connected to the law of large num-
bers.)
Here is a simple way to implement many gambles (each gamble is a sequence of games
until G reaches either S0 or S10 ) using the built-in random module that comes with python
(and using the uniform distribution in [0, 1]).
[17]: n = 500000
wl = gamble(n=n)
print('Proportion of winning gambles:', np.count_nonzero(wl) / n)
130
Proportion of winning gambles: 0.02191
The number produced as the output is pretty close to the previously obtained h[2]. In-
deed, it gets closer to h[2] with increasing number of gambles. Now that we have built
more confidence in our answer computed using the eigensolver, let us proceed to examine
all the components of h more closely.
[18]: plt.bar(range(len(h)), h)
plt.title('$G$\'s chances of making 10 chips');
plt.xlabel('Starting number of chips'); plt.ylabel('Probability');
Since G either quits winning or gets ruined with 0 chips (not both), the probability of G’s
ruin is 1 − hi .
[19]: plt.bar(range(len(h)), 1-h, color='red')
plt.title('Chances of $G$\'sruin');
plt.xlabel('Starting number of chips'); plt.ylabel('Probability');
131
This exemplifies the concept of Gambler’s Ruin: in a biased game (where p < 1/2), the
probability of G’s ruin could be much higher than the “intuitive” 1 − p for most starting
values.
Note that if all games are unbiased with p = 1/2, then we get the following linear plot,
which perhaps jives with the intuition more than the unbiased case.
132
the subject of the next theorem (Theorem 2 below) also proved using basic probability
methods.
A state Si of a Markov chain is called an absorbing state if pii = 1. Clearly, once the chain
reaches an absorbing state, it cannot transition to any other state forever.
An absorbing Markov chain is a Markov chain which has at least one absorbing state and has
paths made of directed edges from any state to an absorbing state.
Partition the states in an absorbing Markov chain using index sets A and B, like before,
except that now A denotes the indices for all the absorbing states (and B indicates the
remaining states). Then the following partitioning of the transition matrix is both easy to
conceptualize and easy to implement using numpy’s slicing operations:
[︃ ]︃
P PAB
P = AA
PBA PBB
Note that PAA is an identity matrix and PAB is the zero matrix, because pii = 1 for all i ∈ A.
Example: The gambler G has two absorbing states S0 and S10 , and G’s Markov chain is an
absorbing Markov chain. Setting A = {0, 10} and B = {1, . . . , 9}, the blocks of the above
partitioning for this case are as follows:
[22]: PBA
[22]: array([[0.6, 0. ],
[0. , 0. ],
[0. , 0. ],
[0. , 0. ],
[0. , 0. ],
[0. , 0. ],
[0. , 0. ],
[0. , 0. ],
[0. , 0.4]])
[23]: PBB
133
[0. , 0. , 0. , 0. , 0. , 0. , 0.6, 0. , 0.4],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.6, 0. ]])
[24]: PAA
As already noted above, PAA should always be the identity matrix in an absorbing Markov
chain per our definition.
Now we are ready to state the second result on hitting probabilities, this time with an
easier algorithmic prescription for computing them.
Theorem 2. In any finite absorbing Markov chain, the matrix I − PBB is invertible (where I
denotes the identity matrix of the same size as PBB ). Moreover, letting H denote the matrix
whose (i, j)th entry equals the probability that the chain hits an absorbing state S j , j ∈ A,
starting from another state Si , i ∈ B, in any number of steps, we may compute H by
Example: In one line of code, we can apply Theorem 2 to gambler G using the matrices
made just before the theorem:
Since the second entry of A represents the winning state S10 , the second column above gives
the probability of hitting S10 from various starting states. Note that that second column
is the same as the previously computed h. The first column in the output above gives the
probability of G’s ruin for various starting states.
134
This is the same as considering N = ∞ case in our previous setting. This case results in an
infinite set of states. Theorem 1 applies both to finite and infinite set of states, but we can
only simulate Markov chains with finite number of states. Nonetheless, we can certainly
apply Theorem 2 to compute the hitting probabilities for larger and larger N and get a feel
for what might happen when N = ∞.
But first, we have to make our code better to go to large N. We use scipy’s sparse facilities
to remake the matrices and improve efficiency.
q = 1 - p
# Note that the first and last row of P are not accurate
# in this construction, but we're going to trim them away:
P = diags(q*np.ones(N), offsets=-1, shape=(N+1, N+1)) \
+ diags(p*np.ones(N), offsets=1, shape=(N+1, N+1))
A = [0, N]
B = range(1, N)
I_PBB = (eye(N-1) - P[np.ix_(B, B)]).tocsc()
PBA = P[np.ix_(B, A)].tocsc()
[28]: ruinG(N=10)
After verifying that for the N = 10 case, we obtained the same result, we proceed to
examine the higher values of N. One quick way to visualize the resulting h-values are as
plots over starting states.
135
hs = ruinG(N=20)
ax.plot(hs[:21], 'r-', label='N=20')
hs = ruinG(N=30)
ax.plot(hs, 'r:', label='N=30')
hs = ruinG(N=40)
ax.plot(hs, 'r-.', label='N=40')
Clearly, as N increases, we see a larger range of starting indices for which G hardly stands
a chance of escaping ruin.
For a specific case, suppose G is unable to start with more than 20 chips, but is willing
to play N games, for larger and larger N. Then we compute G’s least ruin probability, the
lowest probability of G’s ruin among all possible starting values, namely
min hi .
i =0,...,20
for i in range(dbl):
print('N = %5d, least ruin probability = %5.4f'
%(N0*2**i, min(ruinG(p=p, N=N0*2**i)[:21])))
136
N = 20, least ruin probability = 0.3334
N = 40, least ruin probability = 0.9995
N = 80, least ruin probability = 1.0000
N = 160, least ruin probability = 1.0000
N = 320, least ruin probability = 1.0000
N = 640, least ruin probability = 1.0000
N = 1280, least ruin probability = 1.0000
What is illustrated in this output is often identified as another, perhaps stronger, manifes-
tation of the Gambler’s Ruin concept: even when the games are fair, G is certain to be ruined if
G continues to play forever.
137
XIV
Google’s PageRank
In the history of the internet, a collection of papers proposing PageRank has been influen-
tial, in particular, a 1998 paper by Sergey Brin and Lawrence Page, two graduate students,
now better known as Google co-founders. They proposed an objective metric to order the
results of a user’s internet search. For those who don’t remember, there was indeed a time
when “junk results often wash[ed] out any results that a user is interested in,” to quote
the paper. Of course, search engines now operate in a competitive business world, and
the algorithms that Google and other companies use to rank their search results currently
are not public knowledge. But the concept of PageRank is now routinely applied beyond
Google, not just to the internet, but to general data on graphs and networks. It serves as an
automatic tool to rank the relative importance of parts of any internet-like giant network.
We shall view the web (internet) as a directed graph. Each internet location (a webpage)
is a vertex of the graph. A hyperlink from one webpage to another is a directed edge of
the graph. From this viewpoint, the central idea of Brin & Page was to exploit the “link
structure of the web to calculate a quality ranking for each web page.”
To introduce PageRank, we shall build on our previous discussion of Markov chains (from
Gambler’s Ruin), which was entirely from the statistical or probabilistic perspective. Below,
we will connect to theorems of Perron and Frobenius, which are results that one might
usually learn in a mathematics program. Of course, all of this helps us understand the
effectiveness of PageRank, a topic that has entered the computer science curricula in recent
decades. Taken together, we then have an example of propitious convergence of ideas from
the distinct fields of computer science, mathematics, and statistics.
Such a vector represents a probability distribution on the vertices of the graph. We may
think of xi as the probability that the system is in state Vi . Alternatively, we may think of
xi as the probability of finding the random walker W at the digraph vertex Vi .
138
How does the probability of finding W on the graph change when W takes a step? Here is
another way of asking the same question: if xi is the probability that the Markov chain is
in state Vi , then what is the probability that the next state of the Markov chain is Vj ? Since
Vj can be arrived at from Vk with probability pkj , and since the prior state was Vk with
probability xk , we conclude that the answer should be the sum of pkj × xk over all the prior
states Vk . In other words, the probability that the next state is Vj equals
N
∑ pkj xk ,
k =1
which is the jth component of Pt x. This argument can be formalized to obtain the following
basic result.
Theorem 1. The probability distribution of a random walk on a directed graph with
stochastic matrix P changes from x to Pt x in each step (where Pt denotes the transpose
of P).
Note that if x is a probability vector and P is a transition matrix, then Pt x is guaranteed
(exercise!) to come out as a probability vector.
Pt s = s.
Example A
[2]: PA = np.array([[1/2, 1/4, 1/4],
[1/3, 1/3, 1/3],
[1/3, 1/3, 1/3]])
139
Frobenius norm of the successive differences,
∥( Pt )n+1 − ( Pt )n ∥ F .
[3]: [0.1402709019216955,
0.023378483653615948,
0.0038964139422693537,
0.0006494023237115095,
0.00010823372061852081,
1.80389534364696e-05,
3.0064922394683963e-06,
5.010820398912459e-07,
8.351367329688623e-08,
1.3918945514670359e-08,
2.3198243477163972e-09,
3.86637279525466e-10,
6.44396235375308e-11,
1.073992260029489e-11,
1.789976112476022e-12,
2.9830188647232445e-13,
4.977570743895776e-14,
8.26743511340278e-15,
1.343782689223772e-15]
This indicates convergence. Let’s check that the convergence actually occurs to a stationary
distribution s that is an eigenvector of Pt .
[4]: array([0.16666667, 1. , 0. ])
[[0.68599434]
[0.51449576]
[0.51449576]]
[6]: sA = v / v.sum()
sA
140
[6]: array([[0.4],
[0.3],
[0.3]])
This is the stationary distribution s for this example. To see that the same vector is obtained
as the limit of ( Pt )n , we simply raise the matrix to a large power and examine the result:
The values of s show that the random walker W will, in the limit, be found in state V0 at a
higher probability (0.4) than the other two states (0.3).
Example B
[8]: PB = np.array([[0, 1/3, 1/3, 1/3],
[0.9, 0, 0, 0.1],
[0.9, 0.1, 0, 0],
[0.9, 0, 0.1, 0]])
[[0.47368421]
[0.1754386 ]
[0.1754386 ]
[0.1754386 ]]
In this example, there is convergence of the powers to the stationary distribution, but it is
slower than Example A. We find this out by taking higher powers than before:
[11]: [2.0819420081737047e-14,
1.9224558387957245e-14,
1.6814381771214046e-14,
1.558237665379239e-14,
1.3619335994971806e-14]
Example C
141
[12]: PC = np.array([[0, 1, 0],
[0, 0, 1],
[1, 0, 0]])
[[0.33333333]
[0.33333333]
[0.33333333]]
In this example, we do not see convergence of the powers Pt to the above stationary dis-
tribution. There seems to be no convergence to anything:
[15]: [2.449489742783178,
2.449489742783178,
2.449489742783178,
2.449489742783178,
2.449489742783178]
142
[[0 1 0]
[0 0 1]
[1 0 0]]
V1
V2
0.1 1
1/3
V1 1/3 0.9
1/4
0.9 V0 1
V1 V0 0.1
1/3 1/3
1/2 V0 1/3 1/3
1/3
1/4 0.9 1
1/3
1/3
0.1 V2
V2 V3
Summary of the three examples: Note how associating the values of si to vertex Vi pro-
duces something that matches our intuition on where to find the random walker in the
long run. The convergence in Example A is a consequence of Perron’s theorem that we
discuss next.
143
Graphical illustration If you have never seen Theorem 1 before, you might be mystified
how so many strong statements can be concluded simply from the positivity assumption.
I’d like to give you an idea of the reasoning that leads to these statements, without writing
out a formal proof, through the following simple example of a 2 × 2 positive matrix.
[17]: array([-0.5, 1. ])
We see that the dominant eigenvalue is 1 in this case. To get an idea of why An converges,
as claimed in the theorem, see what happens when we multiply A by A in terms of the
first and second columns (A0 and A1 ) of A = [ A0 , A1 ]:
A2 = A[ A0 , A1 ] = [ AA0 , AA1 ]
When A is multiplied by a positive vector the result is a linear combination of the columns
of A with positive combination coefficients. This is best seen using pictures. To this end,
we define a function below that plots the columns of A (as two thick arrows) and the region
in between (a two-dimensional cone) using criss-cross lines.
Using it we see what happens to the cone region under repeated application of A.
t = np.linspace(0, 3, num=100)
gridline0 = t[:, np.newaxis] * A0
gridline1 = t[:, np.newaxis] * A1
fig = plt.figure(); ax = plt.gca()
for i in range(20):
ax.plot(gridline0[:, 0], gridline0[:, 1], 'b')
ax.plot(gridline1[:, 0], gridline1[:, 1], 'r')
gridline0 += (1/5) * A1
gridline1 += (1/5) * A0
ax.set_xlim(xlim); ax.set_ylim(ylim)
ax.set_title(tt)
a0 = ax.arrow(0, 0, A0[0], A0[1], width=0.05, color='blue', alpha=0.3)
a1 = ax.arrow(0, 0, A1[0], A1[1], width=0.05, color='red', alpha=0.3)
plt.legend((a0, a1), ('First column vector of '+matlabel, 'Second␣
↪→column vector of '+matlabel), loc='lower right');
[19]: M = A.copy()
for i in range(5): # plot the cone between columns for each matrix power
A0 = M[:, 0]
144
A1 = M[:, 1]
plotcone(A0, A1, matlabel='$A^'+str(i+1)+'$')
M = M @ A
145
As you can see, repeated application of A eventually squeezes the cone region to a linear
region. The vectors in the boundary of the region getting squeezed are the columns of
An as n → ∞, so you have just seen a pictorial illustration of existence of the limit of An ,
and also of the theorem’s claim that in the limit, the columns become multiples of a single
vector. Moreover, the limiting linear region in the figure should remain unaltered under
further applications of A, so it must represent an eigenspace of A. Note also that all of this
happens in the positive quadrant of the plane, so obviously the squeezed in line is the span
of a vector v with positive entries, so this should be the positive eigenvector mentioned in
the theorem.
This squeezing phenomena happens because A has positive entries. If A had negative
entries, the region between its column vectors need not get so squeezed and can dance
all over the place preventing convergence of its powers. Considering another matrix, also
with dominant eigenvalue 1, but now with a negative entry, you get the picture:
B = B @ B
146
plotcone(B[:,0], B[:,1], xlim=(-2.1, 0.5), ylim=(-1.1, 1.1),␣
↪→matlabel='$B^2$', tt='Noncovergence')
Any overlaps between the cones disappear as you take further powers.
This completes our graphical illustration of the connection between positivity of entries,
the convergence of matrix powers, and the resulting capture of a positive eigenvector by
successively squeezing cones. In the next subsection, where we apply Perron’s theorem to
transition matrices, we will be more rigorous, and yet, use nothing more than what you
already know from your linear algebra prerequisites.
147
Putting these together, we conclude that the dominant eigenvalue satisfies µ ≤ 1. But any
transition matrix P always has 1 as an eigenvalue. This is because the fact that its rows
sums are one
N
∑ pij = 1
j =1
x, Pt x, ( Pt )2 x, ( Pt )3 x, ...
always converges for Markov chains with pij > 0 and the limit s is independent of the initial
distribution x.
The theorem also tells us that the limit of Pn exists, and moreover, the limit matrix must
have columns that are scalar multiples of the eigenvector of the dominant eigenvalue: in
other words, the columns of limn→∞ Pn must be scalar multiples of the vector of ones. On
the other hand, the limit of ( Pt )n = ( Pn )t must have columns that are multiples of the
eigenvector s satisfying Pt s = s, which we previously called the stationary distribution.
Thus, having pinned down the rows and the columns, we make the following conclusion:
• The limit of Pn as n → ∞ takes the following form
⎡ ⎤
s1 s2 . . . s N
⎢ s1 s2 . . . s N ⎥
lim Pn = ⎢ . . .. ⎥ .
⎢ ⎥
n→∞ .
⎣. . . . ⎦
s1 s2 . . . s N
Example A has a positive transition matrix. It provides an instance where all the previous
statements can be verified. For example, limn→∞ Pn is approximated by the matrix below
which reveals the above pattern of stationary distributions in each row.
148
[21]: array([[0.4, 0.3, 0.3],
[0.4, 0.3, 0.3],
[0.4, 0.3, 0.3]])
[22]: array([[0.4],
[0.3],
[0.3]])
XIV.4 PageRank
We shall now define the PageRank of vertices on an (unweighted) directed graph with N
vertices.
1. First, set aij = 1 if there is an edge from vertex vi to v j in the graph and set aij = 0
otherwise.
2. Next, let
N
mi = ∑ aik .
k =1
If mi is 0, then the ith vertex is a dangling node (which has no outgoing edges). Define
⎧ aij
⎪ if mi > 0,
mi
⎨
wij =
⎩1
⎪
if mi = 0.
N
These may be thought of as weights on a directed edge from vi to v j if the edge exists
(if not, the weight is zero). The weight wij may also be viewed as providing equal
probabilities to all outgoing edges from vi .
3. Now that we have a weighted directed graph, we may associate it to a Markov chain,
setting transition probabilities to wij , but hold on: if we do so, a random walker W
on the graph can get stuck in vertices with no outgoing edges or in cycles within the
graph. (This is certain to happen on graphs as complex as the internet.) To avoid
this situation, one sets a restart probability 0 < r ≪ 1 with which W is allowed to
jump from one vertex to any other vertex in the graph. (Page called 1 − r the damping
factor.)
4. Finally, set the Markov chain transition probabilities by
r
pij = + (1 − r )wij .
N
The PageRank of a vertex is defined to be the value of the stationary probability distribution at that
vertex obtained using the above pij as the transition probabilities.
Note that the transition matrix ( pij ) defined above is a positive matrix. Hence, due to
Perron’s theorem, and our prior discussion on its application to stochastic matrices, the
149
limit of the probability distributions
x, Pt x, ( Pt )2 x, ( Pt )3 x, ...
exists, is equal to the stationary probability distribution, which we have just decided to
call PageRank. In particular, PageRank is independent of the starting distribution x of the
random walk. Furthermore, we may also arrive at the interpretation of the PageRank of a
graph vertex as the limiting probability that a relentless random walker visits that vertex.
Here is a simple implementation for small graphs using the same notation (aij , wij , pij ) as
above. (Think about what would need to change for a giant graph like the internet. We’ll
consider big graphs in later exercises.)
m = a.sum(axis=1)
dangling = (m==0)
m[dangling] = 1
w = (1 / m[:, np.newaxis]) * a
w[dangling, :] = 1 / a.shape[0]
p = (1-r) * w + (r / a.shape[0])
ew, ev = eig(p.T)
s = ev[:, abs(ew-1) < 1e-15].real
return s / s.sum()
Example D
[24]: # 0 1 2 3 4 5 6 7 8 (Adjacency Matrix of the above␣
↪→graph)
A = np.array([[0, 1, 0, 0, 1, 0, 0, 0, 0], # 0
[0, 0, 0, 0, 1, 0, 0, 0, 0], # 1
[0, 0, 0, 0, 1, 0, 0, 0, 0], # 2
[0, 0, 0, 0, 1, 0, 0, 0, 0], # 3
[0, 0, 0, 0, 0, 0, 1, 0, 0], # 4
[0, 0, 0, 0, 1, 0, 0, 0, 0], # 5
[0, 0, 0, 0, 0, 1, 0, 0, 0], # 6
[0, 0, 0, 0, 0, 1, 0, 0, 0], # 7
[0, 0, 0, 0, 0, 1, 0, 0, 0]]) # 8
150
[25]: array([[0.01111111],
[0.01611111],
[0.01111111],
[0.01111111],
[0.32328823],
[0.30297458],
[0.30207052],
[0.01111111],
[0.01111111]])
Notice from the output that V4 is ranked highest. A vertex to which many other vertices
points to usually get a higher PageRank. Notice also how a vertex to which a highly ranked
vertex points to inherits a high PageRank: this is the case with vertex V6 . Vertex V5 is also
highly ranked because it has its own cluster of vertices (V6 , V7 , V8 ) pointing to it which
included one highly ranked vertex V6 .
It is instructive to look at how PageRank changes as the restart probability is decreased to
0:
[26]: pagerank(A, 0.01)
[26]: array([[0.00111111],
[0.00166111],
[0.00111111],
[0.00111111],
[0.33239996],
[0.33019631],
[0.33018707],
[0.00111111],
[0.00111111]])
[27]: array([[-0. ],
[-0. ],
[-0. ],
[-0. ],
[ 0.33333333],
[ 0.33333333],
[ 0.33333333],
[-0. ],
[-0. ]])
This identifies the cycle where the random walker would end up if it were not for the
restart mechanism.
151
giant graph of the internet, the PageRank of a webpage can be interpreted as the steady
state probability that a random web surfer, following hyperlinks from page to page (with
infinite dedication and with no topical preference), is at that webpage.
When a user types in a search query, a search engine must first be able to mine all the
webpages relevant to the query from its database. (PageRank does not help with this
task.) Then, it must present these pages in some order to user. If the search engine has
already computed a ranking of relative importance of each webpage, then it can present the
results to the user according to that ranking. This is where PageRank helps. It does require
the search engine to solve for a giant eigenvector (with billions of entries) to compute
PageRank on the entire world wide web. The results of this computation (which cannot be
done in real time as the user searches) are stored by the search engine. The stored ranking
is then used to order the results presented to the user. There are reports that Google does
this a couple of times a year (but I don’t know how to verify this).
1 k −1 1 n
• The limit lim ∑ A exists and equals a matrix whose columns are all scalar
k → ∞ k n =0 µ n
multiples of v.
Note the main differences in Theorem 3 in comparison to Theorem 2:
• Unlike positive matrices, now there might be more than one eigenvalue whose abso-
lute value is µ.
• Unlike positive matrices, we can no longer assert that limit An /µn exists, only that
the limit of averages of An /µn exists.
152
Example B Example C
V2
0.1
1/3 V1
0.9 1
0.9
V1 V0 0.1
1/3 V0 1
1/3
0.9
1
0.1
V2
V3
Reconsider Examples B & C Note that the Markov chains in both Examples B and C
have non-negative transition matrices that are irreducible: the irreducibility is obvious
by looking at the previous figures of the digraphs for each example. Hence the Perron-
Frobenius theorem applies to both. Therefore, in both cases we may conclude that the
stationary distribution s is the limit of the averages
x + P t x + ( P t )2 x + · · · + ( P t ) k x
k+1
as k → ∞, for any starting distribution x. Although ( Pt )n does not converge for Example
C, these averages do. For Example B, we observed convergence for ( Pt )n so, of course, the
averages also converge.
Let me conclude with a few words on nomenclature. In the statistics literature, a Markov
chain is called an ergodic Markov chain if it is possible to arrive at any state from any other
state in one or more steps. In other words, a Markov chain is ergodic if its digraph has
paths made of directed edges from any vertex to any other vertex. This concept is there-
fore equivalent to its transition matrix being irreducible, and indeed, several texts use the
adjective irreducible instead of ergodic when studying such Markov chains. In computer
science and in graph theory, a directed graph whose vertices can be connected by a path
of directed edges is called strongly connected, yet another name for the same concept. We
have seen several such instances of distinct names for essentially the same concept in this
and the previous activity. While it may be a nuisance that the names are not standardized,
it’s not surprising for a concept that emerged as important in different disciplines to get
distinct names attached to it; it may even speak to the universality of the concept.
153
XV
Supervised learning by regression
Machine learning refers to mathematical and statistical techniques to build models of data. A
program is said to learn from data when it chooses a model or adapts tunable model pa-
rameters to observed data. In broad strokes, machine learning techniques may be divided
as follows:
• Supervised learning: Models that can predict labels based on labeled training data
– Classification: Models that predict labels as two or more discrete categories
– Regression: Models that predict continuous labels
• Unsupervised learning: Models that identify structure in unlabeled data
– Clustering: Models that detect and identify distinct groups in the data
– Dimensionality reduction: Models that identify lower-dimensional structure in
higher-dimensional data
In this activity, we focus on supervised learning. Note the two further subdivisions men-
tioned above within the category of supervised learning, and here are two examples within
each for further orientation:
• Classification example: identify an email as spam or not (discrete label) based on
counts of trigger words.
• Regression example: predict the arrival time (continuous label) of a streetcar at a
station based on past data.
We shall further focus on regression in this activity. Regression addresses an age-old fitting
problem: given a set of data, find a line (or a curve, or a surface, or a hypersurface in higher
dimensions) that approximately fits the data. The equation of the line, in the machine
learning language, is the model of the data that has been “learnt.” The model can then
“predict” the values, i.e., “labels” at points not covered by the original data set. Finding
equations of curves that pass through a given set of points is the problem of interpolation,
which goes at least as far back as Newton (1675). The fitting problem in regression, also
known at least as far back as Gauss (1809), is a relaxed version of the interpolation problem
in that it does not require the curves to pass through the given data, and is generally more
suitable to handle noisy data. These days, when machine learning comes at you with the
brashness of an overachieving new kid on the block, it is not fashionable to view the subject
from the perspective of established mathematical techniques with rich histories. Instead,
it has somehow become more fashionable to view machine learning as some sort of new
AI miracle. Please do not expect any miracles here.
154
XV.1 Linear Regression
Let’s start from the linear regression in a form you have seen previously: given data points
( xi , f i ), i = 0, 1, . . . , N, fit a linear equation
f ( x ) = a0 + a1 x
is minimized. Since the quantity on the right is a sum of squares, this is called the least-
squares error. It is easy to solve this minimization problem. Writing
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
f0 f ( x0 ) 1 x0 [︃ ]︃
Y data = ⎣ ... ⎦ ,
⎢ ⎥
Y fit = ⎣ ... ⎦ =
⎢ ⎥ ⎢ .. .. ⎥ a0
⎣. . ⎦ a
1
fN f (xN ) 1 x N ⏞⏟⏟⏞
⏞ ⏟⏟ ⏞ a
X
a = ( X t X )−1 X t Y data .
[2]: x = 5 * rng.random(20)
f = 3 * x + 5 * rng.random(20)
plt.scatter(x, f); plt.xlabel('x'); plt.ylabel('Continuous labels (f)');
155
The data has a linear trend, so let’s try to fit a line to it using linear regression formula we
derived above.
[3]: X = np.array([np.ones(len(x)), x]).T
a = inv(X.T @ X) @ X.T @ f # Create the "model"
[5]: plt.scatter(x, f)
plt.xlabel('x'); plt.ylabel('Continuous labels (f)');
plt.plot(x_predict, f_predict, 'c');
To have a visual representation of the error that is minimized by this line, we can plot line
segments (the red thick lines below) whose sum of squared lengths is what we minimized:
156
for i in range(len(x))], color='r', linewidth=4,␣
alpha=0.5)
↪→
plt.gca().add_collection(lc)
plt.xlabel('x'); plt.ylabel('Continuous labels (f)');
plt.plot(x_predict, f_predict, 'c');
Exactly the same algebra as before yields the same solution formula
a = ( X t X )−1 X t Y data ,
157
[8]: x1 = 5 * rng.random(100)
x2 = 5 * rng.random(100)
f = 10 - (3*x1 + 2* x2 + 2 * rng.random(100))
158
[12]: planar_example = {'data': [np.array([x1, x2]).T, f], 'model': a}
a = ( X t X )−1 X t Y data ,
Here is an example where we fit a quadratic curve to a simple one-dimensional data set,
i.e., here
f ( x ) = a0 + a1 x + a2 x 2
and the a’s are found by the above formula.
[13]: x = 5 * rng.random(50)
f = 3 * np.exp(x/2) + 2 * rng.random(50)
plt.scatter(x, f); plt.xlabel('x'); plt.ylabel('Continuous labels (f)');
159
[14]: phi0 = np.ones(len(x))
phi1 = x
phi2 = x**2
If we had attempted to fit a straight-line through the data, then we would not have gotten
such a close fit. Another way of saying this in the prevalent terminology is that linear
features underfit this data, or that the linear model has high bias. Saving this example also
160
for later, we continue.
[16]: curve_example = {'data': [x, f], 'model': a, 'type': 'quadratic'}
We can now fit data to this model using the fit method. Let’s fit the same data we used in
the first example of this activity.
161
Clearly we seem to be getting the same result. We can confirm the results are exactly the
same by digging into the solution components within the model object, as seen below.
(Recall that in f ( x ) = a0 + a1 x, the coefficient a0 is called the intercept.)
[21]: linear_example['model']
[24]: planar_example['model']
162
other mathematical objects, things inside a feature matrix. Tidy data in a feature matrix has
each variable/feature in a column and each observation/sample in a row.
Curve fitting The point of view taken by scikit-learn for curve fitting is that it is a process
obtained by applying the linear regression formula after applying the above transformer.
Therefore, one can implement it using a pipeline object where this transformer is chained
to a linear regression estimator. Here is how this idea plays out for the curve-fitting exam-
ple we saw previously.
[28]: Pipeline(memory=None,
steps=[('polynomialfeatures',
PolynomialFeatures(degree=2, include_bias=True,
interaction_only=False, order='C')),
('linearregression',
LinearRegression(copy_X=True, fit_intercept=True,␣
↪→n_jobs=None,
normalize=False))],
verbose=False)
163
[29]: yfit = quadratic_model.predict(xfit[:, np.newaxis])
plt.scatter(x, y)
plt.plot(xfit, yfit);
We can cross-check that the model parameters are exactly the same after fitting by exam-
ining the LinearRegression object in the quadratic model pipeline:
[30]: quadratic_model.named_steps
[31]: quadratic_model.named_steps['linearregression'].intercept_
[31]: 4.963796378670274
[32]: quadratic_model.named_steps['linearregression'].coef_
To conclude, we have built some confidence in scikit-learn’s abilities under the hood. There
is plenty of material online, including [JV-H], on how to use scikit-learn and other machine
learning packages, and on important pitfalls such as overfitting. However, it may be a bit
harder to find out the mathematics behind each software facility: the documentation is
designed for quick users in a rapidly changing field, and therefore understandably does
not get into the math. This may not be comforting to you as students of mathematics,
164
so my focus here and in the next few lectures is to connect these software tools with the
mathematics you know.
165
XVI
Unsupervised learning by PCA
Recall from the previous lecture that unsupervised learning refers to machine learning mod-
els that identify structure in unlabeled data. In this activity, we study Principal Compo-
nent Analysis (PCA) which is a commonly used technique in unsupervised learning, often
used for discovering structure in high-dimensional data, and for dimensionality reduction.
In this activity, I will extensively draw upon what you studied in some earlier activities.
In particular, I will try to detail the connections between PCA and SVD, the differences
in the jargon, highlight the distinctions between PCA and regression, and illustrate how
unsupervised machine learning is different from supervised machine learning.
XVI.1 Definitions
• Given a one-dimensional data vector x = [ x1 , x2 , . . . , xm ]t , its mean, or sample mean
is
1 m
m i∑
x̄ = xi .
=1
166
• The first principal component of any centered data X is defined as a unit vector
v1 ∈ Rn that maximizes
m
∑ ( v1 · R i )2 .
i =1
[2]: x = 3 * rng.random(20)
y = x + 0.75* rng.random(20)
fig = plt.figure(); ax = plt.gca()
ax.scatter(x, y, color='b')
ax.scatter(x.mean(), y.mean(), color='r', marker='*', s=150, alpha=0.6);␣
↪→ax.axis('equal');
167
We put the data in the form of m samples/observations/rows for n variables/features/columns.
Next, we need to center the data. This just means subtracting the mean of each fea-
ture/variable. Note that the mean is the marked (as the red star) in above figure.
[4]: X = XX - XX.mean(axis=0)
For the visual thinker, centering the data just means moving the origin to the mean (the
red star), as illustrated in the next figure.
168
Now comes the part that’s harder to see, namely the graphical meaning of the maximiza-
tion problem that defines the first principal component. Consider the figure below where
a number of unit vectors are drawn colored.
[6]: fig = plt.figure(); ax = plt.gca()
ax.set_title('The maximization over unit vectors')
theta = np.linspace(0, 2*np.pi, num=100) # draw unit circle
ax.plot(np.cos(theta), np.sin(theta), ':r', alpha=0.3)
Here, the arrows represent unit vectors v, and they are colored according to the value of
the following function of the vectors v:
m
f (v) = ∑ ( v · R i )2 .
i =1
From the figure, there is no doubt that the vectors v for which this function takes the largest
values indicate the “dominant” direction of the data points. Once we find the first maximal
vector, then we can restrict to the orthogonal complement of that vector and repeat the
same maximization to compute further principal components. (In two dimensions, this
becomes trivial, so we proceed ignoring further components.)
169
Statistical literature usually considers the maximization of the function
m
1
m − 1 i∑
g(v) = ( v · R i )2
=1
instead of the above f . Of course, the maximizers of f and g are the same. The function g
represents the variance of the data Ri projected onto v, which is the statistical quantity that
the first principal component maximizes.
How do we solve the maximization problem? The answer is given in the next theorem.
[7]: u, s, vt = svd(X)
Plotting the first right singular vector as an arrow through the centered data immediately
illustrates the theorem’s claim. We find that the first right singular vector is in one of the
two directions where we expected the maximizer of f , in view of the previous figure.
The second singular vector is of course orthogonal to the one shown. (Recall that the
columns of a unitary matrix are orthonormal.)
You might now be thinking that this figure is beginning to look like the linear regression
figure of the previous lecture, especially if one draws a line through that arrow, and com-
pare it with the regression line. Let me check that thinking right away.
170
XVI.4 PCA is different from regression
PCA and linear regression are fundamentally different. Note these differences:
• In supervised learning by regression, the data points were expressed as ( xi , f i ) to
indicate that the labels f i were dependent on the data xi .
• In contrast, now the data is viewed as just points on the plane (without any labels) so
we express the same points as ( xi , yi ). We do not start with an assumption that one
data component depends on the other.
• In supervised learning by regression, the task was to predict values of the label f
for new values of x. In PCA, the task is to discover what relationship exists, if any,
between the x and y values.
So, in spite of these philosophical differences between linear regression and PCA, why is
it producing similar-looking pictures in this two-dimensional example?
Actually, the pictures are not quite identical. Let us compute and plot the line obtained
with linear regression applied to the same points, now viewing one of the variables (the
second) as dependent on the other (the first).
X1 = np.array([np.ones(X.shape[0]), x]).T
a = np.linalg.inv(X1.T @ X1) @ X1.T @ f
x_predict = np.linspace(-2, 2, num=100)
f_predict = a[0] + a[1] * x_predict
plotX(X, ax)
ax.plot(x_predict, f_predict, 'c');
fp = X1 @ a
lc = LineCollection([[(x[i], f[i]), (x[i], fp[i])]
for i in range(len(x))],
color='r', linewidth=4, alpha=0.5)
ax.add_collection(lc)
ax.set_title('The minimization behind regression');
fig = plt.figure(); ax = plt.gca()
plot_reg(X, ax)
171
Recall that the line in this linear regression is arrived at by minimizing the sum of the
squares of the lengths of the (red) vertical line segments.
In PCA, a different quantity is minimized. Although we defined the principal components
using a maximization, we can transform it to a minimization as follows. Recall from linear
algebra that any vector can be decomposed into its projection along a given vector and a
component in the orthogonal complement. In particular, for the above two dimensional
data, the vector Ri can be decomposed into its projection along v, (v · Ri )v plus the compo-
nent of Ri in the orthogonal complement of v, which using a unit vector v⊥ perpendicular
to v, may be expressed as (v⊥ · Ri )v⊥ , i.e.,
Ri = ( v · Ri ) v + ( v ⊥ · Ri ) v ⊥ .
By Pythagoras theorem,
∥ R i ∥2 = ( v · R i )2 + ( v ⊥ · R i )2 .
Since the left hand side is fixed by the data, maximizing (v · Ri )2 over all v is equivalent to
minimizing (v⊥ · Ri )2 over the perpendicular v⊥ . Thus we arrive at the conclusion that the
first principal component v that maximizes
m
∑ ( v · R i )2
i =1
Below is the graphical illustration of this minimization behind the PCA (left plot). We
draw little orange line segments from each data point Ri in the direction v⊥ such that its
length equals (v⊥ · Ri )2 . Please compare it with the previous figure for linear regression,
also reproduced aside below (right plot).
172
ax.arrow(0, 0, v1[0], v1[1], width=0.04, color='brown', alpha=0.6)
Xp = v1[:, np.newaxis] * (X @ v1)
lc = LineCollection([[(X[i, 0], X[i, 1]), (Xp[0, i], Xp[1, i])]
for i in range(X.shape[0])],
color='r', linewidth=4, alpha=0.5)
ax.add_collection(lc)
plotX(X, ax)
ax.set_title('The minimization behind PCA');
fig = plt.figure(figsize=(12, 4))
axl, axr = fig.subplots(1, 2)
plot_pca(X, axl); plot_reg(X, axr)
Clearly the two minimizations are different. The result of the different minimizations hap-
pened to be close for the above example. But this need not happen always. The results can
indeed be quite different, as the quick example below shows.
What distinguishes PCA is not simply this difference in the associated minimization prob-
lems. As you proceed with this lecture, you will understand that the power of PCA lies in
its ability to find patterns in data, i.e., to find feature sets or basis sets in which data can be
efficiently represented.
173
XVI.5 PCA in scikit-learn
Instead of getting the principal components from first principles using the SVD, as we have
done above, you may just use scikit-learn’s PCA facility to get the same result.
To use it, one constructs a PCA object using some hypothesized n_components which can
be less than the data dimensions m and n. To draw the analogies with the previous
computation, let’s apply PCA to the previous data setting n_components=2 (noting that
min(m, n) = 2 in this example).
You can directly give PCA a data set that is not centered. It will do the centering behind
the scenes.
[14]: pca.fit(XX); # fit with raw (uncentered) data
Now, you may ask for the principal components of the data:
[15]: pca.components_
This matches the principal components we computed using the SVD, reproduced below.
[16]: vt
(Note that since principal components are defined only up to a sign, the vectors need only
match up to a sign, in general.)
174
The following two lines computes PCA (using scikit-learn) and SVD (using scipy). We will
use the resulting outputs to establish correspondences between them so we can be fluent
in both languages.
[19]: u, s, vt = svd(X)
[20]: norm(pca.singular_values_ - s)
[20]: 0.0
Second, the principal components returned by pca are equal to ± (ith right singular vector)
from the SVD. Let me illustrate this using the above pca and svd outputs. To check that two
vectors are equal except for the sign “±,” we define a function that computes the norms
of the sum and the difference of the vectors and prints them out. Only one of them need be
zero to have a match up to ±.
Using this function we check if the first seven principal components equal the correspond-
ing singular vector up to ±. Note how one of the printed out norms (either that of the sum
or that of the difference) is zero.
[22]: for i in range(7):
vectors_plus_minus_diff(pca.components_[i, :], vt[i, :])
2.0 0.0
2.0 0.0
2.0 0.0
2.0 0.0
2.0 0.0
0.0 2.0
0.0 2.0
Third, projections of the original data onto the principal axes can be obtained by transform
(or the fit_transform) method of PCA. Of course, such projections are just V-components
of the rows of the data X, or simply XV: since X = UΣV t and V has orthonormal columns,
these projections are also equal to UΣ. Hence we have the following correspondence:
175
[23]: # projected data from pca (can also use pca.fit_transform(XX)):
projX = pca.transform(XX)
1134.0 0.0
1084.5 0.0
1009.3 0.0
852.2 0.0
706.7 0.0
0.0 651.6
0.0 610.5
Fourth, to relate to the low-rank approximation using SVD that we studied in the SVD
lecture, recall that an SVD of X can be rewritten using outer products as
min(m,n)
X= ∑ σj u j v∗j
j =1
from which the best rank ℓ approximation to X, denoted by Xℓ , can be extracted simply by
throwing away the later summands:
ℓ
Xℓ = ∑ σj u j v∗j .
j =1
Before showing how this is done in scikit-learn, let us compute Xℓ , say for ℓ = 5, using the
SVD. We implement the above formula, and add the means to compensate for the fact that
the SVD was taken on centered data.
[26]: l = 5
Xl_svd = u[:, :l] @ np.diag(s[:l]) @ vt[:l, :] + XX.mean(axis=0)
There is a corresponding facility in scikit-learn. First, note that we may give the
n_components argument to PCA, which tells PCA how many principal components to
compute.
Now, to get the best rank ℓ approximation from PCA, we use the transform method, which
gives the components of the data projected onto the principal axes (and there are 5 prin-
cipal axes now). Then, we can use the inverse_transform method to lift the projected
components into the original data space of 64 pixels.
176
[28]: (1797, 5)
The relative difference in norm between Xl_pca and Xl_svd can now be easily verified to
be close to machine precision.
[30]: 1.0648619605229708e-15
Let’s summarize this correspondence as follows: The best rank ℓ approximation Xℓ of the
centered data X satisfies
Xℓ = inverse_transform(transform(XX) − mean_(XX)
Just in case this inverse_transform lifting into data space still sounds mysterious, then
perhaps this reverse engineered formula for it might make it clearer:
inverse_transform(proj) = proj@pca.components_ + mean_(XX),
This also can again immediately be verified in our example:
[31]: 0.0
Fifth, consider the attribute called the explained_variance array of the pca object. This
represents variances explained by the principal components (see the covariance matrix
discussion below for more on this terminology). The elements of this array are related to
the singular values σi as follows.
1
pca.explained_variance_[i] = σ2
m−1 i
[32]: 0.0
177
[33]: norm(pca.explained_variance_ratio_ - (s**2)/(s**2).sum())
[33]: 6.661408213830422e-17
Covariance matrix To understand the origin of some of the terms used in pca attributes,
recall how the covariance matrix is defined: For centered data, the covariance matrix is
1
C= X t X.
m−1
The “explained variances” are the eigenvalues of C. Of course, since X = UΣV t is an SVD
of X, the covariance matrix C may be alternately expressed as
Σ2
C=V Vt,
m−1
from which we conclude that the ith eigenvalue of C is σi2 /(m − 1), which matches our
observation above.
This observation also tells us that the right singular vectors (the columns of V) are actu-
ally eigenvectors of C, since the above factorization of C is actually a diagonalization of
C. Therefore, one can alternately compute the right singular vectors, aka, principal com-
ponents, as the eigenvectors of the covariance matrix simply using numpy’s or scipy’s eig.
Indeed, for the current example, we can immediately cross check that we get the same
results:
[34]: ew, ev = np.linalg.eig(X.T @ X / (m-1)) # eigenvalues & eigenvectors of C
ii = ew.argsort()[::-1]
ew = ew[ii]; ev = ev[:, ii] # sort by descending order of␣
↪→eigenvalues
[35]: 6.25037049393972e-13
2.0 0.0
2.0 0.0
2.0 0.0
0.0 2.0
2.0 0.0
0.0 2.0
0.0 2.0
What is better, eig or svd? The relationship between PCA/SVD and eigenvectors of the
covariance matrix discussed above raises a natural question. If both give the same vectors
(principal components), which one should be recommended for computations?
178
Even though both give the same vectors mathematically, it’s better to use SVD (or scikit-
learn’s PCA, which uses SVD) to avoid round-off errors in the formation of X t X that arise
in some circumstances. A classical example is the case of a Läuchli matrix, an N × ( N − 1)
rectangular matrix of the form
⎡ ⎤
1 1 ··· 1
⎢ϵ
⎢ 0⎥ ⎥
⎢
⎢0 ϵ .. ⎥
.⎥
X = ⎢.
⎢ ⎥
⎢ .. .. ⎥
. ⎥
⎢ ..
⎢ ⎥
⎥
⎣. ϵ⎦
0 0
with a small number ϵ. The matrix X t X then has 1 + ϵ2 on the diagonal, and ones every-
where else. This matrix is very close to being singular numerically. For example, in the
N = 4 case, the matrix
1 + ϵ2
⎡ ⎤
1 1
Xt X = ⎣ 1 1 + ϵ2 1 ⎦
1 1 1 + ϵ2
has eigenvalues 3 + ϵ2 , ϵ2 , ϵ2 by hand calculation. However, eig is unable to distinguish
the latter from zero.
[37]: N = 4
eps = 1e-8
X = np.diag(eps * np.ones(N), k=-1)
X[0, :] = 1; X = X[:, :(N-1)]; X
The last two numbers are so close to machine precision that they are indistinguishable
from 0. Covariance matrices should never have negative eigenvalues, but due to numerical
difficulties, eig may return a small negative value as an eigenvalue. So in particular, if we
attempt to compute the singular values by taking square roots of these eigenvalues, we
might end up taking the square root of a negative number.
[39]: np.sqrt(ew)
179
[39]: array([1.73205081, 0. , nan])
In
√ contrast, the SVD is able to output the singular values fairly close to the exact ones
3 + ϵ2 , ϵ, ϵ without difficulty.
[40]: u, s, vt = svd(X)
s
We used digits.data previously. The images key gives the images of the handwritten
digits.
There are 1797 images, each of 8 x 8 pixels. The flattened array versions of these images
are in digits.data while the 8 × 8 image versions are in digits.images. Here are the first
few of the 1797 images:
for i, ax in enumerate(axes.flat):
ax.imshow(digits.images[i], cmap='binary')
180
To apply PCA, we need to put these images into the tidy data format of m sam-
ples/observations/rows × n variables/features/columns. We set
• each pixel to be a feature/variable,
• each image to be a sample/observation.
Actually, this is the form the data is contained in digits.data, where each 8 × 8 image is
one of 1797 samples of a 64-variable dataset.
[44]: m, n = digits.data.shape
m, n
We construct a PCA object using this data asking specifically to retain only 10 principal
components.
Would you hazard a guess that the 10 principal components are the usual 10 digits?
Well . . . here is how the 10 principal components look like:
181
Obviously, these outputs don’t look anything like recognizable digits. It is important to
understand in what sense these garbled images represent something “principal” about the
original data set. Proceed on to gain this understanding.
Each row of projdgt contains 10 coefficients, which when multiplied by the 10 principal
components, reveal what’s going on. Of course, we must also correct for the previously
subtracted mean. The first row of dgt then yields the following image (left). Compare
it with the original first image in the data (right). Of course, this is reminiscent of the
low-rank approximation of a single image that we discussed in the prior SVD lecture; the
difference now is that we are applying the same process to 1797 centered images all at once
(although we are only showing the first one below).
Let’s dig a bit more into this. Writing the SVD of the centered image data array X as
X= ∑ σk uk vtk ,
k
we may read off the the ith row Ri , which represents the ith image in this dataset, as
follows:
[ Ri ] j = Xij = ∑ σk [uk ]i [vk ] j .
k
182
The pca object above computed the rank-10 best approximation by restricting the above
sum to the first 10 summands. This is what is was implemented above by the line of code
reconstructed_dgts = pca.inverse_transform(projdgt)
From the previously discussed fourth correspondence’s equivalent form of the
inverse_transform, we note that the above statement may equivalently be written as
reconstructed_dgts = projdgt @ pca.components_ + pca.mean_
where the correction for the zero mean is explicit. This also makes it abundantly clear that
the statement setting reconstructed_dgts is just an implementation of the above formula
for Ri .
Viewing the ith image/row Ri as a function f of pixels, it is instructive to view the above
formula for Ri as the sum
f ( x ) = a0 ϕ0 ( x ) + a1 ϕ1 ( x ) + · · · a9 ϕ9 ( x )
where
ak = [σk uk ]i , ϕk = vk ,
i.e., the numbers ak = [σk uk ]i represent coefficients in a basis expansion with the basis
images ϕk set by ϕk = vk , and where x represents one of the 64 pixels. In this viewpoint,
what PCA has done is to fit the 10-term formula for f to a data set of 1797 images. While
this is reminiscent of regression, note two important differences:
• PCA found the basis ϕk (while regression needs ϕk as input).
• The coefficients ak change for each data row (unlike in regression where it’s fixed for
the whole dataset).
To summarize, PCA automatically finds an efficient basis (or feature set) to represent the
data. (In contrast, regression needs you to provide a basis ϕk as input in order to output
the best-fit coefficients ak ; see e.g., the curve fitting examples we have seen previously.)
This exemplifies one of the differences between supervised and unsupervised learning.
183
plt.ylabel('cumulative explained variance');
Clearly, with 10 components, we are far away from the cumulative sum of 1. We are much
closer to the point of diminishing returns, retaining about 95% of the variance, if we instead
choose, say 30 components.
In other words, the 64-dimensional data set may effectively be reduced to a 30-dimensional
dataset retaining 95% of the variance. (Per our discussion in the prior SVD lecture, you can,
of course, also convert this statement on variances into a precise measure of the relative er-
ror in the Frobenius norm.) Summarizing, PCA is also useful as a dimensionality reduction
tool.
184
XVII
Latent Semantic Analysis
June 1, 2020
185
If this is a proxy for the country’s current state, where we go from here seems critical in
this moment.
[1]: c = {'May31':
'Two crises convulse a nation: a pandemic and police violence',
'May30a':
'Nation’s first astronaut launch to orbit from home soil in nearly a␣
↪→decade',
'May30b':
'Death of George Floyd at the hands of police set off protests',
'May27':
'SpaceX launch of NASA astronauts is postponed over weather'}
In this corpus, c['May31'] is a document, and the c has three more documents. Each doc-
ument is composed of many words, or terms. We need to simplify the complexities of
natural language to be able to compute anything. With my apologies to the writers among
186
you, we proceed by taking the view that the order of words, declensions and conjugations,
and often-used words like articles and prepositions, are all immaterial. Then we view
concepts as merely associations of the remaining root words, associations marked by their
joint appearances in documents. LSA is only useful under the assumption that words that
are close in semantics will occur in similar documents as the corpus of documents become
large.
Applying the above-mentioned language simplifications to even a small corpus is a lot of
work, if you try to do it from scratch. But thankfully, there are several python modules
that excel in natural language processing (NLP). Below, I will use spaCy, one of the recent
additions to the python NLP tool set. (Please install it and also make sure to install their En-
glish dataset en_core_web_sm, say by python3 -m spacy download en_core_web_sm, be-
fore proceeding.)
The spacy module is able to process sentences, and identify nouns, verbs, direct objects,
and their interrelationships. In the cell below, after processing a sentence, the result is
saved in an SVG figure. The saved image is then displayed as an image in the next cell.
dobj
nummod nsub j de t
Within a Jupyter notebook, one may also directly render the resulting image (without
needing to save the image into a file) by specifying jupyter=True instead. Here is an
example.
<IPython.core.display.HTML object>
187
Just in case you are not reading this in a Jupyter notebook and the image does not render
on your reading device, I am reproducing the image that displacy generated:
As you can see from the annotated sentence, the module can even identify some named
entities in the real world: it knows about NASA, but it still does not know about SpaceX!
(We will fix this later in this lecture by adding our own named entity terms.)
We shall use the package’s capabilities for tokenization and lemmatization. Tokenization
is the process of dividing a sentence or a document into words via demarcation charac-
ters like white spaces. Lemmatization is the process of identifying the so-called “lemma”
of a word, allowing us to group together inflected forms of the word into a single item.
Here is the result of tokenization and lemmatization on the above sentence. Note how the
originally found words “astronauts” and “postponed” have changed in the output.
Here we have also removed stop words, a collection of the most common words in a lan-
guage as previously identified and categorized by the NLP program. In the above exam-
ple, the words “of”, “is”, and “over” have been removed. You can view spacy’s collection
of all stop words if you use the following import statement.
from spacy.lang.en.stop_words import STOP_WORDS
d = {}
for j, dok in enumerate(c.keys()):
tokens = [w.lemma_ for w in nlp(c[dok])
if not w.is_stop and w.pos_ != 'PUNCT']
for t in tokens:
188
d[t] = d.setdefault(t, [])
d[t] += [j]
A = lil_matrix((len(d.keys()), len(c.keys())), dtype=int)
for i, t in enumerate(d.keys()):
for j in d[t]:
A[i, j] = 1
Adf = pd.DataFrame(A.toarray(), index=d.keys(), columns=c.keys()); Adf
We might want to have a combination of first and last names treated as a single entity,
but the code is not yet smart enough to do that. We’ll fix that later, after introducing the
idea of LSA. For the moment, note how words have been represented as row vectors and
documents as column vectors. This is enough to understand the basics of LSA, as we see
next.
189
it.
[8]: import numpy as np
import matplotlib.pyplot as plt
import seaborn; seaborn.set();
from numpy.linalg import norm
from scipy.linalg import svd
[9]: u, s, vt = svd(A.toarray())
Here is the first important step in creating mathematical objects to represent documents.
Using the best rank k approximation, the first k right singular vectors are used to represent
each document as a k vector.
[10]: k = 2 # Limit to rank k
Vt = vt[:k, :]
pd.DataFrame(Vt, columns=c.keys()) # Documents as k-vectors
The second important step is to represent words (or terms) as mathematical objects in
the same space. Unlike documents, the words/terms are represented by the first k left
singular vectors, weighted by the associated singular values. The first five word tokens
are displayed below as vectors.
[11]: 0 1
crisis -0.269907 0.458490
convulse -0.269907 0.458490
nation -1.099150 0.308952
pandemic -0.269907 0.458490
police -0.378909 1.312628
Many words are mapped to the same point in such a small example. In other words, there
is not enough data in our small corpus to distinguish between such words.
Nonetheless, even in our very small dataset, it is very interesting to see the associations
between words in terms of how different the word vectors are. Ignoring the magnitude
of word vectors, one may measure the difference between two word vectors (both drawn
from the origin) using a device different from the norm. When magnitude is ignored, the
difference between vectors is captured by the angle the word vectors make with each other,
or by the cosine of the angle. Two vectors of the same magnitude are farther apart if the
cosine of their angle is smaller. Remember that it’s very easy to compute the cosine of the
angle between two unit vectors, since it is equal to their dot product.
190
[12]: astronaut = usp.loc['astronaut', :].to_numpy()
crisis = usp.loc['crisis', :].to_numpy()
police = usp.loc['police', :].to_numpy()
[13]: 0.9686558216875333
[14]: 0.27103529721595343
Let’s dig into this a bit more. In our small example, since words are two-dimensional
vectors, we can plot them to see how they are dispersed in terms of angles measured from
the origin. Below, the origin is marked as a red star, and points representing the terminal
point of word vectors are annotated with the word.
191
The first takeaway from this figure is that the angles the word vectors make is clearly in
accordance with the previous cosine computation.
The second is more enigmatic. In our small corpus of four sentences, there were two cate-
gories of news, one of violence, and one of exploration. While we as humans can instinc-
tively make that categorization, it is uncanny that some mathematics and a few simple
lines of code can separate the words associated to the two categories into different areas
of a “word space”. The word that appears somewhat in the middle of the two categories
is nation, as it ought to. (The same figure, after a rotation, modification of arrows, and
cleaned up positioning, is what I presented at the beginning of the lecture.) You should
now have an idea of why LSA can be useful when applied to a large corpus with many
more words, documents, and hidden associations (or latent semantics).
[16]: c.update(
{
'May30Launch':
'Go NASA! Go SpaceX! Godspeed, Bob and Doug!',
'NYTimes':
'NASA and SpaceX officials more often than not ' +
'just call the pilots of this historic mission Bob and Doug.',
192
'May30NASAblog':
'The first stage of the SpaceX rocket has landed ' +
'successfully on the droneship, Of Course I Still Love You.',
'May31NYTimes':
'After a 19 hour trip, NASA astronauts Bob and Doug ' +
'successfully docked their capsule and entered the space station.',
})
Do you see the complexities of dealing with real examples of natural language?
The ocean droneship, controlled by an autonomous robot to help the rocket land, has a
curious name: “Of Course I Still Love You”. Standard tokenization would simply split it
into component words. It would be better to keep it as a single entity. We will do so below
with spacy’s facilities. But, before that, just in case you don’t know, that curious name for
the ship is taken from the novel The Player of Games by Iain M. Banks. Elon Musk gave his
droneship that name in tribute to Banks. Let me add a sentence from Musk and another
from Bank’s novel to our text corpus.
[17]: c.update(
{
'2015Musk':
'West Coast droneship under construction will ' +
'be named Of Course I Still Love You',
'IainBanks':
'These friends of yours are ships. ' +
'Yes, both of them. ' +
'What are they called? ' +
'Of Course I Still Love You and Just Read The Instructions. ' +
'They are not warships? ' +
'With names like that?'
})
To deal with text items like the droneship name, we need to use the phrase matching
capabilities of spacy. Three examples of terms to match are added to a TerminologyList
below. Spacy also does some default phrase matching, e.g., it identifies the phrase “nearly
a decade” as a temporal unit. It is up to us whether we want to use the entire phrase as
a token or not. Below, we will modify the tokenization step to keep all phrases as tokens
with _ in place of white space so we can recognize them easily.
terms = ['SpaceX',
'Of Course I Still Love You',
'Just Read The Instructions']
193
patterns = [nlp.make_doc(text) for text in terms]
matcher = PhraseMatcher(nlp.vocab)
matcher.add('TerminologyList', None, *patterns)
Next, we use a slicing feature (called Span) of spacy to capture the matched phrases as
tokens. We also use the ents attribute provided by spacy to add named entities (a real-
world object with a name) to the document object.
def tokensfromdoc(doc):
d = nlp(doc)
matches = matcher(d)
for match_id, start, end in matches:
term = Span(d, start, end, label='myterms')
d.ents = list(d.ents) + [term]
tokens = [w.lemma_ for w in d
# no pronouns
if w.pos_ != 'PRON' \
# no punctuations
and w.pos_ != 'PUNCT' \
# not Beginning of a named entity
and w.ent_iob_ != 'B' \
# not Inside a named entity
and w.ent_iob_ != 'I' \
# not a stop word
and not w.is_stop]
tokens += [de.string.rstrip().replace(' ', '_') for de in d.ents]
return tokens
def dictokens(corpora):
d = {}
for j, dok in enumerate(corpora.keys()):
for t in tokensfromdoc(corpora[dok]):
d[t] = d.setdefault(t, [])
d[t] += [j]
return d
The above function dictokens makes a dictionary with lemmatized words as keys and
document numbers as values. This can be used to make the term-document matrix as we
did for the initial example.
194
A[i, j] = 1
return A
[21]: d = dictokens(c)
[22]: d = dictokens(c)
A = tdmatrix(d, c)
Adf = pd.DataFrame(A.toarray(), index=d.keys(), columns=c.keys())
This array is now a bit too big to meaningfully display here, but here are a few elements of
one row, which now displays the droneship name as a single token.
[23]: Of_Course_I_Still_Love_You
NYTimes 0
May30NASAblog 1
May31NYTimes 0
2015Musk 1
IainBanks 1
q = W[querytokns, :].mean(axis=0)
nrm = norm(q)
q /= nrm
idx = np.argsort(Vt.T @ q)[::-1]
kl = list(c.keys())
keys = [kl[i] for i in idx]
docs = [c[k] for k in keys]
195
return docs, keys, idx
To use this on our current corpus example, let’s make the word and document vectors first.
Here is an example of a query with two words, astronaut and first, and the first three
matching documents generated by the above strategy.
[26]: ['Nation’s first astronaut launch to orbit from home soil in nearly a␣
↪→decade',
The first result has both search words, while the other two has one of the two search words.
Below is another example, where somewhat surprisingly, a document without the query
word (but certainly what we would consider a relevant document) is listed within the top
three matches.
[27]: myquery = np.where(Adf.index=='droneship')[0]
docs, keys, idx = retrieve(myquery, W, Vt, c)
docs[:3]
[27]: ['The first stage of the SpaceX rocket has landed successfully on the␣
↪→droneship,
Course I Still Love You and Just Read The Instructions. They are not␣
↪→warships?
You']
Let me conclude this introduction to the subject of text analysis and information retrieval
by noting that the concept of mapping words to vectors is finding increasingly significant
uses, such as in automatic translation. I have tried to present ideas in minimal examples,
196
but you should be aware that there are many extensions in the literature. Some extensions
are easy to see as emerging from computational experience. An example is a generalization
that we will see in an exercise that modifies the term-document matrix to account for the
number of times a term occurs in a document. The resulting matrix will have frequency-
weighted entries, not just 0 and 1 as above. This is built into scikit-learn’s text analysis
facilities, which we shall use in the exercise.
197
A
Exercises
for any integers i and N. (Solution codes will be ranked in terms of correctness, readability,
and brevity.)
How do you know your answer is correct? When writing code it is important to check for
correctness. Llementary mathematics tells us that
N
N
∑ n2 = 6
( N + 1)(2N + 1).
n =1
(If you don’t know this prove it!) So you can easily check that your code gives the correct
answer, at least for i = 2. In fact, even for a general power i, power sums have been studied
very well and expressions connecting them to the Riemann zeta function are well known,
so for this task, there are indeed many sources to double check our code results.
Python has many styling guidelines for writing good code. You may want to peruse PEP 8
at your leisure. And take time to behold an easter egg (one of several) within the language:
198
A.2 Exercise: Graphing functions
This exercise checks that you have learnt the basic usage of numpy and matplotlib.
Task 1: Graph a function of one variable Plot the graph of sin( x ) for x in the interval
[0, 10].
√︁
Task 2: Graph a function of two variables Plot the graph of cos( x2 + y2 ) for ( x, y) ∈
[−5, 5] × [−5, 5].
Consider a scenario where you make an object v and then send it to this function, like in
this example:
[2]: v = [2, 5, 1]
twice(v)
Task: Your task is to determine if v is being copied when you call twice(v) for some v. In
other words, is a deep copy of v being implicitly made by python when you send v as an
argument to twice? Answer this for at least two cases:
• v is a numpy array
• v is a string
Explain your observations.
199
at 1000001 uniformly spaced points in the interval [−5, 5]. Time it and then plot the func-
tion.
[1]: import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
If there are problems with this (correctness? efficiency? elegance? brevity?), explain them,
and produce a better function. (Please do check for correctness before you check anything
else.)
200
A.7 Exercise: Differentiation matrix
Task: Make an (n − 1) × n matrix D with the property that when it is applied to a vector
f ∈ Rn we get D f ∈ Rn−1 whose entries are
[ D f ] i = f i +1 − f i , i = 0, 1, . . . , n − 1.
Use the diag facility of numpy to make D fast. Put x j = jh for some positive grid spacing h.
If f j equals f ( x j ) for some differentiable function f , then h−1 D f produces approximations
to the derivative of f , so we shall call h−1 D the differentiation matrix.
Apply the sparse differentiation matrix to the functions described above and note if there
are any performance gains.
Dij = xi − x j
h( P, Q) = max min ∥ p − q∥
p∈ P q∈ Q
201
√
where, for any p ∈ R2 , the notation ∥ p∥ denotes the Euclidean distance p · p. Using this,
the Hausdorff distance is defined by
[︁ ]︁
H ( P, Q) = max h( P, Q), h( Q, P) .
202
A.11 Exercise: Predator-prey model
Suppose the populations of rabbits (denoted by r (t)) and foxes (denoted by x (t)) at time t
in a jungle are modeled by the ODE system
dr
= αr − βrx
dt
dx
= δrx − γx
dt
where α = 1.1, β = 0.4, δ = 0.1, and γ = 0.1.
Task 1: Given initial conditions r (0) = 5 and x (0) = 2, solve for r (t) and x (t) and plot the
solution for 0 ≤ t ≤ 70.
Task 2: The phase plot of the solution consists of points ( x (t), r (t)) for various t values.
Prepare a figure (phase portrait) with phase plots of, say, 10 solutions, one each for randomly
chosen initial values r (0) and x (0) between 1 and 9.
Task 3: This system has two equilibria. Solve for them and mark them in your phase
portrait.
203
[1]: import numpy as np
from scipy.linalg import svd, qr
While numpy arrays have an implicitly defined integer index used to access the values, the
pandas objects have an explicitly defined index associated with the values, a feature shared
with the python dictionary construct. To begin our pandas exercises, start by converting
dictionaries to pandas objects.
Tasks:
1. Convert d0 to a corresponding pandas object pd0.
204
A.15 Exercise: Iris flower dataset
The statistical visualization package seaborn comes with the famous Iris Flower Dataset.
Your exercise is to give one-line codes to address the two tasks mentioned after the figure
below.
[1]: import pandas as pd
%matplotlib inline
import seaborn; seaborn.set()
iris = seaborn.load_dataset('iris')
7
sepal_length
4.5
4.0
sepal_width
3.5
3.0
2.5
2.0 species
7 setosa
versicolor
6 virginica
5
petal_length
4
3
2
1
2.5
2.0
petal_width
1.5
1.0
0.5
0.0
4 6 8 2 3 4 5 2 4 6 0 1 2 3
sepal_length sepal_width petal_length petal_width
Task 1: Make a bar plot of the mean sepal sizes for each species
Task 2: Find the min, max and mean of petal sizes for each species
205
A.16 Exercise: Stock prices
If you have installed pandas_datareader, please download current stock prices using this:
import pandas_datareader.data as web
s = web.DataReader(['AAPL', 'GOOG', 'TSLA'], data_source='yahoo', start='2020')
Alternately, load from a file where data from until yesterday was saved. Download the file
from D2L and move it to the right place in order for the following cell to work.
[2]: s = pd.read_pickle('../../data_external/stock_prices.pkl')
In either case, you will end up with a data frame s which contains three categories of prices
for three tech stocks.
Tasks:
• Find out if a heirarchical indexing (MultiIndex) is being used in this data.
• Access and plot the closing price of AAPL on all days in the data.
• Print the closing price of all three stocks yesterday.
• Extract a smaller data frame with no MultiIndex containing only TSLA data.
[2]: t = seaborn.load_dataset('titanic')
t.head()
[2]: survived pclass sex age sibsp parch fare embarked class \
0 0 3 male 22.0 1 0 7.2500 S Third
1 1 1 female 38.0 1 0 71.2833 C First
2 1 3 female 26.0 0 0 7.9250 S Third
3 1 1 female 35.0 1 0 53.1000 S First
4 0 3 male 35.0 0 0 8.0500 S Third
206
who adult_male deck embark_town alive alone
0 man True NaN Southampton no False
1 woman False C Cherbourg yes False
2 woman False NaN Southampton yes True
3 woman False C Southampton yes False
4 man True NaN Southampton no True
1
f ( x, t) = ( x + t)2 + 2 sin(10( x − t)).
2
The following tasks are to be completed in a .py python file (not in a jupyter notebook).
207
Task 1: Use matplotlib.animation module to display changes in the plot of f over x with
respect to time t.
Task 2: Use the celluloid module to perform the same task. Which is faster? Which is
more convenient?
Task 3: Add a text labeling the time at each snapshot that the animation is composed of.
S1
1
S0 1
1
S2
Task 1:
What is the probability that you can hit one state from another state (in any number of
steps) in this Markov chain? Answer by looking at the graph and double check that an
eigenvector gives what is expected.
208
S1 S3
0.5
1
S0 1
0.5
1
S2
Task 2:
For each question below, guess the answer from the figure, and then verify that your com-
putational method gives the expected answer.
• Starting from B = {3}, what is the probability of hitting A = {0, 1, 2} in any number
of steps?
• Starting from from any state in B = {0, 1, 2} what is the probability of hitting A =
{3} in any number of steps?
• Is this an absorbing Markov chain? Verify that the answers come out the same using
the two methods you learnt.
Task 3:
S1 S3
0.4
0.
1
1
S0 1 S4 S5
1
1
S2
S6
• What are the hitting probabilities of A = {0, 1, 2} from the remaining states?
• What are the hitting probabilities of A = {4, 5, 6} from the remaining states?
• What are the hitting probabilities of A = {3} from the remaining states?
Task 4:
S1 S3
0.7
0.
1
0.1 1
S0 0.2 1 S4 S5
0.
1
8
S2
S6
209
• Starting from 1 what is the probability of hitting 0 in any number of steps?
• Starting from 3 what is the probability of hitting 6 in any number of steps?
210
Task 3: Compute the pagerank of the ruin state for r = 0.1, N = 1000. How much farther
can you go increasing N in your computing environment, while continuing to use eig?
Task 4: Implement the power method for a positive stochastic matrix P as a python func-
tion, with the inputs indicated below.
def powerP(x, aPt, r=0.1, maxn=1000, tol=1e-10):
""" Apply power iterations to a positive stochastic matrix P.
INPUTS:
- x: initial probability distribution for the power method,
- aPt: function that returns P.T @ x given x
- r: restart probability,
- maxn: apply at most 'maxn' power iterations
- tol: quit if successive iterations differ by less than 'tol'.
OUTPUT: Vector of pageranks if converged. """
This function should start with a random initial probability distribution x and compute
( Pt )n x for increasing n until ∥( Pt )n x − ( Pt )n−1 x ∥ becomes smaller than a given input tol-
erance. To save memory and flops, you should not create the transition matrix P in memory,
rather, you should make a function that applies Pt to a vector, and pass that function as one
of the arguments aPt to the powerP function.
Apply your function to compute the pagerank of the ruin state for r = 0.1, N = 100000.
[1]: import os
import urllib
import shutil
import numpy as np
if not os.path.exists(f):
r = urllib.request.urlopen(url)
211
fo = open(f, 'wb')
shutil.copyfileobj(r, fo)
fo.close()
Task 1: Project the 4000-variable data into its first 3 principal components and view the
projections in a three-dimensional plot.
Task 2: Plot the cumulative explained variance for this dataset. What is the percentage of
variance lost in restricting the data from 4000 to 3 dimensions? How many dimensions are
needed to keep 95% of the variance?
212
A.26 Exercise: Eigenfaces
In this exercise you will apply PCA to a large library of facial images to extract dominant
patterns across images. The dataset is called Labeled Faces in the Wild, or LFW, (source) and
is popular in computer vision and facial recognition. It is made up of over a thousand 62 x
47 pixel face images from the internet, the first few of which are displayed below.
Task 1: We refer to the principal components of face image datasets as eigenfaces. Display
the first 28 eigenfaces of this dataset. (They will have little resemblance to the first 28
images displayed above.)
Task 2: Let N be the least number of dimensions to which can you reduce the dataset
without exceeding 5% relative error in the Frobenius norm. Find N. (This requires you
to combine what you learnt in the SVD lecture on the Frobenius norm of the error in best
low-rank approximation with what you just learnt in the PCA lecture.)
Task 3: Repeat PCA, restricting to N eigenfaces (with N as in Task 2), holding back the last
seven images in the dataset. Compute the representations of these last seven images in
terms of the N eigenfaces. How do they compare visually with the original seven images?
Task 4: Restricting to only images of Ariel Sharon and Hugo Chavez, represent (and plot)
213
them as points on a three-dimensional space whose axes represent the principal axes 4, 5,
and 6. Do you see the points somewhat clustered in two groups? (The principal directions
0, 1, 2, 3 are excluded in this task since they seem to reflect lighting, shadows, and generic
facial features, so will likely not be useful in delineating individuals.)
[1]: c = { \
'Lincoln1865':
'With malice toward none, with charity for all ...' +
'let us strive on to finish the work we are in ... ' +
'to do all which may achieve and cherish a just and lasting peace, ' +
'among ourselves, and with all nations.',
'TrumpMay26':
'There is NO WAY (ZERO!) that Mail-In Ballots ' +
'will be anything less than substantially fraudulent.',
'Wikipedia':
'In 1998, Oregon became the first state in the US ' +
'to conduct all voting exclusively by mail.',
'FortuneMay26':
'Over the last two decades, about 0.00006% of total ' +
'vote-by-mail votes cast were fraudulent.',
'TheHillApr07':
'Trump voted by mail in the Florida primary.',
'KingJamesBible':
'Wherefore laying aside all malice, and all guile, and ' +
'hypocrisies, and envies, and all evil speakings',
}
214
versions.)
Task 3: Use LSA to compute three dimensional representations of all documents and
words using your term-document matrix from Task 2. Print out your vector representation
of vote (which will obviously depend on the matrix).
Task 4: Write a function to compute the cosine of the angle between the spans of two word
vectors. Compute the cosine of the angle between malice and vote. Compute the cosine
of the angle between mail and vote.
Task 5: In order to moderate the influence of words that appear very frequently, the TF-
IDF matrix in often used instead of the term-document matrix. The term frequency-inverse
document frequency (TF–IDF) matrix weights the word counts by a measure of how often
they appear in the documents according to a formula found in scikit-learn user guide.
Compute the TF-IDF matrix for the above corpus using TfidfVectorizer.
Task 6: Recompute the two cosines of Task 4, now using the TF-IDF matrix of Task 5 and
compare.
import pandas as pd
215
B
Projects
216
where f, a, b, and eps represent f , a, b and ϵ in the above description, and niters is the
maximal number of iterations, which in this case is the maximal number of subdivisions
of the initial interval you will make (whether or not the eps tolerance is met).
3. Test your bisection code on the function f ( x ) whose roots you know, say f ( x ) =
cos( x ). Once you know your current code is working correctly, proceed to the next
step. If you jump this step, beware that “premature optimization is the root of all
evil,” according to Donald Knuth.
4. Refactor/improve/optimize: When halving the interval, can you reuse a previously
used value of f to make the code more efficient? (This would be important when
the evaluation of f is expensive.) Also have you made sure you have included com-
ments? Does you function have a docstring?
217
• The data may not be perfectly equispaced. Use the function interp1d from
scipy.interpolate to generate values at equispaced time intervals (after installing
scipy).
• You will see that although CO2 data has a rising trend, the curve is filled with small
oscillations. Use your experience from exercises or any other tools you know to cal-
culate rate estimates of the overall trend.
But, of course, the plot you turn in should be the plot obtained that day. And none of us
know yet how that would look like.
218
B.4 Assignment: World map of COVID-19
Task
Your task is to make a chloropleth map visualizing COVID-19 cases worldwide. The coun-
tries of the world should be displayed in the Albers equal area projection. The number of
cases in each country should be indicated by a reasonable color scale (of your choice).
Please submit three files:
1. A .png image file of the chloropleth map using the latest available data on the sub-
mission day.
2. A .py python file (not jupyter notebook) which generated your .png image file.
3. A .mp4 movie file containing an animation of the choloropleth maps from
01/22/2020 through the latest date of the data and a .py python file that you used
for creating the animation.
Hints
• The last task of making the mp4 movie file is harder than the other two. Begin with
the easier tasks.
• An example of a chloropleth map is in the Overview lecture.
• We have already seen the Albers equal area projection in the lecture Visualizing
geospatial data.
• For animation, use your experience from the exercise on animations.
• Use the Johns Hopkins dataset.
An example solution with data until May 7, 2020, can be seen in the output of the next
code cell if you are reading this in a jupyter notebook.
Alternately, the same solution video can be downloaded or visualized at this weblink.
219
Consider n = 3 first (as in the figure above). Model the process as a Markov chain. The
states consist of all the possible color configurations on the arcs. Answer these questions:
• How many states are there?
• Draw the directed graph of the Markov chain.
• What are the absorbing states?
• Is this an absorbing Markov chain?
Task 2: Now, consider a general n instead of n = 3.
• How many states are there?
• Write a python function to compute the probabilities of eventually hitting the ab-
sorbing states for a general n. (Use vectorized operations as much as possible.)
Task 3: Finally, consider the generalization of the above setting from a closed curve to the
surface of a torus. The toroidal surface is divided into n × n rectangles, each of which has
one of k colors (see the figure below for an example). The process generalizes to selecting
one of these rectangles at random, the chosen rectangle then adopting a color from one of
its 8 neighbors, and then repeating.
220
Again, model the process as a Markov chain. This Markov chain arises in population
genetics. It shows you a manifestation of the combinatorial explosion (or the curse of dimen-
sionality) that makes computation by the standard technique quickly infeasible.
• How many states are there for a given n and k?
• Is it feasible to extend the python function you wrote in Task 2 to compute the ab-
sorption probabilities for this Markov chain, say for k = 2, n = 10?
• Imagine cutting, unfolding, and stretching the toroidal surface to a square with an
n × n grid of color cells, respecting the boundary identifications inherited from the
torus. Any state of the Markov chain can thus be implemented as a 2D integer
numpy array (each entry taking one of k values, representing the colors). Write a
python function to simulate the process for a general n and k using numpy’s ran-
dom module. You should see colors dispersing, coalescing, migrating etc, with the
process eventually terminating in an absorbing state.
• Create an animation displaying the sequence of states obtained in one call of your
function, say with k = 4, n = 10, starting from some initial state of your choice. (Be
warned that there are random sequence of states that are too long to fit in memory
even for small n and k, so build in a fail-safe to avoid a computer crash.) Render the
animation either on the toroidal grid or on the equivalent flat n × n grid.
An example of a solution animation can be viewed below:
If the video does not render on your reading device, you may download it from this we-
blink.
221