2/5/24, 9:43 AM
Plotting: matplotlib
import numpy as np
np.random.seed(10)
Python has many plotting libraries. Here we discuss some of the simplest
ones, matplotlib and seaborn. Matplotlib is in a sense a very basic plotting
library, oriented on vectors, not datasets (in this sense comparable to base-R
plotting). But it is very widely used, and with a certain effort, it allows to create
very nice looking plots. It is also easier to tinker with the lower level features in
matplotlib, compared to the more high-level data oriented libraries.
Seaborn is such a high-level data oriented plotting library (comparable to
ggplot in R in this sense). It has ready-made functionality to pick variables
from datasets and modify the visual properties of lines and points depending
on other values in data.
We assume you have imported the following modules:
import numpy as np
import pandas as pd
1/15
2/5/24, 9:43 AM
4.1 Matplotlib
Matplotlib is designed to be similar to the plotting functionality in the popular
matrix language matlab. This is a library geared to scientific plotting. In these
notes, we are mainly interested in the pyplot module but matplotlib contains
more functionality, e.g. handling of images. Typically we import pyplot as
plt :
import matplotlib.pyplot as plt
This page is compiled using matplotlib 3.1.2.
4.1.1 Introductory examples
The module provides the basic functions like scatterplot and line plot, both of
these functions should be called with x and y vectors. Here is a demonstration
of a simple scatterplot:
x = np.random.normal(size=50)
y = np.random.normal(size=50)
_ = plt.scatter(x, y)
_ = plt.show()
https://faculty.washington.edu/otoomet/machinelearning-py/plotting-matplotlib-and-seaborn.html#plotting-matplotlib 2/15
2/5/24, 9:43 AM Chapter 4 Plotting: matplotlib and seaborn | Machine learning in python
This small code demonstrates several functions:
first, we create 50 random dots using numpy
plt.scatter creates a scatterplot (point plot). It takes arguments x and y
for the horizontal and vertical placement of dots
plt.scatter returns an object, we may want to assign it to a temporary
variable to avoid printing. We use variable name _ (just underscore) for
a value we are not really storing but just avoiding printing.
Scatterplot automatically computes the suitable axis range.
plt.show makes the plot visible. It may not be necessary, depending on
the environment you use. For instance, when you run a notebook cell, it
will automatically make the plot visible at the end. However, if you want to
make two plots inside of the cell, you still need to call plt.show to tell
matplotlib that you are done with the first plot and now it is time to show it.
Finally, plt.show also returns an object, and we assign it to a temporary
variable to avoid printing.
Next, here is another simple example of line plot:
3/15
2/5/24, 9:43 AM
x = np.linspace(-5, 5, 100)
y = np.sin(x)
_ = plt.plot(x, y)
_ = plt.show()
Most of the functionality should be clear by now, but here are a few notes:
The first lines create a linear sequence of 100 numbers between -5 and 5,
and compute sin of these numbers.
Line plots are done using plt.plot , it also takes arguments x and y.
4.1.2 Tuning plots
Matplotlib offers a number of arguments and additional functions to improve
the look of the plots. Below we demonstrate a few:
4/15
2/5/24, 9:43 AM
x = np.random.normal(size=50)
y = np.random.normal(size=50)
_ = plt.scatter(x, y,
color="red", # dot color
edgecolor="black",
alpha=0.5 # transparency
)
_ = plt.xlabel("x") # axis labels
_ = plt.ylabel("y")
_ = plt.title("Random dots") # main label
_ = plt.xlim(-5, 5) # axis limits
_ = plt.ylim(-5, 5)
_ = plt.show()
Most of the features demonstated above are obvious from the code and
comments. However, some explanations are still needed:
Argument color denotes dot color when specified as color name, like
“red” or “black”. There is also another way to specify colors, c , see
below.
5/15
2/5/24, 9:43 AM
Alpha denotes transparency with alpha=0 being completely transparent
(invisible) and alpha=1 being completely oblique (default).
All the additional functions return an object that we store into a temporary
variable in order to avoid printing.
All the additional functions in plt are executed before the actual plot is
drawn on screen. In particular, despite we specify the axis limits after
plt.scatter , they still apply to the scatterplot.
Sometimes we want to make color of the dots dependent of another variable.
In this case we can use argument c instead of color:
x = np.random.normal(size=50)
y = np.random.normal(size=50)
z = np.random.choice([1,2,3], size=50)
_ = plt.scatter(x, y,
c=z # color made of variable "z"
)
_ = plt.show()
6/15
2/5/24, 9:43 AM
Now the dots are of different color, depending on the value of z. Note that the
values submitted to c argument must be numbers, strings will not work.
4.1.3 Histograms
Histograms are a quick and easy way to get an overview of 1-D data
distributions. These can be plotted using plt.hist . As hist returns bin
data, one may want to assign the result into a temporary variable to avoid
spurious printing in ipython-based environments (such as notebooks):
x = np.random.normal(size=1000)
_ = plt.hist(x)
_ = plt.show()
Not surprisingly, the histogram of normal random variables looks like, well, a
normal curve.
We may tune the picture somewhat using arguments bins to specify the
desired number of bins, and make bins more distinct by specifying
edgecolor :
7/15
2/5/24, 9:43 AM
_ = plt.hist(x, bins=30, edgecolor="w")
_ = plt.show()
4.2 Seaborn: data oriented plotting
Seaborn library is designed for plotting data, not vectors of numbers. It is built
on top of matplotlib and has only limited functionality outside of that library.
Hence in order to achieve the desired results with seaborn, one has to rely on
some matplotlib functionality for adjusting the plots. Seaborn is typically
imported as _sns:
import seaborn as sns
Below we use seaborn 0.10.0.
Here is an usage example using a data frame of three random variables:
8/15
2/5/24, 9:43 AM
df = pd.DataFrame({"x": np.random.normal(size=50),
"y": np.random.normal(size=50),
"z": np.random.choice([1,2,3], size=50)})
_ = sns.scatterplot(x="x", y="y", hue="z", data=df)
_ = plt.show()
Note the similarities and differences compared to matplotlib:
The information is fed to seaborn using arguments x, y, and hue (and
more) that determine the horizontal and vertical location of the dots, and
their color (“hue”).
These arguments are here not the data vectors as in case of matplotlib
but data variable names, those are looked up in the data frame, specified
with the argument data.
Seaborn automatically provides the axis labels and the legend.
If needed, the plot can be further adjusted with matplotlib functionality,
here we just use plt.show() to display it.
9/15
2/5/24, 9:43 AM
For some reason, seaborn insist that there should be legend for z value
“0”, even if no such value exists in data:
df.z.unique()
## array([2, 3, 1])
4.2.1 Different plot types
The plotting functions of seaborn are largely comparable to those of matplotlib
but the names may differ. It also offers additional plot types, such as density
plot, and to add regression line on scatterplot.
4.2.1.1 Scatterplot
The example above already demonstrated scatterplot. We make another
scatterplot here using sea ice extent data, this time demonstrating marker
types (style). The dataset looks like
ice = pd.read_csv("../data/ice-extent.csv.bz2", sep="\t")
ice.head(3)
## year month data-type region extent area time
## 0 1978 11 Goddard N 11.65 9.04 1978.875000
## 1 1978 11 Goddard S 15.90 11.69 1978.875000
## 2 1978 12 Goddard N 13.67 10.90 1978.958333
10/15
2/5/24, 9:43 AM
We plot the northern sea ice extent (measured in km2 ) for September (month
of yearly minimum) and March (yearly maximum) through the years. We put
both months on the same plot using a different marker:
_ = sns.scatterplot(x="time", y="extent", style="month",
data=ice[ice.month.isin([3,9]) & (ice.region == "N")]
_ = plt.ylim(0, 17)
_ = plt.show()
The plot shows two sets of dots–circles for March and crosses for September.
Note that seaborn automatically adds default labels for the marker types. We
also use matplotlib’s plt.ylim to set the limits for y-axis.
4.2.1.2 Line plot
Here we replicate the previous example using line plot
11/15
2/5/24, 9:43 AM
_ = sns.lineplot(x="time", y="extent", style="month",
data=ice[ice.month.isin([3,9]) & (ice.region == "N")]
_ = plt.ylim(0, 17)
_ = plt.show()
Note that the code is exactly the same as in the scatterplot example, just we
use sns.lineplot instead of sns.scatterplot . As a result the plot is made
of lines, not dots, and the style option controls line style, not the marker style.
4.2.1.3 Regression line on scatterplot
Seaborn has a handy plot type, sns.regplot , that allows one to add the
regression line on plot. Here we plot the september ice extent, and add a trend
line (regression line) on the plot. We also change the default colors using the
scatter_kws and line_kws arguments:
12/15
2/5/24, 9:43 AM Chapter 4 Plotting: matplotlib and seaborn | Machine learning in python
_ = sns.regplot(x="time", y="extent",
scatter_kws = {"color":"blue", "alpha":0.5, "edgecolor":"
line_kws={"color":"black"},
data=ice[ice.month.isin([9]) & (ice.region == "N")])
_ = plt.show()
Unfortunately, regplot does not accept arguments like style for splitting data
into two groups.
4.2.1.4 Histograms and density plots
Seaborn can do both kernel density plots and histograms using
sns.distplot . By default, the function shows histogram, overlied by kernel
density line, but these can be turned off. Both plots can further be customized
with further keywords.
13/15
2/5/24, 9:43 AM
_ = sns.distplot(ice[ice.month.isin([9]) & (ice.region == "N")].extent,
bins=10,
kde=False, # no density
hist_kws={"edgecolor":"black"})
_ = plt.show()
_ = sns.distplot(ice[ice.month.isin([9]) & (ice.region == "N")].extent,
hist=False) # no histogram
_ = plt.show()
14/15
2/5/24, 9:43 AM
Note that distplot does not use data frame centric approach, unlike
regplot or lineplot , it takes its input in a vector form instead.
15/15