Unit 5 Numpy and Pandas _ in Python
Unit 5 Numpy and Pandas _ in Python
import numpy as np
np.random.seed(10)
Numpy
Numpy is the most popular python library for matrix/vector computations. Due
to python’s popularity, it is also one of the leading libraries for numerical
analysis, and a frequent target for computing benchmarks and optimization.
1/58
It is important to keep in mind that
numpy is a separate library that is
not part of the base python. Unlike R,
base python is not vectorized, and
one has to load numpy (or another
vectorized library, such as pandas) in Numpy logo. Isabela Presedo-Floyd, CC BY-SA
import numpy as np
np is pretty much the standard acronym for the numpy and widely used in
Arrays can be created with np.array . For instance, we can create a 1-D
vector of numbers from 1 to 4 by feeding a list of desired numbers to the
np.array :
a = np.array([1,2,3,4])
print("a:\n", a)
## a:
## [1 2 3 4]
Note that it is printed in brackets as list, but unlike a list, it does not have
commas separating the components.
b = np.array([[1,2], [3,4]])
print("b:\n", b)
## b:
## [[1 2]
## [3 4]]
The output does not have the best formatting but it is clear enough.
3/58
2/5/24, 9:32 AM
a.shape
## (4,)
b.shape
## (2, 2)
One can see that vector a has a single dimension of size 4, and matrix b
has two dimensions, both of size 2 (remember: (4,) is a tuple of length 1!).
One can also reshape arrays, i.e. change their shape into another compatible
shape. This can be achieved with .reshape() method. .reshape takes one
argument, the new shape (as a tuple) of the array. For instance, we can
reshape the length-4 vector into a 2x2 matrix as
a.reshape((2,2))
## array([[1, 2],
## [3, 4]])
b.reshape((4,))
4/58
2/5/24, 9:32 AM
## array([1, 2, 3, 4])
np.arange creates sequences, quite a bit like range , but the result will be a
numpy vector. If needed, we can reshape the vector into a desired format:
## array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
## array([[0, 1, 2, 3, 4],
## [5, 6, 7, 8, 9]])
np.zeros and np.ones create arrays filled with zeros and ones respectively:
np.zeros((5,))
5/58
2/5/24, 9:32 AM
np.ones((2,4))
Note that column_stack expects all arrays to be passed as a single tuple (or
list).
6/58
2/5/24, 9:32 AM
a = np.arange(12).reshape((3,4))
print(a)
## [[ 0 1 2 3]
## [ 4 5 6 7]
## [ 8 9 10 11]]
print(100 + a, "\n")
7/58
2/5/24, 9:32 AM
## [[ 1 2 4 8]
## [ 16 32 64 128]
## [ 256 512 1024 2048]]
## array([[ 2, 4, 6, 8, 10],
## [12, 14, 16, 18, 20],
## [22, 24, 26, 28, 30],
## [32, 34, 36, 38, 40]])
a > 6
8/58
2/5/24, 9:32 AM
a == 7
As comparison operators are vectorized, one might expect that the other
logical operators, and, or and not, are also vectorized. But this is not the case.
There are vectorized logical operators, but they differ from the base python
version. These are more similar to corresponding operators in R or C, namely
& for logical and, | for logical or, and ~ for logical not:
9/58
2/5/24, 9:32 AM Chapter 3 Numpy and Pandas | Machine learning in python
a = np.arange(12)
print(a[::2]) # every second element
https://faculty.washington.edu/otoomet/machinelearning-py/numpy-and-pandas.html 10/58
2/5/24, 9:32 AM
## [ 0 2 4 6 8 10]
When working with matrices (2-D arrays), we need two indices, separated by
comma. Comma separates two slices
c = np.arange(12).reshape((3,4))
c
## array([[ 0, 1, 2, 3],
## [ 4, 5, 6, 7],
## [ 8, 9, 10, 11]])
11/58
2/5/24, 9:32 AM
## 6
## array([4, 5, 6, 7])
Comma can separate not just two indices but two slices, so we can write
## array([ 2, 6, 10])
## array([[0, 1, 2, 3],
## [4, 5, 6, 7]])
## array([[0, 1, 2],
## [4, 5, 6]])
12/58
2/5/24, 9:32 AM Chapter 3 Numpy and Pandas | Machine learning in python
a = np.array([1,2,7,8])
i = np.array([True, False, True, False])
a[i] # 1, 7
## array([1, 7])
13/58
2/5/24, 9:32 AM
i = a > 5
i
a[i]
## array([7, 8])
a[a > 5]
## array([7, 8])
New users of numpy (and other languages that support logical indexing)
sometimes forget that the logical condition does not have to be related to the
same array that we are attempting to extract. For instance, we can extract all
results for a certain person:
14/58
2/5/24, 9:32 AM
## array([14, 15])
Here index vector is based on the variable name only and is not directly
related to results . However, we use it to extract values from the latter.
Finally, we also can extract rows (or columns) from a 2-D array in a fairly
similar fashion:
## array([[17, 14],
## [20, 18],
## [13, 15]])
results[names == "Darius",:]
## array([[20, 18]])
The results is the second row of the 2-D array results , corresponding to the
name “Darius”.
15/58
2/5/24, 9:32 AM Chapter 3 Numpy and Pandas | Machine learning in python
a = np.random.randn(2,3)
a
a[a < 0] = 0
a
## array([[1.3315865 , 0.71527897, 0. ],
## [0. , 0.62133597, 0. ]])
16/58
2/5/24, 9:32 AM
Numpy offer a large set of random number generators. These can be invoked
as np.random. generator ( params , size) . For instance,
np.random.choice(N) can be used to create random numbers from 0 to
N − 1 . size determines the shape of the resulting object.
NB! The argument is size, not shape, although it determines the output
shape!
x = np.random.choice(6, size=5)
x
17/58
2/5/24, 9:32 AM
## array([0, 2, 0, 4, 3])
But maybe we prefer not to label the results as 0..5 but 1..6. So we can just
add one to the result. Here is an example that creates 2-D array of die rolls:
1 + np.random.choice(6, size=(2,4))
## array([[1, 5, 4, 1],
## [4, 3, 2, 1]])
Numpy offers a large set of various random values. Here we list a few more:
## 'ACGTCGGGTGCGACCCGAGT'
18/58
2/5/24, 9:32 AM
## array([[2, 2, 1, 1],
## [2, 2, 1, 2]])
Exercise 3.5 We can describe a coin toss as Binomial(1, 0.5) where 1 refers
to the fact that we toss a single coin, and 0.5 means it has probability 0.5 to
come heads up. So such random variables are sequences of zeros and ones.
But how can we get a sequence of -1 and 1 instead? Demonstrate it on
computer!
19/58
2/5/24, 9:32 AM
The random numbers are often called pseudorandom as they are not truly
random–they are computed based on a well-defined algorithm, so when
feeding the same initial values to the algorithm, one always gets the same
random numbers. However, normally the initial values are taken from certain
hart-to-control parameters outside of the program control, such as time in
microseconds and hard disk serial number, so in practice it is impossible to
replicate the same sequence.
However, if you need to replicate your results exactly, you have to set the
initial values explicitly using random.seed(value) . This re-initializes RNG-s to
the given initial state:
np.random.seed(1)
np.random.uniform(size=5) # 1st batch of numbers
20/58
2/5/24, 9:32 AM Chapter 3 Numpy and Pandas | Machine learning in python
np.random.seed(1)
np.random.uniform(size=5) # repeat the 1st batch
Numpy offers a set of basic statistical functions, including sum, mean, and
standard deviations std. These can be applied to the array as a whole, or
separately to rows or columns. In the latter case one has to specify the
argument axis , where the value 0 means to apply the operation row-wise
(and preserve columns) and axis=1 means to apply the operation column-
wise (and preserve rows). Here is an example:
21/58
2/5/24, 9:32 AM
a = np.arange(12).reshape((3,4))
a # 3 rows, 4 columns
## array([[ 0, 1, 2, 3],
## [ 4, 5, 6, 7],
## [ 8, 9, 10, 11]])
## 66
22/58
2/5/24, 9:32 AM
np.sum(a)
## nan
https://faculty.washington.edu/otoomet/machinelearning-py/numpy-and-pandas.html 23/58
2/5/24, 9:32 AM
Pandas
Pandas contains two central data types: Series and DataFrame. Series is
often used as a second-class citizen, just as a single variable (column) in data
frame. But it can also be used as a vectorized dict that links keys (indices) to
values. DataFrame is broadly similar to other dataframes as implemented in R
or spark. When you extract its individual columns and rows you normally get
those in the form of Series. So it is extremely useful to know the basics of
Series when working with data frames. Both DataFrame and Series include
index, a glorified row name, which is very useful for extracting information
based on names, or for merging different variables into a data frame (See
Section Concatenating data with pd.concat ).
24/58
2/5/24, 9:32 AM
3.2.1 Series
s = pd.Series([1,2,5,6])
s
## 0 1
## 1 2
## 2 5
## 3 6
## dtype: int64
Series is printed in two columns. The first one is the index, the second one is
the value. In this example, index is essentially just the row number and it is not
very useful. This is because we did not provide any specific index and hence
pandas picked just the row number. Underneath the two columns, you can
also see the data type, in this case it is 64-bit integer, the default data type for
integers in python.
25/58
2/5/24, 9:32 AM
## ca 38
## tx 26
## ny 19
## fl 19
## dtype: int64
Now the index is helpful: we are looking at state populations, and index tells
us which state is in which row. Another advantage of possessing index is that
even when we filter and manipulate the series, it’s index will still retain the
original row label. So we know that index “fl” will always correspond to Florida.
But if we have removed a few cases, or re-ordered the series, then Florida
may not be on the fourth position any more.
Exercise 3.6 Create a series of 4 capital cities where the index is the name of
corresponding country.
pop.values
26/58
2/5/24, 9:32 AM
pop.index
Note that values are returned as np array, and index is a special index object.
If desired, this can be converted to a list:
list(pop.index)
pop > 20
## ca True
## tx True
## ny False
## fl False
## dtype: bool
the result will be another series, here of logical values, as indicated by the
“bool” data type.
27/58
2/5/24, 9:32 AM
3.2.2 DataFrame
DataFrame can be created manually as a dict of lists (or series). The keys of
the list are the variable names and values are the variable values, normally
these are lists or series. As an example, let’s create a data frame with three
variables, ca, tx and md, and three rows:
df = {'ca': [35, 37, 38], 'tx': [23, 24, 26], 'md': [5,5,6]}
pop = pd.DataFrame(df)
print('population:\n', pop, '\n')
## population:
## ca tx md
## 0 35 23 5
## 1 37 24 5
## 2 38 26 6
The data frame is printed as four columns. Exactly as in case of series, the
first column is index. In the example above we did not specify the index and
hence pandas picked just row numbers. But we can provide an explicit index,
28/58
2/5/24, 9:32 AM
## population:
## ca tx md
## 2010 35 23 5
## 2012 37 24 5
## 2014 38 26 6
To create data frames manually is useful for testing and debugging, in real
applications we typically read data from disk. This can be done with
pd.read_csv that takes the file name as the first argument, and also supports
many other options. In the example below, we read data about G.W.Bush
approval rate in fall 2001. pd.read_csv assumes files are comma-separated
by default, but as this example file is tab-separated we have to declare it using
sep="\t" as an extra argument. We also read the first 10 rows only for
demonstration:
29/58
2/5/24, 9:32 AM
Exercise 3.8 In the example above: how many columns are printed? How
many variables does the dataframe contain?
What happens if we use a wrong separator? This can be easily checked with
printing the number of columns, and printing a few lines of data. Here is an
example:
## (31, 1)
30/58
2/5/24, 9:32 AM
a.head(2)
## date\tapprove\tdisapprove\tdontknow
## 0 2001 Dec 14-16\t86\t11\t3
## 1 2001 Dec 6-9\t86\t10\t4
Two problems are immediately visible: first, the file contains a single column
only (because it does not consider tab symbols as separators), and the two
lines we printed look weird. If you ask for variable names, you can also see
that all variable names are combined together into a single weird name:
a.columns
## Index(['date\tapprove\tdisapprove\tdontknow'], dtype='object')
The tab markers \t in printout give strong hints that the correct separator is
tab.
It may initially be quite confusing to understand how to specify the file name. If
you load data in a jupyter notebook, then the working directory is normally the
same directory where the notebook is located3. Notebook also let’s you to
complete file names with TAB key. But in any case, the working directory can
be found with os.getcwd (get current working directory):
import os
os.getcwd()
31/58
2/5/24, 9:32 AM
## '/home/siim/tyyq/lecturenotes/machinelearning-py'
This helps to specify the relative path if your data file is not located in the
same place as your code. You can also find which files does python find in a
given folder, e.g. in ../data/ :
files = os.listdir("../data/")
files[:5]
As we see, this function returns a list of file names it finds in the given
location.
32/58
2/5/24, 9:32 AM
Indexing refers to selecting data from data frames and series based on
variable names, logical conditions, and position. It is a complex task with many
different methods, and unfortunately also with many caveats. Below, the topic
is split into several subsections:
Modifying data frames: there are slight differences when modifying data
instead of extracting, these are discussed here.
33/58
2/5/24, 9:32 AM
approval.head(4)
To begin with, data frames have variable names. We can extract a single
variable either with ["varname"] or a shorthand as attribute .varname (note:
replace varname with the name of the relevant variable):
34/58
2/5/24, 9:32 AM
## 0 86
## 1 86
## 2 87
## 3 87
## 4 87
## 5 88
## 6 89
## 7 87
## 8 90
## 9 86
## Name: approve, dtype: int64
## 0 86
## 1 86
## 2 87
## 3 87
## 4 87
## 5 88
## 6 89
## 7 87
## 8 90
## 9 86
## Name: approve, dtype: int64
35/58
2/5/24, 9:32 AM
## approve
## 0 86
## 1 86
## 2 87
## 3 87
## 4 87
## 5 88
## 6 89
## 7 87
## 8 90
## 9 86
The attribute shorthand is usually the easier way, but it does not work if you
need to use indirect variable name (variable name that is stored in another
variable) or if the variable name contains spaces or other special characters. It
also does not work for creating new variables in the data frame. See more in
Section 3.3.5.
36/58
2/5/24, 9:32 AM
## date approve
## 0 2001 Dec 14-16 86
## 1 2001 Dec 6-9 86
## 2 2001 Nov 26-27 87
## 3 2001 Nov 8-11 87
## 4 2001 Nov 2-4 87
## 5 2001 Oct 19-21 88
## 6 2001 Oct 11-14 89
## 7 2001 Oct 5-6 87
## 8 2001 Sep 21-22 90
## 9 2001 Sep 14-15 86
Filtering refers to extracting only a subset of rows from the dataframe based
on certain conditions. The conditions are logical operations that can be either
true or false, depending on the values in each row. Filtering produces a sub-
dataframe where only those observations that meet the selection criteria are
present: Here is an example:
37/58
2/5/24, 9:32 AM
Obviously we can use more complex selection conditions, for instance we can
look for very low or very high approval rates as follows:
Note that we are using the vectorized “or” operator | , not the base python
or . We also need to wrap both the “less than” and “greater than” parts in
parenthesis.
See more in Section 3.1.4.
Exercise 3.10 How many polls in the data show the president’s approval rate
at least 88%? At which dates are those polls conducted?
NB! The filtered object is not a new data frame but a view of the original
data frame. This may give you warnings and errors later when you attempt
to modify the filtered data. If you intend to do that, perform a deep copy of
data using the .copy method. See more in Section 3.3.5.
38/58
2/5/24, 9:32 AM
## MY 32.7
## ID 267.7
## KH 15.3
## dtype: float64
We can access series’ values in two ways: by position, and by index. In order
to access elements by position, we have to use attribute .iloc[] where i loc
refers to “integer”. Unlike most other methods, .iloc expects arguments in
brackets. A single number in brackets returns the element as an element
(e.g. a single number), if brackets contain a list (this looks like double
brackets), it returns a series, potentially containing only a single element. So
in order to extract 2nd and 3rd element in the population series, we can write:
## 267.7
39/58
2/5/24, 9:32 AM
## ID 267.7
## KH 15.3
## dtype: float64
## 267.7
## ID 267.7
## MY 32.7
## dtype: float64
Exercise 3.11 Use your series of capital cities (see the exercise above).
Extract:
40/58
2/5/24, 9:32 AM
The fact that there are several ways to extract positional data causes a lot of
confusion for beginners. It is not helped by the common habit of not using
indices and just relying on the automatic row-numbers. In this case positional
access by .iloc[] produces exactly the same results as the index access by
.loc[] , and one can conveniently forget about the index and use whatever
feels easier. But sometimes the index changes as a result of certain
operations and that may lead to errors or unexpected results. For instance, we
can create an alternative population series without explicit index:
## 0 NaN
## 1 26.0
## 2 19.0
## 3 13.0
## dtype: float64
In this example, position and index are equivalent and hence it is easy to
forget that .loc[] is index-based access, not positional access! So one may
freely mix both methods (and remember, .loc is not needed):
41/58
2/5/24, 9:32 AM
pop1.loc[2]
## 19.0
pop1.iloc[2]
## 19.0
pop1[2]
## 19.0
## 1 26.0
## 2 19.0
## 3 13.0
## dtype: float64
42/58
2/5/24, 9:32 AM
## 13.0
## 19.0
## 19.0
Additionally, if pop2 for some reason turns into a numpy array, then pop2[2]
is is based on position as arrays do not have index!
43/58
2/5/24, 9:32 AM
## capital population
## MY Kuala Lumpur 32.7
## ID Jakarta 267.7
## KH Phnom Penh 15.3
## capital population
## KH Phnom Penh 15.3
## 15.3
44/58
2/5/24, 9:32 AM
There is also an index-based extractor .loc[] that accepts one (for rows) or
two (for rows and columns) indices. In case of data frames, the default row
index is just the row number; but the column index is the variable names. So
we can write
## 'Kuala Lumpur'
## population capital
## KH 15.3 Phnom Penh
## ID 267.7 Jakarta
data["capital"]
45/58
2/5/24, 9:32 AM
## MY Kuala Lumpur
## ID Jakarta
## KH Phnom Penh
## Name: capital, dtype: object
data["capital"]["MY"]
## 'Kuala Lumpur'
Finally, remember that 2-D numpy arrays will use similar integer-positional
syntax as .iloc[] , just without .iloc .
In conclusion, it is very important to know what is your data type when using
numpy and pandas. Indexing is all around us when working with data, there
are many somewhat similar ways to extract elements, and which way is
correct depends on the exact data type.
46/58
2/5/24, 9:32 AM
47/58
2/5/24, 9:32 AM
We got a subset of Malaysia and Indonesia. Now let’s add another variable to
these large countries:
## <string>:1: SettingWithCopyWarning:
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
##
## See the caveats in the documentation: http://pandas.pydata.org/pandas-
large
48/58
2/5/24, 9:32 AM
Note the warning: A value is trying to be set on a copy of a slice…. This tells
you that filtering data[data.population > 20] did not create a new data frame
but a view of the existing one in memory, and Pandas is unhappy with the
code modifying just a part of the original data frame.
NB!
Although the result appears correct here, do not rely on this approach! It
may or may not work, depending on the exact memory layout of the
dataset!
Fortunately, the solution is very simple. We need to make an explicit copy with
.copy method before we start any modifications:
Explicit copy is not needed before you start modifying data, you can do
various filtering steps without .copy as long as you make the copy before
modifications.
49/58
2/5/24, 9:32 AM
The index that is attached to series’ and data frames is potentially a useful and
iformative tool. But sometimes it is not very useful. For instance, when you
load data from disk, then the index defaults to be the row number, and this is
rarely what we are interested in. In such cases one may want to change the
index. If you want to create a new index then you can just assign it to
df.index . For instance, we can just assign country names as index to our
large.set_index("capital")
50/58
2/5/24, 9:32 AM
This will remove the column “capital” from data frame as its values will be in
index instead. Note that by default, .set_index() returns a new data frame
instead of modifying it in place, so if you want to preserve it, you have to store
it in a new variable. The opposite–converting the index into a column can be
done with .reset_index() .
Exercise 3.12 Take the data frame of capital-population data frame from
Section 3.3.4.
Ensure that you store and print the final data frame!
Indexing data is complex. Here we repeat and summarize the main methods
we have discussed so far. First create three objects, a numpy matrix, a data
frame, and a series. The first two are 2-dimensional but the last one 1-
dimensional.
M = np.array([[1507, 12478],
[-500, 11034],
[1537, 8443],
[1591, 6810]])
M
51/58
2/5/24, 9:32 AM
## established population
## Mumbai 1507 12478
## Delhi -500 11034
## Bangalore 1537 8443
## Hyderabad 1591 6810
s = pd.Series(M[:,0], index=df.index)
s
## Mumbai 1507
## Delhi -500
## Bangalore 1537
## Hyderabad 1591
## dtype: int64
(This is data about four cities, the year when those were established, and
population in thousands).
52/58
2/5/24, 9:32 AM
Exercise 3.13 Create another numpy matrix and a data frame about cities in a
similar fashion: create a matrix of data, and create a data frame from it using
pd.DataFrame . Specify index (row names) and columns (variable names).
Include at least 3 cities and 3 variables (e.g. population in millions, size in
km2, and population density people per km2).
Hint: you may invent both city names and the figures!
## -500
## array([1537, 8443])
## -500
53/58
2/5/24, 9:32 AM
## established 1537
## population 8443
## Name: Bangalore, dtype: int64
Series: use iloc and brackets (but these are just 1-dimensional):
## -500
## -500
54/58
2/5/24, 9:32 AM
## established 1537
## population 8443
## Name: Bangalore, dtype: int64
s.loc["Delhi"]
## -500
Numpy arrays: use brackets and use a colon : in row indicators place:
M[:,0]
Data frames: you can use iloc and brackets, exactly as in case of
numpy arrays. You can also use brackets and column names (column
index) without iloc , or dot-column name:
df.iloc[:,0]
55/58
2/5/24, 9:32 AM
## Mumbai 1507
## Delhi -500
## Bangalore 1537
## Hyderabad 1591
## Name: established, dtype: int64
df["established"]
## Mumbai 1507
## Delhi -500
## Bangalore 1537
## Hyderabad 1591
## Name: established, dtype: int64
df.established
## Mumbai 1507
## Delhi -500
## Bangalore 1537
## Hyderabad 1591
## Name: established, dtype: int64
If you want to extract rows and columns in a mixed, e.g. rows by number, and
columns by column names (index), you can use double extraction (two sets of
brackets) and chain your extractions into a single line:
df.iloc[:3,:]["population"]
56/58
2/5/24, 9:32 AM
## Mumbai 12478
## Delhi 11034
## Bangalore 8443
## Name: population, dtype: int64
Exercise 3.14 Take your own city matrix and city data frame. From both of
these extract:
data for the third city. For the data frame do it in two ways: using index,
and using row number!
area of the second city. For the data frame, do it in two ways: using
column name (column index), and column number!
Finally, if asking for a single entry (singleton), pandas simplifies the result into
a lower-ranked object (series instead of data frame, or a number instead of
series). If you want to retain a similar data structure as the original one, wrap
your selector in a list. For instance, the previous example that returns a data
frame: single line:
df.iloc[:3,:][["population"]]
## population
## Mumbai 12478
## Delhi 11034
## Bangalore 8443
57/58
2/5/24, 9:32 AM
All these methods can create rather confusing situations sometimes. For
instance, if we do not specify index, it will be automatically created as row
numbers (but starting from 0, not 1). In that case df.iloc[i] and df.loc[i]
give the same result (assuming i is a list of row numbers). Even worse, if
the index skips some numbers, then df.loc[i] may or may not work, and
even where it works, it may give wrong results! In a similar fashion, M[i,j]
works but df[i,j] does not work, df.loc[i,j] works but M.loc[i,j] does
not work. In order to tell if the syntax is correct it is necessary to know what is
the data structure.
2. There are also operations that are not performed elementwise when using
array, in particular matrix product↩
3. If you run your code from command line, the working directory is the
directory where you run the command, not the directory where the
program is located.↩
58/58