0% found this document useful (0 votes)

26 views58 pages

Unit 5 Numpy and Pandas - in Python

The document provides an overview of the Numpy library, emphasizing its importance for vectorized data structures like arrays, matrices, and data frames in Python. It covers fundamental concepts such as importing Numpy, creating arrays, and performing vectorized operations, including mathematical and logical operations. Additionally, it discusses array indexing, slicing, and boolean indexing for efficient data manipulation.

Uploaded by

xech.170

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views58 pages

Unit 5 Numpy and Pandas - in Python

Uploaded by

xech.170

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 58

Unit 5 Numpy and Pandas

import numpy as np
np.random.seed(10)

Base python does not include true vectorized data structures–vectors,

matrices, and data frames. For small things one can use lists, lists of lists, and
list comprehensions. However, such code will be bulky and slow.

This deficiency is addressed by additional libraries, in particular numpy and

pandas. Numpy is the primary way to handle matrices and vectors in python.
This is the way to model either a variable or a whole dataset so vector/matrix
approach is very important when working with datasets. Even more, these
objects also model the vectors/matrices as mathematical objects. Matrix
computations are extremely important in statistics and hence also in machine
learning.

Numpy

Numpy is the most popular python library for matrix/vector computations. Due
to python’s popularity, it is also one of the leading libraries for numerical
analysis, and a frequent target for computing benchmarks and optimization.

1/58
It is important to keep in mind that
numpy is a separate library that is
not part of the base python. Unlike R,
base python is not vectorized, and
one has to load numpy (or another
vectorized library, such as pandas) in Numpy logo. Isabela Presedo-Floyd, CC BY-SA

order to use vectorized operations. 4.0, via Wikimedia Commons.

This also causes certain differences

between the base python approach and the way to do vectorized operations.

3.1.1 Importing numpy

Numpy is typically imported as np :

import numpy as np

np is pretty much the standard acronym for the numpy and widely used in

online documentation. Below we assume numpy has been imported as np .

3.1.2 Array: The Fundamental Data Structure in

Numpy

Numpy is fundamentally based on arrays, N-dimensional data structures. Here

we mainly stay with one- and two-dimensional structures (vectors and
matrices) but the arrays can also have higher dimension (called tensors).
Besides arrays, numpy also provides a plethora of functions that operate on
the arrays, including vectorized mathematics and logical operations.
2/58
2/5/24, 9:32 AM

Arrays can be created with np.array . For instance, we can create a 1-D
vector of numbers from 1 to 4 by feeding a list of desired numbers to the
np.array :

a = np.array([1,2,3,4])
print("a:\n", a)

## a:
## [1 2 3 4]

Note that it is printed in brackets as list, but unlike a list, it does not have
commas separating the components.

If we want to create a matrix (two-dimensional array), we can feed np.array

with a list of lists, one sublist for each row of the matrix:

b = np.array([[1,2], [3,4]])
print("b:\n", b)

## b:
## [[1 2]
## [3 4]]

The output does not have the best formatting but it is clear enough.

One of the fundamental property of arrays its dimension, called shape in

numpy. Shape is array’s size along all of its dimensions. This can be queried
by attribute .shape which returns the sizes in a form of a tuple:

3/58
2/5/24, 9:32 AM

a.shape

## (4,)

b.shape

## (2, 2)

One can see that vector a has a single dimension of size 4, and matrix b
has two dimensions, both of size 2 (remember: (4,) is a tuple of length 1!).

One can also reshape arrays, i.e. change their shape into another compatible
shape. This can be achieved with .reshape() method. .reshape takes one
argument, the new shape (as a tuple) of the array. For instance, we can
reshape the length-4 vector into a 2x2 matrix as

a.reshape((2,2))

## array([[1, 2],
## [3, 4]])

and we can “straighten” matrix b into a vector with

b.reshape((4,))

4/58
2/5/24, 9:32 AM

## array([1, 2, 3, 4])

3.1.3 Creating Arrays

Sometimes it is practical to create arrays manually as we did above, but

usually it is much more important to make those by computation. Below we list
a few options.

np.arange creates sequences, quite a bit like range , but the result will be a
numpy vector. If needed, we can reshape the vector into a desired format:

np.arange(10) # vector of length 10

## array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

np.arange(10).reshape((2,5)) # 2x5 matrix

## array([[0, 1, 2, 3, 4],
## [5, 6, 7, 8, 9]])

np.zeros and np.ones create arrays filled with zeros and ones respectively:

np.zeros((5,))

## array([0., 0., 0., 0., 0.])

5/58
2/5/24, 9:32 AM

np.ones((2,4))

## array([[1., 1., 1., 1.],

## [1., 1., 1., 1.]])

Arrays can be combined in different ways, e.g. np.column_stack combines

them as columns (next to each other), and np.row_stack combines these as
rows (underneath each other). For instance, we can combine a column of
ones and two columns of zeros as follows:

oneCol = np.ones((5,)) # a single vector of ones

zeroCols = np.zeros((5,2)) # two columns of zeros
np.column_stack((oneCol, zeroCols)) # 5x3 columns

## array([[1., 0., 0.],

## [1., 0., 0.],
## [1., 0., 0.],
## [1., 0., 0.],
## [1., 0., 0.]])

Note that column_stack expects all arrays to be passed as a single tuple (or
list).

Exercise 3.1 Use np.zeros , np.ones , mathematical operations and

concatenation to create the following array:

6/58
2/5/24, 9:32 AM

## array([[-1., -1., -1., -1.],

## [ 0., 0., 0., 0.],
## [ 2., 2., 2., 2.]])

See the solution

3.1.4 Vectorized Functions (Universal

Functions)

It is possible to use loops to do computation with numpy objects exactly in the

same way when working with lists. However, one should use vectorized
operations instead whenever possible. Vectorized operations are easier to
code, easier to read, and result in faster code.

Numpy offers a plethora of vectorized functions and operators, called

universal functions. Many of these work as expected. For instance,
mathematical operations. We create a matrix, and then add “100” to it, and
then rise “2” to the power of the values:

a = np.arange(12).reshape((3,4))
print(a)

## [[ 0 1 2 3]
## [ 4 5 6 7]
## [ 8 9 10 11]]

print(100 + a, "\n")

7/58
2/5/24, 9:32 AM

## [[100 101 102 103]

## [104 105 106 107]
## [108 109 110 111]]

print(2a, "\n") # remember: exponent with , not with ^

## [[ 1 2 4 8]
## [ 16 32 64 128]
## [ 256 512 1024 2048]]

Both of these mathematical operations, + and ** are performed

elementwise2 for every single element of the matrix.

Exercise 3.2 Create the following array:

## array([[ 2, 4, 6, 8, 10],
## [12, 14, 16, 18, 20],
## [22, 24, 26, 28, 30],
## [32, 34, 36, 38, 40]])

See the solution

Comparison operators are vectorized too:

a > 6

8/58
2/5/24, 9:32 AM

## array([[False, False, False, False],

## [False, False, False, True],
## [ True, True, True, True]])

a == 7

## array([[False, False, False, False],

## [False, False, False, True],
## [False, False, False, False]])

As comparison operators are vectorized, one might expect that the other
logical operators, and, or and not, are also vectorized. But this is not the case.
There are vectorized logical operators, but they differ from the base python
version. These are more similar to corresponding operators in R or C, namely
& for logical and, | for logical or, and ~ for logical not:

(a < 3) | (a > 8) # logical or

## array([[ True, True, True, False],

## [False, False, False, False],
## [False, True, True, True]])

(a > 4) & (a < 7) # logical and

9/58
2/5/24, 9:32 AM Chapter 3 Numpy and Pandas | Machine learning in python

## array([[False, False, False, False],

## [False, True, True, False],
## [False, False, False, False]])

~(a > 6) # logical not

## array([[ True, True, True, True],

## [ True, True, True, False],
## [False, False, False, False]])

There is no vectorized multi-way comparison like 1 < x < 2 .

3.1.5 Array Indexing and Slicing

Indexing refer to extracting elements based on their position or certain criteria.

This is one of the fundamental operations with arrays. There are two ways to
extract elements: based on position, and based on logical criteria.
Unfortunately, this also makes indexing somewhat confusing, and it needs
some time to become familiar with.

3.1.5.1 Extracting elements based on position

Array indexing is very similar to list indexing. As matrices have two

dimensions, we need two indices.

a = np.arange(12)
print(a[::2]) # every second element

https://faculty.washington.edu/otoomet/machinelearning-py/numpy-and-pandas.html 10/58
2/5/24, 9:32 AM

## [ 0 2 4 6 8 10]

However, unlike lists, one can do vectorized assignments in numpy:

a[5:11] = -1 # assign multiple elements

## array([ 0, 1, 2, 3, 4, -1, -1, -1, -1, -1, -1, 11])

One can also extract multiple elements from a vector:

a[[4,5,7]] # extract 3 elements in one go

## array([ 4, -1, -1])

When working with matrices (2-D arrays), we need two indices, separated by
comma. Comma separates two slices

c = np.arange(12).reshape((3,4))
c

## array([[ 0, 1, 2, 3],
## [ 4, 5, 6, 7],
## [ 8, 9, 10, 11]])

c[1,2] # 2nd row, 3rd column

11/58
2/5/24, 9:32 AM

## 6

c[1] # 2nd row

## array([4, 5, 6, 7])

Comma can separate not just two indices but two slices, so we can write

c[:,2] # all rows, 3rd column

## array([ 2, 6, 10])

c[:2] # 1st, 2nd row

## array([[0, 1, 2, 3],
## [4, 5, 6, 7]])

c[:2, :3] # 1s, 2nd row, first three columns

## array([[0, 1, 2],
## [4, 5, 6]])

Exercise 3.3 Create matrix and access rows and columns

12/58
2/5/24, 9:32 AM Chapter 3 Numpy and Pandas | Machine learning in python

create a 4x5 array of even numbers: 10, 12, 14, …

extract third column

set the fourth row to 1,2,3,4,5

Note: there are many ways to achieve this.

See the solution

3.1.5.2 Boolean indexing

An extremely widely used approach is to extract elements of an array based

on a logical criteria. Fundamentally, it is just using a logical vector for indexing.
The vector must be of the same lengts as the array in question, and the
results contains only those elements the correspond to True in the indexing
vector. Here is an example how we can do this manually:

a = np.array([1,2,7,8])
i = np.array([True, False, True, False])
a[i] # 1, 7

## array([1, 7])

It is important you understand what is going on here: arrays a and i will be

“matched”, so each element of a will have its “match” in i . Next, only those
elements of a that are matched with True are extracted, in this case just 1
and 7.

13/58
2/5/24, 9:32 AM

The previous example–manually creating a logical index vectors of trues and

falses is hardly ever useful. Almost always we use logical operations instead.
For instance, we can extract all elements of a that are greater than 5:

i = a > 5
i

## array([False, False, True, True])

a[i]

## array([7, 8])

This is often written in a more compact manner by skipping explicit logical

vector i :

a[a > 5]

## array([7, 8])

New users of numpy (and other languages that support logical indexing)
sometimes forget that the logical condition does not have to be related to the
same array that we are attempting to extract. For instance, we can extract all
results for a certain person:

14/58
2/5/24, 9:32 AM

names = np.array(["Cyrus", "Darius", "Xerxes", "Artaxerxes", "Cyrus", "Da

results = np.array([17, 14, 20, 18, 13, 15])
results[names == "Darius"]

## array([14, 15])

Here index vector is based on the variable name only and is not directly
related to results . However, we use it to extract values from the latter.

Finally, we also can extract rows (or columns) from a 2-D array in a fairly
similar fashion:

names = np.array(["Cyrus", "Darius", "Xerxes"])

results = np.array([[17, 14], [20, 18], [13, 15]])
results

## array([[17, 14],
## [20, 18],
## [13, 15]])

results[names == "Darius",:]

## array([[20, 18]])

The results is the second row of the 2-D array results , corresponding to the
name “Darius”.

15/58
2/5/24, 9:32 AM Chapter 3 Numpy and Pandas | Machine learning in python

Logical indexing can also be used on the left-hand-side of the expression, in

order to replace elements. Below is an example where we replace all the
negative elements of a with zero.

a = np.random.randn(2,3)
a

## array([[ 1.3315865 , 0.71527897, -1.54540029],

## [-0.00838385, 0.62133597, -0.72008556]])

a[a < 0] = 0
a

## array([[1.3315865 , 0.71527897, 0. ],
## [0. , 0.62133597, 0. ]])

When replacing elements in such fashion then we need to supply the

replacement vector that is either length 1 (all elements are replaced by “0” in
the example above), or alternatively we should supply a replacement vector of
correct length. For instance, we can replace the positive numbers left in a
with 1, 2, 3:

a[a > 0] = np.array([1, 2, 3])

## array([[1., 2., 0.],

## [0., 3., 0.]])

16/58
2/5/24, 9:32 AM

Exercise 3.4 consider two vectors

names = np.array(["Roxana", "Statira", "Roxana", "Statira", "Roxana"])

score = np.array([126, 115, 130, 141, 132])

Do the following using a single one-line vectorized operation.

Extract all test scores that are smaller than 130

Extract all test scores by Statira

Add 10 points to Roxana’s scores. (You need to extract it first.)

See the solution

3.1.6 Random numbers

Numpy offer a large set of random number generators. These can be invoked
as np.random. generator ( params , size) . For instance,
np.random.choice(N) can be used to create random numbers from 0 to
N − 1 . size determines the shape of the resulting object.

NB! The argument is size, not shape, although it determines the output
shape!

Here is an example to simulate roll of a die for 5 times:

x = np.random.choice(6, size=5)
x

17/58
2/5/24, 9:32 AM

## array([0, 2, 0, 4, 3])

But maybe we prefer not to label the results as 0..5 but 1..6. So we can just
add one to the result. Here is an example that creates 2-D array of die rolls:

1 + np.random.choice(6, size=(2,4))

## array([[1, 5, 4, 1],
## [4, 3, 2, 1]])

Numpy offers a large set of various random values. Here we list a few more:

3.1.6.1 Random elements from list

random.choice can also extract random elements from a list:

nucleotides = ["A", "G", "C", "T"]

dna = np.random.choice(nucleotides, 20)
"".join(dna)

## 'ACGTCGGGTGCGACCCGAGT'

As the example demonstrates, random.choice picks random elements with

replacement (use replace option to change this behavior).

18/58
2/5/24, 9:32 AM

3.1.6.2 Random normals

random.normal(loc, scale, size) generates normally distributed random

numbers. The distribution is centered at loc and its variance is scale:

np.random.normal(1000, 100, size=10)

## array([ 969.13995143, 984.62645147, 1067.89311948, 808.64457868,

## 905.75419444, 825.4457021 , 897.90547332, 983.79309738,
## 934.20291005, 1042.21130617])

3.1.6.3 Binomial random numbers

random.binomial(n, p, size) creates random binomials where probability of

success is p and sample size is n:

np.random.binomial(2, 0.5, size=(2,4))

## array([[2, 2, 1, 1],
## [2, 2, 1, 2]])

Exercise 3.5 We can describe a coin toss as Binomial(1, 0.5) where 1 refers
to the fact that we toss a single coin, and 0.5 means it has probability 0.5 to
come heads up. So such random variables are sequences of zeros and ones.
But how can we get a sequence of -1 and 1 instead? Demonstrate it on
computer!

See the solution

19/58
2/5/24, 9:32 AM

3.1.6.4 Uniform random numbers

random.uniform(low, high, size) creates uniformly distributed random

numbers in the interval [low, high]:

np.random.uniform(-1, 1, size=(3,4)) # random numbers in [-1, 1]

## array([[-0.4078626 , -0.73741789, 0.68563587, 0.31807261],

## [ 0.19087921, -0.1272926 , -0.28749935, 0.17426185],
## [-0.70105733, -0.6575228 , -0.20567095, 0.27590313]])

3.1.6.5 Repeating the exact same random sequence

The random numbers are often called pseudorandom as they are not truly
random–they are computed based on a well-defined algorithm, so when
feeding the same initial values to the algorithm, one always gets the same
random numbers. However, normally the initial values are taken from certain
hart-to-control parameters outside of the program control, such as time in
microseconds and hard disk serial number, so in practice it is impossible to
replicate the same sequence.

However, if you need to replicate your results exactly, you have to set the
initial values explicitly using random.seed(value) . This re-initializes RNG-s to
the given initial state:

np.random.seed(1)
np.random.uniform(size=5) # 1st batch of numbers

20/58
2/5/24, 9:32 AM Chapter 3 Numpy and Pandas | Machine learning in python

## array([4.17022005e-01, 7.20324493e-01, 1.14374817e-04, 3.02332573e-01,

## 1.46755891e-01])

np.random.uniform(size=5) # 2nd batch is different

## array([0.09233859, 0.18626021, 0.34556073, 0.39676747, 0.53881673])

np.random.seed(1)
np.random.uniform(size=5) # repeat the 1st batch

## array([4.17022005e-01, 7.20324493e-01, 1.14374817e-04, 3.02332573e-01,

## 1.46755891e-01])

3.1.7 Statistical functions

Numpy offers a set of basic statistical functions, including sum, mean, and
standard deviations std. These can be applied to the array as a whole, or
separately to rows or columns. In the latter case one has to specify the
argument axis , where the value 0 means to apply the operation row-wise
(and preserve columns) and axis=1 means to apply the operation column-
wise (and preserve rows). Here is an example:

21/58
2/5/24, 9:32 AM

a = np.arange(12).reshape((3,4))
a # 3 rows, 4 columns

## array([[ 0, 1, 2, 3],
## [ 4, 5, 6, 7],
## [ 8, 9, 10, 11]])

a.sum() # total sum

## 66

a.sum(axis=0) # add rows, preserve columns

## array([12, 15, 18, 21])

a.sum(axis=1) # add columns, preserve rows

## array([ 6, 22, 38])

The functions come in two forms: as a method x.sum() , and as a separate

function np.sum(x) . These two ways are pretty much equivalent.

By default, a missing value of an array causes the function to return missing:

22/58
2/5/24, 9:32 AM

a = a.astype(float) # as np.nan is float, need a float array

a[1,2] = np.nan
a

## array([[ 0., 1., 2., 3.],

## [ 4., 5., nan, 7.],
## [ 8., 9., 10., 11.]])

np.sum(a)

## nan

NB! This differs from the corresponding functionality in pandas where

missings are ignored by default!

The other statistical functions include

mean for average

median for median

var for variance

std for standard deviation

np.percentile and np.quantile for quantiles

https://faculty.washington.edu/otoomet/machinelearning-py/numpy-and-pandas.html 23/58
2/5/24, 9:32 AM

Pandas

Pandas is the standard python library

to work with dataframes. Unlike in R,
this is not a part of base python and
must be imported separately. It is
typically imported as pd :
Pandas logo. Marc Garcia, BSD license, via
Wikimedia Commons.
import pandas as pd

Pandas relies heavily on numpy but is a separate package. Unfortunately, it

also uses a somewhat different syntax and somewhat different defaults.
However, as it is “made of” numpy, it works very well together with the latter.

Pandas contains two central data types: Series and DataFrame. Series is
often used as a second-class citizen, just as a single variable (column) in data
frame. But it can also be used as a vectorized dict that links keys (indices) to
values. DataFrame is broadly similar to other dataframes as implemented in R
or spark. When you extract its individual columns and rows you normally get
those in the form of Series. So it is extremely useful to know the basics of
Series when working with data frames. Both DataFrame and Series include
index, a glorified row name, which is very useful for extracting information
based on names, or for merging different variables into a data frame (See
Section Concatenating data with pd.concat ).

We start by introducing Series as this is a simpler data structure than

DataFrame, and allows us to introduce index.

24/58
2/5/24, 9:32 AM

3.2.1 Series

Series is a one-dimensional positional column (or row) of values. It is in some

sense similar to list, but from another point of view it is more like a dict, as it
contains index, and you can look up values based on index as a key. So it
allows not only positional access but also index-based (key-based) access. In
terms of internal structure, it is implemented with vectorized operations in
mind, so it supports vectorized arithmetic, and vectorized logical, string, and
other operations. Unlike dicts, it also supports multi-element extraction.

Let’s create a simple series:

s = pd.Series([1,2,5,6])
s

## 0 1
## 1 2
## 2 5
## 3 6
## dtype: int64

Series is printed in two columns. The first one is the index, the second one is
the value. In this example, index is essentially just the row number and it is not
very useful. This is because we did not provide any specific index and hence
pandas picked just the row number. Underneath the two columns, you can
also see the data type, in this case it is 64-bit integer, the default data type for
integers in python.

Now let’s make another example with a more informative index:

25/58
2/5/24, 9:32 AM

pop = pd.Series( [ 38, 26, 19, 19],

index = ['ca', 'tx', 'ny', 'fl'])
# population, in millions
pop

## ca 38
## tx 26
## ny 19
## fl 19
## dtype: int64

Now the index is helpful: we are looking at state populations, and index tells
us which state is in which row. Another advantage of possessing index is that
even when we filter and manipulate the series, it’s index will still retain the
original row label. So we know that index “fl” will always correspond to Florida.
But if we have removed a few cases, or re-ordered the series, then Florida
may not be on the fourth position any more.

Exercise 3.6 Create a series of 4 capital cities where the index is the name of
corresponding country.

See the solution

We can extract values and index using the corresponding attributes:

pop.values

## array([38, 26, 19, 19])

26/58
2/5/24, 9:32 AM

pop.index

## Index(['ca', 'tx', 'ny', 'fl'], dtype='object')

Note that values are returned as np array, and index is a special index object.
If desired, this can be converted to a list:

list(pop.index)

## ['ca', 'tx', 'ny', 'fl']

Series also supports ordinary mathematics, e.g. we can do operations like

pop > 20

## ca True
## tx True
## ny False
## fl False
## dtype: bool

the result will be another series, here of logical values, as indicated by the
“bool” data type.

27/58
2/5/24, 9:32 AM

3.2.2 DataFrame

DataFrame is the central data structure for holding 2-dimensional rectangular

data. It is in many ways similar to R dataframes. However, it also shares a
number of features with Series, in particular the index, so you can imagine a
data frame is just a number of series stacked next to each other. Also,
extracting single rows or columns from DataFrames typically results in a
series.

3.2.2.1 Creating data frames

DataFrame can be created manually as a dict of lists (or series). The keys of
the list are the variable names and values are the variable values, normally
these are lists or series. As an example, let’s create a data frame with three
variables, ca, tx and md, and three rows:

df = {'ca': [35, 37, 38], 'tx': [23, 24, 26], 'md': [5,5,6]}
pop = pd.DataFrame(df)
print('population:\n', pop, '\n')

## population:
## ca tx md
## 0 35 23 5
## 1 37 24 5
## 2 38 26 6

The data frame is printed as four columns. Exactly as in case of series, the
first column is index. In the example above we did not specify the index and
hence pandas picked just row numbers. But we can provide an explicit index,
28/58
2/5/24, 9:32 AM

for instance the year of observation:

pop = pd.DataFrame(df, index = [2010,2012,2014])

print('population:\n', pop, '\n')

## population:
## ca tx md
## 2010 35 23 5
## 2012 37 24 5
## 2014 38 26 6

In this case the index is rather useful.

Exercise 3.7 Create a dataframe of (at least 4) countries, with 2 variables:

population and capital. Country name should be the index.

Hint: feel free to invent populations!

See the solution

3.2.2.2 Read data from file

To create data frames manually is useful for testing and debugging, in real
applications we typically read data from disk. This can be done with
pd.read_csv that takes the file name as the first argument, and also supports
many other options. In the example below, we read data about G.W.Bush
approval rate in fall 2001. pd.read_csv assumes files are comma-separated
by default, but as this example file is tab-separated we have to declare it using
sep="\t" as an extra argument. We also read the first 10 rows only for
demonstration:

29/58
2/5/24, 9:32 AM

approval = pd.read_csv("../data/gwbush-approval.csv", sep="\t", nrows=10)

approval

## date approve disapprove dontknow

## 0 2001 Dec 14-16 86 11 3
## 1 2001 Dec 6-9 86 10 4
## 2 2001 Nov 26-27 87 8 5
## 3 2001 Nov 8-11 87 9 4
## 4 2001 Nov 2-4 87 9 4
## 5 2001 Oct 19-21 88 9 3
## 6 2001 Oct 11-14 89 8 3
## 7 2001 Oct 5-6 87 10 3
## 8 2001 Sep 21-22 90 6 4
## 9 2001 Sep 14-15 86 10 4

Exercise 3.8 In the example above: how many columns are printed? How
many variables does the dataframe contain?

See the solution

What happens if we use a wrong separator? This can be easily checked with
printing the number of columns, and printing a few lines of data. Here is an
example:

a = pd.read_csv("../data/gwbush-approval.csv") # wrong separator

a.shape

## (31, 1)

30/58
2/5/24, 9:32 AM

a.head(2)

## date\tapprove\tdisapprove\tdontknow
## 0 2001 Dec 14-16\t86\t11\t3
## 1 2001 Dec 6-9\t86\t10\t4

Two problems are immediately visible: first, the file contains a single column
only (because it does not consider tab symbols as separators), and the two
lines we printed look weird. If you ask for variable names, you can also see
that all variable names are combined together into a single weird name:

a.columns

## Index(['date\tapprove\tdisapprove\tdontknow'], dtype='object')

The tab markers \t in printout give strong hints that the correct separator is
tab.

It may initially be quite confusing to understand how to specify the file name. If
you load data in a jupyter notebook, then the working directory is normally the
same directory where the notebook is located3. Notebook also let’s you to
complete file names with TAB key. But in any case, the working directory can
be found with os.getcwd (get current working directory):

import os
os.getcwd()

31/58
2/5/24, 9:32 AM

## '/home/siim/tyyq/lecturenotes/machinelearning-py'

This helps to specify the relative path if your data file is not located in the
same place as your code. You can also find which files does python find in a
given folder, e.g. in ../data/ :

files = os.listdir("../data/")
files[:5]

## ['house-votes-84.csv.bz2', 'iris.csv.bz2', 'males.csv.bz2', 'hadcrut-5

As we see, this function returns a list of file names it finds in the given
location.

Exercise 3.9 Refresh your knowledge of relative paths!

What is your current working directory?

List all files in

your current folder

in the parent folder of it.

See the solution

As another complication, notebooks are often run on a separate server or in a

docker container. These may have no access to files in your computer (as the
server), or only have a limited access (like docker container).

32/58
2/5/24, 9:32 AM

3.3 Indexing data frames and series

Indexing refers to selecting data from data frames and series based on
variable names, logical conditions, and position. It is a complex task with many
different methods, and unfortunately also with many caveats. Below, the topic
is split into several subsections:

Select variables explains how to select desired variables from a data

frame

Filter observations with logical operations describes how to filter rows

Positional indexing of Series introduces positional indexing, indexing

based on row number, and how to do it with series

Positional indexing of data frames explains positional indexing, indexing

based on both row and column numbers, for data frames

Modifying data frames: there are slight differences when modifying data
instead of extracting, these are discussed here.

Indexing: summary and comparison provides a summary of all methods.

Fortunately, Series and data frames behave in a broadly similar way,

e.g. selecting cases by logical conditions, based on index, and location are
rather similar. As series do not have columns, we cannot access elements by
column name or by column position though.

These notes do not provide a comprehensive overview, consult e.g. McKinney

"Python for Data Analysis for more details.

33/58
2/5/24, 9:32 AM

3.3.1 Select variables in data frames

We use the G.W.Bush approval data we loaded above to demonstrate variable

access. For a refresher, the first lines of the data frame look like

approval.head(4)

## date approve disapprove dontknow

## 0 2001 Dec 14-16 86 11 3
## 1 2001 Dec 6-9 86 10 4
## 2 2001 Nov 26-27 87 8 5
## 3 2001 Nov 8-11 87 9 4

To begin with, data frames have variable names. We can extract a single
variable either with ["varname"] or a shorthand as attribute .varname (note:
replace varname with the name of the relevant variable):

approval["approve"] # approval, as series

34/58
2/5/24, 9:32 AM

## 0 86
## 1 86
## 2 87
## 3 87
## 4 87
## 5 88
## 6 89
## 7 87
## 8 90
## 9 86
## Name: approve, dtype: int64

approval.approve # the same, as series

## 0 86
## 1 86
## 2 87
## 3 87
## 4 87
## 5 88
## 6 89
## 7 87
## 8 90
## 9 86
## Name: approve, dtype: int64

These constructs return the column as a series. If we prefer to get a single-

column data frame, we can wrap the variable name into a list:

approval[["approve"]] # approval, as data frame

35/58
2/5/24, 9:32 AM

## approve
## 0 86
## 1 86
## 2 87
## 3 87
## 4 87
## 5 88
## 6 89
## 7 87
## 8 90
## 9 86

The attribute shorthand is usually the easier way, but it does not work if you
need to use indirect variable name (variable name that is stored in another
variable) or if the variable name contains spaces or other special characters. It
also does not work for creating new variables in the data frame. See more in
Section 3.3.5.

The previous example where we extracted a single column as a data frame

instead of Series also hints how to extract more than one variable: just wrap
all the required variable names into a list:

vars = ["date", "approve"]

approval[vars]

36/58
2/5/24, 9:32 AM

## date approve
## 0 2001 Dec 14-16 86
## 1 2001 Dec 6-9 86
## 2 2001 Nov 26-27 87
## 3 2001 Nov 8-11 87
## 4 2001 Nov 2-4 87
## 5 2001 Oct 19-21 88
## 6 2001 Oct 11-14 89
## 7 2001 Oct 5-6 87
## 8 2001 Sep 21-22 90
## 9 2001 Sep 14-15 86

There are no attribute shortcuts to extract multiple columns.

3.3.2 Filter observations with logical

operations

Filtering refers to extracting only a subset of rows from the dataframe based
on certain conditions. The conditions are logical operations that can be either
true or false, depending on the values in each row. Filtering produces a sub-
dataframe where only those observations that meet the selection criteria are
present: Here is an example:

approval[approval.approve > 88]

## date approve disapprove dontknow

## 6 2001 Oct 11-14 89 8 3
## 8 2001 Sep 21-22 90 6 4

37/58
2/5/24, 9:32 AM

Note that we have to refer to data variables as approval.approve , not just

approve , unlike in R dplyr where one can just write approve . This is
somewhat harder to write but it is less ambiguous and produces fewer hard-to-
find bugs.

Obviously we can use more complex selection conditions, for instance we can
look for very low or very high approval rates as follows:

approval[(approval.approve < 86) | (approval.approve > 89)]

## date approve disapprove dontknow

## 8 2001 Sep 21-22 90 6 4

Note that we are using the vectorized “or” operator | , not the base python
or . We also need to wrap both the “less than” and “greater than” parts in
parenthesis.
See more in Section 3.1.4.

Exercise 3.10 How many polls in the data show the president’s approval rate
at least 88%? At which dates are those polls conducted?

See the solution

NB! The filtered object is not a new data frame but a view of the original
data frame. This may give you warnings and errors later when you attempt
to modify the filtered data. If you intend to do that, perform a deep copy of
data using the .copy method. See more in Section 3.3.5.

38/58
2/5/24, 9:32 AM

3.3.3 Positional indexing of Series

Besides selecting variables and filtering by logical conditions, we occasionally

need to access elements by index, or by position (location). Here we
demonstrate the positional indexing using a series object, positional indexing
of data frames is discussed in Section 3.3.4 below:

pop = pd.Series([32.7, 267.7, 15.3], # in millions

index=["MY", "ID", "KH"])
pop

## MY 32.7
## ID 267.7
## KH 15.3
## dtype: float64

We can access series’ values in two ways: by position, and by index. In order
to access elements by position, we have to use attribute .iloc[] where i loc
refers to “integer”. Unlike most other methods, .iloc expects arguments in
brackets. A single number in brackets returns the element as an element
(e.g. a single number), if brackets contain a list (this looks like double
brackets), it returns a series, potentially containing only a single element. So
in order to extract 2nd and 3rd element in the population series, we can write:

pop.iloc[1] # extract 2nd element as a number

## 267.7

39/58
2/5/24, 9:32 AM

pop.iloc[[1,2]] # extract 2nd, 3th as a series

## ID 267.7
## KH 15.3
## dtype: float64

Alternatively, we can also extract the elements by index. This works in a

similar fashion, except we have to use .loc[] instead of .iloc[] . The rules
for single and double brackets apply in the similar fashion as in case of
positional access.

pop.loc["ID"] # extract Indonesian population as a number

## 267.7

pop.loc[["ID", "MY"]] # extract Indonesian and Malaysian population

# as a series

## ID 267.7
## MY 32.7
## dtype: float64

Exercise 3.11 Use your series of capital cities (see the exercise above).
Extract:

1st, 3rd element by position as single elements (city names)

40/58
2/5/24, 9:32 AM

2nd element by country name as a 1-element series.

See the solution

One can also drop the .loc[] syntax and just use square brackets, so
instead of writing pop.loc[["ID", "MY"]] , one can just write pop[["ID",
"MY"]] .

The fact that there are several ways to extract positional data causes a lot of
confusion for beginners. It is not helped by the common habit of not using
indices and just relying on the automatic row-numbers. In this case positional
access by .iloc[] produces exactly the same results as the index access by
.loc[] , and one can conveniently forget about the index and use whatever
feels easier. But sometimes the index changes as a result of certain
operations and that may lead to errors or unexpected results. For instance, we
can create an alternative population series without explicit index:

pop1 = pd.Series([np.nan, 26, 19, 13]) # index is 0, 1, ...

pop1

## 0 NaN
## 1 26.0
## 2 19.0
## 3 13.0
## dtype: float64

In this example, position and index are equivalent and hence it is easy to
forget that .loc[] is index-based access, not positional access! So one may
freely mix both methods (and remember, .loc is not needed):

41/58
2/5/24, 9:32 AM

pop1.loc[2]

## 19.0

pop1.iloc[2]

## 19.0

pop1[2]

## 19.0

This becomes a problem if a numeric index is not equivalent to row number

any more, for instance after we drop missings:

pop2 = pop1.dropna() # remove missings

pop2 # note: the first row has index 1

## 1 26.0
## 2 19.0
## 3 13.0
## dtype: float64

pop2.iloc[2] # this is by position

42/58
2/5/24, 9:32 AM

## 13.0

pop2.loc[2] # this is by index

## 19.0

pop2[2] # also by index

## 19.0

Additionally, if pop2 for some reason turns into a numpy array, then pop2[2]
is is based on position as arrays do not have index!

3.3.4 Positional indexing of data frames

We use a small data frame of capital cities to demonstrate how indexing on

data frames works. The data frame contains two variables, name of the capital
city and population as variables, index is the country name:

data = pd.DataFrame({"capital":["Kuala Lumpur", "Jakarta", "Phnom Penh"],

"population":[32.7, 267.7, 15.3]}, # in millions
index=["MY", "ID", "KH"])
data

43/58
2/5/24, 9:32 AM

## capital population
## MY Kuala Lumpur 32.7
## ID Jakarta 267.7
## KH Phnom Penh 15.3

(MY is Malaysia, ID Indonesia and KH is Cambodia).

Exactly as series, data frames allow positional access by .iloc[] . However,

as data frames are two-dimensional objects, .iloc accepts two arguments
(in brackets, separated by comma), the first one for rows, the second one for
columns. So we can write

data.iloc[2] # 3rd row, as series

## capital Phnom Penh

## population 15.3
## Name: KH, dtype: object

data.iloc[[2]] # 3rd row, as data frame

## capital population
## KH Phnom Penh 15.3

data.iloc[2,1] # 3rd row, 2nd column, as a number

## 15.3

44/58
2/5/24, 9:32 AM

There is also an index-based extractor .loc[] that accepts one (for rows) or
two (for rows and columns) indices. In case of data frames, the default row
index is just the row number; but the column index is the variable names. So
we can write

data.loc["MY","capital"] # Malaisian capital

## 'Kuala Lumpur'

data.loc[["KH", "ID"], ["population", "capital"]]

# Extract a sub-dataframe

## population capital
## KH 15.3 Phnom Penh
## ID 267.7 Jakarta

Unfortunately, data frames add their confusing constructs. When accessing

data frames with .loc[] then we have to specify rows first, and possibly
columns second. If we drop .loc then we cannot specify rows. That is,
unless we extract one variable with brackets, get a series and extract the
desired row in the second set of brackets…

data["capital"]

45/58
2/5/24, 9:32 AM

## MY Kuala Lumpur
## ID Jakarta
## KH Phnom Penh
## Name: capital, dtype: object

data["capital"]["MY"]

## 'Kuala Lumpur'

Finally, remember that 2-D numpy arrays will use similar integer-positional
syntax as .iloc[] , just without .iloc .

In conclusion, it is very important to know what is your data type when using
numpy and pandas. Indexing is all around us when working with data, there
are many somewhat similar ways to extract elements, and which way is
correct depends on the exact data type.

3.3.5 Modifying data frames

Modifying data frames can be done in a broadly similar way as extracting

elements. However, there are several exceptions and caveats. Let’s
demonstrate this by modifying the data frame of three countries we created
above.

46/58
2/5/24, 9:32 AM

3.3.5.1 One cannot create variables with dot-attribute

We can extract a single series as data.capital , but when creating a new

variable then we need to specify it using brackets. For instance:

data["temperature"] = [27.7, 26.1, 26.6] # daily mean, January

data

## capital population temperature

## MY Kuala Lumpur 32.7 27.7
## ID Jakarta 267.7 26.1
## KH Phnom Penh 15.3 26.6

Afterward we can access the new variable as data.temperature .

3.3.5.2 Explicitly make copy when working with filtered

data

A typical data science workflow consists of a) filtering data to relevant cases

only, and b) modifying the resulting subset. The first step often involves
removing missing values, or limiting the analysis to a certain subset of
interest. It is important to realize that Pandas’ filtering does not copy the
interesting cases in memory, it may instead just create a view, i.e. re-use the
same location in computer memory but just limit access to certain part of it.4
This is a very good idea in terms of conserving memory and avoiding
unnecessary copy operations. However, this may cause warnings and errors
when modifying the filtered data later. We demonstrate this on the same
dataset.

47/58
2/5/24, 9:32 AM

Select only large countries (population over 20M):

large = data[data.population > 20]

large

## capital population temperature

## MY Kuala Lumpur 32.7 27.7
## ID Jakarta 267.7 26.1

We got a subset of Malaysia and Indonesia. Now let’s add another variable to
these large countries:

large["language"] = ["Malay", "Indonesian"]

## <string>:1: SettingWithCopyWarning:
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
##
## See the caveats in the documentation: http://pandas.pydata.org/pandas-

large

## capital population temperature language

## MY Kuala Lumpur 32.7 27.7 Malay
## ID Jakarta 267.7 26.1 Indonesian

48/58
2/5/24, 9:32 AM

Note the warning: A value is trying to be set on a copy of a slice…. This tells
you that filtering data[data.population > 20] did not create a new data frame
but a view of the existing one in memory, and Pandas is unhappy with the
code modifying just a part of the original data frame.

NB!
Although the result appears correct here, do not rely on this approach! It
may or may not work, depending on the exact memory layout of the
dataset!

Fortunately, the solution is very simple. We need to make an explicit copy with
.copy method before we start any modifications:

large = data[data.population > 20].copy() # explicit copy

large["language"] = ["Malay", "Indonesian"]
large

## capital population temperature language

## MY Kuala Lumpur 32.7 27.7 Malay
## ID Jakarta 267.7 26.1 Indonesian

Now the modification works without a warning.

Explicit copy is not needed before you start modifying data, you can do
various filtering steps without .copy as long as you make the copy before
modifications.

49/58
2/5/24, 9:32 AM

3.3.5.3 Modifying index

The index that is attached to series’ and data frames is potentially a useful and
iformative tool. But sometimes it is not very useful. For instance, when you
load data from disk, then the index defaults to be the row number, and this is
rarely what we are interested in. In such cases one may want to change the
index. If you want to create a new index then you can just assign it to
df.index . For instance, we can just assign country names as index to our

data frame of large countries:

large.index = ["Malaysia", "Indonesia"]

large

## capital population temperature language

## Malaysia Kuala Lumpur 32.7 27.7 Malay
## Indonesia Jakarta 267.7 26.1 Indonesian

Alternatively, we can convert a column to index with .set_index() method:

large.set_index("capital")

## population temperature language

## capital
## Kuala Lumpur 32.7 27.7 Malay
## Jakarta 267.7 26.1 Indonesian

50/58
2/5/24, 9:32 AM

This will remove the column “capital” from data frame as its values will be in
index instead. Note that by default, .set_index() returns a new data frame
instead of modifying it in place, so if you want to preserve it, you have to store
it in a new variable. The opposite–converting the index into a column can be
done with .reset_index() .

Exercise 3.12 Take the data frame of capital-population data frame from
Section 3.3.4.

Replace the index by country names

Convert the index into a variable “country”

Ensure that you store and print the final data frame!

See the solution

3.3.6 Indexing: summary and comparison

Indexing data is complex. Here we repeat and summarize the main methods
we have discussed so far. First create three objects, a numpy matrix, a data
frame, and a series. The first two are 2-dimensional but the last one 1-
dimensional.

M = np.array([[1507, 12478],
[-500, 11034],
[1537, 8443],
[1591, 6810]])
M

51/58
2/5/24, 9:32 AM

## array([[ 1507, 12478],

## [ -500, 11034],
## [ 1537, 8443],
## [ 1591, 6810]])

df = pd.DataFrame(M, columns=["established", "population"],

index=["Mumbai", "Delhi", "Bangalore", "Hyderabad"])
df

## established population
## Mumbai 1507 12478
## Delhi -500 11034
## Bangalore 1537 8443
## Hyderabad 1591 6810

s = pd.Series(M[:,0], index=df.index)
s

## Mumbai 1507
## Delhi -500
## Bangalore 1537
## Hyderabad 1591
## dtype: int64

(This is data about four cities, the year when those were established, and
population in thousands).

52/58
2/5/24, 9:32 AM

Exercise 3.13 Create another numpy matrix and a data frame about cities in a
similar fashion: create a matrix of data, and create a data frame from it using
pd.DataFrame . Specify index (row names) and columns (variable names).
Include at least 3 cities and 3 variables (e.g. population in millions, size in
km2, and population density people per km2).

Hint: you may invent both city names and the figures!

See the solution

Extract rows/columns by number (integer):

Numpy array: just use the numbers in brackets:

M[1,0] # second row, first column

## -500

M[2,:] # third row

## array([1537, 8443])

Data frames: use iloc and brackets:

df.iloc[1,0] # second row, first column

## -500

53/58
2/5/24, 9:32 AM

df.iloc[2,:] # third row

## established 1537
## population 8443
## Name: Bangalore, dtype: int64

Series: use iloc and brackets (but these are just 1-dimensional):

s.iloc[1] # second row

## -500

Extract using index (city names/column names):

numpy array: not possible

Data frames: use loc and brackets:

df.loc["Delhi","established"] # second row, first column

## -500

df.loc["Bangalore",:] # third row

54/58
2/5/24, 9:32 AM

## established 1537
## population 8443
## Name: Bangalore, dtype: int64

Series: use loc (but not columns here):

s.loc["Delhi"]

## -500

If we want to extract individual columns, we can do the following:

Numpy arrays: use brackets and use a colon : in row indicators place:

M[:,0]

## array([1507, -500, 1537, 1591])

Data frames: you can use iloc and brackets, exactly as in case of
numpy arrays. You can also use brackets and column names (column
index) without iloc , or dot-column name:

df.iloc[:,0]

55/58
2/5/24, 9:32 AM

## Mumbai 1507
## Delhi -500
## Bangalore 1537
## Hyderabad 1591
## Name: established, dtype: int64

df["established"]

## Mumbai 1507
## Delhi -500
## Bangalore 1537
## Hyderabad 1591
## Name: established, dtype: int64

df.established

## Mumbai 1507
## Delhi -500
## Bangalore 1537
## Hyderabad 1591
## Name: established, dtype: int64

If you want to extract rows and columns in a mixed, e.g. rows by number, and
columns by column names (index), you can use double extraction (two sets of
brackets) and chain your extractions into a single line:

df.iloc[:3,:]["population"]

56/58
2/5/24, 9:32 AM

## Mumbai 12478
## Delhi 11034
## Bangalore 8443
## Name: population, dtype: int64

Exercise 3.14 Take your own city matrix and city data frame. From both of
these extract:

population density (for all cities)

data for the third city. For the data frame do it in two ways: using index,
and using row number!

area of the second city. For the data frame, do it in two ways: using
column name (column index), and column number!

See the solution

Finally, if asking for a single entry (singleton), pandas simplifies the result into
a lower-ranked object (series instead of data frame, or a number instead of
series). If you want to retain a similar data structure as the original one, wrap
your selector in a list. For instance, the previous example that returns a data
frame: single line:

df.iloc[:3,:][["population"]]

## population
## Mumbai 12478
## Delhi 11034
## Bangalore 8443

57/58
2/5/24, 9:32 AM

All these methods can create rather confusing situations sometimes. For
instance, if we do not specify index, it will be automatically created as row
numbers (but starting from 0, not 1). In that case df.iloc[i] and df.loc[i]
give the same result (assuming i is a list of row numbers). Even worse, if
the index skips some numbers, then df.loc[i] may or may not work, and
even where it works, it may give wrong results! In a similar fashion, M[i,j]
works but df[i,j] does not work, df.loc[i,j] works but M.loc[i,j] does
not work. In order to tell if the syntax is correct it is necessary to know what is
the data structure.

2. There are also operations that are not performed elementwise when using
array, in particular matrix product↩

3. If you run your code from command line, the working directory is the
directory where you run the command, not the directory where the
program is located.↩

4. Pandas decides whether to make a copy or a view in each case

separately, depending on what is the more efficient approach.↩

58/58

NumPy Notes
No ratings yet
NumPy Notes
13 pages
Earthing - Above Ground Riser Comparison
No ratings yet
Earthing - Above Ground Riser Comparison
2 pages
UNIT 5 Python Aktu
No ratings yet
UNIT 5 Python Aktu
49 pages
NumPy Basics
No ratings yet
NumPy Basics
23 pages
Numpy
No ratings yet
Numpy
7 pages
02 Numpy
No ratings yet
02 Numpy
11 pages
Python Presentation 3
No ratings yet
Python Presentation 3
44 pages
Numpy
No ratings yet
Numpy
71 pages
Data Science Handwritten Notes - 3
No ratings yet
Data Science Handwritten Notes - 3
26 pages
Numerical Python Numpy
No ratings yet
Numerical Python Numpy
28 pages
Numpy - All - Lectures - Jupyter Notebook
No ratings yet
Numpy - All - Lectures - Jupyter Notebook
39 pages
python-notes-BCC-302 (Unit - 05)
No ratings yet
python-notes-BCC-302 (Unit - 05)
25 pages
N Umpy Pandas Tutorial
No ratings yet
N Umpy Pandas Tutorial
65 pages
Numpy Full
100% (1)
Numpy Full
40 pages
45B AIML Practical1.1
No ratings yet
45B AIML Practical1.1
57 pages
Module3 Advance Pythonlibraries
No ratings yet
Module3 Advance Pythonlibraries
53 pages
Numpy
No ratings yet
Numpy
9 pages
NUMPY
No ratings yet
NUMPY
33 pages
UNIT II - Data Handling Part I
No ratings yet
UNIT II - Data Handling Part I
8 pages
W03 - FA23 - AIC270 - Programming for AI - Syed Ahmed
No ratings yet
W03 - FA23 - AIC270 - Programming for AI - Syed Ahmed
57 pages
FALLSEM2023-24 CSI3007 ETH VL2023240104352 2023-09-27 Reference-Material-I
No ratings yet
FALLSEM2023-24 CSI3007 ETH VL2023240104352 2023-09-27 Reference-Material-I
47 pages
Numpy Complete Notes
No ratings yet
Numpy Complete Notes
68 pages
Basic Array Creation and Operations
No ratings yet
Basic Array Creation and Operations
27 pages
3 Introduction To Numpy
No ratings yet
3 Introduction To Numpy
9 pages
C1 W2 Lab01 Python Numpy Vectorization Soln
No ratings yet
C1 W2 Lab01 Python Numpy Vectorization Soln
12 pages
Introduction To Numpy
No ratings yet
Introduction To Numpy
41 pages
Unit 4 Python Numpy
No ratings yet
Unit 4 Python Numpy
18 pages
Unit3 - Arrays and Strings
No ratings yet
Unit3 - Arrays and Strings
20 pages
Python Unit 4
No ratings yet
Python Unit 4
43 pages
Numpy Complete Notes
No ratings yet
Numpy Complete Notes
64 pages
Print
No ratings yet
Print
296 pages
Machine Learning - Section #3 (Numpy)
No ratings yet
Machine Learning - Section #3 (Numpy)
21 pages
C1 W1 Lab 1 Introduction To Numpy Arrays
No ratings yet
C1 W1 Lab 1 Introduction To Numpy Arrays
12 pages
Lecture+Notes Python+for+DS PDF
No ratings yet
Lecture+Notes Python+for+DS PDF
48 pages
M3-Introduction To Numpy and Pandas
No ratings yet
M3-Introduction To Numpy and Pandas
55 pages
Numpy Operations
No ratings yet
Numpy Operations
55 pages
Unit 1
No ratings yet
Unit 1
170 pages
Module Numpy
No ratings yet
Module Numpy
67 pages
Lab 1
No ratings yet
Lab 1
6 pages
Lab 2, Python Numpy
No ratings yet
Lab 2, Python Numpy
9 pages
Numpy & Pandas
No ratings yet
Numpy & Pandas
13 pages
Numpy Tutorial
No ratings yet
Numpy Tutorial
19 pages
Ch2 Numpy Pandas
No ratings yet
Ch2 Numpy Pandas
87 pages
15 Numpy
No ratings yet
15 Numpy
32 pages
Numpy Handbook
No ratings yet
Numpy Handbook
16 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
42 pages
Num Py
No ratings yet
Num Py
31 pages
Numpy Basics
No ratings yet
Numpy Basics
66 pages
Unit 3
No ratings yet
Unit 3
42 pages
Numpy Jupyter PDF
No ratings yet
Numpy Jupyter PDF
9 pages
Num Py
No ratings yet
Num Py
15 pages
Value Added Course: Programming in Python and Machine Learning UNIT-2
No ratings yet
Value Added Course: Programming in Python and Machine Learning UNIT-2
41 pages
Lecture 2 - NumPy I
No ratings yet
Lecture 2 - NumPy I
12 pages
Exp 12345
No ratings yet
Exp 12345
15 pages
Python Sem V Portion 2
No ratings yet
Python Sem V Portion 2
29 pages
CAP776 Numpy
No ratings yet
CAP776 Numpy
71 pages
Python Module 5
No ratings yet
Python Module 5
43 pages
Satish Dangi
No ratings yet
Satish Dangi
13 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
Iso 27701 Sample of Audit Checklist Template
100% (1)
Iso 27701 Sample of Audit Checklist Template
5 pages
Soal Us
No ratings yet
Soal Us
9 pages
MSP Software 2021: Guide To
No ratings yet
MSP Software 2021: Guide To
16 pages
CS134 Web Site Design & Development Midterm Exam
No ratings yet
CS134 Web Site Design & Development Midterm Exam
3 pages
Eureka Fair Instructions
No ratings yet
Eureka Fair Instructions
2 pages
Algebra Handout 2 Answers and Solutions
No ratings yet
Algebra Handout 2 Answers and Solutions
4 pages
Deya Manual
No ratings yet
Deya Manual
140 pages
HVDC & FACTS Notes
No ratings yet
HVDC & FACTS Notes
125 pages
HP Apollo 6000 System: Performance For Your Budget
No ratings yet
HP Apollo 6000 System: Performance For Your Budget
4 pages
Certificate of Non-Availability of Stocks: Product Code Product Description UOM Price
No ratings yet
Certificate of Non-Availability of Stocks: Product Code Product Description UOM Price
5 pages
AIR5 MANUALw
No ratings yet
AIR5 MANUALw
33 pages
Altibase 7.1.0 GettingStarted Eng PDF
No ratings yet
Altibase 7.1.0 GettingStarted Eng PDF
84 pages
Nco Sample Paper Class-7 PDF
No ratings yet
Nco Sample Paper Class-7 PDF
2 pages
White Paper To Mok - FINAL-1
67% (3)
White Paper To Mok - FINAL-1
26 pages
Labcxfb
No ratings yet
Labcxfb
15 pages
CLASS IX MATHEMATICS CH 1 Notes
No ratings yet
CLASS IX MATHEMATICS CH 1 Notes
16 pages
d1017 Manual
No ratings yet
d1017 Manual
90 pages
E-Commerce Notes Jwfiles PDF
No ratings yet
E-Commerce Notes Jwfiles PDF
91 pages
Zebra QR Code
No ratings yet
Zebra QR Code
8 pages
PYDS 3150713 Unit-1
No ratings yet
PYDS 3150713 Unit-1
18 pages
Juniper Commands v4 CLI
No ratings yet
Juniper Commands v4 CLI
2 pages
Three Phase Rectifier Control Techniques A Comprehensive Literature Survey
No ratings yet
Three Phase Rectifier Control Techniques A Comprehensive Literature Survey
6 pages
Cloud Computing
No ratings yet
Cloud Computing
2 pages
List of Students For Midyear 2023
No ratings yet
List of Students For Midyear 2023
7 pages
Channel Adaptive ADC and TDC For 28 Gb/s PAM-4 Digital Receiver
No ratings yet
Channel Adaptive ADC and TDC For 28 Gb/s PAM-4 Digital Receiver
4 pages
DAA Notes
No ratings yet
DAA Notes
115 pages
2017 Random Variables and Stochastic Processes
No ratings yet
2017 Random Variables and Stochastic Processes
7 pages
How To Combat Fake News and Disinformation
No ratings yet
How To Combat Fake News and Disinformation
20 pages
Classified 2015 02 04 000000
No ratings yet
Classified 2015 02 04 000000
5 pages

Unit 5 Numpy and Pandas - in Python

Uploaded by

Unit 5 Numpy and Pandas - in Python

Uploaded by

Unit 5 Numpy and Pandas

Base python does not include true vectorized data structures–vectors,

This deficiency is addressed by additional libraries, in particular numpy and

order to use vectorized operations. 4.0, via Wikimedia Commons.

This also causes certain differences

3.1.1 Importing numpy

Numpy is typically imported as np :

online documentation. Below we assume numpy has been imported as np .

3.1.2 Array: The Fundamental Data Structure in

Numpy is fundamentally based on arrays, N-dimensional data structures. Here

If we want to create a matrix (two-dimensional array), we can feed np.array

One of the fundamental property of arrays its dimension, called shape in

and we can “straighten” matrix b into a vector with

3.1.3 Creating Arrays

Sometimes it is practical to create arrays manually as we did above, but

np.arange(10) # vector of length 10

np.arange(10).reshape((2,5)) # 2x5 matrix

## array([0., 0., 0., 0., 0.])

## array([[1., 1., 1., 1.],

Arrays can be combined in different ways, e.g. np.column_stack combines

oneCol = np.ones((5,)) # a single vector of ones

## array([[1., 0., 0.],

Exercise 3.1 Use np.zeros , np.ones , mathematical operations and

## array([[-1., -1., -1., -1.],

See the solution

3.1.4 Vectorized Functions (Universal

It is possible to use loops to do computation with numpy objects exactly in the

Numpy offers a plethora of vectorized functions and operators, called

## [[100 101 102 103]

print(2**a, "\n") # remember: exponent with **, not with ^

Both of these mathematical operations, + and ** are performed

Exercise 3.2 Create the following array:

See the solution

Comparison operators are vectorized too:

## array([[False, False, False, False],

## array([[False, False, False, False],

(a < 3) | (a > 8) # logical or

## array([[ True, True, True, False],

(a > 4) & (a < 7) # logical and

## array([[False, False, False, False],

~(a > 6) # logical not

## array([[ True, True, True, True],

There is no vectorized multi-way comparison like 1 < x < 2 .

3.1.5 Array Indexing and Slicing

Indexing refer to extracting elements based on their position or certain criteria.

3.1.5.1 Extracting elements based on position

Array indexing is very similar to list indexing. As matrices have two

However, unlike lists, one can do vectorized assignments in numpy:

a[5:11] = -1 # assign multiple elements

## array([ 0, 1, 2, 3, 4, -1, -1, -1, -1, -1, -1, 11])

One can also extract multiple elements from a vector:

a[[4,5,7]] # extract 3 elements in one go

## array([ 4, -1, -1])

c[1,2] # 2nd row, 3rd column

c[1] # 2nd row

c[:,2] # all rows, 3rd column

c[:2] # 1st, 2nd row

c[:2, :3] # 1s, 2nd row, first three columns

Exercise 3.3 Create matrix and access rows and columns

create a 4x5 array of even numbers: 10, 12, 14, …

extract third column

set the fourth row to 1,2,3,4,5

Note: there are many ways to achieve this.

See the solution

3.1.5.2 Boolean indexing

An extremely widely used approach is to extract elements of an array based

It is important you understand what is going on here: arrays a and i will be

The previous example–manually creating a logical index vectors of trues and

## array([False, False, True, True])

This is often written in a more compact manner by skipping explicit logical

names = np.array(["Cyrus", "Darius", "Xerxes", "Artaxerxes", "Cyrus", "Da

names = np.array(["Cyrus", "Darius", "Xerxes"])

Logical indexing can also be used on the left-hand-side of the expression, in

## array([[ 1.3315865 , 0.71527897, -1.54540029],

When replacing elements in such fashion then we need to supply the

a[a > 0] = np.array([1, 2, 3])

print(2a, "\n") # remember: exponent with , not with ^