Numpy 1 Merged

Download as pdf or txt
Download as pdf or txt
You are on page 1of 160

keyboard_arrow_down Numpy 1

keyboard_arrow_down Content
Introduction to DAV
Python Lists vs Numpy Array

Importing Numpy
Why use Numpy?

Dimension & Shape


Type Conversion
Indexing
Slicing
NPS use case

keyboard_arrow_down Introduction to DAV (Data Analysis and Visualization) Module


With this lecture, we're starting the DAV module.

It will contain 3 sections -

1. DAV-1: Python Libraries

Numpy
Pandas
Matplotlib & Seaborn
2. DAV-2: Probability Statistics
3. DAV-3: Hypothesis Testing

keyboard_arrow_down Python Lists vs Numpy Arrays


keyboard_arrow_down Homogeneity of data
So far, we've been working with Python lists, that can have heterogenous data.

a = [1, 2, 3, "Michael", True]


a

[1, 2, 3, 'Michael', True]

Because of this hetergenity, in Python lists, the data elements are not stored together in the memory (RAM).

Each element is stored in a different location.


Only the address of each of the element will be stored together.
So, a list is actually just referencing to these different locations, in order to access the actual element.

On the other hand, Numpy only stores homogenous data, i.e. a numpy array cannot contain mixed data types.

It will either

ONLY contain integers


ONLY contain floats
ONLY contain characters

... and so on.


Because of this, we can now store these different data items together, as they are of the same type.

Speed

Programming languages can also be slow or fast.

In fact,

Java is a decently fast language.


Python is a slow language.
C, one of the earliest available languages, is super fast.

This is because C has concepts like memory allocation, pointers, etc.

How is this possible?

With Numpy, though we will be writing our code using Python, but behind the scene, all the code is written in the C programming language, to
make it faster.

Because of this, a Numpy Array will be significantly faster than a Python List in performing the same operation.

This is very important to us, because in data science, we deal with huge amount of data.

keyboard_arrow_down Properties
In-built Functions

For a Python list a , we had in-built functions like .sum(a) , etc.


For NumPy arrays also, we will have such in-built functions.

Slicing

Recall that we were able to perform list slicing.


All of that is still applicable here.

keyboard_arrow_down Importing Numpy


Recall how we used to import a module/library in Python.

In order to use Python Lists, we do not need to import anything extra.


However to use Numpy Arrays, we need to import it into our environment, as it is a Library.

Generally, we do so while using the alias np .

import numpy as np

Note:

In this terminal, we will already have numpy installed as we are working on Google Colab
However, when working on an evironment that does not have it installed, you'll have to install it the first time working.
This can be done with the command: !pip install numpy

keyboard_arrow_down Why use Numpy? - Time Comparison


Suppose you are given a list of numbers. You have to find the square of each number and store it in the original list.
a = [1,2,3,4,5]

type(a)

list

The basic approach here would be to iterate over the list and square each element.

res = [i**2 for i in a]


print(res)

[1, 4, 9, 16, 25]

Let's try the same operation with Numpy.

To do so, first of all we need to define the Numpy array.

We can convert any list a into a Numpy array using the array() function.

b = np.array(a)
b

array([1, 2, 3, 4, 5])

type(b)

numpy.ndarray

nd in numpy.ndarray stands for n-dimensional

Now, how can we get the square of each element in the same Numpy array?

b**2

array([ 1, 4, 9, 16, 25])

The biggest benefit of Numpy is that it supports element-wise operation.

Notice how easy and clean is the syntax.

But is the clean syntax and ease in writing the only benefit we are getting here?

To understand this, let's measure the time for these operations.


We will use %timeit .

l = range(1000000)

%timeit [i**2 for i in l]

343 ms ± 21.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

It took approx 300 ms per loop to iterate and square all elements from 0 to 999,999

Let's peform the same operation using Numpy arrays -

We will use np.array() method for this.


We can peform element wise operation using numpy.

l = np.array(range(1000000))

%timeit l**2

778 µs ± 100 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Notice that it only took 900 𝜇s per loop time for the numpy operation.

What is the major reason behind numpy's faster computation?

Numpy array is densely packed in memory due to it's homogenous type.


Numpy functions are implemented in C programming launguage.
Numpy is able to divide a task into multiple subtasks and process them parallelly.

keyboard_arrow_down Dimensions and Shape


We can get the dimension of an array using the ndim property.

arr1 = np.array(range(1000000))
arr1.ndim

Numpy arrays have another property called shape that tells us number of elements across every dimension.

arr1.shape

(1000000,)

This means that the array arr1 has 1000000 elements in a single dimension.

Let's take another example to understand shape and ndim better.

arr2 = np.array([[1, 2, 3], [4, 5, 6], [10, 11, 12]])


print(arr2)

[[ 1 2 3]
[ 4 5 6]
[10 11 12]]

What do you think will be the shape & dimension of this array?

arr2.ndim

arr2.shape

(3, 3)

ndim specifies the number of dimensions of the array i.e. 1D (1), 2D (2), 3D (3) and so on.

shape returns the exact shape in all dimensions, that is (3,3) which implies 3 in axis 0 and 3 in axis 1.
keyboard_arrow_down np.arange()

Let's create some sequences in Numpy.

We can pass starting point, ending point (not included in the array) and step-size.

Syntax:

arange(start, end, step)

arr2 = np.arange(1, 5)
arr2

array([1, 2, 3, 4])

arr2_step = np.arange(1, 5, 2)
arr2_step

array([1, 3])

np.arange() behaves in the same way as range() function.

But then why not call it np.range?

In np.arange() , we can pass a floating point number as step-size.

arr3 = np.arange(1, 5, 0.5)


arr3

array([1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])

keyboard_arrow_down Type Conversion in Numpy Arrays


For this, let's pass a float as one of the values in a numpy array.

arr4 = np.array([1, 2, 3, 4])


arr4

array([1, 2, 3, 4])

arr4 = np.array([1, 2, 3, 4.0])


arr4

array([1., 2., 3., 4.])

Notice that int is raised to float


Because a numpy array can only store homogenous data i.e. values of one data type.

Similarly, what will happen when we run the following code? Will it give an error?

np.array(["Harry Potter", 1, 2, 3])

array(['Harry Potter', '1', '2', '3'], dtype='<U21')

No. It will convert all elements of the array to char type.

There's a dtype parameter in the np.array() function.

What if we set the dtype of array containing integer values to float ?

arr5 = np.array([1, 2, 3, 4])


arr5

array([1, 2, 3, 4])
arr5 = np.array([1, 2, 3, 4], dtype="float")
arr5

array([1., 2., 3., 4.])

Question: What will happen in the following code?

np.array(["Shivank", "Bipin", "Ritwik"], dtype=float)

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-26-bdb627c3c07e> in <cell line: 1>()
----> 1 np.array(["Shivank", "Bipin", "Ritwik"], dtype=float)

ValueError: could not convert string to float: 'Shivank'

SEARCH STACK OVERFLOW

Since it is not possible to convert strings of alphabets to floats, it will naturally return an Error.

We can also convert the data type with the astype() method.

arr = np.array([10, 20, 30, 40, 50])


arr

array([10, 20, 30, 40, 50])

arr = arr.astype('float64')
print(arr)

[10. 20. 30. 40. 50.]

keyboard_arrow_down Indexing
Similar to Python lists

m1 = np.arange(12)
m1

array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])

m1[0] # gives first element of array

m1[-1] # negative indexing in numpy array

11

You can also use list of indexes in numpy.

m1 = np.array([100,200,300,400,500,600])

m1[[2,3,4,1,2,2]]

array([300, 400, 500, 200, 300, 300])

Did you notice how single index can be repeated multiple times when giving list of indexes?

Note:

If you want to extract multiple indices, you need to use two sets of square brackets [[ ]]

Otherwise, you will get an error.


Because it is only expecting a single index.
For multiple indices, you need to pass them as a list.
m1[2,3,4,1,2,2]

---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-34-0ec34089038e> in <cell line: 1>()
----> 1 m1[2,3,4,1,2,2]

IndexError: too many indices for array: array is 1-dimensional, but 6 were indexed

SEARCH STACK OVERFLOW

keyboard_arrow_down Slicing
Similar to Python lists

m1 = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
m1

array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

m1[:5]

array([1, 2, 3, 4, 5])

Question: What'll be output of arr[-5:-1] ?

m1[-5:-1]

array([6, 7, 8, 9])

Question: What'll be the output for arr[-5:-1: -1] ?

m1[-5: -1: -1]

array([], dtype=int64)

keyboard_arrow_down Fancy Indexing (Masking)


Numpy arrays can be indexed with boolean arrays (masks).
This method is called fancy indexing or masking.

What would happen if we do this?

m1 = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
m1 < 6

array([ True, True, True, True, True, False, False, False, False,
False])

Comparison operation also happens on each element.

All the values before 6 return True


All the values after 6 return False

Question: What will be the output of the following?

m1[[True, True, True, True, True, False, False, False, False, False]]

array([1, 2, 3, 4, 5])

Notice that we are passing a list of indices.

For every instance of True , it will print the corresponding index.


Conversely, for every False , it will skip the corresponding index, and not print it.

So, this becomes a filter of sorts.

Now, let's use this to filter or mask values from our array.

Condition will be passed instead of indices and slice ranges.

m1[m1 < 6]

array([1, 2, 3, 4, 5])

This is known as Fancy Indexing in Numpy.

Question: How can we filter/mask even values from our array?

m1[m1%2 == 0]

array([ 2, 4, 6, 8, 10])

keyboard_arrow_down Use Case: NPS (Net Promoter Score)


Imagine you are a Data Analyst @ Airbnb

You've been asked to analyze user survey data and report NPS to the management.

But, what exactly is NPS?


Have you all seen that every month, you get a survey form from Scaler?

This form asks you to fill in feedback regarding how you are liking the services of Scaler in terms of a numerical score.
This is known as the Likelihood to Recommend Survey.
It is widely used by different companies and service providers to evaluate their performance and customer satisfaction.

Responses are given a scale ranging from 0–10,

with 0 labeled with “Not at all likely,” and


10 labeled with “Extremely likely.”

Based on this, we calculate the Net Promoter Score.

keyboard_arrow_down How to calculate NPS score?

We label our responses into 3 categories:

Detractors: Respondents with a score of 0-6


Passive: Respondents with a score of 7-8
Promoters: Respondents with a score of 9-10.

Net Promoter score = % Promoters - % Detractors.

Range of NPS

If all people are promoters (rated 9-10), we get 100 NPS


Conversely, if all people are detractors (rated 0-6), we get −100 NPS
Also, if all people are neutral (rated 7-8), we get a 0 NPS

Therefore, the range of NPS lies between [−100, 100]

Generally, each company targets to get at least a threshold NPS.

For Scaler, this is a score of 70.


This means that if 𝑁𝑃𝑆 > 70, it is great performance of the company.
Naturally, this varies from business to business.

How is NPS helpful?

Why would we want to analyse the survey data for NPS?


NPS helps a brand in gauging its brand value and sentiment in the market.

Promoters are highly likely to recommend your product or sevice. Hence, bringing in more business.
whereas, Detractors are likely to recommend against your product or service’s usage. Hence, bringing the business down.

These insights can help business make customer oriented decision along with product improvisation.

Two third of Fortune 500 companies use NPS.

Even at Scaler, every month, we randomnly reach out to our learners over a call, and try to understand,

How is the overall experience for them?


What are some things that they like?
What do they don't like?

Based on the feedback received, sometimes we end up getting really good insights, and tackle them.

This will help improve the next month's NPS.


keyboard_arrow_down NPS Problem
Let's first look at the data we have gathered.

Dataset: https://drive.google.com/file/d/1c0ClC8SrPwJq5rrkyMKyPn80nyHcFikK/view?usp=sharing

Downloading the dataset -

!pip install --upgrade gdown


!gdown 1c0ClC8SrPwJq5rrkyMKyPn80nyHcFikK

Requirement already satisfied: gdown in /usr/local/lib/python3.10/dist-packages (4.7.3)


Collecting gdown
Downloading gdown-5.1.0-py3-none-any.whl (17 kB)
Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.10/dist-packages (from gdown) (4.12.3)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from gdown) (3.13.1)
Requirement already satisfied: requests[socks] in /usr/local/lib/python3.10/dist-packages (from gdown) (2.31.0)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from gdown) (4.66.1)
Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.10/dist-packages (from beautifulsoup4->gdown) (2.
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests[socks]
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests[socks]->gdown) (3.
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests[socks]->gdow
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests[socks]->gdow
Requirement already satisfied: PySocks!=1.5.7,>=1.5.6 in /usr/local/lib/python3.10/dist-packages (from requests[socks]->
Installing collected packages: gdown
Attempting uninstall: gdown
Found existing installation: gdown 4.7.3
Uninstalling gdown-4.7.3:
Successfully uninstalled gdown-4.7.3
Successfully installed gdown-5.1.0
Downloading...
From: https://drive.google.com/uc?id=1c0ClC8SrPwJq5rrkyMKyPn80nyHcFikK
To: /content/survey.txt
100% 2.55k/2.55k [00:00<00:00, 9.56MB/s]

Loading the data -

For this we will use the .loadtxt() function


We provide file name along with the dtype of data that we want to load.
Documentation: https://numpy.org/doc/stable/reference/generated/numpy.loadtxt.html

score = np.loadtxt('survey.txt', dtype ='int')

Let's check the type of this data variable score -

type(score)

numpy.ndarray

Let's see what the data looks like -

score[:5]

array([ 7, 10, 5, 9, 9])


Let's check the number of responses -

score.shape

(1167,)

There are a total of 1167 responses for the LTR survey.

Now, let's calculate NPS using these response.

NPS = % Promoters - % Detractors

In order to calculate NPS, we need to calculate two things:

% Promoters
% Detractors

In order to calculate % Promoters and % Detractors , we need to get the count of promoter as well as detractor.

Question: How can we get the count of Promoter/Detractor ?

We can do so by using fancy indexing (masking).

Let's get the count of promoter and detractors -

Detractors have a score <= 6

detractors = score[score <= 6]

# Number of detractors -

num_detractors = len(detractors)
num_detractors

332

Promoters have a score >= 9

promoters = score[score >= 9]

# Number of promoters -

num_promoters = len(promoters)
num_promoters

609

total = len(score)
total

1167

# % of detractors -

percentage_detractors = (num_detractors/total) * 100


percentage_detractors

28.449014567266495

# % of promoters -

percentage_promoters = (num_promoters/total) * 100


percentage_promoters

52.185089974293064

nps = percentage_promoters - percentage_detractors


nps

23.73607540702657
# Rounding off upto 2 decimal places -

np.round(nps, 2)

output 23.74
keyboard_arrow_down Numpy 2

keyboard_arrow_down Content
Working with 2D arrays (Matrices)

Transpose
Indexing
Slicing
Fancy Indexing (Masking)
Aggregate Functions
Logical Operations

np.any()
np.all()
np.where()

Use Case: Fitness data analysis

keyboard_arrow_down Working with 2D arrays (Matrices)


Let's create an array -

import numpy as np
a = np.array(range(16))
a

array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])

What will be it's shape and dimensions?

a.shape

(16,)

a.ndim

keyboard_arrow_down How can we convert this array to a 2-dimensional array?

Using reshape()

For a 2D array, we will have to specify the followings :-

First argument is no. of rows


Second argument is no. of columns

Let's try converting it into a 8x2 array.

a.reshape(8, 2)

array([[ 0, 1],
[ 2, 3],
[ 4, 5],
[ 6, 7],
[ 8, 9],
[10, 11],
[12, 13],
[14, 15]])

Let's try converting it into a 4x4 array.


a.reshape(4, 4)

array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])

a.reshape(4, 5)

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-7-05ad01dfd0f5> in <cell line: 1>()
----> 1 a.reshape(4, 5)

ValueError: cannot reshape array of size 16 into shape (4,5)

SEARCH STACK OVERFLOW

This will give an Error. Why?

We have 16 elements in a , but reshape(4, 5) is trying to fill in 4x5 = 20 elements.


Therefore, whatever the shape we're trying to reshape to, must be able to incorporate the number of elements that we have.

a.reshape(8, -1)

array([[ 0, 1],
[ 2, 3],
[ 4, 5],
[ 6, 7],
[ 8, 9],
[10, 11],
[12, 13],
[14, 15]])

Notice that Python automatically figured out what should be the replacement of -1 argument, given that the first argument is 8 .

We can also put -1 as the first argument. As long as one argument is given, it will calculate the other one.

What if we pass both args as -1 ?

a.reshape(-1, -1)

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-9-decf4fe03d74> in <cell line: 1>()
----> 1 a.reshape(-1, -1)

ValueError: can only specify one unknown dimension

SEARCH STACK OVERFLOW

You need to give at least one dimension.

Let's save a as a 8 x 2 array (matrix) for now.

a = a.reshape(8, 2)

What will be the length of a ?

It will be 8, since it contains 8 lists as it's elements.


Each of these lists have 2 elements, but that's a different thing.

Explanation: len(nd array) will give you the magnitude of first dimension

len(a)

len(a[0])

2
keyboard_arrow_down Transpose
Let's create a 2D numpy array.

a = np.arange(12).reshape(3,4)
a

array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])

a.shape

(3, 4)

There is another operation on a multi-dimensional array, known as Transpose.

It basically means that the no. of rows is interchanged by no. of cols, and vice-versa.

a.T

array([[ 0, 4, 8],
[ 1, 5, 9],
[ 2, 6, 10],
[ 3, 7, 11]])

Let's verify the shape of this transpose array -

a.T.shape

(4, 3)

keyboard_arrow_down Indexing in 2D arrays


Similar to Python lists
a

array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])

Can we extract just the element 6 from a ?

# Accessing 2nd row and 3rd col -


a[1, 2]

This can also be written as

a[1][2]

m1 = np.arange(1,10).reshape((3,3))
m1

array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])

What will be the output of this?

m1[1, 1] # m1[row,column]

We saw how we can use list of indexes in numpy array.

m1 = np.array([100,200,300,400,500,600])

Will this work now?

m1[2, 3]

---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-23-963ce94bbe14> in <cell line: 1>()
----> 1 m1[2, 3]

IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed

SEARCH STACK OVERFLOW

Note:

Since m1 is a 1D array, this will not work.


This is because there are no row and column entity here.

Therefore, you cannot use the same syntax for 1D arrays, as you did with 2D arrays, and vice-versa.

However with a little tweak in this code, we can access elements of m1 at different positions/indices.

m1[[2, 3]]

array([300, 400])

keyboard_arrow_down How will you print the diagonal elements of the following 2D array?
m1 = np.arange(9).reshape((3,3))
m1

array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])

m1[[0,1,2],[0,1,2]] # picking up element (0,0), (1,1) and (2,2)

array([0, 4, 8])

When list of indexes is provided for both rows and cols, for example: m1[[0,1,2],[0,1,2]]

It selects individual elements i.e. m1[0][0], m1[1][1] and m2[2][2] .

keyboard_arrow_down Slicing in 2D arrays


We need to provide two slice ranges, one for row and one for column.
We can also mix Indexing and Slicing

m1 = np.arange(12).reshape(3,4)
m1

array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])

m1[:2] # gives first two rows

array([[0, 1, 2, 3],
[4, 5, 6, 7]])

keyboard_arrow_down How can we get columns from a 2D array?

m1[:, :2] # gives first two columns

array([[0, 1],
[4, 5],
[8, 9]])

m1[:, 1:3] # gives 2nd and 3rd col

array([[ 1, 2],
[ 5, 6],
[ 9, 10]])

keyboard_arrow_down Fancy Indexing (Masking) in 2D arrays


We did this for one dimensional arrays. Let's see if those concepts translate to 2D also.

Suppose we have the matrix m1 -

m1 = np.arange(12).reshape(3, 4)
m1

array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])

What will be output of following?

m1 < 6

array([[ True, True, True, True],


[ True, True, False, False],
[False, False, False, False]])

A matrix having boolean values True and False is returned.


We can use this boolean matrix to filter our array.

Condition(s) will be passed instead of indices and slice ranges.

m1[m1 < 6]

array([0, 1, 2, 3, 4, 5])

Values corresponding to True are retained


Values corresponding to False are filtered out

keyboard_arrow_down Aggregate Functions


Numpy provides various universal functions that cover a wide variety of operations and perform fast element-wise array operations.

keyboard_arrow_down How would calculate the sum of elements of an array?

np.sum()

It sums all the values in a numpy array.

a = np.arange(1, 11)
a

array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

np.sum(a)

55

keyboard_arrow_down What if we want to find the average value or median value of all the elements in an array?

np.mean()

It gives the us mean of all values in a numpy array.

np.mean(a)

5.5

keyboard_arrow_down Now, we want to find the minimum / maximum value in the array.

np.min() / np.max()

np.min(a)

np.max(a)

10

Let's apply aggregate functions on 2D array.

a = np.arange(12).reshape(3, 4)
a

array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
np.sum(a) # sums all the values present in the array

66

keyboard_arrow_down What if we want to do the elements row-wise or column-wise?


By setting axis parameter

keyboard_arrow_down What will np.sum(a, axis=0) do?

np.sum(a, axis=0) adds together values in different rows


axis = 0 → Changes will happen along the vertical axis
Summation of values happen in the vertical direction.
Rows collapse/merge when we do axis=0 .

np.sum(a, axis=0)

array([12, 15, 18, 21])

keyboard_arrow_down What if we specify axis=1 ?

np.sum(a, axis=1) adds together values in different columns


axis = 1 → Changes will happen along the horizontal axis
Summation of values happen in the horizontal direction.
Columns collapse/merge when we do axis=1 .

np.sum(a, axis=1)

array([ 6, 22, 38])

keyboard_arrow_down Logical Operations


keyboard_arrow_down What if we want to check whether "any" element of array follows a specific condition?

np.any()

returns True if any of the corresponding elements in the argument arrays follow the provided condition.

Imagine you have a shopping list with items you need to buy, but you're not sure if you have enough money to buy everything.

You want to check if there's at least one item on your list that you can afford.

In this case, you can use np.any :

import numpy as np

# Prices of items on your shopping list


prices = np.array([50, 45, 25, 20, 35])

# Your budget
budget = 30

# Check if there's at least one item you can afford


can_afford = np.any(prices <= budget)

if can_afford:
print("You can buy at least one item on your list!")
else:
print("Sorry, nothing on your list fits your budget.")

You can buy at least one item on your list!


keyboard_arrow_down What if we want to check whether "all" the elements in our array follow a specific condition?

np.all()

returns True if all the elements in the argument arrays follow the provided condition.

Let's consider a scenario where you have a list of chores, and you want to make sure all the chores are done before you can play video games.

You can use np.all to check if all the chores are completed.

import numpy as np

# Chores status: 1 for done, 0 for not done


chores = np.array([1, 1, 1, 1, 0])

# Check if all chores are done


all_chores_done = np.all(chores == 1)

if all_chores_done:
print("Great job! You've completed all your chores. Time to play!")
else:
print("Finish all your chores before you can play.")

Finish all your chores before you can play.

Multiple conditions for .all() function -

a = np.array([1, 2, 3, 2])
b = np.array([2, 2, 3, 2])
c = np.array([6, 4, 4, 5])

((a <= b) & (b <= c)).all()

True

keyboard_arrow_down What if we want to update an array based on condition?

Suppose you are given an array of integers and you want to update it based on following condition :

if element is > 0, change it to +1


if element < 0, change it to -1.

How will you do it?

arr = np.array([-3,4,27,34,-2, 0, -45,-11,4, 0 ])


arr

array([ -3, 4, 27, 34, -2, 0, -45, -11, 4, 0])

You can use masking to update the array.

arr[arr > 0] = 1
arr [arr < 0] = -1

arr

array([-1, 1, 1, 1, -1, 0, -1, -1, 1, 0])

There's also a numpy function which can help us with it.

keyboard_arrow_down np.where()

Syntax: np.where(condition, [x, y])


returns an ndarray whose elements are chosen from x or y depending on condition.

Suppose you have a list of product prices, and you want to apply a 10% discount to all products with prices above $50.
You can use np.where to adjust the prices.

import numpy as np

# Product prices
prices = np.array([45, 55, 60, 75, 40, 90])

# Apply a 10% discount to prices above $50


discounted_prices = np.where(prices > 50, prices * 0.9, prices)

print("Original prices:", prices)


print("Discounted prices:", discounted_prices)

Original prices: [45 55 60 75 40 90]


Discounted prices: [45. 49.5 54. 67.5 40. 81. ]

Notice that it didn't change the original array.

keyboard_arrow_down Use Case: Fitness data analysis


Imagine you are a Data Scientist at Fitbit

You've been given a user data to analyse and find some insights which can be shown on the smart watch.

But why would we want to analyse the user data for desiging the watch?

These insights from the user data can help business make customer oriented decision for the product design.

Let's first look at the data we have gathered.

Notice that our data is structured in a tabular format.

Each column is known as a feature.


Each row is known as a record.

keyboard_arrow_down Basic EDA


Performing Exploratory Data Analysis (EDA) is like being a detective for numbers and information.
Imagine you have a big box of colorful candies. EDA is like looking at all the candies, counting how many of each color there are, and maybe
even making a pretty picture to show which colors you have the most of. This way, you can learn a lot about your candies without eating them
all at once!

So, EDA is about looking at your things, which is data in this case, to understand them better and find out interesting stuff about them.

Formally defining, Exploratory Data Analysis (EDA) is a process of examining, summarizing, and visualizing data sets to understand their main
characteristics, uncover patterns that helps analysts and data scientists gain insights into the data, make informed decisions, and guide further
analysis or modeling.

First, we will import numpy.

import numpy as np

Let's load the data that we saw earlier.

For this, we will use the .loadtxt() function.

!gdown https://drive.google.com/uc?id=1vk1Pu0djiYcrdc85yUXZ_Rqq2oZNcohd

Downloading...
From: https://drive.google.com/uc?id=1vk1Pu0djiYcrdc85yUXZ_Rqq2oZNcohd
To: /content/fit.txt
100% 3.43k/3.43k [00:00<00:00, 11.3MB/s]

data = np.loadtxt('/content/fit.txt', dtype='str')


data
[ 14-12-2017 , 7422 , Happy , 243 , 5 , Active ],
['15-12-2017', '437', 'Neutral', '14', '3', 'Active'],
['16-12-2017', '1231', 'Neutral', '39', '4', 'Active'],
['17-12-2017', '1696', 'Sad', '55', '4', 'Inactive'],
['18-12-2017', '4921', 'Neutral', '158', '5', 'Active'],
['19-12-2017', '221', 'Sad', '7', '5', 'Active'],
['20-12-2017', '6500', 'Neutral', '213', '5', 'Active'],
['21-12-2017', '3575', 'Neutral', '116', '5', 'Active'],
['22-12-2017', '4061', 'Sad', '129', '5', 'Inactive'],
['23-12-2017', '651', 'Sad', '21', '5', 'Inactive'],
['24-12-2017', '753', 'Sad', '28', '4', 'Inactive'],
['25-12-2017', '518', 'Sad', '16', '3', 'Inactive'],
['26-12-2017', '5537', 'Happy', '180', '4', 'Active'],
['27-12-2017', '4108', 'Neutral', '138', '5', 'Active'],
['28-12-2017', '5376', 'Happy', '176', '5', 'Active'],
['29-12-2017', '3066', 'Neutral', '99', '4', 'Active'],
['30-12-2017', '177', 'Sad', '5', '5', 'Inactive'],
['31-12-2017', '36', 'Sad', '1', '3', 'Inactive'],
['01-01-2018', '299', 'Sad', '10', '3', 'Inactive'],
['02-01-2018', '1447', 'Neutral', '47', '3', 'Inactive'],
['03-01-2018', '2599', 'Neutral', '84', '2', 'Inactive'],
['04-01-2018', '702', 'Sad', '23', '3', 'Inactive'],
['05-01-2018', '133', 'Sad', '4', '2', 'Inactive'],
['06-01-2018', '153', 'Happy', '0', '8', 'Inactive'],
['07-01-2018', '500', 'Neutral', '0', '5', 'Active'],
['08-01-2018', '2127', 'Neutral', '0', '5', 'Inactive'],
['09-01-2018', '2203', 'Happy', '0', '5', 'Active']], dtype='<U10')

We provide the file name along with the dtype of data that we want to load in.

What's the shape of this data?

data.shape

(96, 6)

What's the dimensionality?

data.ndim

We can see that this is a 2-dimensional list.

There are 96 records and each record has 6 features.

These features are:

Date
Step Count
Mood
Calories Burned
Hours of Sleep
Activity Status

Notice that above array is homogenous containing all the data as strings.

In order to work with strings, categorical data and numerical data, we'll have to save every feature seperately.

How will we extract features in seperate variables?

For that, we first need some idea on how data is saved.

Let's see whats the first element of the data .

data[0]

array(['06-10-2017', '5464', 'Neutral', '181', '5', 'Inactive'],


dtype='<U10')

Hmm.. this extracts a row, not a column.

Similarly, we can extract other specific rows.

data[1]
array(['07-10-2017', '6041', 'Sad', '197', '8', 'Inactive'], dtype='<U10')

We can also use slicing.

data[:5]

array([['06-10-2017', '5464', 'Neutral', '181', '5', 'Inactive'],


['07-10-2017', '6041', 'Sad', '197', '8', 'Inactive'],
['08-10-2017', '25', 'Sad', '0', '5', 'Inactive'],
['09-10-2017', '5461', 'Sad', '174', '4', 'Inactive'],
['10-10-2017', '6915', 'Neutral', '223', '5', 'Active']],
dtype='<U10')

Now, we want to place all the dates into a single entity.

How to do that?

One way is to just go ahead and fetch the column number 0 from all rows.
Another way is to, take a transpose of data .

Let's see them both -

Approach 1

data[:, 0]

array(['06-10-2017', '07-10-2017', '08-10-2017', '09-10-2017',


'10-10-2017', '11-10-2017', '12-10-2017', '13-10-2017',
'14-10-2017', '15-10-2017', '16-10-2017', '17-10-2017',
'18-10-2017', '19-10-2017', '20-10-2017', '21-10-2017',
'22-10-2017', '23-10-2017', '24-10-2017', '25-10-2017',
'26-10-2017', '27-10-2017', '28-10-2017', '29-10-2017',
'30-10-2017', '31-10-2017', '01-11-2017', '02-11-2017',
'03-11-2017', '04-11-2017', '05-11-2017', '06-11-2017',
'07-11-2017', '08-11-2017', '09-11-2017', '10-11-2017',
'11-11-2017', '12-11-2017', '13-11-2017', '14-11-2017',
'15-11-2017', '16-11-2017', '17-11-2017', '18-11-2017',
'19-11-2017', '20-11-2017', '21-11-2017', '22-11-2017',
'23-11-2017', '24-11-2017', '25-11-2017', '26-11-2017',
'27-11-2017', '28-11-2017', '29-11-2017', '30-11-2017',
'01-12-2017', '02-12-2017', '03-12-2017', '04-12-2017',
'05-12-2017', '06-12-2017', '07-12-2017', '08-12-2017',
'09-12-2017', '10-12-2017', '11-12-2017', '12-12-2017',
'13-12-2017', '14-12-2017', '15-12-2017', '16-12-2017',
'17-12-2017', '18-12-2017', '19-12-2017', '20-12-2017',
'21-12-2017', '22-12-2017', '23-12-2017', '24-12-2017',
'25-12-2017', '26-12-2017', '27-12-2017', '28-12-2017',
'29-12-2017', '30-12-2017', '31-12-2017', '01-01-2018',
'02-01-2018', '03-01-2018', '04-01-2018', '05-01-2018',
'06-01-2018', '07-01-2018', '08-01-2018', '09-01-2018'],
dtype='<U10')

This gives all the dates.

Approach 2

data_t = data.T

Don't you think all the dates will now be present in the first (i.e. index 0th element) of data_t ?

data_t[0]

array(['06-10-2017', '07-10-2017', '08-10-2017', '09-10-2017',


'10-10-2017', '11-10-2017', '12-10-2017', '13-10-2017',
'14-10-2017', '15-10-2017', '16-10-2017', '17-10-2017',
'18-10-2017', '19-10-2017', '20-10-2017', '21-10-2017',
'22-10-2017', '23-10-2017', '24-10-2017', '25-10-2017',
'26-10-2017', '27-10-2017', '28-10-2017', '29-10-2017',
'30-10-2017', '31-10-2017', '01-11-2017', '02-11-2017',
'03-11-2017', '04-11-2017', '05-11-2017', '06-11-2017',
'07-11-2017', '08-11-2017', '09-11-2017', '10-11-2017',
'11-11-2017', '12-11-2017', '13-11-2017', '14-11-2017',
'15-11-2017', '16-11-2017', '17-11-2017', '18-11-2017',
'19-11-2017', '20-11-2017', '21-11-2017', '22-11-2017',
'23-11-2017', '24-11-2017', '25-11-2017', '26-11-2017',
'27-11-2017', '28-11-2017', '29-11-2017', '30-11-2017',
'01-12-2017', '02-12-2017', '03-12-2017', '04-12-2017',
'05-12-2017', '06-12-2017', '07-12-2017', '08-12-2017',
'09-12-2017', '10-12-2017', '11-12-2017', '12-12-2017',
'13-12-2017', '14-12-2017', '15-12-2017', '16-12-2017',
'17-12-2017', '18-12-2017', '19-12-2017', '20-12-2017',
'21-12-2017', '22-12-2017', '23-12-2017', '24-12-2017',
'25-12-2017', '26-12-2017', '27-12-2017', '28-12-2017',
'29-12-2017', '30-12-2017', '31-12-2017', '01-01-2018',
'02-01-2018', '03-01-2018', '04-01-2018', '05-01-2018',
'06-01-2018', '07-01-2018', '08-01-2018', '09-01-2018'],
dtype='<U10')

Also, what will be the shape of data_t ?

data_t.shape

(6, 96)

keyboard_arrow_down Let's extract all the columns and save them in seperate variables.

date, step_count, mood, calories_burned, hours_of_sleep, activity_status = data.T

step_count

array(['5464', '6041', '25', '5461', '6915', '4545', '4340', '1230', '61',


'1258', '3148', '4687', '4732', '3519', '1580', '2822', '181',
'3158', '4383', '3881', '4037', '202', '292', '330', '2209',
'4550', '4435', '4779', '1831', '2255', '539', '5464', '6041',
'4068', '4683', '4033', '6314', '614', '3149', '4005', '4880',
'4136', '705', '570', '269', '4275', '5999', '4421', '6930',
'5195', '546', '493', '995', '1163', '6676', '3608', '774', '1421',
'4064', '2725', '5934', '1867', '3721', '2374', '2909', '1648',
'799', '7102', '3941', '7422', '437', '1231', '1696', '4921',
'221', '6500', '3575', '4061', '651', '753', '518', '5537', '4108',
'5376', '3066', '177', '36', '299', '1447', '2599', '702', '133',
'153', '500', '2127', '2203'], dtype='<U10')

step_count.dtype

dtype('<U10')

Notice the data type of step_count and other variables.

It's a string type where U means Unicode String and 10 means 10 bytes.

Why? Because Numpy type-casted all the data to strings.

keyboard_arrow_down Let's convert the data types of these variables.

Step Count

step_count = np.array(step_count, dtype='int')


step_count.dtype

dtype('int64')

step_count

array([5464, 6041, 25, 5461, 6915, 4545, 4340, 1230, 61, 1258, 3148,
4687, 4732, 3519, 1580, 2822, 181, 3158, 4383, 3881, 4037, 202,
292, 330, 2209, 4550, 4435, 4779, 1831, 2255, 539, 5464, 6041,
4068, 4683, 4033, 6314, 614, 3149, 4005, 4880, 4136, 705, 570,
269, 4275, 5999, 4421, 6930, 5195, 546, 493, 995, 1163, 6676,
3608, 774, 1421, 4064, 2725, 5934, 1867, 3721, 2374, 2909, 1648,
799, 7102, 3941, 7422, 437, 1231, 1696, 4921, 221, 6500, 3575,
4061, 651, 753, 518, 5537, 4108, 5376, 3066, 177, 36, 299,
1447, 2599, 702, 133, 153, 500, 2127, 2203])

What will be shape of this array?

step_count.shape

(96,)

We saw in last class that since it is a 1D array, its shape will be (96, ) .
If it were a 2D array, its shape would've been (96, 1) .

Calories Burned

calories_burned = np.array(calories_burned, dtype='int')


calories_burned.dtype

dtype('int64')

Hours of Sleep

hours_of_sleep = np.array(hours_of_sleep, dtype='int')


hours_of_sleep.dtype

dtype('int64')

Mood

Mood belongs to categorical data type. As the name suggests, categorical data type has two or more categories in it.

Let's check the values of mood variable -

mood

array(['Neutral', 'Sad', 'Sad', 'Sad', 'Neutral', 'Sad', 'Sad', 'Sad',


'Sad', 'Sad', 'Sad', 'Sad', 'Happy', 'Sad', 'Sad', 'Sad', 'Sad',
'Neutral', 'Neutral', 'Neutral', 'Neutral', 'Neutral', 'Neutral',
'Happy', 'Neutral', 'Happy', 'Happy', 'Happy', 'Happy', 'Happy',
'Happy', 'Happy', 'Neutral', 'Happy', 'Happy', 'Happy', 'Happy',
'Happy', 'Happy', 'Happy', 'Happy', 'Happy', 'Happy', 'Neutral',
'Happy', 'Happy', 'Happy', 'Happy', 'Happy', 'Happy', 'Happy',
'Happy', 'Happy', 'Neutral', 'Sad', 'Happy', 'Happy', 'Happy',
'Happy', 'Happy', 'Happy', 'Happy', 'Sad', 'Neutral', 'Neutral',
'Sad', 'Sad', 'Neutral', 'Neutral', 'Happy', 'Neutral', 'Neutral',
'Sad', 'Neutral', 'Sad', 'Neutral', 'Neutral', 'Sad', 'Sad', 'Sad',
'Sad', 'Happy', 'Neutral', 'Happy', 'Neutral', 'Sad', 'Sad', 'Sad',
'Neutral', 'Neutral', 'Sad', 'Sad', 'Happy', 'Neutral', 'Neutral',
'Happy'], dtype='<U10')

np.unique(mood)

array(['Happy', 'Neutral', 'Sad'], dtype='<U10')

Activity Status

activity_status

array(['Inactive', 'Inactive', 'Inactive', 'Inactive', 'Active',


'Inactive', 'Inactive', 'Inactive', 'Inactive', 'Inactive',
'Inactive', 'Inactive', 'Active', 'Inactive', 'Inactive',
'Inactive', 'Inactive', 'Inactive', 'Inactive', 'Inactive',
'Inactive', 'Inactive', 'Inactive', 'Inactive', 'Inactive',
'Active', 'Inactive', 'Inactive', 'Inactive', 'Inactive', 'Active',
'Inactive', 'Inactive', 'Inactive', 'Inactive', 'Inactive',
'Active', 'Active', 'Active', 'Active', 'Active', 'Active',
'Active', 'Active', 'Active', 'Inactive', 'Inactive', 'Inactive',
'Inactive', 'Inactive', 'Inactive', 'Active', 'Active', 'Active',
'Active', 'Active', 'Active', 'Active', 'Active', 'Active',
'Active', 'Active', 'Active', 'Inactive', 'Active', 'Active',
'Inactive', 'Active', 'Active', 'Active', 'Active', 'Active',
'Inactive', 'Active', 'Active', 'Active', 'Active', 'Inactive',
'Inactive', 'Inactive', 'Inactive', 'Active', 'Active', 'Active',
'Active', 'Inactive', 'Inactive', 'Inactive', 'Inactive',
'Inactive', 'Inactive', 'Inactive', 'Inactive', 'Active',
'Inactive', 'Active'], dtype='<U10')

Since we've extracted form the same source array, we know that

mood[0] and step_count[0]


There is a connection between them, as they belong to the same record.

Also, we know that their length will be the same, i.e. 96

Now let's look at something really interesting.

Can we extract the step counts, when the mood was Happy?
step_count_happy = step_count[mood == 'Happy']

len(step_count_happy)

40

Let's also find for when the mood was Sad.

step_count_sad = step_count[mood == 'Sad']


step_count_sad

array([6041, 25, 5461, 4545, 4340, 1230, 61, 1258, 3148, 4687, 3519,
1580, 2822, 181, 6676, 3721, 1648, 799, 1696, 221, 4061, 651,
753, 518, 177, 36, 299, 702, 133])

len(step_count_sad)

29

Let's do the same for when the mood was Neutral.

step_count_neutral = step_count[mood == 'Neutral']


step_count_neutral

array([5464, 6915, 3158, 4383, 3881, 4037, 202, 292, 2209, 6041, 570,
1163, 2374, 2909, 7102, 3941, 437, 1231, 4921, 6500, 3575, 4108,
3066, 1447, 2599, 500, 2127])

len(step_count_neutral)

27

How can we collect data for when the mood was either happy or neutral?

step_count_happy_or_neutral = step_count[(mood == 'Neutral') | (mood == 'Happy')]


step_count_happy_or_neutral

array([5464, 6915, 4732, 3158, 4383, 3881, 4037, 202, 292, 330, 2209,
4550, 4435, 4779, 1831, 2255, 539, 5464, 6041, 4068, 4683, 4033,
6314, 614, 3149, 4005, 4880, 4136, 705, 570, 269, 4275, 5999,
4421, 6930, 5195, 546, 493, 995, 1163, 3608, 774, 1421, 4064,
2725, 5934, 1867, 2374, 2909, 7102, 3941, 7422, 437, 1231, 4921,
6500, 3575, 5537, 4108, 5376, 3066, 1447, 2599, 153, 500, 2127,
2203])

len(step_count_happy_or_neutral)

67

Let's try to compare step counts on bad mood days and good mood days.

# Average step count on Sad mood days -

np.mean(step_count_sad)

2103.0689655172414

# Average step count on Happy days -

np.mean(step_count_happy)

3392.725

# Average step count on Neutral days -

np.mean(step_count_neutral)

3153.777777777778

As you can see, this data tells us a lot about user behaviour.

This way we can analyze data and learn.


This is just the second class on numpy, we will learn many more concepts related to this, and pandas also.

Let's try to check the mood when step count was greater/lesser.

# mood when step count > 4000

np.unique(mood[step_count > 4000], return_counts = True)

(array(['Happy', 'Neutral', 'Sad'], dtype='<U10'), array([22, 9, 7]))

Out of 38 days when step count was more than 4000, user was feeling happy on 22 days.

# mood when step count <= 2000

np.unique(mood[step_count <= 2000], return_counts = True)

(array(['Happy', 'Neutral', 'Sad'], dtype='<U10'), array([13, 8, 18]))

Out of 39 days, when step count was less than 2000, user was feeling sad on 18 days.

keyboard_arrow_down This suggests that there may be a correlation between the Mood and Step Count .
keyboard_arrow_down Numpy 3

keyboard_arrow_down Content
Sorting
Matrix Multiplication

np.dot
@ operator
np.matmul

Vectorization
Broadcasting

keyboard_arrow_down Sorting
np.sort returns a sorted copy of an array.

import numpy as np

a = np.array([4, 7, 0, 3, 8, 2, 5, 1, 6, 9])
a

array([4, 7, 0, 3, 8, 2, 5, 1, 6, 9])

b = np.sort(a)
b

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

a # no change is reflected in the original array

array([4, 7, 0, 3, 8, 2, 5, 1, 6, 9])

keyboard_arrow_down We can directly call sort method on array but it can change the original array as it is an inplace operation.

a.sort() # sorting is performed inplace


a

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

keyboard_arrow_down Sorting in 2D array


a = np.array([[1,5,3], [2,5,7], [400, 200, 300]])
a

array([[ 1, 5, 3],
[ 2, 5, 7],
[400, 200, 300]])

np.sort(a, axis=0) # sorting every column

array([[ 1, 5, 3],
[ 2, 5, 7],
[400, 200, 300]])

np.sort(a, axis=1) # sorting every row

array([[ 1, 3, 5],
[ 2, 5, 7],
[200, 300, 400]])
Note: By default, the np.sort() functions sorts along the last axis.

a = np.array([[23,4,43], [12, 89, 3], [69, 420, 0]])

np.sort(a) # default axis = -1 (last axis)

array([[ 4, 23, 43],


[ 3, 12, 89],
[ 0, 69, 420]])

keyboard_arrow_down Element-Wise Multiplication


Element-wise multiplication in NumPy involves multiplying corresponding elements of two arrays with the same shape to produce a new array
where each element is the product of the corresponding elements from the input arrays.

a = np.arange(1, 6)
a

array([1, 2, 3, 4, 5])

a * 5

array([ 5, 10, 15, 20, 25])

b = np.arange(6, 11)
b

array([ 6, 7, 8, 9, 10])

a * b

array([ 6, 14, 24, 36, 50])

Both arrays should have the same shape.

c = np.array([1, 2, 3])

a * c

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-17-3f6f667472ca> in <cell line: 1>()
----> 1 a * c

ValueError: operands could not be broadcast together with shapes (5,) (3,)

SEARCH STACK OVERFLOW

d = np.arange(12).reshape(3, 4)
e = np.arange(13, 25).reshape(3, 4)

print(d)
print(e)

[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
[[13 14 15 16]
[17 18 19 20]
[21 22 23 24]]

d * e

array([[ 0, 14, 30, 48],


[ 68, 90, 114, 140],
[168, 198, 230, 264]])

d * 5
array([[ 0, 5, 10, 15],
[20, 25, 30, 35],
[40, 45, 50, 55]])

Takeaway:

Array * Number -> WORKS


Array * Array (same shape) -> WORKS
Array * Array (different shape) -> DOES NOT WORK

keyboard_arrow_down Matrix Multiplication


Rule: Number of columns of the first matrix should be equal to number of rows of the second matrix.

(A,B) * (B,C) -> (A,C)


(3,4) * (4,3) -> (3,3)

Visual Demo: https://www.geogebra.org/m/ETHXK756

a = np.arange(1,13).reshape((3,4))
c = np.arange(2,14).reshape((4,3))

a.shape, c.shape

((3, 4), (4, 3))

keyboard_arrow_down a is of shape (3,4) and c is of shape (4,3). The output will be of shape (3,3).

# Using np.dot
np.dot(a,c)

array([[ 80, 90, 100],


[184, 210, 236],
[288, 330, 372]])

# Using np.matmul
np.matmul(a,c)

array([[ 80, 90, 100],


[184, 210, 236],
[288, 330, 372]])

# Using @ operator
a@c

array([[ 80, 90, 100],


[184, 210, 236],
[288, 330, 372]])

a@5

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-27-16572c98568d> in <cell line: 1>()
----> 1 a@5

ValueError: matmul: Input operand 1 does not have enough dimensions (has 0, gufunc core with signature (n?,k),(k,m?)->(n?

SEARCH STACK OVERFLOW

np.matmul(a, 5)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-28-875bf147741b> in <cell line: 1>()
----> 1 np.matmul(a, 5)

ValueError: matmul: Input operand 1 does not have enough dimensions (has 0, gufunc core with signature (n?,k),(k,m?)->(n?

SEARCH STACK OVERFLOW

np.dot(a, 5)

array([[ 5, 10, 15, 20],


[25, 30, 35, 40],
[45, 50, 55, 60]])

Important:

dot() function supports the vector multiplication with a scalar value, which is not possible with matmul() .
Vector * Vector will work for matmul() but Vector * Scalar won't.

keyboard_arrow_down Vectorization
Vectorization in NumPy refers to performing operations on entire arrays or array elements simultaneously, which is significantly faster and more
efficient than using explicit loops.

a = np.arange(10)
a

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Note:

1d np array --> vector


2d np array --> matrix
3d onwards --> tensors

def random_operation(x):
if x % 2 == 0:
x += 2
else:
x -= 2

return x

random_operation(a)

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-32-83503709589d> in <cell line: 1>()
----> 1 random_operation(a)

<ipython-input-31-1b21f73a20a9> in random_operation(x)
1 def random_operation(x):
----> 2 if x % 2 == 0:
3 x += 2
4 else:
5 x -= 2

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

SEARCH STACK OVERFLOW

cool_operation = np.vectorize(random_operation)

type(cool_operation)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-33-6717d289c693> in <cell line: 1>()
----> 1 type(cool_operation)

NameError: name 'cool_operation' is not defined

SEARCH STACK OVERFLOW

keyboard_arrow_down np.vectorize()

It is a generalised function for vectorization.


It takes the function and returns an object (which acts like function but can take an array as input and perform the operations).

cool_operation(a)

keyboard_arrow_down Broadcasting
Broadcasting in NumPy is the automatic and implicit extension of array dimensions to enable element-wise operations between arrays with
different shapes.

keyboard_arrow_down Case 1: If dimension in both matrix is equal, element-wise addition will be done.

a = np.tile(np.arange(0,40,10), (3,1))
a

array([[ 0, 10, 20, 30],


[ 0, 10, 20, 30],
[ 0, 10, 20, 30]])

Note:

numpy.tile(array, reps) constructs an array by repeating A the number of times given by reps along each dimension.
np.tile(array, (repetition_rows, repetition_cols))
a=a.T
a

array([[ 0, 0, 0],
[10, 10, 10],
[20, 20, 20],
[30, 30, 30]])

b = np.tile(np.arange(0,3), (4,1))
b

array([[0, 1, 2],
[0, 1, 2],
[0, 1, 2],
[0, 1, 2]])

print(a.shape, b.shape)

(4, 3) (4, 3)

Since a and b have the same shape, they can be added without any issues.

a+b

array([[ 0, 1, 2],
[10, 11, 12],
[20, 21, 22],
[30, 31, 32]])

keyboard_arrow_down Case 2: Right array should be of 1-D and number of columns should be same of both the arrays and it will automatically do
n-tile.

array([[ 0, 0, 0],
[10, 10, 10],
[20, 20, 20],
[30, 30, 30]])

c = np.array([0,1,2])
c

array([0, 1, 2])

print(a.shape, c.shape)

(4, 3) (3,)

a + c

array([[ 0, 1, 2],
[10, 11, 12],
[20, 21, 22],
[30, 31, 32]])

c was broadcasted along rows (vertically)


so that a and c can be made compatible

keyboard_arrow_down Case 3: If the left array is column matrix (must have only 1 column) and right array is row matrix, then it will do the n-tile
such that element wise addition is possible.

d = np.array([0,10,20,30]).reshape(4,1)
d

array([[ 0],
[10],
[20],
[30]])
c = np.array([0,1,2])
c

array([0, 1, 2])

print(d.shape, c.shape)

(4, 1) (3,)

d + c

array([[ 0, 1, 2],
[10, 11, 12],
[20, 21, 22],
[30, 31, 32]])

d was stacked (broadcasted) along columns (horizontally)


c was stacked (broadcasted) along rows (vertically)

Will broadcasting work in this case?

a = np.arange(8).reshape(2,4)
a

array([[0, 1, 2, 3],
[4, 5, 6, 7]])

b = np.arange(16).reshape(4,4)
b

array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])

a+b

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-49-ca730b97bf8a> in <cell line: 1>()
----> 1 a+b

ValueError: operands could not be broadcast together with shapes (2,4) (4,4)

SEARCH STACK OVERFLOW

keyboard_arrow_down Broadcasting in 2D Arrays

A + A (same shape)-> Works


A + A (1D) -> Works
A + number -> Works
A + A (different shape but still 2D) -> DOES NOT WORK

Is broadcasting possible in this case?

A = np.arange(1,10).reshape(3,3)
A

array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])

B = np.array([-1, 0, 1])
B

array([-1, 0, 1])

A*B
array([[-1, 0, 3],
[-4, 0, 6],
[-7, 0, 9]])

Yes! Broadcasting is possible for all the operations.

A = np.arange(12).reshape(3, 4)
A

array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])

B = np.array([1, 2, 3])
B

array([1, 2, 3])

A + B

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-55-151064de832d> in <cell line: 1>()
----> 1 A + B

ValueError: operands could not be broadcast together with shapes (3,4) (3,)

SEARCH STACK OVERFLOW

Why did it throw an error?

Are the number of dimensions same for both array? No.

Shape of A ⇒ (3,4)
Shape of B ⇒ (3,)

So, Rule 1 will be invoked to pad 1 to the shape of B.

So, the shape of B becomes (1,3).

Now, we check whether broadcasting conditions are met or not?

Starting from the right most side,

Right most dimension is not equal (4 and 3).

Hence, broadcasting is not possible as per Rule 3 .

Question: Given two arrays,

1. Array A of shape (8, 1, 6, 1)


2. Array B of shape (7, 1, 5)

Is broadcasting possible in this case? If yes, what will be the shape of output?

Answer: Broadcasting possible; Shape will be (8, 7, 6, 5)

Explanation:

As number of dimensions are not equal, Rule 1 is invoked.

The shape of B becomes (1, 7, 1, 5)

Next, it checks whether broadcasting is possible.

A ⇒ (8 , 1, 6, 1)
B ⇒ (1, 7, 1, 5)

Right most dimension, one of the dimension is 1 (1 vs 5)


Next, comparing 6 and 1, We have one dimension as 1
Similarly, we have one of the dimension as 1 in both leading dimensions.
Hence, broadcasting is possible.

Now, as per Rule 2 , dimension with value 1 is streched to match dimension of other array.

Right most dimension of array is streched to match 5


Leading dimension of array B (1) is streched to match array A dim (6)

So, the output shape becomes : (8, 7, 6, 5) .


keyboard_arrow_down Numpy 4

keyboard_arrow_down Content
Shallow vs Deep Copy

view()
copy()

Array Splitting

split()
hsplit()
vsplit()

Array Stacking

hstack()
vstack()
concatenate()

Post Read - Image Manipulation

keyboard_arrow_down Views vs Copies (Shallow vs Deep Copy)


Numpy manages memory very efficiently,
which makes it really useful while dealing with large datasets.

But how does it manage memory so efficiently?

Let's create some arrays to understand what's happens while using numpy.

import numpy as np

# We'll create a np array

a = np.arange(4)
a

array([0, 1, 2, 3])

# Reshape array `a` and store in `b`

b = a.reshape(2, 2)
b

array([[0, 1],
[2, 3]])

Now we will make some changes to our original array a .

a[0] = 100
a

array([100, 1, 2, 3])

What will be values if we print array b ?

array([[100, 1],
[ 2, 3]])

Array b got automatically updated


Array b got automatically updated
This is an example of numpy using Shallow Copy of data.

What happens here?

Numpy re-uses data as much as possible instead of duplicating it.


This helps numpy to be efficient.

When we created b=a.reshape(2,2)

Numpy did NOT make a copy of a to store in b , as we can clearly see.


It is using the same data as in a .
It just looks different (reshaped) in b .
That is why, any changes in a automatically gets reflected in b .

Now, let's see an example where Numpy will create a Deep Copy of data.

a = np.arange(4)
a

array([0, 1, 2, 3])

# Create `c`

c = a + 2
c

array([2, 3, 4, 5])

# We make changes in `a`

a[0] = 100
a

array([100, 1, 2, 3])

array([2, 3, 4, 5])

np.shares_memory(a, c) # Deep Copy

False

As we can see, c did not get affected on changing a .

Because it is an operation.
A more permanent change in data.
So, Numpy had to create a separate copy for c - i.e., deep copy of array a for array c .

keyboard_arrow_down Conclusion:

Numpy is able to use same data for simpler operations like reshape → Shallow Copy.
It creates a copy of data where operations make more permanent changes to data → Deep Copy.

Is there a way to check whether two arrays are sharing memory or not?

Yes, np.shares_memory() function

a= np.arange(10)
a

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

b = a[::2]
b
array([0, 2, 4, 6, 8])

np.shares_memory(a,b)

True

Notice that Slicing creates shallow copies.

a[0] = 1000

array([1000, 2, 4, 6, 8])

a = np.arange(6)
a

array([0, 1, 2, 3, 4, 5])

b = a[a % 1 == 0]
b

array([0, 1, 2, 3, 4, 5])

b[0] = 10

a[0]

np.shares_memory(a,b)

False

Note:

Shallow Copy - Reshaping, Slicing...


Deep Copy - Arithmetic Operations, Masking...

a = np.arange(10)

a_shallow_copy = a.view()
# Creates a shallow copy of a

np.shares_memory(a_shallow_copy, a)

True

a_deep_copy = a.copy()
# Creates a deep copy of a

np.shares_memory(a_deep_copy, a)

False

keyboard_arrow_down .view()

Returns view of the original array.


Any changes made in new array will be reflected in original array.

Documentation: https://numpy.org/doc/stable/reference/generated/numpy.ndarray.view.html
arr = np.arange(10)
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

view_arr = arr.view()
view_arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Let's modify the content of view_arr and check whether it modified the original array as well.

view_arr[4] = 420
view_arr

array([ 0, 1, 2, 3, 420, 5, 6, 7, 8, 9])

arr

array([ 0, 1, 2, 3, 420, 5, 6, 7, 8, 9])

np.shares_memory(arr, view_arr)

True

Notice that changes in view array are reflected in original array.

keyboard_arrow_down .copy()

Returns a copy of the array.


Changes made in new array are not reflected in the original array.

Documentation ( .copy() ): https://numpy.org/doc/stable/reference/generated/numpy.ndarray.copy.html#numpy.ndarray.copy

Documentation: ( np.copy() ): https://numpy.org/doc/stable/reference/generated/numpy.copy.html

arr = np.arange(10)
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

copy_arr = arr.copy()
copy_arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Let's modify the content of copy_arr and check whether it modified the original array as well.

copy_arr[3] = 45
copy_arr

array([ 0, 1, 2, 45, 4, 5, 6, 7, 8, 9])

arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

np.shares_memory(arr, copy)

---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-35-84a6cba2b044> in <cell line: 1>()
----> 1 np.shares_memory(arr, copy)

NameError: name 'copy' is not defined

SEARCH STACK OVERFLOW


Notice that the content of original array were not modified as we changed our copy array.

keyboard_arrow_down Splitting
In addition to reshaping and selecting subarrays, it is often necessary to split arrays into smaller arrays or merge arrays into bigger arrays.

keyboard_arrow_down np.split()

Splits an array into multiple sub-arrays as views.

It takes an argument indices_or_sections .

If indices_or_sections is an integer, n, the array will be divided into n equal arrays along axis.

If such a split is not possible, an error is raised.

If indices_or_sections is a 1-D array of sorted integers, the entries indicate where along axis the array is split.

If an index exceeds the dimension of the array along axis, an empty sub-array is returned correspondingly.

x = np.arange(9)
x

array([0, 1, 2, 3, 4, 5, 6, 7, 8])

np.split(x, 3)

[array([0, 1, 2]), array([3, 4, 5]), array([6, 7, 8])]

IMPORTANT REQUISITE

Number of elements in the array should be divisible by number of sections.

b = np.arange(10)
np.split(b, 3)

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-38-5033f171e13f> in <cell line: 2>()
1 b = np.arange(10)
----> 2 np.split(b, 3)

1 frames
/usr/local/lib/python3.10/dist-packages/numpy/lib/shape_base.py in split(ary, indices_or_sections, axis)
870 N = ary.shape[axis]
871 if N % sections:
--> 872 raise ValueError(
873 'array split does not result in an equal division') from None
874 return array_split(ary, indices_or_sections, axis)

ValueError: array split does not result in an equal division

SEARCH STACK OVERFLOW

b[0:-1]
np.split(b[0:-1], 3)

# Splitting on the basis of exact indices

c = np.arange(16)
np.split(c, [3, 5, 6])

[array([0, 1, 2]),
array([3, 4]),
array([5]),
array([ 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])]

keyboard_arrow_down np.hsplit()
Splits an array into multiple sub-arrays horizontally (column-wise).

x = np.arange(16.0).reshape(4, 4)
x

array([[ 0., 1., 2., 3.],


[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.],
[12., 13., 14., 15.]])

Think of it this way:

There are 2 axis to a 2D array

1. 1st axis - Vertical axis


2. 2nd axis - Horizontal axis

Along which axis are we splitting the array?

The split we want happens across the 2nd axis (Horizontal axis)
That is why we use hsplit()

So, try to think in terms of "whether the operation is happening along vertical axis or horizontal axis".

We are splitting the horizontal axis in this case.

np.hsplit(x, 2)

[array([[ 0., 1.],


[ 4., 5.],
[ 8., 9.],
[12., 13.]]),
array([[ 2., 3.],
[ 6., 7.],
[10., 11.],
[14., 15.]])]

np.hsplit(x, np.array([3, 6]))

[array([[ 0., 1., 2.],


[ 4., 5., 6.],
[ 8., 9., 10.],
[12., 13., 14.]]),
array([[ 3.],
[ 7.],
[11.],
[15.]]),
array([], shape=(4, 0), dtype=float64)]

keyboard_arrow_down np.vsplit()

Splits an array into multiple sub-arrays vertically (row-wise).

x = np.arange(16.0).reshape(4, 4)
x

array([[ 0., 1., 2., 3.],


[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.],
[12., 13., 14., 15.]])

Now, along which axis are we splitting the array?

The split we want happens across the 1st axis (Vertical axis)
That is why we use vsplit()

Again, always try to think in terms of "whether the operation is happening along vertical axis or horizontal axis".

We are splitting the vertical axis in this case.

np.vsplit(x, 2)

[array([[0., 1., 2., 3.],


[4., 5., 6., 7.]]),
array([[ 8., 9., 10., 11.],
[12., 13., 14., 15.]])]

np.vsplit(x, np.array([3]))

[array([[ 0., 1., 2., 3.],


[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.]]),
array([[12., 13., 14., 15.]])]

keyboard_arrow_down Stacking
a = np.arange(1, 5)
b = np.arange(2, 6)
c = np.arange(3, 7)

keyboard_arrow_down np.vstack()

Stacks a list of arrays vertically (along axis 0 or 1st axis).


For example, given a list of row vectors, appends the rows to form a matrix.

np.vstack([b, c, a])

array([[2, 3, 4, 5],
[3, 4, 5, 6],
[1, 2, 3, 4]])

a = np.arange(1, 5)
b = np.arange(2, 4)
c = np.arange(3, 10)

np.vstack([b, c, a])

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-49-5148cb6ebc5f> in <cell line: 1>()
----> 1 np.vstack([b, c, a])

2 frames
/usr/local/lib/python3.10/dist-packages/numpy/core/overrides.py in concatenate(*args, **kwargs)

ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the arra

SEARCH STACK OVERFLOW


keyboard_arrow_down np.hstack

Stacks a list of arrays horizontally (along axis 1 or 2nd axis).

a = np.arange(5).reshape(5, 1)
a

array([[0],
[1],
[2],
[3],
[4]])

b = np.arange(15).reshape(5, 3)
b

array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11],
[12, 13, 14]])

np.hstack([a, b])

array([[ 0, 0, 1, 2],
[ 1, 3, 4, 5],
[ 2, 6, 7, 8],
[ 3, 9, 10, 11],
[ 4, 12, 13, 14]])

keyboard_arrow_down np.concatenate()

Can perform both vstack and hstack


Creates a new array by appending arrays after each other, along a given axis.

Provides similar functionality, but it takes a keyword argument axis that specifies the axis along which the arrays are to be concatenated.

The input array to concatenate() needs to be of dimensions atleast equal to the dimensions of output array.

a = np.array([1,2,3])
a

array([1, 2, 3])

b = np.array([[1,2,3], [4,5,6]])
b

array([[1, 2, 3],
[4, 5, 6]])

np.concatenate([a, b], axis = 0)

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-55-1a93c4fe21df> in <cell line: 1>()
----> 1 np.concatenate([a, b], axis = 0)

/usr/local/lib/python3.10/dist-packages/numpy/core/overrides.py in concatenate(*args, **kwargs)

ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 1 dimension(s) and the

SEARCH STACK OVERFLOW

concatenate() can only work if both a and b have the same number of dimensions.

a = np.array([[1,2,3]])
b = np.array([[1,2,3], [4,5,6]])

np.concatenate([a, b], axis = 0) # axis = 0 -> vstack


array([[1, 2, 3],
[1, 2, 3],
[4, 5, 6]])

a = np.arange(6).reshape(3, 2)
b = np.arange(9).reshape(3, 3)

np.concatenate([a, b], axis = 1) # axis = 1 -> hstack

array([[0, 1, 0, 1, 2],
[2, 3, 3, 4, 5],
[4, 5, 6, 7, 8]])

a = np.array([[1,2], [3,4]])
b = np.array([[5,6,7,8]])

np.concatenate([a, b], axis = None)

# axis = None joins and converts to 1D

array([1, 2, 3, 4, 5, 6, 7, 8])

Question: What will be the output of this?

a = np.array([[1, 2], [3, 4]])


b = np.array([[5, 6]])
np.concatenate((a, b), axis=0)

a = np.array([[1, 2], [3, 4]])


a

array([[1, 2],
[3, 4]])

b = np.array([[5, 6]])
b

array([[5, 6]])

np.concatenate((a, b), axis=0)

array([[1, 2],
[3, 4],
[5, 6]])

How did it work?

Dimensions of a is 2 ×2
What is the dimensions of b ?

1-D array ?? - NO
Look carefully!!
b is a 2-D array of dimensions 1 ×2
axis = 0 ---> It's a vertical axis

So, changes will happen along vertical axis


So, b gets concatenated below a

Question: What will be the result of this concatenation operation?

a = np.array([[1, 2], [3, 4]])


b = np.array([[5, 6]])
np.concatenate((a, b.T), axis=1)

a = np.array([[1, 2], [3, 4]])


a
array([[1, 2],
[3, 4]])

b = np.array([[5, 6]])
b

array([[5, 6]])

np.concatenate((a, b.T), axis=1)

array([[1, 2, 5],
[3, 4, 6]])

What happened here?

Dimensions of a is again 2 ×2
Dimensions of b is again 1 ×2
So, Dimensions of b.T will be 2 × 1

axis = 1 ---> It's a horizontal axis

So, changes will happen along horizontal axis


So, b.T gets concatenated horizontally to a

keyboard_arrow_down Extra-reading material


Object arrays
Image Manipulation

Code Text
keyboard_arrow_down 01-Pandas-Lecture-McKinsey
keyboard_arrow_down Outline
Installation of Pandas

Importing pandas
Importing the dataset
Dataframe/Series

Basic ops on a DataFrame

df.info()
df.head()
df.tail()
df.shape()

Creating Dataframe from Scratch


Basic ops on columns

Different ways of accessing cols


Check for unique values
Rename column
Deleting column
Creating new column

Basic ops on rows

Implicit/Explicit index
df.index
Indexing in Series
Slicing in Series
loc/iloc

keyboard_arrow_down Installing Pandas


# !pip install pandas

keyboard_arrow_down Importing Pandas


You should be able to import Pandas after installing it

We'll import pandas as its alias name pd

import pandas as pd
import numpy as np

keyboard_arrow_down Introduction: Why to use Pandas?


How is it different from numpy ?

The major limitation of numpy is that it can only work with 1 datatype at a time

Most real-world datasets contain a mixture of different datatypes

Like names of places would be string but their population would be int

==> It is difficult to work with data having heterogeneous values using Numpy

Pandas can work with numbers and strings together


So lets see how we can use pandas
keyboard_arrow_down Imagine that you are a Data Scientist with McKinsey
McKinsey wants to understand the relation between GDP per capita and life expectancy and various trends for their clients.
The company has acquired data from multiple surveys in different countries in the past
This contains info of several years about:

country
population size
life expectancy
GDP per Capita
We have to analyse the data and draw inferences meaningful to the company

keyboard_arrow_down Reading dataset in Pandas


Link: https://drive.google.com/file/d/1E3bwvYGf1ig32RmcYiWc0IXPN-mD_bI_/view?usp=sharing

!wget "https://drive.google.com/uc?export=download&id=1E3bwvYGf1ig32RmcYiWc0IXPN-mD_bI_" -O mckinsey.csv

--2023-09-11 18:16:39-- https://drive.google.com/uc?export=download&id=1E3bwvYGf1ig32RmcYiWc0IXPN-mD_bI_


Resolving drive.google.com (drive.google.com)... 173.194.213.102, 173.194.213.113, 173.194.213.139, ...
Connecting to drive.google.com (drive.google.com)|173.194.213.102|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-0s-68-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/l1h3nkkdhuu158c3gc
Warning: wildcards not supported in HTTP.
--2023-09-11 18:16:40-- https://doc-0s-68-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/l1h
Resolving doc-0s-68-docs.googleusercontent.com (doc-0s-68-docs.googleusercontent.com)... 173.194.212.132, 2607:f8b0:400c
Connecting to doc-0s-68-docs.googleusercontent.com (doc-0s-68-docs.googleusercontent.com)|173.194.212.132|:443... connec
HTTP request sent, awaiting response... 200 OK
Length: 83785 (82K) [text/csv]
Saving to: ‘mckinsey.csv’

mckinsey.csv 100%[===================>] 81.82K --.-KB/s in 0.001s

2023-09-11 18:16:40 (76.9 MB/s) - ‘mckinsey.csv’ saved [83785/83785]

keyboard_arrow_down Now how should we read this dataset?

Pandas makes it very easy to work with these kinds of files

df = pd.read_csv('mckinsey.csv') # We are storing the data in df


df

country year population continent life_exp gdp_cap

0 Afghanistan 1952 8425333 Asia 28.801 779.445314

1 Afghanistan 1957 9240934 Asia 30.332 820.853030

2 Afghanistan 1962 10267083 Asia 31.997 853.100710

3 Afghanistan 1967 11537966 Asia 34.020 836.197138

4 Afghanistan 1972 13079460 Asia 36.088 739.981106

... ... ... ... ... ... ...

1699 Zimbabwe 1987 9216418 Africa 62.351 706.157306

1700 Zimbabwe 1992 10704340 Africa 60.377 693.420786

1701 Zimbabwe 1997 11404948 Africa 46.809 792.449960

1702 Zimbabwe 2002 11926563 Africa 39.989 672.038623

1703 Zimbabwe 2007 12311143 Africa 43.487 469.709298

1704 rows × 6 columns

keyboard_arrow_down Dataframe and Series


keyboard_arrow_down What can we observe from the above dataset ?
We can see that it has:

6 columns
1704 rows

What do you think is the datatype of df ?

type(df)

pandas.core.frame.DataFrame

Its a pandas DataFrame

keyboard_arrow_down What is a pandas DataFrame ?


It is a table-like representation of data in Pandas => Structured Data
Structured Data here can be thought of as tabular data in a proper order
Considered as counterpart of 2D-Matrix in Numpy

keyboard_arrow_down Now how can we access a column, say country of the dataframe?

df["country"]

0 Afghanistan
1 Afghanistan
2 Afghanistan
3 Afghanistan
4 Afghanistan
...
1699 Zimbabwe
1700 Zimbabwe
1701 Zimbabwe
1702 Zimbabwe
1703 Zimbabwe
Name: country, Length: 1704, dtype: object

As you can see we get all the values in the column country

keyboard_arrow_down Now what is the data-type of a column?

type(df["country"])

pandas.core.series.Series

Its a pandas Series

What is a pandas Series ?


keyboard_arrow_down
Series in Pandas is what a Vector is in Numpy

What exactly does that mean?


It means a Series is a single column of data

Multiple Series stack together to form a DataFrame

Now we have understood what Series and DataFrames are

keyboard_arrow_down What if a dataset has 100 rows ... Or 100 columns ?

How can we find the datatype, name, total entries in each column ?

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 country 1704 non-null object
1 year 1704 non-null int64
2 population 1704 non-null int64
3 continent 1704 non-null object
4 life_exp 1704 non-null float64
5 gdp_cap 1704 non-null float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB

df.info() gives a list of columns with:

Name/Title of Columns
How many non-null values (blank cells) each column has
Type of values in each column - int, float, etc.

By default, it shows data-type as object for anything other than int or float - Will come back later

keyboard_arrow_down Now what if we want to see the first few rows in the dataset ?

df.head()

country year population continent life_exp gdp_cap

0 Afghanistan 1952 8425333 Asia 28.801 779.445314

1 Afghanistan 1957 9240934 Asia 30.332 820.853030

2 Afghanistan 1962 10267083 Asia 31.997 853.100710

3 Afghanistan 1967 11537966 Asia 34.020 836.197138

4 Afghanistan 1972 13079460 Asia 36.088 739.981106


It Prints top 5 rows by default

We can also pass in number of rows we want to see in head()

df.head(20)

country year population continent life_exp gdp_cap

0 Afghanistan 1952 8425333 Asia 28.801 779.445314

1 Afghanistan 1957 9240934 Asia 30.332 820.853030

2 Afghanistan 1962 10267083 Asia 31.997 853.100710

3 Afghanistan 1967 11537966 Asia 34.020 836.197138

4 Afghanistan 1972 13079460 Asia 36.088 739.981106

5 Afghanistan 1977 14880372 Asia 38.438 786.113360

6 Afghanistan 1982 12881816 Asia 39.854 978.011439

7 Afghanistan 1987 13867957 Asia 40.822 852.395945

8 Afghanistan 1992 16317921 Asia 41.674 649.341395

9 Afghanistan 1997 22227415 Asia 41.763 635.341351

10 Afghanistan 2002 25268405 Asia 42.129 726.734055

11 Afghanistan 2007 31889923 Asia 43.828 974.580338

12 Albania 1952 1282697 Europe 55.230 1601.056136

13 Albania 1957 1476505 Europe 59.280 1942.284244

14 Albania 1962 1728137 Europe 64.820 2312.888958

15 Albania 1967 1984060 Europe 66.220 2760.196931

16 Albania 1972 2263554 Europe 67.690 3313.422188

17 Albania 1977 2509048 Europe 68.930 3533.003910

18 Albania 1982 2780097 Europe 70.420 3630.880722

19 Albania 1987 3075321 Europe 72.000 3738.932735

keyboard_arrow_down Similarly what if we want to see the last 20 rows ?

df.tail(20) #Similar to head


country year population continent life_exp gdp_cap

1684 Zambia 1972 4506497 Africa 50.107 1773.498265

1685 Zambia 1977 5216550 Africa 51.386 1588.688299

1686 Zambia 1982 6100407 Africa 51.821 1408.678565

1687 Zambia 1987 7272406 Africa 50.821 1213.315116

1688 Zambia 1992 8381163 Africa 46.100 1210.884633

1689 Zambia 1997 9417789 Africa 40.238 1071.353818

1690 Zambia 2002 10595811 Africa 39.193 1071.613938

1691 Zambia 2007 11746035 Africa 42.384 1271.211593

1692 Zimbabwe 1952 3080907 Africa 48.451 406.884115

1693 Zimbabwe 1957 3646340 Africa 50.469 518.764268

1694 Zimbabwe 1962 4277736 Africa 52.358 527.272182

1695 Zimbabwe 1967 4995432 Africa 53.995 569.795071

1696 Zimbabwe 1972 5861135 Africa 55.635 799.362176

1697 Zimbabwe 1977 6642107 Africa 57.674 685.587682

1698 Zimbabwe 1982 7636524 Africa 60.363 788.855041

1699 Zimbabwe 1987 9216418 Africa 62.351 706.157306

1700 Zimbabwe 1992 10704340 Africa 60.377 693.420786

1701 Zimbabwe 1997 11404948 Africa 46.809 792.449960

1702 Zimbabwe 2002 11926563 Africa 39.989 672.038623

1703 Zimbabwe 2007 12311143 Africa 43.487 469.709298

keyboard_arrow_down How can we find the shape of the dataframe?

df.shape

(1704, 6)

Similar to Numpy, it gives No. of Rows and Columns -- Dimensions

Now we know how to do some basic operations on dataframes

df.head(3) # We take the first 3 rows to create our dataframe

country year population continent life_exp gdp_cap

0 Afghanistan 1952 8425333 Asia 28.801 779.445314

1 Afghanistan 1957 9240934 Asia 30.332 820.853030

2 Afghanistan 1962 10267083 Asia 31.997 853.100710

keyboard_arrow_down Basic operations on columns


Now what operations can we do using columns?

Maybe add a column


or delete a column
or we can rename the column too

and so on.

We can see that our dataset has 6 cols

keyboard_arrow_down But what if our dataset has 20 cols ? ... or 100 cols ? We can't see ther names in one go.

How can we get the names of all these cols ?


We can do it in two ways:
1. df.columns
2. df.keys

df.columns # using attribute `columns` of dataframe

Index(['country', 'year', 'population', 'continent', 'life_exp', 'gdp_cap'], dtype='object')

df.keys() # using method keys() of dataframe

Index(['country', 'year', 'population', 'continent', 'life_exp', 'gdp_cap'], dtype='object')

Note:

Here, Index is a type of pandas class used to store the address of the series/dataframe

It is an Immutable sequence used for indexing and alignment.

df['country'].head() # Gives values in Top 5 rows pertaining to the key

Pandas DataFrame and Series are specialised dictionary

keyboard_arrow_down But what is so "special" about this dictionary?

It can take multiple keys

df[['country', 'life_exp']].head()

country life_exp

0 Afghanistan 28.801

1 Afghanistan 30.332

2 Afghanistan 31.997

3 Afghanistan 34.020

4 Afghanistan 36.088

And what if we pass a single column name?

df[['country']].head()

country

0 Afghanistan

1 Afghanistan

2 Afghanistan

3 Afghanistan

4 Afghanistan

Note:

Notice how this output type is different from our earlier output using df['country']

==> ['country'] gives series while [['country']] gives dataframe

Now that we know how to access columns, lets answer some questions

keyboard_arrow_down How can we find the countries that have been surveyed ?
We can find the unique vals in the country col

How can we find unique values in a column?

df['country'].unique()

array(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina',


'Australia', 'Austria', 'Bahrain', 'Bangladesh', 'Belgium',
'Benin', 'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil',
'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon',
'Canada', 'Central African Republic', 'Chad', 'Chile', 'China',
'Colombia', 'Comoros', 'Congo, Dem. Rep.', 'Congo, Rep.',
'Costa Rica', "Cote d'Ivoire", 'Croatia', 'Cuba', 'Czech Republic',
'Denmark', 'Djibouti', 'Dominican Republic', 'Ecuador', 'Egypt',
'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Ethiopia',
'Finland', 'France', 'Gabon', 'Gambia', 'Germany', 'Ghana',
'Greece', 'Guatemala', 'Guinea', 'Guinea-Bissau', 'Haiti',
'Honduras', 'Hong Kong, China', 'Hungary', 'Iceland', 'India',
'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Israel', 'Italy',
'Jamaica', 'Japan', 'Jordan', 'Kenya', 'Korea, Dem. Rep.',
'Korea, Rep.', 'Kuwait', 'Lebanon', 'Lesotho', 'Liberia', 'Libya',
'Madagascar', 'Malawi', 'Malaysia', 'Mali', 'Mauritania',
'Mauritius', 'Mexico', 'Mongolia', 'Montenegro', 'Morocco',
'Mozambique', 'Myanmar', 'Namibia', 'Nepal', 'Netherlands',
'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Norway', 'Oman',
'Pakistan', 'Panama', 'Paraguay', 'Peru', 'Philippines', 'Poland',
'Portugal', 'Puerto Rico', 'Reunion', 'Romania', 'Rwanda',
'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia',
'Sierra Leone', 'Singapore', 'Slovak Republic', 'Slovenia',
'Somalia', 'South Africa', 'Spain', 'Sri Lanka', 'Sudan',
'Swaziland', 'Sweden', 'Switzerland', 'Syria', 'Taiwan',
'Tanzania', 'Thailand', 'Togo', 'Trinidad and Tobago', 'Tunisia',
'Turkey', 'Uganda', 'United Kingdom', 'United States', 'Uruguay',
'Venezuela', 'Vietnam', 'West Bank and Gaza', 'Yemen, Rep.',
'Zambia', 'Zimbabwe'], dtype=object)

keyboard_arrow_down Now what if you also want to check the count of each country in the dataframe?

df['country'].value_counts()

Afghanistan 12
Pakistan 12
New Zealand 12
Nicaragua 12
Niger 12
..
Eritrea 12
Equatorial Guinea 12
El Salvador 12
Egypt 12
Zimbabwe 12
Name: country, Length: 142, dtype: int64

Note:

value_counts() shows the output in decreasing order of frequency

keyboard_arrow_down What if we want to change the name of a column ?


We can rename the column by:

passing the dictionary with old_name:new_name pair


specifying axis=1

df.rename({"population": "Population", "country":"Country" }, axis = 1)

Country year Population continent life_exp gdp_cap

0 Afghanistan 1952 8425333 Asia 28.801 779.445314

1 Afghanistan 1957 9240934 Asia 30.332 820.853030

2 Afghanistan 1962 10267083 Asia 31.997 853.100710

3 Afghanistan 1967 11537966 Asia 34.020 836.197138

4 Afghanistan 1972 13079460 Asia 36.088 739.981106

... ... ... ... ... ... ...

1699 Zimbabwe 1987 9216418 Africa 62.351 706.157306

1700 Zimbabwe 1992 10704340 Africa 60.377 693.420786

1701 Zimbabwe 1997 11404948 Africa 46.809 792.449960

1702 Zimbabwe 2002 11926563 Africa 39.989 672.038623

1703 Zimbabwe 2007 12311143 Africa 43.487 469.709298

1704 rows × 6 columns


Alternatively, we can also rename the column without using axis

by using the column parameter

df.rename(columns={"country":"Country"})

Country year population continent life_exp gdp_cap

0 Afghanistan 1952 8425333 Asia 28.801 779.445314

1 Afghanistan 1957 9240934 Asia 30.332 820.853030

2 Afghanistan 1962 10267083 Asia 31.997 853.100710

3 Afghanistan 1967 11537966 Asia 34.020 836.197138

4 Afghanistan 1972 13079460 Asia 36.088 739.981106

... ... ... ... ... ... ...

1699 Zimbabwe 1987 9216418 Africa 62.351 706.157306

1700 Zimbabwe 1992 10704340 Africa 60.377 693.420786

1701 Zimbabwe 1997 11404948 Africa 46.809 792.449960

1702 Zimbabwe 2002 11926563 Africa 39.989 672.038623

1703 Zimbabwe 2007 12311143 Africa 43.487 469.709298

1704 rows × 6 columns

If we try and check the original dataframe df

df

country year population continent life_exp gdp_cap

0 Afghanistan 1952 8425333 Asia 28.801 779.445314

1 Afghanistan 1957 9240934 Asia 30.332 820.853030

2 Afghanistan 1962 10267083 Asia 31.997 853.100710

3 Afghanistan 1967 11537966 Asia 34.020 836.197138

4 Afghanistan 1972 13079460 Asia 36.088 739.981106

... ... ... ... ... ... ...

1699 Zimbabwe 1987 9216418 Africa 62.351 706.157306

1700 Zimbabwe 1992 10704340 Africa 60.377 693.420786

1701 Zimbabwe 1997 11404948 Africa 46.809 792.449960

1702 Zimbabwe 2002 11926563 Africa 39.989 672.038623

1703 Zimbabwe 2007 12311143 Africa 43.487 469.709298

1704 rows × 6 columns

We can clearly see that the column names are still the same and have not changed. So the changes doesn't happen in original dataframe unless
we specify a parameter called inplace

We can set it inplace by setting the inplace argument = True

df.rename({"country": "Country"}, axis = 1, inplace = True)


df
Country year population continent life_exp gdp_cap

0 Afghanistan 1952 8425333 Asia 28.801 779.445314

1 Afghanistan 1957 9240934 Asia 30.332 820.853030

2 Afghanistan 1962 10267083 Asia 31.997 853.100710

3 Afghanistan 1967 11537966 Asia 34.020 836.197138

4 Afghanistan 1972 13079460 Asia 36.088 739.981106

... ... ... ... ... ... ...

1699 Zimbabwe 1987 9216418 Africa 62.351 706.157306

1700 Zimbabwe 1992 10704340 Africa 60.377 693.420786

1701 Zimbabwe 1997 11404948 Africa 46.809 792.449960

1702 Zimbabwe 2002 11926563 Africa 39.989 672.038623

1703 Zimbabwe 2007 12311143 Africa 43.487 469.709298

1704 rows × 6 columns

Note

.rename has default value of axis=0


If two columns have the same name, then df['column'] will display both columns

Now lets try another way of accessing column vals

df.Country

0 Afghanistan
1 Afghanistan
2 Afghanistan
3 Afghanistan
4 Afghanistan
...
1699 Zimbabwe
1700 Zimbabwe
1701 Zimbabwe
1702 Zimbabwe
1703 Zimbabwe
Name: Country, Length: 1704, dtype: object

This however doesn't work everytime

What do you think could be the problems with using attribute style for accessing the columns?

Problems such as

if the column names are not strings

Starting with number: E.g., 2nd


Contains a space: E.g., Roll Number
or if the column names conflict with methods of the DataFrame

E.g. shape

It is generally better to avoid this type of accessing columns

Are all the columns in our data necessary?

We already know the continents in which each country lies


So we don't need this column

keyboard_arrow_down How can we delete cols in pandas dataframe ?


df.drop('continent', axis=1)
Country year population life_exp gdp_cap

0 Afghanistan 1952 8425333 28.801 779.445314

1 Afghanistan 1957 9240934 30.332 820.853030

2 Afghanistan 1962 10267083 31.997 853.100710

3 Afghanistan 1967 11537966 34.020 836.197138

4 Afghanistan 1972 13079460 36.088 739.981106

... ... ... ... ... ...

1699 Zimbabwe 1987 9216418 62.351 706.157306

1700 Zimbabwe 1992 10704340 60.377 693.420786

1701 Zimbabwe 1997 11404948 46.809 792.449960

1702 Zimbabwe 2002 11926563 39.989 672.038623

1703 Zimbabwe 2007 12311143 43.487 469.709298

1704 rows × 5 columns

The drop function takes two parameters:

The column name


The axis

By default the value of axis is 0

An alternative to the above approach is using the "columns" parameter as we did in rename

df.drop(columns=['continent'])

Country year population life_exp gdp_cap

0 Afghanistan 1952 8425333 28.801 779.445314

1 Afghanistan 1957 9240934 30.332 820.853030

2 Afghanistan 1962 10267083 31.997 853.100710

3 Afghanistan 1967 11537966 34.020 836.197138

4 Afghanistan 1972 13079460 36.088 739.981106

... ... ... ... ... ...

1699 Zimbabwe 1987 9216418 62.351 706.157306

1700 Zimbabwe 1992 10704340 60.377 693.420786

1701 Zimbabwe 1997 11404948 46.809 792.449960

1702 Zimbabwe 2002 11926563 39.989 672.038623

1703 Zimbabwe 2007 12311143 43.487 469.709298

1704 rows × 5 columns

As you can see, column contintent is dropped

keyboard_arrow_down Has the column permanently been deleted?

df.head()

Country year population continent life_exp gdp_cap

0 Afghanistan 1952 8425333 Asia 28.801 779.445314

1 Afghanistan 1957 9240934 Asia 30.332 820.853030

2 Afghanistan 1962 10267083 Asia 31.997 853.100710

3 Afghanistan 1967 11537966 Asia 34.020 836.197138

4 Afghanistan 1972 13079460 Asia 36.088 739.981106

NO, the column continent is still there

Do you see what's happening here?


We only got a view of dataframe with column continent dropped

keyboard_arrow_down How can we permanently drop the column?

We can either re-assign it

df = df.drop('continent', axis=1)

OR

We can set parameter inplace=True

By default, inplace=False

df.drop('continent', axis=1, inplace=True)

df.head() #we print the head to check

Country year population life_exp gdp_cap

0 Afghanistan 1952 8425333 28.801 779.445314

1 Afghanistan 1957 9240934 30.332 820.853030

2 Afghanistan 1962 10267083 31.997 853.100710

3 Afghanistan 1967 11537966 34.020 836.197138

4 Afghanistan 1972 13079460 36.088 739.981106

Now we can see the column continent is permanently dropped

keyboard_arrow_down Now similarly, what if we want to create a new column?


We can either

use values from existing columns

OR

create our own values

How to create a column using values from an existing column?

df["year+7"] = df["year"] + 7
df.head()

Country year population life_exp gdp_cap year+7

0 Afghanistan 1952 8425333 28.801 779.445314 1959

1 Afghanistan 1957 9240934 30.332 820.853030 1964

2 Afghanistan 1962 10267083 31.997 853.100710 1969

3 Afghanistan 1967 11537966 34.020 836.197138 1974

4 Afghanistan 1972 13079460 36.088 739.981106 1979

As we see, a new column year+7 is created from the column year

We can also use values from two columns to form a new column

keyboard_arrow_down Which two columns can we use to create a new column gdp ?

df['gdp']=df['gdp_cap'] * df['population']
df.head()

Country year population life_exp gdp_cap year+7 gdp

0 Afghanistan 1952 8425333 28.801 779.445314 1959 6.567086e+09

1 Afghanistan 1957 9240934 30.332 820.853030 1964 7.585449e+09

2 Afghanistan 1962 10267083 31.997 853.100710 1969 8.758856e+09

3 Afghanistan 1967 11537966 34.020 836.197138 1974 9.648014e+09

4 Afghanistan 1972 13079460 36.088 739.981106 1979 9.678553e+09


As you can see

An additional column has been created

Values in this column are product of respective values in gdp_cap and population

What other operations we can use?


Subtraction, Addition, etc.

keyboard_arrow_down How can we create a new column from our own values?
We can create a list

OR

We can create a Pandas Series from a list/numpy array for our new column

df["Own"] = [i for i in range(1704)] # count of these values should be correct


df

Country year population life_exp gdp_cap year+7 gdp Own

0 Afghanistan 1952 8425333 28.801 779.445314 1959 6.567086e+09 0

1 Afghanistan 1957 9240934 30.332 820.853030 1964 7.585449e+09 1

2 Afghanistan 1962 10267083 31.997 853.100710 1969 8.758856e+09 2

3 Afghanistan 1967 11537966 34.020 836.197138 1974 9.648014e+09 3

4 Afghanistan 1972 13079460 36.088 739.981106 1979 9.678553e+09 4

... ... ... ... ... ... ... ... ...

1699 Zimbabwe 1987 9216418 62.351 706.157306 1994 6.508241e+09 1699

1700 Zimbabwe 1992 10704340 60.377 693.420786 1999 7.422612e+09 1700

1701 Zimbabwe 1997 11404948 46.809 792.449960 2004 9.037851e+09 1701

1702 Zimbabwe 2002 11926563 39.989 672.038623 2009 8.015111e+09 1702

1703 Zimbabwe 2007 12311143 43.487 469.709298 2014 5.782658e+09 1703

1704 rows × 8 columns

Now that we know how to create new cols lets see some basic ops on rows

Before that lets drop the newly created cols

df.drop(columns=["Own",'gdp', 'year+7'], axis = 1, inplace = True)


df

Country year population life_exp gdp_cap

0 Afghanistan 1952 8425333 28.801 779.445314

1 Afghanistan 1957 9240934 30.332 820.853030

2 Afghanistan 1962 10267083 31.997 853.100710

3 Afghanistan 1967 11537966 34.020 836.197138

4 Afghanistan 1972 13079460 36.088 739.981106

... ... ... ... ... ...

1699 Zimbabwe 1987 9216418 62.351 706.157306

1700 Zimbabwe 1992 10704340 60.377 693.420786

1701 Zimbabwe 1997 11404948 46.809 792.449960

1702 Zimbabwe 2002 11926563 39.989 672.038623

1703 Zimbabwe 2007 12311143 43.487 469.709298

1704 rows × 5 columns

keyboard_arrow_down Working with Rows


keyboard_arrow_down Just like columns, do rows also have labels?

YES

Notice the indexes in bold against each row

Lets see how can we access these indexes

df.index.values

array([ 0, 1, 2, ..., 1701, 1702, 1703])

keyboard_arrow_down Can we change row labels (like we did for columns)?

What if we want to start indexing from 1 (instead of 0)?

df.index = list(range(1, df.shape[0]+1)) # create a list of indexes of same length


df

Country year population life_exp gdp_cap

1 Afghanistan 1952 8425333 28.801 779.445314

2 Afghanistan 1957 9240934 30.332 820.853030

3 Afghanistan 1962 10267083 31.997 853.100710

4 Afghanistan 1967 11537966 34.020 836.197138

5 Afghanistan 1972 13079460 36.088 739.981106

... ... ... ... ... ...

1700 Zimbabwe 1987 9216418 62.351 706.157306

1701 Zimbabwe 1992 10704340 60.377 693.420786

1702 Zimbabwe 1997 11404948 46.809 792.449960

1703 Zimbabwe 2002 11926563 39.989 672.038623

1704 Zimbabwe 2007 12311143 43.487 469.709298

1704 rows × 5 columns

As you can see the indexing is now starting from 1 instead of 0.

keyboard_arrow_down Explicit and Implicit Indices


keyboard_arrow_down What are these row labels/indices exactly ?

They can be called identifiers of a particular row

Specifically known as explicit indices

Additionally, can series/dataframes can also use python style indexing?


YES

The python style indices are known as implicit indices

How can we access explicit index of a particular row?


Using df.index[]
Takes impicit index of row to give its explicit index

df.index[1] #Implicit index 1 gave explicit index 2

keyboard_arrow_down But why not use just implicit indexing ?

Explicit indices can be changed to any value of any datatype

Eg: Explicit Index of 1st row can be changed to First


Or, something like a floating point value, say 1.0
df.index = np.arange(1, df.shape[0]+1, dtype='float')
df

Country year population life_exp gdp_cap

1.0 Afghanistan 1952 8425333 28.801 779.445314

2.0 Afghanistan 1957 9240934 30.332 820.853030

3.0 Afghanistan 1962 10267083 31.997 853.100710

4.0 Afghanistan 1967 11537966 34.020 836.197138

5.0 Afghanistan 1972 13079460 36.088 739.981106

... ... ... ... ... ...

1700.0 Zimbabwe 1987 9216418 62.351 706.157306

1701.0 Zimbabwe 1992 10704340 60.377 693.420786

1702.0 Zimbabwe 1997 11404948 46.809 792.449960

1703.0 Zimbabwe 2002 11926563 39.989 672.038623

1704.0 Zimbabwe 2007 12311143 43.487 469.709298

1704 rows × 5 columns

As we can see, the indices are floating point values now

Now to understand string indices, let's take a small subset of our original dataframe

sample = df.head()
sample

Country year population life_exp gdp_cap

1.0 Afghanistan 1952 8425333 28.801 779.445314

2.0 Afghanistan 1957 9240934 30.332 820.853030

3.0 Afghanistan 1962 10267083 31.997 853.100710

4.0 Afghanistan 1967 11537966 34.020 836.197138

5.0 Afghanistan 1972 13079460 36.088 739.981106

keyboard_arrow_down Now what if we want to use string indices?

sample.index = ['a', 'b', 'c', 'd', 'e']


sample

Country year population life_exp gdp_cap

a Afghanistan 1952 8425333 28.801 779.445314

b Afghanistan 1957 9240934 30.332 820.853030

c Afghanistan 1962 10267083 31.997 853.100710

d Afghanistan 1967 11537966 34.020 836.197138

e Afghanistan 1972 13079460 36.088 739.981106

This shows us we can use almost anything as our explicit index

Now let's reset our indices back to integers

df.index = np.arange(1, df.shape[0]+1, dtype='int')

keyboard_arrow_down What if we want to access any particular row (say first row)?

Let's first see for one column

Later, we can generalise the same for the entire dataframe


ser = df["Country"]
ser.head(20)

1 Afghanistan
2 Afghanistan
3 Afghanistan
4 Afghanistan
5 Afghanistan
6 Afghanistan
7 Afghanistan
8 Afghanistan
9 Afghanistan
10 Afghanistan
11 Afghanistan
12 Afghanistan
13 Albania
14 Albania
15 Albania
16 Albania
17 Albania
18 Albania
19 Albania
20 Albania
Name: Country, dtype: object

We can simply use its indices much like we do in a numpy array

So, how will be then access the thirteenth element (or say thirteenth row)?

ser[12]

'Afghanistan'

keyboard_arrow_down And what about accessing a subset of rows (say 6th:15th) ?

ser[5:15]

6 Afghanistan
7 Afghanistan
8 Afghanistan
9 Afghanistan
10 Afghanistan
11 Afghanistan
12 Afghanistan
13 Albania
14 Albania
15 Albania
Name: Country, dtype: object

This is known as slicing

Notice something different though?

Indexing in Series used explicit indices


Slicing however used implicit indices

Let's try the same for the dataframe now

keyboard_arrow_down So how can we access a row in a dataframe?

df[0]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/usr/local/lib/python3.8/dist-packages/pandas/core/indexes/base.py in
get_loc(self, key, method, tolerance)
3360 try:
-> 3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:

4 frames
pandas/_libs/hashtable_class_helper.pxi in
pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in
pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

The above exception was the direct cause of the following exception:

KeyError Traceback (most recent call last)


/usr/local/lib/python3.8/dist-packages/pandas/core/indexes/base.py in
get_loc(self, key, method, tolerance)
3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
-> 3363 raise KeyError(key) from err
3364
3365 if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: 0

Notice, that this syntax is exactly same as how we tried accessing a column

===> df[x] looks for column with name x

keyboard_arrow_down How can we access a slice of rows in the dataframe?

df[5:15]

Woah, so the slicing works

===> Indexing in dataframe looks only for explicit indices


===> Slicing, however, checked for implicit indices

This can be a cause for confusion

To avoid this pandas provides special indexers, loc and iloc

We will look at these in a bit Lets look at them one by one

keyboard_arrow_down loc and iloc


keyboard_arrow_down 1. loc

Allows indexing and slicing that always references the explicit index

df.loc[1]

Country Afghanistan
year 1952
population 8425333
life_exp 28.801
gdp_cap 779.445314
Name: 1, dtype: object

df.loc[1:3]

Country year population life_exp gdp_cap

1 Afghanistan 1952 8425333 28.801 779.445314

2 Afghanistan 1957 9240934 30.332 820.853030

3 Afghanistan 1962 10267083 31.997 853.100710

Did you notice something strange here?


The range is inclusive of end point for loc

Row with Label 3 is included in the result

keyboard_arrow_down 2. iloc
Allows indexing and slicing that always references the implicit Python-style index

df.iloc[1]

Country Afghanistan
year 1957
population 9240934
life_exp 30.332
gdp_cap 820.85303
Name: 2, dtype: object

keyboard_arrow_down Now will iloc also consider the range inclusive?

df.iloc[0:2]

Country year population life_exp gdp_cap

1 Afghanistan 1952 8425333 28.801 779.445314

2 Afghanistan 1957 9240934 30.332 820.853030

NO

Because iloc works with implicit Python-style indices

It is important to know about these conceptual differences

Not just b/w loc and iloc , but in general while working in DS and ML

Which one should we use ?

Generally explicit indexing is considered to be better than implicit


But it is recommended to always use both loc and iloc to avoid any confusions

keyboard_arrow_down What if we want to access multiple non-consecutive rows at same time ?

For eg: rows 1, 10, 100

df.iloc[[1, 10, 100]]

As we see, We can just pack the indices in [] and pass it in loc or iloc

keyboard_arrow_down What about negative index?

Which would work between iloc and loc ?


02-Pandas-Lecture-IMDB

keyboard_arrow_down Content
Working with both rows and columns

Pandas built-in operations

Sorting

Concatenation

Merge

Intoduction to IMDB dataset

Merging the dataframes


Feature Exploration
Fetching data using pandas

keyboard_arrow_down Working with Rows and Columns together


keyboard_arrow_down Reading dataset

We will be using our earlier McKinsey dataset for now

Link:https://drive.google.com/file/d/1E3bwvYGf1ig32RmcYiWc0IXPN-mD_bI_/view?usp=sharing

!wget "https://drive.google.com/uc?export=download&id=1E3bwvYGf1ig32RmcYiWc0IXPN-mD_bI_" -O mckinsey.csv

--2023-09-13 15:17:05-- https://drive.google.com/uc?export=download&id=1E3bwvYGf1ig32RmcYiWc0IXPN-mD_bI_


Resolving drive.google.com (drive.google.com)... 142.251.16.100, 142.251.16.102, 142.251.16.113, ...
Connecting to drive.google.com (drive.google.com)|142.251.16.100|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-0s-68-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/1faj8dbj13vdr93ggo
Warning: wildcards not supported in HTTP.
--2023-09-13 15:17:05-- https://doc-0s-68-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/1fa
Resolving doc-0s-68-docs.googleusercontent.com (doc-0s-68-docs.googleusercontent.com)... 172.253.122.132, 2607:f8b0:4004
Connecting to doc-0s-68-docs.googleusercontent.com (doc-0s-68-docs.googleusercontent.com)|172.253.122.132|:443... connec
HTTP request sent, awaiting response... 200 OK
Length: 83785 (82K) [text/csv]
Saving to: ‘mckinsey.csv’

mckinsey.csv 100%[===================>] 81.82K --.-KB/s in 0.009s

2023-09-13 15:17:05 (8.77 MB/s) - ‘mckinsey.csv’ saved [83785/83785]

import pandas as pd
import numpy as np

df = pd.read_csv('mckinsey.csv')

keyboard_arrow_down Now how can we add a row to our dataframe?


There are multiple ways to do this:

append()
loc/iloc

keyboard_arrow_down How can we do add a row using the append() method?


new_row = {'Country': 'India', 'year': 2000,'life_exp':37.08,'population':13500000,'gdp_cap':900.23}
df.append(new_row)

<ipython-input-4-714c78525e27>:2: FutureWarning: The frame.append method is deprecated and will be removed from pandas in
df.append(new_row)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-4-714c78525e27> in <cell line: 2>()
1 new_row = {'Country': 'India', 'year': 2000,'life_exp':37.08,'population':13500000,'gdp_cap':900.23}
----> 2 df.append(new_row)

1 frames
/usr/local/lib/python3.10/dist-packages/pandas/core/frame.py in _append(self, other, ignore_index, verify_integrity, sort
9778 if isinstance(other, dict):
9779 if not ignore_index:
-> 9780 raise TypeError("Can only append a dict if ignore_index=True")
9781 other = Series(other)
9782 if other.name is None and not ignore_index:

TypeError: Can only append a dict if ignore_index=True

SEARCH STACK OVERFLOW

Why are we getting an error here?

Its' saying the ignore_index() parameter needs to be set to True. This parameter tells Pandas to ignore the existing index and create a new
one based on the length of the resulting DataFrame.

new_row = {'Country': 'India', 'year': 2000,'life_exp':37.08,'population':13500000,'gdp_cap':900.23}


df = df.append(new_row, ignore_index=True)
df

<ipython-input-5-39ca58b35231>:2: FutureWarning: The frame.append method is deprecated and will be removed from pandas in
df = df.append(new_row, ignore_index=True)
country year population continent life_exp gdp_cap Country

0 Afghanistan 1952 8425333 Asia 28.801 779.445314 NaN

1 Afghanistan 1957 9240934 Asia 30.332 820.853030 NaN

2 Afghanistan 1962 10267083 Asia 31.997 853.100710 NaN

3 Afghanistan 1967 11537966 Asia 34.020 836.197138 NaN

4 Afghanistan 1972 13079460 Asia 36.088 739.981106 NaN

... ... ... ... ... ... ... ...

1700 Zimbabwe 1992 10704340 Africa 60.377 693.420786 NaN

1701 Zimbabwe 1997 11404948 Africa 46.809 792.449960 NaN

1702 Zimbabwe 2002 11926563 Africa 39.989 672.038623 NaN

1703 Zimbabwe 2007 12311143 Africa 43.487 469.709298 NaN

1704 NaN 2000 13500000 NaN 37.080 900.230000 India

1705 rows × 7 columns

Perfect! So now our row is added at the bottom of the dataframe

But Please Note that:

append() doesn't mutate the the dataframe.

It does not change the DataFrame, but returns a new DataFrame with the row appended.

Another method would be by using loc:

We will need to provide the position at which we will add the new row

keyboard_arrow_down What do you think this positional value would be?

df.loc[len(df.index)] = ['India',2000 ,13500000,37.08,900.23] # len(df.index) since we will add at the last row
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-6-29b4966da254> in <cell line: 1>()
----> 1 df.loc[len(df.index)] = ['India',2000 ,13500000,37.08,900.23] # len(df.index) since we will add at the last row

2 frames
/usr/local/lib/python3.10/dist-packages/pandas/core/indexing.py in _setitem_with_indexer_missing(self, indexer, value)
2158 # must have conforming columns
2159 if len(value) != len(self.obj.columns):
-> 2160 raise ValueError("cannot set a row with mismatched columns")
2161
2162 value = Series(value, index=self.obj.columns, name=indexer)

ValueError: cannot set a row with mismatched columns

SEARCH STACK OVERFLOW

df

country year population continent life_exp gdp_cap Country

0 Afghanistan 1952 8425333 Asia 28.801 779.445314 NaN

1 Afghanistan 1957 9240934 Asia 30.332 820.853030 NaN

2 Afghanistan 1962 10267083 Asia 31.997 853.100710 NaN

3 Afghanistan 1967 11537966 Asia 34.020 836.197138 NaN

4 Afghanistan 1972 13079460 Asia 36.088 739.981106 NaN

... ... ... ... ... ... ... ...

1700 Zimbabwe 1992 10704340 Africa 60.377 693.420786 NaN

1701 Zimbabwe 1997 11404948 Africa 46.809 792.449960 NaN

1702 Zimbabwe 2002 11926563 Africa 39.989 672.038623 NaN

1703 Zimbabwe 2007 12311143 Africa 43.487 469.709298 NaN

1704 NaN 2000 13500000 NaN 37.080 900.230000 India

1705 rows × 7 columns

The new row was added but the data has been duplicated

What you can infer from last two duplicate rows ?

Dataframe allow us to feed duplicate rows in the data

keyboard_arrow_down Now, can we also use iloc?


Adding a row at a specific index position will replace the existing row at that position.

df.iloc[len(df.index)-1] = ['India', 2000,13500000,37.08,900.23]


df

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-8-c6ff7fd1a207> in <cell line: 1>()
----> 1 df.iloc[len(df.index)-1] = ['India', 2000,13500000,37.08,900.23]
2 df

2 frames
/usr/local/lib/python3.10/dist-packages/pandas/core/indexing.py in _setitem_with_indexer_split_path(self, indexer, value,
1877
1878 else:
-> 1879 raise ValueError(
1880 "Must have equal len keys and value "
1881 "when setting with an iterable"

ValueError: Must have equal len keys and value when setting with an iterable

SEARCH STACK OVERFLOW

keyboard_arrow_down What if we try to add the row with a new index?


df.iloc[len(df.index)] = ['India', 2000,13500000,37.08,900.23]

---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-9-1ad11b3daf34> in <cell line: 1>()
----> 1 df.iloc[len(df.index)] = ['India', 2000,13500000,37.08,900.23]

1 frames
/usr/local/lib/python3.10/dist-packages/pandas/core/indexing.py in _has_valid_setitem_indexer(self, indexer)
1516 elif is_integer(i):
1517 if i >= len(ax):
-> 1518 raise IndexError("iloc cannot enlarge its target object")
1519 elif isinstance(i, dict):
1520 raise IndexError("iloc cannot enlarge its target object")

IndexError: iloc cannot enlarge its target object

SEARCH STACK OVERFLOW

Why we are getting error ?

For using iloc to add a row, the dataframe must already have a row in that position.

If a row is not available, you’ll see this IndexError

Please Note:

When using the loc[] attribute, it’s not mandatory that a row already exists with a specific label.

keyboard_arrow_down Now what if we want to delete a row ?


Use df.drop()

If you remember we specified axis=1 for columns

We can modify this for rows

We can use axis=0 for rows

keyboard_arrow_down Does drop() method uses positional indices or labels?

What do you think by looking at code for deleting column?


We had to specify column title

So drop() uses labels, NOT positional indices

Let's drop the row with label 3

df

country year population continent life_exp gdp_cap Country

0 Afghanistan 1952 8425333 Asia 28.801 779.445314 NaN

1 Afghanistan 1957 9240934 Asia 30.332 820.853030 NaN

2 Afghanistan 1962 10267083 Asia 31.997 853.100710 NaN

3 Afghanistan 1967 11537966 Asia 34.020 836.197138 NaN

4 Afghanistan 1972 13079460 Asia 36.088 739.981106 NaN

... ... ... ... ... ... ... ...

1700 Zimbabwe 1992 10704340 Africa 60.377 693.420786 NaN

1701 Zimbabwe 1997 11404948 Africa 46.809 792.449960 NaN

1702 Zimbabwe 2002 11926563 Africa 39.989 672.038623 NaN

1703 Zimbabwe 2007 12311143 Africa 43.487 469.709298 NaN

1704 NaN 2000 13500000 NaN 37.080 900.230000 India

1705 rows × 7 columns

# After dropping the row


df = df.drop(3, axis=0)
df
country year population continent life_exp gdp_cap Country

0 Afghanistan 1952 8425333 Asia 28.801 779.445314 NaN

1 Afghanistan 1957 9240934 Asia 30.332 820.853030 NaN

2 Afghanistan 1962 10267083 Asia 31.997 853.100710 NaN

4 Afghanistan 1972 13079460 Asia 36.088 739.981106 NaN

5 Afghanistan 1977 14880372 Asia 38.438 786.113360 NaN

... ... ... ... ... ... ... ...

1700 Zimbabwe 1992 10704340 Africa 60.377 693.420786 NaN

1701 Zimbabwe 1997 11404948 Africa 46.809 792.449960 NaN

1702 Zimbabwe 2002 11926563 Africa 39.989 672.038623 NaN

1703 Zimbabwe 2007 12311143 Africa 43.487 469.709298 NaN

1704 NaN 2000 13500000 NaN 37.080 900.230000 India

1704 rows × 7 columns

Now we see that row with label 3 is deleted

We now have rows with labels 0, 1, 2, 4, 5, ...

keyboard_arrow_down Now df.loc[4] and df.iloc[4] will give different rows

df.loc[4] # The 4th row is printed

Country Afghanistan
year 1972
population 13079460
life_exp 36.088
gdp_cap 739.981106
Name: 4, dtype: object

df.iloc[4] # The 5th row is printed

Country Afghanistan
year 1977
population 14880372
life_exp 38.438
gdp_cap 786.11336
Name: 5, dtype: object

Why did this happen?

It is because the loc function selects rows using row labels (0,1,2,4 etc.) whereas the iloc function selects rows using their integer positions
(staring from 0 and going up by one for each row).

So for iloc the 5th row starting from 0 index was printed

keyboard_arrow_down And how can we drop multiple rows?

df.drop([1, 2, 4], axis=0) # drops rows with labels 1, 2, 4

Let's reset our indices now

df.reset_index(drop=True,inplace=True) # Since we removed a row earlier, we reset our indices

Now if you remember, the last two rows were duplicates.

keyboard_arrow_down How can we deal with these duplicate rows?


Let's create some more duplicate rows to understand this

df.loc[len(df.index)] = ['India',2000,13500000,37.08,900.23]
df.loc[len(df.index)] = ['Sri Lanka',2022 ,130000000,80.00,500.00]
df.loc[len(df.index)] = ['Sri Lanka',2022 ,130000000,80.00,500.00]
df.loc[len(df.index)] = ['India',2000 ,13500000,80.00,900.23]
df
keyboard_arrow_down Now how can we check for duplicate rows?

Use duplicated() method on the DataFrame

df.duplicated()

It outputs True if an entire row is identical to a previous row.

However, it is not practical to see a list of True and False

We can Pandas loc data selector to extract those duplicate rows

# Extract duplicate rows


df.loc[df.duplicated()]

The first argument df.duplicated() will find the duplicate rows

The second argument : will display all columns

keyboard_arrow_down Now how can we remove these duplicate rows ?

We can use drop_duplicates() of Pandas for this

df.drop_duplicates()

keyboard_arrow_down But how can we decide among all duplicate rows which ones we want to keep ?

Here we can use argument keep:

This Controls how to consider duplicate value.

It has only three distinct value

first
last
False

The default is ‘first’.

If first , this considers first value as unique and rest of the same values as duplicate.

df.drop_duplicates(keep='first')

If last , This considers last value as unique and rest of the same values as duplicate.

df.drop_duplicates(keep='last')

If False , this considers all of the same values as duplicates.

df.drop_duplicates(keep=False)

keyboard_arrow_down What if you want to look for duplicacy only for a few columns?

We can use the argument subset to mention the list of columns which we want to use.

df.drop_duplicates(subset=['Country'],keep='first')

keyboard_arrow_down How can we slice the dataframe into, say, first 4 rows and first 3 columns?

We can use iloc

gdf.iloc[0:4, 0:3]
country year population

0 Afghanistan 1952 8425333

1 Afghanistan 1957 9240934

2 Afghanistan 1962 10267083

3 Afghanistan 1967 11537966

Pass in 2 different ranges for slicing - one for row and one for column just like Numpy

Recall, iloc doesn't include the end index while slicing

keyboard_arrow_down Can we do the same thing with loc ?

df.loc[1:5, 1:4]

---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-494208dc7680> in <cell line: 1>()
----> 1 df.loc[1:5, 1:4]

8 frames
/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py in _maybe_cast_slice_bound(self, label, side, kind)
6621 # reject them, if index does not contain label
6622 if (is_float(label) or is_integer(label)) and label not in self:
-> 6623 raise self._invalid_indexer("slice", label)
6624
6625 return label

TypeError: cannot do slice indexing on Index with these indexers [1] of type int

SEARCH STACK OVERFLOW

keyboard_arrow_down Why does slicing using indices doesn't work with loc ?

Recall, we need to work with explicit labels while using loc

df.loc[1:5, ['country','life_exp']]

country life_exp

1 Afghanistan 30.332

2 Afghanistan 31.997

3 Afghanistan 34.020

4 Afghanistan 36.088

5 Afghanistan 38.438

keyboard_arrow_down We can mention ranges using column labels as well in loc

df.loc[1:5, 'year':'population']

year population

1 1957 9240934

2 1962 10267083

3 1967 11537966

4 1972 13079460

5 1977 14880372

keyboard_arrow_down How can we get specific rows and columns?

df.iloc[[0,10,100], [0,2,3]]
country population continent

0 Afghanistan 8425333 Asia

10 Afghanistan 25268405 Asia

100 Bangladesh 70759295 Asia

We pass in those specific indices packed in []

keyboard_arrow_down Can we do step slicing?

Yes, just like we did in Numpy

df.iloc[1:10:2]

country year population continent life_exp gdp_cap

1 Afghanistan 1957 9240934 Asia 30.332 820.853030

3 Afghanistan 1967 11537966 Asia 34.020 836.197138

5 Afghanistan 1977 14880372 Asia 38.438 786.113360

7 Afghanistan 1987 13867957 Asia 40.822 852.395945

9 Afghanistan 1997 22227415 Asia 41.763 635.341351

keyboard_arrow_down Does step slicing work for loc too?

YES

df.loc[1:10:2]

country year population continent life_exp gdp_cap

1 Afghanistan 1957 9240934 Asia 30.332 820.853030

3 Afghanistan 1967 11537966 Asia 34.020 836.197138

5 Afghanistan 1977 14880372 Asia 38.438 786.113360

7 Afghanistan 1987 13867957 Asia 40.822 852.395945

9 Afghanistan 1997 22227415 Asia 41.763 635.341351

keyboard_arrow_down Pandas built-in operation


Let's select the feature 'life_exp'

le = df['life_exp']
le

0 28.801
1 30.332
2 31.997
3 34.020
4 36.088
...
1699 62.351
1700 60.377
1701 46.809
1702 39.989
1703 43.487
Name: life_exp, Length: 1704, dtype: float64
keyboard_arrow_down

How can we find the mean of the col life_exp ?

[ ] ↳ 1 cell hidden

keyboard_arrow_down What other operations can we do?


sum()
count()
min()
max()

... and so on

Note:

We can see more methods by pressing "tab" after le.

le.sum()

101344.44467999999

le.count()

1704

keyboard_arrow_down What will happen we get if we divide sum() by count() ?

le.sum() / le.count()

59.474439366197174

It gives the mean of life expectancy

keyboard_arrow_down Sorting
If you notice, life_exp col is not sorted

keyboard_arrow_down How can we perform sorting in pandas ?

df.sort_values(['life_exp'])

country year population continent life_exp gdp_cap

1292 Rwanda 1992 7290203 Africa 23.599 737.068595

0 Afghanistan 1952 8425333 Asia 28.801 779.445314

552 Gambia 1952 284320 Africa 30.000 485.230659

36 Angola 1952 4232095 Africa 30.015 3520.610273

1344 Sierra Leone 1952 2143249 Africa 30.331 879.787736

... ... ... ... ... ... ...

1487 Switzerland 2007 7554661 Europe 81.701 37506.419070

695 Iceland 2007 301931 Europe 81.757 36180.789190

802 Japan 2002 127065841 Asia 82.000 28604.591900

671 Hong Kong, China 2007 6980412 Asia 82.208 39724.978670

803 Japan 2007 127467972 Asia 82.603 31656.068060

1704 rows × 6 columns

Rows get sorted based on values in life_exp column

By default, values are sorted in ascending order

keyboard_arrow_down How can we sort the rows in descending order?

df.sort_values(['life_exp'], ascending=False)
country year population continent life_exp gdp_cap

803 Japan 2007 127467972 Asia 82.603 31656.068060

671 Hong Kong, China 2007 6980412 Asia 82.208 39724.978670

802 Japan 2002 127065841 Asia 82.000 28604.591900

695 Iceland 2007 301931 Europe 81.757 36180.789190

1487 Switzerland 2007 7554661 Europe 81.701 37506.419070

... ... ... ... ... ... ...

1344 Sierra Leone 1952 2143249 Africa 30.331 879.787736

36 Angola 1952 4232095 Africa 30.015 3520.610273

552 Gambia 1952 284320 Africa 30.000 485.230659

0 Afghanistan 1952 8425333 Asia 28.801 779.445314

1292 Rwanda 1992 7290203 Africa 23.599 737.068595

1704 rows × 6 columns

Now the rows are sorted in descending

keyboard_arrow_down Can we do sorting on multiple columns?

YES

df.sort_values(['year', 'life_exp'])

country year population continent life_exp gdp_cap

0 Afghanistan 1952 8425333 Asia 28.801 779.445314

552 Gambia 1952 284320 Africa 30.000 485.230659

36 Angola 1952 4232095 Africa 30.015 3520.610273

1344 Sierra Leone 1952 2143249 Africa 30.331 879.787736

1032 Mozambique 1952 6446316 Africa 31.286 468.526038

... ... ... ... ... ... ...

71 Australia 2007 20434176 Oceania 81.235 34435.367440

1487 Switzerland 2007 7554661 Europe 81.701 37506.419070

695 Iceland 2007 301931 Europe 81.757 36180.789190

671 Hong Kong, China 2007 6980412 Asia 82.208 39724.978670

803 Japan 2007 127467972 Asia 82.603 31656.068060

1704 rows × 6 columns

What exactly happened here?

Rows were first sorted based on 'year'

Then, rows with same values of 'year' were sorted based on 'lifeExp'
For Example

This way, we can do multi-level sorting of our data?

keyboard_arrow_down How can we have different sorting orders for different columns in multi-level sorting?

df.sort_values(['year', 'life_exp'], ascending=[False, True])

country year population continent life_exp gdp_cap

1463 Swaziland 2007 1133066 Africa 39.613 4513.480643

1043 Mozambique 2007 19951656 Africa 42.082 823.685621

1691 Zambia 2007 11746035 Africa 42.384 1271.211593

1355 Sierra Leone 2007 6144562 Africa 42.568 862.540756

887 Lesotho 2007 2012649 Africa 42.592 1569.331442

... ... ... ... ... ... ...

408 Denmark 1952 4334000 Europe 70.780 9692.385245

1464 Sweden 1952 7124673 Europe 71.860 8527.844662

1080 Netherlands 1952 10381988 Europe 72.130 8941.571858

684 Iceland 1952 147962 Europe 72.490 7267.688428

1140 Norway 1952 3327728 Europe 72.670 10095.421720

1704 rows × 6 columns

Just pack True and False for respective columns in a list []

keyboard_arrow_down Concatenating DataFrames


keyboard_arrow_down Let's use a mini use-case of users and messages

users --> Stores the user details - IDs and Names of users

users = pd.DataFrame({"userid":[1, 2, 3], "name":["sharadh", "shahid", "khusalli"]})


users

userid name

0 1 sharadh

1 2 shahid

2 3 khusalli
msgs --> Stores the messages users have sent - User IDs and messages

msgs = pd.DataFrame({"userid":[1, 1, 2, 4], "msg":['hmm', "acha", "theek hai", "nice"]})


msgs

userid msg

0 1 hmm

1 1 acha

2 2 theek hai

3 4 nice

keyboard_arrow_down Can we combine these 2 DataFrames to form a single DataFrame?

pd.concat([users, msgs])

userid name msg

0 1 sharadh NaN

1 2 shahid NaN

2 3 khusalli NaN

0 1 NaN hmm

1 1 NaN acha

2 2 NaN theek hai

3 4 NaN nice

How exactly did concat work?

By default, axis=0 (row-wise) for concatenation

userid , being same in both DataFrames, was combined into a single column

First values of users dataframe were placed, with values of column msg as NaN
Then values of msgs dataframe were placed, with values of column msg as NaN

The original indices of the rows were preserved

keyboard_arrow_down Now how can we make the indices unique for each row?

pd.concat([users, msgs], ignore_index = True)

userid name msg

0 1 sharadh NaN

1 2 shahid NaN

2 3 khusalli NaN

3 1 NaN hmm

4 1 NaN acha

5 2 NaN theek hai

6 4 NaN nice

keyboard_arrow_down How can we concatenate them horizontally?

pd.concat([users, msgs], axis=1)


userid name userid msg

0 1.0 sharadh 1 hmm

1 2.0 shahid 1 acha

2 3.0 khusalli 2 theek hai

3 NaN NaN 4 nice

As you can see here:

Both the dataframes are combined horizontally (column-wise)


It gives 2 columns with different positional (implicit) index, but same label

keyboard_arrow_down Merging Dataframes


So far we have only concatenated and not merged data

But what is the difference between concat and merge ?

concat

simply stacks multiple DataFrame together along an axis

merge

combines dataframes in a smart way based on values in shared columns

keyboard_arrow_down How can we know the name of the person who sent a particular message?

We need information from both the dataframes

So can we use pd.concat() for combining the dataframes ?

No

pd.concat([users, msgs], axis=1)

userid name userid msg

0 1.0 sharadh 1 hmm

1 2.0 shahid 1 acha

2 3.0 khusalli 2 theek hai

3 NaN NaN 4 nice

What are the problems with concat here?

concat simply combined/stacked the dataframe horizontally


If you notice, userid 3 for user dataframe is stacked against userid 2 for msg dataframe
This way of stacking doesn't help us gain any insights
=> pd.concat() does not work according to the values in the columns

We need to merge the data

keyboard_arrow_down How can we join the dataframes ?

users.merge(msgs, on="userid")

userid name msg

0 1 sharadh hmm

1 1 sharadh acha

2 2 shahid theek hai

Notice that users has a userid = 3 but msgs does not

When we merge these dataframes the userid = 3 is not included


Similarly, userid = 4 is not present in users , and thus not included
Only the userid common in both dataframes is shown

What type of join is this?

Inner Join

Remember joins from SQL?

The on parameter specifies the key , similar to primary key in SQL

keyboard_arrow_down Now what join we want to use to get info of all the users and all the messages?

users.merge(msgs, on = "userid", how="outer")

userid name msg

0 1 sharadh hmm

1 1 sharadh acha

2 2 shahid theek hai

3 3 khusalli NaN

4 4 NaN nice

Note:

All missing values are replaced with NaN

keyboard_arrow_down And what if we want the info of all the users in the dataframe?

users.merge(msgs, on = "userid",how="left")

userid name msg

0 1 sharadh hmm

1 1 sharadh acha

2 2 shahid theek hai

3 3 khusalli NaN

keyboard_arrow_down Similarly, what if we want all the messages and info only for the users who sent a message?
users.merge(msgs, on = "userid", how="right")

userid name msg

0 1 sharadh hmm

1 1 sharadh acha

2 2 shahid theek hai

3 4 NaN nice

Note,

NaN in name can be thought of as an anonymous message

But sometimes the column names might be different even if they contain the same data

Let's rename our users column userid to id

users.rename(columns = {"userid": "id"}, inplace = True)


users

id name

0 1 sharadh

1 2 shahid

2 3 khusalli

keyboard_arrow_down Now, how can we merge the 2 dataframes when the key has a different name ?

users.merge(msgs, left_on="id", right_on="userid")

id name userid msg

0 1 sharadh 1 hmm

1 1 sharadh 1 acha

2 2 shahid 2 theek hai

Here,

left_on : Specifies the key of the 1st dataframe (users here)

right_on : Specifies the key of the 2nd dataframe (msgs here)

keyboard_arrow_down IMDB Movie Business Use-case (Introduction)


Imagine you are working as a Data Scientist for an Analytics firm

Your task is to analyse some movie trends for a client

IMDB has online database of information related to movies

The database contains info of several years about:

Movies
Rating
Director
Popularity
Revenue & Budget

keyboard_arrow_down Lets download and read the IMDB dataset

File1: https://drive.google.com/file/d/1s2TkjSpzNc4SyxqRrQleZyDIHlc7bxnd/view?usp=sharing
File2: https://drive.google.com/file/d/1Ws-_s1fHZ9nHfGLVUQurbHDvStePlEJm/view?usp=sharing
import pandas as pd
import numpy as np

!gdown 1s2TkjSpzNc4SyxqRrQleZyDIHlc7bxnd

Downloading...
From: https://drive.google.com/uc?id=1s2TkjSpzNc4SyxqRrQleZyDIHlc7bxnd
To: /content/movies.csv
100% 112k/112k [00:00<00:00, 57.4MB/s]

!gdown 1Ws-_s1fHZ9nHfGLVUQurbHDvStePlEJm

Downloading...
From: https://drive.google.com/uc?id=1Ws-_s1fHZ9nHfGLVUQurbHDvStePlEJm
To: /content/directors.csv
100% 65.4k/65.4k [00:00<00:00, 62.8MB/s]

Here we have two csv files

movies.csv
directors.csv

movies = pd.read_csv('movies.csv')
movies.head()

Unnamed:
id budget popularity revenue title vote_average vote_count director_id year month day
0

0 0 43597 237000000 150 2787965087 Avatar 7.2 11800 4762 2009 Dec Thursday

Pirates of
the
1 1 43598 300000000 139 961000000 Caribbean: 6.9 4500 4763 2007 May Saturday
At World's
End

2 2 43599 245000000 107 880674609 Spectre 63 4466 4764 2015 Oct Monday

So what kind of questions can we ask from this dataset?

Top 10 most popular movies, using popularity


Or find some highest rated movies, using vote_average
We can find number of movies released per year too
Or maybe we can find highest budget movies in a year using both budget and year

But can we ask more interesting/deeper questions?


Do you think we can find the most productive directors?
Which directors produce high budget films?
Highest and lowest rated movies for every month in a particular year?

Notice, there's a column Unnamed: 0 which represents nothing but the index of a row.

keyboard_arrow_down How to get rid of this Unnamed: 0 col?

movies = pd.read_csv('movies.csv', index_col=0)


movies.head()

id budget popularity revenue title vote_average vote_count director_id year month day

0 43597 237000000 150 2787965087 Avatar 7.2 11800 4762 2009 Dec Thursday

Pirates of the
1 43598 300000000 139 961000000 Caribbean: At World's 6.9 4500 4763 2007 May Saturday
End

2 43599 245000000 107 880674609 Spectre 6.3 4466 4764 2015 Oct Monday

3 43600 250000000 112 1084939099 The Dark Knight Rises 7.6 9106 4765 2012 Jul Monday

index_col=0 explicitly states to treat the first column as the index

The default value is index_col=None

movies.shape
(1465 11)
keyboard_arrow_down Pandas - 3
Content

Apply()
Grouping

groupby()

Group based Aggregates


Group based Filtering
Group based Apply

keyboard_arrow_down Importing Data


Let's first import our data and prepare it as we did in the last lecture

import pandas as pd
import numpy as np

!gdown 1s2TkjSpzNc4SyxqRrQleZyDIHlc7bxnd
!gdown 1Ws-_s1fHZ9nHfGLVUQurbHDvStePlEJm

movies = pd.read_csv('movies.csv', index_col=0)


directors = pd.read_csv('directors.csv',index_col=0)

Downloading...
From: https://drive.google.com/uc?id=1s2TkjSpzNc4SyxqRrQleZyDIHlc7bxnd
To: /content/movies.csv
100% 112k/112k [00:00<00:00, 77.0MB/s]
Downloading...
From: https://drive.google.com/uc?id=1Ws-_s1fHZ9nHfGLVUQurbHDvStePlEJm
To: /content/directors.csv
100% 65.4k/65.4k [00:00<00:00, 73.5MB/s]

keyboard_arrow_down IMDB Movie Business Use-case (Continued...)


In the previous lecture (Pandas-2), we concluded that:

Movie dataset contains info about movies, release, popularity, ratings and the director ID
Director dataset contains detailed info about the director

In this lecture we begin to perform some operations on the data

keyboard_arrow_down Merging the director and movie data


Now, how can we know the details about the Director of a particular movie?

We will have to merge these datasets

So on which column we should merge the dfs ?

We will use the ID columns (representing unique director) in both the datasets

If you observe,

=> director_id of movies are taken from id of directors dataframe

Thus we can merge our dataframes based on these two columns as keys

Before that, lets first check number of unique director values in our movies data

keyboard_arrow_down How do we get the number of unique directors in movies ?

movies['director_id'].nunique()

199
Recall,

we had learnt about nunique earlier

Similarly for unique diretors in directors df

directors['id'].nunique()

2349

Summary:

Movies Dataset: 1465 rows, but only 199 unique directors


Directors Dataset: 2349 unique directors (= no of rows)

What can we infer from this?

=> Directors in movies is a subset of directors in directors

keyboard_arrow_down Now, how can we check if all director_id values are present in id ?

movies['director_id'].isin(directors['id'])

0 True
1 True
2 True
3 True
5 True
...
4736 True
4743 True
4748 True
4749 True
4768 True
Name: director_id, Length: 1465, dtype: bool

The isin() method checks if the Dataframe column contains the specified value(s).

How is isin different from Python in ?

in works for one element at a time


isin does this for all the values in the column

If you notice,

This is like a boolean "mask"


It returns a df similar to the original df
For rows with values of director_id present in id it returns True, else False

keyboard_arrow_down How can we check if there is any False here?

np.all(movies['director_id'].isin(directors['id']))

True

Lets finally merge our dataframes

Do we need to keep all the rows for movies?

YES

Do we need to keep all the rows of directors?

NO

only the ones for which we have a corresponding row in movies

keyboard_arrow_down So which join type do you think we should apply here ?

We can use LEFT JOIN


data = movies.merge(directors, how='left', left_on='director_id',right_on='id')
data

id_x budget popularity revenue title vote_average vote_count director_id year month day direct

0 43597 237000000 150 2787965087 Avatar 7.2 11800 4762 2009 Dec Thursday James

Pirates of
the
1 43598 300000000 139 961000000 Caribbean: 6.9 4500 4763 2007 May Saturday Gore
At World's
End

2 43599 245000000 107 880674609 Spectre 6.3 4466 4764 2015 Oct Monday Sam

The Dark
Ch
3 43600 250000000 112 1084939099 Knight 7.6 9106 4765 2012 Jul Monday
Rises

Spider-
4 43602 258000000 115 890871626 5.9 3576 4767 2007 May Tuesday S
Man 3

... ... ... ... ... ... ... ... ... ... ... ...

The Last
1460 48363 0 3 321952 7.9 64 4809 1978 May Monday Martin
Waltz

1461 48370 27000 19 3151130 Clerks 7.4 755 5369 1994 Sep Tuesday Ke

1462 48375 0 7 0 Rampage 6.0 131 5148 2009 Aug Friday

1463 48376 0 3 0 Slacker 6.4 77 5535 1990 Jul Friday Richard

El
1464 48395 220000 14 2040920 6.6 238 5097 1992 Sep Friday
Mariachi R

1465 rows × 14 columns

Notice, two stranger id columns id_x and id_y .

keyboard_arrow_down What do you think these newly created cols are?

Since the columns with name id is present in both the df

id_x represents id values from movie df


id_y represents id values from directors df

Do you think any column is redundant here and can be dropped?

id_y is redundant as it is same as director_id


But we dont require director_id further

So we can simply drop these features

data.drop(['director_id','id_y'],axis=1,inplace=True)
data.head()

id_x budget popularity revenue title vote_average vote_count year month day director_name gender

0 43597 237000000 150 2787965087 Avatar 7.2 11800 2009 Dec Thursday James Cameron Male

Pirates of
the
1 43598 300000000 139 961000000 Caribbean: 6.9 4500 2007 May Saturday Gore Verbinski Male
At World's
End

2 43599 245000000 107 880674609 Spectre 6.3 4466 2015 Oct Monday Sam Mendes Male

The Dark
Christopher

keyboard_arrow_down Apply
Apply a function along an axis of the DataFrame or Series

Task: we want to convert our Gender column data to numerical format

Basically,

0 for Male
1 for Female
keyboard_arrow_down How can we encode the column?

Let's first write a function to do it for a single value

def encode(data):
if data == "Male":
return 0
else:
return 1

keyboard_arrow_down Now how can we apply this function to the whole column?

data['gender'] = data['gender'].apply(encode)
data

id_x budget popularity revenue title vote_average vote_count year month day

0 43597 237000000 150 2787965087 Avatar 7.2 11800 2009 Dec Thursda

1 43598 300000000 139 961000000 Pirates of the Caribbean: At World's End 6.9 4500 2007 May Saturda

2 43599 245000000 107 880674609 Spectre 6.3 4466 2015 Oct Monda

3 43600 250000000 112 1084939099 The Dark Knight Rises 7.6 9106 2012 Jul Monda

4 43602 258000000 115 890871626 Spider-Man 3 5.9 3576 2007 May Tuesda

... ... ... ... ... ... ... ... ... ... .

1460 48363 0 3 321952 The Last Waltz 7.9 64 1978 May Monda

1461 48370 27000 19 3151130 Clerks 7.4 755 1994 Sep Tuesda

1462 48375 0 7 0 Rampage 6.0 131 2009 Aug Frida

1463 48376 0 3 0 Slacker 6.4 77 1990 Jul Frida

1464 48395 220000 14 2040920 El Mariachi 6.6 238 1992 Sep Frida

1465 rows × 12 columns

Notice how this is similar to using vectorization in Numpy

We thus can use apply to use a function throughout a column

keyboard_arrow_down Applying a function using apply on multiple columns

finding sum of revenue and budget per movie?

data[['revenue', 'budget']].apply(np.sum)

revenue 209866997305
budget 70353617179
dtype: int64

We can pass multiple cols by packing them within []

But there's a mistake here. We wanted our results per movie (per row)

But, we are getting the sum of the columns

keyboard_arrow_down Applying function with apply on rows using the axis

data[['revenue', 'budget']].apply(np.sum, axis=1)

0 3024965087
1 1261000000
2 1125674609
3 1334939099
4 1148871626
...
1460 321952
1461 3178130
1462 0
1463 0
1464 2260920
Length: 1465, dtype: int64

Every row of revenue was added to same row of budget

What does this axis mean in apply ?

axis = 0 => it will apply to each column

axis = 1 => each row

By default axis = 0

=> apply() can be applied on any dataframe along any particular axis

keyboard_arrow_down Similarly, how can I find profit per movie (revenue-budget)?

def prof(x): # We define a function to calculate profit


return x['revenue']-x['budget']
data['profit'] = data[['revenue', 'budget']].apply(prof, axis = 1)
data

id_x budget popularity revenue title vote_average vote_count year month day director_name gend

0 43597 237000000 150 2787965087 Avatar 7.2 11800 2009 Dec Thursday James Cameron

Pirates of
the
1 43598 300000000 139 961000000 Caribbean: 6.9 4500 2007 May Saturday Gore Verbinski
At World's
End

2 43599 245000000 107 880674609 Spectre 6.3 4466 2015 Oct Monday Sam Mendes

The Dark
Christopher
3 43600 250000000 112 1084939099 Knight 7.6 9106 2012 Jul Monday
Nolan
Rises

Spider-
4 43602 258000000 115 890871626 5.9 3576 2007 May Tuesday Sam Raimi
Man 3

... ... ... ... ... ... ... ... ... ... ... ...

The Last
1460 48363 0 3 321952 7.9 64 1978 May Monday Martin Scorsese
Waltz

1461 48370 27000 19 3151130 Clerks 7.4 755 1994 Sep Tuesday Kevin Smith

1462 48375 0 7 0 Rampage 6.0 131 2009 Aug Friday Uwe Boll

1463 48376 0 3 0 Slacker 6.4 77 1990 Jul Friday Richard Linklater

El Robert
1464 48395 220000 14 2040920 6.6 238 1992 Sep Friday
Mariachi Rodriguez

1465 rows × 13 columns

keyboard_arrow_down Grouping
What is Grouping ?

Simply it could be understood through the terms - Split, apply, combine


1. Split: Breaking up and grouping a DataFrame depending on the value of the specified key.

keyboard_arrow_down Group based Aggregates


2. Apply: Computing some function, usually an aggregate, transformation, or filtering, within the individual groups.

3. Combine: Merge the results of these operations into an output array.

Note: All these steps are to understand the topic


We use different aggregate functions like mean, sum, min, max, count etc. on columns while grouping.

keyboard_arrow_down Grouping data director-wise

data.groupby('director_name')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f71cad3ead0>

Notice,

It's a DataFrameGroupBy type object


NOT a DataFrame type object

keyboard_arrow_down Number of groups our data is divided into?

data.groupby('director_name').ngroups

199

Based on this grouping, we can find which keys belong to which group?

data.groupby('director_name').groups

{'Adam McKay': [176, 323, 366, 505, 839, 916], 'Adam Shankman': [265, 300, 350, 404, 458, 843, 999, 1231], 'Alejandro
González Iñárritu': [106, 749, 1015, 1034, 1077, 1405], 'Alex Proyas': [95, 159, 514, 671, 873], 'Alexander Payne':
[793, 1006, 1101, 1211, 1281], 'Andrew Adamson': [11, 43, 328, 501, 947], 'Andrew Niccol': [533, 603, 701, 722, 1439],
'Andrzej Bartkowiak': [349, 549, 754, 911, 924], 'Andy Fickman': [517, 681, 909, 926, 973, 1023], 'Andy Tennant': [314,
320, 464, 593, 676, 885], 'Ang Lee': [99, 134, 748, 840, 1089, 1110, 1132, 1184], 'Anne Fletcher': [610, 650, 736, 789,
1206], 'Antoine Fuqua': [310, 338, 424, 467, 576, 808, 818, 1105], 'Atom Egoyan': [946, 1128, 1164, 1194, 1347, 1416],
'Barry Levinson': [313, 319, 471, 594, 878, 898, 1013, 1037, 1082, 1143, 1185, 1345, 1378], 'Barry Sonnenfeld': [13,
48, 90, 205, 591, 778, 783], 'Ben Stiller': [209, 212, 547, 562, 850], 'Bill Condon': [102, 307, 902, 1233, 1381],
'Bobby Farrelly': [352, 356, 481, 498, 624, 630, 654, 806, 928, 972, 1111], 'Brad Anderson': [1163, 1197, 1350, 1419,
1430], 'Brett Ratner': [24, 39, 188, 207, 238, 292, 405, 456, 920], 'Brian De Palma': [228, 255, 318, 439, 747, 905,
919, 1088, 1232, 1261, 1317, 1354], 'Brian Helgeland': [512, 607, 623, 742, 933], 'Brian Levant': [418, 449, 568, 761,
860, 1003], 'Brian Robbins': [416, 441, 669, 962, 988, 1115], 'Bryan Singer': [6, 32, 33, 44, 122, 216, 297, 1326],
'Cameron Crowe': [335, 434, 488, 503, 513, 698], 'Catherine Hardwicke': [602, 695, 724, 937, 1406, 1412], 'Chris
Columbus': [117, 167, 204, 218, 229, 509, 656, 897, 996, 1086, 1129], 'Chris Weitz': [17, 500, 794, 869, 1202, 1267],
'Christopher Nolan': [3, 45, 58, 59, 74, 565, 641, 1341], 'Chuck Russell': [177, 410, 657, 1069, 1097, 1339], 'Clint
Eastwood': [369, 426, 447, 482, 490, 520, 530, 535, 645, 727, 731, 786, 787, 899, 974, 986, 1167, 1190, 1313], 'Curtis
Hanson': [494, 579, 606, 711, 733, 1057, 1310], 'Danny Boyle': [527, 668, 1083, 1085, 1126, 1168, 1287, 1385], 'Darren
Aronofsky': [113, 751, 1187, 1328, 1363, 1458], 'Darren Lynn Bousman': [1241, 1243, 1283, 1338, 1440], 'David Ayer':
[50, 273, 741, 1024, 1146, 1407], 'David Cronenberg': [541, 767, 994, 1055, 1254, 1268, 1334], 'David Fincher': [62,
213, 253, 383, 398, 478, 522, 555, 618, 785], 'David Gordon Green': [543, 862, 884, 927, 1376, 1418, 1432, 1459],
'David Koepp': [443, 644, 735, 1041, 1209], 'David Lynch': [583, 1161, 1264, 1340, 1456], 'David O. Russell': [422,
556, 609, 896, 982, 989, 1229, 1304], 'David R. Ellis': [582, 634, 756, 888, 934], 'David Zucker': [569, 619, 965,
1052, 1175], 'Dennis Dugan': [217, 260, 267, 293, 303, 718, 780, 977, 1247], 'Donald Petrie': [427, 507, 570, 649, 858,
894, 1106, 1331], 'Doug Liman': [52, 148, 251, 399, 544, 1318, 1451], 'Edward Zwick': [92, 182, 346, 566, 791, 819,
825], 'F. Gary Gray': [308, 402, 491, 523, 697, 833, 1272, 1380], 'Francis Ford Coppola': [487, 559, 622, 646, 772,
1076, 1155, 1253, 1312], 'Francis Lawrence': [63, 72, 109, 120, 679], 'Frank Coraci': [157, 249, 275, 451, 577, 599,
963], 'Frank Oz': [193, 355, 473, 580, 712, 813, 987], 'Garry Marshall': [329, 496, 528, 571, 784, 893, 1029, 1169],
'Gary Fleder': [518, 667, 689, 867, 981, 1165], 'Gary Winick': [258, 797, 798, 804, 1454], 'Gavin O'Connor': [820, 841,
939, 953, 1444], 'George A. Romero': [250, 1066, 1096, 1278, 1367, 1396], 'George Clooney': [343, 450, 831, 966, 1302],
'George Miller': [78, 103, 233, 287, 1250, 1403, 1450], 'Gore Verbinski': [1, 8, 9, 107, 119, 633, 1040], 'Guillermo
del Toro': [35, 252, 419, 486, 1118], 'Gus Van Sant': [595, 1018, 1027, 1159, 1240, 1311, 1398], 'Guy Ritchie': [124,
215, 312, 1093, 1225, 1269, 1420], 'Harold Ramis': [425, 431, 558, 586, 788, 1137, 1166, 1325], 'Ivan Reitman': [274,
643, 816, 883, 910, 935, 1134, 1242], 'James Cameron': [0, 19, 170, 173, 344, 1100, 1320], 'James Ivory': [1125, 1152,
1180, 1291, 1293, 1390, 1397], 'James Mangold': [140, 141, 557, 560, 829, 845, 958, 1145], 'James Wan': [30, 617, 1002,
1047, 1337, 1417, 1424], 'Jan de Bont': [155, 224, 231, 270, 781], 'Jason Friedberg': [812, 1010, 1012, 1014, 1036],
'Jason Reitman': [792, 1092, 1213, 1295, 1299], 'Jaume Collet-Serra': [516, 540, 640, 725, 1011, 1189], 'Jay Roach':
[195, 359, 389, 397, 461, 703, 859, 1072], 'Jean-Pierre Jeunet': [423, 485, 605, 664, 765], 'Joe Dante': [284, 525,
638, 1226, 1298, 1428], 'Joe Wright': [85, 432, 553, 803, 814, 855], 'Joel Coen': [428, 670, 691, 707, 721, 889, 906,
980, 1157, 1238, 1305], 'Joel Schumacher': [128, 184, 348, 484, 572, 614, 652, 764, 876, 886, 1108, 1230, 1280], 'John
Carpenter': [537, 663, 686, 861, 938, 1028, 1080, 1102, 1329, 1371], 'John Glen': [601, 642, 801, 847, 864], 'John
Landis': [524, 868, 1276, 1384, 1435], 'John Madden': [457, 882, 1020, 1249, 1257], 'John McTiernan': [127, 214, 244,
351, 534, 563, 648, 782, 838, 1074], 'John Singleton': [294, 489, 732, 796, 1120, 1173, 1316], 'John Whitesell': [499,
632, 763, 1119, 1148], 'John Woo': [131, 142, 264, 371, 420, 675, 1182], 'Jon Favreau': [46, 54, 55, 382, 759, 1346],
'Jon M. Chu': [100, 225, 810, 1099, 1186], 'Jon Turteltaub': [64, 180, 372, 480, 760, 846, 1171], 'Jonathan Demme':
[277, 493, 1000, 1123, 1215], 'Jonathan Liebesman': [81, 143, 339, 1117, 1301], 'Judd Apatow': [321, 710, 717, 865,
881], 'Justin Lin': [38, 123, 246, 1437, 1447], 'Kenneth Branagh': [80, 197, 421, 879, 1094, 1277, 1288], 'Kenny
Ortega': [412, 852, 1228, 1315, 1365], 'Kevin Reynolds': [53, 502, 639, 1019, 1059], ...}

keyboard_arrow_down Now what if we want to extract data of a particular group from this list?

data.groupby('director_name').get_group('Alexander Payne')

id_x budget popularity revenue title vote_average vote_count year month day director_name gende

About
793 45163 30000000 19 105834556 6.7 362 2002 Dec Friday Alexander Payne
Schmidt

The
1006 45699 20000000 40 177243185 6.7 934 2011 Sep Friday Alexander Payne
Descendants

1101 46004 16000000 23 109502303 Sideways 6.9 478 2004 Oct Friday Alexander Payne

1211 46446 12000000 29 17654912 Nebraska 7.4 636 2013 Sep Saturday Alexander Payne

1281 46813 0 13 0 Election 6.7 270 1999 Apr Friday Alexander Payne

keyboard_arrow_down extending this to finding an aggregate metric of the data

How can we find the count of movies by each director?

data.groupby('director_name')['title'].count()

director_name
Adam McKay 6
Adam Shankman 8
Alejandro González Iñárritu 6
Alex Proyas 5
Alexander Payne 5
..
Wes Craven 10
Wolfgang Petersen 7
Woody Allen 18
Zack Snyder 7
Zhang Yimou 6
Name: title, Length: 199, dtype: int64

keyboard_arrow_down Finding multiple aggregations of any feature

Finding the very first year and the latest year a director released a movie i.e basically the min and max of year column, grouped by director

data.groupby(['director_name'])["year"].aggregate(['min', 'max'])
# note: can also use .agg instead of .aggregate (both are same)
min max

director_name

Adam McKay 2004 2015

Adam Shankman 2001 2012

Alejandro González Iñárritu 2000 2015

Alex Proyas 1994 2016

Alexander Payne 1999 2013

... ... ...

Wes Craven 1984 2011

Wolfgang Petersen 1981 2006

Woody Allen 1977 2013

Zack Snyder 2004 2016

Zhang Yimou 2002 2014

199 rows × 2 columns

keyboard_arrow_down Group based Filtering


Group based filtering allows us to filter rows from each group by using conditional statements on each group rather than the whole
dataframe.

keyboard_arrow_down finding the details of the movies by high budget directors

Lets assume,

high budget director -> any director with atleast one movie with budget >100M

1. We can get the highest budget movie data of every director

data_dir_budget = data.groupby("director_name")["budget"].max().reset_index()
data_dir_budget.head()

director_name budget

0 Adam McKay 100000000

1 Adam Shankman 80000000

2 Alejandro González Iñárritu 135000000

3 Alex Proyas 140000000

4 Alexander Payne 30000000

2. we can filter out the director names with max budget >100M

names = data_dir_budget.loc[data_dir_budget["budget"] >= 100, "director_name"]

3. Finally, we can filter out the details of the movies by these directors

data.loc[data['director_name'].isin(names)]
id_x budget popularity revenue title vote_average vote_count year month day director_name gend

0 43597 237000000 150 2787965087 Avatar 7.2 11800 2009 Dec Thursday James Cameron

Pirates of
the
1 43598 300000000 139 961000000 Caribbean: 6.9 4500 2007 May Saturday Gore Verbinski
At World's
End

2 43599 245000000 107 880674609 Spectre 6.3 4466 2015 Oct Monday Sam Mendes

The Dark
Christopher
3 43600 250000000 112 1084939099 Knight 7.6 9106 2012 Jul Monday
Nolan
Rises

Spider-
4 43602 258000000 115 890871626 5.9 3576 2007 May Tuesday Sam Raimi
Man 3

... ... ... ... ... ... ... ... ... ... ... ...

The Last
1460 48363 0 3 321952 7.9 64 1978 May Monday Martin Scorsese
Waltz

1461 48370 27000 19 3151130 Clerks 7.4 755 1994 Sep Tuesday Kevin Smith

1462 48375 0 7 0 Rampage 6.0 131 2009 Aug Friday Uwe Boll

1463 48376 0 3 0 Slacker 6.4 77 1990 Jul Friday Richard Linklater

El Robert
1464 48395 220000 14 2040920 6.6 238 1992 Sep Friday
Mariachi Rodriguez

1465 rows × 13 columns

keyboard_arrow_down Filtering groups in a single go using Lambda Function


data.groupby('director_name').filter(lambda x: x["budget"].max() >= 100)

id_x budget popularity revenue title vote_average vote_count year month day director_name gend

0 43597 237000000 150 2787965087 Avatar 7.2 11800 2009 Dec Thursday James Cameron

Pirates of
the
1 43598 300000000 139 961000000 Caribbean: 6.9 4500 2007 May Saturday Gore Verbinski
At World's
End

2 43599 245000000 107 880674609 Spectre 6.3 4466 2015 Oct Monday Sam Mendes

The Dark
Christopher
3 43600 250000000 112 1084939099 Knight 7.6 9106 2012 Jul Monday
Nolan
Rises

Spider-
4 43602 258000000 115 890871626 5.9 3576 2007 May Tuesday Sam Raimi
Man 3

... ... ... ... ... ... ... ... ... ... ... ...

The Last
1460 48363 0 3 321952 7.9 64 1978 May Monday Martin Scorsese
Waltz

1461 48370 27000 19 3151130 Clerks 7.4 755 1994 Sep Tuesday Kevin Smith

1462 48375 0 7 0 Rampage 6.0 131 2009 Aug Friday Uwe Boll

1463 48376 0 3 0 Slacker 6.4 77 1990 Jul Friday Richard Linklater

El Robert
1464 48395 220000 14 2040920 6.6 238 1992 Sep Friday
Mariachi Rodriguez

1465 rows × 13 columns

Notice what's happening here?

We first group data by director and then use groupby().filter function


Groups are filtered if they do not satisfy the boolean criterion specified by function
This is called Group Based Filtering

NOTE

We are filtering the groups here and not the rows


==> The result is not a groupby object but regular pandas DataFrame with the filtered groups eliminated

keyboard_arrow_down Group based Apply


applying a function on grouped objects

keyboard_arrow_down Filtering risky movies?

Let's assume, we call a movi risky if,

its budget is higher than the average revenue of its director

We can subtract the average revenue of a director from budget col, for each director

def func(x):
# a boolean returning function for whether the movie is risky or not
x["risky"] = x["budget"] - x["revenue"].mean() >= 0
return x

data_risky = data.groupby("director_name", group_keys=False).apply(func)

# setting group_keys=True, keeps the group key in the returned dataset (will be default in future version of pandas)
# keep it False if want the normal behaviour

data_risky

id_x budget popularity revenue title vote_average vote_count year month day director_name gend

0 43597 237000000 150 2787965087 Avatar 7.2 11800 2009 Dec Thursday James Cameron

Pirates of
the
1 43598 300000000 139 961000000 Caribbean: 6.9 4500 2007 May Saturday Gore Verbinski
At World's
End

2 43599 245000000 107 880674609 Spectre 6.3 4466 2015 Oct Monday Sam Mendes

The Dark
Christopher
3 43600 250000000 112 1084939099 Knight 7.6 9106 2012 Jul Monday
Nolan
Rises

Spider-
4 43602 258000000 115 890871626 5.9 3576 2007 May Tuesday Sam Raimi
Man 3

... ... ... ... ... ... ... ... ... ... ... ...

The Last
1460 48363 0 3 321952 7.9 64 1978 May Monday Martin Scorsese
Waltz

1461 48370 27000 19 3151130 Clerks 7.4 755 1994 Sep Tuesday Kevin Smith

1462 48375 0 7 0 Rampage 6.0 131 2009 Aug Friday Uwe Boll

1463 48376 0 3 0 Slacker 6.4 77 1990 Jul Friday Richard Linklater

El Robert
1464 48395 220000 14 2040920 6.6 238 1992 Sep Friday
Mariachi Rodriguez

1465 rows × 14 columns

What did we do here?

Defined a custom function


Grouped data acc to director_name
Subtracted mean of budget from revenue
Used apply with the custom function on the grouped data

Lets see if there are any risky movies

data_risky.loc[data_risky["risky"]]
id_x budget popularity revenue title vote_average vote_count year month day director_name ge

Quantum
7 43608 200000000 107 586090727 6.1 2965 2008 Oct Thursday Marc Forster
of Solace

Pirates of
the
Caribbean:
12 43614 380000000 135 1045713802 6.4 4948 2011 May Saturday Rob Marshall
On
Stranger
Tides

Robin
15 43618 200000000 37 310669540 6.2 1398 2010 May Wednesday Ridley Scott
Hood

20 43624 209000000 64 303025485 Battleship 5.5 2114 2012 Apr Wednesday Peter Berg

X-Men:
24 43630 210000000 3 459359555 The Last 6.3 3525 2006 May Wednesday Brett Ratner
Stand
keyboard_arrow_down Pandas - 4
Content

Multi-indexing
Restructuring data

pd.melt()
pd.pivot()
pd.pivot_table()
pd.cut()

keyboard_arrow_down Importing Data


import pandas as pd
import numpy as np

!gdown 1s2TkjSpzNc4SyxqRrQleZyDIHlc7bxnd
!gdown 1Ws-_s1fHZ9nHfGLVUQurbHDvStePlEJm

movies = pd.read_csv('movies.csv', index_col=0)


directors = pd.read_csv('directors.csv',index_col=0)

data = movies.merge(directors, how='left', left_on='director_id',right_on='id')


data.drop(['director_id','id_y'],axis=1,inplace=True)

Downloading...
From: https://drive.google.com/uc?id=1s2TkjSpzNc4SyxqRrQleZyDIHlc7bxnd
To: /content/movies.csv
100% 112k/112k [00:00<00:00, 15.4MB/s]
Downloading...
From: https://drive.google.com/uc?id=1Ws-_s1fHZ9nHfGLVUQurbHDvStePlEJm
To: /content/directors.csv
100% 65.4k/65.4k [00:00<00:00, 86.7MB/s]

keyboard_arrow_down Multi-Indexing
keyboard_arrow_down Task: Which director according would be considered as most productive ?

Should we decide based on the number of movies released by a director?

Or

consider quality into consideration also?

Or

consider the amount of business the movie is doing?

To simplify,

Lets calculate who has directed maximum number of movies

data.groupby(['director_name'])['title'].count().sort_values(ascending=False)

director_name
Steven Spielberg 26
Clint Eastwood 19
Martin Scorsese 19
Woody Allen 18
Robert Rodriguez 16
..
Paul Weitz 5
John Madden 5
Paul Verhoeven 5
John Whitesell 5
Kevin Reynolds 5
Name: title, Length: 199, dtype: int64

Steven Spielberg has directed maximum number of movies. But does it make Steven the most productive director?
Chances are, he might be active for more years than other directors

keyboard_arrow_down calculating active years for every director?

We can subtract both min and max of year

data_agg = data.groupby(['director_name'])[["year", "title"]].aggregate({"year":['min','max'], "title": "count"})


data_agg

year title

min max count

director_name

Adam McKay 2004 2015 6

Adam Shankman 2001 2012 8

Alejandro González Iñárritu 2000 2015 6

Alex Proyas 1994 2016 5

Alexander Payne 1999 2013 5

... ... ... ...

Wes Craven 1984 2011 10

Wolfgang Petersen 1981 2006 7

Woody Allen 1977 2013 18

Zack Snyder 2004 2016 7

Zhang Yimou 2002 2014 6

199 rows × 3 columns

Notice,

director_name column has turned into row labels


There are multiple levels for the column names

This is called Multi-index Dataframe

keyboard_arrow_down What is Multi-index Dataframe ?

It can have multiple indexes along a dimension

no of dimensions remain same though => 2D


Multi-level indexes are possible both for rows and columns

data_agg.columns #Printing the columns for better clarity

MultiIndex([( 'year', 'min'),


( 'year', 'max'),
('title', 'count')],
)

The level-1 column names are year and title

keyboard_arrow_down What would happen if we print the col year of this multi-index dataframe?

data_agg["year"]
min max

director_name

Adam McKay 2004 2015

Adam Shankman 2001 2012

Alejandro González Iñárritu 2000 2015

Alex Proyas 1994 2016

Alexander Payne 1999 2013

... ... ...

Wes Craven 1984 2011

Wolfgang Petersen 1981 2006

Woody Allen 1977 2013

Zack Snyder 2004 2016

Zhang Yimou 2002 2014

199 rows × 2 columns

keyboard_arrow_down How can we convert multi-level back to only one level of columns?

Example: year_min , year_max , title_count

data_agg = data.groupby(['director_name'])[["year","title"]].aggregate(
{"year":['min', 'max'], "title": "count"}) #The column names are not aligned properly

data_agg.columns = ['_'.join(col) for col in data_agg.columns]


data_agg

year_min year_max title_count

director_name

Adam McKay 2004 2015 6

Adam Shankman 2001 2012 8

Alejandro González Iñárritu 2000 2015 6

Alex Proyas 1994 2016 5

Alexander Payne 1999 2013 5

... ... ... ...

Wes Craven 1984 2011 10

Wolfgang Petersen 1981 2006 7

Woody Allen 1977 2013 18

Zack Snyder 2004 2016 7

Zhang Yimou 2002 2014 6

199 rows × 3 columns

Since these were tuples, we can just join them

# another more simplified method

data.groupby('director_name')[['year', 'title']].aggregate(
year_max=('year','max'),
year_min=('year','min'),
title_count=('title','count')
)
year_max year_min title_count

director_name

Adam McKay 2015 2004 6

Adam Shankman 2012 2001 8

Alejandro González Iñárritu 2015 2000 6

Alex Proyas 2016 1994 5

Alexander Payne 2013 1999 5

... ... ... ...

Wes Craven 2011 1984 10

Wolfgang Petersen 2006 1981 7

Woody Allen 2013 1977 18

Zack Snyder 2016 2004 7

Zhang Yimou 2014 2002 6

199 rows × 3 columns

Columns look good, but we may want to turn back the row labels into a proper column as well

keyboard_arrow_down converting row labels into a column using reset_index

data_agg.reset_index()

director_name year_min year_max title_count

0 Adam McKay 2004 2015 6

1 Adam Shankman 2001 2012 8

2 Alejandro González Iñárritu 2000 2015 6

3 Alex Proyas 1994 2016 5

4 Alexander Payne 1999 2013 5

... ... ... ... ...

194 Wes Craven 1984 2011 10

195 Wolfgang Petersen 1981 2006 7

196 Woody Allen 1977 2013 18

197 Zack Snyder 2004 2016 7

198 Zhang Yimou 2002 2014 6

199 rows × 4 columns

keyboard_arrow_down Using the new features, can we find the most productive director?

First calculate how many years the director has been active.

data_agg["yrs_active"] = data_agg["year_max"] - data_agg["year_min"]


data_agg
year_min year_max title_count yrs_active

director_name

Adam McKay 2004 2015 6 11

Adam Shankman 2001 2012 8 11

Alejandro González Iñárritu 2000 2015 6 15

Alex Proyas 1994 2016 5 22

Alexander Payne 1999 2013 5 14

... ... ... ... ...

Wes Craven 1984 2011 10 27

Wolfgang Petersen 1981 2006 7 25

Woody Allen 1977 2013 18 36

Zack Snyder 2004 2016 7 12

Zhang Yimou 2002 2014 6 12

199 rows × 4 columns

Then calculate rate of directing movies by title_count / yrs_active

data_agg["movie_per_yr"] = data_agg["title_count"] / data_agg["yrs_active"]


data_agg

year_min year_max title_count yrs_active movie_per_yr

director_name

Adam McKay 2004 2015 6 11 0.545455

Adam Shankman 2001 2012 8 11 0.727273

Alejandro González Iñárritu 2000 2015 6 15 0.400000

Alex Proyas 1994 2016 5 22 0.227273

Alexander Payne 1999 2013 5 14 0.357143

... ... ... ... ... ...

Wes Craven 1984 2011 10 27 0.370370

Wolfgang Petersen 1981 2006 7 25 0.280000

Woody Allen 1977 2013 18 36 0.500000

Zack Snyder 2004 2016 7 12 0.583333

Zhang Yimou 2002 2014 6 12 0.500000

199 rows × 5 columns

Now finally sort the values

data_agg.sort_values("movie_per_yr", ascending=False)
year_min year_max title_count yrs_active movie_per_yr

director_name

Tyler Perry 2006 2013 9 7 1.285714

Jason Friedberg 2006 2010 5 4 1.250000

Shawn Levy 2002 2014 11 12 0.916667

Robert Rodriguez 1992 2014 16 22 0.727273

Adam Shankman 2001 2012 8 11 0.727273

... ... ... ... ... ...

Lawrence Kasdan 1985 2012 5 27 0.185185

Luc Besson 1985 2014 5 29 0.172414

Robert Redford 1980 2010 5 30 0.166667

Sidney Lumet 1976 2006 5 30 0.166667

Michael Apted 1980 2010 5 30 0.166667

199 rows × 5 columns

Conclusion:

Tyler Perry turns out to be the truly most productive director

keyboard_arrow_down Importing our data


For this topic we will be using data of few drugs being developed by PFizer

Link: https://drive.google.com/file/d/173A59xh2mnpmljCCB9bhC4C5eP2IS6qZ/view?usp=sharing

!gdown 173A59xh2mnpmljCCB9bhC4C5eP2IS6qZ

Downloading...
From: https://drive.google.com/uc?id=173A59xh2mnpmljCCB9bhC4C5eP2IS6qZ
To: /content/Pfizer_1.csv
100% 1.51k/1.51k [00:00<00:00, 8.41MB/s]

What is the data about?

Temperature (K)
Pressure (P)

are recorded after an interval of 1 hour everyday to monitor the drug stability in a drug development test

==> These data points are thus used to identify the optimal set of values of parameters for the stability of the drugs

keyboard_arrow_down Now, Let's explore this dataset

data = pd.read_csv('Pfizer_1.csv')
data
Date Drug_Name Parameter 1:30:00 2:30:00 3:30:00 4:30:00 5:30:00 6:30:00 7:30:00 8:30:00 9:30:00

0 15-10-2020 diltiazem hydrochloride Temperature 23.0 22.0 NaN 21.0 21.0 22 23.0 21.0 22.0

1 15-10-2020 diltiazem hydrochloride Pressure 12.0 13.0 NaN 11.0 13.0 14 16.0 16.0 24.0

2 15-10-2020 docetaxel injection Temperature NaN 17.0 18.0 NaN 17.0 18 NaN NaN 23.0

3 15-10-2020 docetaxel injection Pressure NaN 22.0 22.0 NaN 22.0 23 NaN NaN 27.0

4 15-10-2020 ketamine hydrochloride Temperature 24.0 NaN NaN 27.0 NaN 26 25.0 24.0 23.0

5 15-10-2020 ketamine hydrochloride Pressure 8.0 NaN NaN 7.0 NaN 9 10.0 11.0 10.0

6 16-10-2020 diltiazem hydrochloride Temperature 34.0 35.0 36.0 36.0 37.0 38 37.0 38.0 39.0

7 16-10-2020 diltiazem hydrochloride Pressure 18.0 19.0 20.0 21.0 22.0 23 24.0 25.0 25.0

8 16-10-2020 docetaxel injection Temperature 46.0 47.0 NaN 48.0 48.0 49 50.0 52.0 55.0

9 16-10-2020 docetaxel injection Pressure 23.0 24.0 NaN 25.0 26.0 27 28.0 29.0 28.0

10 16-10-2020 ketamine hydrochloride Temperature 8.0 9.0 10.0 NaN 11.0 12 12.0 11.0 NaN

11 16-10-2020 ketamine hydrochloride Pressure 12.0 12.0 13.0 NaN 15.0 15 15.0 15.0 NaN

12 17-10-2020 diltiazem hydrochloride Temperature 20.0 19.0 19.0 18.0 17.0 16 15.0 NaN 13.0

13 17-10-2020 diltiazem hydrochloride Pressure 3.0 4.0 4.0 4.0 6.0 8 9.0 NaN 9.0

14 17-10-2020 docetaxel injection Temperature 12.0 13.0 14.0 15.0 16.0 17 18.0 19.0 20.0

15 17-10-2020 docetaxel injection Pressure 20.0 22.0 22.0 22.0 22.0 23 25.0 26.0 27.0

16 17-10-2020 ketamine hydrochloride Temperature 13.0 14.0 15.0 16.0 17.0 18 19.0 20.0 21.0

17 17-10-2020 ketamine hydrochloride Pressure 8.0 9.0 10.0 11.0 11.0 12 12.0 11.0 12.0

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18 entries, 0 to 17
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 18 non-null object
1 Drug_Name 18 non-null object
2 Parameter 18 non-null object
3 1:30:00 16 non-null float64
4 2:30:00 16 non-null float64
5 3:30:00 12 non-null float64
6 4:30:00 14 non-null float64
7 5:30:00 16 non-null float64
8 6:30:00 18 non-null int64
9 7:30:00 16 non-null float64
10 8:30:00 14 non-null float64
11 9:30:00 16 non-null float64
12 10:30:00 18 non-null int64
13 11:30:00 16 non-null float64
14 12:30:00 18 non-null int64
dtypes: float64(9), int64(3), object(3)
memory usage: 2.2+ KB

keyboard_arrow_down Melting in Pandas


As we saw earlier, the dataset has 18 rows and 15 columns

If you notice further, you'll see:

The columns are 1:30:00 , 2:30:00 , 3:30:00 , ... so on

Temperature and Pressure of each date is in a separate row

keyboard_arrow_down Can we restructure our data into a better format?

Maybe we can have a column for time , with timestamps as the column value

Where will the Temperature/Pressure values go?

We can similarly create one column containing the values of these parameters

==> "Melt" timestamp columns into two columns - timestamp and corresponding values

How can we restructure our data into having every row corresponding to a single reading?
pd.melt(data, id_vars=['Date', 'Parameter', 'Drug_Name'])

Date Parameter Drug_Name variable value

0 15-10-2020 Temperature diltiazem hydrochloride 1:30:00 23.0

1 15-10-2020 Pressure diltiazem hydrochloride 1:30:00 12.0

2 15-10-2020 Temperature docetaxel injection 1:30:00 NaN

3 15-10-2020 Pressure docetaxel injection 1:30:00 NaN

4 15-10-2020 Temperature ketamine hydrochloride 1:30:00 24.0

... ... ... ... ... ...

211 17-10-2020 Pressure diltiazem hydrochloride 12:30:00 14.0

212 17-10-2020 Temperature docetaxel injection 12:30:00 23.0

213 17-10-2020 Pressure docetaxel injection 12:30:00 28.0

214 17-10-2020 Temperature ketamine hydrochloride 12:30:00 24.0

215 17-10-2020 Pressure ketamine hydrochloride 12:30:00 15.0

216 rows × 5 columns

This converts our data from wide to long format

Notice the id_vars are set of variables which remain unmelted

How does pd.melt() work?

Pass in the DataFrame


Pass in the column names to not melt

But we can provide better names to these new columns

keyboard_arrow_down How can we rename the columns "variable" and "value" as per our original dataframe?

data_melt = pd.melt(data,id_vars = ['Date', 'Drug_Name', 'Parameter'],


var_name = "time",
value_name = 'reading')

data_melt

Date Drug_Name Parameter time reading

0 15-10-2020 diltiazem hydrochloride Temperature 1:30:00 23.0

1 15-10-2020 diltiazem hydrochloride Pressure 1:30:00 12.0

2 15-10-2020 docetaxel injection Temperature 1:30:00 NaN

3 15-10-2020 docetaxel injection Pressure 1:30:00 NaN

4 15-10-2020 ketamine hydrochloride Temperature 1:30:00 24.0

... ... ... ... ... ...

211 17-10-2020 diltiazem hydrochloride Pressure 12:30:00 14.0

212 17-10-2020 docetaxel injection Temperature 12:30:00 23.0

213 17-10-2020 docetaxel injection Pressure 12:30:00 28.0

214 17-10-2020 ketamine hydrochloride Temperature 12:30:00 24.0

215 17-10-2020 ketamine hydrochloride Pressure 12:30:00 15.0

216 rows × 5 columns

Conclusion

The labels of the timestamp columns are conviniently melted into a single column - time

It retained all values in column reading

The labels of columns such as 1:30:00 , 2:30:00 have now become categories of the variable column

The values from columns we are melting are stored in value column
keyboard_arrow_down Pivot
Now suppose we want to convert our data back to wide format

The reason could be to maintain the structure for storing or some other purpose.

Notice:

The variables Date , Drug_Name and Parameter will remain same

The column names will be extracted from the column time

The values will be extracted from the column readings

keyboard_arrow_down How can we restructure our data back to the original wide format, before it was melted?

data_melt.pivot(index=['Date','Drug_Name','Parameter'], # Column to use to make new frame’s index


columns = 'time', # Column to use to make new frame’s columns
values='reading') # Columns to use for populating new frame’s values.

time 10:30:00 11:30:00 12:30:00 1:30:00 2:30:00 3:30:00 4:30:00 5:30:00 6:30:0

Date Drug_Name Parameter

15-10-2020 diltiazem hydrochloride Pressure 18.0 19.0 20.0 12.0 13.0 NaN 11.0 13.0 14

Temperature 20.0 20.0 21.0 23.0 22.0 NaN 21.0 21.0 22

docetaxel injection Pressure 26.0 29.0 28.0 NaN 22.0 22.0 NaN 22.0 23

Temperature 23.0 25.0 25.0 NaN 17.0 18.0 NaN 17.0 18

ketamine hydrochloride Pressure 9.0 9.0 11.0 8.0 NaN NaN 7.0 NaN 9

Temperature 22.0 21.0 20.0 24.0 NaN NaN 27.0 NaN 26

16-10-2020 diltiazem hydrochloride Pressure 24.0 NaN 27.0 18.0 19.0 20.0 21.0 22.0 23

Temperature 40.0 NaN 42.0 34.0 35.0 36.0 36.0 37.0 38

docetaxel injection Pressure 28.0 29.0 30.0 23.0 24.0 NaN 25.0 26.0 27

Temperature 56.0 57.0 58.0 46.0 47.0 NaN 48.0 48.0 49

ketamine hydrochloride Pressure 16.0 17.0 18.0 12.0 12.0 13.0 NaN 15.0 15

Temperature 13.0 14.0 15.0 8.0 9.0 10.0 NaN 11.0 12

17-10-2020 diltiazem hydrochloride Pressure 11.0 13.0 14.0 3.0 4.0 4.0 4.0 6.0 8

Temperature 14.0 11.0 10.0 20.0 19.0 19.0 18.0 17.0 16

docetaxel injection Pressure 28.0 29.0 28.0 20.0 22.0 22.0 22.0 22.0 23

Temperature 21.0 22.0 23.0 12.0 13.0 14.0 15.0 16.0 17

ketamine hydrochloride Pressure 13.0 14.0 15.0 8.0 9.0 10.0 11.0 11.0 12

Temperature 22.0 23.0 24.0 13.0 14.0 15.0 16.0 17.0 18

Notice,

pivot() is the exact opposite of melt

We are getting multiple indices here, but we can get single index again using reset_index

data_melt.pivot(index=['Date','Drug_Name','Parameter'],
columns = 'time',
values='reading').reset_index()
time Date Drug_Name Parameter 10:30:00 11:30:00 12:30:00 1:30:00 2:30:00 3:30:00 4:30:00 5:30:00 6:

0 15-10-2020 diltiazem hydrochloride Pressure 18.0 19.0 20.0 12.0 13.0 NaN 11.0 13.0

1 15-10-2020 diltiazem hydrochloride Temperature 20.0 20.0 21.0 23.0 22.0 NaN 21.0 21.0

2 15-10-2020 docetaxel injection Pressure 26.0 29.0 28.0 NaN 22.0 22.0 NaN 22.0

3 15-10-2020 docetaxel injection Temperature 23.0 25.0 25.0 NaN 17.0 18.0 NaN 17.0

4 15-10-2020 ketamine hydrochloride Pressure 9.0 9.0 11.0 8.0 NaN NaN 7.0 NaN

5 15-10-2020 ketamine hydrochloride Temperature 22.0 21.0 20.0 24.0 NaN NaN 27.0 NaN

6 16-10-2020 diltiazem hydrochloride Pressure 24.0 NaN 27.0 18.0 19.0 20.0 21.0 22.0

7 16-10-2020 diltiazem hydrochloride Temperature 40.0 NaN 42.0 34.0 35.0 36.0 36.0 37.0

8 16-10-2020 docetaxel injection Pressure 28.0 29.0 30.0 23.0 24.0 NaN 25.0 26.0

9 16-10-2020 docetaxel injection Temperature 56.0 57.0 58.0 46.0 47.0 NaN 48.0 48.0

10 16-10-2020 ketamine hydrochloride Pressure 16.0 17.0 18.0 12.0 12.0 13.0 NaN 15.0

11 16-10-2020 ketamine hydrochloride Temperature 13.0 14.0 15.0 8.0 9.0 10.0 NaN 11.0

12 17-10-2020 diltiazem hydrochloride Pressure 11.0 13.0 14.0 3.0 4.0 4.0 4.0 6.0

13 17-10-2020 diltiazem hydrochloride Temperature 14.0 11.0 10.0 20.0 19.0 19.0 18.0 17.0

14 17-10-2020 docetaxel injection Pressure 28.0 29.0 28.0 20.0 22.0 22.0 22.0 22.0

15 17-10-2020 docetaxel injection Temperature 21.0 22.0 23.0 12.0 13.0 14.0 15.0 16.0

16 17-10-2020 ketamine hydrochloride Pressure 13.0 14.0 15.0 8.0 9.0 10.0 11.0 11.0

17 17-10-2020 ketamine hydrochloride Temperature 22.0 23.0 24.0 13.0 14.0 15.0 16.0 17.0

data_melt.head()

Date Drug_Name Parameter time reading

0 15-10-2020 diltiazem hydrochloride Temperature 1:30:00 23.0

1 15-10-2020 diltiazem hydrochloride Pressure 1:30:00 12.0

2 15-10-2020 docetaxel injection Temperature 1:30:00 NaN

3 15-10-2020 docetaxel injection Pressure 1:30:00 NaN

4 15-10-2020 ketamine hydrochloride Temperature 1:30:00 24.0

Now if you notice,

We are using 2 rows to log readings for a single experiment.

keyboard_arrow_down Can we further restructure our data into dividing the Parameter column into T/P?

A format like:

Date | time | Drug_Name | Pressure | Temperature

would be really suitable

We want to split one single column into multiple columns

How can we divide the Parameter column again?


Code Text
data_tidy = data_melt.pivot(index=['Date','time', 'Drug_Name'],
columns = 'Parameter',
values='reading')

data_tidy
Parameter Pressure Temperature

Date time Drug_Name

15-10-2020 10:30:00 diltiazem hydrochloride 18.0 20.0

docetaxel injection 26.0 23.0

ketamine hydrochloride 9.0 22.0

11:30:00 diltiazem hydrochloride 19.0 20.0

docetaxel injection 29.0 25.0


We can use reset_index() to remove the multi-index
... ... ... ... ...

17-10-2020 8:30:00 docetaxel injection 26.0 19.0


data_tidy = data_tidy.reset_index()
data_tidy ketamine hydrochloride 11.0 20.0

9:30:00 diltiazem hydrochloride 9.0 13.0


Parameter Date time Drug_Name Pressure Temperature
docetaxel injection 27.0 20.0
0 15-10-2020 10:30:00 diltiazem hydrochloride 18.0 20.0
ketamine hydrochloride 12.0 21.0
1 15-10-2020 10:30:00 docetaxel injection 26.0 23.0
108 rows × 2 columns
2 15-10-2020 10:30:00 ketamine hydrochloride 9.0 22.0

3 15-10-2020 11:30:00 diltiazem hydrochloride 19.0 20.0

4 15-10-2020 11:30:00 docetaxel injection 29.0 25.0

... ... ... ... ... ...

103 17-10-2020 8:30:00 docetaxel injection 26.0 19.0

104 17-10-2020 8:30:00 ketamine hydrochloride 11.0 20.0

105 17-10-2020 9:30:00 diltiazem hydrochloride 9.0 13.0

106 17-10-2020 9:30:00 docetaxel injection 27.0 20.0

107 17-10-2020 9:30:00 ketamine hydrochloride 12.0 21.0

108 rows × 5 columns

We can rename our index column from Parameter to simply None

data_tidy.columns.name = None

data_tidy.head()

Date time Drug_Name Pressure Temperature

0 15-10-2020 10:30:00 diltiazem hydrochloride 18.0 20.0

1 15-10-2020 10:30:00 docetaxel injection 26.0 23.0

2 15-10-2020 10:30:00 ketamine hydrochloride 9.0 22.0

3 15-10-2020 11:30:00 diltiazem hydrochloride 19.0 20.0

4 15-10-2020 11:30:00 docetaxel injection 29.0 25.0

keyboard_arrow_down Pivot Table


Now suppose we want to find some insights, like mean temperature day wise

keyboard_arrow_down Can we use pivot to find the day-wise mean value of temperature for each drug?

data_tidy.pivot(index=['Drug_Name'],
columns = 'Date',
values=['Temperature'])
keyboard_arrow_down Pandas - 5
Content

Dealing with Missing Values

None and nan values


isna() and isnull()
String method in pandas
Handling datetime
Writing to a file

keyboard_arrow_down Importing and preparing data


import pandas as pd
import numpy as np

!gdown 173A59xh2mnpmljCCB9bhC4C5eP2IS6qZ

data = pd.read_csv('Pfizer_1.csv')
data_melt = pd.melt(data,id_vars = ['Date', 'Drug_Name', 'Parameter'],
var_name = "time",
value_name = 'reading')
data_tidy = data_melt.pivot(index=['Date','time', 'Drug_Name'],
columns = 'Parameter',
values='reading')
data_tidy = data_tidy.reset_index()
data_tidy.columns.name = None

Downloading...
From: https://drive.google.com/uc?id=173A59xh2mnpmljCCB9bhC4C5eP2IS6qZ
To: /content/Pfizer_1.csv
100% 1.51k/1.51k [00:00<00:00, 6.05MB/s]

data.head()

Date Drug_Name Parameter 1:30:00 2:30:00 3:30:00 4:30:00 5:30:00 6:30:00 7:30:00 8:30:00 9:30:00

0 15-10-2020 diltiazem hydrochloride Temperature 23.0 22.0 NaN 21.0 21.0 22 23.0 21.0 22.0

1 15-10-2020 diltiazem hydrochloride Pressure 12.0 13.0 NaN 11.0 13.0 14 16.0 16.0 24.0

2 15-10-2020 docetaxel injection Temperature NaN 17.0 18.0 NaN 17.0 18 NaN NaN 23.0

3 15-10-2020 docetaxel injection Pressure NaN 22.0 22.0 NaN 22.0 23 NaN NaN 27.0

4 15-10-2020 ketamine hydrochloride Temperature 24.0 NaN NaN 27.0 NaN 26 25.0 24.0 23.0

data_melt.head()

Date Drug_Name Parameter time reading

0 15-10-2020 diltiazem hydrochloride Temperature 1:30:00 23.0

1 15-10-2020 diltiazem hydrochloride Pressure 1:30:00 12.0

2 15-10-2020 docetaxel injection Temperature 1:30:00 NaN

3 15-10-2020 docetaxel injection Pressure 1:30:00 NaN

4 15-10-2020 ketamine hydrochloride Temperature 1:30:00 24.0

data_tidy.head()

Date time Drug_Name Pressure Temperature

0 15-10-2020 10:30:00 diltiazem hydrochloride 18.0 20.0

1 15-10-2020 10:30:00 docetaxel injection 26.0 23.0

2 15-10-2020 10:30:00 ketamine hydrochloride 9.0 22.0

3 15-10-2020 11:30:00 diltiazem hydrochloride 19.0 20.0

4 15-10-2020 11:30:00 docetaxel injection 29.0 25.0


keyboard_arrow_down Handling Missing Values
If you notice, there are many "NaN" values in our data

data.head()

Date Drug_Name Parameter 1:30:00 2:30:00 3:30:00 4:30:00 5:30:00 6:30:00 7:30:00 8:30:00 9:30:00

0 15-10-2020 diltiazem hydrochloride Temperature 23.0 22.0 NaN 21.0 21.0 22 23.0 21.0 22.0

1 15-10-2020 diltiazem hydrochloride Pressure 12.0 13.0 NaN 11.0 13.0 14 16.0 16.0 24.0

2 15-10-2020 docetaxel injection Temperature NaN 17.0 18.0 NaN 17.0 18 NaN NaN 23.0

3 15-10-2020 docetaxel injection Pressure NaN 22.0 22.0 NaN 22.0 23 NaN NaN 27.0

4 15-10-2020 ketamine hydrochloride Temperature 24.0 NaN NaN 27.0 NaN 26 25.0 24.0 23.0

keyboard_arrow_down What are these "NaN" values?

They are basically missing values

What are missing values?


A Missing Value signifies an empty cell/no data

There can be 2 kinds of missing values:

1. None
2. NaN (short for Not a Number)

Whats the difference between the "None" and "NaN"?


The diff mainly lies in their datatype

type(None)

NoneType

type(np.nan)

float

None type is for missing values in a column with non-number entries

E.g.-strings

NaN occurs for columns with number entries

Note:

Pandas uses these values nearly interchangeably, converting between them where appropriate, based on column datatype

pd.Series([1, np.nan, 2, None])

0 1.0
1 NaN
2 2.0
3 NaN
dtype: float64

For numerical types, Pandas changes None to NaN type

pd.Series(["1", "np.nan", "2", None])

0 1
1 np.nan
2 2
3 None
dtype: object

pd.Series(["1", "np.nan", "2", np.nan])

0 1
1 np.nan
2 2
3 NaN
dtype: object

For object type, the None is preserved and not changed to NaN

Now we have the basic idea about missing values

keyboard_arrow_down How to know the count of missing values for each row/column?

data.isna().head()

Date Drug_Name Parameter 1:30:00 2:30:00 3:30:00 4:30:00 5:30:00 6:30:00 7:30:00 8:30:00 9:30:00 10:30:00 11

0 False False False False False True False False False False False False False

1 False False False False False True False False False False False False False

2 False False False True False False True False False True True False False

3 False False False True False False True False False True True False False

4 False False False False True True False True False False False False False

We can also use isnull to get the same results

data.isnull().head()

Date Drug_Name Parameter 1:30:00 2:30:00 3:30:00 4:30:00 5:30:00 6:30:00 7:30:00 8:30:00 9:30:00 10:30:00 11

0 False False False False False True False False False False False False False

1 False False False False False True False False False False False False False

2 False False False True False False True False False True True False False

3 False False False True False False True False False True True False False

4 False False False False True True False True False False False False False

keyboard_arrow_down But, why do we have two methods, "isna" and "isnull" for the same operation?

isnull() is just an alias for isna()

pd.isnull

<function pandas.core.dtypes.missing.isna(obj: 'object') -> 'bool | npt.NDArray[np.bool_] | NDFrame'>

pd.isna

<function pandas.core.dtypes.missing.isna(obj: 'object') -> 'bool | npt.NDArray[np.bool_] | NDFrame'>

As we can see, function signature is same for both

isna() returns a boolean dataframe, with each cell as a boolean value

This value corresponds to whether the cell has a missing value

On top of this, we can use .sum() to find the count

data.isna().sum()

Date 0
Drug_Name 0
Parameter 0
1:30:00 2
2:30:00 2
3:30:00 6
4:30:00 4
5:30:00 2
6:30:00 0
7:30:00 2
8:30:00 4
9:30:00 2
10:30:00 0
11:30:00 2
12:30:00 0
dtype: int64

This gives us the total number of missing values in each column

keyboard_arrow_down Can we also get the number of missing values in each row?

data.isna().sum(axis=1)

0 1
1 1
2 4
3 4
4 3
5 3
6 1
7 1
8 1
9 1
10 2
11 2
12 1
13 1
14 0
15 0
16 0
17 0
dtype: int64

Note:

By default the value is axis=0 in sum()

keyboard_arrow_down We have identified the null count, but how do we deal with them?

We have two options:

delete the rows/columns containing the null values


fill the missing values with some data/estimate

Let's first look at deleting the rows

How can we drop rows containing null values?

data.dropna()

Date Drug_Name Parameter 1:30:00 2:30:00 3:30:00 4:30:00 5:30:00 6:30:00 7:30:00 8:30:00 9:30:00

14 17-10-2020 docetaxel injection Temperature 12.0 13.0 14.0 15.0 16.0 17 18.0 19.0 20.0

15 17-10-2020 docetaxel injection Pressure 20.0 22.0 22.0 22.0 22.0 23 25.0 26.0 27.0

16 17-10-2020 ketamine hydrochloride Temperature 13.0 14.0 15.0 16.0 17.0 18 19.0 20.0 21.0

17 17-10-2020 ketamine hydrochloride Pressure 8.0 9.0 10.0 11.0 11.0 12 12.0 11.0 12.0

Rows with even a single missing value have been deleted

keyboard_arrow_down What if we want to delete the columns having missing value?

data.dropna(axis=1)
Date Drug_Name Parameter 6:30:00 10:30:00 12:30:00

0 15-10-2020 diltiazem hydrochloride Temperature 22 20 21

1 15-10-2020 diltiazem hydrochloride Pressure 14 18 20

2 15-10-2020 docetaxel injection Temperature 18 23 25

3 15-10-2020 docetaxel injection Pressure 23 26 28

4 15-10-2020 ketamine hydrochloride Temperature 26 22 20

5 15-10-2020 ketamine hydrochloride Pressure 9 9 11

6 16-10-2020 diltiazem hydrochloride Temperature 38 40 42

7 16-10-2020 diltiazem hydrochloride Pressure 23 24 27

8 16-10-2020 docetaxel injection Temperature 49 56 58

9 16-10-2020 docetaxel injection Pressure 27 28 30

10 16-10-2020 ketamine hydrochloride Temperature 12 13 15

11 16-10-2020 ketamine hydrochloride Pressure 15 16 18

12 17-10-2020 diltiazem hydrochloride Temperature 16 14 10

13 17-10-2020 diltiazem hydrochloride Pressure 8 11 14

14 17-10-2020 docetaxel injection Temperature 17 21 23

15 17-10-2020 docetaxel injection Pressure 23 28 28

16 17-10-2020 ketamine hydrochloride Temperature 18 22 24

17 17-10-2020 ketamine hydrochloride Pressure 12 13 15

=> Every column which had even a single missing value has been deleted

keyboard_arrow_down But what are the problems with deleting rows/columns?

One of the major problems:

loss of data

Instead of dropping, it would be better to fill the missing values with some data

How can we fill the missing values with some data?

data.fillna(0).head()

Date Drug_Name Parameter 1:30:00 2:30:00 3:30:00 4:30:00 5:30:00 6:30:00 7:30:00 8:30:00 9:30:00

0 15-10-2020 diltiazem hydrochloride Temperature 23.0 22.0 0.0 21.0 21.0 22 23.0 21.0 22.0

1 15-10-2020 diltiazem hydrochloride Pressure 12.0 13.0 0.0 11.0 13.0 14 16.0 16.0 24.0

2 15-10-2020 docetaxel injection Temperature 0.0 17.0 18.0 0.0 17.0 18 0.0 0.0 23.0

3 15-10-2020 docetaxel injection Pressure 0.0 22.0 22.0 0.0 22.0 23 0.0 0.0 27.0

4 15-10-2020 ketamine hydrochloride Temperature 24.0 0.0 0.0 27.0 0.0 26 25.0 24.0 23.0

What is fillna(0) doing?

It fills all missing values with 0

We can do the same on a particular column too

data['2:30:00'].fillna(0)

0 22.0
1 13.0
2 17.0
3 22.0
4 0.0
5 0.0
6 35.0
7 19.0
8 47.0
9 24.0
10 9.0
11 12.0
12 19.0
13 4.0
14 13.0
15 22.0
16 14.0
17 9.0
Name: 2:30:00, dtype: float64

keyboard_arrow_down What other values can we use to fill the missing values ?

We can use some kind of estimator too

An estimator like mean or median

How would you calculate the mean of the column 2:30:00 ?

data['2:30:00'].mean()

18.8125

Now let's fill the NaN values with the mean value of the column

data['2:30:00'].fillna(data['2:30:00'].mean())

0 22.0000
1 13.0000
2 17.0000
3 22.0000
4 18.8125
5 18.8125
6 35.0000
7 19.0000
8 47.0000
9 24.0000
10 9.0000
11 12.0000
12 19.0000
13 4.0000
14 13.0000
15 22.0000
16 14.0000
17 9.0000
Name: 2:30:00, dtype: float64

But this doesn't feel right. What could be wrong with this?

Can we use the mean of all compounds as average for our estimator?

Different drugs have different characteristics


We can't simply do an average and fill the null values

Then what could be a solution here?

We could fill the null values of respective compounds with their respective means

keyboard_arrow_down How can we form a column with mean temperature of respective compounds?

We can use apply that we learnt earlier

Let's first create a function to calculate the mean

def temp_mean(x):
x['Temperature_avg'] = x['Temperature'].mean() # We will name the new col Temperature_avg
return x

Now we can form a new column based on the average values of temperature for each drug

data_tidy=data_tidy.groupby(["Drug_Name"], group_keys=False).apply(temp_mean)
data_tidy
Date time Drug_Name Pressure Temperature Temperature_avg

0 15-10-2020 10:30:00 diltiazem hydrochloride 18.0 20.0 24.848485

1 15-10-2020 10:30:00 docetaxel injection 26.0 23.0 30.387097

2 15-10-2020 10:30:00 ketamine hydrochloride 9.0 22.0 17.709677

3 15-10-2020 11:30:00 diltiazem hydrochloride 19.0 20.0 24.848485

4 15-10-2020 11:30:00 docetaxel injection 29.0 25.0 30.387097

... ... ... ... ... ... ...

103 17-10-2020 8:30:00 docetaxel injection 26.0 19.0 30.387097

104 17-10-2020 8:30:00 ketamine hydrochloride 11.0 20.0 17.709677

105 17-10-2020 9:30:00 diltiazem hydrochloride 9.0 13.0 24.848485

106 17-10-2020 9:30:00 docetaxel injection 27.0 20.0 30.387097

107 17-10-2020 9:30:00 ketamine hydrochloride 12.0 21.0 17.709677

108 rows × 6 columns

Now we fill the null values in Temperature using this new column!

data_tidy['Temperature'].fillna(data_tidy["Temperature_avg"], inplace=True)
data_tidy

Date time Drug_Name Pressure Temperature Temperature_avg

0 15-10-2020 10:30:00 diltiazem hydrochloride 18.0 20.0 24.848485

1 15-10-2020 10:30:00 docetaxel injection 26.0 23.0 30.387097

2 15-10-2020 10:30:00 ketamine hydrochloride 9.0 22.0 17.709677

3 15-10-2020 11:30:00 diltiazem hydrochloride 19.0 20.0 24.848485

4 15-10-2020 11:30:00 docetaxel injection 29.0 25.0 30.387097

... ... ... ... ... ... ...

103 17-10-2020 8:30:00 docetaxel injection 26.0 19.0 30.387097

104 17-10-2020 8:30:00 ketamine hydrochloride 11.0 20.0 17.709677

105 17-10-2020 9:30:00 diltiazem hydrochloride 9.0 13.0 24.848485

106 17-10-2020 9:30:00 docetaxel injection 27.0 20.0 30.387097

107 17-10-2020 9:30:00 ketamine hydrochloride 12.0 21.0 17.709677

108 rows × 6 columns

data_tidy.isna().sum()

Date 0
time 0
Drug_Name 0
Pressure 13
Temperature 0
Temperature_avg 0
dtype: int64

Great!!

We have removed the null values of our Temperature column

Let's do the same for Pressure

def pr_mean(x):
x['Pressure_avg'] = x['Pressure'].mean()
return x
data_tidy=data_tidy.groupby(["Drug_Name"]).apply(pr_mean)
data_tidy['Pressure'].fillna(data_tidy["Pressure_avg"], inplace=True)
data_tidy
<ipython-input-27-df55c441df36>:4: FutureWarning: Not prepending group keys to the result index of transform-like apply.
To preserve the previous behavior, use

>>> .groupby(..., group_keys=False)

To adopt the future behavior and silence this warning, use

>>> .groupby(..., group_keys=True)


data_tidy=data_tidy.groupby(["Drug_Name"]).apply(pr_mean)
Date time Drug_Name Pressure Temperature Temperature_avg Pressure_avg

0 15-10-2020 10:30:00 diltiazem hydrochloride 18.0 20.0 24.848485 15.424242

1 15-10-2020 10:30:00 docetaxel injection 26.0 23.0 30.387097 25.483871

2 15-10-2020 10:30:00 ketamine hydrochloride 9.0 22.0 17.709677 11.935484

3 15-10-2020 11:30:00 diltiazem hydrochloride 19.0 20.0 24.848485 15.424242

4 15-10-2020 11:30:00 docetaxel injection 29.0 25.0 30.387097 25.483871

... ... ... ... ... ... ... ...

103 17-10-2020 8:30:00 docetaxel injection 26.0 19.0 30.387097 25.483871

104 17-10-2020 8:30:00 ketamine hydrochloride 11.0 20.0 17.709677 11.935484

105 17-10-2020 9:30:00 diltiazem hydrochloride 9.0 13.0 24.848485 15.424242

106 17-10-2020 9:30:00 docetaxel injection 27.0 20.0 30.387097 25.483871

107 17-10-2020 9:30:00 ketamine hydrochloride 12.0 21.0 17.709677 11.935484

108 rows × 7 columns

data_tidy.isna().sum()

Date 0
time 0
Drug_Name 0
Pressure 0
Temperature 0
Temperature_avg 0
Pressure_avg 0
dtype: int64

keyboard_arrow_down Pandas Cut


Sometimes, we would want our data to be in categorical format instead of continous data.

Lets say, instead of knowing specific test values of a month, I want to know its type. Depends on level of granularity we want to have - Low,
Medium, High, V High

We could have defined more (or less) categories

But how can bucketisation of continous data help?

Since, we can get the count of different categories


We can get a idea of the bin which category (range of values) most of the temperature values lie.

Let's try to use this on our max (temp) column to categorise the data into bins

But, to define categories, lets first check min and max temp values

data_tidy
Date time Drug_Name Pressure Temperature Temperature_avg Pressure_avg

0 15-10-2020 10:30:00 diltiazem hydrochloride 18.0 20.0 24.848485 15.424242

1 15-10-2020 10:30:00 docetaxel injection 26.0 23.0 30.387097 25.483871

2 15-10-2020 10:30:00 ketamine hydrochloride 9.0 22.0 17.709677 11.935484

3 15-10-2020 11:30:00 diltiazem hydrochloride 19.0 20.0 24.848485 15.424242

4 15-10-2020 11:30:00 docetaxel injection 29.0 25.0 30.387097 25.483871

... ... ... ... ... ... ... ...

103 17-10-2020 8:30:00 docetaxel injection 26.0 19.0 30.387097 25.483871

104 17-10-2020 8:30:00 ketamine hydrochloride 11.0 20.0 17.709677 11.935484

105 17-10-2020 9:30:00 diltiazem hydrochloride 9.0 13.0 24.848485 15.424242

106 17-10-2020 9:30:00 docetaxel injection 27.0 20.0 30.387097 25.483871

107 17-10-2020 9:30:00 ketamine hydrochloride 12.0 21.0 17.709677 11.935484

108 rows × 7 columns

print(data_tidy['Temperature'].min(), data_tidy['Temperature'].max())

8.0 58.0

Min value = 8, Max value is 58.

Lets's keep some buffer for future values and take the range from 5-60(instead of 8-58)
Lets divide this data into 4 bins of 10-15 values each

temp_points = [5, 20, 35, 50, 60]


temp_labels = ['low','medium','high','very_high'] # Here labels define the severity of the resultant output of the test
data_tidy['temp_cat'] = pd.cut(data_tidy['Temperature'], bins=temp_points, labels=temp_labels)
data_tidy.head()

Date time Drug_Name Pressure Temperature Temperature_avg Pressure_avg temp_cat

0 15-10-2020 10:30:00 diltiazem hydrochloride 18.0 20.0 24.848485 15.424242 low

1 15-10-2020 10:30:00 docetaxel injection 26.0 23.0 30.387097 25.483871 medium

2 15-10-2020 10:30:00 ketamine hydrochloride 9.0 22.0 17.709677 11.935484 medium

3 15-10-2020 11:30:00 diltiazem hydrochloride 19.0 20.0 24.848485 15.424242 low

4 15-10-2020 11:30:00 docetaxel injection 29.0 25.0 30.387097 25.483871 medium

data_tidy['temp_cat'].value_counts()

low 50
medium 38
high 15
very_high 5
Name: temp_cat, dtype: int64

keyboard_arrow_down String function and motivation for datetime


keyboard_arrow_down What kind of questions can we use string methods for?

Find rows which contains a particular string

Say,

How you can you filter rows containing "hydrochloride" in their drug name?

data_tidy.loc[data_tidy['Drug_Name'].str.contains('hydrochloride')].head()
Date time Drug_Name Pressure Temperature Temperature_avg Pressure_avg temp_cat

0 15-10-2020 10:30:00 diltiazem hydrochloride 18.0 20.0 24.848485 15.424242 low

2 15-10-2020 10:30:00 ketamine hydrochloride 9.0 22.0 17.709677 11.935484 medium

3 15-10-2020 11:30:00 diltiazem hydrochloride 19.0 20.0 24.848485 15.424242 low

5 15-10-2020 11:30:00 ketamine hydrochloride 9.0 21.0 17.709677 11.935484 medium

6 15-10-2020 12:30:00 diltiazem hydrochloride 20.0 21.0 24.848485 15.424242 medium

So in general, we will be using the following format:

> Series.str.function()

Series.str can be used to access the values of the series as strings and apply several methods to it.

Now suppose we want to form a new column based on the year of the experiments?

keyboard_arrow_down What can we do form a column containing the year?

data_tidy['Date'].str.split('-')

0 [15, 10, 2020]


1 [15, 10, 2020]
2 [15, 10, 2020]
3 [15, 10, 2020]
4 [15, 10, 2020]
...
103 [17, 10, 2020]
104 [17, 10, 2020]
105 [17, 10, 2020]
106 [17, 10, 2020]
107 [17, 10, 2020]
Name: Date, Length: 108, dtype: object

To extract the year we need to select the last element of each list

data_tidy['Date'].str.split('-').apply(lambda x:x[2])

0 2020
1 2020
2 2020
3 2020
4 2020
...
103 2020
104 2020
105 2020
106 2020
107 2020
Name: Date, Length: 108, dtype: object

But there are certain problems with this approach:

The dtype of the output is still an object, we would prefer a number type
The date format will always not be in day-month-year, it can vary

Thus, to work with such date-time type of data, we can use a special method of pandas

keyboard_arrow_down Datetime
keyboard_arrow_down How can we handle date-time data-types?

We can do using the to_datetime() function of pandas


It takes as input:

Array/Scalars with values having proper date/time format


dayfirst : Indicating if the day comes first in the date format used
yearfirst : Indicates if year comes first in the date format
Let's first merge our Date and time columns into a new timestamp column

data_tidy['timestamp'] = data_tidy['Date']+ " "+ data_tidy['time']

data_tidy.drop(['Date', 'time'], axis=1, inplace=True)

data_tidy.head()

Drug_Name Pressure Temperature Temperature_avg Pressure_avg temp_cat timestamp

0 diltiazem hydrochloride 18.0 20.0 24.848485 15.424242 low 15-10-2020 10:30:00

1 docetaxel injection 26.0 23.0 30.387097 25.483871 medium 15-10-2020 10:30:00

2 ketamine hydrochloride 9.0 22.0 17.709677 11.935484 medium 15-10-2020 10:30:00

3 diltiazem hydrochloride 19.0 20.0 24.848485 15.424242 low 15-10-2020 11:30:00

4 docetaxel injection 29.0 25.0 30.387097 25.483871 medium 15-10-2020 11:30:00

Lets convert our timestamp col now

data_tidy['timestamp'] = pd.to_datetime(data_tidy['timestamp']) # will leave to explore how you can mention datetime format

data_tidy

Drug_Name Pressure Temperature Temperature_avg Pressure_avg temp_cat timestamp

0 diltiazem hydrochloride 18.0 20.0 24.848485 15.424242 low 2020-10-15 10:30:00

1 docetaxel injection 26.0 23.0 30.387097 25.483871 medium 2020-10-15 10:30:00

2 ketamine hydrochloride 9.0 22.0 17.709677 11.935484 medium 2020-10-15 10:30:00

3 diltiazem hydrochloride 19.0 20.0 24.848485 15.424242 low 2020-10-15 11:30:00

4 docetaxel injection 29.0 25.0 30.387097 25.483871 medium 2020-10-15 11:30:00

... ... ... ... ... ... ... ...

103 docetaxel injection 26.0 19.0 30.387097 25.483871 low 2020-10-17 08:30:00

104 ketamine hydrochloride 11.0 20.0 17.709677 11.935484 low 2020-10-17 08:30:00

105 diltiazem hydrochloride 9.0 13.0 24.848485 15.424242 low 2020-10-17 09:30:00

106 docetaxel injection 27.0 20.0 30.387097 25.483871 low 2020-10-17 09:30:00

107 ketamine hydrochloride 12.0 21.0 17.709677 11.935484 medium 2020-10-17 09:30:00

108 rows × 7 columns

data_tidy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 108 entries, 0 to 107
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Drug_Name 108 non-null object
1 Pressure 108 non-null float64
2 Temperature 108 non-null float64
3 Temperature_avg 108 non-null float64
4 Pressure_avg 108 non-null float64
5 temp_cat 108 non-null category
6 timestamp 108 non-null datetime64[ns]
dtypes: category(1), datetime64[ns](1), float64(4), object(1)
memory usage: 10.3+ KB

The type of timestamp column has been changed to datetime from object

Now, Let's look at a single timestamp using Pandas

keyboard_arrow_down How can we extract information from a single timestamp using Pandas?

ts = data_tidy['timestamp'][0]
ts

Timestamp('2020-10-15 10:30:00')
keyboard_arrow_down Extracting individual information from date

ts.year, ts.month, ts.day, ts.month_name()

(2020, 10, 15, 'October')

... and so on

We can similarly extract minutes and seconds

keyboard_arrow_down This data parsing from string to date-time makes it easier to work with data

We can use this data from the columns as a whole using .dt object

data_tidy['timestamp'].dt

<pandas.core.indexes.accessors.DatetimeProperties object at 0x7c2e78c72b60>

dt gives properties of values in a column

From this DatetimeProperties of column 'end' , we can extract year

data_tidy['timestamp'].dt.year

0 2020
1 2020
2 2020
3 2020
4 2020
...
103 2020
104 2020
105 2020
106 2020
107 2020
Outline

Uses/necessity of matplotlib
Tencent Use Case
Anatomy

Figure
Types of Data visualization
Univariate Data Visualization

Categorical:

Bar chart
Countplot
Pie Chart

Continous

Histogram
KDE
Box and Whiskers Plot

Plots Presentation:

https://docs.google.com/presentation/d/1DkLTjTe6YmGbDHtr4v9Jso553DlCuP3cfSnwvUN1mgE/edit?usp=sharing

Summary/Agenda

Where is all Data Visualization helpful? Why?


Exploratory - EDA
Explanatory - Storytelling

What is the Science in Data Visualization?


Anatomy of a plot/chart
How to use the right plot/chart for given data?

What is the Art in Data Visualization?


Choose the right scale, labels, tick labels
Identify and remove clutters in the plot
Ways to highlight information in the plot

keyboard_arrow_down Importing Matplotlib and Seaborn


We don't need to import the entire library but just its submodule pyplot

We'll use the alias name plt

What is pyplot ?
pyplot is a sub-module for visualization in matplotlib
Think of it as high-level API which makes plotting an easy task
Data Scientists stick to using pyplot only unless they want to create something totally new.

For seaborn, we will be importing the whole seaborn library as alias sns

What is seaborn?
Seaborn is another visualization library which uses matplotlib in the backend for plotting

What is the major difference then between both matplotlib and seaborn?
Seaborn uses fascinating themes and reduces number of code lines by doing a lot of work in the backend
While matplotlib is used to plot basic plots and add more functionlaity on top of that
Seaborn is built on the top of Pandas and Matplotlib

As we proceed through the lecture, we will see the difference between both the libraries
import matplotlib.pyplot as plt
import seaborn as sns

Before we dive into learning these libraries, lets answer some general questions

Why do even we need to visualize data? When do I even need to visualise?

Two reasons/scopes

Exploratory - I can’t see certain patterns just by crunching numbers (avg, rates, %ages)
Explanatory - I can the numbers crunches and insights ready, but I’d like a visual art for storytelling

Lets talk about Science of Data Visualisation

Data

Rows: Samples, Data-points, Records


Columns: Features, Variables

How many kinds of data do we have?

At the fundamental level, it's just two types:

Numerical/Continous
Categorical

Categorical can be further divided into:

Ordinal: Categorical Data with an order (E.g. low, medium, high)


Non-ordinal/nominal: Categorical Data without any order (example gender as Male/Female)

keyboard_arrow_down Video Games Analysis


You are a data scientist at "Tencent Games".

You need to analyze what kind of games they should start creating to get higher success in the market.

!wget https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/021/299/original/final_vg1_-_final_vg_%281%29.csv?16708

--2024-02-06 16:30:07-- https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/021/299/original/final_vg1_-_fin


Resolving d2beiqkhq929f0.cloudfront.net (d2beiqkhq929f0.cloudfront.net)... 13.35.37.7, 13.35.37.102, 13.35.37.159, ...
Connecting to d2beiqkhq929f0.cloudfront.net (d2beiqkhq929f0.cloudfront.net)|13.35.37.7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2041483 (1.9M) [text/plain]
Saving to: ‘final_vg.csv’

final_vg.csv 100%[===================>] 1.95M 2.75MB/s in 0.7s

2024-02-06 16:30:08 (2.75 MB/s) - ‘final_vg.csv’ saved [2041483/2041483]

import pandas as pd
import numpy as np
data = pd.read_csv('final_vg.csv')
data.head()

Rank Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sales Other_Sal

0 2061 1942 NES 1985.0 Shooter Capcom 4.569217 3.033887 3.439352 1.9916

1 9137 ¡Shin Chan Flipa en colores! DS 2007.0 Platform 505 Games 2.076955 1.493442 3.033887 0.3948

2 14279 .hack: Sekai no Mukou ni + Versus PS3 2012.0 Action Namco Bandai Games 1.145709 1.762339 1.493442 0.4086

3 8359 .hack//G.U. Vol.1//Rebirth PS2 2006.0 Role-Playing Namco Bandai Games 2.031986 1.389856 3.228043 0.3948

4 7109 .hack//G.U. Vol.2//Reminisce PS2 2006.0 Role-Playing Namco Bandai Games 2.792725 2.592054 1.440483 1.4934

If you notice,

Columns like Platform , Genre are categorical


While columns like NA_Sales , Global_Sales , Rank are continuous

On noticing further,
Platform is of nominal type, no proper order between the categories
Year is of ordinal type, there's a order to the categories

keyboard_arrow_down Introduction to Matplotlib


Lets learn to create a basic plot using plt

Now say, we want to draw a curve passing through 3 points:

(0, 3)
(1, 5)
(2, 9)

How can we draw a curve using matplotlib ?


By using plt.plot() function

x_val = [0, 1, 2]
y_val = [3, 5, 9]
plt.plot(x_val, y_val)

[<matplotlib.lines.Line2D at 0x7cc05d0c3f40>]

What can we observe from this plot ?

plt.plot() automatically decided the scale of the plot

It also prints the type of object matplotlib.lines.Line2D

While this command decided a lot of things for you, you can customise each of these by understanding components of a matplotlib plot

keyboard_arrow_down Anatomy of Matplotlib


Woah! There is a lot of information in this image. Let's understand them one at a time.

Figure: The overall window or page that everything is drawn on.

Now, how You can create


to choose themultiple independent Figures in Jupyter.
right plot?
If you run the code in terminal, separate windows will pop-up
Firstly, depends on the what is your question of interest
Axes: To the figure you can add multiple Axes which represents a plot
When the question is clear:
Axis: Simply the x-axis and y-axis
How many variables are involved?
Axes: - It is the area on which the data is plotted with functions such as plot()
Whether the variable(s) are numerical or categorical?
x-label: Name of x-axis
How many variables are involved?
y-label: Name of y-axis
Major- Univariate
1 Variable ticks: subdivides the axis into major units. They appear by default during plotting
Analysis
2 Variables
Minor ticks:- subdivides
Bivariate Analysis
the major tick units. They are by default hidden and can be toggled on.
2+ Variables - Multivariate Analysis
Title: Title of each plot (Axes), giving information about the same
PS: Bivariate counts under multivariate, but let's keep it sep for ease of communication
Legend: describes the elements in the plot, blue and green curves in this case
What are the possible cases?
Suptitle: The common title of all the plots
Univariate
These are the major components of a matplotlib plot
Numerical
Categorical

Bivariate
Numerical-Numerical
Numerical-Categorical
Categorical-Categorical

Multivariate
Let’s start with 3 and then we can generalize

Numerical-Numerical-Categorical
Categorical-Categorical-Numerical
Categorical-Categorical-Categorical
Numerical-Numerical-Numerical

We will work on these one by one

keyboard_arrow_down Univariate Data Visualization - Categorical Data


What kind of questions we may want to ask for a categorical variable?

Questions like:

What is the Distribution/Frequency of the data across different categories?


What proportion does a particular category constitutes?

...and so on

Let's take the categorical column "Genre"

keyboard_arrow_down How can we find the top-5 genres?

Recall, how could we get this data using pandas?

cat_counts = data['Genre'].value_counts()
cat_counts

Action 3316
Sports 2400
Misc 1739
Role-Playing 1488
Shooter 1310
Adventure 1286
Racing 1249
Platform 886
Simulation 867
Fighting 848
Strategy 681
Puzzle 582
Name: Genre, dtype: int64

keyboard_arrow_down Now what kind of plot can we use to visualize this information?

We can perhaps plot categories on X-axis and their corresponding frequencies on Y-axis
Such chart is called a Bar Chart or a Count Plot
Can also plot horizontally when the #categories are many

Bar Chart
The data is binned here into categories

How can we draw a Bar plot ?


Using plt.bar()

x_bar=cat_counts.index
y_bar=cat_counts
plt.bar(x_bar,y_bar)

<BarContainer object of 12 artists>

The names seem to be overlapping a lot

keyboard_arrow_down How can we handle overlapping labels?


Maybe decrease the font size (not preferred though)
Or maybe increase the figure size
Or rotate the labels

How can we change the plot size?

plt.figure(figsize=(12,8))
plt.bar(x_bar,y_bar)

<BarContainer object of 12 artists>

keyboard_arrow_down And how can we rotate the tick labels, also maybe increase the fontsize of the same?

plt.figure(figsize=(12,8))
plt.bar(x_bar,y_bar)
plt.xticks(rotation=90, fontsize=12)
([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
[Text(0, 0, 'Action'),
Text(1, 0, 'Sports'),
Text(2, 0, 'Misc'),
Text(3, 0, 'Role-Playing'),
Text(4, 0, 'Shooter'),
Text(5, 0, 'Adventure'),
Text(6, 0, 'Racing'),
Text(7, 0, 'Platform'),
Text(8, 0, 'Simulation'),
Text(9, 0, 'Fighting'),
Text(10, 0, 'Strategy'),
Text(11, 0, 'Puzzle')])

If you notice, the width of each bar is 1

keyboard_arrow_down Can we change the width of these bars?

# same code
plt.figure(figsize=(10,8))
plt.bar(x_bar,y_bar,width=0.2)
plt.xticks(rotation = 90, fontsize=12)
([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
[Text(0, 0, 'Action'),
Text(1, 0, 'Sports'),
Text(2, 0, 'Misc'),
Text(3, 0, 'Role-Playing'),
Text(4, 0, 'Shooter'),
Text(5, 0, 'Adventure'),
Text(6, 0, 'Racing'),
Text(7, 0, 'Platform'),
Text(8, 0, 'Simulation'),
Text(9, 0, 'Fighting'),
Text(10, 0, 'Strategy'),
Text(11, 0, 'Puzzle')])

keyboard_arrow_down What about any additional styling to add to the bars ?

We can change colour of bars


We can add a title to the axes
We can also add x and y labels

plt.figure(figsize=(10,8))
plt.bar(x_bar,y_bar,width=0.2,color='orange')
plt.title('Games per Genre',fontsize=15)
plt.xlabel('Genre',fontsize=12)
plt.ylabel('Count',fontsize=12)
plt.xticks(rotation = 90, fontsize=12)
plt.yticks(fontsize=12)
(array([ 0., 500., 1000., 1500., 2000., 2500., 3000., 3500.]),
[Text(0, 0.0, '0'),
Text(0, 500.0, '500'),
Text(0, 1000.0, '1000'),
Text(0, 1500.0, '1500'),
Text(0, 2000.0, '2000'),
Text(0, 2500.0, '2500'),
Text(0, 3000.0, '3000'),
Text(0, 3500.0, '3500')])

If you notice, there's some text printed always before the plots.

This contains the data information of the plot

keyboard_arrow_down How can we remove the text printed before the plot and just display the plot?

Using plt.show() at the end

plt.figure(figsize=(10,8))
plt.bar(x_bar,y_bar,width=0.2,color='orange')
plt.title('Games per Genre',fontsize=15)
plt.xlabel('Genre',fontsize=12)
plt.ylabel('Count',fontsize=12)
plt.xticks(rotation = 90, fontsize=12)
plt.yticks(fontsize=12)
plt.show()
How can we draw a bar-chart in Seaborn?

In Seaborn, the same plot is called as countplot.


Countplot automtically does even the counting of frequencies for you

keyboard_arrow_down Why not called a barplot?

There is another function in Seaborn called barplot which has some other purpose - discuss later

sns.countplot(x = 'Genre', data = data, order=data['Genre'].value_counts().index, color='cornflowerblue')


plt.xticks(rotation=90)
([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
[Text(0, 0, 'Action'),
Text(1, 0, 'Sports'),
Text(2, 0, 'Misc'),
Text(3, 0, 'Role-Playing'),
Text(4, 0, 'Shooter'),
Text(5, 0, 'Adventure'),
Text(6, 0, 'Racing'),
Text(7, 0, 'Platform'),
Text(8, 0, 'Simulation'),
Text(9, 0, 'Fighting'),
Text(10, 0, 'Strategy'),
Text(11, 0, 'Puzzle')])

The top 5 genres are action, sports, misc, role player, and shooter

keyboard_arrow_down Pie charts


What if instead of actual frequencues, I want see the proportion of the categories with each other?
Say, we want to compare the distrubution/proportion of sales across the different regions?

Which plot can we use for this?

A pie-chart!

sales_data = data[['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales']]


region_sales = sales_data.T.sum(axis='columns')

plt.pie(region_sales,
labels=region_sales.index,
startangle=90,
explode=(0.2,0,0,0))
plt.show()
keyboard_arrow_down Univariate Data Visualisation - Numerical Data
What kind of questions we may have regarding a numerical variable?
1. How is the data distributed? Say distribution of number of games published in a year.
2. Is the data skewed? Are there any outliers? - Extremely high selling games maybe?
3. How much percentage of data is below/above a certain number?
4. Some special numbers - Min, Max, Mean, Median, nth percentile?

Now say, you want to find the distribution of games released every year.

Unlike barplot, to see the distribution we will need to bin the data.

keyboard_arrow_down How can we understand popularity of video games year by year?

Histogram

plt.hist(data['Year'])
plt.show()

The curve is left skewed, with a lot more games being published in 2005-2015

This shows that games started being highly popular in the last 1-2 decades, maybe could point to increased usage of internet worldwide!

If you notice, histograms are basically frequency charts

We can also vary the number of bins, the default number of bins is 10

So if we would need to see this data per decade, we would need 40 years in 4 bins.
plt.hist(data['Year'], bins=4)
plt.show()

We can also get the data of each bin, such as range of the boundaries, values, etc.

count, bins, _ = plt.hist(data['Year'])

count

array([ 112., 70., 92., 449., 1274., 2440., 3921., 5262., 2406.,
355.])

bins

array([1980., 1984., 1988., 1992., 1996., 2000., 2004., 2008., 2012.,


2016., 2020.])

keyboard_arrow_down Now what do these count and bins mean?

bins provides bin edges


counts provides it corresponding counts

What is the length of count ?

10

What shoud be the length of bins ?


10 + 1 = 11

keyboard_arrow_down How can we plot histogram in Seaborn?


sns.histplot(data['Year'], bins=10)

<Axes: xlabel='Year', ylabel='Count'>

Notice,

The boundaries are more defined than matplotlib's plotting


The x and y axis are labelled automatically

keyboard_arrow_down Kernel Density Estimate (KDE) Plot


A KDE plot, similar to histrogram, is a method for visualizing the distributions

But instead of bars, KDE represents data using a continuous probability density curve

Now, Why do we even need KDE plots?


Compared to histogram, KDE produces a plot which is less cluttered and more interpretable
Think of it as a smoothened version of histogram

Let's plot KDE using seaborn 's kdeplot

sns.kdeplot(data['Year'])

<Axes: xlabel='Year', ylabel='Density'>

Can you notice the difference between KDE and histogram?

Y-Axis has probability density estimation instead of count

You can read more about this on:


https://en.wikipedia.org/wiki/Kernel_density_estimation
https://www.youtube.com/watch?v=DCgPRaIDYXA

keyboard_arrow_down Boxplot
Now say I want to find the typical earnings of a game when it is published.

Or maybe find the aggregates like median, min, max and percentiles of the data.

What kind of plot can we use to understand the typical earnings from a game?

Box Plot

keyboard_arrow_down What exactly is a Box Plot?

A box plot or box-and-whisker plot shows the distribution of quantitative data


It facilitates comparisons between

attributes
across levels of a categorical attribute.

The box: Shows the quartiles of the dataset

The whiskers: Show the rest of the distribution

Let's go through the terminology one-by-one

Box plots show the five-number summary of data:

1. Minimum score,
2. first (lower) quartile
3. Median
4. Third (upper) quartile
5. maximum score

Minimum Score
It is the lowest value, excluding outliers

It is shown at the end of bottom whisker

Lower Quartile
25% of values fall below the lower quartile value

It is also known as the first quartile.

Median
Median marks the mid-point of the data

Half the scores are greater than or equal to this value and half are less.

It is sometimes known as the second quartile.

Upper Quartile
75% of the values fall below the upper quartile value

It is also known as the third quartile.

Maximum Score
It is the highest value, excluding outliers

It is shown at the end of upper whisker.


keyboard_arrow_down Content
Bivariate
Continous-Continous

Line plot
Styling and Labelling
Scatterplot

Categorical-Categorical

Dodged countplot
Stacked countplot

Categorical-Continuous

Multiple BoxPlots

Subplots

keyboard_arrow_down Importing the data


!wget https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/021/299/original/final_vg1_-_final_vg_%281%29.csv?16708

--2024-02-06 16:33:16-- https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/021/299/original/final_vg1_-_fin


Resolving d2beiqkhq929f0.cloudfront.net (d2beiqkhq929f0.cloudfront.net)... 18.172.139.94, 18.172.139.46, 18.172.139.61,
Connecting to d2beiqkhq929f0.cloudfront.net (d2beiqkhq929f0.cloudfront.net)|18.172.139.94|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2041483 (1.9M) [text/plain]
Saving to: ‘vgsales.csv’

vgsales.csv 100%[===================>] 1.95M --.-KB/s in 0.07s

2024-02-06 16:33:16 (28.3 MB/s) - ‘vgsales.csv’ saved [2041483/2041483]

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

data = pd.read_csv('vgsales.csv')
data.head()

Rank Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sales Other_Sal

0 2061 1942 NES 1985.0 Shooter Capcom 4.569217 3.033887 3.439352 1.9916

1 9137 ¡Shin Chan Flipa en colores! DS 2007.0 Platform 505 Games 2.076955 1.493442 3.033887 0.3948

2 14279 .hack: Sekai no Mukou ni + Versus PS3 2012.0 Action Namco Bandai Games 1.145709 1.762339 1.493442 0.4086

3 8359 .hack//G.U. Vol.1//Rebirth PS2 2006.0 Role-Playing Namco Bandai Games 2.031986 1.389856 3.228043 0.3948

4 7109 .hack//G.U. Vol.2//Reminisce PS2 2006.0 Role-Playing Namco Bandai Games 2.792725 2.592054 1.440483 1.4934

data.describe()

Rank Year NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales

count 16652.000000 16381.000000 16652.000000 16652.000000 16652.000000 16652.000000 16652.000000

mean 8283.409620 2006.390513 2.752314 1.996875 2.499677 1.151829 8.457873

std 4794.471477 5.863261 1.327002 1.322972 1.164023 1.054813 3.717756

min 1.000000 1980.000000 0.140000 0.010000 0.000000 -0.474276 0.240000

25% 4129.750000 2003.000000 1.781124 1.087977 1.781124 0.394830 5.580341

50% 8273.500000 2007.000000 2.697415 1.714664 2.480356 0.491870 7.536614

75% 12436.250000 2010.000000 3.677290 2.795123 3.176299 1.781124 11.227334

max 16600.000000 2020.000000 8.725452 8.367985 12.722984 7.358020 30.555862


keyboard_arrow_down Bivariate Data Visualization
keyboard_arrow_down Continous-Continous
So far we have been analyzing only a single feature.

But what if we want to visualize two features at once?

What kind of questions can we ask regarding a continous-continous pair of features?

Maybe show relation between two features, like how does the sales vary over the years?
Or show how are the features associated, positively or negatively?

...And so on

Let's go back to the line plot we plotted at the very beginning

keyboard_arrow_down Line Plot


keyboard_arrow_down
How can we plot the sales trend over the years for the longest running game?

First, let's find the longest running game first

data['Name'].value_counts()

Ice Hockey 41
Baseball 17
Need for Speed: Most Wanted 12
Ratatouille 9
FIFA 14 9
..
Indy 500 1
Indy Racing 2000 1
Indycar Series 2005 1
inFAMOUS 1
Zyuden Sentai Kyoryuger: Game de Gaburincho!! 1
Name: Name, Length: 11493, dtype: int64

Great, so Ice Hockey is longer running than most games

Let's try to find the sales trend in North America of the same across the years

ih = data.loc[data['Name']=='Ice Hockey']
sns.lineplot(x='Year', y='NA_Sales', data=ih)

<Axes: xlabel='Year', ylabel='NA_Sales'>

What can we infer from this graph?

The sales across North America seem to have been boosted in the years of 1995-2005
Post 2010 though, the sales seem to have taken a dip

Line plot are great to represending trends such as above, over time

keyboard_arrow_down Style and Labelling


We already learnt in barplot how to add titles, x-label and y-label

Let's add the same here

plt.title('Ice Hockey Sales Trend')


plt.xlabel('Year')
plt.ylabel('Sales')
sns.lineplot(x='Year', y='NA_Sales', data=ih)
plt.show()

keyboard_arrow_down Now what if we want to change the colour of the curve ?

sns.lineplot() contains an argument color

It takes as argument a matplotlib color

OR

as string for some defined colours like:

black: k / black
red: r / red etc

But what all colours can we use ?

Matplotlib provides a lot of colours

Check the documentation for more colours

https://matplotlib.org/2.0.2/api/colors_api.html

plt.title('Ice Hockey Sales Trend')


plt.xlabel('Year')
plt.ylabel('Sales')
sns.lineplot(x='Year', y='NA_Sales', data=ih, color='r')
plt.show()
Now, lets say we only want to show the values from years 1990-2000

keyboard_arrow_down How can we limit our plot to only the last decade of 20th century?

This requires changing the range of x-axis

But how can we change the range of an axis in matplotlib ?


We can use:

plt.xlim() : x-axis
plt.ylim() : y-axis

These funcs take same 2 args:

1. left : Starting point of range


2. right : End point of range

plt.title('Ice Hockey Sales Trend')


plt.xlabel('Year')
plt.ylabel('NA Sales')
plt.xlim(left=1995,right=2010)
sns.lineplot(x='Year', y='NA_Sales', data=ih)
plt.show()

So far we have visualised a single plot to understand it

What if we want to compare it with some other plot?


Say, we want to compare the same sales trend between two games

Ice Hockey
Baseball

Let's first plot the trend for "Baseball"

baseball = data.loc[data['Name']=='Baseball']
sns.lineplot(x='Year', y='NA_Sales', data=baseball)

<Axes: xlabel='Year', ylabel='NA_Sales'>

Now, to compare these, so we will have to draw these plots in the same figure

keyboard_arrow_down How can we plot multiple plots in the same figure ?

sns.lineplot(x='Year', y='NA_Sales', data=ih)


sns.lineplot(x='Year', y='NA_Sales', data=baseball)

<Axes: xlabel='Year', ylabel='NA_Sales'>

We can use multiple sns.lineplot() funcs

Observe:

Seaborn automatically created 2 plots with different colors

keyboard_arrow_down However how can we know which colour is of which plot ?

sns.lineplot() has another argument label to do so


We can simply set the label of each plot
sns.lineplot(x='Year', y='NA_Sales', data=ih, label='Ice Hockey')
sns.lineplot(x='Year', y='NA_Sales', data=baseball, label='Baseball')

<Axes: xlabel='Year', ylabel='NA_Sales'>

We can also pass these labels in plt.legend() as a list in the order plots are done

sns.lineplot(x='Year', y='NA_Sales', data=ih)


sns.lineplot(x='Year', y='NA_Sales', data=baseball)
plt.legend(['Ice Hockey','Baseball'])
plt.show()

keyboard_arrow_down Now can we change the position of the legend, say, to bottom-right corner?

Matplotlib automatically decides the best position for the legends

But we can also change it using the loc parameter

loc takes input as 1 of following strings:

upper center
upper left
upper right
lower right ... etc

sns.lineplot(x='Year', y='NA_Sales', data=ih)


sns.lineplot(x='Year', y='NA_Sales', data=baseball)
plt.legend(['Ice Hockey','Baseball'], loc='lower right')
plt.show()
keyboard_arrow_down Now what if we want the legend to be outside the plot?

Maybe the plot is too congested to show the legend

We can use the same loc parameter for this too

sns.lineplot(x='Year', y='NA_Sales', data=ih)


sns.lineplot(x='Year', y='NA_Sales', data=baseball)
plt.legend(['Ice Hockey','Baseball'], loc=(-0.5,0.5))
plt.show()

The pair of floats signify the (x,y) coordinates for the legend

==> From this we can conclude loc takes two types of arguments:

The location in the form of string


The location in the form of coordinates

keyboard_arrow_down What if we want to add other stylings to legends ?

For eg:

Specify the number of rows/cols

Uses parameter ncols for this


The number of rows are decided automatically

Decide if we want the box of legends to be displayed

Use the bool param frameon


and so on.

sns.lineplot(x='Year', y='NA_Sales', data=ih)


sns.lineplot(x='Year', y='NA_Sales', data=baseball)
plt.legend(['Ice Hockey','Baseball'], loc='lower right', ncol = 2, frameon = False)
plt.show()

Now say we want to highlight a point on our curve.

For e.g.

keyboard_arrow_down How can we highlight the maximum "Ice Hockey" sales across all years ?

Let's first find this point

print(max(ih['NA_Sales']))

0.9

If we observe, this point lies in the year 2004-5 around

Now we need to add text to this point (2004,0.9)

keyboard_arrow_down How can we add text to a point in a figure ?

sns.lineplot(x='Year', y='NA_Sales', data=ih)


sns.lineplot(x='Year', y='NA_Sales', data=baseball)
plt.legend(['Ice Hockey','Baseball'], loc='lower left')
plt.text(2004,max(ih['NA_Sales']), 'Maximum Ice Hockey Sales')
plt.show()
By using plt.text()

Pass in the x and y coordinates where we want the text to appear

Pass in the text string

We can also use plt.grid() to show the grid layout in the background

sns.lineplot(x='Year', y='NA_Sales', data=ih)


sns.lineplot(x='Year', y='NA_Sales', data=baseball)
plt.legend(['Ice Hockey','Baseball'], loc='lower left')
plt.text(2004,max(ih['NA_Sales']), 'Maximum Ice Hockey Sales')
plt.grid()
plt.show()

Note:

We can pass in parameters inside plt.grid() to control its density, colour of grid lines, etc.

You can look it up later on how to customize the grid

https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.grid.html

keyboard_arrow_down Scatter Plot


Now suppose we want to find the relation between Rank and Sales of all games.

keyboard_arrow_down Are Rank and Sales positively or negatively correlated?

In this case, unlike line plot, there maybe multiple points in y-axis for each point in x-axis

data.head()

Rank Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sales Other_Sal

0 2061 1942 NES 1985.0 Shooter Capcom 4.569217 3.033887 3.439352 1.9916

1 9137 ¡Shin Chan Flipa en colores! DS 2007.0 Platform 505 Games 2.076955 1.493442 3.033887 0.3948

2 14279 .hack: Sekai no Mukou ni + Versus PS3 2012.0 Action Namco Bandai Games 1.145709 1.762339 1.493442 0.4086

3 8359 .hack//G.U. Vol.1//Rebirth PS2 2006.0 Role-Playing Namco Bandai Games 2.031986 1.389856 3.228043 0.3948

4 7109 .hack//G.U. Vol.2//Reminisce PS2 2006.0 Role-Playing Namco Bandai Games 2.792725 2.592054 1.440483 1.4934

keyboard_arrow_down How can we plot the relation between Rank and Global Sales ?

Can we use lineplot?

Let's try it out


sns.lineplot(data=data, x='Rank', y='Global_Sales')

<Axes: xlabel='Rank', ylabel='Global_Sales'>

The plot itself looks very messy and it's hard to find any patterns from it.

keyboard_arrow_down Is there any other way we can visualize this relation?

Use scatter plot

sns.scatterplot(data=data, x='Rank', y='Global_Sales')

<Axes: xlabel='Rank', ylabel='Global_Sales'>

Compared to lineplot, we are able to see the patterns and points more distinctly now!

Notice,

The two variables are negatively correlated with each other

With increase in ranks, the sales tend to go down, implying, lower ranked games have higher sales overall!

Scatter plots help us visualize these relations and find any patterns in the data

Key Takeaways:

For Continuous-Continuous Data => Scatter Plot, Line Plot

Sometimes, people also like to display the linear trend between two variables - Regression Plot, do check that

If you notice, Genres , Publisher and Platform are categorical values

Since we have a lot of categories of each of them, we will use top 3 of each to make our analysis easier
top3_pub = data['Publisher'].value_counts().index[:3]
top3_gen = data['Genre'].value_counts().index[:3]
top3_plat = data['Platform'].value_counts().index[:3]
top3_data = data.loc[(data["Publisher"].isin(top3_pub)) & (data["Platform"].isin(top3_plat)) & (data['Genre'].isin(top3_gen)
top3_data

Rank Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sal

2 14279 .hack: Sekai no Mukou ni + Versus PS3 2012.0 Action Namco Bandai Games 1.145709 1.762339 1.4934

13 2742 [Prototype 2] PS3 2012.0 Action Activision 3.978349 3.727034 0.8488

16 1604 [Prototype] PS3 2009.0 Action Activision 4.569217 4.108402 1.1872

19 1741 007: Quantum of Solace PS3 2008.0 Action Activision 4.156030 4.346074 1.0879

21 4501 007: Quantum of Solace PS2 2008.0 Action Activision 3.228043 2.738800 2.5855

... ... ... ... ... ... ... ... ...

16438 14938 Yes! Precure 5 Go Go Zenin Shu Go! Dream Festival DS 2008.0 Action Namco Bandai Games 1.087977 0.592445 1.0879

16479 10979 Young Justice: Legacy PS3 2013.0 Action Namco Bandai Games 2.186589 1.087977 3.4090

16601 11802 ZhuZhu Pets: Quest for Zhu DS 2011.0 Misc Activision 2.340740 1.525543 3.1038

16636 9196 Zoobles! Spring to Life! DS 2011.0 Misc Activision 2.697415 1.087977 2.7607

16640 9816 Zubo DS 2008.0 Misc Electronic Arts 2.592054 1.493442 1.4934

617 rows × 11 columns

keyboard_arrow_down Categorical-Categorical
Earlier we saw how to work with continous-continuous pair of data

Now let's come to the second type of pair of data: Categorical-Categorical

What questions comes to your mind when we say categorical-categorical pair?

Questions related to distribution of a category within another category

What is the distribution of genres for top-3 publishers?


Which platforms do these top publishers use?

Which plot can we use to show distribution of one category with respect to another?

-> We can have can have multiple bars for each category

These multiple bars can be stacked together - Stacked Countplot

Or

Can be placed next to each other - Dodged Countplot

keyboard_arrow_down Dodged Count Plot


keyboard_arrow_down
How can we compare the top 3 platforms these publishers use?

We can use a dodged countplot in this case

plt.figure(figsize=(10,8))
sns.countplot(x='Publisher',hue='Platform',data=top3_data)
plt.ylabel('Count of Games')
Text(0, 0.5, 'Count of Games')

What can we infer from the dodged countplot?

EA releases PS2 games way more than any other publisher, or even platform!
Activision has almost the same count of games for all 3 platforms
EA is leading in PS3 and PS2, but Namco leads when it comes to DS platform

keyboard_arrow_down Stacked Countplot


keyboard_arrow_down
How can we visualize the distribution of genres for top-3 publishers?

We can use a stacked countplot

Do refer the postread for the code on how to plot it

But stacked countplots can be misleading

Some may find it difficult to understand if it starts from baseline or from on top of the bottom area

How do we decide between a Stacked countplot and Dodged countlot?

Stacked countplots are a good way to represent totals


While dodged countplots helps us to comapare values between various categories, and within the category itself too

keyboard_arrow_down Continous-Categorical
Now let's look at our 3rd type of data pair

What kind of questions we may have regarding a continuous-categorical pair?

We might to want calculate some numbers category wise

Like What is the average sales for every genre?


Or we might be interested in checking the distribution of the data category-wise

What is the distribution of sales for the top3 publishers?

What kind of plot can we make for every category?

-> Either KDE plot or Box Plot per category

keyboard_arrow_down Boxplot
keyboard_arrow_down
What is the distribution of sales for the top3 publishers?

sns.boxplot(x='Publisher', y='Global_Sales', data=top3_data)


plt.xticks(rotation=90,fontsize=12)
plt.yticks(fontsize=12)
plt.title('Sales for top3 publisher', fontsize=15)
plt.show()

What can we infer from this plot?

The overall sales of EA is higher, with a much larger spread than other publishers
Activision doesn't have many outliers, and if you notice, even thought the spread is lesser than EA, the median is almost the same

Barplot

What if we want to compare the sales between the genres?


We have to use:

Genre (categorical)
Mean of global sales per genre (numerical)

keyboard_arrow_down How to visualize which genres bring higher average global sales?

sns.barplot(data=top3_data, x="Genre", y="Global_Sales", estimator=np.mean)

<Axes: xlabel='Genre', ylabel='Global_Sales'>

If you remember, we had earlier seen EA had a larger market share of sales

Along with this fact, majority of games EA made was sports

This ultimately proves the fact that Sports has a high market share in the industry, as shown in the barchart

keyboard_arrow_down Subplots
So far we have shown only 1 plot using plt.show()

Say, we want to plot the trend of NA and every other region separately in a single figure

How can we plot multiple smaller plots at the same time?


We will use subplots, i.e., divide the figure into smaller plots

We will be using plt.subplots() It takes mainly 2 arguments:

1. No. of rows we want to divide our figure into

2. No. of columns we want to divide our figure into

It returns 2 things:

Figure
Numpy Matrix of subplots

fig = plt.figure(figsize=(15,10))
sns.scatterplot(x=top3_data['NA_Sales'], y=top3_data['EU_Sales'])
fig.suptitle('Main title')
plt.show()
fig = plt.figure(figsize=(15,10))

plt.subplot(2, 3, 1)
sns.scatterplot(x='NA_Sales', y='EU_Sales', data=top3_data)

plt.subplot(2, 3, 3)
sns.scatterplot(x='NA_Sales', y='JP_Sales', data=top3_data, color='red')

fig.suptitle('Main title')
Text(0.5, 0.98, 'Main title')

fig, ax = plt.subplots(2, 2, figsize=(15,10))


ax[0,0].scatter(top3_data['NA_Sales'], top3_data['EU_Sales'])
ax[0,1].scatter(top3_data['NA_Sales'], top3_data['JP_Sales'])
ax[1,0].scatter(top3_data['NA_Sales'], top3_data['Other_Sales'])
ax[1,1].scatter(top3_data['NA_Sales'], top3_data['Global_Sales'])
keyboard_arrow_down Content
Multivariate Data Visualization

CCN
CNN
NNN
CCC

JointPlot
Pairplots
Correlation and heatmap

keyboard_arrow_down Importing the data

!wget https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/021/299/original/final_vg1_-_final_vg_%281%29.csv?16708

--2024-02-06 16:35:17-- https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/021/299/original/final_vg1_-_fin


Resolving d2beiqkhq929f0.cloudfront.net (d2beiqkhq929f0.cloudfront.net)... 13.35.37.31, 13.35.37.159, 13.35.37.102, ...
Connecting to d2beiqkhq929f0.cloudfront.net (d2beiqkhq929f0.cloudfront.net)|13.35.37.31|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2041483 (1.9M) [text/plain]
Saving to: ‘vgsales.csv’

vgsales.csv 100%[===================>] 1.95M 2.94MB/s in 0.7s

2024-02-06 16:35:19 (2.94 MB/s) - ‘vgsales.csv’ saved [2041483/2041483]

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

data = pd.read_csv('vgsales.csv')
data.head()

Rank Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sales Other_Sal

0 2061 1942 NES 1985.0 Shooter Capcom 4.569217 3.033887 3.439352 1.9916

1 9137 ¡Shin Chan Flipa en colores! DS 2007.0 Platform 505 Games 2.076955 1.493442 3.033887 0.3948

2 14279 .hack: Sekai no Mukou ni + Versus PS3 2012.0 Action Namco Bandai Games 1.145709 1.762339 1.493442 0.4086

3 8359 .hack//G.U. Vol.1//Rebirth PS2 2006.0 Role-Playing Namco Bandai Games 2.031986 1.389856 3.228043 0.3948

4 7109 .hack//G.U. Vol.2//Reminisce PS2 2006.0 Role-Playing Namco Bandai Games 2.792725 2.592054 1.440483 1.4934

If you remember, Genres , Publisher and Platform were categorical values

Hence similar to last lecture, we will use top 3 of each to make our analysis easier

top3_pub = data['Publisher'].value_counts().index[:3]
top3_gen = data['Genre'].value_counts().index[:3]
top3_plat = data['Platform'].value_counts().index[:3]
top3_data = data.loc[(data["Publisher"].isin(top3_pub)) & (data["Platform"].isin(top3_plat)) & (data['Genre'].isin(top3_gen)
top3_data

Rank Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sal

2 14279 .hack: Sekai no Mukou ni + Versus PS3 2012.0 Action Namco Bandai Games 1.145709 1.762339 1.4934

13 2742 [Prototype 2] PS3 2012.0 Action Activision 3.978349 3.727034 0.8488

16 1604 [Prototype] PS3 2009.0 Action Activision 4.569217 4.108402 1.1872

19 1741 007: Quantum of Solace PS3 2008.0 Action Activision 4.156030 4.346074 1.0879

21 4501 007: Quantum of Solace PS2 2008.0 Action Activision 3.228043 2.738800 2.5855

... ... ... ... ... ... ... ... ...

16438 14938 Yes! Precure 5 Go Go Zenin Shu Go! Dream Festival DS 2008.0 Action Namco Bandai Games 1.087977 0.592445 1.0879

16479 10979 Young Justice: Legacy PS3 2013.0 Action Namco Bandai Games 2.186589 1.087977 3.4090

16601 11802 Zh Zh P t Q tf Zh DS 2011 0 Mi A ti i i 2 340740 1 525543 3 1038


16601 11802 ZhuZhu Pets: Quest for Zhu DS 2011.0 Misc Activision 2.340740 1.525543 3.1038

16636 9196 Zoobles! Spring to Life! DS 2011.0 Misc Activision 2.697415 1.087977 2.7607

16640 9816 Zubo DS 2008.0 Misc Electronic Arts 2.592054 1.493442 1.4934

617 rows × 11 columns

keyboard_arrow_down Multivariate
Let’s try to add 3rd variable on the top of the plots we have seen so far

keyboard_arrow_down NNC

How can we visualize the correlation between NA and EU, but for different genres?
Here, we have two numerical and one categorical variable!

Numerical-Numerical → Scatterplot, need to add info about one categorical variable

Numerical-Categorical → Boxplot, need to add info about one numerical variable

Let's ask two questions


Is it Possible to add information about a continuous variable upon boxplots?

Perhaps No

Is it Possible to add information about a categorical variable on scatterplot?

Yes, use colors

Solution: Scatterplot with color

plt.figure(figsize=(7,7))
sns.scatterplot(x='NA_Sales', y='EU_Sales',hue='Publisher',data=top3_data)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel('NA Sales',fontsize=15)
plt.ylabel('EU Sales',fontsize=15)
plt.title('NA vs EU, per Genre', fontsize=15)
plt.show()

Inferences:
If we see this plot, we can notice now that Namco has lower sales correlation, while Activision has a concentrated positivee correlation
EA also has positive correlation, but it's more spread compared to Activision

keyboard_arrow_down CCN
Now, how will you visualize Global Sales for each publisher, but separated by Genres?
We have two categorical and one numerical data here!

Categorical-Categorical → Stacked Barplot, need to add info about one continuous feature

Categorical-Numerical → Boxplots, need to add categorical variable

Which one is easier and possible? We can add one categorical variable by “dodging” multiple boxplots

Solution: Dodged Boxplots

plt.figure(figsize=(15,10))
sns.boxplot(x='Publisher',y='Global_Sales',hue='Genre',data=top3_data)
plt.xlabel('Genre', fontsize=12)
plt.ylabel('Global Sales', fontsize=12)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.title('Global Sales for each Publisher, Genre wise', fontsize=15)
plt.show()

Inferences:

Namco has lower median sales in every Genre as compared to all publishers

Looking at Action Genre, even though EA and Activision has almost similar medians, Action is more spread in EA

An interesting thing to notice here is that, for each of the three publishers, three different genre of games have higher sales median:
Namco: Action
Activision: Misc
EA: Sports

keyboard_arrow_down NNN
So far we have seen how NA and EU are correlated with each other.

But how can we compare the data when we have 3 numerical variables?

Say, the question is, how does rank affect the correlation between NA and EU Sales?
We have used scatter plot for two numerical features, we have two options here

Make a 3D Scatterplot

→ nice for 3D viz, but tough to report/show in static setting


Add info about 3rd feature on the 2D scatter plot itself

→ Bubble Chart

plt.figure(figsize=(15,10))
# sns.scatterplot(x=data['NA_Sales'], y=data['JP_Sales'],data=top3_data, size=data['Rank'], sizes=(1, 200))
sns.scatterplot(x='NA_Sales', y='JP_Sales', size='Rank', sizes=(1, 200), data=data)

plt.xlabel('NA_Sales',fontsize=10)
plt.ylabel('JP Sales', fontsize=10)
plt.title('NA vs JP Sales, based on ranking of games', fontsize=15)
plt.show()

Inferences:
Now interestingly, we can notice that higher ranking games are actually on the lower scale of sales, while lower ranking games are high on
the sales side

keyboard_arrow_down Joint Plot


keyboard_arrow_down Let's see a few more plots that we can visualize using seaborn

Joint Plot
It draws a plot of two variables

It shows scatter, histogram and KDE graphs in the same plot.

Let's check it out


We will take NA_Sales as x-coordinates and EU_Sales as y-coordinates

We can select from different values for parameter kind and it will plot accordingly

"scatter" | "kde" | "hist" | "hex" | "reg" | "resid"

We will set parameter kind to 'reg' here

sns.jointplot(x='NA_Sales', y='EU_Sales', data=top3_data)


plt.show()

keyboard_arrow_down As we can see here:

jointplot plots scatter, histogram and KDE in the same graph when we set kind=reg

Scatter shows the scattering of ( NA_Sales , EU_Sales ) pairs as (x, y) points

Histogram and KDE shows the separate distributions of NA_Sales and EU_Sales in the data

We can also add hue to Joint Plot


Let's check how the 3 Genres of games are distributed in terms of NA_Sales and 'EU_Sales`

sns.jointplot(x='NA_Sales', y='EU_Sales', data=top3_data, hue='Genre')


plt.show()
keyboard_arrow_down Pair Plot
pairplot() in seaborn creates a grid of Axes by default

Each numeric attribute in data is shared across the y-axes across a single row and the x-axes across a single column.

It displays a scatterplot between each pair of attributes in the data with different hue for each category

Since, the diagonal plots belong to same attribute at both x and y axis, they are treated differently
A univariate distribution plot is drawn to show the marginal distribution of the data in each column.

Let's check it out

sns.pairplot(data=top3_data, hue='Genre')
plt.show()
Notice that:

It is like a scatterplot of video games with hue='Genre'

But the scatter is plotted between every pair of attributes

Colour Legends for each genre category are given on right side

It shows relation between each pair of attributes

Diagonal plots are different from scatterplots

Because x and y axis have same attribute

Diagonal plots show a univariate curve category-wise for each attribute


It is also possible to show a subset of variables or plot different variables on the rows and columns
Feel free to experiment this on your own

keyboard_arrow_down Finding correlations among attributes


We can find the level of correlation b/w different attributes (variables)

But what exactly is a correlation?


Two variables are correlated when they change in same/opposite direction

We can check coefficient of correlation using corr()

top3_data.corr()

<ipython-input-11-c78d7a78d920>:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a


top3_data.corr()
Rank Year NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales

Rank 1.000000 0.328705 -0.873726 -0.735711 0.115459 -0.857567 -0.911721

Year 0.328705 1.000000 -0.354256 -0.178026 0.055864 -0.239876 -0.280351

NA_Sales -0.873726 -0.354256 1.000000 0.617483 -0.233315 0.794353 0.856300

EU_Sales -0.735711 -0.178026 0.617483 1.000000 -0.208249 0.771105 0.864147

JP_Sales 0.115459 0.055864 -0.233315 -0.208249 1.000000 -0.355825 -0.014193

Other_Sales -0.857567 -0.239876 0.794353 0.771105 -0.355825 1.000000 0.878816

Global_Sales -0.911721 -0.280351 0.856300 0.864147 -0.014193 0.878816 1.000000

Higher the MAGNITUDE of coefficient of correlation, more the variables are correlated

The sign just determines the direction of change

+ means increase in value of one variable causes increase in value of other variable
- means increase in value of one variable causes decrease in value of other variable, and vice versa

keyboard_arrow_down As you can see, Global Sales and Rank have the highest correlation coeff of -0.91

Let's plot it using scatter plot

sns.scatterplot(x= 'Global_Sales', y= 'Rank', data = top3_data)


plt.show()

When petal_length increases, petal_width also increases

keyboard_arrow_down But Remember


Correlation does NOT mean Causation

We cannot conclude that change in values of a variable is causing change in values of other variable

Now, Let's look at a way to visualize correlation among variables

Heat Map
A heat map plots rectangular data as a color-encoded matrix.

Stronger the colour, stronger the correlation b/w the variables

Let's plot a Heat Map using correlation coefficient matrix generated using corr()

sns.heatmap(top3_data.corr(), cmap= "Blues", annot=True)


plt.show()

<ipython-input-13-4220aa71e94f>:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a


sns.heatmap(top3_data.corr(), cmap= "Blues", annot=True)

annot=True is for writing correlation coeff inside each cell

keyboard_arrow_down You can change the colours of cells in Heat Map if you like

There are a lot of options available!

print(plt.colormaps())

['magma', 'inferno', 'plasma', 'viridis', 'cividis', 'twilight', 'twilight_shifted', 'turbo', 'Blues', 'BrBG', 'BuGn', '

sns.heatmap(top3_data.corr(), cmap= "coolwarm", annot=True)


plt.show()

<ipython-input-15-72b457990d68>:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In


sns.heatmap(top3_data.corr(), cmap= "coolwarm", annot=True)
<matplotlib.collections.PathCollection at 0x7f0a8e08b370>

Notice, we are using 2 numbers during each plotting

Think of subplots as a 2x2 grids, with the two numbers denoting x,y / row,column coordinate of each subplot

keyboard_arrow_down What is this ax parameter exactly?

print(ax)

[[<Axes: > <Axes: >]


[<Axes: > <Axes: >]]

Notice,

It's a 2x2 matrix of multiple axes objects

We are plotting each plot on a single axes object.

Hence, we are using a 2D notation to access each grid/axes object of the subplot

Instead of accesing the individual axes using ax[0, 0] , ax[1, 0] , there is another method we can use too

import matplotlib.pyplot as plt


import numpy as np

plt.figure(figsize=(20,12)).suptitle("NA Sales vs regions",fontsize=20)


# Using a 2x3 subplot
plt.subplot(2, 3, 1)
sns.scatterplot(x='NA_Sales', y='EU_Sales', data=top3_data)

plt.subplot(2, 3, 3)
sns.scatterplot(x='NA_Sales', y='JP_Sales', data=top3_data, color='red')
plt.subplot(2, 3, 4)
sns.scatterplot(x='NA_Sales', y='Other_Sales', data=top3_data, color='green')

plt.subplot(2, 3, 6)
sns.scatterplot(x='NA_Sales', y='Global_Sales', data=top3_data, color='orange')

plt.show()

Suptitle adds a title to the whole figure

We need to observe a few things here

1. The 3rd paramter defines the position of the plot


2. The position/numbering starts from 1
3. It goes on row-wise from start of row to its finish
4. Empty subplots don't show any axes

keyboard_arrow_down But how do we know which plot belongs to which category?

Basically the context of each plot

We can use title , x/y label and every other functionality for the subplots too

plt.figure(figsize=(20,12)).suptitle("NA Sales vs regions",fontsize=20)


# Using a 2x3 subplot
plt.subplot(2, 3, 1)
sns.scatterplot(x='NA_Sales', y='EU_Sales', data=top3_data)
plt.title('NA vs EU Sales', fontsize=12)
plt.xlabel('NA', fontsize=12)
plt.ylabel('EU', fontsize=12)

plt.subplot(2, 3, 3)
sns.scatterplot(x='NA_Sales', y='JP_Sales', data=top3_data, color='red')
plt.title('NA vs JP Sales', fontsize=12)
plt.xlabel('NA', fontsize=12)
plt.ylabel('JP', fontsize=12)

plt.subplot(2, 3, 4)
sns.scatterplot(x='NA_Sales', y='Other_Sales', data=top3_data, color='green')
plt.title('NA vs Other Region Sales', fontsize=12)
plt.xlabel('NA', fontsize=12)
plt.ylabel('Other', fontsize=12)

plt.subplot(2, 3, 6)
sns.scatterplot(x='NA_Sales', y='Global_Sales', data=top3_data, color='orange')
plt.title('NA vs Global Sales', fontsize=12)
plt.xlabel('NA', fontsize=12)
plt.ylabel('Global', fontsize=12)

plt.show()

keyboard_arrow_down What if we want to span a plot across the full length of the plot?

Think of this in terms of a grid.

Currently we are dividing our plot into 2 rows and 3 columns


But we want our plot to be across the middle column, with grids 2 and 5

This can be said as a single column

So, this problem can be simplified to plotting the plot across second column in a 1 row 3 column subplot

plt.figure(figsize=(20,12)).suptitle("Video Games Sales Dashboard",fontsize=20)


# Using a 2x3 subplot
plt.subplot(2, 3, 1)
sns.scatterplot(x='NA_Sales', y='EU_Sales', data=top3_data)
plt.title('NA vs EU Sales', fontsize=12)
plt.xlabel('NA', fontsize=12)
plt.ylabel('EU', fontsize=12)

You might also like