Numpy 1 Merged

keyboard_arrow_down Numpy 1
keyboard_arrow_down Content
Introduction to DAV
Python Lists vs Numpy Array
Importing Numpy
Why use Numpy?
Dimension & Shape

Type Conversion
Indexing
Slicing
NPS use case
keyboard_arrow_down Introduction to DAV (Data Analysis and Visualization) Module

With this lecture, we're starting the DAV module.
It will contain 3 sections -
1. DAV-1: Python Libraries
Numpy
Pandas
Matplotlib & Seaborn
2. DAV-2: Probability Statistics
3. DAV-3: Hypothesis Testing
keyboard_arrow_down Python Lists vs Numpy Arrays

keyboard_arrow_down Homogeneity of data
So far, we've been working with Python lists, that can have heterogenous data.
a = [1, 2, 3, "Michael", True]

a
[1, 2, 3, 'Michael', True]
Because of this hetergenity, in Python lists, the data elements are not stored together in the memory (RAM).
Each element is stored in a different location.

Only the address of each of the element will be stored together.
So, a list is actually just referencing to these different locations, in order to access the actual element.
On the other hand, Numpy only stores homogenous data, i.e. a numpy array cannot contain mixed data types.
It will either
ONLY contain integers

ONLY contain floats
ONLY contain characters
... and so on.

Because of this, we can now store these different data items together, as they are of the same type.
Speed
Programming languages can also be slow or fast.
In fact,
Java is a decently fast language.

Python is a slow language.
C, one of the earliest available languages, is super fast.
This is because C has concepts like memory allocation, pointers, etc.
How is this possible?
With Numpy, though we will be writing our code using Python, but behind the scene, all the code is written in the C programming language, to
make it faster.
Because of this, a Numpy Array will be significantly faster than a Python List in performing the same operation.
This is very important to us, because in data science, we deal with huge amount of data.
keyboard_arrow_down Properties
In-built Functions
For a Python list a , we had in-built functions like .sum(a) , etc.

For NumPy arrays also, we will have such in-built functions.
Slicing
Recall that we were able to perform list slicing.

All of that is still applicable here.
keyboard_arrow_down Importing Numpy

Recall how we used to import a module/library in Python.
In order to use Python Lists, we do not need to import anything extra.

However to use Numpy Arrays, we need to import it into our environment, as it is a Library.
Generally, we do so while using the alias np .
import numpy as np
Note:
In this terminal, we will already have numpy installed as we are working on Google Colab
However, when working on an evironment that does not have it installed, you'll have to install it the first time working.
This can be done with the command: !pip install numpy
keyboard_arrow_down Why use Numpy? - Time Comparison

Suppose you are given a list of numbers. You have to find the square of each number and store it in the original list.
a = [1,2,3,4,5]
type(a)
list
The basic approach here would be to iterate over the list and square each element.
res = [i**2 for i in a]

print(res)
[1, 4, 9, 16, 25]
Let's try the same operation with Numpy.
To do so, first of all we need to define the Numpy array.
We can convert any list a into a Numpy array using the array() function.
b = np.array(a)
b
array([1, 2, 3, 4, 5])
type(b)
numpy.ndarray
nd in numpy.ndarray stands for n-dimensional
Now, how can we get the square of each element in the same Numpy array?
b**2
array([ 1, 4, 9, 16, 25])
The biggest benefit of Numpy is that it supports element-wise operation.
Notice how easy and clean is the syntax.
But is the clean syntax and ease in writing the only benefit we are getting here?
To understand this, let's measure the time for these operations.

We will use %timeit .
l = range(1000000)
%timeit [i**2 for i in l]
343 ms ± 21.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It took approx 300 ms per loop to iterate and square all elements from 0 to 999,999
Let's peform the same operation using Numpy arrays -
We will use np.array() method for this.

We can peform element wise operation using numpy.
l = np.array(range(1000000))
%timeit l**2
778 µs ± 100 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Notice that it only took 900 𝜇s per loop time for the numpy operation.
What is the major reason behind numpy's faster computation?
Numpy array is densely packed in memory due to it's homogenous type.

Numpy functions are implemented in C programming launguage.
Numpy is able to divide a task into multiple subtasks and process them parallelly.
keyboard_arrow_down Dimensions and Shape

We can get the dimension of an array using the ndim property.
arr1 = np.array(range(1000000))
arr1.ndim
Numpy arrays have another property called shape that tells us number of elements across every dimension.
arr1.shape
(1000000,)
This means that the array arr1 has 1000000 elements in a single dimension.
Let's take another example to understand shape and ndim better.
arr2 = np.array([[1, 2, 3], [4, 5, 6], [10, 11, 12]])

print(arr2)
[[ 1 2 3]
[ 4 5 6]
[10 11 12]]
What do you think will be the shape & dimension of this array?
arr2.ndim
arr2.shape
(3, 3)
ndim specifies the number of dimensions of the array i.e. 1D (1), 2D (2), 3D (3) and so on.
shape returns the exact shape in all dimensions, that is (3,3) which implies 3 in axis 0 and 3 in axis 1.
keyboard_arrow_down np.arange()
Let's create some sequences in Numpy.
We can pass starting point, ending point (not included in the array) and step-size.
Syntax:
arange(start, end, step)
arr2 = np.arange(1, 5)
arr2
array([1, 2, 3, 4])
arr2_step = np.arange(1, 5, 2)
arr2_step
array([1, 3])
np.arange() behaves in the same way as range() function.
But then why not call it np.range?
In np.arange() , we can pass a floating point number as step-size.
arr3 = np.arange(1, 5, 0.5)

arr3
array([1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])
keyboard_arrow_down Type Conversion in Numpy Arrays

For this, let's pass a float as one of the values in a numpy array.
arr4 = np.array([1, 2, 3, 4])

arr4
array([1, 2, 3, 4])
arr4 = np.array([1, 2, 3, 4.0])

arr4
array([1., 2., 3., 4.])
Notice that int is raised to float

Because a numpy array can only store homogenous data i.e. values of one data type.
Similarly, what will happen when we run the following code? Will it give an error?
np.array(["Harry Potter", 1, 2, 3])
array(['Harry Potter', '1', '2', '3'], dtype='<U21')
No. It will convert all elements of the array to char type.
There's a dtype parameter in the np.array() function.
What if we set the dtype of array containing integer values to float ?
arr5 = np.array([1, 2, 3, 4])

arr5
array([1, 2, 3, 4])
arr5 = np.array([1, 2, 3, 4], dtype="float")
arr5
array([1., 2., 3., 4.])
Question: What will happen in the following code?
np.array(["Shivank", "Bipin", "Ritwik"], dtype=float)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-26-bdb627c3c07e> in <cell line: 1>()
----> 1 np.array(["Shivank", "Bipin", "Ritwik"], dtype=float)
ValueError: could not convert string to float: 'Shivank'
SEARCH STACK OVERFLOW
Since it is not possible to convert strings of alphabets to floats, it will naturally return an Error.
We can also convert the data type with the astype() method.
arr = np.array([10, 20, 30, 40, 50])

arr
array([10, 20, 30, 40, 50])
arr = arr.astype('float64')
print(arr)
[10. 20. 30. 40. 50.]
keyboard_arrow_down Indexing
Similar to Python lists
m1 = np.arange(12)
m1
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
m1[0] # gives first element of array
m1[-1] # negative indexing in numpy array
11
You can also use list of indexes in numpy.
m1 = np.array([100,200,300,400,500,600])
m1[[2,3,4,1,2,2]]
array([300, 400, 500, 200, 300, 300])
Did you notice how single index can be repeated multiple times when giving list of indexes?
Note:
If you want to extract multiple indices, you need to use two sets of square brackets [[ ]]
Otherwise, you will get an error.

Because it is only expecting a single index.
For multiple indices, you need to pass them as a list.
m1[2,3,4,1,2,2]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-34-0ec34089038e> in <cell line: 1>()
----> 1 m1[2,3,4,1,2,2]
IndexError: too many indices for array: array is 1-dimensional, but 6 were indexed
keyboard_arrow_down Slicing
m1 = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
m1
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
m1[:5]
array([1, 2, 3, 4, 5])
Question: What'll be output of arr[-5:-1] ?
m1[-5:-1]
array([6, 7, 8, 9])
Question: What'll be the output for arr[-5:-1: -1] ?
m1[-5: -1: -1]
array([], dtype=int64)
keyboard_arrow_down Fancy Indexing (Masking)

Numpy arrays can be indexed with boolean arrays (masks).
This method is called fancy indexing or masking.
What would happen if we do this?
m1 = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
m1 < 6
array([ True, True, True, True, True, False, False, False, False,
False])
Comparison operation also happens on each element.
All the values before 6 return True

All the values after 6 return False
Question: What will be the output of the following?
m1[[True, True, True, True, True, False, False, False, False, False]]
array([1, 2, 3, 4, 5])
Notice that we are passing a list of indices.
For every instance of True , it will print the corresponding index.

Conversely, for every False , it will skip the corresponding index, and not print it.
So, this becomes a filter of sorts.
Now, let's use this to filter or mask values from our array.
Condition will be passed instead of indices and slice ranges.
m1[m1 < 6]
array([1, 2, 3, 4, 5])
This is known as Fancy Indexing in Numpy.
Question: How can we filter/mask even values from our array?
m1[m1%2 == 0]
array([ 2, 4, 6, 8, 10])
keyboard_arrow_down Use Case: NPS (Net Promoter Score)

Imagine you are a Data Analyst @ Airbnb
You've been asked to analyze user survey data and report NPS to the management.
But, what exactly is NPS?

Have you all seen that every month, you get a survey form from Scaler?
This form asks you to fill in feedback regarding how you are liking the services of Scaler in terms of a numerical score.
This is known as the Likelihood to Recommend Survey.
It is widely used by different companies and service providers to evaluate their performance and customer satisfaction.
Responses are given a scale ranging from 0–10,
with 0 labeled with “Not at all likely,” and

10 labeled with “Extremely likely.”
Based on this, we calculate the Net Promoter Score.
keyboard_arrow_down How to calculate NPS score?
We label our responses into 3 categories:
Detractors: Respondents with a score of 0-6

Passive: Respondents with a score of 7-8
Promoters: Respondents with a score of 9-10.
Net Promoter score = % Promoters - % Detractors.
Range of NPS
If all people are promoters (rated 9-10), we get 100 NPS

Conversely, if all people are detractors (rated 0-6), we get −100 NPS
Also, if all people are neutral (rated 7-8), we get a 0 NPS
Therefore, the range of NPS lies between [−100, 100]
Generally, each company targets to get at least a threshold NPS.
For Scaler, this is a score of 70.

This means that if 𝑁𝑃𝑆 > 70, it is great performance of the company.
Naturally, this varies from business to business.
How is NPS helpful?
Why would we want to analyse the survey data for NPS?

NPS helps a brand in gauging its brand value and sentiment in the market.
Promoters are highly likely to recommend your product or sevice. Hence, bringing in more business.
whereas, Detractors are likely to recommend against your product or service’s usage. Hence, bringing the business down.
These insights can help business make customer oriented decision along with product improvisation.
Two third of Fortune 500 companies use NPS.
Even at Scaler, every month, we randomnly reach out to our learners over a call, and try to understand,
How is the overall experience for them?

What are some things that they like?
What do they don't like?
Based on the feedback received, sometimes we end up getting really good insights, and tackle them.
This will help improve the next month's NPS.

keyboard_arrow_down NPS Problem
Let's first look at the data we have gathered.
Dataset: https://drive.google.com/file/d/1c0ClC8SrPwJq5rrkyMKyPn80nyHcFikK/view?usp=sharing
Downloading the dataset -
!pip install --upgrade gdown

!gdown 1c0ClC8SrPwJq5rrkyMKyPn80nyHcFikK
Requirement already satisfied: gdown in /usr/local/lib/python3.10/dist-packages (4.7.3)

Collecting gdown
Downloading gdown-5.1.0-py3-none-any.whl (17 kB)
Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.10/dist-packages (from gdown) (4.12.3)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from gdown) (3.13.1)
Requirement already satisfied: requests[socks] in /usr/local/lib/python3.10/dist-packages (from gdown) (2.31.0)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from gdown) (4.66.1)
Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.10/dist-packages (from beautifulsoup4->gdown) (2.
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests[socks]
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests[socks]->gdown) (3.
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests[socks]->gdow
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests[socks]->gdow
Requirement already satisfied: PySocks!=1.5.7,>=1.5.6 in /usr/local/lib/python3.10/dist-packages (from requests[socks]->
Installing collected packages: gdown
Attempting uninstall: gdown
Found existing installation: gdown 4.7.3
Uninstalling gdown-4.7.3:
Successfully uninstalled gdown-4.7.3
Successfully installed gdown-5.1.0
Downloading...
From: https://drive.google.com/uc?id=1c0ClC8SrPwJq5rrkyMKyPn80nyHcFikK
To: /content/survey.txt
100% 2.55k/2.55k [00:00<00:00, 9.56MB/s]
Loading the data -
For this we will use the .loadtxt() function

We provide file name along with the dtype of data that we want to load.
Documentation: https://numpy.org/doc/stable/reference/generated/numpy.loadtxt.html
score = np.loadtxt('survey.txt', dtype ='int')
Let's check the type of this data variable score -
type(score)
numpy.ndarray
Let's see what the data looks like -
score[:5]
array([ 7, 10, 5, 9, 9])

Let's check the number of responses -
score.shape
(1167,)
There are a total of 1167 responses for the LTR survey.
Now, let's calculate NPS using these response.
NPS = % Promoters - % Detractors
In order to calculate NPS, we need to calculate two things:
% Promoters
% Detractors
In order to calculate % Promoters and % Detractors , we need to get the count of promoter as well as detractor.
Question: How can we get the count of Promoter/Detractor ?
We can do so by using fancy indexing (masking).
Let's get the count of promoter and detractors -
Detractors have a score <= 6
detractors = score[score <= 6]
# Number of detractors -
num_detractors = len(detractors)
num_detractors
332
Promoters have a score >= 9
promoters = score[score >= 9]
# Number of promoters -
num_promoters = len(promoters)
num_promoters
609
total = len(score)
total
1167
# % of detractors -
percentage_detractors = (num_detractors/total) * 100

percentage_detractors
28.449014567266495
# % of promoters -
percentage_promoters = (num_promoters/total) * 100

percentage_promoters
52.185089974293064
nps = percentage_promoters - percentage_detractors

nps
23.73607540702657
# Rounding off upto 2 decimal places -
np.round(nps, 2)
output 23.74
Working with 2D arrays (Matrices)
Transpose
Indexing
Slicing
Fancy Indexing (Masking)
Aggregate Functions
Logical Operations
np.any()
np.all()
np.where()
Use Case: Fitness data analysis
keyboard_arrow_down Working with 2D arrays (Matrices)

Let's create an array -
import numpy as np
a = np.array(range(16))
a
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])
What will be it's shape and dimensions?
a.shape
(16,)
a.ndim
keyboard_arrow_down How can we convert this array to a 2-dimensional array?
Using reshape()
For a 2D array, we will have to specify the followings :-
First argument is no. of rows

Second argument is no. of columns
Let's try converting it into a 8x2 array.
a.reshape(8, 2)
array([[ 0, 1],
[ 2, 3],
[ 4, 5],
[ 6, 7],
[ 8, 9],
[10, 11],
[12, 13],
[14, 15]])
Let's try converting it into a 4x4 array.

a.reshape(4, 4)
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
a.reshape(4, 5)
---------------------------------------------------------------------------
<ipython-input-7-05ad01dfd0f5> in <cell line: 1>()
----> 1 a.reshape(4, 5)
ValueError: cannot reshape array of size 16 into shape (4,5)
This will give an Error. Why?
We have 16 elements in a , but reshape(4, 5) is trying to fill in 4x5 = 20 elements.

Therefore, whatever the shape we're trying to reshape to, must be able to incorporate the number of elements that we have.
a.reshape(8, -1)
array([[ 0, 1],
[ 2, 3],
[ 4, 5],
[ 6, 7],
[ 8, 9],
[10, 11],
[12, 13],
[14, 15]])
Notice that Python automatically figured out what should be the replacement of -1 argument, given that the first argument is 8 .
We can also put -1 as the first argument. As long as one argument is given, it will calculate the other one.
What if we pass both args as -1 ?
a.reshape(-1, -1)
---------------------------------------------------------------------------
<ipython-input-9-decf4fe03d74> in <cell line: 1>()
----> 1 a.reshape(-1, -1)
ValueError: can only specify one unknown dimension
You need to give at least one dimension.
Let's save a as a 8 x 2 array (matrix) for now.
a = a.reshape(8, 2)
What will be the length of a ?
It will be 8, since it contains 8 lists as it's elements.

Each of these lists have 2 elements, but that's a different thing.
Explanation: len(nd array) will give you the magnitude of first dimension
len(a)
len(a[0])
2
keyboard_arrow_down Transpose
Let's create a 2D numpy array.
a = np.arange(12).reshape(3,4)
a
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
a.shape
(3, 4)
There is another operation on a multi-dimensional array, known as Transpose.
It basically means that the no. of rows is interchanged by no. of cols, and vice-versa.
a.T
array([[ 0, 4, 8],
[ 1, 5, 9],
[ 2, 6, 10],
[ 3, 7, 11]])
Let's verify the shape of this transpose array -
a.T.shape
(4, 3)
keyboard_arrow_down Indexing in 2D arrays

a
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
Can we extract just the element 6 from a ?
# Accessing 2nd row and 3rd col -

a[1, 2]
This can also be written as
a[1][2]
m1 = np.arange(1,10).reshape((3,3))
m1
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
What will be the output of this?
m1[1, 1] # m1[row,column]
We saw how we can use list of indexes in numpy array.
m1 = np.array([100,200,300,400,500,600])
Will this work now?
m1[2, 3]
---------------------------------------------------------------------------
<ipython-input-23-963ce94bbe14> in <cell line: 1>()
----> 1 m1[2, 3]
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
Note:
Since m1 is a 1D array, this will not work.

This is because there are no row and column entity here.
Therefore, you cannot use the same syntax for 1D arrays, as you did with 2D arrays, and vice-versa.
However with a little tweak in this code, we can access elements of m1 at different positions/indices.
m1[[2, 3]]
array([300, 400])
keyboard_arrow_down How will you print the diagonal elements of the following 2D array?
m1 = np.arange(9).reshape((3,3))
m1
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
m1[[0,1,2],[0,1,2]] # picking up element (0,0), (1,1) and (2,2)
array([0, 4, 8])
When list of indexes is provided for both rows and cols, for example: m1[[0,1,2],[0,1,2]]
It selects individual elements i.e. m1[0][0], m1[1][1] and m2[2][2] .
keyboard_arrow_down Slicing in 2D arrays

We need to provide two slice ranges, one for row and one for column.
We can also mix Indexing and Slicing
m1 = np.arange(12).reshape(3,4)
m1
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
m1[:2] # gives first two rows
array([[0, 1, 2, 3],
[4, 5, 6, 7]])
keyboard_arrow_down How can we get columns from a 2D array?
m1[:, :2] # gives first two columns
array([[0, 1],
[4, 5],
[8, 9]])
m1[:, 1:3] # gives 2nd and 3rd col
array([[ 1, 2],
[ 5, 6],
[ 9, 10]])
keyboard_arrow_down Fancy Indexing (Masking) in 2D arrays

We did this for one dimensional arrays. Let's see if those concepts translate to 2D also.
Suppose we have the matrix m1 -
m1 = np.arange(12).reshape(3, 4)
m1
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
What will be output of following?
m1 < 6
array([[ True, True, True, True],

[ True, True, False, False],
[False, False, False, False]])
A matrix having boolean values True and False is returned.

We can use this boolean matrix to filter our array.
Condition(s) will be passed instead of indices and slice ranges.
m1[m1 < 6]
array([0, 1, 2, 3, 4, 5])
Values corresponding to True are retained

Values corresponding to False are filtered out
keyboard_arrow_down Aggregate Functions

Numpy provides various universal functions that cover a wide variety of operations and perform fast element-wise array operations.
keyboard_arrow_down How would calculate the sum of elements of an array?
np.sum()
It sums all the values in a numpy array.
a = np.arange(1, 11)
a
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
np.sum(a)
55
keyboard_arrow_down What if we want to find the average value or median value of all the elements in an array?
np.mean()
It gives the us mean of all values in a numpy array.
np.mean(a)
5.5
keyboard_arrow_down Now, we want to find the minimum / maximum value in the array.
np.min() / np.max()
np.min(a)
np.max(a)
10
Let's apply aggregate functions on 2D array.
a = np.arange(12).reshape(3, 4)
a
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
np.sum(a) # sums all the values present in the array
66
keyboard_arrow_down What if we want to do the elements row-wise or column-wise?

By setting axis parameter
keyboard_arrow_down What will np.sum(a, axis=0) do?
np.sum(a, axis=0) adds together values in different rows

axis = 0 → Changes will happen along the vertical axis
Summation of values happen in the vertical direction.
Rows collapse/merge when we do axis=0 .
np.sum(a, axis=0)
array([12, 15, 18, 21])
keyboard_arrow_down What if we specify axis=1 ?
np.sum(a, axis=1) adds together values in different columns

axis = 1 → Changes will happen along the horizontal axis
Summation of values happen in the horizontal direction.
Columns collapse/merge when we do axis=1 .
np.sum(a, axis=1)
array([ 6, 22, 38])
keyboard_arrow_down Logical Operations

keyboard_arrow_down What if we want to check whether "any" element of array follows a specific condition?
np.any()
returns True if any of the corresponding elements in the argument arrays follow the provided condition.
Imagine you have a shopping list with items you need to buy, but you're not sure if you have enough money to buy everything.
You want to check if there's at least one item on your list that you can afford.
In this case, you can use np.any :
import numpy as np
# Prices of items on your shopping list

prices = np.array([50, 45, 25, 20, 35])
# Your budget
budget = 30
# Check if there's at least one item you can afford

can_afford = np.any(prices <= budget)
if can_afford:
print("You can buy at least one item on your list!")
else:
print("Sorry, nothing on your list fits your budget.")
You can buy at least one item on your list!

keyboard_arrow_down What if we want to check whether "all" the elements in our array follow a specific condition?
np.all()
returns True if all the elements in the argument arrays follow the provided condition.
Let's consider a scenario where you have a list of chores, and you want to make sure all the chores are done before you can play video games.
You can use np.all to check if all the chores are completed.
import numpy as np
# Chores status: 1 for done, 0 for not done

chores = np.array([1, 1, 1, 1, 0])
# Check if all chores are done

all_chores_done = np.all(chores == 1)
if all_chores_done:
print("Great job! You've completed all your chores. Time to play!")
else:
print("Finish all your chores before you can play.")
Finish all your chores before you can play.
Multiple conditions for .all() function -
a = np.array([1, 2, 3, 2])
b = np.array([2, 2, 3, 2])
c = np.array([6, 4, 4, 5])
((a <= b) & (b <= c)).all()
True
keyboard_arrow_down What if we want to update an array based on condition?
Suppose you are given an array of integers and you want to update it based on following condition :
if element is > 0, change it to +1

if element < 0, change it to -1.
How will you do it?
arr = np.array([-3,4,27,34,-2, 0, -45,-11,4, 0 ])

arr
array([ -3, 4, 27, 34, -2, 0, -45, -11, 4, 0])
You can use masking to update the array.
arr[arr > 0] = 1
arr [arr < 0] = -1
arr
array([-1, 1, 1, 1, -1, 0, -1, -1, 1, 0])
There's also a numpy function which can help us with it.
keyboard_arrow_down np.where()
Syntax: np.where(condition, [x, y])

returns an ndarray whose elements are chosen from x or y depending on condition.
Suppose you have a list of product prices, and you want to apply a 10% discount to all products with prices above $50.
You can use np.where to adjust the prices.
import numpy as np
# Product prices
prices = np.array([45, 55, 60, 75, 40, 90])
# Apply a 10% discount to prices above $50

discounted_prices = np.where(prices > 50, prices * 0.9, prices)
print("Original prices:", prices)

print("Discounted prices:", discounted_prices)
Original prices: [45 55 60 75 40 90]

Discounted prices: [45. 49.5 54. 67.5 40. 81. ]
Notice that it didn't change the original array.
keyboard_arrow_down Use Case: Fitness data analysis

Imagine you are a Data Scientist at Fitbit
You've been given a user data to analyse and find some insights which can be shown on the smart watch.
But why would we want to analyse the user data for desiging the watch?
These insights from the user data can help business make customer oriented decision for the product design.
Let's first look at the data we have gathered.
Notice that our data is structured in a tabular format.
Each column is known as a feature.

Each row is known as a record.
keyboard_arrow_down Basic EDA

Performing Exploratory Data Analysis (EDA) is like being a detective for numbers and information.
Imagine you have a big box of colorful candies. EDA is like looking at all the candies, counting how many of each color there are, and maybe
even making a pretty picture to show which colors you have the most of. This way, you can learn a lot about your candies without eating them
all at once!
So, EDA is about looking at your things, which is data in this case, to understand them better and find out interesting stuff about them.
Formally defining, Exploratory Data Analysis (EDA) is a process of examining, summarizing, and visualizing data sets to understand their main
characteristics, uncover patterns that helps analysts and data scientists gain insights into the data, make informed decisions, and guide further
analysis or modeling.
First, we will import numpy.
import numpy as np
Let's load the data that we saw earlier.
For this, we will use the .loadtxt() function.
!gdown https://drive.google.com/uc?id=1vk1Pu0djiYcrdc85yUXZ_Rqq2oZNcohd
Downloading...
From: https://drive.google.com/uc?id=1vk1Pu0djiYcrdc85yUXZ_Rqq2oZNcohd
To: /content/fit.txt
100% 3.43k/3.43k [00:00<00:00, 11.3MB/s]
data = np.loadtxt('/content/fit.txt', dtype='str')

data
[ 14-12-2017 , 7422 , Happy , 243 , 5 , Active ],
['15-12-2017', '437', 'Neutral', '14', '3', 'Active'],
['16-12-2017', '1231', 'Neutral', '39', '4', 'Active'],
['17-12-2017', '1696', 'Sad', '55', '4', 'Inactive'],
['18-12-2017', '4921', 'Neutral', '158', '5', 'Active'],
['19-12-2017', '221', 'Sad', '7', '5', 'Active'],
['20-12-2017', '6500', 'Neutral', '213', '5', 'Active'],
['21-12-2017', '3575', 'Neutral', '116', '5', 'Active'],
['22-12-2017', '4061', 'Sad', '129', '5', 'Inactive'],
['23-12-2017', '651', 'Sad', '21', '5', 'Inactive'],
['24-12-2017', '753', 'Sad', '28', '4', 'Inactive'],
['25-12-2017', '518', 'Sad', '16', '3', 'Inactive'],
['26-12-2017', '5537', 'Happy', '180', '4', 'Active'],
['27-12-2017', '4108', 'Neutral', '138', '5', 'Active'],
['28-12-2017', '5376', 'Happy', '176', '5', 'Active'],
['29-12-2017', '3066', 'Neutral', '99', '4', 'Active'],
['30-12-2017', '177', 'Sad', '5', '5', 'Inactive'],
['31-12-2017', '36', 'Sad', '1', '3', 'Inactive'],
['01-01-2018', '299', 'Sad', '10', '3', 'Inactive'],
['02-01-2018', '1447', 'Neutral', '47', '3', 'Inactive'],
['04-01-2018', '702', 'Sad', '23', '3', 'Inactive'],
['05-01-2018', '133', 'Sad', '4', '2', 'Inactive'],
['06-01-2018', '153', 'Happy', '0', '8', 'Inactive'],
['07-01-2018', '500', 'Neutral', '0', '5', 'Active'],
['09-01-2018', '2203', 'Happy', '0', '5', 'Active']], dtype='<U10')
We provide the file name along with the dtype of data that we want to load in.
What's the shape of this data?
data.shape
(96, 6)
What's the dimensionality?
data.ndim
We can see that this is a 2-dimensional list.
There are 96 records and each record has 6 features.
These features are:
Date
Step Count
Mood
Calories Burned
Hours of Sleep
Activity Status
Notice that above array is homogenous containing all the data as strings.
In order to work with strings, categorical data and numerical data, we'll have to save every feature seperately.
How will we extract features in seperate variables?
For that, we first need some idea on how data is saved.
Let's see whats the first element of the data .
data[0]
array(['06-10-2017', '5464', 'Neutral', '181', '5', 'Inactive'],

dtype='<U10')
Hmm.. this extracts a row, not a column.
Similarly, we can extract other specific rows.
data[1]
array(['07-10-2017', '6041', 'Sad', '197', '8', 'Inactive'], dtype='<U10')
We can also use slicing.
data[:5]
array([['06-10-2017', '5464', 'Neutral', '181', '5', 'Inactive'],

['07-10-2017', '6041', 'Sad', '197', '8', 'Inactive'],
['08-10-2017', '25', 'Sad', '0', '5', 'Inactive'],
['09-10-2017', '5461', 'Sad', '174', '4', 'Inactive'],
['10-10-2017', '6915', 'Neutral', '223', '5', 'Active']],
dtype='<U10')
Now, we want to place all the dates into a single entity.
How to do that?
One way is to just go ahead and fetch the column number 0 from all rows.
Another way is to, take a transpose of data .
Let's see them both -
Approach 1
data[:, 0]
array(['06-10-2017', '07-10-2017', '08-10-2017', '09-10-2017',

'10-10-2017', '11-10-2017', '12-10-2017', '13-10-2017',
'14-10-2017', '15-10-2017', '16-10-2017', '17-10-2017',
'18-10-2017', '19-10-2017', '20-10-2017', '21-10-2017',
'22-10-2017', '23-10-2017', '24-10-2017', '25-10-2017',
'26-10-2017', '27-10-2017', '28-10-2017', '29-10-2017',
'30-10-2017', '31-10-2017', '01-11-2017', '02-11-2017',
'03-11-2017', '04-11-2017', '05-11-2017', '06-11-2017',
'07-11-2017', '08-11-2017', '09-11-2017', '10-11-2017',
'11-11-2017', '12-11-2017', '13-11-2017', '14-11-2017',
'15-11-2017', '16-11-2017', '17-11-2017', '18-11-2017',
'19-11-2017', '20-11-2017', '21-11-2017', '22-11-2017',
'23-11-2017', '24-11-2017', '25-11-2017', '26-11-2017',
'27-11-2017', '28-11-2017', '29-11-2017', '30-11-2017',
'01-12-2017', '02-12-2017', '03-12-2017', '04-12-2017',
'05-12-2017', '06-12-2017', '07-12-2017', '08-12-2017',
'09-12-2017', '10-12-2017', '11-12-2017', '12-12-2017',
'13-12-2017', '14-12-2017', '15-12-2017', '16-12-2017',
'17-12-2017', '18-12-2017', '19-12-2017', '20-12-2017',
'21-12-2017', '22-12-2017', '23-12-2017', '24-12-2017',
'25-12-2017', '26-12-2017', '27-12-2017', '28-12-2017',
'29-12-2017', '30-12-2017', '31-12-2017', '01-01-2018',
'02-01-2018', '03-01-2018', '04-01-2018', '05-01-2018',
'06-01-2018', '07-01-2018', '08-01-2018', '09-01-2018'],
dtype='<U10')
This gives all the dates.
Approach 2
data_t = data.T
Don't you think all the dates will now be present in the first (i.e. index 0th element) of data_t ?
data_t[0]
array(['06-10-2017', '07-10-2017', '08-10-2017', '09-10-2017',

'10-10-2017', '11-10-2017', '12-10-2017', '13-10-2017',
'14-10-2017', '15-10-2017', '16-10-2017', '17-10-2017',
'18-10-2017', '19-10-2017', '20-10-2017', '21-10-2017',
'22-10-2017', '23-10-2017', '24-10-2017', '25-10-2017',
'26-10-2017', '27-10-2017', '28-10-2017', '29-10-2017',
'30-10-2017', '31-10-2017', '01-11-2017', '02-11-2017',
'03-11-2017', '04-11-2017', '05-11-2017', '06-11-2017',
'07-11-2017', '08-11-2017', '09-11-2017', '10-11-2017',
'11-11-2017', '12-11-2017', '13-11-2017', '14-11-2017',
'15-11-2017', '16-11-2017', '17-11-2017', '18-11-2017',
'19-11-2017', '20-11-2017', '21-11-2017', '22-11-2017',
'23-11-2017', '24-11-2017', '25-11-2017', '26-11-2017',
'27-11-2017', '28-11-2017', '29-11-2017', '30-11-2017',
'01-12-2017', '02-12-2017', '03-12-2017', '04-12-2017',
'05-12-2017', '06-12-2017', '07-12-2017', '08-12-2017',
'09-12-2017', '10-12-2017', '11-12-2017', '12-12-2017',
'13-12-2017', '14-12-2017', '15-12-2017', '16-12-2017',
'17-12-2017', '18-12-2017', '19-12-2017', '20-12-2017',
'21-12-2017', '22-12-2017', '23-12-2017', '24-12-2017',
'25-12-2017', '26-12-2017', '27-12-2017', '28-12-2017',
'29-12-2017', '30-12-2017', '31-12-2017', '01-01-2018',
'02-01-2018', '03-01-2018', '04-01-2018', '05-01-2018',
'06-01-2018', '07-01-2018', '08-01-2018', '09-01-2018'],
dtype='<U10')
Also, what will be the shape of data_t ?
data_t.shape
(6, 96)
keyboard_arrow_down Let's extract all the columns and save them in seperate variables.
date, step_count, mood, calories_burned, hours_of_sleep, activity_status = data.T
step_count
array(['5464', '6041', '25', '5461', '6915', '4545', '4340', '1230', '61',

'1258', '3148', '4687', '4732', '3519', '1580', '2822', '181',
'3158', '4383', '3881', '4037', '202', '292', '330', '2209',
'4550', '4435', '4779', '1831', '2255', '539', '5464', '6041',
'4068', '4683', '4033', '6314', '614', '3149', '4005', '4880',
'4136', '705', '570', '269', '4275', '5999', '4421', '6930',
'5195', '546', '493', '995', '1163', '6676', '3608', '774', '1421',
'4064', '2725', '5934', '1867', '3721', '2374', '2909', '1648',
'799', '7102', '3941', '7422', '437', '1231', '1696', '4921',
'221', '6500', '3575', '4061', '651', '753', '518', '5537', '4108',
'5376', '3066', '177', '36', '299', '1447', '2599', '702', '133',
'153', '500', '2127', '2203'], dtype='<U10')
step_count.dtype
dtype('<U10')
Notice the data type of step_count and other variables.
It's a string type where U means Unicode String and 10 means 10 bytes.
Why? Because Numpy type-casted all the data to strings.
keyboard_arrow_down Let's convert the data types of these variables.
Step Count
step_count = np.array(step_count, dtype='int')

step_count.dtype
dtype('int64')
step_count
array([5464, 6041, 25, 5461, 6915, 4545, 4340, 1230, 61, 1258, 3148,
4687, 4732, 3519, 1580, 2822, 181, 3158, 4383, 3881, 4037, 202,
292, 330, 2209, 4550, 4435, 4779, 1831, 2255, 539, 5464, 6041,
4068, 4683, 4033, 6314, 614, 3149, 4005, 4880, 4136, 705, 570,
269, 4275, 5999, 4421, 6930, 5195, 546, 493, 995, 1163, 6676,
3608, 774, 1421, 4064, 2725, 5934, 1867, 3721, 2374, 2909, 1648,
799, 7102, 3941, 7422, 437, 1231, 1696, 4921, 221, 6500, 3575,
4061, 651, 753, 518, 5537, 4108, 5376, 3066, 177, 36, 299,
1447, 2599, 702, 133, 153, 500, 2127, 2203])
What will be shape of this array?
step_count.shape
(96,)
We saw in last class that since it is a 1D array, its shape will be (96, ) .
If it were a 2D array, its shape would've been (96, 1) .
Calories Burned
calories_burned = np.array(calories_burned, dtype='int')

calories_burned.dtype
dtype('int64')
Hours of Sleep
hours_of_sleep = np.array(hours_of_sleep, dtype='int')

hours_of_sleep.dtype
dtype('int64')
Mood
Mood belongs to categorical data type. As the name suggests, categorical data type has two or more categories in it.
Let's check the values of mood variable -
mood
array(['Neutral', 'Sad', 'Sad', 'Sad', 'Neutral', 'Sad', 'Sad', 'Sad',

'Sad', 'Sad', 'Sad', 'Sad', 'Happy', 'Sad', 'Sad', 'Sad', 'Sad',
'Neutral', 'Neutral', 'Neutral', 'Neutral', 'Neutral', 'Neutral',
'Happy', 'Neutral', 'Happy', 'Happy', 'Happy', 'Happy', 'Happy',
'Happy', 'Happy', 'Neutral', 'Happy', 'Happy', 'Happy', 'Happy',
'Happy', 'Happy', 'Happy', 'Happy', 'Happy', 'Happy', 'Neutral',
'Happy', 'Happy', 'Happy', 'Happy', 'Happy', 'Happy', 'Happy',
'Happy', 'Happy', 'Neutral', 'Sad', 'Happy', 'Happy', 'Happy',
'Happy', 'Happy', 'Happy', 'Happy', 'Sad', 'Neutral', 'Neutral',
'Sad', 'Sad', 'Neutral', 'Neutral', 'Happy', 'Neutral', 'Neutral',
'Sad', 'Neutral', 'Sad', 'Neutral', 'Neutral', 'Sad', 'Sad', 'Sad',
'Sad', 'Happy', 'Neutral', 'Happy', 'Neutral', 'Sad', 'Sad', 'Sad',
'Neutral', 'Neutral', 'Sad', 'Sad', 'Happy', 'Neutral', 'Neutral',
'Happy'], dtype='<U10')
np.unique(mood)
array(['Happy', 'Neutral', 'Sad'], dtype='<U10')
Activity Status
activity_status
array(['Inactive', 'Inactive', 'Inactive', 'Inactive', 'Active',

'Inactive', 'Inactive', 'Inactive', 'Inactive', 'Inactive',
'Inactive', 'Inactive', 'Active', 'Inactive', 'Inactive',
'Active', 'Inactive', 'Inactive', 'Inactive', 'Inactive', 'Active',
'Active', 'Active', 'Active', 'Active', 'Active', 'Active',
'Active', 'Active', 'Active', 'Inactive', 'Inactive', 'Inactive',
'Inactive', 'Inactive', 'Inactive', 'Active', 'Active', 'Active',
'Active', 'Active', 'Active', 'Active', 'Active', 'Active',
'Active', 'Active', 'Active', 'Inactive', 'Active', 'Active',
'Inactive', 'Active', 'Active', 'Active', 'Active', 'Active',
'Inactive', 'Active', 'Active', 'Active', 'Active', 'Inactive',
'Inactive', 'Inactive', 'Inactive', 'Active', 'Active', 'Active',
'Active', 'Inactive', 'Inactive', 'Inactive', 'Inactive',
'Inactive', 'Inactive', 'Inactive', 'Inactive', 'Active',
'Inactive', 'Active'], dtype='<U10')
Since we've extracted form the same source array, we know that
mood[0] and step_count[0]

There is a connection between them, as they belong to the same record.
Also, we know that their length will be the same, i.e. 96
Now let's look at something really interesting.
Can we extract the step counts, when the mood was Happy?
step_count_happy = step_count[mood == 'Happy']
len(step_count_happy)
40
Let's also find for when the mood was Sad.
step_count_sad = step_count[mood == 'Sad']

step_count_sad
array([6041, 25, 5461, 4545, 4340, 1230, 61, 1258, 3148, 4687, 3519,
1580, 2822, 181, 6676, 3721, 1648, 799, 1696, 221, 4061, 651,
753, 518, 177, 36, 299, 702, 133])
len(step_count_sad)
29
Let's do the same for when the mood was Neutral.
step_count_neutral = step_count[mood == 'Neutral']

step_count_neutral
array([5464, 6915, 3158, 4383, 3881, 4037, 202, 292, 2209, 6041, 570,
1163, 2374, 2909, 7102, 3941, 437, 1231, 4921, 6500, 3575, 4108,
3066, 1447, 2599, 500, 2127])
len(step_count_neutral)
27
How can we collect data for when the mood was either happy or neutral?
step_count_happy_or_neutral = step_count[(mood == 'Neutral') | (mood == 'Happy')]

step_count_happy_or_neutral
array([5464, 6915, 4732, 3158, 4383, 3881, 4037, 202, 292, 330, 2209,
4550, 4435, 4779, 1831, 2255, 539, 5464, 6041, 4068, 4683, 4033,
6314, 614, 3149, 4005, 4880, 4136, 705, 570, 269, 4275, 5999,
4421, 6930, 5195, 546, 493, 995, 1163, 3608, 774, 1421, 4064,
2725, 5934, 1867, 2374, 2909, 7102, 3941, 7422, 437, 1231, 4921,
6500, 3575, 5537, 4108, 5376, 3066, 1447, 2599, 153, 500, 2127,
2203])
len(step_count_happy_or_neutral)
67
Let's try to compare step counts on bad mood days and good mood days.
# Average step count on Sad mood days -
np.mean(step_count_sad)
2103.0689655172414
# Average step count on Happy days -
np.mean(step_count_happy)
3392.725
# Average step count on Neutral days -
np.mean(step_count_neutral)
3153.777777777778
As you can see, this data tells us a lot about user behaviour.
This way we can analyze data and learn.

This is just the second class on numpy, we will learn many more concepts related to this, and pandas also.
Let's try to check the mood when step count was greater/lesser.
# mood when step count > 4000
np.unique(mood[step_count > 4000], return_counts = True)
(array(['Happy', 'Neutral', 'Sad'], dtype='<U10'), array([22, 9, 7]))
Out of 38 days when step count was more than 4000, user was feeling happy on 22 days.
# mood when step count <= 2000
np.unique(mood[step_count <= 2000], return_counts = True)
(array(['Happy', 'Neutral', 'Sad'], dtype='<U10'), array([13, 8, 18]))
Out of 39 days, when step count was less than 2000, user was feeling sad on 18 days.
keyboard_arrow_down This suggests that there may be a correlation between the Mood and Step Count .
Sorting
Matrix Multiplication
np.dot
@ operator
np.matmul
Vectorization
Broadcasting
keyboard_arrow_down Sorting
np.sort returns a sorted copy of an array.
import numpy as np
a = np.array([4, 7, 0, 3, 8, 2, 5, 1, 6, 9])
a
array([4, 7, 0, 3, 8, 2, 5, 1, 6, 9])
b = np.sort(a)
b
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
a # no change is reflected in the original array
array([4, 7, 0, 3, 8, 2, 5, 1, 6, 9])
keyboard_arrow_down We can directly call sort method on array but it can change the original array as it is an inplace operation.
a.sort() # sorting is performed inplace

a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
keyboard_arrow_down Sorting in 2D array

a = np.array([[1,5,3], [2,5,7], [400, 200, 300]])
a
array([[ 1, 5, 3],
[ 2, 5, 7],
[400, 200, 300]])
np.sort(a, axis=0) # sorting every column
array([[ 1, 5, 3],
[ 2, 5, 7],
[400, 200, 300]])
np.sort(a, axis=1) # sorting every row
array([[ 1, 3, 5],
[ 2, 5, 7],
[200, 300, 400]])
Note: By default, the np.sort() functions sorts along the last axis.
a = np.array([[23,4,43], [12, 89, 3], [69, 420, 0]])
np.sort(a) # default axis = -1 (last axis)
array([[ 4, 23, 43],

[ 3, 12, 89],
[ 0, 69, 420]])
keyboard_arrow_down Element-Wise Multiplication

Element-wise multiplication in NumPy involves multiplying corresponding elements of two arrays with the same shape to produce a new array
where each element is the product of the corresponding elements from the input arrays.
a = np.arange(1, 6)
a
array([1, 2, 3, 4, 5])
a * 5
array([ 5, 10, 15, 20, 25])
b = np.arange(6, 11)
b
array([ 6, 7, 8, 9, 10])
a * b
array([ 6, 14, 24, 36, 50])
Both arrays should have the same shape.
c = np.array([1, 2, 3])
a * c
---------------------------------------------------------------------------
<ipython-input-17-3f6f667472ca> in <cell line: 1>()
----> 1 a * c
ValueError: operands could not be broadcast together with shapes (5,) (3,)
d = np.arange(12).reshape(3, 4)
e = np.arange(13, 25).reshape(3, 4)
print(d)
print(e)
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
[[13 14 15 16]
[17 18 19 20]
[21 22 23 24]]
d * e
array([[ 0, 14, 30, 48],

[ 68, 90, 114, 140],
[168, 198, 230, 264]])
d * 5
array([[ 0, 5, 10, 15],
[20, 25, 30, 35],
[40, 45, 50, 55]])
Takeaway:
Array * Number -> WORKS

Array * Array (same shape) -> WORKS
Array * Array (different shape) -> DOES NOT WORK
keyboard_arrow_down Matrix Multiplication

Rule: Number of columns of the first matrix should be equal to number of rows of the second matrix.
(A,B) * (B,C) -> (A,C)

(3,4) * (4,3) -> (3,3)
Visual Demo: https://www.geogebra.org/m/ETHXK756
a = np.arange(1,13).reshape((3,4))
c = np.arange(2,14).reshape((4,3))
a.shape, c.shape
((3, 4), (4, 3))
keyboard_arrow_down a is of shape (3,4) and c is of shape (4,3). The output will be of shape (3,3).
# Using np.dot
np.dot(a,c)
array([[ 80, 90, 100],

[184, 210, 236],
[288, 330, 372]])
# Using np.matmul
np.matmul(a,c)
array([[ 80, 90, 100],

[184, 210, 236],
[288, 330, 372]])
# Using @ operator
a@c
array([[ 80, 90, 100],

[184, 210, 236],
[288, 330, 372]])
a@5
---------------------------------------------------------------------------
<ipython-input-27-16572c98568d> in <cell line: 1>()
----> 1 a@5
ValueError: matmul: Input operand 1 does not have enough dimensions (has 0, gufunc core with signature (n?,k),(k,m?)->(n?
np.matmul(a, 5)
---------------------------------------------------------------------------
<ipython-input-28-875bf147741b> in <cell line: 1>()
----> 1 np.matmul(a, 5)
ValueError: matmul: Input operand 1 does not have enough dimensions (has 0, gufunc core with signature (n?,k),(k,m?)->(n?
np.dot(a, 5)
array([[ 5, 10, 15, 20],

[25, 30, 35, 40],
[45, 50, 55, 60]])
Important:
dot() function supports the vector multiplication with a scalar value, which is not possible with matmul() .
Vector * Vector will work for matmul() but Vector * Scalar won't.
keyboard_arrow_down Vectorization
Vectorization in NumPy refers to performing operations on entire arrays or array elements simultaneously, which is significantly faster and more
efficient than using explicit loops.
a = np.arange(10)
a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Note:
1d np array --> vector

2d np array --> matrix
3d onwards --> tensors
def random_operation(x):
if x % 2 == 0:
x += 2
else:
x -= 2
return x
random_operation(a)
---------------------------------------------------------------------------
<ipython-input-32-83503709589d> in <cell line: 1>()
----> 1 random_operation(a)
<ipython-input-31-1b21f73a20a9> in random_operation(x)
1 def random_operation(x):
----> 2 if x % 2 == 0:
3 x += 2
4 else:
5 x -= 2
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
cool_operation = np.vectorize(random_operation)
type(cool_operation)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-33-6717d289c693> in <cell line: 1>()
----> 1 type(cool_operation)
NameError: name 'cool_operation' is not defined
keyboard_arrow_down np.vectorize()
It is a generalised function for vectorization.

It takes the function and returns an object (which acts like function but can take an array as input and perform the operations).
cool_operation(a)
keyboard_arrow_down Broadcasting
Broadcasting in NumPy is the automatic and implicit extension of array dimensions to enable element-wise operations between arrays with
different shapes.
keyboard_arrow_down Case 1: If dimension in both matrix is equal, element-wise addition will be done.
a = np.tile(np.arange(0,40,10), (3,1))
a
array([[ 0, 10, 20, 30],

[ 0, 10, 20, 30],
[ 0, 10, 20, 30]])
Note:
numpy.tile(array, reps) constructs an array by repeating A the number of times given by reps along each dimension.
np.tile(array, (repetition_rows, repetition_cols))
a=a.T
a
array([[ 0, 0, 0],
[10, 10, 10],
[20, 20, 20],
[30, 30, 30]])
b = np.tile(np.arange(0,3), (4,1))
b
array([[0, 1, 2],
[0, 1, 2],
[0, 1, 2],
[0, 1, 2]])
print(a.shape, b.shape)
(4, 3) (4, 3)
Since a and b have the same shape, they can be added without any issues.
a+b
array([[ 0, 1, 2],
[10, 11, 12],
[20, 21, 22],
[30, 31, 32]])
keyboard_arrow_down Case 2: Right array should be of 1-D and number of columns should be same of both the arrays and it will automatically do
n-tile.
array([[ 0, 0, 0],
[10, 10, 10],
[20, 20, 20],
[30, 30, 30]])
c = np.array([0,1,2])
c
array([0, 1, 2])
print(a.shape, c.shape)
(4, 3) (3,)
a + c
array([[ 0, 1, 2],
[10, 11, 12],
[20, 21, 22],
[30, 31, 32]])
c was broadcasted along rows (vertically)

so that a and c can be made compatible
keyboard_arrow_down Case 3: If the left array is column matrix (must have only 1 column) and right array is row matrix, then it will do the n-tile
such that element wise addition is possible.
d = np.array([0,10,20,30]).reshape(4,1)
d
array([[ 0],
[10],
[20],
[30]])
c = np.array([0,1,2])
c
array([0, 1, 2])
print(d.shape, c.shape)
(4, 1) (3,)
d + c
array([[ 0, 1, 2],
[10, 11, 12],
[20, 21, 22],
[30, 31, 32]])
d was stacked (broadcasted) along columns (horizontally)

c was stacked (broadcasted) along rows (vertically)
Will broadcasting work in this case?
a = np.arange(8).reshape(2,4)
a
array([[0, 1, 2, 3],
[4, 5, 6, 7]])
b = np.arange(16).reshape(4,4)
b
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
a+b
---------------------------------------------------------------------------
<ipython-input-49-ca730b97bf8a> in <cell line: 1>()
----> 1 a+b
ValueError: operands could not be broadcast together with shapes (2,4) (4,4)
keyboard_arrow_down Broadcasting in 2D Arrays
A + A (same shape)-> Works

A + A (1D) -> Works
A + number -> Works
A + A (different shape but still 2D) -> DOES NOT WORK
Is broadcasting possible in this case?
A = np.arange(1,10).reshape(3,3)
A
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
B = np.array([-1, 0, 1])
B
array([-1, 0, 1])
A*B
array([[-1, 0, 3],
[-4, 0, 6],
[-7, 0, 9]])
Yes! Broadcasting is possible for all the operations.
A = np.arange(12).reshape(3, 4)
A
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
B = np.array([1, 2, 3])
B
array([1, 2, 3])
A + B
---------------------------------------------------------------------------
<ipython-input-55-151064de832d> in <cell line: 1>()
----> 1 A + B
ValueError: operands could not be broadcast together with shapes (3,4) (3,)
Why did it throw an error?
Are the number of dimensions same for both array? No.
Shape of A ⇒ (3,4)
Shape of B ⇒ (3,)
So, Rule 1 will be invoked to pad 1 to the shape of B.
So, the shape of B becomes (1,3).
Now, we check whether broadcasting conditions are met or not?
Starting from the right most side,
Right most dimension is not equal (4 and 3).
Hence, broadcasting is not possible as per Rule 3 .
Question: Given two arrays,
1. Array A of shape (8, 1, 6, 1)

2. Array B of shape (7, 1, 5)
Is broadcasting possible in this case? If yes, what will be the shape of output?
Answer: Broadcasting possible; Shape will be (8, 7, 6, 5)
Explanation:
As number of dimensions are not equal, Rule 1 is invoked.
The shape of B becomes (1, 7, 1, 5)
Next, it checks whether broadcasting is possible.
A ⇒ (8 , 1, 6, 1)
B ⇒ (1, 7, 1, 5)
Right most dimension, one of the dimension is 1 (1 vs 5)

Next, comparing 6 and 1, We have one dimension as 1
Similarly, we have one of the dimension as 1 in both leading dimensions.
Hence, broadcasting is possible.
Now, as per Rule 2 , dimension with value 1 is streched to match dimension of other array.
Right most dimension of array is streched to match 5

Leading dimension of array B (1) is streched to match array A dim (6)
So, the output shape becomes : (8, 7, 6, 5) .

Shallow vs Deep Copy
view()
copy()
Array Splitting
split()
hsplit()
vsplit()
Array Stacking
hstack()
vstack()
concatenate()
Post Read - Image Manipulation
keyboard_arrow_down Views vs Copies (Shallow vs Deep Copy)

Numpy manages memory very efficiently,
which makes it really useful while dealing with large datasets.
But how does it manage memory so efficiently?
Let's create some arrays to understand what's happens while using numpy.
import numpy as np
# We'll create a np array
a = np.arange(4)
a
array([0, 1, 2, 3])
# Reshape array `a` and store in `b`
b = a.reshape(2, 2)
b
array([[0, 1],
[2, 3]])
Now we will make some changes to our original array a .
a[0] = 100
a
array([100, 1, 2, 3])
What will be values if we print array b ?
array([[100, 1],
[ 2, 3]])
Array b got automatically updated

Array b got automatically updated
This is an example of numpy using Shallow Copy of data.
What happens here?
Numpy re-uses data as much as possible instead of duplicating it.

This helps numpy to be efficient.
When we created b=a.reshape(2,2)
Numpy did NOT make a copy of a to store in b , as we can clearly see.

It is using the same data as in a .
It just looks different (reshaped) in b .
That is why, any changes in a automatically gets reflected in b .
Now, let's see an example where Numpy will create a Deep Copy of data.
a = np.arange(4)
a
array([0, 1, 2, 3])
# Create `c`
c = a + 2
c
array([2, 3, 4, 5])
# We make changes in `a`
a[0] = 100
a
array([100, 1, 2, 3])
array([2, 3, 4, 5])
np.shares_memory(a, c) # Deep Copy
False
As we can see, c did not get affected on changing a .
Because it is an operation.
A more permanent change in data.
So, Numpy had to create a separate copy for c - i.e., deep copy of array a for array c .
keyboard_arrow_down Conclusion:
Numpy is able to use same data for simpler operations like reshape → Shallow Copy.
It creates a copy of data where operations make more permanent changes to data → Deep Copy.
Is there a way to check whether two arrays are sharing memory or not?
Yes, np.shares_memory() function
a= np.arange(10)
a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
b = a[::2]
b
array([0, 2, 4, 6, 8])
np.shares_memory(a,b)
True
Notice that Slicing creates shallow copies.
a[0] = 1000
array([1000, 2, 4, 6, 8])
a = np.arange(6)
a
array([0, 1, 2, 3, 4, 5])
b = a[a % 1 == 0]
b
array([0, 1, 2, 3, 4, 5])
b[0] = 10
a[0]
np.shares_memory(a,b)
False
Note:
Shallow Copy - Reshaping, Slicing...

Deep Copy - Arithmetic Operations, Masking...
a = np.arange(10)
a_shallow_copy = a.view()
# Creates a shallow copy of a
np.shares_memory(a_shallow_copy, a)
True
a_deep_copy = a.copy()
# Creates a deep copy of a
np.shares_memory(a_deep_copy, a)
False
keyboard_arrow_down .view()
Returns view of the original array.

Any changes made in new array will be reflected in original array.
Documentation: https://numpy.org/doc/stable/reference/generated/numpy.ndarray.view.html
arr = np.arange(10)
arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
view_arr = arr.view()
view_arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Let's modify the content of view_arr and check whether it modified the original array as well.
view_arr[4] = 420
view_arr
array([ 0, 1, 2, 3, 420, 5, 6, 7, 8, 9])
arr
array([ 0, 1, 2, 3, 420, 5, 6, 7, 8, 9])
np.shares_memory(arr, view_arr)
True
Notice that changes in view array are reflected in original array.
keyboard_arrow_down .copy()
Returns a copy of the array.

Changes made in new array are not reflected in the original array.
Documentation ( .copy() ): https://numpy.org/doc/stable/reference/generated/numpy.ndarray.copy.html#numpy.ndarray.copy
Documentation: ( np.copy() ): https://numpy.org/doc/stable/reference/generated/numpy.copy.html
arr = np.arange(10)
arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
copy_arr = arr.copy()
copy_arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Let's modify the content of copy_arr and check whether it modified the original array as well.
copy_arr[3] = 45
copy_arr
array([ 0, 1, 2, 45, 4, 5, 6, 7, 8, 9])
arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
np.shares_memory(arr, copy)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-35-84a6cba2b044> in <cell line: 1>()
----> 1 np.shares_memory(arr, copy)
NameError: name 'copy' is not defined

Notice that the content of original array were not modified as we changed our copy array.
keyboard_arrow_down Splitting
In addition to reshaping and selecting subarrays, it is often necessary to split arrays into smaller arrays or merge arrays into bigger arrays.
keyboard_arrow_down np.split()
Splits an array into multiple sub-arrays as views.
It takes an argument indices_or_sections .
If indices_or_sections is an integer, n, the array will be divided into n equal arrays along axis.
If such a split is not possible, an error is raised.
If indices_or_sections is a 1-D array of sorted integers, the entries indicate where along axis the array is split.
If an index exceeds the dimension of the array along axis, an empty sub-array is returned correspondingly.
x = np.arange(9)
x
array([0, 1, 2, 3, 4, 5, 6, 7, 8])
np.split(x, 3)
[array([0, 1, 2]), array([3, 4, 5]), array([6, 7, 8])]
IMPORTANT REQUISITE
Number of elements in the array should be divisible by number of sections.
b = np.arange(10)
np.split(b, 3)
---------------------------------------------------------------------------
<ipython-input-38-5033f171e13f> in <cell line: 2>()
1 b = np.arange(10)
----> 2 np.split(b, 3)
1 frames
/usr/local/lib/python3.10/dist-packages/numpy/lib/shape_base.py in split(ary, indices_or_sections, axis)
870 N = ary.shape[axis]
871 if N % sections:
--> 872 raise ValueError(
873 'array split does not result in an equal division') from None
874 return array_split(ary, indices_or_sections, axis)
ValueError: array split does not result in an equal division
b[0:-1]
np.split(b[0:-1], 3)
# Splitting on the basis of exact indices
c = np.arange(16)
np.split(c, [3, 5, 6])
[array([0, 1, 2]),
array([3, 4]),
array([5]),
array([ 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])]
keyboard_arrow_down np.hsplit()
Splits an array into multiple sub-arrays horizontally (column-wise).
x = np.arange(16.0).reshape(4, 4)
x
array([[ 0., 1., 2., 3.],

[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.],
[12., 13., 14., 15.]])
Think of it this way:
There are 2 axis to a 2D array
1. 1st axis - Vertical axis

2. 2nd axis - Horizontal axis
Along which axis are we splitting the array?
The split we want happens across the 2nd axis (Horizontal axis)
That is why we use hsplit()
So, try to think in terms of "whether the operation is happening along vertical axis or horizontal axis".
We are splitting the horizontal axis in this case.
np.hsplit(x, 2)
[array([[ 0., 1.],

[ 4., 5.],
[ 8., 9.],
[12., 13.]]),
array([[ 2., 3.],
[ 6., 7.],
[10., 11.],
[14., 15.]])]
np.hsplit(x, np.array([3, 6]))
[array([[ 0., 1., 2.],

[ 4., 5., 6.],
[ 8., 9., 10.],
[12., 13., 14.]]),
array([[ 3.],
[ 7.],
[11.],
[15.]]),
array([], shape=(4, 0), dtype=float64)]
keyboard_arrow_down np.vsplit()
Splits an array into multiple sub-arrays vertically (row-wise).
x = np.arange(16.0).reshape(4, 4)
x
array([[ 0., 1., 2., 3.],

[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.],
[12., 13., 14., 15.]])
Now, along which axis are we splitting the array?
The split we want happens across the 1st axis (Vertical axis)
That is why we use vsplit()
Again, always try to think in terms of "whether the operation is happening along vertical axis or horizontal axis".
We are splitting the vertical axis in this case.
np.vsplit(x, 2)
[array([[0., 1., 2., 3.],

[4., 5., 6., 7.]]),
array([[ 8., 9., 10., 11.],
[12., 13., 14., 15.]])]
np.vsplit(x, np.array([3]))
[array([[ 0., 1., 2., 3.],

[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.]]),
array([[12., 13., 14., 15.]])]
keyboard_arrow_down Stacking
a = np.arange(1, 5)
b = np.arange(2, 6)
c = np.arange(3, 7)
keyboard_arrow_down np.vstack()
Stacks a list of arrays vertically (along axis 0 or 1st axis).

For example, given a list of row vectors, appends the rows to form a matrix.
np.vstack([b, c, a])
array([[2, 3, 4, 5],
[3, 4, 5, 6],
[1, 2, 3, 4]])
a = np.arange(1, 5)
b = np.arange(2, 4)
c = np.arange(3, 10)
np.vstack([b, c, a])
---------------------------------------------------------------------------
<ipython-input-49-5148cb6ebc5f> in <cell line: 1>()
----> 1 np.vstack([b, c, a])
2 frames
/usr/local/lib/python3.10/dist-packages/numpy/core/overrides.py in concatenate(*args, **kwargs)
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the arra

keyboard_arrow_down np.hstack
Stacks a list of arrays horizontally (along axis 1 or 2nd axis).
a
array([[0],
[1],
[2],
[3],
[4]])
b = np.arange(15).reshape(5, 3)
b
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11],
[12, 13, 14]])
np.hstack([a, b])
array([[ 0, 0, 1, 2],
[ 1, 3, 4, 5],
[ 2, 6, 7, 8],
[ 3, 9, 10, 11],
[ 4, 12, 13, 14]])
keyboard_arrow_down np.concatenate()
Can perform both vstack and hstack

Creates a new array by appending arrays after each other, along a given axis.
Provides similar functionality, but it takes a keyword argument axis that specifies the axis along which the arrays are to be concatenated.
The input array to concatenate() needs to be of dimensions atleast equal to the dimensions of output array.
a = np.array([1,2,3])
a
array([1, 2, 3])
b = np.array([[1,2,3], [4,5,6]])
b
array([[1, 2, 3],
[4, 5, 6]])
np.concatenate([a, b], axis = 0)
---------------------------------------------------------------------------
<ipython-input-55-1a93c4fe21df> in <cell line: 1>()
----> 1 np.concatenate([a, b], axis = 0)
/usr/local/lib/python3.10/dist-packages/numpy/core/overrides.py in concatenate(*args, **kwargs)
ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 1 dimension(s) and the
concatenate() can only work if both a and b have the same number of dimensions.
a = np.array([[1,2,3]])
b = np.array([[1,2,3], [4,5,6]])
np.concatenate([a, b], axis = 0) # axis = 0 -> vstack

array([[1, 2, 3],
[1, 2, 3],
[4, 5, 6]])
b = np.arange(9).reshape(3, 3)
np.concatenate([a, b], axis = 1) # axis = 1 -> hstack
array([[0, 1, 0, 1, 2],
[2, 3, 3, 4, 5],
[4, 5, 6, 7, 8]])
a = np.array([[1,2], [3,4]])
b = np.array([[5,6,7,8]])
np.concatenate([a, b], axis = None)
# axis = None joins and converts to 1D
array([1, 2, 3, 4, 5, 6, 7, 8])
Question: What will be the output of this?
a = np.array([[1, 2], [3, 4]])

b = np.array([[5, 6]])
np.concatenate((a, b), axis=0)
a = np.array([[1, 2], [3, 4]])

a
array([[1, 2],
[3, 4]])
b = np.array([[5, 6]])
b
array([[5, 6]])
np.concatenate((a, b), axis=0)
array([[1, 2],
[3, 4],
[5, 6]])
How did it work?
Dimensions of a is 2 ×2
What is the dimensions of b ?
1-D array ?? - NO
Look carefully!!
b is a 2-D array of dimensions 1 ×2
axis = 0 ---> It's a vertical axis
So, changes will happen along vertical axis

So, b gets concatenated below a
Question: What will be the result of this concatenation operation?
a = np.array([[1, 2], [3, 4]])

b = np.array([[5, 6]])
np.concatenate((a, b.T), axis=1)
a = np.array([[1, 2], [3, 4]])

a
array([[1, 2],
[3, 4]])
b = np.array([[5, 6]])
b
array([[5, 6]])
np.concatenate((a, b.T), axis=1)
array([[1, 2, 5],
[3, 4, 6]])
What happened here?
Dimensions of a is again 2 ×2
Dimensions of b is again 1 ×2
So, Dimensions of b.T will be 2 × 1
axis = 1 ---> It's a horizontal axis
So, changes will happen along horizontal axis

So, b.T gets concatenated horizontally to a
keyboard_arrow_down Extra-reading material

Object arrays
Image Manipulation
Code Text
keyboard_arrow_down 01-Pandas-Lecture-McKinsey
keyboard_arrow_down Outline
Installation of Pandas
Importing pandas
Importing the dataset
Dataframe/Series
Basic ops on a DataFrame
df.info()
df.head()
df.tail()
df.shape()
Creating Dataframe from Scratch

Basic ops on columns
Different ways of accessing cols

Check for unique values
Rename column
Deleting column
Creating new column
Basic ops on rows
Implicit/Explicit index
df.index
Indexing in Series
Slicing in Series
loc/iloc
keyboard_arrow_down Installing Pandas

# !pip install pandas
keyboard_arrow_down Importing Pandas

You should be able to import Pandas after installing it
We'll import pandas as its alias name pd
import pandas as pd
import numpy as np
keyboard_arrow_down Introduction: Why to use Pandas?

How is it different from numpy ?
The major limitation of numpy is that it can only work with 1 datatype at a time
Most real-world datasets contain a mixture of different datatypes
Like names of places would be string but their population would be int
==> It is difficult to work with data having heterogeneous values using Numpy
Pandas can work with numbers and strings together

So lets see how we can use pandas
keyboard_arrow_down Imagine that you are a Data Scientist with McKinsey
McKinsey wants to understand the relation between GDP per capita and life expectancy and various trends for their clients.
The company has acquired data from multiple surveys in different countries in the past
This contains info of several years about:
country
population size
life expectancy
GDP per Capita
We have to analyse the data and draw inferences meaningful to the company
keyboard_arrow_down Reading dataset in Pandas

Link: https://drive.google.com/file/d/1E3bwvYGf1ig32RmcYiWc0IXPN-mD_bI_/view?usp=sharing
!wget "https://drive.google.com/uc?export=download&id=1E3bwvYGf1ig32RmcYiWc0IXPN-mD_bI_" -O mckinsey.csv
--2023-09-11 18:16:39-- https://drive.google.com/uc?export=download&id=1E3bwvYGf1ig32RmcYiWc0IXPN-mD_bI_

Resolving drive.google.com (drive.google.com)... 173.194.213.102, 173.194.213.113, 173.194.213.139, ...
Connecting to drive.google.com (drive.google.com)|173.194.213.102|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-0s-68-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/l1h3nkkdhuu158c3gc
Warning: wildcards not supported in HTTP.
--2023-09-11 18:16:40-- https://doc-0s-68-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/l1h
Resolving doc-0s-68-docs.googleusercontent.com (doc-0s-68-docs.googleusercontent.com)... 173.194.212.132, 2607:f8b0:400c
Connecting to doc-0s-68-docs.googleusercontent.com (doc-0s-68-docs.googleusercontent.com)|173.194.212.132|:443... connec
HTTP request sent, awaiting response... 200 OK
Length: 83785 (82K) [text/csv]
Saving to: ‘mckinsey.csv’
mckinsey.csv 100%[===================>] 81.82K --.-KB/s in 0.001s
2023-09-11 18:16:40 (76.9 MB/s) - ‘mckinsey.csv’ saved [83785/83785]
keyboard_arrow_down Now how should we read this dataset?
Pandas makes it very easy to work with these kinds of files
df = pd.read_csv('mckinsey.csv') # We are storing the data in df

df
country year population continent life_exp gdp_cap
0 Afghanistan 1952 8425333 Asia 28.801 779.445314
... ... ... ... ... ... ...
1699 Zimbabwe 1987 9216418 Africa 62.351 706.157306
1700 Zimbabwe 1992 10704340 Africa 60.377 693.420786
1701 Zimbabwe 1997 11404948 Africa 46.809 792.449960
1702 Zimbabwe 2002 11926563 Africa 39.989 672.038623
1703 Zimbabwe 2007 12311143 Africa 43.487 469.709298
1704 rows × 6 columns
keyboard_arrow_down Dataframe and Series

keyboard_arrow_down What can we observe from the above dataset ?
We can see that it has:
6 columns
1704 rows
What do you think is the datatype of df ?
type(df)
pandas.core.frame.DataFrame
Its a pandas DataFrame
keyboard_arrow_down What is a pandas DataFrame ?

It is a table-like representation of data in Pandas => Structured Data
Structured Data here can be thought of as tabular data in a proper order
Considered as counterpart of 2D-Matrix in Numpy
keyboard_arrow_down Now how can we access a column, say country of the dataframe?
df["country"]
0 Afghanistan
1 Afghanistan
2 Afghanistan
3 Afghanistan
4 Afghanistan
...
1699 Zimbabwe
1700 Zimbabwe
1701 Zimbabwe
1702 Zimbabwe
1703 Zimbabwe
Name: country, Length: 1704, dtype: object
As you can see we get all the values in the column country
keyboard_arrow_down Now what is the data-type of a column?
type(df["country"])
pandas.core.series.Series
Its a pandas Series
What is a pandas Series ?

keyboard_arrow_down
Series in Pandas is what a Vector is in Numpy
What exactly does that mean?

It means a Series is a single column of data
Multiple Series stack together to form a DataFrame
Now we have understood what Series and DataFrames are
keyboard_arrow_down What if a dataset has 100 rows ... Or 100 columns ?
How can we find the datatype, name, total entries in each column ?
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 country 1704 non-null object
1 year 1704 non-null int64
2 population 1704 non-null int64
3 continent 1704 non-null object
4 life_exp 1704 non-null float64
5 gdp_cap 1704 non-null float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB
df.info() gives a list of columns with:
Name/Title of Columns
How many non-null values (blank cells) each column has
Type of values in each column - int, float, etc.
By default, it shows data-type as object for anything other than int or float - Will come back later
keyboard_arrow_down Now what if we want to see the first few rows in the dataset ?
df.head()

It Prints top 5 rows by default
We can also pass in number of rows we want to see in head()
df.head(20)
12 Albania 1952 1282697 Europe 55.230 1601.056136
13 Albania 1957 1476505 Europe 59.280 1942.284244
14 Albania 1962 1728137 Europe 64.820 2312.888958
15 Albania 1967 1984060 Europe 66.220 2760.196931
16 Albania 1972 2263554 Europe 67.690 3313.422188
17 Albania 1977 2509048 Europe 68.930 3533.003910
18 Albania 1982 2780097 Europe 70.420 3630.880722
19 Albania 1987 3075321 Europe 72.000 3738.932735
keyboard_arrow_down Similarly what if we want to see the last 20 rows ?
df.tail(20) #Similar to head

1684 Zambia 1972 4506497 Africa 50.107 1773.498265
1685 Zambia 1977 5216550 Africa 51.386 1588.688299
1686 Zambia 1982 6100407 Africa 51.821 1408.678565
1687 Zambia 1987 7272406 Africa 50.821 1213.315116
1688 Zambia 1992 8381163 Africa 46.100 1210.884633
1689 Zambia 1997 9417789 Africa 40.238 1071.353818
1690 Zambia 2002 10595811 Africa 39.193 1071.613938
1691 Zambia 2007 11746035 Africa 42.384 1271.211593
1692 Zimbabwe 1952 3080907 Africa 48.451 406.884115
1693 Zimbabwe 1957 3646340 Africa 50.469 518.764268
1694 Zimbabwe 1962 4277736 Africa 52.358 527.272182
1695 Zimbabwe 1967 4995432 Africa 53.995 569.795071
1696 Zimbabwe 1972 5861135 Africa 55.635 799.362176
1697 Zimbabwe 1977 6642107 Africa 57.674 685.587682
1698 Zimbabwe 1982 7636524 Africa 60.363 788.855041
1699 Zimbabwe 1987 9216418 Africa 62.351 706.157306
1700 Zimbabwe 1992 10704340 Africa 60.377 693.420786
1701 Zimbabwe 1997 11404948 Africa 46.809 792.449960
1702 Zimbabwe 2002 11926563 Africa 39.989 672.038623
1703 Zimbabwe 2007 12311143 Africa 43.487 469.709298
keyboard_arrow_down How can we find the shape of the dataframe?
df.shape
(1704, 6)
Similar to Numpy, it gives No. of Rows and Columns -- Dimensions
Now we know how to do some basic operations on dataframes
df.head(3) # We take the first 3 rows to create our dataframe
keyboard_arrow_down Basic operations on columns

Now what operations can we do using columns?
Maybe add a column

or delete a column
or we can rename the column too
and so on.
We can see that our dataset has 6 cols
keyboard_arrow_down But what if our dataset has 20 cols ? ... or 100 cols ? We can't see ther names in one go.
How can we get the names of all these cols ?

We can do it in two ways:
1. df.columns
2. df.keys
df.columns # using attribute `columns` of dataframe
Index(['country', 'year', 'population', 'continent', 'life_exp', 'gdp_cap'], dtype='object')
df.keys() # using method keys() of dataframe
Index(['country', 'year', 'population', 'continent', 'life_exp', 'gdp_cap'], dtype='object')
Note:
Here, Index is a type of pandas class used to store the address of the series/dataframe
It is an Immutable sequence used for indexing and alignment.
df['country'].head() # Gives values in Top 5 rows pertaining to the key
Pandas DataFrame and Series are specialised dictionary
keyboard_arrow_down But what is so "special" about this dictionary?
It can take multiple keys
df[['country', 'life_exp']].head()
country life_exp
0 Afghanistan 28.801
And what if we pass a single column name?
df[['country']].head()
country
0 Afghanistan
1 Afghanistan
2 Afghanistan
3 Afghanistan
4 Afghanistan
Note:
Notice how this output type is different from our earlier output using df['country']
==> ['country'] gives series while [['country']] gives dataframe
Now that we know how to access columns, lets answer some questions
keyboard_arrow_down How can we find the countries that have been surveyed ?
We can find the unique vals in the country col
How can we find unique values in a column?
df['country'].unique()
array(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina',

'Australia', 'Austria', 'Bahrain', 'Bangladesh', 'Belgium',
'Benin', 'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil',
'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon',
'Canada', 'Central African Republic', 'Chad', 'Chile', 'China',
'Colombia', 'Comoros', 'Congo, Dem. Rep.', 'Congo, Rep.',
'Costa Rica', "Cote d'Ivoire", 'Croatia', 'Cuba', 'Czech Republic',
'Denmark', 'Djibouti', 'Dominican Republic', 'Ecuador', 'Egypt',
'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Ethiopia',
'Finland', 'France', 'Gabon', 'Gambia', 'Germany', 'Ghana',
'Greece', 'Guatemala', 'Guinea', 'Guinea-Bissau', 'Haiti',
'Honduras', 'Hong Kong, China', 'Hungary', 'Iceland', 'India',
'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Israel', 'Italy',
'Jamaica', 'Japan', 'Jordan', 'Kenya', 'Korea, Dem. Rep.',
'Korea, Rep.', 'Kuwait', 'Lebanon', 'Lesotho', 'Liberia', 'Libya',
'Madagascar', 'Malawi', 'Malaysia', 'Mali', 'Mauritania',
'Mauritius', 'Mexico', 'Mongolia', 'Montenegro', 'Morocco',
'Mozambique', 'Myanmar', 'Namibia', 'Nepal', 'Netherlands',
'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Norway', 'Oman',
'Pakistan', 'Panama', 'Paraguay', 'Peru', 'Philippines', 'Poland',
'Portugal', 'Puerto Rico', 'Reunion', 'Romania', 'Rwanda',
'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia',
'Sierra Leone', 'Singapore', 'Slovak Republic', 'Slovenia',
'Somalia', 'South Africa', 'Spain', 'Sri Lanka', 'Sudan',
'Swaziland', 'Sweden', 'Switzerland', 'Syria', 'Taiwan',
'Tanzania', 'Thailand', 'Togo', 'Trinidad and Tobago', 'Tunisia',
'Turkey', 'Uganda', 'United Kingdom', 'United States', 'Uruguay',
'Venezuela', 'Vietnam', 'West Bank and Gaza', 'Yemen, Rep.',
'Zambia', 'Zimbabwe'], dtype=object)
keyboard_arrow_down Now what if you also want to check the count of each country in the dataframe?
df['country'].value_counts()
Afghanistan 12
Pakistan 12
New Zealand 12
Nicaragua 12
Niger 12
..
Eritrea 12
Equatorial Guinea 12
El Salvador 12
Egypt 12
Zimbabwe 12
Name: country, Length: 142, dtype: int64
Note:
value_counts() shows the output in decreasing order of frequency
keyboard_arrow_down What if we want to change the name of a column ?

We can rename the column by:
passing the dictionary with old_name:new_name pair

specifying axis=1
df.rename({"population": "Population", "country":"Country" }, axis = 1)
Country year Population continent life_exp gdp_cap
... ... ... ... ... ... ...
1699 Zimbabwe 1987 9216418 Africa 62.351 706.157306
1700 Zimbabwe 1992 10704340 Africa 60.377 693.420786
1701 Zimbabwe 1997 11404948 Africa 46.809 792.449960
1702 Zimbabwe 2002 11926563 Africa 39.989 672.038623
1703 Zimbabwe 2007 12311143 Africa 43.487 469.709298

Alternatively, we can also rename the column without using axis
by using the column parameter
df.rename(columns={"country":"Country"})
Country year population continent life_exp gdp_cap
... ... ... ... ... ... ...
1699 Zimbabwe 1987 9216418 Africa 62.351 706.157306
1700 Zimbabwe 1992 10704340 Africa 60.377 693.420786
1701 Zimbabwe 1997 11404948 Africa 46.809 792.449960
1702 Zimbabwe 2002 11926563 Africa 39.989 672.038623
1703 Zimbabwe 2007 12311143 Africa 43.487 469.709298
If we try and check the original dataframe df
df
... ... ... ... ... ... ...
1699 Zimbabwe 1987 9216418 Africa 62.351 706.157306
1700 Zimbabwe 1992 10704340 Africa 60.377 693.420786
1701 Zimbabwe 1997 11404948 Africa 46.809 792.449960
1702 Zimbabwe 2002 11926563 Africa 39.989 672.038623
1703 Zimbabwe 2007 12311143 Africa 43.487 469.709298
We can clearly see that the column names are still the same and have not changed. So the changes doesn't happen in original dataframe unless
we specify a parameter called inplace
We can set it inplace by setting the inplace argument = True
df.rename({"country": "Country"}, axis = 1, inplace = True)

df
... ... ... ... ... ... ...
1699 Zimbabwe 1987 9216418 Africa 62.351 706.157306
1700 Zimbabwe 1992 10704340 Africa 60.377 693.420786
1701 Zimbabwe 1997 11404948 Africa 46.809 792.449960
1702 Zimbabwe 2002 11926563 Africa 39.989 672.038623
1703 Zimbabwe 2007 12311143 Africa 43.487 469.709298
Note
.rename has default value of axis=0

If two columns have the same name, then df['column'] will display both columns
Now lets try another way of accessing column vals
df.Country
0 Afghanistan
1 Afghanistan
2 Afghanistan
3 Afghanistan
4 Afghanistan
...
1699 Zimbabwe
1700 Zimbabwe
1701 Zimbabwe
1702 Zimbabwe
1703 Zimbabwe
Name: Country, Length: 1704, dtype: object
This however doesn't work everytime
What do you think could be the problems with using attribute style for accessing the columns?
Problems such as
if the column names are not strings
Starting with number: E.g., 2nd

Contains a space: E.g., Roll Number
or if the column names conflict with methods of the DataFrame
E.g. shape
It is generally better to avoid this type of accessing columns
Are all the columns in our data necessary?
We already know the continents in which each country lies

So we don't need this column
keyboard_arrow_down How can we delete cols in pandas dataframe ?

df.drop('continent', axis=1)
Country year population life_exp gdp_cap
0 Afghanistan 1952 8425333 28.801 779.445314
1 Afghanistan 1957 9240934 30.332 820.853030
2 Afghanistan 1962 10267083 31.997 853.100710
3 Afghanistan 1967 11537966 34.020 836.197138
4 Afghanistan 1972 13079460 36.088 739.981106
... ... ... ... ... ...
1699 Zimbabwe 1987 9216418 62.351 706.157306
1700 Zimbabwe 1992 10704340 60.377 693.420786
1701 Zimbabwe 1997 11404948 46.809 792.449960
1702 Zimbabwe 2002 11926563 39.989 672.038623
1703 Zimbabwe 2007 12311143 43.487 469.709298
The drop function takes two parameters:
The column name

The axis
By default the value of axis is 0
An alternative to the above approach is using the "columns" parameter as we did in rename
df.drop(columns=['continent'])
0 Afghanistan 1952 8425333 28.801 779.445314
1 Afghanistan 1957 9240934 30.332 820.853030
2 Afghanistan 1962 10267083 31.997 853.100710
3 Afghanistan 1967 11537966 34.020 836.197138
4 Afghanistan 1972 13079460 36.088 739.981106
... ... ... ... ... ...
1699 Zimbabwe 1987 9216418 62.351 706.157306
1700 Zimbabwe 1992 10704340 60.377 693.420786
1701 Zimbabwe 1997 11404948 46.809 792.449960
1702 Zimbabwe 2002 11926563 39.989 672.038623
1703 Zimbabwe 2007 12311143 43.487 469.709298
As you can see, column contintent is dropped
keyboard_arrow_down Has the column permanently been deleted?
df.head()
NO, the column continent is still there
Do you see what's happening here?

We only got a view of dataframe with column continent dropped
keyboard_arrow_down How can we permanently drop the column?
We can either re-assign it
df = df.drop('continent', axis=1)
OR
We can set parameter inplace=True
By default, inplace=False
df.drop('continent', axis=1, inplace=True)
df.head() #we print the head to check
0 Afghanistan 1952 8425333 28.801 779.445314
1 Afghanistan 1957 9240934 30.332 820.853030
2 Afghanistan 1962 10267083 31.997 853.100710
3 Afghanistan 1967 11537966 34.020 836.197138
4 Afghanistan 1972 13079460 36.088 739.981106
Now we can see the column continent is permanently dropped
keyboard_arrow_down Now similarly, what if we want to create a new column?

We can either
use values from existing columns
OR
create our own values
How to create a column using values from an existing column?
df["year+7"] = df["year"] + 7
df.head()
Country year population life_exp gdp_cap year+7
0 Afghanistan 1952 8425333 28.801 779.445314 1959
1 Afghanistan 1957 9240934 30.332 820.853030 1964
2 Afghanistan 1962 10267083 31.997 853.100710 1969
3 Afghanistan 1967 11537966 34.020 836.197138 1974
4 Afghanistan 1972 13079460 36.088 739.981106 1979
As we see, a new column year+7 is created from the column year
We can also use values from two columns to form a new column
keyboard_arrow_down Which two columns can we use to create a new column gdp ?
df['gdp']=df['gdp_cap'] * df['population']
df.head()
Country year population life_exp gdp_cap year+7 gdp
0 Afghanistan 1952 8425333 28.801 779.445314 1959 6.567086e+09
1 Afghanistan 1957 9240934 30.332 820.853030 1964 7.585449e+09
2 Afghanistan 1962 10267083 31.997 853.100710 1969 8.758856e+09
3 Afghanistan 1967 11537966 34.020 836.197138 1974 9.648014e+09
4 Afghanistan 1972 13079460 36.088 739.981106 1979 9.678553e+09

As you can see
An additional column has been created
Values in this column are product of respective values in gdp_cap and population
What other operations we can use?

Subtraction, Addition, etc.
keyboard_arrow_down How can we create a new column from our own values?
We can create a list
OR
We can create a Pandas Series from a list/numpy array for our new column
df["Own"] = [i for i in range(1704)] # count of these values should be correct

df
Country year population life_exp gdp_cap year+7 gdp Own
0 Afghanistan 1952 8425333 28.801 779.445314 1959 6.567086e+09 0
1 Afghanistan 1957 9240934 30.332 820.853030 1964 7.585449e+09 1
2 Afghanistan 1962 10267083 31.997 853.100710 1969 8.758856e+09 2
3 Afghanistan 1967 11537966 34.020 836.197138 1974 9.648014e+09 3
4 Afghanistan 1972 13079460 36.088 739.981106 1979 9.678553e+09 4
... ... ... ... ... ... ... ... ...
1699 Zimbabwe 1987 9216418 62.351 706.157306 1994 6.508241e+09 1699
1700 Zimbabwe 1992 10704340 60.377 693.420786 1999 7.422612e+09 1700
1701 Zimbabwe 1997 11404948 46.809 792.449960 2004 9.037851e+09 1701
1702 Zimbabwe 2002 11926563 39.989 672.038623 2009 8.015111e+09 1702
1703 Zimbabwe 2007 12311143 43.487 469.709298 2014 5.782658e+09 1703
Now that we know how to create new cols lets see some basic ops on rows
Before that lets drop the newly created cols
df.drop(columns=["Own",'gdp', 'year+7'], axis = 1, inplace = True)

df
0 Afghanistan 1952 8425333 28.801 779.445314
1 Afghanistan 1957 9240934 30.332 820.853030
2 Afghanistan 1962 10267083 31.997 853.100710
3 Afghanistan 1967 11537966 34.020 836.197138
4 Afghanistan 1972 13079460 36.088 739.981106
... ... ... ... ... ...
1699 Zimbabwe 1987 9216418 62.351 706.157306
1700 Zimbabwe 1992 10704340 60.377 693.420786
1701 Zimbabwe 1997 11404948 46.809 792.449960
1702 Zimbabwe 2002 11926563 39.989 672.038623
1703 Zimbabwe 2007 12311143 43.487 469.709298
keyboard_arrow_down Working with Rows

keyboard_arrow_down Just like columns, do rows also have labels?
YES
Notice the indexes in bold against each row
Lets see how can we access these indexes
df.index.values
array([ 0, 1, 2, ..., 1701, 1702, 1703])
keyboard_arrow_down Can we change row labels (like we did for columns)?
What if we want to start indexing from 1 (instead of 0)?
df.index = list(range(1, df.shape[0]+1)) # create a list of indexes of same length

df
1 Afghanistan 1952 8425333 28.801 779.445314
2 Afghanistan 1957 9240934 30.332 820.853030
3 Afghanistan 1962 10267083 31.997 853.100710
4 Afghanistan 1967 11537966 34.020 836.197138
5 Afghanistan 1972 13079460 36.088 739.981106
... ... ... ... ... ...
1700 Zimbabwe 1987 9216418 62.351 706.157306
1701 Zimbabwe 1992 10704340 60.377 693.420786
1702 Zimbabwe 1997 11404948 46.809 792.449960
1703 Zimbabwe 2002 11926563 39.989 672.038623
1704 Zimbabwe 2007 12311143 43.487 469.709298
As you can see the indexing is now starting from 1 instead of 0.
keyboard_arrow_down Explicit and Implicit Indices

keyboard_arrow_down What are these row labels/indices exactly ?
They can be called identifiers of a particular row
Specifically known as explicit indices
Additionally, can series/dataframes can also use python style indexing?

YES
The python style indices are known as implicit indices
How can we access explicit index of a particular row?

Using df.index[]
Takes impicit index of row to give its explicit index
df.index[1] #Implicit index 1 gave explicit index 2
keyboard_arrow_down But why not use just implicit indexing ?
Explicit indices can be changed to any value of any datatype
Eg: Explicit Index of 1st row can be changed to First

Or, something like a floating point value, say 1.0
df.index = np.arange(1, df.shape[0]+1, dtype='float')
df
1.0 Afghanistan 1952 8425333 28.801 779.445314
2.0 Afghanistan 1957 9240934 30.332 820.853030
3.0 Afghanistan 1962 10267083 31.997 853.100710
4.0 Afghanistan 1967 11537966 34.020 836.197138
5.0 Afghanistan 1972 13079460 36.088 739.981106
... ... ... ... ... ...
1700.0 Zimbabwe 1987 9216418 62.351 706.157306
1701.0 Zimbabwe 1992 10704340 60.377 693.420786
1702.0 Zimbabwe 1997 11404948 46.809 792.449960
1703.0 Zimbabwe 2002 11926563 39.989 672.038623
1704.0 Zimbabwe 2007 12311143 43.487 469.709298
As we can see, the indices are floating point values now
Now to understand string indices, let's take a small subset of our original dataframe
sample = df.head()
sample
1.0 Afghanistan 1952 8425333 28.801 779.445314
2.0 Afghanistan 1957 9240934 30.332 820.853030
3.0 Afghanistan 1962 10267083 31.997 853.100710
4.0 Afghanistan 1967 11537966 34.020 836.197138
5.0 Afghanistan 1972 13079460 36.088 739.981106
keyboard_arrow_down Now what if we want to use string indices?
sample.index = ['a', 'b', 'c', 'd', 'e']

sample
a Afghanistan 1952 8425333 28.801 779.445314
b Afghanistan 1957 9240934 30.332 820.853030
c Afghanistan 1962 10267083 31.997 853.100710
d Afghanistan 1967 11537966 34.020 836.197138
e Afghanistan 1972 13079460 36.088 739.981106
This shows us we can use almost anything as our explicit index
Now let's reset our indices back to integers
df.index = np.arange(1, df.shape[0]+1, dtype='int')
keyboard_arrow_down What if we want to access any particular row (say first row)?
Let's first see for one column
Later, we can generalise the same for the entire dataframe

ser = df["Country"]
ser.head(20)
1 Afghanistan
2 Afghanistan
3 Afghanistan
4 Afghanistan
5 Afghanistan
6 Afghanistan
7 Afghanistan
8 Afghanistan
9 Afghanistan
10 Afghanistan
11 Afghanistan
12 Afghanistan
13 Albania
14 Albania
15 Albania
16 Albania
17 Albania
18 Albania
19 Albania
20 Albania
Name: Country, dtype: object
We can simply use its indices much like we do in a numpy array
So, how will be then access the thirteenth element (or say thirteenth row)?
ser[12]
'Afghanistan'
keyboard_arrow_down And what about accessing a subset of rows (say 6th:15th) ?
ser[5:15]
6 Afghanistan
7 Afghanistan
8 Afghanistan
9 Afghanistan
10 Afghanistan
11 Afghanistan
12 Afghanistan
13 Albania
14 Albania
15 Albania
Name: Country, dtype: object
This is known as slicing
Notice something different though?
Indexing in Series used explicit indices

Slicing however used implicit indices
Let's try the same for the dataframe now
keyboard_arrow_down So how can we access a row in a dataframe?
df[0]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/usr/local/lib/python3.8/dist-packages/pandas/core/indexes/base.py in
get_loc(self, key, method, tolerance)
3360 try:
-> 3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
4 frames
pandas/_libs/hashtable_class_helper.pxi in
pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in
pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 0
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)

/usr/local/lib/python3.8/dist-packages/pandas/core/indexes/base.py in
get_loc(self, key, method, tolerance)
3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
-> 3363 raise KeyError(key) from err
3364
3365 if is_scalar(key) and isna(key) and not self.hasnans:
KeyError: 0
Notice, that this syntax is exactly same as how we tried accessing a column
===> df[x] looks for column with name x
keyboard_arrow_down How can we access a slice of rows in the dataframe?
df[5:15]
Woah, so the slicing works
===> Indexing in dataframe looks only for explicit indices

===> Slicing, however, checked for implicit indices
This can be a cause for confusion
To avoid this pandas provides special indexers, loc and iloc
We will look at these in a bit Lets look at them one by one
keyboard_arrow_down loc and iloc

keyboard_arrow_down 1. loc
Allows indexing and slicing that always references the explicit index
df.loc[1]
Country Afghanistan
year 1952
population 8425333
life_exp 28.801
gdp_cap 779.445314
Name: 1, dtype: object
df.loc[1:3]
1 Afghanistan 1952 8425333 28.801 779.445314
2 Afghanistan 1957 9240934 30.332 820.853030
3 Afghanistan 1962 10267083 31.997 853.100710
Did you notice something strange here?

The range is inclusive of end point for loc
Row with Label 3 is included in the result
keyboard_arrow_down 2. iloc
Allows indexing and slicing that always references the implicit Python-style index
df.iloc[1]
Country Afghanistan
year 1957
population 9240934
life_exp 30.332
gdp_cap 820.85303
keyboard_arrow_down Now will iloc also consider the range inclusive?
df.iloc[0:2]
1 Afghanistan 1952 8425333 28.801 779.445314
2 Afghanistan 1957 9240934 30.332 820.853030
NO
Because iloc works with implicit Python-style indices
It is important to know about these conceptual differences
Not just b/w loc and iloc , but in general while working in DS and ML
Which one should we use ?
Generally explicit indexing is considered to be better than implicit

But it is recommended to always use both loc and iloc to avoid any confusions
keyboard_arrow_down What if we want to access multiple non-consecutive rows at same time ?
For eg: rows 1, 10, 100
df.iloc[[1, 10, 100]]
As we see, We can just pack the indices in [] and pass it in loc or iloc
keyboard_arrow_down What about negative index?
Which would work between iloc and loc ?

02-Pandas-Lecture-IMDB
Working with both rows and columns
Pandas built-in operations
Sorting
Concatenation
Merge
Intoduction to IMDB dataset
Merging the dataframes

Feature Exploration
Fetching data using pandas
keyboard_arrow_down Working with Rows and Columns together

keyboard_arrow_down Reading dataset
We will be using our earlier McKinsey dataset for now
Link:https://drive.google.com/file/d/1E3bwvYGf1ig32RmcYiWc0IXPN-mD_bI_/view?usp=sharing
!wget "https://drive.google.com/uc?export=download&id=1E3bwvYGf1ig32RmcYiWc0IXPN-mD_bI_" -O mckinsey.csv
--2023-09-13 15:17:05-- https://drive.google.com/uc?export=download&id=1E3bwvYGf1ig32RmcYiWc0IXPN-mD_bI_

Resolving drive.google.com (drive.google.com)... 142.251.16.100, 142.251.16.102, 142.251.16.113, ...
Connecting to drive.google.com (drive.google.com)|142.251.16.100|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-0s-68-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/1faj8dbj13vdr93ggo
Warning: wildcards not supported in HTTP.
--2023-09-13 15:17:05-- https://doc-0s-68-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/1fa
Resolving doc-0s-68-docs.googleusercontent.com (doc-0s-68-docs.googleusercontent.com)... 172.253.122.132, 2607:f8b0:4004
Connecting to doc-0s-68-docs.googleusercontent.com (doc-0s-68-docs.googleusercontent.com)|172.253.122.132|:443... connec
Length: 83785 (82K) [text/csv]
Saving to: ‘mckinsey.csv’
mckinsey.csv 100%[===================>] 81.82K --.-KB/s in 0.009s
2023-09-13 15:17:05 (8.77 MB/s) - ‘mckinsey.csv’ saved [83785/83785]
import pandas as pd
import numpy as np
df = pd.read_csv('mckinsey.csv')
keyboard_arrow_down Now how can we add a row to our dataframe?

There are multiple ways to do this:
append()
loc/iloc
keyboard_arrow_down How can we do add a row using the append() method?

new_row = {'Country': 'India', 'year': 2000,'life_exp':37.08,'population':13500000,'gdp_cap':900.23}
df.append(new_row)
<ipython-input-4-714c78525e27>:2: FutureWarning: The frame.append method is deprecated and will be removed from pandas in
df.append(new_row)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-4-714c78525e27> in <cell line: 2>()
1 new_row = {'Country': 'India', 'year': 2000,'life_exp':37.08,'population':13500000,'gdp_cap':900.23}
----> 2 df.append(new_row)
1 frames
/usr/local/lib/python3.10/dist-packages/pandas/core/frame.py in _append(self, other, ignore_index, verify_integrity, sort
9778 if isinstance(other, dict):
9779 if not ignore_index:
-> 9780 raise TypeError("Can only append a dict if ignore_index=True")
9781 other = Series(other)
9782 if other.name is None and not ignore_index:
TypeError: Can only append a dict if ignore_index=True
Why are we getting an error here?
Its' saying the ignore_index() parameter needs to be set to True. This parameter tells Pandas to ignore the existing index and create a new
one based on the length of the resulting DataFrame.
new_row = {'Country': 'India', 'year': 2000,'life_exp':37.08,'population':13500000,'gdp_cap':900.23}

df = df.append(new_row, ignore_index=True)
df
<ipython-input-5-39ca58b35231>:2: FutureWarning: The frame.append method is deprecated and will be removed from pandas in
df = df.append(new_row, ignore_index=True)
country year population continent life_exp gdp_cap Country
0 Afghanistan 1952 8425333 Asia 28.801 779.445314 NaN
... ... ... ... ... ... ... ...
1700 Zimbabwe 1992 10704340 Africa 60.377 693.420786 NaN
1704 NaN 2000 13500000 NaN 37.080 900.230000 India
Perfect! So now our row is added at the bottom of the dataframe
But Please Note that:
append() doesn't mutate the the dataframe.
It does not change the DataFrame, but returns a new DataFrame with the row appended.
Another method would be by using loc:
We will need to provide the position at which we will add the new row
keyboard_arrow_down What do you think this positional value would be?
df.loc[len(df.index)] = ['India',2000 ,13500000,37.08,900.23] # len(df.index) since we will add at the last row
---------------------------------------------------------------------------
<ipython-input-6-29b4966da254> in <cell line: 1>()
----> 1 df.loc[len(df.index)] = ['India',2000 ,13500000,37.08,900.23] # len(df.index) since we will add at the last row
2 frames
/usr/local/lib/python3.10/dist-packages/pandas/core/indexing.py in _setitem_with_indexer_missing(self, indexer, value)
2158 # must have conforming columns
2159 if len(value) != len(self.obj.columns):
-> 2160 raise ValueError("cannot set a row with mismatched columns")
2161
2162 value = Series(value, index=self.obj.columns, name=indexer)
ValueError: cannot set a row with mismatched columns
df
... ... ... ... ... ... ... ...
1704 NaN 2000 13500000 NaN 37.080 900.230000 India
The new row was added but the data has been duplicated
What you can infer from last two duplicate rows ?
Dataframe allow us to feed duplicate rows in the data
keyboard_arrow_down Now, can we also use iloc?

Adding a row at a specific index position will replace the existing row at that position.
df.iloc[len(df.index)-1] = ['India', 2000,13500000,37.08,900.23]

df
---------------------------------------------------------------------------
<ipython-input-8-c6ff7fd1a207> in <cell line: 1>()
----> 1 df.iloc[len(df.index)-1] = ['India', 2000,13500000,37.08,900.23]
2 df
2 frames
/usr/local/lib/python3.10/dist-packages/pandas/core/indexing.py in _setitem_with_indexer_split_path(self, indexer, value,
1877
1878 else:
-> 1879 raise ValueError(
1880 "Must have equal len keys and value "
1881 "when setting with an iterable"
ValueError: Must have equal len keys and value when setting with an iterable
keyboard_arrow_down What if we try to add the row with a new index?

df.iloc[len(df.index)] = ['India', 2000,13500000,37.08,900.23]
---------------------------------------------------------------------------
<ipython-input-9-1ad11b3daf34> in <cell line: 1>()
----> 1 df.iloc[len(df.index)] = ['India', 2000,13500000,37.08,900.23]
1 frames
/usr/local/lib/python3.10/dist-packages/pandas/core/indexing.py in _has_valid_setitem_indexer(self, indexer)
1516 elif is_integer(i):
1517 if i >= len(ax):
-> 1518 raise IndexError("iloc cannot enlarge its target object")
1519 elif isinstance(i, dict):
1520 raise IndexError("iloc cannot enlarge its target object")
IndexError: iloc cannot enlarge its target object
Why we are getting error ?
For using iloc to add a row, the dataframe must already have a row in that position.
If a row is not available, you’ll see this IndexError
Please Note:
When using the loc[] attribute, it’s not mandatory that a row already exists with a specific label.
keyboard_arrow_down Now what if we want to delete a row ?

Use df.drop()
If you remember we specified axis=1 for columns
We can modify this for rows
We can use axis=0 for rows
keyboard_arrow_down Does drop() method uses positional indices or labels?
What do you think by looking at code for deleting column?

We had to specify column title
So drop() uses labels, NOT positional indices
Let's drop the row with label 3
df
... ... ... ... ... ... ... ...
1704 NaN 2000 13500000 NaN 37.080 900.230000 India
# After dropping the row

df = df.drop(3, axis=0)
df
... ... ... ... ... ... ... ...
1704 NaN 2000 13500000 NaN 37.080 900.230000 India
Now we see that row with label 3 is deleted
We now have rows with labels 0, 1, 2, 4, 5, ...
keyboard_arrow_down Now df.loc[4] and df.iloc[4] will give different rows
df.loc[4] # The 4th row is printed
Country Afghanistan
year 1972
population 13079460
life_exp 36.088
gdp_cap 739.981106
df.iloc[4] # The 5th row is printed
Country Afghanistan
year 1977
population 14880372
life_exp 38.438
gdp_cap 786.11336
Why did this happen?
It is because the loc function selects rows using row labels (0,1,2,4 etc.) whereas the iloc function selects rows using their integer positions
(staring from 0 and going up by one for each row).
So for iloc the 5th row starting from 0 index was printed
keyboard_arrow_down And how can we drop multiple rows?
df.drop([1, 2, 4], axis=0) # drops rows with labels 1, 2, 4
Let's reset our indices now
df.reset_index(drop=True,inplace=True) # Since we removed a row earlier, we reset our indices
Now if you remember, the last two rows were duplicates.
keyboard_arrow_down How can we deal with these duplicate rows?

Let's create some more duplicate rows to understand this
df.loc[len(df.index)] = ['India',2000,13500000,37.08,900.23]
df.loc[len(df.index)] = ['Sri Lanka',2022 ,130000000,80.00,500.00]
df.loc[len(df.index)] = ['Sri Lanka',2022 ,130000000,80.00,500.00]
df.loc[len(df.index)] = ['India',2000 ,13500000,80.00,900.23]
df
keyboard_arrow_down Now how can we check for duplicate rows?
Use duplicated() method on the DataFrame
df.duplicated()
It outputs True if an entire row is identical to a previous row.
However, it is not practical to see a list of True and False
We can Pandas loc data selector to extract those duplicate rows
# Extract duplicate rows

df.loc[df.duplicated()]
The first argument df.duplicated() will find the duplicate rows
The second argument : will display all columns
keyboard_arrow_down Now how can we remove these duplicate rows ?
We can use drop_duplicates() of Pandas for this
df.drop_duplicates()
keyboard_arrow_down But how can we decide among all duplicate rows which ones we want to keep ?
Here we can use argument keep:
This Controls how to consider duplicate value.
It has only three distinct value
first
last
False
The default is ‘first’.
If first , this considers first value as unique and rest of the same values as duplicate.
df.drop_duplicates(keep='first')
If last , This considers last value as unique and rest of the same values as duplicate.
df.drop_duplicates(keep='last')
If False , this considers all of the same values as duplicates.
df.drop_duplicates(keep=False)
keyboard_arrow_down What if you want to look for duplicacy only for a few columns?
We can use the argument subset to mention the list of columns which we want to use.
df.drop_duplicates(subset=['Country'],keep='first')
keyboard_arrow_down How can we slice the dataframe into, say, first 4 rows and first 3 columns?
We can use iloc
gdf.iloc[0:4, 0:3]
country year population
0 Afghanistan 1952 8425333
Pass in 2 different ranges for slicing - one for row and one for column just like Numpy
Recall, iloc doesn't include the end index while slicing
keyboard_arrow_down Can we do the same thing with loc ?
df.loc[1:5, 1:4]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-494208dc7680> in <cell line: 1>()
----> 1 df.loc[1:5, 1:4]
8 frames
/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py in _maybe_cast_slice_bound(self, label, side, kind)
6621 # reject them, if index does not contain label
6622 if (is_float(label) or is_integer(label)) and label not in self:
-> 6623 raise self._invalid_indexer("slice", label)
6624
6625 return label
TypeError: cannot do slice indexing on Index with these indexers [1] of type int
keyboard_arrow_down Why does slicing using indices doesn't work with loc ?
Recall, we need to work with explicit labels while using loc
df.loc[1:5, ['country','life_exp']]
country life_exp
keyboard_arrow_down We can mention ranges using column labels as well in loc
df.loc[1:5, 'year':'population']
year population
1 1957 9240934
2 1962 10267083
3 1967 11537966
4 1972 13079460
5 1977 14880372
keyboard_arrow_down How can we get specific rows and columns?
df.iloc[[0,10,100], [0,2,3]]
country population continent
0 Afghanistan 8425333 Asia
10 Afghanistan 25268405 Asia
100 Bangladesh 70759295 Asia
We pass in those specific indices packed in []
keyboard_arrow_down Can we do step slicing?
Yes, just like we did in Numpy
df.iloc[1:10:2]
keyboard_arrow_down Does step slicing work for loc too?
YES
df.loc[1:10:2]
keyboard_arrow_down Pandas built-in operation

Let's select the feature 'life_exp'
le = df['life_exp']
le
0 28.801
1 30.332
2 31.997
3 34.020
4 36.088
...
1699 62.351
1700 60.377
1701 46.809
1702 39.989
1703 43.487
Name: life_exp, Length: 1704, dtype: float64
keyboard_arrow_down
How can we find the mean of the col life_exp ?
[ ] ↳ 1 cell hidden
keyboard_arrow_down What other operations can we do?

sum()
count()
min()
max()
... and so on
Note:
We can see more methods by pressing "tab" after le.
le.sum()
101344.44467999999
le.count()
1704
keyboard_arrow_down What will happen we get if we divide sum() by count() ?
le.sum() / le.count()
59.474439366197174
It gives the mean of life expectancy
keyboard_arrow_down Sorting
If you notice, life_exp col is not sorted
keyboard_arrow_down How can we perform sorting in pandas ?
df.sort_values(['life_exp'])
1292 Rwanda 1992 7290203 Africa 23.599 737.068595
552 Gambia 1952 284320 Africa 30.000 485.230659
36 Angola 1952 4232095 Africa 30.015 3520.610273
1344 Sierra Leone 1952 2143249 Africa 30.331 879.787736
... ... ... ... ... ... ...
1487 Switzerland 2007 7554661 Europe 81.701 37506.419070
695 Iceland 2007 301931 Europe 81.757 36180.789190
802 Japan 2002 127065841 Asia 82.000 28604.591900
671 Hong Kong, China 2007 6980412 Asia 82.208 39724.978670
803 Japan 2007 127467972 Asia 82.603 31656.068060
Rows get sorted based on values in life_exp column
By default, values are sorted in ascending order
keyboard_arrow_down How can we sort the rows in descending order?
df.sort_values(['life_exp'], ascending=False)
803 Japan 2007 127467972 Asia 82.603 31656.068060
802 Japan 2002 127065841 Asia 82.000 28604.591900
695 Iceland 2007 301931 Europe 81.757 36180.789190
... ... ... ... ... ... ...
36 Angola 1952 4232095 Africa 30.015 3520.610273
552 Gambia 1952 284320 Africa 30.000 485.230659
1292 Rwanda 1992 7290203 Africa 23.599 737.068595
Now the rows are sorted in descending
keyboard_arrow_down Can we do sorting on multiple columns?
YES
df.sort_values(['year', 'life_exp'])
552 Gambia 1952 284320 Africa 30.000 485.230659
36 Angola 1952 4232095 Africa 30.015 3520.610273
1032 Mozambique 1952 6446316 Africa 31.286 468.526038
... ... ... ... ... ... ...
71 Australia 2007 20434176 Oceania 81.235 34435.367440
695 Iceland 2007 301931 Europe 81.757 36180.789190
803 Japan 2007 127467972 Asia 82.603 31656.068060
What exactly happened here?
Rows were first sorted based on 'year'
Then, rows with same values of 'year' were sorted based on 'lifeExp'
For Example
This way, we can do multi-level sorting of our data?
keyboard_arrow_down How can we have different sorting orders for different columns in multi-level sorting?
df.sort_values(['year', 'life_exp'], ascending=[False, True])
1463 Swaziland 2007 1133066 Africa 39.613 4513.480643
1043 Mozambique 2007 19951656 Africa 42.082 823.685621
1691 Zambia 2007 11746035 Africa 42.384 1271.211593
887 Lesotho 2007 2012649 Africa 42.592 1569.331442
... ... ... ... ... ... ...
408 Denmark 1952 4334000 Europe 70.780 9692.385245
1464 Sweden 1952 7124673 Europe 71.860 8527.844662
1080 Netherlands 1952 10381988 Europe 72.130 8941.571858
684 Iceland 1952 147962 Europe 72.490 7267.688428
1140 Norway 1952 3327728 Europe 72.670 10095.421720
Just pack True and False for respective columns in a list []
keyboard_arrow_down Concatenating DataFrames

keyboard_arrow_down Let's use a mini use-case of users and messages
users --> Stores the user details - IDs and Names of users
users = pd.DataFrame({"userid":[1, 2, 3], "name":["sharadh", "shahid", "khusalli"]})

users
userid name
0 1 sharadh
1 2 shahid
2 3 khusalli
msgs --> Stores the messages users have sent - User IDs and messages
msgs = pd.DataFrame({"userid":[1, 1, 2, 4], "msg":['hmm', "acha", "theek hai", "nice"]})

msgs
userid msg
0 1 hmm
1 1 acha
2 2 theek hai
3 4 nice
keyboard_arrow_down Can we combine these 2 DataFrames to form a single DataFrame?
pd.concat([users, msgs])
userid name msg
0 1 sharadh NaN
1 2 shahid NaN
2 3 khusalli NaN
0 1 NaN hmm
1 1 NaN acha
2 2 NaN theek hai
3 4 NaN nice
How exactly did concat work?
By default, axis=0 (row-wise) for concatenation
userid , being same in both DataFrames, was combined into a single column
First values of users dataframe were placed, with values of column msg as NaN
Then values of msgs dataframe were placed, with values of column msg as NaN
The original indices of the rows were preserved
keyboard_arrow_down Now how can we make the indices unique for each row?
pd.concat([users, msgs], ignore_index = True)
userid name msg
0 1 sharadh NaN
1 2 shahid NaN
2 3 khusalli NaN
3 1 NaN hmm
4 1 NaN acha
5 2 NaN theek hai
6 4 NaN nice
keyboard_arrow_down How can we concatenate them horizontally?
pd.concat([users, msgs], axis=1)

userid name userid msg
0 1.0 sharadh 1 hmm
1 2.0 shahid 1 acha
2 3.0 khusalli 2 theek hai
3 NaN NaN 4 nice
As you can see here:
Both the dataframes are combined horizontally (column-wise)

It gives 2 columns with different positional (implicit) index, but same label
keyboard_arrow_down Merging Dataframes

So far we have only concatenated and not merged data
But what is the difference between concat and merge ?
concat
simply stacks multiple DataFrame together along an axis
merge
combines dataframes in a smart way based on values in shared columns
keyboard_arrow_down How can we know the name of the person who sent a particular message?
We need information from both the dataframes
So can we use pd.concat() for combining the dataframes ?
No
pd.concat([users, msgs], axis=1)
userid name userid msg
0 1.0 sharadh 1 hmm
1 2.0 shahid 1 acha
2 3.0 khusalli 2 theek hai
3 NaN NaN 4 nice
What are the problems with concat here?
concat simply combined/stacked the dataframe horizontally

If you notice, userid 3 for user dataframe is stacked against userid 2 for msg dataframe
This way of stacking doesn't help us gain any insights
=> pd.concat() does not work according to the values in the columns
We need to merge the data
keyboard_arrow_down How can we join the dataframes ?
users.merge(msgs, on="userid")
userid name msg
0 1 sharadh hmm
1 1 sharadh acha
2 2 shahid theek hai
Notice that users has a userid = 3 but msgs does not
When we merge these dataframes the userid = 3 is not included

Similarly, userid = 4 is not present in users , and thus not included
Only the userid common in both dataframes is shown
What type of join is this?
Inner Join
Remember joins from SQL?
The on parameter specifies the key , similar to primary key in SQL
keyboard_arrow_down Now what join we want to use to get info of all the users and all the messages?
users.merge(msgs, on = "userid", how="outer")
userid name msg
0 1 sharadh hmm
1 1 sharadh acha
3 3 khusalli NaN
4 4 NaN nice
Note:
All missing values are replaced with NaN
keyboard_arrow_down And what if we want the info of all the users in the dataframe?
users.merge(msgs, on = "userid",how="left")
userid name msg
0 1 sharadh hmm
1 1 sharadh acha
3 3 khusalli NaN
keyboard_arrow_down Similarly, what if we want all the messages and info only for the users who sent a message?
users.merge(msgs, on = "userid", how="right")
userid name msg
0 1 sharadh hmm
1 1 sharadh acha
3 4 NaN nice
Note,
NaN in name can be thought of as an anonymous message
But sometimes the column names might be different even if they contain the same data
Let's rename our users column userid to id
users.rename(columns = {"userid": "id"}, inplace = True)

users
id name
0 1 sharadh
1 2 shahid
2 3 khusalli
keyboard_arrow_down Now, how can we merge the 2 dataframes when the key has a different name ?
users.merge(msgs, left_on="id", right_on="userid")
id name userid msg
0 1 sharadh 1 hmm
1 1 sharadh 1 acha
2 2 shahid 2 theek hai
Here,
left_on : Specifies the key of the 1st dataframe (users here)
right_on : Specifies the key of the 2nd dataframe (msgs here)
keyboard_arrow_down IMDB Movie Business Use-case (Introduction)

Imagine you are working as a Data Scientist for an Analytics firm
Your task is to analyse some movie trends for a client
IMDB has online database of information related to movies
The database contains info of several years about:
Movies
Rating
Director
Popularity
Revenue & Budget
keyboard_arrow_down Lets download and read the IMDB dataset
File1: https://drive.google.com/file/d/1s2TkjSpzNc4SyxqRrQleZyDIHlc7bxnd/view?usp=sharing
File2: https://drive.google.com/file/d/1Ws-_s1fHZ9nHfGLVUQurbHDvStePlEJm/view?usp=sharing
import pandas as pd
import numpy as np
!gdown 1s2TkjSpzNc4SyxqRrQleZyDIHlc7bxnd
Downloading...
From: https://drive.google.com/uc?id=1s2TkjSpzNc4SyxqRrQleZyDIHlc7bxnd
To: /content/movies.csv
100% 112k/112k [00:00<00:00, 57.4MB/s]
!gdown 1Ws-_s1fHZ9nHfGLVUQurbHDvStePlEJm
Downloading...
From: https://drive.google.com/uc?id=1Ws-_s1fHZ9nHfGLVUQurbHDvStePlEJm
To: /content/directors.csv
100% 65.4k/65.4k [00:00<00:00, 62.8MB/s]
Here we have two csv files
movies.csv
directors.csv
movies = pd.read_csv('movies.csv')
movies.head()
Unnamed:
id budget popularity revenue title vote_average vote_count director_id year month day
0
0 0 43597 237000000 150 2787965087 Avatar 7.2 11800 4762 2009 Dec Thursday
Pirates of
the
1 1 43598 300000000 139 961000000 Caribbean: 6.9 4500 4763 2007 May Saturday
At World's
End
2 2 43599 245000000 107 880674609 Spectre 63 4466 4764 2015 Oct Monday
So what kind of questions can we ask from this dataset?
Top 10 most popular movies, using popularity

Or find some highest rated movies, using vote_average
We can find number of movies released per year too
Or maybe we can find highest budget movies in a year using both budget and year
But can we ask more interesting/deeper questions?

Do you think we can find the most productive directors?
Which directors produce high budget films?
Highest and lowest rated movies for every month in a particular year?
Notice, there's a column Unnamed: 0 which represents nothing but the index of a row.
keyboard_arrow_down How to get rid of this Unnamed: 0 col?
movies = pd.read_csv('movies.csv', index_col=0)

movies.head()
id budget popularity revenue title vote_average vote_count director_id year month day
0 43597 237000000 150 2787965087 Avatar 7.2 11800 4762 2009 Dec Thursday
Pirates of the
1 43598 300000000 139 961000000 Caribbean: At World's 6.9 4500 4763 2007 May Saturday
End
2 43599 245000000 107 880674609 Spectre 6.3 4466 4764 2015 Oct Monday
3 43600 250000000 112 1084939099 The Dark Knight Rises 7.6 9106 4765 2012 Jul Monday
index_col=0 explicitly states to treat the first column as the index
The default value is index_col=None
movies.shape
(1465 11)
keyboard_arrow_down Pandas - 3
Content
Apply()
Grouping
groupby()
Group based Aggregates

Group based Filtering
Group based Apply
keyboard_arrow_down Importing Data

Let's first import our data and prepare it as we did in the last lecture
import pandas as pd
import numpy as np

directors = pd.read_csv('directors.csv',index_col=0)
Downloading...
100% 112k/112k [00:00<00:00, 77.0MB/s]
Downloading...
100% 65.4k/65.4k [00:00<00:00, 73.5MB/s]
keyboard_arrow_down IMDB Movie Business Use-case (Continued...)

In the previous lecture (Pandas-2), we concluded that:
Movie dataset contains info about movies, release, popularity, ratings and the director ID
Director dataset contains detailed info about the director
In this lecture we begin to perform some operations on the data
keyboard_arrow_down Merging the director and movie data

Now, how can we know the details about the Director of a particular movie?
We will have to merge these datasets
So on which column we should merge the dfs ?
We will use the ID columns (representing unique director) in both the datasets
If you observe,
=> director_id of movies are taken from id of directors dataframe
Thus we can merge our dataframes based on these two columns as keys
Before that, lets first check number of unique director values in our movies data
keyboard_arrow_down How do we get the number of unique directors in movies ?
movies['director_id'].nunique()
199
Recall,
we had learnt about nunique earlier
Similarly for unique diretors in directors df
directors['id'].nunique()
2349
Summary:
Movies Dataset: 1465 rows, but only 199 unique directors

Directors Dataset: 2349 unique directors (= no of rows)
What can we infer from this?
=> Directors in movies is a subset of directors in directors
keyboard_arrow_down Now, how can we check if all director_id values are present in id ?
movies['director_id'].isin(directors['id'])
0 True
1 True
2 True
3 True
5 True
...
4736 True
4743 True
4748 True
4749 True
4768 True
Name: director_id, Length: 1465, dtype: bool
The isin() method checks if the Dataframe column contains the specified value(s).
How is isin different from Python in ?
in works for one element at a time

isin does this for all the values in the column
If you notice,
This is like a boolean "mask"

It returns a df similar to the original df
For rows with values of director_id present in id it returns True, else False
keyboard_arrow_down How can we check if there is any False here?
np.all(movies['director_id'].isin(directors['id']))
True
Lets finally merge our dataframes
Do we need to keep all the rows for movies?
YES
Do we need to keep all the rows of directors?
NO
only the ones for which we have a corresponding row in movies
keyboard_arrow_down So which join type do you think we should apply here ?
We can use LEFT JOIN

data = movies.merge(directors, how='left', left_on='director_id',right_on='id')
data
id_x budget popularity revenue title vote_average vote_count director_id year month day direct
0 43597 237000000 150 2787965087 Avatar 7.2 11800 4762 2009 Dec Thursday James
Pirates of
the
1 43598 300000000 139 961000000 Caribbean: 6.9 4500 4763 2007 May Saturday Gore
At World's
End
2 43599 245000000 107 880674609 Spectre 6.3 4466 4764 2015 Oct Monday Sam
The Dark
Ch
3 43600 250000000 112 1084939099 Knight 7.6 9106 4765 2012 Jul Monday
Rises
Spider-
4 43602 258000000 115 890871626 5.9 3576 4767 2007 May Tuesday S
Man 3
... ... ... ... ... ... ... ... ... ... ... ...
The Last
1460 48363 0 3 321952 7.9 64 4809 1978 May Monday Martin
Waltz
1461 48370 27000 19 3151130 Clerks 7.4 755 5369 1994 Sep Tuesday Ke
1462 48375 0 7 0 Rampage 6.0 131 5148 2009 Aug Friday
1463 48376 0 3 0 Slacker 6.4 77 5535 1990 Jul Friday Richard
El
1464 48395 220000 14 2040920 6.6 238 5097 1992 Sep Friday
Mariachi R
Notice, two stranger id columns id_x and id_y .
keyboard_arrow_down What do you think these newly created cols are?
Since the columns with name id is present in both the df
id_x represents id values from movie df

id_y represents id values from directors df
Do you think any column is redundant here and can be dropped?
id_y is redundant as it is same as director_id

But we dont require director_id further
So we can simply drop these features
data.drop(['director_id','id_y'],axis=1,inplace=True)
data.head()
id_x budget popularity revenue title vote_average vote_count year month day director_name gender
0 43597 237000000 150 2787965087 Avatar 7.2 11800 2009 Dec Thursday James Cameron Male
Pirates of
the
1 43598 300000000 139 961000000 Caribbean: 6.9 4500 2007 May Saturday Gore Verbinski Male
At World's
End
2 43599 245000000 107 880674609 Spectre 6.3 4466 2015 Oct Monday Sam Mendes Male
The Dark
Christopher
keyboard_arrow_down Apply
Apply a function along an axis of the DataFrame or Series
Task: we want to convert our Gender column data to numerical format
Basically,
0 for Male
1 for Female
keyboard_arrow_down How can we encode the column?
Let's first write a function to do it for a single value
def encode(data):
if data == "Male":
return 0
else:
return 1
keyboard_arrow_down Now how can we apply this function to the whole column?
data['gender'] = data['gender'].apply(encode)
data
id_x budget popularity revenue title vote_average vote_count year month day
0 43597 237000000 150 2787965087 Avatar 7.2 11800 2009 Dec Thursda
1 43598 300000000 139 961000000 Pirates of the Caribbean: At World's End 6.9 4500 2007 May Saturda
2 43599 245000000 107 880674609 Spectre 6.3 4466 2015 Oct Monda
3 43600 250000000 112 1084939099 The Dark Knight Rises 7.6 9106 2012 Jul Monda
4 43602 258000000 115 890871626 Spider-Man 3 5.9 3576 2007 May Tuesda
... ... ... ... ... ... ... ... ... ... .
1460 48363 0 3 321952 The Last Waltz 7.9 64 1978 May Monda
1461 48370 27000 19 3151130 Clerks 7.4 755 1994 Sep Tuesda
1462 48375 0 7 0 Rampage 6.0 131 2009 Aug Frida
1463 48376 0 3 0 Slacker 6.4 77 1990 Jul Frida
1464 48395 220000 14 2040920 El Mariachi 6.6 238 1992 Sep Frida
Notice how this is similar to using vectorization in Numpy
We thus can use apply to use a function throughout a column
keyboard_arrow_down Applying a function using apply on multiple columns
finding sum of revenue and budget per movie?
data[['revenue', 'budget']].apply(np.sum)
revenue 209866997305
budget 70353617179
dtype: int64
We can pass multiple cols by packing them within []
But there's a mistake here. We wanted our results per movie (per row)
But, we are getting the sum of the columns
keyboard_arrow_down Applying function with apply on rows using the axis
data[['revenue', 'budget']].apply(np.sum, axis=1)
0 3024965087
1 1261000000
2 1125674609
3 1334939099
4 1148871626
...
1460 321952
1461 3178130
1462 0
1463 0
1464 2260920
Length: 1465, dtype: int64
Every row of revenue was added to same row of budget
What does this axis mean in apply ?
axis = 0 => it will apply to each column
axis = 1 => each row
By default axis = 0
=> apply() can be applied on any dataframe along any particular axis
keyboard_arrow_down Similarly, how can I find profit per movie (revenue-budget)?
def prof(x): # We define a function to calculate profit

return x['revenue']-x['budget']
data['profit'] = data[['revenue', 'budget']].apply(prof, axis = 1)
data
id_x budget popularity revenue title vote_average vote_count year month day director_name gend
0 43597 237000000 150 2787965087 Avatar 7.2 11800 2009 Dec Thursday James Cameron
Pirates of
the
1 43598 300000000 139 961000000 Caribbean: 6.9 4500 2007 May Saturday Gore Verbinski
At World's
End
2 43599 245000000 107 880674609 Spectre 6.3 4466 2015 Oct Monday Sam Mendes
The Dark
Christopher
3 43600 250000000 112 1084939099 Knight 7.6 9106 2012 Jul Monday
Nolan
Rises
Spider-
4 43602 258000000 115 890871626 5.9 3576 2007 May Tuesday Sam Raimi
Man 3
... ... ... ... ... ... ... ... ... ... ... ...
The Last
1460 48363 0 3 321952 7.9 64 1978 May Monday Martin Scorsese
Waltz
1461 48370 27000 19 3151130 Clerks 7.4 755 1994 Sep Tuesday Kevin Smith
1462 48375 0 7 0 Rampage 6.0 131 2009 Aug Friday Uwe Boll
1463 48376 0 3 0 Slacker 6.4 77 1990 Jul Friday Richard Linklater
El Robert
1464 48395 220000 14 2040920 6.6 238 1992 Sep Friday
Mariachi Rodriguez
keyboard_arrow_down Grouping
What is Grouping ?
Simply it could be understood through the terms - Split, apply, combine

1. Split: Breaking up and grouping a DataFrame depending on the value of the specified key.
keyboard_arrow_down Group based Aggregates

2. Apply: Computing some function, usually an aggregate, transformation, or filtering, within the individual groups.
3. Combine: Merge the results of these operations into an output array.
Note: All these steps are to understand the topic

We use different aggregate functions like mean, sum, min, max, count etc. on columns while grouping.
keyboard_arrow_down Grouping data director-wise
data.groupby('director_name')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f71cad3ead0>
Notice,
It's a DataFrameGroupBy type object

NOT a DataFrame type object
keyboard_arrow_down Number of groups our data is divided into?
data.groupby('director_name').ngroups
199
Based on this grouping, we can find which keys belong to which group?
data.groupby('director_name').groups
{'Adam McKay': [176, 323, 366, 505, 839, 916], 'Adam Shankman': [265, 300, 350, 404, 458, 843, 999, 1231], 'Alejandro
González Iñárritu': [106, 749, 1015, 1034, 1077, 1405], 'Alex Proyas': [95, 159, 514, 671, 873], 'Alexander Payne':
[793, 1006, 1101, 1211, 1281], 'Andrew Adamson': [11, 43, 328, 501, 947], 'Andrew Niccol': [533, 603, 701, 722, 1439],
'Andrzej Bartkowiak': [349, 549, 754, 911, 924], 'Andy Fickman': [517, 681, 909, 926, 973, 1023], 'Andy Tennant': [314,
320, 464, 593, 676, 885], 'Ang Lee': [99, 134, 748, 840, 1089, 1110, 1132, 1184], 'Anne Fletcher': [610, 650, 736, 789,
1206], 'Antoine Fuqua': [310, 338, 424, 467, 576, 808, 818, 1105], 'Atom Egoyan': [946, 1128, 1164, 1194, 1347, 1416],
'Barry Levinson': [313, 319, 471, 594, 878, 898, 1013, 1037, 1082, 1143, 1185, 1345, 1378], 'Barry Sonnenfeld': [13,
48, 90, 205, 591, 778, 783], 'Ben Stiller': [209, 212, 547, 562, 850], 'Bill Condon': [102, 307, 902, 1233, 1381],
'Bobby Farrelly': [352, 356, 481, 498, 624, 630, 654, 806, 928, 972, 1111], 'Brad Anderson': [1163, 1197, 1350, 1419,
1430], 'Brett Ratner': [24, 39, 188, 207, 238, 292, 405, 456, 920], 'Brian De Palma': [228, 255, 318, 439, 747, 905,
919, 1088, 1232, 1261, 1317, 1354], 'Brian Helgeland': [512, 607, 623, 742, 933], 'Brian Levant': [418, 449, 568, 761,
860, 1003], 'Brian Robbins': [416, 441, 669, 962, 988, 1115], 'Bryan Singer': [6, 32, 33, 44, 122, 216, 297, 1326],
'Cameron Crowe': [335, 434, 488, 503, 513, 698], 'Catherine Hardwicke': [602, 695, 724, 937, 1406, 1412], 'Chris
Columbus': [117, 167, 204, 218, 229, 509, 656, 897, 996, 1086, 1129], 'Chris Weitz': [17, 500, 794, 869, 1202, 1267],
'Christopher Nolan': [3, 45, 58, 59, 74, 565, 641, 1341], 'Chuck Russell': [177, 410, 657, 1069, 1097, 1339], 'Clint
Eastwood': [369, 426, 447, 482, 490, 520, 530, 535, 645, 727, 731, 786, 787, 899, 974, 986, 1167, 1190, 1313], 'Curtis
Hanson': [494, 579, 606, 711, 733, 1057, 1310], 'Danny Boyle': [527, 668, 1083, 1085, 1126, 1168, 1287, 1385], 'Darren
Aronofsky': [113, 751, 1187, 1328, 1363, 1458], 'Darren Lynn Bousman': [1241, 1243, 1283, 1338, 1440], 'David Ayer':
[50, 273, 741, 1024, 1146, 1407], 'David Cronenberg': [541, 767, 994, 1055, 1254, 1268, 1334], 'David Fincher': [62,
213, 253, 383, 398, 478, 522, 555, 618, 785], 'David Gordon Green': [543, 862, 884, 927, 1376, 1418, 1432, 1459],
'David Koepp': [443, 644, 735, 1041, 1209], 'David Lynch': [583, 1161, 1264, 1340, 1456], 'David O. Russell': [422,
556, 609, 896, 982, 989, 1229, 1304], 'David R. Ellis': [582, 634, 756, 888, 934], 'David Zucker': [569, 619, 965,
1052, 1175], 'Dennis Dugan': [217, 260, 267, 293, 303, 718, 780, 977, 1247], 'Donald Petrie': [427, 507, 570, 649, 858,
894, 1106, 1331], 'Doug Liman': [52, 148, 251, 399, 544, 1318, 1451], 'Edward Zwick': [92, 182, 346, 566, 791, 819,
825], 'F. Gary Gray': [308, 402, 491, 523, 697, 833, 1272, 1380], 'Francis Ford Coppola': [487, 559, 622, 646, 772,
1076, 1155, 1253, 1312], 'Francis Lawrence': [63, 72, 109, 120, 679], 'Frank Coraci': [157, 249, 275, 451, 577, 599,
963], 'Frank Oz': [193, 355, 473, 580, 712, 813, 987], 'Garry Marshall': [329, 496, 528, 571, 784, 893, 1029, 1169],
'Gary Fleder': [518, 667, 689, 867, 981, 1165], 'Gary Winick': [258, 797, 798, 804, 1454], 'Gavin O'Connor': [820, 841,
939, 953, 1444], 'George A. Romero': [250, 1066, 1096, 1278, 1367, 1396], 'George Clooney': [343, 450, 831, 966, 1302],
'George Miller': [78, 103, 233, 287, 1250, 1403, 1450], 'Gore Verbinski': [1, 8, 9, 107, 119, 633, 1040], 'Guillermo
del Toro': [35, 252, 419, 486, 1118], 'Gus Van Sant': [595, 1018, 1027, 1159, 1240, 1311, 1398], 'Guy Ritchie': [124,
215, 312, 1093, 1225, 1269, 1420], 'Harold Ramis': [425, 431, 558, 586, 788, 1137, 1166, 1325], 'Ivan Reitman': [274,
643, 816, 883, 910, 935, 1134, 1242], 'James Cameron': [0, 19, 170, 173, 344, 1100, 1320], 'James Ivory': [1125, 1152,
1180, 1291, 1293, 1390, 1397], 'James Mangold': [140, 141, 557, 560, 829, 845, 958, 1145], 'James Wan': [30, 617, 1002,
1047, 1337, 1417, 1424], 'Jan de Bont': [155, 224, 231, 270, 781], 'Jason Friedberg': [812, 1010, 1012, 1014, 1036],
'Jason Reitman': [792, 1092, 1213, 1295, 1299], 'Jaume Collet-Serra': [516, 540, 640, 725, 1011, 1189], 'Jay Roach':
[195, 359, 389, 397, 461, 703, 859, 1072], 'Jean-Pierre Jeunet': [423, 485, 605, 664, 765], 'Joe Dante': [284, 525,
638, 1226, 1298, 1428], 'Joe Wright': [85, 432, 553, 803, 814, 855], 'Joel Coen': [428, 670, 691, 707, 721, 889, 906,
980, 1157, 1238, 1305], 'Joel Schumacher': [128, 184, 348, 484, 572, 614, 652, 764, 876, 886, 1108, 1230, 1280], 'John
Carpenter': [537, 663, 686, 861, 938, 1028, 1080, 1102, 1329, 1371], 'John Glen': [601, 642, 801, 847, 864], 'John
Landis': [524, 868, 1276, 1384, 1435], 'John Madden': [457, 882, 1020, 1249, 1257], 'John McTiernan': [127, 214, 244,
351, 534, 563, 648, 782, 838, 1074], 'John Singleton': [294, 489, 732, 796, 1120, 1173, 1316], 'John Whitesell': [499,
632, 763, 1119, 1148], 'John Woo': [131, 142, 264, 371, 420, 675, 1182], 'Jon Favreau': [46, 54, 55, 382, 759, 1346],
'Jon M. Chu': [100, 225, 810, 1099, 1186], 'Jon Turteltaub': [64, 180, 372, 480, 760, 846, 1171], 'Jonathan Demme':
[277, 493, 1000, 1123, 1215], 'Jonathan Liebesman': [81, 143, 339, 1117, 1301], 'Judd Apatow': [321, 710, 717, 865,
881], 'Justin Lin': [38, 123, 246, 1437, 1447], 'Kenneth Branagh': [80, 197, 421, 879, 1094, 1277, 1288], 'Kenny
Ortega': [412, 852, 1228, 1315, 1365], 'Kevin Reynolds': [53, 502, 639, 1019, 1059], ...}
keyboard_arrow_down Now what if we want to extract data of a particular group from this list?
data.groupby('director_name').get_group('Alexander Payne')
id_x budget popularity revenue title vote_average vote_count year month day director_name gende
About
793 45163 30000000 19 105834556 6.7 362 2002 Dec Friday Alexander Payne
Schmidt
The
1006 45699 20000000 40 177243185 6.7 934 2011 Sep Friday Alexander Payne
Descendants
1101 46004 16000000 23 109502303 Sideways 6.9 478 2004 Oct Friday Alexander Payne
1211 46446 12000000 29 17654912 Nebraska 7.4 636 2013 Sep Saturday Alexander Payne
1281 46813 0 13 0 Election 6.7 270 1999 Apr Friday Alexander Payne
keyboard_arrow_down extending this to finding an aggregate metric of the data
How can we find the count of movies by each director?
data.groupby('director_name')['title'].count()
director_name
Adam McKay 6
Adam Shankman 8
Alejandro González Iñárritu 6
Alex Proyas 5
Alexander Payne 5
..
Wes Craven 10
Wolfgang Petersen 7
Woody Allen 18
Zack Snyder 7
Zhang Yimou 6
Name: title, Length: 199, dtype: int64
keyboard_arrow_down Finding multiple aggregations of any feature
Finding the very first year and the latest year a director released a movie i.e basically the min and max of year column, grouped by director
data.groupby(['director_name'])["year"].aggregate(['min', 'max'])
# note: can also use .agg instead of .aggregate (both are same)
min max
director_name
Adam McKay 2004 2015
Adam Shankman 2001 2012
Alejandro González Iñárritu 2000 2015
Alex Proyas 1994 2016
Alexander Payne 1999 2013
... ... ...
Wes Craven 1984 2011
Wolfgang Petersen 1981 2006
Woody Allen 1977 2013
Zack Snyder 2004 2016
Zhang Yimou 2002 2014
keyboard_arrow_down Group based Filtering

Group based filtering allows us to filter rows from each group by using conditional statements on each group rather than the whole
dataframe.
keyboard_arrow_down finding the details of the movies by high budget directors
Lets assume,
high budget director -> any director with atleast one movie with budget >100M
1. We can get the highest budget movie data of every director
data_dir_budget = data.groupby("director_name")["budget"].max().reset_index()
data_dir_budget.head()
director_name budget
0 Adam McKay 100000000
1 Adam Shankman 80000000
2 Alejandro González Iñárritu 135000000
3 Alex Proyas 140000000
4 Alexander Payne 30000000
2. we can filter out the director names with max budget >100M
names = data_dir_budget.loc[data_dir_budget["budget"] >= 100, "director_name"]
3. Finally, we can filter out the details of the movies by these directors
data.loc[data['director_name'].isin(names)]
Pirates of
the
At World's
End
The Dark
Christopher
Nolan
Rises
Spider-
Man 3
... ... ... ... ... ... ... ... ... ... ... ...
The Last
Waltz
El Robert
1464 48395 220000 14 2040920 6.6 238 1992 Sep Friday
Mariachi Rodriguez
keyboard_arrow_down Filtering groups in a single go using Lambda Function

data.groupby('director_name').filter(lambda x: x["budget"].max() >= 100)
Pirates of
the
At World's
End
The Dark
Christopher
Nolan
Rises
Spider-
Man 3
... ... ... ... ... ... ... ... ... ... ... ...
The Last
Waltz
El Robert
1464 48395 220000 14 2040920 6.6 238 1992 Sep Friday
Mariachi Rodriguez
Notice what's happening here?
We first group data by director and then use groupby().filter function

Groups are filtered if they do not satisfy the boolean criterion specified by function
This is called Group Based Filtering
NOTE
We are filtering the groups here and not the rows

==> The result is not a groupby object but regular pandas DataFrame with the filtered groups eliminated
keyboard_arrow_down Group based Apply

applying a function on grouped objects
keyboard_arrow_down Filtering risky movies?
Let's assume, we call a movi risky if,
its budget is higher than the average revenue of its director
We can subtract the average revenue of a director from budget col, for each director
def func(x):
# a boolean returning function for whether the movie is risky or not
x["risky"] = x["budget"] - x["revenue"].mean() >= 0
return x
data_risky = data.groupby("director_name", group_keys=False).apply(func)
# setting group_keys=True, keeps the group key in the returned dataset (will be default in future version of pandas)
# keep it False if want the normal behaviour
data_risky
Pirates of
the
At World's
End
The Dark
Christopher
Nolan
Rises
Spider-
Man 3
... ... ... ... ... ... ... ... ... ... ... ...
The Last
Waltz
El Robert
1464 48395 220000 14 2040920 6.6 238 1992 Sep Friday
Mariachi Rodriguez
What did we do here?
Defined a custom function

Grouped data acc to director_name
Subtracted mean of budget from revenue
Used apply with the custom function on the grouped data
Lets see if there are any risky movies
data_risky.loc[data_risky["risky"]]
id_x budget popularity revenue title vote_average vote_count year month day director_name ge
Quantum
7 43608 200000000 107 586090727 6.1 2965 2008 Oct Thursday Marc Forster
of Solace
Pirates of
the
Caribbean:
12 43614 380000000 135 1045713802 6.4 4948 2011 May Saturday Rob Marshall
On
Stranger
Tides
Robin
15 43618 200000000 37 310669540 6.2 1398 2010 May Wednesday Ridley Scott
Hood
20 43624 209000000 64 303025485 Battleship 5.5 2114 2012 Apr Wednesday Peter Berg
X-Men:
24 43630 210000000 3 459359555 The Last 6.3 3525 2006 May Wednesday Brett Ratner
Stand
Content
Multi-indexing
Restructuring data
pd.melt()
pd.pivot()
pd.pivot_table()
pd.cut()
keyboard_arrow_down Importing Data

import pandas as pd
import numpy as np

directors = pd.read_csv('directors.csv',index_col=0)
data = movies.merge(directors, how='left', left_on='director_id',right_on='id')

data.drop(['director_id','id_y'],axis=1,inplace=True)
Downloading...
100% 112k/112k [00:00<00:00, 15.4MB/s]
Downloading...
100% 65.4k/65.4k [00:00<00:00, 86.7MB/s]
keyboard_arrow_down Multi-Indexing
keyboard_arrow_down Task: Which director according would be considered as most productive ?
Should we decide based on the number of movies released by a director?
Or
consider quality into consideration also?
Or
consider the amount of business the movie is doing?
To simplify,
Lets calculate who has directed maximum number of movies
data.groupby(['director_name'])['title'].count().sort_values(ascending=False)
director_name
Steven Spielberg 26
Clint Eastwood 19
Martin Scorsese 19
Woody Allen 18
Robert Rodriguez 16
..
Paul Weitz 5
John Madden 5
Paul Verhoeven 5
John Whitesell 5
Kevin Reynolds 5
Name: title, Length: 199, dtype: int64
Steven Spielberg has directed maximum number of movies. But does it make Steven the most productive director?
Chances are, he might be active for more years than other directors
keyboard_arrow_down calculating active years for every director?
We can subtract both min and max of year
data_agg = data.groupby(['director_name'])[["year", "title"]].aggregate({"year":['min','max'], "title": "count"})

data_agg
year title
min max count
director_name
Adam McKay 2004 2015 6
Adam Shankman 2001 2012 8
Alejandro González Iñárritu 2000 2015 6
Alex Proyas 1994 2016 5
Alexander Payne 1999 2013 5
... ... ... ...
Wes Craven 1984 2011 10
Wolfgang Petersen 1981 2006 7
Woody Allen 1977 2013 18
Zack Snyder 2004 2016 7
Zhang Yimou 2002 2014 6
Notice,
director_name column has turned into row labels

There are multiple levels for the column names
This is called Multi-index Dataframe
keyboard_arrow_down What is Multi-index Dataframe ?
It can have multiple indexes along a dimension
no of dimensions remain same though => 2D

Multi-level indexes are possible both for rows and columns
data_agg.columns #Printing the columns for better clarity
MultiIndex([( 'year', 'min'),

( 'year', 'max'),
('title', 'count')],
)
The level-1 column names are year and title
keyboard_arrow_down What would happen if we print the col year of this multi-index dataframe?
data_agg["year"]
min max
director_name
Adam McKay 2004 2015
Adam Shankman 2001 2012
Alejandro González Iñárritu 2000 2015
Alex Proyas 1994 2016
Alexander Payne 1999 2013
... ... ...
Wes Craven 1984 2011
Wolfgang Petersen 1981 2006
Woody Allen 1977 2013
Zack Snyder 2004 2016
Zhang Yimou 2002 2014
keyboard_arrow_down How can we convert multi-level back to only one level of columns?
Example: year_min , year_max , title_count
data_agg = data.groupby(['director_name'])[["year","title"]].aggregate(
{"year":['min', 'max'], "title": "count"}) #The column names are not aligned properly
data_agg.columns = ['_'.join(col) for col in data_agg.columns]

data_agg
year_min year_max title_count
director_name
... ... ... ...
Wes Craven 1984 2011 10
Since these were tuples, we can just join them
# another more simplified method
data.groupby('director_name')[['year', 'title']].aggregate(
year_max=('year','max'),
year_min=('year','min'),
title_count=('title','count')
)
year_max year_min title_count
director_name
... ... ... ...
Wes Craven 2011 1984 10
Columns look good, but we may want to turn back the row labels into a proper column as well
keyboard_arrow_down converting row labels into a column using reset_index
data_agg.reset_index()
director_name year_min year_max title_count
0 Adam McKay 2004 2015 6
1 Adam Shankman 2001 2012 8
2 Alejandro González Iñárritu 2000 2015 6
3 Alex Proyas 1994 2016 5
4 Alexander Payne 1999 2013 5
... ... ... ... ...
194 Wes Craven 1984 2011 10
195 Wolfgang Petersen 1981 2006 7
196 Woody Allen 1977 2013 18
197 Zack Snyder 2004 2016 7
198 Zhang Yimou 2002 2014 6
keyboard_arrow_down Using the new features, can we find the most productive director?
First calculate how many years the director has been active.
data_agg["yrs_active"] = data_agg["year_max"] - data_agg["year_min"]

data_agg
year_min year_max title_count yrs_active
director_name
Adam McKay 2004 2015 6 11
Adam Shankman 2001 2012 8 11
Alejandro González Iñárritu 2000 2015 6 15
Alex Proyas 1994 2016 5 22
Alexander Payne 1999 2013 5 14
... ... ... ... ...
Wes Craven 1984 2011 10 27
Wolfgang Petersen 1981 2006 7 25
Woody Allen 1977 2013 18 36
Zack Snyder 2004 2016 7 12
Zhang Yimou 2002 2014 6 12
Then calculate rate of directing movies by title_count / yrs_active
data_agg["movie_per_yr"] = data_agg["title_count"] / data_agg["yrs_active"]

data_agg
year_min year_max title_count yrs_active movie_per_yr
director_name
Adam McKay 2004 2015 6 11 0.545455
Adam Shankman 2001 2012 8 11 0.727273
Alejandro González Iñárritu 2000 2015 6 15 0.400000
Alex Proyas 1994 2016 5 22 0.227273
Alexander Payne 1999 2013 5 14 0.357143
... ... ... ... ... ...
Wes Craven 1984 2011 10 27 0.370370
Wolfgang Petersen 1981 2006 7 25 0.280000
Woody Allen 1977 2013 18 36 0.500000
Zack Snyder 2004 2016 7 12 0.583333
Zhang Yimou 2002 2014 6 12 0.500000
Now finally sort the values
data_agg.sort_values("movie_per_yr", ascending=False)
year_min year_max title_count yrs_active movie_per_yr
director_name
Tyler Perry 2006 2013 9 7 1.285714
Jason Friedberg 2006 2010 5 4 1.250000
Shawn Levy 2002 2014 11 12 0.916667
Robert Rodriguez 1992 2014 16 22 0.727273
Adam Shankman 2001 2012 8 11 0.727273
... ... ... ... ... ...
Lawrence Kasdan 1985 2012 5 27 0.185185
Luc Besson 1985 2014 5 29 0.172414
Robert Redford 1980 2010 5 30 0.166667
Sidney Lumet 1976 2006 5 30 0.166667
Michael Apted 1980 2010 5 30 0.166667
Conclusion:
Tyler Perry turns out to be the truly most productive director
keyboard_arrow_down Importing our data

For this topic we will be using data of few drugs being developed by PFizer
Link: https://drive.google.com/file/d/173A59xh2mnpmljCCB9bhC4C5eP2IS6qZ/view?usp=sharing
!gdown 173A59xh2mnpmljCCB9bhC4C5eP2IS6qZ
Downloading...
From: https://drive.google.com/uc?id=173A59xh2mnpmljCCB9bhC4C5eP2IS6qZ
To: /content/Pfizer_1.csv
100% 1.51k/1.51k [00:00<00:00, 8.41MB/s]
What is the data about?
Temperature (K)
Pressure (P)
are recorded after an interval of 1 hour everyday to monitor the drug stability in a drug development test
==> These data points are thus used to identify the optimal set of values of parameters for the stability of the drugs
keyboard_arrow_down Now, Let's explore this dataset
data = pd.read_csv('Pfizer_1.csv')
data
Date Drug_Name Parameter 1:30:00 2:30:00 3:30:00 4:30:00 5:30:00 6:30:00 7:30:00 8:30:00 9:30:00
0 15-10-2020 diltiazem hydrochloride Temperature 23.0 22.0 NaN 21.0 21.0 22 23.0 21.0 22.0
1 15-10-2020 diltiazem hydrochloride Pressure 12.0 13.0 NaN 11.0 13.0 14 16.0 16.0 24.0
2 15-10-2020 docetaxel injection Temperature NaN 17.0 18.0 NaN 17.0 18 NaN NaN 23.0
3 15-10-2020 docetaxel injection Pressure NaN 22.0 22.0 NaN 22.0 23 NaN NaN 27.0
4 15-10-2020 ketamine hydrochloride Temperature 24.0 NaN NaN 27.0 NaN 26 25.0 24.0 23.0
5 15-10-2020 ketamine hydrochloride Pressure 8.0 NaN NaN 7.0 NaN 9 10.0 11.0 10.0
6 16-10-2020 diltiazem hydrochloride Temperature 34.0 35.0 36.0 36.0 37.0 38 37.0 38.0 39.0
7 16-10-2020 diltiazem hydrochloride Pressure 18.0 19.0 20.0 21.0 22.0 23 24.0 25.0 25.0
8 16-10-2020 docetaxel injection Temperature 46.0 47.0 NaN 48.0 48.0 49 50.0 52.0 55.0
9 16-10-2020 docetaxel injection Pressure 23.0 24.0 NaN 25.0 26.0 27 28.0 29.0 28.0
10 16-10-2020 ketamine hydrochloride Temperature 8.0 9.0 10.0 NaN 11.0 12 12.0 11.0 NaN
11 16-10-2020 ketamine hydrochloride Pressure 12.0 12.0 13.0 NaN 15.0 15 15.0 15.0 NaN
12 17-10-2020 diltiazem hydrochloride Temperature 20.0 19.0 19.0 18.0 17.0 16 15.0 NaN 13.0
13 17-10-2020 diltiazem hydrochloride Pressure 3.0 4.0 4.0 4.0 6.0 8 9.0 NaN 9.0
14 17-10-2020 docetaxel injection Temperature 12.0 13.0 14.0 15.0 16.0 17 18.0 19.0 20.0
15 17-10-2020 docetaxel injection Pressure 20.0 22.0 22.0 22.0 22.0 23 25.0 26.0 27.0
16 17-10-2020 ketamine hydrochloride Temperature 13.0 14.0 15.0 16.0 17.0 18 19.0 20.0 21.0
17 17-10-2020 ketamine hydrochloride Pressure 8.0 9.0 10.0 11.0 11.0 12 12.0 11.0 12.0
data.info()
RangeIndex: 18 entries, 0 to 17
--- ------ -------------- -----
0 Date 18 non-null object
1 Drug_Name 18 non-null object
2 Parameter 18 non-null object
3 1:30:00 16 non-null float64
8 6:30:00 18 non-null int64
12 10:30:00 18 non-null int64
13 11:30:00 16 non-null float64
14 12:30:00 18 non-null int64
dtypes: float64(9), int64(3), object(3)
keyboard_arrow_down Melting in Pandas

As we saw earlier, the dataset has 18 rows and 15 columns
If you notice further, you'll see:
The columns are 1:30:00 , 2:30:00 , 3:30:00 , ... so on
Temperature and Pressure of each date is in a separate row
keyboard_arrow_down Can we restructure our data into a better format?
Maybe we can have a column for time , with timestamps as the column value
Where will the Temperature/Pressure values go?
We can similarly create one column containing the values of these parameters
==> "Melt" timestamp columns into two columns - timestamp and corresponding values
How can we restructure our data into having every row corresponding to a single reading?
pd.melt(data, id_vars=['Date', 'Parameter', 'Drug_Name'])
Date Parameter Drug_Name variable value
0 15-10-2020 Temperature diltiazem hydrochloride 1:30:00 23.0
1 15-10-2020 Pressure diltiazem hydrochloride 1:30:00 12.0
2 15-10-2020 Temperature docetaxel injection 1:30:00 NaN
3 15-10-2020 Pressure docetaxel injection 1:30:00 NaN
4 15-10-2020 Temperature ketamine hydrochloride 1:30:00 24.0
... ... ... ... ... ...
211 17-10-2020 Pressure diltiazem hydrochloride 12:30:00 14.0
212 17-10-2020 Temperature docetaxel injection 12:30:00 23.0
213 17-10-2020 Pressure docetaxel injection 12:30:00 28.0
214 17-10-2020 Temperature ketamine hydrochloride 12:30:00 24.0
215 17-10-2020 Pressure ketamine hydrochloride 12:30:00 15.0
This converts our data from wide to long format
Notice the id_vars are set of variables which remain unmelted
How does pd.melt() work?
Pass in the DataFrame

Pass in the column names to not melt
But we can provide better names to these new columns
keyboard_arrow_down How can we rename the columns "variable" and "value" as per our original dataframe?
data_melt = pd.melt(data,id_vars = ['Date', 'Drug_Name', 'Parameter'],

var_name = "time",
value_name = 'reading')
data_melt
Date Drug_Name Parameter time reading
0 15-10-2020 diltiazem hydrochloride Temperature 1:30:00 23.0
1 15-10-2020 diltiazem hydrochloride Pressure 1:30:00 12.0
2 15-10-2020 docetaxel injection Temperature 1:30:00 NaN
3 15-10-2020 docetaxel injection Pressure 1:30:00 NaN
4 15-10-2020 ketamine hydrochloride Temperature 1:30:00 24.0
... ... ... ... ... ...
212 17-10-2020 docetaxel injection Temperature 12:30:00 23.0
213 17-10-2020 docetaxel injection Pressure 12:30:00 28.0
215 17-10-2020 ketamine hydrochloride Pressure 12:30:00 15.0
Conclusion
The labels of the timestamp columns are conviniently melted into a single column - time
It retained all values in column reading
The labels of columns such as 1:30:00 , 2:30:00 have now become categories of the variable column
The values from columns we are melting are stored in value column
keyboard_arrow_down Pivot
Now suppose we want to convert our data back to wide format
The reason could be to maintain the structure for storing or some other purpose.
Notice:
The variables Date , Drug_Name and Parameter will remain same
The column names will be extracted from the column time
The values will be extracted from the column readings
keyboard_arrow_down How can we restructure our data back to the original wide format, before it was melted?
data_melt.pivot(index=['Date','Drug_Name','Parameter'], # Column to use to make new frame’s index

columns = 'time', # Column to use to make new frame’s columns
values='reading') # Columns to use for populating new frame’s values.
time 10:30:00 11:30:00 12:30:00 1:30:00 2:30:00 3:30:00 4:30:00 5:30:00 6:30:0
Date Drug_Name Parameter
15-10-2020 diltiazem hydrochloride Pressure 18.0 19.0 20.0 12.0 13.0 NaN 11.0 13.0 14
Temperature 20.0 20.0 21.0 23.0 22.0 NaN 21.0 21.0 22
docetaxel injection Pressure 26.0 29.0 28.0 NaN 22.0 22.0 NaN 22.0 23
Temperature 23.0 25.0 25.0 NaN 17.0 18.0 NaN 17.0 18
ketamine hydrochloride Pressure 9.0 9.0 11.0 8.0 NaN NaN 7.0 NaN 9
Temperature 22.0 21.0 20.0 24.0 NaN NaN 27.0 NaN 26
16-10-2020 diltiazem hydrochloride Pressure 24.0 NaN 27.0 18.0 19.0 20.0 21.0 22.0 23
Temperature 40.0 NaN 42.0 34.0 35.0 36.0 36.0 37.0 38
docetaxel injection Pressure 28.0 29.0 30.0 23.0 24.0 NaN 25.0 26.0 27
Temperature 56.0 57.0 58.0 46.0 47.0 NaN 48.0 48.0 49
ketamine hydrochloride Pressure 16.0 17.0 18.0 12.0 12.0 13.0 NaN 15.0 15
Temperature 13.0 14.0 15.0 8.0 9.0 10.0 NaN 11.0 12
17-10-2020 diltiazem hydrochloride Pressure 11.0 13.0 14.0 3.0 4.0 4.0 4.0 6.0 8
Temperature 14.0 11.0 10.0 20.0 19.0 19.0 18.0 17.0 16
docetaxel injection Pressure 28.0 29.0 28.0 20.0 22.0 22.0 22.0 22.0 23
Temperature 21.0 22.0 23.0 12.0 13.0 14.0 15.0 16.0 17
ketamine hydrochloride Pressure 13.0 14.0 15.0 8.0 9.0 10.0 11.0 11.0 12
Temperature 22.0 23.0 24.0 13.0 14.0 15.0 16.0 17.0 18
Notice,
pivot() is the exact opposite of melt
We are getting multiple indices here, but we can get single index again using reset_index
data_melt.pivot(index=['Date','Drug_Name','Parameter'],
columns = 'time',
values='reading').reset_index()
time Date Drug_Name Parameter 10:30:00 11:30:00 12:30:00 1:30:00 2:30:00 3:30:00 4:30:00 5:30:00 6:
0 15-10-2020 diltiazem hydrochloride Pressure 18.0 19.0 20.0 12.0 13.0 NaN 11.0 13.0
1 15-10-2020 diltiazem hydrochloride Temperature 20.0 20.0 21.0 23.0 22.0 NaN 21.0 21.0
2 15-10-2020 docetaxel injection Pressure 26.0 29.0 28.0 NaN 22.0 22.0 NaN 22.0
3 15-10-2020 docetaxel injection Temperature 23.0 25.0 25.0 NaN 17.0 18.0 NaN 17.0
4 15-10-2020 ketamine hydrochloride Pressure 9.0 9.0 11.0 8.0 NaN NaN 7.0 NaN
5 15-10-2020 ketamine hydrochloride Temperature 22.0 21.0 20.0 24.0 NaN NaN 27.0 NaN
6 16-10-2020 diltiazem hydrochloride Pressure 24.0 NaN 27.0 18.0 19.0 20.0 21.0 22.0
7 16-10-2020 diltiazem hydrochloride Temperature 40.0 NaN 42.0 34.0 35.0 36.0 36.0 37.0
8 16-10-2020 docetaxel injection Pressure 28.0 29.0 30.0 23.0 24.0 NaN 25.0 26.0
9 16-10-2020 docetaxel injection Temperature 56.0 57.0 58.0 46.0 47.0 NaN 48.0 48.0
10 16-10-2020 ketamine hydrochloride Pressure 16.0 17.0 18.0 12.0 12.0 13.0 NaN 15.0
11 16-10-2020 ketamine hydrochloride Temperature 13.0 14.0 15.0 8.0 9.0 10.0 NaN 11.0
12 17-10-2020 diltiazem hydrochloride Pressure 11.0 13.0 14.0 3.0 4.0 4.0 4.0 6.0
13 17-10-2020 diltiazem hydrochloride Temperature 14.0 11.0 10.0 20.0 19.0 19.0 18.0 17.0
14 17-10-2020 docetaxel injection Pressure 28.0 29.0 28.0 20.0 22.0 22.0 22.0 22.0
15 17-10-2020 docetaxel injection Temperature 21.0 22.0 23.0 12.0 13.0 14.0 15.0 16.0
16 17-10-2020 ketamine hydrochloride Pressure 13.0 14.0 15.0 8.0 9.0 10.0 11.0 11.0
17 17-10-2020 ketamine hydrochloride Temperature 22.0 23.0 24.0 13.0 14.0 15.0 16.0 17.0
data_melt.head()
Now if you notice,
We are using 2 rows to log readings for a single experiment.
keyboard_arrow_down Can we further restructure our data into dividing the Parameter column into T/P?
A format like:
Date | time | Drug_Name | Pressure | Temperature
would be really suitable
We want to split one single column into multiple columns
How can we divide the Parameter column again?

Code Text
data_tidy = data_melt.pivot(index=['Date','time', 'Drug_Name'],
columns = 'Parameter',
values='reading')
data_tidy
Parameter Pressure Temperature
Date time Drug_Name
15-10-2020 10:30:00 diltiazem hydrochloride 18.0 20.0
docetaxel injection 26.0 23.0
ketamine hydrochloride 9.0 22.0
11:30:00 diltiazem hydrochloride 19.0 20.0

We can use reset_index() to remove the multi-index
... ... ... ... ...
17-10-2020 8:30:00 docetaxel injection 26.0 19.0

data_tidy = data_tidy.reset_index()
data_tidy ketamine hydrochloride 11.0 20.0
9:30:00 diltiazem hydrochloride 9.0 13.0

Parameter Date time Drug_Name Pressure Temperature
0 15-10-2020 10:30:00 diltiazem hydrochloride 18.0 20.0
ketamine hydrochloride 12.0 21.0
1 15-10-2020 10:30:00 docetaxel injection 26.0 23.0
2 15-10-2020 10:30:00 ketamine hydrochloride 9.0 22.0
... ... ... ... ... ...
We can rename our index column from Parameter to simply None
data_tidy.columns.name = None
data_tidy.head()
Date time Drug_Name Pressure Temperature
keyboard_arrow_down Pivot Table

Now suppose we want to find some insights, like mean temperature day wise
keyboard_arrow_down Can we use pivot to find the day-wise mean value of temperature for each drug?
data_tidy.pivot(index=['Drug_Name'],
columns = 'Date',
values=['Temperature'])
Content
Dealing with Missing Values
None and nan values

isna() and isnull()
String method in pandas
Handling datetime
Writing to a file
keyboard_arrow_down Importing and preparing data

import pandas as pd
import numpy as np
!gdown 173A59xh2mnpmljCCB9bhC4C5eP2IS6qZ
data = pd.read_csv('Pfizer_1.csv')
data_melt = pd.melt(data,id_vars = ['Date', 'Drug_Name', 'Parameter'],
var_name = "time",
value_name = 'reading')
data_tidy = data_melt.pivot(index=['Date','time', 'Drug_Name'],
columns = 'Parameter',
values='reading')
data_tidy = data_tidy.reset_index()
data_tidy.columns.name = None
Downloading...
From: https://drive.google.com/uc?id=173A59xh2mnpmljCCB9bhC4C5eP2IS6qZ
To: /content/Pfizer_1.csv
100% 1.51k/1.51k [00:00<00:00, 6.05MB/s]
data.head()
data_melt.head()
data_tidy.head()
Date time Drug_Name Pressure Temperature

keyboard_arrow_down Handling Missing Values
If you notice, there are many "NaN" values in our data
data.head()
keyboard_arrow_down What are these "NaN" values?
They are basically missing values
What are missing values?

A Missing Value signifies an empty cell/no data
There can be 2 kinds of missing values:
1. None
2. NaN (short for Not a Number)
Whats the difference between the "None" and "NaN"?

The diff mainly lies in their datatype
type(None)
NoneType
type(np.nan)
float
None type is for missing values in a column with non-number entries
E.g.-strings
NaN occurs for columns with number entries
Note:
Pandas uses these values nearly interchangeably, converting between them where appropriate, based on column datatype
pd.Series([1, np.nan, 2, None])
0 1.0
1 NaN
2 2.0
3 NaN
dtype: float64
For numerical types, Pandas changes None to NaN type
pd.Series(["1", "np.nan", "2", None])
0 1
1 np.nan
2 2
3 None
dtype: object
pd.Series(["1", "np.nan", "2", np.nan])
0 1
1 np.nan
2 2
3 NaN
dtype: object
For object type, the None is preserved and not changed to NaN
Now we have the basic idea about missing values
keyboard_arrow_down How to know the count of missing values for each row/column?
data.isna().head()
Date Drug_Name Parameter 1:30:00 2:30:00 3:30:00 4:30:00 5:30:00 6:30:00 7:30:00 8:30:00 9:30:00 10:30:00 11
0 False False False False False True False False False False False False False
2 False False False True False False True False False True True False False
4 False False False False True True False True False False False False False
We can also use isnull to get the same results
data.isnull().head()
Date Drug_Name Parameter 1:30:00 2:30:00 3:30:00 4:30:00 5:30:00 6:30:00 7:30:00 8:30:00 9:30:00 10:30:00 11
4 False False False False True True False True False False False False False
keyboard_arrow_down But, why do we have two methods, "isna" and "isnull" for the same operation?
isnull() is just an alias for isna()
pd.isnull
<function pandas.core.dtypes.missing.isna(obj: 'object') -> 'bool | npt.NDArray[np.bool_] | NDFrame'>
pd.isna
<function pandas.core.dtypes.missing.isna(obj: 'object') -> 'bool | npt.NDArray[np.bool_] | NDFrame'>
As we can see, function signature is same for both
isna() returns a boolean dataframe, with each cell as a boolean value
This value corresponds to whether the cell has a missing value
On top of this, we can use .sum() to find the count
data.isna().sum()
Date 0
Drug_Name 0
Parameter 0
1:30:00 2
2:30:00 2
3:30:00 6
4:30:00 4
5:30:00 2
6:30:00 0
7:30:00 2
8:30:00 4
9:30:00 2
10:30:00 0
11:30:00 2
12:30:00 0
dtype: int64
This gives us the total number of missing values in each column
keyboard_arrow_down Can we also get the number of missing values in each row?
data.isna().sum(axis=1)
0 1
1 1
2 4
3 4
4 3
5 3
6 1
7 1
8 1
9 1
10 2
11 2
12 1
13 1
14 0
15 0
16 0
17 0
dtype: int64
Note:
By default the value is axis=0 in sum()
keyboard_arrow_down We have identified the null count, but how do we deal with them?
We have two options:
delete the rows/columns containing the null values

fill the missing values with some data/estimate
Let's first look at deleting the rows
How can we drop rows containing null values?
data.dropna()
17 17-10-2020 ketamine hydrochloride Pressure 8.0 9.0 10.0 11.0 11.0 12 12.0 11.0 12.0
Rows with even a single missing value have been deleted
keyboard_arrow_down What if we want to delete the columns having missing value?
data.dropna(axis=1)
Date Drug_Name Parameter 6:30:00 10:30:00 12:30:00
0 15-10-2020 diltiazem hydrochloride Temperature 22 20 21
1 15-10-2020 diltiazem hydrochloride Pressure 14 18 20
2 15-10-2020 docetaxel injection Temperature 18 23 25
3 15-10-2020 docetaxel injection Pressure 23 26 28
4 15-10-2020 ketamine hydrochloride Temperature 26 22 20
5 15-10-2020 ketamine hydrochloride Pressure 9 9 11
=> Every column which had even a single missing value has been deleted
keyboard_arrow_down But what are the problems with deleting rows/columns?
One of the major problems:
loss of data
Instead of dropping, it would be better to fill the missing values with some data
How can we fill the missing values with some data?
data.fillna(0).head()
0 15-10-2020 diltiazem hydrochloride Temperature 23.0 22.0 0.0 21.0 21.0 22 23.0 21.0 22.0
1 15-10-2020 diltiazem hydrochloride Pressure 12.0 13.0 0.0 11.0 13.0 14 16.0 16.0 24.0
What is fillna(0) doing?
It fills all missing values with 0
We can do the same on a particular column too
data['2:30:00'].fillna(0)
0 22.0
1 13.0
2 17.0
3 22.0
4 0.0
5 0.0
6 35.0
7 19.0
8 47.0
9 24.0
10 9.0
11 12.0
12 19.0
13 4.0
14 13.0
15 22.0
16 14.0
17 9.0
Name: 2:30:00, dtype: float64
keyboard_arrow_down What other values can we use to fill the missing values ?
We can use some kind of estimator too
An estimator like mean or median
How would you calculate the mean of the column 2:30:00 ?
data['2:30:00'].mean()
18.8125
Now let's fill the NaN values with the mean value of the column
data['2:30:00'].fillna(data['2:30:00'].mean())
0 22.0000
1 13.0000
2 17.0000
3 22.0000
4 18.8125
5 18.8125
6 35.0000
7 19.0000
8 47.0000
9 24.0000
10 9.0000
11 12.0000
12 19.0000
13 4.0000
14 13.0000
15 22.0000
16 14.0000
17 9.0000
Name: 2:30:00, dtype: float64
But this doesn't feel right. What could be wrong with this?
Can we use the mean of all compounds as average for our estimator?
Different drugs have different characteristics

We can't simply do an average and fill the null values
Then what could be a solution here?
We could fill the null values of respective compounds with their respective means
keyboard_arrow_down How can we form a column with mean temperature of respective compounds?
We can use apply that we learnt earlier
Let's first create a function to calculate the mean
def temp_mean(x):
x['Temperature_avg'] = x['Temperature'].mean() # We will name the new col Temperature_avg
return x
Now we can form a new column based on the average values of temperature for each drug
data_tidy=data_tidy.groupby(["Drug_Name"], group_keys=False).apply(temp_mean)
data_tidy
Date time Drug_Name Pressure Temperature Temperature_avg
0 15-10-2020 10:30:00 diltiazem hydrochloride 18.0 20.0 24.848485
1 15-10-2020 10:30:00 docetaxel injection 26.0 23.0 30.387097
2 15-10-2020 10:30:00 ketamine hydrochloride 9.0 22.0 17.709677
... ... ... ... ... ... ...
Now we fill the null values in Temperature using this new column!
data_tidy['Temperature'].fillna(data_tidy["Temperature_avg"], inplace=True)
data_tidy
Date time Drug_Name Pressure Temperature Temperature_avg
... ... ... ... ... ... ...
data_tidy.isna().sum()
Date 0
time 0
Drug_Name 0
Pressure 13
Temperature 0
Temperature_avg 0
dtype: int64
Great!!
We have removed the null values of our Temperature column
Let's do the same for Pressure
def pr_mean(x):
x['Pressure_avg'] = x['Pressure'].mean()
return x
data_tidy=data_tidy.groupby(["Drug_Name"]).apply(pr_mean)
data_tidy['Pressure'].fillna(data_tidy["Pressure_avg"], inplace=True)
data_tidy
<ipython-input-27-df55c441df36>:4: FutureWarning: Not prepending group keys to the result index of transform-like apply.
To preserve the previous behavior, use
>>> .groupby(..., group_keys=False)
To adopt the future behavior and silence this warning, use
>>> .groupby(..., group_keys=True)

data_tidy=data_tidy.groupby(["Drug_Name"]).apply(pr_mean)
Date time Drug_Name Pressure Temperature Temperature_avg Pressure_avg
0 15-10-2020 10:30:00 diltiazem hydrochloride 18.0 20.0 24.848485 15.424242
1 15-10-2020 10:30:00 docetaxel injection 26.0 23.0 30.387097 25.483871
2 15-10-2020 10:30:00 ketamine hydrochloride 9.0 22.0 17.709677 11.935484
... ... ... ... ... ... ... ...
data_tidy.isna().sum()
Date 0
time 0
Drug_Name 0
Pressure 0
Temperature 0
Temperature_avg 0
Pressure_avg 0
dtype: int64
keyboard_arrow_down Pandas Cut

Sometimes, we would want our data to be in categorical format instead of continous data.
Lets say, instead of knowing specific test values of a month, I want to know its type. Depends on level of granularity we want to have - Low,
Medium, High, V High
We could have defined more (or less) categories
But how can bucketisation of continous data help?
Since, we can get the count of different categories

We can get a idea of the bin which category (range of values) most of the temperature values lie.
Let's try to use this on our max (temp) column to categorise the data into bins
But, to define categories, lets first check min and max temp values
data_tidy
Date time Drug_Name Pressure Temperature Temperature_avg Pressure_avg
... ... ... ... ... ... ... ...
print(data_tidy['Temperature'].min(), data_tidy['Temperature'].max())
8.0 58.0
Min value = 8, Max value is 58.
Lets's keep some buffer for future values and take the range from 5-60(instead of 8-58)
Lets divide this data into 4 bins of 10-15 values each
temp_points = [5, 20, 35, 50, 60]

temp_labels = ['low','medium','high','very_high'] # Here labels define the severity of the resultant output of the test
data_tidy['temp_cat'] = pd.cut(data_tidy['Temperature'], bins=temp_points, labels=temp_labels)
data_tidy.head()
Date time Drug_Name Pressure Temperature Temperature_avg Pressure_avg temp_cat
0 15-10-2020 10:30:00 diltiazem hydrochloride 18.0 20.0 24.848485 15.424242 low
1 15-10-2020 10:30:00 docetaxel injection 26.0 23.0 30.387097 25.483871 medium
2 15-10-2020 10:30:00 ketamine hydrochloride 9.0 22.0 17.709677 11.935484 medium
4 15-10-2020 11:30:00 docetaxel injection 29.0 25.0 30.387097 25.483871 medium
data_tidy['temp_cat'].value_counts()
low 50
medium 38
high 15
very_high 5
Name: temp_cat, dtype: int64
keyboard_arrow_down String function and motivation for datetime

keyboard_arrow_down What kind of questions can we use string methods for?
Find rows which contains a particular string
Say,
How you can you filter rows containing "hydrochloride" in their drug name?
data_tidy.loc[data_tidy['Drug_Name'].str.contains('hydrochloride')].head()
Date time Drug_Name Pressure Temperature Temperature_avg Pressure_avg temp_cat
6 15-10-2020 12:30:00 diltiazem hydrochloride 20.0 21.0 24.848485 15.424242 medium
So in general, we will be using the following format:
> Series.str.function()
Series.str can be used to access the values of the series as strings and apply several methods to it.
Now suppose we want to form a new column based on the year of the experiments?
keyboard_arrow_down What can we do form a column containing the year?
data_tidy['Date'].str.split('-')
0 [15, 10, 2020]

1 [15, 10, 2020]
2 [15, 10, 2020]
3 [15, 10, 2020]
4 [15, 10, 2020]
...
103 [17, 10, 2020]
104 [17, 10, 2020]
105 [17, 10, 2020]
106 [17, 10, 2020]
107 [17, 10, 2020]
Name: Date, Length: 108, dtype: object
To extract the year we need to select the last element of each list
data_tidy['Date'].str.split('-').apply(lambda x:x[2])
0 2020
1 2020
2 2020
3 2020
4 2020
...
103 2020
104 2020
105 2020
106 2020
107 2020
Name: Date, Length: 108, dtype: object
But there are certain problems with this approach:
The dtype of the output is still an object, we would prefer a number type
The date format will always not be in day-month-year, it can vary
Thus, to work with such date-time type of data, we can use a special method of pandas
keyboard_arrow_down Datetime
keyboard_arrow_down How can we handle date-time data-types?
We can do using the to_datetime() function of pandas

It takes as input:
Array/Scalars with values having proper date/time format

dayfirst : Indicating if the day comes first in the date format used
yearfirst : Indicates if year comes first in the date format
Let's first merge our Date and time columns into a new timestamp column
data_tidy['timestamp'] = data_tidy['Date']+ " "+ data_tidy['time']
data_tidy.drop(['Date', 'time'], axis=1, inplace=True)
data_tidy.head()
Drug_Name Pressure Temperature Temperature_avg Pressure_avg temp_cat timestamp
0 diltiazem hydrochloride 18.0 20.0 24.848485 15.424242 low 15-10-2020 10:30:00
1 docetaxel injection 26.0 23.0 30.387097 25.483871 medium 15-10-2020 10:30:00
2 ketamine hydrochloride 9.0 22.0 17.709677 11.935484 medium 15-10-2020 10:30:00
Lets convert our timestamp col now
data_tidy['timestamp'] = pd.to_datetime(data_tidy['timestamp']) # will leave to explore how you can mention datetime format
data_tidy
Drug_Name Pressure Temperature Temperature_avg Pressure_avg temp_cat timestamp
... ... ... ... ... ... ... ...
103 docetaxel injection 26.0 19.0 30.387097 25.483871 low 2020-10-17 08:30:00
104 ketamine hydrochloride 11.0 20.0 17.709677 11.935484 low 2020-10-17 08:30:00
106 docetaxel injection 27.0 20.0 30.387097 25.483871 low 2020-10-17 09:30:00
data_tidy.info()
Int64Index: 108 entries, 0 to 107
--- ------ -------------- -----
0 Drug_Name 108 non-null object
1 Pressure 108 non-null float64
2 Temperature 108 non-null float64
3 Temperature_avg 108 non-null float64
4 Pressure_avg 108 non-null float64
5 temp_cat 108 non-null category
6 timestamp 108 non-null datetime64[ns]
dtypes: category(1), datetime64[ns](1), float64(4), object(1)
The type of timestamp column has been changed to datetime from object
Now, Let's look at a single timestamp using Pandas
keyboard_arrow_down How can we extract information from a single timestamp using Pandas?
ts = data_tidy['timestamp'][0]
ts
Timestamp('2020-10-15 10:30:00')
keyboard_arrow_down Extracting individual information from date
ts.year, ts.month, ts.day, ts.month_name()
(2020, 10, 15, 'October')
... and so on
We can similarly extract minutes and seconds
keyboard_arrow_down This data parsing from string to date-time makes it easier to work with data
We can use this data from the columns as a whole using .dt object
data_tidy['timestamp'].dt
<pandas.core.indexes.accessors.DatetimeProperties object at 0x7c2e78c72b60>
dt gives properties of values in a column
From this DatetimeProperties of column 'end' , we can extract year
data_tidy['timestamp'].dt.year
0 2020
1 2020
2 2020
3 2020
4 2020
...
103 2020
104 2020
105 2020
106 2020
107 2020
Outline
Uses/necessity of matplotlib
Tencent Use Case
Anatomy
Figure
Types of Data visualization
Univariate Data Visualization
Categorical:
Bar chart
Countplot
Pie Chart
Continous
Histogram
KDE
Box and Whiskers Plot
Plots Presentation:
https://docs.google.com/presentation/d/1DkLTjTe6YmGbDHtr4v9Jso553DlCuP3cfSnwvUN1mgE/edit?usp=sharing
Summary/Agenda
Where is all Data Visualization helpful? Why?

Exploratory - EDA
Explanatory - Storytelling
What is the Science in Data Visualization?

Anatomy of a plot/chart
How to use the right plot/chart for given data?
What is the Art in Data Visualization?

Choose the right scale, labels, tick labels
Identify and remove clutters in the plot
Ways to highlight information in the plot
keyboard_arrow_down Importing Matplotlib and Seaborn

We don't need to import the entire library but just its submodule pyplot
We'll use the alias name plt
What is pyplot ?
pyplot is a sub-module for visualization in matplotlib
Think of it as high-level API which makes plotting an easy task
Data Scientists stick to using pyplot only unless they want to create something totally new.
For seaborn, we will be importing the whole seaborn library as alias sns
What is seaborn?
Seaborn is another visualization library which uses matplotlib in the backend for plotting
What is the major difference then between both matplotlib and seaborn?
Seaborn uses fascinating themes and reduces number of code lines by doing a lot of work in the backend
While matplotlib is used to plot basic plots and add more functionlaity on top of that
Seaborn is built on the top of Pandas and Matplotlib
As we proceed through the lecture, we will see the difference between both the libraries
import matplotlib.pyplot as plt
import seaborn as sns
Before we dive into learning these libraries, lets answer some general questions
Why do even we need to visualize data? When do I even need to visualise?
Two reasons/scopes
Exploratory - I can’t see certain patterns just by crunching numbers (avg, rates, %ages)
Explanatory - I can the numbers crunches and insights ready, but I’d like a visual art for storytelling
Lets talk about Science of Data Visualisation
Data
Rows: Samples, Data-points, Records

Columns: Features, Variables
How many kinds of data do we have?
At the fundamental level, it's just two types:
Numerical/Continous
Categorical
Categorical can be further divided into:
Ordinal: Categorical Data with an order (E.g. low, medium, high)

Non-ordinal/nominal: Categorical Data without any order (example gender as Male/Female)
keyboard_arrow_down Video Games Analysis

You are a data scientist at "Tencent Games".
You need to analyze what kind of games they should start creating to get higher success in the market.
!wget https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/021/299/original/final_vg1_-_final_vg_%281%29.csv?16708
--2024-02-06 16:30:07-- https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/021/299/original/final_vg1_-_fin

Resolving d2beiqkhq929f0.cloudfront.net (d2beiqkhq929f0.cloudfront.net)... 13.35.37.7, 13.35.37.102, 13.35.37.159, ...
Connecting to d2beiqkhq929f0.cloudfront.net (d2beiqkhq929f0.cloudfront.net)|13.35.37.7|:443... connected.
Length: 2041483 (1.9M) [text/plain]
Saving to: ‘final_vg.csv’
final_vg.csv 100%[===================>] 1.95M 2.75MB/s in 0.7s
2024-02-06 16:30:08 (2.75 MB/s) - ‘final_vg.csv’ saved [2041483/2041483]
import pandas as pd
import numpy as np
data = pd.read_csv('final_vg.csv')
data.head()
Rank Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sales Other_Sal
0 2061 1942 NES 1985.0 Shooter Capcom 4.569217 3.033887 3.439352 1.9916
1 9137 ¡Shin Chan Flipa en colores! DS 2007.0 Platform 505 Games 2.076955 1.493442 3.033887 0.3948
2 14279 .hack: Sekai no Mukou ni + Versus PS3 2012.0 Action Namco Bandai Games 1.145709 1.762339 1.493442 0.4086
3 8359 .hack//G.U. Vol.1//Rebirth PS2 2006.0 Role-Playing Namco Bandai Games 2.031986 1.389856 3.228043 0.3948
4 7109 .hack//G.U. Vol.2//Reminisce PS2 2006.0 Role-Playing Namco Bandai Games 2.792725 2.592054 1.440483 1.4934
If you notice,
Columns like Platform , Genre are categorical

While columns like NA_Sales , Global_Sales , Rank are continuous
On noticing further,
Platform is of nominal type, no proper order between the categories
Year is of ordinal type, there's a order to the categories
keyboard_arrow_down Introduction to Matplotlib

Lets learn to create a basic plot using plt
Now say, we want to draw a curve passing through 3 points:
(0, 3)
(1, 5)
(2, 9)
How can we draw a curve using matplotlib ?

By using plt.plot() function
x_val = [0, 1, 2]
y_val = [3, 5, 9]
plt.plot(x_val, y_val)
[<matplotlib.lines.Line2D at 0x7cc05d0c3f40>]
What can we observe from this plot ?
plt.plot() automatically decided the scale of the plot
It also prints the type of object matplotlib.lines.Line2D
While this command decided a lot of things for you, you can customise each of these by understanding components of a matplotlib plot
keyboard_arrow_down Anatomy of Matplotlib

Woah! There is a lot of information in this image. Let's understand them one at a time.
Figure: The overall window or page that everything is drawn on.
Now, how You can create

to choose themultiple independent Figures in Jupyter.
right plot?
If you run the code in terminal, separate windows will pop-up
Firstly, depends on the what is your question of interest
Axes: To the figure you can add multiple Axes which represents a plot
When the question is clear:
Axis: Simply the x-axis and y-axis
How many variables are involved?
Axes: - It is the area on which the data is plotted with functions such as plot()
Whether the variable(s) are numerical or categorical?
x-label: Name of x-axis
How many variables are involved?
y-label: Name of y-axis
Major- Univariate
1 Variable ticks: subdivides the axis into major units. They appear by default during plotting
Analysis
2 Variables
Minor ticks:- subdivides
Bivariate Analysis
the major tick units. They are by default hidden and can be toggled on.
2+ Variables - Multivariate Analysis
Title: Title of each plot (Axes), giving information about the same
PS: Bivariate counts under multivariate, but let's keep it sep for ease of communication
Legend: describes the elements in the plot, blue and green curves in this case
What are the possible cases?
Suptitle: The common title of all the plots
Univariate
These are the major components of a matplotlib plot
Numerical
Categorical
Bivariate
Numerical-Numerical
Numerical-Categorical
Categorical-Categorical
Multivariate
Let’s start with 3 and then we can generalize
Numerical-Numerical-Categorical
Categorical-Categorical-Numerical
Categorical-Categorical-Categorical
Numerical-Numerical-Numerical
We will work on these one by one
keyboard_arrow_down Univariate Data Visualization - Categorical Data

What kind of questions we may want to ask for a categorical variable?
Questions like:
What is the Distribution/Frequency of the data across different categories?

What proportion does a particular category constitutes?
...and so on
Let's take the categorical column "Genre"
keyboard_arrow_down How can we find the top-5 genres?
Recall, how could we get this data using pandas?
cat_counts = data['Genre'].value_counts()
cat_counts
Action 3316
Sports 2400
Misc 1739
Role-Playing 1488
Shooter 1310
Adventure 1286
Racing 1249
Platform 886
Simulation 867
Fighting 848
Strategy 681
Puzzle 582
Name: Genre, dtype: int64
keyboard_arrow_down Now what kind of plot can we use to visualize this information?
We can perhaps plot categories on X-axis and their corresponding frequencies on Y-axis
Such chart is called a Bar Chart or a Count Plot
Can also plot horizontally when the #categories are many
Bar Chart
The data is binned here into categories
How can we draw a Bar plot ?

Using plt.bar()
x_bar=cat_counts.index
y_bar=cat_counts
plt.bar(x_bar,y_bar)
<BarContainer object of 12 artists>
The names seem to be overlapping a lot
keyboard_arrow_down How can we handle overlapping labels?

Maybe decrease the font size (not preferred though)
Or maybe increase the figure size
Or rotate the labels
How can we change the plot size?
plt.figure(figsize=(12,8))
<BarContainer object of 12 artists>
keyboard_arrow_down And how can we rotate the tick labels, also maybe increase the fontsize of the same?
plt.xticks(rotation=90, fontsize=12)
([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
[Text(0, 0, 'Action'),
Text(1, 0, 'Sports'),
Text(2, 0, 'Misc'),
Text(3, 0, 'Role-Playing'),
Text(4, 0, 'Shooter'),
Text(5, 0, 'Adventure'),
Text(6, 0, 'Racing'),
Text(7, 0, 'Platform'),
Text(8, 0, 'Simulation'),
Text(9, 0, 'Fighting'),
Text(10, 0, 'Strategy'),
Text(11, 0, 'Puzzle')])
If you notice, the width of each bar is 1
keyboard_arrow_down Can we change the width of these bars?
# same code
plt.bar(x_bar,y_bar,width=0.2)
plt.xticks(rotation = 90, fontsize=12)
([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
Text(2, 0, 'Misc'),
keyboard_arrow_down What about any additional styling to add to the bars ?
We can change colour of bars

We can add a title to the axes
We can also add x and y labels
plt.bar(x_bar,y_bar,width=0.2,color='orange')
plt.title('Games per Genre',fontsize=15)
plt.xlabel('Genre',fontsize=12)
plt.ylabel('Count',fontsize=12)
plt.yticks(fontsize=12)
(array([ 0., 500., 1000., 1500., 2000., 2500., 3000., 3500.]),
[Text(0, 0.0, '0'),
Text(0, 500.0, '500'),
Text(0, 1000.0, '1000'),
Text(0, 1500.0, '1500'),
Text(0, 2000.0, '2000'),
Text(0, 2500.0, '2500'),
Text(0, 3000.0, '3000'),
Text(0, 3500.0, '3500')])
If you notice, there's some text printed always before the plots.
This contains the data information of the plot
keyboard_arrow_down How can we remove the text printed before the plot and just display the plot?
Using plt.show() at the end
plt.bar(x_bar,y_bar,width=0.2,color='orange')
plt.title('Games per Genre',fontsize=15)
plt.xlabel('Genre',fontsize=12)
plt.ylabel('Count',fontsize=12)
plt.show()
How can we draw a bar-chart in Seaborn?
In Seaborn, the same plot is called as countplot.

Countplot automtically does even the counting of frequencies for you
keyboard_arrow_down Why not called a barplot?
There is another function in Seaborn called barplot which has some other purpose - discuss later
sns.countplot(x = 'Genre', data = data, order=data['Genre'].value_counts().index, color='cornflowerblue')

plt.xticks(rotation=90)
([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
Text(2, 0, 'Misc'),
The top 5 genres are action, sports, misc, role player, and shooter
keyboard_arrow_down Pie charts

What if instead of actual frequencues, I want see the proportion of the categories with each other?
Say, we want to compare the distrubution/proportion of sales across the different regions?
Which plot can we use for this?
A pie-chart!
sales_data = data[['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales']]

region_sales = sales_data.T.sum(axis='columns')
plt.pie(region_sales,
labels=region_sales.index,
startangle=90,
explode=(0.2,0,0,0))
plt.show()
keyboard_arrow_down Univariate Data Visualisation - Numerical Data
What kind of questions we may have regarding a numerical variable?
1. How is the data distributed? Say distribution of number of games published in a year.
2. Is the data skewed? Are there any outliers? - Extremely high selling games maybe?
3. How much percentage of data is below/above a certain number?
4. Some special numbers - Min, Max, Mean, Median, nth percentile?
Now say, you want to find the distribution of games released every year.
Unlike barplot, to see the distribution we will need to bin the data.
keyboard_arrow_down How can we understand popularity of video games year by year?
Histogram
plt.hist(data['Year'])
plt.show()
The curve is left skewed, with a lot more games being published in 2005-2015
This shows that games started being highly popular in the last 1-2 decades, maybe could point to increased usage of internet worldwide!
If you notice, histograms are basically frequency charts
We can also vary the number of bins, the default number of bins is 10
So if we would need to see this data per decade, we would need 40 years in 4 bins.
plt.hist(data['Year'], bins=4)
plt.show()
We can also get the data of each bin, such as range of the boundaries, values, etc.
count, bins, _ = plt.hist(data['Year'])
count
array([ 112., 70., 92., 449., 1274., 2440., 3921., 5262., 2406.,
355.])
bins
array([1980., 1984., 1988., 1992., 1996., 2000., 2004., 2008., 2012.,

2016., 2020.])
keyboard_arrow_down Now what do these count and bins mean?
bins provides bin edges

counts provides it corresponding counts
What is the length of count ?
10
What shoud be the length of bins ?

10 + 1 = 11
keyboard_arrow_down How can we plot histogram in Seaborn?

sns.histplot(data['Year'], bins=10)
<Axes: xlabel='Year', ylabel='Count'>
Notice,
The boundaries are more defined than matplotlib's plotting

The x and y axis are labelled automatically
keyboard_arrow_down Kernel Density Estimate (KDE) Plot

A KDE plot, similar to histrogram, is a method for visualizing the distributions
But instead of bars, KDE represents data using a continuous probability density curve
Now, Why do we even need KDE plots?

Compared to histogram, KDE produces a plot which is less cluttered and more interpretable
Think of it as a smoothened version of histogram
Let's plot KDE using seaborn 's kdeplot
sns.kdeplot(data['Year'])
<Axes: xlabel='Year', ylabel='Density'>
Can you notice the difference between KDE and histogram?
Y-Axis has probability density estimation instead of count
You can read more about this on:

https://en.wikipedia.org/wiki/Kernel_density_estimation
https://www.youtube.com/watch?v=DCgPRaIDYXA
keyboard_arrow_down Boxplot
Now say I want to find the typical earnings of a game when it is published.
Or maybe find the aggregates like median, min, max and percentiles of the data.
What kind of plot can we use to understand the typical earnings from a game?
Box Plot
keyboard_arrow_down What exactly is a Box Plot?
A box plot or box-and-whisker plot shows the distribution of quantitative data

It facilitates comparisons between
attributes
across levels of a categorical attribute.
The box: Shows the quartiles of the dataset
The whiskers: Show the rest of the distribution
Let's go through the terminology one-by-one
Box plots show the five-number summary of data:
1. Minimum score,
2. first (lower) quartile
3. Median
4. Third (upper) quartile
5. maximum score
Minimum Score
It is the lowest value, excluding outliers
It is shown at the end of bottom whisker
Lower Quartile
25% of values fall below the lower quartile value
It is also known as the first quartile.
Median
Median marks the mid-point of the data
Half the scores are greater than or equal to this value and half are less.
It is sometimes known as the second quartile.
Upper Quartile
75% of the values fall below the upper quartile value
It is also known as the third quartile.
Maximum Score
It is the highest value, excluding outliers
It is shown at the end of upper whisker.

Bivariate
Continous-Continous
Line plot
Styling and Labelling
Scatterplot
Categorical-Categorical
Dodged countplot
Stacked countplot
Categorical-Continuous
Multiple BoxPlots
Subplots
keyboard_arrow_down Importing the data


Resolving d2beiqkhq929f0.cloudfront.net (d2beiqkhq929f0.cloudfront.net)... 18.172.139.94, 18.172.139.46, 18.172.139.61,
Saving to: ‘vgsales.csv’
vgsales.csv 100%[===================>] 1.95M --.-KB/s in 0.07s
2024-02-06 16:33:16 (28.3 MB/s) - ‘vgsales.csv’ saved [2041483/2041483]
import pandas as pd
import numpy as np
data = pd.read_csv('vgsales.csv')
data.head()
data.describe()
Rank Year NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales
count 16652.000000 16381.000000 16652.000000 16652.000000 16652.000000 16652.000000 16652.000000
mean 8283.409620 2006.390513 2.752314 1.996875 2.499677 1.151829 8.457873
std 4794.471477 5.863261 1.327002 1.322972 1.164023 1.054813 3.717756
min 1.000000 1980.000000 0.140000 0.010000 0.000000 -0.474276 0.240000
25% 4129.750000 2003.000000 1.781124 1.087977 1.781124 0.394830 5.580341
50% 8273.500000 2007.000000 2.697415 1.714664 2.480356 0.491870 7.536614
75% 12436.250000 2010.000000 3.677290 2.795123 3.176299 1.781124 11.227334
max 16600.000000 2020.000000 8.725452 8.367985 12.722984 7.358020 30.555862

keyboard_arrow_down Bivariate Data Visualization
keyboard_arrow_down Continous-Continous
So far we have been analyzing only a single feature.
But what if we want to visualize two features at once?
What kind of questions can we ask regarding a continous-continous pair of features?
Maybe show relation between two features, like how does the sales vary over the years?
Or show how are the features associated, positively or negatively?
...And so on
Let's go back to the line plot we plotted at the very beginning
keyboard_arrow_down Line Plot

keyboard_arrow_down
How can we plot the sales trend over the years for the longest running game?
First, let's find the longest running game first
data['Name'].value_counts()
Ice Hockey 41
Baseball 17
Need for Speed: Most Wanted 12
Ratatouille 9
FIFA 14 9
..
Indy 500 1
Indy Racing 2000 1
Indycar Series 2005 1
inFAMOUS 1
Zyuden Sentai Kyoryuger: Game de Gaburincho!! 1
Name: Name, Length: 11493, dtype: int64
Great, so Ice Hockey is longer running than most games
Let's try to find the sales trend in North America of the same across the years
ih = data.loc[data['Name']=='Ice Hockey']
sns.lineplot(x='Year', y='NA_Sales', data=ih)
<Axes: xlabel='Year', ylabel='NA_Sales'>
What can we infer from this graph?
The sales across North America seem to have been boosted in the years of 1995-2005
Post 2010 though, the sales seem to have taken a dip
Line plot are great to represending trends such as above, over time
keyboard_arrow_down Style and Labelling

We already learnt in barplot how to add titles, x-label and y-label
Let's add the same here
plt.title('Ice Hockey Sales Trend')

plt.xlabel('Year')
plt.ylabel('Sales')
plt.show()
keyboard_arrow_down Now what if we want to change the colour of the curve ?
sns.lineplot() contains an argument color
It takes as argument a matplotlib color
OR
as string for some defined colours like:
black: k / black
red: r / red etc
But what all colours can we use ?
Matplotlib provides a lot of colours
Check the documentation for more colours
https://matplotlib.org/2.0.2/api/colors_api.html

plt.xlabel('Year')
plt.ylabel('Sales')
sns.lineplot(x='Year', y='NA_Sales', data=ih, color='r')
plt.show()
Now, lets say we only want to show the values from years 1990-2000
keyboard_arrow_down How can we limit our plot to only the last decade of 20th century?
This requires changing the range of x-axis
But how can we change the range of an axis in matplotlib ?

We can use:
plt.xlim() : x-axis
plt.ylim() : y-axis
These funcs take same 2 args:
1. left : Starting point of range

2. right : End point of range

plt.xlabel('Year')
plt.ylabel('NA Sales')
plt.xlim(left=1995,right=2010)
plt.show()
So far we have visualised a single plot to understand it
What if we want to compare it with some other plot?

Say, we want to compare the same sales trend between two games
Ice Hockey
Baseball
Let's first plot the trend for "Baseball"
baseball = data.loc[data['Name']=='Baseball']
sns.lineplot(x='Year', y='NA_Sales', data=baseball)
Now, to compare these, so we will have to draw these plots in the same figure
keyboard_arrow_down How can we plot multiple plots in the same figure ?

We can use multiple sns.lineplot() funcs
Observe:
Seaborn automatically created 2 plots with different colors
keyboard_arrow_down However how can we know which colour is of which plot ?
sns.lineplot() has another argument label to do so

We can simply set the label of each plot
sns.lineplot(x='Year', y='NA_Sales', data=ih, label='Ice Hockey')
sns.lineplot(x='Year', y='NA_Sales', data=baseball, label='Baseball')
We can also pass these labels in plt.legend() as a list in the order plots are done

plt.legend(['Ice Hockey','Baseball'])
plt.show()
keyboard_arrow_down Now can we change the position of the legend, say, to bottom-right corner?
Matplotlib automatically decides the best position for the legends
But we can also change it using the loc parameter
loc takes input as 1 of following strings:
upper center
upper left
upper right
lower right ... etc

plt.legend(['Ice Hockey','Baseball'], loc='lower right')
plt.show()
keyboard_arrow_down Now what if we want the legend to be outside the plot?
Maybe the plot is too congested to show the legend
We can use the same loc parameter for this too

plt.legend(['Ice Hockey','Baseball'], loc=(-0.5,0.5))
plt.show()
The pair of floats signify the (x,y) coordinates for the legend
==> From this we can conclude loc takes two types of arguments:
The location in the form of string

The location in the form of coordinates
keyboard_arrow_down What if we want to add other stylings to legends ?
For eg:
Specify the number of rows/cols
Uses parameter ncols for this

The number of rows are decided automatically
Decide if we want the box of legends to be displayed
Use the bool param frameon

and so on.

plt.legend(['Ice Hockey','Baseball'], loc='lower right', ncol = 2, frameon = False)
plt.show()
Now say we want to highlight a point on our curve.
For e.g.
keyboard_arrow_down How can we highlight the maximum "Ice Hockey" sales across all years ?
Let's first find this point
print(max(ih['NA_Sales']))
0.9
If we observe, this point lies in the year 2004-5 around
Now we need to add text to this point (2004,0.9)
keyboard_arrow_down How can we add text to a point in a figure ?

plt.legend(['Ice Hockey','Baseball'], loc='lower left')
plt.text(2004,max(ih['NA_Sales']), 'Maximum Ice Hockey Sales')
plt.show()
By using plt.text()
Pass in the x and y coordinates where we want the text to appear
Pass in the text string
We can also use plt.grid() to show the grid layout in the background

plt.legend(['Ice Hockey','Baseball'], loc='lower left')
plt.text(2004,max(ih['NA_Sales']), 'Maximum Ice Hockey Sales')
plt.grid()
plt.show()
Note:
We can pass in parameters inside plt.grid() to control its density, colour of grid lines, etc.
You can look it up later on how to customize the grid
https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.grid.html
keyboard_arrow_down Scatter Plot

Now suppose we want to find the relation between Rank and Sales of all games.
keyboard_arrow_down Are Rank and Sales positively or negatively correlated?
In this case, unlike line plot, there maybe multiple points in y-axis for each point in x-axis
data.head()
keyboard_arrow_down How can we plot the relation between Rank and Global Sales ?
Can we use lineplot?
Let's try it out

sns.lineplot(data=data, x='Rank', y='Global_Sales')
<Axes: xlabel='Rank', ylabel='Global_Sales'>
The plot itself looks very messy and it's hard to find any patterns from it.
keyboard_arrow_down Is there any other way we can visualize this relation?
Use scatter plot
sns.scatterplot(data=data, x='Rank', y='Global_Sales')
<Axes: xlabel='Rank', ylabel='Global_Sales'>
Compared to lineplot, we are able to see the patterns and points more distinctly now!
Notice,
The two variables are negatively correlated with each other
With increase in ranks, the sales tend to go down, implying, lower ranked games have higher sales overall!
Scatter plots help us visualize these relations and find any patterns in the data
Key Takeaways:
For Continuous-Continuous Data => Scatter Plot, Line Plot
Sometimes, people also like to display the linear trend between two variables - Regression Plot, do check that
If you notice, Genres , Publisher and Platform are categorical values
Since we have a lot of categories of each of them, we will use top 3 of each to make our analysis easier
top3_pub = data['Publisher'].value_counts().index[:3]
top3_gen = data['Genre'].value_counts().index[:3]
top3_plat = data['Platform'].value_counts().index[:3]
top3_data = data.loc[(data["Publisher"].isin(top3_pub)) & (data["Platform"].isin(top3_plat)) & (data['Genre'].isin(top3_gen)
top3_data
Rank Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sal
2 14279 .hack: Sekai no Mukou ni + Versus PS3 2012.0 Action Namco Bandai Games 1.145709 1.762339 1.4934
13 2742 [Prototype 2] PS3 2012.0 Action Activision 3.978349 3.727034 0.8488
16 1604 [Prototype] PS3 2009.0 Action Activision 4.569217 4.108402 1.1872
19 1741 007: Quantum of Solace PS3 2008.0 Action Activision 4.156030 4.346074 1.0879
... ... ... ... ... ... ... ... ...
16438 14938 Yes! Precure 5 Go Go Zenin Shu Go! Dream Festival DS 2008.0 Action Namco Bandai Games 1.087977 0.592445 1.0879
16479 10979 Young Justice: Legacy PS3 2013.0 Action Namco Bandai Games 2.186589 1.087977 3.4090
16601 11802 ZhuZhu Pets: Quest for Zhu DS 2011.0 Misc Activision 2.340740 1.525543 3.1038
16636 9196 Zoobles! Spring to Life! DS 2011.0 Misc Activision 2.697415 1.087977 2.7607
16640 9816 Zubo DS 2008.0 Misc Electronic Arts 2.592054 1.493442 1.4934
keyboard_arrow_down Categorical-Categorical
Earlier we saw how to work with continous-continuous pair of data
Now let's come to the second type of pair of data: Categorical-Categorical
What questions comes to your mind when we say categorical-categorical pair?
Questions related to distribution of a category within another category
What is the distribution of genres for top-3 publishers?

Which platforms do these top publishers use?
Which plot can we use to show distribution of one category with respect to another?
-> We can have can have multiple bars for each category
These multiple bars can be stacked together - Stacked Countplot
Or
Can be placed next to each other - Dodged Countplot
keyboard_arrow_down Dodged Count Plot

keyboard_arrow_down
How can we compare the top 3 platforms these publishers use?
We can use a dodged countplot in this case
sns.countplot(x='Publisher',hue='Platform',data=top3_data)
plt.ylabel('Count of Games')
Text(0, 0.5, 'Count of Games')
What can we infer from the dodged countplot?
EA releases PS2 games way more than any other publisher, or even platform!
Activision has almost the same count of games for all 3 platforms
EA is leading in PS3 and PS2, but Namco leads when it comes to DS platform
keyboard_arrow_down Stacked Countplot

keyboard_arrow_down
How can we visualize the distribution of genres for top-3 publishers?
We can use a stacked countplot
Do refer the postread for the code on how to plot it
But stacked countplots can be misleading
Some may find it difficult to understand if it starts from baseline or from on top of the bottom area
How do we decide between a Stacked countplot and Dodged countlot?
Stacked countplots are a good way to represent totals

While dodged countplots helps us to comapare values between various categories, and within the category itself too
keyboard_arrow_down Continous-Categorical
Now let's look at our 3rd type of data pair
What kind of questions we may have regarding a continuous-categorical pair?
We might to want calculate some numbers category wise
Like What is the average sales for every genre?

Or we might be interested in checking the distribution of the data category-wise
What is the distribution of sales for the top3 publishers?
What kind of plot can we make for every category?
-> Either KDE plot or Box Plot per category
keyboard_arrow_down Boxplot
keyboard_arrow_down
What is the distribution of sales for the top3 publishers?
sns.boxplot(x='Publisher', y='Global_Sales', data=top3_data)

plt.xticks(rotation=90,fontsize=12)
plt.title('Sales for top3 publisher', fontsize=15)
plt.show()
What can we infer from this plot?
The overall sales of EA is higher, with a much larger spread than other publishers
Activision doesn't have many outliers, and if you notice, even thought the spread is lesser than EA, the median is almost the same
Barplot
What if we want to compare the sales between the genres?

We have to use:
Genre (categorical)
Mean of global sales per genre (numerical)
keyboard_arrow_down How to visualize which genres bring higher average global sales?
sns.barplot(data=top3_data, x="Genre", y="Global_Sales", estimator=np.mean)
<Axes: xlabel='Genre', ylabel='Global_Sales'>
If you remember, we had earlier seen EA had a larger market share of sales
Along with this fact, majority of games EA made was sports
This ultimately proves the fact that Sports has a high market share in the industry, as shown in the barchart
keyboard_arrow_down Subplots
So far we have shown only 1 plot using plt.show()
Say, we want to plot the trend of NA and every other region separately in a single figure
How can we plot multiple smaller plots at the same time?

We will use subplots, i.e., divide the figure into smaller plots
We will be using plt.subplots() It takes mainly 2 arguments:
1. No. of rows we want to divide our figure into
2. No. of columns we want to divide our figure into
It returns 2 things:
Figure
Numpy Matrix of subplots
fig = plt.figure(figsize=(15,10))
sns.scatterplot(x=top3_data['NA_Sales'], y=top3_data['EU_Sales'])
fig.suptitle('Main title')
plt.show()
fig = plt.figure(figsize=(15,10))
plt.subplot(2, 3, 1)
sns.scatterplot(x='NA_Sales', y='EU_Sales', data=top3_data)
sns.scatterplot(x='NA_Sales', y='JP_Sales', data=top3_data, color='red')
fig.suptitle('Main title')
Text(0.5, 0.98, 'Main title')
fig, ax = plt.subplots(2, 2, figsize=(15,10))

ax[0,0].scatter(top3_data['NA_Sales'], top3_data['EU_Sales'])
ax[0,1].scatter(top3_data['NA_Sales'], top3_data['JP_Sales'])
ax[1,0].scatter(top3_data['NA_Sales'], top3_data['Other_Sales'])
ax[1,1].scatter(top3_data['NA_Sales'], top3_data['Global_Sales'])
Multivariate Data Visualization
CCN
CNN
NNN
CCC
JointPlot
Pairplots
Correlation and heatmap
keyboard_arrow_down Importing the data

Resolving d2beiqkhq929f0.cloudfront.net (d2beiqkhq929f0.cloudfront.net)... 13.35.37.31, 13.35.37.159, 13.35.37.102, ...
Saving to: ‘vgsales.csv’
vgsales.csv 100%[===================>] 1.95M 2.94MB/s in 0.7s
2024-02-06 16:35:19 (2.94 MB/s) - ‘vgsales.csv’ saved [2041483/2041483]
import pandas as pd
import numpy as np
data = pd.read_csv('vgsales.csv')
data.head()
If you remember, Genres , Publisher and Platform were categorical values
Hence similar to last lecture, we will use top 3 of each to make our analysis easier
top3_pub = data['Publisher'].value_counts().index[:3]
top3_gen = data['Genre'].value_counts().index[:3]
top3_plat = data['Platform'].value_counts().index[:3]
top3_data = data.loc[(data["Publisher"].isin(top3_pub)) & (data["Platform"].isin(top3_plat)) & (data['Genre'].isin(top3_gen)
top3_data
Rank Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sal
2 14279 .hack: Sekai no Mukou ni + Versus PS3 2012.0 Action Namco Bandai Games 1.145709 1.762339 1.4934
13 2742 [Prototype 2] PS3 2012.0 Action Activision 3.978349 3.727034 0.8488
16 1604 [Prototype] PS3 2009.0 Action Activision 4.569217 4.108402 1.1872
... ... ... ... ... ... ... ... ...
16438 14938 Yes! Precure 5 Go Go Zenin Shu Go! Dream Festival DS 2008.0 Action Namco Bandai Games 1.087977 0.592445 1.0879
16479 10979 Young Justice: Legacy PS3 2013.0 Action Namco Bandai Games 2.186589 1.087977 3.4090
16601 11802 Zh Zh P t Q tf Zh DS 2011 0 Mi A ti i i 2 340740 1 525543 3 1038

16601 11802 ZhuZhu Pets: Quest for Zhu DS 2011.0 Misc Activision 2.340740 1.525543 3.1038
16636 9196 Zoobles! Spring to Life! DS 2011.0 Misc Activision 2.697415 1.087977 2.7607
16640 9816 Zubo DS 2008.0 Misc Electronic Arts 2.592054 1.493442 1.4934
keyboard_arrow_down Multivariate
Let’s try to add 3rd variable on the top of the plots we have seen so far
keyboard_arrow_down NNC
How can we visualize the correlation between NA and EU, but for different genres?
Here, we have two numerical and one categorical variable!
Numerical-Numerical → Scatterplot, need to add info about one categorical variable
Numerical-Categorical → Boxplot, need to add info about one numerical variable
Let's ask two questions

Is it Possible to add information about a continuous variable upon boxplots?
Perhaps No
Is it Possible to add information about a categorical variable on scatterplot?
Yes, use colors
Solution: Scatterplot with color
sns.scatterplot(x='NA_Sales', y='EU_Sales',hue='Publisher',data=top3_data)
plt.xticks(fontsize=15)
plt.xlabel('NA Sales',fontsize=15)
plt.ylabel('EU Sales',fontsize=15)
plt.title('NA vs EU, per Genre', fontsize=15)
plt.show()
Inferences:
If we see this plot, we can notice now that Namco has lower sales correlation, while Activision has a concentrated positivee correlation
EA also has positive correlation, but it's more spread compared to Activision
keyboard_arrow_down CCN
Now, how will you visualize Global Sales for each publisher, but separated by Genres?
We have two categorical and one numerical data here!
Categorical-Categorical → Stacked Barplot, need to add info about one continuous feature
Categorical-Numerical → Boxplots, need to add categorical variable
Which one is easier and possible? We can add one categorical variable by “dodging” multiple boxplots
Solution: Dodged Boxplots
sns.boxplot(x='Publisher',y='Global_Sales',hue='Genre',data=top3_data)
plt.xlabel('Genre', fontsize=12)
plt.ylabel('Global Sales', fontsize=12)
plt.xticks(fontsize=12)
plt.title('Global Sales for each Publisher, Genre wise', fontsize=15)
plt.show()
Inferences:
Namco has lower median sales in every Genre as compared to all publishers
Looking at Action Genre, even though EA and Activision has almost similar medians, Action is more spread in EA
An interesting thing to notice here is that, for each of the three publishers, three different genre of games have higher sales median:
Namco: Action
Activision: Misc
EA: Sports
keyboard_arrow_down NNN
So far we have seen how NA and EU are correlated with each other.
But how can we compare the data when we have 3 numerical variables?
Say, the question is, how does rank affect the correlation between NA and EU Sales?
We have used scatter plot for two numerical features, we have two options here
Make a 3D Scatterplot
→ nice for 3D viz, but tough to report/show in static setting

Add info about 3rd feature on the 2D scatter plot itself
→ Bubble Chart
# sns.scatterplot(x=data['NA_Sales'], y=data['JP_Sales'],data=top3_data, size=data['Rank'], sizes=(1, 200))
sns.scatterplot(x='NA_Sales', y='JP_Sales', size='Rank', sizes=(1, 200), data=data)
plt.xlabel('NA_Sales',fontsize=10)
plt.ylabel('JP Sales', fontsize=10)
plt.title('NA vs JP Sales, based on ranking of games', fontsize=15)
plt.show()
Inferences:
Now interestingly, we can notice that higher ranking games are actually on the lower scale of sales, while lower ranking games are high on
the sales side
keyboard_arrow_down Joint Plot

keyboard_arrow_down Let's see a few more plots that we can visualize using seaborn
Joint Plot
It draws a plot of two variables
It shows scatter, histogram and KDE graphs in the same plot.
Let's check it out

We will take NA_Sales as x-coordinates and EU_Sales as y-coordinates
We can select from different values for parameter kind and it will plot accordingly
"scatter" | "kde" | "hist" | "hex" | "reg" | "resid"
We will set parameter kind to 'reg' here
sns.jointplot(x='NA_Sales', y='EU_Sales', data=top3_data)

plt.show()
keyboard_arrow_down As we can see here:
jointplot plots scatter, histogram and KDE in the same graph when we set kind=reg
Scatter shows the scattering of ( NA_Sales , EU_Sales ) pairs as (x, y) points
Histogram and KDE shows the separate distributions of NA_Sales and EU_Sales in the data
We can also add hue to Joint Plot

Let's check how the 3 Genres of games are distributed in terms of NA_Sales and 'EU_Sales`
sns.jointplot(x='NA_Sales', y='EU_Sales', data=top3_data, hue='Genre')

plt.show()
keyboard_arrow_down Pair Plot
pairplot() in seaborn creates a grid of Axes by default
Each numeric attribute in data is shared across the y-axes across a single row and the x-axes across a single column.
It displays a scatterplot between each pair of attributes in the data with different hue for each category
Since, the diagonal plots belong to same attribute at both x and y axis, they are treated differently
A univariate distribution plot is drawn to show the marginal distribution of the data in each column.
Let's check it out
sns.pairplot(data=top3_data, hue='Genre')
plt.show()
Notice that:
It is like a scatterplot of video games with hue='Genre'
But the scatter is plotted between every pair of attributes
Colour Legends for each genre category are given on right side
It shows relation between each pair of attributes
Diagonal plots are different from scatterplots
Because x and y axis have same attribute
Diagonal plots show a univariate curve category-wise for each attribute

It is also possible to show a subset of variables or plot different variables on the rows and columns
Feel free to experiment this on your own
keyboard_arrow_down Finding correlations among attributes

We can find the level of correlation b/w different attributes (variables)
But what exactly is a correlation?

Two variables are correlated when they change in same/opposite direction
We can check coefficient of correlation using corr()
top3_data.corr()
<ipython-input-11-c78d7a78d920>:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a

top3_data.corr()
Rank Year NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales
Rank 1.000000 0.328705 -0.873726 -0.735711 0.115459 -0.857567 -0.911721
Year 0.328705 1.000000 -0.354256 -0.178026 0.055864 -0.239876 -0.280351
NA_Sales -0.873726 -0.354256 1.000000 0.617483 -0.233315 0.794353 0.856300
EU_Sales -0.735711 -0.178026 0.617483 1.000000 -0.208249 0.771105 0.864147
JP_Sales 0.115459 0.055864 -0.233315 -0.208249 1.000000 -0.355825 -0.014193
Other_Sales -0.857567 -0.239876 0.794353 0.771105 -0.355825 1.000000 0.878816
Global_Sales -0.911721 -0.280351 0.856300 0.864147 -0.014193 0.878816 1.000000
Higher the MAGNITUDE of coefficient of correlation, more the variables are correlated
The sign just determines the direction of change
+ means increase in value of one variable causes increase in value of other variable
- means increase in value of one variable causes decrease in value of other variable, and vice versa
keyboard_arrow_down As you can see, Global Sales and Rank have the highest correlation coeff of -0.91
Let's plot it using scatter plot
sns.scatterplot(x= 'Global_Sales', y= 'Rank', data = top3_data)

plt.show()
When petal_length increases, petal_width also increases
keyboard_arrow_down But Remember

Correlation does NOT mean Causation
We cannot conclude that change in values of a variable is causing change in values of other variable
Now, Let's look at a way to visualize correlation among variables
Heat Map
A heat map plots rectangular data as a color-encoded matrix.
Stronger the colour, stronger the correlation b/w the variables
Let's plot a Heat Map using correlation coefficient matrix generated using corr()
sns.heatmap(top3_data.corr(), cmap= "Blues", annot=True)

plt.show()
<ipython-input-13-4220aa71e94f>:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a

sns.heatmap(top3_data.corr(), cmap= "Blues", annot=True)
annot=True is for writing correlation coeff inside each cell
keyboard_arrow_down You can change the colours of cells in Heat Map if you like
There are a lot of options available!
print(plt.colormaps())
['magma', 'inferno', 'plasma', 'viridis', 'cividis', 'twilight', 'twilight_shifted', 'turbo', 'Blues', 'BrBG', 'BuGn', '
sns.heatmap(top3_data.corr(), cmap= "coolwarm", annot=True)

plt.show()
<ipython-input-15-72b457990d68>:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In

sns.heatmap(top3_data.corr(), cmap= "coolwarm", annot=True)
<matplotlib.collections.PathCollection at 0x7f0a8e08b370>
Notice, we are using 2 numbers during each plotting
Think of subplots as a 2x2 grids, with the two numbers denoting x,y / row,column coordinate of each subplot
keyboard_arrow_down What is this ax parameter exactly?
print(ax)
[[<Axes: > <Axes: >]

[<Axes: > <Axes: >]]
Notice,
It's a 2x2 matrix of multiple axes objects
We are plotting each plot on a single axes object.
Hence, we are using a 2D notation to access each grid/axes object of the subplot
Instead of accesing the individual axes using ax[0, 0] , ax[1, 0] , there is another method we can use too

import numpy as np
plt.figure(figsize=(20,12)).suptitle("NA Sales vs regions",fontsize=20)

# Using a 2x3 subplot
sns.scatterplot(x='NA_Sales', y='Other_Sales', data=top3_data, color='green')
sns.scatterplot(x='NA_Sales', y='Global_Sales', data=top3_data, color='orange')
plt.show()
Suptitle adds a title to the whole figure
We need to observe a few things here
1. The 3rd paramter defines the position of the plot

2. The position/numbering starts from 1
3. It goes on row-wise from start of row to its finish
4. Empty subplots don't show any axes
keyboard_arrow_down But how do we know which plot belongs to which category?
Basically the context of each plot
We can use title , x/y label and every other functionality for the subplots too
plt.figure(figsize=(20,12)).suptitle("NA Sales vs regions",fontsize=20)

plt.title('NA vs EU Sales', fontsize=12)
plt.xlabel('NA', fontsize=12)
plt.ylabel('EU', fontsize=12)
plt.title('NA vs JP Sales', fontsize=12)
plt.ylabel('JP', fontsize=12)
sns.scatterplot(x='NA_Sales', y='Other_Sales', data=top3_data, color='green')
plt.title('NA vs Other Region Sales', fontsize=12)
plt.ylabel('Other', fontsize=12)
sns.scatterplot(x='NA_Sales', y='Global_Sales', data=top3_data, color='orange')
plt.title('NA vs Global Sales', fontsize=12)
plt.ylabel('Global', fontsize=12)
plt.show()
keyboard_arrow_down What if we want to span a plot across the full length of the plot?
Think of this in terms of a grid.
Currently we are dividing our plot into 2 rows and 3 columns

But we want our plot to be across the middle column, with grids 2 and 5
This can be said as a single column
So, this problem can be simplified to plotting the plot across second column in a 1 row 3 column subplot
plt.figure(figsize=(20,12)).suptitle("Video Games Sales Dashboard",fontsize=20)

plt.title('NA vs EU Sales', fontsize=12)
plt.ylabel('EU', fontsize=12)

Numpy 1 Merged

Uploaded by

Copyright:

Available Formats

Numpy 1 Merged

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Numpy 1 Merged

Uploaded by

Copyright:

Available Formats

keyboard_arrow_down Numpy 1

Dimension & Shape

keyboard_arrow_down Introduction to DAV (Data Analysis and Visualization) Module

It will contain 3 sections -

1. DAV-1: Python Libraries

keyboard_arrow_down Python Lists vs Numpy Arrays

a = [1, 2, 3, "Michael", True]

[1, 2, 3, 'Michael', True]

Each element is stored in a different location.

ONLY contain integers

... and so on.

Programming languages can also be slow or fast.

Java is a decently fast language.

This is because C has concepts like memory allocation, pointers, etc.

How is this possible?

For a Python list a , we had in-built functions like .sum(a) , etc.

Recall that we were able to perform list slicing.

keyboard_arrow_down Importing Numpy

In order to use Python Lists, we do not need to import anything extra.

Generally, we do so while using the alias np .

keyboard_arrow_down Why use Numpy? - Time Comparison

res = [i**2 for i in a]

[1, 4, 9, 16, 25]

Let's try the same operation with Numpy.

To do so, first of all we need to define the Numpy array.

nd in numpy.ndarray stands for n-dimensional

array([ 1, 4, 9, 16, 25])

The biggest benefit of Numpy is that it supports element-wise operation.

Notice how easy and clean is the syntax.

To understand this, let's measure the time for these operations.

%timeit [i**2 for i in l]

Let's peform the same operation using Numpy arrays -

We will use np.array() method for this.

What is the major reason behind numpy's faster computation?

Numpy array is densely packed in memory due to it's homogenous type.

keyboard_arrow_down Dimensions and Shape

Let's take another example to understand shape and ndim better.

arr2 = np.array([[1, 2, 3], [4, 5, 6], [10, 11, 12]])

Let's create some sequences in Numpy.

arange(start, end, step)

np.arange() behaves in the same way as range() function.

But then why not call it np.range?

In np.arange() , we can pass a floating point number as step-size.

arr3 = np.arange(1, 5, 0.5)

array([1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])

keyboard_arrow_down Type Conversion in Numpy Arrays

arr4 = np.array([1, 2, 3, 4])

arr4 = np.array([1, 2, 3, 4.0])

array([1., 2., 3., 4.])

Notice that int is raised to float

np.array(["Harry Potter", 1, 2, 3])

array(['Harry Potter', '1', '2', '3'], dtype='<U21')

No. It will convert all elements of the array to char type.

There's a dtype parameter in the np.array() function.

What if we set the dtype of array containing integer values to float ?

arr5 = np.array([1, 2, 3, 4])

array([1., 2., 3., 4.])

Question: What will happen in the following code?

np.array(["Shivank", "Bipin", "Ritwik"], dtype=float)

ValueError: could not convert string to float: 'Shivank'