Numpy 1 Merged
Numpy 1 Merged
Numpy 1 Merged
keyboard_arrow_down Content
Introduction to DAV
Python Lists vs Numpy Array
Importing Numpy
Why use Numpy?
Numpy
Pandas
Matplotlib & Seaborn
2. DAV-2: Probability Statistics
3. DAV-3: Hypothesis Testing
Because of this hetergenity, in Python lists, the data elements are not stored together in the memory (RAM).
On the other hand, Numpy only stores homogenous data, i.e. a numpy array cannot contain mixed data types.
It will either
Speed
In fact,
With Numpy, though we will be writing our code using Python, but behind the scene, all the code is written in the C programming language, to
make it faster.
Because of this, a Numpy Array will be significantly faster than a Python List in performing the same operation.
This is very important to us, because in data science, we deal with huge amount of data.
keyboard_arrow_down Properties
In-built Functions
Slicing
import numpy as np
Note:
In this terminal, we will already have numpy installed as we are working on Google Colab
However, when working on an evironment that does not have it installed, you'll have to install it the first time working.
This can be done with the command: !pip install numpy
type(a)
list
The basic approach here would be to iterate over the list and square each element.
We can convert any list a into a Numpy array using the array() function.
b = np.array(a)
b
array([1, 2, 3, 4, 5])
type(b)
numpy.ndarray
Now, how can we get the square of each element in the same Numpy array?
b**2
But is the clean syntax and ease in writing the only benefit we are getting here?
l = range(1000000)
343 ms ± 21.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It took approx 300 ms per loop to iterate and square all elements from 0 to 999,999
l = np.array(range(1000000))
%timeit l**2
778 µs ± 100 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Notice that it only took 900 𝜇s per loop time for the numpy operation.
arr1 = np.array(range(1000000))
arr1.ndim
Numpy arrays have another property called shape that tells us number of elements across every dimension.
arr1.shape
(1000000,)
This means that the array arr1 has 1000000 elements in a single dimension.
[[ 1 2 3]
[ 4 5 6]
[10 11 12]]
What do you think will be the shape & dimension of this array?
arr2.ndim
arr2.shape
(3, 3)
ndim specifies the number of dimensions of the array i.e. 1D (1), 2D (2), 3D (3) and so on.
shape returns the exact shape in all dimensions, that is (3,3) which implies 3 in axis 0 and 3 in axis 1.
keyboard_arrow_down np.arange()
We can pass starting point, ending point (not included in the array) and step-size.
Syntax:
arr2 = np.arange(1, 5)
arr2
array([1, 2, 3, 4])
arr2_step = np.arange(1, 5, 2)
arr2_step
array([1, 3])
array([1, 2, 3, 4])
Similarly, what will happen when we run the following code? Will it give an error?
array([1, 2, 3, 4])
arr5 = np.array([1, 2, 3, 4], dtype="float")
arr5
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-26-bdb627c3c07e> in <cell line: 1>()
----> 1 np.array(["Shivank", "Bipin", "Ritwik"], dtype=float)
Since it is not possible to convert strings of alphabets to floats, it will naturally return an Error.
We can also convert the data type with the astype() method.
arr = arr.astype('float64')
print(arr)
keyboard_arrow_down Indexing
Similar to Python lists
m1 = np.arange(12)
m1
11
m1 = np.array([100,200,300,400,500,600])
m1[[2,3,4,1,2,2]]
Did you notice how single index can be repeated multiple times when giving list of indexes?
Note:
If you want to extract multiple indices, you need to use two sets of square brackets [[ ]]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-34-0ec34089038e> in <cell line: 1>()
----> 1 m1[2,3,4,1,2,2]
IndexError: too many indices for array: array is 1-dimensional, but 6 were indexed
keyboard_arrow_down Slicing
Similar to Python lists
m1 = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
m1
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
m1[:5]
array([1, 2, 3, 4, 5])
m1[-5:-1]
array([6, 7, 8, 9])
array([], dtype=int64)
m1 = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
m1 < 6
array([ True, True, True, True, True, False, False, False, False,
False])
m1[[True, True, True, True, True, False, False, False, False, False]]
array([1, 2, 3, 4, 5])
Now, let's use this to filter or mask values from our array.
m1[m1 < 6]
array([1, 2, 3, 4, 5])
m1[m1%2 == 0]
array([ 2, 4, 6, 8, 10])
You've been asked to analyze user survey data and report NPS to the management.
This form asks you to fill in feedback regarding how you are liking the services of Scaler in terms of a numerical score.
This is known as the Likelihood to Recommend Survey.
It is widely used by different companies and service providers to evaluate their performance and customer satisfaction.
Range of NPS
Promoters are highly likely to recommend your product or sevice. Hence, bringing in more business.
whereas, Detractors are likely to recommend against your product or service’s usage. Hence, bringing the business down.
These insights can help business make customer oriented decision along with product improvisation.
Even at Scaler, every month, we randomnly reach out to our learners over a call, and try to understand,
Based on the feedback received, sometimes we end up getting really good insights, and tackle them.
Dataset: https://drive.google.com/file/d/1c0ClC8SrPwJq5rrkyMKyPn80nyHcFikK/view?usp=sharing
type(score)
numpy.ndarray
score[:5]
score.shape
(1167,)
% Promoters
% Detractors
In order to calculate % Promoters and % Detractors , we need to get the count of promoter as well as detractor.
# Number of detractors -
num_detractors = len(detractors)
num_detractors
332
# Number of promoters -
num_promoters = len(promoters)
num_promoters
609
total = len(score)
total
1167
# % of detractors -
28.449014567266495
# % of promoters -
52.185089974293064
23.73607540702657
# Rounding off upto 2 decimal places -
np.round(nps, 2)
output 23.74
keyboard_arrow_down Numpy 2
keyboard_arrow_down Content
Working with 2D arrays (Matrices)
Transpose
Indexing
Slicing
Fancy Indexing (Masking)
Aggregate Functions
Logical Operations
np.any()
np.all()
np.where()
import numpy as np
a = np.array(range(16))
a
a.shape
(16,)
a.ndim
Using reshape()
a.reshape(8, 2)
array([[ 0, 1],
[ 2, 3],
[ 4, 5],
[ 6, 7],
[ 8, 9],
[10, 11],
[12, 13],
[14, 15]])
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
a.reshape(4, 5)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-7-05ad01dfd0f5> in <cell line: 1>()
----> 1 a.reshape(4, 5)
a.reshape(8, -1)
array([[ 0, 1],
[ 2, 3],
[ 4, 5],
[ 6, 7],
[ 8, 9],
[10, 11],
[12, 13],
[14, 15]])
Notice that Python automatically figured out what should be the replacement of -1 argument, given that the first argument is 8 .
We can also put -1 as the first argument. As long as one argument is given, it will calculate the other one.
a.reshape(-1, -1)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-9-decf4fe03d74> in <cell line: 1>()
----> 1 a.reshape(-1, -1)
a = a.reshape(8, 2)
Explanation: len(nd array) will give you the magnitude of first dimension
len(a)
len(a[0])
2
keyboard_arrow_down Transpose
Let's create a 2D numpy array.
a = np.arange(12).reshape(3,4)
a
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
a.shape
(3, 4)
It basically means that the no. of rows is interchanged by no. of cols, and vice-versa.
a.T
array([[ 0, 4, 8],
[ 1, 5, 9],
[ 2, 6, 10],
[ 3, 7, 11]])
a.T.shape
(4, 3)
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
a[1][2]
m1 = np.arange(1,10).reshape((3,3))
m1
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
m1[1, 1] # m1[row,column]
m1 = np.array([100,200,300,400,500,600])
m1[2, 3]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-23-963ce94bbe14> in <cell line: 1>()
----> 1 m1[2, 3]
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
Note:
Therefore, you cannot use the same syntax for 1D arrays, as you did with 2D arrays, and vice-versa.
However with a little tweak in this code, we can access elements of m1 at different positions/indices.
m1[[2, 3]]
array([300, 400])
keyboard_arrow_down How will you print the diagonal elements of the following 2D array?
m1 = np.arange(9).reshape((3,3))
m1
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
array([0, 4, 8])
When list of indexes is provided for both rows and cols, for example: m1[[0,1,2],[0,1,2]]
m1 = np.arange(12).reshape(3,4)
m1
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
array([[0, 1, 2, 3],
[4, 5, 6, 7]])
array([[0, 1],
[4, 5],
[8, 9]])
array([[ 1, 2],
[ 5, 6],
[ 9, 10]])
m1 = np.arange(12).reshape(3, 4)
m1
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
m1 < 6
m1[m1 < 6]
array([0, 1, 2, 3, 4, 5])
np.sum()
a = np.arange(1, 11)
a
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
np.sum(a)
55
keyboard_arrow_down What if we want to find the average value or median value of all the elements in an array?
np.mean()
np.mean(a)
5.5
keyboard_arrow_down Now, we want to find the minimum / maximum value in the array.
np.min() / np.max()
np.min(a)
np.max(a)
10
a = np.arange(12).reshape(3, 4)
a
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
np.sum(a) # sums all the values present in the array
66
np.sum(a, axis=0)
np.sum(a, axis=1)
np.any()
returns True if any of the corresponding elements in the argument arrays follow the provided condition.
Imagine you have a shopping list with items you need to buy, but you're not sure if you have enough money to buy everything.
You want to check if there's at least one item on your list that you can afford.
import numpy as np
# Your budget
budget = 30
if can_afford:
print("You can buy at least one item on your list!")
else:
print("Sorry, nothing on your list fits your budget.")
np.all()
returns True if all the elements in the argument arrays follow the provided condition.
Let's consider a scenario where you have a list of chores, and you want to make sure all the chores are done before you can play video games.
You can use np.all to check if all the chores are completed.
import numpy as np
if all_chores_done:
print("Great job! You've completed all your chores. Time to play!")
else:
print("Finish all your chores before you can play.")
a = np.array([1, 2, 3, 2])
b = np.array([2, 2, 3, 2])
c = np.array([6, 4, 4, 5])
True
Suppose you are given an array of integers and you want to update it based on following condition :
arr[arr > 0] = 1
arr [arr < 0] = -1
arr
keyboard_arrow_down np.where()
Suppose you have a list of product prices, and you want to apply a 10% discount to all products with prices above $50.
You can use np.where to adjust the prices.
import numpy as np
# Product prices
prices = np.array([45, 55, 60, 75, 40, 90])
You've been given a user data to analyse and find some insights which can be shown on the smart watch.
But why would we want to analyse the user data for desiging the watch?
These insights from the user data can help business make customer oriented decision for the product design.
So, EDA is about looking at your things, which is data in this case, to understand them better and find out interesting stuff about them.
Formally defining, Exploratory Data Analysis (EDA) is a process of examining, summarizing, and visualizing data sets to understand their main
characteristics, uncover patterns that helps analysts and data scientists gain insights into the data, make informed decisions, and guide further
analysis or modeling.
import numpy as np
!gdown https://drive.google.com/uc?id=1vk1Pu0djiYcrdc85yUXZ_Rqq2oZNcohd
Downloading...
From: https://drive.google.com/uc?id=1vk1Pu0djiYcrdc85yUXZ_Rqq2oZNcohd
To: /content/fit.txt
100% 3.43k/3.43k [00:00<00:00, 11.3MB/s]
We provide the file name along with the dtype of data that we want to load in.
data.shape
(96, 6)
data.ndim
Date
Step Count
Mood
Calories Burned
Hours of Sleep
Activity Status
Notice that above array is homogenous containing all the data as strings.
In order to work with strings, categorical data and numerical data, we'll have to save every feature seperately.
data[0]
data[1]
array(['07-10-2017', '6041', 'Sad', '197', '8', 'Inactive'], dtype='<U10')
data[:5]
How to do that?
One way is to just go ahead and fetch the column number 0 from all rows.
Another way is to, take a transpose of data .
Approach 1
data[:, 0]
Approach 2
data_t = data.T
Don't you think all the dates will now be present in the first (i.e. index 0th element) of data_t ?
data_t[0]
data_t.shape
(6, 96)
keyboard_arrow_down Let's extract all the columns and save them in seperate variables.
step_count
step_count.dtype
dtype('<U10')
It's a string type where U means Unicode String and 10 means 10 bytes.
Step Count
dtype('int64')
step_count
array([5464, 6041, 25, 5461, 6915, 4545, 4340, 1230, 61, 1258, 3148,
4687, 4732, 3519, 1580, 2822, 181, 3158, 4383, 3881, 4037, 202,
292, 330, 2209, 4550, 4435, 4779, 1831, 2255, 539, 5464, 6041,
4068, 4683, 4033, 6314, 614, 3149, 4005, 4880, 4136, 705, 570,
269, 4275, 5999, 4421, 6930, 5195, 546, 493, 995, 1163, 6676,
3608, 774, 1421, 4064, 2725, 5934, 1867, 3721, 2374, 2909, 1648,
799, 7102, 3941, 7422, 437, 1231, 1696, 4921, 221, 6500, 3575,
4061, 651, 753, 518, 5537, 4108, 5376, 3066, 177, 36, 299,
1447, 2599, 702, 133, 153, 500, 2127, 2203])
step_count.shape
(96,)
We saw in last class that since it is a 1D array, its shape will be (96, ) .
If it were a 2D array, its shape would've been (96, 1) .
Calories Burned
dtype('int64')
Hours of Sleep
dtype('int64')
Mood
Mood belongs to categorical data type. As the name suggests, categorical data type has two or more categories in it.
mood
np.unique(mood)
Activity Status
activity_status
Since we've extracted form the same source array, we know that
Can we extract the step counts, when the mood was Happy?
step_count_happy = step_count[mood == 'Happy']
len(step_count_happy)
40
array([6041, 25, 5461, 4545, 4340, 1230, 61, 1258, 3148, 4687, 3519,
1580, 2822, 181, 6676, 3721, 1648, 799, 1696, 221, 4061, 651,
753, 518, 177, 36, 299, 702, 133])
len(step_count_sad)
29
array([5464, 6915, 3158, 4383, 3881, 4037, 202, 292, 2209, 6041, 570,
1163, 2374, 2909, 7102, 3941, 437, 1231, 4921, 6500, 3575, 4108,
3066, 1447, 2599, 500, 2127])
len(step_count_neutral)
27
How can we collect data for when the mood was either happy or neutral?
array([5464, 6915, 4732, 3158, 4383, 3881, 4037, 202, 292, 330, 2209,
4550, 4435, 4779, 1831, 2255, 539, 5464, 6041, 4068, 4683, 4033,
6314, 614, 3149, 4005, 4880, 4136, 705, 570, 269, 4275, 5999,
4421, 6930, 5195, 546, 493, 995, 1163, 3608, 774, 1421, 4064,
2725, 5934, 1867, 2374, 2909, 7102, 3941, 7422, 437, 1231, 4921,
6500, 3575, 5537, 4108, 5376, 3066, 1447, 2599, 153, 500, 2127,
2203])
len(step_count_happy_or_neutral)
67
Let's try to compare step counts on bad mood days and good mood days.
np.mean(step_count_sad)
2103.0689655172414
np.mean(step_count_happy)
3392.725
np.mean(step_count_neutral)
3153.777777777778
As you can see, this data tells us a lot about user behaviour.
Let's try to check the mood when step count was greater/lesser.
Out of 38 days when step count was more than 4000, user was feeling happy on 22 days.
Out of 39 days, when step count was less than 2000, user was feeling sad on 18 days.
keyboard_arrow_down This suggests that there may be a correlation between the Mood and Step Count .
keyboard_arrow_down Numpy 3
keyboard_arrow_down Content
Sorting
Matrix Multiplication
np.dot
@ operator
np.matmul
Vectorization
Broadcasting
keyboard_arrow_down Sorting
np.sort returns a sorted copy of an array.
import numpy as np
a = np.array([4, 7, 0, 3, 8, 2, 5, 1, 6, 9])
a
array([4, 7, 0, 3, 8, 2, 5, 1, 6, 9])
b = np.sort(a)
b
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
array([4, 7, 0, 3, 8, 2, 5, 1, 6, 9])
keyboard_arrow_down We can directly call sort method on array but it can change the original array as it is an inplace operation.
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
array([[ 1, 5, 3],
[ 2, 5, 7],
[400, 200, 300]])
array([[ 1, 5, 3],
[ 2, 5, 7],
[400, 200, 300]])
array([[ 1, 3, 5],
[ 2, 5, 7],
[200, 300, 400]])
Note: By default, the np.sort() functions sorts along the last axis.
a = np.arange(1, 6)
a
array([1, 2, 3, 4, 5])
a * 5
b = np.arange(6, 11)
b
array([ 6, 7, 8, 9, 10])
a * b
c = np.array([1, 2, 3])
a * c
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-17-3f6f667472ca> in <cell line: 1>()
----> 1 a * c
ValueError: operands could not be broadcast together with shapes (5,) (3,)
d = np.arange(12).reshape(3, 4)
e = np.arange(13, 25).reshape(3, 4)
print(d)
print(e)
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
[[13 14 15 16]
[17 18 19 20]
[21 22 23 24]]
d * e
d * 5
array([[ 0, 5, 10, 15],
[20, 25, 30, 35],
[40, 45, 50, 55]])
Takeaway:
a = np.arange(1,13).reshape((3,4))
c = np.arange(2,14).reshape((4,3))
a.shape, c.shape
keyboard_arrow_down a is of shape (3,4) and c is of shape (4,3). The output will be of shape (3,3).
# Using np.dot
np.dot(a,c)
# Using np.matmul
np.matmul(a,c)
# Using @ operator
a@c
a@5
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-27-16572c98568d> in <cell line: 1>()
----> 1 a@5
ValueError: matmul: Input operand 1 does not have enough dimensions (has 0, gufunc core with signature (n?,k),(k,m?)->(n?
np.matmul(a, 5)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-28-875bf147741b> in <cell line: 1>()
----> 1 np.matmul(a, 5)
ValueError: matmul: Input operand 1 does not have enough dimensions (has 0, gufunc core with signature (n?,k),(k,m?)->(n?
np.dot(a, 5)
Important:
dot() function supports the vector multiplication with a scalar value, which is not possible with matmul() .
Vector * Vector will work for matmul() but Vector * Scalar won't.
keyboard_arrow_down Vectorization
Vectorization in NumPy refers to performing operations on entire arrays or array elements simultaneously, which is significantly faster and more
efficient than using explicit loops.
a = np.arange(10)
a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Note:
def random_operation(x):
if x % 2 == 0:
x += 2
else:
x -= 2
return x
random_operation(a)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-32-83503709589d> in <cell line: 1>()
----> 1 random_operation(a)
<ipython-input-31-1b21f73a20a9> in random_operation(x)
1 def random_operation(x):
----> 2 if x % 2 == 0:
3 x += 2
4 else:
5 x -= 2
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
cool_operation = np.vectorize(random_operation)
type(cool_operation)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-33-6717d289c693> in <cell line: 1>()
----> 1 type(cool_operation)
keyboard_arrow_down np.vectorize()
cool_operation(a)
keyboard_arrow_down Broadcasting
Broadcasting in NumPy is the automatic and implicit extension of array dimensions to enable element-wise operations between arrays with
different shapes.
keyboard_arrow_down Case 1: If dimension in both matrix is equal, element-wise addition will be done.
a = np.tile(np.arange(0,40,10), (3,1))
a
Note:
numpy.tile(array, reps) constructs an array by repeating A the number of times given by reps along each dimension.
np.tile(array, (repetition_rows, repetition_cols))
a=a.T
a
array([[ 0, 0, 0],
[10, 10, 10],
[20, 20, 20],
[30, 30, 30]])
b = np.tile(np.arange(0,3), (4,1))
b
array([[0, 1, 2],
[0, 1, 2],
[0, 1, 2],
[0, 1, 2]])
print(a.shape, b.shape)
(4, 3) (4, 3)
Since a and b have the same shape, they can be added without any issues.
a+b
array([[ 0, 1, 2],
[10, 11, 12],
[20, 21, 22],
[30, 31, 32]])
keyboard_arrow_down Case 2: Right array should be of 1-D and number of columns should be same of both the arrays and it will automatically do
n-tile.
array([[ 0, 0, 0],
[10, 10, 10],
[20, 20, 20],
[30, 30, 30]])
c = np.array([0,1,2])
c
array([0, 1, 2])
print(a.shape, c.shape)
(4, 3) (3,)
a + c
array([[ 0, 1, 2],
[10, 11, 12],
[20, 21, 22],
[30, 31, 32]])
keyboard_arrow_down Case 3: If the left array is column matrix (must have only 1 column) and right array is row matrix, then it will do the n-tile
such that element wise addition is possible.
d = np.array([0,10,20,30]).reshape(4,1)
d
array([[ 0],
[10],
[20],
[30]])
c = np.array([0,1,2])
c
array([0, 1, 2])
print(d.shape, c.shape)
(4, 1) (3,)
d + c
array([[ 0, 1, 2],
[10, 11, 12],
[20, 21, 22],
[30, 31, 32]])
a = np.arange(8).reshape(2,4)
a
array([[0, 1, 2, 3],
[4, 5, 6, 7]])
b = np.arange(16).reshape(4,4)
b
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
a+b
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-49-ca730b97bf8a> in <cell line: 1>()
----> 1 a+b
ValueError: operands could not be broadcast together with shapes (2,4) (4,4)
A = np.arange(1,10).reshape(3,3)
A
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
B = np.array([-1, 0, 1])
B
array([-1, 0, 1])
A*B
array([[-1, 0, 3],
[-4, 0, 6],
[-7, 0, 9]])
A = np.arange(12).reshape(3, 4)
A
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
B = np.array([1, 2, 3])
B
array([1, 2, 3])
A + B
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-55-151064de832d> in <cell line: 1>()
----> 1 A + B
ValueError: operands could not be broadcast together with shapes (3,4) (3,)
Shape of A ⇒ (3,4)
Shape of B ⇒ (3,)
Is broadcasting possible in this case? If yes, what will be the shape of output?
Explanation:
A ⇒ (8 , 1, 6, 1)
B ⇒ (1, 7, 1, 5)
Now, as per Rule 2 , dimension with value 1 is streched to match dimension of other array.
keyboard_arrow_down Content
Shallow vs Deep Copy
view()
copy()
Array Splitting
split()
hsplit()
vsplit()
Array Stacking
hstack()
vstack()
concatenate()
Let's create some arrays to understand what's happens while using numpy.
import numpy as np
a = np.arange(4)
a
array([0, 1, 2, 3])
b = a.reshape(2, 2)
b
array([[0, 1],
[2, 3]])
a[0] = 100
a
array([100, 1, 2, 3])
array([[100, 1],
[ 2, 3]])
Now, let's see an example where Numpy will create a Deep Copy of data.
a = np.arange(4)
a
array([0, 1, 2, 3])
# Create `c`
c = a + 2
c
array([2, 3, 4, 5])
a[0] = 100
a
array([100, 1, 2, 3])
array([2, 3, 4, 5])
False
Because it is an operation.
A more permanent change in data.
So, Numpy had to create a separate copy for c - i.e., deep copy of array a for array c .
keyboard_arrow_down Conclusion:
Numpy is able to use same data for simpler operations like reshape → Shallow Copy.
It creates a copy of data where operations make more permanent changes to data → Deep Copy.
Is there a way to check whether two arrays are sharing memory or not?
a= np.arange(10)
a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
b = a[::2]
b
array([0, 2, 4, 6, 8])
np.shares_memory(a,b)
True
a[0] = 1000
array([1000, 2, 4, 6, 8])
a = np.arange(6)
a
array([0, 1, 2, 3, 4, 5])
b = a[a % 1 == 0]
b
array([0, 1, 2, 3, 4, 5])
b[0] = 10
a[0]
np.shares_memory(a,b)
False
Note:
a = np.arange(10)
a_shallow_copy = a.view()
# Creates a shallow copy of a
np.shares_memory(a_shallow_copy, a)
True
a_deep_copy = a.copy()
# Creates a deep copy of a
np.shares_memory(a_deep_copy, a)
False
keyboard_arrow_down .view()
Documentation: https://numpy.org/doc/stable/reference/generated/numpy.ndarray.view.html
arr = np.arange(10)
arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
view_arr = arr.view()
view_arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Let's modify the content of view_arr and check whether it modified the original array as well.
view_arr[4] = 420
view_arr
arr
np.shares_memory(arr, view_arr)
True
keyboard_arrow_down .copy()
arr = np.arange(10)
arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
copy_arr = arr.copy()
copy_arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Let's modify the content of copy_arr and check whether it modified the original array as well.
copy_arr[3] = 45
copy_arr
arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
np.shares_memory(arr, copy)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-35-84a6cba2b044> in <cell line: 1>()
----> 1 np.shares_memory(arr, copy)
keyboard_arrow_down Splitting
In addition to reshaping and selecting subarrays, it is often necessary to split arrays into smaller arrays or merge arrays into bigger arrays.
keyboard_arrow_down np.split()
If indices_or_sections is an integer, n, the array will be divided into n equal arrays along axis.
If indices_or_sections is a 1-D array of sorted integers, the entries indicate where along axis the array is split.
If an index exceeds the dimension of the array along axis, an empty sub-array is returned correspondingly.
x = np.arange(9)
x
array([0, 1, 2, 3, 4, 5, 6, 7, 8])
np.split(x, 3)
IMPORTANT REQUISITE
b = np.arange(10)
np.split(b, 3)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-38-5033f171e13f> in <cell line: 2>()
1 b = np.arange(10)
----> 2 np.split(b, 3)
1 frames
/usr/local/lib/python3.10/dist-packages/numpy/lib/shape_base.py in split(ary, indices_or_sections, axis)
870 N = ary.shape[axis]
871 if N % sections:
--> 872 raise ValueError(
873 'array split does not result in an equal division') from None
874 return array_split(ary, indices_or_sections, axis)
b[0:-1]
np.split(b[0:-1], 3)
c = np.arange(16)
np.split(c, [3, 5, 6])
[array([0, 1, 2]),
array([3, 4]),
array([5]),
array([ 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])]
keyboard_arrow_down np.hsplit()
Splits an array into multiple sub-arrays horizontally (column-wise).
x = np.arange(16.0).reshape(4, 4)
x
The split we want happens across the 2nd axis (Horizontal axis)
That is why we use hsplit()
So, try to think in terms of "whether the operation is happening along vertical axis or horizontal axis".
np.hsplit(x, 2)
keyboard_arrow_down np.vsplit()
x = np.arange(16.0).reshape(4, 4)
x
The split we want happens across the 1st axis (Vertical axis)
That is why we use vsplit()
Again, always try to think in terms of "whether the operation is happening along vertical axis or horizontal axis".
np.vsplit(x, 2)
np.vsplit(x, np.array([3]))
keyboard_arrow_down Stacking
a = np.arange(1, 5)
b = np.arange(2, 6)
c = np.arange(3, 7)
keyboard_arrow_down np.vstack()
np.vstack([b, c, a])
array([[2, 3, 4, 5],
[3, 4, 5, 6],
[1, 2, 3, 4]])
a = np.arange(1, 5)
b = np.arange(2, 4)
c = np.arange(3, 10)
np.vstack([b, c, a])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-49-5148cb6ebc5f> in <cell line: 1>()
----> 1 np.vstack([b, c, a])
2 frames
/usr/local/lib/python3.10/dist-packages/numpy/core/overrides.py in concatenate(*args, **kwargs)
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the arra
a = np.arange(5).reshape(5, 1)
a
array([[0],
[1],
[2],
[3],
[4]])
b = np.arange(15).reshape(5, 3)
b
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11],
[12, 13, 14]])
np.hstack([a, b])
array([[ 0, 0, 1, 2],
[ 1, 3, 4, 5],
[ 2, 6, 7, 8],
[ 3, 9, 10, 11],
[ 4, 12, 13, 14]])
keyboard_arrow_down np.concatenate()
Provides similar functionality, but it takes a keyword argument axis that specifies the axis along which the arrays are to be concatenated.
The input array to concatenate() needs to be of dimensions atleast equal to the dimensions of output array.
a = np.array([1,2,3])
a
array([1, 2, 3])
b = np.array([[1,2,3], [4,5,6]])
b
array([[1, 2, 3],
[4, 5, 6]])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-55-1a93c4fe21df> in <cell line: 1>()
----> 1 np.concatenate([a, b], axis = 0)
ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 1 dimension(s) and the
concatenate() can only work if both a and b have the same number of dimensions.
a = np.array([[1,2,3]])
b = np.array([[1,2,3], [4,5,6]])
a = np.arange(6).reshape(3, 2)
b = np.arange(9).reshape(3, 3)
array([[0, 1, 0, 1, 2],
[2, 3, 3, 4, 5],
[4, 5, 6, 7, 8]])
a = np.array([[1,2], [3,4]])
b = np.array([[5,6,7,8]])
array([1, 2, 3, 4, 5, 6, 7, 8])
array([[1, 2],
[3, 4]])
b = np.array([[5, 6]])
b
array([[5, 6]])
array([[1, 2],
[3, 4],
[5, 6]])
Dimensions of a is 2 ×2
What is the dimensions of b ?
1-D array ?? - NO
Look carefully!!
b is a 2-D array of dimensions 1 ×2
axis = 0 ---> It's a vertical axis
b = np.array([[5, 6]])
b
array([[5, 6]])
array([[1, 2, 5],
[3, 4, 6]])
Dimensions of a is again 2 ×2
Dimensions of b is again 1 ×2
So, Dimensions of b.T will be 2 × 1
Code Text
keyboard_arrow_down 01-Pandas-Lecture-McKinsey
keyboard_arrow_down Outline
Installation of Pandas
Importing pandas
Importing the dataset
Dataframe/Series
df.info()
df.head()
df.tail()
df.shape()
Implicit/Explicit index
df.index
Indexing in Series
Slicing in Series
loc/iloc
import pandas as pd
import numpy as np
The major limitation of numpy is that it can only work with 1 datatype at a time
Like names of places would be string but their population would be int
==> It is difficult to work with data having heterogeneous values using Numpy
country
population size
life expectancy
GDP per Capita
We have to analyse the data and draw inferences meaningful to the company
6 columns
1704 rows
type(df)
pandas.core.frame.DataFrame
keyboard_arrow_down Now how can we access a column, say country of the dataframe?
df["country"]
0 Afghanistan
1 Afghanistan
2 Afghanistan
3 Afghanistan
4 Afghanistan
...
1699 Zimbabwe
1700 Zimbabwe
1701 Zimbabwe
1702 Zimbabwe
1703 Zimbabwe
Name: country, Length: 1704, dtype: object
As you can see we get all the values in the column country
type(df["country"])
pandas.core.series.Series
How can we find the datatype, name, total entries in each column ?
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 country 1704 non-null object
1 year 1704 non-null int64
2 population 1704 non-null int64
3 continent 1704 non-null object
4 life_exp 1704 non-null float64
5 gdp_cap 1704 non-null float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB
Name/Title of Columns
How many non-null values (blank cells) each column has
Type of values in each column - int, float, etc.
By default, it shows data-type as object for anything other than int or float - Will come back later
keyboard_arrow_down Now what if we want to see the first few rows in the dataset ?
df.head()
df.head(20)
df.shape
(1704, 6)
and so on.
keyboard_arrow_down But what if our dataset has 20 cols ? ... or 100 cols ? We can't see ther names in one go.
Note:
Here, Index is a type of pandas class used to store the address of the series/dataframe
df[['country', 'life_exp']].head()
country life_exp
0 Afghanistan 28.801
1 Afghanistan 30.332
2 Afghanistan 31.997
3 Afghanistan 34.020
4 Afghanistan 36.088
df[['country']].head()
country
0 Afghanistan
1 Afghanistan
2 Afghanistan
3 Afghanistan
4 Afghanistan
Note:
Notice how this output type is different from our earlier output using df['country']
Now that we know how to access columns, lets answer some questions
keyboard_arrow_down How can we find the countries that have been surveyed ?
We can find the unique vals in the country col
df['country'].unique()
keyboard_arrow_down Now what if you also want to check the count of each country in the dataframe?
df['country'].value_counts()
Afghanistan 12
Pakistan 12
New Zealand 12
Nicaragua 12
Niger 12
..
Eritrea 12
Equatorial Guinea 12
El Salvador 12
Egypt 12
Zimbabwe 12
Name: country, Length: 142, dtype: int64
Note:
df.rename(columns={"country":"Country"})
df
We can clearly see that the column names are still the same and have not changed. So the changes doesn't happen in original dataframe unless
we specify a parameter called inplace
Note
df.Country
0 Afghanistan
1 Afghanistan
2 Afghanistan
3 Afghanistan
4 Afghanistan
...
1699 Zimbabwe
1700 Zimbabwe
1701 Zimbabwe
1702 Zimbabwe
1703 Zimbabwe
Name: Country, Length: 1704, dtype: object
What do you think could be the problems with using attribute style for accessing the columns?
Problems such as
E.g. shape
An alternative to the above approach is using the "columns" parameter as we did in rename
df.drop(columns=['continent'])
df.head()
df = df.drop('continent', axis=1)
OR
By default, inplace=False
OR
df["year+7"] = df["year"] + 7
df.head()
We can also use values from two columns to form a new column
keyboard_arrow_down Which two columns can we use to create a new column gdp ?
df['gdp']=df['gdp_cap'] * df['population']
df.head()
Values in this column are product of respective values in gdp_cap and population
keyboard_arrow_down How can we create a new column from our own values?
We can create a list
OR
We can create a Pandas Series from a list/numpy array for our new column
Now that we know how to create new cols lets see some basic ops on rows
YES
df.index.values
Now to understand string indices, let's take a small subset of our original dataframe
sample = df.head()
sample
keyboard_arrow_down What if we want to access any particular row (say first row)?
1 Afghanistan
2 Afghanistan
3 Afghanistan
4 Afghanistan
5 Afghanistan
6 Afghanistan
7 Afghanistan
8 Afghanistan
9 Afghanistan
10 Afghanistan
11 Afghanistan
12 Afghanistan
13 Albania
14 Albania
15 Albania
16 Albania
17 Albania
18 Albania
19 Albania
20 Albania
Name: Country, dtype: object
So, how will be then access the thirteenth element (or say thirteenth row)?
ser[12]
'Afghanistan'
ser[5:15]
6 Afghanistan
7 Afghanistan
8 Afghanistan
9 Afghanistan
10 Afghanistan
11 Afghanistan
12 Afghanistan
13 Albania
14 Albania
15 Albania
Name: Country, dtype: object
df[0]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/usr/local/lib/python3.8/dist-packages/pandas/core/indexes/base.py in
get_loc(self, key, method, tolerance)
3360 try:
-> 3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
4 frames
pandas/_libs/hashtable_class_helper.pxi in
pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in
pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 0
The above exception was the direct cause of the following exception:
KeyError: 0
Notice, that this syntax is exactly same as how we tried accessing a column
df[5:15]
Allows indexing and slicing that always references the explicit index
df.loc[1]
Country Afghanistan
year 1952
population 8425333
life_exp 28.801
gdp_cap 779.445314
Name: 1, dtype: object
df.loc[1:3]
keyboard_arrow_down 2. iloc
Allows indexing and slicing that always references the implicit Python-style index
df.iloc[1]
Country Afghanistan
year 1957
population 9240934
life_exp 30.332
gdp_cap 820.85303
Name: 2, dtype: object
df.iloc[0:2]
NO
Not just b/w loc and iloc , but in general while working in DS and ML
As we see, We can just pack the indices in [] and pass it in loc or iloc
keyboard_arrow_down Content
Working with both rows and columns
Sorting
Concatenation
Merge
Link:https://drive.google.com/file/d/1E3bwvYGf1ig32RmcYiWc0IXPN-mD_bI_/view?usp=sharing
import pandas as pd
import numpy as np
df = pd.read_csv('mckinsey.csv')
append()
loc/iloc
<ipython-input-4-714c78525e27>:2: FutureWarning: The frame.append method is deprecated and will be removed from pandas in
df.append(new_row)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-4-714c78525e27> in <cell line: 2>()
1 new_row = {'Country': 'India', 'year': 2000,'life_exp':37.08,'population':13500000,'gdp_cap':900.23}
----> 2 df.append(new_row)
1 frames
/usr/local/lib/python3.10/dist-packages/pandas/core/frame.py in _append(self, other, ignore_index, verify_integrity, sort
9778 if isinstance(other, dict):
9779 if not ignore_index:
-> 9780 raise TypeError("Can only append a dict if ignore_index=True")
9781 other = Series(other)
9782 if other.name is None and not ignore_index:
Its' saying the ignore_index() parameter needs to be set to True. This parameter tells Pandas to ignore the existing index and create a new
one based on the length of the resulting DataFrame.
<ipython-input-5-39ca58b35231>:2: FutureWarning: The frame.append method is deprecated and will be removed from pandas in
df = df.append(new_row, ignore_index=True)
country year population continent life_exp gdp_cap Country
It does not change the DataFrame, but returns a new DataFrame with the row appended.
We will need to provide the position at which we will add the new row
df.loc[len(df.index)] = ['India',2000 ,13500000,37.08,900.23] # len(df.index) since we will add at the last row
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-6-29b4966da254> in <cell line: 1>()
----> 1 df.loc[len(df.index)] = ['India',2000 ,13500000,37.08,900.23] # len(df.index) since we will add at the last row
2 frames
/usr/local/lib/python3.10/dist-packages/pandas/core/indexing.py in _setitem_with_indexer_missing(self, indexer, value)
2158 # must have conforming columns
2159 if len(value) != len(self.obj.columns):
-> 2160 raise ValueError("cannot set a row with mismatched columns")
2161
2162 value = Series(value, index=self.obj.columns, name=indexer)
df
The new row was added but the data has been duplicated
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-8-c6ff7fd1a207> in <cell line: 1>()
----> 1 df.iloc[len(df.index)-1] = ['India', 2000,13500000,37.08,900.23]
2 df
2 frames
/usr/local/lib/python3.10/dist-packages/pandas/core/indexing.py in _setitem_with_indexer_split_path(self, indexer, value,
1877
1878 else:
-> 1879 raise ValueError(
1880 "Must have equal len keys and value "
1881 "when setting with an iterable"
ValueError: Must have equal len keys and value when setting with an iterable
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-9-1ad11b3daf34> in <cell line: 1>()
----> 1 df.iloc[len(df.index)] = ['India', 2000,13500000,37.08,900.23]
1 frames
/usr/local/lib/python3.10/dist-packages/pandas/core/indexing.py in _has_valid_setitem_indexer(self, indexer)
1516 elif is_integer(i):
1517 if i >= len(ax):
-> 1518 raise IndexError("iloc cannot enlarge its target object")
1519 elif isinstance(i, dict):
1520 raise IndexError("iloc cannot enlarge its target object")
For using iloc to add a row, the dataframe must already have a row in that position.
Please Note:
When using the loc[] attribute, it’s not mandatory that a row already exists with a specific label.
df
Country Afghanistan
year 1972
population 13079460
life_exp 36.088
gdp_cap 739.981106
Name: 4, dtype: object
Country Afghanistan
year 1977
population 14880372
life_exp 38.438
gdp_cap 786.11336
Name: 5, dtype: object
It is because the loc function selects rows using row labels (0,1,2,4 etc.) whereas the iloc function selects rows using their integer positions
(staring from 0 and going up by one for each row).
So for iloc the 5th row starting from 0 index was printed
df.loc[len(df.index)] = ['India',2000,13500000,37.08,900.23]
df.loc[len(df.index)] = ['Sri Lanka',2022 ,130000000,80.00,500.00]
df.loc[len(df.index)] = ['Sri Lanka',2022 ,130000000,80.00,500.00]
df.loc[len(df.index)] = ['India',2000 ,13500000,80.00,900.23]
df
keyboard_arrow_down Now how can we check for duplicate rows?
df.duplicated()
df.drop_duplicates()
keyboard_arrow_down But how can we decide among all duplicate rows which ones we want to keep ?
first
last
False
If first , this considers first value as unique and rest of the same values as duplicate.
df.drop_duplicates(keep='first')
If last , This considers last value as unique and rest of the same values as duplicate.
df.drop_duplicates(keep='last')
df.drop_duplicates(keep=False)
keyboard_arrow_down What if you want to look for duplicacy only for a few columns?
We can use the argument subset to mention the list of columns which we want to use.
df.drop_duplicates(subset=['Country'],keep='first')
keyboard_arrow_down How can we slice the dataframe into, say, first 4 rows and first 3 columns?
gdf.iloc[0:4, 0:3]
country year population
Pass in 2 different ranges for slicing - one for row and one for column just like Numpy
df.loc[1:5, 1:4]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-494208dc7680> in <cell line: 1>()
----> 1 df.loc[1:5, 1:4]
8 frames
/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py in _maybe_cast_slice_bound(self, label, side, kind)
6621 # reject them, if index does not contain label
6622 if (is_float(label) or is_integer(label)) and label not in self:
-> 6623 raise self._invalid_indexer("slice", label)
6624
6625 return label
TypeError: cannot do slice indexing on Index with these indexers [1] of type int
keyboard_arrow_down Why does slicing using indices doesn't work with loc ?
df.loc[1:5, ['country','life_exp']]
country life_exp
1 Afghanistan 30.332
2 Afghanistan 31.997
3 Afghanistan 34.020
4 Afghanistan 36.088
5 Afghanistan 38.438
df.loc[1:5, 'year':'population']
year population
1 1957 9240934
2 1962 10267083
3 1967 11537966
4 1972 13079460
5 1977 14880372
df.iloc[[0,10,100], [0,2,3]]
country population continent
df.iloc[1:10:2]
YES
df.loc[1:10:2]
le = df['life_exp']
le
0 28.801
1 30.332
2 31.997
3 34.020
4 36.088
...
1699 62.351
1700 60.377
1701 46.809
1702 39.989
1703 43.487
Name: life_exp, Length: 1704, dtype: float64
keyboard_arrow_down
[ ] ↳ 1 cell hidden
... and so on
Note:
le.sum()
101344.44467999999
le.count()
1704
le.sum() / le.count()
59.474439366197174
keyboard_arrow_down Sorting
If you notice, life_exp col is not sorted
df.sort_values(['life_exp'])
df.sort_values(['life_exp'], ascending=False)
country year population continent life_exp gdp_cap
YES
df.sort_values(['year', 'life_exp'])
Then, rows with same values of 'year' were sorted based on 'lifeExp'
For Example
keyboard_arrow_down How can we have different sorting orders for different columns in multi-level sorting?
users --> Stores the user details - IDs and Names of users
userid name
0 1 sharadh
1 2 shahid
2 3 khusalli
msgs --> Stores the messages users have sent - User IDs and messages
userid msg
0 1 hmm
1 1 acha
2 2 theek hai
3 4 nice
pd.concat([users, msgs])
0 1 sharadh NaN
1 2 shahid NaN
2 3 khusalli NaN
0 1 NaN hmm
1 1 NaN acha
3 4 NaN nice
userid , being same in both DataFrames, was combined into a single column
First values of users dataframe were placed, with values of column msg as NaN
Then values of msgs dataframe were placed, with values of column msg as NaN
keyboard_arrow_down Now how can we make the indices unique for each row?
0 1 sharadh NaN
1 2 shahid NaN
2 3 khusalli NaN
3 1 NaN hmm
4 1 NaN acha
6 4 NaN nice
concat
merge
keyboard_arrow_down How can we know the name of the person who sent a particular message?
No
users.merge(msgs, on="userid")
0 1 sharadh hmm
1 1 sharadh acha
Inner Join
keyboard_arrow_down Now what join we want to use to get info of all the users and all the messages?
0 1 sharadh hmm
1 1 sharadh acha
3 3 khusalli NaN
4 4 NaN nice
Note:
keyboard_arrow_down And what if we want the info of all the users in the dataframe?
users.merge(msgs, on = "userid",how="left")
0 1 sharadh hmm
1 1 sharadh acha
3 3 khusalli NaN
keyboard_arrow_down Similarly, what if we want all the messages and info only for the users who sent a message?
users.merge(msgs, on = "userid", how="right")
0 1 sharadh hmm
1 1 sharadh acha
3 4 NaN nice
Note,
But sometimes the column names might be different even if they contain the same data
id name
0 1 sharadh
1 2 shahid
2 3 khusalli
keyboard_arrow_down Now, how can we merge the 2 dataframes when the key has a different name ?
0 1 sharadh 1 hmm
1 1 sharadh 1 acha
Here,
Movies
Rating
Director
Popularity
Revenue & Budget
File1: https://drive.google.com/file/d/1s2TkjSpzNc4SyxqRrQleZyDIHlc7bxnd/view?usp=sharing
File2: https://drive.google.com/file/d/1Ws-_s1fHZ9nHfGLVUQurbHDvStePlEJm/view?usp=sharing
import pandas as pd
import numpy as np
!gdown 1s2TkjSpzNc4SyxqRrQleZyDIHlc7bxnd
Downloading...
From: https://drive.google.com/uc?id=1s2TkjSpzNc4SyxqRrQleZyDIHlc7bxnd
To: /content/movies.csv
100% 112k/112k [00:00<00:00, 57.4MB/s]
!gdown 1Ws-_s1fHZ9nHfGLVUQurbHDvStePlEJm
Downloading...
From: https://drive.google.com/uc?id=1Ws-_s1fHZ9nHfGLVUQurbHDvStePlEJm
To: /content/directors.csv
100% 65.4k/65.4k [00:00<00:00, 62.8MB/s]
movies.csv
directors.csv
movies = pd.read_csv('movies.csv')
movies.head()
Unnamed:
id budget popularity revenue title vote_average vote_count director_id year month day
0
0 0 43597 237000000 150 2787965087 Avatar 7.2 11800 4762 2009 Dec Thursday
Pirates of
the
1 1 43598 300000000 139 961000000 Caribbean: 6.9 4500 4763 2007 May Saturday
At World's
End
2 2 43599 245000000 107 880674609 Spectre 63 4466 4764 2015 Oct Monday
Notice, there's a column Unnamed: 0 which represents nothing but the index of a row.
id budget popularity revenue title vote_average vote_count director_id year month day
0 43597 237000000 150 2787965087 Avatar 7.2 11800 4762 2009 Dec Thursday
Pirates of the
1 43598 300000000 139 961000000 Caribbean: At World's 6.9 4500 4763 2007 May Saturday
End
2 43599 245000000 107 880674609 Spectre 6.3 4466 4764 2015 Oct Monday
3 43600 250000000 112 1084939099 The Dark Knight Rises 7.6 9106 4765 2012 Jul Monday
movies.shape
(1465 11)
keyboard_arrow_down Pandas - 3
Content
Apply()
Grouping
groupby()
import pandas as pd
import numpy as np
!gdown 1s2TkjSpzNc4SyxqRrQleZyDIHlc7bxnd
!gdown 1Ws-_s1fHZ9nHfGLVUQurbHDvStePlEJm
Downloading...
From: https://drive.google.com/uc?id=1s2TkjSpzNc4SyxqRrQleZyDIHlc7bxnd
To: /content/movies.csv
100% 112k/112k [00:00<00:00, 77.0MB/s]
Downloading...
From: https://drive.google.com/uc?id=1Ws-_s1fHZ9nHfGLVUQurbHDvStePlEJm
To: /content/directors.csv
100% 65.4k/65.4k [00:00<00:00, 73.5MB/s]
Movie dataset contains info about movies, release, popularity, ratings and the director ID
Director dataset contains detailed info about the director
We will use the ID columns (representing unique director) in both the datasets
If you observe,
Thus we can merge our dataframes based on these two columns as keys
Before that, lets first check number of unique director values in our movies data
movies['director_id'].nunique()
199
Recall,
directors['id'].nunique()
2349
Summary:
keyboard_arrow_down Now, how can we check if all director_id values are present in id ?
movies['director_id'].isin(directors['id'])
0 True
1 True
2 True
3 True
5 True
...
4736 True
4743 True
4748 True
4749 True
4768 True
Name: director_id, Length: 1465, dtype: bool
The isin() method checks if the Dataframe column contains the specified value(s).
If you notice,
np.all(movies['director_id'].isin(directors['id']))
True
YES
NO
id_x budget popularity revenue title vote_average vote_count director_id year month day direct
0 43597 237000000 150 2787965087 Avatar 7.2 11800 4762 2009 Dec Thursday James
Pirates of
the
1 43598 300000000 139 961000000 Caribbean: 6.9 4500 4763 2007 May Saturday Gore
At World's
End
2 43599 245000000 107 880674609 Spectre 6.3 4466 4764 2015 Oct Monday Sam
The Dark
Ch
3 43600 250000000 112 1084939099 Knight 7.6 9106 4765 2012 Jul Monday
Rises
Spider-
4 43602 258000000 115 890871626 5.9 3576 4767 2007 May Tuesday S
Man 3
... ... ... ... ... ... ... ... ... ... ... ...
The Last
1460 48363 0 3 321952 7.9 64 4809 1978 May Monday Martin
Waltz
1461 48370 27000 19 3151130 Clerks 7.4 755 5369 1994 Sep Tuesday Ke
El
1464 48395 220000 14 2040920 6.6 238 5097 1992 Sep Friday
Mariachi R
data.drop(['director_id','id_y'],axis=1,inplace=True)
data.head()
id_x budget popularity revenue title vote_average vote_count year month day director_name gender
0 43597 237000000 150 2787965087 Avatar 7.2 11800 2009 Dec Thursday James Cameron Male
Pirates of
the
1 43598 300000000 139 961000000 Caribbean: 6.9 4500 2007 May Saturday Gore Verbinski Male
At World's
End
2 43599 245000000 107 880674609 Spectre 6.3 4466 2015 Oct Monday Sam Mendes Male
The Dark
Christopher
keyboard_arrow_down Apply
Apply a function along an axis of the DataFrame or Series
Basically,
0 for Male
1 for Female
keyboard_arrow_down How can we encode the column?
def encode(data):
if data == "Male":
return 0
else:
return 1
keyboard_arrow_down Now how can we apply this function to the whole column?
data['gender'] = data['gender'].apply(encode)
data
id_x budget popularity revenue title vote_average vote_count year month day
0 43597 237000000 150 2787965087 Avatar 7.2 11800 2009 Dec Thursda
1 43598 300000000 139 961000000 Pirates of the Caribbean: At World's End 6.9 4500 2007 May Saturda
2 43599 245000000 107 880674609 Spectre 6.3 4466 2015 Oct Monda
3 43600 250000000 112 1084939099 The Dark Knight Rises 7.6 9106 2012 Jul Monda
4 43602 258000000 115 890871626 Spider-Man 3 5.9 3576 2007 May Tuesda
... ... ... ... ... ... ... ... ... ... .
1460 48363 0 3 321952 The Last Waltz 7.9 64 1978 May Monda
1461 48370 27000 19 3151130 Clerks 7.4 755 1994 Sep Tuesda
1464 48395 220000 14 2040920 El Mariachi 6.6 238 1992 Sep Frida
data[['revenue', 'budget']].apply(np.sum)
revenue 209866997305
budget 70353617179
dtype: int64
But there's a mistake here. We wanted our results per movie (per row)
0 3024965087
1 1261000000
2 1125674609
3 1334939099
4 1148871626
...
1460 321952
1461 3178130
1462 0
1463 0
1464 2260920
Length: 1465, dtype: int64
By default axis = 0
=> apply() can be applied on any dataframe along any particular axis
id_x budget popularity revenue title vote_average vote_count year month day director_name gend
0 43597 237000000 150 2787965087 Avatar 7.2 11800 2009 Dec Thursday James Cameron
Pirates of
the
1 43598 300000000 139 961000000 Caribbean: 6.9 4500 2007 May Saturday Gore Verbinski
At World's
End
2 43599 245000000 107 880674609 Spectre 6.3 4466 2015 Oct Monday Sam Mendes
The Dark
Christopher
3 43600 250000000 112 1084939099 Knight 7.6 9106 2012 Jul Monday
Nolan
Rises
Spider-
4 43602 258000000 115 890871626 5.9 3576 2007 May Tuesday Sam Raimi
Man 3
... ... ... ... ... ... ... ... ... ... ... ...
The Last
1460 48363 0 3 321952 7.9 64 1978 May Monday Martin Scorsese
Waltz
1461 48370 27000 19 3151130 Clerks 7.4 755 1994 Sep Tuesday Kevin Smith
1462 48375 0 7 0 Rampage 6.0 131 2009 Aug Friday Uwe Boll
El Robert
1464 48395 220000 14 2040920 6.6 238 1992 Sep Friday
Mariachi Rodriguez
keyboard_arrow_down Grouping
What is Grouping ?
data.groupby('director_name')
Notice,
data.groupby('director_name').ngroups
199
Based on this grouping, we can find which keys belong to which group?
data.groupby('director_name').groups
{'Adam McKay': [176, 323, 366, 505, 839, 916], 'Adam Shankman': [265, 300, 350, 404, 458, 843, 999, 1231], 'Alejandro
González Iñárritu': [106, 749, 1015, 1034, 1077, 1405], 'Alex Proyas': [95, 159, 514, 671, 873], 'Alexander Payne':
[793, 1006, 1101, 1211, 1281], 'Andrew Adamson': [11, 43, 328, 501, 947], 'Andrew Niccol': [533, 603, 701, 722, 1439],
'Andrzej Bartkowiak': [349, 549, 754, 911, 924], 'Andy Fickman': [517, 681, 909, 926, 973, 1023], 'Andy Tennant': [314,
320, 464, 593, 676, 885], 'Ang Lee': [99, 134, 748, 840, 1089, 1110, 1132, 1184], 'Anne Fletcher': [610, 650, 736, 789,
1206], 'Antoine Fuqua': [310, 338, 424, 467, 576, 808, 818, 1105], 'Atom Egoyan': [946, 1128, 1164, 1194, 1347, 1416],
'Barry Levinson': [313, 319, 471, 594, 878, 898, 1013, 1037, 1082, 1143, 1185, 1345, 1378], 'Barry Sonnenfeld': [13,
48, 90, 205, 591, 778, 783], 'Ben Stiller': [209, 212, 547, 562, 850], 'Bill Condon': [102, 307, 902, 1233, 1381],
'Bobby Farrelly': [352, 356, 481, 498, 624, 630, 654, 806, 928, 972, 1111], 'Brad Anderson': [1163, 1197, 1350, 1419,
1430], 'Brett Ratner': [24, 39, 188, 207, 238, 292, 405, 456, 920], 'Brian De Palma': [228, 255, 318, 439, 747, 905,
919, 1088, 1232, 1261, 1317, 1354], 'Brian Helgeland': [512, 607, 623, 742, 933], 'Brian Levant': [418, 449, 568, 761,
860, 1003], 'Brian Robbins': [416, 441, 669, 962, 988, 1115], 'Bryan Singer': [6, 32, 33, 44, 122, 216, 297, 1326],
'Cameron Crowe': [335, 434, 488, 503, 513, 698], 'Catherine Hardwicke': [602, 695, 724, 937, 1406, 1412], 'Chris
Columbus': [117, 167, 204, 218, 229, 509, 656, 897, 996, 1086, 1129], 'Chris Weitz': [17, 500, 794, 869, 1202, 1267],
'Christopher Nolan': [3, 45, 58, 59, 74, 565, 641, 1341], 'Chuck Russell': [177, 410, 657, 1069, 1097, 1339], 'Clint
Eastwood': [369, 426, 447, 482, 490, 520, 530, 535, 645, 727, 731, 786, 787, 899, 974, 986, 1167, 1190, 1313], 'Curtis
Hanson': [494, 579, 606, 711, 733, 1057, 1310], 'Danny Boyle': [527, 668, 1083, 1085, 1126, 1168, 1287, 1385], 'Darren
Aronofsky': [113, 751, 1187, 1328, 1363, 1458], 'Darren Lynn Bousman': [1241, 1243, 1283, 1338, 1440], 'David Ayer':
[50, 273, 741, 1024, 1146, 1407], 'David Cronenberg': [541, 767, 994, 1055, 1254, 1268, 1334], 'David Fincher': [62,
213, 253, 383, 398, 478, 522, 555, 618, 785], 'David Gordon Green': [543, 862, 884, 927, 1376, 1418, 1432, 1459],
'David Koepp': [443, 644, 735, 1041, 1209], 'David Lynch': [583, 1161, 1264, 1340, 1456], 'David O. Russell': [422,
556, 609, 896, 982, 989, 1229, 1304], 'David R. Ellis': [582, 634, 756, 888, 934], 'David Zucker': [569, 619, 965,
1052, 1175], 'Dennis Dugan': [217, 260, 267, 293, 303, 718, 780, 977, 1247], 'Donald Petrie': [427, 507, 570, 649, 858,
894, 1106, 1331], 'Doug Liman': [52, 148, 251, 399, 544, 1318, 1451], 'Edward Zwick': [92, 182, 346, 566, 791, 819,
825], 'F. Gary Gray': [308, 402, 491, 523, 697, 833, 1272, 1380], 'Francis Ford Coppola': [487, 559, 622, 646, 772,
1076, 1155, 1253, 1312], 'Francis Lawrence': [63, 72, 109, 120, 679], 'Frank Coraci': [157, 249, 275, 451, 577, 599,
963], 'Frank Oz': [193, 355, 473, 580, 712, 813, 987], 'Garry Marshall': [329, 496, 528, 571, 784, 893, 1029, 1169],
'Gary Fleder': [518, 667, 689, 867, 981, 1165], 'Gary Winick': [258, 797, 798, 804, 1454], 'Gavin O'Connor': [820, 841,
939, 953, 1444], 'George A. Romero': [250, 1066, 1096, 1278, 1367, 1396], 'George Clooney': [343, 450, 831, 966, 1302],
'George Miller': [78, 103, 233, 287, 1250, 1403, 1450], 'Gore Verbinski': [1, 8, 9, 107, 119, 633, 1040], 'Guillermo
del Toro': [35, 252, 419, 486, 1118], 'Gus Van Sant': [595, 1018, 1027, 1159, 1240, 1311, 1398], 'Guy Ritchie': [124,
215, 312, 1093, 1225, 1269, 1420], 'Harold Ramis': [425, 431, 558, 586, 788, 1137, 1166, 1325], 'Ivan Reitman': [274,
643, 816, 883, 910, 935, 1134, 1242], 'James Cameron': [0, 19, 170, 173, 344, 1100, 1320], 'James Ivory': [1125, 1152,
1180, 1291, 1293, 1390, 1397], 'James Mangold': [140, 141, 557, 560, 829, 845, 958, 1145], 'James Wan': [30, 617, 1002,
1047, 1337, 1417, 1424], 'Jan de Bont': [155, 224, 231, 270, 781], 'Jason Friedberg': [812, 1010, 1012, 1014, 1036],
'Jason Reitman': [792, 1092, 1213, 1295, 1299], 'Jaume Collet-Serra': [516, 540, 640, 725, 1011, 1189], 'Jay Roach':
[195, 359, 389, 397, 461, 703, 859, 1072], 'Jean-Pierre Jeunet': [423, 485, 605, 664, 765], 'Joe Dante': [284, 525,
638, 1226, 1298, 1428], 'Joe Wright': [85, 432, 553, 803, 814, 855], 'Joel Coen': [428, 670, 691, 707, 721, 889, 906,
980, 1157, 1238, 1305], 'Joel Schumacher': [128, 184, 348, 484, 572, 614, 652, 764, 876, 886, 1108, 1230, 1280], 'John
Carpenter': [537, 663, 686, 861, 938, 1028, 1080, 1102, 1329, 1371], 'John Glen': [601, 642, 801, 847, 864], 'John
Landis': [524, 868, 1276, 1384, 1435], 'John Madden': [457, 882, 1020, 1249, 1257], 'John McTiernan': [127, 214, 244,
351, 534, 563, 648, 782, 838, 1074], 'John Singleton': [294, 489, 732, 796, 1120, 1173, 1316], 'John Whitesell': [499,
632, 763, 1119, 1148], 'John Woo': [131, 142, 264, 371, 420, 675, 1182], 'Jon Favreau': [46, 54, 55, 382, 759, 1346],
'Jon M. Chu': [100, 225, 810, 1099, 1186], 'Jon Turteltaub': [64, 180, 372, 480, 760, 846, 1171], 'Jonathan Demme':
[277, 493, 1000, 1123, 1215], 'Jonathan Liebesman': [81, 143, 339, 1117, 1301], 'Judd Apatow': [321, 710, 717, 865,
881], 'Justin Lin': [38, 123, 246, 1437, 1447], 'Kenneth Branagh': [80, 197, 421, 879, 1094, 1277, 1288], 'Kenny
Ortega': [412, 852, 1228, 1315, 1365], 'Kevin Reynolds': [53, 502, 639, 1019, 1059], ...}
keyboard_arrow_down Now what if we want to extract data of a particular group from this list?
data.groupby('director_name').get_group('Alexander Payne')
id_x budget popularity revenue title vote_average vote_count year month day director_name gende
About
793 45163 30000000 19 105834556 6.7 362 2002 Dec Friday Alexander Payne
Schmidt
The
1006 45699 20000000 40 177243185 6.7 934 2011 Sep Friday Alexander Payne
Descendants
1101 46004 16000000 23 109502303 Sideways 6.9 478 2004 Oct Friday Alexander Payne
1211 46446 12000000 29 17654912 Nebraska 7.4 636 2013 Sep Saturday Alexander Payne
1281 46813 0 13 0 Election 6.7 270 1999 Apr Friday Alexander Payne
data.groupby('director_name')['title'].count()
director_name
Adam McKay 6
Adam Shankman 8
Alejandro González Iñárritu 6
Alex Proyas 5
Alexander Payne 5
..
Wes Craven 10
Wolfgang Petersen 7
Woody Allen 18
Zack Snyder 7
Zhang Yimou 6
Name: title, Length: 199, dtype: int64
Finding the very first year and the latest year a director released a movie i.e basically the min and max of year column, grouped by director
data.groupby(['director_name'])["year"].aggregate(['min', 'max'])
# note: can also use .agg instead of .aggregate (both are same)
min max
director_name
Lets assume,
high budget director -> any director with atleast one movie with budget >100M
data_dir_budget = data.groupby("director_name")["budget"].max().reset_index()
data_dir_budget.head()
director_name budget
2. we can filter out the director names with max budget >100M
3. Finally, we can filter out the details of the movies by these directors
data.loc[data['director_name'].isin(names)]
id_x budget popularity revenue title vote_average vote_count year month day director_name gend
0 43597 237000000 150 2787965087 Avatar 7.2 11800 2009 Dec Thursday James Cameron
Pirates of
the
1 43598 300000000 139 961000000 Caribbean: 6.9 4500 2007 May Saturday Gore Verbinski
At World's
End
2 43599 245000000 107 880674609 Spectre 6.3 4466 2015 Oct Monday Sam Mendes
The Dark
Christopher
3 43600 250000000 112 1084939099 Knight 7.6 9106 2012 Jul Monday
Nolan
Rises
Spider-
4 43602 258000000 115 890871626 5.9 3576 2007 May Tuesday Sam Raimi
Man 3
... ... ... ... ... ... ... ... ... ... ... ...
The Last
1460 48363 0 3 321952 7.9 64 1978 May Monday Martin Scorsese
Waltz
1461 48370 27000 19 3151130 Clerks 7.4 755 1994 Sep Tuesday Kevin Smith
1462 48375 0 7 0 Rampage 6.0 131 2009 Aug Friday Uwe Boll
El Robert
1464 48395 220000 14 2040920 6.6 238 1992 Sep Friday
Mariachi Rodriguez
id_x budget popularity revenue title vote_average vote_count year month day director_name gend
0 43597 237000000 150 2787965087 Avatar 7.2 11800 2009 Dec Thursday James Cameron
Pirates of
the
1 43598 300000000 139 961000000 Caribbean: 6.9 4500 2007 May Saturday Gore Verbinski
At World's
End
2 43599 245000000 107 880674609 Spectre 6.3 4466 2015 Oct Monday Sam Mendes
The Dark
Christopher
3 43600 250000000 112 1084939099 Knight 7.6 9106 2012 Jul Monday
Nolan
Rises
Spider-
4 43602 258000000 115 890871626 5.9 3576 2007 May Tuesday Sam Raimi
Man 3
... ... ... ... ... ... ... ... ... ... ... ...
The Last
1460 48363 0 3 321952 7.9 64 1978 May Monday Martin Scorsese
Waltz
1461 48370 27000 19 3151130 Clerks 7.4 755 1994 Sep Tuesday Kevin Smith
1462 48375 0 7 0 Rampage 6.0 131 2009 Aug Friday Uwe Boll
El Robert
1464 48395 220000 14 2040920 6.6 238 1992 Sep Friday
Mariachi Rodriguez
NOTE
We can subtract the average revenue of a director from budget col, for each director
def func(x):
# a boolean returning function for whether the movie is risky or not
x["risky"] = x["budget"] - x["revenue"].mean() >= 0
return x
# setting group_keys=True, keeps the group key in the returned dataset (will be default in future version of pandas)
# keep it False if want the normal behaviour
data_risky
id_x budget popularity revenue title vote_average vote_count year month day director_name gend
0 43597 237000000 150 2787965087 Avatar 7.2 11800 2009 Dec Thursday James Cameron
Pirates of
the
1 43598 300000000 139 961000000 Caribbean: 6.9 4500 2007 May Saturday Gore Verbinski
At World's
End
2 43599 245000000 107 880674609 Spectre 6.3 4466 2015 Oct Monday Sam Mendes
The Dark
Christopher
3 43600 250000000 112 1084939099 Knight 7.6 9106 2012 Jul Monday
Nolan
Rises
Spider-
4 43602 258000000 115 890871626 5.9 3576 2007 May Tuesday Sam Raimi
Man 3
... ... ... ... ... ... ... ... ... ... ... ...
The Last
1460 48363 0 3 321952 7.9 64 1978 May Monday Martin Scorsese
Waltz
1461 48370 27000 19 3151130 Clerks 7.4 755 1994 Sep Tuesday Kevin Smith
1462 48375 0 7 0 Rampage 6.0 131 2009 Aug Friday Uwe Boll
El Robert
1464 48395 220000 14 2040920 6.6 238 1992 Sep Friday
Mariachi Rodriguez
data_risky.loc[data_risky["risky"]]
id_x budget popularity revenue title vote_average vote_count year month day director_name ge
Quantum
7 43608 200000000 107 586090727 6.1 2965 2008 Oct Thursday Marc Forster
of Solace
Pirates of
the
Caribbean:
12 43614 380000000 135 1045713802 6.4 4948 2011 May Saturday Rob Marshall
On
Stranger
Tides
Robin
15 43618 200000000 37 310669540 6.2 1398 2010 May Wednesday Ridley Scott
Hood
20 43624 209000000 64 303025485 Battleship 5.5 2114 2012 Apr Wednesday Peter Berg
X-Men:
24 43630 210000000 3 459359555 The Last 6.3 3525 2006 May Wednesday Brett Ratner
Stand
keyboard_arrow_down Pandas - 4
Content
Multi-indexing
Restructuring data
pd.melt()
pd.pivot()
pd.pivot_table()
pd.cut()
!gdown 1s2TkjSpzNc4SyxqRrQleZyDIHlc7bxnd
!gdown 1Ws-_s1fHZ9nHfGLVUQurbHDvStePlEJm
Downloading...
From: https://drive.google.com/uc?id=1s2TkjSpzNc4SyxqRrQleZyDIHlc7bxnd
To: /content/movies.csv
100% 112k/112k [00:00<00:00, 15.4MB/s]
Downloading...
From: https://drive.google.com/uc?id=1Ws-_s1fHZ9nHfGLVUQurbHDvStePlEJm
To: /content/directors.csv
100% 65.4k/65.4k [00:00<00:00, 86.7MB/s]
keyboard_arrow_down Multi-Indexing
keyboard_arrow_down Task: Which director according would be considered as most productive ?
Or
Or
To simplify,
data.groupby(['director_name'])['title'].count().sort_values(ascending=False)
director_name
Steven Spielberg 26
Clint Eastwood 19
Martin Scorsese 19
Woody Allen 18
Robert Rodriguez 16
..
Paul Weitz 5
John Madden 5
Paul Verhoeven 5
John Whitesell 5
Kevin Reynolds 5
Name: title, Length: 199, dtype: int64
Steven Spielberg has directed maximum number of movies. But does it make Steven the most productive director?
Chances are, he might be active for more years than other directors
year title
director_name
Notice,
keyboard_arrow_down What would happen if we print the col year of this multi-index dataframe?
data_agg["year"]
min max
director_name
keyboard_arrow_down How can we convert multi-level back to only one level of columns?
data_agg = data.groupby(['director_name'])[["year","title"]].aggregate(
{"year":['min', 'max'], "title": "count"}) #The column names are not aligned properly
director_name
data.groupby('director_name')[['year', 'title']].aggregate(
year_max=('year','max'),
year_min=('year','min'),
title_count=('title','count')
)
year_max year_min title_count
director_name
Columns look good, but we may want to turn back the row labels into a proper column as well
data_agg.reset_index()
keyboard_arrow_down Using the new features, can we find the most productive director?
First calculate how many years the director has been active.
director_name
director_name
data_agg.sort_values("movie_per_yr", ascending=False)
year_min year_max title_count yrs_active movie_per_yr
director_name
Conclusion:
Link: https://drive.google.com/file/d/173A59xh2mnpmljCCB9bhC4C5eP2IS6qZ/view?usp=sharing
!gdown 173A59xh2mnpmljCCB9bhC4C5eP2IS6qZ
Downloading...
From: https://drive.google.com/uc?id=173A59xh2mnpmljCCB9bhC4C5eP2IS6qZ
To: /content/Pfizer_1.csv
100% 1.51k/1.51k [00:00<00:00, 8.41MB/s]
Temperature (K)
Pressure (P)
are recorded after an interval of 1 hour everyday to monitor the drug stability in a drug development test
==> These data points are thus used to identify the optimal set of values of parameters for the stability of the drugs
data = pd.read_csv('Pfizer_1.csv')
data
Date Drug_Name Parameter 1:30:00 2:30:00 3:30:00 4:30:00 5:30:00 6:30:00 7:30:00 8:30:00 9:30:00
0 15-10-2020 diltiazem hydrochloride Temperature 23.0 22.0 NaN 21.0 21.0 22 23.0 21.0 22.0
1 15-10-2020 diltiazem hydrochloride Pressure 12.0 13.0 NaN 11.0 13.0 14 16.0 16.0 24.0
2 15-10-2020 docetaxel injection Temperature NaN 17.0 18.0 NaN 17.0 18 NaN NaN 23.0
3 15-10-2020 docetaxel injection Pressure NaN 22.0 22.0 NaN 22.0 23 NaN NaN 27.0
4 15-10-2020 ketamine hydrochloride Temperature 24.0 NaN NaN 27.0 NaN 26 25.0 24.0 23.0
5 15-10-2020 ketamine hydrochloride Pressure 8.0 NaN NaN 7.0 NaN 9 10.0 11.0 10.0
6 16-10-2020 diltiazem hydrochloride Temperature 34.0 35.0 36.0 36.0 37.0 38 37.0 38.0 39.0
7 16-10-2020 diltiazem hydrochloride Pressure 18.0 19.0 20.0 21.0 22.0 23 24.0 25.0 25.0
8 16-10-2020 docetaxel injection Temperature 46.0 47.0 NaN 48.0 48.0 49 50.0 52.0 55.0
9 16-10-2020 docetaxel injection Pressure 23.0 24.0 NaN 25.0 26.0 27 28.0 29.0 28.0
10 16-10-2020 ketamine hydrochloride Temperature 8.0 9.0 10.0 NaN 11.0 12 12.0 11.0 NaN
11 16-10-2020 ketamine hydrochloride Pressure 12.0 12.0 13.0 NaN 15.0 15 15.0 15.0 NaN
12 17-10-2020 diltiazem hydrochloride Temperature 20.0 19.0 19.0 18.0 17.0 16 15.0 NaN 13.0
13 17-10-2020 diltiazem hydrochloride Pressure 3.0 4.0 4.0 4.0 6.0 8 9.0 NaN 9.0
14 17-10-2020 docetaxel injection Temperature 12.0 13.0 14.0 15.0 16.0 17 18.0 19.0 20.0
15 17-10-2020 docetaxel injection Pressure 20.0 22.0 22.0 22.0 22.0 23 25.0 26.0 27.0
16 17-10-2020 ketamine hydrochloride Temperature 13.0 14.0 15.0 16.0 17.0 18 19.0 20.0 21.0
17 17-10-2020 ketamine hydrochloride Pressure 8.0 9.0 10.0 11.0 11.0 12 12.0 11.0 12.0
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18 entries, 0 to 17
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 18 non-null object
1 Drug_Name 18 non-null object
2 Parameter 18 non-null object
3 1:30:00 16 non-null float64
4 2:30:00 16 non-null float64
5 3:30:00 12 non-null float64
6 4:30:00 14 non-null float64
7 5:30:00 16 non-null float64
8 6:30:00 18 non-null int64
9 7:30:00 16 non-null float64
10 8:30:00 14 non-null float64
11 9:30:00 16 non-null float64
12 10:30:00 18 non-null int64
13 11:30:00 16 non-null float64
14 12:30:00 18 non-null int64
dtypes: float64(9), int64(3), object(3)
memory usage: 2.2+ KB
Maybe we can have a column for time , with timestamps as the column value
We can similarly create one column containing the values of these parameters
==> "Melt" timestamp columns into two columns - timestamp and corresponding values
How can we restructure our data into having every row corresponding to a single reading?
pd.melt(data, id_vars=['Date', 'Parameter', 'Drug_Name'])
keyboard_arrow_down How can we rename the columns "variable" and "value" as per our original dataframe?
data_melt
Conclusion
The labels of the timestamp columns are conviniently melted into a single column - time
The labels of columns such as 1:30:00 , 2:30:00 have now become categories of the variable column
The values from columns we are melting are stored in value column
keyboard_arrow_down Pivot
Now suppose we want to convert our data back to wide format
The reason could be to maintain the structure for storing or some other purpose.
Notice:
keyboard_arrow_down How can we restructure our data back to the original wide format, before it was melted?
time 10:30:00 11:30:00 12:30:00 1:30:00 2:30:00 3:30:00 4:30:00 5:30:00 6:30:0
15-10-2020 diltiazem hydrochloride Pressure 18.0 19.0 20.0 12.0 13.0 NaN 11.0 13.0 14
docetaxel injection Pressure 26.0 29.0 28.0 NaN 22.0 22.0 NaN 22.0 23
ketamine hydrochloride Pressure 9.0 9.0 11.0 8.0 NaN NaN 7.0 NaN 9
16-10-2020 diltiazem hydrochloride Pressure 24.0 NaN 27.0 18.0 19.0 20.0 21.0 22.0 23
docetaxel injection Pressure 28.0 29.0 30.0 23.0 24.0 NaN 25.0 26.0 27
ketamine hydrochloride Pressure 16.0 17.0 18.0 12.0 12.0 13.0 NaN 15.0 15
17-10-2020 diltiazem hydrochloride Pressure 11.0 13.0 14.0 3.0 4.0 4.0 4.0 6.0 8
docetaxel injection Pressure 28.0 29.0 28.0 20.0 22.0 22.0 22.0 22.0 23
ketamine hydrochloride Pressure 13.0 14.0 15.0 8.0 9.0 10.0 11.0 11.0 12
Notice,
We are getting multiple indices here, but we can get single index again using reset_index
data_melt.pivot(index=['Date','Drug_Name','Parameter'],
columns = 'time',
values='reading').reset_index()
time Date Drug_Name Parameter 10:30:00 11:30:00 12:30:00 1:30:00 2:30:00 3:30:00 4:30:00 5:30:00 6:
0 15-10-2020 diltiazem hydrochloride Pressure 18.0 19.0 20.0 12.0 13.0 NaN 11.0 13.0
1 15-10-2020 diltiazem hydrochloride Temperature 20.0 20.0 21.0 23.0 22.0 NaN 21.0 21.0
2 15-10-2020 docetaxel injection Pressure 26.0 29.0 28.0 NaN 22.0 22.0 NaN 22.0
3 15-10-2020 docetaxel injection Temperature 23.0 25.0 25.0 NaN 17.0 18.0 NaN 17.0
4 15-10-2020 ketamine hydrochloride Pressure 9.0 9.0 11.0 8.0 NaN NaN 7.0 NaN
5 15-10-2020 ketamine hydrochloride Temperature 22.0 21.0 20.0 24.0 NaN NaN 27.0 NaN
6 16-10-2020 diltiazem hydrochloride Pressure 24.0 NaN 27.0 18.0 19.0 20.0 21.0 22.0
7 16-10-2020 diltiazem hydrochloride Temperature 40.0 NaN 42.0 34.0 35.0 36.0 36.0 37.0
8 16-10-2020 docetaxel injection Pressure 28.0 29.0 30.0 23.0 24.0 NaN 25.0 26.0
9 16-10-2020 docetaxel injection Temperature 56.0 57.0 58.0 46.0 47.0 NaN 48.0 48.0
10 16-10-2020 ketamine hydrochloride Pressure 16.0 17.0 18.0 12.0 12.0 13.0 NaN 15.0
11 16-10-2020 ketamine hydrochloride Temperature 13.0 14.0 15.0 8.0 9.0 10.0 NaN 11.0
12 17-10-2020 diltiazem hydrochloride Pressure 11.0 13.0 14.0 3.0 4.0 4.0 4.0 6.0
13 17-10-2020 diltiazem hydrochloride Temperature 14.0 11.0 10.0 20.0 19.0 19.0 18.0 17.0
14 17-10-2020 docetaxel injection Pressure 28.0 29.0 28.0 20.0 22.0 22.0 22.0 22.0
15 17-10-2020 docetaxel injection Temperature 21.0 22.0 23.0 12.0 13.0 14.0 15.0 16.0
16 17-10-2020 ketamine hydrochloride Pressure 13.0 14.0 15.0 8.0 9.0 10.0 11.0 11.0
17 17-10-2020 ketamine hydrochloride Temperature 22.0 23.0 24.0 13.0 14.0 15.0 16.0 17.0
data_melt.head()
keyboard_arrow_down Can we further restructure our data into dividing the Parameter column into T/P?
A format like:
data_tidy
Parameter Pressure Temperature
data_tidy.columns.name = None
data_tidy.head()
keyboard_arrow_down Can we use pivot to find the day-wise mean value of temperature for each drug?
data_tidy.pivot(index=['Drug_Name'],
columns = 'Date',
values=['Temperature'])
keyboard_arrow_down Pandas - 5
Content
!gdown 173A59xh2mnpmljCCB9bhC4C5eP2IS6qZ
data = pd.read_csv('Pfizer_1.csv')
data_melt = pd.melt(data,id_vars = ['Date', 'Drug_Name', 'Parameter'],
var_name = "time",
value_name = 'reading')
data_tidy = data_melt.pivot(index=['Date','time', 'Drug_Name'],
columns = 'Parameter',
values='reading')
data_tidy = data_tidy.reset_index()
data_tidy.columns.name = None
Downloading...
From: https://drive.google.com/uc?id=173A59xh2mnpmljCCB9bhC4C5eP2IS6qZ
To: /content/Pfizer_1.csv
100% 1.51k/1.51k [00:00<00:00, 6.05MB/s]
data.head()
Date Drug_Name Parameter 1:30:00 2:30:00 3:30:00 4:30:00 5:30:00 6:30:00 7:30:00 8:30:00 9:30:00
0 15-10-2020 diltiazem hydrochloride Temperature 23.0 22.0 NaN 21.0 21.0 22 23.0 21.0 22.0
1 15-10-2020 diltiazem hydrochloride Pressure 12.0 13.0 NaN 11.0 13.0 14 16.0 16.0 24.0
2 15-10-2020 docetaxel injection Temperature NaN 17.0 18.0 NaN 17.0 18 NaN NaN 23.0
3 15-10-2020 docetaxel injection Pressure NaN 22.0 22.0 NaN 22.0 23 NaN NaN 27.0
4 15-10-2020 ketamine hydrochloride Temperature 24.0 NaN NaN 27.0 NaN 26 25.0 24.0 23.0
data_melt.head()
data_tidy.head()
data.head()
Date Drug_Name Parameter 1:30:00 2:30:00 3:30:00 4:30:00 5:30:00 6:30:00 7:30:00 8:30:00 9:30:00
0 15-10-2020 diltiazem hydrochloride Temperature 23.0 22.0 NaN 21.0 21.0 22 23.0 21.0 22.0
1 15-10-2020 diltiazem hydrochloride Pressure 12.0 13.0 NaN 11.0 13.0 14 16.0 16.0 24.0
2 15-10-2020 docetaxel injection Temperature NaN 17.0 18.0 NaN 17.0 18 NaN NaN 23.0
3 15-10-2020 docetaxel injection Pressure NaN 22.0 22.0 NaN 22.0 23 NaN NaN 27.0
4 15-10-2020 ketamine hydrochloride Temperature 24.0 NaN NaN 27.0 NaN 26 25.0 24.0 23.0
1. None
2. NaN (short for Not a Number)
type(None)
NoneType
type(np.nan)
float
E.g.-strings
Note:
Pandas uses these values nearly interchangeably, converting between them where appropriate, based on column datatype
0 1.0
1 NaN
2 2.0
3 NaN
dtype: float64
0 1
1 np.nan
2 2
3 None
dtype: object
0 1
1 np.nan
2 2
3 NaN
dtype: object
For object type, the None is preserved and not changed to NaN
keyboard_arrow_down How to know the count of missing values for each row/column?
data.isna().head()
Date Drug_Name Parameter 1:30:00 2:30:00 3:30:00 4:30:00 5:30:00 6:30:00 7:30:00 8:30:00 9:30:00 10:30:00 11
0 False False False False False True False False False False False False False
1 False False False False False True False False False False False False False
2 False False False True False False True False False True True False False
3 False False False True False False True False False True True False False
4 False False False False True True False True False False False False False
data.isnull().head()
Date Drug_Name Parameter 1:30:00 2:30:00 3:30:00 4:30:00 5:30:00 6:30:00 7:30:00 8:30:00 9:30:00 10:30:00 11
0 False False False False False True False False False False False False False
1 False False False False False True False False False False False False False
2 False False False True False False True False False True True False False
3 False False False True False False True False False True True False False
4 False False False False True True False True False False False False False
keyboard_arrow_down But, why do we have two methods, "isna" and "isnull" for the same operation?
pd.isnull
pd.isna
data.isna().sum()
Date 0
Drug_Name 0
Parameter 0
1:30:00 2
2:30:00 2
3:30:00 6
4:30:00 4
5:30:00 2
6:30:00 0
7:30:00 2
8:30:00 4
9:30:00 2
10:30:00 0
11:30:00 2
12:30:00 0
dtype: int64
keyboard_arrow_down Can we also get the number of missing values in each row?
data.isna().sum(axis=1)
0 1
1 1
2 4
3 4
4 3
5 3
6 1
7 1
8 1
9 1
10 2
11 2
12 1
13 1
14 0
15 0
16 0
17 0
dtype: int64
Note:
keyboard_arrow_down We have identified the null count, but how do we deal with them?
data.dropna()
Date Drug_Name Parameter 1:30:00 2:30:00 3:30:00 4:30:00 5:30:00 6:30:00 7:30:00 8:30:00 9:30:00
14 17-10-2020 docetaxel injection Temperature 12.0 13.0 14.0 15.0 16.0 17 18.0 19.0 20.0
15 17-10-2020 docetaxel injection Pressure 20.0 22.0 22.0 22.0 22.0 23 25.0 26.0 27.0
16 17-10-2020 ketamine hydrochloride Temperature 13.0 14.0 15.0 16.0 17.0 18 19.0 20.0 21.0
17 17-10-2020 ketamine hydrochloride Pressure 8.0 9.0 10.0 11.0 11.0 12 12.0 11.0 12.0
data.dropna(axis=1)
Date Drug_Name Parameter 6:30:00 10:30:00 12:30:00
=> Every column which had even a single missing value has been deleted
loss of data
Instead of dropping, it would be better to fill the missing values with some data
data.fillna(0).head()
Date Drug_Name Parameter 1:30:00 2:30:00 3:30:00 4:30:00 5:30:00 6:30:00 7:30:00 8:30:00 9:30:00
0 15-10-2020 diltiazem hydrochloride Temperature 23.0 22.0 0.0 21.0 21.0 22 23.0 21.0 22.0
1 15-10-2020 diltiazem hydrochloride Pressure 12.0 13.0 0.0 11.0 13.0 14 16.0 16.0 24.0
2 15-10-2020 docetaxel injection Temperature 0.0 17.0 18.0 0.0 17.0 18 0.0 0.0 23.0
3 15-10-2020 docetaxel injection Pressure 0.0 22.0 22.0 0.0 22.0 23 0.0 0.0 27.0
4 15-10-2020 ketamine hydrochloride Temperature 24.0 0.0 0.0 27.0 0.0 26 25.0 24.0 23.0
data['2:30:00'].fillna(0)
0 22.0
1 13.0
2 17.0
3 22.0
4 0.0
5 0.0
6 35.0
7 19.0
8 47.0
9 24.0
10 9.0
11 12.0
12 19.0
13 4.0
14 13.0
15 22.0
16 14.0
17 9.0
Name: 2:30:00, dtype: float64
keyboard_arrow_down What other values can we use to fill the missing values ?
data['2:30:00'].mean()
18.8125
Now let's fill the NaN values with the mean value of the column
data['2:30:00'].fillna(data['2:30:00'].mean())
0 22.0000
1 13.0000
2 17.0000
3 22.0000
4 18.8125
5 18.8125
6 35.0000
7 19.0000
8 47.0000
9 24.0000
10 9.0000
11 12.0000
12 19.0000
13 4.0000
14 13.0000
15 22.0000
16 14.0000
17 9.0000
Name: 2:30:00, dtype: float64
But this doesn't feel right. What could be wrong with this?
Can we use the mean of all compounds as average for our estimator?
We could fill the null values of respective compounds with their respective means
keyboard_arrow_down How can we form a column with mean temperature of respective compounds?
def temp_mean(x):
x['Temperature_avg'] = x['Temperature'].mean() # We will name the new col Temperature_avg
return x
Now we can form a new column based on the average values of temperature for each drug
data_tidy=data_tidy.groupby(["Drug_Name"], group_keys=False).apply(temp_mean)
data_tidy
Date time Drug_Name Pressure Temperature Temperature_avg
Now we fill the null values in Temperature using this new column!
data_tidy['Temperature'].fillna(data_tidy["Temperature_avg"], inplace=True)
data_tidy
data_tidy.isna().sum()
Date 0
time 0
Drug_Name 0
Pressure 13
Temperature 0
Temperature_avg 0
dtype: int64
Great!!
def pr_mean(x):
x['Pressure_avg'] = x['Pressure'].mean()
return x
data_tidy=data_tidy.groupby(["Drug_Name"]).apply(pr_mean)
data_tidy['Pressure'].fillna(data_tidy["Pressure_avg"], inplace=True)
data_tidy
<ipython-input-27-df55c441df36>:4: FutureWarning: Not prepending group keys to the result index of transform-like apply.
To preserve the previous behavior, use
data_tidy.isna().sum()
Date 0
time 0
Drug_Name 0
Pressure 0
Temperature 0
Temperature_avg 0
Pressure_avg 0
dtype: int64
Lets say, instead of knowing specific test values of a month, I want to know its type. Depends on level of granularity we want to have - Low,
Medium, High, V High
Let's try to use this on our max (temp) column to categorise the data into bins
But, to define categories, lets first check min and max temp values
data_tidy
Date time Drug_Name Pressure Temperature Temperature_avg Pressure_avg
print(data_tidy['Temperature'].min(), data_tidy['Temperature'].max())
8.0 58.0
Lets's keep some buffer for future values and take the range from 5-60(instead of 8-58)
Lets divide this data into 4 bins of 10-15 values each
data_tidy['temp_cat'].value_counts()
low 50
medium 38
high 15
very_high 5
Name: temp_cat, dtype: int64
Say,
How you can you filter rows containing "hydrochloride" in their drug name?
data_tidy.loc[data_tidy['Drug_Name'].str.contains('hydrochloride')].head()
Date time Drug_Name Pressure Temperature Temperature_avg Pressure_avg temp_cat
> Series.str.function()
Series.str can be used to access the values of the series as strings and apply several methods to it.
Now suppose we want to form a new column based on the year of the experiments?
data_tidy['Date'].str.split('-')
To extract the year we need to select the last element of each list
data_tidy['Date'].str.split('-').apply(lambda x:x[2])
0 2020
1 2020
2 2020
3 2020
4 2020
...
103 2020
104 2020
105 2020
106 2020
107 2020
Name: Date, Length: 108, dtype: object
The dtype of the output is still an object, we would prefer a number type
The date format will always not be in day-month-year, it can vary
Thus, to work with such date-time type of data, we can use a special method of pandas
keyboard_arrow_down Datetime
keyboard_arrow_down How can we handle date-time data-types?
data_tidy.head()
data_tidy['timestamp'] = pd.to_datetime(data_tidy['timestamp']) # will leave to explore how you can mention datetime format
data_tidy
103 docetaxel injection 26.0 19.0 30.387097 25.483871 low 2020-10-17 08:30:00
104 ketamine hydrochloride 11.0 20.0 17.709677 11.935484 low 2020-10-17 08:30:00
105 diltiazem hydrochloride 9.0 13.0 24.848485 15.424242 low 2020-10-17 09:30:00
106 docetaxel injection 27.0 20.0 30.387097 25.483871 low 2020-10-17 09:30:00
107 ketamine hydrochloride 12.0 21.0 17.709677 11.935484 medium 2020-10-17 09:30:00
data_tidy.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 108 entries, 0 to 107
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Drug_Name 108 non-null object
1 Pressure 108 non-null float64
2 Temperature 108 non-null float64
3 Temperature_avg 108 non-null float64
4 Pressure_avg 108 non-null float64
5 temp_cat 108 non-null category
6 timestamp 108 non-null datetime64[ns]
dtypes: category(1), datetime64[ns](1), float64(4), object(1)
memory usage: 10.3+ KB
The type of timestamp column has been changed to datetime from object
keyboard_arrow_down How can we extract information from a single timestamp using Pandas?
ts = data_tidy['timestamp'][0]
ts
Timestamp('2020-10-15 10:30:00')
keyboard_arrow_down Extracting individual information from date
... and so on
keyboard_arrow_down This data parsing from string to date-time makes it easier to work with data
We can use this data from the columns as a whole using .dt object
data_tidy['timestamp'].dt
data_tidy['timestamp'].dt.year
0 2020
1 2020
2 2020
3 2020
4 2020
...
103 2020
104 2020
105 2020
106 2020
107 2020
Outline
Uses/necessity of matplotlib
Tencent Use Case
Anatomy
Figure
Types of Data visualization
Univariate Data Visualization
Categorical:
Bar chart
Countplot
Pie Chart
Continous
Histogram
KDE
Box and Whiskers Plot
Plots Presentation:
https://docs.google.com/presentation/d/1DkLTjTe6YmGbDHtr4v9Jso553DlCuP3cfSnwvUN1mgE/edit?usp=sharing
Summary/Agenda
What is pyplot ?
pyplot is a sub-module for visualization in matplotlib
Think of it as high-level API which makes plotting an easy task
Data Scientists stick to using pyplot only unless they want to create something totally new.
For seaborn, we will be importing the whole seaborn library as alias sns
What is seaborn?
Seaborn is another visualization library which uses matplotlib in the backend for plotting
What is the major difference then between both matplotlib and seaborn?
Seaborn uses fascinating themes and reduces number of code lines by doing a lot of work in the backend
While matplotlib is used to plot basic plots and add more functionlaity on top of that
Seaborn is built on the top of Pandas and Matplotlib
As we proceed through the lecture, we will see the difference between both the libraries
import matplotlib.pyplot as plt
import seaborn as sns
Before we dive into learning these libraries, lets answer some general questions
Two reasons/scopes
Exploratory - I can’t see certain patterns just by crunching numbers (avg, rates, %ages)
Explanatory - I can the numbers crunches and insights ready, but I’d like a visual art for storytelling
Data
Numerical/Continous
Categorical
You need to analyze what kind of games they should start creating to get higher success in the market.
!wget https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/021/299/original/final_vg1_-_final_vg_%281%29.csv?16708
import pandas as pd
import numpy as np
data = pd.read_csv('final_vg.csv')
data.head()
Rank Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sales Other_Sal
0 2061 1942 NES 1985.0 Shooter Capcom 4.569217 3.033887 3.439352 1.9916
1 9137 ¡Shin Chan Flipa en colores! DS 2007.0 Platform 505 Games 2.076955 1.493442 3.033887 0.3948
2 14279 .hack: Sekai no Mukou ni + Versus PS3 2012.0 Action Namco Bandai Games 1.145709 1.762339 1.493442 0.4086
3 8359 .hack//G.U. Vol.1//Rebirth PS2 2006.0 Role-Playing Namco Bandai Games 2.031986 1.389856 3.228043 0.3948
4 7109 .hack//G.U. Vol.2//Reminisce PS2 2006.0 Role-Playing Namco Bandai Games 2.792725 2.592054 1.440483 1.4934
If you notice,
On noticing further,
Platform is of nominal type, no proper order between the categories
Year is of ordinal type, there's a order to the categories
(0, 3)
(1, 5)
(2, 9)
x_val = [0, 1, 2]
y_val = [3, 5, 9]
plt.plot(x_val, y_val)
[<matplotlib.lines.Line2D at 0x7cc05d0c3f40>]
While this command decided a lot of things for you, you can customise each of these by understanding components of a matplotlib plot
Bivariate
Numerical-Numerical
Numerical-Categorical
Categorical-Categorical
Multivariate
Let’s start with 3 and then we can generalize
Numerical-Numerical-Categorical
Categorical-Categorical-Numerical
Categorical-Categorical-Categorical
Numerical-Numerical-Numerical
Questions like:
...and so on
cat_counts = data['Genre'].value_counts()
cat_counts
Action 3316
Sports 2400
Misc 1739
Role-Playing 1488
Shooter 1310
Adventure 1286
Racing 1249
Platform 886
Simulation 867
Fighting 848
Strategy 681
Puzzle 582
Name: Genre, dtype: int64
keyboard_arrow_down Now what kind of plot can we use to visualize this information?
We can perhaps plot categories on X-axis and their corresponding frequencies on Y-axis
Such chart is called a Bar Chart or a Count Plot
Can also plot horizontally when the #categories are many
Bar Chart
The data is binned here into categories
x_bar=cat_counts.index
y_bar=cat_counts
plt.bar(x_bar,y_bar)
plt.figure(figsize=(12,8))
plt.bar(x_bar,y_bar)
keyboard_arrow_down And how can we rotate the tick labels, also maybe increase the fontsize of the same?
plt.figure(figsize=(12,8))
plt.bar(x_bar,y_bar)
plt.xticks(rotation=90, fontsize=12)
([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
[Text(0, 0, 'Action'),
Text(1, 0, 'Sports'),
Text(2, 0, 'Misc'),
Text(3, 0, 'Role-Playing'),
Text(4, 0, 'Shooter'),
Text(5, 0, 'Adventure'),
Text(6, 0, 'Racing'),
Text(7, 0, 'Platform'),
Text(8, 0, 'Simulation'),
Text(9, 0, 'Fighting'),
Text(10, 0, 'Strategy'),
Text(11, 0, 'Puzzle')])
# same code
plt.figure(figsize=(10,8))
plt.bar(x_bar,y_bar,width=0.2)
plt.xticks(rotation = 90, fontsize=12)
([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
[Text(0, 0, 'Action'),
Text(1, 0, 'Sports'),
Text(2, 0, 'Misc'),
Text(3, 0, 'Role-Playing'),
Text(4, 0, 'Shooter'),
Text(5, 0, 'Adventure'),
Text(6, 0, 'Racing'),
Text(7, 0, 'Platform'),
Text(8, 0, 'Simulation'),
Text(9, 0, 'Fighting'),
Text(10, 0, 'Strategy'),
Text(11, 0, 'Puzzle')])
plt.figure(figsize=(10,8))
plt.bar(x_bar,y_bar,width=0.2,color='orange')
plt.title('Games per Genre',fontsize=15)
plt.xlabel('Genre',fontsize=12)
plt.ylabel('Count',fontsize=12)
plt.xticks(rotation = 90, fontsize=12)
plt.yticks(fontsize=12)
(array([ 0., 500., 1000., 1500., 2000., 2500., 3000., 3500.]),
[Text(0, 0.0, '0'),
Text(0, 500.0, '500'),
Text(0, 1000.0, '1000'),
Text(0, 1500.0, '1500'),
Text(0, 2000.0, '2000'),
Text(0, 2500.0, '2500'),
Text(0, 3000.0, '3000'),
Text(0, 3500.0, '3500')])
If you notice, there's some text printed always before the plots.
keyboard_arrow_down How can we remove the text printed before the plot and just display the plot?
plt.figure(figsize=(10,8))
plt.bar(x_bar,y_bar,width=0.2,color='orange')
plt.title('Games per Genre',fontsize=15)
plt.xlabel('Genre',fontsize=12)
plt.ylabel('Count',fontsize=12)
plt.xticks(rotation = 90, fontsize=12)
plt.yticks(fontsize=12)
plt.show()
How can we draw a bar-chart in Seaborn?
There is another function in Seaborn called barplot which has some other purpose - discuss later
The top 5 genres are action, sports, misc, role player, and shooter
A pie-chart!
plt.pie(region_sales,
labels=region_sales.index,
startangle=90,
explode=(0.2,0,0,0))
plt.show()
keyboard_arrow_down Univariate Data Visualisation - Numerical Data
What kind of questions we may have regarding a numerical variable?
1. How is the data distributed? Say distribution of number of games published in a year.
2. Is the data skewed? Are there any outliers? - Extremely high selling games maybe?
3. How much percentage of data is below/above a certain number?
4. Some special numbers - Min, Max, Mean, Median, nth percentile?
Now say, you want to find the distribution of games released every year.
Unlike barplot, to see the distribution we will need to bin the data.
Histogram
plt.hist(data['Year'])
plt.show()
The curve is left skewed, with a lot more games being published in 2005-2015
This shows that games started being highly popular in the last 1-2 decades, maybe could point to increased usage of internet worldwide!
We can also vary the number of bins, the default number of bins is 10
So if we would need to see this data per decade, we would need 40 years in 4 bins.
plt.hist(data['Year'], bins=4)
plt.show()
We can also get the data of each bin, such as range of the boundaries, values, etc.
count
array([ 112., 70., 92., 449., 1274., 2440., 3921., 5262., 2406.,
355.])
bins
10
Notice,
But instead of bars, KDE represents data using a continuous probability density curve
sns.kdeplot(data['Year'])
keyboard_arrow_down Boxplot
Now say I want to find the typical earnings of a game when it is published.
Or maybe find the aggregates like median, min, max and percentiles of the data.
What kind of plot can we use to understand the typical earnings from a game?
Box Plot
attributes
across levels of a categorical attribute.
1. Minimum score,
2. first (lower) quartile
3. Median
4. Third (upper) quartile
5. maximum score
Minimum Score
It is the lowest value, excluding outliers
Lower Quartile
25% of values fall below the lower quartile value
Median
Median marks the mid-point of the data
Half the scores are greater than or equal to this value and half are less.
Upper Quartile
75% of the values fall below the upper quartile value
Maximum Score
It is the highest value, excluding outliers
Line plot
Styling and Labelling
Scatterplot
Categorical-Categorical
Dodged countplot
Stacked countplot
Categorical-Continuous
Multiple BoxPlots
Subplots
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv('vgsales.csv')
data.head()
Rank Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sales Other_Sal
0 2061 1942 NES 1985.0 Shooter Capcom 4.569217 3.033887 3.439352 1.9916
1 9137 ¡Shin Chan Flipa en colores! DS 2007.0 Platform 505 Games 2.076955 1.493442 3.033887 0.3948
2 14279 .hack: Sekai no Mukou ni + Versus PS3 2012.0 Action Namco Bandai Games 1.145709 1.762339 1.493442 0.4086
3 8359 .hack//G.U. Vol.1//Rebirth PS2 2006.0 Role-Playing Namco Bandai Games 2.031986 1.389856 3.228043 0.3948
4 7109 .hack//G.U. Vol.2//Reminisce PS2 2006.0 Role-Playing Namco Bandai Games 2.792725 2.592054 1.440483 1.4934
data.describe()
Maybe show relation between two features, like how does the sales vary over the years?
Or show how are the features associated, positively or negatively?
...And so on
data['Name'].value_counts()
Ice Hockey 41
Baseball 17
Need for Speed: Most Wanted 12
Ratatouille 9
FIFA 14 9
..
Indy 500 1
Indy Racing 2000 1
Indycar Series 2005 1
inFAMOUS 1
Zyuden Sentai Kyoryuger: Game de Gaburincho!! 1
Name: Name, Length: 11493, dtype: int64
Let's try to find the sales trend in North America of the same across the years
ih = data.loc[data['Name']=='Ice Hockey']
sns.lineplot(x='Year', y='NA_Sales', data=ih)
The sales across North America seem to have been boosted in the years of 1995-2005
Post 2010 though, the sales seem to have taken a dip
Line plot are great to represending trends such as above, over time
OR
black: k / black
red: r / red etc
https://matplotlib.org/2.0.2/api/colors_api.html
keyboard_arrow_down How can we limit our plot to only the last decade of 20th century?
plt.xlim() : x-axis
plt.ylim() : y-axis
Ice Hockey
Baseball
baseball = data.loc[data['Name']=='Baseball']
sns.lineplot(x='Year', y='NA_Sales', data=baseball)
Now, to compare these, so we will have to draw these plots in the same figure
Observe:
We can also pass these labels in plt.legend() as a list in the order plots are done
keyboard_arrow_down Now can we change the position of the legend, say, to bottom-right corner?
upper center
upper left
upper right
lower right ... etc
The pair of floats signify the (x,y) coordinates for the legend
==> From this we can conclude loc takes two types of arguments:
For eg:
For e.g.
keyboard_arrow_down How can we highlight the maximum "Ice Hockey" sales across all years ?
print(max(ih['NA_Sales']))
0.9
We can also use plt.grid() to show the grid layout in the background
Note:
We can pass in parameters inside plt.grid() to control its density, colour of grid lines, etc.
https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.grid.html
In this case, unlike line plot, there maybe multiple points in y-axis for each point in x-axis
data.head()
Rank Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sales Other_Sal
0 2061 1942 NES 1985.0 Shooter Capcom 4.569217 3.033887 3.439352 1.9916
1 9137 ¡Shin Chan Flipa en colores! DS 2007.0 Platform 505 Games 2.076955 1.493442 3.033887 0.3948
2 14279 .hack: Sekai no Mukou ni + Versus PS3 2012.0 Action Namco Bandai Games 1.145709 1.762339 1.493442 0.4086
3 8359 .hack//G.U. Vol.1//Rebirth PS2 2006.0 Role-Playing Namco Bandai Games 2.031986 1.389856 3.228043 0.3948
4 7109 .hack//G.U. Vol.2//Reminisce PS2 2006.0 Role-Playing Namco Bandai Games 2.792725 2.592054 1.440483 1.4934
keyboard_arrow_down How can we plot the relation between Rank and Global Sales ?
The plot itself looks very messy and it's hard to find any patterns from it.
Compared to lineplot, we are able to see the patterns and points more distinctly now!
Notice,
With increase in ranks, the sales tend to go down, implying, lower ranked games have higher sales overall!
Scatter plots help us visualize these relations and find any patterns in the data
Key Takeaways:
Sometimes, people also like to display the linear trend between two variables - Regression Plot, do check that
Since we have a lot of categories of each of them, we will use top 3 of each to make our analysis easier
top3_pub = data['Publisher'].value_counts().index[:3]
top3_gen = data['Genre'].value_counts().index[:3]
top3_plat = data['Platform'].value_counts().index[:3]
top3_data = data.loc[(data["Publisher"].isin(top3_pub)) & (data["Platform"].isin(top3_plat)) & (data['Genre'].isin(top3_gen)
top3_data
2 14279 .hack: Sekai no Mukou ni + Versus PS3 2012.0 Action Namco Bandai Games 1.145709 1.762339 1.4934
19 1741 007: Quantum of Solace PS3 2008.0 Action Activision 4.156030 4.346074 1.0879
21 4501 007: Quantum of Solace PS2 2008.0 Action Activision 3.228043 2.738800 2.5855
16438 14938 Yes! Precure 5 Go Go Zenin Shu Go! Dream Festival DS 2008.0 Action Namco Bandai Games 1.087977 0.592445 1.0879
16479 10979 Young Justice: Legacy PS3 2013.0 Action Namco Bandai Games 2.186589 1.087977 3.4090
16601 11802 ZhuZhu Pets: Quest for Zhu DS 2011.0 Misc Activision 2.340740 1.525543 3.1038
16636 9196 Zoobles! Spring to Life! DS 2011.0 Misc Activision 2.697415 1.087977 2.7607
16640 9816 Zubo DS 2008.0 Misc Electronic Arts 2.592054 1.493442 1.4934
keyboard_arrow_down Categorical-Categorical
Earlier we saw how to work with continous-continuous pair of data
Which plot can we use to show distribution of one category with respect to another?
-> We can have can have multiple bars for each category
Or
plt.figure(figsize=(10,8))
sns.countplot(x='Publisher',hue='Platform',data=top3_data)
plt.ylabel('Count of Games')
Text(0, 0.5, 'Count of Games')
EA releases PS2 games way more than any other publisher, or even platform!
Activision has almost the same count of games for all 3 platforms
EA is leading in PS3 and PS2, but Namco leads when it comes to DS platform
Some may find it difficult to understand if it starts from baseline or from on top of the bottom area
keyboard_arrow_down Continous-Categorical
Now let's look at our 3rd type of data pair
keyboard_arrow_down Boxplot
keyboard_arrow_down
What is the distribution of sales for the top3 publishers?
The overall sales of EA is higher, with a much larger spread than other publishers
Activision doesn't have many outliers, and if you notice, even thought the spread is lesser than EA, the median is almost the same
Barplot
Genre (categorical)
Mean of global sales per genre (numerical)
keyboard_arrow_down How to visualize which genres bring higher average global sales?
If you remember, we had earlier seen EA had a larger market share of sales
This ultimately proves the fact that Sports has a high market share in the industry, as shown in the barchart
keyboard_arrow_down Subplots
So far we have shown only 1 plot using plt.show()
Say, we want to plot the trend of NA and every other region separately in a single figure
It returns 2 things:
Figure
Numpy Matrix of subplots
fig = plt.figure(figsize=(15,10))
sns.scatterplot(x=top3_data['NA_Sales'], y=top3_data['EU_Sales'])
fig.suptitle('Main title')
plt.show()
fig = plt.figure(figsize=(15,10))
plt.subplot(2, 3, 1)
sns.scatterplot(x='NA_Sales', y='EU_Sales', data=top3_data)
plt.subplot(2, 3, 3)
sns.scatterplot(x='NA_Sales', y='JP_Sales', data=top3_data, color='red')
fig.suptitle('Main title')
Text(0.5, 0.98, 'Main title')
CCN
CNN
NNN
CCC
JointPlot
Pairplots
Correlation and heatmap
!wget https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/021/299/original/final_vg1_-_final_vg_%281%29.csv?16708
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv('vgsales.csv')
data.head()
Rank Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sales Other_Sal
0 2061 1942 NES 1985.0 Shooter Capcom 4.569217 3.033887 3.439352 1.9916
1 9137 ¡Shin Chan Flipa en colores! DS 2007.0 Platform 505 Games 2.076955 1.493442 3.033887 0.3948
2 14279 .hack: Sekai no Mukou ni + Versus PS3 2012.0 Action Namco Bandai Games 1.145709 1.762339 1.493442 0.4086
3 8359 .hack//G.U. Vol.1//Rebirth PS2 2006.0 Role-Playing Namco Bandai Games 2.031986 1.389856 3.228043 0.3948
4 7109 .hack//G.U. Vol.2//Reminisce PS2 2006.0 Role-Playing Namco Bandai Games 2.792725 2.592054 1.440483 1.4934
Hence similar to last lecture, we will use top 3 of each to make our analysis easier
top3_pub = data['Publisher'].value_counts().index[:3]
top3_gen = data['Genre'].value_counts().index[:3]
top3_plat = data['Platform'].value_counts().index[:3]
top3_data = data.loc[(data["Publisher"].isin(top3_pub)) & (data["Platform"].isin(top3_plat)) & (data['Genre'].isin(top3_gen)
top3_data
2 14279 .hack: Sekai no Mukou ni + Versus PS3 2012.0 Action Namco Bandai Games 1.145709 1.762339 1.4934
19 1741 007: Quantum of Solace PS3 2008.0 Action Activision 4.156030 4.346074 1.0879
21 4501 007: Quantum of Solace PS2 2008.0 Action Activision 3.228043 2.738800 2.5855
16438 14938 Yes! Precure 5 Go Go Zenin Shu Go! Dream Festival DS 2008.0 Action Namco Bandai Games 1.087977 0.592445 1.0879
16479 10979 Young Justice: Legacy PS3 2013.0 Action Namco Bandai Games 2.186589 1.087977 3.4090
16636 9196 Zoobles! Spring to Life! DS 2011.0 Misc Activision 2.697415 1.087977 2.7607
16640 9816 Zubo DS 2008.0 Misc Electronic Arts 2.592054 1.493442 1.4934
keyboard_arrow_down Multivariate
Let’s try to add 3rd variable on the top of the plots we have seen so far
keyboard_arrow_down NNC
How can we visualize the correlation between NA and EU, but for different genres?
Here, we have two numerical and one categorical variable!
Perhaps No
plt.figure(figsize=(7,7))
sns.scatterplot(x='NA_Sales', y='EU_Sales',hue='Publisher',data=top3_data)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel('NA Sales',fontsize=15)
plt.ylabel('EU Sales',fontsize=15)
plt.title('NA vs EU, per Genre', fontsize=15)
plt.show()
Inferences:
If we see this plot, we can notice now that Namco has lower sales correlation, while Activision has a concentrated positivee correlation
EA also has positive correlation, but it's more spread compared to Activision
keyboard_arrow_down CCN
Now, how will you visualize Global Sales for each publisher, but separated by Genres?
We have two categorical and one numerical data here!
Categorical-Categorical → Stacked Barplot, need to add info about one continuous feature
Which one is easier and possible? We can add one categorical variable by “dodging” multiple boxplots
plt.figure(figsize=(15,10))
sns.boxplot(x='Publisher',y='Global_Sales',hue='Genre',data=top3_data)
plt.xlabel('Genre', fontsize=12)
plt.ylabel('Global Sales', fontsize=12)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.title('Global Sales for each Publisher, Genre wise', fontsize=15)
plt.show()
Inferences:
Namco has lower median sales in every Genre as compared to all publishers
Looking at Action Genre, even though EA and Activision has almost similar medians, Action is more spread in EA
An interesting thing to notice here is that, for each of the three publishers, three different genre of games have higher sales median:
Namco: Action
Activision: Misc
EA: Sports
keyboard_arrow_down NNN
So far we have seen how NA and EU are correlated with each other.
But how can we compare the data when we have 3 numerical variables?
Say, the question is, how does rank affect the correlation between NA and EU Sales?
We have used scatter plot for two numerical features, we have two options here
Make a 3D Scatterplot
→ Bubble Chart
plt.figure(figsize=(15,10))
# sns.scatterplot(x=data['NA_Sales'], y=data['JP_Sales'],data=top3_data, size=data['Rank'], sizes=(1, 200))
sns.scatterplot(x='NA_Sales', y='JP_Sales', size='Rank', sizes=(1, 200), data=data)
plt.xlabel('NA_Sales',fontsize=10)
plt.ylabel('JP Sales', fontsize=10)
plt.title('NA vs JP Sales, based on ranking of games', fontsize=15)
plt.show()
Inferences:
Now interestingly, we can notice that higher ranking games are actually on the lower scale of sales, while lower ranking games are high on
the sales side
Joint Plot
It draws a plot of two variables
We can select from different values for parameter kind and it will plot accordingly
jointplot plots scatter, histogram and KDE in the same graph when we set kind=reg
Histogram and KDE shows the separate distributions of NA_Sales and EU_Sales in the data
Each numeric attribute in data is shared across the y-axes across a single row and the x-axes across a single column.
It displays a scatterplot between each pair of attributes in the data with different hue for each category
Since, the diagonal plots belong to same attribute at both x and y axis, they are treated differently
A univariate distribution plot is drawn to show the marginal distribution of the data in each column.
sns.pairplot(data=top3_data, hue='Genre')
plt.show()
Notice that:
Colour Legends for each genre category are given on right side
top3_data.corr()
Higher the MAGNITUDE of coefficient of correlation, more the variables are correlated
+ means increase in value of one variable causes increase in value of other variable
- means increase in value of one variable causes decrease in value of other variable, and vice versa
keyboard_arrow_down As you can see, Global Sales and Rank have the highest correlation coeff of -0.91
We cannot conclude that change in values of a variable is causing change in values of other variable
Heat Map
A heat map plots rectangular data as a color-encoded matrix.
Let's plot a Heat Map using correlation coefficient matrix generated using corr()
keyboard_arrow_down You can change the colours of cells in Heat Map if you like
print(plt.colormaps())
['magma', 'inferno', 'plasma', 'viridis', 'cividis', 'twilight', 'twilight_shifted', 'turbo', 'Blues', 'BrBG', 'BuGn', '
Think of subplots as a 2x2 grids, with the two numbers denoting x,y / row,column coordinate of each subplot
print(ax)
Notice,
Hence, we are using a 2D notation to access each grid/axes object of the subplot
Instead of accesing the individual axes using ax[0, 0] , ax[1, 0] , there is another method we can use too
plt.subplot(2, 3, 3)
sns.scatterplot(x='NA_Sales', y='JP_Sales', data=top3_data, color='red')
plt.subplot(2, 3, 4)
sns.scatterplot(x='NA_Sales', y='Other_Sales', data=top3_data, color='green')
plt.subplot(2, 3, 6)
sns.scatterplot(x='NA_Sales', y='Global_Sales', data=top3_data, color='orange')
plt.show()
We can use title , x/y label and every other functionality for the subplots too
plt.subplot(2, 3, 3)
sns.scatterplot(x='NA_Sales', y='JP_Sales', data=top3_data, color='red')
plt.title('NA vs JP Sales', fontsize=12)
plt.xlabel('NA', fontsize=12)
plt.ylabel('JP', fontsize=12)
plt.subplot(2, 3, 4)
sns.scatterplot(x='NA_Sales', y='Other_Sales', data=top3_data, color='green')
plt.title('NA vs Other Region Sales', fontsize=12)
plt.xlabel('NA', fontsize=12)
plt.ylabel('Other', fontsize=12)
plt.subplot(2, 3, 6)
sns.scatterplot(x='NA_Sales', y='Global_Sales', data=top3_data, color='orange')
plt.title('NA vs Global Sales', fontsize=12)
plt.xlabel('NA', fontsize=12)
plt.ylabel('Global', fontsize=12)
plt.show()
keyboard_arrow_down What if we want to span a plot across the full length of the plot?
So, this problem can be simplified to plotting the plot across second column in a 1 row 3 column subplot