DataAnalytic-04 - NumPy & Pandas
DataAnalytic-04 - NumPy & Pandas
Data Analytic
Adhi Harmoko Saputro
Data Analytic
2
Data Analytic
3
Content
NumPy Pandas
• Create Array • Creating DataFrames
• Manipulating Array Shapes • Understanding DataFrames
• Stacking Arrays • Reading & Querying
• Partitioning Arrays • Describing DataFrames
• Changing Data Type • Grouping & Joining DataFrame
• Creating Views & Copies • Working Missing Values
• Slicing Arrays • Creating Pivot Tables
• Broadcasting Arrays • Dealing with Dates
Data Analytic
4
NumPy Arrays
Adhi Harmoko Saputro
Data Analytic
5
Create Array
• Create an array using the array() function with a list of items
• Possible data types are bool, int, float, long, double, and long double
Code:
# Creating an array
import numpy as np
a = np.array([2,4,6,8,10])
print(a)
Output:
[ 2 4 6 8 10]
Data Analytic
6
Create Array
• Creates an evenly spaced NumPy array using arange(start,[stop],step) function
• The start is the initial value of the range
• The stop is the last value of the range (compulsory)
• The step is the increment in that range
• Generates a value that is one less than the stop parameter value
Code:
# Creating an array using arange()
import numpy as np
a = np.arange(1,11)
print(a)
Output:
[ 1 2 3 4 5 6 7 8 9 10]
Data Analytic
7
Create Array
• The zeros() function creates an array for a given dimension with all zeroes
• The ones() function creates an array for a given dimension with all ones
• The full() function generates an array with constant values
• The eye() function creates an identity matrix
• The random() function creates an array with any given dimension
Data Analytic
8
Create Array
Code:
import numpy as np
Data Analytic
9
Create Array
Output:
All Zeros Array
[[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]]
All Anes Array
[[1. 1.]
[1. 1.]]
Constant Array
[[4 4]
[4 4]]
2x2 Identity Matrix
[[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]
[0. 0. 0. 1.]]
Random Values Array
[[0.52715194 0.21717546 0.77645954]
[0.6971321 0.28391831 0.88782967]
[0.25132285 0.17128033 0.22925874]]
Data Analytic
10
Data Analytic
11
[0,0] [0,1]
[1,0] [1,1]
Data Analytic
12
print(a)
print(a[0,0])
print(a[0,1])
print(a[1,0])
print(a[1,1])
Output:
[[5 6]
[7 8]]
5
6
7
8
Data Analytic
13
bool This is a Boolean type that stores a bit and takes True or False Values
Data Analytic
14
half/float16 Half precision float: sign bit, 5 bits exponent, 10 bits mantissa
single (float) Platform-defined single precision float: typically sign bit, 8 bits exponent, 23 bits mantissa
double Platform-defined double precision float: typically sign bit, 11 bits exponent, 52 bits mantissa
float complex Complex number, represented by two single-precision floats (real and imaginary components)
double complex Complex number, represented by two double-precision floats (real and imaginary components)
long double complex Complex number, represented by two extended-precision floats (real and imaginary components)
Data Analytic
15
print(np.float64(21))
print(np.int8(21.0))
print(np.bool_(21))
print(np.bool_(0))
print(np.bool_(21.0))
print(np.single(True))
print(np.single(False))
Output:
float64(21) : 21.0
int8(21.0) : 21
bool(21) : True
bool(0) : False
bool(21.0) : True
single(True) : 1.0
single(False): 0.0
Data Analytic
16
Create Array
• Create an array with specific a data type argument
Code:
import numpy as np
print(arr)
Output:
[ 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.]
Data Analytic
17
Data Analytic
18
# Create an array
arr = np.arange(12)
Data Analytic
19
print(new_arr)
Output:
[[ 0 1 2]
[ 3 4 5]
[ 6 7 8]
[ 9 10 11]]
Data Analytic
20
print(new_arr2)
Output:
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
Data Analytic
21
Output:
[[1 2 3]
[4 5 6]
[7 8 9]]
Data Analytic
22
Output:
[1 2 3 4 5 6 7 8 9]
Data Analytic
23
Output:
[[1 4 7]
[2 5 8]
[3 6 9]]
Data Analytic
24
Output:
[[1 2 3 4 5 6 7 8 9]]
Data Analytic
25
Stacking Arrays
• Stacking means joining the same dimensional arrays along with a new axis
Horizontal Stacking
• The same dimensional arrays are joined along with a horizontal axis using the hstack() and
concatenate() functions
Vertical Stacking
• The same dimensional arrays are joined along with a vertical axis using the vstack() and concatenate()
functions
Depth Stacking
• The same dimensional arrays are joined along with a third axis (depth) using the dstack() function
Column Stacking
• Stacks multiple sequence one-dimensional arrays as columns into a single two-dimensional array
Data Analytic
26
Horizontal Stacking
Create one 3*3 array
Code:
arr1 = np.arange(1,10).reshape(3,3)
print(arr1)
arr2 = 2*arr1
Perform horizontal stacking along the x axis
print(arr2)
Output:
[[1 2 3] 1st array
[4 5 6]
[7 8 9]]
[[ 2 4 6] 2nd array
[ 8 10 12]
[14 16 18]]
[[ 1 2 3 2 4 6] Horizontal stacking array
[ 4 5 6 8 10 12]
[ 7 8 9 14 16 18]]
Data Analytic
27
Horizontal Stacking
Code:
# Horizontal stacking using concatenate() function
arr4 = np.concatenate((arr1, arr2), axis=1)
print(arr4)
Output:
[[ 1 2 3 2 4 6] Horizontal stacking array
[ 4 5 6 8 10 12]
[ 7 8 9 14 16 18]]
Data Analytic
28
Vertical Stacking
Code:
The same dimensional arrays are joined along with a
# Vertical stacking vertical axis
arr5 = np.vstack((arr1, arr2))
print(arr5)
Data Analytic
29
Depth Stacking
Code:
The same dimensional arrays are joined along with a third
# Depth stacking axis (depth)
arr7 = np.dstack((arr1, arr2))
print(arr7)
Output:
[[[ 1 2]
[ 2 4]
[ 3 6]]
[[ 4 8]
[ 5 10]
[ 6 12]]
[[ 7 14]
[ 8 16]
[ 9 18]]]
Data Analytic
30
Column Stacking
Code:
# Create 1-D array
arr1 = np.arange(4,7)
print(arr1)
Output:
[4 5 6]
[8 10 12]
[[ 4 8]
[ 5 10] Two one-dimensional arrays and stacked them column-wise
[ 6 12]]
Data Analytic
31
Partitioning Arrays
• Arrays can be partitioned into multiple sub-arrays
• Vertical, horizontal, and depth-wise split functionality
• Split into the same size arrays but can also specify the split location
• Horizontal splitting
• Divided into N equal sub-arrays along the horizontal axis using the hsplit() function
• Vertical splitting
• Divided into N equal subarrays along the vertical axis using the vsplit() and split()
functions
Data Analytic
32
Horizontal Splitting
Code:
# Create an array
arr = np.arange(1,10).reshape(3,3)
print(arr) Divides the array into three sub-arrays
Each part is a column of the original array
# Peroform horizontal splitting
arr_hor_split = np.hsplit(arr, 3)
print(arr_hor_split)
Output:
[[1 2 3]
[4 5 6]
[7 8 9]]
[array([[1],
[4],
[7]]),
array([[2],
[5],
[8]]),
array([[3],
[6],
[9]])]
Data Analytic
33
Vertical Splitting
Code: Divides the array into three sub-arrays
Each part is a row of the original array
# vertical split
arr_ver_split = np.vsplit(arr, 3)
print(arr_ver_split) The split function with axis=0 performs the same operation
as the vsplit() function
# split with axis=0
arr_split = np.split(arr,3,axis=0)
print(arr_split)
Output:
[array([[1, 2, 3]]), array([[4, 5, 6]]), array([[7, 8, 9]])]
[array([[1, 2, 3]]), array([[4, 5, 6]]), array([[7, 8, 9]])]
Data Analytic
34
# print array
print("Float Array:", arr)
Output:
Integer Array: [[1 2 3]
[4 5 6]
[7 8 9]]
Float Array: [[1. 2. 3.]
[4. 5. 6.]
[7. 8. 9.]]
Changed Datatype: float64
Data Analytic
35
Data Analytic
36
Views Copies
• The original base array and are treated as a shallow copy • The separate objects and treated as a deep copy
• View uses the same memory content • Copy stores the array in another location
• Modifications in a view affect the original data • Modifications in a copy do not affect the original array
• Views use the concept of shared memory • Copies require extra space compared to views
• Copies are slower than views
Data Analytic
37
Copy Copy
Heap Heap
Data Analytic
38
Output:
[[1 2] The original array and the assigned array have the same
[3 4]] object ID, meaning both are pointing to the same object
Original Array : 2231251409616
Assignment : 2231251409616
Deep Copy : 2232047625840
Shallow Copy(View): 2232047625552
Data Analytic
39
Slicing Arrays
• Slicing in NumPy is similar to Python lists
• Indexing prefers to select a single value
• Slicing is used to select multiple values from an array
• NumPy arrays also support negative indexing and slicing
• The negative sign indicates the opposite direction and indexing starts from the right-hand
side with a starting value of -1
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
Data Analytic
40
Slicing Arrays
Code:
# Create NumPy Array
arr = np.arange(10)
print(arr)
print(arr[3:6])
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
Output:
[0 1 2 3 4 5 6 7 8 9]
[3 4 5]
Data Analytic
41
Slicing Arrays
Code:
# Create NumPy Array
arr = np.arange(10)
print(arr)
print(arr[3:])
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
Output:
[0 1 2 3 4 5 6 7 8 9]
[3 4 5 6 7 8 9]
Data Analytic
42
Slicing Arrays
Code:
# Create NumPy Array
arr = np.arange(10)
print(arr)
print(arr[-3:])
The slice operation will select values from the third value
from the right side of the array to the end of the array
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
Output:
[0 1 2 3 4 5 6 7 8 9]
[7 8 9]
Data Analytic
43
Slicing Arrays
Code:
# Create NumPy Array
arr = np.arange(10)
print(arr)
print(arr[2:7:2])
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
Output:
[0 1 2 3 4 5 6 7 8 9]
[2 4 6]
Data Analytic
44
Broadcasting Arrays
• Python lists do not support direct vectorizing arithmetic operations
• NumPy offers a fastervectorized array operation compared to Python list loop-based
operations
• All the looping operations are performed in C instead of Python, which makes it faster
• Broadcasting functionality checks a set of rules for applying binary functions
• Addition
• Subtraction
• Multiplication
Data Analytic
45
Broadcasting Arrays
Code:
import numpy as np
# Add two matrices The addition of two arrays of the same size
print(arr1+arr2)
Output:
[[ 6 8]
[10 12]]
Data Analytic
46
Broadcasting Arrays
Code:
# Multiply two matrices
print(arr1*arr2)
Output:
[[ 5 12]
[21 32]]
[[4 5]
[6 7]]
[[ 3 6]
[ 9 12]]
Data Analytic
47
Pandas DataFrames
Adhi Harmoko Saputro
Data Analytic
48
Pandas DataFrames
• The pandas library is designed to work with a panel or tabular data
• Fast, highly efficient, and productive tool for manipulating and analyzing string, numeric,
datetime, and time-series data
• Provides data structures such as DataFrames and Series
Data Analytic
49
Data Analytic
50
# Header of dataframe.
df.head()
Output:
An empty DataFrame
Data Analytic
51
Create DataFrame
Code:
Use a dictionary of the list to create a DataFrame
# Create dictionary of list
data = {'Name': ['Vijay', 'Sundar', 'Satyam', 'Indira'], 'Age': [23, 45, 46, 52 ]}
# Header of dataframe.
df.head()
Output:
Data Analytic
52
Create DataFrame
Code:
# Pandas DataFrame by lists of dicts.
# Initialise data to lists.
data =[ {'Name': 'Vijay', 'Age': 23},{'Name': 'Sundar', 'Age': 25},{'Name': 'Shankar', 'Age': 26}]
Output:
Data Analytic
53
Output:
Data Analytic
54
Pandas Series
• Pandas Series is a one-dimensional sequential data structure
• Able to handle any type of data, such as string, numeric, datetime, Python lists, and
dictionaries with labels and indexes
• Series is one of the columns of a DataFrame
• Create a Series using a Python dictionary, NumPy array, and scalar value
Data Analytic
55
# Show series
series
Output:
0 Ajay
1 Jay
2 Vijay
dtype: object
Data Analytic
56
Output:
0 51
1 65
2 48
3 59
4 68
dtype: int32
Data Analytic
57
Output:
0 10
1 10
2 10
3 10
4 10
5 10
dtype: int64
Data Analytic
58
Data Analytic
59
Data Analytic
60
Pandas Series
• The pandas Series data structure shares some of the common attributes of DataFrames and
also has a name attribute
Code:
# Show the shape of DataFrame
print("Shape:", df.shape)
Output:
Shape: (202, 9)
Data Analytic
61
Pandas Series
• Check the column list of a DataFrame:
Code:
# Check the column list of DataFrame
print("List of Columns:", df.columns)
Output:
List of Columns: Index(['Country', 'CountryID', 'Continent', 'Adolescent fertility rate (%)',
'Adult literacy rate (%)',
'Gross national income per capita (PPP international $)',
'Net primary school enrolment ratio female (%)',
'Net primary school enrolment ratio male (%)',
'Population (in thousands) total'],
dtype='object')
Data Analytic
62
Pandas Series
• Check the data types of DataFrame columns
Code:
# Show the datatypes of columns
print("Data types:", df.dtypes)
Output:
Data types: Country object
CountryID int64
Continent int64
Adolescent fertility rate (%) float64
Adult literacy rate (%) float64
Gross national income per capita (PPP international $) float64
Net primary school enrolment ratio female (%) float64
Net primary school enrolment ratio male (%) float64
Population (in thousands) total float64
dtype: object
Data Analytic
63
# Select a series
country_series=df['Country’]
Output:
197 Vietnam
198 West Bank and Gaza
199 Yemen
200 Zambia
201 Zimbabwe
Name: Country, dtype: object
Data Analytic
64
Data Analytic
65
Output:
DatetimeIndex(['2023-09-01', '2023-09-02', '2023-09-03', '2023-09-04',
'2023-09-05', '2023-09-06', '2023-09-07', '2023-09-08',
'2023-09-09', '2023-09-10', '2023-09-11', '2023-09-12',
'2023-09-13', '2023-09-14', '2023-09-15', '2023-09-16',
'2023-09-17', '2023-09-18', '2023-09-19', '2023-09-20',
'2023-09-21', '2023-09-22', '2023-09-23', '2023-09-24',
'2023-09-25', '2023-09-26', '2023-09-27', '2023-09-28',
'2023-09-29', '2023-09-30', '2023-10-01', '2023-10-02',
'2023-10-03', '2023-10-04', '2023-10-05', '2023-10-06',
'2023-10-07', '2023-10-08', '2023-10-09', '2023-10-10',
'2023-10-11', '2023-10-12', '2023-10-13', '2023-10-14',
'2023-10-15'],
dtype='datetime64[ns]', freq='D')
Data Analytic
66
Pandas Documentation:
https://pandas.pydata.org/docs/user_guide/timeseries.html
Data Analytic
67
Converts Date
Code:
# Convert argument to datetime
pd.to_datetime('1/1/1970')
Output:
Timestamp('1970-01-01 00:00:00')
Data Analytic
68
Converts Date
Code:
# Convert argument to datetime in specified format
pd.to_datetime(['20200101', '20200102'], format='%Y%m%d')
Output:
DatetimeIndex(['2020-01-01', '2020-01-02'], dtype='datetime64[ns]', freq=None)
Data Analytic
69
Output:
ParserError: Unknown string format: not a date
ParserError: Unknown string format: not a date present at position 1
Data Analytic
70
Terima Kasih
Adhi Harmoko Saputro
Data Analytic