Python for Data Science (PDS)
Unit-02
Capturing, Preparing
and Working with data
Outline
Looping
Basic File IO in Python
NumPy V/S Pandas (what to use?)
NumPy
Pandas
www.kaggle.com
From part1 .. Why Python?
Huge amount of additional open-source libraries
Some libraries listed below.
NumPy for scientific computing
pandas for performing data analysis
SciPy for engineering applications, science, and
mathematics
Scikit for machine learning
matplotib for plotting charts and graphs
BeautifulSoup for HTML parsing and XML
Django for server-side web development
And many more..
Unit 02 – Overview of Python and Data Analysis 3
Basic IO operations in Python
Before we can read or write a file, we have to open it using Python's built-in open() function.
syntax
fileobject = open(filename , accessmode)
filename is a name of a file we want to open.
accessmode is determines the mode in which file has to be opened (list of possible values given below)
M Description M Description (create file if not exist) M Description
r Read only (default) w Write only Opens file to append, if file
a
not exist will create it for write
rb Read only in binary format wb Write only in binary format
Append in binary format, if file
r+ Read and Write both w+ Read and Write both ab
not exist will create it for write
Read and Write both in Read and Write both in Append, if file not exist it will
rb+ wb+ a+
binary format binary format create for read & write both
Read and Write both in binary
ab+
format
Unit 02 – Overview of Python and Data Analysis 4
Example : Read file in Python
read(size) will read specified bytes from the file, if we don’t specify size it will return whole file.
readfile.py demofile.txt
1 f = open('demofile.txt') Hello! Welcome to demofile.txt
2 data = f.read() This file is for testing purposes.
3 print(data) Good Luck!
readlines() method will return list of lines from the file.
readlines.py OUTPUT
1 f = open('demofile.txt') ['Hello! Welcome to demofile.txt
2 lines = f.readlines() \n', 'This file is for testing purposes.,\n', Good Luck!
3 print(lines) ']
We can use for loop to get each line separately,
readlinesfor.py OUTPUT
1 f = open('demofile.txt') Hello! Welcome to demofile.txt
2 lines = f.readlines()
3 for l in lines : This file is for testing purposes.
4 print(l)
Good Luck!
Unit 02 – Overview of Python and Data Analysis 5
How to write path?
We can specify relative path in argument to open method, alternatively we can also specify
absolute path.
To specify absolute path,
In windows, f=open(‘D:\\folder\\subfolder\\filename.txt’)
In mac & linux, f=open(‘/user/folder/subfolder/filename.txt’)
We suppose to close the file once we are done using the file in the Python using close()
method.
closefile.py
1 f = open('demofile.txt') fileusingwith.py
2 data = f.read() 1 with open('demofile.txt') as f :
3 print(data) 2 data = f.read()
4 f.close() 3 print(data)
When we open file using with we need not to close the file.
Unit 02 – Overview of Python and Data Analysis 6
Example : Write file in Python
write() method will write the specified data to the file.
readdemo.py
1 with open('demofile.txt','a') as f :
2 f.write('Hello world')
If we open file with ‘w’ mode it will overwrite the data to the existing file or will create new file if
file does not exists.
If we open file with ‘a’ mode it will append the data at the end of the existing file or will create
new file if file does not exists.
Unit 02 – Overview of Python and Data Analysis 7
Reading CSV files without any library functions
A comma-separated values file is a delimited text file that uses a comma to separate values.
Each line of is a data record, Each record consists of many fields, separated by commas.
Example : Book1.csv readlines.py
studentname,enrollment,cpi 1 with open('Book1.csv') as f :
abcd,123456,8.5 2 rows = f.readlines()
bcde,456789,2.5 3 isFirstLine
for r in rows= :
True
cdef,321654,7.6 4 for r in
cols rows :
= r.split(',')
5 if isFirstLine Name
print('Student : = ', cols[0], end=" ")
6 isFirstLine
print('\tEn. No.==False
', cols[1], end=" ")
We can use Microsoft Excel to access 7 continue = \t', cols[2])
print('\tCPI
CSV files. 8 cols = r.split(',')
9 print('Student Name = ', cols[0], end=" ")
In the later sessions we will access CSV 10 print('\tEn. No. = ', cols[1], end=" ")
11 print('\tCPI = \t', cols[2])
files using different libraries, but we can
also access CSV files without any libraries.
(Not recommend)
Unit 02 – Overview of Python and Data Analysis 8
NumPy v/s Pandas
Developers built pandas on top of NumPy,
as a result every task we perform using pandas also goes through NumPy.
To obtain the benefits of pandas, we need to pay a performance penalty that some testers say
is 100 times slower than NumPy for similar task.
Nowadays computer hardware are powerful enough to take care for the performance issue,
but when speed of execution is essential NumPy is always the best choice.
We use pandas to make writing code easier and faster, and tobreduce potential coding errors.
Pandas provide rich time-series functionality, data alignment, NA-friendly statistics, groupby,
merge, etc.. methods, if we use NumPy we have to implement all these methods manually.
So,
if we want performance we should use NumPy,
if we want ease of coding we should use pandas.
Unit 02 – Overview of Python and Data Analysis 9
1- NumPy
NumPy (Numeric Python) is a Python library to manipulate arrays.
Almost all the libraries in python rely on NumPy as one of their main building block.
NumPy provides functions for domains like Algebra, Fourier transform etc..
NumPy is incredibly fast as it has bindings to C libraries.
Install :
conda install numpy
pip install numpy
Unit 02 – Overview of Python and Data Analysis 10
NumPy Array
The most important object defined in NumPy is an N-dimensional array type called ndarray.
It describes the collection of items of the same type, Items in the collection can be accessed
using a zero-based index.
An instance of ndarray class can be constructed in many different ways, the basic ndarray can
be created as below.
syntax
import numpy as np
a= np.array(list | tuple | set | dict)
numpyarray.py Output
1 import numpy as np <class 'numpy.ndarray'>
2 a= np.array(['Andalus','Insitute','Sanaa']) ['Andalus' 'Insitute' 'Sanaa']
3 print(type(a))
4 print(a)
Unit 02 – Overview of Python and Data Analysis 11
NumPy Array (Cont.)
arange(start,end,step) function will create NumPy array starting from start till end (not
included) with specified steps.
numpyarange.py Output
1 import numpy as np [0 1 2 3 4 5 6 7 8 9]
2 b = np.arange(0,10,1)
3 print(b)
zeros(n) function will return NumPy array of given shape, filled with zeros.
numpyzeros.py Output
1 import numpy as np [0. 0. 0.]
2 c = np.zeros(3)
3 print(c) [[0. 0. 0.] [0. 0. 0.] [0. 0. 0.]]
4 c1 = np.zeros((3,3)) #have to give as tuple
5 print(c1)
ones(n) function will return NumPy array of given shape, filled with ones.
Unit 02 – Overview of Python and Data Analysis 12
NumPy Array (Cont.)
eye(n) function will create 2-D NumPy array with ones on the diagonal and zeros elsewhere.
numpyeye.py Output
1 import numpy as np [[1. 0. 0.]
2 b = np.eye(3) [0. 1. 0.]
3 print(b) [0. 0. 1.]]
linspace(start,stop,num) function will return evenly spaced numbers over a specified interval.
numpylinspace.py Output
1 import numpy as np [0. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
2 c = np.linspace(0,1,11) 0.9 1. ]
3 print(c)
Note: in arange function we have given start, stop & step, whereas in lispace function we are
giving start,stop & number of elements we want.
Unit 02 – Overview of Python and Data Analysis 13
Array Shape in NumPy
We can grab the shape of ndarray using its shape property.
numpyarange.py Output
1 import numpy as np (3,3)
2 b = np.zeros((3,3))
3 print(b.shape)
We can also reshape the array using reshape method of ndarray.
numpyarange.py Output
1 import numpy as np [[29 55]
2 re1 = np.random.randint(1,100,10) [44 50]
3 re2 = re1.reshape(5,2) [25 53]
4 print(re2) [59 6]
[93 7]]
Note: the number of elements and multiplication of rows and cols in new array must be equal.
Example : here we have old one-dimensional array of 10 elements and reshaped shape is (5,2)
so, 5 * 2 = 10, which means it is a valid reshape
Unit 02 – Overview of Python and Data Analysis 14
NumPy Random
rand(p1,p2….,pn) function will create n-dimensional array with random data using uniform
distrubution, if we do not specify any parameter it will return random float number.
numpyrand.py Output
1 import numpy as np 0.23937253208490505
2 r1 = np.random.rand()
3 print(r1) [[0.58924723 0.09677878]
4 r2 = np.random.rand(3,2) # no tuple [0.97945337 0.76537675]
5 print(r2) [0.73097381 0.51277276]]
randint(low,high,num) function will create one-dimensional array with num random integer data
between low and high.
numpyrandint.py Output
1 import numpy as np [78 78 17 98 19 26 81 67 23 24]
2 r3 = np.random.randint(1,100,10)
3 print(r3)
We can reshape the array in any shape using reshape method, which we learned in previous
slide.
Unit 02 – Overview of Python and Data Analysis 15
Aggregations
min() function will return the minimum value from the ndarray, there are two ways in which we
can use min function, example of both ways are given below.
numpymin.py Output
1 import numpy as np Min way1 = 1
2 l = [1,5,3,8,2,3,6,7,5,2,9,11,2,5,3,4,8,9,3,1,9,3] Min way2 = 1
3 a = np.array(l)
4 print('Min way1 = ',a.min())
5 print('Min way2 = ',np.min(a))
max() function will return the maximum value from the ndarray, there are two ways in which we
can use min function, example of both ways are given below.
numpymax.py Output
1 import numpy as np Max way1 = 11
2 l = [1,5,3,8,2,3,6,7,5,2,9,11,2,5,3,4,8,9,3,1,9,3] Max way2 = 11
3 a = np.array(l)
4 print('Max way1 = ',a.max())
5 print('Max way2 = ',np.max(a))
Unit 02 – Overview of Python and Data Analysis 16
Aggregations (Cont.)
NumPy support many aggregation functions such as min, max, argmin, argmax, sum, mean, std,
etc…
numpymin.py Output
1 l = [7,5,3,1,8,2,3,6,11,5,2,9,10,2,5,3,7,8,9,3,1,9,3]
2 a = np.array(l)
3 print('Min = ',a.min()) Min = 1
4 print('ArgMin = ',a.argmin()) ArgMin = 3
5 print('Max = ',a.max()) Max = 11
6 print('ArgMax = ',a.argmax()) ArgMax = 8
7 print('Sum = ',a.sum()) Sum = 122
8 print('Mean = ',a.mean()) Mean = 5.304347826086956
9 print('Std = ',a.std()) Std = 3.042235771223635
Unit 02 – Overview of Python and Data Analysis 17
Using axis argument with aggregate functions
When we apply aggregate functions with multidimensional ndarray, it will apply aggregate
function to all its dimensions (axis).
numpyaxis.py Output
1 import numpy as np sum = 45
2 array2d = np.array([[1,2,3],[4,5,6],[7,8,9]])
3 print('sum = ',array2d.sum())
If we want to get sum of rows or cols we can use axis argument with the aggregate functions.
numpyaxis.py Output
1 import numpy as np sum (cols) = [12 15 18]
2 array2d = np.array([[1,2,3],[4,5,6],[7,8,9]]) sum (rows) = [6 15 24]
3 print('sum (cols)= ',array2d.sum(axis=0)) #Vertical
4 print('sum (rows)= ',array2d.sum(axis=1)) #Horizontal
Unit 02 – Overview of Python and Data Analysis 18
Single V/S Double bracket notations
There are two ways in which you can access element of multi-dimensional array, example of
both the method is given below
numpybrackets.py Output
1 arr = double = h
2 np.array([['a','b','c'],['d','e','f'],['g','h','i']]) single = h
3 print('double = ',arr[2][1]) # double bracket notaion
4 print('single = ',arr[2,1]) # single bracket notation
Both method is valid and provides exactly the same answer, but single bracket notation is
recommended as in double bracket notation it will create a temporary sub array of third row
and then fetch the second column from it.
Single bracket notation will be easy to read and write while programming.
Unit 02 – Overview of Python and Data Analysis 19
Slicing ndarray
Slicing in python means taking elements from one given index to another given index.
Similar to Python List, we can use same syntax array[start:end:step] to slice ndarray.
Default start is 0
Default end is length of the array
Default step is 1
numpyslice1d.py Output
1 import numpy as np ['c' 'd' 'e']
2 arr = ['a' 'b' 'c' 'd' 'e']
np.array(['a','b','c','d','e','f','g','h']) ['f' 'g' 'h']
3 print(arr[2:5]) ['c' 'e' 'g']
4 print(arr[:5]) ['h' 'g' 'f' 'e' 'd' 'c'
5 print(arr[5:]) 'b' 'a']
6 print(arr[2:7:2])
7 print(arr[::-1])
Unit 02 – Overview of Python and Data Analysis 20
Array Slicing Example
C-0 C-1 C-2 C-3 C-4
Example :
R-0 1 2 3 4 5 a[2][3] =
R-1 6 7 8 9 10
a[2,3] =
a = R-2
a[2] =
11 12 13 14 15
a[0:2] =
R-3
16 17 18 19 20 a[0:2:2] =
R-4
a[::-1] =
21 22 23 24 25
a[1:3,1:3] =
a[3:,:3] =
a[:,::-1] =
Unit 02 – Overview of Python and Data Analysis 21
NumPy Arithmetic Operations
numpyop.py Output
1 import numpy as np Addition Scalar = [[3 4 5]
2 arr1 = np.array([[1,2,3],[1,2,3],[1,2,3]]) [3 4 5]
3 arr2 = np.array([[4,5,6],[4,5,6],[4,5,6]]) [3 4 5]]
Addition Matrix = [[5 7 9]
4
[5 7 9]
5 arradd1 = arr1 + 2 # addition of matrix with scalar [5 7 9]]
6 arradd2 = arr1 + arr2 # addition of two matrices Substraction Scalar = [[-1 0 1]
7 print('Addition Scalar = ', arradd1) [-1 0 1]
8 print('Addition Matrix = ', arradd2) [-1 0 1]]
9 Substraction Matrix = [[-3 -3 -3]
10 arrsub1 = arr1 - 2 # substraction of matrix with [-3 -3 -3]
scalar [-3 -3 -3]]
Division Scalar = [[0.5 1. 1.5]
11 arrsub2 = arr1 - arr2 # substraction of two matrices
[0.5 1. 1.5]
12 print('Substraction Scalar = ', arrsub1) [0.5 1. 1.5]]
13 print('Substraction Matrix = ', arrsub2) Division Matrix = [[0.25 0.4 0.5
14 arrdiv1 = arr1 / 2 # substraction of matrix with ]
scalar [0.25 0.4 0.5 ]
15 arrdiv2 = arr1 / arr2 # substraction of two matrices [0.25 0.4 0.5 ]]
16 print('Division Scalar = ', arrdiv1)
17 print('Division Matrix = ', arrdiv2)
Unit 02 – Overview of Python and Data Analysis 22
NumPy Arithmetic Operations (Cont.)
numpyop.py Output
1 import numpy as np Multiply Scalar = [[2 4 6]
2 arrmul1 = arr1 * 2 # multiply matrix with scalar [2 4 6]
3 arrmul2 = arr1 * arr2 # multiply two matrices [2 4 6]]
Multiply Matrix = [[ 4 10 18]
4 print('Multiply Scalar = ', arrmul1)
[ 4 10 18]
5 #Note : its not metrix multiplication* [ 4 10 18]]
6 print('Multiply Matrix = ', arrmul2) Matrix Multiplication = [[24 30
7 # In order to do matrix multiplication 36]
8 arrmatmul = np.matmul(arr1,arr2) [24 30 36]
9 print('Matrix Multiplication = ',arrmatmul) [24 30 36]]
10 # OR Dot = [[24 30 36]
arrdot = arr1.dot(arr2) [24 30 36]
[24 30 36]]
11 print('Dot = ',arrdot)
Python 3.5+ support = [[24 30 36]
12 # OR [24 30 36]
13 arrpy3dot5plus = arr1 @ arr2 [24 30 36]]
14 print('Python 3.5+ support = ',arrpy3dot5plus)
Unit 02 – Overview of Python and Data Analysis 23
Outline
Looping (Pandas)
Series
Data Frames
Accessing text, CSV, Excel files using pandas
Accessing SQL Database
Missing Data
Group By
Merging, Joining & Concatenating
Operations
2- Pandas
Pandas is an open source library built on top of NumPy.
It allows for fast data cleaning, preparation and analysis.
It excels in performance and productivity.
It also has built-in visualization features.
It can work with the data from wide variety of sources.
Install :
conda install pandas
OR pip install pandas
Unit 02 – Overview of Python and Data Analysis 25