PYTHON LIBRARIES: NUMPY AND
PANDAS
American University of Sharjah
Prepared by Dr Tamer Shanableh, CSE
Material mainly based on “Python for Programmers” by Paul Deitel and Harvey Deitel, Pearson;
Illustrated edition, ISBN-10 : 0135224330
Python Libraries
2
Popular libraires in Python:
•Python Libraries for Data Science
• Python Libraries for Data Processing and
Python has many software Model Deployment
libraries that can be imported to • 1) Pandas
your program • 2) NumPy
• 3) SciPy
A software library is collection of • 4) Sci-Kit Learn
• 5) PyCaret
pre-written code, such that • 6) Tensorflow
programmers do not reinvent the • 7) OpenCV
wheel •Python Libraries for Data Mining and Data
Scraping
You have previously used “import • 8) SQLAlchemy
math” where math is the name for • 9) Scrapy
the math library in Python • 10) BeautifulSoup
•Python Libraries for Data Visualization
• 11) Matplotlib
• 12) Ggplot
• 13) Plotly
• 14) Altair
• 15) seaborn
Source: https://www.projectpro.io/article/top-5-libraries-for-data-
science-in-python/196
Importing libraries
3
Import the whole library:
import numpy
myarr = numpy.array([1,2,3,4])
OR: Import the whole library with an alias:
import numpy as np
myarr = np.array([1,2,3,4])
Importing a specific object
4
OR: Import a specific function or an object:
import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
# Create a line plot
plt.plot(x, y)
The NumPy Library
5
• NumPy is a popular open-source library in Python for data
science and AI
• A standard way for working with numeric data in Python
• It can be used for creating and manipulating N-dimensional
arrays
• 1D for lists of numbers
• 3D for images (R,G,B)
• 4D for videos (a sequence of 3D images)
Creating NumPy Arrays
6
¨ Start by importing the numpy library
import numpy as np
# Create a 1D array
numpy_array = np.array( [10,20,30] )
# Create a 1D array from a list of numbers
data = [10, 20, 30, 40, 50]
numpy_array = np.array(data)
Numpy 2D arrays
7
Think of 2D arrays as an “Array of Arrays”
import numpy as np
arr_2D = np.array([
[10, 20, 30, 4],
[2, 8, 2, 4],
[30, 12, 67, 44],
[24, 10, 32, 0]
])
print(arr_2D)
print('Shape: ', arr_2D.shape) #prints the
dimensions of the array
Reshaping NumPy Arrays
8
¨ You can use the np reshape function to transform a 1D array into a
multidimensional array (row-wise)
¨ Example: we can reshape a 12-element 1D array into a 4x3 2D array
¨ Clearly, reshaping a 12-element 1D array into a 4x4 2D array does not work
and generates an error.
import numpy as np
arr =
np.array([1,2,3,4,5,6,7,8,9,10,11,12])
print('arr contains: \n', arr)
arr_2D = arr.reshape(4,3)
print('arr_2D contains: \n', arr_2D)
Transposing NumPy Arrays
9
¨ You can use the np transpose function to replace rows with columns in a 2D array
¨ The first row becomes the first column, the second row becomes the second column
and so forth…
import numpy as np
arr = np.array([1,2,3,4,5,6,7,8,9,10,11,12])
print('arr contains: \n', arr)
arr_2D = arr.reshape(4,3)
print('arr_2D contains: \n', arr_2D)
#------------------------------------
arr_2D_transposed = np.transpose(arr_2D)
print('arr_2D_transposed contains: \n',
arr_2D_transposed)
NumPy Sorting (FYI)
10
#Numpy Example: sort #Use the function np.sort(name of
method array, axis to sort: None|0|1)
import numpy as np
rst = np.sort(arr_2D,axis=None)
arr_2D = np.array([ print('sort the whole Array: \n',
[10, 20, 30, 4], rst)
[2, 8, 2, 4],
[30, 12, 67, 44],
[24, 10, 32, 0] # Sort row-wise
]) rst = np.sort(arr_2D,axis=1)
print(arr_2D) print('Row-wise sorting: \n',rst)
# Sort column-wise
rst = np.sort(arr_2D,axis=0)
print('Column-wise sorting:
\n',rst)
NumPy Calculation Functions
11
¨ We will use the sum, min, max, mean, std and var functions on NumPy arrays
import numpy as np
grades = np.array([[87,96, 70], [100, 87, 90], [94,77,
90],[100, 81, 82]])
print('The grades are: \n', grades)
sum = grades.sum(axis=1) # row-wise
print('Summation row-wise:\n',sum)
sum = grades.sum(axis=0) # col-wise
print('Summation col-wise:\n',sum)
sum = grades.sum(axis=None) # all
print('Summation of all grades:\n',sum)
NumPy Calculation Functions
12
¨ We will use the sum, min, max, mean, std and var functions on NumPy arrays
import numpy as np
grades = np.array([[87,96, 70], [100, 87, 90], [94,77,
90], [100, 81, 82]])
print('The grades are: \n', grades)
min = grades.min(axis=1) # row-wise
print('min row-wise:\n',min)
min = grades.min(axis=0) # col-wise
print('min col-wise:\n',min)
min = grades.min(axis=None) # all
print('min of all grades:\n',min)
Indexing and Slicing - 1
13
# You can access individual elements and individual rows
in a NumPy array
import numpy as np
grades = np.array([[87,96, 70], [100, 87, 90], [94,77,
90],[100,81, 82]])
print('The grades are: \n', grades)
#Select one grade using: grade[row index, col index]
print('grades[0,0] = ', grades[0,0])
print('grades[1,2] = ', grades[1,2])
#Select one row of grades using : grade[row index]
print('grades[3] = ', grades[3])
Indexing and Slicing - 2
14
#You can select multiple rows from a NumPy array
import numpy as np
grades = np.array([[87,96, 70],[100, 87, 90],[94, 77,
90], [100, 81, 82]])
print('The grades are: \n', grades)
#Select multiple sequential rows of grades using :
grade[row index from : row index to]
print('grades[0:2] = \n', grades[0:2]) #up to but not
including row 2
Indexing and Slicing - 3
15
¨ You can select a subset of columns in NumPy arrays
¨ grades[:,0] means select all rows, column 0
¨ grades[:, 0:2] means select all rows, columns 0,1 (up to but
not including 2)
import numpy as np
grades = np.array([[87, 96, 70], [100, 87, 90],
[94, 77, 90], [100, 81, 82]])
print('The grades are: \n', grades)
print('First column | grades[:,0] = \n', grades[:,0])
print(‘Last 2 columns| grades[:, 1:3] = \n', grades[:,1:3])
Adopted from https://www.w3resource.com/python-
exercises/numpy/python-numpy-exercise-104.php
Indexing and Slicing - 4
16
¨ Python allows negative indices in arrays
¨ One particular important case the access of the last column using the negative
column index of ‘-1’
import numpy as np
grades = np.array([[87, 96, 70], [100, 87, 90], [94, 77,
90], [100, 81, 82]])
print('The grades are: \n', grades)
print('First column | grades[:,0] = \n', grades[:,0])
print('Last column | grades[:, -1] = \n', grades[:,-1])
17
Pandas: Series and DataFrames
Pandas Series and DataFrames | 1
18
¨ NumPy arrays are optimized for homogenous numeric data
¨ However, in ML applications, we need to provide:
¤ Support for a heterogeneous types (ex. numeric and strings)
¤ Support for missing data
¤ Support for headers and indices (see next slide)
¨ Pandas is the commonly used library for dealing with such data
¨ It provides support for:
¤ Series:
for 1D collections (enhanced 1D array)
¤ DataFrames: for 2D collections (enhanced 2D array)
Pandas Series and DataFrames | 2
19
Index value
Index header header header header
Rest of columns are called “values”
First column is called “index”
Pandas Series | 1
20
¨ An enhanced 1D array
¨ Can be indexed using integers like NumPy or strings
import pandas as pd
grades = pd.Series([87, 100, 94])
print('Grades Series:\n',grades)
print('First grade: ',grades[0])
Output (index and value):
0 87
1 100
2 94
First grade: 87
Pandas Series | 2
21
¨ Provides for statistical functions like import pandas as pd
count, mean, min, max and std grades = pd.Series([87, 100, 94])
¨ For a full numerical summary you can
use the describe function print('Grades Series:\n',grades)
print('Count: ', grades.count())
print('Mean: ' , grades.mean())
print('Min: ' , grades.min())
print('Max: ' , grades.max())
print('Std: ' , grades.std())
# for an overall summary you can
use:
print('Description:\n',grades.des
cribe())
Series with a Custom Index
22
¨ You can use a custom indices with the index argument
Index value
import pandas as pd
grades = pd.Series([87, 100, 94],
index=['First', 'Second', 'final'])
print(grades)
Output:
First 87
Second 100
final 94
Accessing Series Using String Indices
23
¨ In the previous example, a Series with a custom indices, can be accessed via
square brackets [ ] containing a custom index value:
import pandas as pd
grades = pd.Series([87, 100, 94], index=['First',
'Second', 'final'])
print('Grade of first = ',grades['First']) # or
print('Grade of first = ',grades[0])
#--You can also access all values and all indices
print('Series values are: ', grades.values)
print('Series indices are: ', grades.index)
Output:
Grade of first = 87
Grade of first = 87
Series values are: [ 87 100 94]
Series indices are: Index(['First', 'Second', 'final'],
dtype='object')
24
DataFrames
DataFrames
25
¨ Enhanced 2D arrays
Index header header header header
¨ Can have custom indices and headers
¨ Each column in a DataFrame is a Series
Creating DataFrames From Files
26
• Pandas provides a read_csv() function to read data stored as a .csv file into a
pandas DataFrame.
• Pandas supports many different file formats including csv and excel:
• myDataFrame = pd.read_csv(“myfile.csv”)
• Or myDataFrame = pd.read_excel(“myfile.xlsx”)
• To save data from DataFrames to files use:
• myDataFrame.to_csv(“myOutputFile.csv”)
• Or myDataFrame.to_excel(“myOutputFile.xlsx”)
• After reading a file, you can display the first and last 5 rows using
myDataFrame.head()
Creating DataFrames From Files in Colab (FYI)
27
Click to upload a file
I uploaded this file
df2.to_csv('testFileToWrite.csv') # this will create an output file with .csv extension
Creating DataFrames From Internet Files | 1
• We will use the Iris sample data, which contains information on 150
Iris flowers, 50 each from one of three Iris species: Setosa,
Versicolour, and Virginica.
• Each flower is characterized by five attributes:
1. sepal_length in centimeters
2. sepal_width in centimeters
3. petal_length in centimeters
4. petal_width in centimeters
Each flower belongs to one type, which is the last column in dataFrame:
(Setosa, Versicolour, Virginica)
Data is available online at: https://archive.ics.uci.edu/dataset/53/iris
Iris Flowers Dataset
29
Creating DataFrames From Internet Files | 2
30
import pandas as pd
#The argument header=None says that this dataset does not
contain a header yet, so we will add one next
data = pd.read_csv('http://archive.ics.uci.edu/ml/machine-
learning-databases/iris/iris.data',header=None)
#You can then Add column headers
data.columns=['sepal_length','sepal_width','petal_length','pe
tal_width','class']
#And display the first 5 rows to make sure that the reading
is successful
data.head()
Creating DataFrames From Internet Files | 3
31
The output:
Accessing DataFrame’s Columns and Rows | 1
32
petal_length columns:
#Access one column using a header’s name 0 1.4
print('petal_length 1 1.4
columns:\n',data['petal_length']) 2 1.3
3 1.5
4 1.4
...
145 5.2
146 5.0
147 5.2
148 5.4
149 5.1
First row:
#Access one row using the .iloc function sepal_length 5.1
print('\n\nFirst row:') sepal_width 3.5
petal_length 1.4
print(data.iloc[0]) petal_width 0.2
class Iris-setosa
Accessing DataFrame’s Columns and Rows | 2
33
#Access a sequential slice of rows using the .iloc
function
print('\n\nFirst 5 rows:')
print(data.iloc[0:5]) # up to but not including 5
First 5 rows:
sepal_length sepal_width petal_length petal_width class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
Accessing DataFrame’s Columns and Rows | 3
34
#Access a sequential slice of rows and columns using the
.iloc function
print('\n\nFirst 5 rows and first 2 columns:')
#print up to but not including row 5, up to but not
including col 2
#.iloc[ rows from:to , cols from:to ]
print(data.iloc[0:5 , 0:2 ])
First 5 rows and first 2 columns:
sepal_length sepal_width
0 5.1 3.5
1 4.9 3.0
2 4.7 3.2
3 4.6 3.1
4 5.0 3.6
Accessing DataFrame’s Columns and Rows | 4
35
#Access a sequential slice of rows and columns using the
.iloc function
print('\n\nFirst 5 rows and first 2 columns:')
#print up to but not including row 5, and cols 0,1 and the
last column
#.loc[ rows from:to , [cols indices] ]
print(data.iloc[0:5 , [0,1,-1]])
sepal_length sepal_width class
0 5.1 3.5 Iris-setosa
1 4.9 3.0 Iris-setosa
2 4.7 3.2 Iris-setosa
3 4.6 3.1 Iris-setosa
4 5.0 3.6 Iris-setosa
DataFrames Boolean Indexing | 1
36
¨ Pandas provide a powerful selection feature called Boolean indexing
¨ That is, you can use Boolean expression that return T/F to filter a
dataFrame
¨ Let us starts by extracting the numeric data from our dataFrame:
data_numeric = data.iloc[:, 0:4]
data_numeric.head()
sepal_length sepal_width petal_length petal_width
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
DataFrames Boolean Indexing | 2
37
# from the previous slide
data_numeric = data.iloc[:, 0:4]
#Filter the dataFrame, locate values >= 5.0 sepal_length sepal_width petal_length petal_width
rst = data_numeric[data_numeric >= 5.0] 0 5.1 NaN NaN NaN
rst 1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
• Pandas checks every element to determine 3 NaN NaN NaN NaN
whether its value is greater than or equal 4 5.0 NaN NaN NaN
to 5.0 ... ... ... ... ...
145 6.7 NaN 5.2 NaN
• If True then it includes it in the new 146 6.3 NaN 5.0 NaN
DataFrame (rst in the example above). 147 6.5 NaN 5.2 NaN
148 6.2 NaN 5.4 NaN
• Elements for which the condition is False 149 5.9 NaN 5.1 NaN
are represented as NaN (not a number) in 150 rows × 4 column
the new DataFrame
DataFrames Boolean Indexing | 3
38
• In Boolean expression you can use:
• AND which is the & operator
• OR which is the | operator
data_numeric = data.iloc[:, 0:4]
rst = data_numeric[data_numeric >= 5.0]
rst.head()
#Other examples (data_numeric >= 3.0) AND (data_numeric <=
5.0):
rst = data_numeric[(data_numeric >= 3.0) & (data_numeric <=
5.0)]
rst.head()
#Other examples (data_numeric < 3.0) OR (data_numeric > 5.0):
rst = data_numeric[(data_numeric < 3.0) | (data_numeric > 5.0)]
rst.head()
DataFrames Boolean Indexing | 4
39
• In Boolean expression you can use the .loc function for filtering rows according to a
Boolean criteria
import pandas as pd
data = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-
databases/iris/iris.data',header=None)
data.columns=['sepal_length','sepal_width','petal_length','petal_width'
,'class']
#Select row where sepal_length >= 5.0
rst = data.loc[ data.sepal_length >= 5.0 ]
print('Select row where sepal_length >= 5.0')
print(rst.head())
#Select row where sepal_length >= 5.0 AND & data.sepal_width >= 3.5
rst = data.loc[ (data.sepal_length >= 5.0) & (data.sepal_width >= 3.5)]
print('Select row where sepal_length >= 5.0 & data.sepal_width >= 3.5')
print(rst.head())
DataFrames Boolean Indexing | 4
40
Select row where sepal_length >= 5.0
Output: sepal_length sepal_width petal_length petal_width class
0 5.1 3.5 1.4 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
7 5.0 3.4 1.5 0.2 Iris-setosa
10 5.4 3.7 1.5 0.2 Iris-setosa
Select row where sepal_length >= 5.0 & data.sepal_width >= 3.5
sepal_length sepal_width petal_length petal_width class
0 5.1 3.5 1.4 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
10 5.4 3.7 1.5 0.2 Iris-setosa
14 5.8 4.0 1.2 0.2 Iris-setosa
DataFrames Statistics | 1
41
¨ Similar to Series, you can use the describe() function to print out statistics.
¨ In DataFrames, the statistics are calculated by column (for the numeric columns only).
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
data.describe() 50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
DataFrames Statistics | 2
42
¨ Similar to Series, you can use the mean(), min(), max(), std(), var()
¨ In DataFrames, the statistics are calculated by column (for the numeric columns only).
Avg per col:
print('Avg per col:') sepal_length 5.843333
sepal_width 3.054000
print(data.mean()) petal_length 3.758667
print('Std per col:') petal_width 1.198667
print(data.std())
Std per col:
print('Min per col:') sepal_length 0.828066
print(data.min()) sepal_width 0.433594
print('Max per col:') petal_length 1.764420
petal_width 0.763161
print(data.max())
…
DataFrames <-> NumPy | 1
43
¨ There are cases where you need to convert a DataFrame into a NumPy Array and
vice versa
¨ This is needed in machine learning tasks like classification and regression that you
will study next
¨ Let us start by converting a DataFrame into a NumPy array using to_numpy()
function
import pandas as pd
data = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-
databases/iris/iris.data',header=None)
data.columns=['sepal_length','sepal_width','petal_length','petal_width
', 'class']
#Convert a dataFrame into a numPy array
numpy_from_dataFrame = data.to_numpy()
#print(numpy_from_dataFrame)
#OR: Convert the first 4 columns of a dataFrame into a numPy array
numpy_from_dataFrame = data.iloc[:, 0:4].to_numpy()
#print(numpy_from_dataFrame)
DataFrames <-> NumPy | 2
44
Output of Output of data.iloc[:,
data.to_numpy() 0:4].to_numpy()
[[5.1 3.5 1.4 0.2 'Iris-setosa'] [[5.1 3.5 1.4 0.2]
[4.9 3.0 1.4 0.2 'Iris-setosa'] [4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2 'Iris-setosa'] [4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2 'Iris-setosa'] [4.6 3.1 1.5 0.2]
[5.0 3.6 1.4 0.2 'Iris-setosa'] [5. 3.6 1.4 0.2]
[5.4 3.9 1.7 0.4 'Iris-setosa'] [5.4 3.9 1.7 0.4]
[4.6 3.4 1.4 0.3 'Iris-setosa'] [4.6 3.4 1.4 0.3]
[5.0 3.4 1.5 0.2 'Iris-setosa'] [5. 3.4 1.5 0.2]
[4.4 2.9 1.4 0.2 'Iris-setosa'] [4.4 2.9 1.4 0.2]
[4.9 3.1 1.5 0.1 'Iris-setosa'] [4.9 3.1 1.5 0.1]
[5.4 3.7 1.5 0.2 'Iris-setosa'] [5.4 3.7 1.5 0.2]
[4.8 3.4 1.6 0.2 'Iris-setosa'] [4.8 3.4 1.6 0.2]
… …
DataFrames <-> NumPy | 3
45
¨ To convert a NumPy array into a dataFrame we can use the constructor
pd.DataFrame()
¨ Notice how you can add columns (which are the headers), using the argument
columns=[…]
dataFrame_from_numpy =
pd.DataFrame(numpy_from_dataFrame, columns =
['sepal_length', 'sepal_width', 'petal_length',
'petal_width','class'])
dataFrame_from_numpy.head()
Other Ways of Creating DataFrames – 1
import pandas as pd Output:
df = pd.DataFrame( Name Age Gender
{ 0 Braund, Mr. Owen Harris 22 male
"Name":["Braund, Mr. Owen 1 Allen, Mr. William Henry 35 male
Harris", "Allen, Mr. William Henry", 2 Bonnell, Miss. Elizabeth 58 female
"Bonnell, Miss. Elizabeth"],
"Age":[22, 35, 58], Age
“Gender":["male","male", "female"] count 3.000000
}
mean 38.333333
)
print(df) std 18.230012
df.describe() min 22.000000
25% 28.500000
50% 35.000000
75% 46.500000
max 58.000000
https://pandas.pydata.org/
Other Ways of Creating DataFrames – 2
47
#You can create a DataFrame from an existing The dictionary’s
dictionary as follows keys become the
import pandas as pd column names
my_dictionary={ (headers).
"Name": [
"Dr. Sami Batata", The values
"Prof. Marwa Halawah", become the
"Mr. Fawzi Kamal" element values in
], the corresponding
"Age": [29, 40, 60], column.
"Gender": ["male", "female", "male"]
}
df = pd.DataFrame( my_dictionary)
print(df)
Name Age Gender
0 Dr. Sami Batata 29 male
1 Prof. Marwa Halawah 40 female
2 Mr. Fawzi Kamal 60 male