Python Pandas and Matplotlib 7

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 72

ISOM 3400 – PYTHON FOR BUSINESS ANALYTICS

7. Pandas and Matplotlib


Yingpeng Robin Zhu

JUL 06, 2022

1
Pandas

2
Pandas Introduction

 What is Pandas?
 Pandas is a Python library used for working with data sets
 It has functions for analyzing, cleaning, exploring, and manipulating data
 The name "Pandas" has a reference to both "Panel Data", and "Python Data
Analysis" and was created by Wes McKinney in 2008
 Anaconda already has Pandas installed, therefore, no need for additional
installation

 Why Use Pandas?


 Pandas allows us to analyze big data and make conclusions based on statistical
theories
 Pandas can clean messy data sets, and make them readable and relevant
 Relevant data is very important in data science
3
Pandas Installation (In case Pandas not installed)

 Installation of Pandas
 If you have Python and pip already installed on a system, then installation of
Pandas is very easy

 If you have Python but have not installed pip, then install the pip command first
https://phoenixnap.com/kb/install-pip-windows , and then install pandas

4
Pandas Introduction
 Let’s create a Pandas data frame first:

5
Pandas Introduction

 Data sets in Pandas are usually multi-dimensional tables, called DataFrames


 A Pandas DataFrame is a 2-dimensional data structure, like a 2-dimensional array,
or a table with rows and columns
 How does a DataFrame looks like? A preview
Columns

Rows

Index
Data

6
Locate Row

 As you can see from the result above, the DataFrame is like a table with rows and
columns
 Pandas use the .loc() attribute to return one or more specified row(s)

7
Locate Row

 As you can see from the result above, the DataFrame is like a table with rows and
columns
 Pandas use the .loc() attribute to return one or more specified row(s)

8
Locate Column

 You can get access to the values of a column by indicating column names

9
Locate Items

 You can locate a specific item in the data frame with .at[row_index, column_name]

10
Pandas Read CSV

 A simple way to store big data sets is to use CSV files (comma separated files).
 CSV files contains plain text and is a well know format that can be read by everyone
including Pandas
 For the data that we have created, we could also import directly from .csv file

11
Load Files Into a DataFrame

 You can choose to load the selected columns, by indicating usecols

12
Load Files Into a DataFrame

 read_csv() parameters
 You can adjust the way of data input by adjusting the parameters

13
Data Cleaning

 Data cleaning means fixing bad data in your data set


 Bad data could be
 Empty cells
 Data in wrong format
 Wrong data
 Duplicates
 Now, please open our sample data: data_lab7_demo_dirtydata.csv
 The data set contains some empty cells ("Date" in row 24, and "Calories" in row 20
and 30)
 The data set contains wrong format ("Date" in row 28)
 The data set contains wrong data ("Duration" in row 9)
 The data set contains duplicates (row 13 and 14)

14
Data Cleaning

 Remove Rows (e.g., rows that include empty cells)

o If you want to change the original DataFrame, use the


inplace = True argument
o Now, the dropna(inplace = True) will NOT return a new
DataFrame, but it will remove all rows containg NULL values
from the original DataFrame

Note: By default, the dropna() method returns a new DataFrame, and will not
change the original 15
Data Cleaning

 Replace Empty Values


 Another way of dealing with empty cells is to insert a new value instead
 This way you do not have to delete entire rows just because of some empty cells
 The fillna() method allows us to replace empty cells with a value

16
Data Cleaning

 Replace Only For Specified Columns


 Most of the times, we may want to replace a specific column with mean or median

17
Data Cleaning

 Discovering Duplicate
 Duplicate rows are rows that have been registered more than one time

 To discover duplicates, we can use the duplicated() method


 The duplicated() method returns a Boolean values for each row
 Removing Duplicates
 To remove duplicates, use the drop_duplicates() method

18
Data Cleaning

19
Python Matplotlib

20
What is Matplotlib?

21
What is Matplotlib?

 Installation of Matplotlib
 python distribution like Anaconda already has Matplotlib installed
 Import Matplotlib
 Once Matplotlib is installed, import it in your applications by adding the import
module statement: import matplotlib
 Most of the Matplotlib utilities lies under the pyplot submodule, and are usually
imported under the plt alias:

22
First Plot Example

 Draw a line in a diagram from position (0,0) to position (10,500)

23
Matplotlib Plotting

 Plotting x and y points


 The plot() function is used to draw points (markers) in a diagram
 By default, the plot() function draws a line from point to point
 The function takes parameters for specifying points in the diagram
 Parameter 1 is an array containing the points on the x-axis or the horizontal axis.
(e.g., np.array([0,10]))
 Parameter 2 is an array containing the points on the y-axis or the vertical axis (e.g.,
np.array([10,500]))
 If we need to plot a line from (1, 3) to (8, 10), we have to pass two arrays [1, 8] and
[3, 10] to the plot function

24
Matplotlib Plotting

25
Matplotlib Plotting

 You can plot as many points as you like, just make sure you have the same number of
points in both axis

26
Default X-Points

 If we do not specify the points in the x-axis, they will get the default values 0, 1, 2, 3,…
(etc. depending on the length of the y-points)
 o, if we take the same example as above, and leave out the x-points, the diagram will
look like this

27
Matplotlib Plotting

 You can use the keyword argument marker to emphasize each point with a specified
marker

28
Matplotlib Plotting

 You can use the keyword argument linestyle, or shorter ls, to change the style of the
plotted line

29
Matplotlib Plotting

 You can use the keyword argument color or the shorter c to set the color of the line

30
Matplotlib Plotting

 You can use the keyword argument color or the shorter c to set the color of the line

31
Matplotlib Plotting

 Multiple Lines: You can plot as many lines as you like by simply adding more plt.plot()
functions

Note: we only specified the points on the y-axis, meaning that the points on the x-axis got the the default values (0, 1, 2, 3, 5) 32
Matplotlib Plotting

 You can also plot many lines by adding the points for the x- and y-axis for each line in
the same plt.plot() function, so that the x- and y- values come in pairs.

33
Matplotlib Labels and Title
 With Pyplot, you can use the xlabel() and ylabel() functions to set a label for the x- and
y-axis
 With Pyplot, you can use the title() function to set a title for the plot

34
Matplotlib Labels and Title
 With Pyplot, you can use label="name" within the plt.plot() method to label a line
 Remember to use legend() function

35
Matplotlib Labels and Title
 Add Grid Lines to a Plot
 With Pyplot, you can use the grid() function to add grid lines to the plot

36
Matplotlib Subplot
 Display Multiple Plots (1 row two columns)
Two columns

One row

The first plot The second plot

37
Matplotlib Subplot
 Display Multiple Plots (2 rows one column)

One column

The first plot

Two rows

The second plot

38
Matplotlib Subplot
 Display Multiple Plots (as many as you want)

39
Matplotlib Scatter
 Creating Scatter Plots
 With Pyplot, you can use the scatter() function to draw a scatter plot
 The scatter() function plots one dot for each observation. It needs two arrays of the
same length, one for the values of the x-axis, and one for values on the y-axis

40
Matplotlib Scatter
 Change color and size of the markers

Notes: make sure the array for colors and sizes has the same
length as the arrays for the x- and y-axis

41
Matplotlib Bars
 With Pyplot, you can use the bar() function to draw bar graphs

42
Matplotlib Bars
 You can use plt.savefig(“path”) to save the generated image

Note: It’s important to use plt.show() after saving the figure,


otherwise it might not work

43
Horizontal Bars
 If you want the bars to be displayed horizontally instead of vertically, use the barh()
function

44
Horizontal Bars
 Use the keyword argument color to set the color of the bars
 Use the keyword argument width/height to set the width/height of the bars

45
Matplotlib Histograms
 Histogram
 A histogram is a graph showing frequency distributions
 It is a graph showing the number of observations within each given interval
 Say you ask for the height of 250 people, you might end up with a histogram like
this:

46
Matplotlib Histograms
 Create Histogram
 In Matplotlib, we use the hist() function to create histograms
 The hist() function will use an array of numbers to create a histogram, the array is
sent into the function as an argument
 We use NumPy to randomly generate an array with 250 values, where the values
will concentrate around 170, and the standard deviation is 10

47
Matplotlib Histograms
 Create Histogram
 In Matplotlib, we use the hist() function to create histograms

48
Matplotlib Pie Charts
 Create Pie Charts
 With Pyplot, you can use the pie() function to draw pie charts
By default, the plotting of the first wedge starts from the x-axis and move
counterclockwise

49
Matplotlib Pie Charts
 Labels, titles, and percentages

50
Matplotlib Pie Charts
 Explode
 Maybe you want one of the wedges to stand out? The explode parameter allows
you to do that
 The explode parameter, if specified, and not None, must be an array with one value
for each wedge
 Each value represents how far from the center each wedge is displayed

51
Matplotlib Pie Charts
 Explode with Shadow

52
Python in Business Analytics

53
Class Objectives

Revisit important concepts of machine learning


 What is Machine Learning?
 What is linear regression and how does it work?
 What are the evaluation metrics for linear regression?
 How to train and interpret a linear regression model using Scikit-learn?

54
What is Machine Learning?

“Machine Learning is a field of study that gives computers the ability


to learn without being programmed.”
---- Samuel, A. (1959)

“An umbrella of a specific set of algorithms that all have one specific
purpose to learn to detect certain patterns from data.”
---- neurospace

55
What is Machine Learning?

 Machine Learning is making the computer learn from studying data and statistics
 A computer program that analyses data and learns to predict the outcome

https://www.youtube.com/watch?v=nKW8Ndu7Mjw

56
What is Machine Learning?

 Semi-automated extraction of
knowledge from data
 Start with questions that might be
answerable using data
 Exploit different types of models to
provide insights of the data using
computers
 Still require human judgement and
decision-making

57
Two Main Categories of ML

 Supervised Learning
 Is an email a ‘spam’? How does sales differ
by gender?
 There is a specific outcome we are trying to
predict (Label)
 Unsupervised learning
 Extracting structure from data or best
represent data
 Segment grocery shoppers to clusters with
similar behavior
 Recommend movies/music based on past
viewing data
 There is no right or wrong answer
58
Two Main Categories of ML

Supervised Learning Unsupervised Learning


59
How Does Supervised Learning Work?

 Step One: Model training (train a machine


learning model using labeled data)

 Step Two: Model prediction on new data


(for which the label is unknown)

 Step Three: Evaluate the accuracy of the


model (percentage of correct prediction
using labeled data)
60
Some Concepts
Features Target/Label

Student ID Attendance GPA Grade

20465532 0.95 4.0 95

20339901 0.82 3.8 88


Training
20567789 0.5 2.2 60
Dataset
20339912 1.0 3.5 98

… … … …

20429981 0.90 3.9 93 90


Testing
20890012 0.89 2.5 85 86 Dataset

61
Real Value Predicted Value
Supervised Learning Terminology

Features – the values we observe and use to


 Also known as observation, sample, instance, record, independent variable

Response – the value we try to predict


 Also known as target, outcome, label, dependent variable

 The task of supervised learning


 For a new observation, given features, we want to predict the label of this
observation

62
Supervised Learning Algorithms

Supervised learning in which the response is


continuous
 Linear regression

Supervised learning in which the response is


categorical
 Logistic Regression
 K-nearest Neighbors Classifiers
 Naïve Bayes Classifier

63
Linear Regression

A machine learning model can be used to predict continuous variables,


such as sales, stock price, amount
 Runs quickly
 No tuning required
 Easily understandable
 It is well-known and well-documented

64
Linear Regression

It assumes a linear relationship between features and response


So may not generate good prediction if the underlying relationship is
nonlinear

65
Linear Regression Example

Assume a person’s IQ is jointly determined by his/her father’s IQ and


his/her mother’s IQ according to the following regression function:

66
Linear Regression Example

 Question 1: how do we get this function?

 Question 2: how accurate is the prediction? (i.e., how to evaluate?)


Real Value

67
Evaluation Metrics For Regression

Mean absolute error (MAE) – the mean of the absolute value of the errors

Mean squared error (MSE) – the mean of the squared errors

Root mean squared error (RMSE) – the squared root of the mean of the
squared errors

68
Scikit-learn for Model Development

69
Scikit-learn Requirement

Features and response are separate objects

Features and response should be Numpy Arrays


 Data Frame and Data Series build on top of Numpy arrays

Features and response should have specific shapes

70
Scikit-learn 7-step Modeling

 Step 1: define features and response columns


 Step 2: split data into training vs. test data
 Step 3: import the model you want to use
 Step 4: instantiate the model
 Step 5: fit your model with training data
 Step 6: make prediction for the test data
 Step 7: estimate the accuracy of the model

71
Jupyter Notebook

72

You might also like