Python Pandas and Matplotlib 7
Python Pandas and Matplotlib 7
Python Pandas and Matplotlib 7
1
Pandas
2
Pandas Introduction
What is Pandas?
Pandas is a Python library used for working with data sets
It has functions for analyzing, cleaning, exploring, and manipulating data
The name "Pandas" has a reference to both "Panel Data", and "Python Data
Analysis" and was created by Wes McKinney in 2008
Anaconda already has Pandas installed, therefore, no need for additional
installation
Installation of Pandas
If you have Python and pip already installed on a system, then installation of
Pandas is very easy
If you have Python but have not installed pip, then install the pip command first
https://phoenixnap.com/kb/install-pip-windows , and then install pandas
4
Pandas Introduction
Let’s create a Pandas data frame first:
5
Pandas Introduction
Rows
Index
Data
6
Locate Row
As you can see from the result above, the DataFrame is like a table with rows and
columns
Pandas use the .loc() attribute to return one or more specified row(s)
7
Locate Row
As you can see from the result above, the DataFrame is like a table with rows and
columns
Pandas use the .loc() attribute to return one or more specified row(s)
8
Locate Column
You can get access to the values of a column by indicating column names
9
Locate Items
You can locate a specific item in the data frame with .at[row_index, column_name]
10
Pandas Read CSV
A simple way to store big data sets is to use CSV files (comma separated files).
CSV files contains plain text and is a well know format that can be read by everyone
including Pandas
For the data that we have created, we could also import directly from .csv file
11
Load Files Into a DataFrame
12
Load Files Into a DataFrame
read_csv() parameters
You can adjust the way of data input by adjusting the parameters
13
Data Cleaning
14
Data Cleaning
Note: By default, the dropna() method returns a new DataFrame, and will not
change the original 15
Data Cleaning
16
Data Cleaning
17
Data Cleaning
Discovering Duplicate
Duplicate rows are rows that have been registered more than one time
18
Data Cleaning
19
Python Matplotlib
20
What is Matplotlib?
21
What is Matplotlib?
Installation of Matplotlib
python distribution like Anaconda already has Matplotlib installed
Import Matplotlib
Once Matplotlib is installed, import it in your applications by adding the import
module statement: import matplotlib
Most of the Matplotlib utilities lies under the pyplot submodule, and are usually
imported under the plt alias:
22
First Plot Example
23
Matplotlib Plotting
24
Matplotlib Plotting
25
Matplotlib Plotting
You can plot as many points as you like, just make sure you have the same number of
points in both axis
26
Default X-Points
If we do not specify the points in the x-axis, they will get the default values 0, 1, 2, 3,…
(etc. depending on the length of the y-points)
o, if we take the same example as above, and leave out the x-points, the diagram will
look like this
27
Matplotlib Plotting
You can use the keyword argument marker to emphasize each point with a specified
marker
28
Matplotlib Plotting
You can use the keyword argument linestyle, or shorter ls, to change the style of the
plotted line
29
Matplotlib Plotting
You can use the keyword argument color or the shorter c to set the color of the line
30
Matplotlib Plotting
You can use the keyword argument color or the shorter c to set the color of the line
31
Matplotlib Plotting
Multiple Lines: You can plot as many lines as you like by simply adding more plt.plot()
functions
Note: we only specified the points on the y-axis, meaning that the points on the x-axis got the the default values (0, 1, 2, 3, 5) 32
Matplotlib Plotting
You can also plot many lines by adding the points for the x- and y-axis for each line in
the same plt.plot() function, so that the x- and y- values come in pairs.
33
Matplotlib Labels and Title
With Pyplot, you can use the xlabel() and ylabel() functions to set a label for the x- and
y-axis
With Pyplot, you can use the title() function to set a title for the plot
34
Matplotlib Labels and Title
With Pyplot, you can use label="name" within the plt.plot() method to label a line
Remember to use legend() function
35
Matplotlib Labels and Title
Add Grid Lines to a Plot
With Pyplot, you can use the grid() function to add grid lines to the plot
36
Matplotlib Subplot
Display Multiple Plots (1 row two columns)
Two columns
One row
37
Matplotlib Subplot
Display Multiple Plots (2 rows one column)
One column
Two rows
38
Matplotlib Subplot
Display Multiple Plots (as many as you want)
39
Matplotlib Scatter
Creating Scatter Plots
With Pyplot, you can use the scatter() function to draw a scatter plot
The scatter() function plots one dot for each observation. It needs two arrays of the
same length, one for the values of the x-axis, and one for values on the y-axis
40
Matplotlib Scatter
Change color and size of the markers
Notes: make sure the array for colors and sizes has the same
length as the arrays for the x- and y-axis
41
Matplotlib Bars
With Pyplot, you can use the bar() function to draw bar graphs
42
Matplotlib Bars
You can use plt.savefig(“path”) to save the generated image
43
Horizontal Bars
If you want the bars to be displayed horizontally instead of vertically, use the barh()
function
44
Horizontal Bars
Use the keyword argument color to set the color of the bars
Use the keyword argument width/height to set the width/height of the bars
45
Matplotlib Histograms
Histogram
A histogram is a graph showing frequency distributions
It is a graph showing the number of observations within each given interval
Say you ask for the height of 250 people, you might end up with a histogram like
this:
46
Matplotlib Histograms
Create Histogram
In Matplotlib, we use the hist() function to create histograms
The hist() function will use an array of numbers to create a histogram, the array is
sent into the function as an argument
We use NumPy to randomly generate an array with 250 values, where the values
will concentrate around 170, and the standard deviation is 10
47
Matplotlib Histograms
Create Histogram
In Matplotlib, we use the hist() function to create histograms
48
Matplotlib Pie Charts
Create Pie Charts
With Pyplot, you can use the pie() function to draw pie charts
By default, the plotting of the first wedge starts from the x-axis and move
counterclockwise
49
Matplotlib Pie Charts
Labels, titles, and percentages
50
Matplotlib Pie Charts
Explode
Maybe you want one of the wedges to stand out? The explode parameter allows
you to do that
The explode parameter, if specified, and not None, must be an array with one value
for each wedge
Each value represents how far from the center each wedge is displayed
51
Matplotlib Pie Charts
Explode with Shadow
52
Python in Business Analytics
53
Class Objectives
54
What is Machine Learning?
“An umbrella of a specific set of algorithms that all have one specific
purpose to learn to detect certain patterns from data.”
---- neurospace
55
What is Machine Learning?
Machine Learning is making the computer learn from studying data and statistics
A computer program that analyses data and learns to predict the outcome
https://www.youtube.com/watch?v=nKW8Ndu7Mjw
56
What is Machine Learning?
Semi-automated extraction of
knowledge from data
Start with questions that might be
answerable using data
Exploit different types of models to
provide insights of the data using
computers
Still require human judgement and
decision-making
57
Two Main Categories of ML
Supervised Learning
Is an email a ‘spam’? How does sales differ
by gender?
There is a specific outcome we are trying to
predict (Label)
Unsupervised learning
Extracting structure from data or best
represent data
Segment grocery shoppers to clusters with
similar behavior
Recommend movies/music based on past
viewing data
There is no right or wrong answer
58
Two Main Categories of ML
… … … …
61
Real Value Predicted Value
Supervised Learning Terminology
62
Supervised Learning Algorithms
63
Linear Regression
64
Linear Regression
65
Linear Regression Example
66
Linear Regression Example
67
Evaluation Metrics For Regression
Mean absolute error (MAE) – the mean of the absolute value of the errors
Root mean squared error (RMSE) – the squared root of the mean of the
squared errors
68
Scikit-learn for Model Development
69
Scikit-learn Requirement
70
Scikit-learn 7-step Modeling
71
Jupyter Notebook
72