0% found this document useful (0 votes)

3 views124 pages

Machine Learning With Python 2021

This document is a comprehensive guide on machine learning (ML) using Python, covering various concepts such as regression, classification, and model evaluation. It introduces essential libraries and techniques for building predictive models, including supervised and unsupervised learning methods. The document also provides practical examples and labs for hands-on experience with different ML algorithms.

Uploaded by

hamidzmz.zmz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views124 pages

Machine Learning With Python 2021

Uploaded by

hamidzmz.zmz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 124

‫ﯾﺎدﮔﯾری ﻣﺎﺷﯾن ﺑﺎ ﭘﺎﯾﺗون‬

‫ﻣﻘدﻣﮫای ﻋﻣﻠﯽ‬

‫‪ / ١۴٠٠‬ﺟﺎدی ﺑرای ﻣﮑﺗبﺧوﻧﮫ‬

‫ﯾﺎدﮔﯾری ﻣﺎﺷﯾن و ﭘﺎﯾﺗون‬

‫ﺑر اﺳﺎس دوره ‪ IBM‬در ‪ edx‬ﺑﺎ آﻣوزﺷﮕری‬ ‫•‬

‫ﺳﻌﯾد آﻗﺎﺑزرﮔﯽ‪ ،‬داﻧﺷﻣﻧد داده آی‪.‬ﺑﯽ‪.‬ام‪.‬‬
Welcome

• We are going to talk about ML and how it helps in diﬀerent areas (loans,
segmentations, medicine, recommendations, …)

• We will use python libraries to create models, say building a model to estimate
the CO2 emission of the cars using scikit learn or will predict customer churn

• All codes are provided in notepad / jupyther

• After this course you will have new skills like regression, classiﬁcation,
clustering, scikit learn, numpy, pandas AND new projects specially if you start
working on the datasets you can freely available on the internet.
Intro
• Say we want to understand if this is a fraud trx
or not, is this a malignant or benign cell, what
should I show next to this customer, …

• This can be done by ML by looking at some

characteristics of data.

• clean the data

• select the proper algorithm

• train the model

• predict new cases

Intro

Machine Learning is the subﬁeld of computer

science that gives “computers the ability to learn
without being explicitly programmed”
- Arthur Samuel who coined the phrase in 1959
Intro
Examples:
- CO2 emission (Regression)
- Is this cancer? (Classiﬁcation)
- Bank Loans (Clustering)
- Anomaly detection (credit card fraud)
- Netﬂix recommendations (recommenders)
Intro
AI (mimics human I)

- Computer vision
- Language Processing
- Creativity

ML (Subset of AI, more statistical)

- Classiﬁcation
- Clustering
- Neural Network

Revolution in ML (Special ﬁeld of ML)

- Deep Learning
Intro
Python

- You should know the basics

- It is easy
- Libraries like Numpy & Pandas + SciKit
Learn are used with a quick intro
- We are python dependent. You can do
with anything else.. but why? :D
Supervised vs.
Unsupervised
- Supervised: we “teach the model” by labeling & just then, the model can
predict the unknown or future instances
Supervised vs.
Unsupervised
- UnSupervised: Model works on its own to discover information.
Supervised vs.
Unsupervised
‫رﮔرﺳﯾون‬
Regression Intro

• regression is the process of

predicting a continuous value

• Independent (x, desc, ...) vs

Dependent (y, goal, prediction,
...) variables

• y is continuous
Regression Intro
Model
Regression Intro
Types

• Simple (only one independent)

• Linear
• Non-Linear
• Multiple (multiple independent)
• Linear
• Non-Linear
Regression Intro
Samples

• Household Price
• Customer Satisfaction
• Sales Forecast
• Employment Income
Regression Intro
Algorithms

• Ordinal
• Poisson
• Fast Forest quantile
• Linear, Polynominal, Lasso, Stepwise, Ridge
• Bayesian Linear
• Nerural Network
• Decision Forest
• Boosted decision tree
• K-nearest neighbors
Simple Linear
Regression

• Can we predict co2 emission

from one of the independents
(this is why we call it Simple)

• Lets try engine size...

Simple Linear
Regression
• Relationship is obvious
• There is a line, we assume a straight line
• we can predict an emission for say, a car with 2.4

• y hat is the dependent variable of the predicted

value.

• x1 is the independent variable.

• Theta 0 and theta 1 are the parameters of the line
• Theta 1 is known as the slope or gradient of the
ﬁtting line and theta 0 is known as the intercept.

• Theta 0 and theta 1 are also called the

coeﬃcients of the linear equation.
Simple Linear
Regression
MSE

• Residual error for each point is the

distance of the prediction from the
actual point. So Mean Square
Error (MSE should be minimized)

• Minimum MSE can be achieved

with two methods: Math or
Optimization
Simple Linear
Regression
MSE (Math)
Simple Linear
Regression
Pros

• Very Fast
• Easy to understand and interpret
• No need for parameter tuning
(say like in KNN)
Model Evaluation

• goal is to build a model to accurately

predict an unknown case.

• You need to evaluate to see how

much you can trust your
model/prediction

• Two main methods:

• Train and Test on Same data
• Train / Test split
• Regression Evaluation Metrics
Model Evaluation
Train and Test on Same data

• High "training accuracy"

• not always good
• overﬁtting the data
(say capture noise and
produce non
generalized model)

• Low "out of sample

accuracy"

• Important to have
Model Evaluation
Train/Test split

• Mutually exclusive split

• More accurate on
out-of-sample

• ensure that you train your

model with the testing set
afterwards, as you don't
want to lose potentially
valuable data.

• Dependent on which
datasets the data is trained
and tested
Model Evaluation
Evaluation Matrix

• used to explain the performance of a

model

• say comparing actual with predicted

• error of the model is the diﬀerence
between the data points and the
trend line generated by the algorithm

• There diﬀerent metrics (next slide)

but the choice is based on the
model, data type, domain, ...
Model Evaluation
Errors

• mean absolute error (MAE)

• mean squared error (MSE)
• root mean squared error (RMSE); interpretable in the
same units as the response vector or y units

• Relative absolute error, also known as residual sum of

square (RAE)

• Relative squared error (RSE)

• R2; Popular metric for the accuracy of your model.
represents how close the data values are to the ﬁtted
regression line. The higher the better
Lets see some
libraries!

• Notebook
• Numpy
• Matplotlib
• pandas
Lab: Simple Linear
Regression

• ML0101EN-Reg-Simple-Linear-R
egression-Co2.ipynb
Multiple Linear
Regression

• Simple / Multiple
• kind of same as simple
• usages:
• ﬁnd the strength of each independent variable
• predict the impact of the change on one of the independent
variables
Multiple Linear
Regression
Formula
Multiple Linear
Regression
Finding parameters

• Again we can ﬁnd the MSE

• the best model is the one with he minimized MSE
• The method is called Ordinary Least square
• linear algebra
• slow! for less than 10K samples
• Optimization Algorithms
• Gradient Descent (Starts with random, then
changes in multiple iterations)
Multiple Linear
Regression
Some notes

• Try to have theoretical defense when choosing

the independent variables. too many Xs might
result in over ﬁtting

• Xs do not need to be continues. If they are not

try to assign values (like 1 and 2) to categories

• there needs to be a linear relationship. Test your

Xs with scatter plots or use your logic. If the
relationship displayed in your scatter plot is not
linear, then you need to use non-linear
regression.
Lab: Multiple Linear
Regression

• ML0101EN-Reg-Mulitple-Linear-
Regression-Co2.ipynb
Non-Linear
Regression
Non Linear
regression
Polynomial

• Deferent types
• if you have X^2, it is possible to
deﬁne a new X as X^2. So it
can be represented as a special
case of multiple linear
regression. This is called
polynomial
Non Linear
regression
Non Linear

• Models a non linear relationship

between Xs and Y

• Y is non linear function of

parameters Theta
Non Linear
regression
Notes

• How to say if nonlinear:

• plot and see. Y based on each X /
calculate coeﬃcient

• use non-linear if you can not solve

by linear

• How to model data of its non-linear

• polynomial
• non-linear
• transform data!
Lab: Polynomial
Regression

• ML0101EN-Reg-Polynomial-Reg
ression-Co2.ipynb
Lab: Non Linear Linear
Regression

• ML0101EN-Reg-NoneLinearRegr
ession.ipynb
‫ﺑﺧش ﺳوم‬
Classiﬁcation
Classiﬁcation
Intro

• Understand Classiﬁcation
• Understand diﬀerent methods
such as KNN, Decision Trees,
Logistic Regression and SVM

• Apply on datasets
• Evaluate
Classiﬁcation
Intro

• Supervised
• Categorizing unknown items in
classes

• Target is categorical with

discrete values (called classiﬁer)

• Binary: 2 values vs Multi Class

Classiﬁcation
Intro
Classiﬁcation
Intro

• Loan (age, income, load size,

previous records, ...)

• Churn (age, address, income,

equip, data usage, calls, ...)

• Spam / Important email

• Handwriting/Speech recognition
• Biometric identiﬁcation
Classiﬁcation
Intro

• Decision Trees (ID3, C4.5, C5.0)

• Naive Bayes
• Linear Discriminant Analysis
• K-Nearest Neighbor
• Logistic Regression
• Neural Networks
• Support Vector Machines
Classiﬁcation
KNN

•
Classiﬁcation
KNN

• pick K
• calculate of the unknown points
distance from all cases

• predict based on the K nearest points

• How to ﬁnd "distance" (Euclidean can

be one way)

• How to choose K (low -> noise & overﬁt;

high -> too general). Use the diﬀerent Ks
with test set and see which K is good.
Classiﬁcation
KNN

• KNN can be used to compute a

continuous target (regression)

• Say ﬁnd 3 of the closest cases

and ﬁnd the median
Classiﬁcation
KNN Evaluation

• Evaluation explains the

performance of our model

• On test data we have y and ŷ

• There are diﬀerent model
evaluation metrics: Jaccard
index, F1-score, and Log Loss.
Classiﬁcation
KNN Evaluation / Jaccard Index

• Jaccard Index
Classiﬁcation
KNN Evaluation / F1-Score
Classiﬁcation
KNN Evaluation / LogLoss
Lab: KNN

•
Classiﬁcation
Decision Trees / Intro

Internal Node (test), branch (result

of test) & leaf (class)

1. Choose attribute from dataset

2. Calculate the significance of the
attribute in the splitting of data
3. split data based on value of the
best attribute
4. replete !
Classification Decision trees are built using recursive
partitioning to classify the data.
Cholesterol? Sex? ...
Decision Trees / Building
Classification
Decision Trees / Building
Classification
Decision Trees / Building
• The best? the one with the most information gain
Classification • Information gain is the information that can increase the
level of certainty after splitting.
Decision Trees / Building
• IG = Entropy before split - Weighted entropy after split.
Lab: Decision Trees

•
Classiﬁcation
Logistic Regression/ Intro

• who is leaving and why

• Close to Regression but
here, Y is a categorical
(here binary) value

• All Xs should be
continues, or converted to
“continues”
Classiﬁcation
Logistic Regression/ Intro

• Predicting a disease
• chance of mortality based on a situation

• halting a subscription

• purchase

• failure of a product

• ...
Classiﬁcation
Logistic Regression/ Intro

• Target should be category (or better, binary)

• We need the probability of prediction

• we need a linear decision boundary (line or

even polynomial)

• We need to understand the impact of

features (Theta is closer to 0 or is high)
Classiﬁcation
Logistic Regression vs Linear Regression
Classiﬁcation
Logistic Regression vs Linear Regression

• On Previous data, try linear with

age vs income

• now repeat, trying with age vs

churn: funny and we should have
a step function as threshold
Classification
Logistic Regression vs Linear Regression /
Sigmoid
Classification
Logistic Regression Training
Classification
Logistic Regression Training

• Cost Function

• we have to minimize the Cost

• Can be done via derivative but its diﬃcult

Classiﬁcation
Logistic Regression Training

• We can deﬁne a new Cost function!

• here there are more approaches to minimize

the function; say Gradient Descent (iterative
technique)
Classiﬁcation
Logistic Regression Training

• Gradient descent is
an iterative approach
to ﬁnding the
minimum of a
function. It uses the
derivative of a cost
function to change the
parameter values to
minimize the cost or
error.
Classiﬁcation
Logistic Regression Training

•
Classiﬁcation
Support Vector Machines

• supervised
• classiﬁer based on
separator
• mapping data to
high-dimensional so a
hyperplane separator can
be drawn
• Lots of real world datas are
Linearly non separable ,
but what if we go to a
higher dimension? ;)
Classiﬁcation
Support Vector Machines

• but… how to move

to n-dimention?

• there are diﬀerent

kernel functions

• our libraries will do,

we will just compare

• How to ﬁnd the

hyperplane?
Classiﬁcation
Support Vector Machines

• to ﬁnd the
hyperplane, we are
looking for largest
margins from
support vectors
• can also be solved
using gradient
descent
• when learned, we
can just check the
data and see if its
above the line or
below it and decide
Classiﬁcation
Support Vector Machines

• Pros

• accurate in high dimensional spaces

• memory eﬃcient

• Cons

• Prone to over-ﬁtting if we have lots of features

• No probability estimation

• Not computationally eﬃcient for large dataset (n>1000)

Classiﬁcation
Support Vector Machines

• Image recognition

• Text Category Assignment

• spam

• category

• sentiment analysis

• Gene Expression Classiﬁcation

• Outlier detection and clustering

Lab: SVM

•
Clustering
Intro
• Partitioning a customer base into groups of individuals based on
characteristics
• Allows a business to target diﬀerent groups (high proﬁt&low risk, …)
• we can cross-reference the groups with their purchases
Clustering
Intro

• ﬁnding “clusters” in datasets, unsupervised

• Cluster: a group of data points or objects in a

dataset that are similar to other objects in the
group, and dissimilar to datapoints in other
clusters.

• Diﬀerent than classﬁciation:

• no need to be labeled

• Prediction is not the goal

Clustering
Intro / Samples

• Retail & Marketing: identify buying patterns / recommendation systems

• Banking: Fraud detection / identify clusters (loyal, churn, …)

• Insurance: Fraud detection / Risk

• Publication: auto-categorize / recommend

• Medicine: characterize behaviour

• Biology: group genes / cluster genetic markers (family ties)

Clustering
Intro / Where

• Exploratory data analysis

• summary generation

• outlier detection

• ﬁnding duplicates

• pre-processing step
Clustering
Intro / algorithms

• Partitioned-based (K-means, K-Median, Fuzzy

c-means, …): sphere like clusters / Medium or large
data

• Hierarchical (Agglomerative, Divisive): Trees of clusters

/ small size datasets

• Density-based (DBSCAN): arbitrary shaped / good for

special clusters or noisy data
Clustering
K Means
• Unsupervised, Divides data into K non-overlapping
subset/cluster without any cluster internal structure
Clustering
K Means

• We need to understand the similarity

and dissimilarity.

• Golad: minimize intra-cluster

distances ( Dis(x1, x2) ) and
maximize inter-cluster distances (
Dis(c1, c2) )

• It is always good to Normalize!

• diﬀerent formulas: Euclidean,

Cosine, Average distance, … so ﬁrst
understand the domain knowledge
Clustering
K Means
• decide the number of cluster (K)

• init K “centroids” by:

• random points from the dataset

• random points

• assign each customer to the closest centroid and create the

distance matrix

• update the centroid to the mean of its datapoints

• continue till the centroids stop moving

• Notes:

• iterative

• does not guarantee the best result. may catch a local

optimum; but its fast so we can run it many times!
Clustering
K Means / More Points

• Review the algorithm

• but how can we evaluate?

• External: compare with truth

• Internal: Average distance between datapoints within a cluster or the

distance between clusters

• Choosing K is diﬃcult so we run with diﬀerent Ks and check the accuracy

(say mean mean distance inside a cluster) BUT decreasing K will always
reduces this. So we do the elbow method
Clustering
K Means / More Points

• Partition based

• unsupervised

• medium and large datasets (relatively eﬃcient)

• sphere like clusters

• K should be known / guessed

Clustering
K Means / LAB

•
Clustering
Hierarchical / Intro

• 48000 genetic markers makes this chart

from similarity

• Hierarchy of clusters where each node is a

cluster consisting of the clusters of its
daughter nodes
Clustering
Hierarchical / Intro

• Divisive is top down, so you start with

all observations in a large cluster and
break it down into smaller pieces.

• Agglomerative is the opposite of

divisive. So it is bottom up, where
each observation starts in its own
cluster and pairs of clusters are
merged together as they move up the
hierarchy.
Clustering
Hierarchical / Intro