ﯾﺎدﮔﯾری ﻣﺎﺷﯾن ﺑﺎ ﭘﺎﯾﺗون
ﻣﻘدﻣﮫای ﻋﻣﻠﯽ
/ ١۴٠٠ﺟﺎدی ﺑرای ﻣﮑﺗبﺧوﻧﮫ
ﯾﺎدﮔﯾری ﻣﺎﺷﯾن و ﭘﺎﯾﺗون
ﺑر اﺳﺎس دوره IBMدر edxﺑﺎ آﻣوزﺷﮕری •
ﺳﻌﯾد آﻗﺎﺑزرﮔﯽ ،داﻧﺷﻣﻧد داده آی.ﺑﯽ.ام.
Welcome
• We are going to talk about ML and how it helps in different areas (loans,
segmentations, medicine, recommendations, …)
• We will use python libraries to create models, say building a model to estimate
the CO2 emission of the cars using scikit learn or will predict customer churn
• All codes are provided in notepad / jupyther
• After this course you will have new skills like regression, classification,
clustering, scikit learn, numpy, pandas AND new projects specially if you start
working on the datasets you can freely available on the internet.
Intro
• Say we want to understand if this is a fraud trx
or not, is this a malignant or benign cell, what
should I show next to this customer, …
• This can be done by ML by looking at some
characteristics of data.
• clean the data
• select the proper algorithm
• train the model
• predict new cases
Intro
Machine Learning is the subfield of computer
science that gives “computers the ability to learn
without being explicitly programmed”
- Arthur Samuel who coined the phrase in 1959
Intro
Examples:
- CO2 emission (Regression)
- Is this cancer? (Classification)
- Bank Loans (Clustering)
- Anomaly detection (credit card fraud)
- Netflix recommendations (recommenders)
Intro
AI (mimics human I)
- Computer vision
- Language Processing
- Creativity
ML (Subset of AI, more statistical)
- Classification
- Clustering
- Neural Network
Revolution in ML (Special field of ML)
- Deep Learning
Intro
Python
- You should know the basics
- It is easy
- Libraries like Numpy & Pandas + SciKit
Learn are used with a quick intro
- We are python dependent. You can do
with anything else.. but why? :D
Supervised vs.
Unsupervised
- Supervised: we “teach the model” by labeling & just then, the model can
predict the unknown or future instances
Supervised vs.
Unsupervised
- UnSupervised: Model works on its own to discover information.
Supervised vs.
Unsupervised
رﮔرﺳﯾون
Regression Intro
• regression is the process of
predicting a continuous value
• Independent (x, desc, ...) vs
Dependent (y, goal, prediction,
...) variables
• y is continuous
Regression Intro
Model
Regression Intro
Types
• Simple (only one independent)
• Linear
• Non-Linear
• Multiple (multiple independent)
• Linear
• Non-Linear
Regression Intro
Samples
• Household Price
• Customer Satisfaction
• Sales Forecast
• Employment Income
Regression Intro
Algorithms
• Ordinal
• Poisson
• Fast Forest quantile
• Linear, Polynominal, Lasso, Stepwise, Ridge
• Bayesian Linear
• Nerural Network
• Decision Forest
• Boosted decision tree
• K-nearest neighbors
Simple Linear
Regression
• Can we predict co2 emission
from one of the independents
(this is why we call it Simple)
• Lets try engine size...
Simple Linear
Regression
• Relationship is obvious
• There is a line, we assume a straight line
• we can predict an emission for say, a car with 2.4
• y hat is the dependent variable of the predicted
value.
• x1 is the independent variable.
• Theta 0 and theta 1 are the parameters of the line
• Theta 1 is known as the slope or gradient of the
fitting line and theta 0 is known as the intercept.
• Theta 0 and theta 1 are also called the
coefficients of the linear equation.
Simple Linear
Regression
MSE
• Residual error for each point is the
distance of the prediction from the
actual point. So Mean Square
Error (MSE should be minimized)
• Minimum MSE can be achieved
with two methods: Math or
Optimization
Simple Linear
Regression
MSE (Math)
Simple Linear
Regression
Pros
• Very Fast
• Easy to understand and interpret
• No need for parameter tuning
(say like in KNN)
Model Evaluation
• goal is to build a model to accurately
predict an unknown case.
• You need to evaluate to see how
much you can trust your
model/prediction
• Two main methods:
• Train and Test on Same data
• Train / Test split
• Regression Evaluation Metrics
Model Evaluation
Train and Test on Same data
• High "training accuracy"
• not always good
• overfitting the data
(say capture noise and
produce non
generalized model)
• Low "out of sample
accuracy"
• Important to have
Model Evaluation
Train/Test split
• Mutually exclusive split
• More accurate on
out-of-sample
• ensure that you train your
model with the testing set
afterwards, as you don't
want to lose potentially
valuable data.
• Dependent on which
datasets the data is trained
and tested
Model Evaluation
Evaluation Matrix
• used to explain the performance of a
model
• say comparing actual with predicted
• error of the model is the difference
between the data points and the
trend line generated by the algorithm
• There different metrics (next slide)
but the choice is based on the
model, data type, domain, ...
Model Evaluation
Errors
• mean absolute error (MAE)
• mean squared error (MSE)
• root mean squared error (RMSE); interpretable in the
same units as the response vector or y units
• Relative absolute error, also known as residual sum of
square (RAE)
• Relative squared error (RSE)
• R2; Popular metric for the accuracy of your model.
represents how close the data values are to the fitted
regression line. The higher the better
Lets see some
libraries!
• Notebook
• Numpy
• Matplotlib
• pandas
Lab: Simple Linear
Regression
• ML0101EN-Reg-Simple-Linear-R
egression-Co2.ipynb
Multiple Linear
Regression
• Simple / Multiple
• kind of same as simple
• usages:
• find the strength of each independent variable
• predict the impact of the change on one of the independent
variables
Multiple Linear
Regression
Formula
Multiple Linear
Regression
Finding parameters
• Again we can find the MSE
• the best model is the one with he minimized MSE
• The method is called Ordinary Least square
• linear algebra
• slow! for less than 10K samples
• Optimization Algorithms
• Gradient Descent (Starts with random, then
changes in multiple iterations)
Multiple Linear
Regression
Some notes
• Try to have theoretical defense when choosing
the independent variables. too many Xs might
result in over fitting
• Xs do not need to be continues. If they are not
try to assign values (like 1 and 2) to categories
• there needs to be a linear relationship. Test your
Xs with scatter plots or use your logic. If the
relationship displayed in your scatter plot is not
linear, then you need to use non-linear
regression.
Lab: Multiple Linear
Regression
• ML0101EN-Reg-Mulitple-Linear-
Regression-Co2.ipynb
Non-Linear
Regression
Non Linear
regression
Polynomial
• Deferent types
• if you have X^2, it is possible to
define a new X as X^2. So it
can be represented as a special
case of multiple linear
regression. This is called
polynomial
Non Linear
regression
Non Linear
• Models a non linear relationship
between Xs and Y
• Y is non linear function of
parameters Theta
Non Linear
regression
Notes
• How to say if nonlinear:
• plot and see. Y based on each X /
calculate coefficient
• use non-linear if you can not solve
by linear
• How to model data of its non-linear
• polynomial
• non-linear
• transform data!
Lab: Polynomial
Regression
• ML0101EN-Reg-Polynomial-Reg
ression-Co2.ipynb
Lab: Non Linear Linear
Regression
• ML0101EN-Reg-NoneLinearRegr
ession.ipynb
ﺑﺧش ﺳوم
Classification
Classification
Intro
• Understand Classification
• Understand different methods
such as KNN, Decision Trees,
Logistic Regression and SVM
• Apply on datasets
• Evaluate
Classification
Intro
• Supervised
• Categorizing unknown items in
classes
• Target is categorical with
discrete values (called classifier)
• Binary: 2 values vs Multi Class
Classification
Intro
Classification
Intro
• Loan (age, income, load size,
previous records, ...)
• Churn (age, address, income,
equip, data usage, calls, ...)
• Spam / Important email
• Handwriting/Speech recognition
• Biometric identification
Classification
Intro
• Decision Trees (ID3, C4.5, C5.0)
• Naive Bayes
• Linear Discriminant Analysis
• K-Nearest Neighbor
• Logistic Regression
• Neural Networks
• Support Vector Machines
Classification
KNN
•
Classification
KNN
• pick K
• calculate of the unknown points
distance from all cases
• predict based on the K nearest points
• How to find "distance" (Euclidean can
be one way)
• How to choose K (low -> noise & overfit;
high -> too general). Use the different Ks
with test set and see which K is good.
Classification
KNN
• KNN can be used to compute a
continuous target (regression)
• Say find 3 of the closest cases
and find the median
Classification
KNN Evaluation
• Evaluation explains the
performance of our model
• On test data we have y and ŷ
• There are different model
evaluation metrics: Jaccard
index, F1-score, and Log Loss.
Classification
KNN Evaluation / Jaccard Index
• Jaccard Index
Classification
KNN Evaluation / F1-Score
Classification
KNN Evaluation / LogLoss
Lab: KNN
•
Classification
Decision Trees / Intro
•
Classification
Decision Trees / Intro
Internal Node (test), branch (result
of test) & leaf (class)
1. Choose attribute from dataset
2. Calculate the significance of the
attribute in the splitting of data
3. split data based on value of the
best attribute
4. replete !
Classification Decision trees are built using recursive
partitioning to classify the data.
Cholesterol? Sex? ...
Decision Trees / Building
Classification
Decision Trees / Building
Classification
Decision Trees / Building
• The best? the one with the most information gain
Classification • Information gain is the information that can increase the
level of certainty after splitting.
Decision Trees / Building
• IG = Entropy before split - Weighted entropy after split.
Lab: Decision Trees
•
Classification
Logistic Regression/ Intro
• who is leaving and why
• Close to Regression but
here, Y is a categorical
(here binary) value
• All Xs should be
continues, or converted to
“continues”
Classification
Logistic Regression/ Intro
• Predicting a disease
• chance of mortality based on a situation
• halting a subscription
• purchase
• failure of a product
• ...
Classification
Logistic Regression/ Intro
• Target should be category (or better, binary)
• We need the probability of prediction
• we need a linear decision boundary (line or
even polynomial)
• We need to understand the impact of
features (Theta is closer to 0 or is high)
Classification
Logistic Regression vs Linear Regression
Classification
Logistic Regression vs Linear Regression
• On Previous data, try linear with
age vs income
• now repeat, trying with age vs
churn: funny and we should have
a step function as threshold
Classification
Logistic Regression vs Linear Regression /
Sigmoid
Classification
Logistic Regression Training
Classification
Logistic Regression Training
• Cost Function
• we have to minimize the Cost
• Can be done via derivative but its difficult
Classification
Logistic Regression Training
• We can define a new Cost function!
• here there are more approaches to minimize
the function; say Gradient Descent (iterative
technique)
Classification
Logistic Regression Training
• Gradient descent is
an iterative approach
to finding the
minimum of a
function. It uses the
derivative of a cost
function to change the
parameter values to
minimize the cost or
error.
Classification
Logistic Regression Training
• Gradient descent is
an iterative approach
to finding the
minimum of a
function. It uses the
derivative of a cost
function to change the
parameter values to
minimize the cost or
error.
Lab: Logistic Regression
•
Classification
Support Vector Machines
• supervised
• classifier based on
separator
• mapping data to
high-dimensional so a
hyperplane separator can
be drawn
• Lots of real world datas are
Linearly non separable ,
but what if we go to a
higher dimension? ;)
Classification
Support Vector Machines
• but… how to move
to n-dimention?
• there are different
kernel functions
• our libraries will do,
we will just compare
• How to find the
hyperplane?
Classification
Support Vector Machines
• to find the
hyperplane, we are
looking for largest
margins from
support vectors
• can also be solved
using gradient
descent
• when learned, we
can just check the
data and see if its
above the line or
below it and decide
Classification
Support Vector Machines
• Pros
• accurate in high dimensional spaces
• memory efficient
• Cons
• Prone to over-fitting if we have lots of features
• No probability estimation
• Not computationally efficient for large dataset (n>1000)
Classification
Support Vector Machines
• Image recognition
• Text Category Assignment
• spam
• category
• sentiment analysis
• Gene Expression Classification
• Outlier detection and clustering
Lab: SVM
•
Clustering
Intro
• Partitioning a customer base into groups of individuals based on
characteristics
• Allows a business to target different groups (high profit&low risk, …)
• we can cross-reference the groups with their purchases
Clustering
Intro
• finding “clusters” in datasets, unsupervised
• Cluster: a group of data points or objects in a
dataset that are similar to other objects in the
group, and dissimilar to datapoints in other
clusters.
• Different than classficiation:
• no need to be labeled
• Prediction is not the goal
Clustering
Intro / Samples
• Retail & Marketing: identify buying patterns / recommendation systems
• Banking: Fraud detection / identify clusters (loyal, churn, …)
• Insurance: Fraud detection / Risk
• Publication: auto-categorize / recommend
• Medicine: characterize behaviour
• Biology: group genes / cluster genetic markers (family ties)
Clustering
Intro / Where
• Exploratory data analysis
• summary generation
• outlier detection
• finding duplicates
• pre-processing step
Clustering
Intro / algorithms
• Partitioned-based (K-means, K-Median, Fuzzy
c-means, …): sphere like clusters / Medium or large
data
• Hierarchical (Agglomerative, Divisive): Trees of clusters
/ small size datasets
• Density-based (DBSCAN): arbitrary shaped / good for
special clusters or noisy data
Clustering
K Means
• Unsupervised, Divides data into K non-overlapping
subset/cluster without any cluster internal structure
Clustering
K Means
• We need to understand the similarity
and dissimilarity.
• Golad: minimize intra-cluster
distances ( Dis(x1, x2) ) and
maximize inter-cluster distances (
Dis(c1, c2) )
• It is always good to Normalize!
• different formulas: Euclidean,
Cosine, Average distance, … so first
understand the domain knowledge
Clustering
K Means
• decide the number of cluster (K)
• init K “centroids” by:
• random points from the dataset
• random points
• assign each customer to the closest centroid and create the
distance matrix
• update the centroid to the mean of its datapoints
• continue till the centroids stop moving
• Notes:
• iterative
• does not guarantee the best result. may catch a local
optimum; but its fast so we can run it many times!
Clustering
K Means / More Points
• Review the algorithm
• but how can we evaluate?
• External: compare with truth
• Internal: Average distance between datapoints within a cluster or the
distance between clusters
• Choosing K is difficult so we run with different Ks and check the accuracy
(say mean mean distance inside a cluster) BUT decreasing K will always
reduces this. So we do the elbow method
Clustering
K Means / More Points
• Partition based
• unsupervised
• medium and large datasets (relatively efficient)
• sphere like clusters
• K should be known / guessed
Clustering
K Means / LAB
•
Clustering
Hierarchical / Intro
• 48000 genetic markers makes this chart
from similarity
• Hierarchy of clusters where each node is a
cluster consisting of the clusters of its
daughter nodes
Clustering
Hierarchical / Intro
• Divisive is top down, so you start with
all observations in a large cluster and
break it down into smaller pieces.
• Agglomerative is the opposite of
divisive. So it is bottom up, where
each observation starts in its own
cluster and pairs of clusters are
merged together as they move up the
hierarchy.
Clustering
Hierarchical / Intro
• Finding Similarity of
city locations in
Canada
• Dendrogram Y is
the similarity
• We can cut Y
somewhere to have
N number of
clusters (say 3)
Clustering
Hierarchical / Intro
• Finding Similarity of
city locations in
Canada
• Dendrogram Y is
the similarity
• We can cut Y
somewhere to have
N number of
clusters (say 3)
Clustering
Hierarchical / Intro
• Finding Similarity of
city locations in
Canada
• Dendrogram Y is
the similarity
• We can cut Y
somewhere to have
N number of
clusters (say 3)
Clustering
Hierarchical / Intro
• Finding Similarity of
city locations in
Canada
• Dendrogram Y is
the similarity
• We can cut Y
somewhere to have
N number of
clusters (say 3)
Clustering
Hierarchical / Intro
• Finding Similarity of
city locations in
Canada
• Dendrogram Y is
the similarity
• We can cut Y
somewhere to have
N number of
clusters (say 3)
Clustering
Hierarchical / Intro
• Finding Similarity of
city locations in
Canada
• Dendrogram Y is
the similarity
• We can cut Y
somewhere to have
N number of
clusters (say 3)
Clustering
Hierarchical / Intro
• Finding Similarity of
city locations in
Canada
• Dendrogram
• We can cut Y
somewhere to have
N number of
clusters (say 3)
Clustering
Hierarchical / More
•
Clustering
Hierarchical / More
• We should be able to calculate distances between data
points (again say age, BMI, BP)
• but also need the distance “between” clusters:
• Single Linkage Clustering: Minimum distance
• Complete Linkage Clustering: Maximum distance
• Average Linkage Clustering: average of distances from
each point to all other points
• Centroid Linkage Clustering: centroids of clusters
Clustering
Hierarchical / More
• Pros
• Works with unknown N
• Easy to implement
• Useful dentograms; good for understanding
• Cons
• Impossible to undo via algorithm
• long runtimes
• sometimes difficult to identify the number of clusters
(specially for large datasets)
Clustering
Hierarchical / More
• Hierarchical vs K-Means
• Can be slower
• Does not require the number of clusters to run
• Gives more than one partitioning
• Always generate the same clusters
Clustering
Hierarchical / Lab
•
Clustering
DBSCAN
• K-Means will assign
every datapoint to a
cluster; no outlier
• Density Based clusters
will find dense areas
and will separate
outliers. Good for
anomaly detection
• Density: number of
points within a radius
Clustering
DBSCAN
• DBSCAN algorithms is
effective for tasks like
class identification
• effective even in
presence of noise
• Grouping same weather
on dense areas
Clustering
DBSCAN
•
Clustering
DBSCAN
• point types:
• core: within our
neighborhood of the point
there are at least M
points.
• Border:
• less than M in
neighborhood
• reachable from a core
point
• outlier is not core neigher
a border
Clustering
DBSCAN
• Arbitrarily shaped clusters
• Robust to outliers
• Does not require specification of the number of
clusters
Clustering
DBSCAN / Lab
Recommenders
Intro
• Peoples tastes follow patterns (say books)
• Recommender systems capture the pattern of people’s behaviour and use it to
predict what else they might want or like
• Many applications. Netflix, Amazon, facebook, twitter, News, Digikala,
SnapFood
• Broader Exposure -> More Usage
Recommenders
Intro / types
Recommenders
Intro / types
• Memory Based
• Uses the entire user-item dataset to generate a recommendation
• Uses statistical techniques to approximate users or items (Pearson
Correlation, Cosine Similarity, Euclidean Distance, …)
• Model Based
• Develops a model of users in an attempt to learn their preferences
• Models can be created using ML techniques like regression, clustering,
classification, ...
Recommenders
Content Based
• Works based on users profiles
• Works with user ratings (like, view, …) and then finds the similarity between
content of those contents (tags, category, genres, …)
•
Recommenders
Content Based
• Works based on users profiles
• Works with user ratings (like, view, …) and then finds the similarity between
content of those contents (tags, category, genres, …)
•
Recommenders
Content Based
• Works based on users profiles
• Works with user ratings (like, view, …) and then finds the similarity between
content of those contents (tags, category, genres, …)
•
Recommenders
Content Based
• Works based on users profiles
• Works with user ratings (like, view, …) and then finds the similarity between
content of those contents (tags, category, genres, …)
•
Recommenders
Content Based
• LAB
Recommenders
Collaborative Filtering
• User Based
• Based on the user’s similarity or neighborhoods
• Finds similarity between users (say likings history)
• Item-Based
• Based on items similarity
Recommenders
Collaborative Filtering / User Based
•
Recommenders
Collaborative Filtering / User Based
•
Recommenders
Collaborative Filtering / Item Based
•
Recommenders
Collaborative Filtering / Challenges
• Data Sparsity
• Large users but they are rating only a limited number of items
• Cold Start
• What if a new user joins the system? What if a new Item is added?
• Scalability
• Drops performance when items/users are increased. Matrix becomes larger
and larger
• There are solutions.. like using hybrid solutions
Recommenders
Collaborative Filtering
• LAB