0 ratings0% found this document useful (0 votes) 51 views43 pagesChapter 4. Classification Algorithms-Stud
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
Chapter 4. Classification in Machine
Learning
Abebe B, PhDClassifications in Machine learning
* Classification is a supervised machine learning method where the
model tries to predict the correct label of a given input data.
* In classification, the model is fully trained using the training data, and
then it is evaluated on test data before being used to perform prediction
on new unseen data.
* The prediction task is a classification when the target variable is discrete.
An application is the identification of the underlying sentiment of a piece
of text.*Linear classification techniques
+ Linear classifier
*Perceptron
* Logistic regression
*Linear SVM
¢ Naive Bayes
¢Non-linear classification techniques
*k-nearest neighbors
*Non-linear SVM
«Neural networks
* Decision trees
«Random forestLinear vs Non-linear Techniques
* Linear vs Non-linear Techniques
Ay Linear separetic
B lon linear separable
e
For some tasks, input data
can be linearly separable,
and linear classifiers can be
suitably applied
For other tasks, linear - ee #7
classifiers may have ae.7e
difficulties to produce hal ae
adequate decision *
boundaries = %6 8 &
4Non-linear Techniques
* Linear vs Non-linear Techniques
* Non-linear classification
= Features z; are obtained as non-linear functions of the inputs x;
* It results in non-linear decision boundaries
* Can deal with non-linearly separable data
Inputs: x; = [%n1_ Xn2]
os
a 2. 98
Features: 2) = [1 %n2 Xni*%n2 Xn nl
se
Outputs: f(x;,W,b) = Wz, +b* There are four main classification tasks in Machine learning: binary,
multi-class, multi-label, and imbalanced classifications.
* Binary Classification:-
* In a binary classification task, the goal is to classify the input data into
two mutually exclusive categories.
* The training data in such a situation is labeled in a binary format: true
and false; positive and negative; O and 1; spam and not spam, etc.
depending on the problem being tackled.
* For instance, we might want to detect whether a given image is a truck
ora boat.
* Logistic Regression and Support Vector Machines algorithms are natively
designed for binary classifications. However, other algorithms such as K-Nearest
Neighbors and Decision Trees can also be used for binary classification.Binary Classification Problem
*N iid training samples: {x;,, Cy}
* Class label: c,, € {0,1}
* Feature vector: X € R¢
* Focus on modeling conditional probabilities P(C|X)
* Needs to be followed by a decision step* Multi-Class Classification:
The multi-class
classification has at least
two mutually exclusive
class labels, where the goal
is to predict to which class
a given input example
belongs to.
For example, using a model to
identify animal types in images
from an encyclopedia is a
multiclass classification
example because there are
many different animal
classifications that each image
can be classified as.
Example
Mee oT teeta)
e cece
ened
ear’* Most of the binary classification algorithms can be also used for
multi-class classification. These algorithms include but are not
limited to:
*Random Forest, Naive Bayes, K-Nearest Neighbors,
Gradient Boosting , SVM, Logistic Regression.
* Unlike binary classification, multi-class classification does not
have the notion of normal and abnormal outcomes. Instead,
examples are classified as belonging to one among a range of
known classes.
Example:
Classifications of types of crops, Classification of types of music.Binary vs Multi-class Classification
* A classification problem with only 2 classes is referred to as binary
classification
* The output labels are 0 or 1
+ Eg, benign or malignant tumor, spam or no-spam email
* Aproblem with 3 or more classes is referred to as multi-class classification
Binary classification: Multi-class classification:
x A x
x
nl oN | SOKExx
o.O0 oa
° O90
% %Linear Vs Non-linear Classification
+ Linear Classification refers to categorizing a set of data points into a discrete class
based ona linear combination of its explanatory variables.
+ Non-Linear Classification refers to categorizing those instances that are not linearly
separable.
Linear NonlinearS.No
Linear Classification
Non-Linear Classification
Linear Classification refers to categorizing
a set of data points into a discrete class
based on a linear combination of its
explanatory variables.
Non-Linear Classification refers to
categorizing those instances that
are not linearly separable.
hyperplane.
1. | Itis possible to classify data with a Itis not easy to classify data with a
straight line. straight line.
1. | Datais classified with the help of a The utilization of kernels is made to
transform non-separable data into
separable data.K-Nearest Neighbor(KNN) Algorithm
* K-NN algorithm stores all the available data and classifies a new data
point based on the similarity.
* This means when new data appears then it can be easily classified into a
well suite category by using K- NN algorithm.
* It is also called a lazy learner algorithm because it does not learn from
the training set immediately instead it stores the dataset and at the time
of classification, it performs an action on the dataset.
* KNN algorithm at the training phase just stores the dataset and when it
gets new data, then it classifies that data into a category that is much
similar to the new data.* The K-NN working can be explained on the
basis of the below algorithm:
* Step-1: Select the number K of the
neighbors KAN alot Test se)
* Step-2: Calculate the Euclidean distance * Ra
of K number of neighbors ?
* Step-3: Take the K nearest neighbors as
per the calculated Euclidean distance.
* Step-4: Among these k neighbors, count
the number of the data points in each
category.
* Step-5: Assign the new data points to
that category for which the number of
the neighbor is maximum.
* Step-6: Our model is ready.k-Nearest Neighbors Classifier
¢ k-Nearest Neighbors approach considers multiple
neighboring data points to classify a test data point
+ E.g., 3nearest neighbors
o The test example in the figure is the + mark
o The class of the test example is obtained by voting (based on the
distance to the 3 closest points)Advantages of KNN Algorithm:
* It is simple to implement.
* It is robust to the noisy training data
* It can be more effective if the training data is large.
* There’s no need to build a model, tune several parameters, or
make additional assumptions.
Disadvantages of KNN Algorithm:
* Always needs to determine the value of K which may be complex
some time.
* The computation cost is high because of calculating the distance
between the data points for all the training samples.
* To select the K that’s right for your data, we run the KNN algorithm
several times with different values of K and choose the K that
reduces the number of errors we encounter while maintaining the
algorithm’s ability to accurately make predictions when it’s given
data it hasn’t seen before.Linear Classifier
« Find a linear function f of the inputs +; that separates the classes
f%,W,b) =Wxj+b
+ Use pairs of inputs and labels to find the weights matrix Wand
the bias vector b
o The weights and biases are the parameters of the function f
= Several methods have been used to find the optimal set of
parameters of a linear classifier
o Acommon method of choice is the Perceptron algorithm,
where the parameters are updated until a minimal error is
reached (single layer, does not use backpropagation)
+ Linear classifier is a simple approach, but it is a building block of
advanced classification algorithms, such as SVM and neural
networks
o Earlier multi-layer neural networks were referred to as multi-
layer perceptrons (MLPs)Linear Classifier
¢ The decision boundary is linear
= Astraight line in 2D, a flat plane in 3D, a hyperplane
in 3D and higher dimensional space
e Example: classify an input image
« The selected parameters in this example are not
good, because the predicted cat score is low
stretch pixels into single column
J 02 |-05| 01 | 20 56 14 | -96.8 | cat score
15] 43] 21] 00} |231] 4 32) 437.9 | dog score
input image © 9) 02) 08 ee a 61.95 | ship score
Ww 2 b F(2i;W,b)
rySupport Vector Machines
= How to find the best decision boundary?
© All lines in the figure correctly separate the 2
classes
o The line that is farthest from all training
examples will have better generalization
capabilities
= SVM solves an optimization problem:
o First, identify a decision boundary that correctly
classifies the examples
© Next, increase the geometric margin between the
boundary and all examples
= The data points that define the maximum %
margin width are called support vectors
* Find Wand b by solving:
min off
st.yj(ar-x, +b) 21, Va,Support Vector Machines
* The line that maximizes the
minimum margin is a good bet.
* The model class of “hyper-planes
with a margin of m” has a low VC
dimension if m is big.
* This maximum-margin separator is
determined by a subset of the
datapoints.
* Datapoints in this subset are
called “support vectors”.
«It will be useful computationally
if only a small fraction of the
datapoints are support vectors,
because we use the support
vectors to decide which side of
the separator a test case is on.
413/2023 aveve 8, Pro
The support vectors are
indicated by the circles
around them.Training a linear SVM
* To find the maximum margin separator, we have to solve the
following optimization problem:
w.x°+h>+1 for positive cases
w.x°+h<-1 for negativecases
and \\w\\? is as small as possible
* This is tricky but it’s a convex problem. There is only one optimum
and we can find it without fiddling with learning rates or weight
decay or early stopping.
* Don’t worry about the optimization problem. It has been solved.
Its called quadratic programming.
* It takes time proportional to N42 which is really bad for very big
datasets
* so for big datasets we end up doing approximate optimization!Testing a linear SVM
* The separator is defined as the set of points for which:
wxtb=0
soif w.x°+b>0 say its a positive case
and if w.x° +b<0 say its a negative caseNon-linear Support Vector Machines
* Linear vs Non-linear Techniques
© Non-linear SVM
+ The original input space is mapped to a higher-dimensional feature space where
the training set is linearly separable
* Define a non-linear kernel function to calculate a non-linear decision boundary in
the original feature spaceLinear vs Nonlinear separable data
Non linearly separable dataThis type of separator best provides the classification.
But
e tis quite difficult to train a model like this .
e This is termed as Regularisation parameter.A CofA
Tuning . Regularization
Parameters . Gamma
. Margin
SVM¢ Margin Margin is the perpendicular distance between the closest
data points and the Hyperplane ( on both sides )
+ The best optimised line ( hyperplane ) with maximum margin is
termed as Margin Maximal Hyperplane.
The closest points where the margin distance is calculated are
considered as the support vectors.
Maximum Support vectors,
margin
decision
hyperplane Sy
\. Margin is
maximizedRegularization
Also the ‘ C ‘ parameter in Python’s SkLearn Library
Optimises SVM classifier to avoid misclassifying the
data
C—- large Margin of hyperplane — small
C— small Margin of hyperplane — large
e misclassification(possible)
C —> large , chance of overfit
C —> small, chance of underfitting
a ‘
> * ;
SSp + + :
> ;
Se F< :
= - 4 >
x1 1
Large value for Smail value for
persmeterc parameter ©¢ Gamma
* Defines how far influences the calculation of of plausible line of
separation.
* Low gamma -----> points far from plausible line are considered for
calculation
* High gamma -----> points close to plausible line are considered for
calculation
High Gamma Value Low Gamma ValueKernels
* Mathematical functions for transforming data using some linear algebra
+ Different SVM algorithms use different types of kernel functions
Various kernels available
Example :
Use on)
Kernel function dot product of n- dimensional inputs° Pros:
* It works really well with clear margin of separation
* Itis effective in high dimensional spaces.
= It is effective in cases where number of dimensions is greater than the number
of samples.
= It uses a subset of training points in the decision function (called support
vectors), 50 it is also memory efficient.
© Cons:
* It doesn't perform well, when we have large data set because the required
training time is higher
* Italso doesn’t perform very well, when the data set has more noise i.e. target
classes are overlapping
* SVM doesn't ody provide probability estimates, these are calculated using,
an expensive five-fold cross-validation. It is related SVC method of Python
scikit-learn library.
¢ Applications :
= 1. Face detection
= 2. Text and hypertext categorization
= 3. Classification of images 4. Bioinformatics
» 5, Handwriting recognition
= 6. Protein fold and remote homology detection
= 7. Generalized predictive control(GPC)Linear Kernel
As the name suggests, the kernel separates the data linearly as it is the one-dimensional
kemel
Alinear kernel is fast as compared to other kernels because in the linear
kernel we need only regularization parameter and in other kernels, we
need the addition of the y parameter to performing the grid search.
¢ Equation of Linear Kernel shown below:
Where, x and y = input column vectors
k(z,y) =a'y
* The degree=1 and coef0=0 if the coef is zero then the kernel is homogeneous
means it maps the data in a compact linear representation.Polynomial Kernel
* As the name suggests, this kernel deals with the degree/order of the features.
The relationship between features is not linear but rather a curve, due to this
the linear kernel is not suited to get its maximum result because the residue
will be more between observed and predicted value.
* The working function of this kernel is less effective as compared to other
kernels.
+ Equation of Polynomial Kernel shown below: Eek T. d
k(x, y) = (yx y+ eo)
Where d = kernel degreeSigmoid Kernel
* This kernel is mostly used in neural networks or perceptron in machine
learning. To classify the classes in the data it works as an activation function.
The curve in this kernel is also called the cumulative distribution function that
goes from Oto 1.
+ Equation of sigmoid Kernel shown below: vty a)
* Where, x and y = input column vectors, ¥ = Slope, and C0 = interceptRBF (radial basis kernel) Kernel
+ The most using kernel in the machine learning algorithm to classify the data
without knowing the data types and try to separate the classes smoothly.
* The other kernels are not trying to scale well on a huge number of input
features.
* Equation of RBF Kernel shown below:
k(x, y) = exp(—yllz — yl?)
+ Where, x and y = input column vectors, Y = ¥ = 02, kernel of variance.Laplacian Kernel
° The laplacian kernel is from the family of RBF kernel and it can be used in
noiseless data. It is very less affected by the changes in the data and also it has
some similar features with the exponential kernel, as we will discuss later in
this article.
+ The application of this kernel is mainly in image processing to detect edges of
the objects by the name of Laplacian over Gaussian Filter (LoG).
+ Equation of Laplacian Kernel shown below:
k(z,y) = exp(—y||z — ylli)
* Where, x and y = input column vectors, | |x-y| | =Manhattan distance metric
cca
Bonksepal width (cm)
sepal width (cm)
SVC with linear kernel
sepal length (cm)
SVC with RBF kernel
sepal length (cm)
LinearSVC (linear kernel)
sepal width (cm)
sepal length (cm)
SVC with polynomial (degree 3) kernel
sepal width (cm)
sepal length (cm)Naive Bayes Classifier
* Baive Bayes Classifier is one of the simple and most effective
Classification algorithms which helps in building the fast
machine learning models that can make quick predictions.
* It is a probabilistic classifier, which means it predicts on the
basis of the probability of an object.
+ A Naive Bayes classifier is a probabilistic machine learning model that’s
used for classification task. The crux of the classifier is based on the
Bayes theorem.
ol
Works based on Bayes’ theorem
«Why its is called Naive?
~ Because it assumes that the presence of a. particular feature
in a class is unrelated to the presence of any other feature
«Easy to build
«Useful for very large data setsThe theorem can be stated mathematically as follow:
P(B| A) P(A)
P(B)
P(A) and P(B) are the probabilities of observing A and B without regard
to each other. Also known as Prior Probability.
P(A| B) =
’
P(A | B), a conditional (Posterior) probability, is the probability of
observing event A given that B is true.
P(B | A) is the conditional (Posterior)probability of observing event B
given that A is true.
So, how does naive bayes classifier work based on thi:Let D be a training set of tuples and each tuple is represented by n-dimensional
attribute vector, X = (x1, x2,...., xn)
Suppose that there are m classes, C7, C2...., Cm. Given a tuple, X, the classifier will
predict that X belongs to the class having the highest posterior probability, conditioned
on X. That is, the Naive Bayesian classifier predicts that tuple X belongs to the class Ci
if and only if
P(Ci|X) > P(G|X) for 1sjsmj#i.
By Bayes’ theorem
_ PUXIC)PCG)
PIGIX) = Fay
P(X) is constant for all classes, only P(X|C;)P(Cj)needs to be maximized
To reduce computation in evaluating PCX|G), the naive assumption of
class-conditional independence is made. This presumes that the attributes’ values are
conditionally independent of one another, given the class label of the tuple (i.e, that
there are no dependence relationships among the attributes). This assumption is
called class conditional independence.
Thus,
Pic) =[] Peaic
ra
= P(x |C)) x Plea|G) x + * PIC)
anspaoes ”Given all the previous patient's symptoms and
How Naive diagnosis
Bayes chills runnynose headache fever flu?
Works? Y N Mild Y N
(Hands on y y No N y
Calculation) Y N Sirong y
N yi Mid Y y
N N No. N N
N Y Strong Y Y
2 Y Strong N N
Y Y Mild Y Y
Does the patient with the following symptoms have
the flu?
chills runnynose headache fever flu?
Y N Mild Y 2How Naive
Bayes
el ces
(Hands on
Calculation)
ete R
First, we compute all possible individual
probabilities conditioned on the target attribute
(flu)
P(Fu=Y)
P(chills=Njflu=Y)
P(runny nose=Y|flu=Y)
P(runny nose=Niflu=Y)
P(headache=Mild|flu=Y)
P(headache=No}flu=Y)
P(headache=Strongiflu=Y)
P(fever=Yiflu=Y)
P(fever=Niflu=Y)
0.625
06
0.4
0.8
0.2
04
02
04
08
P(Flu=N)
P(chills=¥|Nu=N)
P(chills=N}flu=N)
P(runny nose=Y|flu=N)
P(runny nose=Niflu=N)
P(headache=Mildjfiu=N)
P(headache=No|flu=N)
P(headache=Strong|flu=N)
P(fever=Yiflu=N)
P(fever=Niflu=N)
0.375
0.333
0.666
0.333
0.666
0.333
0.333
0.333
0.333
0.666Structure of Logistic Regression
* Li : mn
Linear Model: wo + Di w; * x;
Bias Weight Weight per Feature
+ y* = sigmoid(wo + YW; * x)
E 1 "| Predict 1
* sigmoid(z) = Bex :
------ 4 —- ——> Threshold =
~ 9
Example: Predict 0 Threshold = .5
Wo Wy We
Est x2
25 a 1
ofojel|e
ofR}ole