Assignment 2
Assignment 2
Assignment 2
Explain the following classification algorithms in machine learning: Logistic Regression, k-Nearest
Neighbors, Decision Trees, Support Vector Machine, Naive Bayes, Gradient Boosting.
Each of these classification algorithms has its strengths and weaknesses, and their performance can vary
depending on the nature of the data and the specific problem at hand. Choosing the right algorithm
often involves experimentation and consideration of the characteristics of the dataset.
Logistic Regression
This is a type of linear regression that predicts the probability of an input belonging to a certain class. It
uses a logistic function (also called a sigmoid function) to map the input features to a value between 0
and 1, which represents the probability of the positive class. The predicted class is then determined by
comparing the probability with a threshold value, usually 0.5. Logistic regression is often used for binary
classification problems, such as spam detection, credit default prediction, etc.
Logistic regression is a supervised learning algorithm used for classification tasks. It works by fitting a
sigmoid function to the training data, which allows it to predict the probability of a given data point
belonging to a particular class. Logistic regression is a relatively simple algorithm to understand and
implement, but it can be very effective for a wide range of classification problems.
In simple terms
Logistic Regression is a simple and widely used classification algorithm for binary and multiclass
classification problems.
It models the probability that a given input belongs to a particular class using the logistic
function, which outputs values between 0 and 1.
It's a linear model that finds the best-fitting hyperplane to separate the classes by optimizing a
cost function like cross-entropy.
Logistic Regression is interpretable and efficient but may not perform well with complex data
distributions.
This is a type of lazy learning algorithm that does not build a model from the training data, but instead
stores the data and uses a similarity measure (such as Euclidean distance) to find the k most similar
instances to a new input. The predicted class is then determined by the majority vote of the k nearest
neighbors, or by a weighted vote based on the distance. k-Nearest Neighbors is often used for multi-class
classification problems, such as image recognition, text categorization, etc.
KNN is another supervised learning algorithm for classification. It works by finding the k most similar
data points in the training set to a new data point, and then predicting the class of the new data point
based on the classes of the k nearest neighbors. KNN is a very simple algorithm to implement, but it can
be very effective for classification problems, especially when the training data is well-labeled.
In simple terms
k-NN is a non-parametric and instance-based classification algorithm.
It classifies data points based on the majority class among their k-nearest neighbors in the
feature space.
The choice of the value 'k' influences the algorithm's performance and can be selected using
techniques like cross-validation.
k-NN is simple and can handle non-linear decision boundaries but can be sensitive to the choice
of distance metric and the curse of dimensionality.
Decision Trees
Decision trees are supervised learning algorithms for classification and regression tasks. They work by
building a tree-like structure that represents the relationships between the features of the data and the
target variable. Decision trees are easy to interpret and can be very effective for classification problems,
especially when the data is noisy or complex.
This is a type of eager learning algorithm that builds a tree-like structure from the training data, where
each node represents a feature, each branch represents a decision rule, and each leaf represents a class
label. The predicted class is then determined by following the path from the root node to the leaf node
that matches the input features. Decision trees are often used for both binary and multi-class
classification problems, such as medical diagnosis, customer segmentation, etc.
In simple terms
Decision Trees are tree-like structures where each node represents a decision or a test on a
feature, and each branch represents the outcome of that test.
It recursively splits the data based on features to classify the data into different classes.
Decision Trees are interpretable and can handle both categorical and numerical data but may
suffer from overfitting if not pruned properly.
This is a type of linear classifier that finds the optimal hyperplane that separates the data into two
classes with the maximum margin. The hyperplane is defined by a subset of data points called support
vectors, which are the closest to the boundary. The predicted class is then determined by the side of the
hyperplane that the input falls on. Support vector machines can also handle non-linear classification
problems by using kernel functions that transform the data into higher-dimensional spaces. Support
vector machines are often used for complex classification problems, such as face detection, handwriting
recognition, etc.
SVMs are supervised learning algorithms for classification and regression tasks. They work by finding a
hyperplane that separates the data into two classes with the largest possible margin. SVMs are very
effective for classification problems, especially when the data is high-dimensional and sparse.
In simple terms
SVM is a powerful classification algorithm that aims to find a hyperplane that best separates
classes while maximizing the margin between them.
It can handle both linear and non-linear classification by using different kernel functions, such as
the linear, polynomial, or radial basis function (RBF) kernels.
SVMs are effective in high-dimensional spaces and are less prone to overfitting due to the
margin concept.
Naive Bayes
This is a type of probabilistic classifier that applies Bayes’ theorem to calculate the posterior probability
of each class given the input features. It assumes that the features are conditionally independent given
the class, which simplifies the computation and reduces the data requirements. The predicted class is
then determined by the class with the highest posterior probability. Naive Bayes is often used for text
classification problems, such as sentiment analysis, spam filtering, etc.
Naive Bayes is a supervised learning algorithm for classification tasks. It works by assuming that the
features of the data are independent of each other, and then uses Bayes' theorem to predict the class of
a new data point. Naive Bayes is a very simple algorithm to implement, and it can be very effective for
classification problems, especially when the features of the data are truly independent.
In simple terms
Naive Bayes is a probabilistic classifier based on Bayes' theorem with the "naive" assumption
that features are conditionally independent.
It's particularly useful for text classification and spam detection.
Despite its simplicity and the independence assumption, Naive Bayes often performs surprisingly
well in practice.
Gradient Boosting
This is a type of ensemble learning algorithm that combines multiple weak learners (usually decision
trees) into a strong learner by iteratively adding new learners that correct the errors of the previous
ones. It uses a gradient descent method to minimize a loss function that measures the difference
between the actual and predicted classes. The predicted class is then determined by the weighted vote
of all the learners. Gradient boosting is often used for high-performance classification problems, such as
fraud detection, ranking systems
In simple terms
Gradient Boosting is an ensemble method that combines multiple weak learners (typically
decision trees) to create a strong learner.
It builds trees sequentially, with each tree correcting the errors made by the previous ones.
Gradient Boosting algorithms like AdaBoost and XGBoost are highly effective and often win
machine learning competitions.
They can handle complex relationships in the data but may require careful hyperparameter
tuning.
Here are some examples use cases for each of the classification algorithms discussed above:
Logistic Regression:
Predicting whether a customer is likely to churn
Predicting whether a loan applicant is likely to default
Predicting whether a medical patient is likely to have a particular disease
k-Nearest Neighbors:
Decision Trees:
Naive Bayes:
Spam filtering
Sentiment analysis
Document classification
Gradient Boosting:
The best classification algorithm to use for a particular problem will depend on a number of factors,
including the nature of the data, the complexity of the problem, and the desired performance metrics.
However, the algorithms discussed above are a good starting point for most classification problems.
If the data is simple and the problem is not too complex, logistic regression or k-nearest
neighbors may be a good choice.
If the data is noisy or complex, decision trees or support vector machines may be a better
choice.
If the data is high-dimensional and sparse, support vector machines are a good choice.
If the features of the data are truly independent, Naive Bayes may be a good choice.
If you need the best possible performance, gradient boosting is a good choice.
It is also important to note that there is no one-size-fits-all solution to machine learning
problems. It is often necessary to experiment with different algorithms and parameters to find
the best solution for a particular problem.
As we know, the Supervised Machine Learning algorithm can be broadly classified into Regression and
Classification Algorithms. In Regression algorithms, we have predicted the output for continuous values,
but to predict the categorical values, we need Classification algorithms.
The Classification algorithm is a Supervised Learning technique that is used to identify the category of
new observations on the basis of training data. In Classification, a program learns from the given dataset
or observations and then classifies new observation into a number of classes or groups. Such as, Yes or
No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes can be called as targets/labels or categories.
Unlike regression, the output variable of Classification is a category, not a value, such as "Green or Blue",
"fruit or animal", etc. Since the Classification algorithm is a Supervised learning technique, hence it takes
labeled input data, which means it contains input with the corresponding output.
The main goal of the Classification algorithm is to identify the category of a given dataset, and these
algorithms are mainly used to predict the output for the categorical data.
Classification algorithms can be better understood using the below diagram. In the below diagram, there
are two classes, class A and Class B. These classes have features that are similar to each other and
dissimilar to other classes.
The algorithm which implements the classification on a dataset is known as a classifier. There are two
types of Classifications:
Binary Classifier: If the classification problem has only two possible outcomes, then it is called as Binary
Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
Multi-class Classifier: If a classification problem has more than two outcomes, then it is called as Multi-
class Classifier.
1. Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it receives the test
dataset. In Lazy learner case, classification is done on the basis of the most related data stored in the
training dataset. It takes less time in training but more time for predictions.
2. Eager Learners: Eager Learners develop a classification model based on a training dataset before
receiving a test dataset. Opposite to Lazy learners, Eager Learner takes more time in learning, and less
time in prediction. Example: Decision Trees, Naïve Bayes, ANN.
Linear Models
Logistic Regression
Non-linear Models
K-Nearest Neighbors
Kernel SVM
Naïve Bayes
Once our model is completed, it is necessary to evaluate its performance; either it is a Classification or
Regression model. So for evaluating a Classification model, we have the following ways:
It is used for evaluating the performance of a classifier, whose output is a probability value between the
0 and 1.
For a good binary Classification model, the value of log loss should be near to 0.
The value of log loss increases if the predicted value deviates from the actual value.
The lower log loss represents the higher accuracy of the model.
1. ?(ylog(p)+(1?y)log(1?p))
2. Confusion Matrix:
o The confusion matrix provides us a matrix/table as output and describes the performance of the
model.
The matrix consists of predictions result in a summarized form, which has a total number of correct
predictions and incorrect predictions. The matrix looks like as below table:
Actual Positive Actual Negative
3. AUC-ROC curve:
ROC curve stands for Receiver Operating Characteristics Curve and AUC stands for Area Under the Curve.
It is a graph that shows the performance of the classification model at different thresholds.
To visualize the performance of the multi-class classification model, we use the AUC-ROC Curve.
The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-axis and FPR(False
Positive Rate) on X-axis.
Classification algorithms can be used in different places. Below are some popular use cases of
Classification Algorithms:
Logistic regression is one of the most popular Machine Learning algorithms, which comes under the
Supervised Learning technique. It is used for predicting the categorical dependent variable using a given
set of independent variables.
Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome must
be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but instead of
giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.
Logistic Regression is much similar to the Linear Regression except that how they are used. Linear
Regression is used for solving Regression problems, whereas Logistic regression is used for solving the
classification problems.
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function, which
predicts two maximum values (0 or 1).
The curve from the logistic function indicates the likelihood of something such as whether the cells are
cancerous or not, a mouse is obese or not based on its weight, etc.
Logistic Regression is a significant machine learning algorithm because it has the ability to provide
probabilities and classify new data using continuous and discrete datasets.
Logistic Regression can be used to classify the observations using different types of data and can easily
determine the most effective variables used for the classification. The below image is showing the
logistic function:
Note: Logistic regression uses the concept of predictive modeling as regression; therefore, it is called
logistic regression, but is used to classify samples; Therefore, it falls under the classification algorithm.
• The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
• It maps any real value into another value within a range of 0 and 1.
• The value of the logistic regression must be between 0 and 1, which cannot go beyond this limit,
so it forms a curve like the "S" form. The S-form curve is called the Sigmoid function or the
logistic function.
• In logistic regression, we use the concept of the threshold value, which defines the probability of
either 0 or 1. Such as values above the threshold value tends to 1, and a value below the
threshold values tends to 0.
The Logistic regression equation can be obtained from the Linear Regression equation. The mathematical
steps to get Logistic Regression equations are given below:
In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation by (1-y):
But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will become:
On the basis of the categories, Logistic Regression can be classified into three types:
Binomial: In binomial Logistic regression, there can be only two possible types of the dependent
variables, such as 0 or 1, Pass or Fail, etc.
Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered types of the
dependent variable, such as "cat", "dogs", or "sheep"
Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of dependent
variables, such as "low", "Medium", or "High".
To understand the implementation of Logistic Regression in Python, we will use the below example:Skip
10s
Example: There is a dataset given which contains the information of various users obtained from the
social networking sites. There is a car making company that has recently launched a new SUV car. So the
company wanted to check how many users from the dataset, wants to purchase the car.
For this problem, we will build a Machine Learning model using the Logistic regression algorithm. The
dataset is shown in the below image. In this problem, we will predict the purchased variable (Dependent
Variable) by using age and salary (Independent variables).
Steps in Logistic Regression: To implement the Logistic Regression using Python, we will use the same
steps as we have done in previous topics of Regression. Below are the steps:
1. Data Pre-processing step: In this step, we will pre-process/prepare the data so that we can use it in
our code efficiently. It will be the same as we have done in Data pre-processing topic. The code for this is
given below:
# importing libraries
import NumPy as nm
import matplotlib.pyplot as mtp
import pandas as pd
#importing datasets
data_set= pd.read_csv('user_data.csv')
By executing the above lines of code, we will get the dataset as the output. Consider the given image:
Now, we will extract the dependent and independent variables from the given dataset. Below is the code
for it:
x= data_set.iloc[:, [2,3]].values
y= data_set.iloc[:, 4].values
In the above code, we have taken [2, 3] for x because our independent variables are age and salary,
which are at index 2, 3. And we have taken 4 for y variable because our dependent variable is at index 4.
The output will be:
Now we will split the dataset into a training set and test set. Below is the code for it:
#feature Scaling
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
We have well prepared our dataset, and now we will train the dataset using the training set. For
providing training or fitting the model to the training set, we will import the LogisticRegression class of
the learn library.
After importing the class, we will create a classifier object and use it to fit the model to the logistic
regression. Below is the code for it:
classifier= LogisticRegression(random_state=0)
classifier.fit(x_train, y_train)
Output: By executing the above code, we will get the below output:
Out[5]:
warm_start=False)
Our model is well trained on the training set, so we will now predict the result by using test set data.
Below is the code for it:
y_pred= classifier.predict(x_test)
In the above code, we have created a y_pred vector to predict the test set result.
Output: By executing the above code, a new vector (y_pred) will be created under the variable explorer
option. It can be seen as:
The above output image shows the corresponding predicted users who want to purchase or not
purchase the car.
Now we will create the confusion matrix here to check the accuracy of the classification. To create it, we
need to import the confusion_matrix function of the sklearn library. After importing the function, we will
call it using a new variable cm. The function takes two parameters, mainly y_true( the actual values)
and y_pred (the targeted value return by the classifier). Below is the code for it:
cm= confusion_matrix()
Output:
By executing the above code, a new confusion matrix will be created. Consider the below image:
We can find the accuracy of the predicted result by interpreting the confusion matrix. By above output,
we can interpret that 65+24= 89 (Correct Output) and 8+3= 11(Incorrect Output).
Finally, we will visualize the training set result. To visualize the result, we will use ListedColormap class of
matplotlib library. Below is the code for it:
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show()
In the above code, we have imported the ListedColormap class of Matplotlib library to create the
colormap for visualizing the result. We have created two new variables x_set and y_set to
replace x_train and y_train. After that, we have used the nm.meshgrid command to create a rectangular
grid, which has a range of -1(minimum) to 1 (maximum). The pixel points we have taken are of 0.01
resolution.
To create a filled contour, we have used mtp.contourf command, it will create regions of provided colors
(purple and green). In this function, we have passed the classifier.predict to show the predicted data
points predicted by the classifier.
Output: By executing the above code, we will get the below output:
The graph can be explained in the below points:
In the above graph, we can see that there are some Green points within the green region and Purple
points within the purple region.
All these data points are the observation points from the training set, which shows the result for
purchased variables.
This graph is made by using two independent variables i.e., Age on the x-axis and Estimated salary on the
y-axis.
The purple point observations are for which purchased (dependent variable) is probably 0, i.e., users
who did not purchase the SUV car.
The green point observations are for which purchased (dependent variable) is probably 1 means user
who purchased the SUV car.
We can also estimate from the graph that the users who are younger with low salary, did not purchase
the car, whereas older users with high estimated salary purchased the car.
But there are some purple points in the green region (Buying the car) and some green points in the
purple region(Not buying the car). So we can say that younger users with a high estimated salary
purchased the car, whereas an older user with a low estimated salary did not purchase the car.
We have successfully visualized the training set result for the logistic regression, and our goal for this
classification is to divide the users who purchased the SUV car and who did not purchase the car. So from
the output graph, we can clearly see the two regions (Purple and Green) with the observation points.
The Purple region is for those users who didn't buy the car, and Green Region is for those users who
purchased the car.
Linear Classifier:
As we can see from the graph, the classifier is a Straight line or linear in nature as we have used the
Linear model for Logistic Regression. In further topics, we will learn for non-linear Classifiers.
Our model is well trained using the training dataset. Now, we will visualize the result for new
observations (Test set). The code for the test set will remain same as above except that here we will
use x_test and y_test instead of x_train and y_train. Below is the code for it:
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show()
Output:
The above graph shows the test set result. As we can see, the graph is divided into two regions (Purple
and Green). And Green observations are in the green region, and Purple observations are in the purple
region. So we can say it is a good prediction and model. Some of the green and purple data points are in
different regions, which can be ignored as we have already calculated this error using the confusion
matrix (11 Incorrect output).
Hence our model is pretty good and ready to make new predictions for this classification problem.
K-Nearest Neighbor is one of the simplest Machine Learning algorithms based on Supervised Learning
technique.
K-NN algorithm assumes the similarity between the new case/data and available cases and put the new
case into the category that is most similar to the available categories.
K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This
means when new data appears then it can be easily classified into a well suite category by using K- NN
algorithm.
K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the
Classification problems.
K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data.
It is also called a lazy learner algorithm because it does not learn from the training set immediately
instead it stores the dataset and at the time of classification, it performs an action on the dataset.
KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies
that data into a category that is much similar to the new data.
Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we want to
know either it is a cat or dog. So for this identification, we can use the KNN algorithm, as it works on a
similarity measure. Our KNN model will find the similar features of the new data set to the cats and dogs
images and based on the most similar features it will put it in either cat or dog category.
Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1, so
this data point will lie in which of these categories. To solve this type of problem, we need a K-NN
algorithm. With the help of K-NN, we can easily identify the category or class of a particular dataset.
Consider the below diagram:
How does K-NN work? The K-NN working can be explained on the basis of the below algorithm:
Suppose we have a new data point and we need to put it in the required category. Consider the below
image:
Firstly, we will choose the number of neighbors, so we will choose the k=5.
Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is the
distance between two points, which we have already studied in geometry. It can be calculated as:
By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in
category A and two nearest neighbors in category B. Consider the below image:
As we can see the 3 nearest neighbors are from category A, hence this new data point must belong to
category A.
Below are some points to remember while selecting the value of K in the K-NN algorithm:
There is no particular way to determine the best value for "K", so we need to try some values to
find the best out of them. The most preferred value for K is 5.
A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in the
model.
Large values for K are good, but it may find some difficulties.
It is simple to implement.
It is robust to the noisy training data
It can be more effective if the training data is large.
Always needs to determine the value of K which may be complex some time.
The computation cost is high because of calculating the distance between the data points for all
the training samples.
To do the Python implementation of the K-NN algorithm, we will use the same problem and dataset
which we have used in Logistic Regression. But here we will improve the performance of the model.
Below is the problem description:
Problem for K-NN Algorithm: There is a Car manufacturer company that has manufactured a new SUV
car. The company wants to give the ads to the users who are interested in buying that SUV. So for this
problem, we have a dataset that contains multiple user's information through the social network. The
dataset contains lots of information but the Estimated Salary and Age we will consider for the
independent variable and the Purchased variable is for the dependent variable. Below is the dataset:
Steps to implement the K-NN algorithm:
The Data Pre-processing step will remain exactly the same as Logistic Regression. Below is the code for it:
# importing libraries
import numpy as nm
import pandas as pd
#importing datasets
data_set= pd.read_csv('user_data.csv')
x= data_set.iloc[:, [2,3]].values
y= data_set.iloc[:, 4].values
#feature Scaling
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
By executing the above code, our dataset is imported to our program and well pre-processed. After
feature scaling our test dataset will look like:
From the above output image, we can see that our data is successfully scaled.
metric='minkowski': This is the default parameter and it decides the distance between the points.
And then we will fit the classifier to the training data. Below is the code for it:
classifier.fit(x_train, y_train)
Output: By executing the above code, we will get the output as:
Out[10]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
weights='uniform')
Predicting the Test Result: To predict the test set result, we will create a y_pred vector as we did in
Logistic Regression. Below is the code for it:
y_pred= classifier.predict(x_test)
Output:
In above code, we have imported the confusion_matrix function and called it using the variable cm.
Output: By executing the above code, we will get the matrix as below:
In the above image, we can see there are 64+29= 93 correct predictions and 3+4= 7 incorrect
predictions, whereas, in Logistic Regression, there were 11 incorrect predictions. So we can say that the
performance of the model is improved by using the K-NN algorithm.
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show()
Output:
The output graph is different from the graph which we have occurred in Logistic Regression. It can be
understood in the below points:
As we can see the graph is showing the red point and green points. The green points are for
Purchased(1) and Red Points for not Purchased(0) variable.
The graph is showing an irregular boundary instead of showing any straight line or any curve because it is
a K-NN algorithm, i.e., finding the nearest neighbor.
The graph has classified users in the correct categories as most of the users who didn't buy the SUV are
in the red region and users who bought the SUV are in the green region.
The graph is showing good result but still, there are some green points in the red region and red points in
the green region. But this is no big issue as by doing this model is prevented from overfitting issues.
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show()
Output:
The above graph is showing the output for the test data set. As we can see in the graph, the predicted
output is well good as most of the red points are in the red region and most of the green points are in
the green region.
However, there are few green points in the red region and a few red points in the green region. So these
are the incorrect observations that we have observed in the confusion matrix(7 Incorrect output).
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is
used for Classification as well as Regression problems. However, primarily, it is used for Classification
problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-
dimensional space into classes so that we can easily put the new data point in the correct category in the
future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are
called as support vectors, and hence algorithm is termed as Support Vector Machine. Consider the below
diagram in which there are two different categories that are classified using a decision boundary or
hyperplane:
Example: SVM can be understood with the example that we have used in the KNN classifier. Suppose we
see a strange cat that also has some features of dogs, so if we want a model that can accurately identify
whether it is a cat or dog, so such a model can be created by using the SVM algorithm. We will first train
our model with lots of images of cats and dogs so that it can learn about different features of cats and
dogs, and then we test it with this strange creature. So as support vector creates a decision boundary
between these two data (cat and dog) and choose extreme cases (support vectors), it will see the
extreme case of cat and dog. On the basis of the support vectors, it will classify it as a cat. Consider the
below diagram:
Loaded: 00s
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Types of SVM
Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be
classified into two classes by using a single straight line, then such data is termed as linearly
separable data, and classifier is used called as Linear SVM classifier.
Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear data
and classifier used is called as Non-linear SVM classifier.
Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-dimensional
space, but we need to find out the best decision boundary that helps to classify the data points. This
best boundary is known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which means if there
are 2 features (as shown in image), then hyperplane will be a straight line. And if there are 3 features,
then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum distance
between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position of the
hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence called a
Support vector.
The working of the SVM algorithm can be understood by using an example. Suppose we have a dataset
that has two tags (green and blue), and the dataset has two features x1 and x2. We want a classifier that
can classify the pair(x1, x2) of coordinates in either green or blue. Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But there
can be multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or region
is called as a hyperplane. SVM algorithm finds the closest point of the lines from both the classes. These
points are called support vectors. The distance between the vectors and the hyperplane is called
as margin. And the goal of SVM is to maximize this margin. The hyperplane with maximum margin is
called the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we
cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we have used
two dimensions x and y, so for non-linear data, we will add a third dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in 2d space
with z=1, then it will become as:
Now we will implement the SVM algorithm using Python. Here we will use the same dataset user data,
which we have used in Logistic regression and KNN classification.
Till the Data pre-processing step, the code will remain the same. Below is the code:
# importing libraries
import numpy as nm
import pandas as pd
#importing datasets
data_set= pd.read_csv('user_data.csv')
#Extracting Independent and dependent Variable
x= data_set.iloc[:, [2,3]].values
y= data_set.iloc[:, 4].values
#feature Scaling
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
After executing the above code, we will pre-process the data. The code will give the dataset as:
The scaled output for the test set will be:
Fitting the SVM classifier to the training set:
Now the training set will be fitted to the SVM classifier. To create the SVM classifier, we will
import SVC class from Sklearn.svm library. Below is the code for it:
classifier.fit(x_train, y_train)
In the above code, we have used kernel='linear', as here we are creating SVM for linearly separable data.
However, we can change it for non-linear data. And then we fitted the classifier to the training
dataset(x_train, y_train)
Output:
Out[8]:
y_pred= classifier.predict(x_test)
After getting the y_pred vector, we can compare the result of y_pred and y_test to check the difference
between the actual value and predicted value.
Output: Below is the output for the prediction of the test set:
Output:
As we can see in the above output image, there are 66+24= 90 correct predictions and 8+2= 10 correct
predictions. Therefore we can say that our SVM model improved as compared to the Logistic regression
model.
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show()
Output:
As we can see, the above output is appearing similar to the Logistic regression output. In the output, we
got the straight line as hyperplane because we have used a linear kernel in the classifier. And we have
also discussed above that for the 2d space, the hyperplane in SVM is a straight line.
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show()
Output:
As we can see in the above output image, the SVM classifier has divided the users into two regions
(Purchased or Not purchased). Users who purchased the SUV are in the red region with the red scatter
points. And users who did not purchase the SUV are in the green region with green scatter points. The
hyperplane has divided the two classes into Purchased and not purchased variable.
Naïve Bayes Classifier Algorithm
Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used for
solving classification problems.
Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which helps in
building the fast machine learning models that can make quick predictions.
It is a probabilistic classifier, which means it predicts on the basis of the probability of an object.
Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis, and
classifying articles.
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:
Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is independent of the
occurrence of other features. Such as if the fruit is identified on the bases of color, shape, and taste, then
red, spherical, and sweet fruit is recognized as an apple. Hence each feature individually contributes to
identify that it is an apple without depending on each other.
Bayes' Theorem:
Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the probability of
a hypothesis with prior knowledge. It depends on the conditional probability.
Where,
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis is
true.
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play". So using this
dataset we need to decide that whether we should play or not on a particular day according to the
weather conditions. So to solve this problem, we need to follow the below steps:
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Weather No Yes
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny)= 0.35
P(Yes)=0.71
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the relationship
between features.
There are three types of Naive Bayes Model, which are given below:
Gaussian: The Gaussian model assumes that features follow a normal distribution. This means if
predictors take continuous values instead of discrete, then the model assumes that these values are
sampled from the Gaussian distribution.
Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial distributed. It
is primarily used for document classification problems, it means a particular document belongs to which
category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the predictor variables
are the independent Booleans variables. Such as if a particular word is present or not in a document.
This model is also famous for document classification tasks.
Python Implementation of the Naïve Bayes algorithm:
Now we will implement a Naive Bayes Algorithm using Python. So for this, we will use the
"user_data" dataset, which we have used in our other classification model. Therefore we can easily
compare the Naive Bayes model with the other models.
Steps to implement:
In this step, we will pre-process/prepare the data so that we can use it efficiently in our code. It is similar
as we did in data-pre-processing. The code for this is given below:
import numpy as nm
import pandas as pd
dataset = pd.read_csv('user_data.csv')
y = dataset.iloc[:, 4].values
# Splitting the dataset into the Training set and Test set
# Feature Scaling
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)
In the above code, we have loaded the dataset into our program using "dataset =
pd.read_csv('user_data.csv'). The loaded dataset is divided into training and test set, and then we have
scaled the feature variable.
After the pre-processing step, now we will fit the Naive Bayes model to the Training set. Below is the
code for it:
# Fitting Naive Bayes to the Training set
classifier = GaussianNB()
classifier.fit(x_train, y_train)
In the above code, we have used the GaussianNB classifier to fit it to the training dataset. We can also
use other classifiers as per our requirement.
Output:
Now we will predict the test set result. For this, we will create a new predictor variable y_pred, and will
use the predict function to make the predictions.
y_pred = classifier.predict(x_test)
Output:
The above output shows the result for prediction vector y_pred and real vector y_test. We can see that
some predications are different from the real values, which are the incorrect predictions.
Now we will check the accuracy of the Naive Bayes classifier using the Confusion matrix. Below is the
code for it:
cm = confusion_matrix(y_test, y_pred)
Output:
As we can see in the above confusion matrix output, there are 7+3= 10 incorrect predictions, and
65+25=90 correct predictions.
Next, we will visualize the training set result using Naïve Bayes Classifier. Below is the code for it:
mtp.xlim(X1.min(), X1.max())
mtp.ylim(X2.min(), X2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show()
Output:
In the above output we can see that the Naïve Bayes classifier has segregated the data points with the
fine boundary. It is Gaussian curve as we have used GaussianNB classifier in our code.
mtp.xlim(X1.min(), X1.max())
mtp.ylim(X2.min(), X2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show()
Output:
The above output is final output for test set data. As we can see the classifier has created a Gaussian
curve to divide the "purchased" and "not purchased" variables. There are some wrong predictions which
we have calculated in Confusion matrix. But still it is pretty good classifier.
We are probably living in the most defining period of human history. The period when computing moved
from large mainframes to PCs to the cloud. But what makes it defining is not what has happened but
what is coming our way in years to come. What makes this period exciting and enthralling for someone
like me is the democratization of the various tools, techniques, and machine learning algorithms that
followed the boost in computing. Welcome to the world of data science.
Today, as a data scientist, I can build data-crunching machines with complex algorithms for a few dollars
per hour. But reaching here wasn’t easy! I had my dark days and nights.
Learning Objectives
Are you a beginner looking for a place to start your data science journey and learn machine learning
models? Presenting a list. of comprehensive courses, full of knowledge and data science learning,
curated just for you to learn data science (using Python) from scratch:
How it works: This algorithm consists of a target/outcome variable (or dependent variable) which is to be
predicted from a given set of predictors (independent variables). Using this set of variables, we generate
a function that maps input data to desired outputs. The training process continues until the model
achieves the desired level of accuracy on the training data. Examples of Supervised Learning:
Regression, Decision Tree, Random Forest, KNN, Logistic Regression, etc.
How it works: In this algorithm, we do not have any target or outcome variable to predict / estimate
(which is called unlabeled data). It is used for recommendation systems or clustering populations in
different groups. clustering algorithms are widely used for segmenting customers into different groups
for specific interventions. Examples of Unsupervised Learning: Apriori algorithm, K-means clustering.
How it works: Using this algorithm, the machine is trained to make specific decisions. The machine is
exposed to an environment where it trains itself continually using trial and error. This machine learns
from past experience and tries to capture the best possible knowledge to make accurate business
decisions. Example of Reinforcement Learning: Markov Decision Process
1. Linear Regression
It is used to estimate real values (cost of houses, number of calls, total sales, etc.) based on a continuous
variable(s). Here, we establish the relationship between independent and dependent variables by fitting
the best line. This best-fit line is known as the regression line and is represented by a linear equation Y=
a*X + b.
The best way to understand linear regression is to relive this experience of childhood. Let us say you ask
a child in fifth grade to arrange people in his class by increasing the order of weight without asking them
their weights! What do you think the child will do? He/she would likely look (visually analyze) at the
height and build of people and arrange them using a combination of these visible parameters. This is
linear regression in real life! The child has actually figured out that height and build would be correlated
to weight by a relationship, which looks like the equation above.
In this equation:
Y – Dependent Variable
a – Slope
X – Independent variable
b – Intercept
These coefficients a and b are derived based on minimizing the sum of the squared difference of distance
between data points and the regression line.
Look at the below example. Here we have identified the best-fit line having linear
equation y=0.2811x+13.9. Now using this equation, we can find the weight, knowing the height of a
person.
Linear Regression is mainly of two types: Simple Linear Regression and Multiple Linear Regression.
Simple Linear Regression is characterized by one independent variable. And, Multiple Linear
Regression(as the name suggests) is characterized by multiple (more than 1) independent variables.
While finding the best-fit line, you can fit a polynomial or curvilinear regression. And these are known as
polynomial or curvilinear regression.
Here’s a coding window to try out your hand and build your own linear regression model:
Python:
R Code:
#Identify feature and response variable(s) and values must be numeric and numpy arrays
x <- cbind(x_train,y_train)
# Train the model using the training sets and check score
summary(linear)
#Predict Output
predicted= predict(linear,x_test)
2. Logistic Regression
Don’t get confused by its name! It is a classification algorithm, not a regression algorithm. It is used to
estimate discrete values ( Binary values like 0/1, yes/no, true/false ) based on a given set of independent
variables(s). In simple words, it predicts the probability of the occurrence of an event by fitting data to
logistic function. Hence, it is also known as logit regression. Since it predicts the probability, its output
values lie between 0 and 1 (as expected).
Let’s say your friend gives you a puzzle to solve. There are only 2 outcome scenarios – either you solve it,
or you don’t. Now imagine that you are being given a wide range of puzzles/quizzes in an attempt to
understand which subjects you are good at. The outcome of this study would be something like this – if
you are given a trigonometry-based tenth-grade problem, you are 70% likely to solve it. On the other
hand, if it is a grade fifth history question, the probability of getting an answer is only 30%. This is what
Logistic Regression provides you.
Coming to the math, the log odds of the outcome are modeled as a linear combination of the predictor
variables.
ln(odds) = ln(p/(1-p))
Now, you may ask, why take a log? For the sake of simplicity, let’s just say that this is one of the best
mathematical ways to replicate a step function. I can go into more details, but that will beat the purpose
of this article.
Build your own logistic regression model in Python here and check the accuracy:
R Code:
x <- cbind(x_train,y_train)
# Train the model using the training sets and check score
summary(logistic)
#Predict Output
predicted= predict(logistic,x_test)
3. Decision Tree
This is one of my favorite algorithms, and I use it quite frequently. It is a type of supervised learning
algorithm that is mostly used for classification problems. Surprisingly, it works for both categorical and
continuous dependent variables. In this algorithm, we split the population into two or more
homogeneous sets. This is done based on the most significant attributes/ independent variables to make
as distinct groups as possible. For more details, you can read Decision Tree Simplified.
Source:
statsexchange
In the image above, you can see that population is classified into four different groups based on multiple
attributes to identify ‘if they will play or not’. To split the population into different heterogeneous groups,
it uses various techniques like Gini, Information Gain, Chi-square, and entropy.
The best way to understand how the decision tree works, is to play Jezzball – a classic game from
Microsoft (image below). Essentially, you have a room with moving walls and you need to create walls
such that the maximum area gets cleared off without the balls.
So, every time you split the room with a wall, you are trying to create 2 different populations within the
same room. Decision trees work in a very similar fashion by dividing a population into as different groups
as possible.
R Code:
library(rpart)
x <- cbind(x_train,y_train)
# grow tree
summary(fit)
#Predict Output
predicted= predict(fit,x_test)
It is a classification method. InSVM algorithm , we plot each data item as a point in n-dimensional space
(where n is the number of features you have), with the value of each feature being the value of a
particular coordinate.
For example, if we only had two features like the Height and Hair length of an individual, we’d first plot
these two variables in two-dimensional space where each point has two coordinates (these co-ordinates
are known as Support Vectors)
Now, we will find some lines that split the data between the two differently classified groups of data.
This will be the line such that the distances from the closest point in each of the two groups will be the
farthest away. If there are more variables, a hyperplane is used to separate the classes.
In the example shown above, the line which splits the data into two differently classified groups is
the black line since the two closest points are the farthest apart from the line. This line is our classifier.
Then, depending on where the testing data lands on either side of the line, that’s what class we can
classify the new data as.
Think of this algorithm as playing Jezz Ball in n-dimensional space. The tweaks in the game are:
You can draw lines/planes at any angle (rather than just horizontal or vertical as in the classic game)
The objective of the game is to segregate balls of different colors in different rooms.
R Code:
library(e1071)
x <- cbind(x_train,y_train)
# Fitting model
summary(fit)
#Predict Output
predicted= predict(fit,x_test)
5. Naive Bayes
The Naive Bayesian model is easy to build and particularly useful for very large data sets. Along with
simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.
Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x), and P(x|c). Look
at the equation below:
Here,
P(x|c) is the likelihood which is the probability of the predictor given the class.
P(x) is the prior probability of the predictor.
Example: Let’s understand it using an example. Below is a training data set of weather and the
corresponding target variable, ‘Play.’ Now, we need to classify whether players will play or not based on
weather conditions. Let’s follow the below steps to perform it.
Step 2: Create a Likelihood table by finding the probabilities like Overcast probability = 0.29 and
probability of playing is 0.64.
Step 3: Now, use the Naive Bayesian equation to calculate the posterior probability for each class. The
class with the highest posterior probability is the outcome of the prediction.
Problem: Players will pay if the weather is sunny. Is this statement correct?
We can solve it using above discussed method, so P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)
Here we have P (Sunny | Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P(Yes)= 9/14 = 0.64
Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has a higher probability.
Naive Bayes uses a similar method to predict the probability of different classes based on various
attributes. This algorithm is mostly used in text classification and with problems having multiple classes.
R Code:
library(e1071)
x <- cbind(x_train,y_train)
# Fitting model
summary(fit)
#Predict Output
predicted= predict(fit,x_test)
It can be used for both classification and regression problems. However, it is more widely used in
classification problems in the industry. K nearest neighbors is a simple algorithm that stores all available
cases and classifies new cases by a majority vote of its k neighbors. The case assigned to the class is most
common amongst its K nearest neighbors measured by a distance function.
These distance functions can be Euclidean, Manhattan, Minkowski, and Hamming distances. The first
three functions are used for continuous functions, and the fourth one (Hamming) for categorical
variables. If K = 1, then the case is simply assigned to the class of its nearest neighbor. At times, choosing
K turns out to be a challenge while performing kNN modeling.
KNN can easily be mapped to our real lives. If you want to learn about a person with whom you have no
information, you might like to find out about his close friends and the circles he moves in and gain access
to his/her information!
Python Code:
R Code:
library(knn)
x <- cbind(x_train,y_train)
# Fitting model
#Predict Output
predicted= predict(fit,x_test)
7. K-Means
It is a type of unsupervised algorithm which solves the clustering problem. Its procedure follows a simple
and easy way to classify a given data set through a certain number of clusters (assume k clusters). Data
points inside a cluster are homogeneous and heterogeneous to peer groups.
Remember figuring out shapes from ink blots? k means is somewhat similar to this activity. You look at
the shape and spread to decipher how many different clusters/populations are present!
In K-means, we have clusters, and each cluster has its own centroid. The sum of the square of the
difference between the centroid and the data points within a cluster constitutes the sum of the square
value for that cluster. Also, when the sum of square values for all the clusters is added, it becomes a total
within the sum of the square value for the cluster solution.
We know that as the number of clusters increases, this value keeps on decreasing, but if you plot the
result, you may see that the sum of squared distance decreases sharply up to some value of k and then
much more slowly after that. Here, we can find the optimum number of clusters.
Python Code:
R Code:
library(cluster)
8. Random Forest
Random Forest is a trademarked term for an ensemble learning of decision trees. In Random Forest,
we’ve got a collection of decision trees (also known as “Forest”). To classify a new object based on
attributes, each tree gives a classification, and we say the tree “votes” for that class. The forest chooses
the classification having the most votes (over all the trees in the forest).
If the number of cases in the training set is N, then a sample of N cases is taken at random but with
replacement. This sample will be the training set for growing the tree.
If there are M input variables, a number m<<M is specified such that at each node, m variables are
selected at random out of the M, and the best split on this m is used to split the node. The value of m is
held constant during the forest growth.
For more details on this algorithm, compared with the decision tree and tuning model parameters, I
would suggest you read these articles:
Python Code:
R Code:
library(randomForest)
x <- cbind(x_train,y_train)
# Fitting model
summary(fit)
#Predict Output
predicted= predict(fit,x_test)
In the last 4-5 years, there has been an exponential increase in data capturing at every possible stage.
Corporates/ Government Agencies/ Research organizations are not only coming up with new sources,
but also, they are capturing data in great detail.
For example, E-commerce companies are capturing more details about customers like their
demographics, web crawling history, what they like or dislike, purchase history, feedback, and many
others to give them personalized attention more than your nearest grocery shopkeeper.
As data scientists, the data we are offered also consists of many features, this sounds good for building a
good robust model, but there is a challenge. How’d you identify highly significant variable(s) out of 1000
or 2000? In such cases, the dimensionality reduction algorithm helps us, along with various other
algorithms like Decision Tree, Random Forest, PCA (principal component analysis), Factor Analysis,
Identity-based on the correlation matrix, missing value ratio, and others.
Python Code:
R Code:
library(stats)
Now, let’s look at the 4 most commonly used gradient boosting algorithms.
GBM
GBM is a boosting algorithm used when we deal with plenty of data to make a prediction with high
prediction power. Boosting is actually an ensemble of learning algorithms that combines the prediction
of several base estimators in order to improve robustness over a single estimator. It combines multiple
weak or average predictors to build a strong predictor. These boosting algorithms always work well in
data science competitions like Kaggle, AV Hackathon, and Crowd Analytix.
Python Code:
R Code:
library(caret)
x <- cbind(x_train,y_train)
# Fitting model
Gradient Boosting Classifier and Random Forest are two different boosting tree classifiers, and often
people ask about the difference between these two algorithms.
XGBoost
Another classic gradient-boosting algorithm that’s known to be the decisive choice between winning and
losing in some Kaggle competitions is the XGBoost. It has an immensely high predictive power, making it
the best choice for accuracy in events. It possesses both a linear model and the tree learning algorithm,
making the algorithm almost 10x faster than existing gradient booster techniques.
One of the most interesting things about the XGBoost is that it is also called a regularized boosting
technique. This helps to reduce overfit modeling and has massive support for a range of languages such
as Scala, Java, R, Python, Julia, and C++.
The support includes various objective functions, including regression, classification, and ranking.
Supports distributed and widespread training on many machines that encompass GCE, AWS, Azure, and
Yarn clusters. XGBoost can also be integrated with Spark, Flink, and other cloud dataflow systems with
built-in cross-validation at each iteration of the boosting process.
Python Code:
R Code:
require(caret)
x <- cbind(x_train,y_train)
# Fitting model
TrainControl <- trainControl( method = "repeatedcv", number = 10, repeats = 4)
OR
Light GBM
Light GBM is a gradient-boosting framework that uses tree-based learning algorithms. It is designed to
be distributed and efficient with the following advantages:
The framework is a fast and high-performance gradient-boosting one based on decision tree algorithms
used for ranking, classification, and many other machine-learning tasks. It was developed under the
Distributed Machine Learning Toolkit Project of Microsoft.
Since the light GBM is based on decision tree algorithms, it splits the tree leaf-wise with the best fit,
whereas other boosting algorithms split the tree depth-wise or level-wise rather than leaf-wise. So when
growing on the same leaf node in Light GBM, the leaf-wise algorithm can reduce more loss than the
level-wise algorithm, resulting in much better accuracy, which any existing boosting algorithms can rarely
achieve.
Python Code:
param['metric'] = 'auc'
num_round = 10
bst.save_model('model.txt')
ypred = bst.predict(data)
R Code:
library(RLightGBM)
data(example.binary)
#Parameters
lgbm.data.setField(handle.data, "label", y)
lgbm.booster.train(handle.booster, num_iterations, 5)
#Predict
#Test accuracy
If you’re familiar with the Caret package in R, this is another way of implementing the LightGBM.
require(caret)
require(RLightGBM)
data(iris)
model <-caretModel.LGBM()
print(fit)
library(Matrix)
print(fit)
Cat boost
Cat Boost is one of open-sourced machine learning algorithms from Yandex. It can easily integrate with
deep learning frameworks like Google’s TensorFlow and Apple’s Core ML. The best part about Cat Boost
is that it does not require extensive data training like other ML models and can work on a variety of data
formats, not undermining how robust it can be.
Cat boost can automatically deal with categorical variables without showing the type conversion error,
which helps you to focus on tuning your model better rather than sorting out trivial errors. Make sure
you handle missing data well before you proceed with the implementation.
Python Code:
import pandas as pd
import numpy as np
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
train.fillna(-999, inplace=True)
test.fillna(-999,inplace=True)
#Creating a training set for modeling and validation set to check model performance
X = train.drop(['Item_Outlet_Sales'], axis=1)
y = train.Item_Outlet_Sales
from sklearn.model_selection import train_test_split
model.fit(X_train, y_train,cat_features=categorical_features_indices,eval_set=(X_validation,
y_validation),plot=True)
submission = pd.DataFrame()
submission['Item_Identifier'] = test['Item_Identifier']
submission['Outlet_Identifier'] = test['Outlet_Identifier']
submission['Item_Outlet_Sales'] = model.predict(test)
R Code:
set.seed(1)
require(titanic)
require(caret)
require(catboost)
tt <- titanic::titanic_train[complete.cases(titanic::titanic_train),]
grid <- expand.grid(depth = c(4, 6, 8),learning_rate = 0.1,iterations = 100, l2_leaf_reg = 1e-3, rsm =
0.95, border_count = 64)
print(report)
print(importance)
End Note
By now, I am sure you would have an idea of commonly used machine learning algorithms. My sole
intention behind writing this article and providing the codes in R and Python is to get you started right
away. If you are keen to master machine learning algorithms, start right away. Take up problems, develop
a physical understanding of the process, apply these codes, and watch the fun!
Key Takeaways
We are now familiar with some of the most common ML algorithms used in the industry.
We’ve covered the advantages and disadvantages of various ML algorithms.
We’ve also learned the basic implementation details in R and Python languages.
CLASSIFICATION ALGORITHM
The Classification algorithm is a Supervised Learning technique that uses training data to determine the
category of new observations. A program in Classification learns from a given dataset or observations
and then classifies new observations into one of several classes or groups. For example, Yes or No, 0 or 1,
Spam or No Spam, cat or dog, and so on. Classes are also known as targets/labels or categories.
Unlike regression, the output variable of Classification is a category, not a value, such as “Green or Blue”,
“fruit or animal”, etc. Since the Classification algorithm is a Supervised learning technique, hence it takes
labeled input data, which means it contains input with the corresponding output.
In Classification algorithm ,the model tries to predict the correct label of a given input data. In
classification, the model is fully trained using the training data, and then it is evaluated on test data
before being used to perform prediction on new unseen data.
The main goal of the Classification algorithm is to identify the category of a given dataset, and these
algorithms are mainly used to predict the output for the categorical data.
Classification algorithms can be better understood using the below diagram. In the below diagram, there
are two classes, class A and Class B. These classes have features that are similar to each other and
dissimilar to other classes.
For instance, an algorithm can learn to predict whether a given email is spam or ham (no spam).
Before diving into the classification concept, we will first understand the difference between the two
types of learners in classification: lazy and eager learners. Then we will clarify the misconception
between classification and regression.
There are two types of learners in machine learning classification: lazy and eager learners.
Eager learners are machine learning algorithms that first build a model from the training dataset before
making any prediction on future datasets. They spend more time during the training process because of
their eagerness to have a better generalization during the training from learning the weights, but they
require less time to make predictions.
Most machine learning algorithms are eager learners, and below are some examples:
Logistic Regression.
Support Vector Machine.
Decision Trees.
Artificial Neural Networks.
Lazy learners or instance-based learners, on the other hand, do not create any model immediately from
the training data, and this is where the lazy aspect comes from. They just memorize the training data,
and each time there is a need to make a prediction, they search for the nearest neighbor from the whole
training data, which makes them very slow during prediction. Some examples of this kind are:
K-Nearest Neighbor.
Case-based reasoning.
However, some algorithms, such as BallTrees and KDTrees, can be used to improve the prediction
latency.
There are four main classification tasks in Machine learning: binary, multi-class, multi-label, and
imbalanced classifications.
Binary Classification
In a binary classification task, the goal is to classify the input data into two mutually exclusive categories.
The training data in such a situation is labeled in a binary format: true and false; positive and negative; O
and 1; spam and not spam, etc. depending on the problem being tackled. For instance, we might want to
detect whether a given image is a truck or a boat.
Logistic Regression and Support Vector Machines algorithms are natively designed for binary
classifications. However, other algorithms such as K-Nearest Neighbors and Decision Trees can also be
used for binary classification.
Multi-Class Classification
The multi-class classification, on the other hand, has at least two mutually exclusive class labels, where
the goal is to predict to which class a given input example belongs to. In the following case, the model
correctly classified the image to be a plane.
Most of the binary classification algorithms can be also used for multi-class classification. These
algorithms include but are not limited to:
Random Forest
Naive Bayes
K-Nearest Neighbors
Gradient Boosting
SVM
Logistic Regression.
Multi-Label Classification
In multi-label classification tasks, we try to predict 0 or more classes for each input example. In this case,
there is no mutual exclusion because the input example can have more than one label.
Such a scenario can be observed in different domains, such as auto-tagging in Natural Language
Processing, where a given text can contain multiple topics. Similarly to computer vision, an image can
contain multiple objects, as illustrated below: the model predicted that the image contains: a plane, a
boat, a truck, and a dog.
It is not possible to use multi-class or binary classification models to perform multi-label classification.
However, most algorithms used for those standard classification tasks have their specialized versions for
multi-label classification. We can cite:
For the imbalanced classification, the number of examples is unevenly distributed in each class, meaning
that we can have more of one class than the others in the training data. Let’s consider the following 3-
class classification scenario where the training data contains: 60% of trucks, 25% of planes, and 15% of
boats
Using conventional predictive models such as Decision Trees, Logistic Regression, etc. could not be
effective when dealing with an imbalanced dataset, because they might be biased toward predicting the
class with the highest number of observations, and considering those with fewer numbers as noise.
Classification Algorithms can be further divided into the Mainly two category:
Linear Models
Logistic Regression
Support Vector Machines
Non-linear Models
K-Nearest Neighbors
Kernel SVM
Naïve Bayes
Decision Tree Classification
Random Forest Classification
1. LOGISTIC REGRESSION
Logistic regression is kind of like linear regression, but is used when the dependent variable is not a
number but something else (e.g., a “yes/no” response). It’s called regression but performs classification
based on the regression and it classifies the dependent variable into either of the classes.
Logistic regression is used for prediction of output which is binary, as stated above. For example, if a
credit card company builds a model to decide whether or not to issue a credit card to a customer, it will
model for whether the customer is going to “default” or “not default” on their card.
Linear Regression
Firstly, linear regression is performed on the relationship between variables to get the model. The
threshold for the classification line is assumed to be at 0.5.
Logistic function is applied to the regression to get the probabilities of it belonging in either class.
It gives the log of the probability of the event occurring to the log of the probability of it not occurring. In
the end, it classifies the variable based on the higher probability of either class.
Here, z is a linear combination of features and their associated weights, plus a bias term:
As gradient descent is the algorithm that is being used, the first step is to define a Cost function or Loss
function.
This function should be defined in such a way that it should be able to tell us how much the predictions
of our model deviates from the original outcome.
In the equation of J(theta), Y represents the actual target value and h_theta is our model’s
output. h_theta will be explained down below. But, Let us assume that our model already have a way to
make predictions and we have a defined h_theta.
These predictions will lie between 0 and 1. So, we’ll get a probability as an output.
Part 1 : When Y = 1
When the actual target is 1, we want our model’s prediction to be close to 1 as possible. So, Our cost
function should increase the penalty as our model’s prediction goes farther away from 1 and towards 0.
Our model’s penalty should decrease as it’s prediction comes nearer to 1. So, Our objective now is to
define a function for this purpose
and that function is nothing but: — log(x)
consider the y axis to be the cost and the x axis to be the model’s prediction. Note: our model’s
prediction won’t exceed 1 and won’t go below 0. So, that part is outside of our worries.
when model’s prediction is closer to 1, the penalty is closer to 0 . As it moves further from 1 and towards
0, the penalty increases. Sol, this function can be used when the actual target is 1.
Part 2 : When Y = 0
Similarly, when Y is equal to 0, we wan’t our model’s predictions to be as close to 0 as possible. Which
means lower penalty for values closer to 0 and higher penalty for values farther away from 0 and
towards 1.
So, The appropriate function for this is -log(1-h_theta(x))
This second part of the cost function. That is, -log(1-h_theta(x)).
Consider the X-axis to be the value our model predicts and the Y-axis to be the penalty that the model
gets assuming that the original target is 0.
The 2 parts of the cost function are prepared. To ensure that the first part activates when y=1 and the
second part doesn’t interfere and the second part activates when y=0 and the first part doesn’t interfere,
we add the y and the (1-y) terms to the cost function.
At the end , We get the cost function mentioned in fig 2.1 highlighted in blue.
Now that we have defined a cost function, the aim is to find the optimal w and b such that it minimises
this cost function for our data-set . This is where Gradient Descent comes In. By doing this, the model
learns the parameters to reduce it’s penalty thus making much more accurate predictions.
we would like to find how the cost changes with respect to w and b, so as to change the original w and b
slowly to get the optimal parameters.
The derivation for that gradient of the logistic regression cost function is shown in the below figures
After finding the gradients, we need to subtract the gradients with the original w and b. We subtract so
that we move the values of gradients in the opposite direction to the slope so as to make sure the cost is
decreasing.
Cost function is a function that tells us how much our model deviates from the most ideal model that we
can create. So, making sure that parameters are optimized in a way to reduce this cost function will
ensure that we get a good classifier, assuming that the points are linearly separable and some other
minor factors.
K-NN algorithm is one of the simplest classification algorithms and it is used to identify the data points
that are separated into several classes to predict the classification of a new sample point. K-NN is a non-
parametric, lazy learning algorithm. It classifies new cases based on a similarity measure (i.e., distance
functions).
Mathematical Function:
In KNN, the algorithm predicts the class or value of a data point by considering the K-nearest data points
in the training dataset. The basic idea is to find the majority class (for classification) or compute the
average (for regression) of the K-nearest neighbors.
Optimization:
KNN doesn’t involve optimization of parameters like other machine learning algorithms (e.g., linear
regression or neural networks). Instead, it stores the entire training dataset in memory and performs
predictions based on the similarity between data points. The main computational cost in KNN is the
search for the K-nearest neighbors when making predictions. This process can be optimized using data
structures like KD-trees or Ball trees for efficient nearest neighbor search
Cost Function:
Since KNN doesn’t have model parameters to optimize and it doesn’t involve a cost function during
training, it doesn’t have a cost function in the same sense that algorithms like logistic regression or
neural networks do. KNN is a non-parametric algorithm, meaning it doesn’t make any underlying
assumptions about the data distribution.
The “cost” or performance evaluation in KNN is typically done using metrics such as accuracy, F1-score,
mean squared error (for regression), or other suitable evaluation metrics for the specific task. These
metrics are used to assess how well the KNN algorithm is performing on the given dataset during testing
or validation.
Support vector is used for both regression and classification. It is based on the concept of decision planes
that define decision boundaries. A decision plane (hyperplane) is one that separates between a set of
objects having different class memberships. Support Vector Machines (SVMs) are a powerful class of
supervised machine learning algorithms used for classification and regression tasks. SVMs aim to find a
hyperplane that best separates the data into different classes while maximizing the margin between the
classes. Here, I’ll provide an overview of the theory, mathematical concepts, optimization, cost function,
kernel functions, parameter tuning, and accuracy in SVMs.
Linear Separability:
SVMs are based on the concept of linear separability, which means they work best when the data can be
cleanly separated by a hyperplane in the feature space.
Hyperplane:
In a binary classification problem, a hyperplane is a decision boundary that separates the data into two
classes. The equation of a hyperplane in a feature space is given by:
Here, w is the weight vector, x is the feature vector, and b is the bias term.
Margin:
The margin is the distance between the hyperplane and the nearest data point from either class. SVM
aims to maximize this margin.
Support Vectors:
Support vectors are the data points that lie closest to the hyperplane and are used in determining the
margin and decision boundary.
Optimization:
The primary goal of SVM is to find the parameters w and b that define the hyperplane while maximizing
the margin. This is done by solving a constrained optimization problem. The objective is to minimize the
norm of the weight vector (||w||) while satisfying the following constraints for each data point:
Subject to
Here, y^(i) is the class label of the i-th data point.
Cost Function:
Where:
The hinge loss encourages the correct classification with a margin of at least 1.
The objective in SVM is to minimize this hinge loss while regularizing the norm of the weight vector w.
Kernel Function:
In cases where the data is not linearly separable in the original feature space, SVM can still be applied by
using a kernel function. Kernel functions allow SVM to implicitly map the data to a higher-dimensional
feature space where it might become linearly separable.
Common kernel functions include the linear kernel, polynomial kernel, radial basis function (RBF) kernel,
and sigmoid kernel.
SVMs have several important hyperparameters that can impact their performance, including:
C: The regularization parameter, which controls the trade-off between maximizing the margin and
minimizing the classification error on the training data.
Parameter tuning is crucial for achieving high accuracy with SVMs. This is often done through techniques
like grid search or random search to find the best combination of hyperparameters that yield the highest
accuracy on a validation set or through cross-validation.
Example: SVM for Binary Classification
Problem Statement: We want to classify data points into two classes, “Blue” and “Red,” based on two
features, “X1” and “X2.”
Let’s generate some synthetic data points for this example. We’ll create two classes, “Blue” and “Red,” in
a 2D space.
import numpy as np
np.random.seed(0)
X = np.random.randn(20, 2)
plt.xlabel(“X1”)
plt.ylabel(“X2”)
plt.show()
The generated data consists of 20 points, with X1 and X2 as features. Positive class (1) is marked in blue,
and negative class (-1) is marked in red.
Next, we’ll split the data into a training set and a testing set.
Now, we’ll create and train an SVM classifier on the training data.
clf = svm.SVC(kernel=’linear’)
# Train the classifier on the training data
clf.fit(X_train, Y_train)
We can now use the trained SVM to make predictions on the test data.
Y_pred = clf.predict(X_test)
Random Forest Classification is a popular ensemble machine learning algorithm used for both
classification and regression tasks. It is an ensemble method based on decision trees and is known for its
high predictive accuracy, robustness, and ability to handle large datasets. Random Forests are
constructed by training multiple decision trees and aggregating their predictions to make more accurate
and stable predictions.
Here are key components and concepts related to Random Forest Classification:
Ensemble Method:
Random Forest is an ensemble learning method, which means it combines the predictions of multiple
base learners (decision trees in this case) to improve overall predictive accuracy and reduce overfitting.
Decision Trees:
Random Forests are built from individual decision trees. Each decision tree is trained on a random subset
of the data and features. This randomness helps reduce overfitting.
The algorithm uses a technique called bagging, where each decision tree is trained on a bootstrapped
sample (randomly selected with replacement) from the training data. This creates diversity among the
individual trees.
During the construction of each decision tree, a random subset of features is selected at each split point.
This decorrelates the trees and reduces the risk of overfitting.
Voting or Averaging:
In the case of classification, Random Forests typically use a majority voting scheme, where each tree’s
prediction is counted, and the class with the most votes become the final prediction. For regression, it
averages the predictions of individual trees.
Optimization:
Random Forests do not involve optimization in the traditional sense because they are an ensemble
method, and each decision tree is built independently. The optimization occurs during the training of
individual decision trees, where they seek to split the data at each node in a way that maximizes
information gain (for classification) or minimizes mean squared error (for regression).
Cost Function:
There is no global cost function for Random Forests. The cost functions are used within each decision
tree to guide the splitting process. Common cost functions for decision trees include Gini impurity and
entropy for classification tasks and mean squared error for regression tasks.
Accuracy:
Random Forests are known for their high predictive accuracy. The ensemble nature of the algorithm
reduces overfitting and helps capture complex relationships in the data.
Hyperparameter Tuning:
Random Forests have hyperparameters that can be tuned to optimize model performance. Some key
hyperparameters include the number of trees in the forest (n_estimators), the maximum depth of each
tree (max_depth), the minimum number of samples required to split an internal node
(min_samples_split), and the maximum number of features to consider at each split (max_features).
Random Forests employ a technique called out-of-bag (OOB) error estimation. During the training of
each tree, a portion of the data is not used (out-of-bag samples). These OOB samples can be used to
estimate the model’s performance without the need for a separate validation set. OOB error can be used
as a criterion for hyperparameter tuning.
Mathematical Formulation:
While the Random Forest algorithm doesn’t involve a single global mathematical optimization function,
it does involve mathematical calculations at each step of the decision tree construction, including feature
selection and splitting, as described earlier.
The power of Random Forest comes from the aggregation of multiple decision trees, which reduces
overfitting and increases predictive accuracy. The ensemble approach combines the strength of multiple
models while mitigating their weaknesses, resulting in a more robust and optimized classifier or
regressor.
Example:
Let’s consider a simple example using the famous Iris dataset, which is a classification problem aiming to
predict the species of iris flowers based on their features (sepal length, sepal width, petal length, and
petal width). We will use scikit-learn, a popular Python library, to demonstrate Random Forest
Classification:
iris = load_iris()
X = iris.data
y = iris.target
rf_classifier.fit(X_train, y_train)
y_pred = rf_classifier.predict(X_test)
# Calculate accuracy
print(f”Accuracy: {accuracy}”)
Decision Tree Classification is a supervised machine learning algorithm used for both classification and
regression tasks. It is a non-linear model that makes decisions by recursively splitting the data into
subsets based on the values of input features. Decision trees are hierarchical structures composed of
nodes and branches, where each internal node represents a feature and a decision rule, and each leaf
node represents a class label (for classification) or a predicted value (for regression).
Here are key components and concepts related to Decision Tree Classification:
Ensemble methods involve combining the predictions of multiple decision trees to improve overall
predictive accuracy and robustness. Two common ensemble methods based on decision trees are
Random Forest and Gradient Boosting.
Optimization:
Decision trees are optimized during their construction to find the best feature and split point that result
in the most informative decision at each node. The optimization goal is typically to minimize impurity or
entropy for classification tasks or minimize mean squared error for regression tasks.
Cost Function:
For regression, the cost function is typically Mean Squared Error (MSE), which measures the variance of
target values.
Accuracy:
Decision trees aim to maximize accuracy by partitioning the data into subsets that are as homogeneous
as possible with respect to the target variable. The accuracy of a decision tree is evaluated on a
validation or test dataset and is a measure of how well the model generalizes to new, unseen data.
Hyperparameter Tuning:
Decision trees have hyperparameters that can be tuned to optimize model performance and prevent
overfitting. Common hyperparameters include:
Min Samples Split: The minimum number of samples required to split a node.
Min Samples Leaf: The minimum number of samples required to be at a leaf node.
Max Features: The maximum number of features to consider when finding the best split.
Criterion: The cost function used for splitting nodes (e.g., Gini impurity, entropy, or MSE).
Decision trees are constructed by recursively splitting the data into subsets based on feature values in a
way that optimizes a cost function. The optimization process involves mathematical calculations at each
node of the tree. Here’s an overview of the mathematical calculations and optimization steps involved in
building a decision tree:
For classification tasks, the most common splitting criteria to optimize are Gini impurity and entropy. For
regression tasks, mean squared error (MSE) is commonly used.
The Gini impurity measures the probability of misclassifying a randomly chosen element. It is calculated
for a node t as:
Where:
b. Entropy (Classification):
Entropy measures the level of impurity or disorder in the data. It is calculated for a node t as:
Where:
For regression tasks, the mean squared error (MSE) is used as the cost function. It measures the variance
of target values within a node. The goal is to minimize MSE when splitting a node.
2. Optimization:
The optimization process involves finding the best feature and split point that minimizes the chosen cost
function (Gini impurity, entropy, or MSE). At each node, the algorithm considers each feature and
evaluates the cost of splitting the data based on different split points. The split that results in the lowest
impurity (for classification) or lowest MSE (for regression) is chosen.
Mathematically, for each candidate feature F and split point S, the cost function is calculated, and the
feature-split pair with the lowest cost is selected as the optimal split:
3. Recursive Splitting:
Once the optimal split is found, the data is divided into child nodes, and the optimization process is
applied recursively to each child node until a stopping criterion is met. Common stopping criteria include
reaching a maximum depth, having too few samples in a node, or achieving perfect purity (all samples
belong to the same class for classification).
4. Pruning (Optional):
After constructing the decision tree, pruning can be applied to reduce the complexity of the tree and
prevent overfitting. Pruning involves removing branches of the tree that do not significantly improve
predictive accuracy.
5. Prediction:
To make a prediction for a new data point, it traverses the decision tree from the root node down to a
leaf node, following the splits based on the feature values. The leaf node’s class label (for classification)
or predicted value (for regression) is used as the final prediction.
Example:
Let’s consider a simple example of a Decision Tree Classification problem using Python and the Iris
dataset:
iris = load_iris()
X = iris.data
y = iris.target
dt_classifier.fit(X_train, y_train)
y_pred = dt_classifier.predict(X_test)
# Calculate accuracy
print(f”Accuracy: {accuracy}”)
Naïve Bayes
The naive Bayes classifier is based on Bayes’ theorem with the independence assumptions between
predictors (i.e., it assumes the presence of a feature in a class is unrelated to any other feature). Even if
these features depend on each other, or upon the existence of the other features, all of these properties
independently. Thus, the name naive Bayes.
Based on naive Bayes, Gaussian naive Bayes is used for classification based on the binomial (normal)
distribution of data.
P(yellow) = 10/17
P(green) = 7/17
P(?) = 4/17
3. Calculate Likelihood
P(data/class) = Number of similar observations to the class/Total no. of points in the class.
P(?/yellow) = 1/7
P(?/green) = 3/10
The higher probability, the class belongs to that category as from above 75% probability the point
belongs to class green.
Multinomial, Bernoulli naive Bayes are the other models used in calculating probabilities. Thus, a naive
Bayes model is easy to build, with no complicated iterative parameter estimation, which makes it
particularly useful for very large datasets.
Conditional Independence Assumption: Naive Bayes assumes that features are conditionally
independent given the class label. This simplifying assumption greatly reduces the computational
complexity of the algorithm.
Theory:
The main idea behind Naive Bayes is to compute the posterior probability of each class for a given set of
features and select the class with the highest probability as the predicted class.
Optimization:
There is no explicit optimization process in Naive Bayes as there are no model parameters to be learned
during training. Instead, Naive Bayes estimates probabilities from the training data. The optimization
occurs implicitly through probability estimation, and the model is relatively simple and computationally
efficient.
Cost Function:
Naive Bayes doesn’t have a cost function in the traditional sense, as it doesn’t involve parameter tuning
or optimization. The decision boundary is determined by the probabilistic calculations and the class with
the highest posterior probability is selected as the prediction.
Accuracy:
Accuracy is a common metric used to evaluate the performance of Naive Bayes. It measures the
proportion of correctly classified instances out of all instances in the dataset. However, the choice of
evaluation metrics may vary depending on the specific problem and class imbalance.
Example:
Here’s a simplified example of text classification using the Naive Bayes algorithm, specifically for spam
email detection. In this example, we’ll use the scikit-learn library in Python:
emails = [“Get a free iPhone now!”, “Meeting at 3 PM today.”, “Discounts on shoes.”, “Important
information inside.”]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(emails)
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)
y_pred = nb_classifier.predict(X_test)
# Calculate accuracy
print(f”Accuracy: {accuracy}”)
Gradient boosting classifier is a boosting ensemble method. Boosting is a way to combine (ensemble)
weak learners, primarily to reduce prediction bias. Instead of creating a pool of predictors, as in bagging,
boosting produces a cascade of them, where each output is the input for the following learner. Typically,
in bagging algorithm trees are grown in parallel to get the average prediction across all trees, where each
tree is built on a sample of original data. Gradient boosting, on the other hand, takes a sequential
approach to obtaining predictions instead of parallelizing the tree building process. In gradient boosting,
each decision tree predicts the error of the previous decision tree — thereby boosting (improving) the
error (gradient).
3.Build another shallow decision tree that predicts residual based on all the independent values
4.Uodate the original predictions with the new prediction multiplied by learning rate
Let’s see how maths work out for Gradient Boosting algorithm. Say we have mean squared error (MSE)
as loss defined as:
We want our predictions, such that our loss function (MSE) is minimum. By using gradient descent and
updating our predictions based on a learning rate, we can find the values where MSE is minimum.
So, we are basically updating the predictions such that the sum of our residuals is close to 0 (or
minimum) and predicted values are sufficiently close to actual values.
The logic behind gradient boosting is simple, (can be understood intuitively, without using mathematical
notation). I expect that whoever is reading this post might be familiar with simple linear model
modeling.
A basic assumption of linear regression is that sum of its residuals is 0, i.e. the residuals should be spread
randomly around zero.
Now think of these residuals as mistakes committed by our predictor model. Although, tree-based
models (considering decision tree as base models for our gradient boosting here) are not based on such
assumptions, but if we think logically (not statistically) about this assumption, we might argue that, if we
are able to see some pattern of residuals around 0, we can leverage that pattern to fit a model.
So, the intuition behind gradient boosting algorithm is to repetitively leverage the patterns in residuals
and strengthen a model with weak predictions and make it better. Once we reach a stage where
residuals do not have any pattern that could be modeled, we can stop modeling residuals (otherwise it
might lead to overfitting). Algorithmically, we are minimizing our loss function, such that test loss
reaches its minima.
In summary,
• We first model data with simple models and analyze data for errors.
• These errors signify data points that are difficult to fit by a simple model.
• Then for later models, we particularly focus on those hard-to-fit data to get them right.
• In the end, we combine all the predictors by giving some weights to each predictor.
Metrics to Measure Classification Model Performance
1. CONFUSION MATRIX
A confusion matrix is a table that is often used to describe the performance of a classification model on a
set of test data for which the true values are known. It is a table with four different combinations of
predicted and actual values in the case for a binary classifier.
A true positive is an outcome where the model correctly predicts the positive class. Similarly, a true
negative is an outcome where the model correctly predicts the negative class.
The terms false positive and false negative are used in determining how well the model is predicting with
respect to classification. A false positive is an outcome where the model incorrectly predicts
the positive class. And a false negative is an outcome where the model incorrectly predicts
the negative class. The more values in main diagonal, the better the model, whereas the other diagonal
gives the worst result for classification.
False Positive
False positive (type I error) — when you reject a true null hypothesis.
This is an example in which the model mistakenly predicted the positive class. For example, the model
inferred that a particular email message was spam (the positive class), but that email message was
actually not spam. It’s like a warning sign that the mistake should be rectified as it’s not much of a
serious concern compared to false negative.
False Negative
False negative (type II error) — when you accept a false null hypothesis.
This is an example in which the model mistakenly predicted the negative class. For example, the model
inferred that a particular email message was not spam (the negative class), but that email message
actually was spam. It’s like a danger sign that the mistake should be rectified early as it’s more serious
than a false positive.
From the confusion matrix, we can infer accuracy, precision, recall and F-1 score.
Accuracy
Accuracy alone doesn’t tell the full story when working with a class-imbalanced data set, where there is
a significant disparity between the number of positive and negative labels. Precision and recall are better
metrics for evaluating class-imbalanced problems.
Precision
Recall
Out of all the positive classes, recall is how much we predicted correctly. It is also called sensitivity or
true positive rate (TPR).
F-1 Score
It is often convenient to combine precision and recall into a single metric called the F-1 score, particularly
if you need a simple way to compare two classifiers. The F-1 score is the harmonic mean of precision and
recall.
The regular mean treats all values equally, while the harmonic mean gives much more weight to low
values thereby punishing the extreme values more. As a result, the classifier will only get a high F-1 score
if both recall and precision are high.
2. RECEIVER OPERATOR CURVE (ROC) AND AREA UNDER THE CURVE (AUC)
ROC curve is an important classification evaluation metric. It tells us how well the model has accurately
predicted. The ROC curve shows the sensitivity of the classifier by plotting the rate of true positives to
the rate of false positives. If the classifier is outstanding, the true positive rate will increase, and the area
under the curve will be close to one. If the classifier is similar to random guessing, the true positive rate
will increase linearly with the false positive rate. The better the AUC measure, the better the model.
3. CUMULATIVE ACCURACY PROFILE CURVE
The CAP of a model represents the cumulative number of positive outcomes along the y-axis versus the
corresponding cumulative number of a classifying parameters along the x-axis. The CAP is distinct from
the receiver operating characteristic (ROC), which plots the true-positive rate against the false-positive
rate. CAP curve is rarely used as compared to ROC curve.
Consider a model that predicts whether a customer will purchase a product. If a customer is selected at
random, there is a 50 percent chance they will buy the product. The cumulative number elements for
which the customer buys would rise linearly toward a maximum value corresponding to the total
number of customers. This distribution is called the “random” CAP. Its the blue line in the above
diagram. A perfect prediction, on the other hand, determines exactly which customer will buy the
product, such that the maximum customer buying the property will be reached with a minimum number
of customer selection among the elements. This produces a steep line on the CAP curve that stays flat
once the maximum is reached, which is the “perfect” CAP. It’s also called the “ideal” line and is the grey
line in the figure above.
The confusion matrix for a multi-class classification problem can help you determine mistake patterns.
Classification algorithms are widely used in various fields and applications where the goal is to categorize
or classify data into predefined classes or categories. Here are some common use cases for classification
algorithms:
Email Spam Detection: Classify emails as “spam” or “not spam” to filter out unwanted emails
from users’ inboxes.
Sentiment Analysis: Analyze text data from social media, reviews, or customer feedback to
classify sentiment as positive, negative, or neutral.
Medical Diagnosis: Diagnose diseases or medical conditions based on patient data, medical tests,
and symptoms.
Credit Risk Assessment: Determine the creditworthiness of loan applicants by classifying them as
low, medium, or high-risk borrowers.
Image Classification: Categorize images into predefined classes, such as recognizing objects in
photos or detecting anomalies in medical images.
Natural Language Processing (NLP):Categorize documents or text data into topics, genres, or
genres for content recommendation or organizing information.
Customer Churn Prediction: Predict whether customers are likely to churn (leave) a service or
product, such as a subscription service or a mobile app.
Fraud Detection: Identify fraudulent transactions, activities, or behaviors in financial systems,
insurance claims, or online platforms.
Speech Recognition: Classify spoken words or phrases into text, enabling voice assistants and
transcription services.
Anomaly detection: Detect anomalies or outliers in data, such as network intrusion detection or
manufacturing quality control.
Document Classification: Automatically classify documents into categories, such as news articles,
legal documents, or research papers.
Recommendation Systems: Recommend products, movies, music, or content to users based on
their preferences, behaviors, or historical data.
Species Identification: Identify species of plants or animals based on observations, images, or
genetic data.
Quality Control: Inspect and classify manufactured products as defective or non-defective based
on quality control data.
Credit Card Transaction Fraud Detection: Detect fraudulent credit card transactions by classifying
them as legitimate or suspicious.
Intrusion Detection in Cybersecurity: Monitor network traffic and classify it as normal or
potentially malicious, identifying cyber threats and attacks.
Employee Attrition Prediction: Predict whether employees are likely to leave a company based
on historical HR data.
Customer Segmentation: Segment customers into groups based on their behavior,
demographics, or purchasing habits for targeted marketing campaigns.
Handwriting Recognition: Recognize handwritten text or characters and classify them into
alphanumeric characters.
Fault Detection in Manufacturing: Detect and classify faults or defects in manufacturing
processes or products, improving product quality.
These are just a few examples of the many applications of classification algorithms across various
domains. Classification plays a fundamental role in machine learning and data analysis, enabling
automated decision-making and pattern recognition in diverse fields.