Assignment 2

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 111

Assignment 2

Explain the following classification algorithms in machine learning: Logistic Regression, k-Nearest
Neighbors, Decision Trees, Support Vector Machine, Naive Bayes, Gradient Boosting.

Each of these classification algorithms has its strengths and weaknesses, and their performance can vary
depending on the nature of the data and the specific problem at hand. Choosing the right algorithm
often involves experimentation and consideration of the characteristics of the dataset.

Logistic Regression

This is a type of linear regression that predicts the probability of an input belonging to a certain class. It
uses a logistic function (also called a sigmoid function) to map the input features to a value between 0
and 1, which represents the probability of the positive class. The predicted class is then determined by
comparing the probability with a threshold value, usually 0.5. Logistic regression is often used for binary
classification problems, such as spam detection, credit default prediction, etc.

Logistic regression is a supervised learning algorithm used for classification tasks. It works by fitting a
sigmoid function to the training data, which allows it to predict the probability of a given data point
belonging to a particular class. Logistic regression is a relatively simple algorithm to understand and
implement, but it can be very effective for a wide range of classification problems.

In simple terms

 Logistic Regression is a simple and widely used classification algorithm for binary and multiclass
classification problems.
 It models the probability that a given input belongs to a particular class using the logistic
function, which outputs values between 0 and 1.
 It's a linear model that finds the best-fitting hyperplane to separate the classes by optimizing a
cost function like cross-entropy.
 Logistic Regression is interpretable and efficient but may not perform well with complex data
distributions.

k-Nearest Neighbors (KNN)

This is a type of lazy learning algorithm that does not build a model from the training data, but instead
stores the data and uses a similarity measure (such as Euclidean distance) to find the k most similar
instances to a new input. The predicted class is then determined by the majority vote of the k nearest
neighbors, or by a weighted vote based on the distance. k-Nearest Neighbors is often used for multi-class
classification problems, such as image recognition, text categorization, etc.

KNN is another supervised learning algorithm for classification. It works by finding the k most similar
data points in the training set to a new data point, and then predicting the class of the new data point
based on the classes of the k nearest neighbors. KNN is a very simple algorithm to implement, but it can
be very effective for classification problems, especially when the training data is well-labeled.

In simple terms
 k-NN is a non-parametric and instance-based classification algorithm.
 It classifies data points based on the majority class among their k-nearest neighbors in the
feature space.
 The choice of the value 'k' influences the algorithm's performance and can be selected using
techniques like cross-validation.
 k-NN is simple and can handle non-linear decision boundaries but can be sensitive to the choice
of distance metric and the curse of dimensionality.

Decision Trees

Decision trees are supervised learning algorithms for classification and regression tasks. They work by
building a tree-like structure that represents the relationships between the features of the data and the
target variable. Decision trees are easy to interpret and can be very effective for classification problems,
especially when the data is noisy or complex.

This is a type of eager learning algorithm that builds a tree-like structure from the training data, where
each node represents a feature, each branch represents a decision rule, and each leaf represents a class
label. The predicted class is then determined by following the path from the root node to the leaf node
that matches the input features. Decision trees are often used for both binary and multi-class
classification problems, such as medical diagnosis, customer segmentation, etc.

In simple terms

 Decision Trees are tree-like structures where each node represents a decision or a test on a
feature, and each branch represents the outcome of that test.
 It recursively splits the data based on features to classify the data into different classes.
 Decision Trees are interpretable and can handle both categorical and numerical data but may
suffer from overfitting if not pruned properly.

Support Vector Machines (SVMs)

This is a type of linear classifier that finds the optimal hyperplane that separates the data into two
classes with the maximum margin. The hyperplane is defined by a subset of data points called support
vectors, which are the closest to the boundary. The predicted class is then determined by the side of the
hyperplane that the input falls on. Support vector machines can also handle non-linear classification
problems by using kernel functions that transform the data into higher-dimensional spaces. Support
vector machines are often used for complex classification problems, such as face detection, handwriting
recognition, etc.

SVMs are supervised learning algorithms for classification and regression tasks. They work by finding a
hyperplane that separates the data into two classes with the largest possible margin. SVMs are very
effective for classification problems, especially when the data is high-dimensional and sparse.

In simple terms

 SVM is a powerful classification algorithm that aims to find a hyperplane that best separates
classes while maximizing the margin between them.
 It can handle both linear and non-linear classification by using different kernel functions, such as
the linear, polynomial, or radial basis function (RBF) kernels.
 SVMs are effective in high-dimensional spaces and are less prone to overfitting due to the
margin concept.

Naive Bayes

This is a type of probabilistic classifier that applies Bayes’ theorem to calculate the posterior probability
of each class given the input features. It assumes that the features are conditionally independent given
the class, which simplifies the computation and reduces the data requirements. The predicted class is
then determined by the class with the highest posterior probability. Naive Bayes is often used for text
classification problems, such as sentiment analysis, spam filtering, etc.

Naive Bayes is a supervised learning algorithm for classification tasks. It works by assuming that the
features of the data are independent of each other, and then uses Bayes' theorem to predict the class of
a new data point. Naive Bayes is a very simple algorithm to implement, and it can be very effective for
classification problems, especially when the features of the data are truly independent.

In simple terms

 Naive Bayes is a probabilistic classifier based on Bayes' theorem with the "naive" assumption
that features are conditionally independent.
 It's particularly useful for text classification and spam detection.
 Despite its simplicity and the independence assumption, Naive Bayes often performs surprisingly
well in practice.

Gradient Boosting

This is a type of ensemble learning algorithm that combines multiple weak learners (usually decision
trees) into a strong learner by iteratively adding new learners that correct the errors of the previous
ones. It uses a gradient descent method to minimize a loss function that measures the difference
between the actual and predicted classes. The predicted class is then determined by the weighted vote
of all the learners. Gradient boosting is often used for high-performance classification problems, such as
fraud detection, ranking systems

In simple terms

 Gradient Boosting is an ensemble method that combines multiple weak learners (typically
decision trees) to create a strong learner.
 It builds trees sequentially, with each tree correcting the errors made by the previous ones.
 Gradient Boosting algorithms like AdaBoost and XGBoost are highly effective and often win
machine learning competitions.
 They can handle complex relationships in the data but may require careful hyperparameter
tuning.

Example Use Cases

Here are some examples use cases for each of the classification algorithms discussed above:

Logistic Regression:
 Predicting whether a customer is likely to churn
 Predicting whether a loan applicant is likely to default
 Predicting whether a medical patient is likely to have a particular disease

k-Nearest Neighbors:

 Classifying images of handwritten digits


 Recommending products to customers based on their past purchases
 Diagnosing diseases based on a patient's symptoms

Decision Trees:

 Predicting whether a website visitor is likely to make a purchase


 Predicting whether a stock price is likely to go up or down
 Fraud detection

Support Vector Machines:

 Classifying images of objects


 Text classification
 Image retrieval

Naive Bayes:

 Spam filtering
 Sentiment analysis
 Document classification

Gradient Boosting:

 Click-through rate prediction


 Ad recommendation
 Risk assessment

Which Algorithm Should You Use?

The best classification algorithm to use for a particular problem will depend on a number of factors,
including the nature of the data, the complexity of the problem, and the desired performance metrics.
However, the algorithms discussed above are a good starting point for most classification problems.

Here are some general guidelines for choosing a classification algorithm:

 If the data is simple and the problem is not too complex, logistic regression or k-nearest
neighbors may be a good choice.
 If the data is noisy or complex, decision trees or support vector machines may be a better
choice.
 If the data is high-dimensional and sparse, support vector machines are a good choice.
 If the features of the data are truly independent, Naive Bayes may be a good choice.
 If you need the best possible performance, gradient boosting is a good choice.
 It is also important to note that there is no one-size-fits-all solution to machine learning
problems. It is often necessary to experiment with different algorithms and parameters to find
the best solution for a particular problem.

Classification Algorithm in Machine Learning

As we know, the Supervised Machine Learning algorithm can be broadly classified into Regression and
Classification Algorithms. In Regression algorithms, we have predicted the output for continuous values,
but to predict the categorical values, we need Classification algorithms.

What is the Classification Algorithm?

The Classification algorithm is a Supervised Learning technique that is used to identify the category of
new observations on the basis of training data. In Classification, a program learns from the given dataset
or observations and then classifies new observation into a number of classes or groups. Such as, Yes or
No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes can be called as targets/labels or categories.

Unlike regression, the output variable of Classification is a category, not a value, such as "Green or Blue",
"fruit or animal", etc. Since the Classification algorithm is a Supervised learning technique, hence it takes
labeled input data, which means it contains input with the corresponding output.

In classification algorithm, a discrete output function(y) is mapped to input variable(x).

1. y=f(x), where y = categorical output

The best example of an ML classification algorithm is Email Spam Detector.

The main goal of the Classification algorithm is to identify the category of a given dataset, and these
algorithms are mainly used to predict the output for the categorical data.

Classification algorithms can be better understood using the below diagram. In the below diagram, there
are two classes, class A and Class B. These classes have features that are similar to each other and
dissimilar to other classes.
The algorithm which implements the classification on a dataset is known as a classifier. There are two
types of Classifications:

Binary Classifier: If the classification problem has only two possible outcomes, then it is called as Binary
Classifier.

Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.

Multi-class Classifier: If a classification problem has more than two outcomes, then it is called as Multi-
class Classifier.

Example: Classifications of types of crops, Classification of types of music.

Learners in Classification Problems:

In the classification problems, there are two types of learners:

1. Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it receives the test
dataset. In Lazy learner case, classification is done on the basis of the most related data stored in the
training dataset. It takes less time in training but more time for predictions.

Example: K-NN algorithm, Case-based reasoning

2. Eager Learners: Eager Learners develop a classification model based on a training dataset before
receiving a test dataset. Opposite to Lazy learners, Eager Learner takes more time in learning, and less
time in prediction. Example: Decision Trees, Naïve Bayes, ANN.

Types of ML Classification Algorithms:


Classification Algorithms can be further divided into the Mainly two category:

Linear Models

Logistic Regression

Support Vector Machines

Non-linear Models

K-Nearest Neighbors

Kernel SVM

Naïve Bayes

Decision Tree Classification

Random Forest Classification

Evaluating a Classification model:

Once our model is completed, it is necessary to evaluate its performance; either it is a Classification or
Regression model. So for evaluating a Classification model, we have the following ways:

1. Log Loss or Cross-Entropy Loss:

It is used for evaluating the performance of a classifier, whose output is a probability value between the
0 and 1.

For a good binary Classification model, the value of log loss should be near to 0.

The value of log loss increases if the predicted value deviates from the actual value.

The lower log loss represents the higher accuracy of the model.

For Binary classification, cross-entropy can be calculated as:

1. ?(ylog(p)+(1?y)log(1?p))

Where y= Actual output, p= predicted output.

2. Confusion Matrix:

o The confusion matrix provides us a matrix/table as output and describes the performance of the
model.

o It is also known as the error matrix.

The matrix consists of predictions result in a summarized form, which has a total number of correct
predictions and incorrect predictions. The matrix looks like as below table:
Actual Positive Actual Negative

Predicted Positive True Positive False Positive

Predicted Negative False Negative True Negative

3. AUC-ROC curve:

ROC curve stands for Receiver Operating Characteristics Curve and AUC stands for Area Under the Curve.

It is a graph that shows the performance of the classification model at different thresholds.

To visualize the performance of the multi-class classification model, we use the AUC-ROC Curve.

The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-axis and FPR(False
Positive Rate) on X-axis.

Use cases of Classification Algorithms

Classification algorithms can be used in different places. Below are some popular use cases of
Classification Algorithms:

 Email Spam Detection


 Speech Recognition
 Identifications of Cancer tumor cells.
 Drugs Classification
 Biometric Identification, etc.

Logistic Regression in Machine Learning

Logistic regression is one of the most popular Machine Learning algorithms, which comes under the
Supervised Learning technique. It is used for predicting the categorical dependent variable using a given
set of independent variables.
Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome must
be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but instead of
giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.

Logistic Regression is much similar to the Linear Regression except that how they are used. Linear
Regression is used for solving Regression problems, whereas Logistic regression is used for solving the
classification problems.

In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function, which
predicts two maximum values (0 or 1).

The curve from the logistic function indicates the likelihood of something such as whether the cells are
cancerous or not, a mouse is obese or not based on its weight, etc.

Logistic Regression is a significant machine learning algorithm because it has the ability to provide
probabilities and classify new data using continuous and discrete datasets.

Logistic Regression can be used to classify the observations using different types of data and can easily
determine the most effective variables used for the classification. The below image is showing the
logistic function:

Note: Logistic regression uses the concept of predictive modeling as regression; therefore, it is called
logistic regression, but is used to classify samples; Therefore, it falls under the classification algorithm.

Logistic Function (Sigmoid Function):

• The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
• It maps any real value into another value within a range of 0 and 1.
• The value of the logistic regression must be between 0 and 1, which cannot go beyond this limit,
so it forms a curve like the "S" form. The S-form curve is called the Sigmoid function or the
logistic function.
• In logistic regression, we use the concept of the threshold value, which defines the probability of
either 0 or 1. Such as values above the threshold value tends to 1, and a value below the
threshold values tends to 0.

Assumptions for Logistic Regression:

• The dependent variable must be categorical in nature.


• The independent variable should not have multi-collinearity.

Logistic Regression Equation:

The Logistic regression equation can be obtained from the Linear Regression equation. The mathematical
steps to get Logistic Regression equations are given below:

We know the equation of the straight line can be written as:

In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation by (1-y):

But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will become:

The above equation is the final equation for Logistic Regression.

Type of Logistic Regression:

On the basis of the categories, Logistic Regression can be classified into three types:

Binomial: In binomial Logistic regression, there can be only two possible types of the dependent
variables, such as 0 or 1, Pass or Fail, etc.

Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered types of the
dependent variable, such as "cat", "dogs", or "sheep"

Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of dependent
variables, such as "low", "Medium", or "High".

Python Implementation of Logistic Regression (Binomial)

To understand the implementation of Logistic Regression in Python, we will use the below example:Skip
10s

Example: There is a dataset given which contains the information of various users obtained from the
social networking sites. There is a car making company that has recently launched a new SUV car. So the
company wanted to check how many users from the dataset, wants to purchase the car.
For this problem, we will build a Machine Learning model using the Logistic regression algorithm. The
dataset is shown in the below image. In this problem, we will predict the purchased variable (Dependent
Variable) by using age and salary (Independent variables).

Steps in Logistic Regression: To implement the Logistic Regression using Python, we will use the same
steps as we have done in previous topics of Regression. Below are the steps:

• Data Pre-processing step


• Fitting Logistic Regression to the Training set
• Predicting the test result
• Test accuracy of the result(Creation of Confusion matrix)
• Visualizing the test set result.

1. Data Pre-processing step: In this step, we will pre-process/prepare the data so that we can use it in
our code efficiently. It will be the same as we have done in Data pre-processing topic. The code for this is
given below:

#Data Pre-processing Step

# importing libraries

import NumPy as nm
import matplotlib.pyplot as mtp

import pandas as pd

#importing datasets

data_set= pd.read_csv('user_data.csv')

By executing the above lines of code, we will get the dataset as the output. Consider the given image:

Now, we will extract the dependent and independent variables from the given dataset. Below is the code
for it:

#Extracting Independent and dependent Variable

x= data_set.iloc[:, [2,3]].values

y= data_set.iloc[:, 4].values
In the above code, we have taken [2, 3] for x because our independent variables are age and salary,
which are at index 2, 3. And we have taken 4 for y variable because our dependent variable is at index 4.
The output will be:

Now we will split the dataset into a training set and test set. Below is the code for it:

# Splitting the dataset into training and test set.

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)

The output for this is given below:


For test
set:

For training set:


In logistic regression, we will do feature scaling because we want accurate result of predictions. Here we
will only scale the independent variable because dependent variable have only 0 and 1 values. Below is
the code for it:

#feature Scaling

from sklearn.preprocessing import StandardScaler

st_x= StandardScaler()

x_train= st_x.fit_transform(x_train)

x_test= st_x.transform(x_test)

The scaled output is given below:


2. Fitting Logistic Regression to the Training set:

We have well prepared our dataset, and now we will train the dataset using the training set. For
providing training or fitting the model to the training set, we will import the LogisticRegression class of
the learn library.

After importing the class, we will create a classifier object and use it to fit the model to the logistic
regression. Below is the code for it:

#Fitting Logistic Regression to the training set

from sklearn.linear_model import LogisticRegression

classifier= LogisticRegression(random_state=0)

classifier.fit(x_train, y_train)

Output: By executing the above code, we will get the below output:

Out[5]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,

intercept_scaling=1, l1_ratio=None, max_iter=100,

multi_class='warn', n_jobs=None, penalty='l2',


random_state=0, solver='warn', tol=0.0001, verbose=0,

warm_start=False)

Hence our model is well fitted to the training set.

3. Predicting the Test Result

Our model is well trained on the training set, so we will now predict the result by using test set data.
Below is the code for it:

#Predicting the test set result

y_pred= classifier.predict(x_test)

In the above code, we have created a y_pred vector to predict the test set result.

Output: By executing the above code, a new vector (y_pred) will be created under the variable explorer
option. It can be seen as:
The above output image shows the corresponding predicted users who want to purchase or not
purchase the car.

4. Test Accuracy of the result

Now we will create the confusion matrix here to check the accuracy of the classification. To create it, we
need to import the confusion_matrix function of the sklearn library. After importing the function, we will
call it using a new variable cm. The function takes two parameters, mainly y_true( the actual values)
and y_pred (the targeted value return by the classifier). Below is the code for it:

#Creating the Confusion matrix

from sklearn.metrics import confusion_matrix

cm= confusion_matrix()

Output:

By executing the above code, a new confusion matrix will be created. Consider the below image:

We can find the accuracy of the predicted result by interpreting the confusion matrix. By above output,
we can interpret that 65+24= 89 (Correct Output) and 8+3= 11(Incorrect Output).

5. Visualizing the training set result

Finally, we will visualize the training set result. To visualize the result, we will use ListedColormap class of
matplotlib library. Below is the code for it:

#Visualizing the training set result

from matplotlib.colors import ListedColormap


x_set, y_set = x_train, y_train

x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.01),

nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))

mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),

alpha = 0.75, cmap = ListedColormap(('purple','green' )))

mtp.xlim(x1.min(), x1.max())

mtp.ylim(x2.min(), x2.max())

for i, j in enumerate(nm.unique(y_set)):

mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],

c = ListedColormap(('purple', 'green'))(i), label = j)

mtp.title('Logistic Regression (Training set)')

mtp.xlabel('Age')

mtp.ylabel('Estimated Salary')

mtp.legend()

mtp.show()

In the above code, we have imported the ListedColormap class of Matplotlib library to create the
colormap for visualizing the result. We have created two new variables x_set and y_set to
replace x_train and y_train. After that, we have used the nm.meshgrid command to create a rectangular
grid, which has a range of -1(minimum) to 1 (maximum). The pixel points we have taken are of 0.01
resolution.

To create a filled contour, we have used mtp.contourf command, it will create regions of provided colors
(purple and green). In this function, we have passed the classifier.predict to show the predicted data
points predicted by the classifier.

Output: By executing the above code, we will get the below output:
The graph can be explained in the below points:

In the above graph, we can see that there are some Green points within the green region and Purple
points within the purple region.

All these data points are the observation points from the training set, which shows the result for
purchased variables.

This graph is made by using two independent variables i.e., Age on the x-axis and Estimated salary on the
y-axis.

The purple point observations are for which purchased (dependent variable) is probably 0, i.e., users
who did not purchase the SUV car.

The green point observations are for which purchased (dependent variable) is probably 1 means user
who purchased the SUV car.

We can also estimate from the graph that the users who are younger with low salary, did not purchase
the car, whereas older users with high estimated salary purchased the car.

But there are some purple points in the green region (Buying the car) and some green points in the
purple region(Not buying the car). So we can say that younger users with a high estimated salary
purchased the car, whereas an older user with a low estimated salary did not purchase the car.

The goal of the classifier:

We have successfully visualized the training set result for the logistic regression, and our goal for this
classification is to divide the users who purchased the SUV car and who did not purchase the car. So from
the output graph, we can clearly see the two regions (Purple and Green) with the observation points.
The Purple region is for those users who didn't buy the car, and Green Region is for those users who
purchased the car.

Linear Classifier:
As we can see from the graph, the classifier is a Straight line or linear in nature as we have used the
Linear model for Logistic Regression. In further topics, we will learn for non-linear Classifiers.

Visualizing the test set result:

Our model is well trained using the training dataset. Now, we will visualize the result for new
observations (Test set). The code for the test set will remain same as above except that here we will
use x_test and y_test instead of x_train and y_train. Below is the code for it:

#Visulaizing the test set result

from matplotlib.colors import ListedColormap

x_set, y_set = x_test, y_test

x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.01),

nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))

mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),

alpha = 0.75, cmap = ListedColormap(('purple','green' )))

mtp.xlim(x1.min(), x1.max())

mtp.ylim(x2.min(), x2.max())

for i, j in enumerate(nm.unique(y_set)):

mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],

c = ListedColormap(('purple', 'green'))(i), label = j)

mtp.title('Logistic Regression (Test set)')

mtp.xlabel('Age')

mtp.ylabel('Estimated Salary')

mtp.legend()

mtp.show()

Output:
The above graph shows the test set result. As we can see, the graph is divided into two regions (Purple
and Green). And Green observations are in the green region, and Purple observations are in the purple
region. So we can say it is a good prediction and model. Some of the green and purple data points are in
different regions, which can be ignored as we have already calculated this error using the confusion
matrix (11 Incorrect output).

Hence our model is pretty good and ready to make new predictions for this classification problem.

K-Nearest Neighbor(KNN) Algorithm for Machine Learning

K-Nearest Neighbor is one of the simplest Machine Learning algorithms based on Supervised Learning
technique.

K-NN algorithm assumes the similarity between the new case/data and available cases and put the new
case into the category that is most similar to the available categories.

K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This
means when new data appears then it can be easily classified into a well suite category by using K- NN
algorithm.

K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the
Classification problems.

K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data.

It is also called a lazy learner algorithm because it does not learn from the training set immediately
instead it stores the dataset and at the time of classification, it performs an action on the dataset.
KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies
that data into a category that is much similar to the new data.

Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we want to
know either it is a cat or dog. So for this identification, we can use the KNN algorithm, as it works on a
similarity measure. Our KNN model will find the similar features of the new data set to the cats and dogs
images and based on the most similar features it will put it in either cat or dog category.

Why do we need a K-NN Algorithm?

Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1, so
this data point will lie in which of these categories. To solve this type of problem, we need a K-NN
algorithm. With the help of K-NN, we can easily identify the category or class of a particular dataset.
Consider the below diagram:
How does K-NN work? The K-NN working can be explained on the basis of the below algorithm:

• Step-1: Select the number K of the neighbors


• Step-2: Calculate the Euclidean distance of K number of neighbors
• Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
• Step-4: Among these k neighbors, count the number of the data points in each category.
• Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum.
• Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required category. Consider the below
image:

Firstly, we will choose the number of neighbors, so we will choose the k=5.

Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is the
distance between two points, which we have already studied in geometry. It can be calculated as:
By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in
category A and two nearest neighbors in category B. Consider the below image:
As we can see the 3 nearest neighbors are from category A, hence this new data point must belong to
category A.

How to select the value of K in the K-NN Algorithm?

Below are some points to remember while selecting the value of K in the K-NN algorithm:

 There is no particular way to determine the best value for "K", so we need to try some values to
find the best out of them. The most preferred value for K is 5.
 A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in the
model.
 Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:

 It is simple to implement.
 It is robust to the noisy training data
 It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:

 Always needs to determine the value of K which may be complex some time.
 The computation cost is high because of calculating the distance between the data points for all
the training samples.

Python implementation of the KNN algorithm

To do the Python implementation of the K-NN algorithm, we will use the same problem and dataset
which we have used in Logistic Regression. But here we will improve the performance of the model.
Below is the problem description:

Problem for K-NN Algorithm: There is a Car manufacturer company that has manufactured a new SUV
car. The company wants to give the ads to the users who are interested in buying that SUV. So for this
problem, we have a dataset that contains multiple user's information through the social network. The
dataset contains lots of information but the Estimated Salary and Age we will consider for the
independent variable and the Purchased variable is for the dependent variable. Below is the dataset:
Steps to implement the K-NN algorithm:

 Data Pre-processing step


 Fitting the K-NN algorithm to the Training set
 Predicting the test result
 Test accuracy of the result(Creation of Confusion matrix)
 Visualizing the test set result.

Data Pre-Processing Step:

The Data Pre-processing step will remain exactly the same as Logistic Regression. Below is the code for it:

# importing libraries

import numpy as nm

import matplotlib.pyplot as mtp

import pandas as pd

#importing datasets
data_set= pd.read_csv('user_data.csv')

#Extracting Independent and dependent Variable

x= data_set.iloc[:, [2,3]].values

y= data_set.iloc[:, 4].values

# Splitting the dataset into training and test set.

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)

#feature Scaling

from sklearn.preprocessing import StandardScaler

st_x= StandardScaler()

x_train= st_x.fit_transform(x_train)

x_test= st_x.transform(x_test)

By executing the above code, our dataset is imported to our program and well pre-processed. After
feature scaling our test dataset will look like:
From the above output image, we can see that our data is successfully scaled.

Fitting K-NN classifier to the Training data:


Now we will fit the K-NN classifier to the training data. To do this we will import
the KNeighborsClassifier class of Sklearn Neighbors library. After importing the class, we will create
the Classifier object of the class. The Parameter of this class will be

n_neighbors: To define the required neighbors of the algorithm. Usually, it takes 5.

metric='minkowski': This is the default parameter and it decides the distance between the points.

p=2: It is equivalent to the standard Euclidean metric.

And then we will fit the classifier to the training data. Below is the code for it:

#Fitting K-NN classifier to the training set

from sklearn.neighbors import KNeighborsClassifier

classifier= KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2 )

classifier.fit(x_train, y_train)

Output: By executing the above code, we will get the output as:

Out[10]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',

metric_params=None, n_jobs=None, n_neighbors=5, p=2,

weights='uniform')

Predicting the Test Result: To predict the test set result, we will create a y_pred vector as we did in
Logistic Regression. Below is the code for it:

#Predicting the test set result

y_pred= classifier.predict(x_test)

Output:

The output for the above code will be:


Creating the Confusion Matrix:
Now we will create the Confusion Matrix for our K-NN model to see the accuracy of the classifier. Below
is the code for it:

#Creating the Confusion matrix

from sklearn.metrics import confusion_matrix

cm= confusion_matrix(y_test, y_pred)

In above code, we have imported the confusion_matrix function and called it using the variable cm.

Output: By executing the above code, we will get the matrix as below:

In the above image, we can see there are 64+29= 93 correct predictions and 3+4= 7 incorrect
predictions, whereas, in Logistic Regression, there were 11 incorrect predictions. So we can say that the
performance of the model is improved by using the K-NN algorithm.

Visualizing the Training set result:


Now, we will visualize the training set result for K-NN model. The code will remain same as we did in
Logistic Regression, except the name of the graph. Below is the code for it:

#Visulaizing the trianing set result

from matplotlib.colors import ListedColormap

x_set, y_set = x_train, y_train

x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.01),

nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))


mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),

alpha = 0.75, cmap = ListedColormap(('red','green' )))

mtp.xlim(x1.min(), x1.max())

mtp.ylim(x2.min(), x2.max())

for i, j in enumerate(nm.unique(y_set)):

mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],

c = ListedColormap(('red', 'green'))(i), label = j)

mtp.title('K-NN Algorithm (Training set)')

mtp.xlabel('Age')

mtp.ylabel('Estimated Salary')

mtp.legend()

mtp.show()

Output:

By executing the above code, we will get the below graph:

The output graph is different from the graph which we have occurred in Logistic Regression. It can be
understood in the below points:

As we can see the graph is showing the red point and green points. The green points are for
Purchased(1) and Red Points for not Purchased(0) variable.

The graph is showing an irregular boundary instead of showing any straight line or any curve because it is
a K-NN algorithm, i.e., finding the nearest neighbor.
The graph has classified users in the correct categories as most of the users who didn't buy the SUV are
in the red region and users who bought the SUV are in the green region.

The graph is showing good result but still, there are some green points in the red region and red points in
the green region. But this is no big issue as by doing this model is prevented from overfitting issues.

Hence our model is well trained.

Visualizing the Test set result:


After the training of the model, we will now test the result by putting a new dataset, i.e., Test dataset.
Code remains the same except some minor changes: such as x_train and y_train will be replaced
by x_test and y_test.
Below is the code for it:

#Visualizing the test set result

from matplotlib.colors import ListedColormap

x_set, y_set = x_test, y_test

x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.01),

nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))

mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),

alpha = 0.75, cmap = ListedColormap(('red','green' )))

mtp.xlim(x1.min(), x1.max())

mtp.ylim(x2.min(), x2.max())

for i, j in enumerate(nm.unique(y_set)):

mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],

c = ListedColormap(('red', 'green'))(i), label = j)

mtp.title('K-NN algorithm(Test set)')

mtp.xlabel('Age')

mtp.ylabel('Estimated Salary')

mtp.legend()

mtp.show()

Output:
The above graph is showing the output for the test data set. As we can see in the graph, the predicted
output is well good as most of the red points are in the red region and most of the green points are in
the green region.

However, there are few green points in the red region and a few red points in the green region. So these
are the incorrect observations that we have observed in the confusion matrix(7 Incorrect output).

Support Vector Machine Algorithm

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is
used for Classification as well as Regression problems. However, primarily, it is used for Classification
problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-
dimensional space into classes so that we can easily put the new data point in the correct category in the
future. This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are
called as support vectors, and hence algorithm is termed as Support Vector Machine. Consider the below
diagram in which there are two different categories that are classified using a decision boundary or
hyperplane:
Example: SVM can be understood with the example that we have used in the KNN classifier. Suppose we
see a strange cat that also has some features of dogs, so if we want a model that can accurately identify
whether it is a cat or dog, so such a model can be created by using the SVM algorithm. We will first train
our model with lots of images of cats and dogs so that it can learn about different features of cats and
dogs, and then we test it with this strange creature. So as support vector creates a decision boundary
between these two data (cat and dog) and choose extreme cases (support vectors), it will see the
extreme case of cat and dog. On the basis of the support vectors, it will classify it as a cat. Consider the
below diagram:

Loaded: 00s
SVM algorithm can be used for Face detection, image classification, text categorization, etc.

Types of SVM

SVM can be of two types:

 Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be
classified into two classes by using a single straight line, then such data is termed as linearly
separable data, and classifier is used called as Linear SVM classifier.
 Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear data
and classifier used is called as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-dimensional
space, but we need to find out the best decision boundary that helps to classify the data points. This
best boundary is known as the hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in the dataset, which means if there
are 2 features (as shown in image), then hyperplane will be a straight line. And if there are 3 features,
then hyperplane will be a 2-dimension plane.

We always create a hyperplane that has a maximum margin, which means the maximum distance
between the data points.

Support Vectors:

The data points or vectors that are the closest to the hyperplane and which affect the position of the
hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence called a
Support vector.

How does SVM works?


Linear SVM:

The working of the SVM algorithm can be understood by using an example. Suppose we have a dataset
that has two tags (green and blue), and the dataset has two features x1 and x2. We want a classifier that
can classify the pair(x1, x2) of coordinates in either green or blue. Consider the below image:

So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But there
can be multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or region
is called as a hyperplane. SVM algorithm finds the closest point of the lines from both the classes. These
points are called support vectors. The distance between the vectors and the hyperplane is called
as margin. And the goal of SVM is to maximize this margin. The hyperplane with maximum margin is
called the optimal hyperplane.
Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we
cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we have used
two dimensions x and y, so for non-linear data, we will add a third dimension z. It can be calculated as:

z=x2 +y2

By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in 2d space
with z=1, then it will become as:

Hence, we get a circumference of radius 1 in case of non-linear data.

Python Implementation of Support Vector Machine

Now we will implement the SVM algorithm using Python. Here we will use the same dataset user data,
which we have used in Logistic regression and KNN classification.

Data Pre-processing step

Till the Data pre-processing step, the code will remain the same. Below is the code:

#Data Pre-processing Step

# importing libraries

import numpy as nm

import matplotlib.pyplot as mtp

import pandas as pd

#importing datasets

data_set= pd.read_csv('user_data.csv')
#Extracting Independent and dependent Variable

x= data_set.iloc[:, [2,3]].values

y= data_set.iloc[:, 4].values

# Splitting the dataset into training and test set.

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)

#feature Scaling

from sklearn.preprocessing import StandardScaler

st_x= StandardScaler()

x_train= st_x.fit_transform(x_train)

x_test= st_x.transform(x_test)

After executing the above code, we will pre-process the data. The code will give the dataset as:
The scaled output for the test set will be:
Fitting the SVM classifier to the training set:

Now the training set will be fitted to the SVM classifier. To create the SVM classifier, we will
import SVC class from Sklearn.svm library. Below is the code for it:

from sklearn.svm import SVC # "Support vector classifier"

classifier = SVC(kernel='linear', random_state=0)

classifier.fit(x_train, y_train)

In the above code, we have used kernel='linear', as here we are creating SVM for linearly separable data.
However, we can change it for non-linear data. And then we fitted the classifier to the training
dataset(x_train, y_train)

Output:

Out[8]:

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,

decision_function_shape='ovr', degree=3, gamma='auto_deprecated',

kernel='linear', max_iter=-1, probability=False, random_state=0,

shrinking=True, tol=0.001, verbose=False)


The model performance can be altered by changing the value of C(Regularization factor), gamma, and
kernel.

Predicting the test set result:


Now, we will predict the output for test set. For this, we will create a new vector y_pred. Below is the
code for it:

#Predicting the test set result

y_pred= classifier.predict(x_test)

After getting the y_pred vector, we can compare the result of y_pred and y_test to check the difference
between the actual value and predicted value.

Output: Below is the output for the prediction of the test set:

Creating the confusion matrix:


Now we will see the performance of the SVM classifier that how many incorrect predictions are there as
compared to the Logistic regression classifier. To create the confusion matrix, we need to import
the confusion_matrix function of the sklearn library. After importing the function, we will call it using a
new variable cm. The function takes two parameters, mainly y_true( the actual values) and y_pred (the
targeted value return by the classifier). Below is the code for it:

#Creating the Confusion matrix

from sklearn.metrics import confusion_matrix

cm= confusion_matrix(y_test, y_pred)

Output:

As we can see in the above output image, there are 66+24= 90 correct predictions and 8+2= 10 correct
predictions. Therefore we can say that our SVM model improved as compared to the Logistic regression
model.

Visualizing the training set result:


Now we will visualize the training set result, below is the code for it:

from matplotlib.colors import ListedColormap

x_set, y_set = x_train, y_train

x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.01),

nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))

mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),

alpha = 0.75, cmap = ListedColormap(('red', 'green')))


mtp.xlim(x1.min(), x1.max())

mtp.ylim(x2.min(), x2.max())

for i, j in enumerate(nm.unique(y_set)):

mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],

c = ListedColormap(('red', 'green'))(i), label = j)

mtp.title('SVM classifier (Training set)')

mtp.xlabel('Age')

mtp.ylabel('Estimated Salary')

mtp.legend()

mtp.show()

Output:

By executing the above code, we will get the output as:

As we can see, the above output is appearing similar to the Logistic regression output. In the output, we
got the straight line as hyperplane because we have used a linear kernel in the classifier. And we have
also discussed above that for the 2d space, the hyperplane in SVM is a straight line.

Visualizing the test set result:

#Visulaizing the test set result

from matplotlib.colors import ListedColormap

x_set, y_set = x_test, y_test


x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.01),

nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))

mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),

alpha = 0.75, cmap = ListedColormap(('red','green' )))

mtp.xlim(x1.min(), x1.max())

mtp.ylim(x2.min(), x2.max())

for i, j in enumerate(nm.unique(y_set)):

mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],

c = ListedColormap(('red', 'green'))(i), label = j)

mtp.title('SVM classifier (Test set)')

mtp.xlabel('Age')

mtp.ylabel('Estimated Salary')

mtp.legend()

mtp.show()

Output:

By executing the above code, we will get the output as:

As we can see in the above output image, the SVM classifier has divided the users into two regions
(Purchased or Not purchased). Users who purchased the SUV are in the red region with the red scatter
points. And users who did not purchase the SUV are in the green region with green scatter points. The
hyperplane has divided the two classes into Purchased and not purchased variable.
Naïve Bayes Classifier Algorithm

Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used for
solving classification problems.

It is mainly used in text classification that includes a high-dimensional training dataset.

Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which helps in
building the fast machine learning models that can make quick predictions.

It is a probabilistic classifier, which means it predicts on the basis of the probability of an object.

Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis, and
classifying articles.

Why is it called Naïve Bayes?

The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:

Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is independent of the
occurrence of other features. Such as if the fruit is identified on the bases of color, shape, and taste, then
red, spherical, and sweet fruit is recognized as an apple. Hence each feature individually contributes to
identify that it is an apple without depending on each other.

Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

Bayes' Theorem:

Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the probability of
a hypothesis with prior knowledge. It depends on the conditional probability.

The formula for Bayes' theorem is given as:

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis is
true.

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.


P(B) is Marginal Probability: Probability of Evidence.

Working of Naïve Bayes' Classifier:

Working of Naïve Bayes' Classifier can be understood with the help of the below example:

Suppose we have a dataset of weather conditions and corresponding target variable "Play". So using this
dataset we need to decide that whether we should play or not on a particular day according to the
weather conditions. So to solve this problem, we need to follow the below steps:

 Convert the given dataset into frequency tables.


 Generate Likelihood table by finding the probabilities of given features.
 Now, use Bayes theorem to calculate the posterior probability.
 Problem: If the weather is sunny, then the Player should play or not?
 Solution: To solve this, first consider the below dataset:

Outlook Play

0 Rainy Yes

1 Sunny Yes

2 Overcast Yes

3 Overcast Yes

4 Sunny No

5 Rainy Yes

6 Sunny Yes

7 Overcast Yes

8 Rainy No

9 Sunny No

10 Sunny Yes
11 Rainy No

12 Overcast Yes

13 Overcast Yes

Frequency table for the Weather Conditions:

Weather Yes No

Overcast 5 0

Rainy 2 2

Sunny 3 2

Total 10 5

Likelihood table weather condition:

Weather No Yes

Overcast 0 5 5/14= 0.35

Rainy 2 2 4/14=0.29

Sunny 2 3 5/14=0.35

All 4/14=0.29 10/14=0.71

Applying Bayes'theorem:

P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)

P(Sunny|Yes)= 3/10= 0.3

P(Sunny)= 0.35

P(Yes)=0.71

So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60


P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)

P(Sunny|NO)= 2/4=0.5

P(No)= 0.29

P(Sunny)= 0.35

So P(No|Sunny)= 0.5*0.29/0.35 = 0.41

So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)

Hence on a Sunny day, Player can play the game.

Advantages of Naïve Bayes Classifier:

Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.

It can be used for Binary as well as Multi-class Classifications.

It performs well in Multi-class predictions as compared to the other Algorithms.

It is the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier:

Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the relationship
between features.

Applications of Naïve Bayes Classifier:

 It is used for Credit Scoring.


 It is used in medical data classification.
 It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.
 It is used in Text classification such as Spam filtering and Sentiment analysis.

Types of Naïve Bayes Model:

There are three types of Naive Bayes Model, which are given below:

Gaussian: The Gaussian model assumes that features follow a normal distribution. This means if
predictors take continuous values instead of discrete, then the model assumes that these values are
sampled from the Gaussian distribution.

Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial distributed. It
is primarily used for document classification problems, it means a particular document belongs to which
category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.

Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the predictor variables
are the independent Booleans variables. Such as if a particular word is present or not in a document.
This model is also famous for document classification tasks.
Python Implementation of the Naïve Bayes algorithm:

Now we will implement a Naive Bayes Algorithm using Python. So for this, we will use the
"user_data" dataset, which we have used in our other classification model. Therefore we can easily
compare the Naive Bayes model with the other models.

Steps to implement:

 Data Pre-processing step


 Fitting Naive Bayes to the Training set
 Predicting the test result
 Test accuracy of the result(Creation of Confusion matrix)
 Visualizing the test set result.

Data Pre-processing step:

In this step, we will pre-process/prepare the data so that we can use it efficiently in our code. It is similar
as we did in data-pre-processing. The code for this is given below:

Importing the libraries

import numpy as nm

import matplotlib.pyplot as mtp

import pandas as pd

# Importing the dataset

dataset = pd.read_csv('user_data.csv')

x = dataset.iloc[:, [2, 3]].values

y = dataset.iloc[:, 4].values

# Splitting the dataset into the Training set and Test set

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 0)

# Feature Scaling

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)

In the above code, we have loaded the dataset into our program using "dataset =
pd.read_csv('user_data.csv'). The loaded dataset is divided into training and test set, and then we have
scaled the feature variable.

The output for the dataset is given as:

Fitting Naive Bayes to the Training Set:

After the pre-processing step, now we will fit the Naive Bayes model to the Training set. Below is the
code for it:
# Fitting Naive Bayes to the Training set

from sklearn.naive_bayes import GaussianNB

classifier = GaussianNB()

classifier.fit(x_train, y_train)

In the above code, we have used the GaussianNB classifier to fit it to the training dataset. We can also
use other classifiers as per our requirement.

Output:

Out[6]: GaussianNB(priors=None, var_smoothing=1e-09)

Prediction of the test set result:

Now we will predict the test set result. For this, we will create a new predictor variable y_pred, and will
use the predict function to make the predictions.

# Predicting the Test set results

y_pred = classifier.predict(x_test)

Output:
The above output shows the result for prediction vector y_pred and real vector y_test. We can see that
some predications are different from the real values, which are the incorrect predictions.

Creating Confusion Matrix:

Now we will check the accuracy of the Naive Bayes classifier using the Confusion matrix. Below is the
code for it:

# Making the Confusion Matrix

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)

Output:
As we can see in the above confusion matrix output, there are 7+3= 10 incorrect predictions, and
65+25=90 correct predictions.

Visualizing the training set result:

Next, we will visualize the training set result using Naïve Bayes Classifier. Below is the code for it:

# Visualising the Training set results

from matplotlib.colors import ListedColormap

x_set, y_set = x_train, y_train

X1, X2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step = 0.01),

nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))

mtp.contourf(X1, X2, classifier.predict(nm.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),

alpha = 0.75, cmap = ListedColormap(('purple', 'green')))

mtp.xlim(X1.min(), X1.max())

mtp.ylim(X2.min(), X2.max())

for i, j in enumerate(nm.unique(y_set)):

mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],

c = ListedColormap(('purple', 'green'))(i), label = j)

mtp.title('Naive Bayes (Training set)')


mtp.xlabel('Age')

mtp.ylabel('Estimated Salary')

mtp.legend()

mtp.show()

Output:

In the above output we can see that the Naïve Bayes classifier has segregated the data points with the
fine boundary. It is Gaussian curve as we have used GaussianNB classifier in our code.

6) Visualizing the Test set result:

# Visualising the Test set results

from matplotlib.colors import ListedColormap

x_set, y_set = x_test, y_test

X1, X2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step = 0.01),

nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))

mtp.contourf(X1, X2, classifier.predict(nm.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),

alpha = 0.75, cmap = ListedColormap(('purple', 'green')))

mtp.xlim(X1.min(), X1.max())

mtp.ylim(X2.min(), X2.max())

for i, j in enumerate(nm.unique(y_set)):

mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],


c = ListedColormap(('purple', 'green'))(i), label = j)

mtp.title('Naive Bayes (test set)')

mtp.xlabel('Age')

mtp.ylabel('Estimated Salary')

mtp.legend()

mtp.show()

Output:

The above output is final output for test set data. As we can see the classifier has created a Gaussian
curve to divide the "purchased" and "not purchased" variables. There are some wrong predictions which
we have calculated in Confusion matrix. But still it is pretty good classifier.

Top 10 Machine Learning Algorithms (with Python and R Codes)


Google’s self-driving cars and robots get a lot of press, but the company’s real future is in machine
learning, the technology that enables computers to get smarter and more personal.

We are probably living in the most defining period of human history. The period when computing moved
from large mainframes to PCs to the cloud. But what makes it defining is not what has happened but
what is coming our way in years to come. What makes this period exciting and enthralling for someone
like me is the democratization of the various tools, techniques, and machine learning algorithms that
followed the boost in computing. Welcome to the world of data science.

Today, as a data scientist, I can build data-crunching machines with complex algorithms for a few dollars
per hour. But reaching here wasn’t easy! I had my dark days and nights.

Learning Objectives

 Major focus on commonly used machine learning techniques and algorithms.


 Algorithms covered – Linear regression, logistic regression, Naive Bayes, kNN, Random forest,
etc.
 Learn both theory and implementation of the machine learning algorithms in R and python.

Are you a beginner looking for a place to start your data science journey and learn machine learning
models? Presenting a list. of comprehensive courses, full of knowledge and data science learning,
curated just for you to learn data science (using Python) from scratch:

Supervised Learning Algorithms

How it works: This algorithm consists of a target/outcome variable (or dependent variable) which is to be
predicted from a given set of predictors (independent variables). Using this set of variables, we generate
a function that maps input data to desired outputs. The training process continues until the model
achieves the desired level of accuracy on the training data. Examples of Supervised Learning:
Regression, Decision Tree, Random Forest, KNN, Logistic Regression, etc.

Unsupervised Learning Algorithms

How it works: In this algorithm, we do not have any target or outcome variable to predict / estimate
(which is called unlabeled data). It is used for recommendation systems or clustering populations in
different groups. clustering algorithms are widely used for segmenting customers into different groups
for specific interventions. Examples of Unsupervised Learning: Apriori algorithm, K-means clustering.

Reinforcement Learning Algorithms

How it works: Using this algorithm, the machine is trained to make specific decisions. The machine is
exposed to an environment where it trains itself continually using trial and error. This machine learns
from past experience and tries to capture the best possible knowledge to make accurate business
decisions. Example of Reinforcement Learning: Markov Decision Process

1. Linear Regression

It is used to estimate real values (cost of houses, number of calls, total sales, etc.) based on a continuous
variable(s). Here, we establish the relationship between independent and dependent variables by fitting
the best line. This best-fit line is known as the regression line and is represented by a linear equation Y=
a*X + b.

The best way to understand linear regression is to relive this experience of childhood. Let us say you ask
a child in fifth grade to arrange people in his class by increasing the order of weight without asking them
their weights! What do you think the child will do? He/she would likely look (visually analyze) at the
height and build of people and arrange them using a combination of these visible parameters. This is
linear regression in real life! The child has actually figured out that height and build would be correlated
to weight by a relationship, which looks like the equation above.

In this equation:

Y – Dependent Variable

a – Slope

X – Independent variable

b – Intercept

These coefficients a and b are derived based on minimizing the sum of the squared difference of distance
between data points and the regression line.

Look at the below example. Here we have identified the best-fit line having linear
equation y=0.2811x+13.9. Now using this equation, we can find the weight, knowing the height of a
person.

Linear Regression is mainly of two types: Simple Linear Regression and Multiple Linear Regression.
Simple Linear Regression is characterized by one independent variable. And, Multiple Linear
Regression(as the name suggests) is characterized by multiple (more than 1) independent variables.
While finding the best-fit line, you can fit a polynomial or curvilinear regression. And these are known as
polynomial or curvilinear regression.
Here’s a coding window to try out your hand and build your own linear regression model:

Python:

R Code:

#Load Train and Test datasets

#Identify feature and response variable(s) and values must be numeric and numpy arrays

x_train <- input_variables_values_training_datasets

y_train <- target_variables_values_training_datasets

x_test <- input_variables_values_test_datasets

x <- cbind(x_train,y_train)

# Train the model using the training sets and check score

linear <- lm(y_train ~ ., data = x)

summary(linear)

#Predict Output

predicted= predict(linear,x_test)

2. Logistic Regression

Don’t get confused by its name! It is a classification algorithm, not a regression algorithm. It is used to
estimate discrete values ( Binary values like 0/1, yes/no, true/false ) based on a given set of independent
variables(s). In simple words, it predicts the probability of the occurrence of an event by fitting data to
logistic function. Hence, it is also known as logit regression. Since it predicts the probability, its output
values lie between 0 and 1 (as expected).

Again, let us try and understand this through a simple example.

Let’s say your friend gives you a puzzle to solve. There are only 2 outcome scenarios – either you solve it,
or you don’t. Now imagine that you are being given a wide range of puzzles/quizzes in an attempt to
understand which subjects you are good at. The outcome of this study would be something like this – if
you are given a trigonometry-based tenth-grade problem, you are 70% likely to solve it. On the other
hand, if it is a grade fifth history question, the probability of getting an answer is only 30%. This is what
Logistic Regression provides you.

Coming to the math, the log odds of the outcome are modeled as a linear combination of the predictor
variables.

odds= p/ (1-p) = probability of event occurrence / probability of not event occurrence

ln(odds) = ln(p/(1-p))

logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3....+bkXk


Above, p is the probability of the presence of the characteristic of interest. It chooses parameters that
maximize the likelihood of observing the sample values rather than that minimize the sum of squared
errors (like in ordinary regression).

Now, you may ask, why take a log? For the sake of simplicity, let’s just say that this is one of the best
mathematical ways to replicate a step function. I can go into more details, but that will beat the purpose
of this article.

Build your own logistic regression model in Python here and check the accuracy:

R Code:

x <- cbind(x_train,y_train)

# Train the model using the training sets and check score

logistic <- glm(y_train ~ ., data = x,family='binomial')

summary(logistic)

#Predict Output

predicted= predict(logistic,x_test)

3. Decision Tree
This is one of my favorite algorithms, and I use it quite frequently. It is a type of supervised learning
algorithm that is mostly used for classification problems. Surprisingly, it works for both categorical and
continuous dependent variables. In this algorithm, we split the population into two or more
homogeneous sets. This is done based on the most significant attributes/ independent variables to make
as distinct groups as possible. For more details, you can read Decision Tree Simplified.

Source:
statsexchange

In the image above, you can see that population is classified into four different groups based on multiple
attributes to identify ‘if they will play or not’. To split the population into different heterogeneous groups,
it uses various techniques like Gini, Information Gain, Chi-square, and entropy.

The best way to understand how the decision tree works, is to play Jezzball – a classic game from
Microsoft (image below). Essentially, you have a room with moving walls and you need to create walls
such that the maximum area gets cleared off without the balls.
So, every time you split the room with a wall, you are trying to create 2 different populations within the
same room. Decision trees work in a very similar fashion by dividing a population into as different groups
as possible.

R Code:

library(rpart)

x <- cbind(x_train,y_train)

# grow tree

fit <- rpart(y_train ~ ., data = x,method="class")

summary(fit)

#Predict Output

predicted= predict(fit,x_test)

4. SVM (Support Vector Machine)

It is a classification method. InSVM algorithm , we plot each data item as a point in n-dimensional space
(where n is the number of features you have), with the value of each feature being the value of a
particular coordinate.

For example, if we only had two features like the Height and Hair length of an individual, we’d first plot
these two variables in two-dimensional space where each point has two coordinates (these co-ordinates
are known as Support Vectors)
Now, we will find some lines that split the data between the two differently classified groups of data.
This will be the line such that the distances from the closest point in each of the two groups will be the
farthest away. If there are more variables, a hyperplane is used to separate the classes.

In the example shown above, the line which splits the data into two differently classified groups is
the black line since the two closest points are the farthest apart from the line. This line is our classifier.
Then, depending on where the testing data lands on either side of the line, that’s what class we can
classify the new data as.

Think of this algorithm as playing Jezz Ball in n-dimensional space. The tweaks in the game are:

You can draw lines/planes at any angle (rather than just horizontal or vertical as in the classic game)
The objective of the game is to segregate balls of different colors in different rooms.

And the balls are not moving.

R Code:

library(e1071)

x <- cbind(x_train,y_train)

# Fitting model

fit <-svm(y_train ~ ., data = x)

summary(fit)

#Predict Output

predicted= predict(fit,x_test)

5. Naive Bayes

It is a classification technique based on Bayes’ theorem with an assumption of independence between


predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a
class is unrelated to the presence of any other feature. For example, a fruit may be considered to be an
apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or
upon the existence of the other features, a naive Bayes classifier would consider all of these properties
to independently contribute to the probability that this fruit is an apple.

The Naive Bayesian model is easy to build and particularly useful for very large data sets. Along with
simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.

Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x), and P(x|c). Look
at the equation below:

Here,

P(c|x) is the posterior probability of class (target) given predictor (attribute).

P(c) is the prior probability of the class.

P(x|c) is the likelihood which is the probability of the predictor given the class.
P(x) is the prior probability of the predictor.

Example: Let’s understand it using an example. Below is a training data set of weather and the
corresponding target variable, ‘Play.’ Now, we need to classify whether players will play or not based on
weather conditions. Let’s follow the below steps to perform it.

Step 1: Convert the data set to a frequency table.

Step 2: Create a Likelihood table by finding the probabilities like Overcast probability = 0.29 and
probability of playing is 0.64.

Step 3: Now, use the Naive Bayesian equation to calculate the posterior probability for each class. The
class with the highest posterior probability is the outcome of the prediction.

Problem: Players will pay if the weather is sunny. Is this statement correct?

We can solve it using above discussed method, so P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)

Here we have P (Sunny | Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P(Yes)= 9/14 = 0.64

Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has a higher probability.

Naive Bayes uses a similar method to predict the probability of different classes based on various
attributes. This algorithm is mostly used in text classification and with problems having multiple classes.

Code for a Naive Bayes classification model in Python:

R Code:

library(e1071)

x <- cbind(x_train,y_train)

# Fitting model

fit <-naiveBayes(y_train ~ ., data = x)

summary(fit)

#Predict Output
predicted= predict(fit,x_test)

6. kNN (k- Nearest Neighbors)

It can be used for both classification and regression problems. However, it is more widely used in
classification problems in the industry. K nearest neighbors is a simple algorithm that stores all available
cases and classifies new cases by a majority vote of its k neighbors. The case assigned to the class is most
common amongst its K nearest neighbors measured by a distance function.

These distance functions can be Euclidean, Manhattan, Minkowski, and Hamming distances. The first
three functions are used for continuous functions, and the fourth one (Hamming) for categorical
variables. If K = 1, then the case is simply assigned to the class of its nearest neighbor. At times, choosing
K turns out to be a challenge while performing kNN modeling.

KNN can easily be mapped to our real lives. If you want to learn about a person with whom you have no
information, you might like to find out about his close friends and the circles he moves in and gain access
to his/her information!

Things to consider before selecting kNN:

 KNN is computationally expensive


 Variables should be normalized else higher range variables can bias it
 Works on pre-processing stage more before going for kNN like an outlier, noise removal

Python Code:

R Code:

library(knn)

x <- cbind(x_train,y_train)

# Fitting model

fit <-knn(y_train ~ ., data = x,k=5)


summary(fit)

#Predict Output

predicted= predict(fit,x_test)

7. K-Means

It is a type of unsupervised algorithm which solves the clustering problem. Its procedure follows a simple
and easy way to classify a given data set through a certain number of clusters (assume k clusters). Data
points inside a cluster are homogeneous and heterogeneous to peer groups.

Remember figuring out shapes from ink blots? k means is somewhat similar to this activity. You look at
the shape and spread to decipher how many different clusters/populations are present!

How K-means forms cluster:

 K-means picks k number of points for each cluster known as centroids.


 Each data point forms a cluster with the closest centroids, i.e., k clusters.
 Finds the centroid of each cluster based on existing cluster members. Here we have new
centroids.
 As we have new centroids, repeat steps 2 and 3. Find the closest distance for each data point
from new centroids and get associated with new k-clusters. Repeat this process until
convergence occurs, i.e., centroids do not change.

How to determine the value of K:

In K-means, we have clusters, and each cluster has its own centroid. The sum of the square of the
difference between the centroid and the data points within a cluster constitutes the sum of the square
value for that cluster. Also, when the sum of square values for all the clusters is added, it becomes a total
within the sum of the square value for the cluster solution.
We know that as the number of clusters increases, this value keeps on decreasing, but if you plot the
result, you may see that the sum of squared distance decreases sharply up to some value of k and then
much more slowly after that. Here, we can find the optimum number of clusters.

Python Code:

R Code:

library(cluster)

fit <- kmeans(X, 3) # 5 cluster solution

8. Random Forest

Random Forest is a trademarked term for an ensemble learning of decision trees. In Random Forest,
we’ve got a collection of decision trees (also known as “Forest”). To classify a new object based on
attributes, each tree gives a classification, and we say the tree “votes” for that class. The forest chooses
the classification having the most votes (over all the trees in the forest).

Each tree is planted & grown as follows:

If the number of cases in the training set is N, then a sample of N cases is taken at random but with
replacement. This sample will be the training set for growing the tree.

If there are M input variables, a number m<<M is specified such that at each node, m variables are
selected at random out of the M, and the best split on this m is used to split the node. The value of m is
held constant during the forest growth.

Each tree is grown to the largest extent possible. There is no pruning.

For more details on this algorithm, compared with the decision tree and tuning model parameters, I
would suggest you read these articles:
Python Code:

R Code:

library(randomForest)

x <- cbind(x_train,y_train)

# Fitting model

fit <- randomForest(Species ~ ., x,ntree=500)

summary(fit)

#Predict Output

predicted= predict(fit,x_test)

9. Dimensionality Reduction Algorithms

In the last 4-5 years, there has been an exponential increase in data capturing at every possible stage.
Corporates/ Government Agencies/ Research organizations are not only coming up with new sources,
but also, they are capturing data in great detail.

For example, E-commerce companies are capturing more details about customers like their
demographics, web crawling history, what they like or dislike, purchase history, feedback, and many
others to give them personalized attention more than your nearest grocery shopkeeper.

As data scientists, the data we are offered also consists of many features, this sounds good for building a
good robust model, but there is a challenge. How’d you identify highly significant variable(s) out of 1000
or 2000? In such cases, the dimensionality reduction algorithm helps us, along with various other
algorithms like Decision Tree, Random Forest, PCA (principal component analysis), Factor Analysis,
Identity-based on the correlation matrix, missing value ratio, and others.

Python Code:

R Code:

library(stats)

pca <- princomp(train, cor = TRUE)

train_reduced <- predict(pca,train)

test_reduced <- predict(pca,test)

10. Gradient Boosting Algorithms

Now, let’s look at the 4 most commonly used gradient boosting algorithms.

GBM

GBM is a boosting algorithm used when we deal with plenty of data to make a prediction with high
prediction power. Boosting is actually an ensemble of learning algorithms that combines the prediction
of several base estimators in order to improve robustness over a single estimator. It combines multiple
weak or average predictors to build a strong predictor. These boosting algorithms always work well in
data science competitions like Kaggle, AV Hackathon, and Crowd Analytix.

Python Code:

R Code:

library(caret)

x <- cbind(x_train,y_train)

# Fitting model

fitControl <- trainControl( method = "repeatedcv", number = 4, repeats = 4)

fit <- train(y ~ ., data = x, method = "gbm", trControl = fitControl,verbose = FALSE)

predicted= predict(fit,x_test,type= "prob")[,2]

Gradient Boosting Classifier and Random Forest are two different boosting tree classifiers, and often
people ask about the difference between these two algorithms.

XGBoost

Another classic gradient-boosting algorithm that’s known to be the decisive choice between winning and
losing in some Kaggle competitions is the XGBoost. It has an immensely high predictive power, making it
the best choice for accuracy in events. It possesses both a linear model and the tree learning algorithm,
making the algorithm almost 10x faster than existing gradient booster techniques.

One of the most interesting things about the XGBoost is that it is also called a regularized boosting
technique. This helps to reduce overfit modeling and has massive support for a range of languages such
as Scala, Java, R, Python, Julia, and C++.

The support includes various objective functions, including regression, classification, and ranking.
Supports distributed and widespread training on many machines that encompass GCE, AWS, Azure, and
Yarn clusters. XGBoost can also be integrated with Spark, Flink, and other cloud dataflow systems with
built-in cross-validation at each iteration of the boosting process.

Python Code:

R Code:

require(caret)

x <- cbind(x_train,y_train)

# Fitting model
TrainControl <- trainControl( method = "repeatedcv", number = 10, repeats = 4)

model<- train(y ~ ., data = x, method = "xgbLinear", trControl = TrainControl,verbose = FALSE)

OR

model<- train(y ~ ., data = x, method = "xgbTree", trControl = TrainControl,verbose = FALSE)

predicted <- predict(model, x_test)

Light GBM

Light GBM is a gradient-boosting framework that uses tree-based learning algorithms. It is designed to
be distributed and efficient with the following advantages:

 Faster training speed and higher efficiency


 Lower memory usage
 Better accuracy
 Parallel and GPU learning supported
 Capable of handling large-scale data

The framework is a fast and high-performance gradient-boosting one based on decision tree algorithms
used for ranking, classification, and many other machine-learning tasks. It was developed under the
Distributed Machine Learning Toolkit Project of Microsoft.

Since the light GBM is based on decision tree algorithms, it splits the tree leaf-wise with the best fit,
whereas other boosting algorithms split the tree depth-wise or level-wise rather than leaf-wise. So when
growing on the same leaf node in Light GBM, the leaf-wise algorithm can reduce more loss than the
level-wise algorithm, resulting in much better accuracy, which any existing boosting algorithms can rarely
achieve.

Also, it is surprisingly very fast, hence the word ‘Light.’

Python Code:

data = np.random.rand(500, 10) # 500 entities, each contains 10 features

label = np.random.randint(2, size=500) # binary target

train_data = lgb.Dataset(data, label=label)


test_data = train_data.create_valid('test.svm')

param = {'num_leaves':31, 'num_trees':100, 'objective':'binary'}

param['metric'] = 'auc'

num_round = 10

bst = lgb.train(param, train_data, num_round, valid_sets=[test_data])

bst.save_model('model.txt')

# 7 entities, each contains 10 features

data = np.random.rand(7, 10)

ypred = bst.predict(data)

R Code:

library(RLightGBM)

data(example.binary)

#Parameters

num_iterations <- 100

config <- list(objective = "binary", metric="binary_logloss,auc", learning_rate = 0.1, num_leaves = 63,


tree_learner = "serial", feature_fraction = 0.8, bagging_freq = 5, bagging_fraction = 0.8,
min_data_in_leaf = 50, min_sum_hessian_in_leaf = 5.0)

#Create data handle and booster

handle.data <- lgbm.data.create(x)

lgbm.data.setField(handle.data, "label", y)

handle.booster <- lgbm.booster.create(handle.data, lapply(config, as.character))


#Train for num_iterations iterations and eval every 5 steps

lgbm.booster.train(handle.booster, num_iterations, 5)

#Predict

pred <- lgbm.booster.predict(handle.booster, x.test)

#Test accuracy

sum(y.test == (y.pred > 0.5)) / length(y.test)

#Save model (can be loaded again via lgbm.booster.load(filename))

lgbm.booster.save(handle.booster, filename = "/tmp/model.txt")

If you’re familiar with the Caret package in R, this is another way of implementing the LightGBM.

require(caret)

require(RLightGBM)

data(iris)

model <-caretModel.LGBM()

fit <- train(Species ~ ., data = iris, method=model, verbosity = 0)

print(fit)

y.pred <- predict(fit, iris[,1:4])

library(Matrix)

model.sparse <- caretModel.LGBM.sparse()

#Generate a sparse matrix

mat <- Matrix(as.matrix(iris[,1:4]), sparse = T)


fit <- train(data.frame(idx = 1:nrow(iris)), iris$Species, method = model.sparse, matrix = mat, verbosity =
0)

print(fit)

Cat boost

Cat Boost is one of open-sourced machine learning algorithms from Yandex. It can easily integrate with
deep learning frameworks like Google’s TensorFlow and Apple’s Core ML. The best part about Cat Boost
is that it does not require extensive data training like other ML models and can work on a variety of data
formats, not undermining how robust it can be.

Cat boost can automatically deal with categorical variables without showing the type conversion error,
which helps you to focus on tuning your model better rather than sorting out trivial errors. Make sure
you handle missing data well before you proceed with the implementation.

Python Code:

import pandas as pd

import numpy as np

from catboost import CatBoostRegressor

#Read training and testing files

train = pd.read_csv("train.csv")

test = pd.read_csv("test.csv")

#Imputing missing values for both train and test

train.fillna(-999, inplace=True)

test.fillna(-999,inplace=True)

#Creating a training set for modeling and validation set to check model performance

X = train.drop(['Item_Outlet_Sales'], axis=1)

y = train.Item_Outlet_Sales
from sklearn.model_selection import train_test_split

X_train, X_validation, y_train, y_validation = train_test_split(X, y, train_size=0.7, random_state=1234)

categorical_features_indices = np.where(X.dtypes != np.float)[0]

#importing library and building model

from catboost import CatBoostRegressormodel=CatBoostRegressor(iterations=50, depth=3,


learning_rate=0.1, loss_function='RMSE')

model.fit(X_train, y_train,cat_features=categorical_features_indices,eval_set=(X_validation,
y_validation),plot=True)

submission = pd.DataFrame()

submission['Item_Identifier'] = test['Item_Identifier']

submission['Outlet_Identifier'] = test['Outlet_Identifier']

submission['Item_Outlet_Sales'] = model.predict(test)

R Code:

set.seed(1)

require(titanic)

require(caret)

require(catboost)

tt <- titanic::titanic_train[complete.cases(titanic::titanic_train),]

data <- as.data.frame(as.matrix(tt), stringsAsFactors = TRUE)


drop_columns = c("PassengerId", "Survived", "Name", "Ticket", "Cabin")

x <- data[,!(names(data) %in% drop_columns)]y <- data[,c("Survived")]

fit_control <- trainControl(method = "cv", number = 4,classProbs = TRUE)

grid <- expand.grid(depth = c(4, 6, 8),learning_rate = 0.1,iterations = 100, l2_leaf_reg = 1e-3, rsm =
0.95, border_count = 64)

report <- train(x, as.factor(make.names(y)),method = catboost.caret,verbose = TRUE, preProc =


NULL,tuneGrid = grid, trControl = fit_control)

print(report)

importance <- varImp(report, scale = FALSE)

print(importance)

End Note

By now, I am sure you would have an idea of commonly used machine learning algorithms. My sole
intention behind writing this article and providing the codes in R and Python is to get you started right
away. If you are keen to master machine learning algorithms, start right away. Take up problems, develop
a physical understanding of the process, apply these codes, and watch the fun!

Key Takeaways

 We are now familiar with some of the most common ML algorithms used in the industry.
 We’ve covered the advantages and disadvantages of various ML algorithms.
 We’ve also learned the basic implementation details in R and Python languages.

CLASSIFICATION ALGORITHM

What is the Classification Algorithm?

The Classification algorithm is a Supervised Learning technique that uses training data to determine the
category of new observations. A program in Classification learns from a given dataset or observations
and then classifies new observations into one of several classes or groups. For example, Yes or No, 0 or 1,
Spam or No Spam, cat or dog, and so on. Classes are also known as targets/labels or categories.

Unlike regression, the output variable of Classification is a category, not a value, such as “Green or Blue”,
“fruit or animal”, etc. Since the Classification algorithm is a Supervised learning technique, hence it takes
labeled input data, which means it contains input with the corresponding output.

In Classification algorithm ,the model tries to predict the correct label of a given input data. In
classification, the model is fully trained using the training data, and then it is evaluated on test data
before being used to perform prediction on new unseen data.

In classification algorithm, a discrete output function(y) is mapped to input variable(x).

y=f(x), where y = categorical output

The best example of an ML classification algorithm is Email Spam Detector.

The main goal of the Classification algorithm is to identify the category of a given dataset, and these
algorithms are mainly used to predict the output for the categorical data.

Classification algorithms can be better understood using the below diagram. In the below diagram, there
are two classes, class A and Class B. These classes have features that are similar to each other and
dissimilar to other classes.

For instance, an algorithm can learn to predict whether a given email is spam or ham (no spam).

Before diving into the classification concept, we will first understand the difference between the two
types of learners in classification: lazy and eager learners. Then we will clarify the misconception
between classification and regression.

Lazy Learners Vs. Eager Learners

There are two types of learners in machine learning classification: lazy and eager learners.

Eager learners are machine learning algorithms that first build a model from the training dataset before
making any prediction on future datasets. They spend more time during the training process because of
their eagerness to have a better generalization during the training from learning the weights, but they
require less time to make predictions.

Most machine learning algorithms are eager learners, and below are some examples:

 Logistic Regression.
 Support Vector Machine.
 Decision Trees.
 Artificial Neural Networks.

Lazy learners or instance-based learners, on the other hand, do not create any model immediately from
the training data, and this is where the lazy aspect comes from. They just memorize the training data,
and each time there is a need to make a prediction, they search for the nearest neighbor from the whole
training data, which makes them very slow during prediction. Some examples of this kind are:
K-Nearest Neighbor.

Case-based reasoning.

However, some algorithms, such as BallTrees and KDTrees, can be used to improve the prediction
latency.

Different Types of Classification Tasks in Machine Learning

There are four main classification tasks in Machine learning: binary, multi-class, multi-label, and
imbalanced classifications.

Binary Classification

In a binary classification task, the goal is to classify the input data into two mutually exclusive categories.
The training data in such a situation is labeled in a binary format: true and false; positive and negative; O
and 1; spam and not spam, etc. depending on the problem being tackled. For instance, we might want to
detect whether a given image is a truck or a boat.

Logistic Regression and Support Vector Machines algorithms are natively designed for binary
classifications. However, other algorithms such as K-Nearest Neighbors and Decision Trees can also be
used for binary classification.

Multi-Class Classification

The multi-class classification, on the other hand, has at least two mutually exclusive class labels, where
the goal is to predict to which class a given input example belongs to. In the following case, the model
correctly classified the image to be a plane.

Most of the binary classification algorithms can be also used for multi-class classification. These
algorithms include but are not limited to:

 Random Forest
 Naive Bayes
 K-Nearest Neighbors
 Gradient Boosting
 SVM
 Logistic Regression.

Multi-Label Classification

In multi-label classification tasks, we try to predict 0 or more classes for each input example. In this case,
there is no mutual exclusion because the input example can have more than one label.

Such a scenario can be observed in different domains, such as auto-tagging in Natural Language
Processing, where a given text can contain multiple topics. Similarly to computer vision, an image can
contain multiple objects, as illustrated below: the model predicted that the image contains: a plane, a
boat, a truck, and a dog.
It is not possible to use multi-class or binary classification models to perform multi-label classification.
However, most algorithms used for those standard classification tasks have their specialized versions for
multi-label classification. We can cite:

 Multi-label Decision Trees


 Multi-label Gradient Boosting
 Multi-label Random Forests
 Imbalanced Classification

For the imbalanced classification, the number of examples is unevenly distributed in each class, meaning
that we can have more of one class than the others in the training data. Let’s consider the following 3-
class classification scenario where the training data contains: 60% of trucks, 25% of planes, and 15% of
boats

The imbalanced classification problem could occur in the following scenario:

 Fraudulent transaction detections in financial industries


 Rare disease diagnosis
 Customer churn analysis

Using conventional predictive models such as Decision Trees, Logistic Regression, etc. could not be
effective when dealing with an imbalanced dataset, because they might be biased toward predicting the
class with the highest number of observations, and considering those with fewer numbers as noise.

Types of ML Classification Algorithms:

Classification Algorithms can be further divided into the Mainly two category:

 Linear Models
 Logistic Regression
 Support Vector Machines
 Non-linear Models
 K-Nearest Neighbors
 Kernel SVM
 Naïve Bayes
 Decision Tree Classification
 Random Forest Classification

1. LOGISTIC REGRESSION

Logistic regression is kind of like linear regression, but is used when the dependent variable is not a
number but something else (e.g., a “yes/no” response). It’s called regression but performs classification
based on the regression and it classifies the dependent variable into either of the classes.

Logistic regression is used for prediction of output which is binary, as stated above. For example, if a
credit card company builds a model to decide whether or not to issue a credit card to a customer, it will
model for whether the customer is going to “default” or “not default” on their card.

Linear Regression
Firstly, linear regression is performed on the relationship between variables to get the model. The
threshold for the classification line is assumed to be at 0.5.

Logistic Sigmoid Function

Logistic function is applied to the regression to get the probabilities of it belonging in either class.

It gives the log of the probability of the event occurring to the log of the probability of it not occurring. In
the end, it classifies the variable based on the higher probability of either class.

Here, z is a linear combination of features and their associated weights, plus a bias term:

Intuition behind Logistic Regression Cost Function

As gradient descent is the algorithm that is being used, the first step is to define a Cost function or Loss
function.
This function should be defined in such a way that it should be able to tell us how much the predictions
of our model deviates from the original outcome.

In the equation of J(theta), Y represents the actual target value and h_theta is our model’s
output. h_theta will be explained down below. But, Let us assume that our model already have a way to
make predictions and we have a defined h_theta.
These predictions will lie between 0 and 1. So, we’ll get a probability as an output.
Part 1 : When Y = 1

When the actual target is 1, we want our model’s prediction to be close to 1 as possible. So, Our cost
function should increase the penalty as our model’s prediction goes farther away from 1 and towards 0.
Our model’s penalty should decrease as it’s prediction comes nearer to 1. So, Our objective now is to
define a function for this purpose
and that function is nothing but: — log(x)

consider the y axis to be the cost and the x axis to be the model’s prediction. Note: our model’s
prediction won’t exceed 1 and won’t go below 0. So, that part is outside of our worries.

when model’s prediction is closer to 1, the penalty is closer to 0 . As it moves further from 1 and towards
0, the penalty increases. Sol, this function can be used when the actual target is 1.

Part 2 : When Y = 0

Similarly, when Y is equal to 0, we wan’t our model’s predictions to be as close to 0 as possible. Which
means lower penalty for values closer to 0 and higher penalty for values farther away from 0 and
towards 1.
So, The appropriate function for this is -log(1-h_theta(x))
This second part of the cost function. That is, -log(1-h_theta(x)).

Consider the X-axis to be the value our model predicts and the Y-axis to be the penalty that the model
gets assuming that the original target is 0.

The 2 parts of the cost function are prepared. To ensure that the first part activates when y=1 and the
second part doesn’t interfere and the second part activates when y=0 and the first part doesn’t interfere,
we add the y and the (1-y) terms to the cost function.
At the end , We get the cost function mentioned in fig 2.1 highlighted in blue.

Gradient Descent and Cost Function Derivatives

Now that we have defined a cost function, the aim is to find the optimal w and b such that it minimises
this cost function for our data-set . This is where Gradient Descent comes In. By doing this, the model
learns the parameters to reduce it’s penalty thus making much more accurate predictions.
we would like to find how the cost changes with respect to w and b, so as to change the original w and b
slowly to get the optimal parameters.
The derivation for that gradient of the logistic regression cost function is shown in the below figures

After finding the gradients, we need to subtract the gradients with the original w and b. We subtract so
that we move the values of gradients in the opposite direction to the slope so as to make sure the cost is
decreasing.
Cost function is a function that tells us how much our model deviates from the most ideal model that we
can create. So, making sure that parameters are optimized in a way to reduce this cost function will
ensure that we get a good classifier, assuming that the points are linearly separable and some other
minor factors.

2. K-NEAREST NEIGHBORS (K-NN)

K-NN algorithm is one of the simplest classification algorithms and it is used to identify the data points
that are separated into several classes to predict the classification of a new sample point. K-NN is a non-
parametric, lazy learning algorithm. It classifies new cases based on a similarity measure (i.e., distance
functions).

Mathematical Function:

In KNN, the algorithm predicts the class or value of a data point by considering the K-nearest data points
in the training dataset. The basic idea is to find the majority class (for classification) or compute the
average (for regression) of the K-nearest neighbors.

For classification, the prediction is typically done by a majority vote:

Optimization:

KNN doesn’t involve optimization of parameters like other machine learning algorithms (e.g., linear
regression or neural networks). Instead, it stores the entire training dataset in memory and performs
predictions based on the similarity between data points. The main computational cost in KNN is the
search for the K-nearest neighbors when making predictions. This process can be optimized using data
structures like KD-trees or Ball trees for efficient nearest neighbor search

Cost Function:

Since KNN doesn’t have model parameters to optimize and it doesn’t involve a cost function during
training, it doesn’t have a cost function in the same sense that algorithms like logistic regression or
neural networks do. KNN is a non-parametric algorithm, meaning it doesn’t make any underlying
assumptions about the data distribution.

The “cost” or performance evaluation in KNN is typically done using metrics such as accuracy, F1-score,
mean squared error (for regression), or other suitable evaluation metrics for the specific task. These
metrics are used to assess how well the KNN algorithm is performing on the given dataset during testing
or validation.

SUPPORT VECTOR MACHINE (SVM)

Support vector is used for both regression and classification. It is based on the concept of decision planes
that define decision boundaries. A decision plane (hyperplane) is one that separates between a set of
objects having different class memberships. Support Vector Machines (SVMs) are a powerful class of
supervised machine learning algorithms used for classification and regression tasks. SVMs aim to find a
hyperplane that best separates the data into different classes while maximizing the margin between the
classes. Here, I’ll provide an overview of the theory, mathematical concepts, optimization, cost function,
kernel functions, parameter tuning, and accuracy in SVMs.

Theory and Mathematical Concepts:

Linear Separability:

SVMs are based on the concept of linear separability, which means they work best when the data can be
cleanly separated by a hyperplane in the feature space.

Hyperplane:

In a binary classification problem, a hyperplane is a decision boundary that separates the data into two
classes. The equation of a hyperplane in a feature space is given by:

Here, w is the weight vector, x is the feature vector, and b is the bias term.

Margin:

The margin is the distance between the hyperplane and the nearest data point from either class. SVM
aims to maximize this margin.

Support Vectors:

Support vectors are the data points that lie closest to the hyperplane and are used in determining the
margin and decision boundary.

Optimization:

The primary goal of SVM is to find the parameters w and b that define the hyperplane while maximizing
the margin. This is done by solving a constrained optimization problem. The objective is to minimize the
norm of the weight vector (||w||) while satisfying the following constraints for each data point:

 For positive class data points: w^Tx + b ≥ 1


 For negative class data points: w^Tx + b ≤ -1

The optimization problem can be formulated as:

Subject to
Here, y^(i) is the class label of the i-th data point.

Cost Function:

The cost function in SVM is typically referred to as the hinge loss:

Where:

L is the hinge loss.

y is the true class label (either +1 or -1).

f(x) is the decision function, which is w^Tx + b in SVM.

The hinge loss encourages the correct classification with a margin of at least 1.

The objective in SVM is to minimize this hinge loss while regularizing the norm of the weight vector w.

Kernel Function:

In cases where the data is not linearly separable in the original feature space, SVM can still be applied by
using a kernel function. Kernel functions allow SVM to implicitly map the data to a higher-dimensional
feature space where it might become linearly separable.

Common kernel functions include the linear kernel, polynomial kernel, radial basis function (RBF) kernel,
and sigmoid kernel.

Parameter Tuning and Accuracy:

SVMs have several important hyperparameters that can impact their performance, including:

C: The regularization parameter, which controls the trade-off between maximizing the margin and
minimizing the classification error on the training data.

Choice of kernel function and kernel-specific parameters.

The width of the margin.

Parameter tuning is crucial for achieving high accuracy with SVMs. This is often done through techniques
like grid search or random search to find the best combination of hyperparameters that yield the highest
accuracy on a validation set or through cross-validation.
Example: SVM for Binary Classification

Problem Statement: We want to classify data points into two classes, “Blue” and “Red,” based on two
features, “X1” and “X2.”

Step 1: Generate Synthetic Data

Let’s generate some synthetic data points for this example. We’ll create two classes, “Blue” and “Red,” in
a 2D space.

import numpy as np

import matplotlib.pyplot as plt

# Create random data

np.random.seed(0)

X = np.random.randn(20, 2)

Y = np.where(X[:, 0] + X[:, 1] > 0, 1, -1)

# Visualize the data

plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=plt.cm.Paired)

plt.xlabel(“X1”)

plt.ylabel(“X2”)

plt.title(“Synthetic Data for Binary Classification”)

plt.show()

The generated data consists of 20 points, with X1 and X2 as features. Positive class (1) is marked in blue,
and negative class (-1) is marked in red.

Step 2: Split the Data

Next, we’ll split the data into a training set and a testing set.

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

Step 3: Train the SVM

Now, we’ll create and train an SVM classifier on the training data.

from sklearn import svm

# Create an SVM classifier with a linear kernel

clf = svm.SVC(kernel=’linear’)
# Train the classifier on the training data

clf.fit(X_train, Y_train)

Step 4: Make Predictions

We can now use the trained SVM to make predictions on the test data.

Y_pred = clf.predict(X_test)

Random Forest Classification:

Random Forest Classification is a popular ensemble machine learning algorithm used for both
classification and regression tasks. It is an ensemble method based on decision trees and is known for its
high predictive accuracy, robustness, and ability to handle large datasets. Random Forests are
constructed by training multiple decision trees and aggregating their predictions to make more accurate
and stable predictions.

Here are key components and concepts related to Random Forest Classification:

Ensemble Method:

Random Forest is an ensemble learning method, which means it combines the predictions of multiple
base learners (decision trees in this case) to improve overall predictive accuracy and reduce overfitting.

Decision Trees:

Random Forests are built from individual decision trees. Each decision tree is trained on a random subset
of the data and features. This randomness helps reduce overfitting.

Bagging (Bootstrap Aggregating):

The algorithm uses a technique called bagging, where each decision tree is trained on a bootstrapped
sample (randomly selected with replacement) from the training data. This creates diversity among the
individual trees.

Random Feature Selection:

During the construction of each decision tree, a random subset of features is selected at each split point.
This decorrelates the trees and reduces the risk of overfitting.

Voting or Averaging:

In the case of classification, Random Forests typically use a majority voting scheme, where each tree’s
prediction is counted, and the class with the most votes become the final prediction. For regression, it
averages the predictions of individual trees.

Optimization:

Random Forests do not involve optimization in the traditional sense because they are an ensemble
method, and each decision tree is built independently. The optimization occurs during the training of
individual decision trees, where they seek to split the data at each node in a way that maximizes
information gain (for classification) or minimizes mean squared error (for regression).

Cost Function:

There is no global cost function for Random Forests. The cost functions are used within each decision
tree to guide the splitting process. Common cost functions for decision trees include Gini impurity and
entropy for classification tasks and mean squared error for regression tasks.

Accuracy:

Random Forests are known for their high predictive accuracy. The ensemble nature of the algorithm
reduces overfitting and helps capture complex relationships in the data.

Hyperparameter Tuning:

Random Forests have hyperparameters that can be tuned to optimize model performance. Some key
hyperparameters include the number of trees in the forest (n_estimators), the maximum depth of each
tree (max_depth), the minimum number of samples required to split an internal node
(min_samples_split), and the maximum number of features to consider at each split (max_features).

Out-of-Bag (OOB) Error:

Random Forests employ a technique called out-of-bag (OOB) error estimation. During the training of
each tree, a portion of the data is not used (out-of-bag samples). These OOB samples can be used to
estimate the model’s performance without the need for a separate validation set. OOB error can be used
as a criterion for hyperparameter tuning.

Mathematical Formulation:

While the Random Forest algorithm doesn’t involve a single global mathematical optimization function,
it does involve mathematical calculations at each step of the decision tree construction, including feature
selection and splitting, as described earlier.

The power of Random Forest comes from the aggregation of multiple decision trees, which reduces
overfitting and increases predictive accuracy. The ensemble approach combines the strength of multiple
models while mitigating their weaknesses, resulting in a more robust and optimized classifier or
regressor.

Example:

Let’s consider a simple example using the famous Iris dataset, which is a classification problem aiming to
predict the species of iris flowers based on their features (sepal length, sepal width, petal length, and
petal width). We will use scikit-learn, a popular Python library, to demonstrate Random Forest
Classification:

from sklearn.datasets import load_iris

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split


from sklearn.metrics import accuracy_score

# Load the Iris dataset

iris = load_iris()

X = iris.data

y = iris.target

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a Random Forest Classifier with 100 trees

rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the classifier on the training data

rf_classifier.fit(X_train, y_train)

# Make predictions on the test data

y_pred = rf_classifier.predict(X_test)

# Calculate accuracy

accuracy = accuracy_score(y_test, y_pred)

print(f”Accuracy: {accuracy}”)

Decision Tree Classifier:

Decision Tree Classification is a supervised machine learning algorithm used for both classification and
regression tasks. It is a non-linear model that makes decisions by recursively splitting the data into
subsets based on the values of input features. Decision trees are hierarchical structures composed of
nodes and branches, where each internal node represents a feature and a decision rule, and each leaf
node represents a class label (for classification) or a predicted value (for regression).

Here are key components and concepts related to Decision Tree Classification:

Ensemble Method of Decision Trees:

Ensemble methods involve combining the predictions of multiple decision trees to improve overall
predictive accuracy and robustness. Two common ensemble methods based on decision trees are
Random Forest and Gradient Boosting.

Optimization:

Decision trees are optimized during their construction to find the best feature and split point that result
in the most informative decision at each node. The optimization goal is typically to minimize impurity or
entropy for classification tasks or minimize mean squared error for regression tasks.
Cost Function:

For classification, common cost functions for splitting nodes include:

Gini Impurity: Measures the probability of misclassifying a randomly chosen element.

Entropy: Measures the level of impurity or disorder in the data.

For regression, the cost function is typically Mean Squared Error (MSE), which measures the variance of
target values.

Accuracy:

Decision trees aim to maximize accuracy by partitioning the data into subsets that are as homogeneous
as possible with respect to the target variable. The accuracy of a decision tree is evaluated on a
validation or test dataset and is a measure of how well the model generalizes to new, unseen data.

Hyperparameter Tuning:

Decision trees have hyperparameters that can be tuned to optimize model performance and prevent
overfitting. Common hyperparameters include:

Max Depth: The maximum depth or height of the tree.

Min Samples Split: The minimum number of samples required to split a node.

Min Samples Leaf: The minimum number of samples required to be at a leaf node.

Max Features: The maximum number of features to consider when finding the best split.

Criterion: The cost function used for splitting nodes (e.g., Gini impurity, entropy, or MSE).

Decision trees are constructed by recursively splitting the data into subsets based on feature values in a
way that optimizes a cost function. The optimization process involves mathematical calculations at each
node of the tree. Here’s an overview of the mathematical calculations and optimization steps involved in
building a decision tree:

1. Splitting Criterion and Impurity:

For classification tasks, the most common splitting criteria to optimize are Gini impurity and entropy. For
regression tasks, mean squared error (MSE) is commonly used.

a. Gini Impurity (Classification):

The Gini impurity measures the probability of misclassifying a randomly chosen element. It is calculated
for a node t as:
Where:

C is the number of classes.

p_i is the proportion of samples in class i at node t.

The goal is to minimize the Gini impurity when splitting a node.

b. Entropy (Classification):

Entropy measures the level of impurity or disorder in the data. It is calculated for a node t as:

Where:

C is the number of classes.

p_i is the proportion of samples in class i at node t.

The goal is to minimize the entropy when splitting a node.

c. Mean Squared Error (Regression):

For regression tasks, the mean squared error (MSE) is used as the cost function. It measures the variance
of target values within a node. The goal is to minimize MSE when splitting a node.

2. Optimization:

The optimization process involves finding the best feature and split point that minimizes the chosen cost
function (Gini impurity, entropy, or MSE). At each node, the algorithm considers each feature and
evaluates the cost of splitting the data based on different split points. The split that results in the lowest
impurity (for classification) or lowest MSE (for regression) is chosen.

Mathematically, for each candidate feature F and split point S, the cost function is calculated, and the
feature-split pair with the lowest cost is selected as the optimal split:

3. Recursive Splitting:

Once the optimal split is found, the data is divided into child nodes, and the optimization process is
applied recursively to each child node until a stopping criterion is met. Common stopping criteria include
reaching a maximum depth, having too few samples in a node, or achieving perfect purity (all samples
belong to the same class for classification).

4. Pruning (Optional):
After constructing the decision tree, pruning can be applied to reduce the complexity of the tree and
prevent overfitting. Pruning involves removing branches of the tree that do not significantly improve
predictive accuracy.

5. Prediction:

To make a prediction for a new data point, it traverses the decision tree from the root node down to a
leaf node, following the splits based on the feature values. The leaf node’s class label (for classification)
or predicted value (for regression) is used as the final prediction.

Example:

Let’s consider a simple example of a Decision Tree Classification problem using Python and the Iris
dataset:

from sklearn.datasets import load_iris

from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

# Load the Iris dataset

iris = load_iris()

X = iris.data

y = iris.target

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a Decision Tree Classifier

dt_classifier = DecisionTreeClassifier(max_depth=3, random_state=42)

# Train the classifier on the training data

dt_classifier.fit(X_train, y_train)

# Make predictions on the test data

y_pred = dt_classifier.predict(X_test)

# Calculate accuracy

accuracy = accuracy_score(y_test, y_pred)

print(f”Accuracy: {accuracy}”)

Naïve Bayes
The naive Bayes classifier is based on Bayes’ theorem with the independence assumptions between
predictors (i.e., it assumes the presence of a feature in a class is unrelated to any other feature). Even if
these features depend on each other, or upon the existence of the other features, all of these properties
independently. Thus, the name naive Bayes.

Based on naive Bayes, Gaussian naive Bayes is used for classification based on the binomial (normal)
distribution of data.

P(class|data) is the posterior probability of class(target) given predictor(attribute). The probability of a


data point having either class, given the data point. This is the value that we are looking to calculate.

P(class) is the prior probability of class.

P(data|class) is the likelihood, which is the probability of predictor given class.

P(data) is the prior probability of predictor or marginal likelihood.

Naive Bayes Steps

1. Calculate Prior Probability

P(class) = Number of data points in the class/Total no. of observations

P(yellow) = 10/17

P(green) = 7/17

2. Calculate Marginal Likelihood

P(data) = Number of data points similar to observation/Total no. of observations

P(?) = 4/17

The value is present in checking both the probabilities.

3. Calculate Likelihood

P(data/class) = Number of similar observations to the class/Total no. of points in the class.

P(?/yellow) = 1/7

P(?/green) = 3/10

4. Posterior Probability for Each Class


5. Classification

The higher probability, the class belongs to that category as from above 75% probability the point
belongs to class green.

Multinomial, Bernoulli naive Bayes are the other models used in calculating probabilities. Thus, a naive
Bayes model is easy to build, with no complicated iterative parameter estimation, which makes it
particularly useful for very large datasets.

Conditional Independence Assumption: Naive Bayes assumes that features are conditionally
independent given the class label. This simplifying assumption greatly reduces the computational
complexity of the algorithm.
Theory:

The main idea behind Naive Bayes is to compute the posterior probability of each class for a given set of
features and select the class with the highest probability as the predicted class.

Optimization:

There is no explicit optimization process in Naive Bayes as there are no model parameters to be learned
during training. Instead, Naive Bayes estimates probabilities from the training data. The optimization
occurs implicitly through probability estimation, and the model is relatively simple and computationally
efficient.

Cost Function:

Naive Bayes doesn’t have a cost function in the traditional sense, as it doesn’t involve parameter tuning
or optimization. The decision boundary is determined by the probabilistic calculations and the class with
the highest posterior probability is selected as the prediction.

Accuracy:

Accuracy is a common metric used to evaluate the performance of Naive Bayes. It measures the
proportion of correctly classified instances out of all instances in the dataset. However, the choice of
evaluation metrics may vary depending on the specific problem and class imbalance.

Example:

Here’s a simplified example of text classification using the Naive Bayes algorithm, specifically for spam
email detection. In this example, we’ll use the scikit-learn library in Python:

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.naive_bayes import MultinomialNB

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

# Sample dataset (emails and labels)

emails = [“Get a free iPhone now!”, “Meeting at 3 PM today.”, “Discounts on shoes.”, “Important
information inside.”]

labels = [“spam”, “not spam”, “spam”, “not spam”]

# Convert text data to numerical features using CountVectorizer

vectorizer = CountVectorizer()

X = vectorizer.fit_transform(emails)

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.3, random_state=42)


# Create a Multinomial Naive Bayes classifier

nb_classifier = MultinomialNB()

# Train the classifier on the training data

nb_classifier.fit(X_train, y_train)

# Make predictions on the test data

y_pred = nb_classifier.predict(X_test)

# Calculate accuracy

accuracy = accuracy_score(y_test, y_pred)

print(f”Accuracy: {accuracy}”)

GRADIENT BOOSTING CLASSIFICATION

Gradient boosting classifier is a boosting ensemble method. Boosting is a way to combine (ensemble)
weak learners, primarily to reduce prediction bias. Instead of creating a pool of predictors, as in bagging,
boosting produces a cascade of them, where each output is the input for the following learner. Typically,
in bagging algorithm trees are grown in parallel to get the average prediction across all trees, where each
tree is built on a sample of original data. Gradient boosting, on the other hand, takes a sequential
approach to obtaining predictions instead of parallelizing the tree building process. In gradient boosting,
each decision tree predicts the error of the previous decision tree — thereby boosting (improving) the
error (gradient).

Working of Gradient Boosting

1.Initialie predictions with a simple decision tree

2.Calculate residual value

3.Build another shallow decision tree that predicts residual based on all the independent values

4.Uodate the original predictions with the new prediction multiplied by learning rate

5.Repeat steps two through for a certain number of iterations.

Let’s see how maths work out for Gradient Boosting algorithm. Say we have mean squared error (MSE)
as loss defined as:

We want our predictions, such that our loss function (MSE) is minimum. By using gradient descent and
updating our predictions based on a learning rate, we can find the values where MSE is minimum.
So, we are basically updating the predictions such that the sum of our residuals is close to 0 (or
minimum) and predicted values are sufficiently close to actual values.

Intuition behind Gradient Boosting

The logic behind gradient boosting is simple, (can be understood intuitively, without using mathematical
notation). I expect that whoever is reading this post might be familiar with simple linear model
modeling.

A basic assumption of linear regression is that sum of its residuals is 0, i.e. the residuals should be spread
randomly around zero.

Now think of these residuals as mistakes committed by our predictor model. Although, tree-based
models (considering decision tree as base models for our gradient boosting here) are not based on such
assumptions, but if we think logically (not statistically) about this assumption, we might argue that, if we
are able to see some pattern of residuals around 0, we can leverage that pattern to fit a model.

So, the intuition behind gradient boosting algorithm is to repetitively leverage the patterns in residuals
and strengthen a model with weak predictions and make it better. Once we reach a stage where
residuals do not have any pattern that could be modeled, we can stop modeling residuals (otherwise it
might lead to overfitting). Algorithmically, we are minimizing our loss function, such that test loss
reaches its minima.

In summary,
• We first model data with simple models and analyze data for errors.
• These errors signify data points that are difficult to fit by a simple model.
• Then for later models, we particularly focus on those hard-to-fit data to get them right.
• In the end, we combine all the predictors by giving some weights to each predictor.
Metrics to Measure Classification Model Performance

1. CONFUSION MATRIX

A confusion matrix is a table that is often used to describe the performance of a classification model on a
set of test data for which the true values are known. It is a table with four different combinations of
predicted and actual values in the case for a binary classifier.

A true positive is an outcome where the model correctly predicts the positive class. Similarly, a true
negative is an outcome where the model correctly predicts the negative class.

False Positive and False Negative

The terms false positive and false negative are used in determining how well the model is predicting with
respect to classification. A false positive is an outcome where the model incorrectly predicts
the positive class. And a false negative is an outcome where the model incorrectly predicts
the negative class. The more values in main diagonal, the better the model, whereas the other diagonal
gives the worst result for classification.

False Positive

False positive (type I error) — when you reject a true null hypothesis.
This is an example in which the model mistakenly predicted the positive class. For example, the model
inferred that a particular email message was spam (the positive class), but that email message was
actually not spam. It’s like a warning sign that the mistake should be rectified as it’s not much of a
serious concern compared to false negative.

False Negative

False negative (type II error) — when you accept a false null hypothesis.

This is an example in which the model mistakenly predicted the negative class. For example, the model
inferred that a particular email message was not spam (the negative class), but that email message
actually was spam. It’s like a danger sign that the mistake should be rectified early as it’s more serious
than a false positive.

Accuracy, Precision, Recall and F-1 Score

From the confusion matrix, we can infer accuracy, precision, recall and F-1 score.

Accuracy

Accuracy is the fraction of predictions our model got right.

Accuracy can also be written as

Accuracy alone doesn’t tell the full story when working with a class-imbalanced data set, where there is
a significant disparity between the number of positive and negative labels. Precision and recall are better
metrics for evaluating class-imbalanced problems.

Precision

Out of all the classes, precision is how much we predicted correctly.

Precision should be as high as possible.

Recall

Out of all the positive classes, recall is how much we predicted correctly. It is also called sensitivity or
true positive rate (TPR).

Recall should be as high as possible.

F-1 Score

It is often convenient to combine precision and recall into a single metric called the F-1 score, particularly
if you need a simple way to compare two classifiers. The F-1 score is the harmonic mean of precision and
recall.

The regular mean treats all values equally, while the harmonic mean gives much more weight to low
values thereby punishing the extreme values more. As a result, the classifier will only get a high F-1 score
if both recall and precision are high.
2. RECEIVER OPERATOR CURVE (ROC) AND AREA UNDER THE CURVE (AUC)

ROC curve is an important classification evaluation metric. It tells us how well the model has accurately
predicted. The ROC curve shows the sensitivity of the classifier by plotting the rate of true positives to
the rate of false positives. If the classifier is outstanding, the true positive rate will increase, and the area
under the curve will be close to one. If the classifier is similar to random guessing, the true positive rate
will increase linearly with the false positive rate. The better the AUC measure, the better the model.
3. CUMULATIVE ACCURACY PROFILE CURVE

The CAP of a model represents the cumulative number of positive outcomes along the y-axis versus the
corresponding cumulative number of a classifying parameters along the x-axis. The CAP is distinct from
the receiver operating characteristic (ROC), which plots the true-positive rate against the false-positive
rate. CAP curve is rarely used as compared to ROC curve.
Consider a model that predicts whether a customer will purchase a product. If a customer is selected at
random, there is a 50 percent chance they will buy the product. The cumulative number elements for
which the customer buys would rise linearly toward a maximum value corresponding to the total
number of customers. This distribution is called the “random” CAP. Its the blue line in the above
diagram. A perfect prediction, on the other hand, determines exactly which customer will buy the
product, such that the maximum customer buying the property will be reached with a minimum number
of customer selection among the elements. This produces a steep line on the CAP curve that stays flat
once the maximum is reached, which is the “perfect” CAP. It’s also called the “ideal” line and is the grey
line in the figure above.

The confusion matrix for a multi-class classification problem can help you determine mistake patterns.

For a binary classifier:

Uses Of Classification Algorithm:

Classification algorithms are widely used in various fields and applications where the goal is to categorize
or classify data into predefined classes or categories. Here are some common use cases for classification
algorithms:

 Email Spam Detection: Classify emails as “spam” or “not spam” to filter out unwanted emails
from users’ inboxes.
 Sentiment Analysis: Analyze text data from social media, reviews, or customer feedback to
classify sentiment as positive, negative, or neutral.
 Medical Diagnosis: Diagnose diseases or medical conditions based on patient data, medical tests,
and symptoms.
 Credit Risk Assessment: Determine the creditworthiness of loan applicants by classifying them as
low, medium, or high-risk borrowers.
 Image Classification: Categorize images into predefined classes, such as recognizing objects in
photos or detecting anomalies in medical images.
 Natural Language Processing (NLP):Categorize documents or text data into topics, genres, or
genres for content recommendation or organizing information.
 Customer Churn Prediction: Predict whether customers are likely to churn (leave) a service or
product, such as a subscription service or a mobile app.
 Fraud Detection: Identify fraudulent transactions, activities, or behaviors in financial systems,
insurance claims, or online platforms.
 Speech Recognition: Classify spoken words or phrases into text, enabling voice assistants and
transcription services.
 Anomaly detection: Detect anomalies or outliers in data, such as network intrusion detection or
manufacturing quality control.
 Document Classification: Automatically classify documents into categories, such as news articles,
legal documents, or research papers.
 Recommendation Systems: Recommend products, movies, music, or content to users based on
their preferences, behaviors, or historical data.
 Species Identification: Identify species of plants or animals based on observations, images, or
genetic data.
 Quality Control: Inspect and classify manufactured products as defective or non-defective based
on quality control data.
 Credit Card Transaction Fraud Detection: Detect fraudulent credit card transactions by classifying
them as legitimate or suspicious.
 Intrusion Detection in Cybersecurity: Monitor network traffic and classify it as normal or
potentially malicious, identifying cyber threats and attacks.
 Employee Attrition Prediction: Predict whether employees are likely to leave a company based
on historical HR data.
 Customer Segmentation: Segment customers into groups based on their behavior,
demographics, or purchasing habits for targeted marketing campaigns.
 Handwriting Recognition: Recognize handwritten text or characters and classify them into
alphanumeric characters.
 Fault Detection in Manufacturing: Detect and classify faults or defects in manufacturing
processes or products, improving product quality.

These are just a few examples of the many applications of classification algorithms across various
domains. Classification plays a fundamental role in machine learning and data analysis, enabling
automated decision-making and pattern recognition in diverse fields.

Logistic Regression Analysis

Regressor Algorithm VS Classifier Algorithm

You might also like