Unit 4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

Clustering

Introduction to Clustering: It is basically a type of unsupervised learning method. An unsupervised


learning method is a method in which we draw references from datasets consisting of input data
without labeled responses. Generally, it is used as a process to find meaningful structure, explanatory
underlying processes, generative features, and groupings inherent in a set of examples.

Clustering is the task of dividing the population or data points into a number of groups such that data
points in the same groups are more similar to other data points in the same group and dissimilar to
the data points in other groups. It is basically a collection of objects on the basis of similarity and
dissimilarity between them.

For example The data points in the graph below clustered together can be classified into one single
group. We can distinguish the clusters, and we can identify that there are 3 clusters in the below
picture.

It is not necessary for clusters to be spherical as depicted below:


DBSCAN: Density-based Spatial Clustering of Applications with Noise

These data points are clustered by using the basic concept that the data point lies within the given
constraint from the cluster center. Various distance methods and techniques are used for the
calculation of the outliers.

Why Clustering?

Clustering is very much important as it determines the intrinsic grouping among the unlabelled data
present. There are no criteria for good clustering. It depends on the user, and what criteria they may
use which satisfy their need. For instance, we could be interested in finding representatives for
homogeneous groups (data reduction), finding “natural clusters” and describing their unknown
properties (“natural” data types), in finding useful and suitable groupings (“useful” data classes) or in
finding unusual data objects (outlier detection). This algorithm must make some assumptions that
constitute the similarity of points and each assumption make different and equally valid clusters.

Clustering Methods:

Density-Based Methods: These methods consider the clusters as the dense region having some
similarities and differences from the lower dense region of the space. These methods have good
accuracy and the ability to merge two clusters. Example DBSCAN (Density-Based Spatial Clustering of
Applications with Noise), OPTICS (Ordering Points to Identify Clustering Structure), etc.

Hierarchical Based Methods: The clusters formed in this method form a tree-type structure based on
the hierarchy. New clusters are formed using the previously formed one. It is divided into two category

Agglomerative (bottom-up approach)

Divisive (top-down approach)

Examples CURE (Clustering Using Representatives), BIRCH (Balanced Iterative Reducing Clustering and
using Hierarchies), etc.

Partitioning Methods: These methods partition the objects into k clusters and each partition forms one
cluster. This method is used to optimize an objective criterion similarity function such as when the
distance is a major parameter example K-means, CLARANS (Clustering Large Applications based upon
Randomized Search), etc.

Grid-based Methods: In this method, the data space is formulated into a finite number of cells that
form a grid-like structure. All the clustering operations done on these grids are fast and independent
of the number of data objects example STING (Statistical Information Grid), wave cluster, CLIQUE
(CLustering In Quest), etc.

Clustering Algorithms: K-means clustering algorithm – It is the simplest unsupervised learning


algorithm that solves clustering problem.K-means algorithm partitions n observations into k clusters
where each observation belongs to the cluster with the nearest mean serving as a prototype of the
cluster.
Applications of Clustering in different fields:

Marketing: It can be used to characterize & discover customer segments for marketing purposes.

Biology: It can be used for classification among different species of plants and animals.

Libraries: It is used in clustering different books on the basis of topics and information.

Insurance: It is used to acknowledge the customers, their policies and identifying the frauds.

City Planning: It is used to make groups of houses and to study their values based on their geographical
locations and other factors present.

Earthquake studies: By learning the earthquake-affected areas we can determine the dangerous zones.

Image Processing: Clustering can be used to group similar images together, classify images based on
content, and identify patterns in image data.

Genetics: Clustering is used to group genes that have similar expression patterns and identify gene
networks that work together in biological processes.

Finance: Clustering is used to identify market segments based on customer behavior, identify patterns
in stock market data, and analyze risk in investment portfolios.

Customer Service: Clustering is used to group customer inquiries and complaints into categories,
identify common issues, and develop targeted solutions.

Manufacturing: Clustering is used to group similar products together, optimize production processes,
and identify defects in manufacturing processes.

Medical diagnosis: Clustering is used to group patients with similar symptoms or diseases, which helps
in making accurate diagnoses and identifying effective treatments.

Fraud detection: Clustering is used to identify suspicious patterns or anomalies in financial


transactions, which can help in detecting fraud or other financial crimes.

Traffic analysis: Clustering is used to group similar patterns of traffic data, such as peak hours, routes,
and speeds, which can help in improving transportation planning and infrastructure.
Social network analysis: Clustering is used to identify communities or groups within social networks,
which can help in understanding social behavior, influence, and trends.

Cybersecurity: Clustering is used to group similar patterns of network traffic or system behavior, which
can help in detecting and preventing cyberattacks.

Climate analysis: Clustering is used to group similar patterns of climate data, such as temperature,
precipitation, and wind, which can help in understanding climate change and its impact on the
environment.

Sports analysis: Clustering is used to group similar patterns of player or team performance data, which
can help in analyzing player or team strengths and weaknesses and making strategic decisions.

Crime analysis: Clustering is used to group similar patterns of crime data, such as location, time, and
type, which can help in identifying crime hotspots, predicting future crime trends, and improving crime
prevention strategies.

K means Clustering – Introduction


K-Means Clustering Algorithm

K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering problems
in machine learning or data science. In this topic, we will learn what is K-means clustering algorithm,
how the algorithm works, along with the Python implementation of k-means clustering.

What is K-Means Algorithm?

K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into
different clusters. Here K defines the number of pre-defined clusters that need to be created in the
process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.

It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that
each dataset belongs only one group that has similar properties.

It allows us to cluster the data into different groups and a convenient way to discover the categories of
groups in the unlabeled dataset on its own without the need for any training.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this
algorithm is to minimize the sum of distances between the data point and their corresponding clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and
repeats the process until it does not find the best clusters. The value of k should be predetermined in
this algorithm.

The k-means clustering algorithm mainly performs two tasks:

Determines the best value for K center points or centroids by an iterative process.
Assigns each data point to its closest k-center. Those data points which are near to the particular k-
center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from other clusters.

The below diagram explains the working of the K-means Clustering Algorithm:

How does the K-Means Algorithm Work?

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of
each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Let's understand the above steps by considering the visual plots:

Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given
below:
Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different clusters.
It means here we will try to group these datasets into two different clusters.

We need to choose some random k points or centroid to form the cluster. These points can be either
the points from the dataset or any other point. So, here we are selecting the below two points as k
points, which are not the part of our dataset. Consider the below image:

Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will compute
it by applying some mathematics that we have studied to calculate the distance between two points.
So, we will draw a median between both the centroids. Consider the below image:
From the above image, it is clear that points left side of the line is near to the K1 or blue centroid, and
points to the right of the line are close to the yellow centroid. Let's color them as blue and yellow for
clear visualization.

As we need to find the closest cluster, so we will repeat the process by choosing a new centroid. To
choose the new centroids, we will compute the center of gravity of these centroids, and will find new
centroids as below:

Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same process of
finding a median line. The median will be like below image:
From the above image, we can see, one yellow point is on the left side of the line, and two blue points
are right to the line. So, these three points will be assigned to new centroids.
As reassignment has taken place, so we will again go to the step-4, which is finding new centroids or
K-points.

We will repeat the process by finding the center of gravity of centroids, so the new centroids will be as
shown in the below image:

As we got the new centroids so again will draw the median line and reassign the data points. So, the
image will be:

We can see in the above image; there are no dissimilar data points on either side of the line, which
means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two final clusters will be
as shown in the below image:

How to choose the value of "K number of clusters" in K-means Clustering?

The performance of the K-means clustering algorithm depends upon highly efficient clusters that it
forms. But choosing the optimal number of clusters is a big task. There are some different ways to find
the optimal number of clusters, but here we are discussing the most appropriate method to find the
number of clusters or value of K. The method is given below:
Elbow Method

The Elbow method is one of the most popular ways to find the optimal number of clusters. This method
uses the concept of WCSS value. WCSS stands for Within Cluster Sum of Squares, which defines the
total variations within a cluster. The formula to calculate the value of WCSS (for 3 clusters) is given
below:

WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2

In the above formula of WCSS,

∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data point
and its centroid within a cluster1 and the same for the other two terms.

To measure the distance between data points and centroid, we can use any method such as Euclidean
distance or Manhattan distance.

To find the optimal value of clusters, the elbow method follows the below steps:

It executes the K-means clustering on a given dataset for different K values (ranges from 1-10).

For each value of K, calculates the WCSS value.

Plots a curve between calculated WCSS values and the number of clusters K.

The sharp point of bend or a point of the plot looks like an arm, then that point is considered as the best
value of K.

Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the elbow
method. The graph for the elbow method looks like the below image:
What is K-means Clustering?

Unsupervised Machine Learning is the process of teaching a computer to use unlabeled, unclassified
data and enabling the algorithm to operate on that data without supervision. Without any previous data
training, the machine’s job in this case is to organize unsorted data according to parallels, patterns, and
variations.

What is the objective of k-means clustering?

The goal of clustering is to divide the population or set of data points into a number of groups so that
the data points within each group are more comparable to one another and different from the data points
within the other groups. It is essentially a grouping of things based on how similar and different they are
to one another.

How k-means clustering works?

We are given a data set of items, with certain features, and values for these features (like a
vector). The task is to categorize those items into groups. To achieve this, we will use the K-
means algorithm, an unsupervised learning algorithm. ‘K’ in the name of the algorithm
represents the number of groups/clusters we want to classify our items into.

(It will help if you think of items as points in an n-dimensional space). The algorithm will
categorize the items into k groups or clusters of similarity. To calculate that similarity, we will
use the Euclidean distance as a measurement.

The algorithm works as follows:

1. First, we randomly initialize k points, called means or cluster centroids.


2. We categorize each item to its closest mean, and we update the mean’s coordinates,
which are the averages of the items categorized in that cluster so far.
3. We repeat the process for a given number of iterations and at the end, we have our
clusters.

The “points” mentioned above are called means because they are the mean values of the
items categorized in them. To initialize these means, we have a lot of options. An intuitive
method is to initialize the means at random items in the data set. Another method is to initialize
the means at random values between the boundaries of the data set (if for a feature x, the
items have values in [0,3], we will initialize the means with values for x at [0,3]).
K-Means Clustering in R
Airbnb Listings Dataset
All the content given here will revolve around Airbnb rental listings in Cape Town,
available on DataLab. The data contains different types of information, like the hosts,
the price, the number of reviews, the coordinates and so on.
Even if the dataset seems to provide deep information about vacation rentals, there
are still some clear questions that can be answered. For example, let’s imagine that
Airbnb requests to its data scientists to investigate the segmentation of the listings
within this platform in Cape Town.
Every town is different from the other and has different exigences depending on the
culture of the people living there. The identification of clusters in this city may be useful
to extract new insights, that can increase the satisfaction of actual customers with
suitable Customer Loyalty Strategies, avoiding customer churn. At the same time, it’s
also crucial to attract new people with appropriate promotions.
Visualize Airbnb data
Before applying k-means, we would like to investigate more on the relationship
between the variables by taking a look at the correlation matrix. For convenience of
display, we round the digits to two decimal places.
Visualize Airbnb data
Before applying k-means, we would like to investigate more on the relationship
between the variables by taking a look at the correlation matrix. For convenience of
display, we round the digits to two decimal places.
library(dplyr)
airbnb |>
select(price, minimum_nights, number_of_reviews) |>
cor(use = "pairwise.complete.obs") |>
round(2)

This code uses the dplyr package in R to perform some data manipulation and analysis
on an Airbnb dataset.
First, the library(dplyr) line loads the dplyr package into the R environment.
Next, the airbnb dataset is piped (|>) into a series of functions. The select() function is
used to choose only the columns price, minimum_nights, and number_of_reviews.
The cor() function is then used to calculate the pairwise correlation between these
three columns, using the pairwise.complete.obs argument to handle any missing data.
Finally, the round() function is used to round the resulting correlation matrix to two
decimal places.
Overall, this code is performing a simple analysis of the relationship between price,
minimum nights, and number of reviews in the Airbnb dataset.

From the output, there seems to be a slight negative relationship between price and
the number of reviews: The higher the price, the lower the number of reviews. The
same for minimum nights and a number of reviews. Since the minimum nights don’t
have a big impact on the price, we would just like to investigate more on how price and
number of reviews are related:
library(ggplot2)
ggplot(data, aes(number_of_reviews, price, color = room_type, shape = room_type))
+
geom_point(alpha = 0.25) +
xlab("Number of reviews") +
ylab("Price")

Code Explanation

This code uses the ggplot2 library in R to create a scatter plot of the number_of_reviews and
price variables from a dataset called data. The color and shape aesthetics are set to the
room_type variable, which means that the points on the plot will be colored and shaped based
on the different values of room_type. The geom_point function is used to add the points to
the plot, with an alpha value of 0.25 to make the points slightly transparent. The xlab and ylab
functions are used to add labels to the x and y axes, respectively.
The scatterplot shows that the cost of accommodations, in particular entire houses, is higher
when there is a small number of reviews, and seems to decrease as the number of reviews
grows.
Normalization

Before fitting the model, there is a further step to do. k-means is sensitive to variables that
have incomparable units, leading to misleading results. In this example, the number of reviews
is tens or hundreds, but the price is tens of thousands. Without any data processing, the
differences in price would appear to be larger than differences in reviews, but we want these
variables to be treated equally.
To avoid this problem, the variables need to be transformed to be on a similar scale. In this
way, they can be compared correctly using the distance metric.

There are different methods to tackle this issue. The most known and used is standardization,
which consists in subtracting the average value from the feature value and, then, dividing it
by its standard deviation. This technique will allow obtaining features with a mean of 0 and a
deviation of 1.

You can scale the variables with the scale() function. Since this returns a matrix, the code is
cleaner using a base-R style rather than a tidyverse style.
airbnb[, c("price", "number_of_reviews")] = scale(airbnb[, c("price", "number_of_reviews")])
This code scales the "price" and "number_of_reviews" columns in the "airbnb" data frame using the
scale() function in R. The scale() function standardizes the values in a vector by subtracting the mean
and dividing by the standard deviation.

The code uses the [, c("price", "number_of_reviews")] syntax to select only the "price" and
"number_of_reviews" columns from the "airbnb" data frame. The = operator assigns the scaled values
back to these columns in the "airbnb" data frame.

Overall, this code is useful for normalizing the values in these columns so that they can be more easily
compared and analyzed.

Fitting and evaluating the model

We can finally identify the clusters of listings with k-means. For getting started, let’s try performing k-
means by setting 3 clusters and nstart equal to 20. This last parameter is needed to run k-means with
20 different random starting assignments and, then, R will automatically choose the best results total
within-cluster sum of squares. We also set a seed to replicate the same results every time we run the
code.

# Get the two columns of interest


airbnb_2cols <- data[, c("price", "number_of_reviews")]
set.seed(123)
km.out <- kmeans(airbnb_2cols, centers = 3, nstart = 20)
km.out

Code Explanation

This code is written in R and performs k-means clustering on a dataset called "airbnb_2cols".
First, the code selects two columns of interest from the dataset "data" and assigns them to a
new variable called "airbnb_2cols". The two columns selected are "price" and
"number_of_reviews".
Next, the code sets a seed value for reproducibility and performs k-means clustering on the
"airbnb_2cols" dataset. The "centers" parameter specifies the number of clusters to create,
which in this case is set to 3. The "nstart" parameter specifies the number of times to run the
algorithm with different initial centroids. The results of the clustering are stored in a variable
called "km.out".
Finally, the code prints the results of the clustering to the console. The output shows the
cluster centers for each of the three clusters, as well as the number of observations in each
cluster.

Output:
From the output, we can observe that three different clusters have been found with sizes 785,
37 and 16069. For each cluster, the squared distances between the observations to the
centroids are calculated. So, each observation will be assigned to one of the three clusters.

Even if it seems like a good result, the best way to find the best model is to try different models
with a different number of clusters. So, we need to start from a model with a single cluster,
after we try a model with two clusters and so on. All this procedure needs to be tracked using
a graphical representation, called scree plot, in which the number of clusters is plotted on the
x-axis, while WCSS is on the y-axis.

In this case study, we build 10 k-means models, each of these will have a different number of
clusters, reaching a maximum of 10 clusters. Moreover, we are going to use only a part of the
dataset. So, we include only the price and the number of reviews. To plot the scree plot, we
need to save the total within-cluster sum of squares of all the models into the variable wss.
This code is written in R and it is used to create a scree plot to determine the optimal
number of clusters for a k-means clustering algorithm.

First, the variable n_clusters is set to 10, which is the maximum number of clusters to
consider.

Then, an empty vector wss is created to store the within-cluster sum of squares for each
number of clusters.

The set.seed(123) function sets the random seed to ensure reproducibility of the results.
Next, a for loop is used to fit a k-means clustering model for each possible number of
clusters from 1 to n_clusters. The kmeans() function is used to fit the model, with the
centers argument set to the current number of clusters and nstart set to 20 to ensure that
the algorithm is run multiple times with different starting points. The within-cluster sum of
squares is then saved to the wss vector for each number of clusters.
After the loop, a data frame wss_df is created with the number of clusters and the
corresponding within-cluster sum of squares.

Finally, a scree plot is created using ggplot2 with the number of clusters on the x-axis and
the within-cluster sum of squares on the y-axis. The geom_point() and geom_line() functions
are used to create the plot, and the scale_x_continuous() function is used to set the breaks
on the x-axis. The resulting plot shows the elbow point, which indicates the optimal number
of clusters to use for the k-means clustering algorithm.

scree_plot +
geom_hline(
yintercept = wss,
linetype = 'dashed',
col = c(rep('#000000',4),'#FF0000', rep('#000000', 5))
)

This code adds a horizontal line to a scree plot in R. The geom_hline() function is used to add
a horizontal line to the plot. The yintercept argument specifies the y-coordinate of the line,
which is set to the value of wss. The linetype argument specifies the type of line to be dashed.
The col argument specifies the color of the line, which is set to black for the first four segments
and red for the fifth segment, with black for the remaining five segments. The + operator is
used to add the horizontal line to the scree plot.
If you take a look again, the decision to make seems much more clear than before, don’t you
think? From this visualization, we can say that the best choice is to set up the number of
clusters equal to 5. After k=5, the improvements of the models seem to reduce sharply.

# Select number of clusters


k <- 5
set.seed(123)
# Build model with k clusters: km.out
km.out <- kmeans(airbnb_2cols, centers = k, nstart = 20)

This code is written in R and it performs k-means clustering on a dataset called "airbnb_2cols".
The first line of code sets the number of clusters to 5 and assigns it to the variable "k".
The second line sets the random seed to 123, which ensures that the results of the clustering
will be reproducible.
The third line uses the "kmeans" function to perform the clustering. The "airbnb_2cols"
dataset is passed as the first argument, and the "centers" parameter is set to "k", which means
that the algorithm will try to find 5 clusters. The "nstart" parameter is set to 20, which means
that the algorithm will run 20 times with different initial starting points to try to find the best
clustering solution.
The output of the "kmeans" function is assigned to the variable "km.out", which contains
information about the clustering solution, such as the cluster centers and the assignments of
each data point to a cluster.

We can try again to visualize the scatterplot between the price and the number of reviews.
We also colour the points based on the cluster id:
data$cluster_id <- factor(km.out$cluster)
ggplot(data, aes(number_of_reviews, price, color = cluster_id)) +
geom_point(alpha = 0.25) +
xlab("Number of reviews") +
ylab("Price")

This code is written in R.


The first line of code creates a new column in the "data" data frame called "cluster_id" and
assigns it the factor values of the "km.out$cluster" object. This is done to assign each data
point in the "data" data frame to a specific cluster based on the k-means clustering algorithm.
The second line of code creates a scatter plot using ggplot2 library. The x-axis represents the
"number_of_reviews" column in the "data" data frame, the y-axis represents the "price"
column in the "data" data frame, and the color of each point is determined by the "cluster_id"
column in the "data" data frame.
The third and fourth lines of code add labels to the x and y axes of the plot, respectively.
The "alpha" parameter in the "geom_point" function sets the transparency of the points to
0.25, making it easier to see overlapping points.
We can observe that:

There are a few listings that have high prices and low reviews, coloured in green and fuchsia
We also have a group of data points with both low prices and reviews, marked in red
After that, the price seems to stabilize as soon as the number of reviews increases.
When will K-Means fail?
Keep attention that k-means is simple and easy to apply, but it’s not always the best choice to
segment data into groups since it may fail. There is the assumption that the clusters are
spherical and, then, it does a good job with these cases, while groups with different sizes and
densities tend not to be captured well by this algorithm.
When these conditions are not respected, it’s preferable to find other alternative approaches,
like DBSCAN and BIRCH. Clustering in Machine Learning: 5 Essential Clustering Algorithms
provides a great overview of clustering approaches in case you want to dig deep.
Decision Tree
A Decision Tree learning is a predictive modeling approach. It is used to address classification
problems in statistics, data mining, and machine learning. It is having a tree-like structure upside
down and represents decisions or for decision-making. It can handle high dimension data and have
good accuracy. Decision tree algorithm is used in many applications such as medical production,
manufacturing, financial analysis, etc., For example, decision trees can be used in predicting the price
of a house based on size and number of rooms or predicting if a person is male or female based on
height and weight, and so on…

A decision tree can be represented as below. The topmost node is called the root node which has no
incoming edges. An internal node represents a test or an attribute and each branch represents an
outcome of a test and each terminal node or leaf holds a class. It has one incoming edge and has two
or more outgoing edges. Terminal node or Leaf node represents a class node and has exactly one
incoming node and no outgoing node.

Each node in the decision tree is a condition on the feature. Which is designed to split the dataset
into similar response values ends up in the same dataset.

CART — Classification and Regression Trees

Tree analogy is generally represented by CART known as Classification And Regression Tree. CART is
simple to understand, interpret, visualize and requires little effort for data preparation. Moreover, it
performs feature selection. Regression trees are mainly used when the target variable is numerical.
Here value obtained by a terminal node is always the mean or average of the responses falling in that
region. As a result, if any unseen data or observation will predict with the mean value. Classification
is used when the target variable is categorical. Here value obtained by a terminal node is the mode
of response falling in that region and any unseen data or observation in this region will make a
prediction based on the mode value.
Even though CART is simple and has great advantages, but it can lead to overfitting if data is not
properly handled. Moreover, it can lead to instability, if there is a small variation in data.

While growing a tree below points are to be considered :

Features to choose

Conditions for splitting

To know where to stop

Pruning

The decision to make a strategic split heavily affects the accuracy of the tree and the decision criteria
for regression and classification trees will be different. Entropy/Information gain or Gini Index can be
used for choosing the best split. Entropy and Information gain go hand in hand.

For a given dataset with different features, to decide which feature to be considered as the root node
and which feature should be the next decision node and so on, information gain of each feature
should be known. The feature which has maximum information gain will be considered as the root
node. To calculate information gain first we should calculate the entropy.

Entropy :

Entropy is a measure of disorder or impurity in the given dataset.

In the decision tree, messy data are split based on values of the feature vector associated with each
data point. With each split, the data becomes more homogenous which will decrease the entropy.
However, some data in some nodes will not be homogenous, where the entropy value will not be
small. The higher the entropy, the harder it is to draw any conclusion. When the tree finally reaches
the terminal or leaf node maximum purity is added.

For a dataset that has C classes and the probability of randomly choosing data from class, i is Pi. Then
entropy E(S) can be mathematically represented as

If we have a dataset of 10 observations belonging to two classes YES and NO. If 6 observations
belong to the class, YES, and 4 observations belong to class NO, then entropy can be written as
below.
Decision Tree Classification

Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems. It
is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.

In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
Decision nodes are used to make any decision and have multiple branches, whereas Leaf
nodes are the output of those decisions and do not contain any further branches.

The decisions or the test are performed on the basis of features of the given dataset.

It is a graphical representation for getting all the possible solutions to a problem/decision


based on given conditions.

It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.

In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.

A decision tree simply asks a question, and based on the answer (Yes/No), it further split
the tree into subtrees.

Below diagram explains the general structure of a decision tree:


Why use Decision Trees?

There are various algorithms in Machine learning, so choosing the best algorithm for the
given dataset and problem is the main point to remember while creating a machine
learning model. Below are the two reasons for using the Decision tree:

Decision Trees usually mimic human thinking ability while making a decision, so it is easy
to understand.

The logic behind the decision tree can be easily understood because it shows a tree-like
structure.

Decision Tree Terminologies

Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further
after getting a leaf node.

Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.

Branch/Sub Tree: A tree formed by splitting the tree.

Pruning: Pruning is the process of removing the unwanted branches from the tree.

Parent/Child node: The root node of the tree is called the parent node, and other nodes
are called the child nodes.

How does the Decision Tree algorithm Work?

In a decision tree, for predicting the class of the given dataset, the algorithm starts from
the root node of the tree. This algorithm compares the values of root attribute with the
record (real dataset) attribute and, based on the comparison, follows the branch and
jumps to the next node.

For the next node, the algorithm again compares the attribute value with the other sub-
nodes and move further. It continues the process until it reaches the leaf node of the tree.
The complete process can be better understood using the below algorithm:

Step-1: Begin the tree with the root node, says S, which contains the complete dataset.

Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the best attributes.

Step-4: Generate the decision tree node, which contains the best attribute.

Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node.

Example: Suppose there is a candidate who has a job offer and wants to decide whether
he should accept the offer or Not. So, to solve this problem, the decision tree starts with
the root node (Salary attribute by ASM). The root node splits further into the next decision
node (distance from the office) and one leaf node based on the corresponding labels. The
next decision node further gets split into one decision node (Cab facility) and one leaf
node. Finally, the decision node splits into two leaf nodes (Accepted offers and Declined
offer). Consider the below diagram:

Attribute Selection Measures

While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a
technique which is called as Attribute selection measure or ASM. By this measurement,
we can easily select the best attribute for the nodes of the tree. There are two popular
techniques for ASM, which are:

• Information Gain
• Gini Index
1. Information Gain:

Information gain is the measurement of changes in entropy after the segmentation of a


dataset based on an attribute.

It calculates how much information a feature provides us about a class.

According to the value of information gain, we split the node and build the decision tree.

A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using
the below formula:

Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies


randomness in data. Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,

S= Total number of samples

P(yes)= probability of yes

P(no)= probability of no

2. Gini Index:

Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.

An attribute with the low Gini index should be preferred as compared to the high Gini
index.

It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.

Pruning: Getting an Optimal Decision tree

Pruning is a process of deleting the unnecessary nodes from a tree in order to get the
optimal decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the
learning tree without reducing accuracy is known as Pruning. There are mainly two types
of tree pruning technology used:

• Cost Complexity Pruning


• Reduced Error Pruning.

Advantages of the Decision Tree

• It is simple to understand as it follows the same process which a human follow


while making any decision in real-life.
• It can be very useful for solving decision-related problems.
• It helps to think about all the possible outcomes for a problem.
• There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree

• The decision tree contains lots of layers, which makes it complex.


• It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
• For more class labels, the computational complexity of the decision tree may
increase.
• Implementation in R
• Input Data
• We will use the R in-built data set named readingSkills to create a decision tree. It
describes the score of someone's readingSkills if we know the variables
"age","shoesize","score" and whether the person is a native speaker or not.

Here is the sample data.

# Load the party package. It will automatically load other


# dependent packages.
library(party)
# Print some records from data set readingSkills.
print(head(readingSkills))

When we execute the above code, it produces the following result and chart –
nativeSpeaker age shoeSize score
1 yes 5 24.83189 32.29385
2 yes 6 25.95238 36.63105
3 no 11 30.42170 49.60593
4 yes 7 28.66450 40.28456
5 yes 11 31.88207 55.46085
6 yes 10 30.07843 52.83124
Loading required package: methods
Loading required package: grid
...............................
...............................

When we execute the above code, it produces the following result –


Bayes' Theorem:

Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional
probability.

The formula for Bayes' theorem is given as:

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the probability of
a hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

Naïve Bayes Classifier Algorithm

Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes


theorem and used for solving classification problems.

It is mainly used in text classification that includes a high-dimensional training dataset.

Naïve Bayes Classifier is one of the simple and most effective Classification algorithms
which helps in building the fast machine learning models that can make quick predictions.

It is a probabilistic classifier, which means it predicts on the basis of the probability of an


object.

Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.

Why is it called Naïve Bayes?

The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be
described as:

Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the
bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an
apple. Hence each feature individually contributes to identify that it is an apple without
depending on each other.

Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

Working of Naïve Bayes' Classifier:


Working of Naïve Bayes' Classifier can be understood with the help of the below example:

Suppose we have a dataset of weather conditions and corresponding target variable


"Play". So using this dataset we need to decide that whether we should play or not on a
particular day according to the weather conditions. So to solve this problem, we need to
follow the below steps:

1. Convert the given dataset into frequency tables.


2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.

Problem: If the weather is sunny, then the Player should play or not?

Solution: To solve this, first consider the below dataset:

Outlook Play

0 Rainy Yes

1 Sunny Yes

2 Overcast Yes

3 Overcast Yes

4 Sunny No

5 Rainy Yes

6 Sunny Yes

7 Overcast Yes

8 Rainy No

9 Sunny No

10 Sunny Yes

11 Rainy No

12 Overcast Yes

13 Overcast Yes
Frequency table for the Weather Conditions:

Weather Yes No

Overcast 5 0

Rainy 2 2

Sunny 3 2

Total 10 5
Likelihood table weather condition:

Weather No Yes

Overcast 0 5 5/14= 0.35

Rainy 2 2 4/14=0.29

Sunny 2 3 5/14=0.35

All 4/14=0.29 10/14=0.71

The naive Bayes classifier introduces the assumption that all variables vj are conditionally
independent. This is a naive assumption, hence the name, since variables often depend
on each other. This assumption simplifies the bayes equation to

where

x = Vector of n attributes

P(x|c) = Likelihood which is the probability of predictor given class c

P(c) = Prior probability of class c

P(x) = Prior probability of predictor x


Applying Bayes'theorem:

P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)

P(Sunny|Yes)= 3/10= 0.3

P(Sunny)= 0.35

P(Yes)=0.71

So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60

P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)

P(Sunny|NO)= 2/4=0.5

P(No)= 0.29

P(Sunny)= 0.35

So P(No|Sunny)= 0.5*0.29/0.35 = 0.41

So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)

Hence on a Sunny day, Player can play the game.

Advantages of Naïve Bayes Classifier:

• Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
• It can be used for Binary as well as Multi-class Classifications.
• It performs well in Multi-class predictions as compared to the other Algorithms.
• It is the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier:


• Naive Bayes assumes that all features are independent or unrelated, so it cannot
learn the relationship between features.
• Applications of Naïve Bayes Classifier:
• It is used for Credit Scoring.
• It is used in medical data classification.
• It can be used in real-time predictions because Naïve Bayes Classifier is an eager
learner.
• It is used in Text classification such as Spam filtering and Sentiment analysis.

Types of Naïve Bayes Model:


There are three types of Naive Bayes Model, which are given below:

Gaussian: The Gaussian model assumes that features follow a normal distribution. This
means if predictors take continuous values instead of discrete, then the model assumes
that these values are sampled from the Gaussian distribution.
Multinomial: The Multinomial Naïve Bayes classifier is used when the data is
multinomial distributed. It is primarily used for document classification problems, it
means a particular document belongs to which category such as Sports, Politics,
education, etc.
The classifier uses the frequency of words for the predictors.
Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the
predictor variables are the independent Booleans variables. Such as if a particular word
is present or not in a document. This model is also famous for document classification
tasks.

Implementation:

Now we will implement a Naive Bayes Algorithm using Python. So for this, we will use the
"user_data" dataset, which we have used in our other classification model. Therefore we
can easily compare the Naive Bayes model with the other models.

Steps to implement:

1. Data Pre-processing step


2. Fitting Naive Bayes to the Training set
3. Predicting the test result
4. Test accuracy of the result(Creation of Confusion matrix)
5. Visualizing the test set result.

Naive Bayes algorithm Process Flow


Take an example, Imagine because of current weather, cricket match will happen or not?
Now, we need to classify whether players will play the match or not based on weather
conditions.

Convert the data set into a frequency table


Create a Likelihood table by finding the probabilities like play the match or not
Based on the Naive Bayes equation calculate the posterior probability for each class. The
highest posterior probability in each class is the outcome of the prediction.
It is easy to use and fast to predict class of test data set.
It perform well in case of categorical input variables compared to numerical variable(s).
Its required independent predictor variables for better performance.

Load libraries
library(naivebayes)
library(dplyr)
library(ggplot2)
library(psych)
Getting Data
data <- read.csv("D:/RStudio/NaiveClassifiaction/binary.csv", header = T)
head(data)
Launch Thickness Appearance Spreading Rank
0 6 9 8 2
0 5 8 7 2
0 8 7 7 2
0 8 8 9 1
0 9 8 7 2
0 7 7 7 2
Let us understand the dataset, the dataset contains 5 columns.

Share point & R integration

Launch- Response variable, 0 indicates product not launched and 1 indicates product is
launched

Thickness-product thickness score

Appearance-product appearance score

Spreading- product spreading score

Rank-Rank of the product

Frequency Identification
Let’s calculate the frequency of response variable under each rank. The minimum
frequency of each class is 5 required for analysis.

xtabs(~Launch+Rank, data = data)


Rank
Rank
Launch 1 2 3
0 12 21 13
1 21 15 13
In this all-cell frequencies are greater than 5 and ideal for further analysis.

Now just look at each variable class based on str function

str(data)
data.frame': 95 obs. of 5 variables:
$ Launch : int 0 0 0 0 0 0 0 0 0 0 ...
$ Thickness : int 6 5 8 8 9 7 8 8 8 8 ...
$ ColourAndAppearance: int 9 8 7 8 8 7 9 7 9 9 ...
$ EaseOfSpreading : int 8 7 7 9 7 7 8 7 9 8 ...
$ Rank : int 2 2 2 1 2 2 2 2 1 2 ...
Now you can see the data frame contains 95 (still small dataset you can try Naive Bayes
for large datasets) observations of 5 variables. The columns Launch and Rank stored as
integer variables. If these two variables appearing as integer needs to convert into factor
variables.

tidyverse in R complete tutorial

data$Rank <- as.factor(data$Rank)


data$Launch <- as.factor(data$Launch)
When we are doing naïve Bayes classification one of the assumptions is to independent
variables are not highly correlated. In this case, remove the rank column and test the
correlation of the predictor variables.

Visualization
pairs.panels(data[-1])

Low correlation was observed between independent variables.

15 Essential packages in R

Visualize the data based on ggplot


data %>%
ggplot(aes(x=Launch, y=Thickness, fill = Launch)) +
geom_boxplot() +theme_bw()+
ggtitle("Box Plot")
Product got highest score in the thickness got launched in the market.
data %>%
ggplot(aes(x=Launch, y=Appearance, fill = Launch)) +
geom_boxplot() +theme_bw()+
ggtitle("Box Plot")

data %>%
ggplot(aes(x=Launch, y=Spreading, fill = Launch)) +
geom_boxplot() +theme_bw()+
ggtitle("Box Plot")
Data Partition
Let’s create train and test data sets for training the model and testing.
set.seed(1234)
ind <- sample(2, nrow(data), replace = T, prob = c(0.8, 0.2))
train <- data[ind == 1,]
test <- data[ind == 2,]
Naive Bayes Classification
Naive Bayes Classification in R
model <- naive_bayes(Launch ~ ., data = train, usekernel = T)
model plot(model)
You can try usekernel = T without also, based on model accuracy you can adjust the
same.
Conclusion
Based on Naive Bayes Classification in R, misclassification is around 14% in test data.
You can increase model accuracy in the train test while adding more observations.

You might also like