Unit Iv
Unit Iv
Unit Iv
Voting Classifier
The voting classifier is an ensemble learning method that combines several base models to
produce the final optimum solution. The base model can independently use different algorithms
such as KNN, Random forests, Regression, etc., to predict individual outputs.
The voting classifier is divided into hard voting and Soft voting.
Hard voting
Hard voting is also known as majority voting. The base model's classifiers are fed with the
training data individually. The models predict the output class independent of each other. The
output class is a class expected by the majority of the models.
Soft voting
In Soft voting, Classifiers or base models are fed with training data to predict the classes out
of m possible courses. Each base model classifier independently assigns the probability of
occurrence of each type. In the end, the average of the possibilities of each class is calculated,
and the final output is the class having the highest probability.
Bagging
• Bagging, also known as bootstrap aggregation, is the ensemble learning method that is
commonly used to reduce variance within a noisy data set.
• In bagging, a random sample of data in a training set is selected with replacement—
meaning that the individual data points can be chosen more than once.
How bagging works
1. Bootstrapping: Bagging leverages a bootstrapping sampling technique to create
diverse samples. This resampling method generates different subsets of the training data
set. It does so by selecting data points at random and with replacement. This means that
each time you select a data point from the training data set, you are able to select the
same instance multiple times. As a result, a value or instance repeated twice (or more)
in a sample.
2. Parallel training: These bootstrap samples are then trained independently and in
parallel with each other using weak or base learners.
3. Aggregation: Finally, depending on the task (that is, regression or classification), an
average or a majority of the predictions are taken to compute a more accurate estimate.
In the case of regression, an average is taken of all the outputs predicted by the
individual classifiers; this is known as soft voting. For classification problems, the class
with the highest majority of votes is accepted; this is known as hard voting or majority
voting.
Boosting
The basic principle behind the working of the boosting algorithm is to generate multiple weak
learners and combine their predictions to form one strict rule.
Eg: Adaboost
• This method does not follow Bootstrapping. However, it will create different decision
trees with a single split (one depth), called decision stumps.
• The number of decision stumps it will make will depend on the number of features in
the dataset.
• Suppose there are M features then, Adaboost will create M decision stumps.
• 1. We will assign an equal sample weight to each observation.
• 2. We will create M decision stumps, for M number of features.
• 3. Out of all M decision stumps, I first have to select one best decision tree model. For
selecting it, we will either calculate the Entropy or Gini coefficient. The model with
lesser entropy will be selected (means model that is less disordered).
• . Now, after the first decision stump is built, an algorithm would evaluate this decision
and check how many observations the model has misclassified.
• 5. Suppose out of N observations, The first decision stump has misclassified T number
of observations.
• 6. For this, we will calculate the total error (TE), which is equal to T/N.
• 7. Now we will calculate the performance of the first decision stump. Performance of
stump = 1/2*loge((1-TE)/TE)
• 8. Now we will update the weights assigned before. To do this, we will first update the
weights of those observations, which we have misclassified. The weights of wrongly
classified observations will be increased and the weights of correctly classified weights
will be reduced.
• 9. By using this formula: old weight * e performance of stump
• 10. Now respectively for each observation, we will add and subtract the updated
weights to get the final weights.
• 11. But these weights are not normalized that is their sum is not equal to one. To do
this, we will sum them and divide each final weight with that sum.
• 12. After this, we have to make our second decision stump. For this, we will make a
class intervals for the normalized weights.
• 13. After that, we want to make a second weak model. But to do that, we need a sample
dataset on which the second weak model can be run. For making it, we will run N
number of iterations. On each iteration, it will calculate a random number ranging
between 0-1 and this random will be compared with class intervals we created and on
which class interval it lies, that row will be selected for sample data set. So new sample
data set would also be of N observation.
• 14. This whole process will continue for M decision stumps. The final sequential tree
would be considered as the final tree.
Stacking
• Bagging and boosting use homogenous weak learners for ensemble
• Stacking often considers heterogeneous weak learners, learns them in parallel, and
combines them by training a meta-learner to output a prediction based on the different
weak learner’s predictions.
• A meta learner inputs the predictions as the features and the target being the ground
truth values in data D,
• It attempts to learn how to best combine the input predictions to make a better output
prediction.
Unsupervised learning
• unsupervised machine learning models are given unlabeled data and allowed to
discover patterns and insights without any explicit guidance or instruction.
• Better suited for more complex processing tasks, such as organizing large datasets into
clusters.
• They are useful for identifying previously undetected patterns in data and can help
identify features useful for categorizing data.
• An unsupervised learning algorithm will go through the data and identify patterns in
the data points. For instance, it might group data by temperature or similar weather
patterns.
• There are three types of unsupervised learning tasks: clustering, association rules, and
dimensionality reduction.
Clustering
• Clustering is a technique for exploring raw, unlabeled data and breaking it down into
groups (or clusters) based on similarities or differences.
• It is used in a variety of applications, including customer segmentation, fraud detection,
and image analysis.
• Clustering algorithms split data into natural groups by finding similar structures or
patterns in uncategorized data.
• Exclusive clustering: Data is grouped in a way where a single data point can only exist
in one cluster. This is also referred to as “hard” clustering. A common example of
exclusive clustering is the K-means clustering algorithm, which partitions data points
into a user-defined number K of clusters.
• Overlapping clustering: Data is grouped in a way where a single data point can exist
in two or more clusters with different degrees of membership. This is also referred to
as “soft” clustering.
• Hierarchical clustering: Data is divided into distinct clusters based on similarities,
which are then repeatedly merged and organized based on their hierarchical
relationships. There are two main types of hierarchical clustering: agglomerative and
divisive clustering. This method is also referred to as HAC—hierarchical cluster
analysis.
• Probabilistic clustering: Data is grouped into clusters based on the probability of each
data point belonging to each cluster. This approach differs from the other methods,
which group data points based on their similarities to others in a cluster.
Association
• Association rule mining is a rule-based approach to reveal interesting relationships
between data points in large datasets.
• Unsupervised learning algorithms search for frequent if-then associations—also called
rules—to discover correlations and co-occurrences within the data and the different
connections between data objects.
Dimensionality reduction
• Dimensionality reduction is an unsupervised learning technique that reduces the
number of features, or dimensions, in a dataset.
• Dimensionality reduction extracts important features from the dataset, reducing the
number of irrelevant or random features present. This method uses principle component
analysis (PCA) and singular value decomposition (SVD) algorithms to reduce the
number of data inputs without compromising the integrity of the properties in the
original data.
Real-world unsupervised learning examples
• Anomaly detection: Unsupervised clustering can process large datasets and discover
data points that are atypical in a dataset.
• Recommendation engines: Using association rules, unsupervised machine learning can
help explore transactional data to discover patterns or trends that can be used to drive
personalized recommendations for online retailers.
• Customer segmentation: Unsupervised learning is also commonly used to generate
buyer persona profiles by clustering customers’ common traits or purchasing behaviors.
These profiles can then be used to guide marketing and other business strategies.
• Fraud detection: Unsupervised learning is useful for anomaly detection, revealing
unusual data points in datasets. These insights can help uncover events or behaviors
that deviate from normal patterns in the data, revealing fraudulent transactions or
unusual behavior like bot activity.
• Natural language processing (NLP): Unsupervised learning is commonly used for
various NLP applications, such as categorizing articles in news sections, text translation
and classification, or speech recognition in conversational interfaces.
• Genetic research: Genetic clustering is another common unsupervised learning
example. Hierarchical clustering algorithms are often used to analyze DNA patterns and
reveal evolutionary relationships.
K-Means Clustering
• K-means clustering is a popular unsupervised machine learning algorithm used for
partitioning a dataset into a pre-defined number of clusters.
• The goal is to group similar data points together and discover underlying patterns or
structures within the data.
• K-means is a centroid-based algorithm or a distance-based algorithm, where we
calculate the distances to assign a point to a cluster.
• In K-Means, each cluster is associated with a centroid.
K-Means Clustering Procedure
1. Initialization: Start by randomly selecting K points from the dataset. These points will
act as the initial cluster centroids.
2. Assignment: For each data point in the dataset, calculate the distance between that point
and each of the K centroids. Assign the data point to the cluster whose centroid is closest
to it. This step effectively forms K clusters.
3. Update centroids: Once all data points have been assigned to clusters, recalculate the
centroids of the clusters by taking the mean of all data points assigned to each cluster.
4. Repeat: Repeat steps 2 and 3 until convergence. Convergence occurs when the centroids
no longer change significantly or when a specified number of iterations is reached.
5. Final Result: Once convergence is achieved, the algorithm outputs the final cluster
centroids and the assignment of each data point to a cluster.
there are three Gaussian functions, hence K = 3. Each Gaussian explains the data contained in
each of the three clusters available. The mixing coefficients are themselves probabilities and
must meet this condition:
Where x represents our data points, D is the number of dimensions of each data point. μ and Σ
are the mean and covariance, respectively.
we will also find it useful to take the log of this equation, which is given by:
If we differentiate this equation with respect to the mean and covariance and then equate it to
zero, then we will be able to find the optimal values for these parameters, and the solutions will
correspond to the Maximum Likelihood Estimates (MLE) for this setting.
First, let’s suppose we want to know what is the probability that a data point xn comes from
Gaussian k. We can express this as:
In this case, z is a latent variable that takes only two possible values. It is one when x came from
Gaussian k, and zero otherwise. Actually, we don’t get to see this z variable in reality, but
knowing its probability of occurrence will be useful in helping us determine the Gaussian
mixture parameters, as we discuss later.
Likewise, we can state the following:
Which means that the overall probability of observing a point that comes from Gaussian k is
actually equivalent to the mixing coefficient for that Gaussian. This makes sense, because the
bigger the Gaussian is, the higher we would expect this probability to be. Now let z be the set
of all possible latent variables z, hence:
We know beforehand that each z occurs independently of others and that they can only take the
value of one when k is equal to the cluster the point comes from. Therefore:
the Bayes rule, will help us determine this probability. From the product rule of probabilities,
we know that
sum up the terms on z, hence
This is the equation that defines a Gaussian Mixture, and you can clearly see that it depends on
all parameters that we mentioned previously! To determine the optimal values for these we need
to determine the maximum likelihood of the model. We can find the likelihood as the joint
probability of all observations xn, defined by:
Like we did for the original Gaussian density function, let’s apply the log to each side of the
equation:
EM algorithm
• The Expectation-Maximization (EM) algorithm is defined as the combination of
various unsupervised machine learning algorithms, which is used to determine the local
maximum likelihood estimates (MLE) or maximum a posteriori estimates
(MAP) for unobservable variables in statistical models.
• It is a technique to find maximum likelihood estimation when the latent variables are
present. It is also referred to as the latent variable model.
• A latent variable model consists of both observable and unobservable variables where
observable can be predicted while unobserved are inferred from the observed variable.
These unobservable variables are known as latent variables.
• Expectation step (E - step): It involves the estimation (guess) of all missing values in
the dataset so that after completing this step, there should not be any missing value.
• Maximization step (M - step): This step involves the use of estimated data in the E-
step and updating the parameters.
• Repeat E-step and M-step until the convergence of the values occurs.
Let us now define the steps that the general EM algorithm will follow¹.
Step 1: Initialise θ accordingly. For instance, we can use the results obtained by a previous K-
Means run as a good starting point for our algorithm.
Step 2 (Expectation step): Evaluate
Well, actually we have already found p(Z|X, θ). Remember the γ expression we ended up with
in the previous section? For better visibility, let’s bring our earlier equation (4) here:
For Gaussian Mixture Models, the expectation step boils down to calculating the value of γ in
(4) by using the old parameter values. Now if we replace (4) in (5), we will have:
Sounds good, but we are still missing p(X, Z|θ*). How can we find it? Well, actually it’s not
that difficult. It is just the complete likelihood of the model, including both X and Z, and we can
find it by using the following expression:
Which is the result of calculating the joint probability of all observations and latent variables
and is an extension of our initial derivations for p(x). The log of this expression is given by
Nice! And we have finally gotten rid of this troublesome logarithm that affected the summation
in (3). With all of this in place, it will be much easier for us to estimate the parameters by just
maximizing Q with respect to the parameters, but we will deal with this in the maximization
step. Besides, remember that the latent variable z will only be 1 once everytime the summation
is evaluated. With that knowledge, we can easily get rid of it as needed for our derivations.
Finally, we can replace (7) in (6) to get:
In the maximization step, we will find the revised parameters of the mixture. For this purpose,
we will need to make Q a restricted maximization problem and thus we will add a Lagrange
multiplier to (8). Let’s now review the maximization step.
Step 3 (Maximization step): Find the revised parameters θ* using:
Where
Which is what we ended up with in the previous step. However, Q should also take into account
the restriction that all π values should sum up to one. To do so, we will need to add a suitable
Lagrange multiplier. Therefore, we should rewrite (8) in this way:
And now we can easily determine the parameters by using maximum likelihood. Let’s now take
the derivative of Q with respect to π and set it equal to zero:
Then, by rearranging the terms and applying a summation over k to both sides of the
equation, we obtain:
Similarly, if we differentiate Q with respect to μ and Σ, equate the derivative to zero and then
solve for the parameters by making use of the log-likelihood equation (2) we defined, we obtain: