0% found this document useful (0 votes)
3 views

data mining techniques

Uploaded by

narendranic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

data mining techniques

Uploaded by

narendranic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Support Vector Machine Algorithm

A Support Vector Machine (SVM) is a powerful machine learning algorithm widely used for both linear
and nonlinear classification, as well as regression and outlier detection tasks. SVMs are highly
adaptable, making them suitable for various applications such as text classification, image
classification, spam detection, handwriting identification, gene expression analysis, face detection,
and anomaly detection.

SVMs are particularly effective because they focus on finding the maximum separating
hyperplane between the different classes in the target feature, making them robust for both binary and
multiclass classification. In this outline, we will explore the Support Vector Machine (SVM) algorithm,
its applications, and how it effectively handles both linear and nonlinear classification, as well
as regression and outlier detection tasks.

Support Vector Machine

A Support Vector Machine (SVM) is a supervised machine learning algorithm used for
both classification and regression tasks. While it can be applied to regression problems, SVM is best
suited for classification tasks. The primary objective of the SVM algorithm is to identify the optimal
hyperplane in an N-dimensional space that can effectively separate data points into different classes in
the feature space. The algorithm ensures that the margin between the closest points of different classes,
known as support vectors, is maximized.

The dimension of the hyperplane depends on the number of features. For instance, if there are two
input features, the hyperplane is simply a line, and if there are three input features, the hyperplane
becomes a 2-D plane. As the number of features increases beyond three, the complexity of visualizing
the hyperplane also increases.

Consider two independent variables, x1 and x2, and one dependent variable represented as either a
blue circle or a red circle.

• In this scenario, the hyperplane is a line because we are working with two features (x1 and x2).

• There are multiple lines (or hyperplanes) that can separate the data points.

• The challenge is to determine the best hyperplane that maximizes the separation margin
between the red and blue circles.
Linearly Separable Data points

From the figure above it’s very clear that there are multiple lines (our hyperplane here is a line because
we are considering only two input features x1, x2) that segregate our data points or do a classification
between red and blue circles. So how do we choose the best line or in general the best hyperplane that
segregates our data points?

How does Support Vector Machine Algorithm Work?

One reasonable choice for the best hyperplane in a Support Vector Machine (SVM) is the one that
maximizes the separation margin between the two classes. The maximum-margin hyperplane, also
referred to as the hard margin, is selected based on maximizing the distance between the hyperplane
and the nearest data point on each side.
Multiple hyperplanes separate the data from two classes

So we choose the hyperplane whose distance from it to the nearest data point on each side is
maximized. If such a hyperplane exists it is known as the maximum-margin hyperplane/hard margin. So
from the above figure, we choose L2. Let’s consider a scenario like shown below

Selecting hyperplane for data with outlier

Here we have one blue ball in the boundary of the red ball. So how does SVM classify the data? It’s
simple! The blue ball in the boundary of red ones is an outlier of blue balls. The SVM algorithm has the
characteristics to ignore the outlier and finds the best hyperplane that maximizes the margin. SVM is
robust to outliers.

Hyperplane which is the most optimized one

So in this type of data point what SVM does is, finds the maximum margin as done with previous data
sets along with that it adds a penalty each time a point crosses the margin. So the margins in these types
of cases are called soft margins. When there is a soft margin to the data set, the SVM tries to
minimize (1/margin+∧(∑penalty)). Hinge loss is a commonly used penalty. If no violations no hinge loss.If
violations hinge loss proportional to the distance of violation.

Till now, we were talking about linearly separable data(the group of blue balls and red balls are
separable by a straight line/linear line). What to do if data are not linearly separable?

Original 1D dataset for classification


Say, our data is shown in the figure above. SVM solves this by creating a new variable using a kernel. We
call a point xi on the line and we create a new variable yi as a function of distance from origin o.so if we
plot this we get something like as shown below

Mapping 1D data to 2D to become able to separate the two classes

In this case, the new variable y is created as a function of distance from the origin. A non-linear function
that creates a new variable is referred to as a kernel.

Support Vector Machine Terminology

• Hyperplane: The hyperplane is the decision boundary used to separate data points of different
classes in a feature space. For linear classification, this is a linear equation represented as
wx+b=0.

• Support Vectors: Support vectors are the closest data points to the hyperplane. These points are
critical in determining the hyperplane and the margin in Support Vector Machine (SVM).

• Margin: The margin refers to the distance between the support vector and the hyperplane. The
primary goal of the SVM algorithm is to maximize this margin, as a wider margin typically results
in better classification performance.

• Kernel: The kernel is a mathematical function used in SVM to map input data into a higher-
dimensional feature space. This allows the SVM to find a hyperplane in cases where data points
are not linearly separable in the original space. Common kernel functions include linear,
polynomial, radial basis function (RBF), and sigmoid.

• Hard Margin: A hard margin refers to the maximum-margin hyperplane that perfectly separates
the data points of different classes without any misclassifications.
• Soft Margin: When data contains outliers or is not perfectly separable, SVM uses the soft
margin technique. This method introduces a slack variable for each data point to allow some
misclassifications while balancing between maximizing the margin and minimizing violations.

• C: The C parameter in SVM is a regularization term that balances margin maximization and the
penalty for misclassifications. A higher C value imposes a stricter penalty for margin violations,
leading to a smaller margin but fewer misclassifications.

• Hinge Loss: The hinge loss is a common loss function in SVMs. It penalizes misclassified points or
margin violations and is often combined with a regularization term in the objective function.

• Dual Problem: The dual problem in SVM involves solving for the Lagrange multipliers associated
with the support vectors. This formulation allows for the use of the kernel trick and facilitates
more efficient computation.
Apriori Algorithm

Apriori algorithm is given by R. Agrawal and R. Srikant in 1994 for finding frequent itemsets in a dataset
for boolean association rule. Name of the algorithm is Apriori because it uses prior knowledge of
frequent itemset properties. We apply an iterative approach or level-wise search where k-frequent
itemsets are used to find k+1 itemsets.

To improve the efficiency of level-wise generation of frequent itemsets, an important property is used
called Apriori property which helps by reducing the search space.

Apriori Property –
All non-empty subset of frequent itemset must be frequent. The key concept of Apriori algorithm is its
anti-monotonicity of support measure. Apriori assumes that

All subsets of a frequent itemset must be frequent(Apriori property).


If an itemset is infrequent, all its supersets will be infrequent.

Before we start understanding the algorithm, go through some definitions which are explained in my
previous post.
Consider the following dataset and we will find frequent itemsets and generate association rules for
them.

minimum support count is 2


minimum confidence is 60%

Step-1: K=1
(I) Create a table containing support count of each item present in dataset – Called C1(candidate set)
(II) compare candidate set item’s support count with minimum support count(here min_support=2 if
support_count of candidate set items is less than min_support then remove those items). This gives us
itemset L1.

Step-2: K=2

• Generate candidate set C2 using L1 (this is called join step). Condition of joining Lk-1 and Lk-1 is
that it should have (K-2) elements in common.

• Check all subsets of an itemset are frequent or not and if not frequent remove that
itemset.(Example subset of{I1, I2} are {I1}, {I2} they are frequent.Check for each itemset)

• Now find support count of these itemsets by searching in dataset.


(II) compare candidate (C2) support count with minimum support count(here min_support=2 if
support_count of candidate set item is less than min_support then remove those items) this gives us
itemset L2.

Step-3:

o Generate candidate set C3 using L2 (join step). Condition of joining Lk-1 and Lk-1 is that it
should have (K-2) elements in common. So here, for L2, first element should match.
So itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}{I2, I3, I4}{I2, I4, I5}{I2,
I3, I5}

o Check if all subsets of these itemsets are frequent or not and if not, then remove that
itemset.(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} which are frequent. For {I2,
I3, I4}, subset {I3, I4} is not frequent so remove it. Similarly check for every itemset)

o find support count of these remaining itemset by searching in dataset.

(II) Compare candidate (C3) support count with minimum support count(here min_support=2 if
support_count of candidate set item is less than min_support then remove those items) this gives us
itemset L3.

Step-4:

o Generate candidate set C4 using L3 (join step). Condition of joining Lk-1 and Lk-1 (K=4) is
that, they should have (K-2) elements in common. So here, for L3, first 2 elements
(items) should match.
o Check all subsets of these itemsets are frequent or not (Here itemset formed by joining
L3 is {I1, I2, I3, I5} so its subset contains {I1, I3, I5}, which is not frequent). So no itemset
in C4

o We stop here because no frequent itemsets are found further

Thus, we have discovered all the frequent item-sets. Now generation of strong association rule comes
into picture. For that we need to calculate confidence of each rule.

Confidence –
A confidence of 60% means that 60% of the customers, who purchased milk and bread also bought
butter.

Confidence(A->B)=Support_count(A∪B)/Support_count(A)

So here, by taking an example of any frequent itemset, we will show the rule generation.
Itemset {I1, I2, I3} //from L3
SO rules can be
[I1^I2]=>[I3] //confidence = sup(I1^I2^I3)/sup(I1^I2) = 2/4*100=50%
[I1^I3]=>[I2] //confidence = sup(I1^I2^I3)/sup(I1^I3) = 2/4*100=50%
[I2^I3]=>[I1] //confidence = sup(I1^I2^I3)/sup(I2^I3) = 2/4*100=50%
[I1]=>[I2^I3] //confidence = sup(I1^I2^I3)/sup(I1) = 2/6*100=33%
[I2]=>[I1^I3] //confidence = sup(I1^I2^I3)/sup(I2) = 2/7*100=28%
[I3]=>[I1^I2] //confidence = sup(I1^I2^I3)/sup(I3) = 2/6*100=33%

So if minimum confidence is 50%, then first 3 rules can be considered as strong association rules.

Limitations of Apriori Algorithm


Apriori Algorithm can be slow. The main limitation is time required to hold a vast number of candidate
sets with much frequent itemsets, low minimum support or large itemsets i.e. it is not an efficient
approach for large number of datasets. For example, if there are 10^4 from frequent 1- itemsets, it need
to generate more than 10^7 candidates into 2-length which in turn they will be tested and accumulate.
Furthermore, to detect frequent pattern in size 100 i.e. v1, v2… v100, it have to generate 2^100
candidate itemsets that yield on costly and wasting of time of candidate generation. So, it will check for
many sets from candidate itemsets, also it will scan database many times repeatedly for finding
candidate itemsets. Apriori will be very low and inefficiency when memory capacity is limited with large
number of transactions.

You might also like