PRu 4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Linear Classifier

In the field of machine learning, the goal of statistical classification is to use an object's
characteristics to identify which class (or group) it belongs to. A linear classifier achieves this by
making a classification decision based on the value of a linear combination of the characteristics.
A linear classifier is a model that makes a decision to categories a set of data points to a discrete
class based on a linear combination of its explanatory variables. As an example, combining details
about a dog such as weight, height, colour and other features would be used by a model to decide
its species. The effectiveness of these models lies in their ability to find this mathematical
combination of features that groups data points together when they have the same class and
separate them when they have different classes, providing us with clear boundaries for how to
classify.

If each instance belongs to one and only one class, then our input data can be divided into decision
regions separated by decision boundaries.

Discriminant Functions
A two-category classifier with a discriminant function of the form (1) uses the following rule:
Decide ω1 if
g(x) > 0 and ω2 if g(x) < 0
⇔ Decide ω1 if
wtx > -w0 and ω2 otherwise
If g(x) = 0 ⇒ x is assigned to either class
• The equation g(x) = 0 defines the decision surface that separates points assigned to the
category ω1 from points assigned to the category ω2
• When g(x) is linear, the decision surface is a hyperplane
Decision Hyperplanes
• a hyperplane is a subspace whose dimension is one less than that of its ambient space. For
example, if a space is 3-dimensional then its hyperplanes are the 2-dimensional planes,
while if the space is 2-dimensional, its hyperplanes are the 1-dimensional lines.
• A decision boundary is a hypersurface that partitions the underlying vector space into two
sets, one for each class. A general hypersurface in a small dimension space is turned into a
hyperplane in a space with much larger dimensions.
• Hyperplane and decision boundary are equivalent at small dimension space, 'plane' has the
meaning of straight and flat, so it is a line or a plane that separate the data sets. When you
do a non-linear operation to map your data to a new feature space, the decision boundary
is still a hyperplane in that space, but is not a plane any more at the original space.

Decide ω1 if
g(x) > 0 and ω2 if g(x) < 0
⇔ Decide ω1 if
wtx > -w0 and ω2 otherwise
If g(x) = 0 ⇒ x is assigned to either class
• The equation g(x) = 0 defines the decision surface that separates points assigned to the
category ω1 from points assigned to the category ω2
• When g(x) is linear, the decision surface is a hyperplane
Linear Discriminant Functions and Decision Hyperplanes
Let us once more focus on the two-class case and consider linear discriminant functions. Then the
respective decision hypersurface in the l-dimensional feature space is a hyperplane, that is
𝑔(𝑥) = 𝑤 𝑇 𝑥 + 𝑤0 = 0
where w = [w1, w2,…, wl]T is known as the weight vector and w0 as the threshold. If x1, x2 are two
points on the decision hyperplane, then the following is valid
0 = 𝑤 𝑇 𝑥1 + 𝑤0 = 𝑤 𝑇 𝑥2 + 𝑤0   ⇒
𝑤 𝑇 (𝑥1 − 𝑥2 ) = 0
Since the difference vector x1 – x2 obviously lies on the decision hyperplane (for any x1, x2), it is
apparent from Eq. (3.2) that the vector w is orthogonal to the decision hyperplane.
Figure shows the corresponding geometry (for w1 > 0, w2 > 0, w0 < 0). Recalling our high school
math, it is easy to see that the quantities entering in the figure are given by

|𝑤0 | |𝑔(𝑥)|
𝑑= and 𝑧 =
√𝑤2 2
1 +𝑤2 √𝑤2 2
1 +𝑤2

In other words, |g(x)| is a measure of the Euclidean distance of the point x from the decision
hyperplane. On one side of the plane g(x) takes positive values and on the other negative. In the
special case that w0 = 0, the hyperplane passes through the origin.
The Perceptron algorithm
https://www.youtube.com/watch?v=1XkjVl-j8MM
• The Perceptron algorithm is a two-class (binary) classification machine learning algorithm.
• It is a type of neural network model, perhaps the simplest type of neural network model.
• It consists of a single node or neuron that takes a row of data as input a nd predicts a class
label. This is achieved by calculating the weighted sum of the inputs and a bias (set to 1).
The weighted sum of the input of the model is called the activation.
Activation = Weights * Inputs + Bias
If the activation is above 0.0, the model will output 1.0; otherwise, it will output 0.0.
Predict 1: If Activation > 0.0
Predict 0: If Activation <= 0.0

What is the Perceptron model in Machine Learning?


• Perceptron is Machine Learning algorithm for supervised learning of various binary
classification tasks. Further, Perceptron is also understood as an Artificial Neuron or neural
network unit that helps to detect certain input data computations in business intelligence.

• Perceptron model is also treated as one of the best and simplest types of Artificial Neural
networks. However, it is a supervised learning algorithm of binary classifiers. Hence, we can
consider it as a single-layer neural network with four main parameters, i.e., input values,
weights and Bias, net sum, and an activation function.

Input Nodes or Input Layer:


This is the primary component of Perceptron which accepts the initial data into the system for
further processing. Each input node contains a real numerical value.
Wight and Bias:
Weight parameter represents the strength of the connection between units. This is another most
important parameter of Perceptron components. Weight is directly proportional to the strength of
the associated input neuron in deciding the output. Further, Bias can be considered as the line of
intercept in a linear equation.
Activation Function:
These are the final and important components that help to determine whether the neuron will fire
or not. Activation Function can be considered primarily as a step function.
Types of Activation functions:
• Sign function
• Step function, and
• Sigmoid function

Perceptron models are divided into two types. These are as follows:
• Single-layer Perceptron Model
• Multi-layer Perceptron model

Advantages of Multi-Layer Perceptron:


• A multi-layered perceptron model can be used to solve complex non-linear problems.
• It works well with both small and large input data.
• It helps us to obtain quick predictions after the training.
• It helps to obtain the same accuracy ratio with large as well as small data.
Disadvantages of Multi-Layer Perceptron:

• In Multi-layer perceptron, computations are difficult and time-consuming.


• In multi-layer Perceptron, it is difficult to predict how much the dependent variable affects
each independent variable.
• The model functioning depends on the quality of the training.

Characteristics of Perceptron
• Perceptron is a machine learning algorithm for supervised learning of binary classifiers.
• In Perceptron, the weight coefficient is automatically learned.
• Initially, weights are multiplied with input features, and the decision is made whether the
neuron is fired or not.
• The activation function applies a step rule to check whether the weight function is greater
than zero.
• The linear decision boundary is drawn, enabling the distinction between the two linearly
separable classes +1 and -1.
• If the added sum of all input values is more than the threshold value, it must have an output
signal; otherwise, no output will be shown.

Mean Square Error Estimate


The mean squared error (MSE) tells you how close a regression line is to a set of points. It does this
by taking the distances from the points to the regression line (these distances are the “errors”) and
squaring them. The squaring is necessary to remove any negative signs. It also gives more weight
to larger differences. It’s called the mean squared error as you’re finding the average of a set of
errors. The lower the MSE, the better the prediction.

Mean Squared Error Example


MSE formula = (1/n) * Σ (actual – predict)2
Where:
• n = number of items,
• Σ = summation notation,
• Actual = original or observed y-value,
• Forecast = y-value from regression.

General steps to calculate the MSE from a set of X and Y values:


• Find the regression line.
• Insert your X values into the linear regression equation to find the new Y values (Y’).
• Subtract the new Y value from the original to get the error.
• Square the errors.
• Add up the errors (the Σ in the formula is summation notation).
• Find the mean.

Example Problem:
Find the MSE for the following set of values: (43,41), (44,45), (45,49), (46,47), (47,44).
Step 1: Find the regression line. I used this online calculator and got the regression line y = 9.2 +
0.8x.
Step 2: Find the new Y’ values:
• 9.2 + 0.8(43) = 43.6
• 9.2 + 0.8(44) = 44.4
• 9.2 + 0.8(45) = 45.2
• 9.2 + 0.8(46) = 46
• 9.2 + 0.8(47) = 46.8
Step 3: Find the error (Y – Y’):
• 41 – 43.6 = -2.6
• 45 – 44.4 = 0.6
• 49 – 45.2 = 3.8
• 47 – 46 = 1
• 44 – 46.8 = -2.8
Step 4: Square the Errors:
• -2.62 = 6.76
• 0.62 = 0.36
• 3.82 = 14.44
• 12 = 1
• -2.82 = 7.84
Step 5: Add all of the squared errors up: 6.76 + 0.36 + 14.44 + 1 + 7.84 = 30.4.
Step 6: Find the mean squared error:
• 30.4 / 5 = 6.08
Stochastic Approximation of LMS Algorithm
LMS or Gradient Descent
• The least mean square algorithm uses a technique called “method of steepest
descent” and continuously estimates results by updating filter weights. Through the
principle of algorithm convergence, the least mean square algorithm provides
particular learning curves useful in machine learning theory and implementation.
• Gradient Descent is a generic optimization algorithm capable of finding optimal
solutions to a wide range of problems.
• An important parameter of Gradient Descent (GD) is the size of the steps,
determined by the learning rate hyperparameters. If the learning rate is too small,
then the algorithm will have to go through many iterations to converge, which will
take a long time, and if it is too high, we may jump the optimal value.
Stochastic Gradient Descent
• The word ‘stochastic ‘means a system or process linked with a random probability. Hence,
in Stochastic Gradient Descent, a few samples are selected randomly instead of the whole
data set for each iteration. In Gradient Descent, there is a term called “batch” which
denotes the total number of samples from a dataset that is used for calculating the gradient
for each iteration
• So, in SGD, we find out the gradient of the cost function of a single example at each
iteration instead of the sum of the gradient of the cost function of all the examples.
• In SGD, since only one sample from the dataset is chosen at random for each iteration, the
path taken by the algorithm to reach the minima is usually noisier than your typical
Gradient Descent algorithm. But that doesn’t matter all that much because the path taken
by the algorithm does not matter, as long as we reach the minima and with a significantly
shorter training time.
• Stochastic gradient descent is an optimization algorithm often used in machine learning
applications to find the model parameters that correspond to the best fit between
predicted and actual outputs. It’s an inexact but powerful technique.
• Stochastic gradient descent is widely used in machine learning applications. Combined with
backpropagation, it’s dominant in neural network training applications.
Algorithm
Sum of Error Estimate/ Standard Error Estimation
Standard Error Meaning
The standard error is one of the mathematical tools used in statistics to estimate the variability. It
is abbreviated as SE. The standard error of a statistic or an estimate of a parameter is the standard
deviation of its sampling distribution. We can define it as an estimate of that standard deviation.
Standard Error Formula
The accuracy of a sample that describes a population is identified through the SE formula. The
sample mean which deviates from the given population and that deviation is given as;

Where S is the standard deviation and n is the number of observations.

Standard Error of Estimate (SEE)


The standard error of the estimate is the estimation of the accuracy of any predictions. It is
denoted as SEE. The regression line depreciates the sum of squared deviations of prediction. It is
also known as the sum of squares error. SEE is the square root of the average squared deviation.
The deviation of some estimates from intended values is given by standard error of estimate
formula.

Where xi stands for data values, x bar is the mean value and n is the sample size.

How to calculate Standard Error


Step 1: Note the number of measurements (n) and determine the sample mean (μ). It is the
average of all the measurements.
Step 2: Determine how much each measurement varies from the mean.
Step 3: Square all the deviations determined in step 2 and add altogether: Σ (xi – μ) ²
Step 4: Divide the sum from step 3 by one less than the total number of measurements (n -1).
Step 5: Take the square root of the obtained number, which is the standard deviation (σ).
Step 6: Finally, divide the standard deviation obtained by the square root of the number of
measurements (n) to get the standard error of your estimate.
Standard Error Example
Calculate the standard error of the given data:
Y: 5, 10, 12, 15, 20
Solution: First we have to find the mean of the given data;
Mean = (5+10+12+15+20)/5 = 62/5 = 10.5
Now, the standard deviation can be calculated as;
S = Summation of difference between each value of given data and the mean value/Number of
values.

After solving the above equation, we get;


• S = 5.35
Therefore, SE can be estimated with the formula;
• SE = S/√n
• SE = 5.35/√5 = 2.39

You might also like