MLunit 2 Mynotes
MLunit 2 Mynotes
MLunit 2 Mynotes
UNIT-II
1. Distance based Methods
Distanced based algorithms are machine learning algorithms that classify queries
by computing distances between these queries and a number of internally stored
exemplars. Exemplars that are closets to the query have the largest influence on the
classification assigned to the query.
Euclidean Distance
It is the most common use of distance. In most cases when people said about
distance, they will refer to Euclidean distance. Euclidean distance is also known as
simply distance. When data is dense or continuous, this is the best proximity
measure. The Euclidean distance between two points is the length of the path
connecting them. The Pythagorean theory gives distance between two points.
Manhattan Distance
If you want to find Manhattan distance between two different points (x1, y1) and
(x2, y2) such as the following, it would look like the following:
Manhattan distance = (x2 – x1) + (y2 – y1)
Diagrammatically, it would look like traversing the path from point A to point B
while walking on the pink straight line.
Minkowski Distance
The generalized form of the Euclidean and Manhattan Distances is the Minkowski
Distance. You can express the Minkowski distance as
2. Nearest Neighbours
The abbreviation KNN stands for “K-Nearest Neighbour”. It is a supervised
machine learning algorithm. The algorithm can be used to solve both classification
and regression problem statements.
KNN calculates the distance from all points in the proximity of the unknown data
and filters out the ones with the shortest distances to it. As a result, it’s often
referred to as a distance-based algorithm.
In order to correctly classify the results, we must first determine the value of K
(Number of Nearest Neighbours).
When the value of K is set to even, a situation may arise in which the elements
from both groups are equal.
In the diagram below, elements from both groups are equal in the internal “Red”
circle (k == 4).
In this condition, the model would be unable to do the correct classification for
you. Here the model will
randomly assign any of the two classes to this new unknown data.
Choosing an odd value for K is preferred because such a state of equality between
the two classes would
never occur here. Due to the fact that one of the two groups would still be in the
majority, the value of K is selected as odd.
The public votes for the candidate with whom they feel more connected.
When the votes for all of the candidates have been recorded, the candidate with the
most votes is declared as the election’s winner.
Decision Trees
Decision tree is one of the predictive modelling approaches used in statistics, data
mining and machine learning.
Decision trees are constructed via an algorithmic approach that identifies ways to
split a data set based on different conditions. It is one of the most widely used
and practical methods for supervised learning. Decision Trees are a non-
parametric supervised learning method used for both classification and
regression tasks.
Tree models where the target variable can take a discrete set of values are called
classification trees. Decision trees where the target variable can take continuous
values (typically real numbers) are called regression trees. Classification And
Regression Tree (CART) is general term for this.
Approach to make decision tree
While making decision tree, at each node of tree we ask different type of
questions. Based on the asked question we will calculate the information gain
corresponding to it.
Information Gain
Pure
Pure means, in a selected sample of dataset all data belongs to same class (PURE).
Impure
Impure means, data is mixture of different classes.
Definition of Gini Impurity
Gini Impurity is a measurement of the likelihood of an incorrect classification of
a new instance of a random variable, if that new instance were randomly
classified according to the distribution of class labels from the data set.
Get list of rows (dataset) which are taken into consideration for making
decision tree (recursively at each nodes).
Calculate uncertanity of our dataset or Gini impurity or how much our data
is mixed up
Generate list of all question which needs to be asked at that node.
Partition rows into True rows and False rows based on each question asked.
Calculate information gain based on gini impurity and partition of data from
previous step.
Update highest information gain based on each question asked.
Update best question based on information gain (higher information gain).
Divide the node on best question. Repeat again from step 1 again until
we get pure node (leaf nodes).
Advantage of Decision Tree
Prone to overfitting.
Require some kind of measurement as to how well they are doing.
Need to be careful with parameter tuning.
Can create biased learned trees if some classes dominate.
Overfitting is one of the major problem for every model in machine learning. If
model is overfitted it will poorly generalized to new samples. To avoid decision
tree from overfitting we remove the branches that make use of features having
low importance. This method is called as Pruning or post-pruning. This way we
will reduce the complexity of tree, and hence imroves predictive accuracy by the
reduction of overfitting.
Pruning should reduce the size of a learning tree without reducing predictive
accuracy as measured by a cross-validation set. There are 2 major Pruning
techniques.
Minimum Error: The tree is pruned back to the point where the cross-
validated error is a minimum.
Smallest Tree: The tree is pruned back slightly further than the minimum
error. Technically the pruning creates a decision tree with cross-
validation error within 1 standard error of the minimum error.
The Naïve Bayes algorithm is comprised of two words" Naïve" and "Bayes",
Which can be described as:
Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used
to determine the probability of a hypothesis with prior knowledge. It
depends on the conditional probability.
The formula for Bayes' theorem is given as:
Where,
Spam Detection
Emotional Analysis
Article Categorization
Linear Models
Linear regression is one of the easiest and most popular Machine Learning
algorithms. It is a statistical method that is used for predictive analysis. Linear
regression makes predictions for continuous/real or numeric variables
such as sales, salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y)
and one or more independent (y) variables, hence called as linear regression. Since
linear regression shows the linear relationship, which means it finds how the value
of the dependent variable is changing according to the value of the independent
variable.
The linear regression model provides a sloped straight line representing the
relationship between the variables.
Consider the below image:
Mathematically, we can represent a linear regression as:
y= a0+a1x+ ε
Here,
Y= Dependent Variable (Target Variable)
X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error
The values for x and y variables are training datasets for Linear Regression model
representation.
Linear regression can be further divided into two types of the algorithm:
Simple Linear Regression:
If a single independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called Simple
Linear Regression.
The key point in Simple Linear Regression is that the dependent variable
must be a continuous/real value.
However, the independent variable can be measured on continuous or categorical
values.
Simple Linear regression algorithm has mainly two objectives:
• Model the relationship between the two variables. Such as the relationship
between Income and expenditure, experience and Salary, etc.
• Forecasting new observations. Such as Weather forecasting according to
temperature, Revenue of a company according to the investments in a year, etc.
Binary Classification
Multiclass Classification
MNIST
The database is also widely used for training and testing in the field of machine
learning.
It was created by "re-mixing" the samples from NIST's original datasets.
The creators felt that since NIST's training dataset was taken from American
Census Bureau employees, while the testing dataset was taken from American high
school students, it was not well-suited for machine learning experiments.
Furthermore, the black and white images from NIST were normalized to fit into a
28x28 pixel bounding box and anti-aliased, which introduced grayscale levels.
The MNIST database contains 60,000 training images and 10,000 testing images.
Half of the training set and half of the test set were taken from NIST's training
dataset, while the other half of the training set and the other half of the test set were
taken from NIST's testing dataset. The original creators of the database keep a list
of some of the methods tested on it. In their original paper, they use a support-
vector machine to get an error rate of 0.8%.