Machine Learning
Machine Learning
Machine Learning
Machine Learning
Arthur Samuel, a pioneer in the field of artificial intelligence and computer gaming, coined
the term “Machine Learning” as – “Field of study that gives computers the capability
to learn without being explicitly programmed”.
Model: A model is a specific representation learned from data by applying some machine
learning algorithm. A model is also called hypothesis.
Feature: A feature is an individual measurable property of our data. A set of numeric features
can be conveniently described by a feature vector. Feature vectors are fed as input to the
model. For example, in order to predict a fruit, there may be features like color, smell,
taste, etc.
Target(Label): A target variable or label is the value to be predicted by our model. For the
fruit example discussed in the features section, the label with each set of input would be the
name of the fruit like apple, orange, banana, etc.
Training: The idea is to give a set of inputs(features) and it’s expected outputs(labels), so after
training, we will have a model (hypothesis) that will then map new data to one of the categories
trained on.
Prediction: Once our model is ready, it can be fed a set of inputs to which it will provide a
predicted output(label).
Types of Learning
• Supervised Learning
• Unsupervised Learning
• Semi-Supervised Learning
1. Supervised Learning: Supervised learning is when the model is getting trained on a labelled
dataset. Labelled dataset is one which have both input and output parameters. In this type of
learning both training and validation datasets are labelled as shown in the figures below.
Classification Regression
Types of Supervised Learning:
• Classification
• Regression
Linear Regression
Nearest Neighbor
Decision Trees
Random Forest
Unsupervised Learning:
Unsupervised learning is the training of machine using information that is neither
classified nor labeled and allowing the algorithm to act on that information without
guidance. Here the task of machine is to group unsorted information according to
similarities, patterns and differences without any prior training of data. Unsupervised
machine learning is more challenging than supervised learning due to the absence of
labels.
Clustering
Association
Clustering: A clustering problem is where you want to discover the inherent groupings
in the data, such as grouping customers by purchasing behavior.
Association: An association rule learning problem is where you want to discover rules
that describe large portions of your data, such as people that buy X also tend to buy Y.
The most basic disadvantage of any Supervised Learning algorithm is that the
dataset has to be hand-labeled either by a Machine Learning Engineer or a Data
Scientist. This is a very costly process, especially when dealing with large volumes
of data. The most basic disadvantage of any Unsupervised Learning is that
it’s application spectrum is limited.
Semi-supervised machine learning:
To counter these disadvantages, the concept of Semi-Supervised Learning was introduced. In this
type of learning, the algorithm is trained upon a combination of labeled and unlabeled data.
Typically, this combination will contain a very small amount of labeled data and a very large
amount of unlabeled data.
Intuitively, one may imagine the three types of learning algorithms as Supervised learning where a
student is under the supervision of a teacher at both home and school, Unsupervised learning where
a student has to figure out a concept himself and Semi-Supervised learning where a teacher teaches
a few concepts in class and gives questions as homework which are based on similar concepts.
REGRESSION
regr = linear_model.LinearRegression()
train_x = np.asanyarray(train[['ENGINESIZE']])
train_y = np.asanyarray(train[['CO2EMISSIONS']])
# The coefficients
Multiple regression is an
extension of simple linear
regression. It is used when
we want to predict the
value of a variable based
on the value of two or
more other variables. The
variable we want to predict
is called the dependent
variable (or sometimes, the
outcome, target or criterion
variable).
Simple linear regression
• Predict CO2 emission vs Engine size of all cars
- Independent variable(x) : Engine size
-Dependent variable(y):CO2 emission
Multiple linear regression
• Predict CO2 emission vs Engine size and cylinders of all car
-Independent variable(x) : engine size,cylinders
-Dependent variable(y):CO2 emission
from sklearn import linear_model
regr = linear_model.LinearRegression()
train_x = np.asanyarray(train[['ENGINESIZE','CYLINDERS']])
train_y = np.asanyarray(train[['CO2EMISSIONS']])
# The coefficients
train_x = np.asanyarray(train[['ENGINESIZE','CYLINDERS']])
train_y = np.asanyarray(train[['CO2EMISSIONS']])
test_x = np.asanyarray(test[['ENGINESIZE']])
test_y = np.asanyarray(test[['CO2EMISSIONS']])
poly = PolynomialFeatures(degree=2)
train_x_poly = poly.fit_transform(train_x)
train_x_poly.shape
fit_transform takes our x values, and output a list of our data raised from power of 0
to power of 2 (since we set the degree of our polynomial to 2).
in our example
Now, we can deal with it as 'linear regression' problem. Therefore, this polynomial
regression is considered to be a special case of traditional multiple linear regression.
So, you can use the same mechanism as linear regression to solve such a problems.
so we can use LinearRegression() function to solve it:
clf = linear_model.LinearRegression()
train_y_ = clf.fit(train_x_poly, train_y)
# The coefficients
print ('Coefficients: ', clf.coef_)
print ('Intercept: ',clf.intercept_)
Decision tree regression
Decision tree builds regression models in the form of a tree structure. It breaks down a
dataset into smaller and smaller subsets while at the same time an associated decision
tree is incrementally developed. The final result is a tree with decision nodes and leaf
nodes. A decision node (e.g., Outlook) has two or more branches (e.g., Sunny, Overcast
and Rainy), each representing values for the attribute tested. Leaf node (e.g., Hours
Played) represents a decision on the numerical target. The topmost decision node in a
tree which corresponds to the best predictor called root node. Decision trees can handle
both categorical and numerical data.
Decision tree regression observes features of an object and trains a model in
the structure of a tree to predict data in the future to produce meaningful
continuous output. Continuous output means that the output/result is not
discrete, i.e., it is not represented just by a discrete, known set of numbers or
values.
Discrete output example: A weather prediction model that predicts whether or
not there’ll be rain in a particular day.
Continuous output example: A profit prediction model that states the
probable profit that can be generated from the sale of a product.
Code:
# import the regressor
from sklearn.tree import DecisionTreeRegressor
The Random Forest is one of the most effective machine learning models for
predictive analytics, making it an industrial workhorse for machine learning.
The random forest model is a type of additive model that makes predictions
by combining decisions from a sequence of base models. Here, each base
classifier is a simple decision tree. This broad technique of using multiple
models to obtain better predictive performance is called model ensembling.
In random forests, all the base models are constructed independently using a
different subsample of the data
Approach :
Pick at random K data points
from the training set.
Build the decision tree
associated with those K data
points.
Choose the number Ntree of
trees you want to build and
repeat step 1 & 2.
For a new data point, make
each one of your Ntree trees
predict the value of Y for the
data point, and assign the
new data point the average
across all of the predicted Y
values.
Code
binomial: Target variable can have only 2 possible types: “0” or “1” which may represent “win” vs “loss”, “pass” vs “fail”,
“dead” vs “alive”, etc.
multinomial: Target variable can have 3 or more possible types which are not ordered(i.e. types have no quantitative
significance) like “disease A” vs “disease B” vs “disease C”.
ordinal: It deals with target variables with ordered categories. For example, a test score can be
categorized as:“very poor”, “poor”, “good”, “very good”. Here, each category can be given a score like
0, 1, 2, 3.
•Start with binary class problems
How do we develop a classification algorithm?
• Tumour size vs malignancy (0 or 1)
• We could use linear regression
• Then threshold the classifier output (i.e. anything over some value is yes, else no)
• In our example below linear regression with thresholding seems to work
•We can see above this does a reasonable job of stratifying the data points into one of two classes
• But what if we had a single Yes with a very small tumour
• This would lead to classifying all the existing yeses as nos
•Another issues with linear regression
• We know Y is 0 or 1
• Hypothesis can give values large than 1 or less than 0
•So, logistic regression generates a value where is always either 0 or 1
• Logistic regression is a classification algorithm - don't be confused
Hypothesis representation
•What function is used to represent our hypothesis in classification
•We want our classifier to output values between 0 and 1
• When using linear regression we did hθ(x) = (θT x)
• For classification hypothesis representation we do hθ(x) = g((θT x))
• Where we define g(z)
• z is a real number
• This is the sigmoid function, or the logistic function
• If we combine these equations we can write out the hypothesis as
•How does the sigmoid function look like
When our hypothesis (hθ(x)) outputs a number, we treat that value as the estimated probability that
y=1 on input x
• Example
• If X is a feature vector with x0 = 1 (as always) and x1 = tumourSize
• hθ(x) = 0.7
• Tells a patient they have a 70% chance of a tumor being malignant
hθ(x) = P(y=1|x ; θ)
• What does this mean?
• Probability that y=1, given x, parameterized by θ
•Since this is a binary classification task we know y = 0 or 1
• So the following must be true
• P(y=1|x ; θ) + P(y=0|x ; θ) = 1
• P(y=0|x ; θ) = 1 - P(y=1|x ; θ)
Decision boundary
•This gives a better sense of what the hypothesis function is computing
• One way of using the sigmoid function is;
• When the probability of y being 1 is greater than 0.5 then we can predict y = 1
• Else we predict y = 0
• When is it exactly that hθ(x) is greater than 0.5?
• Look at sigmoid function
• g(z) is greater than or equal to 0.5 when z is greater than or equal to 0
• So if z is positive, g(z) is greater than 0.5
• z = (θT x)
• So when
• θT x >= 0
• Then hθ >= 0.5
•So what we've shown is that the hypothesis predicts y = 1 when θT x >= 0
• The corollary of that when θT x <= 0 then the hypothesis predicts y = 0
• Let's use this to better understand how the hypothesis makes its predictions
Decision boundary
•This gives a better sense of what the hypothesis function is computing
• One way of using the sigmoid function is;
• When the probability of y being 1 is greater than 0.5 then we can predict y = 1
• Else we predict y = 0
• When is it exactly that hθ(x) is greater than 0.5?
• Look at sigmoid function
• g(z) is greater than or equal to 0.5 when z is greater than or equal to 0
• So if z is positive, g(z) is greater than 0.5
• z = (θT x)
• So when
• θT x >= 0
• Then hθ >= 0.5
•So what we've shown is that the hypothesis predicts y = 1 when θT x >= 0
• The corollary of that when θT x <= 0 then the hypothesis predicts y = 0
• Let's use this to better understand how the hypothesis makes its predictions
Consider,
hθ(x) = g(θ0 + θ1x1 + θ2x2)
•To get around this we need a different, convex Cost() function which means we can apply gradient
descent
The above two functions can be compressed into a single function i.e.
Gradient Descent
Now the question arises, how do we reduce the cost value. Well, this can be done by using Gradient
Descent. The main goal of Gradient descent is to minimize the cost value. i.e. min J(θ).
Now to minimize our cost function we need to run the gradient descent function on each parameter i.e.
Gradient descent has an analogy in which we have to imagine ourselves at the top of a mountain valley
and left stranded and blindfolded, our objective is to reach the bottom of the hill. Feeling the slope of the
terrain around you is what everyone would do. Well, this action is analogous to calculating the gradient
descent, and taking a step is analogous to one iteration of the update to the parameters.
Multiclass classification problems
•Getting logistic regression for multiclass classification using one vs. all
•Multiclass - more than yes or no (1 or 0)
• Classification with multiple classes for assignment
•Given a dataset with three classes, how do we get a learning algorithm to work?
• Use one vs. all classification make binary classification work for multiclass classification
•One vs. all classification
• Split the training set into three separate binary classification problems
• i.e. create a new fake training set
• Triangle (1) vs crosses and squares (0) hθ1(x)
• P(y=1 | x1; θ)
• Crosses (1) vs triangle and square (0) hθ2(x)
• P(y=1 | x2; θ)
• Square (1) vs crosses and square (0) hθ3(x)
• P(y=1 | x3; θ)
•Train a logistic regression classifier hθ(i)(x) for each class i to predict the
probability that y = i
•On a new input, x to make a prediction, pick the class i that maximizes
the probability that hθ(i)(x) = 1
K-Nearest Neighbors
This algorithm classifies cases based on their similarity to other cases.
In K-Nearest Neighbors, data points that are near each other are said to be
neighbors.
Similar cases with the same class labels are near each other.
Thus, the distance between two cases is a measure of their dissimilarity.
There are different ways to calculate the similarity or conversely,
the distance or dissimilarity of two data points.
For example, this can be done using Euclidean distance.
the K-Nearest Neighbors algorithm works as follows.
- calculate the distance from the new case hold out from each of the cases in the dataset.
- search for the K-observations in the training data that are nearest to the measurements
of the unknown data point.
- predict the response of the unknown data point using the most popular response value
from the K-Nearest Neighbors.
There are two parts in this algorithm that might be a bit confusing.
A low value of K causes a highly complex model as well, which might result in overfitting of the model.
It means the prediction process is not generalized enough to be used for out-of-sample cases.
Out-of-sample data is data that is outside of the data set used to train the model.
In other words, it cannot be trusted to be used for prediction of unknown samples. It's important to remember that
overfitting is bad, as we want a general model that works for any data, not just the data used for training.
Now, on the opposite side of the spectrum, if we choose a very high value of K such as K equals 20,
then the model becomes overly generalized.
So, how can we find the best value for K?
The general solution is to reserve a part of your data for testing the accuracy of the model.
Once you've done so, choose K equals one and then use the training part for modeling and calculate the accuracy of
prediction using all samples in your test set.
Repeat this process increasing the K and see which K is best for your model.
For example, in our case,
1. No Training Period: KNN is called Lazy Learner (Instance based learning). It does not learn
anything in the training period. It does not derive any discriminative function from the training
data. In other words, there is no training period for it. It stores the training dataset and learns
from it only at the time of making real time predictions. This makes the KNN algorithm much
faster than other algorithms that require training e.g. SVM, Linear Regression etc.
2. Since the KNN algorithm requires no training before making predictions, new data can be
added seamlessly which will not impact the accuracy of the algorithm.
3. KNN is very easy to implement. There are only two parameters required to implement KNN i.e.
the value of K and the distance function (e.g. Euclidean or Manhattan etc.)
Disadvantages of KNN
1. Does not work well with large dataset: In large datasets, the cost of calculating the
distance between the new point and each existing points is huge which degrades the
performance of the algorithm.
2. Does not work well with high dimensions: The KNN algorithm doesn't work well
with high dimensional data because with large number of dimensions, it becomes difficult
for the algorithm to calculate the distance in each dimension.
3.Sensitive to noisy data, missing values and outliers: KNN is sensitive to noise in
the dataset. We need to manually impute missing values and remove outliers.
SUPPORT VECTOR MACHINE(SVM)
A Support Vector Machine is a supervised algorithm that can classify cases by finding a
separator.
SVM works by first mapping data to a high dimensional feature space so that data points can
be categorized, even when the data are not linearly separable.
Then, a separator is estimated for the data. The data should be transformed in such a
way that a separator could be drawn as a hyperplane.
Therefore, the SVM algorithm outputs an optimal hyperplane that categorizes new examples.
DATA TRANFORMATION
For the sake of simplicity, imagine that our dataset is one-dimensional data.
This means we have only one feature x.
As you can see, it is not linearly separable.
Well, we can transfer it into a two-dimensional space. For example, you can increase the dimension of data by
mapping x into a new space using a function with outputs x and x squared.
ADVANTAGES
- Accurate in high dimension place
- Memory efficient
DISADVANTAGES
- Small datasets
- Prone to overfitting
APPLICATIONS
- Image Recognition
- Spam detection
Naive Bayes
Classifiers
COLLECTION OF CLASSIFICATION ALGORITHMS
Principle of Naive Bayes Classifier:
A Naive Bayes classifier is a probabilistic machine learning model that’s used for
classification task. The crux of the classifier is based on the Bayes theorem.
It is not a single algorithm but a family of algorithms where all of them share a
common principle, i.e. every pair of features being classified is independent of
each other.
Example:
Let us take an example to get some better intuition. Consider the problem of playing golf. The dataset is represented as below.
We classify whether the day is suitable for playing golf, given the features of the
day. The columns represent these features and the rows represent individual
entries. If we take the first row of the dataset, we can observe that is not suitable
for playing golf if the outlook is rainy, temperature is hot, humidity is high and it
is not windy. We make two assumptions here, one as stated above we consider
that these predictors are independent. That is, if the temperature is hot, it does not
necessarily mean that the humidity is high. Another assumption made here is that
all the predictors have an equal effect on the outcome. That is, the day being
windy does not have more importance in deciding to play golf or not.
According to this example, Bayes theorem can be rewritten as:
Now, you can obtain the values for each by looking at the dataset and substitute
them into the equation. For all entries in the dataset, the denominator does not
change, it remain static. Therefore, the denominator can be removed and a
proportionality can be introduced.
In our case, the class variable(y) has only two outcomes, yes or no. There could
be cases where the classification could be multivariate. Therefore, we need to find
the class y with maximum probability.
Using the above function, we can obtain the class, given the predictors.
We need to find P(xi | yj) for each xi in X and yj in y. All these calculations have
been demonstrated in the tables below:
So, in the figure above, we have calculated P(x i | yj) for each xi in X and yj in y
manually in the tables 1-4. For example, probability of playing golf given that the
temperature is cool, i.e P(temp. = cool | play golf = Yes) = 3/9.
Also, we need to find class probabilities (P(y)) which has been calculated in the table 5. For example,
P(play golf = Yes) = 9/14.
So now, we are done with our pre-computations and the classifier is ready!
Let us test it on a new set of features (let us call it today):
Types of Naive Bayes Classifier:
Bernoulli Naive Bayes: This is similar to the multinomial naive bayes but the
predictors are boolean variables. The parameters that we use to predict the class
variable take up only values yes or no, for example if a word occurs in the text or
not.
Gaussian Naive Bayes: When the predictors take up a continuous value and are
not discrete, we assume that these values are sampled from a gaussian
distribution.
Step 4b: A branch with entropy more than 0 needs further splitting
Step 5: The ID3 algorithm is run recursively on the non-leaf branches, until all
data is classified.
Decision Tree to Decision Rules
A decision tree can easily be transformed to a set of rules by mapping from the
root node to the leaf nodes one by one.
Limitations to Decision Trees
Decision trees tend to have high variance when they utilize different training and
test sets of the same data, since they tend to overfit on training data. This leads to
poor performance on unseen data. Unfortunately, this limits the usage of decision
trees in predictive modeling.
To overcome these problems we use ensemble methods, we can create models
that utilize underlying(weak) decision trees as a foundation for producing
powerful results and this is done in Random Forest Algorithm
Random forest
Definition:
Random forest algorithm is a supervised classification algorithm Based on Decision
Trees, also known as random decision forests, are a popular ensemble method that
can be used to build predictive models for both classification and regression
problems.
Ensemble we mean(In Random Forest Context), Collective Decisions of Different
Decision Trees. In RFT(Random Forest Tree), we make a prediction about the class,
not simply based on One Decision Trees, but by an (almost) Unanimous Prediction,
made by 'K' Decision Trees.
Construction:
'K' Individual Decision Trees are made from given Dataset, by randomly dividing
the Dataset and the Feature Subspace by process called as Bootstrap
Aggregation(Bagging), which is process of random selection with replacement.
Generally 2/3rd of the Dataset (row-wise 2/3rd) is selected by bagging, and On that
Selected Dataset we perform what we call is Attribute Bagging.
Now Attribute Bagging is done to select 'm' features from given M features,(this
Process is also called Random Subspace Creation.) Generally value of 'm' is
square-root of M. Now we select say, 10 such values of m, and then Build 10
Decision Trees based on them, and test the 1/3rd remaining Dataset on these(10
Decision Trees).We would then Select the Best Decision Tree out of this. And
Repeat the whole Process 'K' times again to build such 'K' decision trees.
Classification:
Prediction in Random Forest (a collection of 'K' Decision Trees) is truly ensemble
ie, For Each Decision Tree, Predict the class of Instance and then return the class
which was predicted the most often.
Using Random Forest Classifier
K-means Algorithm
The simplest among unsupervised learning algorithms. This works on the
principle of k-means clustering. This actually means that the clustered groups
(clusters) for a given set of data are represented by a variable ‘k’. For each
cluster, a centroid (arithmetic mean of all the data points that belong to that
cluster) is defined.
The centroid is a data point present at the centre of each cluster (considering
Euclidean distance). The trick is to define the centroids far away from each other
so that the variation is less. After this, each data point in the cluster is assigned to
the nearest centroid such that the sum of the squared distance between the data
points and the cluster’s centroid is at the minimum.
Algorithm
1.Clusters the data into k groups where k is predefined.
2. k points at random as cluster centers.
3.Assign objects to their closest cluster center according to the Euclidean distance
function.
4.Calculate the centroid or mean of all objects in each cluster.
5.Repeat steps 2, 3 and 4 until the same points are assigned to each cluster in
consecutive rounds.
The Euclidean distance between two points in either the plane or 3-dimensional
space measures the length of a segment connecting the two points.
The step by step process:
K-means clustering algorithm has found to be very useful in grouping new data.
Some practical applications which use k-means clustering are sensor
measurements, activity monitoring in a manufacturing process, audio detection
and image segmentation.
Disadvantage Of K-MEANS:
K-Means forms spherical clusters only. This algorithm fails when data is not
spherical ( i.e. same variance in all directions).
K-Means algorithm is sensitive towards outlier. Outliers can skew the clusters in
K-Means in very large extent.
K-Means algorithm requires one to specify the number of clusters and for which
there is no global method to choose best value.
Hierarchical Clustering Algorithms
Last but not the least are the hierarchical clustering algorithms. These algorithms
have clusters sorted in an order based on the hierarchy in data similarity
observations. Hierarchical clustering is categorised into two types, divisive(top-
down) clustering and agglomerative (bottom-up) clustering.
Agglomerative Hierarchical clustering Technique: In this technique, initially
each data point is considered as an individual cluster. At each iteration, the similar
clusters merge with other clusters until one cluster or K clusters are formed.
Divisive Hierarchical clustering Technique:Divisive Hierarchical clustering is
exactly the opposite of the Agglomerative Hierarchical clustering. In Divisive
Hierarchical clustering, we consider all the data points as a single cluster and in
each iteration, we separate the data points from the cluster which are not similar.
Each data point which is separated is considered as an individual cluster.
Most of the hierarchical algorithms such as single linkage, complete linkage,
median linkage, Ward’s method, among others, follow the agglomerative
approach.
Calculating the similarity between two clusters is important to merge or divide the
clusters. There are certain approaches which are used to calculate the similarity
between two clusters:
MIN: Also known as single linkage algorithm can be defined as the similarity of
two clusters C1 and C2 is equal to the minimum of the similarity between points
Pi and Pj such that Pi belongs to C1 and Pj belongs to C2.
This approach can separate non-elliptical shapes as long as the gap between two
clusters is not small.
MIN approach cannot separate clusters properly if there is noise between clusters.
MAX: Also known as the complete linkage algorithm, this is exactly opposite to
the MIN approach. The similarity of two clusters C1 and C2 is equal to
the maximum of the similarity between points Pi and Pj such that Pi belongs to
C1 and Pj belongs to C2.
MAX approach does well in separating clusters if there is noise between clusters
but Max approach tends to break large clusters.
Group Average: Take all the pairs of points and compute their similarities and
calculate the average of the similarities.
The group Average approach does well in separating clusters if there is noise
between clusters but it is less popular technique in the real world.
Limitations of Hierarchical clustering Technique:
There is no mathematical objective for Hierarchical clustering.
All the approaches to calculate the similarity between clusters has its own
disadvantages.
High space and time complexity for Hierarchical clustering. Hence this clustering
algorithm cannot be used when we have huge data.
Density-based spatial clustering of applications with noise (DBSCAN) is a well-
known data clustering algorithm that is commonly used in data mining and machine
learning.
Unlike to K-means, DBSCAN does not require the user to specify the number of
clusters to be generated
DBSCAN can find any shape of clusters. The cluster doesn’t have to be circular.
DBSCAN can identify outliers
The basic idea behind density-based clustering approach is derived from a human
intuitive clustering method. by looking at the figure below, one can easily identify
four clusters along with several points of noise, because of the differences in the
density of points
DBSCAN algorithm has two parameters:
ɛ: The radius of our neighborhoods around a data point p.
minPts: The minimum number of data points we want in a neighborhood to define a cluster.
Using these two parameters, DBSCAN categories the data points into three categories:
Core Points: A data point p is a core point if Nbhd(p,ɛ) [ɛ-neighborhood of p] contains at
least minPts ; |Nbhd(p,ɛ)| >= minPts.
Border Points: A data point *q is a border point if Nbhd(q, ɛ) contains less than minPts data points,
but q is reachable from some core point p.
Outlier: A data point o is an outlier if it is neither a core point nor a border point. Essentially, this is the
“other” class.
The steps to the DBSCAN algorithm are:
Pick a point at random that has not been assigned to a cluster or been designated
as an outlier. Compute its neighborhood to determine if it’s a core point. If yes,
start a cluster around this point. If no, label the point as an outlier.
Once we find a core point and thus a cluster, expand the cluster by adding
all directly-reachable points to the cluster. Perform “neighborhood jumps” to find
all density-reachable points and add them to the cluster. If an an outlier is added,
change that point’s status from outlier to border point.
Repeat these two steps until all points are either assigned to a cluster or
designated as an outlier.
.
Below is the DBSCAN clustering algorithm in pseudocode:
DBSCAN(dataset, eps, MinPts){
# cluster index
C=1
for each unvisited point p in dataset {
mark p as visited
# find neighbors
Neighbors N = find the neighboring points of p
if |N|>=MinPts:
N = N U N'
if p' is not a member of any cluster:
add p' to cluster C
}
CROSS VALIDATION:
Cross-validation is a technique in which we train our model using the subset of the data-
set and then evaluate using the complementary subset of the data-set.
The three steps involved in cross-validation are as follows :
1. Split data set into training and test set
2. Using the training set train the model.
3. Test the model using the test set
USE: To get good out of sample accuracy
Even though we use cross validation technique we get variation in accuracy when we
train our model for that we use K-fold cross validation technique
In K-fold cross validation, we split the data-set into k number of
subsets(known as folds) then we perform training on the all the
subsets but leave one(k-1) subset for the evaluation of the trained
model. In this method, we iterate k times with a different subset
reserved for testing purpose each time.
Code in python for k-cross validation:
Final value of the classifier is the class which is repeated more number
of times among all classifier for classification problem and for regression
problem final value is the average of all the regressor values got in
sequential trees
Code in python for XGBOOST