ML - Module 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

Machine Learning

Module 1

Introduction: Basic definitions, Linear Algebra, Statistical learning theory, types of


learning, hypothesis space and Inductive bias, evaluation and cross validation,
Optimization.

Module 2

Statistical Decision Theory, Bayesian Learning (ML, MAP, Bayes estimates, Conjugate
priors), Linear Regression, Ridge Regression, Lasso, Principal Component Analysis,
Partial Least Squares.

Module 3

Linear Classification, Logistic Regression, Linear Discriminant Analysis, Quadratic


Discriminant Analysis, Perceptron, Support Vector Machines + Kernels, Artificial
Neural Networks + Back Propagation, Decision Trees, Bayes Optimal Classifier, Naive
Bayes.

Module 4

Hypothesis testing, Ensemble Methods, Bagging Adaboost Gradient Boosting,


Clustering, Kmeans, K-medoids, Density-based Hierarchical, Spectral.

Module 5

Expectation Maximization, GMMs, Learning theory Intro to Reinforcement Learning ,


Bayesian Networks.
Module 1
Introduction: Basic definitions
Data: It can be any unprocessed fact, value, text, sound, or picture that is not being
interpreted and analysed.
Instance (sample/record): An instance is each of the data available for analysis.
Feature (attribute/property/field): These are the attributes that describe each of
the instances in the dataset.
Feature Engineering: This is the previous process to the creation of the prediction
model in which an analysis, cleaning and structuring of the data fields is carried out.

Model: After training the system (that is, after detecting patterns in the data), a
model is created to make predictions.

Inductive Learning is where we are given examples of a function in the form of data
(x) and the output of the function (f(x)). The goal of inductive learning is to learn the
function for new data (x). This is the general theory behind supervised learning.

Dataset: It is the raw material of the prediction system. This is the historical data
used to train the system that detects patterns. The dataset is composed of instances,
and instances of factors, characteristics or properties.
Different Forms of Data
 Numeric Data: Quantitative/Numerical data can be expressed in numerical
values, which makes it countable. For example, the price of a phone, the height
or weight of a person, etc., falls under the quantitative data.
 Categorical Data: Qualitative or Categorical Data is data that can’t be
measured or counted in the form of numbers. These types of data are sorted
by category, not by number. The gender of a person, i.e., male, female, or
others, is qualitative data.

Splitting data in Machine Learning


 Training Data: The part of data we use to train our model. This is the data that
your model actually sees(both input and output) and learns from.
 Validation Data: The part of data that is used to do a frequent evaluation of the
model, fit on the training dataset along with improving involved
hyperparameters (initially set parameters before the model begins learning).
This data plays its part when the model is actually training.
 Test Data: Once our model is completely trained, test data provides an unbiased
evaluation. When we feed in the inputs of test data, our model will predict some
values (without seeing actual output). After prediction, we evaluate our model
by comparing it with the actual output present in the test data.

Traditional Programming Vs Machine Learning


Traditional Programming: Data and program is run on the computer to produce the
output.

Machine Learning: Data and output is run on the computer to create a program. This
program can be used in traditional programming.

Steps used in Machine Learning


There are 5 basic steps used to perform a machine learning task:

1. Collecting data: Gathering past data and storing it in excel, access, text files
etc. forms the foundation of the future learning. The better the variety, density
and volume of relevant data, better the learning prospects for the machine
becomes.

2. Preparing the data: One needs to spend time analysing and determining the
quality of data and then taking steps for fixing issues such as missing data and
treatment of outliers.

3. Training a model: This step involves choosing the appropriate algorithm and
representation of data in the form of the model.
4. Evaluating the model: To test the accuracy, test data is used. This step
determines the precision in the choice of the algorithm based on the outcome.

5. Improving the performance: This step might involve choosing a different


model altogether or introducing more variables to augment the efficiency.

Statistical learning theory

Statistical learning theory is a framework for machine learning that draws from
statistics and functional analysis. It deals with finding a predictive function
based on the data presented. The main idea in statistical learning theory is to
build a model that can draw conclusions from data and make predictions.
Statistical learning theory is the broad framework for studying the concept of
inference in both supervised and unsupervised machine learning. Inference
covers the entire spectrum of machine learning, from gaining knowledge,
making predictions or decisions and constructing models from a set of labelled
or unlabelled data.
A statistical model defines the relationships between a dependent and
independent variable.

Types of Learning
Machine learning
Machine learning (ML): Machine learning is a subset of AI, which enables the
machine to automatically learn from data, improve performance from past
experiences, and make predictions.

Machine learning is a method by which a computer program


can “automatically learn and improve from experience without being
explicitly programmed.”

A more technical definition given by Tom M. Mitchell’s (1997) : “A computer


program is said to learn from experience E with respect to some class of tasks T
and performance measure P, if its performance at tasks in T, as measured by P,
improves with experience E.” Example: A handwriting recognition problem.
Task T: recognizing and classifying handwritten words within images.
Performance measure P: percentage of words correctly classified, accuracy.
Training experience E: a data-set of handwritten words with given
classifications.

Based on the methods and way of learning, ML is divided into mainly four types.

1. Supervised Machine Learning


2. Unsupervised Machine Learning
3. Semi-Supervised Machine Learning
4. Reinforcement Learning

Shortcomings of Machine Learning

1. Machine learning is not based in knowledge. Machines are driven by data, not
human knowledge. As a result, “intelligence” is dictated by the volume of data
you have to train it with. Machine learning cannot attain human-level
intelligence.
2. Machine learning models are difficult to train. Time, resources and massive
data sets are needed to create data models, and the process involves manually
pre-tagging and categorizing data sets.
3. Machine learning is prone to data issues. Data quality, data labelling and
building model are some of the data related problems for ML success.
4. Machine learning is often biased. Machine learning systems are known for
operating in a black box, meaning you have no visibility into how the machine
learns and makes decisions. Thus, if you identify an instance of bias, there is no
way to identify what caused it. Your only recourse is to retrain the algorithm
with additional data, but that is no guarantee to resolve the issue.

Applications of Machine learning


1. Image Recognition: It is used to identify objects, persons, places, images etc.
2. Speech Recognition: Google assistant, Siri, Cortana, Alexa etc.
3. Virtual Personal Assistant: Google assistant, Alexa, Siri, Cortana etc.
4. Traffic prediction: Google maps
5. Product recommendations
6. Self-driving cars
7. Email Spam and Malware Filtering
8. Online Fraud Detection
9. Medical Diagnosis
10. Stock Market trading

Types of Machine learning

1. Supervised Machine Learning


Supervised learning: Supervised learning is the types of machine learning in which
machines are trained using well "labelled" training data, and on basis of that data,
machines predict the output.

o Labelled Data: The labelled data means some input data is already tagged
with the correct output.
o In supervised learning, the training data provided to the machines work as the
supervisor that teaches the machines to predict the output correctly.

Supervised learning is classified into two categories of algorithms:

1. Regression
2. Classification

1. Regression: Regression algorithms are used if there is a relationship between the


input variable and the output variable. It is used for the prediction of continuous
variables, such as Weather forecasting, Market Trends, etc. Below are some
Regression algorithms.

o Linear Regression
o Polynomial Regression
o Lasso Regression

2. Classification: Classification algorithms are used when the output variable is


categorical, which means there are two classes such as Yes-No, Male-Female, etc.

o Logistic Regression
o Support vector Machines
o Naive Bayes

Advantages of Supervised learning

o With the help of supervised learning, the model can predict the output on the
basis of prior experiences.
o In supervised learning, we can have an exact idea about the classes of objects.
o Supervised learning model helps us to solve various real-world problems such
as fraud detection, spam filtering, Risk Assessment, Image classification, etc.
o Helps to optimize performance criteria with the help of experience.

Disadvantages of supervised learning

o Supervised learning models are not suitable for handling the complex tasks.
o Supervised learning cannot predict the correct output if the test data is
different from the training dataset.
o In supervised learning, we need enough knowledge about the classes of
object.
o Classifying big data can be challenging.
o Training for supervised learning needs a lot of computation time.

Applications of Supervised Learning

o Image Segmentation: Supervised Learning algorithms are used in image


segmentation. In this process, image classification is performed on different
image data with pre-defined labels.
o Medical Diagnosis: Supervised algorithms are also used in the medical field
for diagnosis purposes. It is done by using medical images and past labelled
data with labels for disease conditions. With such a process, the machine can
identify a disease for the new patients.
o Fraud Detection: Supervised Learning classification algorithms are used for
identifying fraud transactions, fraud customers, etc. It is done by using historic
data to identify the patterns that can lead to possible fraud.
o Spam detection: In spam detection & filtering, classification algorithms are
used. These algorithms classify an email as spam or not spam. The spam
emails are sent to the spam folder.
o Speech Recognition: Supervised learning algorithms are also used in speech
recognition. The algorithm is trained with voice data, and various
identifications can be done using the same, such as voice-activated
passwords, voice commands, etc.
o BioInformatics: BioInformatics is the storage of Biological Information of us
humans such as fingerprints, iris texture, earlobe and so on.
2. Unsupervised Machine Learning

Unsupervised learning: Unsupervised learning is a type of machine learning in which


models are trained using unlabeled dataset and are allowed to act on that data
without any supervision.

Unsupervised learning is classified into two categories of algorithms:

1. Clustering
2. Association

1. Clustering: Clustering is a method of grouping the objects into clusters such that
objects with most similarities remains into a group and has less or no similarities with
the objects of another group.

o K-means clustering
o Principal Component Analysis
o DBSCAN Algorithm

2. Association: An association rule is an unsupervised learning method which is used


for finding the relationships between variables in the large database. A typical
example of Association rule is Market Basket Analysis. Such as people who buy X are
also tend to purchase Y item.

o Apriori Algorithm
o FP-Growth algorithm

Advantages of Unsupervised Learning

o Unsupervised learning is used for more complex tasks as compared to


supervised learning because, in unsupervised learning, we don't have labelled
input data.
o It is easy to get unlabelled data in comparison to labelled data.
o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their own
experiences, which makes it closer to the real AI.
o It can help in scenarios where we don’t know how many or what classes is the
data divided.
Disadvantages of Unsupervised Learning

o Unsupervised learning is intrinsically more difficult than supervised learning as


it does not have corresponding output.
o It is less accurate as input data is not labeled, and algorithms do not know the
exact output in advance.
o The information obtained by the algorithm may not always correspond to the
output class that we required.
o The user has to understand and map the output obtained with the
corresponding labels.

Applications of unsupervised learning


o News Sections: Google News uses unsupervised learning to categorize
articles on the same story from various online news outlets. For example,
the results of a presidential election could be categorized under their label
for “US” news.
o Computer vision: Unsupervised learning algorithms are used for visual
perception tasks, such as object recognition.
o Medical imaging: Unsupervised machine learning provides essential features
to medical imaging devices, such as image detection, classification and
segmentation, used in radiology and pathology to diagnose patients
accurately.
o Anomaly detection: Unsupervised learning models can comb through large
amounts of data and discover atypical data points within a dataset. These
anomalies can raise awareness around faulty equipment, human error, or
breaches in security.
o Customer personas: Defining customer personas makes it easier to
understand common traits and business clients' purchasing habits.
Unsupervised learning allows businesses to build better buyer persona
profiles, enabling organizations to align their product messaging more
appropriately.
o Recommendation Engines: Using past purchase behaviour data,
unsupervised learning can help to discover data trends that can be used to
develop more effective cross-selling strategies. This is used to make relevant
add-on recommendations to customers during the checkout process for
online retailers.
o Latent variable models: we can use this model for data processing in a case
of reducing features or in a data set in multiple components.
Difference b/w Supervised and Unsupervised Learning

SUPERVISED LEARNING UNSUPERVISED LEARNING

Uses Known and Labeled Uses Unknown and Unabeled


Input Data Data as input Data as input

Computational
Complexity Very Complex Less Computational Complexity

Real Time Uses off-line analysis Uses Real Time Analysis of Data

Number of Classes are Number of Classes are not


Number of Classes known known

Accurate and Reliable Moderate Accurate and Reliable


Accuracy of Results Results Results

Machine Learning Algorithm Map


3. Semi-Supervised Learning
Semi-Supervised learning: Semi-Supervised learning is a type of Machine Learning
algorithm that represents the intermediate ground between Supervised and
Unsupervised learning algorithms. It uses the combination of labeled and unlabeled
datasets during the training period.

Assumptions followed by Semi-Supervised Learning


To work with the unlabeled dataset, there must be a relationship between the
objects. To understand this, semi-supervised learning uses any of the following
assumptions:

o Continuity Assumption: As per the continuity assumption, the objects near


each other tend to share the same group or label.
o Cluster assumptions: In this assumption, data are divided into different
discrete clusters. Further, the points in the same cluster share the output
label.
o Manifold assumptions: This assumption helps to use distances and densities,
and this data lie on a manifold of fewer dimensions than input space.

Semi-Supervised Learning Algorithms


o semi-supervised support vector machines(S3VM)
o Transductive support vector machine (TSVM)
o Graph Based Techniques

Applications of Semi-supervised Learning

o Machine translation: Teaching algorithms to translate language based on less


than a full dictionary of words.

o Fraud detection: Identifying cases of fraud when you only have a few positive
examples.

o Labeling data: Algorithms trained on small data sets can learn to apply data
labels to larger sets automatically.

o Speech Analysis

o Web content classification


o Text document classifier

o Protein(DNA) sequence classification

3. Reinforcement Learning

Reinforcement Learning: Reinforcement Learning is a feedback-based Machine


learning technique in which an agent learns to behave in an environment by
performing the actions and seeing the results of actions. The agent learns
automatically with these feedbacks and improves its performance. The agent learns
with the process of hit and trial, and based on the experience. Finding the shortest
route between two points on a map is a typical reinforcement learning use-cases

Main points in Reinforcement learning –

o Input: The input should be an initial state from which the model will start
o Output: There are many possible outputs as there are a variety of solutions
to a particular problem
o Training: Training is based upon the input, Model will return a state and the
user will decide to reward or punish the model based on its output.
o The model keeps continues to learn.
o The best solution is decided based on the maximum reward.
There are mainly two types of reinforcement learning, which are:
1. Positive Reinforcement
2. Negative Reinforcement

1. Positive Reinforcement: The positive reinforcement learning means adding


something to increase the tendency that expected behavior would occur again. It
impacts positively on the behavior of the agent and increases the strength of the
behavior.

This type of reinforcement can sustain the changes for a long time, but too much
positive reinforcement may lead to an overload of states that can reduce the
consequences.

2. Negative Reinforcement: The negative reinforcement learning is opposite to the


positive reinforcement as it increases the tendency that the specific behavior will
occur again by avoiding the negative condition.

It can be more effective than the positive reinforcement depending on situation


and behavior, but it provides reinforcement only to meet minimum behavior.

Reinforcement Learning Algorithms

o Q-Learning
o State Action Reward State action (SARSA)
o Deep Q Neural Network (DQN)

Advantages of Reinforcement Learning


o It helps you to find which situation needs an action
o Helps you to discover which action yields the highest reward over the longer
period.
o Reinforcement learning provides the learning agent with a reward function.
o It also allows it to figure out the best method for obtaining large rewards.

Disadvantages of Reinforcement Learning


o Feature/reward design which should be very involved
o Parameters may affect the speed of learning.
o Realistic environments can have partial observability.
o Too much Reinforcement may lead to an overload of states which can
diminish the results.
o Realistic environments can be non-stationary.
Reinforcement Learning Applications

o Robotics: RL is used in Robot navigation, walking, juggling, etc. Robots can


learn to perform tasks the physical world using this technique.
o Control: RL can be used for adaptive control such as Factory processes,
admission control in telecommunication, and Helicopter pilot is an example of
reinforcement learning.
o Game Playing: RL has been used to teach bots to play a number of video
games such as tic-tac-toe, chess, etc.
o Chemistry: RL can be used for optimizing the chemical reactions.
o Business: RL is now used for business strategy planning.

o Resource management: Given finite resources and a defined goal, RL can help
enterprises plan out how to allocate resources.

o Manufacturing: In various automobile manufacturing companies, the robots


use deep reinforcement learning to pick goods and put them in some
containers.

o Finance Sector: The RL is currently used in the finance sector for evaluating
trading strategies.

Hypothesis Space and Inductive Bias


Hypothesis: A hypothesis is a function that best describes the target in supervised
machine learning. The hypothesis that an algorithm would come up depends upon
the data and also depends upon the restrictions and bias that we have imposed on
the data.

Hypothesis Space: Hypothesis space is the set of all possible legal hypothesis. This is
the set from which the machine learning algorithm would determine the best
possible (only one) which would best describe the target function or the outputs.

Inductive Bias: The inductive bias (also known as learning bias) of a learning
algorithm is the set of assumptions that the learner uses to predict outputs.

In other words, inductive bias refers to a set of (explicit or implicit)


assumptions made by a learning algorithm in order to perform induction, that
is, to generalize a finite set of observation (training data) into a general model
of the domain.

Evaluation and Cross Validation


Evaluation
Evaluation: Model evaluation is the process of using different evaluation metrics to
understand a machine learning model’s performance, as well as its strengths and
weaknesses.

Confusion /Classification metrics


When performing classification predictions, there are four types of outcomes that
could occur.

o True positives occur when your system predicts that an observation belongs to a
class and it actually does belong to that class.
o True negatives occur when your system predicts that an observation does not
belong to a class and it does not belong to that class.
o False positives occur when you predict an observation belongs to a class when in
reality it does not. Also known as a type 2 error.
o False negatives occur when you predict an observation does not belong to a
class when in fact it does. Also known as a type 1 error.
The three main metrics used to evaluate a classification model are accuracy,
precision, and recall.

Accuracy : Accuracy is defined as the percentage of correct predictions for the test
data. It can be calculated easily by dividing the number of correct predictions by the
number of total predictions.

Precision: Precision is defined as the fraction of relevant examples (true positives)


among all of the examples which were predicted to belong in a certain class.

Percentage of positive instances out of the total predicted positive instances.


Here denominator is the model prediction done as positive from the whole
given dataset. Take it as to find out ‘how much the model is right when it says it
is right’.

Recall/Sensitivity/True Positive Rate: Recall is defined as the fraction of examples


which were predicted to belong to a class with respect to all of the examples that
truly belong in the class.

Percentage of positive instances out of the total actual positive instances.


Therefore denominator (TP + FN) here is the actual number of positive
instances present in the dataset. Take it as to find out ‘how much extra right
ones, the model missed when it showed the right ones’
.
Specificity: Percentage of negative instances out of the total actual
negative instances. Therefore denominator (TN + FP) here is the actual number of
negative instances present in the dataset. It is similar to recall but the shift is on the
negative instances. Like finding out how many healthy patients were not having
cancer and were told they don’t have cancer.

F1 score: It is the harmonic mean of precision and recall. This takes the contribution
of both, so higher the F1 score, the better. See that due to the product in the
numerator if one goes low, the final F1 score goes down significantly. So a model
does well in F1 score if the positive predicted are actually positives (precision) and
doesn't miss out on positives and predicts them negative (recall).

Validation

Validation: Validation is the process of verifying whether the mathematical results


calculating relationships between variables are acceptable as descriptions of the
data. Just after model building, an error estimation for the model is made on the
training dataset, which is called the Evaluation of residuals. In this step i.e, Evaluate
Residuals Step, we find the training Error by finding the difference between
predicted output and the original output.

Bias & Variance

Bias: The bias is known as the difference between the prediction of the values by the
ML model and the correct value. Being high in biasing gives a large error in training as
well as testing data. Its recommended that an algorithm should always be low biased
to avoid the problem of under-fitting.
By high bias, the data predicted is in a straight line format, thus not fitting
accurately in the data in the data set. Such fitting is known as Under-fitting of
Data. This happens when the hypothesis is too simple or linear in nature.

When the Bias is high, assumptions made by our model are too basic, the model
can’t capture the important features of our data. This means that our model
hasn’t captured patterns in the training data and hence cannot perform well on
the testing data too.

Some examples of machine learning algorithms with low bias are Decision
Trees, k-Nearest Neighbours and Support Vector Machines. At the same time,
an algorithm with high bias is Linear Regression, Linear Discriminant Analysis
and Logistic Regression.

Ways to reduce High Bias

o Increase the input features as the model is underfitted.


o Decrease the regularization term.
o Use more complex models, such as including some polynomial features.

Variance: The variability of model prediction for a given data point which tells us
spread of our data is called the variance of the model. The model with high variance
has a very complex fit to the training data and thus is not able to fit accurately on the
test data. As a result, such models perform very well on training data but has high
error rates on test data.

When a model is high on variance, it is then said to as Over-fitting of Data.


Over-fitting is fitting the training set accurately via complex curve and high order
hypothesis but is not the solution as the error with unseen data is high. While
training a data model variance should be kept low.

Variance indicates how much the estimate of the target function will alter if
different training data were used. In other words, variance describes how much
a random variable differs from its expected value. Contrary to bias, the Variance
is when the model takes into account the fluctuations in the data i.e. the noise
as well.

Some examples of machine learning algorithms with low variance are, Linear
Regression, Logistic Regression, and Linear discriminant analysis. At the same
time, algorithms with high variance are decision tree, Support Vector Machine,
and K-nearest neighbours.

Ways to Reduce High Variance


o Reduce the input features or number of parameters as a model is over-fitted.
o Do not use a much complex model.
o Increase the training data.
o Increase the Regularization term.

Different Combinations of Bias-Variance


There are four possible combinations of bias and variances, which are represented by
the below diagram.

1. Low-Bias, Low-Variance: The combination of low bias and low variance shows an
ideal machine learning model. However, it is not possible practically.
2. Low-Bias, High-Variance: With low bias and high variance, model predictions are
inconsistent and accurate on average. This case occurs when the model learns
with a large number of parameters and hence leads to an over-fitting
3. High-Bias, Low-Variance: With High bias and low variance, predictions are
consistent but inaccurate on average. This case occurs when a model does not
learn well with the training dataset or uses few numbers of the parameter. It
leads to under-fitting problems in the model.
4. High-Bias, High-Variance: With high bias and high variance, predictions are
inconsistent and also inaccurate on average.

Bias Variance Trade-off


If our model is too simple and has very few parameters then it may have high bias
and low variance. On the other hand if our model has large number of parameters
then it’s going to have high variance and low bias. So we need to find the right/good
balance without over-fitting and under-fitting the data.

This just ensures that we capture the essential patterns in our model while
ignoring the noise present it in. This is called Bias-Variance Trade-off. It helps
optimize the error in our model and keeps it as low as possible. An optimized
model will be sensitive to the patterns in our data, but at the same time will
be able to generalize to new data.

This trade-off in complexity is why there is a trade-off between bias and


variance. An algorithm can’t be more complex and less complex at the same
time. For the graph, the perfect trade-off will be like.

Under-fitting & Over-fitting

Under-fitting: Under-fitting occurs when the model has fewer features and hence a
statistical model or machine learning algorithm cannot capture the underlying trend
of the data. Intuitively, under-fitting occurs when the model or the algorithm does
not fit the data well enough. Under-fitting is often a result of an excessively simple
model. This model has high bias.

Over-fitting: Over-fitting occurs when the model has complex functions and hence
the statistical model or machine learning algorithm captures the noise of the data.
Intuitively, over-fitting occurs when the model or the algorithm fits the data too well
but is not able to generalize to predict new data. Over-fitting a model result in good
accuracy for training data set but poor results on new data sets. This model has high
variance.

There are three main options to address the issue of over-fitting:

1. Reduce the number of features: Manually select which features to keep.


Doing so, we may miss some important information, if we throw away some
features.
2. Regularization: Keep all the features, but reduce the magnitude of weights W.
Regularization works well when we have a lot of slightly useful feature.
3. Early stopping: When we are training a learning algorithm iteratively such as
using gradient descent, we can measure how well each iteration of the model
performs. Up to a certain number of iterations, each iteration improves the
model. After that point, however, the model’s ability to generalize can
weaken as it begins to over-fit the training data.
Cross-Validation

Cross-Validation: It is a statistical method that is used to find the performance of


machine learning models. It is used to protect our model against overfitting in a
predictive model In cross-validation, we partitioned our dataset into a fixed number
of folds (or partitions), run the analysis on each fold, and then averaged the overall
error estimate.
Types of Cross-Validation Techniques

Cross-validation techniques can be divided into two broad categories:

1. Non-Exhaustive methods
2. Exhaustive methods

1. Non-exhaustive Methods
These methods do not include all the ways of splitting the original dataset.

Hold out Validation approach: To use this approach, firstly we have to separate our
initial dataset into two parts – training data and testing data. Then, we train the
model on the training data and then see the performance on the unseen data. we
first shuffled the data randomly before splitting.

Pros
 This approach is Fully independent of the data.
 This approach only needs to be run once so has lower computational costs.
Cons
 The Performance leads to a higher variance if we have a dataset of smaller
size.
k-Fold Cross-Validation: In this approach, we divide the data set into k number of
subsets and the holdout method is repeated k number of times.

The procedure has a single parameter called k that refers to the number of
groups that a given data sample is to be split into. As such, the procedure is
often called k-fold cross-validation. When a specific value for k is chosen, it
may be used in place of k in the reference to the model, such as k=10
becoming 10-fold cross-validation.
The value for k is chosen such that each train/test group of data samples is
large enough to be statistically representative of the broader dataset.

Here is the algorithm you should have to follow:


1. Randomly divide your entire dataset into k numbers of folds.
2. For each fold in your dataset, build your model on k – 1 folds of the dataset
and test the model to find the performance for the kth fold.
3. Repeat this until each of the k-folds has become the test set exactly once.
4. Finally, the average of your k accuracies is called the cross-validation accuracy
and it will serve as our performance metric for the model.

Pros
 Models may not be affected much if there are some outliers present in the
dataset.
 It helps us to overcome the problem of variability.
 This method results in a less biased model compared to other methods since
every observation has the chance of appearing in both train and test sets.
 The Best approach if we have a limited amount of input data.
Cons
 Imbalanced datasets will impact our model.
 Requires Computation k times as much as to evaluate since training algorithm
has to be rerun from start K times to complete the k folds.
Stratified K-Fold Cross Validation: It tries to address the problem of the K-Fold
approach. The splitting of data into folds may be governed by criteria such as
ensuring that each fold has the same proportion of observations with a given
categorical value, such as the class outcome value. This is called stratified cross-
validation.

Stratification: It is the process of rearranging the data such that each of the
folds is a good representative of the whole dataset with respect to different
classes.

Pros

 It can improve different models using hyper-parameter tuning.


 Helps us compare models.
 It helps in reducing both Bias and Variance.

2. Exhaustive Methods

These methods test our model on all the possible ways to divide our original dataset
into training and a validation set.

Leave one out Cross-Validation (LOOCV): In this approach, we take out only one
data point from the available dataset for each iteration to test the model and train
the model on the rest of the data. This process iterates for each of the data points.
Pros

 Since we make use of all data points, hence the bias will be less.

Cons

 Higher execution time since we repeat the cross-validation process n times


(where n is the number of observations in the dataset).

 This leads to higher variation in testing model effectiveness because we test


our model against only one data point. So, our results get highly influenced by
the data point. For Example, If the data point is an outlier, it can lead to a
higher variation.

Optimization
Optimisation: Machine learning optimisation is the process of iteratively improving
the accuracy of a machine learning model, lowering the degree of error. Optimisation
is measured through a loss or cost function, which is typically a way of defining the
difference between the predicted and actual value of data. Machine learning models
aim to minimise this loss function, or lower the gap between prediction and reality of
output data

Optimization is the process where we train the model iteratively that results in a
maximum and minimum function evaluation. We compare the results in every
iteration by changing the hyperparameters in each step until we reach the
optimum results.

Parameters and hyperparameters of the model


o You need to set hyperparameters before starting to train the model. They
include a number of clusters, learning rate, etc. Hyperparameters describe the
structure of the model.

o On the other hand, the parameters of the model are obtained during the
training. There is no way to get them in advance. Examples are weights and
biases for neural networks. This data is internal to the model and changes based
on the inputs.
Top optimization techniques in machine learning
Gradient Descent and Stochastic Gradient Descent Algorithms are some of the most
important optimization techniques

Gradient Descent Algorithm


Maxima and Minima: Maxima is the largest and Minima is the smallest value of a
function within a given range. We represent them as below:

Global Maxima and Minima: It is the maximum value and minimum value
respectively on the entire domain of the function

Local Maxima and Minima: It is the maximum value and minimum value respectively
of the function within a given range.

There can be only one global minima and maxima but there can be more than
one local minima and maxima.

Gradient Descent: Gradient Descent is an optimization algorithm and it finds out the
local minima of a differentiable function. It is a minimization algorithm that
minimizes a given function.
Let’s see the geometric intuition of Gradient Descent:

Slope of Y=X²

Let’s take an example graph of a parabola, Y=X²

Here, the minima is the origin(0, 0). The slope here is Tanθ. So the slope on the right
side is positive as 0<θ<90 and its Tanθ is a positive value. The slope on the left side is
negative as 90<θ<180 and its Tanθ is a negative value.

Slope of points as moved towards minima


One important observation in the graph is that the slope changes its sign from
positive to negative at minima. As we move closer to the minima, the slope reduces.

So, how does the Gradient Descent Algorithm work?

Objective: Calculate X*- local minimum of the function Y=X².

 Pick an initial point X₀ at random


 Calculate X₁ = X₀-r*df/dx+ at X₀. r is Learning Rate (we’ll discuss r in Learning
Rate Section). Let us take r=1. Here, df/dx is nothing but the gradient.
 Calculate X₂ = X₁-r*df/dx+ at X₁.
 Calculate for all the points: X₁, X₂, X₃, ……., Xᵢ-₁, Xᵢ
 General formula for calculating local minima: Xᵢ = (Xᵢ-₁)-r[df/dx+ at Xᵢ-₁
 When (Xᵢ — Xᵢ-₁) is small, i.e., when Xᵢ-₁, Xᵢ converge, we stop the iteration and
declare X* = Xᵢ

Learning Rate: Learning Rate is a hyperparameter or tuning parameter that


determines the step size at each iteration while moving towards minima in the
function. For example, if r = 0.1 in the initial step, it can be taken as r=0.01 in the next
step. Likewise it can be reduced exponentially as we iterate further.

What happens if we keep r value as constant:

Oscillation Problem

In the above example, we took r=1. As we calculate the points Xᵢ, Xᵢ+₁, Xᵢ+₂,….to find
the local minima, X*, we can see that it is oscillating between X = -0.5 and X = 0.5.
When we keep r as constant, we end up with an oscillation problem. So, we have to
reduce the ‘r’ value with each iteration.
Hyperparameters decide the bias-variance tradeoff. When r value is low, it could
overfit the model and cause high variance. When r value is high, it could underfit
the model and cause high bias. We can find the correct r value with Cross
Validation technique. Plot a graph with different learning rates and check for the
training loss with each value and choose the one with minimum loss.

However, classical gradient descent will not work well when there are a couple
of local minimums. Finding the first minimum, we will simply stop searching
because the algorithm only finds a local one, its not made to find the global one.

In gradient descent, we proceed forward with steps of the same size. If you
choose a learning rate that is too large, the algorithm will be jumping around
without getting closer to the right answer. If it’s too small, the computation will
start mimicking exhaustive search take, which is, of course, inefficient.

You might also like