ML - Module 1
ML - Module 1
ML - Module 1
Module 1
Module 2
Statistical Decision Theory, Bayesian Learning (ML, MAP, Bayes estimates, Conjugate
priors), Linear Regression, Ridge Regression, Lasso, Principal Component Analysis,
Partial Least Squares.
Module 3
Module 4
Module 5
Model: After training the system (that is, after detecting patterns in the data), a
model is created to make predictions.
Inductive Learning is where we are given examples of a function in the form of data
(x) and the output of the function (f(x)). The goal of inductive learning is to learn the
function for new data (x). This is the general theory behind supervised learning.
Dataset: It is the raw material of the prediction system. This is the historical data
used to train the system that detects patterns. The dataset is composed of instances,
and instances of factors, characteristics or properties.
Different Forms of Data
Numeric Data: Quantitative/Numerical data can be expressed in numerical
values, which makes it countable. For example, the price of a phone, the height
or weight of a person, etc., falls under the quantitative data.
Categorical Data: Qualitative or Categorical Data is data that can’t be
measured or counted in the form of numbers. These types of data are sorted
by category, not by number. The gender of a person, i.e., male, female, or
others, is qualitative data.
Machine Learning: Data and output is run on the computer to create a program. This
program can be used in traditional programming.
1. Collecting data: Gathering past data and storing it in excel, access, text files
etc. forms the foundation of the future learning. The better the variety, density
and volume of relevant data, better the learning prospects for the machine
becomes.
2. Preparing the data: One needs to spend time analysing and determining the
quality of data and then taking steps for fixing issues such as missing data and
treatment of outliers.
3. Training a model: This step involves choosing the appropriate algorithm and
representation of data in the form of the model.
4. Evaluating the model: To test the accuracy, test data is used. This step
determines the precision in the choice of the algorithm based on the outcome.
Statistical learning theory is a framework for machine learning that draws from
statistics and functional analysis. It deals with finding a predictive function
based on the data presented. The main idea in statistical learning theory is to
build a model that can draw conclusions from data and make predictions.
Statistical learning theory is the broad framework for studying the concept of
inference in both supervised and unsupervised machine learning. Inference
covers the entire spectrum of machine learning, from gaining knowledge,
making predictions or decisions and constructing models from a set of labelled
or unlabelled data.
A statistical model defines the relationships between a dependent and
independent variable.
Types of Learning
Machine learning
Machine learning (ML): Machine learning is a subset of AI, which enables the
machine to automatically learn from data, improve performance from past
experiences, and make predictions.
Based on the methods and way of learning, ML is divided into mainly four types.
1. Machine learning is not based in knowledge. Machines are driven by data, not
human knowledge. As a result, “intelligence” is dictated by the volume of data
you have to train it with. Machine learning cannot attain human-level
intelligence.
2. Machine learning models are difficult to train. Time, resources and massive
data sets are needed to create data models, and the process involves manually
pre-tagging and categorizing data sets.
3. Machine learning is prone to data issues. Data quality, data labelling and
building model are some of the data related problems for ML success.
4. Machine learning is often biased. Machine learning systems are known for
operating in a black box, meaning you have no visibility into how the machine
learns and makes decisions. Thus, if you identify an instance of bias, there is no
way to identify what caused it. Your only recourse is to retrain the algorithm
with additional data, but that is no guarantee to resolve the issue.
o Labelled Data: The labelled data means some input data is already tagged
with the correct output.
o In supervised learning, the training data provided to the machines work as the
supervisor that teaches the machines to predict the output correctly.
1. Regression
2. Classification
o Linear Regression
o Polynomial Regression
o Lasso Regression
o Logistic Regression
o Support vector Machines
o Naive Bayes
o With the help of supervised learning, the model can predict the output on the
basis of prior experiences.
o In supervised learning, we can have an exact idea about the classes of objects.
o Supervised learning model helps us to solve various real-world problems such
as fraud detection, spam filtering, Risk Assessment, Image classification, etc.
o Helps to optimize performance criteria with the help of experience.
o Supervised learning models are not suitable for handling the complex tasks.
o Supervised learning cannot predict the correct output if the test data is
different from the training dataset.
o In supervised learning, we need enough knowledge about the classes of
object.
o Classifying big data can be challenging.
o Training for supervised learning needs a lot of computation time.
1. Clustering
2. Association
1. Clustering: Clustering is a method of grouping the objects into clusters such that
objects with most similarities remains into a group and has less or no similarities with
the objects of another group.
o K-means clustering
o Principal Component Analysis
o DBSCAN Algorithm
o Apriori Algorithm
o FP-Growth algorithm
Computational
Complexity Very Complex Less Computational Complexity
Real Time Uses off-line analysis Uses Real Time Analysis of Data
o Fraud detection: Identifying cases of fraud when you only have a few positive
examples.
o Labeling data: Algorithms trained on small data sets can learn to apply data
labels to larger sets automatically.
o Speech Analysis
3. Reinforcement Learning
o Input: The input should be an initial state from which the model will start
o Output: There are many possible outputs as there are a variety of solutions
to a particular problem
o Training: Training is based upon the input, Model will return a state and the
user will decide to reward or punish the model based on its output.
o The model keeps continues to learn.
o The best solution is decided based on the maximum reward.
There are mainly two types of reinforcement learning, which are:
1. Positive Reinforcement
2. Negative Reinforcement
This type of reinforcement can sustain the changes for a long time, but too much
positive reinforcement may lead to an overload of states that can reduce the
consequences.
o Q-Learning
o State Action Reward State action (SARSA)
o Deep Q Neural Network (DQN)
o Resource management: Given finite resources and a defined goal, RL can help
enterprises plan out how to allocate resources.
o Finance Sector: The RL is currently used in the finance sector for evaluating
trading strategies.
Hypothesis Space: Hypothesis space is the set of all possible legal hypothesis. This is
the set from which the machine learning algorithm would determine the best
possible (only one) which would best describe the target function or the outputs.
Inductive Bias: The inductive bias (also known as learning bias) of a learning
algorithm is the set of assumptions that the learner uses to predict outputs.
o True positives occur when your system predicts that an observation belongs to a
class and it actually does belong to that class.
o True negatives occur when your system predicts that an observation does not
belong to a class and it does not belong to that class.
o False positives occur when you predict an observation belongs to a class when in
reality it does not. Also known as a type 2 error.
o False negatives occur when you predict an observation does not belong to a
class when in fact it does. Also known as a type 1 error.
The three main metrics used to evaluate a classification model are accuracy,
precision, and recall.
Accuracy : Accuracy is defined as the percentage of correct predictions for the test
data. It can be calculated easily by dividing the number of correct predictions by the
number of total predictions.
F1 score: It is the harmonic mean of precision and recall. This takes the contribution
of both, so higher the F1 score, the better. See that due to the product in the
numerator if one goes low, the final F1 score goes down significantly. So a model
does well in F1 score if the positive predicted are actually positives (precision) and
doesn't miss out on positives and predicts them negative (recall).
Validation
Bias: The bias is known as the difference between the prediction of the values by the
ML model and the correct value. Being high in biasing gives a large error in training as
well as testing data. Its recommended that an algorithm should always be low biased
to avoid the problem of under-fitting.
By high bias, the data predicted is in a straight line format, thus not fitting
accurately in the data in the data set. Such fitting is known as Under-fitting of
Data. This happens when the hypothesis is too simple or linear in nature.
When the Bias is high, assumptions made by our model are too basic, the model
can’t capture the important features of our data. This means that our model
hasn’t captured patterns in the training data and hence cannot perform well on
the testing data too.
Some examples of machine learning algorithms with low bias are Decision
Trees, k-Nearest Neighbours and Support Vector Machines. At the same time,
an algorithm with high bias is Linear Regression, Linear Discriminant Analysis
and Logistic Regression.
Variance: The variability of model prediction for a given data point which tells us
spread of our data is called the variance of the model. The model with high variance
has a very complex fit to the training data and thus is not able to fit accurately on the
test data. As a result, such models perform very well on training data but has high
error rates on test data.
Variance indicates how much the estimate of the target function will alter if
different training data were used. In other words, variance describes how much
a random variable differs from its expected value. Contrary to bias, the Variance
is when the model takes into account the fluctuations in the data i.e. the noise
as well.
Some examples of machine learning algorithms with low variance are, Linear
Regression, Logistic Regression, and Linear discriminant analysis. At the same
time, algorithms with high variance are decision tree, Support Vector Machine,
and K-nearest neighbours.
1. Low-Bias, Low-Variance: The combination of low bias and low variance shows an
ideal machine learning model. However, it is not possible practically.
2. Low-Bias, High-Variance: With low bias and high variance, model predictions are
inconsistent and accurate on average. This case occurs when the model learns
with a large number of parameters and hence leads to an over-fitting
3. High-Bias, Low-Variance: With High bias and low variance, predictions are
consistent but inaccurate on average. This case occurs when a model does not
learn well with the training dataset or uses few numbers of the parameter. It
leads to under-fitting problems in the model.
4. High-Bias, High-Variance: With high bias and high variance, predictions are
inconsistent and also inaccurate on average.
This just ensures that we capture the essential patterns in our model while
ignoring the noise present it in. This is called Bias-Variance Trade-off. It helps
optimize the error in our model and keeps it as low as possible. An optimized
model will be sensitive to the patterns in our data, but at the same time will
be able to generalize to new data.
Under-fitting: Under-fitting occurs when the model has fewer features and hence a
statistical model or machine learning algorithm cannot capture the underlying trend
of the data. Intuitively, under-fitting occurs when the model or the algorithm does
not fit the data well enough. Under-fitting is often a result of an excessively simple
model. This model has high bias.
Over-fitting: Over-fitting occurs when the model has complex functions and hence
the statistical model or machine learning algorithm captures the noise of the data.
Intuitively, over-fitting occurs when the model or the algorithm fits the data too well
but is not able to generalize to predict new data. Over-fitting a model result in good
accuracy for training data set but poor results on new data sets. This model has high
variance.
1. Non-Exhaustive methods
2. Exhaustive methods
1. Non-exhaustive Methods
These methods do not include all the ways of splitting the original dataset.
Hold out Validation approach: To use this approach, firstly we have to separate our
initial dataset into two parts – training data and testing data. Then, we train the
model on the training data and then see the performance on the unseen data. we
first shuffled the data randomly before splitting.
Pros
This approach is Fully independent of the data.
This approach only needs to be run once so has lower computational costs.
Cons
The Performance leads to a higher variance if we have a dataset of smaller
size.
k-Fold Cross-Validation: In this approach, we divide the data set into k number of
subsets and the holdout method is repeated k number of times.
The procedure has a single parameter called k that refers to the number of
groups that a given data sample is to be split into. As such, the procedure is
often called k-fold cross-validation. When a specific value for k is chosen, it
may be used in place of k in the reference to the model, such as k=10
becoming 10-fold cross-validation.
The value for k is chosen such that each train/test group of data samples is
large enough to be statistically representative of the broader dataset.
Pros
Models may not be affected much if there are some outliers present in the
dataset.
It helps us to overcome the problem of variability.
This method results in a less biased model compared to other methods since
every observation has the chance of appearing in both train and test sets.
The Best approach if we have a limited amount of input data.
Cons
Imbalanced datasets will impact our model.
Requires Computation k times as much as to evaluate since training algorithm
has to be rerun from start K times to complete the k folds.
Stratified K-Fold Cross Validation: It tries to address the problem of the K-Fold
approach. The splitting of data into folds may be governed by criteria such as
ensuring that each fold has the same proportion of observations with a given
categorical value, such as the class outcome value. This is called stratified cross-
validation.
Stratification: It is the process of rearranging the data such that each of the
folds is a good representative of the whole dataset with respect to different
classes.
Pros
2. Exhaustive Methods
These methods test our model on all the possible ways to divide our original dataset
into training and a validation set.
Leave one out Cross-Validation (LOOCV): In this approach, we take out only one
data point from the available dataset for each iteration to test the model and train
the model on the rest of the data. This process iterates for each of the data points.
Pros
Since we make use of all data points, hence the bias will be less.
Cons
Optimization
Optimisation: Machine learning optimisation is the process of iteratively improving
the accuracy of a machine learning model, lowering the degree of error. Optimisation
is measured through a loss or cost function, which is typically a way of defining the
difference between the predicted and actual value of data. Machine learning models
aim to minimise this loss function, or lower the gap between prediction and reality of
output data
Optimization is the process where we train the model iteratively that results in a
maximum and minimum function evaluation. We compare the results in every
iteration by changing the hyperparameters in each step until we reach the
optimum results.
o On the other hand, the parameters of the model are obtained during the
training. There is no way to get them in advance. Examples are weights and
biases for neural networks. This data is internal to the model and changes based
on the inputs.
Top optimization techniques in machine learning
Gradient Descent and Stochastic Gradient Descent Algorithms are some of the most
important optimization techniques
Global Maxima and Minima: It is the maximum value and minimum value
respectively on the entire domain of the function
Local Maxima and Minima: It is the maximum value and minimum value respectively
of the function within a given range.
There can be only one global minima and maxima but there can be more than
one local minima and maxima.
Gradient Descent: Gradient Descent is an optimization algorithm and it finds out the
local minima of a differentiable function. It is a minimization algorithm that
minimizes a given function.
Let’s see the geometric intuition of Gradient Descent:
Slope of Y=X²
Here, the minima is the origin(0, 0). The slope here is Tanθ. So the slope on the right
side is positive as 0<θ<90 and its Tanθ is a positive value. The slope on the left side is
negative as 90<θ<180 and its Tanθ is a negative value.
Oscillation Problem
In the above example, we took r=1. As we calculate the points Xᵢ, Xᵢ+₁, Xᵢ+₂,….to find
the local minima, X*, we can see that it is oscillating between X = -0.5 and X = 0.5.
When we keep r as constant, we end up with an oscillation problem. So, we have to
reduce the ‘r’ value with each iteration.
Hyperparameters decide the bias-variance tradeoff. When r value is low, it could
overfit the model and cause high variance. When r value is high, it could underfit
the model and cause high bias. We can find the correct r value with Cross
Validation technique. Plot a graph with different learning rates and check for the
training loss with each value and choose the one with minimum loss.
However, classical gradient descent will not work well when there are a couple
of local minimums. Finding the first minimum, we will simply stop searching
because the algorithm only finds a local one, its not made to find the global one.
In gradient descent, we proceed forward with steps of the same size. If you
choose a learning rate that is too large, the algorithm will be jumping around
without getting closer to the right answer. If it’s too small, the computation will
start mimicking exhaustive search take, which is, of course, inefficient.