P-2.1.2 Cross Validation and Regularization

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 37

University Institute of Engineering

DEPARTMENT OF COMPUTER SCIENCE


& ENGINEERING
Bachelor of Engineering (Computer Science & Engineering)
Subject Name : Machine Learning
Subject Code: CST-316
Topic: Cross Validation and Regularization
Lecture-2.1.2
By : Baljeet Kaur Nagra
DISCOVER . LEARN . EMPOWER
Course Outcomes
CO-1: Identify the deep learning algorithms
which are more appropriate for various
types of learning tasks in various domains.

CO-2: To study learning processes:


supervised and unsupervised, deterministic
and statistical knowledge of deep learners,
and ensemble learning

CO-3: To provide a comprehensive


foundation to Deep Neural Networks and
Optimization methodology with applications
to Pattern Recognition, Computer Vision

2
Course Objectives

To provide a To study learning


comprehensive processes:
To understand the
foundation to Deep supervised and
history and To understand
Neural Networks unsupervised,
development of modern techniques
and Optimization deterministic and
artificial neural and practical trends
methodology with statistical
network and deep of deep learning.
applications to knowledge of deep
neural networks.
Pattern Recognition, learners, and
Computer Vision. ensemble learning

3
CONTENTS
• Cross validation
• Steps
• Types- train-test split, Hold out, K-fold, Stratifies k-fold, Leave p- out

• Regularization
• Working
• Techniques-L1,L2

• No free lunch theorem

4
Validation
• This process of deciding whether the numerical results quantifying hypothesized
relationships between variables, are acceptable as descriptions of the data, is known
as validation.
• Generally, an error estimation for the model is made after training, better known as
evaluation of residuals.
• In this process, a numerical estimate of the difference in predicted and original responses
is done, also called the training error.
• However, this only gives us an idea about how well our model does on data used to train
it.
• Now its possible that the model is underfitting or overfitting the data.
• So, the problem with this evaluation technique is that it does not give an indication of
how well the learner will generalize to an independent/ unseen data set. 
• Getting this idea about our model is known as Cross Validation.
5
Cross Validation
• To evaluate the performance of any machine learning model we need to test it on some
unseen data.
• Based on the models performance on unseen data we can say weather our model is
Under-fitting/Over-fitting/Well generalised.
•  Cross validation (CV) is one of the technique used to test the effectiveness of a machine
learning models, it is also a re-sampling procedure used to evaluate a model if we have a
limited data. 
• To perform CV we need to keep aside a sample/portion of the data on which is do not use
to train the model, later us this sample for testing/validating.

6
Cross Validation
• In machine learning, we couldn’t fit the model on the training data and can’t say that the
model will work accurately for the real data.
• For this, we must assure that our model got the correct patterns from the data, and it is
not getting up too much noise.
• For this purpose, we use the cross-validation technique.
• Cross-validation is a technique in which we train our model using the subset of the data-
set and then evaluate using the complementary subset of the data-set.

7
Steps
• Reserve some portion of sample data-set.
• Using the rest data-set train the model.
• Test the model using the reserve portion of the data-set.

8
Methods of Cross validation

• Train test split

• Hold out Method

• K-Fold Cross Validation

• Stratified K-fold Cross Validation

• Leave One Out Cross Validation


9
Train Test Split
• In this approach we randomly split the
complete data into training and test sets.
• Then Perform the model training on the
training set and use the test set for validation
purpose, ideally split the data into 70:30 or
80:20.
• With this approach there is a possibility of
high bias if we have limited data, because we
would miss some information about the data
which we have not used for training.
• If our data is huge and our test sample and
train sample has the same distribution then
this approach is acceptable

10
Hold Out Cross Validation
• Now a basic remedy for this involves removing a part of the training data and using it to
get predictions from the model trained on rest of the data. 
• The error estimation then tells how our model is doing on unseen data or the validation
set. 
• This is a simple kind of cross validation technique, also known as the holdout
method.
• Although this method doesn’t take any overhead to compute and is better than traditional
validation, it still suffers from issues of high variance. 
• This is because it is not certain which data points will end up in the validation set
and the result might be entirely different for different sets.

11
K-fold Cross Validation
• The procedure has a single parameter called k that refers to the number of groups that a
given data sample is to be split into.
• As such, the procedure is often called k-fold cross-validation.
• When a specific value for k is chosen, it may be used in place of k in the reference to the
model, such as k=10 becoming 10-fold cross-validation.
• If k=5 the dataset will be divided into 5 equal parts and the below process will run 5
times, each time with a different holdout set.
• 1. Take the group as a holdout or test data set
• 2. Take the remaining groups as a training data set
• 3. Fit a model on the training set and evaluate it on the test set
• 4. Retain the evaluation score and discard the model

12
K-fold Cross Validation

13
K-fold Cross Validation
• The value for k is chosen such that each train/test group of data samples is large enough
to be statistically representative of the broader dataset.
• A value of k=10 is very common in the field of applied machine learning, and is
recommend if you are struggling to choose a value for your dataset.
• If a value for k is chosen that does not evenly split the data sample, then one group will
contain a remainder of the examples.
• It is preferable to split the data sample into k groups with the same number of samples,
such that the sample of model skill scores are all equivalent.

14
Stratified K-fold Cross Validation
• Same as K-Fold Cross Validation,
just a slight difference
• The splitting of data into folds may
be governed by criteria such as
ensuring that each fold has the same
proportion of observations with a
given categorical value, such as the
class outcome value.
• This is called stratified cross-
validation.
• In below image, the stratified k-fold
validation is set on basis of Gender
whether M or F

15
Leave p Out Cross Validation(LPOCV)
• This approach leaves p data point out
of training data, i.e. if there are n data
points in the original sample then, n-
p samples are used to train the model
and p points are used as the
validation set.
• This is repeated for all combinations
in which the original sample can be
separated this way, and then the error
is averaged for all trials, to give
overall effectiveness.
• The number of possible
combinations is equal to the number
of data points in the original sample
or n. 16
Leave p Out Cross Validation
• This method is exhaustive in the sense that it needs to train and validate the model for all
possible combinations, and for moderately large p, it can become computationally
infeasible.
• A particular case of this method is when p = 1. This is known as Leave one out cross
validation. 
• This method is generally preferred over the previous one because it does not suffer from
the intensive computation, as number of possible combinations is equal to number
of data points in original sample or n.

17
Regularization
• The word regularize means to make things regular or acceptable.
• This is exactly why we use it for. 
• Regularizations are techniques used to reduce the error by fitting a function appropriately
on the given training set and avoid overfitting. 
• Regularization is a technique used for tuning the function by adding an additional penalty
term in the error function.
• The additional term controls the excessively fluctuating function such that the
coefficients don’t take extreme values.
• This technique of keeping a check or reducing the value of error coefficients are called
shrinkage methods or weight decay in case of neural networks.

18
Working
• The basic idea is to penalize the complex models i.e. adding a complexity term that
would give a bigger loss for complex models. To understand it, let’s consider a simple
relation for linear regression. Mathematically, it is stated as below:
• Y ≈ β0 + β1X1 + β2X2 + …+ βpXp
• Here Y represents the learned relation and β represents the coefficient estimates for
different variables or predictors(X).

• Now, in order to fit a model that accurately predicts the value of Y, we require a loss
function and optimized parameters i.e. bias and weights.
• The loss function generally used for linear regression is called the residual sum of
squares (RSS). According to the above stated linear regression relation, it can be given
as:

19
Working
• We can also call RSS as the linear regression objective without regularization.
• Now, the model will learn by the means of this loss function.
• Based on our training data, it will adjust the weights (coefficients).
• If our dataset is noisy, it will face overfitting problems and estimated coefficients won’t
generalize on the unseen data.
• This is where regularization comes into action. It regularizes these learned estimates
towards zero by penalizing the magnitude of coefficients.

20
Techniques
• There are two main regularization techniques, namely
• Ridge Regression and
• Lasso Regression.

• They both differ in the way they assign a penalty to the coefficients.

• A regression model that uses L1 regularization technique is called Lasso


Regression and model which uses L2 is called Ridge Regression.

21
Ridge Regression (L2 Regularization)

• Above image shows ridge regression, where the RSS is modified by adding the shrinkage
quantity. 
• Now, the coefficients are estimated by minimizing this function.
• Here, λ is the tuning parameter that decides how much we want to penalize the flexibility of
our model. 

22
Ridge Regression (L2 Regularization)
• The increase in flexibility of a model is represented by increase in its coefficients, and if we
want to minimize the above function, then these coefficients need to be small.
• This is how the Ridge regression technique prevents coefficients from rising too high.
• Also, notice that we shrink the estimated association of each variable with the response,
except the intercept β0,
• This intercept is a measure of the mean value of the response when xi1 = xi2 = …= xip = 0.
• When λ = 0, the penalty term has no effect, and the estimates produced by ridge regression
will be equal to least squares.
• However, as λ→∞, the impact of the shrinkage penalty grows, and the ridge regression
coefficient estimates will approach zero.
• As can be seen, selecting a good value of λ is critical.
• The coefficient estimates produced by this method are also known as the L2 norm.
23
Ridge Regression (L2 Regularization)
• The coefficients that are produced by the standard least squares method are scale
equivariant, i.e. if we multiply each input by c then the corresponding coefficients are scaled
by a factor of 1/c.
• Therefore, regardless of how the predictor is scaled, the multiplication of predictor and
coefficient(Xjβj) remains the same. 
• However, this is not the case with ridge regression, and therefore, we need to standardize
the predictors or bring the predictors to the same scale before performing ridge regression. 

24
Lasso Regression (L1 Regularization)

• Lasso is another variation, in which the above function is minimized.


• Its clear that this variation differs from ridge regression only in penalizing the high
coefficients.
• It uses |βj|(modulus)instead of squares of β, as its penalty. In statistics, this is known as the L1
norm.

25
L1 and L2 Regularization
• The ridge regression can be thought of as solving an equation, where summation of squares of
coefficients is less than or equal to s.
• And the Lasso can be thought of as an equation where summation of modulus of coefficients
is less than or equal to s.
• Here, s is a constant that exists for each value of shrinkage factor λ. 
• These equations are also referred to as constraint functions.

26
L1 and L2 Regularization
• Consider their are 2 parameters in a given problem.
• Then according to above formulation, the ridge regression is expressed by β1² + β2² ≤ s.
• This implies that ridge regression coefficients have the smallest RSS(loss function) for all
points that lie within the circle given by β1² + β2² ≤ s.

• Similarly, for lasso, the equation becomes,|β1|+|β2|≤ s.


• This implies that lasso coefficients have the smallest RSS(loss function) for all points that lie
within the diamond given by |β1|+|β2|≤ s.

27
L1 and L2 Regularization

• The above image shows the constraint functions(green areas), for lasso(left) and ridge
regression(right), along with contours for RSS(red ellipse).
• Points on the ellipse share the value of RSS.
• For a very large value of s, the green regions will contain the center of the ellipse, making
coefficient estimates of both regression techniques, equal to the least squares estimates.
• But, this is not the case in the above image. 28
L1 and L2 Regularization
• In this case, the lasso and ridge regression coefficient
estimates are given by the first point at which an ellipse
contacts the constraint region. 
• Since ridge regression has a circular constraint with
no sharp points, this intersection will not generally
occur on an axis, and so the ridge regression
coefficient estimates will be exclusively non-zero. 
• However, the lasso constraint has corners at each of
the axes, and so the ellipse will often intersect the
constraint region at an axis. When this occurs, one of
the coefficients will equal zero. 
• In higher dimensions(where parameters are much more
than 2), many of the coefficient estimates may equal
zero simultaneously. 29
Conclusion
• This sheds light on the obvious disadvantage of ridge regression, which is model
interpretability. 
• It will shrink the coefficients for least important predictors, very close to zero. But it will
never make them exactly zero.
• In other words, the final model will include all predictors.
• However, in the case of the lasso, the L1 penalty has the effect of forcing some of the
coefficient estimates to be exactly equal to zero when the tuning parameter λ is su fficiently
large. 
• Therefore, the lasso method also performs variable selection and is said to yield sparse
models.

30
What does Regularization Achieve
• A standard least squares model tends to have some variance in it, i.e. this model won’t
generalize well for a data set different than its training data. 
• Regularization, significantly reduces the variance of the model, without substantial
increase in its bias.
• So the tuning parameter λ, used in the regularization techniques described above, controls the
impact on bias and variance.
• As the value of λ rises, it reduces the value of coefficients and thus reducing the variance. 
• Till a point, this increase in λ is beneficial as it is only reducing the variance(hence
avoiding overfitting), without loosing any important properties in the data. 
• But after certain value, the model starts loosing important properties, giving rise to bias in the
model and thus underfitting. Therefore, the value of λ should be carefully selected.

31
No Free Lunch Theorem
• The No Free Lunch Theorem (NFLT), implies that one algorithm that creates the best fit to
the solution is not universally superior to any other algorithm.
• There are, generally speaking, two No Free Lunch (NFL) theorems: one for machine learning
and one for search and optimization.
• These two theorems are related and tend to be bundled into one general axiom (the folklore
theorem).
• As stated by David Wolpert and William G. Macready, “If an algorithm performs better than
random search on some class of problems then it must perform worse than random search on
the remaining problems.”
• This means to me that if an algorithm is particularly adept at solving one class of problem,
then that algorithm is fitted to recognize the pattern of that particular problem.
• Al algorithm works well with a specific dataset because it is responsive to the unique qualities
of that data set, however, another dataset might have other unique challenges that will not be
solved with the same algorithm.
32
No Free Lunch Theorem
• In his 1996 paper The Lack of A Priori
Distinctions Between Learning Algorithms,
David Wolpert demonstrates that for any two
algorithms, A and B, there are as many scenarios
where A will perform worse than B as there are
where A will outperform B.
• This even holds true when one of the given
algorithms is random guessing. Wolpert proved
that for all possible domains (all possible
problem instances drawn from a uniform
probability distribution), the average
performance for algorithms A and B is the same.

33
No Free Lunch Theorem
• The two most important things to take away from the No Free Lunch theorems are:
• Always check your assumptions before relying on a model or search algorithm.
• There is no “super algorithm” that will work perfectly for all datasets.
• The No Free Lunch theorems were not written to tell you what to do in different scenarios.
The No Free Lunch theorems were specifically written to counter claims along the lines of:
• Machine learning algorithm/optimization strategy is the best, always and forever, for all the
scenarios.

34
No Free Lunch Theorem
• Models are simplifications of a specific component of reality (observed with data). To
simplify reality, a machine learning algorithm or statistical model needs to make assumptions
and introduce bias (known specifically as inductive or learning bias).
• Bias-free learning is futile because a learner that makes no a priori assumptions will have no
rational basis for creating estimates when provided new, unseen input data.
• The assumptions of an algorithm will work for some data sets but fail for others. This
phenomenon is important to understand the concepts of under fitting and the bias/variance
tradeoff.
• The combination of your data and a randomly selected machine learning model are not
enough to make accurate or meaningful predictions about the future or unknown outcomes.
• You, the human, will need to make assumptions about the nature of your data and the world
we live in.
• Playing an active role in making assumptions will only strengthen your models and make
them more useful, even if they are wrong.
35
References
• Books and Journals
• Understanding Machine Learning: From Theory to Algorithms by Shai Shalev-Shwartz and Shai
Ben-David-Cambridge University Press 2014
• Introduction to machine Learning – the Wikipedia Guide by Osman Omer.

• Video Link-
• https://www.youtube.com/watch?v=9f-GarcDY58
• https://www.youtube.com/watch?v=GwIo3gDZCVQ

• Web Link-
• https://towardsdatascience.com/cross-validation-in-machine-learning-72924a69872f
• https://towardsdatascience.com/regularization-an-important-concept-in-machine-learning-5891628907ea
• https://www.kdnuggets.com/2019/09/no-free-lunch-data-science.html
• https://towardsdatascience.com/a-blog-about-lunch-and-data-science-how-there-is-no-such-a-thing-as-free-
lunch-e46fd57c7f27 36
THANK YOU

You might also like