How To Improve Model

Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

How to improve our model

How to improve the model

More data may be required


Data needs to have more diversity
Algorithm needs longer training
More hidden layers or hidden units are required
Add Regularization
Change the Neural network architecture like activation function etc.
There are many other considerations you can think of..
Distribution of data for improving the accuracy of
the model

Training set
Which you run your learning algorithm on.
Dev Which you use to tune parameters, select features, and
(development) make other decisions regarding the learning algorithm.
set Sometimes also called the hold-out cross validation set .

which you use to evaluate the performance of the


Test set algorithm, but not to make any decisions regarding what
learning algorithm or parameters to use.
Partition your data in different categories
Dev Test or
Hold Out Cross Validation
Training Set Test Set
Set

Traditional Style partitioning 70/30 or 60/20/20

But in the era of Deep Learning may even go down to 99 0.5 0.5

If the data size is 1,00,0000 then 5000 5000 data size will still be there
in dev and test sets
Importance of Choosing dev and test sets wisely

The purpose of the dev Very important that dev


and test sets are to direct and test set reflect data Bad distribution will
your team toward the you expect to get in the severely restrict analysis
most important changes to future and want to do well to guess that why test data
make to the machine on. is not giving good results
learning system
Importance of Dev Set

• If your team improves the classifier’s accuracy from 95.0% to 95.1%,


you might not be able to detect that 0.1% improvement from playing
with the app.
• Having a dev set and metric allows you to very quickly detect which
ideas are successfully giving you small (or large) improvements, and
therefore lets you quickly decide what ideas to keep refining, and
which ones to discard.
Data Distribution Mismatch
It is naturally good to have the data in all the sets from the same distribution.

For example Housing data coming from Mumbai and we are trying to find the
house prices in Chandigarh.

Else wasting a lot of time in improving the performance of dev set and then
finding out that it is not working well for the test set.

Sometime we have only two partitioning of the data in that case they are called
Train/dev or train/test set.
Using a single Evaluation Metric
You should be clear about what you are trying to achieve and what you
are trying to tune
Classifier Precision Recall
A 95 90
B 98 85

Precision – of examples recognized as true how many are true

Recall – of total true examples how many have been correctly extracted
Precision and Recall
• Total days – 30 (one month)
• Actual rain – 10
• Model saying 9 day (5 are correct and 4 are incorrect)
• Precision = 5/9 (How many selected item are relevant ? )
• Recall = 5/10 (How many relevant items are selected ? )
Using a single Evaluation Metric
You should be clear about what you are trying to achieve and what you
are trying to tune
Classifier Precision Recall F1 Score

A 95 90 92.4

B 98 85 91

!
F1 Score = Harmonic mean of Precision and Recall ! !
"
" #
Optimize one parameter and satisfy others
Classifier Accuracy Running Time Safety False Positive … ..
A 90 20ms No
B 92 80ms Yes
C 96 2000ms Yes

Maximize ????? Subject to ????? And ????? And…

So few of them can be satisficing metric.


e.g if we say that running time needs to be minimum 100ms that running time is
satisficing metric and accuracy can be optimizing metric
False Positive and False Negative
-Amazon Echo listening for “Alexa”; Apple Siri listening for “Hey Siri”;
Android listening for “Okay Google”.
-False positive rate—the frequency with which the system wakes up even
when no one said the wakeword—as well as the false negative rate—how
often it fails to wake up when someone says the wakeword.
-One goal is to minimize the false negative rate (optimizing metric), subject
to there being no more than one false positive every 24 hours of operation
(satisficing metric).
Bias/Variance
Bias/Variance

The algorithm’s error rate on How much worse the


the training set is algorithm does on the dev
algorithm’s bias . (or test) set than the training
set is algorithm’s variance .
Bias/Variance
Train Set 1 12 8 1
Error
Dev Set Error 9 13 16 1.5
High Variance High Bias High Bias, Low Bias
High Variance Low Variance
Overfitting Underfitting Underfitting Good fit
Multi dimensional system can have high bias in some areas and high variance in some
other areas of the system, resulting in High Bias and High Variance issue
High Bias
Increase the model size (such as It allows to fit the training set better. If you find that this
increases variance, then use regularization, which will usually
number of neurons/layers)
eliminate the increase in variance.

Modify input features based on Create additional features that help the algorithm eliminate a
particular category of errors. These new features could help
insights from error analysis
with both bias and variance.

Reduce or eliminate
regularization (L2, L1 reduces avoidable bias, but increase variance.
regularization, dropout)

Modify model architecture (such


as neural network architecture) so
This can affect both bias and variance.
that it is more suitable for your
problem
High Variance
Simplest and most reliable way to address variance, so long as
Add more training data you have access to significantly more data and enough
computational power to process the data.

Add regularization (L2, L1 regularization, This technique reduces variance but increases bias.
dropout)

Add early stopping (stop gradient descent


early, based on dev set error) Reduces variance but increases bias.

Modify model architecture (such as neural


network architecture) so that it is more This affects both bias and variance.
suitable for your problem
Often compare with human level performance
Image Ease of obtaining data from human labelers
recognition,
spam
classification.
Error analysis can draw on human intuition.

Use human-level performance to estimate the


optimal error rate and also set a “desired error
rate
Tasks Where we don’t compare with human level
performance
Picking a book to It is harder to obtain labels
recommend to
you;
picking an ad to
show a user on a Human intuition is harder to count on
website;
predicting stock
market. It is hard to know what the optimal error rate
and reasonable desired error rate is
Regularization

This technique discourages Any modification we make to the


learning a more complex or learning algorithm that is intended
flexible model, so as to avoid to reduce the generalization error,
the risk of overfitting. but not its training error
Norm (L1 and L2)

Ridge Regression (L2)


Euclidian Distance

Lasso Regression (L1)

Manhattan or City Block Distance


Dropout regularization

𝑥!

𝑥"
𝑦"
𝑥#

𝑥$
Drop out regularization: Prevents Overfitting
This technique has also become popular recently. We drop out some of the hidden units for specific
training examples. Different hidden units may go off for different examples. In different iterations
of the optimization the different units may be dropped randomly.

The drop outs can also be different for different layers. So, we can select specific layers which have
higher number of units and may be contributing more towards overfitting; thus suitable for higher
dropout rates.

For some of the layers drop-out can be 0, that means no dropout


Layer wise drop out

0 0.2 0.2 0 0 0
Drop out
• Drop out also help in spreading out the weights at all layers as the
system will be reluctant to put more weight on some specific node. So
it help in shrinking weights and has an adaptive effect on the weights.
• Dropout has a similar effect as L2 regularization for overfitting.
• We don’t use dropout for test examples
• We also need to bump up the values at the output of each layer
corresponding to the dropout
Early stopping

Dev set error

Training error

# iterations
Early Stopping
Sometime dev set error goes
down and then it start going By stopping halfway we also
up. So you may decide to reduce number of iterations
stop where the curve has to train and the computation
started taking a different time.
turn.

Early stopping does not go


fine with orthogonalization We are stopping the process
because it contradicts with of optimization in between
our original objective of to take care of the overfitting
optimizing(w,b) to the which is a different
minimum possible cost objective then optimization.
function.

You might also like