Medical Insurance Prediction Slides

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 43

PROJECT OVERVIEW

PROJECT OVERVIEW

• The objective of this case study is to predict the health insurance cost incurred by
Individuals based on their age, gender, BMI, number of children, smoking habit and
geo-location.

https://www.publicdomainpictures.net/en/view-image.php?image=279
909&picture=medical-insurance
PROJECT OVERVIEW

• The available features are:


o sex: insurance contractor gender
o bmi: Body mass index (ideally 18.5 to 24.9)
o children: Number of children covered by health insurance / Number of dependents
o smoker: Smoking
o region: the beneficiary's residential area in the US, northeast, southeast,
southwest, northwest.
• Target (output):
o charges: Individual medical costs billed by health insurance

• Data Source: https://www.kaggle.com/mirichoi0218/insurance


MULTIPLE LINEAR
REGRESSION
RECALL SIMPLE LINEAR REGRESSION?

• Goal is to obtain a relationship (model) between two variables only such as age and insurance cost for
example.
MODEL! (GOAL)
INSURANCE COST ($)

𝑦 =𝑏+ 𝑚∗ 𝑥

DEPENDANT VARIABLE INDEPENDENT VARIABLE


INSURANCE PREMIUM ($) AGE (YEARS)

AGE (YEARS)

6
MULTIPLE LINEAR REGRESSION:
INTUITION
• Multiple Linear Regression: examines relationship between more than two variables.
• Recall that Simple Linear regression is a statistical model that examines linear relationship between two variables only.
• Each independent variable has its own corresponding coefficient.

𝑦 =𝑏 0 +𝑏1 ∗ 𝑥1 + 𝑏2 ∗ 𝑥2 +..+ 𝑏𝑛 𝑥 𝑛

DEPENDANT VARIABLES INDEPENDENT VARIABLES


INSURANCE COST ($) (AGE, SMOKING HABITS, REGION,..ETC)
REGRESSION
METRICS AND KPIs
REGRESSION METRICS: HOW TO ASSESS
MODEL PERFORMANCE?
• After model fitting, we would like to assess the performance of the model by comparing model
predictions to actual (True) data
(actual)
INSURANCE COST ($)

𝑬𝒓𝒓𝒐𝒓 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔 ( 𝑬𝒓𝒓𝒐𝒓 )= ^


𝒚𝒊− 𝐲𝒊

(estimated/predicted)

AGE (YEARS)

9
REGRESSION METRICS: MEAN
ABSOLUTE ERROR (MAE)
• Mean Absolute Error (MAE) is obtained by calculating the absolute difference between the model predictions and the
true (actual) values
• MAE is a measure of the average magnitude of error generated by the regression model
• The mean absolute error (MAE) is calculated as follows:

• MAE is calculated by following these steps:


1. Calculate the residual of every data point
2. Calculate the absolute value (to get rid of the sign)
3. Calculate the average of all residuals
• If MAE is zero, this indicates that the model predictions are perfect.
REGRESSION METRICS: MEAN
SQUARE ERROR (MSE)
• Mean Square Error (MSE) is very similar to the Mean Absolute Error (MAE) but instead of using absolute values,
squares of the difference between the model predictions and the training dataset (true values) is being calculated.
• MSE values are generally larger compared to the MAE since the residuals are being squared.
• In case of data outliers, MSE will become much larger compared to MAE
• In MSE, error increases in a quadratic fashion while the error increases in proportional fashion in MAE
• In MSE, since the error is being squared, any predicting error is being heavily penalized
• The MSE is calculated as follows:

• MSE is calculated by following these steps:


1. Calculate the residual for every data point
2. Calculate the squared value of the residuals
3. Calculate the average of results from step #2
REGRESSION METRICS: ROOT MEAN
SQUARE ERROR (RMSE)
• Root Mean Square Error (RMSE) represents the standard deviation of the residuals (i.e.: differences between the model
predictions and the true values (training data)).
• RMSE can be easily interpreted compared to MSE because RMSE units match the units of the output.
• RMSE provides an estimate of how large the residuals are being dispersed.
• The RMSE is calculated as follows:

• RMSE is calculated by following these steps:


1. Calculate the residual for every data point
2. Calculate the squared value of the residuals
3. Calculate the average of the squared residuals
4. Obtain the square root of the result
REGRESSION METRICS: MEAN ABSOLUTE
PERCENTAGE ERROR (MAPE)
• MAE values can range from 0 to infinity which makes it difficult to interpret the result as compared to the training data.
• Mean Absolute Percentage Error (MAPE) is the equivalent to MAE but provides the error in a percentage form and
therefore overcomes MAE limitations.
• MAPE might exhibit some limitations if the data point value is zero (since there is division operation involved)
• The MAPE is calculated as follows:
REGRESSION METRICS: MEAN
PERCENTAGE ERROR (MPE)
• MPE is similar to MAPE but without the absolute operation
• MPE is useful to provide an insight of how many positive errors as compared to negative ones
• The MPE is calculated as follows:
REGRESSION METRICS
AND KPIs – PART #2
REGRESSION METRICS: R SQUARE ()-
COEFFICIENT OF DETERMINATION
• R-square or the coefficient of determination represents the proportion of variance (of y) that has been explained
by the independent variables in the model.
• If , this means that 80% of the increase in insurance cost is due to increase in the age of the applicant.
INSURANCE COST

AGE
REGRESSION METRICS: R SQUARE ()-
COEFFICIENT OF DETERMINATION
• R-square represents the proportion of variance of the dependant variable (y) that has been explained by the
independent variables.
• R-square provides an insight of goodness of fit.
• It gives a measure of how well unseen samples are likely to be predicted by the model, through the proportion
of explained variance.
• Maximum value is 1
• A constant model that always predicts the expected value of y, disregarding the input features, will have an R²
score of 0.0.
REGRESSION METRICS:
ADJUSTED R SQUARE ()
• If , this means that 80% of the increase in medical insurance cost is due to increase in applicant's age.
• Let’s add another ‘useless’ independent variable, let’s say “color of car” to the Z-axis. (note that we are trying
to predict the medical insurance cost and not the car insurance cost!)
• Now increases and becomes:

INSURANCE COST
C A R
N T ’ S
C A
APPLI
OR OF
COL

AGE
REGRESSION METRICS: ADJUSTED R
SQUARE ()
• One limitation of is that it increases by adding independent variables to the model which is misleading since
some added variables might be useless with minimal significance.
• Adjusted overcomes this issue by adding a penalty if we make an attempt to add independent variable that
does not improve the model.
• Adjusted is a modified version of the and takes into account the number of predictors in the model.
• If useless predictors are added to the model, Adjusted will decrease
• If useful predictors are added to the model, Adjusted will increase
• is the number of independent variables and is the number of samples
ARTIFICIAL NEURAL
NETWORKS FOR
REGRESSION
NEURON MATHEMATICAL MODEL

• The neuron collects signals from input channels named dendrites, processes
information in its nucleus, and then generates an output in a long thin branch
called axon.
X1
W1

W2
X2 NEURON

W3

DENDRITES X3
AXON
NUCLEAS
DO YOU REMEMBER OUR FIRST
NEURON MODEL?
• Bias allows to shift the activation function curve up or down.
• Number of adjustable parameters = 4 (3 weights and 1 bias).
• Activation function “F”.
b
𝑿𝟏 𝑊1

𝑊2
INPUTS/INDEPENDENT
VARIABLES 𝑿𝟐 F
𝑊3

𝑿𝟑 𝑦 = 𝑓 ( 𝑋 1 𝑊 1 + 𝑋 2 𝑊 2 + 𝑋 3 𝑊 3 +𝑏)
SINGLE NEURON MODEL IN ACTION!

• Let’s assume an activation function of Unit Step.


• The activation functions is used to map the input between (0, 1).

𝑏=0
Input #1=1 𝑿𝟏 =0.7

𝑊 2=0.1
Input #2=3
𝑿𝟐 F
𝑊 3=0.3

Input #3=4
𝑿𝟑
ACTIVATION
FUNCTIONS
ACTIVATION FUNCTIONS

• SIGMOID:
o Takes a number and sets it between 0 and 1
o Converts large negative numbers to 0 and large positive
numbers to 1.
o Generally used in output layer.

• Photo credit: https://commons.wikimedia.org/wiki/File:Sigmoid-function.svg


• Photo Credit: https://fr.m.wikipedia.org/wiki/Fichier:MultiLayerNeuralNetworkBigger_english.png
• Photo Credit: https://commons.wikimedia.org/wiki/File:Logistic-curve.svg
ACTIVATION FUNCTIONS

• RELU (RECTIFIED LINEAR UNITS):


o if input x < 0, output is 0 and if x > 0 the output is x.
o RELU does not saturate so it avoids vanishing gradient problem.
o It uses simple thresholding so it is computationally efficient.
o Generally used in hidden layers.

• Photo credit: https://commons.wikimedia.org/wiki/File:ReLU_and_Nonnegative_Soft_Thresholding_Functions.svg


• Photo Credit: https://fr.m.wikipedia.org/wiki/Fichier:MultiLayerNeuralNetworkBigger_english.png
ACTIVATION FUNCTIONS

• HYPERBOLIC TANGENT ACTIVATION FUNCTION:


o “Tanh” is similar to sigmoid, converts number between -1 and 1.
o Unlike sigmoid, tanh outputs are zero-centered (range: -1 and 1).
o Tanh suffers from vanishing gradient problem so it kills gradients when saturated.
o In practice, tanh is preferable over sigmoid.

• Photo credit: https://commons.wikimedia.org/wiki/File:Hyperbolic_Tangent.svg


• Photo Credit: https://fr.m.wikipedia.org/wiki/Fichier:MultiLayerNeuralNetworkBigger_english.png
MULTI-NEURON
MODEL (MULTI-LAYER
PERCEPTRON MODEL)
MULTI-LAYER PERCEPTRON NETWORK

• The network is represented by a matrix of weights, inputs and outputs.


• Total Number of adjustable parameters = 8:
• Weights = 6 P1 W1
n1 a1
• Biases = 2 Matrix Representation P2
W2
∑ f
W3
 P1  W11
P   P2 
P1 P3
b1
n1
 P3  W21
S2W2  P3W3f  b1)
a1  f ( P1W1  P a1

W12
W W12 W13  b1
W   11  P2
W21 W22 W23  W22 1
n2
b  W13 S f a2
b   1 P1 W1
b2  P3
W23 n2b2 a2
W2
a  f (W  P  b) P2 ∑ 1
f
W3

P3
b2
a 2  f ( P1W1  P2W2  P3W3  b 2)

29
MULTI-LAYER PERCEPTRON NETWORK

• Let’s connect multiple of these neurons in a multi-layer fashion.


• The more hidden layers, the more “deep” the network will get.
 P1 
P 
P 2
 
 PN1  Node (n+1, i) representation

[ ]
𝑊 11 𝑊 12 𝑊1,𝑁
⋯ 1 Non-Linear Sigmoid Activation function
𝑊 21 𝑊 22 𝑊 2,𝑁 1
1
𝜑 (𝑤)= −𝑤
⋮ ⋱ ⋮ 1 +𝑒
𝑊 𝑚 − 1 ,1 𝑊 𝑚 −1 ,2 𝑊 𝑚− 1 , 𝑁
⋯ 1
m: number of neurons in the hidden layer
𝑊 𝑚 ,1 𝑊 𝑚, 2 𝑊 𝑚, 𝑁 1
: number of inputs
HOW DO ANNS
TRAIN?
ANN TRAINING PROCESS

Performance measure
(Mean Square Error)
Predicted Desired
Training
Testing Output (True)
Epochs or Time
inputs
inputs Output Y
XX

Error =

Update Network Weights


DIVIDE DATA INTO TRAINING AND TESTING

• Data set is generally divided into 80% for training and 20% for testing.
• Sometimes, we might include cross validation dataset as well and then we divide it
into 60%, 20%, 20% segments for training, validation, and testing, respectively
(numbers may vary). TRAINING DATASET
1. Training set: used for gradient calculation and weight update. 80%
2. Validation set:
o used for cross-validation which is performed to assess training quality as TESTING DATASET
training proceeds. 20%
o Cross-validation is implemented to overcome over-fitting (over-training).
Over-fitting occurs when algorithm focuses on training set details at cost
of losing generalization ability.
o Trained network MSE might be small during training but during testing, TRAINING DATASET
the network may exhibit poor generalization performance. 60%
3. Testing set: used for testing trained network.
VALIDATION DATASET
20%
TESTING DATASET
20%
GRADIENT DESCENT
GRADIENT DESCENT

• Gradient descent is an optimization algorithm


used to obtain the optimized network weight and
bias values
• It works by iteratively trying to minimize the cost
function
• It works by calculating the gradient of the cost
function and moving in the negative direction until
the local/global minimum is achieved
• If the positive of the gradient is taken, local/global
maximum is achieved

Photo Credit: https://commons.wikimedia.org/wiki/File:Gradient_descent_method.png


Photo Credit: https://commons.wikimedia.org/wiki/File:Gradient_descent.png
LEARNING RATE

• The size of the steps taken are called the learning rate
• If learning rate increases, the area covered in the search space will increase so we
might reach global minimum faster
• However, we can overshoot the target
• For small learning rates, training will take much longer to reach optimized
weight values
GRADIENT DESCENT

THESE ARE MY TRAINING DATA (INPUTS


• Let’s assume that we want to obtain the AND OUTPUT)
optimal values for parameters ‘m’ and ‘b’.
Training Predicted Actual
inputs Output (True)
X Output Y
𝑦 =𝑏+ 𝑚∗ 𝑥 MACHINE
LEARNING MODEL
𝑦 =𝑏+𝑚∗ 𝑥
GOAL IS TO FIND error =
BEST PARAMETERS

• We need to first formulate a loss


function as follows: Update Weights (parameters)
𝑛 𝑛
1 1
𝐶𝑜𝑠𝑡 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑓 ( 𝑚,𝑏 )= ∑ ( 𝑒𝑟𝑟𝑜𝑟 ) = ∑ ( ^𝑦 𝑖 − 𝑦 𝑖 )2
2

𝑁 𝑖=1 𝑁 𝑖=1
GRADIENT DESCENT

𝑛
1
𝐿𝑜𝑠𝑠 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑓 (𝑚, 𝑏)= ∑ ( ^𝑦 𝑖 − 𝑦 𝑖 )2

Sum of Squared Residuals


𝑁 𝑖=1 OPTIMAL POINT
GRADIENT DESCENT WORKS AS FOLLOWS: GLOBAL MINIMUM

1. Calculate the gradient (derivative) of the Loss function


2. Pick random values for weights (m, b) and substitute
3. Calculate the step size (how much are we going to update the
parameters?)

4. Update the parameters and repeat


Parameters (m, b)

*Note: in reality, this graph is 3D and has three axes, one for m, b and sum of squared residuals
BACKPROPAGATION
BACK PROPAGATION

• Backpropagation is a method used to train ANNs by calculating gradient needed to


update network weights.
• It is commonly used by the gradient descent optimization algorithm to adjust the
weight of neurons by calculating the gradient of the loss function.
STEP 1: FORWARD PROPAGATION

STEP 2: ERROR
CALCULATION
STEP 4: WEIGHT UPDATE

STEP 3: BACK PROPAGATION


BACK PROPAGATION

• Backpropagation Phase 1: propagation


o Propagation forward through the network to generate the output value(s)
o Calculation of the cost (error term)
o Propagation of output activations back through network using training pattern target in order to
generate the deltas (difference between targeted and actual output values)
STEP 1: FORWARD PROPAGATION

STEP 2: ERROR
CALCULATION
STEP 4: WEIGHT UPDATE

STEP 3: BACK PROPAGATION


BACK PROPAGATION

• Phase 2: weight update


o Calculate weight gradient.
o A ratio (percentage) of the weight's gradient is subtracted from the weight.
o This ratio influences the speed and quality of learning and called learning rate. The greater the
ratio, the faster neuron train, but lower ratio, more accurate the training is.
STEP 1: FORWARD PROPAGATION

STEP 2: ERROR
CALCULATION
STEP 4: WEIGHT UPDATE

STEP 3: BACK PROPAGATION


BACK PROPAGATION ADDITIONAL
READING MATERIAL
• “Backpropagation neural networks: A tutorial” by Barry J.Wythoff
• “Improved backpropagation learning in neural networks with windowed
momentum”, International Journal of Neural Systems, vol. 12, no.3&4, pp. 303-318.

You might also like