05 Logistic Regression
05 Logistic Regression
-neeeeeeeeee
In layman's terms, classification in the process of taking a collection of data and organizing it
into groups or categories based on some common characteristics.
↑
For example, if you have a dataset of images, a classification algorithm might be used to
automatically organize the images into groups based on the objects on scenes they depict.
↑
This can be done using a training dataset that includes labeled examples of the different classes on
categories.
↑
The algorithm uses this training data to learn the characteristics that distinguish each class, and
↑
In more detailed terms, classification is a type of supervised learning algorithm, which means
that it is trained on a dataset thatincludes both the input data and the corresponding labels.
↑
The training dataset is used to teach the algorithm about the relationshipbetween the input data and
the labels.
-neeeeeeeetion
↑
This can be done using a variety of techniques, such as decision trees, support vector machines, and neural
network. Once the algorithm has been trained. It can be used to make predictions on new, unseen data.
For example, if you have a datasetof images, the classification algorithm mighttake an image as image as input
and predictwhich class category itbelongs to as cat of dog
such
on
The algorithm's ability to make accurate predictions can be evaluated using a separate test dataset, which
includes input data and known labels. This allows you to measure the performance of the algorithm and
make any necessary adjustments to improve its accuracy.
-neeeeeeeeeee
some basic examples of classification problems in machine learning include:
4. Medical diagnosis a patient's medical condition based on their symptoms and test results.
5. Image classifications an image as containing a particular object on not, such as a dog on a cat.
These are just a few examples of classification problems that can be tackled using machine learning.
Other examples include language translation, speech recognition, and many more.
ensification
I mmerse
There are several types of classification data, including:
Classification
classification
*
In this type of classification, In this type of classification In this type of classification. In this type of classification
the goal is to predict one of two the goal is to predict one of the goal is to predict multiple the goal is to predict a class
possible classes, such as: more than two classes, such classes for each input, such that is relatively have in
spam on not spam. as the objects present in an as the objects present in an the training data, such
transactions in a dataset
There are many different types of classification algorithms, and the appropriate algorithm where most transactions
for a particular problem will depend on the specific characteristics of the data and the goals are legitimate.
of the model. Some common types of classification algorithms include:
In addition to the types of the classification data mentioned in the previous answer, there are few other types that are worth mentioning.
These include:
Classification
&
I -
such as low, medium and tree like structure,such as:such as the differentcourse of
on
Again, the appropriate type of classification data for a particular problem will depend on the specific
characteristics of the data and the goals of the machine learning model.
Freemaneeeetement
The problem of predicting whether an individual will defaulton their credit card payment, on the basis of their annual
income and monthly credit canbalance, is a supervised learning problem in the field of machine learning.
↑
In this problem,the goal is to build a model that can take in an individual's annual income and monthly credit card balance
as input, and predict whether on notthey will default on their credit (and payment.
The model will be trained on a labeled datasetthatcontains examples of individuals and whether or notthey defaulted on
their credit and payment. This labeled training data will be used to learn the relationship between an individual's annual
income and monthly credit card balance,and their likelihood of defaulting on their payment.
Once the model is trained, itcan be used to make predictions on new, unseen data. This could be used, for example, by a
credit (and company to assess the risk of a potential customer defaulting on their payment, and to make decisions
about whether to approve on deny their credit (and application.
a
Overall, this problem involves using machine learning to build model thatcan predictan individual's likelihood
of defaulting on their credit and payment,based on their annual income and monthly credit card balance.
It appears that individuals who defaulted tended to have higher credit card balances than those who did not.
we learn how to build a model to predict default (y) for any given value of balance (XI) and income x2.
Kneeeeeeeeeeat
Regression
you have a problem to solve which is predicting whether a person will default or not given its income.
1 -
nine.
O:NO defAUlt probability
1 -...
predicting 0.8
Incomes
i
line of Best fit is drawn Using LR P 0.8
=
.....
ok
Income (INR)
person
Then whereg
the problem linean regression is highly affected by 12345671000 180
↳
inclusion of outliens.
Outliers
probability probability probability
above,
tunein /
1
-----
er
wrongly classified
0.5
0.3
.
-
In Income (INR)
B
900K
negative prob as well.
Pr(default Yes/Balancel
=
0-1]
OUtPUt vaIUeS
Threshold changeable!!
P(X) Bo+B,xlike
=
It must fall between & 1 The pred. are not sensible, since of course than the prob. of default, regardless of
PIK 0 (or some values of X) this problem gives outputs between 0 and 1 for all values of X.
p(x)
eBot, 1 -
nine
~
=
1 eot,n
+
low balances we now
109 Odds
Eg:Recommended the service: predict prob. of default
A survey describes a survey of 250 customers of an as close to, but never
automobile dealership. The customers here as usual below zepo.
if they would recommend the service department to a probability are never above one.
n 250; x 210;
= =
P =
0.84
=
Odds are simply the natio of the probability for the two possible outcomes.
Odds 1, P 0.84,1
=
=
1
=
0.84 016 =
5:1
1 -
E
five times higher probability of a
round
184
=
5.12B5
=
2;
=
u
25
=
ifthe person woman
if the person
-
man
odds-p =
= 15694 (for women
Regression
- n e e e e e e e e e
The
logistic functionis
defined as the
1 eot
+ iX
I regression coefficientestimated from data
pred. Prob
that PredWillbel Euler's number
Derivation:
Derived from Bennoullidistribution, probability distribution that models the outcome of a single
P(a) 1)
Odds of observing success -> natio of Odds (Y 11x)
=
=
1 -
p(x) 1 -
P(x)
109it(p(a))
109(*)
= take the invense the
of logit transformation, maps logoads
to the prob
P(x) =
P(x)
11exp( to B,x])
=
-
+
P(x)
eBot,x
=
1 eot,X
+
generally use?
-neeeeeeeeeaction
we
The choice between the two forms of the logistic function is langely a matter of convention and personal
preference.
some people find the first form, which involves the exponential function, to be more intutive and easien
to work with, while others prefer the second form, which involves the negative exponential function.
In practice, mostsoftware packages that implement logistic regression use the first form of the logistic function,
P(X) e1Bo+,x)/ (1+e "(Bot Y), because it has some computational advantages.
=
specially, it avoids the need to compute the negative exponentials, which can be computationally expensive and may
Kneeeeeeeeeamble:
suppose we have a dataset of 100 customers who have applied for a loan, and we want to model the probability
of a customer defaulting on the loan (iey=1) as a function of their credit score (ie, X). The credit scones range from
500 +0 800, and we have the following logistic regression model.
109itP(X) Bo+,X
=
By 0.3,B, 0.01
=
-
=
Given these coefficients, we can calculate the predicted probability of default for a customer with a creditscone of 700 using
calculate the logit of the predicted probability: Apply the logistic function to the logit:
109i+ P(X) B0 B,X 0.3 0.01 x700 3.8 p(X) 1/ [1 exp(-109it(p(x))] 1/ [1 exp) 3.0)] 0.0498
+ -
= + =
= - +
= =
+
=
Therefore, the predicted probability of defaultfor a customer with a credit scone of 700 is 0.0498 On 4.98%
This means thatwe estimate thatthere is a 4.98% chance thatthe customen will default on the loan
given their credit scope of 100.
Estimating the coefficients:To use the logistic hypothesis function, we need to estimate the coefficients, which are the
values of 30 and , in the logistic function.
Maximum likelihood Estimation:The most common method for estimating the coefficients is maximum likelihood
estimation, which finds the values of Bo and , that maximise the likelihood of
Observing the data given the logistic hypothesis function.
Example: Let's say we have a dataset of 100 customers who have applied for a loan, and we want to model the
probability of a customer defaulting on the loan as a function of their credit scone. The logistic hypothesis
function is:
109it B0 =
B,X
+
where p(x) is the predicted probability of default given the credit scone X, Bo is the intercept and , is the slope.
coefficients:The coefficients Bo and , represent the intercept and the slope of the logistic regression model, respectively.
They describe the relationship between the log-odds of the probability of default and the credit scone.
Interpretation:For example, if , is positive, it means that as the credit score increases, the log-odds of defaulting
on the loan increase. If so is negative, it means that the baseline log.odds of defaulting on the loan
conclusion: By estimating the coefficients, we can use the logistic hypothesis function to make predictions of the
Probability of default given the creditscone, and gain valuable insights to the relationship between the binary
regression coefficients
Eeeeeeeeeeing
"We used the least square approach to estimate the coefficients of linear regression.
"We could use least squares to fit the logistic model, but generally we use general method of maximum likelihood.
We will learn about the difference
Intvision behind this: between least squares and manimum
likelihood estimation.
a no. Close to one for all a no. Close to "0" for individuals
individuals who defaulted. Who don't
likelihood
H
a
e n ermei
ee
method
s
re
The maximum likelihood method is a stastical technique for estimating the parameters of a statistical model based
on the observed data. The goal of the maximum likelihood method is to find the values of the parameters that maximize the probability
of the observed data given the model.
The specific form of the likelihood function depends on the form of the model and the distribution of the data.
For example, in logistic regression, the model is based on the logistic function and the data are assumed to be independently
and identically distributed according to a Bennoulli distribution. In this case, the likelihood function has the form:
P(il)
LB)
y(Kil,Y=!- suppose we are interested in predicting the probability of a
=
11 P(Y,)X,yi)
=
Probability
We can use logistic regression to model the relationship between the account balance, credit scone, and the
probability of default. In this case, the logistic regression model would have the form:
obceility Distrition
Kanenement
Bernoulli distribution:
SUPPOSE YOU have:
an experiment
expt. results in one of the two possible outcomes, success or faliune on binary response.
p) =p
0
P(Y=y) pYl1- p) Y3 P(y 0) po(1 p)
-
FOR y 0, 1 =
likelihood estimation
Leanne
m e e t
HelPS tO eStiMates.
WhY??
Labels are binary and pred. Will be either one of them on 1 which means:
Y-Ben(P) where
(2)
p e
=
on
1
e(2) it 2
sigmoid function
-
1 + e
Bu B,x
= +
enemOLLI
Immerse
y
P(Yy(X x) pY)1 p)
-
= -
= =
=
(1+eipotpin)" *
(1-1+elpotp"*
for any data point
P(xi),(1
-
-
p(xi)) (1-yi)
1 -
p(y;/X;;)
p(y;/x;ii
Kneeeeeeeeemple
suppose we have a dataset of 4 customers who have applied for a loan, and we want to model the probability of a
customer defaulting on the loan (i.e., =1) as a function of their credit scone (i.e. X). The credit scones and binary
response variables are given in the following table:
CUSTOMer X Y Let's assume that we have estimated the following coefficients from
1 600 1 the data using MLE:
2 5500 30 13.0B, 0.01
=
For customer 1 with a credit scone of 600, the predicted probability of defaultis:
30 0.01 +
600117 0.9525
+
=
For customers with a credit scone of 550, the predicted probability of defaultis:
30 0.01 +
550117 0.0478
+
=
For customers with a credit scone of 700, the predicted probability of defaultis:
30 0.01 +
700117 0.9975
+
=
For customer with a credit scone of 650, the predicted probability of defaultis:
30 0.01 +
650117 0.7876
+
=
Given the predicted probabilities of default, we can calculate the likelihood of observing the data for each customer.
For a customer 1with a binary response variable y 1, the likelihood of observing the data= is:
For a customer 2 with a binary response variable y 0, the likelihood of observing the data
= is:
For a customers with a binary response variable y 1, the likelihood of observing the data= is:
For a customer with a binary response variable y 0, the likelihood of observing the data = is:
The likelihood value of 0.1921 in this example represents the probability of observing the binary response variable"for all 4
customers given the credit scones X and the estimated coefficients Bo and . Ahigher likelihood value indicates a better fit of
the logistic regression model to the data set, as itmeans that the estimated coefficients are more likely to have generated the
Observed data.
A likelihoodvalue of 0.1921 is not considered a good fit of the logistic regression model to the data, as it indicates
a relatively low probability of observing the binary response variable, given the credit scones X and estimated
coefficients Bo and B. A better fit of the model to the data would typically result in a higher likelihood value
ood
↳neeeeeeeeee
((B0.,) 2, [y,109(p(xi)) (-y,)109(1-p(Xill]
+
Model.
Gradient Ascent
- - - - - - - - - - - - -
Gradient ascent is a method for finding the maximum of a function. It is used in logistic regression to find the best-fitto the
dAtA.
Imagine you are hiking up a mountain. The goal is to reach the highest pointof the mountain, which is the peak. You start
at the bottom of the mountain and need to find the bestpath to the peak.
Gradient ascent works by iteratively updating the values of the parameters in the direction of the steepest ascent
of the 109 likelihood. The gradient of the log likelihood tells you the direction of steepest ascent. You move in that
Think of the gradientas a compass. The gradient points you in the direction
of the steepest ascent, just like a compass points you in the direction of
nonth. You keep following the gradientuntil you reach the maximum
109 likelihood.
Inlogistic regression, gradientascent is used to find the maximum log likelihood and the best-fit model to the data.
The 109 likelihood is a measure of how well the model fits the data,and the goal is to find the values of the parameters
that maximize the log likelihood. Gradient ascent helps us achieve this by iteratively updating the values of the
parameters in the direction of the steepest ascent of the log likelihood.
So we apply gradient ASCENTalgorithm in order to find our beta values which maximizes own log likelihood.
0,1,2,......,n) =
o B, Old
3 L
pantial derivative of likelihood
learning pate
((B)
+
Because we want =
to maximise
P(i) 1
=
+
e - z(z Bot ,x)
=
= (5,-y,'"
same as linear regression
Of the data.
· From a geometrical perspective, the denivative ofthe 109 likelihood can be thought of as the slope of the 109 likelihood
function. The optimization algorithm updates the values of Bo and be in the direction of the steepest ascent lie, the
Emmenenemen
with Gradient Regent
There's a way to use gradientDESCENTjust like we used in IR but we need a bitof changes in our optimization problem,
So we introductive NLL.
The negative log likelihood (NL) is used to measure the goodness of fit of the logistic regression model to the data.
The goal in logistic regression is to find the values of beta thatminimize the NIL, which is equivalent to maximizing the
likelihood function.
toy
I
m
n likelihood
ner e e t
Gradient Regent
Minimizing the negative log likelihood rather than maximizing the
N(L(B) -
=
m,1094; +
(1 yi)109(1
-
-
want min NLL(B): the rule of loss functions approaching o as the model gets better.
repeat
Bj B,
=
-
x,(y, yi)x")
-
meets of logistitic
Le anne Regression
Binary response:The response variable should be binary, meaning itcan take on only two values, such as a on 1.
Independence of observations:The observations should be independent of each other. This means that the outcomes of one
Observation should notaffectthe outcome of another observation.
Linearity in the log.Odds:The log. Odds of the response should have a linean relationshipwith the prediction variables. This
means thatthe log. Odds of the response can be estimated as a linean combination of the prediction
variables.
Large sample size:The sample size should be large enough to accurately estimate the coefficients in the model.
For example, consider a study that examines the association between a person's smoking status (binary response)
and their age (prediction variable). The independence of observations assumption would require thatthe outcome
of one person's smoking status does not affectanother person's smoking status. The linearity in the 109-Odds
assumption would require that the log-odds of a penson's smoking status can be estimated as a linean function of
their age.