0% found this document useful (0 votes)
13 views

05 Logistic Regression

This document discusses classification in machine learning. Specifically, it provides examples of classification problems, describes different types of classification data and algorithms, and discusses predicting credit card default as a classification problem. It notes that the goal is to build a model that can take income and credit card balance as input to predict the likelihood of an individual defaulting, and that the model is trained on labeled data of past defaults.

Uploaded by

hayero5557
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

05 Logistic Regression

This document discusses classification in machine learning. Specifically, it provides examples of classification problems, describes different types of classification data and algorithms, and discusses predicting credit card default as a classification problem. It notes that the goal is to build a model that can take income and credit card balance as input to predict the likelihood of an individual defaulting, and that the model is trained on labeled data of past defaults.

Uploaded by

hayero5557
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Regression

-neeeeeeeeee
In layman's terms, classification in the process of taking a collection of data and organizing it
into groups or categories based on some common characteristics.


For example, if you have a dataset of images, a classification algorithm might be used to

automatically organize the images into groups based on the objects on scenes they depict.


This can be done using a training dataset that includes labeled examples of the different classes on

categories.

The algorithm uses this training data to learn the characteristics that distinguish each class, and

then uses this knowledge to make predictions on new, unbalanced data.


In more detailed terms, classification is a type of supervised learning algorithm, which means

that it is trained on a dataset thatincludes both the input data and the corresponding labels.


The training dataset is used to teach the algorithm about the relationshipbetween the input data and
the labels.

-neeeeeeeetion

This can be done using a variety of techniques, such as decision trees, support vector machines, and neural

network. Once the algorithm has been trained. It can be used to make predictions on new, unseen data.

For example, if you have a datasetof images, the classification algorithm mighttake an image as image as input
and predictwhich class category itbelongs to as cat of dog

such
on

The algorithm's ability to make accurate predictions can be evaluated using a separate test dataset, which

includes input data and known labels. This allows you to measure the performance of the algorithm and
make any necessary adjustments to improve its accuracy.
-neeeeeeeeeee
some basic examples of classification problems in machine learning include:

classification problems In this problem, the goal is to classify

1.Spam detection an email as either spam or not spam.

2. Sentiment analysis a piece of text as expressing positive.

3. Fraud detection transactions as either fraudulent or legitimate.

4. Medical diagnosis a patient's medical condition based on their symptoms and test results.

5. Image classifications an image as containing a particular object on not, such as a dog on a cat.

These are just a few examples of classification problems that can be tackled using machine learning.
Other examples include language translation, speech recognition, and many more.

ensification
I mmerse
There are several types of classification data, including:

Classification
classification
*

1. Binary 2. Multiclass classification 3. Multilabel classification 4. Imbalanced classification

In this type of classification, In this type of classification In this type of classification. In this type of classification

the goal is to predict one of two the goal is to predict one of the goal is to predict multiple the goal is to predict a class

possible classes, such as: more than two classes, such classes for each input, such that is relatively have in

spam on not spam. as the objects present in an as the objects present in an the training data, such

image. image. as detecting fraudlent

transactions in a dataset
There are many different types of classification algorithms, and the appropriate algorithm where most transactions

for a particular problem will depend on the specific characteristics of the data and the goals are legitimate.
of the model. Some common types of classification algorithms include:

In addition to the types of the classification data mentioned in the previous answer, there are few other types that are worth mentioning.

These include:

Classification
&
I -

1. Ordinal classification 2. Hierarchical classification 3. Nominal classification

in this type of classification In this type of classification, In this type of classification,


the goalis to predict a class the goal is to predict a class the 9001 is to predict a class
that has a natural ordering,that belongs to a hierarchy that has no inherent ordening,

such as low, medium and tree like structure,such as:such as the differentcourse of
on

high. such as the different levels rainbow.


of an organisational chant.

Again, the appropriate type of classification data for a particular problem will depend on the specific
characteristics of the data and the goals of the machine learning model.
Freemaneeeetement
The problem of predicting whether an individual will defaulton their credit card payment, on the basis of their annual

income and monthly credit canbalance, is a supervised learning problem in the field of machine learning.


In this problem,the goal is to build a model that can take in an individual's annual income and monthly credit card balance

as input, and predict whether on notthey will default on their credit (and payment.

The model will be trained on a labeled datasetthatcontains examples of individuals and whether or notthey defaulted on

their credit and payment. This labeled training data will be used to learn the relationship between an individual's annual

income and monthly credit card balance,and their likelihood of defaulting on their payment.

Once the model is trained, itcan be used to make predictions on new, unseen data. This could be used, for example, by a

credit (and company to assess the risk of a potential customer defaulting on their payment, and to make decisions
about whether to approve on deny their credit (and application.

a
Overall, this problem involves using machine learning to build model thatcan predictan individual's likelihood
of defaulting on their credit and payment,based on their annual income and monthly credit card balance.

The individuals who defaulted in a given month are shown in range,


a and those who did in
not blue.

It appears that individuals who defaulted tended to have higher credit card balances than those who did not.

we learn how to build a model to predict default (y) for any given value of balance (XI) and income x2.

Kneeeeeeeeeeat
Regression
you have a problem to solve which is predicting whether a person will default or not given its income.

probability 1: not default

1 -
nine.
O:NO defAUlt probability

1 -...

predicting 0.8

Incomes
i
line of Best fit is drawn Using LR P 0.8
=

in such a way that the - &0.830.5


distance between the line 0.5 The person will not

and all the points is minimum. default.

.....
ok
Income (INR)

... If y 0.5, then the person is not default.

If y <0.5, then the is default

person
Then whereg
the problem linean regression is highly affected by 12345671000 180

inclusion of outliens.
Outliers
probability probability probability
above,

tunein /
1
-----

er
wrongly classified
0.5

0.3

.
-
In Income (INR)
B
900K
negative prob as well.

Yes, the person DEFAULT No, the person


defaulted haven't defaulted
Yes nO

Modelling the probability that y belongs to any of the category


->

Pr(default Yes/Balancel
=
0-1]
OUtPUt vaIUeS

E.g. P(balance> 0.5,then the prediction would be yes!

Threshold changeable!!

SO, HOW should Pn(y=11X) and X?


we model the relationship between p(X)=

P(X) Bo+B,xlike
=

linear hypothesis problem with this approach.


for balances close too, we can predict the "-"probability.
>probability can be greaten than 1.

It must fall between & 1 The pred. are not sensible, since of course than the prob. of default, regardless of

credit (and balance.


If use straight line:

P(X) <0 (for some values of x) Tofix


we must modelp(X) using a function that

PIK 0 (or some values of X) this problem gives outputs between 0 and 1 for all values of X.

10gistic function probability

p(x)
eBot, 1 -
nine

~
=

1 eot,n
+
low balances we now
109 Odds
Eg:Recommended the service: predict prob. of default
A survey describes a survey of 250 customers of an as close to, but never
automobile dealership. The customers here as usual below zepo.
if they would recommend the service department to a probability are never above one.

friend. The no of who recommended yes was 210.


.....
ok Income (INR)

n 250; x 210;
= =
P =

0.84
=

(proportion) 10g.neq.works with odds mather than proportions.

Odds are simply the natio of the probability for the two possible outcomes.

Odds 1, P 0.84,1
=
=
1
=

0.84 016 =

5:1
1 -
E
five times higher probability of a
round
184
=
5.12B5
=

2;
=

customer recommending the dennice


department than notrecommending it.
sample proportion of women who are insta users is given as 61081 and proportion of man 43.98%

u
25
=
ifthe person woman

if the person
-
man
odds-p =
= 15694 (for women

odds =,* 0*9g= = 0.7851 (forman)

Regression
- n e e e e e e e e e

The
logistic functionis
defined as the

1 eot
+ iX
I regression coefficientestimated from data
pred. Prob
that PredWillbel Euler's number

Derivation:
Derived from Bennoullidistribution, probability distribution that models the outcome of a single

trial thatcan result in either Truel"success"on "fallune".


Let P(u) be the probability observing success outcome(y=1) given X.
Then the probability of Observing a 'faliune' outcome (y=0) is 1-P(X)

P(a) 1)
Odds of observing success -> natio of Odds (Y 11x)
=
=

1 -

p(x) 1 -

P(x)

Odds can take any't' values from 0 to 0.

transformed to log odds (109it).


This takes from neg infinity to 't'*.

109it(p(a))
109(*)
= take the invense the
of logit transformation, maps logoads
to the prob
P(x) =

substituting logit yeilds


Texp(-10gi+p(x) 109it(P(x)
109),-pn)
=

P(x)
11exp( to B,x])
=

-
+

P(x)
eBot,x
=

1 eot,X
+

equation simplified by multiplying both numeraton &


So, what we generally use? denominator by exp(Bo+,X)

generally use?
-neeeeeeeeeaction
we

The two forms of the logistic function, p(X e(B B,x)/(1 = +

eBot,x)) and PIX 11 1 + expBo-B,x, are


+
=

mathematically equivalent and produce identical results.


Both forms map the linean combination of prediction variables and corresponding regression coefficients to the range
of t0,17, which ensures that the predicted values are valid probabilities.

The choice between the two forms of the logistic function is langely a matter of convention and personal
preference.
some people find the first form, which involves the exponential function, to be more intutive and easien
to work with, while others prefer the second form, which involves the negative exponential function.
In practice, mostsoftware packages that implement logistic regression use the first form of the logistic function,
P(X) e1Bo+,x)/ (1+e "(Bot Y), because it has some computational advantages.
=

specially, it avoids the need to compute the negative exponentials, which can be computationally expensive and may

introduce numerical stability issues in some cases.

Kneeeeeeeeeamble:
suppose we have a dataset of 100 customers who have applied for a loan, and we want to model the probability
of a customer defaulting on the loan (iey=1) as a function of their credit score (ie, X). The credit scones range from
500 +0 800, and we have the following logistic regression model.

109itP(X) Bo+,X
=

Wherep(X) is the predicted probability of defaultgiven the credit score X,


Bo is the intercept
B, is the slope
Let's say that we have estimated the following coefficients from the data using maximum likelihood estimation:

By 0.3,B, 0.01
=
-
=

Given these coefficients, we can calculate the predicted probability of default for a customer with a creditscone of 700 using

the following steps:

calculate the logit of the predicted probability: Apply the logistic function to the logit:
109i+ P(X) B0 B,X 0.3 0.01 x700 3.8 p(X) 1/ [1 exp(-109it(p(x))] 1/ [1 exp) 3.0)] 0.0498
+ -
= + =
= - +
= =
+
=

Therefore, the predicted probability of defaultfor a customer with a credit scone of 700 is 0.0498 On 4.98%
This means thatwe estimate thatthere is a 4.98% chance thatthe customen will default on the loan
given their credit scope of 100.

the coefficient in logistic Regression


-neeeeeeeeeting
Introduction:In logistic regression, we use the logistic hypothesis function to model the relationship between
the binary response variable and the predictor variables.

Estimating the coefficients:To use the logistic hypothesis function, we need to estimate the coefficients, which are the
values of 30 and , in the logistic function.

Maximum likelihood Estimation:The most common method for estimating the coefficients is maximum likelihood
estimation, which finds the values of Bo and , that maximise the likelihood of
Observing the data given the logistic hypothesis function.

Example: Let's say we have a dataset of 100 customers who have applied for a loan, and we want to model the
probability of a customer defaulting on the loan as a function of their credit scone. The logistic hypothesis
function is:
109it B0 =
B,X
+
where p(x) is the predicted probability of default given the credit scone X, Bo is the intercept and , is the slope.

coefficients:The coefficients Bo and , represent the intercept and the slope of the logistic regression model, respectively.
They describe the relationship between the log-odds of the probability of default and the credit scone.

Interpretation:For example, if , is positive, it means that as the credit score increases, the log-odds of defaulting
on the loan increase. If so is negative, it means that the baseline log.odds of defaulting on the loan

are lower, even for a credit scone of zeno.

conclusion: By estimating the coefficients, we can use the logistic hypothesis function to make predictions of the

Probability of default given the creditscone, and gain valuable insights to the relationship between the binary

response variable and predicton variable.

regression coefficients
Eeeeeeeeeeing
"We used the least square approach to estimate the coefficients of linear regression.
"We could use least squares to fit the logistic model, but generally we use general method of maximum likelihood.
We will learn about the difference
Intvision behind this: between least squares and manimum
likelihood estimation.

we seek to estimates S.t. the predictedPPOb


foro and, P(ni) of default for corresponds as closely as
each individual. possible to the individual
default status.

Find Bo &B, Plugs in Plu) which Yeild

a no. Close to one for all a no. Close to "0" for individuals
individuals who defaulted. Who don't

likelihood
H
a
e n ermei
ee
method
s
re
The maximum likelihood method is a stastical technique for estimating the parameters of a statistical model based

on the observed data. The goal of the maximum likelihood method is to find the values of the parameters that maximize the probability
of the observed data given the model.

The specific form of the likelihood function depends on the form of the model and the distribution of the data.
For example, in logistic regression, the model is based on the logistic function and the data are assumed to be independently

and identically distributed according to a Bennoulli distribution. In this case, the likelihood function has the form:

P(il)
LB)
y(Kil,Y=!- suppose we are interested in predicting the probability of a
=

11 P(Y,)X,yi)
=

customer defaulting on a credit (and given their account


-

P(YiIXi, Bl Yi balance and credit scone. We have a dataset containing


↓ likelihood function the accountbalance, credit scones, and default status
Predicted
10 notdefaulted,
-

1 defaulted for a sample of customers.


=

Probability
We can use logistic regression to model the relationship between the account balance, credit scone, and the
probability of default. In this case, the logistic regression model would have the form:

obceility Distrition
Kanenement
Bernoulli distribution:
SUPPOSE YOU have:

an experiment
expt. results in one of the two possible outcomes, success or faliune on binary response.

p(success) p and P(faliurel=1 P


=
-

let Y=1 if a success occurs and y 0 if a faliune occurs.


=

Then X has a Bennoulli Distribution P(y 1) p(1


= =
-

p) =p
0
P(Y=y) pYl1- p) Y3 P(y 0) po(1 p)
-

known as the PMA. i p


-
-
= =
= =

FOR y 0, 1 =

likelihood estimation
Leanne
m e e t

HelPS tO eStiMates.
WhY??
Labels are binary and pred. Will be either one of them on 1 which means:
Y-Ben(P) where
(2)
p e
=

on
1
e(2) it 2
sigmoid function
-

1 + e

Bu B,x
= +

enemOLLI
Immerse
y
P(Yy(X x) pY)1 p)
-

= -
= =

=
(1+eipotpin)" *
(1-1+elpotp"*
for any data point

we can white the likelihood of the all data point.

(B) Ply y" 1X xil) "likelihood of independenttraining labels"


= =
=

P(xi),(1
-
-

p(xi)) (1-yi)
1 -

p(y;/X;;)
p(y;/x;ii
Kneeeeeeeeemple
suppose we have a dataset of 4 customers who have applied for a loan, and we want to model the probability of a
customer defaulting on the loan (i.e., =1) as a function of their credit scone (i.e. X). The credit scones and binary
response variables are given in the following table:
CUSTOMer X Y Let's assume that we have estimated the following coefficients from
1 600 1 the data using MLE:
2 5500 30 13.0B, 0.01
=

Given these coefficients, we can calculate the predicted probabilities of default


3
700 1

658 O for customer in the dataset:

For customer 1 with a credit scone of 600, the predicted probability of defaultis:

p(x) 1/ [1 exp(-109it(p(x)))] 1/[1 exp(- 1


= + = + -

30 0.01 +

600117 0.9525
+
=

For customers with a credit scone of 550, the predicted probability of defaultis:

p(x) 1/ [1 exp(-109it(p(x)))] 1/[1 exp(- 1


= + = + -

30 0.01 +

550117 0.0478
+
=

For customers with a credit scone of 700, the predicted probability of defaultis:

p(x) 1/ [1 exp(-109it(p(x)))] 1/[1 exp(- 1


= + = + -

30 0.01 +

700117 0.9975
+
=

For customer with a credit scone of 650, the predicted probability of defaultis:

p(x) 1/ [1 exp(-109it(p(x)))] 1/[1 exp(- 1


= + = + -

30 0.01 +

650117 0.7876
+
=

Given the predicted probabilities of default, we can calculate the likelihood of observing the data for each customer.

For a customer 1with a binary response variable y 1, the likelihood of observing the data= is:

p(X)-y (1 P(X) (1 y) 0.952511x(1-0.9525) (1 1) 0.9525


- -
x -
- =
- =

For a customer 2 with a binary response variable y 0, the likelihood of observing the data
= is:

p(X)-y (1 P(X) (1 y) 0.04780*(10.0478) (1 0) 0.9522


- -
x -
- =
- =

For a customers with a binary response variable y 1, the likelihood of observing the data= is:

p(X)-y (1 P(X)) (1 y) 0.997511x(1-0.9975) (1 1) 0.9975


- -
x -
- =
- =

For a customer with a binary response variable y 0, the likelihood of observing the data = is:

p(X)-y (1 P(X) (1 y) 0.78760x(1-0.7876) (1 0) 0.2124


- -
x -
- =
- =

LIB0.1) 1Ti 13-(n3P(X-i) [Y i3 *(1


=
=
-
-
p(Xi))[1 Y-i3 0.9525x0.9522x0.9975x0.2124=0.19215893286
-
=

The likelihood value of 0.1921 in this example represents the probability of observing the binary response variable"for all 4

customers given the credit scones X and the estimated coefficients Bo and . Ahigher likelihood value indicates a better fit of
the logistic regression model to the data set, as itmeans that the estimated coefficients are more likely to have generated the
Observed data.

A likelihoodvalue of 0.1921 is not considered a good fit of the logistic regression model to the data, as it indicates
a relatively low probability of observing the binary response variable, given the credit scones X and estimated
coefficients Bo and B. A better fit of the model to the data would typically result in a higher likelihood value
ood
↳neeeeeeeeee
((B0.,) 2, [y,109(p(xi)) (-y,)109(1-p(Xill]
+

Why 109 likelihood?


simplification of calculations
·
Taking the logarithm of a productof probabilities thansforms it into a sum of logarithms, which is easier to work
with and less prone to numerical ennons
convenientoptimization
·
The 109 likelihood is a continuous and differentiable function, making it easier to optimize using gradient descent on other
optimization algorithms.
·
The Objective is to find the values of Bo and , thatmaximise the log likelihood (equivalentto minimizing the negative 109
likelinood)
Additivity property
·
Provides a convenient way to interpret the goodness of fit of the model of the data
·
Ahigher log likelihood indicates a better fitof the model to the data, while a lower log likelihood indicates a poonent it
· can be used to compane different models, with the model with the highest log likelihood typically considered the best

Model.

Gradient Ascent
- - - - - - - - - - - - -

Gradient ascent is a method for finding the maximum of a function. It is used in logistic regression to find the best-fitto the

dAtA.
Imagine you are hiking up a mountain. The goal is to reach the highest pointof the mountain, which is the peak. You start
at the bottom of the mountain and need to find the bestpath to the peak.

just like hiking up a mountain, gradientascent


finds the best path to the maximum of a function.

The function represents the log likelihood in logistic

negression, and the goal is to find the maximum


189 likelihood, which connesponds to the best-fit
model.

Gradient ascent works by iteratively updating the values of the parameters in the direction of the steepest ascent
of the 109 likelihood. The gradient of the log likelihood tells you the direction of steepest ascent. You move in that

direction by a small step size at each iteration.

Think of the gradientas a compass. The gradient points you in the direction
of the steepest ascent, just like a compass points you in the direction of

nonth. You keep following the gradientuntil you reach the maximum
109 likelihood.
Inlogistic regression, gradientascent is used to find the maximum log likelihood and the best-fit model to the data.

The 109 likelihood is a measure of how well the model fits the data,and the goal is to find the values of the parameters

that maximize the log likelihood. Gradient ascent helps us achieve this by iteratively updating the values of the
parameters in the direction of the steepest ascent of the log likelihood.

So we apply gradient ASCENTalgorithm in order to find our beta values which maximizes own log likelihood.

Gradient Ascent for learning &


RePeA+E
new
Bi Bj0 x.-B)(fOj
=
+

0,1,2,......,n) =

o B, Old
3 L
pantial derivative of likelihood
learning pate

2,(4,109(P(i)) (Y yi)109(y- p(xi)))


-

((B)
+

Because we want =

to maximise
P(i) 1
=

+
e - z(z Bot ,x)
=

= (5,-y,'"
same as linear regression

Gradient ascent is same as gradient ascent just we have 't'sign.

take out devinactive?


Enemeeeeeeeee
·
The derivative of the 109 likelihood is taken with respect to the coefficients Bo and I to understand how changing the values
of the coefficients affects the 109 likelihood.
·
The optimization algorithm (e.g., gradientdescentuses the denivative of the 109 likelihood to determine the direction of
the step that maximizes the log likelihood and iteratively updates the values of Bo and , until the 109 likelihood is maximized.
· The denivative of the 109 likelihood provides information aboutthe rate of the change of the log likelihood with respectto the
coefficients, which is used to find the values Bo and , that maximize the log likelihood and provide the best fit of the model

Of the data.
· From a geometrical perspective, the denivative ofthe 109 likelihood can be thought of as the slope of the 109 likelihood

function. The optimization algorithm updates the values of Bo and be in the direction of the steepest ascent lie, the

direction of the highest slope) until the log likelihood is maximized.

Emmenenemen
with Gradient Regent

There's a way to use gradientDESCENTjust like we used in IR but we need a bitof changes in our optimization problem,

So we introductive NLL.

The negative log likelihood (NL) is used to measure the goodness of fit of the logistic regression model to the data.

NLL is calculated by taking the negative logarithm of the likelihood function.

The goal in logistic regression is to find the values of beta thatminimize the NIL, which is equivalent to maximizing the
likelihood function.
toy
I
m
n likelihood
ner e e t

Gradient Regent
Minimizing the negative log likelihood rather than maximizing the
N(L(B) -
=

m,1094; +

(1 yi)109(1
-
-

5,1)] likelihood function has several advantages:


BASICallY
NLL(B) -
=
((B) It's much easier to reason about the loss this way, to be consistentwith

want min NLL(B): the rule of loss functions approaching o as the model gets better.

repeat
Bj B,
=
-

x,(y, yi)x")
-

Notice this is!


Here we want to minimize.

meets of logistitic
Le anne Regression
Binary response:The response variable should be binary, meaning itcan take on only two values, such as a on 1.

Independence of observations:The observations should be independent of each other. This means that the outcomes of one
Observation should notaffectthe outcome of another observation.
Linearity in the log.Odds:The log. Odds of the response should have a linean relationshipwith the prediction variables. This

means thatthe log. Odds of the response can be estimated as a linean combination of the prediction

variables.
Large sample size:The sample size should be large enough to accurately estimate the coefficients in the model.

For example, consider a study that examines the association between a person's smoking status (binary response)
and their age (prediction variable). The independence of observations assumption would require thatthe outcome
of one person's smoking status does not affectanother person's smoking status. The linearity in the 109-Odds

assumption would require that the log-odds of a penson's smoking status can be estimated as a linean function of
their age.

You might also like