2

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

2.

Background
2.1. Logistic Definition
Statistics is a scientific discipline concerned with the
collection, analysis, interpretation, and presentation of
data. It employs probability theory to establish methods for
gathering and manipulating statistical information,
enabling researchers to draw meaningful conclusions. The
field is broadly categorized into descriptive and inferential
statistics. Descriptive statistics focuses on summarizing
and organizing data, while inferential statistics involves
making predictions or generalizations about a larger
population based on a sample dataset.

2.2. Logistic Regresstion Definition


Logistic regression (or logit regression) is a process of
estimating the probability of a discrete outcome, based on
a given dataset of independent variables. It is the
appropriate regression analysis to conduct when the
dependent variable is binary (with value 0 or 1).
Like all regression analyses, logistic regression is a
predictive analysis. Logistic re gression is used to describe
data and to explain the relationship between one
dependent binary variable and one or more independent
variables. This regression technique is sim ilar to linear
regression and can be used to predict the Probabilities for
classification problems.
In logistic regression, a logit transformation is applied on
the odds—that is, the prob ability of success divided by the
probability of failure. This is also commonly known as the
log odds, or the natural logarithm of odds, and this logistic
function is represented by the following formula:
P(X) =
1
−Xβ
1+ e

- P: “ Success probability” – the probability of the


dependent variable equaling a success/case rather
than a failure/non-case (probability of a 1).
- X =β0+β1X1+β2X2+...+βkXk: The dependent variable.
- k: The number of parameters.
- Xi: The independent variables.
- β0: The intercept.
- βi: The coefficient of Xi
With one X variable, the theoretical model for P has an
elongated signoidal shape with asymptotes at 0 and 1,
although in sample estimates we may not see the
mentioned shape if the range of the variable is limited.
After the model has been computed, it’s best practice to
evaluate the how well the model predicts the dependent
variable, which is called goodness of fit. The Hos mer–
Lemeshow test is a popular method to assess model fit,
which will be discussed in a later section.

2.3. Model Explaination


2.3.1. Logistic Regression vs. Linear Regression
Linear regression models are used to identify the
relationship between a continuous dependent variable and
one or more independent variables. When there is only one
in dependent variable and one dependent variable, it is
known as simple linear regression, but as the number of
independent variables increases, it is referred to as
multiple linear regression. For each type of linear
regression, it seeks to plot a line of best fit through a set of
data points, which is typically calculated using the least
squares method.
Similar to linear regression, logistic regression is also used
to estimate the relation ship between a dependent variable
and one or more independent variables, but it is used to
make a prediction about a categorical variable versus a
continuous one. The model delivers a binary or
dichotomous outcome limited to two possible outcomes:
yes/no, 0/1, or true/false. The unit of measure also differs
from linear regression as it produces a probability, but the
logit function transforms the S-curve into a straight line.
With its usage to solve Classification problems, Logistic
regression will be used in this project to determine the
probability qualified water by determining the relationship
between variables such as the pH, hardness etc., of 3276
different water bodies and use it to predict whether the
water will be safe for human consumption or not.

2.3.2. Overall Model Evaluation


Given that the Logistic Regression model is constructed
based on a sample drawn from the population, it is
susceptible to sampling error. Consequently, a hypothesis
test must be conducted to conclude that the relationship
between the predictor variable (x) and the response
variable (y) is statistically significant.
Statistical Hypothesis:
 Null Hypothesis (H₀): β₁ = 0
 Alternative Hypothesis (H₁): β₁ ≠ 0
Subsequently, we calculate the overall model Chi-square
value using the formula:

χ² = Null degrees of freedom−Residual degrees of freedom


Null deviance−Residual deviance

The model's p-value is then determined by 1 - χ². If the p-


value is smaller than the significance level (p < 0.05), the
model is highly effective in predicting probabilities.
Conversely, if the p-value is greater than the significance
level (p > 0.05), the predictor variable (x) and the response
variable (y) exhibit no significant relationship (and can be
disregarded).

2.3.3. ROC – AUC method to evaluate model


accuracy
Determining Prediction Errors
Predicted=0 Predicted=1
Reality=0 TN-True negative FP-False Positive
Reality=1 FN-False TP-True Positive
Negative

Trade-off between Sensitivity and Specificity


Threshold value t:
 If P(y=1)≥t: Predicted y=1
 If P(y=1)<t : Predicted y=0
Case 1 (TH1): If you want to reduce false positives:
Choose a high threshold t (e.g., t=0.7): Fewer
customers are predicted to default.

 Higher specificity, lower sensitivity.


Case 2 (TH2): If you want to reduce false:
Choose a low threshold t (e.g., t=0.3): More
customers are predicted to default.

 Lower specificity, higher sensitivity.


Choosing the threshold t depends on which error is more
critical in the model:
Let a be the cost of the customer defaulting when
predicted not to default.

Let b be the cost of the customer not defaulting but


being predicted to default.

 Total cost = a* False Negative + b* False Positive.


 Choose the threshold value that minimizes total cost.
*Receive Operator Characteristic Curve (ROC): represents
a trade-off in sensitivity and specificity as the threshold
value changes
-X axis: False positive error (1-Specificity)
-Y axis: Sensitivity
When sensitivity increase, false positive error increase and
vice versa

*Area Under ROC Curve (AOC):


-Is the area under ROC curve
-AUC represents the model's predictive ability: more
accurate than the model's accuracy in the case of an
unbalanced sample.
-AUC = 1: Perfect model
-AUC = 0.5: Random model

You might also like