Regression Notes
Regression Notes
Regression Notes
https://colab.research.google.com/drive/1mUIzN8OPBpXJcimtDpkddVeoB4aabwqm?us
p=sharing
Regression Types
Linear Regression
Linear regression is a widely recognized and fundamental method in predictive
modeling. Its primary objective is to establish a linear relationship between the
dependent variable (target variable) and one or more independent variables
(predictors). This technique identifies the line (or hyperplane in higher dimensions) that
optimally represents the linear correlation between the independent variables (X) and
the continuous dependent variable (y). The equation for a simple linear regression is
given below:
y = b₀ + b₁X. (1)
Equation (1) represents the best-fitting straight line through your data.
• y – The dependent variable (what you're trying to predict).
• b₀ – The y-intercept (the point where the line crosses the y-axis). This represents
the predicted value of y when X is zero. In practical terms, it may not always be
meaningful if X rarely or never equals zero.
• b₁ – The slope (the steepness or tilt of the line). This tells you how much y
changes (on average) for every one-unit change in X.
• X – The independent variable (what you're using to make the prediction).
Multiple Linear Regression
Multiple linear regression (MLR) extends simple linear regression to incorporate
multiple predictors, allowing for a more comprehensive analysis of the factors that
influence the dependent variable. This statistical technique is fundamental in predictive
analytics, enabling the exploration of complex relationships between variables across
various fields, from economics to engineering.
Key Assumptions:
Linearity – There is a linear relationship between the independent and dependent
variables.
No multicollinearity – The independent variables should not be too highly
correlated with each other.
Independence – Observations must be independent of one another.
Normality of Residuals – Errors (residuals) should follow a normal distribution.
Homoscedasticity – The variance of errors is constant throughout the data.
Logistic Regression
Important for binary outcomes (yes/no, true/false). It estimates probabilities using
a logistic function. t: Logistic regression predicts the probability of the occurrence of an
event by fitting data to a logistic curve. It is modeled using a logistic function P(Y=1) = 1
/ 1+e-(b0+b1X). The model operates on the log odds of the probabilities, transforming the
linear combination of inputs into a probability using the logistic function.
Key Assumptions:
No Multicollinearity – this refers to a situation in which two or more predictor
variables (also known as independent variables) in a regression model are highly
correlated. This correlation can cause problems because it becomes difficult to
determine the individual effect of each predictor on the dependent variable.
Linearity of independent variables and Log Odds – The model assumes that the
log odds of the dependent variable (a transformation of the probability that a
particular event occurs) is a linear combination of the independent variables.
Large sample size.
Polynomial Regression
When data shows that the relationship between variables is curvilinear,
polynomial regression, an extension of linear regression, is used. This method can
model a non-linear relationship within the data. In polynomial regression, the dependent
variable 𝑌� is expressed as an nth degree polynomial in 𝑋� as shown in the equation
below:
𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝛽2 𝑋 2 + ⋯ + 𝛽𝑛 𝑋 𝑛 + 𝜖 (2)
Equation (2) shows that 𝛽�0, 𝛽�1, …, 𝛽�𝑛�� are coefficients to be estimated, and 𝜖�
represents the model's error term, assumed normally distributed.
Key Assumptions:
Linearity in Parameters – Polynomial regression models a non-linear relationship
in terms of the dependent and independent variables but assumes linearity in the
parameters, meaning the response variable 𝑌�� is a linear function of the
coefficients and terms of the independent variables.
Independence of Errors – The residuals (errors) from the model must be
independent of each other, implying that the error at one predictor value does not
affect the error at another value.
Homoscedasticity – The variance of the error terms should be constant across all
levels of the independent variables, as changing variance can lead to biased or
misleading results.
Normality of Residuals – For inference purposes, the residuals of the model
should be normally distributed, which is necessary for assessing the statistical
significance of coefficients using standard t-tests and F-tests.
No Multicollinearity (if multiple predictors) – Polynomial regression with multiple
predictors assumes minimal multicollinearity between the predictors to avoid
inflated variance and unreliable coefficient estimates.
Goodness of Fit
Goodness of fit measures how well the regression model's predicted values
match the observed data. It assesses the model's explanatory power and the accuracy
of predictions. The following are the primary metrics used to evaluate goodness of fit in
regression models.
R-Squared (R2)
R-squared represents the proportion of the variance in the dependent variable
that is predictable from the independent variables. The equation for the R2 is given
below:
𝑆𝑆
𝑅 2 = 1 − 𝑆𝑆𝑟𝑒𝑠 (3)
𝑡𝑜𝑡
where:
SSres are the sum of squares of residuals (errors)
SStot are the total sum of squares (variance of the dependent variable)
An R2 value close to 1 indicates that a large proportion of the variance in the
dependent variable is explained by the model, while a value close to 0 indicates that the
model explains very little of the variance. For example, an R2 value of 0.8 means 80%
of the variance in the dependent variable is explained by the independent variables.
Adjusted R-Squared
The Adjusted R-squared modifies the R2 value to account for the number of
predictors in the model. It prevents overestimation of the model's explanatory power,
especially in models with multiple independent variables. The formula for the adjusted
R2 is given below:
(1−𝑅 2 )(𝑛−1)
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑�𝑅 2 = 1 − (4)
𝑛−𝑘−1
where:
n are the number of observations.
k are the number of predictors
Unlike the R2, the adjusted R-squared can decrease if the added predictors do
not improve the model significantly. It is a more accurate measure of the model's
explanatory power. In a model with a high number of predictors, a high adjusted R-
squared indicates that the additional predictors contribute meaningful information.
Root Mean Squared Error (RMSE)
RMSE measures the average magnitude of the prediction errors, representing
the square root of the average of the squared differences between observed and
predicted values. The formula for RMSE is given below:
∑𝑛 ̂𝑖 )2
𝑖=1(𝑦𝑖 −𝑦
𝑅𝑀𝑆𝐸 = √ (5)
𝑛
where:
yi are the observed values
̂ are the predicted values
𝑦𝑖
n are the number of observations
Lower RMSE values indicate better fit, as the model's predictions are closer to
the actual values. RMSE is sensitive to outliers, as larger errors have a more significant
impact on the value. For example, an RMSE of 2.5 means that, on average, the
predictions are 2.5 units away from the actual values.
Like RMSE, lower MSE values indicate a better fit. However, MSE is more
sensitive to large errors due to the squaring of differences. For example, an MSE of 4.0
indicates that the average squared difference between observed and predicted values is
4.
F-Statistic and p-Value
The F-statistic tests the overall significance of the regression model. It compares
the model with no predictors (intercept-only model) to the model with predictors.
𝑅2
𝐹= 𝑘
1−𝑅2
(8)
𝑛−𝑘−1
A higher F-statistic value indicates that the model is a better fit than the intercept-
only model. The p-value associated with the F-statistic indicates the probability that the
observed data could occur under the null hypothesis (no relationship between
dependent and independent variables). A p-value less than 0.05 suggests that the
model is statistically significant at the 5% significance level.
Residual Analysis
Residuals are the differences between observed and predicted values. Analyzing
residuals helps to check the assumptions of the regression model. Scatter plots of
residuals versus predicted values or independent variables can be used to determine
the residuals. Residuals should be randomly distributed with no discernible pattern.
Patterns may indicate issues such as non-linearity, heteroscedasticity, or outliers. For
example, a A funnel-shaped pattern in residuals suggests heteroscedasticity, where the
variance of errors is not constant.
Durbin-Watson Statistic
The Durbin-Watson statistic tests for the presence of autocorrelation in the
residuals from a regression analysis. The values of Durbin-Watson statistic range from 0
to 4, a value of 2 indicates no autocorrelation. Values approaching 0 indicate positive
autocorrelation while values approaching 4 indicate negative autocorrelation.
Autocorrelation indicates that residuals are not independent, violating one of the
key assumptions of linear regression. For example, A Durbin-Watson statistic of 1.5
suggests some positive autocorrelation in the residuals.