Regression Notes

Regression Python File:
https://colab.research.google.com/drive/1mUIzN8OPBpXJcimtDpkddVeoB4aabwqm?us
p=sharing
Regression Types
Linear Regression
Linear regression is a widely recognized and fundamental method in predictive
modeling. Its primary objective is to establish a linear relationship between the
dependent variable (target variable) and one or more independent variables
(predictors). This technique identifies the line (or hyperplane in higher dimensions) that
optimally represents the linear correlation between the independent variables (X) and
the continuous dependent variable (y). The equation for a simple linear regression is
given below:
y = b₀ + b₁X. (1)
Equation (1) represents the best-fitting straight line through your data.
• y – The dependent variable (what you're trying to predict).
• b₀ – The y-intercept (the point where the line crosses the y-axis). This represents
the predicted value of y when X is zero. In practical terms, it may not always be
meaningful if X rarely or never equals zero.
• b₁ – The slope (the steepness or tilt of the line). This tells you how much y
changes (on average) for every one-unit change in X.
• X – The independent variable (what you're using to make the prediction).
Multiple Linear Regression
Multiple linear regression (MLR) extends simple linear regression to incorporate
multiple predictors, allowing for a more comprehensive analysis of the factors that
influence the dependent variable. This statistical technique is fundamental in predictive
analytics, enabling the exploration of complex relationships between variables across
various fields, from economics to engineering.
Key Assumptions:
 Linearity – There is a linear relationship between the independent and dependent
variables.
 No multicollinearity – The independent variables should not be too highly
correlated with each other.
 Independence – Observations must be independent of one another.
 Normality of Residuals – Errors (residuals) should follow a normal distribution.
 Homoscedasticity – The variance of errors is constant throughout the data.
Logistic Regression
Important for binary outcomes (yes/no, true/false). It estimates probabilities using
a logistic function. t: Logistic regression predicts the probability of the occurrence of an
event by fitting data to a logistic curve. It is modeled using a logistic function P(Y=1) = 1
/ 1+e-(b0+b1X). The model operates on the log odds of the probabilities, transforming the
linear combination of inputs into a probability using the logistic function.
Key Assumptions:
 No Multicollinearity – this refers to a situation in which two or more predictor
variables (also known as independent variables) in a regression model are highly
correlated. This correlation can cause problems because it becomes difficult to
determine the individual effect of each predictor on the dependent variable.
 Linearity of independent variables and Log Odds – The model assumes that the
log odds of the dependent variable (a transformation of the probability that a
particular event occurs) is a linear combination of the independent variables.
 Large sample size.
Polynomial Regression
When data shows that the relationship between variables is curvilinear,
polynomial regression, an extension of linear regression, is used. This method can
model a non-linear relationship within the data. In polynomial regression, the dependent
variable 𝑌� is expressed as an nth degree polynomial in 𝑋� as shown in the equation
below:
𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝛽2 𝑋 2 + ⋯ + 𝛽𝑛 𝑋 𝑛 + 𝜖 (2)
Equation (2) shows that 𝛽�0, 𝛽�1, …, 𝛽�𝑛�� are coefficients to be estimated, and 𝜖�
represents the model's error term, assumed normally distributed.
Key Assumptions:
 Linearity in Parameters – Polynomial regression models a non-linear relationship
in terms of the dependent and independent variables but assumes linearity in the
parameters, meaning the response variable 𝑌�� is a linear function of the
coefficients and terms of the independent variables.
 Independence of Errors – The residuals (errors) from the model must be
independent of each other, implying that the error at one predictor value does not
affect the error at another value.
 Homoscedasticity – The variance of the error terms should be constant across all
levels of the independent variables, as changing variance can lead to biased or
misleading results.
 Normality of Residuals – For inference purposes, the residuals of the model
should be normally distributed, which is necessary for assessing the statistical
significance of coefficients using standard t-tests and F-tests.
 No Multicollinearity (if multiple predictors) – Polynomial regression with multiple
predictors assumes minimal multicollinearity between the predictors to avoid
inflated variance and unreliable coefficient estimates.
Goodness of Fit
Goodness of fit measures how well the regression model's predicted values
match the observed data. It assesses the model's explanatory power and the accuracy
of predictions. The following are the primary metrics used to evaluate goodness of fit in
regression models.
R-Squared (R2)
R-squared represents the proportion of the variance in the dependent variable
that is predictable from the independent variables. The equation for the R2 is given
below:
𝑆𝑆
𝑅 2 = 1 − 𝑆𝑆𝑟𝑒𝑠 (3)
𝑡𝑜𝑡
where:
SSres are the sum of squares of residuals (errors)
SStot are the total sum of squares (variance of the dependent variable)
An R2 value close to 1 indicates that a large proportion of the variance in the
dependent variable is explained by the model, while a value close to 0 indicates that the
model explains very little of the variance. For example, an R2 value of 0.8 means 80%
of the variance in the dependent variable is explained by the independent variables.
Adjusted R-Squared
The Adjusted R-squared modifies the R2 value to account for the number of
predictors in the model. It prevents overestimation of the model's explanatory power,
especially in models with multiple independent variables. The formula for the adjusted
R2 is given below:
(1−𝑅 2 )(𝑛−1)
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑�𝑅 2 = 1 − (4)
𝑛−𝑘−1
where:
n are the number of observations.
k are the number of predictors
Unlike the R2, the adjusted R-squared can decrease if the added predictors do
not improve the model significantly. It is a more accurate measure of the model's
explanatory power. In a model with a high number of predictors, a high adjusted R-
squared indicates that the additional predictors contribute meaningful information.
Root Mean Squared Error (RMSE)
RMSE measures the average magnitude of the prediction errors, representing
the square root of the average of the squared differences between observed and
predicted values. The formula for RMSE is given below:
∑𝑛 ̂𝑖 )2
𝑖=1(𝑦𝑖 −𝑦
𝑅𝑀𝑆𝐸 = √ (5)
𝑛
where:
yi are the observed values
̂ are the predicted values
𝑦𝑖
n are the number of observations
Lower RMSE values indicate better fit, as the model's predictions are closer to
the actual values. RMSE is sensitive to outliers, as larger errors have a more significant
impact on the value. For example, an RMSE of 2.5 means that, on average, the
predictions are 2.5 units away from the actual values.
Mean Absolute Error (MAE)

The MAE measures the average magnitude of the errors in a set of predictions,
without considering their direction. It is the average of the absolute differences between
observed and predicted values. The formula for MAE is given below:
∑𝑛
𝑖=1|𝑦𝑖 −𝑦
̂𝑖 |
𝑀𝐴𝐸 = (6)
𝑛
MAE provides a straightforward measure of the average error magnitude. Lower

MAE values indicate better model performance. For example, an MAE of 3.0 suggests
that, on average, the model's predictions deviate from the actual values by 3 units.
Mean Squared Error (MSE)
The MSE measures the average of the squared differences between observed
and predicted values. It emphasizes larger errors due to squaring. The formula for this
is given below:
∑𝑛 ̂𝑖 )2
𝑖=1(𝑦𝑖 −𝑦
𝑀𝑆𝐸 = (7)
𝑛
Like RMSE, lower MSE values indicate a better fit. However, MSE is more
sensitive to large errors due to the squaring of differences. For example, an MSE of 4.0
indicates that the average squared difference between observed and predicted values is
4.
F-Statistic and p-Value
The F-statistic tests the overall significance of the regression model. It compares
the model with no predictors (intercept-only model) to the model with predictors.
𝑅2
𝐹= 𝑘
1−𝑅2
(8)
𝑛−𝑘−1
A higher F-statistic value indicates that the model is a better fit than the intercept-
only model. The p-value associated with the F-statistic indicates the probability that the
observed data could occur under the null hypothesis (no relationship between
dependent and independent variables). A p-value less than 0.05 suggests that the
model is statistically significant at the 5% significance level.
Residual Analysis
Residuals are the differences between observed and predicted values. Analyzing
residuals helps to check the assumptions of the regression model. Scatter plots of
residuals versus predicted values or independent variables can be used to determine
the residuals. Residuals should be randomly distributed with no discernible pattern.
Patterns may indicate issues such as non-linearity, heteroscedasticity, or outliers. For
example, a A funnel-shaped pattern in residuals suggests heteroscedasticity, where the
variance of errors is not constant.
Durbin-Watson Statistic
The Durbin-Watson statistic tests for the presence of autocorrelation in the
residuals from a regression analysis. The values of Durbin-Watson statistic range from 0
to 4, a value of 2 indicates no autocorrelation. Values approaching 0 indicate positive
autocorrelation while values approaching 4 indicate negative autocorrelation.
Autocorrelation indicates that residuals are not independent, violating one of the
key assumptions of linear regression. For example, A Durbin-Watson statistic of 1.5
suggests some positive autocorrelation in the residuals.
The Least Squares Method

The least squares method is a mathematical procedure to find the best-fitting
curve to a given set of points by minimizing the sum of the squares of the offsets ("the
residuals") of the points from the curve.
In the context of linear regression, the goal is to find the values of 𝛽�0�and 𝛽�1 that
minimize the sum of the squared differences between the observed values and the
values predicted by the model. This formula can be expressed as shown below:
𝑛 2
𝑚𝑖𝑛𝛽 ,𝛽 𝑆 = ∑
0 �
� (𝑦𝑖 − (𝛽0 + 𝛽1𝑥𝑖))
𝑖=1
where:
𝑦�𝑖� are the observed values,
𝑥�𝑖� are the independent variable values,
𝛽�0 and 𝛽�1 are the parameters of the model,
𝑛� are the number of observations.
Properties of Least Squares Estimators

 Unbiasedness – Under the linear regression assumptions, the least squares
estimators are unbiased, meaning they expect to hit the true values of the
parameters.
 Efficiency – They have the smallest variance among all linear unbiased
estimators (given the Gauss-Markov theorem).
 Consistency – As the number of observations n increases, the estimators
converge to the true parameter values.
The least squares method is essential for linear regression as it provides a
straightforward and efficient way to estimate the model's parameters, ensuring that the
resulting model is the best possible linear fit to the data under the linearity assumption.
Understanding and applying this method enables better predictions, insights, and
decisions based on data.

Regression Notes

Uploaded by

Copyright:

Available Formats

Regression Notes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Regression Notes

Uploaded by

Copyright:

Available Formats

Regression Python File:

Mean Absolute Error (MAE)

MAE provides a straightforward measure of the average error magnitude. Lower

The Least Squares Method

Properties of Least Squares Estimators

You might also like