04 - Notebook4 - Additional Information
04 - Notebook4 - Additional Information
04 - Notebook4 - Additional Information
4. The estimator object is trained with the trainings data rfe = rfe.fit(X train, y train).
Now we have an estimator object.
5. We now reduce the features to the 10 most relevant features that have been identified
using the estimator.
The data set with 10 features (instead of the original 30 features) is used for linear regression.
1
2 Multivariate Linear Regression
A linear regression with several variables / features is named multivariate. In order to formulate
the hypothesis space as one single dot product, a dummy input attribute xj,0 with the value
of 1 is introduced. Its weight is the intercept of the function. (slide 10 of the slides to linear
regression, Lecture Week 4).
Often this additional variable is named intercept, in our notebook it was named const and
added to the data frame with the following line of code:
X = sm.add constant(X)
where sm refers to the module statsmodels.api
2
3.1 General Information and Results
At first we look at the upper part of the summary:
It names the dependent variable, which is price. This is the variable we predict using
linear regression.
The method that was applied to develop the predictor, or the machine learning model is
specified, in our case it is OLS, that is ordinary least squares.
The date and time of the calculation is specified, i.e. when was the model built.
In the next line the number of observations N is specified. These are the number of
data samples / data points that were used to build the model.
Now we come to the line after the next one, the Df model. Df refers to degree of
freedom. This refers to the number of independent variables / parameters p our model
was build with. In our case we used 10 features, therefore the degree of freedom of the
model is 10.
Degree of freedom of the residuals, that is of the error terms. It is calculated as the
number of observation minus the df of the model minus 1, N − p − 1.
The OLS result is normally derived with nonrobust covariance. The module offers the
method OLSResults.get robustcov results() if an application required the calculation
based on a robust covariance.
R-squared, R2 indicates the how much of the variation of the dependent variable can be
explained using the model.
R2 does not decrease if more variables / parameters are introduced, an the degree of
freedom for the errors is reduced. This is corrected by calculating the adjusted R2 .
(see StatQuest).
2 SS(f it)/(N − p − 1)
Radjusted =1−
SS(mean)/(n − 1)
Formula after [Walpole et al., 2007], formulation adapted to StatQuest.
The F-statistic tests the null hypothesis that all variables are set to 0, i.e. a insignificant.
3
Prob (F-statistic) indicates that probability, that the null hypothesis is true. If the null
hypothesis is rejected, the set of the variables is significant. In order to determine the
probability we need the df of the model as well as the df of the residuals.
Log-likelihood, AIC and BIC are probability values of the model being true. When
comparing models, a model with a higher Log-likelihood is better as well as a model with
a smaller AIC and BIC.
For each coefficient the value is specified. Of course this value is only an estimate, therefore
additional information is provided. The std err specified the standard error of the coefficient.
The t-distribution is often used in ”problems that deal with inference about the population
mean”.[Walpole et al., 2007] In order to determine whether the value is small enough, the prob-
ability value for the t value is specified, the probability of the null hypothesis, that the variable
has no effect. In the last two columns the confidence interval of the coefficient is specified. With
a 95% probability the true value of the coefficient lies inside this interval.
Feature Selection In Notebook 4 the information about the coefficients is used for feature
selection. Here the const, i.e. the intercept, is not included. Variables with a probability of
the t value, that is too high, can be removed. Please note that after the removal of a variable,
the linear regression must be repeated. A high p value indicates that a specific variable is
insignificant in the presence of the other variables. [Walpole et al., 2007]
4
4 Collinearity
Using the results of OLS, we check the significance of variables. Thevariance inflation factor
helps at identifying variables that have a high linear correlation. In linear regression we assume
an independence of variables. If the predicting variables themselves are highly correlated, the
model becomes more complex to be interpreted and understood. Therefore it is often recom-
mended to remove those variables.
References
[Walpole et al., 2007] Walpole, R. E., Myers, R. H., Myers, S. L., and Ye, K. (2007). Probability
& Statistics for Engineers & Scientists. Pearson International Edition, 8 edition.