04 - Notebook4 - Additional Information

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Intro DS and ML

Prof. Dr. Christina B. Claß


SoSe 2022
christina.class@eah-jena.de
Office: 03.00.16

Additional Information regarding Notebook 4


In our notebook 4, the author explains steps to develop and improve a linear regression model.
In different Python modules there are different implementations of linear regression. She uses
the different methods.

1 Feature Selection / ”Curse of Dimensionality”


Having too many features / variables makes machine learning complex and also prone to over-
fitting. Therefore in a first step we might wish to eliminate some variables that are not relevant.
We can use some results of our visualization steps.
In the notebook the author uses methods from sklearn.feature selection.RFE and
sklearn.linear model.LinearRegression.

1. A linear regression classifier is created / instantiated: lm = LinearRegression()

2. The classifier is trained: lm.fit(X train,y train)

3. An estimator object is created to determine the importance of each feature:


rfe = RFE(lm, n features to select=10). Of course we need to specify the model, for
which the estimator is created (here the trained linear regression model) as well as the
number of interesting features (here: 10)

4. The estimator object is trained with the trainings data rfe = rfe.fit(X train, y train).
Now we have an estimator object.

5. We now reduce the features to the 10 most relevant features that have been identified
using the estimator.

(a) She propose a verbose way:


ˆ First she displays a list with the name of the feature, whether it belongs to the
selected 10 features (True or False) and the rank. Note that all selected features
are on the same first rank. The other features are ranked according to their
importance starting with rank 2.
ˆ Then she reduces the data to the relevant columns:
X train rfe = X train[X train.columns[rfe.support ]]. The result is a data
frame with all methods.
(b) We could equally reduce the number of features using
X train rfe2=rfe.transform(X train). Please note that this method returns a numpy
array and not a data frame. I.e. the method head() can, e.g., not be applied.

The data set with 10 features (instead of the original 30 features) is used for linear regression.

1
2 Multivariate Linear Regression
A linear regression with several variables / features is named multivariate. In order to formulate
the hypothesis space as one single dot product, a dummy input attribute xj,0 with the value
of 1 is introduced. Its weight is the intercept of the function. (slide 10 of the slides to linear
regression, Lecture Week 4).
Often this additional variable is named intercept, in our notebook it was named const and
added to the data frame with the following line of code:
X = sm.add constant(X)
where sm refers to the module statsmodels.api

3 The Summary Statistics


For linear regression the author of the notebook uses the OLS class in the module statsmodels.api.
Linear regression is implemented in different Python modules, but statsmodels.api gives a
detailed summary (similar to other languages and tools) that looks like:

2
3.1 General Information and Results
At first we look at the upper part of the summary:

This summary contains main information about the linear regression:

ˆ It names the dependent variable, which is price. This is the variable we predict using
linear regression.

ˆ The method that was applied to develop the predictor, or the machine learning model is
specified, in our case it is OLS, that is ordinary least squares.

ˆ The date and time of the calculation is specified, i.e. when was the model built.

ˆ In the next line the number of observations N is specified. These are the number of
data samples / data points that were used to build the model.

ˆ Now we come to the line after the next one, the Df model. Df refers to degree of
freedom. This refers to the number of independent variables / parameters p our model
was build with. In our case we used 10 features, therefore the degree of freedom of the
model is 10.

ˆ Degree of freedom of the residuals, that is of the error terms. It is calculated as the
number of observation minus the df of the model minus 1, N − p − 1.

ˆ The OLS result is normally derived with nonrobust covariance. The module offers the
method OLSResults.get robustcov results() if an application required the calculation
based on a robust covariance.

ˆ R-squared, R2 indicates the how much of the variation of the dependent variable can be
explained using the model.

ˆ R2 does not decrease if more variables / parameters are introduced, an the degree of
freedom for the errors is reduced. This is corrected by calculating the adjusted R2 .

SS(mean) − SS(f it) SS(f it)


R2 = =1−
SS(mean) SS(mean)

(see StatQuest).
2 SS(f it)/(N − p − 1)
Radjusted =1−
SS(mean)/(n − 1)
Formula after [Walpole et al., 2007], formulation adapted to StatQuest.

ˆ The F-statistic tests the null hypothesis that all variables are set to 0, i.e. a insignificant.

3
ˆ Prob (F-statistic) indicates that probability, that the null hypothesis is true. If the null
hypothesis is rejected, the set of the variables is significant. In order to determine the
probability we need the df of the model as well as the df of the residuals.

ˆ Log-likelihood, AIC and BIC are probability values of the model being true. When
comparing models, a model with a higher Log-likelihood is better as well as a model with
a smaller AIC and BIC.

3.2 The Coefficients


In the second part information regarding the coefficient is summarized.

For each coefficient the value is specified. Of course this value is only an estimate, therefore
additional information is provided. The std err specified the standard error of the coefficient.
The t-distribution is often used in ”problems that deal with inference about the population
mean”.[Walpole et al., 2007] In order to determine whether the value is small enough, the prob-
ability value for the t value is specified, the probability of the null hypothesis, that the variable
has no effect. In the last two columns the confidence interval of the coefficient is specified. With
a 95% probability the true value of the coefficient lies inside this interval.

3.3 The Distribution information


The third part provides different information about the distribution of the residuals. Linear
regression bases on the assumption, that the residuals are normally distributed with a zero
mean..

Feature Selection In Notebook 4 the information about the coefficients is used for feature
selection. Here the const, i.e. the intercept, is not included. Variables with a probability of
the t value, that is too high, can be removed. Please note that after the removal of a variable,
the linear regression must be repeated. A high p value indicates that a specific variable is
insignificant in the presence of the other variables. [Walpole et al., 2007]

4
4 Collinearity
Using the results of OLS, we check the significance of variables. Thevariance inflation factor
helps at identifying variables that have a high linear correlation. In linear regression we assume
an independence of variables. If the predicting variables themselves are highly correlated, the
model becomes more complex to be interpreted and understood. Therefore it is often recom-
mended to remove those variables.

References
[Walpole et al., 2007] Walpole, R. E., Myers, R. H., Myers, S. L., and Ye, K. (2007). Probability
& Statistics for Engineers & Scientists. Pearson International Edition, 8 edition.

You might also like