3.linear Regression

Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

ZG 512

Supervised Learning
Linear Regression
BITS Pilani Dr Arindam Roy
Pilani Campus
Types of Machine Learning

BITS Pilani, Pilani Campus


Regression

Predict a value of a given continuous valued variable based on the values


of other variables, assuming a linear or nonlinear model of dependency.
Greatly studied in statistics, neural network fields.
Examples:
◦ Predicting sales amounts of new product based on advertising expenditure.
◦ Predicting wind velocities as a function of temperature, humidity, air pressure, etc.
◦ Predicting price of a house based on its attributes

BITS Pilani, Pilani Campus


Variables

# TV Radio Paper Sales


Sales is the Dependent Variable 1 230.1 37.8 69.2 22.1
• Also known as the Response or Target
• Generically referred to as Y 2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3
TV, Radio and Paper are the independent variables
• Also known as features, or inputs, or predictors
4 151.5 41.3 58.5 18.5
• Generically referred to as X (or X1, X2, X3) 5 180.8 10.8 58.4 12.9
6 8.7 48.9 75 7.2

BITS Pilani, Pilani Campus


Matrix X and Vector y

The Advertising data set has 4 variables and 6 # TV Radio Paper Sales
observations 1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
The variable names are “TV”, “Radio”, “Paper” and 3 17.2 45.9 69.3 9.3
“Sales”
4 151.5 41.3 58.5 18.5
p = 3 (the number of independent variables)
5 180.8 10.8 58.4 12.9
n = 6 (the number of observations) 6 8.7 48.9 75 7.2

X represents the input data set; X is a 6 * 3 matrix


y represents the output variable; y is a 6 * 1 vector

BITS Pilani, Pilani Campus


Matrix X and Vector y

# TV Radio Paper Sales


1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3
4 151.5 41.3 58.5 18.5
5 180.8 10.8 58.4 12.9
6 8.7 48.9 75 7.2
X is a 6 * 3 matrix or X6*3 & y is a 6 * 1 vector or y6*1

Xi represents the ith observation. Xi is a vector represented as (xi1 xi2 …..xip)

xj represents the jth variable. xj is a vector represented as (x1j x2j …..xnj)


yi represents the ith observation of the output variable. Y is the vector (y1 y2 ….. yn)

BITS Pilani, Pilani Campus


A Linear Model

The linear model is an important example of a parametric model:


f ( X ) = β0 + β 1 X 1 + β 2 X 2 + . . . β p X p .

• A linear model is specified in terms of p + 1 parameters: β 0 , β 1 , . . . , β p .


• We estimate the parameters by fitting the model to training data.
• Although it is almost never correct, a linear model often serves as a good and interpretable
approximation to the unknown true function f (X).

BITS Pilani, Pilani Campus


A Linear Model (Parametric)

The linear model is an example of a parametric model


f ( X ) = β0 + β 1 X 1 + β 2 X 2 + . . . β p X p

• A linear model is specified in terms of p + 1 parameters: β 0 , β 1 , . . . , β p .

• The linear model: f ( X ) = β 0 + β 1 X 1 + β 2 X 2 + . . . β p X p has (p + 1) parameters


• We estimate the parameters by fitting the model to training data.

• Simple Linear Regression: Only one x variable


• Multiple Linear Regression: Many x variables

BITS Pilani, Pilani Campus


A Linear Model

Y = β0 + β1X1 + β2X2 + · · · + βpXp + ε

• β’s: Unknown constants, known as coefficients or parameters


• βj: The average effect on Y of a unit increase in Xj , holding all other predictors fixed.

• ε is the error term – captures measurement errors and missing variables


• ε is a random variable independent of X
• E(ε) = 0

• In the advertising example, the model becomes


sales = β0 + β1 × TV + β2 × radio + β3 × newspaper + ε

f is said to represent the systematic information that X provides about Y

BITS Pilani, Pilani Campus


Regression Assumptions

1. E(ε) = 0
2. ε is normally distributed
3. Var(ε) for all values of the independent variables are constant (Homoscedasticity)
4. The values of ε are independent (No Serial Correlation or Autocorrelation)
5. There is no (or little) multicollinearity among the independent variables
6. The model adequately captures the relationship

BITS Pilani, Pilani Campus


Effects of Multi colinearity on the model

1. The standard error of estimate gets inflated. P-values may get inflated due
to underestimation of the t statistic. A IV which is statistically significant
might be shown as statistically insignificant w.r.t to the p values.
2. The sign of regression coefficient may be different. Instead of –ve value, you
might get +ve values and vice verse
3. Adding/removing a variable or even an observation may result in large
variation in the regression coefficient estimates.

BITS Pilani, Pilani Campus


Multicollinearity and VIF
• X1 and X2 are significant when included separately, but together the effect of both variables shrink. Multicollinearity exists
when there is a correlation between multiple independent variables in a multiple regression model. This can adversely affect
the regression results.
• Multicollinearity does not reduce the explanatory power of the model; it does reduce the statistical significance of the
independent variables.
• Test for Multicollinearity: Variance Inflation Factor

• VIF equal to 1 = variables are not correlated


• VIF between 1 and 5 = variables are moderately correlated
• VIF greater than 5 = variables are highly correlated

Solutions to multicollinearity
1. Drop unnecessary variables
2. Advanced techniques: Ridge / Lasso / Stepwise / Principal Components Regression

BITS Pilani, Pilani Campus


Homoscedasticity Vs Heteroscedasticity
• Are the residuals spread equally along
the ranges of predictors?
• The plot should have a horizontal line
with equally spread points.

In the second plot, this is not the case.


• The variability (variances) of the
residual points increases with the value
of the fitted outcome variable,
suggesting non-constant variances in
the residuals errors
(or heteroscedasticity)

BITS Pilani, Pilani Campus


Types of Regression Models
Simple Regression

(Education) x y (Income)

Multiple Regression

(Education) x1

(Soft Skills) x2 y (Income)

(Experience) x3

(Age) x4

BITS Pilani, Pilani Campus


Direct Solution Method
Least Squares Method (Ordinary Least Squares or OLS)

• Slope for the Estimated Regression Equation

𝑥𝑖 −𝑥 𝑦𝑖 −𝑦
𝑏1 =
𝑥𝑖 −𝑥 2

• y Intercept for the Estimated Regression Equation

𝑏0 = 𝑦 - 𝑏1 𝑥

where:
xi = value of independent variable for ith observation
yi =value of dependent variable for ith observation
𝑥= mean value for dependent variable
𝑦= mean value for dependent variable

BITS Pilani, Pilani Campus


Exercise
Kumar’s Electronics periodically has a special week-long sale. As part of the advertising
campaign Kumar runs one or more TV commercials during the weekend preceding the
sale. Data from a sample of 5 previous sales are shown below.

# of TV Ads # of Cars Sold


(x) (y)
1 14
3 24
2 18
1 17
3 27

BITS Pilani, Pilani Campus


Solution
# of TV Ads # of Cars Sold 𝐱𝐢 − 𝐱 𝐲𝐢 − 𝐲 𝐱 𝐢 − 𝐱 𝐲𝐢 − 𝐲 𝟐
𝐱𝐢 − 𝐱
(x) (y)
1 14 -1 -6 6 1
3 24 1 4 4 1
2 18 0 -2 0 0
1 17 -1 -3 3 1
3 27 1 7 7 1
Sum 10 100 0 0 20 4
Mean 2 20

𝑥𝑖 −𝑥 𝑦𝑖−𝑦
• Slope for the Estimated Regression Equation 𝑏1 = = 20/4 = 5
𝑥𝑖 −𝑥 2

• Y Intercept for the Estimated Regression Equation 𝑏0 = 𝑦 - 𝑏1 𝑥 = 20 – 10 = 10


• Estimated Regression Equation: 𝑦 = b0 + b1x = 10 + 5x
• Predict Sales if Ads run = 5? 15?
BITS Pilani, Pilani Campus
Evaluation of Regression Model

BITS Pilani, Pilani Campus


Goodness of Fit

BITS Pilani, Pilani Campus


Coefficient of Determination (R-squared)

• Proportion of variance in a dependent variable that can be explained by an independent


variable

BITS Pilani, Pilani Campus


Exercise (already covered)

• Simple Linear Regression using Excel


• Multiple Linear Regression using Excel
• Multiple Linear Regression using statsmodel
• Multiple Linear Regression using scikitlearn

BITS Pilani, Pilani Campus


Qualitative Predictors

One Hot Encoding (Dummy Variables)


1. Find out count of distinct values, say n, in the column.
2. Create n new columns – each with a name of the distinct value.
3. Encode values 0 and 1 under those columns depending upon the value in that observation.
4. Avoid when the count of distinct values is high.

BITS Pilani, Pilani Campus


Label Encoding

1. Find out count of distinct values, say n, in the column.


2. Find the order.
3. Encode values 0, 1, 2….based on the order.

BITS Pilani, Pilani Campus


Extensions of the Linear Model
Removing the additive assumption - Interactions:
• In our previous analysis ( see Jupyter notebook) of the
Adve rtisin g data, we assumed that the effect on s a l e s
of increasing one advertising medium is independent of the
amount spent on the other media.
• For example, the linear model

s^ales = β0 + β1 × TV + β2 × radio + β3 × newspaper

states that the average effect on s a l e s of a one-unit


increase in TV is always β1, regardless of the amount spent
on radio.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Interactions — continued
• But suppose that spending money on radio advertising
actually increases the effectiveness of T V advertising, so
that the slope term for TV should increase as radio
increases.
• In this situation, given a fixed budget of $100, 000,
spending half on radio and half on TV may increase s a l e s
more than allocating the entire amount to either TV or to
radio.
• In marketing, this is known as a synergy effect, and in
statistics it is referred to as an interaction effect.
• Many real-world relationships are not purely additive; the
effect of one variable on the outcome can depend on the
level of another variable. Interaction terms allow the
model to capture these complexities.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Interaction in the Advertising data?

Sales

TV

Radio

When levels of either TV or radio are low, then the true s a l e s


are lower than predicted by the linear model.
But when advertising is split between the two media, then the
model tends to underestimate sales.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Modelling interactions —
Advertising data
Model takes the form

sales = β0 + β1 × TV + β2 × radio + β3 × (radio × TV) + ϵ


= β0 + (β1 + β3 × radio) × TV + β2 × radio + ϵ.

Results:
Coefficient Std. Error t-statistic p-value
Intercept 6.7502 0.248 27.23 < 0.0001
TV 0.0191 0.002 12.70 < 0.0001
radio 0.0289 0.009 3.24 0.0014
TV×radio 0.0011 0.000 20.73 < 0.0001

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Interpretation

• The results in this table suggests that interactions are


important.
• The p-value for the interaction term TV×radio is
extremely low, indicating that there is strong evidence for
H A : β3 /= 0.
• The R 2 for the interaction model is 96.8%, compared to
only 89.7% for the model that predicts s a l e s using TV and
radio without an interaction term.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Interpretation

• This means that (96.8 −89.7)/(100 −89.7) = 69% of the


variability in s a l e s that remains after fitting the additive
model has been explained by the interaction term.
• The coefficient estimates in the table suggest that an
increase in T V advertising of $1, 000 is associated with
increased sales of
(βˆ1 + βˆ3 × radio) × 1000 = 19 + 1.1 × radio units.
• An increase in radio advertising of $1, 000 will be
associated with an increase in sales of
(βˆ2 + βˆ3 × TV) × 1000 = 29 + 1.1 × TV units.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Hierarchy

• Sometimes it is the case that an interaction term has a


very small p-value, but the associated main effects (in this
case, TV and radio) do not.
• The hierarchy principle:
If we include an interaction in a model, we should also
include the main effects, even if the p-values associated
with their coefficients are not significant.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Hierarchy

• The rationale for this principle is that interactions are hard


to interpret in a model without main effects — their
meaning is changed.
• Specifically, the interaction terms also contain main effects,
if the model has no main effect terms.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Non-linear effects of predictors

50
Linear
polynomial regression on Auto data Degree 2
Degree 5

40
Miles per gallon

30
20
10

50 100 150 200

Horsepower

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


The figure suggests that

mpg = β0 + β1 × horsepower + β2 × horsepower2 + ϵ

may provide a better fit.

Coefficient Std. Error t-statistic p-value


Intercept 56.9001 1.8004 31.6 < 0.0001
horsepower -0.4662 0.0311 -15.0 < 0.0001
horsepower2 0.0012 0.0001 10.1 < 0.0001

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

You might also like