3.linear Regression

ZG 512
Supervised Learning
Linear Regression
BITS Pilani Dr Arindam Roy
Pilani Campus
Types of Machine Learning
BITS Pilani, Pilani Campus

Regression
Predict a value of a given continuous valued variable based on the values

of other variables, assuming a linear or nonlinear model of dependency.
Greatly studied in statistics, neural network fields.
Examples:
◦ Predicting sales amounts of new product based on advertising expenditure.
◦ Predicting wind velocities as a function of temperature, humidity, air pressure, etc.
◦ Predicting price of a house based on its attributes

Variables
# TV Radio Paper Sales

Sales is the Dependent Variable 1 230.1 37.8 69.2 22.1
• Also known as the Response or Target
• Generically referred to as Y 2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3
TV, Radio and Paper are the independent variables
• Also known as features, or inputs, or predictors
4 151.5 41.3 58.5 18.5
• Generically referred to as X (or X1, X2, X3) 5 180.8 10.8 58.4 12.9
6 8.7 48.9 75 7.2

Matrix X and Vector y
The Advertising data set has 4 variables and 6 # TV Radio Paper Sales
observations 1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
The variable names are “TV”, “Radio”, “Paper” and 3 17.2 45.9 69.3 9.3
“Sales”
4 151.5 41.3 58.5 18.5
p = 3 (the number of independent variables)
5 180.8 10.8 58.4 12.9
n = 6 (the number of observations) 6 8.7 48.9 75 7.2
X represents the input data set; X is a 6 * 3 matrix

y represents the output variable; y is a 6 * 1 vector

Matrix X and Vector y
# TV Radio Paper Sales

1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3
4 151.5 41.3 58.5 18.5
5 180.8 10.8 58.4 12.9
6 8.7 48.9 75 7.2
X is a 6 * 3 matrix or X6*3 & y is a 6 * 1 vector or y6*1
Xi represents the ith observation. Xi is a vector represented as (xi1 xi2 …..xip)
xj represents the jth variable. xj is a vector represented as (x1j x2j …..xnj)

yi represents the ith observation of the output variable. Y is the vector (y1 y2 ….. yn)

A Linear Model
The linear model is an important example of a parametric model:

f ( X ) = β0 + β 1 X 1 + β 2 X 2 + . . . β p X p .
• A linear model is specified in terms of p + 1 parameters: β 0 , β 1 , . . . , β p .

• We estimate the parameters by fitting the model to training data.
• Although it is almost never correct, a linear model often serves as a good and interpretable
approximation to the unknown true function f (X).

A Linear Model (Parametric)
The linear model is an example of a parametric model

f ( X ) = β0 + β 1 X 1 + β 2 X 2 + . . . β p X p
• A linear model is specified in terms of p + 1 parameters: β 0 , β 1 , . . . , β p .
• The linear model: f ( X ) = β 0 + β 1 X 1 + β 2 X 2 + . . . β p X p has (p + 1) parameters

• We estimate the parameters by fitting the model to training data.
• Simple Linear Regression: Only one x variable

• Multiple Linear Regression: Many x variables

A Linear Model
Y = β0 + β1X1 + β2X2 + · · · + βpXp + ε
• β’s: Unknown constants, known as coefficients or parameters

• βj: The average effect on Y of a unit increase in Xj , holding all other predictors fixed.
• ε is the error term – captures measurement errors and missing variables

• ε is a random variable independent of X
• E(ε) = 0
• In the advertising example, the model becomes

sales = β0 + β1 × TV + β2 × radio + β3 × newspaper + ε
f is said to represent the systematic information that X provides about Y

Regression Assumptions
1. E(ε) = 0
2. ε is normally distributed
3. Var(ε) for all values of the independent variables are constant (Homoscedasticity)
4. The values of ε are independent (No Serial Correlation or Autocorrelation)
5. There is no (or little) multicollinearity among the independent variables
6. The model adequately captures the relationship

Effects of Multi colinearity on the model
1. The standard error of estimate gets inflated. P-values may get inflated due
to underestimation of the t statistic. A IV which is statistically significant
might be shown as statistically insignificant w.r.t to the p values.
2. The sign of regression coefficient may be different. Instead of –ve value, you
might get +ve values and vice verse
3. Adding/removing a variable or even an observation may result in large
variation in the regression coefficient estimates.

Multicollinearity and VIF
• X1 and X2 are significant when included separately, but together the effect of both variables shrink. Multicollinearity exists
when there is a correlation between multiple independent variables in a multiple regression model. This can adversely affect
the regression results.
• Multicollinearity does not reduce the explanatory power of the model; it does reduce the statistical significance of the
independent variables.
• Test for Multicollinearity: Variance Inflation Factor
• VIF equal to 1 = variables are not correlated

• VIF between 1 and 5 = variables are moderately correlated
• VIF greater than 5 = variables are highly correlated
Solutions to multicollinearity
1. Drop unnecessary variables
2. Advanced techniques: Ridge / Lasso / Stepwise / Principal Components Regression

Homoscedasticity Vs Heteroscedasticity
• Are the residuals spread equally along
the ranges of predictors?
• The plot should have a horizontal line
with equally spread points.
In the second plot, this is not the case.

• The variability (variances) of the
residual points increases with the value
of the fitted outcome variable,
suggesting non-constant variances in
the residuals errors
(or heteroscedasticity)

Types of Regression Models
Simple Regression
(Education) x y (Income)
Multiple Regression
(Education) x1
(Soft Skills) x2 y (Income)
(Experience) x3
(Age) x4

Direct Solution Method
Least Squares Method (Ordinary Least Squares or OLS)
• Slope for the Estimated Regression Equation
𝑥𝑖 −𝑥 𝑦𝑖 −𝑦
𝑏1 =
𝑥𝑖 −𝑥 2
• y Intercept for the Estimated Regression Equation
𝑏0 = 𝑦 - 𝑏1 𝑥
where:
xi = value of independent variable for ith observation
yi =value of dependent variable for ith observation
𝑥= mean value for dependent variable
𝑦= mean value for dependent variable

Exercise
Kumar’s Electronics periodically has a special week-long sale. As part of the advertising
campaign Kumar runs one or more TV commercials during the weekend preceding the
sale. Data from a sample of 5 previous sales are shown below.
# of TV Ads # of Cars Sold

(x) (y)
1 14
3 24
2 18
1 17
3 27

Solution
# of TV Ads # of Cars Sold 𝐱𝐢 − 𝐱 𝐲𝐢 − 𝐲 𝐱 𝐢 − 𝐱 𝐲𝐢 − 𝐲 𝟐
𝐱𝐢 − 𝐱
(x) (y)
1 14 -1 -6 6 1
3 24 1 4 4 1
2 18 0 -2 0 0
1 17 -1 -3 3 1
3 27 1 7 7 1
Sum 10 100 0 0 20 4
Mean 2 20
𝑥𝑖 −𝑥 𝑦𝑖−𝑦
• Slope for the Estimated Regression Equation 𝑏1 = = 20/4 = 5
𝑥𝑖 −𝑥 2
• Y Intercept for the Estimated Regression Equation 𝑏0 = 𝑦 - 𝑏1 𝑥 = 20 – 10 = 10

• Estimated Regression Equation: 𝑦 = b0 + b1x = 10 + 5x
• Predict Sales if Ads run = 5? 15?
Evaluation of Regression Model

Goodness of Fit

Coefficient of Determination (R-squared)
• Proportion of variance in a dependent variable that can be explained by an independent

variable

Exercise (already covered)
• Simple Linear Regression using Excel

• Multiple Linear Regression using Excel
• Multiple Linear Regression using statsmodel
• Multiple Linear Regression using scikitlearn

Qualitative Predictors
One Hot Encoding (Dummy Variables)

1. Find out count of distinct values, say n, in the column.
2. Create n new columns – each with a name of the distinct value.
3. Encode values 0 and 1 under those columns depending upon the value in that observation.
4. Avoid when the count of distinct values is high.

Label Encoding
1. Find out count of distinct values, say n, in the column.

2. Find the order.
3. Encode values 0, 1, 2….based on the order.

Extensions of the Linear Model
Removing the additive assumption - Interactions:
• In our previous analysis ( see Jupyter notebook) of the
Adve rtisin g data, we assumed that the effect on s a l e s
of increasing one advertising medium is independent of the
amount spent on the other media.
• For example, the linear model
s^ales = β0 + β1 × TV + β2 × radio + β3 × newspaper
states that the average effect on s a l e s of a one-unit

increase in TV is always β1, regardless of the amount spent
on radio.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Interactions — continued
• But suppose that spending money on radio advertising
actually increases the effectiveness of T V advertising, so
that the slope term for TV should increase as radio
increases.
• In this situation, given a fixed budget of $100, 000,
spending half on radio and half on TV may increase s a l e s
more than allocating the entire amount to either TV or to
radio.
• In marketing, this is known as a synergy effect, and in
statistics it is referred to as an interaction effect.
• Many real-world relationships are not purely additive; the
effect of one variable on the outcome can depend on the
level of another variable. Interaction terms allow the
model to capture these complexities.
Interaction in the Advertising data?
Sales
TV
Radio
When levels of either TV or radio are low, then the true s a l e s

are lower than predicted by the linear model.
But when advertising is split between the two media, then the
model tends to underestimate sales.

Modelling interactions —
Advertising data
Model takes the form
sales = β0 + β1 × TV + β2 × radio + β3 × (radio × TV) + ϵ

= β0 + (β1 + β3 × radio) × TV + β2 × radio + ϵ.
Results:
Coefficient Std. Error t-statistic p-value
Intercept 6.7502 0.248 27.23 < 0.0001
TV 0.0191 0.002 12.70 < 0.0001
radio 0.0289 0.009 3.24 0.0014
TV×radio 0.0011 0.000 20.73 < 0.0001

Interpretation
• The results in this table suggests that interactions are

important.
• The p-value for the interaction term TV×radio is
extremely low, indicating that there is strong evidence for
H A : β3 /= 0.
• The R 2 for the interaction model is 96.8%, compared to
only 89.7% for the model that predicts s a l e s using TV and
radio without an interaction term.

Interpretation
• This means that (96.8 −89.7)/(100 −89.7) = 69% of the

variability in s a l e s that remains after fitting the additive
model has been explained by the interaction term.
• The coefficient estimates in the table suggest that an
increase in T V advertising of $1, 000 is associated with
increased sales of
(βˆ1 + βˆ3 × radio) × 1000 = 19 + 1.1 × radio units.
• An increase in radio advertising of $1, 000 will be
associated with an increase in sales of
(βˆ2 + βˆ3 × TV) × 1000 = 29 + 1.1 × TV units.

Hierarchy
• Sometimes it is the case that an interaction term has a

very small p-value, but the associated main effects (in this
case, TV and radio) do not.
• The hierarchy principle:
If we include an interaction in a model, we should also
include the main effects, even if the p-values associated
with their coefficients are not significant.

Hierarchy
• The rationale for this principle is that interactions are hard

to interpret in a model without main effects — their
meaning is changed.
• Specifically, the interaction terms also contain main effects,
if the model has no main effect terms.

Non-linear effects of predictors
50
Linear
polynomial regression on Auto data Degree 2
Degree 5
40
Miles per gallon
30
20
10
50 100 150 200
Horsepower

The figure suggests that
mpg = β0 + β1 × horsepower + β2 × horsepower2 + ϵ
may provide a better fit.
Coefficient Std. Error t-statistic p-value

Intercept 56.9001 1.8004 31.6 < 0.0001
horsepower -0.4662 0.0311 -15.0 < 0.0001
horsepower2 0.0012 0.0001 10.1 < 0.0001

3.linear Regression

Uploaded by

Copyright:

Available Formats

3.linear Regression

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

3.linear Regression

Uploaded by

Copyright:

Available Formats

ZG 512

BITS Pilani, Pilani Campus

Predict a value of a given continuous valued variable based on the values

BITS Pilani, Pilani Campus

# TV Radio Paper Sales

BITS Pilani, Pilani Campus

X represents the input data set; X is a 6 * 3 matrix

BITS Pilani, Pilani Campus

# TV Radio Paper Sales

Xi represents the ith observation. Xi is a vector represented as (xi1 xi2 …..xip)

xj represents the jth variable. xj is a vector represented as (x1j x2j …..xnj)

BITS Pilani, Pilani Campus

The linear model is an important example of a parametric model:

• A linear model is specified in terms of p + 1 parameters: β 0 , β 1 , . . . , β p .

BITS Pilani, Pilani Campus

The linear model is an example of a parametric model

• A linear model is specified in terms of p + 1 parameters: β 0 , β 1 , . . . , β p .

• The linear model: f ( X ) = β 0 + β 1 X 1 + β 2 X 2 + . . . β p X p has (p + 1) parameters

• Simple Linear Regression: Only one x variable

BITS Pilani, Pilani Campus

Y = β0 + β1X1 + β2X2 + · · · + βpXp + ε

• β’s: Unknown constants, known as coefficients or parameters

• ε is the error term – captures measurement errors and missing variables

• In the advertising example, the model becomes

f is said to represent the systematic information that X provides about Y

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

• VIF equal to 1 = variables are not correlated

BITS Pilani, Pilani Campus

In the second plot, this is not the case.

BITS Pilani, Pilani Campus

(Soft Skills) x2 y (Income)

BITS Pilani, Pilani Campus

• Slope for the Estimated Regression Equation

• y Intercept for the Estimated Regression Equation

BITS Pilani, Pilani Campus

# of TV Ads # of Cars Sold

BITS Pilani, Pilani Campus

• Y Intercept for the Estimated Regression Equation 𝑏0 = 𝑦 - 𝑏1 𝑥 = 20 – 10 = 10

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

• Proportion of variance in a dependent variable that can be explained by an independent

BITS Pilani, Pilani Campus

• Simple Linear Regression using Excel

BITS Pilani, Pilani Campus

One Hot Encoding (Dummy Variables)

BITS Pilani, Pilani Campus

1. Find out count of distinct values, say n, in the column.

BITS Pilani, Pilani Campus

s^ales = β0 + β1 × TV + β2 × radio + β3 × newspaper

states that the average effect on s a l e s of a one-unit

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

When levels of either TV or radio are low, then the true s a l e s

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

sales = β0 + β1 × TV + β2 × radio + β3 × (radio × TV) + ϵ

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

• The results in this table suggests that interactions are

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956