3.linear Regression
3.linear Regression
3.linear Regression
Supervised Learning
Linear Regression
BITS Pilani Dr Arindam Roy
Pilani Campus
Types of Machine Learning
The Advertising data set has 4 variables and 6 # TV Radio Paper Sales
observations 1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
The variable names are “TV”, “Radio”, “Paper” and 3 17.2 45.9 69.3 9.3
“Sales”
4 151.5 41.3 58.5 18.5
p = 3 (the number of independent variables)
5 180.8 10.8 58.4 12.9
n = 6 (the number of observations) 6 8.7 48.9 75 7.2
1. E(ε) = 0
2. ε is normally distributed
3. Var(ε) for all values of the independent variables are constant (Homoscedasticity)
4. The values of ε are independent (No Serial Correlation or Autocorrelation)
5. There is no (or little) multicollinearity among the independent variables
6. The model adequately captures the relationship
1. The standard error of estimate gets inflated. P-values may get inflated due
to underestimation of the t statistic. A IV which is statistically significant
might be shown as statistically insignificant w.r.t to the p values.
2. The sign of regression coefficient may be different. Instead of –ve value, you
might get +ve values and vice verse
3. Adding/removing a variable or even an observation may result in large
variation in the regression coefficient estimates.
Solutions to multicollinearity
1. Drop unnecessary variables
2. Advanced techniques: Ridge / Lasso / Stepwise / Principal Components Regression
(Education) x y (Income)
Multiple Regression
(Education) x1
(Experience) x3
(Age) x4
𝑥𝑖 −𝑥 𝑦𝑖 −𝑦
𝑏1 =
𝑥𝑖 −𝑥 2
𝑏0 = 𝑦 - 𝑏1 𝑥
where:
xi = value of independent variable for ith observation
yi =value of dependent variable for ith observation
𝑥= mean value for dependent variable
𝑦= mean value for dependent variable
𝑥𝑖 −𝑥 𝑦𝑖−𝑦
• Slope for the Estimated Regression Equation 𝑏1 = = 20/4 = 5
𝑥𝑖 −𝑥 2
Sales
TV
Radio
Results:
Coefficient Std. Error t-statistic p-value
Intercept 6.7502 0.248 27.23 < 0.0001
TV 0.0191 0.002 12.70 < 0.0001
radio 0.0289 0.009 3.24 0.0014
TV×radio 0.0011 0.000 20.73 < 0.0001
50
Linear
polynomial regression on Auto data Degree 2
Degree 5
40
Miles per gallon
30
20
10
Horsepower