Lecture 4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Lecture#4

Linear Regression
The Model

The model has a deterministic and a probabilistic components

Hous
e
Cost
c o sts
o use
g ah
il di n
o0o0t.+
Bu t r2e5 0
f
u u=a
abo speecr ossqt
Most lots
$7o5u ize)
H
sell S
75(
for $25,000

House
size
However, house cost vary even among same size
houses! Since cost behave
Hous unpredictably,
e we add a random component.
Cost

Most lots
sell
for $25,000 House cost = 25000 + 75 +
(Size) ε
House
size
• The first order linear model

Y = dependent variable
X = independent variable
β0 = Y-intercept β0 and β1 are unknown
β1 = slope of the line Y population
ε = error variable parameters, therefore are
estimated
from the data.

Rise β1 =
β0 Ru Rise/Run
n
X
Estimating the Coefficients

• The estimates are determined by


• drawing a sample from the population of interest,
• calculating sample statistics.
• producing a straight line that cuts into the data.

Y ⬥
⬥ Question: What should
⬥ be

⬥ ⬥ ⬥ ⬥ ⬥ considered a good line?
⬥ ⬥ ⬥ ⬥ ⬥

X
The Least Squares (Regression) Line

A good line is one that minimizes


the sum of squared differences
between the
points and the line.
Sum of squared differences (2 - 1)2 (4 - 2)2 (1.5 - 3)2 (3.2 - 4)2 =
=Sum of squared differences +(2 -2.5)2 + (4 - 2.5)
+ 2 (1.5 - 2.5)
6.892
(3.2 - 2.5)2 =
= + + Let +us compare 3.99 two
4 (2,4
)⬥ lines
The second line is
horizontal
⬥ (4,3.2
3
)
2.
52
(1,2 ⬥
) ⬥ (3,1.5
1 )

The smaller the sum


1 2 3 4 of
squared differences
the better the fit of
the
line to the data.
The Estimated Coefficients

To calculate the estimates of the line The regression equation that


coefficients, that minimize the estimates
differences between the data points the equation of the first order linear
and the line, use the formulas: model
is:
The Simple Linear Regression Line

• A car dealer wants to find


the relationship between
the odometer reading and
the selling price of used cars.
• A random sample of 100 cars is selected,
and the data
recorded.
• Find the regression line.

Independent Dependent
variable X variable Y
• Solution
– Solving by hand: Calculate a number of statistics

where n =
100.
Error Variable: Required Conditions
• The error ε is a critical part of the regression model.
• Four requirements involving the distribution of ε must be satisfied.
• The probability distribution of ε is normal.
• The mean of ε is zero: E(ε) = 0.
• The standard deviation of ε is σε for all values of X.
• The set of errors associated with different values of Y are all independent.
Assessing the Model

• The least squares method will produces a regression line whether or


not there are linear relationship between X and Y.
• Consequently, it is important to assess how well the linear model fits
the data.
• Several methods are used to assess the model. All are based on the
sum of squares for errors, SSE.
Sum of Squares for Errors
• This is the sum of differences between the points and the regression line.
• It can serve as a measure of how well the line fits the data. SSE is defined
by

– A shortcut formula
Standard Error of Estimate
• The mean error is equal to zero.
• If σε is small the errors tend to be close to zero (close to the mean error).
Then, the model fits the data well.
• Therefore, we can, use σε as a measure of the suitability of using a linear
model.
• An estimator of σε is given by sε
•Example
•Calculate the standard error of estimate for previous
Example, and describe what does it tell you about the
model fit?
•Solution
Calculated
before

It is hard to assess the model


based
on sε even when compared with
the
mean value of Y.
Testing the Slope
• When no linear relationship exists between two variables, the regression
line should be horizontal.

❑ ❑

❑ ❑ ❑
❑ ❑ ❑
❑ ❑
❑❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑
❑ ❑❑ ❑ ❑ ❑
❑ ❑ ❑
❑ ❑ ❑ ❑ ❑ ❑
❑ ❑ ❑ ❑ ❑ ❑
❑ ❑ ❑ ❑❑ ❑❑ ❑ ❑❑ ❑❑ ❑ ❑❑ ❑ ❑ ❑ ❑❑ ❑ ❑ ❑ ❑
❑❑ ❑ ❑ ❑ ❑ ❑
❑ ❑❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑❑ ❑ ❑
❑ ❑ ❑ ❑❑ ❑❑ ❑❑ ❑ ❑ ❑ ❑❑ ❑ ❑❑ ❑ ❑ ❑
❑❑

No linear
Linear
relationship.
Different inputs (X)
relationship.
Different inputs (X)
yield
yield
different
The slope outputs (Y). to
is not equal The
the slope
same is equal(Y).
output to
zero zero
• We can draw inference about β1 from b1 by testing
H0 : β 1 = 0
H1: β1 ≠ 0 (or < 0,or > 0)
• The test statistic is

wher
• If the error variable is normally distributed, the statistic has Student t
distribution with d.f. = n-2. e

The standard error of


b1 .
• To understand the significance of this coefficient
note:

The regression
model
Overall variability
in Y
The
error
y
2
Two data points (X1,Y1) and
(X2,Y2)
of a certain sample are shown.

y Variation in Y = SSR + SSE


1

x x
Variation explained by
Total variation in1 Y 2 + Unexplained variation
the
= (error)
regression line
• R2 measures the proportion of the variation in Y
that is explained by the variation in X.

• R2 takes on any value between zero and one.


R2 = 1: Perfect match between the line and the data points.
R2 = 0: There are no linear relationship between X and Y.
Find the coefficient of determination; what does this statistic tell you
about the model?

You might also like