Simple Regression Model: Erbil Technology Institute

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Simple Regression Model

Erbil Technology Institute

Subject : Statistics

Supervised By: Dr. Mahdi S. Raza

Prepared by : Hevin Sahdulla Jamil

1
INTRODUCTION

It is a general statistical practice to report standard errors associated with estimated model parameters
Typically model parameters are estimated by a single data source but there are many examples in
which two or more sources are used A common example occurs in chemistry and physics applications
where nonlinear models are defined in terms of well-known constants (e.g. Planck's constant and
Faraday's constant) whose estimates have been previously established The uncertainties associated with
such physical constants are generally very small However in other applications some of the model
parameters must be estimated by external data sources and the uncertainties associated with these
estimates can be quite large We shall call the parameters estimated by external data sets „input‟
parameters In routine applications the variability associated with the estimation of input parameters is
often ignored.

Regression model

In practice, researchers first select a model they would like to estimate and then use their chosen
method (e.g., ordinary least squares) to estimate the parameters of that model. Regression models
involve the following components:

 The unknown parameters, often denoted as a scalar or vector

 The independent variables, which are observed in data and are often denoted as a vector

(where denotes a row of data).

 The dependent variable, which are observed in data and often denoted using the scalar

2
 The error terms, which are not directly observed in data and are often denoted using the

scalar

What is Simple Linear Regression?

Simple linear regression is a statistical method that allows us to summarize and study relationships
between two continuous (quantitative) variables:

 One variable, denoted x, is regarded as the predictor, explanatory, or independent variable.

 The other variable, denoted y, is regarded as the response, outcome, or dependent variable.

Because the other terms are used less frequently today, we'll use the "predictor" and "response" terms
to refer to the variables encountered in this course. The other terms are mentioned only to make you
aware of them should you encounter them. Simple linear regression gets its adjective "simple," because
it concerns the study of only one predictor variable. In contrast, multiple linear regression, which we
study later in this course, gets its adjective "multiple," because it concerns the study of two or more
predictor variables.

Types of relationships

Before proceeding, we must clarify what types of relationships we won't study in this course,
namely, deterministic (or functional) relationships. Here is an example of a deterministic relationship.

3
The Multiple Regression Model

The multiple regression model extends the basic concept of the simple regression model discussed
in Chapters 4 and 5. A multiple regression model enables us to estimate the effect on Yi
Yi of changing a regressor X1iX1i if the remaining regressors X2i,X3i…,XkiX2i,X3i…,Xki do not
vary. In fact we already have performed estimation of the multiple regression model using R in the
previous section. The interpretation of the coefficient on student-teacher ratio is the effect on test
scores of a one unit change student-teacher ratio if the percentage of English learners is kept
constant.
Just like in the simple regression model, we assume the true relationship between Y
Y and X1i,X2i……,XkiX1i,X2i……,Xki to be linear. On average, this relation is given by the
population regression function
E(Yi|X1i=x1,X2i=x2,X3i=x3,…,Xki=xk)=β0+β1x1+β2x2+β3x3+⋯+βkxk.(6.3)(6.3) E
(Yi|X1i=x1,X2i=x2,X3i=x3,…,Xki=xk)=β0+β1x1+β2x2+β3x3+⋯+βkxk.
As in the simple regression model, the
relationYi=β0+β1X1i+β2X2i+β3X3i+⋯+βkXkiYi=β0+β1X1i+β2X2i+β3X3i+⋯+βkXkidoes not hold
exactly since there are disturbing influences to the dependent variable YY we cannot observe as
explanatory variables. Therefore we add an error term uu which represents deviations of the
observations from the population regression line to This yields the population multiple regression
model

summary(mult.mod)$coef
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 686.0322445 7.41131160 92.565565 3.871327e-280
#> STR -1.1012956 0.38027827 -2.896026 3.978059e-03
#> english -0.6497768 0.03934254 -16.515882 1.657448e-47

4
Simple Linear Regression Model & Interpretation

 Regression model
 Regression line

Example: Relationship between diesel oil

consumption rates measured by two methods

x- rate measured by drain-weigh method

Y-rate measured by CI-trace method

LS Estimates of Model Parameters

Least squares (LS) estimation

– estimates regression parameters by minimizing SSE

– The resulting line is called the regression line

5
Inference in Linear Regression

Linear regression attempts to model the relationship between two variables by fitting a linear equation
to observed data. Every value of the independent variable x is associated with a value of the dependent

variable y. The variable y is assumed to be normally distributed with mean y and variance . The

least-squares regression line y = b0 + b1x is an estimate of the true population regression line, y =

0 + 1x. This line describes how the mean response y changes with x. The observed values

for y vary about their means y and are assumed to have the same standard deviation . The fitted
values b0 and b1 estimate the true intercept and slope of the population regression line.

Since the observed values for y vary about their means y, the statistical model includes a term for
this variation. In words, the model is expressed as DATA = FIT + RESIDUAL, where the "FIT" term

represents the expression 0 + 1x. The "RESIDUAL" term represents the deviations of the observed

values y from their means y, which are normally distributed with mean 0 and variance . The

notation for the model deviations is .

In formal terms, the model for linear regression is the following:

Given n pairs of observations (x1, y1), (x2, y2), ... , (xn, yn), the observed response is yi = 0 + 1xi +
i.

In the least-squares model, the best-fitting line for the observed data is calculated by minimizing the
sum of the squares of the vertical deviations from each data point to the line (if a point lies on the fitted
line exactly, then its vertical deviation is 0). Because the deviations are first squared, then summed,
there are no cancellations between positive and negative values. The least-squares
estimates b0 and b1 are usually computed by statistical software. They are expressed by the following
equations:

6
The computed values for b0 and b1 are unbiased estimators of 0 and 1, and are normally distributed
with standard deviations that may be estimated from the data.

The values fit by the equation b0 + b1xi are denoted i, and the residuals ei are equal to yi - i, the
difference between the observed and fitted values. The sum of the residuals is equal to zero.

The variance ² may be estimated by s² = , also known as the mean-squared error (or MSE).
The estimate of the standard error s is the square root of the MSE.

7
History

The earliest form of regression was the method of least squares, which was published by Legendre in
1805, and by Gauss in 1809. Legendre and Gauss both applied the method to the problem of
determining, from astronomical observations, the orbits of bodies about the Sun (mostly comets, but
also later the then newly discovered minor planets). Gauss published a further development of the
theory of least squares in 1821, including a version of the Gauss–Markov theorem.

The term "regression" was coined by Francis Galton in the nineteenth century to describe a biological
phenomenon. The phenomenon was that the heights of descendants of tall ancestors tend to regress
down towards a normal average (a phenomenon also known as regression toward the mean). For
Galton, regression had only this biological meaning, but his work was later extended by Udny
Yule and Karl Pearson to a more general statistical context.[11][12] In the work of Yule and Pearson,
the joint distribution of the response and explanatory variables is assumed to be Gaussian. This
assumption was weakened by R.A. Fisher in his works of 1922 and 1925. Fisher assumed that
the conditional distribution of the response variable is Gaussian, but the joint distribution need not be.
In this respect, Fisher's assumption is closer to Gauss's formulation of 1821.

In the 1950s and 1960s, economists used electromechanical desk "calculators" to calculate regressions.
Before 1970, it sometimes took up to 24 hours to receive the result from one regression.

8
References

1. Necessary Condition Analysis

2. David A. Freedman (27 April 2009). Statistical Models: Theory and Practice. Cambridge
University Press. ISBN 978-1-139-47731-4.

3. R. Dennis Cook; Sanford Weisberg Criticism and Influence Analysis in


Regression, Sociological Methodology, Vol. 13. (1982), pp. 313–361

4. A.M. Legendre. Nouvelles méthodes pour la détermination des orbites des comètes, Firmin
Didot, Paris, 1805. “Sur la Méthode des moindres quarrés” appears as an appendix.

5. Jump up to:a b Chapter 1 of: Angrist, J. D., & Pischke, J. S. (2008). Mostly Harmless
Econometrics: An Empiricist's Companion. Princeton University Press.

You might also like