Ma724 - 38
Ma724 - 38
Ma724 - 38
If the variables X and Y in a bivariate distribution are related we will find that the points
in the scatter diagram will cluster around some curve called the Curve of Regression. If
the curve is a straight line then it is called the Line of Regression and there is said to be
a linear regression between the variables. The line of regression is the line which gives
the best estimate to the value of one variable for any specific value of the other. In fact,
there are two such lines one giving the best possible mean values of Y for each specified
value of X and other giving the best possible mean values of X for each specified value
of Y. The former is known as the line of regression of Y on X and the latter is known
as the line of regression of X on Y. Thus the line of regression is the line of best fit and
is obtained using the principles of least squares.
Note: The principle of least squares consists in minimizing the sum of squares of the
deviations of the actual values of Y from its estimated values as given by the line of
best fit.
Thus, the line of regression of Y on X is given by;
σY
̅) = r
(Y − Y ̅)
(X − X (A)
σX
where ‘r’ is the sample correlation coefficient and σX and σY are the standard deviations
of X and Y respectively
We denote the factors:
σY
bYX = r and is called the regression coefficient of Y on X.
σX
σX
bXY = r and is called the regression coefficient of X on Y.
σY
NOTE:
1. Whenever we have to estimate Y for a given value of X, i.e., Y is dependent and
X is independent, we then use equation (A) otherwise we use (B).
2. In the case of perfect correlation i.e., r = ±1, the two lines of regression
coincide. Thus we have only one line.
3. The correlation coefficient is obtained as the geometric mean of the two
regression coefficients. Thus r = ±√bXY bYX
4. Both the lines of regression pass through the point (x̅, y̅), the sample mean.
1
2
3
Problems
1. Obtain the line of regression of Y on X for the following data and estimate the most
probable value of Y when X is 70.
Item No x y
1 40 2.5
2 70 6.0
3 50 4.5
4 60 5.0
5 80 4.5
6 50 2.0
7 90 5.5
8 40 3.0
9 60 4.5
10 60 3.0
Solution:
The line of regression of Y on X is given by:
̅) = rxy σY (X − X
(Y − Y ̅)
σX
Item
X Y dx=x-60 dy=y-4.5 d2x dxdy
No
1 40 2.5 -20 -2 400 40
2 70 6.0 10 1.5 100 15
3 50 4.5 -10 0 100 0
4 60 5.0 0 0.5 0 0
5 80 4.5 20 0 400 0
6 50 2.0 -10 -2.5 100 25
7 90 5.5 30 1.0 400 30
8 40 3.0 -20 -1.5 400 300
9 60 4.5 0 0 0 0
10 60 3.0 0 -1.5 0 0
∑ dx = 0 ∑ dy = −4.5 ∑ d2x = 2400 ∑ dx dy = 140
4
With A=60 and B=4.5 we have,
∑ dx ∑ dy −4.5
x̅ = A + = 60 + 0 = 60 and y̅ = B + = 4.5 + ( ) = 4.05
n n 10
∑ d2x ∑ dx 2 ∑ d2y ∑ d𝑦 2
σ2x = −( ) and σ2y = −( )
n n n n
n ∑ dx .dy −(∑ dx )(∑ dy )
r=
2
√ [n ∑ d2x −(∑ dx )2 ] [n ∑ d2y −(∑ dy ) ]
∑ dx ∑ dy
σy ∑ dxdy−
n
r. = (∑ dx)2
σx ∑ d2 x−
n
140−0 140
= =
2400−0 2400
= 0.06
Thus, the required line of regression of Y on X is given by:
̅) = rxy σY (X − X
(Y − Y ̅)
σX
5
3. The regression equations of two variables X and Y are X=0.7Y+5.2 and Y=0.3X+2.8.
Find the means of the variables and the correlation coefficient.
Solution:
̅, Y
Since both the lines of regression passes through the point (X ̅) we have,
̅ = 0.7Y
X ̅ + 5.2 and Y ̅ = 0.3X ̅ + 2.8
Solving we get,
̅ = 9.06
X
̅ = 5.518
Y
Now, regression coefficient of Y on X is: byx =0.3
regression coefficient of X on Y is: bxy =0.7
∴ r = √byx bxy = √0.3 ∗ 0.7 = √0.21 = 0.46
Height of
165 160 170 163 173 158 178 168 173 170 175 180
father x
Height of
173 168 173 165 175 168 173 165 180 170 173 178
son y
Obtain the two regression lines and hence obtain r, the correlation coefficient
Solve it!
Solve it!
6
Multiple-Linear Regression:
In many of the real life situations it may happen that the dependent variable Y can more
adequately be predicted when there are more than one independent variable, say,
X1 , X 2 , X 3 … X k . Typically one may have a relationship of the type:
Y = a + b1 X1 + b2 X 2 + ⋯ + bk X k
One can obtain the coefficients a, b1 , b2 , …, bk , as earlier, by using the Principle Least
Squares and hence obtain the line of regression. This is called the multiple regression.
Non-linear Regression:
Quite often it is observed while using linear regression that the estimated (predicted) values
of the dependent variable produce poor results. One reason behind this could be due to the
fact that the variables may be far from being linearly related but a Curvi-linear relationship
may be more appropriate.
For example, instead of Y = a + bX, a relationship of second degree, say of the type,
Y = a + bX + cX 2 may be more appropriate. By using Principle of Least Squares, one may
obtain the coefficients a, b and c and hence arrived the regression equation which is non-
linear.