BS UNIT-3 (1)

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 15

CLASS NOTES

LDC INSTITUTE OF TECHNICAL


STUDIES

CLASS NOTES
M.B.A 1ST YEAR (SEMESTER-2)
SUBJECT CODE- KMBN104
BUSINESS STATISTICS AND ANALYSIS
UNIT -3

LDC GROUP OF INSTITUTIONS OFFERS: B.TECH., MBA, MCA, POLYTECHNIC DIPLA, ITI, BBA & BCA
CLASS NOTES

Measurement of Correlation: Karl Pearson’s Method, Spearman


Rank Correlation
Karl Pearson’s Coefficient of Correlation is widely used mathematical
method wherein the numerical expression is used to calculate the degree
and direction of the relationship between linear related variables.

Pearson’s method, popularly known as a Pearsonian Coefficient of


Correlation, is the most extensively used quantitative methods in practice.
The coefficient of correlation is denoted by “r”.

If the relationship between two variables X and Y is to be ascertained,


then the following formula is used:

Properties of Coefficient of Correlation

 The value of the coefficient of correlation (r) always lies


between±1. Such as:
r=+1, perfect positive correlation
r=-1, perfect negative correlation
r=0, no correlation
 The coefficient of correlation is independent of the origin and
scale.By origin, it means subtracting any non-zero constant
from the given value of X and Y the vale of “r” remains
unchanged. By scale it means, there is no effect on the
value of “r” if the value of X and Y is divided or multiplied by
any constant.
 The coefficient of correlation is a geometric mean of two
regression coefficient. Symbolically it is represented as:

LDC GROUP OF INSTITUTIONS OFFERS: B.TECH., MBA, MCA, POLYTECHNIC DIPLA, ITI, BBA & BCA
CLASS NOTES

 The coefficient of correlation is “ zero” when the variables X


and Y are independent. But, however, the converse is not
true.

Assumptions of Karl Pearson’s Coefficient of Correlation


1. The relationship between the variables is “Linear”, which means when
the two variables are plotted, a straight line is formed by the points
plotted.

2. There are a large number of independent causes that affect the


variables under study so as to form a Normal Distribution. Such as,
variables like price, demand, supply, etc. are affected by such factors that
the normal distribution is formed.

3. The variables are independent of each other.

Note: The coefficient of correlation measures not only the magnitude of


correlation but also tells the direction. Such as, r = -0.67, which shows
correlation is negative because the sign is “-“ and the magnitude is 0.67.

SPEARMAN RANK CORRELATION


Spearman rank correlation is a non-parametric test that is used to
measure the degree of association between two variables. The Spearman
rank correlation test does not carry any assumptions about the
distribution of the data and is the appropriate correlation analysis when
the variables are measured on a scale that is at least ordinal.

The Spearman correlation between two variables is equal to the Pearson


correlation between the rank values of those two variables; while
Pearson’s correlation assesses linear relationships, Spearman’s correlation
assesses monotonic relationships (whether linear or not). If there are no
repeated data values, a perfect Spearman correlation of +1 or −1 occurs
when each of the variables is a perfect monotone function of the other.

Intuitively, the Spearman correlation between two variables will be high


when observations have a similar (or identical for a correlation of 1) rank
(i.e. relative position label of the observations within the variable: 1st,
2nd, 3rd, etc.) between the two variables, and low when observations
have a dissimilar (or fully opposed for a correlation of −1) rank between
the two variables.

The following formula is used to calculate the Spearman rank


correlation:

LDC GROUP OF INSTITUTIONS OFFERS: B.TECH., MBA, MCA, POLYTECHNIC DIPLA, ITI, BBA & BCA
CLASS NOTES

ρ= Spearman rank correlation

di= the difference between the ranks of corresponding variables

n= number of observations

Assumptions

The assumptions of the Spearman correlation are that data must be at


least ordinal and the scores on one variable must be monotonically
related to the other variable.

Properties of Correlation co-efficient


The following are the main properties of correlation.

1. Coefficient of Correlation lies between -1 and +1:

The coefficient of correlation cannot take value less than -1 or more than
one +1. Symbolically,

-1<=r<= + 1 or | r | <1.

2. Coefficients of Correlation are independent of Change of Origin:

This property reveals that if we subtract any constant from all the values
of X and Y, it will not affect the coefficient of correlation.

3. Coefficients of Correlation possess the property of symmetry:

The degree of relationship between two variables is symmetric as shown


below:

4. Coefficient of Correlation is independent of Change of Scale:

This property reveals that if we divide or multiply all the values of X and Y,
it will not affect the coefficient of correlation.

5. Co-efficient of correlation measures only linear correlation


between X and Y.

LDC GROUP OF INSTITUTIONS OFFERS: B.TECH., MBA, MCA, POLYTECHNIC DIPLA, ITI, BBA & BCA
CLASS NOTES

6. If two variables X and Y are independent, coefficient of


correlation between them will be zero.

Karl Pearson’s Coefficient of Correlation is widely used mathematical


method wherein the numerical expression is used to calculate the degree
and direction of the relationship between linear related variables.

Pearson’s method, popularly known as a Pearsonian Coefficient of


Correlation, is the most extensively used quantitative methods in
practice. The coefficient of correlation is denoted by “r”.

If the relationship between two variables X and Y is to be ascertained,


then the following formula is used:

Properties of Coefficient of Correlation

 The value of the coefficient of correlation (r) always lies


between±1. Such as:
r=+1, perfect positive correlation
r=-1, perfect negative correlation
r=0, no correlation
 The coefficient of correlation is independent of the origin and
scale.By origin, it means subtracting any non-zero constant
from the given value of X and Y the vale of “r” remains
unchanged. By scale it means, there is no effect on the
value of “r” if the value of X and Y is divided or multiplied by
any constant.
 The coefficient of correlation is a geometric mean of two
regression coefficient.Symbolically it is represented as:

LDC GROUP OF INSTITUTIONS OFFERS: B.TECH., MBA, MCA, POLYTECHNIC DIPLA, ITI, BBA & BCA
CLASS NOTES

 The coefficient of correlation is “zero”when the variables X


and Y are independent. But, however, the converse is not
true.

Assumptions of Karl Pearson’s Coefficient of Correlation

1. The relationship between the variables is “Linear”,which


means when the two variables are plotted, a straight line is
formed by the points plotted.
2. There are a large number of independent causes that affect
the variables under study so as to form a Normal Distribution.
Such as, variables like price, demand, supply, etc. are
affected by such factors that the normal distribution is
formed.
3. The variables are independent of each other.

Note: The coefficient of correlation measures not only the magnitude of


correlation but also tells the direction. Such as, r = -0.67, which shows
correlation is negative because the sign is “-“and the magnitude is 0.67.

Regression: Meaning, Assumption, Regression Line


REGRESSION

Regression is a statistical measurement used in finance, investing and


other disciplines that attempts to determine the strength of the
relationship between one dependent variable (usually denoted by Y) and a
series of other changing variables (known as independent variables).

Regression helps investment and financial managers to value assets and


understand the relationships between variables, such as commodity prices and
the stocks of businesses dealing in those commodities.

Regression Explained

The two basic types of regression are linear regression and multiple linear
regressions, although there are non-linear regression methods for more
complicated data and analysis. Linear regression uses one independent
variable to explain or predict the outcome of the dependent variable Y,
while multiple regressions use two or more independent variables to
predict the outcome.

Regression can help finance and investment professionals as well as


professionals in other businesses. Regression can also help predict sales
for a company based on weather, previous sales, GDP growth or other
types of conditions. The capital asset pricing model (CAPM) is an often-

LDC GROUP OF INSTITUTIONS OFFERS: B.TECH., MBA, MCA, POLYTECHNIC DIPLA, ITI, BBA & BCA
CLASS NOTES

used regression model in finance for pricing assets and discovering costs
of capital.

The general form of each type of regression is:

 Linear regression: Y = a + bX + u
 Multiple regression: Y = a + b1X1 + b2X2 + b3X3 + … +
btXt + u

Where:

Y = the variable that you are trying to predict (dependent variable).

X = the variable that you are using to predict Y (independent variable).

a = the intercept.

b = the slope.

u = the regression residual.

Regression takes a group of random variables, thought to be predicting Y,


and tries to find a mathematical relationship between them. This
relationship is typically in the form of a straight line (linear regression)
that best approximates all the individual data points. In multiple
regression, the separate variables are differentiated by using numbers
with subscripts.

ASSUMPTIONS IN REGRESSION

 Independence: The residuals are serially independent (no


autocorrelation).
 The residuals are not correlated with any of the independent
(predictor) variables.
 Linearity: The relationship between the dependent variable
and each of the independent variables is linear.
 Mean of Residuals: The mean of the residuals is zero.
 Homogeneity of Variance: The variance of the residuals at all
levels of the independent variables is constant.
 Errors in Variables: The independent (predictor) variables are
measured without error.
 Model Specification: All relevant variables are included in the
model. No irrelevant variables are included in the model.

LDC GROUP OF INSTITUTIONS OFFERS: B.TECH., MBA, MCA, POLYTECHNIC DIPLA, ITI, BBA & BCA
CLASS NOTES

 Normality:The residuals are normally distributed. This


assumption is needed for valid tests of significance but not
for estimation of the regression coefficients.

REGRESSION LINE

Definition: The Regression Line is the line that best fits the data, such that
the overall distance from the line to the points (variable values) plotted on
a graph is the smallest. In other words, a line used to minimize the
squared deviations of predictions is called as the regression line.

There are as many numbers of regression lines as variables. Suppose we


take two variables, say X and Y, then there will be two regression lines:

 Regression line of Y on X: This gives the most probable values


of Y from the given values of X.
 Regression line of X on Y: This gives the most probable values
of X from the given values of Y.

The algebraic expression of these regression lines is called as Regression


Equations. There will be two regression equations for the two regression
lines.

The correlation between the variables depend on the distance between


these two regression lines, such as the nearer the regression lines to each
other the higher is the degree of correlation, and the farther the
regression lines to each other the lesser is the degree of correlation.

The correlation is said to be either perfect positive or perfect negative


when the two regression lines coincide, i.e. only one line exists. In case,
the variables are independent; then the correlation will be zero, and the
lines of regression will be at right angles, i.e. parallel to the X axis and Y
axis.

Note: The regression lines cut each other at the point of average of X and
Y. This means, from the point where the lines intersect each other the
perpendicular is drawn on the X axis we will get the mean value of X.
Similarly, if the horizontal line is drawn on the Y axis we will get the mean
value of Y.

Properties of Regression Coefficients


The constant ‘b’ in the regression equation (Y e = a + bX) is called as
the Regression Coefficient. It determines the slope of the line, i.e. the
change in the value of Y corresponding to the unit change in X and
therefore, it is also called as a “Slope Coefficient.”

LDC GROUP OF INSTITUTIONS OFFERS: B.TECH., MBA, MCA, POLYTECHNIC DIPLA, ITI, BBA & BCA
CLASS NOTES

Properties of Regression Coefficient

1. The correlation coefficient is the geometric mean of two


regression coefficients. Symbolically, it can be expressed as:

2. The value of the coefficient of correlation cannot exceed unity


i.e. 1. Therefore, if one of the regression coefficients is
greater than unity, the other must be less than unity.
3. The sign of both the regression coefficients will be same , i.e.
they will be either positive or negative. Thus, it is not
possible that one regression coefficient is negative while the
other is positive.
4. The coefficient of correlation will have the same sign as that of
the regression coefficients, such as if the regression
coefficients have a positive sign, then “r” will be positive and
vice-versa.
5. The average value of the two regression coefficients will be
greater than the value of the correlation . Symbolically, it can be

represented as
6. The regression coefficients are independent of the change of
origin, but not of the scale . By origin, we mean that there will
be no effect on the regression coefficients if any constant is
subtracted from the value of X and Y. By scale, we mean
that if the value of X and Y is either multiplied or divided by
some constant, then the regression coefficients will also
change.

Thus, all these properties should be kept in mind while solving for the
regression coefficients.

LDC GROUP OF INSTITUTIONS OFFERS: B.TECH., MBA, MCA, POLYTECHNIC DIPLA, ITI, BBA & BCA
CLASS NOTES

Correlation and regression analysis


Correlation Analysis

Correlation is a measure of association between two variables. The variables are


not designated as dependent or independent. The two most popular correlation
coefficients are: Spearman’s correlation coefficient rho and Pearson’s product-
moment correlation coefficient.

When calculating a correlation coefficient for ordinal data, select


Spearman’s technique. For interval or ratio-type data, use Pearson’s
technique.

The value of a correlation coefficient can vary from minus one to plus one.
A minus one indicates a perfect negative correlation, while a plus one
indicates a perfect positive correlation. A correlation of zero means there
is no relationship between the two variables. When there is a negative
correlation between two variables, as the value of one variable increases,
the value of the other variable decreases, and vise versa. In other words,
for a negative correlation, the variables work opposite each other. When
there is a positive correlation between two variables, as the value of one
variable increases, the value of the other variable also increases. The
variables move together.

The standard error of a correlation coefficient is used to determine the


confidence intervals around a true correlation of zero. If your correlation
coefficient falls outside of this range, then it is significantly different than
zero. The standard error can be calculated for interval or ratio-type data
(i.e., only for Pearson’s product-moment correlation).

LDC GROUP OF INSTITUTIONS OFFERS: B.TECH., MBA, MCA, POLYTECHNIC DIPLA, ITI, BBA & BCA
CLASS NOTES

The significance (probability) of the correlation coefficient is determined


from the t-statistic. The probability of the t-statistic indicates whether the
observed correlation coefficient occurred by chance if the true correlation
is zero. In other words, it asks if the correlation is significantly different
than zero. When the t-statistic is calculated for Spearman’s rank-
difference correlation coefficient, there must be at least 30 cases before
the t-distribution can be used to determine the probability. If there are
fewer than 30 cases, you must refer to a special table to find the
probability of the correlation coefficient.

Example
A company wanted to know if there is a significant relationship between
the total number of salespeople and the total number of sales. They
collect data for five months.

Variable 1 Variable 2

207 6907

180 5991

220 6810

205 6553

190 6190

——————————–

Correlation coefficient = .921


Standard error of the coefficient = ..068
t-test for the significance of the coefficient = 4.100
Degrees of freedom = 3
Two-tailed probability = .0263

Another Example

Respondents to a survey were asked to judge the quality of a product on a


four-point Likert scale (excellent, good, fair, poor). They were also asked
to judge the reputation of the company that made the product on a three-
point scale (good, fair, poor). Is there a significant relationship between

LDC GROUP OF INSTITUTIONS OFFERS: B.TECH., MBA, MCA, POLYTECHNIC DIPLA, ITI, BBA & BCA
CLASS NOTES

respondents perceptions of the company and their perceptions of quality


of the product?

Since both variables are ordinal, Spearman’s method is chosen. The first
variable is the rating for the quality the product. Responses are coded as
4=excellent, 3=good, 2=fair, and 1=poor. The second variable is the
perceived reputation of the company and is coded 3=good, 2=fair, and
1=poor.

Variable 1 Variable 2

4 3

2 2

1 2

3 3

4 3

1 1

2 1

——————————————-

Correlation coefficient rho = .830


t-test for the significance of the coefficient = 3.332
Number of data pairs = 7

Probability must be determined from a table because of the small sample


size.

Regression Analysis
Simple regression is used to examine the relationship between one
dependent and one independent variable. After performing an analysis,
the regression statistics can be used to predict the dependent variable
when the independent variable is known. Regression goes beyond
correlation by adding prediction capabilities.

LDC GROUP OF INSTITUTIONS OFFERS: B.TECH., MBA, MCA, POLYTECHNIC DIPLA, ITI, BBA & BCA
CLASS NOTES

People use regression on an intuitive level every day. In business, a well-


dressed man is thought to be financially successful. A mother knows that
more sugar in her children’s diet results in higher energy levels. The ease
of waking up in the morning often depends on how late you went to bed
the night before. Quantitative regression adds precision by developing a
mathematical formula that can be used for predictive purposes.

For example, a medical researcher might want to use body weight


(independent variable) to predict the most appropriate dose for a new
drug (dependent variable). The purpose of running the regression is to
find a formula that fits the relationship between the two variables. Then
you can use that formula to predict values for the dependent variable
when only the independent variable is known. A doctor could prescribe
the proper dose based on a person’s body weight.

The regression line (known as the least squares line) is a plot of the
expected value of the dependent variable for all values of the
independent variable. Technically, it is the line that “minimizes the
squared residuals”. The regression line is the one that best fits the data
on a scatterplot.

Using the regression equation, the dependent variable may be predicted


from the independent variable. The slope of the regression line (b) is
defined as the rise divided by the run. The y intercept (a) is the point on
the y axis where the regression line would intercept the y axis. The slope
and y intercept are incorporated into the regression equation. The
intercept is usually called the constant, and the slope is referred to as the
coefficient. Since the regression model is usually not a perfect predictor,
there is also an error term in the equation.

In the regression equation, y is always the dependent variable and x is


always the independent variable. Here are three equivalent ways to
mathematically describe a linear regression model.

y = intercept + (slope x) + error

y = constant + (coefficientx) + error

y = a + bx + e

The significance of the slope of the regression line is determined from the
t-statistic. It is the probability that the observed correlation coefficient
occurred by chance if the true correlation is zero. Some researchers prefer
to report the F-ratio instead of the t-statistic. The F-ratio is equal to the t-
statistic squared.

LDC GROUP OF INSTITUTIONS OFFERS: B.TECH., MBA, MCA, POLYTECHNIC DIPLA, ITI, BBA & BCA
CLASS NOTES

The t-statistic for the significance of the slope is essentially a test to


determine if the regression model (equation) is usable. If the slope is
significantly different than zero, then we can use the regression model to
predict the dependent variable for any value of the independent variable.

On the other hand, take an example where the slope is zero. It has no
prediction ability because for every value of the independent variable, the
prediction for the dependent variable would be the same. Knowing the
value of the independent variable would not improve our ability to predict
the dependent variable. Thus, if the slope is not significantly different than
zero, don’t use the model to make predictions.

The coefficient of determination (r-squared) is the square of the


correlation coefficient. Its value may vary from zero to one. It has the
advantage over the correlation coefficient in that it may be interpreted
directly as the proportion of variance in the dependent variable that can
be accounted for by the regression equation. For example, an r-squared
value of .49 means that 49% of the variance in the dependent variable
can be explained by the regression equation. The other 51% is
unexplained.

The standard error of the estimate for regression measures the amount of
variability in the points around the regression line. It is the standard
deviation of the data points as they are distributed around the regression
line. The standard error of the estimate can be used to develop
confidence intervals around a prediction.

Example

A company wants to know if there is a significant relationship between its


advertising expenditures and its sales volume. The independent variable
is advertising budget and the dependent variable is sales volume. A lag
time of one month will be used because sales are expected to lag behind
actual advertising expenditures. Data was collected for a six month
period. All figures are in thousands of dollars. Is there a significant
relationship between advertising budget and sales volume?

Indep. Var. Depen. Var

4.2 27.1

6.1 30.4

3.9 25.0

LDC GROUP OF INSTITUTIONS OFFERS: B.TECH., MBA, MCA, POLYTECHNIC DIPLA, ITI, BBA & BCA
CLASS NOTES

5.7 29.7

7.3 40.1

5.9 28.8

————————————————–

Model: y = 9.873 + (3.682x) + error


Standard error of the estimate = 2.637
t-test for the significance of the slope = 3.961
Degrees of freedom = 4
Two-tailed probability = .0149
r-squared = .807

You might make a statement in a report like this: A simple linear


regression was performed on six months of data to determine if there was
a significant relationship between advertising expenditures and sales
volume. The t-statistic for the slope was significant at the .05 critical alpha
level, t(4)=3.96, p=.015. Thus, we reject the null hypothesis and conclude
that there was a positive significant relationship between advertising
expenditures and sales volume. Furthermore, 80.7% of the variability in
sales volume could be explained by advertising expenditures.

LDC GROUP OF INSTITUTIONS OFFERS: B.TECH., MBA, MCA, POLYTECHNIC DIPLA, ITI, BBA & BCA

You might also like