10-Correlation and Linear Regression

Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

Correlation and

Regression Analysis
Week
Introduction
General multiple regression equation:

𝑦ො = 𝑎 + 𝑏1 𝑥1 + 𝑏2 𝑥2 +𝑏3 𝑥3 + ⋯ + 𝑏𝑘 𝑥𝑘
Multiple standard error of estimate
o Correlation analysis is a group of techniques to measure the relationship between
two variables
o The basic idea of correlation analysis is to report the relationship between two
variables
o The usual first step is to plot the data in a scatter diagram
What is correlation analysis?
Example: The sales manager of North American Copier Sales, which has a large sales
force throughout the United States and Canada, wants to determine whether there is
a relationship between the number of sales calls made in a month and the number of
copiers sold that month. 15 samples were collected.
What is correlation analysis?
o Observation: As the number of sales calls increases, it appears the number of
copiers sold also increases.
o Dependent variable: the variable that is being predicted or estimated
o Independent variable: provides the basis for estimating or predicting the dependent
variable
o Independent variable: number of sales calls; dependent variable: number of copiers
sold
Originated by Karl Pearson about 1900, the correlation coefficient describes the
strength of the relationship between two sets of interval-scaled or ratio-scaled variables.
Designated r, it is often referred to as Pearson’s r and as the Pearson product-moment
correlation coefficient. It can assume any value from −1.00 to +1.00 inclusive. A correlation
coefficient of −1.00 or +1.00 indicates perfect correlation
What is correlation analysis?

The scatter diagram shows graphically that the sales representatives who make
more calls tend to sell more copiers.
The Correlation Coefficient
Correlation coefficient: A measure of the strength of the linear relationship between
two variables.
The Correlation Coefficient

Characteristics of the correlation coefficient:


1. The sample correlation coefficient is identified by the lowercase letter r.
2. It shows the direction and strength of the linear relationship between two interval or ratio-scale variables.
3. It ranges from −1 up to and including +1.
4. A value near 0 indicates there is little linear relationship between the variables.
5. A value near 1 indicates a direct or positive linear relationship between the variables.
6. A value near −1 indicates an inverse or negative linear relationship between the variables.
-
The Correlation Coefficient
σ 𝑥 − 𝑥ҧ 𝑦 − 𝑦ത
𝐶𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 =
𝑛 − 1 𝑠𝑥 𝑠𝑦
y
𝐶𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡
6672
= = 0.865
15 − 1 42.76 12.89

Interpretation:
o direct relationship
o the association is strong

Important! There is a relationship or association


between the two variables, not that a change in
one causes a change in the other
-The correlation coefficient is also unaffected by the units of the two variables. For example, if we had used
hundreds of copiers sold instead of the number sold, the correlation coefficient would be the same. The
correlation coefficient is independent of the scale used if we divide the term Σ (x − x ) (y − y ) by the sample
standard deviations
-The correlation of 0.865 indicates a strong positive linear association between the variables. Ms. Bancer would
be correct to encourage the sales personnel to make that extra sales call because the number of sales calls made
is related to the number of copiers sold
-If there is a strong relationship (say, .97) between two variables, we are tempted to assume that an increase or
decrease in one variable causes a change in the other variable. For example, historically, the consumption of
Georgia peanuts and the consumption of aspirin have a strong correlation. However, this does not indicate that
an increase in the consumption of peanuts caused the consumption of aspirin to increase
-Relationships such as these are called spurious correlations. What we can conclude when we find two variables
with a strong correlation is that there is a relationship or association between the two variables, not that a
change in one causes a change in the other
Testing the significance of the correlation coefficient
Only 15 salespeople were sampled. Could it be that the
correlation in the population is actually 0?
Did the computed r come from a population of paired
observations with zero correlation?

Step 1: 𝐻0 : 𝜌 = 0
𝐻1 : 𝜌 ≠ 0
Step 2: α = 0.05 Step 3: Test statistic used is t because we
don’t know the σ and sample size is < 30

Step 4: n – 2 = 15 – 2 = 13
Testing the significance of the correlation coefficient
Step 5:
𝑟 𝑛−2 0.865 15 − 2
𝑡= = = 6.216
1− 𝑟2 1 − 0.8652

H0 is rejected at the 0.05 significance level.

Step 6:
There’s evidence that the correlation in the population is not zero. This indicates to
the sales manager that there is correlation with respect to the number of sales calls
made and the number of copiers sold in the population of salespeople.
END
Regression analysis
Regression equation: An equation that expresses the linear relationship between two
variables.

Regression analysis: The technique used to develop the regression equation and
provide the estimates (dependent variable Y)
Regression analysis
In regression analysis, our objectives are to:
o use the data to position a line that best represents the relationship between the
two variables
o calculate the values of a (y-intercept) and b (slope of the line) to develop a linear
equation (𝑦ො = 𝑏𝑥 + 𝑎)that best fits the data

Least squares principle: A mathematical


procedure that uses the data to position a
line with the objective of minimizing the sum
of the squares of the vertical distances
between the actual y values and the
predicted values of y.
We would prefer a method that results in a single, best regression line. This method is called the least squares
principle. It gives what is commonly referred to as the “best-fitting” line.
Regression analysis
Example: Recall the example involving North American Copier Sales. The sales
manager gathered information on the number of sales calls made and the number of
copiers sold for a random sample of 15 sales representatives. As a part of her
presentation at the upcoming sales meeting, Ms. Bancer, the sales manager, would
like to offer specific information about the relationship between the number of sales
calls and the number of copiers sold. Use the least squares method to determine a
linear equation to express the relationship between the two variables. What is the
expected number of copiers sold by a representative who made 100 calls?
Regression analysis
Step 1: Find b Step 3: Regression equation

𝑦ො = 0.2608𝑥 + 19.9632
𝑠𝑦 12.89
𝑏=𝑟 = 0.865 = 0.2608
𝑠𝑥 42.76

So if a salesperson makes 100 calls, he or


Step 2: Find a she can expect to sell
𝑎 = 𝑦ത − 𝑏𝑥ҧ = 45 − 0.2608 96 = 19.9632 𝑦ො = 0.2608 100 + 19.9632 = 46.0432

But is the independent variable a


useful predictor?
Testing the significance of the slope
Step 1: 𝐻0 : 𝛽 ≤ 0
𝐻1 : 𝛽 > 0

Step 2: α = 0.05 Step 3: Test statistic used is t because we


don’t know the σ and sample size is < 30

Step 4: n – 2 = 15 – 2 = 13

Step 5: Given that sb = 0.042;


𝑏 − 0 0.2606 − 0
t= = = 6.205 Reject H0
𝑠𝑏 0.042

Step 6:
There’s evidence that the slope is greater than 0. . The
independent variable, number of sales calls, is useful in
estimating copier sales.
References
Lind, D.A., Marchal, W.G. & Wathen, S.A. (2015) Statistical Techniques in Business and
Economics, 17th Edition. McGraw-Hill

You might also like