0% found this document useful (0 votes)
21 views26 pages

Pradytha Galuh Putranti - 2304220013 - SSD - B ING-STAT

bahasa inggris statistika
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views26 pages

Pradytha Galuh Putranti - 2304220013 - SSD - B ING-STAT

bahasa inggris statistika
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

REGRESSION ANALYSIS AND CORRELATION ANALYSIS

Pradytha Galuh Putranti (2304220013)

Semarang State University


Year 2023/2024
REGRESSION ANALYSIS
A. Regression Analysis
The aim of this chapter is to know how to calculate an estimate or regression
equation that will explain the relationship between two variables. What will be
discussed is simple line regression, where we will discuss the relationship between two
variables which is usually quite precisely expressed in a straight line. Next, the purpose
of using the regression equation is to estimate the value of a variable at a certain value
of another variable, in other words the regression equation is used. for forecasting.

B. Functional Relationships Between Variables


In regression analysis, the independent variable will be expressed as
X 1 ,X 2 ,...,X k(𝑘 ≥ 1)
while the dependent variable is expressed as Y.
In general, the regression model or equation for a population can be written in
the form
μy.x 1 ,x 2 , ..., x k = f(X 1 , X 2 , ..., X k │ θ 1 ,θ 2 , ..., θm)
With θ 1 ,θ 2 ,...,θ m the parameters in the regression.

A simple regression model for a population with an independent variable which


is commonly known as simple linear regression is
μy.x= θ 1 + θ 2
In this case the parameter is θ 1 and θ 2 .

Based on a sample, the population regression equation will be determined or


estimated in the formula
(IX.1).
This can be done by estimating the parameters θ 1 ,θ 2 ,...,θ m. For the case of
simple linear regression, it is necessary to estimate the parameters θ 1 and θ 2 . If θ 1 and
θ 2 is estimated by 𝑎 and 𝑏, then the regression equation based on the sample is
Ŷ = 𝒂 + 𝒃𝑿
Regression with X being the independent variable and Y the dependent variable
is called a regression of Y on X.

Quadruple or parabolic population regression model for an independent variable


with parameters θ 1 , θ 2 , and θ 3 is
μy.xx 2 = θ 1 + θ 2
And based on a random sample, the parameters θ 1 , θ 2 , and θ 3 needs to be estimated
with the following equation
Ŷ = 𝒂 + 𝒃𝑿 + 𝒄𝑿2
With 𝑎, 𝑏and 𝑐respectively obtained from calculations based on research data
which are estimates for θ 1 , θ 2 and θ 3 respectively.

Here's how to determine the regression equation, if you have observational data.
1. Freehand Method
This method uses scatter diagrams to visualize observation data, with
the independent variables 𝑋and dependent variables 𝑌plotted on the horizontal
and vertical axes. The benefit is that it involves identifying the relationship
between two variables and determining the type of regression equation. If the
points are around a straight line, linear regression can be concluded, whereas if
they are around a curved line, there is a nonlinear regression. The relationship
between variables can be positive, negative, or there is no particular pattern.
Scatter diagrams help visual analysis for better understanding.
2. Least Squares Method for Linear Regression.
This method is based on the fact that the sum of the squares of the
distance between the points and the regression line being sought must be as
small as possible. For observations consisting of an independent variable
population, the linear regression model is
μy.x= θ 1 + θ 2
Parameter value θ 1 And θ 2 estimated by 𝑎 and 𝑏 so that the regression
equation using sample data is
Ŷ = 𝒂 + 𝒃𝑿
Regression coefficients 𝑎 and 𝑏for linear regression can be calculated
using the formula
(∑ 𝒀𝒊 )(∑ 𝑿𝟐𝒊 ) − (∑ 𝑿𝒊 )(∑ 𝑿𝒊 𝒀𝒊 )
𝒂=
𝒏 ∑ 𝑿𝟐𝒊 − (∑ 𝑿𝒊 )𝟐
𝒏 ∑ 𝑿𝒊 𝒀𝒊 − (∑ 𝑿𝒊 )(∑ 𝒀𝒊 )
𝒃=
𝒏 ∑ 𝑿𝟐𝒊 − (∑ 𝑿𝒊 )𝟐

If the coefficient is first calculated 𝑏, the coefficient 𝑎can also be determined


using a formula

̅ − 𝒃𝑿
𝒂=𝒀 ̅

Where X̄ and Ȳ are the averages for variables X and Y respectively.

In linear regression, the coefficient 𝑏means the average change in Y for


every one unit change in variable X. The change in the value of Y increases if
the value 𝑏has a positive sign and decreases for a 𝑏negative sign.

C. Multiple Variances Associated with Simple Linear Regression


The first assumption, regarding prediction errors or prediction errors or
differences 𝑒 = 𝑌 − Ŷthat occur, considering that the observed results of the dependent
variable Y are not necessarily the same as the expected value, namely Ŷ obtained from
regression of observation results (sample). In a population, prediction error is assumed
to be a random variable that follows a normal distribution with mean zero and variance
σ2
Second assumption, for each given value of X, the dependent variable Y is
independent and normally distributed with mean (θ 1 +θ 2𝑋) and variance σ 2 yx . The
variance σ 2 yx is assumed to be the same for each
D. Regression Standard Error and Simple Regression Coefficients

a. Standard error for regression


∑ 𝒀𝟐 − 𝒂 ∑ 𝒀 − 𝒃 ∑ 𝑿𝒀
𝑺𝒆 = √
𝒏−𝟐

b. Standard errors for regression coefficients 𝒂 (parameters 𝒂)


∑ 𝑿𝟐 − 𝑺𝒆
𝑺𝒂 = √ 𝟐
𝒏 ∑ 𝑿𝟐 − (∑ 𝑿)

c. Standard errors for regression coefficients 𝒃(parameters 𝒃)


𝑺𝒆
𝑺𝒃 =
√ (∑ 𝑿)
𝟐
∑ 𝑿𝟐 − 𝒏
E. Linear Regression Hypothesis Testing
(Table formula)
a. Test independence using the formula𝑭

𝑺𝟐𝒓𝒆𝒈
𝑭= 𝟐
𝑺𝒓𝒆𝒔
This test is used to find out whether there is a linear relationship between
the independent variable and the dependent variable. Testing the significance of
the linear relationship between the independent variable and the dependent
variable.
b. Test the linearity of the regression model using the formula𝑭

𝑺𝟐𝑻𝑪
𝑭= .
𝑺𝟐𝒆
This test is carried out to find out whether the linear model is suitable
for modeling independent and dependent variables.
F. Non-Linear Regression
a. Quadratic Parabola Model
This general equation is estimated by
Ŷ = 𝒂 + 𝒃𝑿 + 𝒄𝑿2
By using coefficients, they 𝑎, 𝑏, 𝑐 must be determined based on
observational data. By using the least squares method, it 𝑎, 𝑏, 𝑐 can be calculated
with a system of equations:

∑ 𝑌𝑖 = 𝑛𝑎 + 𝑏 ∑ 𝑋𝑖 + 𝑐 ∑ 𝑋𝑖2

∑ 𝑋𝑖 𝑌𝑖 = 𝑎 ∑ 𝑋𝑖 + 𝑏 ∑ 𝑋𝑖2 + 𝑐 ∑ 𝑋𝑖3

∑ 𝑋𝑖2 𝑌𝑖 = 𝑎 ∑ 𝑋𝑖2 + 𝑏 ∑ 𝑋𝑖3 + 𝑐 ∑ 𝑋𝑖4

b. Cubic Parabolic Model


General equation of the estimated model by
Ŷ = 𝒂 + 𝒃𝑿 + 𝒄𝑿𝟐 + 𝒅𝑿𝟑
With coefficients 𝑎, 𝑏, 𝑐, 𝑑 calculated from observation data. The system
of equations that must be solved to determine 𝑎, 𝑏, 𝑐, 𝑑 is:

∑ 𝑌𝑖 = 𝑛𝑎 + 𝑏 ∑ 𝑋𝑖 + 𝑐 ∑ 𝑋𝑖2 + 𝑑 ∑ 𝑋𝑖3

∑ 𝑋𝑖 𝑌𝑖 = 𝑎 ∑ 𝑋𝑖 + 𝑏 ∑ 𝑋𝑖2 + 𝑐 ∑ 𝑋𝑖3 + 𝑑 ∑ 𝑋𝑖4

∑ 𝑋𝑖2 𝑌𝑖 = 𝑎 ∑ 𝑋𝑖2 + 𝑏 ∑ 𝑋𝑖3 + 𝑐 ∑ 𝑋𝑖4 + 𝑑 ∑ 𝑋𝑖5

∑ 𝑋𝑖3 𝑌𝑖 = 𝑎 ∑ 𝑋𝑖3 + 𝑏 ∑ 𝑋𝑖4 + 𝑐 ∑ 𝑋𝑖5 + 𝑑 ∑ 𝑋𝑖6

c. Exponential Model
The general equation of this model is estimated by
̂ = 𝒂𝒃𝑿
𝒀
̂ = 𝐥𝐨𝐠 𝒂 + (𝐥𝐨𝐠 𝒃)𝑿
𝐥𝐨𝐠 𝒀

∑ 𝐥𝐨𝐠 𝒀𝒊 ∑ 𝑿𝒊
𝐥𝐨𝐠 𝒂 = − (𝐥𝐨𝐠 𝒃) ( )
𝒏 𝒏
𝒏(∑ 𝑿𝒊 𝐥𝐨𝐠 𝒀𝒊 ) − (∑ 𝑿𝒊 )(∑ 𝐥𝐨𝐠 𝒀𝒊 )
𝐥𝐨𝐠 𝒃 =
𝒏 ∑ 𝑿𝟐𝒊 − (∑ 𝑿𝒊 )𝟐
̂ = 𝒂𝒆𝒃𝑿
𝒀
d. Geometric Model
The general equation of this model is estimated by
̂ = 𝒂𝑿𝒃
𝒀
̂ = 𝐥𝐨𝐠 𝒂 + 𝒃 𝐥𝐨𝐠 𝑿
𝐥𝐨𝐠 𝒀

∑ 𝐥𝐨𝐠 𝒀𝒊 ∑ 𝐥𝐨𝐠 𝑿𝒊
𝐥𝐨𝐠 𝒂 = −𝒃
𝒏 𝒏
𝒏(∑ 𝐥𝐨𝐠 𝑿𝒊 𝐥𝐨𝐠 𝒀𝒊 ) − (∑ 𝐥𝐨𝐠 𝑿𝒊 )(∑ 𝐥𝐨𝐠 𝒀𝒊 )
𝒃=
𝒏 ∑ 𝒍𝒐𝒈𝟐 𝑿𝒊 − (∑ 𝐥𝐨𝐠 𝑿𝒊 )𝟐

e. Logistics Model
The simplest logistic model can be estimated by
𝟏
𝒀̂=
𝒂𝒃𝒙
𝟏
𝐥𝐨𝐠 ( ) = 𝐥𝐨𝐠 𝒂 + (𝐥𝐨𝐠 𝒃)𝑿
̂
𝒀

𝟏
∑ 𝐥𝐨𝐠 ( ) ∑ 𝑿𝒊
𝒀𝒊
𝐥𝐨𝐠 𝒂 = − (𝐥𝐨𝐠 𝒃) ( )
𝒏 𝒏
𝟏 𝟏
𝒏 (∑ 𝑿𝒊 𝐥𝐨𝐠 (𝒀 )) − (∑ 𝑿𝒊 ) (∑ 𝐥𝐨𝐠 (𝒀 ))
𝒊 𝒊
𝐥𝐨𝐠 𝒃 =
𝒏 ∑ 𝑿𝟐𝒊 − (∑ 𝑿𝒊 )𝟐

f. Hyperbola Model
The simple general equation for the hyperbola model can be written in the form
𝟏
̂=
𝒀
𝒂 + 𝒃𝑿
𝟏
= 𝒂 + 𝒃𝑿
𝒀

𝟏 𝟏
(∑ 𝒀 ) (∑ 𝑿𝟐𝒊 ) − (∑ 𝑿𝒊 ) (∑ 𝑿𝒊 𝒀 )
𝒊 𝒊
𝒂=
𝒏 ∑ 𝑿𝟐𝒊 − (∑ 𝑿𝒊 )𝟐
𝟏 𝟏
𝒏 ∑ 𝑿𝒊 − (∑ 𝑿𝒊 ) (∑ )
𝒀𝒊 𝒀𝒊
𝒃=
𝒏 ∑ 𝑿𝟐𝒊 − (∑ 𝑿𝒊 )𝟐
G. Multiple Linear Regression
Previously we discussed the linear relationship of two variables X and Y using
the linear regression equation Ŷ = 𝑎𝑏 𝑥 .
In reality, a lot of observational data occurs involving more than two variables.
For example, rice yields (Y) are influenced by fertilizer use (X 1 ), rice field area (X 2 )
and rainfall (X 3 ). In general, observational data Y can occur or be influenced by the
independent variables X 1 ,X 2 ,...,X k .
̂ = 𝒂 + 𝒃𝟏 𝑿𝟏 + 𝒃𝟐 𝑿𝟐 +. . . +𝒃𝒌 𝑿𝒌
𝒀

𝒂𝒏 + 𝒃𝟏 ∑ 𝑿𝟏 + 𝒃𝟐 ∑ 𝑿𝟐 +. . . + 𝒃𝒌 ∑ 𝑿𝒌 = ∑ 𝒀

𝒂 ∑ 𝑿𝟏 + 𝒃𝟏 ∑ 𝑿𝟐𝟏 + 𝒃𝟐 ∑ 𝑿𝟏 𝑿𝟐 +. . . +𝒃𝒌 ∑ 𝑿𝟏 𝑿𝒌 = ∑ 𝑿𝟏 𝒀

𝒂 ∑ 𝑿𝟐 + 𝒃𝟏 ∑ 𝑿𝟐 𝑿𝟏 + 𝒃𝟐 ∑ 𝑿𝟐 𝟐 +. . . +𝒃𝒌 ∑ 𝑿𝟐 𝑿𝒌 = ∑ 𝑿𝟏 𝒀

𝒂 ∑ 𝑿𝒌 + 𝒃𝟏 ∑ 𝑿𝒌 𝑿𝟏 + 𝒃𝟐 ∑ 𝑿𝒌 𝑿𝟐 +. . . +𝒃𝒌 ∑ 𝑿𝒌 𝟐 = ∑ 𝑿𝒌 𝒀

Example of a linear regression question:


List the heights and weights of 26 students
Height (cm) Weight (kg)
162 48.0
158 46.0
170 58.1
167 53.1
159 46.8
160 47.0
170 63.2
163 52.7
164 59.7
158 47.1
164 58.4
158 46.5
156 46.0
161 58.3
163 50.7
160 50.6
168 60.3
159 47.0
156 46.9
162 49.7
159 46.9
164 56.1
167 58.0
158 47.0
163 56.0
160 49.8
Determine the linear regression between height (X) and weight (Y)!
Solution:
∑𝑋ᵢ = 4.209 ∑𝑌ᵢ = 1.349,8 ∑𝑋ᵢ𝑌ᵢ = 218.682,4
∑𝑋ᵢ2 = 681.777 ∑𝑌ᵢ2 = 70.816,51 𝑛 = 26
Calculate the coefficient b for the regression of Y on X:
𝑛∑𝑋ᵢ𝑌ᵢ − (∑𝑋ᵢ)(∑𝑌ᵢ)
𝑏=
𝑛∑𝑋ᵢ2 − (∑𝑋ᵢ)²
26(218.682,4)−(4,209)(1.349,8)
= = 0,42
26(681.777)−(4.209)²

𝑎 = 𝑌̅ − 𝑏𝑋̅
1.349,8 4.209
= − (0,42) = -16.08
26 26

Regression of Y on X, the equation is:


𝑌̂ = −16,08 + 0,42𝑋

Calculate the linear regression of X on Y, with the equation𝑋̂ = 𝑐 + 𝑑𝑌


(∑𝑋ᵢ)(∑𝑌ᵢ^2 ) − (∑𝑌ᵢ)(∑𝑋ᵢ𝑌ᵢ)
𝑐=
𝑛∑𝑌ᵢ2 − (∑𝑌ᵢ)²
(4.209)(70.816,51)−(1.349,8)(218.682,4)
= = 147,63
26(70.816,51)−(1.394,8)²

𝑛∑𝑋ᵢ𝑌ᵢ − (∑𝑋ᵢ)(∑𝑌ᵢ)
𝑑=
𝑛∑𝑌ᵢ2 − (∑𝑌ᵢ)²
26(218.682,4)−(4.209)(1.349,8)
= = 0,23
26(70.816,51)−(1.349,8)²

The linear regression of X on Y has the equation:


𝑋̂ = 147,63 + 0,23𝑌
Simple Linear Regression Analysis

Simple Linear Regression Analysis


Simple linear regression analysis is used to determine the influence or linear relationship between one
independent variable and one dependent variable. Case example: A student wants to research whether
there is an influence between production costs and sales levels in a company. Samples taken 12
months. The data obtained is as follows:

Data on Production Costs and


Sales Levels
cost sale
57500000 87600000
50800000 82500000
41300000 76900000
43600000 85400000
48200000 89300000
58400000 92100000
59000000 92600000
46800000 91300000
52900000 95700000
53700000 98300000
50800000 97400000
55400000 99300000
In this case, production costs are the independent variable, and sales level is the dependent variable.
Here a simple linear regression analysis will be carried out to determine the effect of production cost
variables on sales levels and a classic regression assumption test will be carried out. Analysis steps in
SPSS:
1. Input and define data
2. Analyze >> Regression >> Linear
3. Enter the Production Cost variable into the Independent(s) box, and the Sales Level variable into
the Dependent box. Next, click the Statistics button . Then the following display will appear:
4. Put a check mark on Durbin Watson. Next, click the Continue button . Then in the previous box
click the Plots button . Next, the following display will appear:

5. Enter SRESID in the Y box and ZPRED in the X box , then check the Normal probability plot.
Next, click the Continue button. OK
Interpretation of Output Results
Output Variables Entered/Removed
From the output it can be seen that the independent variable included in the model is Price and the
dependent variable is Income and no variables were removed. Meanwhile, the regression method uses
Enter.

Output Model Summary


R is multiple correlation, namely the correlation between two or more independent variables on the
dependent variable. In simple regression, the R number shows the simple correlation (Pearson
correlation) between variable X and Y. The R number is 0.580, meaning the correlation between the
production cost variable and the sales level is 0.580, this means there is a close relationship because
the value is close to 1.
R Square (R2) or the square of R, which shows the coefficient of determination. This figure will be
converted into a percentage, which means the percentage contribution of the influence of the
independent variable to the dependent variable. The R2 value is 0.336, meaning that the percentage
contribution of the influence of production cost variables to sales levels is 33.6%, while the remainder
is influenced by other variables not included in this model.
Adjusted R Square , is the adjusted R Square, a value of 0.270. This also shows the contribution of
the influence of the independent variable to the dependent variable. Adjusted R Square is usually used
to measure the contribution of influence if more than two independent variables are used in the
regression. Standard Error of the Estimate, is a measure of prediction error, a value of 5788229.847.
This means that the error in predicting the sales level was Rp. 5788229,847. [ 155 ]
ANOVA output
ANOVA or analysis of variance, namely testing the regression coefficients together (F test) to test the
significance of the influence of several independent variables on the dependent variable. This analysis
is more appropriately applied to multiple regression.
Output Coefficients
Unstandardized Coefficients, are coefficient values that are not standardized or have no benchmark,
this value uses the units used in the data on the dependent variable, for example Rp, % etc. Coefficient
B consists of a constant value (the value of Y if X = 0) and a regression coefficient (a value that
shows the increase or decrease in variable Y based on variable X), these values are included in the
linear regression equation. Meanwhile, Standard Error is the maximum value of error that can occur in
estimating the population average based on a sample. This value is used to find the calculated t by
dividing the coefficient by the standard error.
Standardized Coefficients (coefficient values that have been standardized or have certain
benchmarks, the Beta coefficient value is closer to 0, the relationship between variables To find out
whether the results are significant or not, the calculated t number will be compared with the t table.
Significance is the magnitude of the probability or opportunity to make an error in making a decision.
If the test uses a significance level of 0.05, it means the chance of getting a maximum error is 5%, in
other words we believe that 95% of the decisions are correct. The regression equation for simple
linear regression is as follows (next page).
𝑌 ′ = 𝑎 + 𝑏𝑋
Information:
Y': Predicted value of the dependent variable
A: Constant, namely the value of Y' if X = 0
b : Regression coefficient, namely the value of increase or decrease in variable Y based
on variable X
X: Independent variable
The values in the output are then entered into the regression equation as follows:
The meaning of these numbers is as follows:
𝑌 ′ = 55414271,26 + 0,685𝑋
- The value of constant (a) is 55414271.26. This can be interpreted as if the production cost value is 0,
then the sales level is IDR 55414271.26.

- The regression coefficient value of the price variable (b) is positive, namely 0.685. This means that
for every increase in production costs by IDR 1, the sales level will also increase by IDR 0.685

t test
The t test in this case is used to find out whether production costs have a significant effect on sales
levels or not. The test uses a significance level of 0.05 and is 2-sided. Test steps as follows:
1. Formulate a hypothesis
Ho: Production costs have no effect on sales levels.
Ha: Production costs influence sales levels.
2. Determine t count and significance
From the output we can get a t count of 2.252 and a significance of 0.048
3. Determine the t table
The t table can be seen in the statistical table at a significance of 0.05 /2 = 0.025 with degrees of
freedom df = n-2 or 12-2 = 10, the results obtained for the t table are 2.228 (see the t table
attachment).
4. Testing Criteria
If –t table > t count < t table then Ho is accepted
If –t count < -t table or t count > t table then Ho is rejected
5. Based on Significance:
If significance is > 0.05 then Ho is accepted
If significance <0.05 then Ho is rejected
6. Make conclusions
The calculated t value > t table (2.252 > 2.228) and significance < 0.05 (0.048 < 0.05), then Ho is
rejected, so it can be concluded that production costs have an effect on sales levels.

Classic Regression Assumption Test:


The classical assumption test is a statistical requirement that must be met in multiple linear regression
analysis based on ordinary least squares (OLS). So regression analysis that is not based on OLS does
not require classical assumption requirements, for example logistic regression or ordinal regression .
Likewise, not all classical assumption tests must be carried out in linear regression analysis, for
example the multicollinearity test is not carried out in simple linear regression analysis and the
autocorrelation test does not need to be applied to cross sectional data .
a. Residual normality test
The residual normality test is used to test whether the residual values resulting from the regression are
normally distributed or not. A good regression model is one that has residual values that are normally
distributed. The method used is a graphic method, namely by looking at the distribution of data at a
diagonal source on the Normal PP Plot of standardized regression graph. As a basis for decision
making, if the points spread around the line and follow the diagonal line then the residual value is
normal. The normality test results can be seen in the regression results output, and are displayed as
follows:
Conclusion: From the graph it can be seen that the points are spread around the line and follow the
diagonal line, so the residual value is normal

b. Autocorrelation Test
Autocorrelation is a correlation between observation members arranged according to time or place. A
good regression model should not have autocorrelation. The test method uses the Durbin-Watson test
(DW test). Decision making in the Durbin Watson test is as follows:
criteria :

Durbin-Watson test
Ho : p=0 (no autocorrelation)
Ha : P ≠0 (there is autocorrelation)
d value: 0.037 (located in the Rejet Ho area)
N= 12
K' = number of independent variables without intercept
dl= 0.971
du=1,331
4-1,331 = 2,669
4- 0.971= 3.029
sig = 5%
Conclusion: the dw value is in positive autocorrelation.

c. Heteroscedasticity Test
Heteroscedasticity is the residual variance that is not the same for all observations in the regression
model. A good regression should not have heteroscedasticity. Below, a heteroscedasticity test is
carried out using the method
graph, namely by looking at the pattern of dots on the regression graph. The basic criteria for decision
making are:
- If there is a certain pattern, such as the points forming a certain regular pattern (wavy, widening then
narrowing), then heteroscedasticity occurs.
- If there is no clear pattern, such as dots spread above and below the number 0 on the Y axis, then
heteroscedasticity does not occur.
The results of the Heteroscedasticity test can be seen in the regression results output, and are
displayed as follows:
From the output it can be seen that the points do not form a clear pattern, and the points spread above
and below the number 0 on the Y axis, so it can be concluded that heteroscedasticity does not occur in
the regression model.
COLLERATORY ANALYSIS

Correlation analysis is a study that discusses the degree (how strong) the relationship
between two or more variables. The measure of the degree of relationship is called the
Correlation Coefficient. Simply put, correlation analysis is a way to find out whether there is
a relationship between variables. Nowadays, the correlation coefficient is a number that
shows the direction and strength of the relationship between two or more variables. This
direction is expressed in the form of a positive or negative relationship

The direction of the relationship is positive, meaning:

• If the value of a variable is increased, it will increase the value of other variables.
• If the value of a variable is decreased, it will decrease the value of other variables.

The direction of the relationship is negative, meaning:

• If the value of a variable is increased, it will decrease the value of other variables.
• If the value of a variable is decreased, it will increase the value of other variables.

Strong relationship
• The strength of the relationship is expressed in the form of a number, between 0 – 1.
The number 0 indicates a relationship that does not exist. Number 1 indicates a
perfect relationship
• For more details, pay attention to the following table of levels of correlation and
strength of relationship:
Example

• If r = -1, it means perfect negative correlation. This indicates that there is an inverse
relationship between variable X and variable Y where if variable X increases, then
variable Y decreases.
• If r = +1, it means perfect positive correlation. This indicates that there is a
unidirectional relationship between variable X and variable Y, where if variable X
increases, variable Y also increases.

Correlation coefficient

• The correlation coefficient has a range from -1 to +1


• The correlation coefficient can be determined based on the distribution of meeting
points between two variables
• The smaller the correlation coefficient, the greater the error in making predictions.

Correlation Techniques

The following are guidelines for choosing a correlation technique based on the type of data
used:
Example of a correlation question :
The following table shows the authoritarianism scores and scores struggle social from
12 people student :
Score
Student Authoritarianism Struggle social
A 82 42
B 98 46
C 87 39
D 40 37
E 116 65
F 113 88
G 111 86
H 83 56
I 85 62
J 126 92
K 106 54
L 117 81

Based on the data, check it out is there is a relationship between score


authoritarianism with score struggle social work for students ? ( do testing
significance at a = 0.01)
Answer :

The following table show score authoritarianism and score struggle social from 12
students :
Student Score X² Y² XY
Authoritarianism Struggle
(X) Social (Y)
A 82 42 6724 1764 3444
B 98 46 9604 2116 4508
C 87 39 7569 1521 3393
D 40 37 1600 1369 1480
E 116 65 13456 4225 7540
F 113 88 12769 7744 9944
G 111 86 12321 7396 9546
H 83 56 6889 3136 4648
I 85 62 7225 3844 5270
J 126 92 15876 8464 11592
K 106 54 11236 2916 5729
L 117 81 13689 6561 8586
ΣX = 1164 ΣY = 748 ΣX² = ΣY² = ΣXY =
118958 51056 75680
12(75680) − (1164)(748)
𝑟𝑥𝑦 =
√12(118958) − (1164)2 . √12(51056) − (748)2
908160 − 870672
=
√1427496 − (1164)2 . √612672 − (748)2
37488 37488
= = = 0,60
√72600 . √53168 62127.48

𝑟𝑥𝑦 = 0.60
0,60 √12−2
Tcount → 𝑡𝑜 = = 2,38
√1−(0,60)2
a = 0.01 , df = 12 – 2 = 10, t- table → t( 0.005;10)=3.17
conclusion : thank Ho because |𝑡𝑜| ≤ 𝑡 𝑎⁄2 ↔ 2,38 < 3,17
meaning : there is no significant relationship score authoritarianism with score
struggle social .
Regression and Correlation Hypothesis Testing with Microsoft Excel

Case:
The following is sales data from snack companies:
X : percentage increase in advertising costs
Y: percentage increase in sales results
X 1 2 4 5 7 9 10 12
Y 2 4 5 7 8 10 12 14
Determine the regression equation from the sales data.
Solution:
1. Type the data to be analyzed then name it Advertising Costs and Sales Results .

2. Click Data Menu → Data Analysis → Regression


3. Next the Regression dialog box will open.
4. In the Input section, enter the data range for variable Y ( Sales Results ) and variable
X ( Advertising Costs ) which will be tested by blocking the corresponding data. Other
fillings can be ignored.
5. In the Output section, click the New Worksheet Ply option to display the analysis results
on a new Worksheet, or enter the Outpur Range to display the analysis results on the
same Worksheet.
6. In the Residual section, you can select the Residual and Residual Plot options to see the
residual values of the model and its plot.

7. Click OK then the following output will appear.

In general, the results of regression analysis provide calculation results which are
arranged in three tables. From the output results, it can be seen that the constant value
(Intercept) is 1.267 and the b coefficient value (X variable) is 1.037. The coefficient of
determination can be seen in the first table, Regression Statistics, which displays an R
square value of 0.984.
This means that 98.4% of variations or changes in the Sales Results variable can be
explained and influenced by changes in the Advertising Cost variable, while 1.6% is caused
by the influence of other variables that are not observed. By paying attention to the
regression results, the following regression equation can be obtained
Ŷ=a+bX=1.267+1.037 where Y is Sales Results and X is Advertising Costs.

If it is known that there is a relationship (correlation) between the independent variable and
the dependent variable, then we can then see how big the relationship is between the two
variables. The steps for testing correlation analysis using Microsoft Excel are as follows.
1. Type the data to be analyzed then name it Advertising Costs and Sales Results .

2. Click Data Menu → Data Analysis → Correlation.

3. Next the Correlations dialog box will open. In the Input section, enter the data range for the
two variables whose relationship will be tested, namely the Advertising Cost variable and
the Sales Results variable by blocking the corresponding data. Grouped By Activate the
Columns option and check the Label in first row box to display the label description.
4. In the Output Options section , click the Output Range option and click an empty column
on the worksheet to display the analysis results on the same worksheet. Then click OK .

5. The results of the correlation analysis will give the following results.

In the output, it can be seen that the relationship between the Advertising Cost variable
and Sales Results is 0.98927. Seeing the large value of the relationship between these two
variables, it can be concluded that Advertising Costs have a very strong and positive
relationship with Sales Results . Every time there is an increase in advertising costs , sales
results will also increase linearly .

You might also like