Simple Linear Regression (1)
Simple Linear Regression (1)
Business
• Analysis of Data – Methods
• Correlation
• Simple Linear Regression
WHAT IS BA Analytics
Analytics is the data-driven decision-making
approach for a business problem.
• The term univariate analysis refers to the analysis of one variable. You
can remember this because the prefix “uni” means “one.”
• There are three common ways to The purpose of the bivariate analysis is to understand the
perform the univariate analysis: relationship between two variables.
1. Summary Statistics • There are three common ways to perform the bivariate
2. Frequency Distributions analysis:
3. Charts • 1. Scatterplots.
• 2. Correlation Coefficients.
• 3. Simple Linear Regression.
MULTIVARIATE ANALYSIS
• The term multivariate analysis refers to the analysis of more than two variable.
• There are three common ways to perform the multivariate analysis:
• 1. Scatterplot Matrix
• 2. Multiple linear Regression
• 3. Pair plot
Univariate Analysis
1. Summary Statistics
• Measures of central tendency: these numbers describe where the centre
of a dataset is located. Examples include the mean and the median.
• Mean (the average value): 3.8
• Median (the middle value): 4
2. Frequency Distributions
• This allows us to quickly see that the
most frequent household size is 4.
3. Charts
• Boxplot
• Histogram
• Pie Chart
BIVARIATE ANALYSIS
• Two variables:
• (1) Hours spent studying and
• (2) Exam score received by 20 different students:
• 1. Scatterplots
• A scatterplot offers a visual way to perform bivariate analysis. It
allows us to visualize the relationship between two variables by
placing the value of one variable on the x-axis and the value of
the other variable on the y-axis.
CORRELATION
Positive Correlation – When the variables are changing in the same direction (either increase or decrease in
parallel), we call it as a positively correlated. For e.g. price of a goods and demand, hot weather and cold drink
consumptions, etc.
Negative Correlation – When the variables are changing in the opposite direction (One is increasing and other is
decreasing), we call it as a negatively correlated. For e.g. alcohol consumption and lifeline, smartphones usages and
battery lifeline, etc.
Zero Correlation – We call it a zero correlated when there is no relationship between the variables
(Correlation=0). For e.g. HR recruits and temperature, paper production and beverages, etc.
TYPES OF CORRELATION
STANDARD RANGE OF CORRELATION
COEFFICIENT
LINEAR CORRELATION - PEARSON'S
CORRELATION COEFFICIENT
• Steps:-
Ø
• Limitation:-
Ø Pearson assumes all features are independent.
Ø Pearson identifies only linear correlations
• In
Formula in excel –
• Scatter plot is a simple graph where the data of two continuous variables are plotted
against each other.
• It examines the relationship between two variables and to check the degree of
association between them.
• One variable is called the independent variable and the other variable is called the
dependent variable.
• The degree of association of a variable is known as correlation.
• Scattered diagram is one of the ways of finding the extent of relationship between two
quantitative variables.
• However, this method will only indicate that there is a relationship between two
variables but, not the extent to which they are related.
Regression
Regression gression
Linear Logistic Re
Introduction to Regression Analysis
• • Predict the value of a dependent variable based on the value of at least one
independent variable
• - Explain the impact of changes in an independent variable on the dependent
variable
n The variable being predicted is called the dependent
variable and is denoted by y.
OPt iOn Al :
Add A t e x t qu
Ote,
PhOtO, O r vid
eO t O
SuPPOrt y Ou
r
S Pr in GbOAr d
LINEAR REGRESSION
Regression is a statistical measurement that attempts to determine the strength of
the relationship between a dependent variable and a series of independent variables.
Linear regression is a quiet and simple statistical regression method used for
predictive analysis and shows the relationship between the continuous variables.
Linear regression shows the linear relationship between the independent variable
(X-axis) and the dependent variable (Y-axis).
If there is a single input variable (x), such linear regression is called simple linear
regression
And if there is more than one input variable, such linear regression is called multiple Optional: Introduce
new vocabulary or
linear regression. fun facts here
SIMPLE LINEAR REGRESSION MODEL
y = b0 + b1x +e
where:
b0 and b1 are called parameters of the model,
e is a random variable called the error term.
Simple Linear Regression Equation
E(y) = b0 + b1x
ŷ = b0 + b1 x
Simple Linear Regression is the theoretical form with unknown population (β0 and β1).
Estimated Simple Linear Regression is the practical, data-driven model where we estimate those
parameters as b0 and b1 using sample data.
ESTIMATION PROCESS
Estimated
b0 and b1 Regression Equation
provide estimates of ŷ = b0 + b1 x
b0 and b1 Sample Statistics
b0, b1
SIMPLE LINEAR REGRESSION EQUATION
E(y)
Regression line
Intercept Slope b1
b0 is positive
x
Simple Linear Regression Equation
E(y)
Intercept
b0 Regression line
Slope b1
is negative
x
Simple Linear Regression Equation
No Relationship
E(y)
x
INTERCEPT AND SLOPE – EXAMPLE
• The intercept (often labelled the constant) is the expected mean value of
Y when all X=0.
• Start with a regression equation with one predictor, X.
• If X sometimes equals 0, the intercept is simply the expected mean
value of Y at that value.
• If X never equals 0, then the intercept has no meaning.
• The slope indicates the steepness of a line.
• m is the slope of a regression line, which is the rate of change
for y as x changes.
• The slope is positive 5. When x increases by 1, y increases by 5. The
y-intercept is 2.
• The slope is negative 0.4. When x increases by 1, y decreases by 0.4.
The y-intercept is 7.2.
LEAST SQUARES METHOD
The Least Squares Method is a mathematical approach used in regression analysis to find the best-
fitting line (or curve) for a set of data points by minimizing the sum of the squares of the differences
(the "residuals") between the observed values and the values predicted by the model.
The least squares method is a procedure for using sample data to find the estimated regression
equation
min å (y i - y i ) 2
where:
yi = observed value of the dependent variable
for the ith observation
y^i = estimated value of the dependent variable
for the ith observation
LEAST SQUARES METHOD
Calculate The slope and intercept for
the Estimated Regression Equation
• Slope for the Estimated Regression Equation
b1 = å ( x - x )( y - y )
i i
å (x - x )i
2
where:
xi = value of independent variable for ith
observation
yi = value of dependent variable for ith
_ observation
x = mean value for independent variable
_
y = mean value for dependent variable
Least Squares Method
b0 = y - b1 x
ŷ = b0 + b1 x
Simple Linear Regression
Number of Number of
x- y- x- mean(x) * x- mean(x) *
TV Ads (x) Cars Sold (y) x y
mean(x) mean(y) y - mean(y) x- mean(x)
1 14 1 14 -1 -6 6 1
3 24 3 24 1 4 4 1
2 18 0 -2 0 0
2 18 1 17 -1 -3 3 1
1 17 3 27 1 7 7 1
3 27 sum 20 4
Sx = 10 Sy = 100
x=2 y = 20
ESTIMATED REGRESSION EQUATION
b1 = å ( x - x )( y - y ) 20
i i
= =5
å (x - x )i
2
4
b1 = å ( x - x )( y - y )
i i
å (x - x ) i
2
FIND THE BEST FIT LINE
• When working with linear regression, our main goal is to find the best
fit line, meaning the error between predicted and actual values should
be minimized. The best fit line will have the least error.
• For Linear Regression, we use the Mean Squared Error (MSE) cost
function, which is the average squared error between the predicted and
actual values.
• Multiple R. It is the Correlation Coefficient that measures the strength of a linear relationship
between two variables. The correlation coefficient can be any value between -1 and 1, and
its absolute value indicates the relationship strength. The larger the absolute value, the stronger
the relationship:
• R Square. It is the Coefficient of Determination, which is used as an indicator of the goodness of
fit. It shows how many points fall on the regression line. The R2 value is calculated from the total
sum of squares, more precisely, it is the sum of the squared deviations of the original data from
the mean. The range is 0 to 1.
• Adjusted R Square. It is the R square adjusted for the number of the independent variable in
the model. You will want to use this value instead of R square for multiple regression analysis.
0 means there is no linear relationship between predictor variable ‘x’ and response variable ‘y’ and 1
mean there is a perfect linear relationship between ‘x’ and ‘y’.
å i
( y - y ) 2
= å i
( ˆ
y - y ) 2
+ å i i
( y - ˆ
y ) 2
where:
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squares due to error
Relationship
SST = SSR + SSE
Coefficient of Determination
r2 = SSR/SST
where:
SSR = sum of squares due to regression
SST = total sum of squares
Coefficient of Determination
rxy = + .8772
rxy = +.9366
SSE (Error Sum of Squares)
å i
( y - y ) 2
= å i
( ˆ
y - y ) 2
+ å i i
( y - ˆ
y ) 2
SST (Total Sum of Squares) SST = SSR + SSE
Population and Sales data SSR = SST – SSE
Restaurant Population (x) Sales (y) (y - y mean) (y - y mean)2 SSR = 15730 – 1530 = 14200
1 2 58 -72 5184
2 6 105 -25 625
3 8 88 -42 1764
4
5
8
12
118
117
-12
-13
144
169
R2 = SSR/SST
6 16 137 7 49 = 14200/15730
7 20 157 27 729 = 0.9027
8 20 169 39 1521
9 22 149 19 361
10 26 202 72 5184
Mean = 130 SST = 15730
Correlation Coefficient (Multiple R)
R2 = SSR/SST
= 14200/15730
= 0.9027
df = n-k-1
n= df +k+1
Adjusted R - squared
Model Assumptions
In a simple linear regression equation, the mean or expected value of y is a linear function of x:
E(y) = ẞ0 + ẞ1 x.
Alternatively, if the value of ẞ1 is not equal to zero, we would conclude that the two variables are related.
Thus, to test for a significant regression relationship, we must conduct a hypothesis test to determine whether
the value of ẞ1 is zero
Two tests are commonly used. Both require an estimate of variance of e in the regression model
T – test
F - test
TESTING FOR SIGNIFIC ANCE
• An Estimate of s 2
s 2 = MSE = SSE/(n - 2)
where:
SSE = å ( yi - yˆ i ) 2 = å ( yi - b0 - b1 xi ) 2
TESTING FOR SIGNIFICANCE
• An Estimate of s
SSE
s = MSE =
n-2
Estimate of Variance – Mean Square Error -
𝑆𝑆𝐸
𝑠 ! = 𝑀𝑆𝐸 =
𝑛−2
"#$%
𝑠 ! = 𝑀𝑆𝐸 = = 191.25
&
""# 𝑆𝑆𝑅
s = 𝑀𝑆𝐸 = 𝑀𝑆𝑅 =
$%! 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠
s = 191.25 = 13.829
Testing for Significance: F Test
Hypotheses
H0 : b1 = 0
H a : b1 ¹ 0
Test Statistic
F = MSR/MSE
Testing for Significance: F Test
Rejection Rule
Reject H0 if
p-value < a
or F > Fa
where:
Fa is based on an F distribution with
1 degree of freedom in the numerator and
n - 2 degrees of freedom in the denominator
F Test !"# %&'((
F = !"$ F= = 74.25
%)%.'+
The F distribution table shows that with one degree of freedom in the numerator and n-2 = 10-2 = 8 degrees of
freedom in the denominator, F = 11.26. provides an area of 0.01 in the upper tail. Thus, the area in the upper tail
of the F distribution corresponding to the test statistic F. = 74.25 must be less than 0.01. Thus, we conclude that
the p-value must be less than 0.01.
Because the p-value is less than a = 0.01, we reject NULL, and conclude that a significant relationship exists
between the size of the student population and quarterly sales.
TESTING FOR SIGNIFIC ANCE: T TEST
• Hypotheses
H0 : b1 = 0
H a : b1 ¹ 0
• Test Statistic
b1 s
t= where sb1 =
sb1 S( xi - x ) 2
Testing for Significance: t Test
Rejection Rule
where:
ta/2 is based on a t distribution
with n - 2 degrees of freedom
&! (
Estimated Standard Deviation of b1 t= ' = ).(+), = 8.62
𝑠 "!
𝑠𝑏1 =
∑ 𝑥𝑖 − 𝑚𝑒𝑎𝑛 𝑥 2
The t distribution table shows that with one degree of freedom
in the numerator and n-2 = 10-2 = 8 degrees of freedom in the
13.829
𝑠𝑏1 = = 0.5803 denominator, t = 3.355. provides an area of .005 in the upper
568 tail.
So Reject H0
CONFIDENCE INTERVAL FOR b 1
1 2 58 70
2 6 105 90
3 8 88 100
Actual and 4
5
8
12
118
117
100
120
Predicted Values 6
7
16
20
137
157
140
160
8 20 169 160
9 22 149 170
10 26 202 190
QUESTIONS 1
X Y
34 43
42 50
70 90
110 130
250 315
1. Find the value for b0 and b1 for the given value of X and Y for Simple Linear
regression.
2. Create the equation for simple linear regression with the help of the b0 and b1 value
3. Find the value of Y for a given value of X = 140
Question - A telecommunications company conducted a study on the annual revenue and the
number of cell towers operated by five network providers. The aim is to determine if annual
revenue can be predicted based on the number of cell towers in service.
Questions:
(a) Develop a regression equation to estimate the annual revenue based on the number of cell towers
in service.
(b) Is there any positive association between annual revenue and the number of cell towers in
service? Explain.
(c) Forecast the annual revenue if the number of cell towers in service is 12.
Question
The management team of an energy company is evaluating the effectiveness of their marketing
campaigns by analyzing the relationship between the advertising expenses and annual revenue. They
calculate two key metrics:
If a regression equation is given by y=2+3x, what will be the predicted value of y when x=4?
•A) 12
•B) 14
•C) 10
•D) 8
If you calculate an R2 value of 0.9 for a simple linear regression model, how would you
interpret this?
•A) 90% of the variance in the dependent variable is explained by the independent variable.
•B) There is a 90% chance that the model is correct.
•C) 90% of the independent variable is explained by the dependent variable.
•D) There is no association between the variables.
ANOVA
df SS MS F Significance F
Regression 1 659.7773672 659.7773672 9.343777228 0.005591082
Residual 23 1624.062633 70.61141882
Total 24 2283.84
Regression Statistics
Multiple R Multiple R (SQRT(R Square)
R Square R Square (SSR/SST) or 1- (SSE/SST)
Adjusted R Square Adjusted R Square
Standard Error Standard Error sqrt(MSE)
Observations 25
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.537484407
R Square 0.288889488
Adjusted R Square 0.257971639
Standard Error 8.403060087
Observations 25
ANOVA
df SS MS F Significance F
Regression 1 659.7773672 659.7773672 9.343777228 0.005591082
Residual 23 1624.062633 70.61141882
Total 24 2283.84
Lower Upper
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% 95.0% 95.0%
Intercept 33.56672 21.1197553 1.589351517 0.125634061 -10.1228274 77.25626 -10.1228 77.25626
Promote 0.679119 0.222169496 3.056759269 0.005591082 0.219526048 1.138711 0.219526 1.138711