0% found this document useful (0 votes)
16 views83 pages

Simple Linear Regression (1)

The document discusses business analytics (BA) as a data-driven approach to decision-making, focusing on data analysis methods including univariate, bivariate, and multivariate analyses. It explains correlation types and regression analysis, particularly simple linear regression, as tools for understanding relationships between variables and predicting outcomes. Key concepts include the use of statistical models to identify trends and make data-informed business decisions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views83 pages

Simple Linear Regression (1)

The document discusses business analytics (BA) as a data-driven approach to decision-making, focusing on data analysis methods including univariate, bivariate, and multivariate analyses. It explains correlation types and regression analysis, particularly simple linear regression, as tools for understanding relationships between variables and predicting outcomes. Key concepts include the use of statistical models to identify trends and make data-informed business decisions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

Analyt i cs - I I

Business
• Analysis of Data – Methods
• Correlation
• Simple Linear Regression
WHAT IS BA Analytics
Analytics is the data-driven decision-making
approach for a business problem.

Data analysis—includes data description, data


inference, and the search for relationships in data.

• Business analytics (BA) is a set of disciplines and technologies for


solving business problems using data analysis and statistical
models.
• Taking in and processing historical business data. Analysing that
data to identify trends, patterns, and root causes. Making data-driven
business decisions based on those insights.
• It is the science of analysing data to find out patterns that will be helpful in
developing strategies.
ANALYSIS OF DATA
Univariate analysis
Bivariate analysis
Multivariate Analysis

• The term univariate analysis refers to the analysis of one variable. You
can remember this because the prefix “uni” means “one.”

• The purpose of the univariate analysis is to understand the distribution


of values for a single variable.

• Bivariate Analysis: The analysis of two variables.


• Multivariate Analysis: The analysis of two or more variables.
UNIVARIATE ANALYSIS BIVARIATE ANALYSIS

• There are three common ways to The purpose of the bivariate analysis is to understand the
perform the univariate analysis: relationship between two variables.
1. Summary Statistics • There are three common ways to perform the bivariate
2. Frequency Distributions analysis:

3. Charts • 1. Scatterplots.
• 2. Correlation Coefficients.
• 3. Simple Linear Regression.
MULTIVARIATE ANALYSIS

• The term multivariate analysis refers to the analysis of more than two variable.
• There are three common ways to perform the multivariate analysis:
• 1. Scatterplot Matrix
• 2. Multiple linear Regression
• 3. Pair plot
Univariate Analysis

• Choose to perform univariate analysis on the variable Household Size:

1. Summary Statistics
• Measures of central tendency: these numbers describe where the centre
of a dataset is located. Examples include the mean and the median.
• Mean (the average value): 3.8
• Median (the middle value): 4
2. Frequency Distributions
• This allows us to quickly see that the
most frequent household size is 4.

3. Charts
• Boxplot
• Histogram
• Pie Chart
BIVARIATE ANALYSIS

If we plotted these (X, Y) pairs on a


scatterplot, it would look like this:
• Suppose we have the following dataset:

Based on the scatterplot we can tell that there is


a positive association between variables X and Y:
when X increases, Y tends to increase as well.
BIVARIATE ANALYSIS

• Two variables:
• (1) Hours spent studying and
• (2) Exam score received by 20 different students:
• 1. Scatterplots
• A scatterplot offers a visual way to perform bivariate analysis. It
allows us to visualize the relationship between two variables by
placing the value of one variable on the x-axis and the value of
the other variable on the y-axis.
CORRELATION

• Correlation is the statistical tool which is used to know the relationship


between two or more variables i.e. the degree to which the variables are
associated with each other. In simpler words, it measures the closeness of the
relationship. For example, price and supply, demand and supply, income and
expenditure are correlated.
TYPES OF CORRELATION

Positive Correlation – When the variables are changing in the same direction (either increase or decrease in
parallel), we call it as a positively correlated. For e.g. price of a goods and demand, hot weather and cold drink
consumptions, etc.

Negative Correlation – When the variables are changing in the opposite direction (One is increasing and other is
decreasing), we call it as a negatively correlated. For e.g. alcohol consumption and lifeline, smartphones usages and
battery lifeline, etc.

Zero Correlation – We call it a zero correlated when there is no relationship between the variables
(Correlation=0). For e.g. HR recruits and temperature, paper production and beverages, etc.
TYPES OF CORRELATION
STANDARD RANGE OF CORRELATION
COEFFICIENT
LINEAR CORRELATION - PEARSON'S
CORRELATION COEFFICIENT

Ø Used to measure the strength of association between two continuous


features.
Ø Both positive and negative correlation are useful.

• Steps:-
Ø

Ø Compute the Pearson’s Correlation Coefficient for each feature.


Ø Sort according the score.
Ø Retain the highest ranked features, discard the lowest ranked.
Ø

• Limitation:-
Ø Pearson assumes all features are independent.
Ø Pearson identifies only linear correlations
• In
Formula in excel –

CORREL(First Array, Second Array)


CORREL(A1:A5,B1:B5)
Correlational Analysis Scattered diagram

• Scatter plot is a simple graph where the data of two continuous variables are plotted
against each other.
• It examines the relationship between two variables and to check the degree of
association between them.
• One variable is called the independent variable and the other variable is called the
dependent variable.
• The degree of association of a variable is known as correlation.

• Scattered diagram is one of the ways of finding the extent of relationship between two
quantitative variables.
• However, this method will only indicate that there is a relationship between two
variables but, not the extent to which they are related.
Regression

Regression gression
Linear Logistic Re
Introduction to Regression Analysis

• Regression Analysis is a form of predictive modelling techniques



• Estimate the relationships between two or more variables
• -How does the dependent variable change when one of the independent variables

• • Predict the value of a dependent variable based on the value of at least one
independent variable
• - Explain the impact of changes in an independent variable on the dependent
variable
n The variable being predicted is called the dependent
variable and is denoted by y.

n The variables being used to predict the value of the


dependent variable are called the independent
variables and are denoted by x.

OPt iOn Al :
Add A t e x t qu
Ote,
PhOtO, O r vid
eO t O
SuPPOrt y Ou
r
S Pr in GbOAr d
LINEAR REGRESSION
Regression is a statistical measurement that attempts to determine the strength of
the relationship between a dependent variable and a series of independent variables.

Linear regression is a quiet and simple statistical regression method used for
predictive analysis and shows the relationship between the continuous variables.
Linear regression shows the linear relationship between the independent variable
(X-axis) and the dependent variable (Y-axis).

If there is a single input variable (x), such linear regression is called simple linear
regression
And if there is more than one input variable, such linear regression is called multiple Optional: Introduce
new vocabulary or
linear regression. fun facts here
SIMPLE LINEAR REGRESSION MODEL

n The equation that describes how y is related to x and


an error term is called the regression model.
n The simple linear regression model is:

y = b0 + b1x +e
where:
b0 and b1 are called parameters of the model,
e is a random variable called the error term.
Simple Linear Regression Equation

 The simple linear regression equation is:

E(y) = b0 + b1x

• Graph of the regression equation is a straight line.


• b0 is the y intercept of the regression line.
• b1 is the slope of the regression line.
• E(y) is the expected value of y for a given x value.
Estimated Simple Linear Regression Equation
 The estimated simple linear regression equation

ŷ = b0 + b1 x

• The graph is called the estimated regression line.


• b0 is the y intercept of the line.
• b1 is the slope of the line.
• ŷ is the estimated value of y for a given x value.

Simple Linear Regression is the theoretical form with unknown population (β0 and β1).
Estimated Simple Linear Regression is the practical, data-driven model where we estimate those
parameters as b0 and b1 using sample data.
ESTIMATION PROCESS

Regression Model Sample Data:


y = b0 + b1x +e x y
Regression Equation x1 y1
E(y) = b0 + b1x . .
Unknown Parameters . .
b0, b1 xn yn

Estimated
b0 and b1 Regression Equation
provide estimates of ŷ = b0 + b1 x
b0 and b1 Sample Statistics
b0, b1
SIMPLE LINEAR REGRESSION EQUATION

 Positive Linear Relationship

E(y)

Regression line

Intercept Slope b1
b0 is positive

x
Simple Linear Regression Equation

 Negative Linear Relationship

E(y)

Intercept
b0 Regression line

Slope b1
is negative

x
Simple Linear Regression Equation

 No Relationship

E(y)

Intercept Regression line


b0
Slope b1
is 0

x
INTERCEPT AND SLOPE – EXAMPLE

• The intercept (often labelled the constant) is the expected mean value of
Y when all X=0.
• Start with a regression equation with one predictor, X.
• If X sometimes equals 0, the intercept is simply the expected mean
value of Y at that value.
• If X never equals 0, then the intercept has no meaning.
• The slope indicates the steepness of a line.
• m is the slope of a regression line, which is the rate of change
for y as x changes.
• The slope is positive 5. When x increases by 1, y increases by 5. The
y-intercept is 2.
• The slope is negative 0.4. When x increases by 1, y decreases by 0.4.
The y-intercept is 7.2.
LEAST SQUARES METHOD

The Least Squares Method is a mathematical approach used in regression analysis to find the best-
fitting line (or curve) for a set of data points by minimizing the sum of the squares of the differences
(the "residuals") between the observed values and the values predicted by the model.

The least squares method is a procedure for using sample data to find the estimated regression
equation

Example:- Estimate the salary of an employee based on year of experience.

Here year of experience is an independent variable,


and the salary of an employee is a dependent variable

So y= salary and x = experience


LEAST SQUARES METHOD

• Least Squares Criterion

min å (y i - y i ) 2

where:
yi = observed value of the dependent variable
for the ith observation
y^i = estimated value of the dependent variable
for the ith observation
LEAST SQUARES METHOD
Calculate The slope and intercept for
the Estimated Regression Equation
• Slope for the Estimated Regression Equation

b1 = å ( x - x )( y - y )
i i

å (x - x )i
2

where:
xi = value of independent variable for ith
observation
yi = value of dependent variable for ith
_ observation
x = mean value for independent variable
_
y = mean value for dependent variable
Least Squares Method

 y-Intercept for the Estimated Regression Equation

b0 = y - b1 x

Using this equation

ŷ = b0 + b1 x
Simple Linear Regression

 Example: Reed Auto Sales

Reed Auto periodically has a special week-long sale.


As part of the advertising campaign Reed runs one or
more television commercials during the weekend
preceding the sale. Data from a sample of 5 previous
sales are shown on the next slide.
Simple Linear Regression

 Example: Reed Auto Sales

Number of Number of
x- y- x- mean(x) * x- mean(x) *
TV Ads (x) Cars Sold (y) x y
mean(x) mean(y) y - mean(y) x- mean(x)
1 14 1 14 -1 -6 6 1
3 24 3 24 1 4 4 1
2 18 0 -2 0 0
2 18 1 17 -1 -3 3 1
1 17 3 27 1 7 7 1
3 27 sum 20 4

Sx = 10 Sy = 100
x=2 y = 20
ESTIMATED REGRESSION EQUATION

 Slope for the Estimated Regression Equation

b1 = å ( x - x )( y - y ) 20
i i
= =5
å (x - x )i
2
4

 y-Intercept for the Estimated Regression Equation


b0 = y - b1 x = 20 - 5(2) = 10
 Estimated Regression Equation
yˆ = 10 + 5x

Using this e quation find the value of y for any x


R Output
ANOTHER EXAMPLE – SALES DEPEND ON THE POPULATION

Population and Sales data


Restaurant Population Sales
1 2 58
2 6 105
3 8 88
4 8 118
5 12 117
6 16 137
7 20 157
8 20 169
9 22 149
10 26 202
Population and Sales data
m = 2840/568 = 5
Restaurant Population (x) Sales (y) (x - x mean) (y - y mean)
(x - x mean) * (x - x mean) C = 130 – 5 * 14 = 60
(y - y mean) (x - x mean)
Equation is y = 60 + 5x
1 2 58 -12 -72 864 144 Based on equation find
2 6 105 -8 -25 200 64 predicted value for Sales
3 8 88 -6 -42 252 36
4 8 118 -6 -12 72 36
5 12 117 -2 -13 26 4 y = 60 + 5x
6 16 137 2 7 14 4 y1= 60 + 5 * 2 = 70
7 20 157 6 27 162 36 And actual value is 58
8 20 169 6 39 234 36
9 22 149 8 19 152 64 y6 = 60 + 5 *16 = 140
10 26 202 12 72 864 144
Mean - 14 130 Sum - 2840 568 Find for all

b1 = å ( x - x )( y - y )
i i

å (x - x ) i
2
FIND THE BEST FIT LINE

• When working with linear regression, our main goal is to find the best
fit line, meaning the error between predicted and actual values should
be minimized. The best fit line will have the least error.

• For Linear Regression, we use the Mean Squared Error (MSE) cost
function, which is the average squared error between the predicted and
actual values.

Graph for regression


equation y = 60 + 5x
DATA ANALYSIS TOOL FOR REGRESSION

• 1. On the Data tab, in the Analysis group, click Data Analysis.

2. Select Regression and click OK


• 3. Select the Y Range This is the predictor variable (also called the
dependent variable).
• 4. Select the X Range. These are the explanatory variables (also
called independent variables). These columns must be adjacent to
each other.
• 5. Check Labels.
• 6. Click in the Output Range box and select a cell.
• 7. Check Residuals.
• 8. Click OK.
• It produces the Summary Output (rounded to 3 decimal places).
INFERENCE OF REGRESSION MODEL

• Multiple R. It is the Correlation Coefficient that measures the strength of a linear relationship
between two variables. The correlation coefficient can be any value between -1 and 1, and
its absolute value indicates the relationship strength. The larger the absolute value, the stronger
the relationship:
• R Square. It is the Coefficient of Determination, which is used as an indicator of the goodness of
fit. It shows how many points fall on the regression line. The R2 value is calculated from the total
sum of squares, more precisely, it is the sum of the squared deviations of the original data from
the mean. The range is 0 to 1.
• Adjusted R Square. It is the R square adjusted for the number of the independent variable in
the model. You will want to use this value instead of R square for multiple regression analysis.

• R-squared increases, even if the independent variable is insignificant. Adjusted R-squared


increases only when the independent variable is significant and affects the dependent
variable. Adjusted R2 is always less than or equal to R2.
• As the sample size increases, the difference between adjusted r-squared and r-squared reduces.
• Standard Error: It is another goodness-of-fit measure that shows the precision of your
regression analysis - the smaller the number, the more certain you can be about your
regression equation. While R2 represents the percentage of the variance of the dependent
variable that is explained by the model, Standard Error is an absolute measure that shows the
average distance that the data points fall from the regression line.
Coefficient of Determination (R square)
The Coefficient of Determination is the measure of the variance in response variable ‘y’ that
can be predicted using predictor variable ‘x’. It shows the accuracy of regression line.

Measure the goodness of fit

The value of the coefficient of Determination varies from 0 to 1.

0 means there is no linear relationship between predictor variable ‘x’ and response variable ‘y’ and 1
mean there is a perfect linear relationship between ‘x’ and ‘y’.

R2 is calculated using Sum of Square (SS)

Types of Sum of Square


1. Regression Sum of Square (SSR) r2 = SSR/SST
2. Residual (Error) Sum of Square (SSE)
3. Total Sum of Square (SST)
COEFFICIENT OF DETERMINATION

• Relationship Among SST, SSR, SSE

SST = SSR + SSE

å i
( y - y ) 2
= å i
( ˆ
y - y ) 2
+ å i i
( y - ˆ
y ) 2

where:
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squares due to error
Relationship
SST = SSR + SSE
Coefficient of Determination

 The coefficient of determination is:

r2 = SSR/SST

where:
SSR = sum of squares due to regression
SST = total sum of squares
Coefficient of Determination

r2 = SSR/SST = 100/114 = .8772


The regression relationship is very strong; 87.72%
of the variability in the number of cars sold can be
explained by the linear relationship between the
number of TV ads and the number of cars sold.

SST = SSR + SSE = 100 +14 = 114


Sample Correlation Coefficient

rxy = (sign of b1 ) Coefficient of Determination


rxy = (sign of b1 ) r 2
where:
b1 = the slope of the estimated regression
equation yˆ = b0 + b1 x

The sign of b1 in the equation yˆ = 10 + 5 x is “+”.

rxy = + .8772

rxy = +.9366
SSE (Error Sum of Squares)

Population and Sales data


Error (Actual
Population Predicted Sales Squared
Restaurant Sales (y) - Predicted
(x) (y =60 + 5x) Error
sales)
1 2 58 70 -12 144
2 6 105 90 15 225
3 8 88 100 -12 144
4 8 118 100 18 324
5 12 117 120 -3 9
6 16 137 140 -3 9
7 20 157 160 -3 9
8 20 169 160 9 81
9 22 149 170 -21 441
10 26 202 190 12 144
SSE = 1530

å i
( y - y ) 2
= å i
( ˆ
y - y ) 2
+ å i i
( y - ˆ
y ) 2
SST (Total Sum of Squares) SST = SSR + SSE
Population and Sales data SSR = SST – SSE
Restaurant Population (x) Sales (y) (y - y mean) (y - y mean)2 SSR = 15730 – 1530 = 14200
1 2 58 -72 5184
2 6 105 -25 625
3 8 88 -42 1764
4
5
8
12
118
117
-12
-13
144
169
R2 = SSR/SST
6 16 137 7 49 = 14200/15730
7 20 157 27 729 = 0.9027
8 20 169 39 1521
9 22 149 19 361
10 26 202 72 5184
Mean = 130 SST = 15730
Correlation Coefficient (Multiple R)

Correlation Coefficient = (Sign of slope m) SQRT(Coefficient of determination)


= (Sign of slope m) SQRT(R2)
= + SQRT (0.9027) = + 0.9501

R2 = SSR/SST
= 14200/15730
= 0.9027
df = n-k-1
n= df +k+1

Adjusted R - squared
Model Assumptions

1. E(ε)=0, implying that the model's


predictions are unbiased.

2. The variance of the error terms is


constant across all values of X. The
spread or dispersion of the errors
should not increase or decrease with
X.

3. The error terms are assumed to be


independent of each other. – no
autocorrelation

4. The error terms are assumed to be


normally distributed with a mean of
zero.
TESTING FOR SIGNIFICANCE

To test for a significant regression relationship, we


must conduct a hypothesis test to determine whether
the value of b1 is zero.

Two tests are commonly used:


t Test and F Test

Both the t test and F test require an estimate of s 2,


the variance of e in the regression model.
Test of Significance

In a simple linear regression equation, the mean or expected value of y is a linear function of x:

E(y) = ẞ0 + ẞ1 x.

If the value of ẞ1, is zero, E(y) = ẞ0 + (0)x = ẞ0.


In this case, the mean value of y does not depend on the value of x and hence we would conclude that x and y
are not linearly related.

Alternatively, if the value of ẞ1 is not equal to zero, we would conclude that the two variables are related.
Thus, to test for a significant regression relationship, we must conduct a hypothesis test to determine whether
the value of ẞ1 is zero
Two tests are commonly used. Both require an estimate of variance of e in the regression model

T – test
F - test
TESTING FOR SIGNIFIC ANCE

• An Estimate of s 2

The mean square error (MSE) provides the estimate


of s 2, and the notation s2 is also used.

s 2 = MSE = SSE/(n - 2)

where:

SSE = å ( yi - yˆ i ) 2 = å ( yi - b0 - b1 xi ) 2
TESTING FOR SIGNIFICANCE

• An Estimate of s

• To estimate s we take the square root of s 2.

• The resulting s is called the standard error of


the estimate.

SSE
s = MSE =
n-2
Estimate of Variance – Mean Square Error -

𝑆𝑆𝐸
𝑠 ! = 𝑀𝑆𝐸 =
𝑛−2
"#$%
𝑠 ! = 𝑀𝑆𝐸 = = 191.25
&

Standard Error of the estimate

""# 𝑆𝑆𝑅
s = 𝑀𝑆𝐸 = 𝑀𝑆𝑅 =
$%! 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠
s = 191.25 = 13.829
Testing for Significance: F Test
 Hypotheses

H0 : b1 = 0
H a : b1 ¹ 0
 Test Statistic

F = MSR/MSE
Testing for Significance: F Test

 Rejection Rule

Reject H0 if
p-value < a
or F > Fa
where:
Fa is based on an F distribution with
1 degree of freedom in the numerator and
n - 2 degrees of freedom in the denominator
F Test !"# %&'((
F = !"$ F= = 74.25
%)%.'+

The F distribution table shows that with one degree of freedom in the numerator and n-2 = 10-2 = 8 degrees of
freedom in the denominator, F = 11.26. provides an area of 0.01 in the upper tail. Thus, the area in the upper tail
of the F distribution corresponding to the test statistic F. = 74.25 must be less than 0.01. Thus, we conclude that
the p-value must be less than 0.01.

Because the p-value is less than a = 0.01, we reject NULL, and conclude that a significant relationship exists
between the size of the student population and quarterly sales.
TESTING FOR SIGNIFIC ANCE: T TEST

• Hypotheses

H0 : b1 = 0
H a : b1 ¹ 0

• Test Statistic

b1 s
t= where sb1 =
sb1 S( xi - x ) 2
Testing for Significance: t Test

 Rejection Rule

Reject H0 if p-value < a


or t < -ta/2 or t > ta/2

where:
ta/2 is based on a t distribution
with n - 2 degrees of freedom
&! (
Estimated Standard Deviation of b1 t= ' = ).(+), = 8.62
𝑠 "!
𝑠𝑏1 =
∑ 𝑥𝑖 − 𝑚𝑒𝑎𝑛 𝑥 2
The t distribution table shows that with one degree of freedom
in the numerator and n-2 = 10-2 = 8 degrees of freedom in the
13.829
𝑠𝑏1 = = 0.5803 denominator, t = 3.355. provides an area of .005 in the upper
568 tail.

So Reject H0
CONFIDENCE INTERVAL FOR b 1

n We can use a 95% confidence interval for b1 to test


the hypotheses just used in the t test.
n H0 is rejected if the hypothesized value of b1 is not
included in the confidence interval for b1.
CONFIDENCE INTERVAL FOR b 1

• The form of a confidence interval for b1 is: ta /2 sb1


is the
b1 ± ta /2 sb1 margin
b1 is the of error
point
estimator

where ta / 2 is the t value providing an area


of a/2 in the upper tail of a t distribution
with n - 2 degrees of freedom
Anova Table
SOME C AUTIONS ABOUT THE
INTERPRETATION OF SIGNIFIC ANCE TESTS

n Rejecting H0: b1 = 0 and concluding that the


relationship between x and y is significant does
not enable us to conclude that a cause-and-effect
relationship is present between x and y.

n Just because we are able to reject H0: b1 = 0 and


demonstrate statistical significance does not enable
us to conclude that there is a linear relationship
between x and y.
R Output
Interpretation

df is the number of the degrees

Regression df = Total no of parameters - 1 (2-1 =1)


Residual df = Total no of inputs - Total no of parameters (10-2=8)
Population and Sales data

Population Predicted Sales


Restaurant Sales (y)
(x) (y =60 + 5x)

1 2 58 70
2 6 105 90
3 8 88 100
Actual and 4
5
8
12
118
117
100
120
Predicted Values 6
7
16
20
137
157
140
160
8 20 169 160
9 22 149 170
10 26 202 190
QUESTIONS 1

X Y
34 43
42 50
70 90
110 130
250 315

1. Find the value for b0 and b1 for the given value of X and Y for Simple Linear
regression.
2. Create the equation for simple linear regression with the help of the b0 and b1 value
3. Find the value of Y for a given value of X = 140
Question - A telecommunications company conducted a study on the annual revenue and the
number of cell towers operated by five network providers. The aim is to determine if annual
revenue can be predicted based on the number of cell towers in service.

Cell Towers in Service Annual Revenue ($


Company
(100s) million)
NetworkOne 25 50
ConnectHub 20 42
SkyNet 18 38
StreamLink 15 32
OmniComm 10 25

Questions:
(a) Develop a regression equation to estimate the annual revenue based on the number of cell towers
in service.
(b) Is there any positive association between annual revenue and the number of cell towers in
service? Explain.
(c) Forecast the annual revenue if the number of cell towers in service is 12.
Question

The management team of an energy company is evaluating the effectiveness of their marketing
campaigns by analyzing the relationship between the advertising expenses and annual revenue. They
calculate two key metrics:

•Sum of Squares due to Error (SSE) = 15


•Sum of Squares due to Regression (SSR) = 25

1. Calculate the R2 value and write the interpretation.


2. Calculate Multiple R and write the interpretation (with respect to the strength of correlation –
positive or negative and strong or weak )
In a simple linear regression equation y = β0 + β1 x, what does β1 represent?

•A) The intercept of the regression line


•B) The slope of the regression line
•C) The predicted value of y when x=0
•D) The residual

Which of the following statements is true regarding the correlation coefficient r?


•A) r can range from 0 to 1
•B) r indicates the strength and direction of a linear relationship
•C) r determines the slope of the regression line
•D) r has no connection to linear regression
In a simple linear regression model, if the sum of squares due to error (SSE) is high compared to
the sum of squares due to regression (SSR), what does this indicate?
A) The model fits the data well.
B) The model does not fit the data well.
C) There is a strong linear relationship between the variables.
D) The independent variable perfectly predicts the dependent variable.

If a regression equation is given by y=2+3x, what will be the predicted value of y when x=4?

•A) 12
•B) 14
•C) 10
•D) 8
If you calculate an R2 value of 0.9 for a simple linear regression model, how would you
interpret this?

•A) 90% of the variance in the dependent variable is explained by the independent variable.
•B) There is a 90% chance that the model is correct.
•C) 90% of the independent variable is explained by the dependent variable.
•D) There is no association between the variables.

What is the main purpose of the intercept term β0 in a regression model?


•A) It measures the slope of the line.
•B) It represents the change in y for a one-unit change in x.
•C) It represents the expected value of y when x=0.
•D) It measures the goodness of fit of the model.
QUESTIONS 2

ANOVA
df SS MS F Significance F
Regression 1 659.7773672 659.7773672 9.343777228 0.005591082
Residual 23 1624.062633 70.61141882
Total 24 2283.84

Regression Statistics
Multiple R Multiple R (SQRT(R Square)
R Square R Square (SSR/SST) or 1- (SSE/SST)
Adjusted R Square Adjusted R Square
Standard Error Standard Error sqrt(MSE)
Observations 25

1. Find regression statistics using the given ANOVA table.


2. Is this model accepted or rejected? Justify your answer.
QUESTIONS 3

SUMMARY OUTPUT

Regression Statistics
Multiple R 0.537484407
R Square 0.288889488
Adjusted R Square 0.257971639
Standard Error 8.403060087
Observations 25

ANOVA
df SS MS F Significance F
Regression 1 659.7773672 659.7773672 9.343777228 0.005591082
Residual 23 1624.062633 70.61141882
Total 24 2283.84

Lower Upper
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% 95.0% 95.0%
Intercept 33.56672 21.1197553 1.589351517 0.125634061 -10.1228274 77.25626 -10.1228 77.25626
Promote 0.679119 0.222169496 3.056759269 0.005591082 0.219526048 1.138711 0.219526 1.138711

regression equation form is y = mx +c, put


1. Create a linear regression equation. the value for intercept and slope and form
2. Find the value of y for X = 95 equation.
EXAMPLE

Usage (x) Expense (y) Estimated regression equation is y = 0.20 + 2.60x


1 3
2 7
3 5 1. Compute – SSE, SSR and SST
4 11 2. Compute R square
5 14 3. Compute Correlation Coefficient
EXAMPLE – SIMPLE LINEAR REGRESSION

Population and Sales data


Population
Restaurant (x) Sales (y)
1 2 58
2 6 105
3 8 88
4 8 118
5 12 117
b1

1. Find the regression equation for the given dataset. b0 = y - b1x


2. Based on the equation identify the type of relationship. (Correlation –
Positive, Negative or No relation) – Based on the value of b1
3. Find the sale for population 20. Represent in simple variable
y = b0 + b1x
EXAMPLE – SIMPLE LINEAR REGRESSION

DF Sum Sq Mean Sq F value P value


Experience 1 1600
Residuals 8 120

1. Find the value of SST.


2. Find MSE and MSR
3. Find the R2 value for the given output.
4. Find F – test value
5. Interpret the based on R2 or coefficient of determination.
6. Interpret based on p-value
Interpretation - The value of R2 is 0.902, which is close to 1, so the model is significant, the impact of the input variable
population is 90% on the output
p-value is less than 0.05, so the model is significant, and the impact of the input variable population is high on the
output
THANK YOU

You might also like