Module III (Part II)(Regression and Time Series)

Regression
Module-III
What is Regression?
A way of predicting the value of one variable from another.
‒ It is a hypothetical model of the relationship between two variable.
‒ The model is used in a linear one.
‒ Regression is a statistical procedure that determines the equation for the
straight line that best fits a specific set of data.
• Any straight line can be represented by an equation of the form Y = bX + a, where
b and a are constants.
• The value of b is called the slope constant and determines the direction and
degree to which the line is tilted.
• The value of a is called the Y-intercept and determines the point where the line
crosses the Y-axis.
Main Objectives
Two main objectives:
 Establish if there is a relationship between two variables
‒ Specifically, establish if there is a statistically significant relationship
between the two.
‒ Example: Income and expenditure, wage and gender, etc.
 Forecast new observations.
‒ Can we use what we kow about the relationship to forecast unobserved
values?
‒ Example: What will are sales be over the next quarter?
Variable’s Roles
Variables
Dependent Independent
‒ This is the varable whos values ‒ This is the varable that explains
we want to explain of forecast. the other one.
‒ Its values depends on
‒ Its values are independent.
something else.
‒ We denote it as Y. ‒ We denote it as X.
Y=mX+c
A Linear Equation
You may remember one of these.
‒ y = a + bx
‒ y = mx + b
• In this regression discussion, we just use a different notation:
‒ y = β0 + β1x,
• where, β0 is called as intercept and β1 is called as cofficient or slope
• The values of the regression parameters 0, and 1 are not known.
• We estimate them from data.
• 1 indicates the change in the mean response per unit increase in X.
• We call it “linear” because the equation represents a straight line in a bi-

dimensional plot.
Regression Analysis
In regression analysis we use the independent variable (X) to estimate the
dependent variable (Y)
• Both the variables must be at least interval scale.
• The least squares criterion is used to determine the equation.
Regression Equation:
An equation that express the linear relationship between two variables.
Least Squares Principle:
It determining a regression equation by minimising the sum of the squares of
the vertical distances between the actual Y values and the predicted values of Y
Linear Regression cont…
 A linear regression line has an equation of the form Y = a + bX + e, where X is the explanatory variable and Y is the
dependent variable. The slope of the line is b, a is the intercept (the value of y when x = 0), and e is the random
error.
 The slope and intercept is as follows in the following linear equation of line:
 In the above snap, a is considered as the y-intercept.

7
 The random error in the following linear equation of line:
 To fit the regression line, a statistical approach known as least squares method.
8
Least Squares Principle
• The least squares principle
• Dots are actual values of Y
• Asterisks are the predicted values of Y for a given value of X
 The linear regression model provides a sloped straight line representing the relationship between the variables.
Consider the below image:
10
 The calculation of b and a is as follows:
 If b > 0, then x(predictor) and y(target) have a positive relationship. That is increase in x will increase y.
 If b < 0, then x(predictor) and y(target) have a negative relationship. That is increase in x will decrease y.
 If sum of squared error is taken as a metric to evaluate the model, then goal to obtain a line that best reduces the
error.
11
Regression Line
We will write an estimated regression line based on sample data as
ˆ  b0  b1 x
y
The method of least squares chooses the values for b0, and b1 to minimize
n n 2
the sum of squared errors SSE   ( yi  yˆ i ) 2   y  b0  b1 x 
i 1 i 1
Using calculus, we obtain estimating formulas:

n n n n
 (x i  x )( yi  y ) n xi yi   xi  yi
b0  y  b1 x
b1  i 1
 i 1 i 1 i 1
n n n
and
 ( xi  x )
i 1
2
n xi2  ( xi ) 2
i 1 i 1
Estimation of Mean Response
Fitted regression line can be used to estimate the mean value of y for a given value of x.
Example :
• The weekly advertising expenditure (x) and weekly sales (y) are presented in the following
table.
y x
 From the data table we have:
1250
1380
41
54
n  10  x  564   32604
x 2
1425 63  y  14365  xy  818755

1425 54
1450 48
 The least squares estimates of the regression
1300 46 coefficients are:
1400 62 n xy   x y 10(818755)  (564)(14365)
b1    10.8
1510 61
n x 2  ( x) 2 10(32604)  (564) 2
1575 64
1650 71 b0  1436.5  10.8(56.4)  828
Point Estimation of Mean Response
• The estimated regression function is:
ŷ  828  10.8x
Sales  828  10.8 Expenditure
• This means that if the weekly advertising expenditure is

increased by $1 we would expect the weekly sales to increase
by $10.8.
Linear Regression
• Linear regression attempts to model the relationship between two variables
by fitting a linear equation to observed data. One variable is considered to
be an explanatory variable, and the other is considered to be a dependent
variable.
• For example, a modeler might want to relate the weights of individuals to their heights
using a linear regression model.
• Linear regression can be further divided into two types of the algorithm:
• Simple Linear Regression:
If a single independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Simple Linear Regression.
• Multiple Linear regression:
If more than one independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called Multiple Linear
Regression.
What happens if we change the
intercept?
y = 4 + 2x
y = 9 + 2x
y = -2 + 2x
What happens if we change the slope?
y = 4 + 2x
y = 4 + 5x
y = 4 + 0x
=4
y = 4 - 3x
But, the world is not linear !
y = 4 + 2x
True Value
y = β0 + β1x + ε
Simple Linear Regression Model
For model with one predictor, Exploring ‘b1’
• If b1 > 0, then x(predictor) and y(target) have a positive relationship. That
is increase in x will increase y.
Intercept Calculation, • If b1 < 0, then x(predictor) and y(target) have a negative relationship.
That is increase in x will decrease y.
Exploring ‘b0’
• If the model does not include x=0, then the prediction will become
Co-efficient Formula, meaningless with only b0. For example, we have a dataset that relates
height(x) and weight(y). Taking x=0(that is height as 0), will make
equation have only b0 value which is completely meaningless as in real-
time height and weight can never be zero.
• If the model includes value 0, then ‘b0’ will be the average of all
predicted values when x=0.
• If there is no ‘b0’ term, then regression will be forced to pass over the
origin. Both the regression co-efficient and prediction will be biased.
Linear Regression Line
A linear line showing the relationship between the dependent and independent variables is
called a regression line. A regression line can show two types of relationship:
Positive Linear Relationship: Negative Linear Relationship:

If the dependent variable increases on the Y- If the dependent variable decreases on the Y-
axis and independent variable increases on X- axis and independent variable increases on
axis, then such a relationship is termed as a the X-axis, then such a relationship is called a
Positive linear relationship. negative linear relationship.
Company Sales in 1000s (Y) Number of agents in 100s (X)
A 25 8
B 35 12
C 29 11
D 24 5
E 38 14
F 12 3
G 18 6
H 27 8
I 17 4
J 30 9
21
School of Computer Engineering
 The linear regression will thus be Predicted (Y) = 8.3272 + 2.1466 X
 The above equation can be used to predict the volume of sales for an
insurance company given its agent number. Thus if a company has 1000
agents (10 hundreds) the predicted value of sales will be around ?
 In summary, linear regression consists of the following steps:
 Collection of sample of independent and dependent variable.
 Compute b and a.
 Use these values to formulate the linear regression equation.
 Given the new values for X predict the value of Y.
 Larger and better the sample of data, more accurate would be the regression
model and would lead to more accurate forecasts.
22
School of Computer Engineering
Finding the best fit line:
When working with linear regression, our main goal is to find the best fit line
that means the error between predicted values and actual values should be
minimized. The best fit line will have the least error.
Data for Linear Regression Example
Our assumption, which we will test, is that income explains consumption

Interpreting the Coefficients
The estimate dmodel is

Consumption = 49.13 + 0.85 Income + ε
‒ 49.13 could be interpreted as the consumption level of a
family with 0 income.
‒ 0.85 is the marginal effect of one unit of income on
consumption: for every unit more of income a family has,
we estimate its consumption grows by 0.85 units.
Estimated vs. Actual values
Terminologies
 Outliers: Suppose there is an observation in the dataset which is having a very high or very low
value as compared to the other observations in the data, i.e. it does not belong to the population,
such an observation is called an outlier. In simple words, it is extreme value. An outlier is a
problem because many times it hampers the results we get.
 Multicollinearity: When the predictors are highly correlated to each other then the variables
are said to be multicollinear. Many types of regression techniques assumes multicollinearity
should not be present in the dataset. It is because it causes problems in ranking variables based on
its importance or it makes job difficult in selecting the most important independent variable
(factor).
 Heteroscedasticity: When dependent variable's variability is not equal across values of an

independent variable, it is called heteroscedasticity. Example -As one's income increases, the
variability of food consumption will increase. A poorer person will spend a rather constant
amount by always eating inexpensive food; a wealthier person may occasionally buy inexpensive
food and at other times eat expensive meals. Those with higher incomes display a greater
variability of food consumption.
Terminologies
• Bias(Error in training data): While making predictions, a difference

occurs between prediction values made by the model and actual
values/expected values, and this difference is known as bias errors or
Errors due to bias
• Low Bias: A low bias model will make fewer assumptions about the
form of the target function.
• High Bias: A model with a high bias makes more assumptions, and
the model becomes unable to capture the important features of our
dataset. A high bias model also cannot perform well on new data.
Terminologies:
• Variance (Error in test data): The variance would specify the amount
of variation in the prediction if the different training data was used. In
simple words, variance tells that how much a random variable is
different from its expected value.
• Low variance means there is a small variation in the prediction of the

target function with changes in the training data set. At the same
time, High variance shows a large variation in the prediction of the
target function with changes in the training dataset.
Terminologies
 Over-fitting: It means that model works well on the training dataset but is
unable to perform better on the test datasets. It is also known as problem of
high variance. Variance indicates how much the estimate of the model will
alter if different training data were used.
 Under-fitting: When the model works so poorly that it is unable to fit even
training set well then it is said to be under-fitting the data. It is also known as
problem of high bias. A bias is the amount that a model’s prediction differs
from the target value.
Terminologies
Reasons for under-fitting:
• High bias and low variance
• The size of the training dataset used is not enough.
• The model is too simple.
• Training data is not cleaned and also contains noise in it.
Techniques to reduce under-fitting:
• Increase model complexity
• Increase the number of features, performing feature engineering
• Remove noise from the data.
• Increase the number of epochs or increase the duration of training to
get better results.
Terminologies
Reasons for over-fitting are as follows:
• High variance and low bias
• The model is too complex
• The size of the training data
Techniques to reduce over-fitting:
• Increase training data.
• Reduce model complexity.
• Cross-Validation
• Training with more data
• Removing features
• Early stopping the training
The General Idea of Regression Models
Simple regression considers the • A simple regression
model (one independent
relation between a single variable) fits a regression
explanatory variable and response line in 2-dimensional
variable space
Multiple regression simultaneously • A multiple regression

considers the influence of multiple model with two
explanatory variables on a response explanatory variables fits
variable Y a regression plane in 3-
The intent is to look dimensional space
at the independent
effect of each
variable while
“adjusting out” the
influence of potential
confounders
Multiple Linear Regression Models
• Many applications of regression analysis involve situations in which there are
more than one regressor variable.
• A regression model that contains more than one regressor variable is called a
multiple regression model.
• For example, suppose that the effective life of a cutting tool depends on the
cutting speed and the tool angle. A possible multiple regression model could be
Multiple Linear Regression
 Multiple linear regression refers to a statistical technique that is used to predict the outcome of a variable based on
the value of two or more variables. It is sometimes known simply as multiple regression, and it is an extension of
linear regression.
 Example:
 Do age and intelligence quotient (IQ) scores predict grade point average (GPA)?
 Do weight, height, and age explain the variance in cholesterol levels?
 Do height, weight, age, and hours of exercise per week predict blood pressure?
 The formula for a multiple linear regression is:
y = β0+ β1x1 + β2x2 + β3x3 + β4x4+ … … … … … … + βnxn + e
where, y = the predicted value of the dependent variable.

β0 = the y-intercept (value of y when all other parameters are set to 0)
β1x1= the regression coefficient (β1) of the first independent variable (x1)
βnxn= the regression coefficient (βn) of the last independent variable (xn)
e = model error
35
Multiple Linear Regression with Two Independent
Variables
 The formula for a multiple linear regression is:
y = β0+ β1x1 + β2x2 + e
where, y = the predicted value of the dependent variable.
β0 = the y-intercept (value of y when all other parameters are set to 0)
β1x1= the regression coefficient (β1) of the first independent variable (x1)
β2x2= the regression coefficient (βn) of the second independent variable (x2)
e = model error
 β1 and β2 is calculated as follows:  β0 is calculated as follows:
36
Case Study: Data Preparation for MR
Multiple Linear Regression Models (cont..)
Least Squares Estimation of the Parameters
• The least squares function is given by
• The least squares estimates must satisfy

• The least squares normal Equations are
• The solution to the normal Equations are the least squares estimators of the
regression coefficients.
• The least squares normal Equations are
• The solution to the normal Equations are the least squares estimators of the
regression coefficients.
Specifically, we will fit the
multiple linear regression
model
Where Y= pull strength,

x1=wire length, and x2=die
height. From the data in table
we calculate,
For the model Inserting the computed summations into
the normal equations are the normal equations, we obtain
Matrix Approach to Multiple Linear Regression
Suppose the model relating the regressors to the response is
In matrix notation this model can be written as
where
Example:
We illustrated fitting the multiple
regression
The model
Where y is the observed pull strength for
matrix X and Y
a wire bond, x1 is the wire length, and x2 vector for this
is the die height. The 25 observations are model are
in table.
Now, we will use the matrix approach to
fit the regression model with the data
provided.
Non-Linear Regression
 In the case of linear and multiple linear regression, the dependent variable is linearly
dependent on the independent variable(s). But, in several situations, the situation is no
simple where the two variables might be related in a non-linear way.
 This may be the case where the results from the correlation analysis show no linear
relationship but these variables might still be closely related.
 If the result of the data analysis show that there is a non-linear (also known as curvilinear)
association between the two variables, then the need is to develop a non-linear regression
model.
 The non-linear data can be handled in 2 ways:
 Use of polynomial rather than linear regression model
 Transform the data and then use linear regression model
62
Product Increase in sale in% (Y) Discount in %(X)
A 3.05 10
B 7.62 15
C 12.19 20
D 20.42 25
E 28.65 30
F 42.06 35
G 55.47 40
H 74.68 45
I 93.88 50
63
The scatter diagram of sales increase for various discount percentage looks as follows:
The value of r is 0.97 which indicates a very strong, almost perfect, positive correlation, and the data value appears to
form a slight curve.
64
Polynomials are the equations that involve powers of the independent variables. A second degree (quadratic), third
degree (cubic), and n degree polynomial functions:
 Second degree: y = β0+ β1x + β2x2 + e
 Third degree: y = β0+ β1x + β2x2 + β3x3 + e
 n degree: y = β0+ β1x + β2x2 + β3x3 + … … + βnxn + e
Where:
 β0 is the intercept of the regression model
 β1, β2, β3 are the coefficient of the predictors.
How to find the right degree of the equation?
As we increase the degree in the model, it tends to increase the performance of the model. However, increasing the
degrees of the model also increases the risk of over-fitting and under-fitting the data. So, one of the approach can be
adopted:
 Forward Selection: This method increases the degree until it is significant enough to define the best possible
model.
 Backward Elimination: This method decreases the degree until it is significant enough to define the best possible
model.
65
 The techniques of fitting of the polynomial model in one variable can be extended to the fitting of
polynomial models in two or more independent variables.
 A second-order polynomial is more used in practice, and its model with two independent variables is
specified by:
y = β0+ β1x1 + β2x2 + β11x12 + β22x22 + β12x1x2 + e
 This is also termed as response surface. The methodology of response surface is used to fit such
models and helps in designing an experiment. This type is generally covered in the topics in the design of
experiment.
Class work
 Define the second-order polynomial model with two independent variables.
 Define the second-order polynomial model with three independent variables.
 Define the third-order polynomial model with two independent variables.
66
 The tools exists in software such as SAS, Excel or the language such as Python, R can estimate the value
of coefficients of predictor such as β0, β1 etc and to fit a curve in a non-linear fashion for the given data.
 Following figure depicts the graph of increase in sale vs. discount.
The predicted model is Y = 5.6359 – 0.6497 x + 0.0482 x2

67
R2 is a known as coefficient of determination and it’s a number that indicates how well the data fits into
the develop model i.e. a line or curve.
Linear
Curve
68
 An R2 of 1 indicates that the regression model perfectly fits the data while an R2 of 0 indicate that model
does not fit the data at all.
 An R2 is calculated as follows:
where
In the example, a value of 0.99 for R2 indicates that a quadratic model is good fit for the data.
69
Data points on a Scatter gram
Linear Regression Applied
 Under-fitting/ unable to capture data

pattern
R2 : 0.638
70
To overcome under-fitting, we need to increase the complexity
of the model.
Model is linear (x^2 is only one more feature), but curve is

quadratic.
R2 score = 0.85
If we try to fit a cubic curve (degree=3) to the

dataset, we can see that it passes through more
data points than the quadratic and the linear plots.
 R2 score =0.98
71
If we further increase the degree to 20, we can see

that the curve passes through more data points. A
comparison of curves for degree 3 and 20 on the data
set.
A comparison of fitting linear, quadratic and For degree=20, the model is also capturing the noise
cubic curves on the dataset in the data. This is an example of over-fitting.
72
The Bias vs Variance trade-off

Bias refers to the error due to the model’s simplistic assumptions in fitting the data. A high bias means that the model
is unable to capture the patterns in the data and this results in under-fitting.
Variance refers to the error due to the complex model trying to fit the data. High variance means the model passes
through most of the data points and it results in over-fitting the data.
73
The Bias vs Variance trade-off
74
 Another preferable way to perform non-linear regression is to try to transform the data in order to
make the relationship between the two variables more linear and then use a regression model
rather than a polynomial one. Transformations aim to make a non-linear relationship between two
variables more linear so that it can be described by a linear regression model.
 Three most popular transformations are the:

 Square root (√X)
 Logarithm (log X)
 Negative reciprocal (- 1/ X)
75
Application of Square root (√Y)
Product Discount in %(X) Increase in sale in% (Y) SQRT (Y)
A 10 3.05 √3.05 = 1.75

B 15 7.62 √7.62 = 2.76
C 20 12.19 √12.19 = 3.49
D 25 20.42 √20.42 = 4.52
E 30 28.65 √28.65 = 5.35
F 35 42.06 √42.06 = 6.49
G 40 55.47 √55.47 = 7.45
H 45 74.68 √74.68 = 8.64
I 50 93.88 √93.88 = 9.69
76
Square root of transformation
In the similar fashion, Logarithm and negative reciprocal techniques can be applied to the dependent
variable followed up by the application of linear regression model.
77
Logistic Regression
• In linear regression the Y variable is always a continuous variable.
• If suppose, the Y variable was categorical, you cannot use linear regression model it.
• So what would you do when the Y is a categorical variable with 2 classes?
• Logistic regression can be used to model and solve such problems, also called as binary
classification problems.
• Logistic Regression is one of the most commonly used Machine Learning algorithms that
is used to model a binary variable that takes only 2 values – 0 and 1.
• The objective of Logistic Regression is to develop a mathematical equation that can give
us a score in the range of 0 to 1.
78
Logistic Regression (Why?)
When the response variable has only 2 possible values, it is desirable to have a model that predicts the value either
as 0 or 1 or as a probability score that ranges between 0 and 1.
Linear regression does not have this capability. Because, If you use linear regression to model a binary response
variable, the resulting model may not restrict the predicted Y values within 0 and 1.
79
Logistic Regression
 Logistic Regression was used in the biological sciences in early twentieth century. It was then used in many
social science applications.
 Logistic Regression is used when the dependent variable (target) is categorical. For example:
 To predict whether an email is spam (1) or not (0). If the model infers a value of 0.932 on a particular
email message, it implies a 93.2% probability that the email message is spam. The model predicts the
email message is spam 93.2% of the time and the remaining 6.8% will not.
 Whether the tumor is malignant (1) or not (0)
 There are 3 types of Logistic Regression

 Binary Logistic Regression: The categorical response has only two 2 possible outcomes. Example: Spam
or Not, “win” / “loss”, “pass” / “fail”, “dead” / “alive”, etc.
 Multinomial Logistic Regression: Three or more categories without ordering. Example: Predicting
which food is preferred more (Veg, Non-Veg, Vegan), ( “disease A” / “disease B” / “disease C”)
 Ordinal Logistic Regression: Three or more categories with ordering. Example: Movie rating from 1 to
5.
80
Logistics Regression (Example)
 Spam Detection: Spam detection is a binary classification problem where we are given an email and we need to
classify whether or not it is spam. If the email is spam, we label it 1; if it is not spam, we label it 0.
 Tumour Prediction: A Logistic Regression classifier may be used to identify whether a tumour is malignant or if
it is benign. Several medical imaging techniques are used to extract various features of tumours. For instance,
the size of the tumour, the affected body area, etc. These features are then fed to a Logistic Regression classifier
to identify if the tumour is malignant or if it is benign.
 Health : Predicting if a given mass of tissue is benign or malignant
 Marketing : Predicting if a given user will buy an insurance product or not
 Banking : Predicting if a customer will default on a loan.
• Dichotomous categorical response variable Y
• e.g. Success/Failure, Remission/No Remission, Survived/Died, CHD/No CHD, Low Birth Weight/Normal
Birth Weight, etc…
81
Logistic Regression
 It measures the relationship between the categorical dependent variable and one or more independent
variables by estimating probabilities using a logistic function which is the cumulative logistic distribution.
 Since the predicted values are probabilities and therefore are restricted to (0, 1), a logistic regression model
only predicts the probability of particular outcome given the values of the existing data.
 Example: A group of 20 students spends between 0 and 6 hours studying for an exam. How does the number
of hours spent studying affect the probability of the student passing the exam? The reason for using logistic
regression for this problem is that the values of the dependent variable, pass and fail, while represented by "1"
and "0", are not cardinal numbers. If the problem was changed so that pass/fail was replaced with the grade
0–100 (cardinal numbers), then simple regression analysis could be used. The table shows the number of
hours each student spent studying, and whether they passed (1) or failed (0).
82
Logistic Regression cont…
 In logistic regression, we don’t directly fit a straight line to the data like in linear regression. Instead, we fit a S
shaped curve, called sigmoid or logistic regression curve. A logistic regression curve showing probability of
passing an exam versus hours studying is shown below.
 Y-axis goes from 0 to 1. This is because the sigmoid function always takes as maximum (i.e. 1) and minimum (i.e. 0),
and this fits very well to the goal of classifying samples in two different categories (fail or pass).
 The sigmoid function is sigmoid(x) = 1 / (1 + e–x) where x is the weighted sum of independent variable i.e. x =
β0 + β1 xi where i is the individual independent variable instance.
83
Consider a model with one predictor X1, and one binary response variable Y, which we denote p = P(Y = 1 | X1 =
x), where p is the probability of success. p should meet criteria: (i) it must always be positive, (ii) it must always
be less than equals to 1.
We assume a linear relationship between the independent variable and the logit of the event i.e. Y = 1. In
statistics, the logit is the logarithm of the odds i.e. p / (1-p). This linear relationship can be written in the
following mathematical form (where ℓ is the logit, b is the base of the logarithm, and β is the parameter of the
model.
The odds can be recovered by exponentiation of the logit:
Where Sb is the sigmoid function with base b. However in some cases it can be easier to communicate results by
working in base 2, base 10, or exponential constant e.
In reference to the students example, solving the equation with software tool and considering base as e, the
coefficient is β0 = -4.0777 and β1= 1.5046
84
 For example, for a student who studies 2 hours, entering the value Hours = 2 in the equation gives the
estimated probability of passing the exam of 0.26.
 Similarly, for a student who studies 4 hours, the estimated probability of passing the exam is 0.87:
 Following table shows the probability of passing the exam for several values of hours studying.
Hours of study Probability of passing the exam

1 0.07
2 0.26
3 0.61
5 0.97
85
Time Series Analysis Examples
Daily Take away

Average Monthly from Restaurant
Temperature
Profit of the
Daily petrol Price
company
Hourly Electricity
Consumption
86
Time Series Analysis Examples
Field Example Topics
Economics Gross Domestic Product (GDP), Consumer Price Index (CPI), and
unemployment rates
Medicine Blood pressure tracking, weight tracking, cholesterol measurements, heart

rate monitoring
Physical sciences Global temperatures, monthly sunspot observations, pollution levels.
Social Sciences Birth rates, population, migration data, political indicators
Epidemiology Disease rates, mortality rates, mosquito populations 87

Time Series Model
 A time series is a sequential set of data points, measured typically at successive times. It is mathematically defined
as a set of vectors x(t) where t = 0,1, 2… where t represents the time elapsed. The variable x(t) is treated as
random variable.
 A time series model generally reflect the fact that observations close together in time which are closely related
than the observations further apart.
 The data shown below represent the weekly demand of some product. The model uses x to indicate an
observation and t to represent the index of the time period. The data from 1 to t is: x1, x2, … xt. Time series
considering 25 periods is shown below.
demand
Weekly
demand -25
Time series
of weekly
periods
88
Time Series Model cont…
 Any time series is composition of many individual component times series. Some of these components are
predictable whereas other components may be almost random which can be difficult to predict.
 This calls for the decomposition methods that will generate individual component series from the original
series. Decomposing a series into such components enable to analyze the behaviour of each component and
then conclude the Forecast by assembling the individuals. Thus it improves the accuracy of the final
forecast.
 Example: A typical sales time series.
89
Time Series Model Component
Time series models are characterized of four components:
 Trend component  Seasonal component
 Cyclical component  Irregular component
90
Time Series Model Component contd...
I. Trend component
 The trend shows the general tendency of the data to increase or decrease during a long
period of time. A trend is a smooth, general, long-term, average tendency.
 It is not always necessary that the increase or decrease is in the same direction
throughout the given period of time.
 It is observable that the tendencies may increase, decrease or are stable in different
sections of time. But the overall trend must be upward, downward or stable.
 The population, agricultural production, items manufactured, number of births and
deaths, number of industry or any factory, number of schools or colleges are some of its
example showing some kind of tendencies of movement.
91
I. Trend component Upward
Downword
Stable
92
II. Seasonal component
 These are the rhythmic forces which operate in a regular and periodic manner over a span of less than a
year. They have the same or almost the same pattern during a period of 12 months. This variation will be
present in a time series if the data are recorded hourly, daily, weekly, quarterly, or monthly.
 These variations come into play either because of the natural forces or person-made conventions.
Natural forces Person-made

Production of crops depends on seasons, the sale of Some festivals, customs, habits, fashions, and some
umbrella and raincoats in the rainy season, and the sale occasions like marriage They recur themselves year after
of electric fans and A.C. shoots up in summer seasons. year. An upswing in a season should not be taken as an
indicator of better business conditions.
93
Time Series Model Component cont…
III. Cyclical component
 The variations in a time series which operate themselves over a span of more than one year are the
cyclic variations. This oscillatory movement has a period of oscillation of more than a year. One
complete period is a cycle. This cyclic movement is sometimes called the ‘Business Cycle’.
 ‘Business Cycle’: It is a four-phase cycle comprising of the phases of prosperity, recession,
depression, and recovery. The cyclic variation may be regular but not periodic. The upswings and
the downswings in business depend upon the joint nature of the economic forces and the
interaction between them.
94
IV. Irregular component
 They are not regular variations and are purely random or irregular. These fluctuations
are unforeseen, uncontrollable, unpredictable, and are erratic. These forces are
earthquakes, wars, flood, famines, and any other disasters.
95
96
Time Series Model Component cont…
Pictorial depiction of different component
97
Decomposition Model
 Mathematical representation of the decomposition approach is Yt = f(Tt, St, Ct, It) where Yt is the time series value at
time t. Tt, St, Ct, and It are the trend, seasonal, cyclic and irregular component value at time t respectively.
 There are 3 types of decomposition model:
 Additive model
 Multiplicative model  Mixed model
Additive model
 According to this model, a time series is expressed as Yt = Tt + St + Ct + It
 The model is appropriate when the amplitude of both the seasonal and irregular variations do not change
as the level of trend rises or falls.
 This model assumes that all four components of the time series act independently of each other.
Multiplicative model
 According to this model, a time series is expressed as Yt = Tt * St * Ct * It
 The model is appropriate when the amplitude of both the seasonal and irregular variations increase as
the level of trend rises.
 The model assumes that the various components operate proportionately to each other. 98
Decomposition Model cont…
99
Decomposition Model cont…
III. Mixed model
 Different assumptions lead to different combinations of additive and multiplicative models as

Yt = Tt + St + Ct * It
 The time series analysis can also be done using the model as:
 Yt = Tt + St * Ct * It
 Yt = Tt * St + Ct * It
Home Work
 How to determine if a time series has a trend component?
 How to determine if a time series has a seasonal component?
 How to determine if a time series has both a trend and seasonal component?
100
Time Series Forecasting Model
Time series forecasting is required to make scientific predictions based on historical time
stamped data. It involves building models [called Time Series Forecasting Model] through
historical analysis and using them to make observations and drive future strategic decision-
making.
Time series forecasting models can be classified into 2 categories.

• Averaging methods in which all observations (time series values) are equally weighted. The
variations are:
 1.1. Averaging Model
 1.2 Moving Averages Model
• Exponential smoothing methods that applies unequal weights to past data, typically decaying
in an exponential manner as one goes from recent to distinct past. The variations are:
 2.1 Simple Exponential Smoothing
 2.2 Holt’s Method
101
(Averaging Model)
1.1 Averaging Model
 The simple average method uses the mean of all the past values to forecast the next
value. This method is seen to be no use in a practical scenario.
 This method is used when the time series has attained some level of stability and no
longer dependent on any external parameters.
 This would happen in sales forecasting, only when the product for which the forecast is
needed is at a mature stage in its life cycle.
 The Averaging Model is represented as follows where F is the forecasted value at instance
of time t+1, t is the current time and Yi is the value of series at time instant i.
102
Averaging Model
A manager of a warehouse wants to know how much a typical supplier delivers
Supplier Amount
in 10 dollar units. He/she has taken a sample of 12 suppliers at random,
1 9 obtaining the result as shown in the table.
2 8
3 9 The computed mean of the amount is 10 and hence the manager decides to
use this as the estimate for the expenditure of a typical supplier.
4 12
5 9
It is more reasonable to assume that the recent points in past are better
6 12 predictors than the whole history. This is particularly true for sales
7 11 forecasting.
8 7 Every product has a life cycle: Initial stage, Middle volatile period and a
9 13 more or less stable Mature stage and an End stage. Hence, a better
10 9 method of forecasting would be to use Moving Averages (MAs).
11 11 e.g. Keypad Phone as the product
12 10
103
(Moving Average Model)
1.2 Moving Average Model
 The MA approach calculates an average of a finite number of past observations and then employs that average as the
forecast for the next period.
 The number of sample observations to be included in the calculation of the average is specified at the start of the
process. The term MA refer to the fact that as a new observation becomes available, a new average is calculated by
dropping the oldest observation in order to include the newest one.
 An MA of order k, represented with MA(k) is calculated as:
 MA(3), MA(5) and MA(12) are commonly used for monthly data and MA(4) is normally used for quarterly data.
 MA(4), and MA(12) would average out the seasonality factors in quarterly and monthly data respectively.
 The advantage of MA method is that the data requirement is very small.
 The major disadvantage is that it assumes the data to be stationary.
 MA also called as simple moving average.
104
Moving Averages (MAs) cont…
Month Demand Yi  MA(3) = (Y10+Y11+Y12)/3 [Recent past 3 Yi]
1 89 =(275 + 188 + 312) / 3
2 57 = 258.33
3 144
 MA(6) = (223+286+212+275+188+312)/6
4 221
= 249.33
5 177
6 280
 MA(12) = (89+57+144+221+177+280+223
7 223 +286+212+ 275+188+312)/ 12
8 286 = 205.33
9 212
Home Work
10 275
Calculate:
11 188  MA(5)
12 312  MA(4)
 MA(10) 105
(Exponential Smoothing Model)
2.1 Exponential Smoothing Model
 The extension to the MA method is to have a weighted MA, whereas in Single Moving Averages the
past observations are weighted equally, Exponential Smoothing assigns exponentially decreasing
weights as the observation get older. In other words, recent observations are given relatively more
weight in forecasting than the older observations.
 In the case of moving averages, the weights assigned to the observations are the same and are equal to
1/N. In exponential smoothing, however, there are one or more smoothing parameters to be
determined (or estimated) and these choices determine the weights assigned to the observations.
 This class of techniques consists of a range of methods
 Simple exponential smoothing (SES)/ Weighted Moving Average: used for the data with no trend
or seasonality
 Sophisticated widely used Holt’s or Holt-Winters’ method which is able to provide forecasts for data
that exhibit both seasonality and trend.
In all these methods, the observations are weighted in an exponentially decreasing manner as they
become older.
106
Simple Exponential Smoothing
For any time period t, the smoothed value St [forecast vale] is found by computing
St=α * yt−1+(1−α) * St−1 and t≥2
where 0<α≤1, α is smoothing constant

yt-1 is the actual demand value at time t-1
or just previous demand
St-1 last smoothed observation at time t-1
or just previous forecast
 When α = 1, St= Yt-1 or forecast will be just previous demand

[Less smoothing effect and give greater weight to recent changes in the data]
 When α = 0, St= St-1 or forecast will be just previous forecast
[Greater smoothing effect and are less responsive to recent changes]
107
Simple Exponential Smoothing
Why is it called Exponential?
Let us expand the basic recurrence equation by first substituting for St−1 in the basic equation
St=α * yt−1+(1−α) * St−1 to obtain:
St=α * yt−1+(1−α) * [α * yt−2+(1−α) * St−2]
= α * yt−1+ α * (1−α) * yt−2+ (1−α)2 * St−2
= α *(1-α)1-1 yt−1+ α * (1−α)2-1 * yt−2+ (1−α)2 * St−2
α *(1-α)1-1 yt−1+ α * (1−α)2-1 * yt−2+ (1−α)2 * St−2

By substituting for St−2, then for St−3, and so forth, until we reach S2 (which is just y1), it can be
shown that the expanding equation can be written as:
This illustrates the exponential behavior. The weights, α * (1−α)t decrease geometrically.
108
Simple Exponential Smoothing Problem
Month Actual Smoothed

Demand Forecast
Q: For α = 0.75, find the Forecast for March
(Yi) (Si)
St=α * yt−1+(1−α) * St−1
JAN 500 400
FEB 600 SMarch=α * yFeb+(1−α) * SFeb
MARCH SFeb=α * yJan+(1−α) * SJan
= 0.75 * 500 + 0.25 * 400 = 475
SMarch=α * yFeb+(1−α) * SFeb
= 0.75 * 600 + 0.25 * 475 = 568.75
For α = 0.75 and α=0.25 find the Forecast for March and justify which α is best 109
Simple Exponential Smoothing cont…
 What is the best value for α?
The speed at which the older responses are dampened (smoothed) is a function of the value of α. When α is
close to 1, dampening is quick and when α is close to 0, dampening is slow. This is illustrated in the table
below.
α (1-α) (1-α)2 (1-α)3 (1-α)4
0.9 0.1 0.01 0.001 0.0001
0.5 0.5 0.25 0.125 0.0625
0.1 0.9 0.81 0.729 0.6561
Error calculation
 The error is calculated as Et = yt – St
(i.e. difference of actual and smooth/forecast at time t)
 Then error square is calculated i.e. ESt = Et * Et
 Then, sum of the squared errors (SSE) is calculated i.e. SSE = ΣESi for i = 2 to n where n is the number of
observations.
 Then, the mean of the squared errors is calculated i.e. MSE = SSE/(n-1)
 The best value for α is choosen to result the smallest MSE. 110
Let us illustrate this principle with an example. Consider the following data set consisting of 12 observations taken over
time with α as 0.1:
Time yt St Et ESt
1 71
2 70 0.1 * 71 + (1-0.1) * 71 = 71 70 - 71= -1.0 (-1.0)2 = 1.00
3 69 0.1 * 70+ (1-0.1) * 71 = 70.9 69 – 70.9 = -1.90 (-1.90) 2 = 3.61
4 68 70.71 -2.71 7.34
5 64 70.44 -6.44 41.47
6 65 69.80 -4.80 23.04
7 72 69.32 2.68 7.18
8 78 69.58 8.42 70.90
9 75 70.43 4.57 20.88
10 75 70.88 4.12 16.97
11 75 71.29 3.71 13.76
12 70 71.67 -1.67 2.79
111
 The sum of the squared errors (SSE) = 208.94. The mean of the squared errors (MSE) is the SSE /11 =
19.0.
 In the similar fashion, the MSE was again calculated for α=0.5 and turned out to be_______, so in this case
we would prefer _____________.
 Can we do better?
 We could apply the proven trial-and-error method. This is an iterative procedure beginning with a
range of α between 0.1 and 0.9.
 We determine the best initial choice for α and then search between α−Δ and α+Δ. We could repeat
this perhaps one more time to find the best α to 3 decimal places.
In general, most well designed statistical software programs should be able to find the value of α that minimizes the MSE.
112
(Holt’s Methos)
2.2 Holt’s Method
 Holt (1957) extended simple exponential smoothing to allow the forecasting of data with a trend
but with no seasonality.
 This method involves a forecast equation and two smoothing equations (i.e. one for the level and one
for the trend).
 The k step ahead forecast function for a given time series X is
where at time t,
Xt + k = ℓt + k * bt ℓt denotes an estimate of the level of the series
bt denotes an estimate of the trend/slope of the time series
 The equation for level is
ℓt = α * yt + (1 – α) * (ℓt-1 + bt-1)
 The equation for trend is
bt = β * (ℓt - ℓt-1) + (1- β) * bt-1
where,
α is the smoothing parameter for the level, 0≤α≤1
β is the smoothing parameter for the trend, 0≤β≤1
 Reasonable starting values for level and slope are ℓ1 = X1 and b1 = X2 - X1 113
Evaluation of Forecasting Accuracy
 What makes a good forecast? Of course, a good forecast is an accurate forecast.
 A forecast “error” is the difference between an observed value and its forecast. The “error”
does not mean a mistake, it means the unpredictable part of an observation.
 Error measure plays an important role in calibrating and refining forecasting model/method
and helps the analyst to improve forecasting method.
 The popular and highly recommended error measures are

 Mean Square Error (MSE)
 Root Mean Square Error (RMSE)
 Mean Absolute Percentage Error (MAPE)
114
Mean Square Error (MSE)
MSE is defined as mean or average of the square of the difference between actual and estimated values.
Mathematically it is represented as:
Month 1 2 3 4 5 6 7 8 9 10 11 12
Actual 42 45 49 55 57 60 62 58 54 50 44 40
Demand
Forecasted 44 46 48 50 55 60 64 60 53 48 42 38
Demand
Error -2 -1 1 5 2 0 -2 -2 1 2 2 2
Squared Error 4 1 1 25 4 0 4 4 1 4 4 4
Sum of Square Error = 56 and MSE = 56 / 12 = 4.6667

115
Root Mean Square Error (RMSE)
It is just the square root of the mean square error. Mathematically it is represented as:
Month 1 2 3 4 5 6 7 8 9 10 11 12
Actual 42 45 49 55 57 60 62 58 54 50 44 40
Demand
Forecasted 44 46 48 50 55 60 64 60 53 48 42 38
Demand
Error -2 -1 1 5 2 0 -2 -2 1 2 2 2
Squared Error 4 1 1 25 4 0 4 4 1 4 4 4
Sum of Square Error = 56, MSE = 56 / 12 = 4.6667, RMSE = SQRT(4.667) = 2.2

116
Mean Absolute Percentage Error (MAPE)
The formula to calculate MAPE is as follows:
Here, X’(t) represents the forecasted data value of point t and X(t) represents the actual data value of point t. Calculate
MAPE for the below dataset.
Month 1 2 3 4 5 6 7 8 9 10 11 12
Actual 42 45 49 55 57 60 62 58 54 50 44 40
Demand
Forecasted 44 46 48 50 55 60 64 60 53 48 42 38
Demand
 MAPE is commonly used because it’s easy to interpret and easy to explain. For example, a MAPE value of 11.5%
means that the average difference between the forecasted value and the actual value is 11.5%.
 The lower the value for MAPE, the better a model is able to forecast values e.g. a model with a MAPE of 2% is more
accurate than a model with a MAPE of 10%.
117
118

Module III (Part II)(Regression and Time Series)

Uploaded by

Copyright:

Available Formats

Module III (Part II)(Regression and Time Series)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module III (Part II)(Regression and Time Series)

Uploaded by

Copyright:

Available Formats

Regression

• We call it “linear” because the equation represents a straight line in a bi-

 In the above snap, a is considered as the y-intercept.

Using calculus, we obtain estimating formulas:

1425 63  y  14365  xy  818755

• This means that if the weekly advertising expenditure is

Positive Linear Relationship: Negative Linear Relationship:

Our assumption, which we will test, is that income explains consumption

The estimate dmodel is

 Heteroscedasticity: When dependent variable's variability is not equal across values of an

• Bias(Error in training data): While making predictions, a difference

• Low variance means there is a small variation in the prediction of the

Multiple regression simultaneously • A multiple regression

y = β0+ β1x1 + β2x2 + β3x3 + β4x4+ … … … … … … + βnxn + e

where, y = the predicted value of the dependent variable.

• The least squares estimates must satisfy

Where Y= pull strength,

In matrix notation this model can be written as

The predicted model is Y = 5.6359 – 0.6497 x + 0.0482 x2

Linear Regression Applied

 Under-fitting/ unable to capture data

Model is linear (x^2 is only one more feature), but curve is

If we try to fit a cubic curve (degree=3) to the

If we further increase the degree to 20, we can see

The Bias vs Variance trade-off

 Three most popular transformations are the:

Product Discount in %(X) Increase in sale in% (Y) SQRT (Y)

A 10 3.05 √3.05 = 1.75

 There are 3 types of Logistic Regression

The odds can be recovered by exponentiation of the logit:

Hours of study Probability of passing the exam

Daily Take away

Field Example Topics

Medicine Blood pressure tracking, weight tracking, cholesterol measurements, heart

Physical sciences Global temperatures, monthly sunspot observations, pollution levels.

Social Sciences Birth rates, population, migration data, political indicators

Epidemiology Disease rates, mortality rates, mosquito populations 87

Natural forces Person-made

Pictorial depiction of different component

 Different assumptions lead to different combinations of additive and multiplicative models as

Time series forecasting models can be classified into 2 categories.

St=α * yt−1+(1−α) * St−1 and t≥2

where 0<α≤1, α is smoothing constant

 When α = 1, St= Yt-1 or forecast will be just previous demand

α *(1-α)1-1 yt−1+ α * (1−α)2-1 * yt−2+ (1−α)2 * St−2

Month Actual Smoothed

FEB 600 SMarch=α * yFeb+(1−α) * SFeb

MARCH SFeb=α * yJan+(1−α) * SJan

= 0.75 * 500 + 0.25 * 400 = 475

SMarch=α * yFeb+(1−α) * SFeb

= 0.75 * 600 + 0.25 * 475 = 568.75

 The popular and highly recommended error measures are

Sum of Square Error = 56 and MSE = 56 / 12 = 4.6667

Sum of Square Error = 56, MSE = 56 / 12 = 4.6667, RMSE = SQRT(4.667) = 2.2

The formula to calculate MAPE is as follows:

You might also like

α (1-α)1-1 yt−1+ α (1−α)2-1 * yt−2+ (1−α)2 * St−2