Module III (Part II)(Regression and Time Series)
Module III (Part II)(Regression and Time Series)
Module III (Part II)(Regression and Time Series)
Module-III
What is Regression?
A way of predicting the value of one variable from another.
‒ It is a hypothetical model of the relationship between two variable.
‒ The model is used in a linear one.
‒ Regression is a statistical procedure that determines the equation for the
straight line that best fits a specific set of data.
• Any straight line can be represented by an equation of the form Y = bX + a, where
b and a are constants.
• The value of b is called the slope constant and determines the direction and
degree to which the line is tilted.
• The value of a is called the Y-intercept and determines the point where the line
crosses the Y-axis.
Main Objectives
Two main objectives:
Establish if there is a relationship between two variables
‒ Specifically, establish if there is a statistically significant relationship
between the two.
‒ Example: Income and expenditure, wage and gender, etc.
Forecast new observations.
‒ Can we use what we kow about the relationship to forecast unobserved
values?
‒ Example: What will are sales be over the next quarter?
Variable’s Roles
Variables
Dependent Independent
‒ This is the varable whos values ‒ This is the varable that explains
we want to explain of forecast. the other one.
‒ Its values depends on
‒ Its values are independent.
something else.
‒ We denote it as Y. ‒ We denote it as X.
Y=mX+c
A Linear Equation
You may remember one of these.
‒ y = a + bx
‒ y = mx + b
• In this regression discussion, we just use a different notation:
‒ y = β0 + β1x,
• where, β0 is called as intercept and β1 is called as cofficient or slope
• The values of the regression parameters 0, and 1 are not known.
• We estimate them from data.
• 1 indicates the change in the mean response per unit increase in X.
To fit the regression line, a statistical approach known as least squares method.
8
Least Squares Principle
• The least squares principle
• Dots are actual values of Y
• Asterisks are the predicted values of Y for a given value of X
Linear Regression cont…
The linear regression model provides a sloped straight line representing the relationship between the variables.
Consider the below image:
10
Linear Regression cont…
The calculation of b and a is as follows:
If b > 0, then x(predictor) and y(target) have a positive relationship. That is increase in x will increase y.
If b < 0, then x(predictor) and y(target) have a negative relationship. That is increase in x will decrease y.
If sum of squared error is taken as a metric to evaluate the model, then goal to obtain a line that best reduces the
error.
11
Regression Line
We will write an estimated regression line based on sample data as
ˆ b0 b1 x
y
The method of least squares chooses the values for b0, and b1 to minimize
n n 2
the sum of squared errors SSE ( yi yˆ i ) 2 y b0 b1 x
i 1 i 1
(x i x )( yi y ) n xi yi xi yi
b0 y b1 x
b1 i 1
i 1 i 1 i 1
n n n
and
( xi x )
i 1
2
n xi2 ( xi ) 2
i 1 i 1
Estimation of Mean Response
Fitted regression line can be used to estimate the mean value of y for a given value of x.
Example :
• The weekly advertising expenditure (x) and weekly sales (y) are presented in the following
table.
y x
From the data table we have:
1250
1380
41
54
n 10 x 564 32604
x 2
1575 64
1650 71 b0 1436.5 10.8(56.4) 828
Point Estimation of Mean Response
• The estimated regression function is:
ŷ 828 10.8x
Sales 828 10.8 Expenditure
y = 4 + 2x
y = 9 + 2x
y = -2 + 2x
What happens if we change the slope?
y = 4 + 2x
y = 4 + 5x
y = 4 + 0x
=4
y = 4 - 3x
But, the world is not linear !
y = 4 + 2x
True Value
y = β0 + β1x + ε
Simple Linear Regression Model
Simple Linear Regression Model
For model with one predictor, Exploring ‘b1’
• If b1 > 0, then x(predictor) and y(target) have a positive relationship. That
is increase in x will increase y.
Intercept Calculation, • If b1 < 0, then x(predictor) and y(target) have a negative relationship.
That is increase in x will decrease y.
Exploring ‘b0’
• If the model does not include x=0, then the prediction will become
Co-efficient Formula, meaningless with only b0. For example, we have a dataset that relates
height(x) and weight(y). Taking x=0(that is height as 0), will make
equation have only b0 value which is completely meaningless as in real-
time height and weight can never be zero.
• If the model includes value 0, then ‘b0’ will be the average of all
predicted values when x=0.
• If there is no ‘b0’ term, then regression will be forced to pass over the
origin. Both the regression co-efficient and prediction will be biased.
Simple Linear Regression Model
Linear Regression Line
A linear line showing the relationship between the dependent and independent variables is
called a regression line. A regression line can show two types of relationship:
21
School of Computer Engineering
Linear Regression cont…
The linear regression will thus be Predicted (Y) = 8.3272 + 2.1466 X
The above equation can be used to predict the volume of sales for an
insurance company given its agent number. Thus if a company has 1000
agents (10 hundreds) the predicted value of sales will be around ?
In summary, linear regression consists of the following steps:
Collection of sample of independent and dependent variable.
Compute b and a.
Use these values to formulate the linear regression equation.
Given the new values for X predict the value of Y.
Larger and better the sample of data, more accurate would be the regression
model and would lead to more accurate forecasts.
22
School of Computer Engineering
Simple Linear Regression Model
Finding the best fit line:
When working with linear regression, our main goal is to find the best fit line
that means the error between predicted values and actual values should be
minimized. The best fit line will have the least error.
Data for Linear Regression Example
Multicollinearity: When the predictors are highly correlated to each other then the variables
are said to be multicollinear. Many types of regression techniques assumes multicollinearity
should not be present in the dataset. It is because it causes problems in ranking variables based on
its importance or it makes job difficult in selecting the most important independent variable
(factor).
• Variance (Error in test data): The variance would specify the amount
of variation in the prediction if the different training data was used. In
simple words, variance tells that how much a random variable is
different from its expected value.
Under-fitting: When the model works so poorly that it is unable to fit even
training set well then it is said to be under-fitting the data. It is also known as
problem of high bias. A bias is the amount that a model’s prediction differs
from the target value.
Terminologies
Reasons for under-fitting:
• High bias and low variance
• The size of the training dataset used is not enough.
• The model is too simple.
• Training data is not cleaned and also contains noise in it.
Techniques to reduce under-fitting:
• Increase model complexity
• Increase the number of features, performing feature engineering
• Remove noise from the data.
• Increase the number of epochs or increase the duration of training to
get better results.
Terminologies
Reasons for over-fitting are as follows:
• High variance and low bias
• The model is too complex
• The size of the training data
Techniques to reduce over-fitting:
• Increase training data.
• Reduce model complexity.
• Cross-Validation
• Training with more data
• Removing features
• Early stopping the training
The General Idea of Regression Models
Simple regression considers the • A simple regression
model (one independent
relation between a single variable) fits a regression
explanatory variable and response line in 2-dimensional
variable space
35
Multiple Linear Regression with Two Independent
Variables
The formula for a multiple linear regression is:
y = β0+ β1x1 + β2x2 + e
where, y = the predicted value of the dependent variable.
β0 = the y-intercept (value of y when all other parameters are set to 0)
β1x1= the regression coefficient (β1) of the first independent variable (x1)
β2x2= the regression coefficient (βn) of the second independent variable (x2)
e = model error
β1 and β2 is calculated as follows: β0 is calculated as follows:
36
Case Study: Data Preparation for MR
Case Study: Data Preparation for MR
Case Study: Data Preparation for MR
Case Study: Data Preparation for MR
Case Study: Data Preparation for MR
Case Study: Data Preparation for MR
Case Study: Data Preparation for MR
Case Study: Data Preparation for MR
Case Study: Data Preparation for MR
Case Study: Data Preparation for MR
Case Study: Data Preparation for MR
Case Study: Data Preparation for MR
Case Study: Data Preparation for MR
Multiple Linear Regression Models (cont..)
Least Squares Estimation of the Parameters
Multiple Linear Regression Models (cont..)
Least Squares Estimation of the Parameters
• The least squares function is given by
• The solution to the normal Equations are the least squares estimators of the
regression coefficients.
Multiple Linear Regression Models (cont..)
Least Squares Estimation of the Parameters
• The least squares normal Equations are
• The solution to the normal Equations are the least squares estimators of the
regression coefficients.
Multiple Linear Regression Models (cont..)
Multiple Linear Regression Models (cont..)
Specifically, we will fit the
multiple linear regression
model
where
Multiple Linear Regression Models (cont..)
Matrix Approach to Multiple Linear Regression
Example:
Multiple Linear Regression Models (cont..)
Matrix Approach to Multiple Linear Regression
We illustrated fitting the multiple
regression
The model
Where y is the observed pull strength for
matrix X and Y
a wire bond, x1 is the wire length, and x2 vector for this
is the die height. The 25 observations are model are
in table.
Now, we will use the matrix approach to
fit the regression model with the data
provided.
Multiple Linear Regression Models (cont..)
Matrix Approach to Multiple Linear Regression
Multiple Linear Regression Models (cont..)
Non-Linear Regression
In the case of linear and multiple linear regression, the dependent variable is linearly
dependent on the independent variable(s). But, in several situations, the situation is no
simple where the two variables might be related in a non-linear way.
This may be the case where the results from the correlation analysis show no linear
relationship but these variables might still be closely related.
If the result of the data analysis show that there is a non-linear (also known as curvilinear)
association between the two variables, then the need is to develop a non-linear regression
model.
The non-linear data can be handled in 2 ways:
Use of polynomial rather than linear regression model
Transform the data and then use linear regression model
62
Non-Linear Regression
Product Increase in sale in% (Y) Discount in %(X)
A 3.05 10
B 7.62 15
C 12.19 20
D 20.42 25
E 28.65 30
F 42.06 35
G 55.47 40
H 74.68 45
I 93.88 50
63
Non-Linear Regression
The scatter diagram of sales increase for various discount percentage looks as follows:
The value of r is 0.97 which indicates a very strong, almost perfect, positive correlation, and the data value appears to
form a slight curve.
64
Non-Linear Regression
Polynomials are the equations that involve powers of the independent variables. A second degree (quadratic), third
degree (cubic), and n degree polynomial functions:
Second degree: y = β0+ β1x + β2x2 + e
Third degree: y = β0+ β1x + β2x2 + β3x3 + e
n degree: y = β0+ β1x + β2x2 + β3x3 + … … + βnxn + e
Where:
β0 is the intercept of the regression model
β1, β2, β3 are the coefficient of the predictors.
How to find the right degree of the equation?
As we increase the degree in the model, it tends to increase the performance of the model. However, increasing the
degrees of the model also increases the risk of over-fitting and under-fitting the data. So, one of the approach can be
adopted:
Forward Selection: This method increases the degree until it is significant enough to define the best possible
model.
Backward Elimination: This method decreases the degree until it is significant enough to define the best possible
model.
65
Non-Linear Regression
The techniques of fitting of the polynomial model in one variable can be extended to the fitting of
polynomial models in two or more independent variables.
A second-order polynomial is more used in practice, and its model with two independent variables is
specified by:
y = β0+ β1x1 + β2x2 + β11x12 + β22x22 + β12x1x2 + e
This is also termed as response surface. The methodology of response surface is used to fit such
models and helps in designing an experiment. This type is generally covered in the topics in the design of
experiment.
Class work
Define the second-order polynomial model with two independent variables.
Define the second-order polynomial model with three independent variables.
Define the third-order polynomial model with two independent variables.
66
Non-Linear Regression
The tools exists in software such as SAS, Excel or the language such as Python, R can estimate the value
of coefficients of predictor such as β0, β1 etc and to fit a curve in a non-linear fashion for the given data.
Following figure depicts the graph of increase in sale vs. discount.
Curve
68
Non-Linear Regression
An R2 of 1 indicates that the regression model perfectly fits the data while an R2 of 0 indicate that model
does not fit the data at all.
An R2 is calculated as follows:
where
In the example, a value of 0.99 for R2 indicates that a quadratic model is good fit for the data.
69
Non-Linear Regression
Data points on a Scatter gram
70
Non-Linear Regression
To overcome under-fitting, we need to increase the complexity
of the model.
71
Non-Linear Regression
A comparison of fitting linear, quadratic and For degree=20, the model is also capturing the noise
cubic curves on the dataset in the data. This is an example of over-fitting.
72
Non-Linear Regression
73
The Bias vs Variance trade-off
74
Non-Linear Regression
Another preferable way to perform non-linear regression is to try to transform the data in order to
make the relationship between the two variables more linear and then use a regression model
rather than a polynomial one. Transformations aim to make a non-linear relationship between two
variables more linear so that it can be described by a linear regression model.
75
Non-Linear Regression
Application of Square root (√Y)
76
Non-Linear Regression
Square root of transformation
In the similar fashion, Logarithm and negative reciprocal techniques can be applied to the dependent
variable followed up by the application of linear regression model.
77
Logistic Regression
• In linear regression the Y variable is always a continuous variable.
• If suppose, the Y variable was categorical, you cannot use linear regression model it.
• So what would you do when the Y is a categorical variable with 2 classes?
• Logistic regression can be used to model and solve such problems, also called as binary
classification problems.
• Logistic Regression is one of the most commonly used Machine Learning algorithms that
is used to model a binary variable that takes only 2 values – 0 and 1.
• The objective of Logistic Regression is to develop a mathematical equation that can give
us a score in the range of 0 to 1.
78
Logistic Regression (Why?)
When the response variable has only 2 possible values, it is desirable to have a model that predicts the value either
as 0 or 1 or as a probability score that ranges between 0 and 1.
Linear regression does not have this capability. Because, If you use linear regression to model a binary response
variable, the resulting model may not restrict the predicted Y values within 0 and 1.
79
Logistic Regression
Logistic Regression was used in the biological sciences in early twentieth century. It was then used in many
social science applications.
Logistic Regression is used when the dependent variable (target) is categorical. For example:
To predict whether an email is spam (1) or not (0). If the model infers a value of 0.932 on a particular
email message, it implies a 93.2% probability that the email message is spam. The model predicts the
email message is spam 93.2% of the time and the remaining 6.8% will not.
Whether the tumor is malignant (1) or not (0)
80
Logistics Regression (Example)
Spam Detection: Spam detection is a binary classification problem where we are given an email and we need to
classify whether or not it is spam. If the email is spam, we label it 1; if it is not spam, we label it 0.
Tumour Prediction: A Logistic Regression classifier may be used to identify whether a tumour is malignant or if
it is benign. Several medical imaging techniques are used to extract various features of tumours. For instance,
the size of the tumour, the affected body area, etc. These features are then fed to a Logistic Regression classifier
to identify if the tumour is malignant or if it is benign.
Health : Predicting if a given mass of tissue is benign or malignant
Marketing : Predicting if a given user will buy an insurance product or not
Banking : Predicting if a customer will default on a loan.
• Dichotomous categorical response variable Y
• e.g. Success/Failure, Remission/No Remission, Survived/Died, CHD/No CHD, Low Birth Weight/Normal
Birth Weight, etc…
81
Logistic Regression
It measures the relationship between the categorical dependent variable and one or more independent
variables by estimating probabilities using a logistic function which is the cumulative logistic distribution.
Since the predicted values are probabilities and therefore are restricted to (0, 1), a logistic regression model
only predicts the probability of particular outcome given the values of the existing data.
Example: A group of 20 students spends between 0 and 6 hours studying for an exam. How does the number
of hours spent studying affect the probability of the student passing the exam? The reason for using logistic
regression for this problem is that the values of the dependent variable, pass and fail, while represented by "1"
and "0", are not cardinal numbers. If the problem was changed so that pass/fail was replaced with the grade
0–100 (cardinal numbers), then simple regression analysis could be used. The table shows the number of
hours each student spent studying, and whether they passed (1) or failed (0).
82
Logistic Regression cont…
In logistic regression, we don’t directly fit a straight line to the data like in linear regression. Instead, we fit a S
shaped curve, called sigmoid or logistic regression curve. A logistic regression curve showing probability of
passing an exam versus hours studying is shown below.
Y-axis goes from 0 to 1. This is because the sigmoid function always takes as maximum (i.e. 1) and minimum (i.e. 0),
and this fits very well to the goal of classifying samples in two different categories (fail or pass).
The sigmoid function is sigmoid(x) = 1 / (1 + e–x) where x is the weighted sum of independent variable i.e. x =
β0 + β1 xi where i is the individual independent variable instance.
83
Logistic Regression cont…
Consider a model with one predictor X1, and one binary response variable Y, which we denote p = P(Y = 1 | X1 =
x), where p is the probability of success. p should meet criteria: (i) it must always be positive, (ii) it must always
be less than equals to 1.
We assume a linear relationship between the independent variable and the logit of the event i.e. Y = 1. In
statistics, the logit is the logarithm of the odds i.e. p / (1-p). This linear relationship can be written in the
following mathematical form (where ℓ is the logit, b is the base of the logarithm, and β is the parameter of the
model.
Where Sb is the sigmoid function with base b. However in some cases it can be easier to communicate results by
working in base 2, base 10, or exponential constant e.
In reference to the students example, solving the equation with software tool and considering base as e, the
coefficient is β0 = -4.0777 and β1= 1.5046
84
Logistic Regression cont…
For example, for a student who studies 2 hours, entering the value Hours = 2 in the equation gives the
estimated probability of passing the exam of 0.26.
Similarly, for a student who studies 4 hours, the estimated probability of passing the exam is 0.87:
Following table shows the probability of passing the exam for several values of hours studying.
Profit of the
Daily petrol Price
company
Hourly Electricity
Consumption
86
Time Series Analysis Examples
Economics Gross Domestic Product (GDP), Consumer Price Index (CPI), and
unemployment rates
periods
88
Time Series Model cont…
Any time series is composition of many individual component times series. Some of these components are
predictable whereas other components may be almost random which can be difficult to predict.
This calls for the decomposition methods that will generate individual component series from the original
series. Decomposing a series into such components enable to analyze the behaviour of each component and
then conclude the Forecast by assembling the individuals. Thus it improves the accuracy of the final
forecast.
Example: A typical sales time series.
89
Time Series Model Component
Time series models are characterized of four components:
Trend component Seasonal component
Cyclical component Irregular component
90
Time Series Model Component contd...
I. Trend component
The trend shows the general tendency of the data to increase or decrease during a long
period of time. A trend is a smooth, general, long-term, average tendency.
It is not always necessary that the increase or decrease is in the same direction
throughout the given period of time.
It is observable that the tendencies may increase, decrease or are stable in different
sections of time. But the overall trend must be upward, downward or stable.
The population, agricultural production, items manufactured, number of births and
deaths, number of industry or any factory, number of schools or colleges are some of its
example showing some kind of tendencies of movement.
91
Time Series Model Component contd...
I. Trend component Upward
Downword
Stable
92
Time Series Model Component contd...
II. Seasonal component
These are the rhythmic forces which operate in a regular and periodic manner over a span of less than a
year. They have the same or almost the same pattern during a period of 12 months. This variation will be
present in a time series if the data are recorded hourly, daily, weekly, quarterly, or monthly.
These variations come into play either because of the natural forces or person-made conventions.
93
Time Series Model Component cont…
III. Cyclical component
The variations in a time series which operate themselves over a span of more than one year are the
cyclic variations. This oscillatory movement has a period of oscillation of more than a year. One
complete period is a cycle. This cyclic movement is sometimes called the ‘Business Cycle’.
‘Business Cycle’: It is a four-phase cycle comprising of the phases of prosperity, recession,
depression, and recovery. The cyclic variation may be regular but not periodic. The upswings and
the downswings in business depend upon the joint nature of the economic forces and the
interaction between them.
94
Time Series Model Component contd...
IV. Irregular component
They are not regular variations and are purely random or irregular. These fluctuations
are unforeseen, uncontrollable, unpredictable, and are erratic. These forces are
earthquakes, wars, flood, famines, and any other disasters.
95
Time Series Model Component contd...
96
Time Series Model Component cont…
97
Decomposition Model
Mathematical representation of the decomposition approach is Yt = f(Tt, St, Ct, It) where Yt is the time series value at
time t. Tt, St, Ct, and It are the trend, seasonal, cyclic and irregular component value at time t respectively.
There are 3 types of decomposition model:
Additive model
Multiplicative model Mixed model
Additive model
According to this model, a time series is expressed as Yt = Tt + St + Ct + It
The model is appropriate when the amplitude of both the seasonal and irregular variations do not change
as the level of trend rises or falls.
This model assumes that all four components of the time series act independently of each other.
Multiplicative model
According to this model, a time series is expressed as Yt = Tt * St * Ct * It
The model is appropriate when the amplitude of both the seasonal and irregular variations increase as
the level of trend rises.
The model assumes that the various components operate proportionately to each other. 98
Decomposition Model cont…
99
Decomposition Model cont…
III. Mixed model
The time series analysis can also be done using the model as:
Yt = Tt + St * Ct * It
Yt = Tt * St + Ct * It
Home Work
How to determine if a time series has a trend component?
How to determine if a time series has a seasonal component?
How to determine if a time series has both a trend and seasonal component?
100
Time Series Forecasting Model
Time series forecasting is required to make scientific predictions based on historical time
stamped data. It involves building models [called Time Series Forecasting Model] through
historical analysis and using them to make observations and drive future strategic decision-
making.
The simple average method uses the mean of all the past values to forecast the next
value. This method is seen to be no use in a practical scenario.
This method is used when the time series has attained some level of stability and no
longer dependent on any external parameters.
This would happen in sales forecasting, only when the product for which the forecast is
needed is at a mature stage in its life cycle.
The Averaging Model is represented as follows where F is the forecasted value at instance
of time t+1, t is the current time and Yi is the value of series at time instant i.
102
Averaging Model
A manager of a warehouse wants to know how much a typical supplier delivers
Supplier Amount
in 10 dollar units. He/she has taken a sample of 12 suppliers at random,
1 9 obtaining the result as shown in the table.
2 8
3 9 The computed mean of the amount is 10 and hence the manager decides to
use this as the estimate for the expenditure of a typical supplier.
4 12
5 9
It is more reasonable to assume that the recent points in past are better
6 12 predictors than the whole history. This is particularly true for sales
7 11 forecasting.
8 7 Every product has a life cycle: Initial stage, Middle volatile period and a
9 13 more or less stable Mature stage and an End stage. Hence, a better
10 9 method of forecasting would be to use Moving Averages (MAs).
11 11 e.g. Keypad Phone as the product
12 10
103
Time Series Forecasting Model
(Moving Average Model)
1.2 Moving Average Model
The MA approach calculates an average of a finite number of past observations and then employs that average as the
forecast for the next period.
The number of sample observations to be included in the calculation of the average is specified at the start of the
process. The term MA refer to the fact that as a new observation becomes available, a new average is calculated by
dropping the oldest observation in order to include the newest one.
An MA of order k, represented with MA(k) is calculated as:
MA(3), MA(5) and MA(12) are commonly used for monthly data and MA(4) is normally used for quarterly data.
MA(4), and MA(12) would average out the seasonality factors in quarterly and monthly data respectively.
The advantage of MA method is that the data requirement is very small.
The major disadvantage is that it assumes the data to be stationary.
MA also called as simple moving average.
104
Moving Averages (MAs) cont…
Month Demand Yi MA(3) = (Y10+Y11+Y12)/3 [Recent past 3 Yi]
1 89 =(275 + 188 + 312) / 3
2 57 = 258.33
3 144
MA(6) = (223+286+212+275+188+312)/6
4 221
= 249.33
5 177
6 280
MA(12) = (89+57+144+221+177+280+223
7 223 +286+212+ 275+188+312)/ 12
8 286 = 205.33
9 212
Home Work
10 275
Calculate:
11 188 MA(5)
12 312 MA(4)
MA(10) 105
Time Series Forecasting Model
(Exponential Smoothing Model)
2.1 Exponential Smoothing Model
The extension to the MA method is to have a weighted MA, whereas in Single Moving Averages the
past observations are weighted equally, Exponential Smoothing assigns exponentially decreasing
weights as the observation get older. In other words, recent observations are given relatively more
weight in forecasting than the older observations.
In the case of moving averages, the weights assigned to the observations are the same and are equal to
1/N. In exponential smoothing, however, there are one or more smoothing parameters to be
determined (or estimated) and these choices determine the weights assigned to the observations.
This class of techniques consists of a range of methods
Simple exponential smoothing (SES)/ Weighted Moving Average: used for the data with no trend
or seasonality
Sophisticated widely used Holt’s or Holt-Winters’ method which is able to provide forecasts for data
that exhibit both seasonality and trend.
In all these methods, the observations are weighted in an exponentially decreasing manner as they
become older.
106
Simple Exponential Smoothing
For any time period t, the smoothed value St [forecast vale] is found by computing
Let us expand the basic recurrence equation by first substituting for St−1 in the basic equation
St=α * yt−1+(1−α) * St−1 to obtain:
St=α * yt−1+(1−α) * [α * yt−2+(1−α) * St−2]
= α * yt−1+ α * (1−α) * yt−2+ (1−α)2 * St−2
= α *(1-α)1-1 yt−1+ α * (1−α)2-1 * yt−2+ (1−α)2 * St−2
This illustrates the exponential behavior. The weights, α * (1−α)t decrease geometrically.
108
Simple Exponential Smoothing Problem
For α = 0.75 and α=0.25 find the Forecast for March and justify which α is best 109
Simple Exponential Smoothing cont…
What is the best value for α?
The speed at which the older responses are dampened (smoothed) is a function of the value of α. When α is
close to 1, dampening is quick and when α is close to 0, dampening is slow. This is illustrated in the table
below.
α (1-α) (1-α)2 (1-α)3 (1-α)4
0.9 0.1 0.01 0.001 0.0001
0.5 0.5 0.25 0.125 0.0625
0.1 0.9 0.81 0.729 0.6561
Error calculation
The error is calculated as Et = yt – St
(i.e. difference of actual and smooth/forecast at time t)
Then error square is calculated i.e. ESt = Et * Et
Then, sum of the squared errors (SSE) is calculated i.e. SSE = ΣESi for i = 2 to n where n is the number of
observations.
Then, the mean of the squared errors is calculated i.e. MSE = SSE/(n-1)
The best value for α is choosen to result the smallest MSE. 110
Simple Exponential Smoothing cont…
Let us illustrate this principle with an example. Consider the following data set consisting of 12 observations taken over
time with α as 0.1:
Time yt St Et ESt
1 71
2 70 0.1 * 71 + (1-0.1) * 71 = 71 70 - 71= -1.0 (-1.0)2 = 1.00
3 69 0.1 * 70+ (1-0.1) * 71 = 70.9 69 – 70.9 = -1.90 (-1.90) 2 = 3.61
4 68 70.71 -2.71 7.34
5 64 70.44 -6.44 41.47
6 65 69.80 -4.80 23.04
7 72 69.32 2.68 7.18
8 78 69.58 8.42 70.90
9 75 70.43 4.57 20.88
10 75 70.88 4.12 16.97
11 75 71.29 3.71 13.76
12 70 71.67 -1.67 2.79
111
Simple Exponential Smoothing cont…
The sum of the squared errors (SSE) = 208.94. The mean of the squared errors (MSE) is the SSE /11 =
19.0.
In the similar fashion, the MSE was again calculated for α=0.5 and turned out to be_______, so in this case
we would prefer _____________.
Can we do better?
We could apply the proven trial-and-error method. This is an iterative procedure beginning with a
range of α between 0.1 and 0.9.
We determine the best initial choice for α and then search between α−Δ and α+Δ. We could repeat
this perhaps one more time to find the best α to 3 decimal places.
In general, most well designed statistical software programs should be able to find the value of α that minimizes the MSE.
112
Time Series Forecasting Model
(Holt’s Methos)
2.2 Holt’s Method
Holt (1957) extended simple exponential smoothing to allow the forecasting of data with a trend
but with no seasonality.
This method involves a forecast equation and two smoothing equations (i.e. one for the level and one
for the trend).
The k step ahead forecast function for a given time series X is
where at time t,
Xt + k = ℓt + k * bt ℓt denotes an estimate of the level of the series
bt denotes an estimate of the trend/slope of the time series
The equation for level is
ℓt = α * yt + (1 – α) * (ℓt-1 + bt-1)
The equation for trend is
bt = β * (ℓt - ℓt-1) + (1- β) * bt-1
where,
α is the smoothing parameter for the level, 0≤α≤1
β is the smoothing parameter for the trend, 0≤β≤1
Reasonable starting values for level and slope are ℓ1 = X1 and b1 = X2 - X1 113
Evaluation of Forecasting Accuracy
What makes a good forecast? Of course, a good forecast is an accurate forecast.
A forecast “error” is the difference between an observed value and its forecast. The “error”
does not mean a mistake, it means the unpredictable part of an observation.
Error measure plays an important role in calibrating and refining forecasting model/method
and helps the analyst to improve forecasting method.
114
Mean Square Error (MSE)
MSE is defined as mean or average of the square of the difference between actual and estimated values.
Mathematically it is represented as:
Month 1 2 3 4 5 6 7 8 9 10 11 12
Actual 42 45 49 55 57 60 62 58 54 50 44 40
Demand
Forecasted 44 46 48 50 55 60 64 60 53 48 42 38
Demand
Error -2 -1 1 5 2 0 -2 -2 1 2 2 2
Squared Error 4 1 1 25 4 0 4 4 1 4 4 4
Month 1 2 3 4 5 6 7 8 9 10 11 12
Actual 42 45 49 55 57 60 62 58 54 50 44 40
Demand
Forecasted 44 46 48 50 55 60 64 60 53 48 42 38
Demand
Error -2 -1 1 5 2 0 -2 -2 1 2 2 2
Squared Error 4 1 1 25 4 0 4 4 1 4 4 4
Here, X’(t) represents the forecasted data value of point t and X(t) represents the actual data value of point t. Calculate
MAPE for the below dataset.
Month 1 2 3 4 5 6 7 8 9 10 11 12
Actual 42 45 49 55 57 60 62 58 54 50 44 40
Demand
Forecasted 44 46 48 50 55 60 64 60 53 48 42 38
Demand
MAPE is commonly used because it’s easy to interpret and easy to explain. For example, a MAPE value of 11.5%
means that the average difference between the forecasted value and the actual value is 11.5%.
The lower the value for MAPE, the better a model is able to forecast values e.g. a model with a MAPE of 2% is more
accurate than a model with a MAPE of 10%.
117
118