White Paper On Regression
White Paper On Regression
White Paper On Regression
On 07/09/2010
By
Ashish Maheshwari
( B09004)
Statistical Modelling:
Statistical modelling involves the appropriate application of statistical techniques, each
requiring certain assumptions to perform hypothesis tests, interpret the data and reach valid
conclusions. Data from experiments, product testing, simulation, surveys, and statistical
process and quality control must be appropriately analyzed before results can be
determined and conclusions drawn. The results from experiment or testing must be obtained
following established statistical procedures, including experimental design and the
appropriate use of statistical analysis and modelling techniques. These results can then be
reproduced, within sampling error, by repeating the experiment.
Benefits:
• Application of appropriate statistical analysis techniques
• Development of appropriate conclusions and key learning from the data
• Ensuring results address experimental objectives
• Maximizing information gained from the data
• Maximizing chances of the experiment being successful
Techniques:
1. Statistical analysis and modelling techniques
2. Descriptive techniques
3. Data graphs, plots and exploratory data analysis
4. Multi linear regression analysis
5. Logistic regression
6. Time series analysis
7. Discrminant analysis
8. Factor analysis
9. Cluster analysis
10. Multivariate analysis
11. Nonparametric analysis
12. Experimental design
Care must be taken when we use historical data to estimate the regression equation.
Condition can change and violate one or more of the assumptions on which our regression
analysis depends.
Another error which may arise is the dependence of some variables on time. Suppose a firm
uses regression analysis to determine the relationship between the number of employees
and production volume. If the observation used in the analysis to determine extend back for
several years, the resulting regression line may be too steep because it may fail to
recognise the effect of changing technology.
5.
Relationships that have no common bond:
When applying regression analysis people sometime find a relationship between two
variables that, in fact have no common bond.
For example, to find a statistical relationship between a random variable of the number of
miles per gallon consumed by eight different cars and the distance from earth to other eight
planets. But because there is no common bond between gas mileage and the distance to
other planets, this relationship would be meaningless.
6. Finding things that do not exist:
In this regard, if one have to run a large number of regressions between many pairs of
variables, it would be possible to get some interesting relationships. For example, to find a
high statistical relationship between your income and the amount of beer consumed in the
US or even between the length of weight train and the weather. But in neither case there is
a factor common to both variables. Hence, such relationships are meaningless.
7. Misinterpreting r and r2: :
Social Model:
1. Health survey :
Taking example of Tuberculosis scenario during National Family Health Survey. If we take
the relationship of reporting TB infection and seeking treatment for men and women by
various socio- economic characteristics, multivariate logistic regression are applied to find
the significant factors explaining reporting TB and treatment- seeking.
2. Analysis on Urbanization:
Taking example of China’s urbanization projection level, which can be projected by applying
regression model and S- curve regression model.
Its formula is : ut=a0+a1*t
Where, t is the independent variable of year, ut is the dependent variable of urbanisation
level in year t.
Based on the urbanisation level in 1990 cencus definition in the period of 1983-1999, the
constants in this formula are estimated and the linear regression simulation equation :
Ut= -1026.54+0.529*t
The static feature of this equation are as the following :
R2=0.98, F= 714.46, sig F=0.00000,
Which indicates that the simulation model is statistically significant.
Source:(www.iiasa.ac.at/admin)
3. Land use change scenario projections:
If the study area includes all the countries in the world, We derive future proportions of
artificial surfaces per region from projections of population and GDP, using a regression
model. We calculated a linear regression model linking the proportion of artificial surfaces
per region to the population and gross domestic per capita, with the country and urban type
city as additional factors.
a. Coefficient of determination:
In statistics Coefficient of determination, R2 is used in the context of statistical models whose
main purpose of future outcomes on the basis of other related information. It is the
proportion of validity in a data set that is accounted for by the statistical model. It provides a
measure of how well future outcomes are likely to be predicted by the model. There are
several different definitions of R 2 which are only sometimes equivalent. One class of such
cases includes that of linear regression. In this case, R2 is simply the square of the sample
correlation coefficient between the outcomes and their predicted values, or in the case of
simple linear regression, between the outcome and the values being used for prediction. In
such cases, the values vary from 0 to 1. If it is more towards 1, the model is valid and if it
more towards 0, the model is less valid.
1.613675 1.86311
94% 25%
(Source: www..bseindia..com)
The data above shows the closing price per month of Orissa cements limited starting from March 04
to Februarys 10 vis-a -vis data of sensex starting from march 04 to February 10. Therefore, by
running regression analysis with the help of this data, we can calculate the Beta of the given stock.
When analysts use capital asset pricing model (CAPM), they generally use regression to calculate
Beta. Beta is use to calculate the cost of capital for a company. It helps in valuing a company and
further equity research and recommendation to the investors.
• Hypothesis 1:
• Hypothesis 2:
The stock price of the company is more sensitive than the sensex.
Since the statistical use of regression may overwhelm some, Microsoft excel has packaged them in
their standard copy of the software. Below, excel 7.0 is used to illustrate the ease of calculating the
regression.
Step 1:
Step2:
Step 3:
Obtain data for dependent variable and independent variable from past periods. For this business
model, we will use stock of OCL as well as sensex, starting from March 04 to February 10 .
Step 4:
Run the regression to assess the level of fit. In order to complete regression analysis, we first need to
add a piece of software that comes with standard version of excel. Once the information is input,
select the data which to be analysed and run the regression tool to view regression dialog bbox. Keep
in mind that the Y range is the dependent variable and the X range is the independent variable.
1. Basic
R2
statist
Regression Statistics 2. R2 statistic
0.5547 for analysis
Multiple R 17 purpose
0.3077 3. Standard
ANOVA
R Square 11 error for each
4. Total sum
Adjusted R of df
0.2976 SS
squared regression. 771 standard
Square
Regression 0.92624
0.1737 2
Standard Error
Residual 8469 2.08386
Observations 71 5
Total 5.
70 Total
3.01010 of
sum
7 The performance of sensex is equal to the collective
squared errors.
performance of all the fifty companies stock in BSE.
We assume here that the volatility of sensex will
Coefficie Standard affect the stock price of a company. If an increase in
nts Error
6. Total sum
sensexof
increases the stock price then there is a
- 0.02112
squares.positive correlation in between them and vice-versa.
Intercept 0.00916 4
1.35798 0.24521
X Variable 1 9 3
Y=0.2305x+0.0159
Executive Summary:
The above linear regression model gives us idea of Beta of the stock of a company which in turn
infers about the volatility of that stock. This also presents us the fact how the stock of a company is
performing in the market and whether it in accordance with the economic growth of the country. It
simplifies the fact that the sensex returns for a day have a positive or a negative impact on the daily
stock return of a company.
Business Model:
Yea No. of cars fuel price per barrel 1/fuel price per barrel Per capita
rs sold in Rs in Rs income
200
2 6626387 1112.67 0.000898738 19040
200
3 6240526 1292.85 0.000773487 20989
200
4 6814554 1702.16 0.000587491 23241
200
5 7338314 2177.74 0.000459191 20813
200
6 8036010 2643.91 0.000378228 23222
200
7 8534690 2605.88 0.000383748 29382
200
8 9237780 4258.39 0.00023483 37490
I have taken data of number of car sold of Toyota , fuel price per barrel and per capita income from
year 2002 to 2008.
Source:
The business model in this context is to find out the dependency of sale of Toyota cars in relation to
fuel price and per capita income. From this model we can forecast the sale of Toyota.
Hypothesis 1:
Hypothesis 2:
SUMMARY OUTPUT
Regression Statistics
0.9493
Multiple R 42
0.9012
R Square 49
Adjusted R 0.8518
Square 74
421834
Standard Error .6
Observations 7
ANOVA
Significanc
df SS MS F eF
3.25E+ 18.253
Regression 2 6.5E+12 12 04 0.009752
1.78E+
Residual 4 7.12E+11 11
Total 6 7.21E+12
R2 is 0.94 which is very near to 1, that indicates sale of Toyota cars is depend on fuel price as well as
per capita income. The model can be Y=6958610-2.5E+0.9x1 + 77.99742x2
Where,
Y=-4E+09x + 1E+07
Y= 149.56x+4E+06
Executive Summary:
The above model gives idea about the expected sale of Toyota car next year. In this model fuel price
and per capita income are to be taken as independent variable. So its easy to get a data of expected
per capita income and fuel price. We can put data in this model and easily find out the expected sale
of Toyota car next year. Here in this model the assumption is that sale of Toyota is only depend on
the two variables which may or may not be true. The limitation of this model is only applicable in India.