Cheat Sheet
Cheat Sheet
Cheat Sheet
Parameter Estimates
Centering
= true intercept
a = population intercept
= true slope
b = population slope
= true error
e = population error
true regression equation: for the total population
estimated regression equation: for sample
R2
% of the variance in Y that is explained by X
Measures goodness of fit
R2 = Regression Sum of Squares (RSS) (y hat - mean y)2
Total Sum of Squares (TSS) (yi - mean of y)2
Problems with R2:
1) not a measure of magnitude of rel btwn X & Y
2) dependent/vulnerable on the std dev of X & Y
cant compare across samples
biased in small samples
1.Biased slope estimate: over an infinite number of samples, the As sample size increases, variance usually decreases
estimate will not equal the true population value
2.Want efficient estimators: unbiased estimator with the least
variance
Forcing the Intercept Through the Origin
If theory predicts that if X = 0, Y should = 0
Not a good idea:
Standardized Estimates (Beta Weights)
1) changes the slope (strength of the relationship)
Change in Y in standard deviation unites brought about by one
2) cant test H0: = 0
standard deviation unit change in X
3) wont work if you really have curvilinear rel
Units are lost in the transformation (now in std dev units)
4) maybe it doesnt make sense to talk about the line at
all if X = 0
Hard to convey meaning to the reader ( b/c std dev units)
5) may have a bad sample; makes it appear sig.
6) if you force the line you deny the chance to see if
there is something wrong with the model and if the
Functional Transformations of Independent Variable
model actually predicts an intercept of 0
Used if there is a non-linear relationship btwn X & Y
Log (X)
X
7) costs of leaving a in are minor compared to taking it
out
8) R2, slope, intercept all change; difficult to interpret
No
Stop
Yes
Explain as
anomalies?
Yes
Delete, explain in
footnote
No
Are they
severe?
2
i
e .
n-(k+1)
Multicollinearity (MC)
X1 can be predicted if the values of X2 and X3 are known
Cant tell which variable is actually having an impact
It exists in degrees and the magnitude of it determines whether Standard Error of Parameter Estimate
or not it is a problem
ei2/n-(k+1)
Inflates the standard errors: all look more significant
(xi x hat)2
Diagnose it using VIF (variance inflation factor) Scores : scores
of 4 or 5 usually cut off point for problems; higher scores are Variance Inflation Factor (VIF)
problematic
1
.
VIF=
High Aux R2 (> .75) indicates high MC: you can explain a lot of
1-auxR2
X1 with the other variables
What to do about it:
Finding the Standard Error
Get more data; pool data (but that is problematic)
Standard error of regression
Combine variables (ex: socioeconomic status combines
(xi )2
education, income, and occupational prestige which alone
tend to be highly correlated)
Miscellaneous Info
Drop one X
Adjusted R2 Standard Error of Reg
Dont do this: only added it because it was theoretically Sample Size is total df from ANOVA table + 1
important
If std error of estimate is inflated, t will drop, p goes up,
If you drop a relevant right hand side (RHS) variable you keep Ho when it should be rejected
get biased parameter estimates
If std error of estimate is deflated, t will go up, p goes down,
It is acceptable to run 2 models:
reject Ho when it should be kept
1) with all variables
b=
(xi mean x) (yi mean y)
2) with some dropped variablesshowing that it
(xi mean x)2
could be misestimating
You are giving full information (important)
Missing Values
Heteroskedasticity
AC
Reject
Autocorrelation
Inflated standard error (conclude there is less sig)
Residuals are correlated; usually happens with time series data;
If small var(e) are located away from mean X
can inflate/deflate standard errors
std error is too large, underconfident that
Diagnosing:
b0
1. scatterplot: look for pattern in residuals
Deflated standard error (conclude there is more sig)
2. regress residuals on previous residuals
If large var(e) are located away from mean X,
ei = ei-1 + i looking for sig
reported std error will be to low
How much bias is indicated?
How to diagnose it:
% bias induced
Scatter plot
.0
0%
.2
3%
Autocorrelation
variance
Heteroskedasticity
No AC
Fail to Reject
AC
Reject
.5
8%
.8
19%
.9
29%
3. Durbin Watson Scores
^d
^d = 2 2
0
2
1
0
Goldfield/Quandt
Order observations by suspect x
Throw out middle observations
Run 2 models using all original xs
Mean residual SS1
Mean residual SS2
F
-1
4
Durbin Watson: save in SPSS, look up critical value
F-test limitations
Does not tell you which is significant, just that something is
If have insignificant F, all b insignificant
Glejser
Save residuals
Logit
Run new regression with abs residual as DV and
OLS not for dichotomous DV
suspect x as only IV
Values > 1 & < 1 which are not options; actual values are only 0
If new parameter est sig, have heterocan predict with
&1
residual
Induces heteroskedasticityresiduals clustered in the middle,
Limitations: cant check middle, only works for linear
but no actual values there
Whites
Choice functions tend to be
Save residuals
Regress all IV, squares (but not dummies) & all
potential interactions on residuals
Cant model probability as a straight line, if do, misspecify model
and bias parameter estimate
n*R2 2
df = # of regressors
Limitations: could be overkill; doesnt tell you where
the problem is; since have so many variables,
some could be randomly significant and you
could diagnose when not really problem
WLS
Assumes that best info in data is in the observations
with least variance in error terms
Weighs some observations more than others
Divide all by hetero variable
Pure hetero should not bias parameter estimates,
but could be indication of
measurement/specification problems and
correcting for it could bias parameter estimates