STATA Training Session 2
STATA Training Session 2
p n
n
ts coefficien regression the of estimates square least Y X
1
X) X (
'
'
= |
General Linear Regression
Step 1: Examine data
graph matrix ret mktrf smb hml rf umd
ret
excess
return on
the
market
small-minus-big
return
high-minus-low
return
risk-free
return rate
(one month
treasury bill
rate)
momentum
factor
-.2
0
.2
-.2 0 .2
-.2
-.1
0
.1
-.2 -.1 0 .1
-.2
0
.2
-.2 0 .2
-.1
0
.1
-.1 0 .1
0
.002
.004
.006
0 .002 .004 .006
-.1
0
.1
.2
-.1 0 .1 .2
General Linear Regression
Step 2: Perform Linear Regression
regress ret mktrf smb hml rf umd year
regress: to perform linear regression
General Linear Regression
sw regress ret mktrf smb hml rf umd year, pe(0.05)
sw: to perform stepwise regression
pe(0.05): to specify the significant level of the F-test for addition to the model; items
with a p-value less than 0.05 will be included.
General Linear Regression
Step 3: Post-estimation Statistics
vif //variance inflation factor
rvfplot //plot residuals against predicted values
predict fit //store fitted values
predict sdres, rstandard //store standard residuals
pnorm sdres //normal probability plot of residuals
twoway scatter sdres fit //plot residuals against predicted values
predict cook, cooksd //store Cooks distance statistics
list year ret cook if cook>4/108 // lists details of those observations for which the
statistic is above the suggested cut-off point (4/n).
General Linear Regression
-
.
2
-
.
1
0
.
1
.
2
R
e
s
i
d
u
a
l
s
-.2 -.1 0 .1 .2
Fitted values
0
.
0
0
0
.
2
5
0
.
5
0
0
.
7
5
1
.
0
0
N
o
r
m
a
l
F
[
(
s
d
r
e
s
-
m
)
/
s
]
0.00 0.25 0.50 0.75 1.00
Empirical P[i] = i/(N+1)
General Linear Regression
Exercise 2
1. Repeat the analysis described in this section after removing the listed possible
outliers identified by Cooks.
2. After finishing Q1, repeat the analysis but treat the variable year as the
categorical.
hint: use command
xi: sw regress ret mktrf smb hml rf umd i.year, pe(0.05)
Logistic Model
Binary logistic model: dichotomous response outcomes
e,.g.: presence or absence of an event
Ordinal logistic model: ordinal response variable with more than two
ordered categories
e,.g.: a 5-point Likert scale
Multinomial logistic model: nominal response variables with more
than two categories
e,.g.: different types of programs in school
Binary Logistic Regression
General Form of Model
is the Odds Ratio that when increases by one unit and all other
covariates remain the same.
Binary responses are typically coded as 1 for the event of interest, and 0 for the
opposite event.
) | (
i i i
x y E = t
pi p i i i i i
x x x | | | | t t t + + + + = = ... )) 1 /( log( ) ( logit
2 2 1 1 0
)) ' exp( 1 /( ) ' exp(
i i i
x x | | t + =
) exp(
k
| 1 = y k
x
Y
Binary Logistic Regression
Description of Data
How to identify a person with high chance of getting defaults on the bank loan. We have
700 records from bank database (bankloan.csv) .
Variable name Variable information
age Age in years
ed Level of education
1= didnt complete high school 2= high school degree
3= college degree 4= undergraduate 5= postgraduate
employ Years with current employer
address Years in current address
income Household income in thousands
debtinc Debt to income ratio (*100)
creddebt Credit card debt in thousands
othdebt Other debts in thousands
default Previously defaulted (1=Yes; 0=No)
Binary Logistic Regression
Step 1: Import and examine data
insheet using bankloan.csv
d
browse
codebook default
Binary Logistic Regression
tabstat age employ address income debtinc creddebt othdebt, by(default)
table ed, c(mean income mean age mean debtinc mean creddebt mean othdebt) by(default)
Binary Logistic Regression
Step 2: Construct logistic model
logistic default age ed employ income address
estimates store model1
logistic default age ed employ income address debtinc creddebt othdebt
lrtest model1 .
sw logit default age address employ income debtinc creddebt othdebt, pe(0.05)
logistic: produces odds ratios.
logit: produces parameter coefficients.
estimates: saves the current likelihood
and all the estimates.
lrtest: produces p-value of likelihood-
ratio test.
Binary Logistic Regression
Step 3: Post-estimation statistics
predict prob
predict resi, rstandard
hist resi
estat gof
estat gof: goodness-of-fit test
0
.
2
.
4
.
6
.
8
D
e
n
s
i
t
y
-5 0 5 10
standardized Pearson residual
Binary Logistic Regression
estat classification
Summary of correct
predictions
Summary of incorrect
predictions
Overall success rate
This is calculated based on 50% as a
cut-off point for positive
predictions.
Binary Logistic Regression
gen z=_b[debtinc]*debtinc+_b[employ]*employ+_b[creddebt]* creddebt+_b[address]*address
line prob z, sort
0
.
2
.
4
.
6
.
8
1
P
r
(
d
e
f
a
u
l
t
)
-10 -5 0 5 10
z
Binary Logistic Regression
gen empcat=employ>5
logit default address empcat debtinc creddebt
postgr3 debtinc, by(empcat) //you need to install postgr3 package
0
.
2
.
4
.
6
.
8
0 10 20 30 40
debtinc
yhat_, empcat == 0 yhat_, empcat == 1
postgr3: graphs the predicted
values , holding all other variables
constant at specified values (default
is the mean).
Marginal impact is higher for
people with short service than
for those with long service in
their current company.
Binary Logistic Regression
Exercise 3
1. Explore the use of commands lroc and lsens to diagnostic data and interpret
results.
lroc: graphs the ROC curve and calculates the area under the curve.
lsens: graphs sensitivity and specificity versus probability cutoff.
2. Predict the probability of default on bank loan for a person with
debt/income ratio of 22.7, 2 years with current employer, 16 years living in
current place, and 1.21 thousand credit card debt.
Ordinal Logistic Model
General Form of Model
.
x
p p
p p
p p
x
p
p
p
'
) ( 1
log ) ( Logit
'
1
log ) ( Logit
20
2 1
2 1
2 1
10
1
1
1
| |
| |
+ =
+
+
= +
+ =
=
1 ... and
'
) ... ( 1
...
log ) ... ( Logit
.
1 2 1
0
2 1
2 1
2 1
= + + + +
+ =
+ + +
+ + +
= + + +
+ k k
k
k
k
k
p p p p
x
p p p
p p p
p p p | |
represents Odds Ratio that for any s when increases by one unit and all
other covariates remain the same.
Ordered responses with k categories can be formulated as a threshold model.
) exp(
k
|
s
a y >
k
x
Ordinal Logistic Model
Construct model
recode income (min/20=1 "<20") (20/30=2 "20-29") (30/40=3 "30-39") (40/50=4
"40-49") (50/max=5 "above 50"), generate(inccat)
codebook inccat
Ordinal Logistic Model
xi: ologit inccat age i.ed employ debtinc, or
listcoef, help
oligit: to perform ordered logistic
regression.
listcoef: to obtain ORs and change
of odds for a sd of the variable.
Ordinal Logistic Model
xi: omodel logit inccat age i.ed employ debtinc
brant, detail
Test parallel regression assumption
(proportional odds assumption):
omodel: to perform likelihood ratio
test.
brant: to do Brant test.
Ordinal Logistic Model
prtab employ //predicted probabilities for each of the values of the variable specified
prvalue, x(_Ied_2=1) //predicted probabilities for selected values of variables
prvalue, x(_Ied_2=1 age=28 employ=3 debtinc=10)
Multinomial Logistic Model
xi: mlogit inccat age i.ed employ debtinc
Multinomial Logistic Model
listcoef
fitstat
prtab _Ied_2
Multinomial Logistic Model
predict p1 p2 p3 p4 p5
summarize p1 p2 p3 p4 p5
sort employ
twoway connect p1 p5 employ, msym(i i)
0
.
2
.
4
.
6
.
8
1
0 10 20 30
employ
Pr(inccat==1) Pr(inccat==5)
Logistic Model
Exercise 4
1. Try to construct probit models.
End