Stata Tutorial 13 v2 0
Stata Tutorial 13 v2 0
Stata Tutorial 13 v2 0
by Manfred W. Keil
to Accompany
Introduction to Econometrics
by James H. Stock and Mark W. Watson
-----------------------------------------------------------------------------------------------------------------1. STATA: INTRODUCTION
2. CROSS-SECTIONAL DATA
Interactive Use: Data Input and Simple Data Analysis
a) The Easy and Tedious Way: Manual Data Entry
b) Summary Statistics
c) Graphical Presentations
d) Simple Regression
e) Entering Data from a Spreadsheet
f) Importing Data Files directly into STATA
g) Multiple Regression Model
h) Data Transformations
Batch (Do-Files)
4
5
10
11
15
17
18
21
22
24
38
4. FINAL NOTE
44
-----------------------------------------------------------------------------------------------------------------
-1-
1. STATA: INTRODUCTION
This tutorial will introduce you to a statistical and econometric software package called
STATA. The tutorial is an introduction to some of the most commonly used features in
STATA. These features were used by the authors of your textbook to generate the statistical
analysis reported in Chapters 3-9 (Stock and Watson, 2015). The tutorial provides the
necessary background to reproduce the results of Chapters 3-9 and to carry out related
exercises. It does not cover panel data (Chapter 10), binary dependent variables (Chapter 11),
instrumental variable analysis (Chapter 12), or time-series analysis (Chapters 14-16).
The most current professional version is STATA 13. Both STATA 12 and STATA 13 are
sufficiently similar so that those who only have access to STATA 12 can also use this tutorial.
As with many statistical packages, newer versions of a program allow you to use more
advanced and recently developed techniques that you, as a first time user, most likely will not
encounter in a first course of statistics or econometrics. There are several versions of STATA
12, such as STATA/IC, STATA/SE, and STATA/MP. The difference is basically in terms of
the number of variables STATA can handle and the speed at which information is processed.
Most users will probably work with the Intercooled (IC) version.
STATA runs on the Windows (2000, 2003, XP, Vista, Server 2008, or Windows 7), Mac, and
Unix computers platform. It is produced by StataCorp in College Station, TX. You can read
about various product information at the firms Web site, www.stata.com . There are 20
manuals that can be purchased with STATA 13, although subsets can be bought separately.
Perhaps the most useful of these are the Users Guide and the Base Reference Manual, which
can simply be downloaded. You can order STATA by calling (800) 782-8272 or by filling out
a form at www.stata.com/order/quote-request/student/. In addition, if you purchase the Student
Version, you can acquire STATA at a steep discount. Prices vary, but you could get a
perpetual license for STATA/IC for $189, or a six-month license for as low as $69 (a
business/single user pays $1,695 to purchase STATA). There is even a 30-days free
evaluation copy for STATA.
Econometrics deals with three types of data: cross-sectional data, time series data, and panel
(longitudinal) data (see Chapter 1 of the Stock and Watson (2015)). In a cross-section you
analyze data from multiple entities at a single point in time. In a time series you observe the
behavior of a single entity over multiple time periods. This can range from high frequency data
such as financial data (hours, days); to data observed at somewhat lower (monthly)
frequencies, such as industrial production, inflation, and unemployment rates; to quarterly data
(GDP) or annual (historical) data. One big difference between cross-sectional and time series
analysis is that the order of the observation numbers does not matter in cross-sections. With
time series, you would lose some of the most interesting features of the data if you shuffled the
observations. Finally, panel data can be viewed as a combination of cross-sectional and time
series data, since multiple entities are observed at multiple time periods. STATA allows you to
work with all three types of data.
STATA is most commonly used for cross-sectional and panel data in academics, business, and
-2-
government, but you can work with it relatively easily when you analyze time-series data.
STATA allows you to store results within a program and to retrieve these results for further
calculations later. Remember how you calculated confidence intervals in statistics say for a
population mean? Basically you needed the sample mean, the standard error, and some value
from a statistical table. In STATA, you can calculate the mean and standard deviation of a
sample and then temporarily store these. You then work with these numbers in a standard
formula for confidence intervals. In addition, STATA provides the required numbers from the
relevant distribution (normal, 2 , F, etc.).
While STATA is truly interactive, you will run a program sooner rather than later in a
batch mode.
Interactive use: you type a STATA command in the STATA Command Window (see
below) and hit the Return/Enter key on your keyboard. STATA executes the command
and the results are displayed in the STATA Results Window. Then you enter the next
command, STATA executes it, and so forth, until the analysis is complete. Even the
simplest statistical analysis typically will involve several STATA commands.
Batch mode: all of the commands for the analysis are listed in a file, and STATA is told
to read the file and execute all of the commands. These files are called Do-Files and are
saved using a .do suffix.
In the good old days the equivalent of writing a Do-File was to submit a batch of cards, each
card containing a single command (now line), to a technician, who would use a card reader to
enter these into the computer. The computer would then execute the sequence of statements.
(You stored this batch of cards typically in a filing cabinet, and the deck was referred to as a
file and stored them in a filing cabinet typically with a rubber band around each file or
deck of cards.) While you will work at first in interactive mode by clicking on buttons or
writing single line commands, you will very soon discover the advantage of running your
regressions in batch mode. This method allows you to see the history of commands, and you
can also analyze where exactly things went wrong if there are problems (errors) with any of
your commands. This tutorial will initially explain the interactive use of STATA since it is
more intuitive. However, we will switch as soon as it makes sense into the batch mode and you
should seriously try to do your research/class work using this mode (Do-Files).
STATA produces highly professionally looking graphs and charts. However, it requires some
practice to generate these. A separate manual (Graphics) is devoted to the topic only. Since
STATA works in a Windows format, it allows you to cut and paste the data into other
Windows-based program, such as Word or WordPerfect.
Finally, there is a warning about the limitations of this tutorial. The purpose is to help you gain
an initial understanding of how to work with STATA. I hope that the tutorial looks less
daunting than the manuals. However, it cannot replace the accompanying manuals, which you
will have to consult for more detailed questions (alternatively use Help within the program).
Feel free to provide me with feedback of how the tutorial can be improved for future
-3-
generattions of stu
udents (mkeeil@claremo
ontmckenna.edu). Colleagues of m
mine and I have
decided
d to set up
p a Wiki run by stu
udents but ssupervised bby faculty at my acaddemic
instituttion. We hav
ve found thaat the wisdo
om of crowdds often prroduces valuuable inform
mation
for thosse who follo
ow. This is, of
o course, ju
ust a suggestiion. Finally you may waant to think aabout
workin
ng with statisstical softwaare as learnin
ng a new lannguage: praccticing it rouutinely will rresult
in imprrovement. Iff you set it aside for too
o long, you will only reemember thee most impoortant
lines but
b will forg
get the impo
ortant detailss. Another ddanger of tuutorials likee this is thatt you
simply follow the instructions
i
and when yo
ou are done,, you do nott remember tthe commannds. It
is thereefore a good idea to keep
p a separate sheet and too write downn commandss and examplles of
them iff you think you will usse them lateer. I will giive you shorrt exercises so that youu can
practicee the commaands on yourr own.
OSS-SECTIIONAL DAT
TA
2. CRO
Interacctive Use: Data
D Input an
nd Simple Data Analysiss
g started. Click
C
on the STATA ico
on to begin yyour sessionn, or choose STATA 13 from
Lets get
your ST
TART wind
dow. Once yo
ou have starrted STATA, you will seee a large wiindow contaaining
severall smaller win
ndows. At th
his point you
u can load a data set or enter data (described beelow)
and beg
gin the statisstical analysiis.
-4-
-5-
str
sschool
606.8
631.1
631.4
631.8
631.9
632
632
638.5
638.7
639.3
19.5
20.1
21.5
20.1
20.4
22.4
22.9
19.1
20.2
19.7
1
2
3
4
5
6
7
8
9
10
-6-
After entering
e
the data, doublee-click the grey box at thhe top of thee first. This will result iin the
followiing box to ap
ppear at the right
r
bottom
m of your scrreen:
In the Name
N
box, replace
r
var1
1 with the naame of the ffirst column variable, heere testscr. IIn the
Label box,
b
you maay want to en
nter informaation that thaat helps you remember hhow the dataa was
created
d originally or
o as inform
mation for otthers who m
may subsequuently work with your ddata. I
suggestt you enter here
h
Avg
g test score (=(read_scr+
(=
+math_scr)//2)
Do a siimilar operattion for the second
s
colum
mn, that is reename var2 as str. Simillarly you couuld
enter fo
or the third variable
v
str
Student teacherr ratio (teachhers/enrl_tot)
y, call the third column scchool.
Finally
After completing th
his task, the Data Editorr screen shouuld look as ffollows:
-7-
y
comman
nds to edit thhe data now appear in thhe Results Boox,
Next cllose the box.. Note that your
your co
ommand to edit
e is listed in the Comm
mand Box, annd your new
wly created vvariables are
shown in the variab
ble list on th
he upper righ
ht-hand side:
Enterin
ng data in th
his way is veery tedious, and
a you wil l make data input errorss frequently.. You
will see below how
w to enter data
d
directly from a spreeadsheet or aan ASCII file, which arre the
most co
ommon form
ms of data yo
ou will receiv
ve in the futuure.
-8-
-9-
b) Summary Statisttics
For thee moment, leets just see if we are wo
orking with the same daata set. Typee in the folloowing
commaand
sum teestscr str, deetail
sum staands for sum
mmarize an
nd the option detail givees you a moore extensivee list of sum
mmary
statisticcs for each of the variaables you haave entered . These incllude the meedian and ceertain
percenttiles of the frequency
fr
disstribution. You
Y will learrn later that yyou can alsoo obtain sum
mmary
statisticcs for a subset of your daata by addin
ng an if or in command fo
following thee variable naame.
plained in Chapter
C
2 off your textboook (for exam
mple, Kurtoosis is
The summary statiistics are exp
defined
d in equation
n (2.15) on page
p
25 in Sttock and Waatson (2015)..
If yourr summary sttatistics diffe
fer, then checck the data aagain. To retturn to the ddata observattions,
edit thee data using
g the Data Editor.
E
Oncce you have located thee data probllem, click on the
observaation and change it. Afteer correcting
g the problem
m, press the ppreserve buttton again.
e
the data, there are
a various things
t
you ccan do with it. You maay want to kkeep a
After entering
- 10 -
hard copy of what you just entered. If so, click on the Print button. This will print the entire
output of what you have produced so far.
In general, it is a good idea to save the data and your work frequently in some form. Many of
us have learned through multiple painful experiences how easy it is to lose hours of work by
not backing up data/results in some fashion. To save the data set you created, either press the
Save button or click on File and then Save As. Follow the usual Windows format for saving
files (drives, directories, file type, etc.). If you save datasets in STATA readable format, then
you should use the extension .dta. Once you have saved your work, you can call it up the
next time you intend to use it by clicking on File and then Open. Try these operations by
saving the current workfile under the name SW13smpl.dta.
c) Graphical Presentations
Most often it is a good idea to generate graphs (pictures) to get some feel for the data. You
will be able to detect outliers which may be the result of data entry errors or you will be able to
see if the data makes sense. Although STATA offers many graphing options, we will only go
through a few commonly used ones here.1
There are three graphs that you will use most often:
histograms;
line graphs, where one or more variables are plotted across entities (these will become
more important in time series analysis when you are plotting variables over time);
scatterplots (crossplots), where one variable is graphed against another.
The purpose of histograms is to display absolute or relative frequencies for a single variable. In
general, the command is
histogram varname, percent title( )
The percent option produces relative frequencies, and the title option adds whatever name
you place between ( ) to the top of the graph.
You can either save the graph you have generated, or copy and paste it into another Windows
based document, such as Word ((replacing percent with frequency would have resulted in
absolute, rather than relative, frequencies to be plotted; there are other options for you to
explore, such as the number of classes (bins) to choose, etc.).
Try
histogram testscr, percent title(Testscores)
20
40
Percent
60
80
100
Testscores
600
610
620
630
Avg test score (=(read_scr+math_scr)/2)
640
To create a line graph in a cross section, you can add a third variable in your data set which
takes on the number of the observation (here: 1, 2, 3, , 10), in this case, the variable school
that we created.
Lets plot the student-teacher ratio for the first 10 observations using the scatter command. The
command is followed by the two variables you would like to see plotted, where the first one
appears on the Y axis and the second on the X axis.
scatter varname1 varname2
plots variable 1 against variable 2. Try this with the student-teacher ratio and the variable
school.
The resulting graph just gives you the data points here. There are two ways to make this more
informative, one is to connect the points by using the line command followed by the two
variable names. Alternatively you can use the twoway connected command to have both the
points and the lines displayed.
Try both here:
- 12 -
After the graph appears, you can edit it using the Graph Editor (either use File and then Start
Graph Editor or push the Graph Editor button). Alter the graph until it looks like the one
below. Some of the alternations can be made in the resulting dialog boxes.
Graph 1
18
19
Student-Teacher Ratio
20
21
22
23
24
5
6
School District
10
Frequently you will be interested either in causal relationships between variables or in the
ability of one variable to forecast another. As a result, it is a good idea to plot two variables in
the same graph.
The first way to look for a relationship is to plot the observations of both variables. This can be
done by generalizing the command twoway connected to include more than two variable names
(one for the Y axis and one for the X axis). Try this here with
twoway connected str testscr school
The resulting graph is pretty uninformative, since test scores and student-teacher ratios are on a
different scale. You can allow for two (or more) scales by entering the following command:
twoway (scatter str school, c(1) yaxis(1)) (scatter testscr school, c(1) yaxis(2))
This command instructs STATA to use two Y axis, one for the student-teacher ratio on the left
side of the graph, and the other for test scores on the right side of the graph. You may want to
beautify the resulting graph by using the graph editor. See if you can produce something like
the graph below:
- 13 -
24
620
630
Avg test score
23
22
21
610
20
18
600
19
640
Grahph 2
Test Scores and Student-Teacher Ratio Across 10 School Districts
5
6
School District
Student-Teacher Ratio
10
To get an even better idea about the relationship, you can display a two-dimensional
relationship in a scatterplot (see page 92 of your Stock and Watson (2015) textbook). Given
our discussion above, you could simply use the command scatter testscr str. However, you
may want to see what a fitted line through that scatter plot would look like, in which case you
have to modify the command slightly:
scatter testscr str || lfit testscr str
where || is the key | typed twice.
This will result in the following graph (after beautification):
Graph 3
600
610
Test Scores
620
630
640
19
20
21
Student-Teacher Ratio
Fitted values
- 14 -
22
23
(Not to worry about the positive slope here. Remember, this is a sample, and a very small one
at that. After all, you may get 10 heads in 10 flips of a coin.)
d) Simple Regression
There is a commonly held belief among many parents that lower student-teacher ratios will
result in better student performance. Consequently, in California, for example, all K-3 classes
were reduced to a maximum student-teacher ratio of 20 (Class Size Reduction Act CSR) in
the late 90s. This comes at a cost, of course. Initially, it was $1.8 billion a year. With dollar
figures as big as these (ask yourself, if you laid down a dollar bill every second, how many
years would it take to reach 1 billion?), the natural question arises whether or not it is worth it.
That is why you are analyzing the effect of reducing student-teacher ratios in Chapters 4-9 of
the Stock and Watson textbook.
For the 10 school districts in our sample, we seem to have found a positive relationship
between larger classes and student performance. Not to worry we will soon work with all 420
observations from the California School Data Set, and we will then find the negative
relationship you have seen in the textbook for now, we are more concerned about learning
techniques in STATA.
In the previous section, we included a regression line in the scatterplot, something that you
should have encountered towards the end of your statistics course. However, the graph of the
regression line does not allow you to make quantitative statements about the relationship; you
want to know the exact values of the slope and the intercept. For example, in general
applications, you may want to predict the effect of an increase by one in the explanatory
variable (here the student-teacher ratio) on the dependent variable (here the test scores).
To answer the questions relating to the more precise nature of the relationship between class
size and student performance, you need to estimate the regression intercept and slope. A
regression line is little else than fitting a line through the observations in the scatterplot
according to some principle. You could, for example, draw a line from the test score for the
lowest student-teacher ratio to the test score for the highest student-teacher ratio, ignoring all
the observations in between. Or you could sort the data by student-teacher ratio and split the
sample in half so that the observations with the lowest ten student-teacher ratios are in one set,
and the observations with the highest ten student-teacher ratios are in the other set. For each of
the two sets you could calculate the average student-teacher ratio and the corresponding
average test score, and then connect the two resulting points. Or you could just eyeball the
relationship. Some of these principles have better properties than others to infer the true
underlying (population) relationship from the given sample. The principle of Ordinary Least
Squares (OLS), for example, will give you desirable properties under certain restrictive
assumptions that are discussed in Chapter 4 of the Stock/Watson textbook.
- 15 -
Back to
o computing
g. If the dep
pendent variable, Y, is oonly determiined by a siingle explannatory
variable X in a linear fashion off the type
Yi 0 1 X i ui
i=1,2, ..., N
with u represen
nting the errror, or rand
dom disturbbance, not aaccounted fo
for by the llinear
equatio
on, then thee task is to find a valu
ue for 0 and 1 . IIf you had values for these
coefficients, then 1 describes the
t effect off a unit increease in X on Y.
Often a regression line is a lineear approxim
mation to an underlying complicatedd relationshipp and
the inteercept 0 on
nly has a useeful meaning
g if observatiions around X
X=0 occur inn the data. A
As we
have seeen in the scatterplot abo
ove, there arre no observaations arounnd the studennt-teacher rattio of
zero, an
nd it is thereefore better not to interp
pret the num
merical valuee of the interrcept at all. Your
professsor most likeely will givee you a serio
ous penalty iin the exam for interpreting the inteercept
here beecause with no
n students present,
p
therre is no scorre to record. (What woulld be the funnction
of the teacher
t
in thaat case?)
There are
a various ways
w
to estim
mate the reg
gression line . The comm
mand for regrressing a varriable
Y on a constant (inttercept) and another variiable X is:
reg Y X
where reg
- 16 -
Accord
ding to these results, low
wering the stu
udent-teacheer ratio by onne student per class resuults in
an decrrease of 0.6 points, on av
verage, in th
he district wiide test scorre. Using thee notation off your
textboo
ok, you shou
uld display th
he results as follows:
= 618.9 + 0.61
TestScore
0 STR, R 2 = 0.007, SE
SER = 9.8
(51.1) (2..33)
Note th
hat the resullt for the 10 chosen scho
ool districts is quite diffferent from the sample of all
420 sch
hool districtts. Howeverr, as pointed
d out beforee, this is a rrather smalll sample annd the
regresssion R2 is qu
uite low. As a matter off fact, in Chhapter 5 of yyour textboook, you will learn
that thee above slope coefficientt is not statisstically signiificant.
- 17 -
When you are done, you are ready to save the file. Name it caschool.dta.
You can now reproduce Equation (4.7) from the textbook. Use the regression command you
previously learned to generate the following output.
. reg testscr str, r
Linear regression
Number of obs
F( 1,
418)
Prob > F
R-squared
Root MSE
testscr
Coef.
str
_cons
-2.279808
698.933
Robust
Std. Err.
.5194892
10.36436
t
-4.39
67.44
P>|t|
0.000
0.000
=
=
=
=
=
420
19.26
0.0000
0.0512
18.581
-1.258671
719.3057
(You can find the standard errors and the distribution of the estimators on p. 131 of the Stock
and Watson (2015) textbook. The regression R 2 , sum of squared residuals (SSR), and standard
error of the regression (SER) are presented in Key Concept 4.3.)
separations between words, and therefore will only read the filename up until the first space or
symbol, and then considers the rest to be a separate command.
Note: In order to insheet data, there must be no data already stored in memory. To get rid of
any data that is already stored, type the command
clear
before insheeting.
Once you have insheeted your data, you should see this reflected in your Results box and your
variables should appear in your Variables List box. You can type edit to see your data in the
data editor.
To save your data as a STATA file, click on File on the upper toolbar, then select Save As.
When you save your file, make sure it is saved as a .dta file. This type of file can only be
opened in STATA. Alternatively, you can type the command
save (filename)
where (filename) is the directory location and name of your file. If you have a previous version
of this saved already, to overwrite the old version add replace after the save command. For
example:
save C:\My Documents\test.dta, replace
If you wish to save a file that has been previously saved in the same directory location as the
previous version, you may use the command save, replace.
Note: When you save a STATA dataset, you are really only saving the dataset as it exists at the
time you chose to save. You are not retaining any of the analysis you may have conducted,
such as running regressions or testing for the statistical significance of coefficients. However,
if you have changed the data since opening the file, such as edited observations, these changes
will be reflected.
As an exercise, copy the caschool.xls or caschool.xlsx data file from the Stock and Watson
website and save the Excel file in some subdirectory on your computer as a .csv file. Then
import the data set using the insheet command. Finally run the simple regression of testscr on
str and check that your output contains 420 observations and corresponds to the STATA
regression output in the previous section.
- 19 -
ASCII data
You can also import data from an ASCII file (text file). This assumes that you either saved data
from a different source as an ASCII file or that you received data in ASCII file format. The file
must be organized with one observation in each row, and the variables in the data set must be
in separate columns.
Using the infile command, type the name of the variable that represents each column, followed
by the file name.
For example, consider an ASCII dataset that looks as follows:
ahe
10.75 12
16.50 16
married
6
3
1
0
0
0
..
12.10 12
STATA dataset
Data files that have been saved in STATA format, carry the extension .dta
To open a dataset that is already saved as a .dta file, you can either go to File and then Open to
select your dataset, or you can type the command
use (filename)
- 20 -
This will open your dataset into STATA, as long as you have changed your working directory
to the location on your computer where the data file is stored. The command to change the
working directory is
CD: C:\(location)
Here are two tricks that will be of help down the road.
(i)
If you are not sure how to type in the location of your data file, just right-click on
your Start button and select Explore. Then find your data set. Next right click on
the data set and chose Properties. A new window opens up. Copy the Location.
Return to the Command Window in STATA and type use and then past the
location. Add \ and the name of the file, including the extension. Then finish the
command with a , clear.
Here is an example from my computer:
use C:\ClaremontLectures\ECON125\STATA\baseb.dta, clear
(ii)
The clear command is very important. It erases previous data, if there was any,
from memory. I, and others, have wasted time trying to find errors in programming
simply by not clearing memory. Even if you dont understand the reason, the advice
is always to include the clear command when you read in a new data set.
You can try doing this with the caschool.dta data set from the Stock and Watson website.
Simply save that data set on your computer, then double click on it. This will open STATA
with the data loaded already. Obviously this is the easiest method to import data into STATA.
Regardless of which method you use to import data, it is always a good idea to inspect the
data to check if there are some abnormalities. To do this, click on the Data Editor
(Browse) button below the drop down menus.
- 21 -
Yi 0 1 X 1i 2 X 2i ... k X ki ui , i = 1,,n.
To estimate the coefficients of the multiple regression model, you proceed in a similar way as
in the simple regression model. The difference is that you now need to list the additional
explanatory variables. In general, the command is:
reg Y X1 X2 Xk, (options)
where (options) can be omitted (this is the default and gives you homoskedasticity-only
standard errors) or can be replaced by various possible entries ( e.g. r for heteroskedasticity
robust standard errors).
See if you can reproduce the following regression output, which corresponds to Column 5 in
Table 7.1 of the Stock and Watson (2015) textbook (page 241). The option used below is (r) to
produce heteroskedasticity-robust standard error (STATA refers to these as Robust Standard
Errors).
. reg testscr str el_pct meal_pct calw_pct, r
Linear regression
Number of obs
F( 4,
415)
Prob > F
R-squared
Root MSE
testscr
Coef.
str
el_pct
meal_pct
calw_pct
_cons
-1.014353
-.1298219
-.5286191
-.0478537
700.3918
Robust
Std. Err.
.2688613
.0362579
.0381167
.0586541
5.537418
t
-3.77
-3.58
-13.87
-0.82
126.48
P>|t|
0.000
0.000
0.000
0.415
0.000
=
=
=
=
=
420
361.68
0.0000
0.7749
9.0843
-.4858534
-.0585498
-.4536932
.0674424
711.2767
- 22 -
1)
2)
3)
4)
str = 0
el_pct = 0
meal_pct = 0
calw_pct = 0
F(
4,
415) =
Prob > F =
361.68
0.0000
Note that the F-statistic is identical to the same statistic listed in the regression output.
See if you can generate the F-statistic of 5.43 following Equation (7.6) in the Stock and
Watson (2015) text and listed at the bottom of page 223 (restrict the coefficients of STR and
Expn to be zero).
h) Data Transformations
So far, we have only used data in regressions that already existed in some file that we either
created or used. Almost always, you will be required to transform some of the raw data that
you received before you run a regression. In STATA you transform variables by using the
gen (as in generate) command. For example, Chapter 8 of the Stock/Watson textbook
introduces the polynomial regression model, logarithms, and interactions between variables.
Lets reproduce Equations (8.2), (8.11), (8.18), and (8.37) here. The following commands
generate the necessary variables2:
gen avginc2=avginc^2
gen avginc3=avginc^3
gen lavginc=log(avginc)
gen ltestscr=log(testscr)
gen strpctel=str*el_pct
Note how the commands and generated variables are displayed in STATA, including those in
red when you make a mistake in the command (e.g. genr instead of gen).
For example, I have generated a variable called avginc2, and assigned it to be the square of the previously
defined variable avginc. Note that I am generating variable names that are self-explanatory. They could have
been called variable1, variable2, variable3, etc. but it is a good idea to create variable names that you can
remember.
- 23 -
Next ru
un the four regressions using the same
s
techni que as for m
multiple reggression anaalysis.
Finally
y save your workfile
w
agaiin and exit STATA.
S
Exercisse
One off the probleems with thee type of tu
utorial you are workingg on is thatt you just foollow
instructtions withou
ut internalizzing them. A typical sttudent will finish the ttutorial withh few
problem
ms but then
n little is rettained. If I asked you tto retrieve a data set aand to run a few
regresssions, for exaample, would you be ablle to do that?? Or would yyou say how
w do I do thiis?
Lets see how mucch you undeerstood. Go to the Stockk and Watsoon website ffor the 3rd eddition
(http://w
www.pearso
onhighered.ccom/stock_w
watson). Ennter the SStudent Ressources in the
Compa
anion Web Site,
S
and dow
wnload the CPS
C data set for Chapterr 8 (Data Setts for Repliccating
Empirical Results: CPS Data Used
U
in Chap
pter 8). Nexxt open it in STATA3
3 Note for STATA 12 users: if youu just double click
c
on the cpps_ch8.dta filee, an error messsage will occuur that
tells you
u that insufficiient memory was
w allocated. Before you oppen the cps_chh8.dta file, inccrease your m
memory
(usually set at 1 MB by
y default). You
u can do this by
y typing in the command
set
s mem 10m
which in
ncreases the meemory to 10 megabytes.
m
In geeneral, make suure to set the m
memory large eenough to handdle the
data set, but small enou
ugh for your computer to han
ndle the prograam (use k for kkilobyte, m for megabyte, andd g for
gigabytee).
- 24 -
Why do
d you think
k your resultts differ from
m those listeed in the tablle? What if you found a way
to restrrict your sam
mple to only include individuals whoo are at least 30 but not oolder than 644? To
find a way
w to restrrict your sam
mple, look for
fo Help andd the if comm
mand. Thenn, restricting your
samplee to those in
ndividuals in
n that age grroup, replicaate columnss (1) to (3). For columnn (4),
define potential
p
exp
perience as the
t Mincer experience
e
vvariable (agee Years of eeducation 6 ).
Batch Files
d on buttons in STATA oor used the Command Window too type
So far, you have eiither clicked
executaable statemeents (commaands one by one,
o or line by line). Buut what if you wanted to keep
a perm
manent recorrd of all thee transformaations you m
made, regresssions you ttried, graphss you
created
d, etc.? In that case, you
u would need
d to create a program that consistts of a list of line
commaands similar to those thaat you used in
i the Comm
mand Windoow previouusly. After haaving
created
d such a prog
gram, which
h is a text or Ascii file, you cann then execuute (run) iit and
view th
he output affterwards (iff the program
m did not ccontain any errors). Battch files cann also
includee loops and conditional branching (if you donnt know whhat these aree, not to woorry).
Batch files
f
in STAT
TA are calleed Do-Files.
- 25 -
Using STATA in batch mode has two important advantages over using STATA interactively:
the Do-File provides an audit trail for your work. The file provides an exact record of
each STATA command;
even the best computer programmers will make typing or other errors when using
STATA. When a command contains an error, it wont be executed by STATA, or
worse, it will be executed but produce the wrong result. Following an error, it is often
necessary to start the analysis from the beginning. If you are using STATA
interactively, you must retype all of the commands. If you are using a Do-File, then you
only need to correct the command containing the error and rerun the file.
Lets create such a program. Click on New Do-File Editor button. This opens the STATA DoFile Editor box.
Type in, the following commands exactly as they appear.
log using \statafiles\stata1.log, replace
use \statafiles\caschool.dta
describe
generate income = avginc*1000
summarize income
log close
exit
- 26 -
Line 4: This line tells STATA to create a new variable called income (a shorter version of the
command is gen instead of generate). The new variable is constructed by
multiplying the variable avginc by 1000. The variable avginc is contained in the dataset
and is the average household income in a school district expressed in thousands of
dollars. The new variable income will be the average household income expressed in
dollars instead of thousands of dollars.
Line 5: This line tells STATA to compute some summary statistics (a shorter version of the
command is sum instead of summarize). STATA will produce the mean, standard
deviation, etc.
Line 6: This line closes the file stata1.log which contains the output.
Line 7: This line tells STATA that the program has ended.
As long as you have replaced the path in line 1 and line 2 with the relevant paths from the
computer you are working on, and if you downloaded/saved the California Test Score Data
Set, then we are good to go. Save the Do-File, using the .do suffix. Next execute this Do-File
by first opening STATA on your computer. Next, click on the File menu, then Do, and then
select the stata1.do file you just saved. This will run or execute the program.
(Alternatively, you can run the program, or even just part of the program, by hitting the
Execute (do) button in the Do-file Editor.)
You will be able to see the program being executed in the Results Window. Since the execution
will not fit into one screen, you can scroll up and see everything that happened during the
run. Sometimes (although not here) you may see that the program execution pauses, and that
--more--
is displayed at the bottom of the Results Window. If this happens, push any key on the
keyboard and execution will continue.
To exit STATA, click on the usual exit button at the top right of STATA (alternatively click on
File and then Exit.) STATA will ask you if you really want to exit, and you will respond Yes.
Your output has been saved in stata1.log and you can look at it by opening the file with any
text editor (Notepad, for example) or in Word/WordPerfect. Here is what you should see:
----------------------------------------------------------------------------------------------name: <unnamed>
log:
log type:
opened on:
yourpathhere
text
yourdateandtimehere
. use C:\yourpathhere
- 27 -
. describe
Contains data from C:\yourpathhere\caschool.dta
obs:
420
vars:
13
yourdatehere
size:
20,160
----------------------------------------------------------------------------------------------storage
display
value
variable name
type
format
label
variable label
--------------------------------------------------------------------------------------------------------------------enrl_tot
int
%8.0g
teachers
float
%8.0g
calw_pct
float
%8.0g
meal_pct
float
%8.0g
computer
int
%8.0g
testscr
float
%8.0g
comp_stu
float
%8.0g
expn_stu
float
%8.0g
str
float
%8.0g
avginc
float
%8.0g
el_pct
float
%8.0g
read_scr
float
%8.0g
math_scr
float
%8.0g
----------------------------------------------------------------------------------------------Sorted by:
. generate income = avginc*1000
. summarize income
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------income |
420
15316.59
7225.89
5335
55328
. log close
name:
log:
log type:
<unnamed>
C:\yourpathhere\stata1.log
text
You now have an initial idea of how to work with Do-Files in STATA. The rest of this part of
the tutorial will guide you through further commands and make the initial Do-File more
complex.
I suggest that you continue to work with the batch file you just created and then for you to add
new lines to this program (if you use the .pdf version of this tutorial or have printed the tutorial
using a color printer, then the new commands will appear in red).
- 28 -
#delimit;
******************************************;
* Administrative Commands;
******************************************;
set more off;
clear;
log using C:\statafiles\stata1.log,replace;
******************************************;
* Read in the data set;
*******************************************;
use C:\statafiles\caschool.dta;
des;
*******************************************;
* Transform Data and Create New Variables;
*******************************************;
* Construct Average District Income in $s;
*******************************************;
gen income = avginc*1000;
*******************************************;
* carry out statistical analysis;
*******************************************;
* summary statistics for Income;
*******************************************;
sum
income;
*******************************************;
* end of program;
*******************************************;
log close;
exit;
The new version of the Do-File carries out exactly the same calculations as before. However it
uses four features of STATA for more complicated analysis. The first new command is
# delimit ;
This command tells STATA that each STATA command ends with a semicolon. If STATA
does not see a semicolon at the end of the line, then it assumes that the command carries over
to the following line. This is useful because complicated commands in STATA are often too
long to fit on a single line. (Make sure to place a ; at the end of the seven old commands.)
The above Do-File contains an example of a STATA command written on two lines: near the
bottom of the file you see the command sum income written on two lines. STATA combines
these two lines into one command because the first line does not end with a semicolon. While
two lines are not necessary for this command, some STATA commands can get quite long, so
it is good to get used to employing this feature.
A word of warning: if you use the # delimit ; command, it is critical that you end each
command with a semicolon. Forgetting the semicolon on even a single line means that the DoFile will not run properly (again, dont forget to add the seven ; in the first version of the
program).
The second new feature of the above Do-File is that many of the lines begin with an asterisk.
STATA ignores the text that comes after *, so that these lines can be used for comments or
to describe what the commands that follow are doing. Note that each of these lines ends with a
- 29 -
semicolon. Without the semicolon, STATA would include the next line as part of the text
description.
A final new feature in the program is the command
set more off
This command eliminates the need to hit a key on your keyboard in the case when STATA fills
the Results Window and stops displaying further results (the -- more -- would appear).
Run the program and have a look at the new log file.
Next, change the previous version of the Do-File by adding commands until the new version
looks as follows (again, new commands can be seen in red if your tutorial displays colors):
#delimit ;
*********************************************************;
*Administrative Commands;
*********************************************************;
set more off;
clear;
log using C:\STATA\stata1.log, replace;
*********************************************************;
*Read in the Dataset;
*********************************************************;
use C:\STATA\caschool.dta;
des;
*********************************************************;
*Transform Data and Create New Variables;
*********************************************************;
***** Construct Average District Income in $s;
gen income = avginc*1000;
***** Define variables for subset of data;
gen testscr_lo = testscr if (str<20);
gen testscr_hi = testscr if (str>=20);
*********************************************************;
*Carry Out Statistical Analysis;
*********************************************************;
***** Summary Statistics for Income;
sum
income;
sum testscr;
ttest testscr=0;
ttest testscr_lo=0;
ttest testscr_hi=0;
ttest testscr_lo=testscr_hi, unequal unpaired;
*********************************************************;
*Repeat the Analysis using STR = 19;
*********************************************************;
replace testscr_lo=testscr if (str<19);
replace testscr_hi=testscr if (str>=19);
ttest testscr_lo=testscr_hi, unequal unpaired;
*********************************************************;
*End of Program;
*********************************************************;
log close;
exit;
- 30 -
New variables are created using only a portion of the dataset. Two of the variables in the
dataset are testscr (the average test score in a school district) and str (the districts
average class size or student teacher ratio). The STATA command
gen testscr_lo = testscr if (str<20)
creates a new variable testscr_lo that is equal to testscr, but this variable is only defined
for districts that have an average class size of less than twenty students (that is, for
which str < 20).
The statement str<20 is an example of a relational operation. STATA uses several relational
operators:
<
>
<=
>=
==
~=
&
|
less than
greater than
less than or equal to
greater than or equal to
equal to
not equal to
and
or
(Note that = is not the same as ==. = assigns a value to a variable, while == tests for
the equality of a variable to a value).
2)
The ttest command constructs tests and confidence intervals for the mean of a
population or for the difference between two means (see Stock and Watson, 2015; 7191). The command is used in two different ways in the program.
The first is
ttest testscr=0
. ttest testscr=0
One-sample t test
Variable
Obs
Mean
testscr
420
654.1565
Std. Err.
Std. Dev.
.9297082
19.05335
652.3291
mean = mean(testscr)
Ho: mean = 0
Ha: mean < 0
Pr(T < t) = 1.0000
655.984
t = 703.6149
degrees of freedom =
419
Ha: mean != 0
Pr(|T| > |t|) = 0.0000
This command computes the sample mean and standard deviation of the variable testscr,
computes a t-test that the population mean is equal to zero, and computes a 95%
- 31 -
confidence interval for the population mean. (In this example, the t-test that the
population mean of test scores is equal to zero is not really of interest, but the
confidence interval for the mean is what we are looking for in this example.) The same
command is then used for testscr_lo and testscr_hi (see section 3.2 and 3.3 in Stock and
Watson (2015)).
The second form of the command is
ttest testscr_lo=testscr_hi, unequal unpaired
Obs
Mean
testsc~o
testsc~i
238
182
combined
420
diff
Std. Err.
Std. Dev.
657.3513
649.9788
1.254794
1.323379
19.35801
17.85336
654.8793
647.3676
659.8232
652.5901
654.1565
.9297082
19.05335
652.3291
655.984
7.37241
1.823689
3.787296
10.95752
Ha: diff != 0
Pr(|T| > |t|) = 0.0001
4.0426
403.607
Executing this statement will test the hypothesis that testscr_lo and testscr_hi come from
populations with the same mean. That is, the command computes the t-statistic for the
null hypothesis that the (population) mean of test scores for districts with class sizes less
than 20 students is the same as the mean of test scores for districts with class sizes
greater than 20 students. The command uses two options that are listed after the
comma in the command. These options are unequal and unpaired. The option unequal
tells STATA that the variances in the two populations may not be the same. The option
unpaired tells STATA that the observations are for different districts, that is, these are
not panel data representing the same entity at two different time periods (see section 3.4
in Stock and Watson (2015)).
3)
A third new feature in the Do-File is the command replace. This appears near the
bottom of the file. Here, the analysis is to be carried out again, but using 19 as the cutoff
for small classes. Since the variables testscr_lo and testscr_hi already exist (they were
define by the gen command earlier in the program), STATA cannot generate variables
with the same name. Instead, the command replace is used to replace the existing series
with new series. In essence, the command instructs the program to overwrite the
previously stored data.
- 32 -
You are now ready to execute (run) the program as done before.
As before, change the previous version of the Do-File by adding commands until the new
version looks as follows (again, new commands can be seen in red if your tutorial displays
colors):
#delimit ;
*********************************************************;
*Administrative Command
*********************************************************;
set more off;
clear;
log using \statafiles\stata1.log, replace;
*********************************************************;
*Read in the Dataset;
*********************************************************;
use \statafiles\caschool.dta;
des;
*********************************************************;
*Transform Data and Create New Variables;
*********************************************************;
***** Construct Average District Income in $s;
gen income = avginc*1000;
***** Define variables for subset of data;
gen testscr_lo = testscr if (str<20);
gen testscr_hi = testscr if (str>=20);
*********************************************************;
*Carry Out Statistical Analysis;
*********************************************************;
***** Summary Statistics for Income;
sum
income;
*********************************************************;
***** Table 4.1 *****;
*********************************************************;
sum str testscr, detail;
*********************************************************;
***** Figure 4.2 *****;
*********************************************************;
twoway scatter testscr str || lfit testscr str;
*********************************************************;
***** Correlation *****;
*********************************************************;
cor str testscr;
*********************************************************;
***** Equation 4.11 and 5.8 *****;
*********************************************************;
reg testscr str, robust;
*********************************************************;
***** Equation 5.18 *****;
gen d = (str<20);
reg testscr d, r;
*********************************************************;
- 33 -
sum testscr;
ttest testscr=0;
ttest testscr_lo=0;
ttest testscr_hi=0;
ttest testscr_lo=testscr_hi, unequal unpaired;
*********************************************************;
*Repeat the Analysis using STR = 19;
*********************************************************;
replace testscr_lo=testscr if (str<19);
replace testscr_hi=testscr if (str>=19);
ttest testscr_lo=testscr_hi, unequal unpaired;
*********************************************************;
*End of Program;
*********************************************************;
log close;
exit;
The new commands reproduce some of the empirical results shown in Chapters 4 and 5 of
Stock and Watson (2015). There are several features of STATA included in the new commands
which have not been used in the previous examples:
1)
The summarize command (sum) is now includes the option detail, which provides
more detailed summary statistics. The command is written as
sum str testscr, detail
This command tells STATA to compute summary statistics for the two variables str and
testscr. The option detail produces detailed summary statistics that include, for
example, the percentiles that are reported in Table 4.1 on p. 113 of Stock and Watson
(2015).
2)
The command
twoway scatter testscr str || lfit testscr str
constructs a scatterplot of testscr versus str and includes the estimated regression line for
the simple regression of the California Test Score Data Set, shown on p. 116 of Stock
and Watson (2015).
3)
The command
cor str testscr
tells STATA to compute the correlation between the student teacher ratio and test
scores.
4)
Next you will reproduce equations (4.11) and (5.8) in Stock and Watson (2011) by using
- 34 -
The final innovation over the previous version of the Do-File is contained in the two
commands following the line Equation 5.18. First a binary (sometimes referred to as
dummy or indicator) variable d is created suing the STATA command
gen d = (str<20)
This variable is equal to 1 if the expression in parenthesis is true, that is, when the
student teacher ratio is less than 20. Otherwise it is equal to 0, in other words, when the
expression is false, or when the student teacher ratio 20. STATA allows you to use
any of the relational operators defined above. The final regression command tells
STATA to run a regression of test scores on the binary variable just created. The output
reproduces equation (5.18) on p. 155 of Stock and Watson (2011).
Run the program now and look at the output in the log-file.
The upcoming Do-File will be the last program in this tutorial. Having understood all five
should give you a solid grounding in STATA programming. As before, there are several
commands added to the previous version of the Do-File. Add these commands to your older
version until the new version looks as follows (new commands can be seen in red if your
tutorial displays colors):
#delimit ;
*********************************************************;
*Administrative Commands;
*********************************************************;
set more off;
clear;
log using \statafiles\stata1.log, replace;
*********************************************************;
*Read in the Dataset;
*********************************************************;
use \statafiles\caschool.dta;
des;
*********************************************************;
*Transform Data and Create New Variables;
*********************************************************;
***** Construct Average District Income in $s;
gen income = avginc*1000;
***** Define variables for subset of data;
- 35 -
The file produces several of the empirical results from Chapter 7 of Stock and Watson (2015).
As before, some commands have been abbreviated when there is no possibility of confusion.
- 37 -
The file uses abbreviations for STATA commands throughout (generate becomes gen, regress
turns into reg, etc.).
In essence there are two new commands:
1)
The first new command involves the testing of restrictions in equation 7.6 (page 221 of
Stock and Watson (2015)). The command
reg testscr str expn_stu el_pct, r
instructs STATA to compute the regression. The command vce asks STATA to print out
the estimated variances and covariances of the estimated regression coefficients. The
command
test str expn_stu
gets STATA to carry out the joint test that the coefficients on str and expn_stu are both
equal to zero.
2) The second new command is in the analysis of Table 7.1 on page 241 of Stock and
Watson (2015). When STATA computes an OLS regression, it computes the adjusted
2
R-squared ( R ) as described in Section 6.4, page 197 of Stock and Watson (2015).
However, STATA does not display all of the results it computes, including the adjusted
R-squared (when the r option is invoked). The command
display Adjusted Rsquared = e(r2_a)
instructs STATA to print out (display) the adjust R-squared. Whatever appears
between the two quotation marks ( ) will be displayed in your output (you did not
have to display the words Adjusted Rsquared but could have chosen anything else, such
as My Measure of Fit). However e(r2_a) tells STATA where to retrieve the stored
result from and cannot be changed. The adjusted R-squared is not the only statistic that
STATA stores and does not display. You can use the Help function or look in the
Reference volume under Saved Results for the reg command to find other statistics.
- 38 -
programs specifically designed for time series data, and the web site contains EViews and
RATS programs for Chapters 14-16, as well as a tutorial for EViews.
log using filename [,append replace]. This opens the file given by filename as a
log file for STATA output. The options append and replace are used when there
is already a file with the same name. With append, STATA will append the
output to the bollom of the existing file. With replace, STATA will replace the
existing file with the new output file.
log close. This closes the current log file.
set mat #
- 39 -
sets the maximum number of variables that can be used in a regression. The default
maximum is 40. If you have a huge number of observations and want to run a
regression with 45 variables, then you will need to use the command, where # is a
number greater than 45.
set memory #m
is used in Windows and Unix versions of STATA to set the amount of memory used by
the program. For details, see the discussion within the tutorial.
set more off
tells STATA not to pause and display the more- message in the Results Window.
Data Management
describe
describes the contents of data in memory or on disk. A related command is describe
using filename, which describes the dataset in filename
drop list of variables
this deletes/erases the variables in list of variables from the current STATA session.
For example, drop str testscr will delete the two variables str and testscr
keep list of variables
deletes/erases all of the variables from the current STATA session except those in list of
variables. Alternatively, it keeps the variables in the list and drops everything else. For
example, keep str testscr will keep the two variables str and testscr and deletes all of
the other variables in the current STATA session.
list list of variables
tells STATA to print all of the observations for the variables listed in list of variables.
save filename [, replace]
tells STATA to save the dataset that is currently in memory as a file with name
filename. The option replace tells STATA that it may replace any other file with the
name filename.
use filename
tells STATA to load a dataset from the file filename.
Examples:
generate newts = testscr/100
creates a new variable called newts that is constructed as the variable testscr divided by 100.
replace testscr = testscr/100
changes the variable testscr so that all observations are divided by 100.
You can use the standard arithmetic operations of addition (+), subtraction (-), multiplication
(*), division (/), and exponentiation (^) in generate/replace commands. For example,
generate ts_squared = testscr*testscr
creates a new variable ts_squared as the square of testscr. (This could also have been
accomplished by using the command gen ts_squared = testscr^2.)
You can also use relational operators to construct binary variables. For example, in the forth
batch file, the following command was included
gen d = (str<20);
This created the binary variable d that was equal to 1 when str<20 and was equal to 0
otherwise.
Standard functions can also be used. Three of the most useful are:
abs(x)
exp(x)
ln(x)
Statistical Operations
cor list of variables
tells STATA to compute the correlation between each of the variables in list of
variables
twoway scatter var1 var2 || lfit var1 var2
produces a scatter plot of var1 on the Y-axis and var2 on the X-axis. If the || lfit part is
included then the fitted OLS line is also displayed
predict newvarname [, residuals]
when this command follows the regress command, the OLS predicted values or
residuals are calculated and stored under the name newvarname. When the option
residuals is used, the residuals are computed; otherwise the predicted values are
computed and placed into newvarname.
Example:
reg testscr str expn_stu el_pct, r
predict tshat
predict uhat, residuals
Here, testscr is regressed on str, expn_stu, el_pct (first command); the fitted values are saved
and stored under the name tshat (second command), and the residuals are saved under the
name uhat (third command).
is used with a list of variables, then summary statistics are computed for all variables in
the list. If the option details is used, more detailed summary statistics (including
percentiles) are computed.
Examples:
sum testscr str
computes summary statistics for the variables testscr and str.
sum testscr str, detail
computes detaild summary statistics for the variables testscr and str.
test
this command is used to test hypothese about regression coefficients. It can be used to
test many types of hypotheses. The most common use of this command is to carry out a
joint test that several coefficients are equal to zero. Used this way, the form of the
command is test list of variables where the list is to be carried out on the coefficients
corresponding to the variables given in list of variables.
Example:
reg testscr str expn_stu el_pct, r
test str expn_stu
Here testscr is regressed on str, expn_stu, and el_pct (first command), and a joint test of
the hypothesis that the coefficient on str and expn_stu are jointly equal to zero is carried
out (second command).
ttest
this command is used to thest a hypothesis about the mean or the difference between
two means. The command has several forms. Here are a few:
ttest varname = # [if expression]}[,level(#)]
Here you test the null hypothesis that the population mena of the series varname is
equal to #. When if expression is used, then the test is computed using observations for
which expression is true. The option level(#) is the desired level of the confidence
interval. If this option is not used, then a confidence level of 95% is used.
Examples:
ttest testscr = 0;
- 43 -
tests the null hypothesis that the population mean of testscr is equal to 0 and computes a 95%
confidence interval.
ttest testscr = 0, level(90);
tests the null hypothesis that the population mean of testscr is equal to 0 and computed a 90%
confidence interval.
ttest testscr = 0 if (str<20);
tests the null hypothesis that the population mean of testscr is equal to and computes a 95%
confidence interval only using observations for which str<20.
ttest varname1 = varname 2 [if expression] [, level(#) unpaired unequal]
tests the null hypothesis that the population mean of series varname1 is equal to the
population mean of series varname2. When if expression is used, then the test is
computed using observations for which expression is true. The option level (#) is the
desired level of the confidence interval. If this option is not used, then a confidence
level of 95% is used. The option unpaired means that the observations are not paried
(they are not panel data), and the option unequal means that the population variables
may not be the same. (Section 3.4 of Stock and Watson (2011) describes the equality of
means tests under the unpaired and unequal assumptions.)
Examples:
ttest testscr_lo=testscr_hi, une unp;
tests the null hypothesis that the population mean of testscr_lo is equal to the population mean
of testscr_hi and computes a 95% confidence interval. Calculations are performed using the
unequal variance and unpaired formula of Section 3.4 of Stock and Watson, 2011).
ttest ts_lostr=ts_hisstr if elq1==1, unp une;
tests the null hypothesis that the population mean of ts_lostr is equal to the population mean of
ts_histr, and computes a 95% confidence interval. Calculations are performed for those
observations for which elq1 is equal to 1. Calculations are performed using the unequal
variance and unpaired formula of Section 3.4 of Stock and Watson (2011).
4. FINAL NOTE
For a complete list of commands, consult the STATA Users Guide and Base Reference
Manuals (3 volumes). In addition, there are more detailed manuals on Graphics, Time Series,
Data Management, etc.; Alternatively, you may want to use the Help command inside
STATA. It will display details of STATA commands including options. Under the Search
tab, for example, you will find most of what you are looking for. As mentioned before, this
- 44 -
tutorial is not intended to replace the Reference or Users Guide. The best way to learn how to
use the program is to spend some time exploring and working with it. For a nice visual
introduction to the manuals, go to www.youtube.com/embed/xWJTFtWhQc4.
STATA replication batch files for all the results in the Stock/Watson textbook are available
from the Web site. You are invited to download these and study them.
There are many other tutorials on STATA available to you on the internet. If you prefer a
visual one, then perhaps going to www.ats.ucla.edu/stat/stata/notes/default.htm might be a
good one to look at. STATA has its own YouTube series and you can find it at
www.stata.com/links/video-tutorials/.
- 45 -