Introduction To Stata 8
Introduction To Stata 8
Introduction To Stata 8
Svend Juul
Alphabetic index 71
2
Preface
Stata is a software package designed for data management and statistical analysis; the primary
target group is academia.
This booklet is intended mainly for the beginner, but knowledge of fundamental Windows
functions is necessary. Only basic commands and a few more advanced analyses are shown.
You will find a few examples of output, and you can find more in the manuals. The booklet
does not replace the manuals.
You communicate with Stata by commands, either by typing them or by using the menus to
create them. Every action is elicited by a command, giving you a straightforward
documentation of what you did. This mode of operation is a contrast to working with
spreadsheets where you can do many things almost by intuition, but where the documentation
of what you did – including the sequence of actions – may be extremely difficult to
reconstruct.
Managing and analysing data is more than issuing commands, one at a time. In my booklet
Take good care of your data1 I give advice on documentation and safe data management, with
SPSS and Stata examples. The main message is: Keep the audit trail. Primarily for your own
sake, secondarily to enable external audit or monitoring. These considerations are also
reflected in this booklet.
To users with Stata 7 experience: You can go on working as usual, and your version 7 do-files
should work as usual. However, most users will love the new menu system and the much-
improved graphics.
To users with SPSS experience: By design Stata is lean compared to SPSS. Concerning
statistical capabilities Stata can do a lot more. Stata has a vivid exchange of ideas and
experiences with the academic users while SPSS increasingly targets the business world. Just
compare the home pages www.spss.com and www.stata.com. Or try to ask a question or give
a suggestion to each of the companies. In section 15.9 I show some SPSS commands and their
Stata counterparts.
Vince Wiggins of Stata Corporation has given helpful advice on the graphics section.
Svend Juul
1) Juul S. Take good care of your data. Aarhus, 2003. (download from
www.biostat.au.dk/teaching/software).
1
Notation in this booklet
Stata commands are shown like this:
tabulate agegr sex , chi2
tabulate and chi2 are Stata words, shown with italics, agegr and sex is variable
information, shown with ordinary typeface.
In the output examples you will often see the commands preceded by a period:
. tabulate agegr sex , chi2
This is how commands look in output, but you should not type the period yourself when
entering a command.
Optional parts of commands are shown in light typeface and enclosed in light typeface square
brackets. Square brackets may also be part of a Stata command; in that case they are shown in
the usual bold typeface. Comments are shown with light typeface:
save c:\dokumenter\proj1\alfa1.dta [ , replace]
summarize bmi [fweight=pop] // Weights in square brackets
In the Stata manuals it is assumed that you use c:\data for all of your Stata files. I strongly
discourage that suggestion. I am convinced that files should be stored in folders reflecting the
subject, not the program used; otherwise you could easily end up confused. You will therefore
see that I always give a full path in the examples when opening (use) or saving files (save).
2
1. Installing, customizing and updating Stata
1.1. Installing Stata [GSW] 1
No big deal, just insert the CD and follow the instructions. By default Stata will be installed in
the c:\Stata folder. 'Official' ado-files will be put in c:\Stata\ado, 'unofficial' in
c:\ado. To get information about the locations, enter the Stata command sysdir. The
folders created typically are:
c:\Stata the main program
c:\Stata\ado\base 'official' ado-files as shipped with Stata
c:\Stata\ado\updates 'official' ado-file updates
c:\ado\plus downloaded 'unofficial' ado-files
c:\ado\personal for your own creations (ado-files, profile.do etc.)
3
The profile.do file [GSW] A7
If you put a profile.do file in the c:\ado\personal folder2 the commands will be
executed automatically each time you open Stata. Write your profile.do using any text-
editor, e.g. Stata's Do-file editor or NoteTab (see appendix 3) – but not Word or WordPerfect.
// c:\ado\personal\profile.do
set logtype text // simple text output log
log using c:\tmp\stata.log , replace // open output log
cmdlog using c:\tmp\cmdlog.txt , append // open command log
// See a more elaborate profile.do in section 16
set logtype text writes plain ASCII text (not SMCL) in the output log, to enable
displaying it in e.g. NoteTab.
log opens Stata's log file (stata.log) to receive the full output; the replace option
overwrites old output. The folder c:\tmp must exist beforehand.
cmdlog opens Stata's command log file (cmdlog.txt); the append option keeps the
command log from previous sessions and lets you examine and re-use past commands.
If you somehow lost the setting, you can easily recreate it by:
Prefs ► Load windowing preferences
By default the Results window displays a lot of colours; to me they generate more smoke than
light. I chose to highlight error messages only and to underline links:
Prefs ► General preferences ► Result colors ► Color Scheme: Custom 1
Result, Standard, Input: White
Errors: Strong yellow, bold
Hilite: White, bold
Link: Light blue, underlined
Background: Dark blue or black
In each window (see section 2) you may click the upper left window button; one option is to
select font for that type of window. Select a fixed width font, e.g. Courier New 9 pt.
4
2. Windows in Stata [GSW] 2
I Suggest arranging the main windows as shown below. Once you made your choices:
Prefs ► Save Windowing Preferences
█ Review
sysuse auto.dta █ Stata Results
summarize
generate gp100m = 100/mp . generate gp100m = 100/mpg
label variable gp100m "Gallo
. label variable gp100m "Gallons per 100 miles"
regress gp100m weight
. regress gp100m weight
Source | SS df MS Number of obs = 74
-------------+------------------------------ F( 1, 72) = 194.71
Model | 87.2964969 1 87.2964969 Prob > F = 0.0000
█ Variables Residual | 32.2797639 72 .448330054 R-squared = 0.7300
-------------+------------------------------ Adj R-squared = 0.7263
make Make and M Total | 119.576261 73 1.63803097 Root MSE = .66957
price Price
-----------------------------------------------------------------------
mpg Mileage (mp
gp100m | Coef. Std. Err. t P>|t| [95% Conf. Interval]
rep78 Repair recor
-------------+---------------------------------------------------------
headroom Headroom (i
weight | .001407 .0001008 13.95 0.000 .001206 .0016081
trunk Trunk space
_cons | .7707669 .3142571 2.45 0.017 .1443069 1.397227
weight Weight (lbs.)
-----------------------------------------------------------------------
length Length (in.)
-more-
turn Turn Circle (
displacement Displaceme
gear_ratio Gear Ratio █ Stata Command
foreign Car type
gp100m Gallons per regress gp100m weight
Review window
This window displays the most recent commands. Click a command in the Review window to
paste it to the Command line window where you may edit and execute it. You may also scroll
through past commands using the PgUp and PgDn keys.
Save past commands to a do-file by clicking the upper left Review window button and: Save
Review Contents. Another option is to use the command log file (cmdlog.txt; section 1.2).
Variables window
You see a list of the variables in memory. Paste a variable name to the Command line window
by clicking it. You may also paste a variable name to an active dialog field.
Results window
This is the primary display of the output. Its size is limited, and you can't edit it. You may
print the contents, but in general my suggestion to let NoteTab print the log-file is handier.
5
Viewer window [GSW] 3
The main use of this window is viewing help files (see help and search, section 4). You
may select part of the viewer window for printing, using the mouse – but unfortunately not
the keyboard – to highlight it.
find and install SJ, STB, and user-written programs from the net
review, manage, and uninstall user-written programs
Stata also suggests that you use the Viewer window for viewing and printing output (the log
file), but it does not work well, and I find it much handier to use a general text editor (see
section 3) for examining, editing and printing output.
6
3. Suggested mode of operation
The recommendations in this section only partly follow what is recommended in the [GSW]
manual; I try to explain why, when my recommendations differ.
Do-files [U] 19
See examples of do-files in section 16. A do-file is a series of commands to be executed in
sequence. For any major tasks this is preferable to entering single commands because:
• You make sure that the commands are executed in the sequence intended.
• If you discover an error you can easily correct it and re-run the do-file.
• The do-file serves as documentation of what you did.
Use the do-file editor or any text editor like NoteTab (see appendix 3) to enter the commands.
I prefer NoteTab to the do-file editor, because I have direct access to several do-files and the
output in one place, although I cannot execute commands directly from NoteTab.
You may, after having issued a number of more or less successful commands, click the upper
left Review window button to save the contents of the Review window as a do-file. The
command log file (cmdlog.txt) may be used for the same purpose.
The do command
Find and execute a do-file by clicking: File ► Do...
or by entering path and filename in the command window:
do c:\dokumenter\proj1\alpha.do
You may also from the do-file editor execute the entire or a highlighted part of the current file
by clicking the do-button (number two from the right). The disadvantage of this method is
that the name of your do-file is not reflected in the output. I recommend issuing a do
command with full path and name of the do-file, for reasons of documentation.
7
3.2. Handling output
Stata's output facilities are less than optimal. In this section I show how you can use the third-
party program NoteTab to handle output for editing and printing.
The size of the Results window buffer is restricted, and you only have access to the last few
pages of output. To increase the buffer size (default 32,000 bytes) permanently:
set scrollbufsize 200000
The full log (c:\tmp\stata.log) is a copy of what you saw in the Results window. I use it
to inspect, edit and print output in NoteTab. I selected plain ASCII text for the full log; it is
overwritten next time you start Stata or when you issue the newlog command.
The nt command gives you rapid access to your output in NoteTab. See appendix 3 on how
to create both commands.
The command log (c:\tmp\cmdlog.txt) includes all commands issued. It is cumulative,
i.e. new commands are added to the file, which is not overwritten next time Stata is opened.
You may use it instead of the Review window to see and recover previous commands.
Error messages
Stata's short error messages include a code, e.g. r(131). The code is a link, and by clicking
it you get more clues. Some error messages, however, are not very informative.
9
5. Stata file types and names [U] 14.6
10
6. Variables
A Stata data set is rectangular; here is one with five observations and four variables:
Variables
obsno age height weight
1 27 178 74
Observations
2 54 166 67
3 63 173 85
4 36 182 81
5 57 165 90
Stata is case-sensitive
Variable names may include lowercase and uppercase letters, but Stata is case-sensitive: sex
and Sex are two different variable names. Throughout this booklet I use lowercase variable
names; anything else would be confusing.
compress can reduce the size of your dataset considerably by finding the most economical
way of storage.
11
Numeric formats [U] 15.5.1
The default is General format, presenting values as readable and precisely as possible. In most
cases you need not bother with numeric formats, but you may specify:
format dollars kroner %6.2f
Unfortunately no data entry program accepts .a in a numeric field. In EpiData you might
choose the codes -1 to -3 (provided, of course, that they could not be valid codes) and let
Stata recode them:
recode _all (-1=.a)(-2=.b)(-3=.c)
12
7. Command syntax [U] 14.1
Stata is case-sensitive, and all Stata commands are lowercase. Variable names may include
lowercase and uppercase letters, but sex and Sex are two different variable names.
Throughout this booklet I use lowercase variable names; anything else would be confusing.
Uppercase variable names and are sometimes used within programs (ado-files, see section
15.7) to avoid confusion with the variable names in the data set.
The general syntax of Stata commands can be written like this:
[prefix:] command [varlist][if expression][in range][weight][, options]
Command examples
Here are examples with the command summarize (mean, minimum, maximum etc.):
Prefix Command Varlist Qualifiers Options Comments
summarize No varlist: All variables
summarize _all _all: all variables
summarize sex age Two variables
summarize sex-weight Variables in sequence
summarize pro* All variables starting with
pro
summarize age if sex==1 Males only
summarize bmi in 1/10 First 10 observations
summarize bmi [fweight=n] Weighted observations
sort sex Separate table for each sex.
by sex: summarize bmi Data must be sorted first.
summarize bmi , detail Option: detail
13
In commands that have a dependent variable, it is the first in the varlist:
oneway bmi sex bmi is the dependent variable
regression bmi sex age bmi is the dependent variable
scatter weight height scatterplot, weight is y-axis
tabulate expos case The first variable defines the rows
14
by and bysort prefix [U] 14.5
Makes a command display results for subgroups of the data. Data must be pre-sorted:
sort sex
by sex: summarize age height weight
or, in one line:
bysort sex: summarize age height weight
The purpose of comments is to make do-files more readable to yourself – Stata does not care
whatever you write.
// C:\DOKUMENTER\PROFILE.DO executes when opening Stata
summarize bmi , detail // Body mass index
15
8. Getting data into Stata [U] 24; [GSW] 7
If you already have a disk file with this name, your request will be rejected unless you specify
the replace option. Only use the replace option if you really want to overwrite data.
You may also enter data directly in Stata's data window (not recommended; see section 2 and
[GSW] 6, 9).
16
Reading ASCII data [U] 24
You may read a tab-separated ASCII file with variable names in row 1 by the command:
insheet using c:\dokumenter\p1\a.txt , tab
If you have a comma-separated file without variable names in row 1 the command is:
insheet id type sold price using c:\dokumenter\p1\a.txt , comma
insheet assumes that all data belonging to one observation are in one line.
Reading fixed format data [R] infix; [R] infile (fixed format)
In fixed format data the information on each variable is determined by the position in the line.
The blank type in observation 2 will be read as missing.
1 2 47 51.23
2 793 199.70
Fixed format data can also be read by infile; to do this a dictionary file must be created,
specifying variable names and positions etc. See [R] infile (fixed format).
17
9. Documentation commands [GSW] 8
Stata does not need documentation commands; you need the documentation yourself. The
output becomes more legible, and the risk of errors when interpreting the output is reduced.
It is wise to include the creation date, to ensure that you analyse the most recent version.
Use informative labels, but make them short; they are sometimes abbreviated in output.
Use informative labels, but make them short; value labels are often abbreviated to 12
characters in output.
Most often you will use the same name for the variable and its label:
label define sex 1 male 2 female
label values sex sex
but the separate definition of the label enables you to reuse it:
label define yesno 1 yes 2 no
label values q1 yesno
label values q2 yesno
If you want to correct a label definition or add new labels, use the modify option:
label define sexlbl 9 "unknown sex" , modify
adds the label for code 9 to the existing label definition.
In output Stata unfortunately displays either the codes or the value labels, and you often need
to see them both, to avoid mistakes. You may solve this by including the codes in the labels;
this happens automatically with:
numlabel _all , add
18
See label definitions
See the value label definitions by:
label list or
labelbook
The notes are kept in the data set and can be seen by:
notes
Notes are cumulative; old notes are not discarded (and that is nice)
19
10. Modifying data
Don't misinterpret the title of this section: Never modify your original data, but add
modifications by generating new variables from the original data. Not documenting
modifications may lead to serious trouble. Therefore modifications:
• should always be made with a do-file with a name reflecting what it does:
gen.alfa2.do generates alfa2.dta.
• The first command in the do-file reads data (eg. use, infix).
• The last command saves the modified data set with a new name (save).
• The do-file should be 'clean', ie. not include commands irrelevant to the modifications.
See examples of modifying do-files in section 16 and in Take good care of your data.
10.1. Calculations
Operators in expressions [GSW] 12; [U] 16.2
Arithmetic operators
generate alcohol = beers+wines+spirits
generate bmi = weight/(height^2)
The precedence order of arithmetic operators are as shown in the table; power before
multiplication and division, before addition and subtraction. Control the order by parentheses;
however the parentheses in the last command were not necessary since power takes
precedence over division – but they didn't harm either.
Logical expressions can be true or false; a value of 0 means false, any other value (including
missing values) means true. This means that with sex coded 1 for males and 0 for females
and no missing values, the second command could have been written as:
summarize age if sex
20
generate; replace [R] generate
Generate a new variable by:
generate bmi=weight/(height^2)
If the target variable (bmi) already exists in the data set, use replace:
replace bmi=weight/(height^2)
Besides the standard operators there are a number of functions: [R] functions
generate y=abs(x) absolute value |x|
gen y=exp(x) exponential, ex
gen y=ln(x) natural logarithm
gen y=log10(x) base 10 logarithm
gen y=sqrt(x) square root
gen y=int(x) integer part of x. int(5.8) = 5
gen y=round(x) nearest integer. round(5.8) = 6
gen y=round(x, 0.25) round(5.8, 0.25) = 5.75
gen y=mod(x1,x2) modulus; the remainder after dividing x1 by x2
gen y=max(x1,...xn) maximum value of arguments
gen y=min(x1,...xn) minimum value of arguments
gen y=sum(x) cumulative sum across observations, from first to current obs.
gen y=_n _n is the observation number
gen y=_N _N is the number of observations in the data set
21
recode [R] recode
Changes a variable's values, e.g. for grouping a continuous variable into few groups. The
inverted sequence ensures that age 55.00 (at the birthday) goes to category 4:
recode age (55/max=4)(35/55=3)(15/35=2)(min/15=1) , generate(agegr)
Very important: The generate option creates a new variable with the recoded
information; without generate the original information in age will be destroyed.
Other examples:
recode expos (2=0) Leave other values unchanged
recode expos (2=0) , gen(exp2) Values not recoded transferred unchanged
recode expos (2=0) if sex==1 Values not recoded (sex != 1) set to missing
recode expos (2=0) if sex==1 , copy Values not recoded (sex != 1) unchanged
recode expos (2=0)(1=1)(else=.) Recode remaining values to missing (.)
recode expos (missing=9) Recode any missing (., .a, .b etc.) to 9
age agegrp
0 ≤ age < 5 0
5 ≤ age < 15 5
15 ≤ age < 25 15
.. ..
85 ≤ age < 120 85
for
Enables you with few command lines to repeat a command. To do the modulus 11 test for
Danish CPR numbers (see section 15.3) first multiply the digits by 4,3,2,7,6,5,4,3,2,1; next
sum these products; finally check whether the sum can be divided by 11. The CPR numbers
were split into 10 one-digit numbers c1-c10:
generate test = 0
for C in varlist c1-c10 \ X in numlist 4/2 7/1 : ///
replace test = test + C*X
replace test = mod(test,11) // Remainder after division by 11
list id cpr test if test !=0
C and X are stand-in variables (names to be chosen by yourself; note the use of capital
letters to distinguish from existing variables), to be sequentially substituted by the elements in
the corresponding list. Each list must be declared by type; there are four types: newlist
(list of new variables), varlist list of existing variables, numlist (list of numbers),
anylist (list of words).
for is not documented in the manuals any more. The foreach and forvalues
commands partially replace it, but they don't handle parallel lists as shown above. See section
15.7.
22
10.2. Selections
Observations dropped can only be returned to memory with a new use command. However,
preserve and restore (documented in [P]) let you obtain a temporary selection:
preserve // preserve a copy of the data currently in memory
keep if sex == 1
calculations
analyses
restore // reload the preserved dataset
The new sequence of variables will be as defined. Any variables not mentioned will follow
after the variables mentioned.
23
10.4. Sorting data
sort [R] sort, [R] gsort
To sort your data according to mpg (primary key) and weight (secondary key):
sort mpg weight
sort only sorts in ascending order; gsort is more flexible, but slower. To sort by mpg
(ascending) and weight (descending) the command is:
gsort mpg –weight
From a patient register you have information about hospital admissions, one or more per
person, identified by cpr and admdate (admission date). You want to construct the
following variables: obsno (observation number), persno (internal person ID), admno
(admission number), admtot (patient's total number of admissions).
. use c:\dokumenter\proj1\alfa1.dta
. sort cpr admdate
. gen obsno=_n // _n is the observation number
. by cpr: gen admno=_n // _n is obs. number within each cpr
. by cpr: gen admtot=_N // _N is total obs. within each cpr
. sort admno cpr // all admno==1 first
. gen persno=_n if admno==1 // give each person number if admno==1
. sort obsno // original sort order
. replace persno=persno[_n-1] if persno >=. // replace missing persno
. save c:\dokumenter\proj1\alfa2.dta
. list cpr admdate obsno persno admno admtot in 1/7
cpr admdate obsno persno admno admtot
1. 0605401234 01.05.1970 1 1 1 3
2. 0605401234 06.05.1970 2 1 2 3
3. 0605401234 06.05.1971 3 1 3 3
4. 0705401234 01.01.1970 4 2 1 1
5. 0705401235 01.01.1970 5 3 1 1
6. 0805402345 01.01.1970 6 4 1 2
7. 0805402345 10.01.1970 7 4 2 2
. summarize persno // number of persons (max persno)
. anycommand if admno==1 // first admissions
. anycommand if admno==admtot // last admissions
. tab1 admtot if admno==1 // distribution of n of admissions
You may also create a keyfile linking cpr and persno, and remove cpr from your
analysis file. See example 16b in Take good care of your data.
24
10.6. Combining files [U] 25
Both files must be sorted beforehand by the matching key (lbnr in the example above), and
the matching key must have the same name in both data sets. Apart from the matching key the
variable names should be different. Below A and B symbolize the variable set in the input
files, and numbers represent the matching key. Missing information is shown by . (period):
fila filb filab _merge
1A 1B 1AB 3
2A 2A. 1
3B 3.B 2
4A1 4B 4A1B 3
4A2 4A2B 3
Stata creates the variable _merge which takes the value 1 if only data set 1 (fila)
contributes, 2 if only data set 2 (filb) contributes, and 3 if both sets contribute. Check for
mismatches by:
tab1 _merge
list lbnr _merge if _merge < 3
For lbnr 4 there were two observations in fila, but only one in filb. The result was two
observations with the information from filb assigned to both of them. This enables to
distribute information eg. about doctors to each of their patients – if that is what you desire.
But what if the duplicate lbnr 4 was an error? To check for duplicate id's before merging,
sort and compare with the previous observation:
sort lbnr
list lbnr if lbnr==lbnr[_n-1]
Another way to check for and list observations with duplicate id's is:
duplicates report lbnr
duplicates list lbnr
merge is a lot more flexible than described here; see [R] merge.
25
10.7. Reshaping data
collapse [R] collapse
You want to create an aggregated data set, not with the characteristics of each individual, but
of groups of individuals. One situation might be to characterize physicians by number of
patient contacts, another to make a reduced data set for Poisson regression (see section 13):
. // gen.stcollaps.cancer2.do
. use c:\dokumenter\proj1\stsplit.cancer2.dta , clear
. collapse (sum) risktime died , by(agegr drug)
. save c:\dokumenter\proj1\stcollaps.cancer2.dta
. summarize
Variable | Obs Mean Std. Dev. Min Max
-------------+-----------------------------------------------------
drug | 15 2 .8451543 1 3
agegr | 15 55 7.319251 45 65
risktime | 15 4.133334 3.01067 .3581161 10.88906
died | 15 2.066667 2.374467 0 8
26
11. Description and analysis
This section gives information on the simpler statistical commands with examples of output.
Obtain detailed information on the distribution of selected variables by the detail option:
summarize price , detail
Find the nice modification summvl, also displaying variable labels, by: findit summvl.
. summvl
Variable Obs Mean Std.Dev Min Max Label
-------------------------------------------------------------------------
id 37 19 10.8244 1 37 identification number
type 37 1.89189 .965625 1 4 type of wine
price 35 46.58 16.3041 11.95 78.95 price per 75 cl bottle
rating 35 2.51429 .919444 1 4 quality rating
Stata's listing facilities are clumsy when you want to list many variables simultaneously. Find
and install the useful alternative slist by: findit slist.
tab1
tab1 gives one-way tables (frequency tables) for one or more variables:
27
. tab1 rating type (table for type not shown)
-> tabulation of rating
quality |
rating | Freq. Percent Cum.
--------------+-----------------------------------
1. poor | 6 17.14 17.14
2. acceptable | 9 25.71 42.86
3. good | 16 45.71 88.57
4. excellent | 4 11.43 100.00
--------------+-----------------------------------
Total | 35 100.00
You can request a three-way table (a two-way table for each value of nation) with:
bysort nation: tabulate rating type
The command:
tab2 rating type nation
gives three two-way tables, one for each of the combinations of the variables. But beware:
you can easily produce a huge number of tables.
[P] foreach
Imagine that you want 10 two-way tables: each of the variables q1-q10 by sex. With
tabulate you must issue 10 commands to obtain the result desired. If you call tab2 with
11 variables you get 55 two-way tables: all possible combinations of the 11 variables. The
foreach command (see section 15.7) lets you circumvent the problem:
foreach Q of varlist q1-q10 {
tabulate `Q' sex
}
The local macro Q is a stand-in for q1 to q10, and the commands generate 10 commands:
tabulate q1 sex
tabulate q2 sex etc.
28
tabi [R] tabulate
tabi is an 'immediate' command (see section 15.5) enabling you to analyse a table without
first creating a data set. Just enter the cell contents, delimiting the rows by \ (backslash):
. tabi 10 20 \ 17 9 , chi exact
| col
row | 1 2 | Total
-----------+----------------------+----------
1 | 10 20 | 30
2 | 17 9 | 26
-----------+----------------------+----------
Total | 27 29 | 56
Pearson chi2(1) = 5.7308 Pr = 0.017
Fisher's exact = 0.031
1-sided Fisher's exact = 0.016
All procedures perform stratified analysis (Mantel-Haenszel). cc gives odds ratios for each
stratum and the Mantel-Haenszel estimate of the common odds ratio. The test of homogeneity
tests whether the odds ratio estimates could reflect a common odds ratio.
If you want to stratify by more than one variable, the following command is useful:
egen racesex=group(race sex)
cc case exposed , by(racesex)
The immediate commands do not perform stratified analysis; an example with cci. Just
enter the four cells (a b c d) of the 2×2 table:
cci 10 20 17 9 , woolf
29
11.2. Continuous variables
oneway [R] oneway
compares means between two or more groups (analysis of variance):
oneway price type [ , tabulate noanova]
The table, but not the test, could also be obtained by;
tabulate type , summarize(price) [R] tabsum
The col(stat) option let the statistics form the columns; without it the statistics would
have formed the rows. The format() option lets you decide the display format. The
statistics are:
n, mean, sum, min, max, range, sd, var, cv (coefficient of variation), semean, skew
(skewness), kurt (kurtosis), p1, p5, p10, p25, p50 (or median), p75, p90, p95, p99, q
(quartiles: p25, p50, p75), and iqr (interquartile range).
30
ttest [R] ttest
T-test for comparison of means for continuous normally distributed variables:
ttest bmi , by(sex) Standard t-test, equal variances assumed
ttest bmi , by(sex) unequal Unequal variances (see sdtest)
ttest prebmi==postbmi Paired comparison of two variables
ttest prebmi==postbmi , unpaired Unpaired comparison of two variables
ttest bmidiff==0 One-sample t-test
ttesti 32 1.35 .27 50 1.77 .33 Immediate command. Input n, mean
n1 m1 sd1 n2 m2 sd2 and SD for each group
Distribution diagnostics
Diagnostic plots: [R] diagplots
pnorm bmi Normal distribution (P-P plot)
qnorm bmi Normal distribution (Q-Q plot)
Formal test for normal distribution: [R] swilk
swilk bmi Test for normal distribution
Test for equal variances: [R] sdtest
sdtest bmi , by(sex) Compare SD between two groups
sdtest prebmi==postbmi Compare two variables
Bartlett's test for equal variances is displayed by oneway, see above.
Non-parametric tests
For an overview of tests available, in the Viewer window command line enter:
search nonparametric
31
12. Regression analysis
Performing regression analysis with Stata is easy. Defining regression models that give sense
is more complex. Especially consider:
• If you look for causes, make sure your model is meaningful. Don't include independent
variables that represent steps in the causal pathway; it may create more confounding
than it prevents. Automatic selection procedures are available in Stata (see [R] sw), but
they may seduce the user to non-thinking. I will not describe them.
• If your hypothesis is non-causal and you only look for predictors, logical requirements
are more relaxed. But make sure you really are looking at predictors, not consequences
of the outcome.
• Take care with closely associated independent variables, e.g. education and social class.
Including both may obscure more than illuminate.
xi: [R] xi
The xi: prefix handles categorical variables in regression models. From a five-level
categorical variable xi: generates four indicator variables; in the regression model they are
referred to by the i. prefix to the original variable name:
xi: regress bmi sex i.agegrp
You may also use xi: to include interaction terms:
xi: regress bmi age i.sex i.treat i.treat*i.sex
By default the first (lowest) category will be omitted, i.e. be the reference group. You may,
before the analysis, select agegrp 3 to be the reference by defining a 'characteristic':
char agegrp[omit] 3
32
12.2. Logistic regression
logistic [R] logistic
A standard logistic regression with ck as the dependent variable:
logistic ck sex smoke speed alc
The dependent variable (ck) must be coded 0/1 (no/yes). If the independent variables are also
coded 0/1 the interpretation of odds ratios is straightforward, otherwise the odds ratios must
be interpreted per unit change in the independent variable.
The xi: prefix applies as described in section 12.1:
xi: logistic ck sex i.agegrp i.smoke
xi: logistic ck i.sex i.agegrp i.smoke i.sex*i.smoke
After running logistic obtain a classification table, including sensitivity and specificity
with a cut-off point of your choice:
lstat , cutoff(0.3)
Repeat lstat with varying cut-off points or, smarter, use lsens to see sensitivity and
specificity with varying cutoff points:
lsens
33
13. Survival analysis and related issues
st [ST] manual
The st family of commands includes a number of facilities, described in the Survival
Analysis manual [ST]. Here I describe the stset and stsplit commands and give a few
examples. The data is cancer1.dta, a modification of the cancer.dta sample data
accompanying Stata.
The observation starts at randomization (agein), the data set includes these variables:
. summvl // summvl is a summarize displaying variable labels.
// Get it by: findit summvl
Variable Obs Mean Std.Dev Min Max Label
------------------------------------------------------------------------
lbnr 48 24.5 14 1 48 Patient ID
drug 48 1.875 .841099 1 3 Drug type (1=placebo)
drug01 48 .583333 .498224 0 1 Drug: placebo or active
agein 48 56.398 5.6763 47.0955 67.8746 Age at randomization
ageout 48 57.6896 5.45418 49.0122 68.8737 Age at death or cens.
risktime 48 1.29167 .854691 .083333 3.25 Years to death or cens.
died 48 .645833 .483321 0 1 1 if patient died
34
. summarize
Variable | Obs Mean Std. Dev. Min Max
-------------+-----------------------------------------------------
lbnr | 48 24.5 14 1 48
.... |
risktime | 48 1.291667 .8546908 .0833333 3.25
died | 48 .6458333 .4833211 0 1
_st | 48 1 0 1 1
_d | 48 .6458333 .4833211 0 1
_t | 48 1.291667 .8546908 .0833333 3.25
_t0 | 48 0 0 0 0
. save c:\dokumenter\proj1\st.cancer1.dta
Four new variables were created, and the st'ed data set is prepared for a number of incidence
rate and survival analyses:
_st 1 if the observation includes valid survival time information, otherwise 0
_d 1 if the event occurred, otherwise 0 (censoring)
_t time or age at observation end (here: risktime)
_t0 time or age at observation start (here: 0)
35
Including age in the analysis
stset data with ageout as the time-of-exit variable, agein as the time-of-entry
variable:
. // c:\dokumenter\proj1\gen.st.cancer2.do
. use c:\dokumenter\proj1\cancer1.dta , clear
. stset ageout , enter(time agein) failure(died==1) id(lbnr)
. summarize
Variable | Obs Mean Std. Dev. Min Max
-------------+-----------------------------------------------------
lbnr | 48 24.5 14 1 48
... |
_st | 48 1 0 1 1
_d | 48 .6458333 .4833211 0 1
_t | 48 57.61966 5.444583 49.87939 68.70284
_t0 | 48 56.328 5.659862 47.97637 67.73915
. save c:\dokumenter\proj1\st.cancer2.dta
The sts and stcox analyses as shown above now must be interpreted as age-adjusted
(delayed entry analysis). Summary of time at risk and age-specific incidence rates:
stptime , at(45(5)70) by(drug) // 5 year age intervals
The data now has 61 observations with events and risktime distributed to the proper age
intervals. Describe risktime etc. by:
bysort drug: stsum , by(agegr)
36
poisson [R] poisson
The stsplit.cancer2.dta data set above can be used for Poisson regression with a little
more preparation. died and risktime must be replaced as shown. You also may collapse
the file to a table with one observation for each age group and drug (see section 10.7):
. // c:\dokumenter\proj1\gen.stcollaps.cancer2.do
. use c:\dokumenter\proj1\stsplit.cancer2.dta , clear
. replace died = _d
. replace risktime = _t - _t0
. summarize
Variable | Obs Mean Std. Dev. Min Max
-------------+-----------------------------------------------------
.... |
risktime | 61 1.016394 .7343081 .0833321 2.75
died | 61 .5081967 .5040817 0 1
_st | 61 1 0 1 1
_d | 61 .5081967 .5040817 0 1
_t | 61 56.87054 5.563221 49.01218 68.8737
_t0 | 61 55.85415 5.610502 47.09552 67.87458
agegr | 61 54.01639 5.832357 45 65
. collapse (sum) risktime died , by(agegr drug)
. summarize
Variable | Obs Mean Std. Dev. Min Max
-------------+-----------------------------------------------------
drug | 15 2 .8451543 1 3
agegr | 15 55 7.319251 45 65
risktime | 15 4.133334 3.01067 .3581161 10.88906
died | 15 2.066667 2.374467 0 8
. save c:\dokumenter\proj1\stcollaps.cancer2.dta
37
14. Graphs
14.1. Introduction
The purpose of this section is to help you understand the fundamentals of Stata 8 graphs, and
to enable you to create and modify them.
With Stata's dialogs you can easily define a graph. Once you made your choices, press
[Submit] rather than [OK]; this gives the opportunity to modify your choices after having
looked at the result.
This was the easy part. For the purpose of analysis you can do most things with the dialogs.
Look at the illustrations in this section to get some ideas of the types and names of graphs. At
www.ats.ucla.edu/stat/stata/Library/GraphExamples/default.htm you find a number of graph
examples with the commands used.
The following more complex stuff illustrates how to make graphs ready for publication.
–o–
Stata can produce high-quality graphs, suited for publication. However, the first edition of the
Graphics manual is complicated to use, to say the least; don't feel inferior if you get lost in the
maze while looking up information. The on-line help works better, once you understand the
general principles.
Use the Graphics manual to see examples of graphs, but skip the style and options specifi-
cations unless you are very dedicated.
The style of the graphs presented here is different from the manual style; I attempted to hit a
leaner mainstream style used in most scientific journals. The graphs are based upon my
schemes lean1 with a framed plot area and no gridlines and lean2 with no frame but
with gridlines. Find the schemes used by findit lean schemes. See more on this issue
under Schemes, section 14.9.
You will meet some critical remarks in this section. However:
• Stata's graphics is a very versatile system; you can create almost whatever you want,
except (fortunately) 3-D effects.
• The Stata people are very open to criticism and suggestions, and the users' input no
doubt will give inspiration to improved design, accessibility and documentation.
38
14.2. The anatomy of graphs
Figure 1 shows the most important elements of a graph. The graph area is the entire figure,
including everything, while the plot area is the central part, defined by the axes.
Title: Figure 1
Subtitle: The anatomy of a graph
40
Plot area
30
Y-axis title
20
Legend
first plot
second plot
10
2,000 3,000 4,000 5,000
X-axis title
Note: This is the outer region or background
A graph consists of several elements: Title, legend, axes, and one or more plots, e.g. two
scatterplots within the same plot area; Figure 1 includes two scatterplots.
Below is the command that generated Figure 1 (except the dashed outer frame). The elements
of the command will be explained later.
sysuse auto.dta // open auto.dta accompanying Stata
set scheme lean1
twoway (scatter mpg weight if foreign==0) ///
(scatter mpg weight if foreign==1) ///
, ///
title("Title: Figure 1") ///
subtitle("Subtitle: The anatomy of a graph") ///
ytitle("Y-axis title") xtitle("X-axis title") ///
note("Note: This is the outer region or background") ///
legend(title("Legend") , ///
label(1 "first plot") label(2 "second plot")) ///
text(35 3400 "Plot area")
This is the syntax style generated by the dialogs, and I will stick to it.
Unfortunately the Graphics manual frequently uses another, less transparent style:
graph-command plot-command , plot-options | | plot-command , plot-options | | , graph-options
Clue: Put a | | where the standard syntax has a ) parenthesis closing a plot specification.
39
When letting the dialog generate a simple scatterplot command, the result is like this:
twoway (scatter mpg weight)
twoway defines the graph type; scatter defines a plot in the graph. You could enter the
same in the command window, but Stata also understands this short version:
scatter mpg weight
The variable list (e.g. mpg weight) in most graph commands may have one or more
dependent (y-) variables, and one independent (x-) variable, which comes last.
Graph commands may have options; as in other Stata commands a comma precedes the
options. title() is an option to the twoway graph command:
twoway (scatter mpg weight) , title("74 car makes")
Options may have sub-options. size() is a sub-option to the title() option; here it lets
the title text size be 80% of the default size:
twoway (scatter mpg weight) , title("74 car makes" , size(*0.8))
Warning: Options don't tolerate a space between the option keyword and the parenthesis,
like the following (□ denotes a blank character):
title□("74 car makes")
The error message may be confusing, e.g. 'Unmatched quotes' or 'Option not allowed'.
Advice: Graph commands tend to include a lot of nested parentheses, and you may make
errors (I often do). In the Do-file editor, place the cursor after an opening parenthesis and
enter [Ctrl]+B, to see the balancing closing parenthesis. In NoteTab you can use [Ctrl]+M
(match) in the same way.
You can, however, determine the aspect ratio of the plot area (the y/x axis ratio) by the
aspect() option. To obtain a square plot area:
twoway (scatter mpg weight) , ysize(3) xsize(4) aspect(1)
40
Ticks, labels and gridlines
Stata sets reasonable ticks and labels at the axes; you may also define them yourself. The
following command sets a tick and a label for every 20 years at the x-axis, minor ticks divide
each major interval in two. The y-axis has a log scale; tick marks are defined.
twoway (line incidence year) , ///
xlabel(1900(20)2000) xmtick(##2) ///
yscale(log) ylabel(1 2 5 10 20 50 100)
If you use the s2color, s2mono or lean2 scheme, the default is horizontal gridlines and
no vertical gridlines. To drop horizontal and include vertical gridlines (hardly a good idea in
this case):
... , xlabel( , grid) ylabel(1 2 5 10 20 50 100 , nogrid)
If you want to display decimal commas rather than periods, give the Stata command:
set dp comma
Plotregion margin
By default twoway graphs include a margin between the extreme plot values and the axes, to
avoid symbols touching axes. If you want a zero margin – as in the twoway line plot,
section 14.7 – include:
... , plotregion(margin(zero))
The twoway line plot, section 14.7, illustrates an alternative placement of the legend:
... , legend(label(1 "Males") label(2 "Females") ring(0) pos(8))
41
A text block is placed in the plot area by giving its y and x coordinates; place(c) (the
default) means that the coordinates apply to the center of the text block; place(se) that
they apply to the block's southeast corner. See example in Figure 1 and the twoway line
plot, section 14.7:
... , text(90 69 "1999-2000")
Marker symbols
Markers are defined by symbol: msymbol(), outline colour: mlcolor(), fill colour
mfcolor() and size msize(). To define a hollow circle:
twoway (scatter mpg weight , msymbol(Oh))
A hollow circle (Oh) is transparent. Obtain a circle with a non-transparent white fill by:
twoway (scatter mpg weight , msymbol(O) mfcolor(white))
Connecting lines
The twoway line and twoway connected examples, section 14.7, use connecting
lines; here the clpattern() and clwidth() options apply:
twoway (line m1840-k1999 age , clpattern( - l - l – l ))
42
The default connect-style is a straight line. Obtain a step-curve like a Kaplan-Meier plot by:
twoway (line cum time , connect(J))
Bars
Bar graphs (twoway bar) and range plots use bar outlines; here the blpattern() and
blwidth() options apply. The colour of the bar fill is defined by the bfcolor() option:
... , bar(1, bfcolor(gs9)) bar(2, bfcolor(gs14))
14.7. Examples
On the following pages you find illustrations of some important graph types, including the
commands that generated the graphs. The appearance is different from the manual's graphs; it
was determined by my schemes lean1 and lean2, described in section 14.9.
For each graph you see the do-file that made it, including the data for the graph or a use
command. I suggest letting do-files generating graphs always start with a gph. prefix, for
easy identification.
In the illustrations I reduced the graph size by the xsize() and ysize() options. This,
however, leads to too small text and symbols, and I enlarged them by the scale() option.
twoway graphs have continuous x- and y-axes. Many plot-types fit in twoway graphs;
exceptions are graph bar, graph box and graph pie.
histogram
80
60
N of children
40
20
0
1000 1500 2000 2500 3000 3500 4000 4500 5000
Birthweight, grams
A histogram depicts the distribution of a continuous variable. The y-axis may reflect a count
(frequency), a density or a percentage; the corresponding normal curve may be overlaid.
Histograms are documented in [R] histogram and in [G] graph twoway histogram.
43
// c:\dokumenter\...\gph.birthweight.do
use "C:\dokumenter\...\newborns.dta" , clear
set scheme lean2
histogram bweight ///
, ///
frequency ///
normal ///
start(750) width(250) ///
xlabel(1000(500)5000) ///
xmticks(##2) ///
xtitle("Birthweight, grams") ///
ytitle("N of children") ///
plotregion(margin(b=0)) ///
xsize(4) ysize(2.3) scale(1.4)
graph bar
10
Prevalence (per cent)
2 Males
Females
0
16-24 25-44 45-66 67-79 80+
Age
// c:\dokumenter\...\gph.diabetes prevalence.do
clear
input str5 age m f
16-24 .9 .2
25-44 .8 .8
45-66 3.8 2.9
67-79 8.2 5.4
80+ 9.1 7.2
end
set scheme lean2
graph bar m f ///
, ///
over(age) ///
b1title("Age") ///
ytitle("Prevalence (per cent)") ///
legend( label(1 "Males") label(2 "Females") ) ///
xsize(4) ysize(2.3) scale(1.4)
For some reason the xtitle() option is not valid for bar graphs. To generate an x-axis title
you may, however, use b1title() instead.
Bar fill colours are assigned automatically according to the scheme. This option would
generate a very dark fill for females:
... , bar(2 , bfcolor(gs3))
44
In bar graphs the x-axis is categorical, the y-axis continuous. In the example variables m and
f defined the heights of the bars, but actually graph bar used the default mean function,
as if the command were (with one observation per bar the result is the same):
graph bar (mean) m f , over(age)
With the auto.dta data you could generate bars for the number of domestic and foreign
cars by:
graph bar (count) mpg , over(foreign)
Actually what is counted is the number of non-missing values of mpg.
Bar graphs are documented in [G] graph bar and [G] graph twoway bar.
twoway scatter
40
Mileage (mpg)
30
20
Domestic
Foreign
10
2,000 3,000 4,000 5,000
Weight (lbs.)
// c:\dokumenter\...\gph.mpg_weight.do
clear
sysuse auto
set scheme lean2
twoway ///
(scatter mpg weight if foreign==0, msymbol(Oh)) ///
(scatter mpg weight if foreign==1, msymbol(O)) ///
, ///
legend(label(1 "Domestic") label(2 "Foreign")) ///
xsize(4) ysize(2.3) scale(1.4)
Twoway graphs have continuous x- and y-axes; scatter is the "mother" of twoway graphs.
A graph with one plot has no legend; this one with two plots has. The default legend texts
often need to be replaced by short, distinct texts, like here. Since the xtitle() and
ytitle() options were not specified, Stata used the variable labels as axis titles.
45
twoway line
100
1999-2000
80 1901-05
Per cent surviving
60
1840-49
40
20
Females
Males
0
0 20 40 60 80 100
Age
// c:\dokumenter\...\gph.DKsurvival.do
use c:\dokumenter\...\DKsurvival.dta , clear
sort age // Data must be sorted by the x-axis variable
list in 1/3, clean // List to show the data structure
age m1840 k1840 m1901 k1901 m1999 k1999
1. 0 100.00 100.00 100.00 100.00 100.00 100.00
2. 1 84.47 86.76 86.93 89.59 99.16 99.37
3. 2 80.58 83.11 85.22 87.89 99.08 99.32
set scheme lean1
twoway ///
(line m1840-k1999 age , clpattern( - l - l – l )) ///
, ///
plotregion(margin(zero)) ///
xtitle("Age") ///
ytitle("Per cent surviving") ///
legend(label(1 "Males") label(2 "Females") order(2 1) ///
ring(0) pos(8)) ///
text(91 72 "1999-2000") ///
text(77 48 "1901-05") ///
text(49 40 "1840-49") ///
xsize(3.3) ysize(2.3) scale(1.4)
A line plot is a variation of scatterplot without markers, but with connecting lines. This graph
includes six line plots, required by one plot-specification with six y- and one x-variable. The
clpattern() option defines the six connected-line patterns.
Make sure data are sorted according to the x-axis variable; otherwise the result is nonsense.
The example shows how to include text in a graph and how to position the legend within the
plot area (see section 14.5 on placement of graph elements).
Twoway graphs by default include "empty" space between the axes and the extreme plot
values. The graph option plotregion(margin(zero)) lets the plot start right at the axes.
46
twoway connected; twoway rcap
100
80
Mean score
60
40 Observed
Expected
95% CI
20
PF RP BP GH VT SF RE MH
SF-36 subscale
// c:\dokumenter\...\gph.SF36a.do
clear
input scale n obs sd norm
1 139 60.81 27.35 70.77
2 139 37.65 42.06 62.01
...
8 139 73.06 21.54 79.99
end
generate se=sd/sqrt(n)
generate ci1=obs+1.96*se
generate ci2=obs-1.96*se
label define scale 1 "PF" 2 "RP" 3 "BP" 4 "GH" 5 "VT" 6 "SF" 7 "RE" 8 "MH"
label values scale scale
set scheme lean1
twoway ///
(connected obs scale , msymbol(O) clpattern(l)) ///
(connected norm scale , msymbol(O) mfcolor(white) clpattern(-)) ///
(rcap ci1 ci2 scale) ///
, ///
ytitle("Mean score") ///
xtitle("SF-36 subscale") ///
xlabel(1(1)8 , valuelabel noticks) ///
xscale(range(0.5 8.5)) ///
legend(label(1 "Observed") label(2 "Expected") label(3 "95% CI")) ///
xsize(4) ysize(2.3) scale(1.4)
This graph includes three plots; two connected and one rcap. In twoway plots both axes
are continuous, so you could not have a categorical variable (PF, RP etc.) at the x-axis.
Solution: use a numerical variable and use value labels to indicate the meaning. This graph
style is frequently used to present SF-36 results, although connecting lines may be illogical
when displaying eight qualitatively different scales.
xscale(range(0.5 8.5)) increased the distance between plot symbols and plot margin.
rcap does not calculate confidence intervals for you; it is up to you to provide two y- and
one x-value for each confidence interval. rspike would have plotted intervals without caps.
47
twoway rspike
Cross-sectional study
// c:\dokumenter\...\gph.length_bias.do
clear
set obs 20
gen x=_n
gen y1=x
gen y2=y1+2
replace y2=y1+8 if mod(x,2)==0
set scheme lean2
twoway (rspike y1 y2 x , horizontal blwidth(*1.5)) ///
, ///
yscale(off) ylabel(, nogrid) ytitle("") ///
xlabel(none) xtitle("Cross-sectional study") ///
xline(14.5) ///
xsize(3.7) ysize(2.3) scale(1.4)
The purpose of this graph is to illustrate length bias: a cross-sectional (prevalence) study may
mislead you. Cases with short duration (due to successful treatment or high case fatality) are
underrepresented in a cross-sectional sample.
rspike is in the twoway r* family: range plots, like rcap shown before; this time it is
horizontal.
In range plots and droplines (next page) the lines technically are bar outlines, and options are
blcolor(), blpattern() etc.; hence the blwidth(*1.5) to make the spikes wider
than the default.
It is easy to create one or more reference lines; use xline() and yline().
48
twoway dropline
20
Patient number
15
10
5
Deaths
Censorings
1
0 1 2 3 4 5 6 7
Years after diagnosis
// c:\dokumenter\...\gph.obstime.do
use "c:\dokumenter\...\cohort1.dta" , clear
list patient time died in 1/5 , clean
patient time died
1. 1 0.578 1
2. 2 0.867 1
3. 3 1.235 1
4. 4 1.374 0
5. 5 1.437 1
set scheme lean2
twoway ///
(dropline time patient if died==1, horizontal msymbol(S)) ///
(dropline time patient if died==0, horizontal ///
msymbol(S) mfcolor(white)) ///
, ///
plotregion(margin(zero)) ///
ytitle("Patient number") ///
yscale(range(0 22)) ///
ylabel(1 5 10 15 20 , nogrid) ///
xtitle("Years after diagnosis") ///
xlabel(0(1)7) ///
legend(label(1 "Deaths") label(2 "Censorings") ring(0)) ///
xsize(3.7) ysize(2.5) scale(1.3)
In a dropline plot a line 'drops' from a marker perpendicularly to the x- or y-axis. Droplines
technically are bar outlines, like range plots, and their appearance is controlled by
blpattern(), blcolor() and blwidth().
The marker for censorings is a square with white fill, not a hollow square, to avoid the
dropline to be visible within the marker.
49
twoway function
// c:\dokumenter\...\gph.normal.do
set scheme lean2
twoway ///
(function y=normden(x) , range(-3.5 3.5) ///
droplines(-1.96 -1 0 1 1.96)) ///
, ///
plotregion(margin(zero)) ///
yscale(off) ylabel(, nogrid) ///
xlabel(-3 -1.96 -1 0 1 1.96 3 , format(%4.2f)) ///
xtitle("Standard deviations from mean") ///
xsize(3) ysize(2.3) scale(1.4)
twoway function gives you the opportunity to visualize any mathematical function. The
result has no relation to the actual data in memory. The range() option is necessary; it
defines the x-axis range.
Other examples:
An identity line, to be overlaid in a scatterplot comparing two measurements:
twoway ///
(scatter sbp2 sbp1) ///
(function y=x , range(sbp1))
A parabola:
twoway (function y=x^2 , range(-2 2))
50
graph matrix
graph matrix
10 20 30 40 150 200 250
15,000
Price 10,000
5,000
40
30 Mileage
20 (mpg)
10
5,000
4,000
Weight
(lbs.) 3,000
2,000
250
200 Length
(in.)
150
5,000 10,000 15,000 2,0003,0004,0005,000
// c:\dokumenter\...\gph.matrix.do
sysuse auto , clear
set scheme lean1
graph matrix price mpg weight length ///
, ///
title(graph matrix) ///
mlwidth(*0.7) ///
xsize(5) ysize(4)
Matrix scatterplots are useful for analysis, but are infrequently used for publication.
The lean1 and lean2 schemes by default use a small hollow circle as marker in matrix
scatterplots. Here mlwidth(*0.7) made the marker outline thinner.
The upper right cells are redundant rotated images of the lower left cells; omit them by:
graph matrix price mpg weight length , half
51
14.8. Saving, displaying and printing graphs
Save a graph
The active graph can be saved as a .gph file:
graph save "c:\dokumenter\...\DKsurvival.gph" [, asis replace]
The asis option saves a 'frozen' graph, it is displayed as is, regardless of scheme settings.
Without this option you save a 'live' graph: you may display it again, maybe using a different
scheme or modifying its size. The manual states that you may edit it, but that is not the case.
My firm recommendation:
Rarely save graph files; always save a do-file for each graph with a name that tells what it
does, e.g. gph.DKsurvival.do. Let all graph-defining do-files start with a gph. prefix,
for easy identification. The do-file documents what you did, you can edit it to modify the
graph, and you can modify the do-file to create another graph. Remember to include the data
or a use command reading the data used. This advice also applies when you initially
defined the graph command with a graph dialog.
The scale() option is useful to increase marker and text size, e.g. for a slide show.
xsize() and ysize() modify the size of the graph area (arguments in inches), and
scheme() lets you display a graph under a different scheme – but that sometimes fails.
Copying and printing 'smooth' coloured or gray areas sometimes give poor results, and a
raster-pattern is preferable. This is a printer, not a Stata issue; in this respect modern printers
are worse than older. At my old HP LaserJet 1100 printer the LaserJet III printing mode
translates gray areas to raster-patterns, copying and printing nicely. You may need to
experiment.
If you in the future don't want Stata's logo being printed on each graph:
graph set print logo off
52
Select Enhanced Metafile (EMF) or Windows Metafile (WMF); which one works best depends
on your system and printer; take a critical look at the results.
This note uses the schemes lean1 and lean2; they are modifications to s1mono and
s2mono. Most scientific journals use a lean graph style – or at least require that graphs
submitted are lean and black-and-white. If you are interested, download and install both
schemes (use the command findit lean schemes).3
The difference between the two is that lean1 has a framed plot area, but no gridlines, while
the opposite is the case for lean2. Section 14.7 includes examples using both schemes.
To select a scheme:
set scheme lean2
To create a scheme with your own preferences use Stata's do-file editor or another text editor
to enter the options you want in your own personal scheme (e.g. myscheme) and save it as
c:\ado\personal\myscheme.scheme. Scheme terminology differs from graph command
terminology; documentation is forthcoming.3
3
Juul S. Lean mainstream schemes for Stata 8 graphics. The Stata Journal 2003; 3: 295-301.
53
15. Miscellaneous
15.1. Memory considerations [U] 7
In Intercooled Stata a data set can have a maximum of 2,000 variables (Stata/SE: 32,000).
Stata keeps the entire data set in memory, and the number of observations is limited by the
memory allocated. The memory must be allocated before you open (use) a data set.
As described in section 1.2 the initial default memory is defined in the Stata icon. To change
this to 15 MB, right-click the icon, select Properties, and change the path field text to:
c:\stata\wstata.exe /m15.
54
15.2. String variables [U] 15.4; [U] 26
Throughout this text I have demonstrated the use of numeric variables, but Stata also handles
string (text) variables. It is almost always easier and more flexible to use numeric variables,
but sometimes you might need string variables. String values must be enclosed in quotes:
replace ph=45 if nation == "Danish"
"Danish", "danish", and "DANISH" are different string values.
A string can include any character, also numbers; however number strings are not interpreted
by their numeric value, just as a sequence of characters. Strings are sorted in dictionary
sequence, however all uppercase letters come before lowercase; numbers come before letters.
This principle is also applies to relations: "12" < "2" < "A" < "AA" < "Z" < "a".
55
Numbers to strings
You want the numeric variable cprnum converted to a string variable cprstr:
generate str10 cprstr = string(cprnum , "%10.0f")
You may isolate part of a string variable by the substr function. The arguments are: source
string, start position, length. In the following a3 will be characters 2 to 4 of strvar:
generate str3 a3 = substr(strvar,2,3)
You may substitute characters within a string. In an example above the string variable
cprstr was created from the numeric variable cprnum. However, for persons with a
leading 0 in the CPR number the string will start with a blank, not a 0. This can be
remedied by:
replace cprstr = subinstr(cprstr," ","0",1)
The upper function converts lower case to upper case characters; the lower function does
the opposite. Imagine that ICD-10 codes had been entered inconsistently, the same code
somtimes as E10.1, sometimes as e10.1. These are different strings, and you want them to
be the same (E10.1):
replace icd10 = upper(icd10)
What did we obtain? Two variables: the string variable scode1 with 26 values (A to Z) and
a numeric variable ncode4 (0.0-99.9). Now identify diabetes (E10.0-E14.9) by:
generate diab=0
replace diab=1 if scode1=="E" & ncode4>=10 & ncode4<15
If you received ASCII data, the same result could have been obtained by letting eg. the infix
command read the same data twice as different types:
infix id 1-4 str5 scode 5-9 str1 scode1 5 ncode2 6-9 ///
using c:\dokumenter\...\list1.txt
56
15.3. Dates. Danish CPR numbers
Another option is to enter the date as a string (sbdate) and translate it to a date variable:
infix str10 sbdate 1-10 using c:\dokumenter\p1\datefile.txt
generate bdate = date(sbdate,"dmy") // "dmy" defines sequence
format bdate %dD.N.CY
The date function 'understands' most input formats: 17jan2001, 17/1/2001,
17.1.2001, 17 01 2001, but not 17012001. However todate, a user-written function,
handles this situation; find and download it by: findit todate.
You may extract day, month and year from a date variable (bdate):
generate bday = day(bdate)
gen bmonth = month(bdate)
gen byear = year(bdate)
57
Or you can extract key information from a CPR number read as one string variable (cprstr):
generate bday = real(substr(cprstr,1,2))
gen bmon = real(substr(cprstr,3,2))
gen byear = real(substr(cprstr,5,2))
gen control = real(substr(cprstr,7,4))
gen pos7 = real(substr(cprstr,7,1)) // to find century
Before creating bdate you must decide the century of birth; see the rules below:
generate century = 19
replace century = 20 if pos7 >= 4 & byear <= 36
replace century = 18 if pos7 >= 5 & pos7 <= 8 & byear >= 58
replace byear = 100*century + byear
generate bdate = mdy(bmon,bday,byear)
The information on sex can be extracted from control; the mod function calculates the
remainder after division by 2 (male=1, female=0):
generate sex = mod(control,2)
I developed an ado-file (cprcheck.ado) that extracts birth date and sex information and checks
the validity of a CPR number. Find and download it by:
findit cprcheck
58
15.4. Random samples, simulations
59
15.5. Immediate commands [U] 22
An 'immediate' command requires tabular or aggregated input; data in memory are not
affected. The immediate commands tabi, cci, csi, iri and ttesti are mentioned in
section 11, and sampsi (sample size estimation) in section 15.6.
60
15.6. Sample size and study power
sampsi [R] sampsi
Sample size and study power estimation are pre-study activities: What are the consequences
of different decisions and assumptions for sample size and study power?
Example: Sample size estimation for comparison of means, unequal SDs and sample sizes:
sampsi 50 60 , sd1(14) sd2(10) ratio(2)
sampsi also handles trials with repeated measurements, see [R] sampsi.
61
15.7. ado-files [U] 20-21, [P] (Programming manual)
An ado-file is a program. Most users will never write programs themselves, but just use
existing programs. If you are a freak, read more in the User's Guide ([U] 20-21) and the
programming manual [P]. Save user-written programs in c:\ado\personal. To see the
locations of all ado-files issue the command sysdir.
The simplest form of an .ado file is a single command or a do-file with a leading program
define command and a terminating end command. There must be a new line after the
terminating end.
Here is an example to demonstrate that creating your own commands is not that impossible.
Just enter datetime in the command window, and the date and time is displayed:
. datetime
9 Feb 2003 16:54:15
Two ado-files useful for the interaction between Stata and NoteTab are shown in appendix 3.
Q is a local macro (see [U] 21.3); foreach defines it as a stand-in for the variables q1 to
q10, and the sequence generates ten tabulate commands. The local macro is in effect
only within the braces {} which must be placed as shown.
When referring to the local macro Q it must be enclosed in single quotes: `Q'. In the
manuals single quotes are shown differently; but the opening quote is ` (accent grave), and
the ending quote the simple '.
62
15.8. Exchange of data with other programs
Beware: Translation between programs may go wrong, and you should check carefully eg. by
comparing the output from SPSS' DESCRIPTIVES and Stata's summarize. Especially
compare the number of valid values for each variable and take care with missing values and
date variables.
63
Frequently used SPSS commands and the similar Stata commands
SPSS command Similar Stata command
Data in and out
DATA LIST infile; infix; insheet
GET FILE use
SAVE OUTFILE save
Calculations
COMPUTE generate; replace; egen
IF (sex=1) y=2. generate y=2 if sex==1
RECODE a (5 thru 9=5) INTO agr. recode a (5/9=5) , generate(agr)
DO REPEAT ... END REPEAT for; foreach; forvalues
SELECT IF keep if; drop if
TEMPORARY. command if sex==1
SELECT IF (sex=1).
SAMPLE 0.1. sample 10
SPLIT FILE by...:
WEIGHT Weights can be included in most commands; see
section 7.
Analysis
DESCRIPTIVES summarize
FREQUENCIES tabulate; tab1
CROSSTABS tabulate; tab2
MEANS bmi BY agegrp. oneway bmi agegrp , tabulate
T-TEST ttest
LIST list
WRITE outfile; outsheet
Advanced
SORT CASES BY sort
AGGREGATE collapse
ADD FILES append
MATCH FILES merge
64
16. Do-file examples
Here follow short examples of do-files doing typical things. Find more examples in Take
good care of your data. All major work should be done with do-files rather than by entering
single commands because:
1. The do-file serves as documentation for what you did.
2. If you discover an error, you can easily correct the do-file and re-run it.
3. You are certain that commands are executed in the sequence intended.
Example 1 generates the first Stata version of the data, and example 2 generates a modified
version. I call both do-files vital in the sense that they document modifications to the data.
Such do-files are part of the documentation and they should be stored safely. Safe storage also
means safe retrieval, and they should have names telling what they do. My principle is this:
In example 1 gen.wine.do generates wine.dta. In example 2 gen.visit12a.do
generates visit12a.dta. This is different from example 3 where no new data are
generated, only output. This do-file is not vital in the same sense as example 1 and 2, and it
should not have the gen. prefix (the Never Cry Wolf principle).
As mentioned in section 3 I prefer to use NoteTab rather than the do-file editor for creating
do-files. The last command in a do-file must be terminated by a carriage return; otherwise
Stata cannot 'see' the command.
Example 1. gen.wine.do generates Stata data set wine.dta from ASCII file
// gen.wine.do creates wine.dta 13.5.2001
infix id 1-3 type 4 price 5-10 rating 11 ///
using c:\dokumenter\wines\wine.txt
// Add variable labels
label variable id "Identification number"
lab var type "Type of wine"
lab var price "Price per 75 cl bottle"
lab var rating "Quality rating"
// Add value labels
label define type 1 "red" 2 "white" 3 "rosé" 4 "undetermined"
label values type type
lab def rating 1 "poor" 2 "acceptable" 3 "good" 4 "excellent"
lab val rating rating
// Add data set label
label data "wine.dta created from wine.txt, 13.5.2001"
save c:\dokumenter\wines\wine.dta
65
Example 3. Analyse Stata data
// winedes.do Descriptive analysis of the wine data 14.5.2001
use c:\dokumenter\wines\wine.dta
describe
codebook
summarize
tab1 type rating
tabulate type rating , chi2 exact
oneway price rating , tabulate
Compared to the profile.do suggested in section 1.2, this version adds a time stamp to
the command log file (cmdlog.txt). This means better possibilities to reconstruct previous
work.
66
Appendix 1
The Scandinavian sales agent for Stata and StatTransfer is Metrika (www.metrika.se).
Students and employees at University of Aarhus and Aarhus University Hospital can purchase
Stata at a special discount rate. Other educational institutions may have similar arrangements.
Various local information concerning Stata and other software may be found at:
www.biostat.au.dk/teaching/software.
67
Appendix 2
EpiData files
If your dataset has the name first, you will work with three files:
first.qes is the definition file where you define variable names and entry fields.
first.rec is the data file in EpiInfo 6 format.
first.chk is the checkfile defining variable labels, legal values and conditional jumps.
Suggested options
Before starting for the first time, set general preferences (File < Options). I recommend:
[Define Data]: You get the EpiData editor where you define variable names, labels, and
formats. If the name of your dataset is first, save the definition file as first.qes:
FIRST.QES My first try with EpiData.
• The first word is the variable name, the following text becomes the variable label.
• ## indicates a two-digit numeric field,
• ##.# a four-digit numeric field with one decimal.
• ___ a three character string variable,
• <dd/mm/yyyy> a date,
• <today-dmy> an automatic variable: the date of entering the observation.
• Text not preceding a field definition ("1 male 2 female"; "======="; "Page 2")
are instructions etc. while entering data.
68
Variable names can have up to 8 characters a-z (but not æøå) and 0-9; they must start
with a letter. Avoid special characters, also avoid _ (underscore). If you use Stata for
analysis remember that Stata is case-sensitive (always use lowercase variable names).
[Add checks]: You do not have to write the actual code yourself, but may use the menu
system. The information is stored in a checkfile (first.chk) which is structured as below.
* FIRST.CHK Good idea to include the checkfile name as a comment.
LABELBLOCK
LABEL sexlbl Create the label definition sexlbl. You might give it the
1 Male
2 Female name sex – but e.g. a label definition n0y1 (0 No; 1 Yes)
END might define a common label for many variables
END
sex
COMMENT LEGAL USE sexlbl Use the sexlbl label definition for sex. Other entries than
JUMPS 1, 2, and nothing will be rejected.
1 bdate If you enter 1 for sex, you will jump to the variable
END
END
bdate; se the menu below.
The meaning of the Menu dialog box is not obvious at first sight, and I will explain a little:
[Enter data]: You see a data entry form as you defined it; it is straightforward. With the
options suggested the active field shifts colour to yellow, making it easy for you to see where
you are.
As an assurance against typing mistakes you may enter part or all of the data a second time in
a second file and compare the contents of file1 and file2.
[Document] lets you create a codebook, including variable and value labels and checking
rules. The codebook shown below displays structure only, to be compared with your primary
codebook; you also have the option to display information about the data entered.
[Export]: Finally you can export your data to a statistical analysis programme. The .rec file is
in EpiInfo 6 format, and EpiData creates dBase, Excel, Stata, SPSS and SAS files. Variable
and value labels are transferred to Stata, SAS and SPSS files, but not to spreadsheets.
69
Appendix 3
70
Index
Do-file editor ..........................6
A Do-files.........................7;10;65
H
ado-files ............................... 62 drop ....................................23 help...................................... 9
Aggregating data .................. 26 dropline (graphs) ...........49 histogram........................ 43
anova ................................. 30 Hosmer-Lemeshow test ....... 33
append ............................... 25
Arithmetic operators ............ 20
E
ASCII data ........................... 17 egen ....................................21
I
Axis labels (graphs) ............. 40 encode ...............................55 ICD-10 codes....................... 56
Axis options (graphs) ........... 40 Entering data ........................68 if qualifier ................... 14;20
EpiData.................................68 Immediate commands.......... 60
B epitab command family ...29 in qualifier ........................ 14
Error messages .......................9 infile............................... 17
Bar graph options................. 42 Excel.....................................63 infix ................................. 17
Bar graphs ............................ 44 expand ...............................26 input ................................. 16
Bartlett's test......................... 31 insheet ....................... 17;63
browse ................................. 6
by: prefix .......................... 15 F Installing Stata ....................... 3
bysort prefix ................... 15 File names ............................10
findit .................................9 K
C Fixed format data..................17
for.......................................22
Kaplan-Meier curve ............. 35
keep.................................... 23
Calculations.......................... 20 foreach .............................62 Kruskall-Wallis test ............. 31
cc......................................... 29 format ...............................12
char.................................... 32 Format, dates ........................57
ci, cii ............................... 60 Format, numeric data............12 L
clear ................................. 16 Format, strings......................55 label ................................. 18
codebook .......................... 19 forvalues ........................62 Labels .................................. 18
collapse .......................... 26 Freefield data........................17 lfit.................................... 33
Command line window ....... 5;7 function (graphs) ...........50 Line graphs .......................... 46
Command syntax ................. 13 Linear regression ................. 32
Comma-separated data......... 17
Comments ............................ 15
G list.................................... 27
Log files................................. 8
compress .......................... 54 generate...........................21 Logical operators ................. 20
Conditional commands ........ 14 Goodness-of fit test .........33;37 logistic .......................... 33
Confidence interval (graphs) 47 Graph area ............................39 Logistic regression............... 33
Confidence intervals ............ 60 graph bar ........................44 Logrank test ......................... 35
Connecting lines (graphs)42;47 Graph command syntax........39 Long command lines............ 15
Continuation lines ................ 15 graph display...............52 lroc.................................... 33
contract .......................... 26 graph matrix .................51 lsens ................................. 33
Cox regression ..................... 35 Graph options .......................40 lstat ................................. 33
CPR numbers ....................... 57 graph save ......................52
cprcheck .......................... 58 graph twoway
Customizing Stata .................. 3 connected....................47 M
graph twoway dropline Macro................................... 62
D .........................................49
graph twoway function
Mann-Whitney test .............. 31
Mantel-Haenszel analysis .... 29
Data entry............................. 68 .........................................50 Manuals ............................ 9;67
Data set label........................ 18 graph twoway line .....46 Markers (graphs).................. 42
Data window .......................... 6 graph twoway rcap .....47 Matrix scatterplot................. 51
Date formats......................... 57 graph twoway rspike 48 mdy function ...................... 57
date function...................... 57 graph twoway scatter Memory considerations ....... 54
Date variables....................... 57 .........................................45 merge ................................. 25
describe .......................... 19 graph use ........................52 Missing values ..................... 12
destring .......................... 55 Graphs ..................................38
display............................. 60 Grid lines (graphs)................40
do........................................... 7
71
recode ...............................22 summvl............................... 27
N Reference line (graphs) ........48 Survival analysis.................. 34
newlog.ado (user program) regress .............................32 Syntax .................................. 13
......................................... 70 Regression analysis ..............32
Non-parametric tests ............ 31
Normal distribution .............. 31
Regression, Cox....................35
Regression, linear .................32
T
Notation in this booklet.......... 2 Regression, logistic ..............33 tab1.................................... 28
note.................................... 19 Regression, Poisson..............37 tab2.................................... 28
NoteTab Light...................... 70 Relational operators..............20 tabi.................................... 29
nt.ado (user program) ...... 70 rename ...............................23 table ................................. 30
Number lists ......................... 14 Reordering variables.............23 Tab-separated data ............... 17
Numbering observations ...... 24 replace .............................21 tabulate .......................... 27
Numeric formats .................. 12 reshape .............................26 Ticks (graphs) ...................... 40
Numeric ranges .................... 14 Results window ...................5;8 time.ado (user program) 62
Numeric variables ................ 11 Review window......................5 Transposing data.................. 26
numlabel .......................... 18 ROC curve............................33 T-test.................................... 31
run.........................................7 ttest, ttesti................. 31
O twoway connected ....... 47
Observations ........................ 11
S twoway dropline ......... 49
twoway function ......... 50
oneway ............................... 30 sample ..........................23;59 twoway graphs.................. 43
Open a graph ........................ 52 Sample size estimation .........61 twoway line................... 46
Operators.............................. 20 sampsi ...............................61
twoway rcap................... 47
Options................................. 13 save ....................................16
twoway rspike .............. 48
order ................................. 23 Saving graphs .......................52
twoway scatter............ 45
outfile............................. 63 Scatterplot, matrix ................51
Output .................................... 8 Scatterplots ...........................45
outsheet .......................... 63 Schemes (graphs) .................53 U
sdtest ...............................31
search .................................9 Updating Stata ....................... 3
P Selecting observations..........23
use ...................................... 16
Plot area ............................... 39 Selecting variables................23
pnorm ................................. 31 Simulations...........................59 V
poisgof............................. 37 slist..................................27
sort ....................................24 Value labels ......................... 18
poisson............................. 37
Spreadsheets .........................63 Variable labels ..................... 18
Poisson regression................ 37
SPSS and Stata .....................63 Variable lists ........................ 13
Power estimation.................. 61
st command family .............34 Variable names .................... 11
P-P plot ................................ 31
Variables.............................. 11
predict............................. 32 Stata manuals..........................9
Variables window .................. 5
profile.do................... 4;66 StatTransfer ..........................63
Variance homogeneity ......... 31
Programs .............................. 62 stcox..................................35
Viewer window................... 6;8
Purchasing Stata................... 67 stptime .............................35
Stratified analysis .................29
Q String formats .......................55 W
string function................56
Weighting observations ....... 14
qnorm ................................. 31 String variables.....................55
whelp ................................... 9
Q-Q plot ............................... 31 sts graph ........................35
Wilcoxon test....................... 31
Qualifiers.............................. 13 sts list...........................35
Windows in Stata................... 5
Quotes .................................. 15 sts test...........................35
stset..................................34
R stsplit .............................36 X
stsum..................................36 xi: prefix................................ 32
Random numbers ................. 59 Study power..........................61 xpose ................................. 26
Random samples ............. 23;59 substr function................56
real function .................... 55 summarize ........................27