Biostatistics in Public Health Using STATA (Introduction)
Biostatistics in Public Health Using STATA (Introduction)
Biostatistics in Public Health Using STATA (Introduction)
Public Health
Using STATA
Erick L. Suárez
Cynthia M. Pérez
Graciela M. Nogueras
Camille Moreno-Gorrín
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2016 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business
This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and
publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information stor-
age or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copy-
right.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222
Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that pro-
vides licenses and registration for a variety of users. For organizations that have been granted a photo-
copy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are
used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
To our loved ones
Preface ................................................................................................................xi
Acknowledgments ............................................................................................xiii
Authors .............................................................................................................. xv
1 Basic Commands ....................................................................................1
1.1 Introduction ....................................................................................1
1.2 Entering Stata ..................................................................................2
1.3 Taskbar ............................................................................................2
1.4 Help ................................................................................................3
1.5 Stata Working Directories ...............................................................4
1.6 Reading a Data File .........................................................................6
1.7 insheet Procedure .............................................................................7
1.8 Types of Files ...................................................................................7
1.9 Data Editor......................................................................................7
2 Data Description ..................................................................................11
2.1 Most Useful Commands ...............................................................11
2.2 list Command ................................................................................12
2.3 Mathematical and Logical Operators.............................................12
2.4 generate Command ........................................................................14
2.5 recode Command ...........................................................................15
2.6 drop Command .............................................................................16
2.7 replace Command ..........................................................................16
2.8 label Command .............................................................................16
2.9 summarize Command ...................................................................17
2.10 do-file Editor .................................................................................19
2.11 Descriptive Statistics and Graphs...................................................19
2.12 tabulate Command ........................................................................20
3 Graph Construction .............................................................................23
3.1 Introduction ..................................................................................23
3.2 Box Plot .........................................................................................23
3.3 Histogram .....................................................................................25
3.4 Bar Chart ......................................................................................25
vii
viii ◾ Contents
Erick L. Suárez
University of Puerto Rico
Cynthia M. Pérez
University of Puerto Rico
Graciela M. Nogueras
MD Anderson Cancer Center
Camille Moreno-Gorrín
University of Puerto Rico
xi
This page intentionally left blank
Acknowledgments
xiii
This page intentionally left blank
Authors
xv
xvi ◾ Authors
Basic Commands
1.1 Introduction
Stata is a computer program designed to perform various statistical procedures.
Among the basic statistical procedures that can be performed are the following:
calculation of summary measures, construction of graphs, and frequency distribu-
tion using contingency tables. Furthermore, using Stata, you can perform param-
eter estimation in generalized linear models and survival analysis models using
uncorrelated and correlated data. The program also has the ability to perform arith-
metic operations on matrices. Its ability to export and import databases in the Excel
format gives Stata great versatility. This program is regularly used in biostatistics
courses in public health schools in different countries. It is also often cited as one
of the main programs used for statistical analysis in scientific publications related
to public health research.
This chapter will provide an introduction to the Stata program, version 14.0.
We assume that readers of this book have a basic knowledge of both biostatistics
and epidemiology.
1
2 ◾ Biostatistics in Public Health Using STATA
1.3 Taskbar
The taskbar provides common access to all windows-based program commands, such
as File, Edit, Data, Graphics, and Statistics; these options can be found at the upper part
of the main window. The most frequently used icon is the Data Editor icon, with which
it is possible to enter values and identify the variables in a given project. The Graphics
button provides access to the window used to generate different types of graphs. The
Statistics option allows the user to perform statistical mathematical operations through
the execution of the commands. Below the taskbar are icons that allow the user to open,
save, and print, along with icons that facilitate the observation of graphics (Figure 1.2).
1.4 Help
One of the most useful attributes of Stata is its support system, which allows the
user to find the commands and their ways of execution, according to that user’s
specific needs. The help menu can be accessed by clicking on the “New Viewer”
icon on the toolbar or by typing either help or the letter h in the command area
and following that with a keyword that represents the topic about which the user
requires more information (see Figure 1.3).
4 ◾ Biostatistics in Public Health Using STATA
or
h anova
Upon entering those commands, a specific window for ANOVA will appear (see
Figure 1.4).
It is important to keep the working files in a directory that is different from the
default directory that Stata assigns, because during the regular program updates
files located in the default directory may be removed.
To create a particular file, the mkdir and cd commands must be used to navi-
gate to that directory again. The sequence of commands to create a directory is
as follows:
To use Stata in the new working directory, you need to restart the program
and immediately move to the desired directory. For example, assuming that the
name of the working directory is “students” and assuming, as well, that this
6 ◾ Biostatistics in Public Health Using STATA
directory is located in your computer’s Documents folder, the following will take
you to that folder:
cd “/Users/Documents/students”
For the latter, on the other hand, it is necessary to click , the Open icon, and
browse the folder that contains the working file. The describe command can be
used to view the information contained in the data file, which might include the
number of observations, variables, and file size, among others, as shown below
(assuming that the active database being used contains the anthropometric mea-
surements of 10 subjects):
describe
Output
. describe
Contains data
obs: 10
vars: 5
size: 200
-----------------------------------------------------------------------------
storage display value
variable name type format label variable label
-----------------------------------------------------------------------------
var1 float %9.0g
var2 float %9.0g
var3 float %9.0g
var4 float %9.0g
var5 float %9.0g
-----------------------------------------------------------------------------
Basic Commands ◾ 7
The replace option that has been placed after the comma (above) is used to
clear the program if another database was being used. Stata does not open
a database if there is another one that is already open. The clear command can
also be used in Stata to remove a database, therefore clearing the way to use a
new one.
.ado programs
.gph graphs
To access the Data Editor window (Figure 1.5), click the “Edit” icon, , on
the taskbar located in the main window.
At the beginning of the data entry process, the program automatically assigns a
name to the column that defines each variable (var1, var2, …, vark). This name can
be changed in the Variables Manager window after clicking the Data Editor icon,
using the box “Name” (Figure 1.6). To return to the main window of Stata, you
close or minimize the Data Editor window.
Constructing a user-friendly database requires that each variable be named in
such a way as to be easy to identify. This can be done using the “Label” box in the
properties window. When building a database, it is possible for the values assigned to
the variables to be represented by codes. The coding of the variables can be done using
the “Value Label” option. With this option you can assign numerical values to alpha-
numeric variables, thereby allowing better management of the database. This coding
can be done in the Variables Manager window. The steps to do this are as follows:
1. Click “Manage” in the Variables Manager window, and a new window appears
(Figure 1.7). Then click “Create Label” to assign each code a label.
2. After creating the value labels, return to the Variables Manager window, in
which you will be able to assign labels to each variable in the “Label” box (if they
were not assigned previously in the Properties window) (Figure 1.8).
Basic Commands ◾ 9
To continue working in Stata after having created a database, the user needs to
ensure that the data have been saved. To that end, the user will need to assign a
name to the file to continue working on the database. Clicking on “File” (on the
toolbar) followed by “Save As” (on the subsequent dropdown menu) begins this
process. After that, select the working folder or directory and assign a name to the
database. The default file extension is .dta.
Chapter 2
Data Description
11
12 ◾ Biostatistics in Public Health Using STATA
Output
. list in 5/10
+----------------------------------+
| var1 var2 var3 var4 var5 |
|----------------------------------|
5. | 5 45 56 1.52 1 |
6. | 6 36 87 1.46 1 |
7. | 7 30 78 1.44 1 |
8. | 8 29 77 1.56 1 |
9. | 9 27 67 1.52 0 |
|----------------------------------|
10. | 10 29 63 1.52 1 |
+----------------------------------+
Symbol Definition
Usually, these operators are associated with the conditional command If for specific
variables. For example, to display only those observations in which the age is below
30, the command line is as follows:
Output
+--------------------------+
| id age weikg heimt |
|--------------------------|
1. | 1 28 59 1.55 |
3. | 3 25 76 1.6 |
4. | 4 26 65 1.78 |
8. | 8 29 77 1.56 |
9. | 9 27 67 1.52 |
|--------------------------|
10. | 10 29 63 1.52 |
+--------------------------+
14 ◾ Biostatistics in Public Health Using STATA
The symbol of asterisk (*) is also used to make any comment during the Stata pro-
gramming; for example:
+--------------------------------+
| id age weikg heimt sex |
|--------------------------------|
1. | 1 28 59 1.55 0 |
2. | 2 32 35 1.35 0 |
3. | 3 25 76 1.6 0 |
4. | 4 26 65 1.78 0 |
5. | 5 45 56 1.52 1 |
|--------------------------------|
6. | 6 36 87 1.46 1 |
7. | 7 30 78 1.44 1 |
8. | 8 29 77 1.56 1 |
9. | 9 27 67 1.52 0 |
10. | 10 29 63 1.52 1 |
+--------------------------------+
To compute and display the bmi of each participant, the following commands are
executed:
You can see that a new variable, named bmi, has been created as a result of using
the list command:
Data Description ◾ 15
Output
+---------------+
| id bmi |
|---------------|
1. | 1 24.55775 |
2. | 2 19.20439 |
3. | 3 29.6875 |
4. | 4 20.51509 |
5. | 5 24.23823 |
|---------------|
6. | 6 40.81441 |
7. | 7 37.61574 |
8. | 8 31.64037 |
9. | 9 28.99931 |
10. | 10 27.26801 |
+---------------+
gen bmig=bmi
recode bmig 18.5/24.9=1 25/29.9=2 30/max=3
list id bmig
Output
+-----------+
| id bmig |
|-----------|
1. | 1 1 |
2. | 2 1 |
3. | 3 2 |
4. | 4 1 |
5. | 5 1 |
|-----------|
6. | 6 3 |
7. | 7 3 |
8. | 8 3 |
9. | 9 2 |
10. | 10 2 |
+-----------+
16 ◾ Biostatistics in Public Health Using STATA
drop bmi
After the list command, the results will be the same as that reported with the replace
command.
In addition, the label command decodes the categories of the variables, combining
label define and label value commands. The label define command is used to create a
label for different codes to be attached to a legend. Then, the label value command
is used to relate the categories of 1 variable to the labels defined in label define
command. For example, the command lines that are used to label the codes of the
variables sex and bmig are as follows:
After using the list command, the following output will be displayed:
+----------------------------+
| id sex bmig |
|----------------------------|
1. | 1 Male Overweight |
2. | 2 Male Normal |
3. | 3 Male Overweight |
4. | 4 Male Normal |
5. | 5 Female Normal |
|----------------------------|
6. | 6 Female Obese |
7. | 7 Female Obese |
8. | 8 Female Obese |
9. | 9 Male Overweight |
10. | 10 Female Overweight |
+----------------------------+
If you want to eliminate a label that was previously assigned to a variable, the drop
command must be used, as follows:
Output
Output
The detail command can be written at the end of the command line to obtain
information, which is more detailed, about quantitative variables in the database.
For example, assuming we want the detailed information of the distribution of the
variable bmi, the following command line can be used:
sum bmi, detail
Output
bmi
---------------------------------------------------------------
Percentiles Smallest
1% 19.20439 19.20439
5% 19.20439 20.51509
10% 19.85974 24.23823 Obs 10
25% 24.23823 24.55775 Sum of Wgt. 10
distribution. Based on the iqr (interquartile range), the output indicates that 50% of
the bmi around the median value is not greater than 7.4.
tab bmig
Output
In this example, 30% of the study group was categorized as being obese and 40%
as being normal.
The tab command can be used to report contingency tables that, in turn, can be
used to report the frequency distribution, with the option of including percentages
by column and row. For example, to describe the association between the variables
bmig and sex (see the previous database), use the tab command, as follows:
tab bmig sex, co
Output
+-------------------+
| Key |
|-------------------|
| frequency |
| column percentage |
+-------------------+
| sex
bmig | Male Female | Total
----------+----------------------+----------
Normal | 2 1 | 3
| 40.00 20.00 | 30.00
----------+----------------------+----------
Overweight| 3 1 | 4
| 60.00 20.00 | 40.00
----------+----------------------+----------
Obese | 0 3 | 3
| 0.00 60.00 | 30.00
----------+----------------------+----------
Total | 5 5 | 10
| 100.00 100.00 | 100.00
The results show that 80% of women are categorized as being either overweight
or obese, while 40% of men are categorized as being overweight, with none being
categorized as being obese. Only 30% of the subjects (both sexes) are categorized as
being of normal weight.