0% found this document useful (0 votes)
18 views25 pages

11) Stata

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 25

Data Analysis with Stata 12

Tutorial

Updated: November 2012


Stata 12: Data Analysis

Table of Contents
Section 1: Introduction ........................................................................................................ 3
1.1 About this Document ................................................................................................ 3
1.2 Documentation .......................................................................................................... 3
1.3 Accessing Stata ......................................................................................................... 3
1.4 Getting Help .............................................................................................................. 4
Section 2: The Example Dataset ......................................................................................... 5
Section 3: Descriptive Statistics and Graphs ...................................................................... 7
3.1 Introduction ............................................................................................................... 7
3.2 Univariate Descriptives ............................................................................................. 7
3.3 Graphical Displays .................................................................................................. 10
3.4 Bivariate Descriptives ............................................................................................. 13
Section 4: Comparing Means (T-Test, ANOVA, ANCOVA) .......................................... 15
4.1 Introduction ............................................................................................................. 15
4.2 One- and Two-Sample T-Tests ............................................................................... 15
4.3 ANOVA .................................................................................................................. 17
4.4 ANCOVA ............................................................................................................... 19
Section 5: Linear Regression ............................................................................................ 21
5.1 Introduction ............................................................................................................. 21
5.2 Simple Linear Regression ....................................................................................... 21
5.3 Multiple Linear Regression..................................................................................... 22
5.4 Marginal Means ...................................................................................................... 23
Section 6: Conclusion ....................................................................................................... 25

2
The Department of Statistics and Data Sciences, The University of Texas at Austin
Stata 12: Data Analysis

Section 1: Introduction
1.1 About this Document

This document is an introduction to using Stata 12 for data analysis. Stata is a software
package popular in the social sciences for manipulating and summarizing data and
conducting statistical analyses. This is the second of two Stata tutorials, both of which are
based on the 12th version of Stata, although most commands discussed can be used in
early versions also.

The following sections provide information on running a variety of statistical tests and
inference procedures. Readers with at least some basic statistical knowledge are best
suited for these tutorials, although we do attempt to explain each process in as much
detail as possible. In this tutorial, we also assume that the reader is familiar with the
Stata interface, importing and exporting files, and running basic data manipulation
commands. If this is not the case, please see our “Getting Started” tutorial before
continuing.

1.2 Documentation

Similar to the SAS statistical software package, Stata can be intimidating to first-time
users who are not familiar with the syntax language. However, Stata 12 has drop-down
menu options for most analytic, graphical, and statistical commands (similar to, but not as
extensive as, SPSS). As tempting as the drop-down menus are, we still recommend that
you become familiar with the Stata syntax as it is more efficient and leads to fewer errors.
However, we do present both options whenever possible.

Among the many reasons why we prefer to use syntax over the drop-down menus is the
extent of support material to turn to when you run into problems with your code. First
and foremost, we recommend using the “help” feature within Stata itself (described in
detail in Section 8 of the “Getting Started” tutorial). Additionally, you can use the
following:

1) Stata manuals (some are available at the PCL for check-out)


2) Stata’s own website has a modest amount of FAQ’s in the support section:
http://stata.com/support/faqs/
3) The Department of Statistics and Data Sciences website to find more answers
to FAQ’s: https://stat.utexas.edu/software-faqs/stata

1.3 Accessing Stata

If you are a faculty, student, or staff member at the University of Texas at Austin, you
may access Stata 12 in the following ways:

3
The Department of Statistics and Data Sciences, The University of Texas at Austin
Stata 12: Data Analysis

1) License a copy from ITS Software Distribution Services


(http://www.utexas.edu/its/sds).
2) Stata is also available at certain labs around campus, and your department
may also provide it via a server or in one a lab room. Check with your advisor
or chair on the availability of Stata in your department.

1.4 Getting Help

If you are a member of UT-Austin, you can schedule an appointment with a statistical
consultant or send e-mail to stat.consulting@austin.utexas.edu . See
stat.utexas.edu/consulting/ for more details about consulting services, as well as answers
to frequently asked questions Stata and other topics.

4
The Department of Statistics and Data Sciences, The University of Texas at Austin
Stata 12: Data Analysis

Section 2: The Example Dataset


Throughout this document, we will be using a dataset called cars_1993.xls, which was
used in the previous tutorial and contains various characteristics, such as price and miles-
per-gallon, of 92 cars. In order to follow along with the examples, please download this
data by clicking HERE.

Note that this is also the same example dataset we use in the “SAS: Getting Started”
tutorial, and the file is actually one of the example datasets from SAS, which provides
information about the cars_1993 file and is represented below:

Name: cars_1993

Reference: This represents a subset of the information reported in the 1993


Cars Annual Auto Issue published by Consumer Reports and from Pace New Car
and Truck 1993 Buying Guide.

Description: A random sample of 92 1993 model cars is contained in this data


set. The information for each car includes: manufacturer, model, type (small,
compact, sporty, midsize, large, or van), price (in thousands of dollars), city mpg,
highway mpg, engine size (liters), horsepower, fuel tank size (gallons), weight
(pounds), and origin (US or non-US). The data are excellent for doing descriptive
statistics by groups or an ANOVA or regression with price as the response
variable. Note that violations of the assumptions are probably present and
transformation of the response variable is most likely necessary.

Below is what the file should look like once you download and open it in Excel:

5
The Department of Statistics and Data Sciences, The University of Texas at Austin
Stata 12: Data Analysis

6
The Department of Statistics and Data Sciences, The University of Texas at Austin
Stata 12: Data Analysis

Section 3: Descriptive Statistics and Graphs

3.1 Introduction

Almost all analytic procedures begin with running descriptive statistics on the data.
Doing this familiarizes you with the properties of your dataset, including mean values,
measures of spread, and the frequency of observations for different values of categorical
variables. The following section explores the commands in Stata 12 that summarize data,
both numerically and graphically, for both quantitative and qualitative variables.

3.2 Univariate Descriptives

As seen in the first tutorial, the summary command will output the mean, standard
deviation, minimum, maximum, and the number of observations for a specified numeric
variable or set of variables:

You can get more specific details of those variables by adding the detail option after
the list of variables. The output will contain common quartiles and the variance,
skewness, and kurtosis statistics (related to the second, third, and fourth moments of the
distributions of the variables). Below is the example with the three variables from above.
The output continues past the main window, which you can see by hitting Spacebar or
almost any other key:

7
The Department of Statistics and Data Sciences, The University of Texas at Austin
Stata 12: Data Analysis

These skewness and kurtosis statistics can be hard to interpret. If you are testing for the
normality of a variable and need a p-value for these measures, use the sktest command,
shown below for the Price variable:

From the output, we see that Price is significantly skewed (and we can see it is positively
skewed from the value of 0.99 in the previous output) but the kurtosis is not significant.
Having a significant skewness or kurtosis suggests that a variable is not normally
distributed. You may further confirm this by viewing a histogram of the variable (see
Section 3.3).

These summary statistics can also be run by going to Data  Describe Data 
Summary Statistics… To obtain the detailed output, simply click the “Display
additional statistics” option:
8
The Department of Statistics and Data Sciences, The University of Texas at Austin
Stata 12: Data Analysis

The tabstat command also has the capability to output many of the same statistics.
However, you must list out each statistic after the command that you want in the output.
If you are using syntax, we recommend summary, detail because you do not have to
specify each statistic you want.

For categorical variables, the tabulate command will output a frequency table of every
response (as seen below for the Origin variable). You can abbreviate this command with
simply tab:

We can see that the dataset is roughly split in half in terms of US-made cars versus
foreign-made cars. You can also run the tabulate command by going to Statistics 
Summaries, tables, and tests  Tables.

9
The Department of Statistics and Data Sciences, The University of Texas at Austin
Stata 12: Data Analysis

3.3 Graphical Displays

This section presents how to display a single numeric or categorical variable, as well as a
pair of two variables. You should select the type of graph you want based on the type of
variable or variables you wish to display visually.

For a single numeric variable, you can make a histogram with the hist command. It
will select a default number of bins, which you can also specify if needed. You can enter
the syntax shown in the picture below, or go to Graphics  Histogram. Without
specifying any options, Stata will choose a default bin size, which is displayed in the
output window:

After seeing the Price histogram, you might want to inspect a normal quantile-quantile
plot (QQ-plot), which compares the distribution of the variable to a normal distribution.
You can do this with the following command:

qnorm Price

10
The Department of Statistics and Data Sciences, The University of Texas at Austin
Stata 12: Data Analysis

50
40
30
Price

20
10
0

0 10 20 30 40
Inverse Normal

The above plot confirms that Price is skewed left, and departs from a normal distribution.
To numerically present this, you can ask Stata for the skew and kurtosis statistics,
including p-values, as we did in Section 3.2.

Another way to display a continuous variable is with a box plot. Often, researchers want
to compare the distribution of a continuous variable for two or more different groups (for
example, when running an ANOVA procedure). Again, you can produce these with
either syntax or by going to Graphics  Box Plot. Below, we show the boxplots for
vehicle price based on origin (US or non-US):

graph box Price, over(Origin)

11
The Department of Statistics and Data Sciences, The University of Texas at Austin
Stata 12: Data Analysis

We can see from above that US-made cars have less variation on price, with several
expensive outliers. However, the median price of US cars is roughly the same as non-US
cars.

Stata 12 has many other ways to graphically display single variables, including pie charts
and bar graphs for categorical variables. For a list of all of these options, go to the
Graphics menu.

For graphically displaying relationships between two variables, go to Graphics  Two-


way Graph… In the example below, we show the syntax and output for a scatterplot of
engine size and horsepower:

twoway (scatter Horsepower EngineSize), ytitle(Horsepower)


xtitle(Engine Size)

12
The Department of Statistics and Data Sciences, The University of Texas at Austin
Stata 12: Data Analysis

3.4 Bivariate Descriptives

Stata can also quickly and easily provide bivariate descriptive statistics, such as
correlations, partial correlations, and covariances. All of these can be found in the
Statistics  Summaries, tables, and tests  Summary and descriptive statistics
menu. Below is an example of a correlation matrix for four variables in our cars dataset:

You can also visually compare the distribution of two continuous variables to see if they
are similar. This could be an important step in many types of analyses, such as ANOVA
and non-parametric comparison tests of two or more groups.

13
The Department of Statistics and Data Sciences, The University of Texas at Austin
Stata 12: Data Analysis

qqplot CityMPG HighwayMPG

Quantile-Quantile Plot

50
40
CityMPG

30
20
10

20 30 40 50
HighwayMPG

From the above plot, we can see that the miles-per-gallon for these cars in the city has a
roughly the same shape as on the highway, although there is a “shift,” meaning a
different mean value. You can see this by the very nearly-linear pattern of the dots in the
above graph (indicating a similar shape of the distributions of the two variables), and how
they fall below the line in the graph, which is where they would fall if the distributions
were positioned over the same mean value.

14
The Department of Statistics and Data Sciences, The University of Texas at Austin
Stata 12: Data Analysis

Section 4: Comparing Means (T-Test, ANOVA, ANCOVA)

4.1 Introduction

Now that you know how to run preliminary descriptive statistics on your data, the next
step is inevitably to run statistical tests to determine if your hypotheses are correct or not.
This section describes the procedures in Stata that test the equality of means of a
continuous variable from two or more groups. The remaining sections of this tutorial
dive into more complicated statistical tests.

4.2 One- and Two-Sample T-Tests

A t-test is a useful technique for comparing the mean value of a group against some
hypothesized mean (one-sample) or of two separate sets of numbers against each other
(two-sample). The result of these tests provides you with a statistic which can be used to
determine whether the difference between two means is statistically significant. Two-
sample t-tests can be used either to compare two independent groups (known as an
independent-samples t-test) or to compare observations from two measurement occasions
for the same individuals (a paired comparison t-test).

To conduct a t-test, you must have a continuous variable which is drawn from a normally
distributed population (see the previous section for ways to test this). For the examples
below, you can alternatively use the Statistics  Summaries, tables, and tests 
Classical tests menu.

First, we show an example of a one-sample t-test. Below, we test that the mean price for
domestic cars is $15,000. Note that we can add “if” conditions to the ttest command
(without that option, we would be testing the price for all cars in the dataset):

ttest Price == 15 if Origin == “US”

15
The Department of Statistics and Data Sciences, The University of Texas at Austin
Stata 12: Data Analysis

From this analysis, we see that the mean price of US-made cars is about 18.5 thousand
dollars, which is significantly different from our hypothesized mean of 15 thousand
dollars (p-value = 0.003). Note that Stata also gives a 95% confidence interval of the
mean price of US-made cars by default, and since it does not include our null hypothesis,
it also tells us that we can reject it.

When conducting a two-sample t-test, you must test the assumption of equality of
variances in the two groups that are being compared. If you have more than two groups
that you want to compare, you must use an ANOVA (see next section) and also test that
the variances are equal across all groups.

Below is an example of a two-sample t-test where we test the difference in city miles-per-
gallon between domestic and foreign-made cars. Note that in the output of the ttest
command does not include a test of equal variances, so we must run that first ourselves
with the sdtest command:

sdtest CityMPG, by(Origin)

Since the two-tailed p-value is less than 0.05, we must reject the null hypothesis, which in
this case is that the variances are equal. Therefore, we must include the unequal option
at the end of our ttest statement which will adjust the degrees of freedom used in the
analysis (Satterthwaite calculation) to correct for unequal variances. If our sdtest was
not significant, we would use the command below without the unequal at the end:

ttest CityMPG, by(Origin) unequal

16
The Department of Statistics and Data Sciences, The University of Texas at Austin
Stata 12: Data Analysis

Note that the top of this output reads “with unequal variances,” where it would say “with
equal variances” if we did not include the unequal statement in our command. This is a
good check if you forget to test for equality of variances prior to running your t-test.
From the p-value at the bottom center, we see that there is a significant difference
between the city miles-per-gallon for domestic versus foreign cars. We can also see that
the 95% confidence interval of the difference of the means does not contain zero.

4.3 ANOVA

You can use a one-way ANOVA if you want to test the difference in a continuous,
normally-distributed variable among two or more groups. Similar to t-tests, you must
also test the equality of variances across the groups you compare. Luckily, Stata
automatically tests for this when you use an ANOVA command, so you do not have to
remember to do that ahead of time.

There are two ways to run a one-way ANOVA in Stata. By using the oneway command,
you will get the automatic test of the equality of variances. If you use the more common
anova command, you will not get the assumption test by default. However, the oneway
test does not output the residual sum of squares, which the anova command does.

Below we test if the weight of cars is equal among all types (compact, midsize, etc.).
You can also use the Statistics  Linear models  ANOVA/MANOVA  Analysis
of variance and covariance menu instead of running the command directly:

oneway Weight type

17
The Department of Statistics and Data Sciences, The University of Texas at Austin
Stata 12: Data Analysis

The output tells us that the variances among the different types of cars are unequal.
However, ANOVA’s are somewhat robust against violations of this assumption, and
since the p-value is very close to 0.05, we don’t see a problem with the analysis (and
therefore wouldn’t suggest a non-parametric alternative to ANOVA).

The p-value for the ANOVA is <0.0001, meaning that there is a difference in weight
among the different types of vehicles. In other words, we can reject the null hypothesis
that all types of vehicles have equal mean weights. This does not necessarily mean that
all types have different means from each other, but that there is at least one type that
differs from the rest.

In order to get the marginal means, you must run the anova command. After running
anova Weight type, you can use the margin command to get the marginal means of
weight for each type of vehicle:

To run a two-way ANOVA, which will test differences in two different categorical
variables, you must use the anova command and specify two categorical variables after

18
The Department of Statistics and Data Sciences, The University of Texas at Austin
Stata 12: Data Analysis

the continuous dependent variable. Unlike most other statistical software packages, Stata
will not automatically run this test if the categorical variable or variables are formatted as
text. Therefore, to add our Origin variable to the model, which is coded as “US” or “non-
US,” we must first create a coded numeric variable that corresponds to those two values:

gen OriginDummy = (Origin == “US”)

Now that we have “Origin_num,” we can run the two-way ANOVA (note that by using
the “##” in between our two factors, Stata will include both main effects as well as their
interaction term):

From the above output, we can see that the origin of the car is not significant, and neither
is the interaction between origin and type. However, type is significant (p-value<0.0001),
as well as the overall model, which can be found on the top line of the ANOVA table.

4.4 ANCOVA

Suppose you have a continuous variable that you need to control for within your ANOVA
procedure. Such a model is referred to as an ANCOVA, since you are adding a covariate,
or continuous independent variable, to the model. The way to run an ANCOVA is very
simple, but you must remember one important point: you need to tell Stata that a variable
in your anova statement is continuous or it will treat it as another categorical factor.

You denote continuous independent variables within the anova command by placing a
“c.” in front of them. In the example below, we run the one-way ANOVA where we see
if a car’s weight varies significantly based on what type it is, but while controlling for the
size of its fuel tank:

19
The Department of Statistics and Data Sciences, The University of Texas at Austin
Stata 12: Data Analysis

We can see that type remains very significant (p-value < 0.0001) even when we control
for the size of the fuel tank. Note that we tested for the interaction between type and
FuelTank, which we must do whenever we run an ANCOVA. One of the assumptions of
an ANCOVA test is that the covariate does not vary among the groups of the categorical
factor or factors. Since the interaction term is not significant (p-value=0.31), we see that
the assumption is not violated.

If the interaction were significant, we would need to use a different approach to analyze
this data, such as a mixed model. However, since it is not significant, we now run the
ANCOVA without the interaction term to get our final result:

20
The Department of Statistics and Data Sciences, The University of Texas at Austin
Stata 12: Data Analysis

Section 5: Linear Regression

5.1 Introduction

Stata 12 has the capability of running a great variety of different types of regression
models (linear and non-linear, parametric and non-parametric, etc.). This section focuses
on linear regression, both with a single independent variable and with multiple
independent variables.

5.2 Simple Linear Regression

Let us model the linear relationship between engine size of the vehicles and their city
miles-per-gallon. Below is the code for running the linear regression, but you can
alternatively go to Statistics  Linear models and related  Linear regression:

Stata outputs quite a lot of information for even this simple model. At the top, we see an
ANOVA table of the entire model. To the right of that are some fit statistics, including
the overall F-test corresponding to the ANOVA table and R-squared. The bottom table
presents the estimated coefficients of the independent variable EngineSize and the
intercept, their standard errors, and the t-statistic and associated p-values. Finally, the
table includes a 95% confidence interval for each estimate.

We can interpret these results to say that a vehicle’s engine size does significantly impact
the city miles-per-gallon. For each additional unit increase in engine size, the vehicle’s
city miles-per-gallon decreases by roughly 3.84 units.

21
The Department of Statistics and Data Sciences, The University of Texas at Austin
Stata 12: Data Analysis

There are many options available for the regress command, which are described under
help regress.

5.3 Multiple Linear Regression

Adding more independent variables into your linear model is as simple as listing them in
your statement (or adding them in the window if you are using the drop-down menus).
For example, let’s also consider the horsepower and origin of the vehicles in estimating
the city miles-per-gallon.

One drawback to Stata is that it does not automatically create a dummy variable (or set of
dummy variables) when you use a categorical independent variable. It will not allow any
string values in a regress command (or most other regression functions or anova
procedures, as we saw in Section 4.3). Therefore, you must create your own numeric
version of any categorical variable you wish to put in the model, which we show with the
following example for Origin:

gen OriginDummy = (Origin == “US”)

Now, we can use OriginDummy in our model, but because it represents a categorical
variable, we tell Stata this by including a “i.” in front of it. If this variable had more than
two categories, then Stata would output the estimates of each category with respect to the
reference category (whichever has the lowest code, usually zero) in the bottom table of
the output.

From the output, we can see that while controlling for engine size and horsepower, the
origin of the car does not significantly impact the city miles-per-gallon (p-value = 0.43).

22
The Department of Statistics and Data Sciences, The University of Texas at Austin
Stata 12: Data Analysis

5.4 Marginal Means

A common way to further explore the effect of a categorical independent variable is to


look at the marginal means for each group. This is easily done with the margins
command, which can be run after various types of regression commands and will report
on the most recently outputted model:

Following what we saw in the original regression output, the city miles-per-gallon do not
significantly differ between US and foreign cars. This is evident in that the 95%
confidence intervals for the groups overlap with each other.

You can also get the marginal means for continuous independent variables. Although
this is usually not very useful in regular linear regressions, it can be in nonlinear
regression models, such as logistic, and the command is the same regardless of the type
of model you run:

margins , at(Horsepower=(120(10)220)) vsquish

This command will output the marginal mean city miles-per-gallon of cars with values of
Horsepower between 120 and 220, in increments of 10. The vsquish option just
suppresses the empty space in the output and makes it easier to read:

23
The Department of Statistics and Data Sciences, The University of Texas at Austin
Stata 12: Data Analysis

The top portion of the output specifies the values of Horsepower at which the predicted
means are being calculated and the bottom table contains the actual estimates at each of
those intervals. You can see that for each increase of 10 in horsepower, the mean city
miles-per-gallon decreases by about 0.39, which is equal to 10 times the coefficient
estimate in the original regression output.

24
The Department of Statistics and Data Sciences, The University of Texas at Austin
Stata 12: Data Analysis

Section 6: Conclusion
Hopefully this tutorial has taught you how to run common statistical procedures with
Stata 12 and what options are available to test assumptions and make interpretations
easier to understand. Stata has the capability of running more complex models, including
multilevel models, which is described in our “Multilevel Modeling Tutorial.”

If you have any questions on the material presented here, or about other procedures in
Stata that might be more appropriate for your data, please feel free to contact us at
stat.consulting@austin.utexas.edu. If you have a question about Stata or other statistical
software packages, feel free to set up an appointment with one of our consultants by
visiting http://stat.utexas.edu/consulting/free-consulting.

25
The Department of Statistics and Data Sciences, The University of Texas at Austin

You might also like