Bootstrap Up

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Short Guides to Microeconometrics Fall 2010

Unversitat Pompeu Fabra Kurt Schmidheiny

The Bootstrap

1a) The asymptotic sampling distribution is very dicult to derive.

The Bootstrap
1 Introduction

1b) The asymptotic sampling distribution is too dicult to derive for me. This might apply to many multi-stage estimators. Example: the two stage estimator of the heckman sample selection model. 1c) The asymptotic sampling distribution is too time-consuming and error-prone for me. This might apply to forecasts or statistics that are (nonlinear) functions of the estimated model parameters. Example: elasticities calculated from slope coecients. 2 ) The bootstrap produces better approximations for some properties. It can be shown that bootstrap approximations converge faster for certain statistics1 than the approximations based on asymptotic theory. These bootstrap approximations are called asymptotic renements. Example: the t-statistic of a mean or a slope coecient. Note that both asymptotic theory and the bootstrap only provide approximations for nite sample properties. The bootstrap produces consistent approximations for the sampling distribution for a variety of estimators such as the mean, median, the coecients in OLS and most econometric models. However, there are estimators (e.g. the maximum) for which the bootstrap fails to produce consistent properties. This handout covers the nonparametric bootstrap with paired sampling. This method is appropriate for randomly sampled cross-section data. Data from complex random samplings procedures (e.g. stratied sampling) require special attention. See the handout on Clustering. Time-series data and panel data also require more sophisticated bootstrap techniques.
1 These statistics are called asymptotically pivotal, i.e. there asymptotic distributions are independent of the data and of the true parameter values. This applies, for example, to all statistics with the standard normal or Chi-squared as limiting distribution.

The bootstrap is a method to derive properties (standard errors, condence intervals and critical values) of the sampling distribution of estimators. It is very similar to Monte Carlo techniques (see the corresponding hand-out). However instead of fully specifying the data generating process (DGP), we use information from the sample. In short, the bootstrap takes the sample (the values of the independent and dependent variables) as the population and the estimates of the sample as true values. Instead of drawing from a specied distribution (such as the normal) by a random number generator, the bootstrap draws with replacement from the sample. It therefore takes the empirical distribution function (the step-function) as true distribution function. In the example of a linear regression model, the sample provides the empirical distribution for the dependent variable, the independent variables and the error term as well as values for constant, slope and error variance. The great advantage compared to Monte Carlo methods is that we neither make assumption about the distributions nor about the true values of the parameters. The bootstrap is typically used for consistent but biased estimators. In most cases we know the asymptotic properties of these estimators. So we could use asymptotic theory to derive the approximate sampling distribution. That is what we usually do when using, for example, maximum likelihood estimators. The bootstrap is an alternative way to produce approximations for the true small sample properties. So why (or when) would we use the bootstrap? There are two main reasons:
Version: 25-10-2010, 20:15

Short Guides to Microeconometrics

The Bootstrap

2
2.1

The Method: Nonparametric Bootstrap


Bootstrap Samples

Consider a sample with n = 1, ..., N independent observations of a dependent variable y and K + 1 explanatory variables x. A paired bootstrap sample is obtained by independently drawing N pairs (xi , yi ) from the observed sample with replacement. The bootstrap sample has the same number of observations, however some observations appear several times and others never. The bootstrap involves drawing a large number B of bootstrap samples. An individual bootstrap sample is denoted (x , yb ), b where x is a N (K + 1) matrix and yb an N -dimensional column vector b of the data in the b-th bootstrap sample. 2.2 Bootstrap Standard Errors

In case the estimator is consistent and asymptotically normally distributed, bootstrap standard errors can be used to construct approximate condence intervals and to perform asymptotic tests based on the normal distribution. 2.3 Condence Intervals Based on Bootstrap Percentiles

We can construct a two-sided equal-tailed (1) condence interval for an estimate from the empirical distribution function of a series of bootstrap replications. The (/2) and the (1 /2) empirical percentiles of the bootstrap replications are used as lower and upper condence bounds. This procedure is called percentile bootstrap.
1. Draw B independent bootstrap samples (x , yb ) of size N from b (x, y). It is recommended to use B = 1000 or more replications.

The empirical standard deviation of a series of bootstrap replications of can be used to approximate the standard error se() of an estimator .
1. Draw B independent bootstrap samples (x , yb ) of size N from b (x, y). Usually B = 100 replications are sucient.

2. Estimate the parameter of interest for each bootstrap sample: b for b = 1, ..., B.

2. Estimate the parameter of interest for each bootstrap sample: b 3. Estimate se() by se = where =
1 B B b=1 b .

for b = 1, ..., B.

3. Order the bootstrap replications of such that 1 ... B . The lower and upper condence bounds are the B /2-th and B (1 /2)-th ordered elements, respectively. For B = 1000 and = 5% these are the 25th and 975th ordered elements. The estimated (1) condence interval of is [B/2 , B(1/2) ].

1 B1

Note that these condence intervals are in general not symmetric. (b )2 2.4 Bootstrap Hypothesis Tests

b=1

The whole covariance matrix V () of a vector is estimated analogously.

The approximate condence interval in section 2.3 can be used to perform an approximate two-sided test of a null hypothesis of the form H0 : = 0 . The null hypothesis is rejected on the signicance level if 0 lies outside the two-tailed (1 ) condence interval.

Short Guides to Microeconometrics

The Bootstrap

2.5

The bootstrap-t

Assume that we have consistent estimates of and se() at hand and that the asymptotic distribution of the t-statistic is the standard normal t= 0 d N (0, 1). se()

4. Order the bootstrap replications of t such that |t | ... |t |. The 1 B absolute critical value is then the the B (1 )-th element. For B = 1000 and = 5% this is the 950th ordered element. The lower and upper critical values are, respectively:
t/2 = |t B(1) |, t1/2 = |tB(1) |

Then we can calculate approximate critical values from percentiles of the empirical distribution of a series bootstrap replications for the t-statistic. 1. Consistently estimate and se() using the observed sample: , se()
2. Draw B independent bootstrap samples (x , yb ) of size N from b (x, y). It is recommended to use B = 1000 or more replications.

3. Estimate the t-value assuming 0 = for each bootstrap sample: t = b b se ()


b

The symmetric bootstrap-t is the preferred method for bootstrap hypothesis testing as it makes use of the faster convergence of t-statistics relative to asymptotic approximations (i.e. critical values from the t- or standard normal tables). The bootstrap-t procedure can also be used to create condence intervals using bootstrap critical values instead of the ones from the standard normal tables: [ + t/2 se(), + t1/2 se()] The condence interval from bootstrap-t is not necessarily better then the percentile method. However, it is consistent with bootstrap-t hypothesis testing.

for b = 1, ..., B

where b and se () are estimates of the parameter and its stanb dard error using the bootstrap sample. 4. Order the bootstrap replications of t such that t ... t . The 1 B lower critical value and the upper critical values are then the B /2th and B (1 /2)-th elements, respectively. For B = 1000 and = 5% these are the 25th and 975th ordered elements.
t/2 = t B/2 , t1/2 = tB(1/2)

Implementation in Stata 11.0

Stata has very conveniently implemented the bootstrap for cross-section data. Bootstrap sampling and summarizing the results is automatically done by Stata. The Stata commands are shown for the example of a univariate regression of a variable y on x. Case 1: Bootstrap standard errors are implemented as option in the stata command Many stata estimation commands such as regress have a built-in vce option to calculate bootstrap covariance estimates. For example
regress y x, vce(bootstrap, reps(100))

These critical values can now be used in otherwise usual t-tests for . The above bootstrap lower tB/2) and upper tB(1/2) critical values generally dier in absolute values. Alternatively, we can estimate symmetric critical values by adapting step 4:

Short Guides to Microeconometrics

The Bootstrap

runs B = 100 bootstrap iterations of a linear regression and reports bootstrap standard errors along with condence intervals and p-values based on the normal approximation and bootstrap standard errors. The postestimation command
regress y x, vce(bootstrap, reps(1000)) estat bootstrap, percentile

By default, Stata records the whole coecient vector b. Any value returned by a stata command (see ereturn list) can be selected. We can also record functions of returned statistics. For example, the following commands create bootstrap critical values on the 5% signicance level of the t-statistic for the slope coecient:
reg y x scalar b = _b[x] bootstrap t=((_b[x]-b)/_se[x]), reps(1000): reg y x, level(95) estat bootstrap, percentile

reports condence bounds based on bootstrap percentiles rather than the normal approximation. Remember that it is recommended to use at least B = 1000 replications for bootstrap percentiles. The percentiles to be reported are dened with the condence level option. For example, the 0.5% and 99.5% percentiles that create the 99% condence interval are reported by
regress y x, vce(bootstrap, reps(1000)) level(99) estat bootstrap, percentile

The respective symmetric critical values on the 5% signicance level are calculated by
reg y x scalar b = _b[x] bootstrap t=abs((_b[x]-b)/_se[x]), reps(1000): reg y x, level(90) estat bootstrap, percentile

Case 2: The statistic of interest is returned by a single stata command The command
bootstrap, reps(100): reg y x

We can save the bootstrap replications of the selected statistics in a normal stata .dta le to further investigate the bootstrap sampling distribution. For example,
bootstrap b=_b[x], reps(1000) saving(bs_b, replace): reg y x use bs_b, replace histogram b

runs B = 100 bootstrap iterations of a linear regression and reports bootstrap standard errors along with condence intervals and p-values based on the normal approximation and bootstrap standard errors. The postestimation command estat bootstrap is used to report condence intervals based on bootstrap percentiles from e.g. B = 1000 replications:
bootstrap, reps(1000): reg y x estat bootstrap, percentile

shows the bootstrap histogram of the sampling distribution of the slope coecient. Note: it is important that all observations with missing values are dropped from the dataset before using the bootstrap command. Missing values will lead to dierent bootstrap sample sizes. Case 3: The statistics of interest is calculated in a series of stata commands The rst task is to dene a program that produces the statistic of interest for a single sample. This program might involve several estimation

We can select an specic statistic to be recorded in the bootstrap iterations. For example the slope coecient only:
bootstrap _b[x], reps(100): reg y x

Short Guides to Microeconometrics

The Bootstrap

10

commands and intermediate results. For example, the following program calculates the t-statistic centered at in a regression of y on x
program tstat, rclass reg y x return scalar t = (_b[x]-b)/_se[x] end

See also ...

The last line of the program species the value that is investigated in the bootstrap: ( b)/se() which will be returned under the name t. The denition of the program can be directly typed into the command window or is part of a do-le. The program should now be tested by typing
reg y x scalar b = _b[x] tstat return list

There is much more about the bootstrap than presented in this handout. Instead of paired resampling there is residual resampling which is often used in time-series context. There is also a parametric bootstrap. The bootstrap can also be used to reduce the small sample bias of an estimator by bias corrections. The m out of n bootstrap is used to overcome some bootstrap failures. A method very similar to the bootstrap is the jackknife.

References
Bradley Efron and Robert J. Tibshirani (1993), An Introduction to the Bootstrap, Boca Raton: Chapman & Hall. [A fairly advanced but nicely and practically explained comprehensive text by the inventor of the bootstrap.] Brownstone, David and Robert Valetta (2001), The Bootstrap and Multiple Imputations: Harnessing Increased Computing Power for Improved Statistical Tests, Journal of Economic Perspectives, 15(4), 129-141. [An intuitive pladoyer for the use of the bootstrap.] Cameron, A. C. and P. K. Trivedi (2005), Microeconometrics: Methods and Applications, Cambridge University Press. Sections 7.8 and chapter 11. Horowitz, Joel L. (1999) The Bootstrap, In: Handbook of Econometrics, Vol. 5. [This is a very advanced description of the (asymptotic) properties of the bootstrap.] Wooldridge, Jerey M. (2009), Introductory Econometrics: A Modern Approach, 4th ed. South-Western. Appendix 6A. [A rst glance at the bootstrap.]

The bootstrap is then performed by the Stata commands


reg y x scalar b = _b[x] bootstrap t=r(t), reps(1000): tstat estat bootstrap, percentile

As in case 2, the bootstrap results can be saved and evaluated manually. For example,
reg y x scalar b = _b[x] bootstrap t=r(t), reps(1000) saving(bs_t): tstat use bs_t, replace centile t, centile(2.5, 97.5) gen t_abs = abs(t) centile t_abs, centile(95)

reports both asymmetric and symmetric critical values on the 5% signicance level for t-tests on the slope coecient.

You might also like