The Economics and Econometrics of Active Labor Market Programs
James J. Heckman, University of Chicago
Robert J. LaLonde, Michigan State University
and
Je¤rey A. Smith, University of Western Ontario
Prepared for the Handbook of Labor Economics, Volume III, Orley Ashenfelter and
David Card, editors. We thank Susanne Ackum Agell for her helpful comments on Scandanavian active labor market programs and Costas Meghir for very helpful comments on
Sections 1-7.
1
The Economics and Econometrics of Active Labor Programs
Contents
1. Introduction
2. Public Job Training and Active Labor Market Policies
3. The Evaluation Problem and the Parameters of Interest in Evaluating Social Programs
3.1 The Evaluation Problem
3.2 The Counterfactuals of Interest
3.3 The Counterfactuals Most Commonly Estimated in the Literature
3.4 Is Treatment on the Treated an Interesting Economic Parameter?
4. The Prototypical Solutions to the Evaluation Problem
4.1 The Before-After Estimator
4.2 The Di¤erence-in-Di¤erences Estimator
4.3 The Cross Section Estimator
5. Social Experiments
5.1 How Social Experiments Solve the Evaluation Problem
5.2 Intention to Treat and Substitution Bias
5.3 Social Experiments in Practice
5.3.1 Two Important Social Experiments
5.3.2 The Practical Importance of Dropping Out and Substitution
5.3.3 Additional Problems Common to All Evaluations
6. Econometric Models of Outcomes and Program Participation
6.1 Uses of Economic Models
6.2 Prototypical Models of Earnings and Program Participation
6.3 Expected Present Value of Earnings Maximization
6.3.1 Common Treatment E¤ect
6.3.2 A Separable Representation
6.3.3 Variable Treatment E¤ect
6.3.4 Imperfect Credit Markets
6.3.5 Training as a Form of Job Search
6.4 The Role of Program Eligibility Rules in Determining Participation
6.5 Administrative Discretion and the E¢ciency and Equity of Training Provision
6.6 The Con‡ict between the Economic Approach to Program Evaluation and the
Modern Approach to Social Experiments
7. Non-experimental Evaluations
7.1 The Problem of Causal Inference in Non-experimental Evaluations
2
7.2 Constructing a Comparison Group
7.3 Econometric Evaluation Estimators
7.4 Identi…cation Assumptions for Cross-Section Estimators
7.4.1 The Method of Matching
7.4.2 Index Su¢cient Methods and the Classical Econometric Selection Model
7.4.3 The Method of Instrumental Variables
7.4.4 The Instrumental Variable Estimator as a Matching Estimator
7.4.5 IV Estimators and the Local Average Treatment E¤ect
7.4.6 Regression Discontinuity Estimators
7.5 Using Aggregate Time Series Data on Cohorts of Participants to Evaluate Programs
7.6 Panel Data Estimators
7.6.1 Analysis of the Common Coe¢cient Model
7.6.2 The Fixed E¤ects Method
7.6.3 Ut Follows a First-Order Autoregressive Process
7.6.4 Ut is Covariance Stationary
7.6.5 Repeated Cross-Section Analogs of Longitudinal Procedures
7.6.6 The Fixed E¤ect Model
7.6.7 The Error Process Follows a First-Order Autoregression
7.6.8 Covariance Stationary Errors
7.6.9 The Anomalous Properties of First Di¤erence or Fixed E¤ect Models
7.6.10 Robustness of Panel Data Methods in the Presence of Heterogeneous
Responses to Treatment
7.6.11 Panel Data Estimators as Matching Estimators
7.7 Robustness to Biased Sampling Plans
7.7.1 The IV Estimator and Choice-Based Sampling
7.7.2 The IV Estimator and Contamination Bias
7.7.3 Repeated Cross-Section Methods with Unknown Training Status and ChoiceBased Sampling
7.8 Bounding and Sensitivity Analysis
8. Econometric Practice
8.1 Data Sources
8.1.1 Using Existing General Survey Data Sets
8.1.2 Using Administrative Data
8.1.3 Collecting New Survey Data
8.1.4 Combining Data Sources
8.2 Characterizing Selection Bias
8.3 A Simulation Study of the Sensitivity of Nonexperimental Methods
8.3.1 A Model of Earnings and Program Participation
3
8.3.2 The Data Generating Process
8.3.3 The Estimators We Examine
8.3.4 Results from the Simulations
8.4 Speci…cation Testing and the Fallacy of Alignment
9. Indirect E¤ects, Displacement, and General Equilibrium Treatment E¤ects
9.1 Review of Traditional Approaches to Displacement and Substitution
9.2 General Equilibrium Approaches
9.2.1 Davidson and Woodbury
9.2.2 Heckman, Lochner, and Taber
9.3 Summary on General Equilibrium Approaches
10. A Survey of Empirical Findings
10.1 The Objectives of Program Evaluations
10.2 The Impact of Government Programs on Labor Market Outcomes
10.3 The Findings from U.S. Social Experiments
10.4 The Findings from Non-experimental Evaluations of U.S. Programs
10.5 The Findings from European Evaluations
11. Conclusions
1
Introduction
Public provision of job training, of wage subsidies and of job search assistance is a feature
of the modern welfare state. These activities are cornerstones of European “active labor
market policies,” and have been a feature of U.S. social welfare policy for more than three
decades. Such policies also have been advocated as a way to soften the shocks administered
to the labor markets of former East Bloc and Latin economies currently in transition to
market-based systems.
A central characteristic of the modern welfare state is a demand for “objective” knowledge about the e¤ects of various government tax and transfer programs. Di¤erent parties
bene…t and lose from such programs. Assessments of these bene…ts and losses often play
critical roles in policy decision-making. Recently, interest in evaluation has been elevated
as many economies with modern welfare states have ‡oundered, and as the costs of running
welfare states have escalated.
This chapter examines the evidence on the e¤ectiveness of such welfare state active labor market policies such as training, job search and job subsidy policies, and the methods
used to obtain the evidence on their e¤ectiveness. Our methodological discussion of alternative approaches to evaluating programs has more general interest. Few U.S. government
programs have received such intensive scrutiny, and have been subject to so many di¤erent
4
types of evaluation methodologies as has governmentally-supplied job training. In part,
this is due to the fact that short run measures of government training programs are more
easily obtained and are more readily accepted. Outcomes such as earnings, employment,
and educational and occupational attainment are all more easily measured than the outcomes of health and public school education programs. In addition, short run measures of
the outcomes of training programs are more closely linked to the “treatment” of training.
In public school and health programs, a variety of inputs over the life cycle often give rise
to measured outcomes. For these programs, attribution of speci…c e¤ects to speci…c causes
is more problematic.
A major focus of this chapter is on the general lessons learned from over thirty years
of experience in evaluating government training programs. Most of our lessons come from
American studies because the U.S. government has been much more active in promoting
evaluations than have other governments, and the results from the evaluations are often
used to expand – or contract – government programs. We demonstrate that recent studies
in Europe indicate that the basic patterns and lessons from the American case apply more
generally.
The two relevant empirical questions in this literature are (i) adjusting for their lower
skills and abilities, do participants in government employment and training programs bene…t from these programs? and (ii) are these programs worthwhile social investments? As
currently constituted, these programs are often ine¤ective on both counts. For most groups
of participants, the bene…ts are modest, and at worst participation in government programs
is harmful. Moreover, many programs and initiatives can not pass a cost-bene…t test. Even
when programs are cost e¤ective, they are rarely associated with a large scale improvement
in skills. But, at the same time, there is substantial heterogeneity in the impacts of these
programs. For some groups these programs appear to generate signi…cant bene…ts both to
the participants and to society.
We believe that there are two reasons why the private and social gains from these
programs are generally small. First, the per-capita expenditures on participants are usually
small relative to the de…cits that these programs are being asked to address. In order for
such interventions to generate large gains they would have to be associated with very
large internal rates of return. Moreover, these returns would have to larger than what
is estimated for private sector training (Mincer, 1993). Another reason that the gains
from these programs are generally low is that these services are targeted toward relatively
unskilled and less able individuals. Evidence on the complementarity between the returns
to training and skill in the private sector suggests that the returns to training in the public
sector should be relatively low.
We also survey the main methodological lessons learned from thirty years of evaluation
activity conducted mainly in the United States. We have identi…ed eight lessons from the
5
evaluation literature that we believe should guide practice in the future. First, there are
many parameters of interest in evaluating any program. This multiplicity of parameters
results in part because of the heterogeneous impacts of these programs. As a result of
this heterogeneity, some popular estimators that are well-suited for estimating one set
of parameters are poorly suited for estimating others. Understanding that responses to
the same measured treatment are heterogenous across people, that measured treatments
themselves are heterogeneous, that in many cases people participate in programs based in
part on this heterogeneity and that econometric estimators should allow for this possibility,
is an important insight of the modern literature that challenges traditional approaches to
program evaluation. Because of this heterogeneity, many di¤erent parameters are required
to answer the interesting evaluation questions.
Second, there is inherently no method of choice for conducting program evaluations.
The choice of an appropriate estimator should be guided by the economics underlying the
problem, the data that are available or that can be acquired, and the evaluation question
being addressed.
A third lesson from the evaluation literature is that better data helps a lot. The data
available to most analysts have been exceedingly crude as we document below. Too much
has been asked of econometric methods to remedy the defects of the underlying data. When
certain features of the data are improved, the evaluation problem becomes much easier. The
best solution to the evaluation problem lies in improving the quality of the data on which
evaluations are conducted and not in the development of formal econometric methods to
circumvent inadequate data.
Fourth, it is important to compare comparable people. Many non-experimental evaluations identify the parameter of interest by comparing observationally di¤erent persons
using extrapolations based on inappropriate functional forms imposed to make incomparable people comparable. A major advantage of nonparametric methods for solving the
problem of selection bias is that, rigorously applied, they force analysts to compare only
comparable people.
Fifth, evidence that di¤erent non-experimental estimators produce di¤erent estimates
of the same parameter does not indicate that non-experimental methods cannot address the
underlying self-selection problem in the data. Instead, di¤erent estimates obtained from
di¤erent estimators simply indicate that di¤erent estimators address the selection problem
in di¤erent ways and that non-random participation in social programs is an important
problem that deserves more attention in its own right. Di¤erent methods produce the
same estimates only if there is no problem of selection bias.
Sixth, a corollary lesson, derived from lessons three, four and …ve, is that the message
from LaLonde’s (1986) in‡uential study of nonexperimental estimators has been misunderstood. Once analysts de…ne bias clearly, compare comparable people, know a little about
6
the unemployment histories of trainees and comparison group members, administer them
the same questionnaire and place them in the same local labor market, much of the bias in
using nonexperimental methods is attenuated. Variability in estimates across estimators
arises from the fact that di¤erent nonexperimental estimators solve the selection problem under di¤erent assumptions, and these assumptions are often incompatible with each
other. Only if there is no selection bias would all evaluation estimators identify the same
parameter.
Seventh, three decades of experience with social experimentation have enhanced our
understanding of the bene…ts and limitations of this approach to program evaluation. Like
all evaluation methods, this method is based on implicit identifying assumptions. Experimental methods estimate the e¤ect of the program compared to no programs at all when
they are used to evaluate the e¤ect of a program for which there are few good substitutes.
They are less e¤ective when evaluating ongoing programs in part because they appear to
disrupt established bureaucratic procedures. The threat of disruption leads local bureaucrats to oppose their adoption. To the extent that programs are disrupted, the program
evaluated by the method is not the ongoing program that one seeks to evaluate. The parameter estimated in experimental evaluations is often not likely to be of primary interest
to policy makers and researchers, and under any event has to be more carefully interpreted
than is commonly done in most public policy discussions. However, if there is no disruption, and the other problems that plague experiments are absent, the evidence from
social experiments provides a benchmark for learning about the performance of alternative
non-experimental methods.
Eighth, and …nally, programs implemented at a national or regional level a¤ect both
participants and nonparticipants. The current practice in the entire “treatment e¤ect”
literature is to ignore the indirect e¤ects of programs on nonparticipants by assuming they
are negligible. This practice can produce substantially misleading estimates of program
impacts if indirect e¤ects are substantial. To account for the impacts of programs on
both participants and nonparticipants, general equilibrium frameworks are required when
programs substantially impact the economy.
The remainder of the chapter is organized as follows. In Section 2, we distinguish
among several types of active labor market policies and describe the types of employment
and training services o¤ered both in U.S. and in Europe, their approximate costs, and their
intended e¤ects. We introduce the evaluation problem in Section 3. We discuss the importance of heterogeneity in the response to treatment for de…ning counterfactuals of interest.
We consider what economic questions the most widely used counterfactuals answer. In
section 4, we present three prototypical solutions to the problem cast in terms of mean impacts. These prototypes are generalized throughout the rest of this chapter, but three basic
principles introduced in this section underlie all approaches to program evaluation when the
7
parameters of interest are means or conditional means. In Section 5, we present conditions
under which social experiments solve the evaluation problem and assess the e¤ectiveness of
social experiments as a tool for evaluating employment and training programs. In Section
6, we outline two prototypical models of program participation and outcomes that represent
the earliest and the latest thinking in the literature. We demonstrate the implications of
these decision rules for the choice of an econometric evaluation estimator. We discuss the
empirical evidence on the determinants of participation in government training programs.
The econometric models used to evaluate the impact of training programs in nonexperimental settings are described in Section 7. The interplay between the economics of
program participation and the choice of an appropriate evaluation estimator is stressed. In
Section 8, we discuss some of the lessons learned from implementing various approaches to
evaluation. Included in this section are the results of a simulation analysis based on the
empirical model of Ashenfelter and Card (1985), where we demonstrate the sensitivity of the
performance of alternative estimators to assumptions about heterogeneity in impact among
persons and other data generating processes of the underlying econometric model. We also
reexamine LaLonde’s (1986) evidence on the performance of nonexperimental estimators
and reinterpret the main lessons from his study.
Section 9 discusses the problems that arise in using microeconomic methods to evaluate
programs with macroeconomic consequences. A striking example of the problems that
can arise from this practice is provided. Two empirically operational general equilibrium
frameworks are presented, and the lessons from applying them in practice are summarized.
Section 10 surveys the …ndings from the non-experimental literature, and contrasts them
with those from experimental evaluations. We conclude in Section 11 by surveying the main
methodological lessons learned from the program evaluation literature on job training.
8
2
Public Job Training and Active Labor Market Policies
Many government policies a¤ect employment and wages. The “active labor market” policies
we analyze have two important features that distinguish them from general policies, such
as income taxes, that also a¤ect the labor market. First, they are targeted toward the
unemployed or toward those with low skills or little work experience who have completed
(usually at a low level) their formal schooling. Second, the policies are aimed at promoting
employment and/or wage growth among this population, rather than just providing income
support.
Table 2.1 describes the set of policies we consider. This set includes: (a) classroom
training (CT) consisting of basic education to remedy de…ciencies in general skills or vocational training to provide the skills necessary for particular jobs; (b) subsidized employment
with public or private employers (WE), which includes public service employment (wholly
subsidized temporary government jobs) and work experience (subsidized entry-level jobs at
public or non-pro…t employers designed to introduce young people to the world of work)
as well as wage supplements and …xed payments to private …rms for hiring new workers;
(c) subsidies to private …rms for the provision of on-the-job training (OJT); (d) training in
how to obtain a job; and (e) in-kind subsidies to job search such as referrals to employers
and free access to job listings. Policies (d) and (e) fall under the general heading of job
search assistance (JSA), which also includes the job matching services provided by the U.S.
Employment Service and similar agencies in other countries.
As we argue in more detail below, distinguishing the types of training provided is
important for two reasons. First, di¤erent types of training often imply di¤erent economic
models of training participation and impact and therefore di¤erent econometric estimation
strategies. Second, because most existing training programs provide a mix of these services,
heterogeneity in the impact of training becomes an important practical concern. As we show
in Section 7, this heterogeneity has important implications for the choice of econometric
methods for evaluating active labor market policies.
We do not analyze privately supplied job training despite its greater quantitative importance to modern economies (see Heckman, Lochner and Taber, 1998a, or Mincer, 1962,
1993). For example, in the United States, Jacob Mincer has estimated that such training amounts to approximately 4 to 5 percent of GDP, annually. Despite the magnitude
of this investment there are surprisingly few publicly-available studies of the returns to
private job training, and many of those that are available do not control convincingly for
the non-random allocation of training among private sector workers. Governments demand
publicly-justi…ed evaluations of training programs while private …rms, to the extent that
9
they formally evaluate their training programs, keep their …ndings to themselves. An emphasis on objective publicly accessible evaluations is a distinctive feature of the modern
welfare state, especially in an era of limited funds and public demands for accountability.
Table 2.2 presents the amount spent on active labor market policies by a number of
OECD countries. Most OECD countries provide some mix of the employment and training
services described in Table 2.1. Di¤erences among countries include the relative emphasis
on each type of service, the particular populations targeted for service, the total resources
spent on the programs, how resources are allocated among programs and the extent to
which employment and training services are integrated with other programs such as unemployment insurance or social assistance. In addition, although the programs we study are
funded by governments, they are not always conducted by governments, especially in the
U.S. and the U.K. In decentralized training systems, private …rms and local organizations
play an important role in providing employment and training services.
Table 2.2 reveals that many OECD countries spend substantial sums on active labor
market policies. In nearly all countries, total expenditures are more than one-third of total expenditures on unemployment bene…ts, and some countries’ expenditures on active
labor market policies exceed those on unemployment bene…ts. Usually only a fraction of
these expenditures are for CT. Further, even in countries that emphasize classroom training, governments spend substantial sums on other active labor market policies. Denmark
spends 1 percent of its GDP on CT for adults, the most of any OECD country. However,
this expenditure amounts to only 40 percent of its total spending on active labor market
programs. Only in Canada is the fraction spent on CT larger. At the opposite extreme,
Japan and the U.S. spend only 0.03 percent and 0.04 percent, respectively, of their GDP
on CT. However, as the table shows, these two countries also spend the smallest share of
GDP on active labor market policies.
The low percentage of GDP spent on active labor market programs in the U.S. has
led some researchers to comment on the irony that despite these low expenditures, U.S.
programs have been evaluated more extensively and over a longer period of time than
programs elsewhere (Haveman and Saks, 1985; Björklund, 1993). Indeed, much of what is
known about the impacts of these programs and many of the methodological developments
associated with evaluating them come from U.S. evaluations.1
1
However, the level of total expenditure in the U.S. is still quite large. Relative total expenditures
on active labor market policies can be inferred from Table 2.2 using the relative sizes of each economy
compared with the U.S. For example, the German economy is somewhat less than one-fourth the size of
the U.S. economy, and the French, Italian and British economies are approximately one-sixth the size of the
U.S. economy. Accordingly, training expenditures are somewhat greater in Germany and France, about the
same in Italy, and less in the United Kingdom than in the U.S. See OECD, Employment Outlook (1996),
Table 1.1, p.2.
10
We now consider in detail each type of employment and training service in Table 2.1.
This discussion motivates the consideration of alternative economic models of program
participation and impact in Sections 6 and 7, and our focus on heterogeneity in program
impacts. It also provides a context for the empirical literature on the impact these programs
that we review in Section 10.
The …rst category listed in Table 2.1 is classroom training. In many countries, CT
represents the largest fraction of government expenditures on active labor market policy,
and most of that expenditure is devoted to vocational training. Even in the U.S., where
remedial programs aimed at high school dropouts and other low-skill individuals play a
larger role than elsewhere, most CT programs provide vocational training. By design, most
CT programs in the OECD are of limited duration. For example in Denmark, CT typically
lasts 2 to 4 weeks (Jensen, et al., 1993) while in Sweden a duration of four months and
in the United Kingdom, and the United States three months is the more typical duration.
Per capita expenditures on such training varies substantially, with a training slot costing
approximately $7,500 in Sweden and between $2,000 and $3,000 in the United States.2 The
Swedish …gures include stipends for participants while the U.S. …gures do not.
An important di¤erence among OECD countries that provide CT is the extent to which
the training is relatively standardized and therefore less tailored to the requirements of …rms
or the market in general. In the 1980s and early 1990s, the Nordic countries usually provide
CT in government training centers that use standardized materials and teaching methods.
However, the emphasis has shifted recently, especially in Sweden, toward decentralized
and …rm based training. In the United Kingdom and the U.S., the provision of CT is
highly decentralized and its content depends on the choices made by local councils of
business, political, and labor leaders. The local councils receive funding from the federal
government and then subcontract for CT with private vocational and proprietary schools
and local community colleges. Due to this highly decentralized structure, both participant
characteristics and training content can vary substantially among locales, which suggests
that the impact of training is likely to vary substantially across individuals in evaluations
of such programs.
The second category of services listed in Table 2.1 are wage and employment subsidies.
This category encompasses several di¤erent speci…c services which we group together due
to their analytic similarity. The simplest example of this type of policy provides subsidies
to private …rms for hiring workers in particular groups. These subsidies may take the form
of a …xed amount for each new employee hired or some fraction of the employee’s wage for
a period of time. In the U.S., the Targeted Jobs Tax Credit is an example of this type of
program. Heckman, Lochner, Smith and Taber (1997) discuss the empirical evidence on
2
Unless otherwise indicated all monetary units are expressed in 1997 U.S. dollars.
11
the e¤ectiveness of wage and employment subsidies in greater detail.
Temporary work experience (WE) usually targets low skilled youth or adults with poor
employment histories and provides them with a job lasting 3 to 12 months in the public
or nonpro…t sector. The idea of these programs is to ease the transition of these groups
into regular jobs, by helping them learn about the world of work and develop good work
habits. Such programs constitute a very small proportion of U.S. training initiatives, but
substantial fractions of services provided to youth in countries such as France (TUC) and
the United Kingdom (Community Programmes). In public sector employment (PSE) programs, governments create temporary public sector jobs. These jobs usually require some
amount of skill and are aimed at unemployed adults with recent work experience rather
than youth or the disadvantaged. Except for a brief period during the late 1970s, they
have not been used in the United States since the Depression era. However, they have been
and remain an important component of active labor market policy in several European
countries.
The third category in Table 2.1 is subsidized on-the-job training at private …rms. The
goal of subsidized OJT programs is to induce employers provide job-relevant skills, including
…rm-speci…c skills, to disadvantaged workers. In the U.S., employers receive a 50 percent
wage subsidy for up to six months; in the U.K. employers receive a lump sum per week
(O’Higgins, 1994). Although evidence is limited and …rm training is di¢cult to measure,
there is a widespread view that these programs in fact provide little training, even informal
on-the-job training, and are better characterized as a work experience or wage subsidy
program (e.g., Breen, 1988; Hutchinson and Church, 1989).3 Survey responses by employers
who have hired or sponsored OJT trainees suggest that they value the program for its help
in reducing the costs associated with hiring and retaining suitable employees more than for
the opportunity to increase the skills of new workers (Begg, et al., 1991).
For purposes of evaluation, it is almost always impossible to distinguish those OJT
experiences from which new skills were acquired from those that amounted to work experience or wage subsidy without a training component. In addition, because OJT is provided
by individual employers, this indeterminacy is not simply a program-speci…c feature, but
holds among individuals within the same program. Consequently, OJT programs will likely
have heterogeneous e¤ects, and the impact, if any, of these programs will result from some
combination of learning by doing, the usual training provided by the …rm to new workers
3
The provision of subsidized OJT is particularly hard to monitor both because on-the-job training has
proven di¢cult to measure with survey methods (Barron, Berger and Black, 1997) and because trainees
often do not peceive that they have been treated any di¤erently than their co-workers who are not subsidized. In fact, both groups may have received substantial amounts of informal on-the-job training. For
evidence of the importance of informal on-the-job training in the U.S., see Barron, Black and Lowenstein
(1989).
12
and incremental training beyond that provided to unsubsidized workers.
The fourth category of services in Table 2.1 is job search assistance. The purpose of these
services is to facilitate the matching process between workers and …rms both by reducing
time unemployed and by increasing match quality. The programs are usually operated
by the national or local employment service, but sometimes may be subcontracted out to
third parties. Included under this category are direct placement in vacant jobs, employer
referrals, in-kind subsidies to search such as free access to job listings and telephones for
contacting employers, career counseling, and instruction in job search skills. The last of
these, which often includes instruction in general social skills, was developed in the U.S.,
but is now used in U.K., Sweden, and recently France (Björklund and Regner, 1996, p.
24). In recent years, JSA has become more popular due to its low cost, usually just a
few hundred dollars per participant, and relatively solid record of performance (which we
discuss in detail in Section 10).
To conclude this section, we discuss …ve features of employment and training programs
that should be kept in mind when evaluating them. First, as the operation of these programs
has become more decentralized in OECD countries, there have emerged di¤erences between
how these programs were designed and how they are implemented (Hollister and Freedman,
1988). Actual practice can deviate substantially from explicit written policy.4 Therefore,
the evaluator must be careful to characterize the program as implemented when assessing
its impacts.
Second, participants often receive services from more than one category in Table 2.1. For
example, classroom training in vocational skills might be followed by job search assistance.
In the U.K., the Youth Training Scheme (now Youth Training) was explicitly designed to
combine OJT with 13 weeks of CT. Some expensive programs combine several of the services
listed in Table 2.1 into a single package. For example, in the U.S. the Job Corps program
for youth combines classroom training with work experience and job search assistance in
a residential setting at a current cost of around $19,000 per participant. Many available
survey data sets do not identify all the services received by a participant. In this case, the
practice of combining together various types of training, particularly when combinations
are tailored to the needs of individual trainees as in the U.S. JTPA program, constitutes
another source of heterogeneity in the impact of training. Even when administrative data
are available that identify the services received, isolating the impact of particular individual
services often proves di¢cult or impossible in practice due to the small samples receiving
particular combinations of services or due to di¢culties in determining the process by which
4
For example, see Breen (1988) and Hollister and Freedman (1990) describing the implementation of
WEP in Ireland and Hollister and Freedman (1990) and Leigh (1995) describing the implementation of
JTPA in the United States.
13
individuals come to receive particular service combinations.
Third, certain features of active labor market programs a¤ect individuals’ decisions
to participate in training. In some countries, such as Sweden and the United Kingdom,
participation in training is a condition for receiving unemployment bene…ts rather than
less generous social assistance payments. In the U.S., participation is sometimes required
by a court order in lieu of alternative punishment.
Fourth, program administrators often have considerable discretion over whom they
admit into government training programs. This discretion results from the fact that the
number of applicants often exceeds the number of available training positions. It has long
been a feature of U.S. programs, but also has characterized programs in Austria, Denmark,
Germany, Norway, and the United Kingdom (Björklund and Regner, 1996; WestergardNeilsen, 1993; Kraus, et al., 1997). Consequently, when modeling participation in training,
it may be important to account for not only individual incentives, but also those of the
program operators. In Section 6, we discuss the incentives facing program operators and
how they a¤ect the characteristics of participants in government training programs.
Finally, the di¤erent types of services require di¤erent economic models of program participation and impact. For example, the standard human capital model captures the essence
of individual decisions to invest in vocational skills (CT). It provides little guidance to behavior regarding job search assistance or wage subsidies. In Section 6 we describe economic
models that describe participation in alterative programs and discuss their implications for
evaluation research.
14
3
3.1
The Evaluation Problem and the Parameters of Interest in Evaluating Social Programs
The Evaluation Problem
Constructing counterfactuals is the central problem in the literature on evaluating social
programs. In the simplest form of the evaluation problem, persons are imagined as being
able to occupy one of two mutually exclusive states: “0” for the untreated state and “1”
for the treated state. Treatment is associated with participation in the program being
evaluated.5 Associated with each state is an outcome, or set of outcomes. It is easiest to
think of each state as consisting of only a single outcome measure, such as earnings, but
just as easily, we can use the framework to model vectors of outcomes such as earnings,
employment and participation in welfare programs. In the models presented in section 6,
we study an entire vector of earnings or employment at each age that result from program
participation.
We can express these outcomes as a function of conditioning variables, X. Denote
the potential outcomes by Y0 and Y1 , corresponding to the untreated and treated states.
Each person has a (Y0 ; Y1 ) pair. Assuming that means exist, we may write the (vector) of
outcomes in each state as
(3.1a)
Y0 = ¹0 (X) + U0
(3.1b)
Y1 = ¹1 (X) + U1
where E(Y0 jX) = ¹0 (X) and E(Y1 jX) = ¹1 (X): To simplify the notation, we keep the conditioning on X implicit unless it serves to clarify the exposition by making it explicit. The
potential outcome actually realized depends on decisions made by individuals, …rms, families or government bureaucrats. This model of potential outcomes is variously attributed
to Fisher (1935), Neyman (1935), Roy (1951), Quandt (1972, 1988) or Rubin (1974).
To focus on main ideas, throughout most of this chapter we assume E(U1 jX) = E(U0 jX) =
0, although as we note at several places in this paper, this is not strictly required. For many
of the estimators that we consider in this chapter we allow for the more general case
Y0 = g0 (X) + U0
Y1 = g1 (X) + U1
where E(U0 j X) 6= 0 and E(U1 j X) 6= 0. Then ¹0 (X) = g0 (X) + E(U0 jX) and ¹1 (X) =
g1 (X) + E(U1 jX).6 Thus X is not necessarily exogenous in the ordinary econometric usage
5
In this paper, we only consider a two potential state model in order to focus on the main ideas.
Heckman (1998a) develops a multiple state model of potential outcomes for a large number of mutually
exclusive states. The basic ideas in his work are captured in the two outcome models we present here.
6
For example, an exogeneity assumption is not required when using social experiments to identify
E(Y1 ¡ Y0 jX; D = 1).
15
of that term. These conditions do not imply that E(U1 ¡U0 jX; D = 1) = 0. D may depend
on U1 , U0 or U1 ¡ U0 and X.
Note also that Y may be a vector of outcomes or a time series of potential outcomes: (Y0t ; Y1t ); for t = 1; : : : ; T , on the same type of variable. We will encounter the
latter case when we analyze panel data on outcomes. In this case, there is usually a
companion set of X variables which we will sometimes assume to be strictly exogenous
in the conventional econometric meaning of that term: E(U0t jX) = 0; E(U1t jX) = 0
where X = (X1;::: ; XT ): In de…ning a sequence of “treatment on the treated” parameters, E(Y1t ¡ Y0t jX; D = 1) t = 1; : : : ; T; this assumption allows us to abstract from any
dependence between U1t , U0t and X. It excludes di¤erences in U1t and U0t arising from
X dependence and allows us to focus on di¤erences in outcomes solely attributable to D.
While convenient, this assumption is overly strong.
However, we stress that the exogeneity assumption in either cross section or panel
contexts is only a matter of convenience and is not strictly required. What is required
for an interpretable de…nition of the “treatment on the treated” parameter is avoiding
conditioning on X variables caused by D even holding Y P = ((Y01; Y11 ); : : : ; (Y0T; Y1T ))
…xed where Y P is the vector of potential outcomes. More precisely, we require that for the
conditional density of the data
f (XjD; Y P ) = f (XjY P )
i.e. we require that the realization of D does not determine X given the vector of potential
outcomes. Otherwise, the parameter E(Y1 ¡ Y0 jX; D = 1) does not capture the full e¤ect
of treatment on the treated as it operates through all channels and certain other technical
problems discussed in Heckman (1998a) arise. In order to obtain E(Y1t ¡ Y0t jX; D = 1)
de…ned on subsets of X; say Xc ; simply integrate out E(Y1t ¡ Y0t jX; D) against the density
f jD = 1) where X
f is the portion of X not in X : X = (X X
f
f (X
c
c
c
c; c ).
Note, …nally, that the choice of a base state “0” is arbitrary. Clearly the roles of “0”
and “1” can be reversed. In the case of human capital investments, there is a natural
base state. But for many other evaluation problems the choice of a base is arbitrary.
Assumptions appropriate for one choice of “0” and “1” need not carry over to the opposite
choice. With this cautionary note in mind, we proceed as if a well-de…ned base state exists.
In many problems it is convenient to think of “0” as a benchmark “no treatment ”
state. The gain to the individual of moving from “0” to “1” is given by
(3.2)
¢ = Y1 ¡ Y0 :
If one could observe both Y0 and Y1 for the same person at the same time, the gain ¢
would be known for each person. The fundamental evaluation problem arises because we
do not know both coordinates of (Y1 ; Y0 ) and hence ¢ for anybody. All approaches to
16
solving this problem attempt to estimate the missing data. These attempts to solve the
evaluation problem di¤er in the assumptions they make about how the missing data are
related to the available data, and what data are available. Most approaches to evaluation
in the social sciences accept the impossibility of constructing ¢ for anyone. Instead, the
evaluation problem is rede…ned from the individual level to the population level to estimate
the mean of ¢, or some other aspect of the distribution of ¢, for various populations of
interest. The question becomes what features of the distribution of ¢ should be of interest
and for what populations should it be de…ned?
3.2
The Counterfactuals of Interest
There are many possible counterfactuals of interest for evaluating a social program. One
might like to compare the state of the world in the presence of the program to the state of
the world if the program were operated in a di¤erent way, or to the state of the world if the
program did not exist at all, or to the state of the world if alternative programs were used
to replace the present program. A full evaluation entails an enumeration of all outcomes
of interest for all persons both in the current state of the world and in all the alternative
states of interest, and a mechanism for valuing the outcomes in the di¤erent states.
Outcomes of interest in program evaluations include the direct bene…ts received, the
level of behavioral variables for participants and nonparticipants and the payments for the
program, for both participants and nonparticipants, including taxes levied to …nance a
publicly provided program. These measures would be displayed for each individual in the
economy to characterize each state of the world.
In a Robinson Crusoe economy, participation in a program is a well-de…ned event. In
a modern economy, almost everyone participates in each social program either directly or
indirectly. A training program a¤ects more than the trainees. It also a¤ects the persons
with whom the trainees compete in the labor market, the …rms that hire them and the
taxpayers who …nance the program. The impact of the program depends on the number
and composition of the trainees. Participation in a program does not mean the same thing
for all people.
The traditional evaluation literature usually de…nes the e¤ect of participation to be
the e¤ect of the program on participants explicitly enrolled in the program. These are
the “Direct E¤ects.” They exclude the e¤ects of a program that do not ‡ow from direct
participation, known as the “Indirect E¤ects”. This distinction appears in the pioneering
work of H. G. Lewis on measuring union relative wage e¤ects (Lewis, 1963). His insights
apply more generally to all evaluation problems in social settings.
There may be indirect e¤ects for both direct participants and direct nonparticipants.
Thus a direct participant may pay taxes to support the program just as persons who do not
17
directly participate may also pay taxes. A …rm may be an indirect bene…ciary of the lower
wages resulting from an expansion of the trained workforce. The conventional econometric
and statistical literature ignores the indirect e¤ects of programs and equates “treatment”
outcomes with the direct outcome Y1 in the program state and “no treatment” with the
direct outcome Y0 in the no program state.
Determining all outcomes in all states is not enough to evaluate a program. Another
aspect of the evaluation problem is the valuation of the outcomes. In a democratic society,
aggregation of the evaluations and the outcomes in a form useful for social deliberations
also is required. Di¤erent persons may value the same state of the world di¤erently even
if they experience the same “objective” outcomes and pay the same taxes. Preferences
may be interdependent. Redistributive programs exist, in part, because of altruistic or
parternalistic preferences. Persons may value the outcomes of other persons either positively or negatively. Only if one person’s preferences are dominant (the idealized case of a
social planner with a social welfare function) is there a unique evaluation of the outcomes
associated for each possible state from each possible program.
The traditional program evaluation literature assumes that the valuation of the direct
e¤ects of the program boil down to the e¤ect of the program on GDP. This assumption
ignores the important point that di¤erent persons value the same outcomes di¤erently and
that the democratic political process often entails coalitions of persons who value outcomes
in di¤erent ways. Both e¢ciency and equity considerations may receive di¤erent weights
from di¤erent groups. Di¤erent mechanisms for aggregating evaluations and resolving social
con‡icts exist in di¤erent societies. Di¤erent types of information are required to evaluate
a program under di¤erent modes of social decision making.
Both for pragmatic and political reasons, government social planners, statisticians or
policy makers may value objective output measures di¤erently than the persons or institutions being evaluated. The classic example is the value of nonmarket time (Greenberg,
1997). Traditional program evaluations exclude such valuations largely because of the dif…culty of inputting the value and quantity of nonmarket time. By doing this, however,
these evaluations value labor supply in the market sector at the market wage, but value
labor supply in the nonmarket sector at a zero wage. By contrast, individuals value labor
supply in the nonmarket sector at their reservation wage. In this example, two di¤erent
sets of preferences value the same outcomes di¤erently. In evaluating a social program in a
society that places weight on individual preferences, it is appropriate to recognize personal
evaluations and that the same outcome may be valued in di¤erent ways by di¤erent social
actors.
Programs that embody redistributive objectives inherently involve di¤erent groups.
Even if the taxpayers and the recipients of the bene…ts of a program have the same preferences, their valuations of a program will, in general, di¤er. Altruistic considerations often
18
motivate such programs. These often entail private valuations of distributions of program
impacts - how much recipients gain over what they would experience in the absence of the
program. (See Heckman and Smith, 1993, 1995, 1998a and Heckman, Smith and Clements,
1997.)
Answers to many important evaluation questions require knowledge of the distribution
of program gains especially for programs that have a redistributive objective or programs
for which altruistic motivations play a role in motivating the existence of the program. Let
D = 1 denote direct participation in the program and D = 0 denote direct nonparticipation.
To simplify the argument in this section, ignore any indirect e¤ects. From the standpoint
of a detached observer of a social program who takes the base state values (denoted “0”)
as those that would prevail in the absence of the program, it is of interest to know, among
other things,
(A) the proportion of people taking the program who bene…t from it:
Pr(Y1 > Y0 j D = 1) = Pr(¢ > 0 j D = 1);
(B) the proportion of the total population bene…ting from the program:
Pr(Y1 > Y0 j D = 1) ¢ Pr(D = 1) = Pr(¢ > 0 j D = 1) ¢ Pr(D = 1);
(C) selected quantiles of the impact distribution
inf f¢ : F (¢ j D = 1) > qg, where q is a quantile of the distribution
¢
and where “inf” is the smallest attainable value of ¢ that satis…es the
condition stated in the braces;
(D) the distribution of gains at selected base state values:
F (¢ j D = 1; Y0 = y0 );
(E) the increase in the level of outcomes above a certain threshold y¹ due to a policy:
Pr(Y1 > y¹ j D = 1) ¡ Pr(Y0 > y¹ j D = 1).
Measure (A) is of interest in determining how widely program gains are distributed
among participants. Participants in the political process with preferences over distributions
of program outcomes would be unlikely to assign the same weight to two programs with
the same mean outcome, one of which produced favorable outcomes for only a few persons
while the other distributed gains more broadly. When considering a program, it is of
interest to determine the proportion of participants who are harmed as a result of program
participation, indicated by Pr(Y1 < Y0 j D = 1): Negative mean impact results might
be acceptable if most participants gain from the program. These features of the outcome
distribution are likely to be of interest to evaluators even if the persons studied do not
know their Y0 and Y1 values in advance of participating in the program.
Measure (B) is the proportion of the entire population that bene…ts from the program,
assuming that the costs of …nancing the program are broadly distributed and are not
perceived to be related to the speci…c program being evaluated. If voters have correct
19
expectations about the joint distribution of outcomes, it is of interest to politicians to
determine how widely program bene…ts are distributed. At the same time, large program
gains received by a few persons may make it easier to organize interest groups in support
of a program than if the same gains are distributed more widely.
Evaluators interested in the distribution of program bene…ts would be interested in
measure (C). Evaluators who take a special interest in the impact of a program on recipients in the lower tail of the base state distribution would …nd measure (D) of interest. It
reveals how the distribution of gains depends on the base state for participants. Measure
(E) provides the answers to the question “do the distributions of gains for the participants
dominate the distribution of outcomes if they did not participate?” (See Heckman, Smith
and Clements, 1997; and Heckman and Smith, 1998a.) Expanding the scope of the discussion to evaluate the indirect e¤ects of the program makes it more likely that estimating
distributional impacts is an important part in conducting program evaluations.
3.3
The Counterfactuals Most Commonly Estimated In The Literature
The evaluation problem in its most general form for distributions of outcomes is formidable
and is not considered in depth either in this chapter or in the literature. (Heckman and
Smith, 1998a, and Heckman, Smith and Clements, 1997, consider identi…cation and estimation of counterfactual distributions.) Instead, in this chapter we focus on counterfactual
means, and consider a form of the problem in which analysts have access to information
on persons who are in one state or the other at any time, and for certain time periods
there are some persons in both states, but there is no information on any single person who
is in both states at the same time. As discussed in Heckman (1998a) and Heckman and
Smith (1998a), a crucial assumption in the traditional evaluation literature is that the no
treatment state approximates the no program state. This would be true if indirect e¤ects
are negligible.
Most of the empirical work in the literature on evaluating government training programs
focuses on means and in particular on one mean counterfactual: the mean direct e¤ect of
treatment on those who take treatment. The transition from the individual to the group
level counterfactual recognizes the inherent impossibility of observing the same person in
both states at the same time. By dealing with aggregates, rather than individuals, it is
sometimes possible to estimate group impact measures even though it may be impossible
to measure the impacts of a program on any particular individual. To see this point more
formally, consider the switching regression model with two regimes denoted by “1” and “0”
(Quandt, 1972). The observed outcome Y is given by
20
(3.3)
Y = DY1 + (1 ¡ D)Y0 :
When D = 1 we observe Y1 ; when D = 0 we observe Y0:
To cast the foregoing model in a more familiar-looking form, and to distinguish it from
conventional regression models, express the means in (3.1a) and (3.1b) in more familiar
linear regression form:
E(Yj jX) = ¹j (X) = X¯ j ; j = 0; 1.
With these expressions, substitute from (3.1a) and (3.1b) into (3.3) to obtain
Y = D(¹1 (X) + U1 ) + (1 ¡ D)(¹0 (X) + U0 ):
Rewriting,
Y = ¹0 (X) + D(¹1 (X) ¡ ¹0 (X) + U1 ¡ U0 ) + U0 :
Using the linear regression representation, we obtain
(3.4)
Y = X¯ 0 + D(X(¯ 1 ¡ ¯ 0 ) + U1 ¡ U0 ) + U0 :
Observe that from the de…nition of a conditional mean, E(U0 j X) = 0 and E(U1 j X) = 0:
The parameter most commonly invoked in the program evaluation literature, although
not the one actually estimated in social experiments, or in most nonexperimental evaluations, is the e¤ect of randomly picking a person with characteristics X and moving that
person from “0” to “1”:
E(Y1 ¡ Y0 jX) = E(¢jX):
In terms of the switching regression model this parameter is the coe¢cient on D in the
“regression” non-error component of following equation:
(3.5)
Y = ¹0 (X) + D(¹1 (X) ¡ ¹0 (X)) + fU0 + D(U1 ¡ U0 )g
= ¹0 (X) + D(E(¢jX)) + fU0 + D(U1 ¡ U0 )g
= X¯ 0 + DX(¯ 1 ¡ ¯ 0 ) + fU0 + D(U1 ¡ U0 )g
where the term in braces is the “error.”
If the model is specialized so that there are K regressors plus an intercept and ¯ 1 =
(¯ 10 ; : : : ; ¯ 1K ) and ¯ 0 = (¯ 00 ; : : : ¯ 0K ), where the intercepts occupy the …rst position, and
the slope coe¢cients are the same in both regimes:
¯ 1j = ¯ 0j = ¯ j ;
21
j = 1; : : : ; K
and ¯ 00 = ¯ 0 and ¯ 10 ¡ ¯ 00 = ®, the parameter under consideration reduces to ®:
(3.6)
E(Y1 ¡ Y0 jX) = ¯ 10 ¡ ¯ 00 = ®:
The regression model for this special case maybe written as
(3.7)
Y = X¯ + D® + fU0 + D(U1 ¡ U0 )g :
It is nonstandard from the standpoint of elementary econometrics because the error term
has a component that switches on or o¤ with D. In general, its mean is not zero because
E[U0 + D(U1 ¡ U0 )] = E(U1 ¡ U0 jD = 1) Pr(D = 1): If U1 ¡ U0 ; or variables statistically
dependent on it, help determine D, E(U1 ¡ U0 j D = 1) 6= 0. Intuitively, if persons who
have high gains (U1 ¡ U0 ) are more likely to appear in the program, than this term is
positive.
In practice most non-experimental and experimental studies do not estimate E(¢ j X).
Instead, most nonexperimental studies estimate the e¤ect of treatment on the treated,
E(¢ j X; D = 1): This parameter conditions on participation in the program as follows:
(3.8) E(¢jX; D = 1) = E(Y1 ¡ Y0 jX; D = 1) = X(¯ 1 ¡ ¯ 0 ) + E(U1 ¡ U0 jX; D = 1):
It is the coe¢cient on D in the non-error component of the following regression equation:
(3.9)
Y = ¹0 (X) + D(E(¢jX; D = 1))
+ fU0 + D [(U1 ¡ U0 ) ¡ E(U1 ¡ U0 jX; D = 1)]g
= X¯ 0 + D(X(¯ 1 ¡ ¯ 0 ) + E(U1 ¡ U0 jX; D = 1))
+ fU0 + D [(U1 ¡ U0 ) ¡ E(U1 ¡ U0 jX; D = 1)]g :
E(¢ j X; D = 1) is a nonstandard parameter in conventional econometrics. It combines
“structural” parameters X(¯ 1 ¡¯ 0 ) with the means of the unobservables (E(U1 ¡U0 jX; D =
1)): It measures the average gain in the outcome for persons who choose to participate in a
program compared to what they would have experienced in the base state. It computes the
average gain in terms of both observables and unobservables. It is the latter that makes
the parameter look nonstandard. Most econometric activity is devoted to separating ¯ 0
and ¯ 1 from the e¤ects of the regressors on U1 and U0 . Parameter (3.8) combines these
e¤ects.
This parameter is implicitly de…ned conditional on the current levels of participation
in the program in society at large. Thus it recognizes social interaction. But at any point
in time the aggregate participation level is just a single number, and the composition of
trainees is …xed. From a single cross section of data, it is not possible to estimate how
variation in the levels and composition of participants in a program a¤ect the parameter.
The two evaluation parameters we have just presented are the same if we assume that
U1 ¡ U0 = 0, so the unobservables are common across the two states. From (3.9) we now
have Y1 ¡ Y0 = ¹1 (X)¡ ¹0 (X) = X(¯ 1 ¡ ¯ 0 ). The di¤erence between potential outcomes in
the two states is a function of X but not of unobservables. Further specializing the model to
one of intercept di¤erences (i.e. Y1 ¡Y0 = ®); requires that the di¤erence between potential
22
outcomes is a constant. The associated regression can be written as the familiar-looking
dummy variable regression model:
(3.10)
Y = X¯ + D® + U, where E(U) = 0:
The parameter ® is easy to interpret as a standard structural parameter and the speci…cation (3.10) looks conventional. In fact, model (3.10) dominates the conventional evaluation
literature. The validity of many conventional instrumental variables methods and longitudinal estimation strategies is contingent on this speci…cation as we document below. The
conventional econometric evaluation literature focuses on ®; or more rarely, X(¯ 1 ¡ ¯ 0 ),
and the selection problem arises from the correlation between D and U.
While familiar, the framework of (3.10) is very special. Potential outcomes (Y1 ; Y0 )
di¤er only by a constant (Y1 ¡ Y0 = ®). The best Y1 is the best Y0 : All people gain or lose
the same amount in going from “0” to “1”. There is no heterogeneity in gains. Even in
the more general case, with ¹1 (X) and ¹0 (X) distinct, or ¯ 1 6= ¯ 0 in the linear regression
representation, so long as U1 = U0 among people with the same X, there is no heterogeneity
in the outcomes moving from “0” to “1”. This assumed absence of heterogeneity in response
to treatments is strong. When tested, it is almost always rejected (see Heckman, Smith
and Clements, 1997 and the evidence presented below).
There is one case when U1 6= U0 , where the two parameters of interests are still equal
even though there is dispersion in gain ¢. This case occurs when
(3.11)
E(U1 ¡ U0 jX; D = 1) = 0:
Condition (3.11) arises when conditional on X; D does not explain or predict U1 ¡ U0 . This
condition could arise if agents who select into state “1” from “0” either do not know or
do not act on U1 ¡ U0 , or information dependent on U1 ¡ U0 , in making their decision to
participate in the program. Ex post, there is heterogeneity, but ex ante it is not acted on
in determining participation in the program.
When the gain does not a¤ect individuals’ decisions to participate in the program, the
error terms (the terms in braces in (3.7) and (3.9)) have conventional properties. The only
bias in estimating the coe¢cients on D in the regression models arise from the dependence
between U0 and D just as the only source of bias in the common coe¢cient model is the
covariance between U and D when E(U(X)) = 0. To see this point take the expectation
of the terms in braces in (3.7) and (3.9), respectively, to obtain the following:
E(U0 + D(U1 ¡ U0 )jX; D) = E(U0 jX; D)
and
E(U0 + D [(U1 ¡ U0 ) ¡ E(U1 ¡ U0 jX; D = 1)] jX; D) = E(U0 j X; D).
23
A problem that remains when condition (3.11) holds is that, the D component in the
error terms contributes a component of variance to the model and so makes the model
heteroscedastic:
V ar(U0 + D(U1 ¡ U0 )jX; D) = V ar(U0 jX; D)
+2COV (U0 ; U1 ¡ U0 jX; D)D + V ar(U1 ¡ U0 jX; D)D:
The distinction between a model with U1 = U0 , and one with U1 6= U0 , is fundamental to
understanding modern developments in the program evaluation literature. When U1 = U0
and we condition on X, everyone with the same X has the same treatment e¤ect. The
evaluation problem greatly simpli…es and one parameter answers all of the conceptually
distinct evaluation questions we have posed. “Treatment on the treated” is the same
as the e¤ect of taking a person at random and putting him/her into the program. The
distributional questions (A)–(E) all have simple answers because everyone with the same
X has the same ¢. Equation (3.10) is amenable to analysis by conventional econometric
methods. Eliminating the covariance between D and U is the central problem in this model.
When U1 6= U0 , but (3.11) characterizes the program being evaluated, most of the familiar econometric intuition remains valid. This is the “random coe¢cient” model with the
coe¢cient on D “random” (from the standpoint of the observing economist), but uncorrelated with D. The central problem in this model is covariance between U0 and D and the
only additional econometric problem arises in accounting for heteroscedasticity in getting
the right standard errors for the coe¢cients. In this case, the response to treatment varies
among persons with the same X values. The mean e¤ect of treatment on the treated and
the e¤ect of treatment on a randomly chosen person are the same.
In the general case when U1 6= U0 and (3.11) no longer holds, we enter a new world not
covered in the traditional econometric evaluation literature. A variety of di¤erent treatment
e¤ects can be de…ned. Conventional econometric procedures often break down or require
substantial modi…cation. The error term for the model (3.5) has a non-zero mean.7 Both
error terms are heteroscedastic. The distinctions among these three models — (a) the
coe¢cient on D is …xed (given X) for everyone; (b) the coe¢cient on D is variable (given
X), but does not help determine program participation; and (c) the coe¢cient on D is
variable (given X) and does help determine program participation — are fundamental to
this chapter and the entire literature on program evaluation.
7
E[U0 + D(U1 ¡ U0 )X] = E(U1 ¡ U0 j X; D = 1) Pr(D = 1 j X) 6= 0:
24
3.4
Is Treatment on the Treated an Interesting Economic Parameter?
What economic question does parameter (3.2) answer? How does it relate to the conventional parameter of interest in cost-bene…t analysis - the e¤ect of a program on GDP?
In order to relate the parameter (3.2) with the parameters needed to perform traditional
cost-bene…t analysis, it is fruitful to consider a more general framework. Following our
previous discussion, we consider two discrete states or sectors corresponding to direct participation and nonparticipation and a vector of policy variables ' that a¤ect the outcomes
in both states and the allocation of persons to states or sectors. The policy variables may
be discrete or continuous. Our framework departs from the conventional treatment e¤ect
literature and allows for general equilibrium e¤ects.
Assuming that costless lump-sum transfers are possible, that a single social welfare
function governs the distribution of resources and that prices re‡ect true opportunity costs,
traditional cost-bene…t analysis (see, e.g., Harberger, 1971) seeks to determine the impact of
programs on the total output of society. E¢ciency becomes the paramount criterion in this
framework, with the distributional aspects of policies assumed to be taken care of by lump
sum transfers and taxes engineered by an enlightened social planner. In this framework,
impacts on total output are the only objects of interest in evaluating programs. The
distribution of program impacts is assumed to be irrelevant. This framework is favorable
to the use of mean outcomes to evaluate social programs.
Within the context of the simple framework discussed in Section 3.1, let Y1 and Y0 be
individual output which trades at a constant relative price of “1” set externally and not
a¤ected by the decisions of the agents we analyze. Alternatively, assume that the policies
we consider do not alter relative prices. Let ' be a vector of policy variables which operate
on all persons. These generate indirect e¤ects. c(') is the social cost of ' denominated in
“0” units. We assume that c(0) = 0 and that c is convex and increasing in '. Let N1 (')
be the number of persons in state “1” and N0 (') be the number of persons in state “0”.
The total output of society is
N1 (')E(Y1 j D = 1; ') + N0 (')E(Y0 j D = 0; ') ¡ c(');
¹ is the total number of persons in society. For simplicity, we
where N1 (') + N0 (') = N
assume that all persons have the same person-speci…c characteristics X. Vector ' is general
enough to include …nancial incentive variables for participation in the program as well as
mandates that assign persons to a particular state. A policy may bene…t some and harm
others.
25
Assume for convenience that the treatment choice and mean outcome functions are
di¤erentiable and for the sake of argument further assume that ' is a scalar. Then the
change in output in response to a marginal increase in ' from any given position is:
(3.12)
¢(') =
@N1 (')
[E(Y1 j D = 1; ') ¡ E(Y0 j D = 0; ')]+
@'
"
#
"
#
@E(Y1 j D = 1; ')
@E(Y1 j D = 0; ')
@c(')
N1 (')
+ N0 (')
¡
:
@'
@'
@'
The …rst term arises from the transfer of persons across sectors that is induced by the
policy change. The second term arises from changes in output within each sector induced
by the policy change. The third term is the marginal social cost of the change.
In principle, this measure could be estimated from time-series data on the change in
aggregate GDP occurring after the program parameter ' is varied. Assuming a well-de…ned
social welfare function and making the additional assumption that prices are constant at
initial values, an increase in GDP evaluated at base period prices raises social welfare
provided that feasible bundles can be constructed from the output after the social program
parameter is varied so that all losers can be compensated. (See, e.g., La¤ont, 1989, p. 155,
or the comprehensive discussion in Chipman and Moore, 1976).
If marginal policy changes have no e¤ect on intra-sector mean output, the bracketed
elements in the second set of terms inside the braces are zero. In this case, the parameters
of interest for evaluating the impact of the policy change on GDP are:
@N1 (')
(i)
;
the number of people entering or
@'
leaving state 1.
(ii) E(Y1 j D = 1; ') ¡ E(Y0 j D = 0; ');the mean output di¤erence between
sectors.
@c(')
(iii)
;
the social marginal cost of the policy.
@'
It is revealing that nowhere on this list are the parameters that receive the most attention in the econometric policy evaluation literature. (See, e.g., Heckman and Robb, 1985a).
These are “the e¤ect of treatment on the treated”:
(a) E(Y1 ¡ Y0 j D = 1,')
or
¹ . This is
(b) E(Y1 j ' = '
¹ ) ¡ E(Y0 j ' = 0)
where ' = '
¹ sets N1 (¹
') = N
the e¤ect of universal coverage for
the program.
26
Parameter (ii) can be estimated by taking simple mean di¤erences between the outputs
in the two sectors; no adjustment for selection bias is required. Parameter (i) can be
obtained from knowledge of the net movement of persons across sectors in response to the
policy change, something usually neglected in micro policy evaluation (for exceptions, see
Mo¢tt, 1992, or Heckman, 1992). Parameter (iii) can be obtained from cost data. Full
social marginal costs should be included in the computation of this term. The typical
micro evaluation neglects all three terms. Costs are rarely collected and gross outcomes are
typically reported; entry e¤ects are neglected and term (ii) is usually “adjusted” to avoid
selection bias when in fact, no adjustment is needed to estimate the impact of the program
on GDP.
It is informative to place additional structure on this model. This leads to a representation of a criterion that is widely used in the literature on microeconomic program
evaluation and also establishes a link with the models of program participation used in the
later sections of this chapter. Assume a binary choice random utility framework. Suppose
that agents make choices based on net utility and that policies a¤ect participant utility
through an additively-separable term k(') that is assumed scalar and di¤erentiable. Net
utility is
U = X + k(')
where k is monotonic in ' and where the joint distributions of (Y1 ; X) and (Y0 ; X) are
F (y1 ; x) and F (y0 ; x), respectively. The underlying variables are assumed to be continuously distributed. In the special case of the Roy model of self-selection (see, Heckman and
Honoré, 1990, for one discussion) X = Y1 ¡ Y0 ;
D = 1(U ¸ 0) = 1(X ¸ ¡k('));
where “1” is the indicator function (1(Z > 0) = 1 if Z > 0; = 0 otherwise)
1
¹ Pr(U ¸ 0) = N
¹ R¡k(')
N1 (') = N
f (x)dx;
and
¹ Pr(U < 0) = N
¹
N0 (') = N
Total output is
R ¡k(')
¡1
f (x)dx:
R1
R ¡k(')
1
1
¹ R¡1
¹ R¡1
N
y1 ¡k(')
f (y1 ; x j ')dxdy1 + N
y0 ¡1
f (y0 ; x j ')dxdy0 ¡ c('):
27
Under standard conditions (see, e.g., Royden, 1968), we may di¤erentiate this expression
to obtain the following expression for the marginal change in output with respect to a change
in ' :
(3.13)
¢(') =
¹ 0 (')fx(¡k('))[E(Y1 j D = 1; x = ¡k('); ')-E(Y0 j D = 0; x = ¡k('); ')]
Nk
R1
R ¡k(') @f (y0 ; x j ')
R1
@f (y1 ; x j ')
1
¹ R¡1
+N[
y1 ¡k(')
dxdy1 + ¡1
y0 ¡1
dxdy0 ]
@'
@'
@c(')
¡
:
@'
This model has a well-de…ned margin: X = ¡k('); which is the utility of the marginal
entrant into the program. The utility of the participant might be distinguished from the
objective of the social planner who seeks to maximize total output. The …rst set of terms
corresponds to the gain arising from the movement of persons at the margin (the term
in brackets) weighted by the proportion of the population at the margin, k 0 (')fx (¡k(')),
times the number of people in the population. This term is the net gain from switching
sectors. The expression in brackets in the …rst term is a limit form of the “local average
treatment e¤ect” of Imbens and Angrist (1994) which we discuss further in our discussion
of instrumental variables in Section 7.4.3. The second set of terms is the intrasector change
in output resulting from a policy change. This includes both direct and indirect e¤ects.
The second set of terms is ignored in most evaluation studies. It describes how people
who do not switch sectors are a¤ected by the policy. The third term is the direct marginal
social cost of the policy change. It includes the cost of administering the program plus the
opportunity cost of consumption foregone to raise the taxes used to …nance the program.
Below we demonstrate the empirical importance of accounting for the full social costs of
programs.
At an optimum, ¢(') = 0, provided standard second order conditions are satis…ed.
Marginal bene…t should equal the marginal cost. We can use either a cost-based measure
of marginal bene…t or a bene…t-based measure of cost to evaluate the marginal gains or
marginal costs of the program, respectively.
Observe that the local average treatment e¤ect is simply the e¤ect of treatment on the
treated for persons at the margin (X = ¡k(')) :
(3.14)
E(Y1 j D = 1; X = ¡k('); ') ¡ E(Y0 j D = 0; X = ¡k('); ')
= E(Y1 ¡ Y0 j D = 1; X = ¡k('); '):
This expression is obvious once it is recognized that the set X = ¡k(') is the indifference set. Persons in that set are indi¤erent between participating in the program and
not participating. The Imbens and Angrist (1994) parameter is a marginal version of the
“treatment on the treated” evaluation parameter for gross outcomes. This parameter is
one of the ingredients required to produce an evaluation of the impact of a marginal change
28
in the social program on total output but it ignores costs and the e¤ect of a change in the
program on the outcomes of persons who do not switch sectors.8
The conventional evaluation parameter,
E(Y1 ¡ Y0 j D = 1; x; ')
does not incorporate costs, does not correspond to a marginal change and includes rents
accruing to persons. This parameter is in general inappropriate for evaluating the e¤ect
of a policy change on GDP. However, under certain conditions which we now specify, this
parameter is informative about the gross gain accruing to the economy from the existence
of a program at level '
~ compared to the alternative of shutting it down. This is the
information required for an “all or nothing” evaluation of a program.
The appropriate criterion for an all or nothing evaluation of a policy at level ' = '
~ is
A(~
') = fN1 (~
')E(Y1 j D = 1; ' = '
~ ) + N0 (~
')E(Y0 j D = 0; ' = '
~ ) ¡ c(~
')g
¡ fN1 (0)E(Y1 j D = 1; ' = 0 ) + N0 (0)E(Y0 j D = 0; ' = 0)g
where ' = 0 corresponds to the case where there is no program, so that N1 (0) = 0 and
¹ . If A(~
N0 (0) = N
') > 0, total output is increased by establishing the program at level '
~.
In the special case where the outcome in the benchmark state “0” is the same whether
or not the program exists,
(3.15)
E(Y0 j D = 0; ' = '
~ ) = E(Y0 j D = 0; ' = 0):
This condition de…nes the absence of general equilibrium e¤ects in the base state so the no
program state for nonparticipants is the same as the nonparticipation state. Assumption
(3.15) is what enables analysts to generalize from partial equilibrium to general equilibrium
¹ = N1 (') +N0 ('), when (3.15) holds we have
settings. Recalling that N
(3.16)
A(~
') = N1 (~
')E(Y1 ¡ Y0 j D = 1; ' = '
~ ) ¡ c(~
'):9
Given costless redistribution of the bene…ts, the output-maximizing solution for ' also
maximizes social welfare. For this important case, which is applicable to small-scale social
programs with partial participation, the measure “treatment on the treated” which we
focus on in this chapter is justi…ed. For evaluating the e¤ect of marginal variation or
“…ne-tuning” of existing policies, measure ¢(') is more appropriate.10
8
Heckman and Smith (1998a) and Heckman (1997) present comprehensive discussions of the Imbens
and Angrist (1994) parameter. We discuss this parameter further in Section 7. One important di¤erence
between their parameter and the traditional treatment on the treated parameter is that the latter excludes
variables like ' from the conditioning set, but the Imbens-Angrist parameter includes it.
9
Condition (3.15) is stronger than what is required to justify (3.16). The condition only has to hold for
the subset of the population (N0 (') in number) who would not participate in the presence of the program.
10
Björklund and Mo¢tt (1987) estimate both the marginal gross gain and the average gross gain from
participating in a program. However, they do not present estimates of marginal or average costs.
29
4
Prototypical Solutions to the Evaluation Problem
An evaluation entails making some comparison between “treated” and “untreated” persons.
This section considers three widely-used comparisons for estimating the impact of treatment
on the treated: E(Y1 ¡ Y0 j X; D = 1). All use some form of comparison to construct the
required counterfactual E(Y0 j X; D = 1). Data on E(Y1 j X; D = 1) are available from
program participants. A person who has participated in a program is paired with an
“otherwise comparable” person or set of persons who have not participated in it. The set
may contain just one person. In most applications of the method, the paired partner is not
literally assumed to be a replica of the treated person in the untreated state although some
panel data evaluation estimators make such an assumption. Thus, in general, ¢ = Y1 ¡ Y0
is not estimated exactly. Instead, the outcome of the paired partners is treated as a proxy
for Y0 for the treated individual and the population mean di¤erence between treated and
untreated persons is estimated by averaging over all pairs. The method can be applied
symmetrically to nonparticipants to estimate what they would have earned if they had
participated. For that problem the challenge is to …nd E(Y1 j X; D = 0) since the data on
nonparticipants enables one to identify E(Y0 j X; D = 0).
A major di¢culty with the application of this method is providing some objective way
of demonstrating that a candidate partner or set of partners is “otherwise comparable.”
Many econometric and statistical methods are available for adjusting di¤erences between
persons receiving treatment and potential matching partners which we discuss in Section
7.
4.1
The Before-After Estimator
In the empirical literature on program evaluation, the most commonly-used evaluation
strategy compares a person with himself/herself. This is a comparison strategy based on
longitudinal data. It exploits the intuitively-appealing idea that persons can be in both
states at di¤erent times, and that outcomes measured in one state at one time are good
proxies for outcomes in the same state at other times at least for the no-treatment state.
This gives rise to the motivation for the simple “before-after” estimator which is still widely
used. Its econometric descendent is the …xed e¤ect estimator without a comparison group.
The method assumes that there is access either (i) to longitudinal data on outcomes
measured before and after a program for a person who participates in it, or (ii) to repeated
cross section data from the same population where at least one cross section is from a
period prior to the program. To incorporate time into our analysis, we introduce “t”
subscripts. Let Y1t be the post-program earnings of a person who participates in the
program. When longitudinal data are available, Y0t0 is the pre-program outcome of the
30
person. For simplicity, assume that program participation occurs only at time period k,
where t > k > t0 . The “ before-after” estimator uses preprogram earnings Y0t0 to proxy
the treatment state in the post program period. In other words, the underlying identifying
assumption is
(4.A.1)
E(Y0t ¡ Y0t0 j D = 1) = 0:
If this assumption is valid, the “Before-After” estimator is given by
(4.1)
(Y 1t ¡ Y 0t0 )1 ;
where the subscript “1” denotes conditioning on D = 1, and the “¡” denotes sample
means.
To see how this estimator works, observe that for each individual the gain from the
program may be written as
Y1t ¡ Y0t = (Y1t ¡ Y0t0 ) + (Y0t0 ¡ Y0t ):
The second term (Y0t0 ¡ Y0t ) is the approximation error. If this term averages out to zero,
we may estimate the impact of participation on those who participate in a program by
subtracting participants’ mean pre-program earnings from the mean of their post-program
earnings. These means also may be de…ned for di¤erent values of participants’ characteristics, X.
The before-after estimator does not literally require longitudinal data to identify the
means (Heckman and Robb, 1985a,b). As long as the approximation error averages out,
repeated cross-sectional data that sample the same population over time, but not necessarily
the same persons, are su¢cient to construct a before-after estimate. An advantage of this
approach is that it only requires information on the participants and their pre-participation
histories to evaluate the program.
The major drawback to this estimator is its reliance on the assumption that the approximation errors average out. This assumption requires that among participants, the
mean outcome in the no-treatment state is the same in t and t0 . Changes in the overall
state of the economy between t and t0 , or changes in the life cycle position of a cohort of
participants, can violate this assumption.
A good example of a case in which assumption (4.A.1) is likely violated is provided in
the work of Ashenfelter (1978). Ashenfelter observed that prior to enrollment in a training
program, participants experience a decline in their earnings. Later research demonstrates
that Ashenfelter’s “dip” is a common feature of the pre-program earnings of participants
in government training programs. See Figures 4.1 to 4.6 which show the dip for a variety
of programs in di¤erent countries. If this decline in earnings is transitory, and earnings
is a mean-reverting process so that the dip is eventually restored, even in the absence
of participation in the program, and if period t0 falls in the period of transitorily low
31
earnings, then the approximation error will not average out. In this example, the beforeafter estimator overstates the average e¤ect of training on the trained and attributes mean
reversion that would occur under any event to the e¤ect of the program. On the other
hand, if the decline is permanent, the before-after estimator is unbiased for the parameter
of interest. In this case, any improvement in earnings is properly attributable to the
program. Another potential defect of this estimator is that it attributes to the program
any trend in earnings due to macro or lifecycle factors.
Two di¤erent approaches have been used to solve these problems with the before-after
estimators. One controversial method generalizes the before-after estimator by making use
of many periods of pre-program data and extrapolating from the period before t0 to generate
the counterfactual state in period t. It assumes that Y0t and Y0t0 can be adjusted to equality
using data on the same person, or the same populations of persons, followed over time. As
an example, suppose that Y0t is a function of t, or is a function of t- dated variables. If
we have access to enough data on pre-program outcomes prior to date t0 to extrapolate
post-program outcomes Y0t ; and if there are no errors of extrapolation, or if it is safe to
assume that such errors average out to zero across persons in period t, one can replace
the missing data or at least averages of the missing data, using extrapolated values. This
method is appropriate if population mean outcomes evolve as deterministic functions of
time or macroeconomic variables like unemployment. This procedure is discussed further
in Section 7.5.11 The second approach is based on the before-after estimator which we
discuss next.
4.2
The Di¤erence-in-Di¤erences Estimator
A more widely used approach to the evaluation problem assumes access either (i) to longitudinal data or (ii) to repeated cross-section data on nonparticipants in periods t and t0 .
If the mean change in the no-program outcome measures are the same for participants and
nonparticipants i.e. if the following assumption is valid:
(4.A.2)
E(Y0t ¡ Y0t0 j D = 1) = E(Y0t ¡ Y0t0 j D = 0)
then the di¤erence-in-di¤erences estimator given by
(4.2)
(Y¹1t ¡ Y¹0t0 )1 ¡ (Y¹0t ¡ Y¹0t0 )0
t > k > t0 :
is valid for E(¢t j D = 1) = E(Y1t ¡ Y0t j D = 1) where ¢t = Y1t ¡ Y0t because
E[(Y¹1t ¡ Y¹0t0 )1 ¡ (Y¹0t ¡ Y¹0t0 )0 ] = E(¢t j D = 1):12
11
12
See also Heckman and Robb (1985a), p. 210-215.
The proof is immediate. Make the following decomposition
(Y¹1t ¡ Y¹0t0 )1 = (Y¹1t ¡ Y¹0t0 )1 + (Y¹0t ¡ Y¹0t0 )1 :
32
If assumption (4.A.2) is valid, the change in the outcome measure in the comparison group
serves to benchmark common year or age e¤ects among participants.
Because we cannot form the change in outcomes between the treated and untreated
states, the expression
(Y1t ¡ Y0t0 )1 ¡ (Y0t ¡ Y0t0 )0 ;
cannot be formed for anyone, although we can form one or the other of these terms for
everyone. Thus, we cannot use the di¤erence-in-di¤erences estimator to identify the distribution of gains without making further assumptions.13 Like the before-after estimator,
we can implement the di¤erence-in-di¤erences estimator for means (4.2) on repeated cross
sections. It is not necessary to sample the same persons in periods t and t0 — just persons
from the same populations.
Ashenfelter’s dip provides an example of a case where assumption (4.A.2) is likely to be
violated. If Y is earnings, and t0 is measured at the time of a transitory earnings dip, and
if non-participants do not experience the dip, then (4.A.2) will be violated, because the
time path of no-program earnings between t0 and t will be di¤erent between participants
and nonparticipants. In this example, the di¤erence-in-di¤erences estimator overstates the
average impact of training on the trainee.
4.3
The Cross-Section Estimator
A third estimator compares mean outcomes of participants and nonparticipants at time
t: This estimator is sometimes called the cross-section estimator. It does not compare
the same persons because by hypothesis a person cannot be in both states at the same
time. Because of this fact, cross-section estimators cannot estimate the distribution of
gains unless additional assumptions are invoked beyond those required to estimate mean
impacts.
The key identifying assumption for the cross-section estimator of the mean is that
(4.A.3)
E(Y0t j D = 1) = E(Y0t j D = 0);
i.e., that on average persons who do not participate in the program have the same notreatment outcome as those who do participate. If this assumption is valid, then the
cross-section estimator is given by
The claim follows upon taking expectations.
13
One assumption that identi…es the distribution of gains is to assume that (Y1t ¡ Y0t )1 is independent of
(Y0t ¡ Y0t0 )1 and that the distribution of (Y1t ¡ Y0t )1 is the same as the distribution of (Y0t ¡ Y0t0 )0 : Then
the results on deconvolution in Heckman, Smith and Clements (1997) can be applied. See their paper for
details.
33
(4.3)
(Y¹1t )1 ¡ (Y¹0t0 )0 :
This estimator is valid under assumption (4.A.3) because
E((Y¹1t )1 ¡ (Y¹0t )0 ) = E(¢t j D = 1):14
If persons go into the program based on outcome measures in the post-program state, then
assumption (4.A.3) will be violated. The assumption would be satis…ed if participation in
the program is unrelated to outcomes in the no program state in the post-program period.
Thus, it is possible for Ashenfelter’s dip to characterize the data on earnings in the preprogram period, and yet for (4.A.3) to be satis…ed. Moreover, as long as the macro economy
and aging process operate identically on participants and nonparticipants, the cross section
estimator is not vulnerable to the problems that plague the before-after estimator.
The cross section estimator (4.3), the di¤erence-in-di¤erences estimator (4.2), and the
before-after estimator (4.1) comprise the trilogy of conventional non-experimental evaluation estimators. All of these estimators can be de…ned conditional on observable characteristics X. Conditioning on X or additional “instrumental” variables make it more likely that
modi…ed versions of assumptions (4.A.3), (4.A.2), or (4.A.1) will be satis…ed but this is not
guaranteed. If, for example, the distribution of X characteristics is di¤erent between participants (D = 1) and nonparticipants (D = 0), conditioning on X may eliminate systematic
di¤erences in outcomes between the two groups. Using modern nonparametric procedures,
it is possible to exploit each of the identifying conditions to estimate nonparametric versions of all three estimators. On the other hand, if the di¤erence between participants and
nonparticipants is due to unobservables, conditioning may accentuate, and not eliminate,
di¤erences between participants and nonparticipants in the no-program state.15
The three estimators exploit three di¤erent principles but all are based on making some
comparison. The assumptions that justify one method will not, in general, justify any of the
other methods. All of the estimators considered in this chapter exploit one of these three
principles. They extend the simple mean di¤erences just discussed by making a variety of
adjustments to the means. Throughout the rest of the chapter, we organize our discussion
of alternative estimators by discussing how they modify the simple mean di¤erences used
in the three intuitive estimators to account for nonstationary environments and di¤erent
regressors in the di¤erent comparison groups. We …rst consider social experimentation and
how it constructs the counterfactuals used in policy evaluations.
14
Proof:
(Y¹1t )1 ¡ (Y¹0t0 )0 = (Y¹1t )1 ¡ (Y¹0t )1 + (Y¹0t )1 ¡ (Y¹0t0 )0
and take expectations invoking assumption (4-A-3).
Thus if j E(Y0 j D = 1) ¡ E(Y0 j D = 0) j= M , there is no guarantee that j E(Y0 j D = 1; X) ¡ E(Y0 j
D = 0; X) j< M . For some values of X, the gap could widen.
15
34
5
Social Experiments
Randomization is one solution to the evaluation problem. Recent years have witnessed
increasing use of experimental designs to evaluate North American employment and training programs. This approach has been less common in Europe, though a small number of
experiments have been conducted in Britain, Norway and Sweden. When the appropriate
quali…cations are omitted, the impact estimates from these social experiments are easy for
analysts to calculate and for policymakers to understand (see, e.g., Burtless, 1995). As a
result of its apparent simplicity, evidence from social experiments has had an important
impact on the design of U.S. welfare and training programs.16 Because of the importance
of experimental designs in this literature, in this section we show how they solve the evaluation problem, describe how they have been implemented in practice, and discuss their
advantages and limitations.
5.1
How Social Experiments Solve the Evaluation Problem.
An important lesson of this section is that social experiments, like other evaluation methods,
provide estimates of the parameters of interest only under certain behavioral and statistical
assumptions. To see this, let “*” denote outcomes in the presence of random assignment.
Thus, conditional on X for each person we have (Y1¤ ; Y0¤ ; D¤ ) in the presence of random
assignment and (Y1 ; Y0 ; D) when the program operates normally without randomization.
Let R = 1 if a person for whom D¤ = 1 is randomized into the program and R = 0 if the
person is randomized out. Thus, R = 1 corresponds to the experimental treatment group
and R = 0 to the experimental control group.
The essential assumption required to use randomization to solve the evaluation problem
for estimating the mean e¤ect of treatment on the treated is that
(5.A.1)
E(Y1¤ ¡ Y0¤ j X; D¤ = 1) = E(Y1 ¡ Y0 j X; D = 1).
A stronger set of conditions, not strictly required, are
(5.A.2a)
E(Y1¤ j X; D¤ = 1) = E(Y1 j X; D = 1)
and
(5.A.2b)
E(Y0¤ j X; D¤ = 1) = E(Y0 j X; D¤ = 1).
Assumption (5.A.2a) states that the means from the treatment and control groups
generated by random assignment produce the desired population parameter. With certain
exceptions discussed below, this assumption rules out changes in the impact of participation
due to the presence of random assignment as well as changes in the process of program
participation. The …rst part of this assumption can in principle be tested by comparing the
16
We discuss this evidence in Section 10.
35
outcomes of participants under a regime of randomization with the outcome of participants
under the usual regime.
If (5.A.2a) is true, among the population for whom D = 1 and R = 1 we can identify
E(Y1 j X; D = 1; R = 1) = E(Y1 j X; D = 1):
Under (5.A2a) information su¢cient to estimate this mean without bias is routinely produced from data collected on participants in social programs. The new information produced by an experiment comes from those randomized out of the program. Using the
experimental control group it is possible to estimate:
E(Y0 j X; D = 1; R = 0) = E(Y0 j X; D = 1):
Thus, experiments produce data that satisfy assumption (4.A.3). Simple application of the
cross-section estimator identi…es
E(¢ j X; D = 1) = E(Y1 ¡ Y0 j X; D = 1):
Within the context of the model of equation (3.10), an experiment that satis…es (5.A.1)
or (5.A.2a) and (5.A2b) does not make D orthogonal to U. It simply equates the bias in
the two groups R = 1 and R = 0. Thus in the model of equation (3.1), under (5.A.2a),
E(Y jX; D = 1; R = 1) = g1 (X) + E(U1 jX; D = 1) and E(Y jX; D = 1; R = 0) = g0 (X) +
E(U0 jX; D = 1):17
Rewriting the …rst conditional mean, we obtain
E(Y jX; D = 1; R = 1) = g1 (X) + E(U1 ¡ U0 jX; D = 1) + E(U0 jX; D = 1):
Subtracting the second mean from the …rst eliminates the common selection bias component
E(U0 jX; D = 1) so
E(Y jX; D = 1; R = 1) ¡ E(Y jX; D = 1; R = 0) = g1 (X) ¡ g0 (X) + E(U1 ¡ U0 jX; D = 1):
When the model (3.1) is specialized to one of intercept di¤erences, as in (3.10), this
parameter simpli…es to ®. Notice, that the method of social experiments does not set either
E(U1 jX; D = 1) or E(U0 jX; D = 1) equal to zero. Rather, it balances the selection bias in
the treatment and control groups.
17
Notice that in this section we allow for the more general model Y0 = g0 (X) + U0 ,
Y1 = g1 (X) + U1 where E(U0 j X) 6= 0 and E(U1 j X) 6= 0.
36
Stronger assumptions must be made to identify the distribution of impacts F (¢ j D =
1): Without invoking further assumptions, data from experiments, like data from nonexperimental sources, are unable to identify the distribution of impacts because the same
person is not observed in both states at the same time (Heckman, 1992; Heckman, Smith
and Clements, 1997; Heckman and Smith, 1993, 1995, 1998a).
If assumption (5.A.1) or assumptions (5.A.2a) and (5.A.2b) fail to hold because the
program participation probabilities are a¤ected, so D¤ and D are di¤erent, then the composition of the participant population di¤ers in the presence of random assignment. In two
important special cases, experimental data still provide unbiased estimates of the e¤ect of
treatment on the treated. First, if the e¤ect of training is the same for everyone, changing
the composition of the participants has no e¤ect because the parameter of interest is the
same for all possible participant populations (Heckman, 1992). This assumption is sometimes called the common treatment e¤ect assumption and, letting i denote a variable value
for individual i, may be formally expressed as
(5.A.3)
Y1i ¡ Y0i = ¢i ´ ¢ for all i.
This assumption is equivalent to setting U1 = U0 in (3.9). Assumption (5.A.3) can be
de…ned conditionally on observed characteristics, so we may write ¢ = ¢(X). Notice,
however, that in this case, if randomization induces persons with certain X values not to
participate in the program, then estimates of ¢(X) can only be obtained for values of X
possessed by persons who participate in the program. In this case (5.A.1) is satis…ed but
(5.A.2a) and (5.A.2b) are not.
The second special case where experimental data still provide unbiased estimates of
the e¤ect of treatment on the treated arises when decisions about training are not a¤ected
by the realized gain from participating in the program. This case could arise if potential
trainees know E(¢ j X) but not ¢ at the time participation decisions are made. Formally,
the second condition is
(5.A.4)
E(¢ j X; D = 1) = E(¢ j X);
which is equivalent to condition (3.11) in the model (3.9). If either (5.A.3) or (5.A.4) holds,
the simple experimental mean di¤erence estimator is unbiased for E(¢ j X; D = 1):
Randomization improves on the non-experimental cross-section estimator even if there
is no selection bias. In an experiment, for all values of X for which D = 1, one can identify
18
E(¢ j X; D = 1) = E(Y1 ¡ Y0 j X; D = 1):
Using assumption (4.A.3) in an ordinary nonexperimental evaluation, there may be values
of X such that Pr(D = 1 j X) = 1; that is, there may be values of X with no comparison
group members. Randomization avoids this di¢culty by balancing the distribution of X
18
Replace “E” with “F ” in (5.A.2a) and (5.A.2b) to obtain one necessary condition.
37
values in the treatment and control groups (Heckman, 1996). At the same time, however,
random assignment conditional on D = 1 cannot provide estimates of ¢(X) for values of
X such that Pr(D = 1 j X) = 0.
The stage of potential program participation at which randomization is applied - eligibility, application, or acceptance into a program - determines what can be learned from a
social experiment. For randomization conditional on acceptance into a program (D = 1),
we can estimate the e¤ect of treatment on the treated:
E(¢ j X; D = 1) = E(Y1 ¡ Y0 j X; D = 1)
using simple experimental means. We cannot estimate the e¤ect of randomly selecting a
person to go into the program:
E(¢ j X) = E(Y1 ¡ Y0 j X);
by using simple experimental means unless one of two conditions prevails. The …rst condition is just the common e¤ect assumption (5.A.3). This assumption is explicit in the
widely-used dummy endogenous variable model (Heckman, 1978). The second condition is
that embodied in assumption (5.A.4), that participation decisions are independent of the
person-speci…c component of the impact. In both cases, the mean impact of treatment on
a randomly selected person is the same as the mean impact of treatment on the treated.
In the general case, it is di¢cult to estimate the e¤ect of randomly assigning a person
with characteristics X to go into a program. This is because persons randomized into a
program cannot be compelled to participate in it. In order to secure compliance, it may
be necessary to compensate or persuade persons to participate. For example, in many U.S.
social experiments, program operators threaten to reduce participants’ social assistance
bene…ts, if they refuse to participate in training. Such actions, even if successful, alter the
environment in which persons operate and may make it impossible to estimate E(¢ j X)
using experimental means. One assumption that guarantees compliance is the existence of
a “compensation” or “punishment” level c such that
(5.A.5a)
Pr(D = 1 j X; c) = 1
and
(5.A5b)
E(¢ j X; c) = E(¢ j X):
The …rst part of the assumption guarantees that a person with characteristics X can be
“bribed” or “persuaded” to participate in the program. The second part of the assumption
guarantees that compensation c does not a¤ect the outcome being evaluated.19 If c is a
19
Observe that the value of c is not necessarily unique.
38
monetary payment, it would be optimal from the standpoint of an experimental analyst to
…nd the minimal value of c that satis…es these conditions.
Randomization of eligibility is sometimes proposed as a less disruptive alternative to
randomization conditional on D = 1. Randomizing eligibility avoids the application and
screening costs that are incurred when accepted individuals are randomized out of a program. Because the randomization is performed outside of training centers, it also avoids
some of the political costs that have accompanied the use of the experimental method.
Consider a population of persons who are usually eligible for the program. Randomize
eligibility within this population. Let e = 1 if a person retains eligibility and e = 0 if a
person becomes ineligible. Assume that eligibility does not disturb the underlying structure
of the random variables (Y0 ; Y1 ; D; X) and that Pr(D = 1 j X) 6= 0. Then Heckman (1996)
shows that
E(Y j X; e = 1) ¡ E(Y j X; e = 0)
= E(¢ j X; D = 1):
Pr(D = 1 j X; e = 1)
Randomization of eligibility produces samples that can be used to identify E(¢ j X; D = 1)
and also to recover Pr(D = 1jX): The latter is not recovered from samples which condition
on D = 1 (Heckman, 1992; Mo¢tt, 1992). Without additional assumptions of the sort
previously discussed, randomization on eligibility will not, in general, identify E(¢ j X).
5.2
Intention to Treat and Substitution Bias
The objective of most experimental designs is to estimate the conditional mean impact
of training, or E(¢ j X; D = 1). However, in many experiments a signi…cant fraction
of the treatment group drops out of the program and does not receive the services being
evaluated.20 In general, in the presence of dropping out E(¢ j X; D = 1) cannot be
identi…ed using comparisons of means. Instead, the experimental mean di¤erence estimates
the mean e¤ect of the o¤er of treatment, or what is sometimes called the “intent to treat.”
For many purposes, this is the policy-relevant parameter. It is informative on how the
availability of a program a¤ects participant outcomes. Attrition is a normal feature of an
ongoing program.
To obtain an estimate of the impact of training on those who actually receive it, additional assumptions are required beyond (5.A.1) or (5.A.2a) and (5.A.2b). Let T be an
indicator for actual receipt of treatment, with T = 1 for persons actually receiving training, and T = 0 otherwise. Let T ¤ be a similarly de…ned latent variable for control group
20
Using the analysis in the preceding subsection, dropping out by experimental treatment group members
could be reduced by compensating them for completing training.
39
members indicating whether or not they would have actually received training, had they
been in the treatment group. De…ne
E(¢ j X; D = 1; R = 1; T = 1) = E(¢ j X; D = 1; T = 1)
as the mean impact of training on those members of the treatment group who actually
receive it. This parameter will equal the original parameter of interest E(¢ j X; D = 1)
only in the special cases where (5.A.3), the common e¤ect assumption, holds, or where an
analog to (5.A.4) holds so that the decision of treatment group members to drop out is
independent of (¢ ¡ E(¢)), the person-speci…c component of their impact.
A consistent estimate of the impact of training on those who actually received it can be
obtained under the assumption that the mean outcome of the treatment group dropouts is
the same as that of their analogs in the control group, so that
(5.A.6) E(Y j X; D = 1; R = 1; T = 0) = E(Y j X; D = 1; R = 0; T ¤ = 0):
Note that this assumption rules out situations where the treatment group dropouts receive
potentially valuable partial treatment. Under (5.A.6),
E(Y j X; D = 1; R = 1) ¡ E(Y j X; D = 1; R = 0)
(5.1)
P (T = 1 j X; D = 1; R = 1)
identi…es the mean impact of training on those who receive it.21 This estimator scales up
the experimental mean di¤erence estimate by the fraction of the treatment group receiving
training. When all treatment group members receive training, the denominator equals one
and the estimator reduces to the simple experimental mean di¤erence. Estimator (5.1) also
shows that the simple mean di¤erence estimator provides a downward biased estimate of
the mean impact of training on the trained when there are dropouts from the treatment
group, because the denominator always lies between zero and one. Heckman, Smith and
Taber (1998) present methods for estimating distributions of outcomes and for testing the
identifying assumptions in the presence of dropping out. They present evidence on the
validity of the assumptions that justify (5.1) in the National JTPA Study data.
In an experimental evaluation, the converse problem can also arise for the control group
members. In an ideal experiment, no control group members would receive either the experimental treatment or close substitutes to it from other sources. In practice, a signi…cant
fraction of controls often receives similar services from other sources. In this situation, the
mean earnings of control group members no longer correspond to E(Y0 j X; D = 1) and
neither the experimental mean di¤erence estimator nor the adjusted estimator (5.1) identi…es the impact of training relative to no training for those who receive it. However, under
certain conditions discussed in Section 3, the experimental estimate can be interpreted as
the mean incremental e¤ect of the program relative to a world in which it does not exist.
21
See, e.g., Mallar (1978), Bloom (1984) and Heckman, Smith and Taber (1998).
40
As in the case of treatment group dropouts, identifying the impact of training on the
trained in the presence of control group substitution requires additional assumptions beyond
(5.A.1) or (5.A.2a) and (5.A.2b). Let S = 1 denote control group members receiving
substitute training from alternative sources and let S = 0 denote control group members
receiving no training and let Y2 be the outcome conditional on receipt of alternative training.
Consider the general case with both treatment group dropping out and control group
substitution. In this context, one approach would be to invoke the assumptions required to
apply non-experimental techniques as described in Section 7 to the treatment group data
to obtain an estimate of the impact of the training being evaluated on those who receive it.
Heckman, Hohmann, Khoo and Smith (1998) employ this and other strategies using data
from the National JTPA Study.
Alternatively, two other assumptions allow use of the control group data to estimate
the impact of training on the trained. The …rst assumption is a generalized common e¤ect
assumption, where to distinguish individuals we restore subscript i
(5.A.30 )
Y1i ¡ Y0i = Y2i ¡ Y0i = ¢i ´ ¢ for all i.
This assumption states that (a) the impact of the program being evaluated is the same as
the impact of substitute programs for each person and (b) that all persons respond exactly
the same way to the program (a common e¤ect assumption). The second assumption is a
generalized version of (5.A.4), where
(5.A.40 )
E(Y1 ¡ Y0 j X; D = 1; T = 1; R = 1) = E(Y2 ¡ Y0 j X; D = 1; S = 1; R = 0):
This assumption states that the mean impact of the training being evaluated received by
treatment group members who do not drop out equals the mean impact of substitute training on those control group members who receive it. Both (5.A.30 ) and (5.A.40 ) are strong
assumptions. To be plausible, either would require evidence that the training received by
treatment group members was similar in content and duration to that received by control
group members. Note that (5.A.30 ) implies (5.A.40 ). Under either assumption, the ratio
E(Y j X; D = 1; R = 1) ¡ E(Y j X; D = 1; R = 0)
(5.2)
Pr(T = 1 j X; D = 1; R = 1) ¡ Pr(S = 1 j X; D = 1; R = 0)
identi…es the mean impact of training on those who receive it in both the experimental
treatment and control groups, provided that the denominator is not zero. The similarity of
estimator (5.2) to the instrumental variable estimator de…ned in Section 7 is not accidental;
under assumptions (5.A.30 ) or (5.A.40 ), random assignment is a valid instrument for training
because it is correlated with training receipt but not with any other determinants of the
outcome Y . Without one of these assumptions, random assignment is not, in general,
a valid instrument (Heckman, 1997; Heckman, Hohmann, Khoo and Smith, 1998). To
see this point, consider a model in which individuals know their gain from training, but
because the treatment group has access to the program being evaluated, it faces a lower cost
of training. In this case, controls are less likely to be trained, but the mean gross impact
41
would be larger among control trainees than among the treatment trainees. Drawing on the
analysis of Section 7, this correlation violates the condition required for the IV estimator
to identify the parameter of interest.
5.3
Social Experiments in Practice
In this subsection we discuss how social experiments operate in practice. We present empirical evidence on some of the theoretical issues surrounding social experiments discussed
in the preceding subsections and provide a context for the discussion of the experimental
evidence on the impact of training in Section 10. To make the discussion concrete, we focus
in particular on two of the best known U.S. social experiments: the National Supported
Work (NSW) demonstration (Hollister, et al., 1984) and the recent National JTPA Study
(NJS).22 We begin with a brief discussion of the implementation of these two experiments.
5.3.1
Two Important Social Experiments
The NSW Demonstration was one of the …rst employment and training experiments. It
tested the e¤ect of 9 to 18 months of guaranteed work experience in unskilled occupations
on groups of long-term AFDC (welfare) recipients, ex-drug addicts, ex-criminal o¤enders,
and economically disadvantaged youths in 10 sites across the U.S. These jobs were in a
sheltered environment in which productivity standards were gradually raised over time and
participants met frequently with program counselors to discuss grievances and performance.
The NSW enrollment process began with a referral, usually by a welfare agency, drug
rehabilitation agency, or prisoners’ assistance society. Program operators then interviewed
potential participants and eliminated any persons that they believed “would be disruptive
to their programs” (Hollister, et al., 1984, p. 35). Following this screening, a third party
randomly assigned one-half of the quali…ed applicants to the treatment group. The remainder were assigned to the control group and prevented from receiving NSW services.
Although the controls could not receive NSW services, program administrators could not
prevent them from receiving other training services in their community, such as those offered under another widely available training program with the acronym CETA. Follow-up
data on the experimental treatment and control groups were collected via both surveys and
administrative earnings records.
In contrast to the NSW, the NJS sought to evaluate the e¤ectiveness of an ongoing
training program. From the start, the goal of evaluating an ongoing program without
signi…cantly disrupting its operations – and thereby violating assumption (5.A.1) or assumptions (5.A.2a) and (5.A.2b) – posed signi…cant problems. The …rst of these arose
22
See, among others, Doolittle and Traeger (1990), Bloom, et al. (1993) and Orr, et al. (1994).
42
in selecting the training centers at which random assignment would take place. Initially,
evaluators planned to use a random sample of the nearly 600 U.S. JTPA training sites.
Randomly choosing the evaluation sites would enhance the “external validity” of the experiment – the extent to which its …ndings can be generalized to the population of JTPA
training centers. Yet, it was di¢cult to persuade local administrators to participate in an
evaluation that required them to randomly deny services to eligible applicants. When only
four of the randomly selected sites or their alternates agreed to participate, the study was
redesigned to include a “diverse” group of 16 centers willing to participate in a random
assignment study (see Doolittle and Traeger, 1990; or the summary of their analysis presented in Hotz, 1992). Evaluators had to contact 228 JTPA training centers in order to
obtain these sixteen volunteers.23 The option of forcing centers to participate was rejected
because of the importance of securing the cooperation of local administrators in preserving
the integrity of random assignment. Such concerns are not without foundation, as the integrity of an experimental training evaluation in Norway was undermined by the behavior
of local operators (Torp, et al., 1993).
Concerns about disrupting normal program operations and violating (5.A.1) or (5.A.2a)(5.A.2b) also led to an unusual approach to the evaluation of the speci…c service types
provided by JTPA. This program o¤ers a personalized mix of employment and training
services including all those listed in Table 2.1 with the exception of public service employment. During their enrollment in the program, participants may receive two or more of
these services in sequence, where the sequence may depend on the participant’s success or
failure in those services provided …rst. As a result of this heterogeneous, ‡uid structure,
it was impossible without changing the character of the program to conduct random assignment conditional on (planned) receipt of particular services or sets of services. Instead,
JTPA sta¤ recommended particular services for each potential participant prior to random
assignment, and impact estimates were calculated conditional on these recommendations.
In particular, the recommendations were grouped into three “treatment streams”: the
“CT-OS stream” which included persons recommended for classroom training, CT, (and
possibly other services), OS, but not on the job training or OJT; the “OJT stream” which
included persons recommended for OJT (and possibly other services) but not CT; and the
“other stream” which included the rest of the admitted applicants, most of whom ended
up receiving only job search assistance. Note that this issue did not arise in the NSW,
which provided a single service to all of its participants. In the NJS, follow-up data on
earnings, employment and other outcomes were obtained from both surveys and multiple
23
Very large training centers (e.g., Los Angeles) and small, rural centers were excluded from the study
design from the outset of the center enrollment process, for administrative and cost reasons, respectively.
The …nal set of 16 training centers received a total of $1 million in payments to cover the cost of participating
in the experiment.
43
administrative data sources.
5.3.2
The Practical Importance of Dropping Out and Substitution
The most important problems a¤ecting social experiments are treatment group dropout and
control group substitution. These problems are not unique to experiments. Persons drop
out of programs whether or not they are experimentally evaluated. There is no evidence
that the rate of dropping out increases during an experimental evaluation. Most programs
have good substitutes so that the estimated e¤ect of a program as typically estimated is
in relation to the full range of activities in which nonparticipants engage. Experiments
exacerbate this problem by creating a pool of persons who attempted to take training who
then ‡ock to substitute programs when they are placed in an experimental control group.
Table 5.1 demonstrates the practical importance of these problems in experimental
evaluations by reporting the rates of treatment group dropout and control group substitution from a variety of social experiments. It reveals that the fraction of treatment group
members receiving program services is often less than 0.7, and sometimes less than 0.5.
Furthermore, the observed characteristics of the treatment group members who drop out
often di¤er from those who remain and receive the program services.24 In regard to substitution, Table 5.1 shows that as many as 40 percent of the controls in some experiments
received substitute services elsewhere. In an ideal experiment, all treatments receive the
treatment and there is no control group substitution, so that the di¤erence between the
fraction of treatments and controls that receive the treatment equals 1.0. In practice, this
di¤erence is often well below 1.0.
The extent of both substitution and dropout depends on the characteristics of the
treatment being evaluated and the local program environment. In the NSW, where the
treatment was relatively unique and of high enough quality to be clearly perceived as
valuable by participants, dropout and substitution rates were low enough to approximate
the ideal case. In contrast, in the NJS and other evaluations of programs that provide
low cost services widely available from other sources, substitution and dropout rates are
high.25 In the NJS, the substitution problem is accentuated by the fact that JTPA relies on
24
For the NSW, see LaLonde (1984); for the NJS see Smith (1992).
For the NJS, Table 5.1 reveals the additional complication that estimates of the rate of training
receipt in the treatment and control groups depend on the data source used to make the calculation. In
particular, because many treatment group members do not report training that administrative records
show they received, dropout rates measured using only the survey data are substantially higher than those
that combine the survey and administrative data. At the same time, because administrative data are not
available on control group training receipt (other than the very small number of persons who defeated the
experimental protocol), using only self-report data on controls but the combined data for the treatment
group will likely overstate the di¤erence in service receipt levels between the two groups.
25
44
outside vendors to provide most of its training. Many of these vendors, such as community
colleges, provide the same training to the general public, often with subsidies from other
government programs such as Pell Grants. In addition, in order to help in recruiting sites
to participate in the NJS, evaluators allowed them to provide control group members with
a list of alternative training providers in the community. Of the 16 sites in the NJS, 14
took advantage of this opportunity to alert control group members to substitute training
opportunities.
To see the e¤ect of high of dropping out and substitution on the interpretation of the
experimental evidence, consider Project Independence. The unadjusted experimental impact estimate is $264 over the 2-year follow-up period, while application of the IV estimator
that uses sample moments in place of (5.2) yields an adjusted impact estimate of $1,100
($264/0.24). The …rst estimate indicates the mean impact of the o¤er of treatment relative
to the other employment and training opportunities available in the community. Under assumptions (5.A.30 ) or (5.A.40 ), the latter estimate indicates the impact of training relative
to no training in both the treatment and control groups. Under these assumptions, the
high rates of dropping out and substitution suggest that, the experimental mean di¤erence
estimate is strongly downward biased as an estimate of the impact of treatment on the
treated, the primary parameter of policy interest.
A problem unique to experimental evaluations is violation of (5.A.1), or (5.A.2a) and
(5.A.2b) which produces what Heckman (1992) and Heckman and Smith (1993, 1995) call
“randomization bias.” In the NJS, this problem took the form of concerns that expanding
the pool of accepted applicants, which was required to keep the number of participants
at normal levels while creating a control group, would change the process of selection of
persons into the program. Speci…cally, training centers were concerned that the additional
recruits brought in during the experiment would be less motivated and harder to train and
therefore bene…t less from the program. Concerns about this problem were frequently cited
by training centers that declined to participate in the NJS (Doolittle and Traeger, 1990).
To partially allay these concerns, random assignment was changed from the 1:1 ratio that
minimizes the sampling variance of the experimental impact estimator to a 2:1 ratio of
treatments to controls.
Although we have no direct evidence on the empirical importance of changes in participation patterns on measured outcomes during the NJS, there is some indirect evidence
about the validity of (5.A.1) or (5.A.2a) and (5.A.2b) in this instance. First of all, a number
of training centers in the NJS streamlined their intake processes during the experiment –
sometimes with the help of an intake consulting …rm whose services were subsidized as part
of the evaluation. In so doing, they generally reduced the number of visits and other costs
paid by potential trainees, thereby including among those randomly assigned less motivated
persons than were normally served. Second, some training centers asked for, and received,
45
additional temporary reductions in the random assignment ratio during the course of the
experiment when they experienced di¢culties recruiting su¢cient quali…ed applicants to
keep the program operating at normal levels.
A second problem unique to experiments involves obtaining experimental estimates of
the e¤ects of individual components of services provided in sequence as part of a single
program. Experimental designs can readily determine how access to a bundle of services
a¤ects participants’ earnings. More di¢cult is the question of how participation at each
stage in‡uences earnings, when participants can drop out during the sequence. Providing an
experimental answer to this question requires randomization at each stage in the sequence.26
In a program with several stages, this would lead to a proliferation of treatments and
either large (and costly) samples or insu¢cient sample sizes. In practice, such sequential
randomization has not been attempted in evaluating job training programs.
A …nal problem unique to experimental designs is that even under ideal conditions, they
are unable to answer many questions of interest besides the narrow impact of “treatment
on the treated” parameter. For example, it is not possible in practice to obtain simple
experimental estimates of the duration of post-random assignment employment due to postrandom assignment selection problems (Ham and LaLonde, 1990). An elaborate analysis
of self-selection of the sort sought to be avoided by social experiments is required. As
another example, consider estimating the impact of training on wage rates. The problem
that arises in this case is that we observe wages only for those employed following random
assignment. If the experimental treatment a¤ects employment, then the sample of employed
treatments will have di¤erent observed and unobserved characteristics than the employed
controls. In general, we would expect that the persons without wages will be less skilled.
The experimental impact estimate cannot separate out di¤erences between the distribution
of observed wages in the treatment and control groups that result from the e¤ect of the
program on wage rates from the e¤ect of the program on selection into employment. Under
these circumstances, only non-experimental methods such as those discussed in Section 7
can provide an answer to the question of interest.
5.3.3
Additional Problems Common to All Evaluations
There are a number of other problems that arise in both social experiments and nonexperimental evaluations. Solving these problems in an experimental setting requires an26
Alternatively, in a program with three stages, program administrators might randomly assign eligible
participants to one of several treatment groups, with the …rst group receiving only stage 1 services, the
second receiving stage 1 and stage 2 services and the third receiving services from all three stages. However,
a problem may arise with this scheme if participants assigned to the second and third stages of the program
at some point decline to participate. In that case, the design described in the text would be more e¤ective.
46
alysts to make the same types of choices (and assumptions) that are required in a nonexperimental analysis. An important point of this subsection is that experimental impact
estimates are sensitive to these choices in the same way as non-experimental estimates. A
related concern is that experimental evaluations should, but often do not, include sensitivity
analyses indicating the e¤ect of the choices made on the impact estimates obtained.
The …rst common evaluation problem arises from imperfect data. Di¤erent survey
instruments can yield di¤erent measures for the same variable for the same person in a
given time period (see Smith, 1997a,b, and the citations therein). For example, self-reported
measures of earnings or welfare receipt from surveys typically di¤er from administrative
measures covering the same period (LaLonde and Maynard, 1987; Bloom, et al., 1993). As
we discuss in Section 8, in the case of earnings, data sources commonly used for evaluation
research di¤er in the types of earnings covered, the presence or absence of top-coding and
the extent of missing or incorrect values. The evaluator must trade o¤ these factors when
choosing which data source to rely on. Whatever the data source used, the analyst must
make decisions about how to handle outliers and missing values.
To underscore the point that experimental impacts for the same program can di¤er due
to di¤erent choices about data sources and data handling, we compare the impact estimates
for NJS presented in the two o¢cial experimental impact reports, Bloom, et al. (1993) and
Orr, et al. (1994).27 As shown in Table 5.2, these two reports give substantially di¤erent
estimates of the impact of JTPA training for the same demographic groups over the same
time period. The di¤erences result from di¤erent decisions about whom to include in the
evaluation sample, how to combine earnings information from surveys and administrative
data, how to treat seemingly anomalous reports of overtime earnings in the survey data
and so on. Several of the point estimates di¤er substantially, as do the implications about
the relative e¤ectiveness of the three treatment streams for adult women. The estimated
18-month impact for adult women in the “other services” stream triples from the 18month impact report to the 30-month impact report, making it the service with the largest
estimated impact despite the low average cost of the services provided to persons in this
stream.
The second problem common to experimental and non-experimental evaluations is sample attrition. Note that sample attrition is not the same as dropping out of the program.
Both control and treatment group members can attrit from the sample and treatment
group members who drop out of the program will often remain in the data. In the NSW,
attrition from the evaluation sample by the 18 month follow-up interview was 10 percent
for the adult women, but more than 30 percent for the male participants. In the NJS study,
sample attrition by the 18 month follow-up was 12 percent for the adult women and ap27
A complete discussion of the impact estimates from the NJS appears in Section 10.
47
proximately 20 percent of the adult males. Such high rates of attrition are common among
the disadvantaged due to relatively frequent changes in residence and other di¢culties with
making follow-up contacts.
Sample attrition poses a problem for experimental evaluations when it is correlated with
individual characteristics or with the impact of treatment conditional on characteristics.
In practice, persons with poorer labor market characteristics tend to have higher attrition
rates (see, e.g., Brown, 1979). Even if attrition a¤ects both experimental and control groups
in the same way, the experiment estimates the mean impact of the program only for those
who remain in the sample. Usually, attrition rates are both non-random and larger for
controls than for treatments. In this case, the experimental estimate of training is biased
because individuals’ experimental status, R, is correlated with their likelihood of being in
the sample. In this setting, experimental evaluations become non-experimental evaluations
because evaluators must make some assumption to deal with selection bias.
48
6
Econometric Models of Outcomes and Program Participation
The economic approach to program evaluation is based on estimating behavioral relationships that can be applied to evaluate policies not yet implemented. A focus on invariant
behavioral relationships is the cornerstone of the econometric approach. Economic relationships provide frameworks within which empirical knowledge can be accumulated across
di¤erent studies. They o¤er guidance on the speci…cation of empirical relationships for any
given study and the type of data required to estimate a behaviorally-motivated evaluation
model. Alternative empirical evaluation strategies can be judged, in part, by the economic
justi…cation for them. Estimators that make economically implausible or empirically unjusti…ed assumptions about behavior should receive little support.
The approach to evaluation guided by economic models is in contrast with the case-bycase approach of statistics that at best o¤ers intuitive frameworks for motivating estimators.
The emphasis in statistics is on particular estimators and not on the models motivating
the estimators. The output of such case by case studies often does not cumulate. Since
no articulated behavioral theory is used in this approach, it is not helpful in organizing
evidence across studies or in suggesting explanatory variables or behaviorally motivated
empirical relationships for a given study. It produces estimated parameters that are very
di¢cult to use in answering well posed evaluation questions.
All economic evaluation models have two ingredients: (a) a model of outcomes and
(b) a model of program participation. This section presents several prototypical econometric models. The …rst was developed by Heckman (1978) to rationalize the evidence in
Ashenfelter (1978). The second rationalizes the evidence presented in Heckman and Smith
(1998b) and Heckman, Ichimura, Smith and Todd (1998).
6.1
Uses of Economic Models
There are several distinct uses of economic models. (1) They suggest lists of explanatory
variables that might belong in both outcome and participation equations. (2) They sometimes suggest plausible “exclusion restrictions” - variables that in‡uence participation but
do not directly in‡uence outcomes, that can be used to help identify models in the presence
of self-selection by participants. (3) They sometimes suggest speci…c functional forms of
estimating equations motivated by a priori theory or by cumulated empirical wisdom.
49
6.2
Prototypical Models of Earnings and Program Participation
To simplify the discussion, and start where the published literature currently stops, assume
that persons have only one period in their lives - period k - where they have the chance to
take job training. From the beginning of economic life, t = 1 up through t = k, persons
have one outcome associated with the no-training state “0”:
Y0j
j = 1; :::; k:
After period k, there are two potential outcomes corresponding to the training outcome
(denoted “1”) and the no-training outcome (“0”):
(Y0j ; Y1j )
j = k + 1; :::; T
where T is the end of economic life.
Persons participate in training only if they apply to a program and are accepted into
it. Several decision makers may be involved: individuals, family members and bureaucrats.
Let D = 1 if a person participates in a program; D = 0 otherwise. Then the full description
of participation and potential outcomes is
(6.1)
(D; Y0t ; t = 1; :::; k; (Y0t ; Y1t ); t = k + 1; ::::; T ):
As before, observed outcomes after period k can be written as a switching regression model:
Y0t = DY1t + (1 ¡ D)Y0t :
The most familiar model and the one that is most widely used in the training program
evaluation literature assumes that program participation decisions are based on individual
choices based on the maximization of the expected present value of earnings. It ignores
family and bureaucratic in‡uences on participation decisions.
6.3
Expected Present Value of Earnings Maximization
In period k, a prospective trainee seeks to measure the expected present value of earnings.
Earnings is the outcome of interest. The information available to the agent in period k
is Ik . The cost of program participation consists of two components: c (direct costs) and
foregone earnings during the period. Training takes one period to complete. Assume that
credit markets are perfect so that agents can lend and borrow freely at interest rate r.
The expected present value of earnings maximizing decision rule is to participate in the
program (D"= 1) if
#
TP
¡k Y1;k+j
TP
¡k Y0;k+j
(6.2)
E
¡c¡
j Ik ¸ 0;
j
j
j=1 (1 + r)
j=0 (1 + r)
50
and not to participate in the program (D = 0) if this inequality does not hold. In (6.2), the
expectations are computed with respect to the information available to the person in period
k(Ik ). It is important to notice that the expectations in (6.2) are the private expectations
of the decision maker. They may or may not conform to the expectations computed against
the true ex ante distribution. Note further that Ik may di¤er among persons in the same
environment or may di¤er among environments. Many variables external to the model
may belong in the information sets of persons. Thus friends, relatives and other channels
of information may a¤ect personal expectations.28
The following are consequences of this decision rule. (a) Older persons, and persons with
higher discount rates, are less likely to take training. (b) Earnings prior to time period k are
irrelevant for determining participation in the program except for their value in forecasting
future earnings. (i.e. except as they enter the person’s information set Ik ). (c) Only current
costs and the discounted gain to earnings determine participation in the program. Persons
with lower foregone earnings and lower direct costs of program participation are more likely
to go into the program. (d) Any dependence between the realized (measured) income at
date t and D is induced by the decision rule. It is the relationship between the expected
outcomes at the time decisions are made and the realized outcomes that generate the
structure of the bias for any econometric estimator of a model. This framework underlies
much of the empirical work in the literature on evaluating job training programs (see, e.g.,
Ashenfelter, 1978, Bassi, 1983, 1984, and Ashenfelter and Card, 1985). We now consider
various specializations of it.
6.3.1
Common Treatment E¤ect
As discussed in Section 3, the common treatment e¤ect model is implicitly assumed in much
of the literature evaluating job training programs. It assumes that Y1t ¡ Y0t = ®t ; t > k,
where ®t is a common constant for everyone. Another version writes ®t as a function of X;
®t (X). We take it as a point of departure for our analysis. The model we …rst presented
was in Heckman (1978). Ashenfelter and Card (1985) and Heckman and Robb (1985a,
1986a) develop it. In this model, the e¤ect of treatment on the treated and the e¤ect
of randomly assigning a person to treatment come to the same thing, i.e. E(Y1t ¡ Y0t j
X; D = 1) = E(Y1t ¡ Y0t j X) since the di¤erence between the two income streams is the
same for all persons with the same X characteristics. Under this model, decision rule (6.2)
specializes to the discrete choice model
28
A sharp contrast between a model of perfect certainly and model of uncertainty is that the latter
introduces the possibility of incorporating many more “explanatory variables” in the model in addition to
the direct objects of the theory.
51
(6.3)
D = 1;
if E
Ã
TP
¡k
j=1
!
®k+j
¡ c ¡ Y0k j Ik ¸ 0;
(1 + r)j
D=0
otherwise.
If the ®k+j are constant in allµperiods and T ¶
is large (T ! 1) the criterion simpli…es to
®
(6.4)
D = 1 if E
¡ c ¡ Y0k jIk ¸ 0;
r
D = 0 otherwise.
Even though agents are assumed to be farsighted, and possess the ability to make
accurate forecasts, the decision rule is simple. Persons compare current
¯ #
" costs (both direct
¯
TP
¡k ®k+j
¯
costs c and foregone earnings, Y0k ) with expected future rewards E (
)
¯I .
j ¯ k
(1
+
r)
j=1
Future rewards are the same for everyone of the same age and with the same discount
rate. Future values of Y0t do not directly determine participation given Y0k . The link
between D and Y0t ; t > k, comes through the dependence with Y0k and any dependence
on cost c. If one knew, or could proxy, Y0k and c, one could condition on these variables
and eliminate selective di¤erences between participants and nonparticipants. Since returns
are identical across persons, only variation across persons in the direct cost and foregone
earnings components determine the variation in the probability of program participation
across persons. Assuming that c and Y0k are unobserved by the econometrician, but known
to the agent making the decision to go into training,
Pr(D = 1) = Pr
Ã
TP
¡k
j=1
!
®k+j
> c + Y0k :
(1 + r)j
In the case of an in…nite-horizon, temporally-constant treatment e¤ect, ®, the expression
simpli…es to
µ
¶
®
Pr(D = 1) = Pr
¸ c + Y0k :
r
This simple model is rich enough to be consistent with Ashenfelter’s dip. As discussed
in Section 4, the “dip” refers to the pattern that the earnings of program participants
decline just prior to their participation in the program. If earnings are temporarily low in
enrollment period k, and c does not o¤set Y0k , persons with low earnings in the enrollment
period enter the program. Since the return is the same for everyone, it is low opportunity
costs or tuition that drive program participation in this model. If the ®; c or Y0k depend
on observed characteristics, one can condition on those characteristics in constructing the
probability of program participation.
This model is an instance of a more general approach to modelling behavior that is used
in the economic evaluation literature. Write the net utility of program participation of the
52
decision maker as IN. An individual participates in the program (D = 1) if and only if
IN > 0. Adopting a separable speci…cation, we may write
IN = H(X) ¡ V:
In terms of the previous example, H(X) =
TX
¡k
j=1
®k+j
is a constant, and V = c+ Y0k .
(1 + r)j
The probability that D = 1 given X is
(6.5)
Pr(D = 1 j X) = Pr(V < H(X) j X):
If V is stochastically independent of X; we obtain the important special case
Pr(D = 1 j X) = Pr(V < H(X))
which is widely assumed in econometric studies of discrete choice.29
If V is normal with mean ¹1 and variance ¾Ã2V , then
!
H(X) ¡ ¹1
(6.6)
Pr(D = 1 j X) = Pr(V < H(X)) = ©
¾V
where © is the cumulative distribution function of a standard normal random variable. If
V is a standardized logit,
Pr(D = 1 j X) =
exp(H(X))
:
1 + exp(H(X))
Although these functional forms are traditional, they are restrictive and are not required by
the econometric approach. Conditions for nonparametric identi…ability of Pr(D = 1 j X)
given di¤erent assumptions about the dependence of X and V are presented in Cosslett
(1983), and Matzkin (1992). Cosslett (1983), Matzkin (1993) and Ichimura (1993) consider
nonparametric estimation of H and the distribution of V . Lewbel (1998) demonstrates
how discrete choice models can be identi…ed under much weaker assumptions than independence between X and V . Under certain conditions, information about agent decisions
to participate in a training program can be informative about their preferences and the
outcomes of a program.
Heckman and Smith (1998a) demonstrate conditions under which knowledge of the
self-selection decisions of agents embodied in Pr(D = 1 j X) is informative about the
value of Y1 relative to Y0 . In the Roy model (see, e.g., Heckman and Honoré, 1990),
IN = Y1 ¡ Y0 = (¹1 (X) ¡ ¹0 (X)) + (U1 ¡ U0 ): Assuming X is independent of U1 ¡ U0 ; from
29
Conditions for the existence of a discrete choice random utility representation of a choice process are
given in McLennan (1990).
53
self selection decisions of persons into a program, it is possible to estimate ¹1 (X) ¡ ¹0 (X)
up to scale, where the scale is [V ar(U1 ¡ U0 )]1=2 . This is a standard result in discrete choice
theory. Thus in the Roy model it is possible to recover E(Y1 ¡ Y0 j X) up to scale just
from knowledge of the choice probability. Under additional assumptions on the support
of X, Heckman and Smith (1998a) demonstrate that it is possible to recover the full joint
distribution F (y0 ; y1 j X) and to answer all of the evaluation questions about means and
distributions posed in Section 3. Under more general self-selection rules, it is still possible
to infer the personal valuations of a program from observing selection into the program and
attrition from it. The Roy model is the one case where personal evaluations of a program,
as revealed by the choice behavior of the agents studied, coincide with the “objective”
evaluations based on Y1 ¡ Y0 .
Within the context of a choice-theoretic model, it is of interest to consider the assumptions that justify the three intuitive evaluation estimators introduced in section 4, starting
with the cross-section estimator (3.3) - which is valid if assumption (4.A.3) is correct. Given
decision rule (6.3), under what conditions is it plausible to assume that
(4.A.3)
E(Y0t j D = 1) = E(Y0t j D = 0);
t>k
so that cross section comparisons identify the true program e¤ect? (Recall that in a model
with homogeneous treatment impacts, the various mean treatment e¤ects all come to the
same thing.) We assume that evaluators do not observe costs nor do they observe Y0k for
trainees.
Assumption (4.A.3) would be satis…ed in period t if
TP
¡k ®k+j
TP
¡k ®k+j
E(Y0t j
¡ c ¡ Y0k ¸ 0) = E(Y0t j
¡ c ¡ Y0k < 0); t > k:
j
j
j=1 (1 + r)
j=1 (1 + r)
One way this condition can be satis…ed is if earnings are distributed independently over
time (Y0k independent of Y0t ), t > k, and direct costs c are independent of Y0t ; t > k: More
generally, only independence in the means with respect to c + Y0k is required.30 If the
dependence in earnings vanishes for earnings measured more than ` periods apart (e.g. if
earnings are a moving average of order `), then for t > k + `, assumption (4.A.3) would be
satis…ed in such periods.
Considerable evidence indicates that earnings have an autoregressive component (see,
e.g., Ashenfelter 1978; Ashenfelter and Card, 1985; MaCurdy, 1982; Farber and Gibbons,
1994). Then (4.A.3) seems implausible except for special cases.31 Moreover if stipends
(a component of c) are determined in part by current and past income because they are
targeted toward low-income workers, then (4.A.3) is unlikely to be satis…ed.
Access to better information sometimes makes it more likely that a version of assumption
30
31
Formally, it is required that E (Y0t jc + Y0k ) does not depend on c and Y0k for all t > k.
Note, however, much of this evidence is for log earnings and not earnings levels.
54
(4.A.3) will be satis…ed if it is revised to condition on observables X:
(4.A.30 )
E(Y0t j D = 1; X) = E(Y0t j D = 0; X):
In this example, let X = (c; Y0k ): Then if we observe Y0k for everyone, and can condition
on it, and if c is independent of Y0t given Y0k ; then
TP
¡k ®k+j
E(Y0t j D = 1; Y0k ) = E(Y0t j
¡ Y0k ¸ c; Y0k )
j=1 (1 + r)j
= E(Y0t j Y0k )
= E(Y0t j D = 0; Y0k ):
Then for common values of Y0k , assumption (4.A.30 ) is satis…ed for X = Y0k .
Ironically, using too much information may make it di¢cult to satisfy (4.A.30 ). To see
this, suppose that we observe c and Y0k and X = (c; Y0k ). Now
E(Y0t j D = 1; (c; Y0k )) = E(Y0t j c; Y0k )
and
E(Y0t j D = 0; (c; Y0k )) = E(Y0t j c; Y0k )
because c and Y0k perfectly predict D. But (4.A.30 ) is not satis…ed because decision rule
(6.3) perfectly partitions the (c; Y0k ) space into disjoint sets. There are no common values
of X = (c; Y0k ) such that (4.A.30 ) can be satis…ed. In this case, the “regression discontinuity
design” estimator of Campbell and Stanley (1966) is appropriate. We discuss this estimator
in Section 7.4.6 below.
If we assume that
0 < P r(D = 1 j X) < 1
we rule out the phenomenon of perfect predictability of D given X. This condition guarantees that persons with the same X values have a positive probability of being both
participants and nonparticipants.32 Ironically, having too much information may be a bad
thing. We need some “random” variation that places observationally equivalent people in
both states. The existence of this fortuitous randomization lies at the heart of the method
of matching.
Next consider assumption (4.A.1). It is satis…ed in this example if in a time homogeneous environment, a “…xed e¤ect” or “components of variance structure” characterizes Y0t
so that there is an invariant random variable ' such that Y0t can be written as
(6.7)
Y0t = ¯ t + ' + U0t
for all t
and
E(U0t j ') = 0
for all t
where the U0t are mutually independent, and c is independent of U0t : If Y0t is earnings,
then ' is “permanent income” and the U0t are “transitory deviations” around it. Then
using (6.3) for t > k > t0 , we have
32
This is one of two conditions that Rosenbaum and Rubin (1983) call “strong ignorability” and is central
to the validity of matching. We discuss these conditions further in section 7.3.
55
E(Y0t ¡ Y0t0 j D = 1) = ®t + ¯ t ¡ ¯ t0 ,
since
E(U0t j D = 1) ¡ E(U0t0 j D = 1) = 0:
From the assumption of time homogeneity, ¯ t = ¯ t0 . Thus assumption (4.A.1) is satis…ed and the before-after estimator identi…es ®t . It is clearly not necessary to assume that
the U0t are mutually independent, just that
(6.8)
E(U0t ¡ U0t0 j D = 1) = 0
i.e. that the innovation U0t ¡U0t0 is mean independent of U0k +c. In terms of the economics
of the model, it is required that participation does not depend on transitory innovations
in earnings in periods t and t0 . For decision model (6.3), this condition is satis…ed as long
as U0k is independent of U0t and U0t0 , or as long as U0k + c is mean independent of both
terms.
If, however, the U0t are serially correlated, then (4.A.1) will generally not be satis…ed.
Thus if a transitory decline in earnings persists over several time periods (as seems to
be true as a consequence of Ashenfelter’s dip), so that there is stochastic dependence of
(U0t ; U0t0 ) with U0k , then it is unlikely that the key identifying assumption is satis…ed. One
special case where it is satis…ed, developed by Heckman (1978) and Heckman and Robb
(1985a) and applied by Ashenfelter and Card (1985) and Finifter (1987) among others, is a
“symmetric di¤erences” assumption. If t and t0 are symmetrically aligned (so that t = k+
` and t = k¡ `) and conditional expectations forward and backward are symmetric, so that
(6.9)
E(U0t j c + ¯ t + U0k ) = E(U0t0 j c + ¯ k + U0k );
then assumption (4.A.1) is satis…ed. This identifying condition motivates the symmetric
di¤erences estimator discussed in Section 7.6.
Some evidence of non-stationary wage growth presented by Farber and Gibbons (1994),
MaCurdy (1982), Topel and Ward (1992) and others suggests that earnings can be approximated by a “random walk” speci…cation. If
(6.10)
Y0t = ¯ t + ´ +
t
P
j=0
ºj;
where the º j are mean zero, mutually independent and identically-distributed random
variables independent of ´, then (6.8) and (6.9) will not generally be satis…ed. Thus even
if conditional expectations are linear, both forward and backward, it does not follow that
(4.A.1) will hold. Let the variance of ´ and the variance of º j be …nite. Assume that
E(´) = 0. Suppose c is independent of all the º j and ´;and
E(U0t j c + ¯ t + U0k ) =
and
¾ 2´ + k¾ 2v
(c+ U0k ¡ E(c))
¾ 2c + ¾ 2´ + k¾ 2º
E(U0t0 j c + ¯ t + U0k ) =
¾ 2´ + t0 ¾ 2v
(c + U0k ¡ E(c)):
¾ 2c + ¾ 2´ + t¾ 2º
56
These two expressions are not equal unless ¾ 2º = 0:
A more general model that is consistent with the evidence reported in the literature
writes
Y0t = ¹0t (X) + ´ + U0t ;
where
U0t =
k
P
j=1
½0j U0;t¡j +
m
X
m0j º t¡j ;
j=1
where the º t¡j satisfy E(º t¡j ) = 0 at all leads and lags, and are uncorrelated with ´; where
U0t is an autoregression of order k and moving average of length m. Some authors like
MaCurdy (1982) or Gibbons and Farber (1994) allow the coe¢cients (½0j ; m0j ) to depend
on t and do not require that the innovations be identically distributed over time. For
the logarithm of white male earnings in the United States, MaCurdy (1982) …nds that a
model with a permanent component (´), plus one autoregressive coe¢cient (k = 1) and
two moving average terms (m = 2) describes his data.33 Gibbons and Farber report similar
evidence.
These times series models suggest generalizations of the before-after estimator that
exploit the longitudinal structure of earnings processes but work with more general types
of di¤erences that align future and past earnings. These are developed at length in Heckman
and Robb (1985, 1986), Heckman (1998a) and in Section 7.6.
If there are “time e¤ects,” so that ¯ t 6= ¯ t0 , (4.A.1) will not be satis…ed. Before-after
estimators will confound time e¤ects with program gains. The “di¤erence in di¤erences”
estimator circumvents this problem for models in which (4.A.1) is satis…ed for the unobservables of the model but ¯ t 6= ¯ t0 : Note, however, that in order to apply this assumption it
is necessary that time e¤ects be additive in some transformation of the dependent variable
and identical across participants and nonparticipants. If they are not, then (4.A.2) will not
be satis…ed.
For example, if the decision rule for program participation is such that persons with
lower life cycle wage growth paths are admitted into the program, or persons who are
more vulnerable to the national economy are trained, then the assumption of common
time (or age) e¤ects across participants and nonparticipants will be inappropriate and the
di¤erence-in-di¤erence estimator will not identify true program impacts.
33
The estimated value of ½01 is close to 1 so that the model is close is a random walk in levels of log
earnings.
57
6.3.2
A Separable Representation
In implementing econometric evaluation strategies, it is common to control for observed
characteristics X. Invoking the separability assumption, we write the outcome equation
for Y0t as
Y0t = g0t (X) + U0t
where g0t is a behavioral relationship and U0t has a …nite mean conditioning on X. A
parallel expression can be written for Y1t :
Y1t = g1t (X) + U1t .
The expression for g0t (X) is a structural relationship that may or may not be di¤erent from
¹0t (X), the conditional mean. It is a ceteris paribus relationship that informs us of the
e¤ect of changes of X on Y0t holding U0t constant. Throughout this chapter we distinguish
¹1t from g1t and ¹0t from g0t . For the latter, we allow for the possibility that E(U1t j X) 6= 0
and E(U0t j X) 6= 0. The separability enables us to isolate the e¤ect of self selection, as it
operates through the “error term”, from the structural outcome equation:
(6.11a)
E(Y0t j D = 0; X) = g0t (X) + E(U0t j D = 0; X):
(6-11b)
E(Y1t j D = 1; X) = g1t (X) + E(U1t j D = 1; X):
The g0t (X) and g1t (X) functions are invariant across di¤erent conditioning schemes
and decision rules provided that X is available to the analyst. One can borrow knowledge
of these functions from other studies collected under di¤erent conditioning rules including
the conditioning rules that de…ne the samples used in social experiments. Although the
conditional mean of the errors di¤ers across studies, the g0t (X) and analogous g1t (X)
functions are invariant across studies. If they can be identi…ed, they can be meaningfully
compared across studies, unlike the parameter treatment on the treated which, in the case
of heterogeneous response to treatment that is acted on by agents, di¤ers across programs
with di¤erent decision rules and di¤erent participant compositions.
A special case of this representation is the basis for an entire literature. Suppose that
(P.1)
The random utility representation (6.5) is valid.
Further, suppose that
(P.2)
(U0t ; U1t ; V ) k X,
(“ k ” denotes stochastic independence)
and …nally assume that
(P.3)
the distribution of V; F (V ) is strictly increasing in V .
Then
(6.12a)
E(U0t j D = 1; X) = K0t (Pr(D = 1 j X)):
58
and
(6.12b)
E(U1t j D = 1; X) = K1t (Pr(D = 1 j X)).34
The mean error term is a function of P , the probability of participation in the program.
This special case receives empirical support in Heckman, Ichimura, Smith and Todd (1998)
and Heckman, Ichimura and Todd (1997). It enables analysts to characterize the dependence between U0t and X by the dependence of U0t on Pr(D = 1 j X) which is a scalar
function of X. As a practical matter, this greatly reduces the empirical task of estimating
selection models. Instead of having to explore all possible dependence relationships between
U and X; the analyst can con…ne attention to the more manageable task of exploring the
dependence between U and Pr(D = 1 j X). An investigation of the e¤ect of conditioning
on program eligibility rules or self selection on Y0t comes down to an investigation of the
e¤ect of the conditioning on Y0t as it operates through the probability P . It motivates a
focus on the determinants of participation in the program in order to understand selection
bias and is the basis for the “control function” estimators developed in Section 7.
34
The proof is immediate. The proof of (6.12b) follows by similar reasoning. We follow Heckman (1980)
and Heckman and Robb (1985a, 1986b). Assume that U0t ; V are jointly continuous random variables, with
density f (U0t ; V j X). From (P.2)
f(U0t ; V j X) = f(U0t ; V ):
Thus
E(U0t j X; D = 1) =
R1
U0t
¡1
H(X)
R
f(U0t ; V )dU0t dV
¡1
H(X)
R
:
f(V )dV
¡1
Now
Pr(D = 1 j X) =
Inverting, we obtain
H(X)
R
f (V )dV:
¡1
H(X) = FV¡1 (Pr(D = 1 j X)):
Thus
E(U0t j X; D = 1) =
R1
¡1
FV¡1 (Pr(D=1jX))
U0t
R
f (U0t ; V )dV dU0t
¡1
Pr(D = 1 j X)
def K0t (Pr(D = 1 j X)):
=
59
If, however, (P.2) is not satis…ed, then the separable representation is not valid. Then
it is necessary to know more than the probability of participation to characterize E(U0t j
X; D = 1). In this case it is necessary to characterize both the dependence between U0t
and X given D = 1 and the probability of participation.
6.3.3
Variable Treatment E¤ect
A more general version of the decision rule, given by (6.2), allows (Y0t ; Y1t ) to be a pair of
random variables with no necessary restriction connecting them. In the more general case,
®t = Y1t ¡ Y0t ;
t>k
is now a random variable. In this case as previously discussed in Section 3, there is a
distinction between the parameter “the mean e¤ect of treatment on the treated” and the
“mean e¤ect of randomly assigning a person with characteristics X into the program”.
In one important case discussed in Heckman and Robb (1985a), the two parameters have
the same ex post mean value even if treatment e¤ect ®t is heterogeneous after conditioning
on X. Suppose that ®t is unknown to the agent at the time enrollment decisions are
made. The agent forecasts ®t using the information available in his/her information set Ik .
E(®t j Ik ) is the private expectation of gain by the agent. If ex post gains of participants
with characteristics X are the same as what the ex post gains of nonparticipants would
have been had they participated, then the two parameters are the same. This would arise
if both participants and nonparticipants have the same ex ante expected gains
E(®t j D = 1; Ik ) = E(®t j D = 0; Ik ) = E(®t j Ik ),
and if
E[E(®t j Ik ) j X; D = 1] = E[E(®t j Ik ) j X; D = 0];
where the expectations are computed with respect to the observed ex-post distribution
of the X. This condition requires that the information in the participant’s decision set
has the same relationship to X as it has for nonparticipants. The interior expectations
in the preceding expression are subjective. The exterior expectations in the expression
are computed with respect to distributions of objectively-observed characteristics. The
condition for the two parameters to be the same is
E[E(®t j Ik ; D = 1) j X; D = 1] = E[E(®t j Ik ; D = 0) j X; D = 0]:
60
As long as the ex-post objective expectation of the subjective expectations is the same,
the two parameters (E(®t j X; D = 1) and E(®t (X)) are the same. This condition would
be satis…ed if, for example, all agents, irrespective of their X values, place themselves at
the mean of the objective distribution, i.e.,
E(®t j Ik ; D = 1) = E(®t j Ik ; D = 0) = ®
¹t
(see, e.g., Heckman and Robb, 1985a). Di¤erences across persons in program participation
are generated by factors other than potential outcomes. In this case, the ex-post surprise,
(®t ¡ ®
¹ t)
does not depend on X or D in the sense that
E(®t ¡ ®
¹ t j X; D = 1) = 0:
So
E(Y1t ¡ Y0t j X; D = 1) = ®
¹ t:
This discussion demonstrates the importance of understanding the decision rule and its
relationship to measured outcomes in formulating an evaluation model. If agents do not
make their decisions based on the unobserved components of gains from the program or on
variables statistically related to those components, the analysis for the common coe¢cient
model presented in section (a) remains valid even if there is variability in U1t ¡U0t : If agents
anticipate the gains, and base decisions on them, at least in part, then a di¤erent analysis
is required.
The conditions for the absence of bias for one parameter are di¤erent from the conditions for the absence of bias for another parameter. The di¤erence between the “random
assignment” parameter E(Y1t ¡ Y0t j X) and the “treatment on the treated” parameter is
gain in the unobservables going from one state to the next:
E(U1t ¡ U0t j X; D = 1) = E(¢t j X; D = 1) ¡ E(¢t j X):
The only way to avoid bias for both mean parameters is if E(U1t ¡ U0t j X; D = 1) = 0:
Unlike the other estimators, the before-after estimators are non-robust to time e¤ects
that are common across participants and nonparticipants. The di¤erence-in-di¤erences
estimators and the cross-section estimators are unbiased under di¤erent conditions. The
cross-section estimator for the period t common e¤ect and the “treatment on the treated”
61
variable-e¤ect version of the model require that mean unobservables in the no program state
be the same for participants and nonparticipants. The di¤erence-in-di¤erences estimator
requires a balance of the bias in the change in the unobservables from period t0 to period
t. If the cross-section conditions for the absence of bias are satis…ed for all t, then the
assumption justifying the di¤erence-in-di¤erences estimator is satis…ed.
However, the converse is not true. Even if the conditions for the absence of bias in
the di¤erence-in-di¤erences estimator are satis…ed, the conditions for absence of bias for
the cross section estimator are not necessarily satis…ed. Moreover, failure of the di¤erencein-di¤erences condition for the absence of bias does not imply failure of the condition
for absence of bias for the cross-section estimator. Ashenfelter’s dip provides empirically
relevant example of this point. If t0 is measured during the period of the dip, but the dip is
mean-reverting in post-program periods, then the condition for the absence of cross-section
bias could be satis…ed because post-program, there could be no selective di¤erences among
participants.
6.3.4
Imperfect Credit Markets
How robust is the analysis of Sections 6.2 and 6.3, and in particular the conditions for
bias, to alternative speci…cations of decision rules and the economic environments in which
individuals operate? To answer this question, we …rst reexamine the decision rule after
dropping our assumption of perfect credit markets. There are many ways to model imperfect credit markets. The most extreme approach assumes that persons consume their
earnings each period. This changes the decision rule (6.2) and produces a new interpretation for the conditions for absence of bias. Let G denote a time-separable strictly concave
utility function and let ¯ be a subjective discount rate. Suppose that persons have exogenous income ‡ow ´t per period. Expected utility maximization given information Ik
produces the following
program
participation rule:
2
3
8
(6.13)
TP
¡k
>
>
>
¯ j fG(Y1;k+j + ´ k+j ) ¡ G(Y0;k+j + ´ k+j )g 7
< 1 if E 6
4 j=1
5 ¸ 0;
D =>
+G(´
¡
c
)
¡
G(Y
+
´
)
j
I
k
0k
k
>
k
k
>
:
0
otherwise.
As in the previous cases, earnings prior to time period k are only relevant for forecasting
future earnings (i.e., as elements of Ik ). However, the decision rule (6.2) is fundamentally
altered in this case. Future earnings in both states determine participation in a di¤erent
way. Common components of earnings in the two states do not di¤erence out unless G is
a linear function.35
35
Due to the nonlinearity of G, there are wealth e¤ects in the decision to take training.
62
Consider the permanent-transitory model of equation (6.7). That model is favorable to
the application of longitudinal before-after estimators. Suppose that the U0t are independent and identically distributed, and there is a common-e¤ect model. Condition (6.8) is
not satis…ed in a perfect foresight environment when there are credit constraints, or in an
environment in which the U0t can be partially forecast36 because for t > k > t0
E(U0t j X; D = 1) 6= 0
even though
E(U0t0 j X; D = 1) = 0;
so
E(U0t ¡ U0t0 j X; D = 1) 6= 0:
The before-after estimator is now biased. So is the di¤erence in di¤erences estimator. If,
however, the U0t are not known, and cannot be partially forecast, then condition (6.8) is
valid, so both the before-after and di¤erence in di¤erence estimators are unbiased.
Even in a common e¤ect model, with Y0t (or U0t ) independently and identically distributed, the cross section estimator is biased for period t > k in an environment of perfect
certainty with credit constraints because D depends on Y0t through decision rule (6.13).
On the other hand, if Y0t is not forecastable with respect to the information in Ik , the
cross-section estimator is unbiased.
The analysis in this subsection and the previous subsections has major implications for
a certain style of evaluation research. Understanding the stochastic model of the outcome
process is not enough. It is also necessary to know how the decision makers process the
information, and make decisions about program participation.
6.3.5
Training As A Form of Job Search
Heckman and Smith (1998b) …nd that among persons eligible for the JTPA program, the
unemployed are much more likely to enter the program than are other eligible persons.
Persons are de…ned to be unemployed if they are not working but report themselves as
actively seeking work. The relationship uncovered by Heckman and Smith is not due to
eligibility requirements. In the United States, unemployment is not a precondition for
participation in the program.
Several previous studies suggest that Ashenfelter’s dip results from changes in labor
force status, instead of from declines in wages or hours among those who work. Using
36
“Partially forecastable” means that some component of U0t resides in the information set Ik : That is,
letting f(y j x) be the density of y given x, f (U0t j Ik ) 6= f(U0t ) so that Ik predicts U0t in this sense. One
could de…ne “moment forecastability” using conditional expectations of certain moments of function“'":
If E('(U0t ) j Ik ) 6= E('(U0t )), then '(U0t ) is partially moment forecastable using the information in Ik .
More formally, a random variable is fully-forecastable if the ¾-algebra generating U0t is contained in the
¾-algebra of Ik . It is partially forecastable if the complement of the projection of the ¾-algebra of U0t onto
the ¾-algebra of Ik is not the empty set. It is fully unforecastable if the projection of the ¾-algebra of U0t
onto the ¾-algebra of Ik is the empty set.
63
even a crude measure of employment rates, namely whether a person was employed at all
during a calendar year, Card and Sullivan (1988) observed that U.S. CETA training participants’ employment rates declined prior to entering training.37 Their evidence suggests
that changes in labor force dynamics instead of changes in earnings may be a more precise
way to characterize participation in training.
Heckman and Smith (1998b) show that whether or not a person is employed, unemployed (not employed and looking for work), or out of the labor force is a powerful predictor
of participation in training programs. Moreover, they …nd that recent changes in labor force
status are important determinants of participation for all demographic groups. In particular, eligible persons who have just become unemployed, either through job loss or through
re-entry into the labor force, have the highest probabilities of participation. For women,
divorce, another form of job termination, is a predictor of who goes into training. Among
those who either are employed or out of the labor force, persons who have recently entered
these states have much higher participation program probabilities than persons in those
states for some time. Their evidence is formalized by the model presented in this section.
The previous models that we have considered are formulated in terms of levels of costs
and earnings. When opportunity costs are low, or tuition costs are low, the persons are more
likely to enter training. The model presented here recognizes that changes in labor force
states account for participation in training. Low earnings levels are a subsidiary predictor
of program participation that are overshadowed in empirical importance by unemployment
dynamics in the analyses of Heckman and Smith (1998b).
Persons with zero earnings di¤er substantially in their participation probabilities depending on their recent labor force status histories. Yet, in models based on pre-training
earnings dynamics, such as the one presented in Section 6.3, such persons are assumed to
have the same behavior irrespective of their labor market histories.
The importance of labor force status histories also is not surprising given that many
employment and training services, such as job search assistance, on-the-job training at private …rms, and direct placement are all designed to lead to immediate employment. By
providing these services, these programs function as a form of job search for many participants. Recognizing this role of active labor market policies is an important development
in recent research. It indicates that in many cases, participation in active labor market
programs should not be modeled as if it were like a schooling decision, such as we have
modeled it in the preceding sections.
In this section, we summarize the evidence on the determinants of participation in the
program and construct a simple economic model in which job search makes two contribu37
Ham and LaLonde (1990) report the same result using semi-monthly employment rates for adult women
participating in NSW.
64
tions to labor market prospects: (a) it facilitates the rate of arrival of job o¤ers and (b) it
improves the distribution of wages in the sense of giving agents a stochastically dominant
wage distribution compared to the one they face without search. Training is one form of
unemployment that facilitates job search. Di¤erent training options will produce di¤erent job prospects characterized by di¤erent wage and layo¤ distributions. Searchers might
participate in programs that subsidize the rate of arrival of job o¤ers (JSA as described in
Section 2), or that improve the distribution from which wage o¤ers are drawn (i.e., basic
educational and training investments).
Instead of motivating participation in training with a standard human capital model,
we motivate participation as a form of search among options. Because JSA constitutes a
large component of active labor market policy, it is of interest to see how the decision rule
is altered if enhanced job search rather than human capital accumulation is the main factor
motivating individuals’ participation in these programs.
Our model is based on the idea that in program j; wage o¤ers arrive from a distribution
Fj at rate ¸j . Persons pay cj to sample from Fj : (The costs can be negative). Assume
that the arrival times are statistically independent of the wage o¤ers and that arrival times
and wage o¤ers from one search option are independent of the wages and arrival times
of other search options. At any point in time, persons pick the search option with the
highest expected return. To simplify the analysis, suppose that all distributions are time
invariant and denote by N the value of nonmarket time. Persons can select among any of
J options, denoted by j. Associated with each option is a rate at which jobs appear, ¸j .
Let the discount rate be r. These parameters may vary among persons but for simplicity
we assume that they are constant for the same person over time. This heterogeneity
among persons produces di¤erences among choices in training options, and di¤erences in
the decision to undertake training.
In the unemployed state, a person receives a nonmarket bene…t, N . The choice between
search from any of the training and job search options can be written in “Gittens Index”
form. (See, e.g., Berry and Fristedt, 1986). Under our assumptions, being in the nonmarket
state has constant per-period value N irrespective of the search option selected. Letting
Vje be the value of employment arising from search option j, the value of being unemployed
under training option j is:
¸j
(1 ¡ ¸j )
(6.14a)
Vju = N ¡ cj +
Ej max[Vje ; Vju ] +
Vju :
1+r
1+r
The …rst term, (N ¡ cj ), is the value of nonmarket time minus the j-speci…c cost of search.
The second term is the discounted product of the probability that an o¤er arrives next
period if the j th option is used, and the expected value of the maximum of the two options:
work (valued at Vje ) or unemployment Vju . The third term is the probability that the
person will continue to search times the value of doing so. In a stationary environment, if
65
it is optimal to search from j today, it is optimal to do so tomorrow.
Let ¾ je be the exogenous rate at which jobs disappear. For a job holder, the value of
employment is Vje :
(1 ¡ ¾ je)
¾ je
(6.14b)
Vje = Yj +
Vje +
Ej [ max(VN ; Vju ).]
1+r
1+r
Vju is the value of optimal job search under j. The expression consists of the current ‡ow
1
) expected value of employment (Vje ) times the
of earnings (Yj ) plus the discounted (
1+r
probability that the job is retained (1 ¡ ¾ je ). The third term arises from the possibility
that a person loses his/her job (this happens with probability (¾ je )) times the expected
value of the maximum of the search and nonmarket value options (VN ):
To simplify this expression, assume that Vju > VN . If this is not so, the person would
never search under any training option under any event. In this case, Vje simpli…es to
Vje = Yj +
(1 ¡ ¾ je)
¾ je
Vje +
Vju
1+r
1+r
so
¾ je
(1 + r)Yj
Vju +
:
r + ¾ je
r + ¾ je
Substituting (6.14c) into (6.14a), we obtain, after some rearrangement,
(6.14c)
Vje =
Vju =
(1 + r)(N ¡ cj ) + ¸j E (Vje j Vje > Vju ) Pr(Yj > Vju (r=1 + r))
j
r + ¸j Pr(Yj > Vju (r=1 + r))
:
In deriving this expression, we assume that the environment is stationary so that the
optimal policy at time t is also the optimal policy at t0 provided that the state variables
are the same in each period.
The optimal search strategy is
^j =arg max fVju g
j
provided that Vju > VN for at least one j. The lower cj and the higher ¸j , the more
attractive is option j. The larger the Fj — in the sense that j stochastically dominates
j 0 (Fj (x) < Fj 0 (x)), so more of the mass of Fj is the upper portion of the distribution— the
more attractive is option j. Given the search options available to individuals, enrollment
in a job training program may be the most e¤ective option.
The probability that a training from option j lasts Tj = tj periods or more is
Pr(Tj ¸ tj ) = [1 ¡ ¸j (1 ¡ Fj (Vju (r=(1 + r)))]tj
66
where 1¡ ¸j (1 ¡ Fj (Vju (r=1 + r)) is the sum of the probability of receiving no o¤er (1 ¡ ¸j )
plus the probability of receiving an o¤er that is not acceptable (¸j Fj (Vju (r=1 + r)). This
model is nonlinear in the basic parameters. Because of this nonlinearity, many estimators
relying on additive separability of the unobservables, such as di¤erence-in-di¤erences or the
…xed e¤ect schemes for eliminating unobservables, are ine¤ective evaluation estimators.
This simple model summarizes the available empirical evidence on job training programs. (a) It rationalizes variability in the length of time persons with identical characteristics spend in training. Persons receive di¤erent wage o¤ers at di¤erent times and leave the
program to accept the wage o¤ers at di¤erent dates. (b) It captures the notion that training programs might facilitate the rate of job arrivals - the ¸j (this is an essential function
of “job search assistance” programs) or they might produce skills - by improving the Fj0 or both. (c) It accounts for why there might be recidivism back into training programs. As
jobs are terminated (at rate ¾ je ), persons re-enter the program to search for a replacement
job. Recidivism is an important feature of major job training programs. Trott and Baj
(1993) estimate that as many as 20 percent of all JTPA program participants in Northern
Illinois have been in the program at least twice with the modal number being three. This
has important implications for the contamination bias problem that we discuss in Section
7.7.
A less attractive feature of the model is that persons do not switch search strategies.
This is a consequence of the assumed stationarity of the environment and the assumption
that agents know both arrival rates and wage o¤er distributions. Relaxing the stationarity
assumption produces switching among strategies which seems to be consistent with the
evidence. A more general - but less analytically tractable model - allows for learning about
wage o¤er distributions as in Weitzman (1979). In such a model, persons may switch
strategies as they learn about the arrival rates or the wage o¤ers obtained under a given
strategy. The learning can take place within each type of program and may also entail
word of mouth learning from fellow trainees taking the option.
Weitzman’s model captures this idea in a very simple way and falls within the Gitten’s
index framework. The basic idea is as follows. Persons have J search options. They pick
the option with the highest value and take a draw from it. They accept the draw if the
value of the realized draw is better than the expected value of the best remaining option.
Otherwise they try out the latter option. If the draws from the J options are independently
distributed, a Gittens-index strategy describes this policy. In this framework, unemployed
persons may try a variety of options - including job training - before they take a job, or
drop out of the labor force.
One could also extend this model to allow the value of non-market time, N , to become
stochastic. If N ‡uctuates, persons would enter or exit the labor force depending on the
value of N. Adding this feature captures the employment dynamics of trainees described
67
by Card and Sullivan (1988).
In this more general model, shocks to the value of leisure or termination of previous
jobs make persons contemplate taking training. Whether or not they do so depends on the
value of training compared to the value of other strategies for …nding jobs. Allowing for
these considerations produces a model broadly consistent with the evidence presented in
Heckman and Smith (1998b) that persons enter training as a consequence of displacement
from both the market and nonmarket sector.
The full details of this model remain to be developed (see Heckman and Smith, 1999,
for a start). We suggest that future analyses of program participation be based on this
empirically more concordant model. For the rest of this chapter, however, we take decision
rule (6.3) as canonical in order to motivate and justify the choice of alternative econometric
estimators. We urge our readers to modify our analysis to incorporate the lessons from this
framework of labor force dynamics sketched here.
6.4
The Role of Program Eligibility Rules In Determining Participation
Several institutional features of most training programs suggest that the participation rule
is more complex than that characterized by the simple model presented above in Section
6.2. For example, eligibility for training is often based on a set of objective criteria, such as
current or past earnings being below some threshold. In this instance, individuals can take
training at time k only if they have had low earnings, regardless of its potential bene…t to
them. For example, enrollees satisfy
(6.15)
®=r ¡ Yik ¡ ci > 0 and the eligibility rules Yi;k¡1 < K
where K is a cuto¤ level. More general eligibility rules can be analyzed in the same
framework.
The universality of Ashenfelter’s dip in pre-program earnings among program participants occurs despite the substantial variation in eligibility rules among training programs.
This suggests that earnings or employment dynamics drive the participation process and
that Ashenfelter’s dip is not an artifact of eligibility rules. Few major training programs in
the United States have required earnings declines to qualify for program eligibility. Certain
CETA programs in the late 1970s required participants to be unemployed during the period
just prior to enrollment, while NSW required participants to be unemployed at the date of
enrollment. MDTA contained no eligibility requirements, but restricted training stipends
to persons who were unemployed or “underemployed.”38 For the JTPA program, eligibility
38
Eligibility for CETA varied by subprogram. CETA’s controversial Public Sector Employment (PSE)
program required participants to have experienced a minimum number of days of unemployment or “un-
68
has been con…ned to the economically disadvantaged (de…ned by low family income over
the past six months, participation in a cash welfare program or Food Stamps or being a
foster child or disabled). There is also a 10 percent “audit window” of eligibility for persons
facing other unspeci…ed “barriers to employment.”
It is possible that Ashenfelter’s dip results simply from a mechanical operation of program eligibility rules that condition on recent earnings. Such rules select individuals with
particular types of earnings patterns into the eligible population. To illustrate this point,
consider the monthly earnings of adult males who were eligible for JTPA in a given month
from the 1986 panel of the U.S. Survey of Income and Program Participation (SIPP). For
most people, eligibility is determined by family earnings over the past six months. The
mean monthly earnings of adult males appear in Figure 4.1 aligned relative to month ‘k,’
the month when eligibility is measured. The …gure reveals a dip in the mean earnings of
adult male eligibles centered in the middle of the six month window over which family
income is measured when determining JTPA eligibility.
Figure 4.1 also displays the mean earnings of adult males in the experimental control
group from the NJS.39 The earnings dip for the controls, who applied and were admitted
in the program, is larger than for the sample of JTPA eligibles from the SIPP. Moreover,
this dip reaches its minimum during month ‘k’ rather than three or four months before as
would be indicated by the operation of eligibility rules. The substantial di¤erence between
the mean earnings patterns of JTPA participants and eligibles implies that Ashenfelter’s
dip does not result from the mechanical operation of program eligibility rules.40
6.5
Administrative Discretion and the E¢ciency and Equity of
Training Provision
Training participation also often depends on discretionary choices made by program operators. Recent research focuses on how program operators allocate training services among
deremployment” just prior to enrollment. In general, persons became eligible for other CETA programs
by having a low income or limited ability in English. Considerable discretion was left to the states and
training centers to determine who enrolled in the program. By contrast, the NSW eligibility requirements
were quite speci…c. Adult women had to be on AFDC at the time of enrollment, have received AFDC for
30 of the last 36 months, and have a youngest child age six years or older. Youth in the NSW had to be
age 17-20 years with no high school diploma or equivalency degree and have not been in school in the past
six months. In addition, …fty percent of youth participants had to have had some contact with the criminal
justice system (Hollister, et al., 1984).
39
Such data were collected at four of the 16 training centers that participated in the study.
40
Devine and Heckman (1996) present certain nonstationary family income processes that can generate
Ashenfelter’s dip from the application of JTPA eligibility rules. However, in their empirical work they …nd
a dip centered at k ¡ 3 or k ¡ 4 for adult men and adult women, but no dip for male and female youth.
69
groups and on how administrative performance standards a¤ect the allocation of these
services. The main question that arises in these studies is the potential trade-o¤ between
equity and e¢ciency, and the potential con‡ict between social objectives and program operators’ incentives. An e¢ciency criterion that seeks to maximize the social return to public
training investments, regardless of the implications for income distribution, implies focusing training resources on those groups for whom the impact is largest (per dollar spent).
In contrast, equity and redistributive criteria dictate focusing training resources on groups
who are most in “need” of services .
These goals of e¢ciency and equity are written into the U.S. Job Training Partnership
Act.41 Whether or not these twin goals con‡ict with each other depends on the empirical
relationship between initial skill levels and the impact of training. As we discuss in below
Section 10, the impact of training appears to vary on the basis of observable characteristics,
such as sex, age, race and what practitioners call “barriers to employment” – low schooling,
lack of employment experience and so on. These twin goals would be in con‡ict if the largest
social returns resulted from training the most job ready applicants.
In recent years, especially in the United States, policymakers have used administrative
performance standards to assess the success of program operators in di¤erent training sites.
Under JTPA, these standards are based primarily on average employment rates and average wage rates of trainees shortly after they leave training. The target levels for each site
are adjusted based on a regression model that attempts to hold constant features of the
environment over which the local training site has no control, such as racial composition.42
Sites whose performance exceeds these standards may be rewarded with additional funding;
those that fall below may be sanctioned. The use of such performance standards, instead
of measures of the impact of training, raises the issue of “cream-skimming” by program operators (Bassi, 1984). Program sta¤ concerned solely with their site’s performance relative
to the standard should admit into the program applicants who are likely to be employed
at good wages (the “cream”) regardless of whether or not they bene…t from the program.
By contrast, they should avoid applicants who are less likely to be employed after leaving
training or have low expected wages, even if the impact of the training for such persons
is likely to be large. The implications of cream-skimming for equity are clear. If it exists,
program operators are directing resources away from those most in need. However, its im41
A related issue involves di¤erences in the types of services provided to di¤erent groups conditional on
participation in a program. The U.S. General Accounting O¢ce (1991) …nds such di¤erences alarming in
the JTPA program. Smith (1992) argues that they result from di¤erences across groups in readiness for
immediate employment and in the availability of income support during classroom training.
42
See Heckman and Smith (1997d) and the essays in Heckman (1998b) for more detailed descriptions of
the JTPA performance standards system. Similar systems based on the JTPA system now form a part of
most U.S. training programs.
70
plications for e¢ciency depend on the empirical relationship between short-term outcome
levels and long-term impacts. If applicants who are likely to be subsequently employed
also are those who bene…t the most from the program, performance standards indirectly
encourage the e¢cient provision of training services.43
A small literature examines the empirical importance of cream-skimming in JTPA programs. Anderson, et al. (1991) and Anderson, et al.(1993) look for evidence of creamskimming by comparing the observable characteristics of JTPA participants and individuals eligible for JTPA. They report evidence of cream-skimming de…ned in their study as
the case in which individuals with fewer barriers to employment have di¤erentially higher
probabilities of participating in training. However, this …nding may result not from creamskimming by JTPA sta¤, but because among those in the JTPA eligible population, more
employable persons self-select into training.44
Two more recent studies address this problem. Using data from the NJS, Heckman
and Smith (1998e) decompose the process of participation in JTPA into a series of stages.
They …nd that much of what appears to be cream-skimming in simple comparisons between participants’ and eligibles’ characteristics is self-selection. For example, high school
dropouts are very unlikely to be aware of JTPA and as a result are unlikely ever to apply.
To assess the role of cream-skimming, Heckman, Smith and Taber (1996) study a sample
of applicants from one of the NJS training centers. They …nd that program sta¤ at this
training center do not cream-skim, and appear instead to favor the hard-to-serve when
deciding whom to admit into the program. Such evidence suggests that cream-skimming
may not be of major empirical importance, perhaps because the social service orientation
of JTPA sta¤ moderates the incentives provided by the performance standards system, or
because of local political incentives to serve more disadvantaged groups. For programs in
Norway, Aakvik (1998) …nds strong evidence of negative selection of participants on outcomes. Heinrich (1998) reports just the opposite for a job training program in the United
States. At this stage no universal generalization about bureaucratic behavior regarding
cream skimming is possible.
Studies based on the NJS also provide evidence on the implications of cream-skimming,
even if it were to exist. Heckman, Smith and Clements (1997) …nd that except for those who
are very unlikely to be employed, the impact of training does not vary with the expected
levels of employment or earnings in the absence of training. This …nding indicates that
the impact on e¢ciency of cream-skimming (or alternatively the e¢ciency cost of serving
43
Heckman and Smith (1997d) discuss this issue in greater depth. The discussion in the text presumes
that the costs of training provided to di¤erent groups are roughly equal.
44
Program sta¤ often have some control over who applies through their decisions about where and how
much to publicize the program. However, this control is much less important than their ability to select
among program applicants.
71
the hard-to-serve) is low. Similarly, (1998d) …nd little empirical relationship between the
outcome measures used in the JTPA performance standards system and experimental estimates of the impact of JTPA training. These …ndings suggest that cream-skimming has
little impact on e¢ciency, and that administrative performance standards, to the extent
that they a¤ect who is served, do little to increase either the e¢ciency or equity of training
provision.
6.6
The Con‡ict Between The Economic Approach to Program
Evaluation And The Modern Approach to Social Experiments
We have already noted in Section 5 that under ideal conditions, social experiments identify
E(Y1 ¡ Y0 jX; D = 1). Without further assumptions and econometric manipulation, they
do not answer the other evaluation questions posed in Section 3. As a consequence of
the self-selected nature of the samples generated by social experiments, the data produced
from them are far from ideal for estimating the structural parameters of behavioral models.
This makes it di¢cult to generalize …ndings across experiments or to use experiments to
identify the policy-invariant structural parameters that are required for econometric policy
evaluation.
To see this, recall that social experiments balance bias, but they do not eliminate the
dependence between U0 and D or U1 and D. Thus from the experiments conducted under
ideal conditions, we can recover the conditional densities f (y0 jX; D = 1) and f (y1 jX; D =
1). From nonparticipants we can recover f (y0 jX; D = 0). It is the density f (y0 j X; D = 1)
that is the new information produced from social experiments. The other densities are
available from observational data. All of these densities condition on choices. Knowledge
of the conditional means
E(Y0 jX; D = 1) = g0 (X) + E(U0 jX; D = 1)
and
E(Y1 jX; D = 1) = g1 (X) + E(U1 jX; D = 1)
does not allow us to separately identify the structure (g0 (X); g1 (X)) from the conditional
error terms without invoking the usual assumptions made in the nonexperimental selection
literature. Moreover, the error processes for U0 and U1 conditional on D = 1 are fundamentally di¤erent than those in the population at large if participation in the program
depends, in part, on U0 and U1 :
72
For these reasons, evidence from social experiments on programs with di¤erent participation and eligibility rules do not cumulate in any interpretable way. The estimated
treatment e¤ects reported from the experiments combine structure and error in di¤erent
ways, and the conditional means of the outcomes bear no simple relationship to g0 (X)
or g1 (X) (X¯ 0 and X¯ 1 in a linear regression setting). Thus it is not possible, without
conducting a nonexperimental selection study, to relate the conditional means or regression
functions obtained from a social experiment to a core set of policy-invariant structural parameters. Ham and LaLonde (1996) present one of the few attempts to recover structural
parameters from a randomized experiment, where randomization was administered at the
stage where persons applied and were accepted into the program. The complexity of their
analysis is revealing about the di¢culty of recovering structural parameters from social
experiments.
In bypassing the need to specify economic models, many recent social experiments
produce evidence that is not informative about them. They generate choice-based, endogenously strati…ed samples that are di¢cult to use in addressing any other economic question
apart from the narrow question of determining the impact of treatment on the treated for
one program with one set of participation and eligibility rules.
73