Econometrics 5

Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

UNIT- 5 PANEL DATA REGRESSION MODELS

STRUCTURE

5.0 Learning Objective

5.1 Introduction

5.2 Panel Data

5.3 Pooled OLS Regression

5.4 Fixed Effect Least Squares

5.5 Dummy Variable Model

5.6 Fixed effect within group (WG) Estimator

5.7 The Random effects model

5.8 Summary

5.9 Keywords

5.10 Learning Activity

5.11 Unit End Questions

5.12 References

5.0 LEARNING OBJECTIVES

 This module ‗PANEL DATA REGRESSION MODELS ' introduces the key
concept ofPanel Data, Pooled OLS Regression, Fixed Effect Least Squares.
 This module will help the students to understand the concept of Dummy Variable
Model,Fixed effect within group (WG) Estimator, The Random effects model.

5.1 INTRODUCTION

In panel data the same cross-sectional unit is surveyed over time. In short, panel data have
space as well as time dimensions. There are other names for panel data, such as pooled data
(pooling of time series and cross-sectional observations), combination of time series and cross-

1
section data, micropanel data, longitudinal data (a study over time of a variable or group of
subjects), event history analysis (e.g., studying the movement over time of subjects through
successive states or conditions), cohort analysis (e.g., following the career path of 1965
graduates of a business school).

Let us take an example from market prices of wheat and their production in 20 states in India
from 1950 to 2000. For a year, wheat and prices represent cross-section sample. For a state,
there are two time series observations on wheat and their prices. Thus we have (20X2) =40
panel observations on wheat and their prices. The regression models based on such panel data
are known as panel data regression models.

In contrast to standard linear regression models, panel data regression effectively manages the
dependencies of unobserved independent variables on a dependent variable that might result
in biased estimators. In this post, I'll go over the key theoretical underpinnings of the subject
as well as step-by-step instructions for creating a panel data regression model in Python.

I have two goals in mind when I write this post: First, an integrated panel data regression
model is difficult to understand and is difficult to explain simply. Second, panel data
regression in Python is more difficult to conduct than, say, in R, but that doesn't mean it is
less successful. In order to maybe make future panel data analysis a little bit simpler, I
decided to share the information I obtained from a recent assignment.
Enough enough! Let's get started by defining panel data and explaining why it is so effective.
Panel data is a two-dimensional notion in which the same individuals are regularly observed
throughout a range of time periods.

In general, cross-sectional and time-series data may be combined to create panel data. One
observation of several objects and their accompanying characteristics at one particular
moment is referred to as cross-sectional data (i.e. an observation is taken once). Time-series
data only repeatedly observes one item across time. By gathering information from several,
identical items across time, panel data combines both types of features into a single model.
A sort of longitudinal data, or data collected over time at several points, are panel data. Time
series data is one of the three primary forms of longitudinal data. Numerous observations
(high t) on a single unit or fewer (small t)

N). Examples include aggregate national data and stock price patterns.

2
Cross portions that were pooled. several independent samples drawn from different units

(big N) samples taken at various times from the same population:

Unspecific Social Surveys

Excerpts from the US Decennial Census

Surveys of the current population

Data from panels. Multiple observations (small t) from two or more units (large N).

household and individual panel studies (PSID, NLSY, ANES)

Information on businesses and organizations at various times

regional data compiled throughout time

This programme serves as a fundamental introduction to panel data analysis. In particular


I'll focus on the model for linear error components.

5.2 PANEL DATA

A panel data set contains data that is collected over a period of time for one or more uniquely
identifiable individuals or “things”. In panel data terminology, each individual or “thing” for
which data is collected is called a unit.

Here are three real world examples of panel data sets:

The Framingham Heart Study: The Framingham heart study is a long running experiment that
was started in 1948 in the city of Framingham, Massachusetts. Each year, health data from
5000+ individuals is being captured with the goal of identifying risk factors for cardiovascular
disease. In this data set, the unit is a person.

3
The Grunfeld Investment Data: This is a popular research data set that contains corporate
performance data of 10 US companies that was accumulated over a period of 20 years. In this
data set, the unit is a company.

The British Household Panel Survey: This is a survey of a sample of British households. Since
1991, members of each sampled household were asked a set questions and their responses
were recorded. The same sample of households was interviewed again each subsequent year.
The goal of the survey is to analyze the effects of socioeconomic changes happening in Britain
on British households. In this data set, the unit is a household.

While building a panel data set, researchers measure one or more parameters called variables
for each unit and record their values in a tabular format. Examples of variables are sex, race,
weight and lipid levels for individuals or employee count, outstanding shares and EBITDA
for companies. Notice that some variables may change across time periods, while others stay
constant.

What results from this data collection exercise is a three-dimensional data set in which each
row represents a unique unit, each column contains the data from one of the measured
variables for that unit, and the z-axis contains the sequence of time periods over which the the
unit has been tracked.

Panel data sets arise out of longitudinal studies in which the researchers wish to study the
impact of the measured variables on one or more response variables such as the yearly
investment made by a company, or GDP growth of a country

In statistics and econometrics, panel data or longitudinal data are multi-dimensional data
involving measurements over time. Panel data contain observations of multiple phenomena
obtained over multiple time periods for the same firms or individuals.
Time series and cross-sectional data can be thought of as special cases of panel data that are
in one dimension only (one panel member or individual for the former, one time point for the

4
latter). Data is broadly classified according to the number of dimensions. A data set containing
observations on a single phenomenon observed over multiple time periods is called time
series. In time series data, both the values and the ordering of the data points have meaning.
A data set containing observations on multiple phenomena observed at a single point in time
is called cross-sectional. In cross-sectional data sets, the values of the data points have
meaning, but the ordering of the data points does not. A data set containing observations on
multiple phenomena observed over multiple time periods is called panel data. Panel Data
aggregates all the individuals, and analyzes them in a period of time. Alternatively, the second
dimension of data may be some entity other than time. For example, when there is a sample
of groups, such as siblings or families, and several observations from every group, the data
are panel data. Whereas time series and cross-sectional data are both one-dimensional, panel
data sets are two-dimensional.

A study that uses panel data is called a longitudinal study or panel study.
A balanced panel (e.g., the left-hand dataset above) is a dataset in which each panel
member(i.e., person) is observed every year. Consequently, if a balanced panel contains N
panelmembers and T periods, the number of observations (n) in the dataset is necessarily n
= N×T. An unbalanced panel (e.g., the right-hand dataset above) is a dataset in which at
least one panelmember is not observed every period. Therefore, if an unbalanced panel
contains N panelmembers and T periods, then the following strict inequality holds for
the number ofobservations (n) in the dataset: n < N×T.

Both datasets above are structured in the long format, which is where one row holds one
observation per time. Another way to structure panel data would be the wide format where
one row represents one observational unit for all points in time (for the example, the wide
format would have only two (left example) or three (right example) rows of data with
additional columns for each time-varying variable (income, age).

A Statistical Look at the N Panel data is a euphemism for two-dimensional information.


The number of dimensions provides a useful framework for categorizing data. Time series refers to any
collection of data that spans numerous time periods and focuses on a particular topic. The values and
the order of data points in a time series are both significant. We use the term "cross-sectional" to refer
to a data collection that includes information about more than one phenomenon yet was collected at the

5
same time. Numbers of data points in cross-sectional data sets are meaningful, but the order in which
the values appear has no bearing on the interpretation of the data. Panel data refers to a data set that
includes information about several phenomena and/or various time periods. Panel data is used to study
groups of people throughout time. It is possible that the second dimension of data is anything besides
time. Whenever there are several observations from each group in a sample, like in a family or sibling
group, the data are said to be panel data. Panel data sets are two-dimensional, as opposed to one-
dimensional time series data and cross-sectional data.

Data sets which have a panel design-

 Russia Longitudinal Monitoring Survey (RLMS)

 German Socio-Economic Panel (SOEP)

 Household, Income and Labor Dynamics in Australia Survey (HILDA)

 British Household Panel Survey (BHPS)

 Survey of Family Income and Employment (SoFIE)

 Survey of Income and Program Participation (SIPP)

 Lifelong Labor Market Database (LLMDB)

 Longitudinal Internet Studies for the Social sciences (LISS)

 Panel Study of Income Dynamics (PSID)

 Korean Labor and Income Panel Study (KLIPS)

 China Family Panel Studies (CFPS)

 German Family Panel (pairfam)

 National Longitudinal Surveys (NLSY)

6
 Labor Force Survey (LFS)

 Korean Youth Panel (YP)

 Korean Longitudinal Study of Aging (KLoS)

Reasons for using Panel Data

1. Panel data can take explicit account of individual-specific heterogeneity (―individual‖


here means related to the micro unit)
2. By combining data in two dimensions, panel data gives more data variation, less
collinearityand more degrees of freedom.
3. Panel data is better suited than cross-sectional data for studying the dynamics of change.
For example it is well suited to understanding transition behaviour –for example
companybankruptcy or merger.
4. It is better in detecting and measuring the effects which cannot be observed in either
cross- section or time-series data.
5. Panel data enables the study of more complex behavioral models –for example the effects
of technological change, or economic cycles.
6. Panel data can minimize the effects of aggregation bias, from aggregating firms into
broad groups.

Example-

Fig. -5.1 Panel Data

7
In the multiple response permutation procedure (MRPP) example above, two datasets with a
panel structure are shown and the objective is to test whether there's a significant difference
between people in the sample data. Individual characteristics (income, age, sex) are collected
for different persons and different years. In the first dataset, two persons (1, 2) are observed
every year for three years (2016, 2017, 2018). In the second dataset, three persons (1, 2, 3)
are observed two times (person 1), three times (person 2), and one time (person 3),
respectively, over three years (2016, 2017, 2018); in particular, person 1 is not observed in
year 2018 and person 3 is not observed in 2016 or 2018.

A balanced panel (e.g., the first dataset above) is a dataset in which each panel member (i.e.,
person) is observed every year. Consequently, if a balanced panel contains N panel members
and T periods, the number of observations (n) in the dataset is necessarily n = N×T.

An unbalanced panel (e.g., the second dataset above) is a dataset in which at least one panel
member is not observed every period. Therefore, if an unbalanced panel contains N panel
members and T periods, then the following strict inequality holds for the number of
observations (n) in the dataset: n < N×T.

Both datasets above are structured in the long format, which is where one row holds one
observation per time. Another way to structure panel data would be the wide format where one
row represents one observational unit for all points in time (for the example, the wide format
would have only two (first example) or three (second example) rows of data with additional
columns for each time-varying variable (income, age).

Advantages of Panel Data

1. Since panel data relate to individuals, firms, states, countries, etc., over time, there is bound
to be heterogeneity in these units. The techniques of panel data estimation can take such

heterogeneity explicitly into account by allowing for individual-specific variables, as we shall


show

shortly. We use the term individual in a generic sense to include microunits such as individuals,

firms, states, and countries.

8
2. By combining time series of cross-section observations, panel data give “more informative
data, more variability, less collinearity among variables, more degrees of freedom and more
efficiency.”

3. By studying the repeated cross section of observations, panel data are better suited to study
the dynamics of change. Spells of unemployment, job turnover, and labor mobility are better
studied with

panel data.

4. Panel data can better detect and measure effects that simply cannot be observed in pure cross-
section or pure time series data. For example, the effects of minimum wage laws on
employment and earnings can be better studied if we include successive waves of minimum
wage increases in the federal and/or state minimum wages.

5. Panel data enables us to study more complicated behavioral models. For example,
phenomena

such as economies of scale and technological change can be better handled by panel data than
b pure cross-section or pure time series data.

6. By making data available for several thousand units, panel data can minimize the bias that
might result if we aggregate individuals or firms into broad aggregates.

Balanced and unbalanced panel data

If each cross-sectional unit has the same number of time series observations, then such a panel

(data) is called a balanced panel. In the present example we have a balanced panel, as each unit

in the sample has 20 observations. If the number of observations differs among panel members,

we call such a panel an unbalanced panel

5.3 POOLED OLS REGRESSION

In statistics, ordinary least squares (OLS) is a type of linear least squares method for
estimating the unknown parameters in a linear regression model. OLS chooses the parameters
of a linear function of a set of explanatory variables by the principle of least squares:
minimizing the sum of the squares of the differences between the observed dependent variable
(values of the variable being observed) in the given dataset and those predicted by the linear

9
function of the independent variable.

Geometrically, this is seen as the sum of the squared distances, parallel to the axis of the
dependent variable, between each data point in the set and the corresponding point on the
regression surface—the smaller the differences, the better the model fits the data. The
resulting estimator can be expressed by a simple formula, especially in the case of a simple
linear regression, in which there is a single regressor on the right side of the regression
equation.

The OLS estimator is consistent when the regressors are exogenous, and—by the Gauss–
Markov theorem—optimal in the class of linear unbiased estimators when the errors are
homoscedastic and serially uncorrelated. Under these conditions, the method of OLS provides
minimum-variance mean-unbiased estimation when the errors have finite variances. Under
the additional assumption that the errors are normally distributed, OLS is the maximum
likelihood estimator.

There are no unique attributes of individuals within the measurement set, and no universal
effects across time. This is estimation option 1 on the list. But pooled regression may result
in heterogeneity bias:

Fig -5.2 Pooled Ols Regression

10
5.4 FIXED EFFECT LEAST SQUARES

In statistics, a fixed effects model is a statistical model in which the model parameters are
fixed or non-random quantities. This is in contrast to random effects models and mixed models
in which all or some of the model parameters are considered as random variables. In
manyapplications including econometrics and biostatistics a fixed effects model refers to a
regression model in which the group means are fixed (non-random) as opposed to a random
effects model in which the group means are a random sample from a population. Generally,
data can be grouped according to several observed factors. The group means could be modeled
as fixed or random effects for each grouping. In a fixed effects model each group mean is a
group-specific fixed quantity.

Such models assist in controlling for omitted variable bias due to unobserved heterogeneity
when this heterogeneity is constant over time. This heterogeneity can be removed from the data
through differencing, for example by subtracting the group-level average over time, or by
taking a first difference which will remove any time invariant components of the model.

There are two common assumptions made about the individual specific effect: the random
effects assumption and the fixed effects assumption. The random effects assumption is that
the individual-specific effects are uncorrelated with the independent variables. The fixed
effect assumption is that the individual-specific effects are correlated with the independent
variables. If the random effects assumption holds, the random effects estimator is more
efficient than the fixed effects estimator. However, if this assumption does not hold, the
random effects estimator is not consistent. The Durbin–Wu–Hausman test is often used to
discriminate between the fixed and the random effects models In panel data where
longitudinal observations exist for the same subject, fixed effects represent the subject-
specific means. In panel data analysis the term fixed effects estimator (also known as the
within estimator) is used to refer to an estimator for the coefficients in the regression model
including those fixed effects (one time-invariant intercept for each subject).
Such models assist in controlling for omitted variable bias due to unobserved heterogeneity
when this heterogeneity is constant over time. This heterogeneity can be removed from the
data through differencing, for example by subtracting the group-level average over time, or
by takinga first difference which will remove any time invariant components of the model.

11
There are two common assumptions made about the individual specific effect: the random
effects assumption and the fixed effects assumption. The random effects assumption is that
the individual-specific effects are uncorrelated with the independent variables. The fixed
effect assumption is that the individual-specific effects are correlated with the independent
variables. If the random effects assumption holds, the random effects estimator is more
efficient than the fixed effects estimator. However, if this assumption does not hold, the
random effects estimator is not consistent. The Durbin–Wu–Housman test is often used to
discriminate between the fixed and the random effects models.
Regression analysis typically employs the method of least squares to approximate the solution
of overdetermined systems (sets of equations with more equations than unknowns) by
minimizing the sum of squares of the residuals (a residual is the difference between an
observed value and the fitted value provided by a model).

Data fitting is the primary use case. Simple regression and least-squares approaches run into
trouble when the issue involves significant uncertainties in the independent variable (the x
variable); in such circumstances, the methodology necessary for fitting errors-in-variables
models may be considered in place of that for least-squares.

Whether or not the residuals are linear in all unknowns classifies a least-squares issue as either
linear (or ordinary) or nonlinear. In statistical regression analysis, the linear least-squares issue
arises; it may be solved in closed form. Iterative refinement is commonly used to solve a
nonlinear issue, with the system being approximated by a linear one at each iteration.

The prediction error for a dependent variable is modelled as a function of the independent
variable and the outliers from the fitted curve in polynomial least squares.

Standardized least-squares estimates and maximum-likelihood estimates are equivalent when


the observations come from an exponential family with identity as its natural adequate
statistics (such as the normal, exponential, Poisson, and binomial distributions) and the mild-
conditions are fulfilled. The least squares approach may be reformulated as an estimator based
on the technique of moments.

Although the following explanation focuses mostly on linear functions, least squares may and
12
should be used for far broader classes of functions. The least-squares approach may also be
used to fit an extended linear model by repeatedly using a local quadratic approximation to
the probability (using the Fisher information).

Although the least-squares approach is commonly ascribed to Carl Friedrich Gauss (1795),
who made important theoretical contributions to the method and may have used it in the past,
the formal discovery and publication of the method occurred much later, in 1805, by Adrien-
Marie Legendre.

A fixed effects regression consists in subtracting the time mean from each variable in the
model and then estimating the resulting transformed model by Ordinary Least Squares. This
procedure, known as “within” transformation, allows one to drop the unobserved component
and consistently estimate β. Analytically, the above model becomes

ỹ it = β' x̃it + ε̃ it

where ỹ it = y it – ȳ i with ȳ i = T –1 ΣT t = 1 y it (and the same for x, μ, and ε). Because a μ


i is fixed over time, we have μ i μ̄ i = 0.

This procedure is numerically identical to including N – 1 dummies in the regression,


suggesting intuitively that a fixed effects regression accounts for unobserved individual
heterogeneity by means of individual specific intercepts. In other words, the slopes of the
regression are common across units (the coefficients of x1, x 2, …, x K) whereas the intercept
is allowed to vary.

One drawback of the fixed effects procedure is that the within transformation does not allow
one to include time-invariant independent variables in the regression, because they get
eliminated similarly to the fixed unobserved component. In addition, parameter estimates are
likely to be imprecise if the time series dimension is limited.

Under classical assumptions, the fixed effects estimator is consistent (with N → ∞ and T
fixed) in the cases of both E (xjit μ i) = 0 and E (xjit μ i) ≠ 0, where j = 1, …, K. It is efficient
when all the explanatory variables are correlated with μi However, it is less efficient than the
random effect estimator when E (xjitμi ) = 0.
13
The consistency property requires the strict exogene-ity of x. However, this property is not
satisfied when the estimated model includes a lagged dependent variable, as in yit = α yit-1 +
'xit + μi + εit .

This suggests the adoption of instrumental variables or Generalized Method of Moments


techniques in order to obtain consistent estimates. However, a large time dimension T assures
consistency even in the case of the dynamic specification above.

Sometimes the true model includes unobserved shocks common to all units i, but time-
varying. In this case, the model includes an additional error component 6 that can be controlled
for by simply including time dummies in the equation.

A typical application of a fixed effects regression is in the context of wage equations. Let us
assume that we are interested in assessing the impact of years of education in logs e on wages
in logs w when the ability of individuals a is not observed. The true model is then

Wi = β0 + β1 ei + v i

where vi = ai + εi Given that unobserved ability is likely to be correlated with education, then
the composite stochastic error v is also correlated with the regressor and the estimate of β 1
will be biased. However, since innate ability does not change over time, if our data set is
longitudinal, we can use a fixed effect estimator to obtain a consistent estimate of β 1 Applying
the within transformation to the preceding equation we end up with W̃it =βẽ1 it + ε̃ it

where we have eliminated the time invariant unobserved component a i Being E (ε̃it εit ) = 0,
the model now satisfies the classical assumptions and we can estimate it by Ordinary Least
Squares.

5.5 DUMMY VARIABLE MODEL

In statistics and econometrics, particularly in regression analysis, a dummy variable[a] is one


that takes only the value 0 or 1 to indicate the absence or presence of some categorical effect

14
that may be expected to shift the outcome. They can be thought of as numeric stand-ins for
qualitative facts in a regression model, sorting data into mutually exclusive categories (such as
smoker and non-smoker).

A dummy independent variable (also called a dummy explanatory variable) which for some
observation has a value of 0 will cause that variable's coefficient to have no role in influencing
the dependent variable, while when the dummy takes on a value 1 its coefficient acts to alter
the intercept. For example, suppose membership in a group is one of the qualitative variables
relevant to a regression. If group membership is arbitrarily assigned the value of 1, then all
others would get the value 0. Then the intercept would be the constant term for non-members
but would be the constant term plus the coefficient of the membership dummy in the case of
group members.

Incorporating a dummy independent-

Dummy variables are incorporated in the same way as quantitative variables are included (as
explanatory variables) in regression models. For example, if we consider a Mincer-type
regression model of wage determination, wherein wages are dependent on gender (qualitative)
and years of education (quantitative):

where is the error term. In the model, female = 1 when the person is a female
and female = 0 when the person is male. {\displaystyle \delta _{0}}{\displaystyle \delta _{0}}
can be interpreted as the difference in wages between females and males, holding education
constant. Thus, δ0 helps to determine whether there is a discrimination in wages between males
and females. For example, if δ0>0 (positive coefficient), then women earn a higher wage than
men (keeping other factors constant). The coefficients attached to the dummy variables are
called differential intercept coefficients. The model can be depicted graphically as an intercept
shift between females and males. In the figure, the case δ0<0 is shown (wherein men earn a
higher wage than women).

15
Dummy variables may be extended to more complex cases. For example, seasonal effects may
be captured by creating dummy variables for each of the seasons: {\displaystyle

D_{1}=1}{\displaystyle D_{1}=1} if the observation is for summer, and equals zero


otherwise; {\displaystyle D_{2}=1}{\displaystyle D_{2}=1} if and only if autumn, otherwise
equals zero; {\displaystyle D_{3}=1}{\displaystyle D_{3}=1} if and only if winter, otherwise
equals zero; and {\displaystyle D_{4}=1}{\displaystyle D_{4}=1} if and only if spring,
otherwise equals zero. In the panel data, fixed effects estimator dummies are created for each
of the units in cross-sectional data (e.g. firms or countries) or periods in a pooled time-series.
However, in such regressions either the constant term has to be removed or one of the dummies
has to be removed, with its associated category becoming the base category against which the
others are assessed in order to avoid the dummy variable trap:

The constant term in all regression equations is a coefficient multiplied by a regressor equal to
one. When the regression is expressed as a matrix equation, the matrix of regressors then
consists of a column of ones (the constant term), vectors of zeros and ones (the dummies), and
possibly other regressors. If one includes both male and female dummies, say, the sum of these
vectors is a vector of ones, since every observation is categorized as either male or female. This
sum is thus equal to the constant term's regressor, the first vector of ones. As result, the
regression equation will be unsolvable, even by the typical pseudoinverse method. In other
words: if both the vector-of-ones (constant term) regressor and an exhaustive set of dummies
are present, perfect multicollinearity occurs, and the system of equations formed by the
regression does not have a unique solution. This is referred to as the dummy variable trap. The
trap can be avoided by removing either the constant term or one of the offending dummies. The
removed dummy then becomes the base category against which the other categories are
compared.

Dependent dummy variable models

Analysis of dependent dummy variable models can be done through different methods. One
such method is the usual OLS method, which in this context is called the linear probability
model. An alternative method is to assume that there is an unobservable continuous latent
variable Y* and that the observed dichotomous variable Y = 1 if Y* > 0, 0 otherwise. This is
theunderlying concept of the logit and probit models. These models are discussed in brief

16
below.

Linear probability model

Main article: Linear probability model


An ordinary least squares model in which the dependent variable Y is a dichotomous
dummy, taking the values of 0 and 1, is the linear probability model (LPM).[9] Suppose we
consider thefollowing regression:
Yi = α1 + α2Xi + ui
where
X = family income
Y=1 if a house is owned by the family, 0 if a house is not owned by the family
The model is called the linear probability model because, the regression is linear. The
conditional mean of Yi given Xi, written as E (Yi /Xi), is interpreted as the conditional
probability that the event will occur for that value of Xi — that is, Pr(Yi = 1 |Xi). In this
example, E(Yi /Xi), gives the probability of a house being owned by a family whose
income isgiven by Xi.
Now, using the OLS assumption E(Yi /Xi), we
getE(Yi /Xi) = α1 + α2Xi
Some problems are inherent in the LPM model:
 The regression line will not be a well-fitted one and hence measures of significance,
such asR2, will not be reliable.
 Models that are analyzed using the LPM approach will have heteroscedastic
disturbances.
 The error term will have a non-normal distribution.

 The LPM may give predicted values of the dependent variable that are greater than 1
or lessthan 0. This will be difficult to interpret as the predicted values are intended to
be probabilities, which must lie between 0 and 1.
 There might exist a non-linear relationship between the variables of the LPM
model, inwhich case, the linear regression will not fit the data accurately.

In statistics and econometrics, particularly in regression analysis, a dummy


variable is one that takes only the value 0 or 1 to indicate the lack or presence of
some categorical influence that may be predicted to affect the outcome. In a
regression model, they serve as numerical surrogates for qualitative data by
17
classifying it into mutually exclusive groups (such as smoker and non-smoker).

A dummy independent variable (sometimes called a dummy explanatory variable)


which for some observation has a value of 0 will allow that variable's coefficient
to have no effect in affecting the dependent variable, whereas when the dummy
takes on a value 1 its coefficient operates to modify the intercept. For example,
assume membership in a group is one of the qualitative variables important to a
regression. If group membership is randomly assigned the value of 1, then all
others would obtain the number 0. Then the intercept would be the constant term
for non-members but would be the constant term plus the coefficient of the
membership dummy in the case of group members.

Dummy variables are employed often in time series analysis with regime
switching, seasonal analysis and qualitative data applications.
A dummy variable is a numerical variable used in regression analysis to represent
subgroups of the sample in your study. In research design, a dummy variable is
often used to distinguish different treatment groups. In the simplest case, we
would use a 0,1-dummy variable where a person is given a value of 0 if they are
in the control group or a 1 if they are in the treated group. Dummy variables are
useful because they enable us to use a single regression equation to represent
multiple groups. This means that we don’t need to write out separate equation
models for each subgroup. The dummy variables act like ‘switches’ that turn
various parameters on and off in an equation. Another advantage of a 0,1 dummy-
coded variable is that even though it is a nominal-level variable you can treat it
statistically like an interval-level variable (if this made no sense to you, you
probably should refresh your memory on levels of measurement). For instance, if
you take an average of a 0,1 variable, the result is the proportion of 1s in the
distribution.
yi=β0+β1Zi+ei
where:

 yi is outcome score of ith unit,


 β0 is coefficient for the intercept,
 β1 is coefficient for the slope,

18
 Zi is:
o 1 if the ith unit is in the treatment group;
o 0 if the ith unit is in the control group;
 ei is residual for the ith unit.

As an example of a dummy variable, think about the basic regression model for a two-
group randomized experiment where the sole assessment is done after the fact. A
comparison of posttest means for two groups using this model is equivalent to a t-test
or an ANOVA with only one factor (ANOVA). Key to the model is the estimate of
the difference between the groups, denoted by the term 1. Using this straightforward
example, I'll demonstrate how to utilize dummy variables to solve for the individual
subgroup equations. Then, we'll demonstrate how to subtract the equations for each
subgroup to get an approximation of the gap between them. You'll see that by
employing dummy variables, we can cram a ton of data into a single equation. Here, I
just want to demonstrate that the treatment group outperformed the control group by a
factor of 1.

The first step towards seeing this is to figure out the equation for our two groups
independently. Z = 0 in the non-experimental group. When we plug it into the
equation under the assumption that the error component averages to zero, we get the
intercept 0 as a projected value for the control group. We can now calculate the
treatment group line by replacing Z with 1, knowing that the error term is assumed to
average to 0. According to the formula for the treatment group, the total value for this
group is the sum of the two beta values.

Having completed step 1, we can now move on to calculating the dissimilarity


between the groups. To what extent can we trust their claims? The gap, then,
corresponds to the difference in the equations we derived for the two sets of data
above. In other words, the dissimilarity across groups may be determined by
subtracting their respective equations. The difference is 1, which should be
immediately apparent from the diagram. Consider the implications. The
dissimilarity coefficient between the two sets is 1. Well, let's do it again for kicks.
For this model, the dissimilarity across the categories is 1.

19
Always, by following the two procedures outlined above, you will be able to examine how
dummy variables are being utilized to represent various subgroup equations in any regression
model using dummy variables.

To do this, substitute the dummy values into an equation and get an answer specific to each
subgroup.
Examine the dissimilarities across classes by comparing their corresponding equations.

5.6 FIXED EFFECT WITHIN GROUP (WG) ESTIMATOR

Fixed effects are variables that are constant across individuals; these variables, like age, sex,
or ethnicity, don‘t change or change at a constant rate over time. They have fixed effects; in
other words, any change they cause to an individual is the same. For example, any effects
from beinga woman, a person of color, or a 17-year-old will not change over time.
It could be argued that these variables could change over time. For example, take women in
the workplace: Forbes reports that the glass ceiling is cracking. However, the wheels of
change are extremely slow (there was a 26-year gap between Britain‘s first woman prime
minister, Margaret Thatcher, and the country‘s second woman prime minister, Theresa May).
Therefore, for purposes of research and experimental design, these variables are treated as a
constant.

In a fixed effects model, random variables are treated as though they were non-random, or
fixed. For example, in regression analysis, ―fixed effects regression fixes (holds constant)
average effects for whatever variable you think might affect the outcome of your analysis.
Fixed effects models do have some limitations. For example, they can‘t control for variables
that vary over time (like income level or employment status). However, these variables can
be included in the model by including dummy variables for time or space units. This may
seemlike a good idea, but the more dummy variables you introduce, the more the ―noise‖
in the model is controlled for; this could lead to over-dampening the model, reducing the
useful aswell as the useless information.

Such models assist in controlling for unobserved heterogeneity, when this heterogeneity is
constant over time: typically the ethnicity, the year and location of birth are heterogeneous

20
variables a fixed effect model can control for. This constant heterogeneity is the fixed effect
for this individual. This constant can be removed from the data, for example by subtracting
each individual's means from each of his observations before estimating the model.
A random effects model makes the additional assumption that the individual effects are
randomly distributed. It is thus not the opposite of a fixed effects model, but a special case. If
the random effects assumption holds, the random effects model is more efficient than the fixed
effects model. However, if this additional assumption does not hold (ie, if the Hausman test
fails), the random effects model is not consistent.

5.7 THE RANDOM EFFECTS MODEL

In statistics, a random effects model, also called a variance components model, is a statistical
model where the model parameters are random variables. It is a kind of hierarchical linear
model, which assumes that the data being analyzed are drawn from a hierarchy of different
populations whose differences relate to that hierarchy. In econometrics, random effects
models are used in the analysis of hierarchical or panel data when one assumes no fixed effects
(it allows for individual effects). The random effects model is a special case of the fixed effects
model.

Contrast this to the biostatistics definitions, as biostatisticians use "fixed" and "random"
effects to respectively refer to the population-average and subject-specific effects (and where
the latter are generally assumed to be unknown, latent variables).

Random effect models assist in controlling for unobserved heterogeneity when the
heterogeneity is constant over time and not correlated with independent variables. This
constant can be removed from longitudinal data through differencing, since taking a first
difference which will remove any time invariant components of the model.
Two common assumptions can be made about the individual specific effect: the random
effects assumption and the fixed effects assumption. The random effects assumption is that
the individual unobserved heterogeneity is uncorrelated with the independent variables. The
fixed effect assumption is that the individual specific effect is correlated with the independent
variables.

If the random effects assumption holds, the random effects model is more efficient than the

21
fixed effects model. However, if this assumption does not hold, the random effects model is
not consistent

sin general, random effects are efficient, and should be used (over fixed effects) if the
assumptions underlying them are believed to be satisfied. For random effects to work in the
school example it is necessary that the school-specific effects be uncorrelated to the other
covariates of the model. This can be tested by running fixed effects, then random effects, and
doing a Hausman specification test. If the test rejects, then random effects is biased and fixed
effects is the correct estimation procedure.

To quantify the impact of immeasurable personal traits like perseverance or savviness, the Random
Effects regression model is commonly employed. Similar effects at the individual level are common in
panel data analyses. The Random Effects model, along with the Fixed Effect regression model, is a
popular method for investigating how different factors influence the panel data set's response.

This is the third and final instalment of our three-part series on Panel Data Analysis.

Inference in Panel Data Using Pooled Ordinary Least Squares

To Analyze Panel Data Using a Fixed Effects Regression Model

An Overview of the Random Effects Regression Model for Panel Data

It's possible that the first 10% of this chapter will feel like a review for individuals who have already
studied the chapters on the FE model and the Pooled OLS model.

Let's start by (quickly) reviewing panel data.

Covariate effects in mixed-effects models may be calculated using the full random-effects
model (FREM). Mean and standard deviation are used to characterize the covariates in the
model. Estimated covariances between parameters and covariates capture the covariate effects.
Methods based on estimating fixed effects may experience performance drops, however this
strategy is resistant to such impacts (e.g., correlated covariates where the effects cannot be
simultaneously identified in fixed-effects methods). You may modify the covariate-parameter

22
relationship by employing FREM covariate parameterization and transforming covariate data records.
It was demonstrated that the four relations used in this implementation (linear, log-linear, exponential,
and power) yield estimates that are comparable to those obtained using a fixed-effects design. Both real
and simulated data with and without non-normally distributed and strongly correlated variables were
used to compare FREM to technically identical full fixed-effects models (FFEMs). Based on these
studies, it is clear that both FREM and FFEM work admirably in the studied instances, with FREM
providing somewhat more precise estimates of parameter interindividual variability (IIV).
Moreover, FREM provides the distinct benefit of allowing a single estimation to produce
covariate impact coefficient estimates and IIV estimates for any subset of the analyzed
variables, including the influence of each covariate separately. Covariate effects can be
communicated in a fashion that is not dependent on other covariates, or the model can be
applied to data sets with varying sets of accessible covariates.

a random effect model, also called a variance components model is a kind of hierarchical
linear model. It assumes that the data describe a hierarchy of different populations whose
differences are constrained by the hierarchy. The fixed effects model is a special case

Simple example-

Suppose m elementary large schools are chosen randomly from among millions in a large
country. Then n pupils are chosen randomly from among those at each such school. Their
scores on a standard aptitude test are ascertained. Let Yij be the score of the jth pupil at the
ith school. Then

where μ is the average of all scores in the whole population, Ui is the deviation of the average
of all scores at the ith school from the average in the whole population, and Wij is the
deviation of the jth pupil's score from the average score at the ith school.

23
5.8 SUMMARY

In statistics and econometrics, panel data or longitudinal data are multi-dimensional data
involving measurements over time. Panel data contain observations of multiple phenomena
obtained over multiple time periods for the same firms or individuals.
In panel data the same cross-sectional unit (industry, firm and country) is surveyed over time,
sowe have data which is pooled over space as well as time.

Dependencies of unobserved independent variables on a dependent variable can lead to biased


estimators in standard linear regression models, but these dependencies can be tamed with
panel data regression. In this post, I will show you how to construct a panel data regression
model in Python and explain the most significant theory behind this topic.

In writing this, I hope to accomplish two things: My first point is that I have yet to come
across a simple and clear explanation of an integrated panel data regression model. Second,
although it is possible to execute panel data regression in Python, it is not as user-friendly as
it is, say, in R. This, however, does not diminish the effectiveness of the method. So, in the
hopes of making future panel data analysis slightly less painful for myself ;-), I've chosen to
share what I've learned on a recent assignment.

• With panel data we can study different issues:


- Cross sectional variation (unobservable in time series data) vs.
Time series variation (unobservable in cross sectional data)
- Heterogeneity (observable and unobservable individual heterogeneity)
- Hierarchical structures (say, zip code, city and state effects)
- Dynamics in economic behavior
- Individual/Group effects (individual effects)
- Time effects

Regression using panel data may mitigate omitted variable bias when there is no information
on variables that correlate with both the regressors of interest and the independent variable
and if these variables are constant in the time dimension or across entities. Provided that panel
data is available panel regression methods may improve upon multiple regression models
which, as discussed in Chapter 9, produce results that are not internally valid in such a setting.

24
This chapter covers the following topics:
notation for panel data
fixed effects regression using time and/or entity fixed effects
computation of standard errors in fixed effects regression models
Following the book, for applications we make use of the dataset Fatalities from the AER
package (Christian Kleiber and Zeileis 2020) which is a panel dataset reporting annual state
level observations on U.S. traffic fatalities for the period 1982 through 1988. The applications
analyze if there are effects of alcohol taxes and drunk driving laws on road fatalities and, if
present, how strong these effects are.

5.9 KEYWORDS

 1.Regression analysis-Regression analysis is a set of statistical methods used for the


estimation of relationships between a dependent variable and one or more
independent variables. It can be utilized to assess the strength of the relationship
between variables and for modeling the future relationship between them.

 2.Panel regressions- Panel regression is a modeling method adapted to panel data,


also called longitudinal data or cross-sectional data. It is widely used in econometrics,
where the behavior of statistical units (i.e. panel units) is followed across time. Those
units can be firms, countries, states, etc.

 3.Estimator- In statistics, an estimator is a rule for calculating an estimate of a given


quantity based on observed data: thus the rule (the estimator), the quantity of interest
(the estimand) and its result (the estimate) are distinguished. For example, the sample
mean is a commonly used estimator of the population mean.

 4.Accounting- Accounting is the process of documenting a business's financial


transactions. These transactions are compiled, examined, and reported to oversight
organizations, regulatory bodies, and tax collecting organizations as part of the
accounting process. A company's activities, financial condition, and cash flows are
summarized in the financial statements that are used in accounting. They provide a
succinct overview of financial events across an accounting period.

25
 5.Finance- Finance is defined as the management of money and includes activities
such as
investing, borrowing, lending, budgeting, saving, and forecasting. There are three
main types of finance: (1) personal, (2) corporate, and (3) public/government.

5.10 LEARNING ACTIVITY

1. Note on Panel Data


__________________________________________________________________________
__________________________________________________________________________
2. Explain Fixed Effect Least Squares
__________________________________________________________________________
__________________________________________________________________________
3. Give details of Random Effects Model
__________________________________________________________________________
__________________________________________________________________________

5.12 UNIT END QUESTIONS

A. Descriptive Questions

Short Questions

1. Which cross-sectional dependency test is more appropriate for smaller time and large cross-
section macro panel?

2. A single step estimation of True Random Effect technical efficiency model with inefficiency
determinants?

3. What is panel data?

4. what are the different Types of Panel Data Regression?

5. What kind of data are required for panel analysis?

Long Questions

26
1.Explain the random effects model?

2. Explain briefly Dummy variable model?

3.What is pooled OLS regression. Explain with example?

4. What are the reasons for using Panel Data?

5. Explain fixed effect model in detail.

B. Multiple Choice Questions

1.Which of the following is a disadvantage of the fixed effects approach to estimating a panel
model?

a. The model is likely to be technical to estimate

b. The approach may not be valid if the composite error term is correlated with one or more of
the explanatory variables

c. The number of parameters to estimate may be large, resulting in a loss of degrees of freedom

d. The fixed effects approach can only capture cross-sectional heterogeneity and not temporal
variation in the dependent variable.

2. The "within transform" involves

a. Taking the average values of the variables

b. Subtracting the mean of each entity away from each observation on that entity

c. Estimating a panel data model using least squares dummy variables

d. Using both time dummies and cross-sectional dummies in a fixed effects panel model

3. The fixed effects panel model is also sometimes known as

a. A seemingly unrelated regression model

b. The least squares dummy variables approach

c. The random effects model

27
d. Heteroscedasticity and autocorrelation consistent

4. Which of the following is a disadvantage of the random effects approach to estimating a


panel model?

a. The approach may not be valid if the composite error term is correlated with one or more of
the explanatory variables

b. The number of parameters to estimate may be large, resulting in a loss of degrees of freedom

c. The random effects approach can only capture cross-sectional heterogeneity and not
temporal variation in the dependent variable.

d. All of (a) to (c) are potential disadvantages of the random effects approach.

5. Which of the following are advantages of the use of panel data over pure cross-sectional or
pure time-series modelling? (i) The use of panel data can increase the number of degrees of
freedom and therefore the power of tests

(ii) The use of panel data allows the average value of the dependent variable to vary either
cross-sectionally or over time or both

(iii) The use of panel data enables the researcher allows the estimated relationship between the
independent and dependent variables to vary either cross-sectionally or over time or both

a. (i) only

b. (i) and (ii) only

c. (ii) only

d. (i), (ii), and (iii)

Answers

1-c, 2-b, 3-b , 4-a ,5- b

28
5.13 REFERENCES

 Gujarati, D., Porter, D.C and Gunasekhar, C (2012). Basic Econometrics (Fifth
Edition)McGraw Hill Education.
 Anderson, D. R., D. J. Sweeney and T. A. Williams. (2011). Statistics for
Business andEconomics. 12th Edition, Cengage Learning India Pvt. Ltd.
 Wooldridge, Jeffrey M., Introductory Econometrics: A Modern Approach, Third
edition,Thomson South-Western, 2007.
 Johnstone, J., Econometrics Methods, 3rd Edition, McGraw Hill, New York, 1994.
 Ramanathan, Ramu, Introductory Econometrics with Applications, Harcourt
AcademicPress, 2002 (IGM Library Call No. 330.0182 R14I).
 Koutsoyiannis, A. The Theory of Econometrics, 2nd Edition, ESL

29

You might also like