Dummy Variables: Nominal Scale
Dummy Variables: Nominal Scale
Dummy Variables: Nominal Scale
For example, holding all other factors constant, female workers are found to
earn less than their male counterparts or nonwhite workers are found to earn
less than whites. One way we could “quantify” such attributes is by
constructing artificial variables that take on values of 1 or 0, 1 indicating the
presence (or possession) of that attribute and 0 indicating the absence of
that attribute.
Variables that assume such 0 and 1 values are called dummy variables.
Example:
Suppose we want to find out if the average annual salary (AAS) of public school
teachers differs among the three geographical regions of the country.
Geographical regions are (1) Northeast and North Central, (2) South, (3) West.
If you take the simple arithmetic average of the average salaries of the teachers in
the three regions, you will find that these averages for the three regions are as
follows: $24,424.14 (Northeast and North Central), $22,894 (South), and
$26,158.62 (West). These numbers look different, but are they statistically
different from one another? There are various statistical techniques to compare
two or more mean values, which generally go by the name of analysis of
variance. But the same objective can be accomplished within the framework of
regression analysis.
To see this, consider the following model:
The “slope” coefficients β2 and β3 tell by how much the mean salaries of
teachers in the Northeast and North Central and in the South differ from the mean
salary of teachers in the West. But how do we know if these differences are
statistically significant?
Overall conclusion is that statistically the mean salaries of public
school teachers in the West and the Northeast and North Central are
about the same but the mean salary of teachers in the South is
statistically significantly lower by about $3265.
From the preceding discussion, it is clear that all one has to do is see if
the coefficients attached to the various dummy variables are
individually statistically significant. This example also shows how easy
it is to incorporate qualitative, or dummy regressors in the regression
models.
Caution in the Use of Dummy Variables
1. If a qualitative variable has m categories, introduce only (m − 1) dummy
variables. If you do not follow this rule, you will fall into what is called the
dummy variable trap.
2. The category for which no dummy variable is assigned is known as the
base, benchmark, control, comparison, reference, or omitted category.
And all comparisons are made in relation to the benchmark category.
3. The intercept value (β1) represents the mean value of the benchmark
category.
4. The coefficients attached to the dummy variables are known as the
differential intercept coefficients because they tell by how much the
value of the intercept that receives the value of 1 differs from the intercept
coefficient of the benchmark category.
5. If a qualitative variable has more than one category, as in our illustrative
example, the choice of the benchmark category is strictly up to the
researcher.
6. There is a way to circumvent this trap by introducing as many dummy
variables as the number of categories of that variable, provided we do not
introduce the intercept in such a model. Thus, if we drop the intercept term
and consider the following model,
EXAMPLE:
Once you go beyond one qualitative variable, you have to pay close
attention to the category that is treated as the base category, since all
comparisons are made in relation to that category. This is especially
important when you have several qualitative regressors, each with several
categories.
REGRESSION WITH A MIXTURE OF QUANTITATIVE AND
QUALITATIVE REGRESSORS: THE ANCOVA MODELS
which is the mean hourly wage function for female nonwhite workers.
Observe that
α2 = differential effect of being a female
α3 = differential effect of being a nonwhite
α4 = differential effect of being a female nonwhite
which shows that the mean hourly wages of female nonwhite is different
(by α4) from the mean hourly wages of females or nonwhite Hispanics. If,
for instance, all the three differential dummy coefficients are negative, this
would imply that female nonwhite workers earn much lower mean hourly
wages than female or nonwhite workers as compared with the base
category, which in the present example is male white.
EXAMPLE 9.5
AVERAGE HOURLY EARNINGS IN RELATION TO EDUCATION,
GENDER, AND RACE
Let us first present the regression results based on model without interaction
term. we obtained the following results:
Yi = −0.2610 − 2.3606D2i − 1.7327D3i + 0.8028Xi
t = (−0.2357)** (−5.4873)* (−2.1803)* (9.9094)*
R2 = 0.2032 n = 528
The results show, ceteris paribus, the average hourly earnings of females are
lower by about $2.36, and the average hourly earnings of nonwhite workers are
also lower by about $1.73.
We now consider the results of model which includes the interaction dummy.
Yi =−0.26100 −2.3606D2i −1.7327D3i +2.1289D2iD3i +0.8028Xi
t = (−0.2357)** (−5.4873)* (−2.1803)* (1.7420)** (9.9095)**
R2 = 0.2032 n = 528
Holding the level of education constant, if you add the three dummy coefficients
you will obtain: −1.964 (=−2.3605 − 1.7327 + 2.1289), which means that mean
hourly wages of nonwhite female workers is lower by about $1.96, which is
between the value of −2.3605 (gender difference alone) and −1.7327 (race
difference alone).
THE USE OF DUMMY VARIABLES IN SEASONAL ANALYSIS
where Yt = sales (in thousands) and the D’s are the dummies, taking a value
of 1 in the relevant quarter and 0 otherwise.
How do we obtain the deseasonalized time series of sales? This can be
done easily. You estimate the values of Y from model for each observation
and subtract them from the actual values of Y, that is, you obtain (Yt − ˆYt )
which are simply the residuals from the regression.
What do these residuals represent? They represent the remaining
components of the sales time series, namely, the trend, cycle, and random
components.