Handout 2 - Stat 301
Handout 2 - Stat 301
Handout 2 - Stat 301
STAT-301
In this, we will discuss various sampling techniques that can be used to select potential
respondents to a survey.
INTRODUCTION TO SURVEYS
These days, surveys are used everywhere and for many reasons. For example, surveys are
commonly used to track the following:
Surveys (and observational studies, in general) can be broadly classified into two types:
Note that many surveys will provide data that can serve both of the above purposes.
SOME TERMINOLOGY
Target Population – Ask yourself the following question: to whom do you want to generalize
the results? This group is the target population. Ideally, this is the group from which you’d
like to sample.
Study Population - Sometimes, the target population is hard to access, and only some of the
target population will be available for the study. This (typically nonrandom) subset of the
target population is known as the study population (or the accessible population).
Sampling Frame - The sampling frame is a list of all elements (called sampling units) of the
study population that will be used for sampling. Sometimes this is people, sometimes it is
households, etc. This is needed so that every sampling unit in the population is identified so
they will have an equal opportunity for selection.
1
Introduction to Surveys and Sampling Techniques STAT 301
Sample – The sample consists of the elements from the study population that are actually
selected to participate in the study. When a survey study involves people, these elements are
typically referred to as subjects or participants. Finally, note that not necessarily all of the
subjects selected for a sample will choose to participate in the study!
2
Introduction to Surveys and Sampling Techniques STAT 301
In his book Sampling Techniques, William G. Cochran outlines the many steps that are involved
in planning a survey.
In future handouts, we will discuss in detail many aspects of designing a survey. For now,
however, we fill focus primarily on the sampling process.
SAMPLING METHODS
The goal is to obtain a good sample so that we can draw sound conclusions about the
population. It is essential that our sample be representative of the population!
“The Literary Digest conducted polls regarding the presidential elections in 1920, 1924, 1928, and
1932, making correct predictions of the winner for each. In the 1936 election they decided to
conduct their most ambitious poll and collected responses from over two million people. Using
this information they predicted that Alf Landon would win the 1936 presidential election over
Franklin Roosevelt. Franklin Roosevelt, however, ended up winning the presidency with 61
percent of the votes. The Literary Digest was left wondering what went wrong. There was also a
gentleman named George Gallup who also conducted a poll and correctly predicted that
Roosevelt would win the presidency. You might wonder what George Gallup did to end up
with a correct prediction while the Literary Digest seemed to bobble after years of correct
predictions. It turns out the difference in the polls conducted was the manner in which the
samples were chosen. The Literary Digest used telephone directories and lists of automobile
3
Introduction to Surveys and Sampling Techniques
owners to select their participants who consisted mostly of wealthy individuals, whereas Gallup
tried to get a sample which represented characteristics of the population. Most of the lower
class individuals voted in favor of Roosevelt because he was proposing the New Deal recovery
program, which was very desirable since the country was just coming out of the worst
economic recession they had seen at the time.“
Source: Babbie, Earl. The Basics of Social Research, 5th Edition. 2011. Wadsworth, p. 204 – 205.
The key to making a correct prediction in the 1936 presidential election was the ability to get a
representative sample of the population of interest. Sampling theory makes the process of
obtaining a representative sample and then estimating parameters of interest extremely
efficient. Methods of sample selection and estimation have been developed to provide the most
precise estimates at the lowest cost. Next, we will discuss some of these sampling techniques.
Nonprobability Sampling – This does not involve random selection, and the probability of
selecting each unit from the population is either unknown or, in some cases, zero.
Next, we will discuss commonly used probability sampling methods. In general, we’ll use the
following notation:
The simplest and most common probability sample is known as a simple random sample (SRS).
When obtaining a simple random sample, researchers select n units from N objects such that
each possible sample of size n has an equal chance of being selected.
4
Handout 2: Introduction to Surveys and Sampling Techniques
Questions:
2. Suppose a professor takes the first 10 students on the class list (which is organized
alphabetically) and surveys them to find out their opinions of the course so far. Is this a
simple random sample?
3. What if the professor uses the class list but this time randomly chooses a letter from the
English alphabet and selects for the sample those students whose last name begins with
that letter?
4. Suppose that instead of using the class list, the professor picks a digit at random and
selects for the sample those students with phone numbers ending in that digit. Is this a
simple random sample?
In general, simple random sampling is simple to accomplish and easy to explain. It is a fair
sampling method, and it’s reasonable to assume the sample is representative of the population
(and therefore we can generalize the results from the sample back to the population). This
method, however, is not the most efficient method of sampling. Also, by luck of the draw in
some cases, we may not get a good representation of the subgroups in the population. To deal
with these issues, alternative sampling methods are often used.
The next method ensures that we get a good representation of the subgroups in the population.
A stratified random sample is obtained by first dividing the population into homogeneous
subgroups (called strata) and then taking a simple random sample from each subgroup.
Variables often used to create strata include age, gender, ethnicity, socioeconomic status,
diagnosis, geographic region, etc.
5
Introduction to Surveys and Sampling Techniques
There are two approaches to obtaining stratified random samples: proportionate and
disproportionate.
To obtain a disproportionate stratified random sample, the subgroup sample sizes would not be
set equal to the proportions of the subgroup in the study population.
For example, in some studies, researchers may want to oversample minority groups so that they
can ensure enough minority subjects are obtained to draw conclusions about those subgroups.
Note that when obtaining overall estimates regarding the population, the researchers would
adjust the estimates based on the fact that some groups were oversampled. This type of
complex analysis is beyond the scope of this course, but it’s important to recognize that it is
used by statisticians with expertise in surveys and sampling.
The stratified random sampling procedure comes with the following advantages and
disadvantages.
• They ensure that the sample is representative of the population in terms of the
stratification variable. Essentially, they “protect” us from the unlikely but still possible
scenario of obtaining a biased sample through simple random sampling.
• They generally lead to more precision on estimates than do simple random samples, as
long as the groups are homogeneous. How much the stratification helps depends on the
relationship between the variable(s) used to create strata and the outcomes of interest in
the study. The stronger this relationship, the more we gain from using a stratified
sampling method.
• They allow us to obtain estimates with a certain precision for specific subgroups of the
population.
• They are more complex and require slightly more effort than simple random samples
• The strata must be carefully defined
6
Introduction to Surveys and Sampling Techniques
Questions:
1. What stratification variable was used?
Questions: What stratification variable(s) might be used for the following situations?
7
Introduction to Surveys and Sampling Techniques
The next sampling method we’ll discuss is another example of a probability sampling method.
Note that if the units in the sampling frame are randomly ordered, this is essentially the same as
simple random sampling. In this case, we could safely assume the sample is representative of
the population. The main advantage of systematic random sampling is that it can be quickly
and easily implemented; the main disadvantage is that the method might lead to bias if there
are hidden periodicities in the sampling frame.
A cluster random sample is obtained by first dividing the study population into clusters
(typically geographically). Once clusters have been formed, a simple random sample of clusters
is taken. All units within the sampled clusters then form the sample.
• They make the sampling process much more efficient, especially when the population is
dispersed across a wide geographic region.
• They are the least representative of the population out of all of the types of probability
samples discussed so far. Units within a cluster may have similar characteristics, which
could lead to an over- or under-representation of certain characteristics in the sample.
• They are typically associated with high sampling errors.
• They lead to a violation of the assumption of independence, so the analysis of data
collected from cluster sampling methods is slightly more complex.
Now that we have discussed the four basic types of probability sampling, consider one last
probability sampling approach: multi-stage random sampling.
8
Introduction to Surveys and Sampling Techniques
9
Introduction to Surveys and Sampling Techniques
Definition
A sampling error is the discrepancy between the sample statistic and the population parameter
of interest that is due to random fluctuations in the data that occur when the sample is selected.
When a probability sampling method is used, we can obtain an estimate of the sampling error
which represents the magnitude of the uncertainty regarding the obtained parameter estimate.
In general, for simple random samples, the sampling error is computed as follows (note that
this was probably called the standard error in your introductory statistics course):
Sample Variance
Sampling Error =
n
Once this has been calculated, it can be used to construct a confidence interval. Once again, this
standard formula for calculating the sampling error is based on the assumption that the sample
was drawn using simple random sampling. When another probability sampling method has
been used, the sampling error may actually be slightly higher (or lower) than indicated by the
standard formula. For example, the sampling error for cluster sampling will be higher than the
sampling error for simple random sampling. For stratified designs, on the other hand, the
sampling error will be lower. The overall message is this: care should be taken to use the
correct standard errors in an analysis. Once again, these analyses are beyond the scope of this
course; however, you should recognize the importance of their use by professional statisticians.
The bigger issue is this: traditional probability theory does not apply to these samples, and so
the sampling error can’t be estimated. Therefore, no valid confidence intervals can be
obtained, either! Probability sampling methods give us much more confidence that we have
represented the population well, and they allow for the estimation of sampling errors (and thus
margins of error and confidence intervals). Sometimes, however, it is not feasible to take
probability random samples. In such cases, the researchers may use one of the following
nonprobability sampling methods.
10
Introduction to Surveys and Sampling Techniques
Convenience Sampling
Obtaining a convenience sample involves selecting the most readily available elements of a
population for a study.
For example, in a clinical study, researchers may use patients that are easily accessed. Or, a
study may ask for (or even recruit and provide monetary compensation to) volunteers. In such
cases, there is concern about whether these samples are representative of the population (and
maybe even suspicion that they are not). For example, one concern regarding opinion polls is
that people who volunteer might tend to be more interested in (and therefore more
knowledgeable or opinionated about) the survey topic than the general public.
Purposive Sampling
When obtaining a purposive sample, researchers “hand-pick” units from the study population
that they believe to be representative of the population.
For example, in order to study consumer preferences of Caucasian females age 30-40 years old,
market researchers may go to a shopping mall and seek out individuals who seem to fit this
category. Or, suppose researchers want to estimate the average amount a shopper spends in one
visit to the Mall of America. If purposive sampling is used, the researcher will look around and
use their own judgment to sample shoppers who they feel are representative of the population.
Purposive sampling allows researchers to reach a targeted sample quickly; however, samples
may not be representative of the population because the sample is likely to overweight
subgroups of the population that are more readily available.
Snowball Sampling
This is a variant of purposive sampling. To obtain a snowball sample, a few members of the
population who meet criteria for inclusion in the study are selected, and they are asked to
provide names of further subjects they know who also meet the criteria.
This will most likely yield unrepresentative results; in some cases, however, it may be the only
way to reach a population that is hard to access.
11
Introduction to Surveys and Sampling Techniques
Example: Snowball Sample to Investigate Reasons for Drug Use Amongst Young People
Quota Sampling
A quota sample is selected non-randomly according to some fixed quota which is set to reflect
certain characteristics of the population.
For example, if a population consists of 40% men and 60% women and a sample of size 100 is
desired, researchers recruit the first 40 men and 60 women that meet the inclusion criteria. This
is similar to stratified sampling but does not involve random selection; so, it ensures selection of
adequate numbers of subjects from each subgroup. The sample, however, may not be
representative of the population.
12
Handout 2: Introduction to Surveys and Sampling Techniques
There are two types of errors that typically occur with survey studies:
Non-sampling Errors – This refers to any other deviations from the population parameters that
are not due to sampling error. These are much harder to quantify.
Non-sampling Errors
Coverage Error – This includes both undercoverage and overcoverage. Undercoverage occurs
when some members of the population are inadequately represented in the sample. A classic
example of undercoverage is the Literary Digest voter survey, which predicted that Alfred
Landon would beat Franklin Roosevelt in the 1936 presidential election. The survey sample
suffered from undercoverage of low-income voters, who tended to be Democrats. Also, in
many surveys of households, subgroups such as members of the military or persons who are
institutionalized tend to be undercovered. Overcoverage occurs when some members of the
population are overly represented in the sample. The classic example of this is when the data
frame contains duplicate records.
Some researchers just ignore these types of error, which isn’t necessarily a good idea. Others
are careful to redefine their study population or make corrections to their sampling frame.
Finally, some consult with professional statisticians with survey expertise who can use
advanced statistical techniques to adjust for coverage errors.
Non-Response Error - Sometimes, individuals chosen for the sample are unwilling or unable to
participate in the survey. Nonresponse bias is the bias that results when respondents differ in
meaningful ways from nonrespondents. The Literary Digest survey illustrates this problem, as
well. Respondents tended to be Landon supporters; and nonrespondents were Roosevelt
supporters. Since only 25% of the sampled voters actually completed the mail-in survey, survey
results overestimated voter support for Alfred Landon.
Again, some researchers just ignore this type of error, which isn’t necessarily a good idea.
Others consult with professional statisticians with survey expertise who can use advanced
statistical techniques (such as data imputation) to adjust for non-response errors.
13
Introduction to Surveys and Sampling Techniques
Voluntary Response Error - This occurs when individuals select themselves to participate in a
poll. In this case, we obtain information from only those who feel strongly enough to respond.
An example would be call-in radio shows that solicit audience participation in surveys on
controversial topics (abortion, affirmative action, gun control, etc.). The resulting sample tends
to over-represent individuals who have strong opinions.
Measurement Error – This type of error occurs when surveys do not measure what they were
intended measure. This can happen when the respondent doesn’t understand the question,
doesn’t answer it truthfully, or the interviewer makes a mistake when recording the
respondent’s answers. This could also happen if the survey suffers from the effects of question
wording.
The following diagram shows how where these errors appear in the sampling process.
14