Sampling Distribution
Sampling Distribution
Sampling Distribution
Kaustav Banerjee
Decision Sciences Area, IIM Lucknow
1
the sampling unit is a college. Alternatively, after selecting a college from all the colleges in
Lucknow, if a sample of students is selected from each selected college, then the colleges are
the first stage sampling units and the students are the second stage sampling units.
Sampling frame: list of all college students in Lucknow or list of all colleges in Lucknow. In
other words, the list of all sampling units in the population.
Sample: this is a collection of sampling units drawn from a frame or frames. Sample could
be drawn according to convenience or could be drawn in a way such that every sample has
got equal chance of being selected. The former is called a convenience sample and the latter
is called a random sample. A random sample is often preferred because it avoids any bias in
selection and usually results in a representative sample. A convenience sample in this example
could be a sample of students from the the nearby colleges. For drawing a random sample on
the other hand a list of all college students (a sampling frame) needs to be created and then
using some random mechanism a sample of students will be selected from the list.
Parameter: a parameter is a population characteristic of our interest. In our example it
is the average based on data from all college students in Lucknow. It is usually a fixed
unknown number. Other population characteristics that may be of our interest are the standard
deviation of the times spent by the students in the population, and the percentage of students
in the population spending more than 4 hours, etc.
Statistic: a statistic is a sample analogue of the parameter like sample average or sample
standard deviation. A statistic is calculated on the basis of sample observations. In the
context of our example, the average and standard deviation of times spent in social media
sites by the selected students, percentage of selected students spending more than four hours
in social media sites etc. are examples of statistics. For estimating the unknown parameter,
the sample value of the associated statistic is used. For example in estimating population
average, the value of the sample average observed for the selected sample is used.
The first column gives the number of hours, second column gives 5 parameter values obtained
2
on the basis of all the 10 students. Next we select a random sample of 5 students listed as
Sample 1 and obtain the same quantities on the basis of the selected students, reported in
Statistic 1. Taking a fresh sample in Sample 2 we repeat the exercise reported in Statistic 2.
The message is, parameters remain constant, as they are based on all the sampling units
of the population. However, statistic is a sample/data-driven quantity: its value varies with
the sample.
2 Collection of Data
Two methods are usually followed in collection of data for research. Either through conducting
experiment or surveys (often using data already collected by some agency, known as syndicated
data). The former is often called data collected through experimental study and the latter,
the data collected through observational study.
We now consider collection of data through surveys. In case the data are collected from each
and every sampling unit of the population the method is known as complete enumeration or
census method of data collection. Naturally, if the data are collected through census method
theoretically one can find the true value of the parameter. However, there is an implicit
assumption in the above statement, which most of the times does not hold, and is very very
far from what happens in reality. The assumption is: the collected data are free from errors
which is never (!!!) true.
On the other hand, in case of estimation through sample surveys we only have access to
the selected sample observations which is a part of the whole population. The other part
of the population remains unobserved. Thus the value of the sample statistic, which we use
as an estimate for the unknown value of the parameter, could be near or far from the true
value of the parameter depending on whether the sample is a close or a bad representation of
the population. So one of the important issues in sample survey is to select a sample which
represents the population well, so that an estimate with good accuracy is obtained.
Better representation and hence more accuracy could be achieved by (i) increasing sample
size (ii) by adopting a better method of selection of the sample or using better sampling
design. The representation and hence accuracy also depends on (iii) how homogeneous is the
population. But, controlling homogeneity is beyond our reach.
3
directory, list of e-mail addresses) that is used for drawing sample, does not match up perfectly
with the sampling frame of the target population.
It could be due to non-response. This is considered to be one of the most serious and
frequently occurred errors in any survey. In a personal interview this arises in one of the three
ways: the inability to contact the sampled unit (person or household or an organization etc.
In actual survey substituting by a next door neighbour is common but is not a good idea),
the inability of the respondent to come up with an answer to the question of interest (for
example asking someone about the impact of a policy decision, who may not have any clue),
or refusal to answer (could be because of fear or of intention not to divulge). A good survey
should attempt to obtain some information about the group of non-respondents in order to
understand how different or similar they are as a group, from the group of respondents.
Besides these, there could be errors of observations: may be due to respondent’s
reporting error, the respondent may not simply remember it correctly, the respondent may
not understand the question properly, like, asking the head of the household the number of
literates in the household (the meaning of literate may not be clear to the respondent).
Besides the above, the errors could be due to inability of the interviewers to obtain honest
response, could be due to bad design of the questionnaire, (it has been observed that ordering
and wording of questions, nature of the question (whether the question is open ended or close
ended) lead to lot of variation in responses), could be due to coding errors etc. So any kind of
error besides sampling error is known as non-sampling error.
It is believed that for a moderately large sample survey the non-sampling error contributes
around 70-80% of the total error. Finally, the non-sampling error increases with the increase
in sample size.
What do you think? With reference to the above discussion, why do you
think sample survey could be a better choice than census?
4
Q1 Following SRSWR, what is the probability of drawing a particular sequence of n (say
n = 10) units in order, out of N (say N = 60) units in the population?
Q2 Following SRSWOR, what is the probability of drawing a particular sequence of n (say
n = 10) units in order, out of N (say N = 60) units in the population?
Q3 In SRSWR, what is the probability that any particular unit (say the second unit) is
present in a sample of size n, drawn from a population of N units?
Q4 In SRSWOR, what is the probability that any particular unit (say the second unit) is
present in a sample of size n, drawn from a population of N units?
You can use an EXCEL function like randbetween(min, max) to generate random numbers
betweem min and max. A random number table is a sequence of digits generated using a
mechanism as discussed above so that in the long run the table contains all the digits 0,1,2,...,9
in approximately equal proportions, with no trends in the pattern in which the digits are
generated. Thus if a digit is drawn at random from the random number table the chance of
getting any digit is 1/10.
For example consider the problem of drawing a random sample of customers of Flipkart from
its customer base. Flipkart’s customer base is not only very large but dynamic too. For all
5
practical purposes the population could be approximately considered as an infinite population.
If Flipkart selects a sample of customers by picking up a purchaser every 5 seconds during the
grand sale of 36 hours, the customers in the sample could be dependent. Because there is a
possibility that the customers could exhibit similar buying behaviour. One should be careful
about avoiding dependence if domain knowledge makes us feel so.
Question: Does it intuitively make sense to assume that sampling from an infinite population
is equivalent to sampling from a finite population with replacement?
Pn
Sample Mean X̄ = n−1 i=1 Xi under Finite Population
r
σ N −n
E(X̄) = µ; SD(X̄) = √ (SRSWOR)
n N −1
6
1. Notice that for SRSWR and SRSWOR the sample mean and the sample proportion are
random variables and hence they have a probability distribution.
2. Do you notice that the accuracy of sample mean does not depend on the population size
if the sampling fraction
p is very small. Usually if it is less than 0.05 the finite population
correction (fpc) (N − n)/(N − 1) is taken as 1. Do you think the above fact to be
counter-intuitive?
3. As an implication of the above formulas one could very nicely interpret the impact of
sample size, of population heterogeneity and the role of sampling fraction f = n/N on
accuracy of sample mean as an estimator of population mean. Please explain.
4. For SRSWR what alterations would be required to the formula?
1. As an implication of the above formulas one could very nicely interpret the impact of
sample size, of population heterogeneity and the role of sampling fraction f = n/N on
accuracy of sample proportion as an estimator of population proportion. Please explain.
2. For SRSWR what alterations would be required to the formula?
7
5.3 Sampling from an Arbitrary Population
Example: As a promotion strategy of its brand a cell phone company decides to offer a
discount of either Rs. 5000 or Rs. 3000 or Rs. 2000 to the first 10000 customers e-ordering a
particular model on its website. The price of the phone is Rs. 10000. As soon as a customer
places an order the discount amount will be flashed and will be deducted from the price. To
decide on the discount to be offered to a customer the company decides to use the following
random mechanism. With the placing of an order, a digit between 0 to 9 will be selected at
random. If the chosen random digit is either 0 or 1, the offered discount will be Rs. 5000, if
it is between 2 and 4, it will be Rs. 3000, and otherwise it will be Rs. 2000.
1. Suppose a customer (among the first 10000) places an order for a phone. Find the
probability distribution of the price of the phone for the customer and also find its mean
and standard deviation.
2. Suppose a customer (among the first 9000) decides to place order for two such phones,
then find the probability distribution of total (average) price of two phones for the
customer. Also find its mean and standard deviation.
3. If a customer (a local shop owner, among the first 5000 customers) places an order of
40 cell phones then find the mean and standard deviation of the total (average) price
of the phones. Find an approximation to its probability distribution and then find the
probability that the average price is (i) less than equal to 6000 (ii) more than Rs. 7000
and (iii) between Rs. 6000 to Rs. 8000.
How do we find an approximation to the probability distribution of the average price for the
cell phones ordered by the shop owner? The next result provides a surprising answer.
X̄ − µ
Z= √ is approximately N (0, 1).
σ/ n
8
manufacturing process are set at 400 mg and 0.5 mg. The production supervisor knows from
his experience that the standard deviation of the process does rarely change. However, he feels
that continuous monitoring of the process is necessary for checking the stability of the mean
of the process. A consultant suggested him to implement the following procedure. In every
hour during a shift a sample of 100 capsules is to be selected and if the average content of the
sample falls below 399.90 or above 400.10 stop the process and hunt for the trouble.
1. What is the probability of a false alarm if this procedure is followed?
2. Assuming that the mean has actually shifted to 400.1, what is the probability that the
shift will be detected using such a sample?
3. What is the probability that the change in mean will remain undetected after two such
samples are inspected since the beginning of morning shift?
4. What is the probability that it remains undetected in the first two and gets detected at
the inspection of the third sample?
5. Suppose the process produces 10000 capsules per hour. What is the expected number of
capsules produced that will violate the norm of the regulatory authority till the change
in mean is detected in case of 3 above?
Application 2 in Statistical Quality Control: For assessing the quality of lots sent by
vendors, the quality control departments usually devise sampling inspection plans for taking
a decision on whether to accept or reject a lot. Suppose the lot size is 100 (N ), then the
sampling inspection plan specifies a sample size, say, 10 (n) that needs to be selected from the
lot without replacement and if the sample contains more than, say, 1 (c) defective items (again
the number specified by the sampling plan) the decision would be to reject the lot, otherwise
accept it. Sampling is often the only option if the testing is destructive in nature.
For designing sampling inspection plans the interests of both the consumer and the vendor
are to be protected. Since the decision to accept or reject a lot is taken on the basis of a
sample, there is a chance that even if the lot quality is good (bad) the lot may get rejected
(accepted). The vendor to protect himself from rejection of good lots imposes a condition
like: ‘if a lot has 5% (p1 ) defective items, the chance of rejecting such a lot should not exceed
10%’ (V Risk). Let us call it the vendor’s risk. On the other hand, for reducing the chance of
accepting a bad lot the consumer imposes a condition like, ‘the chance of accepting a lot with
10% (p2 ) defective should not exceed 10%’ (C Risk). Let us call this consumer’s risk.
Let us now consider a problem just to illustrate the above. Suppose N = 20; n = 5; c = 0;
p1 = 5%; V Risk = 10%; p2 = 10%; C Risk = 10%.
1. Does this sampling plan fulfill Vendor’s risk?
2. Does this sampling plan fulfill Consumer’s risk?
3. If actual number of defectives in the lot is 4, what is the chance of accepting such a lot?
4. Repeat 1–2 for N = 103 ; n = 20; c = 2; p1 = 5%; V Risk = 10%; p2 = 10%; C Risk = 10%
5. Repeat 3 with actual number of defectives 40.
9
enough. When sample sizes are small, we must assume normality for all practical purposes.
Note that, there are procedures to validate the assumption of normality, to be taken up in due
course. The following result helps to tackle problems when the sample size is not so large.
10