1994 Chen Weighted Sampling Max Entropy
1994 Chen Weighted Sampling Max Entropy
1994 Chen Weighted Sampling Max Entropy
Stable URL:
http://links.jstor.org/sici?sici=0006-3444%28199408%2981%3A3%3C457%3AWFPSTM%3E2.0.CO%3B2-I
Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at
http://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained
prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in
the JSTOR archive only for your personal, non-commercial use.
Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at
http://www.jstor.org/journals/bio.html.
Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed
page of such transmission.
JSTOR is an independent not-for-profit organization dedicated to and preserving a digital archive of scholarly journals. For
more information regarding JSTOR, please contact support@jstor.org.
http://www.jstor.org
Fri Apr 13 13:01:29 2007
Biometrika (1994), 81, 3, pp. 457-69
Printed in Great Britain
We propose and study a simple way to do this that appears to have been overlooked in
sample survey literature.
We find it convenient to use indicator variables to represent samples. Thus the random
sample may be denoted by X where
and the random variable Xi takes the values 1 or 0 according as the ith unit is in or out
of the sample, for i = 1,. . . , N. Let
Dn= {x = (x,, . . . , x,) : xi = 0 or 1, and x, + . .. + XN = n).
The random vector X takes values in Dn. We denote the probability density function of
a typical sampling scheme by p(x) for any vector x a Dn, where p(x) > 0 and Cp(x) = 1.
The associated probability that the sample includes the ith unit is
It is obvious that rescaling the wi by a positive constant multiplier determines the same
p(x) but that modulo rescaling the p(x) corresponding to distinct w are distinct. It is less
obvious that the coverage probabilities determined by (2) are in one-one correspondence
with the weights wi modulo scaling, but we show in 9 2 that this correspondence is a direct
consequence of standard exponential family theory. The notation wi = eei makes the
exponential family connection obvious because (3) can be rewritten as
p(x) = fi w?/
i=l
I:
y€Dn
(fi wii)cc ( 2
i=l
exp
i=l
Oixi) (x a Dn), (4)
The relation between w and n is a special case of that between natural and mean-value
parameterizations for an exponential family. The following result can be proved by using
Theorem 3.6 of Brown (1986, p. 74).
THEOREM 1. For any vector TC satisfying (I), there exists a vector w for the maximum
entropy model subject to the constraint (2), and w is unique up to rescaling.
To compute w from n, we recast (2) in the form of a set of equations (5) below, and
solve these iteratively as in (7). Throughout the paper we use the following notation: S =
(1, . . . , N), capital letters such as A, B or C for subsets of S, Ac = S\A for the complement
of A in S, and I A1 for the number of elements of A. Also
for any nonempty set C c S and 1 < k < I CI, R(0, C) = 1 and R(k, C) = 0 for any k > I CI.
The following Proposition 1 follows immediately from the definitions.
PROPOSITION 1. For any nonempty set C c S and 1 < k < I CI:
(a) CjGCwjR(k- 1, C\{jl) = kR(k, C),
(b) x j e c R ( k , C\{j))=(ICl -k)R(k, C),
(c) Z:=o ~ ( iC)R(k
, - i, CC)= R(k, C).
Using this notation, we may rewrite (2) as
wiR(n - 1, {i)')
n. = (i = 1, . . . , N).
R(n, S)
Note that an application of (a) in Proposition 1 proves that the right-hand sides of (5)
sum to n, consistent with the sum of the ni as in (1). Thus for fixed n, there are N - 1
linearly independent relations among the N relations of (5). Without loss of generality,
we assume that n, 4 n2 < . . . < nN and let w, = n,. Dividing each of the first N - 1 equa-
tions of (5) by the Nth equation on both sides and rearranging the terms for each equation,
we get the following set of restricted equations,
niR(n - 1, {N)')
w. = (i=l, ..., N-I), wN=nN.
R(n - 1, {i}')
Although a closed-form solution to (6) seems impossible, the equations can be solved as
a fixed-point problem by using an iterative procedure. Specifically, the following updating
scheme provides a solution of (6):
In this section, we discuss procedures for drawing a sample from the maximum
entropy model. Given the wi, there is an explicit formula for each possible sample in Dn,
but the large number, N!/{(N - n)!n!), of choices renders impractical a naive approach of
a single multinomial draw from N!/((N - n)!n!) cells. Therefore, we consider draw-by-
draw selection procedures, where we draw one unit at a time until n units are obtained.
We call a selection procedure 'forward' if it selects n units from the population as the
sample, or 'backward' if it removes N - n units from the population and takes the remain-
ing n units as the sample. Noticing that, for any x E Dn,
where A, = {i : xi = I), it is obvious that for any 'forward' procedure, there is a correspond-
ing 'backward' procedure, which selects unsampled units in the same way using w;'
instead of wi.
We also distinguish among procedures by whether or not a procedure requires n fixed
in advance. In the context of sample surveys, the ni are usually prespecified. Thus the
sample size n and the wi are fixed in advance. However, it is possible in some applications
that the wi are prespecified and different sample sizes are to be experimented with. In this
case, a selection procedure that does not depend on n is desired.
The output of a draw-by-draw procedure is represented by A,, A,, . . . , An where A, =
(21, and A, c S denotes the set of selected indices after k draws. The following are two
'forward' procedures, one for fixed n and the other for nonfixed n. The 'backward' version
of these procedures can be defined accordingly.
Procedure 1 (forward, nfixed). At the kth draw (k = 1,. . . , n), a unit j E A;-, is selected
with probability
Using the relation wi w. pi/(l - pi), the function R can be written as
for any nonempty set C c S and 0 < k < I CI. Thus PI has the interpretation
,=,n (n -k + +
l)R(n - k 1, AS-,)
R(0, A:) 1
n! (,=,
= - pr (X, = 1, t E A,).
= wik) R(n, S) n!
Since there are n! different ways of ordering the indices i,, . . . , in, the probability of
obtaining the units i,, . . . , in without regard to order is exactly pr (X, = 1, t E A,).
Suppose that nil...ikis the kth-order inclusion probability for the units i,, . . . , ik to be
in a sample of size n from the maximum entropy model. Then a property of Procedure 1
is that
Although P, can be calculated directly from R functions, the computation can be much
simplified by noticing that P,(j, A:) = nj/n and using the following formula recursively
for the consecutive draws.
LEMMA2. For any l < k < n - l and ~ E A ; ,
Procedure 1 can be realized using the following algorithm, which requires O(nN)
operations.
1. For j = 1,. . . ,N, calculate Pl(j, S), which is given by nj/n. Then draw unit i1 accord-
ing to the probability Pl(il, S).
2. If n > 1, then A, + (21, A, t {il }, k t 2, go to 3; otherwise stop.
3. For all j E At-,, calculate Pl(j, A;-,) from P,(j, A;-,) and Pl(ik-,, At-,) using (10).
Then draw unit i, according to the probability P1(ik,A;-,).
+
4. If k < n, then A k tA,-, u {i,), k t k 1, go to 3; otherwise stop.
Procedure 2 (forward, n nonjxed). At the kth draw (k = 1, . . . , n), a unit j E A;-, is
selected with probability
Sampling to maximize entropy
By (a) and (c) in Proposition 1,
j 2 - 1 z0
k-l R(i, AkPl)
- i ) ~ ( kS)
, {Cj,Ai-l wjR(k - i - 1, A;-,\{ j}))
Thus P,(., A;-,) is a probability density on A;-,. Now we show by induction that a
random sample selected by Procedure 2 is a sample from the maximum entropy model.
Let yk be the random index of the unit selected at the kth draw. Assuming
for any A c S with I A1 = k - 1, which is true for k = 2 by the definition of P,, we show
that pr (Ak= B) = R(k, B)/R(k, S) for any B c S with I BI = k. By (b) and (c) in Proposition 1,
where nij is the second-order inclusion probability for both the units i, j to be in the
sample. An alternative expression that is valid only when n is fixed is derived by Yates &
Grundy (1953):
Yates & Grundy (1953) also suggest the following estimator of I/(?), which is unbiased
+ +
if nij 0 for all pairs i j and for use when n is fixed:
is a sufficient condition for the Yates & Grundy's variance estimator v(P) to be always
nonnegative. By some algebra, it can be shown that (13) is a sufficient condition for a
sampling procedure without replacement to be more efficient, i.e. to have smaller variance,
than sampling with replacement. Few plans have this nice property, especially when n > 2.
For example, the procedure by Hartley & Rao (1962) satisfies (13) only when n = 2.
Furthermore, for v(P) to be unbiased and definable for all samples, the condition that
+
nij > 0 for any i j is also desirable.
Property 3. A direct consequence by applying Property 2 to (11) and (12) is as follows:
n i c c y i e v ( P )= 0 o v ( P ) = o (14)
for all possible samples.
Sampling to maximize entropy 465
For an arbitrarily given sampling with unequal probabilities and without replacement
scheme, ni cc yi is only sufficient for v(P) = 0 and v(P) = 0. Therefore, the resulting esti-
mator of the variance of P, v(P), may be zero even when P is clearly not a constant. The
result in (14) implies, however, that the maximum entropy model displays the desirable
property that V ( n or u(P) is zero if and only if the ni are proportional to the yi.
Property 4 (rotatability). In continuing surveys conducted at regular intervals or multi-
stage sampling, it is often desired that the sampling units should not stay in the survey
indefinitely, but rather that old units that have been in the survey for a specified period
should be rotated, i.e. replaced, by new units. It is also desirable that the new units are
selected in such a way that the new sample follows the same probability distribution as
the old one. We find that doing rotations under the maximum entropy model is especially
convenient. To see this, we present two schemes. Let A denote the set of the indices in the
current sample.
SCHEME1.
Step 1. Draw a unit i E A with uniform probability, and draw a unit j E ACwith uniform
probability.
Step 2. Replace the unit i by the unit j with the acceptance probability
Step 3. If the replacement is accepted, take A u {j)\{i) as the new sample and stop;
otherwise go back to Step 1.
Scheme 1 is in fact a Metropolis-Hasting-type algorithm (Smith & Roberts, 1993).
When all the weights are equal, this algorithm reduces to the celebrated Bernoulli-Laplace
model. It can be shown that the acceptance rate of Scheme 1 is
pr (accept) = -
N-n n
where the ni are assumed to be in ascending order. Two extreme cases: pr (accept) = 1
when the ni are all equal, and pr (accept) = 0 when the largest n ni are 1 and the rest are 0.
SCHEME 2.
Step 1. Select a unit i E AC using Procedure 2 and then select a unit j E A u {i) using
Procedure 2', where Procedure 2 is described in 5 3 and Procedure 2' is the 'backward'
version of Procedure 2.
Step 2. If i =I=j, take A u {i)\{ j ) as the new sample and stop; otherwise go back to Step 1.
In Scheme 2, we use Procedure 2 to add a unit to the current sample A and then use
Procedure 2' to remove a unit from A plus this new unit. If the unit added and the unit
removed are different, we accept the final sample as the new sample; otherwise, we repeat
the procedure until a sample different from A is obtained. Procedure 1 cannot be used in
place of Procedure 2 for adding units to the current sample because PI(., A') is not defined
when [ A [ =n. There is no simple form for the acceptance rate of Scheme 2, though we
know that, in the same extreme cases as for Scheme 1 above, it is (N - n)/(N - n + I),
and 0, respectively.
Example 1. We use the following simulation to illustrate the properties of the mapping
between w and n described in Lemma 1. Let N = 100. A vector n* is generated uniformly
from the simplex
and its corresponding w* is found from (6) via the iterative procedure described in (7).
For convenience of comparison, this particular w* is used as the weights for five different
maximum entropy models with sample size n = 2, 5, 15,30, 50, respectively. The coverage
probabilities n for each of these five models are obtained by using (5). Then by using (7),
each of these five x's is converted to a w, which should be the same as w* up to a scalar.
We use max I w f ) / -~ 1~I < 0.01 as the stopping rule, where w(') is the value of w at step t
and 17 = w* nN/w; is the fixed point. The results are summarized in Table 1.
The procedure does not take many iterations even when n, is close to 1 and n, is close
to 0. Although the value of w, differs from experiment to experiment, actually it is always
set to be n,, the ratios wN/wi(i = 1,. . . , N - 1) remain nearly the same for all five experi-
ments, which is explained by the fact that w is determined up to rescaling (Theorem 1).
The phenomenon that the ratio wl/nl approaches 1 as N/n gets large is supported by (d)
of Lemma 1.
Example 2. Let N = 4 and n = 2. The ni are 0.1, 0.4, 0.7 and 0.8, respectively. The
corresponding wi found from (6) are 0.05193,0.2355,0-52 and 0.8, respectively. The second-
order inclusion probabilities nij are given in Table 2.
Partial support was provided by an Army Research Office Grant and a National Science
Foundation Grant to Harvard University.
Sampling to maximize entropy
APPENDIX
Proofs
The proofs presented in the Appendix are only a sketch. Detailed proofs can be found in a
technical report by X.-H. Chen at Department of Statistics, Harvard University.
2. We have R ( k - 2, C ) R ( k ,C ) < { R ( k- 1, C ) I 2for any C c S and 2 < k < I CI.
PROPOSITION
Proof. Each term in R ( k - 2 , C ) R ( k ,C ) has the form w f l . . . w ~ w ~. .~wizk-
+ ~j - z. , where
0 <j < k - 2, corresponding to the case when both R(k - 2, C ) and R ( k , C ) choose w i l , . . . , wij and
in addition, k -j - 2 of w ~ ~. .+. , ~wizk
, -,-,
are from R(k - 2, C )and the other k -j are from R(k, C ) .
This particular term appears
As N/n+ co, the right-hand side of ( A 2 ) approaches n i / n l . Thus for any pair i + j , we have
wi/wj+ n i / n j as N / n + co.
PROPOSITION 3. Let d be a convex subset of 9Zm, and g = (g,, . . . ,g,) a function g : 9Zm+9Zm.
Given any two vectors x and y in d, let
I f each gi is
(a) continuous at every point of l(x,y),
(b) diferentiable at every interior point of l(x,y),
then
where 11.11 denotes the I, norm; i.e. for a vector x , 11 x 11 = max I xi 1 , and for a matrix A = (aij), ,
IIAII = m a x ( I a i l l + . . . +IaimI).
Proof. Use the mean value theorem for multivariate functions and basic calculus.
Proof of Theorem 2. (a) By Theorem 1, there is a unique ii, = (ii,,,. .. , EN) for the maximum
entropy model if we let ii,, = 71,. By Lemma 1, Ei/iiiN6 ni/nN.Thus Gi 6 ni and (6) has exactly one
solution in W.
(b) Let xi = log wi for each i E S,
gi(x)=log xi + log R(n - 1, {N)') -log R(n - 1, {i)')
for i = 1, . . . ,N - 1 and g,(x) = log 71,. For convenience, we consider the logarithmic transform-
ation of (6), which can be written as
where Xi = log iiiifor each i E S. It can be checked that X and g satisfy all conditions in Proposition 3
and g(X) c X. It is obvious that agi(x)/axj= 0 when i = N or j = N. It can be easily checked that
w,R(n - 2, {N, i)')
if i = j + N ;
R(n - 1, {NIC)
wjR(n - 2, {N,j)') - wjR(n - 2, {i,j)')
if i, j and N are distinct.
R(n - 1, { W c ) R(n - 1, {i}')
Using Proposition 1 and Proposition 2, we can show that
, ~ ~ l ~ l =
8gi(x) wNR(n- 2, {i,N)')
R(n-l,{i)') <
wNR(n- 1, {N)')
R(n, S)
Using Proposition 2, it can be shown that w,R(n - 1, {N)')/R(n, S) is decreasing in wi for each
i = l , ..., N - l . T h u s
sup IIg'(x) 11
x sz
{ C 1 I}
= sup
x~~
max agi(x)
-
axj
< sup { w N ~ ( -
n 1, { N I ' I
R(n, S)
l < i < ~j = l W ~ W
The last step is obtained by the Nth equation of (5) since ii, is a solution of (5). By Proposition 3,
It is well known that the inequality in (A3) is a sufficient condition for any fixed point procedure
to converge. Thus the sequence x") ( t = l , 2 , . . .) converges to X since ji. is the unique fixed point
in X. Accordingly, we have that the sequence w(') ( t = 1,2,. . .) converges monotonically and geo-
metrically to ii, and the convergence rate is bounded by n,. Although we do not know ii, beforehand,
we can always choose TC E W as the starting point.
Proof of Theorem 3. It is easy to check that (8) holds for k = 1. For any k 2 2, define
Proof of Property 2. Since the wi are all positive, it is easy to see that nij> 0. Comparing nij and
ninj term by term, it is seen that we only need to show
R(n,S)R(n- 2, {i,j } ' ) < R(n - 1, {iIc)R(n- 1, { j } ' ) .
By combinatorial expansion, it can be shown that
R(n,S)R(n- 2, {i,j } ' ) - R(n - 1,{ilc)R(n- 1, { j } ' ) = R(n - 2, {i,j}')R(n,{i,j } ' )
- {R(n- 1 7 {i,j } c ) } 2
< 0.
The last step is a direct consequence of Proposition 2 and thus completes the proof.
REFERENCES
BROWN,
L. D. (1986). Fundamentals ofStatistical Exponential Families (withApplications in Statistical Decision
Theory). Hayward, CA: Institute of Mathematical Statistics.
CHAUDHURI,
A. & VOS,J. W. E. (1988). Unified Theory and Strategies of Survey Sampling. New York:
Elsevier Science.
DARROCH, J. N. & RATCLIFF, D. (1972). Generalized iterative scaling for log-linear models. Ann. Math. Statist.
43, 1470-80.
DEMING, W. E. & STEPHAN, F. F. (1940). On a least squares adjustment of a sampled frequency table when
the expected marginal totals are known. Ann. Math. Statist. 11, 427-44.
HANIF,M. & BREWER, K. R. W. (1980). Sampling with unequal probabilities without replacement: A review.
Int. Statist. Rev. 48, 317-35.
HANSEN, M. H. & HURWITZ, W. N. (1943). On the theory of sampling from a finite population. Ann. Math.
Statist. 14, 333-62.
HARTLEY, H. 0. & RAO,J. N. K. (1962). Sampling with unequal probabilities and without replacement. Ann.
Math. Statist. 33, 350-74.
HORVITZ, D. G. & THOMPSON, D. J. (1952). A generalization of sampling without replacement from a finite
universe. J. Am. Statist. Assoc. 47, 663-85.
JOE,H. (1990). A winning strategy for lotto games? Can. J. Statist. 18, 233-44.
SMITH, A. F. M. & ROBERTS. G. 0 . (1993). Bayesian computation via the Gibbs sampler and related Markov
chain Monte Carlo methods. J. R. Statist. Soc. B 55, 3-23.
STERN, H. & COVER, T. M. (1989). Maximum entropy and the lottery. J. Am. Statist. Assoc. 84, 980-85.
YATES, F. & GRUNDY,P. M. (1953). Selection without replacement from within strata with probability pro-
portional to size. J. R. Statist. Soc. B 15, 253-61.