1994 Chen Weighted Sampling Max Entropy

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Weighted Finite Population Sampling to Maximize Entropy

Xiang-Hui Chen; Arthur P. Dempster; Jun S. Liu

Biometrika, Vol. 81, No. 3. (Aug., 1994), pp. 457-469.

Stable URL:
http://links.jstor.org/sici?sici=0006-3444%28199408%2981%3A3%3C457%3AWFPSTM%3E2.0.CO%3B2-I

Biometrika is currently published by Biometrika Trust.

Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at
http://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained
prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in
the JSTOR archive only for your personal, non-commercial use.

Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at
http://www.jstor.org/journals/bio.html.

Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed
page of such transmission.

JSTOR is an independent not-for-profit organization dedicated to and preserving a digital archive of scholarly journals. For
more information regarding JSTOR, please contact support@jstor.org.

http://www.jstor.org
Fri Apr 13 13:01:29 2007
Biometrika (1994), 81, 3, pp. 457-69
Printed in Great Britain

Weighted finite population sampling to maximize entropy


BY XIANG-HUI CHEN, ARTHUR P. DEMPSTER AND JUN S. LIU

Department of Statistics, Harvard University, Cambridge, Massachusetts 02138, U.S.A.

Attention is drawn to a method of sampling a finite population of N units with unequal


probabilities and without replacement. The method was originally proposed by Stern &
Cover (1989) as a model for lotteries. The method can be characterized as maximizing
entropy given coverage probabilities ni, or equivalently as having the probability of a
selected sample proportional to the product of a set of 'weights' wi. We show the essential
uniqueness of the wi given the n,, and describe practical, geometrically convergent algor-
ithms for computing the wi from the ni. We present two methods for stepwise selection of
sampling units, and corresponding schemes for removal of units that can be used in
connection with sample rotation. Inclusion probabilities of any order can be written
explicitly in closed form. Second-order inclusion probabilities nij satisfy the condition
0 < nij < ninj, which guarantees Yates & Grundy's variance estimator to be unbiased,
definable for all samples and always nonnegative for any sample size.
Some key words: Exponential family; Independent Bernoulli trials; Iterative proportional fitting; Maximum
entropy; Rotatability; Sampling with unequal probabilities and without replacement; Survey sampling;
Weighted sampling.

Random sampling of n distinct units from a population of N units may be called


weighted sampling when the probabilities associated with the N!/{(N - n)!n!) possible
choices are not all equal. A problem often considered in the literature, e.g. Hanif & Brewer
(1980), Chaudhuri & Vos (1988, pp. 143-346), is to define a particular weighted sampling
scheme subject to prespecification of the marginal probabilities ni that the sample includes
the ith population unit, where
N
O<ni<l ( i = l , . . . , N), ni=n. (1)
i=l

We propose and study a simple way to do this that appears to have been overlooked in
sample survey literature.
We find it convenient to use indicator variables to represent samples. Thus the random
sample may be denoted by X where

and the random variable Xi takes the values 1 or 0 according as the ith unit is in or out
of the sample, for i = 1,. . . , N. Let
Dn= {x = (x,, . . . , x,) : xi = 0 or 1, and x, + . .. + XN = n).
The random vector X takes values in Dn. We denote the probability density function of
a typical sampling scheme by p(x) for any vector x a Dn, where p(x) > 0 and Cp(x) = 1.
The associated probability that the sample includes the ith unit is

where the ni satisfy (1).


The particular family of sampling schemes that we propose can be defined in any of
three ways that we show to be equivalent.
Method 1. Pick any vector of weights, w = (w,, . . . , w,), where wi > 0 for i = 1, . . . , N,
and define

It is obvious that rescaling the wi by a positive constant multiplier determines the same
p(x) but that modulo rescaling the p(x) corresponding to distinct w are distinct. It is less
obvious that the coverage probabilities determined by (2) are in one-one correspondence
with the weights wi modulo scaling, but we show in 9 2 that this correspondence is a direct
consequence of standard exponential family theory. The notation wi = eei makes the
exponential family connection obvious because (3) can be rewritten as

in terms of the parameters 0 = (O,, . . . , ON) which are determined up to an additive


constant.
Method 2. Pick any vector of coverage probabilities n = (n,, . .. , n,), where the ni satisfy
(I), and choose p(x) subject to the constraints (2) to maximize the entropy
- C P(X)1% P(x).
It is easily proved that if a weight vector w or its corresponding exponent vector 0 can
be determined such that p(x) defined by Method 1 matches the n given for Method 2,
then this Method 1 choice is the unique maximum entropy scheme proposed as Method 2.
A more general form of this result is proved by Darroch & Ratcliff (1972).
This model was first proposed by Stern & Cover (1989), and further generalized by Joe
(1990), for determining optimal lottery strategies. We address in 5 2 the existence and
uniqueness of w, modulo scaling, given n.
The third characterization that we suggest is a trivial variant of Method 1.
Method 3. Pick any vector of probabilities p = (p,, . . . ,p,), where 0 < pi < 1 for i =
1,. . . ,N, and define Z = (Z,, . .. , 2,) to be independent Bernoulli trials with probabilities
pl, . . . ,p,. Then define the sampling distribution of X to be the conditional distribution
of Z given x Z i = n. It is evident that Method 3 gives the same sampling scheme as
Method 1 if and only if the wi are proportional to pi/(l -pi).
Thus the base model for all three methods described above is

p(x) = fi w?/
i=l
I:
y€Dn
(fi wii)cc ( 2
i=l
exp
i=l
Oixi) (x a Dn), (4)

where w, or equivalently 0, may be determined by TC through (2). We refer to (4) as the


maximum entropy model.
Sampling to maximize entropy 459
In 5 2, we show that, for any proper vector n satisfying (I), the corresponding w exists
and is determined up to a scaling factor. Moreover, the vector w can be easily found via
a modification of the iterative proportional fitting procedure (Deming & Stephan, 1940)
which converges monotonically and geometrically with a rate bounded by max ni. Each
iteration requires O(n2N) operations. We demonstrate several other properties of the
mapping between w and n.
In 5 3, we describe several procedures for selecting a sample of n distinct units from a
population of N units under the maximum entropy model. Each procedure selects or
removes population units one by one until a sample of n distinct units is obtained. The
simplest of these procedures requires O(nN) operations.
In 5 4, we discuss the application of the maximum entropy model in survey sampling,
x
considering in particular estimation of the population total Y = yi, where yi is a charac-
teristic associated with the ith individual in the population. We exhibit desirable properties
of the Horvitz & Thompson (1952) estimator of Y guaranteed by the maximum entropy
model. We also give two schemes for sample rotation.
In 5 5, we give two numerical examples of simulation to illustrate the properties of the
mapping between w and x.
The proofs for some important results are sketched in the Appendix. For details a
technical report is available from X.-H. Chen upon request.

The relation between w and n is a special case of that between natural and mean-value
parameterizations for an exponential family. The following result can be proved by using
Theorem 3.6 of Brown (1986, p. 74).
THEOREM 1. For any vector TC satisfying (I), there exists a vector w for the maximum
entropy model subject to the constraint (2), and w is unique up to rescaling.
To compute w from n, we recast (2) in the form of a set of equations (5) below, and
solve these iteratively as in (7). Throughout the paper we use the following notation: S =
(1, . . . , N), capital letters such as A, B or C for subsets of S, Ac = S\A for the complement
of A in S, and I A1 for the number of elements of A. Also

for any nonempty set C c S and 1 < k < I CI, R(0, C) = 1 and R(k, C) = 0 for any k > I CI.
The following Proposition 1 follows immediately from the definitions.
PROPOSITION 1. For any nonempty set C c S and 1 < k < I CI:
(a) CjGCwjR(k- 1, C\{jl) = kR(k, C),
(b) x j e c R ( k , C\{j))=(ICl -k)R(k, C),
(c) Z:=o ~ ( iC)R(k
, - i, CC)= R(k, C).
Using this notation, we may rewrite (2) as
wiR(n - 1, {i)')
n. = (i = 1, . . . , N).
R(n, S)
Note that an application of (a) in Proposition 1 proves that the right-hand sides of (5)
sum to n, consistent with the sum of the ni as in (1). Thus for fixed n, there are N - 1
linearly independent relations among the N relations of (5). Without loss of generality,
we assume that n, 4 n2 < . . . < nN and let w, = n,. Dividing each of the first N - 1 equa-
tions of (5) by the Nth equation on both sides and rearranging the terms for each equation,
we get the following set of restricted equations,
niR(n - 1, {N)')
w. = (i=l, ..., N-I), wN=nN.
R(n - 1, {i}')
Although a closed-form solution to (6) seems impossible, the equations can be solved as
a fixed-point problem by using an iterative procedure. Specifically, the following updating
scheme provides a solution of (6):

where w(') = (wlf),. . . , ~ 8 ) ) .


THEOREM 2. Define W = {w : 0 < wi 4 ni, i = 1,. . . ,N - 1; wN= 71,). Then:
(a) the set of equations in (6) has a unique solution, w*, in W ;
(b) startingfrom w(O)= n, the sequence (7) of vectors w(') (t = 1,2, . . .) converges monoton-
ically and geometrically to w* with a rate bounded by n,.
Remark. The iterative procedure defined by (7) is a variant of the iterative proportional
fitting algorithm first proposed by Deming & Stephan (1940). Imagine a 2N contingency
table initialized with a uniform distribution over the subset of cells such that n of the N
binary variables take the value 1 and the remaining N - n take the value 0. As originally
proposed, the Deming-Stephan procedure would use the formula (7) for i = 1,. . . , N to
define a cycle, but updating the table N times within each cycle, after each wlf) is modified
into wlf'l), whereas we fix the normalization of the wi by holding w, = n, where n, is the
largest n,, and hence we update the wi in each cycle only for i = 1,. . . , N - 1. While
Deming-Stephan is known to increase entropy monotonely at each step (Darroch &
Ratcliff, 1972), further properties such as those in Theorem 2 do not apply to Deming-
Stephan. In practice, the procedure in (7) takes far fewer iterations to converge than
Deming-Stephan.
Our algorithm is also more direct and much faster than the one proposed by Stern &
Cover (1989) which uses a generalized iterative scaling algorithm of Darroch & Ratcliff
(1972).
LEMMA1. For any i, j E S , the following properties hold:
(a) n, = n j e w i = wj;
(b) n i > n j e w i > w j ;
(c) ni > njewi/wj > ni/nj;
(d) if c1 < nk 4 c2,for all n, and some el, c2 E (0, I), then wi/wj+ ni/nj as N/n -,CQ
The naive approach to computing R by adding all the terms in its definition is inefficient
and often prohibitively expensive. The following result enables us to calculate the function
R recursively and relatively cheaply.
3. Define T(i, C) =
THEOREM xjFw; for any i 3 1 and C c S. Then for any 1 C k 4 I CI,
1
R(k, C) = - (- 1)'" ~ ( iC)R(k
, - i, C).
k i=1
Sampling to maximize entropy
When wi = 1 for all i,

and (8) gives the following combinatorial formula:

If T(i, C) and R(k - i, C) (i = 1, . . . , k) are all available, it requires O(k) operations to


calculate R(k, C) using (8). By induction, if T(i, C) (i = 1, . . . , k) are all available, it requires
O(k2) operations to get R(k, C). For each iteration modifying w(') into w'''') in (7), we
need to compute R(n - 1, { j)') ( j= 1, . . . , N), which requires O(n2N) operations if
T(i, { j)') (i = 1, . . . , n - 1, j = 1, . . . , N ) are all available. However, it takes only O(nN)
operations to get all T(i, { j)') using the formula T(i, { j)') = T(i, S) - wj. In total, each
iteration in (7) needs 0(n2N)operations.

In this section, we discuss procedures for drawing a sample from the maximum
entropy model. Given the wi, there is an explicit formula for each possible sample in Dn,
but the large number, N!/{(N - n)!n!), of choices renders impractical a naive approach of
a single multinomial draw from N!/((N - n)!n!) cells. Therefore, we consider draw-by-
draw selection procedures, where we draw one unit at a time until n units are obtained.
We call a selection procedure 'forward' if it selects n units from the population as the
sample, or 'backward' if it removes N - n units from the population and takes the remain-
ing n units as the sample. Noticing that, for any x E Dn,

where A, = {i : xi = I), it is obvious that for any 'forward' procedure, there is a correspond-
ing 'backward' procedure, which selects unsampled units in the same way using w;'
instead of wi.
We also distinguish among procedures by whether or not a procedure requires n fixed
in advance. In the context of sample surveys, the ni are usually prespecified. Thus the
sample size n and the wi are fixed in advance. However, it is possible in some applications
that the wi are prespecified and different sample sizes are to be experimented with. In this
case, a selection procedure that does not depend on n is desired.
The output of a draw-by-draw procedure is represented by A,, A,, . . . , An where A, =
(21, and A, c S denotes the set of selected indices after k draws. The following are two
'forward' procedures, one for fixed n and the other for nonfixed n. The 'backward' version
of these procedures can be defined accordingly.
Procedure 1 (forward, nfixed). At the kth draw (k = 1,. . . , n), a unit j E A;-, is selected
with probability
Using the relation wi w. pi/(l - pi), the function R can be written as

for any nonempty set C c S and 0 < k < I CI. Thus PI has the interpretation

where Z,, . . . ,Z, are independent Bernoulli trials as defined in Method 3 in 5 1.


It is easy to see by (a) in Proposition 1 that PI(., A;-,) is a probability density on
At-, . To see that a random sample of n units selected by Procedure 1 is a sample from
the maximum entropy model, we first compute the probability of choosing an ordered set
of indices i,, . . . , in using Procedure 1, where in this case A, = {i,, . . . , i,} for k = 1, . . . , n:
n wikR(n - k, A;)
n
k=l
PIG,, A;-,)=
n

,=,n (n -k + +
l)R(n - k 1, AS-,)
R(0, A:) 1
n! (,=,
= - pr (X, = 1, t E A,).
= wik) R(n, S) n!
Since there are n! different ways of ordering the indices i,, . . . , in, the probability of
obtaining the units i,, . . . , in without regard to order is exactly pr (X, = 1, t E A,).
Suppose that nil...ikis the kth-order inclusion probability for the units i,, . . . , ik to be
in a sample of size n from the maximum entropy model. Then a property of Procedure 1
is that

Although P, can be calculated directly from R functions, the computation can be much
simplified by noticing that P,(j, A:) = nj/n and using the following formula recursively
for the consecutive draws.
LEMMA2. For any l < k < n - l and ~ E A ; ,

Procedure 1 can be realized using the following algorithm, which requires O(nN)
operations.
1. For j = 1,. . . ,N, calculate Pl(j, S), which is given by nj/n. Then draw unit i1 accord-
ing to the probability Pl(il, S).
2. If n > 1, then A, + (21, A, t {il }, k t 2, go to 3; otherwise stop.
3. For all j E At-,, calculate Pl(j, A;-,) from P,(j, A;-,) and Pl(ik-,, At-,) using (10).
Then draw unit i, according to the probability P1(ik,A;-,).
+
4. If k < n, then A k tA,-, u {i,), k t k 1, go to 3; otherwise stop.
Procedure 2 (forward, n nonjxed). At the kth draw (k = 1, . . . , n), a unit j E A;-, is
selected with probability
Sampling to maximize entropy
By (a) and (c) in Proposition 1,

j 2 - 1 z0
k-l R(i, AkPl)
- i ) ~ ( kS)
, {Cj,Ai-l wjR(k - i - 1, A;-,\{ j}))

Thus P,(., A;-,) is a probability density on A;-,. Now we show by induction that a
random sample selected by Procedure 2 is a sample from the maximum entropy model.
Let yk be the random index of the unit selected at the kth draw. Assuming

for any A c S with I A1 = k - 1, which is true for k = 2 by the definition of P,, we show
that pr (Ak= B) = R(k, B)/R(k, S) for any B c S with I BI = k. By (b) and (c) in Proposition 1,

R(k - 1, B\{ j}) wjR(k - i - 1, Bc)R(i,B\{ j } )


==jFB R ( k 1, S) i=O (k - i)R (k, S)

By induction, the probability of obtaining a set A, is


R(n, A,)/R(n, S) = pr (X,
= 1, t E A,).

It is evident from the proof above that using Procedure 2

Thus Procedure 2 does not depend on n.


The two procedures have different uses. By using (lo), Procedure 1 requires less oper-
ations than Procedure 2, but cannot be used when n is nonfixed. Procedure 2 is useful for
doing rotations in survey sampling. The preference between forward and backward pro-
cedures depends on the scale of n. Evidently, forward procedures are preferred when
n < N/2 while backward procedures are preferred when n > N/2. When the wi are all equal,
both procedures reduce to simple random sampling without replacement.

4. APPLICATION T O SURVEY SAMPLING


In the context of survey sampling, weighted sampling is often called sampling with
unequal probabilities and without replacement. Its use in survey sampling was first sug-
gested by Hansen & Hurwitz (1943). Hanif & Brewer (1980) reviewed and classified over
50 different such schemes. The goal is typically to estimate the population total
Y = C yi for a finite population of N units, where yi is a characteristic associated with the
ith unit. A commonly-used unbiased estimator of Y as proposed by Horvitz & Thompson
where X = (XI, .. . ,XN) represents a random sample of n distinct units drawn from the
population. The variance of P is given by

where nij is the second-order inclusion probability for both the units i, j to be in the
sample. An alternative expression that is valid only when n is fixed is derived by Yates &
Grundy (1953):

Yates & Grundy (1953) also suggest the following estimator of I/(?), which is unbiased
+ +
if nij 0 for all pairs i j and for use when n is fixed:

The maximum entropy model has the following properties.


Property 1 . The inclusion probabilities of any order are uniquely determined by the ni
and can be explicitly expressed in closed form. For example,
nij = wiwjR(n - 2, {i, jIc)/R(n, S).
Thus v(P) and v(P) can be evaluated directly from the wi. In general, the kth-order
(1 < k < n) inclusion probability for the units i,, . . . , i, to be in the sample is as
given in (9). As we have shown in 2, these inclusion probabilities can be easily calculated
from the wi.
Property 2. For the maximum entropy model, 0 < nij < nixj, for any pair i j. +
It is easy to see that the condition

is a sufficient condition for the Yates & Grundy's variance estimator v(P) to be always
nonnegative. By some algebra, it can be shown that (13) is a sufficient condition for a
sampling procedure without replacement to be more efficient, i.e. to have smaller variance,
than sampling with replacement. Few plans have this nice property, especially when n > 2.
For example, the procedure by Hartley & Rao (1962) satisfies (13) only when n = 2.
Furthermore, for v(P) to be unbiased and definable for all samples, the condition that
+
nij > 0 for any i j is also desirable.
Property 3. A direct consequence by applying Property 2 to (11) and (12) is as follows:
n i c c y i e v ( P )= 0 o v ( P ) = o (14)
for all possible samples.
Sampling to maximize entropy 465
For an arbitrarily given sampling with unequal probabilities and without replacement
scheme, ni cc yi is only sufficient for v(P) = 0 and v(P) = 0. Therefore, the resulting esti-
mator of the variance of P, v(P), may be zero even when P is clearly not a constant. The
result in (14) implies, however, that the maximum entropy model displays the desirable
property that V ( n or u(P) is zero if and only if the ni are proportional to the yi.
Property 4 (rotatability). In continuing surveys conducted at regular intervals or multi-
stage sampling, it is often desired that the sampling units should not stay in the survey
indefinitely, but rather that old units that have been in the survey for a specified period
should be rotated, i.e. replaced, by new units. It is also desirable that the new units are
selected in such a way that the new sample follows the same probability distribution as
the old one. We find that doing rotations under the maximum entropy model is especially
convenient. To see this, we present two schemes. Let A denote the set of the indices in the
current sample.
SCHEME1.
Step 1. Draw a unit i E A with uniform probability, and draw a unit j E ACwith uniform
probability.
Step 2. Replace the unit i by the unit j with the acceptance probability

Step 3. If the replacement is accepted, take A u {j)\{i) as the new sample and stop;
otherwise go back to Step 1.
Scheme 1 is in fact a Metropolis-Hasting-type algorithm (Smith & Roberts, 1993).
When all the weights are equal, this algorithm reduces to the celebrated Bernoulli-Laplace
model. It can be shown that the acceptance rate of Scheme 1 is

pr (accept) = -
N-n n
where the ni are assumed to be in ascending order. Two extreme cases: pr (accept) = 1
when the ni are all equal, and pr (accept) = 0 when the largest n ni are 1 and the rest are 0.
SCHEME 2.

Step 1. Select a unit i E AC using Procedure 2 and then select a unit j E A u {i) using

Procedure 2', where Procedure 2 is described in 5 3 and Procedure 2' is the 'backward'

version of Procedure 2.

Step 2. If i =I=j, take A u {i)\{ j ) as the new sample and stop; otherwise go back to Step 1.

In Scheme 2, we use Procedure 2 to add a unit to the current sample A and then use
Procedure 2' to remove a unit from A plus this new unit. If the unit added and the unit
removed are different, we accept the final sample as the new sample; otherwise, we repeat
the procedure until a sample different from A is obtained. Procedure 1 cannot be used in
place of Procedure 2 for adding units to the current sample because PI(., A') is not defined
when [ A [ =n. There is no simple form for the acceptance rate of Scheme 2, though we
know that, in the same extreme cases as for Scheme 1 above, it is (N - n)/(N - n + I),
and 0, respectively.
Example 1. We use the following simulation to illustrate the properties of the mapping
between w and n described in Lemma 1. Let N = 100. A vector n* is generated uniformly
from the simplex

and its corresponding w* is found from (6) via the iterative procedure described in (7).
For convenience of comparison, this particular w* is used as the weights for five different
maximum entropy models with sample size n = 2, 5, 15,30, 50, respectively. The coverage
probabilities n for each of these five models are obtained by using (5). Then by using (7),
each of these five x's is converted to a w, which should be the same as w* up to a scalar.
We use max I w f ) / -~ 1~I < 0.01 as the stopping rule, where w(') is the value of w at step t
and 17 = w* nN/w; is the fixed point. The results are summarized in Table 1.

Table 1. Number of iterations for computing w from n and the range of


w and n for sample size n = 2,5, 15,30,50
n Number of iterations nloo (wloo) % W1 wl/nl

The procedure does not take many iterations even when n, is close to 1 and n, is close
to 0. Although the value of w, differs from experiment to experiment, actually it is always
set to be n,, the ratios wN/wi(i = 1,. . . , N - 1) remain nearly the same for all five experi-
ments, which is explained by the fact that w is determined up to rescaling (Theorem 1).
The phenomenon that the ratio wl/nl approaches 1 as N/n gets large is supported by (d)
of Lemma 1.
Example 2. Let N = 4 and n = 2. The ni are 0.1, 0.4, 0.7 and 0.8, respectively. The
corresponding wi found from (6) are 0.05193,0.2355,0-52 and 0.8, respectively. The second-
order inclusion probabilities nij are given in Table 2.

Table 2. Second-order inclusion probabilities nij

Partial support was provided by an Army Research Office Grant and a National Science
Foundation Grant to Harvard University.
Sampling to maximize entropy
APPENDIX
Proofs
The proofs presented in the Appendix are only a sketch. Detailed proofs can be found in a
technical report by X.-H. Chen at Department of Statistics, Harvard University.
2. We have R ( k - 2, C ) R ( k ,C ) < { R ( k- 1, C ) I 2for any C c S and 2 < k < I CI.
PROPOSITION
Proof. Each term in R ( k - 2 , C ) R ( k ,C ) has the form w f l . . . w ~ w ~. .~wizk-
+ ~j - z. , where
0 <j < k - 2, corresponding to the case when both R(k - 2, C ) and R ( k , C ) choose w i l , . . . , wij and
in addition, k -j - 2 of w ~ ~. .+. , ~wizk
, -,-,
are from R(k - 2, C )and the other k -j are from R(k, C ) .
This particular term appears

times in R(k - 2, C ) R ( k ,C ) . By analogy, the same term appears

times in { R ( k - 1, C)I2. Since a binomial function ( 2 i ) ! / { ( 2-


i m)!m!}reaches its maximum when
m = i, we find
2k-2j-2 2k-2j-2
( k-j )<(k-j-1 )
and thus R(k - 2, C ) R ( k ,C ) < {R(k - 1, C ) I 2
Proof of Lemma 1. Take the ratio of the ith equation and the jth equation of (5):
ni - w i n - 1, { i } ) - wiwjR(n- 2, {i,j}')
- +wiR(n - 1, {i,j)')
(A1
nj w j R ( n - 1, { j}') - wjwiR(n- 2, {i,3)') + w j R ( n - 1, {i,j}')'
Then (a) and (b) can be directly deduced from ( A l ) . Since the right-hand side of ( A l ) <wi/wj if
and only if wi 2 w j ,(c)is also straightforward by the condition ni 2 n j and (a)and (b).By manipulat-
ing the equation in ( A l )for n , instead of n j and using (c),we can show

As N/n+ co, the right-hand side of ( A 2 ) approaches n i / n l . Thus for any pair i + j , we have
wi/wj+ n i / n j as N / n + co.
PROPOSITION 3. Let d be a convex subset of 9Zm, and g = (g,, . . . ,g,) a function g : 9Zm+9Zm.
Given any two vectors x and y in d, let

I f each gi is
(a) continuous at every point of l(x,y),
(b) diferentiable at every interior point of l(x,y),
then

where 11.11 denotes the I, norm; i.e. for a vector x , 11 x 11 = max I xi 1 , and for a matrix A = (aij), ,
IIAII = m a x ( I a i l l + . . . +IaimI).
Proof. Use the mean value theorem for multivariate functions and basic calculus.
Proof of Theorem 2. (a) By Theorem 1, there is a unique ii, = (ii,,,. .. , EN) for the maximum
entropy model if we let ii,, = 71,. By Lemma 1, Ei/iiiN6 ni/nN.Thus Gi 6 ni and (6) has exactly one
solution in W.
(b) Let xi = log wi for each i E S,
gi(x)=log xi + log R(n - 1, {N)') -log R(n - 1, {i)')
for i = 1, . . . ,N - 1 and g,(x) = log 71,. For convenience, we consider the logarithmic transform-
ation of (6), which can be written as

where x = (x,, . . . , x,) and g(x) = (g,(x), . . . ,g,(x)). Define

where Xi = log iiiifor each i E S. It can be checked that X and g satisfy all conditions in Proposition 3
and g(X) c X. It is obvious that agi(x)/axj= 0 when i = N or j = N. It can be easily checked that
w,R(n - 2, {N, i)')
if i = j + N ;
R(n - 1, {NIC)
wjR(n - 2, {N,j)') - wjR(n - 2, {i,j)')
if i, j and N are distinct.
R(n - 1, { W c ) R(n - 1, {i}')
Using Proposition 1 and Proposition 2, we can show that

, ~ ~ l ~ l =
8gi(x) wNR(n- 2, {i,N)')
R(n-l,{i)') <
wNR(n- 1, {N)')
R(n, S)
Using Proposition 2, it can be shown that w,R(n - 1, {N)')/R(n, S) is decreasing in wi for each
i = l , ..., N - l . T h u s

sup IIg'(x) 11
x sz
{ C 1 I}
= sup
x~~
max agi(x)
-
axj
< sup { w N ~ ( -
n 1, { N I ' I
R(n, S)
l < i < ~j = l W ~ W

The last step is obtained by the Nth equation of (5) since ii, is a solution of (5). By Proposition 3,

It is well known that the inequality in (A3) is a sufficient condition for any fixed point procedure
to converge. Thus the sequence x") ( t = l , 2 , . . .) converges to X since ji. is the unique fixed point
in X. Accordingly, we have that the sequence w(') ( t = 1,2,. . .) converges monotonically and geo-
metrically to ii, and the convergence rate is bounded by n,. Although we do not know ii, beforehand,
we can always choose TC E W as the starting point.
Proof of Theorem 3. It is easy to check that (8) holds for k = 1. For any k 2 2, define

Qd = CisCw f ~ ( -k d, C\{i)) (d = 1, . . . ,k).


It can be shown that, for d = 1,. . . ,k - 1,
Sampling to maximize entropy
Using (A4) repeatedly for d = 1,. . . , k - 1, we get
Q1= T(1,C)R(k- 1, C)- Q2= T(1,C)R(k- 1, C )- {T(2,C)R(k- 2, C) - Q 3 )
=. . . = T(1,C)R(k- 1, C ) - T(2,C)R(k- 2, C ) + . . . + (- Qk.
By Proposition 1 and the definition of R,
Q1= xi,, wiR(k - 1, C\{i}) = kR(k,C ) , Qk = Ei,, w!R(O,C\{i}) = T(k,C).
Thus
1 1
R(k,C )= - Q1= -
k kiZl
1
(- l)'+'T(i,C )R(k- i, C).

Proof of Property 2. Since the wi are all positive, it is easy to see that nij> 0. Comparing nij and
ninj term by term, it is seen that we only need to show
R(n,S)R(n- 2, {i,j } ' ) < R(n - 1, {iIc)R(n- 1, { j } ' ) .
By combinatorial expansion, it can be shown that
R(n,S)R(n- 2, {i,j } ' ) - R(n - 1,{ilc)R(n- 1, { j } ' ) = R(n - 2, {i,j}')R(n,{i,j } ' )
- {R(n- 1 7 {i,j } c ) } 2
< 0.
The last step is a direct consequence of Proposition 2 and thus completes the proof.

REFERENCES
BROWN,
L. D. (1986). Fundamentals ofStatistical Exponential Families (withApplications in Statistical Decision
Theory). Hayward, CA: Institute of Mathematical Statistics.
CHAUDHURI,
A. & VOS,J. W. E. (1988). Unified Theory and Strategies of Survey Sampling. New York:
Elsevier Science.
DARROCH, J. N. & RATCLIFF, D. (1972). Generalized iterative scaling for log-linear models. Ann. Math. Statist.
43, 1470-80.
DEMING, W. E. & STEPHAN, F. F. (1940). On a least squares adjustment of a sampled frequency table when
the expected marginal totals are known. Ann. Math. Statist. 11, 427-44.
HANIF,M. & BREWER, K. R. W. (1980). Sampling with unequal probabilities without replacement: A review.
Int. Statist. Rev. 48, 317-35.
HANSEN, M. H. & HURWITZ, W. N. (1943). On the theory of sampling from a finite population. Ann. Math.
Statist. 14, 333-62.
HARTLEY, H. 0. & RAO,J. N. K. (1962). Sampling with unequal probabilities and without replacement. Ann.
Math. Statist. 33, 350-74.
HORVITZ, D. G. & THOMPSON, D. J. (1952). A generalization of sampling without replacement from a finite
universe. J. Am. Statist. Assoc. 47, 663-85.
JOE,H. (1990). A winning strategy for lotto games? Can. J. Statist. 18, 233-44.
SMITH, A. F. M. & ROBERTS. G. 0 . (1993). Bayesian computation via the Gibbs sampler and related Markov
chain Monte Carlo methods. J. R. Statist. Soc. B 55, 3-23.
STERN, H. & COVER, T. M. (1989). Maximum entropy and the lottery. J. Am. Statist. Assoc. 84, 980-85.
YATES, F. & GRUNDY,P. M. (1953). Selection without replacement from within strata with probability pro-
portional to size. J. R. Statist. Soc. B 15, 253-61.

[Received June 1993.Revised April 19941

You might also like