Detection: R.G. Gallager
Detection: R.G. Gallager
Detection: R.G. Gallager
Detection
R.G. Gallager
1 Introduction
Detection, decision making, hypothesis testing, and decoding are synonyms. The word
detection refers to the effort to decide whether some phenomenon is present or not on the
basis of some observations. For example, a radar system uses the data to detect whether
or not a target is present; a quality control system attempts to detect whether a unit
is defective; a medical test detects whether a given disease is present. The meaning has
been extended in the communication field to detect which one, among a set of mutually
exclusive alternatives, is correct. Decision making is, again, the process of deciding
between a number of mutually exclusive alternatives. Hypothesis testing is the same, and
here the mutually exclusive alternatives are called hypotheses. Decoding is the process
of mapping the received signal into one of the possible set of code words or transmitted
symbols. We use the word hypotheses for the possible choices in what follows, since the
word conjures up the appropriate intuitive image.
These problems will be studied initially in a purely probabilistic setting. That is, there is
a probability model within which each hypothesis is an event. These events are mutually
exclusive and collectively exhaustive, i.e., the sample outcome of the experiment lies in
one and only one of these events, which means that in each performance of the experi-
ment, one and only one hypothesis is correct. Assume there are m hypotheses, numbered
0, 1, . . . , m − 1, and let H be a random variable whose sample value is the correct hy-
pothesis j, 0 ≤ j ≤ m − 1 for that particular sample point. The probability of hypothesis
j, pH (j), is denoted pj and is usually referred to as the a priori probability of j. There
is also a random variable (rv) Y , called the observation. This is the data on which the
decision must be based. We observe a sample value y of Y , and on the basis of that
observation, we want to make a decision between the possible hypotheses. For this and
the following section, the observation could equally well be a complex random variable, a
random vector, or a chance variable.
Before discussing how to make decisions, it is important to understand when and why deci-
sions must be made. As an example, suppose we conclude, on the basis of the observation,
that hypothesis 0 is correct with probability 2/3 and hypothesis 1 with probability 1/3.
Simply making a decision on hypothesis 0 and forgetting about the probabilities seems to
be throwing away much of the information that we have gathered. The problem is that
sometimes choices must be made. In a communication system, the user wants to receive
1
the message rather than a set of probabilities. In a control system, the controls must
occasionally take action. Similarly managers must occasionally choose between courses
of action, between products, and between people to hire. In a sense, it is by making
decisions that we return from the world of mathematical probability models to the world
being modeled.
There are a number of possible criteria to use in making decisions, and initially, we
assume that the criterion is to maximize the probability of choosing correctly. That is,
when the experiment is performed, the resulting sample point maps into a sample value
j for H and into a sample value y for Y . The decision maker observes y (but not j)
and maps y into a decision Ĥ(y). The decision is correct if Ĥ(y) = j. In principal,
maximizing the probability of choosing correctly is almost trivially simple. Given y, we
calculate pH|Y (j | y) for each j, 0 ≤ j ≤ m − 1. This is the probability that j is the correct
hypothesis conditional on y. Thus the rule for maximizing the probability of being correct
is to choose Ĥ(y) to be that j for which pH|Y (j | y) is maximized. This is denoted
where arg maxj means the argument j that maximizes the function. If the maximum is
not unique, it makes no difference to the probability of being correct which maximizing
j is chosen. To be explicit, we will choose the smallest maximizing j.1 The conditional
probability pH|Y (j | y) is called an a posteriori probability, and thus the decision rule in
(1) is called the maximum a posteriori probability (MAP) rule.
When we want to distinguish between different decision rules, we denote the MAP decision
rule in (1) as ĤM AP (y). Since the MAP rule maximizes the probability of correct decision
for each sample value y, it also maximizes the probability of correct decision averaged over
all y. To see this analytically, let ĤA (y) be an arbitrary decision rule. Since Ĥ maximizes
pH|Y (j | y)] over j,
pH|Y (ĤM AP (y) | y) − pH|Y (ĤA (y) | y) ≥ 0; for any rule A and all y. (2)
Taking the expected value of the first term on the left over the observation Y , we get the
probability of correct decision using the MAP decision rule. The expected value of the
second term on the left is the probability of correct decision using the rule A. Thus, taking
the expected value of (2) over Y shows that the MAP rule maximizes the probability of
correct decision over the observation space. The above results are very simple, but also
important and fundamental. We summarize them in the following theorem.
Theorem 1.1 The MAP rule, given in (1), maximizes the probability of correct decision
for each observed sample value y and also maximizes the overall probability of correct
decision.
1
As discussed in section 4, it is sometimes desirable to add some randomness into the choice of a
maximizing j.
2
Before discussing the implications and use of the MAP rule, we review the assumptions
that have been made. First, we assumed a probability model in which the probability
assignment is known, and in which, for each performance of the experiment, one and only
one hypothesis is correct. This conforms very well to the communication model in which
a transmitter sends one of a set of possible signals, and the receiver, given signal plus
noise, makes a decision on the signal actually sent. It does not always conform well to
a scientific experiment attempting to verify the existence of some new phenomenon; in
such situations, there is often no sensible way to model a priori probabilities. We discuss
detection in the absence of apriori probabilities in section 4.
The next assumption was that maximizing the probability of correct decision is an ap-
propriate decision criterion. In many situations, the cost of a wrong decision is highly
asymmetric. For example, when testing for a treatable but deadly disease, making an
error when the disease is present is far more costly than making an error when the disease
is not present. It is easy to extend the theory to account for relative costs of errors, and
we do that later.
The next few sections are restricted to the case of binary hypotheses, (m = 2). This allows
us to understand most of the important ideas but simplifies the notation considerably.
Later we consider an arbitrary number of hypotheses.
2 Binary Detection
Assume a probability model in which the correct hypothesis H is a binary random variable
with values 0 and 1 and a priori probabilities p0 and p1 . In the communication context,
the a priori probabilities are usually modeled as equi-probable, but occasionally there are
multi-stage detection processes in which the result of the first stage can be summarized by
a new set of a priori probabilities. Thus we continue to allow arbitrary a priori probabilites.
Let Y be a rv whose conditional probability density, fY |H (y | j), is initially assumed to be
finite and non-zero for all real y and for j equal to both 0 and 1. The modifications for
discrete Y , complex Y , or vector Y are straight-forward and are illustrated by examples.
The conditional densities fY |H (y | j), j = 0, 1 are called likelihoods in the jargon of hypoth-
esis testing. The marginal density of Y is given by fY (y) = p0 fY |H (y | 0) + p1 fY |H (y | 1).
The a posteriori probability of H, for i = 0 or 1, is given by
pj fY |H (y | j)
pH|Y (j | y) = . (3)
fY (y)
Ĥ=0
p0 fY |H (y | 0) ≥ p1 fY |H (y | 1)
. (4)
fY (y) < fY (y)
Ĥ=1
3
This “equation” indicates that the MAP decision is 0 if the left side is greater than or
equal to the right, and is 1 if the left side is less than the right. Choosing the decision to
be 0 when equality holds is arbitrary and does not affect the probability of being correct.
Canceling fY (y) and rearranging,
Ĥ=0
fY |H (y | 0) ≥ p1
Λ(y) = = η. (5)
fY |H (y | 1) < p0
Ĥ=1
Another, usually simpler, approach is to work directly with the likelihood ratio. Since
Λ(y) is a function of the observed sample value y, we can define the likelihood ratio random
variable Λ(Y ) in the usual way, i.e., for every sample point ω, Y (ω) is the corresponding
4
sample value y, and Λ(Y ) is then shorthand for Λ(Y (ω)). In the same way, Ĥ(Y ) (or
more briefly Ĥ) is the decision random variable. In these terms, (5) states that
Thus,
Pr{e | H=0} = Pr{Ĥ=1 | H=0} = Pr{Λ(Y ) < η | H=0}. (9)
Pr{e | H=1} = Pr{Ĥ=0 | H=1} = Pr{Λ(Y ) ≥ η | H=1}. (10)
A sufficient statistic is defined as a function of the observation y from which the likelihood
ratio can be calculated. For example, y itself, Λ(y), and any one to one function of Λ(y)
are sufficient statistics. Λ(y) and functions of it are often simpler to work with than
y in calculating the probability of error. This will be particularly true when we get to
vector or process observations, since Λ(y) will then still be a one dimensional variable.
We have seen that the MAP rule (and, as we find later, essentially any sensible decision
rule) can be specified in terms of the likelihood ratio. Thus, once a sufficient statistic has
been calculated from the observed vector, the observed vector has no further value. For
example, we see from (9) and (10) that the conditional error probabilities are determined
simply from the conditional distribution functions of the likelihood ratio. We will often
find that the log likelihood ratio, LLR(Y ) = ln[Λ(Y )] is even more convenient to work
with than Λ(Y ). We next look at a simple but very important example of binary MAP
detection.
We have simplified 2-PAM by assuming that only a single binary symbol is sent (rather
than a sequence over time) and that only the single observed sample y corresponding to
5
Noise
Signal +a or −a- ? y
Source 0 or 1- i - Detector 0 or -
1
Mapper
Figure 1: The source produces a binary digit which is mapped into ±a. This is
modulated into a waveform, WGN is added, the resultant waveform is demodulated
and sampled, resulting in a noisy received value y. Based on this observation the
receiver makes a decision on the source output.
that input is observed. We will see later that these simplifications are unnecessary, but
they let us understand the problem in the simplest possible context. The observation rv
Y (i.e., the channel output) is a + Z or −a + Z, depending on whether H = 0 or 1. Thus,
conditional on H = 0, Y ∼ N (a, N0 /2) and, conditional on H = 1, Y ∼ N (−a, N0 /2).
· ¸ · ¸
1 −(y−a)2 1 −(y+a)2
fY |H (y | 0) = √ exp ; fY |H (y | 1) = √ exp .
πN0 N0 πN0 N0
The likelihood ratio is the ratio of these likelihoods, and given by
· ¸ · ¸
−(y−a)2 + (y+a)2 4ay
Λ(y) = exp = exp . (11)
N0 N0
Ĥ=0
· ¸
4ay ≥ p1
exp = η. (12)
N0 < p0
Ĥ=1
Ĥ=0
· ¸
4ay ≥
LLR(y) = ln(η). (13)
N0 <
Ĥ=1
Ĥ=0
≥ N0 ln(η)
y . (14)
< 4a
Ĥ=1
6
(N0 /4a) ln η
Ĥ=0
-
Ĥ=1
¾
−a a
X
»
X
»XXX
0 I
@
@
Pr{Ĥ = 0|H = 1}
7
In terms of Figure 2, this is the tail of either Gaussian distribution from the point 0 where
they cross. We shall see that this equation keeps reappearing in different guises, and it
will soon seem like a completely obvious result for a variety of detection problems.
Noise
Signal ?
b or b0 - i y
Source 0 or 1- - Detector 0 or -
1
Mapper
Figure 3: The source produces a binary digit; 0 is mapped into b and 1 is mapped
into b0 . After modulation, addition of WGN and demodulation, the noisy received
observation is y. The receiver maps y into the detected output Ĥ.
8
Under H=0, the observation Y is given by Y = a + Z ; under H=1, Y = −a + Z .
Thus, under H = 0, Y is a k-tuple of Gaussian rv’s whose means are given by a0 , . . . , ak
and whose fluctuations are iid. Thus,
kX −(yj − aj )2 µ ¶
1 1 −ky − ak2
fY |H (y | 0) = exp = exp .
(πN0 )k/2 j=1
N0 (πN0 )k/2 N0
This test involves the observation y only in terms of the inner product hy , ai. This
says that the MAP test is linear in y , and that it involves the different components of
the observation linearly according to the size of the signal in that direction. This is not
surprising, since the two hypotheses are separated more by the larger components of a
than by the smaller.
The left side of (21) is the size of the projection of the observation onto the signal; the
decision is then based on that projection. This is illustrated geometrically in Figure 4.
This result is rather natural; the noise has equal variance in all directions, and only the
noise in the direction of the signal should be relevant in detecting the signal. The geometry
of the situation is particularly clear in the ML case. The noise is spherically symmetric
around the origin, and the likelihoods depend only on the distance from the origin. The
ML detection rule is then equivalent to choosing the hypothesis closest to the received
point; the set of points equidistant from the two hypotheses, as illustrated in Figure 4, is
simply the perpendicular bisector between them; the sign of the projection of y onto a
determines which side of that perpendicular bisector y lies on.
Another way of viewing (21), and perhaps the simplest, is to view it in a different co-
ordinate system. That is, choose φ1 = a/k ak as one element of an orthonormal basis
9
for the k-vectors and choose another k−1 orthonormal vectors by the Gram-Schmidt
procedure. In this new co-ordinate system, hy , ai = kakhy , φ1 i so the left side of (21) is
simply hy , φ1 i, i.e., the size of the projection onto a. This might not look any simpler
at first, but what we have done is convert a k-dimensional problem into a 1-dimensional
problem. Expressed in terms of the 1-dimensional problem in (13), the observation y in
(13) is hy , φ1 i, and the signal a in (13) is kak. The noise in the other k − 1 directions
is independent both of the signal and the noise in the signal direction and thus cancel
out in the likelihood ratio (as proven by the derivation of the MAP equation). This is
sometimes called the theorem of irrelevance, although we come back later to discuss this
further.
Cy
CH
C HH
0 d HH #Ã »»»»
d C HH »»
C
H »
i »»»
»
C
C »d˜»»»» a
0 »»
#à » d˜» C
"!
» »»
»»»
C
»»i
−a C d2 − d 0 2 = d˜2 − d˜02
» » Ĥ =»1»C
»
9 »
:
"! »»
C Ĥ = 0
C
C
Figure 4: Decision regions for binary signals in WGN. A vector y on the threshold
boundary is shown. The distance from y to a is d = ky − ak. Similarly the distance
to −a is d 0 = ky + ak. From (20), d2 − d 0 2 is proportional to the likelihood ratio
2
Λ(y ). From the Pythagorean theorem, however, d2 − d 0 2 = d˜2 − d˜0 , where d˜ and d˜0
are the distances from a and −a to the projection of y on the straight line generated
by a. This demonstrates geometrically why it is only the projection of y onto the line
generated by a that is important.
hY , ai
= −kak + hZ , φ1 i
kak
The mean and variance of this, given H=1, is −kak and N0 /2. Thus, hY , ai/kak is
N (−kak, N0 /2). From (21), the probability of error given H=1 is the probability that
this rv exceeds N0 ln(η)/(4 kak). This is the probability that the noise Z is greater than
kak + N0 ln(η)/(4 kak). Normalizing as in subsection 3.1,
à p !
kak N 0 /2 ln η
Pr{e | H=1} = Q p +
N0 /2 2kak
10
By the same argument,
à p !
kak N0 /2 ln η
Pr{e | H=0} = Q p −
N0 /2 2kak
It can be seen that this is the same answer as we get from (15) and (16) by first putting the
problem in a coordinate system where a is collinear with one of the coordinate vectors.
The energy per bit is Eb = kak2 , so that (17) and (18) follow as before. This is not
surprising, of course, since we have seen that this vector decision problem is identical to
the scalar problem when we use the appropriate basis. Note that we have established one
extra fact by going through the vector case - observations in directions orthogonal to the
signal are in fact irrelevant. This is not surprising since the noise in those directions is
independent both of the noise in the signal direction and of the signal itself.
For most communication problems, the a priori probabilities are assumed to be equal so
that η = 1. Thus, as in (19), Ãr !
2Eb
Pr{e} = Q (22)
N0
This gives us a useful sanity check - the probability of error does not depend on the or-
thonormal coordinate basis. As before, the error probability depends only on the distance
between the signals. The energy per bit, however, is Eb = kak2 + kck2 so that the center
point c contributes to the energy used, but not to the error probability.
The likelihood under H=1 is the same except for changing the sign of a. Using the result
of (21), the threshold test is
Ĥ=0
hy , ai ≥ N0 ln(η)
. (23)
kak < 4kak
Ĥ=1
11
This threshold test involves the inner product hy , ai over the 2k dimensional real vector
space. This inner product, written out in terms of the real and imaginary parts of u and
v is
Xk k
X
hy , ai = [ <(uj )<(vj ) + =(uj )=(vj ) ] = <(uj vj ) = <(hv , ui)
j=1 j=1
Ĥ=0
<(hv , ui) ≥ N0 ln(η)
. (24)
kuk < 4kuk
Ĥ=1
In the complex vector space, we can define an orthonormal basis in which φ1 = u/kuk.
Then hv , ui/kuk is the (complex valued) component of v (properly scaled) in the direction
of u. Viewing the complex plane as a two dimensional real space, then, <(hu, v i/kuk
is the further projection of this two dimensional value in the direction of u. We will
interpret this further when we discuss detection for QAM signals.
The other results and interpretations of the last subsection remain unchanged. In partic-
ular, since kuk = kak, the error probability results are given by
à p !
kuk N0 /2 ln η
Pr{e | H=1} = Q p + (25)
N0 /2 2kuk
à p !
kuk N 0 /2 ln η
Pr{e | H=0} = Q p − (26)
N0 /2 2kuk
For the ML case, recognizing that kuk2 = Eb , we have the familiar result
Ãr !
2Eb
Pr{e} = Q (27)
N0
12
Assume that the sample functions of Z(t) are L2 functions and represent Z(t) as
X Z
Z(t) = Zj φj (t) where Zj = Z(t)φ∗j (t) dt
j
where the Zj are independent and, over the range of interest, are iid with iid real and
imaginary parts, each N (0, N0 /2).
The output is modelled as
X
V (t) = Vj φj (t) where Yj = ±uj + Zj
Initially assume that, for any given k, the sample values v1 , . . . , vk of V1 , . . . , Vk are ob-
served. If we view the input as u 0 (k) = (u1 , . . . , uk )T , this is then the exact problem
that we solvedq in the previous subsection. The norm kuk in that solution is given here
Pk
as ku 0 (k)k = 2
j=1 |uj | . The observation is equivalent to observing the projection
of the received waveform on the space spanned by φ1 (t), . . . , φk (t). As k P is increased,
the observed waveform approaches v(t) in the L2 sense and the waveform kj=1 uj φj (t)
approaches u(t) in the L2 sense. We thus have
Z k
X
hv , ui = v (t)u ∗ (t) dt = lim vj u∗j
k→∞
j=1
Z k
X
2
kuk = |u(t)|2 dt = lim |uj |2
k→∞
j=1
The MAP test and associated error probabilities are then given by (24) -(27). We have
been slightly cavalier in going to the limit above. The real issue here, however, is not
a mathematical issue but rather a modeling issue. We have assumed that the noise
variables are iid over the region of interest, and the region of interest is the set of degrees
of freedom over which uj is non-zero. A simpler way to understand this modeling issue is
to use an orthonormal expansion in which φ1 (t) = u(t)/kuk. In this expansion, the only
modeling assumption is that the noise in the other degrees of freedom (i.e., Z2 , Z3 , . . .) is
independent of Z1 .
This says that the received signal should be passed through the filter matched to u(t) and
sampled before performing the threshold test. Recall that for QAM modulation, we used a
modulating pulse that was orthonormal to its time shifts. The received lowpass waveform
was then passed through a matched filter to avoid intersymbol interference. This says
that the matched filter is also necessary to achieve maximum likelihood detection (i.e., to
minimize error probability with equiprobable signals). We will come back shortly to look
more carefully at the issue of sending successive signals.
We can now interpret the results of this section in terms of binary communication. We
are still assuming a “one-shot” system in which only one binary digit is sent. We have just
13
shown that the optimal receiver starts with a matched filter, which projects the output
onto the complex one dimensional space of the signal. The ML decision is then to choose
which of the two possible signals is closest to that complex output (this is what the real
value operation in (24) accomplishes). As illustrated in Figure 4, it is equivalent to view
this one dimensional complex matched filter output as a two dimensional real value and
project it onto the direction of u. Finally, it is also equivalent to directly choose whether
u or −u is closer in Euclidean distance to v .
At one level then, we have shown that ML detection for binary antipodal signals is accom-
plished by simply choosing the hypothesized signal closest to the received signal. That is,
ML detection is minimum distance detection. We have also seen, however, that minimum
distance detection can be accomplished by first projecting the received signal onto the
complex one dimensional space of the signal viewed at baseband. It is easy to see that
it is also accomplished by first going from a continuous time output to a discrete time
output and then using the vector results above directly. Finally, if u(t) is real and we pass
it through the matched filter, the sampled output corresponds to the model of subsection
3.1.
It is important to note that the performance of binary antipodal communication in WGN
depends only on the energy of the waveform and not on the structure of the signal. In a
sense, this is what WGN means. It is uniform in all directions, so that all signal choices
(within the range over which the noise is white) behave the same way.
We now have all the machinery needed to proceed to m-ary signal sets rather than binary
signal sets. Before doing that, we investigate the effect of varying the threshold in bnary
threshold tests. In most conventional data transmission situations, using ML decisions
(i.e., a unit threshold) is sufficient. However, there are many closely related problems,
such as in radar and in channel measurement, where a more general approach is needed.
14
Define q0 (0) as limη→0 q0 (η) and q1 (0) as limη→0 q1 (η). Clearly q0 (0) = 0 and in typical
situations q1 (0) = 1. More generally, q1 (0) = Pr{Λ(Y )>0|H=1}. In other words, q1 (0) is
less than 1 if there is some set of observations that are impossible under H=0 but have
positive probability under H=1. Similarly, define q0 (∞) as limη→∞ q0 (η) and q1 (∞) as
limη→∞ q1 (η). We have q0 (∞) = Pr{Λ(Y ) < ∞} and q1 (∞) = 0.
Finally, for an arbitrary test A, threshold or not, denote Pr{e | H=0} as q0 (A) and
Pr{e | H=1} as q1 (A).
Using (28), we can plot q0 (η) and q1 (η) as parametric functions of η; we call this the
error curve.3 Figure 5 illustrates this error curve for a typical detection problem such
as (17) and (18) for antipodal binary signalling. We have already observed that, as the
threshold η is increased, the set of y mapped into Ĥ=0 decreases, thus increasing q0 (η)
and decreasing q1 (η). Thus, as η increases from 0 to ∞, the curve in Figure 5 moves from
the lower right to the upper left.
1 q0 (∞)
Figure 5 also shows a straight line of slope −η through the point (q0 (η), q1 (η)) on the
error curve. The following lemma shows why this line is important.
Lemma 1: For each η, 0<η<∞, the line of slope −η through the point (q1 (η), q0 (η))
lies on or beneath all other points (q1 (η 0 ), q0 (η 0 )) on the error curve, and also lies beneath
(q1 (A), q0 (A)) for all tests A.
Before proving this lemma, we give an example of the the error curve for a discrete
observation space.
Example of Discrete Observations: Figure 6 shows the error curve for an example in
which the hypotheses 0 and 1 are again mapped 0 → +a and 1 → −a. The observation
Y , however, can take on only the four discrete values +3, +1, −1, −3, with the conditional
probabilities for H=0 and 1 given in the figure. Since Λ(y) takes on only four possible
values, the conditional distribution function of Λ(Y ) conditional on either H=1 or H=0 is
constant except for jumps at its four possible values, 1/4, 2/3, 3/2, 4. This means that the
3
In the radar field, one often plots 1 − q0 (η) as a function of q1 (η). This is called the receiver operating
characteristic (ROC). If one flips the error curve vertically around the point 1/2, the ROC results.
15
threshold test can change only at those four values. Thus the threshold test is constant
over the range 0 < η ≤ 1/4, constant over 1/4 < η ≤ 3/2, constant over 3/2 < η ≤ 4,
and constant over η > 4. For 0 < η ≤ 1/4, the test chooses Ĥ=0 for all y. Consequently,
q1 (η) = 1 and q0 (η) = 0 over this range. Similarly, for 1/4 < η ≤ 3/2, the test chooses
Ĥ=1 for y= − 3 and Ĥ = 0 otherwise, resulting in q1 (η) = 0.6 and q0 (η) = 0.1 over this
range. Successive ranges of η lead to more observation values being decoded as Ĥ=1.
1t
16
The lemma shows that if the error curve gives q0 (η) as a differentiable function of q1 (η)
(as in the case of Figure 5), then the line of slope −η through (q1 (η), q0 (η)) is a tangent,
at point (q1 (η), q0 (η)), to the error curve. Thus in what follows we denote this line as the
η-tangent to the error curve. Note that the error curve of Figure 6 is not really a curve
at all, but the η-tangent, as defined above and illustrated in the figure for η = 2/3, still
lies on or beneath all points of the error curve and all achievable points (q1 (A), q0 (A)), as
proven above.
Since, for each test A, the point (q1 (A), q0 (A)) lies on or above each η-tangent, it also
lies on or above the supremum of these η-tangents over 0 < η < ∞. It also follows,
then, that for each η 0 , 0 < η 0 < ∞, (q1 (η 0 ), q0 (η 0 )) lies on or above this supremum. Since
(q1 (η 0 ), q0 (η 0 )) also lies on the η 0 -tangent, it lies on or beneath the supremum, and thus
must lie on the supremum. We conclude that each point of the error curve lies on the
supremum of the η-tangents.
Although all points of the error curve lie on the supremum of the η-tangents, all points
of the supremum are not necessarily points of the error curve, as seen from Figure 6. We
shall see shortly, however, that all points on the supremum are achievable by a simple
extension of threshold tests. Thus we call this supremum the extended error curve. For
the example in Figure 5 the extended error curve is the same as the error curve itself. For
the discrete example in Figure 6, the extended
1t
error curve is shown in Figure 7.
C
C
C
C
Ct
J
J
J
Jt
Q
Q
q0 (η) QQt
XX
XXXt
q1 (η) 1
Figure 7: The extended error curve for the discrete observation example of Figure
6. From Lemma 1, for each slope −η, the η-tangent touches the error curve. Thus,
the line joining two adjacent points on the error curve must be an η-tangent for its
particular slope, and therefore must lie on the extended error curve.
To understand the discrete case better, assume that the extended error function has a
straight line portion of slope −η ∗ and horizontal extent γ. This implies that the distribu-
tion function of Λ(Y ) given H=1 has a discontinuity of magnitude γ at η ∗ . Thus there
is a set Y ∗ of one or more y with Λ(y) = η ∗ , Pr{Y ∗ |H=1} = γ, and Pr{Y ∗ |H=0} = η ∗ γ.
For a MAP test with threshold η ∗ , the overall error probability is not effected by whether
y ∈ Y ∗ is detected as Ĥ=0 or Ĥ=1. Our convention is to detect y ∈ Y ∗ as Ĥ=0, which
corresponds to the lower right point on the straight line portion of the extended error
17
curve. The opposite convention, detecting y ∈ Y ∗ as Ĥ=1 reduces the error probability
given H=1 by γ and increases the error probability given H=0 by η ∗ γ, i.e., it corresponds
to the upper left point on the straight line portion of the extended error curve.
Note that when we were interested in MAP detection, it made no difference how y ∈ Y ∗
was detected for the threshold η ∗ . For the Neyman-Pearson test, however, it makes a great
deal of difference since q0 (η ∗ ) and q1 (η ∗ ) are changed. In fact, we can achieve any point
on the straight line in question by detecting y ∈ Y ∗ randomly, increasing the probability
of choosing Ĥ=0 to approach the lower right end point. In other words, the extended
error curve is the curve relating q1 to q0 using a randomized threshold test. For a given
η ∗ , of course, only those y ∈ Y ∗ are detected randomly.
To summarize, the Neyman-Pearson test is a randomized threshold test. For a constraint
α on Pr{e|H=1}, we choose the point α on the abscissa of the extended error curve and
achieve the corresponding ordinate as the minimum Pr{e|H=1}. If that point on the
extended error curve lies within a straight line segment of slope η ∗ , a randomized test is
used for those observations with likelihood ratio η ∗ .
Since the extended error curve is a supremum of straight lines, it is a convex function.
Since these straight lines all have negative slope, it is a monotonic decreasing2 function.
Thus, Figures 5 and 7 represent the general behavior of extended error curves, with the
slight possible exception mentioned above that the end points need not have one of the
error probabilities equal to 1.
The following theorem summarizes the results about Neyman-Pearson tests.
Theorem 4.1 The extended error curve is convex and strictly decreasing between
(q1 (∞, q0 (∞) and (q1 (0, q0 (0). For a constraint α on Pr{e|H=1}, the minimum value
of Pr{e|H=0} is given by the ordinate of the extended error curve corresponding to the
abscissa α and is achieved by a randomized threshold test.
There is one more interesting variation on the theme of threshold tests. If the a priori
probabilities are unknown, we might want to minimize the maximum probability of error.
That is, we visualize choosing a test followed by nature choosing H to maximize the prob-
ability of error. Our objective is to minimize the probability of error under this worst case
assumption. The resulting test is called a minmax test. It can be seen geometrically from
Figures 5 or 7 that the minmax test is the randomized threshold test at the intersection
of the extended error curve with a 45◦ line from the origin.
If there is symmetry between H = 0 and H = 1 (as in the Gaussian case), then the
extended error curve will be symmetric around the 45◦ degree line, and the threshold
will be at η = 1 (i.e., the ML test is also the minmax test). This is an important result
for Gaussian communication problems, since it says that ML detection, i.e., minimum
distance detection is robust in the sense of not depending on the input probabilities. If
we know the a priori probabilities, we can do better than the ML test, but we can do no
worse.
2
To be more precise, it is strictly decreasing between the end points (q1 (∞, q0 (∞) and q1 (0, q0 (0).
18