T/ Iitsch,: Christina Breining, Dreiseitd, Etkrhard Xansler, Scheder, SCMDQ Andjan Henning

Christina Breining, Pia Dreiseitd, EtKrhard Xansler, kidreas Mader,
Be t\;iitsch,Henning &der, mas Scheder, Gerhard S c M d q and Jan Tilp
he problem of acoustic echo cancella-

tion arises wherever a loudspeaker
and a microphone are placed such
that the microphone picks up the sig-
nal radiated by the loudspeaker and its reflec-
tions at the borders of the enclosure. As a result,
the electroacoustic circuit may become unstable
and produce howling. In the case oftelecommu-
nication systems, the users are annoyed by listen-
ing to their own speech delayed by the
round-trip time of the system. To avoid these
problems, the attenuation of the acoustic path
between the loudspeaker and microphone must
be sufficiently high. Classically, this is achieved
by acoustic means. In the case of a public address
system, e.g., highly directionally sensitive loud-
speakers and microphones or arrays of loud-
speakers and microphones are located so that
they do not "see" each other. In telephone sys-
tems, the loudspealcer and the microphone are
combined in a handset providing an attenuation
ofat least 45 dB. Hands-free telephony was orig-
inally made possible by degrading the
fd-duplcx channel to a half-duplcx channcl by
the use of voice-activated switches. Controlling
these switches, however, turns out to be am ex-
tremely difficult task. communication over such
systems requii-es highly disciplined speakers.
With increasingly more signal processing
power becoming available in recent years, the
application of an adaptive filter parallel to the
loudspeaker-enclosure-microphone system
(LEM system) became economically feasible
(Fig. 1j. If one succeeds in matching the impulse
42 JULY 1999
response c(n) of the filter exactly with the impulse re-
sponse g(%)of the LEM system, the signals x(n)and e(.)
will be perfectly dccoupled without any disturbing effects
to the users of the electroacoustic system.
In this contribution, we mainly deal with hands-flee
telephone system. These are monophonic systems, and
their characteristics are replated by the International
Telecommunication Union (ITU),The most severe re-
strictions for signal processing are the tolerable delay
times: only 2 ms are allowed for stationary telephones Matrices :re mdicated by capital boldface letters. Noms
and only 39 ms for mobile telephones. In audio or video are alwaF L, norms. Frequently used abbreviations are
confirence systems, stereophonic audio systems are dcsir- listed in Table 1.
able. Because signals in both channels are highly corre-
lated, there is no unique solution for the iinpulse Scenario
responses of the echo-cancellation filters to be applied in
both channels. Healing aids form a special class of audio
systems. The problem of acoustic feedback arises when Loudspeaker-Enclosure-Microphone(LEM)
closed car molds slip out of the ear or when open molds Systems
are used. The LEM system here is characterized by a very The central elements in acoustic echo cancellation are a
sniall enclosure leading to a short impulse response. Fi- 1oudspeak.er and a microphone placed within one cnclo-
nite-iiupi~se-response(FIR) echo-cancellation filters sure. For low Found pressure and 110overload of the AID
converters, this system may be modeled with sufficient
need only on the order of 10 coefficients. On the other
hand, hearing aids are battery powered. Therefore, power
consumption of the signal processing hardware is a major
issue. The time delay introduced by echo cancellation
should not exceed 1 nis. Finally, voice-control systems are
gaining more attention. Here the microphone output is
fed into a speech-recognition system requiring a mini-
mum signal-to-noise ratio. In applications where loud-
speaker signals are present, like computer games or Local
remote control of television sets, cancellation of the sfni Speech
Signal
acoustical echoes can significantly improve the recogni-
tion rates. vfni Local
NOlSe
The problem of acoustic echo control has gained con-
Far-End
siderable attention during the last decade. Sunreys con- Speaker
taining further references may be foundin [32] and [23].
This article is divided into three major sections: The 1. Loud~,~eaker-enclosure-microphone (LEM) system with
echo-cancellationfilter (ECQ and the notation usedin this article
first explaius the properties of LEM systems and how to
model them by an adaptive filter. Furthermore, it briefly
describes the properties of the speech signals involved.
The second, and main section, is devoted to the solution
of the echo cancellation problem by the application of
adaptive filters. In the third section, we show how to cope
with iinplemcntatioii problems caused by the need to use
inexpensive signal prcicessiiig hardware.
Notation
Throughout this article, we use notation according to
Fig. 1. scala^ va?iahles (signals) are written with l o w
crcasc lctrcrs [e.g., x ( n ) , d ( n ) ] . For vcctom we use
lowercase boldface letters, e.g., c(n) , g(n),
0 50 100 150 200 250
Time (ms)
4%)= (c[) (75)> cl (u),
" "cN-i (n))' ,
A 2. Impuke response measuredin an office(sampling fre-
quency+ kHz).
JULY 1999 IEEE SIGNAL PROCESSING MACPiZlNE 43

~
. __
accuracy as a linear system. Moving objects or changing .. .
Table
~.
1. Frequently
.
Used .Abbreviations.
. .-. . .
the temperature results in a time-variable impulse re-
sponse. Its specificshape depends on the size of the enclo- AID Analog/digital
sure, the reflection properties of its boundaries, and the
position of objects (especiallythe loudspeaker and the mi- AEC Adaptive echo canceller
crophone) within the enclosure. Depending on the appli-
cation, it may be possible to design this system such that
the reverberation time is small, resulting in a short im-
pulse response. Examples of this solutioii are telecoirunu-
nication studios. On the other hand, electronic means are
the only tools to provide hands-free communication out
of ordinary ofice rooms or cars, for example.
In general, the acoustic coupling within an enclosure is
formed by a direct path between the loudspeaker and the
microphone, and a very large number of echo paths. The
impulse response can be described by a sequence of delta ERLE 1 Echo-return loss enhancement I
impulses delayed proportionally to the geometrical length
of the related path. Reflectivity of thc boundaries of the en-
closure and the path Ieiigtli determiiie the impulse ampli-
tude [ 3 ] .The reverberation time of an office is typically on
the order of a few hundred milliseconds, of the interior of a
car, a few tens of milliseconds. Figs. 2 and 3 show impulse
responses of LEM systems measured in an ofice and in a ITU 1 InternationalTelecommunication Union I
car. The microphone signals have been sampled at an
8-kHz rate. These impulse responses are highly sensitive to LEM LoudspLahr-enclosure-inicrophonr
any changes within the LEM system. This is explained by
the fact that, assuming a sound velocity of 343 m/s and an
8-lcHz sampling frequency, the distance traveled between
two sampling instants is 4.3 cm. Therefore, a 4.3-cm
change in the length of an echo path moves the related im-
pulse by one sampling interval. Thus, the need for an adap-
tive echo-cancellation fdter (ECF) is evident.
MLP Mulnlayer pcicepuon
The question of the optimal structure of the ECF has
been discussed extensively in the literature. Because a long NLMS Normalized LMS
impulse response must be modeled, a recursive (IIR) filter
seems best suited at first g h e e . At sccond glance, h o w Power spcctral density
ever, the impulse response cxhibitsa highly detailedand ir-
regular shape. To achieve a sufficiently good match, the Recursive least squares
replica must offer a large number of adjustable parameters.
Therefore, an IIR filter does not show an advantage SAEC Stereophonic xoustic echo cancellation
against anon-recursive (FIR) filter [48], [ 5 3 ] .Theirref'ut-
able argument for preferring an FIR filter, however, is its SNR Signal-to-noiseratio
guaranteed stability during adaptation. A measure to ex-
VAU Voice actmiry detcctor
press the effect of echo canccllatioii is the so-called
echo-return loss enhancement (ERLE):
XLMS Extended LMS
where d ( n )is the microphone output signal, assuming the and

loudspaker is the only signal source within the enclosure, N-L
and d(n)describes the ECF output. Denoting the impulse i(n)= cp(n - i).
response of the LEM system and the filter by g resp. c -3s- 2 4
(3)
suming time invariance for the moment-it follows that
where N 1is the degree of the (FIR) ECF. Assuming,
~
for simplicity, a stationary whitc input x(n)(Fig. l),the

ERLE can be expressed as
44 IEEE SIGNAL PROCESSING MAGAZINE JULY 1999

ERLE = 10 log 0.4
0.3
0.2
0.1
An upper boundary for the effect of a filter of degree
N - 1 i.e., a transversal filter with N taps, cai be calcu- 0
lated by assuming a perfect match of the ECF
-0.1
ci = o if o r O < i < N - l . (51 -0.2
In this case we have -0.3
-0.4
EIUE,n,,(N)=lOlog E,,O: dB, 0 10 20 30 40 50 60 70 80 90 100
c, Bi (6) I
Time (ms)
A 3. Impdse response measured in a passenger car (sampling

For further simplification, we (quite realistically)assume frequeney=d kHz).
an exponential decay of the impulse response g:
1 ~ , I = e - ~f”o r i t O , (7)
where 0 < U < 1. Inserting ( 7 )into (4) rcsults in Microphone
ERLE,,, ( N j = 10 log e2“. dB = 8.7 aN dK. (8)

Distance from the Dummy
to the Fix Paint:
The parameter a is related to thc reverberation time T,
being defined as the time necessary for a 60-dB decay of Measurement 1: 40 cm
the sound cnergy after switching off the sound source. Measurement 2: 90 cm
Ncglecting the time constants of the loudspeaker and
the microphone, a calculation equivalent to the one
~
Reflector,.:’
above leads to
8.7 aN, dK = 60 dB, (9)

where N , = T,fs,fsdenoting thc sampling frequency. A 4. Measurement setup for the impulse response.
Inserting the parameter n from ( 9 )into (8) leads to
ms. Short-time variances may differ by more than 40 dB
[43]. Sampling frequencies range from 8 IcHz in telc-
phone systcms, up to about 40 liHz in high-fidelity sys-
tems. Even in the case of the 8-kHz sampling frequency,
where T , = N / f,. This result shows that the maximally coiisecutivc samples are highly correlated. The normal-
achicvable ERLE depends linearly on the order N - 1of ized autocorrelation coefficient of neighboring samples
the ECF. An echo attcnuation of 60 dB requires an im- assumes values in the range of 0.8 to O.Y. Short-time
pulse responsc length oftlie FIR ECF of at k a t TR. autocorrelation matrices very often become singular.
Thus, special precautions are necessary to prevent insta-
bility in :ilgoritlms that use-dircctly or indirectlythe
Speech Signals
inverse of the autocorrelation matrix. To summarize,
The pcrforniance of an adaptive filter depends crucially
speech signals pro\,ide a nonpersistent excitation for most
on the properties of the input and thc error signal. In the
adaptive algorithms.
application considcrcd here, we have to dcal primarily
with speech signals additidy disturbed by other speech
signals (in the case of double talk) and by noise. Per- Adaptiive Algorithms
forming signal p c c s s i n g 011 this typc of sigiral turns out
to be very difficult. Speech is characterized by nearly peri- In this section, we examine the convergence and traclung
odic segments, by noisc-like segments, and by pauses. behavior of three popular adaptive algorithms. For con-
The signal envelope fluctuates greatly. In speech proccss- vergence behavior, the initial adjustment of the adaptive
ing it is widely accepted that parameters derived from a filter to r’ne LEM impulse response is evaluated, whereas
speech signal have to be updated at intenrals of about 20 thc tracking behavior indicates how well thc adaptive fil-
JULY 1999 IEEE SIGNAL PROCESSING MAGAZINE 45

ter can f d o w room-impulse-response changes. We com- with normalized impulse responses:
pare the couvergence speed of the adaptive algorithms for
stationary (white and colored noise) and nonstationary
(speech) excitation siguals.
This relatively high value shows that movements of the

Simulution Environment local speaker lead to large changes ofthe room impulse re-
For the simulations considered in the followiug para- sponse. Therefore, good tracking behavior of the adap-
graphs, the LEM system is modeled with the help of two tive algorithms is important when dealing with echo
measured impulse responses. We start by adapting the fil- cancellation problems.
ter to the first impulse response to evaluate the conver- Other expressions, such as system distance or filter
gence speed and then, after 5 seconds, switch to the room mismatch, map be used equivdently for system error.
model of the secoud impulse response. We switch be-
cause we want to cxamine the tracking behavior of the al-
gorithms that is the reaction to room-impulse-response Short Description
changes. of the Implemented Algorithms
The LEM impulse responses were measured in an or-
dinary office with a floor space of 18.2 m2 a i d a height of The Normalized Least Mean Square (NLMS) Algorithm
2.77 m. Rcsides the volume, the impulse response is ex- The NLMS algorithm belongs to the category of stochas-
tremely depeudent on the size and the furnishing of the tic gradieut algorithms. The idea on which the NLMS al-
room. A room with concrete walls has a much longer re- gorithm is based and its adaptation rule are given in [34],
verberation time, which rcsults in a much longer impulse Table la.
response, than in a furnished room with carpeted floors
and wood-paneled walls. The office, whose impulse re-
sponse was measured, exhibits a reverberation time (echo
rcdnctiou of 60 dR) of about TIE= 290 ms. It has a car-
peted floor, two large windows, and is furnished like a
standard office.
Measurement Setup
To measure the impulse response, we use the arrange-
ment according to the ITU-T-P34 [42] recommcnda-
tion (Fig. 4).
A dummy models the user ofthe hands-free telephone.
Thc two impuise responses are measured for two differ-
em positions of the dummy without any othcr changes ui
the room between the two measurements. The measure-
ment signals are sampled at 8 liHz with a 16-bitA/D con-
vcrter. Floating-point operations arc used for the
calculation of the impulse responses. In Fig. 5, the first A 5.Measured impulse response.
impulse rcsporise is meawred for a distance of 40 cm be-
tween the dunimy aud the marlied fx point.
The direct echo path is clearly visible at fdtertap 16,
which is equivalent to a sound path of about 80 cm, corre-
sponding to h e loudspeal~er-microphone-distancein-
cluding small A/D-converter &laps.
Fig. 6 depicts the spectrum of the impulsc response
with its dominant lowfrequency components. Hence, we
can see that high frequencies are better absorbed thau low
ones. The second impulse response is meamred for a dis-
tance of 90 cm between the dummy and the fix point. Its
properties are comparable to the curvc shown in Fig. 5.
The system error norm betwcen the two measured im-
pulse responses is given by:
Frequency (Hz)
L 6. Spectrum of the impulse response

Analysis shows that this algorithm converges for
step-size0 < ~ ( P z<) 2 in case ofnndismrbed error signals, In coistrast to white noise, the
that is, for
autocorrelation matrix of speech
s(n)= V(?%) = 0, signals differs considerably from
Where the identity matrix.
y(nj = .(E)’ g(nj + s(n)+ v(n) (Fig. 1).
higher fir the fast versioiis, and is especially significant in
The chosen step-size must be smaller whcn local noise is
fixed-point implementations.
present (see “Control of Adaptive Filters”).
The choice of two values is cnrcial for the stability of
The convergence speed depends on the length of the
thc RLS algorithm:
adaptivc filter. In [27],it is shown that, in the case of
white Gaussian noise, the numher of adaptation stcps A The initialization of the inverse of the autocorrclation
necessary to reach a certain degree of compensation is matrix.
proportional to the filter length. A The choice of the hrgctting factor h.
In comparison to white noisc excitation, the conver- A simple and weU-known possibility is to initialize tlie
gence speed decreases significantlywhen speech is used as inverse of the autocorrelation matrix with a nndtiple of
excitation. In contrast to white noise, the autocorrelation tlie idcntity matrix (I):
matrix of speech signals differs considerably from the
identity matrix. This is an iniportant property, because R ’ (n = 0) = 6-l I
the convergence speed of the N I M S algorithm depends
on the relation of the largest to the smallest eigenvalue of Eor stationary input signals, one should choose
the autocorrelation matrix. The larger this ratio is, the 6 2 0.01 G:, where 0: is tlic variance of the excitation sig-
smaller the convergence spced. We uscd tlic step-size nal L40J, [64]. The choice of 6 is niuch more critical for
p ( n )= lfor the followingsiiiiulations, because 1is the 013- nonstationary excitation. Thercforc, an initialization with
timal step-size whcn the adaptation is not degraded by lo- 6 = 6F (1 h j is recommended, where 6 is the variance
-
cal signals (s(nj= (n)= 0). The small residual noisc of the excitation signal during tlie initialization period
component due to tlic ihct that we use longer impulse rc- 1601. A value for 6 that is too small cau lead to instabilities
sponsss to niodel the LEM tan than for the adaptive of the algorithm (overshoot phenomenon 14111j, whereas
filter is neglected here and influences only thc final mis- values that are too large reduce the convcrgcnce speed, es-
alignment. pecially dixing the start up of the adaptation.
The choice of thc forgetting factor h also influences the
stability, .the convergence, and the tracking bchavior of
The Affine Projection (AP) Algorithm the algorithm. We get tlic bcst tracking behavior for val-
The AP algorithm can be considered as an extension of
the NLMS algorithm. Its adaptation rule is given in 1341,
+
ues of about A = 1- (T,: filter Icngth). For reasons of
stability, however, a choice of h between 1 - & and
Tdbk 5C. 1 - is I-ecommended[60].
A convergence analysis can be peifbrnicd similarly to
For thc simulations we uscd 6 = 0.01 .‘and h = 0.999
that fix the NTMS algorithm, also resulting in the condi-
for stationary excitation and 6 = 1006: and h = 0.9999
tion0 < ~ ( n< 2) for the undisturbed case. Also compara-
for speech excitation.
blc to the NLMS algorithm is the decrease of
convergence speed with increasing filter length.
One can show that the computational complexity for Length and Final Minimum Nlisalignment of
the AP algorithm is N,, times higher [O(N,,Nj] than the Adaptive Filter
for the NLMS algorithm [O(Arj],wlicrc AT, denotes the The algoritlms are testcd with :in LEV model, with a n
order of the Al’ algorithm. impulse rrsponsc being somewhat longer than tlic adap-
For the same reasons as the NLMS algorithm, we tive filter. This is done for two reasons:
choose tlie step-size,u(n)= 1for the followingsimulatioiis. A In a real application, it is not possible to modcl tlic ai-
tire echo path with an adaptivc filtcr.
The Recursive-Least-Sqoores (RLS) Algorithm A For coinparison of the adaptive algorithms, thc initial
Tlic KLS algoi-irhnii I ~ c l ~ i ~
tog another
s class of algv- convergence and tracking behavior are of special interest.
ritlxns and is based on the minimization of the weighted Adaptation curvcs that excced -60 dR arc not realistic for
squarcd crror sum. Explanations ofthe algorithm and its echo-cancellation applications.
adaptation rule are given in [34], Table 43. We decided to test the algorithms with two different
Tlic KLS algorithm involves the risk of instability in- filter Iengtlis ( N = 256 and 1024) hcing realistic repre-
herent in its recursive adaptation rule. This risk is even sentatives for a car cabin and an office, respectively. Thc

impulse responses used for the simulations are described
in “Measurement Setup.”
To limit the final misalignment to values between 4 0
and -60 dB,the impulse response of the LEM system is
-
truncated to Z = 259 and 1030. The theoretical minimum
misalignment is given by
with1 I xi 1 =fi being the Z, norm andg(n) being nor-

malized (12). The theoretical minimum misalignment is
reached if the cotfficients of the adaptive fdter c(n) are
4 5 ~ 2 4 6 8
identical to the first N filter coefficients of the room n i d e l Time (s)
g(n). Thus, the minininm misalignments are: ~~
A 7.Convergence of the indicatedalgorithms (N = 256) for

(N dB
C O ~ V , , , ~ ~ = 256) = 4 0 . 3 white excitation (NLMS and AP: p(n) = 1, RLS: h = 0,999).
conv,, (N = 1024) = -578 dB. (14)
For the following comparkons of the adaptation curves it
is necessary to talie these values into account.
As the described choices of the adaptive filter and LEM
model lengths are identical for the two adaptive algorithms,
they are compared under the same, and thus fair, conditions.
In the following sections, we note that the algorithms
do not converge to the same steady-state solution. This is
caused by the more or less sensitive reactions of the algo-
rithms to the small error signal generated by the longer
LEM model in comparisoii to the adaptive filter.
The aim of this section is to investigate the conver-
gence and tracking behavior of the adaptive algorithms
without local disturbance. In order to obtain the highest
convergence speed, the best choice for the step-size is
one. We face the fact that the algorithms do not converge
Time (s)
to die same steady state as it is the task of a step-size con-
trol (see “Applications to Acoustic Echo Control”) to A 8. Convergence of the indicated algorithms (N= 1024) far
choose the step-size appropriately. One task of such a white excitation (NLMS andAP:p(n) = 1, RLS: h = 0,999).
control is to reduce the step-size when the adaptation has
nearly converged. Thus, it is possible to reach the theoret-
ical misalignment.
Convergence o f the Algorithms

In the following, we compare the convergence of the de-
scribed adaptive algorithms with the use ofthe system er-
ror norm
lolog(Ilc(n)-g(n)Il2 1 3 (15)
which is the squared norm of the difference between the
LEM model g(n) and the adaptive filter c(n). Supposing
that the lengths of g(n)and c(n) are different, the shorter
filterisaeropaddedforthecalcnlationofthenorm (13).
The special aim of our investigations is to underlint
the dependence of the described adaptive algorithms on
the input signals and filter lengths. We will see and work Frequency (Hz) I
out that the best choice for an algorithm is closely related A 9. Power spectral density (PSD) of the colored noise generated
to the individual application problem. by IIR filtering with 15 LPC coefficients.

Stationary input Signals
Comparing tlic NLMS with the AI’ algorithm for white
-5
iioisc excitation (Figs. 7 and S), no remai-kablcdifference
-g -- AP Order 2
is visible. The reason is that the described advantages of
tlic AP compared to the NIMS algorithm have no effect
- -10
fur white noise excitation. As expected, the convergence

g
z
-15
spccd decreases with increasing filter length. 0 -20

For the NI,MS algorithm, die minimum misalign- t
ments ofthe adaptive filters do not reach their tlicorcticd E
L
-25
limits (14).This happens because a disturbing sigual is

E
-30
gcncratcd by the filter coefticients ofthe LEiM model that
cxcceds the length of the adaptive filter. This disturbing -35
signal worsens thc final alignment cif the filter. Thus, a I
step-sire cotitrol is required. 2 4 6 8 10 12
Time (s)
The advantages of the Ap algorithm coinpared to tlic
NI.MS algorithm become obvious for colored excitation I IO. Coni,ergence of the indicated algorithms (N = 256) for col-
signals. The colored iioisc signal used for the simulations ored excitation (NLMS and AP:p(n) = 1 , RLS: h = 0.999).
is generated by Ali-filtering a white noise signal with 15
linear-prcdictivc-codiiig (LI’C) coefficients of a typical
speech signal. Hence, the colot-ed noise signal exhibits a
0
povvcr spectral density (PSI)) comparable tu the PSI1 ofa
- NLMS
11 signal (Fig. 9). l i i contrast to the very slow con-
verging NLiMS dgorithm, the second-order AF dgo-
rithin adapts m u c h fastcr (Figs. 10 and 11). The
performance improvement gained by increasing the or-
der (if the AP algorithm decreases with the ordcr of the
AP algorithm.
For stationary excitation, the convergence speed ofthe
IUS algorithm is higher than that ofthe NLiMS and the
AP algorithm. Additionally, it reaches the theorctical
iniiiinium misalignment of tlic adaptive filter for white
excitation.
The simulations confirm thc statement that tlie RLS
algorithm is cspccially suitable for colored, but stationary
excitation signals, hccausc the difference in convergence
11. Convergence of the indicated algorithms (N = 1024) for
speed for white and colorcd excitation is negligible.
colored excitation (NLMS andAP:p(n) = 1, RLS: h = 0.999).
Speech Exdation rithin ordcr from 5 to 10 for speech, in contrast to colored

To smooth the adaptation cnrves for speech excitation noise excitation. The rcasun for this effect is that an LPC
(Pigs. 12 and 13), we average over 10 different speech analysis of ordcr 15 is uot sufficient to completely dc-
signals. scribe the correlations of a speech signal.
The results indicate that the convcrgeiicc speed of the The RLS algorithm also converges much more slowly
NLMS algorithm for speech excitation is approximately for speech cxcitatioii than for stationary input signals.
comparable to its convergence speed for colored noise cx- This is especially d u e t u initialization of t h e
citation with a I’SD comparahlc to speech. For the Al’ al- antocorrelation matrix with a large 6, which is necessary
gorithm, however, the filter adapts much more slo\vly to ensure stability and to reduce the risk ofthe overshoot
compared to its adaptation speed with colored excitation. phenomenon 1411. Ncvertlielcss, the initial convergence
Comparing tlie AP algorithm for different orders, we of the RlLS algorithm is faster than for the other algo-
achieve, aiialogous to colored excitation, the largest gain rithms described.
in convergence spccd when we increase the ordcr ofthe
AP algorithm from 1 (NLMSalgorithms) to 2. The addi-
tional improvcrncnt gained by an incrcoscd ordcr of the Tracking Behavior
AP algorithm becomes progressively sinaller with all ill- There is ino noticeable difference between the traclung be-
creasing AI’ ordcr. This improvement, however, is still havior and the initial convcrgcncc of the NIMS and the
greater for speech than for the colored excitation signal. AP algorithm. The RLS algorithm, howwcr, behaves dif-
For this comparison, it is necessary to look at the incrcase ferently. The tracking ofthe KI.S algorithmis remarltably
of convergence speed when augmenting the Al’ algo- slower than the initial convergence, because the RLS al-

~
. .. .. .., .. .
0 2 4 6 8 10 12
Time (s) Time (s)
A 12. Convergence of the indfcated algorithms (N = 256) for A 13. Convergence of the indicatedolgorithms (N = 1024) for
speech excitation (NLMS andAP: v ( n ) = 1, RLS: h = 0,9999). speech excitation (NLMS andAP:K(n) = 1, RLS: h = 0,9999)
gorithm is based on the minimization of an exponentially again means a tradeoff between processing power and
weighted sum [26].The larger the resulting memory, performance.
that is, the closer h is to 1,the more slowly the U S algo- There are various possibilities for placing the
rithm can follow room impulse changes. On the other decorrelation filters within the system. For identification
hand, a forgetting factor close to 1 is necessary to ensure purposes, it would be sufficient to simply filter the excita-
stability for speech excitation. For a choice of tion signal. But because this signal also serves as loud-
h = 1 - lo4, the RLS algorithm is only superior to the speaker signal and is, therefore, transmitted into the
AP algorithm up to order 2. room, pre-processing is not possible. One can move this
In summary, the NLMS algorithm converges very linear system into the filter branch and also across the
slowly for correlated excitation signals. Therefore, the AP LEM system (Fig. 14).Because the filter is still in the sig-
algorithm leads to much better results. With respect to nal path to the far-end listener, one has to apply an inverse
the initial convergence, the U S algorithm shows the filter after the calculation of the error value. To invert the
fastest convergence. However, this does not bold in the linear prediction, one would have to use a recursive filter
case of trachig, where the RLS algorithm is inferior to and apply the same coefficients as in the prediction fdter.
AP algorithms of higher orders. Because a linear prediction error filter is minimum phase,
its inverse is always causal and, therefore, stable [ 361.
One can omit the decorrelation filter (and, conse-
NLMS Algorithm Using Decorrelation Filters quently, its inverse) in the signal path, in which case, the
Even though various lunds of adaptive algorithms arc adaptive filter also must model the inverse decorrelation
theoreticallv, aoolicable for acoustic echo-cancellationfi- filter [78].
I I
ters, in most cases, a simple and robust algorithm outper- A different approach for the application of
forms more sophisticated solutions. Therefore, in most decorrelationfiters, especiallyif adaptive decorrelationfi-
applications with limited precision and processing ters are used, is the implementation of an auxiliary loop for
power, the normalized LMS algorithm is applied. How- adaptation, asshowninFig. 15 [28], [65].Becausetheco-
ever, as it was shown in the previous section, the adapta- efficients ofthe echo-cancellationfilter are now copiedinto
tion performance of the NLMS algorithm, in the case of the echo-cancellation filter in the signal path, there is no
speech excitation, is rather poor due to the strong correla- more need for a decorrelationfdter or its inverse in the sig-
tion of speech signals. One way to overcome this Frob- nal path. In the case offmed-pointproccssing,where recur-
lem, is to pre-whiten or decorrelate the incoming sive fiters of a high order may cause stability problems, this
excitation signal before passing it to the adaptive algo- is of considerable advantage. iMoreover, for adaptive
rithm [36]-especially in thc field of echo cancellation, decorrelation filters it is preferable not to have constantly
decorrelation fdters are widely applied [28], [77], [78]. changing decorrelation fiters in the path of the estimated
Decorrelationfilters are predictor error filters, with their echo signal. Otherwise, the data vectors within the
coefficients matched to the correlation properties of the echo-cancellation fdter would have to be newly calculated
speech signal. Because speech signals are nonstationary, every time the decorrelation filter is updated which could
one has to decide whether the prediction coefficients lead to distorrioiis in the near-end signal.
should be adapted periodically to be exactly matched with Implementing this second path, however, results in an
the current section of the speech signal, or simply to the increased processor load in terms of both memory and
long-term statistics of speech. This, as we will see later, computational load. The filtering operation of the echo

canceller must bc performed twice, once with the original
signa1 of the far-end speaker, and once with the
decorrelated signal. This also means a second huffer for
tlic decorrelated excitation signal.
Fixed or Adaptive Decorrelatian Filters?

I I
Fixed decorrelation filtcrs can only compensate the mean
14. Placement of decorrelation and inverse decorrelation filters.
powcr spectral density of speech. The coefficients are,
therefore, calculated usiug a representative set of speech
sequences. This, of coursc, can never be an optimal match
to the present excitation signal. Therefore, the speed of
convcrgcnce can not be as fast as with an adaptive renewal
of tlie coefficients. The great advantage of fixed
decorrelation filters, however, is thcir ven-low coinputa-
tional complexity
Improved performance is achieved by the use of adap-
tive decorrelation filters. The filter coefficients arc
adapted periodically to the short-term characteristics of
the excitation signal 1281, [65], 1771.The coefficicntsfor
the decorrclatioii filter are calculated using the
Lcvinson-Dnrbin algorithm [36]. It is important, how-
ever, to stop tlic recursion for computing the cocfficients
of thc decorrelation filter when they outnnmbcr tlie rank 15.Auxiliary loop for the adaptation algorithm; the inverse
of tlie input autocorrelation matrix, because, in this case, decorrelation is no longer necessary
the autocorrelation matrix becoiiies singular. Due to the
non-stationarity of speech signals, this length can v a n must be used. However, as depicted in Fig. 16, at least 10
with time. However, because the Levinson-lhrbinalgo- coefficientsshould be implemented.
rithm is rank recursive,one can always fall baclito the pre- Also the number of data points needed for proper esti-
ceding order. mation of the correlation coefficients incrrascs with the
One should note that the increase of computatioual ordcr of the decorrclation filtcr. Hence, one has to make
complexity is not negligible. In addition to the computa- sure that the block size required for estimation of the cor-
tion oftlie new decorrelation coefficients,both the excita- rekation coefficients docs not exceed the timc interval in
tion vector and its vector uorin have to be updated. If which speech signals can be assumed to be stationary.
necessary, one can coinpensate for this additional compu- Simulations have shown that, for this application, oue
tational load by skipping updates ofthe echo canceller co- should consider the order of the decorrelation filters in
efficients. tlie range of 15 to 25.
Discussion of Predictor Order Choice of Coefficients for Fixed Decorrelation Filters

When discussing the order ofthe decorrelation filter, onc As was shown in Fig. 16, the gain yielded by only a
has to distinguish again between fmd and adaptivc first-order decorrelation filter is considerable. We now
decorrelation filters. For fixed decorrelation firers, an or- want to cxamiiie the influence of this prediction coeffi-
der higher than two seems to be useless, because only a cient on the spced of convergence. Because speech signals
long-term characteristic c m be modeled. For adaptive have a low-pass characteristic, the first order
decorrclation films, liowcver, additional gain can be decorrelation filtcr will be a simple high-pass filter. If an
achievcd by further enlarging t h e number of analysis is performed for a representative sct ofspeech sig-
decorrclation filter coefficients, Simulation results have nals, including both male and female voices, a inean value
been published by several authors [65], [ 7 8 ] . of about 0.9 is calculated for the dccorrelation coefficient.
Fig. 16 shows the iinprovcment in the speed of coii- Fig. 17shows that the misalignment does not reflcct a
vergence gained by decorrelation filters of various orders. large influence of the prediction coefficient. All convcr-
The effect of first-order filters is rcmarhhlc. With only grnce curves lie within tlic range of a few decibels of final
this very simplest furin of dccorrelation, a decrease in the system error norm if an average of 10 male and female
residual system error norni of about 10 dR was achieved voices is taken as excitation signal. I n Fig. 18, liowever,
in our simulation. In contrast, tlie additional gain of a sec- the difference betwecn an excitation with either male or
ond order, but still fixed decorrelation filter is negligible. female voices becomes obvious. Because thc malc voice,
When faster convcrgencc is required, adaptive filters in general, has more low-frequency components, it seems

matrix, because all rows or columns are linearly depend-
ent. Therefore, A(%)can never equal
RI:= [X'(n)X(nj]-'.
Despite this fact, one can show that the two algorithms are
equivaleut Xthe coefficients a(%)are adapted in each sample
period as the fonvard prediction error filter. In tllw case, the
first row or first colurmi of Ri aid A(%)are equal, leading
sensible to use higher decorrclatioii coefficicnrs. For an to the same adaptation rule for both algorithms.
excitation with a female voice, the adaptation perfor- Comparing the siinulation results of both the AP algo-
malice shows the opposite behavior. A switch betweell rithm and the NLMS algorithm with decorrelated ecita-
two fured coefficients for male and female speakers may, tion (Fig. 19), the perforinance of the latter algorithm
therefore, be of great advantage and overcome this mis- appears to be siniilar to an Ap algorithm of lower order.
alignment. Affine projection, however, has a superior performance,
especially in a quiet environnieiit without background
noise, rvhcre very-low system errors can be achieved.
Comparison of Decorrelotion to More Complex However, because the adaptation process of the
Adaplotion Algorithms decorrclatioii coefficients is only pcrfornicd periodically,
We have shown thus far that the use of decorrelation fil-
ters improves the performance of the NLMS algorithm
enormously. However, we still have to compare this tech-
nique to the more sophisticated adaptation algorithms.
Whcn we look closely at decorrelation filters, we sec that
this proccssing has an effect similar to the affine projec-
tion algorithm. Combining the decorrelation filters and
the NLMS algorithm delivers both thc AP algorithm and
the NLMS algorithm with decorrelation filters exhibiting
thc samc structure. As a result, we can concentrate on
these two algorithms.
Rccalliiig the adaptation equation of the NLMS algo-
rithm [34]
-35 .-.ip~L.J
0 2 4 6 8 10 12
Time ( s )
we rcplace e(%)with its decorrelated equivalent
P 16. Convergence of the NLMS algorithm using decorrelalion fi
ters with speech excifation.
er (n)= a' (*)e(%)
and x(n) with
x,(n)=[x(n),[x(n-1),...I a(n)=X(nja(nj
to yield
a(n)a"(n)
c(n + 1)= c(*j + f i ( n ) X ( n ) e(*)
. . . . . . . . . .
with
.........
Comparing (18) to the adaptation rule of the affine pro- Time ( s )

jection [34] (Table Sc), the only diffcrence appears innia- A 17. First-order decorrelation, excitafion with speech sequences
trix A(%).However, a(nj a T(nj always delivers a singular of bofh male and female speakers.

once every 20 ms, the decorrelation tcchiquc still appears
to be computationally less expensivc. Especially in the case
of limited processing power and, cvcii more importmtly,
finite word length, the tcchiquc of using decorrelated sig-
nals to enhance the performance of a very simple, but ro-
bust algorithm provcs to be superior to the stabilizing of
more sophisticatedalgorithms such as the R I 5 or AP algo-
rithms. In contmt to the latter nvo algorithms, the NLiMS
algorithm is easier to control, especially in a distorted env-
ronmcnt. Hcuce, for an applicationsuchas in the cabuiofa
car, it is advisablc to perform the adaptation with the ro-
bust NTMS algorithm.
For vcry-lowcost implementations, we suggest a
first-orderdecorrelationfiter with a fLxed coefficiciit. Ifa lit-
Time (s) tle more processing power is available, an adaptive
I
18. First-order decorrelation, excitation with speech sequences decorrelation filter oforder around 10-15in a auxiliary loop
of male and female speakers, respectively. iniplciiientdtion (Fig. 15) appcars to be advantageous.
Applications to Acoustic Echo Control
Subband Processing
Acoustic-echo-control systems can be rcalized either in
full-band or subband structures [46]. Both approaches
show adx-antages aid disadvantages, especially with re-
spect to computational complexity and signal delap. In
this section, subband stixctures will he introduced and
compared to full-band systems.
By using filterbanlo, the signal x(n) of the far-end
speaker and the microphone signal y(%)are split up into
several subbands (Fig. 20). Dcpcndiiig on the properties
of tlic 1ow-p"ss and baiidpass filters, the sampling rate in
thc subbands caii be reduced. According to this reduc-
tion, the length of the adaptive filters caii be also lowered.
A 19. Convergencecomporison between the NLMS algorithm Instead of filtering the fd-baud signal and adapting one
with adaptive decorrelation filters and the AP algorithm of 2nd
filter at the high sampliiig rate, Q (number of subbands)
and 5th orders.
convolutions and adaptations with
x(n)
1 - snbsamplcd sienals are pcrformed in
c
parallel at reduced sampling rates. The

Analvsis Filterbank 1 1 final error signale(?*)is obtained by re-
combining aiid upsampling the
subband signals e\,@%) ( v = 0...+).
Operating with subsampled signals
leads to a reduction of computational
complexity \vhich is onc of the main
advantages of subband structures.
computational Complexity
In the fnU-band structure, the number
of multiplications required to perform
Synthesis Filterbank Analysis Filterbank the convolution and the adautation
L.-.
k 20. Structure of an acoustic-echo-controlsystem operating in subbands. By using two
' nsing, e.g.. thc NLMS-algorithm and
an adaptive filter of letigrh N is given
analysis filterbanks, the for-end speaker and the microphone signol are decomposed
into Q subbonds. In each subband, a digital replica of the echo siqnal is qenerated by
by U,,
~ .
(N.
)= 2hr. Assuming
. . .
that Q
o n v h t i o n ofthe subband excitation s&ols ondthe estimoted~ubbandimpulse re- subbands a r e llsed W1t''
sponses. After subtrodion of the measured and the estimofed subbond echo signals, bandwidths and equal subsanipling
the full-band error signal is synthesized using a third filterbank

subband, and rednces the convcrgence speed. When using
notch fitcrs, the perfect reconstruction property of the
fdterbank is lost. Nevertheless, these problems can be kept
to a minimum by “fne tuning” aftcr a rough startup design.
Speed of Convergence
For whte input signals, adaptingwith die NLVS algorithm
(see ‘The Normalized Least Mean Square Algorithm”)
factors Y < the num_ber of m~dtiplicationsin the x(wj e(.)
subband structure O,sn(N)can be denoted as c(n + 1)= c(n)+ b(%)
9.
BSB( N )= 2 N The factor 2Qin the numerator results X T (9%) x(n)’
from the computation ofQ / 2 complex-valued convolu-
tions and adaptations. The subband signals are and using a step-sizep(n)= I, the reduction of the mean
subsampled by a factor Y, which reduces the coinplexity square error (in fiill-band) can be approximated by [27]
byr. The subsampled convolutions and adaptations have
to be computed only everyr-th sample period, further re-
ducing the complexity by the factor r .
BsB( N )does not include the computational overhead
associated with the fdtcrbanks. By choosing a polyphase After r adaptations, this quotient can be written as
FIR filterbank as described in [ 101and [74], with Q = 16
channels and a prototype low-pass *filter of length
N, = 256, the additional complexity Ox,al,sB (N)of two
analysis filterbanlis and one synthesis fdterbank can be de-
noted as
In subbands, assuming that there is no distortion caused

by the filterbank,this ratio for thc same period oftime is
Therefore, the total number of multiplications is:
Full-band: 0, (N)= 2N,
Nu, Q
Subband: 0,~,,(N)=3--3-ldQ+4Ni. Q
Y 7“ Y
Especially for large N , both striictures exhibit nearly the
fihcrbmk ’ Gfilrcn”g
&F same speed of convergence. If speech is used as an input
signal, the convergeuce of an adaptive filtcr depends on
In Fig. 21, the number of multiplications per sample pe- the cigenvalue spread of the correlation matrix of the in-
riod versus the length of an adaptive full-band filter is put signal, which is bound by the quotient of the maxi-
shown. Depending on the subsampling factorr, the effec- mum and the minimum of the power spectral density
tive reduction of computatioual load starts at a filter [36]. In ideal subband StnIChIres, this quotient is re-
length of approximately 100 coefficients (for Q= 16). duced, leading to a faster convergence. However, this ad-
vantage is partially lost by nonideal filters, which increase
the eigenvalue spread. For properly designed filterbanla,
Design of Filterbanks both effects nearly compensate. Using speech as the input
The overall quality of the subband echo control system signal, the speed of convergence is about the same for
strongly relies on the quality of the filterbatik.Therefore, both hll-band and subband strncmrcs.
many papers on this topic have been published in recent
years. Excellcut tutorials about filterbanks and multirate
systems can be found in [21], [71], and [72]. Art#icio/ Delay
In designing die fiterbank, a compromise between firer By transforming a causal full-band impulse response into
length, subsampling rate, stop-band attenuation, and a parallel system o f subband impulse responses, it can be
aliasingmust be found. The filter should be as shortSI; possi- shown that the resulting subband impulse responses have
ble, which leads to a nonzero uansition band. To diminish noncausal taps [46], [75]. In order to reach a satisfactory
aliasing, die subsampling factor must be decreased or, in echo reduction, some of these noncausal taps should also
critically subsmipled filterbanks (r = M ) , a cascade of be modeled by delaying the microphone signal.The arti-
notch filters should be implemented. Decreasing thc ficial delay also increases the requirements for echo atten-
subsamplingrate reduces the aliaingterms, but also distom uation [66], due to the increased echo sensitivity of the
the flatness of the fiterbank transfer function in each hunian ear.

In Fig. 22, nvo system ideiitifications are presented. A frequencies. AII example of choosing the subband filter
measured impulse response of an office (reverberation orders to cope with a limited amount of memo7 for the
time -300 ms) was used for the simulation. A filterbank filter coefficients can also be seen in Fig. 23; the rectan-
with 16 channels and a subsampling factor of 13 was ini- glcs represent the subband filters, and the length of cach
plemented and white noise was used as the cxcitation sig- rectangle is proportional to its filter order.
nal. The length of each adaptive subhaiid filter was set to Each adaptive subband filter can be controlled by its
100. Without any delay in the microphone path, the echo own controlling and detection unit. For this reason, the
could only be reduced by -25 dB.With a compensation system in Fig. 20 has independent step-size control units
delay of 8 ins (64 ful-band samples), an ERLE of -33 in each channel. The double-tall-dcrection (see “A Corre-
dB was achieved. lation-Based Double-Talk Detcction,”) opcrates only in
the band with the largest signal power, which reduces tlic
detection error. It should he noted that the amilabilihr of
Controlling the System and Ifs Parameterization
an improved system control increases the spccd of‘con-
The subbarid structure offers the systcm designer an addi-
vergence, an additional significant advantage of subband
tional degree of freedom. Dctectors and control mecha-
structures.
nisms (“Control of Adaptive Filters”) can be
To conclude the consideration on multirate techniqiics
implemented separatelyfor each chanticl. Also, the lcngth
in the form of subband structures, the computational
of thc adaptive filters can be adjusted per channel accord-
ing to the statistical properties of tlic excitation signal complexity due to adaptive filtering can be reduced con-
(speech) and the room impulse response. siderably. The subhaiid structure offers an additional de-
gree of freedoin in system design. All necessary forms of
The orders of the subhand filters are usually chosen ac-
control and detection can operate independently in each
cording to the power spectral density ofthe excitation sig-
channel, which increases the speed of convergence. The
nal. Speech has most of its energy in the low-frequency
price to be paid for these advantages is a significant delay
bands, so the corrcspondiiig subband filters have higher
introduced into the signal patli.
orders than the high-frequency echo-cancellation filters.
Also in the case of white cxcitation, a noiiutiiform choicc
ofthe filter orders can lead to tlic smallestpower ofthe re- Elock Processing for Large Adaptive Filters
sidual echo signal. In most offccs, high-frequency ab- Long-time-domain-based adaptive filtcrs require huge
sorbing materials are uscd (e.g., carpets, curtains, etc.). In amounts of processing power due to their algorithmic
Fig. 23, a time-frequency analysis of an impulsc response complexity For many real-time applications, such as
of an office is shown. In the high-frcquency range, the acoustic echo or noise control, algorithms with rcduced
LEM system absorbs the energy much faster than at OW numerical complexity are necessary. To remedy the com-
plexity problem, we can use adaptive filters based on
\umber of Multiplications
Per Sample
2500
Fullband Structur
2000
1500
1000
500
Subband Structure
0 150 0.5 1 1.5 2 2.5 3 3.5 4
0 200 400 600 800 1000 1200 1400 1600 Seconds
N
I A 22. Convergencewith 0 and 8 ms (64 full-bond tabs) of artifi-
A 2 1. Number of multiplications per somple period in full-bond ciol delay. Here the results of two system identifications are
and subband mode. For the subbond structure, a filterbank presented. For both identifications, the same excitation signol
with Q = 16 channels and a prototype low-pass filter length (white noise) ond the same set ofporamefers were used. With-
N,, = 256 is used. Only for small filter length N does the out any ortificial delay, an ERLE of about 25 dB (difference be-
full-band structure need fewer multiplications than the tween the power of the microphone signal and the power of
subband structure (due to the filterbank overhead). For lorge the error signal) can be ochieved. By introducing a delay of 8
N, subband processing is superior to full-bond processing in ms in the microphone poth, on increase of about 8 dB in the
terms of computationol complexity ERLE is achieved.

Input Signal Filter Impulse Response
Input Signal Filter Impulse Response
Time (ms)
A 23. Tine-frequency ana/ysis of (I LEM system impulse re-

sponse. The energy in the high frequencies is absorbed by
most LEM systems faster than the energy located in the low
frequencies. Red represents high-energy and blue represents Input Signal Filter Impulse Response
low-energy a r e a I. 1 I .. 1
block processing, which may work in the time or the fre-

quency domain. In general, block-processing algorithms
collect Hinput signal samples beforc they calculate a block
of B output signal samples. Consequently, the filter is
adapted o d y once every H sampling instants.
In tbe filter part of the adaptive filter, the error signal
vector
A 24. Exompie forpartitioning the filter impulse response into
e a (xB)= y R(n8)- xi,R c N(nBj (19) WO parts: a) input signal and FIR filter, b) delayed input signal
with partitioned FIR filter, and c) input signal and delay ele-
is calculated by using the excitation signal matrix oforder ment with partitioned FIR filter.
NXB
For both implemeiitationmethods, the highest com-
= [x, ( n n ... , XA, w- B + 1)l.
x.v,n($4 (20) plesityreduction is achievedwiththc blocklengthBeqnal
to the filter length N [56], [57].
In the adaptation part of the adaptive filter, the filter
impulse response is updated:
Reduction of Signul Deluy
c,(nB+Bj=c,~(nBj+y(nB)X,,,jnBje,(nB). (21) We iiow discuss the signal delay iutroduced by the block
processing algorithm A nuniber of 6: sampling intervals are
required to colicct an input signal block, m d B sampling
Complexity Reduction times are used to perform the calculation itself. Because
The matrix-vector products in (19) and (21j can either room impulse responses can be severalhundred or thousand
be calculated with reduced numerical complexity in the coefficieiitsloug (compare witb Pig. 5 ) , a numerically ef&
time domain [54], [57]or in the frequency domain cient blodc-processingalgorithm with block length B equal
/161, 1681. to the filter lciigth N introduces a large sigilal delay. For ex-
The time-domain implemeiitation uses a radix-2 de- ample, using a sampling rate of 16liHzand a fdter len,gh of
composition of the input signal to reduce thc numerical 2048 coefficients would lead to a signal delay of 256 ms.
complexity [57]. The frcquency-domain implementation This is unacceptable for most communication systems.
employs the “overlap-and-add” or “overlap-and-save” To reduce the signal delay, we partition the filter im-
sectioning methods [20j, [59]for the input signal se- pulse response uniformly into Asmaller parts of length U
quence, which are based on a fast convolution scheme. [24], [ Z S j. In Fig. 24, the output signal ofcach subfilter is
Using frequency-domain methods, the calcula- calculated, which is thcn added to the other output signals
tion-intensive matrix-vector mnltiplication can be traus- to form the ovcraU error signal
formed i n t o element-wise multiplications of A-l
complex-valued vectors and some Discrete Fourier e R(n8)= y o (nB)- (nB - aUj cS’(nB),
Transforms (1)FTs). Y =U (22)

whcre
’
c$ (%B)= IC,,. (nB),. .. ,c, + I L,-l (nIC)]’ (23)
are the filter impulse responses of the subfilters.
The adaptation part of the filter can be partitioned sim-
ilarly to the filter part: ~ ~ ~of the ~ y ~ and
v a r i $tates ~ t e ~
the $tep-$iz
tation steps are actually oversized, which, in turn, can lead
to an instability of the adaptive filter. There are several
quantities leading to a disturbance on the local side:
A The signal of the local speaker leads to an essential dis-
turbance of the adaptive filter. Activity of both the local
and the far-end spealccr is called double-tall<.
A A permanent local noise (e.g., background noise in a
(22) and (24) can be implemented cficiently using the car cabin) disturbs the error signal.
methods described in “Complexity Reduction.” A It is not possible to imitate the complete impulse re-
By sectioning the filter impulsc response, we are able to sponse due to the large number of filter coefficicnts re-
adjust the algorithm parameters, e.g., the block le@ B, quired. The residual echo caused by the part of tlie
the length of the subfiltersU , and thc number of subfilters system, that cannot be modeled can be interpreted as ad-
A according to the requircineiits of our application, e.g., ditional local noise.
the signal delay and thc computational complexity. A Fixed-point DSPs are often used in implemeiitations to
limit the cost. The quantization noisc of the fxed-point
arithmetics can also be considered as additional local
Control of Adaptive Filters noise, which adversely affects the stability of the filter.
In each of these cases, one must reduce the adaptation
Rationale step by using a smaller step-size. However, a permanent
In order to achicve high-yuality echo cancellation, it is small step-size may still be too high in .the case o f a large
nccessary to ensure both a snfficicntspeed of convergencc local disturbance, and it slows down the convcrgence
and stability oftlie adaptive filter. A high speed ofconver- speed when no local disturbance is present. Therefore the
gence requires large adaptation steps, however, oversized stcp-size should be connded.
steps lead to a divcrgence ofthc filter coefficients. Another problem is the time variance of the LEM sys-
For most of the adaptive algorithms, the adaptation tem. A variation of the impulse rcsponsc ofthe LEM sys-
step, given by the filter update, is chosen to bc propor- tcm also leads to a filter mismatch and, therefore, to a
tional to the undisturbed error signal E(%) and the rapid increase of the error signal. In contrast to a large er-
stcp-six ~ ( 1 2 ) .The undisturbed error signal is described ror signal caused by a local disturbance, tlie step-size
by the error*signal without any local disturbance should not be lowercd in this case. The adaptive filter
[E(%) = d(n)- d(n)].Howcver, this undisturbed crror sig- should be adjusted rapidly.to the modified LEM impulse
nal is only available in simulations, so that the disnirbed response, which makes a large stcp-six necessay.
error signal e(%)must be used instead, resulting in the fol- To cope with thesc problems, we need to design a con-
lowing equation for the filter update: trol mechanism to detcrminc the various statcs of the sys-
tcm and to choose the step-size accordingly. (The state of
c(n + 1)- c(n)= p(n)e(%) r(n) , the system is, for instance, dcscribed by the activity or
non-activity of the speakers. This notation will be ex-
where r(n)govcrns the direction of tlie filter updatc. For plained i n “Control of Speech Enhancement Algorithms
instance, in thc case of the LMS algorithm, this direction in the State Spacc.”)Two examples of such mcchanisins
is given by thc direction ofthc vector x(n). will bc presented in “A Correlation-Eased Double-Tall<
A constant step-size~ ( $ 2 is) assumed until thc step-size Dctection” and “Step-Size Control Based on Delay Coef-
control is prescnted. If any disturbance exists on thc local ficients.” Other approaches to control the adaptive algo-
side, thcre will be a malfunction of the adaptive algorithm rithm are proposed in (51,[29],[35]. However, for the
due to approximation of tlie uiidisturbcd crror signal by prcscnt, fully reliable algorithms are not yet avdahle.
the disturbed error signal.Thc error signal grows with the Similarly,most ofthese control algorithms are able to de-
local disnirbancc. (In die remaining section, the disturbed tect just one paramcter of the acoustic echo canceller.For
error signal is incant whcn referring to the error signal.) instance, some algorithms are available that can detect a
The adaptive algorithm tnistakcs this increase of the error doublc-talk situation. There arc othcr algorithms that es-
signal for an increase of the residual echo. Thus, the adap- timatc the step-size without knowledge about the dou-
JULY 1999 IEEE SIGNAL PROCtiSSlNG MAGAZINE 57

vin) models upon which the definition of optimality is based.
A well-known optimality definition was introduced in
[76].For this deduction, the local and far-end signals are
modeled as white, short-time stationary, mutually
uticorrelated processes, aiid the step-size is designed to
minimize the mean-squared system error, which is, in this
case, equivalent to minimizing the mean-squared-error
power. Based on these assumptions, the optimum
l I
step-size is calculated as
I 25. Schematic for the deloy coefffcientsmethod.
ble-talk situation. Thus, a combination of these

algorithms is helpful. This combination is usually gener-
ated hcuristically. In “A Combined Step-Size Control Although the signal model is obviously uIirealistic, this
Using Neural Nenvorks and Fuzzy Logic,” new methods step-size is already quite useful. An intuitive explanation
are prcsentcd that combine the control algorithms sys- for the result is a t hand: the more noise is contained in the
tematically error signal, the more this error misleads the filter, and
the smaller the step-size should be. Many linown step-size
control methods are based on (2S),and several mcthods
Optimum Step-Size have been proposed which try to estimate the power of
As explained in the previous section, step-size control is the adaptation error. Two of these methods are explained
indispensable for an adequate performance of acoustic in the following two sections, respectively.
echo cancellers. Even algorithms that are designed to
cope with non-white or nonstationary excitation need
such a control mechanism, regardless of their conver- A Correlation-BasedDouble-Talk Detection
gence speed. This is due to the time-variable nature of the In the rationale for step-size controls, it was mentioned
LEM impulse response, and to the bursty local speech sig- that the activity of the local speaker leads to abnormal be-
nal, which acts as distortion. Only a step-size control can havior of the adaptive algorithm. The casc of the active lo-
provide sufficient attenuation in a slowly changing envi- cal spealier and the inactive far-end spealcer can be easily
ronment, prevent divergence in the presence of local treated by observing the signal of the far-end speaker. If
specch, and enable fast tracking when sudden changes of this signal falls below a predetermined threshold, the adap-
the LEM impulse response occur. tation should be stopped because there is no adaptation
A step-size control mechanism is different for each possible without ai excitation signal. However, the case
adaptive algorithm. Thus, this section will only describe when both the local and the far-end spealrer are active is
the common idea of these mechanisms. This idea will be more complicated, because the locally generated signals
explained with the example of an acoustic echo canceller cannot be measured in a real-time system. The only input
that uses theNLMS algorithm, as implementedin anum- on the local side is the microphone signal, which is indeed a
ber of real-time systems [17], [38], [63]. As shown in superposition of the local signal, the echo signal, and the
[SI,the step-sizecontrol for h i s algorithm can be used in background noise. So it is impossible to detect the activity
the same manner fur the modified NLMS algorithm that of the local signal by observing the microphone signal
is adapted to the special characteristicsofthe LEM system only.Even an increase ofthe microphone signal can not be
[SI]. Other adaptation algoritluns, such as the RLS algo- interpreted as an activity of the local speaker, because a
rithm, are also used in acoustic echo cancellation (AEC) variation ofthe LEM system also leads to such an increase.
with a step-size control based on similar principles to In this section, a method is presented that estimates the
those of NLMS stcp-size control [ 9 ] . In general, the corrclation between the loudspeaker and the microphone
mcthods described here can bc easily adapted to many ad- signal. Whcn a high correlation value between these two
aptation algoi-ithms. signals is detected, a low local signal power can be de-
The step-size control problem consists of two main duced. Assuming this low local signal power, the adapta-
parts: First, ail optimal behavior of the step-size must be tion can be enabled by selecting the step-size accordingly.
defined for the application ofAEC. This will be discussed Ifthe correlation value is low, the step-size should be low-
in die following paragraphs. Second, this behavior must ered to disablethe adaptation. The simplestwayto do this
be realized in the best possible way with the information is to choose a constant step-size for single-talk situations
available in real time, which will be the subject of the fol- (high correlation value) and a step-size equal to zero for
lowing sections. double-talk or high background noise situations (low
Even for the NLMS adaptation algorithm, different correlation value),
ways have been proposed to determine an optimum Various methods to estimate the correlation value are
step-size. These solutions result from different signal published 1301, [79]. In [38], a modified correlation

value p(n) is proposed which is v c similar
~ to a normal-
ized short-term correlation between the loudspeaker aud
the microphone signal:
A large local background noise leads to a permanent

p(n) itself yields values between 0 and 1.Whenever p(n) detection of double-talk. Therefore, the estimated error
exceeds a predetermined threshold, a non-double-talk sit- signal cannot be used reliably.That means that the corre-
uation is assumed. latiombased double-talk detection can only be used for
To be robust in the face of a time delay of the LF.M sys- hands-free telephoue systems in environments with mod-
tem, tlie maximum correlation value for different delays 1 erate levels of background noise (for instance in offices).
is calculated,assuming that tlic direct echo signal is m z i - In environnients with high-bacligrourid-noise lcvcls, a
mally correlated with the excitation signal. The correla- combinatiou with other detectors is necessary.
tion term must be estimated for a limited number of
samples M , where the estimation quality is proportional
to M . However, with a large estimation window, the de- Step-Size Control Based On Delay Coefficients
tection becomes slower, which can lead to a divergence of For estimating the step-size according to (25),the dis-
the coefficients of the echo-canceling filter in the mean- turbed error signal e(n)and the undisturbed error signal
time. Reducing the number of samples enlarges the vari- e(.) is required. The power of the disturbed error signal
auce of this estimation value. Therefore, we have a e(n)can be estimated easily, by recursive square averag-
tradeoffbetween the estimation quality and the detection ing, for instance. H o w e \ ~ r ,the estimation of the
delay. The faster the algorithm or the larger tlie step-siae, undisturbed error signal+) is much more complicated.
the worse the consequence of a time-delayed double-talk In the previous section, a method to determine
detection will be for the stability of the adaptive filter. non-double-talk situations was descrilxd, in order to esti-
The discretisation of the time delay and the length of mate the undisturbed error signal. The method presented
the room impulse response cause a large attenuation of in this section attempts to estimate the iundisnirbcd error
the higher fiequencies of the correlation function be- signal directly [ E ] ,[76]. It is based on the assumption
tween the loudspeaker and the microphone signal. There- that the coefficient error is spread uniformly over the co-
fore, a reliable detection of non-double-talk situatioiis is efficients of the filter. Furtherniorc, all these coefficients
only possible in voiced segments of the excitation signal. must adapt toward their optimnm value with the same
Similarly, these are the segments containing most of the mean speed of convergence. Regarding thesc assump-
tions, one can infer the overall adaptation error from the
signal energy. There are results available 1371 that use
luiowlcdge ofthe error of some coefficients. Therefore,
only the frequency band below 1kHz, leading to a more
the microphone sigual is delayed for a few taps (about
reliable correlation value. It cau also be shown that, in the
16-32 taps), which can be interpreted as a delay of the
case of unvoiced segments of speech signals, the corrcla-
LEM impulse response (Fig. 25). Under these condi-
tion value is as low as i n the case ofdouble-talk situations.
tions, the first N , coefficients of the a.daptirc filter, de-
Consequently, the adaptation is disabled. This is tolerable noted as cT(nj, should converge tovvard zero. That is,
because it is not possible to adapt the filter during these supposing a reasonable non-zero initialization, this seg-
segments of low energy. Hence, the resulting low nieut of the coefficient vector leads to an estimate of the
step-size is even helpful. system error norm:
Therefore, the most important parameters influenc-
ing the behavior of the cosrelation are given by the
threshold to detect non-double-talk situations, the num-
ber of samples M , the time delay I , and the frequency
band ofthe signals in which the correlation value should with N , representing the number of (delaycoefficients,
be detected. and N denoting the number nf filter coefficients for
The correlationvalue can be successfullyused to detect modeling the room impnlse rcsponsc. With regard to
non-double-talli situations. Cousidcring thcsc situations the systcni-error norm, the mean-squared corfiicient er-
ofa small local disturbance, it is possible to estimate the ror and hence the mean-squared undisturbed error sig-
nndirtzwbed error signal and the echo attenuation. In turn, nal can be estimated. Assuming a short-time stationary
a step-size can be determined. Othenvisc, when dou- white excitation and no correlation between the excita-
ble-talk is detected or high-baclground~noisesituations t i o n signal a n d t h e system error vector, t h e
exist, a step-sire close to zero can be chosen. mean-squared error signal results in

20
10
(26)
0
Taking (25) together with (26), a rule for the step-size
can be specified. -1 0
The above algorithm works quite well for
time-invariant room impulsc responses. It can deal with
different levels of background noise, and correctly sets the
step-sizeto very smallvalues during double-tallcsituations.
However, problems arise due to changes of the room im-
pulse response: assuming a low system-errornorm, the de-
lay coefficients are close to zero. This is interpreted by this
method as a low adaptation error. Therefore, the step-size
is chosen close to zero. A change of the room impulse re- A 26. Analysis of a step-size control method by utilizng optimum
sponse that appears at this instant will not affect the delay step-size and cost function.
coefficients. Hence, the step-sizeis kept veqwnall. There-
fore, the adaptation filter is unable to adapt toward this trol. Special voice activity detectors (VAD)are added for
modified room impuke response, resulting in frozen filter this purpose, mostly on the basis of excitation signal
coefficients.In the case of a higher backgronnd-noiselevel, power estimation. When a veryhigh threshold for speech
this problem is less severe, due to the variation of the dela17 activity detectionis chosen, even unreliable step-size coil-
coefficients caused by the local noise. trol methods can avoid divergence of the filter, but the
I n summary, o n e can say that the de- real-time convergence speed decreases significantly.
lay-fiter-coefficients method functions adequately in a To enable the systematic optimization of the complete
noisy environment, except for the ability to dctect a step-size control, a second optimum step-size is intro-
change of the LEM system. Therefore, combining with duced in [ 1I]. It is deduced from deterministic equations
other methods that reliably detect changes of the room instead of statistical signal models. No assumptions are
impulse response is helpful. needed about the characteristics of the signals, which
makes die calculation valid for every possible far-end or
local signal. Moreover, because the exact signals are
A Combined Sfep-Size Control Using Neural known during simulations, this optimum step-size can be
Networks And Fuzzy Logic calculated in simulation environments, so that step-size
control methods can be analyzed in predetermined situa-
tions over short time intervals. This step-size is derived by
Deterministic Derivation of an Optimal Step-Size minimizing the Euclidean distance between the filter vec-
For the systematic design of control methods, a detailed tor and the LEM impulse response, which results in:
analysis of the methods aud their behavior is helpful. The
(%)of(25),whichis based on averysimple
step-size p,>pt,l
speech model, does not allow for such observations. The
assumptions used to derive this step-size are far from fid-
filled for voiced speech, whose short-time power spectral To calculate this step-size, the value and sign of the ad-
density is decidedly more important at lower frequencies aptation error must be Imown. As stated before, this in-
than at higher ones, and whose average short-time power formation is available in a simnulation, but it c m o t be
and spectrum change significantly during transient estimated easily in real time. On the other hand, the fact
phases between the phonemes. Obviously, for optimum that it is mcasurable in simulations is very useful for evalu-
step-size control in the presence of speech, &opt,, ( n ) ating the performance of step-size control methods in a
should not be regarded as the reference. Moreover, esti- rcalistic environment. The comparison of the step-size
mation of the mean adaptation error power does not pro- produced by the step-size control method and the opti-
duce exact results when speech signals are involved. This mum imtantaneous step-size indicates whether the con-
is due to the nonstationarity of speech, which limits the trol method generally over- or underestimates the
estimationinterval. It can actually he shown that using an step-size in certain situations.This informationcan be ex-
estimate of & up',, (92) on the signals available in the siinula- ploited to tune the parameters of a step-size control
tion does not even guarantee convergence when speech method, or to analyze it in detail.
excites the adaptive filter [ H I . Therefore, in real-time ap- Moreover, a cost function can be derived from
plications, the step-size is usually set to zero during criti- (n)that provides a ranlung of estimation errors ac-
cal intervals such as speech pauses, at the beginning and cordiug to the damage they cause to the echo attenuation.
end of which the adaptation is especially difficult to coil- It is denoted as:
60 lEEE SIGNAL PROCESSING MAGAZINE JULY 1999

a quieter environment, tlie decision threshold should be
altered, because in this casc, it is possible to detect speech
utterances much carlier, at a much lo\icr risk ofmisintcr-
Both optimum step-size and cost function support pretiiig pauses due to speech. Ingeneral, the detectors for
each other as suitable tools for the analysis of step-size certain situations, or states, [if the hands-free tclepbonc
control methods. The correspondence bcnveen cost and are sensitive to the overall situation, so that all the charac-
system error iiorm is shown clearly in Fig. 26, but the cost teristics of the state should he lcnown in order to establish
function is also szusitive to the underestimation of the a good control algorithm. The main states can be drawn
step-size at the beginning of the adaptation, and thus, in state space (Fig. 27).
provides more information than the system error iiorm. Control for spccch~eilhanceinciitalgorithms can be
For a detailed derivation and discussioii of the equations pzrfornied iu n w stages: First, the states must be de-
introduced in this section, the reader is referred to [13]. tected. This information is of general use for the speech
enhancement algorithms. Sccond, every
speech-ciihaiicemesit algorithm has to evaluate the state
Control of Speech Enhoncernent information, and derive the values of i t s control parame-
Algorithms in the Stote Spoce ters from it. The diagram shown in Fig. 27 and the system
In most step-size control units, a V D is used to distin- description given in the previous section serin to indicate
guish speech from speech pauses. But this is only one of a the use of Markov models for this task. i.e., the state de-
iiumber of distinctions necessary in the hanh-frcc tele- tection should h e cuabled to influence tlie parameters for
plionc. For adaptive loss-control algorithms that must en- funire dccisiiins. This idca has been exploited in [15]
sure a certain closed-loop attenuation value, the far-ciid However, a Markov inodel implies that^ the control algo-
and local speakers' activity must be recognized as different rithm uses its own state-detcction rcsults as information
cases to decide in which channel the adaptive loss must bc for future detections, i.e., that it is recursive. This a p
applied. The performance of noise-reduction algorithms proach has not heeii coilsidered for more than two state
can also hc improved when informatioii about the differ- detectors because it would become exceedingly complex
cut speech activity situations is a\dablc. The most difficult to ensure the stability of such a system.
and critical detection is that of simultaneous activiry of Kccausc stcp-size control mcthods are not reliable in all
both speakers, i.e., double-tali, because the residual echo possiblesituations [ 131, statcdetectionis gciierallyamul-
can easily he mistaken for a local speech signal. tiplc-input/multiple-otitput decision problem. It has of-
Voice-activity detectors can be improved by adjusting teii been solved in rule-based approaches, where all the
thcir decisi(inthreshold to the background noise levcl, es- relevantthresholds havetobcdefinedb?hand [27],[37],
pecially when the VAD algorithm is mainly based on the but this appi-oach has been applied almost exclusively for
power of the signal under observation. When thc back the distinction of a small subset of the states (sin-
ground noise levcl is vel-yhigh, as in a car whose engine is glc-tall~/double-talk)and detectors. The reason for this i s
ruining, speech activity can only be recognized when it is an incrcasing complexity in the defiuition of rules and
sufficieutly loud. Therefore, the begiiiniiig of speech tit- thresholds with a larger number of input values.
terances is often not detected, which is very disturbing Because the detectors arc basedon estimates ovcr small
\die11adaptive loss controts are applied. Consequently, in time intervals, their outputs may be erroneous, and may
Local Speaker
Inactive
- Local Speaker
Active
lead to the wrong control commands.
The importance of the correct detec-
tion of certain situations varies fix the
Inner Cube different spcccli~erthancemcntalgo-
Insufficient Excitation rithms. For example, double-talk de-
Outer Cube
tection must not bt: delayed when its
Sufficient Excitation result is used for acoustic echo canccl-
lation, whereas thr delay is less impor-
taut for adaptive loss control. This
implies that, given a certain variancc in
thc results ofthe estimators, the deci-
Step Gain siou thresholds for the distinction cif
0
., 1 several states should be dependent 011
the algorithm that uses thc detection
result. hdditionallv, the values of the
A 27. The hands-free telephone set in state space. The diogram focuses on those states parameters usua'JT
with for-end speaker activity Distinguishing these states is very difficultbecause the Over a contillUous so that the
far-end and the local speech signal can interfere. However, it is extremely important for crisp set of states must be re-mapped
the Derformance of the acoustic echo conceller. to continuous values. Both of these

facts suggest the use of a fuzzy algorithm, which provides
reliability information for its decisions. An example of
such an approach is described in [ 121, and a more general
version will be described in the next section. A simpler
version ofkizzy logic control that is easy to implement for
real-time applications is described in [13].
Combined State Detection a n d Step-Size Control Using

Fuzzy Learning Vector Quantization
A fi~zzyde-based system [ 131 can be designed easily and
quickly if experience with the input-output relationships of
the system exists, and if this relationship does not require
large and complex rule-bases. Optimization of such a sys- -
m LVQ. Unsupervised Retraining
tem must be done experimentally and by hand. Unforn- E O
nately, the parameters are nonliuealy coupled, which -r
y_
-
-5
- Single Talk No Background Noise
.-..-..,-. Dou%ieTalk.
.-.-.Sin leTaik’CarNoise(SNR0dBj
No Background N ~ i s e
-,
0 Double Talk, Car Noise (SNR 0 dB)
makes it very dificult to optimize them, especially with a
large number of inputs. Therefore, if a larger number of Y
-10
-
detectors is available, or if a larger number ofstates is to be
distinguished, this optimization can no longer be carried -15
z
out by hand in reasonable time. For the case combining a
large set ofstep-size methods, we examine an approach us-
g -20
W
ing &zzy learning vector quantization, which has the addi- E
I -25
tional advantage of providing state information. 2 0 2 4 6 8 1 0 1 2 1 4 1 6
m
In a first approach, every state is assigned one step-size. Time (s)
I
Therefore, the stcp-size coutrol is actually reduced to a
A 28. Top: results of the delay coefficients step-size control
multi-inputimulti-output decision, where the decision method. Bottom: results for the combination of step-size con-
can be made by a pattern-recognition algorithm such as trolmethods by LVQ. The system reacts correctly in the case of
learning vector quantization (LVQ) [47]. It can be used a change in the LEM impulse response, in contrast to applying
to automaticallypartition the input space spanned by the the delay coefficients method onlv
input values, i.e., the results ofthe step-size control meth-
ods, into regions that are assigned to a state. For reasons considerably if a suitable transformation for each input
of robumiess, a fuzzy algorithm is applied here that is value is found. Another consequence of the LVQ ap-
similar to the ones proposed in [45].The advantage of proach is that some states are very difficult to separate. In
such an approach is that the results of the state detector particular, these are the very-highly distorted double-talk
can be evaluated on a test data set, and that the parameters
state and the state of initial adaptation, because the adap-
that must be tuned for step-size optimization are linearly
tatiou error is speech-like and of rather high power in
combined, so that their tuning is much easier than in a
both cases. In our implementation, an additional state
normal fuzzy system. The nining can even be supported
was introduced to incorporate the special case of initial
by the cost function presented in “Deterministic Deriva-
tion of an Optimal Step-Size.”Moreover, the arithmetic adaptation, thus circumventing the problem of the LVQ
to be performed in real time only consists of a number of recognizingdouble-talk at the beginning. This indicates a
multiplications a i d accumulationsfor which a digital sig- low step-size, and keeps the adaptation from starting at
nalprocessor is optimized. On the other hand, we have to all. Results of the step-size control by fuzzy LVQ with
train the state vectors first, then assign a step-size to each niue state vectors arc shown in Fig. 28. The simulations
state. In order to provide training data with sufficient were carried out with speech as excitation and distortion.
echo attenuation, the firsc training cyclc is carricd out For the case of background noisc, car noise was used at an
with data obtained by adaptation with the optimal average SNR level of 0 dB.At about the 12th second, a
step-size of (27). In a second cycle, the LVQ is trained change in the LEM impulse response was siniulated as
online, i.e., the step-size that it produces is used for the measured during the movement of a dummy in front of
adaptation and thus, influences the training data. By this the hands-free telephone. The change occurred during an
approach, the situations provided by the training data be- interval of about one second. This environment was cho-
come increasingly similar to real-time situatioils. sen because a gradual change is much harder to detect
Because the membership functions are calculated from than a sudden oiie, and because this kind of change is
the Euclidean distance between the input vector and the more typical in real time. For details of the implementa-
state vector, criteria with large output values influence the tion, see [14]. To facilitate the use ofthis method, radial
state detection more strongly than those with small out- basis functions [55], [73]can be usedinstead ofthe fuzzy
put values. Therefore, the performance can be iticreaxd system to implement a fuzzy-neural system and thus

achieve automatic tuning of the parameters, very similar x2( n )help to reduce this effect to a certain but limited
to that of a typical feed-hrward neural network, while anionnt.
preserving the state information and interpretability of A With an ill-conditioned correlationmatrixR,,x, (l),the
the sirstem. misalignmciit of the ECFs is much ~ o r s efor the
dual-channel case than for the single-channel case.
A Because x (n)and x (n)are correlated, the optimal im-
Adaptive Filters for Stereophonic Acoustic ,
pulse responses cl (n)and c (n)arc not uniquely defined.
Echo Cancellation Consequently, the filter adaptations may have to follow
dislocations of the signal source in the far-end room.
Rationale and Genera/ Setup
Monophonic hands-free telcphone systems do not pro- Proposed Methods
vide listeners with spatialsound injnmation. Stereophonic In general, any imaginable method should avoid affecting
devices comprising two separate acoustic transmission stereophonic pcrception. Several proposals for a solution
channels that allow listeners to distinguzsb benveeiz dzffkent of the just-listcd problems have been madc. According to
speakers are used to enhance sound realism and to satisfy [SI, any adaptive algorithm should he backed by a
customers’ demands for teleprescnce. decorrelating component in order to enhance the condi-
A stereophonic structure exhibits W O loudspeakers as tioning of the correlation matrix R x , k(1)
2 and to get a ro-
well as two micrciphones, both in tlie local and in the bust solution that is no longer dcpendem on far-end
far-end room. As a direct consequence, four echo paths room properties.
(i.e., impulse responses) must be identified individually
for each room. Thus, an increased amount of processing Decorrelorion of the Excitation Signals
power inust be devoted to this task as compared to a
Theaddition ofindependentrundmnoise to each channel may
monophonic hands-free system. help to ~ C ~ L I Cthe
C correlationofthe stereo sign& andthere-
For the salcc of clarity, only one half of the echo path
fore enhance the performance ofthe ECEs. Unfomiately,
system in the local room is shown in Fig. 29. It is assumed an improvement of the conditioning of the correlationma-
that neither local noise nor local speech signals are prcs- trix Rxlx2 (/) requires rather high and thus disturbing noise
ent. In the far-end room, one speaker’s voice in conjunc-
levcls [69]. A possible solution to this dilemma may be to
tion with ambient noise (which is not depicted in Fig. 29) use the auditory masking properties of the human ear. The
is picked up by two microphones. The disturbed speech idca consists of adding spectrally shaped random noise to
signals xI (n)and x 2 ( n )are transmitted in two distinct each cliannel that is masked by the londspealcr input signals
paths to the local rtiom and radiatcd by two londspcakers. [331. The additionalcosts for this method can be lept rela-
Two adaptive filters with impulse responses c, (!a) and tively low when frcquency-domainadaptive algorithms for
c , (n),respectively, try io imitate the echo signal y ( n ) by
the ECFs are applied.
an appropriate output d(n),in order to cancel thc echo. Further research has been directed toward thc applica-
tion of noizlinear mmfownations [SI. In principle, a s m d
Prob/ems signal derived by a nonlinear function (c.g., half-wave rec-
In an analogy to the situation for a single-channel device, tification) of the signalitself is addcd to ‘eachexcitation sig-
the ECFs have finite-length impulse responses c, ( n )and
c 2 (n).With such confined sets of filter coefficients, it is
impossible to perfectly model the infinite impulse re-
sponscs g, (n)and g, ( n )oftlie echo paths. Therefore, any
solution of tlie identification problem is biased or “mis-
aligiied” due to iinderinodcliaation.
I n addition, a dual-chaiinel structure as depicted in
Fig. 29 prises the fcillowing problems with regard to the
adaptation ofthe ECFs [ X I :
A The excitation signals x I ( n ) and x 2 ( n ) arc strongly
cross-correlatedbecause they are filtcrcdvrrsions ofa com-
mon sonrce signal. Due to diis, tlie correlation matrix
A 29. Basic scheme for stereophonic (IcOust,icecho cancellotion

(SAEC). A speaker (source) in the far-end room is active. Due
to acoustic coupling between the loudspeakers and micro-
is ill-conditioned and consequently, the adaptivc behav- phone, an echo signal y ( n ) is coupled basck from the local to
ior of the adaptation algorithm is degraded. However, the far-end room. Two filters with impulse responses c , ( n ) and
noise components in the excitation signals x, (n) and c,(n), respectively are adapted in order to cancel the echo.

nal. The distortion iutroduccd by the use of such
noidinearities is hardly perceptible. In combination with
complementary comb-filtering, the approach of
nonlinearly transforming the excitation signals provides
improved convergence of the ECFs without coniproniis-
ing spatial realism or increasing computational coinplexity
[7]. This method is based on the observatiou that, in gen- with p representing a fxed correlation coeEcient that
eral, the signal energy below 1IiHz is decisive for auditory scales the sum squared cross-chaimcl coefficient and
localizationa i d that coinplenientar).comb-fdtering in fre- p , ( n )= x: ( n )x, (n),i= 1,2, beiiig the sum of squared
quency rcgions above 1lcHz does not distort stereophonic input data in the two channels. I denotes the unit matrix.
perception. In the frequency band below 1 1-2, a nonhi- For 0 < a , < 1,0 < p < 1, and0 5 y << 1, the leaky XUMS
ear transforniatioii is carried out. Subsampling in this spec- algorithm is stable.
tral range allorus the fast-converging frequency-domain The decoupled update tcrm for thc coefficients of the
RLS algorithm to bc applied with a modest computational first filter in (30), uiider consideration of (31),shows the
complexity hthc high-frequency range, two completnen- decorrelation property of this algorithm:
tary comb fdters perfectly cancel the correlation of the ste-
aE
reo signals. Therefore, the NLMS algorithm is appropriate c,(n + 1 ) = (1 - y ) q (n)+
in t h i s frequency range. P,IP22 -p2r,,
Another method for decorrelating stereo signals pro-
poses theperiodic dehy of the exchatinns&nd in one cbannel
by a simple two-tap filter with time-variable coefficients
that are colitrolled by a periodic function [44].
As a means of decorrclation, leakage can be lliterpretcd
as the addition of white noise to the input signals.
Dual-Chonnel Adoptive Algorithms
Compared to thc addition of white noise, leakage is im-
In stereo applications, the time-doirjainNLMS algomthm plemented directly in the algorithm and will, therefore,
is sensitive to the conditioning of the correlation matrix not affect the characteristics of die input signals.
R x l x(I).
3 In order to improve the speed of convergence The ovcrall complexity is kept at a modest number o f 6
and, at thc same time, to reduce the computational coin- N additions and 8N multiplications. Thus, this approach
plcxity, the NLMS algorithm in thc frequency domain is a is particularly effective for large filter lengths N, though it
better candidate for a stereophonic system. leads to a slight 2N increase of the computational cost in
TheRLS a&oTitbm may be viewed as a reference of per- comparison with the conveiitionai XLMS algorithm.
formancc [31].With this algorithm, the meansquared er- Sevcral authors iiivestigatc die applicability ofpvojec-
ror converges independently of the conditioning of the tion algovitbm (see “The Affine Projection Algorithm”)
correlation matrix Rxlx2 (1). Its major drawback is high to stereophonic devices. By exploiting the time variance
computational complexity. of the short-time cross-correlation fuictions, projection
An LMS-likc algorithm, the so-called eXtended L M S algorithms in full-band or subbands provide efficient so-
( X L X S ) algorithm, can be derived from the two-chamicl lutions with respect to the speed of coiivergence and thc
RLS algorithm [6]. It inhercntly provides a partial residual misalignnieiit [52], [67]. Their high coniputa-
decorrelation of the inputs z, (n) and x, (n),while tional complexity, however, prohibits them from being
roughly approximating the correlation matrix R,,,, (I). implemented. Fast versions of projection algorithms ac-
Therefore, this algorithm is lcss sensitive to corrclated in- complish a minimum complexity of 6N+O(P), whcreP
put signals than the coiivetiuonal LMS algorithm or the represents the projection order [4].
NLMS algorithm. By introducing leakage, which is pro-
posed for stabilizing systems and alleviating “stalling” of
adaptive cocfficients, the filter update equatiou of the Implementation
huky XLMS nlgovrithin may be denoted as [39]: Implemeiitors of acoustic echo cancellers have to deal
with the demands of hands-frce telephone set users, the
regulations of appropriate authorities, and limitations by
the iiiarlceting division of the manufacturer. Combining
the intercsts of these threc groups can bc hard because
with the leakage factory and the stcp-sizeai:The XLMS their priorities are different (Fig. 30).
algorithm explicitly takes the cross-correlation between Hands-frec telephone set users arc primarily interested
the two input signals xI(n)and x, (n)into account. This in using their telephone in a natural way. They do not
can be observed by the expansion of the inatriv M, a want to
(rather simplified) two-channel correlation matrix, that is Hear echoes.
defined by the following equation: Re affected by a half-duplex communication.

Minimizing the cost can affect other :aims undesirably
as well. Table 2 evaluates these measures with regard to
the conflicting intcrcsts.
In the following sections, some methods will be shown
on how to cope with restricted arithmetic; finite
quantization; and reduced processing power due to coil-
flictiiig intcrests of users, providcrs oE hardware, and
standardization authorities.
I 1
s1 30. Conflicting priorities between different interest groups.
finite Word Length
Be forced to single-talk discipline. Most algorithms used fiir adaptive filtering arc derived
Be troubled by insufficient comprehensibility. using the assumption of infinite precision. Tcsting these
Technical details, e.g., the adaptation algorithm or the algorithms with simulations or real-tiine implementa-
amount cif roorn~imptrlse-respoiisecancellation, do not tions reveals that this assumption is invalid. Therefore,
interest them. differences benveen the expected bchxvior of the algo-
Rcgulations by appropriate authorities allow tclecom- rithm and the result obtained by finite !precision nlay he
niuiication providers to offer ctimpatible services. Ac- observed. Circumstances (e.g., aim, ambition) determine
cording to ITU, hands-free telephone sets whethcr these deviations can be neglected or not.
*. Must guarantce an echo attenuation of at lcast 45 dB
Thcre are several reasons for deviations from the ex-
during single-talk. pected results [19], [49]. Most of than are due t u the fi-
nite precision that is inhercnt in electronic calculations.
a. Arc not allowcd to delay the signal more than 2 ins for
stationary phones and more than 39 ms for mobile
phones. Why Fixed Point?
Uccausc hands-free telephone set implementors must The majority of todafs DSPs provide fied-poiititaritlimetic
fulfill tliese requirements prescribed by the ITU, it is difii- and quantize data with 16 bit resolution lti81. As mentioned
cult for thcm tu offcr greatest comfort to their customers. previously in this chapter, low cost is a major criterion for
For manufacturers, the following points are important: manufacturers. Therefore, fixed-point DSPs arc preferred to
floating-pointDSPs for consumer products.
Package size (smaller products, "size docs matter") An important drawback offied-point 13Sl's is the im
Power consumption (e.g., for mobile products) tcrnal number represcntation. Quantization with 16 bits
d. Competitiveness (pcrformance, number of fcatures) is quite popular and well suited for raw spccch data.
It should be noted that the first three points are to be However, for scluared values of speech signals and room
ininimizcd, whereas competiti\wicss is to be maximized. impulsc response filter taps, quantization with 16 bits
Becausc, for most marketing decisions, cost plays a domi- does not sutfice. Especially for the application in AEC,
nant role, this can result in sonic restrictions on hardware the quantization of thc filter taps limits the acliicvable rc-
available fix adaptive filters: sidual echo attenuation.
A Simple arithmetic (fied point rather than floating
Quantization of Filter Tops
Two effects enlarge the residual echo of adaptive filtcrs if'
implemented with fixed-point arithmetic. First, the
Established technology qtIantization error of each filter tap cau Ibe secn as an addi-
Mcasurc Cost Package S i 7 ~ Power Consumption
Simple arithmetic
_________
U u v
Small memory I U U
I .ow pro'cssing powcr L -L U
Estahlishcd technology L n n ~

tional noise tern1 that degrades the achievable residual
echo if no step-size control is used. However, this effect
does not play the most important role for 16-bit
quantization, unless the number of taps is huge. And, if
the second term on the right of the NLMS equation (16)
is smaller than the quantization interval, the correspond-
ing filter tap freezes. This “stopping” phenomenon oc-
curs especially ify(n) is very small. This conficts with the
rule that one cau decrcase the excess mean squared error
by decreasing the step-size p(n). With fixed-point arith-
metic, it can be advantageous to leave p(n) at a higher 0 2 4 6 8 10 12
value [ H I . Time (s)
Freezing coefficients can be ohsenred easily when ex-
A 3 1. Convergence of €5-NLMS olgorithm. filter length: 2000
amining a specific algorithm that LISCS typical properties
taps; room length: 2044 taps. Excitation: white Goossian noise.
of room impulse rcsponses to speed up thc convergence
-: floating point; - -:fixedpoint (16 bit).
ofhighorderadaptivefilters [50], [51]. Tlieupdateequa-
tion of the “exponentially weighted step-sizc (ES)”
16
NLMS algorithm is as follows:
14 . . . . . . . . . . . . ..~... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
... : .............................. . . : ................
where A is a diagonal matrix that assigns exponentially

decreasing step-sizes to the coefticients:
A = diag@, 1 5 .
>...,a, (34)
with a s = a y ’ for 0 < i < N,and y being an exponential
i
attenuation ratio with the same logarithmic decay as the
impulse response (0 < y < 1).
,
A 32. Logarithmic representation of room impulse response
Because the ES-NLMS algorithm uses very sinall I g(i)l (solid line) and its quantized envelope (dashed line).
step-sizes for some coefficients, the stopping phenomenon
significantlydecreases the performance of the adaptation
when implemented with quantized filter taps. Fig. 31
shows converging cumes of the ES-NLiMS algorithm with
fdter taps in floating-point and fixed-point precision.
This behavior can be improved by malung use of a typ-
ical property of room impulse responses. As shown in where e , ( n )is the error signal obtained using scaled filter
Fig. 5 , the average magnitudc of the room impulse re- coefficients.
sponse decays exponentially. Fig. 32 shows the magni- Assumiug infinitc precision, the scaled filter update
tude of the room impulse response and its quantized equation and the conventional NLiMS equation are
envelope. Because adaptive algorithms try to identify the equivalent, despite the fact that the normal error signal
room impulse response g(n)with the filter c(n),the coef- e(%)and the error used in filter scaling, e, (n),can differ
ficients at the tail of c(n)only make use of a portion of the during adaptation. By using the diagonal scaling inatriv
available bits. The bits above the dashed line in Fig. 32 re- B, the filter vector c(n)adapts to a “faked” room impulse.
main unused. This “waste” of bits would not occur if thc Using (36) and (37)improves the performance of the
shape of the room impulse response was more uniform. learning curve with fEed-point coefficients in Fig. 31 to
To achieve the desired behavior, one can introduce a almost the same level as the curve with floating point co-
scaled filter vector c ,(n).Its relation to the normal filter effkients. Additional information about filter scaling can
vector c(n)can be described by using the diagonal scaling be fouud in [61].
matrix B:
c(n)= B c , (n). Reduced Processing Power

(35)
Apart from quantization effects, the performance of
Using this equation, thc error calculation and the filter high-order adaptive filters for echo cancellation is limited
update can be written as: by the processing power that can be used. Restrictions
usually arise from economic considerations, despite the
e, ( n ) = u ( n ) - F . c ,(njlTx(n) (36) fact that there arc extremely powerhl DSPs ou the mar-

lcet at the time of this writing. These DSPs, combined trol also helps to increase the spccch recognition rate in
with high-level programming tools, allow short voice control systems, The dcmand fix such systems will
time-to-market cycles. certainly grow. Examples from the domain of consu~ner
Low processing power is still an issue fix echo cancel- products include the control of audio and,video sets and a
lation, because the high filter order can be very demand- wide variety of computer applications. Thc use of voice
ing for DSPs. This is cspecially true if, due to the control and hands-free communication devices also can
architecture of a specific DSP, the update of one coeffi- considcrably ease life for physically handkapped people.
cient requires several processing cycles. The adaptation of the ccho cancellation filter has to
There arc sevcral ways to reduce the complexity. One cope with nonstationary and colored excitation signals
very popular method is to use subband processing algo- and time-variant LEM systems. This highly unfavorable
rithms ("Subband Processing"). This is appropriate only environment requires a special tailoring of adaptation al-
if the additional delay introduced is tolerable. Another gorithms.
way to reduce nuncrical complexity is to use block pro- In this article, we have discussed the application of
cessing algorithms ( "Rlock Processing for Large Adap- high-order adaptive filters to the problem of acoustical
tive Filters"). Here, the delay problem of subband echo cancellation. We described means to achievc robust
processing can be eased by partitioning the filter vector. perforniaice. We further presented methods for reducing
For tasks that do not allow additional delay and require computational complexity that allow implcmentations in
high-order filters (500 to 1500 taps, as in acoustic echo lowcost, fncd-point digital signal processors.
cancellation),ftd-band algorithms may be modified. Sev- Progress in technology mill allow the use of more so-
eral algorithm with reduced complexity, based on partial phisticated algorithms at lower cost i n the ncar futurc.
update of the coefficients, have been introduced. Therefore, research and development activitiesin the area
The coefficientsto be updated can be selected in differ- of acoustic echo control will continue oil a high Ievel.
ent ways: the periodic NLMS algorithm [70] and other
partial update algorithms, such as the sequential NLMS Chvistina Breining, Pia Lhiseitel, Ebeihwd Hiinslei;
1221 or the seqnential block NLMS, for example, use a Andveas M ~ d wBemhavd
, Nitsch, Hemkg Pudev, Thomas
static scheme to update some predefined coefficients of Schevtlev, Gevhand Schmidt, m d f a n Till,a x with the Signal
the filter vector. The reduction in complexity also ac- Theory Group at the Electrical Engineming lkpartmcnt
counts for a reduction of convergence specd. In fact, by of Darnistadt University of Technology, Darmstadt, Ger-
reducing the number of coefficients updated in every many. Their work focuses on problems of acoustic echo
sample period by a ccrtain factor, the adaptation perfor- cancellation and speech enhancement. h i d e from the de-
mance, in t e r m of convergencc speed, decreases by ap- sign and analysis of algorithms, emphasis is put on
proximately the same factor, compared to the real-time implementations in real-world em'lrolllllellts.
conrrntional NLMS algorithm. This can be an important
drawback, especially for acoustic echo cancellers, if insuf-
ficient convergence is noticeable.
Another algorithm tries to prcsenx the performance
of the regular NLMS algorithm by updating coefficients
with the largest update term [l], [Z]. These coefficients
are casy to detect by sorting the magnitude values ofthe
vector ~ ( 1 %and
) selecting thc coefficients associated with
the highest entries in this list. The results obtained with
this algorithm are very close to the M I update, but the dy-
namic update scheme has a high demand for memory,
namely mice the filter Icngth. Because small meinoy re-
quirements can be very important for commercial prod-
ucts, this can be prohibitive for its use in industrial
hands-free telephone sets. By dividing the input vector
x(n)into several sections and updating the cocfficientsre-
lated to the most excited sections, this disadvantage can
be circumventcd [62].
Conclusions
The control of acoustical echoes is an important h c t i o n
in speech processing systems. Hands-free telephones, for
example, not only enhance user comfort, but also increase
safety when used while driving a car. Acoustic ccho con-


T/ Iitsch,: Christina Breining, Dreiseitd, Etkrhard Xansler, Scheder, SCMDQ Andjan Henning

Uploaded by

Copyright:

Available Formats

T/ Iitsch,: Christina Breining, Dreiseitd, Etkrhard Xansler, Scheder, SCMDQ Andjan Henning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

T/ Iitsch,: Christina Breining, Dreiseitd, Etkrhard Xansler, Scheder, SCMDQ Andjan Henning

Uploaded by

Copyright:

Available Formats

Christina Breining, Pia Dreiseitd, EtKrhard Xansler, kidreas Mader,

Be t\;iitsch,Henning &der, mas Scheder, Gerhard S c M d q and Jan Tilp

he problem of acoustic echo cancella-

JULY 1999 IEEE SIGNAL PROCESSING MACPiZlNE 43

where d ( n )is the microphone output signal, assuming the and

for simplicity, a stationary whitc input x(n)(Fig. l),the

44 IEEE SIGNAL PROCESSING MAGAZINE JULY 1999

In this case we have -0.3

A 3. Impdse response measured in a passenger car (sampling

ERLE,,, ( N j = 10 log e2“. dB = 8.7 aN dK. (8)

8.7 aN, dK = 60 dB, (9)

JULY 1999 IEEE SIGNAL PROCESSING MAGAZINE 45

This relatively high value shows that movements of the

L 6. Spectrum of the impulse response

46 IEEE SIGNAL PROCESSING MAGAZINE JULY 1999

JULY 1999 IEEE SIGNAL PROCESSING MAGAZINE 47

with1 I xi 1 =fi being the Z, norm andg(n) being nor-

A 7.Convergence of the indicatedalgorithms (N = 256) for

Convergence o f the Algorithms

48 IEEE SIGNAL PROCESSING MAGAZINE JULY 1999

fur white noise excitation. As expected, the convergence

spccd decreases with increasing filter length. 0 -20

limits (14).This happens because a disturbing sigual is

Speech Exdation rithin ordcr from 5 to 10 for speech, in contrast to colored

JULY 1999 IEEE SIGNAL PROCESSING MAGAZINE 49

50 IEEE SIGNAL PROCESSING MAGAZINE JULY 1999

Fixed or Adaptive Decorrelatian Filters?

Discussion of Predictor Order Choice of Coefficients for Fixed Decorrelation Filters

JULY 1999 IEEE SIGNAL PROCESSING MAGAZINE 51

and x(n) with

Comparing (18) to the adaptation rule of the affine pro- Time ( s )

52 IEEE SIGNAL PROCESSING MAGAZINE JULY 1999

Applications to Acoustic Echo Control

parallel at reduced sampling rates. The

JULY 1999 IEEE SIGNAL PROCESSING MAGAZINE 53

In subbands, assuming that there is no distortion caused

Full-band: 0, (N)= 2N,

54 IEEE SIGNAL PROCESSING MAGAZINE JULY 1999

JULY 1999 IEEE SIGNAL PROCESSING MAGAZINE 55

Input Signal Filter Impulse Response

A 23. Tine-frequency ana/ysis of (I LEM system impulse re-

block processing, which may work in the time or the fre-

56 IEEE SIGNAL PROCESSING MAGAZINE JULY 1999

JULY 1999 IEEE SIGNAL PROCtiSSlNG MAGAZINE 57

ble-talk situation. Thus, a combination of these

58 IEEE SIGNAL PROCESSING MAGAZINE JULY 1999

A large local background noise leads to a permanent

JULY 1999 IEEE SIGNAL PROCESSING MAGAZINE 59

60 lEEE SIGNAL PROCESSING MAGAZINE JULY 1999

JULY 1999 IEEE SIGNAL PROCESSING MAGAZINE 61

Combined State Detection a n d Step-Size Control Using

62 IEEE SIGNAL PROCESSING MAGAZINE JULY 1999

A 29. Basic scheme for stereophonic (IcOust,icecho cancellotion

JULY 1999 IEEE SIGNAL PROCESSING MAGAZINE 63

64 IEEE SIGNAL PROCESSING MAGAZINE JULY 1999

Mcasurc Cost Package S i 7 ~ Power Consumption

JULY 1999 IEEE SIGNAL PROCESSING MAGAZINE 65

where A is a diagonal matrix that assigns exponentially