T/ Iitsch,: Christina Breining, Dreiseitd, Etkrhard Xansler, Scheder, SCMDQ Andjan Henning
T/ Iitsch,: Christina Breining, Dreiseitd, Etkrhard Xansler, Scheder, SCMDQ Andjan Henning
T/ Iitsch,: Christina Breining, Dreiseitd, Etkrhard Xansler, Scheder, SCMDQ Andjan Henning
42 JULY 1999
response c(n) of the filter exactly with the impulse re-
sponse g(%)of the LEM system, the signals x(n)and e(.)
will be perfectly dccoupled without any disturbing effects
to the users of the electroacoustic system.
In this contribution, we mainly deal with hands-flee
telephone system. These are monophonic systems, and
their characteristics are replated by the International
Telecommunication Union (ITU),The most severe re-
strictions for signal processing are the tolerable delay
times: only 2 ms are allowed for stationary telephones Matrices :re mdicated by capital boldface letters. Noms
and only 39 ms for mobile telephones. In audio or video are alwaF L, norms. Frequently used abbreviations are
confirence systems, stereophonic audio systems are dcsir- listed in Table 1.
able. Because signals in both channels are highly corre-
lated, there is no unique solution for the iinpulse Scenario
responses of the echo-cancellation filters to be applied in
both channels. Healing aids form a special class of audio
systems. The problem of acoustic feedback arises when Loudspeaker-Enclosure-Microphone(LEM)
closed car molds slip out of the ear or when open molds Systems
are used. The LEM system here is characterized by a very The central elements in acoustic echo cancellation are a
sniall enclosure leading to a short impulse response. Fi- 1oudspeak.er and a microphone placed within one cnclo-
nite-iiupi~se-response(FIR) echo-cancellation filters sure. For low Found pressure and 110overload of the AID
converters, this system may be modeled with sufficient
need only on the order of 10 coefficients. On the other
hand, hearing aids are battery powered. Therefore, power
consumption of the signal processing hardware is a major
issue. The time delay introduced by echo cancellation
should not exceed 1 nis. Finally, voice-control systems are
gaining more attention. Here the microphone output is
fed into a speech-recognition system requiring a mini-
mum signal-to-noise ratio. In applications where loud-
speaker signals are present, like computer games or Local
remote control of television sets, cancellation of the sfni Speech
Signal
acoustical echoes can significantly improve the recogni-
tion rates. vfni Local
NOlSe
The problem of acoustic echo control has gained con-
Far-End
siderable attention during the last decade. Sunreys con- Speaker
taining further references may be foundin [32] and [23].
This article is divided into three major sections: The 1. Loud~,~eaker-enclosure-microphone (LEM) system with
echo-cancellationfilter (ECQ and the notation usedin this article
first explaius the properties of LEM systems and how to
model them by an adaptive filter. Furthermore, it briefly
describes the properties of the speech signals involved.
The second, and main section, is devoted to the solution
of the echo cancellation problem by the application of
adaptive filters. In the third section, we show how to cope
with iinplemcntatioii problems caused by the need to use
inexpensive signal prcicessiiig hardware.
Notation
Throughout this article, we use notation according to
Fig. 1. scala^ va?iahles (signals) are written with l o w
crcasc lctrcrs [e.g., x ( n ) , d ( n ) ] . For vcctom we use
lowercase boldface letters, e.g., c(n) , g(n),
0 50 100 150 200 250
Time (ms)
4%)= (c[) (75)> cl (u),
" "cN-i (n))' ,
A 2. Impuke response measuredin an office(sampling fre-
quency+ kHz).
. __
accuracy as a linear system. Moving objects or changing .. .
Table
~.
1. Frequently
.
Used .Abbreviations.
. .-. . .
the temperature results in a time-variable impulse re-
sponse. Its specificshape depends on the size of the enclo- AID Analog/digital
sure, the reflection properties of its boundaries, and the
position of objects (especiallythe loudspeaker and the mi- AEC Adaptive echo canceller
crophone) within the enclosure. Depending on the appli-
cation, it may be possible to design this system such that
the reverberation time is small, resulting in a short im-
pulse response. Examples of this solutioii are telecoirunu-
nication studios. On the other hand, electronic means are
the only tools to provide hands-free communication out
of ordinary ofice rooms or cars, for example.
In general, the acoustic coupling within an enclosure is
formed by a direct path between the loudspeaker and the
microphone, and a very large number of echo paths. The
impulse response can be described by a sequence of delta ERLE 1 Echo-return loss enhancement I
impulses delayed proportionally to the geometrical length
of the related path. Reflectivity of thc boundaries of the en-
closure and the path Ieiigtli determiiie the impulse ampli-
tude [ 3 ] .The reverberation time of an office is typically on
the order of a few hundred milliseconds, of the interior of a
car, a few tens of milliseconds. Figs. 2 and 3 show impulse
responses of LEM systems measured in an ofice and in a ITU 1 InternationalTelecommunication Union I
car. The microphone signals have been sampled at an
8-kHz rate. These impulse responses are highly sensitive to LEM LoudspLahr-enclosure-inicrophonr
any changes within the LEM system. This is explained by
the fact that, assuming a sound velocity of 343 m/s and an
8-lcHz sampling frequency, the distance traveled between
two sampling instants is 4.3 cm. Therefore, a 4.3-cm
change in the length of an echo path moves the related im-
pulse by one sampling interval. Thus, the need for an adap-
tive echo-cancellation fdter (ECF) is evident.
MLP Mulnlayer pcicepuon
The question of the optimal structure of the ECF has
been discussed extensively in the literature. Because a long NLMS Normalized LMS
impulse response must be modeled, a recursive (IIR) filter
seems best suited at first g h e e . At sccond glance, h o w Power spcctral density
ever, the impulse response cxhibitsa highly detailedand ir-
regular shape. To achieve a sufficiently good match, the Recursive least squares
replica must offer a large number of adjustable parameters.
Therefore, an IIR filter does not show an advantage SAEC Stereophonic xoustic echo cancellation
against anon-recursive (FIR) filter [48], [ 5 3 ] .Theirref'ut-
able argument for preferring an FIR filter, however, is its SNR Signal-to-noiseratio
guaranteed stability during adaptation. A measure to ex-
VAU Voice actmiry detcctor
press the effect of echo canccllatioii is the so-called
echo-return loss enhancement (ERLE):
XLMS Extended LMS
0.3
0.2
0.1
An upper boundary for the effect of a filter of degree
N - 1 i.e., a transversal filter with N taps, cai be calcu- 0
lated by assuming a perfect match of the ECF
-0.1
ci = o if o r O < i < N - l . (51 -0.2
-0.4
EIUE,n,,(N)=lOlog E,,O: dB, 0 10 20 30 40 50 60 70 80 90 100
c, Bi (6) I
Time (ms)
1 ~ , I = e - ~f”o r i t O , (7)
where 0 < U < 1. Inserting ( 7 )into (4) rcsults in Microphone
Reflector,.:’
above leads to
Measurement Setup
To measure the impulse response, we use the arrange-
ment according to the ITU-T-P34 [42] recommcnda-
tion (Fig. 4).
A dummy models the user ofthe hands-free telephone.
Thc two impuise responses are measured for two differ-
em positions of the dummy without any othcr changes ui
the room between the two measurements. The measure-
ment signals are sampled at 8 liHz with a 16-bitA/D con-
vcrter. Floating-point operations arc used for the
calculation of the impulse responses. In Fig. 5, the first A 5.Measured impulse response.
impulse rcsporise is meawred for a distance of 40 cm be-
tween the dunimy aud the marlied fx point.
The direct echo path is clearly visible at fdtertap 16,
which is equivalent to a sound path of about 80 cm, corre-
sponding to h e loudspeal~er-microphone-distancein-
cluding small A/D-converter &laps.
Fig. 6 depicts the spectrum of the impulsc response
with its dominant lowfrequency components. Hence, we
can see that high frequencies are better absorbed thau low
ones. The second impulse response is meamred for a dis-
tance of 90 cm between the dummy and the fix point. Its
properties are comparable to the curvc shown in Fig. 5.
The system error norm betwcen the two measured im-
pulse responses is given by:
Frequency (Hz)
cal signals (s(nj= (n)= 0). The small residual noisc of the excitation signal during tlie initialization period
component due to tlic ihct that we use longer impulse rc- 1601. A value for 6 that is too small cau lead to instabilities
sponsss to niodel the LEM tan than for the adaptive of the algorithm (overshoot phenomenon 14111j, whereas
filter is neglected here and influences only thc final mis- values that are too large reduce the convcrgcnce speed, es-
alignment. pecially dixing the start up of the adaptation.
The choice of thc forgetting factor h also influences the
stability, .the convergence, and the tracking bchavior of
The Affine Projection (AP) Algorithm the algorithm. We get tlic bcst tracking behavior for val-
The AP algorithm can be considered as an extension of
the NLMS algorithm. Its adaptation rule is given in 1341,
+
ues of about A = 1- (T,: filter Icngth). For reasons of
stability, however, a choice of h between 1 - & and
Tdbk 5C. 1 - is I-ecommended[60].
A convergence analysis can be peifbrnicd similarly to
For thc simulations we uscd 6 = 0.01 .‘and h = 0.999
that fix the NTMS algorithm, also resulting in the condi-
for stationary excitation and 6 = 1006: and h = 0.9999
tion0 < ~ ( n< 2) for the undisturbed case. Also compara-
for speech excitation.
blc to the NLMS algorithm is the decrease of
convergence speed with increasing filter length.
One can show that the computational complexity for Length and Final Minimum Nlisalignment of
the AP algorithm is N,, times higher [O(N,,Nj] than the Adaptive Filter
for the NLMS algorithm [O(Arj],wlicrc AT, denotes the The algoritlms are testcd with :in LEV model, with a n
order of the Al’ algorithm. impulse rrsponsc being somewhat longer than tlic adap-
For the same reasons as the NLMS algorithm, we tive filter. This is done for two reasons:
choose tlie step-size,u(n)= 1for the followingsimulatioiis. A In a real application, it is not possible to modcl tlic ai-
tire echo path with an adaptivc filtcr.
The Recursive-Least-Sqoores (RLS) Algorithm A For coinparison of the adaptive algorithms, thc initial
Tlic KLS algoi-irhnii I ~ c l ~ i ~
tog another
s class of algv- convergence and tracking behavior are of special interest.
ritlxns and is based on the minimization of the weighted Adaptation curvcs that excced -60 dR arc not realistic for
squarcd crror sum. Explanations ofthe algorithm and its echo-cancellation applications.
adaptation rule are given in [34], Table 43. We decided to test the algorithms with two different
Tlic KLS algorithm involves the risk of instability in- filter Iengtlis ( N = 256 and 1024) hcing realistic repre-
herent in its recursive adaptation rule. This risk is even sentatives for a car cabin and an office, respectively. Thc
-
truncated to Z = 259 and 1030. The theoretical minimum
misalignment is given by
lolog(Ilc(n)-g(n)Il2 1 3 (15)
which is the squared norm of the difference between the
LEM model g(n) and the adaptive filter c(n). Supposing
that the lengths of g(n)and c(n) are different, the shorter
filterisaeropaddedforthecalcnlationofthenorm (13).
The special aim of our investigations is to underlint
the dependence of the described adaptive algorithms on
the input signals and filter lengths. We will see and work Frequency (Hz) I
out that the best choice for an algorithm is closely related A 9. Power spectral density (PSD) of the colored noise generated
to the individual application problem. by IIR filtering with 15 LPC coefficients.
. .. .. .., .. .
0 2 4 6 8 10 12
Time (s) Time (s)
A 12. Convergence of the indfcated algorithms (N = 256) for A 13. Convergence of the indicatedolgorithms (N = 1024) for
speech excitation (NLMS andAP: v ( n ) = 1, RLS: h = 0,9999). speech excitation (NLMS andAP:K(n) = 1, RLS: h = 0,9999)
gorithm is based on the minimization of an exponentially again means a tradeoff between processing power and
weighted sum [26].The larger the resulting memory, performance.
that is, the closer h is to 1,the more slowly the U S algo- There are various possibilities for placing the
rithm can follow room impulse changes. On the other decorrelation filters within the system. For identification
hand, a forgetting factor close to 1 is necessary to ensure purposes, it would be sufficient to simply filter the excita-
stability for speech excitation. For a choice of tion signal. But because this signal also serves as loud-
h = 1 - lo4, the RLS algorithm is only superior to the speaker signal and is, therefore, transmitted into the
AP algorithm up to order 2. room, pre-processing is not possible. One can move this
In summary, the NLMS algorithm converges very linear system into the filter branch and also across the
slowly for correlated excitation signals. Therefore, the AP LEM system (Fig. 14).Because the filter is still in the sig-
algorithm leads to much better results. With respect to nal path to the far-end listener, one has to apply an inverse
the initial convergence, the U S algorithm shows the filter after the calculation of the error value. To invert the
fastest convergence. However, this does not bold in the linear prediction, one would have to use a recursive filter
case of trachig, where the RLS algorithm is inferior to and apply the same coefficients as in the prediction fdter.
AP algorithms of higher orders. Because a linear prediction error filter is minimum phase,
its inverse is always causal and, therefore, stable [ 361.
One can omit the decorrelation filter (and, conse-
NLMS Algorithm Using Decorrelation Filters quently, its inverse) in the signal path, in which case, the
Even though various lunds of adaptive algorithms arc adaptive filter also must model the inverse decorrelation
theoreticallv, aoolicable for acoustic echo-cancellationfi- filter [78].
I I
ters, in most cases, a simple and robust algorithm outper- A different approach for the application of
forms more sophisticated solutions. Therefore, in most decorrelationfiters, especiallyif adaptive decorrelationfi-
applications with limited precision and processing ters are used, is the implementation of an auxiliary loop for
power, the normalized LMS algorithm is applied. How- adaptation, asshowninFig. 15 [28], [65].Becausetheco-
ever, as it was shown in the previous section, the adapta- efficients ofthe echo-cancellationfilter are now copiedinto
tion performance of the NLMS algorithm, in the case of the echo-cancellation filter in the signal path, there is no
speech excitation, is rather poor due to the strong correla- more need for a decorrelationfdter or its inverse in the sig-
tion of speech signals. One way to overcome this Frob- nal path. In the case offmed-pointproccssing,where recur-
lem, is to pre-whiten or decorrelate the incoming sive fiters of a high order may cause stability problems, this
excitation signal before passing it to the adaptive algo- is of considerable advantage. iMoreover, for adaptive
rithm [36]-especially in thc field of echo cancellation, decorrelation filters it is preferable not to have constantly
decorrelation fdters are widely applied [28], [77], [78]. changing decorrelation fiters in the path of the estimated
Decorrelationfilters are predictor error filters, with their echo signal. Otherwise, the data vectors within the
coefficients matched to the correlation properties of the echo-cancellation fdter would have to be newly calculated
speech signal. Because speech signals are nonstationary, every time the decorrelation filter is updated which could
one has to decide whether the prediction coefficients lead to distorrioiis in the near-end signal.
should be adapted periodically to be exactly matched with Implementing this second path, however, results in an
the current section of the speech signal, or simply to the increased processor load in terms of both memory and
long-term statistics of speech. This, as we will see later, computational load. The filtering operation of the echo
RI:= [X'(n)X(nj]-'.
Despite this fact, one can show that the two algorithms are
equivaleut Xthe coefficients a(%)are adapted in each sample
period as the fonvard prediction error filter. In tllw case, the
first row or first colurmi of Ri aid A(%)are equal, leading
sensible to use higher decorrclatioii coefficicnrs. For an to the same adaptation rule for both algorithms.
excitation with a female voice, the adaptation perfor- Comparing the siinulation results of both the AP algo-
malice shows the opposite behavior. A switch betweell rithm and the NLMS algorithm with decorrelated ecita-
two fured coefficients for male and female speakers may, tion (Fig. 19), the perforinance of the latter algorithm
therefore, be of great advantage and overcome this mis- appears to be siniilar to an Ap algorithm of lower order.
alignment. Affine projection, however, has a superior performance,
especially in a quiet environnieiit without background
noise, rvhcre very-low system errors can be achieved.
Comparison of Decorrelotion to More Complex However, because the adaptation process of the
Adaplotion Algorithms decorrclatioii coefficients is only pcrfornicd periodically,
We have shown thus far that the use of decorrelation fil-
ters improves the performance of the NLMS algorithm
enormously. However, we still have to compare this tech-
nique to the more sophisticated adaptation algorithms.
Whcn we look closely at decorrelation filters, we sec that
this proccssing has an effect similar to the affine projec-
tion algorithm. Combining the decorrelation filters and
the NLMS algorithm delivers both thc AP algorithm and
the NLMS algorithm with decorrelation filters exhibiting
thc samc structure. As a result, we can concentrate on
these two algorithms.
Rccalliiig the adaptation equation of the NLMS algo-
rithm [34]
-35 .-.ip~L.J
0 2 4 6 8 10 12
Time ( s )
we rcplace e(%)with its decorrelated equivalent
P 16. Convergence of the NLMS algorithm using decorrelalion fi
ters with speech excifation.
er (n)= a' (*)e(%)
x,(n)=[x(n),[x(n-1),...I a(n)=X(nja(nj
to yield
a(n)a"(n)
c(n + 1)= c(*j + f i ( n ) X ( n ) e(*)
. . . . . . . . . .
with
.........
Subband Processing
Acoustic-echo-control systems can be rcalized either in
full-band or subband structures [46]. Both approaches
show adx-antages aid disadvantages, especially with re-
spect to computational complexity and signal delap. In
this section, subband stixctures will he introduced and
compared to full-band systems.
By using filterbanlo, the signal x(n) of the far-end
speaker and the microphone signal y(%)are split up into
several subbands (Fig. 20). Dcpcndiiig on the properties
of tlic 1ow-p"ss and baiidpass filters, the sampling rate in
thc subbands caii be reduced. According to this reduc-
tion, the length of the adaptive filters caii be also lowered.
A 19. Convergencecomporison between the NLMS algorithm Instead of filtering the fd-baud signal and adapting one
with adaptive decorrelation filters and the AP algorithm of 2nd
filter at the high sampliiig rate, Q (number of subbands)
and 5th orders.
convolutions and adaptations with
x(n)
1 - snbsamplcd sienals are pcrformed in
c
computational Complexity
In the fnU-band structure, the number
of multiplications required to perform
Synthesis Filterbank Analysis Filterbank the convolution and the adautation
L.-.
k 20. Structure of an acoustic-echo-controlsystem operating in subbands. By using two
' nsing, e.g.. thc NLMS-algorithm and
an adaptive filter of letigrh N is given
analysis filterbanks, the for-end speaker and the microphone signol are decomposed
into Q subbonds. In each subband, a digital replica of the echo siqnal is qenerated by
by U,,
~ .
(N.
)= 2hr. Assuming
. . .
that Q
o n v h t i o n ofthe subband excitation s&ols ondthe estimoted~ubbandimpulse re- subbands a r e llsed W1t''
sponses. After subtrodion of the measured and the estimofed subbond echo signals, bandwidths and equal subsanipling
the full-band error signal is synthesized using a third filterbank
Speed of Convergence
For whte input signals, adaptingwith die NLVS algorithm
(see ‘The Normalized Least Mean Square Algorithm”)
factors Y < the num_ber of m~dtiplicationsin the x(wj e(.)
subband structure O,sn(N)can be denoted as c(n + 1)= c(n)+ b(%)
9.
BSB( N )= 2 N The factor 2Qin the numerator results X T (9%) x(n)’
from the computation ofQ / 2 complex-valued convolu-
tions and adaptations. The subband signals are and using a step-sizep(n)= I, the reduction of the mean
subsampled by a factor Y, which reduces the coinplexity square error (in fiill-band) can be approximated by [27]
byr. The subsampled convolutions and adaptations have
to be computed only everyr-th sample period, further re-
ducing the complexity by the factor r .
BsB( N )does not include the computational overhead
associated with the fdtcrbanks. By choosing a polyphase After r adaptations, this quotient can be written as
FIR filterbank as described in [ 101and [74], with Q = 16
channels and a prototype low-pass *filter of length
N, = 256, the additional complexity Ox,al,sB (N)of two
analysis filterbanlis and one synthesis fdterbank can be de-
noted as
Nu, Q
Subband: 0,~,,(N)=3--3-ldQ+4Ni. Q
Y 7“ Y
Especially for large N , both striictures exhibit nearly the
fihcrbmk ’ Gfilrcn”g
&F same speed of convergence. If speech is used as an input
signal, the convergeuce of an adaptive filtcr depends on
In Fig. 21, the number of multiplications per sample pe- the cigenvalue spread of the correlation matrix of the in-
riod versus the length of an adaptive full-band filter is put signal, which is bound by the quotient of the maxi-
shown. Depending on the subsampling factorr, the effec- mum and the minimum of the power spectral density
tive reduction of computatioual load starts at a filter [36]. In ideal subband StnIChIres, this quotient is re-
length of approximately 100 coefficients (for Q= 16). duced, leading to a faster convergence. However, this ad-
vantage is partially lost by nonideal filters, which increase
the eigenvalue spread. For properly designed filterbanla,
Design of Filterbanks both effects nearly compensate. Using speech as the input
The overall quality of the subband echo control system signal, the speed of convergence is about the same for
strongly relies on the quality of the filterbatik.Therefore, both hll-band and subband strncmrcs.
many papers on this topic have been published in recent
years. Excellcut tutorials about filterbanks and multirate
systems can be found in [21], [71], and [72]. Art#icio/ Delay
In designing die fiterbank, a compromise between firer By transforming a causal full-band impulse response into
length, subsampling rate, stop-band attenuation, and a parallel system o f subband impulse responses, it can be
aliasingmust be found. The filter should be as shortSI; possi- shown that the resulting subband impulse responses have
ble, which leads to a nonzero uansition band. To diminish noncausal taps [46], [75]. In order to reach a satisfactory
aliasing, die subsampling factor must be decreased or, in echo reduction, some of these noncausal taps should also
critically subsmipled filterbanks (r = M ) , a cascade of be modeled by delaying the microphone signal.The arti-
notch filters should be implemented. Decreasing thc ficial delay also increases the requirements for echo atten-
subsamplingrate reduces the aliaingterms, but also distom uation [66], due to the increased echo sensitivity of the
the flatness of the fiterbank transfer function in each hunian ear.
2500
Fullband Structur
2000
1500
1000
500
Subband Structure
0 150 0.5 1 1.5 2 2.5 3 3.5 4
0 200 400 600 800 1000 1200 1400 1600 Seconds
N
I A 22. Convergencewith 0 and 8 ms (64 full-bond tabs) of artifi-
A 2 1. Number of multiplications per somple period in full-bond ciol delay. Here the results of two system identifications are
and subband mode. For the subbond structure, a filterbank presented. For both identifications, the same excitation signol
with Q = 16 channels and a prototype low-pass filter length (white noise) ond the same set ofporamefers were used. With-
N,, = 256 is used. Only for small filter length N does the out any ortificial delay, an ERLE of about 25 dB (difference be-
full-band structure need fewer multiplications than the tween the power of the microphone signal and the power of
subband structure (due to the filterbank overhead). For lorge the error signal) can be ochieved. By introducing a delay of 8
N, subband processing is superior to full-bond processing in ms in the microphone poth, on increase of about 8 dB in the
terms of computationol complexity ERLE is achieved.
Time (ms)
’
c$ (%B)= IC,,. (nB),. .. ,c, + I L,-l (nIC)]’ (23)
are the filter impulse responses of the subfilters.
The adaptation part of the filter can be partitioned sim-
ilarly to the filter part: ~ ~ ~of the ~ y ~ and
v a r i $tates ~ t e ~
the $tep-$iz
tation steps are actually oversized, which, in turn, can lead
to an instability of the adaptive filter. There are several
quantities leading to a disturbance on the local side:
A The signal of the local speaker leads to an essential dis-
turbance of the adaptive filter. Activity of both the local
and the far-end spealccr is called double-tall<.
A A permanent local noise (e.g., background noise in a
(22) and (24) can be implemented cficiently using the car cabin) disturbs the error signal.
methods described in “Complexity Reduction.” A It is not possible to imitate the complete impulse re-
By sectioning the filter impulsc response, we are able to sponse due to the large number of filter coefficicnts re-
adjust the algorithm parameters, e.g., the block le@ B, quired. The residual echo caused by the part of tlie
the length of the subfiltersU , and thc number of subfilters system, that cannot be modeled can be interpreted as ad-
A according to the requircineiits of our application, e.g., ditional local noise.
the signal delay and thc computational complexity. A Fixed-point DSPs are often used in implemeiitations to
limit the cost. The quantization noisc of the fxed-point
arithmetics can also be considered as additional local
Control of Adaptive Filters noise, which adversely affects the stability of the filter.
In each of these cases, one must reduce the adaptation
Rationale step by using a smaller step-size. However, a permanent
In order to achicve high-yuality echo cancellation, it is small step-size may still be too high in .the case o f a large
nccessary to ensure both a snfficicntspeed of convergencc local disturbance, and it slows down the convcrgence
and stability oftlie adaptive filter. A high speed ofconver- speed when no local disturbance is present. Therefore the
gence requires large adaptation steps, however, oversized stcp-size should be connded.
steps lead to a divcrgence ofthc filter coefficients. Another problem is the time variance of the LEM sys-
For most of the adaptive algorithms, the adaptation tem. A variation of the impulse rcsponsc ofthe LEM sys-
step, given by the filter update, is chosen to bc propor- tcm also leads to a filter mismatch and, therefore, to a
tional to the undisturbed error signal E(%) and the rapid increase of the error signal. In contrast to a large er-
stcp-six ~ ( 1 2 ) .The undisturbed error signal is described ror signal caused by a local disturbance, tlie step-size
by the error*signal without any local disturbance should not be lowercd in this case. The adaptive filter
[E(%) = d(n)- d(n)].Howcver, this undisturbed crror sig- should be adjusted rapidly.to the modified LEM impulse
nal is only available in simulations, so that the disnirbed response, which makes a large stcp-six necessay.
error signal e(%)must be used instead, resulting in the fol- To cope with thesc problems, we need to design a con-
lowing equation for the filter update: trol mechanism to detcrminc the various statcs of the sys-
tcm and to choose the step-size accordingly. (The state of
c(n + 1)- c(n)= p(n)e(%) r(n) , the system is, for instance, dcscribed by the activity or
non-activity of the speakers. This notation will be ex-
where r(n)govcrns the direction of tlie filter updatc. For plained i n “Control of Speech Enhancement Algorithms
instance, in thc case of the LMS algorithm, this direction in the State Spacc.”)Two examples of such mcchanisins
is given by thc direction ofthc vector x(n). will bc presented in “A Correlation-Eased Double-Tall<
A constant step-size~ ( $ 2 is) assumed until thc step-size Dctection” and “Step-Size Control Based on Delay Coef-
control is prescnted. If any disturbance exists on thc local ficients.” Other approaches to control the adaptive algo-
side, thcre will be a malfunction of the adaptive algorithm rithm are proposed in (51,[29],[35]. However, for the
due to approximation of tlie uiidisturbcd crror signal by prcscnt, fully reliable algorithms are not yet avdahle.
the disturbed error signal.Thc error signal grows with the Similarly,most ofthese control algorithms are able to de-
local disnirbancc. (In die remaining section, the disturbed tect just one paramcter of the acoustic echo canceller.For
error signal is incant whcn referring to the error signal.) instance, some algorithms are available that can detect a
The adaptive algorithm tnistakcs this increase of the error doublc-talk situation. There arc othcr algorithms that es-
signal for an increase of the residual echo. Thus, the adap- timatc the step-size without knowledge about the dou-
10
(26)
0
Taking (25) together with (26), a rule for the step-size
can be specified. -1 0
The above algorithm works quite well for
time-invariant room impulsc responses. It can deal with
different levels of background noise, and correctly sets the
step-sizeto very smallvalues during double-tallcsituations.
However, problems arise due to changes of the room im-
pulse response: assuming a low system-errornorm, the de-
lay coefficients are close to zero. This is interpreted by this
method as a low adaptation error. Therefore, the step-size
is chosen close to zero. A change of the room impulse re- A 26. Analysis of a step-size control method by utilizng optimum
sponse that appears at this instant will not affect the delay step-size and cost function.
coefficients. Hence, the step-sizeis kept veqwnall. There-
fore, the adaptation filter is unable to adapt toward this trol. Special voice activity detectors (VAD)are added for
modified room impuke response, resulting in frozen filter this purpose, mostly on the basis of excitation signal
coefficients.In the case of a higher backgronnd-noiselevel, power estimation. When a veryhigh threshold for speech
this problem is less severe, due to the variation of the dela17 activity detectionis chosen, even unreliable step-size coil-
coefficients caused by the local noise. trol methods can avoid divergence of the filter, but the
I n summary, o n e can say that the de- real-time convergence speed decreases significantly.
lay-fiter-coefficients method functions adequately in a To enable the systematic optimization of the complete
noisy environment, except for the ability to dctect a step-size control, a second optimum step-size is intro-
change of the LEM system. Therefore, combining with duced in [ 1I]. It is deduced from deterministic equations
other methods that reliably detect changes of the room instead of statistical signal models. No assumptions are
impulse response is helpful. needed about the characteristics of the signals, which
makes die calculation valid for every possible far-end or
local signal. Moreover, because the exact signals are
A Combined Sfep-Size Control Using Neural known during simulations, this optimum step-size can be
Networks And Fuzzy Logic calculated in simulation environments, so that step-size
control methods can be analyzed in predetermined situa-
tions over short time intervals. This step-size is derived by
Deterministic Derivation of an Optimal Step-Size minimizing the Euclidean distance between the filter vec-
For the systematic design of control methods, a detailed tor and the LEM impulse response, which results in:
analysis of the methods aud their behavior is helpful. The
(%)of(25),whichis based on averysimple
step-size p,>pt,l
speech model, does not allow for such observations. The
assumptions used to derive this step-size are far from fid-
filled for voiced speech, whose short-time power spectral To calculate this step-size, the value and sign of the ad-
density is decidedly more important at lower frequencies aptation error must be Imown. As stated before, this in-
than at higher ones, and whose average short-time power formation is available in a simnulation, but it c m o t be
and spectrum change significantly during transient estimated easily in real time. On the other hand, the fact
phases between the phonemes. Obviously, for optimum that it is mcasurable in simulations is very useful for evalu-
step-size control in the presence of speech, &opt,, ( n ) ating the performance of step-size control methods in a
should not be regarded as the reference. Moreover, esti- rcalistic environment. The comparison of the step-size
mation of the mean adaptation error power does not pro- produced by the step-size control method and the opti-
duce exact results when speech signals are involved. This mum imtantaneous step-size indicates whether the con-
is due to the nonstationarity of speech, which limits the trol method generally over- or underestimates the
estimationinterval. It can actually he shown that using an step-size in certain situations.This informationcan be ex-
estimate of & up',, (92) on the signals available in the siinula- ploited to tune the parameters of a step-size control
tion does not even guarantee convergence when speech method, or to analyze it in detail.
excites the adaptive filter [ H I . Therefore, in real-time ap- Moreover, a cost function can be derived from
plications, the step-size is usually set to zero during criti- (n)that provides a ranlung of estimation errors ac-
cal intervals such as speech pauses, at the beginning and cordiug to the damage they cause to the echo attenuation.
end of which the adaptation is especially difficult to coil- It is denoted as:
Local Speaker
Inactive
- Local Speaker
Active
lead to the wrong control commands.
The importance of the correct detec-
tion of certain situations varies fix the
Inner Cube different spcccli~erthancemcntalgo-
Insufficient Excitation rithms. For example, double-talk de-
Outer Cube
tection must not bt: delayed when its
Sufficient Excitation result is used for acoustic echo canccl-
lation, whereas thr delay is less impor-
taut for adaptive loss control. This
implies that, given a certain variancc in
thc results ofthe estimators, the deci-
Step Gain siou thresholds for the distinction cif
0
., 1 several states should be dependent 011
the algorithm that uses thc detection
result. hdditionallv, the values of the
A 27. The hands-free telephone set in state space. The diogram focuses on those states parameters usua'JT
with for-end speaker activity Distinguishing these states is very difficultbecause the Over a contillUous so that the
far-end and the local speech signal can interfere. However, it is extremely important for crisp set of states must be re-mapped
the Derformance of the acoustic echo conceller. to continuous values. Both of these
.-..-..,-. Dou%ieTalk.
.-.-.Sin leTaik’CarNoise(SNR0dBj
No Background N ~ i s e
-,
0 Double Talk, Car Noise (SNR 0 dB)
makes it very dificult to optimize them, especially with a
large number of inputs. Therefore, if a larger number of Y
-10
-
detectors is available, or if a larger number ofstates is to be
distinguished, this optimization can no longer be carried -15
z
out by hand in reasonable time. For the case combining a
large set ofstep-size methods, we examine an approach us-
g -20
W
ing &zzy learning vector quantization, which has the addi- E
I -25
tional advantage of providing state information. 2 0 2 4 6 8 1 0 1 2 1 4 1 6
m
In a first approach, every state is assigned one step-size. Time (s)
I
Therefore, the stcp-size coutrol is actually reduced to a
A 28. Top: results of the delay coefficients step-size control
multi-inputimulti-output decision, where the decision method. Bottom: results for the combination of step-size con-
can be made by a pattern-recognition algorithm such as trolmethods by LVQ. The system reacts correctly in the case of
learning vector quantization (LVQ) [47]. It can be used a change in the LEM impulse response, in contrast to applying
to automaticallypartition the input space spanned by the the delay coefficients method onlv
input values, i.e., the results ofthe step-size control meth-
ods, into regions that are assigned to a state. For reasons considerably if a suitable transformation for each input
of robumiess, a fuzzy algorithm is applied here that is value is found. Another consequence of the LVQ ap-
similar to the ones proposed in [45].The advantage of proach is that some states are very difficult to separate. In
such an approach is that the results of the state detector particular, these are the very-highly distorted double-talk
can be evaluated on a test data set, and that the parameters
state and the state of initial adaptation, because the adap-
that must be tuned for step-size optimization are linearly
tatiou error is speech-like and of rather high power in
combined, so that their tuning is much easier than in a
both cases. In our implementation, an additional state
normal fuzzy system. The nining can even be supported
was introduced to incorporate the special case of initial
by the cost function presented in “Deterministic Deriva-
tion of an Optimal Step-Size.”Moreover, the arithmetic adaptation, thus circumventing the problem of the LVQ
to be performed in real time only consists of a number of recognizingdouble-talk at the beginning. This indicates a
multiplications a i d accumulationsfor which a digital sig- low step-size, and keeps the adaptation from starting at
nalprocessor is optimized. On the other hand, we have to all. Results of the step-size control by fuzzy LVQ with
train the state vectors first, then assign a step-size to each niue state vectors arc shown in Fig. 28. The simulations
state. In order to provide training data with sufficient were carried out with speech as excitation and distortion.
echo attenuation, the firsc training cyclc is carricd out For the case of background noisc, car noise was used at an
with data obtained by adaptation with the optimal average SNR level of 0 dB.At about the 12th second, a
step-size of (27). In a second cycle, the LVQ is trained change in the LEM impulse response was siniulated as
online, i.e., the step-size that it produces is used for the measured during the movement of a dummy in front of
adaptation and thus, influences the training data. By this the hands-free telephone. The change occurred during an
approach, the situations provided by the training data be- interval of about one second. This environment was cho-
come increasingly similar to real-time situatioils. sen because a gradual change is much harder to detect
Because the membership functions are calculated from than a sudden oiie, and because this kind of change is
the Euclidean distance between the input vector and the more typical in real time. For details of the implementa-
state vector, criteria with large output values influence the tion, see [14]. To facilitate the use ofthis method, radial
state detection more strongly than those with small out- basis functions [55], [73]can be usedinstead ofthe fuzzy
put values. Therefore, the performance can be iticreaxd system to implement a fuzzy-neural system and thus
Simple arithmetic
_________
U u v
Small memory I U U
I .ow pro'cssing powcr L -L U
Estahlishcd technology L n n ~
A = diag@, 1 5 .
>...,a, (34)
with a s = a y ’ for 0 < i < N,and y being an exponential
i
attenuation ratio with the same logarithmic decay as the
impulse response (0 < y < 1).
,
A 32. Logarithmic representation of room impulse response
Because the ES-NLMS algorithm uses very sinall I g(i)l (solid line) and its quantized envelope (dashed line).
step-sizes for some coefficients, the stopping phenomenon
significantlydecreases the performance of the adaptation
when implemented with quantized filter taps. Fig. 31
shows converging cumes of the ES-NLiMS algorithm with
fdter taps in floating-point and fixed-point precision.
This behavior can be improved by malung use of a typ-
ical property of room impulse responses. As shown in where e , ( n )is the error signal obtained using scaled filter
Fig. 5 , the average magnitudc of the room impulse re- coefficients.
sponse decays exponentially. Fig. 32 shows the magni- Assumiug infinitc precision, the scaled filter update
tude of the room impulse response and its quantized equation and the conventional NLiMS equation are
envelope. Because adaptive algorithms try to identify the equivalent, despite the fact that the normal error signal
room impulse response g(n)with the filter c(n),the coef- e(%)and the error used in filter scaling, e, (n),can differ
ficients at the tail of c(n)only make use of a portion of the during adaptation. By using the diagonal scaling inatriv
available bits. The bits above the dashed line in Fig. 32 re- B, the filter vector c(n)adapts to a “faked” room impulse.
main unused. This “waste” of bits would not occur if thc Using (36) and (37)improves the performance of the
shape of the room impulse response was more uniform. learning curve with fEed-point coefficients in Fig. 31 to
To achieve the desired behavior, one can introduce a almost the same level as the curve with floating point co-
scaled filter vector c ,(n).Its relation to the normal filter effkients. Additional information about filter scaling can
vector c(n)can be described by using the diagonal scaling be fouud in [61].
matrix B:
Conclusions
The control of acoustical echoes is an important h c t i o n
in speech processing systems. Hands-free telephones, for
example, not only enhance user comfort, but also increase
safety when used while driving a car. Acoustic ccho con-