Application of Acoustic Echo Cancellation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 81

Master Thesis IMIT/LECS/2007-63

Implementation of Acoustic Echo Cancellation For PC Applications Using MATLAB


Master of Science Thesis In System on Chip Design by Lu Lu Stockholm, 05/2007
Supervisor: Examiner: Temujin Gautama (NXP Software Leuven Belgium) Axel Jantsch (ICT/KTH Stockholm Sweden)

Abstract
The communication technique has changed a lot in the recent years. Today people are more interested in hands-free communication with the use of a loudspeaker and a microphone, in stead of a normal telephone. However, the presence of a large acoustic coupling between the loudspeaker and microphone would produce a loud echo that would make conversation difficult. The solution to these problems is the elimination of the echo with an echo cancellation or echo suppression algorithm. However, traditional methods are not sufficient.

The objective of this thesis is to find out a good echo removal algorithm, which is capable of providing convincing results for PC application. The basic components of an echo canceller are an adaptive filter, and a double-talk detector. The adaptive filter estimates the echo path, based on which a replica of the echo is created and subtracted from the combination of the actual echo and the near-end speech signal. Double talk occurs when both ends are talking. The task of a doubletalk detector is to sense the doubletalk, so that to stop the adaptive filter in order to avoid divergence. Since there has been a revolution in the field of personal computers in recent years, this work attempts to implement the acoustic echo canceller algorithm on a PC with the help of the MATLAB software.

Acknowledgement
Firstly I would like to thank a lot my responsible person Temujin Gautama from NXP Software (Leuven, Belgium), for his patient help and friendly support throughout my work. Without him, there will be no this project. His advice and constant guidelines have assisted me to get through a lot of difficult situations.

A special thank goes to my favorite professor Luc Bienstman in GroupT Engineering School (Leuven, Belgium), who generously did a big effort to help me make this thesis project possible, and also, as always, gave me continuous encouragement and taught me how to gain confidence in me when I doubted myself.

Besides my advisers, I would also thank all my friends here in Leuven, who are always so supportive and make me never feel lonely.

Table of Contents:
CONCLUSION 7

CHAPTER I: INTRODUCTION 1.1 Need for Echo Cancellation 1.2 Basics of Echo Cancellation 1.2.1 System overview 1.2.2 Adaptive Filter 1.2.3 Double talk Detector 1.3 Measures of Performance 1.3.1 Echo Return Loss Enhancement (ERLE) 1.3.2 Near-end Attenuation (NEA) 1.4 Thesis Organization

8 8 11 11 13 13 14 14 14 16

CHAPTER II: ECHO CANCELLATION ALGORITHMS 2.1 Acoustic Echo Canceller 2.2 Acoustic Echo Suppressor 2.2.1 Noise Suppression with Spectral Subtraction 2.2.2 Acoustic Echo Suppression with Spectral Subtraction 2.2.3 Overlapping-windowed FFT 2.2.4 Comparison of AEC and AES 2.3 Adaptive filters 2.3.1 Wiener Filter 2.3.2 Least Mean Square Algorithm (LMS) 2.3.3 Normalized Least Mean Square Algorithm (NLMS) 2.3.4 Problem with NLMS algorithm 2.3.5 A Simplified Echo Path Model 2.4 Double Talk Detector 2.4.1 The Generic Doubletalk Detection Scheme 2.4.2 Geigel DTD

18 18 19 19 22 22 25 25 26 27 28 29 31 33 34 34

2.4.3 Normalized Cross-correlation (NCR DTD) 2.4.4 Variable Impulse Response (VIRE DTD) 2.4.5 Double talk detection performance evaluation CHAPTER III: OTHER ISSUES 3.1 Room Impulse Response 3.1.1 Measure the Testing Room acoustics 3.2 Measure of the delay between the loudspeaker and the microphone 3.3 Noise issues 3.3.1 Typing noise cancellation based on cross-correlation 3.3.2 High Pass Filtering

35 36 37 39 39 40 41 42 43 46

CHAPTER IV: EVALUATION 4.1 Requirements of AEC 4.2 Speech Stimuli 4.3 Acoustic Echo Canceller 4.4 Acoustic Echo Suppressor Based on Spectral Subtraction 4.4.1 NLMS-Based AES 4.4.2 Coloration Effect Filter-Based AES 4.5 DTD Performance Evaluation 4.5.1 Geigel DTD 4.5.2 NCR DTD 4.5.3 VIRE DTD 4.6 AES with different DTD Algorithms 4.6.1 The Influence of the NES and the FES 4.6.2 The Noise Performance

47 48 48 49 52 52 56 58 60 62 65 68 71 73

CHAPTER V: CONCLUSION AND FURTHER WORK 5.1 Summary and Conclusion 5.2 Further Works 75 76

LIST OF ACRONYMS

77

LIST OF FIGURES

78

REFERENCES APPENDIX

80

CONCLUSION
This paper works on the implementations of acoustic echo cancellation algorithms and analysis based on simulations in MATLAB. It focuses on Normalized Least Mean Square (NLMS) algorithm and the recently proposed method by Christof Faller et al which uses a simplified echo path Model based on a frequency-domain coloration effect filter.

As an important part of a successful Acoustic Echo Cancellation, several Double-Talk detection methods are also studied and analyzed, including the Geigel algorithm, the Normalized Cross-correlation method (NCR DTD), and the Variable Impulse Response Double Talk Detector (VIRE DTD). Some possible further works are discussed at the end of this paper.

Key words: AEC, AES, NLMS, Coloration-effect filter (CF), DTD, Geigel, NCR, VIRE, SER, SNR, ERLE, NEA

CHAPTER I INTRODUCTION
Echo is a delayed and distorted version of an original sound or electrical signal which is reflected back to the source. If a reflected wave arrives after a very short time of direct sound, it is considered as a spectral distortion or reverberation. However, when the reflected wave arrives a few tens of milliseconds after the direct sound, it is heard as a distinct echo. In data communication, the echo can incur a big data transmit error. In applications like hands-free telecommunications, the echo, in extreme conditions, can make the conversation impossible. The echo has been a big issue in communication networks. Hence this thesis is devoted to the investigation and development of an effective way to control the acoustic echo in hands-free communications.

This chapter gives a general review of the basic techniques in AEC such as the echo cancellation structure, the adaptive filter, double-talk detector and performance measures. Section 1.1 addresses the causes of echo and the echo cancellation environment. Section 1.2 details the basics of an acoustic echo removal system and the system structure of the echo cancellation process. Section 1.3 introduces the two important measures when evaluating the echo removal performance. Finally, the organization of the thesis is described.

1.1 Need for Echo Cancellation There are two types of echo existing in telecommunication networks, namely electrical echo and acoustic echo. The electrical echo is due to the impedance mismatch at various points along the transmission medium. This echo can be found in the public-switched telephone network (PSTN), mobile, and IP phone systems. The electric echo is created at the hybrid connections which are created at the two-wire / four-wire PSTN conversion points as shown in Figure 1.1. It will not be included in the scope of this thesis.

Figure 1.1: Hybrid Connections and the Resulting Electric Echo

However, the development of hands-free communication systems gave rise to another kind of echo known as an acoustic echo. The sound wave travels from loudspeaker to microphone through vibrations of circuit or open air and generated echo. Examples of such systems are mobiles, VOIP calls by using, for instance, Skype, the teleconferencing for meetings or remote educations etc. and the hands-free operations have gained more and more popularity in recent years. This situation is the one we will contribute to in this thesis. The basic setup of a typical hands-free system can be shown as:

Fig.1.2 Basic setup of a hands-free communication system

Each side of the communication process is called an End. The remote end from the speaker is called the far end (FE), and the near end (NE) refers to the end being measured. The acoustic echo is due to the coupling between the loudspeaker and microphone. The speech of the far-end speaker is sent to the loudspeaker at the near end, and it is reflected

from the floor, walls and other neighboring objects, and then picked up by the near-end microphone and transmitted back to the far-end speaker, yielding an echo, which can be illustrated in Figure 1.3.

Fig 1.3: Generation of acoustic echo through direct coupling and reverberations

Acoustic echo can severely reduce conversation quality. Adaptive cancellation of such acoustic echoes has become very important in hands-free communication systems. However, not all echoes reduce voice quality. In order for telephone conversations to sound natural, callers must be able to hear themselves speaking. For this reason, a short instantaneous echo, termed side tone, is deliberately inserted. The side tone is coupled with the callers speech from the telephone mouthpiece to the earpiece so that the line sounds connected, and it also allows the speaker to adjust his/her own speaking level. Nevertheless the necessity of the side tone in mobile phones has been frequently brought into discussion. The reason is that side tone poses more difficult problems in the out-of-doors environment of the mobile phone. If you sneeze or blow into your microphone, or when wind noise exists, you hear it loud and clear. Hence nowadays the side tone is either eliminated or designed to be adjustable in mobile phones upon users option.

10

1.2 Basics of Echo Cancellation As stated in the previous section, there is a need of removing undesired echoes during telecommunications. Hence from this part on, the investigation of echo removal method is started. Echo can be either cancelled in time domain or suppressed in frequency domain. In this section the system schematic of the acoustic echo cancellation is firstly introduced, which is the basic structure for all echo removal methods. Later the two major concerns in the echo cancelling process, which are the adaptive filter and the double-talk detector, are briefly reviewed.

1.2.1

System Overview

Since we know the original signal which goes to the loudspeaker, we can use it to predict and remove the signal picked up by the microphone. The process of doing this is called Acoustic Echo Cancellation.

Schematically we can describe an AEC system as in Figure 1.4. The remote speaker signal, which is always referred as far-end signal and denoted as x (t ) , passes through the room acoustic filter, h, producing an acoustic echo termed y (t ) . The microphone receives

near-end speech signal v(t ) together with the echo disregarding of the surrounding noise; the received signal z (t ) thus consists of both v(t ) and y (t ) : z (t ) = y (t ) + v(t ) = f ( x, h) + v(t ) where h is the room acoustic filter.

as well The task of the AEC is to model the room acoustic path with an adaptive filter h t as possible and remove the echo signal from the measured signal, yielding a residual signal e(t ) which will only consist of the near-end speech. The acoustic filter is required to be adaptive since the echo path in the room is most likely time-varying which can be caused by, for example, the movement of objects or the moving of the loudspeaker or microphone from one place to another. However, to capture the complexity of an acoustic echo path, the length of the filter needs to be infinity, but a large filter order will bring a

11

high computational load. So evidently there is a trade-off between the complexity and the performance of the AEC.

The residual signal,

T x(t ) , (t ) = z (t ) h e(t ) = z (t ) y should only consist of the near-end signal, which is the case when the acoustic adaptive
(t ) y (t ) , then e(t ) v (t ) . filter is close to the echo path, namely y

Fig. 1.4 General schematic of Acoustic Echo Cancellation

The adaptive filter uses the residual signal e(t ) to estimate the error and update new filter coefficients, however, only if there is no near-end speech. When near-end speech exists, the estimated error is not correct so that it will distort the filter or even result in the divergence of the filter, so that it is important to determine whether the near-end speech is present or not. Hence an acoustic echo cancellation normally includes parts as the adaptive filter, as well as the double-talk detector to detect if near-end speech exists, and possibly a nonlinear processor to eliminate the residual echoes. We will discuss each of them in the later-on study.

12

1.2.2

Adaptive Filtering

There are two main types of digital filtering: the Finite Impulse Response (FIR) and the Infinite Impulse Response (IIR). IIR can normally achieve similar performance as FIR, with smaller amount of coefficients and less computation. However, as the complexity of the filter grows, the order of the IIR filter increases a lot and the computational advantage is less dominant. Also, IIR suffers from the instability problem. So the filters that are being used in AEC are usually of the FIR type.

The adaptive filter is the critical part of the AEC which performs the work of estimating the echo path of the room to get a replica of the echo signal. It needs an adaptive update to adapt to the environmental change, for example, people moving in the room. An important issue of the adaptive filter is the convergence speed which measures how fast the filter converges to the best estimate of the room acoustic path.

A lot of adaptive filters have been derived and employed for the AEC. In this paper, we will mainly study the standard Normalized Least Mean Square (NLMS) algorithm which is old, has a low computational complexity and is proven to work well compared to a lot of new methods. Also the recently-proposed simplified echo path model using frequency-domain coloration filter is studied and analyzed.

1.2.3

Double Talk Detection

One of the most difficult issues in the AEC is to know when the filter should stop or slow down the adaptation. As we have discussed in the previous part, it is important to know if the near-end speech exists or not, when there is far-end signal present. The situation, when both the near-end and the far-end are active, is referred to as Double Talk. If double talk occurs, the error signal e(t ) will not only contain the echo estimation error, but also the near-end signal. If this signal is used to update the filter coefficients, it might create an artificial echo and even diverge. Thus, it is a vital yet difficult job, which is the task of Double-talk Detector.

13

There are a variety of Double-talk detection methods. In this thesis work, we consider three famous ones: the Geigel algorithm which is quite simple, the Normalized Cross-correlation method (NCR) and the Variance of Impulse Response algorithm (VIRE). All of them will be implemented and compared in the later work.

1.3 Measures of Performance

To have a standard way to examine the performance of the echo removal algorithms, some parameters are required as a measure of the performance. The most important task of AEC (or AES) is to suppress the echo, so it would be necessary to know how much the echo can be reduced. During the period of double talk, the near-end signal would be affected as well as the undesired echo in the cancellation or suppression process, so that the amount of the attenuation would be interesting to know.

1.3.1

Echo Return Loss Enhancement (ERLE)

Echo Return Loss Enhancement (ERLE) is the most important measure of how much in dB the echo is suppressed by the acoustic echo cancellation. It is defined as the power of the original echo over the power of the residual echo signal after cancellation in dB unit: ERLE = 10 log10 (

z2 ), r

where z2 is the power of the microphone signal and r2 is the power of the residual echo. A precise measure of ERLE should be performed in the portion where there is no near-end signal but only the echo. The higher the ERLE is, the better the AEC works.

1.3.2

Near-end Attenuation (NEA)

Near-end attenuation (NEA) is a measure of how much the near-end signal is suppressed in dB during the cancellation process in double talk situation. It is defined as the power of the near-end signal after suppression over the power before suppression during double talk:

14

2 aft NEA = 10 log10 ( 2 ) , bef

2 2 is the power of the near-end speech during DT in the residual signal and bef is where aft

the power of the near-end signal during DT in the microphone signal.

To make the NEA calculation practical, during recording in this thesis, the recorded signals consist of three segments: far-end single talk, double talk and near-end single talk. The ERLE is calculated based on the far-end only stage. To calculate NEA, we made another synthetic signal based on the recorded microphone signal, which has a sign-inversed (counter-phase) near-end speech during double talk by subtracting the double of the near-end part, as shown in Figure 1.5. After passing both of them through the AEC, we subtract the two residual signals and divide the result by two, which gives us the near-end speech during DT after AEC:
Near _ end _ residual = (e(t ) plus e(t ) min us ) / 2 Far _ end _ residual = (e(t ) plus + e(t ) min us ) / 2

aft = Near _ end _ residual


Evidently low near-end attenuation is desired.

Fig 1.5 Composition of the signals used to calculate the NEA

15

1.4 Thesis Organization


This thesis focuses on two main issues of acoustic echo cancellation, namely the adaptation algorithm and the control of adaptation in double talk situation.

Chapter 2 presents all the theory backgrounds. Firstly it reviews and compares the two major ways to achieve echo cancellation, which are the acoustic echo canceller (AEC) in time domain and acoustic echo suppressor (AES) in frequency domain. The adaptive filter which is used to model the acoustic echo path is the central part of the AEC. Hence much effort and researches have been devoted to it. LMS is an old, simple and proven algorithm which has turned out to work well in comparison with newer more advanced algorithms. In this project, we use the normalized LMS (NLMS) for the main filter in AEC, since NLMS is so far the most popular algorithm in practice for its computational simplicity. For the frequency-domain adaptive filtering method in AES, the recently introduced simplified echo path method with a frequency-domain coloration effect filter is studied. After that, the generic double talk detection scheme is outlined and then several well-known double talk detectors are discussed. The Geigel algorithm is simple and works well when the far-end signal is sufficiently smaller than the near-end speech, namely it has assumption of the echo path, so in practice not widely applied to the echo cancellation algorithms. The Normalized Cross-correlation method uses the correlation value between the far-end signal and the near-end signal, and is also normalized, which would bring more promising results compared to the Geigel algorithm. The Variance of Impulse Response algorithm is based on the fact that the presence of the near-end speech will bring dramatic variations on the filter taps, which could bring good result. However, it is more sensitive the microphone signal than the NCR. At last, the measures and the receiver operating curve which are used to evaluate the DTD are introduced.

In chapter 3 the other issues occurred during the echo cancellation process are discussed. Since the adaptive filter is trying hard to mimic the room acoustics, it might be interesting to find a strategy to measure one to have a general idea of how the room impulse response looks like. Secondly the adaptive filtering algorithms normally require the synchronization between the far-end and near-end speech. There has to be a delay from

16

the speaker to the microphone for the sound wave to propagate. To estimate this delay, a method based on cross-correlation is adopted. Also noise is a big issue when the quality of the microphone is not so good, as for the case of the internal microphone of the laptops. The noise includes the hard-disk and fan noise from the laptop itself, the typing noise from the near-end user, as well as the environmental noise. The typing noise mostly probably would be the most annoying one, since the keyboard is always close to the internal microphone in a laptop construction. A high pass filter gives a good attenuation of most of the noise because most noise concentrate at low frequencies. The nonlinear processor as a possible part of a AEC is introduced generally at the end of this chapter.

Chapter 4 is devoted to the evaluation of all the algorithms discussed above. Through a bunch of recordings and simulations in MATLAB, we try to find out which adaptive filtering and double talk detection algorithms suit better the PC application. In chapter 5 the conclusion is drawn and also the possible future work is presented.

17

CHAPTER II ECHO CANCELLATION ALGORITHMS

In this chapter, the theoretical background for echo cancellation is reviewed generally. There are two common ways to remove acoustic echo. The basic method is the traditional Acoustic Echo Canceller (AEC) which is discussed schematically in section 1.2 and will be covered again in section 2.1. Another way is the Acoustic Echo Suppressor (AES). The AES for telephony application is usually half-duplex which shuts off completely the speech from the direction with lower power after comparing the strength of both ends. It is simple but not effective. Full-duplex communication is more comfortable for real-time conversations. Another AES method, derived from noise suppression based on spectral subtraction, makes full-duplex possible, and will be introduced in section 2.2. In section 2.3, adaptive filtering algorithms are presented in detail, including the LMS filter and NLMS filter which are derived from Wiener optimal filters, and the coloration effect filter in frequency domain. Different double-talk detection (DTD) methods are discussed individually in section 2.4 and also the DTD performance evaluation measures.

2.1 Acoustic Echo Canceller


The traditional solution to the acoustic echo problem is the acoustic echo canceller (AEC). An acoustic echo canceller achieves the echo removal by modeling the echo path impulse response with an adaptive filter and subtracting echo estimation from the microphone signal. The acoustic echo path is assumed to be a linear filter with length L, h = {h1 , h2 , h3 hL } ,
T

where L is the length of the echo path, and ( ) T denotes the transpose of a matrix or a vector. Then the microphone signal is expressed as: z ( k ) = h T x ( k ) + v ( k ) + n( k )

18

where x(k ) = {x(k L + 1), x(k L + 2)...x(k )}T , so h T x(k ) is the echo signal, v(k ) is the near-end speech and n(k ) stands for the ambient noise signal.

= h ,h ,h h T is used to approximate the true echo path h , where A modeling filter h 1 2 3 L L is the length of the filter. The echo estimate will be
T x(k ) (k ) = h y

. Once the adaptive filter converges, Adaptive algorithms are used to search the optimum h the residual signal will be the echo-cancelled outgoing signal.

The echo signal can be cancelled successfully when the modeling filter approaches the true echo path. In practice, however, a modeling filter often differs from the true echo path due to complicated reasons such as speaker nonlinearities and environment changes, the lack of knowledge about the length of the echo path, and so on, resulting in residual echo signals.

2.2 Acoustic Echo Suppressor

Unlike AEC, an acoustic echo suppressor achieves echo attenuation in the frequency domain, and which is working in similar manner as the traditional noise suppression algorithm. The AES can achieve similar results in a full duplex way as the AEC.

2.2.1 Noise Suppression with Spectral Subtraction

The introduction of noise suppression here is because that the echo works in a similar way as the noise. So the method for noise suppression could be also interesting for echo elimination. Various speech enhancement techniques exist for the purpose of eliminating noise. Spectral subtraction is one of these methods to enhance speech in the presence of noise. Spectral subtraction for noise suppression basically means that an estimate ( f ) | of the noise magnitude spectrum is subtracted from the instantaneous input |N magnitude spectrum X ( f ) . The noise can also be attenuated with a certain factor. The aim of this process is to obtain an audio signal which contains less noise than the original.

19

The basic flowchart of the spectral subtraction looks like following:

Figure.2.1 Noise suppression with spectral subtraction

The noisy speech consists of 2 parts basically, the clean speech and the noise:
x (t ) = s (t ) + n(t )

After Fourier transform:


X ( f ) = S( f ) + N( f ) ,

and the magnitude of the frequency spectrum can be approximately expressed as: X ( f ) S( f ) + N( f ) So the magnitude of the clean speech can be calculated by subtracting the estimation of the average noise spectrum: (f ) S( f ) X ( f ) N and for the phase of the clean signal, the phase information of the noisy speech is adopted:

S( f ) X ( f )
Combine the amplitude and phase information we get the estimate of the speech amplitude spectrum:

S( f ) = X i ( f ) (

( f ) ),0) max(( X i ( f ) N

Xi( f )
20

) = Gi ( f ) X i ( f )

where and are the design parameters to control the performance, and i stands for the ith frame since the frequency-domain calculation needs the FFT which is frame-based.

Since Short-time estimates of X i ( f ) fluctuate randomly in noise-only frames, resulting in randomly fluctuating gains Gi ( f ) . After noise suppression, statistical analysis shows that broadband noise is transformed into signal composed of short-lived tones with randomly distributed frequencies, called musical noise, which sounds like a warbling or watery effect on the enhanced speech. These artifacts are due to randomly distributed spectral peaks in the residual noise spectrum. One possible way to solve this is to overestimate the average noise power to lower the peaks, but the original speech signal might also be distorted.

Also a lot clicking noise occurs due to the steep changes in the gain function and it can be removed by adding a gain smoothing function as following:
Gs, i = (1 smooth _ factor ) Gs, i + smooth _ factor Gi

and smooth factor will determine the time constant of the exponentially smoothed gain function. To understand better how the smoothing function works, supposing a step gain function in Figure 2.2 (solid blue line), after applying the gain smoothing we will get a smoothed version of the steep changing corners (dotted line). The smooth factor in this figure is 0.01. In this way, the sharp glitches in the gain function are eliminated.

Figure 2.2 Smoothing of a step function

21

2.2.2 Acoustic Echo Suppression with Spectral Subtraction

Echo suppression is basically performed in the same manner as noise suppression. Unlike AEC, an acoustic echo suppressor achieves echo attenuation through manipulating the magnitude spectrum of the microphone signal in the frequency domain, while leaving the phase spectrum untouched.

The adaptive filter in the AES works as the Noise density estimation unit in the noise suppressor, which is combined with FFT to produce an estimate of the echo magnitude spectrum. The echo spectrum estimate is used to form the gain function together with the spectrum of the microphone signal.
Gi( f ) = (
( f ) ),0) max(( Z ( f ) Y
1

Z( f )

where and are the design parameters to control the echo suppression performance. If the echo is under estimated, >1 is used and <1 if it is over-estimated.

Then the multiplication of the gain function and the microphone signal will calculate the magnitude spectrum of the residual signal which is supposed to be echo-free. After performing the inverse FFT transformation, the echo-suppressed outgoing signal is obtained as: e(n) = F 1 [Gi( f ) Z ( f )] where F 1 ( ) denotes the inverse FFT.

2.2.3 Overlapping-windowed FFT


For the transformation into the spectral domain, the choice of data window and overlapping are also important. When windowing a simple waveform, like cos(t ) , causes its Fourier transform to have non-zero values at frequencies other than , commonly called leakage. The rectangular window is the simplest window and has the best resolution, but suffers most from the window leakage problem among all. Other windows like Hann, Hamming, Kaiser Windows, are moderate. Hamming has more leakage than the other two. Kaiser can have the smallest leakage in price of lower

22

resolution. On the other hand, using windows to perform FFT on a small segment of input will bring some distortions because of the transient effects, so overlapped windowing is employed. The windows will overlap in time, namely the window will only shift a part of the total window size instead of the whole. FFT and latter processes are then performed for every window. To restore the signal, the reconstructed data through IFFT are summed up at the end.

Different windows are compared and the results are displayed in Figure 2.3, 2.4 and 2.5, to find out which could bring better result, namely less error. Half of the window is shifted per time and FFT and IFFT functions are processed. The error between the original and the recovered signal is displayed in the unit of dB.

The Hann Window has relatively smaller error compared to the other two methods. The performance of the Kaiser window depends highly on the value of beta. For the Kaiser window, larger the beta is, wider the window will be and smaller the side-lobe becomes. According to simulations in MATLAB, the one with beta equals to around 5.8 gives the smallest error after performing FFT and IFFT as shown in Figure 2.5.

Figure 2.3 Error caused by Hann-Windowing FFT

23

Figure 2.4 Error caused by Hamming-Windowing FFT

beta = 2.5

beta = 4.5

beta = 5.8

beta = 6.8

Figure 2.5 Error caused by Kaiser-Windowing FFT with different beta


24

As a conclusion from the simulation results, Hann window gives the best performance upon the error reduction. Hence, the Hann window is chosen to perform the Overlap-add method during AES process.

2.2.4 Comparison of AEC and AES


Both AEC and AES have their advantages and disadvantages. The AEC is a well-defined technique. When the modeling filter approaches the true echo path, an AEC can eliminate echo signal successfully without introducing much distortion to the outgoing signal. However, in reality the modeling filter often differs from the true echo path due to complicated reasons, for example, the modeling filter is shorter than the true echo path, the echo path may change or the existing nonlinearity in the echo path, and so on. As a result, some residual echoes may still remain. In addition, an AEC is often computationally expensive. In comparison, an AES is able to achieve higher echo attenuation and more robust. In addition, the AES algorithm may introduce less computational complexity as the simplified echo path method which will be discussed later on. However, as in the noise suppressor, this technique sometimes introduces audible distortions to the outgoing signal.

2.3 Adaptive Filters


The adaptive filter is the central part of the AEC, which is used to mimic the acoustic echo path. There are numerous adaptive algorithms that are applicable in acoustic echo cancellation such as least mean squares (LMS), recursive least squares (RLS) and affine projection algorithm (APA) etc. LMS is an old, simple and proven algorithm which has turned out to work well in comparison with newer more advanced algorithms. In this project we use the normalized LMS (NLMS) for the main filter in AEC, since NLMS is so far the most popular algorithm in practice for its computational simplicity. In the following paragraphs the Normalized Least Mean Square algorithm is outlined, which is an adaptation process based on linear FIR algorithm. It aims at approximating the room acoustic path with the best possible model. For the frequency-domain adaptive filtering method in AES, the recently introduced simplified echo path method with a

25

frequency-domain coloration effect filter is studied, which has the advantage of lower computational complexity and more robustness.

2.3.1 Wiener Filter


The Wiener filter represents the optimum filter in the sense of the Mean-Squared Error (MSE). It minimizes the cost function based on the filter coefficients which can be expressed as: J (w ) = E{e 2 } where w stands for the corresponding filter coefficients. E{e 2 } represents the mean power of the error signal e(k ) . With the optimal filter coefficients the minimum of the cost function J (w opt ) = min( E{e 2 }) is reached.

The error signal can be calculated as the difference between the desired signal and the output of the adaptive filter, e(k ) = d (k ) y ( k ) , and y (k ) = w T x(k ) with x(k ) = {x(k L + 1), x(k L + 2)...x(k )}T , where L is the length of the filter.

The squared error function would be: e 2 ( k ) = d 2 ( k ) 2w T x( k ) d ( k ) + w T w x T ( k ) x( k ) . The auto-correlation matrix R is defined by
R = E{x T (k )x(k )} ,

and the cross-correlation vector is:

p = E{x(k )d (k )} .
Assuming that the desired signal is real, wide-sense stationary, the cost function can be written as:
2 J (w opt ) = d 2pw T + Rw T w

Then the minimum point of the function can be obtained by calculating the point which has zero gradient and the general gradient of the cost function is:

26

w {J (w )} = 2p + 2Rw = 2(Rw p) The above leads to the time-discrete Wiener-Hopf-Equation:

w opt = pR 1
which gives the filter coefficients of a Wiener Filter, optimal in the sense of the MSE. The Wiener filter is a linear optimum filter. It depends on the known statistics R and p. In practice, we do not know R and P exactly, and in an adaptive context they may be slowly varying with time. The adaptive filter should be able to track the changes in the statistics hence a changing w opt , so some approximations are necessary. One idea is to approximate the R and p values, which leads to the Recursive Least Squares (RLS) algorithm. Another way is to approximate the gradient as in the Least Mean Square algorithm presented in the following section. The LMS algorithm is introduced much earlier than the RLS algorithm. The RLS algorithms have the advantage of fast convergence, while the LMS costs much fewer computations. In the PC embedded software application, the benefit of the LMS method is more attractive.

2.3.2 Least Mean Square Algorithm (LMS)

The Least Mean Square algorithm is derived from the Steepest Descent method. Instead of going the direct path from the starting point to the optimum, it is easier to follow the gradient of the error function which leads to the optimum iteratively. The gradient as shown in Figure 2.6, is a vector pointing in the steepest uphill direction on the error surface at a given point of w(k). The filter coefficient is updated by taking a step opposite the gradient direction. It goes locally downhill in the steepest direction to approach the optimum:
w (k + 1) = w (k ) c w {J (w )}

And

w {J (w )} = 2x(k )d(k ) + 2x(k )x T (k )w = 2x(k )[d (k ) x T (k )w ]


= 2 x ( k ) e ( k )

So now it leads to the Least Mean-Square algorithm


w (k + 1) = w (k ) + x(k )e(k )

27

where = 2c is defined as the step-size. The step size parameter controls the convergence speed of the filter.

Figure 2.6 Gradient of the Error function

2.3.3 Normalized Least Mean Square Algorithm (NLMS)

The NLMS algorithm is derived from the LMS algorithm. The motivation of this algorithm is that the power of the input signal varies with time, so the step size between two adjacent filter coefficients will vary as well, then also the convergence speed. The convergence speed will slow down with small signals, and for the loud ones the over-shoot error would increase. So the idea is to continuously adjust the step size parameter with the input power. Therefore, the step size is normalized by the current input power, resulting in the Normalized Least Mean Square algorithm, with

(n) =

2 x 2 (k )

where is again the design parameter to adjust the convergence speed, and 0 < < 1 . NLMS usually converges much more quickly than LMS at very little extra cost, so it is very commonly used.

28

2.3.4 The Problem with NLMS:

The performance of the fast converging NLMS algorithm will be largely degraded when doubletalk or only near-end speech exists. The reason is that it is calculated from a ratio between the error signal and the power of the far-end signal.
w(k + 1) = w(k ) + 2 x(k )
2

x ( k )e ( k )

During the pauses of doubletalk or when only near-end speech exists, the coefficients become exceedingly unstable since the input is approaching zero while the error signal is relatively large due to the near-end signals existence. The filter weights start to diverge. The LMS algorithm does not suffer from this problem.

There are several possible solutions to solve this, which will be illustrated as follows:

1. Safety constant

One possibility to solve this problem is to simply add a safety constant to denominator: w(k + 1) = w(k ) + 2

+ x(k )

x ( k )e ( k )

The value of the factor will influence the output quality in a way that by increasing the factor the less the jitter of the weights will be, but the lower the ERLE will become.

2. Threshold

Another common and low-cost possibility is to introduce certain threshold to the input power. The weight will be kept the same if the power of the input is lower than the threshold to avoid the large jitters of the weights. It is basically a far-end signal detector based on the input power. w(k + 1) = w(k ) + 2 x(k )
2

x(k )e(k ) if x(k )


2

> threshold

w(k + 1) = w(k ) if x(k )

< threshold

29

3. Combination of LMS and NLMS

Both the safety factor and the input threshold will be input power dependent. Hence we introduce a new idea which combines the advantages of NLMS and LMS. Two adaptive filters are adapted in parallel and adjusted by a factor (0< <1). Each of the filter banks donates or 1- percentage during the calculation of the error signal. e = z y1 (1 ) y 2 y1 is the echo estimation of the NLMS filter and y2 for LMS section. This method is basically trying to find the optimal combination of LMS and NLMS at each time instance, in order to achieve fast convergence and relatively large ERLE for echo cancellation and also gain more stability. To derive the appropriate value or update method for , we use the same way as for LMS. The steepest descent method is applied to approach the minimum of the least mean-squared value.

e 2 = ( y1 y 2) 2( z ( y1 y 2) y 2) = ( y1 y 2) 2e

[ ]

i +1 = i + c ( y1 y 2) 2e
c is the step size parameter as for the LMS algorithm.
Theoretically the should be more or less 1 for FE (Far-end) single talk section, which indicates the employ of NLMS algorithm. This is because the NLMS filter adapts faster and gains higher ERLE than LMS at this moment. During DT (Double Talk) and NE (Near-end) single talk, becomes 0, since the LMS algorithm does not suffer from the stability problem as the NLMS when NES (Near-end Speech) exists.

We tested this algorithm with recorded signal including three segments, which are far-end speech only, DT and near-end speech only. The value as plotted in Figure 2.7 shows analogical result as we expected.

30

Figure 2.7 plot ( = 0.01 = 0.8 )

During simulation in MATLAB, it indicates that the new algorithm does not improve enough compared to the calculation complexity it brings. As a conclusion, with the same ERLE achieved, and according to the audible test results of the three methods, safety constant method is chosen as an efficient way which brings acceptable results.

2.3.5 A Simplified Echo Path Model

The normal adaptive algorithm aims at approximating the real acoustic echo path, inherently suffers from the effect of echo path changes and non-linearity. Christof Faller and Christophe Tournery recently proposed a new AES without a need for the complex computation of the acoustic echo path estimation. Instead of identifying the echo path impulse response, the proposed method estimates only the magnitude spectrum of the echo that is needed for echo suppression. A filter mimicking the coloration effect of the echo path on the loudspeaker signal is adopted. The gain filter for the AES is computed using this coloration effect filter. The proposed AES has low complexity and higher robustness because it estimates signal independent on the physical echo path.

Coloration in an audio process means that some frequency ranges are attenuated or amplified while the others are not. It is necessary to know which frequencies are 31

attenuated, not modified or amplified on the loudspeaker signal for the AES. A typical room impulse response consists of the direct sound which comes from the loudspeaker directly to the microphone, several early reflections and then the late reflections which is like a long tail with high density, as shown in figure 2.8. The dense late reflections hardly influence the amplitude of the frequency spectrum. The large direct sound and the early reflections are what color the signal. Hence, to obtain the necessary information for the echo suppression it is enough to just consider the direct sound and the early reflections, which indicates the improvement of the computational complexity.

Figure 2.8 Typical room impulse response

A real-valued coloration effect filter Gv(i, k ) , mimicking the spectral modification effect of the echo path on the loudspeaker signal, is estimated. For obtaining an approximate echo magnitude spectrum, the estimated delay and coloration effect filter are applied to the loudspeaker signal spectra, (i, k ) = Gv(i, k ) X (i, k ) Y d where d stands for the number of samples to delay. Since it takes a certain amount of time for the loudspeaker signal to reach the microphone, the magnitude spectrum of the echo is calculated with the delayed loudspeaker signal.

The coloration effect filter is computed as the magnitude of the least squares estimator Gv(i, k ) =
* E Xd (i, k )Y (i, k ) * E X d (i, k ) X d (i, k )

{ {

} }

32

where denotes complex conjugate. Since the acoustic echo path is likely to vary in time, Gv(i, k ) is estimated iteratively as Gv(i, k ) = where
* a12 (i, k ) = E{ X d (i, k ) Z (i, k )} + (1 )a12 (i, k 1) * a 22 (i, k ) = E{ X d (i, k ) X d (i, k )} + (1 )a 22 (i, k 1)

a12 (i, k ) a 22 (i, k )

and [0,1] determines the time constant of the exponentially decaying estimation window. Then the magnitude spectrum of the echo signal is used to form the gain filter as in:
max(( Z (i, k ) Y(i, k) ),0) G (i.k ) = ( Z (i, k )

^

During double talk, the coloration filter will affect the near-end speech and even diverge in the same way as the NLMS algorithm. To prevent this, a double talk detector (or near-end speech detector) can be necessary to freeze the coloration effect filter when double talk exists.

2.4 Double Talk Detector

An important feature that an AEC should have is its capability to provide full duplex services, which means it allows the both ends to speak simultaneously, namely the case of Double Talk (DT). If DT exists, the microphone signal which is used for adaptation will not only contain the echo but also the near-end signal. This could lead to the divergence of the adaptive filters since the near-end speech acts as a strong uncorrelated noise to the adaptive algorithm. Thus it is necessary to detect when the double talk occurs, and stop the adaptation process. This is done by a double talk detector.

33

2.4.1 The Generic Doubletalk Detection Scheme

The generic DTD is based on a detection statistic , which is formed by using available signals as the speaker signal, the microphone signal and the output signal etc. Then by comparing the with a preset threshold T, the double talk situation is declared or not. Once the double talk is detected, the filter adaptation will be disabled for a minimum period of time Thold. The filter adaptation will be resumed if the detection statistic indicates that there is no DT consecutively over a time Thold.

There are a variety of double talk detectors based on different algorithm to calculate the decision statistic . The most popular ones are the Geigel algorithm and the Normalized Cross-correlation (NCR) method as well as the Variance Impulse Response algorithm (VIRE).

2.4.2 Geigel DTD

The most basic algorithm for double talk detection is the one originally developed by Geigel. It is a quite simple approach by comparing the power of the received signal and the far-end signal. Since normally the room acoustic filter will damp the far-end signal, when the received microphone signal divided by the maximum of the past far-end samples is lager than certain threshold, the DT is declared.

The decision statistic is calculated as:

| z (t ) | max( x(t ) , x(t 1) ... x(t N + 1) )

If is larger than some preset threshold T, it is deemed that DT is occurring, otherwise not, i.e.

> T Double talk present T Double talk not present

The choice of T will strongly affect the performance of the detector. During MATLAB analysis, it can be found by plotting the decision variable and finding out which threshold

34

would optimally distinguish the DT from the far-end signal. The Geigel detector has the benefit of being computationally simple and needing little memory. However, the Geigel detector has quite poor performance.

2.4.3 Normalized Cross-correlation (NCR) DTD

An alternative method is the normalized cross-correlation algorithm. The microphone signal z(k) can be expressed as a sum of the echo signal and the near-end speech signal (NES), where we ignore the noise influence first.
z (t ) = y (t ) + v (t )

Suppose the echo path impulse response of the room is h, such that the echo signal is: y (t ) = h T x(t ) The power of the measured microphone signal can be written as:

z2 (t ) = h T R xx h + v2 (t )
where R xx = E x T x .

{ }

The cross-correlation sequence of the speaker and microphone signals can be expressed according to definition: rxy = E{xy} = R xx h Yielding:
1 h = rxy R xx

And the power of the microphone signal can be rewritten as:


T 1 z2 (t ) = rxy R xx rxy + v2 (t )

When there is no NES present, i.e. v (t ) = 0 , then z (t ) = y (t ) and


T 1 z2 (t ) = rxz R xx rxz with rxz = E{xz} .

The detection statistic is suggested as:


1

r T R 1 r 2 = xz 2 xx xz . z (t )

35

The nominator is the power of the measured signal if no near-end speech is present, whereas the denominator is the actual power of the measured signal. Thus if there is no near-end speech signal present, 1 , otherwise < 1 .

The DT decision is formed as

< T Double Talk present T Double Talk not present


T is selected between 0 and 1.

The NCR method is normally computationally infeasible, as it not only requires the estimation of the cross-correlation sequence rxy and the far-end covariance matrix R xx , but also the inversion of the covariance matrix. A practical approach is adopted for this reason. The room echo path response is approximated by the response of the adaptive filter, which results in:
1 w h = rxy R xx
T rxz w = 2 z (t ) 2 T y w T R xx w (t ) = 2 2 z (t ) z (t )

The nominator is the power of the estimated echo signal and the denominator is the actual microphone signal power. This is the form of the cheap normalized cross-correlation algorithm. Since we are using Hann window based overlapping add method, the decision factor will be calculated for each window frame.

2.4.4 Variance Impulse Response (VIRE DTD)

VIRE DTD is a recently introduced method which uses the maximum value of the adaptive filter coefficients. The recent variance impulse response algorithm (VIRE) is based on the variance of the adaptive filters. Since the near-end speech acts as a corrupting noise, it will induce dramatic variations in the adaptive filter taps. It uses the maximum value of the adaptive filter as a measure of the fluctuations with certain exponential forgetting factor:

(n) = (n 1) + (1 ) [ ]2
36

where is the maximum value of the filter coefficients and is a expected value of

which is again formed with the exponential forgetting factor:

(n) = (n 1) + (1 ) = max(h(0), h(1) h(k 1))


By defining certain threshold for the variation of the adaptive filter taps, the DT decision is made as following:

> T Double Talk present T Double Talk not present


The detection will still be frame-based and calculated once at the end of each frame. Hence normally the length of the frame in this work is relatively small. Also it is worth to mention that the VIRE algorithm is more sensitive the power of the near-end speech as we will see later during simulation.

2.4.5 Double Talk Detection Performance Evaluation

Certain criteria are necessary to compare different types of DTD, since we can not compare the performance directly because different threshold can be used. Also, a systematic approach is required to select the value of threshold.

The criteria to evaluate DTD performance are as follows: Probability of False alarm (Pf): the probability of declaring detection when DT does not exist. Probability of Detection (Pd): the probability of successful detection when DT does exist. Probability of miss (Pm = 1 - Pd): the probability of detection failure when DT is present.

Pf is calculated when there is only far-end signal present, namely v = 0, X active Pf = N where is the output decision of the DTD, X active is the output of the activity detector and N is the length of the entire far-end speech signal x, which is the first 15 seconds in our case.

37

The miss probability Pm is measured as the proportion of near-end speech that remains undetected when far-end speech also exits. The Pm is a meaningful criterion to fairly compare different DTD methods, because the disruptive effect of undetected double talk on an adaptive filter depends on the near-end speech that goes undetected. Pm = 1

X X

active

Vactive

active Vactive

and Pd = 1 Pm

where is the output decision of the DTD, X active and Vactive are the output of the activity detector for near end and far end respectively. The logical AND assures that the miss probability is only counted when both NE and FE are active.

A good detection method should maximize Pd while minimizing Pf even in a low signal to noise ratio situation. In general, higher Pd is achieved at the cost of a higher Pf. There is a trade-off depending on the cost of a false alarm and a miss. A Receiver Operating Characteristic (ROC) curve is widely used to characterize detection schemes as in radar applications. The Pms with respect to Pfs at different threshold points are plotted in the ROC curve in order to find a proper threshold to achieve certain performance. Also, given a certain probability of fault alarm, one can plot the probability of miss in function of the Signal-to-Echo Ratio (SER) or the Signal-to-Noise Ratio (SNR) to evaluate the DTD algorithm under different speech power condition or environmental noise circumstance.

38

CHAPTER III: OTHER ISSUES

Some other issues besides the echo cancellation theory are presented in this chapter. To decide the length of the adaptive filter, we need to know the length of the actual echo path which the filter is trying to model. Section 3.1 discusses about the room acoustics and the way to measure it. Another important issue is the synchronization problem between the loudspeaker signal and the microphone signal, which is essential for the echo cancellation process and is covered in section 3.2. Last but not least, the noise is always an important consideration for speech and audio applications. The noise sources for hands-free communication are various.

3.1 Room Acoustics

The testing of the AEC will be performed in real rooms. The AEC which works well in one room, however, might not be compliant in another. This is because the acoustics of all rooms are different. This flexibility allows designer to test the AEC in the type of rooms they were designed for. However, this also means that the user has the responsibility of determining whether the AEC will operate in his or her particular environment. An AEC solution that was designed to operate in an office may not work properly in a conference room. If an echo canceller works in one room and not another, it would most likely be due to a tail length that was too short for the second room.

The tail length of an AEC is the length of time over which it can cancel echoes (in the unit of ms). This is directly related to the reverberation time of the room. As the room reverberation time increases, a longer tail length will be needed in that room. If the reverberation time is much longer than the tail length, a significant amount of the echo will remain audible.

39

There are two main factors that affect the reverberation time of a room. They are room size, and the materials used to construct the walls and objects in the room. Most sound is absorbed when it strikes walls or other surfaces. If materials are used that absorb sound well (such as carpet, curtains, or acoustic tile), the reverberation will die out more quickly than if the room contains mostly reflective materials (hard wood, or glass). If a room is small, the sound waves will bounce off the walls more frequently, and will be absorbed more quickly.

3.1.1 Measure the Testing Room acoustics

Since the job of the adaptive filter is to model the room acoustic, it is interesting to have an idea how it looks like, also the length of the adaptive filter should cover sufficient length of the room impulse response it is to be operated in.

Typically the impulse response is the system response to a Dirac pulse, which theoretically has infinite amplitude at certain time point and all zero at the others. However it is impossible to make a real Dirac pulse in practice. The room impulse response, namely the acoustic echo path in the room, can be measured in several ways.

As a conceptual method consider a room and a balloon in it at point p. The balloon pops and makes a "pou" sound, which is similar (due to its short duration) to a Dirac delta, and the output h[n] is the sequence of the damped sound. Here h[n] depends on the location (point p) of the balloon. If we know h[n] at point p of the room, then we actually know the impulse response of the room at point p. It is then possible to predict its response to any sound produced at this point. However it is not easy to use this method to fit exactly our actual recording position. We can approximately use a sudden big sound to simulate but it wont bring good results.

The other way is to simulate with sine waves of different frequencies. To use the sine-waves would bring the best results but it is quite time-consuming. The third choice is to measure with a white noise signal which in theory has a flat response in the frequency magnitude spectrum, so the de-convolution in time domain can

40

be calculated easily as a division in frequency domain. The final method of recording white noise has the potential of giving a good result while it can be realized quite easily using MATLAB and doesnt require anything other than a computer equipped with a microphone and a speaker. This is the method that we have chosen.
0.5 0.4 0.3 0.2 Amplitude 0.1 0 -0.1 -0.2 -0.3 Sampling rate = 8KHz

10 Time (ms)

15

20

25

Figure 3.1 Room Impulse Response in the Scream room in NXP Leuven

Our testing room has the size of 5m x 6m with hard walls. As we can see, approximately 16ms adaptive filter length is needed during our testing, namely 128 taps at 8 KHz sampling rate.

3.2 Measure of the delay between the loudspeaker and the microphone

Not only for the new AES algorithm proposed by Christof Faller and Christophe Tourney, the other algorithms of AEC or AES also require the estimation of the delay between the microphone and loudspeaker signals, since apparently all the algorithms would need the two signals to be synchronized.

To estimate the time delay, the cross-correlation is used in this paper.

Cross-correlation

is the standard way of measuring how two signals are correlated. The correlation will be 41

high if the microphone signal is similar to the reference loudspeaker signal. So the result of the CC indicates the point where the two signals correlated most. The cross correlation can be related to the convolution as:

where the inverse sequence of the complex conjugate of one signal is used in the convolution calculation. And the convolution between two signals in the time domain can be transferred into multiplication in the frequency domain, and converted back to time series by IFFT. The MATLAB code looks like following:

x_inv = [flipud(x (index)); zeros (fs, 1)]; z = [z (index); zeros (fs, 1)]; Temp = FFT (x_inv).*FFT (z); cc = abs (IFFT (temp)); [Value, Lag] = MAX (cc); lag

% inverse the loudspeaker signal

% zero padding for later calculation % multiplication in frequency domain % IFFT % Index of the maximum of the CC result is the

To get a relatively accurate estimation, we calculate the delay for several frames and average the results.

3.3 Noise issues

During hands-free communication for PC applications, a lot of noise may exist and disturb the speech going to the microphone. The noise problem is especially worth concern in the situation of using the internal microphone of a laptop. The amplification of the internal microphone equipped in the laptop is usually high to be able to pick up the near-end speech. Hence, a variety of noises such as the hard-disk, fan of the laptop, the typing on the keyboard, mouse clicking as well as various ambient noise are likely to be picked up

The possible noise sources when using internal microphone of the laptop is illustrated in Figure 3.2. The hard drives and cooling fan are close the microphone, so as to together

42

with other mechanical sounds, be transmitted to the internal mike through vibrations. The clicking sounds of the keyboard and the mouse are also major noise sources in this case. Since the keyboard is usually close the position of the microphone, the typing noise can be quite loud which makes it the most annoying noise source in this case. The situation can be improved much more when a good-quality external microphone is adopted. During our recording process, by using the Trust MC-1200 high sensitive external microphone, the noise floor is lowered by 10-13dB compared to the internal microphone.

Figure 3.2 Common Laptop Noise Sources

3.3.1 Typing noise cancellation based on cross-correlation

As mentioned above, the internal microphone is likely to pick up a lot of noise from sources such as the hard-disk and fan of the laptop and environments, and the typing noise on the keyboard could the most annoying one among all, which bring up the motivation of finding a way to reduce it. The most direct way is to use an external microphone. Also there exists dedicated keyboard in the market which uses special pads to reduce the typing noise. What we would like to make effort on below is to look for

43

certain algorithm which could be embedded into our AEC (or AES) algorithms to cancel the typing noise.

Firstly we need to know how the typing sound looks like before dealing with it. So we recorded the typing action using the internal microphone of the laptop. By analyzing the recording result, it is found out that each typing sound generally consists of two separate parts, namely a press sound and a trailing release sound, as we can see in Figure 3.3.

One typing sound (press and release) 1

-1 0.5

0.2 0.4 0.6 0.8 1 Typical press sound for the keyboard of our testing laptop

1.2

-0.5 1 0.5 0 -0.5

0.01 0.02 0.03 0.04 0.05 0.06 0.07 Typical release sound for the keyboard of our testing laptop

0.08

0.01

0.02

0.03

0.04 Time (s)

0.05

0.06

0.07

0.08

Figure 3.3 Typical look of a typing sound on the keyboard

The first idea coming to mind is to use cross-correlation algorithm to recognize the typing operation and then cancel it out. Two masks are needed for corresponding press and release sounds because the pause between them could vary from one typing to another. The cross correlation is calculated with a shifting window and normalized to be

44

independent on the input power. Once the cross-correlation value exceeds certain threshold, as shown in Figure 3.4, a pressing or releasing action is considered as happening, and a scaled mask will be subtracted from the input signal. The way to scale the mask is important. The projection operation brings a good estimate of one vector on the direction of another, as defined in:
proj B A = < A, B > B < B, B >

< > denotes the inner product of two vectors:


< a,b >= a i bi
i =1 n

or < a,b >= b T a so the estimated typing noise is calculated as the projection of the masks on the actual microphone signal when the typing is detected.

As we can observe from the residual signal, the typing noise cant be completely cancelled and still audible. This is because individual typing can be somewhat different from the mask, which may indicate that there is no linear model for the typing noise. Hence in the following we try to model the noise with a LMS adaptive filter.

The input to the LMS system is a pulse train to trigger the typing action, which can be generated by the above mentioned cross correlation method. If a linear time-invariant model exists for the typing noise, the cancellation should perform well after the filter converged, otherwise not.

Through simulation we observed that the typing noise estimation generated from the LMS filter only match perfectly with the input signal occasionally, in other cases it may advance or lag the input. Hence, we can conclude that there is no linear time-invariant model applicable to the typing noise cancellation, namely, it varies with time. A much more sophisticated method is required, and we will not go further within the scope of this thesis.

45

Recorded Typing Noise z(t) 1 0 -1 1 0 -1 1 Press Flag Cross-correlation result between z(t) and press mask

0 1 0 -1 1

Cross-correlation result between z(t) and release mask

Release Flag

0 1 0 -1 0 1 2 3

Residual typing noise

5 Samples

10 x 10
4

Figure 3.4 Typing noise cancellation

3.3.2 High Pass Filtering

After analyzing the recorded noise, it is found out that most noise components dominate the low-frequency portion in the spectrum, including the typing noise. So to reduce the noise, a high pass filter will be an efficient choice. A second order Butterworth filter with cutoff frequency of 200Hz is adopted.

46

CHAPTER IV EVALUATION

So far, we have studied and discussed two echo removal methods, which are the Acoustic Echo Canceller and the Acoustic Echo Suppressor. Most AEC products are based on the adaptive LMS or NLMS digital filter, which is a well-defined algorithm that has been used for years. To achieve larger echo attenuation without the help from other devices as Nonlinear Processor, the Acoustic Echo Suppressor based on spectral subtraction is a good option. Despite using the combination of NLMS and FFT transform, a simplified echo path in frequency domain can be also adopted by AES, even with lower computational complexity. To deal with the annoying situation during Double talk, three Double talk Detection methods including Geigel, Normalized Cross-correlation and Variance Impulse Response algorithms are presented. In the following section, the performances of all different methods will be examined and compared using MATLAB simulations. Many parameters occur in the algorithms, e.g., the learning rate, the safety constant and the suppression factor in AES, etc. They all affect the performance of AEC or AES in some way. Hence, to achieve a certain target performance, the parameter tuning gains an important role in the simulation process. The evaluation of algorithms is primarily based on how much ERLE they may achieve, since echo attenuation is the goal of an AEC. When similar ERLE are accomplished by different algorithms, the initial convergence time and the near-end attenuation will be more important. In real hands-free communication, the volume of the near-end voice and far-end speech are totally variable. Hence, it is necessary to check how the algorithm reacts to the change of the Signal-to-Echo ratio, which is defined as the ratio between the power of the near-end speech and the power of the echo signal. The noise has been an important issue for audio and speech systems for a long time. The performance under different noise strength needs to be evaluated for each algorithm.

47

4.1 Requirements for AEC

The performance evaluation of an AEC (or AES) solution is based on specifications and listening tests. As discussed in section 1.3, there are some measures existing for the evaluation of AEC. The International Telecommunication Union (ITU) has regulated certain criteria for a number of performance characteristics of AEC. These include such specifications as rate of convergence, amount of cancellation and bandwidth. Although these criteria are necessary, they are not sufficient to determine whether an AEC is good enough, since the performance of the AEC is quite location-sensitive and noise-sensitive, and the specification can only cover certain test environment. Hence, the evaluation through auditory test is necessary for a given application. At the end of the day, how an AEC sounds is the final criteria.

4.2 Speech Stimuli

In hands-free communication systems, the input signal is primarily speech and the output signal consists of speech disturbed by noise and other speech signals. Speech has highly time-varying characteristics. It is not stationary, but can be approximated to be stationary in short time intervals. Speech is sometimes quasi-periodic (e.g., vowels) and sometimes acts as noise (e.g., fricatives) or like impulses (e.g., plosives). Speech also contains pauses. Speech signals are wide-band with a frequency content ranging from 100 Hz to more than 8 kHz. In agreement with the sampling theorem, the audio signals (with frequency between 300 Hz to 3400 Hz), should be sampled at a frequency equal or greater than 6800 Hz (2 X 3400). Actually, the telephone applications usually take the sampling rate at 8 KHz. The most popular choices for VOIP are 8 KHz and 16 KHz. A higher sampling rate improves the speech quality but also requires wider bandwidth. Throughout our simulation, 8 KHz sampling frequency is used. In all, speech provides a non-persistent excitation for the adaptive filters used in AEC.

The speech stimuli signals consist of two channels: the channel to be played on the NE speaker (male voice) and the channel to be played on the FE (female voice). The recorded signals consist of three segments, which are FE single talk, double talk and NE single talk,

48

to examine the achieved ERLE as well as the performance during DT respectively. Each segment has a length of 15 seconds and there are pauses of 1 second in between.

FE single talk 0.4 0.2 0 FE -0.2 -0.4 -0.6

Double talk

NE single talk

1 0.5 NE 0 -0.5 -1

10

15

20

25 TIme (s)

30

35

40

45

50

Figure 4.1 Speech stimuli segmentation

The recording is made with a DELL Latitude-D600 laptop. The near-end setup of a laptop user can be different form one to another. Different configurations of internal or external microphones and internal or external loudspeakers differ in the nominal Signal-to-Echo Ratio (SER) and Signal-to-Noise Ratio (SNR). The SER is defined as the ratio of the power of the NES to the power of the echo in the recording signal. The nominal SER is obtained with the recording under common perceptional strength of NE and FE speech. When the external microphone is used, higher nominal SER and SNR can be achieved, compared to the situation using internal microphone with the same loudspeaker positions.

4.3 Acoustic Echo Canceller

Firstly, the AEC based on simple NLMS-adapted FIR filter is evaluated through parameter tuning. The learning rate is always an important parameter for NLMS to

49

control the convergence speed. Another parameter, as discussed in section 2.3.4, a safety constant, is added to the denominator of the NLMS coefficient adaptation equation to avoid divergence. In order to find a proper value for the safety constant, the corresponding ERLE is plotted against different safety constants as shown in Figure 4.2, with different learning rates. The simulation is performed with the signal which is recorded with common perceptional strength of NE and FE speech using external microphone and external speaker setup. The nominal SER in this case is 5dB. The length of the filter is 128.
16 15 14 13 12 11 10 9 8 Learning rate = 1.0 Learning rate = 0.5 Learning rate = 0.1

ERLE (dB)

0.2

0.4 0.6 Safety constant

0.8

Figure 4.2 How ERLE changes with the safety constant (Three lines have different corresponding learning rate)

It is observed that the ERLE has a peak value. The result is reasonable. If the step size is large when the safety constant is negligible, the filter may be over-adapt and take longer time to reach the final optimum point. In another case, when the step size is significantly lowered by the safety constant, the adaptation will also take more steps to approach the optimum. In both cases, the ERLE would become lower in the consequences of slower adaptation. Hence, it is reasonable to obtain a peak value of ERLE where the filter coefficients reach the optimum in the minimum number of steps.

50

Next, the learning rate is tuned in the similar manner. The result is illustrated in Figure 4.3. Similar reason as for safety constant, the higher learning rates result in over-adaptation and the lower values result in fine adaptation, both slow down the convergence speed. There is an optimal value for which leads to the fastest convergence.
15

10 ERLE (dB) 5 0 0

0.5

1 1.5 Learning rate (2 x Alpha)

Figure 4.3 The effect of Learning rate on ERLE (safety constant = 0.1)

The simulation results in Figure 4.4 show the NLMS-based AEC can only achieve certain amount of attenuation. One reason is that the adaptive filter can never model the echo path impulse response completely due to its limited number of filter taps. Another important reason is that the NLMS assumes a linear echo path, yet in reality, the loudspeaker-room-microphone impulse response is nonlinear. The nonlinearities come from the saturation effects from the amplifiers and loudspeakers. Through listening test, the echo residual is still audible. During double talk, though the influence on the near-end signal is small (1 to 2 dB near-end attenuation), the echo left is still significant. Figure 4.4 also proves the improvement of adding the safety factor. When there is no safety constant, the filter becomes unstable when DT starts so that a lot of jitters are observed.

51

Recorded signal z(t) 0.2 0 -0.2 1 0 -1 0.2 0 -0.2

Without Safety constant

Safety constant = 0.1

0.5

1.5

2 Samples

2.5

3.5 x 10

4
5

Figure 4.4 Echo Cancellation Result of NLMS AEC under nominal SER (learning rate = 1. Notice the y-axis of the figure without safety constant has larger scale)

4.4 Acoustic Echo Suppressor Based on Spectral Subtraction

As seen above, the time domain AEC can only achieve a very low ERLE. Hence, an echo suppression filter is added after the NLMS based echo canceller. The echo estimate from the NLMS algorithm is transformed and subtracted in frequency domain with a Hann window. Over-suppression can be performed to gain higher ERLE. An alternative way goes to the simplified echo path, which only estimates the magnitude spectrum of the echo signal and leaves out the phase information to reduce the computational load.

4.4.1 NLMS-Based AES

There are a variety of parameters in an NLMS AES to adjust the performance, as discussed in section 2.2, including the learning rate and the safety constant during NLMS process, the and the echo suppression ratio , and the smooth factor of the gain function. All of them face a trade-off between echo rejection and speech distortion. In the following paragraphs, each parameter will be tuned individually to investigate its effect. The test signal will still be the nominal recording with 5dB SER by using external

52

microphone and external loudspeakers. With this recording, it is found out that 30dB ERLE is the least requirement to make the echo in the FE single talk section acceptable.

As discussed in last section, the learning rate of the NLMS adaptive filter has influence on the ERLE as well as the NEA during DT. As observed in Figure 4.5, the larger the learning rate, the higher the ERLE and NEA will be. In fact the adaptation speed of the NLMS filter for AES is not as crucial as for the AEC any more. There are other parameters which are able to tune the initial convergence time and the ERLE, e.g. the suppression ratio and the smooth factor of the gain function. Hence, unlike the NLMS algorithm which uses learning rate of 1, small learning rate is chosen to assure low NEA at this moment.
35 ERLE (dB)

30

25 -5 NEA (dB) -10 -15 -20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.1

0.2

0.3

0.4 0.5 0.6 Learning rate

0.7

0.8

0.9

Figure 4.5 The effect of learning rate on the ERLE and the NEA ( = 1 , = 25 , safety factor = 0.1, smooth = 0.1)

The smooth factor as introduced in section 2.21, manipulates the fluctuating of the suppression gain function. The influence of smooth factor on the ERLE and NEA is drawn in Figure 4.6. Larger smoothing (smaller smooth factor) results in more flat suppression gain, which leads to the attenuation of both the echo and the NES during DT. As observed from Figure 4.7, the large smoothing (smooth factor=0.01) brings in high suppression on the NES during DT and the residual speech sounds natural yet has a very

53

low volume, while the low smoothing (smooth factor=0.99) results in enormous clicking effect during DT which yields artificial sounds. Hence, a smooth factor in the middle range should be chosen. We use smooth factor of 0.3.
30 ERLE (dB) 28 26 24 -6 NEA (dB) -8 -10 -12 0 0.2 0.4 0.6 0.8 Smooth factor of the Gain function 1 0 0.2 0.4 0.6 0.8 1

Figure 4.6 The effect of the smooth factor on the ERLE and the NEA ( = 1 , = 25 , Learning rate = 0.2, safety factor = 0.1)

Recorded signal z(t) 0.2 0 -0.2 0.2 0 -0.2 0.2 0 -0.2

Residual signal with smooth=0.01

Residual signal with smooth=0.99

0.5

1.5

2 Samples

2.5

3.5 x 10
5

Figure 4.7 The effect of smooth factor on the NES during DT

54

Recalling the gain function of AES Gi( f ) = (


( f ) ),0) max(( Z ( f ) Y
1

Z( f )

) ,

to find the optimal values of alpha and beta, we use the minimization function in MATLAB to return the beta value corresponding to certain alpha which minimizes the squared difference between the acquired ERLE and desired value. In such way, with a series of given alpha values we get a series of corresponding beta values to achieve 30dB ERLE as shown in Figure 4.8. We choose = 1 which brings least computational load so that the needs to be at least 27 to achieve 30dB ERLE.
2500 Zoomed in 2000 30 20 1500 beta beta 10 1000 0 0.2

0.4

0.6 alpha

0.8

500

0 0.2

0.4

0.6

0.8

1 1.2 alpha

1.4

1.6

1.8

Figure 4.8 Alpha Beta values to achieve 30dB ERLE (Learning rate = 0.2, safety constant = 0.1, smooth factor = 0.3)

The simulation work load is quite high. To speed up the slow simulation process due to the slow run of for-loops in MATLAB, the NLMS is implemented as a mex-function which is more than 10 times faster.

Through simulation and listening test, the NLMS AES is able to achieve much higher ERLE than the NLMS AEC, with more computational load introduced. During the DT period, the echo is also inaudible yet the NES is affected largely as seen in Figure 4.7.

55

Certain portion of the NES has more attenuation than other due to the sharp changes of the suppression gain. This results in discontinuities in the residual signal during DT.

4.4.2 Coloration-Effect-Filter-Based AES

As introduced in section 2.3.5, the AES based on a simplified echo path magnitude spectrum costs much less computational complexity.

Recalling equation: Gv(i, k ) = a12 (i, k ) a 22 (i, k ) and

* a12 (i, k ) = E{ X d (i, k ) Z (i, k )} + (1 )a12 (i, k 1) * a 22 (i, k ) = E{ X d (i, k ) X d (i, k )} + (1 )a 22 (i, k 1)

here the functions similarly as the learning rate in the NLMS algorithm, controlling the adaptation speed of the coloration-effect filter. If it is too large, the attenuation of the echo and the NES will be both high, and even the NES is not audible any more. If it is too small, the initial convergence speed will be too slow, as shown in Figure 4.9. The initial convergence time is defined as the time which it takes the echo to become totally silent.

The coloration-effect filter also suffers from the divergence problem when NES exists. Hence, a constant is added to the denominator in the same way as the NLMS algorithm.
Gv(i, k ) = a12 (i, k ) c + a22 (i, k )

A large constant will smooth the variation of the filter taps and slow down the convergence significantly, but bring less attenuation to the NES during DT, which is illustrated in figure 4.10. According to the simulation and listening results, sigma of 0.05 and safety constant of 0.01 are chosen to ensure a relatively faster convergence and smaller attenuation on the NES.

56

Recorded Echo 0.2 0 -0.2 0.05 Slow convergence 0 -0.05 0.05 Fast convergence 0 -0.05

Echo Residual with sigma=0.01

Echo Residual with sigma = 0.05

0.5

1.5

2.5 Samples

3.5

4.5 x 10

5
4

Figure 4.9 Different convergence time corresponding to different sigma ( = 1 , = 30 Notice the axis of the recorded echo has larger scale)

60 50

0 -10

40 ERLE (dB) NEA (dB) c c c c 0 0.2 = = = = 0 0.0001 0.001 0.01 0.8 0.4 0.6 Sigma

-20

30 20

-30 -40

10

-50

-60

0.2

0.4 0.6 Sigma

0.8

Figure 4.10 The influence of the sigma to the ERLE and NEA with different safety constant values ( = 1 , = 30 )

57

Then the optimal values of and to gain 30dB ERLE are computed again in the same way as for NLMS AES, shown in Figure 4.10. When = 1 , = 27 is required.
16000 14000 Zoomed-in 12000 10000 beta 8000 6000 4000 2000 0 0.2 0 0.2 0.4 0.6 alpha 0.8 1 beta 20 0.4 60 40

0.6

0.8

1 1.2 alpha

1.4

1.6

1.8

Figure 4.11 Alpha Beta values to achieve 30dB ERLE (Sigma = 0.05, Safety constant = 0.01, smooth factor = 0.3)

To draw a conclusion at this moment, the performance of AES is superior to that of the AEC. The AES gains higher ERLE and is able to eliminate the echo completely. During DT, though the NES is affected as well, the background echo is quite distinct any more. Through simulation and listening test, it is found out that the Coloration-effect filter AES can achieve similar result as the NLMS AES, with much less computational complexity. However, its convergence speed is slower than the NLMS.

4.5 DTD Performance Evaluation

All the algorithms discussed above suffer from the DT problem.

The NES is attenuated

more or less by using different methods, especially for AES. A lot of discontinuities occur during DT. Hence we will introduce Double Talk Detectors into AES. The adaptation of the filter will be frozen for a hold time and lower will be adopted to have lower NEA when DT is detected. Instead of not being able to hear the speaker properly,

58

clear voice yet with some residual echo would be preferable. Longer hold time will protect the NES better yet lead to long recovery time from the DT to FE single talk which may bring in a boost of the echo. The value of hold time is normally chosen to be tens of milliseconds. 32ms of hold time is used throughout this thesis. As stated in section 2.4.5, the performance of DTD is evaluated by probability of false alarm (Pf) during the FE single talk and the probability of miss (Pm) in DT duration, as shown in table 4.1. Based on the computations of Pf and Pm, a plot which is referred to as the Receiver Operating Characteristic (ROC) curve is adopted as a comparison criterion between different algorithms. In the ROC curve, the probability of false alarm is plotted against the probability of miss by tuning the threshold. This curve provides us with the knowledge of the threshold to achieve certain DTD performance in terms of Pf and Pm, and vice versa. For example, we can find out the threshold corresponding to 0.1 Pf. A typical ROC curve is illustrated in Figure 4.12. It shows the tradeoff between the correct detection and the fault ones. The smaller area enclosed by the ROC curve is, which indicates both low Pf and low Pm can be attained at the same time, the better the DTD performance will be.

FE single talk DTD = 0 DTD = 1 Correct False alarm

DT Miss Correct

Table 4.1 Definition of False alarm and Miss

Figure 4.12 Typical ROC curve illustration

59

4.5.1 Geigel DTD

We firstly implemented the simple Geigel scheme. The most important parameter for a DTD is the threshold. The threshold set in Figure 4.12 separates all the FE single talk in order to have a uniform attenuation. Because different suppression factors are used for FE single talk and double talk in AES, the false alarm during FE single talk segment will result in a sudden boost in the residual echo, which sounds annoying. Hence, a low probability of false alarm is required. However, partial DT situations are left undetected.
25 20 Decision variable for Geigel (dB) 15 10 5 0 -5 -10 -15 A possible threshold

0.5

1.5

2 2.5 Samples

3.5 x 10

4
5

Figure 4.13 The detection statistic for Geigel algorithm for the nominal recording with external microphone and loudspeakers

Higher threshold reduces the probability of false alarm at the price of an increase to the probability of miss during DT; lower threshold gains more correct DT detection yet may result in large fault detection so as to reduce the ERLE. The ROC curve of Geigel algorithm is shown in Figure 4.13.

As stated in section 2.4.2, the Geigel DTD operates by comparing the power of the received signal and the far-end signal. Recalling equation:

60

| z (t ) | max( x(t ) , x(t 1) ... x(t N + 1) )

> T Double talk present


T Double talk not present

which shows that it works well when the strength of the NES is much higher than the FES, namely when the SER is large as illustrated in Figure 4.14. The SER is changed by adjusting the NES while keeping the nominal FES. The probability of false alarm will be almost constant because the strength of the echo is kept the same. As discussed before, the probability of false alarm is required to be as low as possible so that the threshold is chosen to attain zero probability of false. When the SER goes low, the chance of missing DT detection becomes higher because most of the decision variable will be lower than the threshold due to the small NES.

Overall the Geigel algorithm only works well under circumstances which assume the echo path attenuates the FES and NES is sufficiently high. However, Geigel is not a strong candidate in reality where unknown echo path and unknown NES are present.

0.8 Probability of Miss

0.6

0.4

0.2

0.2

0.4 0.6 0.8 Probability of False alarm

Figure 4.14 ROC curve of Geigel Algorithm under nominal SER

61

0.8 Probability of miss Probability 0.6

0.4

0.2 Probability of false alarm 0 -5 0 SER (dB) 5 10

Figure 4.15 Probability of miss decreases as the SER is increased by enlarging the amplitude of near-end speech (Pf = 0)

4.5.2 NCR DTD

The cheap-NCR algorithm is normally adopted for its efficient calculation. Recalling form section 2.4.3, the detection statistic of cheap NCR is calculated as:
2 y (t ) = 2 z (t )

< T Double Talk present T Double Talk not present


which is the ratio between the power of the estimated echo and the power of the microphone signal. Since it needs the time-domain echo estimate from the adaptive filter, it is easier to be applied to the NLMS algorithm compared to the simplified echo path. The convergence speed of the NLMS filter also needs to be lower to slow down the fault adaptation during DT in order to have the right detection statistic. As shown in Figure 4.15, a smaller learning rate brings better result.

The ROC curve is drawn as in Figure 4.16 to look for the threshold to achieve a low probability of false alarm. The performance of the NCR is also dependent on the window size. The larger the window size, the more precise the calculation of the power, so that 62

the better the prediction will be, which is also shown in Figure 4.14. The ROC with window size 800 has improvement of 10% probability of miss over the one with window size of 128, with the same probability of false alarm acquired. However, larger window size means longer computation time and larger delay to yield output, while the real-time communications demand low processing delay.

Next the variation of the probabilities against the SER is evaluated as illustrated in Figure 4.17. The threshold is chosen to attain 0 probability of false alarm and fixed for all SER. The probability of miss increases as the near-end signal gets weaker, because the detection statistic will be larger during double talk so that less DT events will be detected. Compared to the Geigel algorithm in Figure 4.12, the variation of the probability of miss against the SER is smaller in cheap-NCR DTD, with similar probability of false alarm achieved and same window size. In all, the NCR DTD is a more reliable method than the Geigel DTD, yet it requires slower adaptation and longer processing delay to obtain better detection.
1 Learning rate = 1.0 Learning rate = 0.2 Learning rate = 0.1

0.8 Probability of miss

0.6

0.4

0.2

0.2

0.4 0.6 Probability of false alarm

0.8

Figure 4.16 ROC curves of cheap-NCR DTD for NLMS AES with different Learning rate under nominal SER

63

0.8 Probability of miss

0.6

Window size = 128

0.4 Window size = 800 0.2

0.2

0.4 0.6 Probability of false alarm

0.8

Figure 4.17 ROC curve of cheap-NCR DTD with different window size under nominal SER (learning rate = 0.1)

0.8 Probability of miss Probability 0.6

0.4

0.2 Probability of false alarm 0 -5 0 SER (dB) 5 10

Figure 4.18 Probability of miss increases as the SER is decreased by reducing the amplitude of near-end speech (learning rate = 0.1, window size = 128)

64

4.5.3 VIRE DTD

The VIRE DTD uses the maximum value of the adaptive filter as a measure of the fluctuations as presented in section 2.4.4. The variations of the adaptive filter taps will be both high for NLMS filter and the coloration-effect filter, so that it can be applied to both algorithms. The faster the filter adapts, the larger the variations of the filter coefficients will be. A large learning rate makes the VIRE DTD work extremely well as observed in Figure 4.18 and Figure 4.19. Though the high learning rate brings large NEA as shown in Figure 4.5, the DTD is now used to protect the NES from being over-attenuated such that a large learning rate can be adopted to affirm the good performance of the DTD. Hence, one advantage of the VIRE DTD is the inherent fast adaptation.

The ROC curves in Figure 4.18 and 4.19 also show that the VIRE DTD for both NLMS and coloration-effect filter AES can achieve very low probability of false alarm and probability of miss at the same time, especially for the coloration-effect filter AES. The detection statistics of the VIRE algorithm for the coloration-effect filter AES in Figure 4.20 makes it possible to draw a threshold to separate the single talk and double talk completely, unlike the Geigel and NCR algorithms. The detection decisions during double talk are much larger than the ones during single-talk period, which leads to the excellent performance of the VIRE DTD.
1 Learning rate = 1.0 Learning rate = 0.2 Learning rate = 0.1

0.8 Probability of miss

0.6

0.4

0.2

0.1

0.2 0.3 0.4 Probability of false alarm

0.5

0.6

Figure 4.19 ROC curves of VIRE DTD for NLMS AES with different Learning rate under nominal SER (window size = 128)

65

1 sigma = 0.05 sigma = 0.01 0.8 Probability of miss

0.6

0.4

0.2

0.2

0.4 0.6 Probability of false alarm

0.8

Figure 4.20 ROC curves of VIRE DTD for the Coloration-effect filter AES with different Learning rate under nominal SER (window size = 128)

FE Single Talk 20 0 Detection Statistic in dB unit -20 -40 -60 -80 -100 -120 -140 -160 0 1

Double Talk

NE Single Talk

A possible threshold

2 Samples

3 x 10

4
5

Figure 4.21 Detection decision obtained using the VIRE DTD for the Coloration-effect filter AES under nominal SER

After tuning the SER by adjusting the power of the NES, similar results are obtained for NLMS AES and Coloration-effect filter AES. Though the performance of the VIRE DTD is excellent under the nominal SER situation, it turns out to be varying dramatically with

66

the SER as we can see in Figure 4.21. When the echo is much larger compared to the near-end speech, the near-end signal is weak during DT. After the NLMS filter has converged during FE single talk, the filter weights will not be influenced much by the NES during DT, so that the variation of filter coefficients is not large enough to detect the current situation as DT which leads to the high probability of miss. On the other hand, the VIRE DTD performs in a similar manner in the Coloration-effect filter AES. The DTD performance is quite stable during certain SER range as illustrated in Figure 4.22. Lower threshold will cover larger SER operating range because lower threshold allows lower NES during DT to be still separated from the FE single talk, as implied in Figure 4.20.

As a conclusion, the VIRE DTD is able to achieve best detection performance among all algorithms. Especially the VIRE DTD embedded in the Coloration-effect filter AES provides the most promising results. As well, compared to the cheap-NCR algorithm, VIRE DTD has the advantage of fast adaptation and convergence. However, the VIRE DTD is quite sensitive to the strength of the NES. It is hard to detect the DT situation when the NES is too weak. Hence there is certain limitation in this method.

0.8

Probability

0.6

Probability of miss

0.4

0.2 Probability of false alarm 0 -5 0 SER (dB) 5 10

Figure 4.22 Probability of miss of the VIRE DTD in the NLMS AES varies dramatically with near-end speech (Pf = 0)

67

1 0.9 0.8 0.7 0.6 Probability 0.5 0.4 0.3 0.2 0.1 0 Probability of miss when T = 0.03 Probability of miss when T = 0.05 Probability of miss when T = 0.1 Probability of false alarm for all three thresholds is zero

-5

0 SER (dB)

10

15

Figure 4.23 Performance of the VIRE DTD in the Coloration-effect filter AES

4.6 AES with Different DTD Algorithms

In this section, the improvement of each DTD algorithm to the AES is going to be evaluated and compared. In the AES without DTD equipped, the same suppression ratio will be used everywhere, which results in large attenuation on the NES during DT. It makes the conversation during DT rather difficult. Hence, the DTD is adopted to predict the DT situation. When DT is detected, on one hand the filter adaptation is stopped, and on the other hand the suppression ratio is lowered in order to limit the NEA. However, certain amount of echo attenuation is still required during DT period. Hence, a different

is used to acquire 20dB ERLE during DT period, which is 10dB lower than that
during FE single talk.

The performance of each method under nominal SER is firstly compared as shown in Figure 4.23. With the same 30dB ERLE achieved during FE single talk segment, it can be observed that all the DTD algorithms help to reduce the NEA during DT. The Coloration-effect filter AES results in less NEA than the NLMS AES overall. The results also show that the Coloration-effect filter AES with VIRE DTD displays the most 68

outstanding performance. It is expected because of the excellent detection capability of the VIRE DTD for the Coloration-effect filter AES. Through auditory tests, the NES during DT by the Coloration-effect Filter AES with VIRE DTD sounds most natural, with least attenuation and without discontinuities.

In practice the near-end speech is an unknown variable, as well as the far-end speech. The effect of NES and FES are evaluated separately in paragraph 4.6.1.As discussed in section 3.3, various kinds of noise exist in hands-free communications. Some noises are stationary background noise, while some are abrupt sounds. The impact of the stationary noise is more likely to be tested than the unpredicted sudden noise, which will be discussed in section 4.6.2.

69

Original microphone signal z(t) 0.2 0 -0.2 0.2 0 -0.2 0.2 0 -0.2 0.2 0 -0.2 0.2 0 -0.2 NLMS AES with cheap-NCR DTD

NLMS AES without DTD

NLMS AES with Geigel DTD

NLMS AES with VIRE DTD

0.5

1.5

2 Samples

2.5

3.5 x 10
5

Coloration-effect Filter AES without DTD 0.2 0 -0.2 0.2 0 -0.2 0.2 0 -0.2 0 0.5 1 1.5 2 Samples 2.5 3 3.5 x 10
5

Coloration-effect Filter AES with Geigel DTD

Coloration-effect Filter AES with VIRE DTD

Figure 4.24 Comparison of different AES methods under nominal SER

70

4.6.1 The Influence of the NES and the FES

The performance of each DTD algorithm against varying NES has been studied as in section 4.5. The ERLE value is affected by the Pf in the FE single talk segment, while the NEA is determined by the Pm during DT. Figure 4.24 shows how the power of NES influences the ERLE and NEA. The FES is kept the same so that the detection statistics during FE single talk section for each DTD are stable over all SER range. The threshold of each DTD is chosen to achieve zero Pf to avoid echo boost. In such a way, all methods attain the same suppression ratio during FE single talk so that similar ERLE is achieved and maintained over the whole SER range. As observed before, the Pm increases as the NES diminishes, and therefore the NEA also rises. The results in Figure 4.24 verify the analysis in section 4.5.

Now, the nominal NES is kept the same and the FES is varied. The result is illustrated in Figure 4.25. The louder the FES is, namely the larger the echo, the larger the ERLE value is achieved. However, it can be observed that the echo attenuation drops when the echo is too large for the VIRE and Geigel algorithms. Due to the amplification effect from the volume adjustment, the magnitude of echo (z(t)) may be comparable to or even larger than the FES (x(t)). The performance of the Geigel algorithm and the VIRE DTD for Coloration-effect filter AES depend on the ratio between the microphone signal (z(t)) and speaker signal (x(t)) and Pf increases as z(t) enlarges. For the VIRE DTD in the NLMS AES, the variation of the filter taps becomes larger during FE single talk as the microphone signal increases, which also increases the Pf. In such a way, the filter adaptation will be slowed down and low suppression ratio is used in the Geigel and VIRE DTD. Hence, significant amount of echo is left after cancellation. The NCR algorithm does not suffer from this problem because it estimates the echo path based on both the FES and the echo signal such that the ratio between the estimated echo and the actual echo is relatively stable, namely a stable DTD performance.

71

-2

40 35

-4 30 -6 25 -8 ERLE (dB) NEA (dB) 20 15 -10 10 -12 5 -14 0

NLMS AES NLMS AES with NCR DTD NLMS AES with Geigel DTD NLMS AES with VIRE DTD CF AES CF AES with Geigel DTD CF AES with VIRE DTD -5 0 5 SER (dB) 10

-5

0 5 SER (dB)

10

Figure 4.25 Performance variation against SER with varying NES

45 40

-2 35 -4 ERLE (dB) NEA (dB) 30 25 20 15 10 -10 5 -12 0 NLMS AES NLMS AES with NCR DTD NLMS AES with Geigel DTD NLMS AES with VIRE DTD CF AES CF AES with Geigel DTD CF AES with VIRE DTD -5 0 5 SER (dB) 10

-6

-8

-5

0 5 SER (dB)

10

Figure 4.26 Performance variation against SER with varying FES

72

1 0.9 0.8 0.7 Probability of miss 0.6 0.5 0.4 0.3 0.2 0.1 0

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 NCR DTD Geigel DTD VIRE DTD for CF VIRE DTD for NLMS

-5

0 SER (dB)

10

Probability of false alarm

-5

0 SER (dB)

10

Figure 4.27 DTD performance variation caused by varying FES

4.6.2 The Noise Performance

To examine the influence of the strength of the stationary noise, the white noise with adjusted noise power is added to the nominal recording. As we can see in Figure 4.26, the resulting ERLE drops as the noise power increases. As we can understand from the analysis above, the performance of the DTD determines how the AES acts overall. The influence of the stationary noise to the DTD is examined as shown in Figure 4.27. The decline of the ERLE is on one hand caused by the increase of the Pf, and on the other hand is due to the larger noise contribution to the residual signal.

The NEA is almost invariable because the variation of the Pm is small. The Pf of the VIRE DTD for Coloration-effect Filter AES barely changes which guarantees a steady attenuation of the echo during FE single talk in spite of the noise strength. The Pm rises as the noise gets stronger so that some discontinuities occur as in the NCR and Geigel methods.

73

0 -1 -2 -3 ERLE (dB) -4 NEA (dB) -5 -6 -7 -8 -9 -10 10

30

25

20

15

10

NLMS AES NLMS AES with NCR DTD NLMS AES with Geigel DTD NLMS AES with VIRE DTD CF AES CF AES with Geigel DTD CF AES with VIRE DTD 15 20 SNR (dB) 25 30

15

20 SNR (dB)

25

30

0 10

Figure 4.28 Noise Performance under nominal SER

1 0.9 0.8 Probability of false alarm 0.7 Probability of miss 0.6 0.5 0.4 0.3 0.2 0.1 0 10

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 10 NCR DTD Geigel DTD VIRE DTD for CF VIRE DTD for NLMS

15

20 SNR (dB)

25

30

15

20 SNR (dB)

25

30

Figure 4.29 Comparison of DTD noise performance

74

CHAPTER V CONCLUSION & FURTHER WORK

5.1 Summary and Conclusion

Nowadays, the conventional and hands-free telephones occupy more and more important role in solving peoples communication needs. One of the major problems in a telecommunication application over a telephone system is echo. This thesis is devoted to find a solution for acoustic echo cancellation during a hands-free conversation using PC. The basic echo canceller based on the famous NLMS algorithm is firstly studied. The resulting ERLE of the basic AEC is low such that the residual echo is still audible. Therefore the acoustic echo suppressor is introduced which is able to eliminate the echo completely. The NLMS algorithm is firstly adopted to calculate the magnitude spectrum of the echo signal in the AES but it costs much computational complexity. Then a recently proposed algorithm, which uses a coloration-effect filter to estimate the magnitude spectrum of the main portion of the acoustic path, is studied and modified to ease the calculation. The disadvantage of the Coloration-effect Filter AES is slower adaptation compared to the NLMS AES. Both of the two AES algorithms are capable of making the echo inaudible during far-end single talk, but they all suffer from near-end attenuation and discontinuities problems during double talk. Hence, Double Talk Detection algorithms are investigated, including the Geigel DTD, (cheap) NCR DTD and VIRE DTD. Each DTD algorithm is analyzed and evaluated individually based on the two parameters: Pf and Pm. After that, the DTD algorithms are implemented into each AES methods and compared. From the simulation and auditory tests, it is found out that the Coloration-effect Filter AES with VIRE DTD is able to bring in the best result, with least attenuation and without discontinuities. Yet, this consequence will only hold when the near-end signal picked up by the microphone is strong enough compared to the far-end speech. Also, the performance will only degrade as the noise becomes stronger than certain point. In all, the echo cancellation algorithm presented in this thesis successfully attempted to find a software solution for the problem of echoes in the telecommunications environment. Furthermore, many efforts have been contributed to

75

the ways of regulating the parameters and a general frame for evaluating and comparing different algorithms, as well as the analysis of the inside meaning of the results. The

proposed algorithm was completely a software approach without utilizing any DSP hardware components. The algorithm was capable of running in any PC with MATLAB software installed. In addition, the results obtained were convincing. The audio of the output speech signals were highly satisfactory and validated the goals of this research.

5.2 Possible Further Work

The test of the algorithm was performed totally off-line. The testing speech was recorded beforehand as input to the algorithm and the output was looked over after simulation. Therefore, the real-time application to for testing purpose could be the most interesting future work.

The high background noise level is annoying to the listeners side during a conversation and will affect the performance of the algorithm. However, the background noise is a natural part of a conversation, which may provide the surrounding environment of the person we talk to. Hence, there is a need of the noise suppression algorithm to reduce the background noise to a comfortable level. Moreover, a study of the way to handle the music noise which is trickier to solve can be also done in the future.

In practice, the echo could be still noticeable due to large variations of echo path characteristic. Therefore, a further research and evaluation of the reaction of the algorithm to the echo path changes should be made effort to.

76

LIST OF ACRONYMS

AEC: Acoustic Echo Canceller AES: Acoustic Echo Suppresser DT: Double Talk DTD: Double Talk Detection (Detector) FES: Far End Speech LMS: Least Mean Square NCR: Normalized Cross-correlation NES: Near End Speech NLMS: Normalized Least Mean Square Pf: Probability of false alarm Pm: Probability of miss SER: Signal To Echo Ratio SNR: Signal To Noise Ratio VIRE: Variance Impulse Response

77

LIST OF FIGURES

Figure 1.1: Hybrid Connections and the Resulting Electric Echo Figure 1.2: Basic setup of a hands-free communication system Figure 1.3: Generation of acoustic echo through direct coupling and reverberations Figure 1.4: General schematic of Acoustic Echo Cancellation Figure 1.5: Composition of the signals used to calculate the NEA Figure.2.1: Noise suppression with spectral subtraction Figure 2.2: Smoothing of a step function Figure 2.3: Error caused by Hann-Windowing FFT Figure 2.4: Error caused by Hamming-Windowing FFT Figure 2.5: Error caused by Kaiser-Windowing FFT with different beta Figure 2.6: Gradient of the Error function Figure 2.7: plot Figure 2.8: Typical room impulse response Figure 3.1: Room Impulse Response in the Scream room in NXP Leuven Figure 3.2: Common Laptop Noise Sources Figure 3.3: Typical look of a typing sound on the keyboard Figure 3.4: Typing noise cancellation Figure 4.1: Speech stimuli segmentation Figure 4.2: How ERLE changes with the safety constant Figure 4.3: The effect of Learning rate on ERLE (safety constant = 0.1) Figure 4.4: Echo Cancellation Result of NLMS AEC under nominal SER Figure 4.5: The effect of learning rate on the ERLE and the NEA Figure 4.6: The effect of the smooth factor on the ERLE and the NEA Figure 4.7: The effect of smooth factor on the NES during DT Figure 4.8: Alpha Beta values to achieve 30dB ERLE Figure 4.9: Different convergence time corresponding to different sigma Figure 4.10: The influence of the sigma to the ERLE and NEA with different safety constant values Figure 4.11: Alpha Beta values to achieve 30dB ERLE 78

Figure 4.12: Typical ROC curve illustration Figure 4.13: The detection statistic for Geigel algorithm for the nominal recording with external microphone and loudspeakers Figure 4.14: ROC curve of Geigel Algorithm under nominal SER Figure 4.15: Probability of miss decreases as the SER is increased by enlarging the amplitude of near-end speech (Pf = 0) Figure 4.16: ROC curves of cheap-NCR DTD for NLMS AES with different Learning rate under nominal SER Figure 4.17: ROC curve of cheap-NCR DTD with different window size under nominal SER Figure 4.18: Probability of miss increases as the SER is decreased by reducing the amplitude of near-end speech Figure 4.19: ROC curves of VIRE DTD for NLMS AES with different Learning rate under nominal SER Figure 4.20: ROC curves of VIRE DTD for the Coloration-effect filter AES with different Learning rate under nominal SER Figure 4.21: Detection decision obtained using the VIRE DTD for the Coloration-effect filter AES under nominal SER Figure 4.22: Probability of miss of the VIRE DTD in the NLMS AES varies dramatically with near-end speech Figure 4.23: Performance of the VIRE DTD in the Coloration-effect filter AES Figure 4.24: Comparison of different AES methods under nominal SER Figure 4.25: Performance variation against SER with varying NES Figure 4.26: Performance variation against SER with varying FES Figure 4.27: DTD performance variation caused by varying FES Figure 4.28: Noise Performance under nominal SER Figure 4.29: Comparison of DTD noise performance

79

REFERENCES

Christof Faller and Christophe Tournery, Robust Acoustic Echo Control Using A Simple Echo Path Model Audiovisual Communications Laboratory, EPFL, Lausanne, Switzerland 2006 C. Faller and C. Tournery, Estimating the delay and coloration effect of the acoustic echo path for low complexity echo suppression in Proc. Intl. Works. On Acoustic. Echo and Noise Control (IWAENC), Sept. 2005. Andreas Jakobsson, Karlstad University and Per Ahgren, Uppsala University, Acoustic Echo Cancellation S. F. Boll, Suppression of acoustic noise in speech using spectral subtraction IEEE trans. Acoust. Speech Sig. Processing, vol. 27, no. 2, pp. 113120, Nov. 1979. K. Ochiai, T. Araseki, and T. Ogihara, Echo canceller with two echo path models, IEEE trans. on Communications, vol. 25, no. 6, pp. 589595, June 1977. Jun H. Cho, Dennis R. Morgan and Jacob Benesty, An objective Technique for Evaluating Doubletalk Detectors in Acoustic Echo Cancellers IEEE trans. On Speech and Audio Processing, vol. 7 no. 6, Nov. 1999 Raghavendran, Srinivasaprasath Implementation of an Acoustic Echo Canceller Using Matlab2003 P. Ahgren, On System Identification and Acoustic Echo Cancellation PhD thesis, Uppsala University, 2004. J. Benesty, D. R. Morgan, and J. H. Cho, A new class of doubletalk detectors based on cross-correlation, IEEE Trans. Speech Audio Processing, vol. 8, pp. 168-172, March 2000.

80

Geigel DTD

Form detection statistic | z (t ) | = max( x(t ) , x(t 1) ... x(t N + 1) )


2 y (t ) = 2 z (t )

Comparison with T

> T Double talk present T Double talk not present


< T Double Talk present T Double Talk not present

Cheap NCR DTD VIRE DTD

(n) = (n 1) + (1 ) [ ]2 (n) = (n 1) + (1 ) = max(h(0), h(1) h(k 1))

> T Double Talk present T Double Talk not present

81

You might also like