Extracting Room Reverberation Time From Speech Using Artificial Neural Networks
Extracting Room Reverberation Time From Speech Using Artificial Neural Networks
A novel method to extract the reverberation time from reverberated speech utterances is
presented In this study, speech utterances are restricted to pronounced digits, uncontrolled
discourse is not considered. The reverberation times considered are wide band, within the
frequency range of speech utterances. A multilayer feed forward neural network is trained
on speech examples with known reverberation times generated by a room simulator The
speech signals are preprocessed by calculating short-term rms values. A second decision-
based neural network is added to improve the reliability of the predictions In the retrieve
phase, the trained neural networks extract room reverberation times from speech signals
picked up in the rooms to an accuracy of 0.1 s This provides an alternative to traditional
measurement methods and facilitates the occupied measurement of room reverberation times
acousticians can make rather precise judgments concern- work. The reverberation times estimated by the neural
ing room reverberation times through listening to speech network and the corresponding true reverberation times
or music in a room suggests that the reverberation time (teacher) are compared to obtain the errors. The training
might be obtained using cognitive models, provided that process consists in updating the internal synaptic
both the anechoic speech signals and the impulse re- weights of the neural network iteratively so that the
sponses of the room are statistically obtainable by means errors between the true and the estimated reverberation
of machine learning through presenting a large number times over all the training examples are minimized. In
of training samples. the retrieve phase, speech utterances as received by a
Inspired by the study of human brains, artificial neural microphone in the space are sent to the trained neural
networks (ANNs), or neural networks for short, are net- network via the preprocessor. The trained neural net-
works of a large number of primitive neurons. Neural work then gives the reverberation time, as shown in
networks have the capability of being trained to store Fig. 2.
and retrieve information and generalize problems from
examples by altering internal connection weights. It has 2 REVERBERATED SPEECH UTTERANCES AND
been proven that such networks can be trained to map THEIR ENVELOPES
complicated nonlinear functions to an arbitrarily prede-
fined precision, provided that adequate computing power 2.1 Reverberation Time, Impulse Response, and
is used [3]. Moreover it has been demonstrated that some Perceived Sound Pressure in a Room
neural network models are powerful signal processors The transmission characteristics of a room can be de-
or filters and can extract useful information from noisy scribed by its impulse response (Fig. 3). The reverber-
data. As a result, artificial neural networks have been ation time is referred to as the period of time it takes
used widely as artificial intelligence means to solve for the sound pressure level in the room to be attenuated
classification, approximation, feature extraction, gener- by 60 dB from the moment when the excitation stops.
alization, and signal processing problems in various ap- Due to the fact that a signal-to-noise ratio higher than
plication areas [4]. In particular, nonlinear multilayer 60 dB is difficult to obtain, it is a common practice to
feedforward network architectures are the most popular take the - 5 - d B to - 3 5 - d B segment and calculate the
neural network model used widely to solve various engi- reverberation time accordingly. The reverberation time
neering problems. When certain short-term memory can be calculated from the impulse response using the
mechanisms such as tapped delay lines are used, they well-known Schroeder backward integration method.
are also powerful in handling temporal signals such as The sound m(t) perceived at a listening or measure-
speech [5], [6]. Bringing these thoughts together, this ment position can be calculated through the convolution
paper proposes using artificial neural networks to extract of the source signal s(t) and the impulse response h(t)
reverberation times from speech utterances. This novel of the room,
method utilizes "natural" test signals, that is, speech,
and hence facilitates occupied measurement of reverber- m(t) = s(t) @ h(t) (1)
ation times. In addition, examining whether artificial
neural networks are capable of modeling cognitive be- where the room impulse response is assumed linear and
havior relating to room characteristics is of considerable possibly time variant. This convolution relationship in-
academic interest. dicates that information on the room impulse response
is contained in the perceived sound signals. Research
1 OVERVIEW into speech recognition and enhancement suggests that
Reverberated
Speech ~ ~_~
Utterances Pre-processor Ne work
(with known RT) "
\
Adjust NN using error signals
Fig l. Block diagram of training neural network Reverberated speech examples are used as input vectors, known RTs are used
as teachers
speech signals can be treated as stochastic processes, each individual word has different buildup and decay
and it is possible to utilize machine learning or artificial phases of its own
intelligence to identify the statistical features of speech 2) The impulse responses of most real rooms are not
[7], [8]. Furthermore, reverberation time is, in fact, a exactly exponential, two or more decay rates are common.
statistical feature of the impulse response, Extraction 3) Natural speech utterances happen at random in-
of the reverberation time from speech can therefore be stants in time and last for different periods.
regarded as finding a particular statistical feature from 4) The energy in a room may not have decreased to
reverberated speech which comprises both original an insignificantly low value when new excitations start.
speech and room impulse response information. This 5) Ambient and other noises caused by the measure-
can be achieved by machine learning, provided a large ment system and produced by the audience may be sig-
number of training samples are available and the neural nificant, particularly in occupied measurements.
network algorithm can generalize training samples suf- Fig. 5 shows an anechoic and two reverberated ver-
ficiently to learn these statistical features. Consequently, sions of the speech utterances "one, two, three, four,"
to extract reverberation times from speech, the neural read by a narrator, and their rms envelopes. From these
network must learn statistical features of both the im- signatures the following can be seen.
pulse responses and the speech 1) The slopes of the rise and fall edges are related to
the reverberation time. The longer the reverberation
2.2 Buildup and Decay of Sound Pressure and time, the slower the rise and decay.
Reverberation Time 2) Different speech utterances (dry speech) may have
In reverberation time extraction, only a low-frequency slightly different rise and fall slopes.
subspace of the reverberated speech is of interest. Envel-
opes of speech utterances have been used successfully
to assess reverberation and other effects on speech trans- >
mission quality and speech intelligibility [9], [10]. C
Short-term rms values of speech signals can be used to .o 0
g ~ o,,esoun
061 I
if,/
El.
70
C
::Iii
0
Or)
"O
N
04
E
O
-06
Z
08
-1
-01 O 01 02 03 04 05 06 07 08 09
Time [s]
Fig 3 Example of simulated impulse response Plot generated by stochastic model described in Section 4 1
3) The slopes, though indicating reverberation times, 3 NEURAL NETWORK ARCHITECTURE AND
are very noisy DATA PREPROCESSOR
An estimation of the reverberation time through sim-
ple examinations of slopes of buildup and decay edges 3.1 Architecture of the Neural Network
using a ruler is impossible in the case of real speech The neural network architecture for extracting rever-
Neural networks, however, are inherently capable of beration times is illustrated in Fig. 6. It is a multilayered
extracting features from noisy signals The tasks for the feedforward network with two hidden nonlinear layers.
neural networks are twofold. First, the neural networks The input layer is simply an interface between the exter-
are expected to generalize impulse responses so that once nal data and internal nonlinear neurons. In this particular
trained on a closed set of examples, they can correctly case, the input layer bridges and distributes the output
identify the reverberation times of rooms having any from the preprocessor to the first nonlinear hidden layer
possible impulse responses in the retrieve phase Sec- without additional processing of data. In theory, the
ond, the neural networks are also expected to learn the more nonlinear neurons are used, the more powerful the
short-term energy variation feature of speech utterances network is. However, research results indicate that in
through examples and perform averaging and generaliza- most cases a network with two nonlinear layers is a wise
tion, so that the trained neural networks can have some choice in the sense that excessive nonlinear layers may
robustness to different utterances, leading eventually to degrade the learning speed and make training difficult
source-independent measurement due to the presence of too many local minima and the
reduced speed of error propagation [4]. In this project,
empirical comparison has found a neural network suita-
ble that has 40 neurons on the input layer, 20 nonlinear
neurons on the first hidden layer, 5 nonlinear neurons
on the second nonlinear layer, and one linear summation
output neuron. All the neurons in the nonlinear hidden
(a)
layers have the same structure, as depicted in Fig. 7 (the
ith neuron) Each of these neurons has two parts, a linear
summation basis function u(') and a nonlinear activation
05 \ ~ ,/~',\,,,\ ,,,,, function/(.) The nonlinear activation function f ( . ) is
0 ' '-'/\ ' / '~\'t, ' ' a sigmoid:
0 1 2 3 4 5 6 7
(b) 1
ai- 1 + e-"~ (2)
._.N.
05 "
i\, t \ The notation w~i represents the connection weight from
the jth neuron to "the ith neuron. The output layer has
E
o ' ' only one node and is a linear combination processor,
z 0 1 2 3 4 5 6 7
that is, a neuron without a nonlinear activation function.
(d)
The dynamic equation of such a network can therefore
be expressed as
N/ I
ui = ~ wij(l)a(l 1) + b i ( l ) (4)
j--I
0 1 2 3 4 5 6 7
(e)
ai(1) = f ( u i ( 1 ) ) , 1 <~ i <~ N i , 1 <~ l <~ L (5)
O5
'f
0
0
i
1
v~\|
2
i
3
-W~ .U,.~'.,..~/a
T i m e [s]
r
4
i
z
5 6 7
where l represents the layer number (the input layer is
presented as layer 0), and the input to the neural network
is uniformly expressed as ai(0) The training process of
such a network can be carried out using the so-called
back-propagation method [12], [13] Back-propagation
(9 is a gradient-based optimization method and reveals reli-
Fig 5 Signatures of anechoic and reverberated speech utter- able convergence features, tolerable convergence
ances and their rms envelopes (amplitudes normalized to their speech, and good generalization capability. The back-
maximum values for visual comparison) (a) RT = 0
(b) RT - 1 s (c) RT = 4 s Much noisier edges are found propagation method of training a neural network is to
in envelopes of reverberated speech apply different input data (examples) continuously onto
the input layer of the network and compare the output where the error signal 5~Z)(l) is defined as
of the neural network with the value of the teacher (true
value) to obtain the error. The connection weights inside OE
the neural network are then adjusted so as to minimize 5Im)(l) - ~ i t)'a~ml"" (9)
the overall errors between the outputs of the neural net-
work and the teachers over all the training examples. and the scalar xI is the learning rate.
The particular rule of training the network is to update For simplicity, the bias value b~ is included in the
the connection weights according to a gradient-type weight vector by setting it to 1 and using an extra weight
learning formula, w0 to adjust it as appropriate. Thus the connection
w! m+1)(1) = wij(~)(l) + Awl~")(l) weights and biases can be treated identically in training.
"tJ (6)
with the mth training sample a('n)(o) and the correspond- 3.2 Input Data and Preprocessing
ing teacher t (") pairs so as to minimize the energy func- The data preprocessor is designed to perform four
tion E defined by functions:
1) Normalization of the signal energy so that the signal
OUt(m)]2 . presentation is independent of the input level
E = ~1 ~ [t (m) - (7)
m=l 2) Implementation of a short-term m e m o r y mechanism
3) Detection of the short-term average rms value or
The back-propagation training process follows a chain envelope of speech
rule: 4) Conversion of the input vector to a suitable format
for the neural network.
(m) OE
A w i j (1) = - ' q Owl,;)(l ) Speech signals are temporal sequences A multitapped
delay line with short-term rms value detectors is used
to detect the short-term average energy changes of the
OE Oalm)(1)
perceived sounds, as illustrated in Fig. 8. In this applica-
= --~] aalm)(1)" OwlT)(l ) tion 2000-point overlapped rms detectors, providing 10
rms values per second, are found adequate for speech
= ~qSlm)(l)f'(ulm)(l))a}m)(l - 1) (8) signals sampled at 16 kHz. This is equivalent to monitor-
ing fluctuations lower than 5 Hz in a speech envelope.
For a better resolution in differentiating early reflections
more rms detectors may be considered. As the problem
is to extract the rate of change, it is natural to consider
a differential preprocessor between the rms detector and
the input layer. This can be written as
bi
~.~ To Input Layer
, Wil~ of ANN
f(')
Input w i 2 . ~
J Win
speech
U(.) ....... Basis Function f( )....... Activation Function signals
Fig. 7 Nonlinear neuron model A linear summation basis Fig 8 Block diagram of preprocessor Tapped delay line
function and a sigmoid activation function used a nonlinear followed by rms detectors is used to preprocess received
neuron in hidden layers speech signals. Preprocessor output is fed into ANN
ate band-pass filters may be applied to the speech signal algorithm was used in the application of ECG recogni-
and a related narrow-band reverberation time used as tion [4]. Similarly, a speech utterance alignment algo-
teacher in the training phase. However, the limitation rithm that is used widely in speech recognition can be
on extracting the reverberation time from speech is that adopted in the application of reverberation time extrac-
only reverberation parameters within the frequency tion. As an alternative to the alignment algorithm, an
range of speech utterances can be extracted. additional decision-based network is proposed here as a
"watchdog." The watchdog network has an architecture
3.3 Capability of the Neural Network similar to the main network, but a reduced network size
Neural networks can be used as classifiers or approxi- and modified output neuron. It has 40, 10, 5, and 1
mators. Neural n e t w o r k - b a s e d classifiers have found neurons on the input, first nonlinear, second nonlinear,
wide application in pattern recognition In these applica- and output layers, respectively. The output neuron is
tions, a common approach is to use one decision-based nonlinear, having only two output states, 0 and 1.
subnet for each particular pattern, known as one-class- Training of the additional neural network watchdog
in-one-network (OCON) model [4] If reverberation time consists of two steps. The first step is to train the main
extraction is treated as a kind of classification problem network to be partially misalignment tolerant Examples
(that is, a different subnet for a particular reverberation with certain amounts of position and width changes are
time), a large number of subnets will be needed. Another included in the training set. Pulse widths and spaces
disadvantage is that the network will not be able to between two pulses, as shown in Fig 9, uniformly dis-
output continuous values of reverberation time There- tributed from 0.35 to 0.65 s, are included in the training
fore an approximation-based network and a correspond- set. Up to this stage, when tested, the main network can
ing training strategy are favored for the reverberation respond correctly to examples that appear in the training
time extraction problem. set. However, when pulse widths and positions are sig-
To identify the capability of the proposed neural net- nificantly different from the cases included in the train-
work model, preliminary tests are carried out prior to ing set, large errors occur (more than 5 s in the worst
training the network tediously on real speech signals and case). The second step is to train the watchdog to identify
impulse responses. Extracting the rate of change from input examples that exceed the misalignment tolerance
randomly positioned exponential segments is first stud- of the main network Fig 10 is a block diagram showing
ied. The training and test examples are simply generated the training method of the watchdog. Training examples
by the convolution of pulse trains and exponential decay are generated by the convolution of pulse trains with
functions, as illustrated in Fig. 9. This is in analogy to random position and pulse widths and exponential decay
the idealized noise-free situation described in Section functions. Each input datum from the data set is sent
2.1. Equivalent reverberation times according to the simultaneously to the trained main network and the
equation [ 14] watchdog. The output of the trained main network and
the expected output are compared. If an error is smaller
h(t) = e x p ( - 6.9t/RT) (11)
Trained I = If abs(o-t)<=e
are used as sample-teacher pairs to train the neural T=I,
approximation
network. neural network " T=0,
Else
Various comparisons are made Investigation showed I Input data
that such a network architecture is capable of extracting
exponential rates of change. However, the neural model Decisio~based
neural ne~,o rk
appears to be sensitive to the positions of the input pulses (Watchdo~
and not completely tolerant of misalignment. This find-
ing is similar to the misalignment-sensitive problem
Fig 10 Training of watchdog decision-based neural network
identified in electrocardiogram recognition, as reported Watchdog is trained to identify input cases from which main
in [ 15 ], [ 16]. As a solution to this problem, an alignment network can satisfactorily extract RTs
ID
r
m
ID
"O \,
B
Q.
\
\,
<E
Anechoic speech utterances "one," "two," and "three" trial and then recoded in C language to gain the full
read by a female narrator are used. The convolution of potential of computing power Training the neural net-
the anechoic speech utterances and 10 000 simulated works is quite a tedious job. It takes about one week on
room impulse responses with reverberation times from a 333-MHz Pentium II desktop PC to train the proposed
0.1 to 5 s is performed to generate a large data set. Half neural network. To identify a suitable learning rate and
of the data set is used as a training set, and the second the initial values for weights, several trials and restarts
half is used as a validation set. As generalization is are often inevitable. Fortunately training is a one-off
an important concern in this application, validation is process in developing the neural network models for this
performed to test the trained network. The test patterns application. Once the neural networks are trained and
used for validation purposes are strictly selected from validated, the connection weights become constant val-
those patterns that have not been encountered in the ues in the networks. So in the retrieve phase, the neural
training phase. When testing the robustness of the networks have very quick response to any input cases
trained neural networks to speech utterances, two addi- and extract room acoustic parameters in real time.
tional words, "four" and "five," are used in the experi-
ment, a sufficiently large data set (10 000 examples) is 4.3 Robustness to Different Impulse Responses
used to ensure more rigorous validation tests. In the experiment, single-syllable utterances "one,"
"two," "three," read by a female narrator are recorded
4.2 Training in an anechoic chamber at a sampling frequency of 16
Training on speech utterances and simulated impulse kHz. The average length of each utterance is about 0.4
responses is performed using a quasi-block-adaptive s. The source used to carry out the test is 27 possible
method. This is a balance between an aggressive data- combinations of these three utterances, that is, " o n e . . .
adaptive method and a prudent block-adaptive method. two . . . three," "one . . . three . . . two," "two . . .
The data-adaptive method updates the connection one. t h r e e , " . . . , "three . . t h r e e . . , three." The
weights in a neural network based on one example- time intervals between these single utterances are on
teacher pair, and therefore weights are updated in each average 0.5 s. To tackle the time misalignment problem
iteration. In some applications this method exhibits mentioned before, time-shifted patterns are included in
faster convergence speed than the block-adaptive the training set. The time shifting for spaces between
method However, it lacks consistent numerical ro- two words is randomly arranged from 0.4 to 0.6 s. The
bustness The block-adaptive method does not update learning process is illustrated in Fig. 12, where the y
the weights until the entire block of example-teacher axis, according to Eq. (7), represents the summed square
pairs has been presented. In this study the so-called errors between teachers and outputs of the neural net-
quasi-block-adaptive method works in such a way that work over 50 randomly chosen samples from the valida-
weight updating happens after 50 training example pairs tion set. It can be seen that the mean squared error,
of different reverberation times have been presented. In when tested using data from the validation set, decreases
particular, one quasi-block comprises examples with all continuously when the number of training iterations in-
the reverberation time values ranging from 0.1 to 5 s at
steps of 0.1 s. The example of each individual reverber- 106
ation value in a quasi-block is chosen randomly from
the training set. This is, in fact, a random selection of e~
(/)
utterance sequences and impulse response types having 4
10
identical reverberation times. The training algorithm fol- Q.
creases. A f t e r 1 million iterations, for any arbitrary test networks on utterances, and results of a p r e l i m i n a r y
data (data from the validation set), the error between study of source independence are presented in Table 1.
the outputs o f the neural network and the actual reverber- For each reverberation time within the range of 0.5 to
ation times is less than _+0.048 s Fig. 13 shows the
worst case o f the test results, when the trained neural
networks are tested using the data from the validation
set. z
5
The training and testing process has already incorpo- E
rated the t i m e - s h i f t e d s a m p l e s , and therefore the trained t-
O
network shows a certain tolerance to the time shifting = 4
of input signals. H o w e v e r , the time shift included in the
training e x a m p l e s is not adequate to deal with uncon-
trolled speech, a p r e p r o c e s s o r that aligns the measured
speech utterances is still needed. The approach of using
a d e c i s i o n - b a s e d neural network as a watchdog, dis- E 2
cussed in the p r e c e d i n g section, is adopted. A f t e r the ffl
5 s, with a step size of 0.5 s, 100 different tests are verberation times. It is known that the envelope spec-
performed, and the errors listed in Table 1 are the worst trum of anechoic speech taken over a long period (40 s)
cases. It can be seen that the trained neural network does is relatively constant [9]. Reverberation modifies the
show a certain independence from the utterances used, envelopes of speech signals according to different rever-
but in the middle of the reverberation time range, unac- beration times. The difference between envelops of an-
ceptable large errors occur. It is anticipated that when echoic speech and reverberated speech provides a modu-
more utterance samples are presented in the training set, lation transfer function (MTF) from Which reverberation
the neural network should learn the characteristics of times and a speech transmission index (STI) can be ex-
speech utterances better and average them out to achieve tracted. This indicates that the source independence fea-
better source-independent features. ture of the proposed method could be improved provided
that longer speech sequences and more words are used
4.5 Training on Early Decay Time in both the training and the retrieve phases. Feeding
Early decay time (EDT) is another important objective long-term envelope spectra of reverberated speech into
acoustic parameter. It focuses on the initial 10-dB decay the neural network could be a practical solution, as this
period and has been shown to correlate better to subjec- enables averaged information of speech envelopes over
tive responses in reverberant fields with continuous sig- a sufficiently long period to be sent to neural networks
nals. The proposed method is also tried to extract early with reasonable sizes.
decay time from speech utterances in an identical way. The simulator used to generate training and validation
But the input layer of the approximate network is ex- examples is based on a stochastic model in diffuse fields.
tended to have 60 neurons. This is to ensure that the early Although it is expected to generate any possible impulse
decay part of the examples can be sufficiently sensed by responses (with various early reflection patterns) to form
the neural networks Similar results are obtained. The a superset of realistic impulse responses using the sto-
maximum error of early decay time extraction, when chastic model, some acoustic conditions such as coupled
tested using the words that appeared in the training set, rooms may not be well reflected in the data set. It is
is 0.056 s [see Fig. 13(b)]. believed that this system would be more useful when
more complicated training examples are included.
5 DISCUSSION AND CONCLUSIVE REMARKS As a first step toward extracting room acoustic param-
eters from speech using artificial neural networks, this
This paper has presented a novel method to extract study has demonstrated the feasibility and potential of
room reverberation times from single-syllable speech utter- this new approach As an alternative to traditional mea-
ances (in this case, pronounced digits) using artificial neu- surement methods, this new method can facilitate occu-
ral networks. The limitation of the approximation-based pied measurements of reverberation times and early de-
multilayer feedforward network in this application can cay times. Moreover, once the networks are trained and
be circumvented by using a second decision-based connection weights obtained, implementation of such a
watchdog neural network. The early decay time of rooms neural network-based measurement system is straight-
can be extracted from speech utterances in a similar forward and can be realized on a hardware platform at
fashion. In this study wide-band reverberation param- low cost. This may lead to the development of a new
eters within the frequency range of speech utterances type of instrumentation. The method presented here may
are considered. Although frequency-dependent rever- also be useful in robust speech recognition: the estimated
beration times can be extracted with added band-pass reverberation times provide important information for
filters, the method is restricted to the estimation of rever- dereverberation. As a long-term objective, the authors
beration parameters within the frequency range of are investigating novel approaches to realize source-
speech utterances. independent room acoustic parameter identification.
Two important concerns of the proposed m e t h o d - - r o - This may enable reverberation times to be extracted from
bustness to different impulse responses and source inde- nonrestricted speech.
p e n d e n c e - h a v e been addressed Test results show that
the proposed neural networks can sufficiently generalize 6 ACKNOWLEDGMENT
different impulse responses and therefore are robust to
rooms with different types of impulse responses. When The authors would like to acknowledge the support
using utterances included in the training phase as excita- from the Engineering and Physical Sciences Research
tions, the trained neural networks can achieve a better Council, UK (EPSRC grant reference GR/L89280) for
than 0.1-s resolution in reverberation time measure- funding this project. Inspiring discussions with Y. W.
ments. This can meet the precision requirement of practi- Lam, W. J. Davies, and M. West are gratefully
cal use and can be implemented by playing back prere- appreciated.
corded anechoic speech utterances as excitations. An
initial investigation also reveals that the proposed 7 REFERENCES
method has certain source-independent features. When
two out of three words in the excitations are new words [ 1] L Cremer and H. Muller, Principles andApplica-
that have never appeared in the training set, the trained tions of Room Acoustics, T. Schultz, Transl., vol. 1
neural networks can still give coarse estimations of re- (Applied Science, London, 1978), p. 194.
[2] H. Attias and C. E. Schreiner, "Blind Source Sep- Internal Representations by Error Propagations," in Par-
aration and Deconvolution: The Dynamic Component allel Distributed Processing: Exploration in the Micro-
Analysis Algorithm," Neural Computation, vol. 10, pp. structure of Cognition, vol. 1 (MIT Press, Cambridge,
1373-1424 (1998). MA, 1986), pp. 318-362.
[3] G. Cybenco, "Approximation by Superpositions [13] M. Riedmiller, "Advanced Supervised Learning
of a Sigmoidal Function, Mathematics of Control," Sig- in Multi-layer Perceptrons--From Back Propagation to
nals and Syst., vol. 2, pp. 303-314 (1989). Adaptive Algorithms," Int. J. Computer Standards and
[4] S. Y. Kung, Digital Neural Network, Information Interfaces, special issue on neural networks, (16), pp.
and System Science ser. (Prentice-Hall, Englewood 265-278 (1994).
Cliffs, NJ, 1993). [14] M. Schroeder, "Modulation Transfer Functions:
[5] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, Definition and Measurement," Acustica, vol. 49, pp.
and K. J. Lang, "Phoneme Recognition Using Time- 179- 182 (1981).
Delay Neural Networks," IEEE Trans. Acoust., Speech, [15] Y. H. Hu, W. J. Thompkins, and Q. Xue, "Arti-
Signal Process., vol. ASSP-37, pp. 328-339 (1989). ficial Neural Networks for ECG Arrhythmia Monitor-
[6] S. Haykin, Neural Networks: A Comprehensive ing," in Neural Networks for Signal Processing, vol. 2,
Foundation, 2nd ed. (Prentice-Hall, Englewood Cliffs, S. Y. Kung, F. Fallside, J. A. Sorensen, and C. A.
NJ, 1999). Kamm, Eds., Proc. 1992 IEEE Workshop (Helsingoer,
[7] Y. Ephram, D. Malah, et al., "On the Application Denmark, 1992), pp. 350-359.
of Hidden Markov Models for Enhancing Noisy Speech," [16] J. S. Taur and S. Y. Kung, "Prediction-Based
IEEE Trans. Acoust., Speech, Signal Process., vol. Networks with ECG Application," in Proc. IEEE Int.
ASSP-37, pp. 1846-1856 (1989 Dec.). Conf. on Neural Networks (San Francisco, CA, 1993),
[8] B. Juang and L. R. Rabiner, "Mixture Autore- pp. 1920-1925.
gressive Hidden Markov Model for Speech Signals," [17] M. R. Schroeder and B. F. Logan, "'Colorless'
IEEE Trans. Acoust., Speech, Signal Process., vol. Artificial Reverberation," J. Audio Eng. Soc., vol. 9,
ASSP-33, pp. 1404-1413 (1985 Dec.). pp. 192-197 (1961 July).
[9] T. Houtgast and H. J. M. Steeneken, "Envelope [18] M. R. Schroeder, "Natural-Sounding Artificial
Spectrum and Intelligibility of Speech in Enclosures," in Reverberation," J. Audio Eng. Soc., vol. 10, pp.
Proc. IEEE-AFCRL 1972 Speech Conf., pp. 392-395. 219-223 (1962 July).
[10] H, J. M. Steeneken and T. Houtgast, "A Physi- [ 19] J. Dattorro, "Effect Design, Part 1: Reverberator
cal Method for Measuring Speech Transmission Qual- and Other Filters," J. Audio Eng. Soc., vol. 45, pp.
ity," J. Acoust. Soc. Am., vol. 67, pp. 318-326 660-684 (1997 Sept.).
(1980 Jan.). [20] K. H. Kuttruff, "Auralization of Impulse Re-
[11] M. Barron, Auditorium Acoustics and Architec- sponses Modeled on the Basis of Ray-Tracing Results,"
tural Design (E & FN Spon, Chapman & Hall, Lon- J. Audio Eng. Soc., vol. 41, pp. 876-880 (1993 Nov.).
don, 1998). [21] H. Kuttruff, Room Acoustics (Elsevier Science
[12] D. E. Rumelhart, G. Hinton, et al., "Learning Publishers, England, 1991).
THE AUTHORS
T. J. Cox F Li P Darlington
Trevor Cox is a senior lecturer at the School of Acous- sor Systems, Inc in a variety of spaces world wide He
tics and Electronic Engineering at the University of Sal- currently sits on two standards working groups concern-
ford. His research and teaching interests center on room ing the characterizationof diffusion, includingthe AES
acoustics and digital signal processing. He is also tutor SC-04-02 of which he is vice-chair.
of the M.Sc course in audio acoustics. A majority of Dr. Cox's interest in neural networks began with his
Dr Cox's research work is in the area of measurement, work on modeling nonlinear blast noise propagation.
prediction, design, and characterization of room acous- This paper represents some of his interests in applying
tic diffusers He has numerous journal publications on neural networks to problems in room acoustics. His re-
diffusers, and his designs have been used by RPG Diffu- search work also includes more conventional numerical
techniques, such as boundary element methods, and the ical methods, and instrumentation.
subjective evaluation of performance space acoustics us-
ing quantitative and qualitative methods
Dr. Cox is a member of the Audio Engineering Soci- Paul Darlington is a director of Apply Dynamics
ety and the Institute of Acoustics in the U K. He is an Ltd , a company that provides consultancy, design,
associate editor of the Acustica united with Acta Acus- and training services in audio, acoustics, and electro-
tica, and is also on the editorial board of the Institute acoustics After receiving an undergraduate degree
of Acoustics Bulletin from the Institute of Sound and Vibration Research,
University of Southampton, Dr Darlington studied
for a master's degree in electronics at the same univer-
Francis Feng Li, was born in 1963 in Shanghai, sity He was then awarded a Ph D for his research
China, and holds a bachelor's degree in electronics and into adaptive filters, conducted at ISVR He lectured
computer engineering and a master of philosophy in at the University of Wyoming for two years and for
computer modeling of EMC Before he went to the the next ten years was lecturer and senior lecturer in
U K , he spent ten years on the academic staff at Shang- the Department of Acoustics and Audio Engineering,
hai University teaching a variety of courses and conduct- University of Salford
ing research in his field He is currently a research assist- Dr. Darlington's work is motivated by those engi-
ant to Dr Trevor Cox and a Ph D candidate near neering applications that involve both electronics and
completion Mr Li has a broad teaching and research acoustics He has worked in the areas of control, com-
interest in the area of computational intelligence, digital munications, instrumentation, electroacoustics, and sig-
signal processing, acoustics, audio engineering, numer- nal processing