Academia.eduAcademia.edu

Blind Change Detection for Audio Segmentation

2005, Proc. IEEE International …

Automatic segmentation of these audio streams according to speaker identities, environmental and channel conditions has be-come an important preprocessing step for speech recognition, speaker recognition, and audio data mining [7], [8], [?], and [?]. In this paper, ...

BLIND CHANGE DETECTION FOR AUDIO SEGMENTATION Mohamed Kamal Omar, Upendra Chaudhari, Ganesh Ramaswamy IBM T. J. Watson Research Center Yorktown Heights, NY 10598, USA mkomar,uvc,ganeshr@us.ibm.com ABSTRACT Automatic segmentation of audio streams according to speaker identities, environmental and channel conditions has become an important preprocessing step for speech recognition, speaker recognition, and audio data mining. In most previous approaches, the automatic segmentation was evaluated in terms of the performance of the final system like the word error rate for speech recognition systems. In many applications like online audio indexing, and information retrieval systems, the actual boundaries of the segments are required. Therefore we present an approach based on the cumulative sum (CuSum) algorithm for automatic segmentation which minimizes the missing probability for a given false alarm rate. In this paper, we compare the CuSum algorithm to the Bayesian information criterion (BIC) algorithm, and a generalization of the Kolmogorov-Smirnov’s test for automatic segmentation of audio streams. We present a two-step variation of the three algorithms which improves the performance significantly. We present also a novel approach that combines hypothesized boundaries from the three algorithms to achieve the final segmentation of the audio stream. Our experiments on the 1998 Hub4 broadcast news show that a variation of the CuSum algorithm significantly outperforms the other two approaches and that combining the three approaches using a voting scheme improves the performance slightly compared to using the a two-step variation of the CuSum algorithm alone. 1. INTRODUCTION Many audio resources like broadcast news contain different kinds of audio signals like speech, music, noise, and different environmental and channel conditions. The performance of many applications based on these streams like speech recognition and audio indexing degrades significantly due to the presence of the irrelevant portions of the audio stream. Therefore segmenting the data to homogeneous portions according to type (speech, noise, music, etc.), speaker identity, environmental conditions, and channel conditions has become an important preprocessing step before using them [1], [2], [3], [4], [5], [6]. The previous approaches for automatic segmentation of audio data can be classified into two categories: informed and blind. Informed approaches include both decoder-based and model-based algorithms. In decoder-based approaches, the input audio stream is first decoded using speech and silence models [7]; then the desired segments can be produced by using the silence locations generated by the decoder. In model-based approaches, different models are built to represent the different acoustic classes expected in the stream and the input audio stream can be classified by maximum likelihood selection and then locations of change in the acoustic class are identified as segmental boundaries [3]. In both cases, models trained on the data representing all acoustic classes of interest are used in the automatic segmentation. The informed automatic segmentation is limited to applications where enough amount of training data is available for building the acoustic models. It can not generalize to unseen acoustic conditions in the training data. Also approaches based solely on speech and silence models mainly detect silence locations that are not necessarily corresponding to boundaries between different acoustic segments. In this paper, we will focus on blind automatic segmentation techniques which do not suffer from these limitations and therefore serve a wider range of applications. Blind change detection avoids the requirements of the informed approach by trying to build models of the observations in a neighborhood of a candidate point under the two hypothesis of change and no change and using a criterion based on the log likelihood ratio of these two models for automatic segmentation of the acoustic data. Examples of this approach are [1], [2], [4], and [5]. In [6], the combination of an informed approach and a blind approach was considered. Most of the previous approaches had the goal of providing an input to a speech recognition, or a speaker adaptation system. Therefore they provided the evaluation of their systems based on comparisons of the word error rates achieved by using the automatic and the manual segmentation not the accuracy of the generated boundaries using the automatic segmentation [4], [3], [7]. Exceptions of this trend include [5] and when the main focus is data indexing like in [6]. In many applications like on-line audio indexing and information retrieval, the goal of the automatic segmentation algorithm is to detect the changes in the input audio stream and to keep the number of false alarms as low as possible. Unfortunately all of the current techniques for automatic blind segmentation like using the Kullback-Liebler distance, the generalized likelihood ratio distance [2], or the Bayesian information criterion [5] try to optimize an objective function that is not directly related to minimizing the missing probability for a given false alarm rate. If we define the missing probability as the probability of not detecting a change within a reasonable period of time of a valid change in the stream, then minimizing the missing probability is equivalent to minimizing the duration between the detected change and the actual change, namely the detection time. In this paper, we will use a variation of the CuSum algorithm which minimizes the detection time for a given false alarm rate to automatically segment an input audio stream [8]. We will show that this variation significantly outperforms the Bayesian information criterion algorithm, and a generalization of the non-parametric Kolmogorov-Smirnov’s test, [9]. We will also present a two-step variation of the three algorithms which improves the performance significantly. Finally, we will introduce two approaches for combining the results of the three algorithms to achieve better and more robust segmentation. In the next section, the three criteria used for automatic segmentation and the implementation of the corresponding algorithms are given. In section 3, the algorithm used in combining the output of the three systems to generate the final segmentation is presented. The experiments performed to evaluate the different strategies are described in section 4. Finally, Section 5 contains a discussion of the results and future research. 2. PROBLEM FORMULATION The goal of our work is to search for a proper segmentation of a given audio signal such that each resulting segment is homogeneous and belongs to one of the different acoustic classes like speech, noise, and music and to a single speaker and a single channel. In this section, we will describe the actual implementation of the three algorithms and the assumptions made to make the estimation of the segmentation points efficient. In the three algorithms, each frame of data is represented by a feature vector of the cepstrum coefficients. Given an observation sequence of length n, the detection of a change is equivalent to accepting the hypothesis H1 of change for time r ≤ n when testing it against the hypothesis H0 of no change (i.e. r ≥ n). The following algorithm is used to detect the change points in the input audio stream using any of the three criteria: 1. Initialize the first observation index f with zero and the last l with n0 . 2. Detect if there’s a change using one of the three algorithms for the input sequence of observations. where lk is the log likelihood ratio of the observation k to a threshold λ [8]. The CuSum algorithm assumes that the conditional PDFs of the observations under both the hypothesis H1 of change for time r ≤ n and the hypothesis H0 of no change (i.e. r ≥ n) are known. In most automatic segmentation applications, this is not true. Therefore, we train a two-Gaussian mixture using the n observations in the given sequence. We initialize the two Gaussian components such that the mean of one of them corresponds to the mean of few observations in the beginning of the sequence of observations and the mean of the other corresponds to the mean of few observations in the end of the observations sequence. The automatic segmentation using the CuSum algorithm is then reduced to a binary hypothesis testing problem. The two hypothesis of this problem are H0 : zr∗ , · · · , zn ∼ N (µ0 , Σ0 ), and H1 : zr∗ , · · · , zn ∼ N (µ1 , Σ1 ), P where r = arg maxr n k=r lk , lk is the log likelihood ratio estimated using the two Gaussian components N (µ0 , Σ0 ) and N (µ1 , Σ1 ). ∗ 2.2. Change Detection Using the BIC Algorithm The Bayesian information criterion is based on the log likelihood ratio of two models representing the two hypothesis of having twoclass or one-class observation sequence. It adds a penalty term to account for the difference in the number of parameters of the two models [12]. The parameters of both models are estimated using the maximum likelihood criterion. Given n observations, the Bayesian information criterion BIC approach compares 3. If no change is detected set l = l + n0 . else set f = r and l = r + n, where r is the location of the change detected. 4. If (l − f > 3n0 ) f = l − 3n0 . 5. If not end of audio stream go to 2. 6. End. In the following, we will give details that are specific to the implementation of each algorithm. 2.1. Change Detection Using the CuSum Algorithm Under the assumption that the sequence of the log likelihood ratios, {li }n i=1 , is an i.i.d process, the CuSum algorithm is optimal in the sense of minimizing detection time for a given false alarm rate [10]. This assumption is valid for many interesting processes like some random processes that are modeled by Markov chains or some autoregressive processes [11]. In the CuSum algorithm, the likelihood ratio of the conditional PDFs of the observations under both the hypothesis H1 of change for time r ≤ n and the hypothesis H0 is estimated, then the maximum of the sum of the log likelihood ratio of a given sequence of observations is compared to a threshold to determine whether a boundary exists between two segments of the observation sequence. Given n observations, we compare cn = max r n X k=r lk , (1) bn = n X k=1 lk − 1 (d1 − d2 ) log(nM ), 2 (2) where d1 and d2 are the number of parameters of the two models, and M is the dimension of the observation vector [5], [12]. We implemented the BIC algorithm using the same assumptions given in [5]. So the conditional PDF of the observations under the hypothesis H1 of change consists of two Gaussian PDFs. Both Gaussian PDFs are trained using maximum likelihood estimation. One of them is trained using the observations before the hypothesized boundary and the other is trained using observations after it. The conditional PDF of the observations under the hypothesis H0 of no change is modeled with a single Gaussian PDF trained using maximum likelihood estimation from using all the n observations. Detecting a change at time r using the BIC algorithm is then reduced to a binary hypothesis testing problem. The two hypothesis of this problem are H0 : z1 , · · · , zn ∼ N (µ0 , Σ0 ), and H1 : z1 , · · · , zr−1 ∼ N (µ1 , Σ1 ); zr , · · · , zn ∼ N (µ2 , Σ2 ), where N (µ0 , Σ0 ) is the Gaussian model trained using all the n observations and N (µ1 , Σ1 ) is trained using the first r observations and N (µ2 , Σ2 ) is trained using the last n − r observations. Since the model of the conditional PDF under the hypothesis H1 of change depends on the location of the change, reestimation of the model parameters is required for each new hypothesized boundary within the sequence of observations of length n. This problem is avoided in our CuSum algorithm implementation, as in this case both models are independent of the location of the hypothesized boundary. 2.3. Change Detection Using the Kolmogorov-Smirnov’s Test The Kolmogorov-Smirnov’s test is a nonparametric test of change in the input data [9]. It compares the maximum of the difference of the empirical CDFs of the data before and after the hypothesized change point to a threshold to determine whether this point is a valid boundary point between two distinct classes. In other words, to test the validity of a boundary at observation k, the test compares Sn = sup |Fk (z) − Gn−k (z)| , (3) z Fk (z) = k 1X Θ(z − zj ), k j=1 n X 1 Θ(z − zj ), n−k (4) (5) 3. Generate a list of the candidate points from the union of the output of the three algorithms. m Sn = sup sup |Fkm (zsm ) − Gm n−k (zs )| , (6) s where Fkm (zsm ) = k 1X Θ(zsm − zjm ), k j=1 (7) and n X 1 Θ(zsm − zjm ), n − k j=k+1 5. Remove the invalid changes from the list using a voting scheme or a likelihood ratio test. 7. If (l − f > 3n0 ) set f = l − 3n0 . 8. If not end of audio stream go to 2. and Θ(.) is the unit step function, to a threshold α [8]. The Kolmogorov-Smirnov’s test was designed for onedimensional observations. To generalize for observation vectors of dimension M , we assume that the elements of the observation vector are statistically independent and replace the criterion of the Kolmogorov-Smirnov’s test with the following one m Gm n−k (z ) = 2. Detect if there’s a change using the three algorithms for the input sequence of observations. 6. If the candidate list is empty set l = l + n0 . else set f = r and l = r + n0 , where r is the location of the last change in the candidate list. j=k+1 m 1. Initialize the first observation index f with zero and the last l with n0 . 4. Calculate the values of the measurements of the three algorithms at every point of the candidate list. where Gn−k (z) = we use each of them separately to generate a set of potential change points. Then the values of the three measures used in the three algorithms for detection of the change are evaluated at every point of the three sets. Then based on either a voting scheme or a likelihood ratio test of two models trained on the values of the three measurements of manually segmented data near and far from a valid change respectively, the set of valid change points are selected from the collection of the three sets. The steps of the algorithm are (8) for m = 1, · · · , M , and the range of values of each dimension is quantized to a fixed number of bins, {zsm }S s=1 to be used in calculating the empirical CDFs. 3. AN ALGORITHM FOR COMBINING THE THREE SYSTEMS 9. End. 4. EXPERIMENTS We tested the three approaches of the CuSum algorithm, the BIC algorithm, and the generalized Kolmogorov-Smirnov’s test on the automatic segmentation of the broadcast news Hub4 1998 evaluation data. The data is sampled at 16 KHZ and windowed to frames of 20 ms duration with overlap of 10 ms. Nineteen cepstrum coefficients are calculated for each frame. We selected the initial number of observations to be tested for a change, n0 , to be 300 frames for the CuSum and the BIC algorithms and 400 frames for the generalized Kolmogorov-Smirnov’s test. For the generalized Kolmogorov-Smirnov’s test, we divided the range of the values of each dimension of the observation vector to five bins and the two empirical CDF’s are compared in each of these five bins to find the maximum. The size of the testing data is 5.5 hours which contained approximately 625 homogeneous segments. For the three algorithms, we tried adding a verification step in which the objective criterion is calculated at the candidate change points using the knowledge obtained in the first step of the previous and the next change point and then compared to a new threshold. The thresholds in our experiments are chosen empirically to minimize, on 2-hours held-out data, the objective function O = M P + 0.1 ∗ F A, Since the three approaches for automatic segmentation of the audio data described before use different criteria and different modeling of the conditional PDFs of the observations under both hypothesis of valid change or no change. It is reasonable to expect these algorithms to employ complementary information for automatic change detection and therefore combining the three approaches can improve the overall performance and robustness of the automatic change detection system. To combine the three approaches, (9) where M P is the missing probability and F A is the false alarm probability, and under the constraint that the false alarm probability is less than 0.1. The missing probability is calculated by assuming a change is missed, if no change was detected within one second from it. Table 1 shows the results for one-step automatic segmentation using the three algorithms. It shows that our implementation of the CuSum algorithm significantly outperforms the Algorithm Kolmogorov-Smirnov BIC CuSum Missing Prob. 37.6 35.9 27.9 FA Prob. 8.6 9.4 8.9 Table 1. One-Step Automatic Segmentation Algorithm Kolmogorov-Smirnov BIC CuSum Missing Prob. 35.9 32.6 16.8 FA Prob. 8.3 9.1 4.9 Table 2. Automatic Segmentation with a Verification Step BIC and generalized Kolmogorov-Smirnov’s tests. The BIC algorithm works better than the generalized Kolmogorov-Smirnov’s test, although the latter tends to give lower false alarm probability. Table 2 shows that adding a verification step after the initial segmentation improves significantly the performance of both the BIC and the CuSum algorithms. We tested also the combination algorithm described in the previous section. Table 3 shows that the combination based on the voting scheme significantly outperforms that based on the likelihood ratio test using models trained using manually segmented data. It shows also that this voting scheme slightly outperforms the best single automatic segmentation system which uses our implementation of the CuSum algorithm with a verification step. 5. RESULTS AND DISCUSSION In this paper, we examined three approaches for blind automatic segmentation of audio streams. Our implementation of two of these approaches, namely the CuSum algorithm and the generalized Kolmogorov-Smirnov test, is novel and for the first time applied to automatic segmentation of audio streams. We also presented a two-step variation of the algorithms which improved the perfomnace significantly. Finally, we presented also two approaches for combining the scores from the three systems to achieve better performance and more robust segmentation of the audio stream. The results for our tests show that our implementation of the CuSum algorithm significantly outperforms other blind automatic segmentation techniques of audio data like the BIC algorithm and the generalized Kolmogorov-Smirnov approach. It has the advantage also of not having to reestimate the conditional models at each potential segmentation point within the same window. The better performance can be attributed partially to the fact that the CuSum algorithm tries to minimize the detection time for a given false alarm rate. This objective is more suited to automatic segmentaAlgorithm CuSum Voting Combination Likelihood Ratio Combination Missing Prob. 16.8 15.6 16.7 FA Prob. 4.9 7.3 6.9 Table 3. Automatic Segmentation using Systems Combination tion applications than the objectives of both the BIC and the generalized Kolmogorov-Smirnov approaches. Combining the scores of the three systems using voting schemes is significantly better than the likelihood ratio approach using models trained near the change points and others far from the change points. Combination using voting is slightly better than using the CuSum algorithm alone with a verification step but at the expense of increasing the processing time of the data. Further investigation of the effect of the type of the input features and the models for the conditional PDFs on the automatic segmentation performance will be our main goal in future research. We will consider also many other alternatives for combining the scores of the three algorithms. 6. REFERENCES [1] H. Beigi, S. Maes “Speaker, Channel, and Environment Change Detection,” Proceedings of the World Congress on Automation, pp. 18–22, 1998. [2] H. Gish, N. Schmidt, “Text-independent speaker identification,” in IEEE Signal Processing Magazine, pp. 18–21, 1994. [3] J. L. Gauvain, L. Lamel, “Audio Partitioning and Transcription for Broadcast Data Indexation,” in Multimedia Tools and Applications, vol. 14, no.2, pp. 187–200, 2001. [4] M. Siegler, U. Jain, B. Ray, R. Stern, “Automatic Segmentation, Classification, and Clustering of Broadcast New Audio,” in DARPA Speech Recognition Workshop Proc., pp. 97–99, 1997. [5] S. S. Chen and P. S. Gopalakrishnan, “Speaker Environment And Channel Change Detection And Clustering Via The Bayesian Information Criterion,” in DARPA Speech Recognition Workshop Proc., 1998. [6] T. Kemp, M. Schmidt, M. Westphal, A. Waibel, “Strategies for Automatic Segmentation of Audio Data,” in Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, vol. 3, pp. 1423–1426, 2000. [7] B. Ramabhadran, J. Huang, U. Chaudhari, G. Iyengar, H. J. Nock, “Impact of Audio Segmentation and Segment Clustering on Automated Transcription Accuracy of Large Spoken Archives,” in Proc. of EuroSpeech, pp. 2589–2593, 2003. [8] M. Basseville, I Nikiforov, Detection of Abrupt ChangesTheory and Application, Prentice-Hall, April 1993. [9] J . Deshayes, D Picard, “ Off-line Statistical Analysis of Change-Point Models Using Non-Parametric and Likelihood Methods” in Detection of Abrupt Changes in Signals and Dynamical Systems, Springer-Verlag, 1986. [10] A. N. Shirayev, “The Problem of the Most Rapid Detection of a Disturbance in a Stationary Process,” in Soviet Math. Dokl., no. 2, pp. 795–799, 1961. [11] G. V. Moustakides, “Quickest Detection of Abrupt Changes for a Class of Random Processes,” in IEEE Transactions On Information Theory, vol. 44, no. 5, September 1998. [12] R. E. Kass, A. E. Rafetry Bayes Factors, Technical Report no. 254, Department of Statistics, University of Washington, July 1994.