Abstract
In this paper, we present a new method for classifying voice pathologies. Reconstructed Phase Space (RPS) images are employed to represent the nonlinear dynamics of the signals, and a Convolutional Neural Network (CNN) is designed to automatically learn spatial features and a classification decision from the RPS images. Due to the large parameter space of the CNN, we augmented the Massachusetts Eye and Ear Infirmary (MEEI) database with synthetic training data obtained by slowing down or speeding up the audio signal. The proposed method was evaluated in the pairwise classification of 5 voice pathologies: paralysis, edema, nodule, polyp and keratosis. Experiments were also carried out on a broader pathology class, called benign lesion, consisting of nodule, polyp and cyst signals. Accuracies similar to state-of-the-art approaches support the relevance of the method. Best accuracy was achieved in the polyp vs. nodule classification. Data augmentation was beneficial to most of the classification experiments.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
The human voice is produced from an air flow passing through the various constrictions in the vocal tract structures: the vocal folds, the larynx shape, the mouth, the nose etc. As a result, voice signals embed a very unique information that may be used to identify individuals in biometric systems. Moreover, voice signals may also be used as input for systems designed to detect and classify anomalies on those structures, being a valuable tool in healthcare. Currently, about 25% of the world population work in a profession that requires excessive vocalization [1], such as teachers, lawyers, singers, and others. The size of the risk group emphasizes the importance of research in the classification of voice pathologies. The diagnosis of voice pathologies can be performed in a subjective way, consisting of listening to the patient’s voice and deciding whether there is pathology or not, and objectively with laboratory tests which are usually more precise but highly invasive, causing discomfort to the patient.
Several researches in the area of digital processing of voice signals have been carried out with the purpose of evaluating the quality of the patient’s voice and assisting a specialist in the diagnosis of voice pathologies. Acoustic analysis of voice can be an efficient tool for supporting the diagnosis of pathologies and has the advantage of not being invasive. For feature extraction, several methods based on linear acoustic theory have been used to identify voice pathologies, such as jitter and shimmer [2], cepstral coefficients [3], mel-frequency cepstral coefficients [4] and others. These methods may not be suitable to the problem, because airflow propagation through the vocal tract is more likely to follow the fluid dynamic rules, which lead to nonlinear models [5]. In recent years, researchers have proposed the use of chaos theory to identify voice pathologies. Features obtained from Reconstructed Phase Space (RPS), such as Lyapunov exponents and correlation dimension, were adopted as a nonlinear model of the speech signal [6,7,8]. The Reconstructed Phase Space consists of a set of m-dimensional vectors. When m = 2, it is possible to generate a 2D image from the phase space trajectory.
Convolutional Neural Networks (CNN) are widely used for image classification. Voice signals can be segmented into multiple frames, which can, in turn, be used to train a CNN. However, it is necessary to use a great amount of input samples to properly train a CNN, and usually there is a small amount of available signals of specific pathologies. For example, the well-know Massachusets Ear and Eye Infirmary database (MEEI) [9] contains only 19 signals with vocal fold nodule and 20 with vocal fold polyp. To overcome this difficulty, the speech signal may be segmented into small frames, and feature extraction is performed in each of them, so there will be more data to train the CNN. In addition, data augmentation may be employed to artificially increase the training set size.
The main goal of this work is to propose and experimentally investigate a new method for voice pathology classification that trains a CNN with RPS trajectories images. This work is organized as follows: In the next section we introduce relevant related work in this area. In Sect. 3, we present some theoretical fundamentals for the research. Section 4 contains a description of the experimental setup. Results and discussion are given in Sect. 5. Finally, Sect. 6 contains the conclusions and proposals of future work.
2 Related Work
Recent interest on the use of chaos theory for voice pathology analysis focused in obtaining a feature vector from the Reconstructed Phase Space (RPS) and presenting it as input to a classifier. Costa et al. [6] generated recurrence plots from the RPS, and extracted 7 features from the plots. They used Linear Discriminant Analysis (LDA) for classification purposes. Ghasemzadeh et al. [7] calculated the Lyapunov Spectrum from the RPS and obtained a 49-dimensional feature vector. As classifier, they used Support Vector Machines (SVM). Travieso et al. [8] obtained 10 features from the RFS to train a combined HMM-SVM classifier. Fang et al. [10] used 3 different features from the RPS to construct a 10-dimensional feature vector that was fed into a SVM classifier.
Only recently deep learning has been applied to classify voice pathologies. Frid et al. [11] used a CNN to discriminate between normal versus Parkinson’s Disease voices. Fang et al. [12] used Mel-cepstral coefficients as feature vectors to train a fully connected deep learning to classify normal versus pathology. Harar et al. [13] used a CNN with a Time Distributed Layer in a similar problem.
Additional details about the related work is summarized in Table 1. Most of the works using chaos theory adopted SVM as classifier. Proprietary databases were used in 5 of the reviewed work. In the remaining papers, the publicly available datasets MEEI [9] or SVD databases [14] were used, but the signals used were different from each other. Because of these, it was not possible to perform a direct comparison between those work and ours.
3 Theoretical Review
In this section, we present a brief explanation about data augmentation methods, Reconstructed Phase Space and Convolutional Neural Networks.
3.1 Data Augmentation
Data Augmentation is a widely used strategy for increasing the amount of training data. It consists of applying one or more deformations to the original signal that results in a new synthetic signal [15]. These deformations are typically applied to signals that only belongs to the training set of a classifier. A key concept of data augmentation is that deformations applied to labeled signals will not modify the semantic meaning of the labels [16]. For the use in voice signals with some pathology, the challenge is to assure that the deformation will preserve the characteristics of the pathology. For example, a pitch variation occurs in the presence of a polyp and a deformation that changes the pitch of the signal may affect the characteristics of the polyp signal. To perform data augmentation in this paper, we used the Time Stretching (TS) method. The TS method consist in slowing down or speeding up the audio signal. Given an audio signal x(t), applying a factor \(\alpha \) yields the synthetic signal \(x(\alpha t)\).
3.2 Reconstructed Phase Space
Phase space is an abstract space that represents the evolution of a dynamic system whose dimensions are state variables. The sequence of states constitutes the trajectory of the phase space. To analyze the phase space associated with a time series, such as a voice signal, it is first necessary to reconstruct the phase space of an appropriate size [18]. The most used method in the literature for phase space reconstruction is the time delay method [19] (Eq. 1).
where x(t) is the time series, m represents the embedding dimension, \(\tau \) is the optimum time delay and \(\xi _t\) represents the state in the time t.
The behavior of the trajectory in the phase space reconstruction represents the vocal dynamics. The more regular the reconstructed phase space, the more periodicity the signal has. The time delay (\(\tau \)) calculation is based on the Information Theory, in which the mutual information curve of the signal is estimated [20]. The value of \(\tau \) is the first minimum of the curve. Average mutual information provides the same information that the correlation function provides for linear systems [21].
3.3 Convolutional Neural Networks
A CNN has two types of layers: convolutional and pooling. It is usually organized with an input layer, one or more pairs of convolutional and pooling layers, and one or more full layers connected at the end of the network to perform the actual classification of the inputs. CNN combines three ideas to ensure some level of shift and distortion invariance: local receptive fields, shared weights, and spatial subsampling [22]. Each neuron of the convolutional layer takes inputs from a small rectangular section of the previous layer and applies a filter. Filters are replicated along with the entire input space to perform some sort of local feature extraction. The parameters of this layer are the size of the rectangular section (kernel size), number of filters (number of feature maps), stride and padding. Stride indicates the amount by which filter shifts, and padding consists in enlarging the image around its border with zero valued pixels [22]. The pooling layer is designed to achieve spatial invariance by reducing the resolution of the feature maps. This layer takes inputs from the previous convolutional layer and generates a lower resolution version by typically taking the maximum filter activation from a small rectangular region The parameters of the so called max-pooling layer are the size of the region, the stride and padding [15].
4 Experiment
In this section, we describe the experimental setup of this work. Subsection 4.1 contains a description of the database. In Subsect. 4.2, we present the methodology of the feature extraction. The used CNN Architecture are presented in Subsect. 4.3. Finally, Subsect. 4.4 contains the experimental setup.
4.1 Database
The Massachusets Eye and Ear Infirmary (MEEI) database [9] consists of 710 sustained /a/ vowel voice signals and 715 voice signals from the first 12 s of the Rainbow Passage, obtained from 777 subjects. Signals were acquired with low noise level, constant microphone distance, 16-bit resolution, sampling rate of 25.000 Hz for voice pathology signals and 50.000 Hz for normal voice signals. Altogether, 53 signals of the sustained vowel /a/ and 53 signals of the Rainbow Passage are normal voice and the remainder is voice pathologies affected by various pathologies, ranging from vocal folds pathologies such as Nodules and Cysts to neurological pathologies, such as Parkinson’s disease and Stuttering. In the present work, we used normal signals (53), voice pathology signals (168), signals affected by vocal fold paralysis (67 signals), vocal fold edema (44), vocal fold keratosis (26), vocal fold nodule (19) and vocal fold polyp (20). We also group nodules and polyps signals with cyst signals (04) in a class denominated Focal Benign Lesion of the Lamina Propria (43), according to [17]. For conciseness, we called this class Lesion.
4.2 Feature Extraction
Firstly, we augmented the dataset using the approach described in Sect. 3.1. We used \(\alpha = 0.8, 0.9, 1.1, 1.2\) to TS. After that, we segmented the signals using two different sizes of frame: 20 ms and 10 cycles of pitch, both of them with 50% overlap. Then, we obtained the reconstructed phase space for each frame using \(m = 2\) (see Eq. 1). In Table 2 we show the total of images per class, without data augmentation. The total with data augmentation is 5\(\times \) more. Next, we generate an image for each frame using the phase space trajectory (see Fig. 1). In Figs. 1 and 2, gray levels are inverted for better visualization. For dimensionality reduction, each image was divided in a box of size N \(\times \) N pixels without overlapping. Each box was assigned to the count of pixels of the original image within that box. We then generate a gray scale image where the pixel intensities represent the counts obtained within each of the boxes. In this work, we used boxes with N = 10 and 15 (Fig. 1).
4.3 CNN Architecture
The network is designed with 9 layers, as follows: input \(\Rightarrow \) convolution (16 kernels, stride 3) \(\Rightarrow \) max-pooling (size 2) \(\Rightarrow \) convolution (32 kernels, stride 3) \(\Rightarrow \) max-pooling (size 2) \(\Rightarrow \) convolution (64 kernels, stride 3) \(\Rightarrow \) max-pooling (size 2) \(\Rightarrow \) dense RELU (1000 neurons) \(\Rightarrow \) dense Softmax (2 neurons). Three convolutional layers were used to describe the input vectors into a set local features, that become more abstract as the network depth increases. A pooling layer was used between each of the convolutional layers to reduce dimensionality. As the last component of our network, there is a stack of 2 fully connected layers ended with Softmax layer with 2 neurons (one neuron for each class) for the final classification. For the convolutional layers, we used 16, 32 and 64 kernels of size 3, and Rectified linear unit (Relu) as activation function. Other designs have been empirically evaluated, but with poorer results.
4.4 Experimental Setup
We opted to perform pairwise classifications experiments for the chosen pathologies, since this is a common approach to deal with imbalanced datasets, which is the case of MEEI. A final classification can be obtained from multiple pairwise classifiers via a majority vote scheme, but this is not within the scope of this paper. For each classification, the amount of signals used in the training set corresponded to 50% of the class with the least amount of signals. For the validation set, the amount of signals used corresponded to 10% of the class with the least amount of signals, and the remainder of the signals were used for the test set. For example, in the Paralysis versus Polyp classification, 10 signals of each class were used for training, 2 for validation, 55 of Paralysis and 8 of polyp for testing. This procedure was performed to balance the amount of signals of each class in the neural network training. The signals from each set were chosen randomly each time and it was ensured that no signal has segments in more than one of the sets. Each classification was performed 10 times and the average of the results was obtained. We used the stochastic gradient descent with momentum function during training of our proposed model, with learning rate of 0.01 and momentum of 0.9. The batch size was K/10, where K is the total amount of images in the training set. Training ended when there was no progress in validation loss for 5 epochs.
5 Results and Discussion
In a visual analysis of the RPS images, the trajectory of a normal voice (Fig. 2(a)) is more regular than the trajectory in voice pathologies (Figs. 2(b) to (f)). It is because normal voices are more periodicals than voice pathologies. We can see that there are differences between the pathological classes, but there are visual differences between signals from the same pathology but different subject (Figs. 1 and 2(b)). CNN classifier has been trained to extract features from the images to identify the pathology. Only frames of the whole input signal were used for training. For experimental evaluation purposes, we created a meta classification rule that outputs the most frequent class label among all classified frames for a particular signal.
The best results with and without using data augmentation are presented in the Table 3. The results were evaluated according to sensibility (SE), specificity (SP) and accuracy (ACC). N is the size of box. We can see that in most of the classifications the best results occurred with 10 cycles of pitch frame. Also, in most of cases the best results occurred when using data augmentation. Another highlight is the result of the polyp versus nodule classification. The accuracy was above 90%, with 100% of specificity.
As in the previous case (without data augmentation), most of the best results obtained when using data augmentation were obtained for 10 cycles of pitch frame. In most cases the results were better than those obtained without data augmentation. This is a good indicator that the augmentation strategy was beneficial to the CNN classifier. However, there is an exception: results involving the polyp class did not improve with data augmentation. The correct polyp-versus-edema and polyp-versus-keratosis classification rates got worse. Moreover, polyp-versus-paralysis and polyp-versus-nodule results did not significantly change. We suspect the time stretching used for data augmentation is modifying discriminative characteristics of the polyp signal. An investigation of this aspect is left as future work.
6 Conclusions and Future Work
In this paper, we proposed a novel method for the classification of voices affected by pathologies, based on RFS images and CNN. In general, when using a variable frame (10 cycles of pitch) to generate the RFS images yielded better results than using a fixed frame (20 ms). Data Augmentation turned out to be a promising method to increase correct classification rates of voice pathologies, but the chosen method affected the classifications involving the polyp class. As future work, we will analyze signals with polyp to know how the time stretching method affected them, propose additional data augmentation methods that do not affect the characteristics of the pathologies.
References
Al-Nasheri, A., Muhammad, G., Alsulaiman, M., Ali, Z.: Investigation of voice pathology detection and classification on different frequency regions using correlation functions. J. Voice 31(1), 3–15 (2016)
Cordeiro, H.T., Fonseca, J.M., Ribeiro, C.M.: Reinke’s Edema and Nodules identification in vowels using spectral features and pitch jitter. Procedia Technol. 17, 202–208 (2014)
Ali, Z., Elamvazuthi, I., Alsulaiman, M., Muhammad, G.: Automatic voice pathology detection with running speech by using estimation of auditory spectrum and cepstral coefficients based on the all-pole model. J. Voice 30(6), 757.e7–757.e19 (2016)
Salma, C., Asma, B., Aicha, B., Noureddine, E.: Recognition of pathological voices. In: IEEE International Multi-Conference on Systems, Signals & Devices (SSD14), Barcelona, pp. 1–6 (2014)
Teager, H.M., Teager, S.M.: Evidence for nonlinear sound production mechanisms in the vocal tract. In: Hardcastle, W.J., Marchal, A. (eds.) Speech Production and Speech Modelling. NATO ASI Series (Series D: Behavioural and Social Sciences), vol. 55, pp. 241–261. Springer, Dordrecht (1990). https://doi.org/10.1007/978-94-009-2037-8_10
Costa, W.C.A., Assis, F.M., Neto, B.G.A., Costa, S.C., Vieira, V.J.D.: Pathological voice assessment by recurrence quantification analysis. In: ISSNIP Biosignals and Biorobotics Conference (BRC), pp. 1–6 (2012)
Ghasemzadeh, H., Khass, M.T., Arjmandi, M.K., Pooyan, M.: Detection of vocal disorders based on phase space parameters and Lyapunov spectrum. Biomed. Signal Process. Control 22, 135–145 (2015)
Travieso, C.M., Alonso, J.B., Orozco-Arroyave, J.R., Vargas-Bonilla, J.F., Nöth, E., Ravelo-García, A.G.: Detection of different voice diseases based on the nonlinear characterization of speech signals. Expert Syst. Appl. 82, 184–195 (2017)
Kay Elemetrics Corp.: Disordered Voice Database, Version 1.03 (CDROM). MEEI, Voice and Speech Lab, Boston, MA, October 1994
Fang, C., Li, H., Ma, L., Zhang, M.: Intelligibility evaluation of pathological speech through multigranularity feature extraction and optimization. Comput. Math. Methods Med. 2017, 1–8 (2017). https://www.hindawi.com/journals/cmmm/2017/2431573/cta/
Frid, A., Kantor, A., Svechin, D., Manevitz, L.M.: Diagnosis of Parkinson’s disease from continuous speech using deep convolutional networks without manual selection of features. In: IEEE International Conference on the Science of Electrical Engineering (ICSEE), pp. 1–4 (2016)
Fang, S., et al.: Detection of pathological voice using cepstrum vectors: a deep learning approach. J. Voice (2018). https://www.sciencedirect.com/science/article/pii/S089219971730509X
Harar, P., Alonso-Hernandezy, J.B., Mekyska, J., Galaz, Z., Burget, Z., Smekal, Z.: Voice pathology detection using deep learning: a preliminary study. In: International Conference and Workshop on Bioinspired Intelligence (IWOBI), pp. 1–4 (2017)
Barry, W.J., Pützer, M.: Saarbrucken voice database. Institute of Phonetics, University of Saarland (2016). http://www.stimmdatenbank.coli.uni-saarland.de/
Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (NIPS), pp. 1097–1105 (2012)
Salamon, J., Bello, J.P.: Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process. Lett. 24(3), 279–283 (2017)
Verdolini, K., Rosen, C.A., Rosen, C.A., Branski, R.C.: Classification Manual for Voice Disorders-I. Psychology Press, Oxon (2014)
Takens, F.: Detecting strange attractors in turbulence. In: Rand, D., Young, L.-S. (eds.) Dynamical Systems and Turbulence, Warwick 1980. LNM, vol. 898, pp. 366–381. Springer, Heidelberg (1981). https://doi.org/10.1007/BFb0091924
Packard, N.H.: Geometry from a time series. Phys. Rev. Lett. 45(9), 712 (1980)
Fraser, A., Swinney, H.: Independent coordinates for strange attractors from mutual information. Phys. Rev. A 33(2), 1134 (1986)
Li, W.: Mutual information functions versus correlation functions. J. Stat. Phys. 60(5–6), 823–837 (1990)
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Marinus, J.V.d.M.L., de Araújo, J.M.F.R., Gomes, H.M. (2019). Reconstructed Phase Space and Convolutional Neural Networks for Classifying Voice Pathologies. In: Vera-Rodriguez, R., Fierrez, J., Morales, A. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2018. Lecture Notes in Computer Science(), vol 11401. Springer, Cham. https://doi.org/10.1007/978-3-030-13469-3_92
Download citation
DOI: https://doi.org/10.1007/978-3-030-13469-3_92
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-13468-6
Online ISBN: 978-3-030-13469-3
eBook Packages: Computer ScienceComputer Science (R0)