Keywords

1 Introduction

The human voice is produced from an air flow passing through the various constrictions in the vocal tract structures: the vocal folds, the larynx shape, the mouth, the nose etc. As a result, voice signals embed a very unique information that may be used to identify individuals in biometric systems. Moreover, voice signals may also be used as input for systems designed to detect and classify anomalies on those structures, being a valuable tool in healthcare. Currently, about 25% of the world population work in a profession that requires excessive vocalization [1], such as teachers, lawyers, singers, and others. The size of the risk group emphasizes the importance of research in the classification of voice pathologies. The diagnosis of voice pathologies can be performed in a subjective way, consisting of listening to the patient’s voice and deciding whether there is pathology or not, and objectively with laboratory tests which are usually more precise but highly invasive, causing discomfort to the patient.

Several researches in the area of digital processing of voice signals have been carried out with the purpose of evaluating the quality of the patient’s voice and assisting a specialist in the diagnosis of voice pathologies. Acoustic analysis of voice can be an efficient tool for supporting the diagnosis of pathologies and has the advantage of not being invasive. For feature extraction, several methods based on linear acoustic theory have been used to identify voice pathologies, such as jitter and shimmer [2], cepstral coefficients [3], mel-frequency cepstral coefficients [4] and others. These methods may not be suitable to the problem, because airflow propagation through the vocal tract is more likely to follow the fluid dynamic rules, which lead to nonlinear models [5]. In recent years, researchers have proposed the use of chaos theory to identify voice pathologies. Features obtained from Reconstructed Phase Space (RPS), such as Lyapunov exponents and correlation dimension, were adopted as a nonlinear model of the speech signal [6,7,8]. The Reconstructed Phase Space consists of a set of m-dimensional vectors. When m = 2, it is possible to generate a 2D image from the phase space trajectory.

Convolutional Neural Networks (CNN) are widely used for image classification. Voice signals can be segmented into multiple frames, which can, in turn, be used to train a CNN. However, it is necessary to use a great amount of input samples to properly train a CNN, and usually there is a small amount of available signals of specific pathologies. For example, the well-know Massachusets Ear and Eye Infirmary database (MEEI) [9] contains only 19 signals with vocal fold nodule and 20 with vocal fold polyp. To overcome this difficulty, the speech signal may be segmented into small frames, and feature extraction is performed in each of them, so there will be more data to train the CNN. In addition, data augmentation may be employed to artificially increase the training set size.

The main goal of this work is to propose and experimentally investigate a new method for voice pathology classification that trains a CNN with RPS trajectories images. This work is organized as follows: In the next section we introduce relevant related work in this area. In Sect. 3, we present some theoretical fundamentals for the research. Section 4 contains a description of the experimental setup. Results and discussion are given in Sect. 5. Finally, Sect. 6 contains the conclusions and proposals of future work.

2 Related Work

Recent interest on the use of chaos theory for voice pathology analysis focused in obtaining a feature vector from the Reconstructed Phase Space (RPS) and presenting it as input to a classifier. Costa et al. [6] generated recurrence plots from the RPS, and extracted 7 features from the plots. They used Linear Discriminant Analysis (LDA) for classification purposes. Ghasemzadeh et al. [7] calculated the Lyapunov Spectrum from the RPS and obtained a 49-dimensional feature vector. As classifier, they used Support Vector Machines (SVM). Travieso et al. [8] obtained 10 features from the RFS to train a combined HMM-SVM classifier. Fang et al. [10] used 3 different features from the RPS to construct a 10-dimensional feature vector that was fed into a SVM classifier.

Only recently deep learning has been applied to classify voice pathologies. Frid et al. [11] used a CNN to discriminate between normal versus Parkinson’s Disease voices. Fang et al. [12] used Mel-cepstral coefficients as feature vectors to train a fully connected deep learning to classify normal versus pathology. Harar et al. [13] used a CNN with a Time Distributed Layer in a similar problem.

Additional details about the related work is summarized in Table 1. Most of the works using chaos theory adopted SVM as classifier. Proprietary databases were used in 5 of the reviewed work. In the remaining papers, the publicly available datasets MEEI [9] or SVD databases [14] were used, but the signals used were different from each other. Because of these, it was not possible to perform a direct comparison between those work and ours.

Table 1. Best results of the related works.

3 Theoretical Review

In this section, we present a brief explanation about data augmentation methods, Reconstructed Phase Space and Convolutional Neural Networks.

3.1 Data Augmentation

Data Augmentation is a widely used strategy for increasing the amount of training data. It consists of applying one or more deformations to the original signal that results in a new synthetic signal [15]. These deformations are typically applied to signals that only belongs to the training set of a classifier. A key concept of data augmentation is that deformations applied to labeled signals will not modify the semantic meaning of the labels [16]. For the use in voice signals with some pathology, the challenge is to assure that the deformation will preserve the characteristics of the pathology. For example, a pitch variation occurs in the presence of a polyp and a deformation that changes the pitch of the signal may affect the characteristics of the polyp signal. To perform data augmentation in this paper, we used the Time Stretching (TS) method. The TS method consist in slowing down or speeding up the audio signal. Given an audio signal x(t), applying a factor \(\alpha \) yields the synthetic signal \(x(\alpha t)\).

3.2 Reconstructed Phase Space

Phase space is an abstract space that represents the evolution of a dynamic system whose dimensions are state variables. The sequence of states constitutes the trajectory of the phase space. To analyze the phase space associated with a time series, such as a voice signal, it is first necessary to reconstruct the phase space of an appropriate size [18]. The most used method in the literature for phase space reconstruction is the time delay method [19] (Eq. 1).

$$\begin{aligned} \xi _t = \{x(t), x(t + \tau ), ..., x(t + (m - 1)\tau )\} \end{aligned}$$
(1)

where x(t) is the time series, m represents the embedding dimension, \(\tau \) is the optimum time delay and \(\xi _t\) represents the state in the time t.

The behavior of the trajectory in the phase space reconstruction represents the vocal dynamics. The more regular the reconstructed phase space, the more periodicity the signal has. The time delay (\(\tau \)) calculation is based on the Information Theory, in which the mutual information curve of the signal is estimated [20]. The value of \(\tau \) is the first minimum of the curve. Average mutual information provides the same information that the correlation function provides for linear systems [21].

3.3 Convolutional Neural Networks

A CNN has two types of layers: convolutional and pooling. It is usually organized with an input layer, one or more pairs of convolutional and pooling layers, and one or more full layers connected at the end of the network to perform the actual classification of the inputs. CNN combines three ideas to ensure some level of shift and distortion invariance: local receptive fields, shared weights, and spatial subsampling [22]. Each neuron of the convolutional layer takes inputs from a small rectangular section of the previous layer and applies a filter. Filters are replicated along with the entire input space to perform some sort of local feature extraction. The parameters of this layer are the size of the rectangular section (kernel size), number of filters (number of feature maps), stride and padding. Stride indicates the amount by which filter shifts, and padding consists in enlarging the image around its border with zero valued pixels [22]. The pooling layer is designed to achieve spatial invariance by reducing the resolution of the feature maps. This layer takes inputs from the previous convolutional layer and generates a lower resolution version by typically taking the maximum filter activation from a small rectangular region The parameters of the so called max-pooling layer are the size of the region, the stride and padding [15].

4 Experiment

In this section, we describe the experimental setup of this work. Subsection 4.1 contains a description of the database. In Subsect. 4.2, we present the methodology of the feature extraction. The used CNN Architecture are presented in Subsect. 4.3. Finally, Subsect. 4.4 contains the experimental setup.

4.1 Database

The Massachusets Eye and Ear Infirmary (MEEI) database [9] consists of 710 sustained /a/ vowel voice signals and 715 voice signals from the first 12 s of the Rainbow Passage, obtained from 777 subjects. Signals were acquired with low noise level, constant microphone distance, 16-bit resolution, sampling rate of 25.000 Hz for voice pathology signals and 50.000 Hz for normal voice signals. Altogether, 53 signals of the sustained vowel /a/ and 53 signals of the Rainbow Passage are normal voice and the remainder is voice pathologies affected by various pathologies, ranging from vocal folds pathologies such as Nodules and Cysts to neurological pathologies, such as Parkinson’s disease and Stuttering. In the present work, we used normal signals (53), voice pathology signals (168), signals affected by vocal fold paralysis (67 signals), vocal fold edema (44), vocal fold keratosis (26), vocal fold nodule (19) and vocal fold polyp (20). We also group nodules and polyps signals with cyst signals (04) in a class denominated Focal Benign Lesion of the Lamina Propria (43), according to [17]. For conciseness, we called this class Lesion.

4.2 Feature Extraction

Firstly, we augmented the dataset using the approach described in Sect. 3.1. We used \(\alpha = 0.8, 0.9, 1.1, 1.2\) to TS. After that, we segmented the signals using two different sizes of frame: 20 ms and 10 cycles of pitch, both of them with 50% overlap. Then, we obtained the reconstructed phase space for each frame using \(m = 2\) (see Eq. 1). In Table 2 we show the total of images per class, without data augmentation. The total with data augmentation is 5\(\times \) more. Next, we generate an image for each frame using the phase space trajectory (see Fig. 1). In Figs. 1 and 2, gray levels are inverted for better visualization. For dimensionality reduction, each image was divided in a box of size N \(\times \) N pixels without overlapping. Each box was assigned to the count of pixels of the original image within that box. We then generate a gray scale image where the pixel intensities represent the counts obtained within each of the boxes. In this work, we used boxes with N = 10 and 15 (Fig. 1).

Table 2. Total of images per class.
Fig. 1.
figure 1

(a) Reconstructed Phase Space Trajectory of the first 20 ms of an Edema voice (CAC10AN) from MEEI database and image after dimensionality reduction using (b) N = 10, and (c) N = 15.

4.3 CNN Architecture

The network is designed with 9 layers, as follows: input \(\Rightarrow \) convolution (16 kernels, stride 3) \(\Rightarrow \) max-pooling (size 2) \(\Rightarrow \) convolution (32 kernels, stride 3) \(\Rightarrow \) max-pooling (size 2) \(\Rightarrow \) convolution (64 kernels, stride 3) \(\Rightarrow \) max-pooling (size 2) \(\Rightarrow \) dense RELU (1000 neurons) \(\Rightarrow \) dense Softmax (2 neurons). Three convolutional layers were used to describe the input vectors into a set local features, that become more abstract as the network depth increases. A pooling layer was used between each of the convolutional layers to reduce dimensionality. As the last component of our network, there is a stack of 2 fully connected layers ended with Softmax layer with 2 neurons (one neuron for each class) for the final classification. For the convolutional layers, we used 16, 32 and 64 kernels of size 3, and Rectified linear unit (Relu) as activation function. Other designs have been empirically evaluated, but with poorer results.

4.4 Experimental Setup

We opted to perform pairwise classifications experiments for the chosen pathologies, since this is a common approach to deal with imbalanced datasets, which is the case of MEEI. A final classification can be obtained from multiple pairwise classifiers via a majority vote scheme, but this is not within the scope of this paper. For each classification, the amount of signals used in the training set corresponded to 50% of the class with the least amount of signals. For the validation set, the amount of signals used corresponded to 10% of the class with the least amount of signals, and the remainder of the signals were used for the test set. For example, in the Paralysis versus Polyp classification, 10 signals of each class were used for training, 2 for validation, 55 of Paralysis and 8 of polyp for testing. This procedure was performed to balance the amount of signals of each class in the neural network training. The signals from each set were chosen randomly each time and it was ensured that no signal has segments in more than one of the sets. Each classification was performed 10 times and the average of the results was obtained. We used the stochastic gradient descent with momentum function during training of our proposed model, with learning rate of 0.01 and momentum of 0.9. The batch size was K/10, where K is the total amount of images in the training set. Training ended when there was no progress in validation loss for 5 epochs.

5 Results and Discussion

In a visual analysis of the RPS images, the trajectory of a normal voice (Fig. 2(a)) is more regular than the trajectory in voice pathologies (Figs. 2(b) to (f)). It is because normal voices are more periodicals than voice pathologies. We can see that there are differences between the pathological classes, but there are visual differences between signals from the same pathology but different subject (Figs. 1 and 2(b)). CNN classifier has been trained to extract features from the images to identify the pathology. Only frames of the whole input signal were used for training. For experimental evaluation purposes, we created a meta classification rule that outputs the most frequent class label among all classified frames for a particular signal.

The best results with and without using data augmentation are presented in the Table 3. The results were evaluated according to sensibility (SE), specificity (SP) and accuracy (ACC). N is the size of box. We can see that in most of the classifications the best results occurred with 10 cycles of pitch frame. Also, in most of cases the best results occurred when using data augmentation. Another highlight is the result of the polyp versus nodule classification. The accuracy was above 90%, with 100% of specificity.

As in the previous case (without data augmentation), most of the best results obtained when using data augmentation were obtained for 10 cycles of pitch frame. In most cases the results were better than those obtained without data augmentation. This is a good indicator that the augmentation strategy was beneficial to the CNN classifier. However, there is an exception: results involving the polyp class did not improve with data augmentation. The correct polyp-versus-edema and polyp-versus-keratosis classification rates got worse. Moreover, polyp-versus-paralysis and polyp-versus-nodule results did not significantly change. We suspect the time stretching used for data augmentation is modifying discriminative characteristics of the polyp signal. An investigation of this aspect is left as future work.

Fig. 2.
figure 2

Reconstructed Phase Space Trajectory of the first 20 ms of (a) a Normal voice (AXH1NAL), (b) an Edema voice (CAK25AN), (c) a Paralysis voice (RAN30AN), (d) a Nodule voice (MXN24AN), (e) a Polyp voice (MPB23AN) and (f) a Keratosis voice (EMP27AN).

Table 3. Best results without and with data augmentation.

6 Conclusions and Future Work

In this paper, we proposed a novel method for the classification of voices affected by pathologies, based on RFS images and CNN. In general, when using a variable frame (10 cycles of pitch) to generate the RFS images yielded better results than using a fixed frame (20 ms). Data Augmentation turned out to be a promising method to increase correct classification rates of voice pathologies, but the chosen method affected the classifications involving the polyp class. As future work, we will analyze signals with polyp to know how the time stretching method affected them, propose additional data augmentation methods that do not affect the characteristics of the pathologies.