This document describes building a speech emotion detection classifier. It discusses the need for such classifiers in safety systems to monitor worker mental states. It then outlines the process of collecting speech datasets, preparing and exploring the data, augmenting it, extracting features, building a convolutional neural network model, training it to classify emotions in speech, and evaluating the trained model's performance. The model achieved 70% accuracy at predicting emotions like surprise and anger but could be improved with more data augmentation and different feature extraction methods.
This document describes building a speech emotion detection classifier. It discusses the need for such classifiers in safety systems to monitor worker mental states. It then outlines the process of collecting speech datasets, preparing and exploring the data, augmenting it, extracting features, building a convolutional neural network model, training it to classify emotions in speech, and evaluating the trained model's performance. The model achieved 70% accuracy at predicting emotions like surprise and anger but could be improved with more data augmentation and different feature extraction methods.
This document describes building a speech emotion detection classifier. It discusses the need for such classifiers in safety systems to monitor worker mental states. It then outlines the process of collecting speech datasets, preparing and exploring the data, augmenting it, extracting features, building a convolutional neural network model, training it to classify emotions in speech, and evaluating the trained model's performance. The model achieved 70% accuracy at predicting emotions like surprise and anger but could be improved with more data augmentation and different feature extraction methods.
This document describes building a speech emotion detection classifier. It discusses the need for such classifiers in safety systems to monitor worker mental states. It then outlines the process of collecting speech datasets, preparing and exploring the data, augmenting it, extracting features, building a convolutional neural network model, training it to classify emotions in speech, and evaluating the trained model's performance. The model achieved 70% accuracy at predicting emotions like surprise and anger but could be improved with more data augmentation and different feature extraction methods.
Why we need speech emotion detection classifier? ● In a work safety system, emotion recognition can provide important information about the mental state of workers, which can be used to prevent work accidents Dataset ● ravdess-emotional-speech-audio ● cremad ● toronto-emotional-speech-set-tess ● surrey-audiovisual-expressed-emotion-savee ● These four datasets can be downloaded for free from the internet Importing Libraries Data Preparation 1. ravdess-emotional-speech-audio dataframe The code is a Python script that is used to retrieve information from a directory, in this case the ravdess-emotional-speech-audio directory. 2. cremad dataframe The code is a Python script that is used to retrieve information from a directory, in this case the cremad directory. 3. toronto-emotional-speech-set-tess dataset The code is a Python script that is used to retrieve information from a directory, in this case the toronto-emotional- speech-set-tess directory. 4. surrey-audiovisual-expressed-emotion-savee dataset
The audio files in this dataset are named in
such a way that the prefix letters describes the emotion classes as follows:
'a' = 'anger' 'd' = 'disgust' 'f' = 'fear'
'h' = 'happiness' 'n' = 'neutral' 'sa' = 'sadness' 'su' = 'surprise' Data Visualisation and Exploration First let's plot the count of each emotions in our dataset
Generates a count plot using the
Seaborn library to visualize the distribution of different emotions in a dataset Plot Waveplots and Spectograms for Audio Signals (Fear)
Waveplots let us know the loudness
of the audio at a given time. A spectrogram is a visual representation of the spectrum of frequencies of sound or other signals as they vary with time. It’s a representation of frequencies changing with respect to time for given audio/music signals Plot Waveplots and Spectograms for Audio Signals (Angry) Plot Waveplots and Spectograms for Audio Signals (Sad) Plot Waveplots and Spectograms for Audio Signals (happy) Data Augmentation Data augmentation is the process by which we create new synthetic data samples by adding small perturbations on our initial training set. To generate syntactic data for audio, we can apply noise injection, shifting time, changing pitch and speed. The objective is to make our model invariant to those perturbations and enhace its ability to generalize. In order to this to work adding the perturbations must conserve the same label as the original training sample. In images data augmention can be performed by shifting the image, zooming, rotating ... First, let's check which augmentation techniques works better for our dataset. 1. Simple Audio Plots the waveform of an audio signal and allows for interactive playback of the corresponding audio file. This can be useful for visualizing the structure of an audio signal or for checking the quality of an audio file. 2. Noise Injection Adds noise to an audio signal, plots its waveform, and allows for interactive playback of the resulting noisy signal. This can be useful for simulating real-world audio scenarios or testing the robustness of audio processing algorithms to noise. 3. Stretching Applies time stretching to an audio signal, plots its waveform, and allows for interactive playback of the resulting signal. 4. Shifting Shifts an audio signal by a certain number of frames, plots its waveform, and allows for interactive playback of the resulting shifted signal. This can be useful for modifying the timing or tempo of an audio signal or for aligning multiple audio signals together. 5. Pitch Pitching is a useful feature in voice recognition systems because it can help distinguish between different speakers, identify emotional state, and aid in speech recognition. Feature Extraction Extraction of features is a very important part in analyzing and finding relations between different things. As we already know that the data provided of audio cannot be understood by the models directly so we need to convert them into an understandable format for which feature extraction is used Feature Extraction (Cont.) collects feature and label data from multiple audio files & manipulate and analyze data in tabular form. Data Preparation Modelling Creates a convolutional neural network (CNN) using the Keras API for TensorFlow, and compiles it for training with the Adam optimizer and categorical cross-entropy loss. Train a Neural Network Model Train a neural network model with the ReduceLROnPlateau callback function to optimize the learning rate during training. By using ReduceLROnPlateau, it can help the model achieve better convergence and reduce the risk of overfitting the training data. Displays Model Accuracy on Test Data and Displays Graphs of Loss and Accuracy in The Model Training Process Predicting the labels of test data using a trained machine learning model, and then converting the predicted labels and actual labels back to their original categorical form. Creating a Confusion Matrix and Then Visualizing It Using a Heatmap A confusion matrix is a table that is often used to evaluate the performance of a classification model. It shows the number of true positives, true negatives, false positives, and false negatives predictions for each class. CONCLUSION We can see our model is more accurate in predicting surprise, angry emotions and it makes sense also because audio files of these emotions differ to other audio files in a lot of ways like pitch, speed etc
We overall achieved 70%
accuracy on our test data and its decent but we can improve it more by applying more augmentation techniques and using other feature extraction methods.