Speech Emotion Recognition Using Deep Learning
Speech Emotion Recognition Using Deep Learning
Speech Emotion Recognition Using Deep Learning
https://doi.org/10.22214/ijraset.2022.42973
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue V May 2022- Available at www.ijraset.com
Abstract: Emotional state identification based on analysis of vocalisations is a challenging subject in the field of Human-
Computer Interaction (HCI). In the research that has been done on speech emotion recognition (SER). A wide range of research
approaches has been used in order to extract feelings from a variety of inputs, including a number of well- known ways to
speech analysis and categorization that are already known. Recent research has suggested the use of deep learning algorithms.
as potential alternatives to the approaches that are traditionally used in SER. This article offers a summary of more in- depth
topics learning methodologies, as well as current research employing it, are discussed to identify the feelings conveyed by verbal
expressions. The analysis will consider the feelings that were recorded in the databases that were utilised were: the contributions
to both speech and emotion that were removed the restrictions that were found, as well as the discoveries that were made
discovered.
Keywords: Speech emotions, Real-time Speech Classification, Transfer Learning, HCI Bandwidth Reduction, SER, LSTM
I. INTRODUCTION
The identification of speech emotions has developed from a specialised field application to an essential part of the human-computer
interface participation (HCI)[1]. It is the goal of these systems to make natural Making the interaction between humans and
machines more straightforward by interpreting information conveyed verbally via voice-to-voice contact rather than using
conventional means of input and simplifying the process for users human listeners to whom a response may be given. Contact centre
conversations, driver assistance systems built into automobiles, and the use of expressions of emotion gleaned from speech in
medical .The term "spoken language" may be used to refer to apps as well conversation system [2]. Additional problems about HCI
systems, However, we cannot ignore this, especially when it comes to these: The testing of systems moves from the lab to the actual
world application. As a consequence of this, efforts are required to prevail overcoming these challenges and improving computer
emotion recognition. [3]
There is a possibility that the quality of voice portals or contact centres will be evaluated using a detection method for rage. It makes
it possible for service providers to tailor their deals to the feelings that their customers are now experiencing. Monitoring the stress
levels of pilots may help lower the likelihood of an accident mishap with a plane in the field of civil aviation [4]. Many scholars
have included the module for the detection of emotions into their merchandise in order to attract and keep a larger number of users
interaction with video games. In order to improve the overall quality of cloud-based gaming information was used by Hossain et al.
for their research distinct capacities for detecting and impacts on emotional states that sense emotions. The objective is to raise the
level of involvement felt by players by customising the to their internal states of emotion. A psychiatric consultation using a chatbot
.Within the domain of mental health, the provision of therapeutic services is recommended care[5].
Chatbot that engages in conversation and makes use of voice emotion Detection to facilitate conversation is yet another concept for
an emotion recognition application. An active SER in real time The application should strive to do the most that is feasible the
optimal combination of low processing load and high throughput times, as well as a high level of accuracy.
The most important parts of the speech are the feelings. The techniques of speech recognition (SER) include feature extraction and
feature classification. Excitation based on the source characteristics, as well as prosodic characteristics, have all been created by
experts in the study of speech by academics in the area processing. The second step is to combine the ingredients element that
separates data using both direct and indirect separators.
Bayesian Networks, often known as BN, or the Maximum Likelihood Support Vector Machine (SVM) and the Likelihood Principle
(MLP) Machine are two of the linear technologies that are used most often criteria for the identification of feelings and emotions. It
was the voice signal is often regarded to be moving in a non- stationary fashion. As a direct consequence of this, non-linear
classifiers are thought to achieve their goals for SER.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 3090
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue V May 2022- Available at www.ijraset.com
The Gaussian and the spectral energy ratio (SER) are two non-linear classifiers. The Mixture Model, abbreviated as GMM, and the
Hidden Markov Model Model (HMM) (HMM). The classification of data often makes use of these discovered in the most
fundamental aspects. power-based characteristics such is Linear Predictor Coefficients (LPC), Mel Energyspectrum Dynamic
Coefficients (MEDC), and Mel Energyspectrum Coefficients Frequency Cepstral Coefficients, often known as MFCC, in addition to
Cepstral coefficients derived from the Perceptual Linear Prediction model (PLP) are often employed as an efficient tool for
identifying feelings from audio. Other descriptors that may be used for feelings
K-Nearest Neighbour is included in the identification (KNN),Decision Making and Principal Component Analysis (PCA) trees.
II. OBJECTIVE
The basic goal of SER is to advance human health and well- being interaction with a machine. It is also employed in the act of lying
sensors that can follow the movements of a person condition of the mind [6]. Words and emotions detection has recently seen an
uptick in its prevalence in the both the medical and forensic scientific communities. Pitch and timbre The characteristics of prosody
are used to identify seven distinct feelings across the course of this activity.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 3091
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue V May 2022- Available at www.ijraset.com
A. Feature Extraction
We utilised 39 MFCC (12 MFCC + energy, 12 delta MFCC +energy) within the scope of this research, in addition to the Zero
Crossing Rate, Both the Teager Energy Operator and the Harmonic to Noise Ratio are included.
C. Classification Model
SVMs were first developed for the purpose of discrimination between two different groups. Expanding a subject in a variety of ways
split-split splitter into multi-stage split. There have been several application developments. Multi-class SVMs find applications in
many different fields of business and have shown to be beneficial in the classification of a wide range of sources of information
.[18]In order to solve the issue, the SVMs will first investigate a decomposition that consists of quite a few binary components
classifiers. SVMs, or support vector machines, are a kind of machine learning supervised machine learning and has applications in
data classification and forecasting are also included. It makes an effort to Acquiring a hyperplane that can be used to categorise
materials is the first step divide the data into really huge genes, which will serve you very well identifies the many pieces of training
data that are presented in the feature area by the K function of the kernel, which is the kernel that is utilised the most often A variety
of functions, including linear, polynomial, and RBF, are used in order toto categorise new values based on the training dataset and
conduct analysis on them. As a consequence of this, the only choice available to utilise this partition is to discover the relevant kernel
functions and adjust the settings to fit your needs get the maximum level of detection that is achievable. [10]Then, if we switch the
kind of SVM kernel, we will do the following :based on actual data, make adjustments to the SVM classifier's many parameters.in
order to determine the variables r, d, and c, which are chosen by the user values, in order to determine which kernel parameters are
optimal for our system search.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 3092
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue V May 2022- Available at www.ijraset.com
D. LSTM Model
It is an advanced kind of recurrent neural network that has the ability to learn long-term data trends. Due to the fact that the model is
repeated module is composed of four tiers that communicate with one another. This is not impossible if we work together on it
.[19]The graphic that can be seen above depicts four layers of neural networks in yellow input is represented in yellow boxes, point-
wise operators in green circles, and output in blue circles, and cell state in blue circles. This is the LSTM module consisting of three
gates and a cell structure, this construction gives you the ability to read just some of the information from each unit, leave some of it
unread, or store it.[17] The state of the cell in LSTM determines whether or not data may travel through it.by preventing most
modifications to the units and permitting just a handful of them interactions. Each individual unit is equipped with an I/O, an O/P,
and a forget gate for data addition and subtraction based on the current cell state. The forget gate makes its determinations with the
help of a sigmoid function whether or whether information from the prior state of the cell should be discarded. in order to control the
flow of information to the current cellstate, the input gate will carry out a point-wise multiplication of the points operation of
'sigmoid' and 'tanh.' Finally, the output gate choose one of these two which information should be sent on to the next concealed
state.[20]
V. RESULT
1) It is the accuracy graph which is showing very good accuracy.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 3093
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue V May 2022- Available at www.ijraset.com
The below image is the result for one emotion happy with waveshow plot.
VI. CONCLUSION
The purpose of this study is to provide a general perspective on recent developments in the ability to recognise emotions from
speech systems. Finding new and improved solutions is the mission of the SER ways to extract emotion from audio. Utilization of
designs of deep convolutional learning that are capable of learning from Presentations of voice spectrograms are becoming
increasingly common. They are widely acknowledged as a reliable foundation for SER systems, in addition to recurrent network
structures. More intricate the designs of SERs have evolved throughout the course of time, with a concentrate on obtaining
emotionally meaningful information from the worldwide circumstances. According to the results of our research, the focus
mechanism is capable of assisting SER Improved system performance may not always manifest itself in observable ways.
REFERENCES
[1] Yenigalla, P.; Kumar, A.; Tripathi, S.; Singh, C.; Kar, S.; Vepa, J. Speech Emotion Recognition Using Spectrogram & Phoneme Embedding. In Proceedings of
the INTERSPEECH, Hyderabad, India, 2–6 September 2018.
[2] Koolagudi, S.G.; Murthy, Y.V.S.; Bhaskar, S.P. Choice of a classifier, based on properties of a dataset: Case study-speech emotion recognition. Int. J. Speech
Technol. 2018, 21, 167–183.
[3] Zhang, S.; Zhang, S.; Huang, T.; Gao, W. Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid
Matching. IEEE Trans. Multimed. 2018, 20, 1576– 1590.
[4] Xi, Y.; Li, P.; Song, Y.; Jiang, Y.; Dai, L. Speaker to Emotion: Domain Adaptation for Speech Emotion Recognition with Residual Adapters. In Proceedings of
the 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Lanzhou, China, 18–21 November
2019; pp. 513–518
[5] B. W. Schuller, “Speech emotion recognition: Two decades in a nutshell, benchmarks, and ongoing trends,” Communications of the ACM, vol. 61, no. 5, pp.
90–99, 2018.
[6] M. S. Hossain and G. Muhammad, “Emotion recognition using deep learning approach from audio– visual emotional big data,” Information Fusion, vol. 49, pp.
69–78, 2019.
[7] A. B. Nassif, I. Shahin, I. Attili, M. Azzeh, and K. Shaalan, “Speech recognition using deep neural networks: A systematic review,” IEEE Access, vol. 7, pp. 19
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 3094
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue V May 2022- Available at www.ijraset.com
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 3095