Final Reporttop 50

PROJECT REPORT (KEC-851)
On
EMOTION DETECTION USING DEEP LEARNING
Submitted for partial fulfillment of award of the degree of
Bachelor of Technology
In
Electronics and Communication Engineering
Submitted By
Aman Gangwar – 1901920310015

Ankit Raj – 1901920310022
Dubesh Chauhan – 1901920310055
Nidhish Kumar Singh - 1901920310095
Under the Guidance of
Dr. Mohan Singh

(Associate Professor)
Deptt. of Electronics and Communication Engineering

G. L. BAJAJ INSTITUTE OF TECHNOLOGY AND MANAGEMENT
Plot no. 2, Knowledge Park III, Gr. Noida
Session: 2022-23 (Even Sem)
1
DECLARATION
We certify that
1. The work contained in this Project Report is original and has been done by us under the
guidance of my supervisor.
2. The work has not been submitted to any other University or Institute for the award of any other
degree or diploma.
3. We have followed the guidelines provided by the University in preparing the Report.
4. We have confirmed to the norms and guidelines in the Ethical Code of Conduct of the
University.
5. Whenever we used materials (data, theoretical analysis, figures, and texts) from other sources
we have given due credit to them by citing them in the text of the report and giving their details
in the references. Further, we have taken permission form the copyright owners of the sources,
whenever necessary.
Date: Signature
Place: Aman Gangwar

1901920310015
Signature
Ankit Raj
1901920310022
Signature
Dubesh Chauhan
1901920310055
Signature
Nidhish Kumar Singh

1901920310095
ii
iii
Deptt. of Electronics and Communication Engineering
G. L. BAJAJ INSTITUTE OF TECHNOLOGY AND MANAGEMENT
[Approved by AICTE, Govt. of India & Affiliated to A.K.T.U (Formerly U.P.T.U), Lucknow]
CERTIFICATE
Certified that Aman Gangwar(1901920310015), Ankit Raj(1901920310022), Dubesh

Chauhan(1901920310055), Nidhish Kumar Singh(1901920310095) have carried out the
project work (Project -II, KEC-851) presented in this report entitled “Emotion
Detection using Deep Learning” for the award of Bachelor of Technology in
Electronics and Communication Engineering during the Academic session 2022-23 from
Dr. A.P.J. Abdul Kalam Technical University (Formerly U.P.T.U), Lucknow. The
project embodies result of the work and studies carried out by Student himself and the
contents of the report do not form the basis for the award of any other degree to the
candidate or to anybody else.
(Dr. Mohan Singh) (Dr. Piyush Yadav)

(Project Guide) (Project Coordinator)
(Asssociate Professor) (Associate Professor)
Deptt.of ECE Deptt.of ECE
(Dr. Satyendra Sharma)

HOD, Deptt.of ECE
Date:
iv
ABSTRACT
Humans can effortlessly pick out emotions using their senses which as computer are
imaginative and prescient and seeks to mimic human vision by means of analyzing the
digital image as enter. For humans to stumble an emotion will not be a difficult task to
perform. Detecting emotion via voice for example detecting ‘stress’ in a voice with the aid
of setting parameters in areas like tone, pitch, pace, volume, and so forth in the case of
virtual photos detecting emotion simply by using analyzing photos is novel way. In this
model, we are looking to design a convolutional neural network version that may classify
the input photo into 7 exclusive feelings. The respected feelings we are going to classify are
angry, disgust, contempt, happy, sad, surprise and neutral. In order to classify these
emotions, we're implementing Convolutional Neural Networks (CNNs) which can
successfully and accurately elucidate semantic information coming from the faces in an
automatic manner. We also apply some data augmentation techniques in order to
intercept over fitting and under fitting problems. The Results of this version shows that
it works higher if it has more set of pics to research. The proposed version achieves the
accuracy more than 90%.
Here, we also recognizing age and gender of an individual which is trained to predict
age and gender using a broad ResNet framework. The Broad ResNet framework can be
effectively utilized for age and gender recognition tasks. The framework's ability to capture
complex features and patterns, coupled with appropriate dataset preparation and training, can
enable accurate and robust predictions for age and gender identification in images.
v
ACKNOWLEDGEMENT
In the absence of mother, the birth of a child is not possible and in the absence of teacher
the right path of knowledge is impossible. This project is by far the most significant
accomplishment in our life and it would be impossible without people who supported us
and believed us.
We would like to extend my gratitude and my sincere thanks to my honorable, esteemed
guide Dr. Mohan Singh (Associate Professor (ECE)), Department of Electronics and
Communication Engineering, GL Bajaj, Greater Noida for their immeasurable guidance and
valuable time that he/she devoted for project. We sincerely thank for their exemplary
guidance and encouragement. His trust and support inspired us in the most important
moments of making right decisions and we are glad to work with him.
We would also like to give very special thanks to our HOD Sir Dr. Satyendra Sharma.
Also, we would also like to give thanks to our project coordinator Dr. Piyush Yadav
(Associate Professor (ECE)), and my teachers for their support, help and encouragement
during this work.
We would like to thank all my friends for all the thoughtful and mind stimulating
discussions we had, which prompted us to think beyond the obvious.
We have enjoyed their companionship so much during my stay at GL Bajaj, Greater
Noida. We would like to thank all those who made my stay in GL Bajaj, Greater Noida an
unforgettable and rewarding experience.
A boat held to its moorings will see the floods pass by; but detached of its moorings, may
not survive the flood. The support of all the members of our family (specially our parents,
our sisters and brothers) motivated us to work even while facing the blues. We dedicate
this work to them.
Aman Gangwar (1901920310015)
Ankit Raj (1901920310022)
Duvesh Chauhan (1901920310055)
Nidhish Kumar Singh (1901920310095)
vi
TABLE OF CONTENT
Page
No.
Declaration
ii
Certificate iii
Abstract iv
Acknowledgements v
Table of Contents vi-vii
List of Tables viii
List of Figures ix
Abbreviations x
CHAPTER 1: INTRODUCTION 1-8
1.1 Introduction 1
1.2 Objective 4
1.3 Motivation with problem formulation 6
1.4 Methodology 6
1.5 Structure of project report 6
CHAPTER 2: LITERATURE SURVEY 9-12
CHAPTER 3: PROPOSED METHODOLOGY 13-

26
3.1 Proposed methodology 13
3.2 System Architecture 14
3.3 Data sets of speech emotion recognition 15
3.4 Speech Architecture of CNN 17
3.5 Feature of audio files 18
3.6 Classification model 19
3.7 Face emotions 20
vii
3.8 CNN model 23
3.9 Feature extraction 24
CHAPTER 4: TECHNICAL SPECIFICATION 27-

36
4.1 Classification using machine learning 27
4.2 Deep learning implementation 28
4.2.1 Neural network 30
4.2.2 Convolutional neural network 30
4.3 Technology learnt 31
4.4 System specification 36
CHAPTER 5: CONSTRAINTS ALTERNATIVES & 37-

TRADEOFFS 42
5.1 System specification 37
5.2 Classification 37
CHAPTER 6: RESULT & DISCUSSION 43-

65
6.1 Execution speed 43
6.2 Code 48
6.3 Result 53
CHAPTER 7: CONCLUTION & FUTURE WORK 66

7.1 Conclusion 66
7.2 Future Work 66
REFERNCES 67
USER MANUAL 71
LIST OF PUBLICATION 72
APPENDIX 73
CERTIFICATES 82
viii
LIST OF FIGURE
Figure No. Figure Name Page No.
3.1 System architecture 20
3.2 ARCHITECTURE OF CNN 21
3.3 PROPOSED TRAINING MODEL 22
3.4 PROPOSED TESTING MODEL 23
3.5 DROPOUT LAYER 25
3.6 PROPOSED FEATURE EXTRACTION 26

METHOD
4.1 SEQUENCE DIAGRAM 27
4.2 USE CASE DIAGRAM 28
4.3 ACTIVITY DIAGRAM 29
4.4 COLLABERATION DIAGRAM 30
5.1 EXTRACTING IMAGES FROM 32

DATASETS
5.2 SPEECH EMOTION LAYERS OF CNN 33
5.3 FACE EMOTION LAYERS OF CNN 35
5.4 RESULTING ACCURACY OF SPEECH 36

EMOTION
5.5 MODEL LOSS 38
5.6 AUDIO RECORDING 39
5.7 RESULTS & OUTPUTS OF FACE 40

EMOTION
5.8 OUTPUTS OF SPEECH EMOTION 41
ix
LIST OF TABLE
Page No.
3.5. FACE DATASETS COUNT 16
3
4.2 COMPARISON OF ALGORITHM ACCURACY 35
x
LIST OF ACRONYMS
CNN CONVOLUTION NEURAL NETWORK

MPL MULTIPLE PERCEPTRON
LSTM LONG SHORT - TIME MEMORY
MFCC MEL FREQUENCY CEPSTRUM
COEFFICIENTS
LPCC LINEAR PREDICTION CEPSTRUM
COEFFICIENTS
ANN ARTIFICIAL NEURAL NETWORK
SVM SUPPORT VECTOR MACHINE
RNN RECURRENT NEURAL NETWORK
PLP PERCEPTUAL LINEAR PREDICTION
xi
CHAPTER 1 INTRODUCTION
Introduction
In latest years, the interaction amongst human beings and computers has been
continuously evolving, reaching a smooth purpose of natural interaction. The most
expressive way for human beings to express their feelings is through facial expressions.
Humans have little or no try to find and interpret facial and facial expressions inside the
scene. Still, growing an automatic device to accomplish this challenge remains pretty
difficult. There are several related problems: the class of expressions (eg, within the
emotional class), detecting photograph segments as faces and extracting facial features
records. A device that correctly performs those operations within the actual global could
be a crucial step in reaching similar human interactions among people and machines.
Predicting age, gender, and emotion is vital for social sciences and research. Researchers can
utilize deep learning and machine learning algor to efficiently gather demographic
information for studies, enabling deeper insights into how age, gender, and emotions
influence human behavior and decision-making processes. These predictive capabilities
contribute to a better understanding of societal dynamics and facilitate evidence-based
research. Human emotion detection is implemented in many areas requiring additional
security or information about the person. It can be seen as a second step to face detection
where we may be required to set up a second layer of security, where along with the face,
the emotion is also detected. This can be useful to verify that the person standing in front
of the camera is not just a 2-dimensional representation [1]. Another important domain
where we see the importance of emotion detection is for business promotions. Most of the
businesses thrive on customer responses to all their products and offers. If an artificial
intelligent system can capture and identify real time emotions based on user image or
video, they can make a decision on whether the customer liked or disliked the product or
offer. We have seen that security is the main reason for identifying any person. It can be
based on fingerprint matching, voice recognition, passwords, retina detection,etc. Identifying
the intent of the person can also be important to avert threats. This can be helpful in
vulnerable areas like airports, concerts and major public gatherings which have
1
seen many breaches in recent years. Human emotions can be classified as: fear, contempt,
disgust, anger, surprise, sad, happy, and neutral. These emotions are very subtle. Facial
muscle contortions are very minimal and detecting these differences can be very
challenging as even a small difference results in different expressions [4]. Also,
expressions of different or even the same people might vary for the same emotion, as
emotions are hugely context dependent [7]. While we can focus on only those areas of the
face which display a maximum of emotions like around the mouth and eyes [3], how we
extract these gestures and categorize them is still an important question. Neural networks
and machine learning have been used for these tasks and have obtained good results.
Machine learning algorithms have proven to be very useful in pattern recognition and
classification. The most important aspects for any machine learning algorithm are the
features. In this paper we will see how the features are extracted and modified for
algorithms like Support Vector Machines [1]. We will compare algorithms and the feature
extraction techniques from different papers.
The following categories of feelings are distinguished:
i. Anger: Here, we are recognizing anger using facial expressions, which is done using
Convolutional Neural Networks (CNNs), which is a type of deep learning algorithm
commonly used for image analysis tasks.
ii. Disgust: The success of disgust recognition with CNNs relies on factors such as the
quality and diversity of the training dataset, the model architecture, and the training process.
iii. Fear: Recognizing fear using facial expressions through Convolutional Neural Networks
(CNNs) follows a similar process as recognizing anger or disgust.
iv. Surprise: A diverse dataset is needed for a better result because surprise and fear
expressions are very similar to each other.
v. Happiness: It is easy to recognize the happiness expression since it is not similar to any
other facial expression.
vi. Neutral: Recognizing neutrality requires less computation power since fewer neural
networks are required to recognize neutral expressions.
vii. Sad: To ensure the success of sadness recognition with CNNs, it is crucial to have a
2
diverse and well-labeled dataset that includes a range of facial expressions of sadness.
Figure. 1. Various emotions which are detected by the program [7]
The human emotion dataset can be a very good example to study the robustness and
nature of classification algorithms and how they perform for different types of datasets.
Usually before extraction of features for emotion detection, face detection algorithms are
applied on the image or the captured frame. We can generalize the emotion detection steps
as follows: 1) Dataset preprocessing 2) Face detection 3) Feature extraction 4) Classification
based on the features. In this work, we focus on the feature extraction technique and
emotion detection based on the extracted features. Section 2 focuses on some important
features related to the face. Section 3 gives information on the related work done in this
field. Related work covers many of the feature extraction techniques used until now. It
also covers some important algorithms which can be used for emotion detection in human
faces. Section 4 details the tools and libraries used in the implementation. Section 5 explains
the implementation of the proposed feature extraction and emotion detection framework.
Section 6 highlights the result of the experiment. Section 7 covers the conclusion and
future work.
Here, we also recognizing age and gender of an individual which is trained to predict age
and gender using a broad ResNet framework. few examples of their uses are as the
utilization of age and gender prediction plays a significant role in content recommendations.
3
Online platforms, such as streaming services and e-commerce websites, leverage this
information to offer personalized recommendations to their users. By considering the user's
age and gender, these platforms can suggest movies, TV shows, products, or services that
align with their preferences, enhancing the overall user experience and driving engagement.
range of practical applications across different domains. Apart from this, it can use Within
the healthcare domain, age and gender prediction can be valuable in various ways. For
example, in telemedicine applications, automated identification of age and gender can assist
healthcare providers in gathering initial demographic information from patients. This
information can be used to deliver appropriate healthcare services and tailor treatment plans
to specific age groups and genders. for predicting age and gender we use The Broad ResNet
architecture is a variant of the ResNet (Residual Network) model that has wider layers to
improve feature representation. It is commonly used in computer vision tasks such as age
and gender identification. When using the Broad ResNet architecture for age and gender
identification, the model typically takes an image as input and processes it through a series
of convolutional layers, followed by pooling and fully connected layers. Here's a high-level
overview of how age and gender identification can be performed using the Broad ResNet
architecture.1) Input Image: The input image is fed into the Broad ResNet architecture. It is
usually preprocessed by resizing and normalizing the image to a fixed size.2) Convolutional
Layers: The image is passed through several convolutional layers. These layers are
responsible for extracting features from the image at different spatial scales. Each
convolutional layer applies a set of filters to the input image, generating feature maps that
highlight different patterns and texture.3) Pooling Layers: After each set of convolutional
layers, pooling layers are used to down sample the feature maps. Pooling helps reduce the
spatial dimensions of the feature maps while retaining important information.4) Fully
Connected Layers: The output from the last pooling layer is flattened and fed into fully
connected layers. These layers are responsible for learning high-level representations of the
features extracted from the image. They capture complex relationships between features and
enable the model to make predictions.5) Age and Gender Prediction: The fully connected
layers are connected to separate output layers for age and gender prediction. The number of
nodes in the age output layer corresponds to the number of age categories or a continuous
regression value. Similarly, the gender output layer has nodes representing different genders
(e.g., male and female).6) Training: The model is trained using a labeled dataset where each
image is annotated with its corresponding age and gender. During training, the model's
parameters are adjusted using techniques like backpropagation and gradient descent to
minimize the prediction error.7) Inference: After training, the model can be used for age and
4
gender identification on unseen images. The input image is passed through the trained
network, and the output nodes for age and gender provide predictions. The age category or
continuous value and the gender with the highest activation can be considered as the model's
identification results. When using the Broad ResNet architecture for age and gender
identification, several factors can influence the model's performance. Firstly, the exact
architecture details may differ based on specific implementations and modifications for this
task. These adaptations could involve altering the number and arrangement of convolutional
layers, pooling layers, and fully connected layers. Secondly, the quality and diversity of the
training data are critical considerations. Annotated datasets containing images labeled with
age and gender information are used to train the model. Ensuring that the training data is
representative of the target population is crucial for achieving accurate predictions. Thirdly,
the techniques employed during training significantly impact the model's performance.
Common techniques such as back propagation and gradient descent are utilized to adjust the
model's parameters based on the training data. This process enables the model to learn from
the data and optimize its performance by minimizing prediction errors. In terms of model
architecture, the age prediction component typically consists of an output layer with nodes
corresponding to specific age categories or a continuous regression value. Similarly, the
gender prediction component includes an output layer with nodes representing different
genders, such as male and female. During the inference phase, the trained model can be used
to predict the age and gender of new, unseen images. The input image undergoes processing
through the trained network, and the output nodes for age and gender provide predictions.
The model's identification results are determined by considering the age category or
continuous value, as well as the gender with the highest activation. It’s crucial to note that
the performance of the age and gender identification model relies on various factors,
including the specific architecture design, the quality and representativeness of the training
data, and the training techniques employed. Continuous evaluation and validation are
necessary to ensure accurate and reliable predictions. The field of age and gender
identification using the Broad ResNet architecture is continuously evolving, with ongoing
research and advancements refining the architecture and techniques to achieve improved
performance.
While predicting emotion in CNNs can be trained on large datasets of facial expressions
or speech recordings with annotated emotion labels. The CNN learns to identify patterns and
features in the data that are indicative of specific emotions. This approach is called
5
supervised learning. The first step in emotion detection using CNNs is to preprocess the
data. For facial expressions, this involves extracting relevant features such as facial
landmarks, facial expressions, and head pose. For speech recordings, this involves
extracting features such as pitch, intonation, and spectral features. These features are then
fed into the CNN as input. The CNN consists of multiple layers, including convolutional
layers, pooling layers, and fully connected layers. In the convolutional layers, the CNN
applies filters to the input data to extract features. The pooling layers down sample the
feature maps to reduce the dimensionality of the data. The fully connected layers use the
extracted features to predict the emotion label. Training the CNN involves minimizing a
loss function that measures the difference between the predicted emotion label and the true
emotion label. This is done using an optimization algorithm such as stochastic gradient
descent (SGD). Once the CNN is trained, it can be used to predict the emotion label of new
facial expressions. The CNN takes the preprocessed data as input and outputs a probability
distribution over the possible emotion labels. One of the advantages of using CNNs for
emotion detection is their ability to automatically learn relevant features from raw data.
This eliminates the need for manual feature engineering, which can be time-consuming
and error-prone. Additionally, CNNs can capture complex patterns and relationships in the
data that may be difficult to capture using traditional machine learning methods.
There are several challenges in emotion detection using CNNs. One of the main
challenges is the availability of labeled data. Emotion detection datasets are often small
and may not represent the diversity of emotions and expressions in real-world situations.
Another challenge is the generalization of the CNN to new domains and contexts. The
CNN may not perform well on new data that is different from the training data. AI
algorithms for age, gender, and emotion prediction offer benefits but face physical
challenges. Accuracy and bias issues may arise, especially for unique individuals or subtle
expressions. Input quality affects performance, requiring standardized data. Privacy concerns
surround personal data usage. Ethical implications include discrimination and stereotype
reinforcement. Interpretation and contextual understanding may be limited. Overcoming
challenges requires research, fairness, transparency, and user education. Striving for
accuracy, robustness, and unbiased outcomes is crucial. It is also crucial to consider ethical
considerations and privacy protection when deploying AI systems for age, gender, and
emotion prediction. Responsible and fair use of these technologies must be upheld to
safeguard individuals' privacy rights and avoid any potential biases or discrimination.
6
Deep Learning
Deep learning is a subfield of machine learning that focuses on training artificial neural
networks to learn and make intelligent decisions from data. It is inspired by the structure and
function of the human brain, specifically the interconnected network of neurons.
At the core of deep learning are artificial neural networks, which consist of layers of
interconnected nodes, known as artificial neurons or units. These units receive inputs, apply
mathematical operations to them, and produce output activations. The connections between
the units have associated weights that determine the strength of influence one unit has on
another. Deep learning architectures are typically composed of multiple layers of
interconnected units, forming a deep neural network. The term "deep" refers to the depth of
the network, which is characterized by the number of layers it contains. Each layer performs
feature extraction and transformation, learning progressively more complex representations
of the input data as information flows through the network. Training a deep neural network
involves a two-step process: forward propagation and backpropagation. During forward
propagation, input data is fed into the network, and activations are calculated through the
layers until reaching the output layer. The calculated output is then compared to the desired
output to compute a loss or error value. In backpropagation, the error is propagated
backward through the network, and the weights and biases of the network are adjusted
iteratively to minimize the error. This process utilizes optimization algorithms, such as
stochastic gradient descent (SGD) or its variants, to update the weights and biases based on
the computed gradients of the loss function with respect to the network parameters. One of
the key advantages of deep learning is its ability to automatically learn and extract
hierarchical representations from raw data. By using multiple layers, deep neural networks
can learn to capture abstract and complex features from the input data, enabling them to
model highly intricate patterns and make accurate predictions. Deep learning has achieved
remarkable success in various domains, including computer vision, natural language
processing, speech recognition, and recommendation systems. Convolutional Neural
Networks (CNNs) have been particularly effective in computer vision tasks, while Recurrent
Neural Networks (RNNs) and their variants, such as Long Short-Term Memory (LSTM),
have excelled in sequential data processing. Moreover, deep learning has benefited from the
availability of large-scale labeled datasets, such as ImageNet and COCO, and advancements
in computational resources, particularly Graphics Processing Units (GPUs), which can
accelerate the training process. In recent years, deep learning has also seen the rise of
7
pretraining and transfer learning. Pretraining involves training a deep neural network on a
large dataset and then fine-tuning it on a specific task or domain. Transfer learning leverages
pretrained models to extract useful features and knowledge from one task and apply them to
another related task with limited labeled data.
Figure. 2
ANN
Artificial Neural Networks (ANNs) are computational models inspired by the structure and
functioning of biological neural networks, such as the human brain. ANNs are a fundamental
component of deep learning and are widely used for various machine learning tasks,
including pattern recognition, classification, regression, and optimization. At its core, an
ANN consists of interconnected artificial neurons, also known as nodes or units. These
neurons receive inputs, apply mathematical operations to them, and produce an output
signal. The neurons are organized into layers, including an input layer, one or more hidden
layers, and an output layer. The connections between neurons in different layers have
associated weights, which determine the strength of the influence one neuron has on another.
The weights are initially assigned random values and are adjusted during the training process
to optimize the network's performance. The flow of information in an ANN is typically
forward, with inputs propagating through the layers to produce an output. Each neuron in a
8
layer computes a weighted sum of its inputs, applies an activation function, and passes the
result to the next layer. Common activation functions include sigmoid, ReLU (Rectified
Linear Unit), and tanh (hyperbolic tangent), among others. The learning process in ANNs
involves training the network on labeled data to adjust the weights and biases, enabling the
network to learn patterns and make accurate predictions. This is typically done using
supervised learning, where the network is presented with input-output pairs and updates its
weights based on the error or loss between its predicted outputs and the true outputs. The
training process often involves an optimization algorithm, such as stochastic gradient descent
(SGD), to iteratively update the weights and minimize the loss function. Backpropagation, a
technique used in ANNs, computes the gradients of the loss function with respect to the
network's weights, allowing for efficient weight updates during training. ANNs can have
various architectures, including feedforward neural networks, recurrent neural networks
(RNNs), convolutional neural networks (CNNs), and more. Feedforward neural networks,
also known as multilayer perceptrons (MLPs), have a simple structure with one or more
hidden layers and are effective for many tasks. RNNs are designed to handle sequential data
and have feedback connections that allow them to capture temporal dependencies. CNNs
excel in image and spatial data processing, utilizing convolutional layers to extract local
patterns and hierarchical representations. ANNs have been successful in various domains,
including computer vision, natural language processing, speech recognition, and robotics.
They have achieved state-of-the-art performance in tasks such as image classification, object
detection, machine translation, and sentiment analysis. However, ANNs also have some
limitations. They require large amounts of labeled training data to learn effectively, and
training can be computationally intensive, especially for deep architectures. Overfitting,
where the network becomes too specialized to the training data, is also a concern.
Regularization techniques, such as dropout and weight decay, are commonly used to mitigate
overfitting. Artificial Neural Networks are computational models inspired by biological
neural networks. They consist of interconnected artificial neurons organized into layers and
are trained on labeled data to learn patterns and make predictions. ANNs have achieved
impressive performance in various machine learning tasks and are a core component of deep
learning methodologies.
9
Figure. 3.
RNN
Recurrent Neural Networks (RNNs) are a type of artificial neural network that excel in
processing sequential and temporal data. Unlike traditional feedforward neural networks,
RNNs have feedback connections, allowing information to persist and be passed from one
step to the next within the network. The key characteristic of RNNs is their ability to
maintain an internal memory or hidden state, which allows them to capture dependencies and
patterns over time. This memory is shared across all time steps, enabling the network to
process sequences of varying lengths and model the context and history of the data. The
basic building block of an RNN is a recurrent unit, often represented as a simple recurrent
neuron or a more complex variant such as the Long Short-Term Memory (LSTM) or Gated
Recurrent Unit (GRU). These recurrent units process an input at each time step, update their
hidden state based on the current input and the previous hidden state, and produce an output.
The forward pass in an RNN involves iterating over the input sequence one time step at a
time, updating the hidden state and producing outputs at each step. The hidden state at each
time step serves as a summary of the past information, capturing the context and
dependencies up to that point. During training, RNNs are trained using a technique called
backpropagation through time (BPTT). BPTT calculates the gradients of the loss function
with respect to the network parameters at each time step and propagates them backward
through time, enabling the network to learn from past inputs and adjust its weights. One
challenge with traditional RNNs is the vanishing gradient problem, where gradients diminish
10
as they are propagated back in time, making it difficult for the network to capture long-term
dependencies. This problem led to the development of more sophisticated variants such as
LSTM and GRU. LSTM introduces memory cells and gating mechanisms that allow the
network to selectively store and access information over time. It has a more complex
structure, including input gates, forget gates, and output gates, which control the flow of
information and help address the vanishing gradient problem. GRU is a simplified version of
LSTM that combines the input and forget gates into a single update gate, reducing the
number of parameters and computations compared to LSTM while still capturing long-term
dependencies effectively. RNNs have found success in various applications involving
sequential data, including natural language processing (NLP), speech recognition, machine
translation, sentiment analysis, and time series forecasting. They can process variable-length
inputs and generate outputs of varying lengths, making them suitable for tasks where the
order and context of the data are crucial. However, traditional RNNs have limitations in
modeling very long sequences due to the vanishing gradient problem and can struggle with
capturing long-term dependencies. In such cases, advanced architectures like Transformer
models have emerged as alternatives, which use self-attention mechanisms to capture global
dependencies more effectively. RNNs are a class of neural networks designed for sequential
data processing. They utilize feedback connections and maintain hidden states to capture
temporal dependencies and context. Variants like LSTM and GRU address the vanishing
gradient problem and have been successful in various tasks requiring sequence modeling and
prediction.
Figure. 4.
11
CNN
CNN, short for convolutional neural network, is a type of deep learning algorithm
widely used in image and video processing. This neural network architecture is
designed to automatically learn and extract essential features from input data,
particularly photos. An excellent example of CNN's recent and significant use is for
image classification on the ImageNet benchmark [21].The CNN processes the input
data through a series of layers, including convolutional layers, pooling layers, and
fully connected layers. The convolutional layers use learned filters to extract crucial
features from the input data, while the pooling layers reduce the dimensionality of
the convolutional layer output by down sampling it. The fully connected layers use
the output of the convolutional and pooling layers to classify the input data.
We will look more about in next chapter.
Figure. 5.
Xception Architecture
The Xception architecture is a deep convolutional neural network (CNN) architecture that
was introduced by François Chollet in 2017. It is an extension of the Inception architecture
and stands for "Extreme Inception." Xception is known for its exceptional performance in
image classification tasks, achieving state-of-the-art results on various benchmarks. The key
idea behind the Xception architecture is to replace the traditional inception modules used in
the Inception network with depthwise separable convolutions. These depthwise separable
convolutions aim to capture both spatial and channel-wise dependencies in the input data
12
more effectively and efficiently. In traditional CNNs, convolutions are performed on both
the spatial dimensions and the channels simultaneously. In Xception, depthwise separable
convolutions decouple these operations. The first step involves performing a depthwise
convolution, which applies a separate convolutional filter to each input channel
independently. This captures spatial information. The second step is a pointwise convolution
that applies 1x1 filters to combine the output channels from the previous step. This allows
for cross-channel information exchange. The Xception architecture begins with an entry
flow, which consists of several convolutional and pooling layers to process the input image.
This initial part of the network performs basic feature extraction and downsampling,
reducing the spatial dimensions while increasing the number of channels. Following the
entry flow, the Xception architecture employs a series of residual-like modules in the middle
flow. These modules consist of several stacked depthwise separable convolutional layers.
The use of residual connections helps alleviate the vanishing gradient problem and allows
for better gradient flow during training. The exit flow of the Xception architecture comprises
a combination of global average pooling and fully connected layers. The global average
pooling layer aggregates the spatial information across each channel, reducing the spatial
dimensions to a single vector. This vector is then fed into fully connected layers for final
classification. The Xception architecture offers several advantages. By using depthwise
separable convolutions, it significantly reduces the number of parameters and computations
compared to traditional convolutional layers. This leads to a more efficient network with
fewer computational requirements. Additionally, the architecture can capture both local and
global dependencies more effectively, leading to improved performance in image
classification tasks. The Xception architecture has demonstrated excellent performance on
various image classification benchmarks, including the ImageNet dataset. It has also been
used as a feature extractor or pretrained backbone in transfer learning scenarios for various
computer vision tasks, such as object detection, semantic segmentation, and image
recognition. It's worth noting that while the Xception architecture has shown strong
performance in many scenarios, newer architectures such as EfficientNet and RegNet have
surpassed it in terms of both accuracy and efficiency. However, Xception remains a
significant milestone in the development of efficient and effective convolutional neural
network architectures for image classification.
13
Figure. 6.
SqeezeNet Architecture
The SqueezeNet architecture is a lightweight convolutional neural network (CNN)

architecture that was proposed by Forrest N. Iandola et al. in 2016. It aims to achieve high
accuracy on image classification tasks while having a significantly smaller model size
compared to traditional CNN architectures. The key idea behind SqueezeNet is to reduce the
number of parameters in the network without sacrificing accuracy by employing several
innovative design choices. The building blocks of SqueezeNet are called Fire modules. A
Fire module consists of a squeeze layer and an expand layer. The squeeze layer comprises
1x1 convolutions that aim to reduce the number of input channels. This dimensionality
reduction helps in compressing the network while preserving valuable information. The
expand layer consists of a combination of 1x1 and 3x3 convolutions, which help in capturing
both local and global context. SqueezeNet introduces two types of Fire modules: the
"squeeze-and-expand" Fire module and the "expand-only" Fire module. The former has a
compression ratio of 1x1, meaning it reduces the number of input channels to one-fourth.
The latter has a compression ratio of 0.5x, which further reduces the number of channels by
half. By using these different Fire module configurations, SqueezeNet achieves a good
balance between model size reduction and preserving accuracy. SqueezeNet incorporates
down sampling operations to reduce the spatial dimensions of feature maps. It uses max
14
pooling with a stride of 2, which reduces the spatial size by half. This down sampling helps
in extracting and preserving essential features while reducing computational requirements.
Similar to other CNN architectures, SqueezeNet ends with fully connected layers for final
classification. However, to keep the model size small, SqueezeNet uses 1x1 convolutions
instead of fully connected layers. This reduces the number of parameters while achieving
similar classification performance. SqueezeNet achieves a significantly smaller model size
compared to traditional CNN architectures. The authors introduced techniques like
parameter sharing, aggressive down sampling, and reducing 3x3 filters to 1x1 filters to
compress the network. SqueezeNet has 50x fewer parameters compared to AlexNet while
achieving comparable accuracy on image classification tasks. The SqueezeNet architecture is
particularly suitable for scenarios with limited computational resources, such as mobile and
embedded devices. It offers a good trade-off between model size and accuracy, making it
efficient for real-time applications and deployments with restricted memory or processing
power. Since its introduction, SqueezeNet has been used in various applications, including
image classification, object detection, and semantic segmentation. It has served as a basis for
further research and has inspired the development of other lightweight architectures like
MobileNet and ShuffleNet. It's worth noting that while SqueezeNet provides an efficient
solution for model compression, more recent architectures like EfficientNet have achieved
even better trade-offs between model size and accuracy by incorporating advanced
techniques like compound scaling and neural architecture search.
Figure. 7.
15
Activation Function
Activation functions play a crucial role in artificial neural networks by introducing non-
linearity to the network's output. They determine whether a neuron should be activated or not
based on the input it receives.
Here are some commonly used activation functions and their characteristics:
1. Sigmoid Function: The sigmoid function is a smooth, S-shaped curve that maps the input to
a value between 0 and 1. It is given by the formula:
σ(x) = 1 / (1 + e^(-x))
The sigmoid function is widely used in the past but has lost popularity in recent years due to
some limitations. One of the main issues is the vanishing gradient problem, where gradients
become very small for large inputs, making it challenging for deep networks to learn
effectively. Sigmoid functions are still used in some cases, such as binary classification
tasks.
2. Hyperbolic Tangent (Tanh) Function: The hyperbolic tangent function is similar to the
sigmoid function but maps the input to a value between -1 and 1. It is given by the formula:
tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))
Like the sigmoid function, tanh also suffers from the vanishing gradient problem. However,
it has the advantage of being zero-centered, making it easier for the network to learn
compared to the sigmoid function.
3. Rectified Linear Unit (ReLU) Function: The ReLU function is a piecewise linear function
that maps all negative input values to zero and leaves positive values unchanged. It is
defined as:
ReLU(x) = max(0, x)
ReLU has gained significant popularity due to its simplicity and effectiveness in deep neural
networks. It addresses the vanishing gradient problem and accelerates training by providing
faster convergence. However, ReLU can also suffer from the "dying ReLU" problem, where
some neurons become inactive and do not contribute to the learning process.
4. Leaky ReLU Function: The Leaky ReLU function is a variant of the ReLU function that
allows a small, non-zero gradient for negative inputs. It is defined as:
LeakyReLU(x) = max(ax, x)
16
where a is a small constant (e.g., 0.01). The non-zero gradient for negative inputs helps
address the dying ReLU problem and provides better performance in some cases.
5. Parametric ReLU (PReLU) Function: PReLU is another variant of the ReLU function that
introduces learnable parameters. It allows the network to learn the slope of the negative part
of the activation function. This enables the network to adapt the activation function to the
data during training.
6. Exponential Linear Unit (ELU) Function: The ELU function is a smooth approximation of
the ReLU function that handles negative inputs more gracefully. It introduces a negative
saturation region to allow negative values, which helps mitigate the dying ReLU problem.
The ELU function is defined as:
ELU(x) = x if x > 0 = α * (e^x - 1) if x <= 0
where α is a positive constant controlling the slope of the function for negative inputs.
These are just a few examples of activation functions used in neural networks. Each
activation function has its own characteristics and affects the network's learning dynamics.
The choice of activation function depends on the specific problem, network architecture, and
empirical performance on the task at hand.
Figure. 8.
17
SoftMax
SoftMax is an activation function commonly used in the output layer of a neural network,
particularly in multi-class classification problems. It takes a vector of real-valued inputs and
transforms them into a probability distribution over multiple class. The SoftMax function
operates on a vector of values, typically the logits or pre-activation outputs from the previous
layer of the neural network. It exponentiates each value and normalizes them to ensure they
sum up to 1. The SoftMax function can be defined as follows for a vector x:
softmax(x_i) = exp(x_i) / sum(exp(x_j)) for j in range(1, C)
where x_i represents the i-th element of the input vector, C is the total number of classes,
and the sum is taken over all elements of the input vector.
The main purpose of the SoftMax function is to convert raw scores or logits into
probabilities. By exponentiating the inputs and normalizing them, the SoftMax function
assigns a probability to each class, indicating the likelihood of the input belonging to that
class.
The SoftMax function guarantees that the resulting values sum up to 1, thereby forming a
valid probability distribution. Each output value represents the probability of the input
belonging to the corresponding class.
It converts the logits into probabilities, making it easier to interpret the output of the neural
network. The class with the highest probability is typically selected as the predicted class.
This function is differentiable, which allows for efficient backpropagation during the training
process. The gradients can be computed with respect to the input logits, enabling the network
to learn from the provided labels and update its weights accordingly. It amplify the
differences between class probabilities, which can lead to a bias towards the class with the
highest score. In scenarios with imbalanced datasets, where some classes have significantly
more samples than others, this can result in poor performance. Techniques such as class
weighting or utilizing alternative loss functions can help address this issue. The SoftMax
function includes a temperature parameter (often denoted as T) that controls the smoothness
of the output probabilities. A higher temperature value (>1) produces a softer probability
distribution, while a lower temperature value (<1) produces a sharper distribution with more
emphasis on the highest-scoring class. SoftMax is primarily used in multi-class classification
tasks, where there are more than two mutually exclusive classes. It is often combined with
18
the cross-entropy loss function to compute the loss and gradients during training.
While SoftMax is commonly used in the output layer for classification, it is worth noting that
in some cases, alternative activation functions such as sigmoid or linear functions may be
used, depending on the nature of the problem (e.g., binary classification or regression).
Objective
The use of an Emotion recognition system with a solid rate of accuracy is to personalize
different attributes for an individual specifically to suit their interest. Consequently, we
decided to do design this system where we could detect a person’s age , gender using
various methods and algorithms of deep learning . for instance, A.I assistants to play music
based on dedicated emotion rather than choosing randomly. Similarly, a smart car slowing
down when one is angry or is in a kind of agony. As a result, this application has much
potential in the world, that would benefit companies and also to even ensure safety to
consumers. To enforce Convolutional Neural Networks for facial expressions and Broad
ResNet framework for identifying age and gender of individual person on its glance at
camera. Emotion recognition is the process of identifying a person's emotional state from
various cues, such as facial expressions, physiological signals, and gestures. In recent
years, emotion recognition has gained significant attention in the field of artificial
intelligence (AI) and machine learning (ML). The ability to recognize emotions can help
machines understand human behavior and respond appropriately. Facial expressions, age
and gender determination are the most commonly used modalities for emotion recognition.
While facial expressions have been studied extensively. Age and gender recognition
provides rich emotional information. Therefore, it can be used as a reliable source for
emotion recognition. In this article, we propose an emotion recognition system based on
convolutional neural networks (CNNs) for facial expressions . CNNs are a type of deep
19
learning model that is highly effective at image and audio processing tasks. We propose to
use CNNs to recognize facial expressions from images. The proposed system has a
wide range of potential applications, including personalized marketing, health
monitoring, and smart homes. For instance, a personalized music player could be
designed based on the user's emotion, which would automatically select music that
matches the user's mood. Similarly, a smart home could be designed to adjust its lighting
and temperature based on the user's emotional state. The first step in building an emotion
recognition system is to collect a large dataset of labeled data. The dataset should contain a
wide range of emotions and different facial expressions or speech signals The quality of the
dataset is essential for the success of the system, and it should be balanced in terms of the
number of samples for each emotion. For facial expression recognition, we propose to use a
CNN model that is trained on the labeled dataset. The input to the CNN is an image of the
face, and the output is a probability distribution over the different emotions The model is
trained using back propagation, which adjusts the model parameters to minimize the error
between the predicted output and the ground truth label. For predicting age and gender we
use The Broad ResNet architecture is a variant of the ResNet (Residual Network) model that
has wider layers to improve feature representation. It is commonly used in computer vision
tasks such as age and gender identification. When using the Broad ResNet architecture for
age and gender identification, the model typically takes an image as input and processes it
through a series of convolutional layers, followed by pooling and fully connected layers.
Here's a high-level overview of how age and gender identification can be performed using
the Broad ResNet architecture.1) Input Image: The input image is fed into the Broad ResNet
architecture. It is usually preprocessed by resizing and normalizing the image to a fixed
size.2) Convolutional Layers: The image is passed through several convolutional layers.
These layers are responsible for extracting features from the image at different spatial scales.
Each convolutional layer applies a set of filters to the input image, generating feature maps
that highlight different patterns and texture.3) Pooling Layers: After each set of
convolutional layers, pooling layers are used to down sample the feature maps. Pooling helps
reduce the spatial dimensions of the feature maps while retaining important information.4)
Fully Connected Layers: The output from the last pooling layer is flattened and fed into fully
connected layers. These layers are responsible for learning high-level representations of the
features extracted from the image. They capture complex relationships between features and
enable the model to make predictions.5) Age and Gender Prediction: The fully connected
layers are connected to separate output layers for age and gender prediction. The number of
nodes in the age output layer corresponds to the number of age categories or a continuous
20
regression value. Similarly, the gender output layer has nodes representing different genders
(e.g., male and female).6) Training: The model is trained using a labeled dataset where each
image is annotated with its corresponding age and gender. During training, the model's
parameters are adjusted using techniques like backpropagation and gradient descent to
minimize the prediction error.7) Inference: After training, the model can be used for age and
gender identification on unseen images. To achieve high accuracy in emotion recognition,
we propose to use a deep CNN model with multiple convolutional and pooling layers. The
deep model can extract high-level features from the input data, which can be used to
recognize emotions accurately. One of the challenges in emotion recognition is dealing with
inter-individual variability. Different people may express the same emotion differently,
which can affect the accuracy of the system. To address this issue, we propose to use
transfer learning, where we pretrain the CNN model on a large dataset of facial expressions
or speech emotions and fine-tune it on our dataset. Transfer learning can help the model
learn more general features that are not specific to a particular individual. In conclusion, we
propose an emotion recognition system based on CNNs for facial expressions and speech.
The proposed system has a wide range of potential applications, including personalized
marketing, health monitoring, and smart homes. The system can be trained using a large
labeled dataset, and the CNN model can be fine-tuned using transfer learning to improve
accuracy. We believe that emotion recognition is an essential area of research in AI and has
a significant potential for practical applications. Emotion detection has become an
important research topic in recent years due to its wide range of applications, such as in
human-computer interaction, marketing, and healthcare.
Problem Statement
Human feelings and intentions are expressed via facial expressions and speech deriving a
green and powerful feature is the essential element of facial features machine. Facial
expressions bring non-verbal cues, which play a crucial position in interpersonal family
members. Automatic reputation of facial expressions can be an crucial thing of natural
human-gadget interfaces; it could also be utilized in behavioral technological know-how
and in clinical practice. An automatic Facial Expression Recognition gadget needs to clear
up the subsequent problems: detection and region of faces in a cluttered scene, facial
function extraction, and facial features classification.
21
Methodologies
The primary idea of the project is to recognize emotion and predict age and gender using
various methods and algorithms of deep learning. For age-gender classification, the dataset
was obtained from IMDb-WIKI, and for emotion detection, it was obtained from Kaggle's
Fer2013 dataset. Two models are used in this design: one is trained to predict age and gender
using a broad ResNet framework, while the other is taught to recognize emotions using a
traditional CNN architecture. Compared to classifier-based approaches, our technique
exhibits higher classification accuracy for both age and gender.
Background
Recognizing age, gender, and facial expressions from images or videos is a challenging task
in computer vision and pattern recognition. It involves developing algorithms and models
that can automatically analyze visual cues to infer specific attributes related to age, gender,
and emotional expressions. Age Recognition aims to estimate the age of a person from visual
cues, such as facial appearance or other relevant features. It is a complex task because age
estimation is subjective and can vary across different cultures and individuals. Factors such
as wrinkles, facial contours, hair color, and skin texture are typically considered in age
estimation algorithms. Machine learning techniques, including deep learning models like
convolutional neural networks (CNNs), have been widely used for age recognition due to
their ability to capture intricate facial features. Gender Recognition focuses on determining
the gender of a person based on visual characteristics. It involves analyzing facial attributes
such as facial structure, hair length, and other gender-specific features. Similar to age
recognition, deep learning models, particularly CNNs, have shown promising results in
gender recognition tasks. These models are trained on large-scale datasets containing labeled
images of individuals, allowing them to learn discriminative patterns and features indicative
of gender. Facial Expression Recognition aims to detect and classify various emotional states
displayed by a person's face. It involves analyzing facial muscle movements and their
configurations to infer emotions such as happiness, sadness, anger, disgust, fear, and
surprise. Traditional approaches to facial expression recognition include extracting
handcrafted features like facial landmarks or texture descriptors and using machine learning
algorithms for classification. However, deep learning methods, particularly CNNs and
recurrent neural networks (RNNs), have demonstrated significant advancements in facial
expression recognition by directly learning expressive features from raw facial images or
22
sequences. To train models for age, gender, and facial expression recognition, large
annotated datasets are crucial. Researchers and organizations have created labeled datasets,
such as the Audience dataset for age and gender recognition and the Extended Cohn-Kanade
(CK+) dataset for facial expression recognition. These datasets provide diverse samples of
individuals across different ages, genders, and emotional expressions, facilitating the
development and evaluation of robust recognition models. In recent years, advancements in
deep learning, increased computational power, and the availability of large-scale datasets
have significantly improved the performance of age, gender, and facial expression
recognition systems. These technologies have applications in areas such as biometrics,
human-computer interaction, marketing, and surveillance, offering valuable insights and
enabling various practical use cases.
Optimizer
In the context of deep learning, an optimizer is an algorithm or method used to adjust

the weights of a neural network during the training process. The goal of an optimizer
is to minimize the loss function and find the set of weights that result in the best
performance of the model on the given task. Optimizers play a crucial role in
determining the speed and quality of convergence during training.
Here are some commonly used optimizers in deep learning:
1. Stochastic Gradient Descent (SGD): SGD is the basic and most widely used
optimizer in deep learning. It updates the weights in the opposite direction of the
gradient of the loss function with respect to the weights. The update rule for SGD is
given by:
W_new = W_old - learning_rate * gradient
where W_new and W_old are the new and old weight values, respectively,
learning_rate is a hyperparameter controlling the step size of the update, and the
gradient represents the derivative of the loss function with respect to the weights.
2. Momentum: Momentum is an extension of SGD that introduces a momentum term to

accelerate the convergence and overcome the issue of oscillations in the gradient
path. It adds a fraction of the previous weight update to the current update, allowing
the optimizer to move more consistently towards the minimum. The update rule for
23
momentum is given by:
V_new = momentum * V_old - learning_rate * gradient W_new = W_old + V_new
where V_new and V_old represent the updated and previous velocity, respectively,
momentum is a hyperparameter controlling the contribution of the previous update,
and the other variables have the same meaning as in SGD.
3. AdaGrad: AdaGrad adapts the learning rate for each weight individually based on the
historical gradients. It increases the learning rate for infrequent features and
decreases it for frequent features. This helps the optimizer converge quickly in
directions with steep gradients and more slowly in flatter directions. The update rule
for AdaGrad is given by:
G_new = G_old + gradient^2 W_new = W_old - (learning_rate / sqrt(G_new +

epsilon)) * gradient
where G_new and G_old represent the updated and previous squared gradients,
respectively, epsilon is a small constant added for numerical stability, and the other
variables have the same meaning as in SGD.
4. RMSprop: RMSprop is an extension of AdaGrad that addresses its aggressive and

monotonically decreasing learning rate. It introduces a decay term that limits the
accumulation of historical gradients. The update rule for RMSprop is given by:
G_new = decay * G_old + (1 - decay) * gradient^2 W_new = W_old - (learning_rate

/ sqrt(G_new + epsilon)) * gradient
where G_new and G_old represent the updated and previous squared gradients,
respectively, decay is a hyperparameter controlling the contribution of the previous
gradients, and the other variables have the same meaning as in AdaGrad.
5. Adam: Adam combines the benefits of both momentum and RMSprop. It utilizes
adaptive learning rates for each weight and also includes a momentum-like term.
Adam has become one of the most popular and widely used optimizers in deep
learning. The update rule for Adam is given by:
M_new = beta1 * M_old + (1 - beta1) * gradient V_new = beta2 * V_old + (1 -

beta2) * gradient^2 W_new = W_old - (learning_rate / (sqrt(V_new) + epsilon)) *
24
M_new
where M_new and M_old represent the updated and previous first moments (mean),
V_new and V_old represent the updated and previous second moments (uncentered
variance), beta1 and beta2
Adam Optimizer
Adam (Adaptive Moment Estimation) optimizer is an adaptive learning rate optimization

algorithm commonly used in deep learning. It combines the concepts of momentum-based
optimization and root mean square (RMS) propagation to achieve efficient and effective
parameter updates during training.
Here's a detailed explanation of the Adam optimizer:
1. Gradient-Based Optimization: Adam optimizer, like other gradient-based optimization

algorithms, aims to minimize the loss function by iteratively updating the model parameters
based on the gradients of the loss with respect to the parameters.
2. Momentum-Based Optimization: Adam incorporates the concept of momentum, which helps

accelerate the convergence and overcome local minima. It utilizes a running average of the
gradients' first moment (mean) to keep track of the direction of the previous gradients.
3. RMS Propagation: In addition to momentum, Adam uses RMS propagation to adjust the
learning rate for each parameter individually. It maintains a running average of the gradients'
second moment (variance) to adaptively scale the learning rate.
4. Parameter Initialization: Adam initializes two variables, namely the first moment estimate
(m) and the second moment estimate (v), to zero vectors of the same size as the parameters
being optimized.
5. Update Rule: At each iteration, Adam calculates the current gradient of the parameters and
updates the estimates of the first and second moments using exponential decay rates, beta1
and beta2, respectively. These decay rates control the contribution of past gradients to the
current estimates.
6. Bias Correction: To account for the bias introduced by initializing the first and second
moment estimates to zero, Adam performs bias correction by adjusting the estimates in the
early iterations. This correction is necessary to make the initial updates more accurate.
25
7. Learning Rate: Adam adapts the learning rate for each parameter based on the estimates of
the first and second moments. The learning rate is dynamically scaled based on the ratio of
the current gradient to the magnitude of the previous gradients.
8. Regularization: Adam supports L2 regularization by incorporating the L2 penalty term into

the parameter updates. This helps prevent overfitting by encouraging smaller weights in the
model.
Benefits of Adam optimizer:
 Efficient updates: The adaptive learning rate of Adam allows for efficient parameter updates,
especially in high-dimensional spaces.
 Robustness to noisy gradients: The running averages of the first and second moments help
reduce the impact of noisy gradients during training.
 Little manual tuning: Adam adjusts the learning rate automatically, reducing the need for
manual tuning of hyperparameters.
However, it's worth noting that Adam may not always be the optimal choice for all
scenarios. In some cases, other optimization algorithms such as SGD (Stochastic Gradient
Descent) or its variants may perform better, particularly with smaller datasets or certain
network architectures.
In practice, it is common to use Adam as a default optimizer choice for deep learning tasks
due to its overall effectiveness and ease of use. It has become a popular choice among
researchers and practitioners for a wide range of applications.
26
Figure. 8.
Weight
In the context of deep learning, weights refer to the learnable parameters of a neural
network. They represent the strengths of connections between neurons in different layers of
the network. The weights determine how input data is transformed as it passes through the
network, allowing the network to learn and make predictions.
Here are some key points about weights in deep learning:
1. Initialization: Initially, the weights of a neural network are initialized randomly. Proper
initialization is crucial because it helps the network converge faster during training.
Common weight initialization techniques include random initialization from a uniform or
Gaussian distribution, Xavier initialization, and He initialization, which take into account the
dimensions of the weight matrices and the activation functions used.
2. Learnable Parameters: During training, the weights are adjusted to minimize the difference
between the predicted output of the network and the true target values. This is done through
the process of backpropagation, where gradients are computed with respect to the weights
using the chain rule of derivatives. The weights are updated in the direction that minimizes
the loss function using optimization algorithms like gradient descent, Adam, or RMSprop.
3. Weight Sharing: In some cases, weight sharing is employed to reduce the number of
parameters and improve the efficiency of the network. Weight sharing refers to using the
same set of weights for multiple connections in the network. This is commonly seen in
convolutional neural networks (CNNs) where the same filters are applied to different regions
27
of the input image.
4. Regularization: Regularization techniques are often applied to prevent overfitting, which

occurs when the model learns to perform well on the training data but fails to generalize to
new data. Regularization methods like L1 and L2 regularization, dropout, and batch
normalization help control the magnitudes of the weights and improve the network's
generalization ability.
5. Importance in Model Performance: The weights play a crucial role in determining the
performance of the deep learning model. Proper adjustment of weights allows the network to
learn meaningful representations from the input data, capture complex patterns, and make
accurate predictions. Well-tuned weights result in improved accuracy and generalization
capability.
6. Transfer Learning: Pre-trained weights from a previously trained model can be used as a
starting point for a new model or a different task. This is known as transfer learning and is
effective when the new task has limited labeled data. By utilizing pre-trained weights, the
network can benefit from the learned representations and potentially achieve better
performance with less training.
It's important to note that the weights in a deep learning model are updated iteratively during
training, and the learning process aims to find the optimal values that minimize the loss
function. Proper weight initialization, regularization, and optimization techniques are critical
for training deep learning models effectively and achieving good performance on the given
task
Figure. 9.
28
Loss Function
Loss functions, also known as cost functions or objective functions, are an integral part of
deep learning models. They quantify the discrepancy between the predicted output of a
model and the true target values. The choice of a suitable loss function depends on the
specific task and the desired properties of the model.
Here are some commonly used loss functions in deep learning:
1. Mean Squared Error (MSE) Loss: The MSE loss is widely used in regression problems,
where the goal is to predict continuous values. It measures the average squared difference
between the predicted and true values. The formula for MSE is:
MSE = (1/n) * Σ(y_pred - y_true)^2
where y_pred is the predicted value, y_true is the true value, and n is the number of samples.
2. Binary Cross-Entropy Loss: Binary cross-entropy loss is commonly used in binary

classification problems, where there are two mutually exclusive classes. It calculates the
cross-entropy between the predicted probabilities and the true binary labels. The formula for
binary cross-entropy is:
BCE = - (y_true * log(y_pred) + (1 - y_true) * log(1 - y_pred))
where y_pred is the predicted probability, y_true is the true label (0 or 1).
3. Categorical Cross-Entropy Loss: Categorical cross-entropy loss is used in multi-class

classification problems, where there are more than two mutually exclusive classes. It
measures the dissimilarity between the predicted class probabilities and the true class labels.
The formula for categorical cross-entropy is:
CCE = - Σ(y_true * log(y_pred))
where y_pred is the predicted probability vector, y_true is the true one-hot encoded label
vector.
4. Sparse Categorical Cross-Entropy Loss: Sparse categorical cross-entropy loss is similar to

categorical cross-entropy, but it is used when the true labels are provided as integers rather
than one-hot encoded vectors. This is commonly used in scenarios where the number of
classes is large and encoding all labels as one-hot vectors would be memory-intensive.
5. Kullback-Leibler Divergence (KL Divergence) Loss: KL divergence loss measures the

29
difference between two probability distributions. It is often used in scenarios such as
variational autoencoders (VAEs) or generative adversarial networks (GANs) to compare the
similarity between the generated samples and the true data distribution.
6. Huber Loss: Huber loss is a robust loss function that is less sensitive to outliers compared to
the mean squared error. It combines quadratic and linear loss functions, providing a balance
between the two. It is often used in regression tasks where the presence of outliers can
significantly impact the performance.
These are just a few examples of loss functions commonly used in deep learning. The choice
of a loss function depends on the problem at hand, the nature of the data, and the desired
properties of the model. It is important to select a loss function that aligns with the specific
requirements and objectives of the task to optimize the model effectively during training.
Figure. 10.
DropOut Layer
Dropout is a regularization technique commonly used in deep learning models to

prevent overfitting. It involves randomly dropping out (setting to zero) a fraction of
the input units or neurons during the training phase. This forces the network to learn
more robust and generalized representations by reducing the reliance on any single
subset of neurons.
Here are the key details about dropout layers:
1. Purpose: The primary purpose of dropout is to improve the generalization ability of

deep learning models. By randomly dropping out neurons, dropout prevents co-
30
adaptation of neurons, where specific neurons become overly dependent on each
other for making predictions. This encourages the network to learn more independent
and diverse features, making it more robust to noise and variations in the input data.
2. Implementation: Dropout is typically implemented as a separate layer in the neural

network architecture, placed between the fully connected layers or convolutional
layers. During training, each neuron in the dropout layer has a probability (dropout
rate) of being temporarily "dropped out" or deactivated. The dropout rate is a
hyperparameter that determines the fraction of neurons to be dropped, typically
ranging from 0.2 to 0.5.
3. Inference Phase: During the inference phase or when making predictions, dropout is
usually turned off, and all neurons are used. However, the weights of the neurons are
scaled by the inverse of the dropout rate to maintain the expected activation level.
This scaling ensures that the predictions made during inference are consistent with
the expected output of the network.
4. Regularization Effect: Dropout acts as a regularization technique by introducing

noise or randomness into the network during training. By dropping out neurons,
dropout effectively reduces the model's capacity, preventing it from overfitting to the
training data. This regularization effect helps improve the model's ability to
generalize to unseen data, resulting in better performance on test or validation sets.
5. Training Time and Model Ensemble: Dropout increases the computational cost
during training because each forward pass requires randomly dropping out neurons.
However, it also provides a form of model ensemble during training. Since different
subsets of neurons are dropped out in each training iteration, the network effectively
trains multiple subnetworks, which can be seen as an ensemble of models. This
ensemble effect improves the robustness and stability of the learned representations.
6. Hyperparameter Tuning: The dropout rate is a hyperparameter that needs to be tuned

during model development. A dropout rate that is too low may not provide enough
regularization, leading to overfitting, while a dropout rate that is too high may result
in underfitting. The optimal dropout rate depends on the specific task, dataset, and
model architecture. It is typically determined through experimentation and cross-
validation.
31
Overall, dropout is a powerful regularization technique in deep learning that helps
prevent overfitting, improve generalization, and increase the robustness of the
learned representations. It is widely used in various types of neural networks,
including fully connected networks, convolutional neural networks (CNNs), and
recurrent neural networks (RNNs).
Figure. 11.
1.5 Structure of the report

The entire report can be divided into three broad sections that are mentioned below:
1) Pre-factory information
a) Title Page
 This includes the report’s title, name, and address of the organization conducting the
research.
 The name of the client to whom the report is to be submitted.
 And the date of submission
b) Declaration
 It states that the project group has submitted the results of their own thought, research, or
self-expression and there is no plagiarism.
c) Certificate
 The certificate states that the work has been carried out by this project group and has not
been submitted by any other project group of the institute for award or degree.
d) Acknowledgements
 Acknowledgments provide a way to thank those who supported or encouraged you in
research, writing, and other parts of developing reports and papers.
32
e) Table of Contents
 This covers the list of all the topics with their page numbers.
f) Abstract
 It is a very important part of this section which summarizes the problem, research design,
and the major findings and conclusions.
 It is like a mini report.
2) Main body
a) Introduction
 This chapter discusses the introduction, motivation, and need for a project in the real
world.
b) Literature Survey
 This section discusses all the past research work related to the field in which the project
has been made and mentions the work, shortcomings, improvements, etc. of past research
works.
c) Fundamentals of Project
 This Section Discusses the project made by team members including working, basics,
implementation, designing, etc.
d) Results and Discussion
a) Title Page
research.
b) Declaration
c) Certificate
d) Acknowledgements
33
f) Abstract
a) Title Page
research.
b) Declaration
c) Certificate
d) Acknowledgements
f) Abstract
5) Main body
a) Introduction
 This chapter discusses the introduction, motivation, and need for a project in the real
world.
b) Literature Survey
 This section discusses all the past research work related to the field in which the project
has been made and mentions the work, shortcomings, improvements, etc. of past research
works.
c) Fundamentals of Project
 This Section Discusses the project made by team members including working, basics,
implementation, designing, etc.
34
d) Results and Discussion
 This section discusses the conclusions and research papers published while making the
project.
e) Conclusions and Further Works
 This section interprets the conclusion and further work that need to be done.
6) End section
a) References
 A reference documents the sources used by researchers in writing the report in alphabetic
order.
b) List of Publications
 It contains a list of all the research papers that the researchers have published in the
course of their study.
c) Certificates
 These contains certificates that the researchers have got for their research papers in the
course of their study.
35
CHAPTER 2
LITREATURE SURVEY
1. Deep learning for sentiment analysis: A survey" by Lei Zhang et al. (2018) This paper
provides a comprehensive survey of deep learning techniques used for sentiment analysis.
The authors discuss various neural network architectures, such as convolutional neural
networks (CNNs), recurrent neural networks (RNNs), and their variants, that have been
applied to sentiment analysis. The paper also covers transfer learning and multi-task learning
techniques used for sentiment analysis. The authors conclude that deep learning techniques
have achieved state-of-the-art results in sentiment analysis and have the potential to
improve in the future.
2. Emotion recognition using facial landmarks, Python, DLib and OpenCV" by Marcin
Pietranik (2018) In this paper, the author proposes a method for emotion detection using
facial landmarks detected using Python, DLib, and OpenCV. The author uses a
convolutional neural network (CNN) to classify the emotions based on the facial
landmarks. The results show that the proposed method achieves an accuracy of 91.4% on
the CK+ dataset and 84.8% on the JAFFE dataset, which are both widely used datasets for
emotion detection.
3. Deep learning based emotion recognition from speech signals" by B. Yegnanarayana et

al. (2019) This paper proposes a deep learning-based approach for emotion detection from
speech signals. The authors use a combination of convolutional neural networks (CNNs)
and long short-term memory (LSTM) networks to learn the features from speech signals.
The proposed method achieves an accuracy of 65.33% on the IEMOCAP dataset, which is
a benchmark dataset for emotion detection from speech signals.
4. Sentiment analysis using deep learning techniques: A review" by I. A. H. Akhtar et

al. (2020) This paper provides a review of deep learning techniques used for
36
sentiment analysis. The authors discuss various deep learning architectures, such as CNNs,
RNNs, and their variants, that have been applied to sentiment analysis. The paper also
covers transfer learning and multi-task learning techniques used for sentiment analysis.
The authors conclude that deep learning techniques have achieved state-of-the-art results in
sentiment analysis and have the potential to improve in the future.
5. Emotion recognition using deep learning techniques: A review" by V. P. Desai et al.

(2021) This paper provides a review of deep learning techniques used for emotion
recognition. The authors discuss various deep learning architectures, such as CNNs, RNNs,
and their variants, that have been applied to emotion recognition. The paper also covers
transfer learning and multi-task learning techniques used for emotion recognition. The
authors conclude that deep learning techniques have shown promising results in emotion
recognition tasks and have the potential to improve in the future.
6. "Deep Learning Face Attributes in the Wild" (2015) by Rui Li et al. (2015) This paper
proposes Li et al. proposed a deep learning framework for age and gender estimation from
face images. They combined a multi-column deep neural network architecture with multi-
task learning to jointly predict age and gender.
7. Deep Learning-Based Facial Expression Recognition: A Comprehensive Review Facial

expression recognition (FER) is one of the most commonly used methods for emotion
detection. This paper provides a comprehensive review of deep learning- based FER
methods. It discusses various deep learning architectures used for FER, such as
Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Hybrid
models. The paper also reviews various datasets used for FER and compares the
performance of different deep learning models on these datasets. The review concludes with
some future research directions for FER using deep learning.
8. Deep Emotion Recognition Using Convolutional Neural Networks This paper proposes a
deep learning-based method for emotion recognition using CNNs. The
37
9. proposed method first extracts facial features using a pre-trained CNN and then feeds these
features into a fully connected neural network for classification. The authors evaluate their
method on two benchmark datasets, achieving state-of-the-art performance on both. The
paper also provides a comparative analysis of their method with other state-of-the-art
methods and demonstrates the effectiveness of their approach.
10. "Deep Age and Gender Estimation Based on Convolutional Neural Network"by Yang
Yang et al.(2017) : Yang et al. proposed a deep age and gender estimation model based on
the VGGNet architecture. They introduced a novel feature fusion approach and achieved
state-of-the-art performance on several benchmark datasets.
11. "Age and Gender Classification Using Convolutional Neural Networks" (2015)by Gil
Levi and Tal Hassner: Levi and Hassner developed a deep convolutional neural network
model for estimating age and gender from face images. They introduced a residual-based
architecture and trained the model on a large-scale dataset.
12. Affective Computing: A Survey of Recent Trends, Challenges, and Applications

Affective computing is an interdisciplinary field that focuses on developing computational
methods for recognizing, interpreting, and expressing emotions. This survey paper provides
an overview of recent trends, challenges, and applications in affective computing. The
authors discuss various approaches used in affective computing, such as physiological
signal-based methods, facial expression recognition, and speech emotion recognition. The
paper also reviews various applications of affective computing, such as healthcare,
education, and entertainment. The survey concludes with some future research directions for
affective computing the result to achieve our desire goal in this
38
CHAPTER 3
PROJECT DESCRIPTION AND GOALS
Our project's motto is to develop an Emotion recognition system, which can recognize any
test subject’s emotion almost accurately in any kind of situation. For this design the two
major components are Most prominent feature selection. Most accurate classification
algorithm to simulate accurate results based on the features extracted.
Proposed Methodology
The hierarchical approach is used to determine the age, gender, and emotion of a person in
front of a camera. The first step involves using YOLOv2-tiny (You Only Look Once) to
detect the face of the individual. The resulting image frames are then passed through several
deep neural networks that have been trained on publicly available datasets. The process
includes two modules: one for emotion detection and another for age and gender
identification. The emotion module receives input from YOLOv2-tiny's output and uses
various hidden neural networks to map the facial characteristics onto the input image. The
output is produced using a SoftMax layer. The age-gender module receives the output from
the emotion module and stores it in an HDF5 file. The final output is a combination of both
modules and displays the age, gender, and emotion of the input data.
The model uses a webcam and OpenCV to capture frames and initially uses YOLOv2-tiny to
detect the face of the person in front of the camera. YOLOv2-tiny is a real-time object
detection tool. The network architecture comprises several deep neural networks, and each
frame is sent as input to this architecture. The Conv2D layer receives the input image first,
and the resulting output is fed into MaxPool 2D. The other hidden layers extract facial
features and carry them forward. The output of the emotion model is passed to the age-
gender module via an HDF5 file. This input goes through several hidden layers to extract
features from the face before being passed to global average pooling 2D, which computes
the average value of all values. The SoftMax layer then creates the final integrated result of
39
both modules and displays the age, gender, and emotion of the input data.
Based on the research of related works and the objectives of the project, the following
goals are to be achieved:
1. to analyze each dataset via rigorous exploration

2. to successfully extract feature subsets via feature selection
3. to perform classification using the five different algorithms on each subset
4. to construct an efficient CNN architecture
5. to perform classification of the subsets using the constructed CNN
System Architecture
Convolution Neural Networks (CNN’s) are used to differentiate the speech samples based
on their emotion. Databases inclusive of RAVDEES and SAVEE are applied to prepare
and determine CNN fashions. Keras (Tensor Flow’s high- degree API for building and
training deep gaining knowledge of models) is used as the programming framework to put
in force CNN fashions. Seven exploratory arrangements of the present work are explained
in this section.
40
FIGURE 2.2.1 SYSTEM ARCHITECTURE
Figure. 12.
41
YOLO
YOLO (You Only Look Once) is a state-of-the-art object detection algorithm introduced by
Joseph Redmon et al. in 2015. It revolutionized real-time object detection by combining high
accuracy and impressive speed in a single framework. YOLO takes an input image and
directly predicts the bounding boxes and class probabilities of multiple objects within the
image.
Here's a detailed explanation of YOLO:
1. Grid-based Detection: YOLO divides the input image into a grid of cells. Each cell is
responsible for predicting bounding boxes and class probabilities for objects that fall within
it. The size of the grid can be adjusted based on the desired level of granularity in object
detection.
2. Single Forward Pass: Unlike traditional object detection algorithms that rely on multiple
stages or region proposals, YOLO performs object detection in a single forward pass of a
convolutional neural network (CNN). This enables YOLO to achieve real-time inference
speeds.
3. Anchor Boxes: YOLO uses anchor boxes to predict bounding boxes. Anchor boxes are pre-
defined boxes of different sizes and aspect ratios. YOLO predicts the offsets (i.e., shifts)
from anchor boxes to adjust them to match the shape and size of the actual objects in the
image.
4. Prediction Encoding: Each grid cell in YOLO predicts multiple bounding boxes and their
associated class probabilities. For each bounding box, YOLO predicts the x and y
coordinates, width and height, confidence score, and class probabilities. The confidence
score reflects the likelihood of an object being present in the bounding box, and the class
probabilities represent the probability of the object belonging to different predefined classes.
5. Non-Maximum Suppression: After the predictions are generated, YOLO applies non-
maximum suppression to eliminate duplicate and overlapping bounding boxes. Non-
maximum suppression selects the bounding box with the highest confidence score and
suppresses overlapping boxes with lower scores, resulting in a final set of non-overlapping
and highly confident bounding boxes.
6. Training: YOLO is trained on labeled datasets, where each image is annotated with bounding
box coordinates and class labels. During training, YOLO uses a loss function that combines
localization loss (how well the predicted bounding box aligns with the ground truth) and
42
classification loss (how well the predicted class probabilities match the ground truth). The
network is then optimized using techniques like backpropagation and gradient descent to
minimize the loss.
7. YOLO Variants: Since its introduction, YOLO has undergone several improvements and
variations. YOLOv2 introduced architectural changes like anchor boxes and feature pyramid
networks for better accuracy and multi-scale object detection. YOLOv3 further improved
upon YOLOv2 with the addition of more layers, larger input size options, and improved
feature extraction. YOLOv4 and YOLOv5 continued to refine the architecture with
advancements in network design, data augmentation, and training strategies.
YOLO has become a popular choice for real-time object detection applications due to its
impressive performance and speed. It has been successfully used in various domains,
including autonomous driving, surveillance, robotics, and more. The continuous
development of YOLO and its variants has pushed the boundaries of real-time object
detection and significantly contributed to the field of computer vision.
4 Figure. 13.
Datasets Of Age, gender Recognition

The IMDb-WIKI dataset is a widely used and publicly available dataset extensively used in
age estimation and face recognition research. It is composed of face images collected from
IMDb (Internet Movie Database) and Wikipedia. This dataset has become a standard
benchmark for training and evaluating models for age estimation, gender classification, and
43
related tasks. The dataset was compiled by gathering images of celebrities from IMDb,
which offers an extensive database of actors and actresses, and from Wikipedia, which
includes a diverse range of individuals from various fields. The dataset contains age
annotations for a substantial number of images, obtained from publicly available information
like birthdates mentioned on IMDb or Wikipedia pages. With its wide variety of images,
including professional headshots and casual photographs, the dataset captures different
lighting conditions, facial expressions, and poses, enabling the training of models robust to
real-world variations. The dataset is typically split into training, validation, and testing
subsets, facilitating model development, hyperparameter tuning, and final evaluation. While
the dataset offers rich face images for age and gender analysis, it does come with some
challenges. Age annotations may not always be precise, which can introduce potential
inaccuracies in the dataset. Moreover, the dataset primarily focuses on celebrities, which
might limit its representation of the general population. Nonetheless, due to its availability
and comprehensive nature, the IMDb-WIKI dataset has been widely used in various research
studies to develop and evaluate age estimation and gender classification algorithms.
Feature Extraction
The proposed method based on the Broad ResNet architecture extracts key features from
facial images for age and gender recognition. By leveraging deep convolutional layers, the
model captures intricate details such as edges, textures, and facial landmarks, along with
higher-level characteristics like expressions and facial structures. The extension of the
architecture with broader convolutional filters enhances the model's ability to capture global
information and context in the images, contributing to a more comprehensive representation
of facial features. Additionally, the increased model capacity allows for the extraction of rich
and complex features that are crucial for accurate age and gender recognition. Through
training on large-scale datasets like IMDb-WIKI, the model learns robust features that
generalize well to unseen facial images, resulting in improved performance. By utilizing
these extracted features, the proposed approach demonstrates its effectiveness in age and
gender classification tasks, leveraging the power of deep convolutional networks and the
advantages offered by the Broad ResNet architecture.
Haar-Cascade Classifier
Haar Cascade Classifier is a machine learning-based object detection algorithm developed

44
by Viola and Jones in 2001. It is particularly well-known for its application in face detection
but can also be used to detect other objects of interest.
The algorithm is based on the Haar-like features, which are simple rectangular features that
are calculated by subtracting the sum of pixel intensities in one region from the sum of pixel
intensities in another region. These Haar-like features are computed at different scales and
positions across an image.
The Haar Cascade Classifier consists of two main stages: training and detection.
1. Training:
 Positive Samples: Positive samples refer to the images containing the objects of interest,
such as faces. These images are labeled to indicate the presence of the object.
 Negative Samples: Negative samples refer to images that do not contain the objects of
interest. These images are labeled as background.
 Feature Selection: During the training process, a large number of Haar-like features are
computed on both positive and negative samples. The algorithm selects a small subset of
discriminative features that are most effective in distinguishing the object from the
background.
 AdaBoost: AdaBoost (Adaptive Boosting) is used to iteratively select the best features and
create a strong classifier. Each feature is assigned a weight based on its performance in
classifying the positive and negative samples. The weights are adjusted at each iteration to
focus on misclassified samples, allowing the classifier to improve over time.
 Cascade Classifiers: Multiple weak classifiers are combined into a strong classifier using a
cascade structure. Each weak classifier is trained using a subset of features, and the cascade
structure allows for efficient and fast detection by quickly rejecting regions that are unlikely
to contain the object.
2. Detection:
 Sliding Window: The detection process involves scanning a sliding window across the input
image at different scales and positions. The window size is varied to account for different
object sizes.
 Integral Image: To efficiently compute the Haar-like features, an integral image

representation is used. The integral image allows for the fast calculation of rectangular
feature sums by storing the cumulative sum of pixel intensities.
45
 Adaboost Classification: At each sliding window position, the computed Haar-like features
are evaluated using the trained AdaBoost classifier. The classifier determines whether the
window contains the object or not based on the learned weights of the selected features.
 Cascade Classification: The cascade structure is utilized to speed up the detection process.
The image regions that pass through the early stages of the cascade (which consists of fewer
and faster classifiers) are further evaluated by subsequent stages with more complex
classifiers. This allows for early rejection of regions that are unlikely to contain the object,
reducing the number of computations needed.
 Non-Maximum Suppression: After the cascade classifiers have been applied, a post-
processing step called non-maximum suppression is performed to eliminate duplicate
detections and select the most accurate bounding boxes for the detected objects.
Haar Cascade Classifier has been widely used for real-time object detection tasks due to its
efficiency and accuracy. While it was originally designed for face detection, it has been
successfully applied to detect various objects, including eyes, pedestrians, and vehicles.
However, one limitation of the Haar Cascade Classifier is that it may struggle with complex
object appearances or variations in scale, pose, or lighting conditions. Other more advanced
object detection algorithms, such as Faster R-CNN and YOLO, have since been developed to
address these limitations.
Figure. 13.
46
Designing the Dimensions of the Model
When designing the dimensions of the model based on the Broad ResNet architecture for age
and gender recognition, several considerations come into play. Firstly, the depth of the
model, which refers to the number of convolutional layers and residual blocks, needs to
strike a balance between model complexity and computational resources. The selection of
filter sizes is another crucial aspect, with smaller filter sizes often used in initial layers and
larger ones in subsequent layers to capture both local details and global context. Network
width, determined by the number of filters in each layer, should be carefully chosen to
balance model capacity and computational efficiency. Incorporating pooling and striding
operations can help reduce spatial dimensions and increase the receptive field. Lastly, the
addition of fully connected layers can transform the extracted features into age and gender
predictions. Regularization techniques like dropout and batch normalization can be
incorporated to prevent overfitting. It's important to experiment and fine-tune these design
choices based on dataset characteristics, computational constraints, and desired performance,
as specific adjustments may be needed for optimal results
Model Training and Testing

The model is trained with the training dataset and tested with the test data set. Actual
values are compared with the predicted values. This comparison gives us the accuracy of
the model. The dataset is split into training and validation sets for model training and
evaluation. A CNN deep learning algorithm is used for age and gender prediction and
emotion recognition to get a high accuracy The architecture of the model is designed ,
including the number of layers, their types (e.g., convolutional, recurrent), activation
functions, and connections. The model is trained using the extracted features as input and the
corresponding labels for age, gender, and emotions. Optimize hyperparameters, such as
learning rate, batch size, and regularization techniques, to improve the model's performance.
The model's performance is evaluated using appropriate metrics, such as accuracy, precision,
recall, or F1 score, on the validation set. Adjust the model architecture and hyperparameters
47
as needed.
Architecture of Cnn
In the current study, the deep neural network architecture actualized is convolutional neural
network. In the proposed architecture after each convolutional layer max-pooling layer is
placed. To establish non linearity in the model, for activation function Rectified Linear
Units (ReLU) is used in both convolutional and fully connected layers.
Figure. 14.
Batch normalization is used to improve the firmness of neural network, which normalizes
the result of the preceding activation layer by reducing the number by what the hidden unit
values move around and allows each of the layers in a network to learn by itself. Dense
layer is used; in which all the neurons in a layer are connected to neurons in the next layer
and it is a fully connected layer. Softmax unit is used to compute probability distribution
of the classes. The number of Softmax to be used depends on number of classes to classify
the emotions. The model took between 10hrs to 14hrs to be trained. CPU's consumes lot of
time to train the model, instead of that GPU's can be used to speed the training process.
The several cores in GPU accelerate the speed and save much time. The FIGUREure 2
shows the Convolutional Neural Network architecture used in this model. Lighter CNN
architecture is also used to classify among a greater number of classes and good results are
obtained.
Feature Of IMDb-WIKI dataset

The IMDb-WIKI dataset offers several prominent features that make it highly valuable for
age and gender recognition research. Firstly, the dataset is characterized by its large-scale
nature, encompassing a significant number of face images collected from IMDb and
48
Wikipedia. This extensive data availability allows for robust model development and
comprehensive evaluation. Additionally, the dataset includes age annotations for a
substantial portion of the images, derived from publicly available information like birthdates
found on IMDb or Wikipedia pages. These age labels enable supervised learning approaches,
facilitating age estimation tasks. Another noteworthy feature is the dataset's diversity,
incorporating a wide range of individuals, including celebrities from IMDb and individuals
from various fields found on Wikipedia. This diversity in terms of age, ethnicity, gender, and
occupation enhances the dataset's representation of the general population, enabling models
to generalize better. Furthermore, the dataset encompasses a variety of face images,
capturing different lighting conditions, facial expressions, poses, and image qualities, thus
reflecting real-world complexities. Lastly, the IMDb-WIKI dataset is publicly available,
accessible through research repositories and dedicated websites, making it easily accessible
to researchers and practitioners in the field. Overall, the combination of its large-scale
nature, age annotations, diversity, image variety, and accessibility make the IMDb-WIKI
dataset a valuable resource for the development and evaluation of age and gender
recognition models.
Classification models
For age, gender recognition
Classification models using the IMDb-WIKI dataset for age and gender recognition typically
follow a specific architecture. The model begins with inputting face images from the dataset,
which are preprocessed for size consistency, alignment, and normalization. Convolutional
layers are then used to extract low-level visual features and capture relevant patterns from
the images. Activation functions like ReLU introduce non-linearity, while pooling layers
reduce spatial dimensions and retain important features. Fully connected layers follow,
allowing the model to learn higher-level representations and complex relationships between
features. For age classification, a SoftMax layer with nodes corresponding to age groups
predicts the age category. Similarly, a SoftMax layer with two nodes representing male and
female labels is added for gender classification. The model is trained using a suitable loss
function and optimized through algorithms like stochastic gradient descent or Adam.
Evaluation is conducted on a separate validation or test set using metrics such as accuracy,
precision, recall, and F1 score. It's important to customize the architecture and
hyperparameters based on the specific task requirements and to experiment and fine-tune the
model for optimal performance with the IMDb-WIKI dataset.
49
Data Pre-processing:
It is divided into two sets namely; Training Set and Validation Set. Training set samples
composed of 80% of the original dataset samples and 20% of the samples are specified for
validation. The Training & Validation Set will be 22966 and 5741 respectively.
Preprocessing is executed to prepare snap shots for the characteristic extraction level. A set
of facial feature points is extracted from the pictures then facial functions derived from those
points. Different units of facial features are used for each schooling and validation classifiers
Data Augmentation:
In order to avoid over fitting and improve recognition accuracy we applied data
augmentation techniques on each training and validation samples. For each image we
performed following transforms:
1. Rescale (1. / 255)

2. Shear (0.2) 3. Zoom (0.2)
3. Flip (horizontal)
Datasets:
An appropriate and suitable datasets to choose is very important part of the given problem.
So in order to get the best results for a given problem we are using FERC-2013 datasets
which are coming from Kaggle platform for data science. FERC-2013 contains
fer2013.Csv which incorporates 3 columns (emotion, pixels, usage). The beneath desk
represents no of samples used for each emotional class
Class No. of Samples
Angry 3995
Disgust 436
Fear 4097
Happy 7215
Neutral 4830
50
Sad 3171
Surprise 4965
TABLE 2: FACE DATASETS COUNT
Convolutional Neural Network (CNN) Model

In this element, we proposed our CNN Model information glide shape for facial emotion
detection trouble. In this version, we take the enter picture size of forty 48 x 48 pixels. This
version architecture consists of five layers.
These layers consists of five convolutional layers and 5 max pooling layers together with 2
completely linked layer and at ultimate layer we classify image thru softmax activation
function. The output layer includes 7 neurons corresponding to 7 emotional labels: angry,
disgust, fear, happy, sad, suprise and neutral.
The 5-Layered CNN structure is represented inside the under table. This structure consists
of 5 Convolution layers and 5 Max Pooling Layers together with 2 Fully Connected Layers
and the output layer. Our version makes use of Rectified Linear Unit (relu) as most precisely
used activation function which is carried out on all of the
51
Convolution Layer and Dense Layer besides the final layer (output layer) that's honestly
Softmax Function. Dropout Layer is likewise implemented after every Convolution, Max
Pooling and Dense Layer with the price of zero.25.
The first convolution layer composed of 64 filters with size of three x three with activation
characteristic. After every a hit convolution layer max pooling layer is applied which
consists of 2 x 2 as our well-known length of the pooling layer? On second layer we
multiplied the no. Of filters with the aid of 2 times than the preceding convolution layer so
in this layers the no. Of filters are 128 with five x five clear out size similarly, the
subsequent layers are (no of filters, filter out length) (sixty four, 3 x 3), (128, 5 x 5), (128,
3 x three), (256, five x five), (512, 3 x 3) respectively. In pooling layer, on every

successful convolution layer we implemented max pooling layer however, for each
convolution layer we implemented max pooling layer with widespread size of 2 x 2 to all
of the five pooling layers with the intention to discover or seize each feature of the image.
Before and after each pooling layer we upload dropout layer to randomly select the
neurons and ignore them within the subsequent layer on the way to avoid the over
becoming problem. The subsequent layer is absolutely connected layer or hidden layering
conventional neural community as we are the use of 2 FC layers consisting of output layer.
The fully related layer consists of 512, 1024, 7 neurons respectively. The 7 neurons
represent the 7 labeled training of human emotions as stated earlier.
At last, we get the 7 neurons with chance distribution as our output of the model wherein
the only with maximum opportunity will be our very last answer.
Convolution Network Layers and their terminologies:

Input Image:
Our version needs 48 x 48 x 1 enter picture where 48x48 is the height and width of that
photo and 1 is the channel (channel = 1 (grayscale) / 3 (RGB)).
Pooling Layer:
Pooling layer basically calculates or in fact chooses the one pixel from the specified
dimensions. So in order to compute pooling there are two popular methods used such as
max pooling and average pooling where max pooling calculates maximum value and
average pooling calculates mean value respectively from the matrix.
52
Convolution Layer:
The essential purpose of the Convolutions to analyze the visual imagery. In our case, we
take forty eight x forty eight x 1 as a enter image for first layer and with filter size and
stride and as sixty four and three respectively. To calculate the output layer the above
components referred to in pooling layer is likewise utilize here.
Fully Connected Layer:

The Fully Connected Layer is the layer belief which makes use of rectified linear unit
activation characteristic for the FC Layers and for the final layer we use SoftMax
function. It is essentially a hidden layer from the typical neural community version
Stride:
Stride is basically a window of specific dimensions which travels via the matrix (image)
and calculates for the desired layers. For example, in max pooling if we specify stride as
2 x 2 then it will travel through the whole matrix and calculate the maximum fee from
that 2 x 2 window.
Dropout Layer:
Dropout Layer is a technique of stopping the neural network version from over becoming
issues. Dropout Layer randomly selects neurons and ignores them in whilst education the
version. In our case our model is trained on charge of 0.25.
Figure. 15.
53
Feature Extraction
Feature Selection is the process of retrieving useful features from a group of data.
Oftentimes it is found that a lot of features are redundant in the actual prediction of results
they do not have any impact on the ML model. Such features unnecessarily hoard
computation time and make the process more expensive. Algorithms which identify the
optimal subset of features which provide better results than all the features combined are
called feature selection algorithms. These algorithms generally evaluate the features based
on certain metrics.
They are of three types:

⮚ Filter: These selection algorithms tend to use statistical measures to determine the
importance of features. The impact of a feature on the prediction is directly measured
as opposed to a combination of features. Tests such as the Chi- Square test are used to
determine the relevance of a feature. They do not use any ML algorithms to verify
metrics.
⮚ Wrapper: These algorithms iterate throughout the feature subset either from 1 to n or
vice versa to determine the optimal n features. Through either forward selection
(chooses the best feature and adds the next best feature till n features are achieved),
backward selection (eliminates the worst feature till n features are achieved), or
exhaustive selection (a combination of forward and backward), the best set of features
are chosen. Although these methods provide very good results, they are
computationally expensive.
⮚
Embedded: These algorithms employ a combination of Filter and Wrapper methods
and perform exhaustive searches using low computation. They claim to possess the
best abilities of both the methods. Lasso, Ridge, and Elastic Netsare examples of these
methods.
54
CHAPTER 4
TECHNICAL SPECIFICATIONS
There are multiple aspects of conspiring software for this project since complete
processing work done in software, and ample time went in designing software and testing.
The PYTHON requests for audio input processes that audio corpus, identifying features,
then compare with the trained data frame and results in an approx emotion of the in taken
audio.
Classification Using Machine Learning

Machine Learning is a conglomeration of some aspects of data mining, statistical analyses,
and pattern recognition. Depending on the type of algorithm used for the problem,
different weights are assigned to the features which are responsible for the prediction of
results. In essence, an algorithm is a mathematical function. The algorithm is trained on a
part of the dataset, where it is allowed to match its function to the expected result. The
weights are adjusted and the outputs of the functions are approximated to correspond to a
particular class of the result. The model when fitted onto a test data, makes a prediction of
the class of the entry based on the functional output. These predictions are plotted against
the true values in a confusion matrix and evaluated based on several metrics including
accuracy.
Support vector machine

Support Vector Machines have a wide range of applications and are one of the most
versatile supervised learning algorithms. SVM represents the inputs as points in space and
tries to create a gap among the various categories as accurately in hyper-space as possible.
It distinguishes the categories by constructing a hyperplane or a set of hyperplanes in an
infinite-dimension space. In addition to its many flexibilities, there exists a technique in
SVM known as the ‘kernel trick’, which implicitly maps all inputs into higher feature
spaces.
SVM and the kernel trick do not scale well when a large number of inputs or features are
involved.
55
K-nearest neighbor classifier
The KNN algorithm plots all the inputs in feature space and calculates the distance
between two points using the Euclidean Distance formula as given below:
√∑(𝑥𝑖 − 𝑦𝑖) ^2 (𝑛 (𝑖=1))

The points in the neighborhoods are weighted uniformly, and in this project, five nearest
neighbors are considered. This is a type of instance-based or lazy learning method
Decision tree classifier

Decision Trees use a non-parametrized approach to solve classification problems. This
model infers decision rules based on the features and uses little to no data preprocessing.
The tree model can handle numerical as well as categorical data very easily. It creates a
tree consisting of if-else and other such conditional statements and predicts based on the
decisions at each node of the tree. Deeper trees are more complex, but sometimes the
classifier builds unnecessarily complex trees. Subtle differences in learning can create
entirely different trees which may provide varying results. The decisions, however, use a
white-box model and hence can be easily validated and interpreted.
Neural Network
Neural Networks are mechanisms which are modeled after the human brain. They consist
of layers of nodes which pass data from layer to another. Each layer consists of a function
which assigns weights to the incoming nodes and performs certain mathematical
operations on them before forwarding it to the succeeding layer. Although Neural
Networks have been around a long time, they rose to prominence due to their application
in Deep Learning. Today they are primarily used for performing rigorous analyses on
large image datasets.
56
The name itself is derived from the neuron which is the basic cell in humans that relays
data via electrical synapses.
This project classifies the datasets via a Convolutional Neural Network Architecture. A
CNN has its origins from the fully connected nature of neurons in animals. The
architecture consists of various layers or kernels which actually perform the learning
component. The following are important terms to know regarding the CNN Architecture:
1. Conv1D: This is a layer (kernel) which convolves a single spatial dimension input into a
tensor of outputs.
2. Dense: This layer applied the activation function on the incoming data and modifies the
weight of the nodes before feeding forward.
3. Activation: This is a function which is applied on a layer. These are popular mathematical
models which have proven to be beneficial in calculating the weights of nodes. This project
uses the following activation functions.
4. Flatten: This function is usually used just before the output layer. It converts all inputs
shaped (x,) to (x,1).
5. Batch Normalization: It standardizes the initiations of the past layer and keeps up the mean
actuation near 0 and the enactment standard deviation near 0
6. Dropout: This layer randomly sets input units to 0 to help prevent overfitting.
7. Input Layer: This is the first layer of the CNN where inputs are delivered.
8. Output Layer: This is the final layer of the CNN where predications are made.
9. Hidden Layers: These are layers that fall between the input and output layers and perform
learning and forwarding of the nodes. The proposed CNN is to have one hidden layer
wedged between the input and output layers. The input layer is a Conv1D layer with 32
filters and a kernel size of 2. It uses a tanh activation function. Batch Normalization is
performed on this layer and a Dropout of 0.2 is performed. The hidden layer is also a
Conv1D layer identical in every way to the previous layer except for having 64 filters. The
nodes are first flattened and a Dense layer with 64 filters using the tanh activation function
forwards the data but with a Dropout of 0.5. The model is compiled using the following
functions:
10. Optimizer: This is a function used to change the attributes of the nodes in order to reduce
57
losses. The proposed CNN uses the Adam optimizer.
11. Loss: This function calculates the loss between the true and predicted values. The proposed
CNN uses the Binary Cross-Entropy loss function.
12. Metrics: The model is evaluated with respect to Accuracy during each Epoch
(iteration).
Convolutional Neural Network

Convolution layer
The convolutional layer is the basic building block of CNNs and it performs the
convolution operation on the input image using a set of learnable filters or kernels. The
convolutional layer consists of three main components: the input, the filters, and the
output.
Input: The input to the convolutional layer is a three-dimensional tensor of shape (W, H,
C), where W is the width of the input image, H is the height of the input image, and C is
the number of channels or color depth of the input image.
Filters: The filters or kernels are learnable parameters that are convolved with the input
image to extract features. The filters are small matrices of size (F, F, C), where F is the
filter size and C is the number of channels in the input image. The number of filters
determines the number of features that the convolutional layer will extract from the input
image.
Output: The output of the convolutional layer is a three-dimensional tensor of shape (W',
H', F), where W' and H' are the spatial dimensions of the output feature map and F is the
number of filters. The output feature map represents the activation of each filter at each
spatial location of the input image.
The convolutional layer performs the following steps:

Convolution: The input image is convolved with each filter to produce a feature map.
Activation: The output of the convolution operation is passed through an activation
function such as ReLU (Rectified Linear Unit) to introduce non-linearity into the network.
Pooling: The output feature map is down-sampled using a pooling operation such as max
pooling or average pooling to reduce the spatial dimensions of the feature map.
58
Convolutional Layer
The convolutional layer has several hyperparameters that are used to control its behavior
and performance. The main hyperparameters of the convolutional layer are:
Filter size: The filter size determines the size of the kernel or filter that is applied to the
input image. Typical filter sizes are 3x3, 5x5.
Pooling layer
1. Convolutional layers in a convolutional neural framework deliberately apply learned
channels to incorporate pictures in order to make feature maps that summarize the proximity
of those features in the data.
2. Convolutional layers exhibit amazingly ground-breaking, and stacking convolutional layers

in significant models grants layers close to the commitment to learning low- level features
and layers further in the model to learn high- demander progressively special features, like
shapes or express articles.
3. Confinement of the component map yield of convolutional layers is that they record the
specific circumstance of features in the information. This suggests little improvements in
the circumstance of the part in the data picture will realize another component map. This can
happen with re-altering, turn, moving, and other minor changes to the information picture.
4. A straightforward method to clarify this issue from signal handling is called down sampling.
That is the place a lower goals form of an info signal is made that despite everything
contains the noteworthy or basic auxiliary components, without the fine detail that may not
be as helpful to the task.
Spatial Pooling can be of different types:

1. Max Pooling the most significant part from the redressed includes map.
2. Global Pooling can be utilized in a model to forcefully sum up the nearness of an element in
a picture. It is likewise at times utilized in models as an option in contrast to utilizing a
completely associated layer to progress from feature maps to a yield forecast for the
model
59
Technologies Learnt
PYTHON
1. Python is a general-purpose interpreted, interactive, object-oriented, and high-level
programming language.
2. It was created by Guido van Rossum during 1985- 1990

3. Monty Python's Flying Circus”, a BBC comedy series from the 1970s.
Why Learn Python?

1. Interpreted Language
2. It supports oops concept
3. It is a platform independent language
4. Line by line executer
5. No need to compile the program
6. Beginner’s easy to learn
Python Applications
1. Web Development
2. Game Development
3. Machine Learning and Artificial Intelligence
4. Data Science and Data Visualization
5. Web Scraping Applications
6. Business Applications
7. Audio and Video Applications
8. Embedded Applications
Good to know
1. The most recent major version of Python is Python 3, which we shall be using in this
60
tutorial. However, Python 2, although not being updated with anything other than security
updates, is still quite popular.
2. Python 2.0 was released in 2000, and the 2.x versions were the prevalent releases until
December 2008. At that time, the development team made the decision to release version
3.0, which contained a few relatively small but significant changes that were not
backward compatible with the 2.x versions. Python 2 and 3 are very similar, and some
features of Python 3 have been backported to Python 2. But in general, they remain not
quite compatible.
3. Both Python 2 and 3 have continued to be maintained and developed, with periodic release
updates for both. As of this writing, the most recent versions available are 2.7.15 and
3.6.5. However, an official End of Life date of January 1, 2020 has been
established for Python 2, after which time it will no longer be maintained.
4. Python is still maintained by a core development team at the Institute, and Guido is
still in charge, having been given the title of BDFL (Benevolent Dictator for Life) by the
Python community. The name Python, by the way, derives not from the snake, but from the
British comedy troupe Monty Python’s Flying Circus, of which Guido was, and
presumably still is, a fan. It is common to find references to Monty Python sketches and
movies scattered throughout the Python documentation.
It is possible to write Python in an Integrated Development Environment, such as Thonny,

PyCharm, NetBeans or Eclipse which are particularly useful when managing larger
collections of Python files.
Python Syntax compared to other programming language

1. Python was designed to for readability, and has some similarities to the English language
with influence from mathematics.
2. Python uses new lines to complete a command, as opposed to other programming languages
which often use semicolons or parentheses.
3. Python relies on indentation, using whitespace, to define scope; such as the scope of loops,
functions and classes. Other programming languages often use curly-brackets for this
purpose.
61
HTML:
HTML stands for Hypertext Markup Language which is a standard markup language for
creating Webpages and web applications. We’ve used HTML coding for designing the
Webpages of our system being built. Web browsers receive HTML documents from web
server but in our case, it is from local storage and renders the documents into multimedia
web pages. So, we have learnt a lot many new HTML elements in addition to what
we have learnt in the class, and have implemented all of them in the project where ever it is
necessary.
CSS:
CSS stands for cascading Style Sheets and CSS is a style sheet language that is used for
describing the presentation of a document written in a markup language like HTML. CSS
is a technology that is used alongside with HTML and JavaScript.
CSS is designed to enable the separation of presentation ad content, including layout,

colors, and fonts.CSS helps us to improve content accessibility, provide more flexibility
and control in the specification of presentation characteristics, enable multiple Webpages
to share formatting by specifying the relevant CSS in a separate CSS file and reduce
complexity and repetition in the structural content.
WEB APPLICATION
Cup is a lightweight WSGI web application system. It is intended to make beginning
snappy and simple, with the capacity to scale up to complex applications. It started as a
straightforward covering around Werkzeug and Jinja and has gotten one of the most well-
known Python web application structures.
Highlights
1. Built being developed worker and quick debugger.
2. Integrated uphold for unit testing.
62
3. Restful solicitation dispatching.
4. Jinja2 templating.
5. Support for secure treats (customer side meetings)
6. WSGI 1.0 consistent.
7. Unicode based
Libraries
Pandas
Pandas is a library which is primarily used for data analysis and manipulation. It parses
data in the form of pandas Data Frames. Some of its functionalities include reading,
sorting, merging, aggregating, and reshaping data. It has features to handle missing and
null values, and allows for data filtration as well. It also has capabilities to handle time-
series data.
NumPy
NumPy is a library that deals with large multi-dimensional arrays and matrices. It
performs high-level mathematical operations on these mathematical structures. The
NumPy array is a universal data structure to handle images, filter kernels, and feature
points.
Matplotlib
Matplotlib is Python’s plotting library and is used along with NumPy to create graphical
visualizations and plots. It has a MATLAB-like interface and is also designed to perform
functions equivalent to MATLAB on Python. It provides an object-oriented API to embed
plots into applications.
Seaborn
Seaborn is a statistical graphing library built on top of matplotlib and is closely integrated
with Panda’s data structures. Visualization is its core functionality and it is heavily used for
exploratory data analysis and data understanding. It has a wide range of color-palettes and
is useful in visualizing a multitude of correlation plots and multivariate distributions.
63
Scikit-learn
Scikit-learn or sklearn is an open-source ML library written in and written for the Python
language. It consists of a collection of algorithms and resources that are useful while
performing classification, regression, or clustering on a dataset. It uses NumPy to for high
performance linear algebra. The algorithms used in this project: PT, SVC, KNN, GNB,
and DTC have been picked form sklearn’s available ML algorithms.
TensorFlow
TensorFlow is an open-source library for performing a range of functions including
dataflow and differentiable programming. Developed by Google’s Brain Team, it is used
to develop neural network architectures and machine learning. Additionally, it has a
special hardware processing unit designed for machine learning.
System Specification
All experiments in this project were performed in Python. All classifications and neural
network constructions were performed on Google Collaborator which runs on Python 3
Google Compute Engine backend with no hardware accelerators. These environments
were run on a 64-bit Windows 10 OS with an 8GB RAM and a 1TB HDD.
The following versions of the libraries were used:

1. Pandas 1.0.3
2. NumPy 1.0.4
3. Matplotlib 3.0.1
4. Seaborn 1.10.1
5. Sklearn 0.22.2
6. TensorFlow 2.2.0
7. Keras 2.3.0-tf
The platform versions are as follows:
1. Jupyter Notebook 6.0.3
2. Anaconda Navigator 1.9.12
3. Python 3.7.6
4. Google Collaborator 1.0.0
64
CHAPTER 6
RESULTS & DISCUSSION
6.1 Execution Speed
While training, each step takes about 6ms per step, which makes up around 5 seconds per
sample. Our dataset has around 1500 samples of audio, each of which lasts 3 seconds.
Training the whole data set takes around 2 hours, but

it’s a one-time thing. Testing live samples take around 118ms per step, which is around
3.76 seconds on the whole.
Extracting Images from The Datasets:
FIGURE-11: EXTRACTING DATASET OF FACE EMOTION
Layers Of CNN For Speech Emotion:

After training numerous models, we got the best validation accuracy of 90% with 18
layers, using CNN model.
65
FIGURE – 12: SPEECH LAYERS OF CNN
66
Face Emotion Recognition Layers Of CNN:
FIGURE -13: FACE EMOTION LAYERS OF CNN
67
Model Accuracy:
FIGURE – 14: RESULTING ACCURACY FOR SPEECH EMOTION
FIGURE-15: RESULTING ACCURACY OF FACE EMOTION
68
Model Loss
FIGURE 16: SPEECH EMOTION MODEL LOSS
FIGURE 17 FACE EMOTION MODEL LOSS
69
Appendix
Source Code:
SPEECH EMOTION CODE:
import matplotlib.pyplot as plt

import numpy as np
import pandas as pd
import seaborn as sns
import os
# Importing Deep Learning Libraries

from keras.preprocessing.image import load_img, img_to_array
from keras.preprocessing.image import ImageDataGenerator
from keras.layers import
Dense, Input,Dropout,GlobalAveragePooling2D,Flatten,Conv2D,BatchNormali
zation,Activation,MaxPooling2D
from keras.models import Model,Sequential
from keras.optimizers import Adam,SGD,RMSprop
picture_size = 48
folder_path = "../input/face-expression-recognition-dataset/images/"
expression = 'disgust'
plt.FIGUREure(FIGUREsize= (12,12))
for i in range(1, 10, 1):
plt.subplot(3,3,i)
img = load_img(folder_path+"train/"+expression+"/"+
os.listdir(folder_path + "train/" + expression)[i],
target_size=(picture_size, picture_size))
plt.imshow(img)
plt.show()
batch_size = 128
datagen_train = ImageDataGenerator()
datagen_val = ImageDataGenerator()
70
train_set = datagen_train.flow_from_directory(folder_path+"train",
target_size = (picture_size,picture_size), color_mode = "grayscale",
batch_size=batch_size, class_mode='categorical', shuffle=True)
test_set = datagen_val.flow_from_directory(folder_path+"validation",
target_size = (picture_size,picture_size), color_mode = "grayscale",
batch_size=batch_size, class_mode='categorical', shuffle=False)
from keras.optimizers import Adam,SGD,RMSprop
no_of_classes = 7
model = Sequential()
#1st CNN layer

model.add(Conv2D(64,(3,3),padding = 'same',input_shape = (48,48,1)))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size = (2,2)))
model.add(Dropout(0.25))
#2nd CNN layer model.add(Conv2D(128,

(5,5),padding = 'same'))
model.add(Dropout (0.25))
#3rd CNN layer model.add(Conv2D(512,

(3,3),padding = 'same'))
model.add(Dropout (0.25))
#4th CNN layer

model.add(Conv2D(512,(3,3), padding='same'))
71
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
#Fully connected 1st layer model.add(Dense(256)) model.add(BatchNormalization())

model.add(Activation('relu')) model.add(Dropout(0.25))
# Fully connected layer 2nd layer model.add(Dense(512)) model.add(BatchNormalization())

model.add(Activation('relu')) model.add(Dropout(0.25)) model.add(Dense(no_of_classes,
activation='softmax')) opt = Adam(lr = 0.0001)
model.compile(optimizer=opt,loss='categorical_crossentropy',
metrics=['accuracy'])
model.summary()
from keras.optimizers import RMSprop,SGD,Adam
from keras.callbacks import ModelCheckpoint, EarlyStopping,
ReduceLROnPlateau
checkpoint = ModelCheckpoint("./model.h5", monitor='val_acc', verbose=1,

save_best_only=True, mode='max')
early_stopping = EarlyStopping(monitor='val_loss',
min_delta=0, patience=3,
verbose=1,
restore_best_weights=True
)
reduce_learningrate = ReduceLROnPlateau(monitor='val_loss',
factor=0.2,
patience=3, verbose=1,
min_delta=0.0001)
72
callbacks_list = [early_stopping,checkpoint,reduce_learningrate]
epochs = 48
model.compile(loss='categorical_crossentropy',
optimizer = Adam(lr=0.001),
metrics=['accuracy'])
history = model.fit_generator(generator=train_set,
steps_per_epoch=train_set.n//train_set.batch_size, epochs=epochs,
validation_data = test_set,
validation_steps = test_set.n//test_set.batch_size, callbacks=callbacks_list
)
plt.style.use('dark_background')
plt.FIGUREure(FIGUREsize=(20,10))
plt.subplot(1, 2, 1)
plt.suptitle('Optimizer : Adam', fontsize=10)

plt.ylabel('Loss', fontsize=16)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.legend(loc='upper right')
plt.subplot(1, 2, 2)
plt.ylabel('Accuracy', fontsize=16)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.legend(loc='lower right')
plt.show()
73
OUTPUT
FIGURE-18: AUDIO RECORDING
FACE EMOTION CODE:
from keras.models import load_model

from time import sleep
from tensorflow.keras.utils import img_to_array
from keras.preprocessing import image
import cv2
import numpy as np
face_classifier = cv2.CascadeClassifier(r'haarcascade_frontalface_default.xml')
classifier =load_model(r'model.h5')
emotion_labels = ['Angry','Disgust','Fear','Happy','Neutral', 'Sad', 'Surprise']
cap = cv2.VideoCapture(0)
print("done")
while True:
_, frame = cap.read()
labels = []
gray = cv2.cvtColor(frame,cv2.COLOR_BGR2GRAY)
faces = face_classifier.detectMultiScale(gray)
for (x,y,w,h) in faces: cv2.rectangle(frame,(x,y),(x+w,y+h),

(0,255,255),2) roi_gray = gray[y:y+h,x:x+w]
roi_gray = cv2.resize(roi_gray,(48,48),interpolation=cv2.INTER_AREA)
if np.sum([roi_gray])!=0:
roi = roi_gray.astype('float')/255.0 roi =
img_to_array(roi)
roi = np.expand_dims(roi,axis=0) prediction =
classifier.predict(roi)[0]
74
label=emotion_labels[prediction.argmax()]
label_position = (x,y)
cv2.putText(frame,label,label_position,cv2.FONT_HERSHEY_SIMPLEX,1,(0,255,0),2) else:
cv2.putText(frame,'No Faces',
(30,80),cv2.FONT_HERSHEY_SIMPLEX,1,(0,255,0),2)
cv2.imshow('Emotion Detector',frame) if
cv2.waitKey(1) & 0xFF == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
Results:
This section presents the results obtained when applying several deep learning models to
NTC. The influence of several important hyper-parameters and design decisions is
analyzed, in particular: the model architecture, the features selected and the number of
packets extracted from the network flows.
In order to appreciate the detection quality of the different options, and considering the
highly unbalanced distribution of service labels, we provide the following performance
metrics for each option: accuracy, precision, recall, and F1. Considering all metrics, F1
can be considered the most important metric in this scenario. F1 is the harmonic mean of
FIGURE 22:NEUTRAL EXPRESSION
75
3
FIGURE 23: HAPPY EMOTION DETECTED
FIGURE 24: SURPRISE EMOTION DETECTED
76
FIGURE 25 ANGRY EMOTION DETECTED
FIGURE 26: SAD EMOTION DETECTED
77
FIGURE 27: FEAR EMOTION DETECTED
FIGURE 28: CONVOLUTION MATRIX OF FACE EMOTIONS
78
CHAPTER 7 CONCLUSION AND
FUTURE WORK
Conclusions
In this project, a LeNet architecture primarily based on a six-layer convolutional neural
community is implemented to classify human facial expressions i.e., sad, happy, surprise,
fear, anger, disgust, and neutral. The device has been evaluated using Accuracy, Precision,
Recall and F1-rating. The classifier performed an accuracy of 85.77 %. Various
experiments are conducted by changing the several parameters like dimension of the
model, number of epochs, changing the partition ratio between training and test data set.
Different accuracy was found for different experiments. A lighter CNN architecture with
80% of training data and 20% of test data gave good results compared to deeper CNN
architecture while classifying among ten classes. The accuracy of this model was found to
be 80%. The performance of the deeper CNN model was found to be very good when the
classification was among two classes. The reason is there were a greater number of
training samples available to classify among two classes. When the same model was used
to classify among ten classes the training dataset was divided into ten labels this led to the
fewer number of training samples available for each class.
Future Work
In the future paintings, the version may be prolonged to coloration pictures. This will
permit analysis of the efficacy of pre-skilled models which includes AlexNet or VGGNet
for facial emotion recognition. This led to the poor accuracy of the model. With the greater
number of training samples available for each class and with the help of GPUs to speed
up the training process more accuracy can be achieved in the future enhancements
79
REFERENCES
[1] MinSeop Lee ,Yun Kyu Lee , Myo-Taeg Lim 1, and Tae-Koo Kang 2, Emotion
Recognition Using Convolutional Neural Network with Selected Statistical
Photoplethysmogram 19 May 2020
[2] Peng Song; Wenming Zheng Feature Selection Based Transfer Subspace Learning for
Speech Emotion Recognition 31 January 2018
[3] Monorama Swain, Aurobinda Routray ,P. Kabisatpathy Databases, features and classifiers
for speech emotion recognition: a review 19 January 2018
[4] Leila Kerkeni Youssef Serrestou; Mohamed Mbarki; Kosai Raoof; Mohamed Ali Mahjoub
A review on speech emotion recognition: Case of pedagogical interaction in classroom 23
October 2017
[5] Cheng He,Yun-jin Yao,Xue-song Ye An Emotion Recognition System Based on

Physiological Signals Obtained by Wearable Sensors 01 October 2016
[6] Christos-Nikolaos Anagnostopoulos, Theodoros Iliou & Ioannis Giannoukos Features and
classifiers for emotion recognition from speech: a survey from 2000 to 2011 09 November
2012
[7] Yu Gu, Eric Postma,Hai-Xiang Lin,Jaap van den Herik Speech emotion recognition using
voiced segment selection algorithm August 2016
[8] Fabien Ringeval, Björn Schuller, Michel Valstar,Shashank Jaiswal Erik Marchi,Denis
Lalanne AV+EC 2015: The First Affect Recognition Challenge Bridging Across Audio,
Video, and Physiological Data October 2015

[9] Christopher Pal,Samira Ebrahimi KahouVincent Michalski,Kishore Konda,Roland
Memisevic Recurrent Neural Networks for Emotion Recognition in Video November
2015
[10] Wei-Long Zheng, Bao-Liang Lu Personalizing EEG-based affective models with transfer
learning July 2016
[11] Carlos Busso, Zhigang Deng *, Serdar Yildirim, Murtaza Bulut, Chul Min Lee, Abe
Kazemzadeh, Sungbok Lee, Ulrich Neumann* , Shrikanth Narayanan Analysis of Emotion
80
Recognition using Facial Expressions, Speech and Multimodal Information October 2004
[12] Anbhag Yao, Junchao Shao, Ningning Ma, and Yurong Chen. Capturing au- aware facial
features and their latent relation for emotion recognition in the wild.In Proceeding of the
2015 ACM on International Conference on Multimodal Interaction, pages 451-458.ACM,
2015
[13] Ninad Mehendale Facial emotion recognition using convolutional neural networks
(FERC) 18 February 2020
[14] Ruhul Amin Khalil; Edward Jones; Mohammad Inayatullah Babar; Tariqullah Jan;
Mohammad Haseeb Zafar; Thamer Al Hussain Speech Emotion Recognition Using Deep
Learning Techniques: A Review 19 August 2019
[15] E. Gunawan, H. Ardinata, and R. Yustitia, "Facial Emotion Recognition using Deep
Learning," 2019 IEEE 5th International Conference on Engineering Technologies and
Social Sciences (ICETSS), Bali, Indonesia, 2019, pp. 1-4.
[16] K. B. Kejriwal and P. C. Pandey, "Emotion Detection using Convolutional Neural

Networks," 2019 International Conference on Computer, Communication, and Signal
Processing (ICCCSP), Chennai, India, 2019, pp. 1-6.
[17] S. Gupta and G. Sharma, "A Comparative Study of Deep Learning Techniques for
Emotion Recognition," 2019 IEEE 3rd International Conference on Trends in Electronics
and Informatics (ICOEI), Tirunelveli, India, 2019, pp. 437-442.
[18] Y. Zhang and W. Zhang, "Facial Expression Recognition Based on Deep Learning," 2019
IEEE International Conference on Consumer Electronics - China (ICCE-China),
Changsha, China, 2019, pp. 1-5.
[19] L. Yang, J. Huang, and Z. Yu, "Emotion Recognition Using Deep Learning and Speech
Signals," 2019 International Conference on Artificial Intelligence in Information and
Communication (ICAIIC), Okayama, Japan, 2019, pp. 91-94.
[20] A. Kumar and A. Jain, "Facial Emotion Recognition using Deep Learning Techniques,"
2019 International Conference on Intelligent Sustainable Systems (ICISS), Chennai, India,
2019, pp. 474-478.
[21] Y. Zhang and W. Zhang, "Facial Expression Recognition Based on Deep Learning," 2019
IEEE International Conference on Consumer Electronics - China (ICCE-China),
Changsha, China, 2019, pp. 1-5.
81
[22] V. Dey, S. Bhattacharya, and D. De, "Facial Emotion Recognition using Deep Learning
Techniques," 2019 10th International Conference on Computing, Communication and
Networking Technologies (ICCCNT), Kanpur, India, 2019, pp. 1-5.
[23] S. Chen and Y. Wang, "Emotion Recognition using Deep Learning with Facial
Expression and Speech," 2019 IEEE 4th International Conference on Cloud Computing
and Big Data Analysis (ICCCBDA), Chengdu, China, 2019, pp. 310-315.
[24] A. Agarwal, N. Bansal, and A. Singh, "Emotion Detection using Facial Expression
Analysis with Deep Learning," 2020 International Conference on Information Technology
and Knowledge Management (ICITKM), Gurugram, India, 2020, pp. 1-6.
[25] S. K. Sharma and K. R. K. Kumar, "Emotion Recognition from Speech using Deep
Learning," 2020 11th International Conference on Computing, Communication and
Networking Technologies (ICCCNT), Kharagpur, India, 2020, pp. 1-6.
[26] Y. Li, L. Li, and J. Li, "Emotion Recognition based on Deep Learning and Physiological
Signals," 2020 IEEE 6th International Conference on Computer and Communications
(ICCC), Chengdu, China, 2020, pp. 131-135
[27] M. Alreshoodi and A. Alshehri, "Emotion Detection from Arabic Text Using Deep
Learning," in IEEE Access, vol. 7, pp. 129340-129350, 2019.
[28] S. Ghosh, S. Ganguly and N. Dey, "Emotion Detection in Twitter Data Using Deep
Learning Techniques," 2019 IEEE 19th International Conference on Data Mining
Workshops (ICDMW), Beijing, China, 2019, pp. 1038-1045.
[29] M. R. Habib, M. Z. Rahman and M. A. Haque, "Emotion Detection from Bangla Text
Using Deep Learning Techniques," 2019 3rd International Conference on Intelligent
Computing and Control Systems (ICICCS), Madurai, India, 2019, pp. 572-577.
[30] R. Han, C. Chen, X. Wang and W. Li, "Emotion Detection of Short Text Based on Deep
Learning," 2019 4th International Conference on Computational Intelligence and
Applications (ICCIA), Beijing, China, 2019, pp. 232-236.
[31] C. Li, B. Li and H. Lu, "Emotion Detection from Text Using Deep Learning and Semantic
Analysis," 2019 IEEE International Conference on Multimedia and Expo Workshops
(ICMEW), Shanghai, China, 2019, pp. 1-6.
[32] T. Ouyang and X. Dong, "Emotion Detection in Microblogs Based on Deep Learning and
Convolutional Neural Network," 2019 IEEE International Conference on Internet of
82
Things (iThings) and IEEE Green Computing and Communications (GreenCom) and
IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data
(SmartData), Atlanta, GA, USA, 2019, pp. 1931-1937.
[33] P. Ram and P. S. Hiremath, "Emotion Detection from Text using Deep Learning and
Natural Language Processing Techniques," 2019 International Conference on Intelligent
Computing and Control Systems (ICICCS), Madurai, India, 2019, pp. 518-522.
[34] M. Wang, J. Zhao, Y. Xu and W. Sun, "Emotion Detection in Text using Deep Learning
and Word Embeddings," 2019 IEEE International Conference on Big Data (Big Data), Los
Angeles, CA, USA, 2019, pp. 1956-1963.
[35] J. Xie, C. Zhang, Q. Cai and W. Wu, "A Deep Learning-Based Emotion Detection System
for English and Chinese Text," 2019 IEEE 8th Data Driven Control and Learning Systems
Conference (DDCLS), Hong Kong, China, 2019, pp. 497-502.
[36] Y. Zhang and X. Liu, "Emotion Detection in Chinese Text using Deep Learning and Word
Embeddings," 2019 IEEE 3rd Information Technology, Networking, Electronic and
Automation Control Conference (ITNEC), Chengdu, China, 2019, pp. 1068-1072.
[37] B. Zhang, Z. Yan and H. Zhu, "Emotion Detection in Microblogs Using Deep Learning
with Attention Mechanism," 2019 IEEE 2nd International Conference on Information and
Computer Technologies (ICICT), Cairo, Egypt, 2019, pp. 216-221.
[38] M. Zhang, Q. Liu and Y. Li, "Emotion Detection in Social Media Text using Deep
Learning with Attention Mechanism," 2019 IEEE 4th International Conference on Cloud
Computing and Big Data Analysis
83
USER MANUAL
Speech Emotion Recognition

1. STEP 1: Set up a laptop with working microphone and also Anaconda tools init
2. STEP 2: Now, execute audiorecorder.py to record audio using a microphone.
3. STEP 3: Train the application if it is 1st time execution (run each cell in order)
4. STEP 4: Run the last few commands after training.
5. STEP 5: Application displays emotion of voice
Face Emotion Recognition

1. STEP 1: Set up a laptop with working webcam and also Anaconda tool in it
2. STEP 2: Train the application if it is 1st time execution (run each cell in order) in
Jupyter notebook.
3. STEP 3: Now, execute faceemotionrecognition.py in command prompt
4. STEP 4: Register and Login then allow webcam to detect the face.
5. STEP 5: Application displays emotion of face
84

Final Reporttop 50

Uploaded by

Copyright:

Available Formats

Final Reporttop 50

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Final Reporttop 50

Uploaded by

Copyright:

Available Formats

PROJECT REPORT (KEC-851)

EMOTION DETECTION USING DEEP LEARNING

Submitted for partial fulfillment of award of the degree of

Electronics and Communication Engineering

Aman Gangwar – 1901920310015

Under the Guidance of

Dr. Mohan Singh

Deptt. of Electronics and Communication Engineering

Place: Aman Gangwar

Nidhish Kumar Singh

Certified that Aman Gangwar(1901920310015), Ankit Raj(1901920310022), Dubesh

(Dr. Mohan Singh) (Dr. Piyush Yadav)

(Dr. Satyendra Sharma)

Nidhish Kumar Singh (1901920310095)

CHAPTER 1: INTRODUCTION 1-8

CHAPTER 2: LITERATURE SURVEY 9-12

CHAPTER 3: PROPOSED METHODOLOGY 13-

CHAPTER 4: TECHNICAL SPECIFICATION 27-

CHAPTER 5: CONSTRAINTS ALTERNATIVES & 37-

CHAPTER 6: RESULT & DISCUSSION 43-

CHAPTER 7: CONCLUTION & FUTURE WORK 66

Figure No. Figure Name Page No.

3.1 System architecture 20

3.2 ARCHITECTURE OF CNN 21

3.3 PROPOSED TRAINING MODEL 22

3.4 PROPOSED TESTING MODEL 23

3.5 DROPOUT LAYER 25

3.6 PROPOSED FEATURE EXTRACTION 26

4.2 USE CASE DIAGRAM 28

4.3 ACTIVITY DIAGRAM 29

4.4 COLLABERATION DIAGRAM 30

5.1 EXTRACTING IMAGES FROM 32

5.3 FACE EMOTION LAYERS OF CNN 35

5.4 RESULTING ACCURACY OF SPEECH 36

5.6 AUDIO RECORDING 39

5.7 RESULTS & OUTPUTS OF FACE 40

CNN CONVOLUTION NEURAL NETWORK

The following categories of feelings are distinguished:

Figure. 1. Various emotions which are detected by the program [7]

We will look more about in next chapter.

The SqueezeNet architecture is a lightweight convolutional neural network (CNN)

tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))

ELU(x) = x if x > 0 = α * (e^x - 1) if x <= 0

softmax(x_i) = exp(x_i) / sum(exp(x_j)) for j in range(1, C)

In the context of deep learning, an optimizer is an algorithm or method used to adjust

Here are some commonly used optimizers in deep learning:

W_new = W_old - learning_rate * gradient

2. Momentum: Momentum is an extension of SGD that introduces a momentum term to

V_new = momentum * V_old - learning_rate * gradient W_new = W_old + V_new

G_new = G_old + gradient^2 W_new = W_old - (learning_rate / sqrt(G_new +

4. RMSprop: RMSprop is an extension of AdaGrad that addresses its aggressive and

G_new = decay * G_old + (1 - decay) * gradient^2 W_new = W_old - (learning_rate

M_new = beta1 * M_old + (1 - beta1) * gradient V_new = beta2 * V_old + (1 -

variance), beta1 and beta2

Adam (Adaptive Moment Estimation) optimizer is an adaptive learning rate optimization

Here's a detailed explanation of the Adam optimizer:

1. Gradient-Based Optimization: Adam optimizer, like other gradient-based optimization

2. Momentum-Based Optimization: Adam incorporates the concept of momentum, which helps

8. Regularization: Adam supports L2 regularization by incorporating the L2 penalty term into