Facial Emotion Recognition
Facial Emotion Recognition
Facial Emotion Recognition
INTRODUCTION
Facial emotion recognition is the process of detecting human emotions from facial expressions.
The human brain recognizes emotions automatically, and software has now been developed that can
recognize emotions as well. This technology is becoming more accurate all the time, and will
eventually be able to read emotions as well as our brains do.
Emotion detection recognition (EDR) is a method used for detection and recognition
of human emotions with the incorporation of technological capabilities, such as facial
recognition, speech and voice recognition, biosensing, machine learning, and pattern
recognition.
AI can detect emotions by learning what each facial expression means and applying that knowledge
to the new information presented to it. Emotional artificial intelligence, or emotion AI, is a
technology that is capable of reading, imitating, interpreting, and responding to human facial
expressions and emotions.
Emotions govern our daily lives; they are a big part of the human experience and inevitably they
affect our decision making. We tend to repeat actions that make us feel happy but we avoid those
that make us angry or sad.
Information spreads quickly via the Internet — a big part of it as text — and as we know, emotions
tend to intensify if left undealt with.
Thanks to natural language processing, this subjective information can be extracted from written
sources such as reviews, recommendations, publications on social media, transcribed conversations,
etc. allowing us to understand the emotions expressed by the author of the text and therefore act
accordingly.
OBJECTIVE:
The objective of emotion recognition is identifying emotions of a human. The emotion
can be captured either from face or from verbal communication. In this work we focus on
identifying human emotion from facial expressions.
Extracting and understanding of emotion has a high importance of the interaction between
human and machine communication.
GOALS:
Technically, the project’s goal consists on training a deep neural network with labeled
images of static facial emotions. Later, this network could be used as part of a software to
detect emotions in real time. Using this piece of software will allow robots to capture their
interlocutor’s inner state (at some extent). This capability can be used by machines to
improve their interaction with humans by providing more adequate responses.
This project has been divided into two phases. The first phase consisted on the use of a
facial emotion labeled data set to train a deep learning network. The chosen data set is the
Extended Cohn-Kanade Database. Additionally, evaluations were performed on several
network topologies to test their prediction accuracy. The use of convolutional neural
networks on the topologies was preferred given its great achievements on computer vision
tasks.
THEORETICAL BACKGROUND:
2. Machine Learning:
Machine Learning (ML) is a subfield of Artificial Intelligence. A simple ML explanation is
the one coined by Arthur Samuel in 1959: “... field of study that gives computers the ability
to learn without being explicitly programmed”. This statement provides a powerful insight in
the particular approach of this field. It completely differs from other fields where any new
feature has to be added by hand. For instance, in software development, when a new
requirement appears, a programmer has to create software to handle this new case. In
ML, this is not exactly the case. The ML algorithms create models, based on input data.
These models generate an output that is usually a set of predictions or decisions. Then,
when a new requirement appears, the model might be able to handle it or to provide an
answer without the need of adding new code.
ML is usually divided into 3 broad categories. Each category focuses on how the learning
process is executed by a learning system. These categories are: supervised learning,
unsupervised learning, and reinforcement learning. Supervised learning is when a model
receives a set of labeled inputs, which means that they also contain the corresponding
belonging class. The model tries to adapt
itself in a way that can map every input with the corresponding output class. On the other
hand, unsupervised learning receives a set of inputs without them being labeled. In that
sense, the model tries to learn from the data by exploring patterns on them. Finally,
reinforcement learning is when an agent is rewarded or punished accordingly the
decisions it took in order to achieve a goal.
On this project, our problem falls into the supervised learning category since the images to
be processed are labeled. In our case, the label is the emotion that the image represents.
In the figure 2.3, the topology of a perceptron is introduced. Luckily, most of ANN concepts
can be explained in this simple architecture. As it can be seen, there
Figure 2.3: Perceptron topology
is a set of inputs, X1 to Xn. This layer is coined as the input layer. Each of these inputs has
a corresponding weight, Wn. On the neuron (unit), a weighted sum is performed. Also, a
bias is added to the neuron so it can implement a linear function. The independence of the
bias will move the curve on the X-axis.
n
y = f(t) = ∑ Xi *Wi + Ꝋ
i=1
After that, the result of f(t) is the input of an activation function. The activation function
defines the output of the node. As the perceptron is a binary classifier, the binary step
function is suitable for this topology. It will output only a couple of classes, 0 or 1.
0, y>0
Output =
1, y≤0
Finally, the prediction is measured against the real value. This error signal is going to be
used to update weights on the first layer to improve the prediction results. This is
performed trough backpropagation learning . During 1960’s, ANN were a hot research
topic. However, a publication by Minsky and Papert on 1969 finished this golden era. In
their publication, it was stated that a perceptron has several limitations. Specially, that it
would not be suitable to perform general abstractions. Moreover, more complex
architectures derived from a perceptron would not be able to overcome these limitations,
as well.
4. Use of GPU:
The use of GPU for training has become fundamental for training deep networks because
of practical reasons. The main reason is the reduction of the training time compared to
CPU training. While different speedups are reported depending on the network topology, it
is common to have around 10 times speed when using GPU
The difference between CPU and GPU is how they process tasks. CPU are suitable to
perform sequential serial processing on few cores. On the other hand, GPU encompasses
a massive parallel architecture. This architecture involves thousands of small cores
designed to handle multiple tasks simultaneously. Thus, DL operations are suitable to train
on GPU since they involve vector and matrix operations that can be handled in parallel.
Despite that during this project only a limited amount of experiments were conducted using
GPU, it is important to stress its practical importance reducing training time.
Implementation Framework:
1. TensorFlow:
TensorFlow (TF) is an open-source software library for machine learning written in Python
and C++. Its release some months ago had a strong press coverage. The main reason
behind it is that TF was developed by Google Brain Team. Google has already been using
TF to improve some tasks on several products. These tasks include speech recognition in
Google Now, search features in Google Photos, and the smart reply feature in Inbox by
Gmail. Some design decision in TF have lead to this framework to be early adopted by a
big community. One of them is the ease of going from prototype to production. There is no
need to compile or to modify the code to use it on a product. Then, the framework is not
only thought as a research tool, but as a production one. Another main design aspect is
that there is no need to use different API when working on CPU or GPU. Moreover, the
computations can be deployed over desktops, servers and mobile devices. A key
component of the library is the data flow graph. The sense of expressing mathematical
computations with nodes and edges is a TF trademark. Nodes are usually the
mathematical operations, while edges define the input / output association between nodes.
The information travels around the graph as a tensor, a multidimensional array. Finally, the
nodes are allocated on devices where they are executed asynchronously or in parallel
when all the resources are ready.
Conclusions:
In this project, a research to classify facial emotions over static facial images using deep
learning techniques was developed. This is a complex problem that has already been
approached several times with different techniques. While good results have been
achieved using feature engineering, this project focused on feature learning, which is one
of DL promises. While the results achieved were not state-of-the-art, they were slightly
better than other techniques including feature engineering. It means that eventually DL
techniques will be able to solve this problem given an enough amount of labeled
examples. While feature engineering is not necessary, image pre-processing boosts
classification accuracy. Hence, it reduces noise on the input data. Nowadays, facial
emotion detection software includes the use of feature engineering. A solution totally
based on feature learning does not seem close yet because of a major limitation: the lack
of an extensive dataset of emotions. For instance, ImageNet contest uses a dataset
containing 14 197 122 images. By having a larger dataset, networks with a larger
capability to learn features could be implemented. Thus, emotion classification could be
achieved by means of deep learning techniques.
SIGNATURE OF CANDIDATES:
1.) Rajesh Kumar
2.) Rhythem Sethi
3.) Sachin Kushwaha
NAMES OF CANDIDIATES: