Performance Analysis and Scoring of The Singing Voice

Performance analysis and scoring of the singing voice

singing voice




Oscar Mayor

Jordi Bonada

University Pompeu Fabra

University Pompeu Fabra





Alex Loscos

Available from: Jordi Bonada

Retrieved on: 29 January 2016

Mayor et al.

Performance analysis and scoring of the singing voice



Music Technology Group (MTG), University Pompeu Fabra, Barcelona, SPAIN
Barcelona Music and Audio Technologies (BMAT), Barcelona, SPAIN

In this article we describe the approximation we follow to analyze the performance of a singer when singing a reference
song. The idea is to rate the performance of a singer in the same way that a music tutor would do it, not only giving a
score but also giving feedback about how the user has performed regarding expression, tuning and tempo/timing
characteristics. Also a discussion on what visual feedback should be relevant for the user is discussed. Segmentation at
an intra-note level is done using an algorithm based on untrained HMMs with probabilistic models built out of a set of
heuristic rules that determine regions and their probability of being expressive features. A real-time karaoke-like system
is presented where a user can sing and visualize simultaneously feedback and results of the performance. The
technology can be applied to a wide set of applications that range from pure entertainment to more serious education

Singing voice is considered to be the most expressive
musical instrument. Singing and expressing emotions
are strongly coupled, making clearly distinguishable
when a singer performs sad, happy, tender, or

expression transcription of the performance, segmenting

each note in sub-regions (attack, release, sustain, vibrato
or transition) and assigning an expressive label to each
region. All these processes are based on a set of
descriptors and features extracted from the input audio.

Many people have been working in the scientific field

of performance analysis of the singing voice including
solo or polyphonic singing voice transcription [1,2],
score alignment [3,4] and expressivity [5] but there is a
lack in references about automatic expressive detection
or transcription and expression categorization of singing
performances as it is focused in this article.
We present here a tool for the automatic evaluation of a
singing voice performance with precise note
segmentation and expression detection. The application
includes a friendly graphical interface for visualization
of the analyzed song descriptors including pitch,
vibratos, portamentos, scoops, attacks, sustains and
We also compare our system with existing systems for
evaluation of singing music including musical teaching
applications and musical karaoke-like games.

Figure 1: Overview of the performance analysis system

In our system, the analysis of the singing voice includes
first a note segmentation which consists on aligning the
singing performance to a reference midi and then an
expression segmentation which is basically an

In figure 1 we can see an overview of the performance

analysis of the singing voice that is being performed by
our system. Firstly we decide which features are more
relevant in the singing voice and then we try to derive a
set of heuristic rules, based on the analysis descriptors
(pitch, energy, spectral coefficients, mel cepstrum

AES 35th International Conference, London, UK, 2009 February 11–13

Mayor et al.

Performance analysis and scoring of the singing voice

coefficients and its derivatives) that can uniquely

identify each expression. This set of heuristic rules
constitutes the base of a hypothetic probabilistic model
based on Hidden Markov Models (HMM). Later on, the
performance is automatically segmented into notes and
expression regions based on this hypothetic model with
no training process [7] involved at all.
First of all the singing voice is analyzed (see figure 2)
and some descriptors in time domain are extracted
including zero crossing, amplitude and energy and its
derivative. Then a frequency domain analysis is
performed and some spectral descriptors are computed
like LF energy, HF energy, filter bank (40 coefficients),
Mel Cepstrum (24 coefficients), spectral flatness, and
delta timbre calculated from the average of the
derivatives of the Mel Cepstrum coefficients. Mel
cepstrum derivative is very relevant to detect timbre
changes in the singing performance since in singing
voice, note onsets commonly match changes in
phonetics. After the frequency domain analysis some
spectral algorithms are applied to extract high level
descriptors from the performance including a spectral
peak and pitch detection algorithm, vibrato detection
based on the pitch analysis and some harmonic
descriptors are computed like stability and sinusoidality.
These descriptors will be the base to establish a criterion
to segment into notes and detect expression from the

Figure 2: Analysis process

One of the first steps we need to do before the automatic
analysis of the singing performance is to manually
analyze and categorize different expressive aspects or
expressive executions in singing performances. In most
cases the expression can be categorized using only pitch
and amplitude evolution along the time, but if we want
to achieve better results to improve current existing
systems we need to take into account more analysis
descriptors. We would need to label different expressive
resources and extract the descriptors that better describe
the performance and distinguish the most, one
performance from another. This set of descriptors will
be used in the real-time performance analysis of the
singing voice.

We are performing a note segmentation with prior
knowledge of the reference midi melody the user is

supposed to be singing, so we are aligning the midi

notes to the notes in the singing performance. As a
result of the segmentation we will have the same notes
of the midi reference but the onset and duration will be
adjusted or aligned to the performance of the user and
silences between notes will kept. Thus, if the
performance adds or drops notes that are not present in
the midi reference, the alignment will try to be as many
similar as it can be to the reference but without adding
or dropping notes. In figure 3 we can see the results of a
note alignment where three notes in the original score
(MIDI notes) are aligned to the user pitch (User notes).

Figure 3: Note alignment results

Based on the analysis data observation, we decide
which features are more relevant in the singing voice
and then we derive the set of heuristic rules that best
identify uniquely each expression. This set of rules is
used as a probabilistic model in the note segmentation
and expression transcription algorithms to automatically
segment the performance into notes and expression
Note alignment is performed using segmental HMMs
based on hypothetic probabilistic models. A sequence of
note and silence states given by a MIDI score represents
the melody of the song (see figure 4) and heuristic rules
determine the most probable path from all possible
paths in the Viterbi matrix. The resulting score is the
same as the reference MIDI but with the notes shortened
or lengthened and the onsets and pitch shifted to better
fulfill the rules applied in the segmentation/alignment
algorithm. You can see this in detail in [6].

MIDI note




Figure 4: MIDI note and note model sequence.

The probability associated to each node computes the
best path to reach that node, starting from the beginning
of the song. We distinguish two types of probability:

AES 35th International Conference, London, UK, 2009 February 11–13

Mayor et al.

Performance analysis and scoring of the singing voice

transition probability and cost probability. The

transition probability in our case is always 1, therefore
the path probability only depends on the cost
probability. The cost probability is computed using
heuristic rules which observe the voice descriptors.
Given an observation window (from start to end frame)
and a given note model, a set of rules are applied to the
voice descriptors and each rule computes a rule
probability. The cost probability is then the
multiplication of all these rule probabilities.

Once the best expression type path has been chosen, the
most probable label for each expression type has then to
be estimated.

Expression categorization and transcription of the
performance is carried out as well using segmental
HMMs based on hypothetic probabilistic models.
Expression paths are modeled as sequences of attack,
sustain, vibrato, release, transition states and their
possible connections. Besides, different labels can be
assigned to each state to distinguish between different
ways of performing. For instance, in case of a transition,
some possible labels include scoop-up, portamento or
normal. These paths are considered by the expression
recognition module and the path with highest
probability among all is the one chosen. The
probabilities are based in heuristic rules based on the
analysis descriptors. You can see this in detail in [6].

- Timing ratings based on note segmentation and


In figure 5 all the possible expression paths for a two

notes segment are shown and the best expression path is
highlighted. These paths are evaluated by the expression
recognition module to pick among them the highest
probability one. Cost probabilities and transition
probabilities are calculated in the same way that for the
note alignment process.

After doing the analysis of the user performance, we
- Elemental ratings based in high level descriptors
like pitch and volume.

- Expression rating based on the expressive

transcription and categorization explained in
previous sections.
Results of the performance analysis are given as a score
from 0 to 100 including fundamental performance
ratings and the expression performance rating (see
figure 6).
The fundamental performance ratings are calculated
comparing the pitch, volume and timing between the
user and the reference singer and also between the user
and the midi notes. This is done in order to give two
separate ratings, one for mimicry and other for
comparison with a standard execution.

In order to build the expression model sequence while

performing in real-time, first we need to complete the
alignment of the last performed note. Thus, in real time
context, the performance analysis has a one note latency
before it can show any feedback to the user.











Figure 6: Performance rating overview











Figure 5: Expression path with expression labels

We are not giving much relevance to this topic as our

application is more focused in giving constant feedback
to the user about the performance rather than giving a
global accumulated score, which does not help the
singer to improve the performance. We have chosen
instead to give more relevance to the display of the
segments where the singer performs worse; giving the
chance to repeat those parts and improve singing skills.

AES 35th International Conference, London, UK, 2009 February 11–13

Mayor et al.

Performance analysis and scoring of the singing voice

The technology that we have developed can be applied
in many fields from entertainment and games to more
education focused applications.
6.1 Singing Education
In the classical singing education, the master-apprentice
model is used where teacher gives instructions and
feedback on the performance to the student about:

Acoustic quality
Physiological aspect of the
(posture of the vocal apparatus)


The way that the teacher gives feedback to the student

uses imagery (ex: sing as if through the top of your
head) and this yields into a problem of ambiguous
interpretation and a big time lag between students
performance and teachers feedback.
There are some existing systems for computer singing
education that offer real-time visual feedback trying to
solve the problem of classical singing education stated
above. These systems include pitch trace, spectrogram,
larynx parameters, etc. Some examples of these systems
are: SINGAD (SINGing Assessment and Development)
[8], WinSINGAD [9], ALBERT (Acoustic and
Laryngeal Biofeedback Enhancement in Real Time)
[10] [11] and SING & SEE [12].

some musical video games that incorporate karaoke

style gaming and we compare them with our system.
The software that has been developed includes an
offline tool for research development and a real-time
scoring application for user singing evaluation.
The off-line tool can be used for manual segmentation
and expressive label edition of notes and intranote
regions. This tool gives feedback by showing the
probability of each heuristic rule in the manual
segmentation given by the user. With the tool, the user
can change the segmentation and see whether the global
probability improves or not. This tool is also used to
display the descriptors calculated in the analysis process
so the user can view the values of these descriptors at
any time point and change the heuristic rules to and
improve the automatic note segmentation and
expression transcription results. In the bottom window
in figure 7 we can see the display of values of some
analysis descriptors along time, each descriptor with a
different color. Above this window we can see the pitch
curve of the performance, the results of the notesegmentation and expression transcription as well as the
MIDI score.

Our system can also be used to solve the lack of visual

feedback problem offering pitch, visual note
segmentation and expression detection in real-time.
Moreover our system walks a step further allowing to
compare the user performance with a performance of a
professional, for instance the performance of the
teacher. These features convert it into a powerful tool
for singing education. The visual feedback that our
system gives to the user is explained with details in
section 7.1.
6.2 Entertainment
Singing voice automatic scoring has become quite
popular in the past few years in games like Singstar
[13], Ultrastar [14], Karaoke Revolution [15], Lips [16]
and Rock Band [17]. However, the algorithms applied
in these videogame applications are rude and far too
distant from current voice analysis research in the
scientific community. With our system we offer a more
complex analysis of the singing voice, not only focused
in pitch and timing characteristics but also detecting
expressivity in the performance, which can be perfectly
applied to video games. In section 7.1 we enumerate

Figure 7: Offline GUI tool

The real-time tool allows the display of some analysis
descriptors like pitch and note and expression
transcription as well as the reference songs notes and
lyrics to guide the performance while the user sings.
The display proposal is based on karaoke-style-games

AES 35th International Conference, London, UK, 2009 February 11–13

Mayor et al.

Performance analysis and scoring of the singing voice

with significant additional information about the

performance is shown to the user including some partial
and global scoring. The real-time scoring tool is
explained with more detail in section 7.1 together with a
comparison with other real-time scoring games.
One of the big drawbacks in karaoke games and
commercial karaoke video systems is the lack of visual
feedback that users get about the performance. Many
systems just show the lyrics and only in some cases the
user gets vague information about the note pitch and
duration of the notes that has to perform. While
performing the song, these notes get highlighted when
you sing them in tune (see figure 8).

In such kind of systems, only information about the

song phrase that the user is singing and all the reference
notes of the current phrase are shown in the screen and
are replaced by new notes when the new phrase comes
in. Other systems adopt a scrolling representation of
melody, where notes flow from right to left of the
screen as in a platform game, while the user is singing a
song. In figure 10 you can see an example of scrolling
karaoke game. In this example the user only gets
feedback about the pitch of the performance at the
current time and past information is lost as the notes are
sung so the user gets limited feedback about out of tune

Figure 8: Lips game

Figure 10: Karaoke Revolution game
This information is not enough if you want to use the
karaoke as a virtual singing teacher, and also, when we
sing out of tune we dont have enough feedback to
determine how far we are from the correct pitch or
tempo of each of the notes. Some systems also give
information, when the user sings out of tune, about the
note performed, so the user is able to know if is singing
below or above the target desired pitch (see figure 9).

Figure 9: Singstar game in solo mode

Our singing scoring tool adopts a hybrid method by

adding more visual feedback for the user; at the same
time it gives the possibility to replay the performance to
view the parts of the song where the user has mistaken
the most. In figure 11 we can see the visual feedback
that offers our system, in the lower part a global view of
the song is given which will allow the user to review the
performance after singing and see the errors committed
in any part of the song. These parts will be marked with
red color so the user can visualize the conflicting parts
of the song and quickly go to them. We also allow
repeating certain parts of the song to improve results.
In the middle part of the screen the lyrics are shown
and, above them, the midi notes of the score and the
transcribed notes from the performance of the user are
represented in a scroll from right to left while
performing the song. Also the fundamental pitch of a
reference singer and the user are shown in real-time, so
while performing you can see how close you are to the
reference, not only at a note level but also with more
detail at a frame level, comparing your pitch with the
reference one. Visualizing the pitch also allows you to
improve vibrato executions and other expressive aspects
where the pitch shape evolution is fundamental like

AES 35th International Conference, London, UK, 2009 February 11–13

Mayor et al.

scoop-ups attacks, fall-down releases and different kind

of transitions. This also helps the user to be in tune as
the desired scenario for pitch is to be drawn over the
midi target notes, when this happens the user is in tune
and the more distant the pitch is from the midi notes, the
more out of tune the user is. This visual feedback allows
the user to rapidly correct the tuning while singing.
While the user sings, some detected expressive
resources performed by the reference singer are marked
on the screen with different signs. The user has to
imitate them to score high in the expressive rating. If the
user performs well regarding expression, these marks
get highlighted. If the user performs new expressive
resources not performed by the reference singer, user
marks appear in the screen.
In the above part of the screen, the scoring of the song is
shown divided in expressive rating, mimicry rating,
score rating and total rating, which is an average of the

Performance analysis and scoring of the singing voice

instance putting a very-short sustain or not when the

stable part of a note is about 50ms, or putting a long
sustain fall-down and a short normal release instead of a
short sustain and longer release fall-down. For this
reason, the expression transcription evaluation has been
done manually by an expert. More than 2500 expression
sub-regions have been manually checked and more than
95% of the correctly note-segmented sub-regions were
considered to be correctly transcribed by the expert.
This research can either be applied to education, to
entertainment or to the so called edutainment
(something in between the previous two). In the
education context, the impact will be significant as this
research can be applied to music schools in order to give
an evaluation tool to analyze the performance of the
user, not only giving information about tuning and
tempo but also about expression and compare it to
reference performances. In the entertainment arena, this
research can be applied to gaming, the most obvious
application is karaoke but there are many other musical

The work described in this paper has been supported
and funded by the Music Technology Group in the
Pompeu Fabra University and by Yamaha Corp.

AES 35th International Conference, London, UK, 2009 February 11–13

AES 35th International Conference, London, UK, 2009 February 11–13

