A System That Produces Musicxml File From A Music Sheet Dealing With Polyphony Music Arrangements in Image Format
A System That Produces Musicxml File From A Music Sheet Dealing With Polyphony Music Arrangements in Image Format
A System That Produces Musicxml File From A Music Sheet Dealing With Polyphony Music Arrangements in Image Format
This study presents a system that will improve the generation of MusicXML file from a music
sheet image by introducing a new texture of music, Polyphony, that is two or more simultaneous
lines of independent melody in a musical sheet. This study belongs to deep learning that uses
convolutional neural network for image classification.
Music has been part of our lives for many years, an art form where efforts of millions of
people have made contributions to it’s evolutions where it created different kinds of forms and
purposes spread throughout the world. It connects the people in different parts of the world with
different cultures and lifestyles. Since music is universal, people all around the world considers
it a symbol of unity where music has the capability to create emotional reactions and these
emotions [1], which are intense feelings that are directed at someone or something [2], are
possessed by every person in this world making music a very important factor to the society.
To understand music is to also understand our society that is why several studies in
music have been conducted in order to deeply understand and discover the components of
music. This is where music theory comes in. Music theory is the study of possibilities and
practices of music. Music theorist attempts to explain the compositional techniques composers
use in music by establishing rules and patterns[3].
Theoreticians often used music sheet to guide them in conducting their studies since
music sheet is a form of musical notations that embodies the properties of a musical piece like
pitch and duration [4] which for them is a need to know to analyze music properly. In studying
music, a music sheet plays an important role because it serves as a visual representation of a
musical piece. A music sheet can also be used to guide musical performances and keep them
from properly executing the music that they are playing. Musicians also used music sheet to
create and compose new music. A music sheet can be also used by common people who just
wants to learn music as long as they know how to to read and understand it. Music sheet is very
important in music because of all great uses it can offer but without the ability of reading the
music sheet it would be useless for us that is why being able to read music sheet is very
In music theory, one of the basic elements of music is texture [5]. Texture describes how
much is going on in the music at any given moment, it might be made up of rhythm only, or of a
melody line with chordal accompaniment, or many interweaving melodies [5]. To formally
describe texture in music, these four terms, namely Monophony, Homophony, Heterophony and
Polyphony are used [5]. The study that will be conducted will focus on Polyphony specifically.
Since the study will focus on polyphonic music there is a need to understand how
polyphony is applied in music sheet and this is where musical notation comes in. Identifying the
difference between the musical notations present in a monophonic music sheet versus
polyphonic music sheet should be done. Since polyphony is more complex in nature because of
the multiple melodies it comprise [5] against the monophy which comprises only one melody it
would be given that the polyphonic music sheet have more properties of musical notations.
There have already been existing studies that were conducted that focused on
monophonic music that is why the study that will be conducted chooses Polyphonic texture. The
system will aim to improve the existing study which is to generate a MusicXML file from a music
sheet image.
Every person in this world can appreciate music but learning music would require a
special dedication. In learning music, reading a music sheet is a requirement and in order to do
that it needs to take time in learning the musical notations which can be very tedious. Although
some have the abilities to learn faster, most of the people can have a very hard time learning it.
All music can be represented by a music sheet, therefore people who can understand it have an
advantage of learning music better which also means a better appreciation of music.
With the advancement of technology applied in music, there have been an incline growth
of softwares that enables music sheet to be read in computers which now becomes a “digital
music sheet” and can be converted in different virtual formats. In most cases, it can be a MIDI
file, an audio (wav, mp3), or a MusicXML export. In those formats, MusicXML, a digital music
sheet interchange and distribution format, has many programs/softwares in the market that
make use of it.
Interest in Music Information Retrieval (MIR) is recently very increasing and large
amount of musical data is available in the internet [6] because of digitalization of these musical
works. Since digitized images will be used in the system it would then allow to extract
information further by using the music sheet images and extract the musical notations through
the generation of MusicXML format thereby preserving these valuable informations.
Virtually almost all of the music we listen today is polyphonic in texture [7] which means
music have more than one melodies playing at a time. Given that polyphony is intrinsic with
music nowadays, it’s equally important that we need to understand the musical notation
associated with it and apply it through MusicXML. Thus to address the problem, this study will
provide a system that can generate MusicXML file from a Polyphonic music sheet image.
The objective of the study is to make a system that accepts a music sheet with
polyphonic texture in an image format and generate its equivalent MusicXML file. Specifically,
this study aims to produce a system that can:
Many musical works produced in the past are still available only as original manuscript
or as photocopies, which includes music sheets. The preservation of these works requires
digitalization and transformation into a machine-readable format. Digitalization has been
commonly used as a possible tool, offering easy duplications, distribution and digital processing
of these information [8]. There are many ways to represent these digital format an example is in
a form of Portable Network Graphics (PNG). Although having these music sheets in digital
format preserves the information, it doesn’t not extract the musical semantics behind those
musical notation of those images which means the only usage of it is just to be read or printed
but it can’t be played.
Before, sharing music between different kinds of music software programs used to be
difficult since no one music program can do everything equally [9]. As the time goes by, music
application in technology was able to develop a music notation interchange format called MIDI.
But even with the invention of MIDI, it was not able to represent music notation since it is only
good for performance application such as sequencers.
Until people in the community in line with music and technology decided to come up with
new music interchange format such as NIFF and SMDL but these formats have limitations. NIFF
does not work very well for general interchange. It also represent music in a graphical format
which means it doesn’t have a concept of “C” and another major impediment is that it’s in a
binary format which makes it harder to debug [9]. SMDL, on the other hand, became too
complex to work with.
Finally, MusicXML was created by Recordare LLC. MusicXML solved the limitations
other music notation interchange formats have, it was able to characterize the western music
notation. To define clearly, a MusicXML is an interchange format for music notation, music
analysis, music information retrieval, and musical performances [10] which now becomes the
standard for interactive music sheet. MusicXML is used by many applications to perform music
related task like playing the music semantics inside the format.
Programs have been developed through the years were an image music sheet can be
converted to MusicXML. The most notable programs are Audiveris, Sharpeye and Pdf2Music
and Musa.
Audiveris is an open-source Optical Music Recognition software. It allows user to import
scanned music sheet and export them into MusicXML format. Although it does what the study
wanted, it have some difficulties in reading music symbols like beam [11]. Programs have been
created to accept MusicXML format and then play it accordingly. A good example of that
program is the system created by Mr. Michael Alan V. Ygnacio in his special problem: “A system
that converts a music sheet with round notes and with shape notes”. The system plays accepts
the MusicXML generated by Audiveris and then with the translated round notes convert to
shape notes. Since the system heavily relies on Audiveris, the reading and rendering of music
sheet image to MusicXML, he can’t modify the ability of Audiveris to accurately scan an image
of a music sheet. Mr. Ygnacio said that “Audiveris has difficulties on reading the music symbols
and other data within the scanned music sheet” [11].
SharpEye is an optical music recognition program which has an interactive graphical
environment for editing the symbolic music notation extracted from scanned music. It can scan
and convert printed music sheet into a musicxml file or a midi file [12]. It only uses bitmap and
tiff images. However, Sharpeye has its drawbacks. According to Vrist, there are missing flags on
the eighth notes, missing notes and missing flats or sharps [13].
Pdf2Music offers pro and non-pro services but both of them doesn’t offer conversion of
images to MusicXML. The major drawback in Pdf2Music is that it generates a bad MusicXML
format from the music sheet in pdf format [13]. Missing information and wrong placement of
music symbols can be seen in the MusicXML file.
Template Matching
Template Matching is a high-level machine vision technique that identifies the parts on
an image that match a predefined template. It has two main disadvantages. First, it needs the
template to be stored in a memory for correlation and second, it is sensitive to noise and scale
which makes it prone to errors.
A new system developed by Mr. Ramie Jeanne Cortes in her special problem: “A
System that Produces a MusicXML File From a Music Sheet in Image Format” to solve the
limitations of the previously talked tools. The system generates MusicXML file from a music
sheet image just what the system wanted but one particular limitation she stated in her SP is
that her system only deals with Monophonic single-part arrangements, that is only single note
will be played at a time or solo melody [4]. This limits the possibilities of extracting more
information form a music sheet image.
The Challenge of Optical Music Recognition
Optical Music Recognition (OMR) is the problem of converting a scanned image of sheet
music into a symbolic representation like MusicXML [15] or MIDI. Almost all of the presented
previous works applies OMR in their system. In OMR, the most common algorithms it follows
are decomposed into four task:
1. Image Preprocessing
2. Segmentation
3. Object Recognition
4. Semantic Reconstruction
Despite the fact that OMR systems have been researched thoroughly over the last few
decades and even several commercial tools exist, the practical results are still far from ideal
[16]. Since image preprocessing is the initial step of OMR, it can affect the next processes and if
something wrong happens in this stage, the recognition would be useless. One of the most
important open issues of the OMR research is the lack of an available ground-truth database
where it could serve as a benchmark [17] and the absence of common methodologies and
metrics that would compare the results of OMR systems [17]. Given all those limitations
regarding OMR, the study will find another way that doesn’t use OMR.
Given with all of the presented previous works related to the generation of MusicXML
format from a music sheet image, it is clear that the limitations and shortcomings of these
systems are apparent. This implies that a need for a new system that would generate a
MusicXML file from a Polyphonic music sheet image is needed. By introducing a new system
involving deep learning to identify music semantics in music sheet image would hopefully solve
the presented tools’ limitations and impediments.
Conceptual Framework
The following are the terms and concepts that are used throughout the study.
Deep Learning
Deep learning, also known as Deep Neural Learning or Deep Neural Network, is a
subset of machine learning in Artificial Intelligence (AI) that has networks which are capable of
learning unsupervised from data that is unstructured or unlabeled [18]. Essentially it involves a
lot of input data into a computer system which in return can make decisions about other data.
Using deep learning, music sheet images will be used as the data to be fed in the computer
system. Later on, if the system will be given an input of another music sheet image, it would be
able to identify the music semantics.
Convolution Neural Network
Convolutional Neural Networks are very similar to ordinary Neural Networks which are
made up of neurons that have learnable weights and biases and each neuron receives some
inputs, performs a dot product and optionally follows it with a non-linearity [19]. Convolutional
neural network are essentially black box that constructs features that can be handcrafted and
these abstract features created from training are so generalized that they account for variance.
Creating a system that has manually trained CNN would take a lot of computing power and a lot
of time both of which is beyond of my capabilities.
Inception Model
An inception model is a pre-trained CNN Model. Inception model was trained by Google
on 100K images with a thousand categories. The use case of the study will focus on the
classification of music symbols of the polyphonic music sheet image. But inception was not
trained on these music symbols that is why we use transfer learning.
Figure [19]
Transfer Learning
A process called transfer learning means applying the learnings from a previous training
session to a new training session. Looking at the inception model, above, we can see that when
we feed an image as an input at each layer it will perform a series of operations on that data
until it outputs a label and a classification percentage each layer is a different set of abstractions
in the first layers it's basically taught itself edge detection then shape detection in the middle
layers and they get increasingly more abstract up until the end.
Looking at the last few layers these are the highest level detectors for whole objects. For
transfer learning, the system will basically just want to retrain that last layer on features of the
music symbols so it can add a representation to its repository of knowledge.
Music Notations
The staff (plural staves) consists of five horizontal parallel lines and four spaces. Each of
those lines and each of those spaces represents a different letter, named A-G, which in turn
represents a note and the note sequence moves alphabetically up the staff [20]. In polyphonic
music, two independent melodies are played at the same time which means more than one
notes are active or played at the same time.
In the figure below, the music sheet can be determined as polyphonic in texture since
the number of staves are more than one and often times it is enclosed in a brace. Although
there are other cases where only one set of staff can be seen other measures can be used.
Figure [21]
The pitch of a note is how high or low it sounds [21]. Key signatures determine pitch by
using sharps and flats. In the study, key signatures will be vital in determining the notes
equivalent value. The clef on the music sheet can be a treble clef or a bass clef. The treble clef
has the ornamental letter G on the far left side and it encircles the “G” line on the staff [20] while
the bass clef is also referred as the F clef [20]. Most music sheet have these important music
notations because clef tells the letter name of the note and the Key tells whether the note is
sharp, flat or natural. Using these notations, we can now identify the notes and label them
The note’s letter names are identified by the position in the staff. With this, it will be easy
to locate the note in the staff [22].
This study will implement Image Classification using deep learning through
Convolutional Neural Network using Inceptional Model with the help of transfer learning. These
are the tasks: Data Acquisition, Data Labeling, Image Preprocessing, and generation of
MusicXML. The development process will follow the process of Analysis, Design,
Implementation and Testing phase.
The following will be the analysis of each OMR tasks:
During the process of cropping, all five line of the staff as well as the entire beam
of the note should be visible in the resulting picture. Since the study deals with
polyphonic music sheets, identifying the properties that are unique in polyphonic music is
important. The number of stave and notes played at the same time are identified. An
image ratio should be defined for the extracted images. Below is an example image of a
music sheet. (Further studies will be conducted in order to properly classify the music
symbol images will be conducted)
The consisting data set would then be composed of pictures of the properties
extracted in the music sheet image. Labelling each of those data would then take place.
Identifying each properties according to the music sheet image. Since the study deals
with polyphonic music, labelling the images according to the properties of a polyphonic
music is done. Labelling the notes can be done through identifying the key signature of
the score and the clef it is using and the position of the note in the staff.
The digital image which is extracted during the Data Acquisition have different
sizes and ratios. That is why images would have to be the preprocess in order to have
similar sizes and ratios.
Retraining of the classifier with the newly linked Music Symbol Image dataset will
be done to follow the concept of transfer learning. The classifier would then return an
ordered list based on the identified music symbols in the music sheet image.
The ordered list of music symbols will be converted into MusicXML format. For
generating an output, it must follow the MusicXML Data Type Document Specification.
The result should use the tags uniform to all MusicXML files.
[1] Vuust, P., & Kringelbach, M. L. “The Pleasure of Making Sense of Music”. Interdisciplinary
Science Reviews (2010): 166-182. Accessed 2017
[2] Lydeen, L. F. (1987). Emotions and moods: medical & psychological subject analysis with
bibliography. Washington, D.C.: Abbe Association. Accessed 2017
[3] M. (n.d.). Music theory and classic harmony. Retrieved October 24, 2017, from
[4] Cortes, R. and Roxas, R. A System that Produces a MusicXML File From a Music Sheet in
Image Format
[5] M. (n.d.). Music texture theory – Monophony or Polyphony. Retrieved September 14, 2017,
from http://www.aboutmusictheory.com/music-texture.html
[6] Hutchison, D., Jensen, K., Kanade, T., Kittler, J., Kleinberg, J. M., Kronland-Martinet, R., . . .
Ystad, S. (n.d.). Computer Music Modeling and Retrieval. Genesis of Meaning in Sound and
Music. Berlin, Heidelberg: Springer Berlin Heidelberg.
[7] MENU. (2017, March 30). Retrieved November 23, 2017, from
[8] Paredes, R., Cardoso, J. S., & Pardo, X. M. (2015). Pattern Recognition and Image Analysis
7th Iberian Conference, IbPRIA 2015, Santiago de Compostela, Spain, June 17-19, 2015,
Proceedings. Cham: Springer International Publishing.
[11] Ygnacio, M. and Roxas, R. A System that Converts Music Score Sheets with Round Notes
into One with Shape Notes. (pages 4)
[13] Vrist, Soren Bjerregaard (2009). Optical Music Recognition for Structural Information from
high-quality scanned music, p.39, 45
[14] (n.d.). Retrieved December 5, 2017, from
[15] J. Ganseman, P. Scheunders, and W. D’haes. Using xquery on musicxml databases for
musicological analysis. In Proceedings of the 9th International Conference on Music Information
Retrieval, pages 433–438, Philadelphia, USA, September 14-18 2008. http://ismir2008.ismir.
[16] Rebelo, A., Fujinaga, I., Paszkiewicz, F., Marcal, A., Guedes, C., Cardoso, J.:Optical music
recognition: state-of-the-art and open issues. International Journal of Multimedia Information
Retrieval 1(3), 173–190 (2012)
[17] Novotný, J., & Pokorný, J. (2015). Introduction to Optical Music Recognition: Overview and
Practical Challenges. DATESO.
[18] Momoh, O. (2016, October 30). Deep Learning. Retrieved December 6, 2017, from
[19] Convolutional Neural Networks (CNNs / ConvNets). (n.d.). Retrieved December 9, 2017,
from http://cs231n.github.io/convolutional-networks/#layers
[20] Says, R. L. (2016, June 10). How to Read Sheet Music: Step-by-Step Instructions.
Retrieved December 13, 2017, from
[21] Schmidt-Jones, C., & Jones, R. (2007). Understanding basic music theory. Houston, TX:
[22] Music Note Names. (n.d.). Retrieved December 13, 2017, from
SP Adviser:
December 2017