Irjet Music Information Retrieval and Ge
Irjet Music Information Retrieval and Ge
Irjet Music Information Retrieval and Ge
1. INTRODUCTION
As the number of songs keeps on growing, people find it
relatively hard to manage the songs of their taste. Since Figure-1: Overview of Music Genre Classification System
listening to music online has become very convenient for
people, thanks to the rise of online music streaming services The remainder of the paper is structured as follows.
such as Spotify, iTunes, and others, users expect the music Section 2 of this paper puts some light on the previous
to be recommended by the service. To make that possible, work related to this field, while in section 3, the
we need to study people's listening choices and identify the structuring of the dataset is explained. Section 4 covers
genre that they listen to, which is the best way to do so. various classification algorithms and their details. The
Owing to the rapid growth of the digital entertainment results and evaluations are mentioned in section 5 of this
industry, automatic classification of music genres has study, followed by the sections for conclusion and
acquired significant prominence in recent years. One way to references.
effectively classify the song is genre-based classification.
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 1025
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 07 | July 2020 www.irjet.net p-ISSN: 2395-0072
Classifying the music without human interaction has been a In this proposed solution, we have used the GTZAN
fascinating problem for lots of people working from dataset, which is popular in the field of Music Information
different branches like signal processing, machine learning, Retrieval. The dataset comprises the audio files which
and music theory. There is a vast amount of research work were gathered in the year 2000-2001 from a variety of
related to audio and music classification. sources like CDs, microphone recordings, radio.
The task of music classification is based on two different This dataset contains 100 music files of each genre. There
aspects, namely symbolic and audio. Symbolic classification are a total of 10 genres so in total there are 1000 music
mostly relies upon symbolic formats like MusicXML and files. 10 genres include Blues, Classical, Country, Disco,
MIDI. Several models have suggested conducting a symbolic Hip-hop, Jazz, Metal, Pop, and Rock. It contains a 30
classification of music genres. The input is used as a seconds audio clip of sampling rate 22050 Hz at 16 bit.
collection of instruments, musical sound, rhythm, dynamics,
pitch figures, melody, etc. for a wide selection of multi-class Source:- http://marsyas.info/downloads/datasets.html
generic classifiers. Symbolic music classification on audio
files is highly impractical as making an effective audio 4. METHODOLOGY
transcription system ought to be more difficult than audio
genre classification itself. This section elaborates upon the task of data
preprocessing followed by feature description and the
A work by Tzanetakis and Cook in (2002) [3], where two proposed approaches used for classification of music
researchers performed music genre classification using the genre, Machine learning techniques and Deep Neural
timbral-related features, texture features, and pitch-related Network.
features based on the multi-pitch detection algorithm. Some
of the features used in this work include MFCCs, roll-off, and 4.1 Preprocessing
spectral contrast. Their system achieved an overall accuracy
of 61%. The work proposed by Lidy and Rauber (2005) [4] To improve the model results, we processed the data by
discusses the contribution of psycho-acoustic features to normalizing it and then converting the labels into
detect music genres. categorical values. Since the dataset is very diverse in
each feature, normalization of the data was necessary.
A variety of experiments, with the recent popularity of deep We tried out different normalization methods like
neural networks, extend these methods to speech and other Standard Scaling, Z-score, Decimal Scaling, and Min-Max
types of audio data (Abdel- Hamid et al., 2014; Gemmeke et normalization, where Min-Max normalization gave the
al., 2017 [5]). The audio in the time domain is not entirely best results. In this technique of data normalization, a
clear for feedback in neural networks due to the linear transformation is performed on the original data.
tremendous sampling rate. Nevertheless, it was discussed The data is fetched along with the minimum, and
for audio generation tasks in Van Den Oord et al. (2016) [6]. maximum value and each value is replaced according to
The spectrogram of a signal that captures both frequency the following formula.
and time information is a common alternative
representation.
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 1026
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 07 | July 2020 www.irjet.net p-ISSN: 2395-0072
To preprocess our dataset we have used pandas and NumPy 4.2.4 Spectral bandwidth
library. Machine Learning related tasks for classification are
done using the scikit-learn library, and the Deep Neural Spectral Bandwidth is the difference between the upper
Network is written using Tensorflow Keras. and lower frequencies in a continuous band of
frequencies of an audio signal. It is typically measured in
4.2 Manually Extracted Features hertz. The p-th order spectral bandwidth corresponds to
the p-th order moment about the spectral centroid and is
In this section, we have described various musical features calculated as
used to train the machine learning algorithms and Deep
Neural Network for the classification task. We have used
Librosa, a python library for extracting the features.
The calculation of RMSE is done frame by frame and then 4.2.7 Zero Crossing Rate (ZCR)
we take the average and standard deviation across all
frames. A zero-crossing point refers to one where the signal
changes sign from positive to negative. The entire 10-
4.2.3 Spectral centroid second signal is divided into smaller frames, and the
number of zero-crossings present in each frame is
determined. The features are chosen by calculating the
Every frame has a pre-specific frequency band number. And
average and standard deviation of the ZCR score for all
the spectral contrast is measured as the difference between
maximum and minimum magnitudes within each frequency the frames.
band.
4.3 CLASSIFIERS
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 1027
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 07 | July 2020 www.irjet.net p-ISSN: 2395-0072
which we have used four classifiers K nearest neighbors 4.4.1 Dense Neural Network
(KNN), Support Vector Machine (SVM), Decision Tree and
Random Forest. The name dense suggests that in the network, all the
layers are fully connected by the neurons. Every neuron
4.3.1 Implementation Details in a layer is input from all neurons in the last layer, so
they are connected densely. This means that the dense
This section gives details about the implementation of layer is a completely connected layer, which means that
machine learning algorithms that we have used. We have all neurons in a layer are connected to those in the next
implemented all the machine learning classifiers using layer.
scikit-learn library.
● ReLU: ReLU stands for the Rectified Linear Unit.
It is the most popular activation function that is
1. SVM: Support Vector Machine is a supervised
chiefly implemented in hidden layers of Neural
learning method for classification and regression. In networks. It is non-linear in nature, which means
this technique, we try to find a plane that has the we can easily backpropagate the errors and have
maximum margin. So, there is a maximum distance multiple layers of neurons being activated by the
between the data points of both classes. We have ReLU function. The ReLU layer applies the
used Linear, Poly, and Radial Basis Function (RBF) function f(x) = max(0, x) to all of the values in the
kernels. It is implemented as a one-vs-rest input . In other words, this layer only changes all
the negative activations to 0 and maintains the
classification task, and we got the best accuracy
positive values.
with Linear Kernel.
● Dropout: The Dropout layer is used to prevent
2. KNN: K Nearest Neighbors is simple and easy to the problem of overfitting in neural networks. It
implement a supervised learning algorithm that is randomly sets a fraction ‘rate’ of input units to 0
widely used for the task of classification. The basic at each update during training time. This
idea behind KNN is that similar things are near to simplifies the neural network and decreases
each other, or in other words, the same traits exist training. In each iteration, we use a different
nearby. The KNN classifier captures the notion of combination of neurons to predict the final
similarities among objects based on mathematics, output. Figure 2 provides insight into the
like the calculating distance between the objects. In structural change in the neural network after
KNN, the test sample is assigned a class value to the adding a dropout layer. In our work, a dropout
class of the majority of its nearest neighbors. The rate of 0.3 is used, which means out of ten
KNN algorithm is based on the K value, which neurons, three will be shut off randomly.
determines the number of training neighbors to
which a test sample is compared. The most suitable ● Softmax: Softmax is the form of logistic
value of K that we found is 13. regression where it converts the input value into
vectors of probability distribution that sums up
4.4 Deep Neural Network to 1. The class having the highest probability is
considered as the predicted class.
In this section, we describe the second approach of
classification, Deep Neural Network. A deep neural network
is an architecture inspired by biological systems. DNN is
Feed-Forward Networks where raw input flows from the
input layer to the output layer without going backward. To
extract the high-level features progressively from the raw
input, it uses multiple layers.
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 1028
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 07 | July 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 1029
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 07 | July 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 1030
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 07 | July 2020 www.irjet.net p-ISSN: 2395-0072
5.2.2 Confusion Matrix tricky to classify, while genres like classical and pop are
easy to classify accurately. One future direction of
interest is to discover hidden relationships between
music genres across time, which is not only a topic of
interest, but it also has potential commercial applications.
This exploration could lead to use of machine learning to
determine artist influences that are directly applicable to
playlist creation and song recommendation.
REFERENCES
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 1031
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 07 | July 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 1032