Academia.eduAcademia.edu

Conditional GAN for Diatonic Harmonic Sequences Generation in a VR Context

2021

http://dx.doi.org/10.14236/ewic/EVA2021.15 Conditional GAN for Diatonic Harmonic Sequences Generation in a VR Context Anna Shvets FabLab by Inetum 157 Boulevard McDonald, 75019 Paris, France anna.shvets@inetum.world Samer Darkazanli iMSA Rue Clos Maury 82000 Montauban, France darkazanli.samer@imsa.msa.fr The use of AI models for music generation receives an important attention from scientific communities. Different architectures of deep learning neural networks have been applied for this specific task, such as Recurrent Neural Networks (RNN), Generative Adversarial Networks (GAN), Autoencoders, Variational Autoencoders (VAE) and Transformers. One of the important aspects of the generation process is the possibility to control the output by providing the input parameters, and a conditional generation was widely used in a computer vision domain to meet this need. In a present research we adopt the principles of conditional generation using GAN architecture and convolutions, applying them to a temporal domain, resulting in building a conditional GAN for diatonic harmonic sequences generation. The model is further used as a core feature of the VR module for computer-aided composition from “Graphs in harmony learning” VR project, where the sequence generation is conditioned by the user's input and the result is mapped to 3D representations of the generated chords. Conditional GAN. Harmonic sequences generation. VR. Structural harmony method. Computer-aided composition. 1. INTRODUCTION 2. VR APPLICATION DESCRIPTION Recent advances in the field of deep neural networks influenced the approaches in computeraided composition, which nowadays is enriched with a plethora of deep learning architectures completing a range of generation tasks. Those tasks correspond to the features of music language, such as generation of a melody (Yu et al. 2017; Yang, Chou & Yang 2017), accompaniment (Kosmas Kritsis et al. 2021; Dong et al. 2018, Kaliakatsos-Papakostas, Floros & Vrahatis 2012), temporal dependencies (Vohra, Goel & Sahoo 2014; Eck & Schmidhuber 2002) as well as the generation of entire sections of music in a specific style (Liang 2016; Huang et al. 2018). However, despite the variety of solutions, the conditional harmonic sequence generation in a symbolic domain, still leaves a room for experiments, and in a present paper we propose an adaptation of the Conditional GAN architecture, issued from a computer vision domain, preserving convolutional layers for computing sequential data. The idea of using such type of architecture was closely related to the use case of the model and is dependent from the VR application context, discussed further. The application “Graphs in harmony learning” aims to develop harmonic ear and the theoretical understanding of underlying relations of chords, attributed to function groups, via exploitation of the original representation of music knowledge based on the system of graphs. The understanding of music theory is aided with the use of a specific colour scheme, representing music functions, applied to the schematic representation of chords inside the system of graphs, as well as to the score representation of chords on the staff in a virtual 3D space. The explicit visual representation, achieved with the possibilities of the VR application context and proposed representation methodology, helps the learner to focus on a chord phonism and functional relationships between chords, which increases the overall learning efficiency (Shvets 2019). © Shvets et al. Published by BCS Learning and Development Ltd. Proceedings of EVA London 2021, UK A previous version of the application, presented before (Shvets & Darkazanli 2020), consisted of three modules – the explanation module (Entry room), explaining and illustrating the theory basics and a graph system to the learner, the main module (Study room), exposing the study material, and the testing module (Test room), challenging the 97 Conditional GAN for Diatonic Harmonic Sequences Generation in a VR Context Anna Shvets & Samer Darkazanli user with an auditory recognition of sequences, studied in the main module. A creative module (Practice room), is a new feature, which allows the user to apply in practice the acquired knowledge, using computer-aided sequence generation. This module has not been shown yet and will be further discussed in detail. After the information extraction from MIDI files, the chord sequences were transposed to a C key and transformed to their roman numeral representation. A further filtering process consisted in the elimination of any chord alteration and sequential chord repetitions, since the generated sequences, limited to five chords only (due to a mapping 3D space limitations), should express a harmonic variety, free of repetitions. The prepared in such a way array of chords is then divided into seven classes, where the first chord is a representation of the class (one of seven degrees of the music scale), with the following four chords. Finally, the chords of each of sequences are replaced with their indices from the dictionary of unique chord values, converted to tensors and normalized between -1 and 1. In such a form the features are fed to the model along with their seven labels. 3. CONDITIONAL GAN The AI model architecture choice is conditioned by the context in which the model is used, that is a creative module of the VR music learning application. As mentioned before, the module proposes a user the possibility to apply in practice the acquired knowledge about diatonic harmonic sequences, learned in a main module. Along with the test module, also based on an AI solution, the user receives a complete circle of learning experience, consisting of three phases – learning, testing and application in practice. This is the main reason why the existing models of music generation were not suitable, as the main restriction of this project consists in the obligation of generating diatonic harmonic sequences in a C key only. Table 1: The list of examples used for training. Composer Works Number of MIDI files Another requirement of the project was a participation of the learner in a generative process – the user should choose a chord from the system of graphs to trigger harmonic sequence generation, and the chosen chord will start the sequence returned by the model with a further mapping to its 3D staff notation representation. Therefore, the information about the chord should condition the content of the sequence to generate and be inputted as a class to the model. Giving these two main requirements, the architectural choice was made towards conditional GAN architecture. Haydn Mozart 21 21 Number of extracted chords 193 929 Beethoven Mendelsohn 29 15 5355 1564 Shubert 29 10802 Chopin 48 5409 Liszt 16 7492 Brahms 10 2503 Total 189 34 247 3.2 GAN architecture The network was built in Pytorch deep learning framework. The generator part of the model is composed of a tripled stack of transposed 1D convolution, batch normalization and rectified linear unit (ReLU) layers, with the addition of the final stack, composed of transposed 1D convolution and tangent layers. The discriminator part has one stack less, replacing transposed 1D convolution with a 1D convolution itself and utilizing leaky ReLU layer with a slope of 0.2, instead of ReLU, used in a generator. The final stack of the discriminator contains the convolution layer only (Figure 1). The Adam optimisation algorithm with a small learning rate (0.0002) was used for both parts of the model. Finally, binary cross entropy loss function (BCEWithLogitsLoss) was used to calculate the performance of both – generator and discriminator. 3.1 Dataset and pre-processing Before considering the architecture of the model, let us discuss the dataset and its preparation pipeline. In view of the diatonic content requirement, a classical along with romantic (early and middle periods) music styles were chosen to form a training set to the model, using the data present in Kaggle dataset Classical Music. The distribution of examples per composer as well as the number of extracted chords is shown in a table 1. The chord extraction itself was made via Python library for computational musicology Music21 (Cuthbert & Ariza 2010). Additionally, manually constructed progressions, containing passing progressions, were added to the training corpus to ensure the presence of passing progressions in further generated sequences. The information about the sequence label is added to the noise vector, being fed to the generator, while the label information for the discriminator is 98 Conditional GAN for Diatonic Harmonic Sequences Generation in a VR Context Anna Shvets & Samer Darkazanli Figure 1: Diagram of the network architecture. added to the input channels dimension. This way both parts of the network have the information about the class to which the sequence belongs to: the generator for being able to transform the input noise tensor into the shape of the sequence, and the discriminator – to assess the effort of the generator, by comparing it to a ground truth example of the sequence inside a training loop. – to 0.67, which is are good metrics for GAN. The generation tests showed, however, that the model is prone to a mode collapse, therefore the network architecture may be further improved to achieve better stability in future works. 3.3 Training After the training process is finished, the best checkpoint of the model is exported to a binary file and integrated into the web application using Flask framework. The application containing the model is then deployed to the server and the communication with the VR application is done via JSON – the user of the VR application chooses the chord from a graph structure, which is sent to the web endpoint, where the model is deployed; the model takes that input and generates the rest of the sequence, sending it back to the VR application; upon the reception of the data, VR application adjusts the mapping of a sound and 3D models of chords and exposes a mapped version to the user as a sequence on the notation staff (Figure 3). 4. VR APPLICATION INTEGRATION The extracted chords, being passed through the filtering process, resulted in a network input formation, containing 16 374 harmonic sequences this data was used to train the generator and discriminator for 100 epochs. The learning curves for the generator and discriminator are shown in Figure 2. Figure 2: The loss decrease during training for both models of the GAN. The figure shows a good dynamic of the loss decrease for both parts of the network. The noisy shape of the curves is rather expected for GAN architecture. In the end of the training process, the two models were rather balanced with the generator loss equal to 0.74 and discriminator loss Figure 3: The sequence generated by conditional GAN, mapped to chords in a VR space. 99 Conditional GAN for Diatonic Harmonic Sequences Generation in a VR Context Anna Shvets & Samer Darkazanli Kritsis, K., Kylafi, T., Kaliakatsos-Papakostas, M., Pikrakis, A. and Katsouros, V. (2020). On the Adaptability of Recurrent Neural Networks for RealTime Jazz Improvisation Accompaniment. Frontiers in Artificial Intelligence, 3, p.113. 5. RESULTS In this paper we have shown an adaptation of the Conditional GAN architecture to sequential data, representing harmonic sequences of diatonic chords. We also demonstrated how this type of architecture, along with a generated content, serves to a specific use case, which is the integration to the creative module for computeraided sequence generation in a VR harmony learning application. The perspectives of the work will focus on augmentation of the number of classes to include chord inversions into generation condition. Liang, F. (2016). Bachbot: Automatic Composition in the s Style of Bach Chorales. M.Phil thesis. University of Cambridge. Shvets, A. (2019). Structural harmony method in the context of deep learning on example of music by Valentyn Sylvestrov and Philipp Glass. In: Weinel, J., Bowen, J.P., Diprose, G., and Lambert, N. (eds), EVA London 2019 (Electronic Visualisation and the Arts), London, UK, 10–14 July 2019, 318–320. BCS, London. doi: 10.14236/ewic/EVA2019.60 6. REFERENCES Cuthbert, M. and Ariza, C. (2010) "music21: A Toolkit for Computer-Aided Musicology and Symbolic Music Data," International Conference on Music Information Retrieval, pp. 637–642. Shvets, A., Darkazanli, S. (2020). Graphs in harmony learning: AI assisted VR application. In: Weinel, J., Bowen, J.P., Diprose, G., and Lambert, N. (eds), EVA London 2020 (Electronic Visualisation and the Arts), London, UK, 6–9 July 2020, 104–105. BCS, London. doi: 10.14236/ewic/EVA2020.18 Dong, H.W., Hsiao, W.Y., Yang, L.C. and Yang, Y.H., 2018, April. Musegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment. AAAI Conference on Artificial Intelligence (Vol. 32, No. 1). Vohra, R., Goel, K. and Sahoo, J. (2015). Modelling Temporal Dependencies in Data Using a DBNLSTM. IEEE International Conference on Data Science and Advanced Analytics. Eck, D. and Schmidhuber, J. (2002). Finding Temporal Structure in Music: Blues Improvisation with LSTM Recurrent Networks 12th IEEE Workshop on Neural Networks for Signal Processing. Yang, L. C., Chou, S. Y., & Yang, Y. H. (2017). MidiNet: A convolutional generative adversarial network for symbolic-domain music generation. arXiv preprint arXiv:1703.10847. Huang, C.Z.A., Vaswani, A., Uszkoreit, J., Shazeer, N., Simon, I., Hawthorne, C., Dai, A.M., Hoffman, M.D., Dinculescu, M. and Eck, D., 2018. Music transformer. arXiv preprint arXiv:1809.04281. Yu, L., Zhang, W., Wang, J. and Yu, Y. (2017). Seqgan: Sequence generative adversarial nets with policy gradient. AAAI conference on artificial intelligence (Vol. 31, No. 1). 100