Academia.eduAcademia.edu

Singing voice separation with deep u-net convolutional networks

2020, International Symposium/Conference on Music Information Retrieval

SINGING VOICE SEPARATION WITH DEEP U-NET CONVOLUTIONAL NETWORKS Andreas Jansson1, 2 , Eric Humphrey2 , Nicola Montecchio2 , Rachel Bittner2 , Aparna Kumar2 , Tillman Weyde1 1 City, University of London, 2 Spotify {andreas.jansson.1, t.e.weyde}@city.ac.uk {ejhumphrey, venice, rachelbittner, aparna}@spotify.com ABSTRACT The decomposition of a music audio signal into its vocal and backing track components is analogous to image-toimage translation, where a mixed spectrogram is transformed into its constituent sources. We propose a novel application of the U-Net architecture — initially developed for medical imaging — for the task of source separation, given its proven capacity for recreating the fine, low-level detail required for high-quality audio reproduction. Through both quantitative evaluation and subjective assessment, experiments demonstrate that the proposed algorithm achieves state-of-the-art performance. 1. INTRODUCTION The field of Music Information Retrieval (MIR) concerns itself, among other things, with the analysis of music in its many facets, such as melody, timbre or rhythm [20]. Among those aspects, popular western commercial music (“pop” music) is arguably characterized by emphasizing mainly the Melody and Accompaniment aspects; while this is certainly an oversimplification in the context of the whole genre, we restrict the focus of this paper to the analysis of music that lends itself well to be described in terms of a main melodic line (foreground) and accompaniment (background) [27]. Normally the melody is sung, whereas the accompaniment is performed by one or more instrumentalists; a singer delivers the lyrics, and the backing musicians provide harmony as well as genre and style cues [29]. The task of automatic singing voice separation consists of estimating what the sung melody and accompaniment would sound like in isolation. A clean vocal signal is helpful for other related MIR tasks, such as singer identification [18] and lyric transcription [17]. As for commercial applications, it is evident that the karaoke industry, estic Andreas Jansson, Eric Humphrey, Nicola Montecchio, Rachel Bittner, Aparna Kumar, Tillman Weyde. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Andreas Jansson, Eric Humphrey, Nicola Montecchio, Rachel Bittner, Aparna Kumar, Tillman Weyde. “Singing Voice Separation with Deep U-Net Convolutional Networks”, 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017. mated to be worth billions of dollars globally [4], would directly benefit from such technology. 2. RELATED WORK Several techniques have been proposed for blind source separation of musical audio. Successful results have been achieved with non-negative matrix factorization [26, 30, 32], Bayesian methods [21], and the analysis of repeating structures [23]. Deep learning models have recently emerged as powerful alternatives to traditional methods. Notable examples include [25] where a deep feed-forward network learns to estimate an ideal binary spectrogram mask that represents the spectrogram bins in which the vocal is more prominent than the accompaniment. In [9] the authors employ a deep recurrent architecture to predict soft masks that are multiplied with the original signal to obtain the desired isolated source. Convolutional encoder-decoder architectures have been explored in the context of singing voice separation in [6] and [8]. In both of these works, spectrograms are compressed through a bottleneck layer and re-expanded to the size of the target spectrogram. While this “hourglass” architecture is undoubtedly successful in discovering global patterns, it is unclear how much local detail is lost during contraction. One potential weakness shared by the papers cited above is the lack of large training datasets. Existing models are usually trained on hundreds of tracks of lower-thancommercial quality, and may therefore suffer from poor generalization. In this work we aim to mitigate this problem using weakly labeled professionally produced music tracks. Over the last few years, considerable improvements have occurred in the family of machine learning algorithms known as image-to-image translation [11] — pixel-level classification [2], automatic colorization [33], image segmentation [1] — largely driven by advances in the design of novel neural network architectures. This paper formulates the voice separation task, whose domain is often considered from a time-frequency perspective, as the translation of a mixed spectrogram into vocal and instrumental spectrograms. By using this framework we aim to make use of some of the advances in image-to- image translation — especially in regard to the reproduction of fine-grained details — to advance the state-of-theart of blind source separation for music. 3. METHODOLOGY This work adapts the U-Net [24] architecture to the task of vocal separation. The architecture was introduced in biomedical imaging, to improve precision and localization of microscopic images of neuronal structures. The architecture builds upon the fully convolutional network [14] and is similar to the deconvolutional network [19]. In a deconvolutional network, a stack of convolutional layers — where each layer halves the size of the image but doubles the number of channels — encodes the image into a small and deep representation. That encoding is then decoded to the original size of the image by a stack of upsampling layers. In the reproduction of a natural image, displacements by just one pixel are usually not perceived as major distortions. In the frequency domain however, even a minor linear shift in the spectrogram has disastrous effects on perception: this is particularly relevant in music signals, because of the logarithmic perception of frequency; moreover, a shift in the time dimension can become audible as jitter and other artifacts. Therefore, it is crucial that the reproduction preserves a high level of detail. The UNet adds additional skip connections between layers at the same hierarchical level in the encoder and decoder. This allows low-level information to flow directly from the highresolution input to the high-resolution output. 3.1 Architecture The goal of the neural network architecture is to predict the vocal and instrumental components of its input indirectly: the output of the final decoder layer is a soft mask that is multiplied element-wise with the mixed spectrogram to obtain the final estimate. Figure 1 outlines the network architecture. In this work, we choose to train two separate models for the extraction of the instrumental and vocal components of a signal, to allow for more divergent training schemes for the two models in the future. 3.1.1 Training Let X denote the magnitude of the spectrogram of the original, mixed signal, that is, of the audio containing both vocal and instrumental components. Let Y denote the magnitude of the spectrograms of the target audio; the latter refers to either the vocal (Yv ) or the instrumental (Yi ) component of the input signal. The loss function used to train the model is the L1,1 norm 1 of the difference of the target spectrogram and the masked input spectrogram: L(X, Y ; Θ) = ||f (X, Θ) ⊙ X − Y ||1,1 (1) 1 The L 1,1 norm of a matrix is simply the sum of the absolute values of its elements. where f (X, Θ) is the output of the network model applied to the input X with parameters Θ – that is the mask generated by the model. Two U-Nets, Θv and Θi , are trained to predict vocal and instrumental spectrogram masks, respectively. 3.1.2 Network Architecture Details Our implementation of U-Net is similar to that of [11]. Each encoder layer consists of a strided 2D convolution of stride 2 and kernel size 5x5, batch normalization, and leaky rectified linear units (ReLU) with leakiness 0.2. In the decoder we use strided deconvolution (sometimes referred to as transposed convolution) with stride 2 and kernel size 5x5, batch normalization, plain ReLU, and use 50% dropout to the first three layers, as in [11]. In the final layer we use a sigmoid activation function. The model is trained using the ADAM [12] optimizer. Given the heavy computational requirements of training such a model, we first downsample the input audio to 8192 Hz in order to speed up processing. We then compute the Short Time Fourier Transform with a window size of 1024 and hop length of 768 frames, and extract patches of 128 frames (roughly 11 seconds) that we feed as input and targets to the network. The magnitude spectrograms are normalized to the range [0, 1]. 3.1.3 Audio Signal Reconstruction The neural network model operates exclusively on the magnitude of audio spectrograms. The audio signal for an individual (vocal/instrumental) component is rendered by constructing a spectrogram: the output magnitude is given by applying the mask predicted by the U-Net to the magnitude of the original spectrum, while the output phase is that of the original spectrum, unaltered. Experimental results presented below indicate that such a simple methodology proves effective. 3.2 Dataset As stated above, the description of the model architecture assumes that training data was available in the form of a triplet (original signal, vocal component, instrumental component). Unless one is in the extremely fortunate position as to have access to vast amounts of unmixed multitrack recordings, an alternative strategy has to be found in order to train a model like the one described. A solution to the issue was found by exploiting a specific but large set of commercially available recordings in order to “construct” training data: instrumental versions of recordings. It is not uncommon for artists to release instrumental versions of tracks along with the original mix. We leverage this fact by retrieving pairs of (original, instrumental) tracks from a large commercial music database. Candidates are found by examining the metadata for tracks with matching duration and artist information, where the track title (fuzzily) matches except for the string “Instrumental” occurring in exactly one title in the pair. The pool of tracks is pruned by excluding exact content matches. Figure 1. Network Architecture Genre Pop Rap Dance & House Electronica R&B Rock Alternative Children’s Metal Latin Indie Rock Other Percentage 26.0% 21.3% 14.2% 7.4% 3.9% 3.6% 3.1% 2.5% 2.5% 2.3% 2.2% 10.9% Table 1. Training data genre distribution Details about the construction of this dataset can be found in [10]. The above approach provides a large source of X (mixed) and Yi (instrumental) magnitude spectrogram pairs. The vocal magnitude spectrogram Yv is obtained from their half-wave rectified difference. A qualitative analysis of a large handful of examples showed that this technique produced reasonably isolated vocals. The final dataset contains approximately 20,000 track pairs, resulting in almost two months worth of continuous audio. To the best of our knowledge, this is the largest training data set ever applied to musical source separation. Table 1 shows the relative distribution of the most frequent genres in the dataset, obtained from the catalog metadata. 4. EVALUATION We compare the proposed model to the Chimera model [15] that produced the highest evaluation scores in the 2016 MIREX Source Separation campaign 2 ; we make use of their web interface 3 to process audio clips. It should be noted that the Chimera web server is running an improved version of the algorithm that participated in MIREX, using a hybrid “multiple heads” architecture that combines deep clustering with a conventional neural network [16]. For evaluation purposes we built an additional baseline model; it resembles the U-Net model but without the skip connections, essentially creating a convolutional encoderdecoder, similar to the “Deconvnet” [19]. We evaluate the three models on the standard iKala [5] and MedleyDB dataset [3]. The iKala dataset has been used as a standardized evaluation for the annual MIREX campaign for several years, so there are many existing results that can be used for comparison. MedleyDB on the other hand was recently proposed as a higher-quality, commercial-grade set of multi-track stems. We generate 2 www.music-ir.org/mirex/wiki/2016:Singing_ Voice_Separation_Results 3 danetapi.com/chimera NSDR Vocal NSDR Instrumental SIR Vocal SIR Instrumental SAR Vocal SAR Instrumental U-Net 11.094 14.435 23.960 21.832 17.715 14.120 Baseline 8.549 10.906 20.402 14.304 15.481 12.002 Chimera 8.749 11.626 21.301 20.481 15.642 11.539 Model U-Net Baseline Chimera LCP2 LCP1 MC2 Table 2. iKala mean scores NSDR Vocal NSDR Instrumental SIR Vocal SIR Instrumental SAR Vocal SAR Instrumental U-Net 8.681 7.945 15.308 21.975 11.301 15.462 Baseline 7.877 6.370 14.336 16.928 10.632 15.332 SD 3.583 3.247 4.151 3.626 3.835 3.676 Min 4.165 1.846 -0.368 2.508 0.742 -7.875 Max 21.716 19.641 20.812 19.875 19.960 22.734 Median 14.525 10.869 12.045 11.000 10.800 9.900 Table 4. iKala NSDR Instrumental, MIREX 2016 Chimera 6.793 5.477 12.382 20.880 10.033 12.530 Model U-Net Baseline Chimera LCP2 LCP1 MC2 Table 3. MedleyDB mean scores isolated instrumental and vocal tracks by weighting sums of instrumental/vocal stems by their respective mixing coefficients as supplied by the MedleyDB Python API 4 . We limit our evaluation to clips that are known to contain vocals, using the melody transcriptions provided in both iKala and MedleyDB. The following functions are used to measure performance: Signal-To-Distortion Ratio (SDR), Signal-toInterference Ratio (SIR), and Signal-to-Artifact Ratio (SAR) [31]. Normalized SDR (NSDR) is defined as NSDR(Se , Sr , Sm ) = SDR(Se , Sr ) − SDR(Sm , Sr ) (2) where Se is the estimated isolated signal, Sr is the reference isolated signal, and Sm is the mixed signal. We compute performance measures using the mir eval toolkit [22]. Table 2 and Table 3 show that the U-Net significantly outperforms both the baseline model and Chimera on all three performance measures for both datasets. In Figure 2 we show an overview of the distributions for the different evaluation measures. Assuming that the distribution of tracks in the iKala hold-out set used for MIREX evaluations matches those in the public iKala set, we can compare our results to the participants in the 2016 MIREX Singing Voice Separation task. 5 Table 4 and Table 5 show NSDR scores for our models compared to the best performing algorithms of the 2016 MIREX campaign. In order to assess the effect of the U-Net’s skip connections, we can visualize the masks generated by the U-Net and baseline models. From Figure 3 it is clear that while the baseline model captures the overall structure, there is a lack of fine-grained detail observable. 4 Mean 14.435 10.906 11.626 11.188 10.926 9.668 github.com/marl/medleyDB http://www.music-ir.org/mirex/wiki/2016: Singing_Voice_Separation_Results Mean 11.094 8.549 8.749 6.341 6.073 5.289 SD 3.566 3.428 4.001 3.370 3.462 2.914 Min 2.392 -0.696 -1.850 -1.958 -1.658 -1.302 Max 20.720 18.530 18.701 17.240 17.170 12.571 Median 10.804 8.746 8.868 5.997 5.649 4.945 Table 5. iKala NSDR Vocal, MIREX 2016 4.1 Subjective Evaluation Emiya et al. introduced a protocol for the subjective evaluation of source separation algorithms [7]. They suggest asking human subjects four questions that broadly correspond to the SDR/SIR/SAR measures, plus an additional question regarding the overall sound quality. As we asked these four questions to subjects without music training, our subjects found them ambiguous, e.g., they had problems discerning between the absence of artifacts and general sound quality. For better clarity, we distilled the survey into the following two questions in the vocal extraction case: • Quality: “Rate the vocal quality in the examples below.” • Interference: “How well have the instruments in the clip above been removed in the examples below?” For instrumental extraction we asked similar questions: • Quality: “Rate the sound quality of the examples below relative to the reference above.” • Extracting instruments: “Rate how well the instruments are isolated in the examples below relative to the full mix above.” Data was collected using CrowdFlower 6 , an online platform where humans carry out micro-tasks, such as image classification, simple web searches, etc., in return for small per-task payments. In our survey, CrowdFlower users were asked to listen to three clips of isolated audio, generated by U-Net, the baseline model, and Chimera. The order of the three clips was randomized. Each question asked one of the Quality and Interference questions. In the Interference question 5 6 www.crowdflower.com 40 35 30 25 20 15 10 5 0 5 iKala Instrumental UBa Net N s Ch eline SDR im NS era D R U-N NSDR Ba et S s I Ch eline R im SI era R U-N SIR Ba et S s A Ch eline R im SA era R SA R iKala Vocal UBa Net N s Ch eline SDR im NS era D R U-N NSDR Ba et S s I Ch eline R im SI era R U-N SIR Ba et S s A Ch eline R im SA era R SA R 40 35 30 25 20 15 10 5 0 5 Figure 2. iKala vocal and instrumental scores Figure 3. U-Net and baseline masks we also included a reference clip. The answers were given according to a 7 step Likert scale [13], ranging from “Poor” to “Perfect”. Figure 4 is a screen capture of a CrowdFlower question. To ensure the quality of the collected responses, we interspersed the survey with “control questions” that the user had to answer correctly according to a predefined set of acceptable answers on the Likert scale. Users of the platform are unaware of which questions are control questions. If they are answered incorrectly, the user is disqualified from the task. A music expert external to our research group was asked to provide acceptable answers to a number of random clips that were designated as control questions. For the survey we used 25 clips from the iKala dataset and 42 clips from MedleyDB 7 . We had 44 respondents and 724 total responses for the instrumental test, and 55 respondents supplied 779 responses for the voice test. Figure 5 shows mean and standard deviation for answers provided on CrowdFlower. The U-Net algorithm outperforms the other two models on all questions. 5. CONCLUSION AND FUTURE WORK Figure 4. CrowdFlower example question We have explored the U-Net architecture in the context of singing voice separation, and found that it brings clear improvements over the state-of-the-art. The benefits of lowlevel skip connections were demonstrated by comparison to plain convolutional encoder-decoders. A factor that we feel should be investigated further is the impact of large training data: work remains to be done to correlate the effects of the size of the training dataset to the quality of source separation. We have observed some examples of poor separation on tracks where the vocals are mixed at lower-than-average volume, uncompressed, suffer from extreme application of audio effects, or otherwise unconventionally mixed. Since the training data consisted exclusively of commercially produced recordings, we hypothesize that our model has learned to distinguish the kind of voice typically found in commercial pop music. We plan to investigate this further 7 Audio examples can be found on http://mirg.city.ac.uk/ codeapps/vocal-source-separation-ismir2017 7 MedleyDB vocal 7 5 4 4 3 3 2 2 1 1 MedleyDB instrumental U-N et Ba Ba U-N et 7 Qu sel ality ine Ch Qua im era lity Qu ali U-N ty et I nte Ba rf sel ine eren ce I Ch n im terfe era ren In t c erf e ere nc e 6 5 Qu sel ality ine Ch Qua im era lity Qu ali U-N ty et I n ter Ba sel f ine eren Ch Inte ce im era rfere In t n ce erf ere nc e 6 7 5 5 4 4 3 3 2 2 1 1 Ba Qu et U-N Ba Qu et iKala instrumental sel ality ine Ch Qua im era lity Qu ali U-N ty et I nte Ba rf sel ine eren ce I Ch n im terfe era ren Int c erf e ere nc e 6 sel ality ine Ch Qua im era lity Qu ali U-N ty et I n ter Ba sel f ine eren Ch Inte ce im era rfere Int nce erf ere nc e 6 U-N iKala vocal Figure 5. CrowdFlower evaluation results (mean/std) by systematically analyzing the dependence of model performance on the mixing conditions. Finally, subjective evaluation of source separation algorithms is an open research question. Several alternatives exist to 7-step Likert scale, e.g. the ITU-R scale [28]. Tools like CrowdFlower allow us to quickly roll out surveys, but care is required in the design of question statements. 6. REFERENCES [1] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoderdecoder architecture for scene segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017. [2] Aayush Bansal, Xinlei Chen, Bryan Russell, Abhinav Gupta, and Deva Ramanan. Pixelnet: Towards a general pixel-level architecture. arXiv preprint arXiv:1609.06694, 2016. [5] Tak-Shing Chan, Tzu-Chun Yeh, Zhe-Cheng Fan, Hung-Wei Chen, Li Su, Yi-Hsuan Yang, and Roger Jang. Vocal activity informed singing voice separation with the iKala dataset. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pages 718–722. IEEE, 2015. [6] Pritish Chandna, Marius Miron, Jordi Janer, and Emilia Gómez. Monoaural audio source separation using deep convolutional neural networks. In International Conference on Latent Variable Analysis and Signal Separation, pages 258–266. Springer, 2017. [7] Valentin Emiya, Emmanuel Vincent, Niklas Harlander, and Volker Hohmann. Subjective and objective quality assessment of audio source separation. IEEE Transactions on Audio, Speech, and Language Processing, 19(7):2046–2057, 2011. [8] Emad M Grais and Mark D Plumbley. Single channel audio source separation using convolutional denoising autoencoders. arXiv preprint arXiv:1703.08019, 2017. [3] Rachel M. Bittner, Justin Salamon, Mike Tierney, Matthias Mauch, Chris Cannam, and Juan Pablo Bello. MedleyDB: A multitrack dataset for annotationintensive MIR research. In Proceedings of the 15th International Society for Music Information Retrieval Conference, ISMIR 2014, Taipei, Taiwan, October 2731, 2014, pages 155–160, 2014. [9] Po-Sen Huang, Minje Kim, Mark Hasegawa-Johnson, and Paris Smaragdis. Singing-voice separation from monaural recordings using deep recurrent neural networks. In Proceedings of the 15th International Society for Music Information Retrieval Conference, ISMIR 2014, Taipei, Taiwan, October 27-31, 2014, pages 477–482, 2014. [4] Kevin Brown. Karaoke Idols: Popular Music and the Performance of Identity. Intellect Books, 2015. [10] Eric Humphrey, Nicola Montecchio, Rachel Bittner, Andreas Jansson, and Tristan Jehan. Mining labeled data from web-scale collections for vocal activity detection in music. In Proceedings of the 18th ISMIR Conference, 2017. [11] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. arXiv preprint arXiv:1611.07004, 2016. [12] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [13] Rensis Likert. A technique for the measurement of attitudes. Archives of psychology, 1932. [14] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431– 3440, 2015. [15] Yi Luo, Zhuo Chen, and Daniel PW Ellis. Deep clustering for singing voice separation. 2016. [16] Yi Luo, Zhuo Chen, John R Hershey, Jonathan Le Roux, and Nima Mesgarani. Deep clustering and conventional networks for music separation: Stronger together. arXiv preprint arXiv:1611.06265, 2016. [17] Annamaria Mesaros and Tuomas Virtanen. Automatic recognition of lyrics in singing. EURASIP Journal on Audio, Speech, and Music Processing, 2010(1):546047, 2010. [18] Annamaria Mesaros, Tuomas Virtanen, and Anssi Klapuri. Singer identification in polyphonic music using vocal separation and pattern recognition methods. In Proceedings of the 8th International Conference on Music Information Retrieval, ISMIR 2007, Vienna, Austria, September 23-27, 2007, pages 375–378, 2007. [19] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1520–1528, 2015. [23] Zafar Rafii and Bryan Pardo. Repeating pattern extraction technique (REPET): A simple method for music/voice separation. IEEE transactions on audio, speech, and language processing, 21(1):73–84, 2013. [24] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 234–241. Springer, 2015. [25] Andrew JR Simpson, Gerard Roma, and Mark D Plumbley. Deep karaoke: Extracting vocals from musical mixtures using a convolutional deep neural network. In International Conference on Latent Variable Analysis and Signal Separation, pages 429–436. Springer, 2015. [26] Paris Smaragdis, Cedric Fevotte, Gautham J Mysore, Nasser Mohammadiha, and Matthew Hoffman. Static and dynamic source separation using nonnegative factorizations: A unified view. IEEE Signal Processing Magazine, 31(3):66–75, 2014. [27] Philip Tagg. Analysing popular music: theory, method and practice. Popular music, 2:37–67, 1982. [28] Thilo Thiede, William C Treurniet, Roland Bitto, Christian Schmidmer, Thomas Sporer, John G Beerends, and Catherine Colomes. Peaq-the itu standard for objective measurement of perceived audio quality. Journal of the Audio Engineering Society, 48(1/2):3–29, 2000. [29] George Tzanetakis and Perry Cook. Musical genre classification of audio signals. IEEE Transactions on speech and audio processing, 10(5):293–302, 2002. [30] Shankar Vembu and Stephan Baumann. Separation of vocals from polyphonic audio recordings. In ISMIR 2005, 6th International Conference on Music Information Retrieval, London, UK, 11-15 September 2005, Proceedings, pages 337–344, 2005. [20] Nicola Orio et al. Music retrieval: A tutorial and review. Foundations and Trends R in Information Retrieval, 1(1):1–90, 2006. [31] Emmanuel Vincent, Rémi Gribonval, and Cédric Févotte. Performance measurement in blind audio source separation. IEEE transactions on audio, speech, and language processing, 14(4):1462–1469, 2006. [21] Alexey Ozerov, Pierrick Philippe, Frdric Bimbot, and Rmi Gribonval. Adaptation of bayesian models for single-channel source separation and its application to voice/music separation in popular songs. IEEE Transactions on Audio, Speech, and Language Processing, 15(5):1564–1578, 2007. [32] Tuomas Virtanen. Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE transactions on audio, speech, and language processing, 15(3):1066– 1074, 2007. [22] Colin Raffel, Brian McFee, Eric J. Humphrey, Justin Salamon, Oriol Nieto, Dawen Liang, and Daniel P. W. Ellis. Mir eval: A transparent implementation of common MIR metrics. In Proceedings of the 15th International Society for Music Information Retrieval Conference, ISMIR 2014, Taipei, Taiwan, October 27-31, 2014, pages 367–372, 2014. [33] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In European Conference on Computer Vision, pages 649–666. Springer, 2016.