MusicGen - Ipynb - Colab
MusicGen - Ipynb - Colab
MusicGen - Ipynb - Colab
ipynb - Colab
MusicGen is a Transformer-based model capable fo generating high-quality music samples conditioned on text descriptions or audio prompts.
It was proposed in the paper Simple and Controllable Music Generation by Jade Copet et al. from Meta AI.
1. The text descriptions are passed through a frozen text encoder model to obtain a sequence of hidden-state representations
2. The MusicGen decoder is then trained to predict discrete audio tokens, or audio codes, conditioned on these hidden-states
3. These audio tokens are then decoded using an audio compression model, such as EnCodec, to recover the audio waveform
The pre-trained MusicGen checkpoints use Google's t5-base as the text encoder model, and EnCodec 32kHz as the audio compression model.
The MusicGen decoder is a pure language model architecture, trained from scratch on the task of music generation.
The novelty in the MusicGen model is how the audio codes are predicted. Traditionally, each codebook has to be predicted by a separate model
(i.e. hierarchically) or by continuously refining the output of the Transformer model (i.e. upsampling). MusicGen uses an efficient token
interleaving pattern, thus eliminating the need to cascade multiple models to predict a set of codebooks. Instead, it is able to generate the full
set of codebooks in a single forward pass of the decoder, resulting in much faster inference.
https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/MusicGen.ipynb#scrollTo=e52ff5b2-c170-4079-93a4-a02acbdaeb39&printMode=true 1/12
11/12/24, 12:11 AM MusicGen.ipynb - Colab
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")
https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/MusicGen.ipynb#scrollTo=e52ff5b2-c170-4079-93a4-a02acbdaeb39&printMode=true 2/12
11/12/24, 12:11 AM MusicGen.ipynb - Colab
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning:
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(
config.json: 100% 7.87k/7.87k [00:00<00:00, 392kB/s]
https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/MusicGen.ipynb#scrollTo=e52ff5b2-c170-4079-93a4-a02acbdaeb39&printMode=true 6/12
11/12/24, 12:11 AM MusicGen.ipynb - Colab
We can then place the model on our accelerator device (if available), or leave it on the CPU otherwise:
import torch
keyboard_arrow_down Generation
MusicGen is compatible with two generation modes: greedy and sampling. In practice, sampling leads to significantly better results than
greedy, thus we encourage sampling mode to be used where possible. Sampling is enabled by default, and can be explicitly specified by setting
do_sample=True in the call to MusicgenForConditionalGeneration.generate (see below).
Unconditional Generation
The inputs for unconditional (or 'null') generation can be obtained through the method
MusicgenForConditionalGeneration.get_unconditional_inputs . We can then run auto-regressive generation using the .generate method,
specifying do_sample=True to enable sampling mode:
unconditional_inputs = model.get_unconditional_inputs(num_samples=1)
`torch.nn.functional.scaled_dot_product_attention` does not support having an empty attention mask. Falling back to the manual a
The argument max_new_tokens specifies the number of new tokens to generate. As a rule of thumb, you can work out the length of the
generated audio sample in seconds by using the frame rate of the EnCodec model:
https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/MusicGen.ipynb#scrollTo=e52ff5b2-c170-4079-93a4-a02acbdaeb39&printMode=true 7/12
11/12/24, 12:11 AM MusicGen.ipynb - Colab
audio_length_in_s
processor = AutoProcessor.from_pretrained("facebook/musicgen-small")
inputs = processor(
text=["80s pop track with bassy drums and synth", "90s rock song with loud guitars and heavy drums"],
padding=True,
return_tensors="pt",
)
Audio(audio_values[0].cpu().numpy(), rate=sampling_rate)
https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/MusicGen.ipynb#scrollTo=e52ff5b2-c170-4079-93a4-a02acbdaeb39&printMode=true 8/12
11/12/24, 12:11 AM MusicGen.ipynb - Colab
0:00 / 0:05
The guidance_scale is used in classifier free guidance (CFG), setting the weighting between the conditional logits (which are predicted from
the text prompts) and the unconditional logits (which are predicted from an unconditional or 'null' prompt). A higher guidance scale encourages
the model to generate samples that are more closely linked to the input prompt, usually at the expense of poorer audio quality. CFG is enabled
by setting guidance_scale > 1 . For best results, use a guidance_scale=3 (default) for text and audio-conditional generation.
inputs = processor(
audio=sample["array"],
https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/MusicGen.ipynb#scrollTo=e52ff5b2-c170-4079-93a4-a02acbdaeb39&printMode=true 9/12
11/12/24, 12:11 AM MusicGen.ipynb - Colab
sampling_rate=sample["sampling_rate"],
text=["80s blues track with groovy saxophone"],
padding=True,
return_tensors="pt",
)
Audio(audio_values[0].cpu().numpy(), rate=sampling_rate)
0:02 / 0:20
To demonstrate batched audio-prompted generation, we'll slice our sample audio by two different proportions to give two audio samples of
different length. Since the input audio prompts vary in length, they will be padded to the length of the longest audio sample in the batch before
being passed to the model.
To recover the final audio samples, the audio_values generated can be post-processed to remove padding by using the processor class once
again:
sample = next(iter(dataset))["audio"]
inputs = processor(
audio=[sample_1, sample_2],
sampling_rate=sample["sampling_rate"],
text=["80s blues track with groovy saxophone", "90s rock song with loud guitars and heavy drums"],
https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/MusicGen.ipynb#scrollTo=e52ff5b2-c170-4079-93a4-a02acbdaeb39&printMode=true 10/12
11/12/24, 12:11 AM MusicGen.ipynb - Colab
padding=True,
return_tensors="pt",
)
Audio(audio_values[0], rate=sampling_rate)
0:12 / 0:12
model.generation_config
GenerationConfig {
"bos_token_id": 2048,
"decoder_start_token_id": 2048,
"do_sample": true,
"guidance_scale": 3.0,
"max_length": 1500,
"pad_token_id": 2048
}
Alright! We see that the model defaults to using sampling mode ( do_sample=True ), a guidance scale of 3, and a maximum generation length of
1500 (which is equivalent to 30s of audio). You can update any of these attributes to change the default generation parameters:
https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/MusicGen.ipynb#scrollTo=e52ff5b2-c170-4079-93a4-a02acbdaeb39&printMode=true 11/12
11/12/24, 12:11 AM MusicGen.ipynb - Colab
https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/MusicGen.ipynb#scrollTo=e52ff5b2-c170-4079-93a4-a02acbdaeb39&printMode=true 12/12