MusicGen - Ipynb - Colab

11/12/24, 12:11 AM MusicGen.
ipynb - Colab
MusicGen is a Transformer-based model capable fo generating high-quality music samples conditioned on text descriptions or audio prompts.
It was proposed in the paper Simple and Controllable Music Generation by Jade Copet et al. from Meta AI.
The MusicGen model can be de-composed into three distinct stages:
1. The text descriptions are passed through a frozen text encoder model to obtain a sequence of hidden-state representations
2. The MusicGen decoder is then trained to predict discrete audio tokens, or audio codes, conditioned on these hidden-states
3. These audio tokens are then decoded using an audio compression model, such as EnCodec, to recover the audio waveform
The pre-trained MusicGen checkpoints use Google's t5-base as the text encoder model, and EnCodec 32kHz as the audio compression model.
The MusicGen decoder is a pure language model architecture, trained from scratch on the task of music generation.
The novelty in the MusicGen model is how the audio codes are predicted. Traditionally, each codebook has to be predicted by a separate model
(i.e. hierarchically) or by continuously refining the output of the Transformer model (i.e. upsampling). MusicGen uses an efficient token
interleaving pattern, thus eliminating the need to cascade multiple models to predict a set of codebooks. Instead, it is able to generate the full
set of codebooks in a single forward pass of the decoder, resulting in much faster inference.
keyboard_arrow_down Prepare the Environment

!nvidia-smi
Mon Nov 11 18:08:39 2024

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 51C P8 9W / 70W | 0MiB / 15360MiB | 0% Default |
| | | N/A |
https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/MusicGen.ipynb#scrollTo=e52ff5b2-c170-4079-93a4-a02acbdaeb39&printMode=true 1/12
11/12/24, 12:11 AM MusicGen.ipynb - Colab
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
!pip install --upgrade --quiet pip

!pip install --upgrade --quiet transformers datasets[audio]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 67.2 MB/s eta 0:00:00

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.0/10.0 MB 111.2 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.0/3.0 MB 64.6 MB/s eta 0:00:00
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the
gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.9.0 which is incompatible.
keyboard_arrow_down Load the Model

The pre-trained MusicGen small, medium and large checkpoints can be loaded from the pre-trained weights on the Hugging Face Hub. Change
the repo id with the checkpoint size you wish to load.
from transformers import MusicgenForConditionalGeneration
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning:
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(
config.json: 100% 7.87k/7.87k [00:00<00:00, 392kB/s]
model.safetensors: 100% 2.36G/2.36G [00:12<00:00, 256MB/s]

/usr/local/lib/python3.10/dist-packages/transformers/models/encodec/modeling_encodec.py:124: UserWarning: To copy construct from
self.register_buffer("padding_total", torch.tensor(kernel_size - stride, dtype=torch.int64), persistent=False)
Config of the text_encoder: <class 'transformers.models.t5.modeling_t5.T5EncoderModel'> is overwritten by shared text_encoder co
"_name_or_path": "t5-base",
"architectures": [
"T5ForConditionalGeneration"
],
"classifier_dropout": 0.0,
"d_ff": 3072,
"d_kv": 64,
"d_model": 768,
"decoder_start_token_id": 0,
"dense_act_fn": "relu",
"dropout_rate": 0.1,
"eos_token_id": 1,
"feed_forward_proj": "relu",
"initializer_factor": 1.0,
"is_encoder_decoder": true,
"is_gated_act": false,
"layer_norm_epsilon": 1e-06,
"model_type": "t5",
"n_positions": 512,
"num_decoder_layers": 12,
"num_heads": 12,
"num_layers": 12,
"output_past": true,
"pad_token_id": 0,
"relative_attention_max_distance": 128,
"relative_attention_num_buckets": 32,
"task_specific_params": {
"summarization": {
" l t i " t
"early_stopping": true,
"length_penalty": 2.0,
"max_length": 200,
"min_length": 30,
"no_repeat_ngram_size": 3,
"num_beams": 4,
"prefix": "summarize: "
},
"translation_en_to_de": {
"max_length": 300,
"num_beams": 4,
"prefix": "translate English to German: "
},
"translation_en_to_fr": {
"max_length": 300,
"num_beams": 4,
"prefix": "translate English to French: "
},
"translation_en_to_ro": {
"max_length": 300,
"num_beams": 4,
"prefix": "translate English to Romanian: "
}
},
"transformers_version": "4.46.2",
"use_cache": true,
"vocab_size": 32128
}
Config of the audio_encoder: <class 'transformers.models.encodec.modeling_encodec.EncodecModel'> is overwritten by shared audio_

"_name_or_path": "facebook/encodec_32khz",
"architectures": [
"EncodecModel"
],
"audio_channels": 1,
"chunk_length_s": null,
"codebook_dim": 128,
"codebook_size": 2048,
"compress": 2,
co p ess : ,
"dilation_growth_rate": 2,
"hidden_size": 128,
"kernel_size": 7,
"last_kernel_size": 7,
"model_type": "encodec",
"norm_type": "weight_norm",
"normalize": false,
"num_filters": 64,
"num_lstm_layers": 2,
"num_residual_layers": 1,
"overlap": null,
"pad_mode": "reflect",
"residual_kernel_size": 3,
"sampling_rate": 32000,
"target_bandwidths": [
2.2
],
"torch_dtype": "float32",
"trim_right_ratio": 1.0,
"upsampling_ratios": [
8,
5,
4,
4
],
"use_causal_conv": false,
"use_conv_shortcut": false
}
Config of the decoder: <class 'transformers.models.musicgen.modeling_musicgen.MusicgenForCausalLM'> is overwritten by shared dec

"activation_dropout": 0.0,
"activation_function": "gelu",
"attention_dropout": 0.0,
"audio_channels": 1,
"bos_token_id": 2048,
"classifier_dropout": 0.0,
"dropout": 0.1,
"ffn_dim": 4096,
"hidden_size": 1024,
"initializer_factor": 0.02,
"layerdrop": 0.0,
"max_position_embeddings": 2048,
"model_type": "musicgen_decoder",
"num_attention_heads": 16,
"num_codebooks": 4,
"num_hidden_layers": 24,
"pad_token_id": 2048,
"scale_embedding": false,
"tie_word_embeddings": false,
"use_cache": true,
"vocab_size": 2048
}
generation_config.json: 100% 224/224 [00:00<00:00, 16.1kB/s]
We can then place the model on our accelerator device (if available), or leave it on the CPU otherwise:
import torch
device = "cuda:0" if torch.cuda.is_available() else "cpu"

model.to(device);
keyboard_arrow_down Generation
MusicGen is compatible with two generation modes: greedy and sampling. In practice, sampling leads to significantly better results than
greedy, thus we encourage sampling mode to be used where possible. Sampling is enabled by default, and can be explicitly specified by setting
do_sample=True in the call to MusicgenForConditionalGeneration.generate (see below).
Unconditional Generation
The inputs for unconditional (or 'null') generation can be obtained through the method
MusicgenForConditionalGeneration.get_unconditional_inputs . We can then run auto-regressive generation using the .generate method,
specifying do_sample=True to enable sampling mode:
unconditional_inputs = model.get_unconditional_inputs(num_samples=1)
audio_values = model.generate(**unconditional_inputs, do_sample=True, max_new_tokens=256)
`torch.nn.functional.scaled_dot_product_attention` does not support having an empty attention mask. Falling back to the manual a
The argument max_new_tokens specifies the number of new tokens to generate. As a rule of thumb, you can work out the length of the
generated audio sample in seconds by using the frame rate of the EnCodec model:
audio_length_in_s = 256 / model.config.audio_encoder.frame_rate
audio_length_in_s
keyboard_arrow_down Text-Conditional Generation

The model can generate an audio sample conditioned on a text prompt through use of the MusicgenProcessor to pre-process the inputs. The
pre-processed inputs can then be passed to the .generate method to generate text-conditional audio samples. Again, we enable sampling
mode by setting do_sample=True :
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained("facebook/musicgen-small")
inputs = processor(
text=["80s pop track with bassy drums and synth", "90s rock song with loud guitars and heavy drums"],
padding=True,
return_tensors="pt",
)
audio_values = model.generate(**inputs.to(device), do_sample=True, guidance_scale=3, max_new_tokens=256)
Audio(audio_values[0].cpu().numpy(), rate=sampling_rate)
preprocessor_config.json: 100% 275/275 [00:00<00:00, 19.6kB/s]
tokenizer_config.json: 100% 2.37k/2.37k [00:00<00:00, 180kB/s]
spiece.model: 100% 792k/792k [00:00<00:00, 7.98MB/s]
tokenizer.json: 100% 2.42M/2.42M [00:00<00:00, 30.8MB/s]
special_tokens_map.json: 100% 2.20k/2.20k [00:00<00:00, 116kB/s]
0:00 / 0:05
The guidance_scale is used in classifier free guidance (CFG), setting the weighting between the conditional logits (which are predicted from
the text prompts) and the unconditional logits (which are predicted from an unconditional or 'null' prompt). A higher guidance scale encourages
the model to generate samples that are more closely linked to the input prompt, usually at the expense of poorer audio quality. CFG is enabled
by setting guidance_scale > 1 . For best results, use a guidance_scale=3 (default) for text and audio-conditional generation.
keyboard_arrow_down Audio-Prompted Generation

The same MusicgenProcessor can be used to pre-process an audio prompt that is used for audio continuation. In the following example, we
load an audio file using the Datasets library, pre-process it using the processor class, and then forward the inputs to the model for generation:
from datasets import load_dataset
dataset = load_dataset("sanchit-gandhi/gtzan", split="train", streaming=True)

sample = next(iter(dataset))["audio"]
# take the first half of the audio sample

sample["array"] = sample["array"][: len(sample["array"]) // 2]
inputs = processor(
audio=sample["array"],
sampling_rate=sample["sampling_rate"],
text=["80s blues track with groovy saxophone"],
padding=True,
)
Audio(audio_values[0].cpu().numpy(), rate=sampling_rate)
README.md: 100% 703/703 [00:00<00:00, 44.5kB/s]
0:02 / 0:20
To demonstrate batched audio-prompted generation, we'll slice our sample audio by two different proportions to give two audio samples of
different length. Since the input audio prompts vary in length, they will be padded to the length of the longest audio sample in the batch before
being passed to the model.
To recover the final audio samples, the audio_values generated can be post-processed to remove padding by using the processor class once
again:
sample = next(iter(dataset))["audio"]
# take the first quater of the audio sample

sample_1 = sample["array"][: len(sample["array"]) // 4]
# take the first half of the audio sample

sample_2 = sample["array"][: len(sample["array"]) // 2]
inputs = processor(
audio=[sample_1, sample_2],
sampling_rate=sample["sampling_rate"],
text=["80s blues track with groovy saxophone", "90s rock song with loud guitars and heavy drums"],
padding=True,
)
# post-process to remove padding from the batched audio

audio_values = processor.batch_decode(audio_values, padding_mask=inputs.padding_mask)
Audio(audio_values[0], rate=sampling_rate)
0:12 / 0:12
keyboard_arrow_down Generation Config

The default parameters that control the generation process, such as sampling, guidance scale and number of generated tokens, can be found
in the model's generation config, and updated as desired. Let's first inspect the default generation config:
model.generation_config
GenerationConfig {
"bos_token_id": 2048,
"decoder_start_token_id": 2048,
"do_sample": true,
"guidance_scale": 3.0,
"max_length": 1500,
"pad_token_id": 2048
}
Alright! We see that the model defaults to using sampling mode ( do_sample=True ), a guidance scale of 3, and a maximum generation length of
1500 (which is equivalent to 30s of audio). You can update any of these attributes to change the default generation parameters:
# increase the guidance scale to 4.0

model.generation_config.guidance_scale = 4.0

MusicGen - Ipynb - Colab

Uploaded by

Document Informationclick to expand document informationMusic generation using magenta gen ai

Document Informationclick to expand document information

Copyright:

Available Formats

MusicGen - Ipynb - Colab

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MusicGen - Ipynb - Colab

Uploaded by

Copyright:

Available Formats

11/12/24, 12:11 AM MusicGen.

The MusicGen model can be de-composed into three distinct stages:

keyboard_arrow_down Prepare the Environment

Mon Nov 11 18:08:39 2024

!pip install --upgrade --quiet pip

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 67.2 MB/s eta 0:00:00

keyboard_arrow_down Load the Model

from transformers import MusicgenForConditionalGeneration

model.safetensors: 100% 2.36G/2.36G [00:12<00:00, 256MB/s]

Config of the audio_encoder: <class 'transformers.models.encodec.modeling_encodec.EncodecModel'> is overwritten by shared audio_

Config of the decoder: <class 'transformers.models.musicgen.modeling_musicgen.MusicgenForCausalLM'> is overwritten by shared dec

generation_config.json: 100% 224/224 [00:00<00:00, 16.1kB/s]

device = "cuda:0" if torch.cuda.is_available() else "cpu"

audio_values = model.generate(**unconditional_inputs, do_sample=True, max_new_tokens=256)

audio_length_in_s = 256 / model.config.audio_encoder.frame_rate

keyboard_arrow_down Text-Conditional Generation

from transformers import AutoProcessor

audio_values = model.generate(**inputs.to(device), do_sample=True, guidance_scale=3, max_new_tokens=256)

preprocessor_config.json: 100% 275/275 [00:00<00:00, 19.6kB/s]

tokenizer_config.json: 100% 2.37k/2.37k [00:00<00:00, 180kB/s]

spiece.model: 100% 792k/792k [00:00<00:00, 7.98MB/s]

tokenizer.json: 100% 2.42M/2.42M [00:00<00:00, 30.8MB/s]

special_tokens_map.json: 100% 2.20k/2.20k [00:00<00:00, 116kB/s]

keyboard_arrow_down Audio-Prompted Generation

from datasets import load_dataset

dataset = load_dataset("sanchit-gandhi/gtzan", split="train", streaming=True)

# take the first half of the audio sample

audio_values = model.generate(**inputs.to(device), do_sample=True, guidance_scale=3, max_new_tokens=256)

README.md: 100% 703/703 [00:00<00:00, 44.5kB/s]

# take the first quater of the audio sample

# take the first half of the audio sample

audio_values = model.generate(**inputs.to(device), do_sample=True, guidance_scale=3, max_new_tokens=256)

# post-process to remove padding from the batched audio

keyboard_arrow_down Generation Config

# increase the guidance scale to 4.0

You might also like