Fix the issue that csm model cannot work with pipeline mode. #39349

yuanwu2017 · 2025-07-11T08:40:20Z

What does this PR do?

Fix the csm model(text_to_audio) cannot work with pipeline mode.

example:

import torch
from transformers import pipeline

device = "hpu"
if torch.cuda.is_available():
    device="cuda"

print(f"device: {device}")
tts_pipeline = pipeline(
    "text-to-audio",
    model="sesame/csm-1b",
    device=device
)
generate_kwargs = {
    "output_audio": True,
}

text = "[0]Hello from Sesame."

audio_output = tts_pipeline(text, generate_kwargs=generate_kwargs)
print(f"audio_output: {audio_output}")
audio = audio_output["audio"] if isinstance(audio_output["audio"], torch.Tensor) and len(audio_output["audio"].shape) == 1 else audio_output["audio"][0]
import soundfile as sf
sf.write("example_with_pipeline.wav", audio, samplerate=audio_output["sampling_rate"])

Before patch:

After patch:

Batch inference with original example:

import torch
from transformers import CsmForConditionalGeneration, AutoProcessor
model_id = "sesame/csm-1b"

device="hpu"
if torch.cuda.is_available():
    device="cuda"


# load the model and the processor
processor = AutoProcessor.from_pretrained(model_id)
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)

# prepare the inputs
text = ["[0]Hello from Sesame.", "[1]Open sesame."] # `[0]` for speaker id 0
inputs = processor(text, add_special_tokens=True).to(device)

audio = model.generate(**inputs, output_audio=True)
print(f"audio:{audio}")
processor.save_audio(audio, ["1.wav", "2.wav"])

Result:

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Signed-off-by: yuanwu <yuan.wu@intel.com>

ydshieh · 2025-07-11T08:48:04Z

Hi @yuanwu2017 , thank you for this PR. Let's add some description in What does this PR do? 🙏

yuanwu2017 · 2025-07-14T02:24:25Z

Hi @yuanwu2017 , thank you for this PR. Let's add some description in What does this PR do? 🙏
I have added. Thanks.

yuanwu2017 · 2025-07-14T02:25:17Z

@ydshieh @eustlb Please help to review.

yuanwu2017 · 2025-07-21T01:40:51Z

@ydshieh @eustlb Please help to review.

eustlb

Thanks a lot 🤗
Do we need the codec batching here? It should be in another PR

eustlb · 2025-07-22T09:57:50Z

src/transformers/audio_utils.py

+    # If it's a torch tensor with more than one dimension, convert it to a list of tensors
+    if is_torch_tensor(audio) and len(audio.shape) > 1:
+        return list(audio)
+


We don't need this, that is exactly what is done below no?

As described below, because pipeline requires that the return value of generate is a tensor, it needs to be converted to list here so that the subsequent processor.save_audio can work normally.

eustlb · 2025-07-22T10:13:57Z

src/transformers/models/csm/generation_csm.py

-                # =======================================
-                # TODO: @eustlb, this should be batched !!!
-                # but requires making sure batched inference of the codec model works as intended
-                for audio_codes_batch in generated_audio_codes:
-                    eos_idxs = (audio_codes_batch == self.config.codebook_eos_token_id).all(dim=-1).nonzero()
+                generated_audio_codes = generate_output.sequences if generate_returned_dict else generate_output
+
+                # Find EOS positions for all batches at once
+                eos_mask = (generated_audio_codes == self.config.codebook_eos_token_id).all(dim=-1)
+                eos_positions = torch.zeros(
+                    generated_audio_codes.shape[0], dtype=torch.long, device=generated_audio_codes.device
+                )
+
+                for i in range(generated_audio_codes.shape[0]):
+                    eos_idxs = eos_mask[i].nonzero()
                    if eos_idxs.numel() != 0:
-                        cutoff_idx = eos_idxs.min()
+                        eos_positions[i] = eos_idxs.min()
                    else:
-                        cutoff_idx = audio_codes_batch.shape[0]
-
-                    audio_codes_batch = audio_codes_batch[:cutoff_idx]
-                    codec_decode_output = self.codec_model.decode(audio_codes_batch.transpose(0, 1).unsqueeze(0))
-                    audio.append(codec_decode_output.audio_values[0, 0])
-                # =======================================
+                        eos_positions[i] = generated_audio_codes.shape[1]
+
+                # Create a mask for valid positions
+                max_len = eos_positions.max().item()
+                valid_mask = torch.arange(max_len, device=generated_audio_codes.device).unsqueeze(
+                    0
+                ) < eos_positions.unsqueeze(1)
+
+                # Truncate and pad audio codes
+                truncated_codes = generated_audio_codes[:, :max_len]
+                masked_codes = truncated_codes * valid_mask.unsqueeze(-1)
+
+                # Decode all batches at once
+                transposed_codes = masked_codes.transpose(1, 2)
+                codec_decode_output = self.codec_model.decode(transposed_codes)
+                audio = codec_decode_output.audio_values.squeeze(1)


Is this necessary to this PR (I mean enabling pipeline usage?)
The codec is not inferred batched for the moment because there is no equivalence yet between batch/sequential for Mimi. I would expect this to break slow tests. This should be in another PR/ issue.

Part of it may be needed because pipeline requires that the return value of generate is a tensor, otherwise the postprocessing will cause an error.

transformers/src/transformers/pipelines/base.py

Line 1463 in d9b35c6

def run_single(self, inputs, preprocess_params, forward_params, postprocess_params):

transformers/src/transformers/pipelines/text_to_audio.py

Line 262 in d9b35c6

output_dict["audio"] = waveform.to(device="cpu", dtype=torch.float).numpy()

Ok, I can divide this patch into two.

Signed-off-by: yuanwu <yuan.wu@intel.com>

yuanwu2017 · 2025-07-24T06:54:32Z

@eustlb I have removed the codec batching. Please help to review.

eustlb

One last iteration and we'll be good to merge 🤗

eustlb · 2025-07-29T16:16:47Z

src/transformers/models/csm/generation_csm.py

@@ -68,7 +68,7 @@ class CsmGenerateOutput(GenerateDecoderOnlyOutput):
            The generated audio.
    """

-    audio: Optional[list[torch.Tensor]] = None
+    waveform: Optional[list[torch.Tensor]] = None


Won't merge this as it is breaking, can you rather update postprocess in text_to_audio.py pipeline? with something like

if self.model.config.model_type == "csm": waveform_key = "audio" else: waveform_key = "waveform"

src/transformers/pipelines/text_to_audio.py

Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>

yuanwu2017 · 2025-07-30T09:38:47Z

Done.

Signed-off-by: yuanwu <yuan.wu@intel.com>

yao-matrix · 2025-08-05T17:47:22Z

@yuanwu2017 , pls run make style to make style check pass

yuanwu2017 · 2025-08-06T01:31:15Z

Done.

Signed-off-by: yuanwu <yuan.wu@intel.com>

yuanwu2017 · 2025-08-08T06:18:54Z

@eustlb @ydshieh Please help to review. I didn’t find the errors about this patch.

ydshieh · 2025-08-08T08:59:12Z

Hi @yuanwu2017

I will let @eustlb to finalize the review.

From the screenshot you shared, I saw, in tests/pipelines/test_pipelines_text_to_audio.py, the model csm is not used in the tests.
Would you like to add one there?

yuanwu2017 · 2025-08-11T02:23:02Z

Add slow tag.

yuanwu2017 · 2025-08-11T02:35:25Z

Hi @yuanwu2017

I will let @eustlb to finalize the review.

From the screenshot you shared, I saw, in tests/pipelines/test_pipelines_text_to_audio.py, the model csm is not used in the tests. Would you like to add one there?

Done.

github-actions · 2025-08-11T02:35:30Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto

Signed-off-by: yuanwu <yuanwu@habana.ai>

Fix the issue that csm model cannot work with pipeline mode.

aecb491

Signed-off-by: yuanwu <yuan.wu@intel.com>

ydshieh requested a review from eustlb July 11, 2025 08:47

yuanwu2017 added 3 commits July 14, 2025 10:25

Merge branch 'main' into csm

4f64b5c

Merge branch 'main' into csm

5e83566

Merge branch 'main' into csm

a4e8405

eustlb reviewed Jul 22, 2025

View reviewed changes

yuanwu2017 added 3 commits July 24, 2025 09:14

Merge branch 'main' into csm

ddbd62f

Remove batching inference

e0da0de

Signed-off-by: yuanwu <yuan.wu@intel.com>

csm output is list of tensor

0fa1f52

Signed-off-by: yuanwu <yuan.wu@intel.com>

eustlb reviewed Jul 29, 2025

View reviewed changes

Update src/transformers/pipelines/text_to_audio.py

20a010b

Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>

Use different waveform key for different model

f8862ce

Signed-off-by: yuanwu <yuan.wu@intel.com>

yuanwu2017 added 2 commits August 6, 2025 01:37

Fix make style errors

089c9e6

Signed-off-by: yuanwu <yuan.wu@intel.com>

Merge branch 'main' into csm

c2cc92a

Merge branch 'main' into csm

fa7deb9

Add csm tests

4eae068

Signed-off-by: yuanwu <yuanwu@habana.ai>

Fix the issue that csm model cannot work with pipeline mode. #39349

Are you sure you want to change the base?

Fix the issue that csm model cannot work with pipeline mode. #39349

Conversation

yuanwu2017 commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

ydshieh commented Jul 11, 2025

Uh oh!

yuanwu2017 commented Jul 14, 2025

Uh oh!

yuanwu2017 commented Jul 14, 2025

Uh oh!

yuanwu2017 commented Jul 21, 2025

Uh oh!

eustlb left a comment

Choose a reason for hiding this comment

Uh oh!

eustlb Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

yuanwu2017 Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

eustlb Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

yuanwu2017 Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

yuanwu2017 Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

yuanwu2017 commented Jul 24, 2025

Uh oh!

eustlb left a comment

Choose a reason for hiding this comment

Uh oh!

eustlb Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yuanwu2017 commented Jul 30, 2025

Uh oh!

yao-matrix commented Aug 5, 2025

Uh oh!

yuanwu2017 commented Aug 6, 2025

Uh oh!

yuanwu2017 commented Aug 8, 2025

Uh oh!

ydshieh commented Aug 8, 2025

Uh oh!

yuanwu2017 commented Aug 11, 2025

Uh oh!

yuanwu2017 commented Aug 11, 2025

Uh oh!

github-actions bot commented Aug 11, 2025

Uh oh!

Uh oh!

yuanwu2017 commented Jul 11, 2025 •

edited

Loading