Skip to content

Fix the issue that csm model cannot work with pipeline mode. #39349

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

yuanwu2017
Copy link
Contributor

@yuanwu2017 yuanwu2017 commented Jul 11, 2025

What does this PR do?

Fix the csm model(text_to_audio) cannot work with pipeline mode.

example:

import torch
from transformers import pipeline

device = "hpu"
if torch.cuda.is_available():
    device="cuda"

print(f"device: {device}")
tts_pipeline = pipeline(
    "text-to-audio",
    model="sesame/csm-1b",
    device=device
)
generate_kwargs = {
    "output_audio": True,
}

text = "[0]Hello from Sesame."

audio_output = tts_pipeline(text, generate_kwargs=generate_kwargs)
print(f"audio_output: {audio_output}")
audio = audio_output["audio"] if isinstance(audio_output["audio"], torch.Tensor) and len(audio_output["audio"].shape) == 1 else audio_output["audio"][0]
import soundfile as sf
sf.write("example_with_pipeline.wav", audio, samplerate=audio_output["sampling_rate"])

Before patch:
image
After patch:
image

Batch inference with original example:

import torch
from transformers import CsmForConditionalGeneration, AutoProcessor
model_id = "sesame/csm-1b"

device="hpu"
if torch.cuda.is_available():
    device="cuda"


# load the model and the processor
processor = AutoProcessor.from_pretrained(model_id)
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)

# prepare the inputs
text = ["[0]Hello from Sesame.", "[1]Open sesame."] # `[0]` for speaker id 0
inputs = processor(text, add_special_tokens=True).to(device)

audio = model.generate(**inputs, output_audio=True)
print(f"audio:{audio}")
processor.save_audio(audio, ["1.wav", "2.wav"])

Result:
image

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Signed-off-by: yuanwu <yuan.wu@intel.com>
@ydshieh ydshieh requested a review from eustlb July 11, 2025 08:47
@ydshieh
Copy link
Collaborator

ydshieh commented Jul 11, 2025

Hi @yuanwu2017 , thank you for this PR. Let's add some description in What does this PR do? 🙏

@yuanwu2017
Copy link
Contributor Author

Hi @yuanwu2017 , thank you for this PR. Let's add some description in What does this PR do? 🙏
I have added. Thanks.

@yuanwu2017
Copy link
Contributor Author

@ydshieh @eustlb Please help to review.

@yuanwu2017
Copy link
Contributor Author

@ydshieh @eustlb Please help to review.

Copy link
Contributor

@eustlb eustlb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot 🤗
Do we need the codec batching here? It should be in another PR

Comment on lines 188 to 191
# If it's a torch tensor with more than one dimension, convert it to a list of tensors
if is_torch_tensor(audio) and len(audio.shape) > 1:
return list(audio)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this, that is exactly what is done below no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As described below, because pipeline requires that the return value of generate is a tensor, it needs to be converted to list here so that the subsequent processor.save_audio can work normally.

Comment on lines 471 to 499
# =======================================
# TODO: @eustlb, this should be batched !!!
# but requires making sure batched inference of the codec model works as intended
for audio_codes_batch in generated_audio_codes:
eos_idxs = (audio_codes_batch == self.config.codebook_eos_token_id).all(dim=-1).nonzero()
generated_audio_codes = generate_output.sequences if generate_returned_dict else generate_output

# Find EOS positions for all batches at once
eos_mask = (generated_audio_codes == self.config.codebook_eos_token_id).all(dim=-1)
eos_positions = torch.zeros(
generated_audio_codes.shape[0], dtype=torch.long, device=generated_audio_codes.device
)

for i in range(generated_audio_codes.shape[0]):
eos_idxs = eos_mask[i].nonzero()
if eos_idxs.numel() != 0:
cutoff_idx = eos_idxs.min()
eos_positions[i] = eos_idxs.min()
else:
cutoff_idx = audio_codes_batch.shape[0]

audio_codes_batch = audio_codes_batch[:cutoff_idx]
codec_decode_output = self.codec_model.decode(audio_codes_batch.transpose(0, 1).unsqueeze(0))
audio.append(codec_decode_output.audio_values[0, 0])
# =======================================
eos_positions[i] = generated_audio_codes.shape[1]

# Create a mask for valid positions
max_len = eos_positions.max().item()
valid_mask = torch.arange(max_len, device=generated_audio_codes.device).unsqueeze(
0
) < eos_positions.unsqueeze(1)

# Truncate and pad audio codes
truncated_codes = generated_audio_codes[:, :max_len]
masked_codes = truncated_codes * valid_mask.unsqueeze(-1)

# Decode all batches at once
transposed_codes = masked_codes.transpose(1, 2)
codec_decode_output = self.codec_model.decode(transposed_codes)
audio = codec_decode_output.audio_values.squeeze(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary to this PR (I mean enabling pipeline usage?)
The codec is not inferred batched for the moment because there is no equivalence yet between batch/sequential for Mimi. I would expect this to break slow tests. This should be in another PR/ issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of it may be needed because pipeline requires that the return value of generate is a tensor, otherwise the postprocessing will cause an error.

def run_single(self, inputs, preprocess_params, forward_params, postprocess_params):

output_dict["audio"] = waveform.to(device="cpu", dtype=torch.float).numpy()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I can divide this patch into two.

Signed-off-by: yuanwu <yuan.wu@intel.com>
Signed-off-by: yuanwu <yuan.wu@intel.com>
@yuanwu2017
Copy link
Contributor Author

@eustlb I have removed the codec batching. Please help to review.

Copy link
Contributor

@eustlb eustlb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last iteration and we'll be good to merge 🤗

@@ -68,7 +68,7 @@ class CsmGenerateOutput(GenerateDecoderOnlyOutput):
The generated audio.
"""

audio: Optional[list[torch.Tensor]] = None
waveform: Optional[list[torch.Tensor]] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't merge this as it is breaking, can you rather update postprocess in text_to_audio.py pipeline? with something like

if self.model.config.model_type == "csm":
    waveform_key = "audio"
else:
    waveform_key = "waveform"

Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
@yuanwu2017
Copy link
Contributor Author

Done.

Signed-off-by: yuanwu <yuan.wu@intel.com>
@yao-matrix
Copy link
Contributor

@yuanwu2017 , pls run make style to make style check pass

@yuanwu2017
Copy link
Contributor Author

Done.

Signed-off-by: yuanwu <yuan.wu@intel.com>
@yuanwu2017
Copy link
Contributor Author

@eustlb @ydshieh Please help to review. I didn’t find the errors about this patch.
image

@ydshieh
Copy link
Collaborator

ydshieh commented Aug 8, 2025

Hi @yuanwu2017

I will let @eustlb to finalize the review.

From the screenshot you shared, I saw, in tests/pipelines/test_pipelines_text_to_audio.py, the model csm is not used in the tests.
Would you like to add one there?

@yuanwu2017
Copy link
Contributor Author

image Add slow tag. image

@yuanwu2017
Copy link
Contributor Author

Hi @yuanwu2017

I will let @eustlb to finalize the review.

From the screenshot you shared, I saw, in tests/pipelines/test_pipelines_text_to_audio.py, the model csm is not used in the tests. Would you like to add one there?

Done.

Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto

Signed-off-by: yuanwu <yuanwu@habana.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants