Skip to content

Update wavlm.md to match new model card template #40047

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

reedrya
Copy link
Contributor

@reedrya reedrya commented Aug 8, 2025

What does this PR do?

This PR updates the WavLM model card to comply with the format introduced in #36979.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).

Who can review?

@stevhliu

Notes

  • I did not include the AttentionMaskVisualizer section since I'm unfamiliar. Please advise if that should be added.

Copy link
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution!

<div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
</div>
[WavLM](https://huggingface.co/papers/2110.13900) is a self-supervised speech representation model from Microsoft designed to work across the “full stack” of speech tasks, from automatic speech recognition (ASR) to speaker diarization and audio event detection. It builds on HuBERT’s masked prediction approach but introduces denoising and data augmentation to make the learned representations more robust in noisy and multi-speaker conditions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[WavLM](https://huggingface.co/papers/2110.13900) is a self-supervised speech representation model from Microsoft designed to work across the “full stack” of speech tasks, from automatic speech recognition (ASR) to speaker diarization and audio event detection. It builds on HuBERT’s masked prediction approach but introduces denoising and data augmentation to make the learned representations more robust in noisy and multi-speaker conditions.
[WavLM](https://huggingface.co/papers/2110.13900) is a self-supervised speech representation model designed to work across the “full stack” of speech tasks, from automatic speech recognition (ASR) to speaker diarization and audio event detection. It builds on [HuBERTs](./hubert) masked prediction approach. It introduces gated relative position bias for better recognition accuracy and an unsupervised utterance-mixing strategy to improve speaker discrimination.


## Overview
You can find all the original WavLM checkpoints under the [WavLM](https://huggingface.co/models?other=wavlm) collection.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You can find all the original WavLM checkpoints under the [WavLM](https://huggingface.co/models?other=wavlm) collection.
You can find all the original WavLM checkpoints under the [Microsoft](https://huggingface.co/microsoft/models?search=wavlm) organization.


The abstract from the paper is the following:
The example below demonstrates how to extract audio features with [`Pipeline`] or the [`AutoModel`] class.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The example below demonstrates how to extract audio features with [`Pipeline`] or the [`AutoModel`] class.
The example below demonstrates how to automatically transcribe speech into text with [`Pipeline`] or the [`AutoModel`] class.


Relevant checkpoints can be found under https://huggingface.co/models?other=wavlm.
```python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import torch
from transformers import pipeline

pipeline = pipeline(
    task="automatic-speech-recognition",
    model="patrickvonplaten/wavlm-libri-clean-100h-base-plus",
    torch_dtype=torch.float16,
    device=0
)

pipeline("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac")


## Resources
```python
import torch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from transformers import AutoProcessor, AutoModelForCTC
from datasets import load_dataset
import torch

dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
dataset = dataset.sort("id")
sampling_rate = dataset.features["audio"].sampling_rate

processor = AutoProcessor.from_pretrained("patrickvonplaten/wavlm-libri-clean-100h-base-plus")
model = AutoModelForCTC.from_pretrained("patrickvonplaten/wavlm-libri-clean-100h-base-plus", torch_dtype=torch.float16)

# audio file is decoded on the fly
inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)

# transcribe speech
transcription = processor.batch_decode(predicted_ids)
transcription[0]

</hfoption>
</hfoptions>

Quantization reduces the memory burden of large models by representing the weights in a lower precision.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model is not that large so we can remove the Quantization section

```

## Notes
- WavLM processes raw 16kHz audio waveforms provided as 1D float arrays. Use `Wav2Vec2Processor` for preprocessing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- WavLM processes raw 16kHz audio waveforms provided as 1D float arrays. Use `Wav2Vec2Processor` for preprocessing.
- WavLM processes raw 16kHz audio waveforms provided as 1D float arrays. Use [`Wav2Vec2Processor`] for preprocessing.


## Notes
- WavLM processes raw 16kHz audio waveforms provided as 1D float arrays. Use `Wav2Vec2Processor` for preprocessing.
- For CTC-based fine-tuning, model outputs should be decoded with `Wav2Vec2CTCTokenizer`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- For CTC-based fine-tuning, model outputs should be decoded with `Wav2Vec2CTCTokenizer`.
- For CTC-based fine-tuning, model outputs should be decoded with [`Wav2Vec2CTCTokenizer`].

Comment on lines +101 to +105
- The model works particularly well for tasks like speaker verification, identification, and diarization.

## Resources
- [Audio classification task guide](https://huggingface.co/docs/transformers/en/tasks/audio_classification)
- [Automatic speech recognition task guide](https://huggingface.co/docs/transformers/en/tasks/asr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- The model works particularly well for tasks like speaker verification, identification, and diarization.
## Resources
- [Audio classification task guide](https://huggingface.co/docs/transformers/en/tasks/audio_classification)
- [Automatic speech recognition task guide](https://huggingface.co/docs/transformers/en/tasks/asr)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants