Update wavlm.md to match new model card template #40047

reedrya · 2025-08-08T23:51:49Z

What does this PR do?

This PR updates the WavLM model card to comply with the format introduced in #36979.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).

Who can review?

@stevhliu

Notes

I did not include the AttentionMaskVisualizer section since I'm unfamiliar. Please advise if that should be added.

stevhliu

Thanks for your contribution!

stevhliu · 2025-08-11T16:14:16Z

docs/source/en/model_doc/wavlm.md

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[WavLM](https://huggingface.co/papers/2110.13900) is a self-supervised speech representation model from Microsoft designed to work across the “full stack” of speech tasks, from automatic speech recognition (ASR) to speaker diarization and audio event detection. It builds on HuBERT’s masked prediction approach but introduces denoising and data augmentation to make the learned representations more robust in noisy and multi-speaker conditions.


Suggested change

[WavLM](https://huggingface.co/papers/2110.13900) is a self-supervised speech representation model from Microsoft designed to work across the “full stack” of speech tasks, from automatic speech recognition (ASR) to speaker diarization and audio event detection. It builds on HuBERT’s masked prediction approach but introduces denoising and data augmentation to make the learned representations more robust in noisy and multi-speaker conditions.

[WavLM](https://huggingface.co/papers/2110.13900) is a self-supervised speech representation model designed to work across the “full stack” of speech tasks, from automatic speech recognition (ASR) to speaker diarization and audio event detection. It builds on [HuBERTs](./hubert) masked prediction approach. It introduces gated relative position bias for better recognition accuracy and an unsupervised utterance-mixing strategy to improve speaker discrimination.

stevhliu · 2025-08-11T16:15:00Z

docs/source/en/model_doc/wavlm.md


-## Overview
+You can find all the original WavLM checkpoints under the [WavLM](https://huggingface.co/models?other=wavlm) collection.


Suggested change

You can find all the original WavLM checkpoints under the [WavLM](https://huggingface.co/models?other=wavlm) collection.

You can find all the original WavLM checkpoints under the [Microsoft](https://huggingface.co/microsoft/models?search=wavlm) organization.

stevhliu · 2025-08-11T16:18:24Z

docs/source/en/model_doc/wavlm.md


-The abstract from the paper is the following:
+The example below demonstrates how to extract audio features with [`Pipeline`] or the [`AutoModel`] class.


Suggested change

The example below demonstrates how to extract audio features with [`Pipeline`] or the [`AutoModel`] class.

The example below demonstrates how to automatically transcribe speech into text with [`Pipeline`] or the [`AutoModel`] class.

stevhliu · 2025-08-11T16:18:34Z

docs/source/en/model_doc/wavlm.md


-Relevant checkpoints can be found under https://huggingface.co/models?other=wavlm.
+```python


import torch from transformers import pipeline pipeline = pipeline( task="automatic-speech-recognition", model="patrickvonplaten/wavlm-libri-clean-100h-base-plus", torch_dtype=torch.float16, device=0 ) pipeline("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac")

stevhliu · 2025-08-11T16:25:22Z

docs/source/en/model_doc/wavlm.md


-## Resources
+```python
+import torch


from transformers import AutoProcessor, AutoModelForCTC from datasets import load_dataset import torch dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation") dataset = dataset.sort("id") sampling_rate = dataset.features["audio"].sampling_rate processor = AutoProcessor.from_pretrained("patrickvonplaten/wavlm-libri-clean-100h-base-plus") model = AutoModelForCTC.from_pretrained("patrickvonplaten/wavlm-libri-clean-100h-base-plus", torch_dtype=torch.float16) # audio file is decoded on the fly inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt") with torch.no_grad(): logits = model(**inputs).logits predicted_ids = torch.argmax(logits, dim=-1) # transcribe speech transcription = processor.batch_decode(predicted_ids) transcription[0]

stevhliu · 2025-08-11T16:25:56Z

docs/source/en/model_doc/wavlm.md

+</hfoption>
+</hfoptions>
+
+Quantization reduces the memory burden of large models by representing the weights in a lower precision. 


The model is not that large so we can remove the Quantization section

stevhliu · 2025-08-11T16:26:21Z

docs/source/en/model_doc/wavlm.md

+```
+
+## Notes
+- WavLM processes raw 16kHz audio waveforms provided as 1D float arrays. Use `Wav2Vec2Processor` for preprocessing.


Suggested change

- WavLM processes raw 16kHz audio waveforms provided as 1D float arrays. Use `Wav2Vec2Processor` for preprocessing.

- WavLM processes raw 16kHz audio waveforms provided as 1D float arrays. Use [`Wav2Vec2Processor`] for preprocessing.

stevhliu · 2025-08-11T16:26:30Z

docs/source/en/model_doc/wavlm.md

+
+## Notes
+- WavLM processes raw 16kHz audio waveforms provided as 1D float arrays. Use `Wav2Vec2Processor` for preprocessing.
+- For CTC-based fine-tuning, model outputs should be decoded with `Wav2Vec2CTCTokenizer`.


Suggested change

- For CTC-based fine-tuning, model outputs should be decoded with `Wav2Vec2CTCTokenizer`.

- For CTC-based fine-tuning, model outputs should be decoded with [`Wav2Vec2CTCTokenizer`].

stevhliu · 2025-08-11T16:26:41Z

docs/source/en/model_doc/wavlm.md

+- The model works particularly well for tasks like speaker verification, identification, and diarization.
+
+## Resources
+- [Audio classification task guide](https://huggingface.co/docs/transformers/en/tasks/audio_classification)
+- [Automatic speech recognition task guide](https://huggingface.co/docs/transformers/en/tasks/asr)


Suggested change

- The model works particularly well for tasks like speaker verification, identification, and diarization.

## Resources

- [Audio classification task guide](https://huggingface.co/docs/transformers/en/tasks/audio_classification)

- [Automatic speech recognition task guide](https://huggingface.co/docs/transformers/en/tasks/asr)

Update wavlm.md to match new model card template

a851678

stevhliu mentioned this pull request Aug 11, 2025

[Community contributions] Model cards #36979

Open

stevhliu reviewed Aug 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update wavlm.md to match new model card template #40047

Update wavlm.md to match new model card template #40047

reedrya commented Aug 8, 2025

Uh oh!

stevhliu left a comment

Uh oh!

stevhliu Aug 11, 2025

Uh oh!

stevhliu Aug 11, 2025

Uh oh!

stevhliu Aug 11, 2025

Uh oh!

stevhliu Aug 11, 2025

Uh oh!

stevhliu Aug 11, 2025

Uh oh!

stevhliu Aug 11, 2025

Uh oh!

stevhliu Aug 11, 2025

Uh oh!

stevhliu Aug 11, 2025

Uh oh!

stevhliu Aug 11, 2025

Uh oh!

Uh oh!


		## Overview
		You can find all the original WavLM checkpoints under the [WavLM](https://huggingface.co/models?other=wavlm) collection.


		The abstract from the paper is the following:
		The example below demonstrates how to extract audio features with [`Pipeline`] or the [`AutoModel`] class.

	The example below demonstrates how to extract audio features with [`Pipeline`] or the [`AutoModel`] class.
	The example below demonstrates how to automatically transcribe speech into text with [`Pipeline`] or the [`AutoModel`] class.


		Relevant checkpoints can be found under https://huggingface.co/models?other=wavlm.
		```python

	- WavLM processes raw 16kHz audio waveforms provided as 1D float arrays. Use `Wav2Vec2Processor` for preprocessing.
	- WavLM processes raw 16kHz audio waveforms provided as 1D float arrays. Use [`Wav2Vec2Processor`] for preprocessing.

	- For CTC-based fine-tuning, model outputs should be decoded with `Wav2Vec2CTCTokenizer`.
	- For CTC-based fine-tuning, model outputs should be decoded with [`Wav2Vec2CTCTokenizer`].

Update wavlm.md to match new model card template #40047

Are you sure you want to change the base?

Update wavlm.md to match new model card template #40047

Conversation

reedrya commented Aug 8, 2025

What does this PR do?

Before submitting

Who can review?

Notes

Uh oh!

stevhliu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!