-
Notifications
You must be signed in to change notification settings - Fork 30k
Update wavlm.md to match new model card template #40047
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution!
<div class="flex flex-wrap space-x-1"> | ||
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white"> | ||
</div> | ||
[WavLM](https://huggingface.co/papers/2110.13900) is a self-supervised speech representation model from Microsoft designed to work across the “full stack” of speech tasks, from automatic speech recognition (ASR) to speaker diarization and audio event detection. It builds on HuBERT’s masked prediction approach but introduces denoising and data augmentation to make the learned representations more robust in noisy and multi-speaker conditions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[WavLM](https://huggingface.co/papers/2110.13900) is a self-supervised speech representation model from Microsoft designed to work across the “full stack” of speech tasks, from automatic speech recognition (ASR) to speaker diarization and audio event detection. It builds on HuBERT’s masked prediction approach but introduces denoising and data augmentation to make the learned representations more robust in noisy and multi-speaker conditions. | |
[WavLM](https://huggingface.co/papers/2110.13900) is a self-supervised speech representation model designed to work across the “full stack” of speech tasks, from automatic speech recognition (ASR) to speaker diarization and audio event detection. It builds on [HuBERTs](./hubert) masked prediction approach. It introduces gated relative position bias for better recognition accuracy and an unsupervised utterance-mixing strategy to improve speaker discrimination. |
|
||
## Overview | ||
You can find all the original WavLM checkpoints under the [WavLM](https://huggingface.co/models?other=wavlm) collection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can find all the original WavLM checkpoints under the [WavLM](https://huggingface.co/models?other=wavlm) collection. | |
You can find all the original WavLM checkpoints under the [Microsoft](https://huggingface.co/microsoft/models?search=wavlm) organization. |
|
||
The abstract from the paper is the following: | ||
The example below demonstrates how to extract audio features with [`Pipeline`] or the [`AutoModel`] class. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example below demonstrates how to extract audio features with [`Pipeline`] or the [`AutoModel`] class. | |
The example below demonstrates how to automatically transcribe speech into text with [`Pipeline`] or the [`AutoModel`] class. |
|
||
Relevant checkpoints can be found under https://huggingface.co/models?other=wavlm. | ||
```python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import torch
from transformers import pipeline
pipeline = pipeline(
task="automatic-speech-recognition",
model="patrickvonplaten/wavlm-libri-clean-100h-base-plus",
torch_dtype=torch.float16,
device=0
)
pipeline("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac")
|
||
## Resources | ||
```python | ||
import torch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from transformers import AutoProcessor, AutoModelForCTC
from datasets import load_dataset
import torch
dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
dataset = dataset.sort("id")
sampling_rate = dataset.features["audio"].sampling_rate
processor = AutoProcessor.from_pretrained("patrickvonplaten/wavlm-libri-clean-100h-base-plus")
model = AutoModelForCTC.from_pretrained("patrickvonplaten/wavlm-libri-clean-100h-base-plus", torch_dtype=torch.float16)
# audio file is decoded on the fly
inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)
# transcribe speech
transcription = processor.batch_decode(predicted_ids)
transcription[0]
</hfoption> | ||
</hfoptions> | ||
|
||
Quantization reduces the memory burden of large models by representing the weights in a lower precision. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The model is not that large so we can remove the Quantization section
``` | ||
|
||
## Notes | ||
- WavLM processes raw 16kHz audio waveforms provided as 1D float arrays. Use `Wav2Vec2Processor` for preprocessing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- WavLM processes raw 16kHz audio waveforms provided as 1D float arrays. Use `Wav2Vec2Processor` for preprocessing. | |
- WavLM processes raw 16kHz audio waveforms provided as 1D float arrays. Use [`Wav2Vec2Processor`] for preprocessing. |
|
||
## Notes | ||
- WavLM processes raw 16kHz audio waveforms provided as 1D float arrays. Use `Wav2Vec2Processor` for preprocessing. | ||
- For CTC-based fine-tuning, model outputs should be decoded with `Wav2Vec2CTCTokenizer`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- For CTC-based fine-tuning, model outputs should be decoded with `Wav2Vec2CTCTokenizer`. | |
- For CTC-based fine-tuning, model outputs should be decoded with [`Wav2Vec2CTCTokenizer`]. |
- The model works particularly well for tasks like speaker verification, identification, and diarization. | ||
|
||
## Resources | ||
- [Audio classification task guide](https://huggingface.co/docs/transformers/en/tasks/audio_classification) | ||
- [Automatic speech recognition task guide](https://huggingface.co/docs/transformers/en/tasks/asr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The model works particularly well for tasks like speaker verification, identification, and diarization. | |
## Resources | |
- [Audio classification task guide](https://huggingface.co/docs/transformers/en/tasks/audio_classification) | |
- [Automatic speech recognition task guide](https://huggingface.co/docs/transformers/en/tasks/asr) |
What does this PR do?
This PR updates the WavLM model card to comply with the format introduced in #36979.
Before submitting
Who can review?
@stevhliu
Notes