-
Notifications
You must be signed in to change notification settings - Fork 30k
Standardize BARTpho model card: badges, new examples, fixed broken im… #40051
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…age section, and links (huggingface#36979)Update bartpho.md
Hi @stevhliu, Thank you for your dedication to maintaining high-quality documentation across the repository. I've standardized the BARTpho model card with several key improvements: Key Changes:
I'd appreciate your feedback on whether any additional elements would enhance the documentation further - specifically, would an architecture diagram improve clarity for users? I'm also open to any other suggestions you might have to make this model card even more valuable for the community. Looking forward to your review and any insights you can share! Best regards |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good start! Next time, make sure the code examples run and are aligned
<img alt="Hugging Face Model" src="https://img.shields.io/badge/Model%20Hub-BARTpho-blue"> | ||
<img alt="License" src="https://img.shields.io/badge/License-Apache%202.0-green"> | ||
<img alt="Language" src="https://img.shields.io/badge/Language-Vietnamese-orange"> | ||
</div> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need to add these extras
[BARTpho](https://arxiv.org/abs/2109.09701) is the first large-scale, monolingual sequence-to-sequence model pre-trained exclusively for Vietnamese, developed by [VinAI Research](https://huggingface.co/vinai). | ||
It’s based on the **BART** denoising autoencoder architecture, with adaptations from **mBART**, and comes in two variants — **word** and **syllable** — to handle the unique way Vietnamese uses whitespace. | ||
Think of it like a supercharged summarizer and text generator that really “gets” Vietnamese — both at the word and syllable level. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BARTpho](https://arxiv.org/abs/2109.09701) is the first large-scale, monolingual sequence-to-sequence model pre-trained exclusively for Vietnamese, developed by [VinAI Research](https://huggingface.co/vinai). | |
It’s based on the **BART** denoising autoencoder architecture, with adaptations from **mBART**, and comes in two variants — **word** and **syllable** — to handle the unique way Vietnamese uses whitespace. | |
Think of it like a supercharged summarizer and text generator that really “gets” Vietnamese — both at the word and syllable level. | |
[BARTpho](https://huggingface.co/papers/2109.09701) is a large-scale Vietnamese sequence-to-sequence model. It offers a word-based and syllable-based version. This model is built on the [BART](./bart) large architecture with its denoising pretraining. |
|
||
The abstract from the paper is the following: | ||
You can find all official checkpoints in the [BARTpho collection](https://huggingface.co/collections/vinai/bartpho-66f8a74775316eaa77d59969). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can find all official checkpoints in the [BARTpho collection](https://huggingface.co/collections/vinai/bartpho-66f8a74775316eaa77d59969). | |
You can find all the original checkpoints under the [VinAI](https://huggingface.co/vinai/models?search=bartpho) organization. |
> \[!TIP] | ||
> This model was contributed by [VinAI Research](https://huggingface.co/vinai). | ||
> Check out the `bartpho-word` and `bartpho-syllable` variants in the right sidebar for examples of summarization, punctuation restoration, and capitalization restoration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
> \[!TIP] | |
> This model was contributed by [VinAI Research](https://huggingface.co/vinai). | |
> Check out the `bartpho-word` and `bartpho-syllable` variants in the right sidebar for examples of summarization, punctuation restoration, and capitalization restoration. | |
> [!TIP] | |
> This model was contributed by [dqnguyen](https://huggingface.co/dqnguyen). | |
> Check out the BARTpho in the right sidebar for examples of how to apply BART to different language tasks. |
|
||
This model was contributed by [dqnguyen](https://huggingface.co/dqnguyen). The original code can be found [here](https://github.com/VinAIResearch/BARTpho). | ||
The example below demonstrates how to run summarization with \[`pipeline`] or load the model via \[`AutoModel`]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example below demonstrates how to run summarization with \[`pipeline`] or load the model via \[`AutoModel`]. | |
The example below demonstrates how to summarize text with [`Pipeline`] or the [`AutoModel`] class. |
>>> input_ids = tokenizer(line, return_tensors="tf") | ||
>>> features = bartpho(**input_ids) | ||
```bash | ||
transformers-cli download vinai/bartpho-word |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
echo -e "Quang tổng hợp hay gọi tắt là quang hợp là quá trình thu nhận và chuyển hóa năng lượng ánh sáng Mặt trời của thực vật,
tảo và một số vi khuẩn để tạo ra hợp chất hữu cơ phục vụ bản thân cũng như làm nguồn thức ăn cho hầu hết các sinh vật
trên Trái Đất. Quang hợp trong thực vật thường liên quan đến chất tố diệp lục màu xanh lá cây và tạo ra oxy như một sản phẩm phụ" | transformers run --task summarization --model vinai/bartpho-word --device 0
</hfoption> | ||
</hfoptions> | ||
|
||
Quantization reduces the memory footprint of large models by storing weights in lower precision. See the [BitsAndBytes Quantization guide](https://huggingface.co/docs/transformers/quantization/bitsandbytes) for details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Models not that big so we don't need a Quantization section
extracted from the pre-trained SentencePiece model "vocab_file" that is available from the multilingual XLM-RoBERTa. | ||
Other languages, if employing this pre-trained multilingual SentencePiece model "vocab_file" for subword | ||
segmentation, can reuse BartphoTokenizer with their own language-specialized "monolingual_vocab_file". | ||
Use the [AttentionMaskVisualizer](https://github.com/huggingface/transformers/blob/main/src/transformers/utils/attention_visualizer.py) to see what tokens the model attends to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not supported for this model yet
|
||
## BartphoTokenizer | ||
## Resources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this section
* **Preprocessing is non-negotiable**: | ||
|
||
* All inputs must undergo Vietnamese tone normalization. | ||
* For `bartpho-word` variants, text must also be segmented with [VnCoreNLP](https://github.com/vncorenlp/VnCoreNLP). | ||
* The model is trained on a 20GB corpus (\~145M sentences), so domain-specific performance may vary. | ||
* `bartpho-word` consistently outperforms `bartpho-syllable` on Vietnamese generative tasks. | ||
|
||
```python | ||
# Example: Masked language modeling with bartpho-syllable | ||
from transformers import MBartForConditionalGeneration, AutoTokenizer | ||
import torch | ||
|
||
model = MBartForConditionalGeneration.from_pretrained("vinai/bartpho-syllable") | ||
tokenizer = AutoTokenizer.from_pretrained("vinai/bartpho-syllable") | ||
|
||
TXT = "Chúng tôi là <mask> nghiên cứu viên." | ||
input_ids = tokenizer(TXT, return_tensors="pt")["input_ids"] | ||
logits = model(input_ids).logits | ||
masked_index = (input_ids == tokenizer.mask_token_id).nonzero().item() | ||
predicted_ids = torch.topk(logits[0, masked_index], 5).indices | ||
print(tokenizer.decode(predicted_ids, skip_special_tokens=True)) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* **Preprocessing is non-negotiable**: | |
* All inputs must undergo Vietnamese tone normalization. | |
* For `bartpho-word` variants, text must also be segmented with [VnCoreNLP](https://github.com/vncorenlp/VnCoreNLP). | |
* The model is trained on a 20GB corpus (\~145M sentences), so domain-specific performance may vary. | |
* `bartpho-word` consistently outperforms `bartpho-syllable` on Vietnamese generative tasks. | |
```python | |
# Example: Masked language modeling with bartpho-syllable | |
from transformers import MBartForConditionalGeneration, AutoTokenizer | |
import torch | |
model = MBartForConditionalGeneration.from_pretrained("vinai/bartpho-syllable") | |
tokenizer = AutoTokenizer.from_pretrained("vinai/bartpho-syllable") | |
TXT = "Chúng tôi là <mask> nghiên cứu viên." | |
input_ids = tokenizer(TXT, return_tensors="pt")["input_ids"] | |
logits = model(input_ids).logits | |
masked_index = (input_ids == tokenizer.mask_token_id).nonzero().item() | |
predicted_ids = torch.topk(logits[0, masked_index], 5).indices | |
print(tokenizer.decode(predicted_ids, skip_special_tokens=True)) | |
``` | |
- BARTphop uses the large architecture of BART with an additional layer-normalization layer on top of the encoder and decoder. The BART-specific classes should be replaced with the mBART-specific classes. | |
- This implementation only handles tokenization through the `monolingual_vocab_file` file. This is a Vietnamese-specific subset of token types taken from that multilingual vocabulary. If you want to use this tokenizer for another language, replace the `monolingual_vocab_file` with one specialized for your target language. |
Updated bartpho.md
What does this PR do?
Fixes # (issue)
Standardize BARTpho model card
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
Please review this @stevhliu