Skip to content

Standardize BARTpho model card: badges, new examples, fixed broken im… #40051

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

eshwanthkartitr
Copy link

@eshwanthkartitr eshwanthkartitr commented Aug 9, 2025

Updated bartpho.md

What does this PR do?

Fixes # (issue)
Standardize BARTpho model card

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Please review this @stevhliu

@eshwanthkartitr
Copy link
Author

Hi @stevhliu,

Thank you for your dedication to maintaining high-quality documentation across the repository. I've standardized the BARTpho model card with several key improvements:

Key Changes:

  • Standardized model card badges for consistency with other model documentation
  • Added new practical examples to help users better understand implementation
  • Fixed broken links and images that were impacting user experience

I'd appreciate your feedback on whether any additional elements would enhance the documentation further - specifically, would an architecture diagram improve clarity for users? I'm also open to any other suggestions you might have to make this model card even more valuable for the community.

Looking forward to your review and any insights you can share!

Best regards

Copy link
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start! Next time, make sure the code examples run and are aligned

Comment on lines +21 to +24
<img alt="Hugging Face Model" src="https://img.shields.io/badge/Model%20Hub-BARTpho-blue">
<img alt="License" src="https://img.shields.io/badge/License-Apache%202.0-green">
<img alt="Language" src="https://img.shields.io/badge/Language-Vietnamese-orange">
</div>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need to add these extras

Comment on lines +29 to +31
[BARTpho](https://arxiv.org/abs/2109.09701) is the first large-scale, monolingual sequence-to-sequence model pre-trained exclusively for Vietnamese, developed by [VinAI Research](https://huggingface.co/vinai).
It’s based on the **BART** denoising autoencoder architecture, with adaptations from **mBART**, and comes in two variants — **word** and **syllable** — to handle the unique way Vietnamese uses whitespace.
Think of it like a supercharged summarizer and text generator that really “gets” Vietnamese — both at the word and syllable level.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[BARTpho](https://arxiv.org/abs/2109.09701) is the first large-scale, monolingual sequence-to-sequence model pre-trained exclusively for Vietnamese, developed by [VinAI Research](https://huggingface.co/vinai).
It’s based on the **BART** denoising autoencoder architecture, with adaptations from **mBART**, and comes in two variants — **word** and **syllable** — to handle the unique way Vietnamese uses whitespace.
Think of it like a supercharged summarizer and text generator that really “gets” Vietnamese — both at the word and syllable level.
[BARTpho](https://huggingface.co/papers/2109.09701) is a large-scale Vietnamese sequence-to-sequence model. It offers a word-based and syllable-based version. This model is built on the [BART](./bart) large architecture with its denoising pretraining.


The abstract from the paper is the following:
You can find all official checkpoints in the [BARTpho collection](https://huggingface.co/collections/vinai/bartpho-66f8a74775316eaa77d59969).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You can find all official checkpoints in the [BARTpho collection](https://huggingface.co/collections/vinai/bartpho-66f8a74775316eaa77d59969).
You can find all the original checkpoints under the [VinAI](https://huggingface.co/vinai/models?search=bartpho) organization.

Comment on lines +35 to +37
> \[!TIP]
> This model was contributed by [VinAI Research](https://huggingface.co/vinai).
> Check out the `bartpho-word` and `bartpho-syllable` variants in the right sidebar for examples of summarization, punctuation restoration, and capitalization restoration.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> \[!TIP]
> This model was contributed by [VinAI Research](https://huggingface.co/vinai).
> Check out the `bartpho-word` and `bartpho-syllable` variants in the right sidebar for examples of summarization, punctuation restoration, and capitalization restoration.
> [!TIP]
> This model was contributed by [dqnguyen](https://huggingface.co/dqnguyen).
> Check out the BARTpho in the right sidebar for examples of how to apply BART to different language tasks.


This model was contributed by [dqnguyen](https://huggingface.co/dqnguyen). The original code can be found [here](https://github.com/VinAIResearch/BARTpho).
The example below demonstrates how to run summarization with \[`pipeline`] or load the model via \[`AutoModel`].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The example below demonstrates how to run summarization with \[`pipeline`] or load the model via \[`AutoModel`].
The example below demonstrates how to summarize text with [`Pipeline`] or the [`AutoModel`] class.

>>> input_ids = tokenizer(line, return_tensors="tf")
>>> features = bartpho(**input_ids)
```bash
transformers-cli download vinai/bartpho-word
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

echo -e "Quang tổng hợp hay gọi tắt là quang hợp là quá trình thu nhận và chuyển hóa năng lượng ánh sáng Mặt trời của thực vật, 
tảo và một số vi khuẩn để tạo ra hợp chất hữu cơ phục vụ bản thân cũng như làm nguồn thức ăn cho hầu hết các sinh vật 
trên Trái Đất. Quang hợp trong thực vật thường liên quan đến chất tố diệp lục màu xanh lá cây và tạo ra oxy như một sản phẩm phụ" | transformers run --task summarization --model vinai/bartpho-word --device 0

</hfoption>
</hfoptions>

Quantization reduces the memory footprint of large models by storing weights in lower precision. See the [BitsAndBytes Quantization guide](https://huggingface.co/docs/transformers/quantization/bitsandbytes) for details.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Models not that big so we don't need a Quantization section

extracted from the pre-trained SentencePiece model "vocab_file" that is available from the multilingual XLM-RoBERTa.
Other languages, if employing this pre-trained multilingual SentencePiece model "vocab_file" for subword
segmentation, can reuse BartphoTokenizer with their own language-specialized "monolingual_vocab_file".
Use the [AttentionMaskVisualizer](https://github.com/huggingface/transformers/blob/main/src/transformers/utils/attention_visualizer.py) to see what tokens the model attends to.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not supported for this model yet


## BartphoTokenizer
## Resources
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this section

Comment on lines +111 to +132
* **Preprocessing is non-negotiable**:

* All inputs must undergo Vietnamese tone normalization.
* For `bartpho-word` variants, text must also be segmented with [VnCoreNLP](https://github.com/vncorenlp/VnCoreNLP).
* The model is trained on a 20GB corpus (\~145M sentences), so domain-specific performance may vary.
* `bartpho-word` consistently outperforms `bartpho-syllable` on Vietnamese generative tasks.

```python
# Example: Masked language modeling with bartpho-syllable
from transformers import MBartForConditionalGeneration, AutoTokenizer
import torch

model = MBartForConditionalGeneration.from_pretrained("vinai/bartpho-syllable")
tokenizer = AutoTokenizer.from_pretrained("vinai/bartpho-syllable")

TXT = "Chúng tôi là <mask> nghiên cứu viên."
input_ids = tokenizer(TXT, return_tensors="pt")["input_ids"]
logits = model(input_ids).logits
masked_index = (input_ids == tokenizer.mask_token_id).nonzero().item()
predicted_ids = torch.topk(logits[0, masked_index], 5).indices
print(tokenizer.decode(predicted_ids, skip_special_tokens=True))
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* **Preprocessing is non-negotiable**:
* All inputs must undergo Vietnamese tone normalization.
* For `bartpho-word` variants, text must also be segmented with [VnCoreNLP](https://github.com/vncorenlp/VnCoreNLP).
* The model is trained on a 20GB corpus (\~145M sentences), so domain-specific performance may vary.
* `bartpho-word` consistently outperforms `bartpho-syllable` on Vietnamese generative tasks.
```python
# Example: Masked language modeling with bartpho-syllable
from transformers import MBartForConditionalGeneration, AutoTokenizer
import torch
model = MBartForConditionalGeneration.from_pretrained("vinai/bartpho-syllable")
tokenizer = AutoTokenizer.from_pretrained("vinai/bartpho-syllable")
TXT = "Chúng tôi là <mask> nghiên cứu viên."
input_ids = tokenizer(TXT, return_tensors="pt")["input_ids"]
logits = model(input_ids).logits
masked_index = (input_ids == tokenizer.mask_token_id).nonzero().item()
predicted_ids = torch.topk(logits[0, masked_index], 5).indices
print(tokenizer.decode(predicted_ids, skip_special_tokens=True))
```
- BARTphop uses the large architecture of BART with an additional layer-normalization layer on top of the encoder and decoder. The BART-specific classes should be replaced with the mBART-specific classes.
- This implementation only handles tokenization through the `monolingual_vocab_file` file. This is a Vietnamese-specific subset of token types taken from that multilingual vocabulary. If you want to use this tokenizer for another language, replace the `monolingual_vocab_file` with one specialized for your target language.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants