Standardize BARTpho model card: badges, new examples, fixed broken im… #40051

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

eshwanthkartitr wants to merge 1 commit into huggingface:main from eshwanthkartitr:patch-1

eshwanthkartitr commented Aug 9, 2025 •

edited

Loading

Updated bartpho.md

What does this PR do?

Fixes # (issue)
Standardize BARTpho model card

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Please review this @stevhliu


          Standardize BARTpho model card: badges, new examples, fixed broken im…

edf320a

…age section, and links (huggingface#36979)Update bartpho.md

Author

eshwanthkartitr commented Aug 9, 2025

Thank you for your dedication to maintaining high-quality documentation across the repository. I've standardized the BARTpho model card with several key improvements:

Key Changes:

Standardized model card badges for consistency with other model documentation
Added new practical examples to help users better understand implementation
Fixed broken links and images that were impacting user experience

I'd appreciate your feedback on whether any additional elements would enhance the documentation further - specifically, would an architecture diagram improve clarity for users? I'm also open to any other suggestions you might have to make this model card even more valuable for the community.

Looking forward to your review and any insights you can share!

Best regards

stevhliu mentioned this pull request

[Community contributions] Model cards #36979

Open

stevhliu reviewed

View reviewed changes

Member

stevhliu left a comment

Good start! Next time, make sure the code examples run and are aligned

docs/source/en/model_doc/bartpho.md

Comment on lines +21 to +24

    
              <img alt="Hugging Face Model" src="https://img.shields.io/badge/Model%20Hub-BARTpho-blue">

              <img alt="License" src="https://img.shields.io/badge/License-Apache%202.0-green">

              <img alt="Language" src="https://img.shields.io/badge/Language-Vietnamese-orange">

              </div>

Member

stevhliu Aug 11, 2025

Don't need to add these extras

docs/source/en/model_doc/bartpho.md

Comment on lines +29 to +31

+              [BARTpho](https://arxiv.org/abs/2109.09701) is the first large-scale, monolingual sequence-to-sequence model pre-trained exclusively for Vietnamese, developed by [VinAI Research](https://huggingface.co/vinai).
+              It’s based on the **BART** denoising autoencoder architecture, with adaptations from **mBART**, and comes in two variants — **word** and **syllable** — to handle the unique way Vietnamese uses whitespace.
+              Think of it like a supercharged summarizer and text generator that really “gets” Vietnamese — both at the word and syllable level.

Member

stevhliu Aug 11, 2025

Suggested change

      
            [BARTpho](https://arxiv.org/abs/2109.09701) is the first large-scale, monolingual sequence-to-sequence model pre-trained exclusively for Vietnamese, developed by [VinAI Research](https://huggingface.co/vinai).
          
            It’s based on the **BART** denoising autoencoder architecture, with adaptations from **mBART**, and comes in two variants — **word** and **syllable** — to handle the unique way Vietnamese uses whitespace.
          
            Think of it like a supercharged summarizer and text generator that really “gets” Vietnamese — both at the word and syllable level.
          
            [BARTpho](https://huggingface.co/papers/2109.09701) is a large-scale Vietnamese sequence-to-sequence model. It offers a word-based and syllable-based version. This model is built on the [BART](./bart) large architecture with its denoising pretraining.

docs/source/en/model_doc/bartpho.md

    
              The abstract from the paper is the following:

              You can find all official checkpoints in the [BARTpho collection](https://huggingface.co/collections/vinai/bartpho-66f8a74775316eaa77d59969).

Member

stevhliu Aug 11, 2025

Suggested change

      
            You can find all official checkpoints in the [BARTpho collection](https://huggingface.co/collections/vinai/bartpho-66f8a74775316eaa77d59969).
          
            You can find all the original checkpoints under the [VinAI](https://huggingface.co/vinai/models?search=bartpho) organization.

docs/source/en/model_doc/bartpho.md

Comment on lines +35 to +37

+              > \[!TIP]
+              > This model was contributed by [VinAI Research](https://huggingface.co/vinai).
+              > Check out the `bartpho-word` and `bartpho-syllable` variants in the right sidebar for examples of summarization, punctuation restoration, and capitalization restoration.

Member

stevhliu Aug 11, 2025

Suggested change

      
            > \[!TIP]
          
            > This model was contributed by [VinAI Research](https://huggingface.co/vinai).
          
            > Check out the `bartpho-word` and `bartpho-syllable` variants in the right sidebar for examples of summarization, punctuation restoration, and capitalization restoration.
          
            > [!TIP]
          
            > This model was contributed by [dqnguyen](https://huggingface.co/dqnguyen).
          
            > Check out the BARTpho in the right sidebar for examples of how to apply BART to different language tasks.

docs/source/en/model_doc/bartpho.md


		This model was contributed by [dqnguyen](https://huggingface.co/dqnguyen). The original code can be found [here](https://github.com/VinAIResearch/BARTpho).
		The example below demonstrates how to run summarization with \[`pipeline`] or load the model via \[`AutoModel`].

Member

stevhliu Aug 11, 2025

Suggested change

      
            The example below demonstrates how to run summarization with \[`pipeline`] or load the model via \[`AutoModel`].
          
            The example below demonstrates how to summarize text with [`Pipeline`] or the [`AutoModel`] class.

docs/source/en/model_doc/bartpho.md

-              >>> input_ids = tokenizer(line, return_tensors="tf")
-              >>> features = bartpho(**input_ids)
+              ```bash
+              transformers-cli download vinai/bartpho-word

Member

stevhliu Aug 11, 2025

echo -e "Quang tổng hợp hay gọi tắt là quang hợp là quá trình thu nhận và chuyển hóa năng lượng ánh sáng Mặt trời của thực vật, 
tảo và một số vi khuẩn để tạo ra hợp chất hữu cơ phục vụ bản thân cũng như làm nguồn thức ăn cho hầu hết các sinh vật 
trên Trái Đất. Quang hợp trong thực vật thường liên quan đến chất tố diệp lục màu xanh lá cây và tạo ra oxy như một sản phẩm phụ" | transformers run --task summarization --model vinai/bartpho-word --device 0

docs/source/en/model_doc/bartpho.md

+              </hfoption>
+              </hfoptions>
+              Quantization reduces the memory footprint of large models by storing weights in lower precision. See the [BitsAndBytes Quantization guide](https://huggingface.co/docs/transformers/quantization/bitsandbytes) for details.

Member

stevhliu Aug 11, 2025

Models not that big so we don't need a Quantization section

docs/source/en/model_doc/bartpho.md

-                extracted from the pre-trained SentencePiece model "vocab_file" that is available from the multilingual XLM-RoBERTa.
-                Other languages, if employing this pre-trained multilingual SentencePiece model "vocab_file" for subword
-                segmentation, can reuse BartphoTokenizer with their own language-specialized "monolingual_vocab_file".
+              Use the [AttentionMaskVisualizer](https://github.com/huggingface/transformers/blob/main/src/transformers/utils/attention_visualizer.py) to see what tokens the model attends to.

Member

stevhliu Aug 11, 2025

Not supported for this model yet

docs/source/en/model_doc/bartpho.md


		## BartphoTokenizer
		## Resources

Member

stevhliu Aug 11, 2025

Remove this section

docs/source/en/model_doc/bartpho.md

Comment on lines +111 to +132

+              * **Preprocessing is non-negotiable**:
+              * All inputs must undergo Vietnamese tone normalization.
+              * For `bartpho-word` variants, text must also be segmented with [VnCoreNLP](https://github.com/vncorenlp/VnCoreNLP).
+              * The model is trained on a 20GB corpus (\~145M sentences), so domain-specific performance may vary.
+              * `bartpho-word` consistently outperforms `bartpho-syllable` on Vietnamese generative tasks.
+              ```python
+              # Example: Masked language modeling with bartpho-syllable
+              from transformers import MBartForConditionalGeneration, AutoTokenizer
+              import torch
+              model = MBartForConditionalGeneration.from_pretrained("vinai/bartpho-syllable")
+              tokenizer = AutoTokenizer.from_pretrained("vinai/bartpho-syllable")
+              TXT = "Chúng tôi là <mask> nghiên cứu viên."
+              input_ids = tokenizer(TXT, return_tensors="pt")["input_ids"]
+              logits = model(input_ids).logits
+              masked_index = (input_ids == tokenizer.mask_token_id).nonzero().item()
+              predicted_ids = torch.topk(logits[0, masked_index], 5).indices
+              print(tokenizer.decode(predicted_ids, skip_special_tokens=True))
+              ```

Member

stevhliu Aug 11, 2025

Suggested change

      
            * **Preprocessing is non-negotiable**:
          
            * All inputs must undergo Vietnamese tone normalization.
          
            * For `bartpho-word` variants, text must also be segmented with [VnCoreNLP](https://github.com/vncorenlp/VnCoreNLP).
          
            * The model is trained on a 20GB corpus (\~145M sentences), so domain-specific performance may vary.
          
            * `bartpho-word` consistently outperforms `bartpho-syllable` on Vietnamese generative tasks.
          
            ```python
          
            # Example: Masked language modeling with bartpho-syllable
          
            from transformers import MBartForConditionalGeneration, AutoTokenizer
          
            import torch
          
            model = MBartForConditionalGeneration.from_pretrained("vinai/bartpho-syllable")
          
            tokenizer = AutoTokenizer.from_pretrained("vinai/bartpho-syllable")
          
            TXT = "Chúng tôi là <mask> nghiên cứu viên."
          
            input_ids = tokenizer(TXT, return_tensors="pt")["input_ids"]
          
            logits = model(input_ids).logits
          
            masked_index = (input_ids == tokenizer.mask_token_id).nonzero().item()
          
            predicted_ids = torch.topk(logits[0, masked_index], 5).indices
          
            print(tokenizer.decode(predicted_ids, skip_special_tokens=True))
          
            ```
          
            - BARTphop uses the large architecture of BART with an additional layer-normalization layer on top of the encoder and decoder. The BART-specific classes should be replaced with the mBART-specific classes.
          
            - This implementation only handles tokenization through the `monolingual_vocab_file` file. This is a Vietnamese-specific subset of token types taken from that multilingual vocabulary. If you want to use this tokenizer for another language, replace the `monolingual_vocab_file` with one specialized for your target language.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet