Skip to content

Add model card for MobileViT #40033

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

Shivamjan
Copy link

What does this PR do?

This PR adds a detailed and beginner-friendly model card for MobileViT to the Hugging Face Transformers documentation. The previous model card was minimal and lacked clear explanations about the model architecture. This model retains several elements from the earlier version, as they remain applicable and effective for users.

The new version includes:

  • A clear explanation of the MobileViT architecture.
  • Notes on preprocessing and image format.
  • Clarifies how to use the model for classification and segmentation.
  • Highlights TensorFlow Lite compatibility for mobile use.
  • Primary references to the original paper and related resources.

Fixes # (issue)

Before submitting

  • [ x] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • [ x] Did you read the contributor guideline,
    Pull Request section?
  • [ x] Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • [ x] Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@Shivamjan
Copy link
Author

Shivamjan commented Aug 8, 2025

@stevhliu Please take a look at your convenience and do let me know if there is any further changes required.

Copy link
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start! Please check the model card format again as its missing Pipeline and AutoModel examples!

Shivamjan and others added 8 commits August 9, 2025 09:19
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Copy link
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, just a few more comments!


This model was contributed by [matthijs](https://huggingface.co/Matthijs). The TensorFlow version of the model was contributed by [sayakpaul](https://huggingface.co/sayakpaul). The original code and weights can be found [here](https://github.com/apple/ml-cvnets).
from transformers import pipeline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just run it on a single image rather than a dataset

import torch
from transformers import pipeline

pipeline = pipeline(
    task="image-classification",
    model="apple/mobilevit-small",
    torch_dtype=torch.float16,
    device=0
)
pipeline("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg")


```python

import torch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import torch
import requests
from PIL import Image
from transformers import AutoModelForImageClassification, AutoImageProcessor

image_processor = AutoImageProcessor.from_pretrained(
    "apple/mobilevit-small",
    use_fast=True,
)
model = AutoModelForImageClassification.from_pretrained(
    "apple/mobilevit-small",
)
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = image_processor(image, return_tensors="pt").to("cuda")

with torch.no_grad():
  logits = model(**inputs).logits
predicted_class_id = logits.argmax(dim=-1).item()

class_labels = model.config.id2label
predicted_class_label = class_labels[predicted_class_id]
print(f"The predicted class label is: {predicted_class_label}")

<PipelineTag pipeline="image-classification"/>
- Does **not** operate on sequential data, it's purely designed for image tasks.
- Feature maps are used directly instead of token embeddings.
- Use [`MobileViTImageProcessor`](https://huggingface.co/docs/transformers/main/en/model_doc/mobilevit#transformers.MobileViTImageProcessor) to preprocess images.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Use [`MobileViTImageProcessor`](https://huggingface.co/docs/transformers/main/en/model_doc/mobilevit#transformers.MobileViTImageProcessor) to preprocess images.
- Use [`MobileViTImageProcessor`] to preprocess images.

Comment on lines +114 to +116
- The **classification models** are pretrained on [**ImageNet-1k**](https://huggingface.co/datasets/imagenet-1k) (ILSVRC 2012).
- The **segmentation models** use a [**DeepLabV3**](https://huggingface.co/papers/1706.05587) head and are pretrained on [**PASCAL VOC**](http://host.robots.ox.ac.uk/pascal/VOC/).
- TensorFlow versions are compatible with **TensorFlow Lite**, making them ideal for edge/mobile deployment.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- The **classification models** are pretrained on [**ImageNet-1k**](https://huggingface.co/datasets/imagenet-1k) (ILSVRC 2012).
- The **segmentation models** use a [**DeepLabV3**](https://huggingface.co/papers/1706.05587) head and are pretrained on [**PASCAL VOC**](http://host.robots.ox.ac.uk/pascal/VOC/).
- TensorFlow versions are compatible with **TensorFlow Lite**, making them ideal for edge/mobile deployment.
- The classification models are pretrained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k).
- The segmentation models use a [DeepLabV3](https://huggingface.co/papers/1706.05587) head and are pretrained on [PASCAL VOC](http://host.robots.ox.ac.uk/pascal/VOC/).
- TensorFlow versions are compatible with TensorFlow Lite, making them ideal for edge/mobile deployment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants