This is an example to convert CoreNet's CLIP model implementation to MLX's CLIP example with some customized modification. MLX is a machine learning framework that provides native Apple Silicon hardware support.
To convert an example CoreNet's CLIP model to the example MLX CLIP using the files in this directory:
cd mlx_examples/clip/
# Install required dependencies
# We assume that the main requirements.txt is already installed.
pip install -r requirements.txt
# Convert the model
python main_clip_to_mlx.py \
--common.config-file "../../projects/range_augment/clip/clip_vit_base.yaml" \
--model.multi-modal-image-text.pretrained https://docs-assets.developer.apple.com/ml-research/models/cvnets-v2/examples/range_augment/clip/clip_vit_base_16.pt \
--common.results-loc results/mlx_model/
# Try example inference
python clip.py
Comparing to PyTorch, given the input as ["a photo of cat", "a photo of dog"]
prompt
and the assets/{cat,dog}.jpeg
images. The results are the following on a M2 Ultra:
Model | PyTorch time 100iters (s) | MLX time 100iters (s) | Speedup (%) |
---|---|---|---|
FP16 Base variant | 2.7322 | 1.0743 | 60.68% |
FP16 Huge variant | 4.9098 | 4.3189 | 12.04% |