Model card for vit_large_patch14_clip_224.laion2b_ft_in12k

A Vision Transformer (ViT) image classification model. Pretrained on LAION-2B image-text pairs using OpenCLIP. Fine-tuned on ImageNet-12k in timm. See recipes in Reproducible scaling laws.

Model Details

Model Type: Image classification / feature backbone
Model Stats:
- Params (M): 315.3
- GMACs: 77.8
- Activations (M): 57.1
- Image size: 224 x 224
Papers:
- OpenCLIP: https://github.com/mlfoundations/open_clip
- Reproducible scaling laws for contrastive language-image learning: https://arxiv.org/abs/2212.07143
- LAION-5B: An open large-scale dataset for training next generation image-text models: https://arxiv.org/abs/2210.08402
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2
Dataset: ImageNet-12k
Pretrain Dataset:
- LAION-2B

Model Usage

Image Classification

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model('vit_large_patch14_clip_224.laion2b_ft_in12k', pretrained=True)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0)) # unsqueeze single image into batch of 1

top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)

Image Embeddings

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
 'vit_large_patch14_clip_224.laion2b_ft_in12k',
 pretrained=True,
 num_classes=0, # remove classifier nn.Linear
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0)) # output is (batch_size, num_features) shaped tensor

# or equivalently (without needing to set num_classes=0)

output = model.forward_features(transforms(img).unsqueeze(0))
# output is unpooled, a (1, 257, 1024) shaped tensor

output = model.forward_head(output, pre_logits=True)
# output is a (1, num_features) shaped tensor

Model Comparison

Explore the dataset and runtime metrics of this model in timm model results.

Citation

@software{ilharco_gabriel_2021_5143773,
 author = {Ilharco, Gabriel and
 Wortsman, Mitchell and
 Wightman, Ross and
 Gordon, Cade and
 Carlini, Nicholas and
 Taori, Rohan and
 Dave, Achal and
 Shankar, Vaishaal and
 Namkoong, Hongseok and
 Miller, John and
 Hajishirzi, Hannaneh and
 Farhadi, Ali and
 Schmidt, Ludwig},
 title = {OpenCLIP},
 month = jul,
 year = 2021,
 note = {If you use this software, please cite it as below.},
 publisher = {Zenodo},
 version = {0.1},
 doi = {10.5281/zenodo.5143773},
 url = {https://doi.org/10.5281/zenodo.5143773}
}

@article{cherti2022reproducible,
 title={Reproducible scaling laws for contrastive language-image learning},
 author={Cherti, Mehdi and Beaumont, Romain and Wightman, Ross and Wortsman, Mitchell and Ilharco, Gabriel and Gordon, Cade and Schuhmann, Christoph and Schmidt, Ludwig and Jitsev, Jenia},
 journal={arXiv preprint arXiv:2212.07143},
 year={2022}
}

@inproceedings{schuhmann2022laionb,
 title={{LAION}-5B: An open large-scale dataset for training next generation image-text models},
 author={Christoph Schuhmann and
 Romain Beaumont and
 Richard Vencu and
 Cade W Gordon and
 Ross Wightman and
 Mehdi Cherti and
 Theo Coombes and
 Aarush Katta and
 Clayton Mullis and
 Mitchell Wortsman and
 Patrick Schramowski and
 Srivatsa R Kundurthy and
 Katherine Crowson and
 Ludwig Schmidt and
 Robert Kaczmarczyk and
 Jenia Jitsev},
 booktitle={Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
 year={2022},
 url={https://openreview.net/forum?id=M3Y74vmsMcY}
}

@article{dosovitskiy2020vit,
 title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
 author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
 journal={ICLR},
 year={2021}
}

@misc{rw2019timm,
 author = {Ross Wightman},
 title = {PyTorch Image Models},
 year = {2019},
 publisher = {GitHub},
 journal = {GitHub repository},
 doi = {10.5281/zenodo.4414861},
 howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
}