Large Vision-Language Foundation Models (VLFM), such as CLIP, ALIGN and Florence, are trained on large-scale datasets of image-caption pairs and achieve superior transferability and robustness on downstream tasks, but they are difficult to use in many practical applications due to their large size, high latency and fixed architectures. Unfortunately, recent work shows training a small custom VLFM for resource-limited applications is currently very difficult using public and smaller-scale data. In this paper, we introduce a new distillation mechanism (DIME-FM) that allows us to transfer the knowledge contained in large VLFMs to smaller, customized foundation models using a relatively small amount of inexpensive, unpaired images and sentences. We transfer the knowledge from the pre-trained CLIP-ViT-L/14 model to a ViT-B/32 model, with only 40M public images and 28.4M unpaired public sentences. The resulting model “Distill-ViT-B/32” rivals the CLIP-ViT-B/32 model pre-trained on its private WiT dataset (400M image-text pairs): Distill-ViT-B/32 achieves similar results in terms of zero-shot and linear-probing performance on both ImageNet and the ELEVATER (20 image classification tasks) benchmarks. It also displays comparable robustness when evaluated on five datasets with natural distribution shifts from ImageNet.
Figure 1: Conceptual Figure of our Vision-Language Knowledge Distillation DIME-FM. We distill the knowledge from a large VLFM “CLIP-ViT-L/14’ pretrained on 400M private image-text paired dataset. We only use public unpaired image and text corpora as inputs. Our Distill-ViT-B/32 rivals CLIP-ViT-B/32 in both transferability and robustness. ZS: Zero-Shot, LP: Linear Probing.
Figure 2: Illustration of our proposed distillation losses. In each iteration, we compute two losses (Lvl, Lp-vl) and one regularizer (Ludist) with a min-batch of images and texts to distill knowledge from the teacher to the student. We freeze all parameters in the teacher model and learn the student model from scratch.
Figure 3: Transferability and Robustness for different Image/Text Dataset Sizes. (a) zero-shot transferability of our student model increases with larger training image/text corpus; (b.i) shows robustness strongly correlates to the training image dataset size (represented as the dot size); (b.ii) shows robust score strongly correlates to IN-1K performance when changing the training text.
@InProceedings{Sun_2023_ICCV,
author = {Sun, Ximeng and Zhang, Pengchuan and Zhang, Peizhao and Shah, Hardik and Saenko, Kate and Xia, Xide},
title = {DIME-FM : DIstilling Multimodal and Efficient Foundation Models},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2023},
pages = {15521-15533}
}