Cola : A Benchmark for Compositional Text-to-image Retrieval

a.k.a. Can your vision-langauge model Compose Objects Localized with Attributes?

1Boston University, 2Meta AI (FAIR), 3University of Washington
NeurIPS 2023 - Track on Datasets and Benchmarks

We present Cola, where a model has to Compose Objects Localized with Attributes.

To solve Cola, a model must match the correct image to the correct caption, not a distractor image with the same objects and attributes but in the wrong configuration.

We explore the design space of possible mechanisms to adapt existing models to this task; we show that a simple multimodal adaptation method to finetune pre-trained vision-language representations works best.


Compositional reasoning is a hallmark of human visual intelligence; yet despite the size of large vision-language models, they struggle to represent simple compositions by combining objects with their attributes. To measure this lack of compositional capability, we design Cola, a text-to-image retrieval benchmark to Compose Objects Localized with Attributes. To solve Cola, a model must retrieve images with the correct configuration of attributes and objects, and avoid choosing a distractor image with the same objects and attributes but in the wrong configuration. Cola contains about 1.2k composed queries of 168 objects and 197 attributes on around 30K images. Our human evaluation finds that Cola is 83.33% accurate, similar to contemporary compositionality benchmarks. Using Cola as a testbed, we explore empirical modeling designs to adapt pre-trained vision-language models to reason compositionally. We explore 6 adaptation strategies on 2 seminal vision-language models, using compositionality-centric test benchmarks - Cola and CREPE. We find the optimal adaptation strategy is to train a multimodal attention layer that jointly attends over the frozen pre-trained image and language features. Surprisingly, training multimodal layers on CLIP performs better than tuning a larger FLAVA model with already pre-trained multimodal layers. Furthermore, our adaptation strategy improves CLIP and FLAVA to comparable levels, suggesting that training multimodal layers using contrastive attribute-object data is key, as opposed to using them pre-trained. Lastly, we show that Cola is harder than a closely related contemporary benchmark, CREPE, since simpler fine-tuning strategies without multimodal layers suffice on CREPE, but not on Cola. However, we still see a significant gap between our best adaptation and human accuracy, suggesting considerable room for further research.



Matching the correct multi-object caption to the correct image

Retrieving the correct multi-attribute object among hard distractors

Related Links

There's a lot of excellent work that also evalutes compositionality.

Winoground study compositionality using relationships, but we focus on attribute-object bindings, since finding objects with correct attributes is crucial in many applications. Object-attribute bindings should fundamentally be easier than compositions with relationships; yet, we find that existing models still struggle with this simpler binding.

CREPE evaluates models using image-to-text, whereas we evaluate using text-to-image, where text queries are used to retrieve the correct image from a set of difficult distractor images. Text-to-image retrieval is harder because image encoders are weaker at distinguishing fine-grained differences in images for a given text than text encoders are at distinguishing fine-grained text. Moreover, text-to-image is better aligned with practical applications, such as a user giving text instructions to a machine to find certain objects.

There are probably many more by the time you are reading this. This work adds to the on-going discussion that robust compositionality seems to tbe lacking in many pre-trained large vision-language models. However, we show that they can be easily adapted to exhibit composiotionality if trained with the contrastive data with fine-grained differences.


      title={COLA: How to adapt vision-language models to Compose Objects Localized with Attributes?}, 
      author={Arijit Ray and Filip Radenovic and Abhimanyu Dubey and Bryan A. Plummer and Ranjay Krishna and Kate Saenko},