Image Identification with Natural Language Specification

Image Identification with Natural Language Specification

Team: Qi Feng and Donghyun Kim and


Image retrieval has been an active research topic in recent years. The collection of digital image is increased rapidly with the explosion of internet. Everyday people take millions of photos on their cellphone. The need for searching for an image on a cellphone by a natural language description is used on a daily basis. For example, Google Photos allow users to search for their own photos using a tag or a caption, while the Google image search can retrieve related images given the input query.

In this work, we proposed an image retrieval system that would identify an image from a set of given images by a natural language specification. The natural language specification is the ground truth caption labeled by human for the target image. Some examples for human labeled captions are shown in figure below.

Examples of natural language specifications of corresponding

The input for the model consist of two parts, a set of images and a ground truth caption labeled by a human(query) for the target image within the set. The goal of the model is to identify which image is the query associated with. It is worth notice that the caption is not specifically tagged to distinguish the target image from other images. The image identification task is illustrated in the figure below.

Identification of the target image by natural language

In our work, we used the Microsoft COCO 2014 (MSCOCO) data set[@DBLP:journals/corr/LinMBHPRDZ14] with its caption annotation.

We propose a model that use the caption embedding conditioning on the visual feature as an input for a language model which is a Long Short Term Memory Network, LSTM[@Hochreiter:1997:LSM:1246443.1246450]), expecting the LSTM to return a similarity score between an image and query.

Our Approach

In our model, the approach to identify the target image is by learning and measuring a similarity between the given natural language specification and the visual representation of each image.

The Baseline Model

One possible way to compute the similarity score it to take the cosine similarity between the average of word embeddings[@pennington2014glove] of the input query (the natural language specification) and a generated caption for each image from an image captioning model[@DBLP:journals/corr/VinyalsTBE16], which is the state-of-the-art image captioning model. The inception v3 image recognition model used in the caption model is pretrained on the ILSVRC-2012-CLS image classification dataset[@ILSVRC15]. The language model is trained for 20,000 iterations using the MSCOCO dataset.

We consider this approach as our baseline model.

The Proposed Model

In this section, we propose a new model which computes the similarity directly using visual features extracted from an image and language features extracted from an input query. In this work, we use the output from the Convolutional Neural Network(CNN) as the initial state for Recurrent Neural Networks(RNN). Similar to captioning models, the visual representations is used as the first input to the RNN, which makes the rest of the process of the RNN conditioned on the visual representation.

The Vision Model

The visual representation of images are computed by CNNs which are widely used for image tasks and is currently state-of-the-art for object recognition and detection.[@DBLP:journals/corr/VinyalsTBE16; @DBLP:journals/corr/HeZRS15] We used the VGG16 [@DBLP:journals/corr/SimonyanZ14a] as the CNN model as is illustrated in figure above. The role of VGG16 is to extract the visual representation from an image. We used a pretrained VGG16 and fixed its convolutional layers during training, while fine-tuned the fully connected layers in VGG16. After VGG16, we used the last fully connected layer(FC3) as the output of the convolution output. We added another fully connected layer with 300 hidden units to further compress the visual representation to a 300 dimensional vector representation of the image, which is same as the dimension of the word embedding dimension that we are using.

Language Similarity Model

The similarity model between the visual representation and the input query is a RNN. RNNs are working well in natural language understanding tasks like machine translation and captioning systems[@DBLP:journals/corr/VinyalsTBE16; @DBLP:journals/corr/DevlinCFGDHZM15; @45610]. The specific RNN architecture we would use for image-caption similarity task is the Long Short Term Memory(LSTM) network[@hochreiter1997long]. For processing the input query, we used GloVe[@pennington2014glove] model for embedding each word to a 300 dimensional vector and make a sequence of 300 dimensional vectors to represent the query. The visual representation is then stacked with sequence of word embeddings from the query. This stacked sequence is then feeded into the LSTM. The output of the LSTM is linearly compressed to a scalar and finally we applied a sigmoid function to get the similarity measurement between the image and the query. The activation function in the LSTM network is modified to use the ReLU instead of $\tanh$ to make it compatible with Excitation Back-propagation(EB).

We used the cross entropy as the loss for the model during training.

The accuracy of the model is computed by providing it with two images with a query, which is the ground truth caption for the first image among the two. The model would produce two similarities for the two images with the same query. We count the test sample as correct if the similarity of the first image to the query is higher than the similarity between the second image and the query. And we count the test sample as incorrect vice versa.

The Saliency Map

In the final step, we generated a saliency map for the target image and the natural language specification by using a single back-propagation. This saliency map could provide an interpretable explanation of the model’s predictions.

We try to find visual grounding of the classifier with a saliency map. Given an input image and natural language specifications, our goal is to find the most salient regions in an image (spatial saliency) and a word in the language specifications (temporal saliency). Zhang et al. [@zhang2016top] proposed a method to find a saliency map with an EB where the method uses only excitatory neurons which weights are greater than 0. Bargal et al. [@bargal2017excitation] exploited an EB for RNNs.

After training a model, we use the EB for RNNs to find spatial and temporal saliency on inputs.

An illustration of saliency maps given a word.



The proposed model is trained with MSCOCO2014 dataset[@DBLP:journals/corr/LinMBHPRDZ14] with caption annotation.

A few samples from the dataset is shown in figure below.

Training and Testing

We set the batch size of training to 128, within which, 64 images are paired with captions from other images in the dataset and we set the similarity to be 0 in this case, while the other 64 images are paired with its own ground truth captions and we set the similarity between the captions and images to be 1. The model is trained with 40,000 random batches drawn from the training set. The loss we use is the cross entropy loss.

We trained the language model and the fully connect layer between VGG16 and the language model using gradient descent for 40,000 iterations with batch size 128 and learning rate decay at 0.96 every 4096 iteration beginning at 1E-3.


We evaluated the accuracy of the baseline model and our proposed model on MSCOCO validation dataset with 64*100 samples.

The baseline model achieves an accuracy of 91.1%, while our proposed model beats the baseline at 93.5%.

The similarity output from our proposed model is shown in figure below.

The similarity output from our proposed model. The number below the
images are the

Spacial and Temporal Saliency

We use the method of Excitation Back-propagation for RNNs [@bargal2017excitation] to find a saliency map. Given an input image and input language specification, we generate a saliency map on an image and find the most salient words among input words. Figure below shows examples of the saliency map. All images are taken from validation data. In a saliency map, red color represents high scores regarding saliency. In (a), the baseball player is mostly highlighted in the input image and the word ’player’ gets the highest saliency scores among the input words. We found that other examples (b,c) have the same tendency as (a). From the saliency map, we show what parts of input the model focuses on.

 Examples of spatial and temporal saliency on MS-COCO. The left column
shows original images and the right column shows saliency maps on the
input images. The input language specification shown under each image. A
red word represents the maximum temporal saliency among input


We proposed a model that identify an image with natural language specifications. We pre-processed images with a pre-trained CNN for extracting visual features and queries with the GloVe embedding. Then, we used a Recurrent Neural Network to measure the similarity between images and queries. Our model outperforms the baseline model. With the method of Excitation Back-propagation for RNNs, we successfully find spatial and temporal grounding of our model’s prediction.