Siamese Natural Language Tracker: Tracking by Natural Language Descriptions with Siamese Trackers

Abstract

We propose a novel Siamese Natural Language Tracker (SNLT), which brings the advancements in visual tracking to the tracking by natural language (NL) descriptions task. The proposed SNLT is applicable to a wide range of Siamese trackers, providing a new class of baselines for the tracking by NL task and promising future improvements from the advancements of Siamese trackers. The carefully designed architecture of the Siamese Natural Language Region Proposal Network (SNL-RPN), together with the Dynamic Aggregation of vision and language modalities, is introduced to perform the tracking by NL task. Empirical results over tracking benchmarks with NL annotations show that the proposed SNLT improves Siamese trackers by 3 to 7 percentage points with a slight tradeoff of speed. The proposed SNLT outperforms all NL trackers to-date and is competitive among state-of-the-art real-time trackers on LaSOT benchmarks while running at 50 frames per second on a single GPU.

The proposed Siamese Natural Language Tracker (SNLT) improves Siamese trackers by leveraging predictions from two modalities: vision and language.

Link to ArXiv.

@inproceedings{feng2021siamese,
title={Siamese Natural Language Tracker: Tracking by Natural Language Descriptions with Siamese Trackers},
author={Feng, Qi and Ablavsky, Vitaly and Bai, Qinxun and Sclaroff, Stan},
booktitle={Proc.\ IEEE Conf.\ on Computer Vision and Pattern Recognition (CVPR)},
pages={},
year={2021}
}

CityFlow-NL: Tracking and Retrieval of Vehicles at City Scale by Natural Language Descriptions

Abstract

Natural Language (NL) descriptions can be one of the most convenient or the only way to interact with systems built to understand and detect city scale traffic patterns and vehicle-related events. In this paper, we extend the widely adopted CityFlow Benchmark with NL descriptions for vehicle targets and introduce the CityFlow-NL Benchmark. The CityFlow-NL contains more than 5,000 unique and precise NL descriptions of vehicle targets, making it the first multi-target multi-camera tracking with NL descriptions dataset to our knowledge. Moreover, the dataset facilitates research at the intersection of multi-object tracking, retrieval by NL descriptions, and temporal localization of events. In this paper, we focus on two foundational tasks: the Vehicle Retrieval by NL task and the Vehicle Tracking by NL task, which take advantage of the proposed CityFlow-NL benchmark and provide a strong basis for future research on the multi-target multi-camera tracking by NL description task.

Example frames and NL descriptions from the proposed CityFlow-NL dataset. Crowdsourcing workers annotate the target vehicle using a carefully designed multi-camera annotation platform. NL descriptions we collect tend to describe vehicle color/type (e.gblue Jeep), vehicle motion (e.gturning right and straight), traffic scene (e.gwinding road), and relations with other vehicles (e.gred truck, black SUV, etc.)

Link to ArXiv.

Used as Challenge Track 5 in AI City Challenge Workshop at CVPR 2021.

@article{feng2021cityflow,
title={CityFlow-NL: Tracking and Retrieval of Vehicles at City Scaleby Natural Language Descriptions},
author={Feng, Qi and Ablavsky, Vitaly and Sclaroff, Stan},
journal={arXiv preprint arXiv:2101.04741},
year={2021}
}

Learning to Separate: Detecting Heavily Occluded Object in Urban Scenes

Abstract

While visual object detection with deep learning has received much attention in the past decade, cases when heavy intra-class occlusions occur have not been studied thoroughly. In this work, we propose a Non-Maximum-Suppression (NMS) algorithm that dramatically improves the detection recall while maintaining high precision in scenes with heavy occlusions. Our NMS algorithm is derived from a novel embedding mechanism, in which the semantic and geometric features of the detected boxes are jointly exploited. The embedding makes it possible to determine whether two heavily-overlapping boxes belong to the same object in the physical world. Our approach is particularly useful for car detection and pedestrian detection in urban scenes where occlusions often happen. We show the effectiveness of our approach by creating a model called SG-Det (short for Semantics and Geometry Detection) and testing SG-Det on two widely-adopted datasets, KITTI and CityPersons for which it achieves state-of-the-art performance.

Learned Semantics-Geometry Embedding (SGE) for bounding boxes predicted by our proposed detector on KITTI and CityPersons images. Heavily overlapped boxes are separated in the SGE space according to the objects they are assigned to. Thus, distance between SGEs can guide NMS to keep correct boxes in heavy intra-class occlusion scenes.

Link to ArXiv.

@article{yang2019learning,   
title={Learning to Separate: Detecting Heavily-Occluded Objects in Urban Scenes},
author={Yang, Chenhongyi and Ablavsky, Vitaly and Wang, Kaihong and Feng, Qi and Betke, Margrit},
journal={Proc. European Conf. on Computer Vision (ECCV)},
pages={530--546},
year={2020}
}

Real-time Visual Object Tracking with Natural Language Description

Abstract

In recent years, deep learning based visual object trackers have been studied thoroughly, but handling occlusions and/or rapid motions of the target remains challenging. In this work, we argue that conditioning on the natural language (NL) description of a target provides information for longer-term invariance, and thus helps cope with typical tracking challenges. However, deriving a formulation to combine the strengths of appearance-based tracking with the language modality is not straightforward. Therefore, we propose a novel deep tracking-by-detection formulation that can take advantage of NL descriptions. Regions that are related to the given NL description are generated by a proposal network during the detection phase of the tracker. Our LSTM based tracker then predicts the update of the target from regions proposed by the NL based detection phase. Our method runs at over 30 fps on a single GPU. In benchmarks, our method is competitive with state of the art trackers that employ bounding boxes for initialization, while it outperforms all other trackers on targets given unambiguous and precise language annotations. When conditioned on only NL descriptions, our model doubles the performance of the previous best attempt.

The tracking by natural language description task.

The goal is to perform tracking by natural language specifications given by a human. For example, someone specifies track the silver sedan running on the highway and our goal is to predict a sequence of bounding boxes on the input video. We also take advantage of the natural language to better handle the cases of occlusion and rapid motion of the target throughout the tracking process.

Link to arXiv.

@inproceedings{feng2020real,   
title={Real-time visual object tracking with natural language description},
author={Feng, Qi and Ablavsky, Vitaly and Bai, Qinxun and Li, Guorong and Sclaroff, Stan},
booktitle={Proc.\ Winter Conf.\ on Applications of Computer Vision (WACV)},
pages={700--709},
year={2020}
}

Image Identification with Natural Language Specification

Image Identification with Natural Language Specification

Team: Qi Feng and Donghyun Kim

fung@bu.edu and donhk@bu.edu

Introduction

Image retrieval has been an active research topic in recent years. The collection of digital image is increased rapidly with the explosion of internet. Everyday people take millions of photos on their cellphone. The need for searching for an image on a cellphone by a natural language description is used on a daily basis. For example, Google Photos allow users to search for their own photos using a tag or a caption, while the Google image search can retrieve related images given the input query.

In this work, we proposed an image retrieval system that would identify an image from a set of given images by a natural language specification. The natural language specification is the ground truth caption labeled by human for the target image. Some examples for human labeled captions are shown in figure below.

Examples of natural language specifications of corresponding
images.[]{data-label="fig-captions"}

The input for the model consist of two parts, a set of images and a ground truth caption labeled by a human(query) for the target image within the set. The goal of the model is to identify which image is the query associated with. It is worth notice that the caption is not specifically tagged to distinguish the target image from other images. The image identification task is illustrated in the figure below.

Identification of the target image by natural language
specification.[]{data-label="fig-image-identification-task"}

In our work, we used the Microsoft COCO 2014 (MSCOCO) data set[@DBLP:journals/corr/LinMBHPRDZ14] with its caption annotation.

We propose a model that use the caption embedding conditioning on the visual feature as an input for a language model which is a Long Short Term Memory Network, LSTM[@Hochreiter:1997:LSM:1246443.1246450]), expecting the LSTM to return a similarity score between an image and query.

Our Approach

In our model, the approach to identify the target image is by learning and measuring a similarity between the given natural language specification and the visual representation of each image.

The Baseline Model

One possible way to compute the similarity score it to take the cosine similarity between the average of word embeddings[@pennington2014glove] of the input query (the natural language specification) and a generated caption for each image from an image captioning model[@DBLP:journals/corr/VinyalsTBE16], which is the state-of-the-art image captioning model. The inception v3 image recognition model used in the caption model is pretrained on the ILSVRC-2012-CLS image classification dataset[@ILSVRC15]. The language model is trained for 20,000 iterations using the MSCOCO dataset.

We consider this approach as our baseline model.

The Proposed Model

In this section, we propose a new model which computes the similarity directly using visual features extracted from an image and language features extracted from an input query. In this work, we use the output from the Convolutional Neural Network(CNN) as the initial state for Recurrent Neural Networks(RNN). Similar to captioning models, the visual representations is used as the first input to the RNN, which makes the rest of the process of the RNN conditioned on the visual representation.

The Vision Model

The visual representation of images are computed by CNNs which are widely used for image tasks and is currently state-of-the-art for object recognition and detection.[@DBLP:journals/corr/VinyalsTBE16; @DBLP:journals/corr/HeZRS15] We used the VGG16 [@DBLP:journals/corr/SimonyanZ14a] as the CNN model as is illustrated in figure above. The role of VGG16 is to extract the visual representation from an image. We used a pretrained VGG16 and fixed its convolutional layers during training, while fine-tuned the fully connected layers in VGG16. After VGG16, we used the last fully connected layer(FC3) as the output of the convolution output. We added another fully connected layer with 300 hidden units to further compress the visual representation to a 300 dimensional vector representation of the image, which is same as the dimension of the word embedding dimension that we are using.

Language Similarity Model

The similarity model between the visual representation and the input query is a RNN. RNNs are working well in natural language understanding tasks like machine translation and captioning systems[@DBLP:journals/corr/VinyalsTBE16; @DBLP:journals/corr/DevlinCFGDHZM15; @45610]. The specific RNN architecture we would use for image-caption similarity task is the Long Short Term Memory(LSTM) network[@hochreiter1997long]. For processing the input query, we used GloVe[@pennington2014glove] model for embedding each word to a 300 dimensional vector and make a sequence of 300 dimensional vectors to represent the query. The visual representation is then stacked with sequence of word embeddings from the query. This stacked sequence is then feeded into the LSTM. The output of the LSTM is linearly compressed to a scalar and finally we applied a sigmoid function to get the similarity measurement between the image and the query. The activation function in the LSTM network is modified to use the ReLU instead of $\tanh$ to make it compatible with Excitation Back-propagation(EB).

We used the cross entropy as the loss for the model during training.

The accuracy of the model is computed by providing it with two images with a query, which is the ground truth caption for the first image among the two. The model would produce two similarities for the two images with the same query. We count the test sample as correct if the similarity of the first image to the query is higher than the similarity between the second image and the query. And we count the test sample as incorrect vice versa.

The Saliency Map

In the final step, we generated a saliency map for the target image and the natural language specification by using a single back-propagation. This saliency map could provide an interpretable explanation of the model’s predictions.

We try to find visual grounding of the classifier with a saliency map. Given an input image and natural language specifications, our goal is to find the most salient regions in an image (spatial saliency) and a word in the language specifications (temporal saliency). Zhang et al. [@zhang2016top] proposed a method to find a saliency map with an EB where the method uses only excitatory neurons which weights are greater than 0. Bargal et al. [@bargal2017excitation] exploited an EB for RNNs.

After training a model, we use the EB for RNNs to find spatial and temporal saliency on inputs.

An illustration of saliency maps given a word.
[@zhang2016top][]{data-label="fig-saliency"}

Experiment

Dataset

The proposed model is trained with MSCOCO2014 dataset[@DBLP:journals/corr/LinMBHPRDZ14] with caption annotation.

A few samples from the dataset is shown in figure below.

Training and Testing

We set the batch size of training to 128, within which, 64 images are paired with captions from other images in the dataset and we set the similarity to be 0 in this case, while the other 64 images are paired with its own ground truth captions and we set the similarity between the captions and images to be 1. The model is trained with 40,000 random batches drawn from the training set. The loss we use is the cross entropy loss.

We trained the language model and the fully connect layer between VGG16 and the language model using gradient descent for 40,000 iterations with batch size 128 and learning rate decay at 0.96 every 4096 iteration beginning at 1E-3.

Results

We evaluated the accuracy of the baseline model and our proposed model on MSCOCO validation dataset with 64*100 samples.

The baseline model achieves an accuracy of 91.1%, while our proposed model beats the baseline at 93.5%.

The similarity output from our proposed model is shown in figure below.

The similarity output from our proposed model. The number below the
images are the
similarities.[]{data-label="fig-output"}

Spacial and Temporal Saliency

We use the method of Excitation Back-propagation for RNNs [@bargal2017excitation] to find a saliency map. Given an input image and input language specification, we generate a saliency map on an image and find the most salient words among input words. Figure below shows examples of the saliency map. All images are taken from validation data. In a saliency map, red color represents high scores regarding saliency. In (a), the baseball player is mostly highlighted in the input image and the word ’player’ gets the highest saliency scores among the input words. We found that other examples (b,c) have the same tendency as (a). From the saliency map, we show what parts of input the model focuses on.

 Examples of spatial and temporal saliency on MS-COCO. The left column
shows original images and the right column shows saliency maps on the
input images. The input language specification shown under each image. A
red word represents the maximum temporal saliency among input
words.[]{data-label="saliency"}

Conclusions

We proposed a model that identify an image with natural language specifications. We pre-processed images with a pre-trained CNN for extracting visual features and queries with the GloVe embedding. Then, we used a Recurrent Neural Network to measure the similarity between images and queries. Our model outperforms the baseline model. With the method of Excitation Back-propagation for RNNs, we successfully find spatial and temporal grounding of our model’s prediction.