Siamese Natural Language Tracker: Tracking by Natural Language Descriptions with Siamese Trackers


We propose a novel Siamese Natural Language Tracker (SNLT), which brings the advancements in visual tracking to the tracking by natural language (NL) descriptions task. The proposed SNLT is applicable to a wide range of Siamese trackers, providing a new class of baselines for the tracking by NL task and promising future improvements from the advancements of Siamese trackers. The carefully designed architecture of the Siamese Natural Language Region Proposal Network (SNL-RPN), together with the Dynamic Aggregation of vision and language modalities, is introduced to perform the tracking by NL task. Empirical results over tracking benchmarks with NL annotations show that the proposed SNLT improves Siamese trackers by 3 to 7 percentage points with a slight tradeoff of speed. The proposed SNLT outperforms all NL trackers to-date and is competitive among state-of-the-art real-time trackers on LaSOT benchmarks while running at 50 frames per second on a single GPU.

The proposed Siamese Natural Language Tracker (SNLT) improves Siamese trackers by leveraging predictions from two modalities: vision and language.

Link to ArXiv.

title={Siamese Natural Language Tracker: Tracking by Natural Language Descriptions with Siamese Trackers},
author={Feng, Qi and Ablavsky, Vitaly and Bai, Qinxun and Sclaroff, Stan},
booktitle={Proc.\ IEEE Conf.\ on Computer Vision and Pattern Recognition (CVPR)},

CityFlow-NL: Tracking and Retrieval of Vehicles at City Scale by Natural Language Descriptions


Natural Language (NL) descriptions can be one of the most convenient or the only way to interact with systems built to understand and detect city scale traffic patterns and vehicle-related events. In this paper, we extend the widely adopted CityFlow Benchmark with NL descriptions for vehicle targets and introduce the CityFlow-NL Benchmark. The CityFlow-NL contains more than 5,000 unique and precise NL descriptions of vehicle targets, making it the first multi-target multi-camera tracking with NL descriptions dataset to our knowledge. Moreover, the dataset facilitates research at the intersection of multi-object tracking, retrieval by NL descriptions, and temporal localization of events. In this paper, we focus on two foundational tasks: the Vehicle Retrieval by NL task and the Vehicle Tracking by NL task, which take advantage of the proposed CityFlow-NL benchmark and provide a strong basis for future research on the multi-target multi-camera tracking by NL description task.

Example frames and NL descriptions from the proposed CityFlow-NL dataset. Crowdsourcing workers annotate the target vehicle using a carefully designed multi-camera annotation platform. NL descriptions we collect tend to describe vehicle color/type (e.gblue Jeep), vehicle motion (e.gturning right and straight), traffic scene (e.gwinding road), and relations with other vehicles (e.gred truck, black SUV, etc.)

Link to ArXiv.

Used as Challenge Track 5 in AI City Challenge Workshop at CVPR 2021.

title={CityFlow-NL: Tracking and Retrieval of Vehicles at City Scaleby Natural Language Descriptions},
author={Feng, Qi and Ablavsky, Vitaly and Sclaroff, Stan},
journal={arXiv preprint arXiv:2101.04741},

Learning to Separate: Detecting Heavily Occluded Object in Urban Scenes


While visual object detection with deep learning has received much attention in the past decade, cases when heavy intra-class occlusions occur have not been studied thoroughly. In this work, we propose a Non-Maximum-Suppression (NMS) algorithm that dramatically improves the detection recall while maintaining high precision in scenes with heavy occlusions. Our NMS algorithm is derived from a novel embedding mechanism, in which the semantic and geometric features of the detected boxes are jointly exploited. The embedding makes it possible to determine whether two heavily-overlapping boxes belong to the same object in the physical world. Our approach is particularly useful for car detection and pedestrian detection in urban scenes where occlusions often happen. We show the effectiveness of our approach by creating a model called SG-Det (short for Semantics and Geometry Detection) and testing SG-Det on two widely-adopted datasets, KITTI and CityPersons for which it achieves state-of-the-art performance.

Learned Semantics-Geometry Embedding (SGE) for bounding boxes predicted by our proposed detector on KITTI and CityPersons images. Heavily overlapped boxes are separated in the SGE space according to the objects they are assigned to. Thus, distance between SGEs can guide NMS to keep correct boxes in heavy intra-class occlusion scenes.

Link to ArXiv.

title={Learning to Separate: Detecting Heavily-Occluded Objects in Urban Scenes},
author={Yang, Chenhongyi and Ablavsky, Vitaly and Wang, Kaihong and Feng, Qi and Betke, Margrit},
journal={Proc. European Conf. on Computer Vision (ECCV)},

Real-time Visual Object Tracking with Natural Language Description


In recent years, deep learning based visual object trackers have been studied thoroughly, but handling occlusions and/or rapid motions of the target remains challenging. In this work, we argue that conditioning on the natural language (NL) description of a target provides information for longer-term invariance, and thus helps cope with typical tracking challenges. However, deriving a formulation to combine the strengths of appearance-based tracking with the language modality is not straightforward. Therefore, we propose a novel deep tracking-by-detection formulation that can take advantage of NL descriptions. Regions that are related to the given NL description are generated by a proposal network during the detection phase of the tracker. Our LSTM based tracker then predicts the update of the target from regions proposed by the NL based detection phase. Our method runs at over 30 fps on a single GPU. In benchmarks, our method is competitive with state of the art trackers that employ bounding boxes for initialization, while it outperforms all other trackers on targets given unambiguous and precise language annotations. When conditioned on only NL descriptions, our model doubles the performance of the previous best attempt.

The tracking by natural language description task.

The goal is to perform tracking by natural language specifications given by a human. For example, someone specifies track the silver sedan running on the highway and our goal is to predict a sequence of bounding boxes on the input video. We also take advantage of the natural language to better handle the cases of occlusion and rapid motion of the target throughout the tracking process.

Link to arXiv.

title={Real-time visual object tracking with natural language description},
author={Feng, Qi and Ablavsky, Vitaly and Bai, Qinxun and Li, Guorong and Sclaroff, Stan},
booktitle={Proc.\ Winter Conf.\ on Applications of Computer Vision (WACV)},

IVCGPU Onboarding

To ensure better schedul

For those joining us: you may, in the course of your research,
need to use our GPU servers (your advisor/mentor would let you know
when/if the time is right).

Our servers attain peak performance when their users exercise some
sort of cooperative coordination as to who is taking which GPUs.
Conversely, a Wild-West scenario guarantees that most/all jobs stall

Our “cooperative coordination” mechanism is implemented via

Here’s the steps that get you started.

1. Add these contacts to your contact lists.

Download the contacts CSV here.

Import to your Google Contacts.

2. Reserve GPU

To reserve a GPU, you can create a calender entry on Google Calendar (either personal or BU Google Account).

1. Add time and title to you event.

2. Add IVGPU to your guest list.

3. You will be able to see the availability of these GPUs by clicking the Find a Time tab.

4. Remove unavailable GPUs and send the invitation.

If the GPU is available during the time you specified, a confirmation email will be sent and you will see a green mark on the GPU name. Otherwise, a declined email will be sent and you will see a red checkmark on the GPU name.

CS542 Machine Learning Fall 2018

Teaching Fellow for CS542 Machine Learning Fall 2018 with Xingchao Peng by Professor Kate Saenko at BU CS.

Course Website Available Here.