Crowdsourcing on Natural Language Descriptions for Visual Object Tracking


We plan to carry out a crowd sourcing project that would annotate existing visual object tracking benchmark dataset with natural language (NL) descriptions. Following carefully designed experiments and instructions, we will obtain NL descriptions that would serve as training data for our research on tracking with NL.

Crowdsourcing on Natural Language Descriptions for Visual Object TrackingTL;DRBackgroundMethodAnnotationA Few ExamplesInterface for AnnotationVerificationInterface for VerificationDatasets for Annotation LaSOTMulti Object Tracking BenchmarkResults from offline and online trial experimentsOffline Experiment on Sequence #1MTurk Experiment on Sequence #2Cost


In our current research project on tracking with natural language description, we designed a supervised model that requires training with labeled sequences. The goal of this research is to utilize and exploit NL description knowledge and improve the tracking performance.

We are aware of two publicly-available tracking datasets: LaSOT and OTB-99-LANG that have NL annotations defining the target to track. These datasets exhibit biases in terms of the linguistic style of the annotations, and how precise they are in defining the target to be tracked.

The problem of language-description bias in these datasets is prominent, both in how natural the NL description seems to a native English speaker to whether or not it uniquely defines the target specified by the ground-truth bounding box. While the former, i.e., “naturalness” of the description is hard to quantify, the latter, i.e., ambiguity, can be tested. We deisgned this crowd sourcing project to verify/quantify the ambiguity of these existing NL annotations. Additionally, we would gather and crowd source new NL descriptions for these sequences.


This crowd sourcing project should be conducted in a two step fashion: Annotation + Verification.


The first step of this experiment is to utilize the crowd sourcing workers to annotate sequences with NL descriptions.

We randomly sample one frame per sequence in the dataset and draw a red bounding box on the frame from the ground truth box annotations. For each of these frames, we distribute it to three different crowd workers to annotate.

Crowd workers are instructed to give short phrases or a sentence to uniquely describe the object in the red bounding box. If the crowd worker cannot conclude a precise description for the object, (s)he would be able to check a multiple choice box to indicate that no suitable descriptions are available for the object.

Crowd workers are instructed to only use the following categories of attributives1:

Additionally, we would inform crowd workers that these given frames are from videos and their annotations would be verified by another human against another frame in the same sequence.

A Few Examples

Crowd workers may check the no applicable NL description box in the following cases:

  1. Cannot Identify the Object. For example, here is frame 1229 in sequence drone-2:


The object is too small and cannot be identified.

  1. Multiple Instances of the Same Object. For example, frame 514 in sequence helmet-2


As there are multiple helmets in this frame, It would be very hard to describe the target with NL without using positional attributives.

Interface for Annotation

Here is a screenshot of the interface that a crowd worker might see:




As described before, we designed this step to quantify the ambiguity of the NL descriptions. We would use the quantitative result we obtained in this step to filter videos that does not have a precise natural language description.

For each sequence in the dataset, we randomly sample one frame from the sequence.

We distribute the frame together with an NL description for the sequence where the frame was sampled from to three different crowd workers.

We instruct crowd workers to draw one bounding box for the object that is uniquely and precisely described by the NL description we obtained from annotation step or from the dataset itself.

After we collect the results, we calculate the intersection-over-union (IoU) between the ground truth bounding box from the dataset and the box drawn by crowd workers. These IoU scores are considered as a quantitative evaluation of ambiguity of NL descriptions. The higher the score, the lower the ambiguity. Higher IoU also means that the NL desciption would be able to carry information over time and survive long-term appearance variations of the target.

Finally, we set an arbitrary threshold on IoU scores and only use sequences and NL descriptions that are higher than the threshold during training and evaluation.

Interface for Verification

Here is a screenshot of the interface that a crowd worker might see:



Datasets for Annotation

We plan to annotate two visual object tracking datasets with NL descriptions: LaSOT and MOT.


LaSOT contains 20 sequences for each of its 70 categories, totaling at 1,400 sequences and more than 3.52 millions frames. It provides a naive and coarse NL description for the target to be tracked. However, the language does not uniquely describe the target.

For example, in sequence cup-4


the given NL description is: transparent cup being placed on the desk. Attributives used to describe the target does not uniquely and precisely describe the target.

Additionally, most NL descriptions provided in the LaSOT datasets follows a certain form:


For example:

white airplane flying in the air above the forest

basketball on a boy's hand

book held by hand of woman

gray hat worn by a boy on the head

robot arm head moving around

volleyball bouncing between beach volleyball players

zebra walking with another zebra

These NL descriptions are grammatically correct but not neccessarily natural.

Multi Object Tracking Benchmark

MOT dataset contains bounding box annotations of multiple different targets in the frame.

For example, here is sequence Stadtmitte.

Using natural language to describe a unique target and let the tracker initiate with that description is one of the research goal that we plan to achieve in the future. Crowdsourcing language description on the MOT dataset would provide us a unique head-start on this problem.

Results from offline and online trial experiments

We launched two experimental tasks to annotate and verify LaSOT sequences. We conducted the first experiment offline with BU CS640 students. The second experiment was carried out on Amazon MTurk.

Offline Experiment on Sequence #1

A few graduate students from CS660 helped on carrying out the above mentioned crowd sourcing method on sequence #1 for each category in the LaSOT dataset. Due to limited work hours available, we only asked one student to annotate and another student to verify the annotation2. We obtained 64 sequences with verified annotations.

Here are a comparison between NL descriptions from students and NL description from LaSOT dataset.

NL Description from StudentsNL Description from LaSOTVideo
book colored in orange, black and whiteblack book being placed on the deskbook-1
white bottle with blue capbottle shaken by the handsbottle-1
bus colored in white and red with black windowsblue bus running on the streetbus-1
black carcar running on the streetcar-1
brown catbrown cat walking on the grass groundcat-1
black cattleblack cattle running on the grass groundcattle-1
brown chameleonchameleon sitting on the wallchameleon-1
bronze coin with cupcoin turned by handcoin-1
yellow crabcrab sitting on the water bottomcrab-1

MTurk Experiment on Sequence #2

For all 70 categories from the LaSOT dataset, sequence number 2 is chosen as sequences for annotation.

For each of the 70 videos, we randomly choose one frame from the video. We draw the groundtruth bounding box in red on that chosen frame. We follow the method described before and asked mturk workers to give a short phrase or one sentence to describe the object within the bounding box.

Every frame (in total 70) is annotated by three different workers resulting in 210 annotations in total.

A brief look at the results, we found a similarity with what we get for sequence #1 with offline experiment with BU students. Majority of these annotations are short and having a very short referring expression if not any.

For each of the 70 videos, we randomly choose another frame from the video (different from which used in step annotation). Together with the NL description we obtained from mturk workers from the previous step, we instructed mturk workers to draw a bounding box given the NL description following the method described above.

For each NL description we obtained from the annotation step, we asked three different mturk workers to draw bounding boxes around the target given the description. With 210 annotations from the annotation step, we receive 630 bounding boxes as output.

As described above, we set an arbitrary 0.7 threshold for intersection over union between these bounding boxes and the ground truth bounding box provided in LaSOT.

Among 630 boxes, 159 boxes/descriptions from 47 videos passed the 0.7 threshold. With a even higher 0.9 threshold for iou, we are left with 41 annotations from 21 videos.


TODO (fung@): Calculate an estimation on the cost of annotating both LaSOT and MOT.

1 We will re-run this crowd sourcing job when we need new categories of attributives.
2 These students do have access to the entire sequence when annotate the natural language description.