My research interests are in the broad area of Artificial Intelligence with a focus on
Vision and Language Understanding. This includes topics on the interpretability of deep
learning models and techniques which provide insight into the model's decisions.
Toward Driving Scene Understanding:
A Dataset for Learning Driver
Behavior and Causal Reasoning
We present the Honda Research Institute Driving Dataset (HDD) for learning driver behavior
in real-life environments. The dataset includes 104 hours of real human driving in the San
Francisco Bay Area collected using an instrumented vehicle equipped with different sensors
and annotated according to the proposed Goal/Stimulus/Attention scheme.
Joint Event Detection and Description in Continuous Video Streams
Huijuan Xu, Boyang Li,
Leonid Sigal, Kate Saenko
Our model continuously encodes the input video stream with three-dimensional convolutional
layers, proposes variable-length temporal events based on pooled features, and generates
their captions. Unlike existing approaches, our event proposal generation and language
captioning networks are trained jointly which improves temporal segmentation.
Top-down Visual Saliency Guided by Captions
Abir Das, Jianming Zhang, Kate Saenko
Our approach produces spatial or spatiotemporal heatmaps for both given query sentences or
predicted by the video captioning model. Unlike recent efforts that introduce explicit
layers to selectively attend to certain inputs while generating each word, our approach
saliency without the overhead of explicit attention layers and can be used to analyze a
existing model architectures and improve their design.
Multimodal Video Description
Abir Das, Dong Huk Park, Subhashini Venugopalan,
Lisa Anne Hendricks, Marcus Rohrbach, Kate Saenko
ACM Multimedia, 2016
MSR-VTT 2016 Leaderboard
We explore models which produce natural language descriptions for video. The task has
important applications in video indexing, human-robot interaction, and describing movies for
the blind. Our entry was ranked 3rd in the Microsoft Research - Video to Text Challenge in
Semantic Textual Similarity
We proposed an ensemble system (using alignment-based scoring, end-to-end LSTM model, Tree-LSTM)
to capture the degree of equivalence in the underlying semantics conveyed by two snippets of text.
Our team was ranked 6th out of 40 participants.