Vasili Ramanishka

I am a Ph.D. candidate at Boston University in the Image and Video Computing group where I work on visual scene understanding. I am advised by Professor Kate Saenko.

While doing my internship at Honda Research Institute USA, I worked with Dr. Yi-Ting Chen and Dr. Teruhisa Misu on perception tasks in driving scenarios: driver behavior understanding, event detection, and monocular depth estimation.

Email  /  CV  /  Google Scholar  /  LinkedIn


My research interests are in the broad area of Artificial Intelligence with a focus on Vision and Language Understanding. This includes topics on the interpretability of deep learning models and techniques which provide insight into the model's decisions.

Toward Driving Scene Understanding:
A Dataset for Learning Driver Behavior and Causal Reasoning

Vasili Ramanishka, Yi-Ting Chen, Teruhisa Misu, Kate Saenko
CVPR, 2018
project page

We present the Honda Research Institute Driving Dataset (HDD) for learning driver behavior in real-life environments. The dataset includes 104 hours of real human driving in the San Francisco Bay Area collected using an instrumented vehicle equipped with different sensors and annotated according to the proposed Goal/Stimulus/Attention scheme.

Joint Event Detection and Description in Continuous Video Streams
Huijuan Xu, Boyang Li, Vasili Ramanishka, Leonid Sigal, Kate Saenko
WACV, 2019

Our model continuously encodes the input video stream with three-dimensional convolutional layers, proposes variable-length temporal events based on pooled features, and generates their captions. Unlike existing approaches, our event proposal generation and language captioning networks are trained jointly which improves temporal segmentation.

Top-down Visual Saliency Guided by Captions
Vasili Ramanishka, Abir Das, Jianming Zhang, Kate Saenko
CVPR, 2017
project page

Our approach produces spatial or spatiotemporal heatmaps for both given query sentences or sentences predicted by the video captioning model. Unlike recent efforts that introduce explicit "attention" layers to selectively attend to certain inputs while generating each word, our approach recovers saliency without the overhead of explicit attention layers and can be used to analyze a variety of existing model architectures and improve their design.

Multimodal Video Description
Vasili Ramanishka, Abir Das, Dong Huk Park, Subhashini Venugopalan, Lisa Anne Hendricks, Marcus Rohrbach, Kate Saenko
ACM Multimedia, 2016
MSR-VTT 2016 Leaderboard (VideoLAB) /  slides

We explore models which produce natural language descriptions for video. The task has important applications in video indexing, human-robot interaction, and describing movies for the blind. Our entry was ranked 3rd in the Microsoft Research - Video to Text Challenge in 2016.

Semantic Textual Similarity
Peter Potash, William Boag, Alexey Romanov, Vasili Ramanishka, Anna Rumshisky
Leaderboard (SimiHawk) /  pdf

We proposed an ensemble system (using alignment-based scoring, end-to-end LSTM model, Tree-LSTM) to capture the degree of equivalence in the underlying semantics conveyed by two snippets of text. Our team was ranked 6th out of 40 participants.