Caption-Guided Visual Saliency
Our approach can produce spatial or spatiotemporal heatmaps for both given input sentences or sentences predicted by the video captioning model. Unlike recent efforts that introduce explicit "attention" layers to selectively attend to certain inputs while generating each word, our approach recovers saliency without the overhead of explicit attention layers and can be used to analyze a variety of existing model architectures and improve their design.
Video to Text
We explore models which allow to produce natural language descriptions for in-the-wild video. The task has important applications in video indexing, human-robot interaction, and describing movies for the blind. Our team VideoLAB was ranked 3rd in ACM Multimedia 2016 Grand Challenge.
Semantic Textual Similarity
The task has been developed over the past years with the idea of capturing the degree of equivalence in the underlying semantics conveyed by two snippets of text. This simple formulation has many potential applications, such as language modeling, machine translation, and information extraction. Our team was ranked 6th out of 40 participants during SemEval-2016 competition.
- V. Ramanishka, A. Das, J. Zhang, K. Saenko; Top-down Visual Saliency Guided by Captions. arXiv preprint
- V. Ramanishka, A. Das, D. H. Park, S. Venugopalan, L. A. Hendricks, M. Rohrbach, K. Saenko; Multimodal Video Description (3rd out of 21 teams participating in the Challenge). ACM Multimedia 2016
- P. Potash, W. Boag, A. Romanov, V. Ramanishka, A. Rumshisky; SimiHawk at SemEval-2016 Task 1: A Deep Ensemble System for Semantic Textual Similarity (6th out of 40 teams participating in the Challenge). Proceedings of SemEval-2016, NAACL
- H. Xu, S. Venugopalan, V. Ramanishka, M. Rohrbach, K. Saenko; A Multi-scale Multiple Instance Video Description Network. Workshop on Closing the Loop between Vision and Language, ICCV 2015