Andrea Burns

Andrea Burns is a third year PhD student at Boston University in the Image and Video Computing Group. She is advised by Prof. Kate Saenko and Prof. Bryan A. Plummer. Her primary research topics include representation learning and the intersection of computer vision and natural language processing (vision and language). Andrea is interested in improving high level reasoning in machine learning models, as robust methods and common sense reasoning are necessary to many applications. Her research is interdisciplinary in nature and she hopes to study reliable vision-language methods for assistive technology.

Andrea's graduate coursework includes:

  • CS537: Random Computing
  • CS591: Deep Learning
  • CS655: Computer Networks
  • CS591: Introduction to Natural Language Processing
  • CS520: Programming Languages
  • CS591: Advanced Optimization Algorithms
  • CS585: Image and Video Computing

She just concluded a research internship with the Robust Perception team at Google Cambridge Fall 2020, and will be continuing part time as a Student Researcher. She is open to research that falls under the umbrella of machine learning, computer vision, natural language processing, speech technologies, and human-computer interaction.

aburns4 [at]  |  CV  |  LinkedIn  |  Google Scholar


I have explored several topics in computer vision and natural language processing including visually enhanced word embeddings, multilingual language representations, image captioning, visual speech recognition, sentiment analysis, and more. Below I include published works; other research projects can be found in the project section below.

Learning to Scale Multilingual Representations for Vision-Language Tasks
Andrea Burns, Donghyun Kim, Derry Wijaya, Kate Saenko, Bryan A. Plummer
European Conference on Computer Vision, ECCV (Spotlight, top 5% of accepted papers) 2020
Project Page

Current multilingual vision-language models either require a large number of additional parameters for each supported language, or suffer performance degradation as languages are added. In this paper, we propose a Scalable Multilingual Aligned Language Representation (SMALR) that represents many languages with few model parameters without sacrificing downstream task performance. SMALR learns a fixed size language-agnostic representation for most words in a multilingual vocabulary, keeping language-specific features for few. We use a novel masked cross-language modeling loss to align features with context from other languages. Additionally, we propose a cross-lingual consistency module that ensures predictions made for a query and its machine translation are comparable. The effectiveness of SMALR is demonstrated with ten diverse languages, over twice the number supported in vision-language tasks to date. We evaluate on multilingual image-sentence retrieval and outperform prior work by 3-4% with less than 1/5th the training parameters compared to other word embedding methods.

Language Features Matter: Effective Language Representations for Vision-Language Tasks
Andrea Burns, Reuben Tan, Kate Saenko,Stan Sclaroff, Bryan A. Plummer
International Conference on Computer Vision, ICCV 2019
Project Page

We rigorously analyze different word embeddings, language models, and embedding augmentation steps on five common VL tasks: image-sentence retrieval, image captioning, visual question answering, phrase grounding, and text-to-clip retrieval. Our experiments provide some striking results; an average embedding language model outperforms an LSTM on retrieval-style tasks; state-of-the-art representations such as BERT perform relatively poorly on vision-language tasks. From this comprehensive set of experiments we propose a set of best practices for incorporating the language component of VL tasks. To further elevate language features, we also show that knowledge in vision-language problems can be transferred across tasks to gain performance with multi-task training. This multi-task training is applied to a new Graph Oriented Vision-Language Embedding (GrOVLE), which we adapt from Word2Vec using WordNet and an original visual-language graph built from Visual Genome, providing a ready-to-use vision-language embedding.

Multispectral Imaging for Improved Liquid Classification in Security Sensor Systems
Andrea Burns, Waheed U. Bajwa
SPIE 2018

Multispectral imaging can be used as a multimodal source to increase prediction accuracy of many machine learning algorithms by introducing additional spectral bands in data samples. This paper introduces a newly curated Multispectral Liquid 12-band (MeL12) dataset, consisting of 12 classes: eleven liquids and an "empty container" class. The usefulness of multispectral imaging in classification of liquids is demonstrated through the use of a support vector machine on MeL12 for classification of the 12 classes. The reported results are both encouraging and point to the need for additional work to improve liquid classification of harmless and dangerous liquids in high-risk environments, such as airports, concert halls, and political arenas, using multispectral imaging.

  • Grace Hopper Conference Award, Boston University
  • Invited participant for the Grad Cohort Workshop of the CRA-W
  • Dean's Fellowship Fall 2018, Boston University
  • The Academic Achievement Award Scholarship 2014-18, Tulane University
  • Dean’s List 2014-18, Tulane University
  • The Elsa Freiman Angrist Scholarship 2015-18, Tulane University
  • Friezo Family Found Greater New York Area Scholarship 2015-18, Tulane University


Feature Refinement for Common Sense Captioning
Andrea Burns, Kate Saenko, Bryan A. Plummer
CVPR Workshop Video Presentation

Third place winner of the VizWiz Grand Challenge at CVPR 2020, awarded $10K Azure Credit. Continued work in progress.

Automating Web Tasks Across Environment and Ability
Andrea Burns, Kate Saenko, Bryan A. Plummer

Work in progress. Building mobile application task dataset, to be used with environment-agnostic reinforcement learning policy for the purpose of automating web navigation tasks across different environments. A feasibility classifier and action-oriented captioning model will be built to provide tools for low-vision or blind users.

Supervised Machine Learning with Abstract Templates
Andrea Burns
Project Video

Implemented logistic regression and perceptron algorithms by creating abstract supervised learning templates in ATS.

Visual Speech Recognition Survey
Andrea Burns
Presentation Slides  

Compared feature representation performing VSR of the AVLetters dataset with Hu moments, Zernike moments, HOG descriptors, and LBP-TOP features. Investigated frame-level and video-level classification using an SVM classifier in SciKitLearn.

Multimodal Sentiment Analysis for Voice Message Systems
Andrea Burns, Chloe Chen, Mackenna Barker

Created a multimodal machine learning model to learn the urgency of a voice message after categorizing it into four emotions: anger, fear, joy, and sadness. Used Python’s SciKitLearn and SDK libraries to apply emotion classification and unsupervised intensity regression on audio and text data.