My name is Wenda Qin, a fourth year PhD student of computer science, currently studied in Boston University (BU). I am advised by Professor Margrit Betke and co-advised by Professor Derry Wijaya. My research are mostly about Computer Vision (CV) and Natural Language Processing (NLP). Specically, multimodal understanding of image and text is my biggest interest. I am currently working on a mutliple of research projects, including gun-violence news analysis with visual and textual inputs, and vision-and-language navigation. My past research includes text recognition in natural scenes and document layout analysis.
I worked as a winter research intern in Honda Research Institute in 2021 , Jan - May. During internship, We explored and analyzed a vision-and-language navigation dataset called R2R. Later we improved the existing navigation models by applying several novel techniques. One of our improved model achieved a new state-of-the-art performance in the R2R challenge in single-run setting (named SE-Mixed (Single-Run)).
Besides research, I also worked as a TF of Computer Graphics (CS 480/680 in 2018,2020,2021 Fall) and Artificial Intelligence (CS 440/640 in 2019 Spring & Fall).
Given an instruction written in natural language, and the views of an enviornment (e.g. apartment, museum), a VLN tasks requires the AI to follow the instruction, relate it to its visual surroundings and walk from place to place until reaching the correct destination. Specifically, our project focus on interior environment, using a well-known dataset called R2R.
While the majority of the community focuses on applying better models and different data augmentation techiques to improve the performance of the model. We observe an interesting fact that most of the models achieved very little improvement during training for validation data in unseen environments after its early 30K iterations (compared to the total training number of 300K iterations). The successs rates then fluctuates without clear convergence. We further discovered that the snapshots (saved parameters at different time periods for the model) made different mistake in navigations while their success rates are relatively the same. Based on such observations, we proposed an ensemble solution that utilizes multiple snapshots. The ensemble works surprisingly well compared to its snapshot singles, while no additional training is needed, which needs a few days to complete.
With a slight adjustment of the base VLN model to allow for more different snapshot candidates, our ensemble tops the R2R challenge in the single-run setting in Success rate Per Length and Navigation Error, equivalently the best in Success Rate ("SE-Mixed (Single-Run)", Oct 21, 2021).
I recently joined the AIEM team for news analysis using CV and NLP models. Specifically, I am currently collaborating with teammates to collect reader emotions on news titles and figures about gun violence. We are aiming for releasing a dataset that could train models for reader emotion prediction and explaining what parts in the image/texts that arouse reader's emotion.
Ear recognition is a rising topic in biometrics. Compared to the traditional face for identification, ear recognition provides better privacy. Meanwhile, it is also contactless compared to fingerprint, which is becoming more valuable under the pandemic.
Due to the limits of available data, we first tried applying training-free feature extraction method SIFT on ear recognition. The algorithm worked well in the beginning. However, as the infants started to grow, their ear shapes rapidly changed, making the old-fashioned SIFT hard to follow. Currently, we are testing on applying different improvements on a VGG-Face pretrained deep neural network, to see if it could mitigate the problem age progression and outperform the SIFT algorithm.
In the same time, we developed a phone application with our SIFT-based ear recognition method and deployed it on clinics in Lusaka, Zambia to collect and evaluate ear data from local infants, the practical study is recorded and published in Gates Open Research.
Last year, I worked with my colleage, Yi Zheng on text recognition in natural scene. We proposed and implemented a transformer-based next-character-prediction module on visual-based text recognition system. Our work was accepted and published on ACM Multi-Media 2020.
In the beginning of my Ph.D. program, I worked with Dr. Randa Elanwar for document analysis on Arabic books and documents. We proposed a challenging Arabic document analysis dataset called BE-Arabic-9K and set up a baseline model using Faster R-CNN. Our work was accepted and published by International Journal on Document Analysis and Recognition (IJDAR).