Ximeng Sun

Ximeng Sun
E-mail: sunxm at bu dot edu

Hi, I am currently a Ph.D. student in Computer Science at Boston University starting from 2019 Spring, supervised by Prof. Kate Saenko. I am interested in the deep learning and computer vision. In particular, I work on efficient learning, vision-language models, and multi-task learning.

I have been fortunate to collaborate with top research labs as an intern, including Meta AI, Google Cloud and IBM Research. During 2022, I was part of Meta AI's team where I had the opportunity to collaborate with Xide Xia, Pengchuan Zhang and Peizhao Zhang. In 2021 Summer, I joined Google Cloud where I worked closely with Clayton Mellina, Xiao Bian and Kihyuk Sohn. In 2019 and 2020 Summer, I worked alongside Rogerio Feris and Rameswar Panda at IBM Research.

Previously, I received my M.S. in ECE from University of Michigan, Ann Arbor and received B.ENG. in Communication Engineering from Beijing University of Posts and Telecommunications.

I am currently looking for full-time (applied) research jobs in industry starting from Jun 2024.

CV / GitHub / Google Scholar

Publications

Preprints

Ping Hu, Ximeng Sun, Stan Sclaroff, Kate Saenko. "DualCoOp++: Fast and Effective Adaptation to Multi-Label Recognition with Limited Annotations", arXiv preprint arXiv:2308.01890, 2023.
Ping Hu, Ximeng Sun, Kate Saenko, Stan Sclaroff. "Weakly-supervised Compositional Feature Aggregation for Few-shot Recognition", arXiv preprint arXiv:1906.04833, 2019.
Huijuan Xu, Bingyi Kang, Ximeng Sun, Jiashi Feng, Kate Saenko, Trevor Darrell. "Similarity R-C3D for Few-shot Temporal Activity Detection", arXiv preprint arXiv:1812.10000, 2018.

Conferences and Journals

Ximeng Sun, Pengchuan Zhang, Peizhao Zhang, Hardik Shah, Kate Saenko, Xide Xia. "DIME-FM: DIstilling Multimodal and Efficient Foundation Models". International Conference on Computer Vision (ICCV) , 2023.

pdf / project page / code

Overview: Large Vision-Language Foundation Models (VLFM), such as CLIP, ALIGN and Florence, are trained on largescale datasets of image-caption pairs and achieve superior transferability and robustness on downstream tasks, but they are difficult to use in many practical applications due to the large size, high latency and fixed architectures. Unfortunately, recent work shows training a small custom VLFM for resource-limited applications is currently very difficult using public and smaller-scale data. In this paper, we introduce a new distillation mechanism (DIME-FM) that allows us to transfer the knowledge contained in large VLFMs to smaller, customized foundation models using a relatively small amount of inexpensive, unpaired images and sentences. We transfer the knowledge from the pre-trained CLIP-ViTL/14 model to a ViT-B/32 model, with only 40M public images and 28.4M unpaired public sentences. The resulting model ''Distill-ViT-B/32'' rivals the CLIP-ViT-B/32 model pre-trained on its private WiT dataset (400M image-text pairs): Distill-ViT-B/32 achieves similar results in terms of zero-shot and linear-probing performance on both ImageNet and the ELEVATER (20 image classification tasks) benchmarks. It also displays comparable robustness when evaluated on five datasets with natural distribution shifts from ImageNet.

Ximeng Sun, Ping Hu, Kate Saenko. "DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited Annotations". Neural Information Processing Systems (NeurIPS), 2022.

pdf / project page / code

Overview: Solving multi-label recognition (MLR) for images in the low-label regime is a challenging task with many real-world applications. Recent work learns an alignment between textual and visual spaces to compensate for insufficient image labels, but loses accuracy because of the limited amount of available MLR annotations. In this work, we utilize the strong alignment of textual and visual features pretrained with millions of auxiliary image-text pairs and propose Dual Context Optimization (DualCoOp) as a unified framework for partial-label MLR and zero-shot MLR. DualCoOp encodes positive and negative contexts with class names as part of the linguistic input (i.e. prompts). Since DualCoOp only introduces a very light learnable overhead upon the pretrained vision-language framework, it can quickly adapt to multi-label recognition tasks that have limited annotations and even unseen classes. Experiments on standard multi-label recognition benchmarks across two challenging low-label settings demonstrate the advantages of our approach over state-of-the-art methods.

Ximeng Sun, Rameswar Panda, Chun-Fu Chen, Aude Oliva, Rogerio Feris, Kate Saenko. "Dynamic Network Quantization for Efficient Video Inference". International Conference on Computer Vision (ICCV), 2021.

pdf / project page / code

Overview: Motivated by the effectiveness of quantization for boosting efficiency, in this paper, we propose a dynamic network quantization framework, that selects optimal precision for each frame conditioned on the input for efficient video recognition. Specifically, given a video clip, we train a very lightweight network in parallel with the recognition network, to produce a dynamic policy indicating which numerical precision to be used per frame in recognizing videos. We train both networks effectively using standard backpropagation with a loss to achieve both competitive performance and resource efficiency required for video recognition. Extensive experiments on four challenging diverse benchmark datasets demonstrate that our proposed approach provides significant savings in computation and memory usage while outperforming the existing state-of-the-art methods.

Ximeng Sun, Rameswar Panda, Chun-Fu Chen, Naigang Wang, Bowen Pan, Aude Oliva, Rogerio Feris, Kate Saenko, "All at Once Network Quantization via Collaborative Knowledge Transfer", IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2024
Ximeng Sun, Rameswar Panda, Rogerio Feris, Kate Saenko. "AdaShare: Learning What To Share For Efficient Deep Multi-Task Learning". Neural Information Processing Systems (NeurIPS), 2020.

pdf / project page / code

Overview: AdaShare is a novel and differentiable approach for efficient multi-task learning that learns the feature sharing pattern to achieve the best recognition accuracy, while restricting the memory footprint as much as possible. Our main idea is to learn the sharing pattern through a task-specific policy that selectively chooses which layers to execute for a given task in the multi-task network. In other words, we aim to obtain a single network for multi-task learning that supports separate execution paths for different tasks.

Ximeng Sun, Huijuan Xu, Kate Saenko. "TwoStreamVAN: Improving Motion Modeling in Video Generation". IEEE Winter Conference on Applications of Computer Vision (WACV), 2020.

arXiv / demo / code / dataset

Overview: We propose TwoStreamVAN to output a realistic video given an input action label by progressively generating and fusing motion and content features at multiple scales using adaptive motion kernels. In addition, to better evaluate video generation models, we design a new synthetic human action dataset SynAction to bridge the difficulty gap between overcomplicated human action datasets and simple toy datasets.

Ximeng Sun, Ryan Szeto, Jason Corso. "A Temporally-Aware Interpolation Network for Video Frame Inpainting". Asian Conference on Computer Vision (ACCV), 2018.

paper / demo / code

Ryan Szeto, Ximeng Sun, Kunyi Lu, Jason Corso. "A Temporally-Aware Interpolation Network for Video Frame Inpainting". IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2019 Nov 6

paper / code

Overview: We propose the first deep learning solution to video frame inpainting. We devise a pipeline composed of two modules: a bidirectional video prediction module and a temporally-aware frame interpolation module. Our experiments demonstrate that our approach produces more accurate and qualitatively satisfying results than a state-of-the-art video prediction method and many strong frame inpainting baselines.

Xingchao Peng, Zijun Huang, Ximeng Sun, Kate Saenko. "Domain Agnostic Learning with Disentangled Representations". International Conference on Machine Learning (ICML), 2019.
Rameswar Panda, Chun-Fu Chen, Quanfu Fan, Ximeng Sun, Kate Saenko, Aude Oliva, Rogerio Feris. "AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition". International Conference on Computer Vision (ICCV), 2021.

Patents

Rameswar Panda, Ximeng Sun, Richard Chen, Rogerio Schmidt Feris and Ekaterina Saenko. "Dynamic network quantization for efficient video inference". US Patent App. 17/566,782