Abir Das



  • RISE: Randomized Input Sampling for Explanation of Black-box Models
    V. Petsiuk, A. Das, K. Saenko; British Machine Vision Conference, 2018 (Oral)
    system We are all seeing the tremendous promise of Deep Learning and AI in different areas of our life - from healthcare to reccomender systems to autonomous driving. As they are being used for a wider array of tasks, and are becoming more pervasive in everyday life, people are starting to ask the question "why"? The decision making process of such systems remains largely unclear and difficult to explain to the end users. In this paper, we address the problem of Explainable AI for deep neural networks that take images as input and output a class probability. Our proposed RISE approach generates an importance map indicating how salient each pixel is for the model’s prediction. In contrast to white-box approaches that estimate pixel importance using gradients or other internal network state, RISE works on black-box models. It estimates importance empirically by probing the model with randomly masked versions of the input image and obtaining the corresponding outputs. We compare our approach to state-of-the-art importance extraction methods using both an automatic deletion/insertion metric and a pointing metric based on human-annotated object segments.
  • 2017:
  • R-C3D: Region Convolutional 3D Network for Temporal Activity Detection
    H. Xu, A. Das, K. Saenko; International Conference on Computer Vision, 2017
    system Activity detection in continuous, untrimmed videos have been explored in this work. Unlike activity recognition activity detection is not only classifying actions but also accurately localizing the start and end times of each activity. Inspired by the Faster R-CNN object detection approach, we introduce a new model, Region Convolutional 3D Network (R-C3D), which encodes the video streams using a three-dimensional fully convolutional network, then generates candidate temporal regions containing activities, and finally classifies selected regions into specific activities. The entire model is trained end-to-end with jointly optimized localization and classification losses. Computation is saved due to the sharing of convolutional features between the proposal and the classification pipelines making it fast to detect activities. Extensive evaluations on three diverse activity detection datasets demonstrate the general applicability of our model.
  • Weakly Supervised Summarization of Web Videos
    R. Panda, A. Das, Z. Wu, J. Ernst, A. Roy-Chowdhury; International Conference on Computer Vision, 2017
    system Summarizing videos has been, traditionally, studied as an unsupervised problem. Recently, supervision in terms of labeled summary is being used to train video summarization models. Unsupervised models are scalable but are blind to the rich semantic information in terms of human annotation limiting the summarization performance. Supervised approaches improves the performance at the cost of large human annotation effort which makes them non-scalable. In this paper, we propose to identify and model important video segments as the most common activities that maximally drive the video towards a particular category. For this purpose we take a weakly supervised approach where we take a flexible 3D CNN architecture, namely Deep Summarization Network (DeSumNet) which is trained with video category information only. The summary is obtained by finding segments in terms of the CNN derivatives with respect to the video segments. The CNN derivative comes via a single back-propagation pass guided by the category with highest score in the forward pass. The advantage is twofold. Firstly, collecting videos with video-level annotation is less costly than collecting summaries but at the same time it is more informative than unsupervised approaches. Secondly, the method is fast as it generates the summary via one single backpropagation pass. In addition, to unleash the full potential of our 3D CNN architecture, we also explored a series of good practices to reduce the influence of limited training data. Experiments on two challenging and diverse datasets well demonstrate that our approach produces superior quality video summaries compared to several recently proposed approaches.
  • Top-down Visual Saliency Guided by Captions
    V. Ramanishka, A. Das, J. Zhang, K. Saenko; Computer Vision and Pattern Recognition, 2017.
    [Project] [Code] [Poster]
    system In this work, we propose a deep neural model to extract spatio-temporally salient regions in videos and spatially salient regions in images using natural language sentences as top-down input. We demonstrated that it can be used to analyze and understand the complex decision processes in image and video captioning networks without making modifications such as adding explicit attention layers. Our approach is inspired by the signal drop-out methods used to study properties of convolutional networks. However we extend the idea to LSTM based encoder-decoder models. We estimate the saliency of each temporal frame and/or spatial region by computing the information gain it produces for generating the given word. This is done by replacing the input image or video by a single region and observing the effect on the word in terms of its generation probability given the single region only. Our approach maintains good captioning performance while providing more accurate spatial heatmaps than existing methods.

  • 2016:
  • Continuous Adaptation of Multi-Camera Person Identification Models through Sparse Non-redundant Representative Selection
    A. Das, R. Panda, A. Roy-Chowdhury; Computer Vision and Image Understanding, 2016.
    system In this paper, we addressed the problem of online learning of identification systems where unlabeled data comes in small minibatches.Though labeling manually is an indispensable part of a supervised framework, for a large scale identification system labeling huge amount of data is a significant overhead. The goal is to involve human in the loop during the online learning of the system but at the same time reduce human annotation effort without compromising the performance. For large multi-sensor data as typically encountered in camera networks, labeling a lot of samples does not always mean more information, as redundant images are labeled several times. In this work, we propose a convex optimization based iterative framework that progressively and judiciously chooses a sparse but informative set of samples for labeling, with minimal overlap with previously labeled images. We demonstrate the effectiveness of our approach on multi-camera person reidentification datasets, to demonstrate the feasibility of learning online classification models in multi-camera big data applications and show that our framework achieves superior performance with significantly less amount of manual labeling.
  • Network Consistent Data Association
    A. Chakraborty, A. Das, A. Roy-Chowdhury; IEEE Trans. on Pattern Analysis and Machine Intelligence, 2016.
    Active We extend our previous work on network consistent re-identification to more general online data association exploring consistency among different agents at the same time. The more general Network Consistent Data Association (NCDA) method now, can dynamically associate new observations to already observed data-points in an iterative fashion, while maintaining network consistency. We have also extended the application area to spatiotemporal cell tracking of Arabidopsis SAM images.
  • Temporal Model Adaptation for Person Re-Identification
    N. Martinel, A. Das, C. Micheloni, A. Roy-Chowdhury; European Conference in Computer Vision, 2016.
    system In this work we have addressed the problem of temporal adaptation of person re-identification systems. We study how to adapt an installed system from data being continuously collected by incorporating human in the loop. However, manual labor being costly, we device a graph-based approach to present the human only the most informative probe-gallery matches that should be used to update the model. Using these probe-gallery image pairs the system is trained in an incremental fashion. For this we introduced a similarity-dissimilarity learning method which is solved using a stochastic alternating directions methods of multipliers. Results on three datasets have shown that our approach performs on par or even better than state-of-the-art approaches while reducing the manual pairwise labeling effort by about 80%.
  • Multimodal Video Description
    V. Ramanishka, A. Das, D. H. Park, S. Venugopalan, L. A. Hendricks, M. Rohrbach, K. Saenko; ACM Multimedia, 2016 (MSR Video to Language Challenge).
    system Understanding a visual scene and expressing it in terms of natural language descriptions is a challenging task and has especially drawn a lot of attemtion among the researchers. In this paper, we addressed the problem of describing long videos in natural language using multiple sources of information. In particular, we propose a sequence-to-sequence model which uses audio and the topic of the video in addition to the visual information to generate coherent descriptions of videos “in the wild”. In addition, we show that employing a committee of models where each model is an expert in describing videos from a specific category is advantageous than a single model trying to describe videos from multiple categories. Extensive experiments on the challenging MSR-VTT dataset are carried out to show the superior performance of the proposed approach on natural videos found in the web which comes with several challenges including diverse content and diverse as well as noisy descriptions. We secured third position in the MSR Video to Language Challenge organized with ACM MM, 2016.
  • Embedded Sparse Coding for Summarizing Multi-view Videos
    R. Panda, A. Das, A. Roy-Chowdhury; IEEE International Conference on Image Processing, 2016.
    system In this paper, we addressed the problem of summarizing long videos viewed from multiple cameras into a single summary. Summarizing multi-view videos has its own challenges which makes it a completely separate problem than the traditional single view video summarization problem. The videos in different views contain different illumination, pose etc. and the lack of time synchronization between different views also makes the problem significantly more challenging. On the other hand, judicious exploration of inter-view content correlations is required so that important information is collected without significanr overlap. We address the problem by first learning a joint embedding space for all the videos such that content correlations between frames of the same view as well as between frames from different views are preserved in the learned embedding space. The two types of content correlations are formulated using two different kernels keeping in mind the fact that the local structure (inside each view) of the frames in terms their proximity should not get destructed while trying to embed videos from different views to a commopn space. The resulting non-convex optimization was solved using a state-of-the-art Majorization-Minimization algorithm. After representing all videos in a joint embedding space, a standard sparse code based representation selection algorithm is applied to get the joint summary. We validate the approach by experimenting on three publicly available benchmark datasets showing improvements over the state-of-the-art.
  • Video Summarization in a Multi-View Camera Network
    R. Panda, A. Das, A. Roy-Chowdhury; International Conference on Pattern Recognition, 2016.
    system In this paper, the problem of summarizing long videos viewed from multiple cameras is addressed by forming a joint embedding space and solving an eigenvalue problem in that embedding space. The embedding is learned by exploring intra view similarities (i.e., between frames of the same video) and also inter-view similarities (i.e., between frames of different videos looking roughly at the same scene). To preserve both types of similarities, a sparse subspace clustering approach is used with objectives and constraints changed suitably to fit the different needs for the two different scenarios. We get the embedded representation of all the frames from all the videos by an unification of the two types of similarities using block matrices. This leads to solving a standard eigenvalue problem. After getting the embeddings, we apply a similar procedure of sparse representative selection as is done in the above paper to get the joint summary. Experimentations on both multiview and single view datasets show the applicability of this generalized method for a wide application area.

  • 2015:
  • Active Image Pair Selection for Continuous Person Re-identification
    A. Das, R. Panda, A. Roy-Chowdhury; IEEE International Conference on Image Processing, 2015.
    Active Most traditional multi-camera person re-identification systems rely on learning a static model on tediously labeled training data. Such a framework may not be suitable for situations when new data arrives continuously or all the data is not available for labeling beforehand. Inspired by the value of information active learning framework, a continuous learning person re-identification system with a human in the loop, is explored in this work. The human in the loop not only provides labels to the incoming images but also improves the learned model by providing most appropriate attribute based explanations. These attribute based explanations are used to learn attribute predictors along the way. The overall effect of such a stratgey is that the machine assists the human to speed up the annotation and the human assists the machine to update itself with more annotation in a symbiotic relationship. In this paper, we validate our approach using a benchmark dataset.
  • 2014:
  • Consistent Re-identification in a Camera Network
    A. Das, A. Chakraborty, A. Roy-Chowdhury; European Conference in Computer Vision, 2014.
    [Supplementary] [Dataset] [Code] [Bibtex] [Poster] [video spotlight]
    consistency Most existing person re-identification methods are camera pairwise. These do not explicitly result in consistent re-identification across a camera network and leads to infeasible associations when results from different camera pairs are combined. In this paper, we propose a network consistent re-identification (NCR) framework. This is formulated as an optimization problem that not only maintains consistency in re-identification results across the network, but also improves the camera pairwise re-identification performance between all the individual camera pairs. The problem is solved as a binary integer program, leading to a globally optimal solution.
  • Re-Identification in the Function Space of Feature Warps
    A. Das, N. Martinel, C. Micheloni, A. Roy-Chowdhury; IEEE Trans. on Pattern Analysis and Machine Intelligence, 2015.
    WFS One of the major problems in the domain of person re-identification is the transformation of features between cameras. In this work we model this transformation by warping feature space from one camera to another motivated by the principle of Dynamic Time Warping (DTW). The warp functions between two instances of the same target are feasible warp functions while those between instances of different targets are infeasible warp functions. A function space, called the Warp Function Space (WFS), composed of these feasible and infeasible warp functions is learned and tperson re-identification is addressed as mapping a test warp function onto the WFS and classifying it as belonging to either the set of feasible or infeasible warp functions. Through extensive experimentations on 5 benchmark datasets our approach is shown to be robust especially with respect to severe illumination and pose variations.