Award Title: Structural Matrix Completion for Data-Mining Applications

(2014 - 2018) Award number: 1421759

PI Mark Crovella (crovella@bu.edu), co-PI Evimaria Terzi (evimaria@bu.edu)

Students: Natali Ruchansky (natalir@bu.edu), Dora Erdos (edori@bu.edu), Charalampos Mavroforakis (cmav@bu.edu), Giovanni Comarela (gcom@bu.edu), Harshal Chaudhari (harshal@bu.edu)

Project Description

A common problem arising in science and engineering is that a dataset may only be partially measured. Often the complete dataset is naturally expressed as a matrix - for example, traffic flows in a city, gene expression across a set of treatments, or ratings of movies for users. Recently, a new solution strategy has emerged for the problem of inferring the missing entries in such datasets, but the power and limits of this new “structural” approach are not fully understood as yet. This project will develop a better understanding of this structural approach and apply that understanding to a number of important problems. In addition, the project will develop new course materials for data science education, and train both graduate and undergraduate students.

The matrix completion problem seeks to infer the missing entries of a matrix, under a low-rank assumption. To date, most matrix completion methods do not actually check whether the known entries contain sufficient information to complete the matrix. Recently, however, a new and very different class of “structural” methods have emerged, which analyze the information content of the visible matrix entries, and so can determine whether accurate completion is possible. From a data mining standpoint, the implications of structural matrix completion methods are largely unexplored. This project will investigate how to leverage structural matrix completion methods to attack a host of data analysis problems, including developing new methods for active matrix completion, new approaches to cross-validating matrix completion results, and new strategies for general matrix completion.

Main Results with datasets and code

N. Ruchansky, M. Crovella, E. Terzi Matrix Completion with Queries. ACM SIGKDD 2015. (code)

In many applications, e.g., recommender systems and traffic monitoring, the data comes in the form of a matrix that is only partially observed and low rank. A fundamental data-analysis task for these datasets is matrix completion, where the goal is to accurately infer the entries missing from the matrix. Even when the data satisfies the low-rank assumption, classical matrix-completion methods may output completions with significant error – in that the reconstructed matrix differs significantly from the true underlying matrix. Often, this is due to the fact that the information contained in the observed entries is insufficient. In this work, we address this problem by proposing an active version of matrix completion, where queries can be made to the true underlying matrix. Subsequently, we design Order&Extend, which is the first algorithm to unify a matrix-completion approach and a querying strategy into a single algorithm. Order&Extend is able identify and alleviate insufficient information by judiciously querying a small number of additional entries. In an extensive experimental evaluation on real-world datasets, we demonstrate that our algorithm is efficient and is able to accurately reconstruct the true matrix while asking only a small number of queries

N. Ruchansky, M. Crovella, E. Terzi Targeted Matrix Completion. Siam International Conference on Data Mining, SDM 2017. (code)

Matrix completion is a problem that arises in many data-analysis settings where the input consists of a partially-observed matrix (e.g., recommender systems, traffic matrix analysis etc.). Classical approaches to matrix completion assume that the input partially-observed matrix is low rank. The success of these methods depends on the number of observed entries and the rank of the matrix; the larger the rank, the more entries need to be observed in order to accurately complete the matrix. In this paper, we deal with matrices that are not necessarily low rank themselves, but rather they contain low-rank submatrices. We propose Targeted, which is a general framework for completing such matrices. In this framework, we first extract the low-rank submatrices and then apply a matrix-completion algorithm to these low-rank submatrices as well as the remainder matrix separately. Although for the completion itself we use state-of-the-art completion methods, our results demonstrate that Targeted achieves significantly smaller reconstruction errors than other classical matrix-completion methods. One of the key technical contributions of the paper lies in the identification of the low-rank submatrices from the input partially-observed matrices.

C. Mavroforakis, D. Erdos, M. Crovella and E. Terzi. Active Positive-Definite Matrix Completion. SIAM International Conference on Data Mining, SDM 2017. (supplementary material) and code and datasets

In many applications, e.g., recommender systems and biological data analysis, the datasets of interest are positive definite (PD) matrices. Such matrices are usually similarity matrices, obtained by the multiplication of a matrix of preferences or observations with its transpose. Oftentimes, such real-world matrices are missing many entries and a fundamental data-analysis task, known by the term PD-matrix completion, is the inference of these missing entries. In this paper, we introduce the active version of PD-matrix completion, in which we assume access to an oracle that, at a given cost, returns the value of an unobserved entry of the PD matrix. In this setting, we consider the following question: “given a fixed budget, which entries should we query so that the completion of the new matrix is much more indicative of the underlying data?”. The main contribution of the paper is the formalization of the above question as the ActivePDCompletion problem and the design of novel and effective algorithms for solving it in practice.

Other results

H. A. Chaudhari, M. Mathioudakis, E. Terzi: Markov Chain Monitoring. SIAM International Conference on Data Mining SDM 2018. code and datasets

G. Comarela, E. Terzi, M. Crovella: Detecting Unusually-Routed ASes: Methods and Applications. Internet Measurement Conference (IMC) 2016.

Acknowledgment

“This material is based upon work supported by the National Science Foundation under Grant No.1421759.”

Disclaimer

Disclaimer: “Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.”