(2014 - 2018) Award number: 1421759

PI Mark Crovella (crovella@bu.edu), co-PI Evimaria Terzi (evimaria@bu.edu)

Students: Natali Ruchansky (natalir@bu.edu), Dora Erdos (edori@bu.edu), Charalampos Mavroforakis (cmav@bu.edu), Giovanni Comarela (gcom@bu.edu), Harshal Chaudhari (harshal@bu.edu)

A common problem arising in science and engineering is that a dataset may only be partially measured. Often the complete dataset is naturally expressed as a matrix - for example, traffic flows in a city, gene expression across a set of treatments, or ratings of movies for users. Recently, a new solution strategy has emerged for the problem of inferring the missing entries in such datasets, but the power and limits of this new “structural” approach are not fully understood as yet. This project will develop a better understanding of this structural approach and apply that understanding to a number of important problems. In addition, the project will develop new course materials for data science education, and train both graduate and undergraduate students.

The matrix completion problem seeks to infer the missing entries of a matrix, under a low-rank assumption. To date, most matrix completion methods do not actually check whether the known entries contain sufficient information to complete the matrix. Recently, however, a new and very different class of “structural” methods have emerged, which analyze the information content of the visible matrix entries, and so can determine whether accurate completion is possible. From a data mining standpoint, the implications of structural matrix completion methods are largely unexplored. This project will investigate how to leverage structural matrix completion methods to attack a host of data analysis problems, including developing new methods for active matrix completion, new approaches to cross-validating matrix completion results, and new strategies for general matrix completion.

N. Ruchansky, M. Crovella, E. Terzi Matrix Completion with Queries. ACM SIGKDD 2015. (code)

In many applications, e.g., recommender systems and traffic
monitoring, the data comes in the form of a matrix that is
only partially observed and low rank. A fundamental data-analysis task for these datasets is
matrix completion, where
the goal is to accurately infer the entries missing from the
matrix. Even when the data satisfies the low-rank assumption, classical
matrix-completion methods may output
completions with significant error – in that the reconstructed
matrix differs significantly from the true underlying matrix.
Often, this is due to the fact that the information
contained in the observed entries is insufficient. In this work,
we address this problem by proposing an active version of
matrix completion, where queries can be made to the true
underlying matrix. Subsequently, we design
**Order&Extend**,
which is the first algorithm to unify a matrix-completion
approach and a querying strategy into a single algorithm.
**Order&Extend** is able identify and alleviate insufficient information by
judiciously querying a small number of additional entries.
In an extensive experimental evaluation on
real-world datasets, we demonstrate that our algorithm is
efficient and is able to accurately reconstruct the true matrix while asking only a small number of queries

N. Ruchansky, M. Crovella, E. Terzi Targeted Matrix Completion. Siam International Conference on Data Mining, SDM 2017. (code)

Matrix completion is a problem that arises in many data-analysis settings where the input consists of a partially-observed matrix
(e.g., recommender systems, traffic matrix analysis etc.).
Classical approaches to matrix completion assume that the input partially-observed matrix
is low rank. The success of these methods depends on the number of observed entries and the rank of the matrix; the larger the rank, the more entries need to be
observed in order to accurately complete the matrix. In this paper, we deal with matrices that are not necessarily
low rank themselves, but rather they contain low-rank submatrices. We propose
**Targeted**, which is a general
framework for completing such matrices. In this framework, we first extract the low-rank submatrices and then
apply a matrix-completion algorithm to these low-rank
submatrices as well as the remainder matrix separately.
Although for the completion itself we use state-of-the-art completion methods, our results demonstrate that
**Targeted** achieves significantly smaller reconstruction
errors than other classical matrix-completion methods.
One of the key technical contributions of the paper lies
in the identification of the low-rank submatrices from
the input partially-observed matrices.

C. Mavroforakis, D. Erdos, M. Crovella and E. Terzi. Active Positive-Definite Matrix Completion. SIAM International Conference on Data Mining, SDM 2017. (supplementary material) and code and datasets

In many applications, e.g., recommender systems and
biological data analysis, the datasets of interest are positive definite (PD) matrices. Such matrices are usually
similarity matrices, obtained by the multiplication of a matrix of preferences or observations with its transpose.
Oftentimes, such real-world matrices are missing many entries and a fundamental data-analysis task, known by
the term PD-matrix completion, is the inference of these missing entries.
In this paper, we introduce the active version of PD-matrix completion, in which we assume
access to an oracle that, at a given cost, returns the value of an
unobserved entry of the PD matrix. In this setting, we consider the following question: “given
a fixed budget, which entries should we query so that the completion of
the new matrix is much more indicative of the underlying data?”. The main contribution of
the paper is the formalization of the above question as the **ActivePDCompletion**
problem and the design of novel and effective algorithms for solving it in practice.

H. A. Chaudhari, M. Mathioudakis, E. Terzi: Markov Chain Monitoring. SIAM International Conference on Data Mining SDM 2018. code and datasets

G. Comarela, E. Terzi, M. Crovella: Detecting Unusually-Routed ASes: Methods and Applications. Internet Measurement Conference (IMC) 2016.

“This material is based upon work supported by the National Science Foundation under Grant No.1421759.”

Disclaimer: “Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.”