Research Overview

Evimaria Terzi, 2018

In this era of ‘‘big data", the goal of my research is to find efficient
and principled methods for extracting useful knowledge from large data collections.
I usually start with data-analysis problems that arise in real
applications. Then, I focus on finding clean computational formulations to model these problems and also
on designing efficient algorithms for solving them. In terms of problem
formulations, I am interested in simply stated, yet
technically challenging, questions. Technically, I am
interested both in the design and analysis of the
algorithms as well as their practical performance on real datasets.

My research agenda focuses on the development of principled and simple algorithms that are efficient,
robust to noise and can handle both noisy and partially-observed data.
Although at different times I am working on different application domains,
I always stay faithful to this vision.

Application-specific research themes

Team formation in expertise management systems.

The proliferation of expertise management systems (e.g.,LinkedIn,
Upwork} etc.) has led to research on team-formation problems.
Our work on team formation [1] was the first to address
the problem of hiring a team of experts with different skills and good history of collaborating together.
Ever since, there has been a lot of work in the data-mining community
around this topic. Our more recent work on this topic focuses on: (a) the development of
team-formation methods that have their roots in practical observations as obtained by social
scientists [2], (b) the development of algorithms for identifying the
team members that need to be hired/fired and
outsourced by a company to achieve optimal cost for projects that arrive in
an online fashion [3], and (c) problems related to the minimum
alteration of existing, yet problematic, teams so that their operation becomes smoother [4].

My recent collaboration with BU Spark! and the Hariri Institute of
Computing (HIC) has given my research on team formation an additional, more practical,
dimension as we are applying our team-formation algorithms to the teams
formed in student hackathons. In this setting, there are often students (usually from under-represented
groups and minorities) who oftain fail to find teammates. Our algorithms
provide recommendations for participants before the event so that everyone has an initial set of people to
talk to the day of the event. At the moment, we apply our algorithms to hackathons in the Boston area,
but we plan to license our software to hackathons throughout the country.

In the future, I will consider questions like: ‘‘Given team performance how can we evaluate the
ability level of each individual team member?" or ‘‘Can we predict the
performance of a newly-formed teams?’’.
The difficulty of these questions is due to the fact that the performance
of a team is not a summation of its individual
experts. Rather, it is defined by an unknown function, which will be
part of the learning task.



Related papers

This is the first paper in the data-mining literature that introduces the
problem of team formation in a network of experts. The key idea of the
paper is to identify a set of experts that can complete a task (i.e.,
collectively they have the skills required by the task) and at the same
time these experts have small communication cost between them. The
communication cost is measured in the social network defined between these
experts (e.g., a network that captures how well every pair of experts can
work together).



In recent years, the proliferation of online resumes and the need to
evaluate large populations of candidates for on-site and virtual teams have
led to a growing interest in the team-formation problem. Given a large pool
of candidates, the general problem requires the selection of a team of
experts to complete a given task. Surprisingly, while ongoing research has
studied numerous variations with different constraints, it has overlooked a
factor with a well-documented impact on team cohesion and performance: team
faultlines.

Despite the existing work on social sciences that measures and quantifies
the faultine potential of existing (already-formed) teams, there is no work on
a-priori forming teams with low faultline potential. This is mostly due to
the fact that existing faultline potential measures are hard to optimize.
In this work, we meet this challenge with a new measure that
can be efficiently used for both faultline measurement and minimization. We
then use the measure to solve the problem of automatically partitioning a
large population into low-faultline teams. By introducing faultlines to the
team-formation literature, our work creates exciting opportunities for
algorithmic work on faultline optimization, as well as on work that
combines and studies the connection of faultlines with other influential team characteristics.



Although freelancing work has grown substantially in recent years,
in part facilitated by a number of online labor marketplaces, traditional
forms of “in-sourcing” work continue being the dominant
form of employment. This means that, at least for the time being,
freelancing and salaried employment will continue to co-exist. In
this paper, we provide algorithms for outsourcing and hiring workers in a
general setting, where workers form a team and contribute
different skills to perform a task. We call this model team formation
with outsourcing
. In our model, tasks arrive in an online fashion:
neither the number nor the composition of the tasks are known
a-priori. At any point in time, there is a team of hired workers
who receive a fixed salary independently of the work they perform.
This team is dynamic: new members can be hired and existing
members can be fired, at some cost. Additionally, some parts of the
arriving tasks can be outsourced and thus completed by non-team
members, at a premium.

Our contribution is an efficient online
cost-minimizing algorithm for hiring and firing team members and
outsourcing tasks. We present theoretical bounds obtained using
a primal–dual scheme proving that our algorithms have logarithmic
competitive approximation ratio. We complement these results
with experiments using semi-synthetic datasets based on actual
task requirements and worker skills from three large online labor
marketplaces. To the best of our knowledge, this is the first paper that
addresses this problem and thus we are the first to provide theoretical and
practical justification of our solutions.



In large organizations (e.g., companies, universities, etc.) individual
experts with different work habits are asked to work together in order to
complete projects or tasks. Oftentimes, the differences in the inherent
work habits of these experts causes tension among them, which can prove
detrimental for the organization’s performance and functioning. The
question we consider in this paper is the following: “can this tension be
reduced by providing incentives to individuals to change their work
habits?”

We are the first to formalize this question in the definition of the k-AlterHabit
problem and analyze its propertis.
Although we show that k-AlterHabit is
NP-hard, we devise polynomial-time algorithms for solving it in
practice. Our algorithms are based on interesting connections that we draw
between our problem and other combinatorial problems. Our experimental
results demonstrate both the efficiency and the efficacy of our algorithmic
techniques on a collection of real data.

Urban data analysis

My previous work on unconventional types of recommender systems
increased my interest in recommendation of routes in urban
environments [3,4] and subsequently the development of algorithmic
techniques for analyzing
data that capture urban activity. Our recent work [1] has developed new algorithmic
tools for identifying popular routes that city dwellers take in a city. This helped us find city hotspots
both for locals and tourists. Motivated by the widespread use of
ride-hailing platforms (e.g., Uber or Lyft), we have developed
a framework for recommending to drivers strategies that maximize their
expected earnings [2].Our methods for this problem allow for significant improvement of the
earnings of drivers when compared to naive drivers that are not strategic or simply follow price surge.
A key component of our framework is that it enables us to rigorously
reason about and analyze the sensitivity of our results to perturbations in the input data.

In the future, I will continue developing algorithmic techniques for
analyzing data coming from different applications
designed to improve different modes of peoples’ lives.



On-demand ride-hailing platforms like Uber and Lyft are helping reshape
urban transportation, by enabling car owners to become drivers for hire
with minimal overhead. Although there are many studies that consider
ride-hailing platforms holistically, e.g., from the perspective of supply
and demand equilibria, little emphasis has been placed on optimization for
the individual, self-interested drivers that currently comprise these
fleets. While some individuals drive opportunistically either as their
schedule allows or on a fixed schedule, we show that strategic behavior
regarding when and where to drive can substantially increase driver
income.

In this paper, we are the first to formalize the problem of devising a driver
strategy to maximize expected earnings, describe a series of dynamic
programming algorithms to solve these problems under different sets of
modeled actions available to the drivers, and exemplify the models and
methods on a large scale simulation of driving for Uber in NYC. In our
experiments, we use a newly-collected dataset that combines the NYC taxi
rides dataset along with Uber API data, to build time-varying traffic and
payout matrices for a representative six-month time period in greater
NYC. From this input, we can reason about prospective itineraries and
payoffs. Moreover, the framework enables us to rigorously reason about and
analyze the sensitivity of our results to perturbations in the input
data. Among our main findings is that repositioning throughout the day is
key to maximizing driver earnings, whereas »chasing surge’ is typically
misguided and sometimes a costly move.



The proliferation of online social networks and the spread of smart mobile
devices enable the collection of information related to a multitude of
users’ activities. These networks, where every node is associated with a
type of action and a frequency, are usually referred to as activity
networks. Examples of such networks include road networks, where the nodes
are intersections and the edges are road segments. Each node is associated
with a number of geolocated actions that users of an online platform took
in its vicinity. In these networks, we define a prize-collecting subgraph
to be a connected set of nodes, which exhibits high levels of activity, and
is compact, i.e., the nodes are close to each other.

The k-PCSubgraphs problem we address in this paper is defined as follows: given an activity
network and an integer k, identify k non-overlapping and connected
subgraphs of the network such that the nodes of each subgraph are close to
each other, and the total number of actions they are associated with is
high. We define and study two new variants of the k-PCSubgraphs
problem, where the subgraphs of interest are tours and paths. Since both
these problems are NP-hard, we provide approximate and heuristic algorithms
that run in time nearly-linear to the number of edges. In our experiments,
we use real activity networks obtained by combining road networks and
projecting on them user activity from Twitter and Flickr. Our experimental
results demonstrate both the efficiency and the practical utility of our methods.





Social networks

The analysis of social-network data has been a recurring theme
in my research. Currently, I am interested in some recurring questions
related to network sparsification [1] or the identification of functional communities [2].
More recently, motivated by the rise of echo-chambers I have worked on the
problem of modifying the network so that such phenomena (and their effects) are minimized [3].
I expect that these problems will become vital in the future and
my work on social networks will focus more and more on themes with such direct impact on society.

Cross-Cutting Research Themes

Active data mining

In many applications the data come in the form of aggregates
as massive scale makes fine-grained data collection
and storage intractable.
In other applications, we have access to only a small percentage of the data points.
Most of the times, one needs to perform analysis of
such incomplete or aggregated data either to perform prediction or
other data-analysis tasks (e.g., matrix completion or clustering).
Over the last years my research has focused on the development of techniques
that appropriately query the data (e.g., in practice by asking specific
feedback from users, or obtaining specific measurements)
in order to better perform a specific data-mining
task cite{chaudhari18markov,golshan14unveiling,malmi17active,mavroforakis17active,ruchansky15matrix}.
Although some of these papers are a bit older, this
direction became quite dominant in my research over the last years.

The end goal of this research direction is to develop a generalized
framework for what I call emph{active data mining}.
This framework will allow us to strategically query unseen data with the
goal to optimize for the data-mining task
at hand (e.g., clustering, community detection, etc.).

Robust data mining

In our recent work on ride-hailing platforms cite{chaudhari18putting}, we developed a framework that
enabled us to rigorously reason about and analyze the sensitivity of our
results to perturbations in the input data, without ever
performing these perturbations. Motivated by this, I plan to further
investigate the application of similar ideas to
other data-mining problems. On a high level, this research thrust will
address long-standing questions in data mining wrt the impact
of noise in the results. For this, we will build upon existing work on robust optimization, yet we will focus on
the robust versions of data-mining problems, which have not been studied rigorously before.

Application of primal-dual algorithms in data-mining applications

A very recent and exciting direction to my research has been the
application of primal-dual algorithms to a large number of data-mining
problems. Although many of the algorithms we use in the data-mining
community have their roots in some primal-dual algorithm, we – as a
data-mining community – rarely go back to the original linear-program
formulation of our problems. The goal of my future research in this domain is to
do exactly that as we have ample preliminary evidence that shows that these
algorithms can become very efficient and useful when they take into consideration the structure
imposed by the practical problems we use them for.