Evimaria Terzi, 2018

In this era of ‘‘big data", the goal of my research is to find efficient

and principled methods for extracting useful knowledge from large data
collections.

I usually start with data-analysis problems that arise in real

applications. Then, I focus on finding clean computational formulations to
model these problems and also

on designing efficient algorithms for solving them. In terms of problem

formulations, I am interested in simply stated, yet

technically challenging, questions. Technically, I am

interested both in the design and analysis of the

algorithms as well as their practical performance on real
datasets.

My research agenda focuses on the development of principled and simple
algorithms that are efficient,

robust to noise and can handle both noisy and partially-observed data.

Although at different times I am working on different application domains,

I always stay faithful to this vision.

The proliferation of expertise management systems (e.g.,LinkedIn,

Upwork} etc.) has led to research on team-formation problems.

Our work on team formation [1] was the first to address

the problem of hiring a team of experts with different skills and
good history of collaborating together.

Ever since, there has been a lot of work in the data-mining community

around this topic. Our more recent work on this topic focuses on: (a) the
development of

team-formation methods that have their roots in practical observations as
obtained by social

scientists [2], (b) the development of algorithms for identifying the

team members that need to be hired/fired and

outsourced by a company to achieve optimal cost for projects that arrive in

an online fashion [3], and (c) problems related to the minimum

alteration of existing, yet problematic, teams so that their operation
becomes smoother [4].

My recent collaboration with BU Spark! and the Hariri Institute of

Computing (HIC) has given my research on team formation an additional, more practical,

dimension as we are applying our team-formation algorithms to the teams

formed in student hackathons. In this setting, there are often students
(usually from under-represented

groups and minorities) who oftain fail to find teammates. Our algorithms

provide recommendations for participants before the event so that everyone
has an initial set of people to

talk to the day of the event. At the moment, we apply our algorithms to
hackathons in the Boston area,

but we plan to license our software to hackathons throughout the country.

In the future, I will consider questions like: ‘‘Given team performance how can we evaluate the

ability level of each individual team member?" or ‘‘Can we predict the

performance of a newly-formed teams?’’.

The difficulty of these questions is due to the fact that the performance

of a team is not a summation of its individual

experts. Rather, it is defined by an *unknown* function, which will be

part of the learning task.

[1] Theodoros Lappas, Kun Liu, Evimaria Terzi: Finding a team of experts in social networks. ACM SIGKDD 2009.

This is the first paper in the data-mining literature that introduces the

problem of *team formation* in a network of experts. The key idea of the

paper is to identify a set of experts that can complete a task (i.e.,

collectively they have the skills required by the task) and at the same

time these experts have small *communication cost* between them. The

communication cost is measured in the social network defined between these

experts (e.g., a network that captures how well every pair of experts can

work together).

[2] Sanaz Bahargam, Behzad Golshan, Theodoros Lappas, Evimaria Terzi: A Team-Formation Algorithm for Faultline Minimization. Journal of Expert Systems with Applications, 2018.

In recent years, the proliferation of online resumes and the need to

evaluate large populations of candidates for on-site and virtual teams have

led to a growing interest in the *team-formation* problem. Given a large pool

of candidates, the general problem requires the selection of a team of

experts to complete a given task. Surprisingly, while ongoing research has

studied numerous variations with different constraints, it has overlooked a

factor with a well-documented impact on team cohesion and performance: team

faultlines.

Despite the existing work on social sciences that measures and quantifies

the faultine potential of existing (already-formed) teams, there is no work
on

a-priori forming teams with low faultline potential. This is mostly due to

the fact that existing faultline potential measures are hard to optimize.

In this work, we meet this challenge with a new measure that

can be efficiently used for both faultline measurement and minimization. We

then use the measure to solve the problem of automatically partitioning a

large population into low-faultline teams. By introducing faultlines to the

team-formation literature, our work creates exciting opportunities for

algorithmic work on faultline optimization, as well as on work that

combines and studies the connection of faultlines with other influential
team characteristics.

[3] Aris Anagnostopoulos, Carlos Castillo, Adriano Fazzone, Stefano Leonardi, Evimaria Terzi: Algorithms for Hiring and Outsourcing in the Online Labor Market. ACM SIGKDD 2018.

Although freelancing work has grown substantially in recent years,

in part facilitated by a number of online labor marketplaces, traditional

forms of “in-sourcing” work continue being the dominant

form of employment. This means that, at least for the time being,

freelancing and salaried employment will continue to co-exist. In

this paper, we provide algorithms for outsourcing and hiring workers in a

general setting, where workers form a team and contribute

different skills to perform a task. We call this model *team formation
with outsourcing*. In our model, tasks arrive in an online fashion:

neither the number nor the composition of the tasks are known

a-priori. At any point in time, there is a team of hired workers

who receive a fixed salary independently of the work they perform.

This team is dynamic: new members can be hired and existing

members can be fired, at some cost. Additionally, some parts of the

arriving tasks can be outsourced and thus completed by non-team

members, at a premium.

Our contribution is an efficient online

cost-minimizing algorithm for hiring and firing team members and

outsourcing tasks. We present theoretical bounds obtained using

a primal–dual scheme proving that our algorithms have logarithmic

competitive approximation ratio. We complement these results

with experiments using semi-synthetic datasets based on actual

task requirements and worker skills from three large online labor

marketplaces. To the best of our knowledge, this is the first paper that

addresses this problem and thus we are the first to provide theoretical and

practical justification of our solutions.

[4] Behzad Golshan, Evimaria Terzi: Minimizing Tension in Teams. ACM CIKM 2017.

In large organizations (e.g., companies, universities, etc.) individual

experts with different work habits are asked to work together in order to

complete projects or tasks. Oftentimes, the differences in the inherent

work habits of these experts causes tension among them, which can prove

detrimental for the organization’s performance and functioning. The

question we consider in this paper is the following: “can this tension be

reduced by providing incentives to individuals to change their work

habits?”

We are the first to formalize this question in the definition of the *k-AlterHabit*

problem and analyze its propertis.

Although we show that *k-AlterHabit* is

NP-hard, we devise polynomial-time algorithms for solving it in

practice. Our algorithms are based on interesting connections that we draw

between our problem and other combinatorial problems. Our experimental

results demonstrate both the efficiency and the efficacy of our algorithmic

techniques on a collection of real data.

My previous work on unconventional types of recommender systems

increased my interest in recommendation of routes in urban

environments [3,4] and subsequently the development of algorithmic

techniques for analyzing

data that capture urban activity. Our recent work [1] has developed new algorithmic

tools for identifying popular routes that city dwellers take in a city. This helped us find city hotspots

both for locals and tourists. Motivated by the widespread use of

ride-hailing platforms (e.g., Uber or Lyft), we have developed

a framework for recommending to drivers strategies that maximize their

expected earnings [2].Our methods for this problem allow for significant improvement of the

earnings of drivers when compared to naive drivers that are not strategic or simply follow price surge.

A key component of our framework is that it enables us to rigorously

reason about and analyze the sensitivity of our results to perturbations in
the input data.

In the future, I will continue developing algorithmic techniques for

analyzing data coming from different applications

designed to improve different modes of peoples’ lives.

[1] Harshal A. Chaudhari, John W. Byers, Evimaria Terzi: Putting Data in the Driver’s Seat: Optimizing Earnings for On-Demand Ride-Hailing. ACM WSDM 2018.

On-demand ride-hailing platforms like Uber and Lyft are helping reshape

urban transportation, by enabling car owners to become drivers for hire

with minimal overhead. Although there are many studies that consider

ride-hailing platforms holistically, e.g., from the perspective of supply

and demand equilibria, little emphasis has been placed on optimization for

the individual, self-interested drivers that currently comprise these

fleets. While some individuals drive opportunistically either as their

schedule allows or on a fixed schedule, we show that strategic behavior

regarding when and where to drive can substantially increase driver

income.

In this paper, we are the first to formalize the problem of devising a driver

strategy to maximize expected earnings, describe a series of dynamic

programming algorithms to solve these problems under different sets of

modeled actions available to the drivers, and exemplify the models and

methods on a large scale simulation of driving for Uber in NYC. In our

experiments, we use a newly-collected dataset that combines the NYC taxi

rides dataset along with Uber API data, to build time-varying traffic and

payout matrices for a representative six-month time period in greater

NYC. From this input, we can reason about prospective itineraries and

payoffs. Moreover, the framework enables us to rigorously reason about and

analyze the sensitivity of our results to perturbations in the input

data. Among our main findings is that repositioning throughout the day is

key to maximizing driver earnings, whereas »chasing surge’ is typically

misguided and sometimes a costly move.

[2] Sofia Maria Nikolakaki, Charalampos Mavroforakis, Alina Ene, Evimaria Terzi: Tours and Paths in Activity Networks. WWW 2018.

The proliferation of online social networks and the spread of smart mobile

devices enable the collection of information related to a multitude of

users’ activities. These networks, where every node is associated with a

type of action and a frequency, are usually referred to as activity

networks. Examples of such networks include road networks, where the nodes

are intersections and the edges are road segments. Each node is associated

with a number of geolocated actions that users of an online platform took

in its vicinity. In these networks, we define a prize-collecting subgraph

to be a connected set of nodes, which exhibits high levels of activity, and

is compact, i.e., the nodes are close to each other.

The *k-PCSubgraphs*
problem we address in this paper is defined as follows: given an activity

network and an integer *k*, identify *k* non-overlapping and connected

subgraphs of the network such that the nodes of each subgraph are close to

each other, and the total number of actions they are associated with is

high. We define and study two **new** variants of the *k-PCSubgraphs*

problem, where the subgraphs of interest are tours and paths. Since both

these problems are NP-hard, we provide approximate and heuristic algorithms

that run in time nearly-linear to the number of edges. In our experiments,

we use real activity networks obtained by combining road networks and

projecting on them user activity from Twitter and Flickr. Our experimental

results demonstrate both the efficiency and the practical utility of our methods.

[3] Esther Galbrun, Konstantinos Pelechrinis, Evimaria Terzi: navigation beyond shortest route: The case of safe paths. Inf. Syst. 57: 160-171 (2016).

[4] Aristides Gionis, Theodoros Lappas, Konstantinos Pelechrinis, Evimaria Terzi: tour recommendations in urban areas. ACM WSDM 2014.

The analysis of social-network data has been a
recurring theme

in my research. Currently, I am interested in some recurring questions

related to network sparsification [1] or the identification of
functional communities [2].

More recently, motivated by the rise of echo-chambers I have worked on the

problem of modifying the network so that such phenomena (and their effects)
are minimized [3].

I expect that these problems will become vital in the future and

my work on social networks will focus more and more on themes with such
direct impact on society.

[1] Aristides Gionis, Polina Rozenshtein, Nikolaj Tatti, Evimaria Terzi: Community-aware network sparsification. SIAM Data Mining Conference (SDM) 2017.

[2] Esther Galbrun, Behzad Golshan, Aristides Gionis, Evimaria Terzi: Finding low-tension communities. SIAM Data Mining COnference (SDM) 2017.

[3] Antonis Matakos, Evimaria Terzi, Panayiotis Tsaparas: Measuring and moderating opinion polarization in social networks. Data Min. Knowl. Discov. 31(5): 1480-1505 (2017).

In many applications the data come in the form of aggregates

as massive scale makes fine-grained data collection

and storage intractable.

In other applications, we have access to only a small percentage of the
data points.

Most of the times, one needs to perform analysis of

such incomplete or aggregated data either to perform prediction or

other data-analysis tasks (e.g., matrix completion or clustering).

Over the last years my research has focused on the development of
techniques

that appropriately query the data (e.g., in practice by asking specific

feedback from users, or obtaining specific measurements)

in order to better perform a specific data-mining

task cite{chaudhari18markov,golshan14unveiling,malmi17active,mavroforakis17active,ruchansky15matrix}.

Although some of these papers are a bit older, this

direction became quite dominant in my research over the last years.

The end goal of this research direction is to develop a generalized

framework for what I
call emph{active data mining}.

This framework will allow us to strategically query unseen data with the

goal to optimize for the data-mining task

at hand (e.g., clustering, community detection, etc.).

In our recent work on ride-hailing platforms cite{chaudhari18putting}, we
developed a framework that

enabled us to rigorously reason about and analyze the sensitivity of our

results to perturbations in the input data, without ever

performing these perturbations. Motivated by this, I plan to further

investigate the application of similar ideas to

other data-mining problems. On a high level, this research thrust will

address long-standing questions in data mining wrt the impact

of noise in the results. For this, we will build upon existing work on
robust optimization, yet we will focus on

the robust versions of data-mining problems, which have not been studied
rigorously before.

A very recent and exciting direction to my research has been the

application of primal-dual algorithms to a large number of data-mining

problems. Although many of the algorithms we use in the data-mining

community have their roots in some primal-dual algorithm, we – as a

data-mining community – rarely go back to the original linear-program

formulation of our problems. The goal of my future research in this domain is to

do exactly that as we have ample preliminary evidence that shows that these

algorithms can become very efficient and useful when they take into consideration the structure

imposed by the practical problems we use them for.