[Download CV]

  • Office: MCS 227F
  • Email: vkalavri@bu.edu
  • Office hours (Fall 2021): TBD

Vasiliki (Vasia) Kalavri

Assistant Professor
Department of Computer Science, Boston University

About me

My research interests include distributed stream processing and large-scale graph analytics.

Before coming to BU, I was a member of the Systems Group at ETH Zurich, where I worked on Strymon, a system for predictive datacenter analytics. In September 2017 I was awarded the ETH Zurich Postdoctoral Fellowship for my research project "Automatic Scaling of Distributed Streaming Computations Using Graph Analytics on Real-Time Monitoring Data". I received my PhD from KTH, Stockholm, and UCL, Belgium, where I was admitted to a double doctoral program as an EMJD-DC fellow. My thesis, "Performance Optimization Techniques and Tools for Distributed Graph Processing" received the IBM Innovation Award 2017. During my PhD studies, I also spent time at DIMA TU Berlin, Telefonica Research Barcelona, and data Artisans.


Apache Flink Book

Fabian Hueske and I have recently published Stream Processing with Apache Flink by O'Reilly Media. If you read it, please leave a review or send us feedback and errata.


Prospective students

I'm recruiting motivated PhD students to join my research lab in Fall'2021. The application deadline is December 15 and GRE scores are not required. The Graduate School of Arts & Sciences offers application fee waivers for first generation applicants and women. Apply here.

News

19 May 2021

Co-Chairing ACM SIGMOD'22 SRC

I am excited to serve a co-chair of the SIGMOD Student Research Competition together with Yongjoo Park.

12 March 2021

The CASP Systems Lab has a website!

After an eventful 1st year, my research lab, Complex Analytics & Scalable Processing, finally has a website. You can also follow us on Twitter @CASPSystems.

2 February 2021

Secrecy: Secure Collaborative Analytics on Secret-Shared Data [arxiv preprint]

Secrecy is a new MPC framework for relational collaborative analytics.

1 December 2020

Area Chair for JSys

JSys -- the Journal of Systems Research is a new diamond open-access journal. I am delighted to serve as the first Area Chair for the Streaming Systems track..

11 November 2020

New article on the future of graph processing [arxiv preprint]

After a wonderful Dagstuhl seminar on Big Graph Processing Systems, our article on the future of graph processing research will appear in the Communications of the ACM.

3 August 2020

New Survey Paper Available

Our paper "A Survey on the Evolution of Stream Processing Systems" is now available on the arxiv.


14 July 2020

Outstanding New Research Direction Award at HotStorage'20

Our USENIX HotStorage'20 paper In support of workload-aware streaming state management won the Outstanding New Research Direction Award and was a finalist for the Best Presentation Award.

Research

My research focuses on various aspects of (distributed) data-centric systems.
More recently, I have been working on self-managed stream processing systems and graph streaming systems.

Self-managed streaming systems

Stream processing research has come a long way and streaming systems have matured considerably since their invention, almost 3 decades ago. What will the next generation of stream processing systems look like? This is the question I have been working on answering with my most recent and ongoing work. We envision a next-generation of stream processing systems that are not only scalable and reliable, but also capable of self-management and automatic reconfiguration without downtime.


Performance analysis and modeling

As stream processing applications are long-running, it is a matter of time before any initial configuration becomes out-of-tune. If the system is not capable of adapting to workload changes, this might lead to data loss, idle resources, and SLO violations. To avoid such situations, we need to continuously monitor system operation, identify changing conditions, and react. See our recent work on understanding the performance of streaming dataflows by generalizing online critical path analysis.


Automatic optimization and reconfiguration

One of the biggest challenges streaming systems face is dealing with variance in input workloads. Contrary to batch processing and database management systems, a stream processing system has no control over the stream arrival rate (or order). How can streaming systems continuously satisfy QoS without wasting resources? Instead of using simplified predictive system models and externally observed noisy metrics, we proposed a white-box approach. DS2 is an automatic scaling controller that leverages system instrumentation and operator dependencies to extract accurate metrics and provide automatic elasticity with accuracy, stability, performance, and safety.


Workload-aware state management

Distributed instrumentation enables a set of reconfiguration decisions and enables stream processors to manage their resource allocation. This model relies on an external controller that continuously monitors the system’s performance and sends control commands to the stream processor’s cluster manager module. There are, however, additional optimizaiton opportunities if we look at how stream processing systems perform fundamental internal operations, such as state management. As streaming dataflow operators are instantiated once and are long-running, their access patterns and state size bounds are largely known in advance (or can be learned). We recently proposed workload-aware state management: using configurable state stores with support for different layouts and data types, and leveraging knowledge about operator state characteristics.

Graph streaming systems

Continuous analysis of graph streams is an emerging application area, where events indicate edge and vertex additions, deletions, and modifications. Even though there exist many specialized systems for dynamic graph processing, modern stream processors do not inherently support graph streaming use-cases. Graph streaming computations are challenging to implement with today’s stream processors because graph streaming algorithms do not nicely fit the dataflow model. Dataflows pushe data through a series of operators that apply transformations until they produce the end result. Such a model is not suitable for graph computations which instead require multiple passes over the graph state. This challenge can be addressed by either using a cyclic dataflow model or by implementing single-pass graph streaming algorithms. See our recent survey preprint for an overview of the area.


The field of graph streaming systems is still at its infancy and there exist many fundamental problems to solve. While there is abundant work on graph streaming algorithms, most of this theoretical work is not readily applicable to the streaming model used in practice by streaming systems. See our recent experimental study on streaming graph partitioning algorithms for an example.

For a full list of publications, visit my Google Scholar profile.

Teaching

Fall 2021: CAS CS210 Computer Systems

NOTE: The class is full. Fill out this form to sign up for the waitlist.

Spring 2021: CS 591 K1 Data Stream Processing and Analytics

Fall 2020: CAS CS210 Computer Systems

NOTE: The class is full. Fill out this form to sign up for the waitlist.

Spring 2020: CS 591 K1 Data Stream Processing and Analytics

Spring 2019: Data Stream Processing and Analytics (ETH Zurich)

Summer Schools and Conference Tutorials

SERVICE

Conferences, workshops, competitions

VLDB 2022, HAOC'21 (PC Member), GRADES-NDA 2021 (co-Chair), ACM DEBS 2021 (PC Member), USENIX ATC 2021 (PC Member), ACM/IFIP Middleware 2021 (PC Member), IEEE ICDE 2021 (PC Member), ACM SIGMOD 2020 Student Research Competition (Judge), EuroSys 2021 (AMA Session coordinator), ACM DEBS 2020 (PC Member), ICDE 2020 (Demonstration Track), EDBT 2020 (Demonstration Track), CCGrid 2019 (Applications and Data Science track co-Chair), OPODIS 2018 (PC member), Middleware Doctoral Symposium (ACM/IFIP Middleware 2020), GRADES-NDA 2020 (co-located with SIGMOD 2020), USENIX HotStorage 2020, DBPL 2019 (co-located with PLDI 2019), GRADES-NDA 2019 (co-located with SIGMOD 2019), DBTest 2018 (co-located with SIGMOD 2018), GRADES-NDA 2018 (co-located with SIGMOD 2018), GABB 2018 (co-located with IPDPS 2018), GABB 2017 (co-located with IPDPS 2017), DEEM 2017 (co-located with SIGMOD 2017).

Industrial Conferences

Flink Forward Berlin 2019, Flink Forward San Francisco 2017, Berlin Buzzwords 2017, Flink Forward Berlin 2016, Berlin Buzzwords 2016

TALKS

From data stream management to distributed dataflows and beyond at North East Database Day 2020. [slides]

Three steps is all you need: fast accurate, automatic scaling decisions for distributed streaming dataflows at USENIX OSDI 2018. [slides] [audio]

Predictive Datacenter Analytics with Strymon at QCon San Francisco 2017. [slides] [video]

Online performance analysis of distributed dataflow systems at O'Reilly Velocity London 2017. [slides] [video]

Graphs as Streams: Rethinking Graph Processing in the Streaming Era at Berlin Buzzwords 2016. [slides] [video]

Demystifying Distributed Graph Processing at dotScale 2016. [slides] [video]