My research interests include distributed stream processing and large-scale graph analytics.
Before coming to BU, I was a member of the Systems Group at ETH Zurich, where I worked on Strymon, a system for predictive datacenter analytics. In September 2017 I was awarded the ETH Zurich Postdoctoral Fellowship for my research project "Automatic Scaling of Distributed Streaming Computations Using Graph Analytics on Real-Time Monitoring Data". I received my PhD from KTH, Stockholm, and UCL, Belgium, where I was admitted to a double doctoral program as an EMJD-DC fellow. My thesis, "Performance Optimization Techniques and Tools for Distributed Graph Processing" received the IBM Innovation Award 2017. During my PhD studies, I also spent time at DIMA TU Berlin, Telefonica Research Barcelona, and data Artisans.
Fabian Hueske and I have recently published Stream Processing with Apache Flink by O'Reilly Media. If you read it, please leave a review or send us feedback and errata.
I'm recruiting motivated PhD students to join my research lab in Fall'2021. The application deadline is December 15 and GRE scores are not required. The Graduate School of Arts & Sciences offers application fee waivers for first generation applicants and women. Apply here.
After an eventful 1st year, my research lab, Complex Analytics & Scalable Processing, finally has a website. You can also follow us on Twitter @CASPSystems.
Secrecy is a new MPC framework for relational collaborative analytics.
JSys -- the Journal of Systems Research is a new diamond open-access journal. I am delighted to serve as the first Area Chair for the Streaming Systems track..
After a wonderful Dagstuhl seminar on Big Graph Processing Systems, our article on the future of graph processing research will appear in the Communications of the ACM.
Our paper "A Survey on the Evolution of Stream Processing Systems" is now available on the arxiv.
Our USENIX HotStorage'20 paper In support of workload-aware streaming state management won the Outstanding New Research Direction Award and was a finalist for the Best Presentation Award.
I contributed an article to the SIGOPS Blog where I discuss the evolution and future of stream processing systems.
We will be presenting the tutorial Beyond Analytics: the Evolution of Stream Processing Systems at ACM SIGMOD'20.
[Tutorial website] [Paper].
Stream processing research has come a long way and streaming systems have matured considerably since their invention, almost 3 decades ago. What will the next generation of stream processing systems look like? This is the question I have been working on answering with my most recent and ongoing work. We envision a next-generation of stream processing systems that are not only scalable and reliable, but also capable of self-management and automatic reconfiguration without downtime.
As stream processing applications are long-running, it is a matter of time before any initial configuration becomes out-of-tune. If the system is not capable of adapting to workload changes, this might lead to data loss, idle resources, and SLO violations. To avoid such situations, we need to continuously monitor system operation, identify changing conditions, and react. See our recent work on understanding the performance of streaming dataflows by generalizing online critical path analysis.
One of the biggest challenges streaming systems face is dealing with variance in input workloads. Contrary to batch processing and database management systems, a stream processing system has no control over the stream arrival rate (or order). How can streaming systems continuously satisfy QoS without wasting resources? Instead of using simplified predictive system models and externally observed noisy metrics, we proposed a white-box approach. DS2 is an automatic scaling controller that leverages system instrumentation and operator dependencies to extract accurate metrics and provide automatic elasticity with accuracy, stability, performance, and safety.
Distributed instrumentation enables a set of reconfiguration decisions and enables stream processors to manage their resource allocation. This model relies on an external controller that continuously monitors the system’s performance and sends control commands to the stream processor’s cluster manager module. There are, however, additional optimizaiton opportunities if we look at how stream processing systems perform fundamental internal operations, such as state management. As streaming dataflow operators are instantiated once and are long-running, their access patterns and state size bounds are largely known in advance (or can be learned). We recently proposed workload-aware state management: using configurable state stores with support for different layouts and data types, and leveraging knowledge about operator state characteristics.
Continuous analysis of graph streams is an emerging application area, where events indicate edge and vertex additions, deletions, and modifications. Even though there exist many specialized systems for dynamic graph processing, modern stream processors do not inherently support graph streaming use-cases. Graph streaming computations are challenging to implement with today’s stream processors because graph streaming algorithms do not nicely fit the dataflow model. Dataflows pushe data through a series of operators that apply transformations until they produce the end result. Such a model is not suitable for graph computations which instead require multiple passes over the graph state. This challenge can be addressed by either using a cyclic dataflow model or by implementing single-pass graph streaming algorithms. See our recent survey preprint for an overview of the area.
The field of graph streaming systems is still at its infancy and there exist many fundamental problems to solve. While there is abundant work on graph streaming algorithms, most of this theoretical work is not readily applicable to the streaming model used in practice by streaming systems. See our recent experimental study on streaming graph partitioning algorithms for an example.
NOTE: The class is full. Fill out this form to sign up for the waitlist.
VLDB 2022, HAOC'21 (PC Member), GRADES-NDA 2021 (co-Chair), ACM DEBS 2021 (PC Member), USENIX ATC 2021 (PC Member), ACM/IFIP Middleware 2021 (PC Member), IEEE ICDE 2021 (PC Member), ACM SIGMOD 2020 Student Research Competition (Judge), EuroSys 2021 (AMA Session coordinator), ACM DEBS 2020 (PC Member), ICDE 2020 (Demonstration Track), EDBT 2020 (Demonstration Track), CCGrid 2019 (Applications and Data Science track co-Chair), OPODIS 2018 (PC member), Middleware Doctoral Symposium (ACM/IFIP Middleware 2020), GRADES-NDA 2020 (co-located with SIGMOD 2020), USENIX HotStorage 2020, DBPL 2019 (co-located with PLDI 2019), GRADES-NDA 2019 (co-located with SIGMOD 2019), DBTest 2018 (co-located with SIGMOD 2018), GRADES-NDA 2018 (co-located with SIGMOD 2018), GABB 2018 (co-located with IPDPS 2018), GABB 2017 (co-located with IPDPS 2017), DEEM 2017 (co-located with SIGMOD 2017).
IEEE Transactions on Parallel and Distributed Systems, VLDB Journal, IEEE Transactions on Knowledge and Data Engineering, IEEE Transactions on Computers
Flink Forward Berlin 2019, Flink Forward San Francisco 2017, Berlin Buzzwords 2017, Flink Forward Berlin 2016, Berlin Buzzwords 2016
From data stream management to distributed dataflows and beyond at North East Database Day 2020. [slides]
Three steps is all you need: fast accurate, automatic scaling decisions for distributed streaming dataflows at USENIX OSDI 2018. [slides] [audio]
Predictive Datacenter Analytics with Strymon at QCon San Francisco 2017. [slides] [video]
Online performance analysis of distributed dataflow systems at O'Reilly Velocity London 2017. [slides] [video]
Graphs as Streams: Rethinking Graph Processing in the Streaming Era at Berlin Buzzwords 2016. [slides] [video]
Demystifying Distributed Graph Processing at dotScale 2016. [slides] [video]