Manos Athanassoulis publications

The Need for a New I/O Model

Tarikul Islam Papon, Manos Athanassoulis

Conference Paper Proceedings of the Conference on Innovative Data Systems Research (CIDR), 2021 (Abstract - to appear)

Abstract

Data-intensive applications performance is typically bounded by the time needed to transfer data through the storage and memory hierarchy. As a result, measuring and modeling disk I/O accesses is used as a proxy to performance. The traditional I/O model considers a two-level memory hierarchy with a fast internal memory of bounded size (memory) and a slow unbounded external memory (storage), which are both divided into fixed-size blocks. Any computation requires to have the corresponding data blocks in memory. Accessing storage is typically orders of magnitude slower than accessing memory, thus, the traditional I/O model considers only storage accesses. This modeling approach closes described reality when two key underlying assumptions hold: (i) disk reads and writes have similar cost, and (ii) applications can perform one I/O at a time. However, those two assumptions are not true for solid-state disks (SSD). The mismatch between the I/O model and contemporary devices is attributed to the fact that it was developed for hard disk drives (HDD). HDDs have symmetric read-write performance that is dominated by the seek time and rotational delay. The mechanical components of HDDs, further, do not allow them to serve multiple concurrent requests, since each request has to individually go through the costly mechanical movement.

Lethe: A Tunable Delete-Aware LSM Engine

Subhadeep Sarkar, Tarikul Islam Papon, Dimitris Staratzis, Manos Athanassoulis

Conference Paper Proceedings of the ACM SIGMOD Conference, 2020

Abstract

Data-intensive applications fueled the evolution of log structured merge (LSM) based key-value engines that employ the out-of-place paradigm to support high ingestion rates with low read/write interference. These benefits, however, come at the cost of treating deletes as a second-class citizen. A delete inserts a tombstone that invalidates older instances of the deleted key. State-of-the-art LSM engines do not provide guarantees as to how fast a tombstone will propagate to persist the deletion. Further, LSM engines only support deletion on the sort key. To delete on another attribute (e.g., timestamp), the entire tree is read and re-written. We highlight that fast persistent deletion without affecting read performance is key to support: (i) streaming systems operating on a window of data, (ii) privacy with latency guarantees on the right-to-be-forgotten, and (iii) en masse cloud deployment of data systems that makes storage a precious resource.

To address these challenges, in this paper, we build a new key-value storage engine, Lethe, that uses a very small amount of additional metadata, a set of new delete-aware compaction policies, and a new physical data layout that weaves the sort and the delete key order. We show that Lethe supports any user-defined threshold for the delete persistence latency offering higher read throughput (1.17 − 1.4x) and lower space amplification (2.1 − 9.8x), with a modest increase in write amplification (between 4% and 25%). In addition, Lethe supports efficient range deletes on a secondary delete key by dropping entire data pages without sacrificing read performance nor employing a costly full tree merge.

Adaptive partitioning and indexing for in situ query processing

Matthaios Olma, Manos Karpathiotakis, Ioannis Alagiannis, Manos Athanassoulis, Anastasia Ailamaki

Journal Paper The VLDB Journal, Vol. 29(1), 2020 (Special Issue)

Abstract

The constant flux of data and queries alike has been pushing the boundaries of data analysis systems. The increasing size of raw data files has made data loading an expensive operation that delays the data-to-insight time. To alleviate the loading cost, in situ query processing systems operate directly over raw data and offer instant access to data. At the same time, analytical workloads have increasing number of queries. Typically, each query focuses on a constantly shifting - yet small - range. As a result, minimizing the workload latency requires the benefits of indexing in in situ query processing. In this paper, we present an online partitioning and indexing scheme, along with a partitioning and indexing tuner tailored for in situ querying engines. The proposed system design improves query execution time by taking into account user query patterns, to (i) partition raw data files logically and (ii) build lightweight partition-specific indexes for each partition. We build an in situ query engine called Slalom to showcase the impact of our design. Slalom employs adaptive partitioning and builds non-obtrusive indexes in different partitions on-the-fly based on lightweight query access pattern monitoring. As a result of its lightweight nature, Slalom achieves efficient query processing over raw data with minimal memory consumption. Our experimentation with both microbenchmarks and real-life workloads shows that Slalom outperforms state-of-the-art in situ engines and achieves comparable query response times with fully indexed DBMS, offering lower cumulative query execution times for query workloads with increasing size and unpredictable access patterns.

Optimal Column Layout for Hybrid Workloads

Manos Athanassoulis, Kenneth Bøgh, Stratos Idreos

Journal Paper Proceedings of the VLDB Endowment, Vol. 12(13), 2019

Abstract

Data-intensive analytical applications need to support both efficient reads and writes. However, what is usually a good data layout for an update-heavy workload, is not well-suited for a read-mostly one and vice versa. Modern analytical data systems rely on columnar layouts and employ delta stores to inject new data and updates.

We show that for hybrid workloads we can achieve close to one order of magnitude better performance by tailoring the column layout design to the data and query workload. Our approach navigates the possible design space of the physical layout: it organizes each column’s data by determining the number of partitions, their corresponding sizes and ranges, and the amount of buffer space and how it is allocated. We frame these design decisions as an optimization problem that, given workload knowledge and performance requirements, provides an optimal physical layout for the workload at hand. To evaluate this work, we build an in-memory storage engine, Casper, and we show that it outperforms state-of-the-art data layouts of analytical systems for hybrid workloads. Casper delivers up to 2.32x higher throughput for update-intensive workloads and up to 2.14x higher throughput for hybrid workloads. We further show how to make data layout decisions robust to workload variation by carefully selecting the input of the optimization.

Optimal Bloom Filters and Adaptive Merging for LSM-Trees

Niv Dayan, Manos Athanassoulis, Stratos Idreos

Journal Paper ACM Transactions on Database Systems, Vol. 43(4), 2018

Abstract

In this paper, we show that key-value stores backed by a log-structured merge-tree (LSM-tree) exhibit an intrinsic trade-off between lookup cost, update cost, and main memory footprint, yet all existing designs expose a suboptimal and difficult to tune trade-off among these metrics. We pinpoint the problem to the fact that modern key-value stores suboptimally co-tune the merge policy, the buffer size, and the Bloom filters' false positive rates across the LSM-tree's different levels.

We present Monkey, an LSM-tree based key-value store that strikes the optimal balance between the costs of updates and lookups with any given main memory budget. The core insight is that worst-case lookup cost is proportional to the sum of the false positive rates of the Bloom filters across all levels of the LSM-tree. Contrary to state-of-the-art key-value stores that assign a fixed number of bits-per-element to all Bloom filters, Monkey allocates memory to filters across different levels so as to minimize the sum of their false positive rates. We show analytically that Monkey reduces the asymptotic complexity of the worst-case lookup I/O cost, and we verify empirically using an implementation on top of RocksDB that Monkey reduces lookup latency by an increasing margin as the data volume grows (50% − 80% for the data sizes we experimented with). Furthermore, we map the design space onto a closed-form model that enables adapting the merging frequency and memory allocation to strike the best trade-off among lookup cost, update cost and main memory, depending on the workload (proportion of lookups and updates), the dataset (number and size of entries), and the underlying hardware (main memory available, disk vs. flash). We show how to use this model to answer what-if design questions about how changes in environmental parameters impact performance and how to adapt the design of the key-value store for optimal performance.

The Periodic Table of Data Structures

Stratos Idreos, Kostas Zoumpatianos, Manos Athanassoulis, Niv Dayan, Brian Hentschel, Michael S. Kester, Demi Guo, Lukas Maas, Wilson Qin, Abdul Wasay, Yiyou Sun

Journal Paper IEEE Data Engineering Bulletin, Vol. 41(3), 2018

Abstract

We describe the vision of being able to reason about the design space of data structures. We break this down into two questions: 1) Can we know all data structures that is possible to design? 2) Can we compute the performance of arbitrary designs on a given hardware and workload without having to implement the design or even access the target hardware? If those challenges are possible, then an array of exciting opportunities would become feasible such as interactive what-if design to improve the productivity of data systems researchers and engineers, and informed decision making in industrial settings with regards to critical hardware/workload/data structure design issues. Then, even fully automated discovery of new data structure designs becomes possible. Furthermore, the structure of the design space itself provides numerous insights and opportunities such as the existence of design continuums that can lead to data systems with deep adaptivity, and a new understanding of the possible performance tradeoffs. Given the universal presence of data structures at the very core of any data-driven field across all sciences and industries, reasoning about their design can have significant benefits, making it more feasible (easier, faster and cheaper) to adopt tailored state-of-the-art storage solutions. And this effect is going to become increasingly more critical as data keeps growing, hardware keeps changing and more applications/fields realize the transformative power and potential of data analytics. This paper presents this vision and surveys first steps that demonstrate its feasibility.

Slalom: Coasting Through Raw Data via Adaptive Partitioning and Indexing

Matthaios Olma, Manos Karpathiotakis, Ioannis Alagiannis, Manos Athanassoulis, Anastasia Ailamaki

Journal Paper Proceedings of the VLDB Endowment, Vol. 10(10), 2017

Abstract

The constant flux of data and queries alike has been pushing the boundaries of data analysis systems. The increasing size of raw data files has made data loading an expensive operation that delays the data-to-insight time. Hence, recent in-situ query processing systems operate directly over raw data, alleviating the loading cost. At the same time, analytical workloads have increasing number of queries. Typically, each query focuses on a constantly shifting – yet small – range. Minimizing the workload latency, now, requires the benefits of indexing in in-situ query processing.

In this paper, we present Slalom, an in-situ query engine that accommodates workload shifts by monitoring user access patterns. Slalom makes on-the-fly partitioning and indexing decisions, based on information collected by lightweight monitoring. Slalom has two key components: (i) an online partitioning and indexing scheme, and (ii) a partitioning and indexing tuner tailored for in-situ query engines. When compared to the state of the art, Slalom offers performance benefits by taking into account user query patterns to (a) logically partition raw data files and (b) build for each partition lightweight partition-specific indexes. Due to its lightweight and adaptive nature, Slalom achieves efficient accesses to raw data with minimal memory consumption. Our experimentation with both micro-benchmarks and real-life workloads shows that Slalom outperforms state-of-the-art in-situ engines (3−10x), and achieves comparable query response times with fully indexed DBMS, offering much lower (~3x) cumulative query execution times for query workloads with increasing size and unpredictable access patterns.

Access Path Selection in Main-Memory Optimized Data Systems: Should I Scan or Should I Probe?

Michael Kester, Manos Athanassoulis, Stratos Idreos

Conference Paper Proceedings of the ACM SIGMOD Conference, 2017

Abstract

The advent of columnar data analytics engines fueled a series of optimizations on the scan operator. New designs include column-group storage, vectorized execution, shared scans, working directly over compressed data, and operating using SIMD and multi-core execution. Larger main memories and deeper cache hierarchies increase the efficiency of modern scans, prompting a revisit of the question of access path selection.

In this paper, we compare modern sequential scans and secondary index scans. Through detailed analytical modeling and experimentation we show that while scans have become useful in more cases than before, both access paths are still useful, and so, access path selection (APS) is still required to achieve the best performance when considering variable workloads. We show how to perform access path selection. In particular, contrary to the way traditional systems choose between scans and secondary indexes, we find that in addition to the query selectivity, the underlying hardware, and the system design, modern optimizers also need to take into account query concurrency. We further discuss the implications of integrating access path selection in a modern analytical data system. We demonstrate, both theoretically and experimentally, that using the proposed model a system can quickly perform access path selection, outperforming solutions that rely on a single access path or traditional access path models. We outline a light-weight mechanism to integrate APS into main-memory analytical systems that does not interfere with low latency queries. We also use the APS model to explain how the division between sequential scan and secondary index scan has historically changed due to hardware and workload changes, which allows for future projections based on hardware advancements.

Monkey: Optimal Navigable Key-Value Store

Niv Dayan, Manos Athanassoulis, Stratos Idreos

Conference Paper Proceedings of the ACM SIGMOD Conference, 2017

Abstract

In this paper, we show that key-value stores backed by an LSM-tree exhibit an intrinsic trade-off between lookup cost, update cost, and main memory footprint, yet all existing designs expose a suboptimal and difficult to tune trade-off among these metrics. We pinpoint the problem to the fact that all modern key-value stores suboptimally co-tune the merge policy, the buffer size, and the Bloom filters’ false positive rates in each level.

We present Monkey, an LSM-based key-value store that strikes the optimal balance between the costs of updates and lookups with any given main memory budget. The insight is that worst-case lookup cost is proportional to the sum of the false positive rates of the Bloom filters across all levels of the LSM-tree. Contrary to state-of-the-art key-value stores that assign a fixed number of bits-per-element to all Bloom filters, Monkey allocates memory to filters across different levels so as to minimize this sum. We show analytically that Monkey reduces the asymptotic complexity of the worst-case lookup I/O cost, and we verify empirically using an implementation on top of LevelDB that Monkey reduces lookup latency by an increasing margin as the data volume grows (50% − 80% for the data sizes we experimented with). Furthermore, we map the LSM-tree design space onto a closed-form model that enables co-tuning the merge policy, the buffer size and the filters’ false positive rates to trade among lookup cost, update cost and/or main memory, depending on the workload (proportion of lookups and updates), the dataset (number and size of entries), and the underlying hardware (main memory available, disk vs. flash). We show how to use this model to answer what-if design questions about how changes in environmental parameters impact performance and how to adapt the various LSM-tree design elements accordingly.

Secure Data Systems and Performance: Friends or Foe?

Manos Athanassoulis

Conference Paper Proceedings of the Conference on Innovative Data Systems Research (CIDR), 2017 (Abstract)

Abstract

The increasing trend in data collection and data processing has lead to the aggressive development of public clouds which store, process, and query data collections from completely heterogenous sources, with different requirements not only in performance, but also in confidentiality and security. The performance considerations have been leading to bigger public clouds where economy at scale allows for more efficient solutions. Today data management is being commoditized and offered as database-as-a-service over private and public cloud infrastructure.

At the same time the increasing number of data breaches over the past 15 years have raised the awareness of privacy and security. For a company, its data need to be private in order not to lose its business advantage, for a government it can be strategic economical and national security reasons, and for a scientist it is to protect work in progress before it is officially published. For a hospital, it may contain sensitive patient data which are protected by law. Even for an end user, data stored and processed on public clouds are sensitive because it may contain private information that the user wants to keep it so. The repetitive breaches, however, show that we are far from having strong guarantees that our data, and the processing we are doing over it, is safe-guarded in the existing private and public cloud infrastructure. We have recently seen sensitive banking information leaked (e.g., credit cards), private data leaked (e.g., photos), and government correspondence (e.g., emails). These occurrences raise new research challenges on building solutions secure data systems with low cost in terms of performance or functionality.

UpBit: Scalable In-Memory Updatable Bitmap Indexing

Manos Athanassoulis, Zheng Yan, Stratos Idreos

Conference Paper Proceedings of the ACM SIGMOD Conference, 2016

Abstract

Bitmap indexes are widely used in both scientific and commercial databases. They bring fast read performance for specific types of queries, such as equality and selective range queries. A major drawback of bitmap indexes, however, is that supporting updates is particularly costly. Bitmap indexes are kept compressed to minimize storage footprint; as a result, updating a bitmap index requires the expensive step of decoding and then encoding a bitvector. Today, more and more applications need support for both reads and writes, blurring the boundaries between analytical processing and transaction processing. This requires new system designs and access methods that support general updates and, at the same time, offer competitive read performance.

In this paper, we propose scalable in-memory Updatable Bitmap indexing (UpBit), which offers efficient updates, without hurting read performance. UpBit relies on two design points. First, in addition to the main bitvector for each domain value, UpBit maintains an update bitvector, to keep track of updated values. Effectively, every update can now be directed to a highly-compressible, easy-to-update bitvector. While update bitvectors double the amount of uncompressed data, they are sparse, and as a result their compressed size is small. Second, we introduce fence pointers in all update bitvectors which allow for efficient retrieval of a value at an arbitrary position. Using both synthetic and real-life data, we demonstrate that UpBit significantly outperforms state-of-the-art bitmap indexes for workloads that contain both reads and writes. In particular, compared to update-optimized bitmap index designs UpBit is 15-29x faster in terms of update time and 2.7x faster in terms of read performance. In addition, compared to read-optimized bitmap index designs UpBit achieves efficient and scalable updates (51-115x lower update latency), while allowing for comparable read performance, having up to 8% overhead.

Design Tradeoffs of Data Access Methods (Tutorial)

Manos Athanassoulis, Stratos Idreos

Conference Paper Proceedings of the ACM SIGMOD Conference, 2016

Abstract

Database researchers and practitioners have been building methods to store, access, and update data for more than five decades. Designing access methods has been a constant effort to adapt to the ever changing underlying hardware and workload requirements. The recent explosion in data system designs - including, in addition to traditional SQL systems, NoSQL, NewSQL, and other relational and non-relational systems - makes understanding the tradeoffs of designing access methods more important than ever. Access methods are at the core of any new data system. In this tutorial we survey recent developments in access method design and we place them in the design space where each approach focuses primarily on one or a subset of read performance, update performance, and memory utilization. We discuss how to utilize designs and lessons-learned from past research. In addition, we discuss new ideas on how to build access methods that have tunable behavior, as well as, what is the scenery of open research problems.

Designing Access Methods: The RUM Conjecture

Manos Athanassoulis, Michael Kester, Lukas Maas, Radu Stoica, Stratos Idreos, Anastasia Ailamaki, Mark Callaghan

Vision Paper Proceedings of the 19th International Conference on Extending Database Technology (EDBT), 2016

Abstract

The database research community has been building methods to store, access, and update data for more than four decades. Throughout the evolution of the structures and techniques used to access data, access methods adapt to the ever changing hardware and workload requirements. Today, even small changes in the workload or the hardware lead to a redesign of access methods. The need for new designs has been increasing as data generation and workload diversification grow exponentially, and hardware advances introduce increased complexity. New workload requirements are introduced by the emergence of new applications, and data is managed by large systems composed of more and more complex and heterogeneous hardware. As a result, it is increasingly important to develop application-aware and hardware-aware access methods.

The fundamental challenges that every researcher, systems architect, or designer faces when designing a new access method are how to minimize, i) read times (R), ii) update cost (U), and iii) memory (or storage) overhead (M). In this paper, we conjecture that when optimizing the read-update-memory overheads, optimizing in any two areas negatively impacts the third. We present a simple model of the RUM overheads, and we articulate the RUM Conjecture. We show how the RUM Conjecture manifests in state-of-the-art access methods, and we envision a trend toward RUM-aware access methods for future data systems.

Past and Future Steps for Adaptive Storage Data Systems: From Shallow to Deep Adaptivity

Stratos Idreos, Manos Athanassoulis, Niv Dayan, Demi Guo, Mike S. Kester, Lukas Maas, Kostas Zoumpatianos

Workshop Paper Proceedings of the 10th International Workshop on Enabling Real-Time Business Intelligence (BIRTE), 2016

Abstract

Datasystems with adaptive storage can autonomously change their behavior by altering how data is stored and accessed. Such systems have been studied primarily for the case of adaptive indexing to automatically create the right indexes at the right granularity. More recently work on adaptive loading and adaptive data layouts brought even more flexibility. We survey this work and describe the need for even deeper adaptivity that goes beyond adjusting knobs in a single architecture; instead it can adapt the fundamental architecture of a data system to drastically alter its behavior.

Beyond the Wall: Near-Data Processing for Databases

Sam Xi, Oreoluwa Babarinsa, Manos Athanassoulis, Stratos Idreos

Workshop Paper Proceedings of the 11th International Workshop on Data Management on New Hardware (DaMoN), 2015

Abstract

The continuous growth of main memory size allows modern data systems to process entire large scale datasets in memory. The increase in memory capacity, however, is not matched by proportional decrease in memory latency, causing a mismatch for in-memory processing. As a result, data movement through the memory hierarchy is now one of the main performance bottlenecks for main memory data systems. Database systems researchers have proposed several innovative solutions to minimize data movement and to make data access patterns hardware-aware. Nevertheless, all relevant rows and columns for a given query have to be moved through the memory hierarchy; hence, movement of large data sets is on the critical path.

In this paper, we present JAFAR, a Near-Data Processing (NDP) accelerator for pushing selects down to memory in modern column-stores. JAFAR implements the select operator and allows only qualifying data to travel up the memory hierarchy. Through a detailed simulation of JAFAR hardware we show that it has the potential to provide 9× improvement for selects in column-stores. In addition, we discuss both hardware and software challenges for using NDP in database systems as well as opportunities for further NDP accelerators to boost additional relational operators.

Queriosity: Automated Data Exploration

Abdul Wasay, Manos Athanassoulis, Stratos Idreos

Vision Paper Proceedings of the IEEE Congress on Big Data, 2015

Abstract

Curiosity, a fundamental drive amongst higher living organisms, is what enables exploration, learning and creativity. In our increasingly data-driven world, data exploration, i.e., making sense of mounting haystacks of data, is akin to intelligence for science, business and individuals. However, modern data systems – designed for data retrieval rather than exploration – only let us retrieve data and ask if it is interesting. This makes knowledge discovery a game of hit-and-trial which can only be orchestrated by expert data scientists.

We present the vision toward Queriosity, an automated and personalized data exploration system. Designed on the principles of autonomy, learning and usability, Queriosity envisions a paradigm shift in data exploration and aims to become a a personalized “data robot” that provides a direct answer to what is interesting in a user’s data set, instead of just retrieving data. Queriosity autonomously and continuously navigates toward interesting findings based on trends, statistical properties and interactive user feedback.

Online Updates on Data Warehouses via Judicious Use of Solid-State Storage

Manos Athanassoulis, Shimin Chen, Anastasia Ailamaki, Phillip B. Gibbons, Radu Stoica

Journal Paper ACM Transactions on Database Systems, Vol. 40(1), 2015

Abstract

Data warehouses have been traditionally optimized for read-only query performance, allowing only offline updates at night, essentially trading off data freshness for performance. The need for 24x7 operations in global markets and the rise of online and other quickly reacting businesses make concurrent online updates increasingly desirable. Unfortunately, state-of-the-art approaches fall short of supporting fast analysis queries over fresh data. The conventional approach of performing updates in place can dramatically slow down query performance, while prior proposals using differential updates either require large in-memory buffers or may incur significant update migration cost.

This article presents a novel approach for supporting online updates in data warehouses that overcomes the limitations of prior approaches by making judicious use of available SSDs to cache incoming updates. We model the problem of query processing with differential updates as a type of outer join between the data residing on disks and the updates residing on SSDs. We present MaSM algorithms for performing such joins and periodic migrations, with small memory footprints, low query overhead, low SSD writes, efficient in-place migration of updates, and correct ACID support. We present detailed modeling of the proposed approach, and provide proofs regarding the fundamental properties of the MaSM algorithms. Our experimentation shows that MaSM incurs only up to 7% overhead both on synthetic range scans (varying range size from 4KB to 100GB) and in a TPC-H query replay study, while also increasing the update throughput by orders of magnitude.

BF-Tree: Approximate Tree Indexing

Manos Athanassoulis, Anastasia Ailamaki

Journal Paper Proceedings of the VLDB Endowment, Vol. 7(14), 2014

Abstract

The increasing volume of time-based generated data and the shift in storage technologies suggest that we might need to reconsider indexing. Several workloads - like social and service monitoring - often include attributes with implicit clustering because of their time-dependent nature. In addition, solid state disks (SSD) (using flash or other low-level technologies) emerge as viable competitors of hard disk drives (HDD). Capacity and access times of storage devices create a trade-off between SSD and HDD. Slow random accesses in HDD have been replaced by efficient random accesses in SSD, but their available capacity is one or more orders of magnitude more expensive than the one of HDD. Indexing, however, is designed assuming HDD as secondary storage, thus minimizing random accesses at the expense of capacity. Indexing data using SSD as secondary storage requires treating capacity as a scarce resource.

To this end, we introduce approximate tree indexing, which employs probabilistic data structures (Bloom filters) to trade accuracy for size and produce smaller, yet powerful, tree indexes, which we name Bloom filter trees (BF-Trees). BF-Trees exploit pre-existing data ordering or partitioning to offer competitive search performance. We demonstrate, both by an analytical study and by experimental results, that by using workload knowledge and reducing indexing accuracy up to some extent, we can save substantially on capacity when indexing on ordered or partitioned attributes. In particular, in experiments with a synthetic workload, approximate indexing offers 2.22x-48x smaller index footprint with competitive response times, and in experiments with TPCH and a monitoring real-life dataset from an energy company, it offers 1.6x-4x smaller index footprint with competitive search times as well.

Reactive and Proactive Sharing Across Concurrent Analytical Queries

Iraklis Psaroudakis, Manos Athanassoulis, Matthaios Olma, Anastasia Ailamaki

Demo Paper Proceedings of the ACM SIGMOD Conference, 2014

Abstract

Today an ever increasing amount of data is collected and analyzed by researchers, businesses, and scientists in data warehouses (DW). In addition to the data size, the number of users and applications querying data grows exponentially. The increasing concurrency is itself a challenge in query execution, but also introduces an opportunity favoring synergy between concurrent queries. Traditional execution engines of DW follows a query-centric approach, where each query is optimized and executed independently. On the other hand, workloads with increased concurrency have several queries with common parts of data and work, creating the opportunity for sharing among concurrent queries. Sharing can be reactive to the inherently existing sharing opportunities, or proactive by redesigning query operators to maximize the sharing opportunities.

This demonstration showcases the impact of proactive and reactive sharing by comparing and integrating representative state-of-the-art techniques: Simultaneous Pipelining (SP), for reactive sharing, which shares intermediate results of common sub-plans, and Global Query Plans (GQP) for proactive sharing, which build and evaluate a single query plan with shared operators. We visually demonstrate, in an interactive interface, the behavior of both sharing approaches on top of a state-of-the-art storage engine using the original prototypes. We show that pull-based sharing for SP eliminates the serialization point imposed by the original push-based approach. Then, we compare, through a sensitivity analysis, the performance of SP and GQP. Finally, we show that SP can improve the performance of GQP for a query mix with common sub-plans.

Solid-State Storage and Work Sharing for Efficient Scaleup Data Analytics

Manos Athanassoulis

Thesis PhD dissertation, EPFL, Lausanne, Switzerland, February 2014

Abstract

Today, managing, storing and analyzing data continuously in order to gain additional insight is becoming commonplace. Data analytics engines have been traditionally optimized for read-only queries assuming that the main data reside on mechanical disks. The need for 24x7 operations in global markets and the rise of online and other quickly-reacting businesses make data freshness an additional design goal. Moreover, the increased requirements in information quality make semantic databases a key (often represented as graphs using the RDF data representation model). Last but not least, the performance requirements combined with the increasing amount of stored and managed data call for high-performance yet space-efficient access methods in order to support the desired concurrency and throughput.

Innovative data management algorithms and careful use of the underlying hardware platform help us to address the aforementioned requirements. The volume of generated, stored and queried data is increasing exponentially, and new workloads often are comprised of time-generated data. At the same time the hardware is evolving with dramatic changes both in processing units and storage devices, where solid-state storage is becoming ubiquitous. In this thesis, we build workload-aware data access methods for data analytics - tailored for emerging time-generated workloads - which use solid-state storage, either (i) as an additional level in the memory hierarchy to enable real-time updates in a data analytics, or (ii) as standalone storage for applications involving support for knowledge-based data, and support for efficiently indexing archival and time-generated data.

Building workload-aware and hardware-aware data management systems allows to increase their performance and to augment their functionality. The advancements in storage have led to a variety of storage devices with different characteristics (e.g., monetary cost, access times, durability, endurance, read performance vs. write performance), and the suitability of a method to an application depends on how it balances the different characteristics of the storage medium it uses. The data access methods proposed in this thesis - MaSM and BF-Tree - balance the benefits of solid-state storage and of traditional hard disks, and are suitable for time-generated data or datasets with similar organization, which include social, monitoring and archival applications. The study of work sharing in the context of data analytics paves the way to integrating shared database operators starting from shared scans to several data analytics engines, and the workload-aware physical data organization proposed for knowledge-based datasets - RDF-tuple - enables integration of diverse data sources into the same systems.

Sharing Data and Work Across Concurrent Analytical Queries

Iraklis Psaroudakis, Manos Athanassoulis, Anastasia Ailamaki

Journal Paper Proceedings of the VLDB Endowment, Vol. 6(9), 2013

Abstract

Today’s data deluge enables organizations to collect massive data, and analyze it with an ever-increasing number of concurrent queries. Traditional data warehouses (DW) face a challenging problem in executing this task, due to their query-centric model: each query is optimized and executed independently. This model results in high contention for resources. Thus, modern DW depart from the query-centric model to execution models involving sharing of common data and work. Our goal is to show when and how a DW should employ sharing. We evaluate experimentally two sharing methodologies, based on their original prototype systems, that exploit work sharing opportunities among concurrent queries at run-time: Simultaneous Pipelining (SP), which shares intermediate results of common sub-plans, and Global Query Plans (GQP), which build and evaluate a single query plan with shared operators.

First, after a short review of sharing methodologies, we show that SP and GQP are orthogonal techniques. SP can be applied to shared operators of a GQP, reducing response times by 20%-48% in workloads with numerous common sub-plans. Second, we corroborate previous results on the negative impact of SP on performance for cases of low concurrency. We attribute this behavior to a bottleneck caused by the push-based communication model of SP. We show that pull-based communication for SP eliminates the overhead of sharing altogether for low concurrency, and scales better on multi-core machines than push-based SP, further reducing response times by 82%-86% for high concurrency. Third, we perform an experimental analysis of SP, GQP and their combination, and show when each one is beneficial. We identify a trade-off between low and high concurrency. In the former case, traditional query-centric operators with SP perform better, while in the latter case, GQP with shared operators enhanced by SP give the best results.

Scaling Up Analytical Queries with Column-Stores

Ioannis Alagiannis, Manos Athanassoulis, Anastasia Ailamaki

Workshop Paper Proceedings of the 6th International Workshop on Testing Database Systems (DBTest), 2013

Abstract

As data analytics is used by an increasing number of applications, data analytics engines are required to execute workloads with increased concurrency, i.e., an increasing number of clients submitting queries. Data management systems designed for data analytics - a market dominated by column-stores - however, were initially optimized for single query execution, minimizing its response time. Hence, they do not treat concurrency as a first class citizen.

In this paper, we experiment with one open-source and two commercial column-stores using the TPC-H and SSB benchmarks in a setup with an increasing number of concurrent clients submitting queries, focusing on whether the tested systems can scale up in a single node instance. The tested systems for in-memory workloads scale up, to some degree; however, when the server is saturated they fail to fully exploit the available parallelism. Further, we highlight the unpredictable response times for high concurrency.

Querying Persistent Graphs using Solid State Storage

Manos Athanassoulis, Bishwaranjan Bhattacharjee, Mustafa Canim, Kenneth A. Ross

Workshop Paper Proceedings of the 4th Annual Non-Volatile Memories Workshop, 2013

Abstract

Solid State Drives (SSDs) are an important component of secondary storage systems. While Hard Disk Drives (HDDs) are cheaper per gigabyte, SSDs are cheaper per random I/O per second. A variety of of solid-state technologies are being developed with NAND flash being the most mature, and Phase Change Memory (PCM) beginning to enter the marketplace. Compared with flash, PCM has finer-grained addressability and higher write endurance. PCM is also expected to offer lower read and write response times.

In this work we study the use of solid-state storage in latency-bound applications, a type of workload that can benefit from the characteristics of flash and PCM technologies. We identify graph processing and Resource Description Framework (RDF) query processing as candidate applications. Using an early PCM prototype device, we demonstrate the benefits of PCM for this workload and compare with a flash device. Moreover, we describe Pythia, a prototype RDF repository designed for Solid State Storage. Using Pythia we investigate whether the predicted benefits for this type of workload can be achieved in a properly designed RDF repository.

Solid State Storage & DBMS

Manos Athanassoulis

Conference Paper Proceedings of the Conference on Innovative Data Systems Research (CIDR), 2013 (short talk)

Abstract

TBA

Path Processing using Solid State Storage

Manos Athanassoulis, Bishwaranjan Bhattacharjee, Mustafa Canim, Kenneth A. Ross

Workshop Paper Proceedings of the 3rd International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures (ADMS), 2012

Abstract

Recent advances in solid state technology have led to the introduction of Solid State Drives (SSDs). Todays SSDs store data persistently using NAND flash memory. While SSDs are more expensive than hard disks when measured in dollars per gigabyte, they are significantly cheaper when measured in dollars per random I/O per second. Additional storage technologies are under development, Phase Change Memory (PCM) being the next one to enter the marketplace. PCM is nonvolatile, it can be byte-addressable, and in future Multi Level Cell (MLC) devices, PCM is expected to be denser than DRAM. PCM has lower read and write latency compared to NAND flash memory, and it can endure orders of magnitude more write cycles before wearing out.

Recent research has shown that solid state devices can be particularly beneficial for latency-bound applications involving dependent reads. Latency-bound applications like path processing in the context of graph processing or Resource Description Framework (RDF) data processing are typical examples of these applications. We demonstrate via a custom graph benchmark that even an early prototype Phase Change Memory device can offer significant improvements over mature flash devices (1.5x - 2.5x speedup in response times). We take this observation further by building Pythia, a prototype RDF repository tailor-made for Solid State Storage to investigate the predicted benefits for these type of workloads that can be achieved in a properly designed RDF repository. We compare the performance of our repository against the state of the art RDF-3X repository in a limited set of tests and discuss the results. We finally compare the performance of our repository running on a PCM-based device against a state of the art flash device, showing that there is indeed significant gain to be achieved by using PCM for RDF processing.

Scalability of write-ahead logging on multicore and multisocket hardware

Ryan Johnson, Ippokratis Pandis, Radu Stoica, Manos Athanassoulis, Anastasia Ailamaki

Journal Paper The VLDB Journal, Vol. 21(2), 2012 (Special Issue)

Abstract

The shift to multi-core and multi-socket hardware brings new challenges to database systems, as the software parallelism determines performance. Even though database systems traditionally accommodate simultaneous requests, a multitude of synchronization barriers serialize execution. Write-ahead logging is a fundamental, omnipresent component in ARIES-style concurrency and recovery, and one of the most important yet-to-be addressed potential bottlenecks, especially in OLTP workloads making frequent small changes to data. In this paper, we identify four logging-related impediments to database system scalability. Each issue challenges different level in the software architecture: (a) the high volume of small-sized I/O requests may saturate the disk, (b) transactions hold locks while waiting for the log flush, (c) extensive context switching overwhelms the OS scheduler with threads executing log I/Os, and (d) contention appears as transactions serialize accesses to in-memory log data structures. We demonstrate these problems and address them with techniques that, when combined, comprise a holistic, scalable approach to logging. Our solution achieves a 20–69% speedup over a modern database system when running log-intensive workloads, such as the TPC-B and TATP benchmarks, in a single-socket multiprocessor server. Moreover, it achieves log insert throughput over 2.2 GB/s for small log records on the single-socket server, roughly 20 times higher than the traditional way of accessing the log using a single mutex. Furthermore, we investigate techniques on scaling the performance of logging to multi-socket servers. We present a set of optimizations which partly ameliorate the latency penalty that comes with multi-socket hardware, and then we investigate the feasibility of applying a distributed log buffer design at the socket level.

MaSM: Efficient Online Updates in Data Warehouses

Manos Athanassoulis, Shimin Chen, Anastasia Ailamaki, Phillip B. Gibbons, Radu Stoica

Conference Paper Proceedings of the ACM SIGMOD Conference, 2011

Abstract

Data warehouses have been traditionally optimized for read-only query performance, allowing only offline updates at night, essentially trading off data freshness for performance. The need for 24x7 operations in global markets and the rise of online and other quickly-reacting businesses make concurrent online updates increasingly desirable. Unfortunately, state-of-the-art approaches fall short of supporting fast analysis queries over fresh data. The conventional approach of performing updates in place can dramatically slow down query performance, while prior proposals using differential updates either require large in-memory buffers or may incur significant update migration cost.

This paper presents a novel approach for supporting online updates in data warehouses that overcomes the limitations of prior approaches, by making judicious use of available SSDs to cache incoming updates. We model the problem of query processing with differential updates as a type of outer join between the data residing on disks and the updates residing on SSDs. We present MaSM algorithms for performing such joins and periodic migrations, with small memory footprints, low query overhead, low SSD writes, efficient in-place migration of updates, and correct ACID support. Our experiments show that MaSM incurs only up to 7% overhead both on synthetic range scans (varying range size from 100GB to 4KB) and in a TPC-H query replay study, while also increasing the update throughput by orders of magnitude.

TPC-E vs. TPC-C: Characterizing the New TPC-E Benchmark via an I/O Comparison Study

Shimin Chen, Anastasia Ailamaki, Manos Athanassoulis, Phillip B. Gibbons, Ryan Johnson, Ippokratis Pandis, and Radu Stoica

Journal Paper SIGMOD Record, Vol. 39(4), 2010

Abstract

TPC-E is a new OLTP benchmark recently approved by the Transaction Processing Performance Council (TPC). In this paper, we compare TPC-E with the familiar TPCC benchmark in order to understand the behavior of the new TPC-E benchmark. In particular, we compare the I/O access patterns of the two benchmarks by analyzing two OLTP disk traces. We find that (i) TPC-E is more read intensive with a 9.7:1 I/O read to write ratio, while TPC-C sees a 1.9:1 read-to-write ratio; and (ii) although TPC-E uses pseudo-realistic data, TPC-E’s I/O access pattern is as random as TPC-C. The latter suggests that like TPC-C, TPC-E can benefit from SSDs, which have superior random I/O support. To verify this, we replay both disk traces on an Intel X25-E SSD and see dramatic improvements for both TPC-C and TPC-E.

Flash in a DBMS: Where and How?

Manos Athanassoulis, Anastasia Ailamaki, Shimin Chen, Phillip B. Gibbons, Radu Stoica

Journal Paper IEEE Data Engineering Bulletin, Vol. 33(4), 2010

Abstract

Over the past decade, new solid state storage technologies, with flash being the most mature one, have become increasingly popular. Such technologies store data durably, and can alleviate many handicaps of hard disk drives (HDDs). Nonetheless, they have very different characteristics compared to HDDs, making it challenging to integrate such technologies into data intensive systems, such as database management systems (DBMS), that rely heavily on underlying storage behaviors. In this paper, we ask the question: Where and how will flash be exploited in a DBMS? We describe techniques for making effective use of flash in three contexts: (i) as a log device for transaction processing on memory-resident data, (ii) as the main data store for transaction processing, and (iii) as an update cache for HDD-resident data warehouses.

Aether: A Scalable Approach to Logging

Ryan Johnson, Ippokratis Pandis, Radu Stoica, Manos Athanassoulis, Anastasia Ailamaki

Journal Paper Proceedings of the VLDB Endowment, Vol. 3(1-2), 2010

Abstract

The shift to multi-core hardware brings new challenges to database systems, as the software parallelism determines performance. Even though database systems traditionally accommodate simultaneous requests, a multitude of synchronization barriers serialize execution. Write-ahead logging is a fundamental, omnipresent component in ARIES-style concurrency and recovery, and one of the most important yet-to-be addressed potential bottlenecks, especially in OLTP workloads making frequent small changes to data. In this paper, we identify four logging-related impediments to database system scalability. Each issue challenges different level in the software architecture: (a) the high volume of small-sized I/O requests may saturate the disk, (b) transactions hold locks while waiting for the log flush, (c) extensive context switching overwhelms the OS scheduler with threads executing log I/Os, and (d) contention appears as transactions serialize accesses to in-memory log data structures. We demonstrate these problems and address them with techniques that, when combined, comprise a holistic, scalable approach to logging. Our solution achieves a 20%-69% speedup over a modern database system when running log-intensive workloads, such as the TPC-B and TATP benchmarks. Moreover, it achieves log insert throughput over 1.8GB/s for small log records on a single socket server, an order of magnitude higher than the traditional way of accessing the log using a single mutex.

A New Look at the Roles of Spinning and Blocking

Ryan Johnson, Manos Athanassoulis, Radu Stoica, Anastasia Ailamaki

Workshop Paper Proceedings of the 5th International Workshop on Data Management on New Hardware (DaMoN), 2009

Abstract

Database engines face growing scalability challenges as core counts exponentially increase each processor generation, and the efficiency of synchronization primitives used to protect internal data structures is a crucial factor in overall database performance. The trade-offs between different implementation approaches for these primitives shift significantly with increasing degrees of available hardware parallelism. Blocking synchronization, which has long been the favored approach in database systems, becomes increasingly unattractive as growing core counts expose its bottlenecks. Spinning implementations improve peak system throughput by a factor of 2x or more for 64 hardware contexts, but suffer from poor performance under load. In this paper we analyze the shifting trade-off between spinning and blocking synchronization, and observe that the trade-off can be simplified by isolating the load control aspects of contention management and treating the two problems separately: spinning-based contention management and blocking-based load control. We then present a proof of concept implementation that, for high concurrency, matches or exceeds the performance of both user-level spinlocks and the pthread mutex under a wide range of load factors.

Evaluating and Repairing Write Performance on Flash Devices

Radu Stoica, Manos Athanassoulis, Ryan Johnson, Anastasia Ailamaki

Workshop Paper Proceedings of the 5th International Workshop on Data Management on New Hardware (DaMoN), 2009

Abstract

In the last few years NAND flash storage has become more and more popular as price per GB and capacity both improve at exponential rates. Flash memory offers significant benefits compared to magnetic hard disk drives (HDDs) and DBMSs are highly likely to use flash as a general storage backend, either alone or in heterogeneous storage solutions with HDDs. Flash devices, however, respond quite differently than HDDs for common access patterns, and recent research shows a strong asymmetry between read and write performance. Moreover, flash storage devices behave unpredictably, showing a high dependence on previous IO history and usage patterns. In this paper we investigate how a DBMS can overcome these issues to take full advantage of flash memory as persistent storage. We propose new a flash aware data layout - append and pack — which stabilizes device performance by eliminating random writes. We assess the impact of append and pack on OLTP workload performance using both an analytical model and micro-benchmarks, and our results suggest that significant improvements can be achieved for real workloads.

Thread scheduling technique for avoiding page thrashing (in Greek)

Manos Athanassoulis

Thesis M.Sc. Thesis, University of Athens, Greece, March 2008

Abstract

Page thrashing is the process utilization collapse due to increase of the multiprogramming level. Several papers have discussed this matter up to know, focusing on specific aspects of the phenomenon in order to address it. As a result the effect of such techniques depends on the type of the processor’s load. There two main issues in addressing page-thrashing: time scheduling (process and thread scheduling) and memory management. In this text a distributed thread scheduling technique is designed. The proposed technique is based on decisions made by each process when scheduling the contained threads, utilizing data that a process has already available. The proposed thread scheduling technique is simulated using a discrete time simulator system.

Energy Efficiency in Wireless Sensor Networks: A Utility-Based Architecture

Manos Athanassoulis, Ioannis Alagiannis, Stathes Hadjiefthymiades

Conference Paper European Wireless (EW), 2007

Abstract

Wireless Sensor Networks (WSN) comprise a fast developing research area with a vast spectrum of applications. A WSN design is influenced by many factors such as transmission errors, network topology and power consumption. Consequently, developing a WSN application introduces several implementation challenges. In this paper, we describe a multi-criteria architecture in order to achieve energy-aware and consistent message forwarding over a WSN. Using the proposed architecture a directed acyclic graph (DAG) is formed throughout the WSN. Such DAG is used for multi-source data aggregation to a single sink. Intermediate nodes evaluate their energy reserve and induced error and decide whether message retransmission is needed. A sink is necessary in order to collect, process and probably forward these data to a more sophisticated system for further processing. The discussed architecture is developed using TinyOS, an operating system designed for WSN nodes, and nesC, an extension of C. Finally evaluation results are presented.

A Multi-Criteria Message Forwarding Architecture for Wireless Sensor Networks

Manos Athanassoulis, Ioannis Alagiannis, Stathes Hadjiefthymiades

Conference Paper Proceedings of the 10th Pan-Hellenic Conference on Informatics (PCI), 2005

Abstract

Wireless Sensor Networks (WSN) comprise a fast-developing research area with a vast spectrum of applications. A WSN design is influenced by many factors such as transmission errors, network topology and power consumption. Consequently, developing a WSN application introduces several implementation challenges. In this paper, we describe a multi-criteria architecture in order to achieve energy-aware and consistent message forwarding over a WSN. Using the proposed architecture a directed acyclic graph (DAG) is formed throughout the WSN. Such DAG is used for multi-source data aggregation to a single sink. Intermediate nodes evaluate their energy reserve and induced error and decide whether message retransmission is needed. A sink is necessary in order to collect, process and probably forward these data to a more sophisticated system for further processing. The discussed architecture is developed using TinyOS, an event-driven lightweight operating system designed for sensor network nodes, and nesC, a highly modular and declarative extension of C.

A Multi-Criteria Message Forwarding Architecture for Wireless Sensor Networks (in Greek)

Manos Athanassoulis (joined with I. Alagiannis)

Thesis B.Sc. Thesis, University of Athens, Greece, September 2005

Abstract

Wireless Sensor Networks (WSN) comprise a fast-developing research area with a vast spectrum of applications. A WSN design is influenced by many factors such as transmission errors, network topology and power consumption. Consequently, developing a WSN application introduces several implementation challenges. In this paper, we describe a multi-criteria architecture in order to achieve energy-aware and consistent message forwarding over a WSN. Using the proposed architecture a directed acyclic graph (DAG) is formed throughout the WSN. Such DAG is used for multi-source data aggregation to a single sink. Intermediate nodes evaluate their energy reserve and induced error and decide whether message retransmission is needed. A sink is necessary in order to collect, process and probably forward these data to a more sophisticated system for further processing. The discussed architecture is developed using TinyOS, an operating system designed for WSN nodes, and nesC, an extension of C. Finally evaluation results are presented.