CS 591 A1

Data Systems Architectures

Syllabus

Class Projects

Announcements

Class at a glance

Class: Tu/Th 12:30-1:45pm, MCS B23
Instructor: Manos Athanassoulis

Teaching Assistant: Subhadeep Sarkar
(Office Hours: MCS 283, M/W 2-3pm)

Office: MCS 279
OH: Tu 2-3pm/Th 2:30-3:30pm

Discussion on Piazza

Announcements

Final grades are now uploaded!
Register for your project presentation.
A list of useful links added in the project page.
Our first visitor is coming on March 5th!
More details about the projects are available in the project page.
Slides for Class 1 are available. We will be posting slides as soon as possible after class.
Register for presentations!
The schedule is populated with the papers to discuss.
Syllabus is now online.
Website is up. Stay tuned for more!
Class starts on 1/22.

Class Milestones - Important Dates

February 4th, last day to add (any) class
February 8th, select a project
February 22nd, come up with a proposal (which you have discussed in OH)
March 8th, submit your mid-way project report
~~April 28th~~, April 29th 11:59pm, submit your final project report/code
April 30th and May 2nd, project presentation (in class)
May 7th 11:59pm, final submission of amended report (if needed) with maximum project grade change 10%

Also, before each class submit your paper review!

Class Schedule

Here you can find the tentative schedule of the class (which might change as the semester progresses).

Class : Introduction to Data Systems and CS591

In this class we will discuss the basics of data systems and the goals and structure of the course.

Readings

Slides

Class : Data Systems Architectures Essentials – Part 1

In this class we discuss the fundamental components that comprise a database system. We will see the commonalities and the differences of the main database system architectures and we will discuss why we have several different ones.

Readings

Slides

Class : Data Systems Architectures Essentials – Part 2

In this class we continue discussing data systems architectures and the basics for modern systems including relational, graph systems, and key-value stores.

Readings

Slides

Class : Class Project Overview

In this class the students will be introduced to the class semester project.

Readings

Slides

Class : Storage Layouts: Row-Stores vs. Column-Stores

Concepts: column-stores, row-stores, vertical partitioning, index-only plans, materialized views, tuple reconstruction, late/early materialization, block iteration, vectorized execution (block iteration), compression (run length encoding), hash joins, index joins, sort-merge joins, invisible joins, star schema

Readings

Discussion Slides by Megan Fantes
(P) Column-Stores vs. Row-Stores: How Different Are They Really?, SIGMOD 2008
(B) C-Store: A Column-oriented DBMS, VLDB 2005

Class : Storage Layouts: Adaptive & Hybrid Layouts

Concepts: on-line transaction processing (OLTP), on-line analytical processing (OLAP), n-ary storage model (NSM), decomposition storage model (DSM), partition attributes across (PAX), flexible storage model (FSM), projectivity, selectivity, concurrency control, multi-version concurrency control (MVCC), two-phase locking (2PL)

Readings

Discussion Slides by John Merfeld
(P) Bridging the Archipelago between Row-Stores and Column-Stores for Hybrid Workloads, SIGMOD 2016
(B) H2O: A Hands-free Adaptive Store, SIGMOD 2014

Class : New Hardware: Data Systems for Flash

Concepts: hard disk drives (HDD), solid-state drives (SSD), flash-based SSDs, in-place vs. out-of-place updates, external sorting, migration overhead, flash-aware design (avoid random writes, prefer sequential access patterns, minimize physical writes per logical update), read-memory-update tradeoff

Readings

Discussion Slides by Roberto Alcalde Diego
(P) MaSM: Efficient Online Updates in Data Warehouses, SIGMOD 2011
(B) Query Processing Techniques for Solid State Drives, SIGMOD 2009

Class : New Hardware: Data Systems for Multi-Core

Concepts: multi-core, many-core, multi-socket, load balancing, skew resistance, context switching, non-uniform memory architectures (NUMA), pipeline breaker, elasticity, thread pool, just-in-time (JIT) code compilation, lock-free data structures, hyper-threading, translation lookaside buffer (TLB), open addressing, morsel-driven parallelism, dynamic hashing, outer join, semi-join, anti-join, radix join

Readings

Discussion Slides by Matthew Cote
(P) Morsel-driven parallelism: A NUMA-aware query evaluation framework for the many-core age, SIGMOD 2014
(B) MonetDB/X100: Hyper-Pipelining Query Execution, CIDR 2005

Class : Indexing A: Updateable Bitmaps

Concepts: bitmap indexing, bitvectors, fence pointers, out-of-place updates, query-driven merging, bitmap encoding schemes (RLE, BBC)

Readings

Discussion Slides by Yuan Zhang and Guanting Chen
(P) UpBit: Scalable In-Memory Updatable Bitmap Indexing, SIGMOD 2016
(B) Bitmap Index Design and Evaluation, SIGMOD 1998

Class : Indexing B: Access Path Selection

Concepts: access path selection, selectivity, concurrency, break-even point, modeling, scans, shared scans, concurrent index accesses, hybrid layouts, column-groups, memory bandwidth vs. latency

Readings

Discussion Slides by Aleksandr Kim
(P) Access Path Selection in Main-Memory Optimized Data Systems: Should I Scan or Should I Probe?, SIGMOD 2017
(B) Access Path Selection in a Relational Database Management System, SIGMOD 1979

Class : Modern Storage Engines: Log-Structured Merge Trees

Concepts: log-structured merge trees (LSM-trees), fence pointers, Bloom filter, size ratio, merge policy, leveling, tiering, optimize for workload, sorted runs, levels, Lagrange multipliers

Readings

Discussion Slides by Stathis Karatsiolis
(P) Monkey: Optimal Navigable Key-Value Store, SIGMOD 2017 [conference presentation video]
(B) The Log-Structured Merge-Tree (LSM-Tree), Acta Informatica 1996

Research Talk : Scaling Write-Intensive Key-Value Stores

Visiting lecture from Niv Dayan, Harvard University

Bio: Niv Dayan is a postdoc at the Data Systems Lab at Harvard. He holds a PhD from the IT University of Copenhagen. Niv works at the intersection of systems and theory for designing efficient data storage, and in his current work he identifies and maps the fundamentally best scalability trade-offs for key-value stores.

Readings

Presentation Slides

Class : Modern Storage Engines: HTAP Systems

Concepts: key-value stores, state, point queries, blind updates, read-modify-write, locality, epoch, latch-free design, cache-friendly design, immutable file, mutable file, append-only systems, in-place updates, conflict-free replicated data type (CRDT)

Readings

Discussion Slides by Shirley Hu
(P) FASTER: A Concurrent Key-Value Store with In-Place Updates, SIGMOD 2018 [conference presentation video]
(B) Fast Scans on Key-Value Stores, VLDB 2017

Class : Indexing C: Data Skipping

Concepts: partitioning, horizontal partitioning, vertical partitioning, hybrid partitioning, zonemaps, tuple reconstruction, normalized schema, denormalized schema, clustering, use of clustering and feature extraction for partitioning

Readings

Discussion Slides by Allison Weaver
(P) Skipping-oriented Partitioning for Columnar Layouts, VLDB 2016
(B) Small Materialized Aggregates: A Light Weight Index for Data Warehousing, VLDB 1998

Class : Indexing D: Adaptive Indexing

Concepts: adaptive indexing, cracking, stochastic cracking, hybrid cracking, scan, sort and binary search, adaptive adaptive indexing, radix partitioning, TLB, software managed buffers, non-temporal streaming stores, partitioning fanout, skew, adaptive indexing convergence rate, simulated annealing, uniform/normal/zipfian distribution

Readings

Discussion Slides by Jiangshan Luo and Ruidong Duan
(P) Adaptive Adaptive Indexing, ICDE 2018
(B) Self-organizing Tuple Reconstruction in Column-stores, SIGMOD 2009

Class canceled - Replaced by OH

Research Talk : Introduction to Time-Series Data Mining

Visiting lecture from John Paparrizos, University of Chicago

Bio: John Paparrizos is a postdoctoral researcher at the University of Chicago. He holds a PhD in Computer Science from Columbia University. His research revolves around developing methods, tools, and systems to help store, organize, retrieve, analyze, and transform into actionable knowledge massive collections of evolving data (i.e., time series, streams of data, and sensor/IoT data).

Class : In-Situ Data Processing: Efficiently Accessing Raw Data Files

Concepts: in-situ query processing, raw data files, adaptive partitioning, fine-grained indexing, query-based vs. homogenous partitioning, implicit clustering, eviction policy, workload shift, memory consumption

Readings

Discussion Slides by Ketill Gudmundsson
(P) Slalom: Coasting Through Raw Data via Adaptive Partitioning and Indexing, VLDB 2017
(B) NoDB: Efficient Query Execution on Raw Data Files, SIGMOD 2012

Class : Scientific Databases: Multi-dimensional Data Management

Concepts: array management systems, multi-dimensional arrays, storage manager, tiles, thread-safe, process-safe, atomicity, dense vs. sparse arrays, global cell order, fragments, dense vs. sparse fragments, consolidation

Readings

Class Overview Progress Slides
Discussion Slides by Ziyang Chen and Yuansheng Dong
(P) The TileDB Array Data Storage Manager, VLDB 2017
(B) Overview of SciDB: Large Scale Array Storage, Processing and Analysis, SIGMOD 2010

Class : Distributed Databases: Global Distributed Systems

Concepts: global-scale distributed database, concurrency control, Paxos, data sharding, external consistency, TrueTime API

Readings

Discussion Slides by Nikhilesh Murugavel
(P) Spanner: Google's Globally-Distributed Database, OSDI 2012
(B) Megastore: Providing Scalable, Highly Available Storage for Interactive Services, CIDR 2011

Class : Map/Reduce: Data Management at Scale

Concepts: Map/Reduce, distributed file systems, resource management, positional delta trees, SQL-on-Map/Reduce, massively parallel processing database management systems (MPP DBMS), user-defined functions (UDF), encryption, authentication, user role management, elasticity, data warehouse, fact table, merge-join, partitioning attributes across (PAX) layout, message passing interface (MPI), two-phase commit (2PC), ACID

Systems/Approaches: Hadoop, Spark, YARN, HDFS Hive, Impala, Vectorwise, Actian Vector

Readings

Discussion Slides by Weixi Li and Yang Yang
(P) VectorH: Taking SQL-on-Hadoop to the Next Level, SIGMOD 2016
(B) SQL-on-Hadoop: Full Circle Back to Shared-Nothing Database Architectures, VLDB 2014
(B) HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads, VLDB 2009

Class : Data Systems for ML: Data Processing Primitives for ML

Concepts: machine learning, statistical analysis, data science, data exploration, data exploration systems, data systems for machine learning, Data Canopy, decomposable statistics, work repetition, materialized aggregates, data movement, basic aggregates, chunk, segment trees, offline/online/speculative execution, smart cache, eviction policy

Readings

Overview & Project Administrative Slides
Discussion Slides by Keith Lovett
(P) Data Canopy: Accelerating Exploratory Statistical Analysis, SIGMOD 2017 [short presentation video]
(B) MLbase: A Distributed Machine-learning System, CIDR 2013

Class : ML for Data Systems: Automatic Data System Design

Concepts: physical design, machine learning, tuning knobs, database administrator (DBA), OtterTune, workload characterization, k-means clustering, knob identification, automatic tuner, feature selection, linear regression model, ordinary least squares, workload mapping (dynamic vs. static), configuration recommendation

Readings

Discussion Slides by Reed Callahan and Yuhao Bai
(P) Automatic Database Management System Tuning Through Large-scale Machine Learning, SIGMOD 2017 [conference presentation video]
(B) Self-Driving Database Management Systems, CIDR 2017

Class : Indexing E: Learned Indexes

Concepts: learned indexes, B+-Trees, hash indexes, Bloom filters, cumulative density function (CDF), recursive model index, hybrid learned indexes

Readings

Discussion Slides by Sumer Rathinam
(P) The Case for Learned Index Structures, SIGMOD 2018 [conference presentation video]

Class : Indexing F: The Periodic Table of Access Methods

Concepts: data structure design, cost synthesis, first principles, learned cost models, design primitives, what-if design questions, design space, partitioning, search algorithms, benchmarking, model fitting

Readings

Discussion Slides by Jake Bloomfeld
(P) The Data Calculator: Data Structure Design and Cost Synthesis from First Principles and Learned Cost Models, SIGMOD 2018 [conference presentation video]
(B) The Periodic Table of Data Structures, Data Bulletin Engineering 2018

Class : Project Presentations A

Presentations

12:30 [Provide class evaluation]
12:42 Yuhao Bai
12:54 Jake Bloomfeld, Keith Lovett, Sumer Rathinam
13:06 Stathis Karatsiolis, John Merfeld
13:18 Reed Callahan, Guanting Chen, Yuan Zhang
13:30 Alexander Kim

Class : Project Presentations B

Presentations

12:30 Megan Fantes, Ketill Guðmundsson, Nikhilesh Murugavel, Ally Weaver
12:42 Ziyang Chen, Yuansheng Dong
12:54 Ruidong Duan, Jiangshan Luo
13:06 Matthew Cote
13:18 Roberto Alcalde, Shirley Hu
13:30 Weixi Li, Yang Yang

Class at a glance

Announcements

Class Milestones - Important Dates

Class Schedule

Class document.write(cday+1): Introduction to Data Systems and CS591

Readings

Class document.write(cday+1): Data Systems Architectures Essentials – Part 1

Readings

Class document.write(cday+1): Data Systems Architectures Essentials – Part 2

Readings

Class document.write(cday+1): Class Project Overview

Readings

Class document.write(cday+1): Storage Layouts: Row-Stores vs. Column-Stores

Readings

Class document.write(cday+1): Storage Layouts: Adaptive & Hybrid Layouts

Readings

Class document.write(cday+1): New Hardware: Data Systems for Flash

Readings

Class document.write(cday+1): New Hardware: Data Systems for Multi-Core

Readings

Class document.write(cday+1): Indexing A: Updateable Bitmaps

Readings

Class document.write(cday+1): Indexing B: Access Path Selection

Readings

Class document.write(cday+1): Modern Storage Engines: Log-Structured Merge Trees

Readings

Research Talk document.write(rday+1): Scaling Write-Intensive Key-Value Stores

Readings

Class document.write(cday+1): Modern Storage Engines: HTAP Systems

Readings

Class document.write(cday+1): Indexing C: Data Skipping

Readings

Class document.write(cday+1): Indexing D: Adaptive Indexing

Readings

Class canceled - Replaced by OH

Research Talk document.write(rday+1): Introduction to Time-Series Data Mining

Class document.write(cday+1): In-Situ Data Processing: Efficiently Accessing Raw Data Files

Readings

Class document.write(cday+1): Scientific Databases: Multi-dimensional Data Management

Readings

Class document.write(cday+1): Distributed Databases: Global Distributed Systems

Readings

Class document.write(cday+1): Map/Reduce: Data Management at Scale

Readings

Class document.write(cday+1): Data Systems for ML: Data Processing Primitives for ML

Readings

Class document.write(cday+1): ML for Data Systems: Automatic Data System Design

Readings

Class document.write(cday+1): Indexing E: Learned Indexes

Readings

Class document.write(cday+1): Indexing F: The Periodic Table of Access Methods

Readings

Class document.write(cday+1): Project Presentations A

Presentations

Class document.write(cday+1): Project Presentations B

Presentations

Class : Introduction to Data Systems and CS591

Class : Data Systems Architectures Essentials – Part 1

Class : Data Systems Architectures Essentials – Part 2

Class : Class Project Overview

Class : Storage Layouts: Row-Stores vs. Column-Stores

Class : Storage Layouts: Adaptive & Hybrid Layouts

Class : New Hardware: Data Systems for Flash

Class : New Hardware: Data Systems for Multi-Core

Class : Indexing A: Updateable Bitmaps

Class : Indexing B: Access Path Selection

Class : Modern Storage Engines: Log-Structured Merge Trees

Research Talk : Scaling Write-Intensive Key-Value Stores

Class : Modern Storage Engines: HTAP Systems

Class : Indexing C: Data Skipping

Class : Indexing D: Adaptive Indexing

Research Talk : Introduction to Time-Series Data Mining

Class : In-Situ Data Processing: Efficiently Accessing Raw Data Files

Class : Scientific Databases: Multi-dimensional Data Management

Class : Distributed Databases: Global Distributed Systems

Class : Map/Reduce: Data Management at Scale

Class : Data Systems for ML: Data Processing Primitives for ML

Class : ML for Data Systems: Automatic Data System Design

Class : Indexing E: Learned Indexes

Class : Indexing F: The Periodic Table of Access Methods

Class : Project Presentations A

Class : Project Presentations B