CS 591 A1

Data Systems Architectures


Class at a glance

Class: Tu/Th 12:30-1:45pm, MCS B23
Instructor: Manos Athanassoulis 

Teaching Assistant: Subhadeep Sarkar 
(Office Hours: MCS 283, M/W 2-3pm)

Office: MCS 279
OH: Tu 2-3pm/Th 2:30-3:30pm

Discussion on Piazza

Announcements

  • Final grades are now uploaded!
  • Register for your project presentation.
  • A list of useful links added in the project page.
  • Our first visitor is coming on March 5th!
  • More details about the projects are available in the project page.
  • Slides for Class 1 are available. We will be posting slides as soon as possible after class.
  • Register for presentations!
  • The schedule is populated with the papers to discuss.
  • Syllabus is now online.
  • Website is up. Stay tuned for more!
  • Class starts on 1/22.


Class Milestones - Important Dates

  • February 4th, last day to add (any) class
  • February 8th, select a project
  • February 22nd, come up with a proposal (which you have discussed in OH)
  • March 8th, submit your mid-way project report
  • April 28th, April 29th 11:59pm, submit your final project report/code
  • April 30th and May 2nd, project presentation (in class)
  • May 7th 11:59pm, final submission of amended report (if needed) with maximum project grade change 10%

Also, before each class submit your paper review!



Class Schedule

Here you can find the tentative schedule of the class (which might change as the semester progresses).

Class : Introduction to Data Systems and CS591

In this class we will discuss the basics of data systems and the goals and structure of the course.

Readings

Class : Data Systems Architectures Essentials – Part 1

In this class we discuss the fundamental components that comprise a database system. We will see the commonalities and the differences of the main database system architectures and we will discuss why we have several different ones.

Readings

Class : Data Systems Architectures Essentials – Part 2

In this class we continue discussing data systems architectures and the basics for modern systems including relational, graph systems, and key-value stores.

Readings

Class : Class Project Overview

In this class the students will be introduced to the class semester project.

Readings

Class : Storage Layouts: Row-Stores vs. Column-Stores

Concepts: column-stores, row-stores, vertical partitioning, index-only plans, materialized views, tuple reconstruction, late/early materialization, block iteration, vectorized execution (block iteration), compression (run length encoding), hash joins, index joins, sort-merge joins, invisible joins, star schema

Readings

Class : Storage Layouts: Adaptive & Hybrid Layouts

Concepts: on-line transaction processing (OLTP), on-line analytical processing (OLAP), n-ary storage model (NSM), decomposition storage model (DSM), partition attributes across (PAX), flexible storage model (FSM), projectivity, selectivity, concurrency control, multi-version concurrency control (MVCC), two-phase locking (2PL)

Readings

Class : New Hardware: Data Systems for Flash

Concepts: hard disk drives (HDD), solid-state drives (SSD), flash-based SSDs, in-place vs. out-of-place updates, external sorting, migration overhead, flash-aware design (avoid random writes, prefer sequential access patterns, minimize physical writes per logical update), read-memory-update tradeoff

Readings

Class : New Hardware: Data Systems for Multi-Core

Concepts: multi-core, many-core, multi-socket, load balancing, skew resistance, context switching, non-uniform memory architectures (NUMA), pipeline breaker, elasticity, thread pool, just-in-time (JIT) code compilation, lock-free data structures, hyper-threading, translation lookaside buffer (TLB), open addressing, morsel-driven parallelism, dynamic hashing, outer join, semi-join, anti-join, radix join

Readings

Class : Indexing A: Updateable Bitmaps

Concepts: bitmap indexing, bitvectors, fence pointers, out-of-place updates, query-driven merging, bitmap encoding schemes (RLE, BBC)

Readings

Class : Indexing B: Access Path Selection

Concepts: access path selection, selectivity, concurrency, break-even point, modeling, scans, shared scans, concurrent index accesses, hybrid layouts, column-groups, memory bandwidth vs. latency

Readings

Class : Modern Storage Engines: Log-Structured Merge Trees

Concepts: log-structured merge trees (LSM-trees), fence pointers, Bloom filter, size ratio, merge policy, leveling, tiering, optimize for workload, sorted runs, levels, Lagrange multipliers

Readings

Research Talk : Scaling Write-Intensive Key-Value Stores

Visiting lecture from Niv Dayan, Harvard University

Bio: Niv Dayan is a postdoc at the Data Systems Lab at Harvard. He holds a PhD from the IT University of Copenhagen. Niv works at the intersection of systems and theory for designing efficient data storage, and in his current work he identifies and maps the fundamentally best scalability trade-offs for key-value stores.

Readings

Class : Modern Storage Engines: HTAP Systems

Concepts: key-value stores, state, point queries, blind updates, read-modify-write, locality, epoch, latch-free design, cache-friendly design, immutable file, mutable file, append-only systems, in-place updates, conflict-free replicated data type (CRDT)

Readings

Class : Indexing C: Data Skipping

Concepts: partitioning, horizontal partitioning, vertical partitioning, hybrid partitioning, zonemaps, tuple reconstruction, normalized schema, denormalized schema, clustering, use of clustering and feature extraction for partitioning

Readings

Class : Indexing D: Adaptive Indexing

Concepts: adaptive indexing, cracking, stochastic cracking, hybrid cracking, scan, sort and binary search, adaptive adaptive indexing, radix partitioning, TLB, software managed buffers, non-temporal streaming stores, partitioning fanout, skew, adaptive indexing convergence rate, simulated annealing, uniform/normal/zipfian distribution

Readings

Class canceled - Replaced by OH

Research Talk : Introduction to Time-Series Data Mining

Visiting lecture from John Paparrizos, University of Chicago

Bio: John Paparrizos is a postdoctoral researcher at the University of Chicago. He holds a PhD in Computer Science from Columbia University. His research revolves around developing methods, tools, and systems to help store, organize, retrieve, analyze, and transform into actionable knowledge massive collections of evolving data (i.e., time series, streams of data, and sensor/IoT data).

Class : In-Situ Data Processing: Efficiently Accessing Raw Data Files

Concepts: in-situ query processing, raw data files, adaptive partitioning, fine-grained indexing, query-based vs. homogenous partitioning, implicit clustering, eviction policy, workload shift, memory consumption

Readings

Class : Scientific Databases: Multi-dimensional Data Management

Concepts: array management systems, multi-dimensional arrays, storage manager, tiles, thread-safe, process-safe, atomicity, dense vs. sparse arrays, global cell order, fragments, dense vs. sparse fragments, consolidation

Readings

Class : Distributed Databases: Global Distributed Systems

Concepts: global-scale distributed database, concurrency control, Paxos, data sharding, external consistency, TrueTime API

Readings

Class : Map/Reduce: Data Management at Scale

Concepts: Map/Reduce, distributed file systems, resource management, positional delta trees, SQL-on-Map/Reduce, massively parallel processing database management systems (MPP DBMS), user-defined functions (UDF), encryption, authentication, user role management, elasticity, data warehouse, fact table, merge-join, partitioning attributes across (PAX) layout, message passing interface (MPI), two-phase commit (2PC), ACID

Systems/Approaches: Hadoop, Spark, YARN, HDFS Hive, Impala, Vectorwise, Actian Vector

Readings

Class : Data Systems for ML: Data Processing Primitives for ML

Concepts: machine learning, statistical analysis, data science, data exploration, data exploration systems, data systems for machine learning, Data Canopy, decomposable statistics, work repetition, materialized aggregates, data movement, basic aggregates, chunk, segment trees, offline/online/speculative execution, smart cache, eviction policy

Readings

Class : ML for Data Systems: Automatic Data System Design

Concepts: physical design, machine learning, tuning knobs, database administrator (DBA), OtterTune, workload characterization, k-means clustering, knob identification, automatic tuner, feature selection, linear regression model, ordinary least squares, workload mapping (dynamic vs. static), configuration recommendation

Readings

Class : Indexing E: Learned Indexes

Concepts: learned indexes, B+-Trees, hash indexes, Bloom filters, cumulative density function (CDF), recursive model index, hybrid learned indexes

Readings

Class : Indexing F: The Periodic Table of Access Methods

Concepts: data structure design, cost synthesis, first principles, learned cost models, design primitives, what-if design questions, design space, partitioning, search algorithms, benchmarking, model fitting

Readings

Class : Project Presentations A

Presentations

Class : Project Presentations B

Presentations