Class: Tu/Th 12:30-1:45pm, MCS B23
Instructor: Manos Athanassoulis
Teaching Assistant: Subhadeep Sarkar
(Office Hours: MCS 283, M/W 2-3pm)
Office: MCS 279
OH: Tu 2-3pm/Th 2:30-3:30pm
Discussion on Piazza
Also, before each class submit your paper review!
Here you can find the tentative schedule of the class (which might change as the semester progresses).
In this class we will discuss the basics of data systems and the goals and structure of the course.
In this class we discuss the fundamental components that comprise a database system. We will see the commonalities and the differences of the main database system architectures and we will discuss why we have several different ones.
In this class we continue discussing data systems architectures and the basics for modern systems including relational, graph systems, and key-value stores.
In this class the students will be introduced to the class semester project.
Concepts: column-stores, row-stores, vertical partitioning, index-only plans, materialized views, tuple reconstruction, late/early materialization, block iteration, vectorized execution (block iteration), compression (run length encoding), hash joins, index joins, sort-merge joins, invisible joins, star schema
Concepts: on-line transaction processing (OLTP), on-line analytical processing (OLAP), n-ary storage model (NSM), decomposition storage model (DSM), partition attributes across (PAX), flexible storage model (FSM), projectivity, selectivity, concurrency control, multi-version concurrency control (MVCC), two-phase locking (2PL)
Concepts: hard disk drives (HDD), solid-state drives (SSD), flash-based SSDs, in-place vs. out-of-place updates, external sorting, migration overhead, flash-aware design (avoid random writes, prefer sequential access patterns, minimize physical writes per logical update), read-memory-update tradeoff
Concepts: multi-core, many-core, multi-socket, load balancing, skew resistance, context switching, non-uniform memory architectures (NUMA), pipeline breaker, elasticity, thread pool, just-in-time (JIT) code compilation, lock-free data structures, hyper-threading, translation lookaside buffer (TLB), open addressing, morsel-driven parallelism, dynamic hashing, outer join, semi-join, anti-join, radix join
Concepts: bitmap indexing, bitvectors, fence pointers, out-of-place updates, query-driven merging, bitmap encoding schemes (RLE, BBC)
Concepts: access path selection, selectivity, concurrency, break-even point, modeling, scans, shared scans, concurrent index accesses, hybrid layouts, column-groups, memory bandwidth vs. latency
Concepts: log-structured merge trees (LSM-trees), fence pointers, Bloom filter, size ratio, merge policy, leveling, tiering, optimize for workload, sorted runs, levels, Lagrange multipliers
Visiting lecture from Niv Dayan, Harvard University
Bio: Niv Dayan is a postdoc at the Data Systems Lab at Harvard. He holds a PhD from the IT University of Copenhagen. Niv works at the intersection of systems and theory for designing efficient data storage, and in his current work he identifies and maps the fundamentally best scalability trade-offs for key-value stores.
Concepts: key-value stores, state, point queries, blind updates, read-modify-write, locality, epoch, latch-free design, cache-friendly design, immutable file, mutable file, append-only systems, in-place updates, conflict-free replicated data type (CRDT)
Concepts: partitioning, horizontal partitioning, vertical partitioning, hybrid partitioning, zonemaps, tuple reconstruction, normalized schema, denormalized schema, clustering, use of clustering and feature extraction for partitioning
Concepts: adaptive indexing, cracking, stochastic cracking, hybrid cracking, scan, sort and binary search, adaptive adaptive indexing, radix partitioning, TLB, software managed buffers, non-temporal streaming stores, partitioning fanout, skew, adaptive indexing convergence rate, simulated annealing, uniform/normal/zipfian distribution
Visiting lecture from John Paparrizos, University of Chicago
Bio: John Paparrizos is a postdoctoral researcher at the University of Chicago. He holds a PhD in Computer Science from Columbia University. His research revolves around developing methods, tools, and systems to help store, organize, retrieve, analyze, and transform into actionable knowledge massive collections of evolving data (i.e., time series, streams of data, and sensor/IoT data).
Concepts: in-situ query processing, raw data files, adaptive partitioning, fine-grained indexing, query-based vs. homogenous partitioning, implicit clustering, eviction policy, workload shift, memory consumption
Concepts: array management systems, multi-dimensional arrays, storage manager, tiles, thread-safe, process-safe, atomicity, dense vs. sparse arrays, global cell order, fragments, dense vs. sparse fragments, consolidation
Concepts: global-scale distributed database, concurrency control, Paxos, data sharding, external consistency, TrueTime API
Concepts: Map/Reduce, distributed file systems, resource management, positional delta trees, SQL-on-Map/Reduce, massively parallel processing database management systems (MPP DBMS), user-defined functions (UDF), encryption, authentication, user role management, elasticity, data warehouse, fact table, merge-join, partitioning attributes across (PAX) layout, message passing interface (MPI), two-phase commit (2PC), ACID
Systems/Approaches: Hadoop, Spark, YARN, HDFS Hive, Impala, Vectorwise, Actian Vector
Concepts: machine learning, statistical analysis, data science, data exploration, data exploration systems, data systems for machine learning, Data Canopy, decomposable statistics, work repetition, materialized aggregates, data movement, basic aggregates, chunk, segment trees, offline/online/speculative execution, smart cache, eviction policy
Concepts: physical design, machine learning, tuning knobs, database administrator (DBA), OtterTune, workload characterization, k-means clustering, knob identification, automatic tuner, feature selection, linear regression model, ordinary least squares, workload mapping (dynamic vs. static), configuration recommendation
Concepts: learned indexes, B+-Trees, hash indexes, Bloom filters, cumulative density function (CDF), recursive model index, hybrid learned indexes
Concepts: data structure design, cost synthesis, first principles, learned cost models, design primitives, what-if design questions, design space, partitioning, search algorithms, benchmarking, model fitting