The schedule is tentative and subjects to change (e.g. snow days).
The lecture slides and other materials will be posted on Piazza.
| Date | Topic | Notes | Recommended reading |
|---|---|---|---|
| Class 1 (Sep 2) | Course overview, logistics, intro to language models | (Bengio+ 2003) paper, MLP implementation | |
| Class 2 (Sep 4) | Supervised learning, architectures, transformers | (Vaswani+ 2017) paper, GPT implementation | |
| Class 3 (Sep 9) | Intro to GPUs, flash attention | (Dao+ 2022) paper, GPUs | |
| Class 4 (Sep 11) | Variants of attention for time/memory optimization | SVD (ch3), DeepSeek-V2 paper | |
| Class 5 (Sep 16) | Backpropagation algorithm | HW1 posted | Backprop notes (ch7.4), implementation |
| Class 6 (Sep 18) | Backprop module formulas | ||
| Class 7 (Sep 23) | Optimization algorithms: SGD, adaptive optimizers | Adam | |
| Class 8 (Sep 25) | Optimization algorithms: low memory and other recent algorithms | low memory, Muon | |
| Class 9 (Sep 30) | Distributed training on GPUs: data parallelism, DeepSpeed/ZeRO, FSDP | GPUs, DeepSpeed/ZeRO, FSDP | |
| Class 10 (Oct 2) | Distributed training on GPUs: model and activation parallelism | HW2 posted | (Narayanan+ 2021), (Korthikanti+ 2022) |
| Class 11 (Oct 7) | Nearest-neighbor search, locality sensitive hashing | LSH | |
| Class 12 (Oct 9) | Kernel density estimation | Send title of paper you intend to present | KDE |
| Oct 14 | No class, Monday schedule | ||
| Class 13 (Oct 16) | Graph-based nearest neighbor search on CPUs and GPUs | HNSW, CAGRA | |
| Class 14 (Oct 21) | Sparse approximation of attention | Sparse FlashAttention, KDEformer, DeepSeek-v3.2, DeepSeek Native Sparse Attention | |
| Class 15 (Oct 23) | Mixture of experts | Switch Transformer, DeepSeek-V3 | |
| Class 16 (Oct 28) | Structured state space models | S4, S4D | |
| Class 17 (Oct 30) | Mamba and hybrid models | Project proposal due | Mamba, Mamba2, Nemotron |
| Class 18 (Nov 4) | Finetuning, fast inference | LoRA, HydraLORA, Speculative Decoding | |
| Class 19 (Nov 6) | Quantization (clustering, hashing e.g. RaBitQ, KV-cache and model weight compression) | Residual Quantization, Qinco2, RaBitQ, CommVQ, AQLM | |
| Class 20 (Nov 11) | CG, OO: DeepSeek-v3 | ||
| Class 21 (Nov 13) | YK: LLaMA: Open and Efficient Foundation Language Models | Project progress report due | |
| Class 22 (Nov 18) | DL: Faster Causal Attention Over Large Sequences Through Sparse Flash Attention , LG: Hashing-Based-Estimators for Kernel Density in High Dimensions | ||
| Class 23 (Nov 20) | JH: DeepSeek-OCR, YT: Shampoo: Preconditioned Stochastic Tensor Optimization | ||
| Class 24 (Nov 25) | SB: Reducing Activation Recomputation in Large Transformer Models | ||
| Nov 27 | No class, Thanksgiving | ||
| Class 25 (Dec 2) | WL: Decoupled Weight Decay Regularization, MS: Gluon: Making Muon & Scion Great Again! | ||
| Class 26 (Dec 4) | VK: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, EC: RoarGraph | ||
| Class 27 (Dec 9) | XF, ZC: Matryoshka Quantization | Project final report due | |