CS 599 E1: Algorithms for Machine Learning


[Home] [Schedule]

The schedule is tentative and subjects to change (e.g. snow days).
The lecture slides and other materials will be posted on Piazza.

Date Topic Notes Recommended reading
Class 1 (Sep 2) Course overview, logistics, intro to language models (Bengio+ 2003) paper, MLP implementation
Class 2 (Sep 4) Supervised learning, architectures, transformers (Vaswani+ 2017) paper, GPT implementation
Class 3 (Sep 9) Intro to GPUs, flash attention (Dao+ 2022) paper, GPUs
Class 4 (Sep 11) Variants of attention for time/memory optimization SVD (ch3), DeepSeek-V2 paper
Class 5 (Sep 16) Backpropagation algorithm HW1 posted Backprop notes (ch7.4), implementation
Class 6 (Sep 18) Backprop module formulas
Class 7 (Sep 23) Optimization algorithms: SGD, adaptive optimizers Adam
Class 8 (Sep 25) Optimization algorithms: low memory and other recent algorithms low memory, Muon
Class 9 (Sep 30) Distributed training on GPUs: data parallelism, DeepSpeed/ZeRO, FSDP GPUs, DeepSpeed/ZeRO, FSDP
Class 10 (Oct 2) Distributed training on GPUs: model and activation parallelism HW2 posted (Narayanan+ 2021), (Korthikanti+ 2022)
Class 11 (Oct 7) Nearest-neighbor search, locality sensitive hashing LSH
Class 12 (Oct 9) Kernel density estimation Send title of paper you intend to present KDE
Oct 14 No class, Monday schedule
Class 13 (Oct 16) Graph-based nearest neighbor search on CPUs and GPUs HNSW, CAGRA
Class 14 (Oct 21) Sparse approximation of attention Sparse FlashAttention, KDEformer, DeepSeek-v3.2, DeepSeek Native Sparse Attention
Class 15 (Oct 23) Mixture of experts Switch Transformer, DeepSeek-V3
Class 16 (Oct 28) Structured state space models S4, S4D
Class 17 (Oct 30) Mamba and hybrid models Project proposal due Mamba, Mamba2, Nemotron
Class 18 (Nov 4) Finetuning, fast inference LoRA, HydraLORA, Speculative Decoding
Class 19 (Nov 6) Quantization (clustering, hashing e.g. RaBitQ, KV-cache and model weight compression) Residual Quantization, Qinco2, RaBitQ, CommVQ, AQLM
Paper presentations
Class 20 (Nov 11) CG, OO: DeepSeek-v3
Class 21 (Nov 13) YK: LLaMA: Open and Efficient Foundation Language Models Project progress report due
Class 22 (Nov 18) DL: Faster Causal Attention Over Large Sequences Through Sparse Flash Attention , LG: Hashing-Based-Estimators for Kernel Density in High Dimensions
Class 23 (Nov 20) JH: DeepSeek-OCR, YT: Shampoo: Preconditioned Stochastic Tensor Optimization
Class 24 (Nov 25) SB: Reducing Activation Recomputation in Large Transformer Models
Nov 27 No class, Thanksgiving
Class 25 (Dec 2) WL: Decoupled Weight Decay Regularization, MS: Gluon: Making Muon & Scion Great Again!
Class 26 (Dec 4) VK: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, EC: RoarGraph
Class 27 (Dec 9) XF, ZC: Matryoshka Quantization Project final report due