ECE 901 @ UW Madison — Advanced Topics in Large Language Models
1) Training LLMs
1-1) Pretraining
1‑1‑1) Architecture
Week 1‑1: DeepSeek-V2/3
- DeepSeek‑V2: A Strong, Economical, and Efficient Mixture‑of‑Experts Language Model: Introduces Multi‑head Latent Attention (MLA) and DeepSeekMoE to shrink KV cache and cut compute while keeping quality.
- DeepSeek‑V3 Technical Report: 671B‑parameter MoE with ~37B active; uses MLA, auxiliary‑loss‑free routing, and multi‑token prediction (MTP).
Week 1‑2: MoE and Multi‑token Prediction
- Mixtral of Experts (8×7B): Open sparse MoE baseline (top‑2 routing) that rivals much larger dense models with far fewer active parameters.
- Better & Faster Large Language Models via Multi‑token Prediction: Train with multiple future‑token heads to boost sample efficiency and speed without extra pretraining time.
Week 2‑1: Positional Encodings and Long Context.
- RoFormer: Enhanced Transformer with Rotary Position Embedding: Rotary positions stabilize attention and generalize better than absolute or relative schemes in many settings.
- LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens: Scaling and rescaling tricks push RoPE to million‑token ranges with controlled degradation.
- YaRN: Efficient Context Window Extension of LLMs: Simple RoPE scaling recipe that reliably extends context length with strong retention of quality.
Week 2‑2: LayerNorm & RMSNorm. Wrap up with gpt-oss.
- On Layer Normalization in the Transformer Architecture: Pre‑LN eases optimization at scale; Post‑LN can reach slightly better end quality but is harder to train.
- Root Mean Square Layer Normalization (RMSNorm): Drops mean‑centering to reduce overhead while keeping performance competitive.
- gpt-oss: Let’s review gpt-oss!
Week 3‑1: Is “Decoder TF + left-to-right autoregressive” the end of the story? Part 1.
- ENTP: Encoder‑only Next‑Token Prediction: Shows encoder‑only NTP can be competitive and more expressive for some tasks.
- Mamba: Linear‑time Sequence Modeling with Selective State Spaces: Linear‑time SSM with content‑aware selection that handles long sequences well.
- (optional 1) TFs are SSMs, Hydra
- (optional 2) Mamba+TF hybrid can do ICL, Hymba, Falcon-H1
Week 3‑2: Is “Decoder TF + left-to-right autoregressive” the end of the story? Part 2.
- Diffusion‑LM Improves Controllable Text Generation: Discrete diffusion improves controllability compared to standard AR decoding in some settings.
- LLaDA: Large Language Diffusion Models: Trains diffusion language models at scale (up to 8B), challenging the need for AR decoding.
- Seed‑Diffusion: Seeding and masked diffusion strategies stabilize and speed up language diffusion inference.
1‑1‑2) Training Data
Week 4‑1: How much data (and compute)?
- Scaling Laws for Neural Language Models: Loss follows power laws in model size, data, and compute.
- Training Compute‑Optimal Large Language Models (Chinchilla): For fixed compute, smaller models trained on more data beat oversized, under‑trained ones.
- Inference‑Aware Scaling Laws for LMs: Chooses model and data under deployment constraints like latency and serving cost.
Week 4‑2: Which Data?
- DataComp‑LM: Open benchmark for LLM pretraining data curation; careful filtering and mixtures yield large gains.
- A Survey on Data Selection for Language Models: Organizes methods for filtering, mixing, deduplication, and curriculum.
- OLMo‑2: Pretraining + Mid‑training: Adds a mid‑training phase and releases open artifacts for reproducibility.
- DataComp‑Reasoning (Open Thoughts): Reasoning‑centric data benchmark that lifts step‑by‑step problem solving.
1‑1‑3) Training Algorithms
Week 5‑1: Optimizers
- AdamW: Decoupled Weight Decay Regularization: Decouples weight decay from gradients for better stability and generalization.
- Lion: Symbolic Discovery of Optimization Algorithms: Sign‑momentum optimizer that is memory‑light yet competitive with Adam.
- Shampoo: Preconditioned Stochastic Tensor Optimization: Matrix and tensor preconditioning accelerates convergence.
- Old Optimizer, New Norm: An Anthology: Adam <-> Shampoo
Week 5‑2: Newer Optimizers
- SOAP: Practical optimizer targeting stability and speed in LLM pretraining.
- Deriving Muon
- On Newton-Schulz
- Muon: Orthogonal‑style updates and normalization tweaks for stable large‑batch training.
- Kimi K2/MuonCLIP: Extends Muon ideas to CLIP‑style multi‑modal training.
- (optional) The Potential of Second-Order Optimization
Week 6‑1: Optimizer Benchmarks
- Fantastic Pretraining Optimizers and Where to Find Them: Head‑to‑head comparisons of pretraining optimizers under controlled setups.
- Benchmarking Optimizers for Large Language Model Pretraining: Independent large‑scale benchmark with ablations and cost breakdowns.
Week 6‑2: Efficient Training
- Megatron‑LM: Training Multi‑Billion Parameter Language Models Using Model Parallelism: Tensor and pipeline parallelism blueprint for very large Transformers.
- ZeRO: Memory Optimizations Toward Training Trillion‑Parameter Models: Shards optimizer states, gradients, and parameters to fit bigger models.
- Dion: Distributed Orthonormalized Updates: Communication‑efficient Muon
- (optional) MegaBlocks: block-sparse GPU kernel for MoE training
- (optional) Google’s MoE: the first modern sparse MoE (and how to efficiently train it)
1‑2) Posttraining
Week 7‑1: Alignment‑Focused Post‑training
- InstructGPT: Training LMs to Follow Instructions with Human Feedback: SFT + RLHF makes smaller models preferable on real prompts.
- Direct Preference Optimization: Your LM is Secretly a Reward Model: Optimizes preferences without RL by a simple classification‑style loss (BT model).
- KTO: Model Alignment as Prospect‑Theoretic Optimization: Preference learning framed with prospect theory; competitive with RLHF/DPO.
- Constitutional AI: Harmlessness from AI Feedback: AI‑guided feedback with a simple constitution reduces harmful outputs.
Week 7‑2: RL of LLMs
Week 8‑1: RL of LLMs
Week 8‑2: Reasoning‑Focused Post‑training
- DeepSeek‑R1: RL‑first training elicits step‑by‑step reasoning.
- o1 system card: o1 system card
- Let’s Verify Step by Step (Process Reward Models): Process‑level supervision yields more reliable reward models than outcome‑only labels.
- VersaPRM: Multi‑domain PRMs trained largely from synthetic reasoning traces.
2) Using LLMs
Week 9‑1: LLMs + Tools
- ReAct: Synergizing Reasoning and Acting: Interleaves thoughts with tool calls to cut hallucinations and improve grounding.
- Toolformer: LMs Can Teach Themselves to Use Tools: Self‑supervised API/tool use from a few demonstrations.
Week 9‑2: System‑Level Optimization
- DSPy: Compiling Declarative LM Calls into Self‑Improving Pipelines: Framework that tunes prompts and data to optimize full pipelines.
- Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs: Few shot bootstrapping, per-module ORPO, MIPRO
- GEPA: Reflective Prompt Evolution: Evolves prompts via reflective search and Pareto selection.
Week 10‑1: Attention and Serving
- FlashAttention: Fast and Memory‑Efficient Exact Attention: IO‑aware tiling gives large speed and memory wins without accuracy loss.
- Efficient Memory Management for LLM Serving with PagedAttention (vLLM): KV cache paging (virtual‑memory style) enables high‑throughput serving.
Week 10‑2: Quantization
- LLM.int8(): 8‑bit Matrix Multiplication for Transformers at Scale: Mixed‑precision with outlier handling keeps accuracy while using int8 matmuls.
- GPTQ: Accurate Post‑Training Quantization for GPTs: One‑shot 3–4‑bit weight quantization using approximate second‑order information.
- AWQ: Activation‑aware Weight Quantization: Channel‑wise scaling protects salient weights for better 4‑bit accuracy.
- (Optional) GPTQ as Babai’s alg.
Week 11‑1: Exact Acceleration
- Blockwise Parallel Decoding for Deep Autoregressive Models
- Speculative Sampling: Draft‑and‑verify decoding gives ~2× speedups while keeping the target model’s distribution.
- (Optional) SpecTr
- Medusa: Extra decoding heads propose and verify multi‑token candidates in one step.
- EAGLE
Week 11‑2: Approximate Inference and KV Policies
- StreamingLLM: Attention Sinks for Infinite‑Length Input: A few sink tokens stabilize long streaming with sliding windows.
- H2O: Heavy‑Hitter Oracle for Efficient KV Cache: Keeps heavy‑hitters plus recency for principled KV eviction.
- SnapKV: Draft‑free selection of the most important prompt tokens for strong KV compression.
- Draft‑based Approximate Inference for LLMs: Uses draft models to rank prompt and KV importance for approximation.
- Lexico: Online Dictionary Learning for KV Cache Compression
3) Adapting LLMs
Week 12-1: PEFT
- LoRA: Low‑rank adapters enable parameter‑efficient finetuning with minimal latency cost.
- DoRA: Magnitude‑direction decomposition improves LoRA’s capacity without runtime overhead.
- Expressive Power of LoRA: Theory on when low‑rank adapters can approximate target functions in Transformers.
- LoRA Training Provably Converges…: Convergence guarantees and clear failure modes in practical regimes.
- (optional) QLoRA
Week 12-2: In‑Context Learning
- Z‑ICL: Zero‑Shot In‑Context Learning with Pseudo‑demos: Builds pseudo‑demos from raw text to close the zero‑ vs few‑shot gap.
- Dual Operating Modes of ICL: Frames ICL as task retrieval vs task learning and explains early‑ascent behavior.
- PromptIntern: Internalizes recurring prompts to reduce input tokens and inference cost.
Week 13-1: Continual Adaptation via Prompt Evolution
- A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence
- PromptBreeder: Evolves prompts with self‑mutation and selection.
- Auto Evol‑Instruct: Automates instruction evolution for data generation with no humans in the loop.
- Automatic Prompt Engineer (APE): Treats prompt search as program synthesis to find high‑performing prompts.
- PromptAgent: Uses planning and reflective error analysis to reach expert‑level prompts.