Paper List
LLM 901 — Weekly Reading Schedule with TL;DRs
1) Training LLMs
1-1) Pretraining
1‑1‑1) Architecture
Week 1‑1 (09/08/2025): DeepSeek-V2/3
- DeepSeek‑V2: A Strong, Economical, and Efficient Mixture‑of‑Experts Language Model: Introduces Multi‑head Latent Attention (MLA) and DeepSeekMoE to shrink KV cache and cut compute while keeping quality.
- DeepSeek‑V3 Technical Report: 671B‑parameter MoE with ~37B active; uses MLA, auxiliary‑loss‑free routing, and multi‑token prediction (MTP).
Week 1‑2 (09/12/2025): MoE and Multi‑token Prediction
- Mixtral of Experts (8×7B): Open sparse MoE baseline (top‑2 routing) that rivals much larger dense models with far fewer active parameters.
- Better & Faster Large Language Models via Multi‑token Prediction: Train with multiple future‑token heads to boost sample efficiency and speed without extra pretraining time.
Week 2‑1 (09/15/2025): Positional Encodings and Long Context.
- RoFormer: Enhanced Transformer with Rotary Position Embedding: Rotary positions stabilize attention and generalize better than absolute or relative schemes in many settings.
- LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens: Scaling and rescaling tricks push RoPE to million‑token ranges with controlled degradation.
- YaRN: Efficient Context Window Extension of LLMs: Simple RoPE scaling recipe that reliably extends context length with strong retention of quality.
Week 2‑2 (09/19/2025): LayerNorm & RMSNorm. Wrap up with gpt-oss.
- On Layer Normalization in the Transformer Architecture: Pre‑LN eases optimization at scale; Post‑LN can reach slightly better end quality but is harder to train.
- Root Mean Square Layer Normalization (RMSNorm): Drops mean‑centering to reduce overhead while keeping performance competitive.
- gpt-oss: Let’s review gpt-oss!
Week 3‑1 (09/22/2025): Is “Decoder TF + left-to-right autoregressive” the end of the story? Part 1.
- ENTP: Encoder‑only Next‑Token Prediction: Shows encoder‑only NTP can be competitive and more expressive for some tasks.
- Mamba: Linear‑time Sequence Modeling with Selective State Spaces: Linear‑time SSM with content‑aware selection that handles long sequences well.
Week 3‑2 (09/26/2025): Is “Decoder TF + left-to-right autoregressive” the end of the story? Part 2.
- Diffusion‑LM Improves Controllable Text Generation: Discrete diffusion improves controllability compared to standard AR decoding in some settings.
- LLaDA: Large Language Diffusion Models: Trains diffusion language models at scale (up to 8B), challenging the need for AR decoding.
- Seed‑Diffusion: Seeding and masked diffusion strategies stabilize and speed up language diffusion inference.
1‑1‑2) Training Data
Week 4‑1 (09/29/2025): How much data (and compute)?
- Scaling Laws for Neural Language Models: Loss follows power laws in model size, data, and compute.
- Training Compute‑Optimal Large Language Models (Chinchilla): For fixed compute, smaller models trained on more data beat oversized, under‑trained ones.
- Inference‑Aware Scaling Laws for LMs: Chooses model and data under deployment constraints like latency and serving cost.
Week 4‑2 (10/03/2025): Which Data?
- DataComp‑LM: Open benchmark for LLM pretraining data curation; careful filtering and mixtures yield large gains.
- A Survey on Data Selection for Language Models: Organizes methods for filtering, mixing, deduplication, and curriculum.
- OLMo‑2: Pretraining + Mid‑training: Adds a mid‑training phase and releases open artifacts for reproducibility.
- DataComp‑Reasoning (Open Thoughts): Reasoning‑centric data benchmark that lifts step‑by‑step problem solving.
1‑1‑3) Training Algorithms
Week 5‑1 (10/06/2025): Optimizers
- AdamW: Decoupled Weight Decay Regularization: Decouples weight decay from gradients for better stability and generalization.
- Lion: Symbolic Discovery of Optimization Algorithms: Sign‑momentum optimizer that is memory‑light yet competitive with Adam.
- Shampoo: Preconditioned Stochastic Tensor Optimization: Matrix and tensor preconditioning accelerates convergence.
- Old Optimizer, New Norm: An Anthology: Adam <-> Shampoo
Week 5‑2 (10/10/2025): Newer Optimizers
- SOAP: Practical optimizer targeting stability and speed in LLM pretraining.
- Muon: Orthogonal‑style updates and normalization tweaks for stable large‑batch training.
- Kimi K2/MuonCLIP: Extends Muon ideas to CLIP‑style multi‑modal training.
Week 6‑1 (10/13/2025): Optimizer Benchmarks
- Fantastic Pretraining Optimizers and Where to Find Them: Head‑to‑head comparisons of pretraining optimizers under controlled setups.
- Benchmarking Optimizers for Large Language Model Pretraining: Independent large‑scale benchmark with ablations and cost breakdowns.
Week 6‑2 (10/17/2025): Efficient Training
- Megatron‑LM: Training Multi‑Billion Parameter Language Models Using Model Parallelism: Tensor and pipeline parallelism blueprint for very large Transformers.
- ZeRO: Memory Optimizations Toward Training Trillion‑Parameter Models: Shards optimizer states, gradients, and parameters to fit bigger models.
- Dion: Distributed Orthonormalized Updates: Communication‑efficient Muon
- (optional) MegaBlocks: block-sparse GPU kernel for MoE training
- (optional) Google’s MoE: the first modern sparse MoE (and how to efficiently train it)
1‑2) Posttraining
Week 7‑1 (10/20/2025): Alignment‑Focused Post‑training
- InstructGPT: Training LMs to Follow Instructions with Human Feedback: SFT + RLHF makes smaller models preferable on real prompts.
- Direct Preference Optimization: Your LM is Secretly a Reward Model: Optimizes preferences without RL by a simple classification‑style loss (BT model).
- KTO: Model Alignment as Prospect‑Theoretic Optimization: Preference learning framed with prospect theory; competitive with RLHF/DPO.
- Constitutional AI: Harmlessness from AI Feedback: AI‑guided feedback with a simple constitution reduces harmful outputs.
Week 7‑2 (10/24/2025): RL of LLMs
Week 8‑1 (10/27/2025): RL of LLMs
Week 8‑2 (10/31/2025): Reasoning‑Focused Post‑training
- DeepSeek‑R1: RL‑first training elicits step‑by‑step reasoning.
- o1 system card: o1 system card
- Let’s Verify Step by Step (Process Reward Models): Process‑level supervision yields more reliable reward models than outcome‑only labels.
- VersaPRM: Multi‑domain PRMs trained largely from synthetic reasoning traces.
2) Using LLMs
Week 9‑1 (11/03/2025): LLMs + Tools
- ReAct: Synergizing Reasoning and Acting: Interleaves thoughts with tool calls to cut hallucinations and improve grounding.
- Toolformer: LMs Can Teach Themselves to Use Tools: Self‑supervised API/tool use from a few demonstrations.
Week 9‑2 (11/07/2025): System‑Level Optimization
- DSPy: Compiling Declarative LM Calls into Self‑Improving Pipelines: Framework that tunes prompts and data to optimize full pipelines.
- GEPA: Reflective Prompt Evolution: Evolves prompts via reflective search and Pareto selection.
Week 10‑1 (11/10/2025): Attention and Serving
- FlashAttention: Fast and Memory‑Efficient Exact Attention: IO‑aware tiling gives large speed and memory wins without accuracy loss.
- Efficient Memory Management for LLM Serving with PagedAttention (vLLM): KV cache paging (virtual‑memory style) enables high‑throughput serving.
Week 10‑2 (11/14/2025): Quantization
- LLM.int8(): 8‑bit Matrix Multiplication for Transformers at Scale: Mixed‑precision with outlier handling keeps accuracy while using int8 matmuls.
- GPTQ: Accurate Post‑Training Quantization for GPTs: One‑shot 3–4‑bit weight quantization using approximate second‑order information.
- AWQ: Activation‑aware Weight Quantization: Channel‑wise scaling protects salient weights for better 4‑bit accuracy.
Week 11‑1 (11/17/2025): Exact Acceleration
- Speculative Sampling: Draft‑and‑verify decoding gives ~2× speedups while keeping the target model’s distribution.
- Medusa: Extra decoding heads propose and verify multi‑token candidates in one step.
Week 11‑2 (11/21/2025): Approximate Inference and KV Policies
- StreamingLLM: Attention Sinks for Infinite‑Length Input: A few sink tokens stabilize long streaming with sliding windows.
- H2O: Heavy‑Hitter Oracle for Efficient KV Cache: Keeps heavy‑hitters plus recency for principled KV eviction.
- SnapKV: Draft‑free selection of the most important prompt tokens for strong KV compression.
- Draft‑based Approximate Inference for LLMs: Uses draft models to rank prompt and KV importance for approximation.
3) Adapting LLMs
11/24/2025: PEFT
- LoRA: Low‑rank adapters enable parameter‑efficient finetuning with minimal latency cost.
- DoRA: Magnitude‑direction decomposition improves LoRA’s capacity without runtime overhead.
- Expressive Power of LoRA: Theory on when low‑rank adapters can approximate target functions in Transformers.
- LoRA Training Provably Converges…: Convergence guarantees and clear failure modes in practical regimes.
11/28/2025: Thanksgiving Recess
12/01/2025: In‑Context Learning
- Z‑ICL: Zero‑Shot In‑Context Learning with Pseudo‑demos: Builds pseudo‑demos from raw text to close the zero‑ vs few‑shot gap.
- Dual Operating Modes of ICL: Frames ICL as task retrieval vs task learning and explains early‑ascent behavior.
- PromptIntern: Internalizes recurring prompts to reduce input tokens and inference cost.
12/05/2025: Continual Adaptation via Prompt Evolution
- PromptBreeder: Evolves prompts with self‑mutation and selection.
- Auto Evol‑Instruct: Automates instruction evolution for data generation with no humans in the loop.
- Automatic Prompt Engineer (APE): Treats prompt search as program synthesis to find high‑performing prompts.
- PromptAgent: Uses planning and reflective error analysis to reach expert‑level prompts.