Paper List

LLM 901 — Weekly Reading Schedule with TL;DRs

1) Training LLMs

1-1) Pretraining

1‑1‑1) Architecture

Week 1‑1 (09/08/2025): DeepSeek-V2/3

DeepSeek‑V2: A Strong, Economical, and Efficient Mixture‑of‑Experts Language Model: Introduces Multi‑head Latent Attention (MLA) and DeepSeekMoE to shrink KV cache and cut compute while keeping quality.
DeepSeek‑V3 Technical Report: 671B‑parameter MoE with ~37B active; uses MLA, auxiliary‑loss‑free routing, and multi‑token prediction (MTP).

Week 1‑2 (09/12/2025): MoE and Multi‑token Prediction

Mixtral of Experts (8×7B): Open sparse MoE baseline (top‑2 routing) that rivals much larger dense models with far fewer active parameters.
Better & Faster Large Language Models via Multi‑token Prediction: Train with multiple future‑token heads to boost sample efficiency and speed without extra pretraining time.

Week 2‑1 (09/15/2025): Positional Encodings and Long Context.

RoFormer: Enhanced Transformer with Rotary Position Embedding: Rotary positions stabilize attention and generalize better than absolute or relative schemes in many settings.
LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens: Scaling and rescaling tricks push RoPE to million‑token ranges with controlled degradation.
YaRN: Efficient Context Window Extension of LLMs: Simple RoPE scaling recipe that reliably extends context length with strong retention of quality.

Week 2‑2 (09/19/2025): LayerNorm & RMSNorm. Wrap up with gpt-oss.

On Layer Normalization in the Transformer Architecture: Pre‑LN eases optimization at scale; Post‑LN can reach slightly better end quality but is harder to train.
Root Mean Square Layer Normalization (RMSNorm): Drops mean‑centering to reduce overhead while keeping performance competitive.
gpt-oss: Let’s review gpt-oss!

Week 3‑1 (09/22/2025): Is “Decoder TF + left-to-right autoregressive” the end of the story? Part 1.

ENTP: Encoder‑only Next‑Token Prediction: Shows encoder‑only NTP can be competitive and more expressive for some tasks.
Mamba: Linear‑time Sequence Modeling with Selective State Spaces: Linear‑time SSM with content‑aware selection that handles long sequences well.

Week 3‑2 (09/26/2025): Is “Decoder TF + left-to-right autoregressive” the end of the story? Part 2.

Diffusion‑LM Improves Controllable Text Generation: Discrete diffusion improves controllability compared to standard AR decoding in some settings.
LLaDA: Large Language Diffusion Models: Trains diffusion language models at scale (up to 8B), challenging the need for AR decoding.
Seed‑Diffusion: Seeding and masked diffusion strategies stabilize and speed up language diffusion inference.

1‑1‑2) Training Data

Week 4‑1 (09/29/2025): How much data (and compute)?

Scaling Laws for Neural Language Models: Loss follows power laws in model size, data, and compute.
Training Compute‑Optimal Large Language Models (Chinchilla): For fixed compute, smaller models trained on more data beat oversized, under‑trained ones.
Inference‑Aware Scaling Laws for LMs: Chooses model and data under deployment constraints like latency and serving cost.

Week 4‑2 (10/03/2025): Which Data?

DataComp‑LM: Open benchmark for LLM pretraining data curation; careful filtering and mixtures yield large gains.
A Survey on Data Selection for Language Models: Organizes methods for filtering, mixing, deduplication, and curriculum.
OLMo‑2: Pretraining + Mid‑training: Adds a mid‑training phase and releases open artifacts for reproducibility.
DataComp‑Reasoning (Open Thoughts): Reasoning‑centric data benchmark that lifts step‑by‑step problem solving.

1‑1‑3) Training Algorithms

Week 5‑1 (10/06/2025): Optimizers

AdamW: Decoupled Weight Decay Regularization: Decouples weight decay from gradients for better stability and generalization.
Lion: Symbolic Discovery of Optimization Algorithms: Sign‑momentum optimizer that is memory‑light yet competitive with Adam.
Shampoo: Preconditioned Stochastic Tensor Optimization: Matrix and tensor preconditioning accelerates convergence.
Old Optimizer, New Norm: An Anthology: Adam <-> Shampoo

Week 5‑2 (10/10/2025): Newer Optimizers

SOAP: Practical optimizer targeting stability and speed in LLM pretraining.
Muon: Orthogonal‑style updates and normalization tweaks for stable large‑batch training.
Kimi K2/MuonCLIP: Extends Muon ideas to CLIP‑style multi‑modal training.

Week 6‑1 (10/13/2025): Optimizer Benchmarks

Fantastic Pretraining Optimizers and Where to Find Them: Head‑to‑head comparisons of pretraining optimizers under controlled setups.
Benchmarking Optimizers for Large Language Model Pretraining: Independent large‑scale benchmark with ablations and cost breakdowns.

Week 6‑2 (10/17/2025): Efficient Training

Megatron‑LM: Training Multi‑Billion Parameter Language Models Using Model Parallelism: Tensor and pipeline parallelism blueprint for very large Transformers.
ZeRO: Memory Optimizations Toward Training Trillion‑Parameter Models: Shards optimizer states, gradients, and parameters to fit bigger models.
Dion: Distributed Orthonormalized Updates: Communication‑efficient Muon
(optional) MegaBlocks: block-sparse GPU kernel for MoE training
(optional) Google’s MoE: the first modern sparse MoE (and how to efficiently train it)

1‑2) Posttraining

Week 7‑1 (10/20/2025): Alignment‑Focused Post‑training

InstructGPT: Training LMs to Follow Instructions with Human Feedback: SFT + RLHF makes smaller models preferable on real prompts.
Direct Preference Optimization: Your LM is Secretly a Reward Model: Optimizes preferences without RL by a simple classification‑style loss (BT model).
KTO: Model Alignment as Prospect‑Theoretic Optimization: Preference learning framed with prospect theory; competitive with RLHF/DPO.
Constitutional AI: Harmlessness from AI Feedback: AI‑guided feedback with a simple constitution reduces harmful outputs.

Week 7‑2 (10/24/2025): RL of LLMs

Prof. Ryu’s lecture note on RL overview

Week 8‑1 (10/27/2025): RL of LLMs

Prof. Ryu’s lecture note on RL for LLMs

Week 8‑2 (10/31/2025): Reasoning‑Focused Post‑training

DeepSeek‑R1: RL‑first training elicits step‑by‑step reasoning.
o1 system card: o1 system card
Let’s Verify Step by Step (Process Reward Models): Process‑level supervision yields more reliable reward models than outcome‑only labels.
VersaPRM: Multi‑domain PRMs trained largely from synthetic reasoning traces.

2) Using LLMs

Week 9‑1 (11/03/2025): LLMs + Tools

ReAct: Synergizing Reasoning and Acting: Interleaves thoughts with tool calls to cut hallucinations and improve grounding.
Toolformer: LMs Can Teach Themselves to Use Tools: Self‑supervised API/tool use from a few demonstrations.

Week 9‑2 (11/07/2025): System‑Level Optimization

DSPy: Compiling Declarative LM Calls into Self‑Improving Pipelines: Framework that tunes prompts and data to optimize full pipelines.
GEPA: Reflective Prompt Evolution: Evolves prompts via reflective search and Pareto selection.

Week 10‑1 (11/10/2025): Attention and Serving

FlashAttention: Fast and Memory‑Efficient Exact Attention: IO‑aware tiling gives large speed and memory wins without accuracy loss.
Efficient Memory Management for LLM Serving with PagedAttention (vLLM): KV cache paging (virtual‑memory style) enables high‑throughput serving.

Week 10‑2 (11/14/2025): Quantization

LLM.int8(): 8‑bit Matrix Multiplication for Transformers at Scale: Mixed‑precision with outlier handling keeps accuracy while using int8 matmuls.
GPTQ: Accurate Post‑Training Quantization for GPTs: One‑shot 3–4‑bit weight quantization using approximate second‑order information.
AWQ: Activation‑aware Weight Quantization: Channel‑wise scaling protects salient weights for better 4‑bit accuracy.

Week 11‑1 (11/17/2025): Exact Acceleration

Speculative Sampling: Draft‑and‑verify decoding gives ~2× speedups while keeping the target model’s distribution.
Medusa: Extra decoding heads propose and verify multi‑token candidates in one step.

Week 11‑2 (11/21/2025): Approximate Inference and KV Policies

StreamingLLM: Attention Sinks for Infinite‑Length Input: A few sink tokens stabilize long streaming with sliding windows.
H2O: Heavy‑Hitter Oracle for Efficient KV Cache: Keeps heavy‑hitters plus recency for principled KV eviction.
SnapKV: Draft‑free selection of the most important prompt tokens for strong KV compression.
Draft‑based Approximate Inference for LLMs: Uses draft models to rank prompt and KV importance for approximation.

3) Adapting LLMs

11/24/2025: PEFT

LoRA: Low‑rank adapters enable parameter‑efficient finetuning with minimal latency cost.
DoRA: Magnitude‑direction decomposition improves LoRA’s capacity without runtime overhead.
Expressive Power of LoRA: Theory on when low‑rank adapters can approximate target functions in Transformers.
LoRA Training Provably Converges…: Convergence guarantees and clear failure modes in practical regimes.

11/28/2025: Thanksgiving Recess

12/01/2025: In‑Context Learning

Z‑ICL: Zero‑Shot In‑Context Learning with Pseudo‑demos: Builds pseudo‑demos from raw text to close the zero‑ vs few‑shot gap.
Dual Operating Modes of ICL: Frames ICL as task retrieval vs task learning and explains early‑ascent behavior.
PromptIntern: Internalizes recurring prompts to reduce input tokens and inference cost.

12/05/2025: Continual Adaptation via Prompt Evolution

PromptBreeder: Evolves prompts with self‑mutation and selection.
Auto Evol‑Instruct: Automates instruction evolution for data generation with no humans in the loop.
Automatic Prompt Engineer (APE): Treats prompt search as program synthesis to find high‑performing prompts.
PromptAgent: Uses planning and reflective error analysis to reach expert‑level prompts.

Paper List

LLM 901 — Weekly Reading Schedule with TL;DRs

1) Training LLMs

1-1) Pretraining

1‑1‑1) Architecture

Week 1‑1 (09/08/2025): DeepSeek-V2/3

Week 1‑2 (09/12/2025): MoE and Multi‑token Prediction

Week 2‑1 (09/15/2025): Positional Encodings and Long Context.

Week 2‑2 (09/19/2025): LayerNorm & RMSNorm. Wrap up with gpt-oss.

Week 3‑1 (09/22/2025): Is “Decoder TF + left-to-right autoregressive” the end of the story? Part 1.

Week 3‑2 (09/26/2025): Is “Decoder TF + left-to-right autoregressive” the end of the story? Part 2.

1‑1‑2) Training Data

Week 4‑1 (09/29/2025): How much data (and compute)?

Week 4‑2 (10/03/2025): Which Data?

1‑1‑3) Training Algorithms

Week 5‑1 (10/06/2025): Optimizers

Week 5‑2 (10/10/2025): Newer Optimizers

Week 6‑1 (10/13/2025): Optimizer Benchmarks

Week 6‑2 (10/17/2025): Efficient Training

1‑2) Posttraining

Week 7‑1 (10/20/2025): Alignment‑Focused Post‑training

Week 7‑2 (10/24/2025): RL of LLMs

Week 8‑1 (10/27/2025): RL of LLMs

Week 8‑2 (10/31/2025): Reasoning‑Focused Post‑training

2) Using LLMs

Week 9‑1 (11/03/2025): LLMs + Tools

Week 9‑2 (11/07/2025): System‑Level Optimization

Week 10‑1 (11/10/2025): Attention and Serving

Week 10‑2 (11/14/2025): Quantization

Week 11‑1 (11/17/2025): Exact Acceleration

Week 11‑2 (11/21/2025): Approximate Inference and KV Policies

3) Adapting LLMs

11/24/2025: PEFT

11/28/2025: Thanksgiving Recess

12/01/2025: In‑Context Learning

12/05/2025: Continual Adaptation via Prompt Evolution

12/08/2025: Final Poster Presentation