Link Search Menu Expand Document

Paper List

LLM 901 — Weekly Reading Schedule with TL;DRs

1) Training LLMs

1-1) Pretraining

1‑1‑1) Architecture

Week 1‑1 (09/08/2025): DeepSeek-V2/3

Week 1‑2 (09/12/2025): MoE and Multi‑token Prediction

Week 2‑1 (09/15/2025): Positional Encodings and Long Context.

Week 2‑2 (09/19/2025): LayerNorm & RMSNorm. Wrap up with gpt-oss.

Week 3‑1 (09/22/2025): Is “Decoder TF + left-to-right autoregressive” the end of the story? Part 1.

Week 3‑2 (09/26/2025): Is “Decoder TF + left-to-right autoregressive” the end of the story? Part 2.


1‑1‑2) Training Data

Week 4‑1 (09/29/2025): How much data (and compute)?

Week 4‑2 (10/03/2025): Which Data?


1‑1‑3) Training Algorithms

Week 5‑1 (10/06/2025): Optimizers

Week 5‑2 (10/10/2025): Newer Optimizers

  • SOAP: Practical optimizer targeting stability and speed in LLM pretraining.
  • Muon: Orthogonal‑style updates and normalization tweaks for stable large‑batch training.
  • Kimi K2/MuonCLIP: Extends Muon ideas to CLIP‑style multi‑modal training.

Week 6‑1 (10/13/2025): Optimizer Benchmarks

Week 6‑2 (10/17/2025): Efficient Training


1‑2) Posttraining

Week 7‑1 (10/20/2025): Alignment‑Focused Post‑training

Week 7‑2 (10/24/2025): RL of LLMs

Week 8‑1 (10/27/2025): RL of LLMs

Week 8‑2 (10/31/2025): Reasoning‑Focused Post‑training


2) Using LLMs

Week 9‑1 (11/03/2025): LLMs + Tools

Week 9‑2 (11/07/2025): System‑Level Optimization

Week 10‑1 (11/10/2025): Attention and Serving

Week 10‑2 (11/14/2025): Quantization

Week 11‑1 (11/17/2025): Exact Acceleration

  • Speculative Sampling: Draft‑and‑verify decoding gives ~2× speedups while keeping the target model’s distribution.
  • Medusa: Extra decoding heads propose and verify multi‑token candidates in one step.

Week 11‑2 (11/21/2025): Approximate Inference and KV Policies


3) Adapting LLMs

11/24/2025: PEFT

  • LoRA: Low‑rank adapters enable parameter‑efficient finetuning with minimal latency cost.
  • DoRA: Magnitude‑direction decomposition improves LoRA’s capacity without runtime overhead.
  • Expressive Power of LoRA: Theory on when low‑rank adapters can approximate target functions in Transformers.
  • LoRA Training Provably Converges…: Convergence guarantees and clear failure modes in practical regimes.

11/28/2025: Thanksgiving Recess

12/01/2025: In‑Context Learning

12/05/2025: Continual Adaptation via Prompt Evolution

12/08/2025: Final Poster Presentation