Add a while loop: the agent keeps going until the task is complete. This is the core agent architecture.
// LLM Agent (Basic Form)
token_history = task_instruction
while task not completed:
generated_tokens = LLM(token_history)
thoughts, action = parse(generated_tokens) output = exec(action) token_history += [thoughts, action, output]
What changed
while loop: the agent keeps going
output: fed back for the next iteration
token_history: accumulates all past interactions
Observe-Think-Act
Each iteration of the while loop = one Observe-Think-Act cycle. The LLM observes token_history, thinks, and acts.
Note: We call it "output" intentionally. The real-world system is not Markov, so the output is richer than just the next "state" — it is anything observed during the execution: logs, signals, side effects, or even a reward. For instance, if the action is a compile command, the output could be the entire compiler log plus a success/fail signal.
The First Design Problem
Look at the same pseudocode again -- but pay attention to what the LLM actually sees.
Do we really want the LLM to see all past history? Can we inject something? Should we take something off?
The problem
token_history grows with every iteration
Eventually fills the context window
Per-token generation compute grows linearly with context size
KV cache memory grows linearly too
Most of the history may be irrelevant
The motivation
We need a way to control what goes into the LLM at each step -- independently from what is stored in history. This is context engineering.
Context Engineering
Separate the history from the LLM context, and prepare the right context for each iteration — including external_info such as knowledge bases, additional prompts, or environment state. This is context engineering.
What can context engineering do? Let's look at three use cases.
Context Engineering (1/3): Swapping Tools
Tools are defined in the system prompt. Context engineering enables swapping the system prompt at any iteration -- changing the available tool set dynamically.
Q: Why not put everything in one giant prompt? A: Context window is finite. Skills enable selective loading -- only what you need, when you need it.
Warning
Skills are also the easiest way to overfit an agent to a benchmark. Task-specific skills can inflate scores without improving general capability. I will discuss a potential fairness issue on Terminal Bench later.
What Does a Skill Actually Look Like?
A skill is just a text file that gets loaded into context when relevant. Here is a real example from Claude Code:
# /commit -- a skill for creating git commits When the user asks to commit changes: 1. Run git status and git diff to see all changes 2. Analyze the diff -- summarize the nature of the changes
(new feature, bug fix, refactor, docs, etc.) 3. Draft a concise commit message (1-2 sentences)
focusing on the "why" rather than the "what" 4. Stage relevant files (avoid secrets, .env, etc.) 5. Create the commit # Available tools: Bash(git status), Bash(git diff), # Bash(git add), Bash(git commit)
A skill = prompt + tools + instructions. It is loaded into context only when the agent needs it. The agent does not see the /commit skill when it is debugging.
Skills enable continual learning. Since skills are just text, the agent can write new skills from experience, update existing ones, and even share them with other agents — a modern form of decentralized learning. Updating a skill = updating a prompt.
Context Engineering (3/3): Compaction
When the context window gets full, compress the token history.
Coding agent: output = [compiler_log, result]. Log is long; result is short ("compile successful"). Remove logs, keep results.
ML research agent: output = [training_curves, validation_loss]. Curves are long; loss is short. Keep the loss.
LLM summarization: LLM_summarizer(token_history) -- use another LLM to compress.
See my recent analysis of how Codex does compaction: @Kangwook_Lee
Context Engineering: KV Cache Constraint
Context engineering changes the context between iterations. But if the prefix changes, the KV cache is invalidated and you pay full recompute cost. How do we modify the context while keeping the prefix stable?
Approach 1: Masking (Manus, 2025)
Provide all information in the system prompt from the start. "Mask" out irrelevant parts via logit masking instead of adding/removing. The prefix never changes, so the KV cache is always reused.
[Manus blog]
Key insight: By keeping the entire system prompt fixed from the start, the KV cache for that prefix is computed once and reused on every iteration. The cost is a longer initial prompt, but the savings compound over many iterations.
Another approach: append iteration-specific context AFTER the stable prefix. It appears just once and is not stored in history.
// Ephemeral context example: while task not completed: context = token_history + log // log appended after prefix (ephemeral!)
generated_tokens = LLM(context)
thoughts, action = parse(generated_tokens)
output = exec(action) token_history += [thoughts, action, result] // only keep results!
This is the technique we used in PUBG Ally -- each step has rich situational info (enemies nearby, health, zone) that is critical right now but not useful in future steps.
Multi-Agents = Context Isolation
Key insight: multi-agents are just programmatic context isolation. Like OOP -- each "object" has its own context.
Problem: One agent does coding and reviewing. The reviewer is biased by the coder's thoughts — confirmation bias! Solution: Give each role its own clean context.
// Each agent has its own context
code = LLM_Agent("code it")
review = LLM_Agent("review it", code)
Instead of bloating the main context, spawn a sub-agent. Its context is discarded after use -- the main agent stays clean.
// Main agent spawns a sub-agent // instead of reading a huge file directly:
action: LLM_Agent("read input.txt and summarize it")
output: "input.txt is about local restaurants in Berkeley ..."
The sub-agent's context can get bloated with the full file content, but the main agent only sees the concise summary. This is the same idea as spawning a subprocess in programming.
Three ways the agent can stop: (1) a verifiable task with a checker -- easy. (2) a fixed time/budget limit -- also easy. (3) the LLM itself decides by generating a "done" action (or EOS token). This is the common case -- and the problematic one.
The False Completion Problem
When the LLM decides it is done, it is frequently wrong. We call this false completion.
A Real Example
On Terminal-Bench-2 (SWE/MLE-level tasks requiring deep expertise): a baseline agent (Terminus) with Claude Opus 4.6 submitted results 5 times within the time limit. 5 out of 5, the agent confidently submitted a wrong answer. Roughly ~80% of failures were due to false completion.
The Ralph Loop
One idea: add an outer loop where a fresh agent checks if the work is actually done.
whileTrue: // inner loop -- clean context, same world state
context = context_build(token_history, external_info)
generated_tokens = LLM(context)
thoughts, action = parse(generated_tokens) if action == done: break
output, answer_not_changed = exec(action)
token_history += [thoughts, action, output]
if answer_not_changed: break// only exit if the next agent found nothing to change
The key: each inner loop starts with clean context but the same world state. It reduces confirmation bias -- a fresh agent is not influenced by prior reasoning.
github.com/snarktank/ralph
Our Approach: Terminus-KIRA
In building Terminus-KIRA, we tested a variant that is both effective and efficient:
whileTrue:
context = context_build(token_history, external_info)
generated_tokens = LLM(context)
thoughts, action = parse(generated_tokens) if action == done: generated_tokens = LLM(token_history_wo_thoughts) // check with just the work, no thoughts thoughts, action = parse(generated_tokens)if action == done: break// truly done: an unbiased agent agrees
output = exec(action)
token_history += [thoughts, action, output] token_history_wo_thoughts += [action, output] // track just the work
The verifier sees only what was done (actions + outputs), not how it was reasoned about (thoughts). This removes confirmation bias without needing a full outer loop restart.
AutoResearch: When False Completion Isn't a Problem
AutoResearch (Andrej Karpathy) went viral on Twitter. It applies the basic agent loop to autonomous ML research -- no Ralph loop needed.
Why does it work without Ralph? This is a progress-measurable task. The goal: train a model with lower validation loss. The loss either went down or it didn't -- very hard to falsely claim completion.
Look at the git state: the current branch/commit we're on
Tune train.py with an experimental idea by directly hacking the code.
git commit
Run the experiment: uv run train.py > run.log 2>&1 (redirect everything -- do NOT use tee or let output flood your context)
Read out the results: grep "^val_bpb:\|^peak_vram_mb:" run.log
If the grep output is empty, the run crashed. Run tail -n 50 run.log to read the Python stack trace and attempt a fix. If you can't get things to work after more than a few attempts, give up.
Record the results in the tsv (NOTE: do not commit the results.tsv file, leave it untracked by git) If val_bpb improved (lower), you "advance" the branch, keeping the git commitIf val_bpb is equal or worse, you git reset back to where you started
The idea is that you are a completely autonomous researcher trying things out. If they work, keep. If they don't, discard. And you're advancing the branch so that you can iterate. If you feel like you're getting stuck in some way, you can rewind but you should probably do this very very sparingly (if ever).
Timeout: Each experiment should take ~5 minutes total (+ a few seconds for startup and eval overhead). If a run exceeds 10 minutes, kill it and treat it as a failure (discard and revert).
Crashes: If a run crashes (OOM, or a bug, or etc.), use your judgment: If it's something dumb and easy to fix (e.g. a typo, a missing import), fix it and re-run. If the idea itself is fundamentally broken, just skip it, log "crash" as the status in the tsv, and move on.
NEVER STOP: Once the experiment loop has begun (after the initial setup), do NOT pause to ask the human if you should continue. Do NOT ask "should I keep going?" or "is this a good stopping point?". The human might be asleep, or gone from a computer and expects you to continue working indefinitely until you are manually stopped. You are autonomous. If you run out of ideas, think harder -- read papers referenced in the code, re-read the in-scope files for new angles, try combining previous near-misses, try more radical architectural changes. The loop runs until the human interrupts you, period.
As an example use case, a user might leave you running while they sleep. If each experiment takes you ~5 minutes then you can run approx 12/hour, for a total of about 100 over the duration of the average human sleep. The user then wakes up to experimental results, all completed by you while they slept!
Similar to AutoResearch, but for RL engineering. More than just hyperparameter tuning — the agent must also design rewards to avoid reward hacking. Example: b-boying spider.
Work by: Wooseong Chung, Taegwan Ha, Kangwook Lee, Jeong-Gwan Lee, Suyoung Lee, Taehwan Kwon, Yunhyeok Kwak (alphabetical)
Test-Time Scaling for Agents
Another orthogonal approach: test-time scaling -- generate multiple candidates and pick the best.
Test-time scaling (e.g., generate many candidates, pick the best) is one approach -- but it's largely unexplored for agents:
Too expensive: running an entire agent loop multiple times costs much more than sampling multiple CoTs
Hard to aggregate: agent outputs are not from a fixed set of choices, so majority voting doesn't directly apply
Majority voting may not help anyway: false completion rate is often > 50%, and all the false traces tend to look similar -- the majority is wrong
Test-time scaling for agents is an important open problem.
Test-Time Scaling: Naive Approach
Can the LLM agent itself figure out which trial was best?
for i inrange(N): // N = test-time scaling factor
history[i] = LLM_agent(task_instruction)
best = LLM("find the most promising work" + task_instruction + history)
Result: accuracy improved, but mostly for tasks with P(success) > 50%. Guess why?
We found the LLM was implicitly clustering the candidates and picking the majority! It was told to pick the best, but it was doing majority voting. When P(success) > 50%, the majority is correct -- so it helps. When P(success) < 50%, the majority is wrong -- so it hurts.
Test-Time Scaling: Pairwise Comparisons
To reduce majority bias, use pairwise comparisons -- the LLM never sees all candidates at once.
for i inrange(N):
history[i] = LLM_agent(task_instruction)
for (i, j) in [N] x [N]:
y[i,j] = LLM("which is more promising?" + history[i] + history[j])
s = BTL_solver(y) // Bradley-Terry-Luce model returnargmax(s)
No majority bias -- pairwise comparisons never expose all candidates at once. No context sharing between comparisons.
More compute = less variance -- compare each pair multiple times. Test-time compute for comparisons too!
OpenClaw = the basic agent loop + Ralph loop + memory update. After each task, the agent updates its own prompts -- enabling continual learning across tasks.
memory_update can be creative: summarize the session, define an IDENTITY that the agent updates over time, develop personality/character across tasks.
The rest of the harness is clever too: managing multiple instructions, background tool execution, compaction + file system. But the key innovation is task-to-task continuity via memory.
Memory in Practice: inZOI and PUBG Ally
inZOI: Tried self-evolving prompts -- personalities polarized too easily. Shipped with user-defined personalities instead. 1M+ copies sold, first game with on-device LLM agents.
PUBG Ally: Memory focused on friendship and past games. "Remember when we won that match?" Makes the ally feel like a real teammate over time.
Lots of Interesting Challenges Left
Proactivity -- when should the agent talk to you? act for you?
Fast reaction -- System 1/System 2 architecture (see Figure Helix)
Distillation -- LLM agents to SLM agents. Not trivial (off-policy, model sharing)
Multimodal -- STT → LLM → TTS loses information. Multimodal should be the core model
Evaluation -- agent outputs are complex, non-deterministic, hard to grade
Planning -- LLMs are bad at exploration/exploitation. External search helps (see TAPE, ReJump)
Summary
The agent loop is simple: observe, think, act, repeat with tool calling via special tokens
Context engineering is the key design space: skills, compaction, ephemeral context, KV cache constraints
Multi-agents / sub-agents / recursive LMs = different forms of context isolation and programmatic control
False completion is the #1 production issue. Ralph loop, progress measurement, and test-time scaling (BTL) address it
Memory enables task-to-task continuity and self-evolution (OpenClaw, inZOI, PUBG Ally)
Many open challenges: proactivity, fast reaction, distillation, multimodal, evaluation, planning
We are early. Maybe the 1950s of communications. Go build (real engineering starts from real problems) and go clarify (the essence resembles control theory, communication theory, statistical inference).