Everyone is building LLM agents. The products are real and shipping — Claude Code, Codex, AutoResearch, OpenClaw. But how do we actually build them? This monograph walks through the key ideas, many of which emerged from solving real problems in production systems: Terminus-KIRA (a terminal coding agent), PUBG Ally (a real-time cooperative game agent), and Smart Zoi (life simulation agents in inZOI).
Along the way, we will encounter context engineering, skills, compaction, multi-agents, recursive LMs, the Ralph loop, test-time scaling, and memory-driven self-evolution. These are not abstract concepts — they were critical in solving real problems.
An LLM agent is an LLM-based system that iteratively 1) observes, 2) thinks, and 3) acts to achieve a goal. The agent reads its environment, reasons about what to do, takes an action (typically a tool call), and then loops — repeating until done.
This is the core loop. Everything else in this monograph is a variation on, or improvement to, this fundamental cycle.
2. Tool Calling (No Loop)
The simplest building block is a single LLM call that produces thoughts and an action. No loop — just one shot.
One call, one action, one result. Useful for simple tasks: "search for X", "calculate Y".
See also: Berkeley Function Calling Leaderboard (Gorilla team); NexusRaven (Jiantao Jiao et al., NeurIPS '23).
3. The Agent Loop (Basic Form)
Add a while loop: the agent keeps going until the task is complete. This is the core agent architecture.
// LLM Agent (Basic Form)
token_history = task_instruction
while task not completed:
generated_tokens = LLM(token_history)
thoughts, action = parse(generated_tokens) output = exec(action) token_history += [thoughts, action, output]
What changed
while loop: the agent keeps going
output: fed back for the next iteration
token_history: accumulates all past interactions
Observe-Think-Act
Each iteration of the while loop = one Observe-Think-Act cycle. The LLM observes token_history, thinks, and acts.
Note: We call it "output" intentionally. The real-world system is not Markov, so the output is richer than just the next "state" — it is anything observed during the execution: logs, signals, side effects, or even a reward. For instance, if the action is a compile command, the output could be the entire compiler log plus a success/fail signal.
4. Context Engineering
Look at the pseudocode again and pay attention to what the LLM actually sees. The LLM receives token_history, which grows with every iteration. Eventually it fills the context window. Per-token generation compute and KV cache memory grow linearly. Most of the history may be irrelevant.
The Problem
token_history grows with every iteration
Eventually fills the context window
Per-token generation compute grows linearly with context size
KV cache memory grows linearly too
Most of the history may be irrelevant
We need a way to control what goes into the LLM at each step — independently from what is stored in history. This is context engineering: separate the history from the LLM context, and prepare the right context for each iteration, including external information such as knowledge bases, additional prompts, or environment state.
What can context engineering do? Let's look at several use cases.
4.1 Swapping Tools
Tools are defined in the system prompt. Context engineering enables swapping the system prompt at any iteration — changing the available tool set dynamically.
Q: Why not put everything in one giant prompt?
A: Context window is finite. Skills enable selective loading — only what you need, when you need it.
Warning
Skills are also the easiest way to overfit an agent to a benchmark. Task-specific skills can inflate scores without improving general capability.
A skill is just a text file that gets loaded into context when relevant. Here is a real example from Claude Code:
# /commit -- a skill for creating git commits When the user asks to commit changes: 1. Run git status and git diff to see all changes 2. Analyze the diff -- summarize the nature of the changes
(new feature, bug fix, refactor, docs, etc.) 3. Draft a concise commit message (1-2 sentences)
focusing on the "why" rather than the "what" 4. Stage relevant files (avoid secrets, .env, etc.) 5. Create the commit # Available tools: Bash(git status), Bash(git diff), # Bash(git add), Bash(git commit)
Key insight
A skill = prompt + tools + instructions. It is loaded into context only when the agent needs it. The agent does not see the /commit skill when it is debugging.
Continual learning
Since skills are just text, the agent can write new skills from experience, update existing ones, and even share them with other agents — a modern form of decentralized learning. Updating a skill = updating a prompt.
4.3 Compaction
When the context window gets full, compress the token history.
Coding agent: output = [compiler_log, result]. Log is long; result is short ("compile successful"). Remove logs, keep results.
ML research agent: output = [training_curves, validation_loss]. Curves are long; loss is short. Keep the loss.
LLM summarization: LLM_summarizer(token_history) — use another LLM to compress.
4.4 KV Cache Constraints
Context engineering changes the context between iterations. But if the prefix changes, the KV cache is invalidated and you pay full recompute cost. How do we modify the context while keeping the prefix stable?
Approach: Masking (Manus, 2025)
Provide all information in the system prompt from the start. "Mask" out irrelevant parts via logit masking instead of adding or removing. The prefix never changes, so the KV cache is always reused.
By keeping the entire system prompt fixed from the start, the KV cache for that prefix is computed once and reused on every iteration. The cost is a longer initial prompt, but the savings compound over many iterations.
Source: Manus Blog, "Context Engineering for AI Agents" (2025).
4.5 Ephemeral Context
Another technique: append iteration-specific context after the stable prefix. This "ephemeral" context appears once and is not stored.
// Ephemeral context example: while task not completed: context = token_history + log // log appended after prefix (ephemeral!)
generated_tokens = LLM(context)
thoughts, action = parse(generated_tokens)
output = exec(action) token_history += [thoughts, action, result] // only keep results!
This technique is used in PUBG Ally, where rich situational information (enemies, health, zone) is critical now but not useful in future iterations.
PUBG Ally is an on-device LLM agent that talks, teams up, and fights alongside the player in a battle royale — all running on the player's GPU. It combines voice (STT + SLM + TTS), strategy/combat/coaching, and runs fully on-device.
5. Multi-Agents and Sub-Agents
Here is an important insight: multi-agents are programmatic context isolation, much like object-oriented programming.
The Problem
If one agent does both coding and reviewing, the reviewer is biased by the coder's thoughts (confirmation bias). Solution: give each role a clean context.
// Each agent has its own context
code = LLM_Agent("code it")
review = LLM_Agent("review it", code)
Instead of bloating the main agent's context, spawn a sub-agent. Its context is discarded after use; the main agent only sees the summary.
// Main agent spawns a sub-agent // instead of reading a huge file directly:
action: LLM_Agent("read input.txt and summarize it")
output: "input.txt is about local restaurants in Berkeley ..."
The sub-agent's context can bloat with the full file, but the main agent only sees the concise summary. This is context isolation in action.
6. Recursive LMs = Programmatic Agent Control
Zhang, Kraska, Khattab, "Recursive Language Models" (2025)
Here is a practical problem: suppose the LLM plans to process file000.txt through file099.txt. In practice, it might lose track and drop file078.txt. The solution: have the LLM write a program that orchestrates sub-agents, guaranteeing all files are processed.
// LLM writes a program that spawns sub-agents:
summary = run_program("""
for file in files:
result += LLM_Agent("summarize " + file)
return result
""")
This combines context engineering via sub-agents with programmatic control. It is especially effective for smaller models, which benefit from structured orchestration.
See also: Giannou et al., "Looped Transformers as Programmable Computers." Off-the-shelf Terminus-KIRA can outperform RLM with large models on some tasks.
· · ·
Checkpoint. So far we have covered: Context Engineering (controlling what the LLM sees), Skills (adaptive prompt optimization), Compaction (removing logs, summarizing, ephemeral context), Multi-/Sub-agents (OOP for context isolation), and Recursive LMs (programmatic sub-agent control). Now we turn to the question of when to stop.
7. The False Completion Problem
Recall our agent loop: while task not completed. How does the agent know it is done? Three ways: (1) a verifiable task with a checker — easy. (2) a fixed time or budget limit — also easy. (3) the LLM itself decides by generating a "done" action. This is the common case — and the problematic one.
When the LLM decides it is done, it is frequently wrong. We call this false completion.
A Real Example
On Terminal-Bench-2 (SWE/MLE-level tasks requiring deep expertise): a baseline agent (Terminus) with Claude Opus 4.6 submitted results 5 times within the time limit. 5 out of 5, the agent confidently submitted a wrong answer. Roughly ~80% of failures were due to false completion.
7.1 The Ralph Loop
One idea: add an outer loop where a fresh agent checks if the work is actually done.
whileTrue: // inner loop -- clean context, same world state
context = context_build(token_history, external_info)
generated_tokens = LLM(context)
thoughts, action = parse(generated_tokens) if action == done: break
output, answer_not_changed = exec(action)
token_history += [thoughts, action, output]
if answer_not_changed: break// only exit if the next agent found nothing to change
The key: each inner loop starts with clean context but the same world state. It reduces confirmation bias — a fresh agent is not influenced by prior reasoning.
whileTrue:
context = context_build(token_history, external_info)
generated_tokens = LLM(context)
thoughts, action = parse(generated_tokens) if action == done: generated_tokens = LLM(token_history_wo_thoughts) // check with just the work, no thoughts thoughts, action = parse(generated_tokens)if action == done: break// truly done: an unbiased agent agrees
output = exec(action)
token_history += [thoughts, action, output] token_history_wo_thoughts += [action, output] // track just the work
The verifier sees only what was done (actions + outputs), not how it was reasoned about (thoughts). This removes confirmation bias without needing a full outer loop restart.
8. AutoResearch: When False Completion Isn't a Problem
AutoResearch (Andrej Karpathy) went viral on Twitter. It applies the basic agent loop to autonomous ML research — no Ralph loop needed.
Why it works without Ralph
This is a progress-measurable task. The goal: train a model with lower validation loss. The loss either went down or it didn't — very hard to falsely claim completion.
The prompt is essentially: LOOP FOREVER. Look at git state. Tune train.py with an experimental idea. Commit. Run the experiment. Read results. If val_bpb improved, keep the commit. If not, git reset. Record results. Never stop.
NEVER STOP: Once the experiment loop has begun, do NOT pause to ask the human if you should continue. The human might be asleep, or gone from the computer and expects you to continue working indefinitely until manually stopped. As an example, a user might leave the agent running while they sleep. At ~5 minutes per experiment, that is ~12/hour, for a total of about 100 overnight. The user wakes up to experimental results, all completed autonomously.
Progress (val_bpb) is directly measurable — false completion is nearly impossible here.
The same pattern applies to: AlphaEvolve (DeepMind), AdaEvolve (Cemri et al.).
9. Super-human Auto RL Agent
Similar to AutoResearch, but for RL engineering. This goes beyond hyperparameter tuning — the agent must also design rewards to avoid reward hacking.
We demonstrated this with a b-boying spider: the agent autonomously designs reward functions and trains RL policies, iterating until super-human performance is achieved.
Work by: Wooseong Chung, Taegwan Ha, Kangwook Lee, Jeong-Gwan Lee, Suyoung Lee, Taehwan Kwon, Yunhyeok Kwak.
10. Test-Time Scaling for Agents
Another orthogonal approach: test-time scaling — generate multiple candidates and pick the best. But this is largely unexplored for agents:
Challenges
Too expensive: running an entire agent loop multiple times costs much more than sampling multiple chain-of-thought traces
Hard to aggregate: agent outputs are not from a fixed set of choices, so majority voting doesn't directly apply
Majority voting may not help anyway: false completion rate is often > 50%, and all the false traces tend to look similar — the majority is wrong
The Naive Approach
Can the LLM itself figure out which trial was best?
for i inrange(N): // N = test-time scaling factor
history[i] = LLM_agent(task_instruction)
best = LLM("find the most promising work" + task_instruction + history)
Result: accuracy improved, but mostly for tasks where P(success) > 50%. Why? We found the LLM was implicitly clustering the candidates and picking the majority. When P(success) > 50%, the majority is correct — so it helps. When P(success) < 50%, the majority is wrong — so it hurts.
Pairwise Comparisons (BTL)
To reduce majority bias, use pairwise comparisons — the LLM never sees all candidates at once.
for i inrange(N):
history[i] = LLM_agent(task_instruction)
for (i, j) in [N] x [N]:
y[i,j] = LLM("which is more promising?" + history[i] + history[j])
s = BTL_solver(y) // Bradley-Terry-Luce model returnargmax(s)
No majority bias
Pairwise comparisons never expose all candidates at once. No context sharing between comparisons.
More compute = less variance
Compare each pair multiple times. Test-time compute for comparisons too!
Preliminary Results
We tested on three agent systems. BTL (Bradley-Terry-Luce) slightly outperforms simple score-based counting:
Agent
Baseline (single run)
BTL Best (pairwise)
Gain
Terminus-KIRA
76.2
81.3
+5.1
Terminus-2
62.9
67.0
+4.1
OpenSage GPT-5.3
78.4
81.1
+2.7
11. OpenClaw: Self-Evolving Agents via Memory
OpenClaw = the basic agent loop + Ralph loop + memory update. After each task, the agent updates its own prompts — enabling continual learning across tasks.
Creative memory
memory_update can be creative: summarize the session, define an IDENTITY that the agent updates over time, develop personality and character across tasks.
The full harness
The rest of the harness is clever too: managing multiple instructions, background tool execution, compaction + file system. But the key innovation is task-to-task continuity via memory.
12. Memory in Practice: inZOI and PUBG Ally
inZOI: We tried self-evolving prompts, but personalities polarized too easily. We shipped with user-defined personalities instead. 1M+ copies sold — the first game with on-device LLM agents.
PUBG Ally: Memory focused on friendship and past games. "Remember when we won that match?" Makes the ally feel like a real teammate over time.
13. Open Challenges
Proactivity — When should the agent talk to you? Act for you?
Fast reaction — System 1/System 2 architecture (see Figure Helix)
Distillation — LLM agents to SLM agents. Not trivial (off-policy, model sharing)
Multimodal — STT → LLM → TTS loses information. Multimodal should be the core model
Evaluation — Agent outputs are complex, non-deterministic, hard to grade
Planning — LLMs are bad at exploration/exploitation. External search helps (see TAPE, ReJump)
· · ·
Summary
The agent loop is simple: observe, think, act, repeat with tool calling via special tokens.
Context engineering is the key design space: skills, compaction, ephemeral context, KV cache constraints.
Multi-agents / sub-agents / recursive LMs = different forms of context isolation and programmatic control.
False completion is the #1 production issue. The Ralph loop, progress measurement, and test-time scaling (BTL) address it.
Memory enables task-to-task continuity and self-evolution (OpenClaw, inZOI, PUBG Ally).
Many open challenges: proactivity, fast reaction, distillation, multimodal, evaluation, planning.
We are early. Maybe the 1950s of communications. Go build (real engineering starts from real problems) and go clarify (the essence resembles control theory, communication theory, statistical inference).
Kangwook Lee · KRAFTON / Ludo Robotics · @kangwooklee