Toward More Efficient and Useful LLM Agents

Everyone is building LLM agents. The products are real and shipping — Claude Code, Codex, AutoResearch, OpenClaw. But how do we actually build them? This monograph walks through the key ideas, many of which emerged from solving real problems in production systems: Terminus-KIRA (a terminal coding agent), PUBG Ally (a real-time cooperative game agent), and Smart Zoi (life simulation agents in inZOI).

Along the way, we will encounter context engineering, skills, compaction, multi-agents, recursive LMs, the Ralph loop, test-time scaling, and memory-driven self-evolution. These are not abstract concepts — they were critical in solving real problems.

1. What is an LLM Agent?

An LLM agent is an LLM-based system that iteratively 1) observes, 2) thinks, and 3) acts to achieve a goal. The agent reads its environment, reasons about what to do, takes an action (typically a tool call), and then loops — repeating until done.

This is the core loop. Everything else in this monograph is a variation on, or improvement to, this fundamental cycle.

2. Tool Calling (No Loop)

The simplest building block is a single LLM call that produces thoughts and an action. No loop — just one shot.

// Tool Calling: a single function call, no loop

context = task_instruction

generated_tokens = LLM(context)
thoughts, action = parse(generated_tokens)
exec(action)

One call, one action, one result. Useful for simple tasks: "search for X", "calculate Y".

See also: Berkeley Function Calling Leaderboard (Gorilla team); NexusRaven (Jiantao Jiao et al., NeurIPS '23).

3. The Agent Loop (Basic Form)

Add a while loop: the agent keeps going until the task is complete. This is the core agent architecture.

// LLM Agent (Basic Form)

token_history = task_instruction

while task not completed: generated_tokens = LLM(token_history)
thoughts, action = parse(generated_tokens)
output = exec(action) token_history += [thoughts, action, output]

What changed

while loop: the agent keeps going
output: fed back for the next iteration
token_history: accumulates all past interactions

Observe-Think-Act Each iteration of the while loop = one Observe-Think-Act cycle. The LLM observes token_history, thinks, and acts.

Note: We call it "output" intentionally. The real-world system is not Markov, so the output is richer than just the next "state" — it is anything observed during the execution: logs, signals, side effects, or even a reward. For instance, if the action is a compile command, the output could be the entire compiler log plus a success/fail signal.

4. Context Engineering

Look at the pseudocode again and pay attention to what the LLM actually sees. The LLM receives token_history, which grows with every iteration. Eventually it fills the context window. Per-token generation compute and KV cache memory grow linearly. Most of the history may be irrelevant.

The Problem

token_history grows with every iteration
Eventually fills the context window
Per-token generation compute grows linearly with context size
KV cache memory grows linearly too
Most of the history may be irrelevant

We need a way to control what goes into the LLM at each step — independently from what is stored in history. This is context engineering: separate the history from the LLM context, and prepare the right context for each iteration, including external information such as knowledge bases, additional prompts, or environment state.

token_history = task_instruction
while task not completed:
context = context_build(token_history, external_info) generated_tokens = LLM(context)
thoughts, action = parse(generated_tokens)
output = exec(action)
token_history += [thoughts, action, output]

4.1 Swapping Tools

Tools are defined in the system prompt. Context engineering enables swapping the system prompt at any iteration — changing the available tool set dynamically.

context_build(token_history, external_info):
tools = select_tools(current_task_phase)
return system_prompt(tools) + token_history

For example, a coding agent starts with file-browsing tools, then switches to editing tools, then to testing tools.

4.2 Skills

Skills are structured (prompt, tool-set, instruction) bundles loaded on demand. They inject task-specific prompts adaptively via context engineering.

context_build(token_history, external_info):
relevant_skills = find_relevant_skill(token_history, external_info)
return token_history + relevant_skills // pre-generation loading for simplicity

Q: Why not put everything in one giant prompt? A: Context window is finite. Skills enable selective loading — only what you need, when you need it.

Warning Skills are also the easiest way to overfit an agent to a benchmark. Task-specific skills can inflate scores without improving general capability.

A skill is just a text file that gets loaded into context when relevant. Here is a real example from Claude Code:

# /commit -- a skill for creating git commits
When the user asks to commit changes:
1. Run git status and git diff to see all changes
2. Analyze the diff -- summarize the nature of the changes
(new feature, bug fix, refactor, docs, etc.)
3. Draft a concise commit message (1-2 sentences)
focusing on the "why" rather than the "what"
4. Stage relevant files (avoid secrets, .env, etc.)
5. Create the commit
# Available tools: Bash(git status), Bash(git diff),
# Bash(git add), Bash(git commit)

Key insight A skill = prompt + tools + instructions. It is loaded into context only when the agent needs it. The agent does not see the /commit skill when it is debugging.

Continual learning Since skills are just text, the agent can write new skills from experience, update existing ones, and even share them with other agents — a modern form of decentralized learning. Updating a skill = updating a prompt.

4.3 Compaction

context_build(token_history, external_info):
return compaction(token_history)

Compaction strategies depend on the application:

Coding agent: output = [compiler_log, result]. Log is long; result is short ("compile successful"). Remove logs, keep results.
ML research agent: output = [training_curves, validation_loss]. Curves are long; loss is short. Keep the loss.
LLM summarization: LLM_summarizer(token_history) — use another LLM to compress.

4.4 KV Cache Constraints

Context engineering changes the context between iterations. But if the prefix changes, the KV cache is invalidated and you pay full recompute cost. How do we modify the context while keeping the prefix stable?

Approach: Masking (Manus, 2025) Provide all information in the system prompt from the start. "Mask" out irrelevant parts via logit masking instead of adding or removing. The prefix never changes, so the KV cache is always reused.

By keeping the entire system prompt fixed from the start, the KV cache for that prefix is computed once and reused on every iteration. The cost is a longer initial prompt, but the savings compound over many iterations.

4.5 Ephemeral Context

Another technique: append iteration-specific context after the stable prefix. This "ephemeral" context appears once and is not stored.

// Ephemeral context example:
while task not completed:
context = token_history + log // log appended after prefix (ephemeral!) generated_tokens = LLM(context)
thoughts, action = parse(generated_tokens)
output = exec(action)
token_history += [thoughts, action, result] // only keep results!

This technique is used in PUBG Ally, where rich situational information (enemies, health, zone) is critical now but not useful in future iterations.

PUBG Ally is an on-device LLM agent that talks, teams up, and fights alongside the player in a battle royale — all running on the player's GPU. It combines voice (STT + SLM + TTS), strategy/combat/coaching, and runs fully on-device.

5. Multi-Agents and Sub-Agents

Here is an important insight: multi-agents are programmatic context isolation, much like object-oriented programming.

The Problem If one agent does both coding and reviewing, the reviewer is biased by the coder's thoughts (confirmation bias). Solution: give each role a clean context.

// Each agent has its own context
code = LLM_Agent("code it")
review = LLM_Agent("review it", code)

// Or iteratively:
while True:
code = LLM_Agent("code it", review)
review = LLM_Agent("review it", code)

Sub-Agents = Clean Slate Tools

Instead of bloating the main agent's context, spawn a sub-agent. Its context is discarded after use; the main agent only sees the summary.

// Main agent spawns a sub-agent
// instead of reading a huge file directly:
action: LLM_Agent("read input.txt and summarize it")
output: "input.txt is about local restaurants in Berkeley ..."

The sub-agent's context can bloat with the full file, but the main agent only sees the concise summary. This is context isolation in action.

6. Recursive LMs = Programmatic Agent Control

Here is a practical problem: suppose the LLM plans to process file000.txt through file099.txt. In practice, it might lose track and drop file078.txt. The solution: have the LLM write a program that orchestrates sub-agents, guaranteeing all files are processed.

// LLM writes a program that spawns sub-agents:
summary = run_program("""
for file in files:
result += LLM_Agent("summarize " + file)
return result
""")

This combines context engineering via sub-agents with programmatic control. It is especially effective for smaller models, which benefit from structured orchestration.

See also: Giannou et al., "Looped Transformers as Programmable Computers." Off-the-shelf Terminus-KIRA can outperform RLM with large models on some tasks.

Checkpoint. So far we have covered: Context Engineering (controlling what the LLM sees), Skills (adaptive prompt optimization), Compaction (removing logs, summarizing, ephemeral context), Multi-/Sub-agents (OOP for context isolation), and Recursive LMs (programmatic sub-agent control). Now we turn to the question of when to stop.

7. The False Completion Problem

Recall our agent loop: while task not completed. How does the agent know it is done? Three ways: (1) a verifiable task with a checker — easy. (2) a fixed time or budget limit — also easy. (3) the LLM itself decides by generating a "done" action. This is the common case — and the problematic one.

token_history = task_instruction
while True: context = context_build(token_history, external_info)
generated_tokens = LLM(context)
thoughts, action = parse(generated_tokens)
if action == done: break output = exec(action)
token_history += [thoughts, action, output]

When the LLM decides it is done, it is frequently wrong. We call this false completion.

A Real Example On Terminal-Bench-2 (SWE/MLE-level tasks requiring deep expertise): a baseline agent (Terminus) with Claude Opus 4.6 submitted results 5 times within the time limit. 5 out of 5, the agent confidently submitted a wrong answer. Roughly ~80% of failures were due to false completion.

7.1 The Ralph Loop

One idea: add an outer loop where a fresh agent checks if the work is actually done.

while True: // outer loop
answer_not_changed = True
token_history = []

while True: // inner loop -- clean context, same world state
context = context_build(token_history, external_info)
generated_tokens = LLM(context)
thoughts, action = parse(generated_tokens)
if action == done: break
output, answer_not_changed = exec(action)
token_history += [thoughts, action, output]

if answer_not_changed: break // only exit if the next agent found nothing to change

The key: each inner loop starts with clean context but the same world state. It reduces confirmation bias — a fresh agent is not influenced by prior reasoning.

7.2 Terminus-KIRA's Approach

In building Terminus-KIRA, we tested a variant that is both effective and efficient:

token_history = task_instruction
token_history_wo_thoughts = []

while True:
context = context_build(token_history, external_info)
generated_tokens = LLM(context)
thoughts, action = parse(generated_tokens)
if action == done:
generated_tokens = LLM(token_history_wo_thoughts) // check with just the work, no thoughts thoughts, action = parse(generated_tokens) if action == done: break // truly done: an unbiased agent agrees output = exec(action)
token_history += [thoughts, action, output]
token_history_wo_thoughts += [action, output] // track just the work

The verifier sees only what was done (actions + outputs), not how it was reasoned about (thoughts). This removes confirmation bias without needing a full outer loop restart.

8. AutoResearch: When False Completion Isn't a Problem

AutoResearch (Andrej Karpathy) went viral on Twitter. It applies the basic agent loop to autonomous ML research — no Ralph loop needed.

Why it works without Ralph This is a progress-measurable task. The goal: train a model with lower validation loss. The loss either went down or it didn't — very hard to falsely claim completion.

The prompt is essentially: LOOP FOREVER. Look at git state. Tune train.py with an experimental idea. Commit. Run the experiment. Read results. If val_bpb improved, keep the commit. If not, git reset. Record results. Never stop.

NEVER STOP: Once the experiment loop has begun, do NOT pause to ask the human if you should continue. The human might be asleep, or gone from the computer and expects you to continue working indefinitely until manually stopped. As an example, a user might leave the agent running while they sleep. At ~5 minutes per experiment, that is ~12/hour, for a total of about 100 overnight. The user wakes up to experimental results, all completed autonomously.

Progress (val_bpb) is directly measurable — false completion is nearly impossible here.

9. Super-human Auto RL Agent

Similar to AutoResearch, but for RL engineering. This goes beyond hyperparameter tuning — the agent must also design rewards to avoid reward hacking.

We demonstrated this with a b-boying spider: the agent autonomously designs reward functions and trains RL policies, iterating until super-human performance is achieved.

Work by: Wooseong Chung, Taegwan Ha, Kangwook Lee, Jeong-Gwan Lee, Suyoung Lee, Taehwan Kwon, Yunhyeok Kwak.

10. Test-Time Scaling for Agents

Another orthogonal approach: test-time scaling — generate multiple candidates and pick the best. But this is largely unexplored for agents:

Challenges

Too expensive: running an entire agent loop multiple times costs much more than sampling multiple chain-of-thought traces
Hard to aggregate: agent outputs are not from a fixed set of choices, so majority voting doesn't directly apply
Majority voting may not help anyway: false completion rate is often > 50%, and all the false traces tend to look similar — the majority is wrong

The Naive Approach

for i in range(N): // N = test-time scaling factor
history[i] = LLM_agent(task_instruction)

best = LLM("find the most promising work" + task_instruction + history)

Result: accuracy improved, but mostly for tasks where P(success) > 50%. Why? We found the LLM was implicitly clustering the candidates and picking the majority. When P(success) > 50%, the majority is correct — so it helps. When P(success) < 50%, the majority is wrong — so it hurts.

Pairwise Comparisons (BTL)

To reduce majority bias, use pairwise comparisons — the LLM never sees all candidates at once.

for i in range(N):
history[i] = LLM_agent(task_instruction)

for (i, j) in [N] x [N]:
y[i,j] = LLM("which is more promising?" + history[i] + history[j])

s = BTL_solver(y) // Bradley-Terry-Luce model
return argmax(s)

No majority bias Pairwise comparisons never expose all candidates at once. No context sharing between comparisons.

More compute = less variance Compare each pair multiple times. Test-time compute for comparisons too!

Preliminary Results

We tested on three agent systems. BTL (Bradley-Terry-Luce) slightly outperforms simple score-based counting:

Agent	Baseline (single run)	BTL Best (pairwise)	Gain
Terminus-KIRA	76.2	81.3	+5.1
Terminus-2	62.9	67.0	+4.1
OpenSage GPT-5.3	78.4	81.1	+2.7

11. OpenClaw: Self-Evolving Agents via Memory

OpenClaw = the basic agent loop + Ralph loop + memory update. After each task, the agent updates its own prompts — enabling continual learning across tasks.

token_history = system_prompt + memory + task_instruction
while True:
context = context_build(token_history, external_info)
generated_tokens = LLM(context)
thoughts, action = parse(generated_tokens)
if action == done:
memory = memory_update(thought_history) // self-evolve! break
output = exec(action)
token_history += [thoughts, action, output]

Creative memory memory_update can be creative: summarize the session, define an IDENTITY that the agent updates over time, develop personality and character across tasks.

The full harness The rest of the harness is clever too: managing multiple instructions, background tool execution, compaction + file system. But the key innovation is task-to-task continuity via memory.

Toward More Efficient
and Useful LLM Agents

Contents

1. What is an LLM Agent?

2. Tool Calling (No Loop)

3. The Agent Loop (Basic Form)

4. Context Engineering

4.1 Swapping Tools

4.2 Skills

4.3 Compaction

4.4 KV Cache Constraints

4.5 Ephemeral Context

5. Multi-Agents and Sub-Agents

Sub-Agents = Clean Slate Tools

6. Recursive LMs = Programmatic Agent Control

7. The False Completion Problem

7.1 The Ralph Loop

7.2 Terminus-KIRA's Approach

8. AutoResearch: When False Completion Isn't a Problem

9. Super-human Auto RL Agent

10. Test-Time Scaling for Agents

The Naive Approach

Pairwise Comparisons (BTL)

Preliminary Results

11. OpenClaw: Self-Evolving Agents via Memory

12. Memory in Practice: inZOI and PUBG Ally

13. Open Challenges

Summary

Toward More Efficientand Useful LLM Agents

Contents

1. What is an LLM Agent?

2. Tool Calling (No Loop)

3. The Agent Loop (Basic Form)

4. Context Engineering

4.1 Swapping Tools

4.2 Skills

4.3 Compaction

4.4 KV Cache Constraints

4.5 Ephemeral Context

5. Multi-Agents and Sub-Agents

Sub-Agents = Clean Slate Tools

6. Recursive LMs = Programmatic Agent Control

7. The False Completion Problem

7.1 The Ralph Loop

7.2 Terminus-KIRA's Approach

8. AutoResearch: When False Completion Isn't a Problem

9. Super-human Auto RL Agent

10. Test-Time Scaling for Agents

The Naive Approach

Pairwise Comparisons (BTL)

Preliminary Results

11. OpenClaw: Self-Evolving Agents via Memory

12. Memory in Practice: inZOI and PUBG Ally

13. Open Challenges

Summary

Toward More Efficient
and Useful LLM Agents