Toward More Efficient
and Useful LLM Agents

Kangwook Lee

Chief AI Officer, KRAFTON  /  CTO, Ludo Robotics


What is an LLM Agent?

An LLM-based system that iteratively
1) observes, 2) thinks, and 3) acts to achieve a goal.

Observe read environment Think reason & decide Act call tool / take action repeat until done

LLM Agents Are Very Hot ... And They Seem Useful!

Everyone is building and using LLM agents. The products are real, and they are shipping.

Claude Code
Anthropic's coding agent
Codex
OpenAI's coding agent
AutoResearch
Autonomous AI research
OpenClaw
Open-source agent framework
These are not demos. These are production systems used by millions. So the natural question is ...

So How Do We Build Agents?

There are so many new ideas coming every day.

Context Engineering
Skills
Recursive LMs
Context Compaction
Multi-/Sub-agents
Ralph Loop
AutoResearch
OpenClaw

What do they actually mean? Do they actually matter?

I will explain how these ideas -- or something spiritually similar -- were critical in solving real problems in our products.
Terminus-KIRA
Terminal coding agent
PUBG Ally
Real-time cooperative agent
Smart Zoi
Life simulation agents

Tool Calling (No Loop)

The simplest building block: a single LLM call that produces thoughts and an action. No loop -- just one shot.

// Tool Calling: a single function call, no loop

context = task_instruction

generated_tokens = LLM(context)
thoughts, action = parse(generated_tokens)
exec(action)
One call, one action, one result. Useful for simple tasks: "search for X", "calculate Y".

See also: Berkeley Function Calling Leaderboard (Gorilla team); NexusRaven (Jiantao Jiao et al., NeurIPS '23)

The Agent Loop (Basic Form)

Add a while loop: the agent keeps going until the task is complete. This is the core agent architecture.

// LLM Agent (Basic Form)

token_history = task_instruction

while task not completed:   generated_tokens = LLM(token_history)
  thoughts, action = parse(generated_tokens)
  output = exec(action)   token_history += [thoughts, action, output]
What changed
  • while loop: the agent keeps going
  • output: fed back for the next iteration
  • token_history: accumulates all past interactions
Observe-Think-Act
Each iteration of the while loop = one Observe-Think-Act cycle. The LLM observes token_history, thinks, and acts.
Note: We call it "output" intentionally. The real-world system is not Markov, so the output is richer than just the next "state" — it is anything observed during the execution: logs, signals, side effects, or even a reward. For instance, if the action is a compile command, the output could be the entire compiler log plus a success/fail signal.

The First Design Problem

Look at the same pseudocode again -- but pay attention to what the LLM actually sees.

// LLM Agent (Basic Form) -- same as before

token_history = task_instruction

while task not completed:
  generated_tokens = LLM(token_history)  // = !!!   thoughts, action = parse(generated_tokens)
  output = exec(action)
  token_history += [thoughts, action, output]
Do we really want the LLM to see all past history? Can we inject something? Should we take something off?
The problem
  • token_history grows with every iteration
  • Eventually fills the context window
  • Per-token generation compute grows linearly with context size
  • KV cache memory grows linearly too
  • Most of the history may be irrelevant
The motivation
We need a way to control what goes into the LLM at each step -- independently from what is stored in history. This is context engineering.

Context Engineering

Separate the history from the LLM context, and prepare the right context for each iteration — including external_info such as knowledge bases, additional prompts, or environment state. This is context engineering.

token_history = task_instruction
while task not completed:
  context = context_build(token_history, external_info)   generated_tokens = LLM(context)
  thoughts, action = parse(generated_tokens)
  output = exec(action)
  token_history += [thoughts, action, output]

What can context engineering do? Let's look at three use cases.

Context Engineering (1/3): Swapping Tools

Tools are defined in the system prompt. Context engineering enables swapping the system prompt at any iteration -- changing the available tool set dynamically.

context_build(token_history, external_info):
  tools = select_tools(current_task_phase)
  return system_prompt(tools) + token_history
Example: a coding agent starts with file-browsing tools, then switches to editing tools, then to testing tools.

Context Engineering (2/3): Skills

Inject task-specific prompts adaptively via skills -- structured (prompt, tool-set, instruction) bundles loaded on demand.

context_build(token_history, external_info):
  relevant_skills = find_relevant_skill(token_history, external_info)
  return token_history + relevant_skills // * pre-generation loading for simplicity
Q: Why not put everything in one giant prompt?
A: Context window is finite. Skills enable selective loading -- only what you need, when you need it.
Warning Skills are also the easiest way to overfit an agent to a benchmark. Task-specific skills can inflate scores without improving general capability. I will discuss a potential fairness issue on Terminal Bench later.

What Does a Skill Actually Look Like?

A skill is just a text file that gets loaded into context when relevant. Here is a real example from Claude Code:

# /commit -- a skill for creating git commits
When the user asks to commit changes:
1. Run git status and git diff to see all changes
2. Analyze the diff -- summarize the nature of the changes
   (new feature, bug fix, refactor, docs, etc.)
3. Draft a concise commit message (1-2 sentences)
   focusing on the "why" rather than the "what"
4. Stage relevant files (avoid secrets, .env, etc.)
5. Create the commit
# Available tools: Bash(git status), Bash(git diff),
# Bash(git add), Bash(git commit)
A skill = prompt + tools + instructions. It is loaded into context only when the agent needs it. The agent does not see the /commit skill when it is debugging.
Skills enable continual learning. Since skills are just text, the agent can write new skills from experience, update existing ones, and even share them with other agents — a modern form of decentralized learning. Updating a skill = updating a prompt.

Context Engineering (3/3): Compaction

When the context window gets full, compress the token history.

context_build(token_history, external_info):
  return compaction(token_history)
Compaction strategies depend on the application:
  • Coding agent: output = [compiler_log, result]. Log is long; result is short ("compile successful"). Remove logs, keep results.
  • ML research agent: output = [training_curves, validation_loss]. Curves are long; loss is short. Keep the loss.
  • LLM summarization: LLM_summarizer(token_history) -- use another LLM to compress.

See my recent analysis of how Codex does compaction: @Kangwook_Lee

Context Engineering: KV Cache Constraint

Context engineering changes the context between iterations. But if the prefix changes, the KV cache is invalidated and you pay full recompute cost. How do we modify the context while keeping the prefix stable?

Approach 1: Masking (Manus, 2025)
Provide all information in the system prompt from the start. "Mask" out irrelevant parts via logit masking instead of adding/removing. The prefix never changes, so the KV cache is always reused. [Manus blog]
Key insight: By keeping the entire system prompt fixed from the start, the KV cache for that prefix is computed once and reused on every iteration. The cost is a longer initial prompt, but the savings compound over many iterations.

KV Cache: Design Around the Cache

Design Around the KV-Cache

Source: Manus Blog: Context Engineering for AI Agents (2025)

KV Cache: Mask, Don't Remove

Mask, Don't Remove

Source: Manus Blog: Context Engineering for AI Agents (2025)

PUBG Ally: An LLM Agent That Plays With You

An on-device LLM agent that talks to you, teams up with you, and fights alongside you in a battle royale game. All running on the player's GPU.

Voice: STT + SLM + TTS
Strategy + Combat + Coaching
Fully On-Device

Context Engineering: Ephemeral Context

Another approach: append iteration-specific context AFTER the stable prefix. It appears just once and is not stored in history.

// Ephemeral context example:
while task not completed:
  context = token_history + log  // log appended after prefix (ephemeral!)   generated_tokens = LLM(context)
  thoughts, action = parse(generated_tokens)
  output = exec(action)
  token_history += [thoughts, action, result]  // only keep results!
This is the technique we used in PUBG Ally -- each step has rich situational info (enemies nearby, health, zone) that is critical right now but not useful in future steps.

Multi-Agents = Context Isolation

Key insight: multi-agents are just programmatic context isolation. Like OOP -- each "object" has its own context.

Problem: One agent does coding and reviewing. The reviewer is biased by the coder's thoughts — confirmation bias!
Solution: Give each role its own clean context.
// Each agent has its own context
code = LLM_Agent("code it")
review = LLM_Agent("review it", code)

// Or iteratively:
while True:
  code = LLM_Agent("code it", review)
  review = LLM_Agent("review it", code)

Sub-Agents = Clean Slate Tools

Instead of bloating the main context, spawn a sub-agent. Its context is discarded after use -- the main agent stays clean.

// Main agent spawns a sub-agent
// instead of reading a huge file directly:

action: LLM_Agent("read input.txt and summarize it")
output: "input.txt is about local restaurants in Berkeley ..."
The sub-agent's context can get bloated with the full file content, but the main agent only sees the concise summary. This is the same idea as spawning a subprocess in programming.

Recursive LMs = Programmatic Agent Control

Source: Zhang, Kraska, Khattab, "Recursive Language Models" (2025)

Let's say the LLM planned to process file000.txt through file099.txt. But it might drop file078.txt -- LLMs lose track over long sequences.

Solution: Instead of executing steps one by one, write a program that orchestrates sub-agents. The program guarantees all files are processed.
// The LLM writes a program that spawns sub-agents:

summary = run_program("""   for file in files:     result += LLM_Agent("summarize " + file)   return result """)
Two ideas combined: (1) context engineering via sub-agents, (2) generic programmatic control. Great for small models!

See also: Giannou et al., "Looped Transformers as Programmable Computers"

Oolong Benchmark Performance

RLM
Terminus-KIRA
48.0
52.0
Qwen3 Coder-480B-A35B
24.0
16.0
Qwen3-8B

Off-the-shelf Terminus-KIRA can outperform RLM with large models on some tasks.

Checkpoint

Let's check where we are. We have covered quite a few ideas so far:

Context Engineering
Skills
Recursive LMs
Context Compaction
Multi-/Sub-agents
Context Engineering = controlling what the LLM sees
Skills = adaptive prompt optimization via context engineering
Compaction = removing logs, summarizing history, ephemeral context
Multi-/Sub-agents = OOP for context isolation
Recursive LMs = programmatic sub-agent control
Ralph Loop
AutoResearch
OpenClaw

When is the Task Complete?

Let's recall our agent architecture. Something we handwaved: while task not completed -- how do we know?

token_history = task_instruction
while True:
  context = context_build(token_history, external_info)
  generated_tokens = LLM(context)
  thoughts, action = parse(generated_tokens)
  if action == done: break   output = exec(action)
  token_history += [thoughts, action, output]
Three ways the agent can stop: (1) a verifiable task with a checker -- easy. (2) a fixed time/budget limit -- also easy. (3) the LLM itself decides by generating a "done" action (or EOS token). This is the common case -- and the problematic one.

The False Completion Problem

When the LLM decides it is done, it is frequently wrong. We call this false completion.

A Real Example
On Terminal-Bench-2 (SWE/MLE-level tasks requiring deep expertise): a baseline agent (Terminus) with Claude Opus 4.6 submitted results 5 times within the time limit. 5 out of 5, the agent confidently submitted a wrong answer. Roughly ~80% of failures were due to false completion.

The Ralph Loop

One idea: add an outer loop where a fresh agent checks if the work is actually done.

Ralph
while True:  // outer loop
  answer_not_changed = True
  token_history = []

  while True:  // inner loop -- clean context, same world state
    context = context_build(token_history, external_info)
    generated_tokens = LLM(context)
    thoughts, action = parse(generated_tokens)
    if action == done: break
    output, answer_not_changed = exec(action)
    token_history += [thoughts, action, output]

  if answer_not_changed: break  // only exit if the next agent found nothing to change

The key: each inner loop starts with clean context but the same world state. It reduces confirmation bias -- a fresh agent is not influenced by prior reasoning. github.com/snarktank/ralph

Our Approach: Terminus-KIRA

In building Terminus-KIRA, we tested a variant that is both effective and efficient:

token_history = task_instruction
token_history_wo_thoughts = []

while True:
  context = context_build(token_history, external_info)
  generated_tokens = LLM(context)
  thoughts, action = parse(generated_tokens)
  if action == done:
    generated_tokens = LLM(token_history_wo_thoughts) // check with just the work, no thoughts     thoughts, action = parse(generated_tokens)     if action == done: break  // truly done: an unbiased agent agrees   output = exec(action)
  token_history += [thoughts, action, output]
  token_history_wo_thoughts += [action, output]  // track just the work
The verifier sees only what was done (actions + outputs), not how it was reasoned about (thoughts). This removes confirmation bias without needing a full outer loop restart.

AutoResearch: When False Completion Isn't a Problem

AutoResearch (Andrej Karpathy) went viral on Twitter. It applies the basic agent loop to autonomous ML research -- no Ralph loop needed.

Why does it work without Ralph? This is a progress-measurable task. The goal: train a model with lower validation loss. The loss either went down or it didn't -- very hard to falsely claim completion.

The same pattern applies to many research tasks: AlphaEvolve (DeepMind), AdaEvolve (Cemri et al.)

AutoResearch

The AutoResearch Prompt

LOOP FOREVER:

Look at the git state: the current branch/commit we're on
Tune train.py with an experimental idea by directly hacking the code.
git commit
Run the experiment: uv run train.py > run.log 2>&1 (redirect everything -- do NOT use tee or let output flood your context)
Read out the results: grep "^val_bpb:\|^peak_vram_mb:" run.log
If the grep output is empty, the run crashed. Run tail -n 50 run.log to read the Python stack trace and attempt a fix. If you can't get things to work after more than a few attempts, give up.
Record the results in the tsv (NOTE: do not commit the results.tsv file, leave it untracked by git)
If val_bpb improved (lower), you "advance" the branch, keeping the git commit If val_bpb is equal or worse, you git reset back to where you started The idea is that you are a completely autonomous researcher trying things out. If they work, keep. If they don't, discard. And you're advancing the branch so that you can iterate. If you feel like you're getting stuck in some way, you can rewind but you should probably do this very very sparingly (if ever).

Timeout: Each experiment should take ~5 minutes total (+ a few seconds for startup and eval overhead). If a run exceeds 10 minutes, kill it and treat it as a failure (discard and revert).

Crashes: If a run crashes (OOM, or a bug, or etc.), use your judgment: If it's something dumb and easy to fix (e.g. a typo, a missing import), fix it and re-run. If the idea itself is fundamentally broken, just skip it, log "crash" as the status in the tsv, and move on.

NEVER STOP: Once the experiment loop has begun (after the initial setup), do NOT pause to ask the human if you should continue. Do NOT ask "should I keep going?" or "is this a good stopping point?". The human might be asleep, or gone from a computer and expects you to continue working indefinitely until you are manually stopped. You are autonomous. If you run out of ideas, think harder -- read papers referenced in the code, re-read the in-scope files for new angles, try combining previous near-misses, try more radical architectural changes. The loop runs until the human interrupts you, period.

As an example use case, a user might leave you running while they sleep. If each experiment takes you ~5 minutes then you can run approx 12/hour, for a total of about 100 over the duration of the average human sleep. The user then wakes up to experimental results, all completed by you while they slept!

Progress (val_bpb) is directly measurable -- false completion is nearly impossible here. github.com/karpathy/autoresearch

Super-human Auto RL Agent

Similar to AutoResearch, but for RL engineering. More than just hyperparameter tuning — the agent must also design rewards to avoid reward hacking. Example: b-boying spider.

Work by: Wooseong Chung, Taegwan Ha, Kangwook Lee, Jeong-Gwan Lee, Suyoung Lee, Taehwan Kwon, Yunhyeok Kwak (alphabetical)

Test-Time Scaling for Agents

Another orthogonal approach: test-time scaling -- generate multiple candidates and pick the best.

Test-time scaling (e.g., generate many candidates, pick the best) is one approach -- but it's largely unexplored for agents:
  • Too expensive: running an entire agent loop multiple times costs much more than sampling multiple CoTs
  • Hard to aggregate: agent outputs are not from a fixed set of choices, so majority voting doesn't directly apply
  • Majority voting may not help anyway: false completion rate is often > 50%, and all the false traces tend to look similar -- the majority is wrong
Test-time scaling for agents is an important open problem.

Test-Time Scaling: Naive Approach

Can the LLM agent itself figure out which trial was best?

for i in range(N):  // N = test-time scaling factor
  history[i] = LLM_agent(task_instruction)

best = LLM("find the most promising work" + task_instruction + history)
Result: accuracy improved, but mostly for tasks with P(success) > 50%. Guess why?
We found the LLM was implicitly clustering the candidates and picking the majority! It was told to pick the best, but it was doing majority voting. When P(success) > 50%, the majority is correct -- so it helps. When P(success) < 50%, the majority is wrong -- so it hurts.

Test-Time Scaling: Pairwise Comparisons

To reduce majority bias, use pairwise comparisons -- the LLM never sees all candidates at once.

for i in range(N):
  history[i] = LLM_agent(task_instruction)

for (i, j) in [N] x [N]:
  y[i,j] = LLM("which is more promising?" + history[i] + history[j])

s = BTL_solver(y)  // Bradley-Terry-Luce model
return argmax(s)
No majority bias -- pairwise comparisons never expose all candidates at once. No context sharing between comparisons.
More compute = less variance -- compare each pair multiple times. Test-time compute for comparisons too!

Preliminary Results: Test-Time Scaling for Agents

Baseline (single run)
BTL Best (pairwise)
76.2
81.3
Terminus-KIRA
62.9
67.0
Terminus-2
78.4
81.1
OpenSage GPT-5.3

BTL (Bradley-Terry-Luce) slightly outperforms simple score-based counting.

OpenClaw: Self-Evolving Agents via Memory

OpenClaw = the basic agent loop + Ralph loop + memory update. After each task, the agent updates its own prompts -- enabling continual learning across tasks.

token_history = system_prompt + memory + task_instruction
while True:
  context = context_build(token_history, external_info)
  generated_tokens = LLM(context)
  thoughts, action = parse(generated_tokens)
  if action == done:
    memory = memory_update(thought_history)  // self-evolve!     break
  output = exec(action)
  token_history += [thoughts, action, output]
memory_update can be creative: summarize the session, define an IDENTITY that the agent updates over time, develop personality/character across tasks.
The rest of the harness is clever too: managing multiple instructions, background tool execution, compaction + file system. But the key innovation is task-to-task continuity via memory.

Memory in Practice: inZOI and PUBG Ally

Lots of Interesting Challenges Left

Summary

Thank You

Kangwook Lee  ·  KRAFTON / Ludo Robotics  ·  @kangwooklee  ·  Questions?