Toward More Efficient
and Useful LLM Agents

Kangwook Lee

Chief AI Officer, KRAFTON / CTO, Ludo Robotics

What is an LLM Agent?

An LLM-based system that iteratively
1) observes, 2) thinks, and 3) acts to achieve a goal.

LLM Agents Are Very Hot ... And They Seem Useful!

Everyone is building and using LLM agents. The products are real, and they are shipping.

Claude Code

Anthropic's coding agent

Codex

OpenAI's coding agent

AutoResearch

Autonomous AI research

OpenClaw

Open-source agent framework

These are not demos. These are production systems used by millions. So the natural question is ...

So How Do We Build Agents?

There are so many new ideas coming every day.

Context Engineering

Skills

Recursive LMs

Context Compaction

Multi-/Sub-agents

Ralph Loop

AutoResearch

OpenClaw

What do they actually mean? Do they actually matter?

I will explain how these ideas -- or something spiritually similar -- were critical in solving real problems in our products.

Terminus-KIRA

Terminal coding agent

PUBG Ally

Real-time cooperative agent

Smart Zoi

Life simulation agents

Tool Calling (No Loop)

The simplest building block: a single LLM call that produces thoughts and an action. No loop -- just one shot.

      // Tool Calling: a single function call, no loop

      context = task_instruction

      generated_tokens = LLM(context)

      thoughts, action = parse(generated_tokens)

      exec(action)

One call, one action, one result. Useful for simple tasks: "search for X", "calculate Y".

The Agent Loop (Basic Form)

Add a while loop: the agent keeps going until the task is complete. This is the core agent architecture.

        // LLM Agent (Basic Form)

        token_history = task_instruction

        while task not completed:
          generated_tokens = LLM(token_history)

          thoughts, action = parse(generated_tokens)

          output = exec(action)
          token_history += [thoughts, action, output]

What changed

while loop: the agent keeps going
output: fed back for the next iteration
token_history: accumulates all past interactions

Observe-Think-Act

Each iteration of the while loop = one Observe-Think-Act cycle. The LLM observes token_history, thinks, and acts.

Note: We call it "output" intentionally. The real-world system is not Markov, so the output is richer than just the next "state" — it is anything observed during the execution: logs, signals, side effects, or even a reward. For instance, if the action is a compile command, the output could be the entire compiler log plus a success/fail signal.

The First Design Problem

Look at the same pseudocode again -- but pay attention to what the LLM actually sees.

        // LLM Agent (Basic Form) -- same as before

        token_history = task_instruction

        while task not completed:

          generated_tokens = LLM(token_history)  // = !!!
          thoughts, action = parse(generated_tokens)

          output = exec(action)

          token_history += [thoughts, action, output]

Do we really want the LLM to see all past history? Can we inject something? Should we take something off?

The problem

token_history grows with every iteration
Eventually fills the context window
Per-token generation compute grows linearly with context size
KV cache memory grows linearly too
Most of the history may be irrelevant

The motivation

We need a way to control what goes into the LLM at each step -- independently from what is stored in history. This is context engineering.

Context Engineering

Separate the history from the LLM context, and prepare the right context for each iteration — including external_info such as knowledge bases, additional prompts, or environment state. This is context engineering.

      token_history = task_instruction

      while task not completed:

        context = context_build(token_history, external_info)
        generated_tokens = LLM(context)

        thoughts, action = parse(generated_tokens)

        output = exec(action)

        token_history += [thoughts, action, output]

What can context engineering do? Let's look at three use cases.

Context Engineering (1/3): Swapping Tools

Tools are defined in the system prompt. Context engineering enables swapping the system prompt at any iteration -- changing the available tool set dynamically.

      context_build(token_history, external_info):

        tools = select_tools(current_task_phase)

        return system_prompt(tools) + token_history

Example: a coding agent starts with file-browsing tools, then switches to editing tools, then to testing tools.

Context Engineering (2/3): Skills

Inject task-specific prompts adaptively via skills -- structured (prompt, tool-set, instruction) bundles loaded on demand.

      context_build(token_history, external_info):

        relevant_skills = find_relevant_skill(token_history, external_info)

        return token_history + relevant_skills // * pre-generation loading for simplicity

Q: Why not put everything in one giant prompt?
A: Context window is finite. Skills enable selective loading -- only what you need, when you need it.

Warning Skills are also the easiest way to overfit an agent to a benchmark. Task-specific skills can inflate scores without improving general capability. I will discuss a potential fairness issue on Terminal Bench later.

What Does a Skill Actually Look Like?

A skill is just a text file that gets loaded into context when relevant. Here is a real example from Claude Code:

      # /commit -- a skill for creating git commits

      When the user asks to commit changes:

      1. Run git status and git diff to see all changes

      2. Analyze the diff -- summarize the nature of the changes

         (new feature, bug fix, refactor, docs, etc.)

      3. Draft a concise commit message (1-2 sentences)

         focusing on the "why" rather than the "what"

      4. Stage relevant files (avoid secrets, .env, etc.)

      5. Create the commit

      # Available tools: Bash(git status), Bash(git diff),

      #   Bash(git add), Bash(git commit)

A skill = prompt + tools + instructions. It is loaded into context only when the agent needs it. The agent does not see the /commit skill when it is debugging.

Skills enable continual learning. Since skills are just text, the agent can write new skills from experience, update existing ones, and even share them with other agents — a modern form of decentralized learning. Updating a skill = updating a prompt.

Context Engineering (3/3): Compaction

When the context window gets full, compress the token history.

      context_build(token_history, external_info):

        return compaction(token_history)

Compaction strategies depend on the application:

Coding agent: output = [compiler_log, result]. Log is long; result is short ("compile successful"). Remove logs, keep results.
ML research agent: output = [training_curves, validation_loss]. Curves are long; loss is short. Keep the loss.
LLM summarization: LLM_summarizer(token_history) -- use another LLM to compress.

See my recent analysis of how Codex does compaction: @Kangwook_Lee

Context Engineering: KV Cache Constraint

Context engineering changes the context between iterations. But if the prefix changes, the KV cache is invalidated and you pay full recompute cost. How do we modify the context while keeping the prefix stable?

Approach 1: Masking (Manus, 2025)

Provide all information in the system prompt from the start. "Mask" out irrelevant parts via logit masking instead of adding/removing. The prefix never changes, so the KV cache is always reused. [Manus blog]

Key insight: By keeping the entire system prompt fixed from the start, the KV cache for that prefix is computed once and reused on every iteration. The cost is a longer initial prompt, but the savings compound over many iterations.

KV Cache: Design Around the Cache

Source: Manus Blog: Context Engineering for AI Agents (2025)

KV Cache: Mask, Don't Remove

Source: Manus Blog: Context Engineering for AI Agents (2025)

PUBG Ally: An LLM Agent That Plays With You

An on-device LLM agent that talks to you, teams up with you, and fights alongside you in a battle royale game. All running on the player's GPU.

Watch on YouTube ↗ PUBG Ally

Voice: STT + SLM + TTS

Strategy + Combat + Coaching

Fully On-Device

Context Engineering: Ephemeral Context

Another approach: append iteration-specific context AFTER the stable prefix. It appears just once and is not stored in history.

      // Ephemeral context example:

      while task not completed:

        context = token_history + log  // log appended after prefix (ephemeral!)
        generated_tokens = LLM(context)

        thoughts, action = parse(generated_tokens)

        output = exec(action)

        token_history += [thoughts, action, result]  // only keep results!

This is the technique we used in PUBG Ally -- each step has rich situational info (enemies nearby, health, zone) that is critical right now but not useful in future steps.

Multi-Agents = Context Isolation

Key insight: multi-agents are just programmatic context isolation. Like OOP -- each "object" has its own context.

Problem: One agent does coding and reviewing. The reviewer is biased by the coder's thoughts — confirmation bias!
Solution: Give each role its own clean context.

      // Each agent has its own context

      code = LLM_Agent("code it")

      review = LLM_Agent("review it", code)

      // Or iteratively:

      while True:

        code = LLM_Agent("code it", review)

        review = LLM_Agent("review it", code)

Sub-Agents = Clean Slate Tools

Instead of bloating the main context, spawn a sub-agent. Its context is discarded after use -- the main agent stays clean.

      // Main agent spawns a sub-agent

      // instead of reading a huge file directly:

      action: LLM_Agent("read input.txt and summarize it")

      output: "input.txt is about local restaurants in Berkeley ..."

The sub-agent's context can get bloated with the full file content, but the main agent only sees the concise summary. This is the same idea as spawning a subprocess in programming.

Recursive LMs = Programmatic Agent Control

Source: Zhang, Kraska, Khattab, "Recursive Language Models" (2025)

Let's say the LLM planned to process file000.txt through file099.txt. But it might drop file078.txt -- LLMs lose track over long sequences.

Solution: Instead of executing steps one by one, write a program that orchestrates sub-agents. The program guarantees all files are processed.

      // The LLM writes a program that spawns sub-agents:


      summary = run_program("""
        for file in files:
          result += LLM_Agent("summarize " + file)
        return result
      """)
    

Two ideas combined: (1) context engineering via sub-agents, (2) generic programmatic control. Great for small models!

Oolong Benchmark Performance

RLM

Terminus-KIRA

48.0

52.0

Qwen3 Coder-480B-A35B

24.0

16.0

Qwen3-8B

Off-the-shelf Terminus-KIRA can outperform RLM with large models on some tasks.

Checkpoint

Let's check where we are. We have covered quite a few ideas so far:

Context Engineering ✓

Skills ✓

Recursive LMs ✓

Context Compaction ✓

Multi-/Sub-agents ✓

✓

Context Engineering = controlling what the LLM sees

✓

Skills = adaptive prompt optimization via context engineering

✓

Compaction = removing logs, summarizing history, ephemeral context

✓

Multi-/Sub-agents = OOP for context isolation

✓

Recursive LMs = programmatic sub-agent control

Ralph Loop

AutoResearch

OpenClaw

When is the Task Complete?

Let's recall our agent architecture. Something we handwaved: while task not completed -- how do we know?

      token_history = task_instruction

      while True:

        context = context_build(token_history, external_info)

        generated_tokens = LLM(context)

        thoughts, action = parse(generated_tokens)

        if action == done: break
        output = exec(action)

        token_history += [thoughts, action, output]

Three ways the agent can stop: (1) a verifiable task with a checker -- easy. (2) a fixed time/budget limit -- also easy. (3) the LLM itself decides by generating a "done" action (or EOS token). This is the common case -- and the problematic one.

The False Completion Problem

When the LLM decides it is done, it is frequently wrong. We call this false completion.

A Real Example

On Terminal-Bench-2 (SWE/MLE-level tasks requiring deep expertise): a baseline agent (Terminus) with Claude Opus 4.6 submitted results 5 times within the time limit. 5 out of 5, the agent confidently submitted a wrong answer. Roughly ~80% of failures were due to false completion.

The Ralph Loop

One idea: add an outer loop where a fresh agent checks if the work is actually done.

      while True:  // outer loop

        answer_not_changed = True

        token_history = []

        while True:  // inner loop -- clean context, same world state

          context = context_build(token_history, external_info)

          generated_tokens = LLM(context)

          thoughts, action = parse(generated_tokens)

          if action == done: break

          output, answer_not_changed = exec(action)

          token_history += [thoughts, action, output]

        if answer_not_changed: break  // only exit if the next agent found nothing to change

The key: each inner loop starts with clean context but the same world state. It reduces confirmation bias -- a fresh agent is not influenced by prior reasoning. github.com/snarktank/ralph

Our Approach: Terminus-KIRA

In building Terminus-KIRA, we tested a variant that is both effective and efficient:

      token_history = task_instruction

      token_history_wo_thoughts = []

      while True:

        context = context_build(token_history, external_info)

        generated_tokens = LLM(context)

        thoughts, action = parse(generated_tokens)

        if action == done:

          generated_tokens = LLM(token_history_wo_thoughts) // check with just the work, no thoughts
          thoughts, action = parse(generated_tokens)
          if action == done: break  // truly done: an unbiased agent agrees
        output = exec(action)

        token_history += [thoughts, action, output]

        token_history_wo_thoughts += [action, output]  // track just the work

The verifier sees only what was done (actions + outputs), not how it was reasoned about (thoughts). This removes confirmation bias without needing a full outer loop restart.

AutoResearch: When False Completion Isn't a Problem

AutoResearch (Andrej Karpathy) went viral on Twitter. It applies the basic agent loop to autonomous ML research -- no Ralph loop needed.

Why does it work without Ralph? This is a progress-measurable task. The goal: train a model with lower validation loss. The loss either went down or it didn't -- very hard to falsely claim completion.

The same pattern applies to many research tasks: AlphaEvolve (DeepMind), AdaEvolve (Cemri et al.)

The AutoResearch Prompt

    LOOP FOREVER:

    Look at the git state: the current branch/commit we're on

    Tune train.py with an experimental idea by directly hacking the code.

    git commit

    Run the experiment: uv run train.py > run.log 2>&1 (redirect everything -- do NOT use tee or let output flood your context)

    Read out the results: grep "^val_bpb:\|^peak_vram_mb:" run.log

    If the grep output is empty, the run crashed. Run tail -n 50 run.log to read the Python stack trace and attempt a fix. If you can't get things to work after more than a few attempts, give up.

    Record the results in the tsv (NOTE: do not commit the results.tsv file, leave it untracked by git)

    If val_bpb improved (lower), you "advance" the branch, keeping the git commit
    If val_bpb is equal or worse, you git reset back to where you started
    The idea is that you are a completely autonomous researcher trying things out. If they work, keep. If they don't, discard. And you're advancing the branch so that you can iterate. If you feel like you're getting stuck in some way, you can rewind but you should probably do this very very sparingly (if ever).

    Timeout: Each experiment should take ~5 minutes total (+ a few seconds for startup and eval overhead). If a run exceeds 10 minutes, kill it and treat it as a failure (discard and revert).

    Crashes: If a run crashes (OOM, or a bug, or etc.), use your judgment: If it's something dumb and easy to fix (e.g. a typo, a missing import), fix it and re-run. If the idea itself is fundamentally broken, just skip it, log "crash" as the status in the tsv, and move on.

    NEVER STOP: Once the experiment loop has begun (after the initial setup), do NOT pause to ask the human if you should continue. Do NOT ask "should I keep going?" or "is this a good stopping point?". The human might be asleep, or gone from a computer and expects you to continue working indefinitely until you are manually stopped. You are autonomous. If you run out of ideas, think harder -- read papers referenced in the code, re-read the in-scope files for new angles, try combining previous near-misses, try more radical architectural changes. The loop runs until the human interrupts you, period.

    As an example use case, a user might leave you running while they sleep. If each experiment takes you ~5 minutes then you can run approx 12/hour, for a total of about 100 over the duration of the average human sleep. The user then wakes up to experimental results, all completed by you while they slept!

Progress (val_bpb) is directly measurable -- false completion is nearly impossible here. github.com/karpathy/autoresearch

Super-human Auto RL Agent

Similar to AutoResearch, but for RL engineering. More than just hyperparameter tuning — the agent must also design rewards to avoid reward hacking. Example: b-boying spider.

Work by: Wooseong Chung, Taegwan Ha, Kangwook Lee, Jeong-Gwan Lee, Suyoung Lee, Taehwan Kwon, Yunhyeok Kwak (alphabetical)

Test-Time Scaling for Agents

Another orthogonal approach: test-time scaling -- generate multiple candidates and pick the best.

Test-time scaling (e.g., generate many candidates, pick the best) is one approach -- but it's largely unexplored for agents:

Too expensive: running an entire agent loop multiple times costs much more than sampling multiple CoTs
Hard to aggregate: agent outputs are not from a fixed set of choices, so majority voting doesn't directly apply
Majority voting may not help anyway: false completion rate is often > 50%, and all the false traces tend to look similar -- the majority is wrong

Test-time scaling for agents is an important open problem.

Test-Time Scaling: Naive Approach

Can the LLM agent itself figure out which trial was best?

      for i in range(N):  // N = test-time scaling factor

        history[i] = LLM_agent(task_instruction)

      best = LLM("find the most promising work" + task_instruction + history)

Result: accuracy improved, but mostly for tasks with P(success) > 50%. Guess why?

We found the LLM was implicitly clustering the candidates and picking the majority! It was told to pick the best, but it was doing majority voting. When P(success) > 50%, the majority is correct -- so it helps. When P(success) < 50%, the majority is wrong -- so it hurts.

Test-Time Scaling: Pairwise Comparisons

To reduce majority bias, use pairwise comparisons -- the LLM never sees all candidates at once.

      for i in range(N):

        history[i] = LLM_agent(task_instruction)

      for (i, j) in [N] x [N]:

        y[i,j] = LLM("which is more promising?" + history[i] + history[j])

      s = BTL_solver(y)  // Bradley-Terry-Luce model

      return argmax(s)

No majority bias -- pairwise comparisons never expose all candidates at once. No context sharing between comparisons.

More compute = less variance -- compare each pair multiple times. Test-time compute for comparisons too!

Preliminary Results: Test-Time Scaling for Agents

Baseline (single run)

BTL Best (pairwise)

76.2

81.3

Terminus-KIRA

62.9

67.0

Terminus-2

78.4

81.1

OpenSage GPT-5.3

BTL (Bradley-Terry-Luce) slightly outperforms simple score-based counting.

OpenClaw: Self-Evolving Agents via Memory

OpenClaw = the basic agent loop + Ralph loop + memory update. After each task, the agent updates its own prompts -- enabling continual learning across tasks.

      token_history = system_prompt + memory + task_instruction

      while True:

        context = context_build(token_history, external_info)

        generated_tokens = LLM(context)

        thoughts, action = parse(generated_tokens)

        if action == done:

          memory = memory_update(thought_history)  // self-evolve!
          break

        output = exec(action)

        token_history += [thoughts, action, output]

memory_update can be creative: summarize the session, define an IDENTITY that the agent updates over time, develop personality/character across tasks.

The rest of the harness is clever too: managing multiple instructions, background tool execution, compaction + file system. But the key innovation is task-to-task continuity via memory.

Memory in Practice: inZOI and PUBG Ally

inZOI: Tried self-evolving prompts -- personalities polarized too easily. Shipped with user-defined personalities instead. 1M+ copies sold, first game with on-device LLM agents.
PUBG Ally: Memory focused on friendship and past games. "Remember when we won that match?" Makes the ally feel like a real teammate over time.

Lots of Interesting Challenges Left

Proactivity -- when should the agent talk to you? act for you?
Fast reaction -- System 1/System 2 architecture (see Figure Helix)
Distillation -- LLM agents to SLM agents. Not trivial (off-policy, model sharing)
Multimodal -- STT → LLM → TTS loses information. Multimodal should be the core model
Evaluation -- agent outputs are complex, non-deterministic, hard to grade
Planning -- LLMs are bad at exploration/exploitation. External search helps (see TAPE, ReJump)

Summary

The agent loop is simple: observe, think, act, repeat with tool calling via special tokens
Context engineering is the key design space: skills, compaction, ephemeral context, KV cache constraints
Multi-agents / sub-agents / recursive LMs = different forms of context isolation and programmatic control
False completion is the #1 production issue. Ralph loop, progress measurement, and test-time scaling (BTL) address it
Memory enables task-to-task continuity and self-evolution (OpenClaw, inZOI, PUBG Ally)
Many open challenges: proactivity, fast reaction, distillation, multimodal, evaluation, planning
We are early. Maybe the 1950s of communications. Go build (real engineering starts from real problems) and go clarify (the essence resembles control theory, communication theory, statistical inference).

Thank You

Kangwook Lee · KRAFTON / Ludo Robotics · @kangwooklee · Questions?

Toward More Efficientand Useful LLM Agents

What is an LLM Agent?

LLM Agents Are Very Hot ... And They Seem Useful!

So How Do We Build Agents?

Tool Calling (No Loop)

The Agent Loop (Basic Form)

The First Design Problem

Context Engineering

Context Engineering (1/3): Swapping Tools

Context Engineering (2/3): Skills

What Does a Skill Actually Look Like?

Context Engineering (3/3): Compaction

Context Engineering: KV Cache Constraint

KV Cache: Design Around the Cache

KV Cache: Mask, Don't Remove

PUBG Ally: An LLM Agent That Plays With You

Context Engineering: Ephemeral Context

Multi-Agents = Context Isolation

Sub-Agents = Clean Slate Tools

Recursive LMs = Programmatic Agent Control

Oolong Benchmark Performance

Checkpoint

When is the Task Complete?

The False Completion Problem

The Ralph Loop

Our Approach: Terminus-KIRA

AutoResearch: When False Completion Isn't a Problem

The AutoResearch Prompt

Super-human Auto RL Agent

Test-Time Scaling for Agents

Test-Time Scaling: Naive Approach

Test-Time Scaling: Pairwise Comparisons

Preliminary Results: Test-Time Scaling for Agents

OpenClaw: Self-Evolving Agents via Memory

Memory in Practice: inZOI and PUBG Ally

Lots of Interesting Challenges Left

Summary

Toward More Efficient
and Useful LLM Agents