Hackathon Lessons — and What They Mean for Hiring AI Talent in 2026

A behind-the-scenes look at the KRAFTON AI R&D Hackathon · Spring 2026 · ~300 participants · 5 problems · 2 weekends

We just wrapped up the first KRAFTON AI R&D Hackathon — a two-week, two-round contest with about 300 participants from across Korea. The crowd was wonderfully mixed: undergraduates, PhD students, AI researchers, non-engineers, and even tech leads from some of Korea’s most prominent companies, all chasing the same problems.

This post is the long version of “what did we make people do, and why?” The first half is the problem set itself — rules, time limits, and links to the original statements — for anyone who wants to try the problems cold. The second half, after a clearly marked spoiler line, is the behind-the-scenes story: where each problem came from, what the intended solutions were, the surprises in the submissions, and what designing this whole thing taught me about hiring AI talent in 2026.

How to read this post. If you want to attempt the problems clean, stop at the spoiler banner roughly halfway down. If you just want the story, scroll past it.

Quick links to all problems and datasets.

Round 1, P1 — MultiplierBoard: round1_p1.html
Round 1, P2 — SparseTap: round1_p2.html · data: round1_p2_data.txt
Round 2, P1 — Pen-and-paper exam: round2_p1.html
Round 2, P2 — BattlePredict: round2_p2.html · data: round2_p2_data.csv
Round 2, P3 — VideoAgent: round2_p3.html

The 5 problems · ~300 participants · top 30 advance to Round 2

Round 1 · Day 1
4 hours · online

MultiplierBoard smallest transformer that multiplies two 6-bit numbers

Round 1 · Day 2
4 hours · online

SparseTap recover hidden XOR taps from a noisy bit-sequence

Round 2 · Day 1 AM
2 hours · pen & paper

AttentionByHand

Round 2 · Day 1 PM
2 hours · in-person

BattlePredict non-stationary skill rating with hidden day labels

Round 2 · Day 2
5 hours · in-person

VideoAgent build a video QA agent for 20 hidden test videos

Round 1 = blue (online, 4h each). Round 2 = warm colors (in person, varying length).

Round 1 — Online · 2 Days · 4 Hours Each

Round 1 ran on a single weekend, fully online. Two problems, one per day, 4 hours each. Anything goes. Any tools, any AI, any libraries.

Day 1 · Problem 1 — MultiplierBoard

Build the smallest transformer that can multiply two 6-bit binary numbers.
Two sub-problems: (1-1) hand-coded weights with a layer-by-layer correctness proof, (1-2) an architecture trained to ≥99% under a fixed protocol. You must solve both, with the smallest parameter count you can manage.

Inspired by the popular AdderBoard challenge from @DimitrisPapail, which asked the same question for addition. Multiplication is meaningfully harder (carries propagate non-locally and partial products interact) so the design space is much richer.

Round 1, Day 1 — build the smallest transformer that multiplies two 6-bit binary numbers

A = 23

× B = 37

P = 851

→

12 input tokens

↓

self-attention (≥1)

↓

MLP

↓

12 output tokens

smallest parameter count wins

6-bit × 6-bit = 12-bit, LSB-first. Two tracks: hand-coded weights (with proof) AND a trained architecture under a fixed protocol. Both required.

Full problem statement: kangwooklee.com/blogs/krafton_ai_hackathon_2026/round1_p1.html

Day 2 · Problem 2 — SparseTap

Find the hidden offsets, then predict. A binary signal’s bits are generated by XOR-ing some specific earlier bits, then flipping each bit independently with probability 0.2. Given 2,000 noisy 256-bit example sequences, predict the next 192 bits of a brand-new, noise-free test sequence.

Hidden parameters: a set of offsets d₁ < d₂ < … < dₛ with S ≤ 16 and dₛ ≤ 64. Each new bit equals the XOR of seq[n − dₖ] across those offsets, plus a Bernoulli(0.2) noise bit. Same hidden taps for all 2,000 sequences; different random seeds.

Round 1, Day 2 — recover the hidden XOR taps from a noisy bit-sequence

A binary sequence with hidden offsets (taps) and noise:

\[ \text{seq}[n] = \text{seq}[n - d_1] \oplus \text{seq}[n - d_2] \oplus \cdots \oplus \text{seq}[n - d_S] \oplus e[n] \] \[ e[n] \sim \text{Bernoulli}(0.2), \quad S \leq 16, \quad d_S \leq 64 \]

↑ XOR these tapped positions, then flip with prob 0.2

1-bit 0-bit tapped position next bit (target)

2,000 noisy training sequences (256 bits each, same hidden taps, different seeds). Predict the next 192 bits of a brand-new, noise-free test sequence.

Full problem statement: kangwooklee.com/blogs/krafton_ai_hackathon_2026/round1_p2.html

Round 2 — In Person · Top 30 Finalists

The top 30 from Round 1 were invited to a single in-person weekend in Seoul. The Round 2 problems were structured very differently — more on why in the second half.

Day 1 Morning · Problem 1 — Pen and paper, no devices

120 minutes, closed book, no calculators, no electronics. Three problems, 40 points total. The only Round 2 problem where AI tools are not allowed.

Self-attention by hand (15 pt). Concrete W_k, W_v, W_q and two 4-dim inputs. Compute keys, values, queries, dot-product attention scores, softmax weights, outputs — with and without causal masking. Plus a True/False on whether next-token prediction requires causal attention.
Sampling strategies (10 pt). Greedy / random / top-k / top-p (nucleus) on a tiny example distribution.
Learning Parity with Noise (15 pt). Derive Pr[b̃ = ⟨ã, s⟩] = (1 + (1 − 2η)^s) / 2 for the XOR of s LPN samples, prove it from scratch, then plug in η = 0.1, s = 3.

Full problem: kangwooklee.com/blogs/krafton_ai_hackathon_2026/round2_p1.html

Day 1 Afternoon · Problem 2 — BattlePredict

Predict each player’s total kills across days 22–50. Ten players compete in battle royale matches over 50 days, 50 matches per day, 5 participants per match in a “gauntlet” of 1v1 duels. Submit 10 numbers. 2 hours.

The catch is in the dataset:

Days 1, 11, and 21 are fully labeled.
Days 2–10 and 12–20 appear as 18 anonymous day-blocks, 50 matches each, in original order — but with the day label stripped out. You see the block boundaries, you don’t know which block is which day.
Days 22–50 are the test set, given only as gauntlet orderings (you know who plays whom; you predict the outcomes).

Score: normalized sum of absolute errors over the 10 per-player kill counts. Lower is better.

Full problem statement: kangwooklee.com/blogs/krafton_ai_hackathon_2026/round2_p2.html

Day 2 · Problem 3 — VideoAgent (5 hours)

Build an automated video analysis agent. At test time it receives 20 videos (up to 20 min each) and 20 multiple-choice questions, with 26 options per question (random baseline ≈ 3.8%). The test folder is released 15 minutes before the deadline. The agent must process all 20 videos and submit within the window. Audio + visual reasoning required.

Full problem statement: kangwooklee.com/blogs/krafton_ai_hackathon_2026/round2_p3.html

VideoAgent — two real sample questions

⚽
Premier
League
compilation

Sample question · Premier League goals compilation

youtube.com/watch?v=SrCBJYoMoro

In the first 10 minutes of this Premier League goals compilation, how many goals were shot from outside the penalty area? Do not double-count the same goal shown in replay; count each distinct goal only once.

🎴
Shin Lim
AGT
magic

Sample question · Shin Lim AGT performance

youtube.com/watch?v=Pxzjzp8NtE0

When a panel member first says "oh my god" during this performance, how many clearly visible cards are on the magician's dark mat or in his hands? Count only cards you can see; exclude any that may be hidden or concealed by the magician.

26 answer options per question (random baseline ≈ 3.8%). Audio + visual reasoning required. No timestamp-jumping.

Why have an AI hackathon at all these days?

Honestly, my biggest worry going in was a worry about hiring itself. In 2026, almost every strong candidate is going to hand you something that looks great. Agents write the code, agents write the report, agents debug the agents. Agents write a fancy CV and papers for you. So what do you actually evaluate?

I came in with two beliefs that ended up shaping every problem.

Round 1 problems

First: if a candidate doesn’t use AI to attack hard problems, they’re already out. I don’t want to hire someone who’s allergic to the most powerful tool of our generation. So we kick off the contest with some problems that are extremely hard yet comfortably solvable by current AI agents.

(Note: We reimbursed every participant up to $200 to spend on anything: cloud, GPUs, API credits, model subscriptions!)

Round 2 problems

Second: but for the problems that matter, the human still has to know what they’re doing. Those are the cases where the AI output looks plausible and is subtly wrong, and you need someone who can smell the rot. So the hardest problems were deliberately designed not to fall to current SOTA agents.

That tension (lean on AI hard, but understand it deeply) is the entire spine of the problem set.

Round 1, Problem 1 — MultiplierBoard

This one was a love letter to AdderBoard. AdderBoard asks: what is the smallest transformer that can add? We just turned the crank one notch and asked the same question for multiplication. We split it into the same two flavors:

Hand-coded weights with a correctness proof. This is the part where the human has to understand attention. Which head attends to which bit, what each MLP computes, how partial products get summed. A wall of weights with no proof gets zero credit.
Trained weights under a frozen training protocol. Here all you control is the architecture; we run the optimizer, schedule, and decoding.

The boundary between “clever weight construction” and “you basically wired a multiplication circuit and called it a transformer” is blurry. We graded generously. If you could explain the construction layer by layer and the math checked out, we credited it.

The fun surprise: for the training track, a huge fraction of submissions used autoresearch-style agent frameworks to search the architecture space. People let an agent propose architectures, train them under the fixed protocol, and iterate on the loss curve. Exactly the kind of “use AI to attack the search problem” behavior we wanted to reward.

Round 1, Problem 2 — SparseTap (the trap)

The cover story is “find the hidden XOR taps and predict the next bits.” If you squint, it looks like a signal-processing or LFSR-cracking problem. Lots of people went down that road and tried to brute-force the offsets, fit a gradient-descent model on {0,1} outputs, and so on.

The intended path: this is just LPN

The key observation is that this problem is literally Learning Parity with Noise (LPN) in disguise:

The “secret” is the binary indicator vector of which offsets are taps: s ∈ {0,1}^W with at most S ones.
Each “sample” is (a, b) where a is a window of W previous bits and b is the next bit — i.e., b = ⟨a, s⟩ ⊕ e with e ∼ Bernoulli(0.2).
Bits are flipped independently with η = 0.2. That’s LPN.

Once you see this, the textbook attack is the Blum–Kalai–Wasserman (BKW) algorithm: XOR-reduce the dataset by combining samples whose a vectors collide on chosen bit blocks, trading sample count for noise reduction. The bias shrinks as (1 − 2η)^s — the exact identity we put on the Round 2 written exam, as a bridge.

Design intent. The whole problem statement was designed so that the connection to LPN is unmistakable if you’ve seen LPN before, and totally invisible if you haven’t. The Round 2 morning exam then reused the LPN identity, so anyone who got SparseTap by the right path would breeze through Q3.

The unintended-but-also-correct path: RANSAC

Several people used a RANSAC-style approach I hadn’t planned for. The idea: randomly sample a small subset of (a, b) constraints, assume they’re all noise-free, solve the resulting linear system over GF(2), and check how many of the remaining constraints are satisfied. If a lot, you’ve found s.

I was initially surprised this works, but the math actually checks out at the specific scales of this problem:

P(single sample is noise-free)         = 1 − 0.2 = 0.8
P(random S-subset is fully noise-free) = 0.8^S    ≈ 0.8^16 ≈ 0.028
                                                            ≈ 1 in 35
Total constraints in dataset            = 2000 × 192 ≈ 384,000

⇒ Sampling random S-subsets and verifying is comfortably tractable.
   RANSAC + a GF(2) solver + verification recovers s in seconds.

RANSAC works on this problem precisely because S, η, and the dataset size happen to be in a friendly regime. Squeeze any one of them and the trick collapses.

I had not designed for this and I love that it worked. Two very different solutions, one I planned and one the participants taught me.

After Round 1

After grading Round 1, we picked the top 30 to advance. Diverse group: students, published AI researchers, non-engineers, several tech leads from well-known Korean companies.

Almost every Round 1 submission looked great. Not just “passed the bar”… actually impressive. Slick code, decent reports, reasonable accuracy. The agents are very good now. Differentiating people purely from their submissions was hard.

If the deliverables all look comparable, what are you actually selecting on? In Round 2 I wanted to give bonus points to people who actually understand what they are doing: not as a gate, but as a tiebreaker that surfaces the deeper kind of skill we would most miss when AI inevitably gets something wrong.

Every Round 2 problem also went through the red-teaming pass: except the pen-and-paper exam, which is closed-book by design and has no agents at the table anyway. For the device-allowed problems, I burned a lot of compute iterating between problem creation and stress-testing with Claude Code, ChatGPT Pro, and autoresearch-style agents, confirming that none of them were solvable end-to-end by any single SOTA call. Each one required real human design choices on top of whatever the agent could do. Write a problem, throw it at every SOTA agent I had access to, watch where it broke, harden the problem, repeat. That iteration loop was, embarrassingly, the most fun I had in the whole two weeks.

Round 2, Day 1 Morning — the pen-and-paper exam

This is the bonus-points problem. No devices, no AI, two hours. The entire exam is structured as a pair of bridge questions for the two Round 1 problems.

Q1 (self-attention by hand) and Q2 (sampling) are the bridge for MultiplierBoard. If you genuinely hand-constructed a transformer that multiplies 6-bit numbers, hand-computing softmax attention on a 2×2 toy and applying nucleus sampling to a 3-token distribution should be mechanical.

Q3 (the LPN noise-amplification identity) is the bridge for SparseTap. It asks how the noise parameter compounds when you XOR samples together — which is the core operation behind BKW. Almost every Round 1 SparseTap submission mentioned and used BKW in their report, so Q3 was checking whether participants actually understood the math behind the XOR-reduction step they were relying on.

So the morning exam is, in effect, a sanity check: did you really do the Round 1 problems yourself, or did an agent do them and you don’t quite know what it built?

A meaningful number of the top 30 (people who had aced Round 1) could not solve the basic attention/sampling questions at all. These were candidates whose Round 1 submissions had been brilliant.

I don’t quite know what to do with that finding yet. On one hand, the role of “knows attention numerically by hand” has objectively shrunk as tools have improved. Maybe it’s fine. On the other hand, the role of “can tell when the AI’s answer is subtly wrong” has objectively grown. Historically those two skills have been correlated. Whether that correlation still holds in 2026 is, I think, the central open question of AI hiring right now.

Round 2, Day 1 Afternoon — BattlePredict

From here on out, anything goes. Bring whatever AI you want. (Reminder: we paid for it.)

This problem comes straight from work my team is actually doing at KRAFTON. PUBG is a battle royale, which means skill rating at scale, which means dealing with the fact that player skill is non-stationary: people warm up, get bored, take breaks, learn, decay. Any honest MMR system has to handle that. So I built the smallest possible toy version of that real problem and turned it into a contest.

The setup

Ten users, each with a linearly time-varying skill s_i(d) = w_i · d + b_i over 50 days. Fifty days, 50 matches per day, 5 players per match, BTL outcomes. Activity rates also vary in time, so heavy users on day 3 may be ghosts by day 18.

The twist: Days 1, 11, and 21 are fully labeled. Days 2–10 and 12–20 are present but with the day labels stripped out, organized into 18 anonymous day-blocks. Predict each player’s total kills on days 22–50.

BattlePredict — 50 days of matches, with hidden labels

\[ s_i(d) = w_i \cdot d + b_i \qquad P(i \text{ beats } j) = \sigma\!\left(s_i(d) - s_j(d)\right) \]

labeled (days 1, 11, 21) hidden block (day label stripped) test (days 22–50, predict kills)

10 users with linearly time-varying skill. 50 matches/day, 5 players per match, gauntlet of 1v1 BTL duels. Activity rates also vary in time: heavy users on day 3 may be ghosts by day 18.

The 18 hidden blocks are present but with day labels removed. Recover the labels first, then extrapolate.

The intended pipeline

BattlePredict — the intended pipeline

1 📊 BTL on the
3 labeled
days

→

2 🤔 Notice skills
disagree
across days

→

3 📦 BTL within
each anonymous
day-block

→

4 🔀 Recover hidden
permutation via
Hungarian

→

5 📈 Re-fit BTL on
full data and
extrapolate

Step 2 is the gating insight. Skip it and run BTL on the full data assuming stationarity → you collapse to the naive baseline.

Each step requires you to actually understand the problem. Step 2 is the gating insight: if you skipped it and ran BTL on the full data assuming stationarity — as most blind agent runs did — you would be confidently wrong from there on out. Step 4 uses the fact that you have linear-trend anchors at days 1, 11, 21, so you can compute, for each anonymous block, a cost matrix of “how well does this block match the interpolated skills for day d?” and solve as a linear assignment problem with the Hungarian method. Once you have the recovered labels, the per-user skill curves come out beautifully linear, and extrapolation to days 22–50 is straightforward.

Internal baselines

We computed several reference baselines after the contest. Lower scores are better:

BattlePredict — internal baselines (lower is better)

Oracle (true params)

0.0097

BTL with TRUE day labels

0.0462

BTL + recovered labels (intended)

0.0502

BTL on labeled days only (naive)

0.1188

The intended pipeline (blue) gets within striking distance of the oracle that knows the true day labels. The naive baseline (red) is more than 2× worse.

What happens if you just hand it to an agent?

Almost universally, the agent grabs the data with day labels (days 1, 11, 21), fits a BTL model on those ~600 duels, and reports the answer. It doesn’t look at the anonymous day-blocks at all. It doesn’t notice that even the three labeled days disagree with each other on per-user skill. It doesn’t realize the skills are non-stationary, let alone try to recover the hidden day labels. The output is confident, plausible-looking, and exactly the naive baseline.

The successful human competitors did the opposite. They started with data exploration — and they used AI heavily for the exploration itself, asking the agent to plot per-day skill estimates from the labeled days, visualize the duel and activity distributions across the anonymous blocks, and generate diagnostics on the fly. Once the picture made the non-stationarity obvious, the rest of the pipeline (per-block BTL → Hungarian → re-fit → extrapolate) became a series of small, well-scoped instructions the agent could execute correctly. They weren’t doing the math by hand… they were using the agent as a microscope, then telling it where to point.

This was my favorite kind of problem: the agent can do every individual step if asked, but it won’t put them in the right order on its own. The bottleneck was never the agent’s capability. It was knowing what to ask for next.

Round 2, Day 2 — VideoAgent

The grand finale. Five hours. Build an automated video QA agent. We hand you a folder of 20 videos and 20 prompts at T − 15 minutes; you have to submit your 20 answers by T.

Design choices that mattered

VideoAgent — design choices that mattered

🎲

26 answer options

Random baseline ≈ 3.8%. Lucky guessing is not a strategy.

⏯️

No timestamp-jumping

Every question requires watching most of the video — counting, scanning, tracking.

🔊

Audio + visual

At least one question detects a spoken phrase, locates the frame, then counts something visual in it.

🙈

No test-set access

Participants only see a tiny sample. They have to invent their own dev set if they want one.

Test folder released at T − 15 min, submission due at T. 20 videos, 20 questions, 15-minute window.

Why video QA specifically? Because, today, video QA is exactly the kind of task where naively throwing a SOTA multimodal model at the input does NOT work AT ALL for hard instances. There were really two viable paths. You could hand-design the agent harness yourself (e.g., frame sampling, audio extraction, multi-pass reasoning, verification) and pick the model by hand. Or you could go meta-harness (arxiv.org/abs/2603.28052) and let an outer optimizer use AI to tune the harness for you. The catch with the second path: you had to build your own evaluation set first, because participants had no access to the test set or any training data (and without an eval set), the meta-optimizer has no hill to climb. Either way, real human work was required.

Result

Despite the difficulty and openness of the problem, the top entry scored close to 50% on the completely hidden test set from a 5-hour cold start with no access to the test set or any training set. There were many other interesting solutions in the middle of the pack: different decomposition strategies, different fallback heuristics, different ways of cross-checking pass-by-pass. The top entry was a clear standout.

What I’m taking away

What I'm taking away

The "use an agent" ceiling is very high

A take-home only tells you the candidate uses AI and pays for good services. Not nothing — but a long way from the signal you actually want.

The signal still exists; design for it

Problems where the SOTA agent confidently produces the wrong answer are gold. They surface candidates who notice the mismatch and intervene.

Pen-and-paper is still an open question

"Knows attention by hand" is a (possibly fading) proxy for "can debug AI when it's wrong." I'd love a sharper signal and to retire the proxy.

04
The hire-worthy skill is AI steering
Not clever algorithms — the top BattlePredict and VideoAgent runs were genuine human brilliance: knowing exactly what to ask the agent next, when to override it, and when to trust it.

The right hire in 2026 reaches for AI by default and overrides it on purpose. Both halves matter.

A few thoughts after grading several hundred submissions over two weekends:

Observation	Implication
The ceiling on “use an agent to do the obvious thing” is very high.	If your hiring funnel is “take-home + read the deliverable,” what you’re really learning is that the candidate uses AI and is willing to pay for good services. That is not nothing. But it is a long way from the signal you actually want.
The signal still exists; you just have to design for it.	Problems where the SOTA agent confidently produces the wrong answer are gold. They surface candidates who notice the mismatch and intervene.
I still think that the pen-and-paper exam was the right call.	That said, I don’t think “knows attention by hand” is intrinsically important in 2026, and it’s a (possibly fading) proxy for “can debug AI when it’s wrong.” I’d love to find a sharper signal.
The most impressive thing wasn’t a clever algorithm: it was clever AI steering.	What surprised me was the genuinely human brilliance behind the top submission entries: people who knew exactly what to ask their agent next, when to override its first instinct, and when to trust it. That is the actual hire-worthy skill.

The right hire in 2026 is someone who reaches for AI by default and overrides it on purpose. Both halves matter. The contest is, in some sense, an attempt to measure both halves at once.

The top three winners walked away with a healthy cash prize, and I walked away with a much more interesting and unsettled set of opinions about hiring than I came in with. That feels right for a first edition. Beyond the winners, a lot of participants told us this was the most exciting AI hackathon they had ever joined, and that they found it deeply educational — honestly, that was the most rewarding part.

Thanks to every participant who showed up, stayed up, and shipped. We’ll do this again next year (maybe sooner?) and I promise the problems will be even weirder.

Hope you enjoyed :-)

Acknowledgements

Thank you to the ~300 participants who joined Round 1, the 30 finalists who came to Seoul, and the entire KRAFTON AI R&D team that ran logistics, grading, and the on-site weekend.

Special thanks to Prof. Dimitris Papailiopoulos for the original AdderBoard challenge that inspired Round 1, Problem 1.

KRAFTON AI R&D Hackathon · Spring 2026 · Retrospective

해커톤에서 배운 것들 — 2026년 AI 인재 채용에 대한 시사점

KRAFTON AI R&D 해커톤 비하인드 스토리 · 2026 봄 · 약 300명 참가 · 5문제 · 두 주말

첫 번째 KRAFTON AI R&D 해커톤이 막 마무리되었습니다 — 약 300명의 한국 전역 참가자와 함께한 2주, 2라운드 대회였습니다. 참가자 구성은 정말 다양했습니다: 학부생, 박사 과정 학생, AI 연구자, 비엔지니어, 그리고 한국의 유명 기술 회사의 테크 리드들까지, 모두 같은 문제에 도전했습니다.

이 글은 “우리가 사람들에게 무엇을, 왜 시켰는가?”의 긴 버전입니다. 첫 번째 절반은 문제 세트 자체입니다 — 규칙, 시간 제한, 원본 문제 링크 — 직접 풀어보고 싶은 분들을 위한 부분입니다. 두 번째 절반은 명확한 스포일러 경고 이후, 비하인드 스토리입니다: 각 문제가 어디서 왔는지, 의도된 풀이가 무엇이었는지, 제출물에서의 놀라움, 그리고 이 모든 것을 설계하면서 2026년에 AI 인재를 채용하는 것에 대해 배운 것들입니다.

이 글을 읽는 방법. 문제를 깨끗하게 풀어보고 싶다면, 본문 중간쯤의 스포일러 배너에서 멈추세요. 그냥 이야기만 듣고 싶다면 그 부분을 건너뛰세요.

모든 문제 및 데이터셋 빠른 링크.

라운드 1, P1 — MultiplierBoard: round1_p1.html
라운드 1, P2 — SparseTap: round1_p2.html · 데이터: round1_p2_data.txt
라운드 2, P1 — 종이와 펜 시험: round2_p1.html
라운드 2, P2 — BattlePredict: round2_p2.html · 데이터: round2_p2_data.csv
라운드 2, P3 — VideoAgent: round2_p3.html

The 5 problems · ~300 participants · top 30 advance to Round 2

Round 1 · Day 1
4 hours · online

MultiplierBoard smallest transformer that multiplies two 6-bit numbers

Round 1 · Day 2
4 hours · online

SparseTap recover hidden XOR taps from a noisy bit-sequence

Round 2 · Day 1 AM
2 hours · pen & paper

AttentionByHand

Round 2 · Day 1 PM
2 hours · in-person

BattlePredict non-stationary skill rating with hidden day labels

Round 2 · Day 2
5 hours · in-person

VideoAgent build a video QA agent for 20 hidden test videos

Round 1 = blue (online, 4h each). Round 2 = warm colors (in person, varying length).

라운드 1 — 온라인 · 2일 · 각 4시간

라운드 1은 한 주말 동안 완전히 온라인으로 진행되었습니다. 두 문제, 하루에 한 문제씩, 각 4시간. 무엇이든 가능합니다. 어떤 도구든, 어떤 AI든, 어떤 라이브러리든.

1일차 · 문제 1 — MultiplierBoard

두 6비트 이진수를 곱할 수 있는 가장 작은 트랜스포머를 만드세요.
두 가지 하위 문제: (1-1) 레이어별 정확성 증명을 포함한 수동 설정 가중치, (1-2) 고정된 프로토콜 하에서 ≥99% 정확도로 학습되는 아키텍처. 두 문제 모두, 가능한 한 적은 파라미터로 풀어야 합니다.

@DimitrisPapail의 유명한 AdderBoard 챌린지에서 영감을 받았습니다. AdderBoard는 덧셈에 대해 같은 질문을 던졌습니다. 곱셈은 의미 있게 더 어렵습니다 (캐리(carry)가 비국소적으로 전파되고 부분곱이 상호작용하기 때문에) 따라서 설계 공간이 훨씬 더 풍부합니다.

Round 1, Day 1 — build the smallest transformer that multiplies two 6-bit binary numbers

A = 23

× B = 37

P = 851

→

12 input tokens

↓

self-attention (≥1)

↓

MLP

↓

12 output tokens

smallest parameter count wins

6-bit × 6-bit = 12-bit, LSB-first. Two tracks: hand-coded weights (with proof) AND a trained architecture under a fixed protocol. Both required.

전체 문제 설명: kangwooklee.com/blogs/krafton_ai_hackathon_2026/round1_p1.html

2일차 · 문제 2 — SparseTap

숨겨진 오프셋을 찾고, 예측하세요. 이진 신호의 비트는 특정 이전 비트들을 XOR하여 생성된 후, 각 비트가 독립적으로 0.2의 확률로 뒤집힙니다. 노이즈가 포함된 256비트 예시 시퀀스 2,000개가 주어지며, 새로운 노이즈 없는 테스트 시퀀스의 다음 192비트를 예측하세요.

숨겨진 파라미터: 오프셋 집합 d₁ < d₂ < … < dₛ, S ≤ 16, dₛ ≤ 64. 각 새 비트는 그 오프셋들에 걸친 seq[n − dₖ]의 XOR에 Bernoulli(0.2) 노이즈 비트를 더한 것과 같습니다. 2,000개 시퀀스 모두 동일한 숨겨진 탭, 다른 무작위 시드.

Round 1, Day 2 — recover the hidden XOR taps from a noisy bit-sequence

A binary sequence with hidden offsets (taps) and noise:

\[ \text{seq}[n] = \text{seq}[n - d_1] \oplus \text{seq}[n - d_2] \oplus \cdots \oplus \text{seq}[n - d_S] \oplus e[n] \] \[ e[n] \sim \text{Bernoulli}(0.2), \quad S \leq 16, \quad d_S \leq 64 \]

↑ XOR these tapped positions, then flip with prob 0.2

1-bit 0-bit tapped position next bit (target)

2,000 noisy training sequences (256 bits each, same hidden taps, different seeds). Predict the next 192 bits of a brand-new, noise-free test sequence.

전체 문제 설명: kangwooklee.com/blogs/krafton_ai_hackathon_2026/round1_p2.html

라운드 2 — 오프라인 · 상위 30명 결승 진출자

라운드 1의 상위 30명이 서울에서 열리는 단 한 주말의 오프라인 행사에 초대되었습니다. 라운드 2의 문제들은 매우 다르게 구성되었습니다 — 그 이유는 두 번째 절반에서 자세히 다룹니다.

1일차 오전 · 문제 1 — 종이와 펜, 기기 사용 금지

120분, 폐쇄형 시험, 계산기 금지, 전자기기 금지. 세 문제, 총 40점. AI 도구가 허용되지 않는 유일한 라운드 2 문제입니다.

손으로 계산하는 셀프 어텐션 (15점). 구체적인 W_k, W_v, W_q와 두 4차원 입력. 키, 밸류, 쿼리, 도트-프로덕트 어텐션 점수, 소프트맥스 가중치, 출력을 인과 마스킹이 있을 때와 없을 때 모두 계산. 그리고 다음 토큰 예측이 인과 어텐션을 필요로 하는지에 대한 참/거짓 문제.
샘플링 전략 (10점). 작은 예제 분포에서 그리디 / 랜덤 / top-k / top-p (뉴클리어스).
노이즈가 있는 패리티 학습 (LPN) (15점). s개의 LPN 샘플의 XOR에 대해 Pr[b̃ = ⟨ã, s⟩] = (1 + (1 − 2η)^s) / 2를 처음부터 유도하고 증명. 그런 다음 η = 0.1, s = 3을 대입.

전체 문제: kangwooklee.com/blogs/krafton_ai_hackathon_2026/round2_p1.html

1일차 오후 · 문제 2 — BattlePredict

22–50일차 동안 각 플레이어의 총 킬 수를 예측하세요. 10명의 플레이어가 50일에 걸쳐 배틀로얄 매치에서 경쟁합니다. 매일 50경기, 각 경기에 5명이 참가하여 1대1 결투의 “건틀릿”을 펼칩니다. 10개의 숫자를 제출하세요. 2시간.

데이터셋의 함정:

1일차, 11일차, 21일차는 완전히 라벨링되어 있습니다.
2–10일차와 12–20일차는 50경기씩 묶인 18개의 익명 일별 블록으로 나타납니다 — 원래 순서지만 일차 라벨이 제거되어 있습니다. 블록 경계는 보이지만, 어떤 블록이 어떤 일차인지는 알 수 없습니다.
22–50일차는 테스트 세트로, 건틀릿 순서만 주어집니다 (누가 누구와 경기하는지는 알지만, 결과를 예측해야 합니다).

점수: 10명 플레이어별 킬 수에 대한 정규화된 절대 오차의 합. 낮을수록 좋습니다.

전체 문제 설명: kangwooklee.com/blogs/krafton_ai_hackathon_2026/round2_p2.html

2일차 · 문제 3 — VideoAgent (5시간)

자동화된 비디오 분석 에이전트를 구축하세요. 테스트 시점에 20개의 비디오(각 최대 20분)와 질문당 26개의 객관식 옵션이 있는 20개의 질문을 받습니다 (랜덤 베이스라인 ≈ 3.8%). 테스트 폴더는 마감 15분 전에 공개됩니다. 에이전트는 그 시간 안에 20개 비디오를 모두 처리하고 제출해야 합니다. 오디오 + 시각 추론이 요구됩니다.

전체 문제 설명: kangwooklee.com/blogs/krafton_ai_hackathon_2026/round2_p3.html

VideoAgent — two real sample questions

⚽
Premier
League
compilation

Sample question · Premier League goals compilation

youtube.com/watch?v=SrCBJYoMoro

🎴
Shin Lim
AGT
magic

Sample question · Shin Lim AGT performance

youtube.com/watch?v=Pxzjzp8NtE0

26 answer options per question (random baseline ≈ 3.8%). Audio + visual reasoning required. No timestamp-jumping.

요즘 같은 시대에 왜 AI 해커톤을 하는가?

솔직히 시작할 때 가장 큰 걱정은 채용 자체에 대한 걱정이었습니다. 2026년에는 거의 모든 강력한 후보자가 멋져 보이는 결과물을 제출할 것입니다. 에이전트가 코드를 짜고, 에이전트가 보고서를 쓰고, 에이전트가 에이전트를 디버그합니다. 에이전트가 멋진 CV와 논문도 써 줍니다. 그렇다면 실제로 무엇을 평가해야 할까요?

모든 문제를 형성한 두 가지 신념을 가지고 들어왔습니다.

라운드 1 문제

첫째: 후보자가 어려운 문제를 공격하기 위해 AI를 사용하지 않는다면, 이미 탈락입니다. 우리 세대의 가장 강력한 도구에 알레르기가 있는 사람을 고용하고 싶지 않습니다. 그래서 우리는 현재 AI 에이전트로 충분히 풀 수 있지만 굉장히 어려운 문제들로 대회를 시작합니다.

(참고: 모든 참가자에게 최대 $200까지 환급해 주어 클라우드, GPU, API 크레딧, 모델 구독 등 무엇에든 자유롭게 쓸 수 있게 했습니다!)

라운드 2 문제

둘째: 하지만 정말 중요한 문제에서는, 인간이 여전히 무엇을 하고 있는지 알아야 합니다. 그것은 AI 출력이 그럴듯해 보이지만 미묘하게 틀린 경우이며, 그 부패를 냄새 맡을 수 있는 사람이 필요합니다. 그래서 가장 어려운 문제들은 의도적으로 현재 SOTA 에이전트가 풀 수 없도록 설계되었습니다.

그 긴장 (AI에 강하게 의지하되, 깊이 이해하라) 이 문제 세트의 척추 전체입니다.

라운드 1, 문제 1 — MultiplierBoard

이것은 AdderBoard에 대한 러브레터였습니다. AdderBoard는 묻습니다: 덧셈을 할 수 있는 가장 작은 트랜스포머는 무엇인가? 우리는 한 단계 더 나아가 곱셈에 대해 같은 질문을 했을 뿐입니다. 같은 두 가지 형태로 나누었습니다:

정확성 증명이 있는 수동 설정 가중치. 이 부분에서 인간은 어텐션을 이해해야 합니다. 어떤 헤드가 어떤 비트에 주의를 기울이는지, 각 MLP가 무엇을 계산하는지, 부분곱이 어떻게 더해지는지. 증명 없는 가중치 더미는 0점입니다.
고정된 학습 프로토콜 하의 학습된 가중치. 여기서 통제할 수 있는 것은 아키텍처뿐입니다; 옵티마이저, 스케줄, 디코딩은 우리가 실행합니다.

“영리한 가중치 구성”과 “사실상 곱셈 회로를 배선해 놓고 트랜스포머라고 부른 것” 사이의 경계는 흐릿합니다. 우리는 관대하게 채점했습니다. 구성을 레이어별로 설명할 수 있고 수학이 맞다면, 점수를 주었습니다.

재미있는 놀라움: 학습 트랙에서, 많은 비율의 제출물이 아키텍처 공간을 검색하기 위해 오토리서치 스타일의 에이전트 프레임워크를 사용했습니다. 사람들은 에이전트가 아키텍처를 제안하고, 고정된 프로토콜 하에서 학습시키고, 손실 곡선에 따라 반복하게 했습니다. 정확히 우리가 보상하고 싶었던 “검색 문제를 공격하기 위해 AI를 사용하라” 행동이었습니다.

라운드 1, 문제 2 — SparseTap (함정)

표면적인 이야기는 “숨겨진 XOR 탭을 찾고 다음 비트를 예측하라”입니다. 눈을 가늘게 뜨면, 신호 처리나 LFSR 크래킹 문제처럼 보입니다. 많은 사람들이 그 길을 가서 오프셋을 무차별 대입하거나, {0,1} 출력에 경사 하강 모델을 맞추려고 했습니다.

의도된 길: 이것은 그냥 LPN입니다

핵심 관찰은 이 문제가 변장한 노이즈가 있는 패리티 학습(LPN)이라는 것입니다:

“비밀”은 어떤 오프셋이 탭인지를 나타내는 이진 지시 벡터: s ∈ {0,1}^W, 최대 S개의 1.
각 “샘플”은 (a, b)로, 여기서 a는 W개 이전 비트의 윈도우이고 b는 다음 비트입니다 — 즉, b = ⟨a, s⟩ ⊕ e, e ∼ Bernoulli(0.2).
비트는 η = 0.2로 독립적으로 뒤집힙니다. 이것이 LPN입니다.

이를 보면, 교과서적인 공격은 Blum–Kalai–Wasserman (BKW) 알고리즘입니다: 선택한 비트 블록에서 a 벡터가 충돌하는 샘플들을 결합하여 데이터셋을 XOR-축소합니다. 샘플 수를 노이즈 감소와 거래합니다. 편향은 (1 − 2η)^s로 줄어듭니다 — 우리가 라운드 2 필기 시험에 다리(bridge)로 넣은 정확히 그 항등식입니다.

설계 의도. 전체 문제 설명은 LPN을 본 적이 있다면 LPN과의 연결이 명백하도록, 그렇지 않다면 완전히 보이지 않도록 설계되었습니다. 라운드 2 오전 시험은 그 LPN 항등식을 재사용했기 때문에, SparseTap을 올바른 방법으로 푼 사람은 Q3를 쉽게 풀 수 있었습니다.

의도하지 않았지만 또한 올바른 길: RANSAC

여러 사람들이 제가 계획하지 않은 RANSAC 스타일 접근법을 사용했습니다. 아이디어: 무작위로 (a, b) 제약의 작은 부분집합을 샘플링하고, 모두 노이즈가 없다고 가정한 후, GF(2)에서 결과 선형 시스템을 풀고, 나머지 제약 중 얼마나 많은 것이 만족되는지 확인합니다. 많이 만족되면, s를 찾은 것입니다.

처음에는 이것이 동작하는 것에 놀랐지만, 이 문제의 특정 스케일에서 수학이 실제로 맞아떨어집니다:

P(단일 샘플이 노이즈 없음)              = 1 − 0.2 = 0.8
P(무작위 S-부분집합이 완전히 노이즈 없음)  = 0.8^S    ≈ 0.8^16 ≈ 0.028
                                                            ≈ 35분의 1
데이터셋의 총 제약 수                    = 2000 × 192 ≈ 384,000

⇒ 무작위 S-부분집합을 샘플링하고 검증하는 것은 충분히 다룰 수 있습니다.
   RANSAC + GF(2) 솔버 + 검증으로 몇 초 안에 s를 복원합니다.

RANSAC은 S, η, 데이터셋 크기가 우연히 우호적인 영역에 있기 때문에 이 문제에서 동작합니다. 그 중 어느 하나라도 짜내면 트릭이 무너집니다.

저는 이것을 위해 설계하지 않았고, 이것이 동작했다는 것이 정말 좋습니다. 두 가지 매우 다른 풀이, 하나는 제가 계획한 것이고 하나는 참가자들이 가르쳐 준 것입니다.

라운드 1 이후

라운드 1을 채점한 후, 진출할 상위 30명을 선정했습니다. 다양한 그룹: 학생, 발표된 AI 연구자, 비엔지니어, 한국의 유명 기술 회사의 여러 테크 리드.

거의 모든 라운드 1 제출물이 훌륭해 보였습니다. 그냥 “기준을 통과한” 것이 아니라… 실제로 인상적이었습니다. 깔끔한 코드, 괜찮은 보고서, 합리적인 정확도. 에이전트는 이제 정말 좋습니다. 제출물만으로 사람들을 차별화하는 것은 어려웠습니다.

결과물이 모두 비슷하게 보인다면, 실제로 무엇을 선택하고 있는 것일까요? 라운드 2에서 저는, 실제로 자신이 무엇을 하고 있는지 이해하는 사람들에게 보너스 점수를 주고 싶었습니다: 게이트가 아니라, AI가 필연적으로 무언가를 잘못 처리할 때 가장 그리워할 깊은 종류의 기술을 표면화하는 동률 결정자(tiebreaker)로서.

라운드 2의 모든 문제는 레드팀 검증을 거쳤습니다: 종이와 펜 시험은 제외인데, 설계상 폐쇄형이고 어차피 시험장에 에이전트가 없기 때문입니다. 기기 사용이 허용된 문제들에 대해서는, 저는 문제 생성과 Claude Code, ChatGPT Pro, 그리고 오토리서치 스타일 에이전트들과의 스트레스 테스트 사이를 반복하며 많은 컴퓨팅을 소비했습니다. 그 어떤 문제도 단일 SOTA 호출만으로는 끝까지 풀 수 없다는 것을 확인했습니다. 각 문제는 에이전트가 할 수 있는 것 위에 실제 인간의 설계 선택을 요구했습니다. 문제를 작성하고, 접근 가능한 모든 SOTA 에이전트에 던지고, 어디서 실패하는지 관찰하고, 문제를 강화하고, 반복. 이 반복 루프는 부끄럽지만, 전체 두 주 동안 제가 가장 즐거웠던 부분입니다.

라운드 2, 1일차 오전 — 종이와 펜 시험

이것이 보너스 점수 문제입니다. 기기 없음, AI 없음, 두 시간. 사실 이 시험 전체는 라운드 1의 두 문제에 대한 한 쌍의 다리(bridge) 문제로 구성되어 있습니다.

Q1 (손으로 계산하는 셀프 어텐션)과 Q2 (샘플링)는 MultiplierBoard에 대한 다리입니다. 6비트 곱셈을 수행하는 트랜스포머를 진짜로 손으로 구성한 사람이라면, 2×2 토이 예제에서 소프트맥스 어텐션을 손으로 계산하고 3-토큰 분포에 뉴클리어스 샘플링을 적용하는 것은 기계적으로 풀 수 있어야 합니다.

Q3 (LPN 노이즈 증폭 항등식)는 SparseTap에 대한 다리입니다. 샘플들을 XOR로 결합할 때 노이즈 비율이 어떻게 변하는지를 정량화하는 문제이며, XOR 축소는 BKW의 핵심 연산입니다. 라운드 1 SparseTap 제출물 거의 모두가 보고서에서 BKW를 언급하고 사용했기 때문에, Q3는 참가자들이 의존했던 XOR 축소 단계의 수학을 실제로 이해하고 있는지 확인하는 문제였습니다.

따라서 오전 시험은 사실상 정직성 점검(sanity check)입니다: 라운드 1 문제를 정말 본인이 풀었는가, 아니면 에이전트가 풀었고 본인은 그것이 무엇을 만들었는지 잘 모르는가?

상위 30명 중 상당수(라운드 1을 멋지게 통과한 사람들)가 기본 어텐션/샘플링 문제를 전혀 풀지 못했습니다. 라운드 1 제출물이 훌륭했던 사람들이었습니다.

이 발견을 어떻게 해야 할지 아직 잘 모르겠습니다. 한편으로는, “어텐션을 손으로 수치적으로 안다”의 역할이 도구가 개선되면서 객관적으로 줄어들었습니다. 어쩌면 괜찮을지도 모릅니다. 다른 한편으로는, “AI의 답이 미묘하게 틀렸을 때 알 수 있다”의 역할이 객관적으로 커졌습니다. 역사적으로 그 두 기술은 상관관계가 있었습니다. 그 상관관계가 2026년에도 여전히 유지되는지가, 제 생각에는, 지금 AI 채용의 핵심적인 미해결 문제입니다.

라운드 2, 1일차 오후 — BattlePredict

여기서부터는, 무엇이든 가능합니다. 어떤 AI든 가져오세요. (다시 한번, 우리가 비용을 지불했습니다.)

이 문제는 KRAFTON에서 우리 팀이 실제로 하고 있는 작업에서 직접 가져왔습니다. PUBG는 배틀로얄이며, 이는 대규모 스킬 등급을 의미하고, 이는 플레이어 스킬이 비정상적(non-stationary)이라는 사실을 다루는 것을 의미합니다: 사람들은 워밍업을 하고, 지루해지고, 휴식을 취하고, 학습하고, 쇠퇴합니다. 정직한 MMR 시스템은 그것을 처리해야 합니다. 그래서 저는 그 실제 문제의 가능한 한 가장 작은 토이 버전을 만들어 대회로 만들었습니다.

설정

10명의 사용자, 각 사용자가 선형으로 시간 변화하는 스킬 s_i(d) = w_i · d + b_i를 50일에 걸쳐 가집니다. 50일, 매일 50경기, 각 경기에 5명, 1대1 결투의 BTL 결과. 활동률도 시간에 따라 변하므로 3일차의 헤비 유저가 18일차에는 유령이 될 수 있습니다.

함정: 1, 11, 21일차는 완전히 라벨링되어 있습니다. 2–10일차와 12–20일차는 18개의 익명 일별 블록으로 일차 라벨이 제거된 채로 존재합니다. 22–50일차의 각 플레이어 총 킬 수를 예측하세요.

BattlePredict — 50 days of matches, with hidden labels

\[ s_i(d) = w_i \cdot d + b_i \qquad P(i \text{ beats } j) = \sigma\!\left(s_i(d) - s_j(d)\right) \]

labeled (days 1, 11, 21) hidden block (day label stripped) test (days 22–50, predict kills)

10 users with linearly time-varying skill. 50 matches/day, 5 players per match, gauntlet of 1v1 BTL duels. Activity rates also vary in time: heavy users on day 3 may be ghosts by day 18.

The 18 hidden blocks are present but with day labels removed. Recover the labels first, then extrapolate.

의도된 파이프라인

BattlePredict — the intended pipeline

1 📊 BTL on the
3 labeled
days

→

2 🤔 Notice skills
disagree
across days

→

3 📦 BTL within
each anonymous
day-block

→

4 🔀 Recover hidden
permutation via
Hungarian

→

5 📈 Re-fit BTL on
full data and
extrapolate

Step 2 is the gating insight. Skip it and run BTL on the full data assuming stationarity → you collapse to the naive baseline.

각 단계는 실제로 문제를 이해해야 합니다. 2단계가 게이팅 인사이트입니다: 이를 건너뛰고 정상성(stationarity)을 가정하여 전체 데이터에 BTL을 실행했다면 — 대부분의 맹목적인 에이전트 실행이 그랬듯이 — 거기서부터 자신감 있게 틀릴 것입니다. 4단계는 1, 11, 21일차의 선형 추세 앵커가 있다는 사실을 사용합니다. 각 익명 블록에 대해 “이 블록이 d일차의 보간된 스킬과 얼마나 잘 맞는가?”의 비용 행렬을 계산하고, 헝가리안 방법으로 선형 할당 문제로 풉니다. 복구된 라벨이 있으면, 사용자별 스킬 곡선이 아름답게 선형으로 나오고, 22–50일차로의 외삽은 간단합니다.

내부 베이스라인

대회 후에 여러 참조 베이스라인을 계산했습니다. 점수가 낮을수록 좋습니다:

BattlePredict — internal baselines (lower is better)

Oracle (true params)

0.0097

BTL with TRUE day labels

0.0462

BTL + recovered labels (intended)

0.0502

BTL on labeled days only (naive)

0.1188

The intended pipeline (blue) gets within striking distance of the oracle that knows the true day labels. The naive baseline (red) is more than 2× worse.

그냥 에이전트에게 넘기면 어떻게 될까요?

거의 보편적으로, 에이전트는 일차 라벨이 있는 데이터(1일차, 11일차, 21일차)를 그대로 가져가서 그 ~600개의 결투에 BTL 모델을 맞추고, 답을 보고합니다. 익명 일별 블록은 아예 들여다보지 않습니다. 라벨된 세 일차 사이에서 사용자별 스킬이 일치하지 않는다는 사실조차 알아차리지 못합니다. 스킬이 비정상적(non-stationary)이라는 것을 깨닫지 못하고, 숨겨진 일차 라벨을 복원할 시도조차 하지 않습니다. 출력은 자신감 있고, 그럴듯해 보이며, 정확히 단순 베이스라인입니다.

성공한 인간 경쟁자들은 정반대로 했습니다. 그들은 데이터 탐색으로 시작했고 — 그 탐색 과정 자체에서 AI를 적극적으로 사용했습니다. 에이전트에게 라벨된 일자별 스킬 추정값을 그려달라고 요청하고, 익명 블록 사이의 결투와 활동 분포를 시각화하고, 즉석에서 진단 자료를 생성하도록 했습니다. 그림이 비정상성을 명백하게 드러내고 나면, 나머지 파이프라인(블록별 BTL → 헝가리안 → 재학습 → 외삽)은 에이전트가 올바르게 실행할 수 있는 작고 잘 정의된 일련의 지시 사항이 되었습니다. 그들은 손으로 수학을 하지 않았습니다… 에이전트를 현미경으로 사용하고, 어디를 가리켜야 하는지 알려주었을 뿐입니다.

이것이 제가 가장 좋아하는 종류의 문제였습니다: 에이전트는 요청하면 모든 개별 단계를 수행할 수 있지만, 스스로는 올바른 순서로 놓지 않습니다. 병목은 결코 에이전트의 능력이 아니었습니다. 다음에 무엇을 요청해야 할지 아는 것이었습니다.

라운드 2, 2일차 — VideoAgent

대단원. 다섯 시간. 자동화된 비디오 QA 에이전트를 구축하세요. 우리는 T − 15분에 20개의 비디오와 20개의 프롬프트가 있는 폴더를 건네주고, 당신은 T까지 20개의 답을 제출해야 합니다.

중요했던 설계 선택

VideoAgent — design choices that mattered

🎲

26 answer options

Random baseline ≈ 3.8%. Lucky guessing is not a strategy.

⏯️

No timestamp-jumping

Every question requires watching most of the video — counting, scanning, tracking.

🔊

Audio + visual

At least one question detects a spoken phrase, locates the frame, then counts something visual in it.

🙈

No test-set access

Participants only see a tiny sample. They have to invent their own dev set if they want one.

Test folder released at T − 15 min, submission due at T. 20 videos, 20 questions, 15-minute window.

왜 비디오 QA인가? 오늘날, 비디오 QA는 SOTA 멀티모달 모델을 입력에 단순히 던지는 것이 어려운 인스턴스에 대해서는 전혀 동작하지 않는 정확한 종류의 작업이기 때문입니다. 사실 두 가지 유효한 경로가 있었습니다. 하나는 에이전트 하네스(harness)를 직접 손으로 설계하는 것입니다 (예: 프레임 샘플링, 오디오 추출, 다중 패스 추론, 검증) 그리고 모델 선택도 손으로 결정하는 것. 다른 하나는 메타 하네스(meta-harness) 경로 (arxiv.org/abs/2603.28052)로, 외부 옵티마이저가 AI를 사용해 하네스 자체를 튜닝하도록 맡기는 것입니다. 두 번째 경로의 함정: 자신만의 평가 세트를 먼저 구축해야 합니다. 참가자에게는 테스트 세트나 학습 데이터가 제공되지 않았기 때문에 (평가 세트 없이는) 메타 옵티마이저가 오를 언덕이 없습니다. 어느 쪽이든, 실제 인간의 작업이 필요했습니다.

결과

문제의 어려움과 개방성에도 불구하고, 상위 항목은 완전히 숨겨진 테스트 세트에서 50%에 가까운 점수를 기록했습니다. 테스트 세트나 학습 세트에 접근하지 않고 5시간의 콜드 스타트에서. 중간에 다른 흥미로운 풀이가 많이 있었습니다: 다른 분해 전략, 다른 폴백 휴리스틱, 다른 패스별 교차 확인 방법. 상위 항목은 명확한 두드러진 결과였습니다.

제가 가져가는 것

What I'm taking away

The "use an agent" ceiling is very high

A take-home only tells you the candidate uses AI and pays for good services. Not nothing — but a long way from the signal you actually want.

The signal still exists; design for it

Problems where the SOTA agent confidently produces the wrong answer are gold. They surface candidates who notice the mismatch and intervene.

Pen-and-paper is still an open question

"Knows attention by hand" is a (possibly fading) proxy for "can debug AI when it's wrong." I'd love a sharper signal and to retire the proxy.

04
The hire-worthy skill is AI steering
Not clever algorithms — the top BattlePredict and VideoAgent runs were genuine human brilliance: knowing exactly what to ask the agent next, when to override it, and when to trust it.

The right hire in 2026 reaches for AI by default and overrides it on purpose. Both halves matter.

두 주말 동안 수백 개의 제출물을 채점한 후의 몇 가지 생각:

관찰	의미
“에이전트로 명백한 일을 하는 것”의 한계는 이제 매우 높습니다.	채용 깔때기가 “테이크홈 + 결과물 읽기”라면, 정말로 알게 되는 것은 후보자가 AI를 사용한다는 것과 좋은 서비스에 돈을 지불할 의향이 있다는 것뿐입니다. 그것이 아무것도 아닌 것은 아닙니다. 하지만 실제로 원하는 신호와는 거리가 멉니다.
신호는 여전히 존재합니다; 그것을 위해 설계해야 합니다.	SOTA 에이전트가 자신감 있게 틀린 답을 만드는 문제는 금입니다. 그것은 불일치를 알아차리고 개입하는 후보자를 표면화합니다.
저는 여전히 종이와 펜 시험이 옳은 결정이었다고 생각합니다.	그렇긴 하지만, “어텐션을 손으로 안다”가 2026년에 본질적으로 중요하다고 생각하지는 않습니다. 그것은 “AI가 틀렸을 때 디버그할 수 있다”의 (점차 사라지는) 프록시입니다. 더 날카로운 신호를 찾고 싶습니다.
가장 인상적이었던 것은 영리한 알고리즘이 아니었습니다: 영리한 AI 조타(steering)였습니다.	저를 놀라게 한 것은 상위 제출물 뒤에 있던 진정으로 인간적인 brilliance였습니다: 다음에 무엇을 에이전트에게 요청할지, 언제 첫 본능을 무시할지, 언제 신뢰할지를 정확히 아는 사람들. 그것이 실제로 채용할 가치가 있는 기술입니다.

2026년의 올바른 채용은 기본적으로 AI에 손을 뻗고 의도적으로 그것을 무시하는 사람입니다. 두 절반 모두 중요합니다. 이 대회는, 어떤 의미에서, 두 절반을 동시에 측정하려는 시도입니다.

상위 3명의 우승자는 두둑한 현금 상금을 가지고 떠났고, 저는 들어왔을 때보다 채용에 대해 훨씬 더 흥미롭고 정착되지 않은 의견들을 가지고 떠났습니다. 첫 번째 에디션에는 그것이 옳게 느껴집니다. 그리고 우승자들 외에도 많은 참가자들이 이번 행사가 지금까지 참여한 AI 해커톤 중 가장 흥미로웠고, 매우 교육적이었다고 말해 주었습니다… 솔직히 그것이 가장 보람 있었던 부분입니다.

나타나서, 밤을 새고, 결과물을 만들어 낸 모든 참가자에게 감사드립니다. 우리는 내년에 다시 할 것입니다 (어쩌면 더 빨리?) 그리고 약속합니다, 문제는 더 이상해질 것입니다.

즐기셨기를 바랍니다 :-)

감사의 말

라운드 1에 참가해 주신 약 300명의 참가자, 서울에 와 주신 30명의 결승 진출자, 그리고 물류, 채점, 현장 주말을 운영한 KRAFTON AI R&D 팀 전체에 감사드립니다.

라운드 1, 문제 1에 영감을 준 원래의 AdderBoard 챌린지에 대해 Prof. Dimitris Papailiopoulos에게 특별한 감사를 드립니다.

KRAFTON AI R&D Hackathon · Spring 2026 · 회고