AttentionByHand

KRAFTON AI R&D Hackathon · Round 2 · Day 1 Morning · 120 Minutes · Spring 2026

The only Round 2 problem where AI tools are not allowed. A pen-and-paper exam: closed book, no calculators, no electronic devices. Three problems, 40 points total.

The whole exam is structured as a pair of bridge questions for the two Round 1 problems. Problems 1 and 2 (self-attention by hand and sampling strategies) are the bridge for MultiplierBoard: anyone who genuinely hand-constructed a transformer that multiplies should find this stuff mechanical. Problem 3 (the LPN noise-amplification identity) is the bridge for SparseTap: it asks how the noise parameter compounds when you XOR samples together — the core operation behind BKW — checking whether you understand the math behind the XOR-reduction step you relied on. So the exam is, in effect, a sanity check — did you really do the Round 1 problems yourself?

Rules.
• Duration: 120 minutes
• Closed book. No calculators, no phones, no laptops, no electronic devices of any kind.
• Show your work to receive credit unless noted otherwise.
• Only the work on the FRONT side of each page will be scanned and graded.

Problem	Topic	Points
1	Self-attention by hand	15
2	Sampling strategies	10
3	Learning Parity with Noise	15
Total		40

Problem 1 — Self-Attention by Hand (15 pts)

Assume there are 2 d-dimensional (d = 4) inputs:

x₁ = [1, 0, 0, 2]^T, x₂ = [0, 1, −2, 1]^T

For each input xₖ, a key kₖ, a value vₖ, and a query qₖ are generated by linear combinations of the corresponding input:

kₖ = Wₖₖₖ xₖ, vₖ = Wₜ xₖ, qₖ = Wₚ xₖ

The dimensions of kₖ, vₖ, and qₖ are all 2. These weight matrices would be learned during training. For now, assume:

          [ 0   0   1   0 ]            [ 1   0   0   4 ]            [ 0   1  -2   0 ]
W_k  =    [ 2   1   1   1 ]    W_v  =  [ 2   4   4   0 ]    W_q  =  [ 1   0   0   1 ]

(a) Compute { kₖ, vₖ, qₖ ; i = 1, 2 }.

(b) Compute the dot-product attention scores qₖ^Tk₃ for all i, j = 1, 2.

(c) (2 pt) For the i-th query, collect the corresponding attention scores qₖ^Tk₃ for j = 1, 2 and apply softmax to compute the attention weights:

α_i,j = softmax_j(qₖ^Tk₁, qₖ^Tk₂) = e^{qₖ^Tk₃} / (e^{qₖ^Tk₁} + e^{qₖ^Tk₂})

Do this for both queries i = 1, 2. Use the numerical approximation 1 / (1 + e¹⁰) ≈ 0 to simplify your answer.

(d) Compute an output yₖ for each input as follows:

yₖ = α_i,1 v₁ + α_i,2 v₂, for i = 1, 2.

(e) Repeat (c) and (d) but with causal masking. That is, for the i-th query, collect the attention scores qₖ^Tk₃ only for j = 1, …, i. (Note the difference in the range of j.) Apply softmax to compute the causal attention weights:

α^causal_1,1 = e^{q₁^Tk₁} / e^{q₁^Tk₁} = 1, α^causal_1,2 = 0
α^causal_2,1 = e^{q₂^Tk₁} / (e^{q₂^Tk₁} + e^{q₂^Tk₂}), α^causal_2,2 = e^{q₂^Tk₂} / (e^{q₂^Tk₁} + e^{q₂^Tk₂})

Then compute the causal output y^causal_i = α^causal_i,1 v₁ + α^causal_i,2 v₂ for i = 1, 2.

(f) True or False? You must provide a detailed explanation to get any partial credit.

Next-token prediction modeling requires the use of a causal attention mechanism (i.e., a decoder-only Transformer model).

Problem 2 — Sampling Strategies (10 pts)

There are several strategies to select the next token given a language model’s predicted probabilities. In this problem, you will apply four sampling strategies: greedy sampling, random sampling, top-k sampling, and top-p (nucleus) sampling.

Consider a language model that predicts the next character after seeing the sequence ABC. It gives the following probability distribution:

Pr(X₄ = A | ABC) = 0.1
Pr(X₄ = B | ABC) = 0.3
Pr(X₄ = C | ABC) = 0.6

Greedy sampling. Always selects the token with the highest probability.
Random sampling. Selects tokens based directly on their predicted probabilities without modification.
Top-k sampling. Selects only the top k most probable tokens, discarding the rest, and then re-scales the remaining probabilities to sum to 1.
Top-p sampling (nucleus sampling). Sorts tokens by descending probability and keeps adding tokens from highest to lowest until their cumulative probability exceeds a threshold p. Only those tokens are kept and their probabilities are re-scaled to sum to 1.

(a) Greedy sampling. Which token will the model select next under greedy sampling?

(b) Random sampling. If random sampling is applied, what is the probability of selecting each of A, B, and C?

(c) Top-2 sampling. Under top-2 sampling, which tokens are considered, and what are their probabilities after re-normalization?

(d) Top-p sampling, p = 0.75. Identify the smallest set of tokens whose cumulative probability exceeds 0.75. Which tokens remain, and what are their probabilities after re-normalization?

Problem 3 — Learning Parity with Noise (15 pts)

Definition. Let s ∈ {0, 1}ⁿ be an unknown secret vector. An LPN oracle 𝒪_s,η with noise rate η ∈ (0, 1/2) outputs independent samples (a, b) where:

a is drawn uniformly at random from {0, 1}ⁿ;
b = ⟨a, s⟩ ⊕ e, where ⟨a, s⟩ = a₁s₁ ⊕ … ⊕ a_ns_n is the inner product over 𝔽₂, and e ∼ Bernoulli(η) is an independent noise bit (i.e., Pr[e = 1] = η).

Suppose we draw s independent samples (a₁, b₁), …, (a_s, b_s) from 𝒪_s,η and combine them into a new sample (ã, &btilde;) where:

ã = a₁ ⊕ … ⊕ a_s, &btilde; = b₁ ⊕ … ⊕ b_s

(a) Derive a closed-form expression for Pr[&btilde; = ⟨ã, s⟩] in terms of η and s, and prove your formula.

(b) Compute the numerical value for η = 0.1 and s = 3.

AttentionByHand · KRAFTON AI R&D Hackathon · Round 2, Day 1 Morning

AttentionByHand

KRAFTON AI R&D Hackathon · Round 2 · Day 1 Morning · 120 Minutes · Spring 2026

AI 도구 사용이 금지된 유일한 라운드 2 문제입니다. 종이와 펜으로 풀어야 하는 시험: 폐쇄형, 계산기 금지, 전자기기 금지. 세 문제, 총 40점.

이 시험 전체는 라운드 1의 두 문제에 대한 한 쌍의 다리(bridge) 문제로 구성되어 있습니다. 문제 1과 2 (손으로 계산하는 셀프 어텐션 및 샘플링 전략)는 MultiplierBoard에 대한 다리입니다: 곱셈을 수행하는 트랜스포머를 진짜로 손으로 구성한 사람이라면 이 정도는 기계적으로 풀 수 있어야 합니다. 문제 3 (LPN 노이즈 증폭 항등식)은 SparseTap에 대한 다리입니다: 샘플들을 XOR로 결합할 때 노이즈 비율이 어떻게 변하는지를 정량화하는 문제이며, XOR 축소는 BKW의 핵심 연산입니다 — 본인이 의존했던 XOR 축소 단계의 수학을 정말로 이해하고 있는지 확인하는 문제입니다. 따라서 이 시험은 사실상 정직성 점검(sanity check)입니다 — 라운드 1 문제를 정말 본인이 풀었는가?

규칙.
• 시간: 120분
• 폐쇄형 시험. 계산기, 휴대폰, 노트북 등 어떤 전자기기도 사용 금지.
• 명시되지 않은 한, 풀이 과정을 모두 기재해야 점수를 받을 수 있습니다.
• 각 페이지의 앞면에 작성된 풀이만 스캔되어 채점됩니다.

문제	주제	배점
1	손으로 계산하는 셀프 어텐션	15
2	샘플링 전략	10
3	노이즈가 있는 패리티 학습 (LPN)	15
합계		40

문제 1 — 손으로 계산하는 셀프 어텐션 (15점)

두 개의 d차원 (d = 4) 입력이 있다고 가정합니다:

x₁ = [1, 0, 0, 2]^T, x₂ = [0, 1, −2, 1]^T

각 입력 xₖ에 대해, 키 kₖ, 밸류 vₖ, 쿼리 qₖ가 해당 입력의 선형 결합으로 생성됩니다:

kₖ = Wₖₖₖ xₖ, vₖ = Wₜ xₖ, qₖ = Wₚ xₖ

kₖ, vₖ, qₖ의 차원은 모두 2입니다. 이 가중치 행렬들은 학습 시에 학습되지만, 지금은 다음과 같다고 가정합니다:

          [ 0   0   1   0 ]            [ 1   0   0   4 ]            [ 0   1  -2   0 ]
W_k  =    [ 2   1   1   1 ]    W_v  =  [ 2   4   4   0 ]    W_q  =  [ 1   0   0   1 ]

(a) { kₖ, vₖ, qₖ ; i = 1, 2 }를 계산하세요.

(b) 모든 i, j = 1, 2에 대해 도트-프로덕트 어텐션 점수 qₖ^Tk₃를 계산하세요.

α_i,j = softmax_j(qₖ^Tk₁, qₖ^Tk₂) = e^{qₖ^Tk₃} / (e^{qₖ^Tk₁} + e^{qₖ^Tk₂})

두 쿼리 i = 1, 2 모두에 대해 수행하세요. 답을 단순화하기 위해 1 / (1 + e¹⁰) ≈ 0이라는 수치 근사를 사용하세요.

(d) 각 입력에 대한 출력 yₖ를 다음과 같이 계산하세요:

yₖ = α_i,1 v₁ + α_i,2 v₂, (i = 1, 2)

(e) 인과 마스킹을 적용하여 (c), (d)를 다시 풀어보세요. 즉, i번째 쿼리에 대해 j = 1, …, i의 어텐션 점수만 모읍니다. (j의 범위가 다른 점에 유의하세요.) 인과 어텐션 가중치는 다음과 같이 계산됩니다:

그런 다음 인과 출력 y^causal_i = α^causal_i,1 v₁ + α^causal_i,2 v₂ (i = 1, 2)을 계산하세요.

(f) 다음 명제는 참입니까, 거짓입니까? 부분 점수를 받으려면 자세한 설명을 제공해야 합니다.

다음 토큰 예측 모델링은 인과 어텐션 메커니즘 (즉, 디코더 전용 트랜스포머 모델)의 사용을 필요로 한다.

문제 2 — 샘플링 전략 (10점)

언어 모델의 예측 확률이 주어졌을 때 다음 토큰을 선택하는 여러 가지 전략이 있습니다. 이 문제에서는 네 가지 샘플링 전략 — 그리디 샘플링, 랜덤 샘플링, top-k 샘플링, top-p (뉴클리어스) 샘플링 — 을 적용해 봅니다.

시퀀스 ABC를 본 후 다음 문자를 예측하는 언어 모델을 가정합니다. 다음과 같은 확률 분포를 출력합니다:

Pr(X₄ = A | ABC) = 0.1
Pr(X₄ = B | ABC) = 0.3
Pr(X₄ = C | ABC) = 0.6

그리디 샘플링. 항상 가장 높은 확률의 토큰을 선택합니다.
랜덤 샘플링. 예측된 확률 그대로에 따라 토큰을 선택합니다.
Top-k 샘플링. 가장 확률이 높은 상위 k개의 토큰만 선택하고, 나머지는 버리고, 남은 확률을 합이 1이 되도록 재정규화합니다.
Top-p 샘플링 (뉴클리어스 샘플링). 토큰을 확률 내림차순으로 정렬한 후, 누적 확률이 임계값 p를 초과할 때까지 토큰을 추가합니다. 그 토큰들만 남기고 합이 1이 되도록 재정규화합니다.

(a) 그리디 샘플링. 그리디 샘플링에서 모델이 선택하는 다음 토큰은 무엇입니까?

(b) 랜덤 샘플링. 랜덤 샘플링을 적용한다면, A, B, C 각각이 선택될 확률은 얼마입니까?

(d) Top-p 샘플링, p = 0.75. 누적 확률이 0.75를 초과하는 가장 작은 토큰 집합을 찾으세요. 어떤 토큰이 남으며, 재정규화 후의 확률은 어떻게 됩니까?

문제 3 — 노이즈가 있는 패리티 학습 (LPN) (15점)

정의. s ∈ {0, 1}ⁿ을 알려지지 않은 비밀 벡터라고 합시다. 노이즈 비율 η ∈ (0, 1/2)를 가진 LPN 오라클 𝒪_s,η는 다음과 같은 독립 샘플 (a, b)를 출력합니다:

a는 {0, 1}ⁿ에서 균등 무작위로 추출;
b = ⟨a, s⟩ ⊕ e, 여기서 ⟨a, s⟩ = a₁s₁ ⊕ … ⊕ a_ns_n은 𝔽₂ 위의 내적이고, e ∼ Bernoulli(η)는 독립적인 노이즈 비트 (즉, Pr[e = 1] = η).

𝒪_s,η로부터 s개의 독립 샘플 (a₁, b₁), …, (a_s, b_s)를 추출하여 새로운 샘플 (ã, &btilde;)으로 결합한다고 가정합시다:

ã = a₁ ⊕ … ⊕ a_s, &btilde; = b₁ ⊕ … ⊕ b_s

(a) η와 s의 함수로 Pr[&btilde; = ⟨ã, s⟩]의 닫힌 형태(closed-form) 식을 유도하고, 그 식을 증명하세요.

(b) η = 0.1, s = 3일 때의 수치 값을 계산하세요.

AttentionByHand · KRAFTON AI R&D Hackathon · Round 2, Day 1 Morning