MultiplierBoard

KRAFTON AI R&D Hackathon · Round 1 · Day 1 · 4 Hours · Spring 2026

AdderBoard challenged the community to build the smallest transformer that can add two numbers. We do the same for binary multiplication. Your mission over the next four hours:

Build the smallest transformer that can multiply two 6-bit binary numbers.
Two problems: (1-1) hand-coded weights with a correctness proof, (1-2) trained weights. You must solve both. Submit three numbers and a short report.

Binary Multiplication

         1 0 1 1          A = 11
       × 1 1 0 1          B = 13
       ─────────
         1 0 1 1          A × b₀
       0 0 0 0 ·          A × b₁, shifted
     1 0 1 1 · ·          A × b₂, shifted
   1 0 1 1 · · ·          A × b₃, shifted
   ─────────────
   1 0 0 0 1 1 1 1      Product = 143

Binary long multiplication: generate partial products (AND + shift), then sum them.

Problem Specification

Task. 6-bit binary multiplication. Given a, b ∈ {0, 1, …, 63}, compute a × b using a transformer model.
Problem 1-1. Find transformer weights that perform exact multiplication, using as few parameters as possible. Prove correctness.
Problem 1-2. Find a transformer architecture that can be trained to ≥99% accuracy on multiplication, using as few parameters as possible.
Metric. Estimate your test accuracy by randomly sampling 10,000 pairs.
Decoding. Greedy decoding (argmax) at every output position.

Fixed Format: LSB-First Binary

All submissions must use the following format. No exceptions.

Vocabulary. 2 tokens: 0, 1 (token IDs 0, 1).
Input. A₀ A₁ A₂ A₃ A₄ A₅ B₀ B₁ B₂ B₃ B₄ B₅ (12 tokens). Operands are 6-bit, zero-padded, LSB first. Positions are fixed — no separators needed.
Output. P₀ P₁ … P₁₁ (12 tokens). Product is 12-bit, zero-padded, LSB first.
Full sequence. 24 tokens total. The model generates P₀…P₁₁ autoregressively, conditioned on the 12-token input.

Worked Examples

23 × 37 = 851 A = 23 = 010111
B = 37 = 100101
P = 851 = 001101010011

1 1 1 0 1 0 1 0 1 0 0 1 1 1 0 0 1 0 1 0 1 1 0 0

63 × 63 = 3969 A = 63 = 111111
B = 63 = 111111
P = 3969 = 111110000001

1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1

Architecture Rules

Python + PyTorch. All code must be in Python using PyTorch.
Self-attention required. At least one self-attention layer.
Autoregressive. The model receives a token sequence and predicts the next token, one at a time.
Standard forward pass. forward() takes a tensor of token IDs and returns logits. No if/else branches, lookup tables, or control flow that encodes multiplication logic.
Generic inference. Problem-specific knowledge lives only in the weights, not in the code.

What’s Allowed

Any positional encoding: learned, sinusoidal, RoPE, ALiBi, etc.
Any activation function: ReLU, GELU, SwiGLU, etc.
Weight tying and parameter sharing of all kinds
Low-rank / factorized projections
Custom embedding strategies for the fixed 2-token vocabulary
Any architecture choices for Problem 1-2 (training protocol is fixed)

Parameter Counting

Count unique parameters after weight tying/deduplication.
Fixed positional encodings (sinusoidal, RoPE with fixed θ) do not count, following the original Transformer paper.
Learned positional encodings do count.
Bias terms count. All nn.Parameter values count.
If in doubt: if requires_grad=True would apply in training, it counts.

Problem 1-1: Hand-Coded Weights

Proof required. You must provide a written correctness argument explaining why your hand-coded weights produce the right answer. Walk through the computation layer by layer: what does each attention head attend to, what does each MLP compute, and how do the outputs compose into the final product bits. A numerical dump of weights without explanation will receive no credit.

Problem 1-2: Trained Weights

You only submit the architecture (a build_model() function that returns an untrained nn.Module). We train and evaluate it using the fixed procedure below.

Fixed training protocol. You may not modify any of the following.

Initialization. PyTorch defaults.
Training data. 100,000 random pairs, each a, b ∈ [0, 63].
Optimizer. AdamW(lr=1e-3, weight_decay=0.01).
Schedule. Cosine annealing over 200 epochs, batch size 256.
Loss. Cross-entropy on the 12 output-token positions only.

Submission Format

Three Numbers

Submit exactly these values (e.g., via a form or spreadsheet row):

Field	Type	Description
`P_1`	int	Parameter count, Problem 1-1 (hand-coded). −1 if no solution found.
`P_2`	int	Parameter count, Problem 1-2 (architecture for training)
`Acc_2`	float	Accuracy, Problem 1-2 (0.0 – 1.0), from the fixed training protocol

Submission File (ZIP)

Submit a single .zip file containing:

Report (PDF, max 2 pages) covering:
- Architecture description. Diagram your transformer’s structure: number of layers, heads, hidden dimension, positional encoding type, activation functions, and any weight-tying schemes.
- Problem 1-1 approach & correctness proof. What algorithm does the transformer implement? Provide a layer-by-layer proof of why it works: which bits each attention head attends to, what each MLP computes, and how the outputs compose into the correct product. This is not optional — weights without a proof receive no credit.
- Problem 1-2 approach. What architecture did you choose and why? Show a training curve (loss and/or accuracy vs. epoch). How does accuracy change with model size?
- Ablations & failed attempts. What design choices mattered most? What did you try that didn’t work?
Code — a single Python file (.py) that defines your model and reproduces your results.

MultiplierBoard · KRAFTON AI R&D Hackathon · Round 1, Day 1

MultiplierBoard

KRAFTON AI R&D Hackathon · Round 1 · Day 1 · 4 Hours · Spring 2026

AdderBoard는 두 수를 더할 수 있는 가장 작은 트랜스포머를 만드는 도전을 제시했습니다. 우리는 같은 도전을 이진 곱셈으로 확장합니다. 앞으로 4시간 동안의 미션:

두 개의 6비트 이진수를 곱할 수 있는 가장 작은 트랜스포머를 만드세요.
두 가지 문제: (1-1) 정확성 증명이 포함된 수동 설정 가중치, (1-2) 학습된 가중치. 두 문제 모두 풀어야 합니다. 세 개의 숫자와 짧은 보고서를 제출하세요.

이진 곱셈

         1 0 1 1          A = 11
       × 1 1 0 1          B = 13
       ─────────
         1 0 1 1          A × b₀
       0 0 0 0 ·          A × b₁, 시프트
     1 0 1 1 · ·          A × b₂, 시프트
   1 0 1 1 · · ·          A × b₃, 시프트
   ─────────────
   1 0 0 0 1 1 1 1      곱 = 143

이진 긴 곱셈: 부분곱(AND + 시프트)을 생성한 후 합산합니다.

문제 명세

과제. 6비트 이진 곱셈. a, b ∈ {0, 1, …, 63}이 주어졌을 때, 트랜스포머 모델을 사용하여 a × b를 계산합니다.
문제 1-1. 가능한 적은 파라미터로 정확한 곱셈을 수행하는 트랜스포머 가중치를 찾으세요. 정확성을 증명하세요.
문제 1-2. 가능한 적은 파라미터로 곱셈에 대해 ≥99% 정확도로 학습될 수 있는 트랜스포머 아키텍처를 찾으세요.
평가 지표. 10,000개의 무작위 쌍을 샘플링하여 테스트 정확도를 추정합니다.
디코딩. 모든 출력 위치에서 그리디 디코딩(argmax).

고정 형식: LSB-우선 이진수

모든 제출은 다음 형식을 사용해야 합니다. 예외 없음.

어휘. 2개의 토큰: 0, 1 (토큰 ID 0, 1).
입력. A₀ A₁ A₂ A₃ A₄ A₅ B₀ B₁ B₂ B₃ B₄ B₅ (12개 토큰). 피연산자는 6비트, 제로 패딩, LSB 우선. 위치가 고정되어 있으므로 구분자가 필요 없습니다.
출력. P₀ P₁ … P₁₁ (12개 토큰). 곱은 12비트, 제로 패딩, LSB 우선.
전체 시퀀스. 총 24개 토큰. 모델은 12개 토큰 입력을 조건으로 P₀…P₁₁을 자기회귀적으로 생성합니다.

계산 예시

23 × 37 = 851 A = 23 = 010111
B = 37 = 100101
P = 851 = 001101010011

1 1 1 0 1 0 1 0 1 0 0 1 1 1 0 0 1 0 1 0 1 1 0 0

63 × 63 = 3969 A = 63 = 111111
B = 63 = 111111
P = 3969 = 111110000001

1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1

아키텍처 규칙

Python + PyTorch. 모든 코드는 PyTorch를 사용한 Python이어야 합니다.
셀프 어텐션 필수. 최소 하나의 셀프 어텐션 레이어가 필요합니다.
자기회귀. 모델은 토큰 시퀀스를 받아 다음 토큰을 하나씩 예측합니다.
표준 순전파. forward()는 토큰 ID 텐서를 입력받아 로짓을 반환합니다. 곱셈 로직을 인코딩하는 if/else 분기, 룩업 테이블, 제어 흐름은 허용되지 않습니다.
범용 추론. 문제 특화 지식은 코드가 아닌 가중치에만 존재해야 합니다.

허용 사항

모든 위치 인코딩: learned, sinusoidal, RoPE, ALiBi 등
모든 활성화 함수: ReLU, GELU, SwiGLU 등
모든 종류의 가중치 공유(weight tying) 및 파라미터 공유
저랭크 / 인수분해 프로젝션
고정된 2-토큰 어휘에 대한 커스텀 임베딩 전략
문제 1-2의 모든 아키텍처 선택 (학습 프로토콜은 고정)

파라미터 카운팅

가중치 공유/중복 제거 후 고유 파라미터를 카운트합니다.
고정 위치 인코딩(sinusoidal, 고정 θ의 RoPE)은 원래 Transformer 논문에 따라 카운트하지 않습니다.
학습된 위치 인코딩은 카운트합니다.
바이어스 항도 카운트합니다. 모든 nn.Parameter 값을 카운트합니다.
판단이 어려운 경우: 학습 시 requires_grad=True가 적용된다면 카운트합니다.

문제 1-1: 수동 설정 가중치

증명 필수. 수동 설정된 가중치가 올바른 답을 생성하는 이유를 설명하는 정확성 논증을 제공해야 합니다. 레이어별로 계산 과정을 설명하세요: 각 어텐션 헤드가 무엇에 주의를 기울이는지, 각 MLP가 무엇을 계산하는지, 그리고 출력이 어떻게 최종 곱 비트로 구성되는지. 설명 없이 가중치만 나열하면 점수를 받을 수 없습니다.

문제 1-2: 학습된 가중치

학습되지 않은 nn.Module을 반환하는 build_model() 함수인 아키텍처만 제출합니다. 아래의 고정된 절차를 사용하여 학습 및 평가합니다.

고정 학습 프로토콜. 다음 사항은 수정할 수 없습니다.

초기화. PyTorch 기본값.
학습 데이터. 100,000개의 무작위 쌍, 각 a, b ∈ [0, 63].
옵티마이저. AdamW(lr=1e-3, weight_decay=0.01).
스케줄. 200 에폭에 대한 코사인 어닐링, 배치 크기 256.
손실 함수. 12개 출력 토큰 위치에 대해서만 크로스 엔트로피.

제출 형식

세 개의 숫자

다음 값을 정확히 제출하세요 (예: 폼 또는 스프레드시트 행):

필드	타입	설명
`P_1`	int	파라미터 수, 문제 1-1 (수동 설정). 해결 못한 경우 −1.
`P_2`	int	파라미터 수, 문제 1-2 (학습용 아키텍처)
`Acc_2`	float	정확도, 문제 1-2 (0.0 – 1.0), 고정 학습 프로토콜 기준

제출 파일 (ZIP)

다음을 포함하는 단일 .zip 파일을 제출하세요:

보고서 (PDF, 최대 2페이지) 포함 내용:
- 아키텍처 설명. 트랜스포머의 구조를 도식화하세요: 레이어 수, 헤드 수, 히든 차원, 위치 인코딩 유형, 활성화 함수, 가중치 공유 방식.
- 문제 1-1 접근법 및 정확성 증명. 트랜스포머가 구현하는 알고리즘은 무엇인가요? 레이어별 증명을 제공하세요: 각 어텐션 헤드가 어떤 비트에 주의를 기울이는지, 각 MLP가 무엇을 계산하는지, 그리고 출력이 어떻게 올바른 곱 비트로 구성되는지. 이것은 선택 사항이 아닙니다 — 증명 없는 가중치는 점수를 받을 수 없습니다.
- 문제 1-2 접근법. 어떤 아키텍처를 선택했고 그 이유는? 학습 곡선(손실 및/또는 정확도 vs. 에폭)을 보여주세요. 모델 크기에 따라 정확도가 어떻게 변하나요?
- 절제 실험 및 실패한 시도. 어떤 설계 선택이 가장 중요했나요? 무엇을 시도했지만 효과가 없었나요?
코드 — 모델을 정의하고 결과를 재현할 수 있는 단일 Python 파일 (.py).

MultiplierBoard · KRAFTON AI R&D Hackathon · Round 1, Day 1