VideoAgent

KRAFTON AI R&D Hackathon · Round 2 · Problem 3 · 5 Hours · Spring 2026

Design and build an automated video analysis system that can watch videos and answer objective, factual multiple-choice questions about their visual content. Every question has a single unambiguous ground-truth answer — there is nothing subjective or open to interpretation. Your system must process 20 video–prompt pairs and return one answer letter per pair.

Answer 20 multiple-choice questions about 20 videos.
Submit a single string of exactly 20 letters (e.g., ABCAZBBGHADDCBAADBCA).

Test-Time Input

At test time your agent will receive a folder with the following structure:

test_folder/
  video1.mp4
  video2.mp4
  …
  video20.mp4
  prompt1.txt
  prompt2.txt
  …
  prompt20.txt

File	Description
`videoN.mp4`	A video file, up to 1200 seconds in length
`promptN.txt`	A multiple-choice question about `videoN.mp4`

Each promptN.txt contains a question with labeled answer choices. The number of choices may vary across prompts.

Example Prompts

prompt1.txt How many cars pass through the
intersection in video1.mp4?

A) 3
B) 5
C) 7
D) 9

prompt2.txt In which direction does the ball
move at the end of video2.mp4?

A) Left to right
B) Right to left
C) Upward
D) Downward

Question Categories

All questions are objective and verifiable — every answer can be determined by carefully observing the video. Question types include but are not limited to:

Category	Example
Counting	How many people / cars / objects appear?
Color & appearance	What color is the largest object?
Spatial relations	Which object is closest to the camera?
Motion & direction	In which direction does the person walk?
Temporal ordering	Which event happens first?
Presence / absence	Does a bicycle appear at any point?

Important. The examples above are illustrative only. Your agent must not assume any particular question format or number of choices. However, every question will be objective — no opinion, no aesthetic judgment, no ambiguity. If you watch the video carefully enough, the answer is unambiguous.

Requirements

Generality. Your system must handle arbitrary video questions without hard-coded assumptions about the task type.
Throughput. The test folder will be released 15 minutes before the deadline. Your system must process all 20 videos and produce answers within that window.
Automation. Once started, the system should run end-to-end without manual intervention.

No architectural constraints. You are free to build a multi-step agentic pipeline, or a single model call, or anything in between. Use whatever approach you believe will maximize accuracy within the time budget.

Latency Budget

Time Constraint Videos released: T − 15 min
Submission deadline: T

→ 20 videos in ≤ 15 min
→ ~45 seconds per video (sequential)
→ Consider parallelization to stay within budget

Videos can be up to 1200 seconds (20 minutes) long. You should carefully plan your video processing pipeline — frame sampling, chunking, summarization — to balance accuracy against the strict time budget.

Design Considerations

You are free to use any tools, APIs, or models. Some strategies to consider:

Frame sampling: Extract key frames at fixed intervals or using scene-change detection.
Video chunking: Split long videos into segments and analyze independently.
Multi-modal models: Use vision-language models that accept image or video inputs.
Temporal tracking: Track objects or events across frames to answer ordering and counting questions.
Parallel processing: Process multiple videos concurrently to meet the time budget.
Specialized vision tools: Object detection, tracking, or counting models as pre-processing steps.

Hint. Naively feeding a video to a state-of-the-art multimodal model (e.g., Gemini) and asking for the answer is not sufficient for many of these questions. We encourage you to try this baseline early and observe where it fails — that will inform the design of your system.

Evaluation

Each of the 20 answers is scored as correct (1) or incorrect (0).

Score = (1 / 20) × Σ_i=1²⁰ Ι[ answer_i = ground_truth_i ]

A score of 1.0 is perfect; 0.0 means no correct answers. Higher is better.

Submission Format

Submit via the provided Google Form:

Answer key: A single string of exactly 20 letters, where the i-th letter is your answer for promptN.txt / videoN.mp4.
Example: ABCAZBBGHADDCBAADBCA
Report: A single zip file containing your report, explaining your agent design, pipeline architecture, and any interesting findings.

Tip. Spend the first 4 hours 45 minutes designing, building, and testing your system. Use the final 15 minutes to run it on the released test set and submit.

VideoAgent · KRAFTON AI R&D Hackathon · Round 2, Problem 3

VideoAgent

KRAFTON AI R&D Hackathon · Round 2 · Problem 3 · 5 Hours · Spring 2026

자동화된 비디오 분석 시스템을 설계하고 구축하세요. 비디오를 시청하고 시각적 내용에 대한 객관적이고 사실적인 객관식 질문에 답해야 합니다. 모든 질문은 명확한 정답이 하나 존재하며 — 주관적이거나 해석의 여지가 있는 질문은 없습니다. 시스템은 20개의 비디오–프롬프트 쌍을 처리하고 각 쌍에 대해 하나의 답을 반환해야 합니다.

20개 비디오에 대한 20개 객관식 질문에 답하세요.
정확히 20개의 알파벳으로 구성된 문자열을 제출합니다 (예: ABCAZBBGHADDCBAADBCA).

테스트 시 입력

테스트 시 에이전트에게 다음 구조의 폴더가 제공됩니다:

test_folder/
  video1.mp4
  video2.mp4
  …
  video20.mp4
  prompt1.txt
  prompt2.txt
  …
  prompt20.txt

파일	설명
`videoN.mp4`	최대 1200초 길이의 비디오 파일
`promptN.txt`	`videoN.mp4`에 대한 객관식 질문

각 promptN.txt에는 레이블이 붙은 답안 선택지가 있는 질문이 포함됩니다. 선택지의 수는 프롬프트마다 다를 수 있습니다.

프롬프트 예시

prompt1.txt video1.mp4에서 교차로를 지나가는
자동차는 몇 대인가요?

A) 3대
B) 5대
C) 7대
D) 9대

prompt2.txt video2.mp4의 마지막에 공이
어느 방향으로 이동하나요?

A) 왼쪽에서 오른쪽으로
B) 오른쪽에서 왼쪽으로
C) 위쪽으로
D) 아래쪽으로

질문 유형

모든 질문은 객관적이며 검증 가능합니다 — 비디오를 주의 깊게 관찰하면 모든 답을 확인할 수 있습니다. 질문 유형은 다음을 포함하되 이에 국한되지 않습니다:

유형	예시
개수 세기	사람 / 자동차 / 물체가 몇 개 등장하는가?
색상 & 외관	가장 큰 물체의 색상은?
공간 관계	카메라에 가장 가까운 물체는?
움직임 & 방향	사람이 어느 방향으로 걸어가는가?
시간 순서	어떤 사건이 먼저 발생하는가?
존재 유무	자전거가 화면에 등장하는가?

중요. 위 예시는 설명을 위한 것입니다. 에이전트는 특정 질문 형식이나 선택지 수를 가정하지 않아야 합니다. 단, 모든 질문은 객관적입니다 — 의견, 미적 판단, 모호함이 없습니다. 비디오를 충분히 주의 깊게 관찰하면 답은 명확합니다.

요구 사항

범용성. 시스템은 작업 유형에 대한 하드코딩된 가정 없이 임의의 비디오 질문을 처리할 수 있어야 합니다.
처리량. 테스트 폴더는 마감 15분 전에 공개됩니다. 시스템은 해당 시간 내에 20개 비디오를 모두 처리하고 답을 생성해야 합니다.
자동화. 일단 시작되면, 시스템은 수동 개입 없이 처음부터 끝까지 실행되어야 합니다.

아키텍처 제약 없음. 다단계 에이전트 파이프라인, 단일 모델 호출, 또는 그 사이의 어떤 방식이든 자유롭게 사용할 수 있습니다. 시간 예산 내에서 정확도를 최대화할 수 있는 접근 방식을 선택하세요.

시간 예산

시간 제약 비디오 공개: T − 15분
제출 마감: T

→ 20개 비디오를 ≤ 15분 내 처리
→ 순차 처리 시 비디오당 ~45초
→ 시간 예산을 맞추려면 병렬 처리를 고려하세요

비디오는 최대 1200초(20분) 길이입니다. 비디오 처리 파이프라인 — 프레임 샘플링, 청킹, 요약 — 을 신중히 설계하여 엄격한 시간 예산 내에서 정확도와 속도의 균형을 맞추세요.

설계 고려사항

어떤 도구, API, 모델이든 자유롭게 사용할 수 있습니다. 고려할 전략:

프레임 샘플링: 고정 간격 또는 장면 전환 감지를 사용하여 핵심 프레임을 추출.
비디오 청킹: 긴 비디오를 세그먼트로 분할하여 독립적으로 분석.
멀티모달 모델: 이미지 또는 비디오 입력을 받는 비전-언어 모델 활용.
시간적 추적: 프레임 간 객체나 이벤트를 추적하여 순서 및 개수 질문에 답변.
병렬 처리: 여러 비디오를 동시에 처리하여 시간 예산을 충족.
특화된 비전 도구: 전처리 단계로 객체 감지, 추적, 또는 카운팅 모델 활용.

힌트. 최신 멀티모달 모델(예: Gemini)에 비디오를 그대로 넣고 답을 요청하는 단순한 방식으로는 많은 질문에서 충분한 성능을 내기 어렵습니다. 이 베이스라인을 먼저 시도해보고 어디서 실패하는지 관찰하세요 — 그것이 시스템 설계의 출발점이 될 것입니다.

평가

20개 답안 각각에 대해 정답(1) 또는 오답(0)으로 채점됩니다.

Score = (1 / 20) × Σ_i=1²⁰ Ι[ answer_i = ground_truth_i ]

점수 1.0은 완벽한 정답, 0.0은 전부 오답. 높을수록 좋습니다.

제출 형식

제공된 Google Form을 통해 제출하세요:

정답 키: 정확히 20개의 알파벳으로 구성된 문자열. i번째 글자는 promptN.txt / videoN.mp4에 대한 답.
예시: ABCAZBBGHADDCBAADBCA
보고서: 에이전트 설계, 파이프라인 구조, 흥미로운 발견을 설명하는 보고서가 포함된 zip 파일 1개.

팁. 처음 4시간 45분은 시스템을 설계, 구축, 테스트하는 데 사용하세요. 마지막 15분에 공개된 테스트 세트에 시스템을 실행하고 제출하세요.

VideoAgent · KRAFTON AI R&D Hackathon · Round 2, Problem 3