Gemini Embedding As A Universal Multimodal Encoder for Open LLMs

Google’s gemini-embedding-2-preview takes any audio or image and returns a single 3072-dimensional vector. It’s designed for retrieval and similarity search. But we had a different question: can these embeddings serve as a universal multimodal encoder for an open-weight LLM?

The idea is borrowed from multimodal LLMs that use specialized encoders (CLIP for vision, Whisper for audio) — but here we replace all of them with a single general-purpose embedding API. Call the Gemini Embedding API, get a vector, project it through a small learned adaptor into “virtual tokens,” and prepend those to a text prompt inside a frozen language model. One encoder for all modalities.

The Pipeline

Every input — whether an image or an audio clip — goes through the same three-stage pipeline. The Gemini Embedding API is frozen; only the adaptor (a small MLP) is trained. The LLM (Qwen3-4B) stays completely frozen.

input 🎤 Audio clip
or
🖼 Image

→

frozen · api call Gemini Embedding API
gemini-embedding-2-preview

One API call → single 3072-d vector

→

output e ∈ ℝ³⁰⁷²

input e ∈ ℝ³⁰⁷²

→

learned · 17M params MLP Adaptor
Linear(3072, 4×d) → GELU → Linear(4×d, k×d)

Projects embedding into k virtual tokens,
each of dimension d = LLM hidden size.
Trained in <1 min on a single GPU.

→

output k virtual
tokens
∈ ℝ^k×d

virtual tokens v₁ … v_k

text prefix / prompt “What object is in this image?
Answer with one word.”

→

frozen LLM Qwen3-4B (frozen)
Greedy decoding → free-form answer

Each downstream task provides the supervision signal. For image classification, we have (image, label) pairs — e.g. a CIFAR-10 photo paired with “horse”. The image is embedded via the frozen Gemini API, projected through the adaptor into virtual tokens, prepended to a task prompt like “What object is in this image? Answer with one word.”, and fed to the frozen LLM. The LLM autoregressively generates an answer, and we compute a cross-entropy loss against the ground-truth label tokens. Only the adaptor’s weights are updated; everything else stays frozen. The same recipe applies to audio tasks — swap the image for an audio clip and the prompt for something like “Transcribe the spoken command.”

The adaptor is tiny: two linear layers with a GELU activation, 17M parameters. Training takes under a minute per task on a single GPU. The LLM stays completely frozen. Everything is evaluated with free generation: greedy decoding, no constrained output, exact string match.

Individual Task Results

We trained separate adaptors for 8 tasks across audio and image. All numbers are exact-match accuracy on a held-out test set (20% split).

Gender (audio)

98.7%

Emotion (audio)

53.2%

Word STT (audio)

94.0%

Sentence STT (audio)

89.3%

OCR (image)

33.0%

Object (image)

97.0%

Clothing (image)

83.5%

Digit (image)

62.5%

Object Classification — 97%

Prompt: “What object is in this image? Answer with one word.” — evaluated on CIFAR-10.

What object?

horse

true: horse

What object?

dog

true: dog

What object?

bird ×

true: dog

What object?

dog ×

true: cat

Clothing Classification — 83.5%

Prompt: “What clothing item is in this image? Answer with one word.” — evaluated on Fashion-MNIST.

What clothing?

sneaker

true: sneaker

What clothing?

coat

true: coat

What clothing?

sneaker ×

true: sandal

What clothing?

pullover ×

true: shirt

Scene Text Recognition — 33%

What word?

bmw

true: BMW

What word?

mobile

true: Mobile

What word?

advertising ×

true: AUDIENCES

What word?

wwwery ×

true: MERYL

Audio Tasks

Gender classification on RAVDESS hits 98.7%. Word-level STT on Google Speech Commands (10 classes) reaches 94%.

Most impressively, sentence-level STT on Fluent Speech Commands reaches 89.3% exact match across 169 unique commands:

🔊

"Turn off the lights" — Transcribe the spoken command.

turn off the lights

🔊

"Increase the temperature in the kitchen" — Transcribe the spoken command.

increase the temperature in the kitchen

🔊

"Pause the music" — Transcribe the spoken command.

pause the music

🔊

"Turn the heat down" — Transcribe the spoken command.

turn down the heat

🔊

"Switch the lights off" — Transcribe the spoken command.

switch the lights on

A single API call, a tiny task-specific MLP, and a frozen LLM. One 3072-d vector projected into the LLM’s token space — no vision encoder, no speech encoder, just one embedding API.

How Far Does STT Go?

The sentence-level STT result (89%) is the most intriguing — the model generates multi-word transcriptions from a single embedding vector. But it works on a closed set of 169 commands. Can it do real, open-vocabulary transcription?

We tested on LibriSpeech (1000 unique sentences, 8–20 words). With our standard setup — one embedding, one virtual token — the result was 0% exact match. We tried several ways to recover:

The model generates fluent English that sounds plausible — but has zero word overlap with the actual speech:

Setup	Exact match
Single embedding, 1 token	0%
Single embedding, 32 tokens	0%
2-second chunks, 1 token each	0%
1-second chunks, 1 token each	0%

🔊

"For once in a way I proved a true prophet" — Transcribe.

This achievement was my greatest token of success

🔊

"Look at that he held out his hand" — Transcribe.

The king looked at him coldly and said

So why does the closed-set STT work at 89%? The embedding encodes some content information — enough to distinguish among 169 short commands. But it doesn’t encode enough for 1000 open-vocabulary sentences. What looks like transcription is more likely the adaptor learning to recognise which of 169 known clusters it’s seeing.

Cross-Modal Transfer: Train on Images, Classify Sounds

Gemini embeds audio and images into the same 3072-d space. Can we exploit this? We took the CIFAR-10 image-trained adaptor and fed it environmental sounds from ESC-50. No audio was seen during training.

Dog

95%

Bird

81%

Cat

70%

Airplane

45%

Automobile

31%

Frog

Image-trained CIFAR-10 adaptor tested on ESC-50 audio clips. Random baseline = 17%.

Trained on dog images

Test: 🔊 Dog barking

dog

95%

Trained on bird images

Test: 🔊 Bird chirping

bird

81%

Trained on cat images

Test: 🔊 Cat meowing

cat

70%

Trained on frog images

Test: 🔊 Frog croaking

bird

59.8% accuracy — 3.6× above random baseline. Dog barking is recognized at 95%, bird chirping at 81%, cat meowing at 70%. Frogs are the odd one out at 2% — the model confidently classifies ribbit as “bird.”

An audio-trained gender classifier also partially transfers to face images: 62% gender classification accuracy on CelebA faces (random baseline = 50%), having only heard voices during training.

The Gemini embedding space has genuine cross-modal alignment. Semantic concepts like “dog” or “bird” form clusters that span both audio and image. An adaptor trained on one modality can classify the other.

Multi-Task: Can One Adaptor Do Everything?

We trained a single shared adaptor on all 8 tasks simultaneously (4 audio + 4 image). Every input goes through the same MLP. The only thing that differs per task is the text prompt. We varied the number of virtual tokens the adaptor produces.

Individual

1 token

8 tokens

32 tokens

Gender

98.7%

48.2%

49.6%

60.4%

Emotion

53.2%

12.9%

12.2%

17.3%

Word STT

94.0%

90.0%

90.5%

Sent. STT

89.3%

3.0%

1.2%

OCR

33.0%

4.0%

Object

97.0%

7.5%

69.0%

85.5%

Clothing

83.5%

43.5%

62.0%

Digit

62.5%

6.5%

14.0%

12.5%

Multi-task accuracy with increasing virtual tokens. 1-token collapses; 8 and 32 tokens progressively recover easier tasks.

With a single virtual token, multi-task training largely collapses. The model defaults to a single class or generates verbose, off-topic text. A single token doesn’t carry enough information for 8 different tasks through one shared projection.

Using more virtual tokens helps. With 8 tokens, word-level STT recovers to 90% and object classification to 69%. At 32 tokens (261M adaptor params, still frozen LLM), object classification reaches 85.5% and clothing 62%. Easier tasks with small label sets recover close to their individually-trained accuracy, while harder tasks (sentence-level STT, OCR) remain largely unrecovered.

What We Learned

Works well	Partially works	Doesn’t work
Object classification (97%)	Emotion classification (53%)	Open-vocabulary STT (0%)
Gender classification (99%)	Scene text recognition (33%)	Multi-task sharing for harder tasks
Closed-set command STT (89%)	Digit classification (62%)
Clothing classification (83%)	Cross-modal transfer (60%)
Word-level STT (94%)

So can Gemini embeddings be a universal multimodal encoder for open LLMs? For individual tasks, the answer is a clear yes — and a remarkably cheap one. One API call for the embedding, a tiny adaptor trained in under a minute, and a frozen LLM is all it takes to reach 97% on object classification or 89% on command transcription. Cross-modal transfer works out of the box, and even a shared multi-task adaptor recovers strong performance on easier tasks when given enough virtual tokens. Open-vocabulary transcription and fine-grained OCR remain out of reach for now, but the simplicity of the setup — no custom encoder, no fine-tuning the LLM — makes this a compelling starting point for adding multimodal capabilities to any open-weight model.

Acknowledgement

Experiments & Writing

Junhyuck Kim

Advising & Writing

Kangwook Lee

Citing Us

@misc{geminiembedding2026,
      title={Can Gemini Embeddings Be a Multimodal Encoder for LLMs?},
      author={Junhyuck Kim and Kangwook Lee},
      year={2026},
      url={https://github.com/krafton-ai/Can-Gemini-Embeddings-Be-a-Multimodal-Encoder-for-LLMs-},
}