Gemini Embedding As A Universal Multimodal Encoder for Open LLMs

Junhyuck Kim · Kangwook Lee · March 2026

Google’s gemini-embedding-2-preview takes any audio or image and returns a single 3072-dimensional vector. It’s designed for retrieval and similarity search. But we had a different question: can these embeddings serve as a universal multimodal encoder for an open-weight LLM?

The idea is borrowed from multimodal LLMs that use specialized encoders (CLIP for vision, Whisper for audio) — but here we replace all of them with a single general-purpose embedding API. Call the Gemini Embedding API, get a vector, project it through a small learned adaptor into “virtual tokens,” and prepend those to a text prompt inside a frozen language model. One encoder for all modalities.

The Pipeline

Every input — whether an image or an audio clip — goes through the same three-stage pipeline. The Gemini Embedding API is frozen; only the adaptor (a small MLP) is trained. The LLM (Qwen3-4B) stays completely frozen.

Stage 1 — Frozen Embedding API

input 🎤 Audio clip
or
🖼 Image
frozen · api call Gemini Embedding API
gemini-embedding-2-preview

One API call → single 3072-d vector
output e ∈ ℝ3072

Stage 2 — Learned Adaptor (MLP)

input e ∈ ℝ3072
learned · 17M params MLP Adaptor
Linear(3072, 4×d) → GELU → Linear(4×d, k×d)

Projects embedding into k virtual tokens,
each of dimension d = LLM hidden size.
Trained in <1 min on a single GPU.
output k virtual
tokens

∈ ℝk×d

Stage 3 — Frozen LLM with Prefix

virtual tokens v1vk
+
text prefix / prompt “What object is in this image?
Answer with one word.”
frozen LLM Qwen3-4B (frozen)
Greedy decoding → free-form answer
↓ generates answer

How is the adaptor trained?

Each downstream task provides the supervision signal. For image classification, we have (image, label) pairs — e.g. a CIFAR-10 photo paired with “horse”. The image is embedded via the frozen Gemini API, projected through the adaptor into virtual tokens, prepended to a task prompt like “What object is in this image? Answer with one word.”, and fed to the frozen LLM. The LLM autoregressively generates an answer, and we compute a cross-entropy loss against the ground-truth label tokens. Only the adaptor’s weights are updated; everything else stays frozen. The same recipe applies to audio tasks — swap the image for an audio clip and the prompt for something like “Transcribe the spoken command.”

The adaptor is tiny: two linear layers with a GELU activation, 17M parameters. Training takes under a minute per task on a single GPU. The LLM stays completely frozen. Everything is evaluated with free generation: greedy decoding, no constrained output, exact string match.

Individual Task Results

We trained separate adaptors for 8 tasks across audio and image. All numbers are exact-match accuracy on a held-out test set (20% split).

Gender (audio)
98.7%
Emotion (audio)
53.2%
Word STT (audio)
94.0%
Sentence STT (audio)
89.3%
OCR (image)
33.0%
Object (image)
97.0%
Clothing (image)
83.5%
Digit (image)
62.5%

Held-out test accuracy per task. Each task has its own dedicated adaptor.

Object Classification — 97%

Prompt: “What object is in this image? Answer with one word.” — evaluated on CIFAR-10.

What object?
horse
true: horse
What object?
dog
true: dog
What object?
bird ×
true: dog
What object?
dog ×
true: cat

Clothing Classification — 83.5%

Prompt: “What clothing item is in this image? Answer with one word.” — evaluated on Fashion-MNIST.

What clothing?
sneaker
true: sneaker
What clothing?
coat
true: coat
What clothing?
sneaker ×
true: sandal
What clothing?
pullover ×
true: shirt

Scene Text Recognition — 33%

Prompt: “What word is shown in this image?” — evaluated on IIIT-5K.

What word?
bmw
true: BMW
What word?
mobile
true: Mobile
What word?
advertising ×
true: AUDIENCES
What word?
wwwery ×
true: MERYL

Audio Tasks

Gender classification on RAVDESS hits 98.7%. Word-level STT on Google Speech Commands (10 classes) reaches 94%.

Most impressively, sentence-level STT on Fluent Speech Commands reaches 89.3% exact match across 169 unique commands:

🔊
"Turn off the lights" — Transcribe the spoken command.
turn off the lights
🔊
"Increase the temperature in the kitchen" — Transcribe the spoken command.
increase the temperature in the kitchen
🔊
"Pause the music" — Transcribe the spoken command.
pause the music
🔊
"Turn the heat down" — Transcribe the spoken command.
turn down the heat
🔊
"Switch the lights off" — Transcribe the spoken command.
switch the lights on
A single API call, a tiny task-specific MLP, and a frozen LLM. One 3072-d vector projected into the LLM’s token space — no vision encoder, no speech encoder, just one embedding API.

How Far Does STT Go?

The sentence-level STT result (89%) is the most intriguing — the model generates multi-word transcriptions from a single embedding vector. But it works on a closed set of 169 commands. Can it do real, open-vocabulary transcription?

We tested on LibriSpeech (1000 unique sentences, 8–20 words). With our standard setup — one embedding, one virtual token — the result was 0% exact match. We tried several ways to recover:

SetupExact match
Single embedding, 1 token0%
Single embedding, 32 tokens0%
2-second chunks, 1 token each0%
1-second chunks, 1 token each0%

The model generates fluent English that sounds plausible — but has zero word overlap with the actual speech:

🔊
"For once in a way I proved a true prophet" — Transcribe.
This achievement was my greatest token of success
🔊
"Look at that he held out his hand" — Transcribe.
The king looked at him coldly and said

So why does the closed-set STT work at 89%? The embedding encodes some content information — enough to distinguish among 169 short commands. But it doesn’t encode enough for 1000 open-vocabulary sentences. What looks like transcription is more likely the adaptor learning to recognise which of 169 known clusters it’s seeing.


Cross-Modal Transfer: Train on Images, Classify Sounds

Gemini embeds audio and images into the same 3072-d space. Can we exploit this? We took the CIFAR-10 image-trained adaptor and fed it environmental sounds from ESC-50. No audio was seen during training.

Dog
95%
Bird
81%
Cat
70%
Airplane
45%
Automobile
31%
Frog
2%

Image-trained CIFAR-10 adaptor tested on ESC-50 audio clips. Random baseline = 17%.

Trained on dog images
Test: 🔊 Dog barking
dog
95%
Trained on bird images
Test: 🔊 Bird chirping
bird
81%
Trained on cat images
Test: 🔊 Cat meowing
cat
70%
Trained on frog images
Test: 🔊 Frog croaking
bird
2%

59.8% accuracy — 3.6× above random baseline. Dog barking is recognized at 95%, bird chirping at 81%, cat meowing at 70%. Frogs are the odd one out at 2% — the model confidently classifies ribbit as “bird.”

An audio-trained gender classifier also partially transfers to face images: 62% gender classification accuracy on CelebA faces (random baseline = 50%), having only heard voices during training.

The Gemini embedding space has genuine cross-modal alignment. Semantic concepts like “dog” or “bird” form clusters that span both audio and image. An adaptor trained on one modality can classify the other.

Multi-Task: Can One Adaptor Do Everything?

We trained a single shared adaptor on all 8 tasks simultaneously (4 audio + 4 image). Every input goes through the same MLP. The only thing that differs per task is the text prompt. We varied the number of virtual tokens the adaptor produces.

Individual
1 token
8 tokens
32 tokens
Gender
98.7%
48.2%
49.6%
60.4%
Emotion
53.2%
12.9%
12.2%
17.3%
Word STT
94.0%
0%
90.0%
90.5%
Sent. STT
89.3%
0%
3.0%
1.2%
OCR
33.0%
0%
0%
4.0%
Object
97.0%
7.5%
69.0%
85.5%
Clothing
83.5%
0%
43.5%
62.0%
Digit
62.5%
6.5%
14.0%
12.5%

Multi-task accuracy with increasing virtual tokens. 1-token collapses; 8 and 32 tokens progressively recover easier tasks.

With a single virtual token, multi-task training largely collapses. The model defaults to a single class or generates verbose, off-topic text. A single token doesn’t carry enough information for 8 different tasks through one shared projection.

Using more virtual tokens helps. With 8 tokens, word-level STT recovers to 90% and object classification to 69%. At 32 tokens (261M adaptor params, still frozen LLM), object classification reaches 85.5% and clothing 62%. Easier tasks with small label sets recover close to their individually-trained accuracy, while harder tasks (sentence-level STT, OCR) remain largely unrecovered.


What We Learned

Works wellPartially worksDoesn’t work
Object classification (97%) Emotion classification (53%) Open-vocabulary STT (0%)
Gender classification (99%) Scene text recognition (33%) Multi-task sharing for harder tasks
Closed-set command STT (89%) Digit classification (62%)
Clothing classification (83%) Cross-modal transfer (60%)
Word-level STT (94%)
  1. Gemini’s multimodal embedding can be used as an LLM encoder for certain tasks. A 17M-parameter MLP bridges the gap for tasks like object classification, gender detection, and closed-set command recognition.
  2. Cross-modal transfer works for some concepts. An adaptor trained only on dog images recognises dog barks at 95%, suggesting the encoder preserves semantic meaning across modalities.
  3. Works for categorical tasks, not for detailed content extraction. The encoder handles “what is this?” well, but struggles with “what exactly does it say?” — open-vocabulary transcription and fine-grained recognition remain out of reach.
  4. Multi-task sharing is hard. A shared adaptor with 32 virtual tokens recovers easier tasks to 62–90%, but harder tasks requiring fine-grained output remain a challenge.

So can Gemini embeddings be a universal multimodal encoder for open LLMs? For individual tasks, the answer is a clear yes — and a remarkably cheap one. One API call for the embedding, a tiny adaptor trained in under a minute, and a frozen LLM is all it takes to reach 97% on object classification or 89% on command transcription. Cross-modal transfer works out of the box, and even a shared multi-task adaptor recovers strong performance on easier tasks when given enough virtual tokens. Open-vocabulary transcription and fine-grained OCR remain out of reach for now, but the simplicity of the setup — no custom encoder, no fine-tuning the LLM — makes this a compelling starting point for adding multimodal capabilities to any open-weight model.


Code, pre-trained adaptor weights, and pre-computed embeddings: github.com/krafton-ai/Can-Gemini-Embeddings-Be-a-Multimodal-Encoder-for-LLMs-

Built with gemini-embedding-2-preview + Qwen3-4B-Instruct. Data: RAVDESS, Google Speech Commands, Fluent Speech Commands, IIIT-5K, CIFAR-10, Fashion-MNIST, SVHN, LibriSpeech, ESC-50, CelebA.


Acknowledgement

Experiments & Writing
Junhyuck Kim
Advising & Writing
Kangwook Lee

Citing Us

If you found this work useful, please cite us as:

@misc{geminiembedding2026,
      title={Can Gemini Embeddings Be a Multimodal Encoder for LLMs?},
      author={Junhyuck Kim and Kangwook Lee},
      year={2026},
      url={https://github.com/krafton-ai/Can-Gemini-Embeddings-Be-a-Multimodal-Encoder-for-LLMs-},
}