Junhyuck Kim · Kangwook Lee · March 2026
Google’s gemini-embedding-2-preview
takes any audio or image and returns a single 3072-dimensional vector.
It’s designed for retrieval and similarity search. But we had a different question:
can these embeddings serve as a universal multimodal encoder for an open-weight LLM?
The idea is borrowed from multimodal LLMs that use specialized encoders (CLIP for vision, Whisper for audio) — but here we replace all of them with a single general-purpose embedding API. Call the Gemini Embedding API, get a vector, project it through a small learned adaptor into “virtual tokens,” and prepend those to a text prompt inside a frozen language model. One encoder for all modalities.
Every input — whether an image or an audio clip — goes through the same three-stage pipeline. The Gemini Embedding API is frozen; only the adaptor (a small MLP) is trained. The LLM (Qwen3-4B) stays completely frozen.
Stage 1 — Frozen Embedding API
gemini-embedding-2-previewStage 2 — Learned Adaptor (MLP)
Stage 3 — Frozen LLM with Prefix
How is the adaptor trained?
Each downstream task provides the supervision signal. For image classification, we have (image, label) pairs — e.g. a CIFAR-10 photo paired with “horse”. The image is embedded via the frozen Gemini API, projected through the adaptor into virtual tokens, prepended to a task prompt like “What object is in this image? Answer with one word.”, and fed to the frozen LLM. The LLM autoregressively generates an answer, and we compute a cross-entropy loss against the ground-truth label tokens. Only the adaptor’s weights are updated; everything else stays frozen. The same recipe applies to audio tasks — swap the image for an audio clip and the prompt for something like “Transcribe the spoken command.”
We trained separate adaptors for 8 tasks across audio and image. All numbers are exact-match accuracy on a held-out test set (20% split).
Held-out test accuracy per task. Each task has its own dedicated adaptor.
Prompt: “What object is in this image? Answer with one word.” — evaluated on CIFAR-10.
Prompt: “What clothing item is in this image? Answer with one word.” — evaluated on Fashion-MNIST.
Prompt: “What word is shown in this image?” — evaluated on IIIT-5K.
Gender classification on RAVDESS hits 98.7%. Word-level STT on Google Speech Commands (10 classes) reaches 94%.
Most impressively, sentence-level STT on Fluent Speech Commands reaches 89.3% exact match across 169 unique commands:
The sentence-level STT result (89%) is the most intriguing — the model generates multi-word transcriptions from a single embedding vector. But it works on a closed set of 169 commands. Can it do real, open-vocabulary transcription?
We tested on LibriSpeech (1000 unique sentences, 8–20 words). With our standard setup — one embedding, one virtual token — the result was 0% exact match. We tried several ways to recover:
| Setup | Exact match |
|---|---|
| Single embedding, 1 token | 0% |
| Single embedding, 32 tokens | 0% |
| 2-second chunks, 1 token each | 0% |
| 1-second chunks, 1 token each | 0% |
The model generates fluent English that sounds plausible — but has zero word overlap with the actual speech:
So why does the closed-set STT work at 89%? The embedding encodes some content information — enough to distinguish among 169 short commands. But it doesn’t encode enough for 1000 open-vocabulary sentences. What looks like transcription is more likely the adaptor learning to recognise which of 169 known clusters it’s seeing.
Gemini embeds audio and images into the same 3072-d space. Can we exploit this? We took the CIFAR-10 image-trained adaptor and fed it environmental sounds from ESC-50. No audio was seen during training.
Image-trained CIFAR-10 adaptor tested on ESC-50 audio clips. Random baseline = 17%.
59.8% accuracy — 3.6× above random baseline. Dog barking is recognized at 95%, bird chirping at 81%, cat meowing at 70%. Frogs are the odd one out at 2% — the model confidently classifies ribbit as “bird.”
An audio-trained gender classifier also partially transfers to face images: 62% gender classification accuracy on CelebA faces (random baseline = 50%), having only heard voices during training.
We trained a single shared adaptor on all 8 tasks simultaneously (4 audio + 4 image). Every input goes through the same MLP. The only thing that differs per task is the text prompt. We varied the number of virtual tokens the adaptor produces.
Multi-task accuracy with increasing virtual tokens. 1-token collapses; 8 and 32 tokens progressively recover easier tasks.
With a single virtual token, multi-task training largely collapses. The model defaults to a single class or generates verbose, off-topic text. A single token doesn’t carry enough information for 8 different tasks through one shared projection.
Using more virtual tokens helps. With 8 tokens, word-level STT recovers to 90% and object classification to 69%. At 32 tokens (261M adaptor params, still frozen LLM), object classification reaches 85.5% and clothing 62%. Easier tasks with small label sets recover close to their individually-trained accuracy, while harder tasks (sentence-level STT, OCR) remain largely unrecovered.
| Works well | Partially works | Doesn’t work |
|---|---|---|
| Object classification (97%) | Emotion classification (53%) | Open-vocabulary STT (0%) |
| Gender classification (99%) | Scene text recognition (33%) | Multi-task sharing for harder tasks |
| Closed-set command STT (89%) | Digit classification (62%) | |
| Clothing classification (83%) | Cross-modal transfer (60%) | |
| Word-level STT (94%) |
So can Gemini embeddings be a universal multimodal encoder for open LLMs? For individual tasks, the answer is a clear yes — and a remarkably cheap one. One API call for the embedding, a tiny adaptor trained in under a minute, and a frozen LLM is all it takes to reach 97% on object classification or 89% on command transcription. Cross-modal transfer works out of the box, and even a shared multi-task adaptor recovers strong performance on easier tasks when given enough virtual tokens. Open-vocabulary transcription and fine-grained OCR remain out of reach for now, but the simplicity of the setup — no custom encoder, no fine-tuning the LLM — makes this a compelling starting point for adding multimodal capabilities to any open-weight model.
Code, pre-trained adaptor weights, and pre-computed embeddings: github.com/krafton-ai/Can-Gemini-Embeddings-Be-a-Multimodal-Encoder-for-LLMs-
Built with gemini-embedding-2-preview
+ Qwen3-4B-Instruct.
Data: RAVDESS,
Google Speech Commands,
Fluent Speech Commands,
IIIT-5K,
CIFAR-10,
Fashion-MNIST,
SVHN,
LibriSpeech,
ESC-50,
CelebA.
If you found this work useful, please cite us as:
@misc{geminiembedding2026,
title={Can Gemini Embeddings Be a Multimodal Encoder for LLMs?},
author={Junhyuck Kim and Kangwook Lee},
year={2026},
url={https://github.com/krafton-ai/Can-Gemini-Embeddings-Be-a-Multimodal-Encoder-for-LLMs-},
}