Google Launches Gemini Embedding 2 With Multimodal AI Search

The Gemini app provides embedding models that generate embeddings for text, images, video, and other content types. You can use these embeddings for activities such as semantic search, classification, and clustering, which often yield more accurate, context-aware results than keyword searches.

The newest model, Gemini-Embedding-2-Preview, is the first from Gemini API to handle multiple content types, mapping text, images, video, audio, and documents into one shared embedding space. This enables searching, classification, and clustering across over 100 languages. For more details, check out the Multimodal Embedding section. If you only need text, Gemini-Embedding-001 remains available.

If your product relies on retrieval, augmented generation (RAG) embeddings are crucial for making these systems more accurate, coherent, and context-aware for teams seeking a managed RAG solution. A file search tool makes RAG management easier and more affordable.

Google has launched Gemini Embedding 2 for public previews, bringing enhancements over the previous version.

As Google’s first native multimodal embedding model, Gemini Embedding 2 can map text, images, video, and documents into one shared embedding space. It was released alongside new AI features for Workspace apps.

If you are new to this, embedding models are different from generative models like Gemini 3. Embedding models help computers understand context by turning text, images, or video into vectors, which are mathematical formats that computers can read and analyze. These embeddings yield more context-aware results across tasks such as semantic search, classification, and clustering than keyword-based methods.

The first Google Embedding model only worked with text. Gemini Embedding 2 now supports text, images, videos, audio, and documents in a single unified embedding space across 100 languages. Below are the content limits:

Text: up to 8192 tokens per request.

Images: up to 6 images per request, supporting PNG and JPEG formats.

Video: up to 120 seconds of video in MP4 or MOV format per request.

Audio: processes and embeds audio data directly without needing transcriptions.

Documents: can be PDFs up to 6 pages long.

In a blog post, Google said the new model streamlines complex pipelines and enhances a wide variety of multi-modal downstream tasks from retrieval-augmented generation (RAG) and semantic search to sentiment analysis and data clustering. The model can analyze detailed relationships among different media types by accepting multiple media types in a single request, such as images and text.

For example, Google noted that Gemini embeddings can help legal professionals find important information during the discovery phase of litigation. The multimodal embedding improves precision and recall across millions of records and enhances image and video search.

Gemini embeddings (Gemini-embeddin-2-preview) are now available for public preview through Gemini, the Gemini API, and Vertex. The Gemini-embedding-001 model is still available for text-only needs.

Source: Embeddings

Google releases Gemini Embedding 2 AI model with multimodal support