What is the difference between multimodal embedding and reranker models in Sentence Transformers?

Multimodal embedding models map various inputs like text and images into a shared vector space for fast, broad similarity searches. Multimodal reranker models are slower but more accurate; they take specific pairs of inputs (e.g., a text query and an image) to compute a precise relevance score, making them ideal for refining the top results from an initial search.

Multimodal Embedding & Reranker Models with Sentence Transformers

The popular Python library Sentence Transformers has released version 5.4, introducing native support for multimodal models. This update allows developers to use a single, consistent API to encode and compare not only text, but also images, audio, and videos. The new capabilities directly address the growing need for systems that can understand and retrieve information across different data formats, particularly for applications in semantic search and retrieval-augmented generation (RAG).

Technically, the update provides two distinct functions: multimodal embedding and reranking. Embedding models map inputs from various modalities into a shared vector space, enabling cross-modal similarity search—for instance, comparing a text query against a collection of images. Reranker models, conversely, compute more precise relevance scores for specific pairs of mixed-modality inputs. According to the release notes by author Tom Aarsen, using these new Vision-Language Models (VLMs) like Qwen3-VL comes with significant hardware requirements, with a recommended GPU of at least 8 GB of VRAM for smaller variants and around 20 GB for larger ones.

This integration of multimodal features into a widely-used library is poised to streamline the development of advanced AI systems. It simplifies the construction of complex pipelines, such as visual document retrieval or cross-modal RAG, by abstracting away much of the underlying complexity. The update explicitly supports the efficient 'retrieve-and-rerank' pattern, where a fast embedding model performs a broad initial search and a slower, more accurate reranker model refines the top candidates. This workflow allows for a practical balance between speed and accuracy when working with large, diverse datasets.

By standardizing the API for text, image, audio, and video processing, Sentence Transformers is lowering the barrier to entry for building production-grade multimodal systems, moving cross-modal search and RAG from a specialized task to a more accessible engineering challenge.