Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers
By Jakub Antkiewicz
•2026-04-17T09:18:19Z
Sentence Transformers Library Unlocks Multimodal Finetuning
The popular Sentence Transformers library has been updated to support the training and finetuning of multimodal embedding and reranker models, a significant development for engineers working on specialized retrieval systems. In a technical walkthrough, developer Tom Aarsen demonstrated the new capabilities by finetuning a Qwen/Qwen3-VL-Embedding-2B model for Visual Document Retrieval (VDR), the task of retrieving document pages based on a text query. The finetuned model showed a marked performance increase, achieving an NDCG@10 score of 0.947, up from the base model's 0.888, and outperforming other models up to four times its size.
A Closer Look at the Technical Implementation
The new training pipeline extends the library's existing framework, simplifying the process of working with combined text, image, audio, and video data. The system automatically handles the preprocessing of different modalities by inspecting the model's processor, removing a major hurdle for developers. Aarsen's example highlights the core components required for a successful finetuning operation:
- Model: An existing multimodal model, such as Qwen/Qwen3-VL-Embedding-2B, serves as the base for finetuning.
- Dataset: A domain-specific dataset containing pairs of inputs, such as text queries and corresponding document images, along with hard negatives.
- Loss Function: A function like `CachedMultipleNegativesRankingLoss` is used to optimize the model by increasing the similarity between correct pairs and decreasing it for incorrect ones.
- Trainer: The `SentenceTransformerTrainer` orchestrates the entire process, bringing together the model, data, and loss function.
This streamlined approach allows practitioners to focus on data quality and model performance rather than the complexities of data collation and preprocessing. The library provides flexibility, allowing users to either finetune existing embedding models or build new ones from base Vision-Language Model (VLM) checkpoints, or even by combining separate single-modality encoders using a `Router` module.
Impact on Specialized AI Applications
This update lowers the barrier for creating highly specialized, efficient AI models for specific business domains. While large, general-purpose models are versatile, they often fall short in niche applications like VDR, which requires a nuanced understanding of document layouts, charts, and tables. By enabling straightforward finetuning, Sentence Transformers empowers organizations to develop smaller, more accurate, and cost-effective models tailored to their proprietary data. This move reflects a broader industry trend toward optimizing smaller, task-specific models that can deliver superior performance and efficiency compared to their larger, generalist counterparts.
The ability to easily finetune open-source multimodal models on domain-specific data provides a direct, cost-effective path for enterprises to build specialized AI retrieval systems that outperform larger, general-purpose APIs.