Why is finetuning a multimodal model often better than using a general-purpose one?

General-purpose multimodal models are trained on diverse data to perform well across many tasks, but this generality prevents them from being optimal for any single, specific task. By finetuning a model on domain-specific data, such as document screenshots for Visual Document Retrieval, it learns the specialized patterns required for that task. This results in significant performance improvements, as demonstrated by the NDCG@10 score increasing from 0.888 to 0.947 in the provided example.

Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers

Sentence Transformers Library Unlocks Multimodal Finetuning

The popular Sentence Transformers library has been updated to support the training and finetuning of multimodal embedding and reranker models, a significant development for engineers working on specialized retrieval systems. In a technical walkthrough, developer Tom Aarsen demonstrated the new capabilities by finetuning a Qwen/Qwen3-VL-Embedding-2B model for Visual Document Retrieval (VDR), the task of retrieving document pages based on a text query. The finetuned model showed a marked performance increase, achieving an NDCG@10 score of 0.947, up from the base model's 0.888, and outperforming other models up to four times its size.

A Closer Look at the Technical Implementation

The new training pipeline extends the library's existing framework, simplifying the process of working with combined text, image, audio, and video data. The system automatically handles the preprocessing of different modalities by inspecting the model's processor, removing a major hurdle for developers. Aarsen's example highlights the core components required for a successful finetuning operation:

Model: An existing multimodal model, such as Qwen/Qwen3-VL-Embedding-2B, serves as the base for finetuning.
Dataset: A domain-specific dataset containing pairs of inputs, such as text queries and corresponding document images, along with hard negatives.
Loss Function: A function like `CachedMultipleNegativesRankingLoss` is used to optimize the model by increasing the similarity between correct pairs and decreasing it for incorrect ones.
Trainer: The `SentenceTransformerTrainer` orchestrates the entire process, bringing together the model, data, and loss function.

This streamlined approach allows practitioners to focus on data quality and model performance rather than the complexities of data collation and preprocessing. The library provides flexibility, allowing users to either finetune existing embedding models or build new ones from base Vision-Language Model (VLM) checkpoints, or even by combining separate single-modality encoders using a `Router` module.

Impact on Specialized AI Applications

This update lowers the barrier for creating highly specialized, efficient AI models for specific business domains. While large, general-purpose models are versatile, they often fall short in niche applications like VDR, which requires a nuanced understanding of document layouts, charts, and tables. By enabling straightforward finetuning, Sentence Transformers empowers organizations to develop smaller, more accurate, and cost-effective models tailored to their proprietary data. This move reflects a broader industry trend toward optimizing smaller, task-specific models that can deliver superior performance and efficiency compared to their larger, generalist counterparts.

The ability to easily finetune open-source multimodal models on domain-specific data provides a direct, cost-effective path for enterprises to build specialized AI retrieval systems that outperform larger, general-purpose APIs.

>> Verify Original Transmission at Hugging Face