Home / AI / Optimizing Multimodal AI: Introducing VLM2VEC and MMEB for Universal Embedding Solutions

Optimizing Multimodal AI: Introducing VLM2VEC and MMEB for Universal Embedding Solutions

abril 11, 2025 6:56 pm

Multimodal embeddings integrate visual and textual information into a cohesive representational space, allowing systems to interpret and connect images and language effectively. These embeddings aid in tasks such as visual question answering, retrieval, classification, and grounding, benefiting AI models that process real-world data through both visual and linguistic perspectives, like in document analysis, digital assistants, or visual search engines.

A significant challenge lies in the current models’ lack of effective generalization across various tasks and modalities. Typically, these models excel in specific tasks but falter when introduced to new datasets. Additionally, the absence of a comprehensive benchmark for evaluating multimodal tasks results in fragmented and inconsistent performance assessments. This limitation hinders the models’ functionality in realistic cross-domain applications, especially when encountering new data distributions.

Tools such as CLIP, BLIP, and SigLIP have been developed to create visual-textual embeddings. These models usually employ separate encoders for images and text, combining their outputs through basic operations like score-level fusion. While useful as baselines, these methods exhibit restricted cross-modal reasoning and generalization, often performing poorly in zero-shot scenarios due to superficial fusion techniques and the absence of task-specific instruction handling in training.

A collaborative effort between Salesforce Research and the University of Waterloo introduced a new model called VLM2VEC and a comprehensive benchmark named MMEB. MMEB consists of 36 datasets across four key tasks: classification, visual question answering, retrieval, and visual grounding. It separates datasets into 20 for training and 16 for evaluation, including out-of-distribution tasks. The VLM2VEC framework aims to transform any vision-language model into an embedding model using contrastive training, enabling it to process any combination of text and images while following task instructions.

To develop VLM2VEC, the research team utilized backbone models like Phi-3.5-V and LLaVA-1.6. The approach starts by constructing task-specific, instruction-based queries and targets processed through a vision-language model to generate embeddings. Contrastive training is applied using the InfoNCE loss function with cosine similarity, aligning embeddings by maximizing the similarity between matching query-target pairs and minimizing it for mismatches. GradCache is employed to accommodate large batch sizes, crucial for diverse negative training, by dividing batches into manageable sub-batches and accumulating gradients. This method guarantees efficient training despite the high memory demands of multimodal inputs. Embedding task-specific instructions in the training pipeline helps the model adjust its encoding to suit the task, such as grounding or retrieval, thereby enhancing generalization capabilities.

The results highlight the advantages of the proposed method. The best-performing version of VLM2VEC used LLaVA-1.6 as its backbone, implemented LoRA tuning, and processed images at a resolution of 1344 × 1344. This setup achieved a Precision@1 score of 62.9% across all 36 MMEB datasets. In zero-shot tests on 16 out-of-distribution datasets, it maintained a robust 57.1% score. Compared to the leading baseline model without fine-tuning, which scored 44.7%, VLM2V