Integrating and processing diverse data types in multimodal artificial intelligence is fundamentally challenging. Current approaches mainly use late-fusion methods, combining separate unimodal models—like vision encoders and language models—after independent pre-training. While this is pragmatic, it begs the question of whether it’s the best strategy for genuine multimodal comprehension. Unimodal pre-training biases can hinder a model’s ability to effectively associate information across modalities. Furthermore, scaling these systems adds complexity, with each unit’s unique hyperparameters and pre-training needs making resource allocation across modalities a daunting task, affecting both scalability and performance in complex multimodal tasks.
Existing research into multimodal integration primarily utilizes late-fusion strategies, combining pre-trained vision encoders with language models, setting a standard with established techniques. Early-fusion models, which integrate data types earlier in the process, remain underexplored despite their potential benefits. Fully multimodal models, which are trained from scratch using all data types, present an alternative, yet some rely on image tokenizers to convert visual data into manageable formats. While Mixture of Experts (MoE) has been used to efficiently scale language models, its application in multimodal models is less explored. Despite established scaling laws for unimodal models, few have probed these dynamics in true multimodal systems, especially those using early-fusion with raw data inputs.
Researchers from Sorbonne University and Apple are examining the scaling characteristics of true multimodal models developed from scratch, challenging existing beliefs about design decisions. By contrasting early-fusion models, which handle raw multimodal inputs directly, with conventional late-fusion methods, they reveal that late fusion holds no intrinsic advantage when both methodologies are developed from scratch. Contrary to popular methods, early-fusion models are more efficient, exhibiting similar scalability to language models with slight variances in scaling factors across different modalities and datasets. Their analysis indicates optimal results occur when model parameters and training tokens are scaled proportionally, a trend evident across various multimodal configurations. The research further investigates MoE architectures, which enhance parameter adaptation across modalities symmetrically, leading to significant performance gains and faster learning compared to standard frameworks. Their scaling analysis suggests emphasizing training tokens over active parameters in sparse models, resulting in effective learning due to the greater parameter count, diverging from behavior seen in dense models.
The architectural study discloses important discoveries about multimodal model scaling and configurations. Native early-fusion and late-fusion designs achieve similar performance when trained from scratch, with early-fusion models having slight benefits at lower compute levels. Scaling analysis confirms that compute-optimized models for both architectures perform comparably as computing resources grow. Notably, native multimodal models (NMMs) exhibit scaling characteristics similar to text-centric models, with scaling exponents slightly varying based on the data type and training mix. Optimal late-fusion models necessitate a higher ratio of parameters to data compared to early-fusion models, signaling different resource distribution strategies. Sparse MoE architectures significantly improve early-fusion NMMs, outperforming dense models at similar inference costs while automatically learning modality-specific preferences. Additionally, in sparse models, scaling increasingly favors training tokens over active parameters as computing capacity rises. Remarkably, modality-agnostic routing within sparse mixtures consistently outperforms modality-aware approaches, challenging assumptions regarding targeted modality specialization.