The Evolution of Multimodal Model Architectures
This work uniquely identifies and characterizes four prevalent multimodal model architectural patterns in the contemporary multimodal landscape. Systematically categorizing models by architecture type facilitates monitoring of developments in the multimodal domain. Distinct from recent survey papers that present general information on multimodal architectures, this research conducts a comprehensive exploration of architectural details and identifies four specific architectural types. The types are distinguished by their respective methodologies for integrating multimodal inputs into the deep neural network model. The first two types (Type A and B) deeply fuses multimodal inputs within the internal layers of the model, whereas the following two types (Type C and D) facilitate early fusion at the input stage. Type-A employs standard cross-attention, whereas Type-B utilizes custom-designed layers for modality fusion within the internal layers. On the other hand, Type-C utilizes modality-specific encoders, while Type-D leverages tokenizers to process the modalities at the model’s input stage. The identified architecture types aid the monitoring of any-to-any multimodal model development.
Introduction. The multimodal domain of machine learning has seen significant advancements in recent years. The proliferation of models capable of processing images, audio, or video in conjunction with text (language) has notably expanded (Alayrac et al. [2022], Lu et al. [2023], Mizrahi et al. [2024], Wu et al. [2023a], Yang et al. [2024], Tang et al. [2023a]). Remarkable strides have been particularly evident in the integration of image and text modalities across diverse vision-language tasks, primarily because of the Transformer model Vaswani et al. [2017]. The Transformer model Vaswani et al. [2017], a pioneering deep neural network (NN) architecture, has spearheaded a unified framework for cross-domain learning. This singular model exhibits remarkable efficacy in comprehending and processing data from diverse domains. The introduction of the Transformer model Vaswani et al. [2017] for Natural Language Processing (NLP) in 2017 marked the inception of transformer-based model architectures. Subsequently, the introduction of the Vision Transformer (ViT) Dosovitskiy et al.
Discussion / Conclusion. This section explores multimodal models with multimodal-input and multimodal-output. Plethora of models exist for any-input-modality to output-text-modality. In contrast, there are significantly fewer multimodal models capable of generating output modalities other than text. Multimodal output generation is one of the primary challenge in the multimodal domain. Type-C and Type-D multimodal architectures are at the forefront of development for any-to-any multimodal models. The representative models are highlighted in Figure 7. These dominant multimodal model architectures address some, though not all, challenging aspects of multimodal generation. Type-D simplifies the training process by utilizing input tokenization, enabling the use of a standard auto-regressive objective function for model training.