VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Paper · arXiv 2501.01957 · Published January 3, 2025

Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, speech plays a crucial role in multimodal dialogue systems, and implementing high-performance in both vision and speech tasks remains a significant challenge due to the fundamental modality differences. In this paper, we propose a carefully designed multi-stage training methodology that progressively trains LLM to understand both visual and speech information, ultimately enabling fluent vision and speech interaction. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities without separate ASR and TTS modules, significantly accelerating multimodal end-to-end response speed.

However, with the growing appeal of human-computer interaction, the role of the speech modality has become increasingly prominent, especially in the multimodal dialogue system. In such a system, speech not only serves as a key medium for information transmission but also greatly improves the naturalness and convenience of interactions.

visual data, such as images, convey spatial information, while speech data convey dynamic changes in time series

ASR) and Text-to-Speech, which can increase latency and reduce coherence

In the first stage, we focus on vision-language by training visual adapters and finetuning the model with descriptive caption and visual QA data. This step establishes the model’s foundational visual capabilities, enabling robust image and video understanding. The second stage introduces audio input processing by training an audio encoder using speech-transcription paired data, followed by fine-tuning with speech QA data. This stage equips the model with the ability to understand and respond to audio inputs effectively. Finally, in the third stage, we train an audio decoder to enable end-to-end speech output, eliminating the need for external TTS modules. This allows VITA-1.5 to generate fluent speech replies, enhancing the naturalness and interactivity of multimodal dialogue systems.

We use 20% of the descriptive caption data from Table 1 for training, where only the visual adapter is trainable, while the other modules are frozen. This approach allows the LLM to initially align the visual modality. Stage 1.2 Vision Understanding. In this stage, our goal is to teach the LLM to transcribe image content. Toward this end, we use all the descriptive caption data from Table 1. During this process, the encoder and adapter of the visual module, as well as the LLM, are trainable. The focus is to enable the model to establish a strong connection between vision and language by learning from descriptive texts about images, allowing it to understand image content via generating natural language descriptions. Stage 1.3 Vision SFT. Following Stage 1.2, the model has acquired a basic understanding of images and videos. However, the instruction following ability is still limited, and it is difficult to cope with the visual QA task. To achieve this, we use all the QA data from Table 1 while retaining 20% of the descriptive caption data to increase the diversity of the dataset and the complexity of the tasks. During training, the encoder and adapter of the visual module, as well as the LLM, are trainable. The key objective of this stage is to enable the model not only to understand visual content but also to

answer questions following instructions.

training data consists of 11,000 hours of speech-transcription pairs. We follow a two-step approach: (a) Speech Encoder Training: We adopt a training framework used in common speech recognition systems, using a Connectionist Temporal Classification (CTC) loss function [18] to train the speech encoder. The aim is for the encoder to predict the transcription text from the speech input. This step ensures that the audio encoder can extract speech features and map them to the text representation space.

The training objective at this stage is to enable the LLM to output the transcription text of the speech data.

Audio SFT. The focus of this stage is to introduce the QA functionality with speech

questions and text answers.

In addition, we add a classification head to the LLM’s output. This head is used to distinguish whether the input comes from speech or text. As a result, the model can more accurately interpret speech inputs and process different modalities efficiently and flexibly.

The training of this stage uses text-speech paired data, where the text is fed into the tokenizer and the embedding later of the LLM to obtain its embedding vectors, and the speech is fed into the encoder of the codec model to obtain its speech tokens. The text embedding vectors are sent to the NAR speech decoder to get global semantic features, and then the features are sent to the AR speech decoder, which predicts the corresponding speech tokens.