We introduce Aether Weaver, a novel, integrated framework for multimodal narrative cogeneration that overcomes limitations of sequential text-to-visual pipelines. Our system concurrently synthesizes t…
Instruction tuning is now a widely adopted approach to aligning large multimodal models (LMMs) to follow human intent. It unifies the data format of vision-language tasks, enabling multi-task joint tr…
“Click-through rate prediction is an essential task in industrial applications, such as online advertising. Recently deep learning based models have been proposed, which follow a similar Embedding& ML…
Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open-source foundational model that nativel…
“Multimodal emotion recognition has experienced rapid development in recent years [1, 2]. Current works mainly focus on collecting larger and more realistic datasets [3, 4], or building more effective…
Multimodal Large Language Models (MLLMs) have demonstrated strong performance across a wide range of vision-language tasks, yet their internal processing dynamics remain underexplored. In this work, w…
there is still a lack of exploration on how to build speech interaction models based on open-source LLMs. To address this, we propose LLaMA-Omni, a novel model architecture designed for low-latency an…
Concurrently, there is an emerging research trend focused on extending these LLM-powered AI agents into the multimodal domain. This extension enables AI agents to interpret and respond to diverse mult…
Solving this problem requires not only understanding the combined visual and textual information but also applying the lever balance principle by comparing the effects on both sides. One intuitive sol…
However, real-world deployment demands continuous adaptation to evolving instructions and domain requirements—a paradigm known as continual instruction tuning (He et al. 2023a), where the model increm…
The 2015 work on “learning to think” [28] proposed to connect both NNs through recurrent connections (trained by the second NN’s learning algorithm) that allow one NN to interview the other by sending…
Flow Theory describes an optimal cognitive state where individuals experience deep focus and intrinsic motivation when a task’s difficulty aligns with their skill level. In AI-augmented reasoning, int…
Personalizing text-to-image models to generate images of specific subjects across diverse scenes and styles is a rapidly advancing field. Current approaches often face challenges in maintaining a bala…
Existing general multimodal large language models (MLLMs) (Bai et al., 2023; Zhu et al., 2023; Liu et al., 2024b) exhibit exceptional visual perception, enabling both image segmentation and textual re…
Achieving human-like perception and reasoning in Multimodal Large Language Models (MLLMs) remains a central challenge in artificial intelligence. While recent research has primarily focused on enhanci…
In this paper, we propose a new dataset, ReasonVQA, for the Visual Question Answering (VQA) task. Our dataset is automatically integrated with structured encyclopedic knowledge and constructed using a…
Vision-Language Models (VLMs) often suffer from visual hallucinations – saying things that aren’t actually in the image – and language shortcuts, where they skip the visual part and just rely on text …
Humans spontaneously use increasingly efficient language as interactions progress, by adapting and forming ad-hoc conventions. This phenomenon has been studied extensively using reference games, showi…
“Stable Diffusion revolutionised image creation from descriptive text. GPT-2, GPT-3(.5) and GPT-4 demonstrated astonishing performance across a variety of language tasks. ChatGPT introduced such langu…
Abstract—Context recognition (SR) is a fundamental task in computer vision that aims to extract structured semantic summaries from images by identifying key events and their associated entities. Speci…
Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, …