Multimodal Models

Topic · 21 papers

Aether Weaver: Multimodal Affective Narrative Co-Generation with Dynamic Scene Graphs
We introduce Aether Weaver, a novel, integrated framework for multimodal narrative cogeneration that overcomes limitations of sequential text-to-visual pipelines. Our system concurrently synthesizes t…
Continual Instruction Tuning for Large Multimodal Models
Instruction tuning is now a widely adopted approach to aligning large multimodal models (LMMs) to follow human intent. It unifies the data format of vision-language tasks, enabling multi-task joint tr…
Deep Interest Network for Click-Through Rate Prediction
“Click-through rate prediction is an essential task in industrial applications, such as online advertising. Recently deep learning based models have been proposed, which follow a similar Embedding& ML…
Emerging Properties in Unified Multimodal Pretraining
Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open-source foundational model that nativel…
Explainable Multimodal Emotion Reasoning
“Multimodal emotion recognition has experienced rapid development in recent years [1, 2]. Current works mainly focus on collecting larger and more realistic datasets [3, 4], or building more effective…
How Multimodal LLMs Solve Image Tasks: A Lens on Visual Grounding, Task Reasoning, and Answer Decoding
Multimodal Large Language Models (MLLMs) have demonstrated strong performance across a wide range of vision-language tasks, yet their internal processing dynamics remain underexplored. In this work, w…
LLaMA-Omni: Seamless Speech Interaction with Large Language Models
there is still a lack of exploration on how to build speech interaction models based on open-source LLMs. To address this, we propose LLaMA-Omni, a novel model architecture designed for low-latency an…
Large Multimodal Agents: A Survey
Concurrently, there is an emerging research trend focused on extending these LLM-powered AI agents into the multimodal domain. This extension enables AI agents to interpret and respond to diverse mult…
MAPS: A Multi-Agent Framework Based on Big Seven Personality and Socratic Guidance for Multimodal Scientific Problem Solving
Solving this problem requires not only understanding the combined visual and textual information but also applying the lever balance principle by comparing the effects on both sides. One intuitive sol…
MLLM-CBench: A Comprehensive Benchmark for Continual Instruction Tuning of Multimodal LLMs with Chain-of-Thought Reasoning Analysis
However, real-world deployment demands continuous adaptation to evolving instructions and domain requirements—a paradigm known as continual instruction tuning (He et al. 2023a), where the model increm…
Mindstorms in Natural Language-Based Societies of Mind
The 2015 work on “learning to think” [28] proposed to connect both NNs through recurrent connections (trained by the second NN’s learning algorithm) that allow one NN to interview the other by sending…
Navigating the State of Cognitive Flow: Context-Aware AI Interventions for Effective Reasoning Support
Flow Theory describes an optimal cognitive state where individuals experience deep focus and intrinsic motivation when a task’s difficulty aligns with their skill level. In AI-augmented reasoning, int…
Nested Attention: Semantic-aware Attention Values for Concept Personalization
Personalizing text-to-image models to generate images of specific subjects across diverse scenes and styles is a rapidly advancing field. Current approaches often face challenges in maintaining a bala…
Pixel-Level Reasoning Segmentation via Multi-turn Conversations
Existing general multimodal large language models (MLLMs) (Bai et al., 2023; Zhu et al., 2023; Liu et al., 2024b) exhibit exceptional visual perception, enabling both image segmentation and textual re…
Pixels, Patterns, but No Poetry: To See The World like Humans
Achieving human-like perception and reasoning in Multimodal Large Language Models (MLLMs) remains a central challenge in artificial intelligence. While recent research has primarily focused on enhanci…
ReasonVQA: A Multi-hop Reasoning Benchmark with Structural Knowledge for Visual Question Answering
In this paper, we propose a new dataset, ReasonVQA, for the Visual Question Answering (VQA) task. Our dataset is automatically integrated with structured encyclopedic knowledge and constructed using a…
Self-Rewarding Vision-Language Model via Reasoning Decomposition
Vision-Language Models (VLMs) often suffer from visual hallucinations – saying things that aren’t actually in the image – and language shortcuts, where they skip the visual part and just rely on text …
Talk Less, Interact Better: Evaluating In-context Conversational Adaptation in Multimodal LLMs
Humans spontaneously use increasingly efficient language as interactions progress, by adapting and forming ad-hoc conventions. This phenomenon has been studied extensively using reference games, showi…
The Curse Of Recursion: Training On Generated Data Makes Models Forget
“Stable Diffusion revolutionised image creation from descriptive text. GPT-2, GPT-3(.5) and GPT-4 demonstrated astonishing performance across a variety of language tasks. ChatGPT introduced such langu…
The Demon is in Ambiguity: Revisiting Situation Recognition with Single Positive Multi-Label Learning
Abstract—Context recognition (SR) is a fundamental task in computer vision that aims to extract structured semantic summaries from images by identifying key events and their associated entities. Speci…
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, …

No results.