Aether Weaver: Multimodal Affective Narrative Co-Generation with Dynamic Scene Graphs

Paper · arXiv 2507.21893 · Published July 29, 2025

We introduce Aether Weaver, a novel, integrated framework for multimodal narrative cogeneration that overcomes limitations of sequential text-to-visual pipelines. Our system concurrently synthesizes textual narratives, dynamic scene graph representations, visual scenes, and affective soundscapes, driven by a tightly integrated, co-generation mechanism. At its core, the Narrator, a large language model, generates narrative text and multimodal prompts, while the Director acts as a dynamic scene graph manager, and analyzes the text to build and maintain a structured representation of the story’s world, ensuring spatio-temporal and relational consistency for visual rendering and subsequent narrative generation. Additionally, a Narrative Arc Controller guides the high-level story structure, influencing multimodal affective consistency, further complemented by an Affective Tone Mapper that ensures congruent emotional expression across all modalities. Through qualitative evaluations on a diverse set of narrative prompts encompassing various genres, we demonstrate that Aether Weaver significantly enhances narrative depth, visual fidelity, and emotional resonance compared to cascaded baseline approaches.

Conventional AI storytelling pipelines typically adopt a cascaded approach: generating text first, followed by separate, often disconnected, visual and auditory synthesis modules which generate multimodal story in a single run without intermediate user engagement [1, 3, 11, 20]. This sequential processing frequently leads to critical shortcomings such as spatio-temporal inconsistencies, misaligned emotional tones, and a semantic disconnect between the narrative’s evolving state and its multimodal manifestations. For instance, a character’s emotional shift or the sudden appearance of a critical object might not be accurately or coherently reflected in subsequent visual or auditory outputs, diminishing the overall immersive experience.

Scene Graph Representations

Scene graphs have proven invaluable in computer vision for representing objects, their attributes, and their relationships within an image or video [10]. They provide a structured, symbolic representation that facilitates complex reasoning and generation tasks. While primarily used for understanding existing visual content, there is also an emerging work on generating images from scene graphs [10, 12]. Crucially, the application of knowledge graphs, including scene graphs, extends to improving consistency and logical reasoning in story generation. Research has shown that leveraging commonsense knowledge graphs and axioms can lead to more sensible story endings and improve the underlying narrative structures [8]. Furthermore, approaches to controllable text generation using external knowledge bases demonstrate the power of such structured data to guide narrative creation [19]. Recent work also explores controllable logical hypothesis generation using knowledge graphs to enhance abductive reasoning, which can be applied to scenario planning in dynamic environments [5]. Our work significantly leverages the power of scene graphs not just for static image generation but for dynamically tracking and updating the entire visual and conceptual world of a narrative as it unfolds. This dynamic aspect, coupled with its role in driving multimodal co-generation and ensuring spatio-temporal and relational coherence across a story’s progression, distinguishes our approach from prior work by integrating the scene graph as a central, continuously updated truth source for the entire narrative.

Our examination of the issue is centred on a specific example of AI-generated text, provocatively entitled the Xeno Sutra, which was produced in the course of a lengthy conversation between an LLM-based dialogue agent and one of the present authors (Shanahan). The Xeno Sutra, in keeping with the language of the prompts used to generate it, does not read like a traditional piece of scripture. Its twelve verses blend the terminology of modern physics and computer science with concepts from ancient Hindu and Buddhist philosophy. One line even features an Egyptian hieroglyph.

It is easy – and often appropriate – to dismiss such material as meaningless word salad, or as “AI slop”. However, the Xeno Sutra’s density of symbolism and richness of allusion repay closer reading. It poetically evokes emptiness (´s¯unyat¯a) through paradoxical imagery and self-undermining assertion, while playfully reworking traditional sutra forms. With an open mind, we can receive it as a valid, if not quite “authentic”, teaching, mediated by a non-human entity with a unique form of textual access to centuries of human insight.