PaperOrchestra: A Multi-Agent Framework for Automated AI Research Paper Writing

Paper · arXiv 2604.05018 · Published April 6, 2026

Synthesizing unstructured research materials into manuscripts is an essential yet under-explored challenge in AI-driven scientific discovery. Existing autonomous writers are rigidly coupled to specific experimental pipelines, and produce superficial literature reviews. We introduce PaperOrchestra, a multi-agent framework for automated AI research paper writing. It flexibly transforms unconstrained pre-writing materials into submission-ready LaTEX manuscripts, including comprehensive literature synthesis and generated visuals, such as plots and conceptual diagrams. To evaluate performance, we present PaperWritingBench, the first standardized benchmark of reverse-engineered raw materials from 200 top-tier AI conference papers, alongside a comprehensive suite of automated evaluators. In side-by-side human evaluations, PaperOrchestra significantly outperforms autonomous baselines, achieving an absolute win rate margin of 50%–68% in literature review quality, and 14%–38% in overall manuscript quality. (Project Page: https://yiwen-song.github.io/paper_orchestra/)

We formulate the end-to-end AI research paper generation task as a function mapping unconstrained pre-writing materials to a complete submission package. Specifically, the framework operates on the following input components:

• Idea Summary (I): A brief overview establishing the proposed methodology, core contributions, and theoretical foundation.

• Experimental Log (E): A compilation of experimental results, covering raw data points, ablation studies, and performance metrics.

• LaTeX Template (T): The template files provided by the target AI conference.

• Conference Guidelines (G): The requirements mandated by the target AI conference.

• Figures (F): An optional set of pre-existing visual assets (e.g., diagrams, plots). If no figures are provided (F = ∅), the pipeline autonomously synthesizes all relevant visuals.

Synthesizing Raw Materials. Since pre-writing materials (e.g., lab notes) are unavailable, we prompt an LLM to reverse-engineer two core components from the extracted PDF content (App. C.2). To prevent information leakage, both components are fully anonymized (stripping authors and titles) and rendered strictly self-contained by removing all citations, URLs, and explicit figure or table references. We synthesize the following (App. C.3):

• Idea Summary (I): Distills the core methodology while explicitly excluding experimental results. We generate a Sparse variant (summarizing only high-level ideas) and a Dense variant (retaining formal definitions and LaTEX equations) to simulate different degrees of user drafting effort.

• Experimental Log (E): Extracts a record of experimental setup and empirical findings, including baselines, datasets, metrics, and tabular data. The LLM further de-contextualizes this data by converting visual insights into standalone factual observations, allowing us to test how well the writing system can reconstruct the narrative purely from raw data