StoryScope: Investigating idiosyncrasies in AI fiction

Paper · arXiv 2604.03136
Co-Writing and CollaborationNLP and LinguisticsLLM Evaluations and BenchmarksLLM Failure Modes

As AI-generated fiction becomes increasingly prevalent, questions of authorship and originality are becoming central to how written work is evaluated. While most existing work in this space focuses on identifying surface-level signatures of AI writing (e.g., word choice, syntactic structure), we ask instead whether AI-generated stories can be distinguished from human ones without relying on stylistic signals, focusing on discourse-level narrative choices such as character agency and chronological discontinuity. We propose StoryScope, a pipeline that automatically induces a fine-grained, interpretable feature space of discourse-level narrative features across 10 dimensions (e.g., plot, agents, temporal structure). We apply StoryScope to a parallel corpus of 10,272 writing prompts, each written by a human author and five LLMs (Claude, DeepSeek, Gemini, GPT, and Kimi), yielding 61,608 stories, each ~5,000 words, and 304 extracted features per story. Narrative features alone achieve 93.2% macro-F1 for human vs. AI detection and 68.4% macro-F1 for six-way authorship attribution, retaining over 97% of the performance of models that include stylistic cues. A compact set of 30 core narrative features captures much of this signal: AI stories over-explain themes and favor tidy, single-track plots while human stories frame protagonist' choices as more morally ambiguous and have increased temporal complexity (e.g., flashbacks, nonlinear structure). Per-model fingerprint features enable six-way attribution: for example, Claude produces notably flat event escalation, GPT over-indexes on dream sequences, and Gemini defaults to external character description. We find that AI-generated stories cluster in a shared region of narrative space, while human-authored stories exhibit greater diversity.

AI fiction is already under our noses. In March 2026, Hachette, a major publishing house, pulled the horror novel Shy Girl after it was flagged as ~78% AI-generated, the first commercially published novel canceled over AI allegations. Nearly 20% of a sample of 14,000 self-published Amazon novels were flagged by Pangram as largely AI-generated, a figure that jumped 41% year-over-year. Overall, readers are increasingly being misled into purchasing AI-generated books attributed to human authors. If authors are unwilling to self-disclose AI usage, how can we address this issue? At first glance, this appears to be a detection problem: can we determine whether a given story was written by human or machine? Existing AI detectors primarily rely on stylistic signals such as word choice and sentence structure, and for good reason: these cues are highly discriminatory. AI-generated text systematically overuses em-dashes, words like "delve" and "tapestry," and other surface-level patterns that even simple classifiers detect reliably. That said, AI style is increasingly fleeting: GPT 5.4 significantly reduced em-dash usage, and fine-tuning to mimic human style drops AI detection rates on creative writing from 97% to 3%. Discourse-level narrative features are far harder to "humanize," as changing them requires significant structural rewrites rather than simple post-hoc edits.

As AI seeps into the writing industry, the question of what constitutes original work shifts from how a story is written to how it is conceived. Settled U.S. legal precedent requires that protected works show a minimal degree of originality; recent guidance from the U.S. Copyright Office clarifies that eligibility depends on sufficient human creative control. To measure this, we use statistical rarity in a feature space of narrative decisions as a proxy for originality, where less common combinations reflect the broader notion of originality invoked by Torrance and copyright law. We hypothesize that humans and AI models make systematically distinct narrative choices, and that these differences persist even when stylistic cues are removed.

Our results suggest that the narrative choices underlying AI-generated fiction are distinguishable from those of human authors, even when surface style is removed. Because these features reflect structural decisions rather than writing style, they may prove more durable as models continue to evolve. When we represent each story as a vector of narrative features, the five AI models occupy a tight cluster that is well-separated from human stories, showing that AI models have converged on a shared narrative space that is systematically separated from human storytelling, and that these changes remain after editing stories for style. Human stories are, on average, rarer in narrative feature space. Each model also exhibits a unique narrative "fingerprint": a set of features on which it diverges from the other AI sources and enables fine-grained attribution.