Persistent AI Agents in Academic Research: A Single-Investigator Implementation Case Study
Background: Large language model systems are commonly evaluated as models, benchmarks, or short conversational episodes. Less is known about what happens when a persistent AI agent is embedded into a real academic research environment with durable memory, local files, external tools, scheduled routines, delegated roles, and explicit safety protocols. Objective: To describe the implementation, utilization, outputs, resource profile, and governance layer of a persistent agentic research environment used by a single academic physician-scientist over 115 days. Methods: This was a structured self-observed implementation case study conducted from January 31 to May 25, 2026. The unit of analysis was the human-agent environment: the researcher, agent runtime, memory files, tool access, repositories, scheduled jobs, specialized agent roles, and safety protocols. Data sources included recoverable session telemetry, memory files, repository and file-system inventories, model-use logs, decision logs, and documented protocol updates. Outcomes were descriptive and organized using PARE-M (Persistent Agentic Research Environment Measurement), a measurement framework covering system architecture, utilization, artifact production, resource use, reproducibility, and governance/correction domains. Conclusions: This case describes a persistent agentic research environment rather than an isolated chatbot interaction. The workflow was cache-dominant: 82.9% of recorded May tokens were cache reads, suggesting that persistent agentic environments may shift the economic unit from cost per token to cost per completed artifact. Future evaluations should use artifact-level denominators, reproducible parsing rules, correction taxonomies, cost-per-artifact estimates, and independent coding of governance events.
Current evaluations of large language models emphasize model performance, benchmark scores, or task-level accuracy. This has been productive for comparing systems, but it leaves a gap: many real users no longer interact with AI as a single prompt-response episode. They embed agents into continuing work environments that include memory, files, tools, scripts, calendars, repositories, workflows, and institutional constraints. This distinction matters in academic research. A physician-scientist does not only ask isolated questions. Research work involves manuscript development, study design, teaching, literature synthesis, code review, software deployment, collaboration, governance, and repeated correction. In this setting, the relevant object is not only "the model." It is the coupled human-agent environment.
By the end of follow-up, the environment included persistent memory, interaction channels, shell/file access, repositories, scheduled jobs, external APIs, specialized agent roles, and governance protocols. The inventory identified 502 memory-related files, 17 configured agent directories, 57 skill files, 4,309 main-session files, 3,194 recoverable main JSONL-like files, 5,760 all-agent session files, and 4,388 recoverable all-agent JSONL-like files. Recoverable main-agent telemetry contained 75,671 de-duplicated records (DRC = 75,671) across 96 active days, corresponding to an active-day fraction of 0.835 (ADF = 96/115). The same memory layer recorded 889 failure, verification, correction, or protocol-proxy events, corresponding to a governance-event rate of 9.26 events per active day (GER = 889/96). These included deployment safeguards, external-action checks, credential-handling rules, citation-verification rules, and lessons from duplicate or unsafe actions. The governance layer therefore became part of the operating environment rather than an after-the-fact policy appendix.
The principal finding is that persistent agentic infrastructure expanded the capacity and scope of academic work. This was not shown by subjective usefulness alone. It was visible across PARE-M v0.1 utilization and governance metrics: ADF = 0.835, DRC = 75,671, OPR = 5.02 events per active day, GER = 9.26 events per active day, and ASB = 10. The environment became a durable work system spanning research, teaching, software, operations, and safety rules. The likely mechanism was accumulated context plus reusable procedures. The May token profile strengthens this interpretation: CDR was 82.9%, suggesting that the workflow increasingly depended on reused context rather than isolated fresh inference. This aligns with memory-augmented and stateful-agent work, including recent production-memory and memory-benchmark studies, but differs from benchmark settings that evaluate agents on bounded tasks rather than lived deployment.