Agent A/B: Automated and Scalable A/B Testing on Live Websites with Interactive LLM Agents
A/B testing is central to UI/UX design, yet our formative study with six industry practitioners revealed that it is slowed by scarce user traffic, long runtimes, and high operational costs. To address these challenges, we introduce Agent A/B, an end-to-end system that deploys large language model (LLM) agents with structured personas to interact with live webpages and generate scalable behavioral evidence before launch. In a case study on Amazon.com, Agent A/B simulated a between-subjects A/B test of filter panel designs with 1,000 agents and found that the reduced filter list produced more purchases, reproducing directional outcomes also observed in a parallel large-scale human experiment. Results further suggest that agent-based simulations can detect interface-sensitive behavioral differences and surface subgroup patterns while offering faster, lower-risk insights. We position Agent A/B as a complement to human testing, enabling earlier prototyping, pre-deployment validation, and hypothesis-driven UX evaluation.
Agent A/B orchestrates four coordinated modules: (i) LLM Agent Generation, (ii) Testing Preparation, (iii) Autonomous A/B Simulation, and (iv) Post-Testing Analysis. First, the LLM Agent Generation Module instantiates a population of agents by generating diverse personas and pairing them with the user-specified task intentions, ensuring variability in demographics and behavioral tendencies while adhering to the experiment constraints. The experiment owner specifies a target demographic distribution and provides an example persona as a style reference; a persona pool is initialized with that example. The system then iteratively samples an existing persona from the pool, samples demographic attributes from the target distribution, and prompts the LLM to produce a new persona matching both the sampled demographics and the reference style.
At the core of Agent A/B is an iterative interaction loop in which each LLM agent operates directly on a live web environment and continuously adapts its actions based on evolving page state. This loop consists of three tightly coupled components: an Environment Parsing Module, an LLM Agent, and an Action Execution Module. The Environment Parsing Module in our system parses the web environment into structured observations with a JSON format that simplifies the structure of the website and stores only key information for the agent-web interactions. In particular, we use a ChromeDriver to execute a JavaScript processing script within the browser. This script selectively extracts targeted information directly from the raw HTML by extracting essential web elements.
The LLM agent functions as a decision-making module that consumes the current state and outputs the next action to be taken. In particular, the LLM agent models the next-step decision-making problem as a form of language-based reasoning and planning task by mapping structured state observations into reasoning traces and action predictions. Agent A/B is not bound to a specific LLM agent. Instead, our system treats the LLM agent as an exchangeable module that supports various types of LLM web agents (ReAct, FireCrawl) with convenient "plug-and-play" APIs, analogous to the Model Context Protocol (MCP) proposed by Claude.
The post-test analysis module outputs summary statistics such as actions per session, session duration (in steps and time), and purchase completion rate. Researchers can also examine detailed behaviors (e.g., search or click filter usage) and compare them across A/B condition variants. The system supports stratified analysis by agent demographics or personas to identify subgroup differences. For instance, when testing redesigned filters, the system can reveal whether agents refined searches more, completed tasks faster, and purchased more, which offers early insights on usability and adoption risks before live deployment.
We emphasize that LLM-agent-based A/B testing is not a replacement for real user testing, but a complementary tool to help experiment owners (e.g., UX researchers and product managers) mitigate traffic scarcity, slow iteration cycles, and collaboration challenges. Our work bridges these lines by deploying persona-driven LLM agents directly on live web interfaces to enable scalable user behavior simulation for A/B testing and early-stage design evaluation.