Turn Every Application into an Agent: Towards Efficient Human-Agent-Computer Interaction with API-First LLM-Based Agents

Paper · arXiv 2409.17140 · Published September 25, 2024

Description automatically generated](file:////Users/adrianchan/Library/Group%20Containers/UBF8T346G9.Office/TemporaryItems/msohtmlclip/clip_image013.png)

However, these agents often suffer from high latency and low reliability due to the extensive sequential UI interactions. To address this issue, we propose AXIS, a novel LLM-based agents framework prioritize actions through application programming interfaces (APIs) over UI actions. This framework also facilitates the creation and expansion of APIs through automated exploration of applications. Our experiments on Office Word demonstrate that AXIS reduces task completion time by 65%-70% and cognitive workload by 38%-53%, while maintaining accuracy of 97%-98% compare to humans. Our work contributes to a new human-agent-computer interaction (HACI) framework and a fresh UI design principle for application providers in the era of LLMs.

LLM-based UI agents capable of serving as users’ delegates, translating user requests expressed in natural language, and directly interacting with the UI of software applications to fulfill users’ needs. With the help of LLM-based UI agents, users could simply ask the application to complete tasks without a deep understanding of application’s UIs and functionalities, which significantly reduces users’ cognitive load of learning new applications.

For instance, inserting a 2×2 table in an Office Word document requires a sequence of UI interactions: “Insert →Table→2×2 Table”. Although the HCI-based design suits the habits of humans, training LLM-based UI agents to emulate such interactions would generate quite a few challenges that are difficult to overcome.

The LLM call latency is also positively correlated with the number of processed tokens [15, 24, 45]. To ensure that the LLM can return high quality outputs, the LLM-based UI agent must pass large volume of UI information to precisely describe the current state, which also increases the latency in each call. The second challenge lies in the reliability domain. Studies have shown that LLMs are prone to hallucinations in generating responses [5, 11, 18, 57]. During the long sequential calls with LLM-based UI agents, the chance of taking a wrong UI control or hallucinating a non-existing UI for interaction increases with each reasoning step

We believe that a new human-agent-computer interaction (HACI) paradigm is needed to address the challenges faced by LLM-based UI agents. In HACI paradigm, API-first LLM-based agents will replace UI agents to prioritize API calls over unnecessary multi-step UI interactions in task completion. Regular UI interactions are only called when the related APIs are unavailable. Compared to the UI agents, API-first agents require less tokens and can obtain more accurate code-formated responses from LLMs.

we propose AXIS: Agent eXploring API for Skill integration, a self-exploration LLM-based framework capable of automatically exploring existing applications, learning insights from the support documents and action trajectories, and constructing new APIs1 based on the existing APIs to empower API-first LLM-based agents with low latency and high reliability