Tool Use and Computer-Use Agents

Topic · 34 papers

Related topics:

A Practical Guide for Designing, Developing, and Deploying Production-Grade Agentic AI Workflows
However, building production-grade agentic AI workflows remains challenging. While prototypes are easy to build with simple scripts or notebooks, scaling them into reliable, governed, and observable s…
Adaptation of Agentic AI
Cutting-edge agentic AI systems are built on foundation models that can be adapted to plan, reason, and interact with external tools to perform increasingly complex and specialized tasks. As these sys…
Agent S: An Open Agentic Framework that Uses Computers Like a Human
We present Agent S, an open agentic framework that enables autonomous interaction with computers through a Graphical User Interface (GUI), aimed at transforming human-computer interaction by automatin…
Agent Workflow Memory
Despite the potential of language model-based agents to solve real-world tasks such as web navigation, current methods still struggle with long-horizon tasks with complex action trajectories. In contr…
AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs
In this paper, we introduce a novel learning paradigm for adaptive Large Language Model (LLM) agents that eliminates the need for fine-tuning the underlying LLMs. Existing approaches are often either …
Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research
Agentic Reasoning, a framework1 that enhances large language model (LLM) reasoning by integrating external tool-using agents. Unlike conventional LLM-based reasoning approaches, which rely solely on i…
Agentic Web: Weaving the Next Web with AI Agents
This challenge gives rise to the notion of the Agent Attention Economy. In a manner analogous to the early Web’s competition for user clicks, external services now compete to be selected and invoked b…
AutoGLM: Autonomous Foundation Agents for GUIs
While foundation models excel at acquiring human knowledge, they often struggle with decision-making in dynamic real-world environments, limiting their progress toward artificial general intelligence.…
Benchmarking Floworks against OpenAI & Anthropic: A Novel Framework for Enhanced LLM Function Calling
2 Limitations of Traditional Function Calling Traditional AI systems have treated function calling as a monolithic task, where the model accepts a task and relevant function schemas, then outputs a c…
Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL
Recent advancements in LLM-based agents have demonstrated remarkable capabilities in handling complex, knowledge-intensive tasks by integrating external tools. Among diverse choices of tools, search t…
CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization
Language agents have shown some ability to interact with an external environment, e.g., a virtual world such as ScienceWorld, to perform complex tasks, e.g., growing a plant, without the startup costs…
Efficient Tool Use with Chain-of-Abstraction Reasoning
To achieve faithful reasoning that aligns with human expectations, large language models (LLMs) need to ground their reasoning to real-world knowledge (e.g., web facts, math and physical rules). Tools…
Equipping agents for the real world with Agent Skills
As model capabilities improve, we can now build general-purpose agents that interact with full-fledged computing environments. Claude Code , for example, can accomplish complex tasks across domains us…
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
Large language model (LLM) agents are increasingly built less by changing model weights than by reorganizing the runtime around them. Capabilities that earlier systems expected the model to recover in…
Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks
However, to realize the true potential of LLMs as autonomous agents, they must learn to identify, call, and interact with external tools and application program interfaces (APIs) to complete complex t…
Improving Generalization in Task-oriented Dialogues with Workflows and Action Plans
Task-oriented dialogue is difficult in part because it involves understanding user intent, collecting information from the user, executing API calls, and generating helpful and fluent responses. Howev…
Improving Small-Scale Large Language Models Function Calling for Reasoning Tasks
This study introduces a novel framework for training smaller language models in function calling, focusing on specific logical and mathematical reasoning tasks. The approach aims to improve performanc…
Insert-expansions For Tool-enabled Conversational Agents
“This paper delves into an advanced implementation of Chain-of-Thought-Prompting in Large Language Models, focusing on the use of tools (or "plug-ins") within the explicit reasoning paths generated by…
Large Language Model-Brained GUI Agents: A Survey
The advent of Large Language Models (LLMs), particularly multimodal models, has ushered in a new era of GUI automation. They have demonstrated exceptional capabilities in natural language understandin…
Large Multimodal Agents: A Survey
Concurrently, there is an emerging research trend focused on extending these LLM-powered AI agents into the multimodal domain. This extension enables AI agents to interpret and respond to diverse mult…
MCP-Zero: Proactive Toolchain Construction for LLM Agents from Scratch
Function-calling has enabled large language models (LLMs) to act as tool-using agents, but injecting thousands of tool schemas into the prompt is costly and error-prone. We introduce MCP-Zero, a proac…
Magentic-UI: Towards Human-in-the-loop Agentic Systems
AI agents powered by large language models are increasingly capable of autonomously completing complex, multi-step tasks using external tools. Yet, they still fall short of humanlevel performance in m…
Navigating the State of Cognitive Flow: Context-Aware AI Interventions for Effective Reasoning Support
Flow Theory describes an optimal cognitive state where individuals experience deep focus and intrinsic motivation when a task’s difficulty aligns with their skill level. In AI-augmented reasoning, int…
Nex-N1: Agentic Models Trained via a Unified Ecosystem for Large-Scale Environment Construction
The evolution of Large Language Models (LLMs) from passive responders to autonomous agents necessitates a fundamental shift in learning paradigms—from static imitation to incentive-driven decision mak…
OmniParser for Pure Vision Based GUI Agent
The recent success of large vision language models shows great potential in driving the agent system operating on user interfaces. However, we argue that the power multimodal models like GPT-4V as a g…
Opportunities for large language models and discourse in engineering design
In this paper, we argue that foundation models such as LLMs can be used for creative reasoning tasks in the engineering design process, complementing and integrating existing computational methods suc…
ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models
“There is a trending paradigm[1; 2; 3; 4; 5; 6; 7; 8] to couple large language models (LLMs) with external plugins or tools, enabling LLMs to interact with environment [9; 10] and retrieve up-to-date …
Real-Time Procedural Learning From Experience for AI Agents
AI agents are artificial intelligence systems capable of observing and taking actions in an environment. As adoption spreads across industries, there is a growing need for AI agents to quickly learn d…
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity. While most agents are language-based, relying on closed-source API with text-ric…
Small LLMs Are Weak Tool Learners: A Multi-LLM Agent
Large Language Model (LLM) agents significantly extend the capabilities of standalone LLMs, empowering them to interact with external tools (e.g., APIs, functions) and complete various tasks in a self…
SpreadsheetLLM: Encoding Spreadsheets for Large Language Models
Recently, the rapid development of Large Language Models (LLMs) has opened new frontiers in table processing (Li et al., 2023b) and reasoning (Cheng et al., 2022). However, spreadsheets pose unique ch…
Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction
Abstract: The current paradigm of test-time scaling relies on generating long reasoning traces (“thinking” more) before producing a response. In agent problems that require interaction, this can be do…
ToolFlow: Boosting LLM Tool-Calling Through Natural and Coherent Dialogue Synthesis
Supervised fine-tuning (SFT) is a common method to enhance the tool calling capabilities of Large Language Models (LLMs), with the training data often being synthesized. The current data synthesis pro…
UI-JEPA: Towards Active Perception of User Intent through Onscreen User Activity
Inspired by the success of self-supervised learning (SSL) techniques like Joint Embedding Predictive Architectures (JEPA) [10] and its variants [3, 5], we propose UI-JEPA, a lightweight video-to-text …