TOPIC

Action Models

11 synthesis notes · 19 source papers

View as

Does agent memory work better at one level of abstraction?

Three competing architectures claim superior agent memory transfer using different abstraction levels. Do they all work, or does one architecture genuinely outperform the others across domains?

Can agents learn reusable sub-task routines from past experience?

Do web agents fail at long-horizon tasks because they cannot extract and reuse workflows shared across similar problems? This explores whether sub-task abstraction enables skill accumulation rather than task-by-task problem solving.

What blocks scaling from language models to autonomous agents?

If large language models excel at next-token prediction, why do they struggle with long-horizon goal-oriented tasks? This explores whether the bottleneck is model capacity or the environments used to train them.

Does constraining edits help agents improve their own skills?

When agents rewrite their own instructions, does freedom to edit lead to better learning, or do safeguards like edit budgets and memory of failures produce more stable improvement?

Can frozen language models continually improve through memory structure alone?

If agents can't update parameters, what form of textual memory lets them keep learning across trials and transfer to new tasks without retraining?

Can LLMs generate workflows without touching proprietary data?

Explores whether LLMs can orchestrate task automation by composing API calls rather than directly accessing confidential information, and whether this approach preserves security while handling unpredictable tasks.

Can you turn an LLM into an agent by just fine-tuning?

Explores whether upgrading language models to action-producing systems requires only model retraining or demands a broader pipeline transformation including data collection, grounding, integration, and safety evaluation.

Can separating causal models from language models improve reasoning?

Can an explicit formal causal model paired with an LLM translator overcome both spurious correlation reasoning and reward-without-explanation problems in RL? This explores whether dividing reasoning labor between systems addresses fundamental weaknesses in each.

What makes synthetic data work across different domains and models?

Explores whether a single optimal approach to synthetic data generation exists, or whether success depends on context like domain, model architecture, and scale. Understanding this matters for building effective data systems.

Can skill documents be optimized like neural network weights?

Can natural-language skill documents be treated as trainable parameters and improved through iterative optimization with validation gating, similar to how model weights are tuned in deep learning?

Why does random tool sampling produce unrealistic synthetic training data?

Tool-calling datasets generated through random sampling and single-turn framing lack the complexity and coherence of real deployment. This explores what structural choices in data synthesis determine whether models can learn realistic tool composition.

Source papers 19

The Arxiv papers behind this sub-topic. Links may take you off-site to arxiv.org.

Agent Workflow Memory
Despite the potential of language model-based agents to solve real-world tasks such as web navigation, current methods still struggle with long-horizon tasks with complex action trajectories. In contr…
CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization
Language agents have shown some ability to interact with an external environment, e.g., a virtual world such as ScienceWorld, to perform complex tasks, e.g., growing a plant, without the startup costs…
Causal Reflection with Language Models
While Large Language Models (LLMs) exhibit impressive fluency and factual recall, they struggle with robust causal reasoning often relying on spurious correlations and brittle patterns. Similarly, tra…
Code as Agent Harness
Recent large language models (LLMs) have demonstrated strong capabilities in understanding and generating code, from competitive programming to repository-level software engineering. In emerging agent…
Decision Transformer: Reinforcement Learning via Sequence Modeling
We introduce a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem. This allows us to draw upon the simplicity and scalability of the Transformer architecture, and asso…
FlowMind: Automatic Workflow Generation with LLMs
The rapidly evolving field of Robotic Process Automation (RPA) has made significant strides in automating repetitive processes, yet its effectiveness diminishes in scenarios requiring spontaneous or u…
Improving Generalization in Task-oriented Dialogues with Workflows and Action Plans
![A screenshot of a computer program](/assets/paper-images/ImprovingGeneralizationInTaskOrientedDialogues.png) Task-oriented dialogue is difficult in part because it involves understanding user inten…
Large Action Models: From Inception to Implementation
This evolution requires the transition from traditional Large Language Models (LLMs), which excel at generating textual responses, to Large Action Models (LAMs), designed for action generation and exe…
Learning Human-Object Interaction as Groups
Human-Object Interaction Detection (HOI-DET) aims to localize human-object pairs and identify their interactive relationships. To aggregate contextual cues, existing methods typically propagate inform…
Nex-N1: Agentic Models Trained via a Unified Ecosystem for Large-Scale Environment Construction
The evolution of Large Language Models (LLMs) from passive responders to autonomous agents necessitates a fundamental shift in learning paradigms—from static imitation to incentive-driven decision mak…
Planning in Strawberry Fields: Evaluating and Improving the Planning and Scheduling Capabilities of LRM o1
OpenAI claims that their recent o1 (Strawberry) model has been specifically constructed and trained to escape the normal limitations of autoregressive LLMs–making it a new kind of model: a Large Reaso…
React - Synergizing Reasoning And Acting In Language Models
“While large language models (LLMs) have demonstrated impressive performance across tasks in language understanding and interactive decision making, their abilities for reasoning (e.g. chain-of-though…
Reasoning-Driven Synthetic Data Generation and Evaluation
Although many AI applications of interest require specialized multi-modal models, relevant data to train such models is inherently scarce or inaccessible. Filling these gaps with human annotators is p…
SkillOpt: Executive Strategy for Self-Evolving Agent Skills
Agent skills today are hand-crafted, generated one-shot, or evolved through loosely controlled self-revision—none of which behaves like a deep-learning optimizer for the skill, and none of which relia…
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
The emergence of agentic reinforcement learning (Agentic RL) marks a paradigm shift from conventional reinforcement learning applied to large language models (LLM RL), reframing LLMs from passive sequ…
Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction
Abstract: The current paradigm of test-time scaling relies on generating long reasoning traces (“thinking” more) before producing a response. In agent problems that require interaction, this can be do…
ToolFlow: Boosting LLM Tool-Calling Through Natural and Coherent Dialogue Synthesis
Supervised fine-tuning (SFT) is a common method to enhance the tool calling capabilities of Large Language Models (LLMs), with the training data often being synthesized. The current data synthesis pro…
Tree Search for Language Model Agents
Autonomous agents powered by language models (LMs) have demonstrated promise in their ability to perform decision-making tasks such as web automation. However, a key limitation remains: LMs, primarily…
Working with AI: Measuring the Occupational Implications of Generative AI
In this work, we take a step toward that goal by analyzing the work activities people do with AI, how successfully and broadly those activities are done, and combine that with data on what occupations…