TOPIC

Workplace Applications

9 synthesis notes · 44 source papers

View as

Why does AI default to coaching instead of doing?

In workplace conversations, users often want AI to execute tasks like writing or gathering information, but AI tends to explain and advise instead. What drives this systematic mismatch between what users need and what AI provides?

Does concentrated AI exposure enable workers to adapt and reallocate?

When AI displaces specific tasks rather than spreading across many, workers may shift effort to non-displaced tasks within their occupation. Does this reallocation mechanism actually offset employment losses?

Can governance rules embedded in runtime memory actually protect autonomous agents?

Explores whether safeguards woven into an agent's operating loop—rather than documented separately—remain durable and retrievable when most needed. Tests whether runtime governance is engineering solution or false assurance.

What happens to human wages in an AGI economy?

Does human labor retain economic value when AGI can replicate most work? This explores whether wages would reflect the computational cost of replacement rather than the value workers actually produce.

Do LLM research ideas actually hold up when experts try to execute them?

Explores whether LLM-generated ideas maintain their apparent novelty advantage when expert researchers spend 100+ hours implementing them. Matters because ideation-stage evaluation may not capture real-world feasibility barriers.

Can LLMs efficiently generate taxonomies and label training data?

Explores whether large language models can automate both taxonomy generation and data labeling to reduce the manual effort and domain expertise traditionally required for text mining tasks.

Do persistent agents really cost less per token?

When AI agents reuse cached context across tasks, does the standard cost-per-token metric still reveal true economic efficiency? A case study suggests the answer may be no.

Should we evaluate deployed agents as whole environments instead?

Conventional LLM evaluation focuses on models or individual episodes, but what if the right measurement unit is the entire coupled human-agent system including memory, tools, and protocols observed over time?

What collaboration level do workers actually want with AI?

Explores whether workers prefer full automation, equal partnership, or continuous human control across different tasks. Understanding worker preferences could reshape how organizations deploy AI systems.

Source papers 44

The Arxiv papers behind this sub-topic. Links may take you off-site to arxiv.org.

A Framework for Collaborating a Large Language Model Tool in Brainstorming for Triggering Creative Thoughts
Among various techniques developed to foster creative thinking, brainstorming is widely used. With recent advancements in Large Language Models (LLMs), tools like ChatGPT have significantly impacted v…
Artificial Intelligence and the Labor Market∗
We utilize recent advances in natural language processing to develop novel measures of workers’ task-level exposure to artificial intelligence (AI) and machine learning technologies from 2010 to 2023,…
Bridging the gulf of envisioning: Cognitive design challenges in llm interfaces.
Large language models (LLMs) exhibit dynamic capabilities and appear to comprehend complex and ambiguous natural language prompts. However, calibrating LLM interactions is challenging for interface de…
CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions
Existing benchmarks fall short in realism, data fidelity, agent-user interaction, and coverage across business scenarios and industries. To address these gaps, we introduce CRMArena-Pro, a novel bench…
Estimating AI productivity gains from Claude conversations
What do real conversations with Claude tell us about the effects of AI on labor productivity? Using our privacy-preserving analysis method, we sample one hundred thousand real conversations from Claud…
Explainable Compliance Detection with Multi-Hop Natural Language Inference on Assurance Case Structure
Ensuring complex systems meet regulations typically requires checking the validity of assurance cases through a claim-argument-evidence framework. Some challenges in this process include the complicat…
Exploring the Frontiers of LLMs in Psychological Applications: A Comprehensive Review
This review explores the frontiers of large language models (LLMs) in psychological applications. Psychology has undergone several theoretical changes, and the current use of artificial intelligence (…
From Articles to Code: On-Demand Generation of Core Algorithms from Scientific Publications
ABSTRACT Maintaining software packages imposes significant costs due to dependency management, bug fixes, and versioning. We show that rich method descriptions in scientific publications can serve as…
Future of Work with AI Agents: Auditing Automation and Augmentation Potential across the U.S. Workforce
Our framework features an audio-enhanced mini-interview to capture nuanced worker desires and introduces the HumanAgency Scale (HAS) as a shared language to quantify the preferred level of human invol…
GRASP: Municipal Budget AI Chatbots for Enhancing Civic Engagement
Abstract—There are a growing number of AI applications, but none tailored specifically to help residents answer their questions about municipal budget, a topic most are interested in but few have a so…
Gdpval: Evaluating Ai Model Performance On Real-world Economically Valuable Tasks
![[Pasted image 20250930085203.png]] We introduce GDPval, a benchmark evaluating AI model capabilities on realworld economically valuable tasks. GDPval covers the majority of U.S. Bureau of Labor Sta…
Generating Query-Relevant Document Summaries via Reinforcement Learning
E-commerce search engines often rely solely on product titles as input for ranking models with latency constraints. However, this approach can result in suboptimal relevance predictions, as product ti…
Generative AI in Real-World Workplaces
This report presents the most recent findings of Microsoft’s research initiative on AI and Productivity, which seeks to measure and understand the productivity gains associated with LLM-powered produc…
Generative Interfaces for Language Models
Large language models (LLMs) are increasingly seen as assistants, copilots, and consultants, capable of supporting a wide range of tasks through natural conversation. However, most systems remain cons…
How AI Impacts Skill Formation
AI assistance produces significant productivity gains across professional domains, particularly for novice workers. Yet how this assistance affects the development of skills required to effectively su…
How Exposed Are UK Jobs to Generative AI? Developing and Applying a Novel Task-Based Index
|_Table 1: Task Categories_ **Task Category**|**Descriptors (In your job, how important is…)**| |**Manual**|Carrying, pushing or pulling heavy objects <br><br>Working for long periods on physical acti…
IFEvalCode: Controlled Code Generation
Code large language models (Code LLMs) have achieved significant advancements in various code-related tasks, particularly in code generation, where the code LLMs produce the target code from natural l…
KellyBench: Can Language Models Beat the Market?
Language models are saturating benchmarks for procedural tasks with narrow objectives. But they are increasingly being deployed in long-horizon, non-stationary environments with open-ended goals. In t…
LLMs as Architects and Critics for Multi-Source Opinion Summarization
benchmark dataset for evaluating multi-source opinion summaries across 7 key dimensions: fluency, coherence, relevance, faithfulness, aspect coverage, sentiment consistency, specificity. Our results d…
Large Language Model-based Data Science Agent: A Survey
The rapid advancement of Large Language Models (LLMs) has driven novel applications across diverse domains, with LLM-based agents emerging as a crucial area of exploration. This survey presents a comp…
Large Language Models can accomplish Business Process Management Tasks
Abstract. Business Process Management (BPM) aims to improve organizational activities and their outcomes by managing the underlying processes. To achieve this, it is often necessary to consider inform…
Medical Reasoning in the Era of LLMs: A Systematic Review of Enhancement Techniques and Applications
We propose a taxonomy of reasoning enhancement techniques, categorized into training-time strategies (e.g., supervised fine-tuning, reinforcement learning) and test-time mechanisms (e.g., prompt engin…
News Sentiment Embeddings for Stock Price Forecasting
A key focus is to use news headlines from the Wall Street Journal (WSJ) to predict the movement of stock prices on a daily timescale with OpenAI-based text embedding models used to create vector encod…
Openagents: An Open Platform For Language Agents In The Wild
![A diagram of a software development with medium confidence](/assets/paper-images/OpenAgents.png) ![A diagram of software development](/assets/paper-images/OpenAgents2.png) Language agents show pot…
OpinionConv: Conversational Product Search with Grounded Opinions
When searching for products, the opinions of others play an important role in making informed decisions. Subjective experiences about a product can be a valuable source of information. This is also tr…
Persistent AI Agents in Academic Research: A Single-Investigator Implementation Case Study
Background: Large language model systems are commonly evaluated as models, benchmarks, or short conversational episodes. Less is known about what happens when a persistent AI agent is embedded into a …
RAG Does Not Work for Enterprises
Retrieval-Augmented Generation (RAG) improves the accuracy and relevance of large language model outputs by incorporating knowledge retrieval. However, implementing RAG in enterprises poses challenges…
ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models
“There is a trending paradigm[1; 2; 3; 4; 5; 6; 7; 8] to couple large language models (LLMs) with external plugins or tools, enabling LLMs to interact with environment [9; 10] and retrieve up-to-date …
Reinforcement Learning for Optimizing RAG for Domain Chatbots
Large Language Models (LLM), conversational assistants have become prevalent for domain use cases. LLMs acquire the ability to contextual question answering through extensive training, and Retrieval A…
SAILER: Structure-aware Pre-trained Language Model for Legal Case Retrieval
To address these issues, in this paper, we propose SAILER, a new Structure-Aware pre-traIned language model for LEgal case Retrieval. It is highlighted in the following three aspects: (1) SAILER fully…
Social Skill Training with Large Language Models
People rely on social skills like conflict resolution to communicate effectively and to thrive in both work and personal life. However, practice environments for social skills are typically out of rea…
The Ideation-Execution Gap: Execution Outcomes of LLM-Generated versus Human Research Ideas
Large Language Models (LLMs) have shown promise in accelerating the scientific research pipeline. A key capability for this process is the ability to generate novel research ideas, and prior studies h…
The Impact of Generative AI on Critical Thinking: Self-Reported Reductions in Cognitive Effort and Confidence Effects From a Survey of Knowledge Workers
As Bainbridge [7] noted, a key irony of automation is that by mechanising routine tasks and leaving exception-handling to the human user, you deprive the user of the routine opportunities to practice …
The Labor Market Effects of Generative Artificial Intelligence
In this paper we develop a new survey analyzing Generative AI use in the labor market to assist in measuring the economic effects of Generative AI. We find, consistent with other surveys that Generati…
The impact of generative artificial intelligence on socioeconomic inequalities and policy making
Generative artificial intelligence (AI) has the potential to both exacerbate and ameliorate existing socioeconomic inequalities. In this article, we provide a state-of-the-art interdisciplinary overvi…
The state of enterprise AI
The most widely deployed GPTs either codify institutional knowledge into reusable assistants or automate workflows through integrations with internal systems. Some organizations have built a culture o…
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks
To measure the progress of these LLM agents’ performance on performing real-world professional tasks, in this paper we introduce TheAgentCompany, an extensible benchmark for evaluating AI agents that …
TnT-LLM: Text Mining at Scale with Large Language Models
Transforming unstructured text into structured and meaningful forms, organized by useful category labels, is a fundamental step in text mining for downstream analysis and application. However, most ex…
Using LLMs to Discover Legal Factors
Abstract. Factors are a foundational component of legal analysis and computational models of legal reasoning. These factor-based representations enable lawyers, judges, and AI and Law researchers to r…
Using Large Language Models to Generate, Validate, and Apply User Intent Taxonomies
Log data can reveal valuable information about how users interact with Web search services, what they want, and how satisfied they are. However, analyzing user intents in log data is not easy, especia…
We Wont be Missed: Work and Growth in the Era of AGI
This chapter explores theoretically the long-run implications of Artificial General Intelligence (AGI) for economic growth and labor markets. AGI makes it feasible to perform all economically valuable…
Why Do Multi-agent LLM Systems Fail?
[[Routers]] Despite growing enthusiasm for Multi-Agent LLM Systems (MAS), their performance gains across popular benchmarks often remain minimal compared to single-agent frameworks. This gap highlig…
Working with AI: Measuring the Occupational Implications of Generative AI
In this work, we take a step toward that goal by analyzing the work activities people do with AI, how successfully and broadly those activities are done, and combine that with data on what occupations…
Workplace Everyday-Creativity through a Highly-Conversational UI to Large Language Models
We explore everyday co-creativity for collaborative human-AI teams in workplaces via a conversational user interface to a large language model. Previous short papers explored human-AI team-creativity …