Bridging the gulf of envisioning: Cognitive design challenges in llm interfaces.
Large language models (LLMs) exhibit dynamic capabilities and appear to comprehend complex and ambiguous natural language prompts. However, calibrating LLM interactions is challenging for interface designers and end-users alike. A central issue is our limited grasp of how human cognitive processes begin with a goal and form intentions for executing actions, a blindspot even in established interaction models such as Norman’s gulfs of execution and evaluation. To address this gap, we theorize how end-users ‘envision’ translating their goals into clear intentions and craft prompts to obtain the desired LLM response. We define a process of Envisioning by highlighting three misalignments on not knowing: (1) what the task should be, (2) how to instruct the LLM to do the task, and (3) what to expect for the LLM’s output in meeting the goal. Finally, we make recommendations to narrow the gulf of envisioning in human-LLM interactions.
However, they also require careful guidance to ensure the generated content is appropriate and in alignment with human goals and intentions. For instance, if an end-user wishes to leverage an LLM to craft a toast and prompts the LLM with “Write a toast for Taylor”, the output may be incomplete without providing the desired qualities. The human must be more specific about their intentions (such as, “Write a heartwarming toast for my best friend Taylor’s retirement party, about 5 minutes long, include a humorous twist, and wish them well on the golf course”). Formal and anecdotal evidence (e.g., [2, 75, 82, 197]) suggests that effectively prompting LLMs to produce outputs similar to human-generated content remains challenging. If intentions are expressed too vaguely or lacking specific detail, the LLM may generate responses that are generic, irrelevant, or off-topic [63, 80, 197]. Iterating with an LLM can correct and progressively guide generation, but playing a “20-questions” or “Hot or Cold” guessing game may be inefficient for longer output and lead to a local minima within the solution space [159]. Further, humans show fixation on initial examples that interfere with exploring alternative solutions [71, 93]. In this work, we draw from theories across HCI and cognitive science to characterize the nature of the cognitive challenges for humans in dialogic interactions with intelligent generative agents.
In this work, we examine the transformative impact of generative AI systems for human-machine interaction, focusing on how this shift from conventional interfaces alters the design and usability of interactions on these three dimensions. Hutchins et al. [68] offer a model of interface-design-challenges in software systems, including an “execution gulf” between user intentions and system actions and an “evaluation gulf” between system output and user understanding of its genesis. Their general principle is that as the distance between the human’s intentions and the system’s interface increases, the costs of interaction increase. LLMs transform human-machine interaction to substantially reduce this distance, and therefore the costs, by defining interaction as human formulation of intentions through natural language dialogue leading to desired output [81, 197]. LLMs narrow the gulf of execution by eliminating conventional needs for action specification and execution [68], leaving only intention to the user. However, the gulf of evaluation may increase by challenges to perceive, interpret, and evaluate output [68] given the LLMs’ probabilistic process.
What are the consequences of these new LLM features enabling success on complex tasks – flexibility in functional scope, variation in intention specificity, and probabilistic processes and outputs – on the nature and costs of human interaction? We suggest this new LLM interaction process poses new challenges for people, which we call, “the gulf of envisioning.” Concretely, the gulf of envisioning characterizes the distance between the human’s initial intentions and their formulation of a prompt that foresees how LLM capabilities and training data can be leveraged to generate high-quality output. Envisioning includes at least three challenges for humans interacting with LLM systems: (1) how to set my goals and intentions such that the LLM can accomplish the task – the capability gap, (2) how to best instruct an LLM about my goals (i.e., prompt engineering) – the instruction gap, and (3) what to expect for the LLM’s output – the intentionality gap. In this paper,we formulate a newmodel of interaction for human- LLM interfaces in which intentions are the actions. Our key contributions include (1) a characterization of how transformative LLM natural language interfaces yield both expansive functionality and new challenges in bridging intentions and outcomes; (2) an updated model of human-machine interaction identifying the process of envisioning execution; and (3) a set of design patterns and guidelines for human-LLM interfaces along with an analysis of interfaces for three types of generative tasks.
In much of HCI work, stage two – forming the intention – is assumed as given [65]. For instance, when cutting and pasting a paragraph of text in a word processor or clicking on the ‘Bold’ font button, how much do we know (or care) about the underlying intentions leading a user to execute those actions? We posit that while this gap from goal to intention has been inadvertently bypassed in traditional design approaches, it emerges as a critical challenge that must be addressed in human-LLM interactions. In this section, we explore this overlooked aspect of intention formation during interactions and postulate its role in LLM-powered interfaces.