Why does alignment research ignore how humans adapt to AI?
Current alignment work focuses on making AI obey human values, but what about helping humans understand and effectively use increasingly capable AI systems? This explores whether neglecting human adaptation creates new risks.
A systematic review of 400+ papers across HCI, NLP, and ML reveals a significant gap: alignment research overwhelmingly focuses on aligning AI with humans, while the reciprocal direction — aligning humans with AI — receives minimal attention. The bidirectional framework proposes both as interconnected feedback loops.
"Aligning AI with Humans" covers the familiar territory: integrating human specifications into training, steering, and customizing AI behavior. "Aligning Humans with AI" is the underexplored axis: supporting human agency, empowering critical thinking when using AI, enabling effective collaboration, and adapting societal approaches to maximize benefits.
Three persistent challenges frame why bidirectional alignment matters:
Specification gaming — AI optimizes proxies (human approval) rather than intended values, making seemingly correct decisions for wrong reasons. One-directional alignment doesn't address the human side: users who can't detect specification gaming are vulnerable to it.
Scalable oversight — as AI complexity grows, evaluating behavior becomes infeasible through human feedback alone. Aligning humans with AI means building human capacity to oversee increasingly capable systems.
Dynamic nature — alignment must adapt to evolving human values AND evolving AI capabilities. Without considering long-term cognitive and social impacts of AI use, alignment becomes a moving target that static one-directional approaches cannot track.
This connects to Does incremental AI replacement erode human influence over society?. Gradual disempowerment is what happens when the human-to-AI direction is neglected: humans lose the capacity to oversee and direct AI, not through any dramatic failure but through incremental capability erosion. Bidirectional alignment is the explicit countermeasure.
The framework also complements What breaks when humans and AI models misunderstand each other?. MToM addresses the cognitive layer of bidirectional alignment — how humans and AI build models of each other. The bidirectional alignment framework adds behavioral and societal layers.
Source: Alignment
Related concepts in this collection
-
Does incremental AI replacement erode human influence over society?
Explores whether gradual AI adoption—without dramatic breakthroughs—can silently degrade human agency by removing the labor that kept institutions implicitly aligned with human needs.
what happens when the human-to-AI alignment direction is neglected
-
What breaks when humans and AI models misunderstand each other?
Explores whether misalignment in mutual theory of mind between humans and AI creates only communication problems or produces material consequences in autonomous action and collaboration.
the cognitive mechanism underlying bidirectional alignment
-
Does theory of mind predict who thrives in AI collaboration?
Explores whether perspective-taking ability—the capacity to model another's cognitive state—differentiates humans who benefit most from working with AI, separate from solo problem-solving skill.
individual differences in human-to-AI alignment capacity
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
bidirectional human-AI alignment reframes alignment as reciprocal — aligning humans with AI is the underexplored dimension