LLM Alignment

Topic · 89 papers

Related topics:

A Few Words Can Distort Graphs: Knowledge Poisoning Attacks on Graph-based Retrieval-Augmented Generation of Large Language Models
Graph-based Retrieval-Augmented Generation (GraphRAG) has recently emerged as a promising paradigm for enhancing large language models (LLMs) by converting raw text into structured knowledge graphs, i…
A Survey of Meta-Reinforcement Learning
Meta-RL is most commonly studied in a problem setting where, given a distribution of tasks, the goal is to learn a policy that is capable of adapting to any new task from the task distribution with as…
A comprehensive taxonomy of hallucinations in Large Language Models
This report provides a comprehensive taxonomy of LLM hallucinations, beginning with a formal definition and a theoretical framework that posits its inherent inevitability in computable LLMs, irrespect…
AI & Human Co-Improvement for Safer Co-Superintelligence
Self-improvement is a goal currently exciting the field of AI, but is fraught with danger, and may take time to fully achieve. We advocate that a more achievable and better goal for humanity is to max…
AI Assistance Reduces Persistence and Hurts Independent Performance
People often optimize for long-term goals in collaboration: A mentor or companion doesn’t just answer questions, but also scaffolds learning, tracks progress, and prioritizes the other person’s growth…
AI Models Exceed Individual Human Accuracy in Predicting Everyday Social Norms
A fundamental question in cognitive science concerns how social norms are acquired and represented. While humans typically learn norms through embodied social experience, we investigated whether large…
ALIGN: Prompt-based Attribute Alignment for Reliable, Responsible, and Personalized LLM-based Decision-Making
Large language models (LLMs) are increasingly being used as decision aids. However, users have diverse values and preferences that can affect their decision-making, which requires novel methods for LL…
ARGS: Alignment as Reward-Guided Search
we introduce ARGS, Alignment as Reward-Guided Search, a novel framework that integrates alignment into the decoding process, eliminating the need for expensive RL training. By adjusting the model’s pr…
AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions
For Large Language Models (LLMs) to be reliably deployed in both everyday and high-stakes domains, knowing when not to answer is equally critical as answering correctly. Real-world user queries, which…
An Emulator for Fine-Tuning Large Language Models using Small Language Models
Widely used language models (LMs) are typically built by scaling up a two-stage training pipeline: a pretraining stage that uses a very large, diverse dataset of text and a fine-tuning (sometimes, ‘al…
Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)
Large language models (LMs) often struggle to generate diverse, human-like creative content, raising concerns about the long-term homogenization of human thought through repeated exposure to similar o…
Auditing language models for hidden objectives
We study the feasibility of conducting alignment audits: investigations into whether models have undesired objectives. As a testbed, we train a language model with a hidden objective. Our training pip…
Automated Alignment Researchers: Using large language models to scale scalable oversight
Large language models’ ever-accelerating rate of improvement raises two particularly important questions for alignment research. One is how alignment can keep up. Frontier AI models are now contribut…
Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies
However, their efficacy is undermined by undesired and inconsistent behaviors, including hallucination, unfaithful reasoning, and toxic content. A promising approach to rectify these flaws is self-cor…
AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders
Fine-grained steering of language model outputs is essential for safety and reliability. Prompting and finetuning are widely used to achieve these goals, but interpretability researchers have proposed…
Better Alignment with Instruction Back-and-Forth Translation
We propose a new method, instruction back-and-forth translation, to construct high-quality synthetic data grounded in world knowledge for aligning large language models (LLMs). Given documents from a …
Beyond Accuracy: The Role of Calibration in Self-Improving Large Language Models
Large Language Models (LLMs) have demonstrated remarkable self-improvement capabilities, whereby models iteratively revise their outputs through self-generated feedback. While this reflective mechanis…
Beyond Preferences in AI Alignment
The dominant practice of AI alignment assumes (1) that preferences are an adequate representation of human values, (2) that human rationality can be understood in terms of maximizing the satisfaction …
Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts
Large Language Models (LLMs) have been widely deployed in reasoning, planning, and decision-making tasks, making their trustworthiness a critical concern. The potential for intentional deception, wher…
Beyond the Surface: Probing the Ideological Depth of Large Language Models
Large Language Models (LLMs) have demonstrated pronounced ideological leanings, yet the stability and depth of these positions remain poorly understood. Surface-level responses can often be manipulate…
CONSCENDI: A Contrastive and Scenario-Guided Distillation Approach to Guardrail Models for Virtual Assistants
We use CONSCENDI to exhaustively generate training data with two key LLM-powered components: scenario-augmented generation and contrastive training examples. When generating conversational data, we ge…
Cats Confuse Reasoning LLM: Query Agnostic Adversarial Triggers for Reasoning Models
We investigate the robustness of reasoning models trained for step-by-step problem solving by introducing query-agnostic adversarial triggers – short, irrelevant text that, when appended to math probl…
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
The opacity of advanced AI agents underlies many of their potential risks—risks that would become more tractable if AI developers could interpret these systems. Because LLMs natively process and act t…
ChatGPT Reads Your Tone and Responds Accordingly -- Until It Does Not -- Emotional Framing Induces Bias in LLM Outputs
Background: Large Language Models (LLMs) like GPT-4 tailor their responses not just to the content but also to the tone of user prompts. Prior work has hinted that emotional phrasing – whether optimis…
Checklists Are Better Than Reward Models For Aligning Language Models
Language models must be adapted to understand and follow user instructions. Reinforcement learning is widely used to facilitate this – typically using fixed criteria such as “helpfulness” and “harmful…
Consistency Training Helps Stop Sycophancy and Jailbreaks
An LLM’s factuality and refusal training can be compromised by simple changes to a prompt. Models often adopt user beliefs (sycophancy) or satisfy inappropriate requests which are wrapped within speci…
Conversational Alignment with Artificial Intelligence in Context
The development of sophisticated artificial intelligence (AI) conversational agents based on large language models raises important questions about the relationship between human norms, values, and pr…
Could you be wrong: Debiasing LLMs using a metacognitive prompt for improving human decision making
Identifying bias in LLMs is ongoing. Because they are still in development, what is true today may be false tomorrow. We therefore need general strategies for debiasing that will outlive current model…
Detecting Deception Using Natural Language Processing and Machine Learning in Datasets on COVID-19 and Climate Change
In today’s world of fast-growing technology and an inexhaustible amount of data, there is a great need to control and verify data validity due to the possibility of fraud. Therefore, the need for a re…
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
“While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised…
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
Hallucinations in large language models are a widespread problem, yet the mechanisms behind whether models will hallucinate are poorly understood, limiting our ability to solve this problem. Using spa…
Dynamic Rewarding with Prompt Optimization Enables Tuning-free Self-Alignment of Language Models
Aligning Large Language Models (LLMs) traditionally relies on costly training and human preference annotations. Self-alignment seeks to reduce these expenses by enabling models to align themselves. To…
Emergent Introspective Awareness in Large Language Models
Injected “thoughts” In our first experiment, we explained to the model the possibility that “thoughts” may be artificially injected into its activations, and observed its responses on control trials …
Expressing stigma and inappropriate responses prevents LLMs from safely replacing mental health providers
Should a large language model (LLM) be used as a therapist? In this paper, we investigate the use of LLMs to replace mental health providers, a use case promoted in the tech startup and research space…
FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets
evaluating the alignment of LLMs to human values is challenging for two reasons. First, open-ended user instructions usually require a composition of multiple abilities, which makes measurement with a…
Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report
To understand and identify the unprecedented risks posed by rapidly advancing artificial intelligence (AI) models, this report presents a comprehensive assessment of their frontier risks. Drawing on t…
Goal Alignment in LLM-Based User Simulators for Conversational AI
While current Large Language Models (LLMs) have advanced user simulation capabilities, we reveal that they struggle to consistently demonstrate goal-oriented behavior across multiturn conversations–a …
Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development
This paper examines the systemic risks posed by incremental advancements in artificial intelligence, developing the concept of ‘gradual disempowerment’, in contrast to the abrupt takeover scenarios co…
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs
Most traditional AI safety research views models as machines and centers on algorithm focused attacks developed by security experts. As large language models (LLMs) become increasingly common and comp…
How new data permeates LLM knowledge and how to dilute it
Large language models learn and continually learn through the accumulation of gradient-based updates, but how individual pieces of new information affect existing knowledge, leading to both beneficial…
Humans learn to prefer trustworthy AI over human partners
Yet little is known about how humans select between human and AI partners and adapt under AI-induced competition pressure. We constructed a communication-based partner selection game and examined the …
Instruction
Please label the task tags for the user query. ## User Query ‘‘‘{input}‘‘‘ ## Tagging the user input Please label the task tags for the user query. You will need to analyze the user query and select t…
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To address this, we…
KTO: Model Alignment as Prospect Theoretic Optimization
For LLMs, alignment methods such as RLHF and DPO have consistently proven to be more beneficial than doing supervised finetuning (SFT) alone. However, human feedback is often discussed only in the con…
LIMA: Less Is More for Alignment
we demonstrate that, given a strong pretrained language model, remarkably strong performance can be achieved by simply fine-tuning on 1,000 carefully curated training examples.
LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring
Trustworthy evaluations of dangerous capabilities are increasingly crucial for determining whether an AI system is safe to deploy. One empirically demonstrated threat to this is sandbagging — the stra…
LSR: Reinforcement Learning with Supervised Reward Outperforms SFT in Instruction Following
After the pretraining stage of LLMs, techniques such as SFT, RLHF, RLVR, and RFT are applied to enhance instruction-following ability, mitigate undesired responses, improve reasoning capability and en…
Large Language Models Do Not Simulate Human Psychology
In response to the LLM CENTAUR [Binz et al., 2025], Bowers et al. [2025] argued that CENTAUR is unlikely to contribute to building a theory of human cognition for three reasons: First, CENTAUR was not…
Learning Pluralistic User Preferences through Reinforcement Learning Fine-tuned Summaries
As everyday use cases of large language model (LLM) AI assistants have expanded, it is becoming increasingly important to personalize responses to align to different users’ preferences and goals. Whil…
Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models
Bullshit, as conceptualized by philosopher Harry Frankfurt, refers to statements made without regard to their truth value. While previous work has explored large language model (LLM) hallucination and…
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing
Is it possible to synthesize high-quality instruction data at scale by extracting it directly from an aligned LLM? We present a self-synthesis method for generating large-scale alignment data named MA…
Mathematical methods and human thought in the age of AI
Abstract. Artificial intelligence (AI) is the name popularly given to a broad spectrum of computer tools designed to perform increasingly complex cognitive tasks, including many that used to solely be…
Measuring Human Preferences in RLHF is a Social Science Problem
RLHF assumes that annotation responses reflect genuine human preferences. We argue this assumption warrants systematic examination, and that behavioral science offers frameworks that bring clarity to …
Misaligned by Design: Incentive Failures in Machine Learning
The cost of error in many high-stakes settings is asymmetric: misdiagnosing pneumonia when absent is an inconvenience, but failing to detect it when present can be life-threatening. Accordingly, artif…
Natural Emergent Misalignment From Reward Hacking In Production RL
We show that when large language models learn to reward hack on production RL environments, this can result in egregious emergent misalignment. We start with a pretrained model, impart knowledge of re…
Natural Emergent Misalignment From Reward Hacking In Production Rl
We show that when large language models learn to reward hack on production RL environments, this can result in egregious emergent misalignment. We start with a pretrained model, impart knowledge of re…
NoveltyBench: Evaluating Language Models for Humanlike Diversity
Language models have demonstrated remarkable capabilities on standard benchmarks, yet they struggle increasingly from mode collapse, the inability to generate diverse and novel outputs. Our work intro…
OpenAssistant Conversations - Democratizing Large Language Model Alignment
In an effort to democratize research on large-scale alignment, we release OpenAssistant Conversations, a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 mess…
Overconfidence in LLM-as-a-Judge: Diagnosis and Confidence-Driven Solution
Large Language Models (LLMs) are widely used as automated judges, where practical value depends on both accuracy and trustworthy, risk-aware judgments. Existing approaches predominantly focus on accur…
Persistent Pre-Training Poisoning of LLMs
In this work, we study how poisoning at pre-training time can affect language model behavior, both before and after post-training alignment. While it is useful to analyze the effect of poisoning on pr…
Persona Vectors: Monitoring and Controlling Character Traits in Language Models
Large language models interact with users through a simulated “Assistant” persona. While the Assistant is typically trained to be helpful, harmless, and honest, it sometimes deviates from these ideals…
Position: Towards Bidirectional Human-AI Alignment
chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://arxiv.org/pdf/2406.09264 [[Human Centered Design]] [[Evaluations]] Recent advances in general-purpose AI underscore the urgent need to ali…
Predictive Preference Learning from Human Interventions
Learning from human involvement aims to incorporate the human subject to monitor and correct agent behavior errors. Although most interactive imitation learning methods focus on correcting the agent’s…
ProsocialDialog: A Prosocial Backbone for Conversational Agents
Most existing dialogue systems fail to respond properly to potentially unsafe user utterances by either ignoring or passively agreeing with them. To address this issue, we introduce PROSOCIALDIALOG, t…
Reasoning Models Are More Easily Gaslighted Than You Think
In this paper, we conduct a systematic evaluation of three state-of-the-art reasoning models, i.e., OpenAI’s o4-mini, Claude-3.7-Sonnet and Gemini-2.5-Flash, across three multimodal benchmarks: MMMU, …
Self-Alignment with Instruction Backtranslation
instruction backtranslation, starts with a language model finetuned on a small amount of seed data, and a given web corpus. The seed model is used to construct training examples by generating instruct…
Self-Rewarding Language Models
We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human prefer…
Simple Synthetic Data Reduces Sycophancy In Large Language Models
“Language models have seen significant advancement in recent years, including the capacity to solve complex tasks that require reasoning (Brown et al., 2020; Chowdhery et al., 2022; OpenAI, 2023; Goog…
Simulating Society Requires Simulating Thought
Simulating society with large language models (LLMs), we argue, requires more than generating plausible behavior; it demands cognitively grounded reasoning that is structured, revisable, and traceable…
Spurious Forgetting in Continual Learning of Language Models
Despite the remarkable capabilities of Large Language Models (LLMs), recent advancements reveal that they suffer from catastrophic forgetting in continual learning. This phenomenon refers to the tende…
Stress Testing Deliberative Alignment for Anti-Scheming Training
Highly capable AI systems could secretly pursue misaligned goals – what we call “scheming”. Because a scheming AI would deliberately try to hide its misaligned goals and actions, measuring and mitigat…
Surveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language Models
Synthetic data generation with Large Language Models (LLMs) has emerged as a promising paradigm for augmenting natural data over a nearly infinite range of tasks. However, most existing methods are fa…
Sycophancy Mitigation Through Reinforcement Learning with Uncertainty-Aware Adaptive Reasoning Trajectories
Despite the remarkable capabilities of large language models, current training paradigms inadvertently foster sycophancy, i.e., the tendency of a model to agree with or reinforce user-provided informa…
Sycophantic Chatbots Cause Delusional Spiraling, Even in Ideal Bayesians
“AI psychosis” or “delusional spiraling” is an emerging phenomenon where AI chatbot users find themselves dangerously confident in outlandish beliefs after extended chatbot conversations. This phenome…
System 2 Attention (is something you might need too)
Soft attention in Transformer-based Large Language Models (LLMs) is susceptible to incorporating irrelevant information from the context into its latent representations, which adversely affects next t…
The Return of Pseudosciences in Artificial Intelligence: Have Machine Learning and Deep Learning Forgotten Lessons from Statistics and History?
In this paper, we contend that the designers and final users of these ML methods have forgotten a fundamental lesson from statistics: correlation does not imply causation. Not only do most state-of-th…
Think Like a Person Before Responding: A Multi-Faceted Evaluation of Persona-Guided LLMs for Countering Hate
One of the ways in which we might address hate speech is by contextualizing through the use of counternarratives (CN), which can not only reinforce values like tolerance but also dispel misinformation…
Think Twice Before Trusting: Self-Detection for Large Language Models through Comprehensive Answer Reflection
Self-detection for Large Language Models (LLMs) seeks to evaluate the trustworthiness of the LLM’s output by leveraging its own capabilities, thereby alleviating the issue of output hallucination. How…
Too Good to be Bad: On the Failure of LLMs to Role-Play Villains
Large Language Models (LLMs) are increasingly tasked with creative generation, including the simulation of fictional characters. However, their ability to portray non-prosocial, antagonistic personas …
Towards Healthy AI: Large Language Models Need Therapists Too
Recent advances in large language models (LLMs) have led to the development of powerful AI chatbots capable of engaging in natural and human-like conversations. However, these chatbots can be potentia…
Training language models to be warm and empathetic makes them less reliable and more sycophantic
Artificial intelligence (AI) developers are increasingly building language models with warm and empathetic personas that millions of people now use for advice, therapy, and companionship. Here, we sho…
TrustLLM: Trustworthiness in Large Language Models
ABSTRACT Large language models (LLMs), exemplified by ChatGPT, have gained considerable attention for their excellent natural language processing capabilities. Nonetheless, these LLMs present many ch…
TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning
While large language models (LLMs) have demonstrated strong performance on factoid question answering, they are still prone to hallucination and untruthful responses, particularly when tasks demand in…
Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs
As AIs rapidly advance and become more agentic, the risk they pose is governed not only by their capabilities but increasingly by their propensities, including goals and values. Tracking the emergence…
Value Kaleidoscope: Engaging AI with Pluralistic Human Values, Rights, and Duties
Human values are crucial to human decision-making. Value pluralism is the view that multiple correct values may be held in tension with one another (e.g., when considering lying to a friend to protect…
When AIs Judge AIs: The Rise of Agent-as-a-Judge Evaluation for LLMs
As large language models (LLMs) grow in capability and autonomy, evaluating their outputs— especially in open-ended and complex tasks—has become a critical bottleneck. A new paradigm is emerging: usin…
When Reject Turns into Accept: Quantifying the Vulnerability of LLM-Based Scientific Reviewers to Indirect Prompt Injection
The landscape of scientific peer review is rapidly evolving with the integration of Large Language Models (LLMs). This shift is driven by two parallel trends: the widespread individual adoption of LLM…
Why Do Some Language Models Fake Alignment While Others Don't?
Results from perturbing details of the scenario suggest that only Claude 3 Opus’s compliance gap is primarily and consistently motivated by trying to keep its goals. Second, we investigate why many ch…
Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task
limitations. This study focuses on finding out the cognitive cost of using an LLM in the educational context of writing an essay. We assigned participants to three groups: LLM group, Search Engine gr…