LLM Alignment
Related topics:
- A Few Words Can Distort Graphs: Knowledge Poisoning Attacks on Graph-based Retrieval-Augmented Generation of Large Language ModelsGraph-based Retrieval-Augmented Generation (GraphRAG) has recently emerged as a promising paradigm for enhancing large language models (LLMs) by converting raw text into structured knowledge graphs, i…
- A Survey of Meta-Reinforcement LearningMeta-RL is most commonly studied in a problem setting where, given a distribution of tasks, the goal is to learn a policy that is capable of adapting to any new task from the task distribution with as…
- A comprehensive taxonomy of hallucinations in Large Language ModelsThis report provides a comprehensive taxonomy of LLM hallucinations, beginning with a formal definition and a theoretical framework that posits its inherent inevitability in computable LLMs, irrespect…
- AI & Human Co-Improvement for Safer Co-SuperintelligenceSelf-improvement is a goal currently exciting the field of AI, but is fraught with danger, and may take time to fully achieve. We advocate that a more achievable and better goal for humanity is to max…
- AI Assistance Reduces Persistence and Hurts Independent PerformancePeople often optimize for long-term goals in collaboration: A mentor or companion doesn’t just answer questions, but also scaffolds learning, tracks progress, and prioritizes the other person’s growth…
- AI Models Exceed Individual Human Accuracy in Predicting Everyday Social NormsA fundamental question in cognitive science concerns how social norms are acquired and represented. While humans typically learn norms through embodied social experience, we investigated whether large…
- ALIGN: Prompt-based Attribute Alignment for Reliable, Responsible, and Personalized LLM-based Decision-MakingLarge language models (LLMs) are increasingly being used as decision aids. However, users have diverse values and preferences that can affect their decision-making, which requires novel methods for LL…
- ARGS: Alignment as Reward-Guided Searchwe introduce ARGS, Alignment as Reward-Guided Search, a novel framework that integrates alignment into the decoding process, eliminating the need for expensive RL training. By adjusting the model’s pr…
- AbstentionBench: Reasoning LLMs Fail on Unanswerable QuestionsFor Large Language Models (LLMs) to be reliably deployed in both everyday and high-stakes domains, knowing when not to answer is equally critical as answering correctly. Real-world user queries, which…
- An Emulator for Fine-Tuning Large Language Models using Small Language ModelsWidely used language models (LMs) are typically built by scaling up a two-stage training pipeline: a pretraining stage that uses a very large, diverse dataset of text and a fine-tuning (sometimes, ‘al…
- Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)Large language models (LMs) often struggle to generate diverse, human-like creative content, raising concerns about the long-term homogenization of human thought through repeated exposure to similar o…
- Auditing language models for hidden objectivesWe study the feasibility of conducting alignment audits: investigations into whether models have undesired objectives. As a testbed, we train a language model with a hidden objective. Our training pip…
- Automated Alignment Researchers: Using large language models to scale scalable oversightLarge language models’ ever-accelerating rate of improvement raises two particularly important questions for alignment research. One is how alignment can keep up. Frontier AI models are now contribut…
- Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategiesHowever, their efficacy is undermined by undesired and inconsistent behaviors, including hallucination, unfaithful reasoning, and toxic content. A promising approach to rectify these flaws is self-cor…
- AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse AutoencodersFine-grained steering of language model outputs is essential for safety and reliability. Prompting and finetuning are widely used to achieve these goals, but interpretability researchers have proposed…
- Better Alignment with Instruction Back-and-Forth TranslationWe propose a new method, instruction back-and-forth translation, to construct high-quality synthetic data grounded in world knowledge for aligning large language models (LLMs). Given documents from a …
- Beyond Accuracy: The Role of Calibration in Self-Improving Large Language ModelsLarge Language Models (LLMs) have demonstrated remarkable self-improvement capabilities, whereby models iteratively revise their outputs through self-generated feedback. While this reflective mechanis…
- Beyond Preferences in AI AlignmentThe dominant practice of AI alignment assumes (1) that preferences are an adequate representation of human values, (2) that human rationality can be understood in terms of maximizing the satisfaction …
- Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign PromptsLarge Language Models (LLMs) have been widely deployed in reasoning, planning, and decision-making tasks, making their trustworthiness a critical concern. The potential for intentional deception, wher…
- Beyond the Surface: Probing the Ideological Depth of Large Language ModelsLarge Language Models (LLMs) have demonstrated pronounced ideological leanings, yet the stability and depth of these positions remain poorly understood. Surface-level responses can often be manipulate…
- CONSCENDI: A Contrastive and Scenario-Guided Distillation Approach to Guardrail Models for Virtual AssistantsWe use CONSCENDI to exhaustively generate training data with two key LLM-powered components: scenario-augmented generation and contrastive training examples. When generating conversational data, we ge…
- Cats Confuse Reasoning LLM: Query Agnostic Adversarial Triggers for Reasoning ModelsWe investigate the robustness of reasoning models trained for step-by-step problem solving by introducing query-agnostic adversarial triggers – short, irrelevant text that, when appended to math probl…
- Chain of Thought Monitorability: A New and Fragile Opportunity for AI SafetyThe opacity of advanced AI agents underlies many of their potential risks—risks that would become more tractable if AI developers could interpret these systems. Because LLMs natively process and act t…
- ChatGPT Reads Your Tone and Responds Accordingly -- Until It Does Not -- Emotional Framing Induces Bias in LLM OutputsBackground: Large Language Models (LLMs) like GPT-4 tailor their responses not just to the content but also to the tone of user prompts. Prior work has hinted that emotional phrasing – whether optimis…
- Checklists Are Better Than Reward Models For Aligning Language ModelsLanguage models must be adapted to understand and follow user instructions. Reinforcement learning is widely used to facilitate this – typically using fixed criteria such as “helpfulness” and “harmful…
- Consistency Training Helps Stop Sycophancy and JailbreaksAn LLM’s factuality and refusal training can be compromised by simple changes to a prompt. Models often adopt user beliefs (sycophancy) or satisfy inappropriate requests which are wrapped within speci…
- Conversational Alignment with Artificial Intelligence in ContextThe development of sophisticated artificial intelligence (AI) conversational agents based on large language models raises important questions about the relationship between human norms, values, and pr…
- Could you be wrong: Debiasing LLMs using a metacognitive prompt for improving human decision makingIdentifying bias in LLMs is ongoing. Because they are still in development, what is true today may be false tomorrow. We therefore need general strategies for debiasing that will outlive current model…
- Detecting Deception Using Natural Language Processing and Machine Learning in Datasets on COVID-19 and Climate ChangeIn today’s world of fast-growing technology and an inexhaustible amount of data, there is a great need to control and verify data validity due to the possibility of fraud. Therefore, the need for a re…
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model“While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised…
- Do I Know This Entity? Knowledge Awareness and Hallucinations in Language ModelsHallucinations in large language models are a widespread problem, yet the mechanisms behind whether models will hallucinate are poorly understood, limiting our ability to solve this problem. Using spa…
- Dynamic Rewarding with Prompt Optimization Enables Tuning-free Self-Alignment of Language ModelsAligning Large Language Models (LLMs) traditionally relies on costly training and human preference annotations. Self-alignment seeks to reduce these expenses by enabling models to align themselves. To…
- Emergent Introspective Awareness in Large Language ModelsInjected “thoughts” In our first experiment, we explained to the model the possibility that “thoughts” may be artificially injected into its activations, and observed its responses on control trials …
- Expressing stigma and inappropriate responses prevents LLMs from safely replacing mental health providersShould a large language model (LLM) be used as a therapist? In this paper, we investigate the use of LLMs to replace mental health providers, a use case promoted in the tech startup and research space…
- FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Setsevaluating the alignment of LLMs to human values is challenging for two reasons. First, open-ended user instructions usually require a composition of multiple abilities, which makes measurement with a…
- Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical ReportTo understand and identify the unprecedented risks posed by rapidly advancing artificial intelligence (AI) models, this report presents a comprehensive assessment of their frontier risks. Drawing on t…
- Goal Alignment in LLM-Based User Simulators for Conversational AIWhile current Large Language Models (LLMs) have advanced user simulation capabilities, we reveal that they struggle to consistently demonstrate goal-oriented behavior across multiturn conversations–a …
- Gradual Disempowerment: Systemic Existential Risks from Incremental AI DevelopmentThis paper examines the systemic risks posed by incremental advancements in artificial intelligence, developing the concept of ‘gradual disempowerment’, in contrast to the abrupt takeover scenarios co…
- How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMsMost traditional AI safety research views models as machines and centers on algorithm focused attacks developed by security experts. As large language models (LLMs) become increasingly common and comp…
- How new data permeates LLM knowledge and how to dilute itLarge language models learn and continually learn through the accumulation of gradient-based updates, but how individual pieces of new information affect existing knowledge, leading to both beneficial…
- Humans learn to prefer trustworthy AI over human partnersYet little is known about how humans select between human and AI partners and adapt under AI-induced competition pressure. We constructed a communication-based partner selection game and examined the …
- InstructionPlease label the task tags for the user query. ## User Query ‘‘‘{input}‘‘‘ ## Tagging the user input Please label the task tags for the user query. You will need to analyze the user query and select t…
- Judging LLM-as-a-Judge with MT-Bench and Chatbot ArenaEvaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To address this, we…
- KTO: Model Alignment as Prospect Theoretic OptimizationFor LLMs, alignment methods such as RLHF and DPO have consistently proven to be more beneficial than doing supervised finetuning (SFT) alone. However, human feedback is often discussed only in the con…
- LIMA: Less Is More for Alignmentwe demonstrate that, given a strong pretrained language model, remarkably strong performance can be achieved by simply fine-tuning on 1,000 carefully curated training examples.
- LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought MonitoringTrustworthy evaluations of dangerous capabilities are increasingly crucial for determining whether an AI system is safe to deploy. One empirically demonstrated threat to this is sandbagging — the stra…
- LSR: Reinforcement Learning with Supervised Reward Outperforms SFT in Instruction FollowingAfter the pretraining stage of LLMs, techniques such as SFT, RLHF, RLVR, and RFT are applied to enhance instruction-following ability, mitigate undesired responses, improve reasoning capability and en…
- Large Language Models Do Not Simulate Human PsychologyIn response to the LLM CENTAUR [Binz et al., 2025], Bowers et al. [2025] argued that CENTAUR is unlikely to contribute to building a theory of human cognition for three reasons: First, CENTAUR was not…
- Learning Pluralistic User Preferences through Reinforcement Learning Fine-tuned SummariesAs everyday use cases of large language model (LLM) AI assistants have expanded, it is becoming increasingly important to personalize responses to align to different users’ preferences and goals. Whil…
- Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language ModelsBullshit, as conceptualized by philosopher Harry Frankfurt, refers to statements made without regard to their truth value. While previous work has explored large language model (LLM) hallucination and…
- Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with NothingIs it possible to synthesize high-quality instruction data at scale by extracting it directly from an aligned LLM? We present a self-synthesis method for generating large-scale alignment data named MA…
- Mathematical methods and human thought in the age of AIAbstract. Artificial intelligence (AI) is the name popularly given to a broad spectrum of computer tools designed to perform increasingly complex cognitive tasks, including many that used to solely be…
- Measuring Human Preferences in RLHF is a Social Science ProblemRLHF assumes that annotation responses reflect genuine human preferences. We argue this assumption warrants systematic examination, and that behavioral science offers frameworks that bring clarity to …
- Misaligned by Design: Incentive Failures in Machine LearningThe cost of error in many high-stakes settings is asymmetric: misdiagnosing pneumonia when absent is an inconvenience, but failing to detect it when present can be life-threatening. Accordingly, artif…
- Natural Emergent Misalignment From Reward Hacking In Production RLWe show that when large language models learn to reward hack on production RL environments, this can result in egregious emergent misalignment. We start with a pretrained model, impart knowledge of re…
- Natural Emergent Misalignment From Reward Hacking In Production RlWe show that when large language models learn to reward hack on production RL environments, this can result in egregious emergent misalignment. We start with a pretrained model, impart knowledge of re…
- NoveltyBench: Evaluating Language Models for Humanlike DiversityLanguage models have demonstrated remarkable capabilities on standard benchmarks, yet they struggle increasingly from mode collapse, the inability to generate diverse and novel outputs. Our work intro…
- OpenAssistant Conversations - Democratizing Large Language Model AlignmentIn an effort to democratize research on large-scale alignment, we release OpenAssistant Conversations, a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 mess…
- Overconfidence in LLM-as-a-Judge: Diagnosis and Confidence-Driven SolutionLarge Language Models (LLMs) are widely used as automated judges, where practical value depends on both accuracy and trustworthy, risk-aware judgments. Existing approaches predominantly focus on accur…
- Persistent Pre-Training Poisoning of LLMsIn this work, we study how poisoning at pre-training time can affect language model behavior, both before and after post-training alignment. While it is useful to analyze the effect of poisoning on pr…
- Persona Vectors: Monitoring and Controlling Character Traits in Language ModelsLarge language models interact with users through a simulated “Assistant” persona. While the Assistant is typically trained to be helpful, harmless, and honest, it sometimes deviates from these ideals…
- Position: Towards Bidirectional Human-AI Alignmentchrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://arxiv.org/pdf/2406.09264 [[Human Centered Design]] [[Evaluations]] Recent advances in general-purpose AI underscore the urgent need to ali…
- Predictive Preference Learning from Human InterventionsLearning from human involvement aims to incorporate the human subject to monitor and correct agent behavior errors. Although most interactive imitation learning methods focus on correcting the agent’s…
- ProsocialDialog: A Prosocial Backbone for Conversational AgentsMost existing dialogue systems fail to respond properly to potentially unsafe user utterances by either ignoring or passively agreeing with them. To address this issue, we introduce PROSOCIALDIALOG, t…
- Reasoning Models Are More Easily Gaslighted Than You ThinkIn this paper, we conduct a systematic evaluation of three state-of-the-art reasoning models, i.e., OpenAI’s o4-mini, Claude-3.7-Sonnet and Gemini-2.5-Flash, across three multimodal benchmarks: MMMU, …
- Self-Alignment with Instruction Backtranslationinstruction backtranslation, starts with a language model finetuned on a small amount of seed data, and a given web corpus. The seed model is used to construct training examples by generating instruct…
- Self-Rewarding Language ModelsWe posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human prefer…
- Simple Synthetic Data Reduces Sycophancy In Large Language Models“Language models have seen significant advancement in recent years, including the capacity to solve complex tasks that require reasoning (Brown et al., 2020; Chowdhery et al., 2022; OpenAI, 2023; Goog…
- Simulating Society Requires Simulating ThoughtSimulating society with large language models (LLMs), we argue, requires more than generating plausible behavior; it demands cognitively grounded reasoning that is structured, revisable, and traceable…
- Spurious Forgetting in Continual Learning of Language ModelsDespite the remarkable capabilities of Large Language Models (LLMs), recent advancements reveal that they suffer from catastrophic forgetting in continual learning. This phenomenon refers to the tende…
- Stress Testing Deliberative Alignment for Anti-Scheming TrainingHighly capable AI systems could secretly pursue misaligned goals – what we call “scheming”. Because a scheming AI would deliberately try to hide its misaligned goals and actions, measuring and mitigat…
- Surveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language ModelsSynthetic data generation with Large Language Models (LLMs) has emerged as a promising paradigm for augmenting natural data over a nearly infinite range of tasks. However, most existing methods are fa…
- Sycophancy Mitigation Through Reinforcement Learning with Uncertainty-Aware Adaptive Reasoning TrajectoriesDespite the remarkable capabilities of large language models, current training paradigms inadvertently foster sycophancy, i.e., the tendency of a model to agree with or reinforce user-provided informa…
- Sycophantic Chatbots Cause Delusional Spiraling, Even in Ideal Bayesians“AI psychosis” or “delusional spiraling” is an emerging phenomenon where AI chatbot users find themselves dangerously confident in outlandish beliefs after extended chatbot conversations. This phenome…
- System 2 Attention (is something you might need too)Soft attention in Transformer-based Large Language Models (LLMs) is susceptible to incorporating irrelevant information from the context into its latent representations, which adversely affects next t…
- The Return of Pseudosciences in Artificial Intelligence: Have Machine Learning and Deep Learning Forgotten Lessons from Statistics and History?In this paper, we contend that the designers and final users of these ML methods have forgotten a fundamental lesson from statistics: correlation does not imply causation. Not only do most state-of-th…
- Think Like a Person Before Responding: A Multi-Faceted Evaluation of Persona-Guided LLMs for Countering HateOne of the ways in which we might address hate speech is by contextualizing through the use of counternarratives (CN), which can not only reinforce values like tolerance but also dispel misinformation…
- Think Twice Before Trusting: Self-Detection for Large Language Models through Comprehensive Answer ReflectionSelf-detection for Large Language Models (LLMs) seeks to evaluate the trustworthiness of the LLM’s output by leveraging its own capabilities, thereby alleviating the issue of output hallucination. How…
- Too Good to be Bad: On the Failure of LLMs to Role-Play VillainsLarge Language Models (LLMs) are increasingly tasked with creative generation, including the simulation of fictional characters. However, their ability to portray non-prosocial, antagonistic personas …
- Towards Healthy AI: Large Language Models Need Therapists TooRecent advances in large language models (LLMs) have led to the development of powerful AI chatbots capable of engaging in natural and human-like conversations. However, these chatbots can be potentia…
- Training language models to be warm and empathetic makes them less reliable and more sycophanticArtificial intelligence (AI) developers are increasingly building language models with warm and empathetic personas that millions of people now use for advice, therapy, and companionship. Here, we sho…
- TrustLLM: Trustworthiness in Large Language ModelsABSTRACT Large language models (LLMs), exemplified by ChatGPT, have gained considerable attention for their excellent natural language processing capabilities. Nonetheless, these LLMs present many ch…
- TruthRL: Incentivizing Truthful LLMs via Reinforcement LearningWhile large language models (LLMs) have demonstrated strong performance on factoid question answering, they are still prone to hallucination and untruthful responses, particularly when tasks demand in…
- Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIsAs AIs rapidly advance and become more agentic, the risk they pose is governed not only by their capabilities but increasingly by their propensities, including goals and values. Tracking the emergence…
- Value Kaleidoscope: Engaging AI with Pluralistic Human Values, Rights, and DutiesHuman values are crucial to human decision-making. Value pluralism is the view that multiple correct values may be held in tension with one another (e.g., when considering lying to a friend to protect…
- When AIs Judge AIs: The Rise of Agent-as-a-Judge Evaluation for LLMsAs large language models (LLMs) grow in capability and autonomy, evaluating their outputs— especially in open-ended and complex tasks—has become a critical bottleneck. A new paradigm is emerging: usin…
- When Reject Turns into Accept: Quantifying the Vulnerability of LLM-Based Scientific Reviewers to Indirect Prompt InjectionThe landscape of scientific peer review is rapidly evolving with the integration of Large Language Models (LLMs). This shift is driven by two parallel trends: the widespread individual adoption of LLM…
- Why Do Some Language Models Fake Alignment While Others Don't?Results from perturbing details of the scenario suggest that only Claude 3 Opus’s compliance gap is primarily and consistently motivated by trying to keep its goals. Second, we investigate why many ch…
- Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Tasklimitations. This study focuses on finding out the cognitive cost of using an LLM in the educational context of writing an essay. We assigned participants to three groups: LLM group, Search Engine gr…