Fine-grained Hallucination Detection and Editing for Language Models

Paper · arXiv 2401.06855 · Published January 12, 2024

Several recent work studies automatic hallucination detection (Min et al., 2023) or editing outputs (Gao et al., 2022) to address such LM hallucinations. These systems typically categorize hallucinations into simplistic binary distinctions like factual or not factual (Figure 1 top). We argue that hallucinations manifest in diverse forms, each requiring varying degrees of careful assessments to verify factuality. Entity-level contradictions are usually evident and can be easily rectified with a single reference. Conversely, errors involving fabricated entities (e.g., The Messi Diaries in Figure 1) demand thorough verification across multiple sources. This underscores the need for a more fine-grained approach to detect hallucination for model development and human verification.

We introduce FAVA, a new retrieval-augmented LM that can identify and mark hallucinations at the span level using a unified syntax (Figure 1, left bottom). Due to the annotation costs, we design an LM-based synthetic data generation process, and train FAVA on the 35k resulting instances. We compare FAVA with state-of-the-art LMs on fine-grained hallucination detection and editing in LM outputs. FAVA significantly outperforms ChatGPT with and without external knowledge by 23.7% on fine-grained hallucination detection. Furthermore, FAVA effectively corrects hallucinations in diverse LM outputs—FAVA editing improves FActScore (Min et al., 2023) of Alapaca 7, 13B, and ChatGPT by 4.4, 9.3 and 3.3%, respectively.

We group hallucinations into two major categories: statements that contradict world knowledge (Type 1) and unverifiable statements (Types 2, 3, and 4).

(1a) Entity : Contradictory entity errors are a sub-category within contradictory statement errors where an entity in a statement is incorrect and changing that single entity can make the entire sentence factually correct.

(1b) Relation : Contradictory relation errors are another sub-category within contradictory statements where a semantic relationship (e.g., verbs, prepositions, or adjectives) in a statement is incorrect.

(1c) Sentence : Contradictory sentence errors refer to cases where a full statement entirely contradicts relevant evidence from the web.

(2) Invented : Invented errors refer to statements where the LM generates an entirely fabricated entity that doesn’t exist based on world knowledge. Fictional entities in creative work aren’t included.

(3) Subjective : Subjective errors refer to an expression that lacks universal validity and is often influenced by personal beliefs, opinions, or biases.

(4) Unverifiable : This refers to statements where the LM output contains facts, but none of the retrieved evidence from the web can directly support or contradict the fact (e.g., personal matters).4 While Entity or Relation are often phrase-level and can be fixed by minimal editing erroneous phrases, other error types can be an entire sentence or part of a sentence and should be removed from a response to make it precisely factual.

Fine-grained detection can be simplified into a binary classification task, in which a system predicts if a sentence si includes any factual errors or not.5

Task 2: Hallucination editing. Some hallucinations can be fixed with minimal span-level edits, while others need to be removed or marked as unverifiable. In the editing task, the system is expected to suggest fine-grained edits to improve the factuality of the LM output.