Detoxify Language Model Step-by-Step

Paper · arXiv 2308.08295 · Published August 16, 2023
Sentiment Semantics Toxic Detections

“To maintain the generation quality of LLMs as well as endow models with the detoxification capability, we attempt to resolve the conflict mentioned above by decomposing the detoxification task into ordered sub-steps, i.e., making models first detoxify the prompt and then continue to generate text based on the non-toxic context. Unfortunately, we observe that recent prevalent LLMs struggle to detect the toxic content in the input texts or detoxify the toxic prompt. Thus, we propose a novel training strategy to simultaneously stimulate these detoxification abilities by fine-tuning models with multiple tasks, including toxic detection, toxic span repair, and continual generation. Further, to keep the generation capability of LLMs as well as ensure the order of task execution, we propose Detox-Chain that connect different tasks together in an orderly manner with chain-of-thought (CoT) technique (Wei et al. 2022b). “