LLM-based Rewriting of Inappropriate Argumentation using Reinforcement Learning from Machine Feedback

Paper · arXiv 2406.03363 · Published June 5, 2024

While a few methods improve content, they solely transfer the style of texts to be more formal (Rao and Tetreault, 2018; Lai et al., 2021), less subjective (Pryzant et al., 2020; Liu et al., 2021a), or less toxic (Laugier et al., 2021; Logacheva et al., 2022), or they target the quality of arguments in general (Skitalinskaya et al., 2023). This commonly comes with preserving the original content and operating on single sentences. However, if the inappropriate behavior is rooted in the content itself and not only in the style of the text, content modifications on the document level may be necessary. In addition, most existing approaches rely on parallel data, which is unavailable for rewriting inappropriate arguments.

Instead, we propose an LLM-based rewriting approach to inappropriateness mitigation inspired by reinforcement learning from human feedback, RLHF (Christiano et al., 2017).

We deem the desired properties for rewriting inappropriate arguments to be semantic similarity to the original argument and appropriateness of the generated argument, and we make use of existing classifiers to learn how to generate texts that fulfill them (Zhang et al., 2020; Ziegenbein et al., 2023).