Exploiting Explainability to Design Adversarial Attacks and Evaluate Attack Resilience in Hate-Speech Detection Models

Paper · arXiv 2305.18585 · Published May 29, 2023
Sentiment Semantics Toxic Detections

“Consider these two sentences:

(1) Voldemort is a bad person

(2) Voldemort is not a good person.

Sentence (1) and Sentence (2) convey a similar meaning, but the latter confuses hate-speech models because the full understanding of the sentence relies not just on the adjectives ”good” or ”bad.” This example showcases how adversarial AI can seamlessly introduce noise by adding the word ”not” before a positive adjective, successfully hiding a negative sentiment within a post.”