Exploiting Explainability to Design Adversarial Attacks and Evaluate Attack Resilience in Hate-Speech Detection Models
“Consider these two sentences:
(1) Voldemort is a bad person
(2) Voldemort is not a good person.
Sentence (1) and Sentence (2) convey a similar meaning, but the latter confuses hate-speech models because the full understanding of the sentence relies not just on the adjectives ”good” or ”bad.” This example showcases how adversarial AI can seamlessly introduce noise by adding the word ”not” before a positive adjective, successfully hiding a negative sentiment within a post.”