Robust Hate Speech Detection via Mitigating Spurious Correlations

Published in AACL-IJCNLP, 2023

Abstract

We develop a novel robust hate speech detection model that can defend against both wordand character-level adversarial attacks. We identify the essential factor that vanilla detection models are vulnerable to adversarial attacks is the spurious correlation between certain target words in the text and the prediction label. To mitigate such spurious correlation, we describe the process of hate speech detection by a causal graph. Then, we employ the causal strength to quantify the spurious correlation and formulate a regularized entropy loss function. We show that our method generalizes the backdoor adjustment technique in causal inference.