Publications
Biases in Toxicity Detection Models
Abstract
Online abuse has become increasingly prevalent in recent years, affecting approximately 40% of US adults according to self-reported data [1]. This rise in harmful interactions–commonly labeled as online “toxicity”–particularly on social media platforms, has raised concerns among researchers and the general public. A primary tool for detecting such toxicity is the Perspective API, developed by the Jigsaw unit of Google [2], specifically designed to mitigate toxicity and promote healthy online dialogue. Widely utilized within both academic research and content moderation efforts, the Perspective API boasts over 1,400 mentions on Google Scholar as of January 2024.
Perspective API employs supervised learning, leveraging a vast dataset of millions of comments sourced from diverse online platforms spanning over 20 languages, including forums like Wikipedia and The New York Times. It defines “toxic” messages as those containing “rude, disrespectful, or unreasonable language likely to disrupt discussions”[3, 4]. In a 2019 SemEval Task, the Perspective API demonstrated superior performance compared to other transformer-based models in hate speech detection [5]. However, recent studies have noted disparities between its scores and human labels [4]. The toxicity score, ranging from 0 to 1, lacks absolute meaning, and typically a threshold (usually 0.5-0.7) is set on the toxicity score, above which content is deemed toxic [6].
- Date
- January 1, 1970
- Authors
- Gianluca Nogara, Francesco Pierri, Stefano Cresci, Luca Luceri, Petter Törnberg, Silvia Giordano
- Journal
- 32nd Symposium on Advanced Database Systems