This research explores the potential of large language models (LLMs) in detecting hate speech on social media platforms. The study evaluates LLMs such as GPT-3.5 and text-davinci on datasets like HateXplain, implicit hate, and ToxicSpans. The unique aspect of this research is the consideration of context information and victim community data in the detection process. The results show that adding target information improves model performance by 20-30%. Adding rationales/explanations also significantly improves the models’ performance. The study further identifies error cases where these models fail to classify or explain their decisions, suggesting the need for industry-scale safeguard techniques.
Publication date: 20 Oct 2023
Project Page: Not provided
Paper: https://arxiv.org/pdf/2310.12860