Probing LLMs for hate speech detection: strengths and vulnerabilities

This research explores the potential of large language models (LLMs) in detecting hate speech on social media platforms. The study evaluates LLMs such as GPT-3.5 and text-davinci on datasets like HateXplain, implicit hate, and ToxicSpans. The unique aspect of this research is the consideration of context information and victim community data in the detection process. The results show that adding target information improves model performance by 20-30%. Adding rationales/explanations also significantly improves the models’ performance. The study further identifies error cases where these models fail to classify or explain their decisions, suggesting the need for industry-scale safeguard techniques.

Publication date: 20 Oct 2023
Project Page: Not provided
Paper: https://arxiv.org/pdf/2310.12860

Post Views: 308

root

Leave a Reply Cancel reply

Press ESC to close

Share Article:

root

The Locality and Symmetry of Positional Encodings

Knowledge-Augmented Language Model Verification

Leave a Reply Cancel reply

Please allow ads on our site