The article presents a study from IBM Research AI on creating efficient models for detecting hate, abuse, and profanity (HAP) in Large Language Models (LLMs). The authors highlight the importance of this work in creating civil and unbiased LLMs. They explain their approach to HAP detection, which involves assigning a binary label (HAP / non- HAP) to every sentence of the input text. They also introduce a small and efficient HAP detection model, IBM-HAP-4L, which is a small 4-layer BERT-like transformer model. The research aims to support HAP detectors for multiple languages.

 

Publication date: 8 Feb 2024
Project Page: https://arxiv.org/abs/2402.05624
Paper: https://arxiv.org/pdf/2402.05624