The paper presents KoTox, a dataset of 39K unethical instruction-output pairs, aimed at refining the training of Large Language Models (LLMs) and improving their ethical awareness. The dataset is designed to help LLMs effectively manage toxic user queries and promote more secure and responsible interactions in Natural Language Processing applications. The authors highlight the challenges of manual construction and reliance on existing models like ChatGPT for automatic toxic data generation. They propose an automated approach that amalgamates lists of derogatory terms, biased expressions, and a diverse set of predicates to generate a wide range of toxic instructions. The empirical evidence suggests that KoTox greatly aids LLMs in effectively responding to toxic queries.

 

Publication date: 30 Nov 2023
Project Page: https://openai.com/blog/chatgpt
Paper: https://arxiv.org/pdf/2311.18215