The paper discusses the potential risks of conversational AI systems generating unsafe, toxic, or unethical content. Previous attempts to train these systems with adversarial datasets have been inadequate, as the models often fail to recognize subtly unsafe situations in casual conversations. To address this, the researchers propose a dual-step fine-tuning process using socially aware n-pair contrastive loss. This approach integrates prosocial behavior into the AI using datasets like Moral Integrity Corpus (MIC) and PROSOCIAL DIALOG. The results show that this method is effective in generating socially appropriate responses.

 

Publication date: 1 Feb 2024
Project Page: https://github.com/souvikdgp16/contrastive_dialog_safety
Paper: https://arxiv.org/pdf/2402.00446