This paper by Abhilash Mishra investigates the challenges of aligning AI systems with human intentions and values using Reinforcement Learning with Human Feedback (RLHF). It discusses the issue of determining whose values should be reflected in AI systems and the limitations of RLHF. The paper argues that it is impossible to universally align AI systems with everyone’s values without violating some private ethical preferences. It discusses the implications for AI governance, suggesting the need for transparent voting rules and for AI systems to be narrowly aligned to specific user groups.

 

Publication date: 26 Oct 2023
Project Page: https://arxiv.org/abs/2310.16048v1
Paper: https://arxiv.org/pdf/2310.16048