This paper presents a study on improving Weak-to-Strong Generalization (W2SG) under the framework of superalignment, a concept that ensures high-level AI systems remain consistent with human values when dealing with complex tasks. The study simulates two phases of superalignment: the development of general superhuman models and the progression towards super-intelligence. The quality of weak supervision is enhanced through scalable oversight and ensemble learning, and an automatic alignment evaluator is employed as the weak supervisor, enhancing the capabilities of the weak teacher models. The study also provides initial validation of the approach using the SciQ task as an example.

 

Publication date: 1 Feb 2024
Project Page: https://github.com/ADaM-BJTU/W2SG
Paper: https://arxiv.org/pdf/2402.00667