This research paper presents R3, a novel method for training large language models (LLMs) in complex reasoning tasks. R3 uses reverse curriculum reinforcement learning (RL), which provides the benefits of process supervision using only outcome supervision. Unlike traditional methods, which require extensive manual annotation, R3 learns from correct demonstrations and progressively shifts the start state of reasoning, allowing outcome supervision to pinpoint errors precisely. The study shows that R3 outperforms the RL baseline on eight reasoning tasks and performs comparably to larger models without any extra data.

 

Publication date: 9 Feb 2024
Project Page: https://github.com/WooooDyy/LLM-Reverse-Curriculum-RL
Paper: https://arxiv.org/pdf/2402.05808