The study focuses on Large Language Models (LLMs) and their ability to handle complex reasoning tasks. It discusses the issues of reliability and faithfulness in the generated rationales of these models. To address these concerns, the authors propose a learning framework, which models reasoning as a planning-based process. The framework uses direct preference optimization (DPO) on collected trajectories, ranked according to synthesized process rewards. The results indicate that the 7B model outperforms strong counterparts like GPT-3.5-Turbo in challenging logical reasoning benchmarks.
Publication date: 1 Feb 2024
Project Page: SparkJiao/rl-trajectory-reasoning
Paper: https://arxiv.org/pdf/2402.00658