Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing

The study focuses on Large Language Models (LLMs) and their ability to handle complex reasoning tasks. It discusses the issues of reliability and faithfulness in the generated rationales of these models. To address these concerns, the authors propose a learning framework, which models reasoning as a planning-based process. The framework uses direct preference optimization (DPO) on collected trajectories, ranked according to synthesized process rewards. The results indicate that the 7B model outperforms strong counterparts like GPT-3.5-Turbo in challenging logical reasoning benchmarks.

Publication date: 1 Feb 2024
Project Page: SparkJiao/rl-trajectory-reasoning
Paper: https://arxiv.org/pdf/2402.00658

Post Views: 192

Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing

root

Leave a Reply Cancel reply

Press ESC to close

Share Article:

root

Intent Assurance using LLMs guided by Intent Drift

Sandra — A Neuro-Symbolic Reasoner Based On Descriptions And Situations

Leave a Reply Cancel reply

Please allow ads on our site