Theoretical guarantees on the best-of-n alignment policy

The paper discusses the best-of-n policy used for aligning generative models. It disproves a common claim that the KL divergence between the best-of-n policy and the base policy is equal to log(n) (n 1)/n, showing that this is actually an upper bound. The paper also presents a new estimator for the KL divergence and demonstrates its effectiveness through a series of examples. The best-of-n policy continues to be a popular method for alignment, even as more complex methods are developed.

Publication date: 3 Jan 2024
Project Page: https://arxiv.org/abs/2401.01879
Paper: https://arxiv.org/pdf/2401.01879

Post Views: 325

root

Exit mobile version

Please allow ads on our site

Looks like you're using an ad blocker. Please support us by disabling these ad blocker.

Press ESC to close

Share Article:

root

Boosting Large Language Model for Speech Synthesis: An Empirical Study

On the hardness of learning under symmetries

Please allow ads on our site