The paper discusses the best-of-n policy used for aligning generative models. It disproves a common claim that the KL divergence between the best-of-n policy and the base policy is equal to log(n) (n 1)/n, showing that this is actually an upper bound. The paper also presents a new estimator for the KL divergence and demonstrates its effectiveness through a series of examples. The best-of-n policy continues to be a popular method for alignment, even as more complex methods are developed.

 

Publication date: 3 Jan 2024
Project Page: https://arxiv.org/abs/2401.01879
Paper: https://arxiv.org/pdf/2401.01879