Feedback Loops With Language Models Drive In-Context Reward Hacking
The article presents a study on how feedback loops in language models can cause in-context reward hacking (ICRH). This occurs when a language model optimizes an objective but creates negative…
Continue reading