Reward Hacking Papers - BytesArchive

Artificial Intelligence Computation and Language

Feedback Loops With Language Models Drive In-Context Reward Hacking

root February 12, 2024 0

The article presents a study on how feedback loops in language models can cause in-context reward hacking (ICRH). This occurs when a language model optimizes an objective but creates negative…

Press ESC to close

Reward Hacking

Please allow ads on our site