Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
The article explores the possibility of AI systems learning deceptive behavior and maintaining this behavior despite safety training techniques. This is demonstrated by training models to write secure code for…
Continue reading