Follow

🔴 **Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation**

“_While we show that integrating CoT monitors into the reinforcement learning reward can indeed produce more capable and more aligned agents in the low optimization regime, we find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the CoT while still exhibiting a significant rate of reward hacking._”

Baker, B. et al. (2025) Monitoring reasoning models for misbehavior and the risks of promoting obfuscation. arxiv.org/abs/2503.11926.

@ai @computerscience

Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.