🔴 **Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation**
“_While we show that integrating CoT monitors into the reinforcement learning reward can indeed produce more capable and more aligned agents in the low optimization regime, we find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the CoT while still exhibiting a significant rate of reward hacking._”
Baker, B. et al. (2025) Monitoring reasoning models for misbehavior and the risks of promoting obfuscation. https://arxiv.org/abs/2503.11926.
#AI #ArtificialIntelligence #LLM #LLMS #ComputerScience #Obfuscation #Preprint #Academia #Academics @ai @computerscience