**Bibliolater 📚 📜 🖋** @bibliolater@qoto.org · 2025-03-24T19:54:40Z

Bibliolater 📚 📜 🖋 @bibliolater@qoto.org

Bibliolater 📚 📜 🖋 @bibliolater@qoto.org

🔴 **Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation**

“_While we show that integrating CoT monitors into the reinforcement learning reward can indeed produce more capable and more aligned agents in the low optimization regime, we find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the CoT while still exhibiting a significant rate of reward hacking._”

Baker, B. et al. (2025) Monitoring reasoning models for misbehavior and the risks of promoting obfuscation. https://arxiv.org/abs/2503.11926.

#AI #ArtificialIntelligence #LLM #LLMS #ComputerScience #Obfuscation #Preprint #Academia #Academics @ai @computerscience

Mar 24, 2025, 19:54 · · · ·

Trending now

Resources

Developers

What is Mastodon?

qoto.org

More…