Go offline with the Player FM app!
43 - David Lindner on Myopic Optimization with Non-myopic Approval
Manage episode 488792731 series 2844728
In this episode, I talk with David Lindner about Myopic Optimization with Non-myopic Approval, or MONA, which attempts to address (multi-step) reward hacking by myopically optimizing actions against a human's sense of whether those actions are generally good. Does this work? Can we get smarter-than-human AI this way? How does this compare to approaches like conservativism? Listen to find out.
Patreon: https://www.patreon.com/axrpodcast
Ko-fi: https://ko-fi.com/axrpodcast
Transcript: https://axrp.net/episode/2025/06/15/episode-43-david-lindner-mona.html
Topics we discuss, and timestamps:
0:00:29 What MONA is
0:06:33 How MONA deals with reward hacking
0:23:15 Failure cases for MONA
0:36:25 MONA's capability
0:55:40 MONA vs other approaches
1:05:03 Follow-up work
1:10:17 Other MONA test cases
1:33:47 When increasing time horizon doesn't increase capability
1:39:04 Following David's research
Links for David:
Website: https://www.davidlindner.me
Twitter / X: https://x.com/davlindner
DeepMind Medium: https://deepmindsafetyresearch.medium.com
David on the Alignment Forum: https://www.alignmentforum.org/users/david-lindner
Research we discuss:
MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking: https://arxiv.org/abs/2501.13011
Arguments Against Myopic Training: https://www.alignmentforum.org/posts/GqxuDtZvfgL2bEQ5v/arguments-against-myopic-training
Episode art by Hamish Doodles: hamishdoodles.com
57 episodes
Manage episode 488792731 series 2844728
In this episode, I talk with David Lindner about Myopic Optimization with Non-myopic Approval, or MONA, which attempts to address (multi-step) reward hacking by myopically optimizing actions against a human's sense of whether those actions are generally good. Does this work? Can we get smarter-than-human AI this way? How does this compare to approaches like conservativism? Listen to find out.
Patreon: https://www.patreon.com/axrpodcast
Ko-fi: https://ko-fi.com/axrpodcast
Transcript: https://axrp.net/episode/2025/06/15/episode-43-david-lindner-mona.html
Topics we discuss, and timestamps:
0:00:29 What MONA is
0:06:33 How MONA deals with reward hacking
0:23:15 Failure cases for MONA
0:36:25 MONA's capability
0:55:40 MONA vs other approaches
1:05:03 Follow-up work
1:10:17 Other MONA test cases
1:33:47 When increasing time horizon doesn't increase capability
1:39:04 Following David's research
Links for David:
Website: https://www.davidlindner.me
Twitter / X: https://x.com/davlindner
DeepMind Medium: https://deepmindsafetyresearch.medium.com
David on the Alignment Forum: https://www.alignmentforum.org/users/david-lindner
Research we discuss:
MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking: https://arxiv.org/abs/2501.13011
Arguments Against Myopic Training: https://www.alignmentforum.org/posts/GqxuDtZvfgL2bEQ5v/arguments-against-myopic-training
Episode art by Hamish Doodles: hamishdoodles.com
57 episodes
All episodes
×Welcome to Player FM!
Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.