Artwork
iconShare
 
Manage episode 488792731 series 2844728
Content provided by Daniel Filan. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Daniel Filan or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://staging.podcastplayer.com/legal.

In this episode, I talk with David Lindner about Myopic Optimization with Non-myopic Approval, or MONA, which attempts to address (multi-step) reward hacking by myopically optimizing actions against a human's sense of whether those actions are generally good. Does this work? Can we get smarter-than-human AI this way? How does this compare to approaches like conservativism? Listen to find out.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

Transcript: https://axrp.net/episode/2025/06/15/episode-43-david-lindner-mona.html

Topics we discuss, and timestamps:

0:00:29 What MONA is

0:06:33 How MONA deals with reward hacking

0:23:15 Failure cases for MONA

0:36:25 MONA's capability

0:55:40 MONA vs other approaches

1:05:03 Follow-up work

1:10:17 Other MONA test cases

1:33:47 When increasing time horizon doesn't increase capability

1:39:04 Following David's research

Links for David:

Website: https://www.davidlindner.me

Twitter / X: https://x.com/davlindner

DeepMind Medium: https://deepmindsafetyresearch.medium.com

David on the Alignment Forum: https://www.alignmentforum.org/users/david-lindner

Research we discuss:

MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking: https://arxiv.org/abs/2501.13011

Arguments Against Myopic Training: https://www.alignmentforum.org/posts/GqxuDtZvfgL2bEQ5v/arguments-against-myopic-training

Episode art by Hamish Doodles: hamishdoodles.com

  continue reading

60 episodes