Go offline with the Player FM app!
41 - Lee Sharkey on Attribution-based Parameter Decomposition
Manage episode 486543213 series 2844728
What's the next step forward in interpretability? In this episode, I chat with Lee Sharkey about his proposal for detecting computational mechanisms within neural networks: Attribution-based Parameter Decomposition, or APD for short.
Patreon: https://www.patreon.com/axrpodcast
Ko-fi: https://ko-fi.com/axrpodcast
Transcript: https://axrp.net/episode/2025/06/03/episode-41-lee-sharkey-attribution-based-parameter-decomposition.html
Topics we discuss, and timestamps:
0:00:41 APD basics
0:07:57 Faithfulness
0:11:10 Minimality
0:28:44 Simplicity
0:34:50 Concrete-ish examples of APD
0:52:00 Which parts of APD are canonical
0:58:10 Hyperparameter selection
1:06:40 APD in toy models of superposition
1:14:40 APD and compressed computation
1:25:43 Mechanisms vs representations
1:34:41 Future applications of APD?
1:44:19 How costly is APD?
1:49:14 More on minimality training
1:51:49 Follow-up work
2:05:24 APD on giant chain-of-thought models?
2:11:27 APD and "features"
2:14:11 Following Lee's work
Lee links (Leenks):
X/Twitter: https://twitter.com/leedsharkey
Alignment Forum: https://www.alignmentforum.org/users/lee_sharkey
Research we discuss:
Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-Based Parameter Decomposition: https://arxiv.org/abs/2501.14926
Toy Models of Superposition: https://transformer-circuits.pub/2022/toy_model/index.html
Towards a unified and verified understanding of group-operation networks: https://arxiv.org/abs/2410.07476
Feature geometry is outside the superposition hypothesis: https://www.alignmentforum.org/posts/MFBTjb2qf3ziWmzz6/sae-feature-geometry-is-outside-the-superposition-hypothesis
Episode art by Hamish Doodles: hamishdoodles.com
56 episodes
Manage episode 486543213 series 2844728
What's the next step forward in interpretability? In this episode, I chat with Lee Sharkey about his proposal for detecting computational mechanisms within neural networks: Attribution-based Parameter Decomposition, or APD for short.
Patreon: https://www.patreon.com/axrpodcast
Ko-fi: https://ko-fi.com/axrpodcast
Transcript: https://axrp.net/episode/2025/06/03/episode-41-lee-sharkey-attribution-based-parameter-decomposition.html
Topics we discuss, and timestamps:
0:00:41 APD basics
0:07:57 Faithfulness
0:11:10 Minimality
0:28:44 Simplicity
0:34:50 Concrete-ish examples of APD
0:52:00 Which parts of APD are canonical
0:58:10 Hyperparameter selection
1:06:40 APD in toy models of superposition
1:14:40 APD and compressed computation
1:25:43 Mechanisms vs representations
1:34:41 Future applications of APD?
1:44:19 How costly is APD?
1:49:14 More on minimality training
1:51:49 Follow-up work
2:05:24 APD on giant chain-of-thought models?
2:11:27 APD and "features"
2:14:11 Following Lee's work
Lee links (Leenks):
X/Twitter: https://twitter.com/leedsharkey
Alignment Forum: https://www.alignmentforum.org/users/lee_sharkey
Research we discuss:
Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-Based Parameter Decomposition: https://arxiv.org/abs/2501.14926
Toy Models of Superposition: https://transformer-circuits.pub/2022/toy_model/index.html
Towards a unified and verified understanding of group-operation networks: https://arxiv.org/abs/2410.07476
Feature geometry is outside the superposition hypothesis: https://www.alignmentforum.org/posts/MFBTjb2qf3ziWmzz6/sae-feature-geometry-is-outside-the-superposition-hypothesis
Episode art by Hamish Doodles: hamishdoodles.com
56 episodes
All episodes
×Welcome to Player FM!
Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.