Go offline with the Player FM app!
42 - Owain Evans on LLM Psychology
Manage episode 487269235 series 2844728
Earlier this year, the paper "Emergent Misalignment" made the rounds on AI x-risk social media for seemingly showing LLMs generalizing from 'misaligned' training data of insecure code to acting comically evil in response to innocuous questions. In this episode, I chat with one of the authors of that paper, Owain Evans, about that research as well as other work he's done to understand the psychology of large language models.
Patreon: https://www.patreon.com/axrpodcast
Ko-fi: https://ko-fi.com/axrpodcast
Transcript: https://axrp.net/episode/2025/06/06/episode-42-owain-evans-llm-psychology.html
Topics we discuss, and timestamps:
0:00:37 Why introspection?
0:06:24 Experiments in "Looking Inward"
0:15:11 Why fine-tune for introspection?
0:22:32 Does "Looking Inward" test introspection, or something else?
0:34:14 Interpreting the results of "Looking Inward"
0:44:56 Limitations to introspection?
0:49:54 "Tell me about yourself", and its relation to other papers
1:05:45 Backdoor results
1:12:01 Emergent Misalignment
1:22:13 Why so hammy, and so infrequently evil?
1:36:31 Why emergent misalignment?
1:46:45 Emergent misalignment and other types of misalignment
1:53:57 Is emergent misalignment good news?
2:00:01 Follow-up work to "Emergent Misalignment"
2:03:10 Reception of "Emergent Misalignment" vs other papers
2:07:43 Evil numbers
2:12:20 Following Owain's research
Links for Owain:
Truthful AI: https://www.truthfulai.org
Owain's website: https://owainevans.github.io/
Owain's twitter/X account: https://twitter.com/OwainEvans_UK
Research we discuss:
Looking Inward: Language Models Can Learn About Themselves by Introspection: https://arxiv.org/abs/2410.13787
Tell me about yourself: LLMs are aware of their learned behaviors: https://arxiv.org/abs/2501.11120
Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data: https://arxiv.org/abs/2406.14546
Emergent Misalignment: Narrow fine-tuning can produce broadly misaligned LLMs: https://arxiv.org/abs/2502.17424
X/Twitter thread of GPT-4.1 emergent misalignment results: https://x.com/OwainEvans_UK/status/1912701650051190852
Taken out of context: On measuring situational awareness in LLMs: https://arxiv.org/abs/2309.00667
Episode art by Hamish Doodles: hamishdoodles.com
56 episodes
Manage episode 487269235 series 2844728
Earlier this year, the paper "Emergent Misalignment" made the rounds on AI x-risk social media for seemingly showing LLMs generalizing from 'misaligned' training data of insecure code to acting comically evil in response to innocuous questions. In this episode, I chat with one of the authors of that paper, Owain Evans, about that research as well as other work he's done to understand the psychology of large language models.
Patreon: https://www.patreon.com/axrpodcast
Ko-fi: https://ko-fi.com/axrpodcast
Transcript: https://axrp.net/episode/2025/06/06/episode-42-owain-evans-llm-psychology.html
Topics we discuss, and timestamps:
0:00:37 Why introspection?
0:06:24 Experiments in "Looking Inward"
0:15:11 Why fine-tune for introspection?
0:22:32 Does "Looking Inward" test introspection, or something else?
0:34:14 Interpreting the results of "Looking Inward"
0:44:56 Limitations to introspection?
0:49:54 "Tell me about yourself", and its relation to other papers
1:05:45 Backdoor results
1:12:01 Emergent Misalignment
1:22:13 Why so hammy, and so infrequently evil?
1:36:31 Why emergent misalignment?
1:46:45 Emergent misalignment and other types of misalignment
1:53:57 Is emergent misalignment good news?
2:00:01 Follow-up work to "Emergent Misalignment"
2:03:10 Reception of "Emergent Misalignment" vs other papers
2:07:43 Evil numbers
2:12:20 Following Owain's research
Links for Owain:
Truthful AI: https://www.truthfulai.org
Owain's website: https://owainevans.github.io/
Owain's twitter/X account: https://twitter.com/OwainEvans_UK
Research we discuss:
Looking Inward: Language Models Can Learn About Themselves by Introspection: https://arxiv.org/abs/2410.13787
Tell me about yourself: LLMs are aware of their learned behaviors: https://arxiv.org/abs/2501.11120
Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data: https://arxiv.org/abs/2406.14546
Emergent Misalignment: Narrow fine-tuning can produce broadly misaligned LLMs: https://arxiv.org/abs/2502.17424
X/Twitter thread of GPT-4.1 emergent misalignment results: https://x.com/OwainEvans_UK/status/1912701650051190852
Taken out of context: On measuring situational awareness in LLMs: https://arxiv.org/abs/2309.00667
Episode art by Hamish Doodles: hamishdoodles.com
56 episodes
All episodes
×Welcome to Player FM!
Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.