Search a title or topic

Over 20 million podcasts, powered by 

Player FM logo
Artwork

Content provided by Daniel Filan. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Daniel Filan or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://staging.podcastplayer.com/legal.
Player FM - Podcast App
Go offline with the Player FM app!

42 - Owain Evans on LLM Psychology

2:14:26
 
Share
 

Manage episode 487269235 series 2844728
Content provided by Daniel Filan. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Daniel Filan or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://staging.podcastplayer.com/legal.

Earlier this year, the paper "Emergent Misalignment" made the rounds on AI x-risk social media for seemingly showing LLMs generalizing from 'misaligned' training data of insecure code to acting comically evil in response to innocuous questions. In this episode, I chat with one of the authors of that paper, Owain Evans, about that research as well as other work he's done to understand the psychology of large language models.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

Transcript: https://axrp.net/episode/2025/06/06/episode-42-owain-evans-llm-psychology.html

Topics we discuss, and timestamps:

0:00:37 Why introspection?

0:06:24 Experiments in "Looking Inward"

0:15:11 Why fine-tune for introspection?

0:22:32 Does "Looking Inward" test introspection, or something else?

0:34:14 Interpreting the results of "Looking Inward"

0:44:56 Limitations to introspection?

0:49:54 "Tell me about yourself", and its relation to other papers

1:05:45 Backdoor results

1:12:01 Emergent Misalignment

1:22:13 Why so hammy, and so infrequently evil?

1:36:31 Why emergent misalignment?

1:46:45 Emergent misalignment and other types of misalignment

1:53:57 Is emergent misalignment good news?

2:00:01 Follow-up work to "Emergent Misalignment"

2:03:10 Reception of "Emergent Misalignment" vs other papers

2:07:43 Evil numbers

2:12:20 Following Owain's research

Links for Owain:

Truthful AI: https://www.truthfulai.org

Owain's website: https://owainevans.github.io/

Owain's twitter/X account: https://twitter.com/OwainEvans_UK

Research we discuss:

Looking Inward: Language Models Can Learn About Themselves by Introspection: https://arxiv.org/abs/2410.13787

Tell me about yourself: LLMs are aware of their learned behaviors: https://arxiv.org/abs/2501.11120

Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data: https://arxiv.org/abs/2406.14546

Emergent Misalignment: Narrow fine-tuning can produce broadly misaligned LLMs: https://arxiv.org/abs/2502.17424

X/Twitter thread of GPT-4.1 emergent misalignment results: https://x.com/OwainEvans_UK/status/1912701650051190852

Taken out of context: On measuring situational awareness in LLMs: https://arxiv.org/abs/2309.00667

Episode art by Hamish Doodles: hamishdoodles.com

  continue reading

56 episodes

Artwork
iconShare
 
Manage episode 487269235 series 2844728
Content provided by Daniel Filan. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Daniel Filan or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://staging.podcastplayer.com/legal.

Earlier this year, the paper "Emergent Misalignment" made the rounds on AI x-risk social media for seemingly showing LLMs generalizing from 'misaligned' training data of insecure code to acting comically evil in response to innocuous questions. In this episode, I chat with one of the authors of that paper, Owain Evans, about that research as well as other work he's done to understand the psychology of large language models.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

Transcript: https://axrp.net/episode/2025/06/06/episode-42-owain-evans-llm-psychology.html

Topics we discuss, and timestamps:

0:00:37 Why introspection?

0:06:24 Experiments in "Looking Inward"

0:15:11 Why fine-tune for introspection?

0:22:32 Does "Looking Inward" test introspection, or something else?

0:34:14 Interpreting the results of "Looking Inward"

0:44:56 Limitations to introspection?

0:49:54 "Tell me about yourself", and its relation to other papers

1:05:45 Backdoor results

1:12:01 Emergent Misalignment

1:22:13 Why so hammy, and so infrequently evil?

1:36:31 Why emergent misalignment?

1:46:45 Emergent misalignment and other types of misalignment

1:53:57 Is emergent misalignment good news?

2:00:01 Follow-up work to "Emergent Misalignment"

2:03:10 Reception of "Emergent Misalignment" vs other papers

2:07:43 Evil numbers

2:12:20 Following Owain's research

Links for Owain:

Truthful AI: https://www.truthfulai.org

Owain's website: https://owainevans.github.io/

Owain's twitter/X account: https://twitter.com/OwainEvans_UK

Research we discuss:

Looking Inward: Language Models Can Learn About Themselves by Introspection: https://arxiv.org/abs/2410.13787

Tell me about yourself: LLMs are aware of their learned behaviors: https://arxiv.org/abs/2501.11120

Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data: https://arxiv.org/abs/2406.14546

Emergent Misalignment: Narrow fine-tuning can produce broadly misaligned LLMs: https://arxiv.org/abs/2502.17424

X/Twitter thread of GPT-4.1 emergent misalignment results: https://x.com/OwainEvans_UK/status/1912701650051190852

Taken out of context: On measuring situational awareness in LLMs: https://arxiv.org/abs/2309.00667

Episode art by Hamish Doodles: hamishdoodles.com

  continue reading

56 episodes

All episodes

×
 
Loading …

Welcome to Player FM!

Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.

 

Copyright 2025 | Privacy Policy | Terms of Service | | Copyright
Listen to this show while you explore
Play