Manage episode 492185026 series 3662020
AI models pass SWE-Bench from memory Text version: https://pivot-to-ai.com/2025/07/02/how-to-pass-an-ai-coding-benchmark-train-on-the-questions/
Patreon: https://www.patreon.com/davidgerard Ko-Fi: https://ko-fi.com/A1529D5 Buy me nice things: https://www.amazon.co.uk/hz/wishlist/ls/3Q8VZW46J6DM6 Get an extremely cool Pivot to AI shirt or mug: https://pivot-to-ai.redbubble.com
Source:
The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason https://arxiv.org/abs/2506.12286
Previously on Pivot to AI:
OpenAI o3 beats FrontierMath — because OpenAI funded the test and had access to the questions https://pivot-to-ai.com/2025/01/20/openai-o3-beats-frontiermath-because-openai-funded-the-test-and-had-access-to-questions/ AI benchmarks are self-promoting trash — but regulators keep using them https://pivot-to-ai.com/2025/02/25/ai-benchmarks-are-self-promoting-trash-but-regulators-keep-using-them/ Apple: ‘Reasoning’ AIs fail hard if they actually have to think https://pivot-to-ai.com/2025/06/08/apple-reasoning-ais-fail-hard-if-they-actually-have-to-think/ video: https://www.youtube.com/watch?v=gSx9pI5so30&list=UU9rJrMVgcXTfa8xuMnbhAEA
Full Pivot to AI playlist: https://www.youtube.com/playlist?list=UU9rJrMVgcXTfa8xuMnbhAEA
106 episodes