Artwork
iconShare
 
Manage episode 507088157 series 3610932
Content provided by Pragmatic AI Labs and Noah Gift. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Pragmatic AI Labs and Noah Gift or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://staging.podcastplayer.com/legal.

Key Argument

  • Thesis: Using ELO for AI agent evaluation = measuring noise
  • Problem: Wrong evaluators, wrong metrics, wrong assumptions
  • Solution: Quantitative assessment frameworks

The Comparison (00:00-02:00)

Chess ELO

  • FIDE arbiters: 120hr training
  • Binary outcome: win/loss
  • Test-retest: r=0.95
  • Cohen's κ=0.92

AI Agent ELO

  • Random users: Google engineer? CS student? 10-year-old?
  • Undefined dimensions: accuracy? style? speed?
  • Test-retest: r=0.31 (coin flip)
  • Cohen's κ=0.42

Cognitive Bias Cascade (02:00-03:30)

  • Anchoring: 34% rating variance in first 3 seconds
  • Confirmation: 78% selective attention to preferred features
  • Dunning-Kruger: d=1.24 effect size
  • Result: Circular preferences (A>B>C>A)

The Quantitative Alternative (03:30-05:00)

Objective Metrics

  • McCabe complexity ≤20
  • Test coverage ≥80%
  • Big O notation comparison
  • Self-admitted technical debt
  • Reliability: r=0.91 vs r=0.42
  • Effect size: d=2.18

Dream Scenario vs Reality (05:00-06:00)

Dream

  • World's best engineers
  • Annotated metrics
  • Standardized criteria

Reality

  • Random internet users
  • No expertise verification
  • Subjective preferences

Key Statistics

MetricChessAI Agents
Inter-rater reliabilityκ=0.92κ=0.42
Test-retestr=0.95r=0.31
Temporal drift±10 pts±150 pts
Hurst exponent0.890.31

Takeaways

  1. Stop: Using preference votes as quality metrics
  2. Start: Automated complexity analysis
  3. ROI: 4.7 months to break even

Citations Mentioned

  • Kapoor et al. (2025): "AI agents that matter" - κ=0.42 finding
  • Santos et al. (2022): Technical Debt Grading validation
  • Regan & Haworth (2011): Chess arbiter reliability κ=0.92
  • Chapman & Johnson (2002): 34% anchoring effect

Quotable Moments

"You can't rate chess with basketball fans"

"0.31 reliability? That's a coin flip with extra steps"

"Every preference vote is a data crime"

"The psychometrics are screaming"


Resources

  • Technical Debt Grading (TDG) Framework
  • PMAT (Pragmatic AI Labs MCP Agent Toolkit)
  • McCabe Complexity Calculator
  • Cohen's Kappa Calculator

🔥 Hot Course Offers:

🚀 Level Up Your Career:

Learn end-to-end ML engineering from industry veterans at PAIML.COM

  continue reading

225 episodes