More Then Words - The Rise Of Multimodal LLMs All Things LLM podcast

AI’s next great leap isn’t about bigger models—it’s about broader senses. In this season premiere of "All Things LLM," Alex and Ben explore the revolutionary world of multimodal large language models (LLMs)—the new frontier where AI can “see,” “hear,” and “understand” the world far beyond text.

In this episode:

Journey to Multimodality: Discover why the future of AI is about breaking beyond the limits of language, integrating text, vision, and audio for richer, more human-like intelligence.
Architectures Explained: Get a clear breakdown of the two main approaches:
- Unified Embedding Decoder—where all data types (words, image patches, sound) become a universal “language” for the model
- Cross-Modality Attention—where separate data streams (like text and images) are fused inside the transformer for fine-grained reasoning
Industry Leaders: A look at the most advanced models: OpenAI’s GPT-4o (handling text, images, and audio), Google’s Gemini (with mega context windows and document+image integration), and Anthropic’s Claude 3.5 Sonnet (excelling at business and historical visual data).
Real-World Impact:
- In healthcare—AIs that analyze X-rays, patient files, and doctor notes at once for deeper, safer insights
- In education—Personalized AI tutors that understand handwriting, voice, and learning style for true adaptive teaching
- In creative fields—Next-gen partners that combine mood boards, music, and text for production-ready film concepts, design, and more
The Emerging Video and Robotics Frontier: How AI’s ability to process moving images sets the stage for breakthroughs in surveillance, manufacturing, and future “embodied” agents that interact with the real world.

Perfect for listeners searching for:

Multimodal LLMs explained
Text and image AI models
GPT-4o vs Gemini vs Claude 3.5 Sonnet
AI in healthcare, education, and creativity
Future of LLMs and AI robotics
Cross-modality attention
AI video analysis

Unlock an understanding of how AI is evolving to be more like us—blending language, sight, and sound for smarter, more intuitive technology. Subscribe now, and join us next week as Alex and Ben dive into the world of autonomous agents and Large Action Models—the AIs that don’t just understand, but act.

All Things LLM is a production of MTN Holdings, LLC. © 2025. All rights reserved.
For more insights, resources, and show updates, visit allthingsllm.com.
For business inquiries, partnerships, or feedback, contact: [email protected]

The views and opinions expressed in this episode are those of the hosts and guests, and do not necessarily reflect the official policy or position of MTN Holdings, LLC.

Unauthorized reproduction or distribution of this podcast, in whole or in part, without written permission is strictly prohibited.
Thank you for listening and supporting the advancement of transparent, accessible AI education.

1
The Bill Simmons Podcast

The Ringer

113k

More Then Words - The Rise of Multimodal LLMs

Download the best offline podcast player

Top Podcasts