Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)

Lenny's PodcastSeptember 25, 20251h 46m

Hamel Husain and Shreya Shankar teach the world’s most popular course on AI evals and have trained over 2,000 PMs and engineers (including many teams at OpenAI and Anthropic). In this conversation, they demystify the process of developing effective evals, walk through real examples, and share practical techniques that’ll help you improve your AI product.What you’ll learn:1. WTF evals are2. Why they’ve become the most important new skill for AI product builders3. A step-by-step walkthrough of how to create an effective eval4. A deep dive into error analysis, open coding, and axial coding5. Code-based evals vs. LLM-as-judge6. The most common pitfalls and how to avoid them7. Practical tips for implementing evals with minimal time investment (30 minutes per week after initial setup)8. Insight into the debate between “vibes” and systematic evals—Brought to you by:Fin—The #1 AI agent for customer serviceDscout—The UX platform to capture insights at every stage: from ideation to productionMercury—The art of simplified finances—Where to find Shreya Shankar• X: https://x.com/sh_reya• LinkedIn: https://www.linkedin.com/in/shrshnk/• Website: https://www.sh-reya.com/• Maven course: https://bit.ly/4myp27m—Where to find Hamel Husain• X: https://x.com/HamelHusain• LinkedIn: https://www.linkedin.com/in/hamelhusain/• Website: https://hamel.dev/• Maven course: https://bit.ly/4myp27m—In this episode, we cover:(00:00) Introduction to Hamel and Shreya(04:57) What are evals?(09:56) Demo: Examining real traces from a property management AI assistant(16:51) Writing notes on errors(23:54) Why LLMs can’t replace humans in the initial error analysis(25:16) The concept of a “benevolent dictator” in the eval process(28:07) Theoretical saturation: when to stop(31:39) Using axial codes to help categorize and synthesize error notes(44:39) The results(46:06) Building an LLM-as-judge to evaluate specific failure modes(48:31) The difference between code-based evals and LLM-as-judge(52:10) Example: LL

Summary coming soon

We're working on generating an AI-powered summary for this episode. Subscribe to get notified when it's ready.

Never miss a podcast summary

Get AI-powered summaries of your favorite podcasts delivered straight to your inbox.

Subscribe for more summaries