In this video, we explore the evolving landscape of large language models (LLMs) in 2025, particularly focusing on their adoption by larger enterprises. We delve into the critical aspects of AI implementation, such as monitoring, evaluation, and consistent performance. The video presents a four-step framework for evaluating LLMs, inspired by Lance Martin from Lane Chain. Additionally, we walk through a practical case study using a movie review dataset from Hugging Face to evaluate how well GPT-3.5 and GPT-4.0 models identify sentiment. This comprehensive guide covers everything from dataset preparation to setting evaluation criteria and comparing model performance, emphasizing the importance of regular and systematic evaluations for the successful deployment of LLMs.
00:00 Intro
01:17 4 -Part Framework
02:23 Double Click on LLMs
02:57 Tooling
03:28 Case Study: Movie Review
04:18 OpenAI Evaluation Walk Thru
07:38 Test Criteria
09:16 Test Evaluation
11:00 Run Evaluation
13:36 Closing
-----
git repo: https://github.com/mannybernabe/openai-evals-tutorial
Sources:
LangChain Eval Series:
OpenAI Docs: https://platform.openai.com/docs/guides/evals
Share this post