Mydra logo
Artificial Intelligence
DeepLearning.AI logo

DeepLearning.AI

Efficiently Serving LLMs

  • up to 1 hour
  • Intermediate

Join our new short course, Efficiently Serving Large Language Models, to build a ground-up understanding of how to serve LLM applications from Travis Addair, CTO at Predibase. Whether you’re ready to launch your own application or just getting started building it, this course will deepen your foundational knowledge of how LLMs work and help you better understand the performance trade-offs you must consider.

  • KV caching
  • Continuous batching
  • Model quantization
  • Low Rank Adapters (LoRA)
  • LLM inference stack

Overview

In this course, you will learn how auto-regressive large language models generate text one token at a time. You will implement the foundational elements of a modern LLM inference stack in code, including KV caching, continuous batching, and model quantization, and benchmark their impacts on inference throughput and latency. You will explore the details of how LoRA adapters work and learn how batching techniques allow different LoRA adapters to be served to multiple customers simultaneously. Get hands-on with Predibase’s LoRAX framework inference server to see these optimization techniques implemented in a real-world LLM inference server. Knowing more about how LLM servers operate under the hood will greatly enhance your understanding of the options you have to increase the performance and efficiency of your LLM-powered applications.

  • Web Streamline Icon: https://streamlinehq.com
    Online
    course location
  • Layers 1 Streamline Icon: https://streamlinehq.com
    English
    course language
  • Self-paced
    course format
  • Live classes
    delivered online

Who is this course for?

Developers

Anyone who wants to understand the components, techniques, and tradeoffs of efficiently serving LLM applications.

Data Scientists

Professionals looking to deepen their foundational knowledge of how LLMs work and the performance trade-offs involved.

AI Enthusiasts

Individuals interested in learning about the optimizations that allow LLM vendors to efficiently serve models to many customers.

This course will help you understand the key components, techniques, and trade-offs of efficiently serving LLM applications. You will learn about the most important optimizations for serving models to many customers and gain hands-on experience with real-world techniques. Ideal for developers, data scientists, and AI enthusiasts looking to enhance their skills and knowledge.

Pre-Requisites

1 / 3

  • Intermediate Python knowledge

  • Basic understanding of machine learning concepts

  • Familiarity with large language models (LLMs)

What will you learn?

Introduction to LLMs
Learn how auto-regressive large language models generate text one token at a time.
KV Caching
Implement KV caching and understand its impact on inference throughput and latency.
Continuous Batching
Explore continuous batching techniques and their benefits for serving multiple users.
Model Quantization
Learn about model quantization and how it affects performance and efficiency.
Low Rank Adapters (LoRA)
Understand how LoRA adapters work and their role in serving multiple fine-tuned models.
Benchmarking
Benchmark the impacts of various techniques on inference throughput and latency.
Real-World Implementation
Get hands-on with Predibase’s LoRAX framework inference server to see optimization techniques in action.

Meet your instructor

  • Travis Addair

    Co-Founder & CTO, Predibase

    Travis Addair is the Co-Founder & CTO at Predibase, the world's first declarative machine learning platform. He is a leader in the field, having co-maintained the popular open source projects Ludwig and Horovod.

Upcoming cohorts

  • Dates

    start now

Free