DeepLearning.AI

Efficiently Serving LLMs

up to 1 hour
Intermediate

Join our new short course, Efficiently Serving Large Language Models, to build a ground-up understanding of how to serve LLM applications from Travis Addair, CTO at Predibase. Whether you’re ready to launch your own application or just getting started building it, this course will deepen your foundational knowledge of how LLMs work and help you better understand the performance trade-offs you must consider.

KV caching
Continuous batching
Model quantization
Low Rank Adapters (LoRA)
LLM inference stack

Overview

In this course, you will learn how auto-regressive large language models generate text one token at a time. You will implement the foundational elements of a modern LLM inference stack in code, including KV caching, continuous batching, and model quantization, and benchmark their impacts on inference throughput and latency. You will explore the details of how LoRA adapters work and learn how batching techniques allow different LoRA adapters to be served to multiple customers simultaneously. Get hands-on with Predibase’s LoRAX framework inference server to see these optimization techniques implemented in a real-world LLM inference server. Knowing more about how LLM servers operate under the hood will greatly enhance your understanding of the options you have to increase the performance and efficiency of your LLM-powered applications.

Online
course location
English
course language
Self-paced
course format
Live classes
delivered online

Who is this course for?

Developers

Anyone who wants to understand the components, techniques, and tradeoffs of efficiently serving LLM applications.

Data Scientists

Professionals looking to deepen their foundational knowledge of how LLMs work and the performance trade-offs involved.

AI Enthusiasts

Individuals interested in learning about the optimizations that allow LLM vendors to efficiently serve models to many customers.

This course will help you understand the key components, techniques, and trade-offs of efficiently serving LLM applications. You will learn about the most important optimizations for serving models to many customers and gain hands-on experience with real-world techniques. Ideal for developers, data scientists, and AI enthusiasts looking to enhance their skills and knowledge.

Pre-Requisites

1 / 3

Intermediate Python knowledge
Basic understanding of machine learning concepts
Familiarity with large language models (LLMs)

What will you learn?

Introduction to LLMs

Learn how auto-regressive large language models generate text one token at a time.

KV Caching

Implement KV caching and understand its impact on inference throughput and latency.

Continuous Batching

Explore continuous batching techniques and their benefits for serving multiple users.

Model Quantization

Learn about model quantization and how it affects performance and efficiency.

Low Rank Adapters (LoRA)

Understand how LoRA adapters work and their role in serving multiple fine-tuned models.

Benchmarking

Benchmark the impacts of various techniques on inference throughput and latency.

Real-World Implementation

Get hands-on with Predibase’s LoRAX framework inference server to see optimization techniques in action.

Meet your instructor

Travis Addair
Co-Founder & CTO, Predibase
Travis Addair is the Co-Founder & CTO at Predibase, the world's first declarative machine learning platform. He is a leader in the field, having co-maintained the popular open source projects Ludwig and Horovod.

Upcoming cohorts

Cost
Free
Duration
1 hour
Dates
start now
Location
Online

Free