Deploying Deep Learning: Quantization, Serving, and Edge AI

Keep adding new skills with 10,000+ programs for $239 (usually $399). Save now.

👁 Board Infinity

Deploying Deep Learning: Quantization, Serving, and Edge AI

This course is part of Advanced Deep Learning Architectures Specialization

👁 Board Infinity

Instructor: Board Infinity

Included with

•

Learn more

Ask Coursera

4 modules

Gain insight into a topic and learn the fundamentals.

Advanced level

Recommended experience

2 weeks to complete

at 10 hours a week

Flexible schedule

Learn at your own pace

4 modules

Gain insight into a topic and learn the fundamentals.

Advanced level

Recommended experience

2 weeks to complete

at 10 hours a week

Flexible schedule

Learn at your own pace

What you'll learn

Apply INT4/INT8 quantization (AWQ, GPTQ, GGUF) to compress LLMs and vision models for production
Deploy high-throughput inference servers using vLLM's PagedAttention and NVIDIA Triton
Run optimized LLMs on CPU and edge devices using ONNX Runtime and Llama.cpp
Build, benchmark, and containerize a production-ready inference API with Docker

Skills you'll gain

Tools you'll learn

Details to know

👁 Image

Shareable certificate

Add to your LinkedIn profile

See how employees at top companies are mastering in-demand skills

Learn more about Coursera for Business

👁 logos of Petrobras, TATA, Danone, Capgemini, P&G and L'Oreal

Build your subject-matter expertise

This course is part of the Advanced Deep Learning Architectures Specialization

When you enroll in this course, you'll also be enrolled in this Specialization.

Learn new concepts from industry experts
Gain a foundational understanding of a subject or tool
Develop job-relevant skills with hands-on projects
Earn a shareable career certificate

👁 Image

There are 4 modules in this course

"Production Deep Learning: Inference, Quantization & Edge Deployment is designed for ML engineers and developers who want to master the full deployment lifecycle — from compressing and quantizing models to serving them at scale using vLLM, Triton, ONNX, and Llama.cpp.

Module 1 covers model compression fundamentals, including pruning, distillation, and INT8/INT4 quantization using AWQ and GPTQ, with a focus on the accuracy–latency tradeoff. Module 2 dives into high-throughput serving architectures, exploring vLLM's PagedAttention, NVIDIA Triton, TensorRT, and scaling inference across GPU clusters with autoscaling patterns. Module 3 focuses on CPU and edge deployment using ONNX Runtime, GGUF, and Llama.cpp, plus multimodal inference with CLIP and LLaVA on resource-constrained devices. Module 4 is a capstone project where you'll quantize a fine-tuned LLM, build a production API with vLLM, benchmark performance, and containerize your model with Docker for cloud and edge deployment. By the end of this course, you will: - Apply INT4/INT8 quantization techniques (AWQ, GPTQ, GGUF) to compress LLMs for production - Deploy high-throughput inference servers using vLLM, Triton, and ONNX Runtime - Run optimized models on GPU, CPU, and edge devices using Llama.cpp and TensorRT - Build, benchmark, and containerize an end-to-end production-ready inference API" Disclaimer: This is an independent educational resource created by Board Infinity for informational and educational purposes only. This course is not affiliated with, endorsed by, sponsored by, or officially associated with any company, organization, or certification body unless explicitly stated. The content provided is based on industry knowledge and best practices but does not constitute official training material for any specific employer or certification program. All company names, trademarks, service marks, and logos referenced are the property of their respective owners and are used solely for educational identification and comparison purposes.

Learn model compression fundamentals, memory profiling, and modern INT8/INT4 quantization techniques including AWQ and GPTQ to optimize models for production inference.

What's included

9 videos3 readings4 assignments

9 videos•Total 80 minutes

Where Trained Models Actually Run•9 minutes
Why Inference Optimization Is a Top Skill•7 minutes
Skill Roadmap: Training → Inference → Edge•9 minutes
Why Models Are Too Big•10 minutes
Three Ways to Make Models Smaller•10 minutes
Accuracy vs Latency: Making Tradeoffs•9 minutes
What Quantization Really Does•9 minutes
Quantizing LLMs with AWQ & GPTQ•8 minutes
Benchmarking: Speed, Accuracy Drop & Perplexity Shift•9 minutes

3 readings•Total 90 minutes

The 2026 Deployment Engineer Role: What Companies Want•30 minutes
Model Compression Strategies at Scale•30 minutes
Choosing the Right Quantization Method for Real Deployment•30 minutes

4 assignments•Total 150 minutes

Career Scope in Production AI & Edge Deployment•30 minutes
Fundamentals of Model Compression•30 minutes
INT8/INT4 Quantization (AWQ, GPTQ)•30 minutes
Model Compression, Quantization & Latency Optimization•60 minutes

Master production-grade serving engines including vLLM with PagedAttention and NVIDIA Triton for scaling inference across GPUs and nodes.

What's included

9 videos3 readings4 assignments

9 videos•Total 68 minutes

What Breaks When Users Increase•8 minutes
How Inference Servers Actually Work•11 minutes
API Patterns for Inference•7 minutes
Why KV Cache Limits Throughput•7 minutes
Running a vLLM Server•6 minutes
Handling Concurrent Requests Under Load•6 minutes
When Triton Makes Sense•7 minutes
Serving Vision Models with Triton•8 minutes
Scaling Across GPUs•9 minutes

3 readings•Total 90 minutes

From Training to Serving: What Changes in Architecture?•30 minutes
PagedAttention Deep Dive & Performance Tuning•30 minutes
Deployment Blueprints: GPU Clusters & Autoscaling Patterns•30 minutes

4 assignments•Total 150 minutes

Serving Architectures Beyond Flask & Python Loops•30 minutes
vLLM Internals (PagedAttention)•30 minutes
NVIDIA Triton & Production Deployment Patterns•30 minutes
Serving Architectures Beyond Flask & Python Loops•60 minutes

Export models to ONNX for interoperability, deploy LLMs on CPU and edge devices with Llama.cpp and GGUF, and build multimodal pipelines with CLIP and LLaVA.

What's included

14 videos3 readings4 assignments

14 videos•Total 111 minutes

Why ONNX Matters•9 minutes
Exporting LLMs & Vision Models to ONNX- Part 1•7 minutes
Exporting LLMs & Vision Models to ONNX Part 2•11 minutes
Speeding Up Inference with ONNX Runtime - part 1•8 minutes
Speeding Up Inference with ONNX Runtime Part 2•9 minutes
What GGUF Is & Why It Matters•9 minutes
Running LLMs with Llama.cpp- Part 1•8 minutes
Running LLMs with Llama.cpp Part 2•8 minutes
Benchmarking: Latency, Token Throughput & Memory•9 minutes
How CLIP Connects Text & Images- part 1•7 minutes
How CLIP Connects Text & Images- part 2•9 minutes
Vision-Enhanced LLMs (LLaVA)•3 minutes
Vision-Enhanced LLMs (LLaVA)- Part 2•6 minutes
Building a Simple Multimodal Pipeline•9 minutes

3 readings•Total 90 minutes

ONNX Runtime Optimization Guide•30 minutes
Edge LLM Deployment: Real-World Limitations & Solutions•30 minutes
Multimodal Models: Practical Deployment Workflows•30 minutes

4 assignments•Total 120 minutes

Exporting Models to ONNX•30 minutes
Llama.cpp & GGUF for CPU/Edge Deployment•30 minutes
Multimodal Inference (CLIP & LLaVA)•30 minutes
ONNX, Llama.cpp & Edge / CPU Deployment•30 minutes

Apply all course concepts in a final project to quantize a fine-tuned model, serve it via vLLM, benchmark it, and package it for cloud and edge deployment.

What's included

10 videos3 readings4 assignments

10 videos•Total 75 minutes

Loading Your QLoRA/LoRA Fine-Tuned Model - Part 1•7 minutes
Loading Your QLoRA/LoRA Fine-Tuned Model Part2•6 minutes
Configure PEFT with LoRA•6 minutes
Validating Quality vs Speed•4 minutes
Load and Preprocess the Dataset•9 minutes
Generate and Store Model Outputs Before Fine-Tuning•9 minutes
Configure Training Arguments and Fine-Tune the Model•8 minutes
Compare Model Outputs After Fine-Tuning•9 minutes
Dockerizing the Service•7 minutes
Running on Cloud, CPU & Edge•9 minutes

3 readings•Total 90 minutes

Quantization Validation Checklist for Production•30 minutes
API Design Patterns for Generative Models•30 minutes
Deployment Benchmark Templates (LLM + Vision)•30 minutes

4 assignments•Total 150 minutes

Preparing the Fine-Tuned Model for Deployment•30 minutes
Building the Production API (vLLM)•30 minutes
Benchmarking & Deployment Packaging•30 minutes
Final Project - The Edge-Ready API (Quantize to Serve to Benchmark)•60 minutes

Earn a career certificate

Add this credential to your LinkedIn profile, resume, or CV. Share it on social media and in your performance review.

Instructor

👁 Board Infinity

Board Infinity

261 Courses•428,186 learners

Offered by

👁 Image

Board Infinity

Explore more from Machine Learning

👁 Image
Status: Free Trial
B
Board Infinity
Deep Learning: Train Neural Networks and Deploy with Docker
Course
👁 Image
Status: Free Trial
C
Coursera
Deploy & Optimize ML Services Confidently
Course
👁 Image
Status: Free Trial
C
Coursera
Optimize and Deploy Edge AI Models
Course
👁 Image
Status: Free Trial
W
Whizlabs
NVIDIA: Large Language Models and Generative AI Deployment
Course

Why people choose Coursera for their career

👁 Image

Felipe M.

Learner since 2018

"To be able to take courses at my own pace and rhythm has been an amazing experience. I can learn whenever it fits my schedule and mood."

👁 Image

Jennifer J.

Learner since 2020

"I directly applied the concepts and skills I learned from my courses to an exciting new project at work."

👁 Image

Larry W.

Learner since 2021

"When I need courses on topics that my university doesn't offer, Coursera is one of the best places to go."

👁 Image

Chaitanya A.

"Learning isn't just about being better at your job: it's so much more than that. Coursera allows me to learn without limits."

Frequently asked questions

Yes, a working knowledge of deep learning, PyTorch, and LLM fundamentals is recommended. This course focuses on production deployment rather than training from scratch.

No. This is an advanced course. Learners should already understand model training, transformers, and basic MLOps concepts before starting.

It prepares you for roles like Inference Engineer, ML Deployment Engineer, Edge AI Developer, and MLOps Engineer focused on generative AI systems.

To access the course materials, assignments and to earn a Certificate, you will need to purchase the Certificate experience when you enroll in a course. You can try a Free Trial instead, or apply for Financial Aid. The course may offer 'Full Course, No Certificate' instead. This option lets you see all course materials, submit required assessments, and get a final grade. This also means that you will not be able to purchase a Certificate experience.

When you enroll in the course, you get access to all of the courses in the Specialization, and you earn a certificate when you complete the work. Your electronic Certificate will be added to your Accomplishments page - from there, you can print your Certificate or add it to your LinkedIn profile.

Yes. In select learning programs, you can apply for financial aid or a scholarship if you can’t afford the enrollment fee. If fin aid or scholarship is available for your learning program selection, you’ll find a link to apply on the description page.

URL: https://www.coursera.org/learn/deploying-deep-learning-quantization-serving-and-edge-ai