Deploying Deep Learning: Quantization, Serving, and Edge AI
Keep adding new skills with 10,000+ programs for $239 (usually $399). Save now.
Deploying Deep Learning: Quantization, Serving, and Edge AI
This course is part of Advanced Deep Learning Architectures Specialization
Instructor: Board Infinity
Included with
Learn more
Ask Coursera
Recommended experience
Recommended experience
What you'll learn
Apply INT4/INT8 quantization (AWQ, GPTQ, GGUF) to compress LLMs and vision models for production
Deploy high-throughput inference servers using vLLM's PagedAttention and NVIDIA Triton
Run optimized LLMs on CPU and edge devices using ONNX Runtime and Llama.cpp
Build, benchmark, and containerize a production-ready inference API with Docker
Skills you'll gain
Tools you'll learn
Details to know
May 2026
16 assignments
See how employees at top companies are mastering in-demand skills
Build your subject-matter expertise
- Learn new concepts from industry experts
- Gain a foundational understanding of a subject or tool
- Develop job-relevant skills with hands-on projects
- Earn a shareable career certificate
There are 4 modules in this course
"Production Deep Learning: Inference, Quantization & Edge Deployment is designed for ML engineers and developers who want to master the full deployment lifecycle β from compressing and quantizing models to serving them at scale using vLLM, Triton, ONNX, and Llama.cpp.
Module 1 covers model compression fundamentals, including pruning, distillation, and INT8/INT4 quantization using AWQ and GPTQ, with a focus on the accuracyβlatency tradeoff. Module 2 dives into high-throughput serving architectures, exploring vLLM's PagedAttention, NVIDIA Triton, TensorRT, and scaling inference across GPU clusters with autoscaling patterns. Module 3 focuses on CPU and edge deployment using ONNX Runtime, GGUF, and Llama.cpp, plus multimodal inference with CLIP and LLaVA on resource-constrained devices. Module 4 is a capstone project where you'll quantize a fine-tuned LLM, build a production API with vLLM, benchmark performance, and containerize your model with Docker for cloud and edge deployment. By the end of this course, you will: - Apply INT4/INT8 quantization techniques (AWQ, GPTQ, GGUF) to compress LLMs for production - Deploy high-throughput inference servers using vLLM, Triton, and ONNX Runtime - Run optimized models on GPU, CPU, and edge devices using Llama.cpp and TensorRT - Build, benchmark, and containerize an end-to-end production-ready inference API" Disclaimer: This is an independent educational resource created by Board Infinity for informational and educational purposes only. This course is not affiliated with, endorsed by, sponsored by, or officially associated with any company, organization, or certification body unless explicitly stated. The content provided is based on industry knowledge and best practices but does not constitute official training material for any specific employer or certification program. All company names, trademarks, service marks, and logos referenced are the property of their respective owners and are used solely for educational identification and comparison purposes.
Learn model compression fundamentals, memory profiling, and modern INT8/INT4 quantization techniques including AWQ and GPTQ to optimize models for production inference.
What's included
9 videos3 readings4 assignments
9 videosβ’Total 80 minutes
- Where Trained Models Actually Runβ’9 minutes
- Why Inference Optimization Is a Top Skillβ’7 minutes
- Skill Roadmap: Training β Inference β Edgeβ’9 minutes
- Why Models Are Too Bigβ’10 minutes
- Three Ways to Make Models Smallerβ’10 minutes
- Accuracy vs Latency: Making Tradeoffsβ’9 minutes
- What Quantization Really Doesβ’9 minutes
- Quantizing LLMs with AWQ & GPTQβ’8 minutes
- Benchmarking: Speed, Accuracy Drop & Perplexity Shiftβ’9 minutes
3 readingsβ’Total 90 minutes
- The 2026 Deployment Engineer Role: What Companies Wantβ’30 minutes
- Model Compression Strategies at Scaleβ’30 minutes
- Choosing the Right Quantization Method for Real Deploymentβ’30 minutes
4 assignmentsβ’Total 150 minutes
- Career Scope in Production AI & Edge Deploymentβ’30 minutes
- Fundamentals of Model Compressionβ’30 minutes
- INT8/INT4 Quantization (AWQ, GPTQ)β’30 minutes
- Model Compression, Quantization & Latency Optimizationβ’60 minutes
Master production-grade serving engines including vLLM with PagedAttention and NVIDIA Triton for scaling inference across GPUs and nodes.
What's included
9 videos3 readings4 assignments
9 videosβ’Total 68 minutes
- What Breaks When Users Increaseβ’8 minutes
- How Inference Servers Actually Workβ’11 minutes
- API Patterns for Inferenceβ’7 minutes
- Why KV Cache Limits Throughputβ’7 minutes
- Running a vLLM Serverβ’6 minutes
- Handling Concurrent Requests Under Loadβ’6 minutes
- When Triton Makes Senseβ’7 minutes
- Serving Vision Models with Tritonβ’8 minutes
- Scaling Across GPUsβ’9 minutes
3 readingsβ’Total 90 minutes
- From Training to Serving: What Changes in Architecture?β’30 minutes
- PagedAttention Deep Dive & Performance Tuningβ’30 minutes
- Deployment Blueprints: GPU Clusters & Autoscaling Patternsβ’30 minutes
4 assignmentsβ’Total 150 minutes
- Serving Architectures Beyond Flask & Python Loopsβ’30 minutes
- vLLM Internals (PagedAttention)β’30 minutes
- NVIDIA Triton & Production Deployment Patternsβ’30 minutes
- Serving Architectures Beyond Flask & Python Loopsβ’60 minutes
Export models to ONNX for interoperability, deploy LLMs on CPU and edge devices with Llama.cpp and GGUF, and build multimodal pipelines with CLIP and LLaVA.
What's included
14 videos3 readings4 assignments
14 videosβ’Total 111 minutes
- Why ONNX Mattersβ’9 minutes
- Exporting LLMs & Vision Models to ONNX- Part 1β’7 minutes
- Exporting LLMs & Vision Models to ONNX Part 2β’11 minutes
- Speeding Up Inference with ONNX Runtime - part 1β’8 minutes
- Speeding Up Inference with ONNX Runtime Part 2β’9 minutes
- What GGUF Is & Why It Mattersβ’9 minutes
- Running LLMs with Llama.cpp- Part 1β’8 minutes
- Running LLMs with Llama.cpp Part 2β’8 minutes
- Benchmarking: Latency, Token Throughput & Memoryβ’9 minutes
- How CLIP Connects Text & Images- part 1β’7 minutes
- How CLIP Connects Text & Images- part 2β’9 minutes
- Vision-Enhanced LLMs (LLaVA)β’3 minutes
- Vision-Enhanced LLMs (LLaVA)- Part 2β’6 minutes
- Building a Simple Multimodal Pipelineβ’9 minutes
3 readingsβ’Total 90 minutes
- ONNX Runtime Optimization Guideβ’30 minutes
- Edge LLM Deployment: Real-World Limitations & Solutionsβ’30 minutes
- Multimodal Models: Practical Deployment Workflowsβ’30 minutes
4 assignmentsβ’Total 120 minutes
- Exporting Models to ONNXβ’30 minutes
- Llama.cpp & GGUF for CPU/Edge Deploymentβ’30 minutes
- Multimodal Inference (CLIP & LLaVA)β’30 minutes
- ONNX, Llama.cpp & Edge / CPU Deploymentβ’30 minutes
Apply all course concepts in a final project to quantize a fine-tuned model, serve it via vLLM, benchmark it, and package it for cloud and edge deployment.
What's included
10 videos3 readings4 assignments
10 videosβ’Total 75 minutes
- Loading Your QLoRA/LoRA Fine-Tuned Model - Part 1β’7 minutes
- Loading Your QLoRA/LoRA Fine-Tuned Model Part2β’6 minutes
- Configure PEFT with LoRAβ’6 minutes
- Validating Quality vs Speedβ’4 minutes
- Load and Preprocess the Datasetβ’9 minutes
- Generate and Store Model Outputs Before Fine-Tuningβ’9 minutes
- Configure Training Arguments and Fine-Tune the Modelβ’8 minutes
- Compare Model Outputs After Fine-Tuningβ’9 minutes
- Dockerizing the Serviceβ’7 minutes
- Running on Cloud, CPU & Edgeβ’9 minutes
3 readingsβ’Total 90 minutes
- Quantization Validation Checklist for Productionβ’30 minutes
- API Design Patterns for Generative Modelsβ’30 minutes
- Deployment Benchmark Templates (LLM + Vision)β’30 minutes
4 assignmentsβ’Total 150 minutes
- Preparing the Fine-Tuned Model for Deploymentβ’30 minutes
- Building the Production API (vLLM)β’30 minutes
- Benchmarking & Deployment Packagingβ’30 minutes
- Final Project - The Edge-Ready API (Quantize to Serve to Benchmark)β’60 minutes
Earn a career certificate
Add this credential to your LinkedIn profile, resume, or CV. Share it on social media and in your performance review.
Instructor
Offered by
Explore more from Machine Learning
- Status: Free TrialB
Board Infinity
Course
- Status: Free Trial
Course
- Status: Free TrialC
Coursera
Course
- Status: Free Trial
Course
Why people choose Coursera for their career
Frequently asked questions
Yes, a working knowledge of deep learning, PyTorch, and LLM fundamentals is recommended. This course focuses on production deployment rather than training from scratch.
No. This is an advanced course. Learners should already understand model training, transformers, and basic MLOps concepts before starting.
It prepares you for roles like Inference Engineer, ML Deployment Engineer, Edge AI Developer, and MLOps Engineer focused on generative AI systems.
More questions
Financial aid available,
