VOOZH about

URL: https://www.coursera.org/learn/deploying-deep-learning-quantization-serving-and-edge-ai

⇱ Deploying Deep Learning: Quantization, Serving, and Edge AI | Coursera


Deploying Deep Learning: Quantization, Serving, and Edge AI

Keep adding new skills with 10,000+ programs for $239 (usually $399). Save now.

Deploying Deep Learning: Quantization, Serving, and Edge AI

Included with

β€’

Learn more

Ask Coursera

Gain insight into a topic and learn the fundamentals.
Advanced level

Recommended experience

2 weeks to complete
at 10 hours a week
Flexible schedule
Learn at your own pace

Gain insight into a topic and learn the fundamentals.
Advanced level

Recommended experience

2 weeks to complete
at 10 hours a week
Flexible schedule
Learn at your own pace

What you'll learn

  • Apply INT4/INT8 quantization (AWQ, GPTQ, GGUF) to compress LLMs and vision models for production

  • Deploy high-throughput inference servers using vLLM's PagedAttention and NVIDIA Triton

  • Run optimized LLMs on CPU and edge devices using ONNX Runtime and Llama.cpp

  • Build, benchmark, and containerize a production-ready inference API with Docker

Details to know

Shareable certificate

Add to your LinkedIn profile

Recently updated!

May 2026

Assessments

16 assignments

Taught in English

Build your subject-matter expertise

This course is part of the Advanced Deep Learning Architectures Specialization
When you enroll in this course, you'll also be enrolled in this Specialization.
  • Learn new concepts from industry experts
  • Gain a foundational understanding of a subject or tool
  • Develop job-relevant skills with hands-on projects
  • Earn a shareable career certificate

There are 4 modules in this course

"Production Deep Learning: Inference, Quantization & Edge Deployment is designed for ML engineers and developers who want to master the full deployment lifecycle β€” from compressing and quantizing models to serving them at scale using vLLM, Triton, ONNX, and Llama.cpp.

Module 1 covers model compression fundamentals, including pruning, distillation, and INT8/INT4 quantization using AWQ and GPTQ, with a focus on the accuracy–latency tradeoff. Module 2 dives into high-throughput serving architectures, exploring vLLM's PagedAttention, NVIDIA Triton, TensorRT, and scaling inference across GPU clusters with autoscaling patterns. Module 3 focuses on CPU and edge deployment using ONNX Runtime, GGUF, and Llama.cpp, plus multimodal inference with CLIP and LLaVA on resource-constrained devices. Module 4 is a capstone project where you'll quantize a fine-tuned LLM, build a production API with vLLM, benchmark performance, and containerize your model with Docker for cloud and edge deployment. By the end of this course, you will: - Apply INT4/INT8 quantization techniques (AWQ, GPTQ, GGUF) to compress LLMs for production - Deploy high-throughput inference servers using vLLM, Triton, and ONNX Runtime - Run optimized models on GPU, CPU, and edge devices using Llama.cpp and TensorRT - Build, benchmark, and containerize an end-to-end production-ready inference API" Disclaimer: This is an independent educational resource created by Board Infinity for informational and educational purposes only. This course is not affiliated with, endorsed by, sponsored by, or officially associated with any company, organization, or certification body unless explicitly stated. The content provided is based on industry knowledge and best practices but does not constitute official training material for any specific employer or certification program. All company names, trademarks, service marks, and logos referenced are the property of their respective owners and are used solely for educational identification and comparison purposes.

Learn model compression fundamentals, memory profiling, and modern INT8/INT4 quantization techniques including AWQ and GPTQ to optimize models for production inference.

What's included

9 videos3 readings4 assignments

9 videosβ€’Total 80 minutes
  • Where Trained Models Actually Runβ€’9 minutes
  • Why Inference Optimization Is a Top Skillβ€’7 minutes
  • Skill Roadmap: Training β†’ Inference β†’ Edgeβ€’9 minutes
  • Why Models Are Too Bigβ€’10 minutes
  • Three Ways to Make Models Smallerβ€’10 minutes
  • Accuracy vs Latency: Making Tradeoffsβ€’9 minutes
  • What Quantization Really Doesβ€’9 minutes
  • Quantizing LLMs with AWQ & GPTQβ€’8 minutes
  • Benchmarking: Speed, Accuracy Drop & Perplexity Shiftβ€’9 minutes
3 readingsβ€’Total 90 minutes
  • The 2026 Deployment Engineer Role: What Companies Wantβ€’30 minutes
  • Model Compression Strategies at Scaleβ€’30 minutes
  • Choosing the Right Quantization Method for Real Deploymentβ€’30 minutes
4 assignmentsβ€’Total 150 minutes
  • Career Scope in Production AI & Edge Deploymentβ€’30 minutes
  • Fundamentals of Model Compressionβ€’30 minutes
  • INT8/INT4 Quantization (AWQ, GPTQ)β€’30 minutes
  • Model Compression, Quantization & Latency Optimizationβ€’60 minutes

Master production-grade serving engines including vLLM with PagedAttention and NVIDIA Triton for scaling inference across GPUs and nodes.

What's included

9 videos3 readings4 assignments

9 videosβ€’Total 68 minutes
  • What Breaks When Users Increaseβ€’8 minutes
  • How Inference Servers Actually Workβ€’11 minutes
  • API Patterns for Inferenceβ€’7 minutes
  • Why KV Cache Limits Throughputβ€’7 minutes
  • Running a vLLM Serverβ€’6 minutes
  • Handling Concurrent Requests Under Loadβ€’6 minutes
  • When Triton Makes Senseβ€’7 minutes
  • Serving Vision Models with Tritonβ€’8 minutes
  • Scaling Across GPUsβ€’9 minutes
3 readingsβ€’Total 90 minutes
  • From Training to Serving: What Changes in Architecture?β€’30 minutes
  • PagedAttention Deep Dive & Performance Tuningβ€’30 minutes
  • Deployment Blueprints: GPU Clusters & Autoscaling Patternsβ€’30 minutes
4 assignmentsβ€’Total 150 minutes
  • Serving Architectures Beyond Flask & Python Loopsβ€’30 minutes
  • vLLM Internals (PagedAttention)β€’30 minutes
  • NVIDIA Triton & Production Deployment Patternsβ€’30 minutes
  • Serving Architectures Beyond Flask & Python Loopsβ€’60 minutes

Export models to ONNX for interoperability, deploy LLMs on CPU and edge devices with Llama.cpp and GGUF, and build multimodal pipelines with CLIP and LLaVA.

What's included

14 videos3 readings4 assignments

14 videosβ€’Total 111 minutes
  • Why ONNX Mattersβ€’9 minutes
  • Exporting LLMs & Vision Models to ONNX- Part 1β€’7 minutes
  • Exporting LLMs & Vision Models to ONNX Part 2β€’11 minutes
  • Speeding Up Inference with ONNX Runtime - part 1β€’8 minutes
  • Speeding Up Inference with ONNX Runtime Part 2β€’9 minutes
  • What GGUF Is & Why It Mattersβ€’9 minutes
  • Running LLMs with Llama.cpp- Part 1β€’8 minutes
  • Running LLMs with Llama.cpp Part 2β€’8 minutes
  • Benchmarking: Latency, Token Throughput & Memoryβ€’9 minutes
  • How CLIP Connects Text & Images- part 1β€’7 minutes
  • How CLIP Connects Text & Images- part 2β€’9 minutes
  • Vision-Enhanced LLMs (LLaVA)β€’3 minutes
  • Vision-Enhanced LLMs (LLaVA)- Part 2β€’6 minutes
  • Building a Simple Multimodal Pipelineβ€’9 minutes
3 readingsβ€’Total 90 minutes
  • ONNX Runtime Optimization Guideβ€’30 minutes
  • Edge LLM Deployment: Real-World Limitations & Solutionsβ€’30 minutes
  • Multimodal Models: Practical Deployment Workflowsβ€’30 minutes
4 assignmentsβ€’Total 120 minutes
  • Exporting Models to ONNXβ€’30 minutes
  • Llama.cpp & GGUF for CPU/Edge Deploymentβ€’30 minutes
  • Multimodal Inference (CLIP & LLaVA)β€’30 minutes
  • ONNX, Llama.cpp & Edge / CPU Deploymentβ€’30 minutes

Apply all course concepts in a final project to quantize a fine-tuned model, serve it via vLLM, benchmark it, and package it for cloud and edge deployment.

What's included

10 videos3 readings4 assignments

10 videosβ€’Total 75 minutes
  • Loading Your QLoRA/LoRA Fine-Tuned Model - Part 1β€’7 minutes
  • Loading Your QLoRA/LoRA Fine-Tuned Model Part2β€’6 minutes
  • Configure PEFT with LoRAβ€’6 minutes
  • Validating Quality vs Speedβ€’4 minutes
  • Load and Preprocess the Datasetβ€’9 minutes
  • Generate and Store Model Outputs Before Fine-Tuningβ€’9 minutes
  • Configure Training Arguments and Fine-Tune the Modelβ€’8 minutes
  • Compare Model Outputs After Fine-Tuningβ€’9 minutes
  • Dockerizing the Serviceβ€’7 minutes
  • Running on Cloud, CPU & Edgeβ€’9 minutes
3 readingsβ€’Total 90 minutes
  • Quantization Validation Checklist for Productionβ€’30 minutes
  • API Design Patterns for Generative Modelsβ€’30 minutes
  • Deployment Benchmark Templates (LLM + Vision)β€’30 minutes
4 assignmentsβ€’Total 150 minutes
  • Preparing the Fine-Tuned Model for Deploymentβ€’30 minutes
  • Building the Production API (vLLM)β€’30 minutes
  • Benchmarking & Deployment Packagingβ€’30 minutes
  • Final Project - The Edge-Ready API (Quantize to Serve to Benchmark)β€’60 minutes

Earn a career certificate

Add this credential to your LinkedIn profile, resume, or CV. Share it on social media and in your performance review.

Instructor

Board Infinity
261 Coursesβ€’428,186 learners

Explore more from Machine Learning

Why people choose Coursera for their career

πŸ‘ Image

Felipe M.

Learner since 2018
"To be able to take courses at my own pace and rhythm has been an amazing experience. I can learn whenever it fits my schedule and mood."
πŸ‘ Image

Jennifer J.

Learner since 2020
"I directly applied the concepts and skills I learned from my courses to an exciting new project at work."
πŸ‘ Image

Larry W.

Learner since 2021
"When I need courses on topics that my university doesn't offer, Coursera is one of the best places to go."
πŸ‘ Image

Chaitanya A.

"Learning isn't just about being better at your job: it's so much more than that. Coursera allows me to learn without limits."

Frequently asked questions

Yes, a working knowledge of deep learning, PyTorch, and LLM fundamentals is recommended. This course focuses on production deployment rather than training from scratch.

No. This is an advanced course. Learners should already understand model training, transformers, and basic MLOps concepts before starting.

It prepares you for roles like Inference Engineer, ML Deployment Engineer, Edge AI Developer, and MLOps Engineer focused on generative AI systems.

To access the course materials, assignments and to earn a Certificate, you will need to purchase the Certificate experience when you enroll in a course. You can try a Free Trial instead, or apply for Financial Aid. The course may offer 'Full Course, No Certificate' instead. This option lets you see all course materials, submit required assessments, and get a final grade. This also means that you will not be able to purchase a Certificate experience.

When you enroll in the course, you get access to all of the courses in the Specialization, and you earn a certificate when you complete the work. Your electronic Certificate will be added to your Accomplishments page - from there, you can print your Certificate or add it to your LinkedIn profile.

Yes. In select learning programs, you can apply for financial aid or a scholarship if you can’t afford the enrollment fee. If fin aid or scholarship is available for your learning program selection, you’ll find a link to apply on the description page.

Financial aid available,