Data Engineering

Key Learning Points from MLOps Specialization – Course 3

Main insights (with lecture notes) from the Machine Learning Engineering for Production (MLOps) Course by DeepLearning.AI & Andrew Ng

Kenneth Leung

Feb 3, 2022

11 min read

MLOPS SPECIALIZATION SERIES

👁 Photo by Ricardo Gomez Angel on Unsplash

Photo by Ricardo Gomez Angel on Unsplash

Although machine learning (ML) and deep learning concepts are essential, possessing production engineering skills is equally (if not more) vital in solving real-world problems with data science.

DeepLearning.AI developed the MLOps Specialization course to share practical lessons on building and maintaining ML systems in production.

In this article, I summarize the lessons so that you can skip the hours of online videos while still gleaning the key insights.

(1) Overview (2) Key Lessons (3) Lecture Notes

This article covers Course 3 of the 4-course MLOps specialization. Follow this page to stay updated with content from subsequent courses.

(1) Overview

In the third course of the MLOps specialization, we focus on building models for different serving environments while managing modeling resources for optimal model inference.

We also explore the techniques and metrics to address model analysis, fairness, and interpretability.

👁 Photo by Victor on Unsplash

Photo by Victor on Unsplash

(2) Key Lessons

PART 1 – Neural Architecture Search

What is Neural Architecture Search?

Neural Architecture Search (NAS) automates the design of neural network architectures (e.g., number of layers, type of activations, and connections)
The concept is similar to hyperparameter tuning, where its goal is to find the optimal architecture that performs well on the data.
NAS is a subfield of automated machine learning (AutoML).
Models designed with NAS are on par with or outperform those created by hand.
We can run NAS with libraries like Keras Tuner.

👁 Overview of Neural Architecture Search (NAS) | Image by author

Overview of Neural Architecture Search (NAS) | Image by author

NAS has three parts: a search space, a search strategy, and a performance estimation strategy.

Search Space

The search space defines the possible components for building different architectures.
There are two types of architecture search spaces: macro and micro.

👁 Two main types of search spaces | Image by author

Two main types of search spaces | Image by author

The macro search space comprises individual layers (e.g., convolutional, pooling) and connection types, and the best model is found by stacking layers sequentially to form a chain structured space.
In contrast, in a micro search space, NAS builds a neural network from cells where each cell is a smaller neural network.
The micro approach has been shown to have significant performance advantages over the macro approach.

Search Strategy

NAS searches through the search space based on specific strategies to find the architectures to test to find the most performant one.
Five common strategies include Grid search, Random Search, Bayesian Optimization, Evolutionary Algorithms, and Reinforcement Learning.

Performance Estimation

NAS depends on measuring the performance of the different architectures that it tries.
The most straightforward approach to estimating performance is to evaluate the validation accuracy of each architecture.
However, calculating validation accuracy can be computationally heavy given the large search spaces and complex networks.
Strategies to reduce computation costs include using lower fidelity estimates, learning curve extrapolation, and network morphism.

👁 Strategies to reduce architecture performance cost estimation | Image by author

Strategies to reduce architecture performance cost estimation | Image by author

PART 2 – Model Resource Management

Issues with High Dimensionality

Although neural networks ignore features that do not have predictive information, it does not mean we should train our models with all features thrown in.
Unwanted features consume compute resources, elevate storage costs, increase complexity in interpretation, introduce noise into data, and increase the risk of overfitting.
As we add more features, we increase the processing power and training data needed for training (where the number of training examples required increases exponentially with each added feature).
A poor model fed with high-quality features will outperform a great model with low-quality features.

Dimensionality Reduction Techniques

We want to keep as much predictive information as possible with as few features as possible.
Manual dimensionality reduction involves understanding the data and business context and leveraging domain knowledge to perform feature engineering and selection.
Beyond manual techniques, there are algorithmic approaches for dimensionality reduction, such as Linear Discriminant Analysis (LDA), ** Partial Least Squares (PLS), and Principal Component Analysis (PCA**).

Need for Model Optimization

👁 Photo by BENCE BOROS on Unsplash

Photo by BENCE BOROS on Unsplash

As mobile, IoT, and edge devices become ubiquitous, there is a need to move ML capabilities from cloud to on-device. It means we need to optimize our models’ performance and resource requirements.
On-device model inference involves loading the trained model into the device application, and this offers improved speed and independence from network connectivity.
Frameworks to deploy models to mobile applications include ML Kit, CoreML, and TensorFlow Lite.

Quantization

Quantization is a technique for optimizing ML models, where it transforms a model into an equivalent representation using lower-precision parameters and computations.
An example is to use fewer bits to represent the pixels of an image.
Although quantization may reduce model accuracy, it improves execution performance and efficiency by **** shrinking neural network size, reducing computational resources, and decreasing latency.
We can quantize the weight parameters and activation computations in neural networks by converting floating-point 32-bit values to 8-bit integers.

👁 Reduction in precision after quantization | Image by author

Reduction in precision after quantization | Image by author

Quantization can be done during (quantization-aware training) or after model training (post-training quantization).

Best Model Selection

If high accuracy is not required, it is better to use smaller, less complex models since embedded devices have limited computational resources.
One example of models optimized for mobile devices is MobileNets, which is designed for computer vision applications.

Pruning

Pruning is an optimization technique that improves model efficiency by removing parts that did not contribute substantially to producing accurate results.

👁 Example of network pruning | Image by author

Example of network pruning | Image by author

TensorFlow has a weight pruning API designed to iteratively remove connections based on magnitude during training.
Weight pruning is compatible with quantization, leading to compounded benefits in model optimization.

PART 3 – High-Performance Modeling

Distributed Training

As we deal with larger datasets and bigger models, we need distributed approaches in model training.
The two types of distributed training are data parallelism and model parallelism.
Data parallelism replicates models onto different accelerators (GPU or TPU) and splitting the data between them.
Model parallelism divides a large model (too big to fit on a single device) into partitions and assigning them to various accelerators.
Data parallelism is easier to implement than model parallelism and is also model-agnostic and applicable to any neural architecture.

👁 Illustration of data parallelism | Image by author

Illustration of data parallelism | Image by author

Data parallelism can be categorized into synchronous (all workers train and complete updates in sync) or asynchronous training.
To perform distributed training, we can use TensorFlow’s tf.distribute.Strategy library.

High-Performance Ingestion

Accelerators (GPU/TPU) are vital for high-performance modeling, but they are expensive and must be used efficiently.
This efficiency is maintained by supplying accelerators with data fast enough to avoid staying idle and improve training time.
We can optimize input pipeline (aka ETL process) performance with approaches such as prefetching, caching, memory reduction, and parallelization of data extraction and transformation.

👁 Pipeline helps to efficiently utilize hardware available and reduce the time required to load and pre-process data | Image by author

Pipeline helps to efficiently utilize hardware available and reduce the time required to load and pre-process data | Image by author

With pipelining, we can overcome CPU bottlenecks by overlapping the CPU pre-processing and model execution of accelerators.

Pipeline Parallelism

We have seen model sizes (e.g., BigGAN, BERT, GPT-3) grow larger to improve performance in recent years.
The gap between model growth and hardware improvement has increased the importance of parallelism.
These larger models have brought about new problems in data parallelism (limited memory of accelerators) and model parallelism (underutilization of accelerator compute capacity).
These problems have led to the development of pipeline parallelism.
Pipeline parallelism enables efficient training of giant models by partitioning a model across multiple accelerators and automatically splitting a mini-batch of training data into even smaller micro-batches.
Some pipeline parallelism frameworks (integrating data and model parallelism) are Google’s GPipe and Microsoft’ PipeDream.

👁 Pipeline parallelism allows for more efficient training | Image by author

Pipeline parallelism allows for more efficient training | Image by author

Knowledge Distillation

The idea behind knowledge distillation is to create a simple ‘student’ model that learns from a more complex ‘teacher‘ model.
The goal is to duplicate the performance of a complex model into a simpler, more efficient model.
For example, DistilBERT is the distilled version of BERT, which uses 40% fewer parameters, runs 60% faster while preserving 97% of BERT’s performance (GLUE language understanding benchmark).

PART 4 – Model Analysis

Aggregate vs. Sliced Metrics

After training and deploying a model, the next phase is to evaluate its performance.
We usually monitor top-level, aggregate metrics that assess performance across the entire dataset (e.g., overall accuracy), but this often hides specific problems around performance and fairness.
There is a need to slice the data to know how it performs at a granular level on individual data subsets.
Choosing important slices to analyze is usually based on domain knowledge.

👁 A model with good performance on average may exhibit failure modes not apparent from the aggregate metric | Image by author

A model with good performance on average may exhibit failure modes not apparent from the aggregate metric | Image by author

For example, customers across different age groups may experience the output of the model very differently.
TensorFlow Model Analysis (TFMA) is an open-source framework for deep analysis of model performance, including analyzing performance on data slices.

Model Robustness

Beyond model performance, we should also evaluate model robustness.
A model is considered robust if its results are consistently accurate, even if one or more features change relatively drastically. It should not produce wildly different and unpredictable results with data changes.
The metrics for assessing robustness are the same ones we use for training, e.g., RMSE for regression models and AUC for classification.

Model Debugging

Model debugging is an emerging discipline focused on finding and fixing problems in models and improving model robustness.
Its goals include improving model transparency, preventing harmful social discrimination, reducing vulnerabilities to adversarial attacks or privacy harms, and avoiding model decay.
The three most popular debugging techniques are benchmarking models, sensitivity analysis, and residual analysis.
Two open-source libraries for assessing adversarial attack vulnerability are Cleverhans and Foolbox.

Continuous Evaluation and Monitoring of Data Drift and Shift

Training data only represents a snapshot of the world when the data is collected and labeled, so model performance is affected as the world changes over time.
It is essential to continuously monitor data and model performance to get early warnings when data drift and shifts occur. These drifts and shifts include concept drift, concept emergence, covariate shift, and prior probability shift.
Supervised techniques for monitoring include statistical process control, sequential analysis (using linear four rates), and error distribution monitoring (adaptive windowing).
Unsupervised techniques include clustering/novelty detection (e.g., OLINDDA, MINAS), feature distribution monitoring, and model-dependent monitoring (e.g., Margin Density Drift Detection).
Leading cloud providers provide services for continuous evaluation, such as Microsoft Azure Machine Learning DataSense, Amazon SageMaker Model Monitor, and Google Cloud AI Continuous Evaluation.

PART 5 – Interpretability and Explainability

Importance of Explainable AI

Model interpretability and explainability are crucial in production ML for reasons like fairness, regulatory and legal requirements, and understanding our model better to improve it.
Interpretability and explainability are parts of a larger field known as responsible AI.

👁 Responsible AI encompasses several components | Image by author

Responsible AI encompasses several components | Image by author

Model Interpretability

A model is interpretable if we can query the model to answer the following:

Why did the model behave in a certain way?
How can we trust the predictions made by the model?
What information can the model provide to avoid prediction errors?

While complex models (e.g., neural networks) produce high accuracy, it often comes at the price of interpretability (aka interpretability vs. accuracy trade-off).
Inherently interpretable models are classic ones like tree-based (e.g., decision trees) and linear models (e.g., linear regression).
Although we cannot always work with intrinsically interpretable models, there are model-agnostic methods for interpreting the results of any model.

👁 Examples of model agnostic methods for interpretability | Image by author

Examples of model agnostic methods for interpretability | Image by author

Popular methods include partial dependence plots (PDP), permutation feature importance, Shapley Additive Explanations (SHAP), and local interpretable model-agnostic explanations (LIME).

(3) Lecture Notes

As a token of appreciation, here’s the GitHub repo with the PDF lecture notes compiled from the slides and transcripts. Give the repo a star to stay updated with notes from subsequent courses.

Ready for more? Check out the next course summary here:

Key Learning Points from MLOps Specialization – Course 4

You can find the summaries from the previous two courses here:

Key Learning Points from MLOps Specialization – Course 1

Key Learning Points from MLOps Specialization – Course 2

Before You Go

I welcome you to join me on a data science learning journey! Follow this Medium page and check out my GitHub to stay in the loop of more exciting data science content. Meanwhile, have fun building production ML systems!

Written By

Kenneth Leung

See all from Kenneth Leung

Data Engineering, Data Science, Deep Learning, Machine Learning, Mlops

Share This Article

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

URL: https://towardsdatascience.com/key-learning-points-from-mlops-specialization-course-3-9e67558212ee/