Data Engineering

Key Learning Points from MLOps Specialization – Course 2

Insights summary (with lecture notes) of the Machine Learning Engineering for Production (MLOps) Course by DeepLearning.AI & Andrew…

Kenneth Leung

Nov 15, 2021

9 min read

MLOps Specialization Series

👁 Photo by Quinten de Graaf on Unsplash

Photo by Quinten de Graaf on Unsplash

While machine learning (ML) concepts are essential, production engineering capabilities are the key to deploying and delivering value from ML models in the real world.

DeepLearning.AI and Coursera recently developed the MLOps Specialization course to share how to conceptualize, build, and maintain integrated ML systems.

In this article, I summarize the lessons so that you can skip the long lectures while still gaining key insights.

(1) Course 2 Overview (2) Key Lessons (3) PDF Lecture Notes

👁 Photo by Green Chameleon on Unsplash

Photo by Green Chameleon on Unsplash

This article covers Course 2 of the 4-course MLOps specialization. Follow this Medium page to stay updated with content from subsequent courses.

Course 2 Overview

This second course (Machine Learning Data Lifecycle in Production) focuses on production ML, defined as the deployment of ML systems in production environments.

Production ML combines ML and modern software development because real-world ML solutions require more than training accurate algorithms.

👁 Production ML is more than just training machine learning models | Image by author

Production ML is more than just training machine learning models | Image by author

The goal of data practitioners should be to build integrated ML systems that constantly operate in production, automatically ingest and retrain on continuously changing data, and optimize for computation costs.

This course covers three key components of a production ML lifecycle:

Collecting, Labeling, and Validating Data
Feature Engineering, Transformation, and Selection
Data Journey and Data Storage

👁 Photo by SELİM ARDA ERYILMAZ on Unsplash

Photo by SELİM ARDA ERYILMAZ on Unsplash

Key Lessons

In the spirit of the course’s emphasis on practical application, the takeaways will be focused on pragmatic advice. The insights are organized based on the three lifecycle components mentioned earlier.

(1) Collecting, Labeling, and Validating Data

End-to-end ML platforms are vital for deploying production ML pipelines. The team at Google uses the open-source TensorFlow Extended (TFX) for production ML.

👁 TensorFlow Extended framework | Image used under Apache License

TensorFlow Extended framework | Image used under Apache License

A TFX pipeline is a sequence of components designed for scalable, high-performance ML tasks.

👁 The sequence of components in an end-to-end production ML platform | Image by author

The sequence of components in an end-to-end production ML platform | Image by author

The first two components (Data Ingestion and Data Validation) relate to data collection, labeling, and validation tasks.

(i) Data Collection

The goal of data teams is to translate user needs into data problems, so the first thing to evaluate is the data itself. Here is a list of critical questions to ask:

What kind of/how much data is available?
What are the details of the data?
- _Is the data annotated? If not, how hard/expensive is it to get it labeled?
- How often do new data come in/get refreshed?
- Are data sources monitored for system issues and outages?_
What are the predictive features?— Does the dataset contain features with predictive values?
What are the labels to be tracked?
What are the metrics for measuring model performance?
How is the quality of the data? — Are there inconsistent data formats (e.g., mixed types) and outliers that affect model performance?

(ii) Data Labeling

Labels are essential because supervised learning (which requires labels) is typically utilized in most business cases.
There are various labeling methods, and the two most common ones are process feedback and human labeling.

👁 Two methods of data labeling | Image by author

Two methods of data labeling | Image by author

Process feedback is a way of continuously creating new training data by getting signals from analyzing system log files, e.g., click-through (i.e., whether the customer clicked or did not click). A problem with this is that there are few scenarios where this is possible.
Human labeling is a more standard approach, where we pass unlabeled data to human labelers (aka raters) to examine and assign labels manually.
While human labeling is straightforward, several problems can arise:

Recruitment can be expensive, especially if the project requires specialist labeling, e.g., radiologists to label X-ray images
Labeling can be slow if the number of raters is small
Quality issues can arise if there are differences in labeling standards across the raters

We can improve labeling consistency by creating clear instructions to guide raters and promoting the active resolution of labeling conflicts.

(iii) Data Validation

The performance of production ML systems can degrade over time due to continuous changes in real-world data.
There needs to be a data validation workflow in place to detect these significant data issues.
TensorFlow Data Validation (TFDV) is a highly scalable library that helps developers maintain the health of ML pipelines through the understanding, validating, and monitoring of data at scale.

👁 Showing where TensorFlow Data Validation fits in the ML pipeline | Image used under Apache License

Showing where TensorFlow Data Validation fits in the ML pipeline | Image used under Apache License

Here are the critical data drift and skew issues to validate:

Concept Drift: Changes in the relationship (aka mapping) between the input and output variables over time
Schema Skew: Training and serving data do not conform to the same schema (e.g., due to different data types present)
Distribution Skew: Distribution of serving and training data are significantly different (e.g., due to seasonality changes over time). This skew comprises granular issues of dataset shift and covariate shift.
Feature Skew: Training feature values are different from the serving feature values (e.g., due to transformation applied only on the training set)

👁 Example of a skew detection workflow | Image by author

Example of a skew detection workflow | Image by author

(2) Feature Engineering, Transformation, and Selection

(i) Feature Engineering

An essential purpose of feature engineering in production ML is to reduce computing resources, and this is done by concentrating predictive information in fewer features to promote computing efficiency.
Inconsistencies in feature engineering can introduce training-serve skews, leading to poor serving model performance. These inconsistencies arise due to:

Training and serving code paths are different (e.g., train in Python but serve in Java), resulting in different transformations between the two
Diverse deployment scenarios (e.g., model deployed in different environments like mobile, web, and server)

(ii) Feature Transformation

Feature transformations occur with two types of granularity:

Instance-level: Involves just the instance (aka one row of data). Examples include multiplication (e.g., squaring a feature) and clipping (e.g., set a non-negative boundary by changing negative values to 0)
Full-pass: Involves the entire dataset. Examples include standard scaling, min-max scaling, and binning.

There are different timepoints to perform transformation:

Transform training data before feeding into the model
Transform within the model

Each time point has pros and cons, and these are essential considerations for production costs and efficiency.

👁 Pros and cons of different transformation timepoints | Image by author

Pros and cons of different transformation timepoints | Image by author

TensorFlow Transform (TFT) is a helpful library for preprocessing and transforming data, and such frameworks are essential for processing large datasets in an efficient and distributed manner.

👁 Showing where TensorFlow Transform fits in the ML pipeline | Image used under Apache License

Showing where TensorFlow Transform fits in the ML pipeline | Image used under Apache License

(iii) Feature Selection

There are three main categories of methods for supervised feature selection: Filter, Wrapper, and Embedded.

👁 Feature selection methods | Image by author

Feature selection methods | Image by author

Besides using performance metrics (e.g., F1 score, AUC) for method evaluation, one should evaluate the number of features (aka feature count) after applying these methods.
The ideal scenario is when the performance metric is maximized and the feature count is minimized.

(3) Data Journey and Data Storage

(i) Data provenance

Data provenance (aka lineage) is the tracking of the series of transformations in the evolution of data and models from raw input to output artifacts.
Understanding the data journey (and thus data provenance) is vital for debugging and reproducibility. If not tracked, it becomes infeasible to recreate, compare, or explain ML models.
Every pipeline run generates useful metadata containing information about the pipeline executions, training runs, and resulting artifacts.
ML Metadata (MLMD) is a library for recording and retrieving metadata associated with production pipeline runs.

👁 A high-level overview of ML Metadata components | Image used under Apache License

A high-level overview of ML Metadata components | Image used under Apache License

(ii) Data Versioning

Managing data pipelines is a challenge as data evolves over the different training runs of a project lifecycle.
While we are familiar with code versioning (e.g., Git) and environment versioning (e.g., Docker), data versioning is equally important and plays a crucial role in data provenance.
Data versioning tools are starting to become available, and some existing examples are DVC and Git LFS.

👁 Diagram flow of how DVC works (along with Git) | Image used under Apache License

Diagram flow of how DVC works (along with Git) | Image used under Apache License

(iii) Feature Stores

A feature store is a central repository for documented, curated, and access-controlled data features that teams can share, discover and use for model training and serving.
Feature stores reduce redundant work since many modeling problems use identical or similar features.
The goal is to provide a unified, consistent, and persistent means of managing data features that are performant and scalable.

👁 Feature stores help store engineered features for subsequent model development | Image by author

Feature stores help store engineered features for subsequent model development | Image by author

(iv) Data Warehouses vs. Databases vs. Data Lakes

There are several leading data storage solutions, namely databases, data warehouses, and data lakes.
A database is an organized collection of data that allows easy access and retrieval.
A data warehouse is a central repository of information designed for analysis to drive informed decisions.
A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale.
Here are the comparisons to explain the utility of these solutions:

Data Warehouse vs. Database

👁 Image

Data Warehouse vs. Data Lake

👁 Image

PDF Lecture Notes

As a token of appreciation, here’s the link to the GitHub repo with the pdf lecture notes I compiled based on the slides and transcripts.

To stay updated with the latest notes from subsequent courses, feel free to give the repo a star as well.

Ready for more?

Here’s the article for the subsequent Course 3:

Key Learning Points from MLOps Specialization – Course 3

If you haven’t, check out Course 1 of the series:

Key Learning Points from MLOps Specialization – Course 1

Before You Go

I welcome you to join me on a data science learning journey! Follow this Medium page and check out my GitHub to stay in the loop of more exciting data science content. Meanwhile, have fun building production ML systems!

How to Easily Draw Neural Network Architecture Diagrams

Most Starred & Forked GitHub Repos for Data Science and Python

Written By

Kenneth Leung

See all from Kenneth Leung

Data Engineering, Data Science, Deep Learning, Machine Learning, Mlops

Share This Article

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

URL: https://towardsdatascience.com/key-learning-points-from-mlops-specialization-course-2-13af51e22d90/