Key Learning Points from MLOps Specialization โ Course 2
Insights summary (with lecture notes) of the Machine Learning Engineering for Production (MLOps) Course by DeepLearning.AI & Andrewโฆ
MLOps Specialization Series
While machine learning (ML) concepts are essential, production engineering capabilities are the key to deploying and delivering value from ML models in the real world.
DeepLearning.AI and Coursera recently developed the MLOps Specialization course to share how to conceptualize, build, and maintain integrated ML systems.
In this article, I summarize the lessons so that you can skip the long lectures while still gaining key insights.
Contents
(1) Course 2 Overview (2) Key Lessons (3) PDF Lecture Notes
This article covers Course 2 of the 4-course MLOps specialization. Follow this Medium page to stay updated with content from subsequent courses.
Course 2 Overview
This second course (Machine Learning Data Lifecycle in Production) focuses on production ML, defined as the deployment of ML systems in production environments.
Production ML combines ML and modern software development because real-world ML solutions require more than training accurate algorithms.
The goal of data practitioners should be to build integrated ML systems that constantly operate in production, automatically ingest and retrain on continuously changing data, and optimize for computation costs.
This course covers three key components of a production ML lifecycle:
- Collecting, Labeling, and Validating Data
- Feature Engineering, Transformation, and Selection
- Data Journey and Data Storage
Key Lessons
In the spirit of the courseโs emphasis on practical application, the takeaways will be focused on pragmatic advice. The insights are organized based on the three lifecycle components mentioned earlier.
(1) Collecting, Labeling, and Validating Data
- End-to-end ML platforms are vital for deploying production ML pipelines. The team at Google uses the open-source TensorFlow Extended (TFX) for production ML.
- A TFX pipeline is a sequence of components designed for scalable, high-performance ML tasks.
- The first two components (Data Ingestion and Data Validation) relate to data collection, labeling, and validation tasks.
(i) Data Collection
- The goal of data teams is to translate user needs into data problems, so the first thing to evaluate is the data itself. Here is a list of critical questions to ask:
- What kind of/how much data is available?
-
What are the details of the data?
- _Is the data annotated? If not, how hard/expensive is it to get it labeled?
- How often do new data come in/get refreshed?
- Are data sources monitored for system issues and outages?_
- What are the predictive features?โ Does the dataset contain features with predictive values?
- What are the labels to be tracked?
- What are the metrics for measuring model performance?
-
How is the quality of the data? โ Are there inconsistent data formats (e.g., mixed types) and outliers that affect model performance?
(ii) Data Labeling
- Labels are essential because supervised learning (which requires labels) is typically utilized in most business cases.
- There are various labeling methods, and the two most common ones are process feedback and human labeling.
- Process feedback is a way of continuously creating new training data by getting signals from analyzing system log files, e.g., click-through (i.e., whether the customer clicked or did not click). A problem with this is that there are few scenarios where this is possible.
- Human labeling is a more standard approach, where we pass unlabeled data to human labelers (aka raters) to examine and assign labels manually.
- While human labeling is straightforward, several problems can arise:
- Recruitment can be expensive, especially if the project requires specialist labeling, e.g., radiologists to label X-ray images
- Labeling can be slow if the number of raters is small
- Quality issues can arise if there are differences in labeling standards across the raters
- We can improve labeling consistency by creating clear instructions to guide raters and promoting the active resolution of labeling conflicts.
(iii) Data Validation
- The performance of production ML systems can degrade over time due to continuous changes in real-world data.
- There needs to be a data validation workflow in place to detect these significant data issues.
- TensorFlow Data Validation (TFDV) is a highly scalable library that helps developers maintain the health of ML pipelines through the understanding, validating, and monitoring of data at scale.
- Here are the critical data drift and skew issues to validate:
- Concept Drift: Changes in the relationship (aka mapping) between the input and output variables over time
- Schema Skew: Training and serving data do not conform to the same schema (e.g., due to different data types present)
- Distribution Skew: Distribution of serving and training data are significantly different (e.g., due to seasonality changes over time). This skew comprises granular issues of dataset shift and covariate shift.
- Feature Skew: Training feature values are different from the serving feature values (e.g., due to transformation applied only on the training set)
(2) Feature Engineering, Transformation, and Selection
(i) Feature Engineering
- An essential purpose of feature engineering in production ML is to reduce computing resources, and this is done by concentrating predictive information in fewer features to promote computing efficiency.
- Inconsistencies in feature engineering can introduce training-serve skews, leading to poor serving model performance. These inconsistencies arise due to:
- Training and serving code paths are different (e.g., train in Python but serve in Java), resulting in different transformations between the two
- Diverse deployment scenarios (e.g., model deployed in different environments like mobile, web, and server)
(ii) Feature Transformation
- Feature transformations occur with two types of granularity:
- Instance-level: Involves just the instance (aka one row of data). Examples include multiplication (e.g., squaring a feature) and clipping (e.g., set a non-negative boundary by changing negative values to 0)
- Full-pass: Involves the entire dataset. Examples include standard scaling, min-max scaling, and binning.
- There are different timepoints to perform transformation:
- Transform training data before feeding into the model
- Transform within the model
- Each time point has pros and cons, and these are essential considerations for production costs and efficiency.
- TensorFlow Transform (TFT) is a helpful library for preprocessing and transforming data, and such frameworks are essential for processing large datasets in an efficient and distributed manner.
(iii) Feature Selection
- There are three main categories of methods for supervised feature selection: Filter, Wrapper, and Embedded.
- Besides using performance metrics (e.g., F1 score, AUC) for method evaluation, one should evaluate the number of features (aka feature count) after applying these methods.
- The ideal scenario is when the performance metric is maximized and the feature count is minimized.
(3) Data Journey and Data Storage
(i) Data provenance
- Data provenance (aka lineage) is the tracking of the series of transformations in the evolution of data and models from raw input to output artifacts.
- Understanding the data journey (and thus data provenance) is vital for debugging and reproducibility. If not tracked, it becomes infeasible to recreate, compare, or explain ML models.
- Every pipeline run generates useful metadata containing information about the pipeline executions, training runs, and resulting artifacts.
- ML Metadata (MLMD) is a library for recording and retrieving metadata associated with production pipeline runs.
(ii) Data Versioning
- Managing data pipelines is a challenge as data evolves over the different training runs of a project lifecycle.
- While we are familiar with code versioning (e.g., Git) and environment versioning (e.g., Docker), data versioning is equally important and plays a crucial role in data provenance.
- Data versioning tools are starting to become available, and some existing examples are DVC and Git LFS.
(iii) Feature Stores
- A feature store is a central repository for documented, curated, and access-controlled data features that teams can share, discover and use for model training and serving.
- Feature stores reduce redundant work since many modeling problems use identical or similar features.
- The goal is to provide a unified, consistent, and persistent means of managing data features that are performant and scalable.
(iv) Data Warehouses vs. Databases vs. Data Lakes
- There are several leading data storage solutions, namely databases, data warehouses, and data lakes.
- A database is an organized collection of data that allows easy access and retrieval.
- A data warehouse is a central repository of information designed for analysis to drive informed decisions.
- A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale.
- Here are the comparisons to explain the utility of these solutions:
Data Warehouse vs. Database
Data Warehouse vs. Data Lake
PDF Lecture Notes
As a token of appreciation, hereโs the link to the GitHub repo with the pdf lecture notes I compiled based on the slides and transcripts.
To stay updated with the latest notes from subsequent courses, feel free to give the repo a star as well.
Ready for more?
Hereโs the article for the subsequent Course 3:
If you havenโt, check out Course 1 of the series:
Before You Go
I welcome you to join me on a data science learning journey! Follow this Medium page and check out my GitHub to stay in the loop of more exciting data science content. Meanwhile, have fun building production ML systems!
How to Easily Draw Neural Network Architecture Diagrams
Most Starred & Forked GitHub Repos for Data Science and Python
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS