VOOZH about

URL: https://towardsdatascience.com/imputation-of-missing-data-in-tables-with-datawig-2d7ab327ece2/

⇱ Imputation of Missing Data in Tables with DataWig | Towards Data Science


Imputation of Missing Data in Tables with DataWig

Implementing Amazon's DataWig in Python to impute missing values in tabular data

9 min read
👁 Photo by Hunter Harritt on Unsplash
Photo by Hunter Harritt on Unsplash

Missing values in real-world datasets is a common phenomenon that poses a key challenge for all data practitioners. This issue is even more challenging when the dataset contains heterogeneous data types.

In this article, we look at how DataWig can help us perform the imputation of missing values in tabular data effectively and efficiently.

Contents

(1) Types of Missing Data and Imputation Techniques (Optional)(2) About DataWig (3) How DataWig Works (4) Imputation Performance of DataWig (5) Python Implementation (6) Advanced Features


(1) Types of Missing Data and Imputation Techniques

(Optional Primer)

Before we begin, it is good to understand the types of missing data and the various imputation techniques available. I have placed the primer in a separate article to keep this article brief. If you are already familiar with these concepts, feel free to skip this part.


(2) About DataWig

Developed at Amazon Science, DataWig is a software package that applies missing value imputation to tables containing heterogeneous data types, i.e., numerical, categorical, and unstructured text.

The goal is to build a robust and scalable framework that allows users to impute missing values without extensive engineering efforts or a machine learning background.


(3) How DataWig Works

DataWig runs three components to perform imputation for heterogeneous data: Encode, Featurizer, and Imputer.

We can see how DataWig works with an example involving non-numerical data. Let’s say we have a 3-row product catalog dataset where the ‘Color‘ column has a missing value in the third row.

Thus, the ‘Color‘ column is the to-be-imputed column (aka output column), while the other columns are the input columns.

The aim is to use the first two rows (containing complete data) to train an imputation model and predict the missing ‘Color‘ value in the third row.

👁 Adapted from DataWig JMLR paper | Image used under CC-BY 4.0
Adapted from DataWig JMLR paper | Image used under CC-BY 4.0
  1. The data types of the columns are first determined automatically using heuristics. For example, a column is defined as categorical instead plain text if it has at least ten times as many rows as unique values.
  2. Features are converted to numerical representation using column encoders, e.g., one-hot encoding.
  3. Numerical-formatted columns are transformed into feature vectors.
  4. Feature vectors are concatenated into a latent representation to be parsed into the imputation model for training and prediction.

Let’s explore each of the three components:

(i) Encoder

The ColumnEncoder class transforms raw data into numerical representations. There are different types of encoders for the different data types, such as:

  • SequentialEncoder – Sequences of string symbols (e.g., characters)
  • BowEncoder – Bag-of-words representation of strings as sparse vectors
  • CategoricalEncoder— For categorical variables (one-hot encoding)
  • NumericalEncoder – For numerical values (normalization of values)

(ii) Featurizer

After encoding into numerical representations, the next step is transforming the data into feature vectors using featurizers.

The purpose is to feed the data as a vector representation into the imputation model’s computational graph for training and prediction.

There are also different types of featurizers to cater to the different data types:

  • _LSTMFeaturizer_ – Map input sequences into latent vectors using LSTM
  • BowFeaturizer – Convert string data into sparse vectors
  • EmbeddingFeaturizer – Map encoded categorical data into vector representations (i.e., embeddings)
  • NumericalFeaturizer – Extract feature vectors using fully connected layers

(iii) Imputer

The final part is to create the imputation model, execute training, and generate predictions to fill in the missing values.

DataWig adopts the MICE technique for imputation, and the model used within is a neural network trained in the backend with MXNet.

In a nutshell, columns containing helpful information are used by the deep learning model to impute missing values in the to-be-imputed column.

Given that there will be different data types, the appropriate loss functions (e.g., squared loss or cross-entropy loss) are also selected automatically.


(4) Imputation Performance of DataWig

The Amazon Science team evaluated DataWig by comparing it against five popular techniques for imputing missing numerical values.

These other imputation techniques include mean imputation, kNN, matrix factorization (MF), and iterative imputation (linear regression and random forest). The comparison was conducted across synthetic and real-world data with varying amounts of missing data and types of missingness.

👁 Adapted from DataWig JMLR paper | Image used under CC-BY 4.0
Adapted from DataWig JMLR paper | Image used under CC-BY 4.0

Based on the normalized mean-squared error, DataWig compared favorably to other approaches, even in the difficult MNAR missingness type. The results are displayed in the plot above.

Further details of the evaluation (including on unstructured text) can be found in the research paper.

Author’s thought: Given DataWig’s purported strengths in handling categorical and text features, I was surprised that the research paper’s evaluation focus was on missing numerical values.


(5) Python Implementation

To show how DataWig works, we will use the Heart Disease Dataset since it contains both numerical and categorical data types.

Note: You can find the GitHub repo for this project [here](https://github.com/kennethleungty/DataWig-Missing-Data-Imputation/blob/main/notebooks/DataWig-Example.ipynb) and the complete Jupyter notebook demo here.

In particular, we will perform two imputations as part of the demo:

  1. Numerical imputation: Fill in missing values in numerical MaxHR column (maximum heart rate achieved by person)
  2. Categorical imputation: Fill in missing values in categorical ChestPain column (type of chest pain encountered)

Step 1 – Initial Setup

  • Create and activate a new conda environment with Python version 3.7. The reason is that DataWig currently works with version 3.7 and below.
conda create -n myenv python=3.7
conda activate myenv
  • Install DataWig via pip
pip install datawig
  • If you would like the environment to appear in your Jupyter notebook, you can run the following:
python -m ipykernel install --user --name myenv --display-name "myenv"

Note: Ensure pandas, NumPy, and scikit-learn libraries are updated to the latest versions.


Step 2 – Data Pre-processing

There are two preprocessing steps to do before imputation:

  • Perform random shuffle train-test split (80/20)
  • Randomly hide an arbitrary proportion (e.g., 25%) of values in the test dataset to simulate missing data. The train set will remain completely non-missing for the imputation model to train on.

    Here is a sample of the test set with the missing data displayed as NaN:

👁 Sample of test set | Image by author
Sample of test set | Image by author

Step 3 – Setup Imputation Model

The easiest way to build and deploy an imputation model is to use the SimpleImputer class. It automatically detects the column data types and uses a set of default encoders and featurizers that yield good results on various datasets.

We first define a list of input columns deemed useful for predicting missing values in the to-be-imputed column. This list is based on the user’s domain knowledge and critical judgment.

We then create two instances of SimpleImputer, one for each of the two columns to be imputed (i.e., MaxHR and ChestPain)


Step 4 – Fit Imputation Model

With our model instances ready, we can fit them on our train dataset. Beyond a simple model fit, we can leverage the hyperparameter optimization (HPO) fit_hpo function of SimpleImputer to find the best imputation model.

The HPO function uses a random search on the custom grid of hyperparameters (e.g., learning rate, batch size, number of hidden layers).

If HPO is not required, we can omit the hyperparameter search arguments (as shown in the categorical imputation example)


Step 5 – Execute Imputation and Generate Predictions

The next step is to generate predictions by running the trained imputation models on the test set with missing values.

The output is the original dataframe plus a new column of the imputed data.

👁 Output predictions dataframe (original and imputed columns boxed in red) | Image by author
Output predictions dataframe (original and imputed columns boxed in red) | Image by author

Step 6 – Evaluation

Finally, let’s see how our imputation models fared with these evaluation metrics:

  • Mean-squared error (MSE) for numerical imputation
  • Matthew Correlation Coefficient (MCC) for categorical imputation

    For this demonstration, the MSE is 342.4, and MCC is 0.22. These values form the benchmark for comparison with other imputation techniques.


(6) Advanced Features

Beyond the basic implementation described earlier, we can leverage advanced DataWig features for our specific project needs.

(i) Imputer

If we want more control over the types of model and preprocessing steps in the imputation models, we can use the Imputer class.

It provides greater flexibility for the custom specification of model parameters (such as particular encoders and featurizers) as compared to the default settings in SimpleImputer.

Here is an example of how the encoder and featurizer for each column are explicitly defined in Imputer:

The Imputer instance can then be used to do a .fit() and .predict().

Author’s thought: The specific definition of column types can be helpful because automatic encoding and featurizing may not always work perfectly. For example, in this dataset, the SimpleImputer misidentified the categorical Thal column as a text column.


(ii) Label-shift Detection and Correction

The SimpleImputer class has a handy function check_for_label_shift that helps us detect issues of data drift (label shift in particular).

Label shift occurs when the marginal distribution differs between the training and real-world data. By understanding how the label distribution has changed, we can then account for the shift in our imputation.

The check_for_label_shift function logs the severity of the shift and returns the weight factors for the labels. Here is a sample output of the weights:

👁 The output of label shift check | Image by author
The output of label shift check | Image by author

We then retrain the model with a weighted likelihood by passing the weights when we re-fit the imputation model to correct the shift.


Wrapping It Up

We have covered how DataWig can be used to impute missing values in data tables effectively and efficiently.

One important caveat is that imputation tools such as DataWig are not magic bullets for handling missing data.

Dealing with missing data is a challenging process that requires proper investigation and a strong understanding of the data and context. A clear example is shown in this demo, where users need to decide which input features to feed into the model to impute the output column accurately.

The GitHub repo for this project can be found here.

Before You Go

I welcome you to join me on a data science learning journey! Follow this Medium page and check out my GitHub to stay in the loop of more exciting data science content. Meanwhile, have fun imputing missing values with DataWig!

Feature Selection with Simulated Annealing in Python, Clearly Explained

How to Dockerize Machine Learning Applications Built with H2O, MLflow, FastAPI, and Streamlit

Top Tips to Google Search Like a Seasoned Data Scientist


Written By

Kenneth Leung

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles