Handling Missing Values in Non-Negative Matrix Factorization (NMF)

Last Updated : 17 Sep, 2025

Non-Negative Matrix Factorization (NMF) is a group of algorithms used in multivariate analysis and linear algebra to factorize a matrix into two matrices and such that all three matrices contain non-negative elements. This property makes NMF particularly useful for applications where data cannot be negative such as text mining and image processing.

However scikit-learn implementation does not support missing values (NaN) in the data matrix. Due to this it requires preprocessing steps to handle any missing data before applying NMF. Before we start imputing missing values let's first visualize them to understand the problem better.

Here we will import numpy, pandas, matplotlib and seaborn.
Here we will create a synthetic dataset with missing values.

Output:

👁 Dataset

Dataset

Strategies for Handling Missing Values

1. Imputation with Zeroes

One most common approach is to replace missing values with zeroes. This method is simple but can lead to biased results because the algorithm treats these zeroes as actual data points rather than missing entries. This approach is often used when the dataset is sparse and zeroes can be considered as valid observations in some contexts such as recommendation systems.

Output:

👁 zero_imputed

Zero Imputed

As we can see in above image all missing values are replaced with 0.

2. Mean Imputation

In mean imputation missing values are replaced with the mean of the non-missing values in the same column. This approach maintain the overall distribution of the data but can underestimate variability and led to biased estimates. It works well when the data is not highly skewed.

Output:

👁 mean_imputed

Mean Imputed

Missing values are replaced with the average value of each column.

3. Iterative Imputation

Iterative imputation is provided by scikit-learn's method called IterativeImputer. It models each feature with missing values as a function of other features and iteratively predicts the missing values. This method can capture the underlying data structure better than simple imputation techniques and is suitable for datasets where correlations exist between features.

Output:

👁 Iterative_imputation

Iterative Imputation

Missing values are filled by predicting them based on other features.

4. Matrix Factorization Imputation

One advanced approach to handling missing values is to use matrix factorization techniques like NMF itself for imputation. In this case NMF is applied to the incomplete matrix with missing entries and the matrix is reconstructed by factoring it into two lower-rank matrices. The missing values are predicted as part of the factorization process.

This method works well because NMF is designed to find latent patterns in the data is important.

Output:

👁 Matrix-factorization-Imputation

Matrix Factorization Imputation

Reconstructed data is close to the imputed input. It uses patterns in the data to estimate missing values.

5. Nearest Neighbor Imputation

Another advanced technique is nearest neighbor imputation which can be used when we have additional context about the data such as similarities between rows or columns. Nearest neighbors imputation fills missing values based on the values of the nearest neighbors in the dataset. This method is useful when similar items or users exhibit similar behaviors.

Output:

👁 KNN-Imputation

KNN Imputation

Each missing value is filled using the average of its 3 nearest rows (neighbors)

Implementing NMF After Imputation

Once the missing values are imputed we can proceed with applying NMF using scikit-learn. The below example show how to perform NMF on a dataset with imputed values:

Output:

👁 NMF

NMF

In the above output image contains hidden features for each row contains how features are combined to rebuild the original data and the final matrix is a close version of the original with missing values filled.

Evaluating the Impact of Imputation

We can check how good our imputation and factorization are using Root Mean Squared Error (RMSE). It tells how close the reconstructed matrix is to the original. Lower RMSE values indicate better reconstruction and more effective imputation.

Output:

RMSE: 0.23080412343029189

In the above output the RMSE is 0.23 which shows a small difference between the original and reconstructed data. This means the imputed data and NMF reconstruction are close to the original values.

Comment

Article Tags:

Explore

Machine Learning Basics

Python for Machine Learning

Feature Engineering

Supervised Learning

Unsupervised Learning

Model Evaluation and Tuning

Advanced Techniques

Machine Learning Practice

Courses

URL: https://www.geeksforgeeks.org/machine-learning/handling-missing-values-in-non-negative-matrix-factorization-nmf/