VOOZH about

URL: https://www.geeksforgeeks.org/machine-learning/kernel-density-estimation/

⇱ Kernel Density Estimation - GeeksforGeeks


  • Courses
  • Tutorials
  • Interview Prep

Kernel Density Estimation

Last Updated : 21 Jun, 2025

Kernel Density Estimation (KDE) is a non-parametric method used to estimate the probability density function (PDF) of a random variable. Unlike histograms, which use discrete bins, KDE provides a smooth and continuous estimate of the underlying distribution, making it particularly useful when dealing with continuous data.

Given a set of independent and identically distributed (i.i.d.) samples from an unknown distribution with density function , the goal is to estimate using only the samples.

The kernel density estimator at a point x is defined as:

Where:

  • is the number of data points.
  • is the bandwidth (smoothing parameter).
  • is the kernel function, which integrates to 1.

Each data point contributes a small "bump'' to the estimate, centered at , and scaled by the bandwidth . The final estimate is the sum of these bumps.

Kernel Functions

The kernel is typically a symmetric, non-negative function that integrates to 1. Common kernels include:

Kernel type

Function

Gaussian kernel

Epanechnikov kernel

Uniform kernel

Triangular kernel

The choice of kernel has a relatively minor impact on the final estimate compared to the choice of bandwidth .

Bandwidth Selection

The bandwidth parameter h determines the smoothness of the density estimate. It controls how much the individual data points contribute to the overall estimate.

  • A small bandwidth produces a spiky estimate that may overfit the data.
  • A large bandwidth smooths the estimate too much, potentially hiding important features.

Optimal Bandwidth Formula

A commonly used formula for bandwidth is the Silverman’s Rule of Thumb:

where:

  • σ is the standard deviation of the data.
  • n is the number of observations.

Multivariate KDE

For -dimensional data , KDE generalizes to:

Where:

  • is a symmetric positive-definite bandwidth matrix.
  • is a multivariate kernel (often a multivariate Gaussian).

Bandwidth matrix controls smoothing in different directions and correlations among dimensions.

Implementation in Python

Here’s how KDE is implemented using scipy:

Output:

👁 kde
KDE plot using Scipy

Variants and Improvements

  1. Adaptive KDE: Instead of using a global bandwidth, adaptive KDE varies bandwidth locally depending on the density of data points. Lower bandwidth is used in dense regions, and higher bandwidth in sparse areas.
  2. Fast KDE: Uses data structures like KD-trees or FFT-based convolutions to speed up computation. Libraries like statsmodels and sklearn offer optimized implementations.
  3. Boundary Correction: When estimating densities near the edge of the support (e.g. non-negative variables), KDE underestimates the density. Solutions include reflection and transformation techniques.

Applications

  1. Data Visualization: KDE provides clearer plots for understanding the shape of data distributions, particularly in large datasets.
  2. Anomaly Detection: Points in low-density regions can be flagged as anomalies. KDE forms the basis for several unsupervised anomaly detection algorithms.
  3. Mode Estimation: KDE allows for identifying peaks in the distribution, which correspond to modes.
  4. Bayesian Inference: KDE is often used to approximate posterior distributions obtained via sampling (e.g. MCMC methods).
  5. Image Processing: In image segmentation and denoising, KDE helps in estimating the intensity distribution of pixels.

Limitations and Challenges

  1. Curse of Dimensionality: KDE performs poorly in high-dimensional spaces. As dimensions increase, data sparsity grows, and KDE requires exponentially more samples for a reliable estimate.
  2. Computational Complexity: Evaluating the density at m points takes O(nm) time. This can be prohibitive for large datasets.
  3. Bandwidth Selection: Choosing an optimal bandwidth is difficult and often problem-specific. Poor choices lead to under- or over-smoothing.
Comment