Mathematical Implementation of KL Divergence for discrete and continuous distributions:
1. Discrete Distributions:
For two discrete probability distributions P = {p1, p2, ..., pn} and {q1, q2, ...., qn} over the same set:
Step by step:
For each outcome i, compute pi log(pi / qi).
Sum all these terms to get the total KL divergence.
2. Continuous Distributions:
For continuous probability density functions p(x) and q(x):
Integration replaces summation for continuous variables.
Gives the expected extra information in nats or bits required when assuming q(x) instead of p(x).
Properties
Properties of KL Divergence are:
1. Non Negativity: KL divergence is always non negative and equals zero if and only if P=Q almost everywhere.
2. Asymmetry: KL divergence is not symmetric so it is not a true distance metric.
3. Additivity for Independent Distributions: If X and Y are independent:
4. Invariance under Parameter Transformations: KL divergence remains the same under bijective transformations of the random variable.
5. Expectation Form: It can be interpreted as the expected logarithmic difference between probabilities under P and Q.
Implementation
Suppose there are two boxes that contain 4 types of balls (green, blue, red, yellow). A ball is drawn from the box randomly having the given probabilities. Our task is to calculate the difference of distributions of two boxes i.e KL divergence.
Step 1: Probability Distributions
Defining the probability distributions:
box_1 and box_2 are two discrete probability distributions.
Each value represents the probability of picking a colored ball from a box: green, blue, red, yellow.
For example, in box_1, the probability of picking a green ball is 0.25.
Step 2: Import Libraries
Importing libraries like Numpy and rel_entr from Scipy.
Step 3: Custom KL Divergence Function
Defining a custom KL divergence function:
1. Formula used:
2. Step by step:
Loop through each probability in distributions a and b.
Compute the term: a[i]⋅log(a[i]/b[i]).
Sum all these terms.
Step 4: Calculate KL Divergence Manually
Calculating KL divergence manually:
kl_divergence(box_1, box_2) calculates
kl_divergence(box_2, box_1) calculates
KL divergence is asymmetric so the two results are usually different.
Step 5: KL Divergence of a Distribution with Itself
Here, for any distribution P.
This is because the distribution is identical so there is no divergence.
rel_entr(a, b) computes the element wise KL divergence term:
sum(rel_entr(box_1, box_2)) sums all the terms to get the total KL divergence.
Using rel_entr is more efficient and avoids manually looping over the array.
Results from rel_entr should match the manual calculation.
Output:
Using Scipy rel_entr function KL-divergence(box_1 || box_2): 0.057 KL-divergence(box_2 || box_1): 0.056 KL-divergence(box_1 || box_1): 0.000
Applications
Some of the applications of KL Divergence are:
Information Theory: Quantifies how much information is lost when one probability distribution is used to approximate another.
Machine Learning: Forms the basis of loss functions like cross entropy is used in variational autoencoders (VAEs) and improves classification accuracy.
Natural Language Processing: Supports language modeling, word embedding comparisons and topic modeling approaches such as Latent Dirichlet Allocation (LDA).
Computer Vision: Used in VAEs, GANs and recognition systems to align generated data with real world distributions.
Anomaly Detection: Identifies unusual or suspicious patterns by measuring distribution shifts, helpful in fraud detection and cybersecurity.
Use of KL Divergence in AI
Some specific use cases of KL Divergence in AI are:
Probabilistic Models: Aligns learned distributions with target distributions like in Variational Autoencoders (VAEs).
Reinforcement Learning: Stabilizes policy updates in algorithms like PPO by limiting divergence from previous policies.
Language Models: Guides token probability distributions during fine-tuning and model distillation.
Generative Models: Measures how closely generated data matches real data distributions.
KL Divergence vs Other Distance Measures
Comparison table of KL Divergence with Other Distance Measures:
Measure
Symmetry
Range
Interpretation
Use Cases
KL Divergence
No
[0, ∞)
Measures how much information is lost when one distribution approximates another
VAEs, PPO, NLP, anomaly detection
Jensen Shannon Divergence
Yes
[0, 1] (normalized)
Smoothed, symmetric version of KL
GANs, text similarity
Hellinger Distance
Yes
[0, 1]
Measures geometric similarity between distributions
Probability comparisons, clustering
Total Variation Distance
Yes
[0, 1]
Maximum difference between probabilities over all events
Robust statistics, hypothesis testing
Wasserstein Distance
Yes
[0, ∞)
Minimum “cost” of transforming one distribution into another
GANs (WGAN), image generation
Limitations
Some of the limitations of KL Divergence are:
Asymmetry: KL divergence is not symmetric (KL(P∣∣Q) (KL(Q∣∣P)) so the “distance” from P to Q is not the same as from Q to P. This makes interpretation harder compared to true metrics.
Infinite Values: If distribution Q(x)=0 in places where P(x)>0, the divergence becomes infinite. This can cause issues in practice especially with sparse or imperfectly estimated distributions.
Support Mismatch Sensitivity: KL requires Q to have nonzero probability wherever P does. If the supports don’t overlap well, the measure breaks down or becomes unstable.
Not a True Distance Metric: KL doesn’t satisfy properties like symmetry and triangle inequality so it cannot be used directly as a “distance” in geometric sense.
Mode Seeking Behavior: Minimizing KL(P∣∣Q) tends to make Q focus only on the most likely regions of P and ignore less probable regions which can cause problems in generative modeling.