The KolmogorovāSmirnov (KS) test is a non-parametric method for comparing probability distributions and checking if two samples differ significantly. It is widely used in statistics, data analysis and quality control because it does not assume any specific distribution form.
Compares one-dimensional distributions without assuming a specific form.
Often used to test random number uniformity or model fit.
Provides a statistic and p-value to determine if samples differ significantly.
Cumulative distribution function (CDF) of the Kolmogorov distribution is defined by:
where
n is the sample size.
x is the normalized Kolmogorov-Smirnov statistic.
k is the index of summation in the series
How does Kolmogorov-Smirnov Test work
Below are the steps for how the Kolmogorov-Smirnov test works:
1. Hypotheses Formulation
Null Hypothesis : The sample follows a specified distribution.
A theoretical distribution (e.g., normal, exponential) is decided against which you want to test the sample distribution. This distribution is usually based on theoretical expectations or prior knowledge.
3. Calculation of the Test Statistic (D)
In a one-sample KS test, the statistic D is the maximum vertical difference between the sampleās empirical distribution function (EDF) and the reference CDF.
For a two-sample Kolmogorov-Smirnov test, the test statistic compares the EDFs of two independent samples.
4. Determination of Critical Value or P-value
The test statistic (D) is compared to a critical value from the Kolmogorov-Smirnov distribution table or, more commonly, a p-value is calculated.
If the p-value is less than the significance level (commonly 0.05), the null hypothesis is rejected, suggesting that the sample distribution does not match the specified distribution.
5. Interpretation of Results
If the null hypothesis is rejected, it indicates that there is evidence to suggest that the sample does not follow the specified distribution. The alternative hypothesis, suggesting a difference, is accepted.
One Sample Kolmogorov-Smirnov Test
The one-sample Kolmogorov-Smirnov (KS) test is used to determine whether a sample comes from a specific distribution. It is particularly useful when the assumption of normality is in question or when dealing with small sample sizes. The test statistic, denoted as , measures the maximum difference between the two cumulative distribution functions.
Empirical Distribution Function
The empirical distribution function at the value x represents the proportion of data points that are less than or equal to x in the sample. The function can be defined as:
where
n is the number of observations in the sample
represents the individual observations
is an indicator function that is 1 if Xi ⤠x and 0 otherwise i.e if the condition is satisfied for the each observation , it is simply 1, otherwise 0.
KolmogorovāSmirnov Statistic
The KolmogorovāSmirnov statistic for a given cumulative distribution function is defined as:
where
sup stands for supremum, which means the largest value over all possible values of x.
is the theoretical cumulative distribution function.
is the empirical cumulative distribution function of the sample (calculated as described above).
Implementation
Here we generates 100 random normal samples, performs the KS test to check normality and compares results with critical values, while plotting the sample histogram against the reference PDF.
The small KS statistic and p-value greater than 0.05 indicate the sampleās EDF closely matches the normal CDF, so we fail to reject the null hypothesis.
Two-Sample KolmogorovāSmirnov Test
The two-sample KS test compares two independent samples by measuring the maximum difference between their empirical distribution functions to assess if they come from the same distribution.
Empirical Distribution Function (EDF)
The empirical distribution function at the value ( x ) in each sample represents the proportion of observations less than or equal to ( x ). Mathematically, the EDFs for the two samples are given by:
For Group 1:
For Group 2:
Where
and are the sample sizes for the two groups
and represent individual observations in the respective samples,
and are the indicator functions.
KolmogorovāSmirnov Statistic
where,
sup denotes the supremum, the largest value over all x values.
are the empirical cumulative distribution functions (ECDFs).
Implementation
The two-sample KS test in Python compares whether two independent samples come from the same distribution by calculating a statistic and p-value, with rejection of the null hypothesis if p < 0.05.
A high KS statistic and very small p-value indicate a large difference between the samplesā distributions, leading to rejection of the null hypothesis.
One-Sample KS Test vs Two-Sample KS Test
Here we compare One sample and Two sample KS test
Features
One-Sample KS Test
Two-Sample KS Test
Goal
Checks if a single sample fits a theoretical distribution
Checks if two samples come from the same distribution
Comparison Metric
Compares the sampleās EDF with the theoretical CDF
Compares the EDF of one sample with the EDF of the other sample
Null Hypothesis
Sample follows the specified distribution
Both samples come from the same distribution
Test Statistic
Maximum vertical deviation between EDF and CDF
Maximum difference between the two EDFs
Multidimensional Kolmogorov-Smirnov Testing
The KS test can be extended to multidimensional data to compare whether two samples follow the same distribution across all dimensions.
Adapts the one-dimensional KS test to evaluate differences in multiple dimensions simultaneously.
Measures the maximum difference in cumulative distribution functions along each dimension.
Useful in multivariate statistics, machine learning and pattern recognition for comparing multidimensional datasets.
When use Kolmogorov-Smirnov Test
Use the KS test to check if two samples follow the same distribution or have similar distribution shapes.
Apply it to compare cumulative probability distributions and quantify differences between datasets.
Higher maximum differences indicate greater dissimilarity in distribution shapes.
Useful for evaluating probability distributions and the overall shape of data distributions.
Can be applied in both parametric and non-parametric hypothesis testing scenarios.
Applications
Checks if a dataset follows a specific distribution for model fitting and prediction.
Compares two datasets to determine if they come from the same distribution.
Validates assumptions about dataset distributions for correct statistical analyses.
Serves as a non-parametric alternative when parametric tests are not applicable.
Limitations
May have limited power with small samples and can detect trivial differences in large samples.
Assumes observations are independent, making it unsuitable for dependent data.
Applicable only to continuous data, not discrete or categorical data without adjustments.
Focuses on overall distribution differences and may miss specific distributional properties.
Multiple tests increase the risk of Type I errors in hypothesis testing.