![]() |
VOOZH | about |
The Empirical Cumulative Distribution Function (ECDF) is a non-parametric method for estimating the Cumulative Distribution Function (CDF) of a random variable. Unlike parametric methods, the ECDF makes no assumptions about the underlying probability distribution of the data.
It is defined as a step function that increases by at each observed data point, where is the total number of observations in the dataset.
The ECDF is a useful tool for visualizing the distribution of a dataset and can provide insights into the underlying distribution that would be difficult to obtain through traditional summary statistics.
Before getting into the details of the Empirical Cumulative Distribution Function (ECDF), itβs important to understand a few foundational concepts related to probability distributions:
The ECDF is defined as follows:
.
The ECDF has several useful properties:
Now, let's move on to some examples of how to compute and plot the ECDF. Before starting this tutorial, you need to have a basic understanding of R language and its data structures. You should also have the latest version of R installed on your computer.
To compute and plot the Empirical Cumulative Distribution Function (ECDF) in R , we generate sample data, compute ECDF using the ecdf() function and plot the result.
Output:
Suppose we have a set of 10 data points: 1, 2, 3, 4, 4, 5, 6, 7, 8 and 9. We want to compute the ECDF of this data set.
Manually, we would first sort the data in ascending order: 1, 2, 3, 4, 4, 5, 6, 7, 8, 9. Then, for each value of x, we would count the number of observations that are less than or equal to x and divide by the total number of observations.
To compute the ECDF at x=5, we would count the number of observations that are less than or equal to 5, which is 6. Dividing by the total number of observations, we get . We would repeat this process for all values of x.
The first step is to sort the data in ascending order and calculate the number of data points:
Output:
'Length : 10'
[1] 1 2 3 4 4 5 6 7 8 9
To compute the ECDF, we need to loop over each data point in the sorted dataset and calculate the proportion of data points that are less than or equal to that point:
Output:
[1] 0.1 0.2 0.3 0.5 0.5 0.6 0.7 0.8 0.9 1.0
In R, we can compute the ECDF using the built-in ecdf() function:
Output:
[1] 0.1 0.2 0.6 0.5 0.5 0.3 0.7 0.8 0.9 1.0
Output:
TRUE
The two methods produce the same result, as can be seen by comparing the outputs of ecdf and ecdf_. The empirical cumulative distribution function assigns a probability of 0.1 to the smallest value in the data, a probability of 0.2 to the second smallest value and so on. The largest value in the data has a probability of 1.0.
We can also plot the ECDF using the plot() function:
Output
Suppose we have a dataset of 100 observations that follows a normal distribution with a mean 0 and a standard deviation of 1. We want to compute the ECDF of this dataset and plot it.
We generate a dataset of 100 observations that follows a normal distribution with mean 0 and standard deviation 1. In R, we can use the rnorm() function to generate random normal data:
Here, we set the random seed to ensure reproducibility and generate 100 random numbers from a normal distribution with a mean of 0 and a standard deviation of 1. The resulting data object is a vector of length 100.
For each value of x, we want to compute the estimated probability that a data point in the dataset is less than or equal to x. We can compute the ECDF values manually using a for loop or using ecdf() function.
Output:
Output:
TRUE
First, we define a sequence of values for x. For each value of x, we want to compute the true probability that a standard normal random variable is less than or equal to x. This can be done using the standard normal CDF and use the pnorm() function to compute the true CDF values for each value of x. we use the same sample mean and standard deviation here also.
Output:
We can plot the true CDF values and ECDF values on the same plot to visualize how closely they match. Here, we use the plot() function to create a line plot with x-values from -4 to 4 and y-values corresponding to the true CDF values in blue and the ECDF values in red. We also add a legend to the plot to distinguish between the two lines.
Output:
We first generate the normal data using the rnorm() function. Then, we compute the sample mean and standard deviation using the mean() and sd() functions. We then define a sequence of values for x and use the pnorm() function to compute the true CDF values for each value of x. We also compute the ECDF manually using a for loop and the sum() function. Finally, we plot both the true CDF.