VOOZH about

URL: https://towardsdatascience.com/probability-for-data-scientists-the-capable-chi-squared-distribution-abced58fa157/

⇱ Probability for Data Scientists: The Capable Chi-Squared Distribution | Towards Data Science


Probability for Data Scientists: The Capable Chi-Squared Distribution

Interactive Visualization of the Distribution Functions

6 min read

Series on Theories


The purpose of this article is to introduce the chi-squared probability distribution. In generating a series of articles on probability, I aim to describe each distribution in an intuitive, concise, useful way. There will not be a focus on derivation or proofs. Instead, I hope to focus on the intuition around the distribution.

Distributions not covered in this article:

Key

RV = random variable.

If unfamiliar with random variables, prerequisite explanations can be found here.

Why is the chi square distribution important?

This distribution serves as a powerful theoretical model.

Its power comes from 3 key statistical properties:

  1. The central limit theorem essentially states, for samples from many different populations*, as sample size increases, the sample mean follows a normal distribution.
  2. With simple arithmetic transformations, any random variable that follows a normal distribution can be "standardized" to mean 0 and variance 1, the standard normal distribution.
  3. Squaring any variable from the standard normal distribution produces a chi square random variable.

Therefore, any normally behaved quantity can be transformed to a chi square quantity! This is very important.

What is a chi square random variable?

There are multiple different ways to define a chi square RV. Here, I will show 3 of them.

Way 1

Though initially scary, the easiest definition for me follows. A chi square RV is any RV described by the following probability density function:

πŸ‘ Written by author
Written by author

The first equation f(x|p) is the "probability density of X=x given p degrees of freedom". Essentially, interpret this statement as the "probability of X taking on realized value x" is a function of p!

The second equation specifies p=1 degrees of freedom.

What is a degree of freedom, you ask? A degree of freedom is just a parameter. The parameter alters the shape of the function. In the PDF and CDF plots below, you can see that the shape of the PDF drastically changes with the value of p. Our goal in many statistics problems is to estimate p from our data.

Repeat. Our goal in many statistics problems is to estimate p (a property of the population) from our data (given our sample).

This is a common theme in data science and machine learning. What does a neural network do? It has a bunch of parameters that are estimated from data using backpropagation. This is a non-parametric extension of what we do in many problems.

In the formula above, we can also employ the the identity function to make this more mathematically correct. The identity function turns "on" (times value 1) the probability when x is between 0 and positive infinity and "off" (times value 0) for all other values of x.

Way 2

The integral of a PDF is a CDF by definition**. We could alternatively say that any chi square RV is specified by a CDF of this form:

πŸ‘ Written by author
Written by author

Pause.

Sometimes abbreviations are useful, and sometimes they are a nuisance. In this example, I would like to point out a case where abbreviations can be more of a nuisance than a help. If you were to refer to the wikipedia page of the chi square distribution, you might see a form that looked like this:

πŸ‘ CDF Expression on Wikipedia page
CDF Expression on Wikipedia page

It specifies that the little doo-dad Ξ³ (gamma) denotes the lower incomplete gamma function. If you happen to cross-reference wolfram alpha, you’ll find a form of the lower incomplete gamma function that looks like this:

πŸ‘ Lower incomplete gamma function from Wolfram Alpha
Lower incomplete gamma function from Wolfram Alpha

As you can see, this is just an (overly) complicated way of writing the equation I wrote the first time.

Way 3

Without going into a full derivation, I will present an alternative way to specify chi square.

The square of any standard normal RV is a chi square RV with 1 degree of freedom. The sum of k independent, squared standard normal RVs is a chi square RV with k degrees of freedom. Here, we assume Za, Zb, and Zc are independent. Then, their sum of squares is chi square with df=3.

πŸ‘ Written by author
Written by author

Note the implications of these theorems. Because chi square comes from a squared value, it will take on only x values between 0 and infinity. This property makes it useful for modeling quantities related to error.

Many mathematical proofs on convergence, or limiting behavior of random variables, hinge on the inclusion of an error/tolerance term. This tolerance term is always considered to be a positive value, greater than 0. Practically it is also a small value. For example, briefly consider the following definition of convergence in probability:

πŸ‘ Example of convergence definition
Example of convergence definition

In this statement, epsilon is greater than 0. We place no other stipulations on epsilon, but without it, we cannot define convergence!

This is important!! Stop zoning out. Convergence is important because it allows us to look at how the behavior of random variable samples changes as we increase the observed sample size. This means we can logically quantify our ability to correctly estimate parameters when we collect lots of data!

In closing, the normal distribution is used to model idealized scenarios because it has convenient symmetry which gives it nice mathematical properties. Its close cousin, the chi square distribution, can similarly be used to model many idealized scenarios. As a result, it is used heavily in statistics.

Note * Population Requirements of CLT

  1. Independent sample draws.
  2. Population mean exists.
  3. Population variance exists and is finite.

Note ** PDF/CDF Relationship

πŸ‘ Written by author, reference [1]
Written by author, reference [1]

Probability Mass Function

Now, we can observe how the value of the parameter shifts the PDF. The PDF describes the probability of each continuous value of y. Click play and drag the bar to change parameter p. For the chi square distribution, the parameter can only be a positive integer. This is a result of the summation mentioned before. For p=8, the probability that Y is 5 is 0.11.

πŸ‘ PDF Traces for p ={1,...,20} by author
PDF Traces for p ={1,…,20} by author

For the interactive plot, I reduced the parameters displayed (just the even values shown) due to hosting image size constraints; however, all values are shown on the matrix plot.

Cumulative Distribution Function

Finally, we can observe how the value of the parameter shifts the cumulative distribution function (CDF). The CDF describes the probability of each discrete value of y. Click play and drag the bar to change parameter p. For p=8, the probability that Y is less than or equal to 10 is 0.74.

πŸ‘ CDF Traces for p ={1,...,20} by author
CDF Traces for p ={1,…,20} by author

More Key Features

  1. Specified by 1 parameter p
  2. Expectation = p = the average value for the variable
πŸ‘ Written by author
Written by author
  1. Variance =2p = the spread of the values

Nice Properties of Chi Square Distribution (Advanced)

  1. Centrality/Noncentrality!
  2. Relationship to t distribution
  3. Relationship to F distribution
  4. Special case of the gamma distribution
  5. Used extensively in hypothesis testing
πŸ‘ Generated by author
Generated by author

That’s a wrap. To access code for the images I generated, the full github notebook is linked. Please comment below stats questions or comments you would like to see in future posts!

References

  1. Casella, G. & Berger, R. L. Statistical Inference. (Cengage Learning, 2021).
  2. https://en.wikipedia.org/wiki/Convergence_of_random_variables#Convergence_in_probability
  3. https://en.wikipedia.org/wiki/Chi-squared_distribution
  4. https://www.wolframalpha.com/

Liked this article? More like it below.

Kate Wall – Graduate Research Assistant – The University of Texas Health Science Center at Houston…

Probability for Data Scientists: The Great Geometric Distribution

Probability Distributions for Beginners


Written By

Kate Wall

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles