An introduction to the generalized linear model (GLM)
What it is and how the model is fitted & Application to housing prices prediction
In the classical linear model, normality is usually required. This is shown in Figure 0.1, with random variable X fixed, the distribution of Y is normal (illustrated by each small bell curve). And the regression curve goes across the mean of each normal distribution.
However, in the generalized linear model, this requirement is no longer necessary because we can choose a distribution model for those observations, according to our knowledge of the data. This is realized by the link function, which transforms the mean of those observations E[Yᵢ] into a linear form. In addition, homoscedasticity is also no longer required. The variance of errors in Y doesn’t have to be constant.[5]
Components of the generalized linear model
There are three main components of a GLM, the link function is one of them. Those components are
- A random component Yᵢ, which is the response variable of each observation. It is worth noting that is a conditional distribution of the response variable, which means Yᵢ is conditioned on Xᵢ.
The distribution of Yᵢ belongs to the exponential family, which means Yᵢ has the form, defined as [2]
The difference is that the θᵢ in the canonical form is not transformed, which makes the canonical form easier to work with. Also, note that it is always possible to convert an exponential family to the canonical form. Alternatively, we can also write the exponential function in the following form [1]
which is used in GLM. In Eq 1.2, θᵢ and ϕᵢ are location (related to the mean) and scale parameters (related to the ). In addition, we use μᵢ to denote the mean of Yᵢ. A note to the notation: in Equation 1.2, yᵢ can be simply written as y as well, just like in Equation 1.1. We just need to keep in mind that a yᵢ or y stands for a result of a single observation.
- A linear predictor, which has the familiar form of an ordinary linear model
We will use this to predict the mean of Yᵢ. Note that in Eq 1.1, ηᵢ is not a linear predictor, but a transform function of θᵢ. In this article, we will only use the form given in Eq 1.2.
- A link function g(∙), transforms the mean of Yᵢ, E(Yᵢ), into a linear form as in Eq [linear], which means
The linked function is required to be smooth and invertible (invertibility indicates that the function is monotonic).
Why is exponential family good?
The exponential family possesses quite a few nice properties.
- In multiple sources (Why are exponential families so awesome?, Advantages of the exponential family, Wiki:Exponential family), it’s mentioned that the exponential family is very feasible in Bayesian statistics because those distributions always have conjugate prior.
- Another very important property is that in Equation 1.1, Tᵢ(x) is a sufficient statistic. In simple language, a sufficient statistic is a function that contains all the information of the variable x with regard to the unknown parameter, in this case, θ. The "sufficient" here has the same meaning as the "sufficient" in "sufficient condition" in logic.
More formally, a statistic T(X₁, …, Xₙ) is said to be sufficient for θ, if the conditional distribution of X₁, …, Xₙ, given T=t, does not depend on θ for **** any value of t.
- Apart from the aforementioned two properties, the exponential family also incorporates multiple different distributions together. This allows us to put multiple different distributions into one pattern. Here we will show that it is possible to obtain a general expression for the mean and variance of exponential family distributions, using a, b and φ.
We will use maximum likelihood to achieve this: we want E[Y] when the likelihood function is optimized. Firstly we calculate the log-likelihood of the general form of exponential family distribution (Equation 1.2) (of course if the log-likelihood is optimized, the likelihood is optimized too).
Then we take the partial derivative of it with regard to θ. Note that the part c(y, ϕ) doesn’t contain θ, so it disappears. And we get
The trick is that we can treat l as a random variable by replacing y with its expected value E[Y], and let the expected value of ∂l/∂θ be 0
which gives us a very simple formula of E(Y)
There is one very important fact worth mentioning. The trick (setting the first derivative to be 0 to get the maximum) works is due to a property of the log-likelihood function of the exponential family – it is concave with regard to **** __ θ.[3] Otherwise, this method simply breaks. (This is yet one more nice thing about the exponential family)
Now we try to calculate the variance of Y, Var(Y). Taking the derivative of Equation 2.3, we get the second derivative of the log-likelihood function
we can use the general result
which shows a property of the log-likelihood function. The proof is technical, neither difficult nor interesting. Therefore, in this article, we are going to omit this. Plugging Equation 2.6 into Equation 2.7 we get
Using the mean of Y, which we already have (Equation 2.5), along with some algebraic operation on Equation 2.8, we immediately get the variance of Y
a(ϕ) can be any function of ϕ, but to make it easier to work with GLM, we usually let
where w is a known constant. Then we can write Equation 2.9 as
An example of distributions belonging to the exponential family
The simplest example of GLM is a GLM with an identity link function. This reduces the GLM to an ordinary linear model. Though it’s simple, this case gives us an idea of what the GLM does.
We know that an ordinary linear model assumes that each observation has a normal distribution. Since it is a special case of GLM, of course, normal distribution belongs to the exponential family. Here we show how to transform the normal distribution into the form of Eq 1.1:
we can see that it’s very easy – it’s all about moving the constant into the exponential part and expanding the square. Equation 3.1 tells us
Using the result we have got in the previous section (Equation 2.5 and Equation 2.10), we can now check the mean and variance of the normal distribution
Fitting the model
To fit the model, we use likelihood estimation. As mentioned before, the log-likelihood function of an exponential function is concave, so we can find the maximum of it by looking for the point, where the first derivative is zero. What are we solving right now? A quick recap of the problem: we have an n-dimensional vector of independent response variables Yᵢ, where μ = E[Yᵢ] and it is linked to a linear predictor via
and θᵢ is a canonical parameter. And we want to find β, which maximizes the log-likelihood function. Once again, the Yᵢ are independent, which makes the MLE of β possible. Similar to Eq 2.1, the log-likelihood of β is
θᵢ and β are related in the following way: θᵢ is related to the mean of Yᵢ (this depends on the concrete distribution function, in the example of the previous section, θᵢ=μᵢ), β is related to the mean of Yᵢ as well, via the link function. So θᵢ and β are connected through μᵢ, which we will see later in the partial differentiation.
Now we want to differentiaate Eq. 4.3 with regard to every element in β (they will be subscripted by index j). This gives
Because of the chain rule, we have
Then after differentiating Eq 2.5, we have
we do this to get ∂θᵢ/∂_μᵢ_ because E[Yᵢ] = μᵢ. The next step is indeed easy, we substitute Eq 4.7 into Eq 4.5 and
where the red part is from Eq 4.7. And Eq 4.8 can be further simplified. As we talked about, in GLM, Var[Yᵢ] is not constant. Therefore, we can consider Var[Yᵢ] as a function of E[Yᵢ], so we can define
such that (refer to Eq 2.10)
Putting Eq 4.10 into Eq 4.8 and setting it to zero (we are interested in the point where the first derivative of the log-likelihood function is zero), we get
Eq 4.11 gives us a system of non-linear equations of β – if j goes from 1 to m, then there are m such equations. How to solve a system of non-linear equations like this, when the number of unknowns is not necessarily equal to the number of equations and the equations can be highly complicated? Numerical methods come into play in this case. Here we will apply Iterative Re-weighted Least Squares (IRLS). In this article, we will not go into the details. Generally speaking, this method approximated the solution iteratively. And when we know that V(μᵢ) is independent of β, then the least square objective is
In another word, the problem from now on is to find the β, which minimizes Eq 4.12. This way we abtain an optimized solution for Eq 4.11.
Use GLM in housing price prediction
The dataset, housing price, is from one of the GettingStarted Prediction Competitions on Kaggle. The code for the whole analysis is available at
The results are evaluated using the Root-mean-square deviation (RMSD). The GLM method gave a top 33% position. Of course, it’s not the most optimal method to use – there are results with 0 errors on the leaderboard. But it works well as a demonstration of GLM in practice.
I’d say that there’s not too much to talk about concerning the implementation, since why each step is done, is explained in the notebook. And most of the code was data exploration, preprocessing, model comparison, and model diagnostics. The modeling part boils down to a single line (of course, don’t forget about import statsmodels.api as sm):
model_full = sm.formula.glm(formula=formula, family=sm.families.Gamma(link=sm.genmod.families.links.log()), data=train).fit()
which fits the data to the generalized gamma distribution with the log link function.
Summary
This article is mainly about the definition of the generalized linear model (GLM), when to use it, and how the model is fitted. A lot of texts are about the exponential family since it is the foundation of GLM and knowing the properties of the exponential family helps us understand why the model fitting becomes minimizing Eq 4.12. (Details of the solution to this problem are omitted because it is worth a whole article of decision.)
In fact, none of this is necessary to make the program run, as we mention in the last section, the modeling is just one line of code. And it would hardly happen, that we need to implement GLM from scratch. However, knowing the theory is always good for making decisions about which model to select and diagnosing and interpreting the model.
References:
[1] Germán Rodríguez. Generalized Linear Model Theory. Accessed on 17 Feb 2022.
[2] Stephen Bates, Andy Tsao. Exponential families. Accessed on 18 Feb 2022.
[3] Dr. Kempthorne, Methods of estimation II. Accessed on 23 Feb 2022.
[4] Hastie, T. J., & Tibshirani, R. J. (2017). Generalized additive models. Routledge.
[5] Great Learning Team (2021), Generalized Linear Models | What does it mean? . Accessed on 7 Apr. 2022.
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS