VOOZH about

URL: https://towardsdatascience.com/linear-regression-in-time-series-sources-of-spurious-regression/

โ‡ฑ Linear Regression in Time Series: Sources of Spurious Regression | Towards Data Science


Linear Regression in Time Series: Sources of Spurious Regression

Why does the autocorrelation of the errors term matter?

14 min read
๐Ÿ‘ Drawing of a person thinking with a linear regression chart in a thought bubble
Image by author and GPT-4o.

1. Introduction

Itโ€™s pretty clear that most of our work will be automated by AI in the future. This will be possible because many researchers and professionals are working hard to make their work available online. These contributions not only help us understand fundamental concepts but also refine AI models, ultimately freeing up time to focus on other activities.

However, there is one concept that remains misunderstood, even among experts. It is spurious regression in time series analysis. This issue arises when regression models suggest strong relationships between variables, even when none exist. It is typically observed in time series regression equations that seem to have a high degree of fit โ€” as indicated by a high Rยฒ (coefficient of multiple correlation) โ€” but with an extremely low Durbin-Watson statistic (d), signaling strong autocorrelation in the error terms.

What is particularly surprising is that almost all econometric textbooks warn about the danger of autocorrelated errors, yet this issue persists in many published papers. Granger and Newbold (1974) identified several examples. For instance, they found published equations with Rยฒ = 0.997 and the Durbin-Watson statistic (d) equal to 0.53. The most extreme found is an equation with Rยฒ = 0.999 and d = 0.093.

It is especially problematic in economics and finance, where many key variables exhibit autocorrelation or serial correlation between adjacent values, particularly if the sampling interval is small, such as a week or a month, leading to misleading conclusions if not handled correctly. For example, todayโ€™s GDP is strongly correlated with the GDP of the previous quarter. Our post provides a detailed explanation of the results from Granger and Newbold (1974) and Python simulation (see section 7) replicating the key results presented in their article.

Whether youโ€™re an economist, data scientist, or analyst working with time series data, understanding this issue is crucial to ensuring your models produce meaningful results.

To walk you through this paper, the next section will introduce the random walk and the ARIMA(0,1,1) process. In section 3, we will explain how Granger and Newbold (1974) describe the emergence of nonsense regressions, with examples illustrated in section 4. Finally, weโ€™ll show how to avoid spurious regressions when working with time series data.

2. Simple presentation of a Random Walk and ARIMA(0,1,1) Process

2.1 Random Walk

Let ๐—โ‚œ be a time series. We say that ๐—โ‚œ follows a random walk if its representation is given by:

๐—โ‚œ = ๐—โ‚œโ‚‹โ‚ + ๐œ–โ‚œ. (1)

Where ๐œ–โ‚œ is a white noise. It can be written as a sum of white noise, a useful form for simulation. It is a non-stationary time series because its variance depends on the time t.

2.2 ARIMA(0,1,1) Process

The ARIMA(0,1,1) process is given by:

๐—โ‚œ = ๐—โ‚œโ‚‹โ‚ + ๐œ–โ‚œ โˆ’ ๐œƒ ๐œ–โ‚œโ‚‹โ‚. (2)

where ๐œ–โ‚œ is a white noise. The ARIMA(0,1,1) process is non-stationary. It can be written as a sum of an independent random walk and white noise:

๐—โ‚œ = ๐—โ‚€ + random walk + white noise. (3) This form is useful for simulation.

Those non-stationary series are often employed as benchmarks against which the forecasting performance of other models is judged.

3. Random walk can lead to Nonsense Regression

First, letโ€™s recall the linear regression model. The linear regression model is given by:

๐˜ = ๐—๐›ฝ + ๐œ–. (4)

Where ๐˜ is a T ร— 1 vector of the dependent variable, ๐›ฝ is a K ร— 1 vector of the coefficients, ๐— is a T ร— K matrix of the independent variables containing a column of ones and (Kโˆ’1) columns with T observations on each of the (Kโˆ’1) independent variables, which are stochastic but distributed independently of the T ร— 1 vector of the errors ๐œ–. It is generally assumed that:

๐„(๐œ–) = 0, (5)

and

๐„(๐œ–๐œ–โ€ฒ) = ๐œŽยฒ๐ˆ. (6)

where ๐ˆ is the identity matrix.

A test of the contribution of independent variables to the explanation of the dependent variable is the F-test. The null hypothesis of the test is given by:

๐‡โ‚€: ๐›ฝโ‚ = ๐›ฝโ‚‚ = โ‹ฏ = ๐›ฝโ‚–โ‚‹โ‚ = 0, (7)

And the statistic of the test is given by:

๐… = (๐‘ยฒ / (๐Šโˆ’1)) / ((1โˆ’๐‘ยฒ) / (๐“โˆ’๐Š)). (8)

where ๐‘ยฒ is the coefficient of determination.

If we want to construct the statistic of the test, letโ€™s assume that the null hypothesis is true, and one tries to fit a regression of the form (Equation 4) to the levels of an economic time series. Suppose next that these series are not stationary or are highly autocorrelated. In such a situation, the test procedure is invalid since ๐… in (Equation 8) is not distributed as an F-distribution under the null hypothesis (Equation 7). In fact, under the null hypothesis, the errors or residuals from (Equation 4) are given by:

๐œ–โ‚œ = ๐˜โ‚œ โˆ’ ๐—๐›ฝโ‚€ ; t = 1, 2, โ€ฆ, T. (9)

And will have the same autocorrelation structure as the original series ๐˜.

Some idea of the distribution problem can arise in the situation when:

๐˜โ‚œ = ๐›ฝโ‚€ + ๐—โ‚œ๐›ฝโ‚ + ๐œ–โ‚œ. (10)

Where ๐˜โ‚œ and ๐—โ‚œ follow independent first-order autoregressive processes:

๐˜โ‚œ = ๐œŒ ๐˜โ‚œโ‚‹โ‚ + ๐œ‚โ‚œ, and ๐—โ‚œ = ๐œŒ* ๐—โ‚œโ‚‹โ‚ + ๐œˆโ‚œ. (11)

Where ๐œ‚โ‚œ and ๐œˆโ‚œ are white noise.

We know that in this case, ๐‘ยฒ is the square of the correlation between ๐˜โ‚œ and ๐—โ‚œ. They use Kendallโ€™s result from the article Knowles (1954), which expresses the variance of ๐‘:

๐•๐š๐ซ(๐‘) = (1/T)* (1 + ๐œŒ๐œŒ*) / (1 โˆ’ ๐œŒ๐œŒ*). (12)

Since ๐‘ is constrained to lie between -1 and 1, if its variance is greater than 1/3, the distribution of ๐‘ cannot have a mode at 0. This implies that ๐œŒ๐œŒ* > (Tโˆ’1) / (T+1).

Thus, for example, if T = 20 and ๐œŒ = ๐œŒ*, a distribution that is not unimodal at 0 will be obtained if ๐œŒ > 0.86, and if ๐œŒ = 0.9, ๐•๐š๐ซ(๐‘) = 0.47. So the ๐„(๐‘ยฒ) will be close to 0.47.

It has been shown that when ๐œŒ is close to 1, ๐‘ยฒ can be very high, suggesting a strong relationship between ๐˜โ‚œ and ๐—โ‚œ. However, in reality, the two series are completely independent. When ๐œŒ is near 1, both series behave like random walks or near-random walks. On top of that, both series are highly autocorrelated, which causes the residuals from the regression to also be strongly autocorrelated. As a result, the Durbin-Watson statistic ๐ will be very low.

This is why a high ๐‘ยฒ in this context should never be taken as evidence of a true relationship between the two series.

To explore the possibility of obtaining a spurious regression when regressing two independent random walks, a series of simulations proposed by Granger and Newbold (1974) will be conducted in the next section.

4. Simulation results using Python.

In this section, we will show using simulations that using the regression model with independent random walks bias the estimation of the coefficients and the hypothesis tests of the coefficients are invalid. The python code that will produce the results of the simulation will be presented in section 6.

A regression equation proposed by Granger and Newbold (1974) is given by:

๐˜โ‚œ = ๐›ฝโ‚€ + ๐—โ‚œ๐›ฝโ‚ + ๐œ–โ‚œ

Where ๐˜โ‚œ and ๐—โ‚œ were generated as independent random walks, each of length 50. The values ๐’ = |๐›ฝฬ‚โ‚| / โˆš(๐’๐„ฬ‚(๐›ฝฬ‚โ‚)), representing the statistic for testing the significance of ๐›ฝโ‚, for 100 simulations will be reported in the table below.

๐Ÿ‘ Image
๐Ÿ‘ Image
Table 1: Regressing two independent random walks

The null hypothesis of no relationship between ๐˜โ‚œ and ๐—โ‚œ is rejected at the 5% level if ๐’ > 2. This table shows that the null hypothesis (๐›ฝ = 0) is wrongly rejected in about a quarter (71 times) of all cases. This is awkward because the two variables are independent random walks, meaning thereโ€™s no actual relationship. Letโ€™s break down why this happens.

If ๐›ฝฬ‚โ‚ / ๐’๐„ฬ‚ follows a ๐(0,1), the expected value of ๐’, its absolute value, should be โˆš2 / ฯ€ โ‰ˆ 0.8 (โˆš2/ฯ€ is the mean of the absolute value of a standard normal distribution). However, the simulation results show an average of 4.59, meaning the estimated ๐’ is underestimated by a factor of:

4.59 / 0.8 = 5.7

In classical statistics, we usually use a t-test threshold of around 2 to check the significance of a coefficient. However, these results show that, in this case, you would need to use a threshold of 11.4 to properly test for significance:

2 ร— (4.59 / 0.8) = 11.4

Interpretation: Weโ€™ve just shown that including variables that donโ€™t belong in the model โ€” especially random walks โ€” can lead to completely invalid significance tests for the coefficients.

To make their simulations even clearer, Granger and Newbold (1974) ran a series of regressions using variables that follow either a random walk or an ARIMA(0,1,1) process.

Here is how they set up their simulations:

They regressed a dependent series ๐˜โ‚œ on m series ๐—โฑผ,โ‚œ (with j = 1, 2, โ€ฆ, m), varying m from 1 to 5. The dependent series ๐˜โ‚œ and the independent series ๐—โฑผ,โ‚œ follow the same types of processes, and they tested four cases:

  • Case 1 (Levels): ๐˜โ‚œ and ๐—โฑผ,โ‚œ follow random walks.
  • Case 2 (Differences): They use the first differences of the random walks, which are stationary.
  • Case 3 (Levels): ๐˜โ‚œ and ๐—โฑผ,โ‚œ follow ARIMA(0,1,1).
  • Case 4 (Differences): They use the first differences of the previous ARIMA(0,1,1) processes, which are stationary.

Each series has a length of 50 observations, and they ran 100 simulations for each case.

All error terms are distributed as ๐(0,1), and the ARIMA(0,1,1) series are derived as the sum of the random walk and independent white noise. The simulation results, based on 100 replications with series of length 50, are summarized in the next table.

๐Ÿ‘ Image
๐Ÿ‘ Image
Table 2: Regressions of a series on m independent โ€˜explanatoryโ€™ series.

Interpretation of the results :

  • It is seen that the probability of not rejecting the null hypothesis of no relationship between ๐˜โ‚œ and ๐—โฑผ,โ‚œ becomes very small when m โ‰ฅ 3 when regressions are made with random walk series (rw-levels). The ๐‘ยฒ and the mean Durbin-Watson increase. Similar results are obtained when the regressions are made with ARIMA(0,1,1) series (arima-levels).
  • When white noise series (rw-diffs) are used, classical regression analysis is valid since the error series will be white noise and least squares will be efficient.
  • However, when the regressions are made with the differences of ARIMA(0,1,1) series (arima-diffs) or first-order moving average series MA(1) process, the null hypothesis is rejected, on average:

(10 + 16 + 5 + 6 + 6) / 5 = 8.6

which is greater than 5% of the time.

If your variables are random walks or close to them, and you include unnecessary variables in your regression, you will often get fallacious results. High ๐‘ยฒ and low Durbin-Watson values do not confirm a true relationship but instead indicate a likely spurious one.

5. How to avoid spurious regression in time series

Itโ€™s really hard to come up with a complete list of ways to avoid spurious regressions. However, there are a few good practices you can follow to minimize the risk as much as possible.

If one performs a regression analysis with time series data and finds that the residuals are strongly autocorrelated, there is a serious problem when it comes to interpreting the coefficients of the equation. To check for autocorrelation in the residuals, one can use the Durbin-Watson test or the Portmanteau test.

Based on the study above, we can conclude that if a regression analysis performed with economical variables produces strongly autocorrelated residuals, meaning a low Durbin-Watson statistic, then the results of the analysis are likely to be spurious, whatever the value of the coefficient of determination Rยฒ observed.

In such cases, it is important to understand where the mis-specification comes from. According to the literature, misspecification usually falls into three categories : (i) the omission of a relevant variable, (ii) the inclusion of an irrelevant variable, or (iii) autocorrelation of the errors. Most of the time, mis-specification comes from a mix of these three sources.

To avoid spurious regression in a time series, several recommendations can be made:

  • The first recommendation is to select the right macroeconomic variables that are likely to explain the dependent variable. This can be done by reviewing the literature or consulting experts in the field.
  • The second recommendation is to stationarize the series by taking first differences. In most cases, the first differences of macroeconomic variables are stationary and still easy to interpret. For macroeconomic data, itโ€™s strongly recommended to differentiate the series once to reduce the autocorrelation of the residuals, especially when the sample size is small. There is indeed sometimes strong serial correlation observed in these variables. A simple calculation shows that the first differences will almost always have much smaller serial correlations than the original series.
  • The third recommendation is to use the Box-Jenkins methodology to model each macroeconomic variable individually and then search for relationships between the series by relating the residuals from each individual model. The idea here is that the Box-Jenkins process extracts the explained part of the series, leaving the residuals, which contain only what canโ€™t be explained by the seriesโ€™ own past behavior. This makes it easier to check whether these unexplained parts (residuals) are related across variables.

6. Conclusion

Many econometrics textbooks warn about specification errors in regression models, but the problem still shows up in many published papers. Granger and Newbold (1974) highlighted the risk of spurious regressions, where you get a high paired with very low Durbin-Watson statistics.

Using Python simulations, we showed some of the main causes of these spurious regressions, especially including variables that donโ€™t belong in the model and are highly autocorrelated. We also demonstrated how these issues can completely distort hypothesis tests on the coefficients.

Hopefully, this post will help reduce the risk of spurious regressions in future econometric analyses.

7. Appendice: Python code for simulation.

#####################################################Simulation Code for table 1 #####################################################

import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt

np.random.seed(123)
M = 100 
n = 50
S = np.zeros(M)
for i in range(M):
#---------------------------------------------------------------
# Generate the data
#---------------------------------------------------------------
 espilon_y = np.random.normal(0, 1, n)
 espilon_x = np.random.normal(0, 1, n)

 Y = np.cumsum(espilon_y)
 X = np.cumsum(espilon_x)
#---------------------------------------------------------------
# Fit the model
#---------------------------------------------------------------
 X = sm.add_constant(X)
 model = sm.OLS(Y, X).fit()
#---------------------------------------------------------------
# Compute the statistic
#------------------------------------------------------
 S[i] = np.abs(model.params[1])/model.bse[1]


#------------------------------------------------------ 
# Maximum value of S
#------------------------------------------------------
S_max = int(np.ceil(max(S)))

#------------------------------------------------------ 
# Create bins
#------------------------------------------------------
bins = np.arange(0, S_max + 2, 1) 

#------------------------------------------------------
# Compute the histogram
#------------------------------------------------------
frequency, bin_edges = np.histogram(S, bins=bins)

#------------------------------------------------------
# Create a dataframe
#------------------------------------------------------

df = pd.DataFrame({
 "S Interval": [f"{int(bin_edges[i])}-{int(bin_edges[i+1])}" for i in range(len(bin_edges)-1)],
 "Frequency": frequency
})
print(df)
print(np.mean(S))
๐Ÿ‘ Image

#####################################################Simulation Code for table 2 #####################################################

import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.stattools import durbin_watson
from tabulate import tabulate

np.random.seed(1) # Pour rendre les rรฉsultats reproductibles

#------------------------------------------------------
# Definition of functions
#------------------------------------------------------

def generate_random_walk(T):
 """
 Gรฉnรจre une sรฉrie de longueur T suivant un random walk :
 Y_t = Y_{t-1} + e_t,
 oรน e_t ~ N(0,1).
 """
 e = np.random.normal(0, 1, size=T)
 return np.cumsum(e)

def generate_arima_0_1_1(T):
 """
 Gรฉnรจre un ARIMA(0,1,1) selon la mรฉthode de Granger & Newbold :
 la sรฉrie est obtenue en additionnant une marche alรฉatoire et un bruit blanc indรฉpendant.
 """
 rw = generate_random_walk(T)
 wn = np.random.normal(0, 1, size=T)
 return rw + wn

def difference(series):
 """
 Calcule la diffรฉrence premiรจre d'une sรฉrie unidimensionnelle.
 Retourne une sรฉrie de longueur T-1.
 """
 return np.diff(series)

#------------------------------------------------------
# Paramรจtres
#------------------------------------------------------

T = 50 # longueur de chaque sรฉrie
n_sims = 100 # nombre de simulations Monte Carlo
alpha = 0.05 # seuil de significativitรฉ

#------------------------------------------------------
# Definition of function for simulation
#------------------------------------------------------

def run_simulation_case(case_name, m_values=[1,2,3,4,5]):
 """
 case_name : un identifiant pour le type de gรฉnรฉration :
 - 'rw-levels' : random walk (levels)
 - 'rw-diffs' : differences of RW (white noise)
 - 'arima-levels' : ARIMA(0,1,1) en niveaux
 - 'arima-diffs' : diffรฉrences d'un ARIMA(0,1,1) => MA(1)
 
 m_values : liste du nombre de rรฉgresseurs.
 
 Retourne un DataFrame avec pour chaque m :
 - % de rejets de H0
 - Durbin-Watson moyen
 - R^2_adj moyen
 - % de R^2 > 0.1
 """
 results = []
 
 for m in m_values:
 count_reject = 0
 dw_list = []
 r2_adjusted_list = []
 
 for _ in range(n_sims):
#--------------------------------------
# 1) Generation of independents de Y_t and X_{j,t}.
#----------------------------------------
 if case_name == 'rw-levels':
 Y = generate_random_walk(T)
 Xs = [generate_random_walk(T) for __ in range(m)]
 
 elif case_name == 'rw-diffs':
 # Y et X sont les diffรฉrences d'un RW, i.e. ~ white noise
 Y_rw = generate_random_walk(T)
 Y = difference(Y_rw)
 Xs = []
 for __ in range(m):
 X_rw = generate_random_walk(T)
 Xs.append(difference(X_rw))
 # NB : maintenant Y et Xs ont longueur T-1
 # => ajuster T_effectif = T-1
 # => on prendra T_effectif points pour la rรฉgression
 
 elif case_name == 'arima-levels':
 Y = generate_arima_0_1_1(T)
 Xs = [generate_arima_0_1_1(T) for __ in range(m)]
 
 elif case_name == 'arima-diffs':
 # Diffรฉrences d'un ARIMA(0,1,1) => MA(1)
 Y_arima = generate_arima_0_1_1(T)
 Y = difference(Y_arima)
 Xs = []
 for __ in range(m):
 X_arima = generate_arima_0_1_1(T)
 Xs.append(difference(X_arima))
 
 # 2) Prรฉpare les donnรฉes pour la rรฉgression
 # Selon le cas, la longueur est T ou T-1
 if case_name in ['rw-levels','arima-levels']:
 Y_reg = Y
 X_reg = np.column_stack(Xs) if m>0 else np.array([])
 else:
 # dans les cas de diffรฉrences, la longueur est T-1
 Y_reg = Y
 X_reg = np.column_stack(Xs) if m>0 else np.array([])
 
 # 3) Rรฉgression OLS
 X_with_const = sm.add_constant(X_reg) # Ajout de l'ordonnรฉe ร  l'origine
 model = sm.OLS(Y_reg, X_with_const).fit()
 
 # 4) Test global F : H0 : tous les beta_j = 0
 # On regarde si p-value < alpha
 if model.f_pvalue is not None and model.f_pvalue < alpha:
 count_reject += 1
 
 # 5) R^2, Durbin-Watson
 r2_adjusted_list.append(model.rsquared_adj)
 
 
 dw_list.append(durbin_watson(model.resid))
 
 # Statistiques sur n_sims rรฉpรฉtitions
 reject_percent = 100 * count_reject / n_sims
 dw_mean = np.mean(dw_list)
 r2_mean = np.mean(r2_adjusted_list)
 r2_above_0_7_percent = 100 * np.mean(np.array(r2_adjusted_list) > 0.7)
 
 results.append({
 'm': m,
 'Reject %': reject_percent,
 'Mean DW': dw_mean,
 'Mean R^2': r2_mean,
 '% R^2_adj>0.7': r2_above_0_7_percent
 })
 
 return pd.DataFrame(results)
 
#------------------------------------------------------
# Application of the simulation
#------------------------------------------------------ 

cases = ['rw-levels', 'rw-diffs', 'arima-levels', 'arima-diffs']
all_results = {}

for c in cases:
 df_res = run_simulation_case(c, m_values=[1,2,3,4,5])
 all_results[c] = df_res

#------------------------------------------------------
# Store data in table
#------------------------------------------------------

for case, df_res in all_results.items():
 print(f"\n\n{case}")
 print(tabulate(df_res, headers='keys', tablefmt='fancy_grid'))
๐Ÿ‘ Image
๐Ÿ‘ Image

References

  • Granger, Clive WJ, and Paul Newbold. 1974. โ€œSpurious Regressions in Econometrics.โ€ Journal of Econometrics 2 (2): 111โ€“20.
  • Knowles, EAG. 1954. โ€œExercises in Theoretical Statistics.โ€ Oxford University Press.

Written By

JUNIOR JUMBONG

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles