Autocorrelated Error

Autocorrelated Error

I. What it is

A. Ordinary least squares regression assumes that the errors are independent, normally distributed with mean 0 and constant variance. When the independence assumption is violated one will obtain unreliable estimates of the regression parameters. However, because the e_i are correlated, the standard errors of the regression coefficients will be smaller than they should be and hence the statistical tests of these parameters will be misleading; they will suggest that the estimates of the parameters are more precise than they really are.

II. How to know that you have it

A. Plot the data - especially plot the residuals against the estimated y values and against variables that may be related to time.

B. Compute autocorrelations and see if they are large.

1) 1st order autocorrelation r(e_te_t-1)

2) 2nd order autocorrelation r(e_te_t-2)

3) ith order autocorrelation r(e_te_t-i)

C. Compute the Durbin-Watson statistic:

Note that when r=0, D=2; when r=1, D=0, and when r=-1, D=4. So the test is whether D is close to 2. Because the value of r(e_te_t-1) will depend on the configuration of the X_i's, the Durbin-Watson statistic can only give bounds for critical values of D.

To evaluate D:

1) locate values of DL and DU in Durbin-Watson statistic table

2) For positive autocorrelation

a) If D<DL then there is positive autocorrelation.

b) If D>DU then there is no positive autocorrelation.

c) If DL<D<DU then the test is inconclusive.

3) For negative autocorrelation

a) If D<4-DU then there is no negative autocorrelation.

b) If D>4-DL then there is negative autocorrelation.

c) If 4-DU<D<4-DL then the test is inconclusive.

Note that the Durbin-Watson test is only a test for 1st order autocorrelation.

III. What to do about it

A. Depends on the type of problem -- three prototypical types of autocorrelation:

1. With a steady trend, one can simply regress time or some simple function of time on the dependent variable.

2. With a seasonal fluctuation, one can regress dummy variables indicating season on the dependent variable.

a. or one can use a moving average smooth and then use the smoothed data.

3. often one will want to use both trend and seasonal fluctuation variables.

4. With random tracking, one can regress on "differenced" data or apply one of the specialized techniques described below.

B. Regression on differenced data

1. The model of the error terms in this situation is:

e_t=pe_t-1 + v_t

where v_t is N(0,s²)

2. Often, the model can be simplified to e_t= e_t-1 + v_t.

In this case, one can manipulate the regression equation to arrive at a simple solution for the autocorrelation problem.

y_t= a + bX_t + e_t

y_t-1 = a + bX_t-1 + e_t-1

(y_t-y_t-1)= b(X_t-X_t-1) + (e_t-e_t-1)

which can be rewritten as:

Y* = bX* + v_t

where Y*=(y_t-y_t-1), X*=(X_t-X_t-1), and v_t=(e_t-e_t-1).

By assumption, v_t is normally distributed and so the regression on "first differences" will yield unbiased estimates of b. Whether the autocorrelation problem has been corrected can be ascertained by examining the autocorrelations of the differenced data.

3. In general, one would like to estimate p in the generalized difference equation above. A similar technique can be applied:

y_t=a + bX_t +e_t

py_t-1=pa + pbx_t-1 + pe_t-1

(multiplying equation for t-1 by p)

(y_t-py_t-1)=a(1-p) + b(x_t-px_t-1) + (e_t-pe_t-1)

which can be rewritten as:

Y*= a(1-p) + bX* + v_t

however, in this equation, Y*=(y_t-py_t-1) and X*=(x_t-px_t-1) so we still need to estimate p to create the new variables Y* and X*. One could regress e_t = pe_t-1 + error, but this model will underestimate p because the e_t will fluctuate around 0 more than do the true residuals (because of the autocorrelation). One can obtain a better estimate by applying:

y_t=a(1-p) + py_t-1 + bX_t - bpx_t-1 + e_t - pe_t-1

= a* + py_t-1 + bX_t - b*x_t-1 + v_t

and discard all of the results except for the coefficient of y_t-1 which is p. Then this can be used to form the new variables for use in the generalized difference regression above.

IV. Special Techniques

A. Generalized Least Squares (GLS)

1. If we know the covariance matrix between the e_i, one could simply use a modified regression equation in which the differing variances are taken into account:

(Y-XB)'U^-1(Y-XB)

where U^-1 is the inverse of the covariance matrix of the errors:

U = s_e²

The MLE estimate of B= (X'U^-1X)^-1X'U^-1Y

B. Autoregressive integrated moving average models (ARIMA)

1. An autoregressive model (of order p) has the form:

y_t= f₁y_t-1 +f ₂y_t-1 + ... +f_py_t-p + v_t

In this model the score on the dependent variable at time t is considered to be a function of the previous scores of the dependent variable at times t-1 through t-p. Once, enough of a time lag is considered, the errors v_t should be distributed N(0,s2) and r(v_t,v_t+h)=0, h<>0.

2. A moving average model (of order q) has the form:

y_t= v_t - q₁v_t-1 - q₂v_t-2 - ... - q_qv_t-q

In this model the score on the dependent variable at time t is considered to be a function of random processes occurring at times t through t-q which are themselves properly distributed.

3. An ARIMA model combines the two:

y_t= f₁y_t-1 + f₂y_t-1 + ... + f_py_t-p + v_t - q₁v_t-1 - q₂v_t-2 - ... - q_qv_t-q

In general, the estimation of ARIMA models is difficult and requires specialized statistical programs.

C. Fourier and Spectral Analysis

1. These techniques attempt to decompose the complex pattern of data into a set of sine waves with different amplitudes and periods. For example, in the graph below, series 4 is composed of the sum of the previous 3 series. A Fourier analysis could decompose series 4 into its components.