Heteroscedasticity

I. What it is and where to find it

A. Variance in Y changes with levels of one or more independent variables.

B. It is often a problem in time series data and when a measure is aggregated over individuals.

1) Example: average college expenses measured by sampling .01 of students at each of several institutions differing in size. Because the size of the sample of students changes with institution size, and because average college expenses has variance s²/n, as institution size grows, n grows and s²/n shrinks.

II. How to know you have it

A. Plot the data

B. Plot the residuals

C. With categorical independent variable, one can perform a test for the homogeneity of variance (e.g., Box’s test; cf. Winer, 1971).

III. What to do about it

A. Conceptually, one might want to treat observations with greater variance with less weight because they give a less precise indication of the path of the regression line.

B. Instead of minimizing S(y_i-a-bx_i)², minimize

S(1/s_i²)(y_i-a-bx_i)². [1]

This is called weighted least squares because the ordinary least squares (OLS) expression is “weighted” (by the inverse of the variance). Note than when s_i²=s² that is, when the variances are all equal (homoscedastic), then this equation gives the ordinary least squares (OLS) solution for a and b. In the heteroscedastic case, this equation gives the maximum likelihood estimates (MLE) of a and b.

C. In general it is not possible to solve [1] and one must rely on computer programs that find the minimum by iterative fitting algorithms.

D. However, there is a simple solution whenever s_i is proportional to the values of a variable (e.g., X_i) i.e., whenever s_i=kX_i. In this case, one can obtain the weighted least squares solution by minimizing

S(1/kX_i²)(y_i-a-bx_i)²

= S((1/k²)(y_i/X_i)-(a/X_i)-(bx_i/X_i))².

Because the constant (1/k²) multiplier does not affect the location of the minimum, one can find the appropriate estimates of a and b by minimizing:

S((y_i/X_i)-(a/X_i)-(bx_i/X_i))²

= S((y_i/X_i)-(a/X_i)-b)²

Therefore, weighted least squares estimates of the regression parameters can be obtained by performing an ordinary least squares regression on the transformed variables obtained by dividing the original variables by X_i:

Y/X_i = a 1/X_i + b + e/X_i

Note that the constant in this equation (b) corresponds to the regression coefficient for the X_i in the original model and that the regression coefficient for the new independent variable corresponds to the constant term in the original equation. Also, note that since the residuals are conceptually also divided by X_i, they will be normally distributed if the original e_i are proportional to the X_i as assumed.

IV. Example: Airline transport accidents predicted by proportion of all flights flown by airline.

A. Initial regression

B. Plot of residuals indicates heteroscedasticity

So new variables are created by dividing the old variables by the proportion of total flights: newinj=injuries/proportion of total flights, newa=1/proportion of total flights, proportion of total flights/proportion of total flights=1.

This gives the WLS solution:

Number of incidents=-.883+73.122*p(total flights)

Recall (or see above) that the coefficient for the constant and the predictor are switched. The R² for this model can be obtained by squaring the correlation between the estimated and actual number of incidents (.698)²=.487. The variable statistics can be obtained from the above results (remembering that the coefficient labeled constant is the coefficient for the independent variable). Notice that the t value for the independent variable has increased slightly reflecting the added precision in this model.

D. The plot of the residuals indicates that the heteroscedasticity problem has disappeared.

V. Multivariate Weighted Least Squares

A. Recall that the ordinary least squares solution is:

B= (X'X)^-¹X'Y

The WLS solution is B= (X'U^-1X)^-¹X'U^-1Y where

U= and U^-1=

That is, the ordinary least squares solution is weighted by the inverse of the variances. The regression equation has the form: U^-1Y=U^-1XB + U^-1e

Note that one would obtain the same result if one multiplied the original regression equation by D where

This would yield the solution B=[(DX)'DX]^-1(DX)'(DY)

= (X'D'DX]^-1X'D'DY

Because D'D=U^-1, this solution is identical to the one above.