Section 8. Goodness of Fit and Overdispersion Although goodness of fit applies to the models introduced previously, it is particularly helpful to understand what it implies in the context of generalized linear models. Goodness of fit chi-square statistics in PROC GENMOD are printed in two varieties. The first is the likelihood ratio test computed with: lrt_chisq = -2 * SUM [Total*(log( _obsvd_pi) - log( _pred_pi ))] where Total is the sample size, the number of observations. This is known as the deviance statistic on the "Criteria for Assessing Goodness of Fit" table. The other more familiar type is the Pearson chi-square: prs_chisq= SUM[ ((freq_actual - freq_pred)**2 ) / freq_pred] Although both are distributed according to the chi-square distribution, the difference in results with Pearson or the deviance chi-squares can produce conflicting _numerical_ results showing how much these two statistics can differ. That is, the two sample statistics computed from the same data may have a ratio of 25 or more. Many small expected frequencies can inflate the Pearson measure. Overdispersion in Binomial and Poisson Regression Models Overdispersion results when the data appear more dispersed than is expected under some reference model. It may occur with count data analyzed with binomial or Poisson regression models, since the variance of both distributions is a function of the mean. That is, VAR[Y] = f(E[Y]) * phi With both distributions the scale parameter phi is assigned a value of 1. To understand what overdispersion implies, first review the linear regression model (computed with ordinary least squares). Under the normal distribution data are never overdispersed because the mean and variance are not related: y_i = x_i'*beta + err_i where err_i ~ N(0,sigma^2) In this model the variance of the residuals (sigma^2) is assumed constant for all linear combinations of the covariates. This variance is estimated from the data and can assume any value greater than zero no matter what the mean value is. Thus, the response values are assumed to have constant variance: Var(y_i)=sigma^2 * 1 For a generalized linear model: g(%mu_i) = X'beta where %mu_i=E(y_i) and g is the link function. The variance of y_i is: Var(y_i)= %phi * V(%mu_i) That is, the variance of an observation equals some constant %phi (the scale parameter) times a function of the mean of y_i. For the generalized linear model with normal errors and the identity link (i.e., the linear regression model previosly discussed) the variance function V(%mu_i) is 1 and thus Var(y_i) is constant for all values of y_i. In this case, the scaled deviance gives the maximum liklihood estimate of sigma^2 (see Section 5). Overdispersion is a possibilitye with two commonly applied distributions, the binomial and Poisson, which have their respective variances fixed by a single parameter, the mean. For a binary random variable Y, the variance of Y is a multiplicative function of its mean, mu: VAR(Y) = mu*(1-mu) Under the Poisson distribution the variance of Y equals the mean, mu: Var(Y)=E(Y)=%mu In either case, the observed counts have variances that are functions which depend on the value of the mean. That is, the variance of Y depends on the expectation of Y, which is estimated from the data. When either model is fitted under the assumption that the data were generated from a binomial distribution or by a Poisson process the scale parameter, %phi, is automatically set equal 1.0. For this reason, PROC GENMOD prints the scale parameter as 1.000 and adds the following note: NOTE: The scale parameter was held fixed. For binomial and Poisson regression models, the covariance matrix (and hence the standard errors of the parameter estimates) is estimated under the assumption that the chosen model is appropriate. More variation in the data may be present than is expected by the distributional assumption. This is called overdispersion (also known as heterogeneity) which typically occurs when the observations are correlated or are collected from "clusters". To identify possible overdispersion in the data for a given model, divide the deviance by its degrees of freedom: this is called the dispersion parameter. If the deviance is reasonably "close" to the degrees of freedom (i.e., the scale parameter=1) then evidence of overdispersion is lacking. A scale parameter that is greater than 1 does not necessarily imply overdispersion is present. This can also indicate other problems, such as an incorrectly specified model (omitted variables, interactions, or non-linear terms), an incorrectly specified functional form (an additive rather than a multiplicative model may be appropriate), as well as influential or outlying observations. If you believe you have correctly specified the model and the scale estimate is greater than 1, then conclude your data are overdispersed. You should be able to identify possible reasons why your data are overdispersed. If you do not correct for overdispersion, the estimates of the standard errors are too small which leads to biased inferences, i.e. you will observer smaller p-values than you should and thus make more Type I errors. As a result, confidence intervals will also be incorrect. When you have the "correct" model, outliers are not a problem, and the scaled deviance is large, the various choices to work with overdispersion include: 1) Allow the variance functions of the binomial and Poisson distributions to have a multiplicative overdispersion factor "phi" computed with: Pearson scale: overdispersion = Pearson Chi SQ / DF Deviance scale: overdispersion = Deviance Chi SQ / DF 2) Zero-inflated models (e.g., zero-inflated Binomial (ZIB) and zero-inflated Poisson (ZIP)) may also be candidates when the overdispersion is caused by an excessive number of 0's (see example for PROC NLMIXED below). 3) For count data you have modeled as Poisson, try to fit a negative binomial distribution instead (gamma mixture of Poisson responses). See Chapter 11 for illustrations of this technique with PROC GENMOD and PROC NLMIXED. 4) For count data of all types, invoke the GEE method utilizing PROC GENMOD with the REPEATED statement. See Chapter 13. 5) PROC GLIMMIX which is currently experimental in Version 9.1.3 will be part of Version 9.2. Using syntax much like PROC MIXED and PROC GENMOD, it allows you to easily model random effects in all types of generalized linear models. The first two possible solutions to work with overdispersion will be outlined below. The others will be illustrated in later sections. Apply a Multiplicative Overdispersion Factor The first method is to rescale the covariance matrix by the estimated dispersion parameter. That is, instead of testing parameters under the assumption that Var(y_i)=%mu_i, assume Var(y_i)=%phi*%mu_i, where %phi is greater than 1 for an overdispersed model. That is, elements of the covariance matrix are multiplied by a dispersion parameter and thus made larger. You can make this choice as an option on the MODEL statement of PROC GENMOD. Supply the value of the dispersion parameter directly or estimate it based on either the Pearson chi-square statistic or the deviance for the fitted model. The dispersion parameter (%phi) can be estimated by the square root of the deviance divided by the degrees of freedom which in GENMOD is applied by specifying SCALE=deviance (or just DSCALE) as an option following the slash on the MODEL statement MODEL y = fct1 / DIST=poisson LINK=log type3 SCALE=deviance; SCALE= or SCALE=pearson are also options to correct for overdispersion with the latter taking the reported chi-square value divided by its degress of freedom. The estimate of dispersion is measured by both the deviance and Pearson's chi-square divided by their degrees of freedom. If this statistic is not close to 1, then the data may be overdispersed if the dispersion estimate is greater than 1 or underdispersed if the dispersion estimate is less than 1. The following example comes from the online documentation where they model overdispersion with williams method and PROC LOGISTIC. In this example, the choices with PROC GENMOD are explored: Example: Working with Overdispersion in Binomial Data In a seed germination test, seeds of two cultivars were planted in pots of two soil conditions. The following SAS statements read the dataset seeds, which contains the observed proportion of seeds that germinated for various combinations of cultivar and soil conditions. The variable n represents the number of seeds planted in a pot, and variable r represents the number germinated. The indicator variables cult and soil represent the cultivar and soil condition, respectively. Run the data through the various GENMOD steps and then compare the various Type 3 fixed effect tests and parameter estimates. The use of the aggregage function with individual response data is also part of the output. DATA seeds(KEEP=pot n r cult soil) rsp(KEEP=pot cult soil i rsp); INPUT pot n r cult soil @@; OUTPUT seeds; rsp=1; DO i= 1 to r; OUTPUT rsp; END; rsp=0; DO i=r+1 to n; OUTPUT rsp; END; datalines; 1 16 8 0 0 2 51 26 0 0 3 45 23 0 0 4 39 10 0 0 5 36 9 0 0 6 81 23 1 0 7 30 10 1 0 8 39 17 1 0 9 28 8 1 0 10 62 23 1 0 11 51 32 0 1 12 72 55 0 1 13 41 22 0 1 14 12 3 0 1 15 13 10 0 1 16 79 46 1 1 17 30 15 1 1 18 51 32 1 1 19 74 53 1 1 20 56 12 1 1 ; PROC PRINT DATA=seeds NOobs n; run; * can also store individual response values; PROC PRINT DATA=rsp(obs=99) NOobs; run; PROC TABULATE DATA=seeds noseps; CLASS cult soil ; TABLE cult, soil*n*f=5.0 / rts=9; RUN; ODS OUTPUT TYPE3=typ3_none; ODS OUTPUT PARAMETERESTIMATES=prms_none; ODS exclude PARAMETERESTIMATES; proc genmod data=seeds; model r/n = cult soil cult*soil / dist=binomial link=logit type3 ; title1 '1. Full Model: no scale with aggregated data'; run; ODS OUTPUT TYPE3=typ3_sm; ODS OUTPUT PARAMETERESTIMATES=prms_sm; ODS exclude PARAMETERESTIMATES; PROC GENMOD DATA=seeds; MODEL r/n = cult soil cult*soil / dist=binomial link=logit scale=d type3; title1 '2. Full Model: scale=deviance with aggregated data'; run; ODS OUTPUT TYPE3=typ3_rsp; ODS OUTPUT PARAMETERESTIMATES=prms_rsp; ODS exclude PARAMETERESTIMATES; proc genmod data=rsp descending; model rsp = cult soil cult*soil / dist=binomial link=logit scale=d aggregate=(pot) type3; title1 '3. Full Model: scale=deviance with aggregated individual response data'; run; ODS OUTPUT TYPE3=typ3_gee; ODS OUTPUT GEEEmpPEst=prms_gee; ODS exclude PARAMETERESTIMATES GEEEmpPEst; PROC GENMOD DATA=rsp descending; CLASS pot; MODEL rsp = cult soil cult*soil / dist=binomial link=logit aggregate=(pot) type3; REPEATED subject=pot / type=ind; TITLE1 '4. Full Model: GEE with individual data'; run; PROC PRINT DATA=typ3_none NOobs; title1 '1. Full Model: no scale with aggregated data'; run; PROC PRINT DATA=prms_none NOobs; run; PROC PRINT DATA=typ3_sm NOobs; title1 '2. Full Model: scale=deviance with aggregated data'; run; PROC PRINT DATA=prms_sm NOobs; run; PROC PRINT DATA=typ3_rsp NOobs; 2. Compute a Zero-Inflated Model with Excess Zeros One potential cause for overdispersion of count data for both the binomial and Poisson regression models is the presence of an abnormally large number of zeros so that the standard binomial or Poisson regression models described in Chapters 6 and 7 for non-negative count data are not suitable. Datasets with excess zeros will exhibit overdispersion since they have more 0's than a Binomial or Poisson model allows for. To address this problem Zero-Inflated Poisson (ZIP) regression model of Lambert (1992) is applied. The ZIB and ZIP models are of relatively recent concept in count data analysis. A zero inflated model specifically allows for this overdispersion and when applicable predicts a more realistic percent of zeroes. The traditional way in which a ZIP model allows for overdispersion is to assume that the counts follow a mixture distribution: the actual counts generated from a Poisson(mu) distribution with probability 1-p_0 or for fixed zeros with probability p_0. That is, 0's can be generated under the Poisson assumption; they can also be generated because of special characteristics of the subject for which any other value than a 0 is highly unlikely. The Poisson means and the zero probabilities are both modeled as a function of the available data. Fit the Zero Inflated Poisson Regression model with NLMIXED Under a Zero inflated Poisson regression model, a zero count can occur in two ways: Let p_0 = probability a 0 occurs, in the situation where the combination of explanatory data will nearly always produce a observed 0 This naturally occuring 0 is modelled as a binary random variable, much like in logistic regression: p_0 = exp(logit) / (1+ exp(logit)) where the logit = LOG(p_0/(1-p_0)) = a + b*x can be modeled as a linear function of explanatory data, just like in logistic regression. The counts 0, 1, 2, 3, .. are also observed because they are generated by the Poisson distribution. These counts occur with a probability of 1-p_0: 1-p_0 = a non-negative count (including 0) generated by the Poisson probability The probabilities of observing Y= 0,1,2,3,... must sum to 1 under the zero-inflated model, that is: p_0 + (1-p_0)*[P(Y=0) + P(Y=1) + P(Y=2) + ...)] = 1 <---------- sums to 1 --------> p_0 + [(1-p_0) * 1] = p_0 + 1 - p_0 = 1 The component parts where P(Y=0) and P(Y>0) can be separated as follows: p_0 + (1-p_0)*[P(Y=0)] + (1-p_0)*[P(Y=1) + P(Y=2) + ...)] = 1 <------ P(Y=0) ------> + <----------- P(Y > 0) --------> = 1 A probability of an observed 0 in the dataset is due to two sources: Pr(Y_i = 0) = p_0 + (1-p_0)* exp(-mu_i) where under the Poisson distribution P(Y=0)=exp(-mu)*mu^0/fact(0) = exp(-mu) Non-zero counts with positive probabilities [P(Y>0)] are generated from a truncated Poisson distribution: Pr(Y_i = y) = (1-p_0) * [exp(-mu)*mu^y / fact(y)] y = 1, 2, 3, .. Since the likelihood has now been completely determined, these equations can be entered into PROC NLMIXED to compute coefficients for both the fixed 0's and the Poisson generated data. PROC NLMIXED DATA=test; PARMS bp_0=-.2 bp_1=.1 bll_0=.2 bll_1=.1; * Estimate the probability of a fixed 0 ; eta_p0 = bp_0 + bp_1*gender; p_0 = EXP(eta_p0) / (1 + EXP(eta_p0)); * Compute the linear predictor of the Poisson mean; eta_mu = bll_0 + bll_1*gender; mu = EXP(eta_mu); /* Compute log likelihood for Zero Inflated Poisson Distribution. Note: the log-likelihood for the zero response values cannot be generated directly. Must first generate the likelihood and then take the logarithm due to the additive nature of the zero likelihood. */ IF y = 0 THEN DO; p_zero = p_0 + ((1-p_0)*EXP(-mu)); loglike = LOG(p_zero); END; IF y > 0 THEN DO; p_gt_0 = (1-p_0)*( EXP(-mu)*((mu)**y)); * note: the denominator of the PDF, FACT(y+1), is optional; loglike = LOG(p_gt_0); END; MODEL y ~ general(loglike); /* fit the model */ run; Link functions relating mu = (mu_1.., mu_n) and p = (p_1.., p_n) to data can be written log(mu) = X1*Beta logit(p) = log(p/(1 - p)) = X2* NU where X_1 and X_2 are covariate matrices with columns corresponding to the data. The covariate matrices can contain covariates in common, and usually X2 contains a subset of the covariates in X1. For our application, X1 consists of the 44 variables described, while we try an intercept term, Price and Low price for X2. These latter choices were based on an initial logistic regression analysis for zero versus non-zero. After Glass (which was clearly the most important discriminator), Price was the next most useful discriminator, followed by Low price. Incorporating further covariates in X2 made a negligible improvement in model ¯t. References Lambert, D. (1992). Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 34, 1-14. Long, J. Scott. (1997) Regression Models for Categorical and Limited Dependent Variables. Sage: Thousand Oaks, CA.