Section 7. Poisson Regression Whenever non-negative integers are collected that count the number of events of interest, Poisson Regression techniques may be an option to consider or to serve as an introduction to more complex count data models. The distribution was first applied to the number of soldier deaths by horse kicks in the Prussian army. Data collected as counts which follow the Poisson distribution have three problems that make computations with least squares linear regression problematic. 1. The Poisson distribution is skewed; linear regression based on normal distribution of residuals assumes a symmetric distribution. 2. The Poisson distribution models non-negative integers; linear regression may produce predicted values that are negative. 3. The variance of a Poisson distribution increases as the mean increases; linear regression assumes a constant variance. The Poisson regression model is not affected by these problematic conditions. The link function for Poisson regression is usually a log transformation of the mean for a given set of covariates which adjusts for the skewness and prevents the model from producing negative predicted values. Poisson regression also allows the variance to change as a function of the mean. The Poisson Model The good news is Poisson regression models work with count data similar to the way in which the normal distribution works in linear regression. Non-negative integers that follow the Poisson distribution have this probability density function (pdf): EXP(-mu)*[mu^(y)] Prob(Y=y} = ------------------, y = 0, 1, 2, 3, .. and mu > 0 y! Statistical theory states that the variance of an observation from this distribution equals the mean, mu. More details concerning estimation of this parameter will be presented later in this section. However, based on these limited details three characteristics are observed that will likely make normal theory regression models problematic with count data. [In this section the type of Poisson model considered is one in which the counts are modeled without denominators, that is, all counts come from subjects with the same measure of size. The analysis of counts from situations where a number of observed events is divided by a total exposure (such as population, area, length, etc.) is called a rate model which is discussed in Section 10.] Under the theory of Generalized Linear Models outlined in section 3, a Poisson regression model implicitly applies a log transformation as the canonical link which adjusts for the skewness and prevents the model from producing negative predicted values. The canonical link between the linear predictor and the mean is the natural log: LOG(mu) = b0 + b1*X1 + ... + bk*Xk which implies the mean in the orginal units will be: mu = EXP(b0 + b1*X1 + ... + bk*Xk) Thus, the value of the mean (mu) must always be greater than 0 (as is required by the definition of the Poisson distribution). Since the possible values of counts range from 0 to positive infinity (i.e., non-negative integers) it is appropriate that the estimate of the mean be constrained to be nonnegative which is accomplished with the log link. Modeling the Poisson Mean Fitting a Poisson regression model involves modeling the mean so that the estimated density provides the best possible match to the observed response probability structure. An obvious limitation with Poisson regression is the variance of a response is a function of the mean. Problems that arise due to this constraint will be further explored in the Section 8 which introduces overdispersion in Poisson regression models. Section 11 (negative binomial distribution), Section 13 (GEE), and Section 14 (the portion that runs GLIMMIX) offer other ways to work with this limitation. To fit count data (entered as the variable named y in code below) distributed as Poisson random variables, PROC GENMOD utilizes the following basic commands: PROC GENMOD DATA=mydata; CLASS < categ_predict > ; MODEL y = / DIST=poisson LINK=log; RUN; * Simulate 95 Poisson random variables with mean 5; DATA pois; DROP seed mean i; mean = 5; seed=19953; DO i = 1 to 95; y = RANPOI(seed,mean); OUTPUT; END; RUN; DATA cnts; DO y=0 to 13; OUTPUT cnts; END; run; PROC TABULATE DATA=pois NOseps classdata=cnts; CLASS y; TABLE y*n=' '*f=3.0 / misstext='0'; RUN; --------------------------------------------------------- | y | |-------------------------------------------------------| | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |10 |11 |12 |13 | |---+---+---+---+---+---+---+---+---+---+---+---+---+---| | 1| 5| 7| 14| 13| 22| 11| 10| 8| 2| 0| 1| 0| 1| --------------------------------------------------------- * PROC GENMOD estimates the value of the mean ; PROC GENMOD DATA=pois; MODEL y = / dist=poisson lrci; ESTIMATE 'Mean' intercept 1 / exp; RUN; Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance 94 106.6039 1.1341 Scaled Deviance 94 106.6039 1.1341 Pearson Chi-Square 94 100.3011 1.0670 Scaled Pearson X2 94 100.3011 1.0670 Log Likelihood 273.4946 Note: the final column labeled "Value/DF" should be "close" to 1 if the Poisson distribution is reasonable for these data. What this column actually implies for data analysis is covered in Section 8 (an estimate of overdispersion). Analysis Of Parameter Estimates Likelihood Ratio Standard 95% Confidence Chi- Pr > Parameter DF Estimate Error Limits Square ChiSq Intercept 1 1.5882 0.0464 1.4959 1.6777 1172.85 <.0001 Scale 0 1.0000 0.0000 1.0000 1.0000 NOTE: The scale parameter was held fixed. Contrast Estimate Results Standard Chi- Pr > Label Estimate Error Alpha Confidence Limits Square ChiSq Mean 1.5882 0.0464 0.05 1.4973 1.6791 1172.8 <.0001 Exp(Mean) 4.8947 0.2270 0.05 4.4695 5.3605 ^^^^^^ <- conf intval -> From the second row of this table, the estimated mean is found to be mu=EXP(1.5882)=4.89. The known mean of 5 is well within the 95% confidence interval. The Poisson mean can also be estimated with PROC NLMIXED. The following commands show how to enter the probability density function (Set 1) and the LOG likelihood (Set 2) (i.e., enter an equation for the loglikelihood of the density function). The result for the estimation of the mean of the Poisson distribution is the same as computed with PROC GENMOD. PROC NLMIXED DATA=pois maxiter=50; PARMS mu=1; p=EXP(-mu)*(mu**y)/(FACT(y)); * ENTER probability density function; loglike = LOG(p); MODEL y ~ general(loglike); run; PROC NLMIXED DATA=pois maxiter=50; PARMS mu=1; loglike = -mu + y*LOG(mu) - Lgamma(y+1); * ENTER loglikelihood function; MODEL y ~ general(loglike); run; NOTE: Since FACT(y) placed in the first set of NLMIXED statements is a constant, it is not absolutely needed; likewise, the -Lgamma(y+1) term is not needed in the second example. You will notice that for these data PROC GENMOD computes the Log Likelihood as 273.4946; remove the Lgamma(y+1) term and NLMIXED computes -2*(Log likelihood) as -547.0 indicating GENMOD doesn't include FACT(y) as part of the maximization computations. It makes no difference for the results or for comparing LL across models, yet for that reason the LL values will not necessarily match what other programs report (see discussion below). Either approach of coding PROC NLMIXED estimates the mean of the Poisson distribution with these data. Parameter Estimates Standard Parameter Estimate Error DF t Value Pr > |t| Lower Upper Gradient mu 4.8948 0.2270 95 21.56 <.0001 4.4442 5.3454 0.001342 See the section below on Maximum Likelihood Estimation for further explanation of how the procedure works directly with the likelihood function. Reasons for explaining how to duplicate results with the NLMIXED procedure will become clear with more advanced applications with mixture distributions such as zero inflated or truncated Poisson models. Analysis of Rates and Incidence Rate Ratios with Poisson Data Counts of events tend to be distributed approximately by a Poisson distribution, where the mean depends on levels of a treatment, condition, or combinations of two or more explanatory variables. The following data are the record of undesired poppy plants per 3 3/4 ft^2 in oat fields under 5 treatment conditions (data are from table 15.11.1 in Snedecor and Cochran, p. 289-90). DATA poppy; DROP y1-y5 i; LABEL trt='Treatment'; ARRAY yy{5} y1-y5; INPUT blk y1 y2 y3 y4 y5 @@; i=0; DO trt= 'A', 'B', 'C', 'D', 'E' ; i=i+1; count = yy{i}; sqrt_cnt=SQRT(count); xplot = i -.15 + .3*ranuni(929); t1 =(trt='A'); t2 = (trt='B'); t3 =(trt='C'); t4 =(trt='D'); t5 =(trt='E'); output; END; CARDS; 1 438 538 77 17 18 2 442 422 61 31 26 3 319 377 157 87 77 4 380 315 52 16 20 ; * first, plot the counts for each treatment level; proc plot data=poppy; plot count * xplot ='*'; options ps=50 ls=75; run; quit; count | 600 + | | | | * | 500 + | | | * | * | * 400 + | * * | | | | * * 300 + | | | | | 200 + | | | * | | 100 + | * * * | * | * | * * | ** ** 0 + | --------+----------+----------+----------+----------+--- A B C D E treatment PROC TABULATE DATA=poppy NOseps; CLASS trt blk; var count; TABLE trt, count=' '*(n*f=3.0 mean*f=6.1 range*f=5.0) / rts=11 Box='Summary'; run; ---------------------------- |Summary | N | Mean |Range| |---------+---+------+-----| |Treatment| | | | |A | 4| 394.8| 123| |B | 4| 413.0| 223| |C | 4| 86.8| 105| |D | 4| 37.8| 71| |E | 4| 35.3| 59| ---------------------------- The plot confirms what the table shows, in that the range of counts increases with the mean, one important signal that an ANOVA would not be appropriate statistical technique to apply with these data. In the situation the square root transformation has historically been applied to count data. Snedecor and Cochran apply the square roots in an ANOVA showing treatments D and E are superior to C, while the C, D, and E treatments are much suprior to A and B in reducing the number of poppies. They even indicate how to reconvert the means from the square roots back to the original scale by adding the quantity (n-1)s^2/n to each reconverted mean. A Poisson analysis of these data can easily be achieved with PROC GENMOD: ODS OUTPUT ESTIMATES=est(DROP=alpha) LSMEANS=lsm(drop=df chisq probchisq) LSmeanDIFFs=dfs; ODS EXCLUDE estimates lsmeans LSMeandiffs; PROC GENMOD DATA=poppy; CLASS trt; MODEL count = trt / dist=poisson link=log type3; ESTIMATE 'Observed Rate: trtmnt A' int 1 trt 1 0 0 0 0 / exp; ESTIMATE 'Observed Rate: trtmnt B' int 1 trt 0 1 0 0 0 / exp; ESTIMATE 'Observed Rate: trtmnt C' int 1 trt 0 0 1 0 0 / exp; ESTIMATE 'Observed Rate: trtmnt D' int 1 trt 0 0 0 1 0 / exp; ESTIMATE 'Observed Rate: trtmnt E' int 1 trt 0 0 0 0 1 / exp; ESTIMATE 'IRR: trt A vs B' trt 1 -1 0 0 0 / exp; ESTIMATE 'IRR: trt C vs D' trt 0 0 1 -1 0 / exp; ESTIMATE 'IRR: trt D vs E' trt 0 0 0 1 -1 / exp; ESTIMATE 'IRR: trt C vs E' trt 0 0 1 0 -1 / exp; LSMEANS trt / diff; TITLE1 'Poisson Regression'; run; PROC PRINT DATA=est NOobs n; WHERE (lowcase(substr(label,1,3)) = 'exp'); TITLE2 'Estimates'; run; DATA lsm; SET lsm; Obs_rate=EXP(estimate); PROC PRINT DATA=lsm NOobs; TITLE2 'Observed Rates'; run; DATA dfs; SET dfs; irr=EXP(estimate); PROC PRINT DATA=dfs NOobs; TITLE2 'Incident Rate Ratios'; run; run; Important questions this analysis should be able to answer include 1. What is the observed rate of counts for each treatment? 2. Do treatments A and B differ from each other and is treatment C different that D or E, and if so by how much? Two equivalent methods of computing the observed rates and their associated rate ratios are demonstrated in the code above. First the a series of ESTIMATE statements with "Observed Rates" for each treatment are computed with the added option exp (to automatically exponentiate the estimate) that is, compute the average counts for treatments A, B, C, D, and E. The rows labeled IRR (the incident rate ratio) are ratios of these observed rates comparing the specified treatments. The two output datasets from LSMEANS with the diff option (given the names lsm and dfs) also presents these results though the exponentiation must occur in DATA step rather than the GENMOD procedure. Poisson regression is applicable here because counts were collected on the same sized sampling unit for each treatment. More details on computing rates and incidence rate ratios with Poisson regression when the counts depend upon the size of the sampled unit are found in section 10. For zero-inflated count models, working with random effects, and for other reasons, it can be helpful to know how to run a Poisson analysis with PROC NLMIXED: How to replicatethe results in NLMIXED; * treatments are dummy coded; * treatment E is the reference category level ; PROC NLMIXED DATA=poppy; PARMS intrc=.1 tA = -1 tB =-.1 tC=-.1 tD=-.1; eta = intrc + tA*t1 + tB*t2 + tC*t3 + tD*t4; mean = EXP(eta); loglike= count*LOG(mean) - mean - Lgamma(count+1); MODEL count ~ GENERAL(loglike); ESTIMATE 'Estimated rate: trt A' EXP(intrc + tA); ESTIMATE 'Estimated rate: trt B' EXP(intrc + tB); ESTIMATE 'IRR: trt A vs trt B' EXP(tA-tB); ESTIMATE 'Estimated rate: trt C' EXP(intrc + tC); ESTIMATE 'Estimated rate: trt D' EXP(intrc + tD); ESTIMATE 'IRR: trt C vs trt D' EXP(tC-tD); ESTIMATE 'Estimated rate: trt E' EXP(intrc); ESTIMATE 'IRR: trt C vs trt E' EXP(tC); CONTRAST 'Treatment Effect' tA, tB, tC, tD ; OPTIONS ls=110; RUN; Maximum Likelihood Estimation of the Poisson Mean Maximum Likelihood Estimation for the Poisson mean involves multiplying all the individual values of the probability density function (pdf) for a given mean and then applying calculus to find the specific value of the mean, mu^, that maximizes this equation. Finding this optimal value is usually much easier with the log-likelihood function; that is, take the logs of the density function, add them together, and solve for the value of mu^ by taking the first derivative set equal to 0. The end result is, given the data, the value of mu that maximizes the likelihood of observing this particular sample. For example, assuming the mean of a distribution of Poisson counts is mu, the pdf for the value of an individual observation y_i drawn at random is: EXP(-mu_i) * mu_i^(y_i) Prob(Y=y_i} = ----------------------- i = 0,1,2,3, ... y_i! Taking the natural log of this pdf produces the log-likelihood for a single Poisson response, that is, the contribution of the i_th observation from a random sample of size n to the total Log-likelihood: LOG(prob( Y=y_i|mu )) = l{i} = y_i*log(mu_i) - mu_i -log(y_i!) For the Poisson distribution (and also binomial and multinomial) any terms which involve binomial coefficients or factorials of the observed counts can be dropped from the computation of the log-likelihood function since they do not affect parameter estimates, estimated covariances, or likelihood ratio tests. The only part of the log-likelihood actually needed to estimate the mean of a Poisson distribution is: l{i} = y_i*log(mu_i) - mu_i Since the factorial term, y_i!, does not contain the parameter mu_i from the model, excluding log(y_i!) does not affect the parameter estimates or their estimated covariances (since this final term acts like a constant whose first derivative with respect to mu is 0). Maximization of the log-likelihood with and without the component log(y_i!) produces the same parameter estimates. Therefore, the log of the factorial term log(y_i!) may be dropped from the likelihood function reducing computational requirements. Since SAS performs the computation in a manner that involves only the parameters, this means it prints a log-likelihood that will not match what other programs produce (such as LIMDEP or STATA). However, observe that for two different Poisson models: l1{i} = y_i * log(mu1_i) - mu1_i - log(y_i!) and l2{i} = y_i * log(mu2_i) - mu2_i - log(y_i!) When the difference in log-likelihood for these two models is computed, the term log(y_i!) eliminates itself from the difference. So, the deviance will be the same whether we include or do not include log(y_i!) in the log-likelihood. SAS avoids some extra computational burden although it is not obvious how much additional computation the full likelihood equation actually imposes. It would be preferred if GENMOD would compute and report the full log-likelihood, much like NLMIXED (and other statistical software) does.