9. Loglinear Models Two or more categorical factors can be cross-classified in a multi-dimensional contingency table of counts and summarized for relationships among them with a loglinear model. The typical application is not that one specific variable exists as a response and variation in it explained by one or more independent variables (as in linear regression), but rather you want to explore the relationships that simultaneously exist among two or more variables. For example, consider the following crosstabulation of three categorical variables, each having two levels: Table 6.2.3, from Agresti, Introduction to Categorical Data Analysis PROC FORMAT; VALUE ysn 1='No' 2='Yes'; RUN; DATA drugs; INPUT a c m count @@; LABEL a='Alcohol Use' c='Cigarette Use' m='Marijuana Use'; CARDS; 1 1 1 911 1 1 2 538 1 2 1 44 1 2 2 456 2 1 1 3 2 1 2 43 2 2 1 2 2 2 2 279 ; PROC Tabulate data=drugs NOseps; CLASS a c m; VAR count; TABLE a*c, m*count=' '*sum=' '*f=8.0 / rts = 30 BOX='Counts'; TITLE3 'Table 6.3, pg. 152'; FORMAT a c m ysn. ; RUN; ------------------------------------------------ |Counts | Marijuana Use | | |-----------------| | | No | Yes | |----------------------------+--------+--------| |Alcohol Use Cigarette Use | | | |No No | 911| 538| | Yes | 44| 456| |Yes No | 3| 43| | Yes | 2| 279| ------------------------------------------------ The table contains 8 cells, each one defined by unique combinations of the levels of the three factors. A model which tests the independence of these factors, one of several loglinear models for this particular contingency table, would be: log(m_ijk) = %mu + a_i + c_j + m_k i=1,2; j=1,2; k=1,2 (1) where m_ijk is the cell count for the cell indexed by i, j, and k in the 2x2x2 table, %mu is a parameter for fitting the sample size (total N of the crosstab), a_i and c_i fit the row margins, m_i is a fits the column totals. Since there are only two levels for each of the three variables, the assumption of effects coding imply the identifiability constraints are a_1 = -a_2, c_1 = -c_2, and m_1 = -m_2 This is called the loglinear model of independence since there are no interaction terms for associations among the variables. PROC CATMOD has been a long-established method to compute loglinear models: PROC Catmod data=drugs; WEIGHT count; MODEL a*c*m = _response_ ; LOGLIN a c m ; TITLE3 'CATMOD: Loglinear MODEL of independence'; RUN; quit; Maximum Likelihood Analysis of Variance Source DF Chi-Square Pr > ChiSq -------------------------------------------------- a 1 892.65 <.000001 c 1 216.28 <.000001 m 1 55.22 <.000001 Likelihood Ratio 4 1286.02 <.000001 Analysis of Maximum Likelihood Estimates Standard Chi- Parameter Estimate Error Square Pr > ChiSq ------------------------------------------------------------ a 1 0.8926 0.0299 892.65 <.000001 c 1 0.3247 0.0221 216.28 <.000001 m 1 -0.1577 0.0212 55.22 <.000001 An equivalent SAS procedure to run this loglinear model of independence in PROC GENMOD is: PROC GENMOD data=drugs; CLASS a c m / param=effect; MODEL count = a m c / dist=poisson link=log type3; CONTRAST 'a1 vs a2' a 1 -1 / wald; CONTRAST 'c1 vs c2' c 1 -1 / wald; CONTRAST 'm1 vs m2' m 1 -1 / wald; run; Selected portions of the output follow: Criteria for Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance 4 1286.0200 321.5050 Scaled Deviance 4 1286.0200 321.5050 Pearson Chi-Square 4 1411.3860 352.8465 Scaled Pearson X2 4 1411.3860 352.8465 Log Likelihood 11367.7894 Note the Deviance of 1286.02 equals the chi-square from the Likelihood ratio test from the CATMOD output. This value is much larger than its degrees of freedom leading to the conclusion the three factors in this study are not independent. Analysis Of Parameter Estimates Standard Wald 95% Chi- Pr > Parameter DF Estimate Error Confidence Limits Square ChiSq Intercept 1 5.2320 0.0309 5.1716 5.2925 28761.2 <.0001 a 1 1 0.8926 0.0299 0.8340 0.9511 892.31 <.0001 m 1 1 -0.1577 0.0212 -0.1993 -0.1161 55.22 <.0001 c 1 1 0.3247 0.0221 0.2814 0.3679 216.28 <.0001 Scale 0 1.0000 0.0000 1.0000 1.0000 NOTE: The scale parameter was held fixed. The estimates for a, m, and c are based on effect coding (see CLASS statement) and match those from PROC CATMOD. Contrast Results Chi- Contrast DF Square Pr > ChiSq Type a1 vs a2 1 892.31 <.0001 Wald c1 vs c2 1 216.28 <.0001 Wald m1 vs m2 1 55.22 <.0001 Wald The results from the CONTRAST statements are Wald chi-square and match the chi-square values from the CATMOD output. Loglinear model with CATMOD ODS OUTPUT PredictedFreqs=prd_f(drop = sample model control); ODS exclude PredictedFreqs PredictedValues; PROC CATMOD DATA=drugs; WEIGHT count; MODEL a*c*m = _response_ / predict=freq ; LOGLIN a|m c|m; TITLE3 'CATMOD: Loglinear MODEL (AM, CM)'; RUN; quit; proc print data=prd_f; run; The CATMOD output for this model is: Maximum Likelihood Analysis of Variance Source DF Chi-Square Pr > ChiSq -------------------------------------------------- a 1 198.52 <.000001 m 1 155.20 <.000001 a*m 1 83.00 <.000001 c 1 292.68 <.000001 c*m 1 401.17 <.000001 Likelihood Ratio 2 187.75 <.000001 Analysis of Maximum Likelihood Estimates Standard Chi- Parameter Estimate Error Square Pr > ChiSq -------------------------------------------------------------- a 1 1.5949 0.1132 198.52 <.000001 m 1 -1.4731 0.1182 155.20 <.000001 a*m 1 1 1.0313 0.1132 83.00 <.000001 c 1 0.6885 0.0402 292.68 <.000001 c*m 1 1 0.8061 0.0402 401.17 <.000001 Function Obs ObsStd Pred PredStd Obs a c m Num Function Err Function Err Residual 1 1 1 1 F1 911 23.37434 909.2396 23.36195 1.760417 2 1 1 2 F2 538 20.26889 438.8404 17.15391 99.15957 3 1 2 1 F3 44 6.568819 45.76042 6.679323 -1.76042 4 1 2 2 F4 456 19.09554 555.1596 18.96775 -99.1596 5 2 1 1 F5 3 1.730909 4.760418 2.126048 -1.76042 6 2 1 2 F6 43 6.495199 142.1596 8.562114 -99.1596 7 2 2 1 F7 2 1.413592 0.239583 0.112401 1.760417 8 2 2 2 F8 279 15.64606 179.8404 10.27909 99.15957 The same results can be obtained with PROC GENMOD: ODS OUTPUT obstats=obst(keep=count a c m Pred Resraw Reschi Resdev); ODS OUTPUT parameterestimates=prms(drop= df lower: upper: prob: ); ODS EXCLUDE obstats parameterestimates; PROC GENMOD data=drugs; CLASS a c m / param=effect; MODEL count = a m a*m c c*m / dist=poisson link=log type3 obstats; CONTRAST 'a1 vs a2' a 1 -1 / wald; CONTRAST 'c1 vs c2' c 1 -1 / wald; CONTRAST 'm1 vs m2' m 1 -1 / wald; CONTRAST 'AM' a*m 1 -1 -1 1 / wald; CONTRAST 'CM' c*m 1 -1 -1 1 / wald; TITLE3 'GENMOD: Loglinear MODEL (AM, CM)'; run; PROC PRINT DATA=prms; run; Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance 2 187.7543 93.8772 Scaled Deviance 2 187.7543 93.8772 Pearson Chi-Square 2 177.6144 88.8072 Scaled Pearson X2 2 177.6144 88.8072 Log Likelihood 11916.9222 LR Statistics For Type 3 Analysis Chi- Pr > Source DF Square ChiSq a 1 1594.13 <.0001 m 1 677.18 <.0001 a*m 1 346.46 <.0001 c 1 503.89 <.0001 c*m 1 751.81 <.0001 Contrast Results Chi- Contrast DF Square Pr > ChiSq Type a1 vs a2 1 198.38 <.000001 Wald c1 vs c2 1 292.68 <.000001 Wald m1 vs m2 1 155.10 <.000001 Wald AM 1 82.94 <.000001 Wald CM 1 401.17 <.000001 Wald GENMOD: Loglinear MODEL (AM, CM) Parameter Level1 Level2 Estimate StdErr ChiSq Intercept 4.1650 0.1183 1239.86 a 1 1.5949 0.1132 198.38 m 1 -1.4731 0.1183 155.10 a*m 1 1 1.0313 0.1132 82.94 c 1 0.6885 0.0402 292.68 c*m 1 1 0.8061 0.0402 401.17 Scale 1.0000 0.0000 Now examine a table with the original and predicted counts: PROC TABULATE DATA=obst NOseps; CLASS a c m ; VAR count pred; TABLE a*c, m*(count*sum=' '*f=5.0 pred*sum=' '*f=9.3) / rts=24; run; -------------------------------------------------------- | | Marijuana Use | | |-------------------------------| | | 1 | 2 | | |---------------+---------------| | |count| Pred |count| Pred | |----------------------+-----+---------+-----+---------| |Alcohol Cigarette | | | | | |Use Use | | | | | |1 1 | 911| 909.240| 538| 438.840| | 2 | 44| 45.760| 456| 555.160| |2 1 | 3| 4.761| 43| 142.160| | 2 | 2| 0.240| 279| 179.840| -------------------------------------------------------- The columns of predicted values for this table are obtained from exponentiating a sum of the respective parameter estimates for indices of a, m, and c. To compute these values, first place the estimates in a row and a reminder of what variable and levels they represent. If the actual sign of the estimate is negative, enter it in (), such as (-1.47). Underneath each estimate place an indicator of the respective level(s) of the coefficient for that combination of A, C, and M. Since effect coding is in place, assign low levels (=1) +1 and high levels (=2) -1. Then multiply the signs together to determine if coefficient should be added or subtracted. mean A M A*M C C*M 909.24 = EXP[ 4.1650 + 1.5949 + (-1.4731) + 1.0313 + 0.6885 + 0.8061 ] mean a=1 m=1 a=1 m=1 c=1 c=1 m=1 438.84 = EXP[ 4.1650 + 1.5949 - (-1.4731) - 1.0313 + 0.6885 - 0.8061 ] mean a=1 m=2 a=1 m=2 c=1 c=2 m=1 45.76 = EXP[ 4.1650 + 1.5949 + (-1.4731) + 1.0313 - 0.6885 - 0.8061 ] mean a=1 m=1 a=1 m=1 c=2 c=2 m=1 555.16 = EXP[ 4.1650 + 1.5949 - (-1.4731) - 1.0313 - 0.6885 + 0.8061 ] mean a=1 m=2 a=1 m=2 c=2 c=2 m=2 Although this loglinear model does not provide a good fit the counts from the original table, it is helpful to explore what the AC interaction means under this model. Summing over levels of M for each combination of A and C results in: a c sum_ac ( sum across levels of m) 1 1 1348.08 = 909.24 + 438.84 1 2 600.92 = 45.76 + 555.16 2 1 146.92 = 4.76 + 142.16 2 2 180.08 = 0.24 + 179.84 These summed predicted counts data are placed in a collapsed table and the odds of C are computed for each level of factor A. --------------------------------------- | | cig1 | cig2 | odds | |---------+--------+--------+---------| |Alcohol | | | | |Use | | | | |1 | 1348.08| 600.92| 2.243| |2 | 146.92| 180.08| 0.816| --------------------------------------- The odds ratio is 2.243/.816 = 2.7. Since it is greater than 1, this implies marijuana use is more likely to be observed in connection with Alcohol use. Note this log-linear analysis is somewhat more comprehensive than a similar analysis with logistic regression of the ability of a and c to predict m as a response variable because it also allows for testing the association between a and c. In general, in log-linear models one looks at associations among predictors as well as their effects on the response, whereas logistic regression looks only at the effects of the explanatory variables on the response (while allowing for some correlation among the predictors).