13. Repeated Measures Analysis of Discrete Data Many studies produce correlated or clustered data. For example, outcomes may be observed on subjects over several time intervals (longitudinal studies), they may be observed under different experimental conditions (e.g., crossover), or they may be observed in clusters where multiple measurements are collected from the same experimental unit (e.g. clinics, families, litters). Like continuous data treated as repeated measures, categorical data can also be collected under these designs as repeated measures or clustered data (i.e., discrete outcomes with two or more ordinal or nominal levels) or counts (Poisson or negative binomial). Analysis techniques for these data have only been available relatively recently with the development of Generalized Estimating Equations (as well as other techniques not discussed here). Generalized Estimating Equations (GEE) were introduced by Liang and Zeger (1986) as a method for handling correlated discrete data that would typically be analyzed with a Generalized Linear Model (GLM). This approach accommodates dichotomous outcomes and counts for which the correlation among observations that generated the data would otherwise not be considered if it were processed with PROC GENMOD as described in previous sections. With two or more measurements from a relatively "large" number of subjects (and not the total number of measurements), GEE models work well with correlated categorical data as response variables. The number of clusters is a key issue for the procedure to work (100 clusters are safe, 50 clusters with several explanatory variables is OK, below 20 clusters is not good). With GEE correlated data can be modeled with output that looks similar to generalized linear models (GLMs) with independent observations. The primary difference is their ability to account for the within-subject covariance structure for the various types of response data. Like the analysis of repeated measurements of continuous response data with PROC MIXED, they work in situations that have: * a mix of categorical and continuous explanatory data * a few response values missing completely at random (MCAR) * time-dependent covariates Let y_ij (j=1,...,n_i, i=1,...,k) represent the jth measurement on the ith subject. n_j measurements collected on subject i N= SUM n_j (i=1,..,k) total measurements. Modeling the Covariance Matrix The available covariance structures specify how observations within a subject or cluster are correlated with each other. The concepts are the same as working with the various correlation structures available for continuous data in PROC MIXED. Correlated data are modeled with the same link functions and linear predictor equation (systematic component) as found with independent data. The random component of GEEs is also described by the same variance functions, but now the covariance structure of the correlated measurements must also be modeled. Keyword Correlation Matrix Type IND independent EXCH | CS exchangeable (equality of all correlations) AR(1) autoregressive(1) MDEP(number) m-dependent with m=number UNSTR | UN unstructured USER | FIXED (matrix) fixed user-specified correlation matrix These choices are briefly described in Section 13a, GEE correlation structures. Properties of GEEs For whatever reason, you may not be able to specify the exact working correlation matrix; however, the correlation structure that produces the smallest deviance is preferred. The important point with GEE modeling is the robust covariance matrix concept mentioned by Liang and Zeger (Biometrika 86). As long as the mean model is correct, the parameter estimates are consistent as the number of clusters (subjects) becomes large (not the total number of measurements). That is, you can misspecify the structure and still obtain consistent parameter estimates. However, they will be more efficient if your specification is closer to the true structure. And of course, efficiency is optimum with the most appropriate correlation structure, especially if the number of clusters is relatively small. GEE provides consistent estimators of Cov(%beta^) as the number of clusters become large, even if the working correlation matrix is not specified correctly. In other words, with sufficient number of clusters, when you use a robust covariance matrix for your hypothesis tests on the regression parameters, you get correct statistical inference even if you picked the wrong correlation structure. Population averaged (PA) and subject-specific (SS) models have been introduced for terminology of two approaches to analyze longitudinal data by Zeger, Liang, and Albert (1988). Population averaged models: PA marginal expectation of the response variable is the primary focus of the analysis. Does a treatment have a therapeutic benefit that goes beyond the control or conventional therapy? One may fit a PA model and compare mean response profiles between the different therapies. Subject-specific models: the primary focus of Subject Specific models is to evaluate the change of an individual's response. This is accomplished by inserting subject-specific random effects into the model (e.g., GLIMMIX). The GEE approach produces a "marginal model" The GEE approach models a known function of the marginal expectation of the dependent variable as a linear function of explanatory variables. In a marginal model the effect of between-subject factor is modeled separately from the within-subject correlation. The interpretation of the parameters from a marginal model is analogous to the standard logistic regression model. The transformed regression coefficient exp(beta) is the odds for success for a subject from the treatment group divided by the odds for success from a subject assigned to the control group. However, in the GEE model we adjusted for the correlation between measurements from the same subject and we assumed that this correlation is identical for every possible pair of measurements from the same subject. Of course, measurements from different subjects are considered to be independent in order to consistently estimate the variance. The interpretation of the parameter does not depend on the respective subject but rather is valid for the whole population of potential subjects in the study and actually averages the treatment effect across the subjects. This is why the parameters from marginal models are also called population-averaged parameters (which is a big difference from random effects models) Two points differentiate PA and SS models. * Regression coefficients of a PA model describe the average population response curve. * Regression coefficients of a SS model describe what the average individual's response curve looks like. The specification of the underlying variance-covariance structure. * In PA, one explicitly models the marginal expectations while choosing a var/cov structure that adequately describes the correlation pattern among the repeated measurements. * In SS models, model individual heterogeneity using subject-specific random effects which partially determine the var/cov structure. When you fit a GEE model, you cannot enter random effects. GEE models estimate population average (PA) response values. If you need to examine random effects, such as those due to sites or subjects, then you need to employ a general linearized mixed model (GLIMMIX). Example A study is designed to evaluate how teenagers who suffer from depression respond to 4 types of treatments. Treatment 1: placebo 2: FLX 3: CBT 4: COMB, a combination of both FLX and CBT Subjects (n=333) were randomly assigned one of four treatment groups. At the end of two 6 week intervals being on one of these treatments they were evaluated for improvement (response=1 for Yes, =0 for No). In addition, at the beginning of the study each subject was measured for their "action" level, essentially, a continuous variable that measures self-motivation. The responses for the subjects from the four treatments, two time points, and quartiles for the action score are summarized in the table below: The quartiles were obtained with PROC UNIVARIATE. These values serve as the dividing points in a FORMAT made to divide the continuous data into 4 levels. PROC FORMAT; VALUE rng 0 -< 7.170 ='1st Q' 7.170 -< 8.117='2nd Q' 8.117 -< 8.667='3rd Q' 8.667 -< 12 ='4th Q'; RUN; These levels can be entered into the TABULATE procedure and the action score can be treated as categorical, even though it still remains continuous. PROC TABULATE DATA=ctr NOseps ; CLASS site treat tm fa3act; VAR response; table tm='Time'*fa3act, treat='Treatment'*response=' '*(sum*f=5.0 n*f=5.0) / rts=18 box='Improvement'; FORMAT fa3act rng. ; run; ------------------------------------------------------------------ |Improvement? | Treatment | | |-----------------------------------------------| | | 1-PBO | 2-FLX | 3-CBT | 4-COMB | | |-----------+-----------+-----------+-----------| | | Sum | N | Sum | N | Sum | N | Sum | N | |----------------+-----+-----+-----+-----+-----+-----+-----+-----| |Time FA3Act | | | | | | | | | |2 1st Q | 5| 18| 10| 15| 7| 24| 11| 22| | 2nd Q | 8| 21| 13| 22| 4| 17| 12| 15| | 3rd Q | 9| 25| 16| 24| 8| 23| 16| 24| | 4th Q | 3| 11| 10| 20| 10| 14| 15| 18| |3 1st Q | 8| 20| 11| 15| 9| 23| 19| 24| | 2nd Q | 7| 22| 15| 22| 8| 18| 13| 16| | 3rd Q | 13| 27| 18| 28| 12| 23| 19| 26| | 4th Q | 4| 13| 12| 21| 8| 14| 14| 20| ------------------------------------------------------------------ In order to check for the possibility of "separation", this events/total notation shows a substantial number of positive counts exists in all cells (i.e., no 0's), that is improvement was observed in subjects across all 4 treatments over a broad range of the action variable. The individual responses were entered into a repeated measures GEE analysis with PROC GENMOD. Since there are only two time points, the various choices of non-identity within subjects correlation matrices act in the same manner. If a third evaluation at 18 weeks was done, the AR(1) structure may be the most reasonable choice, so it was entered for this 2 period study. Also, since GENMOD does not work with random effects, site was treated as a fixed factor: PROC GENMOD DATA=ctr descending ; CLASS subjno site treat tm ; MODEL response = treat site fa3act treat*fa3act tm / dist=binomial link=logit type3 ; REPEATED subject=subjno / type=ar(1) CORRW; LSMEANS treat / diff cl ; RUN; Selected portions of the output are displayed. Response Profile Ordered Total Value cgii Frequency 1 1 347 2 0 298 PROC GENMOD is modeling the probability that response='1'. GEE Model Information Correlation Structure AR(1) Subject Effect SUBJNO (333 levels) Number of Clusters 333 Clusters With Missing Values 21 Correlation Matrix Dimension 2 Maximum Cluster Size 2 Minimum Cluster Size 1 Algorithm converged. Working Correlation Matrix Col1 Col2 Row1 1.0000 0.3317 Row2 0.3317 1.0000 Score Statistics For Type 3 GEE Analysis Chi- Source DF Square Pr > ChiSq treat 3 6.91 0.074777 site 12 15.18 0.231642 FA3Act 1 2.47 0.115855 FA3Act*treat 3 6.59 0.086205 tm 1 4.94 0.026230 Least Squares Means Standard Chi- Effect treat Estimate Error DF Square Pr > ChiSq Confidence Limits treat 1-PBO -0.4729 0.2247 1 4.43 0.035347 -0.9134 -0.0325 treat 2-FLX 0.6901 0.2288 1 9.10 0.002561 0.2416 1.1386 treat 3-CBT -0.2277 0.2131 1 1.14 0.285396 -0.6455 0.1901 treat 4-COMB 1.1394 0.2233 1 26.03 <.000001 0.7017 1.5771 Differences of Least Squares Means Standard Chi- Pr > Confidence Effect treat _treat Estimate Error DF Square ChiSq Limits treat 1-PBO 2-FLX -1.1631 0.2780 1 17.50 0.0001 -1.708 -0.618 treat 1-PBO 3-CBT -0.2452 0.2753 1 0.79 0.3730 -0.785 0.294 treat 1-PBO 4-COMB -1.6123 0.2726 1 34.98 <.0001 -2.147 -1.078 treat 2-FLX 3-CBT 0.9178 0.2804 1 10.71 0.0011 0.368 1.467 treat 2-FLX 4-COMB -0.4493 0.2806 1 2.56 0.1094 -0.999 0.101 treat 3-CBT 4-COMB -1.3671 0.2759 1 24.54 <.0001 -1.908 -0.826 PROC GLIMMIX DATA=ctr ; CLASS site treat tm subjno ; MODEL response(descending) = treat site fa3act treat*fa3act tm / dist=binomial link=logit solution; RANDOM _residual_ / subject=subjno type=ar(1) ; LSMEANS treat / diff odds oddsratio cl at (fa3act)=(4); LSMEANS treat / diff odds oddsratio cl ; LSMEANS treat / diff odds oddsratio cl at (fa3act)=(10); RUN; Type III Tests of Fixed Effects Num Den Effect DF DF F Value Pr > F treat 3 313 2.44 0.064257 site 12 313 1.19 0.285599 FA3Act 1 313 2.37 0.125025 FA3Act*treat 3 313 2.22 0.085277 tm 1 311 5.08 0.024934 treat Least Squares Means (at mean of FA3Act) Standard treat Estimate Error DF t Value Pr > |t| 1-PBO -0.4732 0.2160 313 -2.19 0.029196 2-FLX 0.6901 0.2195 313 3.14 0.001824 3-CBT -0.2279 0.2162 313 -1.05 0.292566 4-COMB 1.1391 0.2325 313 4.90 0.000002 Differences of Least Squares Means Standard treat _treat Estimate Error DF t Value Pr > |t| 1-PBO 2-FLX -1.1634 0.2830 313 -4.11 0.000051 1-PBO 3-CBT -0.2453 0.2860 313 -0.86 0.391744 1-PBO 4-COMB -1.6124 0.2946 313 -5.47 <.000001 2-FLX 3-CBT 0.9181 0.2821 313 3.25 0.001263 2-FLX 4-COMB -0.4490 0.2911 313 -1.54 0.123956 3-CBT 4-COMB -1.3670 0.2941 313 -4.65 0.000005 Differences of Least Squares Means - Odds Ratios Odds Lower Upper treat _treat Lower Upper Ratio Odds Ratio Odds Ratio 1-PBO 2-FLX -1.7203 -0.6065 0.312 0.179 0.545 1-PBO 3-CBT -0.8081 0.3175 0.782 0.446 1.374 1-PBO 4-COMB -2.1920 -1.0327 0.199 0.112 0.356 2-FLX 3-CBT 0.3629 1.4732 2.504 1.438 4.363 2-FLX 4-COMB -1.0217 0.1237 0.638 0.360 1.132 3-CBT 4-COMB -1.9457 -0.7884 0.255 0.143 0.455