Section 7a: Computing Means and Variances from Multiple Measurements on Each Subject A. Two values from each subject B. Three values from each subject C. Extended Example D. Computing Means and Variances from Autocorrelated Data (data collected over equal time intervals) Computing means and variances (which lead to computation of standard deviations and standard errors) is one of the most basic tasks of data analysis. The unbiased estimate of the variance is: VAR(y) = SUM[(y_i - y_mean)**2] / (n-1) where n is the sample size and y_mean is the 'average' of all n observations. This formula tacitly assumes all observations are independent; that is, one and only one observation has been collected from each subject that is each observation is not influenced by data collected from other subjects. What if you collect multiple measurements from each subject (repeated measurements) or if the subjects co-exist in clusters? In these situations you have replication, with no treatment or time effect assumed to be present among the observations. Although there are valid reasons to collect data this way, you should never obtain additional observations from each subject as a substitute for the need to have data from more subjects. Greater power to detect differences in treatment or group means is achieved by having more subjects, not by collecting more data from each subject. When analyzing multiple measurements, the inclination is to compute the means and analyze them as if you had one observation from each subject. This approach essentially eliminates the within-subject variation, which is an important component of the variation you should report in the final results. Also, the subjects will vary for unobserved reasons which imply multiple measurements from each subject will be correlated (the severity can be summarized by the intraclass correlation). The mixed model approach to analyzing multiple measurements eliminates the need to compute means - one can analyze ALL the data, taking into account various correlation structures. The following examples show how the LSMEANS and the REPEATED statements of PROC MIXED work together to compute means and variances. What you should NEVER do with repeated measurements is analyze all the data as if they were independent. The purpose of these examples is to demonstrate with data having a simple structure how the magnitude of the standard errors generally increases when you have clustered data (multiple observations); they also show how PROC MIXED can easily work with data to which you may perhaps mistakenly apply PROCs MEANS, TABULATE, or UNIVARIATE or even worse, compute means and variances in Excel which has no capability to work with correlated data. Computing Means and Variances from Multiple Measurements within each subject One important application of the combination of the LSMEANS and REPEATED statements for multiple measurements (that is, collecting multiple values from each subject) is computing more appropriate estimates of variances and standard errors. One serious short-coming of computing summary statistics from clustered data with PROC MEANS, PROC UNIVARIATE, or PROC TABULATE (and all too easily with Excel) is that standard deviations or standard errors will be "smaller" than they should be. Variances for clustered data also have covariance terms due to the presence of correlated data within clusters (PROCs MEANS, UNIVARIATE, and TABULATE and Excel have no way of knowing whether the data are truly independent). This correlation is generally positive which almost always increases the variance to a larger value than one computed assuming independence. Depending on magnitude of the correlation within subjects, the higher the correlation, the greater will be the observed increase in variance. For example, consider the following dataset (a very small sample is given for illustration and ease of mathematical computation only) from 3 subjects with 3 observations collected over time for each subject for a total of 9 observations: Group subject y1 y2 y3 1 1 12.50 11.98 10.70 = y11 y12 y13 1 2 14.11 13.50 11.48 = y21 y22 y23 1 3 12.20 11.56 10.74 = y31 y32 y33 The data are currently in multivariate format, that is, all the data from each subject are in one row. The columns labeled y1, y2, and y3 need to be placed into a dataset with a narrow or long format (that is one column contains the values of y indexed by a new variable called time with values 1, 2, and 3). DATA indat; DROP y1 y2 y3; INPUT group subject y1 y2 y3; time=1; y=y1; OUTPUT; time=2; y=y2; OUTPUT; time=3; y=y3; OUTPUT; CARDS; 1 1 12.50 11.98 10.70 1 2 14.11 13.50 11.48 1 3 12.20 11.56 10.74 ; A. Variance computation with two values from each subject First, focus on computing summary statistics for variables y1 and y2 (i.e., measurements collected at time=1 and time=2). One can easily compute summary statistics which assume independent data with PROC TABULATE (also, PROC MEANS, UNIVARIATE, or Excel): PROC TABULATE DATA=indat NoSeps; WHERE time IN (1,2); VAR y; TABLE y*(lclm*f=7.2 mean*f=6.2 uclm*f=7.2 var*f=8.4 STDdev*f=8.4 n='Nobs'*f=6.0 STDERR*f=8.4); run; You'll find the following statistics in the output table: ---------------------------------------------------------- | y | |--------------------------------------------------------| |95_LCLM| Mean |95_UCLM| Var | StdDev | Nobs | StdErr | |-------+------+-------+--------+--------+------+--------| | 11.62| 12.64| 13.66| 0.9431| 0.9712| 6| 0.3965| ---------------------------------------------------------- where StdErr = SQRT(Var/Nobs) 0.3965 = SQRT(0.9431/6) Compute the same values of the mean, standard deviation, and standard error of the mean with PROCs GLM and MIXED assuming independence: ODS SELECT means LSMEANS ; PROC GLM data=indat; WHERE time IN (1,2); CLASS group; MODEL y = group / NOint; MEANS group ; LSMEANS group / stderr ; run; QUIT; Output from the MEANS statement: Level of --------------y-------------- group N Mean Std Dev 1 6 12.6416667 0.97115224 Output from the LSMEANS statement: Least Squares Means Standard group y LSMEAN Error Pr > |t| 1 12.6416667 0.3964712 <.0001 And in PROC MIXED: ODS SELECT lsmeans covparms; PROC MIXED data=indat; WHERE time IN (1,2); CLASS group; MODEL y = group / NOint; LSMEANS group; run; Since group only has one level (in this illustrative example), the NOint option allows group to be present on the MODEL statement (otherwise group would be perfectly collinear with the intercept). Its presence on the MODEL statement is necessary for the factor group to be placed on the LSMEANS statement and thus to produce the output below. The output produces: Cov Parm Estimate Residual 0.9431 = variance LSMEAN table group Estimate StdErr 1 12.6417 0.3965 = SQRT(0.9431/6) The mean, variance, and standard error computed from these 6 observations with PROC MIXED are the same values as the mean, variance, and standard error estimates found with PR0C TABULATE. However, since the data are actually "clustered" the variance and resulting standard error estimates are NOT correct. Unlike the summary statistics computed with other procedures, PROC MIXED can remedy this problem with the REPEATED statement. It allows one to account for the necessary modifications to the variance computation with clustered data. PROC MIXED DATA=indat; WHERE time IN (1,2); CLASS subject group; MODEL y = group / NOint; REPEATED / SUBJECT=subject TYPE=cs rcorr r; LSMEANS group ; run; Estimated R Matrix (i.e., the variance/covariance matrix) Row Col1 Col2 1 1.1351 0.9597 -> top left/bottom right entries are VAR(y1) and VAR(y2) 2 0.9597 1.1351 -> the off-diagonal entries are Covariance(y1,y2) Estimated R Correlation Matrix Row Col1 Col2 correlation = cov(y1,y2) / SQRT[VAR(y1) * VAR(y2)] 1 1.0000 0.8455 0.8455 = 0.9597 / (SQRT(1.1351)*SQRT(1.1351)) -> that is, the data are highly correlated 2 0.8455 1.0000 Covariance Parameter Estimates Cov Parm Subject Estimate CS subject 0.9597 = VAR(subject) Residual 0.1754 = Error Model is y_ij = mu + subject_i + error_ij, i = 1,2,3 j = 1,2 Since the subject deviations from the grand mean and the associated errors are independent, the variance of an observation is the sum of these two components: Variance(y_ij) = 0.9597 + 0.1754 = 1.1351 (as shown in the R matrix) group Estimate StdErr Variance(LSMEAN) 1 12.6417 0.5909 0.3491 ^^^^^^ StdErr = SQRT(0.3491) = 0.5909 To derive the variance of the LSMEAN, recall the Estimate column (i.e., in this situation it is the arithmetic mean) is the sum of the observations for y1 and y2 divided by 6: Mean = (y11 + y12 + y21 + y22 + y31 + y32)/6 VAR(mean) = [1/36] * [VAR(y11)+VAR(y22)+VAR(y33)+VAR(y44)+VAR(y55)+VAR(y66) + 2*COV(y11,y21) + 2*COV(y21,y22) + 2*COV(y31,y32)] Because subjects are independent, the covariance terms across subjects (i.e., values of y_ij and y_i'j where i NE i') are assumed equal to 0, e.g., COV(y11,y21)=0. Since VAR(y11) is assumed to be the same as the other five variances and COV(y11,y21) is also the same as the other non-zero covariance terms within subjects, we can simplify the above equation: VAR(mean) = [1/36] * [6 * VAR(y1) + 6*COV(y1,y2)] = [1/6] * [VAR(y1) + COV(y1,y2)] = [1/6] * [1.1351 + 0.9597] = 0.3491 So, the standard error of the lsmean printed above is: StdErr(Estimate) = SQRT(0.3491) = 0.5909 (shown in LSMEANS table above) This example indicates how much greater the standard error of the mean becomes when taking the clustering of the data within subject into account (that is, paired data are positively correlated) when compared with the standard error of the mean computed assuming all 6 observations are independent (i.e., 0.5909 > 0.3965). B. Variance computation with three values collected from each subject Now consider the three values of y listed above which have been collected from each of the three subjects. The variance calculations just demonstrated also have the same implications, although the algebra is messier to work with. Simple summary statistics are "incorrectly" computed with PROC TABULATE: ---------------------------------------------------------- | y | |--------------------------------------------------------| |95_LCLM| Mean |95_UCLM| Var | StdDev | Nobs | StdErr | |-------+------+-------+--------+--------+------+--------| | 11.20| 12.09| 12.97| 1.3335| 1.1548| 9| 0.3849| ---------------------------------------------------------- Following the logic of the algebra to compute the variance of the mean as shown above, each subject now contributes three non-zero covariance terms to the variance of the mean: for subject j the non-zero covariance terms are COV(y_j1,y_j2), COV(y_j1,y_j3), and COV(y_j2,y_j3): VAR(LsMean) = [1/81] * [VAR(y11)+VAR(y12)+VAR(y13) + VAR(y21)+VAR(y22)+VAR(y23) + VAR(y31)+VAR(y32)+VAR(y33) + 2*COV(y_11,y_12) + 2*COV(y_11,y_13) + 2*COV(y_12,y_13) + 2*COV(y_21,y_22) + 2*COV(y_21,y_23) + 2*COV(y_22,y_23) + 2*COV(y_31,y_32) + 2*COV(y_31,y_33) + 2*COV(y_32,y_33) ] = [1/81] * [ 9 * VAR(y) + 18*COV(y_i,y_j) ] = [1/9] * [ VAR(y) + 2 * COV(y_i,y_j) ] PROC MIXED DATA=indat; CLASS subject group; MODEL y = group / NOint; REPEATED / SUBJECT=subject TYPE=cs rcorr r; LSMEANS group ; run; The above procedure procedure produces this output: Estimated R Matrix (Variance/Covariance Matrix) Row Col1 Col2 Col3 1 1.4126 0.3164 0.3164 Var(y) = 1.4126 2 0.3164 1.4126 0.3164 Cov(y_i,y_j) = 0.3164 3 0.3164 0.3164 1.4126 Estimated R Correlation Matrix Row Col1 Col2 Col3 1 1.0000 0.2240 0.2240 2 0.2240 1.0000 0.2240 3 0.2240 0.2240 1.0000 Covariance Parameter Estimates Cov Parm Subject Estimate CS subject 0.3164 Residual 1.0962 Var(y_ij) = subj_var + residual = 0.3164 + 1.0962 = 1.4126 (as shown in the R matrix printed above) Least Squares Means Standard Effect group Estimate Error group 1 12.0856 0.4767 From the equation for Var(LsMean) Var(LsMean) = (1.4126 + 2*0.3164)/9 Standard Error = SQRT(2.0454/9) = 0.4767 The results show that even with a small correlation among the observations (0.224 in this example), the standard error of the mean (0.4767) is considerably larger than the value computed assuming independence (0.3849). Note: demonstrating these calculations is tedious with more factors; it is important to realize when you compute means and covariances across several levels of repeated measures from the same subject. The correlations of the values collected within subjects have a direct bearing on the observed standard errors of the means computed with the LSMEAN statement. C. Extended Example In this example 6 subjects produced 3 values each. DATA indat; label xplot = 'All Subjects'; INPUT ID tr y @@; cnst='1'; xplot = 1 - .25 + .5*ranuni(86293229); CARDS; 1 1 11 1 2 12 1 3 14 4 1 14 4 2 17 4 3 13 3 1 13 3 2 11 3 3 15 6 1 9 6 2 10 6 3 13 2 1 11 2 2 8 2 3 9 5 1 5 5 2 6 5 3 7 ; proc sort; by id tr; proc print NOobs; var id tr y; run; A dot plot of these data show nothing unusual: Plot of y*xplot. Symbol used is '*'. y | 20 + | | | | * | 16 + | * | | * * | * * * | 12 + * | * * * | | * | * * | 8 + * | * | | * | * | 4 + ------------------+-------------- 1.0 And it is easy to compute summary statistics with PROC TABULATE (or PROC MEANS): PROC TABULATE DATA=indat NOseps ; CLASS id ; VAR y; TABLE y*(sum*f=5.0 n*f=3.0 mean*f=6.2 min*f=5.0 max*f=5.0 VAR*f=8.3 Stderr*f=7.4) ; RUN; ----------------------------------------------- | y | |---------------------------------------------| | Sum | N | Mean | Min | Max | Var |StdErr | |-----+---+------+-----+-----+--------+-------| | 198| 18| 11.00| 5| 17| 10.471| 0.7627| ----------------------------------------------- Mean = Sum / N = 198/18 = 11.00 Var = [Sum (y - mean_y)**2] / (n-1) = 10.471 StdErr = SQRT(Var/n) = SQRT(10.471/18) = .7627 Since the data were generated by 6 subjects, the following plot shows how the 3 observations from each individual actually appear: Plot of y*ID. Symbol used is '*'. y | 20 + | | | | * | 16 + | * | | * * | * * * | 12 + * | * * * | | * | * * | 8 + * | * | | * | * | 4 + -------+------------+------------+-----------+-----------+-----------+------ 1 2 3 4 5 6 ID TABULATE can compute the individual subject's sample statistics. In this case, it works since the data within each subject are assumed independent: PROC TABULATE DATA=indat NOseps ; CLASS id ; VAR y; TABLE id, y*(n*f=3.0 mean*f=6.2 VAR*f=6.3 stderr*f=7.4) / rts=9 BOX='Summary Stats'; RUN; ----------------------------------- |Summary| y | |Stats |-------------------------| | | N | Mean | Var |StdErr | |-------+---+------+------+-------| |ID | | | | | |1 | 3| 12.33| 2.333| 0.8819| |2 | 3| 9.33| 2.333| 0.8819| |3 | 3| 13.00| 4.000| 1.1547| |4 | 3| 14.67| 4.333| 1.2019| |5 | 3| 6.00| 1.000| 0.5774| |6 | 3| 10.67| 4.333| 1.2019| ----------------------------------- PROC MIXED can also compute these means and variances: ODS SELECT lsmeans covparms; PROC MIXED DATA=indat; CLASS id tr; MODEL y = id / noint; REPEATED tr / group=id; * group= will compute separate variances for each subject; LSMEANS id; RUN; Covariance Parameter Estimates Cov Parm Group Estimate tr ID 1 2.3333 tr ID 2 2.3333 tr ID 3 4.0000 tr ID 4 4.3333 tr ID 5 1.0000 tr ID 6 4.3333 Least Squares Means Standard Effect ID Estimate Error ID 1 12.3333 0.8819 ID 2 9.3333 0.8819 ID 3 13.0000 1.1547 ID 4 14.6667 1.2019 ID 5 6.0000 0.5774 ID 6 10.6667 1.2019 However, when computing summary stats across subjects, PROC MIXED computes a very different variance than TABULATE: ODS SELECT lsmeans covparms r; PROC MIXED DATA =indat; CLASS id cnst tr; MODEL y = cnst / noint; REPEATED cnst / subject=id r type=cs; *RANDOM id; LSMEANS cnst; RUN; Estimated R Matrix (subject variance/covariance matrix) Row Col1 Col2 Col3 1 11.4593 8.4037 8.4037 2 8.4037 11.4593 8.4037 3 8.4037 8.4037 11.4593 Covariance Parameter Estimates Cov Parm Subject Estimate CS ID 8.4037 Residual 3.0556 Variance of an observation: VAR(y) = 8.4037 + 3.0556 = 11.4593 Notice that 11.4593 is the value found along the diagonal of the R matrix above. The algebra described in section I shows the variance of a mean computed from correlated data includes a covariance term: Var(mean) = Var(y) + 2*COV(y_i,y_j) = (11.4593 + (2*8.4037))/18 = 28.2667 / 18 = 1.57037 The standard error is the square root of the variance of the mean: Standard Error = SQRT(1.57037) = 1.2531 The LSMEANS statement produces the same mean as PROC TABULATE and shows the correct standard error: Least Squares Means Standard Estimate Error 11.0000 1.2531 Note that the standard error computed here with correlated data is 1.2531 which is considerably larger than the value 0.7627 PROC TABULATE computed under the assumption of independence. D. Compute the mean and variance from autocorrelated data (collected across equal time intervals) An auto-correlation structure implies that adjacent data values have a non-zero correlation (usually positive). Data two or more time points apart are also correlated, though this value decreases towards 0 as the lags increase. To demonstrate how PROC MIXED computes a mean and variance of autocorrelated data, twenty observations from a time series from equally spaced observations are given below: DATA one; task='3'; INPUT y @@; CARDS; 4.46 4.48 4.72 4.68 4.76 4.78 4.90 4.60 4.84 4.48 4.22 4.16 4.22 4.28 4.32 4.12 3.98 4.16 4.22 4.68 ; Entering the data into PROC MEANS produces results that assume 20 independent observations: PROC MEANS DATA=one n mean stddev stderr; VAR y; run; Analysis Variable : y N Mean Std Dev Std Error -------------------------------------------------- 20 4.4530000 0.2776802 0.0620912 Next, place the count or number of observations into a macro variable: DATA _null_; CALL SYMPUT("count",nobs); SET one NOBS=nobs; STOP; run; ODS OUTPUT LSMEANS=lsm(drop=effect tvalue df probt) rcorr=rcr; ODS EXCLUDE fitstatistics lrt dimensions rcorr; PROC MIXED DATA=one NOitprint NOclprint ; CLASS task; MODEL y = task / NOtest; LSMEANS task; REPEATED / subject=task type=ar(1) rcorr; run; * print first 5 rows and columns of a 20x20 correlation matrix; PROC PRINT DATA=rcr(obs=5) Noobs; VAR row col1-col5; FORMAT col: 5.3 row 3.0; run; Add the sample size and compute the standard deviation: DATA lsm; SET lsm; count = &count. ; std_dev = stderr * sqrt(count); PROC PRINT DATA=lsm NOobs; var task count Estimate std_dev StdErr; run; The results of these statements are given below: Covariance Parameter Estimates Cov Parm Subject Estimate AR(1) task 0.8016 --> .8016 implies highly correlated data Residual 0.09747 Correlation matrix: the correlation at lag j = corr**j, that is a correlation at lag 2 is .643, as row j, column j + 2 shows: Row Col1 Col2 Col3 Col4 Col5 .. col20 1 1.000 0.802 0.643 0.515 0.413 .. 2 0.802 1.000 0.802 0.643 0.515 .. 3 0.643 0.802 1.000 0.802 0.643 .. 4 0.515 0.643 0.802 1.000 0.802 .. 5 0.413 0.515 0.643 0.802 1.000 .. 6 .. .. The final example shows how these computations quickly become much more complex with correlation structures that are not compound symmetric; however the combination of the REPEATED and LSMEANS statements with PROC MIXED makes it extremely simple. Estimated LSmean (Estimate) and its standard error: task count Estimate std_dev StdErr 3 20 4.4867 0.79397 0.1775 Notice that the mean is somewhat "larger" and the standard deviation and standard errors are considerably larged with autocorrelated data. Summary Computation of a simple mean and variance is takes on added complexity with correlated data. The mean is the same value (with equal number of observations per subject) whether the data as when computed under independence. However, when data clustering is ignored, the variances are underestimated (the severity depends on how correlated individual measurements actually are). And it shows why you should never compute variances of correlated data with Excel, which has no capability to give you the correct results. This section introduces issues behind the computation of the variance of a mean for correlated data. A later section will demonstrate that when you compare two means on within subject factors, a covariance term is also relevant, which generally decreases the variance of a difference in two means. The main point demonstrated here is that when computing means, correlated data produce larger variances than those typically reported and often shown on graphs. More about issues related to plotting means to compare within subject factors will appear in the next section on analyzing repeated measures data.