Section 12.__: HILO Plot, Standard Error Bars, and BOXPLOTS 1. Hilo Plot 2. Standard Deviation Plot 3. Boxplot An important step in a t-test or independent groups ANOVA is to produce an informative plot of the data. If the group sample sizes are small, a hilo plot works well to display the distributions of the data for each level of a group variable. For sample sizes of 15 to 20 (or greater) within each group, side-by-side boxplots may be a better choice. 1. Hilo Plot In a hilo plot, the y-axis defines the response variable and the horizontal axis is the grouping variable. Coding the group variable numerically gives you more flexibility in defining the contents of the horizontal axis layout (character levels also work on the horizontal axis, although it is not as easy to format). The range of the data values within each level of the group variable are connected with a vertical line. The SYMBOL statement with the option interpol=hiloj and value=dot produces this plot. The j at the end of hiloj is optional; when present it adds a line which connects the means across the horizontal axis variable (separate lines can be produced for values of an id variable added on the PLOT statement). The group variable gender (categorical coded numerically as 1=male 2=female) is placed on the horizontal axis and the response variable (continuous) is plotted on the vertical axis. GOPTIONS reset=all cback=white; SYMBOL interpol=hiloj value=dot height=1 color=blue ; PROC GPLOT DATA=scores; PLOT score * gender / haxis=0 to 3 by 1 hminor=0 vaxis=70 to 90 by 4 vminor=3 NOframe ; FORMAT gender gnd. ; TITLE H=2 "Test Scores" ; RUN; QUIT; Plot the percentiles and connect the medians For large datasets you can produce a hilo plot of specific values of the percentiles (e.g., IQR and median). To make this graph, first compute the endpoints of the IQR (P25 and P75) and the median (P50) for each level of the group variable with PROC UNIVARIATE: PROC UNIVARIATE DATA=scores NOprint; BY gender; VAR score; OUTPUT OUT=Pctls N=count pctlpts=25 50 75 pctlpre=sc_ pctlname=P25 p50 P75; run; * percentiles are in multivariate format; PROC PRINT DATA=pctls NOobs; run; gender count sc_P25 sc_p50 sc_P75 1 19 40 49 81 2 21 37 52 76 * place data in univariate format; DATA plt; SET pctls; KEEP gender iqr; iqr =sc_p25; OUTPUT; iqr =sc_p50; OUTPUT; iqr =sc_p75; OUTPUT; RUN; To make a hilo plot and place the middle tickmark at the median only (and not also place a tickmark at the mean) add these two features: - enter the INTERPOL=hiloc option - have exactly three values for each value of the vertical axis variable (y_var) at each level of the horizontal value (xplot). To connect the medians (as demonstrated with connecting the means) enter INTERPOL=hilocj. SYMBOL1 interpol=hilocj cv=black value=dot h=2 color=blue; PROC GPLOT DATA=plt ; PLOT iqr * gender / NOframe Haxis = 0 to 3 by 1 Hminor=0 Vaxis = 70 to 90 by 4 Vminor=3; RUN; QUIT; 2. Standard Deviation Plot To plot + or - 1, 2, 3, .. standard deviations around a mean, enter the std option on the symbol statement. The number indicates how many standard deviations to plot and letter j is optional, depending if you want to join the means with a straight line: DATA plt; SET sashelp.class; gnd = (lowcase(sex)='m'); run; GOPTIONS reset=all cback=white; * +- 2 standard deviation bars (i.e., can enter number 1,2,3.. on std#t) the t adds a short bar to the top and bottom of the plot; SYMBOL1 interpol=std1t color=blue value=dot h=2 w=2 ; SYMBOL2 interpol=std1t color=red value=dot h=2 w=2 ; PROC GPLOT DATA=plt ; PLOT height * gnd=sex / NOframe Haxis = -1 to 2 by 1 Hminor=0 Vaxis = 50 to 75 by 5 Vminor=4; RUN; QUIT; To join the vertical lines at their means, enter one symbol statement as follows and omit the = on the PLOT statement: SYMBOL1 interpol=std1tj color=blue value=dot h=2 w=2 ; Standard Deviation bars placed around a mean assume independent data if your objective is to visually compare means. Also, the normal distribution within each level of the classification factor is an implied assumption. This plot can be deceptive if one is comparing means of repeated measurements from levels collected across the same subjects since the error bars do not take into account a covariance term that is part of variance of a difference. With this standard deviation plots, be sure to specify the original data with multiple observations of your y-axis variable value at each value of your x-axis variable (group). That is, do not specify values from a dataset which has mean values obtained from a prior data reduction procedure with PROC MEANS or PROC UNIVARIATE, or output from an LSMEANS statement from PROC MIXED or summary statistics computed with any type of ANOVA procedure. 3. Boxplots PROC GPLOT can also produce side-by-side boxplots with options selected for the SYMBOL statement. You need to define these symbol characteristics: INTERPOL= BOXT20 specifies a box plot with tops and bottoms on its whiskers (T) and the 20 implies high and low bounds at the 80th and 20th percentiles COLOR= colors the lines of the boxes and whiskers BWIDTH= affects the width of the boxes VALUE= specifies the plot symbol that marks the data points outside the range of the box plot CV= color for the plot symbols HEIGHT= specifies a symbol size The basic commands to produce a boxplot are: GOPTIONS .. ; SYMBOL interpol=boxt20 bwidth=3 width=1 color=blue value=dot cv=red height=2; PROC GPLOT DATA=grades; PLOT grade*section / NOframe haxis=0 to 5 by 1 vaxis=0 to 100 by 10 vminor=0; TITLE1 H=3 'Grades by Section'; run; quit; PROC BOXPLOT is also an important resource for making boxplots and will be the contents of a future article.