Working with Problem Data Missing Data Although complete data for all subjects are desired, the possibility that some data items will not be available, will become lost, or will be unusable for other unknown reasons should not be ignored. Data items that are not available for any subject will be assigned a "missing" value in the database. Every program has its own way of assigning missing data, so be sure to review those procedures. Often in surveys, for example, missing data are given an impossible number such as a -9 where the expected response is from a 1-5 Likert scale. With SAS, the default missing symbol is the . , but options exist to assign other other symbols as missing data, if necessary. Procedures to follow in the presence of missing data include: 1. Use only the variables which are completely recorded for each subject. 2. Fill in missing values with imputation-based procedures such as mean substitution or regression estimates. 3. Define a model-base procedure for the partially missing data and base inferences on the likelihood under that model. The first approach limits the analysis to variables from each subject which have no missing data values. However, when all subject and all variables relevant to the analysis are considered this method may exclude a considerable amount of data. Therefore, procedures which can be applied when a relatively small percentage of the data are missing are often desirable. Details concerning any specific missing data estimation procedure may not be necessary to present other than to mention that relatively sophisticated statistical techniques for dealing with this aspect of data analysis exist. However, some amount of data will inevitably be lost or not available, and establishing a procedure or decision rules to follow when it occurs is necessary. Of course, the magnitude of this problem will not be known until subjects are added to your database and the amount of missing data observed. Working with 'Outliers' It's an unfortunate fact of research that data are not always well-behaved. "Outliers" - unusual data values - occur in almost all research projects involving data collection. This is especially true in observational studies where data may naturally take on very unusual values, even if they come from reputable sources. Data entry errors or rare events (such as readings from a thermometer left in the sun, a change in accounting practice, or a subject who has a sudden muscle spasm) - all these and many more are reasons for outliers to exist in a collection of data. Sources of outliers include: Data errors. When looking for the source of outlying observations, check for data recording or entry errors. "Rare" events. The "rare" event syndrome occurs when extreme observations for some legitimate reason do not fit within the typical range of other data values yet should be considered part of the overall picture. In the presence of outliers, any statistical test based on sample means and variances can be distorted. For example, estimated regression coefficients that minimize the Sum of Squares for Error (SSE) are very sensitive to outliers. There are several problematic effects of outliers, including: * bias or distortion of parameter estimates * inflated sums of squares (which make it unlikely you will partition sources of variation in the data into meaningful components) * distortion of p-values (statistical significance, or lack thereof, can be due to the presence of a few - or even one - unusual data value) * faulty conclusions (it's quite possible to draw false conclusions if you haven't looked for indications that there was anything unusual in the data) Identification of Outliers The "normal" distribution myth. For many statistical modeling purposes, input data do not necessarily require a "normal" or symmetric, bell-shaped distribution. (This feature applies primarily to residuals from a statistical model -- a subject for future articles.) Discrete data or counts, by definition, will not usually look very "normal". With data to be used to compute a linear regression model, it is preferable that the independent or explanatory variables do not have a normal distribution. It can be demonstrated mathematically that normality is not required nor even desirable for explanatory variables in a regression. What is important is to check for data values that lie well outside the range of other data (called leverage points) that can have a undue influence on the results. Your objective should be to collect data with a distribution that allows you to make the best inferences possible about the population under study. Visual aids. Check the distributions of data values by levels of a categorical variable, if available. This procedure should always be one of the first steps in data analysis, as it will quickly reveal the most obvious outliers. For continuous or interval data, a dot plot of a single variable or multi-dimensional scatterplots are good methods to look for outlying observations. A box plot is another very helpful tool, since it makes no assumption on a distribution nor does it require any estimate of a mean or standard deviation. Values that are extreme in relation to the rest of the data are easily identified. Univariate tests check for the presence of outliers; however, many of them are designed to check for the presence of only one outlier, and they also make assumptions on the distribution of the data which are often not relevant (e.g. assume a normal distribution when you have very skewed non-negative data). They often require that a location (mean) or scale (standard deviation) parameter be estimated from the data, As shown earlier, outliers greatly influence these values. This is one reason why "eliminating data that exceed two or three standard deviations" may not be a good, or even a reasonable, decision rule. IQR computation. A simple task is to compute the inter-quartile-range (IQR) and then use a multiple of it as a number that defines what values are considered outliers. A box plot uses this technique to identify outliers, an extremely effective approach, especially for large data sets with continuous data. Basic computing skills are required to find the inter-quartile-range (IQR) and then use a multiple of it as a number that defines what values could be considered outliers. One way to apply this approach is to use PROC UNIVARIATE with SAS and save the order statistics available with its OUTPUT statement. The first quartile (q1), third quartile (q3), and inter-quartile range (IQR) can be saved to an output data set or written to macro variables (see example below). In a subsequent DATA step you can flag observations that lie outside of q1-(1.5*iqr) and q3+(1.5*iqr) as potential outliers and anything outside of q1-(3*iqr) and q3+(3*iqr) as problematic outliers. With large data sets, you can write computer programs to identify data entry errors or extreme observations. SAS is a particularly good tool for this purpose and some examples how to do it are included in the SAS notes. Developing techniques to look for outliers and understanding how they impact data analysis are extremely important parts of a thorough analysis, especially when statistical techniques are to be applied to the data. Here is an example of a SAS program that will detect outliers using order statistics. OPTIONS ls=78 ps=55 NOcenter formdlim=' '; LIBNAME dat '.'; PROC UNIVARIATE DATA=dat.mydata NOprint; VAR y; OUTPUT OUT=qdata Q1=q1 Q3=q3 QRANGE=iqr; RUN; DATA _null_; SET qdata; CALL SYMPUT("q1",q1); CALL SYMPUT("q3",q3); CALL SYMPUT("iqr",iqr); RUN; * save the outliers; DATA outliers; SET dat.mydata; KEEP y severity; IF (y <= (&q1-1.5*&iqr)) OR (y >= (&q3+1.5*&iqr)) THEN severity='*'; IF (y <= (&q1- 3*&iqr)) OR (y >= (&q3+ 3*&iqr)) THEN severity='**'; IF severity='*' OR severity='**' THEN OUTPUT outliers; RUN; PROC PRINT DATA=outliers; VAR y severity; TITLE 'Data outliers for review'; RUN; What Should You Do With Outliers? Working with outliers with continuous data can pose difficult decisions. Neither ignoring them nor deleting them at will are good solutions. If you do nothing, you may end up with a model that describes essentially none of the data - neither the bulk of the data nor the outliers. Even though your numbers may be perfectly legitimate, if they lie outside the range of most of the data, they can cause potential computational anomalies and resulting inference problems. Accommodation. Accommodation of outliers uses techniques to mitigate their harmful effects. One of its strengths is that accommodation of outliers does not need to precede identification. These techniques can be often be used without prior determination that outliers exist. However, keep in mind that identification and accommodation do not compete, rather, they reinforce each other. A few possible approaches to accommodating outliers are listed below. Nonparametric Methods. One very effective way to work with data is to use methods which are robust in the presence of outliers. Nonparametric statistical methods fit into this type of analyses and should be more widely applied to continuous or interval data. When outliers are not a problem studies have shown their ability to detect significant differences is only slightly smaller than corresponding parametric methods. Various forms of robust regression models and computer intensive approaches deserve more attention. Transformations. Another source of "outliers" is the failure to transform data to improve symmetry and (where required) linearity or additivity. These are not "real" outliers in the manner described, but are good indications that perhaps an analyst ignored basic data exploration. Transformations may soften the impact of outliers since the most commonly used transformations (square roots and logarithms) shrink 'larger' values to a much great extent than they shrink 'smaller' values. However, transformations may not fit into the theory of the model you are using, or they may affect its interpretation. Taking the log of a variable does more than to make a distribution less skewed, it changes the relationship between the original variable and the other variables in your model. In addition, many transformations require non-negative data or data that is greater than zero, so they are not always the answer. Deletion. Only as a last resort should outliers be deleted, and then only if they are found to be errors that can't be corrected or lie so far outside the range of the remainder of the data that they distort statistical inferences. When in doubt, you can report model results both with and without the outliers to see how much they change. Data transformations and deletion are important tools but shouldn't be viewed as a cure-all for computational problems associated with outliers. Transformations and/or outlier elimination should be an informed choice, not a routine task. Exploring why outliers exist may provide clues how to develop better statistical models. In fact, many great discoveries in human history can be traced to a researcher exploring some outlying or unusual observed value. Outliers may indicate that an important range of the data has been ignored that is worth knowing about. Apply exploratory data analysis techniques to look for both univariate and multivariate outliers and then evaluate how they impact on the results with and without transformations, accommodation, and deletion. This will help you reach conclusions that are in line with your research objectives. A "common sense" approach is often the best solution. A Few Additional Thoughts about Data Analysis 1. The path to good data analysis begins with your research hypotheses or questions. 2. The phrase 'normally distributed data' is usually most applicable to residuals from a linear model; what is usually most important is that you have a range of data that is not highly skewed or contaminated with outliers. 3. Regression, analysis of variance, t-tests, correlations, etc. are very sensitive to outliers or skewed data and do not fit every data analysis situation equally well; in fact, in many cases they are not appropriate and much more powerful techniques are available. 4. Advances in computing technology and statistics have given us many alternative procedures to analyze data - find a model to fit your data, rather than force a given model onto your data. 5. Always consider the data type hierarchy structure before you begin a statistical analysis to determine which technique is most appropriate for your research.