DATA CODING Before you begin to collect data, clearly define and know what data you will need in order to evaluate your research objectives. One necessary aspect of utilizing a statistical program effectively is to have confidence the types and values of the data you anticipate will be available to evaluate your research hypotheses. Tips in Coding Data for Statistical Analyses Many data items are logically coded as numbers: interval or continuous measurements (e.g., test scores, temperature, age, income, height, weight, etc.). Numbers can also represent discrete categories of interest, whether they are nominal (such as names where the order of the values does not matter) or ordinal (the values have a natural ordering). Categorical data coded as numbers or letters can be associated with "formats" (SAS) or "value labels" (SPSS) which are assigned to them to represent the actual values. As you procede with designing data collection and analysis, a few helpful hints to consider when coding data include: 1. Type of measurement: data types you will most likely collect and then enter into a statistical program consist of numbers (classified as ordinal, interval, or continuous), character (nominal or ordinal), and dates. 2. Assign a unique ID to each subject. A subject ID is a relatively short name, unique to each individual or experimental unit. It consists of either all numeric or all character data (or a combination of both) of the same length (e.g., an id that begins with a letter and consists of 4 characters for all subjects). When it is of a character type it should be consistently in all upper or all lower case and should not contain more specialized digits such as hyphens, spaces, or other things found on the upper case entries of the top row of numbers on the keyboard. Character id values should always begin with a letter which can reflect group assignment (e.g., 'c' for control or 't' for treatment). For the remaining digits enter only letters, numbers, or the underscore. The ID value should not be derived from a name if a person is the experimental unit so that he/she will remain anonymous. In some cases the unique ID may consist of multiple ID variables (e.g., students within classes within schools within district). The ID will allow you to trace the values back to the source of your data, if necessary. For data analyses that work with multiple or repeated measurements collected from each respondent, it is crucial to enter the ID variable since it identifies the "cluster". 3. Spreadsheet layout: Visualize data as stored in a spreadsheet, such as EXCEL, where the rows contain all the data from each subject (or experimental unit). In most cases you will store data in multivariate format (e.g., name variables from the contents of a survey as q1, q2, q3, q4, etc). This is also true if you collect data over time on the same variable (repeated measures) or across different conditions on each subject. In this case, the variable name should indicate both the type and the time of measurement (e.g., name variables for weight collected at baseline, time1, time2, time3 as wt0, wt1, wt2, wt3, etc.). 4. Columns: The columns of the spreadsheet contain the category levels or measurements. Each column must be of one data type (i.e., values are all numeric, all character, all date, etc.) collected across subjects. 5. If you store data in two or more files, a unique combination of ID variables (as defined in step 2) must be assigned to each subject so that you can match records across files in order to merge data. 6. Nominal data (categorical data with distinct, non-ordered levels such as race or political affiliation) can be coded with numbers. Formats (defined with PROC FORMAT in SAS or VALUE LABELS in SPSS) are then associated to the actual numerical value of each variable. 7. Code responses from ordinal data (e.g., Likert scales) with consecutive integers, for example: 1='Strongly Disagree' 2='Disagree' 3='Neutral' 4='Agree' 5='Strongly Agree' Never code ordinal data with the actual response value (much too long) or with abbreviated characters such as SD, D, N, A, or SA. Why do you think this is recommended? Statistical programs naturally order categorical data in alphabetical or numerical order. 8. Code dichotomous or binary data (two levels) with numbers (0/1 is usually preferred such as 0=No/1=Yes; 0=Female/1=Male, 0=Failure/1=Success; etc.). You can then apply formats with PROC FORMAT in SAS or VALUE LABELS in SPSS to assign the specific level information to these codes. 9. Do not combine levels of two or more variables into one. For example, do not enter one variable with multiple attributes such as: 'Young Male', 'Young Female', 'Elderly Male', 'Elderly Female'. Rather, define two variables, one for the subject's age (Young/Elderly) and one for their gender (Male/Female). This also pertains to continuous type data such as weights or heights. For example, always store height as 75 inches rather than 6'3"; you can also enter it as 6.25 feet if the data conversion is easy. (Note that a situation in data analysis exists when coding of multiple values of discrete variables into one variable is necessary, but is the exception to the rule here. Also, when data ared coded into its simplest structure, it is quite simple to recode two or more variables into one, if necessary, through a numerical transformation or concatenation.) 10. A "Check all that apply" question on a survey must be entered as multiple variables for each item, usually with a numerical entry (i.e., 0=Not checked/1=Item checked). Note that a 0 for "Not checked" is different than if there is strong evidence the subject did not respond to the question, in which case it should be left blank. 11. If you have many variables (e.g. responses collected from a survey), if you need to make transformations or perform extensive recoding, or if you need to process data from multiple files, learning how to do all this with commands entered into a syntax window is crucial. Before you venture too far, it may prove very helpful to seek advice from a specialist in file management and data analysis. Nominal and Ordinal Data Many statistical procedures are designed for discrete data, often called categorical variables, i.e. they use data defined by levels classified as either nominal or ordinal. Nominal data provide names for the different levels only; there is no inherent ordering or implied ranking of the values. On the other hand, ordinal data have discrete categories (such as a Likert Scale) that imply a natural ordering of the levels. How you code these types of data is very important as it may affect how statistical procedures interpret the levels. Many statistical programs have a specific way to classify nominal or ordinal data with discrete levels, even if they are coded with numbers. For example, SAS has a CLASS statement with analysis of variance procedures (e.g., PROCs ANOVA, GLM, MIXED, LOGISTIC, and GENMOD). The following options to order the data levels appearing on the CLASS statement are important to know: ORDER=data - order data by the values as they naturally appear in the dataset ORDER=formatted - order data according to their external formatted value, except for numeric variables with no explicit format, which are sorted by their unformatted (internal) values ORDER=freq - descending frequency count; levels with the most observations come first in the order ORDER=internal - ordered by the unformatted, internal values (e.g., if formats are applied with a FORMAT statement within the procedure) These four options specify how the levels of classification variables (variables listed on the CLASS statement) are sorted and interpreted within the procedure. The ordering specific to each choice determines how factor levels are labeled in a table or how estimated parameters in a statistical model appear in the output. The ORDER= option most often appears as an option on the PROC statement. With PROC LOGISTIC, it is placed on the CLASS statement following a slash ( / ) as an option. Knowing how ORDER= works with your data is essential to interpret estimated model coefficients correctly from the Parameter Estimates table or output from CONTRAST and ESTIMATE statements. With linear models, it defines which level is designated as the reference category. If you don't know or aren't certain of the order in which the data appear in the file, sort the data by the design or explanatory variables before applying any statistical procedures. The default option for most procedures in SAS is ORDER=formatted, and its behavior has been modified for Version 8. Numeric variables for which you have not supplied an explicit format (that is, for which there is no corresponding FORMAT statement in the current procedure or in the DATA step that produced the dataset), the levels are ordered by their internal (numeric) values. In releases of SAS prior to Version 8, numerical class levels with no explicit format were ordered by their BEST12. formatted values. In order to revert to the previous method you can specify this format explicitly for the CLASS variables. The change was implemented because the former default behavior for ORDER=formatted often resulted in levels not being ordered numerically and required you to use an explicit format or ORDER=internal to get the more natural ordering. Be aware of how these options work in the various statistical procedures, especially if you have missing levels or combinations of levels. To enter data into SAS or SPSS, you can learn much more about transferring data from Excel at: http://www.uoregon.edu/~robinh/data_trnsf1.html And yes, you can export data from SAS to Excel which you can read at: http://www.uoregon.edu/~robinh/data_trnsf2.txt Read External Text Files Some situations may require you to read external text files which are typically of two types Fixed format: the data are placed in specified columns with a consistent format id name group x1 x2 y 001 john a 3.0 4.62 1 002 mary b 8.1 2.90 4 003 sue a 2.0 7.01 5 ^^^ ^^^^ ^ ^^^^ ^^^^^ ^ 1 1 2 2 <-- column numbers 1 5 0 5 0 5 id: character data in columns 1-3 (note the leading 0's are retained) name: character data in columns 5-8 group: character data in column 11 x1: numeric with one decimal value in columns 15-18 x2: numeric with two decimal values in columns 20-24 y: numeric integer values in column 26 free format: variables are delimited by commas, tabs, spaces, or some other character that never occurs in your data. Here is an example of the same file listed above in both comma and space delimited formats (note that the variables names placed in the first row are also separated by commas or spaces). free formatted file comma delimited file with spaces ------------------- --------------------- id,name,group,x1,x2,y id name group x1 x2 y 001,john,a,3,4.62,1 001 john a 3 4.62 1 002,mary,b,8.1,12.9,4 002 mary b 8.1 12.9 4 003,Sue,a,2,7.01,5 003 Sue a 2.0 7.01 5 You can find more details about how SAS reads external text files at: http://www.uoregon.edu/~robinh/042input.txt Quality Control Editing 1. If your data set is relatively small, print it to check each value for errors either on the screen or on paper. 2. For large datasets, print the first few rows, i.e., with SAS use PROC PRINT DATA=prjdata(obs=10); RUN; to ensure the data have been read correctly by the program. An approach to look at the contents of large datasets is to print a purely random number of selected records across the entire range of observations. In PROC PRINT you may enter a WHERE condition that will print approximately 10% of all rows PROC PRINT DATA=sashelp.class( WHERE=(RanUni(982237) LT .10) ) ; RUN; 3. Run a frequency procedure (with SAS use PROC FREQ) to obtain counts for nominal or ordinal data and to look for missing, incorrect, or miscoded values. 4. IF statements identify variables outside the logical range; they can then be set them to missing or identify them as needing editing, e.g., IF (q1 LT 1 OR q1 GT 5) THEN q1 = .; 5. For continuous or interval level variables examine their distributions for outliers with boxplots and compare these distributions across any categorical variable(s) that represent distinct groups. PROC UNIVARIATE with a BY statement works very well for this step. 6. For interval or continuous data, produce a two-dimensional scatterplot (with SAS use PROC PLOT). To assess the quality and quantity of information, sources of your data should be well-defined. For example, demographic data represent what is known about each subject from existing records, observations, or personal interviews. Data may be collect from a survey or designed experiment or laboratory result for each subject. Some of the data will come from relatively accurate and calibrated measurement systems whereas other data may have relatively large measurement errors indicating the potential use of measurement error models. The major purpose in summarizing contents and form and reporting on its quality is not to mislead yourself or someone who might inherit your data what data are assumed to exist versus what data are actually available. If data are collected with multiple measurements at each test condition, compute summary statistics, which include the mean, median, maximum, minimum, number of tests, and the standard deviation. However, the actual measurements should be stored in the database rather than summary statistics - a program can easily compute them as needed. The median value should also be included as a summary statistic to offer some clue to the effect of a skewed distribution or outliers in the computation of the mean and standard deviation. These two statistics when presented alone may give the false impression that multiple measurements from one test condition are normally distributed when in fact this may not be true. Data from skewed distributions and any "outliers" may adversely affect these two commonly used summary measures. Computing the median will provide a more robust summary statistic for data analyses purposes or a suggestion of a possible transformation if parametric analysis is to be run.