4.1 The DATA and SET Statements: Make a new or modify an existing dataset The DATA step reads or modifies data. PROC IMPORT is one of several ways to read data into a SAS dataset. If the IMPORT procedure works for you, you do not need to read data with a DATA step unless you need to make further calculations of variables. You can immediately begin to apply whatever procedures you want with the data. With PC SAS Windows, you may browse the data using SAS explorer. Just double click on the table icon within the appropriate library. Or you should run PROC CONTENTS and PROC PRINT to determine the contents of the dataset. DATA: Keyword that begins the DATA step The DATA statement tells SAS to begin a new DATA step to read data from an external source into a new dataset or to modify one that already exists. The keyword DATA is followed by one or more names of SAS datasets, either temporary or permanent, (the DATA step may also be instructed to write no dataset by entering _null_ which will be shown to have several applications in subsequent chapters). Syntax: DATA ; are one or more names you assign to the new SAS data set and in the examples will always appear in lower-case letters (though they would also work the same in capital letters). The dataset name can be a two-level name (see LIBNAME examples in Chapter 3) or it can also be a one-level name (for example, a temporary data set called data that will disappear when the session ends): DATA mydat; The following DATA statement creates two temporary SAS data sets called mydat1 and mydat2: DATA mydat1 mydat2; * provide the names for one or more SAS data sets * multiple data sets can be created in on DATA step by assigning separate names (need to use OUTPUT ;) When writing two or more data sets an OUTPUT statement (Chapter 4.7) must be used to write the appropriate data to each data set. You'll also want to use DROP or KEEP statements (Chapter 4.4) in this process, since you'll usually send subsets of the variables to each data set. [Chapter 3 describes how to make the SAS datasets permanent (i.e., saved to the disk).] LIBNAME libref 'c:\project\data'; DATA libref.mydat1 libref.mydat2; This DATA statement is associated with the LIBNAME statement that must appear previous to it in the program. The keyword that links the two together is the user-supplied "libref". This DATA step will create two permanent SAS data sets that will be named mydat1.sas7bdat and mydat2.sas7bdat and stored in the directory c:\project\data. The null dataset is invoked with DATA _null_ ; which allows you to process existing data without saving anything to a new dataset. A few examples include summary calculations written to the log file, write an existing SAS dataset to a text file, or add values to macro variables (Chapter 9). Many keywords (SAS statements) and programming tasks can be included in the DATA step. The DATA step ends with the next step boundary which is usually the appearance of a RUN; or a QUIT; statement, the next DATA statement, or the next PROC statement. The DATA step creates SAS data sets from text files or modifies existing SAS data sets. It is able to perform many data management tasks. It can also output a SAS data set to an external data file (such as a comma delimited TEXT file, *.csv) or to convert SAS variables to macro variables (see Chapter 9). SET: Read values from an existing SAS dataset The SET statement points to an existing SAS data set and reads it into a new one with the name declared on the DATA statement. It usually is placed directly after the DATA statement, but may occasionally be used in other locations in the DATA step. Other statements then follow it to modify or create new variables or perhaps build an entirely new data structure. Syntax: SET ; DATA two; SET one; < other processing statements >; RUN; Among the options, several most commonly needed are: NOBS=count -> compute a variable called count that gives the number of observations in the dataset; for example, to make a macro variable (see Chapter 9) with the number of observations in the data set one approach is: DATA _null_; CALL SYMPUT("nobs",LEFT(PUT(count,8.))); * in case dataset empty, place CALL .. before SET statement; SET one NOBS=count; STOP; * no need to process entire dataset; RUN; OBS=xxx -> stop processing the data set after the first xxx observations have been read END=eof -> allows you to use an IF statement or some other logical statement to test when you have reached the last record in the input data file; e.g. IF eof THEM DO; OUTPUT total; END; POINT=i -> read record number i from the specified dataset (a variable named i is not added to the dataset); for example, to interleave records from one data set with another: DATA c; SET two; BY var_a; OUTPUT; IF last.var_a THEN DO i=1 to &nobs; SET one POINT=i; OUTPUT; END; RUN; I'll digress for a moment to show how dataset options can be inserted almost anywhere a SAS dataset name appears. They specify actions that apply only to the dataset to which they are attached. RENAME, KEEP, and DROP allow you more control over variables entered into an application: * RENAME=() renaming variables * KEEP= (or DROP= ) variables when processing the output data set The following give you more control over observations: FIRSTOBS=i Causes processing to begin with a specified record or case number IN=i Makes a temporary variable that indicates whether the dataset contributed data to the current observation OBS=n Causes processing to end with the nth observation WHERE=? Only selects observations that meets the specified condition ? These examples show a few ways options can be directly applied within SAS statements to the dataset. * Keep the indicated variables in scores; * Rename the variable date to game_date; * Note the rename option does not change the name of date in old_scores; DATA scores(KEEP=game_date year team game1 game2 game3 game4 game5); SET old_scores(RENAME=(date=game_date)); ^^^^ ^^^^^^^^^ old = new_name * print scores from the year 2002, only; PROC PRINT DATA=scores(WHERE=(year=2002)); RUN; * read the first 10 records from old_scores; DATA scores; SET old_scores(OBS=10); run; Direct or Random Access If you want to read specified record numbers, there is no need to cycle through the entire dataset. For example, if only the first and last observations are required use OBS= and POINT= in that only the two required observations are read, e.g., DATA a; DO row_number=1 to 222; OUTPUT; END; DATA b; SET a (OBS=1); OUTPUT; SET a NOBS=nb POINT=nb; OUTPUT; STOP; * be sure to add STOP to avoid an infinite loop ; The nobs=nb option places the number of rows in ‘nb’; the point=nb option tells SAS to read the final row in the dataset. PROC PRINT DATA=b; RUN; row_ Obs number 1 1 2 222 A DATA step that reads observations from a SAS dataset with a SET statement which includes the POINT= option may have no way to detect when you want it to stop reading the input SAS data set. A DATA step used with this option usually requires a STOP; statement or a conditional statement to invoke the STOP; statement. It should be placed within the data processing statements to avoid an infinite or stuck loop. MERGE, UPDATE, and MODIFY are SAS keywords for the DATA step with similar functions as the SET statement. They will be described in Chapter 6. The SET Statement: Computed vs. Read Data In Chapter 2 it was mentioned all computed variables not included in a RETAIN statement are set to missing at the top of the iterative DATA step loop. Note that variables read with the SET statement are automatically retained and NOT set to missing. At the beginning of the next iteration of the DATA step you can assign the value of the previous value of a to b before the next observation is read with the SET statement, thus acting much like the LAG1 function. The next example shows how the computed variable c is treated as missing compared with the variable a which comes from the existing dataset and its value is retained: DATA test; DO a = 1 TO 4; OUTPUT; END; DATA test2; b = a ; PUT '1 ' a= b= c=; SET test ; c = a+b; PUT '2 ' a= b= c=; run; 1a=. b=. c=. 2a=1 b=. c=. 1a=1 b=1 c=. 2a=2 b=1 c=3 1a=2 b=2 c=. 2a=3 b=2 c=5 1a=3 b=3 c=. 2a=4 b=3 c=7 1a=4 b=4 c=.