4.4: Declarative Statements in the DATA Step Statements in the DATA step are generally defined to be of two types: * Declarative (the order they appear in the DATA step usually does not affect data processing) * Executable (the order they appear in the program does matter) Declarative statements such as ARRAY, INFORMAT, FORMAT, KEEP, DROP, LABEL, LENGTH, and RETAIN define the working environment of the DATA step. Although they can be placed anywhere in the block of statements, readability is improved when they are placed near the top (following the DATA and SET statements and before the INPUT statement). However, note that in some situations the logical order of these statements does matter as will be indicated with a few examples. When in doubt, always place statements with some idea of a logical order of how they function in the DATA step. The statements to be examined here include: ARRAY ATTRIB DROP or KEEP INFORMAT and FORMAT LABEL LENGTH RETAIN CARDS or DATALINES ARRAY See Chapter 4, Section 5: SAS arrays ATTRIB Variable attributes such as: label, length, format, informat, etc. may all be defined in one statement rather than several: DATA _new; ATTRIB var1 length=8 format=foo. informat=foo. label="foo" var2 length=$8 format=bar. informat=bar. label="bar" ; run; When writing a new dataset attributes of SAS variables are determined by the first statement or dataset it encounters, unless those attributes are already defined to the new DATA set before a SET or MERGE statement is encountered. For example, when merging files that have character variables of the same name with different lengths define the length of the variable BEFORE the MERGE statement: DATA file_ab; ATTRIB name length=$15; *Must come BEFORE merge statement; MERGE a b; BY name; RUN; This DATA step writes an entry in the dictionary of dataset file_ab BEFORE reading the merge statement, (thus getting its definition from the first dataset). When length of the variable is the only attribute of interest, the LENGTH statement has the same function. DROP or KEEP: Specify which variables are omitted or saved in the DATA step By default, SAS keeps all variables specified in the INPUT statement, variables created in calculations, DO loop indices, etc. (in other words, any variable read from a file or any variable name to the left of an = sign). It also creates variables that are specified in statements such as FORMAT, RETAIN, LENGTH, even if no use of them is made in the program. If all the variables in a data set will not be needed in subsequent DATA or PROC steps, you have the choice to pass along only the data needed later. The ways to do this is with either a DROP or KEEP statement (only one of them is needed). KEEP and DROP statements in DATA steps are not conditionally executed. Their presence alone will have an effect. The same applies for RETAIN and LENGTH statements. General Guidelines: 1. If you have many more variables to DROP than KEEP, use the KEEP statement. 2. If you have many more variables to KEEP than DROP, use the DROP statement. For example: if the data set contains 11 variables id a1 a2 a3 a5 a9 a11 a13 i j k and you will not use i j k in a later PROC or DATA step, then use DROP i j k; * drop the variables i, j, and k and keep the remainder; IF you want to include only 4 of the 11 variables (e.g., id a1 a3 k) in later steps, then use KEEP id a1 a3 k; * do not save the other variables; Consider the following situation. You have a number of variables that all begin with the same prefix and you wish to drop or keep all of them. One example where this occurs is a data file created by PROC TRANSPOSE. You may use the PREFIX option to create variables with common prefixes for convenience (see Chapter 6), then carry out some additional processing in a subsequent DATA step. For the new data set, the variables created from TRANSPOSE may be of no further interest. To do this, use the drop or keep statement as follows:: DROP vname: ; KEEP vname: ; In this situation the colon : is a wildcard that causes any variable name starting with the prefix (in this case "vname") to be dropped or kept. The following examples demonstrate this feature. DATA one; vname_cars=1; vname_trucks =1; vname_vans=1; vname_suvs=1; a = 1; run; DATA two; SET one; KEEP vname: ; RUN; The KEEP statement will allow all variables that begin with vname to be sent to the next step; the variable a will be dropped. If you drop or keep variables that are not present in the DATA step and do not want to see a warning in your log file, issue the following choice: OPTIONS DKROCOND=NoWarning; This choice comes in handy when you create temporary variables that begin with an underscore (e.g., _i) and do not want to carry them on to the next DATA step. The statement DROP _: ; will delete them from the data set. Four advantages of using a KEEP or DROP statement include: 1. Reduces disk space storage requirements 2. Reduces work space 3. Self documenting (i.e. what does this data step produce?) 4. Removes extra variables that no longer have meaning If the data set has only a few variables, a KEEP statement (rather than DROP) is often a good way to document what variables are important at each step and are to be passed along to the next DATA or PROC step. [If you add variables in the DATA step at some later, time, be sure to add them to the KEEP statement. The following examples show how this works. First, create a sample data set with one record: DATA file0; id='a'; var1=11; var2=21; var3=31; OUTPUT; RUN; Example 1: KEEP performed on the outgoing data set when written KEEP or DROP statements affect the data set created on the DATA statement after processing when the record is output. The following code KEEPs var1 and var2 in the output data set called file1. DATA file1 SET file0; * all variables available here to work with ; KEEP id var1 var2; * as a DATA step statement the KEEP= is not needed ; RUN; Example 2: KEEP performed on the outgoing data set(s) when written Use (KEEP= ) following any data set name whether it is created in the DATA step or a data set created in a procedure. It is necessary to have KEEP= enclosed within () with the specific variables listed. Use it whenever you want to save different variables to separate data sets. The syntax works just like example 1 - only the placement of the KEEP statement is now connected with the data set itself. The following DATA step saves the three variables and id to file1. It also saves all the variables to ‘var’ in file2, adds an ‘order’ variable, and then drops var1-var3. DATA file1(KEEP=id var1 var2 var3) file2(KEEP=id order var); SET file0; * all variables available for processing; ARRAY vr{3} var1 var2 var3; OUTPUT file1; DO order=1 to 3; Var = vr{order}; OUTPUT file2; END; RUN; In any SAS procedure KEEP= can likewise be used to save onlyl the variables of interest as shown for PROC MEANS. PROC MEANS DATA=file1 NOprint; VAR var1; OUTPUT OUT=stat(KEEP=mean) MEAN=mean; Note: (DROP= ) works in the same manner as the DROP statement. Example 3: KEEP performed on the incoming data set when read This next example keeps var1 and var2 in the input file and drops var3 at the very beginning, before any processing is done, so var3 is not available in the DATA step. DATA file1; SET file0(KEEP=id var1 var2); * var3 is not available for processing ; RUN; Example 4: KEEP using a combination of the above methods. If you have a large incoming data set you can increase performance by using a KEEP statement on the incoming data set (to reduce the amount of data coming in) and then also use a KEEP statement or KEEP= in the DATA step on the outgoing data set to control how many variables you want to save. DATA file1(KEEP=id var1 var2) file2(KEEP=id var3); SET file0(KEEP=id var1 var2); OUTPUT file1; var3=var2**2; OUTPUT file2; RUN; INFORMAT: read a variable on the INPUT statement with the indicated format Informats specify how to read data from a text file. Since it generally appears along with the INPUT statement, it was described in the previous section covering data input, 042input.txt. FORMAT - assign a SAS format or one of your own to numerical data, dates, or levels of categorical data An important feature of SAS datasets is that in addition to storing data, they can hold other information, namely formats and labels. Formats display data in a various ways; whether they are numbers, dates, or character data, the output will always appear as text printed to a file. Data formats (numerical, date, user defined) can either assigned directly to the variables in the DATA step, or designated within a block of PROC statements with a FORMAT statement. The manner it works is to list one or more variables with the same format followed by the format designation. It can be a SAS format, or one you have defined with PROC FORMAT appearing prior to its first use (see Chapter 5). Formats are easy to distinguish from variable names since they must have a period within for numerical formats or at the end for date or user defined. If all the variable names have the same root (e.g. v1-v99) then you can just put the range into the format statement: FORMAT v1-v99 7.3 ; Sometimes you do not know the last variable index, or you may have many variables that begin with the letter v and have the same format. In that case it is advantageous to use the colon modifier: FORMAT v: 7.3 ; The colon modifier refers to all the variables that start with the letter v. This FORMAT is not necessarily the same as using v1-v99, but it is very often a workable shortcut or a convenient way to name numerical variables of the same format with the same letter, but not forced to use the v1-v99 notation. LABEL - Supply Variable Labels Rather than relying on the usually small number of characters of a variable name, a LABEL statement allows you to document a variable with up to 40 characters of descriptive text. Often, this label will be printed on output rather than the variable name. In some applications, both the variable name and its label will appear together. LABEL wgt = "Weight on 9/28/02" lgth = "Length (inches)"; When formats and labels are assigned in a DATA step, they are applied to all procedures appearing on the printed output. The FORMAT statement can also be applied in specific PROC steps if it was not assigned in a DATA step, or if for some reason you wish to modify the existing format. To eliminate all existing labels from a subsequent DATA or PROC step, use the following OPTIONS statement: OPTIONS NOlabel; LENGTH - Reduce the amount of storage space the dataset requires and also accurately define character data By default, SAS assigns all numerical data, from single digit integers to ratio data with many decimal values, to be stored with 8 bytes. To avoid inaccuracy, this size is necessary for real numbers (i.e., those with decimals produced from calculations of numbers by division or multiplication, trigonometric or logarithmic transformations, numerical constants from functions such as EXP() or CONSTANT ('pi'), etc.). However, for data that take on integer values only, especially one- or two-digit values frequently collected with survey data or as numerical codes for discrete data, memory storage requirements for SAS datasets can be greatly reduced. The integers 1 through 8192 can be represented with 3 bytes rather than 8 thus possibly offering a substantial reduction in the size of the SAS dataset. The LENGTH statement tells SAS how many bytes to allocate to the specified variables. All variables to the left of the number are assigned that length. Multiple sets of variables of different lengths can be specified as shown below: LENGTH a1 a2 a3 3 age year 4 length 6; For example, a dataset containing 10,000 observations and four variables taking on integer values from 1 through 10 was generated. A permanent SAS data set written on a UNIX system with all four variables set to the default value of 8 bytes required 335872 bytes to store the file. A permanent SAS dataset from the same program with all four variables set to 3 bytes took only 131072 bytes. Thus, the size of the file required to store the data is only about 1/3 the size of the default size data set. With numerical data the LENGTH statement is applied only to integers (unless the default option appears). Significant round-off error may occur if this statement is applied to data that contain legitimate decimal places. When character data are included on a LENGTH statement the number should specify the maximum number of characters across all observations for that particular variable. A default option is available to automatically change the length of all numerical data to that number. Other numerical data can be explicitly assigned larger or smaller values as needed. LENGTH i j k 3 area 8 default = 4; * changes the default numerical length which is 8; The default length option applies only to numbers and not to character data. If you assign character values in the DATA step to a new variable, SAS gives it a length of the number of characters found with the first value encountered. This means that in a DATA step if you assign values to a new variable with the following data 'LUNG', 'PLEURA, 'REGIONAL', etc., SAS will assign values of 'LUNG', 'PLEU', 'REGI', etc. by default. With new character data, for example those computed with CAT functions (see chapter 7, section 2), the default length will be 200 - it is highly likely you should change it. If you read character data from a text file with an INPUT statement that does not specify a variable length and does not use INFORMAT or FORMAT statements, SAS automatically assigns the length to be 8 characters. INPUT word $; This potential problem can be avoided with the LENGTH statement placed before the INPUT statement or character functions (e.g., CAT) to specify the maximum number of digits expected in character data. DATA one; LENGTH word $12; INPUT word $; To ensure that you get the proper length for character variables when appending two or more datasets (see Chapter 6, Section 1), specify the LENGTH as the first statement after the DATA statement, e.g., DATA new; LENGTH var1 $45; SET old1 old2 old3; RUN; Otherwise the assigned length for variable var1 in the dataset 'new' will be as defined from the first dataset listed on the SET statement. RETAIN - Retain values of specified variables from previous observation Reasons for entering this statement in a DATA step include: 1. To keep SAS from reinitializing specified variables to missing at the top of each iteration of a DATA step. 2. To assign certain variables specific values. 3. To reorder variables in the PDV. At the beginning of each loop of the DATA step, values of all computed variables which are automatically assigned as missing. With SAS, retain means the system will not change values of declared variables to missing as it does automatically at the beginning of each iteration of the DATA step loop. The RETAIN statement 'retains' the values assigned to specific variables at the bottom of the loop from the previous observation. The following RETAIN statement specifies the values of x and y are to be kept from one iteration of the DATA step loop to the next. RETAIN x y ; The RETAIN statement should only be applied to variables that appear in assignment statements, such as numerical calculations. Entering variables on a RETAIN statement which are read from a SAS dataset is redundant since input dataset variables are one of the classes of variables which are _automatically_ RETAINed. What this implies is that such variables are not "initialized to missing" at the beginning of a new iteration of the DATA step but, rather, RETAIN their previous values from the last iteration until they are over-written by new values (including missing data codes) read in from the input dataset. Note that variables read with INPUT need to be entered on a RETAIN statement if there is any reason to check their value at the top of the loop before reading the next record: DATA test ; b_pr=b; * without RETAIN b_pr contains all missing values; RETAIN b; INPUT a b; CARDS; 1 1 2 4 3 9 ; proc print NOobs; VAR a b b_pr; run ; a b b_pr 1 1 . 2 4 1 3 9 4 An example of how it works can be seen in assignment statements which precede the SET or MERGE statements to produce a lagged variable or calculation with a previous value of some variable. For example, reading data from an existing SAS dataset with SET does not need a RETAIN statement: DATA new; b_prev = b ; SET test ; run ; proc print data =new; VAR a b b_prev ; run ; a b b_prev 1 1 . 2 4 1 3 9 4 RETAIN also allows you to set initial values for constants, variables that count the number of records, or summation variables that increment by the value of some other variable for each observation. If not otherwise indicated, initial values are automatically set equal to 0 for the first observation of the DATA step. However, if you want to set other initial values such as xplt=-5, y1=99, and y2=99, enter the statement: RETAIN xplt 5 y1 y2 99; The two values listed here are assigned to all SAS variable names to the left of the given number. In the DATA step a variable name listed on an INPUT statement or variables read from a SAS dataset with the SET statement should not be entered on a RETAIN statement. Either situation does not make sense since external values are to be read for each observation. The next section will elaborate on using the RETAIN statement with incrementing values or summation across observations. If you list specific variable names, the position of the RETAIN statement in the DATA step is arbitrary and it serves as a declarative statement. However, if you specify _all_, _char_, or _numeric_, only the variables that have been defined before the specific placement of RETAIN statement are affected. This gives the important piece of information that words like _all_ can mean different lists of variables in different contexts. The placement of the _all_ option (i.e. keep the values of all variables across observations) affects the position of RETAIN as demonstrated with this program: DATA test; obs=1; OUTPUT; obs=2; OUTPUT; obs=3; OUTPUT; Program 1 Program 2 Program 3 DATA try1 ; DATA try2 ; DATA try3 ; SET test ; SET test; SET test; IF _n_=1 then x = 10; RETAIN _all_ ; RETAIN x; RETAIN _all_ ; IF _n_=1 then x = 10; IF _n_=1 then x = 10; RUN; RUN ; RUN ; obs x obs x obs x 1 10 1 10 1 10 2 10 2 . 2 10 3 10 3 . 3 10 The relative positioning of the RETAIN and IF statements in these three examples results in the different behavior of the DATA steps, even when no other statements appear before the SET statement. The different behavior only exists because of entering _all_ on the RETAIN statement; if one lists variables to be RETAINed explicitly, then it doesn't matter where the statement is placed. Another situation where the order of declarative statements does matter is when you want to rearrange the order of how variables appear in a dataset and thus the order they would be printed with PROC PRINT or written to an excel file with PROC EXPORT. In this situation, place the RETAIN statement before the SET statement: DATA neworder; RETAIN var3 var1 var2; SET original; RUN; If you have a sequence of variable names that end with a number (e.g., ab1-ab5), you can enter the name prefix followed by a colon (i.e., ab:) to specify that particular range of variables. Note that these variables will be placed following the variable names that don't have the : in the order listed on the RETAIN statement, and they will retain their original ordering observed in the first dataset. DATA a; ab2= 3; ab3= 5; ab1= 6; b= 6; c1= 7; c2= 7; output; ab2=13; ab3=25; ab1=46; b=16; c1=47; c2=47; output; ab2=23; ab3=45; ab1=86; b=26; c1=67; c2=67; output; PROC PRINT NOobs; RUN; ab2 ab3 ab1 b c1 c2 3 5 6 6 7 7 13 25 46 16 47 47 23 45 86 26 67 67 DATA b; RETAIN c: ab: b ; SET a; proc print NOobs; run; b ab2 ab3 ab1 c1 c2 6 3 5 6 7 7 16 13 25 46 47 47 26 23 45 86 67 67 Accumulating Sums Assume you have a variable called cost and you want to accumulate total cost across all observations of a data set. Since all observations are automatically set to missing at the top of every loop of a DATA step the following equation will not work: 1) total=total+cost; However, an alternative statement also produces the desired result: 2) total+cost; The reason the first statement does not work and the second one does work is that SAS interprets the following two statements 1) and 2) in very different ways: In 1) the "+" is simply the addition operator of two variables. There is no implicit initializing (of total to the right of the "=" and to the left of the "+") or retaining (and therefore no accumulation). In other words, a statement such as x=y+z in a DATA step yields only row-specific values. If either y or z is missing, then x will be missing also. Using statement 1) when the DATA step first begins, the first record creates the variable total, but since it is automatically assigned the missing data value, there isn't a number in it and each row of the table (observation) has it's own value. The DATA step is like reading a spreadsheet, one line at a time, across the row from left to right. You can use 1); however you also need to use a RETAIN statement along with it: RETAIN total 0; total=total+cost; The equation approach will give you the desired total, since it tells SAS to retain the value of total after outputting the record to the next record and give it an initial value of 0. To calculate the total sum of some variable across all observations, SAS provides a special type of statement that can stand alone. With statement 2) SAS interprets the "+" as part of the definition of a SUM statement. The SUM statement in 2) does several things implicitly, including: * initializes the variable on the left of the "+" (total) with a value of 0. * RETAINs (i.e., does not set to missing) the value of this variable from one data step iteration to the next. * accumulates the row-by-row values of cost into the accumulator variable called total. CARDS or DATALINES - Insert data directly into the program file These two statements are interchangeable. They always appear at the end of a DATA step to indicate where data begin when reading data from the command file, and thus are considered declarative. With small data sets it is often less confusing and aids in reducing the number of files if data are entered directly into the SAS program file. For this purpose insert the SAS statements to read and process data in a DATA step and then as the final statement use the CARDS; or the DATALINES; statement. At this point your data can be entered directly into the SAS program file. If you enter your data into a SAS program and your data file is 'delimited' by commas or tabs, you need to use an INFILE statement: DATA new; INFILE cards dlm=',' dsd; * data stored in comma delimited format; INPUT x y z; CARDS; 4,3,5 1,2,3 5,7,6 ; Quotes around the word cards in the INFILE statement are not needed. For future reference, note that the CARDS; statement cannot be used inside a SAS macro (see Chapter 9). WHERE The WHERE statement is actually available for both SAS procedures and the DATA step. It is often applied much like the declarative IF statement in the DATA step (see section 4.8) to limit data processing to specified values of the indicated variables or values to be excluded from a list. For example, to find out for whom outliers (e.g., exceptionally large heights for men) occur, PROC PRINT DATA=sashelp.class; WHERE (sex EQ 'M' AND height GE 70 ); Run; It can also exclude outliers or include records with desired characteristics from specific subjects in procedures, such as a two-sample t-test with PROC MIXED: PROC MIXED DATA=sashelp.class; WHERE (12 LE age LE 15); CLASS sex; MODEL height = sex / solution; Run; The LIKE operator in the WHERE statement has many possibilties. "%" stands for all characters; "_" for a single one proc print DATA =sashelp.class; where name like "A%"; * all names that begin with A ; run; Obs Name Sex Age Height Weight 1 Alfred M 14 69.0 112.5 2 Alice F 13 56.5 84.0 Also, "J%a%" means: the first characater is "a" and the rest contains a "a" anywhere. proc print DATA =sashelp.class; where name like "J%n%"; * all names beginning with J and have an n anywhere; run; Obs Name Sex Age Height Weight 7 Jane F 12 59.8 84.5 8 Janet F 15 62.5 112.5 10 John M 12 59.0 99.5