Section 6.7 PROCESSING A SORTED SAS DATASET The FIRST and LAST of DATA Step Processing In SAS, DATA and PROC steps consist of series of statements which often begin with a keyword and always end with a semi-colon. The particular statement which begins with the keyword BY followed by a list of categorical or integer variable names has several important functions including: * sorting a dataset with variables it contains through PROC SORT * processing data by subgroups in statistical procedures * within the DATA step it allows records to be merged from two or more datasets linked together by common ID variables. The BY statement is fundamental to many data processing tasks. You can learn about how it works at: http://support.sas.com/onlinedoc/913/getDoc/en/lrcon.hlp/a001283274.htm The BY statement entered into a DATA step has another very important application which is not as well known or understood. It gives SAS the ability to process data included in one dataset where the records are organized by combinations of levels of classification variables. Suppose an existing SAS dataset contains one or more variables coded as character or integers. The dataset can be organized into blocks of records or subgroups defined by the unique combinations of these categorical or numerical values. Within these subgroups the BY statement provides the capability to process all records of the dataset with only the DATA step. For an illustrative example of this application of the BY statement, consider a dataset called survey which contains the following variables to identify groups of records: Gender (identified in the dataset as gndr and coded with m=male and f=female) and age (coded as integers 1, 2, and 3 which represent three distinct age groups). Score was computed from the summation of several Likert scale responses (no missing data) given by each person. DATA survey; LABEL gndr='Gender'; INPUT id $ gndr $ age score; CARDS; b f 1 30 g m 1 27 l m 2 39 c f 2 23 h f 3 37 e m 1 49 n m 2 25 o m 3 38 a f 1 28 k f 3 32 d m 1 41 ; The next step shown below sorts the dataset (with PROC SORT) by gender and age category (in ascending order). The observation’s identification variable (id) is also included on the BY statement to order the dataset by id within gender and age, even though id will not enter into any subsequent data processing illustrated here. PROC SORT DATA=survey; BY gndr age id; RUN; A new DATA step now follows PROC SORT which writes a new dataset called sum_var. Now enter a BY statement following the SET statement (which reads the sorted dataset called survey) to indicate the records in the dataset have been sorted by gndr and age. Since the objective is to process the data across all records identified by levels of gender or across all records for both gender and age, the ID variable does not need to be included on this BY statement. DATA sum_var; SET survey; BY gndr age; RUN; These three statements give SAS the capability to identify the first and last record within each subgroup of records defined by the variables on the BY statement. This feature of the DATA step has many helpful data processing applications; however, only one will be described here to illustrate how it works. In order to process a sorted dataset within the DATA step, SAS automatically assigns two internal variables for each variable listed on the BY statement which are dummy coded as 0/1 to represent a false/true condition. For gender, the internal variable is called FIRST.gndr such that it is equal to 1 to identify the first record of each subgroup defined by the sorted values of gender; it is assigned 0 otherwise. A second internal variable called LAST.gndr is equal to 1 for the last record of this value of gender in each subgroup; it is assigned 0 otherwise. Analogous FIRST And LAST variables are computed internally for the three levels of age. These variables are not included in the new dataset; they only exist to help SAS identify the first and last records of the subgroups into which the records have been sorted. Recall the dataset called survey was sorted by gndr, age, and id. Listed below you will find the sorted dataset of values for id, gndr, and age variables followed by the respective “internal” values of FIRST. and LAST. and a score variable with which SAS will calculate some summary statistics. id gndr FIRST.gndr LAST.gndr age FIRST.age LAST.age score a f 1 0 1 1 0 28 b f 0 0 1 0 1 30 c f 0 0 2 1 1 23 h f 0 0 3 1 0 37 k f 0 1 3 0 1 32 d m 1 0 1 1 0 41 e m 0 0 1 0 0 49 g m 0 0 1 0 1 27 l m 0 0 2 1 0 39 n m 0 0 2 0 1 25 o m 0 1 3 1 1 38 Because only two values of gender exist and that it was listed first on the BY statement, the columns for FIRST.gndr and LAST.gndr each contain only two 1’s. However, because age is sorted within each level of gender, its associated FIRST and LAST variables change much more often as indicated by the larger number of 1's in the respective columns. If there is only 1 record for a group defined by age, both FIRST and LAST values equal 1 (as is the case for gender=m and age=3). The dataset survey can now be considered to contain both the actual data values and the internal values SAS assigns. You can now process the records within groups of variables listed on the BY statement by entering IF statements in the DATA step. One simple example which demonstrates how data processing works is to accumulate counts and sums within the DATA step for all persons across values of age for each level of gender: DATA sum_var; SET survey; BY gndr; DROP id score age; RETAIN gndr_sum gndr_count; IF (FIRST.gndr EQ 1) THEN DO; gndr_count=0; gndr_sum =0; END; gndr_count = gndr_count + 1; gndr_sum = gndr_sum + score; IF (LAST.gndr EQ 1) THEN DO; gndr_mean = gndr_sum/gndr_count; OUTPUT; END; RUN; The purpose of these two IF statements should not be confused with a statement such as “IF gndr EQ 1 then ..” (which is incorrect since gender itself is character data that can only equal ‘f’ or ‘m’). The actual values of gndr are not referred to; the value of 1 entered above refers only to the answer to the truth of the question asked for each record: “Is this record the first (or last) of a group of sorted records sorted by gender?” SAS accumulates summary statistics by groups. When the first record in any subgroup is encountered (i.e., FIRST.gndr = 1), the summary statistics must be set to 0. Also, any new variables computed within the DATA step should be included in a RETAIN statement since all computed variables are automatically set to missing whenever the SET statement reads a new record. Records should only be output to the new dataset called sum_var when LAST.gndr equals 1, so the inclusion of an OUTPUT statement is necessary as the statement invoked by the “IF LAST.gndr”. Sincvalues of score, id, and age are no longer needed, these variables are entered on the DROP statement. PROC PRINT DATA=sum_var NOobs; RUN; gender gndr_sum gndr_count gndr_mean f 400 5 30.0 m 519 6 36.5 PROC MEANS would be a more natural and efficient procedure to make the computations of this particular example and to collect other summary statistics. However, these statements demonstrate the versatility of processing data tasks you can perform within a DATA step itself which includes a BY statement sorted by the variables listed on it. It essentially allows you more control and ability to delete/modify records of your choosing. A few situations where this approach has proved most helpful are: * Delete records with duplicate variable ids from a dataset * Transpose a dataset and make computations simultaneously * Assign values from lagged variables within subgroups The short example presented here is one of many illustrations of why the DATA step is a very powerful component of the SAS System. Along with SAS arrays, FIRST and LAST data processing can be among the most difficult concepts to grasp. However, when you start to think of data processing by subgroups, the BY statement entered into the DATA step can become a feature for which you will find many applications.