Section 7.5: Applying the LAG and DIF Functions with Sequential Data The LAG function is often applied when working with longitudinal data or in any situation where it is appropriate to place data in a sequential pattern. Auto-correlation usually is very important with this type of data, and so lagged variables play an important role in data analysis. Working with lags may already be familiar from how easy it is to invoke them in Excel worksheets. A simple example of data lagged at an interval of 1 is when the value assigned to the cell of any particular row is taken from the value in a cell from the row directly above it (row i-1) and from the adjacent column: Row A B C 1 1 14 2 2 16 14 3 3 25 16 4 4 19 25 In this example, Column A is the time index, and Column C contains the values from Column B one time period earlier, or at lag 1. An easy way to determine values of data at lag 1 is to enter in cell C2 the formula =B1 and then pull this value down to enter the contents of lagged values in column C. SAS works with datasets that can be visualized like Excel worksheets, that is, it processes data from top to bottom for each column, headed by a different variable name. The DATA step contains a LAG function that performs the same task as the simple Excel example. However, the first item to note is that whenever it is placed in any statement, always remember that it returns the value of its argument from the last time the function was executed, not necessarily the value from the previous record in the dataset (as with Excel). Examples will soon show what this feature implies. If you process a dataset from top to bottom with no special requests, the LAG function works as illustrated above. However, if there are any conditional IF statements or if your dataset is sorted by time within a grouping variable (use of a BY statement), the LAG function will give you incorrect and perhaps unexpected results. To illustrate how the LAG function works, generate observations from a three-year time series of quarterly data: DATA tmp; FORMAT y0 7.1; DO year=2001 to 2003; DO qrtr=1 to 4; y0 = 100 + 10*RANNOR(92623); OUTPUT; END; END; PROC PRINT NOobs; VAR year qrtr y0; RUN; year qrtr y0 2001 1 93.2 2001 2 99.3 2001 3 92.8 2001 4 110.6 2002 1 80.9 2002 2 92.6 2002 3 101.8 2002 4 113.2 2003 1 87.7 2003 2 101.6 2003 3 89.9 2003 4 99.1 Suppose within each year you want to examine the value of y0 for the value from some prior quarter (for purposes of illustration you do not want to compare quarterly values across years). To perform this task SAS has a series of LAG functions: LAG1(y0): the value of y0 from the previous observation LAG2(y0): the value of y0 two observations prior to the current one .. LAG99(y0): the value of y0 99 observations prior to the current one The variable of interest here is called 'y0' that is contained in the dataset where its order has meaning, usually in time or position (location). These functions in SAS are denoted as LAGn and assign variables equal to previous values of a specified variable. The LAG statement always appears as a function to the right of the equal sign in a DATA step as: yn=LAGn(y0); The variable yn is assigned the value of y0 for n lags in the past (1 <= n <= 99). Beginning with the first iteration of the DATA step, the first n values of a new variable to hold the values at lag n, yn, are assigned to be missing (in a first in first out or FIFO queue). If the data set shown above is processed sequentially with LAG statements, here is the resulting data set: DATA tmp0; SET tmp; FORMAT y1 y2 y3 7.1; y1=lag1(y0); y2=lag2(y0); y3=lag3(y0); PROC PRINT DATA=tmp0 NOobs; VAR year qrtr y0 y1 y2 y3; RUN; year qrtr y0 y1 y2 y3 2001 1 93.2 . . . 2001 2 99.3 93.2 . . 2001 3 92.8 99.3 93.2 . 2001 4 110.6 92.8 99.3 93.2 2002 1 80.9 110.6 92.8 99.3 2002 2 92.6 80.9 110.6 92.8 2002 3 101.8 92.6 80.9 110.6 2002 4 113.2 101.8 92.6 80.9 2003 1 87.7 113.2 101.8 92.6 2003 2 101.6 87.7 113.2 101.8 2003 3 89.9 101.6 87.7 113.2 2003 4 99.1 89.9 101.6 87.7 For each sequential value the LAG function defines a "stack" of values inside the computer's memory. Every time the LAG function is invoked, it pulls a value from the top of the stack and assigns the current variable's value to the bottom of the stack for later "recall". The depth of the stack is determined by the number assigned to the LAG function: LAG1(y0) is 1 value deep LAG2(y0) is 2 values deep LAG3(y0) is 3 values deep etc. [ Note: LAG(y0) is equivalent to LAG1(y0) which is 1 value deep. ] In the above example, y1=LAG1(y0) is assigned a missing value for the first observation only; subsequent observations are equal to the previous value of y0. For y2, the first two times the program executes the statement y2=LAG2(y0); it assigns y2 a missing value. The third time it executes y2=LAG2(y0), it assigns to y2 the value given to it back in the first call. From then on, y2=LAG2(y0); assigns to y2 the value it read two calls ago. The danger in entering LAG statements lies when conditional tests are made with an IF/THEN statement or processing data in a DO loop. If LAG is invoked based on a conditional statement, then values are "pulled" from the stack but the current/new values are not "pushed" into the stack for later reference. (You only saved/pushed on the memory stack the value of the data variable whenever you call the LAG function, which will not necessarily be the previous value.) Each time the SAS program executes a LAGn statement it pulls the oldest value it has (that is what FIFO queue means) from the top of the stack and assigns it to the variable yn. It then adds the most recent value to the bottom of queue. Like an excel worksheet, it is easy to perceive that the LAGn functions always accesses values from nth previous observation. This is true if you process the data file sequentially as shown in the first example of this section. However, as you will now observe, the LAG actually returns values from previous calls to the LAG functions, which will not be previous observations if conditions are imposed. This is important to know if you use IF/THEN statements in a program (among others). Here is an example of how LAG works with observations where the lags are computed for odd numbered quarters only: DATA tmp1; SET tmp; FORMAT y1 y2 y3 7.1; rm= MOD(qrtr,2); * MOD is a function that computes the remainder when qrtr is divided by 2; IF rm=1 THEN DO; y1=lag1(y0); y2=lag2(y0); y3=lag3(y0); END; PROC PRINT NOobs; VAR year qrtr rm y0 y1 y2 y3; RUN; year qrtr rm y0 y1 y2 y3 2001 1 1 93.2 . . . 2001 2 0 99.3 . . . 2001 3 1 92.8 93.2 . . 2001 4 0 110.6 . . . 2002 1 1 80.9 92.8 93.2 . 2002 2 0 92.6 . . . 2002 3 1 101.8 80.9 92.8 93.2 2002 4 0 113.2 . . . 2003 1 1 87.7 101.8 80.9 92.8 2003 2 0 101.6 . . . 2003 3 1 89.9 87.7 101.8 80.9 2003 4 0 99.1 . . . In this example y1 [=LAG1(y0)] is assigned two missing values until observation 3 (the next quarter that is an odd number following qrtr=1), not the one missing observation you might expect. Also, the value of y1 for observation 3 (=75.5) equals the value of y0 two periods ago at qrtr=1. For y2, the first two times the program executes the statement y2=LAG2(y0), it assigns y2 a missing value; yet, the first four values of y2 are set to missing and the value of y2 for observation 5 (=75.5) equals the value of y0 for obs=1, 4 periods in the past rather than the expected 2. There may be situations where it is correct to conditionally execute a LAG, but if you are not aware of how the LAG function works, command files which have lagged values inside IF statements or DO loops most likely generate wrong or unexpected results. The answer is correct in the sense that the hardware and software performed according to its specification, but the desired result is not achieved -- and, it is quite possible that an error of this nature would not even be detected! BY statements allow you to calculate the new lagged variable for all records in the dataset and then modify the records which should be set to missing: DATA four; SET tmp; BY year; z = lag1(y0); IF first.year then z = .; RUN; PROC PRINT DATA=four NOobs; VAR year qrtr y0 z; FORMAT z 5.1; RUN; year qrtr y0 z 2001 1 93.2 . 2001 2 99.3 93.2 2001 3 92.8 99.3 2001 4 110.6 92.8 2002 1 80.9 . 2002 2 92.6 80.9 2002 3 101.8 92.6 2002 4 113.2 101.8 2003 1 87.7 . 2003 2 101.6 87.7 2003 3 89.9 101.6 2003 4 99.1 89.9 Unless you know that you definitely want a conditional lag (and lags assigned based on that condition), you should assign values of the lagged variable outside the conditional part of the code and then assign contents of the lagged variable within the conditional statements. If you need to compute variables with more than one lag, a DO loop can help save some tedious coding. For example, to compute a summary statistic with three data values, with two of them lagged values embedded within a grouping variable (year) without a DO loop you would need to write a series of IF statements placed in the DATA step: DATA tmp2; SET tmp; BY year; RETAIN nlags; y1=lag1(y0); y2=lag2(y0); IF first.year then nlags=2; IF nlags=2 THEN DO; y1=.; y2=.; nlags=nlags-1; END; IF nlags=1 THEN DO; y1=.; nlags=nlags-1; END; * when nlags equals 0, compute stats; IF nlags = 0 then y_mn= mean(of y0-y2); OUTPUT; RUN; With many lags, the series of IF statements could be tedious to write. The DATA step below includes lines beginning with "IF first.year.." which perform the same task as the statements above for the specified number of lags (assigned to the macro variable &nn). %LET nn=3; DATA tmp2; SET tmp; BY year ; DROP i j cd; ARRAY yy{&nn.} y0-y%eval(&nn.-1); RETAIN nlags; y1=lag1(y0); y2=lag2(y0); y3=lag3(y0); * these three lines assign missing values as needed for any specified number of lags ; IF first.year then nlags=&nn.-1; DO i = 1 to nlags; j = &nn.-i+1; yy{j} = . ; END; IF nlags > 0 then GOTO fn; * ************************* ; * when nlags <=0 can compute stats; y&nn._mn= mean(of y0-y%eval(&nn.-1)); fn: OUTPUT; nlags=nlags-1; RUN; PROC Print data=tmp2 noobs; VAR year qrtr y0-y%eval(&nn.-1) y&nn._mn; FORMAT y0-y%eval(&nn.-1) 5.1 y&nn._mn 6.2; TITLE "The mean of &nn. adjacent values within years"; RUN; The mean of 3 adjacent values within years year qrtr y0 y1 y2 y3_mn 2001 1 93.2 . . . 2001 2 99.3 93.2 . . 2001 3 92.8 99.3 93.2 95.07 2001 4 110.6 92.8 99.3 100.86 2002 1 80.9 . . . 2002 2 92.6 80.9 . . 2002 3 101.8 92.6 80.9 91.77 2002 4 113.2 101.8 92.6 102.51 2003 1 87.7 . . . 2003 2 101.6 87.7 . . 2003 3 89.9 101.6 87.7 93.10 2003 4 99.1 89.9 101.6 96.88 Computing Statistics with Lagged variables with PROC EXPAND Another procedure available with ETS proceducres (Econometrics/Time Seriers) one can utilize to compute with lagged variables is PROC EXPAND. The computation of a mean with lagged data is specified by a series of CONVERT statements. The variable for ID is assumed to be a date, though integer values should work as that is how SAS counts the number of days since Jan 1 1960 for any date variable. PROC EXPAND DATA=tmp OUT=tmp5 method=none; BY year; ID qrtr; CONVERT y0; CONVERT y0 = y_lag1 / transformout=(lag 1); CONVERT y0 = y_lag2 / transformout=(lag 2); CONVERT y0 = y3_mn / transformout=(movave 3); run; PROC PRINT DATA=tmp5 NOobs; VAR year qrtr y0 y_lag1 y_lag2 y3_mn; format y_l: y3_mn 7.2 ; run; year qrtr y0 y_lag1 y_lag2 y3_mn 2001 1 93.2 . . 93.20 2001 2 99.3 93.20 . 96.23 2001 3 92.8 99.25 93.20 95.07 2001 4 110.6 92.76 99.25 100.86 2002 1 80.9 . . 80.93 2002 2 92.6 80.93 . 86.74 2002 3 101.8 92.56 80.93 91.77 2002 4 113.2 101.81 92.56 102.51 2003 1 87.7 . . 87.71 2003 2 101.6 87.71 . 94.68 2003 3 89.9 101.65 87.71 93.10 2003 4 99.1 89.94 101.65 96.88 One major difference with the DATA step illustration is PROC EXPAND computes moving averages for whatever data are available within each year, whether it is 1, 2, or 3 values. The missing values indicate it does not compute means with data from previous years. Another useful illustration is the computation of percent difference between values at lag 1 with an id variable is automatically computed: y_perc = 100*((y - lag(y))/lag(y)); DATA a; grp=1; do time=1 to 4; y=ranuni(979); output; end; grp=2; do time=1 to 3; y=ranuni(979); output; end; grp=3; do time=1 to 3; y=ranuni(979); output; end; PROC EXPAND DATA=a OUT=b method=none; BY grp; ID time; CONVERT y = y_perc / transformout = ( pctdif 1 ); run; PROC PRINT DATA=b NOobs; VAR grp i TIME y y_perc ; RUN; grp TIME y y_perc 1 1 0.07836 . 1 2 0.71685 814.868 1 3 0.42800 -40.295 1 4 0.95127 122.262 2 1 0.24892 . 2 2 0.92253 270.606 2 3 0.57661 -37.497 3 1 0.52620 . 3 2 0.74835 42.219 3 3 0.16865 -77.463