Section 2.1: Introductory Concepts Instructions written within a command file will range from a few statements in a DATA and PROC steps, to a rather sophisticated sequences set of commands that resembles a programming language. If you think SAS is a programming language and that makes you uneasy, you should actually approach SAS as the logical placement of series of commands expressed through one or more DATA and PROC steps to analyze data in a logical sequence. The commands themselves then take on the role of documenting your data analysis plan. SAS can also assume the role of a very sophisticated programming language where the desired data manipulations or analyses are evaluated in a sequence of steps. The SAS command file is constructed from individual components consisting primarily of a sequence of DATA (manipulate) and PROC (procedures) steps. Each step has its own purpose for data manipulation, visual displays, or calculations. As a result, programs written in SAS lend themselves to top-down design with modularization. Leaving a blank line before the first statement in each step (DATA or PROC) highlights the modularization and will visually enhance this file structure. In fact, the Enhanced Editor with PC SAS draws lines between steps as you write the DATA and PROC steps to provide a visual aid to mark their boundaries. To write effective SAS commands, the basic concepts that demonstrate how the combination of DATA and PR0C steps work together will be introduced. DATA and PROC steps The primary purposes of SAS DATA steps are to manipulate data and then write data to SAS datasets. DATA steps give instructions for reading data into a SAS dataset, or for merging, sub-setting, or updating existing SAS datasets into a new dataset, as well as adding new variables or modifying existing ones. DATA steps are separate componets of a command file. Their specific applications exist only for the order placed in the sequence of steps. Procedures (PROCs) apply specific functions with the data placed in SAS datasets, such as printing the data (PROC PRINT), or producing charts or plots ( PROC GPLOT), tables of categorical data (PROCs FREQ, TABULATE), descriptive statistics for continuous data (PROCs MEANS, SUMMARY, UNIVARIATE, TABULATE), and many other specialized routines. SAS procedures produce either results in the OUTPUT and/or browser window, or write new datasets to be entered into subsequent procedures and data steps. SAS is not officially a Data Base Management System (with the exception of a few little-used techniques) in that it does not operate on datasets *in place*. Rather, it writes new datasets with inputs from existing ones. This feature is not always apparent. For example, when you sort a dataset and do not apply the OUT= option (see Chapter 5, Section 4 for details), SAS writes the output dataset with a temporary name, deletes the input dataset, and then renames the new output dataset with the original name. Nearly the same approach occurs with other PROC and DATA steps when input and output datasets have the same name. Without a backup, you cannot return to the original dataset after it has been replaced. In the case of a sort, you can sort it to its orginal order with specific ID variables; however, you cannot restore it to some previous arbitrary order. If you happen to code some faulty logic and write statement which filters out 100% of your observations, you will end up with an empty dataset. Valid Characters All SAS statements use alphabetic letters (a, b, .., z, upper or lower case); numbers (0, 1, ..,9); special characters include $ _ * - + / = % & () {}, among others. As you will observe in these chapters, a few of symbols, particularly the slash /, the asterisk *, and the hyphen - have multiple functions. In these situations, what each symbol does should be obvious from the context of its appearance. Three important components of all SAS command files are keywords, statements, and steps. DATA and PROC Keywords Keywords begin most SAS statements in DATA or PROC steps; each one is recognized as a specific instruction for SAS to perform. They usually appear at the beginning of the statement and act like a verb in that they describe the action the specific statement will do. For readability, keywords - that is, all names that SAS identifies to perform particular tasks - in these chapters will be placed in capital letters. For example, in the DATA step you will find SET, ARRAY, RETAIN, INFILE, INPUT, etc. In the PROC step common keywords include VAR, CLASS, MODEL, TABLE, LSMEANS, RANDOM, etc. A few statements can be found in both, such as FORMAT or RUN. For each statement that begins with a capitalized keyword, an option, a dataset name, or a variable name can be chosen by the user and will be placed in lower case. SAS statements within a command file (as opposed to unix commands) are not case sensitive; a mixture of upper and lower case usually aids in program readability, especially after several weeks/months of not looking at a particular SAS file. SAS Variable Names SAS statements often include variable names to access the various data items contained in a SAS dataset. SAS variables names primarily are defined on the INPUT statement, in the first row of an Excel file, or to the left-hand side of the equals sign with calculations made within the DATA step. Some general rules for naming SAS variables and datasets include: 1. The first character of a variable or a dataset name must be a letter, a,b,c,...,z, or the underscore, _ . 2. Use any combination of letters, numbers (0,1,...,9), or the underscore to spell the remaining characters in a variable or dataset name. Avoid entering any other characters such as /, #, $, %, &, *, -, +, etc. 3. Versions of SAS up to and including 6.12 allowed a maximum of eight (8) characters to name variables. This limitation was changed with Version 7 and subsequent releases. Variable names with up to 32 characters can now be entered (Rules 1 and 2 lised above still apply). Although increased variable lengths have certain advantages, when writing command files it is often quicker and less prone to make mistakes if you enter short variable names and then add variable labels to them (see description of LABEL statement in Chapter 4). 4. Certain names are reserved by SAS. All of them are either functions or names that begin and end with an underscore. For example, _n_ is used within a DATA step as a counter to indicate the number of the observation currently being processed. Values of _n_ are not saved to the SAS dataset. However, if you want to make a variable that indicates the observation number, _n_ can be placed on the right-hand-side of the = sign as part of an assignment statement: obsvnt= _n_; this statement in a DATA step writes a variable called obsvnt which is equal to 1, 2, 3, 4, 5, ... sequentially assigned to each observation as the dataset set is processed Avoid choosing variable names that resemble SAS keywords, formats, functions, or other key items. For example, when naming a variable, it is not recommended to enter a name that looks like "ddmmyy" since this choice is very close to one type of a SAS date format. The practice of duplicating reserved words as variable names makes all kinds of unnecessary reference problems for the SAS language compiler which may give strange results. The recommended practice is to NOT enter spaces in variable and dataset names. However, if you really must include a space in a variable name, it can be done by selecting an appropriate option and naming the variables in the manner specified below (which will be clear as you read subsequent chapters): OPTIONS validvarname=any; DATA test; INPUT 'var a'n 'var b'n; CARDS; 1 2 10000 200000 ; proc print; run; Obs var a var b 1 1 2 2 10000 200000 NOTE: the option validvarname=any is experimental. With version 8.2 it is proven for datasets but not past that. With validvarname=any in the options you can enter 'var a'n and ‘var b’n in the INPUT statement of a DATA step. Unfortunately when you want to export the dataset with PROC EXPORT the two parts of each variable will be treated as separate variables. This produces error messages which for this example says four variables were uninitialized which means that the data contained in these two variables will NOT be exported. Version 9 is reported to have resolved this problem: OPTIONS validvarname=any; LIBNAME exout excel 'c:\data\test.xls' ver=2002; DATA exout.test; 'first name'n='Paul'; 'last name'n='Beglund'; run; LIBNAME exout clear; SAS Statements Statements are the combination of words, numbers, and symbols that tell SAS what to do. Many statements begin with a SAS keyword (I prefer to capitalize it, though that is optional) and are followed by various options chosen by the user. One of the mandatory items you should constantly remember for all statements: they always, always, always end with a semi-colon -- (to make my point, should I add yet another 'ALWAYS'). Fortunately, if you enter one statement on each line the color coded enhanced editor helps to spot any missing semi-colons since the keyword will not be the color you expect. Throughout these chapters, the place to enter options of your choice will be indicated in brackets < options >. Section 2.3 presents detailed information about commonly used SAS statements. Mathematical equations written by the user are another type of statement and in many cases do not contain a SAS keyword; however, they may call a particular SAS function that will also be given in capital letters (see Chapter 7). Finally, note that SAS provides a wide variety of tools for a wide variety of tasks which can often be done in several ways. Not every data management or analysis task is going to be best-served by only one procedure. Some statistical procedures address the data analysis problems in a less optimal manner than another procedure will, some data management tasks may be better done within a DATA step or a specialized procedure, and some will be best accomplished outside SAS entirely. These choices will become more evident the longer you spend learning and working with the system. So learn to be flexible and recognize the strength and available alternatives for any approach you may take.