4.9: Managing SAS Datasets with PROC DATASETS 1. Obtain Dataset Information 2. List Datasets Contained In a Library 3. Copying Datasets/Catalogs 4. Renaming Datasets/Catalogs 5. Deleting Datasets/Catalogs 6. Modifying Datasets/Variables 7. Copy the Label from var1 to var2 in the same DATA Step 8. Completely Remove Assigned Formats from Variables in a Dataset 9. Append two or more SAS Datasets PROC DATASETS manages SAS datasets. This procedure does what other data management techniques can do (such as the DATA step); however, it will likely perform them much faster and more efficiently. After you have written SAS datasets, this section demonstrates a few applications how to modify their contents with PROC DATASETS. Like so many other features of SAS, when it comes to data management or statistical analysis, there are often several methods to accomplish the same result. For changes to a SAS dataset that do not involve manipulating the actual data, the DATASETS procedure substantially improves efficiency in both CPU and real time over other file management methods. Since PROC DATASETS does not need to process each record in the specified dataset, its greatest efficiency is found when working with large files. With a few exceptions, PROC DATASETS does not generate printed output to the OUTPUT window. The results of the modifications which took place with the submitted statements are displayed in the LOG window. 1. Obtain Dataset Information The default library for storing SAS datasets on a PC during is called WORK. If you omit specifying the input library, the procedure processes the current default SAS data library (usually the WORK data library). This location is where SAS assumes you have stored datasets until you tell it otherwise (e.g., with a LIBNAME statement and the LIBRARY= option). Typically, it is not good programming practice to let the default library be the one to reference; you should specify the library in the DATATSETS procedure as shown below: LIBNAME dat 'c:\sas\data'; PROC DATASETS LIBRARY=dat; RUN; QUIT; Given this LIBNAME statement at this point of the program, the result will be a message listing all SAS datasets in this directory printed in the log file: File # Name Memtype Size Last Modified -------------------------------------------- 1 ONE DATA 5120 18SEP2003:13:54:38 If no SAS datasets are found in the specified directory the message "WARNING: No matching members in directory" is printed. In this case PROC DATASETS indicates it did not find any datasets in the specified library. If PROC DATASETS is used without a LIB statement within the program, the library used is the last (_LAST_) one that was referenced. NOTE: PROC DATASETS is one of the few SAS procedures that requires a QUIT; statement to officially end the step - if you don't include it after the RUN; statement you will see the message "DATASETS running" at the top of the editor window which may affect the functionality of subsequent ODS statements or other commands. 2. List Datasets Contained In a Library To determine dataset names in a specified library, invoke PROC DATASETS and identify the library name and then write these names to a SAS dataset. LIBNAME dtb 'c:\sas\missing'; ODS OUTPUT members = mmbrs ; PROC DATASETS library = dtb; RUN; quit ; proc print DATA=mmbrs NOobs n; run; Mem File Num Name Type Size LastModified 1 GSSEXP DATA 345088 12Mar03:15:58:58 2 HIP DATA 58368 29Dec04:14:50:10 3 USNEWS DATA 82944 12Mar03:15:58:52 N = 3 PROC CONTENTS displays a list of datasets found in a specified library with the NOds option (if it is omitted, the entire contents listing is produced for each dataset). PROC CONTENTS DATA=dtb._all_ NOds; RUN; Member File # Name Type Size Last Modified 1 GSSEXP DATA 345088 12Mar03:15:58:58 2 HIP DATA 58368 29Dec04:14:50:10 3 USNEWS DATA 82944 12Mar03:15:58:52 These two procedures generate essentially the same results, although without the ODS statements, the output from PROC DATASETS is placed in the LOG window and the PROC CONTENTS output is read from the output window. Alternatively, Windows Explorer is another option to navigate to the desired directory specified in the LIBNAME statement; the window will display the same list (and more) of SAS files contained in that directory. If more information is needed about each dataset within a library (e.g. individual variable attributes), the DATASETS procedure needs to be instructed to do this, while with PROC CONTENTS an option was added to remove the detail it typically provides. Below is an example of how to get the usual detail from PROC CONTENTS using the DATASETS procedure: PROC DATASETS LIBRARY=dat; CONTENTS DATA=_all_; RUN; QUIT; The CONTENTS procedure with the NOds option removed is shown below: PROC CONTENTS DATA=dat._all_ ; RUN; The results of the above two procedures are essentially the same. Both procedures send results to the OUTPUT window. This is one of the few examples where the results from applying PROC DATASETS is so treated. 3. Copying Datasets/Catalogs When SAS datasets need to be copied or moved to another library, the simplest method is to use the 'My Computer' window to move or copy files from one location to another. If for some reason the SAS program needs to do this, the task may be placed within the program commands. The DATASETS procedure below copies all datasets from the dat library to the new_dat library as noted in the process below. LIBNAME new_dat "c:\new_data"; PROC DATASETS NOlist; COPY IN=dat OUT=new_dat; RUN; QUIT; The log explains the efficiency that may be gained in using DATASETS instead of other procedures. Looking at the REALTIME and CPU TIME information provided from each of these methods, the PROC DATASETS is more efficient. The DATASETS procedure has a SELECT and EXCLUDE statements that limit the members of the IN library that are copied to the OUT library destination: PROC COPY IN=dat OUT=new_dat; SELECT adverse formats; RUN; PROC DATASETS nolist; COPY IN=dat OUT=new_dat; EXCLUDE adverse formats; RUN; QUIT; These two statements perform the same task as either of the previous commands in this last COPY procedure was instructed to SELECT only two members (ADVERSE and FORMATS) from the IN library, while the EXCLUDE statement used in the DATASETS procedure to copy all but these two files thus copying all members from the IN library to the OUT library, like the last example. There are situations when moving the files from one location to the other is more appropriate. Using the MOVE option in either procedure or using the host-system mode executing a MOVE statement achieves this. PROC DATASETS NOlist; COPY IN=dat OUT=new_dat MOVE; QUIT; RUN; These statements move all SAS datasets from the IN library to the OUT library. 4. Renaming Datasets/Catalogs To rename files, enter the CHANGE statement within PROC DATASETS: PROC DATASETS LIB=dat NOlist; CHANGE adverse = newae; QUIT; RUN; The CHANGE statement requires the LIB= option to exist on the PROC DATASETS statement and is not included as part of the CHANGE statement. The DATA step can also form a new dataset with the new name: DATA dat.newae; SET dat.adverse; RUN; However, when running a DATA step the old dataset still exists and an additional PROC DATASETS step with a DELETE statement would need to be added to a program to remove the old file. 5. Deleting Datasets/Catalogs PROC DATASETS is efficient at deleting datasets. It gives the processing control to the command file instead of assuming that an external step has been performed, e.g. deleting the files outside of SAS. PROC DATASETS NOlist; COPY in=db out=work; RUN; QUIT; PROC DATASETS LIB=dat NOlist; DELETE adverse summary; RUN; QUIT; The second step given above deletes the specified datasets listed on the DELETE statement only. You can also choose datasets with common letters or numbers in their names and place them in a macro variable: DATA ra1254; x=1; DATA ra1284; x=1; DATA ra1274; x=1; DATA ra12A4; x=1; PROC SQL NOprint; SELECT memname INTO :del_list separated BY " " FROM dictionary.members WHERE compress(libname) like 'WORK' AND upcase(substr(memname,1,2)) like 'RA' AND verify(substr(memname,3,4),"0123456789") = 0; QUIT; PROC DATASETS LIB=work NOlist; DELETE &del_list ; run; QUIT; In this example, del_list = RA1254 RA1274 RA1284 which are the deleted datasets. In order to delete all datasets found in a specified library, enter the kill option on the PROC DATASETS statement. In doing so be careful, since this option presented by itself deletes everything (including all temporary SAS datasets in the work directory and existing formats) without prompting you that it is actually the task you want to perform. You can restrict what is deleted with the MEMTYPE= options: PROC DATASETS LIB=dat NOlist MEMTYPE=data kill; RUN; QUIT; Although a few other SAS procedures allow the _all_ option, the DELETE statement within PROC DATASETS does not. Since _all_ is a legal dataset name this would make a very bad keyword for describing all datasets: PROC DATASETS library=work; DELETE _all_; * NOTE: -> _all_ is NOT functional DELETE statement unless _all_ is an actual dataset name; RUN; QUIT; Note also the MEMTYPE= option such as MEMTYPE=data. If you want to "kill" everything currently present in work (including template stores, catalogs, formats, etc.) omit this option. Delete existing graphs with: PROC DATASETS mt=cat nolist; delete gseg; RUN; QUIT; You also need to close the window manually since if it remains open the rest of the code, which actually does the job of purging 'leftover' graphs, will not be able to run, and SAS may give you fits until you do or even "freeze" (my experience if the graph window gets to full). Enter the DM "GRAPH; CANCEL;"; to do everything in one short statement. Deleting all SAS datasets is helpful when "working" files consume a large amount of memory. Once the logic of the program has been checked, KILLing the working files results in a more efficient program. Another important application of the kill option is when you are finished working with a program and are ready to move onto another project without exiting SAS. It is generally a good idea to remove all existing datasets and formats from the 'work' directory when they are no longer needed. When several programs run from a windows editor all share this same working directory, delete "old" files so they are not left there to be erroneously accessed. If some files should remain within a SAS library, the KILL option will not suffice. Instead, the DATASETS procedure provides the SAVE statement which will delete all members in a library except for the ones noted on the SAVE statement: PROC DATASETS LIB=dat NOlist; SAVE formats; RUN; quit; This procedure deletes all files in the dat library except the FORMATS catalog. 6. Modifying Datasets/Variables You can make changes to the attributes of a member within a library with the DATA step. The types of change to be discussed below become much more efficient with the DATASETS procedure. The difference between these two methods is that with the DATA step every record contained in the dataset has to be processed. The DATASETS procedure manipulates the header information of the dataset and therefore doesn't process each observation making it far more efficient. As the size of the datasets being modified increases, PROC DATASETS increases the efficiency dramatically. The following two examples show how to add a label to a SAS dataset. First, the DATA step can be invoked: DATA db.adverse(label='Adverse Events'); SET db.adverse; RUN; The same task can be performed with the DATASETS procedure: PROC DATASETS lib=db nolist; MODIFY adverse (label='Adverse Events'); RUN; QUIT; Both sets of statements the label "Adverse Event" to the dataset named adverse. With large datasets the REAL TIME it takes can be substantially higher with the DATA step method. As a project moves forward, you may want to add labels to indivdual variables or add labels from imported data from an Excel file read with PROC IMPORT. These annotations are very helpful when printing output to describe what is being captured. The DATA step approach is: DATA db.adverse; SET db.adverse; LABEL pat = 'Subject Number' inv = 'Site Number'; RUN; However, PROC DATASETS is much more efficient and not any more difficult to write: PROC DATASETS LIB=db nolist; MODIFY adverse; LABEL pat = 'Subject Number' inv = 'Site Number'; RUN; QUIT; To remove labels from a subset or all variables: PROC DATASETS lib = work NOlist NOdetails; MODIFY dataset_nm ; ATTRIB _all_ label = "" ; RUN ; quit ; (or enter OPTIONS Nolabel;) 7. Copy the Label from var1 to var2 in the same DATA Step There is no particular function that will let you set the pdv label at runtime. In general, the program statements of a data step that is running cannot affect the header of the output tables being created by the DATA statement. The header contains information about each variable; the name, type, length, format, informat, label, constraints, etc... Store the label in a macro variable and enter its contents in a PROC DATASETS following the DATA step. DATA foo; attrib bar length=$8 label='The label of "bar"'; bar='abc'; run; data foo2; set foo; xyz = 100; IF (_n_=1) then call symput("barLabel", vlabel(bar)); run; proc datasets nolist lib=work; modify foo2; label xyz=%sysfunc(quote(&barLabel)); run; quit; Other common tasks are to rename variables and apply formats/informats to variables. With the DATA step the following statements might appear: DATA db.adverse; SET db.adverse(rename=(inv=site pat=subj)); FORMAT todate visdate fromdate mmddyy10.; INFORMAT todate visdate fromdate mmddyy10.; RUN; With PROC DATASETS the statements to perform this task include: PROC DATASETS LIB=db NOlist; MODIFY adverse; RENAME inv=site pat=subj; FORMAT todate visdate fromdate mmddyy10.; INFORMAT todate visdate fromdate mmddyy10.; RUN; QUIT; 8. Completely Remove Assigned Formats from Variables in a Dataset DATA class; SET sashelp.class; FORMAT name $15. weight 6.2; run; Proc Contents Data = Class ; Run ; Alphabetic List of Variables and Attributes # Variable Type Len Format 3 Age Num 8 4 Height Num 8 1 Name Char 8 $15. 2 Sex Char 1 5 Weight Num 8 6.2 * can specify the variable names; Proc DataSets LibName = Work ; Modify Class ; Format Name Weight ; Run ; Quit ; Proc Contents Data = Class ; Run ; Alphabetic List of Variables and Attributes # Variable Type Len 3 Age Num 8 4 Height Num 8 1 Name Char 8 2 Sex Char 1 5 Weight Num 8 DATA t1; STOP; ATTRIB x length=7 format=mmddyy10. y length=8 format=datetime18.; RUN; PROC CONTENTS; RUN; # Variable Type Len Pos Format -------------------------------------------------- 1 x Num 7 8 MMDDYY10. 2 y Num 8 0 DATETIME18. * can also remove all formates with _all_ option; PROC DATASETS lib=work; MODIFY t1; FORMAT _all_; RUN; QUIT; proc contents; run; # Variable Type Len Pos ----------------------------------- 1 x Num 7 8 2 y Num 8 0 When the CPU TIME for the above two tasks to do exactly the same thing are compared, the DATASETS procedure completes them more quickly. The savings in REAL TIME is even more substantial. Assume you have a dataset with indexed variables v1, v2, ... v12. Every month you need to add one additional variable so that current variables v1 thru v12 become v2 thru v13 and the new added variable is v1. In the ensuing month v1 through v13 then become v2 througjh v14 and v1 is the added variable. A macro can help (see chapter 9 for introduction to macro processing). %MACRO update(month); PROC DATASETS LIB=mylib; MODIFY a; RENAME %DO i=1 TO %EVAL(month-1); v&i = v&%EVAL(i+1); %end; run; quit; %MEND update; %update(13) DATA ab; MERGE a b(keep=id v1); BY id; RUN; %update(14) 9. Append two or more SAS Datasets Through PROC DATASETS you can append two or more SAS datasets. DATA recent_hires; SET mylib.new_employees; IF (hired GT "04/28/2004"d) THEN OUTPUT; RUN; PROC DATASETS lib=work ; APPEND base=mylib.employees data=recent_hires; DELETE new; RUN; QUIT; The DELETE statement has been included here since once the dataset set 'recent_hires' has been successfully appended to the base file, you may want to immediately delete it to avoid the risk of appending its contents more than once. If 'mylib.employees', the base dataset, does not already exist SAS will make a new one displaying this note in the log file: NOTE: BASE data set does not exist. DATA file is being copied to BASE file. The PROC DATASETS statement also allows a 'force' option that will append the variables present in the input dataset to the base file even if the two files don't have the same variable names and formats. PROC DATASETS force ; APPEND base=mylib.employees data=recent_hires; Run; quit; The forece option should always be applied with caution since compatibility of datasets is a very important assumption when appending them. If you enter it, be sure to check the contents of the base file (esp. variable names and formats) to be sure the files are appended in the manner you expect. If one or more variable names are in 'recent_hires' dataset and not in the base file, mylib.employees, the two datasets will not be appended without the force option. Even with the force option present, only the variables included in the base dataset will exist. Assume you want to append a SAS dataset to a worksheet in Excel using PROC EXPORT. The only option with version 8.2 of SAS is to have the entire worksheet contents in a SAS dataset and then specify 'replace' in PROC EXPORT. This procedure first deletes the table and then writes a new one. Without the 'replace' option, if the worksheet does not already exist a new one will be written. If the worksheet already exists, you will receive an error message and the table is not written. Note that the DATA step and PROC APPEND both have the capability to append files. See chapter 6.1 for several applications.