Section 4.2: The INFILE and INPUT Statements in the DATA Step In order to perform a data analysis procedure with SAS, your data must first be read into a SAS dataset. Reading raw data from an external source into SAS is usually one of the first tasks you will face. The most common ways to read data into a SAS dataset include: * Data entered within the program itself * Data read from an external text file * Data imported from an Excel spreadsheet If your data have already been entered in an Excel spreadsheet or if you want to know more about the issues involved when entering data into a SAS dataset, you should first read the data coding and transfer description from Excel to SAS found at: http://www.uoregon.edu/~robinh/data_transfer.html If your data meet these guidelines, you most likely will not need to read external text files, so you may skip this section. However, its contents show the many ways in which SAS can read external data through the INFILE and INPUT statements. This is one of many strong features of SAS and is given an extensive introduction. The INPUT statement (along with INFORMAT, LENGTH, and INFILE) is important to clearly understand since other statements in the DATA step that follow its placement are influenced by its contents. The SAS keyword DATA defines the first statement in the DATA step. This keyword immediately implies you will either want to access an existing SAS dataset with the SET, MERGE, or UPDATE statements (to be explained in chapter 6), compute new data within the DATA step (from existing variables or with DO loops and/or random number generation), or read an external data file. The DATA step allows you to process data in a sophisticated programming environment. When reading external text files, the INFILE statement must be paired with an INPUT statement: INFILE: identifies the file name and location and defines the file structure. INPUT: the algorithm part of reading variables from the data file. INFILE: Specify an External Text Data File and its Properties The INFILE statement specifies how to access data placed in an external text file. In this situation it is always placed with -- and must precede -- its associated INPUT statement. Syntax: INFILE ; The following statements read data from an external data file into a new SAS dataset. DATA one; INFILE '' ; * specify a text data file; INPUT var1 var2 .. ; RUN; = drive, path, and name of text file from which data are read dlm= -> dlm= is an abbreviation for delimiter= and defines the delimiter(s) that separates field values. For comma delimited files, enter dlm=',' (Be sure to have single or double quotes on both sides of the comma.) For tab delimited files, enter dlm='09'X as your field delimiter. '09'X is the hexadecimal representation of the tab character. If you have 2 or more delimeters, such as comma and tabs, enter delimiter=', "09"x' dsd -> if two delimiters defined by the dlm= option occur consecutively, the value of the respective variable between them is set to missing. For more details see the section devoted to DSD below. LRECL=xxx -> if a record exceeds the default of 256 characters, add this option to specify the maximum length of a record. The number xxx only needs to be equal to or greater than the maximum record length in the file, so you don't need to know the exact value (see example below to determine the maximum length). END=eof -> allows you to enter an IF statement or some other logical Statement to test when you have reached the last record in the input data file; < see example below > FIRSTOBS=n -> begin to read data on line n of the data file (n=2 in the example statement. This option allows you to store the data file names in the first row or to add other documenting information at the top of the data file. flowover -> causes an INPUT statement to continue to read the next input data record if it does not find values in the current input line for all the variables in the statement. This is the default behavior of the INPUT statement. missover -> tells SAS to move the pointer to a new record when all the variables on the INPUT statement have been read. The option missover prevents SAS from going to a new line if it does not find values for all input values on the current line. This option becomes necessary if you have more variables in your data file than variable names on the INPUT statement. RECFM=v -> Length of records (variable is default). PAD -> 'pad' the record with blanks with length LRECL=xxx making it work like a file with fixed-length records. The PAD option pads the record with blanks up to the indicated record length eliminating the need for the 'missover' option to set unread variables to missing. truncover -> Overrides the default behavior of the INPUT statement when an input record is shorter than what the INPUT statement expects. By default, the INPUT statement automatically reads the next input data record. TRUNCOVER will read variable-length records when some records are shorter than the INPUT statement expects. Variables without any values assigned are set to missing. Use TRUNCOVER to assign the contents of the input buffer to a variable when the field may be shorter than expected. For a basic example of a DATA step which reads a comma delimited ASCII file called mydata.csv, the following INFILE statement tells SAS it resides in the directory c:\data and that it contains four numerical data items beginning with row 2: DATA one; INFILE 'c:\data\mydata.csv' dlm=',' dsd firstobs=2; INPUT a b c d; RUN; Further detail about how the INFILE and INPUT statements work together will be described with a few examples. In particular, special features of the INFILE statement will be explained in more detail. The INPUT Statement: Read data INPUT is the statement which reads data from an external file in text format or inserted into the program itself. A few other statements not yet introduced (such as INFORMAT, CARDS) will need to be present in these examples to show how they relate to the INPUT statement. These ancillary statements will be covered in more detail in subsequent sections. INPUT ; Depending on the format of the data to be read, the INPUT statement may have a very simple structure; it can also become the most complex statement contained in the DATA step. How it is written depends totally on the structure of the data. To read data properly you need explicit knowledge of them: the type (numeric, character, date, etc.), the number of records per observation, the order variables occur in the file, whether they're listed in fixed or free format, and whether they're delimited by spaces, commas, tabs, or some other special character. In the following examples, data are assumed to be stored in one of the following free formats (i.e., all variables must have one of the following delimeters between them): * Spaces * Commas * Tabs * Some other single character that never appears in your data Spaces and commas are the most commonly found delimiters placed between variables for each observation in text files. Semi-colons should be avoided since they already have a very specialized purpose in SAS programs. [If for some reason you cannot avoid semi-colons in your data files, then when you see CARDS; in the following examples, use the CARDS4; or DATALINES4; statement. Enter four consecutive semicolons ;;;; after the last row to indicate the end of your input data. In later sections, how to read data files with other types of delimiters will be demonstrated.] The simplest way to read an external text file into SAS is to leave one or more spaces between all variables when they are all numbers or short character data (8 characters or less), and to arrange them in the same order on each row. Small external data files can be inserted directly into the SAS program following a CARDS; statement as shown below. [Historical note: the use of the word CARDS recalls the time when punched cards were used to submit programs and data rather than text files or interactive programming we use today.] Data may either be read within the SAS program or from an external file. For direct input, here is an example of a space-delimited data file read from within the SAS program (in this situation, an INFILE statement is not necessary and a free-formatted INPUT statement is the most basic way to read data): DATA direct; INPUT id a b c d; CARDS; 1 2 3 4 5 2 4 6 8 9 ; Here's another example of how to read the same data stored in a space-delimited file in free format called direct.dat located in your working directory: DATA direct; INFILE "c:\data\direct.dat" missover; INPUT id a b c d; RUN; In these two examples, a temporary dataset named "direct" is a name you define (shown here in lowercase letters though a mix of lower and upper case is fine). They will be entered into subsequent DATA or PROC steps in your SAS program to identify the application of this particular data set. Input phrases are discussed in more detail below. The previous two examples also indicate that you only can use the INPUT statement with an INFILE statement (to read external data) or the CARDS statement (with data within the program following the DATA step). Example: Read comma delimited data included in the program following the CARDS; statement: DATA total; INFILE cards dlm=',' dsd END=eof missover; INPUT a b c; tot_a+a; tot_b+b; tot_c+c; IF eof THEN OUTPUT; CARDS; 1,2,3 4,2,2 5,4,4 ; PROC PRINT DATA=total; VAR tot_a tot_b tot_c; RUN; Produces this output: OBS TOT_A TOT_B TOT_C 1 10 8 9 NOTE: quotes around the word cards in the INFILE statement (where one would typically place a drive, path, and filename) are not needed. Record Lengths of External Text Files When reading external text files one important feature to know about your data is the length of each record. By default, SAS assigns a value of 256, so for most data files with only a few variables, the default length is more than sufficient to read them. However, in situations where you have many variables and you do not know the maximum length over all records, one way to determine both the minimum and maximum values is shown below. The option lrecl= should be entered with a number you know will be bigger than the longest record. DATA lengths; RETAIN minlen 9999 maxlen 0; INFILE "c:\data\s06.txt" firstobs=2 end=done lrecl=600; INPUT ; minlen = minlen>lengthc(_infile_); IF done THEN OUTPUT; RUN; proc print NOobs; run; minlen maxlen 250 411 In SAS V8 another approach is necessary, something like this: minlen = minlen >< (length(_infile_||'.')-1); Now, to read this external text file, enter LRECL=411 (or a number somewhat larger) to be sure it reads variables from the entire record. When you read an external text file, the minimum and maximum values are also printed to the log file: The minimum record length was 250. The maximum record length was 411. The length is an important item to always check so you can be certain the number you enter for lrecl= is large enough. Example What if you want to read a tab-delimited external text file called fish.txt where the actual data begin on line 3 (assume the first row contains the variable names and a continuous line of dashes appears in the second row) and has extra long record lengths (greater than the system default of 256)? Here is a DATA step with the statements: DATA a; INFORMAT id $char15.; INFILE "c:\data\fish.txt" DLM="09"x dsd LRECL=350 FIRSTOBS=3 pad; INPUT id a b c d; RUN; The INFILE statement option DLM="09"x specifies that a tab-separated value text file is to be read. It also indicates that the file has record lengths greater than the default length of 256 (lrecl=350) and identifies the row number 3 as the specific observation to begin reading the data (firstobs=3). Without the LRECL option the maximum record length of an input data file is 256 which means if your file has a maximum record length less than or equal to 256, the option LRECL=### is not needed. However, if the maximum record length is greater than 256, you need to specify a new maximum record length (LRECL=###), where ### must be greater than or equal to the number of characters in the longest record. If your data file contains header information (such as variable names or other documentation), enter the firstobs=xx option to specify the row where the data SAS is to read actually begins: in this example the INFILE statement tells SAS to begin reading data in row 3 rather than row 1. You should also enter the option 'pad' with lrecl; that is, when you have short lines, which end with CR/LF the column pointer may go to the next line. The option lrecl says to go to this column before going to new line. Another possible options is missover which tells SAS to process a new record with the first variable on the INPUT statement when a CR/LF is encountered in the data file before reaching the prescribed record length; all variables on the INPUT statement not yet assigned a value will be set to missing. Without 'pad' or 'missover' the following message will appear in the log file: NOTE: SAS went to a new line when INPUT statement reached past the end of a line. This warning indicates the external data file was probably not read correctly and that you will have fewer observations than intended. Reading Undelimited Text Data Files You may need to read data from columns not separated by spaces or other delimeters. How does one extract data in such cases? There are several methods which will be illustrated with separate input statements, though you can only enter one INPUT statement at a time: DATA bbb; * 1. specify data type and the specific columns after the variable names; INPUT a 1 b $ 2-3 c 4 d 5-6 e 7 f 8 ; * 2. place a data type and format after the variable name; * INPUT a 1. b $2. c 1. d 2. e 1. f 1. ; * 3. Like #2, also data and formats can be grouped; *INPUT a 1. b $2. (c d e f) (1. 2. 1. 1.) ; cards; 1CD45274 2AB56372 3ST56712 ; proc print data=bbb; run; Obs a b c d e f 1 1 CD 4 52 7 4 2 2 AB 5 63 7 2 3 3 ST 5 67 1 2 The INFILE Statement: If you have many numeric variables of a similar data type in free format, such as responses to items from a survey, you can read them in an abbreviated manner with a special shortcut notation: DATA b; INFILE 'quest.dat'; INPUT id q1-q30; RUN; The INPUT statement in this example automatically creates 30 variables called q1 q2 q3 ... q28 q29 q30. (The options of the INFILE statement will be explained later.) If you want to read character data of 8 digits or less, include a $ after the name of the character variable. For example: DATA c; INFILE 'demo.prn' missover; INPUT frst_nme $ lst_nme $ x1 x2 y; RUN; This INPUT statement assumes no other data are listed in the file and the values are separated by one or more blanks. The unnamed character informat $ by itself indicates the preceding variable is to be read with a maximum of 8 characters. You can specify the exact number of digits if known (such as $10.) but beware that this informat strips leading blanks (and also converts single leading . to blank). That is why one should conscientiously apply the $CHAR informat instead of the unnamed character informat. INFILE: dsd option One important option on the INFILE statement is DSD - It has three functions when reading delimited files. 1. It removes quotes that may surround values in the text file (see how to override this action in an example below). 2. When SAS encounters consecutive delimiters in a file, the default action is to treat the delimiters as one unit. If a data file contains consecutive delimiters, it is likely to have missing values between them. DSD tells SAS to treat consecutive delimiters separately (specified in the dlm= option); therefore, a value that is missing between consecutive delimiters will be read as a missing value when DSD is entered. 3. When it appears by itself, DSD assumes the delimiter is a comma so the DLM= option is not necessary. If another delimiter separates the values in your data file (such as a tab), the DLM= option must also be entered. This option became available in SAS 6.07 and is documented in "SAS Technical Report P-222". In Version 7 and beyond, DSD is documented in "SAS Language Reference: Dictionary". DSD also allows a comma to appear in a character string that is enclosed in quotes. Example: Read Character Data If the number of digits in a character string is greater than 8, enter a formatted INPUT statement with a format that indicates the number of digits in character data: DATA c; INFILE 'demo.dat'; INPUT frst_nme $CHAR15. lst_nme $CHAR15.; RUN; $CHARxx. is the format specification for CHARacter data containing "xx" total possible digits. [Note: you should always enter the $CHAR__. input format since it will read leading spaces where as the $__. format will strip away leading spaces.] For data stored in a fixed format (i.e., each item is located in specific columns), you can specify exactly which columns you want to read the variables. Also if the data to be read are not numeric, you must specify whether they are character, date, or some other format. For example, assume you have data stored in a file called 'mydata.prn' with the following 'fixed' column format: frh 1999/09/16 4.12 344 3.2 sef 1999/01/09 .78 24 2.0 wac 1999/02/22 12.61 357 3.0 fds 1999/03/13 5.61 401 4.1 The following INPUT statement specifies the exact column where each data item begins, a variable name, and its associated format: DATA a; INFILE 'mydata.prn'; INPUT @1 intl $char3. @5 date yymmdd10. @16 x1 5.2 @22 x2 4. @27 y 3.1; RUN; The following DATA step uses an INPUT statement with a very different structure, but is equivalent to the previous one: DATA a; INFILE 'mydata.prn'; INPUT intl $ 1-3 date yymmdd10. 5-14 x1 16-20 x2 22-25 y 27-29; RUN; The alternate form of the INPUT statement uses a range of numbers listed after the variable name that define the column limits where each data item appears. This form is especially useful when columns contain numbers that are stored with a mixture of both integer and decimal values, or where the data may lie anywhere within the column boundaries, as shown below: frh 1999/09/16 4.12 344 3.2 sef 1999/01/09 .78 24 2 wac 1999/02/22 12.61 357 3 fds 1999/03/13 5.61 401 4.1 Example: Read Data with Numerical Values Enclosed in Double Quotes If your data are in a comma delimited text file and the values are surrounded by double quotes, SAS will not be able to read the file without the DSD option and will write an error message indicating this to the LOG window. To read values surrounded by double quotes, the INFILE statement with the respective delimiter and dsd option will work: DATA a; infile cards dlm=',' dsd; input first $ last $ x y; cards; "john","smith",12,15 "tom","jones",1,14 ; PROC PRINT NOobs; RUN; first last x y john smith 12 15 tom jones 1 14 If you want to maintain the double quotes as part of the actual value, enter the INPUT statement modifer, the tilde, (~) following the variable name. DATA a; INFILE cards dlm=',' dsd; INPUT first ~ $ last ~ $ x y; CARDS; "john","smith",12,15 "tom","jones",1,14 ; PROC PRINT NOobs; RUN; first last x y "john" "smith" 12 15 "tom" "jones" 1 14 Suppose you have a comma delimited text file which contains numerical data in which commas for values 1000 or greater are included in the numbers and enclosed within double quotes (such as "22,312.65"). One way is to read them with the DATA step as character data and then convert them to the desired numerical values (e.g., 22312.65). The LENGTH statement is needed when the maximum number of characters for any value to be read (including the double quotes) is more than 8. After the INPUT statement, the COMPRESS function removes the comma. The character value is then converted into a number with the INPUT function. The best12 format helps to locate decimal values correctly when their placement differs across records. DATA a; LENGTH first last $15 ; infile cards dlm=',' dsd; input first last zip; frst = INPUT(compress(first,','),best12.); lst = INPUT(compress(last,','),best12.); cards; "73,201","1,091,052.916",97402 "011,234.023",52347.18,03857 003124,"22,312.65",68942 ; PROC contents ; RUN; PROC PRINT NOobs; VAR first frst last lst zip; FORMAT frst lst 12.4 zip z5.; * ZIP code is still numeric, yet printed With leading zeros below ; RUN; first frst last lst zip 73,201 73201.0000 1,091,052.916 1091052.9160 97402 011,234.023 11234.0230 52347.18 52347.1800 03857 003124 3124.0000 22,312.65 22312.6500 68942 * EXAMPLE: reading unknown line contents; DATA one; LENGTH line $30 a1 a2 a3 a4 $1 ; INPUT; line=_infile_ ; a1 = substr(_infile_,1,1); a2 = substr(_infile_,2,1); a3 = substr(_infile_,3,1); a4 = substr(_infile_,4,1); cards; abcdef123242 xyztvl34233333333333333333333 eiwuer333 ; The INFORMAT Statement in Association with the INFILE and INPUT Statements If you have data in character or date format in a delimited file (comma or tab), it's often helpful to insert an INFORMAT statement *prior* to the INFILE and INPUT statements. For example, assume you have the previous dataset stored in a comma-separated value (CSV) file called 'mydata.csv': Frh_abcdef,1999/09/16,4.12,344,3.2 Sef_bcdefg,1999/01/09,.78,24,2.0 Wac_qwerzx,1999/02/22,12.61,357,3.0 Fds_werwqw,1999/03/13,5.61,401,4.1 The statements in the DATA step to read this text file are listed below: DATA a; INFORMAT intl $char10. time yymmdd10.; INFILE 'mydata.csv' dlm=',' dsd missover; INPUT intl time x1 x2 y; RUN; Notice several changes in the statements entered: 1. The INFORMAT statement tells SAS the format of specified variables that appear on the INPUT statement. In particular, the variable intl contains character data with lengths longer than 8, thus the definition of $char10. for its format specification. When data input formats are entered on the INFORMAT statement, it is not necessary to place format specifications following that particular variable on the INPUT statement. (However, you still need to add the FORMAT statement as described in the next section.) 2. The INFILE statement must also be entered. However, the INPUT statement now looks like the one used for reading a free-formatted file. By default, variables will always be read as numbers - x1, x2, and y are assumed to be numbers in this example. However, the format for intl (character) and time (date) values were previously defined on the INFORMAT statement so formats do not need to be defined again on the INPUT statement. Reading Character Data from Delimited Text Files If your data file contains delimiters, don't enter informat lengths on the INPUT statement (unless you nullify that length with a : -- see below). It is better to preceed the INPUT statement with an INFORMAT or LENGTH statement. It only needs informats when the defaults are not met (e.g., read character data with lengths greater than 8). The INPUT statement then needs no informats placed following the variable name. DATA au; INFORMAT Dept_Name $char30. Approver_Name $char12.; INFILE cards delimiter='|' dsd missover; INPUT au nada Dept_Name Approver_Name ; CARDS; 1|0|SAN FRANCISCO MAIN |BRIAN E 3|1|CONXXX LOOP |RACHELLE L 8|0|COLUMBUS AVENUE OFFICE |GAIL Y 12|0|FILLMORE-CALIFORNIA OFFICE |ALBERT E 14|1|CREDIT CARD SERVICE CENTER BR |AMY C 18|0|GEARY-NINETEENTH AVENUE OFFICE|KERRY ; proc print NOobs; run; Dept_Name Approver_Name au nada SAN FRANCISCO MAIN BRIAN E 1 0 CONXXX LOOP RACHELLE L 3 1 COLUMBUS AVENUE OFFICE GAIL Y 8 0 FILLMORE-CALIFORNIA OFFICE ALBERT E 12 0 CREDIT CARD SERVICE CENTER BR AMY C 14 1 GEARY-NINETEENTH AVENUE OFFICE KERRY 18 0 Use of Trailing @ and @@ One option available with the INPUT statement is the trailing @ or @@. It allows you to read files with special data structures. Its functionality becomes especially apparent when small data files are stored directly in the SAS program. As SAS reads a data file, the internal pointer is automatically moved to the beginning of the next line at the end of every INPUT statement. However, there are applications where you may want to enter data from multiple observations on one line. To do this you must hold the pointer at that location just after the last variable is read, process the data values, and then read the group of variables for the next observation. Depending on the structure of your data file, the trailing @ or @@ allows you this capability. For example, a single trailing @ allows you to read three observations from the same row with three INPUT statements: DATA one; INPUT id $ a b c @; OUTPUT; INPUT id $ a b c @; OUTPUT; INPUT id $ a b c ; OUTPUT; CARDS; a 1 2 3 b 4 5 6 c 7 8 9 d 11 12 13 e 14 15 16 f 17 18 19 ; The first INPUT statement reads the first record of the data file: a 1 2 3 and writes it to the output file; it then reads the second record: b 4 5 6 and writes it to the output file. The third INPUT statement reads c 7 8 9 and writes it. A final @ to hold the pointer is not needed after the third INPUT statement since it will automatically be returned to the beginning of the next line of data at the end of the data step. The use of the trailing @ in this context is not very practical since there are better ways (see examples below). Perhaps the most practical value of the single trailing @ is when all the data for one observation exist on one row and you want to read a specific column of data to determine how read the particular record. For example, multiple types of observations can be included in file: DATA nmbrs(KEEP=x1 x2 x3) names(KEEP=name1 name2 name3); INPUT rec_typ $ @; IF rec_typ='a' THEN DO; INPUT x1 x2 x3; OUTPUT nmbrs; END; IF rec_typ='b' THEN DO; INPUT name1 $ name2 $ name3 $; OUTPUT names; END; CARDS; a 11 22 33 b bill jane ted a 211 322 343 b ed jill bob ; As shown here, after the programs reads ‘id’, the trailing @ fixes the pointer at that location, determines if the record is of type 'a' or of type 'b'. If it is 'a', then it reads in numeric data according to the first INPUT statement and writes it to a dataset for numbers; if it is type 'b' it reads in character data in the second INPUT statement and write the names to a different dataset. The KEEP= option on the DATA statement will be explained in a subsequent section. Here is another useful application of the trailing @: DATA cheese; INPUT cheese $ @; DO response = 1 TO 9; INPUT count @; ch1=(cheese='A'); ch2=(cheese='B'); ch3=(cheese='C'); ch4=(cheese='D'); OUTPUT; END; INPUT; CARDS; A 0 0 1 7 8 8 19 8 1 B 6 9 12 11 7 6 1 0 1 C 1 1 6 8 23 7 5 1 0 D 0 0 0 1 3 7 14 16 0 ; In this example there are four cheeses labeled A, B, C, and D. For each cheese 52 people were asked to rate the taste on an ordinal scale from 1 (strong dislike) to 9 (excellent). The 9 numbers that follow the cheese ID on each row indicate how many of the 52 persons gave a response = 1, =2, .., =9. If you have data from multiple observations on the same line but the DATA step loop returns to the top before the pointer reaches the end of the line, use @@ at the end of the INPUT statement: DATA one; INPUT id $ a b c @@; OUTPUT; CARDS; a 1 2 3 b 4 5 6 c 7 8 9 d 11 12 13 e 14 15 16 f 17 18 19 ; This DATA step works exactly like the program that read the same data in an example above using the trailing @; notice that only one INPUT statement is now required. The double trailing @@ is especially practical when you want to include large amounts of data in the program itself and don't want to scroll through many lines of data. It allows data from multiple observations to appear on each line following the CARDS; statement. However, it is a good idea to not split data from the same observation across rows. SUMMARY: Efficient use of @ or @@ reduces to two simple rules: * The trailing @ holds the pointer at a fixed locations defined by one or more INPUT statements until the bottom of the DATA step is reached; then the pointer automatically is sent to the beginning of the next data line. Use @ when all data for one observation are on the same line and you can read multiple levels of a factor or want to fix the pointer to test a value of a variable to take a specific action. * The double trailing @@ holds the pointer at a specific location even after the bottom of the DATA step loop is reached and it returns to the top. Use @@ when data from different observations are placed in the same row. INPUT MODIFIERS List input is more versatile when you include format modifiers which a list of them and their purpose are described as follows: | & | reads character values that contain embedded blanks. | : | reads data values that need the additional instructions that informats can provide but that are not aligned in columns. Note: [Use formatted input and pointer controls to quickly read data values aligned in columns. ] | ~ | reads delimiters within quoted character values as characters and retains the quotation marks. It tells SAS to read a not use blanks as variable separators. There are several input modifiers that can be used to deal with character input. The Colon : Modifier The : modifier with an informat reads character values longer than 8 bytes or numeric values that contain nonstandard values. Data values aligned in columns can be easily read with formatted input and pointer controls. In some situations, the lengths of variables will not be constant, although you still need to know the maximum length for any given variable. The colon : modifier allows you to read data values that need the additional functionality with data *not* aligned in straight columns or data of unknown lengths contained within delimited files. This example shows how the colon causes input with a list to function correctly when used with formats. Leaving the colons out means that the formats are satisfied before the list of data is input, i.e. the $8. format gets 8 characters without consideration of space delimiters. * The INPUT statement does *NOT* include colons ; DATA two; INPUT a $8. b $4. ; * the format $8. treats spaces as part of the variable read ; CARDS; dog cat ; Note that variable b is blank as printing the dataset shows: PROC PRINT Noobs ; Run ; A B dog cat * The next DATA step contains the colon modifier ; DATA two; INPUT a : $8. b : $4. ; * With the colon, spaces are observed as delimiters; CARDS; dog cat ; proc print Noobs; Run; A B dog cat If character data are indented slightly at the beginning of a row, SAS reads the opening spaces as part of the format $5 (if there is no column specification on the INPUT statement). DATA tst; INPUT a $5. b $ c $ ; CARDS; WU411 two three ; PROC PRINT NOobs; RUN ; A B C WU4 11 two However, placing the colon between the variable name and its input format ($5.) results in the desired output: DATA tst; INPUT a : $5. b $ c $ ; CARDS ; WU411 two three Output: A B C WU411 two three Assuming you have a space or comma delimited file called demo.dat, if the lengths of values are not constant, use a colon modifier on a formatted INPUT statement with a number that indicates the maximum number of characters expected for that variable, e.g.: DATA c; INFILE 'demo.dat' dlm=’,’ dsd; INPUT frst_nme :$20. lst_nme :$16. ; This approach does not require you to declare or know the actual length or even the maximum length; just be sure to pick a large enough number of characters to read all your data for each variable. The ? Modifier The ? modifier suppresses what appears in the log in the case of invalid data, nothing more. A single ? suppresses just the MESSAGE about invalid data having been encountered, but still prints out in the log all of those 'lines of input data' which contained invalid data. With two question marks - ?? - the printing of those 'lines of input data' in the log is suppressed as well. DATA tst; INPUT a ?? b ?? c ??; CARDS; 4 3 5 * 2 1 4 5 m ; The following text appears in the log file: 32 data tst; 33 input a ?? b ?? c ??; 34 cards; NOTE: The data set WORK.TST has 3 observations and 3 variables. NOTE: DATA statement used: real time 0.01 seconds PROC PRINT gives the following output: a b c 4 3 5 . 2 1 4 5 . Perhaps there are a few applications of the ? option; however, it is rarely found in a program. If you want to suppress anything (and it is usually dangerous to do so), the printed data lines with errors are what you would want to not see, so the double question mark ?? has some utility. Perhaps the other way around, with an option to suppress the printing of the lines of input data, makes more sense but retaining the message informing you that invalid data have been encountered - e.g. in situations in which that shouldn't happen - but for now, it is not an option. The & Modifier If you have data in Excel that looks like this: Code Recode Desc 1 2 Value1 2 2 Value1 3 6 Value Other 4 6 Value Other 5 4 Value2 Read this file with the following code: filename q2 dde "excel|q2!r2c2:r42c4"; data rc_q2; length desc $255.; infile q2; input code recode desc; run; It works except the Desc field is truncated whenever there is a space so Code 3 ends up being: 3 6 Value instead of 3 6 Value Other To do to get the whole field enter the & input code recode desc &; data t1; infile cards delimiter=', "09"x'; input x1-x3; list; cards; 1 2,3 1,2,3 1 2 3 1,2 3 1 2 3 1,2,3 1,2 3 3 4 5 ; proc print;run;