Section 7.2 Working with Character Data in the DATA Step The DATA step allows one to read, write, and manipulate many types of data. SAS functions provide much flexibility and convenience when working with character data. Although there are many more than indicated here, the following functions find frequent applications when working with character data. CAT_: Concatenate two or more character variables into one COMPRESS: remove every occurrence of one or more designated characters from a string COMPBL: remove blanks INPUT and PUT: Convert character data to numbers and numbers to character PROPCASE: Convert String to Title Format REPEAT: Add leading Zeros SCAN: selects characters between specified delimiters SUBSTR: extract a contiguous subset of characters from a string TRANSLATE: UPCASE, L0WERCASE, and PROPCASE VLABEL: Capture a Variable Label into a New Variable The CAT_ functions: Concatenate two or more character variables into one CAT is short for concatenate, which means to join character data together into one value. Previous versions required the double vertical slashes || be placed between the two character variable names to be joined; you would also need to enter another function such as TRIM() to remove unwanted spaces. The new CAT_ functions first available in version 9 allow for more concatenate options. The 3-letter CAT() function works much like the concatenation operator. Between the () included the character variable names, separated by commas, to be joined. You may need to first trim spaces from the left and/or right of the value, thus the TRIM() function may need to be applied before the concantenate takes place. Also, set the length of the new variable as the number of characters to be included in the longest value. LENGTH new_ch_var $15; * enter 15 if the number of digits of the longest value is 15; new_ch_var = CAT(TRIM(a),TRIM(b),TRIM(c)); The 4-letter CAT_ functions eliminate the need for TRIM and include: CATT removes trailing blanks from each argument before concatenating CATS strips leading and trailing blanks before concatenating CATX is like CATS but places a delimiter between the values which are concatenated To insert a delimeter such as an underscore, a punctuation mark, or some other character between values, these characters can be entered with the concatenation operator - two vertical lines - as shown in the DATA step below for the variable location1 (as computed with Versions 8 and earlier). The task of entering a delimeter can now be done with the new CAT_ functions available in Version 9. In the DATA step the variable location2 is produced by one of the CAT function, which works in a similar way to the vertical lines. First, enter the new character variable name in a LENGTH statement; otherwise, SAS assigns a length of 200 rather than a much shorter value that you would likely need. The function CATX automatically trims spaces and inserts punctuation or other constants of your choice as shown for location3. DATA places; LENGTH location1 location2 location3 $25 city $10 state $12 zip $5; INPUT city state zip; location1= TRIM(city)||', '||TRIM(state)||' '||zip; * the LENGTH statement adds spaces to the contents; location2=CAT(TRIM(city),', ',TRIM(state),' ',TRIM(zip)); location3=CATX(", ",city,state,zip); * CATX automatically trims out spaces and enters your choice of character delimeters; DATALINES; Ashland Oregon 97520 Bertrand Nebraska 68927 ; PROC PRINT DATA=places NOobs; RUN; city state zip location1 location2 Ashland Oregon 97520 Ashland, Oregon 97520 Ashland, Oregon 97520 Bertrand Nebraska 68927 Bertrand, Nebraska 68927 Bertrand, Nebraska 68927 location3 Ashland, Oregon, 97520 Bertrand, Nebraska, 68927 If you want to convert two or more numerical codes into part of a character ID the PUT function (see below) and the z formats work like a charm: DATA total ; LENGTH plotchar $3 plotid1 plotid2 $5 tdrctn $1 ; INPUT plot transect tdrctn; plotchar = PUT(plot,z3.); * convert the number plot to a 3 digit character; plotid1 = PUT(transect,$1.)||tdrctn||plotchar; * v. 8 CAT approach; plotid2 = CAT(PUT(transect,$1.),tdrctn,plotchar); * v. 9 CAT function; CARDS; 60 2 N 120 2 N 240 2 N 0 2 S 90 2 S ; PROC PRINT DATA=total NOobs; VAR transect tdrctn plot plotchar plotid1 plotid2; run; transect tdrctn plot plotchar plotid1 plotid2 2 N 60 060 2N060 2N060 2 N 120 120 2N120 2N120 2 N 240 240 2N240 2N240 2 S 0 000 2S000 2S000 2 S 90 090 2S090 2S090 The CATT function concantenates all character data in the following manner: DATA tst; a='abc'; b='_cde'; f='ekwn'; charvar = CATT(of _character_); output; a='rsc'; b='_feg'; f='zeen'; charvar = CATT(a,b,f); output; proc print; run; a b f charvar abc _cde ekwn abc_cdeekwn rsc _feg zeen rsc_fegzeen COMPRESS: removes every occurrence of one or more designated characters from a string COMPRESS(source , < 'list-of-characters-to-remove' >) where source specifies the name of the variable that contains the chosen characters to remove. For example: DATA one; x="A.B' (C=D);"; y=COMPRESS(x,".;()'"); len_x=length(x); len_y=length(y); *put y; run; PROC PRINT; VAR x len_x y len_y; run; results in Obs x len_x y len_y 1 A.B' (C=D); 11 AB C=D 6 Note that the COMPRESS function has removed the specified characters and made the variable length shorter. It also removes single quotes only when this character is surrounded by double quotes "" in the specification of characters to remove. COMPBL, TRIM, LEFT, RIGHT, STRIP These functions remove excessive blanks from string variables. COMPBL(): compress multiple, consecutive blanks in a string into a single blank. Suppose you have a character variable with several words having more than one space. For example, this function takes a string variable such as: Phrase='The quick brown fox' phrase=COMPBL(phrase) removes two or more spaces between consecutive words and replaces them with 1: phrase='The quick brown fox' If you specify a blank as one of the characters to remove from the source string, the COMPRESS function is similar to the COMPBL function. However, note that, COMPRESS: removes all spaces from the source string COMPBL: compresses multiple spaces into a single space and has no affect on a single space INPUT and PUT Which direction do you want to proceed? Char To Num ----> Apply INPUT Num to Char ----> Apply PUT For data type conversions in the DATA step, the INPUT function (which has a quite separate purpose from the INPUT statement) converts character values into numbers. Conversely, the PUT function converts numbers into character values. These functions help maintain data integrity and make various types of computations or data manipulations possible. INPUT: Character to Number numvar = INPUT(charvar,); For example, if charvar=' 1080' then: cx=LEFT(charvar); * remove the leading blanks so cx = ’1080’ ; numvar = INPUT(cx,4.0); produces a variable that can treated as a number: numvar = 1080. If your character representation of numerical data already contains a decimal, be careful how you apply a format with decimal implications to read it: DATA test; LENGTH a $6 ; a="11345"; x=INPUT(a,10.2); xy=INPUT(a,10.); output; a="11.345"; x=INPUT(a,10.2); xy=INPUT(a,10.); output; run; PROC PRINT DATA=test NOobs; FORMAT x xy 10.3; run; produces: a x xy 11345 113.450 11345.000 11.345 11.345 11.345 ^^^^^^^^ Notice that in row2 x=11.345 will be truncated to 11.34 without the LENGTH statement that variable "a" will be treated as 6 characters, not 5. In row 2 variable "a" is read correctly, with either the 10.2 or 10.0 input format. However, in row 1 x=11345 (a number without a decimal) is misread with the 10.2 format. One handy application of the INPUT function reads character dates with slashes and convert them into SAS date values. For example, DATA _null_; LENGTH ch_date $10; FORMAT date_of_contact mmddyy10. ; ch_date="01/19/2001"; * assume for this example ch_date is a char format; date_of_contact = INPUT(ch_date, mmddyy10.) ; run; The variable date_of_contact will now be treated as a SAS date value rather than a character string. Another occasionally helpful application of the INPUT function converts data stored as character type into a numeric value, such as the actual text representation of Likert responses (q1) into ordered numbers (s1) in the example below. First, make a format with PROC FORMAT to define the conversions: PROC FORMAT; * this INVALUE statement converts the character representation of a Likert scale into an ordered number; INVALUE $infmta 'Strongly disagree'=1 'Disagree'=2 'Agree'=3 'Strongly agree'=4 ; * this VALUE statement converts the numeric value to its associated text; VALUE fmta 1 ='Strongly Disagree' 2 ='Disagree' 3 ='Agree' 4 ='Strongly Agree' .='Missing' ; RUN; DATA surv; SET xcel_sh ; LENGTH s1 3; s1 = INPUT(q1,$infmta.); run; prints: Obs q1 s1 1 Strongly disagree 1 2 Disagree 2 3 Agree 3 4 Strongly agree 4 PUT: Convert a Number to Character The PUT function performs the reverse application; it converts numerical values into character: charvar = PUT(numvar,); NOTE: You may assign SAS dates directly with the letter d after the date in double quotes: ch_date="01JUL2001"d ; This option will not work with the mmddyy format entered above. The -L option is handy device to avoid leading spaces in the resulting character value. Since character data are defined a default length of 8 (unless otherwise specified), it essentially avoids the need to trim data at a later application when the actual length will vary from the default, or the length specified on a LENGTH statement. In fact, in the example below if the LENGTH of v1c were changed to $6, since the numeric format for v1 is 8, for v1=2, v1c is missing and for v1=222, v1c =2. data one; LENGTH v1c $8; v1=2; V1c = PUT( V1 , 8. ) ; output; v1=222; V1c = PUT( V1 , 8. -L ) ; output; v1=292827; V1c = PUT( V1 , 8. -L ) ; output; proc print; run; Obs v1 V1c 1 2 2 2 222 222 3 292827 292827 Be careful when applying the PUT function in the presence of missing values. When you convert variables to character with: newcharvar=put(oldnumericvar, 1.); PROC CONTENTS states the new variables are character; however when they are printed, the missing values are still listed as a . instead of a blank. This result is within the scope of the function, as it "Returns a value using a specified format". The format "1." is a numeric format and represents a missing as a ".". To complete the conversion, add the TRANSLATE function: newcharvar=TRANSLATE(put(oldnumericvar, 1.), ' ', '.') ; Note: This approach will get you into trouble if you actually have decimal formatted numbers. If so, you can apply the ROUND or ROUNDE function to round to the nearest integer. PROPCASE() The PROPCASE function copies a character argument and converts all uppercase letters to lowercase letters. It then converts to uppercase the first character of a word that is preceded by a blank, forward slash, hyphen, open parenthesis, period, or tab. PROPCASE returns the value that is altered. DATA _null_; input place $ 1-40; name=propcase(place); PUT name; DATALINES; INTRODUCTION TO THE SCIENCE OF ASTRONOMY VIRGIN ISLANDS (U.S.) SAINT KITTS/NEVIS WINSTON-SALEM, N.C. ; in the log window you read: Introduction To The Science Of Astronomy Virgin Islands (U.S.) Saint Kitts/Nevis Winston-Salem, N.C. REPEAT * Add leading zeros to character variable; DATA bb; input in $10.; length out $ 18; out = repeat('0',17-length(in)) || in; * 17, not 18 - see doc ; put in $10. out=; datalines; 12345678 5478 678 run; proc print; run; Obs in out 1 12345678 000000000012345678 2 5478 000000000000005478 3 678 000000000000000678 SCAN: Selects a word from a delimited character expression SCAN(name,n,) name: specifies any character expression or variable name. n: specifies a numeric expression that produces the number of the word in the character string you want SCAN to select. delimiters: specifies a character expression that produces characters that you want SCAN to use as word separators in the character string. If you omit delimiters in an ASCII environment, SAS uses the following characters: blank . < ( + & ! $ * ) ; ^ - / , % | Enter a specified delimiter within quotes, e.g., ‘/’ SCAN has two features to be aware of. For one, it cannot handle internal commas, Joan, "Johnson, Betty", Richard, Frank, ...etc. It also cannot work with consecutive commas to indicate a missing item: Joan,, Richard, Frank, ...etc. where SUB2 is blank. To eliminate leading spaces you may also want sub(i) = left(scan(list,i,',')); SUBSTR: Extract a contiguous subset of characters from a string SUBSTR is a character handling function that extracts a string value (a subset of consecutive characters) from a ‘longer’ string variable. SUBSTR(,,); If address='1025 Parker' then str_num = SUBSTR(address,1,4); means to start with the first character and take the next 4 consecutive digits and place them into the character valued str_num = ’1025’. If a variable subject='fne01_cy', then the statement: id = SUBSTR(subject,4,2); will assign the variable id a 2-digit character value of 01 since the fourth digit of subject is 0 and the length desired beginning with the fourth digit is 2. In case the value of ‘subject’ is not at least 5 digits in length you will get the following message: NOTE: Invalid third argument to function SUBSTR at line XXX column YY. The SUBSTR function can also be utilized with numerical data as follows (note the conversion to character data which takes place first in order to get the right answer): DATA one; defcredt = 1283357743. ; output; defcredt = 1283357743.2 ; output; defcredt = 1283357743.0 ; output; defcredt = 1283357743.88 ; output; DATA str; SET one; we1 = SUBSTR(PUT(defcredt,12.1),5,2); we2 = SUBSTR(LEFT(defcredt),5,2); we3 = SUBSTR(defcredt,5,2); run; proc print NOobs; run; defcredt we1 we2 we3 1283357743.0 35 35 83 1283357743.2 35 35 35 1283357743.0 35 35 83 1283357743.9 35 35 35 The reason 83 appears for for two records under we3 (and not 35) is that SAS applies the Best12. format to the number 1283357743. Since it has a zero value after the decimal, it is two digits shorter under the Best12. format. With this format the character value of the number read changes to " 1283357743" so that two blanks appear before the number 1. When the SUBSTR function is applied directly on the number, the fifth and sixth characters in " 1283357743" are 83. The main point of the last statement for we3 is that it applies a string function directly to numerical data. The PUT() or at least the LEFT() functions allow you control over how the variables are translated into a string, so that you don't risk getting incorrect results. TRANSLATE: replace one character with a different character; e.g., replace a period, '.', with the underscore, '_'. DATA a; INFORMAT var $CHAR6. ; FORMAT var newvar $CHAR6. ; INPUT var ; newvar=TRANSLATE(var,'_','.'); CARDS; 21.3 123.45 32.33 ; proc print NOobs; var newvar 21.3 21_3 123.45 123_45 32.33 32_33 UPCASE(): change text to all capital letters LOWERCASE(): change text to all lowercase letters PROPCASE(): change text to that of a proper name Assume you have several character variables in your dataset and the values (observations) for those variables are all in mixed case. The object is to change all those variables to caps. The data are: obs PT_TM 1 headache 2 systolic blood pressure 3 DIASTOLIC BLOOD PRESSURE The UPCASE() function pt_tm = upcase(pt_tm); changes it to: obs PT_TM 1 HEADACHE 2 SYSTOLIC BLOOD PRESSURE 3 DIASTOLIC BLOOD PRESSURE The function is especially necessary when testing character data with IF statements when the response values may not be consistently entered in one case or if you do not know the values are all upper or lower case, e.g., DATA one; rsp1='No'; rsp2='Yes'; a1=0; a2=0; IF (UPCASE(rsp1) EQ 'No') then a1=1; IF (UPCASE(rsp2) EQ 'YES') then a2=1; PUT 'a1=' a1 'a2=' a2; RUN; The LOG file yields a1=0 a2=1