Section 7.2 Working with Character Data in the DATA Step
The DATA step allows one to read, write, and manipulate many types of
data. SAS functions provide much flexibility and convenience when working
with character data. Although there are many more than indicated here, the
following functions find frequent applications when working with character
data.
CAT_: Concatenate two or more character variables into one
COMPRESS: remove every occurrence of one or more designated
characters from a string
COMPBL: remove blanks
INPUT and
PUT: Convert character data to numbers and numbers to character
PROPCASE: Convert String to Title Format
REPEAT: Add leading Zeros
SCAN: selects characters between specified delimiters
SUBSTR: extract a contiguous subset of characters from a string
TRANSLATE:
UPCASE, L0WERCASE, and PROPCASE
VLABEL: Capture a Variable Label into a New Variable
The CAT_ functions: Concatenate two or more character variables into one
CAT is short for concatenate, which means to join character data together into one value.
Previous versions required the double vertical slashes || be placed between the two
character variable names to be joined; you would also need to enter another function such
as TRIM() to remove unwanted spaces. The new CAT_ functions first available in version 9
allow for more concatenate options.
The 3-letter CAT() function works much like the concatenation operator. Between the ()
included the character variable names, separated by commas, to be joined. You may need to
first trim spaces from the left and/or right of the value, thus the TRIM() function may
need to be applied before the concantenate takes place. Also, set the length of the new
variable as the number of characters to be included in the longest value.
LENGTH new_ch_var $15; * enter 15 if the number of digits
of the longest value is 15;
new_ch_var = CAT(TRIM(a),TRIM(b),TRIM(c));
The 4-letter CAT_ functions eliminate the need for TRIM and include:
CATT removes trailing blanks from each argument before concatenating
CATS strips leading and trailing blanks before concatenating
CATX is like CATS but places a delimiter between the values which
are concatenated
To insert a delimeter such as an underscore, a punctuation mark, or some other character
between values, these characters can be entered with the concatenation operator - two
vertical lines - as shown in the DATA step below for the variable location1 (as computed
with Versions 8 and earlier).
The task of entering a delimeter can now be done with the new CAT_ functions available in
Version 9. In the DATA step the variable location2 is produced by one of the CAT function,
which works in a similar way to the vertical lines. First, enter the new character variable
name in a LENGTH statement; otherwise, SAS assigns a length of 200 rather than a much
shorter value that you would likely need. The function CATX automatically trims spaces and
inserts punctuation or other constants of your choice as shown for location3.
DATA places;
LENGTH location1 location2 location3 $25 city $10 state $12 zip $5;
INPUT city state zip;
location1= TRIM(city)||', '||TRIM(state)||' '||zip; * the LENGTH statement adds
spaces to the contents;
location2=CAT(TRIM(city),', ',TRIM(state),' ',TRIM(zip));
location3=CATX(", ",city,state,zip); * CATX automatically trims out spaces and
enters your choice of character delimeters;
DATALINES;
Ashland Oregon 97520
Bertrand Nebraska 68927
;
PROC PRINT DATA=places NOobs; RUN;
city state zip location1 location2
Ashland Oregon 97520 Ashland, Oregon 97520 Ashland, Oregon 97520
Bertrand Nebraska 68927 Bertrand, Nebraska 68927 Bertrand, Nebraska 68927
location3
Ashland, Oregon, 97520
Bertrand, Nebraska, 68927
If you want to convert two or more numerical codes into part of a character ID the PUT
function (see below) and the z formats work like a charm:
DATA total ;
LENGTH plotchar $3 plotid1 plotid2 $5 tdrctn $1 ;
INPUT plot transect tdrctn;
plotchar = PUT(plot,z3.); * convert the number plot to a 3 digit character;
plotid1 = PUT(transect,$1.)||tdrctn||plotchar; * v. 8 CAT approach;
plotid2 = CAT(PUT(transect,$1.),tdrctn,plotchar); * v. 9 CAT function;
CARDS;
60 2 N
120 2 N
240 2 N
0 2 S
90 2 S
;
PROC PRINT DATA=total NOobs;
VAR transect tdrctn plot plotchar plotid1 plotid2;
run;
transect tdrctn plot plotchar plotid1 plotid2
2 N 60 060 2N060 2N060
2 N 120 120 2N120 2N120
2 N 240 240 2N240 2N240
2 S 0 000 2S000 2S000
2 S 90 090 2S090 2S090
The CATT function concantenates all character data in the following manner:
DATA tst;
a='abc'; b='_cde'; f='ekwn'; charvar = CATT(of _character_); output;
a='rsc'; b='_feg'; f='zeen'; charvar = CATT(a,b,f); output;
proc print; run;
a b f charvar
abc _cde ekwn abc_cdeekwn
rsc _feg zeen rsc_fegzeen
COMPRESS: removes every occurrence of one or more designated
characters from a string
COMPRESS(source , < 'list-of-characters-to-remove' >)
where source specifies the name of the variable that contains the chosen characters to
remove. For example:
DATA one;
x="A.B' (C=D);";
y=COMPRESS(x,".;()'");
len_x=length(x);
len_y=length(y);
*put y;
run;
PROC PRINT; VAR x len_x y len_y; run;
results in
Obs x len_x y len_y
1 A.B' (C=D); 11 AB C=D 6
Note that the COMPRESS function has removed the specified characters and made the variable
length shorter. It also removes single quotes only when this character is surrounded by
double quotes "" in the specification of characters to remove.
COMPBL, TRIM, LEFT, RIGHT, STRIP
These functions remove excessive blanks from string variables.
COMPBL(): compress multiple, consecutive blanks in a string into a single blank.
Suppose you have a character variable with several words having more than one space.
For example, this function takes a string variable such as:
Phrase='The quick brown fox'
phrase=COMPBL(phrase) removes two or more spaces between consecutive words
and replaces them with 1:
phrase='The quick brown fox'
If you specify a blank as one of the characters to remove from the source string, the
COMPRESS function is similar to the COMPBL function. However, note that,
COMPRESS: removes all spaces from the source string
COMPBL: compresses multiple spaces into a single space and
has no affect on a single space
INPUT and PUT
Which direction do you want to proceed?
Char To Num ----> Apply INPUT
Num to Char ----> Apply PUT
For data type conversions in the DATA step, the INPUT function (which has
a quite separate purpose from the INPUT statement) converts character
values into numbers. Conversely, the PUT function converts numbers into
character values. These functions help maintain data integrity and make
various types of computations or data manipulations possible.
INPUT: Character to Number
numvar = INPUT(charvar,);
For example, if charvar=' 1080' then:
cx=LEFT(charvar); * remove the leading blanks so cx = ’1080’ ;
numvar = INPUT(cx,4.0);
produces a variable that can treated as a number: numvar = 1080.
If your character representation of numerical data already contains a
decimal, be careful how you apply a format with decimal implications to
read it:
DATA test;
LENGTH a $6 ;
a="11345"; x=INPUT(a,10.2); xy=INPUT(a,10.); output;
a="11.345"; x=INPUT(a,10.2); xy=INPUT(a,10.); output;
run;
PROC PRINT DATA=test NOobs;
FORMAT x xy 10.3; run;
produces:
a x xy
11345 113.450 11345.000
11.345 11.345 11.345
^^^^^^^^
Notice that in row2 x=11.345 will be truncated to 11.34 without the LENGTH
statement that variable "a" will be treated as 6 characters, not 5. In row 2
variable "a" is read correctly, with either the 10.2 or 10.0 input format.
However, in row 1 x=11345 (a number without a decimal) is misread with the 10.2
format.
One handy application of the INPUT function reads character dates with
slashes and convert them into SAS date values. For example,
DATA _null_;
LENGTH ch_date $10;
FORMAT date_of_contact mmddyy10. ;
ch_date="01/19/2001"; * assume for this example ch_date is a char format;
date_of_contact = INPUT(ch_date, mmddyy10.) ;
run;
The variable date_of_contact will now be treated as a SAS date value
rather than a character string.
Another occasionally helpful application of the INPUT function converts
data stored as character type into a numeric value, such as the actual
text representation of Likert responses (q1) into ordered numbers (s1) in
the example below. First, make a format with PROC FORMAT to define the
conversions:
PROC FORMAT;
* this INVALUE statement converts the character representation of a Likert
scale into an ordered number;
INVALUE $infmta 'Strongly disagree'=1 'Disagree'=2
'Agree'=3 'Strongly agree'=4 ;
* this VALUE statement converts the numeric value to its associated text;
VALUE fmta 1 ='Strongly Disagree' 2 ='Disagree'
3 ='Agree' 4 ='Strongly Agree' .='Missing' ;
RUN;
DATA surv; SET xcel_sh ;
LENGTH s1 3;
s1 = INPUT(q1,$infmta.);
run;
prints:
Obs q1 s1
1 Strongly disagree 1
2 Disagree 2
3 Agree 3
4 Strongly agree 4
PUT: Convert a Number to Character
The PUT function performs the reverse application; it converts numerical
values into character:
charvar = PUT(numvar,