LEARN SAS Within 7 Weeks: Part2 (Introduction To SAS - The Data Step)
LEARN SAS Within 7 Weeks: Part2 (Introduction To SAS - The Data Step)
Unit 4
SAS for Data Management
Welcome.
As mentioned in the introduction to this unit (click on the Unit 4 tab) , the two principal
building blocks of a SAS program are the DATA step and the PROC step. This
reading is a detailed introduction to the DATA step. The emphasis is on using the
DATA step for purposes of reading, displaying, and writing data. Not described, but
possible, is use of the DATA step to accomplish other tasks, such as simulations.
1. To understand the nature of, and purposes of, the DATA step;
2. To be able to read data into SAS from a variety of platforms (instream, external
file, other SAS data set);
3. To appreciate, and be competent in, the formatting of data for ease of readability;
6. To be familiar with the SAS viewtable feature and to appreciate that this is not
recommended for use in data editing; and
week 08 8.1
Week 8 Introduction to SAS – The DATA Step
3. How to Input Data Stored Text Format (INFILE and INPUT) ….... 6
6. How to Read and Write From One or More SAS Data Sets to
Another (the SET statement) ……….………………………………. 15
7. Writing Data to ASCII from SAS (the FILE and PUT statements)…. 16
week 08 8.2
Week 8 Introduction to SAS – The DATA Step
SAS represents data in tabular or rectangular form, where each column represents a
field or variable, which must be named, and each row represents a record or
some field, such as age, the observations will be renumbered sequentially after
sorting. The observation number is not stored with the data, but is printed or
displayed as a convenience.
Typical Listing of Data in SAS Listing from Print Procedure Using Print
Procedure: Displayed in HTML Table Format:
4 4 29 66
week 08 8.3
Week 8 Introduction to SAS – The DATA Step
The DATA step is the most common method of data input or output from the SAS
system. The DATA step consists of several SAS statements, where the particular
statements required depend upon the source of data input. All data steps begin with
week 08 8.4
Week 8 Introduction to SAS – The DATA Step
When you have a small amount of data that can be entered directly by typing it in
within a program, you may choose instream data entry using the CARDS statement.
This is most common when trying a small example or testing out a new program.
The following example creates a temporary SAS dataset called A1 with 3 variables
and 4 observations.
• The INPUT statement names the variables or fields that are to be read.
• The CARDS statement indicates that data lines follow, and the semicolon (;) on
the line after the data, indicates the end of the data lines.
• A RUN statement is used at the end of each DATA or PROC step in SAS so
that the group of statements will be executed. This is optional if the data step is
followed by another data step or proc step – but you must have it at the end of a
program or the last step will not be executed.
week 08 8.5
Week 8 Introduction to SAS – The DATA Step
3. How to Input Data Stored Text Format (the INFILE and INPUT statements)
More commonly data is read in from other sources, such as ASCII data files, or from
other SAS data files rather than appearing instream in the program. The basic
syntax of a DATA step when reading the data from an ASCII file is as follows:
DATA NEW1 ; /* NEW1 is the name of the new SAS data set */
INFILE 'C:\TEMP\RAW.DTA'; /* specifies the file RAW.DTA on C:\TEMP */
INPUT VAR1 VAR2 ; /* specifies names for variables */
RUN;
The INFILE statement can identify an ASCII data file stored on a disk drive or from
directories on the hard drive by specifying the appropriate path. The path and
filename must be enclosed in single quotes. Many options are available to tailor the
INFILE statement to a particular data set. For example, the number of columns to be
read can be controlled with a linesize or logical record length specification on the
INFILE statement. For more details see the SAS Language Guide or SAS HELP.
Following the INFILE statement in SAS will be an INPUT statement that specifies the
correspondence between variable names assigned in SAS and columns in the ASCII
data file. This is where variable names are assigned. This statement will be
week 08 8.6
Week 8 Introduction to SAS – The DATA Step
When the data file to be input is itself a SAS data file, the DATA step takes on a
slightly different form. A SAS data file already has the columns identified with
variable names, and so the INPUT statement is not needed. The following example
reads a previously stored SAS data file called example3, and creates a temporary
• The LIBNAME statement is just a “nickname” (SAS calls this the libref)
together with its companion pointer to the path (the drive and directory) where
the SAS data set is to be saved. Consider the libname statement
• The SET statement names the SAS data set that is to be read in.
• When a single level name (single word, no dot ‘.’ followed by an extension) is
(meaning temporary) data set while you are running the SAS system. Thus,
as soon as you close SAS the "working" data sets are erased. Working data
week 08 8.7
Week 8 Introduction to SAS – The DATA Step
sets are stored in the SAS WORK library. You can view active SAS libraries in
• To save a SAS data set as a permanent data set – one that will be there after
you exit from the SAS software – a two level (libref.dsn) name must be given
dataset.
o The first part of the name (the library reference or libref) matches
exactly the nickname (which points to the path comprised of drive and
In order to create a permanent (saved) SAS data set, you need to run the following
week 08 8.8
Week 8 Introduction to SAS – The DATA Step
LIBNAME IN ‘A:\HW3’;
DATA IN.A2;
SET A2;
RUN; This is the name you would like to call
your permanent SAS data set. The libref
This is the name (A2) of the (IN) before the dot (.) must match the
temporary SAS data set that you
name you wrote on a LIBNAME
want to save.
extension added. You will see this extension when you look at the file
The location, or path (disk drive and directory) of SAS data files, is specified in a
LIBNAME statement. If you double-click on this icon, the SAS Windows will open,
week 08 8.9
Week 8 Introduction to SAS – The DATA Step
DO NOT change the name of a SAS data file in Windows Explorer or My Computer.
Information on the external file name is saved within the file. If you rename
A2.sas7bdat to be A3.sas7bdat you will get an error message when you try to open
week 08 8.10
Week 8 Introduction to SAS – The DATA Step
You can think of the directories on hard disk or floppy disks as libraries for storing
data. The LIBNAME statement is simply a pointer, an instruction that says “I’m
pointing to” a location. The location that is pointed to is a directory and subdirectory
(SDATA and IN, in the above examples) that refers to a specific location (library) for
being provided.
week 08 8.11
Week 8 Introduction to SAS – The DATA Step
• Libraries can also be defined from the toolbar. New library button:
Using the new library button lets you define the LIBREF (or code word for that
library), the ENGINE (or data format) and the PATH (drive and directory).
the definition of the library becomes part of your program, and will be re-
defined each time the program is run. If you use the toolbar to set your library,
you must remember to set up your libraries each time you re-open SAS.
week 08 8.12
Week 8 Introduction to SAS – The DATA Step
You must have a separate library defined for each version (engine) of SAS
Older versions of SAS stored data in different formats. SAS refers to these as
“engines”. For example, version 6.12 of SAS used a default extension of .SD2.
Earlier DOS versions (6.04) of SAS used the extension .SSD . If you know you are
reading SAS data files that were saved with an earlier version of SAS, you must have
these data sets stored in a different directory or subdirectory from V8 SAS data files.
For example, the following lines could be used to read an old SAS data set (version
6.12), and save a copy of it in the new SAS (version 8.2) format:
LIBNAME OLD V612 ‘C:\OLDSAS’; /* Old uses v612 engine, .sd2 format */
LIBNAME NEW V8 ‘C:\TEMP’; /* New v8 engine, .sas7bdat */
DATA NEW.D1;
SET OLD.D1;
RUN;
Two libname statements are used to name 2 directories, the first called OLD, which
contains the file D1.SD2, version 6.12 format. The new data set, D1.SAS7bdat will
be saved in the C:\TEMP directory. The “engine” or version of SAS that created the
data set (in this example, they are v612 and v8) can be named before the path
specification on the libname statement. If you are unsure of the engine, it is not
required, as long as only one type of SAS file can be found in that directory.
week 08 8.13
Week 8 Introduction to SAS – The DATA Step
Take care that data stored by older versions of SAS or other formats that will
be used in SAS, are stored in separate directories, otherwise you will get an
Note that the SAS engine names begin with V for version. Therefore, avoid using a
library name such as Vnnn, where nnn is a number. A list of engine names can be
week 08 8.14
Week 8 Introduction to SAS – The DATA Step
6. How to Read and Write Data from One or more SAS Data Sets to Another
When data is already in SAS format, use a SET statement after the DATA statement
The next example reads two SAS data files, and concatenates them, storing the
result as a single new SAS data set in the same directory. If you want to store the
required.
The SET statement in the DATA step can list a single SAS data file, or many files.
Various options are available using the SET statement to help tailor how the two files
will be combined. The SET statement may also be replaced by a MERGE statement
week 08 8.15
Week 8 Introduction to SAS – The DATA Step
7. Writing Data to ASCII Files from SAS (the FILE and PUT statements)
It is also possible to create ASCII files from SAS datasets. This can be useful for
transferring data into other programs for specific applications. Creation of ASCII
output data files from SAS data sets makes use of a combination of the LIBNAME
and SET statements and a FILE statement. Data are specified for output using a
• The FILE statement is the counterpart of the INFILE statement. Use FILE to
write data to an ASCII file, and use INFILE to read data from an ASCII or text
file.
• The PUT statement corresponds to the INPUT statement. PUT names the
SAS variables to be ‘put’ or written into the ASCII file; INPUT names the
• Since the purpose of the DATA step is to create an ASCII file, there is no need
to create another SAS data file – hence the dummy name _NULL_ is used.
This name is a special SAS name, used when you want to process data, but
week 08 8.16
Week 8 Introduction to SAS – The DATA Step
SAS can also be used for processing data, even when you don’t plan to create or
save a SAS data set. An ASCII data set can be read in, computations made (new
variables created), or variables reformatted, and a new ASCII file written that can be
For example, you may prefer to use the graphics or analysis features of another
software package, but find it easier to manipulate data (e.g., create or modify
variables, change the data file structure) in SAS, and then use the data in another
program.
DATA _NULL_;
INFILE ‘C:\TEMP\EX1.DTA’; /* Names ASCII file to read in */
FILE ‘C:\TEMP\EX2.DTA’; /* Names ASCII file to be created */
INPUT GRP X Y Z; /* Names variables to read in */
TOTAL = SUM(X,Y,Z); /* New var TOTAL sums X, Y and Z */;
11 25 32 21 78
146 29 71 13 113
24 5 9 22 36
week 08 8.17
Week 8 Introduction to SAS – The DATA Step
Variables names are assigned to values in data sets using an INPUT statement.
There are four ways in which values can be associated with variables. These are
Refer to the SAS Language Manual for more details, and SAS Language and
a. List Input
Warning!! List input should not be used as the routine method of data input
unless missing values are appropriately handled on the input (ASCII) data file.
One of the simplest forms of data input is list input or free-format. This method of
input is appropriate for reading small data sets, or creating test data. One or more
input. A delimiter is a defined marker that separates the value for one variable from
delimiters are commas or tabs. By default, when list input is used, SAS assumes a
blank space as the delimiter. To read data with a different delimiter, such as a
week 08 8.18
Week 8 Introduction to SAS – The DATA Step
comma, use the DELIMITER option on the INFILE statement. The following
example uses list input to read three variables from each line.
Note that columns do not necessarily line up for each variable, when the number of
DATA A1;
INPUT SID AGE HEIGHT;
CARDS;
1 7 40
2 26 64
3 41 60
14 29 66
;
RUN;
TIP: Each line (or set of lines) must have a complete set of the values in order
to maintain the correct sequence of variables and values. When all the variables
are not found on a given record (some missing values), the next record is read with
values assigned consecutively. If the height 64 were missing on the second data
line, the value ‘3’ would be read in from the next line as the second height, and then
the next line, starting with SID 14 would be read as the 3rd subject.
week 08 8.19
Week 8 Introduction to SAS – The DATA Step
TIP: A single blank space as a missing value results in a miss-match, which reads
in values from the wrong place, and results in both incorrect values as well as missed
observations.
INFILE statement. When MISSOVER is specified the pointer will not move to a new
line to continue reading data but will assign a SAS missing value. However if the age
value were missing on a line, the value for height would be read in as AGE, unless
For SAS, a period or dot, '.' is used to indicate a missing numeric value.
This is why list input should not be used as the routine method of data input
unless missing values are appropriately handled on the input (ASCII) data file.
Following are some examples to illustrate some problems and solutions with missing
*******************************************************************************************;
*** ***;
*** Project: BE 691F SAS example ***;
*** Date: 15 OCT 2000 ***;
*** Prog: Penny Pekow ***;
*** File: listinput.sas ***;
*** RE: LIST input/ missover ***;
*******************************************************************************************;
*** Input: instream data ***;
*********************************************************************************************;
week 08 8.20
Week 8 Introduction to SAS – The DATA Step
week 08 8.21
Week 8 Introduction to SAS – The DATA Step
week 08 8.22
Week 8 Introduction to SAS – The DATA Step
There are two special codes to be used on the INPUT statement, associated with list
input. The dollar sign special code ($) is used after the variable name to indicate that
character data is to be read – SAS assumes numeric data by default – and the
ampersand special code (&) is used when character variables have single imbedded
blanks. If a single imbedded blank occurs in a character variable, two blanks must be
used to separate this variable from the next variable (that is, the delimiter must be 2
blanks). The example below illustrates the use of these special codes in list or free-
DATA NEW1;
INPUT SID FNAME $ LNAME $ STREET & $15.;
CARDS;
001 Mary Bako 162 Pond St.
202 Sally Jones 447 Lake Drive
370 Peter McArthur 16 Newberry Rd.
;
RUN;
• The example reads an ID variable, first name, last name, and street address
• The dollar sign ($) is used to indicate character data for names and addresses.
• Since imbedded blanks occur within street addresses, this variable name is
• In addition, for character variables, by default, only the first 8 characters will be
• In this example, fifteen characters are to be read for the STREET variable, as
indicated by ‘$15.’ .
week 08 8.23
Week 8 Introduction to SAS – The DATA Step
• Also note, in the data, a double blank space precedes the street address as the
delimiter.
List input must be used when the values to be read are separated by blanks or other
delimiters, but the columns vary from line to line, as in the following data:
1 12 3
2 100 14
3 31 16
In this case it is not possible to specify a particular column for reading the third
variable.
The most common form of input is column or formatted input. Column input
associates the variables with values by specifying the column where the data is
stored. Columns are indicated immediately after the variable name. As in list input, a
dollar sign ($) after the variable name is used to define a character variable. Column
input should be used when possible in all routine data input applications, since errors
due to miss-alignment of variables are minimized. Column input must be used when
no spaces or other delimiters are used between values, or when numeric data are
recorded without explicit inclusion of a decimal point, and values after the decimal
point occur. When this occurs, the number of digits that should be placed after the
week 08 8.24
Week 8 Introduction to SAS – The DATA Step
DATA NEW1;
INPUT HID 1-5 HT 7-9 .1 WT 10-12 ADDRESS $14-25;
CARDS;
30192 665125 53 South Maple
42389 740180 114 Pondview
;
RUN;
• HT is read from columns 7 to 9, and written in SAS with 1 column after the
decimal point.
• WT is read from columns 10 to 12. Values of height and weight read for the
first subject are HT=66.5, WT=125, while values read for the second subject
• Note that the ampersand (&) isn’t necessary for an embedded blank in the
address field when column input is used because the columns, including the
A useful alternative form for column input is available that is easier to read. An @
symbol is used to indicate the beginning column for reading a variable, followed by
the variable name, with the number of columns and format for the variable indicated
immediately after the name. Reading the same data as above using these input
week 08 8.25
Week 8 Introduction to SAS – The DATA Step
DATA NEW1;
INPUT @1 HID 5.
@7 HT 3.1
@10 WT 3.
@14 ADDRESS $12. ;
CARDS;
30192 665125 53 South Maple
42389 740180 114 Pondview
;
RUN;
• The above input statement says to start at column 1 and read 5 columns for
HID.
• Then start at column 7 and read 3 columns for HT, writing the data with 1
• WT is read starting in column 10, for 3 columns (nothing after the decimal
point), and ADDRESS is read as character data, for 12 columns starting with
column 14.
• It is not necessary to put each variable on a new line, though this improves
Although this form of input requires more lines in a SAS program, the
• This type of input statement is also used when reading data with a particular or
unusual format. The most common instance is with reading date values. SAS
offers a wide array of choices for formatting dates (see DATE FORMATS in
the SAS Language Guide), and for reading them in (see DATE INFORMATS).
The next example reads in dates that are stored in MM/DD/YY format.
week 08 8.26
Week 8 Introduction to SAS – The DATA Step
This statement would read dates from a file, starting at column 10, taking 8
columns (6 numbers plus 2 slashes) in MMDDYY type format, such as 03/18/92 for
week 08 8.27
Week 8 Introduction to SAS – The DATA Step
Many special features can be used with column input to make input statements
which data are read in a single input statement. Some of these features are
illustrated in a few more examples. Refer the reader to the SAS Language Manual
for others. The examples that follow illustrate (a) reading data for one observation
from multiple lines, (b) reading multiple records from one given line of data, and (c)
Theoretically, the data for each record could span as many columns as you like so
that, in theory, the length of a line of data could be unlimited. In reality, however, this
is not possible. While SAS allows data to be input from very long data lines (up to
32767 columns), many other application programs restrict the number of columns
that can be used. For example, EpiInfo 6.04 writes data out to 80 characters per line,
and uses multiple lines per record, as needed. Printers are also restricted
(depending on the font) to less than 160 columns per line (for 8.5 inch paper).
Historically, when data were input via physical cards, line length was restricted to 80
week 08 8.28
Week 8 Introduction to SAS – The DATA Step
good idea to keep line length less than 140 columns, though this is not strictly
necessary. When many, many variables are recorded per subject and the number of
columns needed exceeds some limit, then additional variables are entered on
subsequent lines. Many lines can be used for recording variables for a particular
record.
To input data from such records into SAS, the line number is simply noted with a #
symbol prior to reading the variables on the line. A simple example illustrating the
syntax follows:
DATA NEW1;
INPUT #1 @1 HID 5.
@7 HT 3.1
@10 WT 3.
#2 @1 LNAME & $10. FNAME & $
@40 STNO 4.
@45 STNAME $10.;
CARDS;
23901 684145
Jovanovic Mary 69 North St.
45392 735199
Mc Alligator John Paul 1239 Smith Ave.
38389 770201
Xzavior-McCullagh Nancy 37 Northwestern Ave.
;
RUN;
PROC PRINT DATA=NEW1;
VAR HID HT WT LNAME FNAME STNO STNAME;
TITLE1 'Ex: entering multiple lines w/ character
truncation';
RUN;
week 08 8.29
Week 8 Introduction to SAS – The DATA Step
• Variables for HID, height and weight are read from the first line.
• Variables for last name, first name, street number, and street name are read
• Although there are six lines of data, only three records are created, since there
• This example combines fixed (column) and free (list) format, since the columns
used for the first name differ depending on the last name length.
• Single imbedded blanks are permitted in the last name and first name by
inclusion of the symbol "&". The first name is separated from the last name
the variable for last name is specified as 10, while the number of columns
retained for the first name is not specified (and therefore has the default value
week 08 8.30
Week 8 Introduction to SAS – The DATA Step
Another option for reading from multiple lines per record is to use a slash (/) in the
input statement to indicate that variables following the slash are to be read from the
next line. It isn’t as easy to proofread, since the current line number as well as the
total number of lines per record is not specified explicitly. The above data could also
be read as:
DATA NEW1;
INPUT @1 ID 5.
@7 HT 3.1
@10 WT 3.
/ @1 LNAME & $10.
FNAME & $
@40 STNO 4.
@45 STNAME $10.;
CARDS;
23901 684145
Jovanovic Mary 69 North St.
45392 735199
Mc Alligator John-Paul 1239 Smith Ave.
38389 770201
Xzavior-McCullagh Nancy 37 Northwestern Ave.
;
RUN;
When testing programs, or entering small data sets for analysis, data for multiple
records may be recorded on the same line. To read such data, the current line read
For example, suppose the variables for subject's identification (SID), subject's age
(AGE), pulse (PULSE) and years of education (EDUC) are recorded for 9 subjects on
three lines of data. The following example illustrates how the trailing @@ can be
week 08 8.31
Week 8 Introduction to SAS – The DATA Step
DATA NEW1;
INPUT SID AGE 2. PULSE 2. EDUC 2. @@;
CARDS;
01 221604 02 242216 03 332112 04 594007
05 153308 06 402311 07 232614
08 333016 09 302717
;
PROC PRINT DATA=NEW1;
VAR SID AGE PULSE EDUC;
TITLE1 'Example of reading multiple records per line';
RUN;
A total of nine records are read from the three lines of data. Since ID is read in free
format, the INPUT statement will automatically go to the next value (or next line)
week 08 8.32
Week 8 Introduction to SAS – The DATA Step
One other time saving feature can be illustrated in this example. When several
variables have the same fixed format, the format can be specified for the set of
variables by enclosing the set of variables in parentheses, and the common format in
parentheses. For example, the same data input would have resulted for the previous
In some applications, different numbers of lines of data will be recorded for different
subjects. This situation will commonly arise when the number of variables recorded
in a questionnaire is so large that there are multiple lines per record. For some
subjects data may be reported only for variables in the first line, with no data for
subsequent lines (i.e., when large sections are blank due to skip patterns). In these
settings, rather than artificially padding the number of lines with missing values, fewer
week 08 8.33
Week 8 Introduction to SAS – The DATA Step
These data contain information on five subjects, with the subject's name on the first
line, address on the second line, and phone number (if available) on the third line.
For ID=101 and ID=104, all data are reported. For ID=103 and ID=109, only name
and address are reported, and for ID=105 only name is reported.
The first variable on each line of data identifies the subject, while the last variable in
each line identifies the line number for the subject. The data can be input by using a
trailing @ in SAS, where the trailing @ holds the current line of data until a
DATA NEW1;
INFILE ‘C:\TEMP\EX2.DTA’;
INPUT @28 RECNO 1. @; * @ holds the line for next input statement;
IF RECNO=1 THEN
INPUT @1 ID 3.
@7 FNAME $
LNAME & $10.;
ELSE IF RECNO=2 THEN
INPUT @1 ID 3.
@6 STNO 4.
@11 STNAME & $10.;
ELSE IF RECNO=3 THEN
INPUT @1 ID 3.
@7 PHONE $8.;
RUN;
PROC PRINT DATA=NEW1;
TITLE1 'LISTING OF DATA: Varying lines per record';
RUN;
week 08 8.34
Week 8 Introduction to SAS – The DATA Step
There are several features of the program that will be discussed in more detail later,
• In order to decide which line (and which format) is appropriate for a particular
line of data, the variable RECNO is read and the line held for subsequent
operation.
• An ELSE IF statement follows, since the next input statement could only be
used if the first if condition was not met. The output from this program follows:
week 08 8.35
Week 8 Introduction to SAS – The DATA Step
One of the real strengths of SAS is its flexibility in the handling of missing values.
Almost all collections of data have some missing values or values that are so
obviously invalid or out of range that they must be replaced with missing values. In
some cases data are not actually missing but are merely not applicable for all cases.
differentiate among them as, at times, this difference will have an impact on the total
When reading data into a SAS data set from an ASCII file or another format (e.g.,
Excel or Access), missing data can be represented for both numeric and character
data as either a blank or a single period (.) in the ASCII, Excel or Access file. When
reading an ASCII file using LIST input a period must be used, or else the next value
after the blank will be read in, and all subsequent values, at least for that line, will be
misread. An example was given earlier, in the section on LIST input. When using
COLUMN input, the columns may simply be left blank (or a period can be used).
Blank columns will be read into SAS as missing values in column input.
week 08 8.36
Week 8 Introduction to SAS – The DATA Step
In SAS data sets, missing character values are represented by blanks ( ), and
Example illustrating the “9”, “99” “999” practice – The values of “9” or “99” or “999”
are often used to designate missing values in data entry. As such, they cannot be
used in SAS; they must be recoded to a SAS missing value code so that they will not
recodings are accomplished using programming statements when the data is read
The above lines would replace all values of 9 for VAR1 and 99 for AGE with the SAS
Believe it or not, you may want to keep track of the different reasons for missingness
(for example - “unknown”, “refused”, “skipped” are three different data entry
scenarios that yield a missing value). To illustrate, suppose you wish to distinguish
between refusals (coded as 7), not applicable (coded as 8) and missing (coded as 9),
week 08 8.37
Week 8 Introduction to SAS – The DATA Step
• The SAS special missing value ‘R’ is assigned to refusals, originally entered
as “7”
• The SAS special missing value ‘N’ is assigned to the not applicable , originally
entered as “8”
• The SAS special missing value ‘M’ is assigned to the missing values, originally
entered as “9”
This might be handy later if you want to identify refusers, or in getting a count of
refusals, or, if you want to treat these as missing values for computational purposes.
TIP: The special missing values are stored in the data set and print as a letter
SAS orders missing value types. Possible alternatives for the coding of missing
_ . A B C and so forth Z
note: SAS treats the missing value “_” as the smallest and “.z” the largest.
week 08 8.38
Week 8 Introduction to SAS – The DATA Step
We will see later that SAS offers you choices in the handling of missing values, such
as whether or not they appear in frequency tables, and whether or not they are
Sometimes missing numeric data will be provided to you as letters, rather than as
periods or blanks. The result is a mixture of numeric and character entries in the
same field. This will cause an error (“invalid data” ) unless it is properly handled.
Use the MISSING Statement to manage missing numeric data that has been entered
using a letter.
that SAS will read these as missing values rather than as invalid numeric data. In the
following example, R and N will be treated in the SAS dataset as missing values.
DATA TEMP;
MISSING R N;
INPUT AGE;
CARDS;
12
R
19
N
;
RUN;
week 08 8.39
Week 8 Introduction to SAS – The DATA Step
c. INVALIDDATA option
The INVALIDDATA option is a great device! It allows you to detect invalid data and
• It functions by creating a code (one that you’ve specified) when invalid data
appears in an input line; this can be displayed on the output of your SAS run.
value codes nor to valid numeric data format. Use of the INVALIDDATA option
results in the replacement of the ‘3N’ with an ‘X’. Actually the ‘X’ replaces any invalid
OBS AGE
1 12
2 R
3 19
4 N
5 X Å Thus, you know to review the data for observation #5.
week 08 8.40
Week 8 Introduction to SAS – The DATA Step
As mentioned previously, SAS treats missing values as ordered and has a defined
• In SAS missing values are considered to have values less than all possible
take care!! For example in creating age groups from a variable AGE,
representing age in years, the following statements would include those with
that those with missing AGE will also have missing AGEGR, the variable for
age group.
• For example, to delete all observations with missing values for VAR1 use;
week 08 8.41
Week 8 Introduction to SAS – The DATA Step
In the same way, missing character values (blanks) precede all other characters in
blank for a missing value, all those with missing data for SEX could be deleted
• For further information see the chapter on Missing Values in the SAS Language
week 08 8.42
Week 8 Introduction to SAS – The DATA Step
SAS data sets are stored in a special (SAS-specific) format; thus, they cannot be
• Once a data set has been store in SAS format, some associations are
with labels and variable values can be automatically associated with the
• The advantage to its storage in SAS format is that when you refer to a SAS
data sets in the SAS system it is not necessary to keep track of the variable
format and columns. The SAS system does that for you, when you refer to the
variable by name.
• A SAS data set cannot be viewed or printed from a text editor such as Notepad,
or from a word processor. (To get a data listing the print procedure, PROC
Thus, SAS has documentation features that permit the attachment of labels and
formats to variables, which can be stored with the data. These formats and labels
are then used in printed output created by SAS procedures, rather than the variable
names or values.
week 08 8.43
Week 8 Introduction to SAS – The DATA Step
A variable label is a descriptive phrase that characterizes the variable, thus permitting
• It can be as simple as a less abbreviated name for the variable, or can contain
• The keyword LABEL is followed by a variable name, an equal sign, and the
variable label enclosed in single quotes (or double quotes when the label
labeled. A semi-colon to end the label statement follows the last label.
Multiple labels can be listed on a single line – the label statement ends only
week 08 8.44
Week 8 Introduction to SAS – The DATA Step
Missing a quote
• Tip: Line up the equal signs, listing a single variable and label per line. This
• Labels will automatically appear on output for many SAS procedures, and can
This is done in the DATA statement itself, when naming the data set, as in the
week 08 8.45
Week 8 Introduction to SAS – The DATA Step
********************************;
* Store data with labels *;
********************************;
DATA OLD.CLASS1(LABEL='CLASS DEMOGRAPHIC INFO');
SET NEW1;
LABEL SID='Student*ID*Number'
AGE='Age on*Sept 1'
SEX='Sex'
HT='Ht in*Inches';
RUN;
************************************;
** Print the data set with labels *;
************************************;
PROC PRINT DATA=OLD.CLASS1 SPLIT=‘*’;
TITLE2 'Using Labels in place of variable names';
RUN;
********************************;
* Get data structure *;
********************************;
PROC CONTENTS DATA=OLD.CLASS1;
TITLE1 'STRUCTURE OF CLASS DEMOGRAPHIC DATA SET';
RUN;
• Note - Although the LABELs were not created in the original SAS data set, they
were created and saved in the SAS data set CLASS1. The printed output
week 08 8.46
Week 8 Introduction to SAS – The DATA Step
• Note - The second time the data is printed, the option SPLIT=‘*’ was used on
the PROC PRINT statement. This is a print option that indicates that labels
should be used in place of variable names to head columns, and that the
labels should be split into lines at the character indicated within quotes: ‘*’.
This is the reason that asterisks were used in creating the labels. The width of
the column for printing will depend on either the space required for printing the
data, or the space required for printing the variable name or label. For this
reason, long variable names should be avoided, and split characters should be
Student
ID Age on Ht in
OBS Number Sept 1 Sex Inches
1 1 23 Male 70.2
2 3 29 Female 69.5
3 4 35 Male 74.5
week 08 8.47
Week 8 Introduction to SAS – The DATA Step
It is often useful to have a summary of the variables that are contained in the data
set. The SAS system has a special procedure called PROC CONTENTS that
summarizes information about SAS data sets. An example was given above, to
display the variables in the SAS data set CLASS1.SD2. The output from the PROC
CONTENTS follows:
week 08 8.48
Week 8 Introduction to SAS – The DATA Step
• The PROC CONTENTS procedure lists the number of observations in the data
set, the data set label, the variable names, their type, length, position, and
labels.
• Under the Alphabetic List of Variables and Attributes, the # indicates the
ordered position of each variable in the SAS data set. SID is first; AGE is
• The variable positions (Pos) correspond to the starting position for each
variable in the SAS data set, in bytes. The POSITION option on the PROC
cumulating the length of variables in bytes, following the order of input in the
SAS DATA step. Note that a length of 8 bytes is the default length
week 08 8.49
Week 8 Introduction to SAS – The DATA Step
Formats enable descriptive labels to be substituted for numeric codes that represent
• Formats can be created in many procedures and used for the resulting output of
that one procedure, or they can be created with PROC FORMAT and stored in
• For example, for a survey where SEX of the respondent was coded as
later).
2='2.Female';
VALUE YNFMT 0='No'
1='Yes';
RUN;
• This code represents a dictionary. When the dictionary is requested (using a
• The keyword VALUE is followed by the format name, followed by the code=,
week 08 8.50
Week 8 Introduction to SAS – The DATA Step
• A semi-colon follows the final format label (and ONLY the final format label).
• If the variable containing the codes is a character variable, you must create a
character format by beginning the format name with a dollar sign ($) and
enclosing both the code and the label in quotes. (See example).
set where the format names and values will be saved. In the example below,
contains the format information. Once formats are saved they can be used on
• CNTLIN reads in a SAS format data file that has been saved previously.
• If neither is used, the formats created are only available for use during the SAS
session.
week 08 8.51
Week 8 Introduction to SAS – The DATA Step
**********************************************;
** example to create formats and apply them **;
**********************************************;
LIBNAME SDATA 'C:\TEMP';
** create and save formats in c:\temp\fmt1.sas7bdat **;
PROC FORMAT CNTLOUT=SDATA.FMT1;
VALUE SEXFMT 1='1.Male'
2='2.Female';
VALUE YNFMT 0='No'
1='Yes';
VALUE $CODEFMT 'A'='Always'
'B'='Sometimes'
'C'='Rarely'
'D'='Never' ;
RUN;
** create test data with sex, a yes/no and letter coded variables **;
DATA TEST1;
INPUT SEX YN CVAR $;
CARDS;
1 0 A
1 1 B
2 0 C
2 1 D
;
RUN;
** PRINT DATA WITHOUT FORMATS **;
PROC PRINT DATA=TEST1;
TITLE1 'UNFORMATTED LISTING OF TEST1';
RUN;
** ASSIGN FORMATS AND STORE DATA **;
DATA SDATA.TEST2;
SET TEST1;
FORMAT SEX SEXFMT. YN YNFMT. CVAR $CODEFMT.;
RUN;
** PRINT FORMATTED DATA, AND GET STRUCTURE **;
PROC PRINT DATA=SDATA.TEST2;
TITLE1 'FORMATTED VERSION OF TEST DATA';
RUN;
PROC CONTENTS DATA=SDATA.TEST2;
RUN;
week 08 8.52
Week 8 Introduction to SAS – The DATA Step
name(s) followed by the format name ending with a period (.) to indicate a
format. The above example assigns the formats in a second data step, but they can
1 1 0 A
2 1 1 B
3 2 0 C
4 2 1 D
1 1.Male No Always
2 1.Male Yes Sometimes
3 2.Female No Rarely
4 2.Female Yes Never
Tip: Create and save a separate format program for your study. As more
formats are required during the course of a project, add to this format program, and
rerun it to update the format data file. In this way you have a single, complete file of
week 08 8.53
Week 8 Introduction to SAS – The DATA Step
ALERT: Once formats have been assigned to variables in a saved data set, you
cannot access the data without the format file. As the program tries to read the
data file, it will look for the assigned formats, and if they cannot be found an error
message will be given in the log. So your format file must be available, along with
the data file if you have assigned formats in a data step that creates a stored file.
For more information on creating and using formats see the chapter on PROC
The SAS Viewer is a wonderful feature of the SAS program that allows you to view
• TIP: Use this to view your data but not to manipulate it.
• You can open, view and edit a SAS data file from the Windows Explorer:
week 08 8.54
Week 8 Introduction to SAS – The DATA Step
week 08 8.55
Week 8 Introduction to SAS – The DATA Step
• Your data will appear in spreadsheet format, with rows representing records, and
Forms View
There are a number of options on the toolbar for viewing the data.
• The Table Attributes and Column Attributes views show the same
you to modify these. Variables can be hidden from view (such as confidential
information), and the displays can be printed. This adds to the ease of
documentation purposes. You can add new records and edit and delete
• The forms view displays the data one observation at a time, similar to forms in
Access.
Any editing you can do in this view can also be accomplished through
week 08 8.56
Week 8 Introduction to SAS – The DATA Step
changes you have made in your data. This documentation does not occur
week 08 8.57
Week 8 Introduction to SAS – The DATA Step
SAS data sets will always be larger than their corresponding ASCII data files.
This is because SAS data sets save information concerning the variables and
formats, and generally store variables using more bytes per variable than an ASCII
file. In the example in the previous section, the ASCII data set for the three subjects
was contained in 49 bytes whereas the SAS data set CLASS1.sas7bdat required
4096 bytes. It is possible to create SAS data sets so as to use less space. However,
SAS data sets will always be larger than their corresponding ASCII data files.
• Unlike ASCII, the SAS program includes in its storage of a data set information
• The manner in which variables are stored in the SAS data set is itself more
space consuming.
Large data sets lead to slow read-write operations and slower processing. Although
the variable names and labels carry with them a fixed overhead, the additional size of
the data set due to storage of the data should be kept to a minimum.
week 08 8.58
Week 8 Introduction to SAS – The DATA Step
Using the minimal length for each variable will minimize the space required for storing
the data set. The type of variable considered, as indicated in the Table below,
• For character variables, the length in bytes is equal to the number of columns,
so the minimum length is 1 byte, for a 1 column variable, and the maximum
allowed is 200 bytes. Character data is stored with 1 byte per column – 1 byte
is needed for the ASCII character code for each letter, number or symbol.
• Numeric data is stored, not as the ASCII representation of the number, but
using binary code. The largest whole number that can be stored in 1 byte is
255 (recall 1 byte = 23 = 8 bits). To store numeric data in SAS, the minimum
• By default, SAS will assign a variable length of 8 bytes to all numeric variables
Table - Criteria for choosing variable type and length to minimize storage:
Measurement Scale Type Length Maximum Value Power of 2
Nominal Character 1-200
Interval/Ratio Numeric 8
(decimals, negatives)
week 08 8.59
Week 8 Introduction to SAS – The DATA Step
• For numeric variables there are some choices as represented above. For
whole numbers, the length can be defined by the maximum value that the
variable can take. If the maximum value is less than 8,192 then a length of 3
is adequate, and so on. For example, where values of a variable are really
Example - This example illustrates the impact of setting character and numeric
variable lengths. The following program reads a subset of variables into two data
sets. In the first, SAS is allowed to set the variable length by default, and declare all
variables as numeric. In the second data set, we read the same data using character
variables for nominal data, and minimal length appropriate for numeric variables.
Note that the minimal length for numeric variables is length=3. The statements to
week 08 8.60
Week 8 Introduction to SAS – The DATA Step
Here, the original ASCII data was stored in 281 bytes, while the resulting SAS data
In creating the second data set, a second input format is now specified where
nominal variables (codes) were defined as character variables (so that a length of 1
byte can be used), and numeric variables were given a length of 3 bytes unless they
required more.
• Since HID, the ID variable, consisted of 7 columns, which gives values greater
giving the variable name followed by the length, or a list of variables separated
• To reset the default length for variables not specifically named, use DEFAULT=
week 08 8.61
Week 8 Introduction to SAS – The DATA Step
LENGTH LNAME $ 15
ID HOSPID 5
DEFAULT=3;
• Input statements for the same data used above with the addition of a length
statement are:
* Second data set. SAS uses the length statement provided *;
data old.new2;
length hid 5
default=3;
input
@1 HID 7. @8 SID $1. @9 SEGMENT $2.
@11 CINTID $2. @13 CNEWPHON $7. @20 CLVN $1.
@21 CLV_D 6. @27 CL_HR 2. @29 CL_MI 2.
@31 CL_AP $1. @32 CL_PV $1. @33 COUTCODE $1.
@34 CFS $2. @36 CE_HR 2. @38 CE_MI 2.
@40 C01 1. @41 C02A 2. @43 C02B 2.
@45 C02C 2. @47 C02D 2. @49 C03A $1.
@50 C03B $1. @51 C03C $1. @52 C03D $1.
@53 C03E $1. @54 C03F $1. @55 C03G $1.;
cards;
123438741239874238147231908472319084713209487312098432
123712934730246716566743285621989823406213498213482137
123466458750457864879437852374276161823989238198293883
583849238735783566743783287923409234098213672134902134
838393023623460769857786952396851234786243763248796224
;
proc contents data=old.new2;
title1 'Data with length formats';
run;
The resulting SAS data set requires 8192 bytes – about half the size of the file which
used the defaults. While the saving in space may not seem great, imagine the effect
when you read in data for several hundred or several thousand subjects.
week 08 8.62
Week 8 Introduction to SAS – The DATA Step
• The length statement may be specified prior to the input statement, or after the
input statement.
• When the length statement is specified prior to the input statement, the length
statement.
• For example, the numeric variable HID will have length=5, even though the
• To reset the default, the length statement must appear before the input
statement to override the standard default. All variables that are measured on
• If you are trying to fit data onto a disk, use of the length statement can
week 08 8.63