TUTORIAL I: SAS Basics and Data Management I. SAS Basics: SAS (Statistical Analysis Software)
TUTORIAL I: SAS Basics and Data Management I. SAS Basics: SAS (Statistical Analysis Software)
TUTORIAL I: SAS Basics and Data Management I. SAS Basics: SAS (Statistical Analysis Software)
I. SAS Basics
SAS (Statistical Analysis Software) is one of the most popular statistical software packages. SAS
is used to read in, process, and output statistical information from data sets. SAS is very powerful
and it has the capability to analyze almost all types of statistical problems.
DATA mydata;
INPUT variables;
DATALINES;
the lines of data DATA step
;
RUN;
SAS is organized into steps, which are like paragraphs. There are two types of steps:
• DATA steps, which put data in a form that SAS can use
• PROC steps, which use procedures to print, sort and analyze the data
SAS steps consist of a series of statements, which are like sentences in a paragraph.
In all programming, it is important to use a clear and consistent layout to make programs easier to
read and debug. In SAS, the spacing between words and keywords does not matter. In addition,
blank lines can be added in a program.
It is always good practice to place comments in your code. We will return to the topic later in the
tutorial.
The output file contains the output of the various procedures. If the log file indicates that your
program did not work, then nothing will appear in your output file. Even if you get an output file,
it is always necessary to check the log file to make sure that SAS did what you really wanted it to.
DATA mydata;
INPUT variables;
DATALINES;
the lines of data
;
RUN;
The DATA statement is used to name the data set. The DATALINES and INFILE statements are
used to identify the location of the raw data. Finally, the INPUT statement describes the layout of
the raw data. We will look at these four statements in more detail below.
The DATA statement gives the data set a name which allows us to refer to it at a later point in the
code. Using the command:
DATA mydata;
A data set can have any name, but it can’t begin with a number. It must be 8 characters or less in
length and should only contain the symbols {0-9, A-Z, a-z} or "_".
The INPUT statement gives names to the variables in your data set and tells SAS how to move
through the raw data in order to read the variables. It is also possible (and, in some cases,
necessary) to specify variable types (strings, dates, etc) in this line. The statement is written on
the following general form:
Variable names suffer the same restrictions as data set names. In addition, they should not be
repeated in an input statement. Variables can be either numeric or characters. SAS assumes that
variables are numeric unless they are specifically designated as characters. This is done by
designating a variable as a character after its name in the INPUT statement.
The INPUT statement is also used to tell SAS how the data is stored. Below follow two
examples:
In list input, the data are contained in a list. The values are entered without regard to column
location, and they are separated by spaces.
There must be as many values in each line of data as there are variables in the INPUT statement.
A variable is designated as a character by placing a dollar sign ($) after the variable name.
Ex. Suppose we have the following data on the heights of a group of people, stored in the format:
feet, inches and names.
6 0 Michael
5 11 Fred
4 8.5 Isabel
1 11.5 Roxanne
An appropriate INPUT statement would then read:
Using this command we tell SAS to look through a list of data for two numerical variables (FEET
and INCHES) and one character variable (NAME).
In order to use column input, your data must be in the same column on every line. In the
INPUT statement you must specify the column positions after defining the variable name.
If a variable is a character, a dollar sign would come before the column designation.
Ex: Assume the same data is stored in the following form:
6 0 Michael
5 11 Fred
4 8.5 Isabel
1 11.5 Roxanne
Here the variable FEET is found in the first column, the variable INCHES in the third through
sixth column and the variable NAME in the eight through twentieth column.
The data you use in your SAS program can be stored in a number of locations. Typically it is
stored either in the program itself or in an external file. We will discuss both scenarios below.
When we are working with small data sets it is often easier to place the data directly into the
program. The DATALINES statement signals the beginning of the lines of data. One data record
per line is standard. But, if the INPUT statement is constructed differently, more than one data
record per line can be parsed. A semicolon MUST be included following the last line of data. No
semicolon is used at the end of each individual data line!
Ex: The following code reads in three observations with three variables in each (BANK,
ACCTNUM and MONEY).
DATA CASH;
INPUT BANK $ ACCTNUM MONEY;
DATALINES;
CHASE 1536253 50.32
PNC 189273462 1563.82
FLEET 287363 20000.00
;
RUN;
Note: RUN is an optional last statement in the DATA steps and the PROC steps.
You may often need to access data sets that are saved in a computer file. If you collect a set of
data that has many observations, you may want to put it in a separate file that you can easily
access. You can access these data by using the FILENAME statement before the DATA step:
where ‘datain’ is any valid file name (named using the same conventions as the variable names)
and 'mydata.txt' is the name of the file from which you are reading the data. In the above
statement, the file is saved in drive F. If you are saving to a different drive you must change the
code appropriately.
Once this is done, add an INFILE statement in the DATA step, before the INPUT line.
INFILE datain;
Ex. Consider that we have saved a file called 'bank.txt' on drive C, and that it reads:
To read this data into SAS we can use the following DATA step.
Sometimes you may want to skip a few lines in the data file or only read a certain amount of data.
This is especially true if your data set has a header. OBS and FIRSTOBS are two options you can
use in the INFILE statement to do this. If you use the string "obs=10", SAS will stop processing
elements after the 10th one. If you use "firstobs=5", SAS will start processing with the 5th
element, ignoring those before it.
Ex. To read a total of 5 observations (lines) from the external data file, write:
To skip the first 4 observations (lines) in the external data file, write:
Each PROC step works a little differently from the others. Throughout the course of the semester
we will look at individual procedures to see how each works. Here we discuss two of the simplest
procedures used in SAS: PROC PRINT and PROC SORT.
A. PROC PRINT
Perhaps the simplest PROC step is PROC PRINT. This procedure tells SAS to print out certain
variables in the data set. Its general form is:
DATA = mydata tells PROC PRINT to use the SAS data set named "mydata".
List the variables you want to print after VAR in the order you want them printed.
Ex. Pulling together what we have learned so far, the following is an example of a simple SAS
program:
DATA MYDATA;
INPUT NAME $ MIDTERM FINAL;
DATALINES;
Joe 84 79
Sue 94 97
Betty 93 89
;
RUN;
This program will read the name, midterm score and final score for three different people. It will
then print out the names and their scores on the final. Note it does not print the midterm scores.
B. PROC SORT
Often we need to rearrange our data in the order of the size of a certain variable. To rearrange the
order of a SAS data set, you can use PROC SORT. The general form of this procedure is
Here the data set is sorted according to the values of the variable specified in the BY statement.
Ex. Suppose we have a data set named ‘cars’ which contains two variables car_type and mpg
(miles per gallon). To sort the cars from smallest to largest value of mpg write:
If we were instead interested in sorting from largest to smallest value, we could alter the BY
statement as follows:
BY DESCENDING mpg;
V. Data Management
A. Comment Lines
Including comments in a SAS program is helpful to remind ourselves what a particular part of
code does. This is particularly important when one expects to reuse the code at a later time.
There are two ways to include comments. One can use either of the following:
• a statement which begins with a star (*) and ends with a semi-colon, or
• a statement which begins with /* and ends with */
Your output will become more readable if you add a LABEL statement. This allows for a short
description of each variable containing up to 40 characters.
It is possible to use a DATA statement to create new variables from old ones. After an INPUT
statement, set the new variable to be some mathematical function of the old ones. All the usual
mathematical operators work: {+, - ,* , / and ** for exponentiation}.
Ex. Revisit the example where we read in the midterm and final scores for three students. To
create a new variable SCORE which is the average of the MIDTERM and FINAL we write:
Sometimes you may only want to perform operations on a subset of the data. We can do this by
using IF or IF … THEN … ELSE statements.
Ex. To create a new variable GRADE that is equal to “A” if SCORE > 90 and “B” otherwise we
write:
D. DROP/KEEP Statements
Sometimes a data set contains variables that are not used in any data analysis. The DROP and
KEEP statements can be used to exclude certain variables in a data set when you do not need
them anymore. They have the form
The DROP statement excludes the listed variables from the data (keeping the rest), while the
KEEP statement keeps only the listed variables (dropping the rest). You can not use both in the
same DATA step
The Output file will then look as follows (note HEIGHT is no longer part of the data set):
A SAS data set can be used to create a new data set. This can be useful if you want to add some
information to the data set, while keeping a copy of the original. This can be done using the SET
statement.
DATA CASH2;
SET CASH;
RUN;
The data set "CASH2" is a copy of the data set "CASH". By adding some extra lines to the
second DATA step, "CASH" can be changed into something more useful for your particular
analysis.
The SET statement can also be used to combine two or more data sets. If SET is used together
with a BY statement, the data sets are interleaved. If SET is used without a BY statement they
are concatenated. If a BY statement is used, you need to sort the data sets first. This can be done
using PROC SORT.
Ex: Suppose we are working with the following two data sets mydata1 and mydata2:
DATA mydata1;
INPUT NAME $ HEIGHT WEIGHT;
DATALINES;
Anne 71 130
Edith 69 160
Charlie 50 110
Bert 70 180
;
RUN;
DATA mydata2;
INPUT NAME $ EYES $ HAIR $;
DATALINES;
David BROWN BROWN
Bert BLUE BLOND
;
RUN;
DATA both;
SET mydata1 mydata2;
BY NAME;
RUN;
1 Anne 71 130
2 Edith 69 160
3 Charlie 50 110
4 Bert 70 180
5 David . . BROWN BROWN
6 Bert . . BLUE BLOND
To interleave the two data sets use the following set of commands:
DATA both;
SET mydata1 mydata2;
BY Name;
RUN;
1 Anne 71 130
2 Bert 70 180
3 Bert . . Blue Blond
4 Charlie 50 110
5 David . . Brown Brown
6 Edith 69 160
In this case the two data sets are combined and the resulting data is placed in alphabetical order
according to the variable Name.
The MERGE command can be used to merge two or more data sets. It has the following form
If you use MERGE together with a BY statement, match merging is performed. When
performing match merging, the data sets need to be sorted before using the MERGE statement.
Ex: Suppose we are working with two data sets ONE and TWO:
DATA ONE;
INPUT DATE $ WEIGHT;
DATALINES;
JUNE_10 230
JUNE_20 225
JUNE_30 223
;
RUN;
DATA TWO;
INPUT DATE $ HEIGHT;
DATALINES;
JUNE_10 171
;
RUN;
To perform match-merging of the two data sets use the following commands:
PROC PRINT;
RUN;
The dots (‘.’) in the output above indicate missing data (i.e. we have no measurement of height
for the dates June 20 and 30).
It is useful to be able to write data onto an external file. Together with the FILENAME and FILE
statements, we can use the PUT statement to do this. The FILE statement is the complement of
the INFILE statement. The PUT statement is the complement of the INPUT statement.
Ex. The following program reads in the scores on the midterm and final for three students,
calculates their final grade and prints the grade (together with their name) to an external file.
DATA ONE;
INPUT NAME $ MIDTERM FINAL;
SCORE = (MIDTERM + FINAL)/2;
IF SCORE > 90 THEN GRADE = ”A”;
ELSE GRADE = “B”;
DATALINES;
Joe 84 79
Sue 94 97
Betty 93 89
;
RUN;
Joe B
Sue A
Betty A
in the program above, the order of the columns would have been exchanged in the external file.