TUTORIAL I: SAS Basics and Data Management I. SAS Basics: SAS (Statistical Analysis Software)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

TUTORIAL I: SAS Basics and Data management

I. SAS Basics
SAS (Statistical Analysis Software) is one of the most popular statistical software packages. SAS
is used to read in, process, and output statistical information from data sets. SAS is very powerful
and it has the capability to analyze almost all types of statistical problems.

A simple SAS program typically appears in the following general form:

DATA mydata;
INPUT variables;
DATALINES;
the lines of data DATA step
;
RUN;

PROC procedure options;


options; PROC step
RUN;

SAS is organized into steps, which are like paragraphs. There are two types of steps:

• DATA steps, which put data in a form that SAS can use
• PROC steps, which use procedures to print, sort and analyze the data

SAS steps consist of a series of statements, which are like sentences in a paragraph.

• A semicolon (;) is REQUIRED to denote the end of a statement


• SAS statements consist of keywords that have special meaning and variable names added
by the programmer

In all programming, it is important to use a clear and consistent layout to make programs easier to
read and debug. In SAS, the spacing between words and keywords does not matter. In addition,
blank lines can be added in a program.

It is always good practice to place comments in your code. We will return to the topic later in the
tutorial.

II. SAS Program, Log and Output files


SAS is available in all CUIT Labs on the Windows workstations. You can start SAS by clicking
on the SAS icon. This will result in a number of windows appearing on the computer screen (see
figure below). These include an Editor, a Log file and an Output file. The editor is where you
write your SAS program. After you have written your SAS program file, you will need to submit
it for batch processing by pressing the icon depicting a runner (circled in the figure). When a SAS
program is run, it generates information in both the Log and Output files.
The log file contains information about the run, such as warnings and errors. You should
ALWAYS check the log file! An ERROR line means that some part of the processing failed and
you need to run the analysis again. It is also generally worthwhile to look at WARNING lines.

The output file contains the output of the various procedures. If the log file indicates that your
program did not work, then nothing will appear in your output file. Even if you get an output file,
it is always necessary to check the log file to make sure that SAS did what you really wanted it to.

Press to run Output File

Log File Editor

III. The DATA Step


The DATA step consists of statements (lines of code) that create a data set, which SAS can
analyze in subsequent PROC steps. In a simple SAS program the DATA step can appear in the
following format:

DATA mydata;
INPUT variables;
DATALINES;
the lines of data
;
RUN;
The DATA statement is used to name the data set. The DATALINES and INFILE statements are
used to identify the location of the raw data. Finally, the INPUT statement describes the layout of
the raw data. We will look at these four statements in more detail below.

A. The DATA statement

The DATA statement gives the data set a name which allows us to refer to it at a later point in the
code. Using the command:

DATA mydata;

we create a new data set in SAS which is called mydata.

A data set can have any name, but it can’t begin with a number. It must be 8 characters or less in
length and should only contain the symbols {0-9, A-Z, a-z} or "_".

B. The INPUT statement

The INPUT statement gives names to the variables in your data set and tells SAS how to move
through the raw data in order to read the variables. It is also possible (and, in some cases,
necessary) to specify variable types (strings, dates, etc) in this line. The statement is written on
the following general form:

INPUT names types (if needed) column_designation (if needed);

Variable names suffer the same restrictions as data set names. In addition, they should not be
repeated in an input statement. Variables can be either numeric or characters. SAS assumes that
variables are numeric unless they are specifically designated as characters. This is done by
designating a variable as a character after its name in the INPUT statement.

The INPUT statement is also used to tell SAS how the data is stored. Below follow two
examples:

(i) List Input

In list input, the data are contained in a list. The values are entered without regard to column
location, and they are separated by spaces.

There must be as many values in each line of data as there are variables in the INPUT statement.
A variable is designated as a character by placing a dollar sign ($) after the variable name.

Ex. Suppose we have the following data on the heights of a group of people, stored in the format:
feet, inches and names.

6 0 Michael
5 11 Fred
4 8.5 Isabel
1 11.5 Roxanne
An appropriate INPUT statement would then read:

INPUT FEET INCH NAME $;

Using this command we tell SAS to look through a list of data for two numerical variables (FEET
and INCHES) and one character variable (NAME).

(ii) Column Input

In order to use column input, your data must be in the same column on every line. In the
INPUT statement you must specify the column positions after defining the variable name.
If a variable is a character, a dollar sign would come before the column designation.
Ex: Assume the same data is stored in the following form:

6 0 Michael
5 11 Fred
4 8.5 Isabel
1 11.5 Roxanne

Our INPUT statement could then read:

INPUT FEET 1-1 INCH 3-6 NAME $ 8-20;

Here the variable FEET is found in the first column, the variable INCHES in the third through
sixth column and the variable NAME in the eight through twentieth column.

C. Reading the data

The data you use in your SAS program can be stored in a number of locations. Typically it is
stored either in the program itself or in an external file. We will discuss both scenarios below.

(i) Placing data directly in the program

When we are working with small data sets it is often easier to place the data directly into the
program. The DATALINES statement signals the beginning of the lines of data. One data record
per line is standard. But, if the INPUT statement is constructed differently, more than one data
record per line can be parsed. A semicolon MUST be included following the last line of data. No
semicolon is used at the end of each individual data line!

Ex: The following code reads in three observations with three variables in each (BANK,
ACCTNUM and MONEY).

DATA CASH;
INPUT BANK $ ACCTNUM MONEY;
DATALINES;
CHASE 1536253 50.32
PNC 189273462 1563.82
FLEET 287363 20000.00
;
RUN;
Note: RUN is an optional last statement in the DATA steps and the PROC steps.

(ii) Reading data from an external file

You may often need to access data sets that are saved in a computer file. If you collect a set of
data that has many observations, you may want to put it in a separate file that you can easily
access. You can access these data by using the FILENAME statement before the DATA step:

FILENAME datain 'F:\mydata.txt';

where ‘datain’ is any valid file name (named using the same conventions as the variable names)
and 'mydata.txt' is the name of the file from which you are reading the data. In the above
statement, the file is saved in drive F. If you are saving to a different drive you must change the
code appropriately.

Once this is done, add an INFILE statement in the DATA step, before the INPUT line.

INFILE datain;

Remember do NOT add any "DATALINES" statement or raw data!

Ex. Consider that we have saved a file called 'bank.txt' on drive C, and that it reads:

Chase 1536253 50.32


PNC 189273462 1563.82
Fleet 287363 20000.00

To read this data into SAS we can use the following DATA step.

FILENAME INDATA 'C:\bank.txt';


DATA CASH;
INFILE INDATA;
INPUT BANK $ ACCTNUM MONEY;
RUN;

Sometimes you may want to skip a few lines in the data file or only read a certain amount of data.
This is especially true if your data set has a header. OBS and FIRSTOBS are two options you can
use in the INFILE statement to do this. If you use the string "obs=10", SAS will stop processing
elements after the 10th one. If you use "firstobs=5", SAS will start processing with the 5th
element, ignoring those before it.

Ex. To read a total of 5 observations (lines) from the external data file, write:

INFILE DATAFILE OBS= 5;

To skip the first 4 observations (lines) in the external data file, write:

INFILE DATAFILE FIRSTOBS=5;


To skip the first 5 observations (lines) and read the next 7 observations from the external data file,
write:

INFILE DATAFILE FIRSTOBS=6 OBS=12;

IV. The PROC Step


The PROC step analyzes the data that was created in the DATA step. In a simple SAS program,
the PROC step can appear in the following form:

PROC procedure options;


options;
RUN;

Each PROC step works a little differently from the others. Throughout the course of the semester
we will look at individual procedures to see how each works. Here we discuss two of the simplest
procedures used in SAS: PROC PRINT and PROC SORT.

A. PROC PRINT

Perhaps the simplest PROC step is PROC PRINT. This procedure tells SAS to print out certain
variables in the data set. Its general form is:

PROC PRINT DATA = mydata;


VAR variable_name_1 variable_name_2 etc;
RUN;

DATA = mydata tells PROC PRINT to use the SAS data set named "mydata".

List the variables you want to print after VAR in the order you want them printed.

Ex. Pulling together what we have learned so far, the following is an example of a simple SAS
program:

DATA MYDATA;
INPUT NAME $ MIDTERM FINAL;
DATALINES;
Joe 84 79
Sue 94 97
Betty 93 89
;
RUN;

PROC PRINT DATA=MYDATA;


VAR NAME FINAL;
RUN;

This program will read the name, midterm score and final score for three different people. It will
then print out the names and their scores on the final. Note it does not print the midterm scores.
B. PROC SORT

Often we need to rearrange our data in the order of the size of a certain variable. To rearrange the
order of a SAS data set, you can use PROC SORT. The general form of this procedure is

PROC SORT DATA = mydata;


BY var_name;
RUN;

Here the data set is sorted according to the values of the variable specified in the BY statement.

Ex. Suppose we have a data set named ‘cars’ which contains two variables car_type and mpg
(miles per gallon). To sort the cars from smallest to largest value of mpg write:

PROC SORT DATA=cars;


BY mpg;
RUN;

If we were instead interested in sorting from largest to smallest value, we could alter the BY
statement as follows:

BY DESCENDING mpg;

V. Data Management
A. Comment Lines

Including comments in a SAS program is helpful to remind ourselves what a particular part of
code does. This is particularly important when one expects to reuse the code at a later time.

There are two ways to include comments. One can use either of the following:

• a statement which begins with a star (*) and ends with a semi-colon, or
• a statement which begins with /* and ends with */

Ex: DATA CASH;


* This is an example of a comment;
INPUT BANK $ ACCTNUM MONEY;
/* So is this */
DATALINES;
CHASE 1536253 50.32
PNC 189273462 1563.82
FLEET 287363 20000.00
;
RUN;
B. Adding labels to your variables

Your output will become more readable if you add a LABEL statement. This allows for a short
description of each variable containing up to 40 characters.

Ex. LABEL MONEY = "Amount of money in Account (in Dollars)";

C. Creating New Variables and sub-setting the Data

It is possible to use a DATA statement to create new variables from old ones. After an INPUT
statement, set the new variable to be some mathematical function of the old ones. All the usual
mathematical operators work: {+, - ,* , / and ** for exponentiation}.

Ex. Revisit the example where we read in the midterm and final scores for three students. To
create a new variable SCORE which is the average of the MIDTERM and FINAL we write:

INPUT NAME $ MIDTERM FINAL;


SCORE = (MIDTERM + FINAL)/2;

Sometimes you may only want to perform operations on a subset of the data. We can do this by
using IF or IF … THEN … ELSE statements.

Ex. To create a new variable GRADE that is equal to “A” if SCORE > 90 and “B” otherwise we
write:

IF SCORE > 90 THEN GRADE = ”A”;


ELSE GRADE = “B”;

D. DROP/KEEP Statements

Sometimes a data set contains variables that are not used in any data analysis. The DROP and
KEEP statements can be used to exclude certain variables in a data set when you do not need
them anymore. They have the form

DROP variable list;


and
KEEP variable list;

The DROP statement excludes the listed variables from the data (keeping the rest), while the
KEEP statement keeps only the listed variables (dropping the rest). You can not use both in the
same DATA step

Ex: DATA JUNE;


INPUT DATE $ HEIGHT WEIGHT;
KEEP DATE WEIGHT; /*or DROP HEIGHT; */
DATALINES;
JUNE_10 171 230
JUNE_20 171 225
JUNE_30 171 223
;
RUN;
PROC PRINT;
RUN;

The Output file will then look as follows (note HEIGHT is no longer part of the data set):

OBS DATE WEIGHT


1 JUNE_10 230
2 JUNE_20 225
3 JUNE_30 223

E. Use SET Statement to Copy Data Sets

A SAS data set can be used to create a new data set. This can be useful if you want to add some
information to the data set, while keeping a copy of the original. This can be done using the SET
statement.

Ex: DATA CASH;


INPUT BANK $ ACCTNUM MONEY;
DATALINES;
CHASE 1536253 50.32
PNC 189273462 1563.82
FLEET 287363 20000.00
;
RUN;

DATA CASH2;
SET CASH;
RUN;

The data set "CASH2" is a copy of the data set "CASH". By adding some extra lines to the
second DATA step, "CASH" can be changed into something more useful for your particular
analysis.

F. Use SET Statement to Combine Data Sets

The SET statement can also be used to combine two or more data sets. If SET is used together
with a BY statement, the data sets are interleaved. If SET is used without a BY statement they
are concatenated. If a BY statement is used, you need to sort the data sets first. This can be done
using PROC SORT.

Ex: Suppose we are working with the following two data sets mydata1 and mydata2:

DATA mydata1;
INPUT NAME $ HEIGHT WEIGHT;
DATALINES;
Anne 71 130
Edith 69 160
Charlie 50 110
Bert 70 180
;
RUN;
DATA mydata2;
INPUT NAME $ EYES $ HAIR $;
DATALINES;
David BROWN BROWN
Bert BLUE BLOND
;
RUN;

To concatenate the two data sets use the following commands:

DATA both;
SET mydata1 mydata2;
BY NAME;
RUN;

PROC PRINT DATA=BOTH;


RUN;

The Output file takes the following format:

Obs NAME HEIGHT WEIGHT EYES HAIR

1 Anne 71 130
2 Edith 69 160
3 Charlie 50 110
4 Bert 70 180
5 David . . BROWN BROWN
6 Bert . . BLUE BLOND

To interleave the two data sets use the following set of commands:

PROC SORT DATA=mydata1;


BY Name;
RUN;

PROC SORT DATA=mydata2;


BY Name;
RUN;

DATA both;
SET mydata1 mydata2;
BY Name;
RUN;

PROC PRINT DATA=Both;


RUN;
The Output file will take the following format:

Obs Name Height Weight Eyes Hair

1 Anne 71 130
2 Bert 70 180
3 Bert . . Blue Blond
4 Charlie 50 110
5 David . . Brown Brown
6 Edith 69 160

In this case the two data sets are combined and the resulting data is placed in alphabetical order
according to the variable Name.

G. Use MERGE Statement to Merge Data Sets

The MERGE command can be used to merge two or more data sets. It has the following form

MERGE data1 data2;


BY a common variable;

If you use MERGE together with a BY statement, match merging is performed. When
performing match merging, the data sets need to be sorted before using the MERGE statement.

Ex: Suppose we are working with two data sets ONE and TWO:

DATA ONE;
INPUT DATE $ WEIGHT;
DATALINES;
JUNE_10 230
JUNE_20 225
JUNE_30 223
;
RUN;

DATA TWO;
INPUT DATE $ HEIGHT;
DATALINES;
JUNE_10 171
;
RUN;

To perform match-merging of the two data sets use the following commands:

PROC SORT DATA=one;


BY Date;
RUN;

PROC SORT DATA=two;


BY Date;
RUN;
DATA three;
MERGE one two;
BY Date;
RUN;

PROC PRINT;
RUN;

The Output file takes the following format:

Obs Date Weight Height


1 June_10 230 171
2 June_20 225 .
3 June_30 223 .

The dots (‘.’) in the output above indicate missing data (i.e. we have no measurement of height
for the dates June 20 and 30).

H. Use PUT statement to Create New External Date Files

It is useful to be able to write data onto an external file. Together with the FILENAME and FILE
statements, we can use the PUT statement to do this. The FILE statement is the complement of
the INFILE statement. The PUT statement is the complement of the INPUT statement.

Ex. The following program reads in the scores on the midterm and final for three students,
calculates their final grade and prints the grade (together with their name) to an external file.

DATA ONE;
INPUT NAME $ MIDTERM FINAL;
SCORE = (MIDTERM + FINAL)/2;
IF SCORE > 90 THEN GRADE = ”A”;
ELSE GRADE = “B”;
DATALINES;
Joe 84 79
Sue 94 97
Betty 93 89
;
RUN;

FILENAME TEST 'c:\tmp.dat';


DATA TWO;
SET ONE;
FILE TEST;
PUT NAME GRADE;
RUN;
An external file 'tmp.dat' has been created on the C-drive which takes the following format:

Joe B
Sue A
Betty A

Note that if you instead had written:

PUT GRADE NAME;

in the program above, the order of the columns would have been exchanged in the external file.

You might also like