Sas Programming 1 Essentials Summary

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 41

SAS Programming 1: Essentials

e-Course

https://support.sas.com/edu/OLTRN/ECPRG193/index.htm

Summary of Lesson 1: Getting Started with


SAS Programming
This summary contains topic summaries.

Topic Summaries
To go to the movie where you learned a task or concept, select a link.

Exploring SAS
SAS is a suite of business solutions and technologies to help organizations solve business
problems. Base SAS is the centerpiece of all SAS software.
It can be useful to look at SAS capabilities in a simple framework:

Access data: Using SAS, you can read any kind of data.

Manage data: SAS gives you excellent data management capabilities

Analyze data: For statistical analysis, SAS is the gold standard.

Present data: You can use SAS to present your data meaningfully.

Understanding the SAS Programming Process


Here is the overall process of programming in SAS.
1. Define the business need.
2. Write a SAS program based on the desired output, the necessary input, and the
required processing.
3. Run the program.
4. Review your results.
5. If you find inaccuracies or errors, you debug or modify the program.
Depending on your results, you might need to repeat some of the steps.
The power of SAS is that you can use it to read any type of data, including the following three
major file types:

Raw data files contain data that has not been processed by any other computer
program. They are text files that contain one record per line, and the record typically
contains multiple fields. Raw data files arent reports; they are unformatted text.

SAS data sets are specific to SAS. A SAS data set is data in a form that SAS can
understand. Like raw data files, SAS data sets contain data. But in SAS data sets, the
data is created only by SAS and can be read only by SAS.

SAS program files contain SAS programming code. These instructions tell SAS how
to process your data and what output to create. You can save and reuse SAS program
files.

Summary of Lesson 2: Working with SAS Programs

This summary contains topic summaries, syntax, and sample programs.


Topic Summaries
To go to the movie where you learned a task or concept, select a link.
Exploring SAS Programs

A SAS program consists of DATA steps and PROC steps. A SAS programming step is
comprised of a sequence of statements. Every step has a beginning and ending step boundary.
SAS compiles and executes each step independently, based on the step boundaries.
A SAS program can also contain global statements, which are outside DATA and PROC steps,
and typically affect the SAS session. A TITLE statement is a global statement. After it is
defined, a title is displayed on every report, unless the title is cleared or canceled.
SAS statements usually begin with an identifying keyword, and always end with a semicolon.
SAS statements are free format and can begin and end in any column. A single statement can
span multiple lines, and there can be more than one statement per line. Unquoted values can
be lowercase, uppercase, or mixed case. This flexibility can result in programs that are
difficult to read.
Conventional formatting, also called structured formatting, uses consistent spacing to make a
SAS program easy to read. To follow best practices, begin each statement on a new line,
indent statements within each step, and indent subsequent lines in a multi-line statement.
Comments are used to document a program and to mark SAS code as non-executing text.
There are two types of comments: block comments and comment statements.
/* comment */
* comment statement;
Diagnosing and Correcting Syntax Errors

Syntax errors occur when program statements do not conform to the rules of the SAS
language. Common syntax errors include misspelled keywords, missing semicolons, and
invalid options. SAS finds syntax errors during the compilation phase, before it executes the
program. When SAS encounters a syntax error, it writes the following to the log: the word
ERROR or WARNING, the location of the error, and an explanation of the error. You should
always check the log, even if the program produces output.

Mismatched or unbalanced quotation marks are considered a syntax error. In some


programming environments, this results in a simple error message. In other environments, it is
more difficult to identify this type of error.

Sample Programs

Submitting a SAS Program


data work.newsalesemps;
set orion.sales;
where Country='AU';
run;
title 'New Sales Employees';
proc print data=work.newsalesemps;
run;
proc means data=work.newsalesemps;
class Job_Title;
var Salary;
run;
title;

Adding Comments to Your SAS Programs


*This program uses the data set orion.sales to create work.newsalesemps.;
data work.newsalesemps;
set orion.sales;
where Country='US';
run;
/*
proc print data=work.newsalesemps;
run;*/
proc means data=work.newsalesemps;
class Gender;
var Salary/*numeric variable*/;
run;

Viewing and Correcting Syntax Errors


daat work.newsalesemps;
length First_Name $ 12
Last_Name $ 18 Job_Title $ 25;
infile "&path/newemps.csv" dlm=',';
input First_Name $ Last_Name $
Job_Title $ Salary;
run;
proc print data=work.newsalesemps
run;
proc means data=work.newsalesemps average max;
class Job_Title;
var Salary;

run;

Summary of Lesson 3: Accessing Data

This summary contains topic summaries, syntax, and sample programs.


Topic Summaries
To go to the movie where you learned a task or concept, select a link.
Accessing SAS Libraries

SAS data sets are stored in SAS libraries. A SAS library is a collection of one or more SAS
files that are recognized by SAS. SAS automatically provides one temporary and at least one
permanent SAS library in every SAS session.
Work is a temporary library that is used to store and access SAS data sets for the duration of
the session. Sasuser and sashelp are permanent libraries that are available in every SAS
session.
You refer to a SAS library by a library reference name, or libref. A libref is a shortcut to the
physical location of the SAS files.
All SAS data sets have a two-level name that consists of the libref and the data set name,
separated by a period. Data sets in the work library can be referenced with a one-level name,
consisting of only the data set name, because work is the default library. Data sets in
permanent libraries must be referenced with a two-level name.
You can create and access your own SAS libraries. User-defined libraries are permanent but
are not automatically available in a SAS session. You must assign a libref to a user-created
library to make it available. You use a LIBNAME statement to associate the libref with the
physical location of the library, that is, the physical location of your data. You can submit the
LIBNAME statement alone at the start of a SAS session, or you can store it in a SAS program
so that the SAS library is defined each time the program runs. If your program needs to
reference data sets in multiple locations, you can use multiple LIBNAME statements.
LIBNAME libref 'SAS-library' <options>;

Use PROC CONTENTS with libref._ALL_ to display the contents of a SAS library. The
report will list all the SAS files contained in the library, as well as the descriptor portion of
each data set in the library. Use the NODS option in the PROC CONTENTS statement to
suppress the descriptor information for each data set.
PROC CONTENTS DATA=libref._ALL_ NODS;

RUN;
After associating a libref with a permanent library, you can write a PROC PRINT step to
display a SAS data set within the library.
PROC PRINT DATA=libref.SAS-data-set;
RUN;

In an interactive SAS session, a libref remains in effect until you cancel it, change it, or end
your SAS session. To cancel a libref, you submit a LIBNAME statement with the CLEAR
option. This clears or disassociates a libref that was previously assigned. To specify a different
physical location, you submit a LIBNAME statement with the same libref name but with a
different filepath.
LIBNAME libref CLEAR;

When a SAS session ends, everything in the work library is deleted. The librefs are also
deleted. Remember that the contents of permanent libraries still exist in in the operating
environment, but each time you start a new SAS session, you must resubmit the LIBNAME
statement to redefine a libref for each user-created library that you want to access.
Examining SAS Data Sets

SAS data sets are specially structured data files that SAS creates and that only SAS can read.
A SAS data set is displayed as a table composed of variables and observations. A SAS data set
contains a descriptor portion and a data portion.
The descriptor portion contains general information about the data set (such as the data set
name and the number of observations) and information about the variable attributes (such as
name, type, and length). There are two types of variables: character and numeric. A character
variable can store any value and can be up to 32,767 characters long. Numeric variables store
numeric values in floating point or binary representation in 8 bytes of storage by default.
Other attributes include formats, informats, and labels. You can use PROC CONTENTS to
browse the descriptor portion of a data set.
PROC CONTENTS DATA=libref.SAS-data-set;
RUN;
The data portion contains the data values. Data values are either character or numeric. A valid
value must exist for every variable in every observation in a SAS data set. A missing value is
a valid value in SAS. A missing character value is displayed as a blank, and a missing numeric

value is displayed as a period. You can specify an alternate character to print for missing
numeric values using the MISSING= SAS system option. You can use PROC PRINT to
display the data portion of a SAS data set.
SAS variable and data set names must be 1 to 32 characters in length and start with a letter or
underscore, followed by letters, underscores, and numbers. Variable names are not case
sensitive.

Sample Programs

Accessing a SAS Library


/*Replace filepath with the physical location of your practice files.*/
%let path=filepath;
libname orion "&path";

Browsing a Library
proc contents data=orion._all_;
run;
proc contents data=orion._all_ nods;
run;

Viewing a Data Set with PROC PRINT


proc print data=orion.country;
run;

Viewing the Descriptor Portion of a Data Set


proc contents data=orion.sales;
run;

Viewing the Data Portion of a SAS Data Set


proc print data=orion.sales;
run;

Summary of Lesson 4: Producing Detail Reports

This summary contains topic summaries, syntax, and sample programs.


Topic Summaries
To go to the movie where you learned a task or concept, select a link.
Subsetting Report Data

You can use the VAR statement in a PROC PRINT step to subset the variables in a report. You
specify the variables to include and list them in the order in which they are to be displayed.
You can use the SUM statement in a PROC PRINT step to calculate and display report totals
for the requested numeric variables.
PROC PRINT DATA=SAS-data-set;
VAR variable(s);
SUM variable(s);
RUN;
The WHERE statement in a PROC PRINT step subsets the observations in a report. When
you use a WHERE statement, the output contains only the observations that meet the
conditions specified in the WHERE expression. This expression is a sequence of operands and
operators that form a set of instructions that define the condition. The operands can be
constants or variables. Remember that variable operands must be defined in the input data set.
Operators include comparison, arithmetic, logical, and special WHERE operators.
WHERE where-expression;

You can use the ID statement in a PROC PRINT step to specify a variable to print at the
beginning of the row instead of an observation number. The variable that you specify replaces
the Obs column.
ID variable(s);
Sorting and Grouping Report Data

The SORT procedure sorts the observations in a data set. You can sort on one variable or
multiple variables, sort on character or numeric variables, and sort in ascending or descending
order. By default, SAS replaces the original SAS data set unless you use the OUT= option to
specify an output data set. PROC SORT does not generate printed output.

Every PROC SORT step must include a BY statement to specify one or more BY variables.
These are variables in the input data set whose values are used to sort the data. By default,
SAS sorts in ascending order, but you can use the keyword DESCENDING to specify that the
values of a variable are to be sorted in descending order. When your SORT step has multiple
BY variables, some variables can be in ascending and others in descending order.
You can also use a BY statement in PROC PRINT to display observations grouped by a
particular variable or variables. The groups are referred to as BY groups. Remember that the
input data set must be sorted on the variables specified in the BY statement.
PROC SORT DATA=input-SAS-data-set
<OUT=ouput-SAS-data-set>;
BY <DESCENDING> by-variable(s);
RUN;
Enhancing Reports

You can enhance a report by adding titles, footnotes, and column labels. Use the global
TITLE statement to define up to 10 lines of titles to be displayed at the top of the output from
each procedure. Use the global FOOTNOTE statement to define up to 10 lines of footnotes to
be displayed at the bottom of the output from each procedure.
TITLEn 'text';
FOOTNOTEn 'text';

Titles and footnotes remain in effect until you change or cancel them, or until you end your
SAS session. Use a null TITLE statement to cancel all titles, and a null FOOTNOTE
statement to cancel all footnotes.
Use the LABEL statement in a PROC PRINT step to define temporary labels to display in the
report instead of variable names. Labels can be up to 256 characters in length. Most
procedures use labels automatically, but PROC PRINT does not. Use the LABEL option in
the PROC PRINT statement to tell SAS to display the labels. Alternatively, the SPLIT=
option tells PROC PRINT to use the labels and also specifies a split character to control line
breaks in column headings.
PROC PRINT DATA=SAS-data-set LABEL;
LABEL variable='label'
variable='label'
... ;
RUN;

SPLIT='split-character';

Sample Programs

Subsetting Your Report


proc print data=orion.sales;
var Last_Name First_Name Salary;
sum Salary;
run;

Selecting Observations
proc print data=orion.sales noobs;
var Last_Name First_Name Salary Country;
where Country='AU' and Salary<25500;
run;

Using the CONTAINS Operator


proc print data=orion.sales noobs;
var Last_Name First_Name Country Job_Title;
where Country='AU' and Job_Title contains 'Rep';
run;

Subsetting Observations and Replacing the Obs Column


proc print data=orion.customer_dim;
where Customer_Age=21;
id Customer_ID;
var Customer_Name
Customer_Gender Customer_Country
Customer_Group Customer_Age_Group
Customer_Type;
run;

Sorting a Data Set


proc sort data=orion.sales
out=work.sales_sort;
by Salary;
run;
proc print data=work.sales_sort;
run;

Sorting a Data Set by Multiple Variables


proc sort data=orion.sales
out=work.sales2;
by Country descending Salary;
run;
proc print data=work.sales2;

run;

Grouping Observations in Reports


proc sort data=orion.sales
out=work.sales2;
by Country descending Salary;
run;
proc print data=work.sales2;
by Country;
run;

Displaying Titles and Footnotes in a Report


title1 'Orion Star Sales Staff';
title2 'Salary Report';
footnote1 'Confidential';
proc print data=orion.sales;
var Employee_ID Last_Name Salary;
run;
proc print data=orion.sales;
var Employee_ID First_Name Last_Name Job_Title Hire_Date;
run;

Changing and Canceling Titles and Footnotes


title1 'Orion Star Sales Staff';
title2 'Salary Report';
footnote1 'Confidential';
proc print data=orion.sales;
var Employee_ID Last_Name Salary;
run;
title1 'Employee Information';
proc print data=orion.sales;
var Employee_ID First_Name Last_Name Job_Title Hire_Date;
run;

Displaying Labels in a Report


title1 'Orion Star Sales Staff';
title2 'Salary Report';
footnote1 'Confidential';
proc print data=orion.sales label;
var Employee_ID Last_Name Salary;
label Employee_ID = 'Sales ID'
Last_Name = 'Last Name'
Salary = 'Annual Salary';
run;
title;
footnote;

Summary of Lesson 5: Formatting Data Values

This summary contains topic summaries, syntax, and sample programs.


Topic Summaries
To go to the movie where you learned a task or concept, select a link.
Using SAS Formats

A format is an instruction that tells SAS how to display data values in output reports. You can
add a FORMAT statement to a PROC PRINT step to specify temporary SAS formats that
control how values appear in the report. There are many existing SAS formats that you can
use. Character formats begin with a dollar sign, but numeric formats do not.
FORMAT variable(s) format;

SAS stores date values as the number of days between January 1, 1960, and a specific date. To
make the dates in your report recognizable and meaningful, you must apply a SAS date
format to the SAS date values.
Creating and Applying User-Defined Formats

You can create your own user-defined formats. When you create a user-defined format, you
don't associate it with a particular variable or data set. Instead, you create it based on values
that you want to display differently. The formats will be available for the remainder of your
SAS session. You can apply user-defined formats to a specific variable in a PROC PRINT
step.
You use the FORMAT procedure to create a format. You assign a format name that can have
up to 32 characters. The name of a character format must begin with a dollar sign, followed
by a letter or underscore, followed by letters, numbers, and underscores. Names for numeric
formats must begin with a letter or underscore, followed by letters, numbers, and underscores.
A format name cannot end in a number and cannot be the name of a SAS format.
You use a VALUE statement in a PROC FORMAT step to specify the way that you want the
data values to appear in your output. You define value-range sets to specify the values to be
formatted and the formatted values to display instead of the stored value or values. The value
portion of a value-range set can include an individual value, a range of values, a list of values,
or a keyword. The keyword OTHER is used to define a value to display if the stored data
value does not match any of the defined value-ranges.
PROC FORMAT;

VALUE format-name value-or-range1='formatted-value1'


value-or-range2='formatted-value2'
...;
RUN;

When you define a numeric format, it is often convenient to use numeric ranges in the valuerange sets. Ranges are inclusive by default. To exclude the endpoints, use a less-than symbol
after the low end of the range or before the high end.
The LOW and HIGH keywords are used to define a continuous range when the lowest and
highest values are not known. Remember that for character values, the LOW keyword treats
missing values as the lowest possible values. However, for numeric values, LOW does not
include missing values.

Sample Programs

Applying Temporary Formats


proc print data=orion.sales label noobs;
where Country='AU' and
Job_Title contains 'Rep';
label Job_Title='Sales Title'
Hire_Date='Date Hired';
var Last_Name First_Name Country Job_Title
Salary Hire_Date;
run;
proc print data=orion.sales label noobs;
where Country='AU' and
Job_Title contains 'Rep';
label Job_Title='Sales Title'
Hire_Date='Date Hired';
format Hire_Date mmddyy10. Salary dollar8.;
var Last_Name First_Name Country Job_Title
Salary Hire_Date;
run;

Specifying a User-Defined Format for a Character Variable


proc format;
value $ctryfmt 'AU'='Australia'
'US'='United States'
other='Miscoded';
run;
proc print data=orion.sales label;
var Employee_ID Job_Title Salary
Country Birth_Date Hire_Date;
label Employee_ID='Sales ID'
Job_Title='Job Title'
Salary='Annual Salary'

Birth_Date='Date of Birth'
Hire_Date='Date of Hire';
format Salary dollar10.
Birth_Date Hire_Date monyy7.
Country $ctryfmt.;
run;

Specifying a User-Defined Format for a Numeric Variable


proc format;
value tiers low-<50000='Tier1'
50000-100000='Tier2'
100000<-high='Tier3';
run;
proc print data=orion.sales;
var Employee_ID Job_Title Salary
Country Birth_Date Hire_Date;
format Birth_Date Hire_Date monyy7.
Salary tiers.;
run;

Summary of Lesson 6: Reading SAS Data Sets

This summary contains topic summaries, syntax, and sample programs.


Topic Summaries
To go to the movie where you learned a task or concept, select a link.
Reading a SAS Data Set

You use a DATA step to create a new SAS data set from an existing SAS data set. The DATA
step begins with a DATA statement, which provides the name of the SAS data set to create.
Include a SET statement to name the existing SAS data set to be read in as input.
You use the WHERE statement to subset the input data set by selecting only the observations
that meet a particular condition. To subset based on a SAS date value, you can use a SAS date
constant in the WHERE expression. SAS automatically converts a date constant to a SAS date
value.
DATA output-SAS-data-set;
SET input-SAS-data-set;
WHERE where-expression;
RUN;

You use an assignment statement to create a new variable. The assignment statement
evaluates an expression and assigns the resulting value to a new or existing variable. The
expression is a sequence of operands and operators. If the expression includes arithmetic
operators, SAS performs the numeric operations based on priority, as in math equations. You
can use parentheses to clarify or alter the order of operations.
variable=expression;
Customizing a SAS Data Set

By default, the SET statement reads all of the observations and variables from the input data
set and writes them to the output data set. You can customize the new data set by selecting
only the observations and variables that you want to include. You can use a WHERE
statement to select the observations, as long as the variables included in the condition come
from the input data set. You can use a DROP statement to list the variables to exclude from
the new data set, or use a KEEP statement to list the variables to include. If you use a KEEP
statement, you must include every variable to be written, including any new variables.

DROP variable-list;
KEEP variable-list;

SAS processes the DATA step in two phases: the compilation phase and the execution phase.
You can subset the original data set with a WHERE statement for variables that are defined in
the input data set, and a subsetting IF statement for new variables that are created in the
DATA step. Remember that, although IF expressions are similar to WHERE expressions, you
cannot use special WHERE operators in IF expressions.
IF expression;

To subset observations in a PROC step, you must use a WHERE statement. You cannot use a
subsetting IF statement in a PROC step. To subset observations in a DATA step, you can
always use a subsetting IF statement. However, a WHERE statement can make your DATA
step more efficient because it subsets on input.
Adding Permanent Attributes

When you use the LABEL statement in a DATA step, SAS permanently associates the labels
to the variables by storing the labels in the descriptor portion of the data set. Using a
FORMAT statement in a DATA step permanently associates formats with variables. The
format information is also stored in the descriptor portion of the data set. You can use PROC
CONTENTS to view the label and format information. PROC PRINT does not display
permanent labels unless you use the LABEL or SPLIT= option.
LABEL variable='label'
variable='label'
... ;

FORMAT variable(s) format ...;

Sample Programs

Subsetting Observations in the DATA Step


proc print data=orion.sales;
run;

data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep';
run;
proc print data=work.subset1;
run;

Subsetting Observations and Creating a New Variable


data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep' and
Hire_Date<'01jan2000'd;
Bonus=Salary*.10;
run;
proc print data=work.subset1 noobs;
var First_name Last_Name Salary
Job_Title Bonus Hire_Date;
format Hire_Date date9.;
run;

Subsetting Variables in a DATA Step: DROP and KEEP


data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep';
Bonus=Salary*.10;
drop Employee_ID Gender Country Birth_Date;
run;
proc print data=work.subset1;
run;
data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep';
Bonus=Salary*.10;
keep First_Name Last_Name Salary Job_Title Hire_Date Bonus;
run;
proc print data=work.subset1;
run;

Selecting Observations by Using the Subsetting IF Statement


data work.auemps;
set orion.sales;
where Country='AU';
Bonus=Salary*.10;
if Bonus>=3000;
run;
proc print data=work.auemps;
run;

Adding Permanent Labels to a SAS Data Set

data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep';
Bonus=Salary*.10;
label Job_Title='Sales Title'
Hire_Date='Date Hired';
drop Employee_ID Gender Country Birth_Date;
run;
proc contents data=work.subset1;
run;
proc print data=work.subset1 label;
run;

Adding Permanent Formats to a SAS Data Set


data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep';
Bonus=Salary*.10;
label Job_Title='Sales Title'
Hire_Date='Date Hired';
format Salary Bonus dollar12.
Hire_Date ddmmyy10.;
drop Employee_ID Gender Country Birth_Date;
run;
proc contents data=work.subset1;
run;
proc print data=work.subset1 label;
run;

Summary of Lesson 7: Reading Spreadsheet and Database Data

This summary contains topic summaries, syntax, and sample programs.


Topic Summaries
To go to the movie where you learned a task or concept, select a link.
Reading Spreadsheet Data

You can use SAS/ACCESS Interface to PC Files to read the worksheets within a Microsoft
Excel workbook. After you submit a SAS/ACCESS LIBNAME statement, SAS treats the
Excel workbook as if it were a SAS library and treats the worksheets as if they were SAS data
sets within that library. You submit a LIBNAME statement to specify a libref, an engine
name, and the location and name of the workbook. The engine tells SAS the type of input file
and which engine to use to read the input data.
LIBNAME libref <engine> "workbook-name" <options>;
LIBNAME libref <engine> <PATH=> "workbook-name" <options>;

When you browse the library, you might see worksheets and named ranges. Worksheet names
end with a dollar sign, and named ranges do not. Because the dollar sign is a special character,
you must use a SAS name literal when you refer to a worksheet in a program.
libref.'worksheetname$'n

When you assign a libref to an Excel workbook in SAS, the workbook cannot be opened in
Excel. To disassociate a libref, you submit a LIBNAME statement specifying the libref and
the CLEAR option. SAS disconnects from the data source and closes any resources that are
associated with the connection.
Reading Database Data

You can also read database tables as if they were SAS data sets by using the LIBNAME
statement supported by SAS/ACCESS Interface to Oracle. This SAS/ACCESS LIBNAME
statement includes a libref, an engine name, and additional connection options that are siteand installation-specific. After you submit the LIBNAME statement, SAS treats the Oracle
database as if it were a SAS library, and any table in the database can be referenced using a
SAS two-level name, as if it were a SAS data set.
LIBNAME libref engine <SAS/ACCESS Oracle options>;

Sample Programs

Accessing Excel Worksheets in SAS


libname orionx pcfiles path="&path/sales.xls";
proc contents data=orionx._all_;
run;

Printing an Excel Worksheet


proc print data=orionx.'Australia$'n;
run;
proc print data=orionx.'Australia$'n noobs;
where Job_Title ? 'IV';
var Employee_ID Last_Name Job_Title Salary;
run;

Creating a SAS Data Set from an Excel Worksheet


libname orionx pcfiles path="&path/sales.xls";
data work.subset;
set orionx.'Australia$'n;
where Job_Title contains 'Rep';
Bonus=Salary*.10;
label Job_Title='Sales Title'
Hire_Date='Date Hired';
format Salary comma10. Hire_Date mmddyy10.
Bonus comma8.2;
run;
proc contents data=work.subset;
run;
proc print data=work.subset label;
run;
libname orionx clear;

Summary of Lesson 8: Reading Raw Data Files

This summary contains topic summaries, syntax, and sample programs.


Topic Summaries
To go to the movie where you learned a task or concept, select a link.
Introduction to Reading Raw Data Files

A raw data file is an external text file that contains one record per line, and a record typically
contains multiple fields. The fields can be delimited or arranged in fixed columns. Typically,
there are no column headings. The file is usually described in an external document called a
record layout.
In order for SAS to read a raw data file, you must specify the location of each data value in
the record, along with the names and types of the SAS variables in which to store the values.
Three styles of input are available: list input, column input, and formatted input. List input
reads delimited files, and column and formatted input read fixed column files. List input and
formatted input can read both standard and nonstandard data, and column input can read only
standard data. In this course, we are reading delimited raw data files, so list input is used.
Reading Standard Delimited Data

You use a DATA step with INFILE and INPUT statements to read data from a raw data file.
The INFILE statement identifies the name and location of the input file. You use the DLM=
option if the file has a delimiter other than a blank space. The INPUT statement tells SAS how
to read the values, and specifies the name and type for each variable to be created. In the
INPUT statement, you list the variables in the order that the corresponding values appear in
the raw data file, from left to right. You specify character variables by adding a dollar sign
after the variable name. With list input, the default length for all variables is 8 bytes,
regardless of type.
DATA output-SAS-data-set-name;
INFILE 'raw-data-file-name' DLM='delimiter';
INPUT variable1 <$> variable2 <$> ... variableN <$>;
RUN;
SAS processes the DATA step in two phases: compilation and execution. During compilation,
SAS creates an input buffer to hold a record from the raw data file. The input buffer is an area
of memory that SAS creates only when reading raw data, not when reading a SAS data set.
SAS also creates the PDV, an area of memory where an observation is built. In addition to the
variables named in the INPUT statement, SAS creates the iteration counter, _N_, and the error

indicator, _ERROR_, in the PDV. These temporary variables are not written to the output
data set. At the end of the compilation, SAS creates the descriptor portion of the output data
set.
At the start of the execution phase, SAS initializes the PDV and then reads the first record
from the raw data file into the input buffer. It scans the input buffer from non-delimiter to
delimiter and assigns each value to the corresponding variable in the PDV. SAS ignores
delimiters. At the bottom of the DATA step, SAS writes the values from the PDV to the new
SAS data set and then returns to the top of the DATA step.
Truncation often occurs with list input, because character variables are created with a length
of 8 bytes, by default. You can use a LENGTH statement before the INPUT statement in a
DATA step to explicitly define the length of character variables. Numeric variables can be
included in the LENGTH statement to preserve the order of variables, but you need to specify
a length of 8 for each numeric variable.
LENGTH variable(s) <$> length;
Reading Nonstandard Delimited Data

You can use modified list input to read standard and nonstandard data from a delimited raw
data file. Modified list input uses an informat and a colon format modifier for each field to be
read. An informat tells SAS how to read data values, including the number of characters.
When SAS reads character data, a standard character informat, such as $12., is often used
instead of a LENGTH statement. With list input, the data fields vary in length. The colon
format modifier tells SAS to ignore the specified length when it reads data values, and instead
to read only until it reaches a delimiter. Omitting the colon format modifier is likely to result
in data errors.
INPUT variable <$> variable <:informat>;

An informat is required to read nonstandard numeric data, such as calendar dates, and
numbers with dollar signs and commas. Many SAS informats are available for nonstandard
numeric values. Every informat has a width, whether stated explicitly or set by default.
When reading a raw data file, you can use a DROP or KEEP statement to write a subset of
variables to the new data set. You must use a subsetting IF statement to select observations,
because the variables are not coming from an input SAS data set. You can use LABEL and
FORMAT statements to permanently store label and format information in the new data set.
A DATA step can also read instream data, which is data that is within a SAS program. To
specify instream data, you use a DATALINES statement in a DATA step, followed by the lines
of data, followed by a null statement.

DATALINES;
<data line 1>
<data line 2>
...
;
Validating Data

When data values in the input file aren't appropriate for the INPUT statement in a program, a
data error occurs during program execution. SAS records the error in the log by writing a note
about the error, along with a ruler and the contents of the input buffer and the PDV. The
variable _ERROR_ is set to 1, a missing value is assigned to the corresponding variable, and
execution continues.
You can use the DSD option in the INFILE statement if data values are missing in
the middle of a record. When you use the DSD option, SAS assumes that the file
is comma delimited, treats consecutive delimiters as missing data, and allows
embedded delimiters in a field that is enclosed in quotation marks. If you have
missing data values at the end of a record, you can use the MISSOVER option in
the INFILE statement. SAS sets the variable values to missing.
INFILE 'raw-data-file-name' <DLM=> <DSD> <MISSOVER>;

Sample Programs

Creating a SAS Data Set from a Delimited Raw Data File


data work.sales1;
infile "&path/sales.csv" dlm=',';
input Employee_ID First_Name $
Last_Name $ Gender $ Salary
Job_Title $ Country $;
run;
proc print data=work.sales1;
run;

Specifying the Lengths of Variables Explicitly


data work.sales2;
length First_Name $ 12 Last_Name $ 18
Gender $ 1 Job_Title $ 25
Country $ 2;
infile "&path/sales.csv" dlm=',';
input Employee_ID First_Name $ Last_Name $
Gender $ Salary Job_Title $ Country $;
run;

proc contents data=work.sales2;


run;
proc print data=work.sales2;
run;
data work.sales2;
length Employee_ID 8 First_Name $ 12
Last_Name $ 18 Gender $ 1
Salary 8 Job_Title $ 25
Country $ 2;
infile "&path/sales.csv" dlm=',';
input Employee_ID First_Name $ Last_Name $
Gender $ Salary Job_Title $ Country $;
run;
proc contents data=work.sales2 varnum;
run;
proc print data=work.sales2;
run;

Specifying Informats in the INPUT Statement


data work.sales2;
infile "&path/sales.csv" dlm=',';
input Employee_ID First_Name :$12. Last_Name :$18.
Gender :$1. Salary Job_Title :$25. Country :$2.
Birth_Date :date. Hire_Date :mmddyy.;
run;
proc print data=work.sales2;
run;

Subsetting and Adding Permanent Attributes


data work.subset;
infile "&path/sales.csv" dlm=',';
input Employee_ID First_Name :$12.
Last_Name :$18. Gender :$1. Salary
Job_Title :$25. Country :$2.
Birth_Date :date. Hire_Date :mmddyy.;
if Country='AU';
keep First_Name Last_Name Salary
Job_Title Hire_Date;
label Job_Title='Sales Title'
Hire_Date='Date Hired';
format Salary dollar12. Hire_Date monyy7.;
run;
proc print data=work.subset label;
run;

Reading Instream Data


data work.newemps;
input First_Name $ Last_Name $
Job_Title $ Salary :dollar8.;
datalines;
Steven Worton Auditor $40,450
Merle Hieds Trainee $24,025

Marta Bamberger Manager $32,000


;
proc print data=work.newemps;
run;
data work.newemps2;
infile datalines dlm=',';
input First_Name $ Last_Name $
Job_Title $ Salary :dollar8.;
datalines;
Steven,Worton,Auditor,$40450
Merle,Hieds,Trainee,$24025
Marta,Bamberger,Manager,$32000
;
proc print data=work.newemps2;
run;

Reading a Raw Data File That Contains Data Errors


data work.sales4;
infile "&path/sales3inv.csv" dlm=',';
input Employee_ID First $ Last $
Job_Title $ Salary Country $;
run;
proc print data=work.sales4;
run;

Reading a Raw Data File That Contains Missing Data


data work.contacts;
length Name $ 20 Phone Mobile $ 14;
infile "&path/phone2.csv" dsd;
input Name $ Phone $ Mobile $;
run;
proc print data=work.contacts noobs;
run;

Reading a Raw Data File Using the MISSOVER Option


data work.contacts2;
infile "&path/phone.csv" dlm=',' missover;
input Name $ Phone $ Mobile $;
run;
proc print data=contacts2 noobs;
run;
data work.contacts2;
length Name $ 20 Phone Mobile $ 14;
infile "&path/phone.csv" dlm=',' missover;
input Name $ Phone $ Mobile $;
run;
proc print data=contacts2 noobs;
run;

Summary of Lesson 9: Manipulating Data

This summary contains topic summaries, syntax, and sample programs.


Topic Summaries
To go to the movie where you learned a task or concept, select a link.
Using SAS Functions

You use an assignment statement in a DATA step to evaluate an expression and assign the
result to a new or existing variable. The expression on the right side of an assignment
statement can include calls to SAS functions. A SAS function is a routine that accepts
arguments and returns a value.
variable=expression;

The SUM function, a descriptive statistics function, returns the sum of its arguments and
ignores missing values. The arguments can be numeric constants, numeric variables, or
arithmetic expressions, but the arguments must be numeric values and must be enclosed in
parentheses and separated with commas. The parentheses are required, even if no arguments
are passed to the function.
SUM(argument1,argument2, ...)

In addition to descriptive statistics functions, many SAS date functions are available. Some of
these functions create SAS dates, and others extract information from SAS dates. The
MONTH function extracts and returns the numeric month from a SAS date.
MONTH(SAS-date)
Conditional Processing

The IF-THEN statement is a conditional statement. It executes a SAS statement for


observations that meet specific conditions. The statement includes an expression and a SAS
program statement. The expression defines a condition that must be true for the statement to
be executed. The expression is evaluated during each iteration of the DATA step. If the
condition is true, the statement following the THEN statement is executed; otherwise, SAS
skips the statement.
IF expression THEN statement;

A program often includes a sequence of IF statements with mutually exclusive conditions.


When SAS encounters a true condition in this series, evaluating the other conditions isn't
necessary. You can use the ELSE statement to specify an alternative action to be performed
when the condition in an IF-THEN statement is false. This increases the efficiency of the
program.
You can use the logical operators AND and OR to combine conditions in an IF expression.
You use the AND operator when both conditions must be true, and you use the OR operator
when only one of the conditions must be true. An optional final ELSE statement can be used
at the end of a series of IF-THEN/ELSE statements. The statement following the final ELSE
executes if none of the IF expressions is true.
Use a DO group with an IF-THEN or an ELSE statement when multiple statements must be
executed based on one condition. The DO group consists of a DO statement, the SAS
statements to be executed, and an END statement. Each DO statement must have a
corresponding END statement.
IF expression THEN
DO;
executable statements
END;
ELSE IF expression THEN
DO;
executable statements
END;

Truncation can occur when new variables are assigned values within conditional program
statements. During compilation, SAS creates a variable in the PDV the first time it encounters
the variable in the program. If this is in conditional code, be sure that it is created with a
length long enough to store all possible values. It is a best practice to use a LENGTH
statement to explicitly define the length.

Sample Programs

Creating Variables by Using Functions


data work.comp;
set orion.sales;
Bonus=500;
Compensation=sum(Salary,Bonus);
BonusMonth=month(Hire_Date);

run;
proc print data=work.comp;
var Employee_ID First_Name Last_Name
Salary Bonus Compensation BonusMonth;
run;

Assigning Values Conditionally


data work.comp;
set orion.sales;
if Job_Title='Sales Rep. IV' then
Bonus=1000;
if Job_Title='Sales Manager' then
Bonus=1500;
if Job_Title='Senior Sales Manager' then
Bonus=2000;
if Job_Title='Chief Sales Officer' then
Bonus=2500;
run;
proc print data=work.comp;
var Last_Name Job_Title Bonus;
run;

Using Compound Conditions


data work.comp;
set orion.sales;
if Job_Title='Sales Rep. III' or
Job_Title='Sales Rep. IV' then
Bonus=1000;
else if Job_Title='Sales Manager' then
Bonus=1500;
else if Job_Title='Senior Sales Manager' then
Bonus=2000;
else if Job_Title='Chief Sales Officer' then
Bonus=2500;
else Bonus=500;
run;
proc print data=work.comp;
var Last_Name Job_Title Bonus;
run;

Using IF-THEN/ELSE Statements


data work.bonus;
set orion.sales;
if Country='US' then Bonus=500;
else Bonus=300;
run;
proc print data=work.bonus;
var First_Name Last_Name Country Bonus;
run;

Creating Two Variables Conditionally


data work.bonus;
set orion.sales;
if Country='US' then
do;

Bonus=500;
Freq='Once a Year';
end;
else if Country='AU' then
do;
Bonus=300;
Freq='Twice a Year';
end;
run;
proc print data=work.bonus;
var First_Name Last_Name Country Bonus Freq;
run;

Adjusting the Program


data work.bonus;
set orion.sales;
length Freq $ 12;
if Country='US' then
do;
Bonus=500;
Freq='Once a Year';
end;
else if Country='AU' then
do;
Bonus=300;
Freq='Twice a Year';
end;
run;
proc print data=work.bonus;
var First_Name Last_Name Country Bonus Freq;
run;
data work.bonus;
set orion.sales;
length Freq $ 12;
if Country='US' then
do;
Bonus=500;
Freq='Once a Year';
end;
else do;
Bonus=300;
Freq='Twice a Year';
end;
run;
proc print data=work.bonus;
var First_Name Last_Name Country
Bonus Freq;
run;

Summary of Lesson 10: Combining SAS Data Sets

This summary contains topic summaries, syntax, and sample programs.


Topic Summaries
To go to the movie where you learned a task or concept, select a link.
Concatenating Data Sets

You can concatenate two or more data sets by combining them vertically to create a new data
set. It is important to know the structure and contents of the input data sets.
You use a DATA step to concatenate multiple data sets into a single, new data set. In the SET
statement, you can specify any number of input data sets to concatenate. During compilation,
SAS uses the descriptor portion of the first data set to create variables in the PDV, and then
continues with each subsequent data set, creating additional variables in the PDV as needed.
During execution, SAS processes the data sets in the order in which they are listed in the SET
statement.
DATA SAS-data-set;
SET SAS-data-set1 SAS-data-set2 ...;
RUN;

If the data sets have differently named variables, every variable is created in the new data set,
and some observations have missing values for the differently named variables. You can use
the RENAME= data set option to change variable names in one or more data sets. After they
are renamed, they are treated as the same variable during compilation and execution, and in
the new data set.
SAS-data-set (RENAME=(old-name-1=new-name-1;
old-name-2=new-name-2
...
old-name-n=new-name-n))
Merging SAS Data Sets One-to-One

Merging combines observations from two or more SAS data sets into a single observation in a
new data set. A simple merge combines observations based on their positions in the original
data sets. A match-merge combines them based on the values of one or more common
variables. The result of a match-merge is dependant on the relationship between observations
in the input data sets.

You use a DATA step with a MERGE statement to merge multiple data sets into a single data
set. The BY statement indicates a match-merge and specifies the common variable or
variables to match. The common variables are referred to as BY variables. The BY variables
must exist in every data set, and each data set must be sorted by the value of the BY variables.
DATA SAS-data-set;
MERGE SAS-data-set1 SAS-data-set2 ...;
BY <DESCENDING> BY-variable(s);
<additional SAS statements>
RUN;
Merging SAS Data Sets One-to-Many
In a one-to-many merge, a single observation in one data set matches more than
one observation in another data set. The DATA step is the same, regardless of the
relationship between the data sets being merged. SAS processes each BY group
before reinitializing the PDV.
Merging SAS Data Sets That Have Non-Matches

When you merge data sets, observations in one data set might not have a matching
observation in another data set. These are called non-matches. By default, both matches and
non-matches are included in a merged data set. The observations that are matches contain data
from every input data set. The non-matching observations do not contain data from every
input data set.
You can use the IN= data set option in a MERGE statement to create a temporary variable that
indicates whether a data set contributed information to the observation in the PDV. The IN=
variables have two possible values: 0 and 1. You can test the value of this variable using
subsetting IF statements to output only the matches or only the non-matches to the merged
data set.
MERGE SAS-data-set1 <(IN=variable)>...

Sample Programs

Concatenating Data Sets with Different Variables


********** Create Data **********;
data empscn;
input First $ Gender $ Country $;
datalines;
Chang
M
China
Li
M
China

Ming
;
run;

China

data empsjp;
input First $ Gender $ Region $;
datalines;
Cho
F
Japan
Tomi
M
Japan
;
run;
********** Unlike-Structured Data Sets **********;
data empsall2;
set empscn empsjp;
run;
proc print data=empsall2;
run;

Merging Data Sets One-to-One


********** Create Data **********;
data empsau;
input First $ Gender $ EmpID;
datalines;
Togar
M
121150
Kylie
F
121151
Birin
M
121152
;
run;
data phoneh;
input EmpID Phone $15.;
datalines;
121150 +61(2)5555-1793
121151 +61(2)5555-1849
121152 +61(2)5555-1665
;
run;
********** Match-Merge One-to-One**********;
data empsauh;
merge empsau phoneh;
by EmpID;
run;
proc print data=empsauh;
run;

Match-Merging Data Sets with Non-Matches


********** Create Data **********;
data empsau;
input First $ Gender $ EmpID;
datalines;
Togar
M
121150
Kylie
F
121151
Birin
M
121152
;
run;

data phonec;
input EmpID Phone $15.;
datalines;
121150 +61(2)5555-1795
121152 +61(2)5555-1667
121153 +61(2)5555-1348
;
run;
********** Match-Merge with Non-Matches**********;
data empsauc;
merge empsau phonec;
by EmpID;
run;
proc print data=empsauc;
run;

Selecting Non-Matches
********** Create Data **********;
data empsau;
input First $ Gender $ EmpID;
datalines;
Togar
M
121150
Kylie
F
121151
Birin
M
121152
;
run;
data phonec;
input EmpID Phone $15.;
datalines;
121150 +61(2)5555-1795
121152 +61(2)5555-1667
121153 +61(2)5555-1348
;
run;
********** Non-Matches from empsau Only **********;
data empsauc2;
merge empsau(in=Emps)
phonec(in=Cell);
by EmpID;
if Emps=1 and Cell=0;
run;
proc print data=empsauc2;
run;

Summary of Lesson 11: Creating Summary Reports

This summary contains topic summaries, syntax, and sample programs.


Topic Summaries
To go to the movie where you learned a task or concept, select a link.
Using PROC FREQ to Create Summary Reports

You can use PROC FREQ to produce frequency tables that report the distribution of any or all
variable values in a SAS data set. You use the TABLES statement to specify the frequency
tables to produce. You can include a WHERE statement in the PROC FREQ step to subset the
observations. SAS has TABLES statement options that you can use to suppress the default
statistics.
PROC FREQ DATA=SAS-data-set <option(s)>;
TABLES variable(s) </option(s)>;
<additional SAS statements>
RUN;

Frequency distributions work best with variables whose values are categorical and best
summarized by counts instead of averages. Variables that have continuous numeric values,
such as dollar amounts and dates, or many discrete values, can result in a lengthy and
meaningless frequency table. To create a useful frequency report for these variables, you can
apply a SAS or user-defined format to group the values into categories.
You can list multiple variables in a TABLES statement, separated by spaces. This creates a
one-way frequency table for each variable. You can request a separate analysis for each group
by including a BY statement. You can request a two-way frequency table by separating the
variables with an asterisk instead of a space. The resulting crosstabulation table displays
statistics for each distinct combination of values of the selected variables. You can use
TABLES statement options to suppress statistics, change the table format, and format the
displayed values.
Using PROC FREQ for Data Validation

The FREQ procedure can also be used to validate a data set. A one-way frequency table,
which displays all discrete values for a variable and reports on missing values, easily
identifies the existence of invalid or missing values. You can use the ORDER=FREQ and
NLEVELS options to identify duplicate values. After you've identified invalid values, you can
use PROC PRINT to display the corresponding observations.
Using the MEANS and UNIVARIATE Procedures

You can use PROC MEANS to produce summary reports with descriptive statistics. By
default, it reports the number of nonmissing values, the mean, the standard deviation, the
minimum, and the maximum value of every numeric variable in a data set. You can use the
VAR statement to specify the numeric variables to analyze, and add a CLASS statement to
request statistics for groups of observations. The variables listed in the CLASS statement are
called classification variables, or class variables, and each combination of class variable
values is called a class level.
PROC MEANS DATA=SAS-data-set <statistic(s)>;
VAR analysis-variable(s);
CLASS classification-variable(s);
RUN;

When you use the CLASS statement, the output includes N Obs, which reports the number of
observations with each unique combination of class variables. You can request specific
statistics by listing them as options in the PROC MEANS statement. Other options are
available to control the output.
You can also use PROC MEANS to validate a data set. The MIN, MAX, and NMISS statistics
can be used to validate numeric data when you know the range of valid values. PROC
UNIVARIATE can be more useful because it displays the extreme observations, or outliers.
By default, it displays the five highest and five lowest values of the analysis variable, and the
number of the observation with each extreme value. You can use the NEXTROBS= option to
display a different number of extreme observations.
PROC UNIVARIATE DATA=SAS-data-set;
VAR variable(s);
RUN;
Using the Output Delivery System

You can use the SAS Output Delivery System to create different output formats by directing
output to various ODS destinations. For each type of formatted output that you want to create,
you use an ODS statement to open that destination, submit one or more procedures that
generate output, and then close the destination. A file is created for each open destination.
ODS destination FILE="filename" <options>;
<SAS code to generate the report>
ODS destination CLOSE;

Sample Programs

Creating a One-Way Frequency Report


proc freq data=orion.sales;
tables Gender;
where Country='AU';
run;

Using Formats in PROC FREQ


proc format;
value Tiers low-25000='Tier1'
25000<-50000='Tier2'
50000<-100000='Tier3'
100000<-high='Tier4';
run;
proc freq data=orion.sales;
tables Salary;
format Salary Tiers.;
run;

Listing Multiple Variables on a TABLES Statement


proc freq data=orion.sales;
tables Gender Country;
run;
proc sort data=orion.sales out=sorted;
by Country;
run;
proc freq data=sorted;
tables Gender;
by Country;
run;

Creating a Crosstabulation Table


proc freq data=orion.sales;
tables Gender*Country;
run;

Examining Your Data


proc print data=orion.nonsales2 (obs=20);
run;

Using PROC FREQ Options to Validate Your Data


proc freq data=orion.nonsales2 order=freq;
tables Employee_ID/nocum nopercent;
run;
proc freq data=orion.nonsales2 nlevels;
tables Gender Country Employee_ID/nocum nopercent;
run;
proc freq data=orion.nonsales2 nlevels;
tables Gender Country Employee_ID/nocum nopercent noprint;
run;

Using PROC PRINT to Validate Your Data

proc print data=orion.nonsales2;


where Gender not in ('F','M') or
Country not in ('AU','US') or
Job_Title is null or
Salary not between 24000 and 500000 or
Employee_ID is missing or
Employee_ID=120108;
run;

Creating a Summary Report with PROC MEANS


proc means data=orion.sales;
var Salary;
run;

Creating a PROC MEANS Report with Grouped Data


proc means data=orion.sales;
var Salary;
class Gender Country;
run;

Requesting Specific Statistics in PROC MEANS


proc means data=orion.sales n mean;
var Salary;
run;
proc means data=orion.sales min max sum;
var Salary;
class Gender Country;
run;

Validating Data Using PROC MEANS


proc means data=orion.nonsales2 n nmiss min max;
var Salary;
run;

Validating Data Using PROC UNIVARIATE


proc univariate data=orion.nonsales2;
var Salary;
run;
proc univariate data=orion.nonsales2 nextrobs=3;
var Salary;
run;
proc univariate data=orion.nonsales2 nextrobs=3;
var Salary;
id Employee_ID;
run;

Using the SAS Output Delivery System


/*Use a filepath to a location where you have Write access.*/
ods pdf file="c:/output/salaries.pdf";
proc means data=orion.sales min max sum;
var Salary;
class Gender Country;
run;

ods pdf close;


ods csv file="c:/output/salarysummary.csv";
proc means data=orion.sales min max sum;
var Salary;
class Gender Country;
run;
ods csv close;

You might also like