0% found this document useful (0 votes)
13 views4 pages

A Data Set Joining Primer

The document discusses different techniques for joining data from multiple sources in SAS. It describes different types of joins like inner, outer, and cross joins. It also compares using the MERGE statement versus PROC SQL for performing joins, noting advantages and disadvantages of each. Various scenarios requiring different join types are presented.

Uploaded by

kmeena73
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views4 pages

A Data Set Joining Primer

The document discusses different techniques for joining data from multiple sources in SAS. It describes different types of joins like inner, outer, and cross joins. It also compares using the MERGE statement versus PROC SQL for performing joins, noting advantages and disadvantages of each. Various scenarios requiring different join types are presented.

Uploaded by

kmeena73
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Putting It Together – The Poster

A Data Set Joining Primer


Janet Stuelpner
Caroline Bahler

Abstract beyond, the new LIBNAME engines allow us to join


The ability to combine different types of data from all types of data sources to a SAS data set.)
multiple hardware and software platforms is a major
strength of the SAS system. SAS has blessed Figure 1 illustrates a common joining strategy with
information analysts with a wealth of different some of the ambiguities involved. The optimum
options for joining data values from many different place to join a SAS data set and a flat file is within
data structures. There are several factors that will SAS. However, whether the database tables are
determine the type of join that is needed. The source joined within the database environment depends
of the data and the outcome are of utmost upon the type of join required. For instance, outer
importance. This paper will discuss the joining joins can be very database resource intensive and the
techniques offered within the SAS system. “better” choice might be to join the two database
tables within the SAS environment.
Introduction
Data warehouses can contain data collected and Types of Joins
stored in many different physical forms. These data The join operation works on two or more tables at a
structures can include flat files, database tables, time. You can find any relationship that exists
spreadsheets, and SAS data sets. Utilization of this among data elements. There are several types of
“raw” data by an information analyst can require joins. We will focus on only three types. These are
combining two or more of these data structures the cross product, inner join and outer join.
through the use of a join (merging and joining are
synonymous terms referring to the combination of The inner join is also called an equi-join. The result
data structures through the use of common of an inner join is the intersection of the rows in both
variables/fields). One of the strengths of the SAS tables. In other words, it is the rows that are
system is that it provides many different options for common to both tables.
joining data values from many different data
structures. An outer join is used to return all rows that exist in
one table even though the corresponding rows do not
The Software Environment exist in the joined table. There are three types of
The “software environment” greatly influences how outer joins: left, right, full. A left outer join returns
a group of two or more data structures are joined. all rows from the left table. A right outer join
Selection of a joining strategy increases in returns all rows from the right table. A full outer
complexity as the number of “software join returns all rows from both tables. The last type
environments” containing data increases. A of join is a Cross-Join or a Cartesian Product. Some
standard rule of thumb is that data structures from may say that this is not a join at all. In reality, it is
the same environment should be joined within that the set of all possible combinations of the rows from
environment. However, this does not always hold two tables. Something to note is that if your tables
true since system resources and other factors may are large, the output from a cross join could result in
indicate that it is more “efficient” to join data a table that has millions of rows of data.
structures from the same software within a different
“environment”. (Note: in previous versions of SAS SAS Merge vs. PROC SQL
prior to version 7, you could only join a SAS data One of the toughest decisions an analyst must make
set to another SAS data set. Any type of raw data, is whether to use a SAS merge or a PROC SQL.
whether an excel spreadsheet, a flat file or a table There are many pros and cons that must be
from a relational database, had to be converted into a
SAS data set before it could be merged together with
data that was in a SAS data set. In version 7 and
Figure 1. Example of joining strategy.

is being merged. This will set up a boolean operator


considered when making this decision. Let’s take a that will allow easy coding of an outer join.
look at each type of join and then compare.
PROC SQL
Merge PROC SQL is a powerful procedure combining
A merge can be performed with a BY statement some of the functionality of the DATA and PROC
(match-merge) or without (one-to-one merge). Most steps into a single procedure. PROC SQL in many
often, a MERGE is used to combine two data sets cases can be a more efficient alternative to
based upon one or more common variables, which traditional SAS code. PROC SQL is often used as
means BY group processing will take place. You the interface to other database systems. With PROC
must ensure that the observations in both data sets SQL, you can retrieve, update, and report on
are in the correct order, or that they can be retrieved information from SAS data sets or other database
in the correct order. Whether you perform the SORT products. You can match variables that are not the
procedure first, the data is pre-sorted or indexed same name. The data does not need to be pre-sorted.
using the BY variable, it does not matter. The key Only with SQL can you directly perform a cross
variable (or BY variable) that is common to both join.
data sets must have the same name and it is
suggested that they be of the same length. There can Table 1 illustrated the types of join data sets and the
only be one variable name used in the output data desirable SAS tool that is needed to merge the data.
set. If you have variables of the same name in both There are 5 scenarios that are defined. For each one
data sets, the variables from the second data set will several different tools can be used.
overwrite the data from the first. This can give you
unexpected results in the output data set. Scenario A - Groups of two or more parent data sets
are used to build or create a child or output data set.
One great advantage of merging the data is the In scenario A all data values from all of the parent
ability to use the IN= option with each data set that
Informational Type of Data Set Join SAS Tool
Requirements
A. All data values from all data Match-Merge or Full Outer Join • MERGE statement with BY
sets. statement4
• PROC SQL2
B. All data values from a single Non-base data set(s) • PROC FORMAT1
data set (base) and all data are used as • SET statement with KEYS=
values from the other data • “Look-up” table(s) option3
set(s) that match the data • Right or Left outer join. • PROC SQL2
values of the joining variables
within the base.
C. Only those data values from Inner Join • PROC SQL2
all data sets that contain the • MERGE statement with IN=
same data values within the option and BY statement4
joining variables.
D. Placement of data sets side by One-to-one Merge MERGE statement4
side
E. Expansion of data set to Many-to-many Join PROC SQL2
include all levels of a non-
common variable.
Table 1. The type of parent data set joins required to construct a specific child data set.

data sets are needed to create the child data set Scenario E - Expansion of the child data set occurs
(Table 1). The type of joins used to accomplish this when one data set has multiple levels of a non-
are a match-merge or a full join. Missing values are common variable. An example would be where one
placed within observations that do not occur within data set contains a listing of names by city and the
all parent data sets. other contains city information. The child data set
required contains all of the names for each city plus
Scenario B - All data values of a base parent data the city information. In this case the data sets are
set are kept and only those observations containing joined by the common variable city and all names
matching data values of the common variable(s) are within a city are transferred to the child data set.
selected from all other parent data sets (Table 1).
The terminology right or left join is an indication of Conclusions
which parent data set to use as the base The pros and cons of using the different joins are
listed in Table 2.
Scenario C - An inner join is used when the child
data set needs to contain only those data records The selection of a joining tool is dependent on the
from the parent data sets where the common environment of the data structures, the required
variable(s) are identical. contents of the data sets and what tool is the most
system resource efficient. During the application
Scenario D - One-to-one merging combines all of development process, careful bench-marking of
the parent data sets using a common variable and joining tools is required to ensure selection of the
creates a child data set that is as large as the largest correct environment and tool for the job.
data set within the merge list. The parent data sets
are not joined by common variables. Instead parent ACKNOWLEDGMENTS
data sets are placed side by side and each common I would like to take this opportunity to acknowledge
variable data value is super-imposed by the data all of the help and support given to me by my
values within the last parent data set within the husband, Robert Stuelpner. He diligently read and
merge.
SAS Tool Pros Cons
1
PROC FORMAT Creates a “look-up” table using The key values must be unique, no
either single variables or multiple duplicate values can occur within the data
variables for the key and label set used to create the format. Format can
components of the format. be applied to only one data set at a time.

MERGE statement4 • Used only for one-to-one The data sets are not joined by any common
merges no sorting required. variables. If there is a common variable
• Two (2) or more data sets can between the data sets, then the common
be joined. variable will contain the values of the
common variable in the last data set joined.

MERGE statement with BY • Two or more data sets can be All data sets must be sorted by common
statement4 joined by common variables. variable.
• All values of all variables
within all data sets will be
retained.
MERGE statement with IN= • Two (2) or more data sets can All data sets must be sorted by common
option and BY statement4 be joined by common variables. variable.
• All data from each data set will
be read before subsetting
criteria applied.
PROC SQL2 • Data sets do not need to be pre- • All outer joins occur between two data
sorted. sets only.
• Inner joins can occur between • Inner join contains only those values of
two (2) or more data sets. the common variable that match
between the data sets joined.
SET statement with KEYS= Two (2) or more data sets can be Data sets need to be indexed by common
option3 joined by common variables. variables.

Table 2. Pros and cons for using each type of SAS joining tool.

re-read this paper in an effort to correct the obvious 4. Bahler, C. and Clos, S.. To Format or Merge ...
errors and keep me on track. His criticism were That is the Question. Proceedings of the
constructive and his support never ending. Southeast SAS Users Group. 3:363-367.
5. Bowman, JS, Emerson, SL and Darnovsky, M
This has been a tremendous learning experience.
The Practical SQL Handbook, Addison-Wesley,
Thanks to Caroline Bahler who raised the bar to
even greater heights. 1996
6. Plew, RR and Stephens, RK Teach Yourself
References SQL in 24 Hours, SAMS, 2000
1. SAS Institute Inc.. SAS Procedures Guide,
Version 6, Third Edition. Cary, NC: SAS SAS is a registered trademark or trademark of SAS
Institute Inc., 1990. 275-312 pp. Institute Inc. in the USA and other countries. ®
2. SAS Institute. SAS Technical Report P-222, indicates USA registration.
Changes and Enhancements to Base SAS
Software, Release 6.07. Cary, NC: SAS Institute To Contact the authors:
Inc., 1991. 91, 207-217 pp. Caroline Bahler - cbahler@hotmail.com
3. SAS Institute. SAS Language: Reference, Janet Stuelpner - jstuelpner@usa.net
Version 6, First Edition. Cary, NC: SAS
Institute Inc., 1990. 147-155 pp.

You might also like