A Data Set Joining Primer
A Data Set Joining Primer
data sets are needed to create the child data set Scenario E - Expansion of the child data set occurs
(Table 1). The type of joins used to accomplish this when one data set has multiple levels of a non-
are a match-merge or a full join. Missing values are common variable. An example would be where one
placed within observations that do not occur within data set contains a listing of names by city and the
all parent data sets. other contains city information. The child data set
required contains all of the names for each city plus
Scenario B - All data values of a base parent data the city information. In this case the data sets are
set are kept and only those observations containing joined by the common variable city and all names
matching data values of the common variable(s) are within a city are transferred to the child data set.
selected from all other parent data sets (Table 1).
The terminology right or left join is an indication of Conclusions
which parent data set to use as the base The pros and cons of using the different joins are
listed in Table 2.
Scenario C - An inner join is used when the child
data set needs to contain only those data records The selection of a joining tool is dependent on the
from the parent data sets where the common environment of the data structures, the required
variable(s) are identical. contents of the data sets and what tool is the most
system resource efficient. During the application
Scenario D - One-to-one merging combines all of development process, careful bench-marking of
the parent data sets using a common variable and joining tools is required to ensure selection of the
creates a child data set that is as large as the largest correct environment and tool for the job.
data set within the merge list. The parent data sets
are not joined by common variables. Instead parent ACKNOWLEDGMENTS
data sets are placed side by side and each common I would like to take this opportunity to acknowledge
variable data value is super-imposed by the data all of the help and support given to me by my
values within the last parent data set within the husband, Robert Stuelpner. He diligently read and
merge.
SAS Tool Pros Cons
1
PROC FORMAT Creates a “look-up” table using The key values must be unique, no
either single variables or multiple duplicate values can occur within the data
variables for the key and label set used to create the format. Format can
components of the format. be applied to only one data set at a time.
MERGE statement4 • Used only for one-to-one The data sets are not joined by any common
merges no sorting required. variables. If there is a common variable
• Two (2) or more data sets can between the data sets, then the common
be joined. variable will contain the values of the
common variable in the last data set joined.
MERGE statement with BY • Two or more data sets can be All data sets must be sorted by common
statement4 joined by common variables. variable.
• All values of all variables
within all data sets will be
retained.
MERGE statement with IN= • Two (2) or more data sets can All data sets must be sorted by common
option and BY statement4 be joined by common variables. variable.
• All data from each data set will
be read before subsetting
criteria applied.
PROC SQL2 • Data sets do not need to be pre- • All outer joins occur between two data
sorted. sets only.
• Inner joins can occur between • Inner join contains only those values of
two (2) or more data sets. the common variable that match
between the data sets joined.
SET statement with KEYS= Two (2) or more data sets can be Data sets need to be indexed by common
option3 joined by common variables. variables.
Table 2. Pros and cons for using each type of SAS joining tool.
re-read this paper in an effort to correct the obvious 4. Bahler, C. and Clos, S.. To Format or Merge ...
errors and keep me on track. His criticism were That is the Question. Proceedings of the
constructive and his support never ending. Southeast SAS Users Group. 3:363-367.
5. Bowman, JS, Emerson, SL and Darnovsky, M
This has been a tremendous learning experience.
The Practical SQL Handbook, Addison-Wesley,
Thanks to Caroline Bahler who raised the bar to
even greater heights. 1996
6. Plew, RR and Stephens, RK Teach Yourself
References SQL in 24 Hours, SAMS, 2000
1. SAS Institute Inc.. SAS Procedures Guide,
Version 6, Third Edition. Cary, NC: SAS SAS is a registered trademark or trademark of SAS
Institute Inc., 1990. 275-312 pp. Institute Inc. in the USA and other countries. ®
2. SAS Institute. SAS Technical Report P-222, indicates USA registration.
Changes and Enhancements to Base SAS
Software, Release 6.07. Cary, NC: SAS Institute To Contact the authors:
Inc., 1991. 91, 207-217 pp. Caroline Bahler - cbahler@hotmail.com
3. SAS Institute. SAS Language: Reference, Janet Stuelpner - jstuelpner@usa.net
Version 6, First Edition. Cary, NC: SAS
Institute Inc., 1990. 147-155 pp.