100% found this document useful (1 vote)
3K views

Sas Data Management

Uploaded by

Harshit Tripathi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
3K views

Sas Data Management

Uploaded by

Harshit Tripathi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 908

SAS® Data Management

Tools and Applications

Course Notes
SAS® Data Management Tools and Applications Course Notes was developed by Mark Craver,
David Ghan, Robert Ligtenberg, Kari Richardson, and Erin Winters . Instructional design, editing, and
production support was provided by the Learning Design and Development team.
SAS and all other SAS Institute Inc. product or service names are registered trademarks or
trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.

SAS® Data Management Tools and Applications Course Notes

Copyright © 2020 SAS Institute Inc. Cary, NC, USA. All rights reserved. Printed in the United States
of America. No part of this publication may be reproduced, stored in a retrieval system, or
transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise,
without the prior written permission of the publisher, SAS Institute Inc.

Book code E71514, course code LWDISDS2/DISDS2, prepared date 27Feb2020. LWDISDS2_001

ISBN 978-1-64295-452-4
For Your Infor mation iii

Table of Contents

Lesson 1 SAS/ACCESS ® Technology Overview ..................................................1-1

1.1 SAS/ACCESS Technology Overview ..................................................................1-3

Demonstration: Using SAS/ACCESS Methods for a Variety of Data


Sources............................................................................. 1-12

Lesson 2 SAS ® Data Integration Studio: Essentials ............................................2-1

2.1 Exploring the SAS Platform and SAS Data Integration Studio ...............................2-3

Demonstration: Importing Metadata ............................................................ 2-11

2.2 Exploring SAS Data Integration Studio Basics................................................... 2-18

Demonstration: Exploring SAS Data Integration Studio Basics ...................... 2-24

2.3 Examining SAS Data Integration Studio Jobs and Options ................................. 2-34

Demonstration: Examining SAS Data Integration Studio Jobs and Options ..... 2-38

Practice ................................................................................................... 2-49

2.4 Solutions ....................................................................................................... 2-50

Solutions to Practices................................................................................ 2-50

Solutions to Student Activities .................................................................... 2-54

Lesson 3 SAS ® Data Integration Studio: Defining Source Data Metadata ............3-1

3.1 Setting Up the Environment ...............................................................................3-3

Demonstration: Defining Custom Folders.......................................................3-5

3.2 Defining Metadata for a Library ..........................................................................3-9

Demonstration: Defining Metadata for a SAS Library .................................... 3-11

3.3 Registering Metadata for Data Sources ............................................................ 3-17

3.4 Registering SAS Table Metadata...................................................................... 3-22

Demonstration: Registering Metadata for SAS Source Tables........................ 3-25

Practice ................................................................................................... 3-31

3.5 Registering DBMS Table Metadata................................................................... 3-33


iv For Your Information

Demonstration: Registering Metadata for Oracle Source Tables .................... 3-36

Practice ................................................................................................... 3-46

3.6 Registering ODBC Data Source Table Metadata ............................................... 3-47

Demonstration: Registering Metadata for ODBC Data Sources ..................... 3-49

3.7 Registering Metadata for External Files ............................................................ 3-59

Demonstration: Registering Metadata for a Delimited External File ............... 3-61

Practice ................................................................................................... 3-69

3.8 Solutions ....................................................................................................... 3-70

Solutions to Practices................................................................................ 3-70

Solutions to Student Activities .................................................................... 3-78

Lesson 4 SAS ® Data Integration Studio: Defining Target Data Metadata .............4-1

4.1 Registering Metadata for Target Tables ...............................................................4-3

Demonstration: Refresh the Metadata ...........................................................4-8

Demonstration: Defining the Product Dimension Table Metadata ................... 4-12

Practice ................................................................................................... 4-20

4.2 Importing Metadata......................................................................................... 4-22

Demonstration: Importing Relational Metadata ............................................ 4-24

4.3 Solutions ....................................................................................................... 4-30

Solutions to Practices................................................................................ 4-30

Lesson 5 SAS ® Data Integration Studio: Working with Jobs ...............................5-1

5.1 Creating Metadata for Jobs ...............................................................................5-3

Demonstration: Refresh the Metadata ........................................................ 5-13

Demonstration: Populating the Current and Terminated Staff Tables .............. 5-16
Practices.................................................................................................. 5-28

5.2 Working with the Join Transformation ............................................................... 5-31

Demonstration: Populating the Product Dimension Table .............................. 5-41

Practices.................................................................................................. 5-62
For Your Infor mation v

5.3 Solutions ....................................................................................................... 5-64

Solutions to Practices................................................................................ 5-64

Lesson 6 SAS ® Data Integration Studio: Working with Transformations .............6-1

6.1 Working with the Extract and Summary Statistics Transformations ........................6-3

Demonstration: Refresh the Metadata ..........................................................6-5

Demonstration: Reporting for United States Customers ..................................6-9

Practices.................................................................................................. 6-28

6.2 Exploring the SQL Transformations .................................................................. 6-32

Demonstration: Concatenating Tables with Set Operators Transformation ...... 6-38

Practice ................................................................................................... 6-47

6.3 Creating Custom Transformations .................................................................... 6-49

Demonstration: Creating a New Transformation ........................................... 6-59

Practice ................................................................................................... 6-74

Demonstration: Using a New Transformation ............................................... 6-77

Practice ................................................................................................... 6-88

6.4 Solutions ....................................................................................................... 6-90

Solutions to Practices................................................................................ 6-90

Lesson 7 Introduction to Data Quality and the SAS ® Quality Knowledge


Base ....................................................................................................7-1

7.1 Introduction to Data Quality ...............................................................................7-3

7.2 SAS Quality Knowledge Base Overview .............................................................7-8

Lesson 8 DataFlux® Data Management Studio: Essentials ..................................8-1

8.1 Overview of Data Management Studio ...............................................................8-3

Demonstration: Navigating the DataFlux Data Management Studio


Interface ..............................................................................8-9

8.2 DataFlux Repositories..................................................................................... 8-21

Demonstration: Creating a DataFlux Repository ........................................... 8-24

Practice ................................................................................................... 8-32


vi For Your Information

8.3 Quality Knowledge Bases and Reference Data Sources .................................... 8-33

Demonstration: Verifying the Course QKB and Reference Sources ................ 8-37

8.4 Data Connections ........................................................................................... 8-39

Demonstration: Working with Data Connections ........................................... 8-42

Demonstration: Viewing the Data inside a Data Connection .......................... 8-50

Practice ................................................................................................... 8-55

8.5 Solutions ....................................................................................................... 8-56

Solutions to Practices................................................................................ 8-56

Lesson 9 DataFlux® Data Management Studio: Understanding Data ...................9-1

9.1 Methodology Review ........................................................................................9-3

9.2 Creating Data Collections..................................................................................9-6

Demonstration: Creating a Collection of Descriptive Fields..............................9-8

9.3 Designing Data Explorations ........................................................................... 9-10

Demonstration: Creating and Reviewing Results from a Data Exploration


(Optional) .......................................................................... 9-19

Demonstration: Creating a Collection from a Data Exploration (Optional) ....... 9-31

9.4 Creating Data Profiles ..................................................................................... 9-36

Demonstration: Creating and Exploring a Data Profile .................................. 9-48

Practice ................................................................................................... 9-64

9.5 Profiling Other Input Types .............................................................................. 9-65

9.6 Designing Data Standardization Schemes ........................................................ 9-75

Demonstration: Creating a Phrase Standardization Scheme.......................... 9-79

Demonstration: Comparing a New Analysis Report to an Existing Scheme ..... 9-86

Demonstration: Creating an Element Standardization Scheme ...................... 9-93

Practice ................................................................................................... 9-97

9.7 Solutions ....................................................................................................... 9-98

Solutions to Practices................................................................................ 9-98


For Your Infor mation vii

Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to


Improve Data ..................................................................................... 10-1

10.1 Introduction to Data Jobs ................................................................................ 10-3

Demonstration: Setting DataFlux Data Management Studio Options ............ 10-15

10.2 Standardization, Parsing, and Casing ............................................................. 10-20

Demonstration: Investigating Standardization ............................................ 10-24


Demonstration: Working with a Field Layout Node...................................... 10-31

Demonstration: Investigating Parsing and Casing....................................... 10-36

Practice ................................................................................................. 10-44

10.3 Identification Analysis and Right Fielding ........................................................ 10-46

Demonstration: Investigating Right Fielding and Identification Analysis......... 10-50

10.4 Branching and Gender Analysis ..................................................................... 10-55

Demonstration: Working with the Branch and Data Validation Nodes ........... 10-57

Demonstration: Investigating Gender Analysis ........................................... 10-62

10.5 Data Enrichment .......................................................................................... 10-68

Demonstration: Working with Address Verification and Geocoding Nodes ..... 10-74

Practice ................................................................................................. 10-84

10.6 Solutions ..................................................................................................... 10-86

Solutions to Practices.............................................................................. 10-86

Solutions to Student Activities ................................................................ 10-100

Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for


Entity Resolution ............................................................................... 11-1

11.1 Introduction.................................................................................................... 11-3

11.2 Creating Match Codes .................................................................................... 11-7

Demonstration: Creating a Data Job to Generate Match Codes ................... 11-10

11.3 Clustering Records ....................................................................................... 11-17

Demonstration: Using Match Codes (and Other Fields) to Cluster Data


Records .......................................................................... 11-21

Demonstration: Creating a Match Report from Clustered Data ..................... 11-30


viii For Your Information

Practice ................................................................................................. 11-36

11.4 Survivorship ................................................................................................. 11-38

Demonstration: Adding Survivorship to the Entity Resolution Job................. 11-48

Demonstration: Adding Field-Level Rules for the Surviving Record .............. 11-61

Practice ................................................................................................. 11-69

11.5 Solutions ..................................................................................................... 11-71

Solutions to Practices.............................................................................. 11-71

Solutions to Student Activities .................................................................. 11-84

Lesson 12 Understanding the SAS ® Quality Knowledge Base (QKB) .................. 12-1

12.1 Working with QKB Component Files................................................................. 12-3

Demonstration: Accessing the QKB Component Files................................... 12-5

Demonstration: Using the Scheme Builder................................................. 12-16

Demonstration: Using the Chop Table Editor .............................................. 12-24

Demonstration: Using the Phonetics Editor ................................................ 12-34

Demonstration: Using the Regex Library Editor .......................................... 12-43

Demonstration: Using the Vocabulary Editor .............................................. 12-49

Demonstration: Using the Grammar Editor................................................. 12-62

12.2 Working with QKB Definitions ........................................................................ 12-70

12.3 Solutions ..................................................................................................... 12-81

Solutions to Activities and Questions ........................................................ 12-81

Lesson 13 Using SAS ® Code to Access QKB Components ................................. 13-1

13.1 SAS Configuration Options for Accessing the QKB ............................................ 13-3

13.2 SAS Data Quality Server Overview ................................................................ 13-13

Demonstration: Using the Standardization Functions .................................. 13-18

Demonstration: Using the DQSCHEME Procedure to Create a Scheme ....... 13-26

Demonstration: Using the DQSCHEME Procedure to Apply a Scheme......... 13-30

Demonstration: Using the Match Code Generation Functions ...................... 13-34

Demonstration: Using the DQMATCH Procedure ....................................... 13-39


For Your Infor mation ix

Demonstration: Using the Parsing Functions.............................................. 13-45

Demonstration: Using the Extraction Functions .......................................... 13-49

Demonstration: Using the Gender Analysis and Identification Analysis


Functions ........................................................................ 13-52

13.3 Solutions ..................................................................................................... 13-54

Solutions to Activities and Questions ........................................................ 13-54


x For Your Information

To learn more…
For information about other courses in the curriculum, contact the
SAS Education Division at 1-800-333-7660, or send e-mail to
training@sas.com. You can also find this information on the web at
http://support.sas.com/training/ as well as in the Training Course
Catalog.

For a list of SAS books (including e-books) that relate to the topics
covered in this course notes, visit https://www.sas.com/sas/books.html or
call 1-800-727-0025. US customers receive free shipping to US
addresses.
Lesson 1 SAS/ACCESS®
Technology Overview
1.1 SAS/ACCESS Technology Overview ............................................................................ 1-3
Demonstration: Using SAS/ACCESS Methods for a Variety of Data Sources ................. 1-12
1-2 Lesson 1 SAS/ACCESS® Technology Overview

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 SAS/ACCESS Technology Overview 1-3

1.1 SAS/ACCESS Technology Overview

SAS/ACCESS Software
With SAS/ACCESS, you can directly read and write to and from other database
management systems (DBMS).
DATA Step
DBMS table

PROC Step

This is a key capability leveraged across all SAS data management applications.

2
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

SAS provides software that enables you to access data f rom a large variety of database
management systems. These sof tware components, ref erred to as SAS/ACCESS engines, enable
you to read and write directly to and f rom specific data f ormats. This is a key capability that allows
SAS users to integrate data f rom a large variety of data sources as part of the data curation process.
To connect to any specif ic type of database, you license a SAS/ACCESS product specific to that
database. You might license SAS/ACCESS to Oracle, or SAS/ACCESS to Hadoop, or
SAS/ACCESS to Microsoft SQL Server. We have dozens of specific SAS/ACCESS engines.
When you have these engines, then with your SAS code, you can use any SAS DATA step or PROC
step to read or write to any table f rom the database management system in the same way that you
would read or write to a SAS data set.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-4 Lesson 1 SAS/ACCESS® Technology Overview

Database Management Systems


• Database management systems (DBMS) typically use Structured Query
Language (SQL) as the interface to access and manage database tables.
• The SAS/ACCESS interface engines use the SQL language to communicate
with the database tables.

3
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Database systems typically use SQL as their language interf ace.


What most SAS/ACCESS engines do is that they send the native SQL language to the database.
This is built into the SAS/ACCESS engines so that as a SAS programmer, you don’t necessarily
write the native database SQL language yourself . You can write any SAS code and the
SAS/ACCESS engine, on your behalf , will then generate the native database SQL f or you in order to
interact with the database to perf orm read or write operations in the database.
If you choose to do so, however, you can also write the native database SQL code yourself within
your SAS program and SAS will submit that native SQL to the database f or you.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 SAS/ACCESS Technology Overview 1-5

Two Types of SAS/ACCESS Methods

SQL pass- A SAS programmer uses the SQL procedure in a SAS


through session to write and submit native database SQL to the
database.
A SAS programmer uses a LIBNAME statement to connect
SAS/ACCESS to the DBMS. DBMS tables can be named in a SAS program
LIBNAME wherever SAS data sets can be named. SAS implicitly
converts the SAS code into native database SQL
statements.

4
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

There are, then, two main methods available with SAS/ACCESS common to almost all access
engines.
1. With the SQL pass-through method, you use the SQL procedure within your SAS p rogram to
write and submit native database SQL to the database.
2. With the SAS/ACCESS LIBNAME method, you write a specif ic type of LIBNAME statement to
connect to the database system, and then you can write any SAS code and name a database
table to read or create in the same way you would name SAS data sets in your SAS programs.
SAS will generate the native database SQL on your behalf in order to perf orm read or write
operations in the database.

Both these methods support either


• querying the database tables and storing the result as database tables or returning the results
to SAS
• moving data f rom SAS to the DBMS.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-6 Lesson 1 SAS/ACCESS® Technology Overview

General Examples of SAS/ACCESS Methods


SQL pass- proc sql;
through connect to dbms (<dbms connection parameters>);
select * from connection to dbms
(select state, avg(salary) as avsal
from hivetable
group by state);
disconnect from dbms;
quit;
SAS/ACCESS libname mydbms dbms <dbms connection parameters>;
LIBNAME proc means data = mydbms.dbms_table;
class state;
var salary;
run; 5
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Here you can see a general syntax example f or each of these methods.
In the above example f or SQL pass-through:
1. You begin with PROC SQL (the SAS SQL procedure).
2. A CONNECT statement is then used to def ine a connection to your DBMS.
3. A SELECT clause is used to select data f rom the connection to the DBMS .
4. Native DBMS SQL is specif ied in parenthesis.
a. This query is sent by SAS directly to the DBMS.
b. The DBMS executes the query and returns the results to SAS.
5. A DISCONNECT statement is used to close the connection to the DBMS .

When you use the SAS/ACCESS LIBNAME method , you begin by def ining a database library with
the LIBNAME statement. This establishes a library connection to the database. When you have
def ined this, you can then name database tables with the same type of two -level names used f or
SAS data sets in the f orm of libref.datasetname. In this case, we name a database table directly in
the DATA= option of PROC MEANS. Database column names are used in the code in the same way
we name variables f rom SAS data sets. In this case, we are using PROC MEANS to summarize the
SALARY column in the database table categorized by STATE.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 SAS/ACCESS Technology Overview 1-7

General Examples of SAS/ACCESS Methods


SQL pass- proc sql;
through connect to dbms (<dbms connection parameters>);
select * from connection to dbms
(select state, avg(salary) as avsal
from hivetable native database SQL executed by the
group by state); DBMS and results returned to SAS
disconnect from dbms;
quit;
SAS/ACCESS libname mydbms dbms <dbms connection parameters>;
LIBNAME proc means data = mydbms.dbms_table;
class state; SAS converts to a native database
var salary; SQL summary query executed by the
run; 6 DBMS and results returned to SAS
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

This is the same example with a couple of key things highlighted.


In the SQL pass-through, the highlighted native SQL query that will be sent to the database to
execute will calculate mean salaries by state. Only the summary result will be returned to SAS . So,
the database does the processing and only summarized data is returned to SAS. When database
tables are large, it is important to do as much of the processing in the database to avoid the
bottleneck of returning large volumes of data across the network f rom the database server to SAS.
Plus, the database might be running in a server environment that allows it to process the data much
more quickly than the SAS server.
With the LIBNAME method, PROC MEANS in this example is also going to calculate mean sal ary by
state. So it is really requesting the same type of summary processing. In this case, the
SAS/ACCESS engine is going to convert the PROC MEANS code into a native summary query that
executes in the database, and once again only summary inf ormation will be returned to SAS.
However, this is not always the case. A handf ul of SAS procedures and other SAS statements (like
WHERE statements) do convert into native database requests, but in other cases, all the data will be
returned to SAS f or processing. This is something important to consider when using the LIBNAME
method. This is elaborated on in the demonstration, and it is a topic of more detailed discussion in a
later lesson.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-8 Lesson 1 SAS/ACCESS® Technology Overview

Components of the Pass-Through Facility


Here are two types of SQL pass-through components that you can submit
to the DBMS:
• SELECT statements
• EXECUTE statements

7
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

There are two types of SQL pass-through components that you can submit to the DBMS:
• SELECT statements, which produce output to SAS software (SQL procedure pass -through
queries)
• EXECUTE statements, which perf orm all other non-query SQL statements that do not produce
output (f or example, GRANT, CREATE, or DROP)

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 SAS/ACCESS Technology Overview 1-9

Executing Pass-Through Statements


Example of an SQL pass-through SELECT statement:
proc sql;
connect to hadoop (server="server2" port=10000 schema=DIACHD
user='student' passwd='Metadata0');
select * from connection to hadoop
(select employee_name,salary
from salesstaff
where salary > 50000);
disconnect from hadoop;
quit;

8
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Here is a specif ic example for connecting to a Hive database in Hadoop and submitting a HiveQL
pass-through query.
Pass-through methods start with a PROC SQL statement to invoke the SQL procedure.
Next, a CONNECT statement is specif ied where you use a keyword to indicate the type of database
you are connecting to. This is an example of connecting to Hadoop, so the Hadoop keyword is used.
The CONNECT statement also includes a set of key-value pairs in parentheses that contain the
inf ormation that you need to give SAS to enable SAS to connect to the database system.
Next is a SAS PROC SQL SELECT statement. Because this is SQL pass-through, on the FROM
clause of the SAS SELECT statement, a table is not named. Instead, we use FROM CONNECTION
TO HADOOP. This ref ers to the connection that we established in the CONNECT statement. In
parentheses, we write an SQL query in the native SQL language of the database that we are
connecting to. This SELECT statement in parentheses is then passed by SAS, using the connection
inf ormation, directly to the database to execute. The database executes the query and then returns
the result to SAS, where SAS SELECT then queries this result set.
A DISCONNECT statement closes the SAS connection to the database and QUIT ends the SQL
procedure.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-10 Lesson 1 SAS/ACCESS® Technology Overview

Executing Pass-Through Statements


Example of an SQL pass-through EXECUTE statement:
proc sql;
connect to hadoop (server="server2" port=10000 schema=DIACHD
user='student' passwd='Metadata0');

execute (drop table salesstaff) by hadoop;


disconnect from hadoop;
quit;

9
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

This is an example of an SQL pass-through EXECUTE statement. PROC SQL is also used , and the
CONNECT and the DISCONNECT statements work in the same way . However, in this example, an
EXECUTE statement is used because we are not sending a query to the database system, so we do
not need to query a result set with SAS SELECT. So instead we simply use the EXECUTE keyword
and in parenthesis write the native database SQL statement that will be sent by SAS to by executed
by the database. In this example, we send a DROP TABLE statement to execute. With the
EXECUTE statement, you f ollow the native statement in parenthes es with BY HADOOOP, or BY
“the keyword f or the system that you connected to".

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 SAS/ACCESS Technology Overview 1-11

Universal Methodology
SAS/ACCESS interfaces enable SAS programmers to apply consistent
techniques to access a large number of data sources in different formats
in a consistent way.

10
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The beauty of the SAS/ACCESS interf aces is that they allow SAS programmers to apply consistent
techniques to access a large number of data sources in dif ferent f ormats in a consistent way.
Methods that you learn to apply to one data source are easily leveraged if you need to access other
data sources f or which a SAS/ACCESS interf ace is available.
With these technologies in place, various SAS applications are also able to leverage them as you
will discover in various upcoming lessons.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-12 Lesson 1 SAS/ACCESS® Technology Overview

Using SAS/ACCESS Methods for a Variety of Data Sources

This demonstrates how to use SQL pass-through and the SAS/ACCESS LIBNAME method for
several types of data sources.
1. Start SAS Studio.
a. Click the SAS Studio icon on the Start Bar.

2. Open the demonstration program.


a. Click Files and Folders in the left panel.
b. Expand S:\workshop\diacs.
c. Scroll down to find demo.sas in the S:\workshop\diacs folder.
d. Double-click on demo.sas to open.
3. Use the SAS ACCESS LIBNAME method to access data in several database management
systems.
a. Select the first eighteen lines in the program to include the %LET statement and the seven
LIBNAME statements.
%let path=s:\workshop\diacs;
libname diaccs "&path";

/* Native client access engines */


libname oralib oracle path=localhost
user=student pw=Metadata0 schema=orion;
libname db2lib db2
user=student pw=Metadata0 database=sample;
libname msacc pcfiles path="&path\employee.accdb";
libname myexcel pcfiles path="&path\employee.xlsx";

/* ODBC Access */
libname mssql odbc dsn=sqlsrvdsn user=student pw=Metadata0;
/*MS SQL Server*/
libname odbcora odbc dsn=orasrc user=student pw=Metadata0;
/*Oracle*/

b. Click (Run all) to submit the highlighted statements.


Note: The log is displayed and shows messages indicating that the libraries were
successfully assigned.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 SAS/ACCESS Technology Overview 1-13

Note: Each library established a connection to a specific database management system.


The second word in each LIBNAME statement established the libref that will be used
to reference the library in subsequent code. The third word in each LIBNAME
statement is a keyword specific to the type of database you are connecting to and
corresponds the SAS/ACCESS engine being used. Thus, the first library connects to
SAS data sets using the default Base SAS engine, the second library connect to
Oracle, the next connects to Teradata, and the subsequent libraries connect to db2,
to Microsoft Access via the PCFILES engine, to Excel via the PCFILES engine, to
Microsoft SQL Server via the ODBC engine, and to Oracle via the ODBC engine.
There is also a LIBNAME statement showing a connection to Hadoop that is
contained in a comment so that it does not execute. Each of these libraries uses a
set of connection options needed for the specific connection. Typically, three to five
such options are required, and in many cases, the database administrator will
provide the users with the required connection parameters.
c. View the libraries and use them interactively in the Libraries section of the navigate pane.
1) In the left panel, click Libraries.
2) Expand My Libraries.
3) Note that all the libraries defined with the LIBNAME statements are displayed.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-14 Lesson 1 SAS/ACCESS® Technology Overview

4) Click to the left of DB2LIB to expand it.

5) Right-click CUSTOMER_DIM and select Properties to display the SAS Table Properties
window.
6) Click the Columns tab in the SAS Table Properties window.

Note: In this interactive Libraries section in the navigate pane, this DB2 library
connection and all SAS/ACCESS LIBNAME connections enable users to work
with the DB2 tables in the same way as you would work interactively with SAS
data sets. The column attributes in the table properties show that SAS interprets
the metadata in the database tables as if they were SAS data sets. All variables
are treated as either SAS numerics or SAS character variables. Note here that a
CUSTOMER_BIRTH_DATE variable in the DB2 table is a DATE data type in
DB2, which SAS treats as a SAS date value. Date values are SAS numerics with
assigned SAS date formats. This interpretation by the SAS LIBNAME engines is
what allows users to use database tables from these library connections as if
they are SAS data sets in SAS code wherever you can name a SAS data set.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 SAS/ACCESS Technology Overview 1-15

7) Click Close to close the SAS Table Properties window.


8) Double-click CUSTOMER_DIM to view the data in a new tab labeled
DB2LIB.CUSTOMER_DIM.

Note: The table can be viewed interactively in the same way as a SAS data set in a
SAS base library.
9) Click x on the DB2LIB.CUSTOMER_DIM tab to close.
10) Interactively copy a SAS data set into the DB2 library.
a) Expand the SASHELP library.
b) Drag the BASEBALL data set from the SASHELP library and drop it onto the
DB2LIB.
Note: BASEBALL is copied and exists as a table in the DB2 database.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-16 Lesson 1 SAS/ACCESS® Technology Overview

c) Double-click BASEBALL to open and view the data for the new DBDLIB table on a
new tab labeled DB2LIB.BASEBALL.

d) Click x on the DB2LIB.BASEBALL tab to close.


11) Right-click BASEBALL in the DB2LIB library and select Delete.
12) Click Delete to confirm.
d. Delete any existing database tables that might already exist so that they can be created in
subsequent program steps later in this demonstration.
1) For the demo.sas program, click the CODE tab.
2) Select the five PROC DATASETS steps in the demo.sas program.
/* Delete tables, if they exist, so they can be created in
subsequent code*/
proc datasets lib=db2lib nowarn nolist;
delete gendersalarydb2;
run;
proc datasets lib=odbcora nowarn nolist;
delete gendersalaryODBC;
run;
proc datasets lib=msacc nowarn nolist;
delete gendersalaryMSACCESS;
run;
proc datasets lib=myexcel nowarn nolist ;
delete gendersalaryEXCEL2;
run;
proc datasets lib=mssql nowarn nolist;
delete gendersalarySQLSERVER2;
run;
Note: Using the LIBNAME method, you can delete tables in databases using PROC
DATASETS in the same way you use this procedure to delete SAS data sets
from a SAS library. The library references for the SAS/ACCESS libraries are
specified with the LIB= option in the PROC DATASETS statement, and a
DELETE statement is used to specify the table in the library that is to be deleted.
3) Click to execute the highlighted statements.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 SAS/ACCESS Technology Overview 1-17

4) Verify that the selected code executed without errors.


e. Click the CODE tab.
f. Use PROC MEANS syntax to summarize data stored in each of the databases.
1) Highlight the OPTIONS statement, the six PROC MEANS steps, and a second
OPTIONS statement that follows the final PROC MEANS.
options sastrace=',,,d' sastraceloc=saslog nostsuffix ls=64;

ods noproctitle;
title 'Proc Means Oracle';
proc means data=oralib.employee_payroll mean;
class employee_gender;
var salary;
output out=work.gendersalaryORA mean=meansalary;
run;
title 'Proc Means DB2';
proc means data=db2lib.employee_payroll mean;
class employee_gender;
var salary;
output out=db2lib.gendersalaryDB2
(rename=(_type_=type _freq_=freq )) mean=meansalary;
run;
title 'Proc Means MS Access';
proc means data=msacc.employee_payroll mean;
class employee_gender;
var salary;
output out=msacc.gendersalaryMSACCESS mean=meansalary;
run;
title 'Proc Means MS Excel';
proc means data=myexcel.employee_payroll mean;
class employee_gender;
var salary;
output out=myexcel.gendersalaryEXCEL2 mean=meansalary;
run;
title 'Proc Means MS SQL Server';
proc means data=mssql.employee_payroll mean;
class employee_gender;
var salary;
output out=mssql.gendersalarySQLSERVER2 mean=meansalary;
run;
title 'Proc Means ODBC Oracle';
proc means data=odbcora.employee_payroll mean;
class employee_gender;
var salary;
output out=odbcora.gendersalaryODBC
(rename=(_type_=type _freq_=freq )) mean=meansalary;
run;
options sastrace=off;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-18 Lesson 1 SAS/ACCESS® Technology Overview

Note: A table called EMPLOYEE_PAYROLL storing identical data values exists in each
database system that we are connecting to via the LIBNAME statements. The
PROC MEANS steps are performing the identical summary calculations using the
same standard SAS syntax. In each procedure, average salary values will be
calculated for each unique value of the EMPLOYEE_GENDER variable. An
OUTPUT statement is used in each procedure to store the results as a table. For
those database libraries where Write access is granted by the database, the output
is stored as a table in the database. This is done by using the libref for the
SAS/ACCESS library when naming the table to save on the OUT= option of the
OUTPUT statement. If Write access to the database is not available, the output data
set is stored in the SAS Work library. Each procedure will also generate a report of
the summary statistics.
The first OPTIONS statement (at the beginning of this code segment) specifies the
SASTRACE option. When this option is used, the SAS log will display the details
about the native database SQL that is generated by the SAS/ACCESS engines. It is
being used here to demonstrate that when PROC MEANS is used, the SAS engine
for many databases will generate a native summary query. The data summarization
process can be performed by the database and only the summary results are
returned to SAS. Pushing more of the processing into the database will result in
better performance for large database tables. DBMSs are scaled to handle the large
volumes of data stored. Also, processing the data in place in the database reduces
the volume of data that needs to be transported across the network to the machine
where SAS is executing.
2) Click (Run all) to execute the highlighted statements.
3) If necessary, click the Results tab and note that each procedure generated the same
summary report for the table in each database library.

4) Click the LOG tab.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 SAS/ACCESS Technology Overview 1-19

5) Scroll to the top of the log.


6) Examine the SASTRACE messages that appear in the log.
a) Find the SASTRACE messages that are marked by the name of database
connection followed by an underscore character and a number that increases
sequentially during the SAS session for each generated message per engine (for
example, ORACLE_1, ORACLE_2, and so on, and DB2_1, DB2_2, and so on).
b) Examine the two SASTRACE messages displayed below that demonstrate cases
where the SAS/ACCESS engine generated native SQL summary queries to perform
the summarization of the data in-database.
SASTRACE message showing Oracle summarized data for PROC MEANS:
ORACLE_3: Prepared: on connection 0
select COUNT(*) as ZSQL1, MIN(TXT_1."EMPLOYEE_GENDER") as
ZSQL2, COUNT(*) as ZSQL3, COUNT(TXT_1."SALARY") as ZSQL4,
SUM(TXT_1."SALARY") as ZSQL5 from orion.EMPLOYEE_PAYROLL TXT_1
group by TXT_1."EMPLOYEE_GENDER"

SASTRACE message showing DB2 summarized data for PROC MEANS:


DB2_4: Prepared: on connection 0
select COUNT_BIG(*) as ZSQL1, MIN(TXT_1."EMPLOYEE_GENDER") as
ZSQL2, COUNT_BIG(*) as ZSQL3, COUNT_BIG(TXT_1."SALARY") as
ZSQL4, SUM(TXT_1."SALARY") as ZSQL5 from EMPLOYEE_PAYROLL TXT_1
group by TXT_1."EMPLOYEE_GENDER" FOR READ ONLY

Note: The SASTRACE messages shown above contains summary queries where a
GROUP BY clause identifies the class variable specified in the PROC
MEANS step. In the native SQL query, summary statistics are calculated for
the analysis variable identified in the MEANS procedure. Only the
summarized results are returned to SAS, which are then used by PROC
MEANS to generate the report.
The LIBNAME engines, in many cases, do not convert the SAS language
into the equivalent native database SQL process. In such cases, a native
database SELECT statement is generated by the SAS LIBNAME engine to
return all rows of data from the database to SAS, and further processing of
that data is then done by SAS. With large volumes of data, an understanding
of where processing occurs becomes an important performance
consideration in developing your SAS code.
7) View one of the summary tables created.
a) Click the Libraries section in the navigate pane.
b) If necessary, expand My Libraries  DB2LIB.
c) Double-click the GENDERSALARYDB2 table to view the data for the DB2 table on a
separate tab labeled DB2LIB.GENDERSALARYDB2.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-20 Lesson 1 SAS/ACCESS® Technology Overview

d) Click x to close the DB2LIB.GENDERSALARYDB2 tab.


4. Use the SAS/ACCESS SQL pass-through method to write and submit native database SQL
directly to the same set of DBMS.
a. If necessary, click the CODE tab.
b. Locate and highlight the PROC SQL step that contains six separate CONNECT blocks that
each connect to one of the database systems.
proc sql;

connect to oracle (path=localhost user=student pw=Metadata0);


title 'pass-through join from ORACLE';
select * from connection to oracle
(select a.Employee_ID, employee_gender, salary,
department, job_title
from orion.employee_payroll a,
orion.employee_organization b
where a.Employee_ID=b.Employee_ID);
title 'pass-through summary query from ORACLE';
select employee_gender, department,
aveSalary format=dollar10.
from connection to oracle
(select employee_gender, department,
avg(salary) as aveSalary
from orion.employee_payroll a,
orion.employee_organization b
where a.Employee_ID=b.Employee_ID
group by department, employee_gender
order by department, employee_gender);
disconnect from oracle;

connect to db2 (user=student pw=Metadata0 database=sample);


title 'pass-through join from DB2';
select * from connection to db2
(select a.Employee_ID, employee_gender, salary,
department, job_title
from employee_payroll a,

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 SAS/ACCESS Technology Overview 1-21

employee_organization b
where a.Employee_ID=b.Employee_ID);

title 'pass-through summary query from DB2';


select employee_gender, department,
aveSalary format=dollar10.
from connection to db2
(select employee_gender, department,
avg(salary) as aveSalary
from employee_payroll a, employee_organization b
where a.Employee_ID=b.Employee_ID
group by department, employee_gender
order by department, employee_gender);
disconnect from db2;

connect to pcfiles (path="&path\employee.accdb");


title 'pass-through join from MS ACCESS';
select * from connection to pcfiles
(select a.Employee_ID, employee_gender, salary,
department, job_title
from employee_payroll as a,
employee_organization as b
where a.Employee_ID=b.Employee_ID);

title 'pass-through summary query from MS ACCESS';


select employee_gender, department,
aveSalary format=dollar10.
from connection to pcfiles
(select employee_gender, department,
avg(salary) as aveSalary
from employee_payroll as a,
employee_organization as b
where a.Employee_ID=b.Employee_ID
group by department, employee_gender
order by department, employee_gender);
disconnect from pcfiles;

connect to pcfiles (path="&path\employee.xlsx");


title 'pass-through join from Excel';
select * from connection to pcfiles
(select a.Employee_ID, employee_gender, salary,
department, "job title"
from employee_payroll as a,
employee_organization as b
where a.Employee_ID=b."Employee ID");

title 'pass-through summary query from Excel';


select employee_gender, department,
aveSalary format=dollar10.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-22 Lesson 1 SAS/ACCESS® Technology Overview

from connection to pcfiles


(select employee_gender, department,
avg(salary) as aveSalary
from employee_payroll as a,
employee_organization as b
where a.Employee_ID=b."Employee ID"
group by department, employee_gender
order by department, employee_gender);
disconnect from pcfiles;

connect to odbc (dsn=sqlsrvdsn user=student pw=Metadata0);


title 'pass-through join from ODBC SQL Server';
select * from connection to odbc
(select a.Employee_ID, employee_gender, salary,
department, job_title
from employee_payroll a,
employee_organization b
where a.Employee_ID=b.Employee_ID);

title 'pass-through summary query from ODBC SQL Server';


select employee_gender, department,
aveSalary format=dollar10.
from connection to odbc
(select employee_gender, department,
avg(salary) as aveSalary
from employee_payroll a, employee_organization b
where a.Employee_ID=b.Employee_ID
group by department, employee_gender
order by department, employee_gender);
disconnect from odbc;

connect to odbc (dsn=orasrc user=student pw=Metadata0);


title 'pass-through join from ODBC ORACLE';
select * from connection to odbc
(select a.Employee_ID, employee_gender, salary,
department, job_title
from employee_payroll a,
employee_organization b
where a.Employee_ID=b.Employee_ID);

title 'pass-through summary query from ODBC ORACLE';


select employee_gender, department,
aveSalary format=dollar10.
from connection to odbc
(select employee_gender, department,
avg(salary) as aveSalary
from employee_payroll a,
employee_organization b
where a.Employee_ID=b.Employee_ID

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 SAS/ACCESS Technology Overview 1-23

group by department, employee_gender


order by department, employee_gender);
disconnect from odbc;
title;
quit;
Note: Each CONNECT statement connects to one of the six database systems that were
accessed earlier via the LIBNAME statements. The options in parentheses in the
CONNECT statement are the same ones used for the LIBNAME statement.
For each connection, two SQL pass-through queries are submitted to the database. The
queries are the same for each database, although each is written in the native
database-specific syntax. The first query performs a join of an EMPLOYEE_PAYROLL
table and an EMPLOYEE_ORGANIZATION table in the database. The results are
returned to SAS. The second query performs the same join but also summarizes the
join result, grouping by DEPARTMENT and GENDER to derive average salaries for
each grouping combination.
Notice how consistent SQL pass-through syntax is used to connect and submit queries
to each database system, and to select results returned to SAS for report generation.
With SQL pass-through, you can submit native queries to process the data in-database
and then return the query results to SAS. For large database tables, it is recommended
for performance considerations to maximize the amount of processing done by the
native pass-through queries before bringing results to SAS for further processing.
c. Click to execute the highlighted statements.
d. When the program execution completes, view the results on the Results tab.
1) At the beginning of the Results, notive the title and the first few rows of results for the first
pass-through query to Oracle.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-24 Lesson 1 SAS/ACCESS® Technology Overview

2) Scroll downward to find the title and the first few rows of results for the second pass -
through query to Oracle.

3) Continue to scroll downward to note that the same two query results have been created
for the pass-through queries for each of the other database systems as well.
5. Exit SAS Studio.
a. Click Sign Out in the top right corner of the SAS Studio interface.
b. Click Yes to confirm.
c. Click X in the top right to close the browser window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 SAS/ACCESS Technology Overview 1-25

SQL Pass-Through Method versus LIBNAME Method


SQL Pass-
Through LIBNAME
Must construct native DBMS SQL.
 x
Can use any SAS programming methods and name DBMS tables as
input or output data sets.
x 
Explicit control of what executes in the DBMS versus SAS.
 x
Implicitly generated SQL might cause all data from DBMS to be
 
returned to SAS. Should turn on SASTRACE and examine the SAS
log during development to produce efficient code.

12
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-26 Lesson 1 SAS/ACCESS® Technology Overview

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Lesson 2 SAS® Data Integration
Studio: Essentials
2.1 Exploring the SAS Platform and SAS Data Integration Studio ...................................... 2-3
Demonstration: Importing Metadata......................................................................... 2-11

2.2 Exploring SAS Data Integration Studio Basics ........................................................... 2-18


Demonstration: Exploring SAS Data Integration Studio Basics..................................... 2-24

2.3 Examining SAS Data Integration Studio Jobs and Options ........................................ 2-34
Demonstration: Examining SAS Data Integration Studio Jobs and Options .................... 2-38
Practice............................................................................................................... 2-49

2.4 Solutions ................................................................................................................... 2-50


Solutions to Practices ............................................................................................ 2-50
Solutions to Student Activities ................................................................................. 2-54
2-2 Lesson 2 SAS® Data Integration Studio: Essentials

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Exploring the SAS Platform and SAS Data Integration Studio 2-3

2.1 Exploring the SAS Platform and SAS


Data Integration Studio

SAS Platform Applications


SAS Enterprise SAS Add-In for SAS Data
Miner Microsoft Office Integration Studio

SAS Model SAS Visual DataFlux Data


Manager Analytics Management Studio

SAS Forecast SAS OLAP


Server Cube Studio
SAS Studio
SAS Information
JMP Map Studio
SAS Enterprise
Guide

3
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The SAS platform applications provide application interfaces to surface the power of data
management, analytics, and reporting.
The diagram above shows an organizational view of some of the SAS platform applications.
The highlighted group shows some of the analytics applications on the SAS platform. These
applications and other SAS tools help analysts make and manage models, forecast trends, and
generate statistics and visualizations on data.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-4 Lesson 2 SAS® Data Integration Studio: Essentials

SAS Platform Applications


SAS Enterprise SAS Add-In for SAS Data
Miner Microsoft Office Integration Studio

SAS Model SAS Visual DataFlux Data


Manager Analytics Management Studio

SAS Forecast SAS OLAP


Server Cube Studio
SAS Studio
SAS Information
JMP Map Studio
SAS Enterprise
Guide

4
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The highlighted group contains some of the reporting applications available in the SAS platf orm.
These applications and other SAS applications allow users to generate complex dashboards and
reports on their data as well as access data and generate reports with Microsoft Office tools like
Excel or Word.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Exploring the SAS Platform and SAS Data Integration Studio 2-5

SAS Platform Applications


SAS Enterprise SAS Add-In for SAS Data
Miner Microsoft Office Integration Studio

SAS Model SAS Visual DataFlux Data


Manager Analytics Management Studio

SAS Forecast SAS OLAP


Server Cube Studio
SAS Studio
SAS Information
JMP Map Studio
SAS Enterprise
Guide

5
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The highlighted group contains some of the programming interf aces available in the SAS platf orm.
These applications allow users to write and edit SAS code, which can be used to manage, analyze,
and report on data. SAS code will be used in this course to generate custom transf ormations and
can be used in the tool to customize jobs and existing transformations.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-6 Lesson 2 SAS® Data Integration Studio: Essentials

SAS Platform Applications


SAS Enterprise SAS Add-In for SAS Data
Miner Microsoft Office Integration Studio

SAS Model SAS Visual DataFlux Data


Manager Analytics Management Studio

SAS Forecast SAS OLAP


Server Cube Studio
SAS Studio
SAS Information
JMP Map Studio
SAS Enterprise
Guide

6
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The highlighted group contains some of the data management applications available in the SAS
platf orm. Of these Data Management applications, this course concentrates on the SAS Data
Integration Studio application – a brief explanation of the other Data Management applications
f ollows:

DataFlux Data Management Studio:


Combines data prof iling, data cleansing, entity resolution, and monitoring to ols for incorporating data
quality into an inf ormation management process.

SAS OLAP Cube Studio:


Includes a Cube Designer Wizard that enables you to do the f ollowing:
• enter information about the data source used to load a cube
• define dimensions, hierarchies, and levels
• provide measure details and configure aggregations

SAS Information Map Studio:


Enables you to build inf ormation maps f rom different types of data sources. SAS Inf ormation Maps
do the f ollowing:
• act as a bridge between your data warehouse and business users
• incorporate business rules and eliminate the need to understand data relationships

SAS OLAP Cube Studio and SAS Information Map Studio is not be discussed f urther in this
course.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Exploring the SAS Platform and SAS Data Integration Studio 2-7

SAS Metadata Objects


SAS platform applications create various items that are stored in a SAS
Metadata Repository. The items in the repository are organized in the
SAS Folders structure.

Tables Cubes

Dashboards Folders

Information maps Jobs

Libraries OLAP schema

Stored processes Reports

7
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-8 Lesson 2 SAS® Data Integration Studio: Essentials

SAS Folders Tree


SAS Folders
• is a root folder
• cannot be renamed, moved, or deleted
• can contain other folders but cannot contain
individual objects
• provides personal folders for individual users
• provides an area for shared data.

Note: The SAS Folders tree can have customized nonstandard folders.

8
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The initial folder structure includes the following main components:


My Folder ( ) Is a shortcut to the personal f older of the user who is currently logged on.
Products Contains f olders f or individual SAS software products. These f olders contain
content that is installed along with the product. For example, some products have a
set of initial jobs, transformations, stored processes, or reports that users can
modif y for their own purposes. Other products include sample content (f or
example, sample stored processes) to demonstrate their capabilities. Where
applicable, the content is stored under the product’s f older in subf olders that
indicate the release number f or the product.
Note: During installation, the SAS Deployment Wizard enables the installer to
assign a dif ferent name to this f older. Theref ore, your Products folder might
have a dif f erent name.

Shared Data Is provided for you to store user-created content that is shared among multiple
users. Under this f older, you can create any number of subfolders . Each subfolder
has the appropriate permissions to f urther organize this content.
Note: You can also create additional f olders under SAS Folders in which to store
shared content.
Follow these best practices when you interact with SAS folders:
• Use personal folders for personal content and use shared folders for content that multiple users
need to view.
• Do not delete or rename the Users folder.
• Do not delete or rename the home folder or personal folder of an active user.
• Do not delete or rename the Products or System folders or their subfolders.
• Use caution when you rename the Shared Data folder.
• When you create new folders, the security administrator should set permiss ions. The
environment can be pre-configured so that new folders inherit permissions from existing parent
folders.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Exploring the SAS Platform and SAS Data Integration Studio 2-9

Metadata Users and Groups


Users A user is an individual person or service identity.
Groups A group is a set of users.

Data
Ahmed
Integrators
Marcel

Ole
Robert
Bruno

9
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

In order to control access to SAS platform content, SAS must know who makes each request and
what type of access is requested.

Some applications in the SAS Platf orm support role-based security. Roles determine which user
interf ace elements a user sees when he interacts with an application. The various f eatures in
applications that provide role-based management are called capabilities. SAS Data Integration
Studio does not support this type of security.

Note: Groups enable you to easily specify permissions for similar users.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-10 Lesson 2 SAS® Data Integration Studio: Essentials

SAS Packages
SAS Data Integration Studio enables you to export and import metadata.
One format that supports SAS metadata is the SAS package format.

10
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

SAS Data Integration Studio enables you to export and import metadata. One f ormat that supports
SAS metadata is the SAS package f ormat.
The SAS package f ormat
• is a SAS internal f ormat
• supports most SAS platform metadata objects including objects relevant to SAS Data Integration
Studio, such as jobs, libraries, tables, and external f iles.
The SAS package f ormat can be used to
• move metadata between SAS metadata repositories
• maintain backups of metadata.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Exploring the SAS Platform and SAS Data Integration Studio 2-11

Importing Metadata
This demonstration performs steps for initial import of various metadata objects. These objects will
be explored in a subsequent demonstration.

1. Select Start  All Programs  SAS  SAS Data Integration Studio.


Use Bruno’s credentials to log on.
a. Verify that the connection profile is My Server.

Note: Do not click Set this connection profile as the default.


b. Click OK to close the Connection Profile window and to access the Log On window.
c. Type Bruno in the User ID field.
d. Type Student1 in the Password field.

Note: Do not click Save user ID and password in this profile.


e. Click OK to close the Log On window.
SAS Data Integration Studio appears.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-12 Lesson 2 SAS® Data Integration Studio: Essentials

2. If necessary, click the Folders tab in the tree view area.

Some f olders in the Folders tree are provided by default, such as My Folder, Products,
Shared Data, System, and User Folders.
Other f olders and subf olders were added by an administrator, such as Data Mart
Development.
3. Right-click the Data Mart Development folder and select Import  SAS Package.

The Import from SAS Package Wizard appears.


4. Click Browse to Enter the location of the input SAS package file.
5. In the Browse window, navigate to
D:\Workshop\dift\solutions.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Exploring the SAS Platform and SAS Data Integration Studio 2-13

6. Click DIFT Demo.spk.


7. Click OK to close the Browse window.
The path and selected SAS package file appear:

8. Click Next.
9. Verify that all objects are selected.

10. Click Next.


For this collection of metadata objects, two metadata properties need to be verified – SAS
Application Server and Directory Paths.

11. Click Next.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-14 Lesson 2 SAS® Data Integration Studio: Essentials

12. Verify that SASApp is specified under Target.

Two library metadata objects in this SAS Package file were exported from a metadata repository
where they were associated with a SAS Application Server object called SASApp. The defined
SAS Application Server in the target environment that we will associate the objects with is also
called SASApp.
This mapping of original SASApp to the target SASApp occurred by default (because the
application servers have the same name).
13. Click Next.
14. Verify that the two target directory paths match the two original paths.
The first path listed is associated with the library metadata object named DIFT Test Source
Library. If the second path is selected, you will discover it is associated with the library metadata
object named DIFT Test Target Library.

15. Click Next.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Exploring the SAS Platform and SAS Data Integration Studio 2-15

A Warning dialog box appears:

We need to make sure the directory referenced actually exists – we will do this in a Windows
Explorer window.
16. Click Yes to close the Warning dialog box and proceed.
A Summary panel appears for the Import from SAS Package Wizard.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-16 Lesson 2 SAS® Data Integration Studio: Essentials

17. Click Next.


The selected metadata are imported. A message is displayed saying that the import was
successful.

18. Click Finish.


19. If necessary, click in front of Data Mart Development to expand this folder.
20. Verify that the DIFT Demo folder exists as a subfolder of Data Mart Development.
21. Click in front of the DIFT Demo folder to expand it.

We investigate these metadata objects in the next demonstration.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Exploring the SAS Platform and SAS Data Integration Studio 2-17

2.01 Activity
1. Access SAS Data Integration Studio using the My Server connection
profile.
2. Specify Bruno as the user and Student1 as the password.
(note the capital S for Student1 password)
3. On the Folders tab, right-click the Data Mart Development folder
and select Import  SAS Package.
4. Select the package D:\Workshop\dift\solutions\DIFT Demo.spk.
5. Accept the default selections in each step in the wizard.

13
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-18 Lesson 2 SAS® Data Integration Studio: Essentials

2.2 Exploring SAS Data Integration Studio


Basics

Course Data

sports and outdoors

16
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The data used in the course is from a fictitious global sports and outdoors retailer named Orion Star
Sports & Outdoors.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Exploring SAS Data Integration Studio Basics 2-19

Course Data

online store
sports and outdoors brick and mortar catalog business
stores

17
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Orion Star has traditional brick and mortar stores, an online store, and a large catalog business.

Course Data

online store
sports and outdoors brick and mortar catalog business
stores

United States global offices and stores


headquarters
18
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The corporate headquarters is located in the United States, with offices and stores in many countries
throughout the world.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-20 Lesson 2 SAS® Data Integration Studio: Essentials

Orion Star Metadata Folders


You will work with a predefined folder named
Data Mart Development and create subfolders
under this grouping to organize the metadata
created in this course.

A security model is defined to control access to the


SAS Folders location and the SAS objects that are
contained there.

19
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

SAS Data Integration Studio Interface

20
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

One of the first things to do when learning a new software tool is to understand the interface.
SAS Data Integration Studio has a variety of components available – we will explain each of the
primary items.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Exploring SAS Data Integration Studio Basics 2-21

Title Bar, Menu Bar, Toolbar, Status Bar

Menu Bar
Title Bar
Toolbar

Status Bar

31
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The SAS Data Integration Studio interface is designed with features that are common to most
Windows applications.
• The title bar shows the current version of SAS Data Integration Studio, as well as the name of
the current connection profile.
• The menu bar provides access to drop-down menus. The list of active menu items varies
according to the current work area and the type of object that is selected. Inactive menu items
are disabled or hidden.
• The toolbar provides access to shortcuts for items on the menu bar. The list of ac tive tools varies
according to the current work area and the type of object that is selected. Inactive tools are
disabled or hidden.
• The status bar displays the name of the currently selected object, the name of the def ault SAS
Application Server if one is selected, the login ID and metadata identity of the current user, and
the name of the current SAS Metadata Server. To select a dif f erent SAS Application Server,
double-click the name of that server to open a dialog box. If the name of the SAS Metadata
Server is red, the connection is broken. In that case, you can double-click the name of the
metadata server to open a dialog box that enables you to reconnect.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-22 Lesson 2 SAS® Data Integration Studio: Essentials

Tree View, Basic Properties Pane

Tree View

Basic Properties

34
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The tree view can display the following components:


• The Folders tree and the Inventory tree both display metadata for objects that are registered on
the current metadata server, such as tables and libraries.
– The Folders tree displays the metadata folder hierarchy that is shared across the SAS
platform. This folder hierarchy can be customized and is used to organize metadata in a
structure that meets the needs of the user.
– The Inventory tree displays metadata in predefined categories that organize the metadata by
type, such as Table, Library, and so on.
– Not all metadata objects in the Inventory tree can be added or updated in SAS Data
Integration Studio. Some objects appear in the tree view for other reasons. For example, you
cannot add or update actions, conditions, or deployed flows in SAS Data Integration Studio,
but they appear in the tree view so that they can be included in the import and export of jobs.
Likewise, you cannot add or update information maps in SAS Data Integration Studio, but
they appear in the tree view so that they can be displayed in impact analysis.
• The Transformations tree provides access to transformations that can be added to SAS Data
Integration Studio jobs.
• The Checkouts tree is available to users working under change management. The Checkouts
tree displays metadata that is checked out for update, as well as any new metadata that is not
checked in. The Checkouts tree is not displayed in the view of SAS Data Integration Studio
above.
• The Basic Properties pane displays the basic properties of an object selected in a tree view. To
surface this pane, select View  Basic Properties from the menu bar.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Exploring SAS Data Integration Studio Basics 2-23

Job Editor

Job Editor

35
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The Job Editor window enables you to create, run, and troubleshoot SAS Data Integration Studio
jobs.
• The Diagram tab is used to build and update the process flow for a job.
• The Code tab is used to review or update code for a job.
• The Log tab is used to review the log for a submitted job.
• The Output tab is used to review the output of a submitted job.
• The Details pane is used to monitor and debug a job in the Job Editor.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-24 Lesson 2 SAS® Data Integration Studio: Essentials

Exploring SAS Data Integration Studio Basics


This demonstration illustrates accessing SAS Data Integration Studio and exploring the basics of the
interface and the available options for customizing Data Integration Studio. This demonstration uses
predefined metadata objects.

1. Select Start  All Programs  SAS  SAS Data Integration Studio.


2. Use Bruno’s credentials to log on.
a. Verify that the connection profile is My Server.

Note: Do not click Set this connection profile as the default.


b. Click OK to close the Connection Profile window and to access the Log On window.
c. Type Bruno in the User ID field and Student1 in the Password field.

Note: Do not click Save user ID and password in this profile.


d. Click OK to close the Log On window.
SAS Data Integration Studio appears.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Exploring SAS Data Integration Studio Basics 2-25

3. If necessary, click the Folders tab in the tree view area.

Some f olders in the Folders tree are provided by default, such as My Folder, Products,
Shared Data, System, and User Folders.
Other f olders and subf olders were added by an administrator, such as Data Mart
Development.
4. Click in front of Data Mart Development to expand this folder.

5. Click in front of the DIFT Demo folder to expand it.

The DIFT Demo folder contains seven metadata objects: two library objects, four table objects,
and one job object.
Each metadata object has its own set of properties.
6. If necessary, select View  Basic Properties.
The Basic Properties pane can be toggled off. Selecting the above menu selections will surface
the Basic Properties pane.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-26 Lesson 2 SAS® Data Integration Studio: Essentials

7. Click the DIFT Test Job - OrderFact Table Plus object.

The Basic Properties pane displays basic information for this job object.
One interesting property is that the table that is being loaded is identified.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Exploring SAS Data Integration Studio Basics 2-27

8. Click the DIFT Test Source Library object.

The Basic Properties pane displays basic information for this library object.
Interesting properties to note for a library object include the following:
• library reference (libref) used
• the physical path specified
• the complete LIBNAME statement (this includes the engine)

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-28 Lesson 2 SAS® Data Integration Studio: Essentials

9. Click the DIFT Test Table - ORDER_ITEM table object.

The Basic Properties pane displays basic information for this table object.
Interesting properties to note for a table object include the following:
• physical table name
• library object with libref in parentheses
• the type of DBMS
• the number of columns

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Exploring SAS Data Integration Studio Basics 2-29

10. Examine the properties of a table object in more detail.


a. Right-click DIFT Test Table - ORDER_ITEM and select Properties.

The General tab displays the metadata name of the table, as well as the metadata folder
location.
b. Click the Columns tab.

The Columns tab displays the column attributes of the table object. Notice that all columns in
this table are numeric.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-30 Lesson 2 SAS® Data Integration Studio: Essentials

c. Click the Physical Storage tab.

The Physical Storage tab displays the name of the physical table, the name of the library,
and the type of the table.
d. Click Cancel to close the Properties window.
11. Right-click DIFT Test Table - ORDER_ITEM and select Open.
Note: If prompted, use Bruno’s credentials for the SASApp application server.
The View Data window appears and displays the data for this table.

The functions of the View Data window are controlled by the View Data toolbar:

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Exploring SAS Data Integration Studio Basics 2-31

The View Data toolbar contains the following items:


Tool Explanation

Specifies the number of the first row to be displayed.

Positions the data with the go-to row as the first displayed data line.

Navigates to the f irst record of data.

Navigates to the last page of data.

Switches to Browse mode.

Switches to Edit mode.

Enables printing.

Ref reshes the view of the data.

Copies a selected item.

Displays the Search pane.


Displays the Sort By Columns tab in the View Data Query Options
window.

Displays the Filter tab in the View Data Query Options window.

Displays the Columns tab in the View Data Query Options window.
Displays physical column names in the column headings.
Note: You can display any combination of column metadata,
physical column names, and descriptions in the column
headings.

Displays optional descriptions in the column headings.

Displays optional column metadata names in the column headings.


This metadata can be entered in some SAS platf orm applications,
such as SAS Inf ormation Map Studio.
Toggles between displaying the data with metadata f ormats and either
physical f ormats or no f ormats (depending on the Formats setting in
the View Data tab in the Options window.)

12. To close the View Data window, select File  Close (or click ).

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-32 Lesson 2 SAS® Data Integration Studio: Essentials

13. Examine the properties of a library object in more detail.


a. Right-click DIFT Test Source Library and select Properties.

The metadata name of the library object is shown on the General tab. The metadata folder
location is also shown.
b. Click the Options tab.

The Options tab displays the library reference and the location of the physical path of this
library.
c. Click Cancel to close the Properties window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Exploring SAS Data Integration Studio Basics 2-33

14. Right-click DIFT Test Source Library in the Folders tree and select View LIBNAME.
The Display LIBNAME window appears.

15. Click Cancel to close the Display LIBNAME window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-34 Lesson 2 SAS® Data Integration Studio: Essentials

2.3 Examining SAS Data Integration


Studio Jobs and Options

Executing a Job
The job editor window enables a data integration developer to design,
debug, and execute a job.
A job can be executed by
• clicking the Run tool in the
job editor’s tools ( )
• selecting Actions  Run
from the menu bar
• right-clicking in background
of job and selecting Run
• pressing F3.

39
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 Examining SAS Data Integration Studio Jobs and Options 2-35

Job Editor Run Tools

Run entire job Continue running a


stopped job
Stop job execution
Run next node
Run from selected and then stop
transformation
Run to selected Run selected
transformation transformation

40
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Additional Run tools exist on the Job Editor’s toolbar and on the pop-up menu.
Note: The available Run tools depends on the current run state or the selected transformation, or
both.

SAS Data Integration Studio Options


The Options window is used
to specify global settings for
SAS Data Integration Studio
options.

The Options window is accessed


from the menu selection of
Tools  Options.

42
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-36 Lesson 2 SAS® Data Integration Studio: Essentials

Options: General Tab (Enable Row Count)

With option toggled off… With option toggled on…

…the Number of Rows …the Number of Rows


property displays property displays
Row count is disabled the row count for a
for a selected table. selected table.

47
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The option Enable row count on basic properties and data viewer for tables can be toggled on
or of f . The def ault is off.

Options: General Tab (Show Advanced Property Tabs)

…tabs such as Extended


With option toggled
Attributes and Authorization
on…
appear in Properties windows.

…then additional tabs do


With option toggled
not appear in Properties
off…
windows.

52
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The option Show Advanced Property tabs can be toggled on or off. The def ault is on.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 Examining SAS Data Integration Studio Jobs and Options 2-37

Options: Job Editor Tab (Nodes)

If Collapse If Expand
is selected… is selected…

53
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The Nodes grouping of options controls how ports and temporary output tables are displayed in a job
flow diagram.

Options: Job Editor Tab (Layout)

If Left To Right is selected…


If Bottom To Top is selected…
If Top To Bottom is selected…

54
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The Layout grouping of options controls the default orientation of a job flow diagram.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-38 Lesson 2 SAS® Data Integration Studio: Essentials

Examining SAS Data Integration Studio Jobs and Options


This demonstration illustrates executing and debugging a job, as well as setting various options to
affect your SAS Data Integration Studio session.

1. If necessary, access SAS Data Integration Studio using Bruno’s credentials.


a. Select Start  All Programs  SAS  SAS Data Integration Studio.
b. Use Bruno’s credentials to log on.
1) Verify that the connection profile is My Server.
2) Click OK to close the Connection Profile window and to access the Log On window.
3) Type Bruno in the User ID field and Student1 in the Password field.
4) Click OK to close the Log On window.
2. If necessary, click the Folders tab.
3. If necessary, expand Data Mart Development  DIFT Demo folders.

Working with Job Editor


4. Access the Job Editor window for DIFT Test Job - OrderFact Table Plus.
a. Right-click DIFT Test Job - OrderFact Table Plus and select Open.

This job joins two source tables and then loads the result into a target table. The target table
is then used as the source for the Rank transformation. The result of the ranking is loaded
into a target table and sorted, and then a report is generated based on the rankings.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 Examining SAS Data Integration Studio Jobs and Options 2-39

b. Click the DIFT Test Table - ORDERS table object.


Notice that the Details pane now has a Columns tab.
c. Click the Columns tab.

The Columns tab in the Details pane displays column attributes for the selected table object.
These attributes are fully editable in this tab.
Similarly, selecting any of the table objects in the process flow diagram (DIFT Test Table -
ORDERS, DIFT Test Table - ORDER_ITEM, DIFT Test Target - Order Fact Table (in
diagram twice), DIFT Test Target - Ranked Order Fact) displays a Columns tab for that
table object.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-40 Lesson 2 SAS® Data Integration Studio: Essentials

d. Click the Join transformation.


Notice that the Details pane now has a Mappings tab.

The full mapping functionality of the Join’s Designer window is found on this Mappings tab.
Similarly, selecting any of the transformations in the process flow diagram (Join, Table
Loader, Rank, Sort, List Data) displays a Mappings tab for that transformation.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 Examining SAS Data Integration Studio Jobs and Options 2-41

e. Click Run to execute the job (if prompted to log on to application server use
Bruno/Student1).

The transformations execute in sequence. The currently executing transformation is


highlighted.

Transformations are decorated with a symbol to indicate success or failure. Transformations


that complete with errors are outlined in red.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-42 Lesson 2 SAS® Data Integration Studio: Essentials

The Status tab in the Details pane shows the completion status for each step in the job. The
overall (Job) completion status is set to the lowest step completion status .
f. Double-click the first Error (for the Table Loader) in the Status column.
The Details pane shifts its focus to the Warnings and Errors tab. The error indicates that
the physical location for the target library does not exist.

Now you must discover the physical location that is specified for the library object.
g. Click the Folders tab.
h. If necessary, expand Data Mart Development  DIFT Demo.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 Examining SAS Data Integration Studio Jobs and Options 2-43

i. Click DIFT Test Target Library.


The Basic Properties pane displays the physical path location.

j. Create the system folder that is needed.


1) Access Windows Explorer by selecting Start  All Programs  Accessories 
Windows Explorer.
2) Navigate to D:\Workshop\dift.
3) Click New folder in the toolbar area.
4) Type testdm as the name of the new folder and then press Enter.
k. Run the job again by clicking Run.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-44 Lesson 2 SAS® Data Integration Studio: Essentials

The Status tab in the


Details pane now
shows that all
transformations,
except the List Data
transformation,
completed
successfully.

l. Double-click Error for the List Data transformation.


The Details pane shifts its focus to the Warnings and Errors tab. The error indicates that the
physical file does not exist. However, because the file is to be created by the transformation,
it is more likely that the location for the file does not exist.

m. Create the system folder that is needed.


1) If necessary, access Windows Explorer by selecting Start  All Programs 
Accessories  Windows Explorer.
2) Navigate to D:\Workshop\dift.
3) Click New folder in the toolbar area.
4) Type reports as the name of the new folder and then press Enter.
n. Run only the List Data transformation.
1) Click the List Data transformation.

2) Click (Run Selected Transformations) on the job toolbar.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 Examining SAS Data Integration Studio Jobs and Options 2-45

The Status tab of the Details pane shows that the transformation completed successfully.

o. Select File  Close (or click ) to close the Job Editor window.
If you made any changes when you viewed the job, the following window appears:

p. If necessary, click No to not save changes to the job.

SAS Data Integration Studio Options


5. Select Tools  Options.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-46 Lesson 2 SAS® Data Integration Studio: Essentials

The General tab of the Options window appears.

6. Clear the Show advanced property tabs option.

7. Click the Enable row count on basic properties and data viewer for tables option.

Note: Retrieving the number of rows requires system resources for most database tables.
For SAS tables, the number of rows is retrieved from the table metadata and requires
very little overhead.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 Examining SAS Data Integration Studio Jobs and Options 2-47

8. Click the Job Editor tab.

The options on this tab affect the Job Editor.


a. Verify that the default selection in the Layout area is Left To Right.

b. Verify that the default selection in the Nodes area is Collapse.

Note: The displays in this course use the Collapse setting.


9. Click the SAS Server tab in the Options window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-48 Lesson 2 SAS® Data Integration Studio: Essentials

a. Verify that the SASApp server is selected as the value for the Server field.
b. Click Test Connection to establish or test the application server connection for SAS Dat a
Integration Studio. An Information window appears and verifies a successful connection.

c. Click OK to close the Information window.


Note: The application server can also be set and tested via the SAS Data Integration
Studio status bar. For example, if the application server is not defined, the status bar
shows the following:

Double-clicking this area in the status bar accesses the Default Application Server
window where a selection can be made and tested.

10. Click OK to close the Options window.


11. Select File  Exit to close SAS Data Integration Studio.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 Examining SAS Data Integration Studio Jobs and Options 2-49

Practice
1. Establishing Global Options
Access SAS Data Integration Studio using the My Server connection profile and Bruno’s
credentials (User ID=Bruno; Password=Student1).
Access the Options window and set/verify the following:
• General option Show Output tab is selected
• General option Show advanced property tabs is cleared
• General option Enable row count on basic properties and data viewer for tables is
selected
• Job Editor option for Layout is set to Left to Right
• Job Editor option for Nodes is set to Collapse
• SAS Server option for Server is set to SASApp
Question: What effect does selecting the option Enable row count on basic properties and
data viewer for tables have on a View Data window?
Hint: Locate a metadata table object, right-click, and select Open.]

Answer: ______________________________________________________________

2. Executing and Debugging a Job


Access SAS Data Integration Studio using My Server connection profile and Bruno’s credentials
(User ID=Bruno; Password=Student1).
Open the job DIFT Test Job – OrderFact Table Plus (found under folder Data Mart
Development  DIFT Demo) and perform the following steps:
• Run the job.
• Verify that the first error occurs for the Table Loader transformation.
• Create the testdm folder under D:\Workshop\dift.
• Right-click the Table Loader transformation and select Run From Select Transformation.
• Verify that the Rank and Sort transformations both execute successfully.
• Verify that the List Data transformation produces an error.
• Create the reports folder under D:\Workshop\dift.
• Right-click the List Data transformation and select Run Selected Transformations.
• Verify that the transformation runs successfully.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-50 Lesson 2 SAS® Data Integration Studio: Essentials

2.4 Solutions
Solutions to Practices
1. Establishing Global Options
a. If necessary, access SAS Data Integration Studio as Bruno.
1) Select Start  SAS Data Integration Studio.
2) Select My Server as the connection profile.
3) Click OK.
4) Enter Bruno as the user ID and Student1 as the password.
5) Click OK.
b. Select Tools  Options.
c. Verify that the General tab is selected.
1) Verify that Show Output tab is selected.

2) Clear the Show advanced property tabs option.

3) Click the Enable row count on basic properties and data viewer for tables option.

d. Click the Job Editor tab.


1) Verify that the default selection in the Layout area is Left To Right.

2) Verify that the default selection in the Nodes area is Collapse.

e. Click the SAS Server tab in the Options window.


1) Verify that the SASApp server is selected as the value for the Server field.
2) Click Test Connection to establish or test the application server connection for SAS Data
Integration Studio. An Information window appears and verifies a successful connection.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.4 Solutions 2-51

3) Click OK to close the Information window.


f. Click OK to close the Options window.

Question: What effect does selecting the option Enable row count on basic properties
and data viewer for tables have on a View Data window?

Answer: From Folders tab, under Data Mart Development  DIFT Demo, right-click
the DIFT Test Table - ORDER_ITEM table object and select Open. The title
bar of the View Data window now shows the row count for the opened table.

g. Select File  Exit to close SAS Data Integration Studio.

2. Executing and Debugging a Job


a. If necessary, access SAS Data Integration Studio as Bruno.
1) Select Start  SAS Data Integration Studio.
2) Select My Server as the connection profile.
3) Click OK.
4) Enter Bruno as the user ID and Student1 as the password.
5) Click OK.
b. Right-click DIFT Test Job - OrderFact Table Plus and select Open.
This job joins two source tables and then loads the result into a target table. The target table
is then used as the source for the Rank transformation. The result of the ranking is loaded
into a target table and sorted, and then a report is generated based on the rankings.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-52 Lesson 2 SAS® Data Integration Studio: Essentials

c. Click Run to execute the job.

The transformations execute in sequence. The currently executing transformation is


highlighted.

The Status tab in the


Details pane shows the
completion status for
each step in the job.

The overall (Job)


completion status is
set to the lowest step
completion status.

d. Double-click the first Error (for the Table Loader) in the Status column.
The Details pane shifts its focus to the Warnings and Errors tab. The error indicates that
the physical location for the target library does not exist.

Now you must discover the physical location that is specified for the library object.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.4 Solutions 2-53

e. Click the Folders tab.


f. If necessary, expand Data Mart Development  DIFT Demo.
g. Click DIFT Test Target Library.
h. Verify that the path is set to D:\Workshop\dift\testdm.
i. Create the system folder that is needed.
1) Access Windows Explorer by selecting Start  All Programs  Accessories 
Windows Explorer.
2) Navigate to D:\Workshop\dift.
3) Click New folder in the toolbar area.
4) Type testdm as the name of the new folder and then press Enter.
j. Run the job again by clicking Run.
k. Verify that the Status tab in the Details pane now shows that all transformations, except the
List Data transformation, completed successfully.
l. Double-click Error for the List Data transformation.
The Details pane shifts its focus to the Warnings and Errors tab. The error indicates that the
physical file does not exist. However, because the file is to be created by the transformation,
it is more likely that the location for the file does not exist.
m. Create the system folder that is needed.
1) If necessary, access Windows Explorer by selecting Start  All Programs 
Accessories  Windows Explorer.
2) Navigate to D:\Workshop\dift.
3) Click New folder in the toolbar area.
4) Type reports as the name of the new folder and then press Enter.
n. Run only the List Data transformation.
1) Click the List Data transformation.

2) Click (Run Selected Transformations) in the job toolbar.


o. Verify that the Status tab of the Details pane shows that the transformation completed
successfully.
p. Select File  Close (or click ) to close the Job Editor window.
If you made any changes when you viewed the job, the following window appears:

q. If necessary, click No to not save changes to the job.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-54 Lesson 2 SAS® Data Integration Studio: Essentials

Solutions to Student Activities

2.01 Activity – Correct Answer


Refer to the previous demonstration for step-by-step instructions. The DIFT
Demo folder should be visible under the Data Mart Development folder,
with four tables, two libraries, and one job defined in the folder.

14
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Lesson 3 SAS® Data Integration
Studio: Defining Source Data
Metadata
3.1 Setting Up the Environment ......................................................................................... 3-3
Demonstration: Defining Custom Folders ................................................................... 3-5

3.2 Defining Metadata for a Library .................................................................................... 3-9


Demonstration: Defining Metadata for a SAS Library.................................................. 3-11

3.3 Registering Metadata for Data Sources ...................................................................... 3-17

3.4 Registering SAS Table Metadata ................................................................................ 3-22


Demonstration: Registering Metadata for SAS Source Tables...................................... 3-25
Practice............................................................................................................... 3-31

3.5 Registering DBMS Table Metadata ............................................................................. 3-33


Demonstration: Registering Metadata for Oracle Source Tables................................... 3-36
Practice............................................................................................................... 3-46

3.6 Registering ODBC Data Source Table Metadata ......................................................... 3-47


Demonstration: Registering Metadata for ODBC Data Sources ................................... 3-49

3.7 Registering Metadata for External Files ..................................................................... 3-59


Demonstration: Registering Metadata for a Delimited External File .............................. 3-61
Practice............................................................................................................... 3-69

3.8 Solutions ................................................................................................................... 3-70


Solutions to Practices ............................................................................................ 3-70
Solutions to Student Activities ................................................................................. 3-78
3-2 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.1 Setting Up the Environment 3-3

3.1 Setting Up the Environment

SAS Data Integration Studio Setup


SAS Data Integration Studio relies on the following:
• user accounts
• access controls
• metadata folder structure
• server definitions
• other resources
SAS Data Integration Studio resources are
• configured by an administrator
• available for use by Data Integration Studio developers.

3
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-4 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata

Metadata Folder Structure

4
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Metadata is displayed in SAS Data Integration Studio.


The Folders tree organizes the metadata in a custom folders arrangement.
The Inventory tree organizes the metadata by type and is not customizable.
Generally, an administrator sets up the custom folder structure in the Folders tree and defines
access permissions on those folders.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.1 Setting Up the Environment 3-5

Defining Custom Folders


This demonstration illustrates creating a set of custom folders within the Data Mart Development
folder. The custom folders are used to organize the metadata for a data integration project. An
administrator’s credentials are generally used to complete this step.

Accessing SAS Data Integration Studio


1. Select Start  All Programs  SAS  SAS Data Integration Studio.
2. Log on with Bruno’s credentials.
a. Select My Server as the connection profile.

Note: Do not select Set this connection profile as the default.


b. Click OK to close the Connection Profile window and open the Log On window.
c. Enter Bruno in the User ID field and Student1 in the Password field.

Note: Do not select Save user ID and password in this profile.


d. Click OK to close the Log On window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-6 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata

SAS Data Integration Studio appears.

3. If necessary, click the Folders tab.


4. Expand the Data Mart Development folder.

5. Right-click the Data Mart Development folder and select New  Folder.
A new folder is created and Untitled is the initial name.

6. Enter Orion Source Data as the name of the folder and press Enter.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.1 Setting Up the Environment 3-7

7. Right-click the Data Mart Development folder and select New  Folder.
8. Enter Orion Target Data as the name of the folder and then press Enter.
9. Right-click the Data Mart Development folder and select New  Folder.
10. Enter Orion Jobs as the name of the folder and then press Enter.
11. Right-click the Data Mart Development folder and select New  Folder.
12. Enter Orion Reports as the name of the folder and then press Enter.
The final set of folders should resemble the following:

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-8 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata

3.01 Activity
1. Access SAS Data Integration Studio using the My Server connection profile with
Bruno’s credentials (Bruno / Student1).
2. Define four metadata folders under Data Mart Development.
• On the Folders tab, right-click the Data Mart Development folder and select
New  Folder.
• Enter Orion Target Data as name for the folder.
• Create three additional folders under Data Mart Development:
✓ Orion Source Data
✓ Orion Jobs
✓ Orion Reports

7
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.2 Defining Metadata for a Library 3-9

3.2 Defining Metadata for a Library

Libraries

A library is a collection of
one or more files that is
referenced as a unit.

10
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Metadata library def initions


• provide access to source tables and target tables
• can be created in SAS Management Console by an administrator
• can be created in SAS Data Integration Studio by a Data Integration developer
• are created with the New Library Wizard.
Libraries approximately correspond to the level of organization that the operating system uses to
organize files. For example, in directory-based operating environments, a SAS library is a group of
SAS files in the same directory. The directory might contain other files, but only the SAS files are part
of the SAS library.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-10 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata

New Library Wizard

11
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The New Library Wizard is used to register libraries by def ining


• the type of library (library engine)
• a metadata name and location
• the SAS Application Server where this library is to be assigned
• a library reference
• the physical location being referenced.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.2 Defining Metadata for a Library 3-11

Defining Metadata for a SAS Library


This demonstration illustrates defining metadata for a SAS library. The tables in this library are used
as data sources for a data mart.

Accessing SAS Data Integration Studio


1. If necessary, select Start  All Programs  SAS  SAS Data Integration Studio.
2. Log on with Bruno’s credentials.
a. Select My Server as the connection profile.
b. Click OK to close the Connection Profile window and open the Log On window.
c. Enter Bruno in the User ID field and Student1 in the Password field.
d. Click OK to close the Log On window. SAS Data Integration Studio appears.

Defining Metadata for a SAS Library


1. Click the Folders tab.
2. Expand Data Mart Development  Orion Source Data.
3. Verify that the Orion Source Data folder is selected.

4. Select File  New  Library. The New Library Wizard appears.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-12 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata

5. Select SAS BASE Library as the type of library to create.

6. Click Next.
7. Specify the name and location of the new library.
a. Enter DIFT Orion Source Tables Library in the Name field.
b. Verify that the location is set to /Data Mart Development/Orion Source Data.
Note: If the location is incorrect, click Browse, and navigate to SAS Folders  Data Mart
Development  Orion Source Data.

8. Click Next.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.2 Defining Metadata for a Library 3-13

9. Select the SAS server for the new library.


a. Select SASApp in the Available servers list.

b. Click to move SASApp to the Selected servers list.

10. Click Next.


11. Provide the library properties.
a. Enter diftodet in the Libref field.
b. If the desired path does not appear in the Available items list, then do the following:
1) Click New.
The New Path Specification window appears.
2) In the New Path Specification window, click Browse next to Paths.
3) In the Browse window, navigate to D:\Workshop\OrionStar.
4) Single-click the ordetail folder.
5) Click OK to close the Browse window.
The desired path should now appear in the Paths area:

6) Click OK to close the New Path Specification window.


c. If the desired path does appear in the Available items list, then double-click to move it to
the Selected Items list.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-14 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata

The final settings on this page of the New Library Wizard should resemble the following:

Entered libref of diftodet

Desired path appears in the


Selected items list

12. Click Next.


The review window appears.

13. Verify that the information is correct.


14. Click Finish.
15. Verify that the library object is found on the Folders tab.
a. Click the Folders tab.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.2 Defining Metadata for a Library 3-15

b. If necessary, expand Data Mart Development  Orion Source Data.


c. Verify that DIFT Orion Source Tables Library appears under Orion Source Data.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-16 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata

3.02 Activity
1. Access SAS Data Integration Studio using the My Server connection
profile with Bruno’s credentials (Bruno / Student1).
2. Create a new library object with the following specifications:
Library Type: SAS
Metadata Name: DIFT Orion Source Tables Library
Metadata Location: /Data Mart Development/Orion Source Data
SAS Application Server: SASApp
Library Reference: diftodet
Library Location: D:\Workshop\OrionStar\ordetail

14
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.3 Registering Metadata for Data Sources 3-17

3.3 Registering Metadata for Data


Sources

Defining SAS Metadata Objects for Source Data

To access source data for


SAS Data Integration Studio
tables stored in jobs, metadata objects for
a DBMS or ERP the source data need to be
system SAS tables defined.

external files
17
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

To access source data f or SAS Data Integration Studio jobs, metadata objects for the source data
need to be def ined. The tables can be in a DBMS or ERP system, and can also be in the f orm of
SAS tables.
DBMS (DataBase Management System) ref ers to system sof tware f or creating and managing
databases. The DBMS provides users with a systematic way to create, retrieve, and manage data.
ERP (Enterprise Resource Planning) ref ers to business management sof tware that typically includes
a database component.

In this class, f our types of sources are used:


• SAS tables
• Oracle tables (a DBMS example)
• Microsoft Access tables (an ODBC data source example)
• external files

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-18 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata

Defining SAS Metadata Objects for Source Data


• The Register Tables Wizard can
be used to register metadata
for existing tables.

• The External File


wizards can be
used to register
metadata for an
external file.

18
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

In order for an existing table to be used in SAS Data Integration Studio, you must create a metadata
object referencing that table.

Register Tables Wizard


With the Register Tables Wizard,
you define
• type of table (DBMS type)
• library
• metadata location.
The wizard will retrieve table
metadata from the source system,
including
• physical table name
• column properties
• properties for keys and indexes.
19
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.3 Registering Metadata for Data Sources 3-19

External File Wizards


Two common structures for external files are the following:
Delimited files Files in which data values are separated with a delimiter
character.

Fixed Width files Files in which data values appear in columns that are in fixed
positions.

There are two wizards available to define attributes of these two types of
file:
• New Delimited External File Wizard
• New Fixed Width External File Wizard

20
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

There is an additional wizard, the New User Written External File Wizard, which sets up properties
for files whose structure is more complex than can be managed in the New Delimited External File or
New Fixed Width External File Wizards.

External File Wizards


With the External File wizards, you define
• file to read
• parameters
• column definitions.

21
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-20 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata

Table Metadata
Metadata for tables is needed for source and target data. There are a few
dependencies that will come in to play.

Table Every table object needs a library object…

Library Some library objects need a server


definition…

Some server definitions refer to


Server
ODBC data sources…

ODBC
Data Source
28
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

SAS Table Metadata

Table

To register metadata for


Library SAS tables, a library object
is needed.

29
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

We say that the library object is used by the table object, and that the table object is dependent on
the library object.
Note: Multiple tables (SAS or otherwise) can be registered f rom a single library.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.3 Registering Metadata for Data Sources 3-21

DBMS Table Metadata

Table To register metadata for


DBMS tables, a library object
is needed. And the DBMS
Library library object requires a
server object.

Server

30
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

We say that the server object is used by the library object, and the library object is used by the
table object. In addition, the table object is dependent on the library object, and the library object is
dependent on the server object.

ODBC Data Source Table Metadata


To register metadata for tables
Table defined using an ODBC data source,
a library object is needed. The
library object requires a server
Library object. And the server object refers
to the defined ODBC data source.

Server

ODBC
Data Source

31
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

We say that the ODBC data source is used by the server object, that the server object is used by
the library object, and that the library object is used by the table object. In addition, the table object
is dependent on the library object, the library object is dependent on the server object, and the
server object is dependent on the ODBC data source.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-22 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata

3.4 Registering SAS Table Metadata

SAS Table Metadata (Review)

Table

To register metadata for


Library SAS tables, a library
object is needed.

The Register Tables Wizard is used to register


metadata for existing tables.

33
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The Register Tables Wizard is used to register metadata for existing tables – any existing tables.
Therefore, the tables could be SAS tables or DBMS tables or tables accessed through an ODBC
connection.

Register Tables Wizard (1)


Typically, the first step in using the Register Tables Wizard is the selection of
the type of table to be registered.

If SAS is selected…

34
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.4 Registering SAS Table Metadata 3-23

Register Tables Wizard (2)

…the second page of the wizard provides


a filtered list of library objects – only
those libraries of type SAS will appear in
the list.

35
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Register Tables Wizard (3)

When a library is selected in the


second page, a brief set of
properties for that library appear.

36
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-24 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata

Register Tables Wizard (4)

When a library is selected in the


second page, the third page
displays a list of all physical
tables. You can select one, some,
or all to register in metadata.

37
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.4 Registering SAS Table Metadata 3-25

Registering Metadata for SAS Source Tables


This demonstration illustrates using the Register Tables Wizard to define metadata for a SAS table.
The table is used as one of the data sources for a data mart.

1. If necessary, access SAS Data Integration Studio with Bruno’s credentials.


a. Select Start  All Programs  SAS  SAS Data Integration Studio.
b. Select My Server as the connection profile.
c. Click OK to close the Connection Profile window and open the Log On window.
d. Enter Bruno in the User ID field and Student1 in the Password field.
e. Click OK to close the Log On window.
2. Click the Folders tab.
3. Expand Data Mart Development  Orion Source Data.
4. Verify that the Orion Source Data folder is selected.

5. Select File  Register Tables.


The Register Tables Wizard appears.

Note: Only the table types (SAS/ACCESS engines) that are licensed f or your site are available
f or use.

Note: The f ollowing two alternatives are available to initiate the Register Tables Wizard:
• Right-click a f older in the Folders tree where metadata f or the table should be saved,
and then select Register Tables f rom the pop-up menu.
• Right-click a library and select Register Tables.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-26 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata

6. Select SAS as the type of table.

Note: This step is omitted f rom the Register Tables Wizard when you register a table through
a library because the type of table is a library property (library engine).
7. Click Next.
The Select a SAS Library window appears.

8. Click next to the SAS Library field and then select DIFT Orion Source Tables Library.

9. Click Next.
The Define Tables and Select Folder Location window appears.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.4 Registering SAS Table Metadata 3-27

10. Select the PRODUCT_LIST table.


11. Verify that /Data Mart Development/Orion Source Data is selected in the Location field.

Note: If the location is incorrect, click Browse and navigate to SAS Folders  Data Mart
Development  Orion Source Data.
12. Click Next. The review window appears.
13. Verify that the information is correct.

14. Click Finish.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-28 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata

The metadata object for the table is found in the Folders tree under the Orion Source Data
folder.

15. Right-click the PRODUCT_LIST metadata table object and select Properties.
16. Enter DIFT as a prefix to the default name.
17. Remove the description.

Add DIFT prefix

Remove the description

18. Click the Columns tab to view the column properties.

Two columns have special symbols:


indicates that the Product_ID column is a primary key.
indicates that the Product_Level column is indexed.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.4 Registering SAS Table Metadata 3-29

19. Click the Indexes tab.


20. Expand the indexes to see the column names in each index.

21. Click the Keys tab.


22. Click PRODUCT_LIST.Primary in the Keys pane to display the details of the primary key
in the Details area.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-30 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata

23. Click the Physical Storage tab.

Notice that the physical table name is PRODUCT_LIST.

24. Click OK to close the Properties window.


25. Right-click the DIFT PRODUCT_LIST metadata table object and select Open.
The View Data window appears.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.4 Registering SAS Table Metadata 3-31

Practice
For these practices, access SAS Data Integration Studio using My Server as the connection profile
and log on using Bruno’s credentials (Bruno/Student1).
1. Registering SAS Tables from DIFT Orion Source Tables Library

• Place the table objects in the Data Mart Development  Orion Source Data folder.
• Use the Register Tables Wizard to register two SAS tables.
• Register the PRODUCT_LIST and STAFF tables found in the DIFT Orion Source Tables
Library.
• Add the prefix DIFT to the default metadata name of each table.
• Remove the table description if it exists for either table.

2. Defining a Library for Additional SAS Tables

Additional SAS tables are needed for the course workshops. To access these tables, a new
library object must be registered. The specifics for the library are listed below.

Name: DIFT SAS Library

Folder Location: \Data Mart Development\Orion Source Data

SAS Server: SASApp

Libref: diftsas

Path Specification: D:\Workshop\dift\data

3. Registering SAS Tables from DIFT SAS Library

• Place the table objects in the Data Mart Development  Orion Source Data folder.
• Register the below tables found in the DIFT SAS Library.

Physical Table Metadata Table Name

NEWORDERTRANS DIFT NEWORDERTRANS

PROFIT DIFT PROFIT

STAFF_PARTIAL DIFT STAFF_PARTIAL

VALIDPRODUSAOUTDOOR DIFT VALIDPRODUSAOUTDOOR

• Add the prefix DIFT to the default metadata name of each table.
• Remove any table descriptions.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-32 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata

4. Using the Inventory Tab

Recall that the Inventory tab organizes known metadata object types in groups of that object
type. In particular, for library metadata objects, there is a hierarchical view of the library and the
registered tables for that library. This view is often useful.

• Click the Inventory tab.


• Locate and expand the Library object type group.

• Expand the DIFT Orion Source Tables Library.


• Do both tables appear under this library?
• Expand the DIFT SAS Library.
• Do all four tables appear under this library?
Note: On SAS Data Integration Studio’s Inventory tab, only registered tables for a library
will be shown, unlike the library views in SAS Enterprise Guide, SAS Studio, and the
SAS Windowing Environment. Any other SAS files that exist in the SAS library
location will not appear until they are registered.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.5 Registering DBMS Table Meta data 3-33

3.5 Registering DBMS Table Metadata

DBMS Table Metadata (Review)

Table To register metadata for


DBMS tables, a library
object is needed. And the
Library DBMS library object
requires a server object.

Server

44
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Register Tables - Oracle (1)

45
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

In this class, an Oracle database is used as the DBMS example.

The Register Tables Wizard has a choice f or Oracle under the Database Sources group.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-34 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata

Register Tables - Oracle (2)

If Oracle is selected, the


second page of the wizard
provides a filtered list of
library objects – only
libraries of type Oracle will
appear in the list.

46
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Register Tables - Oracle (3)

When a library is selected in the


second page, the third page
displays a list of available
physical tables. You can select
one, some, or all to register in
metadata.

47
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.5 Registering DBMS Table Metadata 3-35

DBMS Table Metadata Process

DBMS tables can be registered 3

Table using the Register Tables Wizard


to select the DBMS library.
DBMS library can be registered in 2
Library SAS Data Integration Studio, and
will reference the DBMS server.
1
DBMS server typically
Server registered by an
administrator using
SAS Management Console

50
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

It is best to first define server metadata, and then library metadata (that uses the server metadata),
and then table metadata (that uses the library metadata).
• Server object is used by the library object.
• Library object is used by the table object
• Table object is dependent on the library object.
• Library object is dependent on the server object.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-36 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata

Registering Metadata for Oracle Source Tables


This demonstration illustrates defining metadata for two tables in an Oracle database. Metadata for
an Oracle data source requires a database server definition in metadata that points to the Oracle
instance.
The following steps are needed:
• Verify that a database server definition exists in metadata for an Oracle database.
• Define a metadata library object that uses the Oracle engine and references the Oracle
database server object.
• Define metadata table objects that use the metadata library object.

Verifying That a Needed Oracle Server Definition Exists


1. Access SAS Management Console with Ahmed's credentials.
a. Select Start  SAS Management Console.
b. Select My Server as the connection profile.
c. Click OK to close the Connection Profile window and open the Log On window.
d. Enter Ahmed in the User ID field and Student1 in the Password field.
e. Click OK to close the Log On window. SAS Management Console appears.
2. If necessary, click the Plug-ins tab.
3. Click to expand the Server Manager plug-in.

Server Manager plug-in

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.5 Registering DBMS Table Metadata 3-37

4. Click the server definition named DIFT Oracle Server.


5. On the Connections tab, right-click the connection named Connection: DIFT Oracle Server and
select Properties.

Right-click and select Properties

The Connection Properties window for the DIFT Oracle Server appears.
a. Click the Options tab.
b. Verify that the Oracle Path Information specifies a path of xe.
Note: xe is a TNS name that was defined when the Oracle client was configured. The name
xe was chosen by the image builder because the express version of Oracle was installed.
c. Verify that the Authentication type field is set to User/Password.
d. Verify that the Authentication domain field is set to OracleAuth.

The Oracle user ID and password are stored in the OracleAuth authentication domain.
e. Click Cancel to close the Connection: DIFT Oracle Server Properties window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-38 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata

6. Click the User Manager plug-in on the Plug-ins tab in SAS Management Console.

Right-click and select Properties

7. Right-click the Data Integrators group in the right pane and select Properties. The Data
Integrators Properties appears.
8. Click the Members tab.

The Data Integrators group has eight members, including Bruno.


9. Click the Accounts tab.

The Data Integrators group has an Oracle account. The Oracle credentials are stored in the
OracleAuth authentication domain. An authentication domain is a metadata object that stores the
credentials to access a server or a DBMS.
10. Click Cancel to close the Data Integrators Properties window.
11. Select File  Exit to close SAS Management Console.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.5 Registering DBMS Table Metadata 3-39

Defining Metadata for an Oracle Library


After the Oracle server is defined, Bruno can define the metadata object that represents a library for
the Oracle database.
1. If necessary, access SAS Data Integration Studio with Bruno’s credentials.
a. Select Start  All Programs  SAS  SAS Data Integration Studio.
b. Select My Server as the connection profile.
c. Click OK to close the Connection Profile window and open the Log On window.
d. Enter Bruno in the User ID field and Student1 in the Password field.
e. Click OK to close the Log On window.
2. Click the Folders tab.
3. Expand Data Mart Development  Orion Source Data.
4. Verify that the Orion Source Data folder is selected.
5. Select File  New  Library. The New Library Wizard appears.
6. Click Oracle Library as the type of library.

7. Click Next.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-40 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata

8. Enter DIFT Oracle Library in the Name field.


9. Verify that the location is set to /Data Mart Development/Orion Source Data.

10. Click Next.


11. Double-click SASApp in the Available servers list to move it to the Selected servers list.
12. Click Next.
13. Enter diftora in the Libref field.

14. Click Next.


15. Verify that DIFT Oracle Server is the value for the Database Server field.

Note: There is only one defined server in metadata of type Oracle. This is the Oracle database
server we just explored in SAS Management Console.
16. Click Next.
The review window shows the following:

17. Click Finish. This completes the metadata definition for the library object.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.5 Registering DBMS Table Metadata 3-41

Defining Metadata for Oracle Tables


With the Oracle library defined, Bruno can define the metadata objects for tables in the Oracle
database.
1. If necessary, click the Folders tab.
2. Expand Data Mart Development  Orion Source Data.
3. Verify that the Orion Source Data folder is selected.
4. Select File  Register Tables. The Register Tables Wizard appears.
5. Expand the Database Sources folder.
6. Click Oracle as the type of table.

7. Click Next. The Oracle window appears.


8. Verify that DIFT Oracle Library is selected in the SAS Library field.

Note: The xe path defined for the DIFT Oracle Server connection surfaces through the
DIFT Oracle Library.
9. Click Next.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-42 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata

10. Select both the ORDERS table as well as the ORDER_ITEM table.
a. Click the ORDERS table.
b. Hold down the Ctrl key and click ORDER_ITEM.
11. Verify that the metadata location for the metadata table object is /Data Mart Development/
Orion Source Data.

12. Click Next.


The review window shows the following:

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.5 Registering DBMS Table Metadata 3-43

13. Click Finish.


The metadata table objects for the two new Oracle data sources, as well as the newly defined
library object, are found in the Folders tree.

14. Investigate and update properties for the ORDERS table.


a. Right-click the ORDERS metadata table object and select Properties.
b. Enter DIFT ORDERS in the Name field.

c. Click the Columns tab to view the registered column properties.

Column properties were retrieved from the database.


The SAS datetime20. informat and format were assigned to the Oracle date-type columns by
SAS in order to display the dates in the SAS application interface. Oracle stores dates as
date-time values.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-44 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata

d. Click the Physical Storage tab.

Notice that the physical table name is ORDERS.


e. Click OK to close the Properties window.
15. Review the data values for the DIFT ORDERS table.
a. Right-click the DIFT ORDERS metadata table object and select Open.
The View Data window appears.

Note: The data values were read directly from the Oracle table for display here. The
metadata table object provides direct access to the data in the database and is interacting
directly with the database to retrieve that data here and when used in jobs.
b. Select File  Close to close the View Data window.
16. Investigate and update properties for the ORDER_ITEM table.
a. Right-click the ORDER_ITEM metadata table object and select Properties.
b. Enter DIFT ORDER_ITEM in the Name field.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.5 Registering DBMS Table Metadata 3-45

c. Click the Columns tab to view the registered column properties.

d. Click the Physical Storage tab.

e. Verify that the physical table name is ORDER_ITEM.


f. Click OK to close the Properties window.
17. Review the data values for the DIFT ORDERS table.
a. Right-click the DIFT ORDER_ITEM metadata table object and select Open.

The View Data window appears.

b. Select File  Close to close the View Data window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-46 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata

Practice
For these practices, access SAS Data Integration Studio using the My Server connection profile and
log on using Bruno’s credentials (Bruno/Student1).

5. Defining a Library for Oracle Tables

Additional tables (from an Oracle DBMS) are needed for the course workshops. To access these
tables, a new library object (for an Oracle DBMS) must be registered. The needed Oracle Server
metadata object already exists.

The specifics for the library are listed below.


Folder Location: \Data Mart Development\Orion Source Data

Type of Library: Oracle Library

Name: DIFT Oracle Library

SAS Server: SASApp

Libref: diftora

Database Server: DIFT Oracle Server

6. Registering Oracle Tables from DIFT Oracle Library

Two Oracle tables need to be registered. The specifics are listed below.
• Place the table objects in the Data Mart Development  Orion Source Data folder.
• Register the below tables found in the DIFT Oracle Library.
• Add the prefix DIFT to the default metadata name of each table.

Physical Table Metadata Table Name

ORDERS DIFT ORDERS

ORDER_ITEM DIFT ORDER_ITEM

7. Using the Inventory Tab

• Click the Inventory tab.


• Locate and expand the Library object type group.
• Expand the DIFT Oracle Library.

Do both tables appear under this library?

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.6 Registering ODBC Data Source Table Metadata 3-47

3.6 Registering ODBC Data Source Table


Metadata

ODBC Data Source Table Metadata (Review)


To register metadata for tables
Table defined using an ODBC data source,
a library object is needed. The
library object requires a server
Library object. And the server object refers
to the defined ODBC data source.

Server

ODBC
Data Source

57
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

ODBC Data Source Table Metadata

4
Table ODBC Data Source tables can be registered
using the Register Tables Wizard.

ODBC libraries can be registered in SAS Data 3


Library Integration Studio, and reference an ODBC
server.
ODBC servers are typically registered by 2
Server an administrator using SAS Management
Console.
ODBC data sources are 1
ODBC
Data Source defined to the operating
system.

61
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-48 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata

ODBC Registration in Windows


On a Windows operating system, the ODBC Data Source Administrator can
be used to add, remove, and configure ODBC data sources and drivers.

62
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.6 Registering ODBC Data Source Table Metadata 3-49

Registering Metadata for ODBC Data Sources


This demonstration illustrates defining metadata for a Microsoft Ac cess database table. You use
ODBC in an environment where SAS and Microsoft Access share the same bitness. The classroom
environment has 64-bit SAS and 64-bit ODBC drivers for Microsoft Access.

Verifying the ODBC System Data Sources


The ODBC Data Source Administrator is used to verify the existence of two Microsoft Acces s
databases as ODBC system data sources.
1. Access the ODBC Data Source Administrator.
a. Select Start  Control Panel.
b. In Search Control Panel, enter ODBC.

c. Click Set up ODBC data sources (64-bit).


The ODBC Data Source Administrator window appears.

Note: For the course image, this is the ODBC Data Source Administrator for 64-bit drivers.
It can also be accessed by double-clicking C:\Windows\System32\odbcad32.exe.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-50 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata

2. Verify the predefined System DSNs.


a. Click the System DSN tab.
b. On the System DSN tab, select the Orion Star Contacts 64-bit data source.
c. Click Configure to open the ODBC Microsoft Access Setup window.

Configure

d. Verify that the database source is D:\Workshop\dift\data\OrionStarContacts.mdb.


1) Click Select in the Database area.
2) Verify that OrionStarContacts.mdb is selected.
3) Verify that the path is D:\Workshop\dift\data.

4) Click Cancel to close the Select Database window.


5) Click Cancel to close the ODBC Microsoft Access Setup window.
e. On the System DSN tab, click the Orion Star Orders data source.
f. Click Configure to open the ODBC Microsoft Access Setup window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.6 Registering ODBC Data Source Table Metadata 3-51

g. Verify that the database source is D:\Workshop\dift\data\OrionStarOrders.mdb.


1) Click Select in the Database area.
2) Verify that OrionStarOrders.mdb is selected.
3) Verify that the path is D:\Workshop\dift\data.
4) Click Cancel to close the Select Database window.
5) Click Cancel to close the ODBC Microsoft Access Setup window.
h. Click Cancel to close the ODBC Data Source Administrator window.
3. Select File  Close to close the Control Panel window.

Verifying the ODBC Database Server Definition


1. Access SAS Management Console with Ahmed's credentials.
a. Select Start  All Programs  SAS  SAS Management Console 9.4.
b. Select My Server as the connection profile.
c. Click OK to close the Connection Profile window and open the Log On window.
d. Enter Ahmed in the User ID field and Student1 in the Password field.
e. Click OK to close the Log On window. SAS Management Console appears.
2. If necessary, click the Plug-ins tab.
3. Expand the Server Manager plug-in.
4. Click the server definition named DIFT ODBC Microsoft Access Server.
5. On the Connections tab, right-click the connection named Orion Star Contacts and select
Properties.

a. Click the Options tab.


b. Verify that Datasrc is selected.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-52 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata

c. Verify that "Orion Star Contacts" is specified in the Datasrc field.

Note: Quotation marks are needed if the Datasrc value contains blanks.
d. Click Cancel to close the Connection: Orion Star Contacts Properties window.
6. On the Connections tab, right-click the connection named Orion Star Orders and select
Properties.
a. Click the Options tab.
b. Verify that Datasrc is selected.
c. Verify that "Orion Star Orders" is specified in the Datasrc field.

d. Click Cancel to close the Connection: Orion Star Orders Properties window.
7. Select File  Exit to close SAS Management Console.

Defining Metadata for an ODBC Data Source Library


Because the needed ODBC database server is defined, Bruno can define the metadata object that
represents a library for the desired ODBC data source.
1. If necessary, access SAS Data Integration Studio with Bruno’s credentials.
a. Select Start  All Programs  SAS  SAS Data Integration Studio.
b. Select My Server as the connection profile.
c. Click OK to close the Connection Profile window and open the Log On window.
d. Enter Bruno in the User ID field and Student1 in the Password field.
e. Click OK to close the Log On window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.6 Registering ODBC Data Source Table Metadata 3-53

2. Click the Folders tab.


3. Expand Data Mart Development  Orion Source Data.
4. Verify that the Orion Source Data folder is selected.
5. Select File  New  Library. The New Library Wizard appears.
6. If necessary, expand Database Data folder.
7. Click ODBC Library.
8. Click Next.
9. Enter DIFT ODBC Contacts Library in the Name field.
10. Verify that the location is set to /Data Mart Development/Orion Source Data.

11. Click Next.


12. Double-click SASApp in the Available servers list to move it to the Selected servers list.
13. Click Next.
14. Enter odbccont in the Libref field.

15. Click Next.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-54 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata

16. Select DIFT ODBC Microsoft Access Server in the Database Server field.
17. Select Connection: Orion Star Contacts in the Connection field.

No login is needed for this connection.


18. Click Next.
The review window shows the following:

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.6 Registering ODBC Data Source Table Metadata 3-55

19. Click Finish.


20. If necessary, click the Folders tab.

The DIFT ODBC Contacts Library is on the Folders tab.

Defining Metadata for Microsoft Access Tables


With the ODBC library defined, Bruno can easily define the metadata object that references a table
in the Microsoft Access database.
1. If necessary, access SAS Data Integration Studio with Bruno’s credentials.
2. Click the Folders tab.
3. Click DIFT ODBC Contacts Library.
4. Select File  Register Tables. The Register Tables Wizard appears.

5. Verify that the SAS library is DIFT ODBC Contacts Library.


6. Verify that the data source is "Orion Star Contacts".
7. Click Next.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-56 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata

8. Click Select All Tables.

9. Verify that the location is set to /Data Mart Development/Orion Source Data.
10. Click Next.
The review window shows the following:

11. Click Finish.

The three tables appear under the Orion Source Data folder on the Folders tab.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.6 Registering ODBC Data Source Table Metadata 3-57

12. Rename the three table objects.


a. Right-click the Contacts table and select Rename.
b. Type DIFT Contacts and press Enter.
c. Right-click the CustType table and select Rename.
d. Type DIFT Customer Types and press Enter.
e. Right-click the NewProducts table and select Rename.
f. Type DIFT NewProducts and press Enter.
The three renamed table objects are alphabetized.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-58 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata

3.03 Activity
1. Access SAS Data Integration Studio using the My Server connection
profile with Bruno’s credentials (Bruno / Student1).
2. On the Folders tab, right-click the Data Mart Development folder and
select Import  SAS Package.
3. Select the package
D:\Workshop\dift\solutions\DIFT_ODBCObjects.spk.
4. Accept the default selections in each page in the wizard.

65
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.7 Registering Metadata for External Files 3-59

3.7 Registering Metadata for External


Files

About External Files

External files are sometimes


referred to as flat files or
raw data files.

68
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

External f iles are


• character-based (for example, ASCII or EBCDIC)
• can contain binary fields
• often uses one line per record.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-60 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata

About External Files


The fields in an external file
• can have fixed widths

• can be separated by a delimiter (for example, comma, tab, or blank).

69
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

About External Files

External files are accessed with


different SAS code than SAS or
DBMS tables, so they have their
own registration wizards.

70
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Similar to SAS or DBMS tables, external f iles can be used as sources and targets.
Unlike SAS or DBMS tables, which are accessed with SAS library engines, external f iles are
accessed with SAS INFILE and INPUT statements f or source f iles and FILE and PUT statements f or
target f iles.

Accordingly, external f iles have their own registration wizards. The three wizards are accessed by
selecting File  New under the External File group.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.7 Registering Metadata for External Files 3-61

Registering Metadata for a Delimited External File


This demonstration illustrates registering metadata for a comma-delimited external file that contains
supplier information.
1. If necessary, access SAS Data Integration Studio with Bruno’s credentials.
a. Select Start  All Programs  SAS  SAS Data Integration Studio.
b. Select My Server as the connection profile.
c. Click OK to close the Connection Profile window and open the Log On window.
d. Enter Bruno in the User ID field and Student1 in the Password field.
e. Click OK to close the Log On window.
2. Click the Folders tab.
3. Expand Data Mart Development  Orion Source Data.
4. Verify that the Orion Source Data folder is selected.
5. Select File  New  External File  Delimited. The New Delimited External File Wizard
appears.
6. Enter DIFT Supplier Information in the Name field.
7. Verify that the location is set to /Data Mart Development/Orion Source Data.

8. Click Next. The External File Location window appears.


9. Click Browse to open the Select a file window.
10. Navigate to D:\Workshop\dift\data.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-62 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata

11. Change the Type field to Delimited Files (*.csv).


12. Select supplier.csv.

13. Click OK.


14. To view the contents of the external file, click Preview.

The preview shows that


• the first record contains column names
• the fields are comma delimited and not space delimited.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.7 Registering Metadata for External Files 3-63

15. Click OK to close the Preview File window.


The final settings for the External File Location window are shown here:

16. Click Next. The Delimiters and Parameters window appears.


17. Clear the Blank check box.
18. Select the Comma check box.

19. Click Next. The Column Definitions window appears.


There is an upper pane and a lower pane. The upper pane is used to define fields to read from
the external file. The lower pane has a series of tabs:
File: used to view raw data in the external file
Data: used to view data in the external file after metadata from the external file wizard has
been applied
Source: used to view the SAS DATA step code generated
Log: used to view a SAS log for SAS DATA step code (to help identify errors)

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-64 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata

20. Click Refresh in the lower pane (with the File tab active) to see the first 10 records in the
external file.

Note: You can manually define properties for each field by clicking the New button. If there are
many fields, the wizard provides some functions to automate the building of initial
metadata.
21. Click Auto Fill in the upper pane of the Column Definitions window. The Auto Fill Columns
window appears.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.7 Registering Metadata for External Files 3-65

22. Enter 2 in the Start record field in the Guessing records area.

Note: Default informats and formats can be selected for character and numeric fields. We do
not use this feature in this demo.
23. Click OK to close the Auto Fill Columns window.
The upper pane of the Column Definitions window is populated with six column definitions: three
numeric and three character values.

24. Click Import. The Import Column Definitions window appears.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-66 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata

25. Click Get the column names from column headings in this file.
26. Verify that 1 is entered in the The column headings are in file record field.

Note: Column properties can be imported from various other sources. We use the fourth option
to import column names from the source file.
27. Click OK.
The Name fields are populated with the column names.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.7 Registering Metadata for External Files 3-67

28. Provide the following descriptions:

Column Name Description

Supplier_ID Supplier ID

Supplier_Name Supplier Name

Street_ID Supplier Street ID

Supplier_Address Supplier Address

Sup_Street_Number Supplier Street


Number

Country Supplier Country

29. Change the length of Supplier_ID to 4.

30. Click the Data tab in the lower pane of the Column Definitions window.
31. Click Refresh.

Note: This action executes the generated code (Source tab) and displays the results. This lets
you verify that the field properties are appropriate and look for any mistakes in the wizard
configuration. The log is useful if errors occur.
32. Click Next.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-68 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata

The review window shows the following:

33. Verify that the object is found on the Folders tab.


a. Click the Folders tab.
b. If necessary, expand Data Mart Development  Orion Source Data.
c. Verify that DIFT Supplier Information appears under Orion Source Data.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.7 Registering Metadata for External Files 3-69

Practice
8. Importing Metadata for a Delimited External File
Access SAS Data Integration Studio using the My Server connection profile with Bruno's
credentials (Bruno/Student1).
• Launch the import from the Data Mart Development folder.
• Import metadata from the SAS package file DI1_DelimExtFile.spk found in
D:\Workshop\dift\solutions.

Question: How many metadata objects are in the SAS package?


Answer: ____________________________________

Question: How many columns are defined for DIFT Supplier Information?
Answer: ____________________________________

9. Defining Metadata for an External File


Access SAS Data Integration Studio using the My Server connection profile with Bruno's
credentials (Bruno/Student1).
A file, D:\Workshop\dift\data\profit.txt, is needed for processing. The fields in this file have
fixed column widths, so the appropriate wizard to use is the New Fixed Width External File
Wizard.
• Name the metadata object representing the external file DIFT Profit Information.
• Place the metadata object in the /Data Mart Development/Orion Source Data folder.
• Assign the following column properties:

Column Begin End


Name Length Type Informat Format Position Position

Company 30 Char $22. 1 22

YYMM 8 Num yymm5. 24 28

Sales 8 Num dollar13. 30 43

Cost 8 Num dollar13. 45 58

Salaries 8 Num dollar13. 61 84

Profit 8 Num dollar13. 87 100

Hint: Columns can be created by clicking New.


• Ignore the warning about missing informats.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-70 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata

3.8 Solutions
Solutions to Practices
1. Registering SAS Tables from DIFT Orion Source Tables Library
a. If necessary, access SAS Data Integration Studio with Bruno’s credentials.
1) Select Start  All Programs  SAS  SAS Data Integration Studio.
2) Select My Server as the connection profile.
3) Click OK to close the Connection Profile window and open the Log On window.
4) Enter Bruno in the User ID field and Student1 in the Password field.
5) Click OK to close the Log On window.
b. Click the Folders tab.
1) Expand Data Mart Development  Orion Source Data.
2) Verify that the Orion Source Data folder is selected.
c. Select File  Register Tables. The Register Tables Wizard appears.
1) Select SAS as the type of table.
2) Click Next. The Select a SAS Library window appears.
3) Click next to the SAS Library field and then select DIFT Orion Source Tables
Library.
4) Click Next. The Define Tables and Select Folder Location window appears.
5) Click the PRODUCT_LIST table and then hold down the Ctrl key and click the STAFF
table to select both.
6) Verify that /Data Mart Development/Orion Source Data is selected in the Location
field.
7) Click Next. The review window appears.
8) Verify that the information is correct.
9) Click Finish.
The metadata objects for the tables are found in the Orion Source Data folder.
d. Edit the properties of table objects.
1) Right-click the PRODUCT_LIST metadata table object and select Properties.
a) Enter DIFT as a prefix to the default name.
b) Remove the description.
c) Click OK to close the Properties window.
2) Right-click the STAFF metadata table object and select Properties.
a) Enter DIFT as a prefix to the default name.
b) Click OK to close the Properties window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.8 Solutions 3-71

2. Defining a Library for Additional SAS Tables


a. If necessary, access SAS Data Integration Studio with Bruno’s credentials.
b. Click the Folders tab.
1) Expand Data Mart Development  Orion Source Data.
2) Verify that the Orion Source Data folder is selected.
c. Select File  New  Library. The New Library Wizard appears.
1) Click SAS BASE Library as the type of library.
2) Click Next.
3) Enter DIFT SAS Library in the Name field.
4) Verify that the location is set to \Data Mart Development\Orion Source Data.
5) Click Next.
6) Click SASApp in the Available servers pane.
7) Click to move SASApp to the Selected servers list.
8) Click Next.
9) Enter diftsas in the Libref field.
10) The desired path does not exist in the Available items pane. Click New.
a) In the New Path Specification window, click Browse next to Paths.
b) In the Browse window, navigate to D:\Workshop\dift.
c) Click the data folder to select it.
d) Click OK to close the Browse window.
e) Verify that the value in the Paths field is D:\Workshop\dift\data.
f ) Click OK to close the New Path Specification window.
11) Verify that the new path appears in the Selected items list.

12) Click Next.


13) Verify that the information is correct in the review window.
14) Click Finish.
The metadata object for the library is found in the Orion Source Data folder.

3. Registering SAS Tables from the DIFT SAS Library


a. If necessary, access SAS Data Integration Studio with Bruno’s credentials.
b. The DIFT SAS Library should be registered in metadata. (If not, perform the previous
practice.)
a. Click the Folders tab.
b. Expand Data Mart Development  Orion Source Data.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-72 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata

c. Verify that the Orion Source Data folder is selected.


d. Select File  Register Tables. The Register Tables Wizard appears.
e. Click SAS as the type of table.
f. Click Next. The Select a SAS Library window appears.
g. Click next to the SAS Library field and then click DIFT SAS Library.
h. Click Next. The Define Tables and Select Folder Location window appears.
1) Hold down the Ctrl key and click NEWORDERTRANS, PROFIT, STAFF_PARTIAL,
and VALIDPRODUSAOUTDOOR.
2) Verify that the location is set to /Data Mart Development/Orion Source Data.
3) Click Next. The review window appears.
4) Verify that the information is correct and click Finish.
The metadata objects for the table are found in the Orion Source Data folder.
i. Update the properties of the new table objects.
1) If necessary, click the Folders tab.
2) Right-click the NEWORDERTRANS metadata table object and select Properties.
a) Enter DIFT at the beginning of the default name.
b) Remove the default description.
c) Click OK to close the Properties window.
3) Right-click the PROFIT metadata table object and select Properties.
a) Enter DIFT at the beginning of the default name.
b) Remove the default description.
c) Click OK to close the Properties window.
4) Right-click the STAFF_PARTIAL metadata table object and select Properties.
a) Enter DIFT at the beginning of the default name.
b) Click OK to close the Properties window.
5) Right-click the VALIDPRODUSAOUTDOOR metadata table object and select
Properties.
a) Enter DIFT at the beginning of the default name.
b) Remove the default description.
c) Click OK to close the Properties window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.8 Solutions 3-73

4. Using the Inventory Tab


a. If necessary, access SAS Data Integration Studio
with Bruno’s credentials.
b. Click the Inventory tab.
c. Locate and expand the object type of Library.
d. Locate and expand the library object named
DIFT Orion Source Tables Library.
e. Verify that two tables have been registered from
the DIFT Orion Source Tables Library.
f. Locate and expand the library object named DIFT
SAS Library.
g. Verify that four tables have been registered from
the DIFT Orion Source Tables Library.

5. Defining a Library for Oracle Tables


a. If necessary, access SAS Data Integration Studio with Bruno’s credentials.
b. Click the Folders tab.
1) Expand Data Mart Development  Orion Source Data.
2) Verify that the Orion Source Data folder is selected.
c. Select File  New  Library. The New Library Wizard appears.
d. Click Oracle Library as the type of library.
e. Click Next.
1) Enter DIFT Oracle Library in the Name field.
2) Verify that the location is set to \Data Mart Development\Orion Source Data.
f. Click Next.
1) Click SASApp in the Available servers pane.
2) Click to move SASApp to the Selected servers list.
g. Click Next.
h. Enter diftora in the Libref field.
i. Click Next.
j. Select DIFT Oracle Server as the Database Server.
k. Click Next.
l. Click Finish.

6. Registering Oracle Tables from DIFT Oracle Library


a. If necessary, access SAS Data Integration Studio with Bruno’s credentials.
b. DIFT Oracle Library should already be registered in metadata. (If not, perform the previous
practice.)
c. Click the Folders tab.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-74 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata

d. Expand Data Mart Development  Orion Source Data.


e. Verify that the Orion Source Data folder is selected.
f. Select File  Register Tables. The Register Tables Wizard appears.
g. Expand Database Sources and select Oracle as the type of table.
h. Click Next.
i. Click next to the SAS Library field and then click DIFT Oracle Library.
j. Click Next.
k. Hold down the Ctrl key and click ORDERS and ORDER_ITEM.
l. Verify that the location is set to /Data Mart Development/Orion Source Data.
m. Click Next. The review window appears.
n. Verify that the information is correct and click Finish.
The metadata objects for the table are found in the Orion Source Data folder.
o. Update the properties of the new table objects.
1) If necessary, click the Folders tab.
2) Right-click the ORDERS metadata table object and select Properties.
a) Enter DIFT at the beginning of the default name.
b) Click OK to close the Properties window.
3) Right-click the ORDER_ITEM metadata table object and select Properties.
a) Enter DIFT at the beginning of the default name.
b) Click OK to close the Properties window.

7. Using the Inventory Tab


a. If necessary, access SAS Data Integration Studio with Bruno’s credentials.
b. Click the Inventory tab.
c. Locate and expand the object type of Library.
d. Locate and expand the library object named DIFT Oracle Library.
e. Verify that two (2) tables have been registered from the DIFT Oracle Library.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.8 Solutions 3-75

8. Importing Metadata for a Delimited External File


a. If necessary, access SAS Data Integration Studio with Bruno’s credentials.
1) Select Start  All Programs  SAS  SAS Data Integration Studio.
2) Select My Server as the connection profile.
3) Click OK to close the Connection Profile window and open the Log On window.
4) Enter Bruno in the User ID field and Student1 in the Password field.
5) Click OK to close the Log On window.
b. Click the Folders tab.
c. Right-click Data Mart Development and select Import  SAS Package.
d. Click Browse next to Enter the location of the input SAS package file.
1) In the Browse window, if necessary, navigate to D:\Workshop\dift\solutions.
2) Click DI1_DelimExtFile.spk.
3) Click OK to close the Browse window.
4) Accept the default choice for All objects.
5) Click Next. Verify that one file object is selected.
6) Click Next. Verify that SAS Application Server and File Path must be verified.
7) Click Next. Verify that the Original and Target SAS Application Server information match.
8) Click Next. Verify that the Original and Target File Paths information match.
9) Click Next. Review the summary information.
10) Click Next. The import occurs.
11) Click Finish.

Question: How many metadata objects are in the SAS package?


Answer: One (1)

Question: How many columns are defined for DIFT Supplier Information?
Answer: Six (6)

9. Defining Metadata for an External File


a. If necessary, access SAS Data Integration Studio with Bruno’s credentials.
b. Click the Folders tab.
c. Expand Data Mart Development  Orion Source Data.
d. Verify that the Orion Source Data folder is selected.
e. Select File  New  External File  Fixed Width.
1) Enter DIFT Profit Information in the Name field.
2) Verify that the location is set to /Data Mart Development/Orion Source Data.
f. Click Next. The External File Location window appears.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-76 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata

g. Click Browse to open the Select a file window.


1) Navigate to D:\Workshop\dift\data.
2) Select profit.txt.
3) Click OK.
h. To view the contents of the external file, click Preview.

Click OK to close the Preview File window.


i. Click Next. The Parameters window appears.
j. Accept the default settings for the parameters and click Next.
k. Define the column information. For each column, click New to add a new column and enter
the following column properties:

Column Begin End


Name Length Type Informat Format Position Position

Company 22 Char $22. 1 22

YYMM 8 Num yymm5. 24 28

Sales 8 Num dollar13. 30 43

Cost 8 Num dollar13. 45 58

Salaries 8 Num dollar13. 61 84

Profit 8 Num dollar13. 87 100

Note: It is easiest to define the above fields in order, from the top to bottom, to avoid
overlapping fields.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.8 Solutions 3-77

l. Click the Data tab and then click Refresh.


A warning window appears.

m. Click Yes. Review the data values on the Data tab. (The values for Sales and Cost are
missing in the first 120 rows.)
n. Click Next.
o. If the warning window appears again, click Yes.
The review window displays general information for the external file.
p. Click Finish. The metadata object for the external file is found in the Orion Source Data
folder.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-78 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata

Solutions to Student Activities

3.01 Activity – Correct Answer


Refer to the previous demonstration for step-by-step instructions.

The final set of folders should resemble the following:

8
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

3.02 Activity – Correct Answer


Refer to the previous demonstration for step-by-step instructions.
The new library object should now be visible on the Folders tab, under the
Data Mart Development  Orion Source Data folder.

15
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.8 Solutions 3-79

3.03 Activity – Correct Answer


Six table objects and two libraries appear in the Orion Source Data folder.

66
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-80 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Lesson 4 SAS® Data Integration
Studio: Defining Target Data
Metadata
4.1 Registering Metadata for Target Tables ........................................................................ 4-3
Demonstration: Refresh the Metadata........................................................................ 4-8
Demonstration: Defining the Product Dimension Table Metadata ................................. 4-12
Practice............................................................................................................... 4-20

4.2 Importing Metadata .................................................................................................... 4-22


Demonstration: Importing Relational Metadata ......................................................... 4-24

4.3 Solutions ................................................................................................................... 4-30


Solutions to Practices ............................................................................................ 4-30
4-2 Lesson 4 SAS® Data Integration Studio: Defining Target Data Metadata

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Registering Metadata for Target Tables 4-3

4.1 Registering Metadata for Target


Tables

Metadata for Data (Review)


Data tables can be used as sources (and targets) for job flows. The tables
can be in a DBMS or ERP system, and can also be in the form of SAS tables.
In addition, data can come from (or be written to) external files. Thus,
external files can also be used as sources (and targets) for job flows.

In this class, the target for the initial


series of tables will be SAS tables.

3
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-4 Lesson 4 SAS® Data Integration Studio: Defining Target Data Metadata

Wizards for Metadata for Target Data


• The New Table Wizard can be
used to define metadata for
a new table.

• The External File wizards


can be used to define
metadata for an
external file.

4
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Note: The External File wizards were discussed in previous materials to def ine metadata f or source
f iles. The same wizards are used to def ine target f iles.

New Table Wizard


When you define metadata for a
new table, the New Table Wizard
can perform these tasks:
• import metadata from tables/
columns that are already
registered in the metadata
repository
• override metadata that was
imported (for example, change a
column name)
• define new attributes for the
table that is defined (for example, indexes)
5
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Registering Metadata for Target Tables 4-5

New Table Wizard (1)

Metadata name
Metadata description

Metadata location
6
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The f irst page of the New Table Wizard enables you to specif y a metadata name, description, and
location f or the new table object.

New Table Wizard (2)

DBMS type
Library location

Valid name

7
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The second page of the New Table Wizard specifies Table Storage Information to include
• DBMS type
• library location for the selected DBMS type
• valid name of the new table.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-6 Lesson 4 SAS® Data Integration Studio: Defining Target Data Metadata

New Table Wizard (3)

8
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The third page of the New Table Wizard enables the selection of metadata f or columns that are
def ined f or existing table objects.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Registering Metadata for Target Tables 4-7

New Table Wizard (4)

Edit column metadata

Add new column metadata

Reorder columns 9
Define Indexes
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The f ourth page of the New Table Wizard enables


• editing of column metadata (perhaps to change a f ormat)
• addition of new columns (perhaps calculated columns)
• reordering of the columns f or the new table
• specif ication of indexes for the new table.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-8 Lesson 4 SAS® Data Integration Studio: Defining Target Data Metadata

Refresh the Metadata


This demonstration clears existing metadata and then refreshes with metadata imported from a SAS
package.
1. If necessary, access SAS Data Integration Studio with Bruno's credentials.
a. Select Start  All Programs  SAS  SAS Data Integration Studio.
b. Select My Server as the connection profile.
c. Click OK to close the Connection Profile window and open the Log On window.
d. Enter Bruno in the User ID field and Student1 in the Password field.
e. Click OK to close the Log On window.
2. Click the Folders tab.
3. Expand Data Mart Development.

Remove Existing Metadata Objects


1. Select all the Orion folders under Data Mart Development.
a. Click the Orion Jobs folder.
b. Press the Shift key and click the Orion Target Data folder.

2. Select Edit  Delete.


A Confirm Delete window appears.

3. Click Yes.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Registering Metadata for Target Tables 4-9

4. A series of Delete Library windows might appear. If so, click Yes in each to confirm the deletion.
Possible Delete Library windows:

Import Needed Metadata Objects


1. Select the Data Mart Development folder.
2. Select File  Import  SAS Package.
3. Click Browse next to Enter the location of the input SAS package file.
a. If necessary, navigate to D:\Workshop\dift\solutions.
b. Click DIFT_EndCh2.spk.
4. Verify that All Objects is selected.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-10 Lesson 4 SAS® Data Integration Studio: Defining Target Data Metadata

5. Click Next.
6. Verify that all four Orion folders are selected.
7. Click Next.
8. Verify that four different types of connections are shown to need to be established.
9. Click Next.
10. Verify that SASApp is listed for both the Original and Target fields.
11. Click Next.
12. Verify that both servers (ODBC Server and Oracle Server) have matching values for the Original
and Target fields.
13. Click Next.
14. Verify that both file paths have matching values for the Original and Target fields.
15. Click Next.
16. Verify that both directory paths have matching values for the Original and Target fields.
17. Click Next. The Summary pane surfaces.
18. Click Next.
19. Verify the import process completed successfully.
20. Click Finish.
21. Expand the Data Mart Development  Orion Source Data folder.
The folder and other metadata objects should resemble the following:

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Registering Metadata for Target Tables 4-11

4.01 Activity
1. Access SAS Data Integration Studio using the My Server connection
profile with Bruno’s credentials (Bruno / Student1).
2. Delete the four Orion folders under the Data Mart Development folder.

3. Click the Data Mart Development folder, and then select File 
Import  SAS Package.

4. Select the SAS package D:\Workshop\dift\solutions\DIFT_EndCh2.spk.

5. Passing through the Import Wizard, verify the four types of connections
and then Finish out of the Import Wizard.

12
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-12 Lesson 4 SAS® Data Integration Studio: Defining Target Data Metadata

Defining the Product Dimension Table Metadata


This demonstration illustrates using the New Table Wizard to define metadata for a target table. The
target table is a SAS data set named DIFT Product Dimension and is stored in a location described
by the DIFT Orion Target Tables Library, a new library object. (The library object is created as well.)
1. If necessary, access SAS Data Integration Studio with Bruno’s credentials.
a. Select Start  All Programs  SAS  SAS Data Integration Studio.
b. Select My Server as the connection profile.
c. Click OK to close the Connection Profile window and open the Log On window.
a. Enter Bruno in the User ID field and Student1 in the Password field.
b. Click OK to close the Log On window.
2. Click the Folders tab.
3. Expand Data Mart Development  Orion Target Data.
4. Verify that the Orion Target Data folder is selected.

5. Select File  New  Table. The New Table Wizard appears.


6. Enter DIFT Product Dimension in the Name field.
7. Verify that the location is set to /Data Mart Development/Orion Target Data.

8. Click Next.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Registering Metadata for Target Tables 4-13

9. Verify that the DBMS field is set to SAS.


10. Click New next to the Library field.

The New Library Wizard appears.


11. Specify metadata for the new library object.
a. Enter DIFT Orion Target Tables Library in the Name field.
b. Click Browse next to the Location field. The Select a Location window appears.
1) If necessary, click in the Look in field and select the Data Mart Development folder.

2) Double-click the Orion Target Data folder.


3) Click OK.
4) Verify that the location is set to /Data Mart Development/Orion Target Data.

c. Click Next.
d. Double-click SASApp in the Available servers list to move it to the Selected servers
list.
e. Click Next.
f. Specify the needed library properties.
1) Enter difttgt in the Libref field.
2) Click New in the Path Specification area.
a) In the New Path Specification window, click Browse next to Paths.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-14 Lesson 4 SAS® Data Integration Studio: Defining Target Data Metadata

b) In the Browse window, navigate to D:\Workshop\dift.

c) Click to add a new folder.


d) Enter datamart as the name of the new folder and press Enter.
e) Click the datamart folder to select it.
f ) Click OK to close the Browse window.
The New Path Specification window displays the newly defined path.

g) Click OK to close the New Path Specification window.


3) Verify that the newly specified path is found in the Selected items list.

g. Click Next. Summary information appears.

h. Verify that the summary information is correct.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Registering Metadata for Target Tables 4-15

i. Click Finish to close the New Library Wizard and return to the New Table Wizard.
12. Verify that the new library DIFT Orion Target Tables Library is selected in the Library field.
13. Enter ProdDim in the Name field.

14. Click Next. The Select Columns window appears.


15. Verify that the Folders tab is selected.
16. Expand Data Mart Development  Orion Source Data.
17. From the Orion Source Data folder, expand the DIFT PRODUCT_LIST table object.
18. Double-click the following columns to move to the Selected list:
Product_ID
Product_Name
Supplier_ID

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-16 Lesson 4 SAS® Data Integration Studio: Defining Target Data Metadata

19. Expand the DIFT Supplier Information external file object.


20. Double-click the following columns to move to the Selected list:
Country
Supplier_Name

21. Click Next.


The Change Columns/Indexes window appears.

22. Update the column metadata.


a. Update the name for the Country column to Supplier_Country.
b. Verify that the length of Supplier_ID is 4.
23. Select the last column (Supplier_Name).

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Registering Metadata for Target Tables 4-17

24. Add metadata for three new columns.


a. For each column, click New to define column properties.
b. Enter the following information for the new columns:

Column Name Description Length Type

Product_Category Product Category 25 Character

Product_Group Product Group 25 Character

Product_Line Product Line 25 Character

The final set of eight columns is shown below.

25. Define two simple indexes: one for Product_ID and one for Product_Group.
a. Click Define Indexes. The Define Indexes window appears.
b. Click New to add the first index.
c. Enter an index name of Product_ID and press Enter.
Note: Be sure to press Enter. If you do not, the name of the index is not saved.

d. Select the Product_ID column and move it to the Indexes pane by clicking .
e. Click New to add the second index.
f. Enter an index name of Product_Group and press Enter.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-18 Lesson 4 SAS® Data Integration Studio: Defining Target Data Metadata

g. Select the Product_Group column and move it to the Indexes pane by clicking .
The two requested indexes are defined in the Define Indexes window.

Note: A simple index in a SAS table must have the same name as its column. A
warning dialog box is presented if an index name does not match its column
name. Clicking Yes in the dialog box enables SAS Data Integration Studio
to match the index name to its column name.
h. Click OK to close the Define Indexes window and return to the New Table Wizard.
26. Click Next.
27. Review the metadata listed in the summary window.

28. Click Finish.


29. Click the Folders tab.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Registering Metadata for Target Tables 4-19

30. Verify that the new table and new library objects appear in the Orion Target Data folder.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-20 Lesson 4 SAS® Data Integration Studio: Defining Target Data Metadata

Practice
1. Importing Metadata for DIFT Product Dimension Table
Access SAS Data Integration Studio using the My Server connection profile with Bruno's
credentials (Bruno/Student1).
• Right-click the Data Mart Development folder and import from the SAS package
D:\Workshop\dift\solutions\DIFT_ProdDimPlus.spk.
– Select all defaults
– When Warning appears:
▪ Click No.

▪ Click in the Target area to open Browse window.


▪ Verify that the Look in field is set to D:\Workshop\dift.
▪ Click the New folder tool ( ) to create the datamart folder in D:\Workshop\dift.
▪ Select the new path D:\Workshop\dift\datamart.
– Pass through the final pages of the Import from SAS Package Wizard, accepting the
defaults.

Question: How many objects were imported in the Orion Target Data folder?
Answer: _____________________________________________________

Question: What engine is defined for the library object?


Answer: _____________________________________________________

• Edit the properties of the table object.


– Three columns (Product_Category, Product_Group, Product_Line) were defined with
the wrong length – update each to have a length of 25.
– Supplier_ID column length needs to be updated to 4.
– A second index needs to be defined – a simple index for the column Product_Group.

2. Defining the DIFT Order Fact Target Table


A new target table that contains order information must be defined in the metadata. Access SAS
Data Integration Studio using the My Server connection profile with Bruno’s credentials
(Bruno/Student1).
• Name the table object DIFT Order Fact.
• Store the table object in the Orion Target Data folder under Data Mart Development.
• Specify that the table should be created as a SAS table with the physical name of
OrderFact.
• Physically store the table in DIFT Orion Target Tables Library.
• Use the set of distinct columns from DIFT ORDER_ITEM and DIFT ORDERS.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Registering Metadata for Target Tables 4-21

• Update the column attributes as follows:

Name Description Informat Format

ORDER_ID Order ID (None) 12.

ORDER_ITEM_NUM Order Item Number (None) (None)

PRODUCT_ID Product ID (None) 12.

QUANTITY Quantity Ordered (None) (None)

TOTAL_RETAIL_PRICE Total Retail Price (None) dollar12.

COSTPRICE_PER_UNIT Cost Price Per Unit (None) dollar12.

DISCOUNT Discount % (None) percent.

ORDER_TYPE Order Type (None) order_type.

EMPLOYEE_ID Employee ID (None) 12.

CUSTOMER_ID Customer ID (None) 12.

OrderDate Date Order was (None) date9.


Placed
Note: New name for ORDER_DATE

DeliveryDate Date Order was (None) date9.


Delivered
Note: New name for DELIVERY_DATE

Note: Notice the newnames for ORDER_DATE and DELIVERY_DATE as well as updated
formats.

3. Using the Inventory Tab

• Click the Inventory tab.


• Locate and expand the Library object type group.
• Expand the DIFT Orion Target Tables Library.

Do tables appear under this library?

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-22 Lesson 4 SAS® Data Integration Studio: Defining Target Data Metadata

4.2 Importing Metadata

Supported Metadata File Formats


SAS Data Integration Studio enables you to export and import metadata.
Two metadata formats are supported:
• SAS metadata in the SAS package format
• relational metadata from other vendors
• using the Common Warehouse Metamodel (CWM) format
• vendor-specific

23
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The SAS package format is a SAS internal f ormat and supports most SAS platform metadata objects
including objects relevant to SAS Data Integration Studio, such as jobs, libraries, tables, and
external f iles.
The Common Warehouse Metamodel (CWM) is an industry standard format that is supported by
many software vendors. The CWM format supports relational metadata such as tables, columns,
indexes, and keys.
The SAS package format can be used to move metadata between SAS metadata repositories, move
metadata between environments such as from development-to-test and test-to-production, maintain
backups of metadata, and keep archived versions of metadata objects.
Relational metadata, including the CWM format, can be used to exchange relational metadata
between software applications and import models from third-party data modeling tools into SAS Data
Integration Studio.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.2 Importing Metadata 4-23

Metadata Bridges
Vendor
implementations of
the CWM format differ.

Each supported vendor-


specific CWM format
can be read with a SAS
Metadata Bridge.

24
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

When licensing SAS Data Integration Studio, you get the choice of several SAS Metadata Bridges.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-24 Lesson 4 SAS® Data Integration Studio: Defining Target Data Metadata

Importing Relational Metadata


This demonstration illustrates importing metadata that was exported from an Object Management
Group (OMG) modeling application in CWM format. Write-metadata permission for the target folder
is necessary. Bruno has the appropriate permissions to perform this task.
1. If necessary, access SAS Data Integration Studio with Bruno's credentials.
a. Select Start  All Programs  SAS  SAS Data Integration Studio.
b. Select My Server as the connection profile.
c. Click OK to close the Connection Profile window and open the Log On window.
d. Enter Bruno in the User ID field and Student1 in the Password field.
e. Click OK to close the Log On window.
2. Click the Folders tab.
3. Expand Data Mart Development  Orion Target Data.
4. Verify that the Orion Target Data folder is selected.
5. Select File  Import  Metadata.
The Metadata Import Wizard appears. The Select an import format window lists the available
metadata bridges.
6. Click OMG CWM 1:x XMI 1:x as the import format.

7. Click Next.
8. Specify the file to import.
a. Click Browse next to the Filename field to open the Select a file window.
b. Navigate to D:\Workshop\dift\data.
c. Select OMG_CWM_XMI.xml.
d. Click OK to close the Select a file window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.2 Importing Metadata 4-25

9. Verify that the folder location is set to /Data Mart Development/Orion Target Data.

10. Click Next.


11. Review (and accept) the default Meta Integration Options.

12. Click Next.


13. Verify that Import as new metadata is selected.

14. Click Next. The Metadata Location window appears.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-26 Lesson 4 SAS® Data Integration Studio: Defining Target Data Metadata

15. Specify a library.


a. Click in the Library field to assign a new library location.
b. On the Folders tab, navigate to Data Mart Development  Orion Target Data.
c. Click DIFT Orion Target Tables Library.
d. Click OK to close the Select a library window.

16. Click Next.


17. Review and accept the final settings.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.2 Importing Metadata 4-27

18. Click Finish. The metadata is imported to the SAS metadata environment.
An information window appears.

19. Click No.


The Folders tab displays two new metadata table objects in the Orion Target Data folder.

20. Update the properties for the CURRENT_STAFF table object.


a. Right-click the CURRENT_STAFF metadata table object and select Properties.
b. Enter DIFT Current Staff in the Name field.

c. Click the Columns tab to view the imported column properties.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-28 Lesson 4 SAS® Data Integration Studio: Defining Target Data Metadata

d. Update the formats for each of the columns.

Name Format

EMPLOYEE_ID 12.

START_DATE date9.

END_DATE date9.

JOB_TITLE (None)

SALARY dollar12.

GENDER $gender.

BIRTH_DATE date9.

EMP_HIRE_DATE date9.

EMP_TERM_DATE date9.

MANAGER_ID 12.

The Columns tab should resemble the following:

e. Select the column of formats.


1) Click the first format cell.
2) Hold down the Shift key and click the last format cell. The column of formats is now
selected.
3) Hold down the Ctrl key and press C to copy the selection to the clipboard.
f. Click OK to close the Properties window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.2 Importing Metadata 4-29

21. Update properties for the TERM_STAFF table object.


a. Right-click the TERM_STAFF metadata table object and select Properties.
b. Enter DIFT Terminated Staff in the Name field.

c. Click the Columns tab to view the imported column properties.


d. Update the formats.
1) Click the first format cell.
2) Hold down the Ctrl key and press V to paste the column of formats that was copied
in a previous step.

e. Click OK to close the Properties window.


The metadata objects in the Orion Target Data folder should now resemble the following:

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-30 Lesson 4 SAS® Data Integration Studio: Defining Target Data Metadata

4.3 Solutions
Solutions to Practices
1. Importing Metadata for DIFT Product Dimension Table
a. If necessary, access SAS Data Integration Studio with Bruno’s credentials.
1) Select Start  All Programs  SAS  SAS Data Integration Studio.
2) Select My Server as the connection profile.
3) Click OK to close the Connection Profile window and open the Log On window.
4) Enter Bruno in the User ID field and Student1 in the Password field.
5) Click OK to close the Log On window.
b. Click the Folders tab.
c. Click the Data Mart Development folder.
d. Select File  Import  SAS Package.
1) Click Browse.
2) If necessary, navigate to D:\Workshop\dift\solutions.
3) Click the file DIFT_ProdDimPlus.spk.
4) Click OK.
5) Verify that All Objects is selected.
6) Click Next.
7) Verify that two objects under the Orion Target Data folder are to be imported.

8) Click Next.
9) Verify that two connection points must be established.
10) Click Next.
11) Verify that SASApp is listed for both the Original and Target fields.
12) Click Next.
13) Verify that D:\Workshop\dift\datamart is listed for both the Original and Target fields.
14) Click Next.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.3 Solutions 4-31

A Warning window appears, stating that the physical location specified does not exist.

15) Click No.

a) Click in the Target area to open Browse window.


b) Verify the Look in field is set to D:\Workshop\dift.
c) Click the New folder tool ( )to create the datamart folder in D:\Workshop\dift.
d) Select the new path D:\Workshop\dift\datamart.
16) Click Next.
17) Verify that the summary information is correction.
18) Click Next.
19) Click Finish.

Question: How many objects were imported in the Orion Target Data folder?
Answer: Two, a library and a table object.

Question: What engine is defined for the library object?


Answer: BASE (a SAS engine)

e. Click the Data Mart Development folder.


f. Select View  Refresh.
g. Expand the Orion Target Data folder.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-32 Lesson 4 SAS® Data Integration Studio: Defining Target Data Metadata

h. Right-click the table object (DIFT Product Information) and select Properties.
1) Click the Columns tab.
2) Locate the Length value for the Product_Category column and update it to 25.
3) Locate the Length value for the Product_Group column and update it to 25.
4) Locate the Length value for the Product_Line column and update it to 25.
5) Locate the Length value for the Supplier_ID column and update it to 4.
6) Click the Indexes tab.
7) Under Indexes, click New.
8) Type Product_Group as the new index name and press Enter.
9) Select Product_Group from the Columns list and move to the new index.
10) Click OK to close the Properties window.

2. Defining the DIFT Order Fact Target Table


a. If necessary, access SAS Data Integration Studio with Bruno’s credentials.
1) Select Start  All Programs  SAS  SAS Data Integration Studio.
2) Select My Server as the connection profile.
3) Click OK to close the Connection Profile window and open the Log On window.
4) Enter Bruno in the User ID field and Student1 in the Password field.
5) Click OK to close the Log On window.
b. Click the Folders tab.
c. Expand Data Mart Development  Orion Target Data.
d. Verify that the Orion Target Data folder is selected.
e. Select File  New  Table. The New Table Wizard appears.
1) Enter DIFT Order Fact in the Name field.
2) Verify that the location is set to /Data Mart Development/Orion Target Data.
3) Click Next.
4) Verify that the DBMS field is set to SAS.
5) Select DIFT Orion Target Tables Library for the Library field.
6) Enter OrderFact in the Name field.
7) Click Next.
f. Expand the Data Mart Development  Orion Source Data folder on the Folders tab.
1) Select the DIFT ORDER_ITEM table object.

2) Click . (All columns from the selected table are moved to the Selected pane.)
3) Select the DIFT ORDERS table object.

4) Click to move all columns from the selected table to the Selected pane.
An Error window appears and indicates that Order_ID cannot be added twice.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.3 Solutions 4-33

5) Click OK.
6) Click Next.
g. Update the column attributes as follows:

Column Name Description Informat Format

ORDER_ID Order ID (None) 12.

ORDER_ITEM_NUM Order Item Number (None) (None)

PRODUCT_ID Product ID (None) 12.

QUANTITY Quantity Ordered (None) (None)

TOTAL_RETAIL_PRICE Total Retail Price (None) dollar12.

COSTPRICE_PER_UNIT Cost Price Per Unit (None) dollar12.

DISCOUNT Discount % (None) percent.

ORDER_TYPE Order Type (None) order_type.

EMPLOYEE_ID Employee ID (None) 12.

CUSTOMER_ID Customer ID (None) 12.

OrderDate Date Order was Placed (None) date9.


Note: New name for ORDER_DATE

DeliveryDate Date Order was Delivered (None) date9.


Note: New name for DELIVERY_DATE

Note: Notice the new names for ORDER_DATE and DELIVERY_DATE as well as updated
formats.
h. Click Next.
i. Review the metadata listed in the summary window.

j. Click Finish.
The new table object appears in the Orion Target Data folder.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-34 Lesson 4 SAS® Data Integration Studio: Defining Target Data Metadata

3. Using the Inventory Tab


a. If necessary, access SAS Data Integration Studio with Bruno’s credentials.
b. Click the Inventory tab.
c. Locate and expand the object type of Library.
d. Locate and expand the library object named DIFT Oracle Library.
e. Verify that two (2) tables have been registered from the DIFT Oracle Library.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Lesson 5 SAS® Data Integration
Studio: Working with Jobs
5.1 Creating Metadata for Jobs .......................................................................................... 5-3
Demonstration: Refresh the Metadata ..................................................................... 5-13
Demonstration: Populating the Current and Terminated Staff Tables ............................. 5-16
Practices ............................................................................................................. 5-28

5.2 Working with the Join Transformation ....................................................................... 5-31


Demonstration: Populating the Product Dimension Table ............................................ 5-41
Practices ............................................................................................................. 5-62

5.3 Solutions ................................................................................................................... 5-64


Solutions to Practices ............................................................................................ 5-64
5-2 Lesson 5 SAS® Data Integration Studio: Working with Jobs

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Creating Metadata for Jobs 5-3

5.1 Creating Metadata for Jobs

What We Have Done


At this point, metadata is defined for the following:

Various Types of Desired


Source Data Target Tables

3
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

We currently have metadata def ined f or several types of source data - SAS tables, Oracle tables
(our DBMS example), Microsoft Access tables (our ODBC example), and external f iles.
In addition, we have metadata def ined f or several target tables.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-4 Lesson 5 SAS® Data Integration Studio: Working with Jobs

What We Need to Do
The next step is to define processes (jobs) that
• read from sources
• perform necessary data transformations
• load targets.

4
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

We now need to def ine metadata f or a new type of metadata object – a job. Jobs will allow us to
read f rom our source data, process that data with transf ormations, and then load our target. Initially
our "targets" will be tables, but it is possible to create output or target inf ormation in the f orm of a
report.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Creating Metadata for Jobs 5-5

What Is a Job?
Job Metadata object that organizes sources, targets, and
transformations into processes that create output

5
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

As specified above, a job is simply another metadata object. A job object organizes source data
metadata objects with various transformations to generate a result. The result is often a target table
but could be a report.
In the top left screen, we see a process flow diagram (on the Diagram tab) that contains two source
objects (both are table metadata objects that are pointing to SAS tables), two transformations (Join
and Table Loader), and a single target object (a SAS table metadata object). SAS Data Integration
Studio uses the job metadata to generate SAS code (shown in the lower screen shot) that executes
the job process.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-6 Lesson 5 SAS® Data Integration Studio: Working with Jobs

Example Process Flow

6
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The direction of the arrows in the process f low indicates the order of the transf ormations. The above
process f low diagram shows a job that reads f rom a source table called STAFF, processes the
Extract transf ormation to produce a work (temporary) table, which in turn is processed by the Sort
transf ormation to creates a registered (permanent) table called US Staff Sorted. In any process f low
diagram, table and external f ile objects have tan colored nodes, and transf ormations are the blue
nodes. Also, table objects are decorated with a symbol in the upper right corner that visually
identif ies the type of table being read or being created.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Creating Metadata for Jobs 5-7

Example Process Flow

Read this in
top-to-bottom
order
corresponding to
the direction
according to
arrows.

7
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Example Process Flow


Symbol indicates SAS table
Table Object

Transformation

Symbol for temporary table

Transformation

Symbol indicates SAS table


Table Object
12
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-8 Lesson 5 SAS® Data Integration Studio: Working with Jobs

New Jobs

Name

Description

Metadata
Location

13
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

New jobs are initialized by the New Job window. In this window, you specif y the f ollowing job
properties:
• name
• description
• metadata location
A new job can be initiated from the Folders tree by clicking the desired location folder and then
selecting File  New  Job. The New Job window allows specification of the metadata name for
the job and a description. If the New Job window was launched from an undesired location, you have
the opportunity to change the folder destination in the Location field. Selecting OK launches the job
editor with an "empty" job, that is, initially there are no sources, transformations, or targets on the
Diagram tab.
If you have an existing job object that you want to edit, you can simply right-click the job and select
Open (you can also double-click the job object – the double-click action is to Open).

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Creating Metadata for Jobs 5-9

Job Editor

Main area

Details pane

14
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The "main area" is a tabbed interf ace where we have possibly four tabs:

Tab Purpose

Diagram Used to build and update the process flow for a job

Code Used to review or update code for a job

Log Used to review the log for a submitted job

Output Used to review the output of a submitted job

The Details pane has tabs to monitor job execution and to aid with debugging job errors.
The Details pane can have the f ollowing tabs:

Tab Purpose

Status Displays the completion status of each step.

Warnings and Errors Displays warnings and errors.

Statistics Displays run-time and table statistics.

Control Flow Is used to review and update the execution sequence of the steps in a
job.

Columns Is used to review and update column properties in a table and is


available only when a table is selected.

Mappings Is used to review and update column mappings across a transformation


and is available only when a transformation is selected.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-10 Lesson 5 SAS® Data Integration Studio: Working with Jobs

Introduction to Transformations

A transformation allows you


to specify how to extract data,
transform data, or load data
into data stores.

15
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Each process in a process f low diagram is specified by a metadata object called a transformation. A
transf ormation allows you to specify how to extract data, transf orm data, or load data into data
stores.

Introduction to Transformations
Each transformation that you
specify in a process flow
diagram generates or
retrieves SAS code. A
transformation's generated
code can be augmented or
replaced.

16
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Each transf ormation that you specify in a process f low diagram generates or retrieves SAS code. A
transf ormation's generated code can be augmented or even replaced. You can also specif y user-
written code f or any transf ormation in a process f low diagram.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Creating Metadata for Jobs 5-11

Transformations Tree
The Transformations tree organizes available transformations into
categories.

A transformation can be added to a job


via the drag-and-drop method.

When added to the job flow diagram,


a transformation is then connected to
sources and targets, and its default
metadata is updated.

17
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The availability of transf ormations for many common processing tasks enables rapid development of
process f lows f or common scenarios. The above display shows the standard Transf ormations tree.

Demonstration: Creating a Job


This simple job
• reads data from a source table with staff information
• uses the Splitter transformation to divide the staff data into current staff
and terminated staff
• loads the resulting data into two target tables.

18
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-12 Lesson 5 SAS® Data Integration Studio: Working with Jobs

Splitter Transformation
The Splitter transformation
• is found in the Data category of transformations
• can be used to create one or more subsets of
a source.

19
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Creating Metadata for Jobs 5-13

Refresh the Metadata


This demonstration clears existing metadata and then refreshes with metadata imported from a SAS
package.
1. If necessary, access SAS Data Integration Studio with Bruno’s credentials.
a. Select Start  All Programs  SAS  SAS Data Integration Studio.
b. Select My Server as the connection profile.
c. Click OK to close the Connection Profile window and open the Log On window.
d. Enter Bruno in the User ID field and Student1 in the Password field.
e. Click OK to close the Log On window.
2. Click the Folders tab.
3. Expand Data Mart Development.

Remove Existing Metadata Objects


1. Select all the Orion folders under Data Mart Development.
a. Click the Orion Jobs folder.

b. Press the Shift key and click the Orion Target Data folder.
2. Select Edit  Delete.
A Confirm Delete window appears.

3. Click Yes.
4. A series of Delete Library windows might appear. If so, click Yes in each window to confirm
deletion.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-14 Lesson 5 SAS® Data Integration Studio: Working with Jobs

Import Needed Metadata Objects


1. Select the Data Mart Development folder.
2. Select File  Import  SAS Package.
3. Click Browse next to Enter the location of the input SAS package file.
a. If necessary, navigate to D:\Workshop\dift\solutions.
b. Click DIFT_EndCh3.spk.
4. Verify that All Objects is selected.
5. Click Next.
6. Verify that all four Orion folders are selected.
7. Click Next.
8. Verify four different types of connections are shown to need to be established.
9. Click Next.
10. Verify that SASApp is listed for both the Original and Target fields.
11. Click Next.
12. Verify both servers (ODBC Server and Oracle Server) have matching values for the Original and
Target fields.
13. Click Next.
14. Verify that both file paths have matching values for the Original and Target fields.
15. Click Next.
16. Verify that all directory paths have matching values for the Original and Target fields.
17. Click Next.
The Summary pane surfaces.
18. Click Next.
19. Verify the import process completed successfully.
20. Click Finish.
21. Expand the Data Mart Development  Orion Source Data folder.
22. Expand the Data Mart Development  Orion Target Data folder.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Creating Metadata for Jobs 5-15

The metadata under the Data Mart Development folder should resemble the following:

Note: In the Orion Target Data folder, there is a table object named DIFT US Suppliers. This
was not discussed or defined previously. It is a table object defining a SAS table to be
created in the DIFT Orion Target Tables Library. The columns for the table were
created from the columns in the DIFT Supplier Information external file object. Two new
tables named DIFT Old Orders and DIFT Recent Orders will be used in an upcoming
practice.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-16 Lesson 5 SAS® Data Integration Studio: Working with Jobs

Populating the Current and Terminated Staff Tables


This demonstration shows the building of a job that uses the Splitter transformation.
The final process flow diagram should resemble the following:

1. If necessary, access SAS Data Integration Studio with Bruno’s credentials.


a. Select Start  All Programs  SAS  SAS Data Integration Studio.
b. Select My Server as the connection profile.
c. Click OK to close the Connection Profile window and access the Log On window.
d. Enter Bruno in the User ID field and Student1 in the Password field.
e. Click OK to close the Log On window.
2. Click the Folders tab.
3. Expand Data Mart Development  Orion Jobs.
4. Verify that the Orion Jobs folder is selected.
5. Select File  New  Job. The New Job window appears.
6. Enter DIFT Populate Current and Terminated Staff Tables in the Name field.
7. Verify that the Location is set to /Data Mart Development/Orion Jobs.

8. Click OK.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Creating Metadata for Jobs 5-17

The Job Editor window appears.

9. Add the source data object to the process flow.


a. If necessary, click the Folders tab.
b. Expand Data Mart Development  Orion Source Data.
c. Drag the DIFT STAFF table object to the Diagram tab of the Job Editor.

Note: When a job window is active, objects can also be added to the diagram by right-
clicking the object and selecting Add to Diagram.
Note: The above screen capture does not show the Details pane.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-18 Lesson 5 SAS® Data Integration Studio: Working with Jobs

10. Select File  Save to save the diagram and job metadata to this point.
11. Add the Splitter transformation to the diagram.
a. In the tree view, click the Transformations tab.
b. Expand the Data grouping.
c. Click the Splitter transformation.

d. Drag the Splitter transformation to the Diagram tab of the Job Editor.
e. Position the Splitter transformation next to the source table object.

Note: The Splitter transformation, by default, produces two work tables. (More can be
produced by specifying the properties of the Splitter transformation.) Notice that the
two work table objects are represented by the green icons located to the right of the
Splitter transformation.
12. Select File  Save to save the diagram and job metadata to this point.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Creating Metadata for Jobs 5-19

13. Connect the source table object to the Splitter transformation.


a. Place the cursor over the DIFT STAFF table object. A connection selector appears.

b. Place the cursor over the connection selector. The cursor changes to a pencil.

c. With the cursor over the connection selector (and the pencil cursor visible), click the
connection selector and drag it to the Splitter transformation. Release the cursor when it is
over the Splitter transformation.

14. Select File  Save to save the diagram and job metadata to this point.
15. Add the target table objects to the diagram.
a. Click the Folders tab.
b. If necessary, expand the Data Mart Development  Orion Target Data folder.
c. Hold down the Ctrl key and click the two target table objects (DIFT Current Staff
and DIFT Terminated Staff).

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-20 Lesson 5 SAS® Data Integration Studio: Working with Jobs

d. Drag the two objects to the Diagram tab of the Job Editor.

e. Arrange the table objects so that they are separated.

16. Select File  Save to save the diagram and job metadata to this point.
17. Connect the Splitter transformation to the target table objects.
The two target tables are loaded with direct one-to-one column mappings of subset data and
no additional load specifications. Therefore, no Table Loader transformation is needed for either
of the target tables. Hence, the two work table objects must be deleted in order to connect the
transformation directly to the target table objects.
a. Right-click one of the work table objects of the Splitter transformation and select Delete.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Creating Metadata for Jobs 5-21

b. Right-click the second work table object of the Splitter transformation and select Delete.
All work tables are now removed from the Splitter transformation.

c. Place the cursor over the Splitter transformation to reveal the connection selector
until the cursor changes to a pencil.

d. When the pencil cursor appears, click and drag to the first output table, DIFT Current Staff.

e. Place the cursor over the Splitter transformation to reveal the connection selector.
f. Click the connection selector and drag to the second output table, DIFT Terminated Staff.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-22 Lesson 5 SAS® Data Integration Studio: Working with Jobs

The resulting process flow diagram should resemble the following:

18. Select File  Save to save the diagram and job metadata to this point.
19. Specify the properties of the Splitter transformation.
a. Right-click the Splitter transformation and select Properties.

b. Click the Row Selection tab.


c. Specify the subsetting criteria for the DIFT Current Staff table object.
1) Verify that the DIFT Current Staff table object is selected in the Target Tables pane.
2) Select Row Selection Conditions in the Row Selection Type field.
3) Click Subset Data below the Selection Conditions area.

The Expression Builder window appears.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Creating Metadata for Jobs 5-23

4) Click the Data Sources tab.


5) Expand the STAFF table.
6) Select the Emp_Term_Date column.
7) Click Add to Expression.

8) Click in the operators area.

9) Enter . (a period for a missing numeric value).


The Expression Text area should now resemble the following:

10) Click OK to close the Expression Builder window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-24 Lesson 5 SAS® Data Integration Studio: Working with Jobs

The Row Selection tab is updated to the following:

d. Specify the subsetting criteria for the DIFT Terminated Staff table object.
1) Verify that the DIFT Terminated Staff table object is selected in the Target Tables pane.
2) Select Row Selection Conditions in the Row Selection Type field.
3) Click Subset Data below the Selection Conditions area. The Expression Builder
window appears.
4) Click the Data Sources tab.
5) Expand the STAFF table.
6) Select the Emp_Term_Date column.
7) Click Add to Expression.

8) Click in the operators area.

9) Enter . (a period for a missing numeric value).


10) Click OK to close the Expression Builder window.
The Row Selection tab is updated to the following:

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Creating Metadata for Jobs 5-25

e. Click the Mappings tab.


f. Verify that all target table columns have a mapping (incoming arrow).

A column mapping indicates that data passes from the source column to the target column.
g. Click OK to close the Splitter Properties window.
20. Select File  Save to save the diagram and job metadata to this point.
21. Run the job.
a. Click Run on the job toolbar.

Note: A job can also be processed by selecting Actions  Run or by right-clicking in the
job background and selecting Run from the pop-up menu.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-26 Lesson 5 SAS® Data Integration Studio: Working with Jobs

b. Click the Status tab on the Details pane.


c. Verify that the status for the pre-processing code, the transformation, the post-processing
code, and the overall job is Completed successfully.

22. Click on the Details pane to close the Details pane.


23. Click the Log tab to view the log for the executed job.
24. Scroll to view the note about the creation of the DIFTTGT.CURRENT_STAFF table.

25. Scroll to view the note about the creation of the DIFTTGT.TERM_STAFF table.

26. View the data for the DIFT Current Staff table object.
a. Click the Diagram tab in the Job Editor.
b. Right-click the DIFT Current Staff table object and select Open.
c. Scroll right to the EMP_TERM_DATE column. All EMP_TERM_DATE values are missing.

d. After you view the data, select File  Close to close the View Data window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Creating Metadata for Jobs 5-27

27. View the data for the DIFT Terminated Staff table object.
a. Right-click the DIFT Terminated Staff table object and select Open.
b. Scroll right to the EMP_TERM_DATE column. All EMP_TERM_DATE values are
nonmissing.

c. After you view the data, select File  Close to close the View Data window.
28. Select File  Close to close the Job Editor. If necessary, save changes to the job. The new job
object appears on the Folders tab.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-28 Lesson 5 SAS® Data Integration Studio: Working with Jobs

Practices
1. Refreshing Course Metadata to Current Point

• Access SAS Data Integration Studio using the My Server connection profile using Bruno's
credentials (Bruno/Student1).
• Delete all four Orion folders under the Data Mart Development folder.
– Select all four Orion folders and then select Edit  Delete.

– Accept defaults for deletion (click Yes for each Delete Library window that appears – this
will delete the associated table objects for each library).
• From the Data Mart Development folder, import fresh metadata from DIFT_Ch4Ex1.spk
(SAS package located in D:\Workshop\dift\solutions).
– Choose All Objects.
– Verify that only Orion folders are selected (if necessary, clear selection for DIFT Demo).
– Verify that the Original and Target fields match for four different connection point
panels.
Question: How many library objects were imported?
Answer: _________________________________________________________

Question: What is the name of the job object in the Orion Jobs folder?
Answer: _________________________________________________________

Question: How many table objects were imported in the Orion Target Data folder?
Answer: _________________________________________________________

2. Checking the Starter Job and Finalizing


• Access SAS Data Integration Studio using the My Server connection profile with Bruno’s
credentials (Bruno/Student1).
• Open the DIFT Populate Current and Terminated Staff Tables job for editing.
– Locate / select the DIFT Populate Current and Terminated Staff Tables job in the
Orion Jobs folder.
– Double-click the job DIFT Populate Current and Terminated Staff Tables to open the
job editor.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Creating Metadata for Jobs 5-29

• Define the connection between the source table and the transformation using the
Connections window.
– Right-click the DIFT STAFF table object in the job flow diagram and select Connections.

– Click under Output Ports. The Output Port – Data window appears.
– Click the Splitter transformation under Output Node.
– Click OK.
– Select File  Save.
• Add target tables using the Replace functionality.
– Right-click the top output table on the Splitter transformation and select Replace.

o In the Table Selector window (on the Folders tab), expand Data Mart Development
 Orion Target Data.
o Click the DIFT Current Staff table.
o Click OK.
– Right-click the remaining output table on the Splitter transformation and select Replace.
o In the Table Selector window (on the Folders tab), expand Data Mart Development
 Orion Target Data.
o Click the DIFT Terminated Staff table.
o Click OK.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-30 Lesson 5 SAS® Data Integration Studio: Working with Jobs

• Click on lower right of Diagram tab.


• Select File  Save.
• Update the properties of the Splitter transformation.
– Right-click the Splitter transformation and select Properties.
– Click the Mappings tab.

– Click the Map all columns tool .


– Click the Row Selection tab.
– Specify the subsetting criteria for the DIFT Current Staff table object:
Emp_Term_Date = .
– Specify the subsetting criteria for the DIFT Terminated Staff table object:
Emp_Term_Date ^= .
– Click OK to close the Splitter Properties.
• Select File  Save to save the job.
• Run the job by selecting Actions  Run.

Question: Did the job execute successfully?


Answer: ____________________________________________________________

Question: How many records were populated in the CURRENT_STAFF table?


Answer: ____________________________________________________________

Question: How many records were populated in the TERM_STAFF table?


Answer: ____________________________________________________________

• Close the edited job.

Question: On the Inventory tab, under what grouping do you find the job?
Answer: __________________________________________________________

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Working with the Join Transformation 5-31

5.2 Working with the Join Transformation

Join Transformation
The Join transformation
• is found in the SQL category of transformations
• generates PROC SQL code
• features a unique graphical interface.

39
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Using the Join Transformation


The Join transformation
• is used to create an SQL query that runs in the context of a SAS Data
Integration Studio job
• supports all the join types such as inner join, left join, right join, and full
join
• supports subqueries and pass-through
SQL
• features a graphical interface for
building and configuring the
components of the SQL query.

40
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-32 Lesson 5 SAS® Data Integration Studio: Working with Jobs

Open the Join Transformation


The process of building the SQL query is performed in the Designer window.

To access this window, right-click the Join transformation in a job and select
Open.

41
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Note: Double-clicking the Join transf ormation in a job also opens the Designer window.

Join’s Designer Window


Designer window bar

Navigate pane

SQL Clauses pane

Properties pane

42
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The Join Designer’s initial view includes the f ollowing:


• Main display area with Diagram tab, Code tab, and Log tab
• Navigate pane - navigate the current query
• SQL Clauses pane - add SQL clauses or edit the join type
• Properties pane - view and edit properties of a selected item

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Working with the Join Transformation 5-33

Navigate Pane: Join

43
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The Diagram tab appears in the main area of the Designer window when Join is selected in the
Navigate pane. The Diagram tab enables you to design the needed clauses f or your SQL query with
a similar drag and drop interf ace to a job f low.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-34 Lesson 5 SAS® Data Integration Studio: Working with Jobs

Navigate Pane: Table Selection

44
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The Tables pane appears when a table object is selected in the Navigate pane or when Select is
selected in the Navigate pane. The Tables pane might also open when other aspects of particular
joins are requested (f or example, the surf acing of Having, Group by, and Order by inf ormation).

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Working with the Join Transformation 5-35

Navigate Pane: Select

45
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The Select tab appears in the main area of the Designer window when Select is selected in the
Navigate pane. The Select tab enables you to maintain the mappings f rom the sources to the target.
The Select tab can also be used to specif y calculated columns for the target table.

Navigate Pane: Where

46
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The Where tab appears in the main area of the Designer window when Where is selected in the
Navigate pane (if a WHERE clause is specif ied as part of the SQL query). The Where tab enables
you to specif y the needed subsetting or join criteria f or the SQL query.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-36 Lesson 5 SAS® Data Integration Studio: Working with Jobs

Job for the Product Dimension Table


The DIFT Product Dimension target table is to be loaded from two
registered sources: DIFT PRODUCT_LIST and DIFT Supplier Information.

• The File Reader transformation is needed to read the external file.


• The Join transformation is used to combine the data sources and calculate
values for three derived columns.

47
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Calculated Columns
Three columns for the Product Dimension target table must be calculated.
The three columns (Product_Line, Product_Category, Product_Group) are
encoded in the Product_ID.
Product_ID has 12 digits. Column Value
• First two digits of Product_ID define Product_ID 210100100001
the Product_Line. Product_Line 210000000000
• First four digits of Product_ID define Product_Category 210100000000
the Product_Category.
Product_Group 210100100000
• First seven digits of Product_ID define
the Product_Group.

51
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Product_Group, Product_Category, and Product_Line are encoded in Product_ID.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Working with the Join Transformation 5-37

Expressions for the Calculated Columns


The calculations for the three columns can be done as follows:

Product_Line = int(Product_ID/10000000000)*10000000000
Product_Category = int(Product_ID/100000000)*100000000
Product_Group = int(Product_ID/100000)*100000

Or more simply:
Product_Line = int(Product_ID/1e10)*1e10
Product_Category = int(Product_ID/1e8)*1e8
Product_Group = int(Product_ID/1e5)*1e5

52
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Replacing the last f ive digits in Product_ID with zeros returns Product_Group f rom the f ormat.
Replacing the last eight digits in Product_ID with zeros returns Product_Category f rom the f ormat.
Replacing the last 10 digits in Product_ID with zeros returns Product_Line f rom the f ormat.
Division and the INT f unction truncate the last f ive, eight, or 10 digits from Product_ID. Then
multiplication adds five, eight, or 10 zeros back to the truncated value. Finally, the PUT f unction
applies the PRODUCT. f ormat (user-def ined) to return the description of the Product_Group,
Product_Category, or Product_Line.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-38 Lesson 5 SAS® Data Integration Studio: Working with Jobs

Using the PRODUCT. Format


A user-defined format, PRODUCT., returns a nice description for the three
calculated columns.
A snippet from PRODUCT. format:

Column Value Applying PRODUCT. Format


Product_ID 210100100001
Product_Line 210000000000 Children
Product_Category 210100000000 Children Outdoors
Product_Group 210100100000 Outdoor things, Kids
53
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Working with the Join Transformation 5-39

Final Expressions for the Calculated Columns


The calculations for the three columns can be updated as follows:

Product_Line = put(int(Product_ID/10000000000)*10000000000, product.)


Product_Category = put(int(Product_ID/100000000)*100000000, product.)
Product_Group = put(int(Product_ID/100000)*100000, product.)

Or more simply:
Product_Line = put(int(Product_ID/1e10)*1e10, product.)
Product_Category = put(int(Product_ID/1e8)*1e8, product.)
Product_Group = put(int(Product_ID/1e5)*1e5, product.)

54
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Replacing the last f ive digits in Product_ID with zeros returns Product_Group f rom the f ormat.
Replacing the last eight digits in Product_ID with zeros returns Product_Category f rom the f ormat.
Replacing the last 10 digits in Product_ID with zeros returns Product_Line f rom the f ormat.
Division and the INT f unction truncate the last f ive, eight, or 10 digits from Product_ID. Then
multiplication adds five, eight, o r 10 zeros back to the truncated value. Finally, the PUT f unction
applies the PRODUCT. f ormat (user-def ined) to return the description of the Product_Group,
Product_Category, or Product_Line.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-40 Lesson 5 SAS® Data Integration Studio: Working with Jobs

Purpose of the Calculated Columns


The calculated columns (Product_Line, Product_Category, and
Product_Group) provide the business analysts a navigation path among
wider and narrower views of the data along the product hierarchy.

Product_Line (widest view)

Product_Category (finer view)

Product_Group (even finer view)

Product_ID (finest view)

55
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Working with the Join Transformation 5-41

Populating the Product Dimension Table


This demonstration illustrates creating the job that loads the DIFT Product Dimension target table.
The job uses the Join transformation to join the DIFT Product_List and DIFT Supplier Information
sources. In addition, three calculated columns are defined in the Join.
The final process flow diagram resembles the following:

1. If necessary, access SAS Data Integration Studio with Bruno’s credentials.


a. Select Start  All Programs  SAS  SAS Data Integration Studio.
b. Select My Server as the connection profile.
c. Click OK to close the Connection Profile window and access the Log On window.
d. Enter Bruno in the User ID field and Student1 in the Password field.
e. Click OK to close the Log On window.
2. Click the Folders tab.
3. Expand Data Mart Development  Orion Jobs.
4. Verify that the Orion Jobs folder is selected.
5. Select File  New  Job. The New Job window appears.
a. Enter DIFT Populate Product Dimension Table in the Name field.
b. Verify that the location is set to /Data Mart Development/Orion Jobs.
c. Click OK. The Job Editor window appears.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-42 Lesson 5 SAS® Data Integration Studio: Working with Jobs

6. Add the source data objects to the process flow.


a. If necessary, from the Folders tab, expand Data Mart Development  Orion Source Data.
b. Click the DIFT PRODUCT_LIST table object. Hold down the Ctrl key and also select
the DIFT Supplier Information external file object.
c. Drag the objects to the Diagram tab of the Job Editor.

d. Arrange the source data objects so that they are separated.

7. Select File  Save to save the diagram and job metadata to this point.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Working with the Join Transformation 5-43

8. Add the File Reader transformation to the diagram.


a. In the tree view, click the Transformations tab.
b. Expand the Access grouping.
c. Click the File Reader transformation.

File Reader transformation


in Access grouping

d. Drag the File Reader transformation to the Diagram tab of the Job Editor.
e. Position the File Reader transformation so that it is next to (to the right of) the external file
object, DIFT Supplier Information.

f. Connect DIFT Supplier Information to the File Reader transformation.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-44 Lesson 5 SAS® Data Integration Studio: Working with Jobs

9. Rename the work table object associated with the File Reader transformation.
a. Right-click the (green) work table object and select Properties.

b. Click the Physical Storage tab.


c. Enter FileReader in the Physical name field.

Replacing the name with FileReader makes this table easier to recognize when you
configure the next transformation in the process flow.
Note: New Physical Name replaces the current physical name with a new random name.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Working with the Join Transformation 5-45

d. Click OK to close the File Reader Properties window.


10. Select File  Save to save the diagram and job metadata to this point.
11. Add the Join transformation to the diagram.
a. In the tree view, click the Transformations tab.
b. Expand the SQL grouping.
c. Click the Join transformation.

Join transformation in
SQL grouping

d. Drag the Join transformation to the Diagram tab of the Job Editor.
e. Position the Join transformation so that it is to the right of (and in between) the
DIFT PRODUCT_LIST table object and the File Reader transformation.

12. Select File  Save to save the diagram and job metadata to this point.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-46 Lesson 5 SAS® Data Integration Studio: Working with Jobs

13. Add inputs to the Join transformation.


a. Place the cursor over the Join transformation in the diagram to reveal the two input ports.

b. Connect the DIFT PRODUCT_LIST table object to one of the input ports of the Join
transformation.

c. Connect the File Reader transformation's output to the second input port of the Join
transformation. (Click the work table icon, , associated with the File Reader
transformation and drag to the second input port of the Join transformation.)

The diagram is updated to the following:

14. Select File  Save to save the diagram and job metadata to this point.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Working with the Join Transformation 5-47

15. Add the DIFT Product Dimension table object as the output of the Join transformation.
a. Right-click the work table of the Join transformation and select Replace.

The Table Selector window appears.


b. Verify that the Folders tab is selected.
c. Expand the Data Mart Development  Orion Target Data folders.
d. Click the DIFT Product Dimension table object.

e. Click OK.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-48 Lesson 5 SAS® Data Integration Studio: Working with Jobs

The process flow diagram is updated to the following:

16. Select File  Save to save the diagram and job metadata to this point.
17. Review the properties of the File Reader transformation.
a. Right-click the File Reader transformation and select Properties.
b. Click the Mappings tab.
c. Verify that all target columns have a column mapping.

d. Click the Code tab.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Working with the Join Transformation 5-49

e. Verify that the primary generated code is a simple DATA step with INFILE, ATTRIB, and
INPUT statements.
f. Click OK to close the File Reader Properties window.
18. Select File  Save to save the diagram and job metadata to this point.
19. Specify the properties of the Join transformation.
a. Right-click the Join transformation and select Open. The Designer window appears.

Recall that the Designer window’s initial view includes the following:
• The Diagram tab in the main pane displays the current query in the form of a process flow
diagram. The name of this tab changes to match the object selected in the Navigate pane.
• The Navigate pane is used to access the components of the current query.
• The SQL Clauses pane is used to add SQL clauses or additional joins to the query.
• The Properties pane is used to display and update the properties of a selected item.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-50 Lesson 5 SAS® Data Integration Studio: Working with Jobs

b. Click the Join item (single-click) on the Diagram tab.


c. In the Join Properties pane, verify that the join type is an Inner join.

Note: The type of join can also be verified and changed by right-
clicking the Join item in the Navigate pane or the Join
item on the Diagram tab (when Join keyword in Navigate
pane is selected). A pop-up menu displays a list of
available join types with a check mark next to the currently
selected type.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Working with the Join Transformation 5-51

d. Click the Where item in the Navigate pane to surface the Where tab in the main pane.
e. Verify that the inner join is executed based on the values of the Supplier_ID columns from
the sources being equal.

Note: Outer joins (left, right, full) do not use the Where item for the join condition. To set
conditions for an outer join, click the Join item ( ) in the Navigate pane.
f. Add an additional WHERE clause to subset the data.
1) Click New in the top portion of the Where tab.
A row is added with the logical AND as the Boolean operator.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-52 Lesson 5 SAS® Data Integration Studio: Working with Jobs

2) Select Choose column(s) from the drop-down list under the first Operand field.

The Choose Columns window appears.


3) Expand DIFT PRODUCT_LIST.
4) Click Product_Level.

5) Click OK to close the Choose Columns window.


6) Enter the numeral 1 for the second Operand field and press Enter.

7) Verify that the SQL WHERE clause is updated to the following:

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Working with the Join Transformation 5-53

g. Click the Select item in the Navigate pane to surface the Select tab.

h. Verify that four of the target columns are not mapped.


i. Map the Country source column to the Supplier_Country target column.
1) Click the Country column in the Source table area.
2) Click the Supplier_Country in the Target table area.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-54 Lesson 5 SAS® Data Integration Studio: Working with Jobs

3) Click the Map selected columns tool ( ).

Note: A one-to-one mapping such as the above can also be performed by dragging the
source column to the target column.
Note: The three columns that are to be calculated remain unmapped.
j. Click to expand the Target table area. This provides more room to work with the
expressions.

An expression must be defined for three columns. However, the columns are not in order of
their scope. Product_Line describes the largest category, Product_Category describes the
next largest category of items, and Product_Group describes the smallest category of
items. They should be reordered for consistency.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Working with the Join Transformation 5-55

k. Reorder the columns.


Note: Product_Group column needs to move before the Product_Category column.
1) Click the column number (7 for screen capture below).

2) Drag and drop the column just above the Product_Category column.
3) Verify that Product_Group is now 6.

l. Click OK to close the Properties window for DIFT Product Dimension.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-56 Lesson 5 SAS® Data Integration Studio: Working with Jobs

20. Specify expressions for the calculated columns.


a. Specify an expression for Product_Group.
1) Locate the Product_Group column.
2) In the Expression column, select Advanced from the drop-down list.

The Expression window appears.

3) Locate and open the HelperFile.txt file in D:\Workshop\dift.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Working with the Join Transformation 5-57

4) Copy the expression for Product_Group.


put(int(product_list.product_id/1e5)*1e5,product.)
5) Paste the copied expression in the Expression Text area.

Note: To build this expression using the Expression Builder, do the following:
• Click the Functions tab in the bottom part of the Expression window.
• Expand (by double-clicking) the Special grouping of functions.
• Single-click the PUT function.
• Click Add to Expression.
• With first argument of PUT highlighted (default), expand the Truncation
grouping of functions.
• Single-click the INT function.
• Click Add to Expression.
• Click the Data Sources tab in the bottom part of the Expression window.
• Expand DIFT PRODUCT_LIST.
• Single-click Product_ID.
• Verify that INT argument is still highlighted (default),
• Click Add to Expression.
• Click tool.
• Type 1e5.
• Move cursor after the INT close parenthesis and before the comma.
• Click tool.
• Type 1e5.
• Move cursor to after the comma and highlight <value>.
• Type product..
6) Click Validate Expression.

7) Click No.
8) Click OK to close the Expression window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-58 Lesson 5 SAS® Data Integration Studio: Working with Jobs

b. Specify an expression for Product_Category.


Note: Text can be entered directly into the Expression field.
1) If necessary, access the HelperFile.txt file in D:\Workshop\dift.
2) Copy the expression for Product_Category.
put(int(product_list.product_id/1e8)*1e8,product.)
3) In SAS Data Integration Studio, click the Expression field for the Product_Category
column.
4) Hold down the Ctrl key and type v to paste the copied expression in the Expression
field.
c. Specify an expression for Product_Line.
1) If necessary, access the HelperFile.txt file in D:\Workshop\dift.
2) Copy the expression for Product_Line.
put(int(product_list.product_id/1e10)*1e10,product.)
3) In SAS Data Integration Studio, click the Expression field for the Product_Line column.
4) Hold down the Ctrl key and type v to paste the copied expression in the Expression
field.
The final settings for the calculated columns are as follows:

d. Click to collapse the target table area back to the right.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Working with the Join Transformation 5-59

e. Verify that each of the calculated columns does not have an established source column
mapping.

f. Fix the warnings.


1) Verify that each of the calculated fields now has an associated warning symbol.

2) Click the column Product_Group.

3) From the toolbar for the Select tab, click and select Update Mappings to Match
Columns Used in Expression.
4) Click the column Product_Category.

5) From the toolbar for the Select tab, click and select Update Mappings to Match
Columns Used in Expression.
6) Click the column Product_Line.

7) From the toolbar for the Select tab, click and select Update Mappings to Match
Columns Used in Expression.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-60 Lesson 5 SAS® Data Integration Studio: Working with Jobs

8) Verify that each of the calculated columns now has a mapping from the source col umn
Product_ID.

g. Select File  Save to save changes to the Join transformation.


h. Click to return to the Job Editor.
21. Verify that the final process flow resembles the following:

22. Click in the job toolbar.


23. Verify that the job completes successfully.

24. View the log for the executed job.


a. Click the Log tab.
b. Scroll to view the note about the creation of the DIFTTGT.PRODDIM table.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Working with the Join Transformation 5-61

25. View the data for the target table.


a. Click the Diagram tab.
b. Right-click the DIFT Product Dimension table and select Open.
c. Scroll to the new columns and verify that the data were calculated properly.

d. Select File  Close to close the View Data window.


26. Select File  Close to close the Job Editor window. The new job object appears on the
Folders tab.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-62 Lesson 5 SAS® Data Integration Studio: Working with Jobs

Practices
3. Populating the OrderFact Table

The final job flow diagram should resemble the following:

• Access SAS Data Integration Studio using the My Server connection profile with Bruno’s
credentials (Bruno/Student1).
• Create a new job in the Orion Jobs folder with the name DIFT Populate Order Fact Table.
• Two tables should be joined together, DIFT ORDER_ITEM and DIFT ORDERS. These tables
are located in the Orion Source Data folder.
• Use the Join transformation to specify an inner join based on equality of the ORDER_ID
columns from the source tables.
• Replace the Join work table with the DIFT Order Fact table (located in the Orion Target
Data folder).
• The target column OrderDate should be calculated by taking the date portion of the source
column ORDER_DATE, which is a datetime column.
Hint: Use the DATEPART() function in an expression on the target side of the Select item in
the Join transformation.
• The target column DeliveryDate should be calculated by taking the date portion of the
source column DELIVERY_DATE, which is a datetime column.
Hint: Use the DATEPART() function in an expression on the target side of the Select item in
the Join transformation.
• After you verify that the table is created successfully (with no warnings), close the job.
Note: The DIFT Order Fact table should have 951,669 observations and 12 variables.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Working with the Join Transformation 5-63

4. Loading the Old_Orders and Recent_Orders Tables

The final job flow diagram should resemble the following:

• Access SAS Data Integration Studio using the My Server connection profile with Bruno’s
credentials (Bruno/Student1).
• Create a new job in the Orion Jobs folder with the name DIFT Populate Old and Recent
Orders Tables.
• Use the Splitter transformation to split the observations from the DIFT Order Fact table.
• Write the records from the Splitter transformation to the DIFT Old Orders and the DIFT
Recent Orders tables. These tables are located in the Orion Source Data folder.
• Old orders are defined as orders placed before January 1, 2009. You can use the following
expression to find the observations for this data:
OrderDate < '01Jan2009'd
• Recent orders are defined as orders placed on or after January 1, 2009. This expression can
be used to find the observations for this data:
OrderDate >= '01Jan2009'd
• After you verify that the tables are created successfully, close the job.
Note: The DIFT Recent Orders table should have 615,396 observations and 12 variables.
The DIFT Old Orders table should have 336,273 observations and 12 variables.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-64 Lesson 5 SAS® Data Integration Studio: Working with Jobs

5.3 Solutions
Solutions to Practices
1. Refreshing Course Metadata to Current Point
a. If necessary, access SAS Data Integration Studio with Bruno’s credentials.
1) Select Start  All Programs  SAS  SAS Data Integration Studio.
2) Select My Server as the connection profile.
3) Click OK to close the Connection Profile window and open the Log On window.
4) Enter Bruno in the User ID field and Student1 in the Password field.
5) Click OK to close the Log On window.
b. Click the Folders tab.
c. Expand Data Mart Development.
d. Delete the four Orion folders.
1) Click the Orion Jobs folder.
2) Press the Shift key and click the Orion Target Data folder.

3) Select Edit  Delete. A Confirm Delete window appears.


4) Click Yes. A Delete Library window appears for the DIFT ODBC Contacts Library.
5) Click Yes. A Delete Library window appears for the DIFT ODBC Orders Library.
6) Click Yes. A Delete Library window appears for the DIFT Oracle Library.
7) Click Yes. A Delete Library window appears for the DIFT Orion Source Tables Library.
8) Click Yes. A Delete Library window appears for the DIFT SAS Library.
9) Click Yes. A Delete Library window appears for the DIFT Orion Table Tables Library.
10) Click Yes.
e. Import fresh metadata.
1) Select the Data Mart Development folder.
2) Select File  Import  SAS Package.
3) Click Browse next to Enter the location of the input SAS package file.
a) If necessary, navigate to D:\Workshop\dift\solutions.
b) Click DIFT_Ch4Ex1.spk.
c) Click OK.
4) Verify that All Objects is selected.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.3 Solutions 5-65

5) Click Next.
6) Verify that all four Orion folders are selected.
7) Click Next.
8) Verify four different types of connections are shown to need to be established.
9) Click Next.
10) Verify that SASApp is listed for both the Original and Target fields.
11) Click Next.
12) Verify that both servers (ODBC Server and Oracle Server) have matching values for the
Original and Target fields.
13) Click Next.
14) Verify that both file paths have matching values for the Original and Target fields.
15) Click Next.
16) Verify that all directory paths have matching values for the Original and Target fields.
17) Click Next. The Summary pane surfaces.
18) Click Next.
19) Verify the import process completed successfully.
20) Click Finish.

Question: How many library objects were imported?


Answer: Six total – five in Orion Source Data and one in Orion Target Data

Question: What is the name of the job object in the Orion Jobs folder?
Answer: DIFT Populate Current and Terminated Staff Tables

Question: How many table objects were imported in the Orion Target Data folder?
Answer: Seven

2. Checking the Starter Job and Finalizing


a. If necessary, access SAS Data Integration Studio with Bruno’s credentials.
1) Select Start  All Programs  SAS  SAS Data Integration Studio.
2) Select My Server as the connection profile.
3) Click OK to close the Connection Profile window and open the Log On window.
4) Enter Bruno in the User ID field and Student1 in the Password field.
5) Click OK to close the Log On window.
b. Click the Folders tab.
c. Expand Data Mart Development  Orion Jobs.
d. Right-click DIFT Populate Current and Terminated Staff Tables job and select Open.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-66 Lesson 5 SAS® Data Integration Studio: Working with Jobs

e. Define the connection for the STAFF table using the Connections window.
1) Right-click the STAFF table in the job flow diagram and select Connections.

2) Click under Output Ports. The Output Port – Data window appears.
a) Click the Splitter transformation under Output Node.

b) Click OK to close the Output Port window.


3) Click OK to close the Connections window.

f. If necessary, click the horizontal layout tool ( ).

g. Select File  Save to save the job.


The job flow diagram should resemble the following:

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.3 Solutions 5-67

h. Add target tables using the Replace functionality.


1) Right-click the top output table on the Splitter transformation and select Replace.

2) In the Table Selector window (on the Folders tab), expand Data Mart Development 
Orion Target Data.
3) Click the DIFT Current Staff table.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-68 Lesson 5 SAS® Data Integration Studio: Working with Jobs

4) Click OK.
The job flow diagram updates to the following:

5) Right-click the remaining output table on the Splitter transformation and select Replace.
6) In the Table Selector window (on the Folders tab), expand Data Mart Development 
Orion Target Data.
7) Click the DIFT Terminated Staff table.
8) Click OK.

i. Click the horizontal layout tool ( ).

j. Select File  Save to save the job.


The job flow diagram should now resemble the following:

k. Update the properties of the Splitter transformation.


1) Right-click the Splitter transformation and select Properties.
2) Click the Mappings tab.
3) Verify that no mappings between the source columns and target columns exist.

No Mappings

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.3 Solutions 5-69

4) Click - the Map all columns tool.


5) Verify that all target columns now have a one-to-one mapped source column.

Mappings

6) Click the Row Selection tab.


7) Specify the subsetting criteria for the DIFT Current Staff table object.
a) Verify that the DIFT Current Staff table object is selected in the Target Tables pane.
b) Select Row Selection Conditions in the Row Selection Type field.
c) Click Subset Data below the Selection Conditions area.
d) Click the Data Sources tab.
e) Expand the STAFF table.
f ) Select the Emp_Term_Date column.
g) Click Add to Expression.

h) Click in the operators area.


i) Enter . (a period for a missing numeric value).
The Expression Text area should now resemble the following:

j) Click OK to close the Expression Builder window.


8) Specify the subsetting criteria for the DIFT Terminated Staff table object.
a) Verify that the DIFT Terminated Staff table object is selected in the Target Tables
pane.
b) Select Row Selection Conditions in the Row Selection Type field.
c) Click Subset Data below the Selection Conditions area.
d) Click the Data Sources tab.
e) Expand the STAFF table.
f ) Select the Emp_Term_Date column.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-70 Lesson 5 SAS® Data Integration Studio: Working with Jobs

g) Click Add to Expression.

h) Click in the operators area.


i) Enter . (a period for a missing numeric value).
j) Click OK to close the Expression Builder window.
9) Click OK to close the Splitter Properties window.
l. Select File  Save to save the job.
The job flow diagram should now resemble the following:

m. Select Actions  Run.

Question: Did the job execute successfully?


Answer: Yes

This row item


identifies that the
job completed
successfully

Question: How many records were populated in the CURRENT_STAFF table?


Answer: 772 records and 10 variables

This can be discovered by selecting the Log tab, scrolling, and locating the
note about the creation of DIFTTGT.CURRENT_STAFF table.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.3 Solutions 5-71

Question: How many records were populated in the TERM_STAFF table?


Answer: 276 records and 10 variables

This can be discovered by selecting the Log tab, scrolling, and locate the
note about the creation of DIFTTGT.CURRENT_STAFF table.

n. Select File  Close to close the data job.

Question: On the Inventory tab, under what grouping do you find the job?
Answer: DIFT Populate Current and Terminated Staff Tables is found in the Job
grouping on the Inventory tab

Inventory tab

This job is in the


Job grouping on
the Inventory tab.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-72 Lesson 5 SAS® Data Integration Studio: Working with Jobs

3. Loading the OrderFact Table

a. If necessary, access SAS Data Integration Studio with Bruno’s credentials.


1) Select Start  All Programs  SAS  SAS Data Integration Studio.
2) Select My Server as the connection profile.
3) Click OK to close the Connection Profile window and open the Log On window.
4) Enter Bruno in the User ID field and Student1 in the Password field.
5) Click OK to close the Log On window.
b. Click the Folders tab.
c. Expand Data Mart Development  Orion Jobs.
d. Verify that the Orion Jobs folder is selected.
e. Select File  New  Job. The New Job window appears.
1) Enter DIFT Populate Order Fact Table in the Name field.
2) Verify that the location is set to /Data Mart Development/Orion Jobs.
3) Click OK. The Job Editor window appears.
f. Add the source data objects to the process flow.
1) Click the Folders tab.
2) If necessary, expand Data Mart Development  Orion Source Data.
3) Right-click the DIFT ORDER_ITEM table object and select Add to Diagram.
4) Right-click the DIFT ORDERS table object and select Add to Diagram.
5) Click the DIFT ORDERS object on the Diagram tab of the Job Editor and drag it away
from the DIFT ORDER_ITEM object. (Initially, these objects are one above the other.)

Note: The icon in the upper right corners of the metadata table objects
indicates that these are Oracle tables.
g. Select File  Save to save the diagram and job metadata to this point.
h. Add the Join transformation to the diagram.
1) In the tree view, click the Transformations tab.
2) Expand the SQL grouping.
3) Select the Join transformation.
4) Drag the Join transformation to the diagram.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.3 Solutions 5-73

5) Place the Join transformation to the right of the source table objects.

i. Add inputs to the Join transformation.


1) Click the DIFT ORDER_ITEM
connection selector and draw a line
to one of the input ports of the Join
transformation.
2) Click the DIFT ORDERS
connection selector and draw a line
to the other input port of the Join
transformation.
j. Select File  Save to save the diagram and job metadata to this point.
k. Add a target table to the diagram.
1) Right-click the work table for the Join transformation and select Replace.
2) Verify that the Folders tab is selected.
3) Expand the Data Mart Development  Orion Target Data folder.
4) Select DIFT Order Fact.
5) Click OK.
The job flow diagram is updated to the following:

l. Select File  Save to save the diagram and job metadata to this point.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-74 Lesson 5 SAS® Data Integration Studio: Working with Jobs

m. Review the properties of the Join transformation.


1) Right-click the Join transformation and select Open. The Designer window appears.
2) Click the Join item in the Navigate pane.

3) In the Join Properties pane, verify that the join is an inner join.

4) Click the Where item in the Navigate pane.


5) Verify that the inner join is executed based on the values of ORDER_ID columns from
the sources being equal.
6) Click the Select item in the Navigate pane to surface the Select tab.
7) Verify that 10 target columns established one-to-one mappings.
8) Verify that two target columns do not have a mapping.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.3 Solutions 5-75

9) Add expressions for the two unmapped columns.

a) Click next to the Target table name to expand the target side.

b) In the Expression field for the OrderDate column, enter datepart(order_date).


c) Press Enter. The expression is updated to datepart(ORDERS.ORDER_DATE).
d) In the Expression field for the DeliveryDate column, enter datepart(delivery_date).
e) Press Enter. The expression is updated to datepart(ORDERS.DELIVERY_DATE).

f ) Click to collapse the target table attributes back to the right side.
g) Right-click the OrderDate column.
h) Select Fix Warning  Update Mappings to Match Columns Used in Expression.
i) Right-click the DeliveryDate column.
j) Select Fix Warning  Update Mappings to Match Columns Used in Expression.
10) Verify that all 12 columns are now mapped.
11) Click Up to return to the Job Editor.
n. Select File  Save to save the diagram and job metadata to this point.
o. Run the job.
1) Click Run.
2) Click the Status tab in the Details pane. Verify that the job completed successfully.
3) Click the Log tab and verify that DIFTTGT.ORDERFACT is created with 951,669
observation and 12 variables.
4) Click the Diagram tab.
5) Right-click DIFT Order Fact and select Open.
6) Review the data and then select File  Close to close the View Data window.
7) Select File  Close to close the Job Editor.

4. Loading the Old_Orders and Recent_Orders Tables


a. Click the Folders tab.
b. Expand Data Mart Development  Orion Jobs.
c. Verify that the Orion Jobs folder is selected.
d. Select File  New  Job. The New Job window appears.
1) Enter DIFT Populate Old and Recent Orders Tables in the Name field.
2) Verify that the location is set to /Data Mart Development/Orion Jobs.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-76 Lesson 5 SAS® Data Integration Studio: Working with Jobs

3) Click OK. The Job Editor window appears.


e. Add the source data object to the process flow.
1) If necessary, click the Folders tab.
2) Expand Data Mart Development  Orion Target Data.
3) Drag the DIFT Order Fact table object to the Diagram tab of the Job Editor.
f. Select File  Save to save the diagram and job metadata to this point.
g. Add the Splitter transformation to the diagram.
1) In the tree view, click the Transformations tab.
2) Expand the Data grouping.
3) Select the Splitter transformation.
4) Drag the Splitter transformation to the diagram.
5) Center the Splitter transformation so that it is aligned with the source table object.
h. Select File  Save to save the diagram and job metadata to this point.
i. Connect the source table object to the Splitter transformation.
1) Click the DIFT Order Fact connection selector.
2) Drag it to the Splitter transformation to form the connection.
j. Select File  Save to save the diagram and job metadata to this point.
k. Add the target table objects to the diagram.
1) Right-click one of the work tables for the Splitter transformation and select Replace.
2) Verify that the Folders tab is selected.
3) Expand the Data Mart Development  Orion Target Data folder.
4) Select DIFT Old Orders.
5) Click OK.
6) Right-click the other work table for the Splitter transformation and select Replace.
7) Verify that the Folders tab is selected.
8) Expand the Data Mart Development  Orion Target Data folder.
9) Select DIFT Recent Orders.
10) Click OK.
11) If necessary, separate the two target table objects.
The process flow diagram should resemble the following:

l. Select File  Save to save the diagram and job metadata to this point.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.3 Solutions 5-77

m. Specify the properties of the Splitter transformation.


1) Right-click the Splitter transformation and select Properties.
2) Click the Row Selection tab.
3) Specify the subsetting criteria for the DIFT Old Orders table object.
a) Verify that the DIFT Old Orders table object is selected in the Target Tables pane.
b) Select Row Selection Conditions in the Row Selection Type field.
c) Click Subset Data below the Selection Conditions area. The Expression window
appears.
d) Click the Data Sources tab.
e) Expand the OrderFact table.
f ) Select the OrderDate column.
g) Click Add to Expression.
h) Click in the operators area.

i) Enter '01jan2009'd.
j) Click Validate Expression.
k) Click No to not display the SAS log.
l) Click OK to close the Expression Builder window.
4) Specify the subsetting criteria for the DIFT Recent Orders table object.
a) Verify that the DIFT Recent Orders table object is selected in the Target Tables
pane.
b) Select Row Selection Conditions in the Row Selection Type field.
c) Click Subset Data below the Selection Conditions area. The Expression window
appears.
d) Click the Data Sources tab.
e) Expand the OrderFact table.
f ) Select the OrderDate column.
g) Click Add to Expression.
h) Click in the operators area.

i) Enter '01jan2009'd.
j) Click Validate Expression.
k) Click No to not display the SAS log.
l) Click OK to close the Expression Builder window.
5) Click the Mappings tab.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-78 Lesson 5 SAS® Data Integration Studio: Working with Jobs

6) Verify that all target table columns are mapped. (That is, all target columns receive data
from a source column.)
7) Click OK to close the Splitter Properties window.
n. Select File  Save to save the diagram and job metadata to this point.
o. Run the job.
1) Click Run to run the job.
2) Click the Status tab in the Details pane. Notice that all processes complete successfully.
3) Click the Log tab to view the log for the executed job.
4) Scroll to view the notes about the creation of the DIFTTGT.RECENT_ORDERS table
and the creation of the DIFTTGT.OLD_ORDERS table.
5) Click the Diagram tab to view the data results.
6) View the DIFT Recent Orders table.
a) Right-click the DIFT Recent Orders table and select Open.
b) The DIFT Recent Orders table should have 615,396 rows.
c) When you are finished viewing the data, select File  Close to close the
View Data window.
7) View the DIFT Old Orders table.
a) Right-click the DIFT Old Orders table and select Open.
b) The DIFT Old Orders table should have 336,273 rows.
c) When you are finished viewing the data, select File  Close to close the View Data
window.
p. Select File  Close to close the Job Editor.
The Orion Jobs folder should resemble the following:

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Lesson 6 SAS® Data Integration
Studio: Working with Transformations
6.1 Working with the Extract and Summary Statistics Transformations ............................. 6-3
Demonstration: Refresh the Metadata ....................................................................... 6-5
Demonstration: Reporting for United States Customers ................................................ 6-9
Practices ............................................................................................................. 6-28

6.2 Exploring the SQL Transformations ........................................................................... 6-32


Demonstration: Concatenating Tables with Set Operators Transformation ..................... 6-38
Practice............................................................................................................... 6-47

6.3 Creating Custom Transformations ............................................................................. 6-49


Demonstration: Creating a New Transformation ........................................................ 6-59
Practice............................................................................................................... 6-74
Demonstration: Using a New Transformation ............................................................ 6-77
Practice............................................................................................................... 6-88

6.4 Solutions ................................................................................................................... 6-90


Solutions to Practices ............................................................................................ 6-90
6-2 Lesson 6 SAS® Data Integration Studio: Working with Transformations

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Working with the Extract and Summary Statistics Transformations 6-3

6.1 Working with the Extract and


Summary Statistics Transformations

Extract Transformation

The Extract transformation


generates PROC SQL code.

3
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The Extract transformation supports the following items:


• one source table
• SELECT clause
• derived columns
• WHERE clause
• GROUP BY clause
• ORDER BY clause
Note: The Join transformation supports multiple source tables.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-4 Lesson 6 SAS® Data Integration Studio: Working with Transformations

Summary Statistics Transformation

The Summary Statistics


transformation generates
PROC MEANS code.

4
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The Summary Statistics transformation provides the following items:


• descriptive statistics like SUM, MEAN, and STD
• multiple analysis variables
(for example, revenue and profit)
• multiple classification (group by) variables
(for example, gender and age group)
• output to a report by default
• output to a table
The upcoming demonstration will use the Summary Statistics transformation, while the upcoming
practice will use the Summary Tables transformation. Both of these transformations are found in the
Analysis grouping of transformations.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Working with the Extract and Summary Statistics Transformations 6-5

Refresh the Metadata


This demonstration clears existing metadata and then refreshes with metadata imported from a SAS
package.
1. If necessary, access SAS Data Integration Studio with Bruno’s credentials.
a. Select Start  All Programs  SAS  SAS Data Integration Studio.
b. Select My Server as the connection profile.
c. Click OK to close the Connection Profile window and open the Log On window.
d. Enter Bruno in the User ID field and Student1 in the Password field.
e. Click OK to close the Log On window.
2. Click the Folders tab.
3. Expand the Data Mart Development folder.

Remove Existing Metadata Objects


1. Select all of the folders under Data Mart Development.
a. Click the DIFT Demo folder.
b. Press and hold the Shift key and click the Orion Target Data folder.

2. Select Edit  Delete.


A Confirm Delete window appears.

3. Click Yes.
4. If a Delete Library window appears for the DIFT Test Source Library, click Yes.
5. If a Delete Library window appears for the DIFT Test Target Library, click Yes.
6. If a Delete Library window appears for the DIFT ODBC Contacts Library, click Yes.
7. If a Delete Library window appears for the DIFT ODBC Orders Library, click Yes.
8. If a Delete Library window appears for the DIFT Oracle Library, click Yes.
9. If a Delete Library window appears for the DIFT Orion Source Tables Library, click Yes.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-6 Lesson 6 SAS® Data Integration Studio: Working with Transformations

10. If a Delete Library window appears for the DIFT SAS Library, click Yes.
11. If a Delete Library window appears for the DIFT Orion Table Tables Library, click Yes.

Import Needed Metadata Objects


1. Select the Data Mart Development folder.
2. Select File  Import  SAS Package.
3. Click Browse next to Enter the location of the input SAS package file.
a. If necessary, navigate to D:\Workshop\dift\solutions.
b. Click DIFT_EndCh6.spk.
4. Verify that All Objects is selected.
5. Click Next.
6. Verify that all four Orion folders are selected.
7. Click Next.
8. Verify four different types of connections need to be established.
9. Click Next.
10. Verify that SASApp is listed for both the Original and Target fields.
11. Click Next.
12. Verify that both servers (ODBC Server and Oracle Server) have matching values for the Original
and Target fields.
13. Click Next.
14. Verify that all directory paths have matching values for the Original and Target fields.
15. Click Next.
16. Verify that both file paths have matching values for the Original and Target fields.
17. Click Next.
The Summary pane surfaces.
18. Click Next.
19. Verify that the import process completed successfully.
20. Click Finish.
21. Expand the Data Mart Development  Orion Jobs folder.
22. Expand the Data Mart Development  Orion Reports  SQL Transforms folder.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Working with the Extract and Summary Statistics Transformations 6-7

The metadata under the Data Mart Development folder should resemble the following:

Note: Some things to verify:


• The Orion Jobs folder contains a partially completed job (DIFT Populate Customer
Order Information Table) – this is used in the upcoming demonstration.
• The Orion Reports  SQL Transforms folder contains four new table objects (DIFT
CHILDREN, DIFT CLOTHESSHOES, DIFT OUTDOORS, DIFT SPORTS) – these
will be used in a demonstration in an upcoming section.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-8 Lesson 6 SAS® Data Integration Studio: Working with Transformations

In addition, all source and target table objects (as well as the corresponding library objects)
discussed up to this point are found in the Orion Source Data and Orion Target Data
folders.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Working with the Extract and Summary Statistics Transformations 6-9

Reporting for United States Customers


This demonstration illustrates how to design a job to join two tables and generate a report on the
resulting table. The report will summarize customer orders in 2011 for United States customers. Two
steps are used. In one job, the Customer Dimension and Order Fact tables are joined, and 2011
data is selected. Then, in a second job, the U.S. rows are extracted and then summarized.
If necessary, access SAS Data Integration Studio with Bruno’s credentials.
1. Select Start  All Programs  SAS  SAS Data Integration Studio.
2. Verify that the connection profile is My Server.
3. Click OK to close the Connection Profile window and open the Log On window.
4. Enter Bruno in the User ID field and Student1 in the Password field.
5. Click OK to close the Log On window.
6. If necessary, refresh the metadata.
a. Click the Folders tab.
b. Click the Data Mart Development folder.
c. Select View  Refresh.

Investigate and Finalize the Job to Load the Customer Order Information Table
1. Run two jobs to prepare source tables for the DIFT Populate Customer Order Information Table
job.
a. Open and run the DIFT Populate Order Fact Table job.
1) Click the Folders tab.
2) Right-click the DIFT Populate Order Fact Table job and select Open.
3) Click Run.
4) Verify that the job completes successfully.

b. Open and run the DIFT Populate Customer Dimension Table job.
1) Click the Folders tab.
2) Right-click the DIFT Populate Customer Dimension Table job and select Open.
3) Click Run.
4) Verify that the job completes successfully.

Note: The DIFT Customer Dimension table and the DIFT Order Fact table will be
source tables for the DIFT Populate Customer Order Inf ormation Table job. If
these tables are not generated, the DIFT Populate Customer Order
Inf ormation Table job will f ail.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-10 Lesson 6 SAS® Data Integration Studio: Working with Transformations

2. Open the starter DIFT Populate Customer Order Inf ormation Table job for editing.
a. Click the Folders tab.
b. Right-click the job DIFT Populate Customer Order Information Table and select Open.
A partial job flow diagram appears:

3. Examine the properties of the Join transformation to verify that it will produce an inner join of the
two tables on matching Customer_ID.
a. Right-click the Join transformation and select Open.
b. Verify that the type of join is an inner join.
1) Click the Join item in the Navigate pane.
2) View the type in the Join Properties pane.

c. Click the Where item in the Navigate pane.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Working with the Extract and Summary Statistics Transformations 6-11

d. Verify that the inner join is based on the matching (equality) Customer_ID columns from
each of the sources.

e. Add an additional WHERE clause to subset the data to only orders placed in 2011.
1) On the Where tab, click New.
2) In the first Operand column, click  Advanced.
The Expression Builder window appears.
3) Click the Functions tab.
4) Double-click Date and Time to expand it.
5) Scroll down to the Date and Time functions and click the YEAR function.

The function is added with a highlighted argument. Keep the highlighting so that we can
easily add the OrderDate column as the function’s argument.
Note: The right pane provides help for the selected function.
6) Click Add to Expression.
7) Click the Data Sources tab.
8) Expand the DIFT Order Fact table.
9) Click OrderDate.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-12 Lesson 6 SAS® Data Integration Studio: Working with Transformations

10) Click Add to Expression.

11) Click OK to close the Expression Builder window.


12) In the Operator field, verify that the value is =.
13) In the second Operand field, enter 2011 and then press Enter.

f. Click the Select item in the Navigate pane.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Working with the Extract and Summary Statistics Transformations 6-13

g. Verify that all 22 target columns are mapped one-to-one from a source column.

Note: Customer_ID is the column used in the join. Therefore, there is a duplicate and only
one copy needs to be mapped and present in the target table.
h. Click to return to the main diagram.
4. Select File  Save to save the diagram and job metadata to this point.
5. Add the target table to the job flow.
a. Right-click the green work table object that is associated with the Join transformation.
Select Register Table.
b. Enter DIFT Customer Order Information in the Name field.
c. Set the Location field.
1) Click Browse.
a) Double-click the Orion Target Data folder.
b) Click OK.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-14 Lesson 6 SAS® Data Integration Studio: Working with Transformations

2) Verify that the location is set to /Data Mart Development/Orion Target Data.

d. Click the Physical Storage tab.


e. Enter CustomerOrderInfo in the Physical name field.
f. Click next to the Library field. The Select a library window appears.
1) Verify that the Folders tab is selected.
2) Expand Data Mart Development  Orion Target Data.
3) Select DIFT Orion Target Tables Library.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Working with the Extract and Summary Statistics Transformations 6-15

4) Click OK to close the Select a library window.


The Physical Storage tab should resemble the following:

g. Click OK to close the Register Table window.


The job flow diagram should resemble the following:

6. Select File  Save to save the diagram and job metadata to this point.
7. Run the job.
a. Right-click in the background of the job and select Run.
b. Click the Status tab in the Details area.
c. Verify that all steps completed successfully.
d. View the log for the executed job. Scroll to view the note about the creation of
DIFTTGT.CUSTOMERORDERINFO.

Note: If this job fails, then the two source tables might not be populated. It might be
necessary to run the two jobs DIFT Populate Customer Dimension Table and DIFT
Populate Order Fact Table (both of these jobs are located in the Data Mart
Development  Orion Jobs folder).

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-16 Lesson 6 SAS® Data Integration Studio: Working with Transformations

8. View the data in the target table generated by the Join.


a. Click the Diagram tab of the Job Editor window.
b. Right-click the DIFT Customer Order Information table and select Open.

Note: The Customer_Country values include Germany, United Kingdom, and United
States. They are written out values, not country codes.
c. Select File  Close to close the View Data window.
9. Select File  Close to close the Job Editor window.

Creating the Job to Summarize the US Customer Orders from 2011


1. Create the initial job metadata.
a. Click the Folders tab.
b. Expand Data Mart Development  Orion Reports  Extract and Summary.
c. Verify that the Extract and Summary folder is selected.
d. Select File  New  Job. The New Job window appears.
e. Enter DIFT Create Report for US Customer Order Information in the Name field.
f. Verify that the location is set to /Data Mart Development/Orion Reports/Extract and
Summary.

g. Click OK. The Job Editor window appears.


2. Add source table metadata to the diagram for the job flow.
a. Click the Folders tab.
b. Expand the Orion Target Data folder.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Working with the Extract and Summary Statistics Transformations 6-17

c. Drag the DIFT Customer Order Information table object to the Diagram tab of the Job
Editor.
3. Add the Extract transformation to the job flow.
a. Click the Transformations tab.
b. Expand the SQL grouping and locate the Extract transformation template.
c. Drag the Extract transformation to the Diagram tab of the Job Editor.
d. Connect the DIFT Customer Order Information table object to the Extract transformation.
4. Add the Summary Statistics transformation to the job flow.
a. If necessary, click the Transformations tab.
b. Expand the Analysis grouping and locate the Summary Statistics transformation template.
c. Drag the Summary Statistics transformation to the Diagram tab of the Job Editor.
d. Connect the Extract transformation to the Summary Statistics transformation.

5. Select File  Save to save the diagram and job metadata to this point.
6. Add a WHERE expression to the Extract transformation so that only rows for US customers are
written to the target table.
a. Right-click the Extract transformation and select Properties.
b. Click the Where tab.
c. On the bottom portion of the Where tab, click the Data Sources tab.
d. Expand the CustomerOrderInfo table.
e. Select Customer_Country.
f. Click Add to Expression.
g. In the Expression Text area, type = "US".

Note: The value “US” that we are using for subsetting does not match the written out
“United States” value that we noted in the View Data window in step 8b. This
seeming discrepancy is because we have selected global options that apply formats
to the data shown in the View Data window. We will examine these options.
h. Click OK to close the Extract Properties window.
7. Select File  Save to save the diagram and job metadata to this point.
8. Close the job.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-18 Lesson 6 SAS® Data Integration Studio: Working with Transformations

9. Investigate the DIFT Customer Order Information table’s properties.


a. Right-click the DIFT Customer Order Information table and select Properties.
b. Click the Columns tab.
c. Verify that the Customer_Country column has the format $COUNTRY20.

Note: This is a user-defined format that displays country codes as written out country
names.
d. Click OK to close the Properties window.
10. Investigate the permanent formats associated with the CustomerOrderInfo physical table.
a. Right-click the DIFT Customer Order Information table and select Analyze.
b. Click the Contents tab.
Note: A PROC CONTENTS step is run on the physical table when the Contents tab is
selected. This SAS procedure returns attributes of the physical table that it analyzes,
including column names, lengths, and formats.
c. Scroll down to the Alphabetic List of Variables and Attributes section of the report.

Note: The Customer_Country column has the $COUNTRY format.


11. Change the format options being applied to the View Data window to view the actual data
values.
a. Click the Tools item in the menu bar and select Options from the drop-down list.
b. Click the View Data tab.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Working with the Extract and Summary Statistics Transformations 6-19

c. Verify that both Apply metadata formats and Apply formats are selected.

If the Apply metadata formats option is selected, any format specified in the properties of a
table will be applied when viewing data in the View Data window.
d. Uncheck the Apply metadata formats option.

If only the Apply formats option is selected, only permanent formats that were specified for
the data when it was created will be applied when viewing data in the View Data wi ndow.
Because the Customer_Country column has a format applied in the properties of the DIFT
Customer Order Information table and has a permanent format, we have to turn off both
formatting options to see the actual data values.
e. Uncheck the Apply formats option.

f. Click OK to close the Options window.


12. Verify that the View Data window now shows country codes for the Customer_Country column.
a. Click the Folders tab.
b. Expand the Data Mart Development  Orion Target Data folders.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-20 Lesson 6 SAS® Data Integration Studio: Working with Transformations

c. Right-click the DIFT Customer Order Information table and select Open.

Note: The Customer_Country column is now displaying the actual, unformatted data
values. We can now see the country codes.
d. Click File and select Close to close the View Data window.
13. Expand the Data Mart Development  Orion Reports  Extract and Summary folders.
14. Right-click the DIFT Create Report for US Customer Order Information job and select Open.
15. Specify properties for the Summary Statistics transformation to define grouping variables,
analysis variables, and statistics and specify report format and layout options .
a. Right-click the Summary Statistics transformation and select Properties.
b. On the General tab, remove the default description to prevent it from appearing in the job
flow node.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Working with the Extract and Summary Statistics Transformations 6-21

c. Click the Mappings tab.


1) Verify that there is no target table and therefore there will be no mappings.

The output will be a report.


d. Click the Options tab.
1) Verify that the Assign columns options group is selected.
2) Locate the Select analysis columns (VAR statement) option.

Note: The left pane lists option groups. The right pane lists the options in an option
group.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-22 Lesson 6 SAS® Data Integration Studio: Working with Transformations

3) Click . The Select Data Source Items window appears.


a) Click Quantity Ordered.
b) Hold down the Ctrl key and click Total Retail Price.
c) Click .

d) Click OK to close the Select Data Source Items window.


4) Locate the Select columns to subgroup data (CLASS statement) option.
5) Click .
a) Click Customer Gender.
b) Hold down the Ctrl key and click Customer Age Group.
c) Click .

6) Click OK to close the Select Data Source Items window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Working with the Extract and Summary Statistics Transformations 6-23

e. Click the Basic group under Statistics.


A few statistics are selected by default. Remove two and rearrange the remaining statistics.
1) In the Selected list box, click Standard deviation (STD) and then click to remove it.

2) In the Selected list box, click Number of observations (N) and click to remove it.
3) In the Selected list box, click Minimum (MIN) and then click to move it to the top.

f. Click the Percentiles option group under Statistics.


1) In the Available list box, click MEDIAN.
2) Click .

g. Click the Other options option group.


1) Enter nodate nonumber ls=80 in the Specify other options for OPTIONS statement
area.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-24 Lesson 6 SAS® Data Integration Studio: Working with Transformations

2) Enter MAXDEC=2 NOLABELS in the Other PROC MEANS options area (after the
default text).

h. Click the Titles and footnotes option group.


1) Enter Customer Order Statistics as the value for the Heading 1 option.
2) Enter (United States Customers) as the value for the Heading 2 option.

i. Click the ODS options option group.


1) Select Use HTML as the value for the ODS result option.
2) Click Browse in the Location option. The Select a File window appears.
3) Navigate to D:\Workshop\dift\reports.
4) Enter UnitedStatesCustomerInfo.html in the Name field.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Working with the Extract and Summary Statistics Transformations 6-25

5) Click OK to close the Select a File window.


j. Click OK to close the Summary Statistics Properties window.
9. Select File  Save to save the diagram and job metadata to this point.
The final job flow diagram should resemble the following:

The Summary Statistics transformation has no output table because it is configured to produce
only a report. To add an output table, right-click on the transformation and select Add Work
Table.
10. Run the job.
a. Right-click in the background of the job and select Run.
b. Click the Status tab in the Details area. Verify that all processes completed successfully.
c. View the log for the executed job.
11. View the listing output.
a. Click the Output tab.

The nodate nonumber ls=80 options in the Summary Statistics transformation apply only to
listing output, not the HTML output.
b. Click the Diagram tab to return to the job flow diagram.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-26 Lesson 6 SAS® Data Integration Studio: Working with Transformations

12. View the HTML document.


a. Open Windows Explorer by right-clicking Start and selecting Open Windows Explorer.
b. Navigate to D:\Workshop\dift\reports.

c. Double-click UnitedStatesCustomerInfo.html to open the generated report.


The generated report is displayed.

d. When you are finished viewing the report, click to close Firefox.
12. If necessary, click the Diagram tab in the Job Editor.
13. Select File  Save to save the diagram and job metadata.
14. Select File  Close to close the Job Editor window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Working with the Extract and Summary Statistics Transformations 6-27

15. Reset the formatting options for the View Data window.
Note: Developers should keep these options off in many cases to make it easier to
understand the data. We will turn them back on to make it easier to read the country
codes in the View Data window.
a. Click the Tools item in the menu bar and select Options from the drop-down list.
b. Click the View Data tab.
c. Select both Apply metadata formats and Apply formats.
d. Click OK to close the Options window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-28 Lesson 6 SAS® Data Integration Studio: Working with Transformations

Practices
A request was made to generate a report. The report must show the total quantities ordered across
the quarters for the year of 2011 for the Product Group values in the product line Clothes & Shoes.
The report should resemble the following:

1. Refreshing Course Metadata to Current Point


• Access SAS Data Integration Studio using the My Server connection profile using Bruno's
credentials (Bruno/Student1).
• Delete all subfolders under the Data Mart Development folder.
• From the Data Mart Development folder, import fresh metadata from DIFT_Ch7Ex1.spk
(SAS package located in D:\Workshop\dift\solutions). Accept all defaults in the connection
point panels.
2. Examining and Executing Imported Job: DIFT Populate Product Order Information Table
• If necessary, access SAS Data Integration Studio using the My Server connection profile
using Bruno's credentials (Bruno/Student1).
• Open the job DIFT Populate Product Order Information Table (in the Orion Jobs folder).

• Examine the properties of the target table DIFT Product Order Information.

Question: How many columns are defined for DIFT Product Order Information?
Answer: _________________________________________________________

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Working with the Extract and Summary Statistics Transformations 6-29

Question: What transformation is populating data in the columns Year, Quarter, Month,
and DOM?
Answer: _________________________________________________________
Question: What column from which table is populating data in the columns Year, Quarter,
Month, and DOM?
Answer: _________________________________________________________

• Run the job.

Question: How many rows were created for DIFT Product Order Information?
Answer: _________________________________________________________

Question: Do the values calculated for Year, Quarter, Month, and DOM look
appropriate?
Answer: _________________________________________________________

• Close the job.

3. Creating the Product Orders Summary Report


Define a job that will create the desired report. Use the Extract and Sort transformations to
prepare the data and the Summary Tables transformation to create the report. The job flow
diagram should resemble the following:

• Name the job DIFT Report for 2011 Clothes-Shoes Products.


• Store the job metadata object in the following folder:
/Data Mart Development/Orion Reports/Extract and Summary
• Add the DIFT Product Order Information table as the source.
• Add an Extract transformation (found in the SQL grouping of transformations) with the
following subsetting criteria:
Product_Line = "Clothes & Shoes" and Year=2011
• Add a Sort transformation (found in the Data grouping of transformations) and sort the data
by Product_Group and Quarter.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-30 Lesson 6 SAS® Data Integration Studio: Working with Transformations

• Add a Summary Tables transformation (found in the Analysis grouping of transformations)


and use the following specifications on the Options tab:

Option Group Option Value

Assign columns Select analysis columns Quantity Ordered


(VAR statement)

Categorize Data Select columns to subgroup data Product_Group, Quarter


(CLASS statement)
Specify row expression Product_Group
Describe TABLE to print Specify column expression Quarter*Quantity=" "*sum
Specify TABLE statement options rts=20
Label a keyword Specify KEYLABEL statement sum=" "

• Additional option specifications for the Summary Tables transformation:

Option Group Option Value

Specify other options for ls=85 nodate nonumber


OPTIONS statement
Other options
Summary tables procedure format=comma8.
options
Total Quantity Ordered for Quarters of
Heading 1
2011
Titles and footnotes
Heading 2 Product_Line: Clothes & Shoes

ODS Result Use HTML


ODS options
D:\Workshop\dift\reports\
Location
Quantities2011ClothesAndShoes.html

Note: The following code can be added as preprocessing code (precode) for this last
transformation to specify the characters used to draw the horizontal and vertical
borders in the table:
options formchar="|---|-|---|";

• Run the job and verify that the desired HTML report is created.

Question: From the log, locate the code that generated the HTML file. What PROC step
generated the report?
Answer: _________________________________________________________

Question: What statement (or statements) identified the categorical and analytical fields?
Answer: _________________________________________________________

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Working with the Extract and Summary Statistics Transformations 6-31

Question: In the HTML report, for the Product Group value of T-Shirts – is there an
increase in quantities ordered across all four quarters?
Answer: _________________________________________________________

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-32 Lesson 6 SAS® Data Integration Studio: Working with Transformations

6.2 Exploring the SQL Transformations

SQL Transformations

The transformations in the


SQL group generate SQL
code.

28
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The transformations in the SQL group generate SQL code and enable you to
• create tables (SAS and DBMS)
• delete rows from a table
• execute SQL statements in a DBMS
• extract rows from a source table
• insert rows into a target table
• join tables
• SQL merge (update or insert)
• perform set operations
• update rows in a target table.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.2 Exploring the SQL Transfor mations 6-33

Set Operators Transformation


This transformation performs SQL set operations.

29
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Set operations allow multiple query result sets to be combined into a single result set. The sets are
combined vertically, with operations combing rows from the source sets in different ways depending
on the operation. This differs from a Join, which combines sets horizontally, creating a result set
with the columns of multiple source sets.

UNION Set Operation


ID Name Line
101 Jacket 210
102 Rain Suit 210 ID Name Line
103 Mittens 210 101 Jacket 210
102 Rain Suit 210
103 Mittens 210
ID Name Line Supplier 104 Gloves 210
104 Gloves 210 50 105 Boots 210
105 Boots 210 4742
101 Jacket 210 50
103 Mittens 210 772
30
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The UNION set operation returns all unique rows from the two query results. It drops noncommon
columns from the result set. By default, duplicate rows are dropped from the result set, but there is
an option (ALL) to keep duplicates.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-34 Lesson 6 SAS® Data Integration Studio: Working with Transformations

EXCEPT Set Operation


ID Name Line
101 Jacket 210
102 Rain Suit 210
103 Mittens 210
ID Name Line
102 Rain Suit 210
ID Name Line Supplier
104 Gloves 210 50
105 Boots 210 4742
101 Jacket 210 50
103 Mittens 210 772
31
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The EXCEPT set operation returns unique rows from the first query result that are not in the second.
It drops noncommon columns from the result set. By default, duplicate rows are dropped from the
result set, but there is an option (ALL) to keep duplicates.

INTERSECT Set Operation


ID Name Line
101 Jacket 210
102 Rain Suit 210
103 Mittens 210
ID Name Line
101 Jacket 210
103 Mittens 210
ID Name Line Supplier
104 Gloves 210 50
105 Boots 210 4742
101 Jacket 210 50
103 Mittens 210 772
32
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The INTERSECT set operation returns rows that are common from both of the two queries’ results. It
drops noncommon columns from the result set. By default, duplicate rows are dropped from the
result set, but there is an option (ALL) to keep duplicates.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.2 Exploring the SQL Transfor mations 6-35

OUTER UNION Set Operation


ID Name Line
101 Jacket 210 ID Name Line Supplier
102 Rain Suit 210 101 Jacket 210
103 Mittens 210 102 Rain Suit 210
103 Mittens 210
104 Gloves 210 50
ID Name Line Supplier 105 Boots 210 4742
104 Gloves 210 50 101 Jacket 210 50
105 Boots 210 4742 103 Mittens 210 772
101 Jacket 210 50
103 Mittens 210 772
33
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The OUTER UNION set operation concatenates the rows from the two query results. It keeps
noncommon columns in the result set and it includes duplicate rows. This set operation is not an
ANSI set operation but an addition by SAS.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-36 Lesson 6 SAS® Data Integration Studio: Working with Transformations

Set Operators Transformation


ID Name Line
101 Jacket 210
102 Rain Suit 210
103 Mittens 210

ProdCode Product ProdLine


104 Gloves 210
105 Boots 210

34
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Without the CORRESPONDING option, columns in each result set will be matched by position, and
not by name.
In this case, the data in the ID column should be in the same column as the data in the ProdCode
column, the data in the Name column should be in the same column as the data in the Product
column, and the data in the Line column should be in the same column as the data in the ProdLine
column. We would want to leave the CORRESPONDING option off and match the columns by
position.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.2 Exploring the SQL Transfor mations 6-37

Set Operators Transformation


ID Name Line
101 Jacket 210
102 Rain Suit 210
103 Mittens 210

Name ID Line
Gloves 104 210
Boots 105 210

35
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

With the CORRESPONDING option, columns from each query result will be matched by name and
not by their position.
In this case, the data from the ID column should not be matched with the data with the Name
column, and the data from the Name column should not be matched with the data from the ID
column. To correctly match these columns, we need to match by name. To change the default
behavior of the transformation and match columns by name, we can turn on the CORRESPONDING
option.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-38 Lesson 6 SAS® Data Integration Studio: Working with Transformations

Concatenating Tables with Set Operators Transformation


This demonstration illustrates using the Set Operators transformation to concatenate a number of
similarly structured tables.

1. If necessary, access SAS Data Integration Studio with Bruno’s credentials.


a. Select Start  All Programs  SAS  SAS Data Integration Studio.
b. Verify that the connection profile is My Server.
c. Click OK to close the Connection Profile window and open the Log On window.
d. Enter Bruno in the User ID field and Student1 in the Password field.
e. Click OK to close the Log On window.
2. Review the tables to be used in the job.
a. If necessary, click the Folders tab.
b. Navigate to Data Mart Development  Orion Reports  SQL Transforms.
c. Right-click the DIFT CHILDREN metadata table object and select Open.
d. Right-click the DIFT CLOTHESSHOES metadata table object and select Open.
e. Right-click the DIFT OUTDOORS metadata table object and select Open.
f. Right-click the DIFT SPORTS metadata table object and select Open.
g. Overlay the four tables and verify that each table has the same eight columns.

Note: Each table contains data for a different product line for Orion Star.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.2 Exploring the SQL Transfor mations 6-39

h. Note the number of rows in each table: 772 rows in the DIFT CHILDREN table, 2021 rows in
the DIFT CLOTHESSHOES table, 357 rows in the DIFT OUTDOORS table, and 2354 rows
in the DIFT SPORTS table.

If a concatenation of these four tables occurs, then the final result set should contain:
772 + 2021 + 357 + 2354 = 5504 rows

i. Select File  Close four times to close all four View Data windows.
3. Create a new job to concatenate the four tables, and to generate a report on the result.
a. If necessary, click the Folders tab.
b. Expand Data Mart Development  Orion Reports  SQL Transforms.
c. Select File  New  Job.
d. Enter DIFT Report on Product Information in the Name field.
e. Verify that the location is set to /Data Mart Development/Orion Reports/SQL Transforms.
f. Click OK. The Job Editor window appears.
4. Add the Set Operators transformation to the Diagram tab.
a. Click the Transformations tab.
b. Expand the SQL group of transformations.
c. Locate the Set Operators transformation.

d. Right-click the Set Operators transformation and select Add to Diagram.


5. Change the orientation of the diagram to a vertical layout.
a. Locate the toolset at the bottom of the Diagram tab.

b. Click (part of the tool) and select Top To Down.


c. Select File  Save to save the job metadata.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-40 Lesson 6 SAS® Data Integration Studio: Working with Transformations

6. Verify that the temporary work table for the Set Operators transformation now appears under the
transformation and not to the right.

7. Specify the properties of the Set Operators transformation to concatenate the four tables
containing product line information with the Outer Union operator.
a. Right-click the Set Operators transformation and select Properties.
b. Click the Set Operators tab.

c. Add the input tables.


1) Click from the Queries pane.
a) Verify that the Table Query Selector window appears.
b) Click the Folders tab (in the Table Query Selector window).
c) Navigate to Data Mart Development  Orion Reports  SQL Transforms.
d) Click the DIFT CHILDREN table object.
e) Click OK to close the Table Query Selector window.
2) Click again from the Queries pane.
a) Click the Folders tab (in the Table Query Selector window).
b) Navigate to Data Mart Development  Orion Reports  SQL Transforms.
c) Click the DIFT CLOTHESSHOES table object.
d) Click OK to close the Table Query Selector window.
3) Click again from the Queries pane.
a) Click the Folders tab (in the Table Query Selector window).
b) Navigate to Data Mart Development  Orion Reports  SQL Transforms.
c) Click the DIFT OUTDOORS table object.
d) Click OK to close the Table Query Selector window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.2 Exploring the SQL Transfor mations 6-41

4) Click again from the Queries pane.


a) Click the Folders tab (in the Table Query Selector window).
b) Navigate to Data Mart Development  Orion Reports  SQL Transforms.
c) Click the DIFT SPORTS table object.
d) Click OK to close the Table Query Selector window.
The final set of tables listed in the Queries pane should resemble the following:

The default set operator type is the Union. Each of these must be changed.
d. Change the set operator types to Outer Union.
1) Click the Union operator between DIFT CHILDREN and DIFT CLOTHESSHOES.
2) Select Outer Union in the Set operator type field.
3) Select the Match columns by name (CORRESPONDING) option.

Union drops noncommon columns. Outer union keeps all columns.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-42 Lesson 6 SAS® Data Integration Studio: Working with Transformations

4) Click the Union operator between DIFT CLOTHESSHOES and DIFT OUTDOORS.
5) Select Outer Union in the Set operator type field.
6) Click the Match columns by name (CORRESPONDING) option.

7) Click the Union operator between DIFT OUTDOORS and DIFT SPORTS.
8) Select Outer Union in the Set operator type field.
9) Click the Match columns by name (CORRESPONDING) option.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.2 Exploring the SQL Transfor mations 6-43

e. Propagate the columns to the temporary result set and perform mappings.
1) Click the DIFT CHILDREN table in the Queries pane.
2) In the Table Expression area, on the Select tab, click (Propagate from sources to
targets).
All columns from the Source table are propagated to the Target table and mapped.

The Set Operators transformation does not automatically propagate columns.


3) Click the DIFT CLOTHESSHOES table in the Queries pane.
The target table has the needed columns. However, no mappings are defined.

The Set Operators transformation does not automatically map columns.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-44 Lesson 6 SAS® Data Integration Studio: Working with Transformations

4) Click the Map all columns tool ( ) to map the columns from DIFT CLOTHESSHOES
to the corresponding target table columns.

5) Click the DIFT OUTDOORS table in the Queries pane.

6) Click the Map all columns tool ( ) to map the columns from DIFT OUTDOORS
to the corresponding target table columns.
7) Click the DIFT SPORTS table in the Queries pane.

8) Click the Map all columns tool ( ) to map the columns from DIFT SPORTS
to the corresponding target table columns.
f. Click OK to close the Set Operators Properties window.
8. Select File  Save to save the job metadata to this point.
9. Click the portion of the tool to autoalign the diagram nodes.

10. Select File  Save to save the job metadata to this point.
11. Register the work table.
a. Right-click the Set Operators work table and select Register Table.
b. On the General tab, enter DIFT Product Information as the name.
c. Verify that the location is /Data Mart Development/Orion Reports/SQL Transforms.
d. Click the Physical Storage tab.
e. Enter Product_Information as the physical name.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.2 Exploring the SQL Transfor mations 6-45

f. If necessary, select DIFT Orion Target Tables Library (located in Data Mart
Development  Orion Target Data).
g. Click OK to close the Register Tables window.
12. Click File  Save to save the current metadata.

13. Run the job.


a. Click Run.
b. If necessary, click Yes in the Control Flow Warnings window.
14. Verify that the job completed successfully.
a. Click the Log tab.
b. Verify that the target table has 5504 rows and 8 columns.

15. View the data.


a. Click the Diagram tab.
b. Right-click the DIFT Product Information table object and select Open.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-46 Lesson 6 SAS® Data Integration Studio: Working with Transformations

c. Select File  Close to close the View Data window.


16. Create a frequency report for the Product Information.
a. On the Transformations tab, expand the Analysis group.
b. Drag the One-Way Frequency transformation to the Diagram tab of the Job Editor.
c. Connect the One-Way Frequency transformation to the DIFT Product Information table.
d. Configure the One-Way Frequency transformation to generate a report on the frequency of
products within each Product_Line.
1) Right-click the One-Way Frequency transformation and select Properties.
2) If you choose, remove the generic description on the General tab.
3) Click the Options tab.
4) Verify that the Assign Columns options group is selected.
5) In the Select columns to perform a one-way frequency distribution on option, select
Product_Line, Product_Category, and Product_Group.
6) Click OK to close the One-Way Frequency Properties.
17. Select File  Save to save the job.
18. Rerun the job.
a. Click Run.
b. Verify that the job completes successfully.
c. Click the Output tab. The beginning of the report should resemble the following:

d. Click the Diagram tab.


19. Select File  Save to save the job.
20. Select File  Close to close the job.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.2 Exploring the SQL Transfor mations 6-47

Practice
In this practice, you create a job that will explore three SQL transformations: Create Table, Delete,
and Insert Rows.
4. Using the Create Table, Delete, and Insert Rows Transformations from the SQL Group
Create a job to load a new table of high-value suppliers. The table is used to create a report
showing counts by country of the high-value suppliers. The job flow should resemble the
following:

• Define a job to create the High Value Suppliers table.


- Name the job DIFT Report on High Value Suppliers.
- Store the job metadata object in /Data Mart Development/Orion Jobs.
- Use the DIFT Supplier Information external file object in the Orion Source Data folder
as the source.
- Add a File Reader transformation to read the DIFT Supplier Information external file.
- Add the Create Table transformation from the SQL transformations group.
- Register the work table of the Create Table transformation as the target table with the
following attributes:

Metadata Name: High Value Suppliers

Metadata Folder: /Data Mart Development/Orion Reports/SQL Transforms

Physical Name: Highvalue_Suppliers

Library: DIFT Orion Target Tables Library

- Configure the Create Table transformation to load only the rows from the source table
with Supplier_ID values greater than 12000. These are the high-value suppliers.

Question: How many rows and columns are in the resultant table?
Answer: ______________________________________________

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-48 Lesson 6 SAS® Data Integration Studio: Working with Transformations

• The report should include only the international high-value suppliers. Delete the domestic
suppliers from the High Value Suppliers table.
- Add a Delete transformation from the SQL transformations group to the job.
Note: The Delete transformation points to the table that it modifies, in this case the
High Value Suppliers table.
- Configure the Delete transformation to delete rows from the High Value Suppliers table.
Delete the rows with the Country value equal to United States.
Hint: Use quotation marks around the value United States.
Question: How many rows and columns are now in the resultant table?
Answer: ______________________________________________

• It was found that additional suppliers with Supplier_ID values greater than 10000 are high-
value suppliers. Insert the additional rows from the DIFT Supplier Information table into the
High Value Suppliers table.
- Add an Insert Rows transformation from the SQL transformations group to the job.
- Configure the Insert Rows transformation to select the rows from the DIFT Supplier
Information table with Supplier_ID between 10000 and 12000 and Country not equal
to United States and insert them into the High Value Suppliers table.
Question: How many rows and columns are now in the resultant table?
Answer: ______________________________________________

• Create a one-way frequency report for the High Value Suppliers values. Show the number
of suppliers for each country.

Note: The following code can be added as preprocessing code (precode) for this last
transformation to specify the characters to draw the horizontal and vertical borders of
the table:
options formchar="|---|-|---|";

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-49

6.3 Creating Custom Transformations

Extending SAS Data Integration Studio


Custom transformations give data integration developers the flexibility
to extend the functional capabilities of SAS Data Integration Studio.

47
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

All transformations, including custom ones, generate SAS code to accomplish the following:
• extract data
• transform data
• load data into data stores
• create reports

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-50 Lesson 6 SAS® Data Integration Studio: Working with Transformations

Types of Transformation Templates


The Transformation tree contains two types of transformation templates.

Java Plug-in Transformation


Created with the Java programming language
Templates

Generated Transformation Created with the New Transformation Wizard


Templates and SAS code

48
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Java plug-in transformation templates were created by SAS Data Integration Studio developers
using Java programming. SAS Data Integration Studio users can use a wizard to create generated
transformation templates with only SAS programming skills.
Examples of Java plug-in transformation templates include most of the templates in the Data folder,
such as Sort, Splitter, and User Written. Examples of generated transformation templates include
Summary Tables and Summary Statistics. These transformations are def ault transformations
included with SAS Data Integration Studio.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-51

Generated Transformations

Generated transformation
templates in the
Transformations tree are
identified by the icon
associated with the
transformation.

49
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

There are many generated transformation templates that come by default with SAS Data Integration
Studio. Two of these include Summary Statistics and Summary Tables. All of the generated
transformation templates are identifiable by their icons in the Transformations tree.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-52 Lesson 6 SAS® Data Integration Studio: Working with Transformations

Generated Transformations

Pop-up Menu for Pop-up Menu for


Generated Java Plug-in
Transformation Transformation

50
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Right-clicking a generated transformation yields a pop-up menu with several options that are not
available for Java transformations:
• Properties
• Analyze
• History
• Copy
• Import
• Export
• Archive
• Compare
• Find In
These options are the same options available for other metadata objects users create in the SAS
Metadata Repository. Generated transformations can be managed in the same way other metadata
objects such as libraries or tables can be managed.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-53

New Transformation Wizard


The New Transformation Wizard guides you through the steps for creating
a generated transformation template. It includes the following steps:
• define a name and location
• enter the SAS source code for the custom transformation
• define the options
• manage the numbers of sources and targets

After the transformation template is saved, it is available in the


Transformations tree for use in any job.

51
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Transformation Code and Macro Variables


SAS source code for a transformation template typically includes macro
variables.
options &options;
title "&title";
proc gchart data=&syslast;
vbar &classvar1 /
group=&classvar2
sumvar=&analysisvar;
run;
quit;

In the New Transformation Wizard, these macro variables are associated


with transformation options.

52
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-54 Lesson 6 SAS® Data Integration Studio: Working with Transformations

Values for Transformation Options

&classvar1

&classvar2

&analysisvar

proc gchart data=&syslast;


vbar &classvar1 /
&classvar2
group=&classvar2
&analysisvar
sumvar=&analysisvar;
run;
quit; 56
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

When the transformation has been created, users can select values on the Options tab in the
transformation properties to configure the transformation. These values will be assigned to the
corresponding macro variables in the code.

Generated Code
The generated code for the transformation includes a %LET statement for
each transformation option.
%let syslast = yy.xx;
%let options = ;
%let classvar1 = Customer_Age_Group;
%let classvar2 = Customer_Gender;
%let analysisvar = Quantity Ordered;
%let title = Sum of Quantity across Gender and AgeGroup;

Each %LET statement creates a macro variable and assigns the value.

57
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-55

New Transformation: General Settings

General settings such as


name and description are
specified on the initial
page of the New
Transformation Wizard.

58
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

New Transformation: SAS Code

The SAS Code page enables


direct entry of SAS code.
Typically, the SAS code is
developed and tested
elsewhere, copied, and then
pasted into this window.

59
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-56 Lesson 6 SAS® Data Integration Studio: Working with Transformations

New Transformation: Options

The Options window


enables the specification
of options and option
groups.

60
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Options and option groups are defined in this window. The value for the option Name must exactly
match the macro variable it is associated with in the SAS source code so that when users configure
the transformation, the value chosen for each option is assigned to the corresponding macro
variable.

New Transformation: Options

Options are also


referred to as prompts
or parameters.

61
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-57

New Transformation: Options

62
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Options are surfaced to users on the Options tab of the transformation when they use the
transformation in a job. Groups become the organizational categories in the left-hand column of the
Options tab, and prompts become options.

New Transformation: Inputs and Outputs

The Inputs and Outputs


areas of this page
determine whether the
new transformation is
to accept source and
target table
connections when used
in a job, and how many.

63
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-58 Lesson 6 SAS® Data Integration Studio: Working with Transformations

New Transformation: Finish

The review page of the


wizard is used for checking
and verifying various
settings for the new
transformation. Finishing
this page creates
the new transformation.

64
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-59

Creating a New Transformation


In this demonstration, a generated transformation that produces a summary report on customer
order information is created. The output includes a tabular component as well as a bar chart.
1. If necessary, access SAS Data Integration Studio using Bruno’s credentials.
a. Select Start  All Programs  SAS  SAS Data Integration Studio.
b. Select My Server as the connection profile.
c. Click OK to close the Connection Profile window. The Log On window appears.
d. Enter Bruno in the User ID field and Student1 in the Password field.
e. Click OK to close the Log On window.
2. Create a new folder.
a. Click the Folders tab in the tree view area.
b. Expand Data Mart Development  Orion Reports.
c. Right-click the Orion Reports folder and select New  Folder.
d. Enter Custom Transformations as the name.

3. Create a new transformation.


a. Click the Custom Transformations folder.
b. Select File  New  Transformation.
c. Enter Summary Table and Vertical Bar Chart as the name.
d. Verify that the location is set to /Data Mart Development/Orion Reports/Custom
Transformations.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-60 Lesson 6 SAS® Data Integration Studio: Working with Transformations

e. Enter User Defined as the transformation category.


The general information for the new transformation should resemble the following:

Note: The generated transformation will appear in both the Transformations tree and the
Folders tree. The Location field will determine where the transformation will be
stored in the metadata folder structure. The Transformation Category field will
determine in which grouping the transformation will be stored in the Transformations
tree.
f. Click Next.
g. Add the SAS source code.
1) Access Windows Explorer.
2) Navigate to D:\Workshop\dift\SASCode.
3) Right-click TabulateGraph.sas and select Edit with Notepad++.
4) Right-click in the background of the Notepad++ window and select Select All.
5) Right-click in the background of the Notepad++ window and select Copy.
6) Select File  Exit to close the Notepad++ window.
7) Access the New Transformation Wizard.
8) Right-click in the SAS Code pane and select Paste.
The SAS Code panel should resemble the following:

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-61

The following represents the SAS code found in the TabulateGraph.sas program:
%macro TabulateGChart;
options mprint;
ods listing close;
%if (%quote(&options) ne) %then %do;
options &options;
%end;
%if(%sysfunc(fileexist(&path))) %then %do;
%if (%quote(&path) ne) %then %do;
ods html path="&path" gpath="&path"
%if (%quote(&filename) ne) %then %do;
file="&filename..html" ;
%end;
%end;
%if (%quote(&tabulatetitle) ne) %then %do;
title1 "&tabulatetitle";
%end;
proc tabulate data=&syslast;
class &classvar1 &classvar2;
var &analysisvar;
table &classvar1*&classvar2,
&analysisvar*(min="Minimum"*f=comma7.2
mean="Average"*f=comma8.2 sum="Total"*f=comma14.2
max="Maximum"*f=comma10.2);
run;
%if (%quote(&gcharttitle) ne) %then %do;
title height=15pt "&gcharttitle";
%end;
goptions dev=png;
proc gchart data=&syslast;
vbar &classvar1 / sumvar=&analysisvar group=&classvar2
clipref frame type=SUM outside=SUM coutline=BLACK;
run; quit;
ods html close;
ods listing;
%end;
%else %do;
%if &sysscp = WIN %then %do;
%put ERROR: <text omitted; refer to file for complete text>.;
%end;
%else %if %index(*HP*AI*SU*LI*,*%substr(&sysscp,1,2)*) %then
%do;
%put ERROR: <text omitted; refer to file>
%end;
%else %if %index(*OS*VM*,*%substr(&sysscp,1,2)*) %then
%do;
%put ERROR: <text omitted; refer to file >.;
%end;
%end;
%mend TabulateGChart;
%TabulateGChart;

Note: Some comments and text were removed from the above display of code.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-62 Lesson 6 SAS® Data Integration Studio: Working with Transformations

This SAS code generates a PROC GCHART chart and a PROC TABULATE table. When the
transformation is complete, users will specify the values for the macro variables in this code by
configuring the transformation options. Therefore, in the next steps, we add these macro variables
as options and define what types of values are acceptable inputs. The macro variables in this code
that will become options are options, path, filename, tabulatetitle, classvar1, classvar2,
analysisvar, and gcharttitle.

options This macro variable will allow users to enter a space-separated list of global
options.

path This macro variable will specify the path where the report will be created.

filename This macro variable will specify the filename of the report. Notice that we
have “hardcoded” the file to be an HTML file. Users should not add an
HTML extension to their filename because one is already present in the
code.

tabulatetitle This macro variable will specify the title for the table.

classvar1 This macro variable will be used in both the PROC GCHART and the
PROC TABULATE steps as a classification variable-it will be the top-level
grouping in the table and chart.

classvar2 This macro variable will be used in both the PROC GCHART and the
PROC TABULATE steps as a classification variable-it will be the secondary
grouping for the table and chart. For example, if classvar1 is set to Country
and classvar2 is set to Gender, the chart and table will be grouped first by
Country, and then will be grouped by Gender within each Country.

analysisvar This macro variable will specify the numeric variable for which the chart and
table will generate statistics.

gcharttitle This macro variable will specify the title for the chart.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-63

h. Click Next.
The Options window is used to define the options for this transformation.

i. Define metadata for three groups.


We will use three groups to organize three different sets of options: data items, titles, and
other options. These groups will surface as organizational categories in the left -hand side of
the Options tab in the Properties of the finished transformation.
1) Click New Group.
2) Enter Data Items in the Displayed text field.

3) Click OK.
4) Click New Group.
5) Enter Titles in the Displayed text field.
6) Click OK.
7) Click New Group.
8) Enter Other Options in the Displayed text field.
9) Click OK.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-64 Lesson 6 SAS® Data Integration Studio: Working with Transformations

j. Define metadata for the options in the Data Items group.


1) Define metadata for the first classification variable.
a) Click the Data Items group.
b) Click New Prompt.
c) Enter classvar1 in the Name field.
Note: The Name field for each option must exactly match the name of the macro
variable that it references. SAS assigns the value the user selects for the
option to the name typed in the Name field. Therefore, if the name of the
macro variable does not match the name typed in the Name field, the macro
variable will not receive a value.

d) Enter Grouping Column for Table and Chart in the Displayed text field.
Note: The displayed text field is required.
e) Enter The column selected for this option will be used as a grouping column
in the TABULATE table and in the GCHART graph. in the Description field.
Note: The description field is not required, but it is highly recommended, giving
users a more detailed idea of the option’s purpose in the transformation.
f ) Click Requires a non-blank value in the Options area.

Note: This forces the user to choose a value for this option before the
transformation can generate its code.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-65

g) Click the Prompt Type and Values tab.

h) Select Data source column for the Prompt type field.


Note: The Data source column Prompt type allows users to pick from a list of
columns. Users will see a list of available columns based on the way we
configure this option. For example, if we choose the Select from source
option, users will be able to select columns only from the source. If we allow
a user to select only numeric columns, character columns will not be shown.
We chose this prompt type because the macro variable classvar1 expects the
name of a variable from the input data set. This is just one of many prompt
types; others that we will use include Text and Directory, but there are many
more available, including Date range, Boolean, and Timestamp.
i) Verify that Select from source is selected in the Columns to select from area.
j) Verify that all data types are selected.
k) Click Limit number of selectable columns.
l) Enter 1 in the Minimum field.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-66 Lesson 6 SAS® Data Integration Studio: Working with Transformations

m) Enter 1 in the Maximum field.

n) Click OK to close the New Prompt window.


2) Define metadata for the second classification variable.
a) Click the Data Items group.
b) Click New Prompt.
c) Enter classvar2 in the Name field.
d) Enter Subgrouping Column for Table and Chart in the Displayed text field.
e) Enter The column selected for this option will be used as a subgrouping
column in the TABULATE table and in the GCHART graph. in the Description
field.
f ) Click Requires a non-blank value in the Options area.
g) Click the Prompt Type and Values tab.
h) Select Data source column for the Prompt type field.
i) Verify that Select from source is selected in the Columns to select from area.
j) Verify that all data types are selected.
k) Click Limit number of selectable columns.
l) Enter 1 in the Minimum field.
m) Enter 1 in the Maximum field.
n) Click OK to close the New Prompt window.
3) Define metadata for the analysis variable, called analysisvar in the code.
a) Click the Data Items group.
b) Click New Prompt.
c) Enter analysisvar in the Name field.
d) Enter Analysis Column for Table and Chart in the Displayed text field.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-67

e) Enter The column selected for this option will be used as an analysis column
in the TABULATE table and it will determine the heights of the bars in the
GCHART chart. in the Description field.
f ) Click Requires a non-blank value in the Options area.
g) Click the Prompt Type and Values tab.
h) Select Data source column for the Prompt type field.
i) Verify that Select from source is selected in the Columns to select from area.
j) Clear Character in the Data types area.
k) Click Limit number of selectable columns.
l) Enter 1 in the Minimum field.
m) Enter 1 in the Maximum field.
n) Click OK to close the New Prompt window.
The three options in the Data Items group should resemble the following:

l. Define metadata for the options in the Titles group.


1) Define metadata for the title to be used with TABULATE output.
a) Click the Titles group.
b) Click New Prompt.
c) Enter tabulatetitle in the Name field.
d) Enter Title for Table in the Displayed text field.
e) Enter Specify the text to be used as the title for the TABULATE output. in the
Description field.
f ) Click the Prompt Type and Values tab.
g) Verify that Text is specified for the Prompt type field.
Note: The Text Prompt type allows users to enter text, select values from a static
list, or select values from a dynamic list. Here, we will simply use the default
configuration and allow users to enter text, but the list options allow for very
complex, customizable transformations.
h) Enter Analyzing &analysisvar across &classvar1 and &classvar2 in the Default
value field.
Note: This value will be populated automatically for this option. The user can
change the value if they wish.
i) Accept the default values for the remaining fields.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-68 Lesson 6 SAS® Data Integration Studio: Working with Transformations

j) Click OK to close the New Prompt window.


2) Define metadata for the title to be used with the GCHART output.
a) Click the Titles group.
b) Click New Prompt.
c) Enter gcharttitle in the Name field.
d) Enter Title for Graph in the Displayed text field.
e) Enter Specify the text to be used as the title for the GCHART output. in the
Description field.
f ) Click the Prompt Type and Values tab.
g) Verify that Text is specified for the Prompt type field.
h) Enter Sum of &analysisvar for &classvar1 grouped by &classvar2 in the Default
value field.
i) Accept the default values for the remaining fields.
j) Click OK to close the New Prompt window.
The two options in the Titles group should resemble the following:

m. Define metadata for the options in the Other Options group.


1) Define metadata for SAS system options.
a) Click the Other Options group.
b) Click New Prompt.
c) Enter options in the Name field.
d) Enter SAS system options in the Displayed text field.
e) Enter Specify a space separated list of global SAS system options. in the
Description field.
f ) Click the Prompt Type and Values tab.
g) Verify that Text is specified for the Prompt type field.
h) Enter nodate nonumber ls=80 in the Default value field.
i) Accept the default values for the remaining fields.
j) Click OK to close the New Prompt window.
2) Define metadata for the path of the HTML file to be created.
a) Click the Other Options group.
b) Click New Prompt.
c) Enter path in the Name field.
d) Enter Output file path in the Displayed text field.
e) Enter Select the path for the HTML file to be created. in the Description field.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-69

f ) Click Requires a non-blank value.


g) Click the Prompt Type and Values tab.
h) Select File or directory for the Prompt type field.
Note: The File or directory Prompt type allows users to browse to a file or folder
location.
i) Verify that the File or directory type is set to Output.
j) Verify that the Selection type is set to Directories.
k) Click the Browse button next to Default value.

(1) Navigate to the D:/workshop/dift folder.


(2) Click the reports folder.

Note: If the reports folder does not exist, create it using the tool.
(3) Click OK to close the Select a Directory window.
l) Click OK to close the New Prompt window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-70 Lesson 6 SAS® Data Integration Studio: Working with Transformations

3) Define metadata for the name of the HTML file to be created.


a) Click the Other Options group.
b) Click New Prompt.
c) Enter filename in the Name field.
d) Enter Output file name in the Displayed text field.
e) Enter Enter a name for the HTML file that will be created. Do NOT enter the
HTML file extension! in the Description field.
f ) Click Requires a non-blank value.
g) Click the Prompt Type and Values tab.
h) Verify that Text is specified for the Prompt type field.
i) Accept the default values for the remaining fields.
j) Click OK to close the New Prompt window.
The three options in the Other Options group should resemble the following:

n. Test the prompts.


1) Click Test Prompts.
The Test the Prompts window appears with the Data Items group selected.

a) Verify that the three items in the Data Items group are all required, as indicated by
the asterisk (*).
b) Verify that the descriptions entered for each of the parameters are displayed.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-71

c) Verify that clicking Browse opens a dialog box to navigate the SAS Folders to a data
source from which a column can be selected.
2) Click Titles in the selection pane.
The two options in the Titles group are displayed with their default values.

3) Click Other Options in the selection pane.


The three options in the Other Options group are displayed. Notice the default
specification for SAS system options and output file path.

4) Click Cancel to close the Test the Prompts window.


o. Click Next.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-72 Lesson 6 SAS® Data Integration Studio: Working with Transformations

p. Clear Transformation supports outputs in the Outputs area.

Note: The Inputs and Outputs options set limits on the number of input ports and output
ports.
q. Click Next.

Note: If a transformation is used in a job and then updated, those changes will affect every
job where the transformation has been used. Check the impact of transformation
updates before you make any changes.
r. Click Finish.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-73

4. Verify that the transformation and new grouping appear in the Transformations tree and that the
transformation also appears in the Folders tree.
a. Click the Transformations tab.
b. Verify that a new grouping User Defined appears.
c. Expand the new grouping and verify that the new transformation exists.

d. Click the Folders tab.


e. Expand the Data Mart Development  Orion Reports  Custom Transformations folder.
f. Verify that the new transformation also appears in the Folders tree in the location specified in
the New Transformation Wizard.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-74 Lesson 6 SAS® Data Integration Studio: Working with Transformations

Practice
In this practice, you create a new transformation that generates a pie chart showing the sum of an
analysis variable across a classification variable.
5. Creating a Graphical Transformation
• Create a new folder called Custom Transformations in the Data Mart Development 
Orion Reports folder.
• Create a new transformation called Generate Pie Chart.
– Name the transformation Generate Pie Chart.
– Place the transformation in the Data Mart Development  Orion Reports 
Custom Transformations folder and place it in the User Defined grouping on the
transformations tree.
Note: You can create the new User Defined grouping by typing User Defined in the
Transformation Category box.
– Use the code from GeneratePieChart.sas (located in D:\Workshop\dift\SASCode) as
the transformation’s source code.
– Define an option group called Data Items.
– Define the following options for the Data Items group:

Name: classvar

Displayed text: Grouping Variable f or Pie Chart

Description: Select the column used to generate slice groupings.

Required: Yes

Prompt type: Data Source Column

Other information: Allow all data types; allow only one column selection.

Name: analysisvar

Displayed text: Analysis Variable for Pie Chart

Description: Select the column whose sum across the chosen category will
be used to determine the size of a slice.

Required: Yes

Prompt type: Data Source Column

Other information: Allow only numeric data types; allow only one column
selection.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-75

– Define a group called Titles and Options.


– Define the following options for the Titles and Options group:

Name: charttitle

Displayed text: Title for Pie Chart

Description: Specif y the text to be used as the title f or the pie chart.

Required: No

Prompt type: Text

Default Value: &analysisvar by &classvar

Name: path

Displayed text: Output file path

Description: Select the path f or the PDF f ile to be created.

Required: Yes

Prompt type: File or Directory; select Directory as the Selection type


Default Value: D:/Workshop/dift/reports

Name: filename

Displayed text: Output file name

Description: Enter a name f or the PDF f ile that will be created. Do NOT
enter the PDF f ile extension!

Required: Yes

Prompt type: Text

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-76 Lesson 6 SAS® Data Integration Studio: Working with Transformations

– Specify that the transformation supports one input and no output tables.
– Verify that the transformation appears in two places – in the correct folder on the Folders
tab and under the correct grouping on the Transformations tab.
– Verify the transformation surfaces on the Folders tab.

– Verify the transformation surfaces on the Transformations tab.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-77

Using a New Transformation


This demonstration illustrates using the previously created Summary Table and Vertical Bar Chart
transformation to generate an HTML report.
1. If necessary, access SAS Data Integration Studio using Bruno’s credentials.
a. Select Start  All Programs  SAS  SAS Data Integration Studio.
b. Select My Server as the connection profile.
c. Click OK to close the Connection Profile window. The Log On window appears.
d. Enter Bruno in the User ID field and Student1 in the Password field.
e. Click OK to close the Log On window.
2. If necessary, prepare the source table for the desired job.
a. Load the data for three target tables.
1) Click the Folders tab.
2) Expand Data Mart Development  Orion Jobs.
3) Run the DIFT Populate Customer Dimension Table job.
a) Right-click the DIFT Populate Customer Dimension Table job and select Open.
b) Select Actions  Run.
c) Verify that the job ran successfully.
d) Select File  Close to close the job.
4) Run the DIFT Populate Order Fact Table job.
a) Right-click the DIFT Populate Order Fact Table job and select Open.
b) Select Actions  Run.
c) Verify that the job ran successfully.
d) Select File  Close to close the job.
5) Run the DIFT Populate Customer Order Information Table job.
a) Right-click the DIFT Populate Customer Order Information Table job and select
Open.
b) Select Actions  Run.
c) Verify that the job ran successfully.
d) Select File  Close to close the job.
3. Create a job to use the new transformation.
a. If necessary, click the Folders tab.
b. Expand Data Mart Development  Orion Reports.
c. Click the Custom Transformations folder.
d. Select File  New  Job. The New Job window appears.
e. Enter DIFT Report and Graphic for Customer Orders as the name.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-78 Lesson 6 SAS® Data Integration Studio: Working with Transformations

f. Verify that the location is set to /Data Mart Development/Orion Reports/


Custom Transformations.
g. Click OK. The job editor window appears.
4. Add source table metadata to the diagram for the process flow.
a. Click the Folders tab.
b. Navigate to the Data Mart Development  Orion Target Data folder.
c. Drag the DIFT Customer Order Information table object to the job editor.
5. Add the Summary Table and Vertical Bar Chart transformation to the process flow.
a. Click the Transformations tab.
b. Expand the User Defined group.
c. Drag the Summary Table and Vertical Bar Chart transformation to the Diagram tab of the
job editor.
6. Connect DIFT Customer Order Information to the Summary Table and Vertical Bar Chart
transformation.

7. Click the Errors symbol. The Errors dialog box shows that required options have not been
specified.

8. Click X to close the Errors dialog box.


9. Select File  Save to save the job metadata to this point.
10. Specify properties for the Summary Table and Vertical Bar Chart transformation.
a. Right-click the Summary Table and Vertical Bar Chart transformation and select
Properties.
b. Click the Options tab.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-79

c. Verify that the Data Items group is selected in the selection pane.

1) Click Browse for the Grouping Column for Table and Chart option.
a) Select Customer Age Group in the Select a Data Source Item window.
b) Click OK to close the Select a Data Source Item window.
2) Click Browse for the Subgrouping Column for Table and Chart option.
a) Select Customer Gender in the Select a Data Source Item window.
b) Click OK to close the Select a Data Source Item window.
3) Click Browse for the Analysis Column for Table and Chart option.
a) Select Quantity Ordered in the Select a Data Source Item window.
b) Click OK to close the Select a Data Source Item window.

d. Click the Titles group in the selection pane.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-80 Lesson 6 SAS® Data Integration Studio: Working with Transformations

e. Verify that the default values are specified. These default values will be used for this instance
of the transformation.

f. Click the Other Options group in the selection pane.

1) Keep the default value under SAS system options.


2) Keep the default value under Output file path.
3) Enter CustomerOrderReport under Output file name.

g. Click OK to close the Summary Table and Vertical Bar Chart Properties window.
11. Select File  Save to save the job metadata to this point.
12. Run the job.
a. Right-click in the background of the job and select Run.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-81

b. If necessary, click the Status tab in the Details area.

We get a warning.
c. Click the Warnings and Errors tab.

The warning message tells us the labels are too wide for the bars.
13. View the generated HTML file.
a. Open Windows Explorer.
b. Navigate to the D:\Workshop\dift\reports folder.
c. Double-click CustomerOrderReport.html.
d. If necessary, click X to close the security message in the browser.
e. If necessary, right-click on the chart and select View Image to view the chart.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-82 Lesson 6 SAS® Data Integration Studio: Working with Transformations

f. The tabular report is at the top of the HTML output. Scroll down for the graphic report.

Note: The values of statistics might vary due to changing customer ages over time.
Notice that the Customer_Age_Group values greater than 75 have their own bars. The
custom format applied to Customer_Age_Group does not cover a large enough range. If we
fix the format, the bars should be wide enough that the labels will fit.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-83

14. Select File  Close to close the browser window.


We will use code from the SASCode folder to fix the format.
15. Copy the necessary code.
a. Open Windows Explorer.
b. Navigate to the D:\Workshop\dift\SASCode folder.
c. Right-click UpdateAgegroupFormat.sas and select Edit with Notepad++.

This code adds a row to the dataset with new agegroup values that cover people from age
76 to age 100.
d. Right-click in the background of the file and select Select All.
e. Right-click in the background of the file and select Copy.
f. Close UpdateAgegroupFormat.sas.
16. Return to SAS Data Integration Studio.
17. Run the code.
a. Select Tools  Code Editor.
b. Right-click in the Code Editor window and select Paste.
c. Click Run.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-84 Lesson 6 SAS® Data Integration Studio: Working with Transformations

18. Click the Output tab and verify the Agegroup format was successfully updated.

There is now a category for people with an age between 76 and 100.
19. Close the Code Editor window. Do not save your changes.
20. Re-generate affected source tables.
The format for Customer_Age_Group is applied in the DIFT Populate Customer Dimension Table
job, therefore the job needs to be re-run to apply the new format. DIFT Customer Dimension is a
source table for the DIFT Customer Order Information table, so the DIFT Populate Customer
Order Information Table job also needs to be re-run. The DIFT Order Fact table is unaffected
because it does not contain customer data.
a. If necessary, click the Folders tab.
b. Run the DIFT Populate Customer Dimension Table job.
1) Right-click the DIFT Populate Customer Dimension Table job and select Open.
2) Select Actions  Run.
3) Verify that the job ran successfully.
4) Select File  Close to close the job.
c. Run the DIFT Populate Customer Order Information Table job.
1) Right-click the DIFT Populate Customer Order Information Table job and select Open.
2) Select Actions  Run.
3) Verify that the job ran successfully.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-85

4) Right-click on the DIFT Customer Order Information table and select Open.

The new format now applies to even those over 75.


5) Close the View Data window.
6) Select File  Close to close the job.
21. Re-run the DIFT Report and Graphic for Customer Orders job and verify the results.
a. Click Run.

This time, there are no warnings.


22. View the generated HTML file.
a. Open Windows Explorer.
b. Navigate to the D:\Workshop\dift\reports folder.
c. Double-click CustomerOrderReport.html.
d. If necessary, click X to close the security message in the browser.
e. If necessary, right-click on the chart and select View Image to view the chart.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-86 Lesson 6 SAS® Data Integration Studio: Working with Transformations

f. The tabular report is at the top of the HTML output. Scroll down for the graphic report.

Since there are fewer bars, now the bars are wide enough that the labels will fit. The results
are also much more useful now that we can capture people who are 76 to 100 years old in
one group.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-87

23. Select File  Close to close the browser window.


24. Return to SAS Data Integration Studio.
25. Select File  Close to close the DIFT Report and Graphic for Customer Orders job.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-88 Lesson 6 SAS® Data Integration Studio: Working with Transformations

Practice
The new transformation from the previous practice is used to generate a PDF file with a pie chart
analyzing sums of quantity ordered by country. The final job flow will resemble the following:

Use Bruno’s credentials for the practice.


6. Using the Generate Pie Chart Transformation in a Job
• If necessary, prepare the source data for the job by running the DIFT Populate Order Fact
Table and DIFT Populate Customer Dimension Table jobs, and then running the DIFT
Populate Customer Order Information Table job.
Note: The error “File DIFTTGT.CUSTOMERORDERINFO.DATA does not exist” indicates
that the source data has not been created and that the above step must be complete
to create the table.
• Create a new job and name it DIFT Generate Country Orders Pie Chart.
• Add the DIFT Customer Order Information table to the job as the source table (from the
/Data Mart Development/Orion Target Data folder).
• Add the Generate Pie Chart transformation to the job and connect it to the DIFT Customer
Order Information table.
• Configure the Generate Pie Chart transformation as follows:

Option Group Option Value

Data Items Grouping Variable for Pie Chart Customer Country


(Customer_Country)

Analysis Variable for Pie Chart Quantity Ordered (QUANTITY)

Titles and Title for Pie Chart Keep the default value,
Options &analysisvar by &classvar

Output file path Keep the default value,


D:\Workshop\dift\reports

Output file name CountryQuantityPieChart

• Run the job.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-89

• View the PDF file. The output should resemble the following:

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-90 Lesson 6 SAS® Data Integration Studio: Working with Transformations

6.4 Solutions
Solutions to Practices
1. Refreshing Course Metadata to Current Point
a. If necessary, access SAS Data Integration Studio with Bruno’s credentials.
1) Select Start  All Programs  SAS  SAS Data Integration Studio.
2) Select My Server as the connection profile.
3) Click OK to close the Connection Profile window and open the Log On window.
4) Enter Bruno in the User ID field and Student1 in the Password field.
5) Click OK to close the Log On window.
b. Click the Folders tab.
c. Expand Data Mart Development.
d. Delete all subfolders under Data Mart Development.
1) Click the first folder under Data Mart Development.
2) Press and hold the Shift key and click the last folder under Data Mart Development.
3) Select Edit  Delete.
4) If any Confirm Delete windows appear, click Yes.
e. Import fresh metadata.
1) Select the Data Mart Development folder.
2) Select File  Import  SAS Package.
3) Click Browse next to Enter the location of the input SAS package file.
a) If necessary, navigate to D:\Workshop\dift\solutions.
b) Click DIFT_Ch7Ex1.spk.
c) Click OK.
4) Verify that All Objects is selected.
5) Click Next.
6) Verify that all four Orion folders are selected.
7) Click Next.
8) Verify four different types of connections are shown to need to be established.
9) Click Next.
10) Verify that SASApp is listed for both the Original and Target fields.
11) Click Next.
12) Verify both servers (ODBC Server and Oracle Server) have matching values for the
Original and Target fields.
13) Click Next.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.4 Solutions 6-91

14) Verify that both file paths have matching values for the Original and Target fields.
15) Click Next.
16) Verify that all directory paths have matching values for the Original and Target fields.
17) Click Next. The Summary pane surfaces.
18) Click Next.
19) Verify the import process completed successfully.
20) Click Finish.

2. Examining and Executing Imported Job: DIFT Populate Product Order Information Tabl e
a. If necessary, access SAS Data Integration Studio with Bruno’s credentials.
1) Select Start  All Programs  SAS  SAS Data Integration Studio.
2) Verify that the connection profile is My Server.
3) Click OK to close the Connection Profile window and access the Log On window.
4) Enter Bruno in the User ID field and Student1 in the Password field.
5) Click OK to close the Log On window.
b. Examine the imported job DIFT Populate Product Order Information Table.
1) Click the Folders tab.
2) Expand Data Mart Development  Orion Jobs.
3) Right-click DIFT Populate Product Order Information Table and select Open.
4) Verify that the Details pane is showing.
5) Single-click the DIFT Product Order Information table object.
6) Click the Columns tab in the Details pane.

Question: How many columns are defined for DIFT Product Order Information?
Answer: 23

7) Click the Join transformation (the one numbered 4).

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-92 Lesson 6 SAS® Data Integration Studio: Working with Transformations

8) If necessary, click the Mappings tab in the Details pane.

Question: What transformation is populating data in the columns Year, Quarter,


Month, and DOM?
Answer: Join

Question: What column from which table is populating data in the columns Year,
Quarter, Month, and DOM?
Answer: The OrderDate column from the OrderFact table

c. Select Actions  Run.


d. Click the Log tab.

Question: How many rows were created for DIFT Product Order Information?
Answer: 951,669

e. In the job flow diagram, right-click the DIFT Product Order Information table and select
Open.

Question: Do the values calculated for Year, Quarter, Month, and DOM look appropriate?
Answer: Yes

f. Select File  Close to close the View Data window.


g. Select File  Close to close the opened job.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.4 Solutions 6-93

3. Creating the Product Orders Summary Report


a. Create the initial job metadata.
1) Click the Folders tab.
2) Expand Data Mart Development  Orion Reports  Extract and Summary.
3) Verify that the Extract and Summary folder is selected.
4) Select File  New  Job. The New Job window appears.
5) Enter DIFT Report for 2011 Clothes-Shoes Products in the Name field.
6) Verify that the location is set to /Data Mart Development/Orion Reports/ Extract and
Summary.
7) Click OK. The Job Editor window appears.
b. Add source table metadata to the diagram for the job flow.
1) If necessary, click the Folders tab.
2) Expand Data Mart Development  Orion Target Data.
3) Drag the DIFT Product Order Information table object to the Diagram tab of the Job
Editor.
c. Add the Extract transformation to the job flow.
1) Click the Transformations tab.
2) Expand the SQL grouping and locate the Extract transformation template.
3) Drag the Extract transformation to the Diagram tab of the Job Editor. Place the
transformation next to the table object.
4) Connect the DIFT Product Order Information table object to the Extract transformation.
d. Add the Sort transformation to the job flow.
1) Verify that the Transformations tab is selected
2) Expand the Data grouping and locate the Sort transformation template.
3) Drag the Sort transformation to the Diagram tab of the Job Editor. Place the
transformation next to the Extract transformation.
4) Connect the work table from the Extract transformation to the Sort transformation.
e. Add the Summary Tables transformation to the job flow.
1) Verify that the Transformations tab is selected.
2) Expand the Analysis grouping and locate the Summary Tables transformation template.
3) Drag the Summary Tables transformation to the Diagram tab of the Job Editor. Place
the transformation next to the table object.
8) Connect the Sort transformation to the Summary Tables transformation.
f. Select File  Save to save the diagram and job metadata to this point.
g. Specify properties for the Extract transformation.
1) Right-click the Extract transformation and select Properties.
2) Click the Where tab.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-94 Lesson 6 SAS® Data Integration Studio: Working with Transformations

3) In the bottom portion of the Where tab, click the Data Sources tab.
4) Expand ProdOrders table.
5) Select Product_Line.
6) Click Add to Expression.
7) In the Expression Text area, type = “Clothes & Shoes” &.
8) In the bottom portion of the Where tab, click the Data Sources tab and if necessary,
expand the ProdOrders table.
9) Select Year.
10) Click Add to Expression.
11) In the Expression Text area, type = 2011.
The final expression should resemble the following:

12) Click OK to close the Extract Properties window.


h. Select File  Save to save the diagram and job metadata to this point.
i. Specify properties for the Sort transformation.
1) Right-click the Sort transformation and select Properties.
2) Click the Sort By Columns tab.
3) Double-click Product_Line in the Available columns list box to move it to the Sort
by columns list box.
4) Double-click Quarter in the Available columns list box to move it to the Sort by columns list
box.

5) Click OK to close the Sort Properties window.


j. Select File  Save to save the diagram and job metadata to this point.
k. Specify properties for the Summary Tables transformation.
1) Right-click the Summary Tables transformation and select Properties.
2) On the General tab, remove the default description.
3) Click the Options tab.
4) Verify that the Assign columns options group is selected in the left pane.
a) Locate the Select analysis columns (VAR statement) option in the right pane.

b) Click for this option to open the Select Data Source Items window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.4 Solutions 6-95

c) Click Quantity Ordered and then click .


d) Click OK to close the Select Data Source Items window.
5) Click the Categorize data options group in the left pane.
a) Locate the Select columns to subgroup data (CLASS statement) option in the right
pane.

b) Click for this option to open the Select Data Source Items window.
c) Click Product Group. Hold down the Ctrl key and select Quarter.

d) Click .
e) Click OK to close the Select Data Source Items window.
6) Click the Describe TABLE to print options group in the left pane.
a) Locate the field for the Specify row expression option and enter Product_Group.
b) Locate the field for the Specify column expression option and enter
Quarter*Quantity=“ ”*Sum.
c) Locate the field for the Specify TABLE statement options option and enter rts=20.
7) Click the Label a keyword options group in the left pane.
a) Locate the field for the Specify KEYLABEL statement option.
b) Enter sum=“ ”.
8) Click the Other options options group in the left pane.
a) Locate the field for the Specify other options for OPTIONS statement option.
b) Enter ls=85 nodate nonumber.
c) Locate the field for the Summary tables procedure options.
d) Enter format=comma8..
9) Click the Titles and footnotes options group in the left pane.
a) In the Heading 1 field, type Total Quantity Ordered for Quarters of 2011.
b) In the Heading 2 field, type Product_Line: Clothes & Shoes.
10) Click the ODS options options group in the left pane.
a) In the ODS result field, select Use HTML.
b) In the Location field, enter
D:\Workshop\dift\reports\Quantities2011ClothesAndShoes.html.
11) Click the Precode and Postcode tab.
a) Check the Precode option.
b) In the Precode field, enter options formchar=“|---|-|---|”;.
12) Click OK to close the Summary Tables Properties window.
l. Select File  Save to save the diagram and job metadata to this point.
m. Run the job.
1) Right-click in the background of the job and select Run.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-96 Lesson 6 SAS® Data Integration Studio: Working with Transformations

2) Click the Status tab in the Details area. Verify that all processes completed successfully.
3) Click to close the Details view.
4) View the log for the executed job.

Question: From the Log, locate the code that generated the HTML file. What PROC step
generated the report?
Answer: PROC TABULATE

Question: What statement (or statements) identified the categorical and analytical fields?
Answer: The CLASS statement identifies the categorical fields, the VAR statement
identifies the analytical field.

n. View the HTML document.


1) Open a Windows Explorer by right-clicking Start and selecting Explore.
2) Collapse the C: drive listing.
3) Expand to D:\Workshop\dift\reports.
4) Double-click Quantities2011ClothesAndShoes.html to open the generated report.
5) Click OK to close the information window.
6) Click to close the information bar in the Internet Explorer window.
7) When you are finished viewing the report, click to close Internet Explorer.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.4 Solutions 6-97

Question: In the HTML report, for the Product Group value of T-Shirts – is there an increase
in quantities ordered across all four quarters?
Answer: There is an increase from Q1 to Q2, and from Q2 to Q3. But there is a
decline from Q3 to Q4.

4. Using the Create Table, Delete, and Insert Rows Transformations from the SQL Group
a. If necessary, access SAS Data Integration Studio with Bruno’s credentials.
1) Select Start  All Programs  SAS  SAS Data Integration Studio.
2) Select My Server as the connection profile.
3) Click OK to close the Connection Profile window and open the Log On window.
4) Enter Bruno in the User ID field and Student1 in the Password field.
5) Click OK to close the Log On window.
b. Create the initial job metadata.
1) Click the Folders tab.
2) Expand Data Mart Development  Orion Reports  SQL Transforms.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-98 Lesson 6 SAS® Data Integration Studio: Working with Transformations

3) Verify that the SQL Transforms folder is selected.


4) Select File  New  Job. The New Job window appears.
5) Enter DIFT Report on High Value Suppliers in the Name field.
6) Verify that the location is set to /Data Mart Development/Orion Reports/ SQL
Transforms.
7) Click OK. The Job Editor window appears.
c. Add source table metadata to the diagram for the job flow.
1) If necessary, click the Folders tab.
2) Expand the Data Mart Development  Orion Source Data folders.
3) Drag the DIFT Supplier Information external file object to the Diagram tab of the Job
Editor.
d. Add the File Reader transformation to the job flow.
1) Click the Transformations tab.
2) Expand the Access group and locate the File Reader transformation template.
3) Drag the File Reader transformation to the Diagram tab of the Job Editor.
4) Connect the DIFT Supplier Information table object to the input port of the File Reader
transformation.
e. Add the Create Table transformation to the job flow.
1) On the Transformations tab, expand the SQL group and locate the Create Table
transformation.
2) Drag the Create Table transformation to the Diagram tab of the Job Editor.
3) Connect the File Reader work table to the Create Table transformation.
f. Register the target table.
1) Right-click the work table of the Create Table transformation and select Register Table.
2) Enter High Value Suppliers in the Name field.
3) Verify that the value of location is /Data Mart Development/Orion Reports/SQL
Transforms.
4) Click the Physical Storage tab.
5) Enter Highvalue_Suppliers in the Physical Name field.
6) If necessary, select DIFT Orion Target Tables Library as the library.
7) Click OK to close the Register Table window.
g. Configure the Create Table transformation.
1) Right-click the Create Table transformation and select Properties.
2) Click the Source tab and verify that the source type is Single Table.
Note: Other source types are Multi-Table Join and Subquery.
3) Verify that the table is the work table of the File Reader.
4) Click the Result tab.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.4 Solutions 6-99

5) Verify that all source columns are propagated and mapped to the target table.
6) Click the Filter and Sort tab.
7) In the Filter (WHERE) pane (upper pane), click New row.
8) In the newly added row, click in the first Operand field and select Choose Column.
9) In the Choose Columns window, expand File Reader.
10) Click Supplier_ID.
11) Click OK to close the Choose Columns window.
12) Click in the Operator field and select >=.
13) Enter 12000 in the second Operand field.
14) Click the Code tab and review the generated PROC SQL code.
15) Click OK to close the Create Table Properties.
16) Select File  Save to save the job metadata.

h. Run the job and verify that the High Value Suppliers table was created.
1) Right-click in the background of the job and select Run.
2) Verify that all steps completed successfully.
Question: How many rows and columns are in the resultant table?
Answer: Scroll on the Log tab to verify that 25 rows and 6 columns are in the result
set.
3) On the Diagram tab, right-click the High Value Suppliers table and select Open.
4) Verify the data in this table. All Supplier_Id values are greater than 12000.
5) Select File  Close to close the View Data window.
i. Add the Delete transformation to the job.
1) On the Transformations tab, expand the SQL group and locate the Delete
transformation.
2) Drag the Delete transformation to the Diagram tab of the Job Editor.
3) Click the Folders tab.
4) If necessary, expand the Data Mart Development  Orion Reports  SQL Transforms
folder.
5) Drag the High Value Suppliers table to the Diagram tab of the Job Editor.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-100 Lesson 6 SAS® Data Integration Studio: Working with Transformations

6) Connect the Delete transformation to the newly added High Value Suppliers table.

j. Configure the Delete transformation.


1) Right-click the Delete transformation and select Properties.
2) Click the Delete tab.
3) Click the Delete specified rows (WHERE) option.
4) Click New row.
5) In the newly added row, click in the first Operand field and select Choose Column.
6) In the Choose Columns window, expand High Value Suppliers.
7) Click Country.
8) Click OK to close the Choose Columns window.
9) In the Operator field, retain =.
10) Enter “United States” in the second Operand field.
11) Click the Code tab and review the generated PROC SQL code.
12) Click OK to close the Delete Properties.
13) If necessary, click the Control Flow tab in the Details pane.
14) Order the Delete transformation as number three to run.
15) Select File  Save to save the job metadata.
k. Run the job and verify that the US records were deleted from the High Value Suppliers
table.
1) Right-click in the background of the job and select Run.
2) Verify that all steps completed successfully.
Question: How many rows and columns are now in the resultant table?
Answer: Scroll on the Log tab to verify that 17 rows and 6 columns are in the
result set.
3) On the Log tab, scroll to the note about the deletion of eight rows from
DIFTTGT.HIGHVALUE_SUPPLIERS.
4) On the Diagram tab, right-click the High Value Suppliers table and select Open.
5) Verify the data in this table. There are no US records.
6) Select File  Close to close the View Data window.
l. Add the Insert Rows transformation to the job.
1) On the Transformations tab, expand the SQL group and locate the Insert Rows
transformation.
2) Drag the Insert Rows transformation to the Diagram tab of the Job Editor.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.4 Solutions 6-101

3) Connect the work table of the File Reader transformation to the input port of the Insert
Rows transformation.
4) Right-click the work table of the Insert Rows transformation and select Replace.
5) In the Table Selector window, click the Folders tab.
6) If necessary, expand the Data Mart Development  Orion Reports  SQL Transforms
folder.
7) Select the High Value Suppliers table.
8) Click OK to close the Table Selector window.

m. Select File  Save to save the job metadata.


n. Configure the Insert Rows transformation.
1) Right-click the Insert Rows transformation and select Properties.
2) Click the Insert tab.
3) Verify that the default query selects all the columns and all the rows from the source
table.
4) In the lower pane, verify that each Query Result Column has a matching Target Column.
(In this job, the source and target tables have the same columns.)
5) Click Edit Query. The Query Builder window appears.
6) On the Query Builder Source tab, verify that the type is Single Table and the table
is the File Reader work table.
7) Click the Result tab.
8) Verify that each target column is mapped to the corresponding source column.
9) Click the Filter and Sort tab.
10) In the Filter (WHERE) pane (upper pane), click New row.
11) In the newly added row, click in the first Operand field and select Choose Column.
12) In the Choose Columns window, expand the File Reader.
13) Click Supplier_ID.
14) Click OK to close the Choose Columns window.
15) In the Operator field, select BETWEEN.
16) Enter 10000 and 12000 in the second Operator field.
17) In the Filter (WHERE) pane (upper pane), click New row to add a second row.
18) In the second row, select AND NOT in the Boolean field.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-102 Lesson 6 SAS® Data Integration Studio: Working with Transformations

19) Click in the first Operand field and select Choose Column.
20) In the Choose Columns window, expand File Reader.
21) Click Country.
22) Click OK to close the Choose Columns window.
23) In the Operator, retain =.
24) Enter “United States” in the second Operand field.

25) Click the Code tab and review the generated PROC SQL code.
26) Click OK to close the Query Builder window.
27) Click OK to close the Insert Rows properties window.
28) Select File  Save to save the job metadata.
o. Run the job and verify that two rows were inserted into the High Value Suppliers table.
1) Verify the control flow. If necessary, use the Control Flow tab to place the transformations
in the correct execution sequence.

2) Right-click in the background of the job and select Run.


3) Verify that all steps completed successfully.
Question: How many rows and columns are now in the resultant table?
Answer: Scroll on the Log tab to verify that 19 rows and 6 columns are in the
result set.
4) On the Log tab, scroll to the note about the insertion of two rows into
DIFTTGT.HIGHVALUE_SUPPLIERS.
5) On the Diagram tab, right-click the High Value Suppliers table and select Open.
6) Verify the data in this table. The last two records have Supplier_ID values between
10000 and 12000 and Country values not equal to United States.
7) Select File  Close to close the View Data window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.4 Solutions 6-103

p. Add and configure a One-Way Frequency transformation.


1) On the Transformations tab, expand the Analysis group and locate the One-Way
Frequency transformation.
2) Drag the One-Way Frequency transformation to the Diagram tab of the Job Editor.
3) Connect the third instance of the High Value Suppliers table object to the One-Way
Frequency transformation.
4) Right-click the One-Way Frequency transformation and select Properties.
5) On the General tab, remove the generic description.
6) Click the Options tab.
7) Verify that the Assign Columns options group is selected.
8) For the Select columns to perform a one-way frequency distribution on option, select
Country.
9) Click OK to close the One-Way Frequency Properties.
10) Right-click in the background of the job and select Run.
11) Verify that all steps completed successfully.
12) Click the Output tab.
13) Verify that the report resembles the following:

14) Select File  Close to close the Job Editor window.


5. Creating a Graphical Transformation
a. If necessary, access SAS Data Integration Studio using Bruno’s credentials.
1) Select Start  All Programs  SAS  SAS Data Integration Studio.
2) Select My Server as the connection profile.
3) Click OK to close the Connection Profile window. The Log On window appears.
4) Enter Bruno in the User ID field and Student1 in the Password field.
5) Click OK to close the Log On window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-104 Lesson 6 SAS® Data Integration Studio: Working with Transformations

b. Create a new folder.


1) Click on the Folders tab.
2) Expand the Data Mart Development  Orion Reports folder.
3) Right-click the Orion Reports folder and select New  Folder.
4) Name the new folder Custom Transformations.
5) Press Enter to save the table name.
c. Create a new transformation.
1) If necessary, click the Folders tab.
2) If necessary, expand Data Mart Development. Orion Reports  Custom
Transformations.
3) Right-click the Custom Transformations folder and select New  Transformation.
4) Specify general information.
a) Enter Generate Pie Chart as the Name.
b) Verify that the Location is set to /Data Mart Development/Orion Reports/
Custom Transformations.
c) Enter User Defined for Transformation Category.

5) Click Next.
6) Specify the SAS code.
a) Access Windows Explorer.
b) Navigate to D:\Workshop\dift\SASCode.
c) Right-click GeneratePieChart.sas and select Edit With Notepad++.
d) Right-click in the background of the Notepad++ window and select Select All.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.4 Solutions 6-105

e) Right-click in the background of the Notepad++ window and select Copy.


f ) Select File  Exit to close the Notepad++ window.
g) Access the New Transformation Wizard.
h) Right-click in the background of the SAS Code pane and select Paste.

The following SAS code is used:


%macro PieChart;

options mprint;
ods listing close;

%if(%sysfunc(fileexist(&path))) %then %do;


%if (%quote(&path) ne) %then
%do;

%if (%quote(&filename) ne) %then


%do;
ods pdf file="&path.\&filename..pdf";
%end;
%end;

%if (%quote(&charttitle) ne) %then


%do;
title height=15pt "&charttitle";
%end;

proc gchart data=&syslast;


pie &classvar / sumvar=&analysisvar;
run; quit;

ods pdf close;


ods listing;

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-106 Lesson 6 SAS® Data Integration Studio: Working with Transformations

%end;

%else %do;
%if &sysscp = WIN %then
%do;
%put ERROR: <text omitted; refer to file for complete text>.;
%end;

%else %if %index(*HP*AI*SU*LI*,*%substr(&sysscp,1,2)*) %then


%do;
%put ERROR: <text omitted; refer to file for complete text>.;
%end;

%else %if %index(*OS*VM*,*%substr(&sysscp,1,2)*) %then


%do;
%put ERROR: <text omitted; refer to file for complete text>.;
%end;
%end;

%mend PieChart;

%PieChart;

Note: Some comments and text were removed from the above display of code.
This code uses PROC GCHART to generate a pie chart, with the sum of analysisvar
determining the size of a slice and classvar determining what data is used to generate the
slice. The charttitle macro variable will allow users to set a title for the chart. The path and
filename macro variables will be used to generate a PDF file at the path location with the
specified filename.
7) Click Next.
8) Create a new group.
a) Click New Group.
b) Enter the name Data Items.

c) Click OK.
9) Create a new group.
a) Click New Group.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.4 Solutions 6-107

b) Enter the name Titles and Options.


c) Click OK.
The groups should surface as the following:

10) Define metadata for the classification variable.


a) Click the Data Items group.
b) Click New Prompt.
c) Enter classvar as the name.
d) Enter Grouping variable for pie chart as the displayed text.
e) Enter Select the column used to generate slice groupings. as the description.
f ) Click Requires a non-blank value in the Options area.

g) Click the Prompt Type and Values tab.


h) Select Data source column as the prompt type.
i) Verify that Select from source is selected in the Columns to select from area.
j) Verify that all Data types are selected.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-108 Lesson 6 SAS® Data Integration Studio: Working with Transformations

k) Click Limit number of selectable columns.


l) Enter 1 as the minimum.
m) Enter 1 as the maximum.

n) Click OK to exit the Edit Prompt window.


11) Define metadata for the analysis variable.
a) Click the Data Items group.
b) Click New Prompt.
c) Enter analysisvar as the name.
d) Enter Analysis Variable for Pie Chart as the displayed text.
e) Enter Select the column whose sum across the chosen category will be used to
determine the size of a slice. as the Description.
f ) Click Requires a non-blank value in the Options area.
g) Click the Prompt Type and Values tab.
h) Select Data source column as the prompt type.
i) Verify that Select from source is selected in the Columns to select from area.
j) Clear Character in the Data types area.
k) Click Limit number of selectable columns.
l) Enter 1 as the minimum.
m) Enter 1 as the maximum.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.4 Solutions 6-109

n) Click OK to exit the Edit Prompt window.

12) Define metadata for the pie chart title.


a) Click the Titles and Options group.
b) Click New Prompt.
c) Enter charttitle in the Name field.
d) Enter Title for Pie Chart in the Displayed text field.
e) Enter Specify the text to be used as the title for the pie chart. in the Description
field.
f ) Click the Prompt Type and Values tab.
g) Verify that Text is specified for the Prompt type field.
h) Enter &analysisvar by &classvar in the Default value field.
i) Accept the default values for the remaining fields.
j) Click OK to close the New Prompt window.
13) Define metadata for the path of the PDF file to be created.
a) Click the Titles and Options group.
b) Click New Prompt.
c) Enter path in the Name field.
d) Enter Output file path in the Displayed text field.
e) Enter Select the path for the PDF file to be created. in the Description field.
f ) Click Requires a non-blank value.
g) Click the Prompt Type and Values tab.
h) Select File or directory for the Prompt type field.
i) Verify that the File or directory type is set to Output.
j) Verify that the Selection type is set to Directories.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-110 Lesson 6 SAS® Data Integration Studio: Working with Transformations

k) Click the Browse button next to Default value.

(1) Navigate to the D:/workshop/dift folder.


(2) Click the reports folder.
Note: If the reports folder does not exist, create it using the tool.
(3) Click OK to close the Select a Directory window.
k) Click OK to close the New Prompt window.
14) Define metadata for the name of the PDF file to be created.
a) Click the Titles and Options group.
b) Click New Prompt.
c) Enter filename in the Name field.
d) Enter Output file name in the Displayed text field.
e) Enter Enter a name for the PDF file that will be created. Do NOT enter the PDF
file extension! in the Description field.
f ) Click Requires a non-blank value.
g) Click the Prompt Type and Values tab.
h) Verify that Text is specified for the Prompt type field.
i) Accept the default values for the remaining fields.
j) Click OK to close the New Prompt window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.4 Solutions 6-111

The final set of prompts should resemble the following:

15) Test the prompts.


a) Click Test Prompts. The Test the Prompts window appears as follows:

b) Click Cancel to close the Test the Prompts window.


16) Click Next.
17) Verify that Transform supports inputs is checked in the Inputs area.
18) Verify that the minimum and maximum number of inputs are both set to 1.
19) Uncheck Transform supports outputs in the Outputs area.
20) Click Next.
21) Review the summary information.
22) Click Finish.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-112 Lesson 6 SAS® Data Integration Studio: Working with Transformations

d. Verify the transformation surfaces on the Folders tab.

e. Verify the transformation surfaces on the Transformations tab.

6. Using the Generate Pie Chart Transformation in a Job


a. If necessary, prepare the source table for the desired job.
1) Load the data for three target tables.
a) Click the Folders tab.
b) Expand Data Mart Development  Orion Jobs.
c) Right-click the job DIFT Populate Customer Dimension Table job and select Open.
(1) Select Actions  Run.
(2) Verify that the job ran successfully.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.4 Solutions 6-113

(3) Select File  Close to close the job.


d) Right-click the job DIFT Populate Order Fact Table job and select Open.
(1) Select Actions  Run.
(2) Verify that the job ran successfully.
(3) Select File  Close to close the job.
e) Right-click the job DIFT Populate Customer Order Information Table job and select
Open.
(1) Select Actions  Run.
(2) Verify that the job ran successfully.
(3) Select File  Close to close the job.
b. Create a new job.
1) Click the Folders tab.
2) Expand Data Mart Development. Orion Reports  Custom Transformations.
3) Right-click the Custom Transformations folder and select New  Job.
4) Name the job DIFT Generate Country Orders Pie Chart.
c. Add the source data to the process flow.
1) Click the Folders tab.
2) Expand Data Mart Development  Orion Target Data.
3) Drag the DIFT Customer Order Information table object to the Diagram tab of the job
editor.
d. Add the Generate Pie Chart transformation to the process flow.
1) Click the Transformations tab.
2) Expand the User Defined folder and locate the Generate Pie Chart transformation
template.
3) Drag the Parse Character Value transformation to the Diagram tab of the job editor.
e. Connect the DIFT Customer Order Information table object to the Generate Pie Chart
transformation.

f. Select File  Save to save diagram and job metadata to this point.
g. Specify properties for the Generate Pie Chart transformation.
1) Right-click the Generate Pie Chart transformation and select Properties.
2) Click the Options tab.
3) Select Customer Country as the Grouping Variable for Pie Chart.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-114 Lesson 6 SAS® Data Integration Studio: Working with Transformations

4) Select Quantity Ordered as the Analysis Variable for Pie Chart.

5) Click the Titles and Options group.


6) Keep the default value for Title for Pie Chart.
7) Keep the default value for Output file path.
8) Enter the name CountryQuantityPieChart for Output file name.
9) Click OK to close the Properties window.
h. Run the job.
1) Right-click in the background of the job and select Run.
2) If necessary, click the Status tab in the Details area.
3) Verify that all steps completed successfully.

i. View the generated HTML file.


1) Open Windows Explorer.
2) Navigate to the D:\Workshop\dift\reports folder.
3) Double-click CountryQuantityPieChart.pdf.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.4 Solutions 6-115

The pdf report shows the pie chart.

4) Select File  Exit to close the PDF.


5) Select File  Save to save the job metadata to this point.
6) Select File  Close to close the job editor.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-116 Lesson 6 SAS® Data Integration Studio: Working with Transformations

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Lesson 7 Introduction to Data
Quality and the SAS® Quality
Knowledge Base
7.1 Introduction to Data Quality ......................................................................................... 7-3

7.2 SAS Quality Knowledge Base Overview ....................................................................... 7-8


7-2 Lesson 7 Introduction to Data Quality and the SAS® Quality Knowledge Base

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.1 Introduction to Data Quality 7-3

7.1 Introduction to Data Quality

Data Quality Defined


ISO 9000
Data quality can be defined as the degree to which a set of characteristics of data
fulfills requirements.

TechTarget
Data quality is a perception or an assessment of data’s fitness to serve its purpose in a given
context. The quality of the data is determined by factors such as accuracy, completeness,
reliability, relevance, and how up to date it is. As data has become more intricately linked with
the operations of organizations, the emphasis on data quality gains greater attention.

Wikipedia
Data that is “fit for [its] intended uses in operations, decision making, and planning.”
The state of completeness, validity, consistency, timeliness, and accuracy that makes
data appropriate for a specific use.
The processes and technologies involved in ensuring the conformance of data values
to business requirements and acceptance criteria.
6
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

As a data scientist, you should appreciate how important data quality is to your projects and to the
enterprise as a whole. You rely on accurate, complete, reliable, up -to-date data. Here are some of
the accepted def initions of data quality throughout the industry.
ISO 9000 def ines data quality as “the degree to which a set of characteristics of data f ulfills
requirements.”
TechTarget adds that “data quality is a perception or an assessment of data’s f itness to serve its
purpose in a given context. The quality of the data is determined by f actors such as accuracy,
completeness, reliability, relevance, and how up-to-date it is. As data has become more intricately
linked with the operations of organizations, the emphasis on data quality gains greater attention.”
Wikipedia def ines data quality in a variety of ways, saying:
• Data that is “fit for [its] intended uses in operations, decision making, and planning.”
• “The state of completeness, validity, consistency, timeliness, and accuracy that makes data
appropriate for a specific use.”
• “The processes and technologies involved in ensuring the conformance of data values to
business requirements and acceptance criteria.”

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-4 Lesson 7 Introduction to Data Quality and the SAS® Quality Knowledge Base

Whyhappen?
Why does Bad Data Does Bad Data Happen?
Bad data happens for
a variety of reasons.

Data entry Data


errors migration
issues

Data decay Processing


over time issues

7
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Data quality issues exist f or a variety of reasons, including the f ollowing:


• Data entry errors – these could be due to typographical errors, character transpositions,
misspellings, inconsistent use of abbreviations, and more.
• Data migration issues – these could be a result of migrating data from old legacy systems,
migration required from mergers and acquisitions, and more.
• Data decay over time – sometimes data that was appropriate for a specific use at a given point in
time is no longer appropriate for the current intended use.
• Processing issues – these could be due to system upgrades, system redesigns, inconsistent use
of data entry fields, data misalignment, gathering data from disparate sources, and more.
Perhaps your data is stored in multiple locations, with different f ile types, and on a variety of
machines. The data is likely inconsistent – like mixed case, inconsistent use of abbreviations, and
data values stored in dif ferent f ormats (for example, phone numbers).
You want to bring this data together, all in one place, combining the f iles to form a single view of the
data. This will almost certainly present the need to cleanse, standardize, deduplicate, and enrich the
data in order to ensure that the data is f it f o r the intended purpose.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.1 Introduction to Data Quality 7-5

Consequences of
Consequences ofBad
BadData
Data
70% of organizations feel poor quality or inconsistent data impacts their
ability to make sound business decisions.
Forrester

The yearly cost of bad data is over $3 trillion annually in the US.
Harvard Business Review

72% of global organizations say data quality issues impact customer trust
and perceptions of their brands.
Experian

8
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

According to recent research, the consequences of bad data are expensive to the organization. Bad
data not only impacts profitability f or the organization, it also af fects your customer’s perception of
your brand. In addition, bad data affects decisioning inside and across the organization, which
directly impacts your company’s bottom line and your relationship with your customers. The results
of some of this research are seen in the excerpts above.
As the studies point out, the cost of bad data affects every aspect your organization because it
• undermines the decision-making process by providing conflicting results, which leads to lack of
trust in the data, which compromises the entire decisioning process at your company
• negatively impacts your marketing and customer relationship goals and reduces customer trust
and perceptions of your brand due to lack of a single view of the subject
• negatively impacts the bottom-line. (According to Experian, bad data has a direct impact on the
bottom line of 88% of all American companies, and the average loss from bad data was
approximately 12% of overall revenue.)

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-6 Lesson 7 Introduction to Data Quality and the SAS® Quality Knowledge Base

SAS’ Data Quality Offering

Industry Vetted Quality


Validation Proven Methodology Knowledge Base

9
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

SAS’ data quality of fering is an industry-validated solution, receiving recognition as a leader in both
the Gartner Magic Quadrant and the Forrester Waves f or Data Quality.
The SAS Data Quality methodology has a proven track record, with more than 20 years of success!
Made up of three phases, the methodology is designed to step you through the process of creating
reliable and consistent data during the data curation lif e cycle. You will learn more about the
methodology in a later lesson.
The SAS Quality Knowledge Base (QKB) is a powerf ul piece of technology that makes a wide range
of data cleansing and data management techniques available to you f rom within a variety of SAS
applications. The rules in the QKB are geography and language specif ic, to ensure applicability for
data f rom all around the world.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.1 Introduction to Data Quality 7-7

Components of a Good Data Quality Strategy


A good data quality
strategy requires
processes, people,
1 2 and technology.
Processes People
Data
Quality

3
Technology
10
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

To have a complete data quality strategy in place, you must have the proper processes, people, and
technology. SAS provides you with the technology components for a well-rounded data quality
strategy. The DataFlux Data Management Studio methodology is a complete and proven
methodology. Now all we need is the people who are trained and ready to put their newly attained
knowledge to work!

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-8 Lesson 7 Introduction to Data Quality and the SAS® Quality Knowledge Base

7.2 SAS Quality Knowledge Base


Overview

What Is a SAS Quality Knowledge Base (QKB)?


QKB Locale Support

- Indicates supported locale

- Indicates no locale coverage

12
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The SAS Quality Knowledge Base (QKB) is a collection of f iles and algorithms that st ore data and
logic f or defining data management operations such as data cleansing and standardization. The
def initions in the QKB that perf orm the cleansing tasks are geography and language specif ic. This
combination of geography and language is known as a locale.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.2 SAS Quality Knowledge Base Overview 7-9

Modifying the QKB


QKB components are
modified using Data
Data Management Studio Management Studio.

QKB

13
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The components of the SAS Quality Knowledge Base (QKB) can be modified using the supplied
editors in Data Management Studio. Modifying or customizing the QKB enables you to create data
quality logic for operations such as custom parsing or cust om matching rules.
The customization component in Data Management Studio consists of a suite of editors, functions,
and interfaces that can be used to construct or modify components and definitions to fit your own
data. These editors enable you to perform the following tasks:
• explore and test the QKB components and definitions
• modify pre-built data types and data management algorithms (definitions) to meet business
needs
• create new data types and def initions based on customer need s

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-10 Lesson 7 Introduction to Data Quality and the SAS® Quality Knowledge Base

Accessing the QKB in SAS Applications


SAS Federation
Server SAS Event Stream Many SAS applications
SAS Data Processing Studio interact with the QKB!
Integration Studio

SAS Data Quality


SAS Data Quality
Accelerators
Server code

14
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Many SAS technology components also access the QKB and perform data-cleansing operations on
data. SAS software products like SAS Data Quality Server, SAS Data Integration Studio, SAS
Federation Server, SAS Data Quality Accelerators, and SAS Event Stream Processing can also be
used to reference the QKB when performing data management operations.
SAS Data Quality Server provides a collection of SAS procedures, functions, and call routines that
surface the functionality in the QKB from within SAS code.
SAS Data Integration Studio can be configured to access QKB components from the Data Quality
tab in the application’s Options window. The QKB definitions are then available from the provided
data quality transformations, or through customized SAS code in the user-written code
transformation.
SAS Federation Server can access the QKB functionality in SAS DS2 method calls that are
executed in FedSQL queries.
The SAS Data Quality Accelerators provide functions that execute QKB functionality in-database.
The SAS Data Quality Accelerator for Hadoop provides data quality functionality in the SAS Data
Loader for Hadoop directives for data quality.
SAS Event Stream Processing enables programmers to build applications that can quickly process
and analyze a large number of continuously flowing events. It can access the QKB to perform data
quality operations on the streaming data.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.2 SAS Quality Knowledge Base Overview 7-11

QKB Definition Types


QKB definitions
enable you to do a
Standardization variety of data
Parse curation tasks.

Match Case

Gender Analysis

Extraction Identification Analysis

15
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

A definition is a set of steps for processing data values. The QKB has definitions that enable you to
do a variety of data management, data quality, and entity resolution tasks. The types of definitions
that are available in the QKB include the following:

Definition Definition Use

Case Transf orms a text string by changing the case of its characters
to uppercase, lowercase, or proper case.

Extraction Extracts parts of the text string and assigns them


to corresponding tokens for the specified data type.

Gender Analysis Guesses the gender of the individual in the text string.

Identif ication Analysis Identif ies the text string as ref erring to a particular predef ined
category.

Language Guess Guesses the language of a text string.

Locale Guess Guesses the locale of a text string.

Match Generates match codes f or text strings where the match codes
denote a f uzzy representation of the character content of the
tokens in the text string.

Parse Parses a text string into meaningf ul tokens.

Pattern Analysis Transf orms a text string into a particular pattern.

Standardization Transf orms a text string into a standard f ormat.

You learn more about each of these def inition types as you use them in the next f ew lessons.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-12 Lesson 7 Introduction to Data Quality and the SAS® Quality Knowledge Base

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Lesson 8 DataFlux® Data
Management Studio: Essentials

8.1 Overview of Data Management Studio ......................................................................... 8-3

Demonstration: Navigating the DataFlux Data Management Studio Interf ace ................... 8-9

8.2 DataFlux Repositories ............................................................................................... 8-21

Demonstration: Creating a DataFlux Repository ........................................................ 8-24

Practice............................................................................................................... 8-32

8.3 Quality Knowledge Bases and Reference Data Sources ............................................. 8-33

Demonstration: Verifying the Course QKB and Ref erence Sources .............................. 8-37

8.4 Data Connections ...................................................................................................... 8-39

Demonstration: Working with Data Connections ........................................................ 8-42

Demonstration: Viewing the Data inside a Data Connection ........................................ 8-50


Practice............................................................................................................... 8-55

8.5 Solutions ................................................................................................................... 8-56

Solutions to Practices ............................................................................................ 8-56


8-2 Lesson 8 DataFlux® Data Management Studio: Essentials

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.1 Overview of Data Management Studio 8-3

8.1 Overview of Data Management Studio

SAS Data Quality Offerings


DataFlux DataFlux SAS Quality Knowledge Base
Data Management Studio Data Management Server (QKB)

Reference Data Packs DataFlux Repository

3
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

A data curator's job is to understand and prepare data f or use in analytics and reports. SAS has
several of ferings that can aid in this ef f ort.

The technology components that comprise SAS data quality offerings include:
• DataFlux Data Management Studio – a Windows-based desktop client application that enables
you to create and manage processes f or ensuring the accuracy and consistency of data during
the data management lif e cycle. Typically, Data Management Studio is thought of as the
development environment.
• DataFlux Data Management Server – provides a scalable server environment f or executing the
processes created in Data Management Studio. Typically, Data Management Server is thought
of as the production environment.
• SAS Quality Knowledge Base (QKB) – a collection of files and algorithms that provide the data
cleansing and data management f unctionality that is surf aced through the nodes in Data
Management Studio processes.
• Reference data source(s) – third-party address verif ication, enrichment, and geo -coding
databases available to validate and enhance your data.
• DataFlux repository – storage location f or objects created and executed in Data Management
Studio. Data Management Server also needs a repository of objects that will be available (and
run) on the server.

Note: Data Management Studio and Data Management Server do not share a repository. Each
component needs its own repository.

As a data curator, you will likely be working in Data Management Studio.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-4 Lesson 8 DataFlux® Data Management Studio: Essentials

Data Management Studio Architecture

Reference Data Packs Client Tier Data Management Studio DataFlux Repository

Data Tier Source or Target Data


Quality Knowledge Base
ODBC,
SAS data sets,
text,
federated data,
and so on

4
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Data Management Studio is used to


• def ine data connections
• establish links to the QKB and address verif ication data packs
• create repositories
• create
o data explorations
o prof iles
o data jobs
o process jobs

… and much more!

The jobs and services created in Data Management Studio are stored in a DataFlux Repository.

Data cleansing algorithms are stored in the Quality Knowledge Base (QKB). The rules and
algorithms in the QKB are specif ic to countries, as well as languages f rom around the world.

Third-party databases, known as “data packs” or ref erence data sources, can be purchased and
used within Data Management Studio. Data packs enable you to verif y and augment the data
coming f rom source systems based on ref erence data sources f rom around the world.

The diagram above shows a basic design-time architecture using Data Management Studio to
access the various technology components needed to build data cleansing processes.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.1 Overview of Data Management Studio 8-5

Proven Methodology

CONTROL DEFINE
MONITOR PLAN

EVALUATE DISCOVER

EXECUTE DESIGN

ACT
5
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The three-phase methodology is a step-by-step process for perf orming data management tasks
such as data quality, data integration, data migrations, and master data management (MDM). When
organizations plan, act on, and monitor data management projects, they build the f oundation to
optimize revenue, control costs, and mitigate risks. No matter what type of data you manage,
DataFlux technology can help you gain a more complete view of corporate inf ormation.

PLAN

DEFINE The planning stage of any data management project starts with this essential f irst
step. This is where the people, processes, technologies, and data sources are
def ined. Roadmaps that include articulating the acceptable outcomes are built.
Finally, the cross-f unctional teams across business units and between business
and IT communities are created to def ine the data management business rules.

DISCOVER A quick inspection of your corporate data would probably find that it resides in many
databases, managed by different systems, with different f ormats and representations
of the same data. This step of the methodology enables you to explore metadata to
verif y that the correct data sources are included in the data management program.
You can also create detailed data prof iles of identified data sources so that you can
understand their strengths and weaknesses.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-6 Lesson 8 DataFlux® Data Management Studio: Essentials

ACT
DESIGN Af ter you complete the f irst two steps, this phase enables you to take the dif ferent
structures, f ormats, data sources, and data f eeds and create an environment that
accommodates the needs of your business. At this step, business and IT users build
workf lows to enf orce business rules f or data quality and data integration. They also
create data models to house data in consolidated or master data sources.

EXECUTE Af ter business users establish how the data and rules should be def ined, the IT staf f
can install them within the IT inf rastructure and determine the integration method
(real time, batch, or virtual). These business rules can be reused and redeployed
across applications, which helps increase data consistency in the enterprise.

MONITOR

EVALUATE This step of the methodology enables users to define and enf orce business rules to
measure the consistency, accuracy, and reliability of new data as it enters the
enterprise. Reports and dashboards about critical data metrics are created f or
business and IT staf f members. The inf ormation that is gained f rom data monitoring
reports is used to ref ine and adjust the business rules.

CONTROL The f inal stage in a data management project involves examining any trends to
validate the extended use and retention of the data. Data that is no longer usef ul
is retired. The project’s success can then be shared throughout the organization.
The next steps are communicated to the data management team to lay
the groundwork f or f uture data management ef f orts.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.1 Overview of Data Management Studio 8-7

Data Management Studio Terminology


Home tab
Toolbar
Main Menu

Navigation
Pane

Information Pane

Navigation
Riser
Bars

13
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The example shown displays the Data Management Studio main interf ace.

In the upper lef t corner of the window, there is the Home tab. This tab is always open and can be
used to navigate back to this view.

At the top of the window, the main menu items are below the Home tab. Selecting one of these items
reveals a selection of actions that are available f or that particular menu item.

Also, at the top of the window, to the right of the main menu, there is the main toolbar. It contains
various buttons to assist you as you navigate in Data Management Studio.

The riser bars are on the lower lef t corner of the window. Select one of the riser bars to display the
objects that are available f or that particular item in the navigation pane.

On the lef t side of the window, the navigation pane is above the riser bars. This pane can be
ref reshed to display the items f or the selected riser bar.

Selecting an item in the navigation pane controls the content of the main area or information pane.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-8 Lesson 8 DataFlux® Data Management Studio: Essentials

Data Management Studio Terminology


Primary
Tabs
Secondary Detach
Secondary Toolbar Selected
Tabs
Tab

Resource
Pane
Data
Flow
Editor

Details
Pane

22
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

When you open new objects f rom the repository in Data Management Studio, a new tab appears f or
each object that you opened. The primary tabs f or these objects are displayed across the top of the
main interf ace, one f or each object that you opened. Click a tab to bring it to the f ront and display its
contents.

In the example shown, the selected tab is displaying a data job. If you want to see the contents of
more than one tab, you have the option to detach the selected tab. The icon that enables you to do
this is in the upper right corner of the tab itself . After you detach a tab, the icon changes to enable
you to re-attach it to the other tabs.

Each item type has its own interf ace. For example, the way that you specif y and view properties and
results f or a prof ile is different f rom a data exploration, data job, or any other item .

For any data job, there is a set of secondary tabs on the lef t side of the window, near the top. These
tabs enable you to navigate f urther within the data job (f or example, to view/edit the data f low, to
view/edit the settings, to view/edit the variables, or to review the log).

The Data Flow tab (the lef tmost secondary tab) includes a resource pane with riser bars. The display
shows the Nodes riser bar, which provides access to transformation nodes that can be added to
data jobs.

The Data Flow tab also includes the data f low editor, which is used to visualize and build data f lows
with the available transf ormation nodes.

The details pane at the bottom displays inf ormation for the selected node in the data f low as well f or
the data f low itself .

Note: Other objects (f or example, process jobs) can use or take advantage of the details pane. To
view (or toggle off) the details pane, click the icon on the toolbar or use the main toolbar.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.1 Overview of Data Management Studio 8-9

Navigating the DataFlux Data Management Studio Interface

In this demonstration, you explore and navigate the interf ace f or DataFlux Data Management Studio.
1. Select Start  All Programs  DataFlux  Data Management Studio 2.7 (studio1).

A splash screen appears when DataFlux Data Management Studio is initialized.

2. Click Cancel to close the Log On window.

A methodology window appears.

3. Click X to close the DataFlux Data Management Methodology window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-10 Lesson 8 DataFlux® Data Management Studio: Essentials

The main interf ace f or DataFlux Data Management Studio is now ready f or use.

The lef t side of the interf ace displays a


navigation pane and navigation riser bars.

The main area of the interf ace


displays the inf ormation pane
f or the selected element in the
navigation pane f or the
selected navigation riser bar.

4. View the main menu and toolbar on the Home tab.

Home tab

Main menu Toolbar

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.1 Overview of Data Management Studio 8-11

5. View the navigation inf ormation on the lef t side of the interf ace.

navigation
pane

navigation area

navigation
riser bars

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-12 Lesson 8 DataFlux® Data Management Studio: Essentials

6. View the inf ormation pane on the right side of the interf ace.

inf ormation
pane

Notice the f ollowing:


• The Information riser bar is selected in the navigation riser bars area.
• The navigation pane displays the items f or the Inf ormation riser bar.
• The Overview item is selected (in the navigation pane).
• The inf ormation pane displays the Overview inf ormation.
• The Overview inf ormation area consists of five portlets.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.1 Overview of Data Management Studio 8-13

7. View the f ive portlets in the Overview inf ormation area.

recent f iles

methodology

Data Roundtable
discussions

documentation

settings

8. Click in the Data Management Methodology portlet.

Data Management Methodology portlet

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-14 Lesson 8 DataFlux® Data Management Studio: Essentials

The DataFlux Data Management Methodology window appears.

a. Clear the selection of Display Data Management Methodology on Startup on the bottom
bar.

b. Click the DEFINE component of the methodology.

Inf ormation appears about the steps to perf orm in the Def ine portion of the methodology.

Four main items are in the Def ine portion of


the methodology:
• Connect to Data
• Explore Data
• Def ine Business Rules
• Build Schemes
Select any of these items to access a brief
overview and a link to more inf ormation.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.1 Overview of Data Management Studio 8-15

c. Click Explore Data.

d. Click Click here to learn more.

The Data Explorations topic appears in the DataFlux Data Management Studio online Help.

e. Click X to close the Help window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-16 Lesson 8 DataFlux® Data Management Studio: Essentials

f. Click X to close the DataFlux Data Management Methodology window.

9. In the navigation area, click the Data riser bar.

The Data riser bar enables users to


work with
• collections
• data connections
• master data f oundations.

Collections A collection is a set of f ields that are selected f rom tables that
are accessed f rom different data connections. A collection
provides a convenient way f or users to build a data set using
those f ields. A collection can be used as an input source f or a
prof ile in Data Management Studio.

Data Connections Data connections are used to access data in jobs, profiles, data
explorations, and data collections.

Master Data Foundations The DataFlux Master Data Foundations f eature in Data
Management Studio uses master data projects and entity
def initions to develop the best possible record f or a specific
resource, such as a customer or a product, f rom all of the
source systems that might contain a ref erence to that resource.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.1 Overview of Data Management Studio 8-17

10. Click Data Connections in the navigation pane.

The inf ormation area provides overview inf ormation about all def ined data connections such as
names, descriptions, and types.

The dif f erent types of data connections that can be registered in DataFlux Data Management
Studio are listed in the table below.

ODBC Connection Displays the Microsoft Windows ODBC Data Source Administrator
dialog box, which can be used to create ODBC connections.
Domain Enabled Enables you to link an ODBC connection to an authentication server
ODBC Connection domain so that credentials f or each user are automatically applied
when the domain is accessed.
Custom Connection Enables you to access data sources that are not otherwise supported
in the Data Management Studio interf ace.
SAS Data Set Enables you to connect to a f older that contains one or more SAS
Connection data sets.

It is a best practice to save user credentials f or your data connections.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-18 Lesson 8 DataFlux® Data Management Studio: Essentials

11. Expand Data Connections in the navigation pane.

12. Click the DataFlux Sample data connection in the navigation pane.

The tables that are accessible through this data connection appear in the inf ormation area.

Note: You work with data connections in a later section.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.1 Overview of Data Management Studio 8-19

13. In the navigation area, click the Data Management Servers riser bar.

a. Click Data Management Servers.

b. In the inf ormation pane, click Connect.

c. When you are prompted f or credentials f or the SAS Metadata Server, enter Bruno f or the
user ID, and Student1 f or the password.

d. Click Log On.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-20 Lesson 8 DataFlux® Data Management Studio: Essentials

e. Click the f irst def ined server DataFlux Data Management Server - sasbap.

The inf ormation area displays specifics about this defined server.

14. In the navigation area, click the Administration riser bar.

The Administration riser bar enables you


to manage various items, such as
• Macro Files
• Quality Knowledge Bases (QKBs)
• Ref erence Sources (data packs)
• Repository Def initions.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.2 DataFlux Repositories 8-21

8.2 DataFlux Repositories

What Is a DataFlux Repository?

DataFlux Repository

Organize Show Lineage


Work

Data Storage File Storage

26
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

A DataFlux repository is the storage location f or objects that the user creates in Data Management
Studio.

The repository is used to organize work and can be used to surf ace lineage between data sources,
objects and f iles created.

The repository consists of two components: the data storage part of the repository, and the file
storage part of the repository.

The data portion of the repository is stored in a database.

The f ile portion of the repository is stored as operating system files.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-22 Lesson 8 DataFlux® Data Management Studio: Essentials

Data Storage versus File Storage

explorations and reports


profiles and reports
business rules
monitoring results

custom metrics
Data Storage data jobs
process jobs
queries
match reports

entity resolution files


File Storage

27
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The data storage portion of a repository can contain the f ollowing:


• explorations and reports
• prof iles and reports
• business rules
• monitoring results
• custom metrics
• business data inf ormation
• master data inf ormation

The f ile storage portion of a repository can contain the f ollowing:


• data jobs
• process jobs
• match reports
• entity resolution f iles
• queries
• entity def initions
• other f iles

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.2 DataFlux Repositories 8-23

Defining a DataFlux Repository

All currently defined


Repository Definitions repository definitions
item selected

Administration riser bar


31
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Repositories are def ined using the Administration riser bar in Data Management Studio. Af ter you
select the Administration riser bar, you can select Repository Definitions in the navigation pane to
see inf ormation about the repositories that are registered.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-24 Lesson 8 DataFlux® Data Management Studio: Essentials

Creating a DataFlux Repository

This demonstration illustrates the steps that are necessary to create a DataFlux repository in Data
Management Studio.

1. If necessary, access the Administration riser bar in Data Management Studio.

a. Select Start  All Programs  DataFlux  Data Management Studio 2.7.

b. Click Cancel to close the Log On window.

c. Verif y that the Home tab is selected.


d. Click the Administration riser bar.

The navigation pane shows


administration items when
the Administration riser bar
is selected.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.2 DataFlux Repositories 8-25

2. Click Repository Definitions in the list of Administration items in the navigation pane.

The inf ormation pane displays details for all def ined repositories.

3. Click New to def ine a new repository.

a. Enter Basics Demos in the Name f ield.

b. In the Data storage area, verif y that Database file is selected.

c. Click Browse next to the Location f ield.

1) In the Open window, navigate to D:\Workshop\dqdmp1.

2) Click New folder.

3) Enter Demos as the new f older name.

4) Press Enter.

5) Double-click the new f older (Demos).

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-26 Lesson 8 DataFlux® Data Management Studio: Essentials

6) In the File name f ield, enter Demos.rps.

7) Click Open.

The Location f ield is updated.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.2 DataFlux Repositories 8-27

d. In the File storage area, click Browse next to the Folder f ield.

1) Navigate to D:\Workshop\dqdmp1\Demos.

2) Click Make New Folder.

3) Enter files as the name of the new f older.

4) Press Enter.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-28 Lesson 8 DataFlux® Data Management Studio: Essentials

5) Click OK.

The Folder f ield in the File storage (optional) area is updated.

e. Clear Private.

The f inal settings f or the new repository def inition should resemble the f ollowing:

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.2 DataFlux Repositories 8-29

f. Click OK.

A message window appears and states that the repository does not exist.

g. Click Yes.

Inf ormation about the repository initialization appears in a window.

h. Click Close.

The repository is created and connected.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-30 Lesson 8 DataFlux® Data Management Studio: Essentials

4. Update the f olders f or this new repository.

a. Click the Folders riser bar.

b. Expand the Basics Demos repository.

Basics Demos is the new


repository just created.

These f olders were created as def ault f olders


when the repository was initialized. This set
of f olders can be modified / updated.

c. Right-click the Basics Demos repository and select New  New Folder.

1) Enter output_files.

2) Press Enter.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.2 DataFlux Repositories 8-31

d. Right-click the Basics Demos repository and select New  New Folder.

1) Enter profiles_and_explorations.

2) Press Enter.

Note: It is a best practice to use all lowercase and no spaces f or f older names
that might be used on the DataFlux Data Management Server as some
server operating systems are case sensitive.

The f inal set of f olders for the Basics Demos repository should resemble the f ollowing:

Note: The listed repository folders exist as locations in both Data storage (database)
and File storage (operating system f olders).

Note: When you create a repository f older, it creates an operating system f older in the
f ile storage location with the same name. Data storage is internally managed by
the database that is used.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-32 Lesson 8 DataFlux® Data Management Studio: Essentials

Practice

1. Creating a Repository for Upcoming Practices

Create a new repository to be used f or the items created in the upcoming practices.
Some specif ics f or the new repository are as f ollows:
• Name the repository Basics Exercises.
• Create a database f ile repository named D:\Workshop\dqdmp1\Exercises\Exercises.rps.
• Specif y a f ile storage location of D:\Workshop\dqdmp1\Exercises\files.
• Specif y that the repository is a shared repository.

2. Updating the Set of Default Folders for the Basics Exercises Repository

Create two additional f olders f or the new Basics Exercises repository:


• output_f iles
• prof iles_and_explorations

The f inal set of f olders should resemble the f ollowing:

3. Attaching to an Existing Repository

Create a new repository in Data Management Studio that attaches to an existing repository that
contains solution f iles. Some specifics f or attaching to the existing repository are as f ollows:
• Name the repository Basics Solutions.
• Point to a database f ile repository named D:\Workshop\dqdmp1\solutions\solutions.rps.
• Specif y a f ile storage location of D:\Workshop\dqdmp1\solutions\files.
• Specif y that the repository is a shared repository.
• Verif y that the repository contains the output_files and profiles_and_explorations f olders.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.3 Quality Knowledge Bases and Reference Data Sources 8-33

8.3 Quality Knowledge Bases and


Reference Data Sources

What Is SAS Quality Knowledge Base (Review)?


QKB Locale Support

- Indicates supported locale

- Indicates no locale coverage

37
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The SAS Quality Knowledge Base (QKB) is a collection of f iles and algorithms that store data and
logic f or defining data management operations such as data cleansing and standardization. The
def initions in the QKB that perf orm the cleansing tasks are geography and language specif ic. This
combination of geography and language is known as a locale.

Data Management Studio enables you to design and execute additional algorithms in any QKB. This
enables you to perf orm data cleansing tasks for other types of data as well. Depending on the
activities that you want to perf orm, you might need to modify existing definitions or create your own
def initions in the Quality Knowledge Base.

Two SAS training courses exist that discuss how to modify and create def initions in the QKB.
The courses are:
• DataFlux Data Management Studio: Understanding the Quality Knowledge Base
• DataFlux Data Management Studio: Creating a New Data Type in the Quality Knowledge Base

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-34 Lesson 8 DataFlux® Data Management Studio: Essentials

Registration of QKB(s)

summary information
for the selected QKB
specific QKB
selected

Administration riser bar


38
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The locations of the Quality Knowledge Base f iles are registered on the Administration riser bar in
Data Management Studio.

Locales are organized in a hierarchy according to their language and country. You can expand the
QKB to display the available languages, and then expand one of the languages to view the locales
that are associated with that language.

Note: There can be only one active QKB at a time in a Data Management Studio session.

QKB def initions will enable you to do a variety of data curation tasks. You will explore many of the
available def inition types in f uture lessons.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.3 Quality Knowledge Bases and Reference Data Sources 8-35

What Is a Reference Source?

USPS Data
Geo+Phone Data

Canada Post Data

39
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Ref erence data sources are used to verif y and enrich data. The ref erence sources (also known as
data packs) are typically a database used by Data Management Studio to compare user data to the
ref erence source. Given enough inf ormation to match an address or location or phone, the ref erence
data source can add a variety of additional f ields to f urther clarif y and enrich your data.

Data Management Studio allows direct use of data packs provided f rom United States Postal
Service, Canada Post and Geo+Phone data. The data packs and updates are f ound on
http://support.sas.com/downloads in the SAS DataFlux Sof tware section.

Note: You cannot directly access or modify ref erence data sources.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-36 Lesson 8 DataFlux® Data Management Studio: Essentials

Registration of Reference Sources

summary information
Reference Sources for Reference Sources
item selected

Administration riser bar


40
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Ref erence source locations are registered on the Administration riser bar in Data Management
Studio. In the display above, you can see that there are registrations f or three dif ferent ref erence
data sources: Canada Post Data, Geo+Phone Data, and USPS Data.

Note: Only one ref erence source location of each type can be d esignated as the def ault.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.3 Quality Knowledge Bases and Reference Data Sources 8-37

Verifying the Course QKB and Reference Sources

This demonstration illustrates how to verif y the QKB and the ref erence sources that are def ined f or
the course.

1. If necessary, access Data Management Studio.

a. Select Start  All Programs  DataFlux  Data Management Studio 2.7.

b. Click Cancel to close the Log On window.

2. Verif y that the Home tab is selected.


3. Click the Administration riser bar.

4. Expand Quality Knowledge Bases.

5. Click QKB CI 27.

The inf ormation area displays summary inf ormation about the selected QKB.

6. On the navigation pane, expand Reference Sources.

7. Click Reference Sources.

8. Verif y that three ref erence sources are def ined f or this instance.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-38 Lesson 8 DataFlux® Data Management Studio: Essentials

9. Click USPS Data.

The inf ormation area displays summary inf ormation about the selected ref erence source.

Each def ined ref erence source can be similarly investigated by clicking on the registered item.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.4 Data Connections 8-39

8.4 Data Connections

What Is a Data Connection?


Data connections enable you
to access your data in Data
Management Studio from
many types of data sources.
Four types of data connections are supported:

ODBC Connections

Domain-Enabled ODBC Connections

Custom Connections

SAS Data Set Connections

44
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Data connections enable you to access your data in Data Management Studio f rom many types of
data sources.

ODBC Connection Displays the Microsoft Windows ODBC Data Source


Administrator dialog box.

Domain Enabled Enables you to link an ODBC connection to an authentication


ODBC Connection server domain so that credentials f or each user are
automatically applied when you access the domain.

Custom Connection Enables you to access data sources that are not otherwise
supported in the Data Management Studio interf ace.

SAS Data Set Enables you to connect to a f older that contains one or more
Connection SAS data sets.

In Data Management Studio, you can create ODBC connections to any ODBC data source that is
def ined to the machine with the ODBC driver manager.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-40 Lesson 8 DataFlux® Data Management Studio: Essentials

Accessing Summary Information for Data Connections

Data Connections
item selected

Data riser bar summary information


for Data Connections

45
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Data connections are created f rom the Data riser bar in Data Management Studio. Af ter you select
the Data riser bar and the Data Connections item, you can see the summary inf ormation f or the
existing data connections. New data connections can be added using menus or toolbar buttons.

Af ter the data connections are def ined, you can access these data sources when you build
data explorations, data profiles, data jobs, and other items in Data Management Studio.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.4 Data Connections 8-41

Previewing Data in a Data Connection's Table

Data Tab

Selected
Table

46
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

In Data Management Studio, you can preview the d ata f rom within the interf ace. In addition to
scrolling through the rows of data, you can also sort and f ilter the data within the Data tab.

Visualizing the Data in a Data Connection

Graph Tab

Graph Properties

Selected
Table

47
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Another nice f eature of Data Management Studio is the ability to visualize the data f rom a selected
table. You have several options f or the types of visualizations that you can create on the data f rom
the Graph tab.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-42 Lesson 8 DataFlux® Data Management Studio: Essentials

Working with Data Connections

This demonstration illustrates how to def ine and work with a data connection, including using the
data viewer and generating queries.

1. If necessary, access Data Management Studio.

a. Select Start  All Programs  DataFlux  Data Management Studio 2.7.

b. Click Cancel to close the Log On window.

2. Verif y that the Home tab is selected.


3. Click the Data riser bar.

4. Def ine a new data connection.

a. Select Data Connections.

b. In the inf ormation area, select


New Data Connection  ODBC Connection.

The ODBC Data Source Administrator window appears.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.4 Data Connections 8-43

c. Click the System DSN tab.

d. Click Add.

e. Select the SQL Server driver.

f. Click Finish to close the Create New Data Source window.

The Create a New Data Source to SQL Server window appears.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-44 Lesson 8 DataFlux® Data Management Studio: Essentials

g. Enter DataFlux Training in the Name f ield.

h. Select SASBAP\SQLEXPRESS in the Server f ield.

i. Click Next.

1) Click With SQL Server authentication using a login ID and password entered by the
user.

2) Enter sa in the Login ID f ield.

3) Enter Student1 in the Password f ield.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.4 Data Connections 8-45

j. Click Next.

k. No changes are needed. Click Next.

l. No changes are needed. Click Finish.

A summary of the conf iguration for the SQL server appears.

m. Click Test Data Source.

n. Click OK to close the test window.

o. Click OK to close the setup window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-46 Lesson 8 DataFlux® Data Management Studio: Essentials

The new data source appears on the System DSN tab of the ODBC Data Source
Administrator window.

p. Click OK to close the ODBC Data Source Administrator window.

5. Verif y that the new data connection appears in Data Management Studio.

a. Click Data Connections in the navigation pane. Then select View  Refresh.

b. Expand Data Connections in the navigation area to view the new data connection.

Note: In addition to the new ODBC data source that you created, all ODBC data sources
already def ined on the machine where Data Management Studio is installed are
available in Data Management Studio.

6. Investigate the new data connection, DataFlux Training.

a. Expand the DataFlux Training data connection.

b. Enter sa in the Login ID f ield.

c. Enter Student1 in the Password f ield.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.4 Data Connections 8-47

d. Click OK.
Note: To avoid entering credentials f or this data connection in subsequent Data
Management Studio sessions, you can right-click the data connection name and
select Save User Credentials. This action creates a .cf g f ile of the same name as
the data connection (thus f or this data connection the f ile would be DataFlux
Training.cf g).
This conf iguration f ile is saved in <user-directory>\DataFlux\dac\savedconn:

Note: The user directory is specif ied in the Def ault Settings portlet. (Select the
Information riser bar and the Overview item in the navigation pane, and the
Def ault Settings portlet appears in the inf ormation pane or main area. )

The DataFlux Training data connection shows a listing of five different schemas.

The dbo, INFORMATION_SCHEMA, and sys f iles are


structures that are created and maintained by SQL
Server. It is recommended that you show only data
sources intended f or use in Data Management Studio.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-48 Lesson 8 DataFlux® Data Management Studio: Essentials

e. Right-click DataFlux Training and select Filter.

1) Click the Local filter radio button.

2) Click Schema name.

3) Verif y that Include is selected.

4) Enter df_ in the Name f ield.

5) Click OK.

Only the data sources intended f or use in Data Management Studio are now displayed.

6) Expand df_gifts.
7) Verif y that f ive tables exist in this schema.

8) Expand df_grocery.

9) Verif y that two tables exist in this schema.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.4 Data Connections 8-49

This display shows the f ive df_gifts tables as well as the two df_grocery tables.

It is important to know the basic steps f or def ining data connections as shown in this demonstration .
However, the two data connections that we will work with were predef ined as part of the course
image. During class, we work with dfConglomerate Gifts (used primarily in the demonstrations) and
dfConglomerate Grocery (used primarily in the practices).

In the next demonstration, we explore a table in the df Conglomerate Gif ts data connection through
the data viewer component on Data Management Studio , as well as examine a generated graphic.
In the subsequent practice, you explore a table (BREAKFAST_ITEMS) in the df Conglomerate
Grocery data connection.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-50 Lesson 8 DataFlux® Data Management Studio: Essentials

Viewing the Data inside a Data Connection

1. If necessary, access Data Management Studio.


a. Select Start  All Programs  DataFlux  Data Management Studio 2.7.

b. Click Cancel to close the Log On window.

2. Verif y that the Home tab is selected.

3. Click the Data riser bar.

4. Expand Data Connections.

5. Expand dfConglomerate Gifts.

6. Click the Customers table.

7. If necessary, click the Summary tab in the inf ormation pane.

The inf ormation area displays summary inf ormation f or the selected table.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.4 Data Connections 8-51

8. Click the Data tab.

By def ault, the Data tab will display up to 500 records f or the selected table. This value (500) can
be changed in the Data Management Studio Options window.
The Data tab has a number of tools that can aid you in viewing and possibly understanding your
data.
9. Click (Sort By) on the Data tab.

a. Double-click JOB TITLE in the Available Fields list to move it to the Selected Fields list.

b. Select Descending in the Direction f ield.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-52 Lesson 8 DataFlux® Data Management Studio: Essentials

c. Click OK.

The data are now displayed in sorted order (by descending JOB TITLE).

Note the decoration f or


the JOB TITLE f ield.

10. Click (Sort By) on the Data tab.

a. Double-click LAST NAME in the Available Fields list to move it to the Selected Fields list.

We have now established a secondary sort f ield. The data will be sorted by JOB TITLE, and
within each distinct JOB TITLE value, the data will be sorted (ascending) by LAST NAME.

b. Click OK.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.4 Data Connections 8-53

The data are now displayed in requested sorted order.

Note the decoration f or Note the decoration f or


the LAST NAME f ield. the JOB TITLE f ield.

11. Click the Fields tab.

This tab can be used to examine properties f or each f ield of the selected table.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-54 Lesson 8 DataFlux® Data Management Studio: Essentials

12. Click the Products table in the navigation pane


(under dfConglomerate Gifts).

13. Click the Graph tab.

a. Select Pie as the chart type.

b. Select CATEGORY f or the X axis.

c. Select STANDARD COST f or the Y axis.

The pie chart of the product data is displayed.

The Data riser bar has the basic capabilities f or examining descriptive inf ormation (Fields tab),
data inf ormation (Data tab) and graphical views of data (Graph tab).

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.4 Data Connections 8-55

Practice

4. Viewing and Graphing a Database Table

Access the BREAKFAST_ITEMS table in the df Conglomerate Grocery data connection.

• View the data on the Data tab and sort by the columns BRAND and NAME.

• Review the f ield attributes on the Fields tab.

Question: How many f ields are in the BREAKFAST_ITEMS table?

Answer:

Question: How many f ields are character?

Answer:

Question: How many f ields are numeric?

Answer:

• Create a graph using the Graph tab with the f ollowing specifications:

Chart type: Area

X axis: NAME

Y axis: SIZE

Question: What NAME value has the highest value (or sum of values) of SIZE
f or the sample def ined by the def ault “row count range” of 30?

Answer:

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-56 Lesson 8 DataFlux® Data Management Studio: Essentials

8.5 Solutions
Solutions to Practices
1. Creating a Repository for Upcoming Practices

a. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.7.

b. Verif y that the Home tab is selected.

c. Click the Administration riser bar.

d. Select Repository Definitions in the list of Administration items in the navigation pane.
e. Click New to def ine a new repository.

1) Enter Basics Exercises in the Name f ield.

2) In the Data storage area, verif y that Database file is selected.

3) Click Browse next to the Location f ield.

a) In the Open window, navigate to D:\Workshop\dqdmp1.

b) Click New folder.


c) Enter Exercises as the new f older name.

d) Press Enter.

e) Double-click the new f older (Exercises).

f ) Enter Exercises.rps in the File name f ield.

g) Click Open.

4) In the File storage area, click Browse next to the Folder f ield.

a) Navigate to D:\Workshop\dqdmp1\Exercises.

b) Click Make New Folder.

c) Enter files as the name of the new f older.

d) Press Enter.

e) Click OK.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.5 Solutions 8-57

5) Clear Private.

6) Click OK. A message window appears and states that the repository does not exist.

7) Click Yes to create the new repository. Inf ormation about the repository initialization
appears in a window.

8) Click Close.

The repository is created and connected:

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-58 Lesson 8 DataFlux® Data Management Studio: Essentials

2. Updating the Set of Default Folders for the Basics Exercises Repository

a. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.7.

b. Verif y that the Home tab is selected.

c. Click the Folders riser bar.

d. Expand the Basics Exercises repository.

e. Right-click the Basics Exercises repository and select New  New Folder.

1) Enter output_files.

2) Press Enter.

f. Right-click the Basics Exercises repository and select New  New Folder.

1) Enter profiles_and_explorations.

2) Press Enter.

The f inal set of f olders for the Basics Exercises repository should resemble the f ollowing:

3. Attaching to an Existing Repository

a. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.7.

b. Verif y that the Home tab is selected.


c. Click the Administration riser bar.

d. Click Repository Definitions in the list of Administration items on the navigation pane.

e. Click New to def ine a new repository.

1) Enter Basics Solutions in the Name f ield.

2) In the Data storage area, verif y that Database file is selected.

3) Click Browse next to the Location f ield.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.5 Solutions 8-59

a) In the Open window, navigate to D:\Workshop\dqdmp1\solutions.

b) Click the solutions.rps f ile so that it appears in the File name f ield.

c) Click Open.

4) In the File storage area, click Browse next to the Folder f ield.

a) Navigate to D:\Workshop\dqdmp1\solutions\files.

b) Click OK.

5) Clear Private.

6) Click OK. The connection to the repository is established.

f. Click the Folders riser bar.

g. Expand the Basics Solutions repository.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-60 Lesson 8 DataFlux® Data Management Studio: Essentials

h. Verif y that the f ollowing folders exist:

4. Viewing and Graphing a Database Table

a. If necessary, select Start  All Programs  DataFlux  Data Management Studio 2.7.

b. Verif y that the Home tab is selected.

c. Click the Data riser bar.

d. Expand Data Connections.

e. Expand the dfConglomerate Grocery data connection.

f. Click the BREAKFAST_ITEMS table.

g. If necessary, click the Summary tab.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.5 Solutions 8-61

The inf ormation area is populated with inf ormation about the selected table.

h. Click the Data tab.

i. Sort the data by the BRAND and NAME columns.


1) Click (Sort By) on the Data tab. The Sort window appears.
2) Double-click BRAND in the Available Fields list to move it to Selected Fields list.
3) Double-click NAME in the Available Fields list to move it to Selected Fields list.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-62 Lesson 8 DataFlux® Data Management Studio: Essentials

4) Click OK.

The data is displayed in sorted order (by BRAND and then by NAME).

Note: The 1 designates a primary ascending sort.


The 2 designates a secondary ascending sort.

j. Click the Fields tab.

Question: How many f ields are in the BREAKFAST_ITEMS table?


Answer: 15
Is a DATETIME f ield a numeric f ield?
Question: How many f ields are character? For some databases, the answer is yes.
Answer: 7 For some, the answer is no.
Thus, the answer to the last question
Question: How many f ields are numeric? might be either 7 or 8 depending on the
Answer: 7 types of tables used.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.5 Solutions 8-63

k. Click the Graph tab.

1) Click in the Chart type f ield and select Area.

2) Click in the X axis f ield and select NAME.

3) Click in the Y axis f ield and select SIZE.

The graph of Size by Name is displayed.

Notice the
scroll bar!

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-64 Lesson 8 DataFlux® Data Management Studio: Essentials

l. Scroll to the right to see the highest value of SIZE f or the sample def ined by the def ault
“row count range” of 30.

scroll bar is
to the right

Question: What NAME value has the highest value (or sum of values) of SIZE
f or the sample def ined by the def ault “row count range” of 30?

Answer: Momco Frosted Fruitos (39.4)

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Lesson 9 DataFlux® Data
Management Studio: Understanding
Data
9.1 Methodology Review ................................................................................................... 9-3

9.2 Creating Data Collections ............................................................................................ 9-6


Demonstration: Creating a Collection of Descriptive Fields............................................ 9-8

9.3 Designing Data Explorations ..................................................................................... 9-10


Demonstration: Creating and Reviewing Results from a Data Exploration (Optional) ....... 9-19
Demonstration: Creating a Collection from a Data Exploration (Optional) ...................... 9-31

9.4 Creating Data Profiles ................................................................................................ 9-36


Demonstration: Creating and Exploring a Data Profile ................................................ 9-48
Practice............................................................................................................... 9-64

9.5 Profiling Other Input Types ........................................................................................ 9-65

9.6 Designing Data Standardization Schemes ................................................................. 9-75


Demonstration: Creating a Phrase Standardization Scheme........................................ 9-79
Demonstration: Comparing a New Analysis Report to an Existing Scheme .................... 9-86
Demonstration: Creating an Element Standardization Scheme .................................... 9-93
Practice............................................................................................................... 9-97

9.7 Solutions ................................................................................................................... 9-98


Solutions to Practices ............................................................................................ 9-98
9-2 Lesson 9 DataFlux® Data Management Studio: Understanding Data

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.1 Methodology Review 9-3

9.1 Methodology Review

DataFlux Data Management Methodology (Review)

CONTROL DEFINE
MONITOR PLAN

EVALUATE DISCOVER

EXECUTE DESIGN

ACT
3
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Recall the three-phase methodology introduced earlier. Starting in this lesson we will examine
different items and components of Data Management Studio as they relate to one of the three
phases. In this lesson, the PLAN phase is explored.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-4 Lesson 9 DataFlux® Data Management Studio: Understanding Data

PLAN Phase

Data Management Studio items


DEFINE
used in this PLAN phase are PLAN
• data explorations
• data profiles.
DISCOVER

4
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The Plan phase involves two primary areas:

Define The planning stage of any data management project starts with this essential first
step. This is where the people, processes, technologies , and data sources are
defined. Roadmaps that include articulating the acceptable outcomes are built.
Finally, the cross-functional teams across business units and between business
and IT communities are created to define the data management business rules.

Discover A quick inspection of your corporate data would probably find that it resides in many
different databases, managed by many different systems, with many different
formats and representations of the same data. This step of the methodology lets
you explore metadata to verify that the right data sources are included in the data
management program – and create detailed data profiles of identified data sources
to understand their strengths and weaknesses.

Within Data Management Studio, two items can be used to help with the understanding of the data
sources – a data exploration and a data profile.
A data exploration will help you understand the structure of your data sources. Data explorations
provide comparisons of structural information across data sources, showing for example, variations
in the spelling of field names.
A data profile helps in understanding the data values in the various data sources. Data profiles
show distributions of field values and include useful metrics such as mean and standard deviation
for numeric fields and pattern counts for text fields. Data profiling will also surface occurrences of
null values, outliers, and other anomalies.
From reports generated for a data exploration, a data collection can easily be constructed (a
collection is a simple list of fields from possibly various tables from possibly various data
connections). Therefore, before studying data explorations, we will define and build a simple data
collection.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.1 Methodology Review 9-5

From reports generated for a data profile, the inconsistencies in actual data values can be viewed
through frequency distributions. These frequency distributions can be used as the basis for an item
called a standardization scheme (schemes are simple lists of data values with corresponding
standard values). Therefore, after finding issues in data profiles, we will define and build a
standardization scheme.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-6 Lesson 9 DataFlux® Data Management Studio: Understanding Data

9.2 Creating Data Collections

What Is a Data Collection?


Data Connection #2
Data Connection #1 A data collection is simply
Data Connection #3 set of data fields from
(possibly) different tables
in (possibly) different data
connections.
Fields Additional Info








Data Collection
13
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

A data collection is simply a set of data fields from one or more different tables from one or more
dif ferent data connections. This is very useful to a data scientist for project documentation, and to identify
what data f ields come from what data sources.

A data collection
• provides a convenient way to group related data fields from cross data sources
• proves to be beneficial in data governance efforts (for example, as a way to record all the fields that
contain phone numbers)
• can be used as an input source for profiles.

In the example:
• There are several defined data connections, where each data connection could contain one to many
tables.
• You want to construct a data collection from the field from several tables from several data
connections.
• You can choose
– the f ields from a single table in a specific data connection.
– the f ields from a single table in a different data connection.
– the f ields from a particular table in a specific data connection.
– the f ields from a different table in a specific data connection.
• The f inal data collection shown is a list of nine fields from four tables where the tables are found in
three dif ferent data connections.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.2 Creating Data Collections 9-7

Working with Collections in Data Management Studio

selected collection

two collections
in Basics Demos
repository

Data riser bar


14
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Data collections are created in the Data riser bar in Data Management Studio and are stored in a
repository.

The example shown has


• the Data riser bar selected
• the Collections item expanded
• the Basics Demos repository expanded to show two defined collections
• the Address Info collection selected in the navigation pane
• the contents of the Address Info collection displayed in the main information area.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-8 Lesson 9 DataFlux® Data Management Studio: Understanding Data

Creating a Collection of Descriptive Fields


This demonstration illustrates the steps that are necessary to create a collection of fields containing
descriptive information or notes.

1. If necessary, access Data Management Studio.


a. Select Start  All Programs  DataFlux  Data Management Studio 2.7.
b. Click Cancel to close the Log On window.
2. Verify that the Home tab is selected.
3. Click the Data riser bar.
4. Expand Collections.
5. Create a new collection of note information.
a. Click the Basics Demos repository.

b. Click (New Collection).


Note: You can also create a new collection by right-clicking the Basics Demos repository
and selecting New Collection.
c. Enter Descriptions in the Name field.
d. Enter Collection of Descriptive Fields in the Description field.
e. Click OK.
The new collection appears on a tab.
6. Insert descriptive fields from the dfConglomerate Gifts and dfConglomerate Grocery data
connections.

a. Click (Insert Fields).


The Insert Fields window appears.
b. Add notes fields from the dfConglomerate Gifts data connection.
1) Click (Select Database) next to the Connection field.
a) Select the dfConglomerate Gifts data connection.
b) Click OK.
2) Add fields from the Customers table.
a) Expand the Customers table.
b) Click the NOTES field. (Select the check box.)
c) Click Add.
Only the NOTES field
3) Add fields from the Employees table. from the Employees
a) Expand the Employees table. table is added to the
collection.
b) Click the NOTES field. (Select the check box.)
c) Click Add.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.2 Creating Data Collections 9-9

4) Add fields from the Orders table.


Only the NOTES field from the
a) Expand the Orders table. Orders table is added to the
b) Click the NOTES field. (Select the check box.) collection.
c) Click Add.
c. Add NOTES fields from the dfConglomerate Grocery data connection.
1) Click (Select Database) next to the Connection field.
2) Select the dfConglomerate Grocery data connection.
3) Click OK.
4) Add fields from the MANUFACTURERS table.
a) Expand the MANUFACTURERS table.
b) Click the NOTES field. (Select the check box.)
c) Click Add.
5) Click Close to close the Insert Fields window.
The Descriptions collection should now resemble the following:

7. Select File  Close Collection.


The Descriptions collection is displayed for the Basics Demos repository.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-10 Lesson 9 DataFlux® Data Management Studio: Understanding Data

9.3 Designing Data Explorations


What Is a Data Exploration?
Data Connection #2 A data exploration reads
Data Connection #1
metadata and/or data from
Data Connection #3
databases and uses one or
more of three types of
analyses to help understand
Identification Analysis Report:
the data structures and data
being used in a project.
ADDRESS

Field Match Report:


PHONE

Items on left are categories.


Items on right are fields that
have been categorized based on
name or data values.
Fields on left have exact or
similar spelling of names to
fields on right.

18
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

A data exploration reads metadata and/or data from databases and uses one or more of the three
types of analyses available to help understand the data structures and data being used in a project.
The three types of analyses available for exploring data are field name matching, field name
analysis, and sample data analysis.
Each type of analysis uses a different algorithm (definition) from the QKB to perform analysis on the
data.
• Field name matching analysis uses a definition known as a match definition. This algorithm
applies fuzzy logic to determine fields that might represent the same type of data, based on the
field name. For example, if you had a field named Phone_Number in one table and a field
named Phone_No in another table, the definition would identify these fields as potentially being
the same, based on a match code generated from the field name.
• Field name analysis uses a definition known as an identification analysis definition. This
algorithm uses the actual words contained in the field name, and looks up the words in a
vocabulary of words to categorize the words and identify their potential meanings. For example,
a field named Fax_Number might be categorized as a PHONE type field.
• Sample data analysis also uses an identification analysis definition but provides the ability to
sample the data in the table to determine whether the data is of a specific type. For example, a
sampling of 50 data records could reveal that the data in a particular field that has perhaps 10-
digits for each value. The identification analysis definition might categorize the field (based on its
data contents) as a PHONE type field.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.3 Designing Data Explorations 9-11

Data explorations can be used for the following purposes:


• to identify data redundancies
• to extract and organize metadata from multiple sources
• to identify relationships between metadata
• to identify specified business data types (city, address, phone, and so on)
For data scientists, data explorations can aid in quickly assessing the type of data in the fields for all
the tables to be used in various analytics.

Field Relationship Map


The Field Relationship
map provides a visual
presentation of the
relationships between
all fields from the
database tables
included in the data
exploration.

19
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The Field Relationship map provides a visual presentation of the field relationships between all the
databases, tables, and fields that are included in the data exploration.
In the above map:
• The outer ring represents the data connections specified for this analysis. The above map has
two “slices” – one for dfConglomerate Gifts and one for dfConglomerate Grocery.
• The inner ring has “slices” – each slice represents a table within its data connection. There are
five slices for the five tables in dfConglomerate Gifts. There are two slides for the two tables in
dfConglomerate Grocery.
• The dots represent fields for the tables specified for analysis. In the above diagram a dot is
selected that represents the STREET_ADDR field from the MANUFACTURERS table in
dfConglomerate Grocery.
• The green lines for a selected dot represent field relationships – field names that are either
exactly spelled the same or determined to be a match from a particular analysis applied.
This gives the data scientist an idea of how the data can possibly be joined together.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-12 Lesson 9 DataFlux® Data Management Studio: Understanding Data

Field Name Matching


Field name matching analyzes
Field name matching is selected.
the names of each field from
the selected data sources to
determine which fields have
either an identical name or
names that are similar enough
to be considered a match.
Match Definition
from the QKB

Note: Match definitions are discussed in more detail later.


20
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

If Field name matching is selected, an appropriate match definition (from the specified locale) along
with a preferred sensitivity must be specified.
Field name matching will analyze the names of each field from the selected data sources to
determine which fields have either an identical name or names that are similar enough to be
considered a match.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.3 Designing Data Explorations 9-13

Report: Field Name Matching


Report tab

ADDRESS
field
selected

Matching address
fields identified
Field Match
riser bar

21
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The Report tab has a variety of reports and information to be investigated. The report results of
running a Field Name Matching analysis are surfaced on the Field Match riser bar.
The Field Match report displays a list of the fields analyzed along with their corresponding matches.
A field of interest is selected in the left navigation pane, and the main area (middle area) displays the
corresponding matches.
In the example shown, based on the Field Match Analysis, the six fields in the main area were
identified as potentially relating to the selected field in the navigation pane (in this case, the Address
field from the Customers table in dfConglomerate Gifts).

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-14 Lesson 9 DataFlux® Data Management Studio: Understanding Data

Field Name Analysis


Field name analysis analyzes
Field name analysis is selected.
the names of each field from
the selected data sources to
determine which identity to
assign to the field. An identity
assigns a semantic meaning
to the type of information
that the field contains.
Identification Analysis Definition
from the QKB

Note: Identification analysis definitions are discussed


in more detail later.
22
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

If Field name analysis is selected, an appropriate identification analysis definition (from the
specified locale) must be specified.
Field name analysis analyzes the names of each field from the selected data sources to determine
which identity or category to assign to the field. An identity assigns a semantic meaning to the type
of information that the field contains (for example, address, phone number, and so on). For example,
a field named BUSINESS_PHONE could get categorized as a PHONE field, or a field named
BUSINESS_EMAIL could get categorized as an EMAIL field.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.3 Designing Data Explorations 9-15

Report: Field Name Analysis

Report tab

PHONE
identity
selected

PHONE type Sample of data


Identification
fields identified values from
Analysis
riser bar Employees table
23
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The report results of running a Field Name Analysis analysis are surfaced on the Identification
Analysis riser bar.
The Identification Analysis report displays a list of fields in metadata that match certain categories
based on the field name as well as the Identification Analysis definition that was selected in the
properties for the Field Name Analysis in the data exploration.
In the example shown, based on the names of the fields, there are 10 fields from a variety of data
sources and tables whose field names meet the criterion in the Field Name identification analysis
definition for PHONE categorization or identity.
Note: The categories for the Field Name identification analysis definition (for example, ADDRESS,
CITY, COUNTRY, COUNTY, etc.) come from the definition in the QKB.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-16 Lesson 9 DataFlux® Data Management Studio: Understanding Data

Sample Data Analysis


Sample data analysis is selected. Sample data analysis will
analyze a sample of data
in each field to determine
Selected sample size: which identity to assign
150 records to the field.

Identification Analysis Definition


from the QKB

Note: Identification analysis definitions are discussed


in more detail later.
24
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

If Sample data analysis is selected, an appropriate identification analysis definition (from the
specified locale) must be specified.
Sample data analysis analyzes a sample of data in each field to determine which identity or category
to assign to the field.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.3 Designing Data Explorations 9-17

Report: Sample Data Analysis

Report tab

PHONE
identity
selected

Identification
PHONE type Sample of data
Analysis
fields identified values from
riser bar
Customers table
25
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The report results of running a Sample Data Analysis analysis are surfaced on the Identification
Analysis riser bar.
The Identification Analysis report displays a list of fields placed in certain categories based on a
sample of data records.
In the example shown, based on a sample of 150 data records, there are 15 fields shown in the main
(middle) area that come from a variety of data sources and tables. Those fields contain values that
meet the criterion in the Contact Info definition for phone numbers.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-18 Lesson 9 DataFlux® Data Management Studio: Understanding Data

Additional Report: Table Match


Tables with
field names
Report tab exactly matching
field names in
Customers table Customers table
selected MANUFACTURERS
table has three field
names spelled
exactly the same.

Graphic of three
field names
spelled exactly
the same
Table Match
riser bar
30
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The Table Match report displays a list of database tables that contain matching fields for a selected
table or field.
Once a table is selected from the list of database tables, the main area refreshes to show all related
tables, as well as what percentage of columns in the table are in common.
In the example shown, the Customers table from dfConglomerate Gifts is selected. In the main
area, you see that the Customers table is related to six additional tables from the two data
connections.
The list of related tables shows the number of fields matched to the fields of the selected table.
Specifically, for the Manufacturers table from dfConglomerate Grocery, there are three fields in
common.
Selecting the Manufacturers table from dfConglomerate Grocery surfaces a relational diagram at
the bottom of the main area. This shows the fields from the Customers table that are matched to
same-named fields in the Manufacturers table. Specifically, the two tables have the ID, CITY, and
NOTES fields in common.
Note: This is not an interactive diagram; you cannot draw additional lines or delete the existing
relationship lines.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.3 Designing Data Explorations 9-19

Creating and Reviewing Results from a Data Exploration


(Optional)
This demonstration illustrates the steps that are necessary to create a data exploration and explore
the results.

1. If necessary, access Data Management Studio.


a. Select Start  All Programs  DataFlux  Data Management Studio 2.7.
b. Click Cancel to close the Log On window.
2. Verify that the Home tab is selected.
3. Click the Folders riser bar.
4. If necessary, expand the Basics Demos repository.
5. Right-click the profiles_and_explorations folder and select New  Data Exploration.
a. Enter Ch3D2_dfConglomerate_DataExploration in the Name field.

b. Click OK.
The new data exploration appears on a primary tab.

6. Verify that the Properties tab is selected.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-20 Lesson 9 DataFlux® Data Management Studio: Understanding Data

7. Define the data to be explored.


a. In the Data Sources area, click Add Table.

The Add Tables window appears.


1) Expand (All Data Connections).
2) Click the dfConglomerate Gifts check box.
3) Click the dfConglomerate Grocery check box.

4) Click Add.
5) Click Close to close the Add Tables window.
b. If necessary, expand each of the data sources.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.3 Designing Data Explorations 9-21

c. Verify that seven tables appear in the Data Sources area.

8. Specify the settings for field name matching analysis.


a. In the Analysis Methods area, click Field name matching.
b. Click once in the Locale field. This reveals a selection tool.
c. Click in the Locale field and select English (United States).
d. Click once in the Match Definition field. This reveals a selection tool.
e. Click in the Match Definition field and select Field Name.
f. Click once in the Sensitivity field. This reveals a selection tool.
g. Click in the Sensitivity field and select 85.

I’m
not
100
%
sure
of
wha
t I’m
doin
g
next
wee
k.

My © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Copyright
origi
nal
plan
9-22 Lesson 9 DataFlux® Data Management Studio: Understanding Data

9. Specify the settings for field name analysis.


a. In the Analysis Methods area, click Field name analysis.
b. Click once in the Locale field. This reveals a selection tool.
c. Click in the Locale field and select English (United States).
d. Click once in the Identification Analysis Definition field. This reveals a selection tool.
e. Click in the Identification Analysis Definition field and select Field Name.

I’m
not
100
10. Specify the settings for sample data analysis.
%
sureIn the Analysis Methods area, click Sample data analysis.
a.
of Enter 500 for Sample size(records).
b.
wha
t I’mClick once in the Locale field. This reveals a selection tool.
c.
doinClick
d. in the Locale field and select English (United States).
g
nextClick once in the Identification Analysis Definition field. This reveals a selection tool.
e.
wee
f. Click in the Identification Analysis Definition field and select Contact Info.
k.

My
origi
nal
plan
s
wer
e to
I’m
go
not
to
100
my
11. Select
% Actions  Run to execute the data exploration.
pare
sure
nts
Note: You can also click (Run Exploration) on the toolbar to execute the data exploration.
of
and
12. Explore
wha
dec the field relationship map results.
torat
I’m
a. Click the Map tab.
doin
e
g
their
next
hou
wee
se
k.
for
Chri
stm
My
as
origi
(my
nal
ann © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Copyright
plan
ual
s
pres
wer
ent
e to
9.3 Designing Data Explorations 9-23

The field relationship map appears:

The outer ring segments represent the selected data connections: dfConglomerate Gifts and
dfConglomerate Grocery.
The inner ring segments represent the selected tables from each data connection. Moving
the cursor over a segment of the inner ring displays the table name.
The dots represent the fields in each table.
b. Locate the MANUFACTURERS table from the dfConglomerate Grocery data connection.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-24 Lesson 9 DataFlux® Data Management Studio: Understanding Data

c. Locate the dot that represents the STREET_ADDR field.

d. Click the dot that represents the STREET_ADDR field.

A sample of data values appears in the pane on the right.


The green lines indicate that four other fields are related to the STREET_ADDR field:
• CONTACT_ADDRESS in the MANUFACTURERS table
• SHIP ADDRESS in the Orders table
• ADDRESS in the Employees table
• ADDRESS in the Customers table

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.3 Designing Data Explorations 9-25

13. Explore the data exploration Field Match report results.


a. Click the Report tab.
b. Verify that the Field Match riser bar is selected.
c. Expand All Databases.
d. Expand dfConglomerate Grocery to display a list of tables for this data connection.
e. Click dfConglomerate Grocery.

For the two tables in the dfConglomerate Grocery connection, there are 32 fields.
There are 59 fields that match the fields in the BREAKFAST_ITEMS table.
There are 85 fields that match the fields in the MANUFACTURERS table.
f. Click the MANUFACTURERS table.

The 17 fields from the MANUFACTURERS table are displayed with the number of matching
fields found for each.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-26 Lesson 9 DataFlux® Data Management Studio: Understanding Data

g. Expand the MANUFACTURERS table to display a list of fields for this table.
h. Click the STREET_ADDR field.

Information about the fields related to the STREET_ADDR field appears.


For each of the fields related to STREET_ADDR, you can discover whether the relationship
is listed as EXACT MATCH (the field names are spelled exactly the same) or as Field Name
(the field names matched according to the rules in the Field Name match definition selected
for field name matching analysis).
The below view is an arranged view to see the field names and the method of matching:

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.3 Designing Data Explorations 9-27

14. Explore the data exploration Identification Analysis report results.


a. If necessary, click the Report tab.
b. Click the Identification Analysis riser bar.
c. Expand All Identification Analysis Definitions.
d. Expand Contact Info.

The Contact Info identification analysis


definition was chosen on the Properties tab
for Sample data analysis.

The Contact Info identification analysis definition has eight defined categorizations for dat a
(ADDRESS, BLANK, E-MAIL, MIXED, NAME, ORGANIZATION, PHONE, UNKNOWN). This
definition inspects the data values (we chose a sample size of 500) to see whether the data
seems to be representative of ADDRESS data, or BLANK data, or E-MAIL data, and so on.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-28 Lesson 9 DataFlux® Data Management Studio: Understanding Data

e. Click the ADDRESS category.


Each of the listed fields (26) is identified or categorized as ADDRESS, one of the categories
from the Contact Info Identification Analysis definition, based on a sample of data values.

Additional categorizations can be examined similarly.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.3 Designing Data Explorations 9-29

f. Collapse Contact Info.


g. Expand Field Name.

The Field Name identification analysis


definition was chosen on the Properties tab
for Field name analysis.

The Field Name identification analysis definition has 19 defined categorizations for data
(ADDRESS, CITY, COUNTRY, COUNTY, DATE, EMAIL, GENDER, GENERIC_ID,
MARITAL_STATUS, MATCHCODE, NAME, ORGANIZATION, ORGANIZATION_ID,
PERSONAL_ID, PHONE, POSTALCODE, STATE/PROVINCE, UNKNOWN, URL). This
definition inspects the field names to see whether a field name seems to be an ADDRESS
field name, or a CITY field name, or a COUNTRY field name, and so on.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-30 Lesson 9 DataFlux® Data Management Studio: Understanding Data

h. Click the ADDRESS category.


Each of the listed fields is identified as ADDRESS, one of the categories from the Field
Name identification analysis definition.

Additional categorizations of fields can be examined similarly.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.3 Designing Data Explorations 9-31

Creating a Collection from a Data Exploration (Optional)

This demonstration illustrates the steps that are necessary to create a data collection when
reviewing the results of a data exploration.

1. If necessary, access Data Management Studio.


a. Select Start  All Programs  DataFlux  Data Management Studio 2.7.
b. Click Cancel to close the Log On window.
2. If necessary, access Ch3D2_dfConglomerate_DataExploration.
a. Verify that the Home tab is selected.
b. Click the Folders riser bar.
c. Expand Basics Demos.
d. Click profiles_and_explorations.
e. In main information area, double-click Ch3D2_dfConglomerate_Data_Exploration.
The data exploration opens on a primary tab.
6. If necessary, click the Report tab.
7. Click the Identification Analysis riser bar.
8. Expand the All Identification Analysis Definitions folder.
9. Expand Field Name.
10. Click ADDRESS.

11. In the main area, select all the fields that are identified as ADDRESS fields.
a. Click the first field.
b. Hold down the Shift key.
c. Click the last field.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-32 Lesson 9 DataFlux® Data Management Studio: Understanding Data

12. Right-click one of the selected fields and select Add To Collection  New Collection.

13. Enter information for the new collection.


a. Enter Address Info in the Name field.
b. Enter Collection of Address Information Fields in the Description field.

c. Click OK.
The new collection appears on a separate tab.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.3 Designing Data Explorations 9-33

14. Click the data exploration tab (Ch3D2_dfConglomerate_Data_Exploration).


15. Add city fields to the collection.
a. Click CITY (under Field Name identification definition).
The fields identified as CITY fields appear in the main area.

b. Select all the fields that are identified as CITY fields.


1) Click the first field.
2) Hold down the Shift key.
3) Click the last field.
c. Right-click one of the selected fields and select Add To Collection  Address Info.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-34 Lesson 9 DataFlux® Data Management Studio: Understanding Data

16. Click the collection tab (Address Info).


17. Verify that the city fields were added to the collection.

18. Click the data exploration tab (Ch3D2_dfConglomerate_Data_Exploration).


19. Add country fields to the collection.
a. Click COUNTRY (under Field Name identification definition).
The fields identified as COUNTRY fields appear in the main area.
b. Select all the fields that are identified as COUNTRY fields.
1) Click the first field.
2) Hold down the Shift key.
3) Click the last field.
c. Right-click one of the selected fields and select Add To Collection  Address Info.
20. Click the collection tab (Address Info).
21. Verify that the fields that are identified as COUNTRY are now part of the Address Info collection.
22. Click the data exploration tab (Ch3D2_dfConglomerate_Data_Exploration).
23. Add postal code fields to the collection.
a. Click POSTALCODE (under Field Name identification definition).
The fields identified as POSTALCODE fields appear in the main area.
b. Select all the fields that are identified as POSTALCODE fields.
1) Click the first field.
2) Hold down the Shift key.
3) Click the last field.
c. Right-click one of the selected fields and select Add To Collection  Address Info.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.3 Designing Data Explorations 9-35

24. Click the collection tab (Address Info).


25. Verify that the fields that are identified as POSTALCODE are now part of the Address Info
collection.
26. Click the data exploration tab (Ch3D2_dfConglomerate_Data_Exploration).
27. Add State/Province fields to the collection.
a. Click STATE/PROVINCE (under Field Name identification definition).
The fields identified as STATE/PROVINCE fields appear in the main area.
b. Select all the fields that are identified as STATE/PROVINCE fields.
1) Click the first field.
2) Hold down the Shift key.
3) Click the last field.
c. Right-click one of the selected fields and select Add To Collection  Address Info.

28. Click the collection tab


(Address Info).
29. Verify that the fields that
are identified as
ADDRESS, CITY,
COUNTRY,
POSTALCODE, and
STATE/PROVINCE are
now part of the Address
Info collection.

30. Select File 


Close Collection.
The data exploration tab
should be active.
31. Select File 
Close Exploration.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-36 Lesson 9 DataFlux® Data Management Studio: Understanding Data

9.4 Creating Data Profiles


What Is a Data Profile?
PAYMENT_TYPE Metrics
Metric Value JOBTITLE A data profile enables
Count 48 Frequency Distribution
you to inspect data for
Null Count 10 Value Count errors, inconsistencies,
Percent Null 20.8 (null value) 13 redundancies, and
Blank Count 0 Purchasing Manager 12 incomplete
… information.
Purchasing Mgr 6
STATE Purch Mgr 5
Pattern Frequency Distribution
Marketing Manager 1
Pattern Count Mktg Manager 1
A.A. 3 Marketing Mgr 4
Aa. 12 Mktg Mgr 2
Aaa. 1 …
AA 55

35
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

A data profile enables you to inspect data for errors, inconsistencies, redundancies, and incomplete
information.

Data profiles provide the following benefits:


• improve understanding of existing data
• aid in identifying issues early in the data management process,
when they are easier and less expensive to manage
• help determine the steps necessary to address issues that were identified
• enable you to assess the quality of your data across time.

Examples:
• Suppose you need to examine payments for a week's orders. Metrics created for a profile can
show that just over 20% of the orders placed have not been paid.
• Suppose you work in a human resources group and need to generate some metrics (for
example, minimum salary, average salary, maximum salary) to compare across various job
categories. However, an examination of the names of job categories shows an alarming
inconsistency.
• Suppose you work in a marketing group and want to do some promotions for customers in
various states. However, examining the values of the STATE field shows inconsistent styles
(patterns) of information.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.4 Creating Data Profiles 9-37

Data Profile Components


Properties tab

Report tab

36
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Profiles contain two tabs:


• The Properties tab is used to identify the data sources, tables, and columns to be used in the
profile. In addition, you can use the profile to control the default profile metrics that are
calculated, set options for calculating the profile metrics, override the metrics for certain columns,
and more. Also, you can specify custom metrics, assign business rules, add alerts, and create
visualizations as part of the profile.
• The Report tab is used to view the output from running the profile. Specifically, you can view the
profile metrics for the tables included in the profile, as well as any columns that are included in
the profile. In addition, you can view the results of any custom metrics specified, any business
rules that were triggered, and any alerts that were triggered. Also, you can interactively visualize
the profile metrics.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-38 Lesson 9 DataFlux® Data Management Studio: Understanding Data

Profile Properties Metric Overrides


PK / FK Analysis
Redundant
data analysis
Alert

Select tables
to profile

Select columns to profile


37
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

To run a data profile on your data, you must first identify the properties for the data profile.
Specifically, you must do the following actions:
• select the data sources to be profiled
• select the data tables to be profiled
• select the columns to be profiled
• specify the metrics to be calculated against the data
• specify additional, more advanced options (Primary Key/Foreign Key Analysis, Redundant Data
Analysis, and Alerts)

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.4 Creating Data Profiles 9-39

Data Profiling Metrics


Data profile metrics are
statistics, calculated from
actual data values, that
are used to assess the
quality of your data.

38
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

You must specify which data profiling metrics must be calculated before the profile is executed. You
have full control over the metrics that are calculated for a data profile. You can select these metrics
by navigating to Tools  Default Profile Metrics on the Properties tab of the data profile.
Standard metrics include the following:
• record count • data type
• null count • data length
• percent null • actual type
• blank count • minimum length
• minimum value • maximum length
• maximum value • non-null count
• pattern count • nullable
• unique count • decimal places
• uniqueness • statistical calculations
• primary key candidate (mode, mean, median, standard deviation,
standard error)

Note: By default, all profile metrics are selected.

Note: You can override metrics for a specific column on the Properties tab.
Not all profile metrics are necessary for all the columns that are profiled. As part of the planning
phase, you should determine which metrics are most important for the types of data fields that you
want to profile.

Calculating frequency distribution, pattern frequency distribution, percentiles , and outliers


take additional processing time, so for performance reasons, be judicious in their selection.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-40 Lesson 9 DataFlux® Data Management Studio: Understanding Data

Profile Results: Standard Table Metrics

Standard Metrics tab


Report tab

Customers
table
selected

39
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

You can view the results of the profile by clicking the Report tab.
On the Report tab, selecting a table displays the standard metrics for all columns in the table.
Expanding the table and selecting a specific column display s the standard metrics that are
specifically for that column.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.4 Creating Data Profiles 9-41

Profile Results: Table Visualizations

Visualizations tab
Report tab

Customers
table
selected

40
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Visualizations are customized charts that you create based on the data and the calculated metrics.
Visualizations can reveal patterns in the metrics that might not be apparent when you view the table
of standard metrics.
As a data scientist, this is a good way to document the health and well-being of the data used in your
project.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-42 Lesson 9 DataFlux® Data Management Studio: Understanding Data

Profile Results: Table Notes

Notes tab
Report tab

Customers
table
selected

41
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Notes can be added to the report (at both the field level and the table level) to aid in the planning
process. In the example, you see the table-level notes that were entered for the Customers table.

Profile Results: Column Profiling

Column Profiling tab


Report tab

STATE/PROVINCE
column selected

42
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Individual fields from a table can be selected. The Column Profiling tab displays the standard metrics
for the selected field. In the example, you see the standard metrics that were calculated for the
STATE/PROVINCE field.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.4 Creating Data Profiles 9-43

Profile Results: Frequency Distribution

Frequency Distribution tab


Report tab

STATE/PROVINCE
column selected

43
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

A distribution of the frequency counts can be calculated for the values of a field. The list of
distribution values can also be filtered and visualized.
If you double-click a specific value in the frequency distribution report, a window opens , showing all
the records in the original data source that contain those values.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-44 Lesson 9 DataFlux® Data Management Studio: Understanding Data

Profile Results: Pattern Frequency Distribution

Pattern Frequency Distribution tab


Report tab

STATE/PROVINCE
column selected

44
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

A distribution of the pattern frequency counts (the pattern of the value in the field) can be calculated.
To view the record (or records) that contain a specific pattern, double-click the pattern distribution
value. The list of pattern distribution values can also be filtered and visualized.
For a pattern, the following rules apply:
• A represents an uppercase letter.
• a represents a lowercase letter.
• 9 represents a digit.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.4 Creating Data Profiles 9-45

Viewing Profile Results: Percentiles

Percentiles tab
Report tab

LIST PRICE
column selected

45
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Percentiles are calculated for intervals that you specify. In the example, you see the percentiles that
were calculated for the LIST_PRICE field in the Products table.

Viewing Profile Results: Outliers

Outliers tab
Report tab

LIST PRICE
column selected

46
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The Outliers tab lists the X minimum and maximum value outliers. The number of listed minimum
and maximum values are specified when the data profiling metrics are set. In the example,
you see the percentiles that were calculated for the LIST_PRICE field in the Products table.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-46 Lesson 9 DataFlux® Data Management Studio: Understanding Data

Frequency Distribution: Table versus Graph


Frequency
Distribution
visualization

Frequency
Distribution
table

47
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

If a frequency distribution is calculated for a particular field, then these calculated values are
available for a special chart type of Frequency Distribution. Recall that visualizations can be created
for a selected table.
The example shown displays
• a Frequency Distribution report for the selected field PRODUCT CODE (from the Products table
from the dfConglomerate Gifts data connection)
• a Frequency Distribution chart for the PRODUCT CODE field from the Products table (from the
dfConglomerate Gifts data connection).
Depending on who is viewing the profile reports (Frequency Distribution report versus Frequency
Distribution chart), one report might be more informative than the other, or possibly having both
provides clearer insight in to the data.
As a data scientist, presenting your project results and proposals is a critical part of your job. These
visualizations will assist you greatly in that endeavor.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.4 Creating Data Profiles 9-47

Pattern Frequency Distribution: Table versus Graph


Pattern
Frequency
Distribution
visualization

Pattern Frequency
Distribution
table

48
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

If a pattern frequency distribution is calculated for a particular field, then these calculated values are
then available for a special chart type of Pattern Frequency Distribution. Recall that visualizations
can be created for a selected table.
The example shown displays:
• a Pattern Frequency Distribution report for the selected field PRODUCT CODE (from the
Products table from the dfConglomerate Gifts data connection).
• a Pattern Frequency Distribution chart for the PRODUCT CODE field from the Products table
(from the dfConglomerate Gifts data connection).
Depending on who is viewing the profile reports (Pattern Frequency Distribution report versus
Pattern Frequency Distribution chart), one report might be more informative than the other, OR
possibly having both provides clearer insight in to the data.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-48 Lesson 9 DataFlux® Data Management Studio: Understanding Data

Creating and Exploring a Data Profile


This demonstration illustrates the steps that are necessary to create a data profile (set properties
and default profile metrics), execute the profile and explore the results.
1. If necessary, access Data Management Studio.
a. Select Start  All Programs  DataFlux  Data Management Studio 2.7.
b. Click Cancel to close the Log On window.
2. Click the Folders riser bar.
3. If necessary, expand the Basics Demos repository.
4. Right-click the profiles_and_explorations folder and select New  Profile.
a. Enter Ch3D3_dfConglomerate_Profile in the Name field.

b. Click OK.
The new profile appears on a primary tab.

5. Define the profile properties.


a. Verify that the Properties tab is selected.
b. Click dfConglomerate Gifts.
An X appears in the check box.
c. Click dfConglomerate Grocery.
An X appears in the check box.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.4 Creating Data Profiles 9-49

6. Review the profile options.


a. Select Tools  Profile Options.
b. Review the options on the General tab.

A few options are shown above.


• If the fields being profiled have many distinct values (high cardinality), then you might want to
deselect “Count all rows for frequency distribution.” It might be best to accumulate counts for the
top 32 values, or whatever is your preference. Those values not “counted” will be placed in an
“Other” grouping.
• Similarly, if the fields being profiled have many distinct patterns in the data, then you m ight want
to deselect “Count all rows for pattern distribution.” It might be best to accumulate counts for the
top 32 patterns, or whatever is your preference. Those values not “counted” will be placed in an
“Other” grouping.
• It might be beneficial to increase the memory cache size used to improve performance of profile
execution.
• It might be beneficial to trim leading/trailing spaces from data values for frequency distribution
values, or for pattern frequency distribution values. For example, you might wish to disregard the
leading blank in a value ‘ NC’ so that it is grouped with the values ‘NC’.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-50 Lesson 9 DataFlux® Data Management Studio: Understanding Data

c. Click the Charts tab.


d. Review the options on the Charts tab.

It might be beneficial to change the generated charts width and height.


e. Click the Quality Knowledge Base tab.
f. Review the options on the Quality Knowledge Base tab.

The patterns are determined by an algorithm in the compiled code.


In addition, the QKB has a type of algorithm called a pattern analysis definition.
If you want pattern analysis other than the default algorithm, then this tab allows the
selection of the appropriate locale (from the default QKB) along with the desired definition.
g. Click Cancel to close the Options window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.4 Creating Data Profiles 9-51

7. Define default profile metrics.


a. Select Tools  Default Profile Metrics.
b. Verify that all metrics are selected.

Note: The f irst f our metric selections take additional processing time to calculate, so be
judicious in their selection.
c. Click Cancel to close the Metrics window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-52 Lesson 9 DataFlux® Data Management Studio: Understanding Data

8. Select File  Save Profile to save the profile to this point.


9. Override metrics for selected fields.
a. Expand the dfConglomerate Grocery data connection.
b. Click the MANUFACTURERS table.
c. Verify that the Fields tab is active on the right.

d. Click the ID field.


e. Hold down the Ctrl key and click the NOTES field.
f. Right-click one of the selected columns and select Override Metrics.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.4 Creating Data Profiles 9-53

1) Clear Frequency distribution.


2) Clear Pattern frequency distribution.
3) Clear Percentiles.
4) Clear Outliers.
5) Click OK to close the Metrics window.
These calculations are not selected as they can take a long time and are not particularly
useful for the analysis of the data stored in ID and NOTES.
g. Select the CITY field.
h. Hold down the Ctrl key. Select the COUNTRY field and the STATE_PROV field.
i. Right-click one of the selected columns and select Override Metrics.
1) Clear Percentiles.
2) Clear Outliers.
3) Click OK to close the Metrics window.
These calculations are not selected as they can take a long time and are not particularly
useful for the analysis of the data stored in CITY, COUNTRY, and STATE_PROV.
The check marks under the M column denote that metrics were overridden for these fields.

There is no designation for different sets of metric overrides. The check marks simply
indicate that the metrics for the selected fields are different from those that are set for the
overall profile.
10. Select File  Save Profile to save the profile.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-54 Lesson 9 DataFlux® Data Management Studio: Understanding Data

11. Select Actions  Run Profile Report.


a. Enter First profile run in the Description field.

b. Click OK to execute the profile.


The profile is executed. The status of the execution is displayed.

After the profile runs and the report is generated, the Report tab becomes active.

12. Expand the dfConglomerate Gifts data connection.


13. Click the Customers table.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.4 Creating Data Profiles 9-55

A table view of calculated metrics for each field selected from the Customers table appears:

Note: Some metrics display (not applicable), which indicates that the metric calculation is not
applicable to that f ield type.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-56 Lesson 9 DataFlux® Data Management Studio: Understanding Data

14. Investigate metrics for a single column.


a. Expand the Customers table.
b. Click the COUNTRY/REGION field.
c. Verify that the Column Profiling tab is selected in the main area.

63 values

four patterns of
COUNTRY/REGION
values

four unique or distinct


COUNTRY/REGION
values

A few observations:
• There are 63 records in the Customers table.
• There are four different patterns or styles for the COUNTRY/REGION field.
• There are four unique or distinct values for the COUNTRY/REGION field.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.4 Creating Data Profiles 9-57

d. Click the Frequency Distribution tab.

There are five rows listed here: the four unique values and (null value). Therefore, it can be
noted that the Unique Count metric does not include null values.
e. Double-click the item (null value).
This action performs a drill-through to the source table and displays a window with the 11
records that have a null value for the COUNTRY/REGION field.
All fields are shown from the data source.

The Drill Through window is simply a display of values . Nothing can be changed via this
window’s interface. However, the data shown can be exported to a .txt, .csv or .xls file.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-58 Lesson 9 DataFlux® Data Management Studio: Understanding Data

f. Scroll to the right and locate the COUNTRY/REGION field.

Scrolling to the right and locating the COUNTRY/REGION field reveals all null values.
Investigating corresponding fields (such as CITY, STATE/PROVINCE, and ZIP/POSTAL
CODE) could indicate the correct value for the COUNTRY/REGION field.
g. Click Close to close the Frequency Distribution Drill Through window.
h. Click the Pattern Frequency Distribution tab.

Almost 83% of the data values have the pattern of three capital letters. It would be
worthwhile to investigate fixing the patterns of the nine observations that do not have the
AAA pattern.
Note: The drill-through action is also available for the pattern frequency distribution values.
i. Select Insert  New Note.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.4 Creating Data Profiles 9-59

1) Enter Check the patterns of this field. in the Add Note window.

2) Click OK to close the Add Note window.


j. Verify that the Notes tab is now active and the new note appears.

This type of note is referred to as a field-level note.


k. Click the Percentiles tab.

Percentiles are not calculated for character fields.


l. Click the Outliers tab.

This tab displays the top X minimum and maximum values (where X is the value set
in the Metrics window).
m. Add a table-level note.
1) Click the Customers table (on
the left).
2) Select Insert  New Note.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-60 Lesson 9 DataFlux® Data Management Studio: Understanding Data

3) Enter the following in the Add Note window:


Check the COUNTRY/REGION field for number of unique values, for differing
patterns and for null values.

4) Click OK to close the Add Note window.


n. Verify that the Notes tab is now active and that the new note appears.

This is a table-level note.


15. Investigate visualizations.
a. Verify that the Customers table is selected.
b. Click the Visualizations tab.

c. Click (Add) to the right of the Chart field.


d. Enter Visual Comparison of Metrics Across Fields in the Description field.
e. Verify that Chart type field is set to Bar.
f. Verify that Data type field is set to Field metrics.
g. Select the following fields in the Fields list:
BUSINESS PHONE
COUNTRY/REGION
FAX NUMBER
HOME PHONE
MOBILE PHONE
STATE/PROVINCE
ZIP/POSTAL CODE

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.4 Creating Data Profiles 9-61

h. Select the following metrics in the Metrics list:


Unique Count
Pattern Count
Minimum Length
Maximum Length
The final settings should resemble the following:

i. Click OK to close the Chart Properties window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-62 Lesson 9 DataFlux® Data Management Studio: Understanding Data

The chart appears on the Visualizations tab.

The configuration of the chart (field selection; metric selection) can be changed by clicking
the (Edit) toolbar button.
The chart can be saved as .jpg, .png, .gif, or .bmp.
The chart can be printed if a valid printer connection is available.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.4 Creating Data Profiles 9-63

j. Right-click the background of the chart and select Chart Type.

If the desired visualization is not understandable as a bar chart, it can be changed easily by
selecting one of the other available types.
16. Investigate a different table that was also profiled in this job.
a. Expand the dfConglomerate Grocery data connection.
b. Expand the MANUFACTURERS table.
c. Click the NOTES field.
d. Click the Frequency Distribution tab (in the main / information area).
Recall that this metric was cleared in a metric override for this field.

Similarly, the Pattern Frequency Distribution, Percentiles, and Outliers tabs display
(Not calculated) because these options were also cleared in the metric override for
this field.
17. Select File  Close Profile.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-64 Lesson 9 DataFlux® Data Management Studio: Understanding Data

Practice

1. Profiling Tables in the dfConglomerate Grocery Data Connection


Use Data Management Studio to create a data profile on the Manufacturers and
Breakfast_Items tables in the dfConglomerate Grocery data source.
• Use the profiles_and_explorations folder in the Basics Exercises repository to store the
profile.
• Name the profile Ch3E1_dfConglomerateGrocery_Profile.
• Profile the two tables (MANUFACTURERS and BREAKFAST_ITEMS) in the
dfConglomerate Grocery data connection.
• Calculate all metrics on all fields (initial setup).
• For the ID fields in both the BREAKFAST_ITEMS and MANUFACTURERS tables, do not
calculate frequency distribution, pattern frequency distribution, percentiles, or outliers.
• Run the profile and provide a description of the initial profile.
• Review the Profile report results. Answer the following questions:

Question: How many distinct values exist for the UOM field in the BREAKFAST_ITEMS
table?
Answer:

Question: What are the distinct values for the UOM field in the BREAKFAST_ITEMS
table?
Answer:

Question: What are the ID values for the records with a value of PK for the UOM field
in the BREAKFAST_ITEMS table?
Answer:

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.5 Profiling Other Input Types 9-65

9.5 Profiling Other Input Types

Data Profiling: Other Input Types


Data profiling can be performed on
established data connections, as well as on
the following types of inputs:
• text files
• SQL queries
• filtered tables
• collections

60
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Thus f ar, you worked with prof iling data in the data sources that were def ined in the data
connections. However, it is important to note that you are not limited to only the data in the data
connections f or profiling. You can also profile data using the f ollowing as input sources : text files,
SQL queries, f ilters on tables, and collections.
Note: You can improve perf ormance when working with large data sets by working with a sample
of the data. A sample interval can be specif ied f or any input type.
Note: By def ault, the sample interval is 1. That is, every row is read when you create the data
prof ile. However, another interval can be specif ied by selecting Actions  Change Sample
Interval.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-66 Lesson 9 DataFlux® Data Management Studio: Understanding Data

Data Profiling: Text Files


When you define properties for a profile of a text file, the menu choice,
Insert  New Text File, is enabled only if the Text Files item is selected.

61
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Data of ten arrives in the f orm of a text f ile (f or example, a f ile with a .txt or a .csv extension), and this
data needs to be studied or profiled.
To prof ile data that is in a text f ile, you need to select Text Files in the navigation pane of the
Properties tab. Af ter you select Text Files, you can then select Insert  New Text File f or the
prof ile. Then you can specif y other options as you would for any other input source.

Steps for Creating Profile using New Text File:


• From the Properties tab of a profile, select Text Files and then Insert  New Text File.

Text Files selected

An Insert Text File window appears.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.5 Profiling Other Input Types 9-67

• Specify a name for the text file (item to appear as child under Text Files), select the appropriate
file type, and click OK.

• Specify file and appropriate attributes for the file, including field names. Then click OK.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-68 Lesson 9 DataFlux® Data Management Studio: Understanding Data

At this point, all fields defined from the text file are now available for selection. Selecting fields,
selecting metrics, specifying options – at this point, all of this is exactly same as what was available
when selecting a single table.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.5 Profiling Other Input Types 9-69

Data Profiling: Filtered Table versus SQL Query


When a table is selected, use the Insert menu to create a new SQL query
or a new filtered table.

62
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

When creating a profile, it is sometimes necessary to execute the profile on a select subset of data
from a table. There are two options available for subsetting the table:
• New Filtered Table
• New SQL Query
Both options are available from the Insert menu.
The New Filtered Table option is used to create a new filtered table using the selected source table.
This option enables you to use an interface to build an expression to be used for filtering data
records from the input table. The filtered result set is then profiled.
The New SQL Query option is used to create an SQL query with a WHERE clause (using the
selected table) to filter the data records. This option opens an SQL Query window where you can
enter the SQL query to be used to filter the data. The filtered result set is then profiled.

Important Notes:
• The results generated for both the SQL query and the filtered table are the same.
• When you use a filtered table, all records from the database are returned to the machine where
the profile runs, and the filtering is performed on that machine.
• The database does the filtering for the SQL query and returns the filtered result set only to the
machine where the profile runs.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-70 Lesson 9 DataFlux® Data Management Studio: Understanding Data

Steps for Creating a Profile Using New Filtered Table


• From the Properties tab of a profile, select a table and then Insert  New Filtered Table.

selected table

An Insert Filter window appears.


• Specify a name for the filtered result set (item to appear as a “sibling” to table selected) and then
click OK.

The Filter on Table window appears.

• Define the filter.


o Specify Rule Condition.
▪ Select appropriate field.
▪ Select operation.
▪ Specify value.
o Click Add Condition.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.5 Profiling Other Input Types 9-71

Rule expression is populated


after you click Add
Rule condition area Condition.

Add Condition

o If needed, click AND or OR, and then build the rest of the compound expression. Then click
Add Condition.
• Click OK to close the Filter on Table window.
At this point, all fields from the original selected table are now available for selection for this filtered
data. Selecting fields, selecting metrics, specifying options – at this point, all of this is exactly same
as what was available when selecting a single table.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-72 Lesson 9 DataFlux® Data Management Studio: Understanding Data

Steps for Creating a Profile Using New SQL Query


Use the following steps to implement data profiling using New SQL Query:
• From the Properties tab of a profile, select a table and then Insert  New SQL Query.

An Insert SQL Query window appears.


• Specify a name for the SQL query result set (item to appear as a “sibling” to table selected) and
then click OK.

The SQL Query window appears.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.5 Profiling Other Input Types 9-73

• Define the SQL query by typing directly in this window.

• Click OK to close the SQL Query window.


At this point, all fields from the specified table (hence * in SQL) are now available for selection for
this query. Selecting fields, selecting metrics, specifying options – at this point, all of this is exactly
same as what was available when selecting a single table.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-74 Lesson 9 DataFlux® Data Management Studio: Understanding Data

Data Profiling: Collection


Recall:
A collection is simply a list of fields
from possibly different tables from
possibly different data connections.
A collection is defined / accessed
from the Data riser bar.

You can create a profile from a collection by


right-clicking the collection. The initial list of
fields to be profiled consists of the fields
that are defined as63
the collection.
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

In the example, a data collection was selected f rom the Data riser bar as input into the data prof ile.
The new data prof ile can be given a metadata name of Address Info Collection Profile. This creates
a new data prof ile f or you. Only the columns that exist in the collection are selected as input. To
save signif icant time and ef f ort, create collections. Then you do not need to search in all the source
data tables in all the data connections f or the f ields that you want to include in the prof ile.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.6 Designing Data Standardization Schemes 9-75

9.6 Designing Data Standardization


Schemes

What Is a Standardization Scheme?

A standardization scheme takes


various spellings or representations
of a data value and defines a standard
way to consistently write these values.

65
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Many standardization schemes are provided with the supplied QKBs. Users can also build
standardization schemes from profile's frequency distribution table (on the Report tab for the profile).
When a scheme is applied, if the input data is equal to the value in the Data column, then the data is
changed to the value in the Standard column.
In the example shown, the standard value SAS INSTITUTE is used when three different instances or
spellings are encountered (SAS, SAS INSTITUTE, SAS INSTITUTE INC).
The following special values might be seen or can be used in the Standard column:

//Remove The matched word or phrase is removed from the input string.

% The matched word or phrase is not updated.


Note: This is used to show that a word or phrase is explicitly marked for no change.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-76 Lesson 9 DataFlux® Data Management Studio: Understanding Data

Two Types of Standardization Schemes


There are two types of standardization schemes in the QKB:
• Phrase schemes
• Element schemes
Data Standard Phrase
DataFlux Inc. DataFlux Incorporated Scheme
Original Data
SAS Inc. SAS Incorporated
Data Value
Data Standard Element
Dataflux Inc. Scheme
SAS Inc. Dataflux DataFlux
Inc. Incorporated
SAS SAS

66
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Standardization schemes are stored in the QKB. There are two types of schemes in the QKB.
• Phrase schemes are used to standardize entire phrases or strings that consist of more than one
word. Some examples of the data types that are typically stored as a phrase include cities,
organizations, addresses, and names.
• Element schemes are applied to each individual word in a phrase. This can be especially useful
if you have the type of data where certain words are repeated frequently (for example, the
qualifying extension on business names). Some data types that are typically standardized using
element schemes include address, organization, city, and name.
In the example, the two data values representing company names can be standardized either of two
ways. You could create a phrase standardization scheme that has both data values in it, with their
standard representations. Otherwise, you could create an element standardization scheme that
simply standardizes the word Inc., and it could be used to standardize both values.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.6 Designing Data Standardization Schemes 9-77

Using a Profile Report to Build a Scheme

67
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Standardization schemes can easily be built in the results in a profile report. By selecting a column
of data that is used in the profile, you can right-click on the column and select Build a Scheme from
the menu.
Note: The frequency distribution must exist for the field so that you can use the Build a Scheme
option.
When you choose to build a scheme in this manner, the application runs an analysis on the data.
Also, it provides you with a report of all the permutations of data from the selected field. When
generating this report, you need decide whether you want to analyze the data as entire phrases
or individual words.
If you choose to run a phrase analysis, similar values are grouped in the report. To know how to
group the data values, a definition in the QKB called a match definition is used. To select the proper
match definition, you need to select the type of data being analyzed.
In the example, a phrase analysis is run using the COMPANY field as input. Because this data
represents company (or organization) data, the Organization match definition is used to group
similar data values.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-78 Lesson 9 DataFlux® Data Management Studio: Understanding Data

Using Scheme Builder


Scheme side
Report side

Build a scheme The Report Generation


automatically. window settings generate
the report surfaced on
the left side of the
Scheme Builder.

69
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The Report Generation window settings generate the report, which is surfaced on the left side of the
Scheme Builder window. The Scheme Builder has a Report side and a Scheme side.

Using Scheme Builder

The Scheme side can


Build a scheme
then be constructed
automatically.
using the Report side.
This can be done
manually or in
automated fashion.

Add to a
scheme
manually.
70
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

After the analysis report is displayed inside the Scheme Builder window, you can use the values in
the report to build the Data/Standard pairs for the standardization scheme. This process can be done
manually with the Add to Scheme button at the bottom of the left pane, or automatically with the
Build Scheme toolbar button. Most schemes are built by using a combination of these methods. This
enables users to take advantage of automated methods that leverage information and algorithms in
the QKB with specific requirements that involve some human intervention and decisions.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.6 Designing Data Standardization Schemes 9-79

Creating a Phrase Standardization Scheme


This demonstration illustrates the steps that are necessary to create a phrase standardization
scheme from a profile report.
1. If necessary, access Data Management Studio.
a. Select Start  All Programs  DataFlux  Data Management Studio 2.7.
b. Click Cancel to close the Log On window.
2. If necessary, open the Ch3D3_dfConglomerate_Profile profile.
a. Click the Folders riser bar.
b. Expand the Basics Demos repository to view the available folders.
c. Click the profiles_and_explorations folder to view the available items.
d. Right-click the Ch3D3_dfConglomerate_Profile profile and click Open.
The profile appears on a new primary tab.
3. Click the Report tab (if necessary).
4. Expand the dfConglomerate Gifts data connection to view the available tables.
5. Click the Customers table.
6. If necessary, click the Standard Metrics tab.
7. Right-click the COMPANY field and select Build a Scheme.

The Report Generation window appears.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-80 Lesson 9 DataFlux® Data Management Studio: Understanding Data

8. Specify report settings.


a. Verify that Phrase analysis is selected as the type of analysis to run.

b. Under Phrase analysis, click the down arrow in the Match definition field and select
Organization.

c. Click the down arrow in the Sensitivity field and select 75.

d. Click OK to close the Report Generation window.


The Report side of the Scheme Builder window is populated with an alphabetical list of the
values that are found in the Frequency Distribution report. The values are grouped with
similar names based on the selected match definition. Groups are determined by the
selected match definition and sensitivity.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.6 Designing Data Standardization Schemes 9-81

9. Use automated initial build of the scheme.


a. Select Edit  Build Scheme.
b. Verify that English (United States) is selected.

c. Click OK to close the Select Locale(s) window.


Note: By default, all values are copied to the Scheme side of the Scheme Builder window. The
multi-item groups are given a standard value, which is the most frequently occurring
value from the grouping on the report side.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-82 Lesson 9 DataFlux® Data Management Studio: Understanding Data

10. Perform manual edits or updates to the scheme.


a. Change the standard value of Farmers Insurance Grp Inc.
1) On the Scheme side, scroll and locate the grouping of records with the standard value
Farmers Insurance Grp Inc.
2) Right-click one of the Farmers Insurance Grp Inc standard values and select Edit.
a) Enter Farmers Insurance Group in the Standard field.

b) Click OK.
Notice that the change applies to all items in the group.

Note: If a single value in a group of items needs to be changed, then select Edit 
Modify Standards Manually  Single Instance. A single value can then
be modified manually. To toggle back to the ability to change all instances
in a group, select Edit  Modify Standards Manually  All Instances.
b. Change the standard value of dfConglomerate incorporated.
1) On the Scheme side, scroll and locate the grouping of records with the standard value
dfConglomerate incorporated.
2) Right-click one of the dfConglomerate incorporated standard values and select Edit.
a) Enter dfConglomerate Inc. in the Standard field.
b) Click OK.
Notice that the change applies to all items in the group.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.6 Designing Data Standardization Schemes 9-83

c. Change the standard value of Eta Technologies.


1) On the Scheme side, scroll and locate the grouping of records with the standard value
Eta Technologies.
2) Right-click one of the Eta Technologies standard values and select Edit.
a) Enter ETA Computers in the Standard field.
b) Click OK.
d. Repeat the same steps to update the following standard values:
• Safeguard Business Sys  Safeguard Business Systems
• 5th 3rd Bank  Fifth Third Bank
• First Data Corp  First Data Corporation
e. Change the standard value of Safeguard Business Sys.
1) On the Scheme side, scroll and locate the grouping of records with the standard value
Safeguard Business Sys.
2) Right-click one of the Safeguard Business Sys standard values and select Edit.
a) Enter Safeguard Business Systems in the Standard field.
b) Click OK.
f. Change the standard value of 5th 3rd Bank.
1) On the Scheme side, scroll and locate the record with the standard value 5th 3rd Bank.
2) Right-click 5th 3rd Bank standard values and select Edit.
a) Enter Fifth Third Bank in the Standard field.
b) Click OK.
g. Change the standard value of First Data Corp.
1) On the Scheme side, scroll and locate the grouping of records with the standard value
First Data Corp.
2) Right-click one of the First Data Corp standard values and select Edit.
a) Enter First Data Corporation in the Standard field.
b) Click OK.
h. Change the standard value of ?.
1) On the Scheme side, scroll and locate the record with the standard value ?.
2) Right-click the ? standard value and select
Edit.
a) Enter //Remove in the Standard field.

Note: This removes the value f rom the


f ield when this scheme is applied.
b) Click OK.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-84 Lesson 9 DataFlux® Data Management Studio: Understanding Data

The Scheme side of the Scheme Builder window should now resemble the following:

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.6 Designing Data Standardization Schemes 9-85

11. Save the scheme to the default QKB.


a. Select File  Save to save the scheme to the default QKB.
b. Enter Ch3D6 Company Phrase Scheme in the Name field.

c. Click Save.
12. Select File  Exit to close the Scheme Builder window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-86 Lesson 9 DataFlux® Data Management Studio: Understanding Data

Comparing a New Analysis Report to an Existing Scheme


It is now possible to run additional analysis reports on the organization fields in other tables and
compare the results with the scheme that you already built.
1. If necessary, access Data Management Studio.
a. Select Start  All Programs  DataFlux  Data Management Studio 2.7.
b. Click Cancel to close the Log On window.
2. If necessary, open the Ch3D3_dfConglomerate_Profile profile.
a. Click the Folders riser bar.
b. Expand the Basics Demos repository to view the available folders.
c. Click the profiles_and_explorations folder to view the available items.
d. Right-click the Ch3D3_dfConglomerate_Profile profile and click Open.
The profile appears on a new primary tab.
3. If necessary, click the Report tab.
4. Expand the dfConglomerate Grocery data connection to view the available tables.
5. Click the MANUFACTURERS table.
6. If necessary, click the Standard Metrics tab.
7. On the Standard Metrics tab, right-click the MANUFACTURER field and select Build a Scheme.
The Report Generation window appears.
a. Under Phrase analysis, click the down arrow in the Match definition field and select
Organization.
b. Click the down arrow in the Sensitivity field and select 75.
c. Click OK to close the Report Generation window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.6 Designing Data Standardization Schemes 9-87

The new analysis report is displayed.

In examining the report, you might


note that there are data values for
company names that were not in the
scheme that was created earlier. You
need to compare this analysis report
to the scheme created previously.

8. Open the scheme Ch3D6 Company Phrase Scheme.


a. Select File  Open.

b. Click Ch3D6 Company Phrase Scheme.


c. Click Open.
The scheme from the earlier demonstration appears on the Scheme side of the Scheme
Builder window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-88 Lesson 9 DataFlux® Data Management Studio: Understanding Data

9. Select Report  Compare Report to Scheme  Highlight Unaccounted Permutations.

Note: Values that do not exist in the scheme are highlighted in red.
10. Update the existing scheme.
a. At the bottom of the Report side, locate the Standard field.
b. Enter Arrowhead Mills, Inc. in the Standard field.

c. Locate the grouping of records that begins with Arrowhead Mills, Inc..

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.6 Designing Data Standardization Schemes 9-89

d. Click the first of these records.


e. Press and hold the Shift key and click the last of these records.

f. Click Add To Scheme (at the bottom of the Report side).


These five values are added to the scheme. Notice that the values are no longer
highlighted in red.
11. Update the existing scheme.
a. At the bottom of the Report side, locate the Standard field.
b. Enter Breadshop Natural Foods in the Standard field.
c. Locate the grouping of records that begins with Breadshop Natural Foods.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-90 Lesson 9 DataFlux® Data Management Studio: Understanding Data

d. Click the first of these records.


e. Press and hold the Shift key and click the last of these records.

f. Click Add To Scheme.


12. Update the existing scheme.
a. Locate the three groups of records that begin with General Mills.
b. Double-click the value
General Mills. This
action populates the
Standard field with the
General Mills value.
c. Select the records in
these three groups.

d. Click Add To Scheme.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.6 Designing Data Standardization Schemes 9-91

13. Update the existing scheme.


a. Locate the grouping of records that begins with Hannaford.
b. Double-click Hannaford Bros. to populate the Standard field.
c. Select the records that begin with Hannaford.

d. Click Add To Scheme.


Notice that the values you manually added to the existing scheme are now black.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-92 Lesson 9 DataFlux® Data Management Studio: Understanding Data

14. Select File  Save to save the scheme to the default QKB.
A warning window appears and indicates that there is duplicate data.

15. Click OK to close the warning window.


16. Select File  Exit to close the Scheme Builder window.
17. Select File  Close Profile.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.6 Designing Data Standardization Schemes 9-93

Creating an Element Standardization Scheme


This demonstration illustrates the steps that are necessary to create an element standardization
scheme from a profile report.
1. If necessary, access Data Management Studio.
a. Select Start  All Programs  DataFlux  Data Management Studio 2.7.
b. Click Cancel to close the Log On window.
2. If necessary, open the Ch3D3_dfConglomerate_Profile profile.
a. Click the Folders riser bar.
b. Expand the Basics Demos repository to view the available folders.
c. Click the profiles_and_explorations folder to view the available items.
d. Right-click the Ch3D3_dfConglomerate_Profile profile and click Open.
The profile appears on a new primary tab.
3. If necessary, click the Report tab.
4. Expand the dfConglomerate Gifts data connection to view the available tables.
5. Click the Customers table.
6. If necessary, click the Standard Metrics tab.
7. On the Standard Metrics tab, right-click JOB TITLE field and select Build a Scheme.
The Report Generation window appears.
8. Specify report settings.
a. Select the Element analysis radio button.
b. Verify that (None) is specified in the Chop table field.

Recall Element analysis looks to break


the information in a field in to individual
words. The (None) choice f inds
individual words where words are
separated by a space.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-94 Lesson 9 DataFlux® Data Management Studio: Understanding Data

c. Click OK to close the Report Generation window.


The Report side of the Scheme Builder window is populated with a list
of the individual elements that are found in all the JOB TITLE field values.

9. Investigate and set a standard for the Manager values.


a. Select Mgr on the Report side.
b. Highlight the Mgr value, and select Report  Permutation Drill Through.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.6 Designing Data Standardization Schemes 9-95

c. Click Close to close the Permutation Drill Through window.


d. Add Manager and Mgr to the scheme with a standard of Manager.
1) Enter Manager in the Standard field.
2) On the Report side, click Manager. Then hold down the Ctrl key and click Mgr.
3) Click Add To Scheme.

10. Set a standard for the Assistant values.


a. Double-click Assistant on the Report side. The Standard field is updated.
b. Hold down the Ctrl key and select Asst.
c. Click Add To Scheme.
11. Set a standard for the Marketing values.
a. Double-click Marketing on the Report side. The Standard field is updated.
b. Hold down the Ctrl key and select Mktg.
c. Click Add To Scheme.
12. Set a standard for the Representative values.
a. Double-click Representative on the Report side. The Standard field is updated.
b. Hold down the Ctrl key and select Rep.
c. Click Add To Scheme.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-96 Lesson 9 DataFlux® Data Management Studio: Understanding Data

The values with the specified standard are added to the Scheme side of the Scheme Builder
window. The scheme should now resemble the following:

13. Save the scheme.


a. Select File  Save to save the scheme to the default QKB.
b. Enter Ch3D7 Job Title Element Scheme in the Name field.

c. Click Save.
14. Select File  Exit to close the Scheme Builder window.
15. Select File  Close Profile to close the profile.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.6 Designing Data Standardization Schemes 9-97

Practice

2. Creating a Scheme from the CONTACT_CNTRY Field


Use Data Management Studio to create a standardization scheme using the values for the
CONTACT_COUNTRY field in the MANUFACTURERS table.

• Open Ch3E1_dfConglomerateGrocery_Profile from the profiles_and_explorations folder


in the Basic Exercises repository.

• Locate the CONTACT_CNTRY field in the MANUFACTURERS table.

• Verify that all values can all be represented by USA.

Note: Be sure to investigate the X value to confirm this.

• Create a scheme that translates all permutations of United States to USA.

• Save the scheme as Ch3E2 CONTACT_CNTRY Scheme.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-98 Lesson 9 DataFlux® Data Management Studio: Understanding Data

9.7 Solutions
Solutions to Practices
1. Profiling Tables in the dfConglomerate Grocery Data Connection
a. If necessary, invoke Data Management Studio.
1) Select Start  All Programs  DataFlux  Data Management Studio 2.7.
2) Click Cancel in the Log On window.
b. Click the Folders riser bar.
c. Click the Basics Exercises repository.

1) Click (New)  Profile.


2) Double-click the profiles_and_explorations folder.
3) Enter Ch3E1_dfConglomerateGrocery_Profile in the Name field.
4) Click OK. The new profile appears on a tab.
d. Specify properties for the new profile.
1) Verify that the Properties tab is selected.
2) Click dfConglomerate Grocery. An X appears in the check box.
3) Define the default profile metrics.
a) Select Tools  Default Profile Metrics.
b) Verify that all metrics are selected.
c) Click OK to close the Metrics window.
e. Select File  Save Profile to save the profile to this point.
f. Override the metrics for the two ID fields.
1) Expand dfConglomerate Grocery.
2) Click the BREAKFAST_ITEMS table.
3) Right-click the ID field on the Fields tab and select Override Metrics.
a) Clear Frequency distribution.
b) Clear Pattern frequency distribution.
c) Clear Percentiles.
d) Clear Outliers.
e) Click OK to close the Metrics window.
4) Click the MANUFACTURERS table.
5) Right-click the ID field on the Fields tab and select Override Metrics.
a) Clear Frequency distribution.
b) Clear Pattern frequency distribution.
c) Clear Percentiles.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.7 Solutions 9-99

d) Clear Outliers.
e) Click OK to close the Metrics window.
g. Select File  Save Profile to save the profile.
h. Select Actions  Run Profile Report.
1) Enter Initial profile in the Description field.
2) Click OK to execute the profile.
The Report tab becomes active.
i. Review the Profile report.

Question: How many distinct values exist for the UOM field in the BREAKFAST_ITEMS
table?
Answer: Expand the BREAKFAST_ITEMS table, and then click the UOM field. The
Column Profiling tab shows that the Unique Count metric has a value of 6.

Question: What are the distinct values for the UOM field in the BREAKFAST_ITEMS table?
Answer: Click the UOM field. Click the Frequency Distribution tab to display the six
unique values: OZ, CT, LB, PK, 0Z and a blank.

Question: What are the ID values for the records with a value of PK for the UOM field
in the BREAKFAST_ITEMS table?
Answer: Double-click the value PK. The ID values for these two records are 556 and
859.

2. Creating a Scheme from the CONTACT_CNTRY Field


a. If necessary, invoke Data Management Studio.
1) Select Start  All Programs  DataFlux  Data Management Studio 2.7.
2) Click Cancel in the Log On window.
b. If necessary, access the appropriate profile.
1) Click the Folders riser bar.
2) Expand Basics Exercises  profiles_and_explorations.
3) Double-click Ch3E1_dfConglomerateGrocery_Profile. The profile opens on a new tab.
4) If necessary, click the Report tab.
c. Expand the MANUFACTURERS table to view the available fields.
1) Click the CONTACT_CNTRY field.
2) Click the Frequency Distribution tab.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-100 Lesson 9 DataFlux® Data Management Studio: Understanding Data

d. Verify that the X value for CONTACT_CNTRY should be USA.


1) Double-click the X value.
2) Scroll to the right and verify the rest of the CONTACT_ address type fields indicate that
CONTACT_CNTRY should be USA.

3) Click Close.
e. Begin designing standardization scheme.
1) Click the MANUFACTURERS table.
2) If necessary, click the Standard Metrics tab.
f. Right-click the CONTACT_CNTRY field and select Build a Scheme.
1) Accept the defaults in the Report Generation window and click OK.
The Scheme Builder window appears with the Report side showing the three distinct
values.

2) Double-click the USA value.


This populates the Standard field at the bottom of the Report side.

3) Select all three values on the Report side.


4) Click Add To Scheme.
The Scheme side now displays the three values
each with a Standard of USA.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.7 Solutions 9-101

g. Select File  Save As to save the scheme to the default QKB.


1) Enter Ch3E2 CONTACT_CNTRY Scheme in the Name field.

2) Click Save.
3) Select File  Exit to close the Scheme Builder window.
4) Select File  Close Profile to close the profile.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-102 Lesson 9 DataFlux® Data Management Studio: Understanding Data

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Lesson 10 DataFlux® Data
Management Studio: Building Data
Jobs to Improve Data

10.1 Introduction to Data Jobs .......................................................................................... 10-3

Demonstration: Setting DataFlux Data Management Studio Options........................... 10-15

10.2 Standardization, Parsing, and Casing ...................................................................... 10-20

Demonstration: Investigating Standardization ......................................................... 10-24


Demonstration: Working with a Field Layout Node................................................... 10-31

Demonstration: Investigating Parsing and Casing .................................................... 10-36

Practice............................................................................................................. 10-44

10.3 Identification Analysis and Right Fielding ................................................................ 10-46

Demonstration: Investigating Right Fielding and Identif ication Analysis ....................... 10-50

10.4 Branching and Gender Analysis .............................................................................. 10-55

Demonstration: Working with the Branch and Data Validation Nodes .......................... 10-57

Demonstration: Investigating Gender Analysis ........................................................ 10-62

10.5 Data Enrichment ...................................................................................................... 10-68

Demonstration: Working with Address Verification and Geocoding Nodes.................... 10-74

Practice............................................................................................................. 10-84

10.6 Solutions ................................................................................................................. 10-86

Solutions to Practices .......................................................................................... 10-86


Solutions to Student Activities ..............................................................................10-100
10-2 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.1 Introduction to Data Jobs 10-3

10.1 Introduction to Data Jobs


DataFlux Data Management Methodology (Review)

Data Management Studio uses CONTROL DEFINE


data jobs for the ACT phase. Data MONITOR PLAN
jobs are the main way to process
data in Data Management Studio.
EVALUATE DISCOVER

EXECUTE DESIGN

ACT
3
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Recall the three-phase methodology introduced earlier. In this lesson, the ACT phase is explored.

Data Management Studio uses data jobs for the ACT phase.

The Act phase involves two primary areas:

DESIGN Af ter you complete the f irst two steps, this phase enables you to take the dif ferent
structures, f ormats, data sources, and data f eeds and create an environment that
accommodates the needs of your business. At this step, business and IT users
build workf lows to enforce business rules f or data quality and data integrat ion. They
also create data models to house data in consolidated or master data sources.

EXECUTE Af ter business users establish how the data and rules should be def ined, the IT
staf f can install them within the IT inf rastructure and determine the integration
method (real time, batch, or virtual). These business rules can be reused and
redeployed across applications, which helps increase data consistency in the
enterprise.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-4 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

Data Job Composition

Source Target
Node A Node B Node C
Node Node

Data jobs are comprised of nodes.


Each node is designed to
accomplish a particular objective.
Most data jobs start with a source
node and end with a target node.

4
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Data jobs consist of nodes. Each node is designed to accomplish an objective (f or example,
generate a subset). Most data jobs start with a source node and end with a target node.
In the example job f low, there are f ive nodes. The beginning node is a source node and the ending
node is a target node. The nodes labeled A, B and C each process the data (to transf orm and/or
cleanse) and deliver the results to the target node.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.1 Introduction to Data Jobs 10-5

Job Options
Data Management Studio provides
many configurable options, including
options that affect data jobs.

5
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Data Management Studio provides you with a good selection of options to control the operations that
you perf orm inside of the application. You can navigate to these options by selecting Tools  Data
Management Studio Options f rom the main menu. You can specif y the f ollowing options:
• general interf ace options
• f ilter and search options f or viewing data
• job options, such as layout def aults
• SQL options, such as join def aults and editor type defaults

In the Data Management Studio Options window, you can set options that af fect how you create and
execute jobs. These options are available by selecting Job in the lef t navigation pane. In the
example shown, you see the specif ic options that are available when working with jobs.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-6 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

Job Options: Output Fields


An important option for data jobs is
one that deals with field propagation
throughout the nodes in a data job.

6
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

An example of the type of options you can set f or jobs is the Output Fields option. This option
specif ies how f ields are initially passed between nodes in a data job. There are three options
available f or controlling this behavior:
• Target
• Source and Target
• All

Bef ore discussing the options, we need a bit more inf ormation about the types of nodes f ound in a
data job.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.1 Introduction to Data Jobs 10-7

Sample Data Job Node Types

Source Target
Node A Node B Node C
Node Node

Source Nodes Intermediate Nodes Target Nodes


• Data Source • Standardization • Data Target (Insert)
• Text File Input • Parsing • Data Target (Update)
• SQL Query • Casing • Text File Output
… and so on … and so on … and so on
11
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

A data job typically starts with a source node to bring data into the data job f low. Most data jobs end
with one or more “target” nodes that create output f rom the processes applied in the data job.

In between the source and the target node, any number of processing steps can take place on the
data as it is passed f rom one node to the next. In the diagram above, data enters the job f low
through a source node, and then is passed through three additional processing nodes (Node A,
Node B and Node C), and then f inally to a target node.

The source node specif ies properties f or a type of data source. Thus, a source node could specify
the properties to access data f rom
• a table f rom a def ined data connection
• a text f ile
• the result set of an SQL query.
Note: There are several other source node types available – the above list is not all-encompassing.

The intermediate nodes will contain properties f or their objective or purpose (f or example, an
intermediate can specif y properties f or subsetting, or for standardizing , or f or parsing). Each type of
intermediate node will have dif ferent properties than other nodes.

The target node specif ies properties f or loading or using the result set generated in the data job. A
target node could specify the properties to load data to
• a table f or a def ined data connection
• a text f ile.
Note: There are several other target node types available – the above list is not all-encompassing.

Also, the result of a data job could be a report. Theref ore, a target node could contain the properties
f or a type of report (f or example, an HTML report).

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-8 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

Choosing Source Node Fields


Data jobs typically begin with a
source node of some type.
Here is the properties window of a
Data Source node. The dialog
provides a way to choose the
fields to use from this source.

Fields passed out of


Data Source node
12
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Many data jobs typically start with a source node of some type. Text File Input and Data Source are
two commonly used nodes.

In the example shown, all f ields f rom the Customers table are being passed f rom the Data Source
node into the next node.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.1 Introduction to Data Jobs 10-9

New Fields (Intermediate Nodes)


Some data job nodes produce new
fields. Here is the properties
window of a Standardization node
– five fields are being standardized
and five new fields will hold the
standardized values.

New fields created by


13
Standardization node
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Many data job nodes create additional output fields as they process the data. For example, a
standardization node can generate standard values stored in new f ields.

In the example shown, f ive f ields were moved to the Selected list (each of the f ive f ields have a
standardization def inition applied). The results f or each of the f ive f ields will be written to new f i elds
whose def ault names are the original f ield name with text _Stnd added. For example, the COMPANY
f ield's standardized values will be stored in a new f ield called COMPANY_Stnd.

But what about the original f ields that are in the source data read by the node?

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-10 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

Incoming Fields as Outputs (Intermediate Nodes)


In many nodes, source fields
(fields passed in to the node)
are passed through the node using
an Additional Outputs window.

This button opens the


Additional Outputs window 14
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Many nodes that produce new f ields like the Standardization node have an Additional Outputs
button. Clicking this button opens a window that provides the ability to select which f ields from the
source data get passed through the current node.

In the example shown, all f ields f rom the Customers table were passed f rom a Data Source node
into the Standardization node – this can be discovered by examining the Available list in the
Standardization Properties window, or by opening the Additional Outputs window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.1 Introduction to Data Jobs 10-11

Choosing Target Node Fields


Data jobs typically end with output
of some type.
Here is the properties window of a
Text File Output node. The window
provides a way to choose the fields
to write to this target.

Fields specified for


output text file
15
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Many data jobs end with a data output of some type. Text File Output and new tables produced with
Data Target (Insert) are two commonly used nodes. In the example shown, only two f ields from the
list of incoming fields are selected to be written to the new text f ile.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-12 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

Field Propagation Setting for Data Jobs


There are three choices for
field propagation:
• Target
• Source and Target
• All

Target All fields available to target nodes are passed to the target.

All fields available from source nodes are passed to next node,
Source and Target
and all fields available to target nodes are passed to the target.

All available fields are passed through source nodes,


All
target nodes, and all intermediate nodes.
16
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

One of the options that you can set f or the data job controls how output f ields are handled f or each
node. The purpose of this option is to control how f ields are initially passed to the adjacent nodes in
the job f low. The setting of the Output Fields option controls the initial selection of fields. It is
important to note that f ield selection modifications can be made on a node-by-node basis af ter the
initial propagation of f ields.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.1 Introduction to Data Jobs 10-13

Field Propagation Option: Target


There are three choices for
field propagation:
• Target
• Source and Target
• All

Source Target
Node A Node B Node C
Node Node

Fields

17
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

If Target is selected f or the Output Fields option, then only the node connected to a target no de will
initially propagate f ields to the target node.

Field Propagation Option: Source and Target


There are three choices for
field propagation:
• Target
• Source and Target
• All

Source Target
Node A Node B Node C
Node Node

Fields Fields

18
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

If Source and Target is selected f or the Output Fields option, then


1. the source node will automatically propagate fields to the next node in the job f low diagram
2. the node connected to a target node will automatically propagate f ields to the target node.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-14 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

Field Propagation Option: All


There are three choices for
field propagation:
• Target
• Source and Target
• All

Source Target
Node A Node B Node C
Node Node

Fields Fields Fields Fields

19
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

If All is selected f or the Output Fields option, then all nodes will automatically propagate all f ields to
the next node in the job f low diagram.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.1 Introduction to Data Jobs 10-15

Setting DataFlux Data Management Studio Options

This demonstration illustrates the steps that are necessary to investigate and set various DataFlux
Data Management Studio options.

1. If necessary, access Data Management Studio.

a. Select Start  All Programs  DataFlux  Data Management Studio 2.7.

b. Click Cancel to close the Log On window.

2. Verif y that the Home tab is selected.


3. Select Tools  Data Management Studio Options.

The Data Management Studio Options window appears.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-16 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

4. Verif y that the General item is selected on the lef t selection pane.

5. Clear Automatically preview a selected node when the preview tab is active.

6. Click Job (on the lef t selection pane).


7. Click Include node-specific notes when printing.

8. Click All in the Output Fields area.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.1 Introduction to Data Jobs 10-17

9. Click New job defaults (in the Job grouping) on the lef t selection pane.

Options are available to establish def ault f unctionality when you create new data jobs.

These options enable you to do the f ollowing actions:


• Show a grid in the background of the job editor.
• Control the direction of the layout of the nodes.
• Select a color f or sticky notes.
• Options specific to error handling f or process jobs.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-18 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

10. Click QKB (in the Job grouping) on the lef t selection pane.

11. If necessary, click in the Default locale f ield and select English (United States).

This option enables you to specify a locale to be used by default in data jobs. Many nodes give
you the option to override the def ault locale f or that step in the processing. In addition, the QKB
selection also displays a list of available locales for the def ault QKB.

12. Click OK to save the changes and close the Data Management Studio Options window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.1 Introduction to Data Jobs 10-19

10.01 Activity
✓ If necessary, access Data Management Studio.
• Access DataFlux Data Management Studio by selecting
Start  All Programs  DataFlux  Data Management Studio 2.7.
• Click Cancel in the Log On window.
✓ Select Tools  Data Management Studio Options.
✓ Under General, clear Automatically preview a selected node when the
preview tab is active.
✓ Under Job, select Include node-specific notes when printing.
✓ Under Job, select All in the Output Fields area.
✓ Under Job, verify that English (United States) is the default locale.

22
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-20 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

10.2 Standardization, Parsing, and


Casing

Data Job Node Groups


Nodes for a data job are
Enrichment: found in groupings of like-
• Geocoding nodes. Our first job will use a
Data Inputs:
Your Text Here • Address Verification (US/Canada) Data Input node, several
• Data Source • … Quality nodes, and a Data
• SQL Query Output node.
• Text File Input
• …

Quality: Data Outputs:


• Standardization • Data Target (Insert)
• Parsing • HTML Report
• Change Case • Text File Output
• Identification Analysis • …
• …
27
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Nodes f or a data job are f ound in groupings of like nodes. These are the groups:
• Data Job
• Data Inputs
• Data Outputs
• Data Integration
• Quality
• Enrichment
• Entity Resolution
• Monitor
• Prof ile
• Utilities

Our f irst job will use a Data Input node, several Quality nodes, and a Data Output node. This data
job will accomplish three things:
• Standardization – this will allow f or spelling and abbreviation issues to be addressed
• Parsing – this will allow a f ield of data to be broken in to smaller perhaps more usable
components
• Casing – this will apply some consistency to the data.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.2 Standardization, Parsing, and Casing 10-21

Two Ways to Standardize


The Standardization node
Standardization Scheme allows for two different ways
to make data more consistent
• is a simple lookup table or standardized.
• values not found remain as is

Standardization Definition
• is more complex than a standardization scheme
• can involve one or more standardization schemes
• can also parse data and apply regular expression libraries and casing

31
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

A node that we will use in a data job is the Standardization node. The properties of the
standardization node allow f or the selection of f ields to be standardized. Then, f or each of the
selected f ields, you can choose either a standardization scheme or a standardization def inition (or
both!). When executed, either the scheme or the def inition, or both, will transf orm the data values in
a consistent manner.

In a previous section, we explored the creation of a standardization scheme f ile, which is simply a list
of data values along with how we would like to have the values written out. A standardization
scheme is used to ensure the standard representation of data values. Standardization schemes can
be applied to single words (element scheme) or the entire phrase (phrase scheme).

A standardization definition is like an algorithm and is more complex than scheme. A standardization
def inition can involve one or more standardization schemes, and can also parse data, apply regular
expressions and possibly casing.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-22 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

Example: Standardization Scheme


partial scheme table
Data Standard
Street St
St. St
ST. St
Rd. Rd original data standardized data
Road Rd
RD. Rd
ADDRESS ADDRESS_STND
123 North Main Street, Suite 100 123 North Main St, Suite
100
4591 S. Covington Road 4591 S. Covington Rd
29 WASSAU ST. 32
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
29 Wassau St

In the example shown, an element scheme was created to standardize words that are contained in
address data values. Af ter the scheme is applied to the data, the words that match the “data” value
in the scheme are transf ormed to the “standard” value in the scheme.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.2 Standardization, Parsing, and Casing 10-23

Example: Standardization Definition

Each unique kind of data can have a


corresponding standardization definition
that can be applied to modify data.

Apply This
Data before Standardization Definition Data after Standardization
Mister John Q. Smith, Junior Name Mr John Q Smith, Jr
dataflux corporation Organization DataFlux Corp
123 North Main Street, Suite Address 123 N Main St, Ste 10
10
33
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Each unique type of data can have a corresponding standardization definition that can be applied to
af f ect the outcome of the data.

The example shown displays a record f or three dif ferent types of data: name data, company or
organization data, and address data.
• If the Name standardization def inition [f rom the English (United States) locale] is applied to the
f irst string of data, you see that several changes have occurred to produce the resultant value.
The name value provided (Mister John Q. Smith, Junior) shows Mister was transf ormed to Mr,
the period f ollowing the middle initial was removed and Junior was transf ormed to Jr.
• If the Organization standardization def inition [from the English (United States) locale] is applied
to the second string of data, you see that several changes have occurred to produce the
resultant value.
The company value provided (DataFlux corporation) shows the unique casing f or DataFlux, and
the word corporation was transf ormation to Corp.
• If the Address standardization def inition [from the English (United States) locale] is applied to the
third string of data, you see that several changes have occurred to produce the resultant value.
The address value provided (123 North Main Street, Suite 10) shows North was transf ormed
to N, Street was transf ormed to St, and Suite was transf ormed to Ste.

Note: If you standardize a data value using both a def inition and a scheme, the def inition is applied
f irst and then the scheme is applied.

Note: Data standardization does not perf orm a validation of the data (f or example, Address
verif ication). Address verification is a separate component of the Data Management Studio
application and is discussed in another section.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-24 Lesson 10 DataFlux® Data Man agement Studio: Building Data Jobs to Improve Data

Investigating Standardization

This demonstration illustrates the steps that are necessary to create a data job that standardizes
f ields in a source data table. This data job is continued in subsequent demonstrations.

1. If necessary, access Data Management Studio.

a. Select Start  All Programs  DataFlux  Data Management Studio 2.7.

b. Click Cancel to close the Log On window.

2. Verif y that the Home tab is selected.


3. Click the Folders riser bar.

4. Expand Basics Demos.

5. Create the data job.

a. Right-click batch_jobs f older and select New  Data Job.

b. Enter Ch5D1_Customers_DataQuality in the Name f ield.

c. Click OK.

The new data job appears on the primary tab.

6. Add a Data Source node to the job f low.

a. Verif y that the Nodes riser bar is selected in the resource pane.

b. Expand the Data Inputs grouping of nodes.

c. Double-click the Data Source node.

The node is added to the job f low and the properties window f or the node appears.

7. Specif y properties for the Data Source node.

a. Enter Customers Table in the Name f ield.

b. Click next to the Input table f ield.

1) Expand the dfConglomerate Gifts data connection.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.2 Standardization, Parsing, and Casing 10-25

2) Select the Customers table.

3) Click OK to close the Select Table window.

The Data Source Properties window should resemble the f ollowing:

c. Click OK to close the Data Source Properties window.

8. Right-click the Data Source node and select Preview.

A sample of records appears on the Preview tab of the Details pane.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-26 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

Notice that there are a number of inconsistencies in the data that need to be addressed:

• The COMPANY f ield needs to be corrected or cleansed. You can use the standardization
scheme created earlier to accomplish this.

• The phone f ields are not consistently f ormatted. A standardization definition can be used to
accomplish this.

• The ADDRESS f ield has portions of the f ield that are not consistent – specifically note the
inconsistent use of pre-directions (f or example, “East”, “E.” and “E”) and street types (f or
example, “Ave”, “Ave.” and “Avenue”).

9. Add a Standardization node to the job f low.

a. From the Nodes riser bar in the resource pane, collapse the Data Inputs grouping of nodes.

b. Expand the Quality grouping of nodes.

c. Locate the Standardization node.

d. Double-click the Standardization node.

The node is added to the job f low and the properties window f or the node appears .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.2 Standardization, Parsing, and Casing 10-27

10. Specif y properties for the Standardization node.

a. Enter Standardize Fields in the Name f ield.

b. Verif y that English (United States) is selected f or the Locale f ield.

c. In the Standardization f ields area, double-click each of the f ollowing fields to move them f rom
the Available list to the Selected list:
COMPANY
JOB TITLE
BUSINESS PHONE
HOME PHONE
MOBILE PHONE
FAX NUMBER
ADDRESS
STATE/PROVINCE
ZIP/POSTAL CODE
COUNTRY/REGION

The Selected list should resemble the f ollowing:

d. For the COMPANY f ield, click under Scheme and select Ch3D6 Company Phrase
Scheme.

e. For the JOB TITLE f ield, click under Scheme and select Ch3D7 Job Title Element
Scheme.

f. For the BUSINESS PHONE f ield, click under Definition and select Phone.

g. For the HOME PHONE f ield, click under Definition and select Phone.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-28 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

h. For the MOBILE PHONE f ield, click under Definition and select Phone.

i. For the FAX NUMBER f ield, click under Definition and select Phone.

j. For the ADDRESS f ield, click under Definition and select Address.

k. For the STATE/PROVINCE f ield, click under Definition and select State/Province
(Abbreviation).

l. For the ZIP/POSTAL CODE f ield, click under Definition and select Postal Code.

m. For the COUNTRY/REGION f ield, click under Definition and select Country.

n. Click Preserve null values.

o. Click Add standardization flag field.

The Standardization Properties under the Selected list should resemble the f ollowing:

Note: A f ield can be standardized using a standardization def inition or a standardization


scheme, or both. If both are specified, the definition is applied first, and then the
scheme is applied to the results f rom the def inition.
Note: Selecting Preserve null values ensures that if a f ield is null when it enters the node,
then the f ield is null af ter it is written to output f rom the node. It is recommended that
this option be selected, if the output is written to a database table.
Note: Selecting Add standardization flag field adds a new f ield to the output of the node.
This f ield indicates whether the value was standardized by the def inition or scheme.
If the value of the _Stnd f ield is dif ferent f rom the value of the input f ield, then the
f lag has a value of True. Otherwise, it has a value of False.
p. Click OK to close the Standardization Properties window.

11. Preview the Standardization node.

a. Right-click the Standardization node and select Preview.

A sample of records appears on the Preview tab of the Details pane.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.2 Standardization, Parsing, and Casing 10-29

b. Scroll to the right to view the _Stnd f ields.

Note: Only a f ew of the _Stnd f ields are shown in the above display.

c. Scroll f arther to the right to view the _Stnd_flag f ields.

Note: Only a f ew of the _Stnd_flag f ields are shown in the display.

When this node is previewed, the f ields always appear in a particular order (all of the original
f ields, all the _Stnd f ields, and then all the _Stnd_flag f ields). With this order it is dif ficult to
verif y whether a f ield’s original value was changed by applying the standardization def inition
or the standardization scheme. Most nodes do not allow the intermingling of the original
f ields and f ields that are produced by the node.

However, a Field Layout node can be used temporarily to reorder all f ields as desired.

12. Select File  Save to save the data job.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-30 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

Field Layout Node


The Field Layout node can be used to perform
the following tasks:
• subset the fields that pass through the
nodes
• control the order of the fields
• change output names for the fields

Only 6 of 11 fields The CATEGORY CATEGORY is


are selected. field is moved to renamed to
position 4 in the PRODUCT CATEGORY
selected list. in the selected list.

40
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The Field Layout node has many uses. When added to a job f low diagram, it can f ilter out unneeded
f ields f or subsequent nodes. In addition, the f ields selected can be in any order desired and the
output names can be updated.

In the example above, we see the f ollowing:


• Only 6 of 11 f ields were moved to the selected list. Thus, the next node in this job f low will have
only the selected six available.
• The CATEGORY f ield that originally appears at the end of the f ield list was moved to the f ourth
position. Thus, the next node in this job f low will show the CATEGORY f ield available in the
f ourth position.
• The CATEGORY f ield was renamed to PRODUCT CATEGORY.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.2 Standardization, Parsing, and Casing 10-31

Working with a Field Layout Node

In this demonstration, you add a Field Layout node to the data job f rom the previous demonstration
to control the order of the columns. Specifically, you want to preview the data where the original
values of the f ields are placed adjacent to their standardized values .

1. If necessary, access Data Management Studio.

a. Select Start  All Programs  DataFlux  Data Management Studio 2.7.

b. Click Cancel to close the Log On window.

2. If necessary, open the Ch5D1_Customers_DataQuality data job.

a. Click the Folders riser bar.

b. Expand Basics Demos.

c. Click batch_jobs.
d. Double-click the Ch5D1_Customers_DataQuality data job.

3. Add a Field Layout node to the job f low (temporarily).

a. From the Nodes riser bar in the resource pane, if necessary, collapse the Quality grouping
of nodes.

b. Expand the Utilities grouping of nodes.

c. Double-click the Field Layout node.

The node is added to the job f low and is attached to


the Standardization node, and the properties window
f or the node appears.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-32 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improv e Data

2. Specif y properties for the Field Layout node.

a. Click Sort Fields.

All the f ields are now sorted alphabetically by name. This sorting places the original f ields
next to their standardized f ield, which is in turn next to the standardization f lag f ield. For
example, ADDRESS is next to ADDRESS_Stnd, which is next to ADDRESS_Stnd_flag.

Note: The up and down arrows to the right of the Selected f ields list can also be used to
reorder the f ields.

Now a preview of this Field Layout node enables us to see how the selected standardization
techniques have af f ected the resultant f ields.

b. Click OK to close the Field Layout properties window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.2 Standardization, Parsing, and Casing 10-33

3. Preview the data f or the (temporary) Field Layout node.

a. Verif y that the Details pane is displayed.

b. Right-click the Field Layout node and select Preview.

It is now easy to investigate the values that are produced by the Standardization node.

For the f irst record, notice that the original ADDRESS f ield value was changed
(Lane changed to Ln). Theref ore, the ADDRESS_Stnd_flag f ield has a value of True.

For the second record, notice that the original ADDRESS f ield value was not changed.
Theref ore, the ADDRESS_Stnd_flag f ield has a value of False.

Similar observations can be made f or the rest of the records across the three related f ields
(the original f ield, the standardized f ield, and the standardization f lag field).

c. Right-click the Field Layout node and select Delete.

d. Select File  Save to save the data job.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-34 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

Parsing Example
Name Information
Parse definitions define Dr. Alan W. Richards, Jr., M.D.
the rules to split the
words from a text string
Parsed Name
into the appropriate
Prefix Dr.
tokens.
Given Name Alan
Middle Name W.
Family Name Richards
Suffix Jr.
Title/Additional Info M.D.

43
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

A parse def inition is a type of definition in the QKB that specif ies the rules f or breaking a text string in
to tokens. Tokens are predef ined, specific to a data type, and have a semantic meaning. For
example, the Name data type includes six different tokens (Pref ix, Given Name, Middle Name,
Family Name, Suf f ix, Title/Additional Inf o), and there is a distinct difference in the semantic meaning
of each of these tokens.
The example shown is parsed using the Name parse def inition. This def inition splits a name in to
individual tokens. In this example, the name contains inf ormation f or each token in the Name data
type. It is important to note that when we parse any specif ic value, some tokens might not
necessarily be assigned values. (For example, a name value of John Smith would populate just two
of the tokens, Given Name and Family Name.)

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.2 Standardization, Parsing, and Casing 10-35

Casing Example

Case definitions are Input Data


Dataflux corporation
algorithms that can be
used to convert a text
string to uppercase,
Data after Casing
lowercase, or proper
Upper DATAFLUX CORPORATION
case.
Lower dataflux corporation
Proper Dataflux Corporation
Proper (Organization) DataFlux Corporation

44
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The purpose of a case def inition is to convert the text to a specific case. For converting to upper or
lower case, each value is converted to uppercase or lowercase respectively. For propercasing, the
def inition converts the f irst character of each word to uppercase, and then augments that with the
known casing of certain words (f or example, DataFlux) and patterns within words. (For example, if
any word begins with Mc, then convert the next letter to uppercase.)
Note: For the best results, select an applicable def inition that is associated with a specif ic data
type, when you apply proper casing.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-36 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

Investigating Parsing and Casing

This demonstration adds to the data job f rom the previous demonstration by adding a Parsing node
to parse an email f ield into tokens. A Change Case node is then added to convert the tokenized
email f ields to uppercase. Finally, a Text File Output node is added to the job f low.

1. If necessary, access Data Management Studio.

a. Select Start  All Programs  DataFlux  Data Management Studio 2.7.

b. Click Cancel to close the Log On window.

2. If necessary, open the Ch5D1_Customers_DataQuality data job.

a. Click the Folders riser bar.

b. Expand Basics Demos.

c. Click batch_jobs.
d. Double-click Ch5D1_Customers_DataQuality.

3. Add a Parsing node to the job f low.

a. From the Nodes riser bar in the resource pane, if necessary, collapse the Utilities grouping
of nodes.

b. Expand the Quality grouping of nodes.

c. Double-click the Parsing node.

The node is added to the data job f low and connected to the Standardization node.
The properties window f or the Parsing node appears.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.2 Standardization, Parsing, and Casing 10-37

4. Specif y properties for the Parsing node.

a. Enter Parse Email in the Name f ield.

b. Verif y that English (United States) is selected f or the Locale f ield.

c. Click the down arrow under Field to parse and select EMAIL.

d. Click the down arrow under Definition and select E-mail.

e. Click the double right-pointing arrows to select all tokens: Mailbox, Sub-Domain,
Top-Level Domain, and Additional Info.

Note: It is not necessary to always select all tokens.

f. Click Preserve null values.

The Parsing Properties window should resemble the f ollowing:

Note: When selected, Result code field specif ies that the results of the parse operation
are passed to the output of the node. If no f ield name is entered, the results are put
into a f ield named __parse_result__.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-38 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

Possible values f or the result code f ield are as f ollows:

Result Code Value Description

OK The parse operation was successf ul.

NO SOLUTION The parse operation was unsuccessf ul,


and no solution was f ound.

NULL The parse operation was not attempted. This


result occurs only when a null value was in the
f ield to be parsed and the Preserve null values
option was enabled.

ABANDONED A resource limit was reached. Increase your


resource limit and try again.

g. Click OK to close the Parsing Properties window.

5. Preview the Parsing node.

a. Right-click the Parsing node and select Preview.

A sample of records appears on the Preview tab of the Details pane.


b. Scroll to the right to view the Mailbox, Sub-Domain, Top-Level Domain, and Additional
Info f ields.

6. Select File  Save to save the data job.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.2 Standardization, Parsing, and Casing 10-39

7. Add a Change Case node to the job f low.


a. From the Nodes riser bar in the resource pane, verif y that the Quality grouping of nodes is
expanded.

b. Double-click the Change Case node.

The node is added to the data job f low and connected to the Parsing node.
The properties window f or the Change Case node appears.

8. Specif y properties for the Change Case node.

a. Enter Upper Case EMAIL Fields in the Name f ield.

b. Verif y that English (United States) is selected f or the Locale f ield.

c. Double-click each of the f ollowing fields to move them f rom the Available list
to the Selected list:

EMAIL
Mailbox
Sub-Domain
Top-Level Domain
Additional Info

d. For each of the f ive selected f ields, click the down arrow under Type and select Upper.

Note: The selection of the Type f ield value f ilters the list of available case def inition f or the
Def inition f ield.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-40 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

e. For each of the f ive selected f ields, click the down arrow under Definition and select
Upper.

f. Accept the def ault output names.

g. Click Preserve null values.

The Change Case Properties window should resemble the f ollowing:

h. Click OK to close the Change Case Properties window.

9. Preview the data.

a. Right-click the Change Case node and select Preview.

A sample of records appears on the Preview tab of the Details pane.

b. Scroll to the right to view the cased email and individual email token f ields.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.2 Standardization, Parsing, and Casing 10-41

10. Add a Text File Output node to the job f low.

a. From the Nodes riser bar in the resource pane, collapse the Quality grouping of nodes.

b. Expand the Data Outputs grouping of nodes.

c. Double-click the Text File Output node.

The node is added to the job f low and connected to the Change Case node.
The properties window f or the Text File Output node appears.

11. Specif y properties for the Text File Output node.

a. Enter Customer Info in the Name f ield.

b. Specif y the output file inf ormation.

1) Click the ellipsis next to the Output file f ield.

2) Navigate to D:\Workshop\dqdmp1\Demos\files\output_files.

3) Enter Ch5D1_CustomerInfo.txt in the File name f ield.

4) Click Save.

c. Specif y attributes for the f ile.


1) Select the double quotation mark " f or the Text qualifier f ield.

2) Verif y that the Field delimiter f ield is set to Comma.

3) Click Include header row.

4) Click Display file after job runs.

d. Accept all def ault output fields.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-42 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

The Text File Output Properties window should resemble the f ollowing:

e. Click OK to close the Text File Output Properties window.

12. Select File  Save to save the data job.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.2 Standardization, Parsing, and Casing 10-43

13. Run the job.

a. Verif y that the Data Flow tab is selected.

b. Select Actions  Run Data Job.

The job runs, and the text f ile output is displayed in a Notepad window.

c. Select File  Exit close the Notepad window.

14. View the detailed log.

a. Click the Log tab.

b. Review the inf ormation f or each of the nodes.

15. Close the data job.

a. Click the Data Flow tab.

b. Select File  Close.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-44 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

Practice

1. Creating a Data Job Containing Standardization and Parsing

Create a data job that uses the Standardization, Parsing, and Data Target (Insert) nodes.
The f inal job f low should resemble the f ollowing:

• Create a new data job named Ch5E1_Manufacturers_DataQuality in the batch_jobs f older


of the Basics Exercises repository.

• Add the MANUFACTURERS table in the df Conglomerate Grocery data connection


as the data source and select the f ollowing fields:

ID CONTACT_STATE_PROV
MANUFACTURER CONTACT_POSTAL_CD
CONTACT CONTACT_CNTRY
CONTACT_ADDRESS CONTACT_PHONE
CONTACT_CITY POSTDATE

• Use the specif ied standardization definition or scheme to standardize the f ollowing fields:

Field Name Definition Scheme

MANUFACTURER Organization

CONTACT Name

CONTACT_ADDRESS Address

CONTACT_CITY City

CONTACT_STATE_PROV State/Province
(Abbreviation)

CONTACT_CNTRY Ch3E2 CONTACT_CNTRY Scheme

CONTACT_PHONE Phone

Note: Accept the def ault names f or the standardized f ields and be sure to preserve
null values.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.2 Standardization, Parsing, and Casing 10-45

• Using the Name parse def inition, parse the standardized CONTACT f ield. Be sure
to preserve null values. Select three tokens and rename the output f ields as f ollows:

Token Name Output Name

Given Name FIRST_NAME

Middle Name MIDDLE_NAME

Family Name LAST_NAME

• Write the standardized, parsed data to a new table named Manufacturers_Stnd (in the
df Conglomerate Grocery data connection). If the data job runs multiple times, ensure
that the records f or each run are the only records in the table.

• Create a selection of the f ields and rename the output f ields as follows:

Field Name Output Name

ID ID

MANUFACTURER_Stnd MANUFACTURER

FIRST_NAME FIRST_NAME

MIDDLE_NAME MIDDLE_NAME

LAST_NAME LAST_NAME

CONTACT_ADDRESS_Stnd CONTACT_ADDRESS

CONTACT_CITY_Stnd CONTACT_CITY

CONTACT_STATE_PROV_Stnd CONTACT_STATE_PROV

CONTACT_POSTAL_CD CONTACT_POSTAL_CD

CONTACT_CNTRY_Stnd CONTACT_CNTRY

CONTACT_PHONE_Stnd CONTACT_PHONE

POSTDATE POSTDATE

• Save and run the data job.

• Review the detailed log and close the data job.

Question: Where (in Data Management Studio) can you view the new table’s data?
Answer:

Question: Where (in Data Management Studio) can you view the new table’s f ield names ?
Answer:

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-46 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

10.3 Identification Analysis and Right


Fielding

Identification Analysis Definitions


Customer
Identification analysis definitions specify
data and logic that can be used to identify DataFlux
the semantic type of a data string. John Q Smith
DataFlux Corp
Nancy Jones

For example, an identification


analysis definition might be
used to determine whether
the data shown represents the
name of an individual or an
organization.
54
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.3 Identification Analysis and Right Fielding 10-47

Data Jobs Nodes Applying Identification Analysis


There are two nodes in a data job that can
apply identification analysis definitions:
• Identification Analysis
• Right Fielding

Both of these nodes are found in the Quality grouping


of nodes.
55
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Identif ication analysis and right f ielding can use the same def initions from the QKB, but each
produces output in different f ormats. Identification analysis identifies the type of data in a f ield, and
right fielding moves the data into separate f ields based on its identification.

Using the Identification Analysis Node

The purpose of an Identification Analysis node is to analyze


the values in the field being analyzed and attempt to
determine the identity of that data value.

56
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The purpose of an Identif ication Analysis node is to analyze the values in the f ield being analyzed
and attempt to determine the identity of that data value. For example, is the value a person’s name,
an organization, an address, and so on? The valid values returned by the identif ication analysis
def inition are determined by the identities def ined for the identif ication analysis definition.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-48 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

Using the Identification Analysis Node

Applying the Contact Info identification


Consider a field that has analysis definition to the Customer
mixed corporate and field using the Identification Analysis
individual customers. node produces a result set that flags
every record with the type of data that
is discovered or identified.

Customer Customer_Identity
DataFlux ORGANIZATION
John Q Smith NAME
DataFlux Corp ORGANIZATION
Nancy Jones NAME

59
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Consider a f ield that has mixed corporate and individual customers. Applying the Contact Info
identif ication analysis definition to the Customer f ield using the Identification Analysis node
produces a result set that f lags every record with the type of data that is discovered or identified.

Using the Identification Analysis Node


The purpose of a Right Fielding node is to analyze the field to
determine the identity of that data value. Then, based on that
identity, the value is moved into a new field that is created for
that particular identity.

60
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The purpose of a Right Fielding node is to analyze the f ield to determine the identity of that data
value. For example, is the value a person’s name, an organization, an address, and so on? Then,
based on that identity, the value is moved into a new f ield that is created f or that particular identit y
(f or example, Company, Person, or Unknown).

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.3 Identification Analysis and Right Fielding 10-49

Using the Right Fielding Node

Consider a field that has Applying the Contact Info identification analysis
mixed corporate and definition to the Customer field using the
individual customers. Right Fielding node can produce a result set that
moves identified values to the correct or right field.

Customer Name Company


DataFlux DataFlux
John Q Smith John Q Smith
DataFlux Corp DataFlux Corp
Nancy Jones Nancy Jones

63
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Consider a f ield that has mixed corporate and individual customers. Applying the Contact Info
identif ication analysis definition to the Customer f ield using the Right Fielding node can produce a
result set that moves identif ied values to the correct or right f ield.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-50 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

Investigating Right Fielding and Identification Analysis


This demonstration opens a data job f rom the Basic Solutions repository. The data job is used to
illustrate the dif f erence between the right f ielding and identif ication analysis nodes.

1. If necessary, access Data Management Studio.

a. Select Start  All Programs  DataFlux  Data Management Studio 2.7.

b. Click Cancel to close the Log On window.

2. Open a data job f rom the Basics Solutions repository.


a. Click the Folders riser bar.

b. Expand Basics Solutions.

c. Click batch_jobs.

d. Double-click Ch5D2_RightFielding_IDAnalysis.

The data job appears on a new primary tab. The job f low should resemble the f ollowing:

3. Examine the properties f or the Text File Input node.

a. Double-click the Prospect List Text File Input node.

b. Verif y that a double quotation mark " is selected as the text qualif ier.

c. Verif y that Comma is selected as the f ield delimiter.

d. Verif y that Number of rows to skip is selected.

e. Verif y that 1 is the def ault value in the Number of rows to skip f ield.

f. Verif y that seven f ields are def ined, including a f ield named Contact.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.3 Identification Analysis and Right Fielding 10-51

The Text File Input Properties window should resemble the f ollowing:

g. Click Cancel to close the Text File Input Properties window.

4. Preview the input data.

a. Right-click the Text File Input node and select Preview.

A sample of records appears on the Preview tab of the Details pane.

b. Verif y that the Contact f ield has a mixture of individual and corporate records .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-52 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

5. Examine the properties f or the Right Fielding node.

a. Double-click the Right Field Contact Info Right Fielding node.

The Right Fielding Properties window should resemble the f ollowing:

b. Verif y that Contact Info is the selected def inition.

c. Verif y that the Contact f ield is the selected f ield.

d. Verif y that two output f ields are def ined: one to contain Company values and one to contain
individual or Person values.

e. Click Additional Outputs.

1) Verif y that the Contact f ield is at the bottom of the list.

Note: The Contact f ield is moved to the end of the list so that it can be more easily
compared to the results of the Right Fielding node.

2) Click OK to close the Additional Outputs window.

f. Click Cancel to close the Right Fielding Properties window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.3 Identification Analysis and Right Fielding 10-53

6. Preview the data.

a. Right-click the Right Fielding node and select Preview.

A sample of records appears on the Preview tab of the Details pane.

b. If necessary, scroll to the right to view the new right f ielding inf ormation.

Note: The Right Fielding node correctly identified data values f rom the Contact f ield as
either Organization (the Company f ield) or Name of Person(s) (the Person f ield).

7. Examine the properties of the Identif ication Analysis node.

a. Double-click the Identify Contact Info Identif ication Analysis node.

The Identif ication Analysis Properties window should resemble the f ollowing:

Note: The Identity Contact Inf o Identif ication Analysis definition is used
to determine the identity of the data values in the Contact f ield.

b. Verif y that the Contact f ield is in the Selected list.


c. Verif y that the Contact Info identif ication definition is the specified definition.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-54 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

d. Verif y that results appear in a new f ield named Contact_Identity.

e. Click Cancel to close the Identif ication Analysis Properties window.

8. Preview the data.

a. Right-click the Identification Analysis node and select Preview.

A sample of records appears on the Preview tab of the Details pane.

b. Scroll to the right to view the new identif ication inf ormation.

Note: The Identif ication Analysis node also correctly identified data values f rom the
Contact f ield as either ORGANIZATION or NAME. Right f ielding moves the data to
the associated f ield, but identification analysis creates a new f ield with values that
indicate the associated category.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.4 Branching and Gender Analysis 10-55

10.4 Branching and Gender Analysis

Node Connections Tab

Node Connections tab

For this selected For this selected


node, there is only node, there are
one input slot multiple output slots
allowed. available.

67
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

When a node is selected in a Data Flow diagram, the Details pane can be surf aced by selecting
View  Show Details Pane.

The Details pane has a Node Connections tab.

This tab shows, f or the type of selected node, how many connections can come “in” to a node
(Connect from), as well as how many connections can come “out” of a node (Connect to).

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-56 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

Working with the Branch Node

The Branch node is used to split


one input in to multiple outputs.

68
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

A Branch node is used in a data job to send the output down two or more output paths. Most nodes
in a data job allow only one input connection and one output connection. The purpose of the Branch
node is to take one input connection and split the data f low into multiple paths. The Branch node
allows f or up to 32 output connections.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.4 Branching and Gender Analysis 10-57

Working with the Branch and Data Validation Nodes

This demonstration continues with an existing data job f rom the Basics Solutions repository to
illustrate the use of the Branch node. Dif ferent paths of data are created based on the identity
determined by the Identif ication Analysis node.

1. If necessary, access Data Management Studio.

a. Select Start  All Programs  DataFlux  Data Management Studio 2.7.

b. Click Cancel to close the Log On window.

2. If necessary, open the data job Ch5D2_RightFielding_IDAnalysis.

a. Click the Folders riser bar.

b. Expand Basics Solutions.

c. Click batch_jobs.
d. Double-click Ch5D2_RightFielding_IDAnalysis.

3. Investigate node connections f or the Branch node.

a. Click the Branch node.

b. Click the Node Connections tab in the Details pane.

Question: How many output slots are available f or the Branch node?
Answer: 32 (slots 0 through 31)

Question: How can you discover how many output slots are needed f or a particular set
of data?

Answer: Create a frequency distribution to find the distinct number of values for
a particular field.

Good News! There is a Frequency Distribution node in the Profile


grouping of nodes.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-58 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

4. To discover how many “branches” to add to the job f low, add a Frequency Distribution node.

Note: This is typically performed while the data job is being created.

a. Right-click the line connector between the Identif ication Analysis node and the Branch node.
Select Delete.

b. Click the Identification Analysis node.

Note: This makes the node the “active” node. The next node added to the data job is added
to the active node.

c. Expand the Profile grouping of nodes.

d. Double-click the Frequency Distribution node.

The Frequency Distribution node is appended to the Identif ication Analysis node.
The Frequency Distribution Properties window appears.

e. From the Available list, double-click the Contact_Identity f ield to move it to the Selected list.

f. Click OK to close the Frequency Distribution Properties window.

g. Right-click the Frequency Distribution node and select Preview.

Notice that there are two distinct values


If the intention of the Branch node is to pass
for the Contact_Identity field.
dif ferent subsets of data “down” the branches,
this preview tells you that only two branches
are needed.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.4 Branching and Gender Analysis 10-59

h. Right-click the Frequency Distribution node and select Delete.

i. Re-connect the Identif ication Analysis node to the Branch node.

1) Place your mouse pointer over the Identif ication Analysis node in a spot
where the drawing tool or pen appears.

2) Click and drag the pen. Release the mouse button when the pen is positioned
over the Branch node.

The Identif ication Analysis node is now reconnected to the Branch node.

5. Investigate the Data Validation node with the name Filter f or Companies.

a. Double-click the Data Validation node named Filter for Companies.

b. Verif y that the Expression area is searching f or records where Contact_Identity is equal to
ORGANIZATION.

c. Click Cancel to close the Data Validation Properties window.

6. Preview the data.

a. Right-click the Data Validation node named Filter for Companies and select Preview.
A sample of records appears on the Preview tab of the Details pane.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-60 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

b. Verif y that all the Contact_Identity f ield values are ORGANIZATION.

7. Investigate the Data Validation node with the name Filter f or Person(s).

a. Double-click the Data Validation node named Filter for Person(s).

b. Verif y that the Expression area is searching f or records where Contact_Identity is equal
to NAME.

c. Click Cancel to close the Data Validation Properties window.

8. Preview the data.

a. Right-click the Data Validation node named Filter for Person(s) and select Preview.

b. Verif y that all the Contact_Identity f ield values are NAME.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.4 Branching and Gender Analysis 10-61

Gender Analysis Definitions


Gender analysis definitions attempt to
identify the gender of a person based on
the name string.

Customer_Name Customer_Gender
Michelle Wan F
E.Fusco U
Earl Rigdon M
71
Pat Duffey U
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The Gender Analysis node of a data job (f ound in the Quality grouping of nodes) can apply a gender
analysis def inition to a selected name f ield. Based on the values of the names, the node will produce
a new f ield that will ref lect the gender of the name. The values returned by the gender analysis
def inition are M(ale), F(emale), and U(nknown).

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-62 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

Investigating Gender Analysis

This demonstration uses an existing data job f rom the Basics Solutions repository to illustrate the
use of the Gender Analysis node. You use this node to determine the gender of individuals (based
on the value in a name f ield), re-order the f ields, and write the data to the output tables.

1. If necessary, access Data Management Studio.

a. Select Start  All Programs  DataFlux  Data Management Studio 2.7.

b. Click Cancel to close the Log On window.

2. If necessary, open the data job Ch5D2_RightFielding_IDAnalysis.

a. Click the Folders riser bar.

b. Expand Basics Solutions.

c. Click batch_jobs.

d. Double-click Ch5D2_RightFielding_IDAnalysis.
3. Examine the properties of the Gender Analysis node.

a. Double-click the Gender Analysis node named Determine Gender.

The properties window f or the node appears.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.4 Branching and Gender Analysis 10-63

b. Verif y that the Person f ield moved to the Selected list.

c. Verif y that the Name gender analysis def inition is selected.

d. Verif y that the output f ield is named Person_Gender.

e. Verif y that Preserve null values is selected.

f. Click Additional Outputs.

The Additional Outputs window appears.

1) Verif y that not all f ields are passed as output f ields.

2) Verif y that the Person f ield is renamed as Name.

3) Click Cancel to close the Additional Outputs window.

g. Click Cancel to close the Gender Analysis Properties window.

4. Preview the Gender Analysis node.

a. Right-click the Gender Analysis node and select Preview.

A sample of records appears on the Preview tab of the Details pane.

b. Verif y that the Person_Gender f ield contains values of F, U, or M.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-64 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

5. Investigate the properties f or the Field Layout node named Re-order Fields.

a. Double-click the Field Layout node named Re-order Fields.

The Field Layout Properties window should resemble the f ollowing:

b. Verif y that the Person_Gender f ield was moved to the third position and renamed to
Gender.

c. Click Cancel to close the properties of the Field Layout node.

6. Investigate the properties f or the Text File Output node named Prospects - Person(s).

a. Double-click the Text File Output node named Prospects - Person(s).

The Text File Output Properties window should resemble the f ollowing:

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.4 Branching and Gender Analysis 10-65

b. Verif y that a f ile named Ch5D2_Prospects_Persons.txt is created.

c. Verif y that the Text qualifier f ield is set to a double quotation mark ".

d. Verif y that the Field delimiter f ield is set to Comma.

e. Verif y that Include header row is selected.

f. Verif y that Display file after job runs is selected.

g. Click Cancel to close the Text File Output Properties window.

7. Investigate the properties f or the Field Layout node named Re-Order Fields.

a. Double-click the Field Layout node named Re-Order Fields.

The Field Layout Properties window appears:

Question: What is the intended purpose f or this instance of the Field Layout node?
Answer: The Data Validation node previous to this node in the data job flow does
not have a way to select to manage output fields. Adding this Field
Layout node enables us to manage fields before adding any additional
nodes.

Question: Aside f rom the list of selected f ields, what is the dif f erence between this Field
Layout node and the Field Layout node in the NAME” branch?

Answer: The Field Layout node on the other branch of this data job was named
“Re-order Fields”. This Field Layout node is named “Re-Order Fields”.
It is important to note that node names are unique where uniqueness
includes the casing.

b. Click Cancel to close the Field Layout Properties window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-66 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

8. Investigate the properties f or the Text File Output node named Prospects - Companies.

a. Double-click the Text File Output node named Prospects - Companies.

The Text File Output Properties window should resemble the f ollowing:

b. Verif y that a f ile named Ch5D2_Prospects_Companies.txt is created.


c. Verif y that the Text qualifier f ield is set to a double quotation mark ".

d. Verif y that the Field delimiter f ield is set to Comma.

e. Verif y that Include header row is selected.

f. Verif y that Display file after job runs is selected.

g. Click Cancel to close the Text File Output Properties window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.4 Branching and Gender Analysis 10-67

9. Run the job.

a. Select Actions  Run Data Job.

b. Verif y that both text f iles appear.

PERSONS:

COMPANIES:

c. Select File  Exit to close each Notepad window.

10. View the detailed log and resolve any issues that exist.

11. Select File  Close to close the data job.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-68 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

10.5 Data Enrichment

Reference Sources (Review)

USPS Data
Geo+Phone Data

Canada Post Data

75
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Ref erence data sources are used to verif y and enrich data. The ref erence sources (also known as
data packs) are typically a database used by Data Management Studio to compare user data t o the
ref erence source. Given enough inf ormation to match an address or location or phone, the ref erence
data source can a variety of additional fields to further clarif y and enrich your data.

Data Management Studio allows direct use of data packs provided f rom United States Postal
Service, Canada Post and Geo+Phone data. The data packs and updates are f ound on
http://support.sas.com/downloads in the SAS DataFlux Sof tware section.

Note: You cannot directly access or modify ref erence data sources.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.5 Data Enrichment 10-69

Registration of Reference Sources (Review)

summary information
Reference Sources for Reference Sources
item selected

Administration riser bar


76
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Ref erence source locations are registered on the Administration riser bar in Data Management
Studio. In the display above, you can see that there are registrations f or three ref erence data
sources – Canada Post Data, Geo+Phone Data, and USPS Data.

Note: Only one ref erence source location of each type can be designated as the def ault.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-70 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

Address Verification
Original Address Address verification identifies,
940 Cary Parkway corrects, and enhances
address information.
27513

Verified Address Information


Street Address 940 NW CARY PKWY
City CARY
State NC
ZIP 27513
ZIP+4 2792
County Name WAKE
Congressional District 2 77
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The address verif ication lookup process requires a valid street address and postal code, or a valid
street address with the corresponding city and state value. If these values match an address in the
lookup database, then the data can be enriched with additional data f ields. In the example shown,
the address 940 Cary Parkway with postal code 27513 is passed into the address verif ication node.
Since this is a valid address, then additional inf ormation can be added to the data row (f or example,
City, State, Zip+4, County Name and Congressional District).

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.5 Data Enrichment 10-71

Geocoding
Original Address Geocoding enhances address
940 NW CARY PKWY information with latitude and
longitude values.
CARY , NC 27513-2792

Verified Geocode Information


Latitude 35.818339
Longitude -78.797333

78
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Geocoding latitude and longitude information can be used to map locations and plan ef ficient
delivery routes. Geocoding can be licensed to return this inf ormation f or the centroid of the postal
code or at the roof -top level.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-72 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

Using Address Verification Nodes in a Data Job


The nodes for performing address
verification are found in the
Enrichment grouping of nodes.

Note: These nodes require the third-party


reference data packs.
79
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The address verif ication nodes in Data Management Studio require that one (or more) third -party
ref erence data sources be licensed. These include:
• US Address Verif ication – United States Postal Service database, which supports CASS
certif ication.
• Canada Address Verif ication – Canada Post database, which supports SERP certification.
• North America Postal Level Geocode (includes PhonePlus)
• US Street Level Geocode – adds additional level of detail to the North American Postal Level
Geocode database.
• Loqate – worldwide databases, which also support worldwide geocoding.

Note: The US Street Level Geocode data requires the North American Postal Level Geocoding
data pack be installed f irst.

Note: More inf ormation about the ref erence data sources can be f ound by navigating to
support.sas.com/downloads and selecting DataFlux Data Updates.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.5 Data Enrichment 10-73

Using Address Verification Nodes in a Data Job


Available input fields Fields to be used for address lookup

Available fields for enrichment

85
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The address verif ication process is performed by one of many available nodes in the Enrichment
group of nodes. Once a node is added to the data job f low, you can look up data values in the
ref erence data source to validate the address. If the address is located in the ref erence data source,
then there are a large number of available f ields that can be used to enrich the existing record.

For example, the job f low diagram shown contains an Address Verif ication node. When examining
the properties of this node, we can see the list of f ields being passed in to this node, or the input
f ields. The input f ields can be matched to f ields from the ref erence source using the Field Type area.
Depending on the specif ied ref erence source, there will be a number of output f ields that can be
generated f rom the ref erence source.

It is important to note that these output f ields are f ields added to the result set *in addition to* the
original f ields that might be passing through this node. (Recall the Additional Outputs button can
control which of the input f ields will pass out of this node. ) In the example shown, the original data
has a CITY f ield (as shown in the Input listing). One of the available output f ields is also named City.
To avoid a naming conf lict, you'll need to change the name of one of the City f ields. In the example
the output f ield City has been renamed to City_V to avoid the potential replication of field names.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-74 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

Working with Address Verification and Geocoding Nodes

This demonstration uses an existing data job in the Basics Solutions repository. It illustrates the use
of an Address Verif ication node to verify addresses and a Geocoding node to enhance addresses
with latitude and longitude inf ormation.

1. If necessary, access Data Management Studio.

a. Select Start  All Programs  DataFlux  Data Management Studio 2.7.

b. Click Cancel to close the Log On window.

2. Open a data job f rom the Basics Solutions repository.

a. Click the Folders riser bar.

b. Expand Basics Solutions.

c. Click batch_jobs.
d. Double-click Ch6D1_AddressVerification.

The data job appears on a new primary tab. The job f low should resemble the f ollowing:

3. Examine the properties of the Data Source node in the job f low.

a. Double-click the Customers Table Data Source node.

b. Verif y that one the f ollowing f ields were added to the Selected list:

ID
COMPANY
LAST NAME
FIRST NAME
ADDRESS
CITY
STATE/PROVINCE
ZIP/POSTAL CODE
COUNTRY/REGION

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.5 Data Enrichment 10-75

The Data Source Properties window should resemble the f ollowing:

c. Click Cancel to close the Data Source Properties window.

4. Right-click the Customers Table Data Source node and select Preview.

A sample of records appears on the Preview tab of the Details pane.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-76 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

5. Examine the properties f or the Address Verif ication (US/Canada) node.

a. Double-click the Address Verification node to view the Properties window.

b. Verif y that United States is selected in the Address area.

c. Verif y the f ollowing input specifications:

1) The Address Line 1 f ield type is assigned to ADDRESS.

2) The City f ield type is assigned to CITY.

3) The State f ield type is assigned to STATE/PROVINCE.

4) The Zip f ield type is assigned to ZIP/POSTAL CODE.

The Input inf ormation should resemble the f ollowing:

Identif ies ref erence


source to be USPS

Field types f rom USPS paired


to appropriate f ields coming
in to this node.

Note: You can click Suggest to attempt to match the f ields f rom the data source
with the f ields in the ref erence source.

d. Investigate options f or address verification.

1) Click Options.

2) Verif y that Proper case results is selected.

3) Verif y that Street abbreviation is selected.

4) Verif y that City abbreviation is selected.

5) Verif y that CASS compliance is not selected.

6) Verif y that Insert dash between ZIP and ZIP4 is selected.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.5 Data Enrichment 10-77

The Options window should


resemble the f ollowing:

7) Click Cancel to close the Options window.

e. Verif y the output inf ormation.


1) In the Output f ields area, verif y that the f ollowing fields are in the Selected list:
Address Line 1 US ZIP
City US ZIP4
State US Street Number
ZIP/Postal Code US Result Code
US Numeric Result Code

2) In the Output Name f ield, verif y that _V is af ter the f ollowing fields:
Address_Line_1
City
State
ZIP/Postal_Code

The Output f ield area should resemble the f ollowing:

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-78 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Impro ve Data

f. Click Cancel to close the Address Verif ication (US/Canada) Properties window.

6. Preview the data.

a. Right-click the Address Verification node and select Preview.

b. Scroll to the right to view verif ied address inf ormation.

Notice the f ollowing:


• The street types f rom the original Address f ield are abbreviated in the verif ied Address f ield.
• Some verif ied City values are updated to match to the ZIP code inf ormation.
• Some cities f rom the original City f ield are abbreviated in the verif ied City f ield.
• Some values f rom the State f ield are standardized to the pattern AA in the verif ied f ield.
• Most records in the preview have a US Result Code = OK and US Numeric Result Code = 0.

Note: The US_Result_Code f ield indicates whether the address was successf ully verif ied.
If the address was not successf ully verified, the code indicates the cause of failure.

Note: The US_Numeric_Result_Code f ield provides a numeric value f or the result. Possible
values f or both fields are as f ollows:

Text Result Numeric


Code Result Code Description

OK 0 Address was verif ied successfully.

PARSE 11 Error parsing address. Components of the address might be


missing.

CITY 12 Could not locate city, state, or ZIP code in the USPS database.
At least, city and state or ZIP code must be present in the input.

MULTI 13 Ambiguous address. There are two or more possible matches


f or this address with dif ferent data.

NOMATCH 14 No matching address is f ound in the USPS data.

OVER 15 One or more input strings is too long (maximum 100 characters).

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.5 Data Enrichment 10-79

7. Examine the properties f or the Street-Level Geocoding node in the job f low.
a. Double-click the Street-Level Geocoding node (named Geocoding using Verified Zip and
Street Number) to view the Properties window.

b. Verif y that United States is selected as the address type.

c. In the Input f ields area, verif y that the ZIP/Postal_Code_V f ield type is set
to Postal/ZIP Code.

d. Verif y that the Street number f ield type is set to US_Street_Number.


e. In the Output f ields area, verif y that the f ollowing fields are in Selected list:

Geocode Result Code


Geocode Latitude
Geocode Longitude
Geocode FIPS

The Street-Level Geocoding Properties window should resemble the f ollowing:

f. Click Cancel to close the Street-Level Geocoding Properties window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-80 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

8. Preview the data.

a. Right-click the Street-Level Geocoding node and select Preview.

A sample of records appears on the Preview tab of the Details pane.

b. Scroll to the right to view the geocode information.

Notice the f ollowing:


• Most records in the preview have a Geocode_Result_Code of DP (Delivery Point).
• Records where the verif ied ZIP code has only f ive digits have a Geocode_Result_Code
of ZIP. These records have less precise values f or the Geocode_Latitude and
Geocode_Longitude columns.

Note: When you use the Street-Level Geocoding node, a license check is perf ormed to
conf irm the license type. If you have a ZIP+4 database, you see only results
ref lecting that license type. If you have a database that includes street -level data,
you see street-level results.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.5 Data Enrichment 10-81

The f ield def initions are shown below:

Available Field Description

Geocode Result Code The result code indicates whether the record was successf ully
geocoded. Other possible codes are as f ollows:
• DP – The match is based on the delivery point.
• PLUS4 – The match f ailed on the delivery point, so the match
is based on ZIP+4.
• ZIP – The ZIP+4 match f ailed, so the match is based on the ZIP
code.
• NOMATCH – The f irst three checks f ailed. There is no match
in the geocoding database.

Geocode Latitude The numerical horizontal map ref erence f or address data.

Geocode Longitude The numerical vertical map ref erence f or address data.

Geocode Country The country code f or the address.


Code

Geocode FIPS The U.S. Federal Inf ormation Processing Standards (FIPS)
number used by the U.S. Census Bureau to ref er to geographical
areas.

Geocode Census A US Census Bureau ref erence number is assigned using the
Tract centroid latitude and longitude. This number contains ref erences
to the State and County codes.

Geocode Census The last f our digits of the State/County/Tract/Block value.


Block

Geocode FIPS MCD The FIPS Minor Civil Division (MCD) code ref ers to a subsection
Code of the US Census county subdivision statistics. This number
includes census data f or county divisions, census subareas, minor
civil divisions, unorganized territories, and incorporated areas.

Geocode MSA or A US Census Bureau code ref erencing Metropolitan Statistical


CMSA Areas (MSA) or County Metropolitan Statistical Areas (CMSA).

Geocode PMSA (f or A US Census Bureau code ref erencing Principal Metropolitan


CMSA) Statistical Areas (PMSA) or Consolidated Metropolitan Statistical
Areas (CMSA).

Geocode CBSA The code f or a Core-Based Statistical Area.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-82 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

9. Investigate the properties f or the Text File Output node.

a. Double-click the Customer Info - Verify/Geocode Text File Output node.

b. Verif y that the Text File Output Properties window resembles the f ollowing:

c. Click Cancel to close the Text File Output Properties window.


10. Run the job.

a. Verif y that the text f ile appears.

b. Select File  Exit to close the Notepad window.

11. View the detailed log.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.5 Data Enrichment 10-83

a. Click the Log tab.

b. Review the inf ormation f or each of the nodes.

12. Close the data job.

a. Click the Data Flow tab.

b. Select File  Close.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-84 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

Practice

2. Performing Address Verification on the MANUFACTURERS Table

Create a data job that uses the Address Verif ication node. The f inal job f low should resemble
the f ollowing display:

• Create a new data job named Ch6E1_Manufacturers_Verify in the batch_jobs f older


of the Basics Exercises repository.

• Add the MANUFACTURERS table f rom the df Conglomerate Grocery data connection
as the data source.

• Verif y the address f ields. Use the f ollowing specifications:

− Map the f ollowing input f ields to the specified f ield type:

Field Name Field Type

MANUFACTURER Firm

STREET_ADDR Address Line 1

CITY City

STATE_PROV State

POSTAL_CD Zip

− Specif y the f ollowing options:

✓ Proper case results

✓ Output blanks as nulls

✓ Street abbreviation

✓ City abbreviation

✓ Insert dash between ZIP when ZIP4

Note: CASS compliance should not be selected.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.5 Data Enrichment 10-85

− Specif y the f ollowing output fields:

Output Type Output Name Output Type Output Name

Firm Firm_V ZIP/Postal Code ZIP_V

Address Line 1 Address_V US County Name US_County_Name

City City_V US Result Code US_Result_Code

State State_V

• Retain only the f ollowing list of original f ields as additional outputs:

ID STATE/PROV
MANUFACTURER POSTAL_CD
STREET_ADDR COUNTRY
CITY PHONE

• Add a Text File Output node to the job f low. Use the f ollowing specifications:

Output file: Ch6E1_Manufacturers_Verify.txt in the directory


D:\Workshop\dqdmp1\Exercises\files\output_files

Text qualifier: " (double quotation mark)

Field delimiter: Comma

✓ Include header row

✓ Display file after job runs

Specif y the f ollowing fields as output for the text f ile with the specif ied output names:

Field Name Output Name

ID ID

Firm_V Manuf acturer

Address_V Address

City_V City

State_V State

ZIP_V ZIP

US_County_Name US_County_Name

US_Result_Code US_Result_Code

• Save and run the job.

• Verif y that the text f ile contains the verif ied address inf ormation.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-86 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

10.6 Solutions
Solutions to Practices
1. Creating a Data Job Containing Standardization and Parsing

a. If necessary, invoke Data Management Studio.

1) Select Start  All Programs  DataFlux  Data Management Studio 2.7.

2) Click Cancel in the Log On window.

b. Verif y that the Home tab is selected.


c. Click the Folders riser bar.

d. Expand the Basics Exercises repository.

e. Right-click the batch_jobs f older and select New  Data Job.

1) Enter Ch5E1_Manufacturers_DataQuality in the Name f ield.

2) Click OK. The new data job appears on a tab.

f. Add a Data Source node to the job f low.


1) Verif y that the Nodes riser bar is selected in the resource pane.

2) Expand the Data Inputs grouping of nodes.

3) Double-click the Data Source node. The node is added to the job f low, and the properties
window f or the node appears.

4) Enter Manufacturers Table in the Name f ield.

5) Click the ellipsis next to the Input table f ield.

a) Expand the dfConglomerate Grocery data connection.

b) Select the MANUFACTURERS table.

c) Click OK to close the Select Table window.

6) In the Output f ields area, click the lef t-pointing double arrow to remove all f ields f rom
the Selected list.

7) Double-click the f ollowing fields to move them f rom the Available list to the Selected list:
ID CONTACT_STATE_PROV
MANUFACTURER CONTACT_POSTAL_CD
CONTACT CONTACT_CNTRY
CONTACT_ADDRESS CONTACT_PHONE
CONTACT_CITY POSTDATE

8) Click OK to save the changes and close the Data Source Properties window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.6 Solutions 10-87

9) Right-click the Data Source node and select Preview.

A sample of records appears on the Preview tab of the Details pane.

g. Add a Standardization node to the job f low.

1) From the Nodes riser bar in the resource pane, collapse the Data Inputs grouping of
nodes.

2) Expand the Quality grouping of nodes.

3) Double-click the Standardization node. The node is added to the job f low.
The properties window f or the node appears.

4) Enter Standardize Fields in the Name f ield.

5) Verif y that English (United States) is selected f or the Locale f ield.

6) In the Standardization f ields area, double-click each of the f ollowing fields to move them
f rom the Available list to the Selected list:
MANUFACTURER
CONTACT
CONTACT_ADDRESS
CONTACT_CITY
CONTACT_STATE_PROV
CONTACT_CNTRY
CONTACT_PHONE

a) For the MANUFACTURER f ield, click the down arrow under Definition and
select Organization.

b) For the CONTACT f ield, click the down arrow under Definition and select Name.

c) For the CONTACT_ADDRESS f ield, click the down arrow under Definition and
select Address.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-88 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

d) For the CONTACT_CITY f ield, click the down arrow under Definition and
select City.

e) For the CONTACT_STATE_PROV f ield, click the down arrow under Definition
and select State/Province (Abbreviation).

f ) For the CONTACT_CNTRY f ield, click the down arrow under Scheme and select
Ch3E2 CONTACT_CNTRY Scheme.

g) For the CONTACT_PHONE f ield, click the down arrow under Definition and
select Phone.

7) Click Preserve null values.

8) Click OK to close the Standardization Properties window.

h. Preview the Standardization node.

1) Right-click the Standardization node and select Preview.

A sample of records appears on the Preview tab of the Details pane.


2) Scroll to the right to view the standardized f ields.

i. Add a Parsing node to the job f low.

1) From the Nodes riser bar in the resource pane, verif y that the Quality grouping of nodes
is expanded.

2) Double-click the Parsing node. The node is added to the job f low. The properties window
f or the node appears.

3) Enter Parse Contact_Stnd in the Name f ield.

4) Verif y that English (United States) is selected f or the Locale f ield.

5) Click the down arrow under Field to parse and select CONTACT_Stnd.

6) Click the down arrow under Definition and select Name.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.6 Solutions 10-89

7) In the Tokens area, double-click each of the f ollowing tokens to move them f rom the
Available list to the Selected list:
Given Name
Middle Name
Family Name

a) Click in the Output Name cell f or the Given Name token, enter FIRST_NAME,
and press Enter.

b) Click in the Output Name cell f or the Middle Name token, enter MIDDLE_NAME,
and press Enter.

c) Click in the Output Name cell f or the Family Name token, enter LAST_NAME,
and press Enter.

8) Click Preserve null values.

9) Click OK to close the Parsing Properties window.

j. Preview the Parsing node.

1) Right-click the Parsing node and select Preview.

A sample of the records appears on the Preview tab of the Details pane.

2) Scroll to the right to view the new, parsed name f ields.

k. Add a Data Target (Insert) node to the job f low.

1) From the Nodes riser bar in the resource pane, collapse the Quality grouping of nodes.

2) Expand the Data Outputs grouping of nodes.

3) Double-click the Data Target (Insert) node. The node is added to the job f low.
The properties window f or the node appears.

4) Enter Insert Records on Manufacturers_Stnd Table in the Name f ield.

5) Click the ellipsis next to the Output table f ield.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-90 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

a) Expand the dfConglomerate Grocery data connection.

b) Select dfConglomerate Grocery.

c) Click (New) to add a table to the selected data connection.

(1) Enter Manufacturers_Stnd in the Enter a name for the new table f ield.

(2) Click OK to close the New Table window.

d) Click OK to close the Select Table window.

6) Click Delete existing rows.

7) In the Output f ields area, click the lef t-pointing double arrow to remove all f ields
f rom the Selected list.

8) Double-click the f ollowing fields to move them to the Selected list:


ID CONTACT_CITY_Stnd
MANUFACTURER_Stnd CONTACT_STATE_PROV_Stnd
FIRST_NAME CONTACT_POSTAL_CD
MIDDLE_NAME CONTACT_CNTRY_Stnd
LAST_NAME CONTACT_PHONE_Stnd
CONTACT_ADDRESS_Stnd POSTDATE

9) Rename the f ollowing f ields. (Click the Output Name cell, enter the new name, and
press Enter.)

Field Name Output Name

MANUFACTURER_Stnd MANUFACTURER

CONTACT_ADDRESS_Stnd CONTACT_ADDRESS

CONTACT_CITY_Stnd CONTACT_CITY

CONTACT_STATE_PROV_Stnd CONTACT_STATE_P ROV

CONTACT_CNTRY_Stnd CONTACT_CNTRY

CONTACT_PHONE_Stnd CONTACT_PHONE

10) Click OK to close the Data Target (Insert) Properties window.

The f inal job f low should resemble the f ollowing:

l. Save and run the data job.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.6 Solutions 10-91

1) Select File  Save.

2) Select Actions  Run Data Job.

m. View the detailed log and resolve any issues.

1) Click the Log tab.


2) For each node, verif y that the Status f ield has a green check mark to indicate that the
node ran successf ully.

Note: If any node has an error or warning, double-click the row f or the node, and review
the messages. Return to the Data Flow tab and re-specif y the properties to fix the
issue. Then rerun the data job.

3) When you are f inished reviewing the log, select File  Close to close the data job.

Question: Where (in Data Management Studio) can you view the new table’s data?
Answer: The Data riser bar allows you to view data for tables.

Question: Where (in Data Management Studio) can you view the new table’s f ield names ?
Answer: The Data riser bar allows you to view field information for tables.

n. Verif y that the records were written to the Manufacturers_Stnd table in the df Conglomerate
Grocery data connection.

1) If necessary, click the Home tab.

2) Click the Data riser bar.

3) Expand Data Connections.

4) Expand dfConglomerate Grocery.

5) Click Manufacturers_Stnd.

6) In the inf ormation area, click the Data tab.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-92 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

7) Scroll through the data and verif y that the table contains the standardized and parsed
f ields.

2. Performing Address Verification on the MANUFACTURERS Table

a. If necessary, invoke Data Management Studio.

1) Select Start  All Programs  DataFlux  Data Management Studio 2.7.

2) Click Cancel in the Log On window.

b. Create a new data job.

1) Verif y that the Home tab is selected.

2) Click the Folders riser bar.

3) Expand the Basics Exercises repository.

4) Right-click the batch_jobs f older and select New  Data Job.

a) Enter Ch6E1_Manufacturers_Verify in the Name f ield.

b) Click OK. The new data job appears on a tab.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.6 Solutions 10-93

c. Add a Data Source node to the job f low.

1) Verif y that the Nodes riser bar is selected in the Resource pane.

2) Expand the Data Inputs grouping of nodes.

3) Double-click the Data Source node. The node is added to the job f low,
and the properties window f or the node appears.

d. Specif y properties for the Data Source node.

1) Enter Manufacturers Table in the Name f ield.

2) Click next to the Input table f ield.

a) Expand the dfConglomerate Grocery data connection.

b) Click the MANUFACTURERS table.

c) Click OK to close the Select Table window.

3) Click OK to close the Data Source Properties window.

e. Edit the basic settings f or the Data Source node and preview the data.

1) Verif y that the Details pane is displayed.

2) If necessary, select the Data Source node in the job diagram.

3) If necessary, click the Basic Settings tab.

4) Enter Data Source in the Description f ield.

5) Right-click the Data Source node and select Preview.

A sample of the records appears on the Preview tab of the Details pane.

f. Add an Address Verif ication (US/Canada) node to the job f low.

1) Verif y that the Nodes riser bar is selected in the Resource pane.

2) Collapse the Data Inputs grouping of nodes.

3) Expand the Enrichment grouping of nodes.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-94 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

4) Double-click the Address Verification (US/Canada) node.

The node is added to the job f low, and the properties window f or the node appears.

g. Specif y properties for the Address Verif ication (US/Canada) node.

1) Enter Verify US Addresses in the Name f ield.

2) Verif y that United States is selected in the Address area.

3) Specif y input inf ormation.

a) Click under Field Type f or the MANUFACTURER f ield and select Firm.

b) Click under Field Type f or the STREET_ADDR f ield and select Address Line 1.

c) Click under Field Type f or the CITY f ield and select City.

d) Click under Field Type f or the STATE_PROV f ield and select State.

e) Click under Field Type f or the POSTAL_CD f ield and select Zip.

The Input area should resemble the f ollowing:

4) Specif y options for address verif ication.

a) Click Options.

b) Click Proper case results.

c) Click Output blanks as nulls.

d) Click Street abbreviation.

e) Click City abbreviation.

f ) Clear CASS compliance.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.6 Solutions 10-95

g) Verif y that Insert dash between ZIP and ZIP4 is selected.

h) Click OK to close the Options window.

5) Specif y the output inf ormation.


a) In the Output f ields area, double-click the f ollowing f ields to move them to the
Selected list:
Firm ZIP/Postal Code
Address Line 1 US County Name
City US Result Code
State

b) Rename the f ields below. Click the Output Name cell, enter the new name,
and press Enter.

Output Type Output Name

Firm Firm_V

Address Line 1 Address_V

City City_V

State State_V

ZIP/Postal Code ZIP_V

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-96 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

The Selected list should resemble the f ollowing:

6) Retain only some of the original f ields as additional outputs.

a) Click Additional Outputs.


b) Click to remove all the f ields f rom the Output f ields list.

c) Double-click the f ollowing fields to move them f rom the Available list
to the Output list:
ID STATE/PROV
MANUFACTURER POSTAL_CD
STREET_ADDR COUNTRY
CITY PHONE

d) Click OK to close the Additional Outputs window.

7) Click OK to close the Address Verif ication (US/Canada) Properties window.

h. Preview the data.

1) Right-click the Address Verification (US/Canada) node and select Preview.

A sample of the records appears on the Preview tab of the Details pane.

2) Scroll to the right to view the verif ied address inf ormation.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.6 Solutions 10-97

i. Add a Text File Output node to the job f low.

1) Verif y that the Nodes riser bar is selected in the Resource pane.

2) Collapse the Enrichment grouping of nodes.

3) Expand the Data Outputs grouping of nodes.

4) Double-click the Text File Output node.

The node is added to the job f low, and the properties window f or the node appears.

j. Specif y properties for the Text File Output node.

1) Enter Manufacturers Info - Verify in the Name f ield.

2) Specif y the output file inf ormation.

a) Click next to the Output file f ield.

b) Navigate to D:\Workshop\dqdmp1\Exercises\files\output_files.

c) Enter Ch6E1_Manufacturers_Verify.txt in the File name f ield.

d) Click Save.

3) Specif y attributes for the f ile.


a) Specif y " (double quotation mark) as the text qualif ier.
b) Verif y that the Field delimiter f ield is set to Comma.
c) Click Include header row.
d) Click Display file after job runs.

4) Specif y desired fields as output for the text f ile.


a) In the Output f ields area, click to remove all f ields from the Selected list.

b) Double-click the f ollowing fields to move them f rom the Available list to the Selected
list:
ID State_V
Firm_V ZIP_V
Address_V US_County_Name
City_V US_Result_Code

c) Rename the f ollowing f ields.

Field Name Output Name

Firm_V Manuf acturer

Address_V Address

City_V City

State_V State

ZIP_V ZIP

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-98 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

The Text File Output Properties should resemble the f ollowing display:

5) Click OK to close the Text File Output Properties window.

The f inal job f low should resemble the f ollowing:

k. Save and run the data job.

1) Select File  Save to save the data job.

2) Verif y that the Data Flow tab is selected.

3) Select Actions  Run Data Job.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.6 Solutions 10-99

l. Verif y that the text f ile contains the verif ied address inf ormation.

m. Select File  Exit to close the Notepad window.

n. Select File  Close to close the data job.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-100 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data

Solutions to Student Activities

10.01 Activity – Correct Answer


Under General:
✓ Clear Automatically preview a selected node when the preview tab is
active.

23
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

10.01 Activity – Correct Answer


Under Job:
✓ Select Include node-specific notes when printing.
✓ Select All in the Output Fields area.

24
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.6 Solutions 10-101

10.01 Activity – Correct Answer


Under Job:
✓ Verify that
English (United States)
is the default locale.

25
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-102 Lesson 10 DataFlux® Data Managemen t Studio: Building Data Jobs to Improve Data

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Lesson 11 DataFlux® Data
Management Studio: Building Data
Jobs for Entity Resolution

11.1 Introduction ............................................................................................................... 11-3

11.2 Creating Match Codes ............................................................................................... 11-7

Demonstration: Creating a Data Job to Generate Match Codes ..................................11-10

11.3 Clustering Records ...................................................................................................11-17

Demonstration: Using Match Codes (and Other Fields) to Cluster Data Records ...........11-21

Demonstration: Creating a Match Report from Clustered Data....................................11-30


Practice..............................................................................................................11-36

11.4 Survivorship .............................................................................................................11-38

Demonstration: Adding Survivorship to the Entity Resolution Job ................................11-48

Demonstration: Adding Field-Level Rules for the Surviving Record .............................11-61


Practice..............................................................................................................11-69

11.5 Solutions ..................................................................................................................11-71

Solutions to Practices ...........................................................................................11-71

Solutions to Student Activities ................................................................................11-84


11-2 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.1 Introduction 11-3

11.1 Introduction

Are These The Same?

Consider this set of data.


Do the three records represent the same individual?
-- OR --
Do the three records represent three totally different individuals?

Name
John Q Smith
Mr. Johnny Smith
Smith, John
3
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Given the sample of records shown, do you think the three records represent the same individual?
Or could the three records represent dif ferent individuals? Is investigating this single field enough
inf ormation to be able to decide on whether the data are showing a single individual, two individuals
or perhaps three dif ferent individuals?

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-4 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution

Match Definitions Generate Match Codes


When comparing the values on a character-by-character basis, we
can see that the records are different data values.
However, the QKB has a definition type called a Match definition
that can transform data into encoded forms (called match codes).
The match codes can then be compared across records to help
identify potential duplicate records.

Name
John Q Smith
Mr. Johnny Smith
Smith, John
4
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

In the example shown, you can see that the three strings representing John Smith could be
identif ying the same person. However, to the computer that compares on a character-by-character
basis, these entries appear to be three totally different text strings.

The QKB has a def inition type called a Match definition. Match def initions produce new f ields called
match codes. Match codes are generated, encoded text strings that represent a data value. Match
codes can be compared across records to identify potentially duplicate data that might be obvious to
the human eye, but not necessarily obvious to a computer program.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.1 Introduction 11-5

Introduction to Match Codes

Using match codes, these records can be identified as a group


(or cluster). However, it is not because the data values are the same.
It is because they generate the same match code..

Name Match Code @ 85 Sensitivity


John Q Smith 4B&~2$$$$$$$$$$C@P$$$$$$$$$
Mr. Johnny Smith 4B&~2$$$$$$$$$$C@P$$$$$$$$$
Smith, John 4B&~2$$$$$$$$$$C@P$$$$$$$$$
5
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Match codes can be used to group similar data values. In the example shown, the records are sorted
by the match code values. You can see that these records might potentially match. This is not
because the value of Name is the same, but because the name values generated the same match
code at the chosen level of sensitivity (in this example, 85 level of sensitivity).

In addition to grouping similar records in a data table, match codes can be used across tables to join
similar records that would not previously be joined based on the value of Name. This is especially
helpf ul when you join two data source tables that do not have a common key or have possibly
dif f erent standards for how the data values were entered.
The match code generation process, in a simple f orm, consists of the f ollowing steps:
• Data is parsed into tokens (f or example, Given Name and Family Name).
• Signif icant tokens are selected f or use in the match code.
• Ambiguities and noise words are removed (f or example, “the”).
• Transf ormations are made (f or example, Johnathon > Jon).
• Phonetics are applied (f or example, HN sounds like N).
• Based on the sensitivity selection, the f ollowing results occur:
o Relevant components are determined.
o A certain number of characters of the transf ormed, relevant components are used.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-6 Lesson 11 DataFlux® Data Management Studio: Build ing Data Jobs for Entity Resolution

Match Codes and Sensitivity


The match code sensitivity is used to determine how much of the
data string you want to use in the generation of the match code, as
well as the number of transformations that get applied to the original
data string.

Name: Mr. Johnny Quintin Smith


Match Code Match Code
Sensitivity Value
95 4B7~2$$$$$$$$$$C@B7$$$$$$$3
85 4B&~2$$$$$$$$$$C@P$$$$$$$$$
65 4B~2$$$$$$$$$$$C@P$$$$$$$$$
55 4B~2$$$$$$$$$$$C$$$$$$$$$$$
6
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Sensitivity is used in the match code generation process to determine how much of the initial data
value that you want to use in the match code. In other words, it enables you to specif y how exact
you want when you generate the match codes.

At higher levels of sensitivity, more characters f rom the input text string are used to generate the
match code. Conversely, at lower levels of sensitivity, fewer characters are used f or the generation
of the match code.

In the example shown you can see that as the sensitivity level drops, less significant characters were
used in the match code string.

Note: It is important to experiment with dif ferent levels of sensitivity, because choosing a
sensitivity that is too low can cause over-matching. Choosing a sensitivity level that
is too high can cause under-matching.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.2 Creating Match Codes 11-7

11.2 Creating Match Codes

Match Code Nodes in a Data Job

The nodes for generating match codes are found


in the Entity Resolution grouping of nodes.
Two nodes can be used:
• Match Codes node
• Match Codes (Parsed) node

8
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Match Code Nodes in a Data Job


Each of these nodes requires the specification of
the following:
• field(s) on which to generate match code(s)
• a match definition for the selected field(s)
• a sensitivity for the selected match
definition(s)
• a new output field to contain the generated
match code string

9
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-8 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution

Using the Match Codes Node


The Match Codes
node is used to
generate match
codes on one (or
more) data fields
in the input
data source.

15
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The Match Codes node is added to a data job to generate match codes f or the desired f ields in your
input data source.

In the example shown:


• Match codes are being generated f or the COMPANY, ADDRESS, CITY, STATE/PROVINCE,
and ZIP/POSTAL CODE f ields.
• For each of the selected f ields, the appropriate match def inition is specified. For example, the
COMPANY f ield contains company or organization data. Thus, the Organization match def inition
was selected f or this f ield. The CITY f ield contains city data. Thus, the City match def inition was
selected f or this f ield.
• Also, f or each f ield selected, a sensitivity level of 85 was chosen.
• The def ault Output Name f ields are created automatically (the original f ield name appended with
the string _MatchCode), but you have the option of changing those if you like.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.2 Creating Match Codes 11-9

Using the Match Codes (Parsed) Node in a Data Job

The Match Codes (Parsed)


node is used if your data is
already parsed into tokens
(individual fields).

21
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The Match Codes (Parsed) node is used if the data on which you want to generate match codes is
already stored in tokens. In the example shown, because the data f or a person’s name is stored in a
FIRST NAME f ield and a LAST NAME f ield, there is no need to go through the process of parsing
the data into tokens again. In this situation, you can use the Match Code (Parsed) node to generate
a single match code f ield based on the data f ields that are passed directly into the tokens that make
up the Name data type.

For this node, you must


• select the appropriate match def inition (this selection populates the Tokens list)
• specif y the desired sensitivity
• pair the input f ields appropriately to the tokens
Two f ields were paired with Name data type tokens:
– FIRST NAME was paired with Given Name
– LAST NAME was paired with Family Name
• specif y the output field (the f ield that will contain the match code string).

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-10 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution

Creating a Data Job to Generate Match Codes

This demonstration illustrates the steps that are necessary to generate match codes on a variety of
data f ields.

1. If necessary, access Data Management Studio.

a. Select Start  All Programs  DataFlux  Data Management Studio 2.7.

b. Click Cancel to close the Log On window.

2. Open a data job f rom the Basics Solutions repository.


a. Click the Folders riser bar.

b. Expand Basics Solutions.

c. Expand batch_jobs.

d. Click Startup_Jobs.

e. Double-click Ch7D1_EntityResolution_01_Startup.

The data job appears on a new primary tab. The data f low should contain a single node.
3. Investigate the single node in the opened data job.

a. Double-click the Data Source node (named Customers Table).

b. Verif y that the Customers table f rom df Conglomerate Gif ts is selected.

c. Verif y that all Available f ields were moved to Selected.

d. Verif y that the name inf ormation is contained in two f ields: FIRST NAME and LAST NAME.

e. Click Cancel to close the Data Source Properties window.


Note: Because the name inf ormation is already parsed, you need to use a Match Codes
(Parsed) node to generate the match code f or the name f ields.

4. Add a Match Codes (Parsed) node to the job f low.

a. Verif y that the Nodes riser bar is selected in the resource pane.

b. Collapse the Data Inputs grouping of nodes.

c. Expand the Entity Resolution grouping of nodes.

d. Double-click the Match Codes (Parsed) node.

The node is added to the job f low. The properties window f or the node appears.
Note: The Match Codes (Parsed) node is used when the input data is parsed. The Name
match def inition is designed for a f ull name. If you generate a match code only on
FIRST_NAME, in most cases, the def inition assumes that this is a last name. Thus,
the matching of nicknames does not occur. For example, Jon does not match
Jonathan.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.2 Creating Match Codes 11-11

5. Specif y properties for the Match Codes (Parsed) node.

a. Enter Match Codes for Parsed Name in the Name f ield.

b. Enter Name_MatchCode in the Output field f ield.

c. Verif y that English (United States) is selected f or the Locale f ield.

d. Click the down arrow under Sensitivity and select 85.

e. Click the down arrow under Definition and select Name.

f. In the Tokens area, click the down arrow under Field Name f or the Given Name token
and select FIRST NAME.

g. Click the down arrow under Field Name f or the Family Name token and select LAST
NAME.

h. Click Generate null match codes for blank field value.

i. Click Preserve null values.

The Match Codes (Parsed) Properties window should resemble the f ollowing:

Note: Allow generation of multiple matchcodes per definition requires the creation of a
special match def inition in the QKB.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-12 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution

Note: Generate null match codes for blank field values generates a NULL match code if
the f ield is blank. If this is not selected, then a match code of all $ symbols is
generated f or the f ield. When you match records, a f ield with NULL does not equal
another f ield with NULL. However, a f ield with all $ symbols equals another f ield with
all $ symbols.

j. Click Additional Outputs.

1) In the Output f ields area, click the LAST NAME f ield. Hold down the Ctrl key and click
the FIRST NAME f ield.

2) Click the down-pointing arrow 12 times to move the two selected f ields to the
bottom of the list of output fields.

3) Click OK to close the Additional Outputs window.

k. Click OK to close the Match Codes (Parsed) Properties window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.2 Creating Match Codes 11-13

6. Preview the data.

a. Right-click the Match Code (Parsed) node and select Preview.

A sample of records appears on the Preview tab of the Details pane.

b. Scroll to the right to view the Name_MatchCode f ield.

At this point, we have generated one of the six needed match code f ields f or our cluster analysis.
The additional match code f ields are generated f rom f ields that are not parsed inf ormation.

7. Add a Match Codes node to the job f low.

a. Verif y that the Entity Resolution grouping of nodes is expanded.

b. Double-click the Match Codes node.

The node is added to the job f low. The properties window f or the node appears.

8. Specif y properties for the Match Codes node.

a. Enter Match Codes for Various Fields in the Name f ield.

b. Verif y that English (United States) is selected f or the Locale f ield.

c. In the Match code f ields area, double-click the COMPANY f ield to move it f rom Available to
Selected.

d. Click the down arrow under Definition and select Organization.

e. Accept the def ault sensitivity of 85.

f. Double-click the ADDRESS f ield to move it f rom Available to Selected.

g. Click the down arrow under Definition and select Address.

h. Accept the def ault sensitivity of 85.

i. Double-click the CITY f ield to move it from Available to Selected.

j. Click the down arrow under Definition and select City.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-14 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution

k. Accept the def ault sensitivity of 85.

l. Double-click the STATE/PROVINCE f ield to move it f rom Available to Selected.

m. Click the down arrow under Definition and select State/Province.

n. Accept the def ault sensitivity of 85.

o. Double-click the ZIP/POSTAL CODE f ield to move it f rom Available to Selected.

p. Click the down arrow under Definition and select Postal Code.

q. Accept the def ault sensitivity of 85.


r. Click the down arrow Generate null match codes for blank field values.

s. Click the down arrow Preserve null values.

The Match Codes Properties window should resemble the f ollowing:

t. Click OK to close the Match Codes Properties window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.2 Creating Match Codes 11-15

9. Preview the data.

a. Right-click the Match Codes node and select Preview.

A sample of records appears on the Preview tab of the Details pane.

b. Scroll to the right to view the generated match codes .

Note: Generating a match code perf orms an out-of-the-box standardization behind the scenes.
Thus, unless the intention is to write the standardized values to output or perf orm custom
standardizations by using a scheme or a modif ied definition, it is not necessary to
standardize bef ore generating match codes.

10. Save the data job.

a. Select File  Save As.

b. Specif y a name of Ch7D1_EntityResolution.

c. Change the Save in location to be Basics Demos  batch_jobs.

d. Click Save.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-16 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution

11.01 Activity
• Navigate to the batch_jobs/Startup_Jobs folder in Basics Solutions.
• Open the data job named Ch7E1_Manufacturers_MatchReport_O1_Startup.
• Add a Match Codes node. Field Name Definition Sensitivity
✓ Choose the following fields MANUFACTURER Organization 75
with the specified match CONTACT Name 75
definitions and sensitivities:
CONTACT_ADDRESS Address 85
CONTACT_STATE_PROV State/Province 85
CONTACT_POSTAL_CD Postal Code 85

✓ Accept the default names for the match code fields.


✓ Choose the options to preserve null values and to generate null match codes for
blank field values.
• Preview the node to view the generated match code fields.
24
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

IMPORTANT: In addition, f or the activity shown, save the modified data job to the
Basics Exercises repository in the batch_jobs f older with the name
Ch7E1_Manufacturers_MatchReport.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.3 Clustering Records 11-17

11.3 Clustering Records

Clustering Data Records


The concept of clustering involves grouping similar data records.

Some key points:


• Clusters can be determined based on a variety of conditions.
• Records can be clustered based on match codes or actual data
values.
• A new field that contains the numeric identifier for the cluster is
created.
• Additional fields can be added to indicate which of the clustering
conditions were met.

30
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-18 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution

Clustering (Grouping Records)


Record Name Name_MC_85 * Address Address_MC_85 * Phone
1 John Q. Smith 4B&~2$$$$C@P$$$$ 940 NW Cary Pkwy -S03£YR$$$$$$$$$$$$ <null>

2 Johnny Smith 4B&~2$$$$C@P$$$$ 940 Cary Parkway -S03£YR$$$$$$$$$$$$ (919) 447 3000

3 Jonathon Smythe 4B&~2$$$$C@P$$$$ <null> <null> (919) 447 3000

* The match code strings shown are not what would actually be generated.
If the grouping or clustering condition is
• Name_MC_85 and Address_MC_85 Matches are found for records 1 and 2.
• Name_MC_85 and Phone Matches are found for records 2 and 3.
If the grouping or clustering is requested on both conditions:
Name_MC_85 and Address_MC_85
or Matches are found for all three records.
Name_MC_85 and Phone 41
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Consider the data shown. Three f ields are provided (Name, Address, and Phone). Two f ields
(Name and Address) have had a match code string generated.

If (Name_MC_85 and Address_MC_85) is the grouping or clustering condition (which means that if ,
across records, Name_MC_85 and Address_MC_85 are the same, then these records will be
grouped). In the data shown, records 1 and 2 would be clustered or grouped based on this condition.

If (Name_MC_85 and Phone) is the grouping or clustering condition (which means that if , across
records, Name_MC_85 and Phone are the same, then these records will be grouped ). In the data
shown, records 2 and 3 would be clustered or grouped based on this cond ition.

If (Name_MC_85 and Address_MC_85) or (Name_MC_85 and Phone) is the grouping or


clustering condition (which means that if , across records, Name_MC_85 and Address_MC_85 are
the same, or if Name_MC_85 and Phone are the same, then all records will be grouped). In the
data shown, records 1, 2, and 3 would be clustered or grouped based on this compound condition.

Note: Identif ying how you expect to identify clusters of records is something that should be
discussed during the PLAN phase of the methodology.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.3 Clustering Records 11-19

Clustering Node

The Clustering node enables you to match


records based on multiple conditions.
You have full flexibility to create conditions that
support your business needs.

42
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-20 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution

Clustering Node Properties

New field to identify unique clusters of data


Option to identify which clusters to output
(all, single-row, multi-row)
Option to organize output

Three clustering conditions

49
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The Clustering node enables you to specif y one or more conditions for identif ying clusters of
records.

In the example shown, there are three conditions to identify rules f or grouping records into a cluster:

Condition 1: if Name_MatchCode and COMPANY_MatchCode and ADDRESS_MatchCode and


CITY_MatchCode and STATE/PROVINCE_MatchCode are the same
or
Condition 2: if Name_MatchCode and COMPANY_MatchCode and ADDRESS_MatchCode and
ZIP/POSTAL_CODE_Matchcode are the same
or
Condition 3: if Name_MatchCode and COMPANY_MatchCode and BUSINESS_PHONE_Stnd
are the same
If one or more of these conditions are true, then the records will be considered a cluster.

Each cluster of records is assigned a common cluster ID value, and in the example shown, the
cluster ID values will be stored in a new f ield named Cluster_ID.

You can choose to have the results contain only those "clusters" or groups of records where there
were no record matches (single-row clusters).
Or you can choose to have the results contain those "clusters" or groups of records where there was
at least one record match (multi-row clusters).
Or you can choose to show all clusters (both single-row clusters and multi-row clusters).

In addition, the result set can be organized or sorted by the cluster ID value.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.3 Clustering Records 11-21

Using Match Codes (and Other Fields) to Cluster Data Records

This demonstration continues with the data job f rom the previous demonstration. A Standardization
node is added to standardize a phone f ield (a f ield that will be used in clustering analysis). The
Clustering node is then added and conf igured. Clustering is perf ormed using the match code f ields
as well as the standardized f ield.

1. If necessary, access Data Management Studio.

a. Select Start  All Programs  DataFlux  Data Management Studio 2.7.

b. Click Cancel to close the Log On window.

2. If necessary, open a data job f rom the Basics Demos repository.

a. Click the Folders riser bar.

b. Expand Basics Demos.

c. Click batch_jobs.

d. Double-click Ch7D1_EntityResolution.

3. Add a Standardization node to the job f low f rom the previous demonstration.

a. Verif y that the Nodes riser bar is selected in the resource pane.
b. If necessary, collapse the Entity Resolution grouping of nodes.

c. Expand the Quality grouping of nodes.

d. Double-click the Standardization node.

The node is added to the job f low. The properties window f or the node appears.

4. Specif y the properties for the Standardization node.

a. Enter Standardize Phone in the Name f ield.

b. Verif y that English (United States) is selected f or the Locale f ield.

c. In the Standardization f ields area, double-click BUSINESS PHONE to move it f rom Available
to Selected.

d. Click the down arrow under Definition and select Phone.

e. Click Preserve null values.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-22 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution

The Standardization Properties window should resemble the f ollowing:

f. Click OK to close the Standardization Properties window.

5. Preview the data.

a. Right-click the Standardization node and select Preview.

A sample of records appears on the Preview tab of the Details pane.

b. Scroll to the right to view the standardized phone f ield.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.3 Clustering Records 11-23

6. Add a Clustering node to the job f low.

a. Verif y that the Nodes riser bar is selected in the resource pane.

b. If necessary, collapse the Quality grouping of nodes.

c. Expand the Entity Resolution grouping of nodes.

d. Double-click the Clustering node.

The node is added to the job f low. The properties window f or the node appears.

7. Specif y properties for the Clustering node.

a. Enter Cluster on Three Conditions in the Name f ield.

b. Enter Cluster_ID as the Output cluster ID field value.

c. Verif y that All clusters is selected in the Output clusters f ield.

d. Click Sort output by cluster number.

e. Click More Options.

1) Click Condition matched field prefix.

2) Enter COND_ in the Condition matched field prefix f ield.


3) Verif y that Return true or false as to whether the condition was matched across the
cluster is selected.

Note: Enabling this option can signif icantly degrade job performance.

4) Click OK.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-24 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution

f. Specif y the f irst condition.


1) In the Conditions area, click (Add a cluster condition).
2) Double-click the f ollowing fields to move them f rom the Available f ields list to the Cluster
if all f ields match list:

Name_MatchCode
COMPANY_MatchCode
ADDRESS_MatchCode
CITY_MatchCode
STATE/PROVINCE_MatchCode

Selecting these fields…

… constructs this condition.

3) Click OK to create the f irst condition.

g. Specif y the second condition.

1) In the Conditions area, click (Add a cluster condition).


2) Double-click the f ollowing fields to move them f rom the Available f ields list to the Cluster
if all f ields match list:

Name_MatchCode
COMPANY_MatchCode
ADDRESS_MatchCode
ZIP/POSTAL CODE_MatchCode

3) Click OK to create the second condition.

h. Specif y the third condition.

1) In the Conditions area, click (Add a cluster condition).

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.3 Clustering Records 11-25

2) Double-click the f ollowing fields to move them f rom the Available f ields list to the Cluster
if all f ields match list:

Name_MatchCode
COMPANY_MatchCode
BUSINESS PHONE_Stnd

3) Click OK to create the third condition.

The f inal set of conditions should resemble the f ollowing:

i. Click Additional Outputs.

1) Locate the LAST NAME and FIRST NAME f ields.

2) Move these two f ields to follow the ID f ield.

3) Click OK to close the Additional Outputs window.

j. Click OK to close the Clustering Properties window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-26 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution

8. Preview the data f or the Clustering node.

a. Right-click the Clustering node and select Preview.

b. Scroll to the right to view the Cluster_ID and COND_ f ields.

c. Scroll down through the observations.

Notice that some records have duplicate values f or Cluster_ID.

The COND_ f ields display True if that condition (1, 2, or 3) caused the record
to be clustered. Otherwise, the COND_ f ields display False.

Question: What is the first value of Cluster_ID that has duplicate values?

Answer: Cluster_ID=22

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.3 Clustering Records 11-27

d. Scroll in the Preview window to where Cluster_ID=22 records are at the top of the window.

e. Scroll to the lef t to see the original f ield values f or LAST NAME, FIRST NAME, COMPANY,
BUSINESS PHONE, ADDRESS, CITY, STATE/PROVINCE, and ZIP/POSTAL CODE.
The original values in the f ields that were used f or match code generation or f or
standardization have values that are “close” to be the same if not exactly the same.

The f irst three records (Cluster_ID=22) represent Mike or Michael Entin.


The values f or the COMPANY f ield f or these three records are “close” to the same values.
Similar observations can be made f or the remaining f ields (BUSINESS PHONE, ADDRESS,
CITY, STATE/PROVINCE, and ZIP/POSTAL CODE).

9. Select File  Save to save the data job.

It might be necessary to report on the cluster results. We now examine the Match Report node.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-28 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution

Match Report Node


The Match Report node enables you to organize
groups of records (or clusters) that were
identified from your match criteria.
The Match Report Viewer surfaces the generated
match report.

52
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

One of the special types of output files, created f rom a set of clustered data, is known as a match
report.

A match report
• is a f ile created in the f ile-based portion of the repository
• is a way to explore each cluster created by the clustering process in the data job
• is ordered by cluster number.

Recall that the Clustering node has an option to identify which clusters to output (all, single-row,
multi-row). If all clusters are output, then the match report will have both single-row clusters and
multi-row clusters. If only single-row or multi-row clusters are output, then the match report will have
only single-row or multi-row clusters.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.3 Clustering Records 11-29

Match Report Properties

Name and location of the match report


A title for the match report
Option to surface match
report when data job runs
Identification of the cluster field

Fields for match report – typically,


“original” fields are selected and
not match code fields.

60
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The Match Report node allows f or the specif ication of a name and location of the match report.

In addition, you can specify a title f or the match report. The title surf aces on the title bar of the Match
Report Viewer.

You can select an option that launches the Match Report Viewer window when the data job f inishes
running.

The f ield created in the Clustering no de (the Cluster f ield) needs to be identif ied.

Lastly, the f ields f or the match report need to be selected. The purpose of the match report is to
understand how the clustering conditions specified in the clustering node have grouped or clustered
the data. Theref ore, the f ields selected for the match report are most of ten the "original" f ields (or the
f ields that the match codes were generated f or, not the match code f ields).

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-30 Lesson 11 DataFlux® Data Management Studio: Building Data J obs for Entity Resolution

Creating a Match Report from Clustered Data

This demonstration illustrates the steps that are necessary to create a match report f rom clustered
records and review the report.

1. If necessary, access Data Management Studio.

a. Select Start  All Programs  DataFlux  Data Management Studio 2.7.

b. Click Cancel to close the Log On window.

2. If necessary, open a data job f rom the Basics Demos repository.


a. Click the Folders riser bar.

b. Expand Basics Demos.

c. Click batch_jobs.

d. Double-click Ch7D1_EntityResolution.

3. Add a Match Report node to the job f low.

a. Verif y that the Nodes riser bar is selected in the resource pane.
b. If necessary, collapse the Entity Resolution grouping of nodes.

c. Expand the Data Outputs grouping of nodes.

d. Double-click the Match Report node.

The node is added to the job f low. The properties window f or the node appears.

4. Specif y properties for the Match Report node.

a. Enter Customers Match Report in the Name f ield.

b. Click the ellipsis next to the Report file f ield.

1) Navigate to D:\Workshop\dqdmp1\Demos\files\output_files.

2) Enter Ch7D1_Customers_MatchReport in the File name f ield.

3) Verif y that Save as type is set to .mre (a proprietary f ile type).

4) Click Save.

c. Enter Customers Match Report - Three Conditions in the Report title f ield.

d. Click Launch Report Viewer after job is completed.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.3 Clustering Records 11-31

e. In the Cluster f ields area, click the down arrow next to Cluster field and select
Cluster_ID.

The Match Report Properties window should resemble the f ollowing:

f. Specif y the report f ields.

1) In the Report f ields area, click the lef t-pointing double arrow to remove all f ields f rom
the Selected list.

2) Double-click the f ollowing fields to move f rom Available to Selected:


Cluster_ID
ID
COMPANY
LAST NAME
FIRST NAME
ADDRESS
CITY
STATE/PROVINCE
ZIP/POSTAL CODE
BUSINESS PHONE_Stnd

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-32 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution

g. Click OK to close the Match Report Properties window.

5. Select File  Save to save the data job.

6. Run the job.

a. Verif y that the Data Flow tab is selected.

b. Select Actions  Run Data Job.

c. Verif y that the df Report Viewer appears. Specif ied title in viewer

Because you requested all the clusters, you get many clusters that contain only one record.

Use these tools to scroll through the pages of clusters:

First cluster

Previous cluster

Next cluster

Last cluster

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.3 Clustering Records 11-33

Question: How many clusters were found?

Answer: 57

The bottom panel of the dfReport Viewer lists the total number of clusters.

Question: How many multi-row clusters exist?

Answer: Four

The easiest way to determine this is to regenerate the report f or multi-row


clusters only:
• Select File  Exit to close the match report.
• Right-click the Clustering node and select Properties.
• Select Multi-row clusters only f or Output clusters.
• Click OK to close the Cluster Properties window.
• Select Actions  Run Data Job. The df Report Viewer window appears.

Question: What is the largest number of records in a cluster?

Answer: Three

Three-record clusters are the largest


clusters.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-34 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution

Question: Examine the first multi-row cluster.


What is the value of Cluster_ID?
What is the maximum value of ID for the records in this cluster?

Answer: Cluster_ID has a value of 22.


The maximum value of ID within this cluster is 63.

Question: Examine the second multi-row cluster of three records.


What is the value of Cluster_ID?
What is the maximum value of ID for the records in this cluster?

Answer: Cluster_ID has a value of 35.


The maximum value of ID within this cluster is 61.

d. Select File  Exit to close the df Report Viewer window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.3 Clustering Records 11-35

7. View the detailed log.

a. Click the Log tab.

b. Review the inf ormation f or each of the nodes.

8. Close the data job.

a. Click the Data Flow tab.

b. Select File  Close.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-36 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution

Practice

1. Creating a Data Job to Cluster Records and Create a Match Report

Continue working with the data job Ch7E1_Manufacturers_MatchReport in the batch_jobs


f older of the Basics Exercises repository. You will add several nodes to the data job
(Standardization, Clustering, and Match Report nodes). The f inal job f low should resemble the
f ollowing:

• Open the Ch7E1_Manufacturers_MatchReport data job f rom the batch_jobs f older in the
Basic Exercises repository (this data job was saved f rom the 7.03 Activity).

Note: If the data job does not exist, open the f ollowing data job as a starting point:
dfr://Basics Solutions/batch_jobs/Startup_Jobs/
Ch7E1_Manufacturers_MatchReport_02_Startup
Immediately save this starter job to the batch_jobs f older in the Basics Exercises
repository with the name Ch7E1_Manufacturers_MatchReport.

The starting data f low diagram should resemble the f ollowing:

• Add a Standardization node to the data f low.

o Use the Phone def inition to standardize the CONTACT_PHONE f ield.

o Accept the def ault name f or the standardized f ield .

o Select Preserve null values.

• Add a Clustering node to the data f low.

o Create an output cluster f ield named Cluster_ID.

o Display all clusters.

o Sort output by cluster number.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.3 Clustering Records 11-37

o Use the f ollowing two conditions to cluster records:

Condition 1:
MANUFACTURER_MatchCode
CONTACT_MatchCode
CONTACT_ADDRESS_MatchCode
CONTACT_POSTAL_CD_MatchCode

Condition 2:
MANUFACTURER_MatchCode
CONTACT_MatchCode
CONTACT_STATE_PROV_MatchCode
CONTACT_PHONE_Stnd

• Add a Match Report node to the data f low.

o Create a report f ile named Ch7E1_Manufacturers_MatchReport in the location


D:/Workshop/dqdmp1/Exercises/file/output_files

o Launch the report viewer when the job runs.

o Specif y the correct cluster f ield.

o Display all the f ollowing f ields in the match report:

Cluster_ID
ID
MANUFACTURER
CONTACT
CONTACT_ADDRESS
CONTACT_STATE_PROV
CONTACT_POSTAL_CD
CONTACT_PHONE_Stnd

• Save and run the data job.

Question: How many clusters are produced using the specif ied clustering conditions?

Answer:

Question: How many clusters are multi-row clusters?

Answer:

Question: What is the largest number of records in a single cluster?

Answer:

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-38 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution

11.4 Survivorship

Entity Resolution
Up to this point, you have seen the following:
• using the Match Codes node and the Match
Codes (Parsed) node to generate match codes
• using the Clustering node to group records
that might represent the same entity.

The multi-row
clusters now need
to be reduced to a
single, surviving
record.
74
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.4 Survivorship 11-39

Surviving Record Identification Node

The Surviving Record Identification node


provides the ability to specify two types of
rules to select and build out the best surviving
record for multi-row clusters:
• record rules
• field rules

75
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The Surviving Record Identif ication node (also ref erred to as the SRI node) is in the Entity
Resolution grouping of nodes.

The properties of the SRI node provide an area f or def ining one or more record rules that can be
applied to multi-row clusters. In addition, field rules can be specif ied to enhance the f inal surviving
record.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-40 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution

Surviving Record Identification: Record Rules


Examples of record rules:
1. First Name = Highest Occurrence
2. ID = Maximum Value
Cluster ID Last Name First Name … more columns…
Cluster 1: 3 records
1 52 Entin Mike
1 63 Entin Michael
1 23 Entin Michael
Cluster 2: 2 records
2 29 Black Robert
2 51 Black Rob
76
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Record rules are used to select which record f rom a cluster should survive. Examples of record rules
include, but are not limited to
• the maximum value in a f ield
• the most f requently occurring value in a f ield
• the value that has the longest value.

In the example shown, there are record rules to identif y the record with the highest occurrence of
f irst name, and the maximum value of ID. The resulting surviving records, based o n this rule, are the
records that do not have a line that is drawn through them.

Note: If there is ambiguity about which record is the survivor, the f irst remaining record
in the cluster is selected.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.4 Survivorship 11-41

Surviving Record Identification: Field Rules


Examples of field rules:
1. Where EMAIL is Not Null, Select EMAIL
2. Where STATE is Shortest Value, Select CITY and STATE
Surviving record:
ID Last Name First Name EMAIL CITY STATE
63 Entin Michael mentin@farmers.com Wallingford CT
Cluster records:
ID Last Name First Name EMAIL CITY STATE
 52 Entin Mike mentin@farmers.com Wallingford Connecticut
 63 Entin Michael Middlefield Conn
 23 Entin Michael 77 Wallingford CT
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Field rules are used to “borrow” inf ormation f rom other records in the cluster when the surviving
record does not contain a data value f or that particular f ield , or if a “better” value exists in one of the
other records in the cluster.

In the example shown, you can see that f or the surviving record (second row under cluster records),
the record does not have a value f or the EMAIL f ield. However, the f irst record in this cluster does
have a value f or EMAIL. A f ield rule can be set up to determine whether EMAIL is not null. If that
condition is met, then the value in the EMAIL f ield f rom this first record in this cluster can be written
to the surviving record’s EMAIL f ield.

Also, in this example, if there is a secondary f ield rule where the STATE f ield has the shortest value,
then select CITY and STATE f rom that record. The third record in this cluster has the shortest value
f or the STATE f ield. Theref ore, the values in the CITY and STATE f ields f rom this third record will be
written to the surviving record’s CITY and STATE f ields.

Note: The row of data at the top is the “newly constructed” surviving record, which uses all the
values that met the record rules, as well as the f ield rules. The f ield values that changed
based on a f ield rule are bolded.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-42 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution

Options for Surviving Record Identification Node

The Surviving Record Identification node


(by default) passes only surviving records
to the next node in the data flow.
The Options window for the node can be
used to override the default behavior.

78
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The Surviving Record Identif ication (SRI) node examines clustered data and determines a surviving
record f or each cluster.

By def ault, the SRI node passes only surviving records to the next node in the data f low. The
Options window f or the node can be used to override the def ault behavior.

There are numerous combinations of options to explore. For example, f or some sets of data, you
can:
• Want to keep duplicate records. A new f ield can be def ined to f lag the surviving record with a
True versus False f or non-surviving records.
• Want to keep original duplicate records and have a new distinct record as the surviving record .
This is particularly usef ul if you are using f ield rules to update values in the surviving record . A
new f ield can be def ined to flag the surviving record can be f lagged with a True versus False f or
non-surviving records.
• Want to keep original duplicate records. A new f ield can be def ined to use the primary key f ield of
the surviving record as the surviving record ID value f or the non-surviving records in a cluster.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.4 Survivorship 11-43

SRI Node: Options (Default)


Cluster_ID ID Last Name First Name … more columns …
22 63 Entin Michael
23 24 Hasselberg Jonas
Only surviving records are passed
24 25 Rodman John from the SRI node.
26 27 Toh Karen
28 51 Black Rob

NO OPTIONS SPECIFIED

79
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

If no options are specif ied for the SRI node, then the def ault behavior is to pass only the surviving
records f rom the SRI node to the next node in a data job.

In the example shown, recall that a three-record cluster existed where Cluster_ID=22. Only the
surviving record (in this case, where ID has a maximum value) f rom this multi-row cluster is passed
to the next node. Similarly, recall that a two-record cluster existed where Cluster_ID=28. Only the
surviving record f rom this multi-row cluster is passed to the next node.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-44 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution

SRI Node: Logical Delete Option 1


Cluster_ID SR_Flag ID Last Name First Name … more columns …
22 False 52 Entin Mike
22 True 63 Entin Michael
All records passed from SRI node.
22 False 23 Entin Michael Survivors are marked with True.
23 True 24 Hasselberg Jonas
24 True 25 Rodman John
26 True 27 Toh Karen
28 False 28 Black Robert Keep duplicates
28 True 51 Black Rob Surviving record ID field: SR_Flag

Surviving record ID field


80
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

One option f or dealing with surviving and non-surviving records involves creating a surviving record
ID f ield that serves as a True/False f lag. The f lag's value indicates whether the record is a surviving
record. When selecting this option, you need to specify a new f ield to contain the True/False
indicator.

If this option is selected, every record that is a surviving record is f lagged as True. All the remaining
records f rom the clusters are f lagged as False. When you use this option, a simple f ilter f or selecting
records f lagged as True yields an accurate output table of only the surviving records.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.4 Survivorship 11-45

SRI Node: Logical Delete Option 2

All clusters, including single-row


clusters, have a generated
surviving record. "New" surviving
records are marked with True.

Keep duplicates
Surviving record ID field: SR_Flag

Generate distinct surviving record

81
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

A second option f or dealing with surviving and non-surviving records involves creating the surviving
record ID f ield as bef ore and selecting Generate distinct surviving record to create a new
surviving record f or each cluster. This new record has a value of True f or the surviving indicator f ield
f lag, and the value of ID is the same as the surviving record.

If this option is selected, every cluster (even the single row clusters) has a surviving record that is
created and f lagged as True. All the original input records are f lagged as False. When you use this
option, a simple f ilter f or selecting records flagged as True yields an accurate output table of only the
surviving records. A simple f ilter f or selecting records , which is f lagged as False, contains all of the
original data records.

Note: This option results in an output data table with two records f or each ID value f or the
surviving records, which violates the assumptions of a primary key f ield. Further subsetting
of the data would be necessary bef ore you use the ID f ield as a primary key.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-46 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution

SRI Node: Logical Delete Option 3

All clusters, including single-row


clusters, have a generated
surviving record. "New" surviving
records are marked with True, and
primary key field is set to (null).

Keep duplicates
Surviving record ID field: SR_Flag

Generate distinct surviving record


Primary key field: ID
82
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

A third option f or dealing with surviving records involves creating the surviving record ID f ield and
selecting Generate distinct surviving record as bef ore. You can also identif y the primary key f ield
f or the input data. This results in a new record (the survivor) that has a value of True f or the surviving
indicator f ield f lag, and a null value f or ID.

If this option is selected, every cluster (even the single-row clusters) has a surviving record that is
created and f lagged as True with a blank value f or ID. All the original input records are f lagged as
False and retain their original value of the primary key f ield (ID). When you use this option, a simple
f ilter f or selecting records flagged as True yields an accurate output table of only the surviving
records. However, you need to add an additional processing step to generate new primary key
values f or the surviving records.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.4 Survivorship 11-47

SRI Node: Logical Delete Option 4


Cluster_ID SR_Flag ID Last Name First Name … more columns …
22 63 52 Entin Mike
22 (null) 63 Entin Michael
All records are passed from the SRI node.
22 63 23 Entin MichaelSR_Flag is set to (null) for the surviving
record in a cluster. For non-surviving records,
23 (null) 24 Hasselberg Jonas
SR_Flag is assigned the value of the primary
24 (null) 25 Rodman John key field from the surviving record.
26 (null) 27 Toh Karen
28 51 28 Black Robert
Keep duplicates
28 (null) 51 Black Rob Surviving record ID field: SR_Flag

Use primary key as surviving record ID


Primary key field: ID
83
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

A f ourth option f or dealing with surviving records involves creating the surviving record ID f ield,
selecting Use primary key as surviving record ID, and identif ying the ID f ield as the primary key
f ield f or the survivor. This results in the surviving record having a value of (null) f or each of the
clusters. The non-surviving records in each cluster have an SR_Flag value of the survivor’s primary
key f ield (ID, in this example).

If this option is selected, surviving records can be identif ied by selecting the records with a (null)
value in the SR_Flag f ield. Selecting records with a value in the SR_Flag f ield gives you a list of the
duplicates of the surviving records.

As you can see – dif f erent combinations of options affect the layout of the result set of the Surviving
Record Identif ication node. A case will need to be made f or which set of options to use f or every data
source used.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-48 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution

Adding Survivorship to the Entity Resolution Job

This demonstration illustrates the steps that are necessary to add and conf igure a Surviving Record
Identif ication node and investigate the various option settings for this node.

1. If necessary, access Data Management Studio.

a. Select Start  All Programs  DataFlux  Data Management Studio 2.7.

b. Click Cancel to close the Log On window.

2. If necessary, open a data job f rom the Basics Demos repository.


a. Click the Folders riser bar.

b. Expand Basics Demos.

c. Click batch_jobs.

d. Double-click Ch7D1_EntityResolution.

3. Save the data job to a new name.

a. Select File  Save As.


b. Enter Ch7D2_SRI in the Name f ield.

c. Verif y the Save in location is set to Basics Demos  batch_jobs f older.

d. Click Save.

4. Right-click the Match Report node (labeled Customers Match Report) and select Delete.

The Match Report


node is deleted.

Note: The above data f low diagram was wrapped f or display purposes.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.4 Survivorship 11-49

5. Add a Surviving Record Identif ication node to the job flow.

a. Verif y that the Nodes riser bar is selected in the resource pane.

b. Expand the Entity Resolution grouping of nodes.

c. Double-click the Surviving Record Identification node.

The node is added to the job f low. The properties window f or the node appears.

6. Specif y properties for the Surviving Record Identif ication node.

a. Enter Select Best Record in the Name f ield.

b. Click the down arrow next to Cluster ID field and select Cluster_ID.

c. Add one record rule.

1) In the Record rules area, click Add.

The Add Record Rule Expression window appears.

2) Click the down arrow next to Field and select ID.

3) Click the down arrow next to Operation and select Maximum Value.

4) Click Add Condition.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-50 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution

The Expression area is updated to resemble the f ollowing:

5) Click OK to close the Add Record Rule Expression window.

The Record rules area should resemble the f ollowing:

d. Click OK to close the Surviving Record Identif ication Properties window.

7. Add a (temporary) Text File Output node to the job f low.

a. Verif y that the Nodes riser bar is selected in the resource pane.

b. Collapse the Entity Resolution grouping of nodes.

c. Expand the Data Outputs grouping of nodes.

d. Double-click the Text File Output node.

The node is added to the job f low. The properties window f or the node appears.

8. Specif y properties for the Text File Output node.

a. Enter Test in the Name f ield.

b. Specif y the output file inf ormation.

1) Click the down arrow next to the Output file f ield.

2) Navigate to D:\Workshop\dqdmp1\Demos\files\output_files.

3) Enter Test.csv in the File name f ield. (Notice the .csv extension!)

4) Click Save.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.4 Survivorship 11-51

c. Specif y attributes for the f ile.

1) Specif y a double quotation mark " as the text qualif ier.

2) Verif y that the Field delimiter f ield is set to Comma.

3) Click Include header row.

4) Click Display file after job runs.

d. Move the Cluster_ID f ield so that it appears above the ID f ield.

1) Scroll in the Selected list to locate the Cluster_ID f ield.

2) Click the up arrow until Cluster_ID appears above the ID f ield.

e. Move the name f ields so that they f ollow the ID f ield.

1) Scroll in the Selected list to locate the Last_Name and First_Name f ields.

2) Click the up arrow until these two f ields f ollow the ID f ield.

The Text File Output Properties window should resemble the f ollowing:

f. Click OK to close the Text File Output Properties window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-52 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution

9. Select File  Save to save the data job.

10. Run the job.

a. Verif y that the Data Flow tab is selected.

b. Select Actions  Run Data Job.

11. Verif y that the text f ile opens in Excel.

12. Investigate the records where Cluster_ID=22.

a. Scroll to the row where Cluster_ID=22.

b. Verif y that the remaining record f or this cluster is the one with the maximum value of ID (63).

13. Investigate the records where Cluster_ID=35.

a. Scroll to the row where Cluster_ID=35.

b. Verif y that the remaining record f or this cluster is the one with the maximum value of ID (61).

14. Close the Test.csv f ile.

a. Select File  Close to close the .csv f ile.

b. Click Don’t Save.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.4 Survivorship 11-53

15. Verif y the processing inf ormation that appears f or each of the nodes.

Note: The Surviving Record Identif ication node processed 63 rows. For each multi-row cluster,
only one record was selected (the record with the maximum ID value). Theref ore, the
number of records written to the text f ile is 57 rows. Selecting only one record f rom each
cluster is the def ault action.

16. Edit the properties of the Surviving Record Identif ication node (Logical Delete Option 1).

a. Right-click the Surviving Record Identification node and select Properties.

b. Click Options to the right of Cluster ID field.

1) Click Keep duplicate records.

2) Enter SR_Flag as the Surviving record ID field.

Note: If Keep duplicate records is selected, Surviving record ID field must be


specif ied.

3) Click OK to close the Options window.

c. Click OK to close the Surviving Record Identif ication Properties window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-54 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution

17. Edit the properties of the Text File Output node.

a. Right-click the Text File Output node and select Properties.

b. Move the SR_Flag f ield so that it appears af ter the Cluster_ID f ield.

1) Scroll in the Selected list to locate the SR_Flag f ield.

2) Click the up arrow until SR_Flag appears af ter the Cluster_ID f ield.

c. Click OK to close the Text File Output Properties window.

d. Select File  Save to save the data job.

18. Run the job.

a. Verif y that the Data Flow tab is selected.

b. Select Actions  Run Data Job.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.4 Survivorship 11-55

19. Verif y that the text f ile opens in Excel.

20. Investigate the records where Cluster_ID=22.

a. Scroll to the rows where Cluster_ID=22.

b. Verif y that all three records f or this cluster appear.

c. Verif y that SR_Flag=TRUE f or the surviving record (where the maximum value of ID is 63).

d. Verif y that SR_Flag=FALSE f or the non-surviving records in this cluster.

21. Investigate the records where Cluster_ID=35.

a. Scroll to the rows where Cluster_ID=35.

b. Verif y that all three records f or this cluster appear.

c. Verif y that SR_Flag=TRUE f or the surviving record (where the maximum value of ID is 61).

d. Verif y that SR_Flag=FALSE f or the non-surviving records in this cluster.

22. Close the Test.csv f ile.

a. Select File  Close to close the .csv f ile.


b. Click Don’t Save.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-56 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution

23. Verif y the processing inf ormation that appears f or each of the nodes.

Note: Notice that the Text File Output node wrote all 63 rows to the text f ile.

24. Edit the properties of the Surviving Record Identif ication node (Logical Delete Option 2).
a. Right-click the Surviving Record Identification node and select Properties.

b. Click Options to the right of Cluster ID field.

1) Click Generate distinct surviving record.

2) Click the down arrow next to Primary key field and select ID.

3) Click OK to close the Options window.

c. Click OK to close the Surviving Record Identif ication Properties window.

25. Select File  Save to save the data job.


26. Run the job.

a. Verif y that the Data Flow tab is selected.

b. Select Actions  Run Data Job.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.4 Survivorship 11-57

27. Verif y that the text f ile opens in Excel.

28. Investigate the records where Cluster_ID=22.

a. Scroll to the rows where Cluster_ID=22.

b. Verif y that f our records f or this cluster appear.


Three records are the original records, and the f ourth is a copy of the record where the
maximum value of ID is 63.

In addition, the f ourth record does not contain a value f or the identif ied primary key f ield (ID).

29. Investigate the records where Cluster_ID=35.

a. Scroll to the rows where Cluster_ID=35.

b. Verif y that f our records f or this cluster appear.


Three records are the original records, and the f ourth is a copy of the record where the
maximum value of ID is 61.

In addition, the f ourth record does not contain a value f or the identif ied primary key f ield (ID).

30. Close the Test.csv f ile.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-58 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution

a. Select File  Close to close the .csv f ile.

b. Click Don’t Save.

31. Verif y the processing inf ormation that appears f or each of the nodes.

Note: Notice that the Surviving Record Identif ication node processed 63 rows. For each cluster,
a distinct surviving record was generated. Theref ore, the number of records written to the
text f ile is 120 rows, which is sum of the original rows plus the new distinct surviving
records.

32. Edit the properties of the Surviving Record Identif ication node (Logical Delete Option 3).

a. Right-click the Surviving Record Identification node and select Properties.

b. Click Options to the right of Cluster ID field.

1) Clear Generate distinct surviving record.

2) Click Use primary key as surviving record ID.

3) Click the down arrow next to Primary key field and select ID.

4) Click OK to close the Options window.

c. Click OK to close the Surviving Record Identif ication Properties window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.4 Survivorship 11-59

33. Select File  Save to save the data job.

34. Run the job.

a. Verif y that the Data Flow tab is selected.

b. Select Actions  Run Data Job.

35. Verif y that the text f ile opens in Excel.

36. Investigate the records where Cluster_ID=22.

a. Scroll to the rows where Cluster_ID=22.

b. Verif y that the surviving record has a null value f or the SR_Flag f ield.

c. Verif y that the duplicate records have the value of the primary key of the surviving record.

37. Investigate the records where Cluster_ID=35.


a. Scroll to the rows where Cluster_ID=35.

b. Verif y that the surviving record has a null value f or the SR_Flag f ield.

c. Verif y that the duplicate records have the value of the primary key of the surviving record.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-60 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution

38. Close the Test.csv f ile.

a. Select File  Close to close the .csv f ile.

b. Click Don’t Save.

39. Verif y the processing inf ormation that appears f or each of the nodes.

The processing inf ormation appears on each of the nodes.

Note: The Text File Output node wrote all 63 rows to the text f ile.

40. Close the data job.

a. Click the Data Flow tab.

b. Select File  Close.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.4 Survivorship 11-61

Adding Field-Level Rules for the Surviving Record

This demonstration illustrates the steps that are necessary to establish f ield-level rules f or populating
f ields in the surviving record using f ield -level rules.

1. If necessary, access Data Management Studio.

a. Select Start  All Programs  DataFlux  Data Management Studio 2.7.

b. Click Cancel to close the Log On window.

2. If necessary, open a data job f rom the Basics Demos repository.


a. Click the Folders riser bar.

b. Expand Basics Demos.

c. Click batch_jobs.

d. Double-click Ch7D2_SRI.

3. Edit the properties f or the Text File Output node.

a. Right-click the Text File Output node and select Properties.


b. In the Output f ields area, rearrange the f ields so that the f ollowing fields appear f irst:
Cluster_ID
SR_Flag
ID
EMAIL
JOB TITLE
MOBILE PHONE

c. Click OK to close the Text File Output Properties window.

4. Select File  Save to save the data job.

5. Run the job.

a. Verif y that the Data Flow tab is selected.

b. Select Actions  Run Data Job.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-62 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution

6. Verif y that the text f ile opens in Excel.

7. Investigate the records where Cluster_ID=22.

a. Scroll to the rows where Cluster_ID=22.

b. Verif y that the surviving record has a null value f or the EMAIL f ield, but another record in the
cluster has a non-null value f or the EMAIL f ield.

c. Verif y that the surviving record has a null value f or the JOB TITLE f ield, but another record in
the cluster has a non-null value f or the JOB TITLE f ield.

8. Investigate the records where Cluster_ID=35.

a. Scroll to the rows where Cluster_ID=35.

b. Verif y that the surviving record has a null value f or the EMAIL f ield, but another record in the
cluster has a non-null value f or the EMAIL f ield.

c. Verif y that the surviving record has a shorter value f or the JOB TITLE f ield, but another
record in the cluster has a longer value f or the JOB TITLE f ield.

9. Close the Test.csv f ile.

a. Select File  Close to close the .csv f ile.

b. Click Don’t Save.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.4 Survivorship 11-63

10. Edit the properties f or the Surviving Record Identif ication node.

a. Right-click the Surviving Record Identification node and select Properties.

b. Under the Output f ields area, click Field Rules.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-64 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution

The Field Rules window appears.

c. Add the f irst of two field rules.

1) Click Add in the Field Rules window.

The Add Field Rule window appears.

2) In the Rule expressions area (of the Add Field Rule window), click Add.

a) Click the down arrow next to Field and select EMAIL.

b) Click the down arrow next to Operation and select Is Not Null.

c) Click Add Condition.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.4 Survivorship 11-65

The Expression area is updated to the f ollowing:

d) Click OK to close the Add Field Rule Expression window.

3) In the Af f ected fields area, verif y that only EMAIL appears in the Selected list.

4) Click OK to close the Add Field Rule window.

d. Add the second of two field rules.


1) Click Add in the Field Rules window.

The Add Field Rule window appears.

2) In the Rule expressions area (of the Add Field Rule window), c lick Add.

a) Click the down arrow next to Field and select JOB TITLE.

b) Click the down arrow next to Operation and select is not Null.

c) Click Add Condition.

d) Click AND (under the Expression area).

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-66 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution

e) Click the down arrow next to Field and select JOB TITLE.

f ) Click the down arrow next to Operation and select Longest Value.

g) Click Add Condition.

The Expression area is updated to the f ollowing:

h) Click OK to close the Add Field Rule Expression window.

3) In the Af f ected fields area, verif y that JOBTITLE appears in the Selected list.
4) In the Af f ected fields area, double-click the MOBILE PHONE f ield to move it f rom
Available to Selected.

5) Click OK to close the Add Field Rule window.

The two def ined f ield rules should resemble the f ollowing:

6) Click OK to close the Field Rules window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.4 Survivorship 11-67

e. Click OK to close the Surviving Record Identif ication Properties window.

11. Select File  Save to save the data job.

12. Run the job.

a. Verif y that the Data Flow tab is selected.

b. Select Actions  Run Data Job.

13. Verif y that the text f ile opens in Excel.

14. Investigate the records where Cluster_ID=22.

a. Scroll to the rows where Cluster_ID=22.

b. Verif y that the surviving record now has a non-null value f or the EMAIL f ield, which was
retrieved f rom another record in the cluster.

c. Verif y that the surviving record has a non-null AND longest value f or the JOB TITLE f ield.

d. Verif y that the surviving record has a “new value” f or the MOBILE PHONE f ield. This value
was copied to the surviving record f rom the record where JOB TITLE was not null.

No Field Rules Specified:

Surviving record with no field rules

Field Rules Specified:

Surviving record with field rules applied

15. Investigate the records where Cluster_ID=35.

a. Scroll to the rows where Cluster_ID=35.


b. Verif y that the surviving record now has a non-null value f or the EMAIL f ield, which was
retrieved f rom another record in the cluster.

c. Verif y that the surviving record has a non-null and longest value f or the JOB TITLE f ield.

d. Verif y that the surviving record has a “new value” f or the MOBILE PHONE f ield. This value
was copied to the surviving record f rom the record where JOB TITLE was not null and the
longest value.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-68 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution

No Field Rules Specified:

Surviving record with no field rules

Field Rules Specified:

Surviving record with field rules applied

Summary: If you use f ield rules, important inf ormation that is potentially spread across multiple
records of a cluster can be retrieved f or the surviving record. In this case, the
surviving records examined have values f or both the EMAIL and JOB TITLE f ields.
In addition, in the second example, the MOBILE PHONE f ield f or the surviving
record is populated f rom the record where JOB TITLE has the longest value.

16. Close the Test.csv f ile.

a. Select File  Close to close the .csv f ile.

b. Click Don’t Save.

17. Close the data job.

a. Click the Data Flow tab.

b. Select File  Close.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.4 Survivorship 11-69

Practice

2. Creating an Entity Resolution Data Job with Survivorship

This practice continues with the data job started in Practice 1. For this exercise, you will add a
Surviving Record Identif ication node and write the clustered records out to a text f ile. The f inal
job f low should resemble the f ollowing:

• Save a copy of the data job Ch7E1_Manufacturers_MatchReport (located in the


batch_jobs f older of the Basics Exercises repository) to the same location. Rename it
Ch7E2_Manufacturers_SelectBestRecord.

• Remove the Match Report node and add a Surviving Record Identif ication node with the
f ollowing properties:

Cluster ID field: Cluster_ID

Options ✓ Keep duplicate records

✓ Specif y SR_ID as the Surviving record ID.

✓ Use primary key as surviving record ID

✓ Specif y ID as the Primary key field.

Record rule: Maximum value of POSTDATE

Field rule: Highest occurrence of CONTACT

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-70 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution

• Add a Text File Output node to the job f low using the f ollowing specifications:
Output file: Ch7E2_Manufacturer_BestRecord.csv in the directory
D:\Workshop\dqdmp1\Exercises\files\output_files

Text qualifier: " (double-quotation mark)

Field delimiter: comma

✓ Include header row

✓ Display file after job runs

Output fields: Cluster_ID


ID
SR_ID
POSTDATE
MANUFACTURER
CONTACT
CONTACT_ADDRESS
CONTACT_CITY
CONTACT_STATE_PROV
CONTACT_POSTAL_CD
CONTACT_CNTRY
CONTACT_PHONE

• Save and run the data job.

• Review the output text f ile and verif y the results.

Question: For Cluster_ID=4, the surviving record is the record where ID=48. Why?

Answer:

Question: For Cluster_ID=6, the selected record is the record with ID=11. Why?

Answer:

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.5 Solutions 11-71

11.5 Solutions
Solutions to Practices
1. Creating a Data Job to Cluster Records and Create a Match Report

a. If necessary, invoke Data Management Studio.

1) Select Start  All Programs  DataFlux  Data Management Studio 2.7.

2) Click Cancel in the Log On window.

b. Open the “starter” data job.


1) Verif y that the Home tab is selected.

2) Click the Folders riser bar.

3) Expand the Basics Exercises repository.

4) Click the batch_jobs f older.

5) Double-click Ch7E1_Manufacturers_MatchReport.

The data job opens on a new tab.


c. Add a Standardization node to the job f low.

1) Verif y that the Nodes riser bar is selected in the Resource pane.

2) Expand the Quality grouping of nodes.

3) Double-click the Standardization node.

The node is added to the job f low. The properties window f or the node appears.

d. Specif y properties for the Standardization node.

1) Enter Standardize Phone in the Name f ield.

2) Verif y that English (United States) is selected f or the Locale f ield.

3) In the Standardization f ields area, double-click CONTACT_PHONE to move it


f rom Available to Selected.

4) Click the down arrow under Definition and select Phone.

5) Click Preserve null values.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-72 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution

The Standardization Properties window should resemble the f ollowing:

6) Click OK to close the Standardization Properties window.

e. Add a Clustering node to the job f low.

1) Verif y that the Nodes riser bar is selected in the Resource pane.

2) Collapse the Quality grouping of nodes.

3) Expand the Entity Resolution grouping of nodes.

4) Double-click the Clustering node.

The node is added to the job f low. The properties window f or the node appears.

f. Specif y properties for the Clustering node.

1) Enter Cluster on Two Conditions in the Name f ield.

2) Enter Cluster_ID as the Output cluster ID field value.

3) Verif y that All clusters is specif ied in the Output clusters f ield.
4) Click Sort output by cluster number.

5) Specif y the f irst condition.

a) In the Conditions area, click (Add a cluster condition).


b) Double-click the f ollowing fields to move them f rom Available fields to Cluster if all
fields match:
MANUFACTURER_MatchCode
CONTACT_MatchCode
CONTACT_ADDRESS_MatchCode
CONTACT_POSTAL_CD_MatchCode

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.5 Solutions 11-73

The Clustering Condition window should resemble the f ollowing:

c) Click OK to create the f irst condition.

6) Specif y the second condition.


a) In the Conditions area, click (Add a cluster condition).
b) Double-click the f ollowing fields to move them f rom Available fields to Cluster if all
fields match:
MANUFACTURER_MatchCode
CONTACT_MatchCode
CONTACT_STATE_PROV_MatchCode
CONTACT_PHONE_Stnd

c) Click OK to create the second condition.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-74 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution

The properties should resemble the f o llowing:

7) Click OK to close the Clustering Properties window.


g. Add a Match Report node to the job f low.

1) Verif y that the Nodes riser bar is selected in the Resource pane.

2) Collapse the Entity Resolution grouping of nodes.

3) Expand the Data Outputs grouping of nodes.

4) Double-click the Match Report node.

The node is added to the job f low. The properties window f or the node appears.

h. Specif y properties for the Match Report node.

1) Enter Manufacturers Match Report in the Name f ield.

2) Click the ellipsis next to the Report file f ield.

a) Navigate to D:\Workshop\dqdmp1\Exercises\files\output_files.

b) Enter Ch7E1_Manufacturers_MatchReport in the File name f ield.

c) Click Save.

3) Enter Manufacturers Match Report - Two Conditions in the Report title f ield.
4) Click Launch Report Viewer after job is completed.

5) In the Cluster f ields area, click the down arrow next to Cluster field and select
Cluster_ID.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.5 Solutions 11-75

6) Specif y the report f ields.

a) In the Report f ields area, click the lef t-pointing double arrow to remove all f ields
f rom the Selected list.

b) Double-click the f ollowing fields to move f rom Available to Selected:


Cluster_ID
ID
MANUFACTURER
CONTACT
CONTACT_ADDRESS
CONTACT_STATE_PROV
CONTACT_POSTAL_CD
CONTACT_PHONE_Stnd

7) Click OK to close the Match Report Properties window.

The f inal job f low should resemble the f ollowing:

i. Save and run the data job.

1) Select File  Save to save the data job.

2) Verif y that the Data Flow tab is selected.

3) Select Actions  Run Data Job.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-76 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution

4) Verif y that the df Report Viewer appears.

j. Answer the f ollowing questions:

Question: How many clusters are produced when you use the specified clustering
conditions?

Answer: 132 clusters

The bottom panel of the dfReport Viewer window shows how many
clusters were found:

Question: How many clusters are multi-row clusters?

Answer: 30 clusters

(1) Select File  Exit to close the df Report Viewer window.

(2) In the data job, right-click the Clustering node and select Properties.

(3) Select Multi-row clusters only f or the Output clusters f ield.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.5 Solutions 11-77

(4) Click OK to close the Clustering Properties window.

(5) Select File  Save to save the data job.

(6) Select Actions  Run Data Job to re-execute the data job.

Question: What is the largest number of records in a single cluster?

Answer: 10 records
Scanning the two pages of the dfReport Viewer, you can see that the
first cluster with 10 records is the largest cluster.

k. Select File  Exit to close the df Report Viewer window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-78 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution

l. View the detailed log.

1) Click the Log tab.

2) Review the inf ormation f or each of the nodes.

m. Close the data job.

1) Click the Data Flow tab.

2) Select File  Close.

2. Creating an Entity Resolution Data Job with Survivorship

a. If necessary, invoke Data Management Studio.

1) Select Start  All Programs  DataFlux  Data Management Studio 2.7.

2) Click Cancel in the Log On window.

b. Open the “starter” data job.

1) Verif y that the Home tab is selected.

2) Click the Folders riser bar.

3) Expand the Basics Exercises repository.

4) Click the batch_jobs f older.

5) Double-click Ch7E1_Manufacturers_MatchReport.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.5 Solutions 11-79

The data job opens on a new tab.

Note: If necessary, you can open the start-up job f or the exercise:
• Verif y that the Home tab is selected.
• Click the Folders riser bar.
• Expand the Basics Solutions repository.
• Expand the batch_jobs f older.
• Expand the Startup_Jobs f older.
• Right-click the Ch7E2_Manufacturers_SelectBestRecord_Startup job
and select Open.

c. Save the data job as a new data job.

1) Select File  Save as to save the job.

2) If necessary, navigate to the Basics Exercises  batch_jobs f older.

3) Name the new job Ch7E2_Manufacturers_SelectBestRecord.

d. Right-click the last node in the data job (Match Report node) and select Delete.

e. Add a Surviving Record Identif ication node to the job flow.

1) Verif y that the Nodes riser bar is selected in the resource pane.

2) Expand the Entity Resolution grouping of nodes.

3) Double-click the Surviving Record Identification node.

The node is added to the job f low. The properties window f or the node appears.

f. Specif y properties for the Surviving Record Identif ication node.

1) Enter Select Best Record in the Name f ield.

2) Click the down arrow next to Cluster ID field and select Cluster_ID.

3) Click Options to the right of Cluster ID field.

a) Click Keep duplicate records.

b) Enter SR_ID as the Surviving record ID field.

c) Click Use primary key as surviving record ID.

d) Click the down arrow next to Primary key field and select ID.

e) Click OK to close the Options window.

4) Add one record rule.


a) In the Record rules area, click Add.

(1) Click the down arrow next to Field and select POSTDATE.

(2) Click the down arrow next to Operation and select Maximum Value.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-80 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution

(3) Click Add Condition. The Expression area is updated.

b) Click OK to close the Add Record Rule Expression window.

5) In the Output f ields area, click Field Rules on the lower right.

6) Click Add in the Field Rules window.

a) Click Add in the Rule expressions area of the Add Field Rule window.

(1) Click next to Field and select CONTACT.

(2) Click next to Operation and select Highest Occurrence.

(3) Click Add Condition. The Expression area is updated.

(4) Click OK to close the Add Field Rule Expression window.

b) In the Af f ected fields area, verif y that only CONTACT appears in Selected.

c) Click OK to close the Add Field Rule window.

d) Click OK to close the Field Rules window.

7) Click OK to close the Surviving Record Identif ication Properties window.

g. Add a Text File Output node to the job f low.

1) Verif y that the Nodes riser bar is selected in the resource pane.

2) Collapse the Entity Resolution grouping of nodes.

3) Expand the Data Outputs grouping of nodes.

4) Double-click the Text File Output node.

The node is added to the job f low. The properties window f or the node appears.

h. Specif y properties for the Text File Output node.

1) Enter Manufacturer Best Record in the Name f ield.

2) Specif y the output file inf ormation.

a) Click the ellipsis next to the Output file f ield.

b) Navigate to D:\Workshop\dqdmp1\Exercises\files\output_files.

c) Enter Ch7E2_Manufacturer_BestRecord.csv in the File name f ield.

d) Click Save.

3) Specif y attributes for the f ile.

a) Specif y a double quotation mark " as the text qualif ier.

b) Verif y that Field delimiter is set to Comma.

c) Click Include header row.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.5 Solutions 11-81

d) Click Display file after job runs.

4) Select f ields to write to the text f ile.

a) In the Output f ields area, click the lef t-pointing double arrow to remove all f ields
f rom the Selected list.

b) Double-click the f ollowing fields to move them f rom Available to Selected:


Cluster_ID
SR_ID
ID
POSTDATE
MANUFACTURER
CONTACT
CONTACT_ADDRESS
CONTACT_CITY
CONTACT_STATE_PROV
CONTACT_POSTAL_CD
CONTACT_CNTRY
CONTACT_PHONE

The properties of the node should resemble the f ollowing:

i. Click OK to close the Text File Output Properties window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-82 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution

The f inal job f low should resemble the f ollowing:

j. Save and run the data job.

1) Select File  Save to save the data job.

2) Verif y that the Data Flow tab is selected.

3) Select Actions  Run Data Job.

4) Verif y that the text f ile appears.

In the display of the text f ile that is shown above, two groups are highlighted.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.5 Solutions 11-83

Question: For Cluster ID=4, the surviving record is the record where ID=48. Why?
Answer: The record rule selects the record where POSTDATE has a maximum
value. For this cluster, the record where ID=48 has the maximum value
of POSTDATE.

Question: For Cluster_ID=6, the selected record is the record with ID=11. Why?

Answer: The record rule selects the record where POSTDATE has a maximum
value. For this cluster, the record where ID=11 has the maximum value
of POSTDATE.

k. Select File  Close to close the text f ile. Do not save any changes.

l. Select File  Close to close the data job.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-84 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution

Solutions to Student Activities

continued...
11.01 Activity – Correct Answer

26
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

11.01 Activity – Correct Answer

Modified job name on Save

Verify the Save in location!

27
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Lesson 12 Understanding the SAS®
Quality Knowledge Base (QKB)
12.1 Working with QKB Component Files .......................................................................... 12-3
Demonstration: Accessing the QKB Component Files ................................................ 12-5
Demonstration: Using the Scheme Builder ............................................................. 12-16
Demonstration: Using the Chop Table Editor........................................................... 12-24
Demonstration: Using the Phonetics Editor............................................................. 12-34
Demonstration: Using the Regex Library Editor ....................................................... 12-43
Demonstration: Using the Vocabulary Editor ........................................................... 12-49
Demonstration: Using the Grammar Editor ............................................................. 12-62

12.2 Working with QKB Definitions.................................................................................. 12-70

12.3 Solutions ................................................................................................................. 12-81


Solutions to Activities and Questions...................................................................... 12-81
12-2 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-3

12.1 Working with QKB Component Files


Quality Knowledge Base Component Files

Chop Tables
Regular Expression Libraries
Phonetics Libraries
Schemes
Vocabularies
Grammars

3
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The QKB consists of six types of component files that serve as the building block s for the definitions.
These files each perform a vital task in the overall functionality provided by the definition. The
following types of component files are available:
• chop tables
• regular expression libraries
• phonetics libraries
• schemes
• vocabularies
• grammars
Note: Each of the above file types has a corresponding special editor. The various editors can be
accessed by selecting Tools  Other QKB Editors.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-4 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

The following component files are available:

Component File Purpose Explanation Editor

Schemes Standardize phrases, A lookup table used to transform Scheme


words, and tokens data values to a standard Builder
representation

Chop Tables Extract individual words A collection of character-level Chop Table


from a text string rules used to create an ordered Editor
word list from a string. For each
character represented in the
table, you can specify the
classification and the operation
performed by the algorithm.

Phonetics Libraries Phonetic (sound-alike) A collection of patterns that Phonetics


reduction of words produce the same output string Editor
for input strings that have similar
pronunciations or spellings.

Regex Libraries Standardization, A collection of patterns that are Regex


categorization, casing, matched against a text string Library
and pattern identification (from left to right) for character- Editor
level cleansing and operations.

Vocabulary Categorize words A collection of words, each Vocabulary


Libraries associated with one or more Editor
categories and likelihoods.

Grammars Identify patterns in word A collection of rules that Grammar


categories represent extracted patterns of Editor
words in a given context

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-5

Accessing the QKB Component Files

This demonstration illustrates how to navigate to the QKB component files in DataFlux Data
Management Studio.
1. If necessary, open Data Management Studio.
a. Select Start All Programs DataFlux Data Management Studio 2.7.
b. Click Cancel in the Log On window.
c. If necessary, close the DataFlux Data Management Methodology window.
2. If necessary, open the QKB CI 27 - ENUSA Only QKB.
a. In Data Management Studio, select the Administration riser bar.
b. Expand Quality Knowledge Bases.
c. Expand QKB CI 27 - ENUSA Only.
d. Expand Global.
e. Expand English.
f. Select English (United States).

g. Click (Open QKB).


The contents of the English (United States) locale are displayed.

Note: The tabs in the Information pane can be used to display the various types of
component files in the QKB.
Note: The Quality Knowledge Base tab is selected by default. This tab displays all the
definitions that are available in the QKB.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-6 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

Note: As you navigate through the component files of the QKB, y ou see various symbols
that correspond to the objects that are related to the QKB.

Components

QKB

Data Type

Def inition

Scheme

Chop Table

Phonetics Library

Regex Library

Vocabulary Library

Grammar

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-7

3. Explore the schemes in the QKB.


a. Click the Schemes tab to see the list of available schemes.

The Schemes tab lists the standardization schemes that are contained in the selected QKB
locale. A scheme is a simple lookup table that maps a word to some alternate, pref erred
representation f or that word. Schemes are used to perf orm standardization of words and
phrases, and f or identif ying “known words” in casing def initions and extraction definitions.
Note: You can create a new scheme (in the Scheme Builder) by clicking the New Scheme
button.
b. Scroll through the list of schemes and locate the GB Country scheme.
c. Select the GB Country scheme.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-8 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

d. Click in the upper left corner to see the actions that are available for interacting with the
schemes.

e. Select Open.
The GB Country scheme opens in the Scheme Builder.

Note: You can also open the Scheme Builder for a scheme by double-clicking it on the
Schemes tab.
f. Preview the data and standard values that exist in the scheme for standardizing country
values.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-9

g. Scroll down to the data values that have United States as their standard.

Note: All of the different data values that would be standardized to United States if we were
to apply this scheme to the data.
h. Select File  Close to close the Scheme Builder.

i. Click (the Show Find Pane icon).

The Show Find Pane icon activates the Find toolbar. The Find toolbar enables you to search
for keywords that are associated with one or more schemes.
j. Enter Country in the Find field.
k. Press Enter.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-10 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

The first scheme that is associated with the word Country in the name is highlighted.

l. Click the Next and Previous buttons to scroll through the schemes.
4. Explore the chop tables in the QKB.
a. Click the Chop Tables tab to see the list of available chop tables.

The Chop Tables tab lists the chop tables that are contained in the selected QKB locale.
A chop table is a collection of character-level rules that are used to create an ordered word
list f rom an input string (f or example, to break a person’s name into the individual words that
make up the name.
Note: You can open a chop table by either double-clicking the chop table, or by selecting a
specific chop table and then clicking (the Open QKB icon). You can create a new
chop table (in the Chop Table Editor) by clicking the New Chop Table button. This
opens a wizard that navigates through the creation of the new chop table.
Note: You might find it beneficial to create a new chop table by opening an existing chop
table and selecting File  Save As to save the new chop table with a different name.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-11

5. Explore the phonetics libraries in the QKB.


a. Click the Phonetics Libraries tab to see the list of available phonetics libraries.

The Phonetics Libraries tab lists the phonetic libraries that are contained in the selected QKB
locale. A phonetics library contains phonetic reduction rules that are used to match words
and phrases with similar sounding words and phrases (f or example, John and Jon).
Note: You can open a phonetics library by double-clicking the phonetics library, or by
selecting a specific phonetics library and then clicking (the Open QKB icon). You
can create a new phonetics library (in the Phonetics Editor) by clicking the New
Phonetics button. This opens a wizard that navigates through the creation of the
new phonetics library.
Note: You might find it beneficial to create a new phonetics library by opening an existing
phonetics library and selecting File  Save As to save the new phonetics library
with a different name.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-12 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

6. Explore the regular expression (regex) libraries in the QKB.


a. Click the Regex Libraries tab to see the list of available regex libraries.

The Regex Libraries tab lists the regular expression (regex) libraries that are contained in the
selected QKB locale. A regular expression library contains regular expressions that are used
for character-level pattern matching and transformations (for example, removing parentheses
from around a string).
Note: You can open a regex library by double-clicking the regex library, or by selecting
the regex library and then clicking (the Open QKB icon). You can create a new
regex library (in the Regex Library Editor) by clicking the New Regex button. This
opens a wizard that navigates through the creation of the new regex library.
Note: You might find it beneficial to create a new regex library by finding an existing regex
library that has similar expressions to the one that you need. Edit it and select
File  Save As to save the new regex library with a different name.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-13

7. Explore the vocabularies in the QKB.


a. Click the Vocabularies tab to see the list of available vocabularies.

The Vocabularies tab lists the vocabulary f iles that are contained in the selected QKB locale.
A vocabulary f ile contains a list of words. Categories are assigned to words in a vocabulary
to help identif y the semantic type of those words (f or example, the word John could be a
given name word, a middle name word, or a f amily name word).
Note: You can open a vocabulary by either double clicking the vocabulary, or by selecting
the vocabulary and clicking (the Open QKB icon). You can create a new
vocabulary in the Vocabulary Editor by clicking the New Vocabulary button. This
opens a wizard that navigates through the creation of the new vocabulary.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-14 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

8. Explore the grammars in the QKB.


a. Click the Grammars tab to see the list of available grammars.

The Grammars tab lists the grammar f iles that are contained in the selected QKB locale.
A grammar is set of rules that represent expected patterns of words in a given context (f or
example, a person’s name could be represented by the pattern <given name word> <f amily
name word>.
Note: You can open a grammar by double-clicking the name or selecting the grammar
and clicking the Open icon. You can create a new grammar in the Grammar Editor by
clicking the New Grammar button. This opens a wizard that navigates through the
creation of the new grammar.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-15

Scheme Overview
Element Standardization Scheme: Phrase Standardization Scheme:
Data Standard Data Standard

& AND DATAFLUX LLC DATAFLUX

and AND DATAFLUX CORP DATAFLUX

ACC ACCOUNT DATAFLUX INC DATAFLUX

ACCT ACCOUNT SAS INSTITUTE SAS

ACT ACCOUNT SAS INSTITUTE INC SAS

ACCOUNTENT ACCOUNTANT THE SAS INSTITUTE SAS

ACCTNG ACCOUNTING THE SAS INSITITUTE INC


5
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

A scheme is a lookup table that is used to transform data values to a standard representation.
Schemes can be applied to individual words in a string (element analysis) or to the entire string
(phrase analysis). Schemes are used in many types of definitions and are often applied at the token
level. For example, in a standardization definition, the input text string is first parsed into tokens, and
then each token is standardized with one or more schemes. In a match definition, standardization
schemes are used to standardize data values that get used in the creation of the match code. In
identification analysis, known word schemes are used to associate words with their possible identity
(for example, familiar phrases that might represent an address or a country value).

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-16 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

Using the Scheme Builder

This demonstration illustrates how to use the Scheme Builder to explore and modify a scheme.
1. If necessary, open Data Management Studio.
a. Select Start All Programs DataFlux Data Management Studio 2.7.
b. Click Cancel in the Log On window.
2. If necessary, open the English (United States) locale from the CI 2.7 - ENUSA Only QKB.
a. Select the Administration riser bar.
b. Expand Quality Knowledge Bases.
c. Expand Global.
d. Expand QKB CI 27 - ENUSA Only.
e. Expand English.
f. Select English (United States).

g. Select (Open QKB).


3. Open the GB Email Service Provider Standards standardization scheme.
a. Click the Schemes tab.
b. Scroll down to the GB Email Service Provider Standards scheme.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-17

c. Right-click the scheme and select Open.


The Scheme Builder window appears. GB Email Service Provider Standards scheme is
loaded in the right pane on the Scheme side.

d. Scroll and identify the values that are standardized as FACEBOOK.

Note: Each line in the scheme represents a potential piece of expected data
and the replacement text (standard).
Note: You can add, delete, or edit a piece of expected data on the Edit menu.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-18 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

4. Scroll and identify the values that are standardized as GMAIL.

5. Click Options in the lower right corner.


6. Verify that this scheme is an Element scheme.

Note: The analysis of an individual field can be counted as a whole (phrase) or based
on each word (element).

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-19

7. Click Cancel to close the Scheme Options window.


8. Select File  Exit to close the Scheme Builder window.
9. Explore the usage of the GB Email Service Provider Standards scheme.
a. In Data Management Studio, right-click GB Email Service Provider Standards
and select Usage.

Note: The GB Email Service Provider Standards standardization scheme is used in the
E-mail standardization definition specifically for processing the Sub-Domain token.
b. Click Close to close the Usage window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-20 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

12.01 Activity
1. Open DataFlux Data Management Studio.
2. Open the QKB CI 2.7 - ENUSA Only QKB.
3. Click the Schemes tab.
4. Open the GB Email Top-Level Domain Standards scheme.
5. Answer the following questions:
• Which values are standardized as ORG?
• Is this an Element scheme or a Phrase scheme?
• Which definitions use this scheme?

7
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-21

Chop Table Overview


Character Classification

11
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

A chop table is a collection of character-level rules that are used to create an ordered word list from
an input string. A chop table contains a line for every single character in the selected character set.
Chopping is the first step in
• performing element analysis when building an element standardization scheme
• chopping a string into a list of words for a parse definition.
Each character in a chop table receives a classification based on the intended use. A character can
be classified as one of these:
• LETTTER/SYMBOL – a letter or a non-separating symbol
• NUMBER – a numeric digit
• FULL SEPARATOR – a delimiting character that separates the string before it from the string
after it
• LEAD SEPARATOR – a separator that attaches to the beginning of a string (for example, an
opening parenthesis)
• TRAIL SEPARATOR – a separator at the end of a string (for example, a period after a name
salutation such as Mr.).

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-22 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

Chop Table Overview


Character Operation

12
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Each character in a chop table also receives an operation, indicating whether the character should
be included in the output string, and if so, how that character should be treated in the output string.
For example, the open parenthesis might not be a relevant character in a person’s name, but is often
used to delineate a portion of a phone number. We can choose to remove the character from a name
string, but not a phone number string. These are the valid arguments for a character’s operation:
• USE – keeps the character in the string.
• TRIM – temporarily removes the character from the string, but keeps it in output tokens .
• SUPPRESS – removes the character from the string and the output tokens.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-23

Working with the Chop Table Editor

Test data value

Result after
chopping

13
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The Chop Table Editor also provides you with a test panel where you can test data values against
the chop table. This ensures you are getting the desired results from the chopping step.
For example, the input string SAS Institute, Inc. is chopped using the GB Organization chop table.
The result of the chop is a list of four words from the string. The comma is on a line by itself because
it is treated as a word on its own in the generated word list. The full stop (period) is part of the Inc.
word because it is classified as a TRAIL SEPARATOR with an operation of USE.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-24 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

Using the Chop Table Editor

This demonstration illustrates how to use the Chop Table Editor to explore, test, and modify a chop
table.
1. If necessary, invoke Data Management Studio.
a. Select Start  All Programs  DataFlux  Data Management Studio 2.7.
b. Click Cancel in the Log On window.
2. If necessary, open the QKB CI 27 - ENUSA Only QBK in Data Management Studio.
a. Select the Administration riser bar.
b. Navigate to the English (United States) locale in QKB CI 27 - ENUSA Only.
c. Select the English (United States) locale and open it.
3. Explore the GB Email chop table.
a. In the right panel, click the Chop Tables tab.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-25

4. Right-click the GB Email chop table and select Open.


The Chop Table Editor window appears, and the GB Email chop table is loaded.

Note: A chop table is stored in a proprietary format. You should not attempt to create
or edit it outside of the provided editor.
5. Scroll to locate the FULL STOP character (value 46).

Note: Most of the characters are classified as either LETTER/SYMBOL or NUMBER


and are assigned an operation of USE.
Note: The FULL STOP character (value 46) is classified as a FULL SEPARATOR
with an operation of USE.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-26 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

6. Scroll to locate the COMMERCIAL AT character (value 64).

Note: The COMMERCIAL AT character (value 64) is classified as a FULL SEPARATOR with an
operation of USE.
Note: These characters separate email address values into the individual words that comprise
the email address.
7. Scroll up to locate the SPACE character (value 32).

Note: Because spaces are not valid as part of email address es, the operation for the SPACE
character (value 32) is TRIM, although it is still classified as a FULL SEPARATOR.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-27

8. Test the chop table with sample email addresses.


a. At the bottom of the Chop Table Editor window, enter John.Doe@sas.com
in the Input string field.
b. Click Go.
The Result area is populated with the chopped string.

c. Enter John. Doe@sas.com in the Input string field. (Notice the space after the first period.)
d. Click Go.
The Result area is populated with the same chopped string because the SPACE character
is associated with an operation of TRIM in the chop table. Therefore, it is not included
in the output.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-28 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

9. Save a new version of the GB Email Chop Table.


a. Select File  Save As in the Chop Table Editor.
b. Enter My GB Email in the Name field.

Note: When you modify any of the QKB component files, it is a best practice to create a
copy of the existing file instead of overwriting it. Here are some reasons:
• Many definitions might reference the original file.
• You might want to revert to the original file at some point in the future.
• Modifications to existing QKB components are more difficult to track when you
upgrade to a new release of the QKB. The QKB merge utility (for merging the old
QKB with the new one) could miss the fact that you have made modifications,
resulting in the loss of your work.
c. Click Save.
The title bar of the Chop Table Editor window displays the new chop table name.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-29

10. Change the uses of some of the characters in the new chop table.
a. Locate the SPACE character (value 32).
b. Click (the down arrow) in the Operation column for the SPACE character and select
USE.
c. Locate the FULL STOP character (value 46).
d. Click (the down arrow) in the Classification column for the FULL STOP character and
select TRAIL SEPARATOR.

e. Select File  Save.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-30 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

11. Use a test data value with the new chop table.
a. Enter John. Doe@sas.com in the Input string field. (Notice the space after the first
period.)
b. Click Go. The Result area is populated with the updated result string.

Note: The FULL STOP character, as a TRAILING SEPARATOR, now becomes a part
of the word that it follows. The SPACE character, as a FULL SEPARATOR,
is now a distinct word in the word list and is included in the output because
it has an operation of USE.
Note: These are not practical alterations but are shown here for illustrative purposes.
12. Save the new chop table.
a. Select File  Exit to close the Chop Table Editor window.
b. If you are prompted, click Yes in the Reload QKB window to reload the QKB.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-31

c. Scroll down on the Chop Tables tab to locate the new chop table, My GB Email.

13. Explore the usage of the GB Email chop table.


a. In Data Management Studio, right-click GB Email and select Usage.

The GB Email chop table is used in three different parse definitions.


b. Click Close to close the Usage window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-32 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

12.02 Activity
1. In Data Management Studio, click the Chop Tables tab.
2. Open the GB Website chop table.
3. Answer the following questions:
• What is the classification for the FULL STOP character (value 46)?
What is the operation?
• What is the classification for the Solidus character (value 47)? What is
the operation?
• What is the chopped string for the input string?
support.sas.com/documentation

15
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Phonetics Library Overview


Replacement text
Rule text

19
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

A phonetics library is a collection of rules that perf orm “sound -alike” analysis on a data value. The
image above shows the EN General Phonetics library. The lef t column shows the rule text, or the
pattern to be matched. The right column in the replacement text f or the matched rul e.
In the QKB, the phonetics library is used exclusively to generate match codes. During match code
generation, phonetic rules are applied to reduce an input string. The goal is to create phonetic rules
that produce the same output string f or input strings with similar pronunciations or spellings (f or
example, Night and Knight are reduced to NIT by applying phonetics rules).

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-33

Working with the Phonetics Editor

“GHT” sounds like “T”

“CK” sounds like “K”

Test Value and Result

20
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The Phonetics Editor allows you to create and manage phonetics rules to be used in reducing data.
In addition to managing the phonetic rules, you can test the rules and view the results inside the
editor.
In the example above, the EN General Phonetics lib rary is used to test the name string “JOHN
MACKNIGHT”. Using the rule that “GHT” sounds like “T”, “CK” sounds like “K”, and an “H” is silent,
the phonetically reduced string is “JON MAKNIT”. These phonetically reduced strings are used in a
match def inition to identify records for the same people when there are slight differences in how their
names have been entered.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-34 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

Using the Phonetics Editor

This demonstration illustrates how to use the Phonetics Editor to explore and test a phonetics library.
1. If necessary, invoke Data Management Studio.
a. Select Start  All Programs  DataFlux  Data Management Studio 2.7.
b. Click Cancel in the Log On window.
2. If necessary, open the QKB CI 27 - ENUSA Only QBK in Data Management Studio.
a. Select the Administration riser bar.
b. Navigate to the English (United States) locale in QKB CI 27 - ENUSA Only.
c. Select the English (United States) locale and open it.
3. Explore the EN General Phonetics library.
a. Click the Phonetics Libraries tab.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-35

b. Right-click the EN General Phonetics library and select Open.


The Phonetics Editor window appears and EN General Phonetics is loaded.

Each line in the editor represents a sequence of text and the phonetic equivalent. Each
phonetic rule is applied in order. You can control the order in which the rules apply to the text
string by adjusting the priority for each of the rules. Rule order can also be controlled by
dragging a rule upward or downward in the list.
Note: You can use the Edit menu to add, delete, or edit a rule.
Rule text can consist of literal characters and a small set of meta-characters. The meta-
characters used in phonetics libraries are outlined in the table below.

Meta-Character Usage

. (dot) Matches any single character in the input string.


Example: CA. would match the following: CAb, CAT, Caw.

[ and ] Matches any one of the specified characters.


Example: CA[TB] would match the following: CAT, CAB.

^ Indicates the pattern must be found at the beginning of a word.


Example: ^SAND would match SANDWICH but not QUICKSAND.

$ Indicates the pattern must be found at the end of a word.


Example: SAND$ would match STREISAND but not SANDWICH.

/ Searches an entire pattern but replaces only the characters before the
slash. Example: SCH/OOL with replacement of SK matches the word
SCHOOL and produces an output string of SKOOL.

\<meta- Escape character that indicates to use a meta-character as a literal.


character> Example: \$ looks for $, and \\ looks for \.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-36 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

Note: For more details about syntax for phonetic rules, explore the following choices:
• Select Help  Help Topics in the Phonetics Library Editor.
• In the Related Topics section, click the link for Phonetics Editor - Components
of a Rule.
4. Test the phonetics library with test data values.
a. Under Test Area, enter KNIGHT in the Input string field.
b. Click Go.
The Result field displays NIT.

Replacement #1

Replacement #2

Note: The first substitution is that the GHT string at the end of the string is replaced
with a T, because GH is silent.
Note: The second replacement is the KN string at the beginning of the word is replaced
with N, because K is silent.
Note: Phonetics library processing is not case sensitive. Input strings are converted
to uppercase before phonetic rules are applied and the results are displayed
in uppercase.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-37

c. Under Test Area, enter MCWHEELAN in the Input string field.


d. Click Go.

Replacement #2

Replacement #1

Note: The Result field displays MAKWEELAN. The phonetics library matched two patterns:
• The first part of the string (MAK) is the result of the pattern MC at the beginning
of a word. It is replaced with MAK.
• The second part of the string (WEELAN) is the result of the pattern WH at the
beginning of the word. It is replaced with W, although this value is not at the
beginning of the word.
Note: The Reset the beginning of word option on the first replacement string (^MC) tells the
phonetics algorithm to reset the flag for the beginning of the word, which enables
(^WH) to be matched and replaced, although it occurs before the first matched pattern
(^MC) in the library.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-38 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

e. Under Test Area, enter MCKNIGHT in the Input string field.


f. Click Go.

Replacement #1

Replacement #2

Note: The Result field displays MKNIT. Two patterns from the phonetics library
are matched.
• The first pattern matched is (GHT), which is replaced with T at the end
of the string. This results in a value of MCKNIT.
• The second pattern matched is (CK), which is replaced with K immediately
following the M. This results in a value of MKNIT.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-39

g. Scroll down to the rule text for ^MC.

This replacement is not made.

The substitution of MC to MAK is ignored, because CK was already changed to K.


This results in the string MK at the beginning of the word.
Note: You can change this behavior by changing the order in which the rules are applied.
You could accomplish this by changing the priority for the rule, or by changing the
rules’ position in the table.
5. Select File  Exit to close the Phonetics Editor window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-40 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

6. Explore the usage of the EN General Phonetics library.


a. In Data Management Studio, right-click EN General Phonetics and select Usage.

Note: The EN General Phonetics library is used in several match definitions.


b. Click Close to close the Usage window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-41

12.03 Activity
1. In Data Management Studio, click the Phonetics Library tab.
2. Open the EN Name Phonetics library.
3. Answer the following question:
• What does your full name phonetically reduce to?

22
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-42 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

Working with the Regex Library Editor

Pattern in the data Substitution for


matched pattern

24
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

A regular expression (typically shortened to regex) attempts to match a pattern in a subject string
f rom lef t to right. Most characters represent themselves in a pattern and match the corresponding
characters in the subject. The Regex Library Editor is used to build and test regular expression
libraries.
Regular expressions have the following characteristics:
• are organized into libraries that can be used for parsing, standardization, and matching
• are primarily intended for character-level cleansing and transformations (Standardization
definitions should be used for word- and phrase-level cleansing.)
• must conform to Perl regular expression syntax
Note: When regular expressions are used against the data, every regular expression in the library
is executed against every data value.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-43

Using the Regex Library Editor

This demonstration illustrates how to use the Regex Library Editor to explore and test a regex library.
1. If necessary, invoke Data Management Studio.
a. Select Start All Programs DataFlux Data Management Studio 2.7.
b. Click Cancel in the Log On window.
2. If necessary, open the QKB CI 27 - ENUSA Only QBK in Data Management Studio.
a. Select the Administration riser bar.
b. Navigate to the English (United States) locale in QKB CI 27 - ENUSA Only.
c. Select the English (United States) locale and open it.
3. Explore the GB Lightweight Punctuation Removal regex library.
a. Click the Regex Libraries tab.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-44 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

b. Right-click the GB Lightweight Punctuation Removal regex library and select Open.
The Regex Library Editor window appears. GB Lightweight Punctuation Removal is loaded.

Note: This regex library is designed to remove periods, commas, semicolons, and double
quotation marks from a text string. It also removes the # (number sign) character,
unless it is followed by a number.
Note: Each regular expression in the list is executed in order, from top to bottom.
c. Double-click the first regular expression [.,;"] (that is, open bracket, period, comma,
semicolon, double quotation mark close bracket).

Note: This window can be used to edit the expression, the substitution, or to add a note.
Note: It is important to remember that regular expressions are applied to the data
sequentially, a single data value could be changed by more than one expression.
One expression could change a data value, causing the value to not match a pattern
in a subsequent expression.
d. Click Cancel to close the Edit Expression window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-45

4. Test the regex library with test data values.


a. Under Test Area, enter A.A.A. in the Input string field.
b. Click Go. The Result field displays AAA.

Note: The periods are removed by the first expression in the regex library.
c. Under Test Area, enter Rudolph "Rudy" Smith, Ph.D. in the Input string field.
d. Click Go.
The Result field displays Rudolph Rudy Smith PhD.

Note: The double quotation marks, the comma, and the period are removed by the first
expression in the regex library.
Note: For more details about regular expression syntax, see the DataFlux Data
Management Studio 2.7: User Guide.
e. Select File  Exit to close the Regex Library Editor.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-46 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

4. Explore the usage of the GB Lightweight Punctuation Removal Regex library.


a. In Data Management Studio, right-click GB Lightweight Punctuation Removal and select
Usage.

The GB Lightweight Punctuation Removal regex library is used by several standardization


definitions.
b. Click Close to close the Usage window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-47

12.04 Activity
1. In Data Management Studio, click the Regex Libraries tab.
2. Open the GB Period Removal regex library.
3. Answer the following questions:
• What is the result for the input string U.S.A.?
• Which types of definitions in the QKB use this regex library?

26
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-48 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

Vocabulary Overview

Likelihoods associated
with each category

List of words

Categories assigned to
the selected word
29
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

A vocabulary is a collection of words, their associated categories, and a likelihood f or each category.
The Vocabulary Editor is used to build and maintain vocabularies. Words can be manually entered
into the vocabulary or imported from a f ile.
Note: Each word in a vocabulary is required to have at least one category (and likelihood)
associated with it.
Vocabularies are used f or the f ollowing:
• in parsing to categorize individual words in the text string
• in the matching process to identify noise words that are omitted from match code generation
• in gender analysis to determine the gender of an individual
• in identification analysis to determine the possible identity of words
The example above shows the GB Email vocabulary. The word “COM” is associated with two
categories:
• COM (com) - a Medium likelihood is assigned to the COM category.
• DMAIN (Domain) - a High likelihood is assigned to DMAIN category.
Note: You can see the text description for a category by hovering over the value in the Categories
pane.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-49

Using the Vocabulary Editor

This demonstration illustrates how to use the Vocabulary Editor to explore and modify a vocabulary.
1. If necessary, invoke Data Management Studio.
a. Select Start All Programs DataFlux Data Management Studio 2.7.
b. Click Cancel in the Log On window.
2. If necessary, open the QKB CI 27 - ENUSA Only QBK in Data Management Studio.
a. Select the Administration riser bar.
b. Navigate to the English (United States) locale in QKB CI 27 - ENUSA Only.
c. Select the English (United States) locale and open it.
3. Explore the GB Email vocabulary.
a. Click the Vocabularies tab.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-50 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

b. Right-click the GB Email vocabulary and select Open.


The Vocabulary Editor window appears, and GB Email is loaded.

Note: A vocabulary is stored in a proprietary format that you should not attempt to create
or edit directly. Vocabulary files should be accessed through the Vocabulary Editor
to avoid the danger of corrupting the file.
3. Create and modify a new version of the GB Email Vocabulary.
a. Select File  Save As in the Vocabulary Editor.
b. Enter My GB Email in the Name field.
c. Click Save.
The title bar of the Vocabulary Editor window displays the new vocabulary name.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-51

4. Investigate the categories and likelihoods for the word COM.


a. Select Search  Find.
b. In the Find what field, enter COM.

c. Click OK.
The word COM is highlighted. The Word properties area on the right displays the properties
of the selected word.

The word COM is associated with the following categories:


• COM with a likelihood of Medium
• DMAIN (Domain) with a likelihood of High
d. In the Word properties area, click the DMAIN category.
e. Click Edit.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-52 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

f. Click (the down arrow) in the Likelihood field and select Very High.

j. Click OK.
There is now a very high likelihood that the word COM belongs to the DMAIN category.

5. Investigate the categories and likelihoods for the word AT.


a. Select Search  Find.
b. In the Find what field, enter AT.
c. Click OK.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-53

The word AT is highlighted. The Word properties area on the right displays the properties
of the selected word.

Note: The word AT is categorized as the following:


• @ with a likelihood of Low
• CCDMAIN (Country Code Domain) with a likelihood of Medium
d. In the Word properties area, click Add to add a new category.

e. Click (the down arrow) in the Category field and select WORD (Word).
f. Verify that Medium is the value in the Likelihood field.

g. Click OK.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-54 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

This adds a new category and likelihood pairing for an existing word.

6. Add a new word to the vocabulary.


a. Select Edit  Add Word.
b. Enter SAS in the Word field.

c. Click OK.
d. Verify that SAS is selected in the Word list.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-55

e. In the Word properties area, click Add to add a new category.

1) Click (the down arrow) in the Category field and select DMAIN (Domain).

2) Click (the down arrow) in the Likelihood field and select High.
3) Click OK. This adds a new word to the vocabulary with a category and likelihood pairing.

Note: Every word in a vocabulary must have at least one assigned category.
If not, the vocabulary cannot be saved.
7. Select File  Save to save the My GB Email vocabulary.
8. Select File  Exit to close the Vocabulary Editor window.
9. When you are prompted to reload the QKB, click Yes.
Note: Because the QKB was loaded into memory when Data Management Studio was
instantiated, it is necessary to reload the QKB in order to use any changes that
you made to the QKB files and definitions.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-56 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

10. Scroll down on the Vocabularies tab to locate the new vocabulary, My GB Email.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-57

12.05 Activity
1. In Data Management Studio, click the Vocabularies tab.
2. Open the GB Website vocabulary.
3. Answer the following questions:
• What are the defined categories and likelihoods for the word HTTP?
• What are the defined categories and likelihoods for the word WWW?

31
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-58 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

Grammar Overview

Basic
Categories

Derived
Categories Derived rules for
valid e-mail addresses

34
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Af ter the morph analysis in the Parse Def inition is used to identify one or more basic categories f or
each word, the patterns of assigned categories can be identif ied. A grammar is a set of rules that
represent expected patterns of words in a given context.
The Grammar Editor is used to build and maintain basic and derived categories and build derived
rules f rom the categories. To improve the readability of grammar rules, all categories in a grammar
are represented using abbreviations.
Note: A grammar consists of two category types - basic and derived.
Note: Basic categories in a grammar correspond to categories associated with words during the
morph analysis.
Note: Basic categories defined in the Grammar are the categories that get imported into the
Vocabulary to assign to words.
The example above shows the basic category abbreviations used by the GB Email Validation
grammar. The derived category VALID is expanded to show the two patterns def ined in the grammar
f or a valid email address.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-59

Working with Derived Category Rules

Parent category

Priority

Derived
category rule

35
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Rules f or the derived categories represent an ordered list of categories (basic and derived) that
identif y patterns of words in a string. Each rule is associated with a parent (derived) category, and
has a priority associated with it.
In this example, a derived category rule f or a valid email address is highlighted. You can see the
derived category rule highlighted in blue. The parent category f or the derived category is VALID, and
the rule’s priority is set to Medium.
Note: Derived category rules can use both basic and derived categories.
Note: Each rule is associated with a priority, indicating the strength of the pattern matched.
Note: In the ideal situation, every possible pattern of categories will be identified in a grammar rule,
with no duplicates. However, this is not realistic, so you should strive to fully identify all of the
common patterns, and some of the less common patterns.
Note: Ambiguities in matched patterns are resolved through a scoring algorithm. This scoring takes
into account the likelihoods assigned to the words (in the Vocabulary) and the priorities of the
rules matched (in the grammar).

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-60 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

Example: EN Name Grammar


In this example, the text string can be represented by a rule that consists
of two basic categories.

Text String: Bob Brauer

Name Rule: [Name] > [Given Name Word] [Family Name Word]

Derived Category Basic Categories

36
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

In the example above, the name “Bob Brauer” is being compared against the grammar rules f or a
person’s name. When reading this rule, you say that the category on the lef t side of the rule is
derived f rom the categories on the right side of the rule. The derived rule identif ies the possibility that
a valid name string can consist of a given name word f ollowed by a f amily name word .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-61

Using Recursive Rules in Derived Categories


Text String: Bob Brauer, Attorney at Law
Bob Brauer, Honorable Justice of the Peace

Name Derived Category Rule: N > GNW FNW COMMA NA

Name Appendage Derived NA > NAW NAW NAW


Category Rules: NA > NAW NAW NAW NAW NAW

A recursive rule can be defined to handle any number of consecutive NAWs.


NA > NAW
NA > NAW NA
Note: NAW is an abbreviated form of name appendage word.
37
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

For some derived categories, you might want to allow a variable number of words of the same basic
category within a text string. In some cases, you will not know the exact number of words that might
appear in the string. The most ef f icient way of allowing for this situation is to use a recursive rul e.
Recursive rules enable you to def ine the recursive word once in a derived category and account f or
any number of occurrences of the word. A recursive rule consists of a basic category followed by a
derived category, which is the root category for the recursive rule being built.
Recursive rules achieve the f ollowing:
• enable matching derived categories of variable length
• avoid having multiple rules of variable length
• eliminate the need to guess at maximum word counts
As shown in the example above, there can be any number of name appendage words associated
with a person’s name. In the f irst example text string, there are three name appendage words
(categorized as NAW). In the second example text string, there are f ive name appendage words
(categorized as NAW).
The recursive rule f or this situation is NAW (Name Appendage Word) f ollowed by NA (Name
Appendage derived category), which could just be another NAW (Name Appendage Word). The rule
keeps looping through the words until it reaches a word that does not meet the rule f or a Name
Appendage.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-62 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

Using the Grammar Editor

This demonstration illustrates how to use the Grammar Editor to explore and modify a grammar.
1. If necessary, invoke Data Management Studio.
a. Select Start All Programs DataFlux Data Management Studio 2.7.
b. Click Cancel in the Log On window.
2. If necessary, open the QKB CI 27 - ENUSA Only QBK in Data Management Studio.
a. Select the Administration riser bar.
b. Navigate to the English (United States) locale in QKB CI 27 - ENUSA Only.
c. Select the English (United States) locale and open it.
3. Explore the GB Email grammar.
a. Click the Grammars tab.

b. Explore the Basic Categories for the grammar.


1) Right-click the GB Email grammar and select Open.
The Grammar Editor window appears, and GB Email is loaded.

Note: A grammar is stored in a proprietary format that you should not attempt to create
or edit directly. Grammar files should be accessed through the Grammar Editor
only.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-63

2) Expand Basic Categories.

Note: These basic categories from the grammar are also assigned to words
in the vocabulary.
3) Click the FNW (Family Name Word) basic category.
The Category properties are displayed on the right.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-64 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

c. Explore the derived categories for the EMAIL_P derived category.


1) Expand Derived Categories.
Other basic or derived categories are combined into an ordered sequence
that can be used to derive an email address.

2) Click (the plus sign) to expand the EMAIL_P derived category.

Note: An email address such as John.Doe@sas.com is contained by the following rule:


MBOX @S MSERVER DOT DOMAIN, where the following is the analysis:
MBOX → John.Doe (where MBOX → Name → FNP SEP GNP)
@S → @
MSERVER → sas
DOT → .
DOMAIN → com

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-65

3. Create and modify a new version of the GB Email grammar.


a. Select File  Save As in the Grammar Editor.
b. Enter My GB Email in the Name field.
c. Click Save.
The title bar of the Grammar Editor window displays the new grammar name.

d. Select Edit  Add Category.


e. Enter DOMAIN_NEW in the Abbreviation field.
f. Enter New Domain in the Name field.
g. Select the Derived radio button for type.

h. Click OK. When a derived category is added, the patterns of basic and derived categories
that identify the category need to be defined.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-66 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

4. Add a new rule to the new DOMAIN_NEW category.


a. If necessary, click the DOMAIN_NEW derived category.
b. Select Edit  Add Rule. The Rule properties area on the right side of the Grammar Editor
window is displayed.

c. Click Add.

d. Click (the down arrow) in the Category field and select DMAIN.

e. Click OK.
f. Click Add.

g. Click (the down arrow) in the Category field and select MSERVER.

h. Click OK.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-67

The derived category DOMAIN_NEW contains the grammar rule [DMAIN MSERVER] and
has a priority of Medium.

Note: Multiple grammar solutions might be returned by a grammar analysis. The priority for
each derived grammar is used as part of a scoring process that determines the best
solution. (This is discussed in more detail later.)
Note: Creating a comprehensive and robust grammar typically involves multiple iterations
and thorough testing.
5. Save the new grammar to the QKB.
a. Select File  Save to save the My GB Email grammar.
b. Select File  Exit to close the Grammar Editor window.
c. If necessary, click Yes in the Reload QKB window to reload the QKB.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-68 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

d. Scroll down on the Grammars tab to locate the new grammar, My GB Email.

6. Explore the usage of the GB Email grammar.


a. In Data Management Studio, right-click GB Email and select Usage.

The GB Email grammar is used in three different parse definitions.


b. Click Close to close the Usage window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-69

12.06 Activity
1. In Data Management Studio, click the Grammars tab.
2. Open the GB Website grammar.
3. Answer the following questions:
• How many basic categories exist in the grammar?
• How many rules are defined for the URL derived category?

39
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-70 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

12.2 Working with QKB Definitions

QKB Definition Types (Review)


You know that QKB
definitions allow you
to do a variety of
data curation tasks.

Data Management Data Quality

Entity Resolution

43
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

In the context of the QKB, a definition is a collection of metadata that def ines an algorithm that can
perf orm a data-cleansing operation. A def inition type corresponds to a type of data-cleansing
operation. For example, a match def inition contains metadat a used for creating a match code, and a
parse def inition contains metadata used f or parsing a data string into its individual tokens. Each
def inition is associated with a data type (that is, the "Name" parse def inition belongs to the "Name"
data type).
The QKB has def initions that allow you to do a variety of data management, data quality, and entity
resolution tasks.
The types of definitions that are available in the QKB include:
• Case
• Extraction
• Gender Analysis
• Identification Analysis
• Language Guess
• Locale Guess
• Match
• Parse
• Pattern Analysis
• Standardization

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.2 Working with Q KB Definitions 12-71

Data Types and Definition Types

44
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

It is important to note that a data type does not necessarily have a def inition of each def inition type.
In the example above, you can see that the Name data type has f ive def inition types associated with
it – Gender Analysis, Match, Parse, Standardization, and Case def initions. If you look at the Postal
Code data type, however, it has only three def inition types associated with it – Match, Parse, and
Standardization def initions. Likewise, the Add ress data type has only three def inition types. The E-
mail data type has f our types of definitions associated with it.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-72 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

Case Definition Example


Input Data
SAS INSTITUTE

Data after Casing


Upper SAS INSTITUTE
Lower sas institute
Proper Sas Institute
Proper (Organization) SAS Institute

45
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The purpose of a case def inition is to ensure the appropriate casing of data as it is processed by the
def inition. Case definitions are algorithms that convert a text string to uppercase, lowercase, or
proper case.
The def inition uses a “base” casing algorithm, and then augments that with the known casing of
certain words (f or example, SAS or DataFlux) and patterns within words (f or example, any word that
begins with Mc, then uppercase the next letter).
Note: For the best results, select an applicable definition that is associated with a specific data
type, when applying proper casing.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.2 Working with Q KB Definitions 12-73

Extraction Definition Example


Input Data String Organization Address Phone

John Doe, SAS Cary, NC SAS Cary, NC 919-677-8000


27513, 27513
919-677-8000
2149773916 SAS SAS Dallas, TX 2149773916
Dallas, TX (near DFW)

The Forum Center, Irvine, CA SAS Irvine, CA (949)517-9300


(949)517-9300 SAS

46
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Extraction def initions extract portions of a string into relevant tokens. For example, the Contact Inf o
extraction def inition is used in the table above to extract portions of the Input Data String into tokens
(Organization, Address, and Phone).
Notice that the order of the tokens in the data does not matter. This is because the def inition uses
vocabularies, regex libraries, and grammars to analyze patterns in the Input Data String and map the
data values to the appropriate tokens.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-74 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

Gender Analysis Definition Example


Name Gender
Chris J. Smith U
Christopher J. Smith M
C.J. Smith, Jr. M
Chris Jane Smith F
Mrs. Chris J. Smith F

47
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Gender Analysis definitions determine the gender of a person, typically based on the person’s name.
The analysis returns the value male, female, or unknown. This type of data can be very useful for
marketing campaigns, checking patient data, ensuring proper salutations for mailings, or analyzing
gender make-up of major field of study in an academic setting.
Note: Typically, gender analysis is performed on individual name data, but it could be used on ID
codes where a portion of the code represents the gender.
The result of gender analysis is typically a code that indicates whether the input value is of one
gender or the other (M for male and F for female). If gender values cannot be determined (due to
incomplete or conflicting information), U (for unknown) is returned.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.2 Working with Q KB Definitions 12-75

Identification Analysis Definition Example


Input Data Value Results after Applying Identification Analysis
Customer Customer Customer_Identity_Type
SAS SAS ORGANIZATION
John Q Smith John Q Smith NAME
SAS Institute, Inc. SAS Institute, Inc. ORGANIZATION
Nancy Jones Nancy Jones NAME

48
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

An Identif ication Analysis Definition specifies data and logic that can be used to identif y the semantic
type of a data string. For example, an identif ication analysis definition might be used to determine
whether a certain string represents the name of an individual or an organization.
Consider a f ield that has mixed corporate and individual customers. Applying the Field Content
identif ication analysis definition to the Customer data produces a result set that f lags every record
with the type of data that is discovered or identified.
Note: The Field Content identification analysis definition can be used to recognize addresses,
cities, email addresses, organization names, phone numbers, and more.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-76 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

Match Definition Example


Name Match Code @ 85 Sensitivity
John Q Smith 4B&~2$$$$$$$$$$C@P$$$$$$$$$
Mr. Johnny Smith 4B&~2$$$$$$$$$$C@P$$$$$$$$$
Smith, John 4B&~2$$$$$$$$$$C@P$$$$$$$$$

49
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Match codes are generated, encoded text strings that represent a data value. Match codes can be
compared across records to identify potentially duplicate data that might be obvious to the human
eye, but not necessarily obvious to a computer program. In the example above, you can see that the
three strings representing John Smith likely represent the same person. To the computer, however,
these look like three totally different individuals.
Match codes can be used to group similar data values together. In the example above, sorted by the
match code values, you can see that these records might potentially match, not because the value of
Name is the same, but because the values generated the same match code at the selected
sensitivity level (in this example, 85).

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.2 Working with Q KB Definitions 12-77

Match Codes and Sensitivity


Name: Mr. Johnny Quintin Smith
Match Code Sensitivity Match Code Value
95 4B7~2$$$$$$$$$$C@B7$$$$$$$3
85 4B&~2$$$$$$$$$$C@P$$$$$$$$$
65 4B~2$$$$$$$$$$$C@P$$$$$$$$$
55 4B~2$$$$$$$$$$$C$$$$$$$$$$$

50
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Sensitivity is used in the match code generation process to determine how much of the initial data
value you want to use in the match code. In other words, it allows you to specify how exact you want
to be in generating the match codes. The chosen level of sensitivity controls how many
transf ormations are made to the data string bef ore the generation of the match code.
Sensitivity also controls the number of positions each token contributes to the match code. At higher
levels of sensitivity, more characters f rom the input text string are used to generate the match code.
Conversely, at lower levels of sensitivity, fewer characters are used in the generation of the match
code.
Note: It is important to experiment with different levels of sensitivity, because choosing too low of a
sensitivity can lead to over-matching, and choosing a sensitivity level that is too high could
lead to under-matching.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-78 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

Parse Definition Examples


Name Information
Dr. Alan W. Richards, Jr., M.D.

Parsed Name
Prefix Dr.
Given Name Alan
Middle Name W.
Family Name Richards
Suffix Jr.
Title/Additional Info M.D.

51
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Parse definitions def ine rules to place the words f rom a text string into the appropriate tokens. In the
example above, a name value is parsed using the Name parse def inition. The purpose of the Name
Parse Def inition is to parse a name string into the tokens that make up the name (f or example,
Pref ix, Given Name, Middle Name, and so on).

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.2 Working with Q KB Definitions 12-79

Standardization Definition Example


Data before Standardization Definition Data after Standardization
Mister Junior Q. Smith, Junior Name Mr Junior Q Smith, Jr
SAS INSTITUTE, INC. Organization SAS Institute Inc
123 North Main Street, Suite 100 Address 123 N Main St, Ste 100
U.S. Country UNITED STATES
9194473000 Phone (919) 447 3000

52
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Standardization definitions are used to improve the consistency of data by applying standardization
schemes to the individual tokens. The process of standardization with a standardization def inition
involves parsing the data string into tokens, and then standardizing each token using one or more
standardization schemes.
The examples above illustrate the ef f ect of applying standardization def initions to various input
strings. You can see that not only do the standardization def initions help with standardizing data
values, but they also control casing of values, as well as t he order of the tokens in the resulting data
string.
Note: Standardizing an address value, as in the example above, does not verify that the address is
correct. It simply ensures a standard representation across address data values.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-80 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

12.07 Activity
1. Open DataFlux Data Management Studio.
2. Open the QKB CI 2.7 - ENUSA Only QKB.
3. Click the Quality Knowledge Base tab.
4. View a list of all standardization definitions.
5. Answer the following question:
•How many data types have associated Standardization definitions?
6. View all the definitions for the E-mail data type.
7. Answer the following questions:
• How many types of definitions exist for the E-mail data type?
• What types of definitions exist for the E-mail data type?
53
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.3 Solutions 12-81

12.3 Solutions
Solutions to Activities and Questions

12.01 Activity – Correct Answer


Question: Which values are standardized as ORG?
Answer: 0RG, OGR, OORG, ORD, ORG, ORGG, ORRG

8
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

12.01 Activity – Correct Answer


Question: Is this an Element scheme or a Phrase scheme?
Answer: Element

9
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-82 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

12.01 Activity – Correct Answer


Question: Which definitions use this scheme?
Answer: E-mail Standardization definition

10
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

12.02 Activity – Correct Answer


Question: What is the classification for the FULL STOP character (value 46)?
Answer: FULL SEPARATOR
Question: What is the operation?
Answer: USE

16
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.3 Solutions 12-83

12.02 Activity – Correct Answer


Question: What is the classification for the Solidus character (value 47)?
Answer: FULL SEPARATOR
Question: What is the operation?
Answer: USE

17
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

12.02 Activity – Correct Answer


Question: What is the chopped string for the input string
support.sas.com/documentation?
Answer: Seven words – support, ., sas, ., com, /, documentation

18
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-84 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

12.03 Activity – Sample Correct Answer


Question: What does your full name phonetically reduce to?
Answer: JOHN PHILLIP DOE  JOM FILLIB DOE

23
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

12.04 Activity – Correct Answer


Question: What is the result for the input string U.S.A.?
Answer: USA

27
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.3 Solutions 12-85

12.04 Activity – Correct Answer


Question: Which types of definitions in the QKB use this regex library?
Answer: Gender Analysis, Identification Analysis, Match, Parse,
Standardization

28
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

12.05 Activity – Correct Answer


Question: What are the defined categories and likelihoods for the word HTTP?
Answer: SCHEMEW (Medium), WORD (Very Low)

32
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-86 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

12.05 Activity – Correct Answer


Question: What are the defined categories and likelihoods for the word WWW?
Answer: DMAIN (Medium)

33
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

12.06 Activity – Correct Answers


Question: How many basic categories exist in the Grammar?
Answer: Nine categories

40
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.3 Solutions 12-87

12.06 Activity – Correct Answers


Question: How many rules are defined for the URL derived category?
Answer: Four rules

41
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

12.07 Activity – Correct Answer


Question: How many data types have
associated Standardization
definitions?
Answer: 17

54
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-88 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)

12.07 Activity – Correct Answer


Question: How many types of definitions exist for the E-mail data type?
Answer: Four

55
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

12.07 Activity – Correct Answer


Question: What types of definitions exist for the E-mail data type?
Answer: Identification Analysis, Match, Parse, Standardization

56
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Lesson 13 Using SAS® Code to
Access QKB Components
13.1 SAS Configuration Options for Accessing the QKB ................................................... 13-3

13.2 SAS Data Quality Server Overview ........................................................................... 13-13


Demonstration: Using the Standardization Functions ............................................... 13-18
Demonstration: Using the DQSCHEME Procedure to Create a Scheme...................... 13-26
Demonstration: Using the DQSCHEME Procedure to Apply a Scheme ....................... 13-30
Demonstration: Using the Match Code Generation Functions .................................... 13-34
Demonstration: Using the DQMATCH Procedure..................................................... 13-39
Demonstration: Using the Parsing Functions........................................................... 13-45
Demonstration: Using the Extraction Functions ....................................................... 13-49
Demonstration: Using the Gender Analysis and Identification Analysis Functions.......... 13-52

13.3 Solutions ................................................................................................................. 13-54


Solutions to Activities and Questions...................................................................... 13-54
13-2 Lesson 13 Using SAS® Code to Access QKB Components

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.1 SAS Configuration Options for Accessing the QKB 13-3

13.1 SAS Configuration Options for


Accessing the QKB

Configuring SAS Applications to Access the QKB


Quality Knowledge
Interactive SAS Base Batch SAS
1 2

3
Third-Party
SAS Platform Applications
SAS Applications

3
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

There are a number of SAS applications that can interact with the QKB. In order for these
applications to access the QKB components, they need to be configured to the root location of the
QKB. This section discusses the options available for configuring SAS to access the QKB:
1. Conf iguring an “interactive” SAS session to connect to the QKB.
2. Accessing the QKB programmatically in SAS code using the %Dqload macro to load the
specif ied QKB into memory.
3. Conf iguring the SAS Platf orm to the QKB by specifying configuration options in the .cf g files for
the SAS Application Server.
Interaction with the QKB f rom within an interactive SAS session is f acilitated by setting system
options that control access to the QKB when a SAS session is instantiated. The programmer can
also set these options programmatically from within the interactive SAS session.
Note: Setting options in the configuration file for an interactive SAS session is not the preferable
method, as the end user might not be aware of the list of locales set in the DQLOCALE
system option, which can lead to mistakenly using algorithms from the incorrect locale.
In batch SAS programs, there are programmatic ways to set SAS system options, including SAS
macros that control access to the QKB and load the QKB into memory.
Access to the QKB f rom within SAS applications that interact with the SAS Platf orm can be
f acilitated by setting options in the conf iguration f ile(s) that execute when a SAS Workspace Server
is instantiated.
Note: Additional options for configuring applications to the QKB are found within the applications
themselves.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-4 Lesson 13 Using SAS® Code to Access QKB Components

DQSETUPLOC SAS System Option


The DQSETUPLOC
system option points
to the root location of
the QKB.
Syntax Example:

-DQSETUPLOC "D:\ProgramData\SAS\QKB\CI27_MultipleLocales"

4
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.1 SAS Configuration Options for Accessing the QKB 13-5

DQLOCALE SAS System Option


The DQLOCALE system
option selects the
default locale(s) to be
used in SAS code.
Syntax Example:

-DQLOCALE (ENUSA ENGBR FRCAN)

5
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The DQLOCALE system option sets an ordered list of locales for SAS to use f or data cleansing
processes. In the example above, the ENUSA, ENGBR, and FRCAN locales have been specif ied.
Note: Multiple locales can be specified for the DQLOCALE option. If multiple locales are specified,
the application searches the locales, in the order specified in the option, until it finds the
definition being used.
Note: All locales in the DQLOCALE list must exist in the QKB referenced in the DQSETUPLOC
option.
Because the locales that are specified with this option need to be loaded into memory for access,
you should always set the value of this system option by invoking the %Dqload macro (discussed in
a later section).

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-6 Lesson 13 Using SAS® Code to Access QKB Components

SAS Data Quality Server Autocall Macros


SAS provides three
autocall macros for
interacting with the
QKB from within code.

SAS Program Quality Knowledge


Base
Autocall Macros

%Dqload
%Dqunload
%Dqputloc

6
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

SAS Data Quality Server provides three autocall macros that f acilitate interaction with the QKB f rom
within SAS. Specif ically, these macros facilitate the loading and unloading of QKB locales in
memory, as well as setting the ordered list of locales to be used in data cleansing processes.
These three macros are available f rom SAS Data Quality Server:
• %Dqload – used to set system option values and load the QKB into memory
• %Dqunload – used to unload the QKB from memory
• %Dqputloc – displays information about the contents of the current QKB locale from memory in
the SAS log.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.1 SAS Configuration Options for Accessing the QKB 13-7

Using the %Dqload Macro for Setting System Options


The %Dqload macro
sets the values for
DQSETUPLOC and
DQLOCALE.
%Dqload Syntax Example: Root location of the QKB

%DQLOAD
(DQSETUPLOC="D:\ProgramData\SAS\QKB\CI27_MultipleLocales",
DQLOCALE=(ENUSA ENGBR FRCAN));

Ordered list of locales

7
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The %Dqload macro is used to specif y the list and order of locales that are loaded into memory in a
SAS session. In addition to loading the QKB into memory, the macro sets the values of the
DQSETUPLOC and DQLOCALE SAS system options.
Options f or the %Dqload macro:
• DQSETUPLOC
• DQLOCALE
• DQINFO - this option controls the amount of information that is written to the SAS log while the
QKB is being loaded into memory. Specifying DQINFO=0 results in no information being written
to the log.
Note: Options set using the %Dqload macro override any system options that were set previously.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-8 Lesson 13 Using SAS® Code to Access QKB Components

Output from the %Dqload Macro Call

8
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

When the %Dqload macro executes, inf ormation is written to the SAS log, including
• values for the DQSETUPLOC system option
• values for the DQLOCALE system option
• confirmation of the locales that were loaded into memory.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.1 SAS Configuration Options for Accessing the QKB 13-9

Checking DQ System Options Programmatically


PROC OPTIONS group=DATAQUALITY;
RUN;

9
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

If you need to check the SAS Data Quality system options that are in ef f ect for your SAS session,
you can submit a PROC OPTIONS step with the option GROUP=DATAQUALITY. The resulting
output will conf irm the data quality settings that are in place f or your c urrent SAS session.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-10 Lesson 13 Using SAS® Code to Access QKB Components

%Dqunload Macro Usage


The %Dqunload macro
General form for the %Dqunload macro:
removes the QKB
contents from
%DQUNLOAD;
memory.

10
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The %Dqunload autocall macro unloads all locales that are currently loaded into memory.
Note: It is not necessary to load and unload the QKB locales from every program that you want to
run, but it is a good practice when you are no longer using the data cleansing functions in
your SAS programs to unload the QKB from memory. The QKB locales will also be unloaded
from memory when your SAS session ends.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.1 SAS Configuration Options for Accessing the QKB 13-11

Using the %Dqputloc Macro to See QKB Contents


The %Dqputloc macro
writes certain QKB
information to the SAS
log.
%Dqputloc syntax example:

%DQPUTLOC(ENUSA, PARSEDEFN=1, SHORT=0);

Locale to Lists-related Writes usage descriptions


be used parse to the SAS log
definition

11
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The %Dqputloc macro is used to write inf ormation f rom the specified QKB locale to the SAS log.
The inf ormation written include all def initions, data type tokens, related f unctions, and the names of
the related parse def initions (f or gender def initions and match def initions).
Options f or the %Dqputloc macro:
• PARESDEFN=0|1 – this option lists the related parse definition for each gender definition and
match definition. The default value is PARSEDEFN=1.
• SHORT=0|1 – this option is used to limit the amount of information written to the log. Specifying
SHORT=1 removes the descriptions of how the definitions are used. The default value is
SHORT=0.
• locale – specifies the locale whose contents you want to view.
Note: If you specify the locale option, the specified locale must be a locale that was loaded into
memory.
Note: If you do not specify the locale option, the first locale in the DQLOAD locale list is used
by default.
The example above illustrates the use of the %Dqputloc macro to write the contents of the ENUSA
locale to the SAS log. In this example, the PARSEDEFN option is used to list the related parse
def initions for other def initions that use parse def initions as a preliminary step in their processing.
The SHORT=0 option writes the usage descriptions to the SAS log.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-12 Lesson 13 Using SAS® Code to Access QKB Components

Output from the %Dqputloc Macro

ENUSA locale specified

Available Case definitions

12
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

13.01 Activity
1. Open a SAS windowing environment session by selecting Start  SAS 
SAS 9.4 (English).
2. Open the D:\Workshop\dqpqkb\demos\Ch2D1_DQServer_Macros.sas
program in the Program Editor window.
Hint: Click File and select Open Program to navigate to the program.
3. Submit the code.
4. Review the SAS Log window.
5. Answer the following questions:
How many types of definitions are available in the ENUSA locale?
What is the name of the gender definition in the ENUSA locale?
What tokens are populated by the Organization (Global) parse definition?
13
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-13

13.2 SAS Data Quality Server Overview

SAS Data Quality Server Overview

DataFlux Data SAS Data Quality Quality Knowledge


Management Server Base
Server
• Procedures
• Functions
• Call Routines

18
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

SAS Data Quality Server provides procedures, functions, and CALL routines to support interaction
with the QKB, as well as with the DataFlux Data Management Server.
The procedures, functions, and CALL routines enable you to access the QKB components, from
within SAS code. The functions and CALL routines are accessible from within DATA step and SQL
code, and are often used to create new columns of data within your code.
Interaction with the DataFlux Data Management Server enables you to have access to any job or
service, that has been made available on the server, from within SAS code. These coding options
allow you the flexibility to run jobs, call real-time services, check the status of running jobs, copy
logs, stop running jobs, and so on.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-14 Lesson 13 Using SAS® Code to Access QKB Components

SAS Data Quality Server Data-Cleansing Functionality

SAS Data Quality Quality Knowledge


Server Base

• Procedures
• Functions
• Call Routines

19
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The procedures, f unctions, and CALL routines in the SAS Data Quality Server allow you access to a
variety of data transf ormation processes.
• Matching – creates match codes, which are encoded representations of data values, that can be
used to cluster similar data records, or as surrogate keys in “fuzzy” joins.
• Standardization – ensures the standard and consistent representation of data values.
• Parsing – used to break a string of data into meaningful tokens.
• Identification Analysis – identifies the semantic type of data in a field.
• Gender Analysis – identifies the gender of an individual based on the components of their name.
• Casing – ensures the proper casing of data values, especially values that do not conform to the
“typical” casing algorithms (for example, SAS and DataFlux).
Note: In order to have access to these data cleansing definitions, the SAS Data Quality Server
code needs to execute in a SAS session that is configured to the QKB and has the
necessary locales loaded into memory.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-15

SAS Data Quality Server: Standardization Functions


The functions that are available for standardizing data elements include the
following:
• DQSTANDARDIZE
• DQSCHEMEAPPLY

20
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

There are two f unctions available in SAS Data Quality Server f or perf orming data standardization.
• DQSTANDARDIZE – returns a character value after standardizing, casing, spacing, and
formatting, and then applies a common representation to certain words and abbreviations.
Note: The DQSTANDARDIZE function uses a standardization definition from the QKB.
• DQSCHEMEAPPLY – applies a scheme to the data and returns a standardized value.
Note: The DQSCHEMEAPPLY function uses a standardization scheme from the QKB.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-16 Lesson 13 Using SAS® Code to Access QKB Components

DQSTANDARDIZE Function Syntax and Example


General form of the DQSTANDARDIZE function:
DQSTANDARDIZE(source-string, 'standardization-definition' <, locale>)

DQSTANDARDIZE function example:


outCompany=dqstandardize(Company,'ORGANIZATION', ENUSA);

Company Value (Input) outCompany Value (Output)


MCDOWELL’S PLUMBING AND HEATING, INC. McDowell’s Plumbing & Heating Inc

21
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Required arguments f or the DQSTANDARDIZE f unction:


• source-string – specifies a character constant, variable, or expression that contains the value to
be standardized by the standardization definition.
• standardization-definition – identifies the standardization definition to be used for the
standardization.
The optional argument f or the DQSTANDARDIZE f unction includes the QKB locale that contains the
standardization def inition. If a locale is not specified, the locale is determined by the ordered list of
locales that are set in the DQLOCALE system option.
In the example above, the Company data value will be standardized using the ORGANIZATION
standardization def inition from the ENUSA local. The standardized data value will be written to a
column named outCompany.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-17

DQSCHEMEAPPLY Function Syntax and Example


General form of the DQSCHEMEAPPLY function:
DQSCHEMEAPPLY('char', 'scheme', 'scheme-format' <, 'mode'>
<,'scheme-lookup-method'> <,'match-definition'> <,sensitivity> <,'locale'>)

DQSCHEMEAPPLY function example:


OutCompany=dqSchemeApply(company,'company_scheme','bfd',
'phrase','EXACT');

Company Value (Input) OutCompany Value (Output)


MCDOWELL’S PLUMBING AND HEATING, INC. McDowell’s Plumbing & Heating Inc

22
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Required arguments f or the DQSCHEMEAPPLY f unction:


• char – specifies a character constant, variable, or expression that contains the value to be
standardized.
• scheme – identifies the scheme to be used for the standardization.
• scheme-format – identifies if the scheme is a BFD or NOBFD scheme.

Optional arguments f or the DQSCHEMEAPPLY f unction:


• mode – specifies how the scheme is to be applied to the values of the input character variable.
• scheme-lookup method – specifies the method of applying the scheme. Valid values are EXACT,
IGNORE_CASE, and USE_MATCHDEF.
• match definition – the name of the match definition to be used to look up data values in the
standardization scheme.
• sensitivity – specifies the amount of information in the match codes that are created during the
application of the scheme.
• locale – specifies the locale that contains the match definition to use in generating match codes.
Note: The BFD | NOBFD options for the scheme-format argument are updated to QKB | NOQKB in
the SAS 9.4M5 release.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-18 Lesson 13 Using SAS® Code to Access QKB Components

Using the Standardization Functions

This demonstration illustrates the use of the functions that are available for standardizing data.
1. If necessary, start a SAS session by selecting Start  All Programs  SAS 
SAS 9.4 (English).
2. Verify that (Enhanced) Editor window is the active window.
3. Open an existing SAS program.
a. Select File  Open Program.
b. Navigate to D:\Workshop\dqpqkb\Demos.
c. Click Ch3D1_DQLOCALE_Functions.sas.
d. Click Open.
4. Review the code.
a. Verify that there is a call to %Dqload that loads the ENUSA locale in to memory.
%DQLOAD(DQSETUPLOC = 'D:\ProgramData\SAS\QKB\CI27_MultipleLocales',
DQLOCALE = (ENUSA));
b. Verify that there are two LIBNAME statements.
libname input 'D:\Workshop\dqpqkb\Data';
libname output 'D:\Workshop\dqpqkb\Solutions\files\output_files';
c. Verify that there is a FILENAME statement.
filename scheme
'D:\ProgramData\SAS\QKB\CI27_MultipleLocales\scheme\en052.sch.qkb';
d. For the DATA step:
1) Verify that a new SAS table prospects_std is being created in the output library.
2) Verify that a SAS table prospects is being read from the input library.
3) Verify that a new character column is being created named city_std of length 32, and
that a label is being assigned to this new column.
4) Verify that the new city_std column is being assigned values returned by the
DQSCHEMEAPPLY function.
data output.prospects_std;
set input.prospects;
length city_std $32;
label city_std='Standardized City';
city_std = dqschemeapply(city,'scheme','BFD','ELEMENT',
IGNORE_CASE');
run;

5. Select Run  Submit to submit the code.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-19

6. Select View  Log.


a. Verify that the %Dqload macro call executed successfully.
b. Verify that the two LIBNAME statements executed successfully.
c. Verify that the FILENAME statement executed successfully.
d. Verify that the DATA step executed successfully.
7. View the new table.
a. Select View  Explorer to open the SAS Explorer.
b. If necessary, expand Libraries and then click the output library.
c. Double-click the prospects_std table.
d. Scroll to the right to examine the city_std column (labeled Standardized City).

1) Select File  Close to close the VIEWTABLE window.


2) Select File  Close to close the Explorer window.
8. Select File  Exit to close the SAS session.
9. Click OK in the Exit window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-20 Lesson 13 Using SAS® Code to Access QKB Components

DQSCHEMEAPPLY CALL Routine Syntax and Example


General form of the DQSCHEMEAPPLY CALL routine:
CALL DQSCHEMEAPPLY('char', 'output-variable', 'scheme', 'scheme-format'
<,mode> <,'transform-count-variable'> <,'scheme-lookup-method'>
<,match-definition> <,sensitivity><,'locale’>)

CALL DQSCHEMEAPPLY example:


call dqschemeapply(company,outcompany,'company_scheme',
'bfd','phrase', 'NumTrans','EXACT');

Input Data Value Output Data Values


MCDOWELL’S DSGN CTR outcompany = MCDOWELL’S DESIGN CENTER
NumTrans = 2
24
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Required arguments f or the DQSCHEMEAPPLY f unction:


• char – specifies a character constant, variable, or expression that contains the value to be
standardized.
• output-variable – the character variable that receives the standardized value of the input value.
• scheme – identifies the scheme to be used for the standardization.
• scheme-format – identifies if the scheme is a BFD (QKB scheme) or NOBFD (SAS data set)
scheme.

Optional arguments f or the DQSCHEMEAPPLY f unction:


• mode – specifies how the scheme is to be applied to the values of the input variable.
• transform-count-variable – identifies the numeric variable that receives the returned number of
transformations that were performed on the input value.
• scheme-lookup method – specifies the method of applying the scheme. Valid values are EXACT,
IGNORE_CASE, and USE_MATCHDEF.
• match definition – the name of the match definition to be used to look up data values in the
standardization scheme.
• sensitivity – specifies the amount of information in the match codes that are created during the
application of the scheme.
• locale – specifies the locale that contains the match definition to use in generating match codes.
In the example above, the CALL DQSCHEMEAPPLY f unction is used to standardize the input
company name MCDOWELL’S DSGN CTR in element mode. Af ter applying the scheme to the data,
the resulting value is MCDOWELL’S DESIGN CENTER, which ref lects that two standardizations
were applied to the elements of the string.
Note: The BFD | NOBFD options for the scheme-format argument are updated to QKB | NOQKB in
the SAS 9.4M5 release.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-21

13.02 Activity
1. Open a SAS windowing environment session by selecting
Start  SAS  SAS 9.4 (English).
2. Open the following program in the Program Editor window:
D:\Workshop\dqpqkb\demos\Ch3D4_CALL_DQSCHEMEAPPLY.sas
Hint: Select File  Open Program and then navigate to the file specified.
3. Submit the code.
4. View the log to make sure that the program executed successfully.
5. Open the output.prospects_call_std table.
6. Use the table to answer the following questions:
Were any transformations made on the Address variable?
What is the highest number of transformations that were applied
to an Address value? 25
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-22 Lesson 13 Using SAS® Code to Access QKB Components

PROC DQSCHEME Overview

PROC DQSCHEME <option(s)>;


APPLY <option(s)>;
CONVERT <option(s)>;
CREATE <option(s)>;

28
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The DQSCHEME procedure is used to improve the consistency of your data f rom within SAS code.
Specif ically, with the DQSCHEME procedure, you can do the f ollowing:
• Create standardization schemes – both SAS data set standardization schemes, as well as BFD
schemes in the QKB.
• Create analysis data sets – used to group together similar data values to assist you with the
creation of standardization schemes
• Apply standardization schemes to data – updates data values based on the “standard” value in
the scheme.
The syntax f or the DQSCHEME procedure consists of three statements:
• APPLY
• CONVERT
• CREATE

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-23

PROC DQSCHEME Statement Syntax Overview

PROC DQSCHEME DATA=input-data-set


<QKB | NOQKB>
OUT=output-data-set;

29
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The required argument f or the PROC DQSCHEME statement:


• DATA= – specifies the input data set to be standardized or to use as input in creating
standardization schemes.
Note: If the APPLY statement is used, then the DATA= data set is the data that is
standardized.
Note: If the CREATE statement is used, then the DATA= data set is the data that is used to
run the analysis and build the scheme.
Optional arguments f or the PROC DQSCHEME statement:
• QKB | NOQKB – specifies whether schemes will be written to the QKB or to a SAS data set.
• OUT= – specifies the output data set for the procedure.
Note: If the data set does not exist, the procedure will create it.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-24 Lesson 13 Using SAS® Code to Access QKB Components

APPLY Statement Syntax Overview

PROC DQSCHEME <options>;


APPLY
<LOCALE=locale-name>
<MATCH-DEFINITION=match-definition>
<MODE=ELEMENT | PHRASE>
<SCHEME=scheme-name>
<SCHEME_LOOKUP=EXACT | IGNORE_CASE |
USE_MATCHDEF>
<SENSITIVITY=sensitivity-level>
<VAR=variable-name>;

30
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The APPLY statement is used to apply a standardization scheme, f rom the specified locale, to a
variable. These are the valid options for the APPLY statement:
• LOCALE= – specifies the name of the locale that contains the standardization and match
definitions to be used in the statement.
• MATCH-DEFINITION= – specifies the name of the match definition to be used in looking up the
input data value in the standardization scheme.
• MODE=ELEMENT | PHRASE – specifies the mode to be used in applying the scheme to the
data.
• SCHEME= – specifies the name of the standardization scheme to be applied to the data.
• SCHEME_LOOKUP =EXACT | IGNORE_CASE | USE_MATCHDEF – specifies the method to
be used in looking up the data value in the scheme.
• SENSITIVITY=sensitivity-level – used in conjunction with the USE_MATCHDEF option, specifies
how exact you want to be when using match definitions to look up data values in the scheme.
Note: The default sensitivity level is 85.
• VAR=variable-name – specifies the variable in the input data set to be standardized.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-25

PROC DQSCHEME Code Example: Creating Schemes

proc dqscheme data=vendors qkb;


create matchdef='City' var=city
scheme=city locale='ENUSA';
create matchdef='State/Province' var=state
scheme=state locale='ENUSA';
create matchdef='Organization' var=company
scheme=org locale='ENUSA';
run;

31
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

In the example above, three CREATE statements are used to create three standardization schemes
in the QKB. The schemes are created using the data f rom the data in the vendors data set. The
three schemes will be named City, State, and Org, and will be stored in the ENUSA locale in the
QKB.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-26 Lesson 13 Using SAS® Code to Access QKB Components

Using the DQSCHEME Procedure to Create a Scheme

This demonstration illustrates the use of the DQSCHEME procedure to create a standardization
scheme.
1. If necessary, start a SAS session by selecting Start  All Programs  SAS 
SAS 9.4 (English).
2. Verify that the (Enhanced) Editor window is the active window.
3. Open an existing SAS program.
a. Select File  Open Program.
b. Navigate to D:\Workshop\dqpqkb\Demos.
c. Click Ch3D5_PROC_DQSCHEME_Create.sas.
d. Click Open.
4. Review the code.
%DQLOAD(DQSETUPLOC='D:\ProgramData\SAS\QKB\CI27_MultipleLocales',
DQLOCALE=(ENUSA));

libname input 'D:\Workshop\dqpqkb\Data';


libname schemes 'D:\Workshop\dqpqkb\Solutions\Files\SAS_Schemes';

PROC DQSCHEME data=input.prospects noqkb;


create matchdef='City' var=city scheme=schemes.SAS_city_scheme;
run;
5. Select Submit to submit the code.
6. Select View  Log and resolve any errors.
7. View the new SAS table that is the scheme.
a. Select View  Explorer to open the SAS Explorer.
b. If necessary, expand Libraries and then click the schemes library.
c. Double-click the SAS_city_scheme table.
8. Open the Scheme file that was created and preview the scheme.
a. Navigate to D:\Workshop\dqpqkb\Solutions\Files\SAS_Schemes.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-27

b. Double-click the sas_city_scheme SAS data set to preview the scheme in the VIEWTABLE
window.

9. Make any changes that are necessary to the scheme by switching to Edit mode.
a. Select Edit  Edit Mode.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-28 Lesson 13 Using SAS® Code to Access QKB Components

b. Make any necessary changes to the scheme.

c. Select File  Save to save any changes that you made.


10. Select File  Close to close the VIEWTABLE window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-29

PROC DQSCHEME Code Example: Applying Schemes


Consider the following example code for applying a scheme using the
DQSCHEME procedure:
libname schemes 'D:\Workshop\DQPQKB\data\SAS_schemes';

proc dqscheme data=prospects out=std_prospects noqkb;


apply scheme=schemes.sas_city_scheme var=city;
run;

33
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

In the example above, the APPLY statement is used to apply the sas_city_sheme standardization
scheme to the City variable in the prospects data set.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-30 Lesson 13 Using SAS® Code to Access QKB Components

Using the DQSCHEME Procedure to Apply a Scheme

This demonstration illustrates the use of the DQSCHEME procedure to apply a standardization
scheme to a variable.
1. If necessary, start a SAS session by selecting Start  All Programs  SAS 
SAS 9.4 (English).
2. Verify that the (Enhanced) Editor window is the active window.
3. Open an existing SAS program.
a. Select File  Open Program.
b. Navigate to D:\Workshop\dqpqkb\Solutions\SAS_Programs.
c. Click Ch3D5_PROC_DQSCHEME_Apply.sas.
d. Click Open.
4. Review the code.
5. Apply the scheme to the City variable in the Prospects data set.
a. In the Enhanced Editor, enter the following code:
%DQLOAD(DQSETUPLOC = 'D:\ProgramData\SAS\QKB\CI27_MultipleLocales',
DQLOCALE = (ENUSA));

libname input 'D:\Workshop\dqpqkb\Data';


libname schemes
'D:\Workshop\dqpqkb\Solutions\Files\SAS_Schemes';
libname output
'D:\Workshop\dqpqkb\Solutions\files\output_files';

/* Use an APPLY statement to apply a scheme to the CITY variable


*/
PROC DQSCHEME data=input.prospects out=output.std_prospects
nobfd;
apply scheme=schemes.Sas_city_scheme var=city;
run;
Hint: This code can be found in the following program:
D:\Workshop\dqpqkb\Demos\Ch3D5_PROC_DQS CHEME_Apply.sas
b. Select Submit to submit the code.
c. Select View  Log and resolve any errors.
6. Open the Output.Std_Prospects data set that was created and preview the data.
a. In the Explorer pane of the SAS windowing environment session, navigate to the Output
library.
b. Open the Output library.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-31

c. Double-click the Std_Prospects SAS data set to see the standardized city values in the
VIEWTABLE window.

7. When finished, select File  Close to close the VIEWTABLE window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-32 Lesson 13 Using SAS® Code to Access QKB Components

13.03 Activity
1. Open a SAS windowing environment session by selecting Start  SAS 
SAS 9.4 (English).
2. Open the following program in the Program Editor window:
D:\Workshop\dqpqkb\demos\Ch3D5_PROC_DQSCHEME_Apply.sas
Hint: Select File  Open Program and then navigate to the file specified.
3. Submit the code.
4. View the log to make sure that the program executed successfully.
5. Verify that the new table output.std_prospects has correct values for the
city column.

35
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

SAS Data Quality Server: Match Code Functions


The functions that are available for creating match codes include the
following:
• DQMATCH
• DQMATCHINFOGET
• DQMATCHPARSED

37
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

These SAS Data Quality Server f unctions are available f or perf orming data matching :
• DQMATCH – returns a match code from a character value.
• DQMATCHINFOGET – returns the name of the parse definition that is associated with a match
definition.
• DQMATCHPARSED – returns a match code from a parsed character value.
We discuss only the DQMATCH f unction in this section.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-33

DQMATCH Function Syntax and Example


General form of the DQMATCH function:
DQMATCH(source-string, 'match-definition' <,'sensitivity'> <,'locale'>)

Consider the following statement from a DATA step:


Matchcode=dqmatch(NameofPerson, 'NAME');

NameofPerson (Input) Matchcode (Output)


Mike Abbott &M&~$$$$$$$$$$$B73_$$$$$$$$
Jane Abbott &M&~$$$$$$$$$$$C&P_$$$$$$$$
Stacey Abbott &M&~$$$$$$$$$$$4~&J$$$$$$$$

38
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The DQMATCH f unction returns a match code based on a data value. The match code is an
encoded representation of the characters in the data string af ter going through several processing
steps and based on the specif ied level of sensitivity.
Required arguments f or the DQMATCH f unction:
• source-string – specifies a character constant, variable, or expression that contains the value for
which a match code is created, according to the specified match definition.
• match-definition – the match definition from the QKB.

Optional arguments f or the DQMATCH f unction:


• sensitivity – specifies an integer value that determines the amount of information in the returned
match code. Valid values range from 50 to 95.
Note: The default value is 85.
• locale – locale that contains the match definition.
In the example above, the DQMATCH f unction is used to return match codes f or the various input
name values. You see that even though the input values all have the same f amily names, the given
names are slightly different, which results in slightly different match code values.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-34 Lesson 13 Using SAS® Code to Access QKB Components

Using the Match Code Generation Functions

This demonstration illustrates the use of the match code generation functions to create match codes.
1. Use SAS Data Quality Server functions in a SAS DATA step to create new data fields.
a. In the Enhanced Editor, enter the following code:
/* Set DQSETUPLOC and DQLOCALE options and load QKB into memory */
%DQLOAD(DQSETUPLOC='D:\ProgramData\SAS\QKB\CI27_MultipleLocales',
DQLOCALE=(ENUSA));

libname input 'D:\Workshop\dqpqkb\Data';


libname output 'D:\Workshop\dqpqkb\Solutions\files\output_files';

/* Use the DQMATCH function to generate match codes */


data output.prospects_matchcodes;
set input.prospects;
length contact_mc $32 address_mc $50;
contact_mc = dqmatch(contact,'NAME',85);
address_mc = dqmatch(address,'ADDRESS',85);
run;
Hint: This code can be found in the following program:
D:\Workshop\dqpqkb\Demos\Ch3D6_Matching_Functions.sas
b. Select Submit to submit the code.
2. Select View  Log and resolve any errors.
3. Preview the Prospects_matchcodes data set.
a. In the Explorer pane of the SAS windowing environment session, navigate to the Output
library.
b. Open the Output library.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-35

c. Double-click the Prospects_matchcodes data set to see the data with match code values in
the VIEWTABLE window.

d. Scroll to the right to see the new match code variables created by the function calls.

4. When finished, select File  Close to close the VIEWTABLE window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-36 Lesson 13 Using SAS® Code to Access QKB Components

PROC DQMATCH Syntax Overview

PROC DQMATCH DATA=input-data-set


CLUSTER=output-numeric-variable-name
CLUSTER_BLANKS | NO_CLUSTER_BLANKS
CLUSTERS_ONLY
DELIMITER | NODELIMITER
LOCALE=locale-name
MATCHCODE=output-character-variable-name
OUT=output-data-set;

CRITERIA <options>;

40
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The DQMATCH procedure has the f ollowing features:


• creates match codes on selected variables using a match definition from the QKB
• enables the programmer to select the level of sensitivity when generating the match code
• clusters records based on a given set of criterion
Match codes are created based on a specif ied match def inition in a specif ied locale. The match
codes are written to an output SAS data set.
Valid statements f or the DQMATCH procedure:
• PROC DQMATCH option(s)
• CRITERIA option(s)
Required argument f or the PROC DQMATCH statement:
• DATA= – specifies the input data set to be used in the generation of match codes.
Optional arguments f or the PROC DQMATCH statement:
• CLUSTER= – specifies the numeric variable in the output data set that will contain the cluster
number
• CLUSTER_BLANKS | NO_CLUSTER_BLANKS – controls whether blank values get written to
the output data set
• CLUSTERS_ONLY – specifying this option causes only the values that are part of a cluster
(potential duplicates) to be written to the output data set
• DELIMITER | NODELIMITER – specifies whether exclamation points (!) are used as delimiters
• LOCALE=<locale-name> – specifies the name of the locale that is used to create match codes
• MATCHCODE=<output-character-variable-name> – specifies the name of the output character
variable that stores the match codes
• OUT=<output-data-set> – specifies the name of the output data set for match codes created with
the DQMATCH procedure.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-37

CRITERIA Statement Syntax Overview

PROC DQMATCH <options>;

CRITERIA <CONDITION=integer>
<DELIMSTR=variable-name | VAR=variable-name>
<EXACT | MATCHDEF>
<MATCHCODE=output-character-variable>
<SENSITIVITY=sensitivity-level>;

41
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The purpose of the CRITERIA statement is to specif y conditions f or generating match code values.
These are the optional arguments f or the CRITERIA statement:
• CONDITION= – integer value that is used to group multiple CRITERIA statements together.
• DELIMSTR= | VAR= – specifies the value to be used to generate the match code. DELIMSTR=
is used to specify a token from a parse step, and VAR= is used to specify a variable name.
• EXACT|MATCHDEF – used to determine how clusters are created. EXACT is used to identify
exact character matches between values. MATCHDEF is used to specify a match definition from
the QKB to be used to generate match codes on the values.
• MATCHCODE= – specifies the name of the character variable that the match definition writes
the match code to.
• SENSITIVITY= – determines the amount of information that is contained in the resulting match
code value.
Note: The def ault level of sensitivity is 85.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-38 Lesson 13 Using SAS® Code to Access QKB Components

PROC DQMATCH Code Example

42
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

PROC DQMATCH Code Example


Clusters Match codes on Contact Match codes on Address

43
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The example above shows the output of the PROC DQMATCH code. You can see that match codes
have been generated on the Contact f ield and also the City f ield f rom the input data table. You can
also see that several records are identif ied as belonging to clusters, since they generated the same
match codes.
Note: The rows with CLUSTER_NUM values of “.” are single-row clusters that do not match any
other rows based on the match code values.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-39

Using the DQMATCH Procedure

This demonstration illustrates the use of the DQMATCH procedure to create match codes and
cluster data records.
1. If necessary, open a SAS session by selecting Start  All Programs  SAS 
SAS 9.4 (English).
2. Using the DQMATCH procedure, create match codes and cluster data on the Name and
Address fields in the Prospects SAS data set.
a. In the Enhanced Editor, enter the following code:
/* Set DQSETUPLOC option to QKB root and load into memory */
%DQLOAD(DQSETUPLOC='D:\ProgramData\SAS\QKB\CI27_MultipleLocales',
DQLOCALE=(ENUSA));

libname input 'D:\Workshop\dqpqkb\Data';


libname output 'D:\Workshop\dqpqkb\Solutions\files\output_files';

/* Create match codes on Contact and Address and cluster data */


PROC DQMATCH data=input.prospects out=output.prospects_mc
cluster=cluster_num;
criteria var=contact matchdef='Name' sensitivity=85
matchcode=name_mc;
criteria var=address matchdef='Address' sensitivity=85
matchcode=address_mc;
run;
Hint: This code can be found in the following program:
D:\Workshop\dqpqkb\Demos\Ch3D7_PROC_DQMATCH.sas
b. Select Submit to submit the code.
3. View the log for any errors.
a. Select View  Log and resolve any errors.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-40 Lesson 13 Using SAS® Code to Access QKB Components

4. Open the table created by the DQMATCH procedure to preview the match codes and the
clusters.
a. In the Explorer pane, navigate to the Output folder.
b. Double-click the Prospects_mc table.
c. Preview the data in the table.

d. When finished, close the VIEWTABLE window.


e. Close the Enhanced Editor window.
f. Click No if you are prompted to save the program.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-41

13.04 Activity
1. Open a SAS windowing environment session by selecting Start  SAS 
SAS 9.4 (English).
2. Open the following program in the Program Editor window:
D:\Workshop\dqpqkb\Demos\Ch3D7_PROC_DQMATCH.sas
Hint: Select File  Open Program to navigate to the program.
3. Submit the code.
4. View the log to make sure that the program executed successfully.
5. Open the output.prospects_mc table.
6. Use the log and the table to answer the following questions:
How many records were written to the output table?
How many clusters were created in the output table?
45
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-42 Lesson 13 Using SAS® Code to Access QKB Components

SAS Data Quality Server: Parsing Functions


The functions that are available for performing data parsing include the
following:
• DQPARSE
• DQPARSEINFOGET
• DQPARSEINPUTLEN
• DQPARSERESLIMIT
• DQPARSESCOREDEPTH
• DQPARSETOKENGET
• DQPARSETOKENPUT

48
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

These f unctions are available f or perf orming data parsing:


• DQPARSE – returns a parsed character value.
• DQPARSEINFOGET – returns the token names for the specified parse definition.
• DQPARSEINPUTLEN – sets the default length of parsed input, and also returns a string
indicating its previous value.
• DQPARSERESLIMIT – sets a limit on resources consumed during parsing.
• DQPARSESCOREDEPTH – specifies how deeply to search for the best parsing score.
• DQPARSETOKENGET – returns a token from a parsed character value.
• DQPARSETOKENPUT – inserts a token into a parsed character value and returns the updated
parsed character value.
We discuss the DQPARSE and DQTOKENGET f unctions in this section.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-43

DQPARSE Function Syntax and Example


General form of the DQPARSE function:
DQPARSE('parse-string', 'parse-definition' <,'locale'>)

Consider the following statement from a DATA step:


ParseString=dqparse(NameofPerson,'NAME');

NameofPerson (Input) ParseString (Output)


Michael D Abbott /=/Michael/=/D/=/Abbott/=//=/
Jane G Abbott /=/Jane/=/G/=/Abbott/=//=/
Stacey Abbott /=/Stacey/=//=/Abbott/=//=/

49
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The DQPARSE f unction returns a tokenized character string f rom a data value. This parsed
character string contains delimiters that separate the individual “tokens” in the tokenized string.
Required arguments f or the DQPARSE f unction:
• parse-string – specifies a character constant, variable, or expression that contains the value to
be parsed, according to the specified parse definition.
• parse-definition – specifies the parse definition from the QKB.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-44 Lesson 13 Using SAS® Code to Access QKB Components

DQPARSETOKENGET Function Syntax and Example


General form of the DQPARSETOKENGET function:
DQPARSETOKENGET('parsed-char', 'token', 'parse-definition' <,'locale'>)

Consider the following statement from a DATA step:


First_Name=dqparsetokenget(ParsedNameString,'Given Name',
'NAME','ENUSA');

ParsedNameString (Input) First_Name (Output token)


/=/Michael/=/D/=/Abbott/=//= Michael
/ Jane
/=/Jane/=/G/=/Abbott/=//=/ Stacey
/=/Stacey/=//=/Abbott/=//=/ 50
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The DQPARSETOKENGET f unction returns the value of the specif ied token f rom a previously
parsed data value. These are the required arguments f or the DQPARSETOKENGET f unction:
• parsed-char –a character constant, variable, or expression that contains the value that is the
parsed character value from which the value of the specified token is returned.
• token – the name of the token that is returned from the parsed value.
Note: To see a valid list of tokens for a parse definition, use the DQPARSEINFOGET function,
or alternatively, the %Dqputloc autocall macro.
• parse-definition – the name of the parse definition from the QKB.
Note: The parse definition used in the PARSETOKENGET function must be the same parse
definition that was used to create the parsed input string.
Optional argument f or the DQPARSETOKENGET f unction:
• locale - locale that contains the parse definition.
In the example above, the DQPARSETOKENGET f unction is used to return the data value stored in
the Given Name token of the parsed name string.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-45

Using the Parsing Functions

This demonstration illustrates the use of the functions that are available for parsing data values.
1. Use SAS Data Quality Server functions in a SAS DATA step to create new data fields.
a. In the Enhanced Editor, enter the following code:
/* Set DQSETUPLOC option to QKB root and load into memory */
%DQLOAD(DQSETUPLOC='D:\ProgramData\SAS\QKB\CI27_MultipleLocales',
DQLOCALE=(ENUSA));

libname input 'D:\Workshop\dqpqkb\Data';


libname output 'D:\Workshop\dqpqkb\Solutions\files\output_files';

data output.prospects_parsed;
set input.prospects;
length ParsedPhone $20 ParsedName $60 areacode $3;
label areacode='Area Code';
parsedphone = dqparse(Phone_Number, 'Phone');
areacode = dqparsetokenget(parsedphone, 'Area Code', 'Phone');
parsedname = dqparse (contact, 'NAME');
run;
Hint: This code can be found in the following program:
D:\Workshop\dqpqkb\Demos\Ch3D8_Parse_Functions.sas
b. Select Submit to submit the code.
2. View the log and resolve any issues.
Select View  Log and resolve any errors.
3. Preview the Prospects_parsed data set.
a. In the Explorer pane, navigate to the Output library.
b. Open the Output library.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-46 Lesson 13 Using SAS® Code to Access QKB Components

c. Double-click the Prospects_parsed SAS data set to see the ParsedPhone and
ParsedName values in the VIEWTABLE window.

d. Scroll to the right to see the new Area Code variable that was created by the function.

3. Select File  Close to close the VIEWTABLE window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-47

SAS Data Quality Server: Data Extraction Functions


The functions that are available for performing data extraction include the
following:
• DQEXTRACT
• DQEXTINFOGET
• DQEXTTOKENGET
• DQEXTTOKENPUT

52
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

These f unctions are available f or perf orming data extraction:


• DQEXTRACT – returns an extracted character value.
• DQEXTINFOGET – returns the token names in an extraction definition.
• DQEXTTOKENGET – returns a token from an extraction character value.
• DQEXTTOKENPUT – inserts a token into an extraction character value and returns the updated
extraction character value.
We discuss the DQEXTRACT and DQEXTTOKENGET f unction in this section.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-48 Lesson 13 Using SAS® Code to Access QKB Components

DQEXTRACT Function Syntax and Example


General form of the DQEXTRACT function:
DQEXTRACT('extraction-string', 'extraction-definition' <,'locale'>)

DQEXTRACT function example:


ExtractString=DQEXTRACT(InputData,'CONTACT INFO','ENUSA');

InputData Value (Input) ExtractString Value (Output)


DBA Mike D Abbott Mike D Abbott/=//=//=//=//=/DBA

DBA Mike D Abbott Enterprises /=/Mike D Abbott Enterprises/=//=//=//=/DBA

C/O Mike D Abbott, 123-456-7890 Mike D Abbott/=//=//=//=/123-456-7890/=/C/O


53
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The DQEXTRACT f unction returns token values f rom a f ree f orm data value. These are the required
arguments f or the DQEXTRACT f unction:
• extraction-string – the value that is extracted according to the specified extraction definition. The
value must be the name of a character variable, a character value in quotation marks, or an
expression that evaluates to a variable name or quoted value.
• extraction-definition – the extraction definition from the QKB used to extract data values into
tokens.
The optional argument f or the DQEXTRACT f unction is the locale that contains the extraction
def inition.
In the example above, the DQEXTRACT f unction is used to return a delimited text string f or the
various input data values that contain inf ormation f or Mike Abbott. The delimiters are used to
separate the various tokens in the extracted data string.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-49

Using the Extraction Functions

This demonstration illustrates the use of the functions that are available for extracting data values.
1. If necessary, open a SAS session by selecting Start  All Programs  SAS 
SAS 9.4 (English).
2. Use SAS Data Quality Server functions in a SAS DATA step to create new data fields.
a. In the Enhanced Editor, enter the following code:
data _null_;
length extractstring $200 extractinfo $200;
extractinfo=dqextinfoget('CONTACT INFO');

extractstring=dqextract('100 SAS Campus Drive, Cary, NC, 27513,


Mike Abbott','CONTACT INFO');

extractstring2=dqexttokenput(extractstring,'SAS','ORGANIZATION',
'CONTACT INFO');
put extractinfo= //
extractstring= //
extractstring2=;
run;
Hint: This code can be found in the following program:
D:\Workshop\dqpqkb\Demos\Ch3D9_Extract_Functions.sas
b. Select Submit to submit the code.
3. Select View  Log and resolve any errors.
4. Preview the SAS log to see the results.
a. In the Log window of the SAS windowing environment session, navigate to the bottom of the
code that you submitted in the previous step.

Note: Values for a specific token can be obtained using the DQEXTTOKENGET function.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-50 Lesson 13 Using SAS® Code to Access QKB Components

SAS Data Quality Server: Identification Analysis Functions


The function available for identification analysis is the DQIDENTIFY function.
General form of the DQIDENTIFY function:
DQIDENTIFY ('char', 'identification-analysis definition' <,'locale'> )

DQIDENTIFY function example:


OutID=dqidentify(Name,'CONTACT INFO');

Name (Input) OutID (Output)


CBG ORGANIZATION
Jane Abbott NAME
Holly Lane ADDRESS
55
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The DQIDENTIFY f unction returns the type of data represented by a data value. These are the
required arguments f or the DQIDENTIFY f unction:
char – specif ies a character constant, variable, or expression that contains the value that is analyzed
to determine that category of the content.
identification-analysis-definition – the identif ication analysis definition f rom the QKB.
The optional argument f or the DQIDENTIFY f unction includes the locale that contains the
identif ication analysis definition.
In the example above, the DQIDENTIFY f unction is used to return the category of data represented
by the various input values. The values returned by the f unc tion are the categories of data guessed
f rom the data values.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-51

DQGENDER Function Syntax and Example


General form of the DQGENDER function:
DQGENDER('char', 'gender-analysis-definition' <, 'locale'>)

DQGENDER function example:


OutGender=dqgender(NameofPerson,'name');

NameofPerson (Input) OutGender (Output)


Mike Abbott M
Jane Abbott F
Stacey Abbott U
56
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

The DQGENDER f unction returns a gender value f rom the name of an individual. These are the
required arguments f or the DQGENDER f unction:
• char – specifies the character variable, or string, that is to be processed by the function.
• gender-analysis-definition – the definition from the QKB that is to be used to determine the
gender.
The optional argument f or the DQGENDER f unction includes the locale that contains the gender
analysis def inition.
In the example above, the DQGENDER f unction is used to guess the gender of the provided data
values. For the name Mike Abbott, the f unction returns M, indicating that the person is a male. For
the name Jane Abbott, the f unction returns F, indicating that the person is a f emale. For the name
Stacey Abbott, the f unction returns U, indicating that the gender is unknown, based on the provided
data value.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-52 Lesson 13 Using SAS® Code to Access QKB Components

Using the Gender Analysis and Identification Analysis


Functions
This demonstration illustrates the use of the functions that are available for performing gender
analysis and identification analysis.
1. If necessary, open a SAS session by selecting Start  All Programs  SAS 
SAS 9.4 (English).
2. Use SAS Data Quality Server functions in a SAS DATA step t o create new data fields.
a. In the Enhanced Editor, enter the following code:
%DQLOAD(DQSETUPLOC='D:\ProgramData\SAS\QKB\CI27_MultipleLocales',
DQLOCALE=(ENUSA));

libname input 'D:\Workshop\dqpqkb\Data';


libname output 'D:\Workshop\dqpqkb\Solutions\files\output_files';

data output.individuals output.organizations;


set input.prospects;
length Identity $20;
identity = dqidentify (Contact,'CONTACT INFO');
if identity='NAME' then output output.individuals;
if identity='ORGANIZATION' then output output.organizations;
run;

data output.individuals;
set output.individuals;
length gender $1;
label gender='Gender';
gender = dqgender (Contact, 'Name');
run;
Hint: This code can be found in the following program:
D:\Workshop\dqpqkb\Demos\Ch3D10_ID_Gender_Functions.sas
b. Select Submit to submit the code.
4. Select View  Log and resolve any errors.
5. Preview the Organizations data set.
b. In the Explorer pane, navigate to the Output library.
c. Open the Output library.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-53

d. Double-click the Organizations SAS data set to see the organization records in the
VIEWTABLE window.

e. When finished, select File  Close to close the VIEWTABLE window.


6. Preview the Individuals data set.
a. In the Explorer pane of the SAS windowing environment session, navigate to the Output
library.
b. Open the Output library.
c. Double-click the Individuals SAS data set to see the records for Individual in the
VIEWTABLE window, and associated gender values.

d. When finished, select File  Close to close the VIEWTABLE window.

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-54 Lesson 13 Using SAS® Code to Access QKB Components

13.3 Solutions
Solutions to Activities and Questions

13.01 Activity – Correct Answer


Question: How many types of definitions are available in the ENUSA locale?
Answer: NINE (9) – every type of
definition has a large comment
block. Scroll through the log and
find that thereare nine types.
(CASE, EXTRACTION, GENDER,
GUESS, IDENTIFICATION,
MATCH, PARSE, PATTERN,
STANDARDIZATION)

14
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

13.01 Activity – Correct Answer


Question: What is the name of the gender definition in the ENUSA locale?
Answer: Name

15
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.3 Solutions 13-55

13.01 Activity – Correct Answer


Question: What tokens are populated by the Organization (Global) parse
definition?
Answer: Name, Legal Form, Site, Additional Info

16
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

13.02 Activity – Correct Answer


Question: Were any transformations made on the Address variable?
Answer: Yes. Visually comparing the Address column values to the
corresponding Address_Std values shows there are differences.
In addition, viewing the NumTrans column values will also
confirm that changes or transformations were made on the
Address column.

26
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-56 Lesson 13 Using SAS® Code to Access QKB Components

13.02 Activity – Correct Answer


Question: What is the highest number of transformations that were applied
to an Address value?
Answer: Four (4)
• Open the table in the VIEWTABLE window.
• Right-click column heading NumTrans and select Sort  Descending.
• Click Yes to change to Edit mode.
• Enter Temp as Table Name
then click OK.
• Verify that 4 is the largest
value.

27
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

13.03 Activity – Correct Answer


Preview the output file.

36
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.3 Solutions 13-57

13.04 Activity – Correct Answer


Question: How many records were written to the output table?
Answer: 928

46
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

13.04 Activity – Correct Answer


Question: How many clusters were created in the output table?
Answer: 35

47
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-58 Lesson 13 Using SAS® Code to Access QKB Components

Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.

You might also like