Sas Data Management
Sas Data Management
Course Notes
SAS® Data Management Tools and Applications Course Notes was developed by Mark Craver,
David Ghan, Robert Ligtenberg, Kari Richardson, and Erin Winters . Instructional design, editing, and
production support was provided by the Learning Design and Development team.
SAS and all other SAS Institute Inc. product or service names are registered trademarks or
trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.
Copyright © 2020 SAS Institute Inc. Cary, NC, USA. All rights reserved. Printed in the United States
of America. No part of this publication may be reproduced, stored in a retrieval system, or
transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise,
without the prior written permission of the publisher, SAS Institute Inc.
Book code E71514, course code LWDISDS2/DISDS2, prepared date 27Feb2020. LWDISDS2_001
ISBN 978-1-64295-452-4
For Your Infor mation iii
Table of Contents
2.1 Exploring the SAS Platform and SAS Data Integration Studio ...............................2-3
2.3 Examining SAS Data Integration Studio Jobs and Options ................................. 2-34
Demonstration: Examining SAS Data Integration Studio Jobs and Options ..... 2-38
Lesson 3 SAS ® Data Integration Studio: Defining Source Data Metadata ............3-1
Lesson 4 SAS ® Data Integration Studio: Defining Target Data Metadata .............4-1
Demonstration: Populating the Current and Terminated Staff Tables .............. 5-16
Practices.................................................................................................. 5-28
Practices.................................................................................................. 5-62
For Your Infor mation v
6.1 Working with the Extract and Summary Statistics Transformations ........................6-3
Practices.................................................................................................. 6-28
8.3 Quality Knowledge Bases and Reference Data Sources .................................... 8-33
Demonstration: Verifying the Course QKB and Reference Sources ................ 8-37
Demonstration: Working with the Branch and Data Validation Nodes ........... 10-57
Demonstration: Working with Address Verification and Geocoding Nodes ..... 10-74
Demonstration: Adding Field-Level Rules for the Surviving Record .............. 11-61
Lesson 12 Understanding the SAS ® Quality Knowledge Base (QKB) .................. 12-1
13.1 SAS Configuration Options for Accessing the QKB ............................................ 13-3
To learn more…
For information about other courses in the curriculum, contact the
SAS Education Division at 1-800-333-7660, or send e-mail to
training@sas.com. You can also find this information on the web at
http://support.sas.com/training/ as well as in the Training Course
Catalog.
For a list of SAS books (including e-books) that relate to the topics
covered in this course notes, visit https://www.sas.com/sas/books.html or
call 1-800-727-0025. US customers receive free shipping to US
addresses.
Lesson 1 SAS/ACCESS®
Technology Overview
1.1 SAS/ACCESS Technology Overview ............................................................................ 1-3
Demonstration: Using SAS/ACCESS Methods for a Variety of Data Sources ................. 1-12
1-2 Lesson 1 SAS/ACCESS® Technology Overview
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 SAS/ACCESS Technology Overview 1-3
SAS/ACCESS Software
With SAS/ACCESS, you can directly read and write to and from other database
management systems (DBMS).
DATA Step
DBMS table
PROC Step
This is a key capability leveraged across all SAS data management applications.
2
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
SAS provides software that enables you to access data f rom a large variety of database
management systems. These sof tware components, ref erred to as SAS/ACCESS engines, enable
you to read and write directly to and f rom specific data f ormats. This is a key capability that allows
SAS users to integrate data f rom a large variety of data sources as part of the data curation process.
To connect to any specif ic type of database, you license a SAS/ACCESS product specific to that
database. You might license SAS/ACCESS to Oracle, or SAS/ACCESS to Hadoop, or
SAS/ACCESS to Microsoft SQL Server. We have dozens of specific SAS/ACCESS engines.
When you have these engines, then with your SAS code, you can use any SAS DATA step or PROC
step to read or write to any table f rom the database management system in the same way that you
would read or write to a SAS data set.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-4 Lesson 1 SAS/ACCESS® Technology Overview
3
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 SAS/ACCESS Technology Overview 1-5
4
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
There are, then, two main methods available with SAS/ACCESS common to almost all access
engines.
1. With the SQL pass-through method, you use the SQL procedure within your SAS p rogram to
write and submit native database SQL to the database.
2. With the SAS/ACCESS LIBNAME method, you write a specif ic type of LIBNAME statement to
connect to the database system, and then you can write any SAS code and name a database
table to read or create in the same way you would name SAS data sets in your SAS programs.
SAS will generate the native database SQL on your behalf in order to perf orm read or write
operations in the database.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-6 Lesson 1 SAS/ACCESS® Technology Overview
Here you can see a general syntax example f or each of these methods.
In the above example f or SQL pass-through:
1. You begin with PROC SQL (the SAS SQL procedure).
2. A CONNECT statement is then used to def ine a connection to your DBMS.
3. A SELECT clause is used to select data f rom the connection to the DBMS .
4. Native DBMS SQL is specif ied in parenthesis.
a. This query is sent by SAS directly to the DBMS.
b. The DBMS executes the query and returns the results to SAS.
5. A DISCONNECT statement is used to close the connection to the DBMS .
When you use the SAS/ACCESS LIBNAME method , you begin by def ining a database library with
the LIBNAME statement. This establishes a library connection to the database. When you have
def ined this, you can then name database tables with the same type of two -level names used f or
SAS data sets in the f orm of libref.datasetname. In this case, we name a database table directly in
the DATA= option of PROC MEANS. Database column names are used in the code in the same way
we name variables f rom SAS data sets. In this case, we are using PROC MEANS to summarize the
SALARY column in the database table categorized by STATE.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 SAS/ACCESS Technology Overview 1-7
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-8 Lesson 1 SAS/ACCESS® Technology Overview
7
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
There are two types of SQL pass-through components that you can submit to the DBMS:
• SELECT statements, which produce output to SAS software (SQL procedure pass -through
queries)
• EXECUTE statements, which perf orm all other non-query SQL statements that do not produce
output (f or example, GRANT, CREATE, or DROP)
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 SAS/ACCESS Technology Overview 1-9
8
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Here is a specif ic example for connecting to a Hive database in Hadoop and submitting a HiveQL
pass-through query.
Pass-through methods start with a PROC SQL statement to invoke the SQL procedure.
Next, a CONNECT statement is specif ied where you use a keyword to indicate the type of database
you are connecting to. This is an example of connecting to Hadoop, so the Hadoop keyword is used.
The CONNECT statement also includes a set of key-value pairs in parentheses that contain the
inf ormation that you need to give SAS to enable SAS to connect to the database system.
Next is a SAS PROC SQL SELECT statement. Because this is SQL pass-through, on the FROM
clause of the SAS SELECT statement, a table is not named. Instead, we use FROM CONNECTION
TO HADOOP. This ref ers to the connection that we established in the CONNECT statement. In
parentheses, we write an SQL query in the native SQL language of the database that we are
connecting to. This SELECT statement in parentheses is then passed by SAS, using the connection
inf ormation, directly to the database to execute. The database executes the query and then returns
the result to SAS, where SAS SELECT then queries this result set.
A DISCONNECT statement closes the SAS connection to the database and QUIT ends the SQL
procedure.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-10 Lesson 1 SAS/ACCESS® Technology Overview
9
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
This is an example of an SQL pass-through EXECUTE statement. PROC SQL is also used , and the
CONNECT and the DISCONNECT statements work in the same way . However, in this example, an
EXECUTE statement is used because we are not sending a query to the database system, so we do
not need to query a result set with SAS SELECT. So instead we simply use the EXECUTE keyword
and in parenthesis write the native database SQL statement that will be sent by SAS to by executed
by the database. In this example, we send a DROP TABLE statement to execute. With the
EXECUTE statement, you f ollow the native statement in parenthes es with BY HADOOOP, or BY
“the keyword f or the system that you connected to".
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 SAS/ACCESS Technology Overview 1-11
Universal Methodology
SAS/ACCESS interfaces enable SAS programmers to apply consistent
techniques to access a large number of data sources in different formats
in a consistent way.
10
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The beauty of the SAS/ACCESS interf aces is that they allow SAS programmers to apply consistent
techniques to access a large number of data sources in dif ferent f ormats in a consistent way.
Methods that you learn to apply to one data source are easily leveraged if you need to access other
data sources f or which a SAS/ACCESS interf ace is available.
With these technologies in place, various SAS applications are also able to leverage them as you
will discover in various upcoming lessons.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-12 Lesson 1 SAS/ACCESS® Technology Overview
This demonstrates how to use SQL pass-through and the SAS/ACCESS LIBNAME method for
several types of data sources.
1. Start SAS Studio.
a. Click the SAS Studio icon on the Start Bar.
/* ODBC Access */
libname mssql odbc dsn=sqlsrvdsn user=student pw=Metadata0;
/*MS SQL Server*/
libname odbcora odbc dsn=orasrc user=student pw=Metadata0;
/*Oracle*/
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 SAS/ACCESS Technology Overview 1-13
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-14 Lesson 1 SAS/ACCESS® Technology Overview
5) Right-click CUSTOMER_DIM and select Properties to display the SAS Table Properties
window.
6) Click the Columns tab in the SAS Table Properties window.
Note: In this interactive Libraries section in the navigate pane, this DB2 library
connection and all SAS/ACCESS LIBNAME connections enable users to work
with the DB2 tables in the same way as you would work interactively with SAS
data sets. The column attributes in the table properties show that SAS interprets
the metadata in the database tables as if they were SAS data sets. All variables
are treated as either SAS numerics or SAS character variables. Note here that a
CUSTOMER_BIRTH_DATE variable in the DB2 table is a DATE data type in
DB2, which SAS treats as a SAS date value. Date values are SAS numerics with
assigned SAS date formats. This interpretation by the SAS LIBNAME engines is
what allows users to use database tables from these library connections as if
they are SAS data sets in SAS code wherever you can name a SAS data set.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 SAS/ACCESS Technology Overview 1-15
Note: The table can be viewed interactively in the same way as a SAS data set in a
SAS base library.
9) Click x on the DB2LIB.CUSTOMER_DIM tab to close.
10) Interactively copy a SAS data set into the DB2 library.
a) Expand the SASHELP library.
b) Drag the BASEBALL data set from the SASHELP library and drop it onto the
DB2LIB.
Note: BASEBALL is copied and exists as a table in the DB2 database.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-16 Lesson 1 SAS/ACCESS® Technology Overview
c) Double-click BASEBALL to open and view the data for the new DBDLIB table on a
new tab labeled DB2LIB.BASEBALL.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 SAS/ACCESS Technology Overview 1-17
ods noproctitle;
title 'Proc Means Oracle';
proc means data=oralib.employee_payroll mean;
class employee_gender;
var salary;
output out=work.gendersalaryORA mean=meansalary;
run;
title 'Proc Means DB2';
proc means data=db2lib.employee_payroll mean;
class employee_gender;
var salary;
output out=db2lib.gendersalaryDB2
(rename=(_type_=type _freq_=freq )) mean=meansalary;
run;
title 'Proc Means MS Access';
proc means data=msacc.employee_payroll mean;
class employee_gender;
var salary;
output out=msacc.gendersalaryMSACCESS mean=meansalary;
run;
title 'Proc Means MS Excel';
proc means data=myexcel.employee_payroll mean;
class employee_gender;
var salary;
output out=myexcel.gendersalaryEXCEL2 mean=meansalary;
run;
title 'Proc Means MS SQL Server';
proc means data=mssql.employee_payroll mean;
class employee_gender;
var salary;
output out=mssql.gendersalarySQLSERVER2 mean=meansalary;
run;
title 'Proc Means ODBC Oracle';
proc means data=odbcora.employee_payroll mean;
class employee_gender;
var salary;
output out=odbcora.gendersalaryODBC
(rename=(_type_=type _freq_=freq )) mean=meansalary;
run;
options sastrace=off;
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-18 Lesson 1 SAS/ACCESS® Technology Overview
Note: A table called EMPLOYEE_PAYROLL storing identical data values exists in each
database system that we are connecting to via the LIBNAME statements. The
PROC MEANS steps are performing the identical summary calculations using the
same standard SAS syntax. In each procedure, average salary values will be
calculated for each unique value of the EMPLOYEE_GENDER variable. An
OUTPUT statement is used in each procedure to store the results as a table. For
those database libraries where Write access is granted by the database, the output
is stored as a table in the database. This is done by using the libref for the
SAS/ACCESS library when naming the table to save on the OUT= option of the
OUTPUT statement. If Write access to the database is not available, the output data
set is stored in the SAS Work library. Each procedure will also generate a report of
the summary statistics.
The first OPTIONS statement (at the beginning of this code segment) specifies the
SASTRACE option. When this option is used, the SAS log will display the details
about the native database SQL that is generated by the SAS/ACCESS engines. It is
being used here to demonstrate that when PROC MEANS is used, the SAS engine
for many databases will generate a native summary query. The data summarization
process can be performed by the database and only the summary results are
returned to SAS. Pushing more of the processing into the database will result in
better performance for large database tables. DBMSs are scaled to handle the large
volumes of data stored. Also, processing the data in place in the database reduces
the volume of data that needs to be transported across the network to the machine
where SAS is executing.
2) Click (Run all) to execute the highlighted statements.
3) If necessary, click the Results tab and note that each procedure generated the same
summary report for the table in each database library.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 SAS/ACCESS Technology Overview 1-19
Note: The SASTRACE messages shown above contains summary queries where a
GROUP BY clause identifies the class variable specified in the PROC
MEANS step. In the native SQL query, summary statistics are calculated for
the analysis variable identified in the MEANS procedure. Only the
summarized results are returned to SAS, which are then used by PROC
MEANS to generate the report.
The LIBNAME engines, in many cases, do not convert the SAS language
into the equivalent native database SQL process. In such cases, a native
database SELECT statement is generated by the SAS LIBNAME engine to
return all rows of data from the database to SAS, and further processing of
that data is then done by SAS. With large volumes of data, an understanding
of where processing occurs becomes an important performance
consideration in developing your SAS code.
7) View one of the summary tables created.
a) Click the Libraries section in the navigate pane.
b) If necessary, expand My Libraries DB2LIB.
c) Double-click the GENDERSALARYDB2 table to view the data for the DB2 table on a
separate tab labeled DB2LIB.GENDERSALARYDB2.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-20 Lesson 1 SAS/ACCESS® Technology Overview
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 SAS/ACCESS Technology Overview 1-21
employee_organization b
where a.Employee_ID=b.Employee_ID);
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-22 Lesson 1 SAS/ACCESS® Technology Overview
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 SAS/ACCESS Technology Overview 1-23
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-24 Lesson 1 SAS/ACCESS® Technology Overview
2) Scroll downward to find the title and the first few rows of results for the second pass -
through query to Oracle.
3) Continue to scroll downward to note that the same two query results have been created
for the pass-through queries for each of the other database systems as well.
5. Exit SAS Studio.
a. Click Sign Out in the top right corner of the SAS Studio interface.
b. Click Yes to confirm.
c. Click X in the top right to close the browser window.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 SAS/ACCESS Technology Overview 1-25
12
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-26 Lesson 1 SAS/ACCESS® Technology Overview
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Lesson 2 SAS® Data Integration
Studio: Essentials
2.1 Exploring the SAS Platform and SAS Data Integration Studio ...................................... 2-3
Demonstration: Importing Metadata......................................................................... 2-11
2.3 Examining SAS Data Integration Studio Jobs and Options ........................................ 2-34
Demonstration: Examining SAS Data Integration Studio Jobs and Options .................... 2-38
Practice............................................................................................................... 2-49
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Exploring the SAS Platform and SAS Data Integration Studio 2-3
3
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The SAS platform applications provide application interfaces to surface the power of data
management, analytics, and reporting.
The diagram above shows an organizational view of some of the SAS platform applications.
The highlighted group shows some of the analytics applications on the SAS platform. These
applications and other SAS tools help analysts make and manage models, forecast trends, and
generate statistics and visualizations on data.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-4 Lesson 2 SAS® Data Integration Studio: Essentials
4
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The highlighted group contains some of the reporting applications available in the SAS platf orm.
These applications and other SAS applications allow users to generate complex dashboards and
reports on their data as well as access data and generate reports with Microsoft Office tools like
Excel or Word.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Exploring the SAS Platform and SAS Data Integration Studio 2-5
5
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The highlighted group contains some of the programming interf aces available in the SAS platf orm.
These applications allow users to write and edit SAS code, which can be used to manage, analyze,
and report on data. SAS code will be used in this course to generate custom transf ormations and
can be used in the tool to customize jobs and existing transformations.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-6 Lesson 2 SAS® Data Integration Studio: Essentials
6
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The highlighted group contains some of the data management applications available in the SAS
platf orm. Of these Data Management applications, this course concentrates on the SAS Data
Integration Studio application – a brief explanation of the other Data Management applications
f ollows:
SAS OLAP Cube Studio and SAS Information Map Studio is not be discussed f urther in this
course.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Exploring the SAS Platform and SAS Data Integration Studio 2-7
Tables Cubes
Dashboards Folders
7
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-8 Lesson 2 SAS® Data Integration Studio: Essentials
Note: The SAS Folders tree can have customized nonstandard folders.
8
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Shared Data Is provided for you to store user-created content that is shared among multiple
users. Under this f older, you can create any number of subfolders . Each subfolder
has the appropriate permissions to f urther organize this content.
Note: You can also create additional f olders under SAS Folders in which to store
shared content.
Follow these best practices when you interact with SAS folders:
• Use personal folders for personal content and use shared folders for content that multiple users
need to view.
• Do not delete or rename the Users folder.
• Do not delete or rename the home folder or personal folder of an active user.
• Do not delete or rename the Products or System folders or their subfolders.
• Use caution when you rename the Shared Data folder.
• When you create new folders, the security administrator should set permiss ions. The
environment can be pre-configured so that new folders inherit permissions from existing parent
folders.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Exploring the SAS Platform and SAS Data Integration Studio 2-9
Data
Ahmed
Integrators
Marcel
Ole
Robert
Bruno
9
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
In order to control access to SAS platform content, SAS must know who makes each request and
what type of access is requested.
Some applications in the SAS Platf orm support role-based security. Roles determine which user
interf ace elements a user sees when he interacts with an application. The various f eatures in
applications that provide role-based management are called capabilities. SAS Data Integration
Studio does not support this type of security.
Note: Groups enable you to easily specify permissions for similar users.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-10 Lesson 2 SAS® Data Integration Studio: Essentials
SAS Packages
SAS Data Integration Studio enables you to export and import metadata.
One format that supports SAS metadata is the SAS package format.
10
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
SAS Data Integration Studio enables you to export and import metadata. One f ormat that supports
SAS metadata is the SAS package f ormat.
The SAS package f ormat
• is a SAS internal f ormat
• supports most SAS platform metadata objects including objects relevant to SAS Data Integration
Studio, such as jobs, libraries, tables, and external f iles.
The SAS package f ormat can be used to
• move metadata between SAS metadata repositories
• maintain backups of metadata.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Exploring the SAS Platform and SAS Data Integration Studio 2-11
Importing Metadata
This demonstration performs steps for initial import of various metadata objects. These objects will
be explored in a subsequent demonstration.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-12 Lesson 2 SAS® Data Integration Studio: Essentials
Some f olders in the Folders tree are provided by default, such as My Folder, Products,
Shared Data, System, and User Folders.
Other f olders and subf olders were added by an administrator, such as Data Mart
Development.
3. Right-click the Data Mart Development folder and select Import SAS Package.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Exploring the SAS Platform and SAS Data Integration Studio 2-13
8. Click Next.
9. Verify that all objects are selected.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-14 Lesson 2 SAS® Data Integration Studio: Essentials
Two library metadata objects in this SAS Package file were exported from a metadata repository
where they were associated with a SAS Application Server object called SASApp. The defined
SAS Application Server in the target environment that we will associate the objects with is also
called SASApp.
This mapping of original SASApp to the target SASApp occurred by default (because the
application servers have the same name).
13. Click Next.
14. Verify that the two target directory paths match the two original paths.
The first path listed is associated with the library metadata object named DIFT Test Source
Library. If the second path is selected, you will discover it is associated with the library metadata
object named DIFT Test Target Library.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Exploring the SAS Platform and SAS Data Integration Studio 2-15
We need to make sure the directory referenced actually exists – we will do this in a Windows
Explorer window.
16. Click Yes to close the Warning dialog box and proceed.
A Summary panel appears for the Import from SAS Package Wizard.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-16 Lesson 2 SAS® Data Integration Studio: Essentials
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Exploring the SAS Platform and SAS Data Integration Studio 2-17
2.01 Activity
1. Access SAS Data Integration Studio using the My Server connection
profile.
2. Specify Bruno as the user and Student1 as the password.
(note the capital S for Student1 password)
3. On the Folders tab, right-click the Data Mart Development folder
and select Import SAS Package.
4. Select the package D:\Workshop\dift\solutions\DIFT Demo.spk.
5. Accept the default selections in each step in the wizard.
13
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-18 Lesson 2 SAS® Data Integration Studio: Essentials
Course Data
16
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The data used in the course is from a fictitious global sports and outdoors retailer named Orion Star
Sports & Outdoors.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Exploring SAS Data Integration Studio Basics 2-19
Course Data
online store
sports and outdoors brick and mortar catalog business
stores
17
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Orion Star has traditional brick and mortar stores, an online store, and a large catalog business.
Course Data
online store
sports and outdoors brick and mortar catalog business
stores
The corporate headquarters is located in the United States, with offices and stores in many countries
throughout the world.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-20 Lesson 2 SAS® Data Integration Studio: Essentials
19
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
20
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
One of the first things to do when learning a new software tool is to understand the interface.
SAS Data Integration Studio has a variety of components available – we will explain each of the
primary items.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Exploring SAS Data Integration Studio Basics 2-21
Menu Bar
Title Bar
Toolbar
Status Bar
31
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The SAS Data Integration Studio interface is designed with features that are common to most
Windows applications.
• The title bar shows the current version of SAS Data Integration Studio, as well as the name of
the current connection profile.
• The menu bar provides access to drop-down menus. The list of active menu items varies
according to the current work area and the type of object that is selected. Inactive menu items
are disabled or hidden.
• The toolbar provides access to shortcuts for items on the menu bar. The list of ac tive tools varies
according to the current work area and the type of object that is selected. Inactive tools are
disabled or hidden.
• The status bar displays the name of the currently selected object, the name of the def ault SAS
Application Server if one is selected, the login ID and metadata identity of the current user, and
the name of the current SAS Metadata Server. To select a dif f erent SAS Application Server,
double-click the name of that server to open a dialog box. If the name of the SAS Metadata
Server is red, the connection is broken. In that case, you can double-click the name of the
metadata server to open a dialog box that enables you to reconnect.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-22 Lesson 2 SAS® Data Integration Studio: Essentials
Tree View
Basic Properties
34
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Exploring SAS Data Integration Studio Basics 2-23
Job Editor
Job Editor
35
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The Job Editor window enables you to create, run, and troubleshoot SAS Data Integration Studio
jobs.
• The Diagram tab is used to build and update the process flow for a job.
• The Code tab is used to review or update code for a job.
• The Log tab is used to review the log for a submitted job.
• The Output tab is used to review the output of a submitted job.
• The Details pane is used to monitor and debug a job in the Job Editor.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-24 Lesson 2 SAS® Data Integration Studio: Essentials
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Exploring SAS Data Integration Studio Basics 2-25
Some f olders in the Folders tree are provided by default, such as My Folder, Products,
Shared Data, System, and User Folders.
Other f olders and subf olders were added by an administrator, such as Data Mart
Development.
4. Click in front of Data Mart Development to expand this folder.
The DIFT Demo folder contains seven metadata objects: two library objects, four table objects,
and one job object.
Each metadata object has its own set of properties.
6. If necessary, select View Basic Properties.
The Basic Properties pane can be toggled off. Selecting the above menu selections will surface
the Basic Properties pane.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-26 Lesson 2 SAS® Data Integration Studio: Essentials
The Basic Properties pane displays basic information for this job object.
One interesting property is that the table that is being loaded is identified.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Exploring SAS Data Integration Studio Basics 2-27
The Basic Properties pane displays basic information for this library object.
Interesting properties to note for a library object include the following:
• library reference (libref) used
• the physical path specified
• the complete LIBNAME statement (this includes the engine)
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-28 Lesson 2 SAS® Data Integration Studio: Essentials
The Basic Properties pane displays basic information for this table object.
Interesting properties to note for a table object include the following:
• physical table name
• library object with libref in parentheses
• the type of DBMS
• the number of columns
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Exploring SAS Data Integration Studio Basics 2-29
The General tab displays the metadata name of the table, as well as the metadata folder
location.
b. Click the Columns tab.
The Columns tab displays the column attributes of the table object. Notice that all columns in
this table are numeric.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-30 Lesson 2 SAS® Data Integration Studio: Essentials
The Physical Storage tab displays the name of the physical table, the name of the library,
and the type of the table.
d. Click Cancel to close the Properties window.
11. Right-click DIFT Test Table - ORDER_ITEM and select Open.
Note: If prompted, use Bruno’s credentials for the SASApp application server.
The View Data window appears and displays the data for this table.
The functions of the View Data window are controlled by the View Data toolbar:
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Exploring SAS Data Integration Studio Basics 2-31
Positions the data with the go-to row as the first displayed data line.
Enables printing.
Displays the Filter tab in the View Data Query Options window.
Displays the Columns tab in the View Data Query Options window.
Displays physical column names in the column headings.
Note: You can display any combination of column metadata,
physical column names, and descriptions in the column
headings.
12. To close the View Data window, select File Close (or click ).
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-32 Lesson 2 SAS® Data Integration Studio: Essentials
The metadata name of the library object is shown on the General tab. The metadata folder
location is also shown.
b. Click the Options tab.
The Options tab displays the library reference and the location of the physical path of this
library.
c. Click Cancel to close the Properties window.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 Exploring SAS Data Integration Studio Basics 2-33
14. Right-click DIFT Test Source Library in the Folders tree and select View LIBNAME.
The Display LIBNAME window appears.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-34 Lesson 2 SAS® Data Integration Studio: Essentials
Executing a Job
The job editor window enables a data integration developer to design,
debug, and execute a job.
A job can be executed by
• clicking the Run tool in the
job editor’s tools ( )
• selecting Actions Run
from the menu bar
• right-clicking in background
of job and selecting Run
• pressing F3.
39
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 Examining SAS Data Integration Studio Jobs and Options 2-35
40
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Additional Run tools exist on the Job Editor’s toolbar and on the pop-up menu.
Note: The available Run tools depends on the current run state or the selected transformation, or
both.
42
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-36 Lesson 2 SAS® Data Integration Studio: Essentials
47
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The option Enable row count on basic properties and data viewer for tables can be toggled on
or of f . The def ault is off.
52
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The option Show Advanced Property tabs can be toggled on or off. The def ault is on.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 Examining SAS Data Integration Studio Jobs and Options 2-37
If Collapse If Expand
is selected… is selected…
53
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The Nodes grouping of options controls how ports and temporary output tables are displayed in a job
flow diagram.
54
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The Layout grouping of options controls the default orientation of a job flow diagram.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-38 Lesson 2 SAS® Data Integration Studio: Essentials
This job joins two source tables and then loads the result into a target table. The target table
is then used as the source for the Rank transformation. The result of the ranking is loaded
into a target table and sorted, and then a report is generated based on the rankings.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 Examining SAS Data Integration Studio Jobs and Options 2-39
The Columns tab in the Details pane displays column attributes for the selected table object.
These attributes are fully editable in this tab.
Similarly, selecting any of the table objects in the process flow diagram (DIFT Test Table -
ORDERS, DIFT Test Table - ORDER_ITEM, DIFT Test Target - Order Fact Table (in
diagram twice), DIFT Test Target - Ranked Order Fact) displays a Columns tab for that
table object.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-40 Lesson 2 SAS® Data Integration Studio: Essentials
The full mapping functionality of the Join’s Designer window is found on this Mappings tab.
Similarly, selecting any of the transformations in the process flow diagram (Join, Table
Loader, Rank, Sort, List Data) displays a Mappings tab for that transformation.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 Examining SAS Data Integration Studio Jobs and Options 2-41
e. Click Run to execute the job (if prompted to log on to application server use
Bruno/Student1).
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-42 Lesson 2 SAS® Data Integration Studio: Essentials
The Status tab in the Details pane shows the completion status for each step in the job. The
overall (Job) completion status is set to the lowest step completion status .
f. Double-click the first Error (for the Table Loader) in the Status column.
The Details pane shifts its focus to the Warnings and Errors tab. The error indicates that
the physical location for the target library does not exist.
Now you must discover the physical location that is specified for the library object.
g. Click the Folders tab.
h. If necessary, expand Data Mart Development DIFT Demo.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 Examining SAS Data Integration Studio Jobs and Options 2-43
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-44 Lesson 2 SAS® Data Integration Studio: Essentials
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 Examining SAS Data Integration Studio Jobs and Options 2-45
The Status tab of the Details pane shows that the transformation completed successfully.
o. Select File Close (or click ) to close the Job Editor window.
If you made any changes when you viewed the job, the following window appears:
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-46 Lesson 2 SAS® Data Integration Studio: Essentials
7. Click the Enable row count on basic properties and data viewer for tables option.
Note: Retrieving the number of rows requires system resources for most database tables.
For SAS tables, the number of rows is retrieved from the table metadata and requires
very little overhead.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 Examining SAS Data Integration Studio Jobs and Options 2-47
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-48 Lesson 2 SAS® Data Integration Studio: Essentials
a. Verify that the SASApp server is selected as the value for the Server field.
b. Click Test Connection to establish or test the application server connection for SAS Dat a
Integration Studio. An Information window appears and verifies a successful connection.
Double-clicking this area in the status bar accesses the Default Application Server
window where a selection can be made and tested.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 Examining SAS Data Integration Studio Jobs and Options 2-49
Practice
1. Establishing Global Options
Access SAS Data Integration Studio using the My Server connection profile and Bruno’s
credentials (User ID=Bruno; Password=Student1).
Access the Options window and set/verify the following:
• General option Show Output tab is selected
• General option Show advanced property tabs is cleared
• General option Enable row count on basic properties and data viewer for tables is
selected
• Job Editor option for Layout is set to Left to Right
• Job Editor option for Nodes is set to Collapse
• SAS Server option for Server is set to SASApp
Question: What effect does selecting the option Enable row count on basic properties and
data viewer for tables have on a View Data window?
Hint: Locate a metadata table object, right-click, and select Open.]
Answer: ______________________________________________________________
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-50 Lesson 2 SAS® Data Integration Studio: Essentials
2.4 Solutions
Solutions to Practices
1. Establishing Global Options
a. If necessary, access SAS Data Integration Studio as Bruno.
1) Select Start SAS Data Integration Studio.
2) Select My Server as the connection profile.
3) Click OK.
4) Enter Bruno as the user ID and Student1 as the password.
5) Click OK.
b. Select Tools Options.
c. Verify that the General tab is selected.
1) Verify that Show Output tab is selected.
3) Click the Enable row count on basic properties and data viewer for tables option.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.4 Solutions 2-51
Question: What effect does selecting the option Enable row count on basic properties
and data viewer for tables have on a View Data window?
Answer: From Folders tab, under Data Mart Development DIFT Demo, right-click
the DIFT Test Table - ORDER_ITEM table object and select Open. The title
bar of the View Data window now shows the row count for the opened table.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-52 Lesson 2 SAS® Data Integration Studio: Essentials
d. Double-click the first Error (for the Table Loader) in the Status column.
The Details pane shifts its focus to the Warnings and Errors tab. The error indicates that
the physical location for the target library does not exist.
Now you must discover the physical location that is specified for the library object.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.4 Solutions 2-53
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-54 Lesson 2 SAS® Data Integration Studio: Essentials
14
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Lesson 3 SAS® Data Integration
Studio: Defining Source Data
Metadata
3.1 Setting Up the Environment ......................................................................................... 3-3
Demonstration: Defining Custom Folders ................................................................... 3-5
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.1 Setting Up the Environment 3-3
3
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-4 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata
4
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.1 Setting Up the Environment 3-5
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-6 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata
5. Right-click the Data Mart Development folder and select New Folder.
A new folder is created and Untitled is the initial name.
6. Enter Orion Source Data as the name of the folder and press Enter.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.1 Setting Up the Environment 3-7
7. Right-click the Data Mart Development folder and select New Folder.
8. Enter Orion Target Data as the name of the folder and then press Enter.
9. Right-click the Data Mart Development folder and select New Folder.
10. Enter Orion Jobs as the name of the folder and then press Enter.
11. Right-click the Data Mart Development folder and select New Folder.
12. Enter Orion Reports as the name of the folder and then press Enter.
The final set of folders should resemble the following:
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-8 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata
3.01 Activity
1. Access SAS Data Integration Studio using the My Server connection profile with
Bruno’s credentials (Bruno / Student1).
2. Define four metadata folders under Data Mart Development.
• On the Folders tab, right-click the Data Mart Development folder and select
New Folder.
• Enter Orion Target Data as name for the folder.
• Create three additional folders under Data Mart Development:
✓ Orion Source Data
✓ Orion Jobs
✓ Orion Reports
7
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.2 Defining Metadata for a Library 3-9
Libraries
A library is a collection of
one or more files that is
referenced as a unit.
10
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-10 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata
11
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.2 Defining Metadata for a Library 3-11
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-12 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata
6. Click Next.
7. Specify the name and location of the new library.
a. Enter DIFT Orion Source Tables Library in the Name field.
b. Verify that the location is set to /Data Mart Development/Orion Source Data.
Note: If the location is incorrect, click Browse, and navigate to SAS Folders Data Mart
Development Orion Source Data.
8. Click Next.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.2 Defining Metadata for a Library 3-13
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-14 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata
The final settings on this page of the New Library Wizard should resemble the following:
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.2 Defining Metadata for a Library 3-15
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-16 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata
3.02 Activity
1. Access SAS Data Integration Studio using the My Server connection
profile with Bruno’s credentials (Bruno / Student1).
2. Create a new library object with the following specifications:
Library Type: SAS
Metadata Name: DIFT Orion Source Tables Library
Metadata Location: /Data Mart Development/Orion Source Data
SAS Application Server: SASApp
Library Reference: diftodet
Library Location: D:\Workshop\OrionStar\ordetail
14
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.3 Registering Metadata for Data Sources 3-17
external files
17
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
To access source data f or SAS Data Integration Studio jobs, metadata objects for the source data
need to be def ined. The tables can be in a DBMS or ERP system, and can also be in the f orm of
SAS tables.
DBMS (DataBase Management System) ref ers to system sof tware f or creating and managing
databases. The DBMS provides users with a systematic way to create, retrieve, and manage data.
ERP (Enterprise Resource Planning) ref ers to business management sof tware that typically includes
a database component.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-18 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata
18
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
In order for an existing table to be used in SAS Data Integration Studio, you must create a metadata
object referencing that table.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.3 Registering Metadata for Data Sources 3-19
Fixed Width files Files in which data values appear in columns that are in fixed
positions.
There are two wizards available to define attributes of these two types of
file:
• New Delimited External File Wizard
• New Fixed Width External File Wizard
20
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
There is an additional wizard, the New User Written External File Wizard, which sets up properties
for files whose structure is more complex than can be managed in the New Delimited External File or
New Fixed Width External File Wizards.
21
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-20 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata
Table Metadata
Metadata for tables is needed for source and target data. There are a few
dependencies that will come in to play.
ODBC
Data Source
28
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Table
29
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
We say that the library object is used by the table object, and that the table object is dependent on
the library object.
Note: Multiple tables (SAS or otherwise) can be registered f rom a single library.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.3 Registering Metadata for Data Sources 3-21
Server
30
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
We say that the server object is used by the library object, and the library object is used by the
table object. In addition, the table object is dependent on the library object, and the library object is
dependent on the server object.
Server
ODBC
Data Source
31
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
We say that the ODBC data source is used by the server object, that the server object is used by
the library object, and that the library object is used by the table object. In addition, the table object
is dependent on the library object, the library object is dependent on the server object, and the
server object is dependent on the ODBC data source.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-22 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata
Table
33
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The Register Tables Wizard is used to register metadata for existing tables – any existing tables.
Therefore, the tables could be SAS tables or DBMS tables or tables accessed through an ODBC
connection.
If SAS is selected…
34
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.4 Registering SAS Table Metadata 3-23
35
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
36
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-24 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata
37
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.4 Registering SAS Table Metadata 3-25
Note: Only the table types (SAS/ACCESS engines) that are licensed f or your site are available
f or use.
Note: The f ollowing two alternatives are available to initiate the Register Tables Wizard:
• Right-click a f older in the Folders tree where metadata f or the table should be saved,
and then select Register Tables f rom the pop-up menu.
• Right-click a library and select Register Tables.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-26 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata
Note: This step is omitted f rom the Register Tables Wizard when you register a table through
a library because the type of table is a library property (library engine).
7. Click Next.
The Select a SAS Library window appears.
8. Click next to the SAS Library field and then select DIFT Orion Source Tables Library.
9. Click Next.
The Define Tables and Select Folder Location window appears.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.4 Registering SAS Table Metadata 3-27
Note: If the location is incorrect, click Browse and navigate to SAS Folders Data Mart
Development Orion Source Data.
12. Click Next. The review window appears.
13. Verify that the information is correct.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-28 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata
The metadata object for the table is found in the Folders tree under the Orion Source Data
folder.
15. Right-click the PRODUCT_LIST metadata table object and select Properties.
16. Enter DIFT as a prefix to the default name.
17. Remove the description.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.4 Registering SAS Table Metadata 3-29
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-30 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.4 Registering SAS Table Metadata 3-31
Practice
For these practices, access SAS Data Integration Studio using My Server as the connection profile
and log on using Bruno’s credentials (Bruno/Student1).
1. Registering SAS Tables from DIFT Orion Source Tables Library
• Place the table objects in the Data Mart Development Orion Source Data folder.
• Use the Register Tables Wizard to register two SAS tables.
• Register the PRODUCT_LIST and STAFF tables found in the DIFT Orion Source Tables
Library.
• Add the prefix DIFT to the default metadata name of each table.
• Remove the table description if it exists for either table.
Additional SAS tables are needed for the course workshops. To access these tables, a new
library object must be registered. The specifics for the library are listed below.
Libref: diftsas
• Place the table objects in the Data Mart Development Orion Source Data folder.
• Register the below tables found in the DIFT SAS Library.
• Add the prefix DIFT to the default metadata name of each table.
• Remove any table descriptions.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-32 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata
Recall that the Inventory tab organizes known metadata object types in groups of that object
type. In particular, for library metadata objects, there is a hierarchical view of the library and the
registered tables for that library. This view is often useful.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.5 Registering DBMS Table Meta data 3-33
Server
44
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
45
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The Register Tables Wizard has a choice f or Oracle under the Database Sources group.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-34 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata
46
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
47
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.5 Registering DBMS Table Metadata 3-35
50
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
It is best to first define server metadata, and then library metadata (that uses the server metadata),
and then table metadata (that uses the library metadata).
• Server object is used by the library object.
• Library object is used by the table object
• Table object is dependent on the library object.
• Library object is dependent on the server object.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-36 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.5 Registering DBMS Table Metadata 3-37
The Connection Properties window for the DIFT Oracle Server appears.
a. Click the Options tab.
b. Verify that the Oracle Path Information specifies a path of xe.
Note: xe is a TNS name that was defined when the Oracle client was configured. The name
xe was chosen by the image builder because the express version of Oracle was installed.
c. Verify that the Authentication type field is set to User/Password.
d. Verify that the Authentication domain field is set to OracleAuth.
The Oracle user ID and password are stored in the OracleAuth authentication domain.
e. Click Cancel to close the Connection: DIFT Oracle Server Properties window.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-38 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata
6. Click the User Manager plug-in on the Plug-ins tab in SAS Management Console.
7. Right-click the Data Integrators group in the right pane and select Properties. The Data
Integrators Properties appears.
8. Click the Members tab.
The Data Integrators group has an Oracle account. The Oracle credentials are stored in the
OracleAuth authentication domain. An authentication domain is a metadata object that stores the
credentials to access a server or a DBMS.
10. Click Cancel to close the Data Integrators Properties window.
11. Select File Exit to close SAS Management Console.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.5 Registering DBMS Table Metadata 3-39
7. Click Next.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-40 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata
Note: There is only one defined server in metadata of type Oracle. This is the Oracle database
server we just explored in SAS Management Console.
16. Click Next.
The review window shows the following:
17. Click Finish. This completes the metadata definition for the library object.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.5 Registering DBMS Table Metadata 3-41
Note: The xe path defined for the DIFT Oracle Server connection surfaces through the
DIFT Oracle Library.
9. Click Next.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-42 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata
10. Select both the ORDERS table as well as the ORDER_ITEM table.
a. Click the ORDERS table.
b. Hold down the Ctrl key and click ORDER_ITEM.
11. Verify that the metadata location for the metadata table object is /Data Mart Development/
Orion Source Data.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.5 Registering DBMS Table Metadata 3-43
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-44 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata
Note: The data values were read directly from the Oracle table for display here. The
metadata table object provides direct access to the data in the database and is interacting
directly with the database to retrieve that data here and when used in jobs.
b. Select File Close to close the View Data window.
16. Investigate and update properties for the ORDER_ITEM table.
a. Right-click the ORDER_ITEM metadata table object and select Properties.
b. Enter DIFT ORDER_ITEM in the Name field.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.5 Registering DBMS Table Metadata 3-45
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-46 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata
Practice
For these practices, access SAS Data Integration Studio using the My Server connection profile and
log on using Bruno’s credentials (Bruno/Student1).
Additional tables (from an Oracle DBMS) are needed for the course workshops. To access these
tables, a new library object (for an Oracle DBMS) must be registered. The needed Oracle Server
metadata object already exists.
Libref: diftora
Two Oracle tables need to be registered. The specifics are listed below.
• Place the table objects in the Data Mart Development Orion Source Data folder.
• Register the below tables found in the DIFT Oracle Library.
• Add the prefix DIFT to the default metadata name of each table.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.6 Registering ODBC Data Source Table Metadata 3-47
Server
ODBC
Data Source
57
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
4
Table ODBC Data Source tables can be registered
using the Register Tables Wizard.
61
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-48 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata
62
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.6 Registering ODBC Data Source Table Metadata 3-49
Note: For the course image, this is the ODBC Data Source Administrator for 64-bit drivers.
It can also be accessed by double-clicking C:\Windows\System32\odbcad32.exe.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-50 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata
Configure
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.6 Registering ODBC Data Source Table Metadata 3-51
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-52 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata
Note: Quotation marks are needed if the Datasrc value contains blanks.
d. Click Cancel to close the Connection: Orion Star Contacts Properties window.
6. On the Connections tab, right-click the connection named Orion Star Orders and select
Properties.
a. Click the Options tab.
b. Verify that Datasrc is selected.
c. Verify that "Orion Star Orders" is specified in the Datasrc field.
d. Click Cancel to close the Connection: Orion Star Orders Properties window.
7. Select File Exit to close SAS Management Console.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.6 Registering ODBC Data Source Table Metadata 3-53
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-54 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata
16. Select DIFT ODBC Microsoft Access Server in the Database Server field.
17. Select Connection: Orion Star Contacts in the Connection field.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.6 Registering ODBC Data Source Table Metadata 3-55
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-56 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata
9. Verify that the location is set to /Data Mart Development/Orion Source Data.
10. Click Next.
The review window shows the following:
The three tables appear under the Orion Source Data folder on the Folders tab.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.6 Registering ODBC Data Source Table Metadata 3-57
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-58 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata
3.03 Activity
1. Access SAS Data Integration Studio using the My Server connection
profile with Bruno’s credentials (Bruno / Student1).
2. On the Folders tab, right-click the Data Mart Development folder and
select Import SAS Package.
3. Select the package
D:\Workshop\dift\solutions\DIFT_ODBCObjects.spk.
4. Accept the default selections in each page in the wizard.
65
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.7 Registering Metadata for External Files 3-59
68
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-60 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata
69
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
70
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Similar to SAS or DBMS tables, external f iles can be used as sources and targets.
Unlike SAS or DBMS tables, which are accessed with SAS library engines, external f iles are
accessed with SAS INFILE and INPUT statements f or source f iles and FILE and PUT statements f or
target f iles.
Accordingly, external f iles have their own registration wizards. The three wizards are accessed by
selecting File New under the External File group.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.7 Registering Metadata for External Files 3-61
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-62 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.7 Registering Metadata for External Files 3-63
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-64 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata
20. Click Refresh in the lower pane (with the File tab active) to see the first 10 records in the
external file.
Note: You can manually define properties for each field by clicking the New button. If there are
many fields, the wizard provides some functions to automate the building of initial
metadata.
21. Click Auto Fill in the upper pane of the Column Definitions window. The Auto Fill Columns
window appears.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.7 Registering Metadata for External Files 3-65
22. Enter 2 in the Start record field in the Guessing records area.
Note: Default informats and formats can be selected for character and numeric fields. We do
not use this feature in this demo.
23. Click OK to close the Auto Fill Columns window.
The upper pane of the Column Definitions window is populated with six column definitions: three
numeric and three character values.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-66 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata
25. Click Get the column names from column headings in this file.
26. Verify that 1 is entered in the The column headings are in file record field.
Note: Column properties can be imported from various other sources. We use the fourth option
to import column names from the source file.
27. Click OK.
The Name fields are populated with the column names.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.7 Registering Metadata for External Files 3-67
Supplier_ID Supplier ID
30. Click the Data tab in the lower pane of the Column Definitions window.
31. Click Refresh.
Note: This action executes the generated code (Source tab) and displays the results. This lets
you verify that the field properties are appropriate and look for any mistakes in the wizard
configuration. The log is useful if errors occur.
32. Click Next.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-68 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.7 Registering Metadata for External Files 3-69
Practice
8. Importing Metadata for a Delimited External File
Access SAS Data Integration Studio using the My Server connection profile with Bruno's
credentials (Bruno/Student1).
• Launch the import from the Data Mart Development folder.
• Import metadata from the SAS package file DI1_DelimExtFile.spk found in
D:\Workshop\dift\solutions.
Question: How many columns are defined for DIFT Supplier Information?
Answer: ____________________________________
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-70 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata
3.8 Solutions
Solutions to Practices
1. Registering SAS Tables from DIFT Orion Source Tables Library
a. If necessary, access SAS Data Integration Studio with Bruno’s credentials.
1) Select Start All Programs SAS SAS Data Integration Studio.
2) Select My Server as the connection profile.
3) Click OK to close the Connection Profile window and open the Log On window.
4) Enter Bruno in the User ID field and Student1 in the Password field.
5) Click OK to close the Log On window.
b. Click the Folders tab.
1) Expand Data Mart Development Orion Source Data.
2) Verify that the Orion Source Data folder is selected.
c. Select File Register Tables. The Register Tables Wizard appears.
1) Select SAS as the type of table.
2) Click Next. The Select a SAS Library window appears.
3) Click next to the SAS Library field and then select DIFT Orion Source Tables
Library.
4) Click Next. The Define Tables and Select Folder Location window appears.
5) Click the PRODUCT_LIST table and then hold down the Ctrl key and click the STAFF
table to select both.
6) Verify that /Data Mart Development/Orion Source Data is selected in the Location
field.
7) Click Next. The review window appears.
8) Verify that the information is correct.
9) Click Finish.
The metadata objects for the tables are found in the Orion Source Data folder.
d. Edit the properties of table objects.
1) Right-click the PRODUCT_LIST metadata table object and select Properties.
a) Enter DIFT as a prefix to the default name.
b) Remove the description.
c) Click OK to close the Properties window.
2) Right-click the STAFF metadata table object and select Properties.
a) Enter DIFT as a prefix to the default name.
b) Click OK to close the Properties window.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.8 Solutions 3-71
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-72 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.8 Solutions 3-73
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-74 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.8 Solutions 3-75
Question: How many columns are defined for DIFT Supplier Information?
Answer: Six (6)
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-76 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata
Note: It is easiest to define the above fields in order, from the top to bottom, to avoid
overlapping fields.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.8 Solutions 3-77
m. Click Yes. Review the data values on the Data tab. (The values for Sales and Cost are
missing in the first 120 rows.)
n. Click Next.
o. If the warning window appears again, click Yes.
The review window displays general information for the external file.
p. Click Finish. The metadata object for the external file is found in the Orion Source Data
folder.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-78 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata
8
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
15
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.8 Solutions 3-79
66
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-80 Lesson 3 SAS® Data Integration Studio: Defining Source Data Metadata
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Lesson 4 SAS® Data Integration
Studio: Defining Target Data
Metadata
4.1 Registering Metadata for Target Tables ........................................................................ 4-3
Demonstration: Refresh the Metadata........................................................................ 4-8
Demonstration: Defining the Product Dimension Table Metadata ................................. 4-12
Practice............................................................................................................... 4-20
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Registering Metadata for Target Tables 4-3
3
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-4 Lesson 4 SAS® Data Integration Studio: Defining Target Data Metadata
4
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Note: The External File wizards were discussed in previous materials to def ine metadata f or source
f iles. The same wizards are used to def ine target f iles.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Registering Metadata for Target Tables 4-5
Metadata name
Metadata description
Metadata location
6
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The f irst page of the New Table Wizard enables you to specif y a metadata name, description, and
location f or the new table object.
DBMS type
Library location
Valid name
7
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The second page of the New Table Wizard specifies Table Storage Information to include
• DBMS type
• library location for the selected DBMS type
• valid name of the new table.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-6 Lesson 4 SAS® Data Integration Studio: Defining Target Data Metadata
8
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The third page of the New Table Wizard enables the selection of metadata f or columns that are
def ined f or existing table objects.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Registering Metadata for Target Tables 4-7
Reorder columns 9
Define Indexes
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-8 Lesson 4 SAS® Data Integration Studio: Defining Target Data Metadata
3. Click Yes.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Registering Metadata for Target Tables 4-9
4. A series of Delete Library windows might appear. If so, click Yes in each to confirm the deletion.
Possible Delete Library windows:
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-10 Lesson 4 SAS® Data Integration Studio: Defining Target Data Metadata
5. Click Next.
6. Verify that all four Orion folders are selected.
7. Click Next.
8. Verify that four different types of connections are shown to need to be established.
9. Click Next.
10. Verify that SASApp is listed for both the Original and Target fields.
11. Click Next.
12. Verify that both servers (ODBC Server and Oracle Server) have matching values for the Original
and Target fields.
13. Click Next.
14. Verify that both file paths have matching values for the Original and Target fields.
15. Click Next.
16. Verify that both directory paths have matching values for the Original and Target fields.
17. Click Next. The Summary pane surfaces.
18. Click Next.
19. Verify the import process completed successfully.
20. Click Finish.
21. Expand the Data Mart Development Orion Source Data folder.
The folder and other metadata objects should resemble the following:
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Registering Metadata for Target Tables 4-11
4.01 Activity
1. Access SAS Data Integration Studio using the My Server connection
profile with Bruno’s credentials (Bruno / Student1).
2. Delete the four Orion folders under the Data Mart Development folder.
3. Click the Data Mart Development folder, and then select File
Import SAS Package.
5. Passing through the Import Wizard, verify the four types of connections
and then Finish out of the Import Wizard.
12
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-12 Lesson 4 SAS® Data Integration Studio: Defining Target Data Metadata
8. Click Next.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Registering Metadata for Target Tables 4-13
c. Click Next.
d. Double-click SASApp in the Available servers list to move it to the Selected servers
list.
e. Click Next.
f. Specify the needed library properties.
1) Enter difttgt in the Libref field.
2) Click New in the Path Specification area.
a) In the New Path Specification window, click Browse next to Paths.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-14 Lesson 4 SAS® Data Integration Studio: Defining Target Data Metadata
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Registering Metadata for Target Tables 4-15
i. Click Finish to close the New Library Wizard and return to the New Table Wizard.
12. Verify that the new library DIFT Orion Target Tables Library is selected in the Library field.
13. Enter ProdDim in the Name field.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-16 Lesson 4 SAS® Data Integration Studio: Defining Target Data Metadata
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Registering Metadata for Target Tables 4-17
25. Define two simple indexes: one for Product_ID and one for Product_Group.
a. Click Define Indexes. The Define Indexes window appears.
b. Click New to add the first index.
c. Enter an index name of Product_ID and press Enter.
Note: Be sure to press Enter. If you do not, the name of the index is not saved.
d. Select the Product_ID column and move it to the Indexes pane by clicking .
e. Click New to add the second index.
f. Enter an index name of Product_Group and press Enter.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-18 Lesson 4 SAS® Data Integration Studio: Defining Target Data Metadata
g. Select the Product_Group column and move it to the Indexes pane by clicking .
The two requested indexes are defined in the Define Indexes window.
Note: A simple index in a SAS table must have the same name as its column. A
warning dialog box is presented if an index name does not match its column
name. Clicking Yes in the dialog box enables SAS Data Integration Studio
to match the index name to its column name.
h. Click OK to close the Define Indexes window and return to the New Table Wizard.
26. Click Next.
27. Review the metadata listed in the summary window.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Registering Metadata for Target Tables 4-19
30. Verify that the new table and new library objects appear in the Orion Target Data folder.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-20 Lesson 4 SAS® Data Integration Studio: Defining Target Data Metadata
Practice
1. Importing Metadata for DIFT Product Dimension Table
Access SAS Data Integration Studio using the My Server connection profile with Bruno's
credentials (Bruno/Student1).
• Right-click the Data Mart Development folder and import from the SAS package
D:\Workshop\dift\solutions\DIFT_ProdDimPlus.spk.
– Select all defaults
– When Warning appears:
▪ Click No.
Question: How many objects were imported in the Orion Target Data folder?
Answer: _____________________________________________________
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Registering Metadata for Target Tables 4-21
Note: Notice the newnames for ORDER_DATE and DELIVERY_DATE as well as updated
formats.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-22 Lesson 4 SAS® Data Integration Studio: Defining Target Data Metadata
23
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The SAS package format is a SAS internal f ormat and supports most SAS platform metadata objects
including objects relevant to SAS Data Integration Studio, such as jobs, libraries, tables, and
external f iles.
The Common Warehouse Metamodel (CWM) is an industry standard format that is supported by
many software vendors. The CWM format supports relational metadata such as tables, columns,
indexes, and keys.
The SAS package format can be used to move metadata between SAS metadata repositories, move
metadata between environments such as from development-to-test and test-to-production, maintain
backups of metadata, and keep archived versions of metadata objects.
Relational metadata, including the CWM format, can be used to exchange relational metadata
between software applications and import models from third-party data modeling tools into SAS Data
Integration Studio.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.2 Importing Metadata 4-23
Metadata Bridges
Vendor
implementations of
the CWM format differ.
24
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
When licensing SAS Data Integration Studio, you get the choice of several SAS Metadata Bridges.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-24 Lesson 4 SAS® Data Integration Studio: Defining Target Data Metadata
7. Click Next.
8. Specify the file to import.
a. Click Browse next to the Filename field to open the Select a file window.
b. Navigate to D:\Workshop\dift\data.
c. Select OMG_CWM_XMI.xml.
d. Click OK to close the Select a file window.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.2 Importing Metadata 4-25
9. Verify that the folder location is set to /Data Mart Development/Orion Target Data.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-26 Lesson 4 SAS® Data Integration Studio: Defining Target Data Metadata
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.2 Importing Metadata 4-27
18. Click Finish. The metadata is imported to the SAS metadata environment.
An information window appears.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-28 Lesson 4 SAS® Data Integration Studio: Defining Target Data Metadata
Name Format
EMPLOYEE_ID 12.
START_DATE date9.
END_DATE date9.
JOB_TITLE (None)
SALARY dollar12.
GENDER $gender.
BIRTH_DATE date9.
EMP_HIRE_DATE date9.
EMP_TERM_DATE date9.
MANAGER_ID 12.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.2 Importing Metadata 4-29
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-30 Lesson 4 SAS® Data Integration Studio: Defining Target Data Metadata
4.3 Solutions
Solutions to Practices
1. Importing Metadata for DIFT Product Dimension Table
a. If necessary, access SAS Data Integration Studio with Bruno’s credentials.
1) Select Start All Programs SAS SAS Data Integration Studio.
2) Select My Server as the connection profile.
3) Click OK to close the Connection Profile window and open the Log On window.
4) Enter Bruno in the User ID field and Student1 in the Password field.
5) Click OK to close the Log On window.
b. Click the Folders tab.
c. Click the Data Mart Development folder.
d. Select File Import SAS Package.
1) Click Browse.
2) If necessary, navigate to D:\Workshop\dift\solutions.
3) Click the file DIFT_ProdDimPlus.spk.
4) Click OK.
5) Verify that All Objects is selected.
6) Click Next.
7) Verify that two objects under the Orion Target Data folder are to be imported.
8) Click Next.
9) Verify that two connection points must be established.
10) Click Next.
11) Verify that SASApp is listed for both the Original and Target fields.
12) Click Next.
13) Verify that D:\Workshop\dift\datamart is listed for both the Original and Target fields.
14) Click Next.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.3 Solutions 4-31
A Warning window appears, stating that the physical location specified does not exist.
Question: How many objects were imported in the Orion Target Data folder?
Answer: Two, a library and a table object.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-32 Lesson 4 SAS® Data Integration Studio: Defining Target Data Metadata
h. Right-click the table object (DIFT Product Information) and select Properties.
1) Click the Columns tab.
2) Locate the Length value for the Product_Category column and update it to 25.
3) Locate the Length value for the Product_Group column and update it to 25.
4) Locate the Length value for the Product_Line column and update it to 25.
5) Locate the Length value for the Supplier_ID column and update it to 4.
6) Click the Indexes tab.
7) Under Indexes, click New.
8) Type Product_Group as the new index name and press Enter.
9) Select Product_Group from the Columns list and move to the new index.
10) Click OK to close the Properties window.
2) Click . (All columns from the selected table are moved to the Selected pane.)
3) Select the DIFT ORDERS table object.
4) Click to move all columns from the selected table to the Selected pane.
An Error window appears and indicates that Order_ID cannot be added twice.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.3 Solutions 4-33
5) Click OK.
6) Click Next.
g. Update the column attributes as follows:
Note: Notice the new names for ORDER_DATE and DELIVERY_DATE as well as updated
formats.
h. Click Next.
i. Review the metadata listed in the summary window.
j. Click Finish.
The new table object appears in the Orion Target Data folder.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-34 Lesson 4 SAS® Data Integration Studio: Defining Target Data Metadata
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Lesson 5 SAS® Data Integration
Studio: Working with Jobs
5.1 Creating Metadata for Jobs .......................................................................................... 5-3
Demonstration: Refresh the Metadata ..................................................................... 5-13
Demonstration: Populating the Current and Terminated Staff Tables ............................. 5-16
Practices ............................................................................................................. 5-28
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Creating Metadata for Jobs 5-3
3
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
We currently have metadata def ined f or several types of source data - SAS tables, Oracle tables
(our DBMS example), Microsoft Access tables (our ODBC example), and external f iles.
In addition, we have metadata def ined f or several target tables.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-4 Lesson 5 SAS® Data Integration Studio: Working with Jobs
What We Need to Do
The next step is to define processes (jobs) that
• read from sources
• perform necessary data transformations
• load targets.
4
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
We now need to def ine metadata f or a new type of metadata object – a job. Jobs will allow us to
read f rom our source data, process that data with transf ormations, and then load our target. Initially
our "targets" will be tables, but it is possible to create output or target inf ormation in the f orm of a
report.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Creating Metadata for Jobs 5-5
What Is a Job?
Job Metadata object that organizes sources, targets, and
transformations into processes that create output
5
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
As specified above, a job is simply another metadata object. A job object organizes source data
metadata objects with various transformations to generate a result. The result is often a target table
but could be a report.
In the top left screen, we see a process flow diagram (on the Diagram tab) that contains two source
objects (both are table metadata objects that are pointing to SAS tables), two transformations (Join
and Table Loader), and a single target object (a SAS table metadata object). SAS Data Integration
Studio uses the job metadata to generate SAS code (shown in the lower screen shot) that executes
the job process.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-6 Lesson 5 SAS® Data Integration Studio: Working with Jobs
6
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The direction of the arrows in the process f low indicates the order of the transf ormations. The above
process f low diagram shows a job that reads f rom a source table called STAFF, processes the
Extract transf ormation to produce a work (temporary) table, which in turn is processed by the Sort
transf ormation to creates a registered (permanent) table called US Staff Sorted. In any process f low
diagram, table and external f ile objects have tan colored nodes, and transf ormations are the blue
nodes. Also, table objects are decorated with a symbol in the upper right corner that visually
identif ies the type of table being read or being created.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Creating Metadata for Jobs 5-7
Read this in
top-to-bottom
order
corresponding to
the direction
according to
arrows.
7
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Transformation
Transformation
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-8 Lesson 5 SAS® Data Integration Studio: Working with Jobs
New Jobs
Name
Description
Metadata
Location
13
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
New jobs are initialized by the New Job window. In this window, you specif y the f ollowing job
properties:
• name
• description
• metadata location
A new job can be initiated from the Folders tree by clicking the desired location folder and then
selecting File New Job. The New Job window allows specification of the metadata name for
the job and a description. If the New Job window was launched from an undesired location, you have
the opportunity to change the folder destination in the Location field. Selecting OK launches the job
editor with an "empty" job, that is, initially there are no sources, transformations, or targets on the
Diagram tab.
If you have an existing job object that you want to edit, you can simply right-click the job and select
Open (you can also double-click the job object – the double-click action is to Open).
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Creating Metadata for Jobs 5-9
Job Editor
Main area
Details pane
14
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The "main area" is a tabbed interf ace where we have possibly four tabs:
Tab Purpose
Diagram Used to build and update the process flow for a job
The Details pane has tabs to monitor job execution and to aid with debugging job errors.
The Details pane can have the f ollowing tabs:
Tab Purpose
Control Flow Is used to review and update the execution sequence of the steps in a
job.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-10 Lesson 5 SAS® Data Integration Studio: Working with Jobs
Introduction to Transformations
15
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Each process in a process f low diagram is specified by a metadata object called a transformation. A
transf ormation allows you to specify how to extract data, transf orm data, or load data into data
stores.
Introduction to Transformations
Each transformation that you
specify in a process flow
diagram generates or
retrieves SAS code. A
transformation's generated
code can be augmented or
replaced.
16
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Each transf ormation that you specify in a process f low diagram generates or retrieves SAS code. A
transf ormation's generated code can be augmented or even replaced. You can also specif y user-
written code f or any transf ormation in a process f low diagram.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Creating Metadata for Jobs 5-11
Transformations Tree
The Transformations tree organizes available transformations into
categories.
17
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The availability of transf ormations for many common processing tasks enables rapid development of
process f lows f or common scenarios. The above display shows the standard Transf ormations tree.
18
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-12 Lesson 5 SAS® Data Integration Studio: Working with Jobs
Splitter Transformation
The Splitter transformation
• is found in the Data category of transformations
• can be used to create one or more subsets of
a source.
19
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Creating Metadata for Jobs 5-13
b. Press the Shift key and click the Orion Target Data folder.
2. Select Edit Delete.
A Confirm Delete window appears.
3. Click Yes.
4. A series of Delete Library windows might appear. If so, click Yes in each window to confirm
deletion.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-14 Lesson 5 SAS® Data Integration Studio: Working with Jobs
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Creating Metadata for Jobs 5-15
The metadata under the Data Mart Development folder should resemble the following:
Note: In the Orion Target Data folder, there is a table object named DIFT US Suppliers. This
was not discussed or defined previously. It is a table object defining a SAS table to be
created in the DIFT Orion Target Tables Library. The columns for the table were
created from the columns in the DIFT Supplier Information external file object. Two new
tables named DIFT Old Orders and DIFT Recent Orders will be used in an upcoming
practice.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-16 Lesson 5 SAS® Data Integration Studio: Working with Jobs
8. Click OK.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Creating Metadata for Jobs 5-17
Note: When a job window is active, objects can also be added to the diagram by right-
clicking the object and selecting Add to Diagram.
Note: The above screen capture does not show the Details pane.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-18 Lesson 5 SAS® Data Integration Studio: Working with Jobs
10. Select File Save to save the diagram and job metadata to this point.
11. Add the Splitter transformation to the diagram.
a. In the tree view, click the Transformations tab.
b. Expand the Data grouping.
c. Click the Splitter transformation.
d. Drag the Splitter transformation to the Diagram tab of the Job Editor.
e. Position the Splitter transformation next to the source table object.
Note: The Splitter transformation, by default, produces two work tables. (More can be
produced by specifying the properties of the Splitter transformation.) Notice that the
two work table objects are represented by the green icons located to the right of the
Splitter transformation.
12. Select File Save to save the diagram and job metadata to this point.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Creating Metadata for Jobs 5-19
b. Place the cursor over the connection selector. The cursor changes to a pencil.
c. With the cursor over the connection selector (and the pencil cursor visible), click the
connection selector and drag it to the Splitter transformation. Release the cursor when it is
over the Splitter transformation.
14. Select File Save to save the diagram and job metadata to this point.
15. Add the target table objects to the diagram.
a. Click the Folders tab.
b. If necessary, expand the Data Mart Development Orion Target Data folder.
c. Hold down the Ctrl key and click the two target table objects (DIFT Current Staff
and DIFT Terminated Staff).
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-20 Lesson 5 SAS® Data Integration Studio: Working with Jobs
d. Drag the two objects to the Diagram tab of the Job Editor.
16. Select File Save to save the diagram and job metadata to this point.
17. Connect the Splitter transformation to the target table objects.
The two target tables are loaded with direct one-to-one column mappings of subset data and
no additional load specifications. Therefore, no Table Loader transformation is needed for either
of the target tables. Hence, the two work table objects must be deleted in order to connect the
transformation directly to the target table objects.
a. Right-click one of the work table objects of the Splitter transformation and select Delete.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Creating Metadata for Jobs 5-21
b. Right-click the second work table object of the Splitter transformation and select Delete.
All work tables are now removed from the Splitter transformation.
c. Place the cursor over the Splitter transformation to reveal the connection selector
until the cursor changes to a pencil.
d. When the pencil cursor appears, click and drag to the first output table, DIFT Current Staff.
e. Place the cursor over the Splitter transformation to reveal the connection selector.
f. Click the connection selector and drag to the second output table, DIFT Terminated Staff.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-22 Lesson 5 SAS® Data Integration Studio: Working with Jobs
18. Select File Save to save the diagram and job metadata to this point.
19. Specify the properties of the Splitter transformation.
a. Right-click the Splitter transformation and select Properties.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Creating Metadata for Jobs 5-23
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-24 Lesson 5 SAS® Data Integration Studio: Working with Jobs
d. Specify the subsetting criteria for the DIFT Terminated Staff table object.
1) Verify that the DIFT Terminated Staff table object is selected in the Target Tables pane.
2) Select Row Selection Conditions in the Row Selection Type field.
3) Click Subset Data below the Selection Conditions area. The Expression Builder
window appears.
4) Click the Data Sources tab.
5) Expand the STAFF table.
6) Select the Emp_Term_Date column.
7) Click Add to Expression.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Creating Metadata for Jobs 5-25
A column mapping indicates that data passes from the source column to the target column.
g. Click OK to close the Splitter Properties window.
20. Select File Save to save the diagram and job metadata to this point.
21. Run the job.
a. Click Run on the job toolbar.
Note: A job can also be processed by selecting Actions Run or by right-clicking in the
job background and selecting Run from the pop-up menu.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-26 Lesson 5 SAS® Data Integration Studio: Working with Jobs
25. Scroll to view the note about the creation of the DIFTTGT.TERM_STAFF table.
26. View the data for the DIFT Current Staff table object.
a. Click the Diagram tab in the Job Editor.
b. Right-click the DIFT Current Staff table object and select Open.
c. Scroll right to the EMP_TERM_DATE column. All EMP_TERM_DATE values are missing.
d. After you view the data, select File Close to close the View Data window.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Creating Metadata for Jobs 5-27
27. View the data for the DIFT Terminated Staff table object.
a. Right-click the DIFT Terminated Staff table object and select Open.
b. Scroll right to the EMP_TERM_DATE column. All EMP_TERM_DATE values are
nonmissing.
c. After you view the data, select File Close to close the View Data window.
28. Select File Close to close the Job Editor. If necessary, save changes to the job. The new job
object appears on the Folders tab.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-28 Lesson 5 SAS® Data Integration Studio: Working with Jobs
Practices
1. Refreshing Course Metadata to Current Point
• Access SAS Data Integration Studio using the My Server connection profile using Bruno's
credentials (Bruno/Student1).
• Delete all four Orion folders under the Data Mart Development folder.
– Select all four Orion folders and then select Edit Delete.
– Accept defaults for deletion (click Yes for each Delete Library window that appears – this
will delete the associated table objects for each library).
• From the Data Mart Development folder, import fresh metadata from DIFT_Ch4Ex1.spk
(SAS package located in D:\Workshop\dift\solutions).
– Choose All Objects.
– Verify that only Orion folders are selected (if necessary, clear selection for DIFT Demo).
– Verify that the Original and Target fields match for four different connection point
panels.
Question: How many library objects were imported?
Answer: _________________________________________________________
Question: What is the name of the job object in the Orion Jobs folder?
Answer: _________________________________________________________
Question: How many table objects were imported in the Orion Target Data folder?
Answer: _________________________________________________________
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.1 Creating Metadata for Jobs 5-29
• Define the connection between the source table and the transformation using the
Connections window.
– Right-click the DIFT STAFF table object in the job flow diagram and select Connections.
– Click under Output Ports. The Output Port – Data window appears.
– Click the Splitter transformation under Output Node.
– Click OK.
– Select File Save.
• Add target tables using the Replace functionality.
– Right-click the top output table on the Splitter transformation and select Replace.
o In the Table Selector window (on the Folders tab), expand Data Mart Development
Orion Target Data.
o Click the DIFT Current Staff table.
o Click OK.
– Right-click the remaining output table on the Splitter transformation and select Replace.
o In the Table Selector window (on the Folders tab), expand Data Mart Development
Orion Target Data.
o Click the DIFT Terminated Staff table.
o Click OK.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-30 Lesson 5 SAS® Data Integration Studio: Working with Jobs
Question: On the Inventory tab, under what grouping do you find the job?
Answer: __________________________________________________________
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Working with the Join Transformation 5-31
Join Transformation
The Join transformation
• is found in the SQL category of transformations
• generates PROC SQL code
• features a unique graphical interface.
39
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
40
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-32 Lesson 5 SAS® Data Integration Studio: Working with Jobs
To access this window, right-click the Join transformation in a job and select
Open.
41
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Note: Double-clicking the Join transf ormation in a job also opens the Designer window.
Navigate pane
Properties pane
42
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Working with the Join Transformation 5-33
43
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The Diagram tab appears in the main area of the Designer window when Join is selected in the
Navigate pane. The Diagram tab enables you to design the needed clauses f or your SQL query with
a similar drag and drop interf ace to a job f low.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-34 Lesson 5 SAS® Data Integration Studio: Working with Jobs
44
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The Tables pane appears when a table object is selected in the Navigate pane or when Select is
selected in the Navigate pane. The Tables pane might also open when other aspects of particular
joins are requested (f or example, the surf acing of Having, Group by, and Order by inf ormation).
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Working with the Join Transformation 5-35
45
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The Select tab appears in the main area of the Designer window when Select is selected in the
Navigate pane. The Select tab enables you to maintain the mappings f rom the sources to the target.
The Select tab can also be used to specif y calculated columns for the target table.
46
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The Where tab appears in the main area of the Designer window when Where is selected in the
Navigate pane (if a WHERE clause is specif ied as part of the SQL query). The Where tab enables
you to specif y the needed subsetting or join criteria f or the SQL query.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-36 Lesson 5 SAS® Data Integration Studio: Working with Jobs
47
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Calculated Columns
Three columns for the Product Dimension target table must be calculated.
The three columns (Product_Line, Product_Category, Product_Group) are
encoded in the Product_ID.
Product_ID has 12 digits. Column Value
• First two digits of Product_ID define Product_ID 210100100001
the Product_Line. Product_Line 210000000000
• First four digits of Product_ID define Product_Category 210100000000
the Product_Category.
Product_Group 210100100000
• First seven digits of Product_ID define
the Product_Group.
51
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Working with the Join Transformation 5-37
Product_Line = int(Product_ID/10000000000)*10000000000
Product_Category = int(Product_ID/100000000)*100000000
Product_Group = int(Product_ID/100000)*100000
Or more simply:
Product_Line = int(Product_ID/1e10)*1e10
Product_Category = int(Product_ID/1e8)*1e8
Product_Group = int(Product_ID/1e5)*1e5
52
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Replacing the last f ive digits in Product_ID with zeros returns Product_Group f rom the f ormat.
Replacing the last eight digits in Product_ID with zeros returns Product_Category f rom the f ormat.
Replacing the last 10 digits in Product_ID with zeros returns Product_Line f rom the f ormat.
Division and the INT f unction truncate the last f ive, eight, or 10 digits from Product_ID. Then
multiplication adds five, eight, or 10 zeros back to the truncated value. Finally, the PUT f unction
applies the PRODUCT. f ormat (user-def ined) to return the description of the Product_Group,
Product_Category, or Product_Line.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-38 Lesson 5 SAS® Data Integration Studio: Working with Jobs
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Working with the Join Transformation 5-39
Or more simply:
Product_Line = put(int(Product_ID/1e10)*1e10, product.)
Product_Category = put(int(Product_ID/1e8)*1e8, product.)
Product_Group = put(int(Product_ID/1e5)*1e5, product.)
54
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Replacing the last f ive digits in Product_ID with zeros returns Product_Group f rom the f ormat.
Replacing the last eight digits in Product_ID with zeros returns Product_Category f rom the f ormat.
Replacing the last 10 digits in Product_ID with zeros returns Product_Line f rom the f ormat.
Division and the INT f unction truncate the last f ive, eight, or 10 digits from Product_ID. Then
multiplication adds five, eight, o r 10 zeros back to the truncated value. Finally, the PUT f unction
applies the PRODUCT. f ormat (user-def ined) to return the description of the Product_Group,
Product_Category, or Product_Line.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-40 Lesson 5 SAS® Data Integration Studio: Working with Jobs
55
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Working with the Join Transformation 5-41
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-42 Lesson 5 SAS® Data Integration Studio: Working with Jobs
7. Select File Save to save the diagram and job metadata to this point.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Working with the Join Transformation 5-43
d. Drag the File Reader transformation to the Diagram tab of the Job Editor.
e. Position the File Reader transformation so that it is next to (to the right of) the external file
object, DIFT Supplier Information.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-44 Lesson 5 SAS® Data Integration Studio: Working with Jobs
9. Rename the work table object associated with the File Reader transformation.
a. Right-click the (green) work table object and select Properties.
Replacing the name with FileReader makes this table easier to recognize when you
configure the next transformation in the process flow.
Note: New Physical Name replaces the current physical name with a new random name.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Working with the Join Transformation 5-45
Join transformation in
SQL grouping
d. Drag the Join transformation to the Diagram tab of the Job Editor.
e. Position the Join transformation so that it is to the right of (and in between) the
DIFT PRODUCT_LIST table object and the File Reader transformation.
12. Select File Save to save the diagram and job metadata to this point.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-46 Lesson 5 SAS® Data Integration Studio: Working with Jobs
b. Connect the DIFT PRODUCT_LIST table object to one of the input ports of the Join
transformation.
c. Connect the File Reader transformation's output to the second input port of the Join
transformation. (Click the work table icon, , associated with the File Reader
transformation and drag to the second input port of the Join transformation.)
14. Select File Save to save the diagram and job metadata to this point.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Working with the Join Transformation 5-47
15. Add the DIFT Product Dimension table object as the output of the Join transformation.
a. Right-click the work table of the Join transformation and select Replace.
e. Click OK.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-48 Lesson 5 SAS® Data Integration Studio: Working with Jobs
16. Select File Save to save the diagram and job metadata to this point.
17. Review the properties of the File Reader transformation.
a. Right-click the File Reader transformation and select Properties.
b. Click the Mappings tab.
c. Verify that all target columns have a column mapping.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Working with the Join Transformation 5-49
e. Verify that the primary generated code is a simple DATA step with INFILE, ATTRIB, and
INPUT statements.
f. Click OK to close the File Reader Properties window.
18. Select File Save to save the diagram and job metadata to this point.
19. Specify the properties of the Join transformation.
a. Right-click the Join transformation and select Open. The Designer window appears.
Recall that the Designer window’s initial view includes the following:
• The Diagram tab in the main pane displays the current query in the form of a process flow
diagram. The name of this tab changes to match the object selected in the Navigate pane.
• The Navigate pane is used to access the components of the current query.
• The SQL Clauses pane is used to add SQL clauses or additional joins to the query.
• The Properties pane is used to display and update the properties of a selected item.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-50 Lesson 5 SAS® Data Integration Studio: Working with Jobs
Note: The type of join can also be verified and changed by right-
clicking the Join item in the Navigate pane or the Join
item on the Diagram tab (when Join keyword in Navigate
pane is selected). A pop-up menu displays a list of
available join types with a check mark next to the currently
selected type.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Working with the Join Transformation 5-51
d. Click the Where item in the Navigate pane to surface the Where tab in the main pane.
e. Verify that the inner join is executed based on the values of the Supplier_ID columns from
the sources being equal.
Note: Outer joins (left, right, full) do not use the Where item for the join condition. To set
conditions for an outer join, click the Join item ( ) in the Navigate pane.
f. Add an additional WHERE clause to subset the data.
1) Click New in the top portion of the Where tab.
A row is added with the logical AND as the Boolean operator.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-52 Lesson 5 SAS® Data Integration Studio: Working with Jobs
2) Select Choose column(s) from the drop-down list under the first Operand field.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Working with the Join Transformation 5-53
g. Click the Select item in the Navigate pane to surface the Select tab.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-54 Lesson 5 SAS® Data Integration Studio: Working with Jobs
Note: A one-to-one mapping such as the above can also be performed by dragging the
source column to the target column.
Note: The three columns that are to be calculated remain unmapped.
j. Click to expand the Target table area. This provides more room to work with the
expressions.
An expression must be defined for three columns. However, the columns are not in order of
their scope. Product_Line describes the largest category, Product_Category describes the
next largest category of items, and Product_Group describes the smallest category of
items. They should be reordered for consistency.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Working with the Join Transformation 5-55
2) Drag and drop the column just above the Product_Category column.
3) Verify that Product_Group is now 6.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-56 Lesson 5 SAS® Data Integration Studio: Working with Jobs
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Working with the Join Transformation 5-57
Note: To build this expression using the Expression Builder, do the following:
• Click the Functions tab in the bottom part of the Expression window.
• Expand (by double-clicking) the Special grouping of functions.
• Single-click the PUT function.
• Click Add to Expression.
• With first argument of PUT highlighted (default), expand the Truncation
grouping of functions.
• Single-click the INT function.
• Click Add to Expression.
• Click the Data Sources tab in the bottom part of the Expression window.
• Expand DIFT PRODUCT_LIST.
• Single-click Product_ID.
• Verify that INT argument is still highlighted (default),
• Click Add to Expression.
• Click tool.
• Type 1e5.
• Move cursor after the INT close parenthesis and before the comma.
• Click tool.
• Type 1e5.
• Move cursor to after the comma and highlight <value>.
• Type product..
6) Click Validate Expression.
7) Click No.
8) Click OK to close the Expression window.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-58 Lesson 5 SAS® Data Integration Studio: Working with Jobs
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Working with the Join Transformation 5-59
e. Verify that each of the calculated columns does not have an established source column
mapping.
3) From the toolbar for the Select tab, click and select Update Mappings to Match
Columns Used in Expression.
4) Click the column Product_Category.
5) From the toolbar for the Select tab, click and select Update Mappings to Match
Columns Used in Expression.
6) Click the column Product_Line.
7) From the toolbar for the Select tab, click and select Update Mappings to Match
Columns Used in Expression.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-60 Lesson 5 SAS® Data Integration Studio: Working with Jobs
8) Verify that each of the calculated columns now has a mapping from the source col umn
Product_ID.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Working with the Join Transformation 5-61
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-62 Lesson 5 SAS® Data Integration Studio: Working with Jobs
Practices
3. Populating the OrderFact Table
• Access SAS Data Integration Studio using the My Server connection profile with Bruno’s
credentials (Bruno/Student1).
• Create a new job in the Orion Jobs folder with the name DIFT Populate Order Fact Table.
• Two tables should be joined together, DIFT ORDER_ITEM and DIFT ORDERS. These tables
are located in the Orion Source Data folder.
• Use the Join transformation to specify an inner join based on equality of the ORDER_ID
columns from the source tables.
• Replace the Join work table with the DIFT Order Fact table (located in the Orion Target
Data folder).
• The target column OrderDate should be calculated by taking the date portion of the source
column ORDER_DATE, which is a datetime column.
Hint: Use the DATEPART() function in an expression on the target side of the Select item in
the Join transformation.
• The target column DeliveryDate should be calculated by taking the date portion of the
source column DELIVERY_DATE, which is a datetime column.
Hint: Use the DATEPART() function in an expression on the target side of the Select item in
the Join transformation.
• After you verify that the table is created successfully (with no warnings), close the job.
Note: The DIFT Order Fact table should have 951,669 observations and 12 variables.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.2 Working with the Join Transformation 5-63
• Access SAS Data Integration Studio using the My Server connection profile with Bruno’s
credentials (Bruno/Student1).
• Create a new job in the Orion Jobs folder with the name DIFT Populate Old and Recent
Orders Tables.
• Use the Splitter transformation to split the observations from the DIFT Order Fact table.
• Write the records from the Splitter transformation to the DIFT Old Orders and the DIFT
Recent Orders tables. These tables are located in the Orion Source Data folder.
• Old orders are defined as orders placed before January 1, 2009. You can use the following
expression to find the observations for this data:
OrderDate < '01Jan2009'd
• Recent orders are defined as orders placed on or after January 1, 2009. This expression can
be used to find the observations for this data:
OrderDate >= '01Jan2009'd
• After you verify that the tables are created successfully, close the job.
Note: The DIFT Recent Orders table should have 615,396 observations and 12 variables.
The DIFT Old Orders table should have 336,273 observations and 12 variables.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-64 Lesson 5 SAS® Data Integration Studio: Working with Jobs
5.3 Solutions
Solutions to Practices
1. Refreshing Course Metadata to Current Point
a. If necessary, access SAS Data Integration Studio with Bruno’s credentials.
1) Select Start All Programs SAS SAS Data Integration Studio.
2) Select My Server as the connection profile.
3) Click OK to close the Connection Profile window and open the Log On window.
4) Enter Bruno in the User ID field and Student1 in the Password field.
5) Click OK to close the Log On window.
b. Click the Folders tab.
c. Expand Data Mart Development.
d. Delete the four Orion folders.
1) Click the Orion Jobs folder.
2) Press the Shift key and click the Orion Target Data folder.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.3 Solutions 5-65
5) Click Next.
6) Verify that all four Orion folders are selected.
7) Click Next.
8) Verify four different types of connections are shown to need to be established.
9) Click Next.
10) Verify that SASApp is listed for both the Original and Target fields.
11) Click Next.
12) Verify that both servers (ODBC Server and Oracle Server) have matching values for the
Original and Target fields.
13) Click Next.
14) Verify that both file paths have matching values for the Original and Target fields.
15) Click Next.
16) Verify that all directory paths have matching values for the Original and Target fields.
17) Click Next. The Summary pane surfaces.
18) Click Next.
19) Verify the import process completed successfully.
20) Click Finish.
Question: What is the name of the job object in the Orion Jobs folder?
Answer: DIFT Populate Current and Terminated Staff Tables
Question: How many table objects were imported in the Orion Target Data folder?
Answer: Seven
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-66 Lesson 5 SAS® Data Integration Studio: Working with Jobs
e. Define the connection for the STAFF table using the Connections window.
1) Right-click the STAFF table in the job flow diagram and select Connections.
2) Click under Output Ports. The Output Port – Data window appears.
a) Click the Splitter transformation under Output Node.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.3 Solutions 5-67
2) In the Table Selector window (on the Folders tab), expand Data Mart Development
Orion Target Data.
3) Click the DIFT Current Staff table.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-68 Lesson 5 SAS® Data Integration Studio: Working with Jobs
4) Click OK.
The job flow diagram updates to the following:
5) Right-click the remaining output table on the Splitter transformation and select Replace.
6) In the Table Selector window (on the Folders tab), expand Data Mart Development
Orion Target Data.
7) Click the DIFT Terminated Staff table.
8) Click OK.
No Mappings
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.3 Solutions 5-69
Mappings
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-70 Lesson 5 SAS® Data Integration Studio: Working with Jobs
This can be discovered by selecting the Log tab, scrolling, and locating the
note about the creation of DIFTTGT.CURRENT_STAFF table.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.3 Solutions 5-71
This can be discovered by selecting the Log tab, scrolling, and locate the
note about the creation of DIFTTGT.CURRENT_STAFF table.
Question: On the Inventory tab, under what grouping do you find the job?
Answer: DIFT Populate Current and Terminated Staff Tables is found in the Job
grouping on the Inventory tab
Inventory tab
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-72 Lesson 5 SAS® Data Integration Studio: Working with Jobs
Note: The icon in the upper right corners of the metadata table objects
indicates that these are Oracle tables.
g. Select File Save to save the diagram and job metadata to this point.
h. Add the Join transformation to the diagram.
1) In the tree view, click the Transformations tab.
2) Expand the SQL grouping.
3) Select the Join transformation.
4) Drag the Join transformation to the diagram.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.3 Solutions 5-73
5) Place the Join transformation to the right of the source table objects.
l. Select File Save to save the diagram and job metadata to this point.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-74 Lesson 5 SAS® Data Integration Studio: Working with Jobs
3) In the Join Properties pane, verify that the join is an inner join.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.3 Solutions 5-75
a) Click next to the Target table name to expand the target side.
f ) Click to collapse the target table attributes back to the right side.
g) Right-click the OrderDate column.
h) Select Fix Warning Update Mappings to Match Columns Used in Expression.
i) Right-click the DeliveryDate column.
j) Select Fix Warning Update Mappings to Match Columns Used in Expression.
10) Verify that all 12 columns are now mapped.
11) Click Up to return to the Job Editor.
n. Select File Save to save the diagram and job metadata to this point.
o. Run the job.
1) Click Run.
2) Click the Status tab in the Details pane. Verify that the job completed successfully.
3) Click the Log tab and verify that DIFTTGT.ORDERFACT is created with 951,669
observation and 12 variables.
4) Click the Diagram tab.
5) Right-click DIFT Order Fact and select Open.
6) Review the data and then select File Close to close the View Data window.
7) Select File Close to close the Job Editor.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-76 Lesson 5 SAS® Data Integration Studio: Working with Jobs
l. Select File Save to save the diagram and job metadata to this point.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5.3 Solutions 5-77
i) Enter '01jan2009'd.
j) Click Validate Expression.
k) Click No to not display the SAS log.
l) Click OK to close the Expression Builder window.
4) Specify the subsetting criteria for the DIFT Recent Orders table object.
a) Verify that the DIFT Recent Orders table object is selected in the Target Tables
pane.
b) Select Row Selection Conditions in the Row Selection Type field.
c) Click Subset Data below the Selection Conditions area. The Expression window
appears.
d) Click the Data Sources tab.
e) Expand the OrderFact table.
f ) Select the OrderDate column.
g) Click Add to Expression.
h) Click in the operators area.
i) Enter '01jan2009'd.
j) Click Validate Expression.
k) Click No to not display the SAS log.
l) Click OK to close the Expression Builder window.
5) Click the Mappings tab.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
5-78 Lesson 5 SAS® Data Integration Studio: Working with Jobs
6) Verify that all target table columns are mapped. (That is, all target columns receive data
from a source column.)
7) Click OK to close the Splitter Properties window.
n. Select File Save to save the diagram and job metadata to this point.
o. Run the job.
1) Click Run to run the job.
2) Click the Status tab in the Details pane. Notice that all processes complete successfully.
3) Click the Log tab to view the log for the executed job.
4) Scroll to view the notes about the creation of the DIFTTGT.RECENT_ORDERS table
and the creation of the DIFTTGT.OLD_ORDERS table.
5) Click the Diagram tab to view the data results.
6) View the DIFT Recent Orders table.
a) Right-click the DIFT Recent Orders table and select Open.
b) The DIFT Recent Orders table should have 615,396 rows.
c) When you are finished viewing the data, select File Close to close the
View Data window.
7) View the DIFT Old Orders table.
a) Right-click the DIFT Old Orders table and select Open.
b) The DIFT Old Orders table should have 336,273 rows.
c) When you are finished viewing the data, select File Close to close the View Data
window.
p. Select File Close to close the Job Editor.
The Orion Jobs folder should resemble the following:
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Lesson 6 SAS® Data Integration
Studio: Working with Transformations
6.1 Working with the Extract and Summary Statistics Transformations ............................. 6-3
Demonstration: Refresh the Metadata ....................................................................... 6-5
Demonstration: Reporting for United States Customers ................................................ 6-9
Practices ............................................................................................................. 6-28
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Working with the Extract and Summary Statistics Transformations 6-3
Extract Transformation
3
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-4 Lesson 6 SAS® Data Integration Studio: Working with Transformations
4
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Working with the Extract and Summary Statistics Transformations 6-5
3. Click Yes.
4. If a Delete Library window appears for the DIFT Test Source Library, click Yes.
5. If a Delete Library window appears for the DIFT Test Target Library, click Yes.
6. If a Delete Library window appears for the DIFT ODBC Contacts Library, click Yes.
7. If a Delete Library window appears for the DIFT ODBC Orders Library, click Yes.
8. If a Delete Library window appears for the DIFT Oracle Library, click Yes.
9. If a Delete Library window appears for the DIFT Orion Source Tables Library, click Yes.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-6 Lesson 6 SAS® Data Integration Studio: Working with Transformations
10. If a Delete Library window appears for the DIFT SAS Library, click Yes.
11. If a Delete Library window appears for the DIFT Orion Table Tables Library, click Yes.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Working with the Extract and Summary Statistics Transformations 6-7
The metadata under the Data Mart Development folder should resemble the following:
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-8 Lesson 6 SAS® Data Integration Studio: Working with Transformations
In addition, all source and target table objects (as well as the corresponding library objects)
discussed up to this point are found in the Orion Source Data and Orion Target Data
folders.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Working with the Extract and Summary Statistics Transformations 6-9
Investigate and Finalize the Job to Load the Customer Order Information Table
1. Run two jobs to prepare source tables for the DIFT Populate Customer Order Information Table
job.
a. Open and run the DIFT Populate Order Fact Table job.
1) Click the Folders tab.
2) Right-click the DIFT Populate Order Fact Table job and select Open.
3) Click Run.
4) Verify that the job completes successfully.
b. Open and run the DIFT Populate Customer Dimension Table job.
1) Click the Folders tab.
2) Right-click the DIFT Populate Customer Dimension Table job and select Open.
3) Click Run.
4) Verify that the job completes successfully.
Note: The DIFT Customer Dimension table and the DIFT Order Fact table will be
source tables for the DIFT Populate Customer Order Inf ormation Table job. If
these tables are not generated, the DIFT Populate Customer Order
Inf ormation Table job will f ail.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-10 Lesson 6 SAS® Data Integration Studio: Working with Transformations
2. Open the starter DIFT Populate Customer Order Inf ormation Table job for editing.
a. Click the Folders tab.
b. Right-click the job DIFT Populate Customer Order Information Table and select Open.
A partial job flow diagram appears:
3. Examine the properties of the Join transformation to verify that it will produce an inner join of the
two tables on matching Customer_ID.
a. Right-click the Join transformation and select Open.
b. Verify that the type of join is an inner join.
1) Click the Join item in the Navigate pane.
2) View the type in the Join Properties pane.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Working with the Extract and Summary Statistics Transformations 6-11
d. Verify that the inner join is based on the matching (equality) Customer_ID columns from
each of the sources.
e. Add an additional WHERE clause to subset the data to only orders placed in 2011.
1) On the Where tab, click New.
2) In the first Operand column, click Advanced.
The Expression Builder window appears.
3) Click the Functions tab.
4) Double-click Date and Time to expand it.
5) Scroll down to the Date and Time functions and click the YEAR function.
The function is added with a highlighted argument. Keep the highlighting so that we can
easily add the OrderDate column as the function’s argument.
Note: The right pane provides help for the selected function.
6) Click Add to Expression.
7) Click the Data Sources tab.
8) Expand the DIFT Order Fact table.
9) Click OrderDate.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-12 Lesson 6 SAS® Data Integration Studio: Working with Transformations
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Working with the Extract and Summary Statistics Transformations 6-13
g. Verify that all 22 target columns are mapped one-to-one from a source column.
Note: Customer_ID is the column used in the join. Therefore, there is a duplicate and only
one copy needs to be mapped and present in the target table.
h. Click to return to the main diagram.
4. Select File Save to save the diagram and job metadata to this point.
5. Add the target table to the job flow.
a. Right-click the green work table object that is associated with the Join transformation.
Select Register Table.
b. Enter DIFT Customer Order Information in the Name field.
c. Set the Location field.
1) Click Browse.
a) Double-click the Orion Target Data folder.
b) Click OK.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-14 Lesson 6 SAS® Data Integration Studio: Working with Transformations
2) Verify that the location is set to /Data Mart Development/Orion Target Data.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Working with the Extract and Summary Statistics Transformations 6-15
6. Select File Save to save the diagram and job metadata to this point.
7. Run the job.
a. Right-click in the background of the job and select Run.
b. Click the Status tab in the Details area.
c. Verify that all steps completed successfully.
d. View the log for the executed job. Scroll to view the note about the creation of
DIFTTGT.CUSTOMERORDERINFO.
Note: If this job fails, then the two source tables might not be populated. It might be
necessary to run the two jobs DIFT Populate Customer Dimension Table and DIFT
Populate Order Fact Table (both of these jobs are located in the Data Mart
Development Orion Jobs folder).
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-16 Lesson 6 SAS® Data Integration Studio: Working with Transformations
Note: The Customer_Country values include Germany, United Kingdom, and United
States. They are written out values, not country codes.
c. Select File Close to close the View Data window.
9. Select File Close to close the Job Editor window.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Working with the Extract and Summary Statistics Transformations 6-17
c. Drag the DIFT Customer Order Information table object to the Diagram tab of the Job
Editor.
3. Add the Extract transformation to the job flow.
a. Click the Transformations tab.
b. Expand the SQL grouping and locate the Extract transformation template.
c. Drag the Extract transformation to the Diagram tab of the Job Editor.
d. Connect the DIFT Customer Order Information table object to the Extract transformation.
4. Add the Summary Statistics transformation to the job flow.
a. If necessary, click the Transformations tab.
b. Expand the Analysis grouping and locate the Summary Statistics transformation template.
c. Drag the Summary Statistics transformation to the Diagram tab of the Job Editor.
d. Connect the Extract transformation to the Summary Statistics transformation.
5. Select File Save to save the diagram and job metadata to this point.
6. Add a WHERE expression to the Extract transformation so that only rows for US customers are
written to the target table.
a. Right-click the Extract transformation and select Properties.
b. Click the Where tab.
c. On the bottom portion of the Where tab, click the Data Sources tab.
d. Expand the CustomerOrderInfo table.
e. Select Customer_Country.
f. Click Add to Expression.
g. In the Expression Text area, type = "US".
Note: The value “US” that we are using for subsetting does not match the written out
“United States” value that we noted in the View Data window in step 8b. This
seeming discrepancy is because we have selected global options that apply formats
to the data shown in the View Data window. We will examine these options.
h. Click OK to close the Extract Properties window.
7. Select File Save to save the diagram and job metadata to this point.
8. Close the job.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-18 Lesson 6 SAS® Data Integration Studio: Working with Transformations
Note: This is a user-defined format that displays country codes as written out country
names.
d. Click OK to close the Properties window.
10. Investigate the permanent formats associated with the CustomerOrderInfo physical table.
a. Right-click the DIFT Customer Order Information table and select Analyze.
b. Click the Contents tab.
Note: A PROC CONTENTS step is run on the physical table when the Contents tab is
selected. This SAS procedure returns attributes of the physical table that it analyzes,
including column names, lengths, and formats.
c. Scroll down to the Alphabetic List of Variables and Attributes section of the report.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Working with the Extract and Summary Statistics Transformations 6-19
c. Verify that both Apply metadata formats and Apply formats are selected.
If the Apply metadata formats option is selected, any format specified in the properties of a
table will be applied when viewing data in the View Data window.
d. Uncheck the Apply metadata formats option.
If only the Apply formats option is selected, only permanent formats that were specified for
the data when it was created will be applied when viewing data in the View Data wi ndow.
Because the Customer_Country column has a format applied in the properties of the DIFT
Customer Order Information table and has a permanent format, we have to turn off both
formatting options to see the actual data values.
e. Uncheck the Apply formats option.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-20 Lesson 6 SAS® Data Integration Studio: Working with Transformations
c. Right-click the DIFT Customer Order Information table and select Open.
Note: The Customer_Country column is now displaying the actual, unformatted data
values. We can now see the country codes.
d. Click File and select Close to close the View Data window.
13. Expand the Data Mart Development Orion Reports Extract and Summary folders.
14. Right-click the DIFT Create Report for US Customer Order Information job and select Open.
15. Specify properties for the Summary Statistics transformation to define grouping variables,
analysis variables, and statistics and specify report format and layout options .
a. Right-click the Summary Statistics transformation and select Properties.
b. On the General tab, remove the default description to prevent it from appearing in the job
flow node.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Working with the Extract and Summary Statistics Transformations 6-21
Note: The left pane lists option groups. The right pane lists the options in an option
group.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-22 Lesson 6 SAS® Data Integration Studio: Working with Transformations
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Working with the Extract and Summary Statistics Transformations 6-23
2) In the Selected list box, click Number of observations (N) and click to remove it.
3) In the Selected list box, click Minimum (MIN) and then click to move it to the top.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-24 Lesson 6 SAS® Data Integration Studio: Working with Transformations
2) Enter MAXDEC=2 NOLABELS in the Other PROC MEANS options area (after the
default text).
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Working with the Extract and Summary Statistics Transformations 6-25
The Summary Statistics transformation has no output table because it is configured to produce
only a report. To add an output table, right-click on the transformation and select Add Work
Table.
10. Run the job.
a. Right-click in the background of the job and select Run.
b. Click the Status tab in the Details area. Verify that all processes completed successfully.
c. View the log for the executed job.
11. View the listing output.
a. Click the Output tab.
The nodate nonumber ls=80 options in the Summary Statistics transformation apply only to
listing output, not the HTML output.
b. Click the Diagram tab to return to the job flow diagram.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-26 Lesson 6 SAS® Data Integration Studio: Working with Transformations
d. When you are finished viewing the report, click to close Firefox.
12. If necessary, click the Diagram tab in the Job Editor.
13. Select File Save to save the diagram and job metadata.
14. Select File Close to close the Job Editor window.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Working with the Extract and Summary Statistics Transformations 6-27
15. Reset the formatting options for the View Data window.
Note: Developers should keep these options off in many cases to make it easier to
understand the data. We will turn them back on to make it easier to read the country
codes in the View Data window.
a. Click the Tools item in the menu bar and select Options from the drop-down list.
b. Click the View Data tab.
c. Select both Apply metadata formats and Apply formats.
d. Click OK to close the Options window.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-28 Lesson 6 SAS® Data Integration Studio: Working with Transformations
Practices
A request was made to generate a report. The report must show the total quantities ordered across
the quarters for the year of 2011 for the Product Group values in the product line Clothes & Shoes.
The report should resemble the following:
• Examine the properties of the target table DIFT Product Order Information.
Question: How many columns are defined for DIFT Product Order Information?
Answer: _________________________________________________________
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Working with the Extract and Summary Statistics Transformations 6-29
Question: What transformation is populating data in the columns Year, Quarter, Month,
and DOM?
Answer: _________________________________________________________
Question: What column from which table is populating data in the columns Year, Quarter,
Month, and DOM?
Answer: _________________________________________________________
Question: How many rows were created for DIFT Product Order Information?
Answer: _________________________________________________________
Question: Do the values calculated for Year, Quarter, Month, and DOM look
appropriate?
Answer: _________________________________________________________
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-30 Lesson 6 SAS® Data Integration Studio: Working with Transformations
Note: The following code can be added as preprocessing code (precode) for this last
transformation to specify the characters used to draw the horizontal and vertical
borders in the table:
options formchar="|---|-|---|";
• Run the job and verify that the desired HTML report is created.
Question: From the log, locate the code that generated the HTML file. What PROC step
generated the report?
Answer: _________________________________________________________
Question: What statement (or statements) identified the categorical and analytical fields?
Answer: _________________________________________________________
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.1 Working with the Extract and Summary Statistics Transformations 6-31
Question: In the HTML report, for the Product Group value of T-Shirts – is there an
increase in quantities ordered across all four quarters?
Answer: _________________________________________________________
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-32 Lesson 6 SAS® Data Integration Studio: Working with Transformations
SQL Transformations
28
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The transformations in the SQL group generate SQL code and enable you to
• create tables (SAS and DBMS)
• delete rows from a table
• execute SQL statements in a DBMS
• extract rows from a source table
• insert rows into a target table
• join tables
• SQL merge (update or insert)
• perform set operations
• update rows in a target table.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.2 Exploring the SQL Transfor mations 6-33
29
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Set operations allow multiple query result sets to be combined into a single result set. The sets are
combined vertically, with operations combing rows from the source sets in different ways depending
on the operation. This differs from a Join, which combines sets horizontally, creating a result set
with the columns of multiple source sets.
The UNION set operation returns all unique rows from the two query results. It drops noncommon
columns from the result set. By default, duplicate rows are dropped from the result set, but there is
an option (ALL) to keep duplicates.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-34 Lesson 6 SAS® Data Integration Studio: Working with Transformations
The EXCEPT set operation returns unique rows from the first query result that are not in the second.
It drops noncommon columns from the result set. By default, duplicate rows are dropped from the
result set, but there is an option (ALL) to keep duplicates.
The INTERSECT set operation returns rows that are common from both of the two queries’ results. It
drops noncommon columns from the result set. By default, duplicate rows are dropped from the
result set, but there is an option (ALL) to keep duplicates.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.2 Exploring the SQL Transfor mations 6-35
The OUTER UNION set operation concatenates the rows from the two query results. It keeps
noncommon columns in the result set and it includes duplicate rows. This set operation is not an
ANSI set operation but an addition by SAS.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-36 Lesson 6 SAS® Data Integration Studio: Working with Transformations
34
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Without the CORRESPONDING option, columns in each result set will be matched by position, and
not by name.
In this case, the data in the ID column should be in the same column as the data in the ProdCode
column, the data in the Name column should be in the same column as the data in the Product
column, and the data in the Line column should be in the same column as the data in the ProdLine
column. We would want to leave the CORRESPONDING option off and match the columns by
position.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.2 Exploring the SQL Transfor mations 6-37
Name ID Line
Gloves 104 210
Boots 105 210
35
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
With the CORRESPONDING option, columns from each query result will be matched by name and
not by their position.
In this case, the data from the ID column should not be matched with the data with the Name
column, and the data from the Name column should not be matched with the data from the ID
column. To correctly match these columns, we need to match by name. To change the default
behavior of the transformation and match columns by name, we can turn on the CORRESPONDING
option.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-38 Lesson 6 SAS® Data Integration Studio: Working with Transformations
Note: Each table contains data for a different product line for Orion Star.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.2 Exploring the SQL Transfor mations 6-39
h. Note the number of rows in each table: 772 rows in the DIFT CHILDREN table, 2021 rows in
the DIFT CLOTHESSHOES table, 357 rows in the DIFT OUTDOORS table, and 2354 rows
in the DIFT SPORTS table.
If a concatenation of these four tables occurs, then the final result set should contain:
772 + 2021 + 357 + 2354 = 5504 rows
i. Select File Close four times to close all four View Data windows.
3. Create a new job to concatenate the four tables, and to generate a report on the result.
a. If necessary, click the Folders tab.
b. Expand Data Mart Development Orion Reports SQL Transforms.
c. Select File New Job.
d. Enter DIFT Report on Product Information in the Name field.
e. Verify that the location is set to /Data Mart Development/Orion Reports/SQL Transforms.
f. Click OK. The Job Editor window appears.
4. Add the Set Operators transformation to the Diagram tab.
a. Click the Transformations tab.
b. Expand the SQL group of transformations.
c. Locate the Set Operators transformation.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-40 Lesson 6 SAS® Data Integration Studio: Working with Transformations
6. Verify that the temporary work table for the Set Operators transformation now appears under the
transformation and not to the right.
7. Specify the properties of the Set Operators transformation to concatenate the four tables
containing product line information with the Outer Union operator.
a. Right-click the Set Operators transformation and select Properties.
b. Click the Set Operators tab.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.2 Exploring the SQL Transfor mations 6-41
The default set operator type is the Union. Each of these must be changed.
d. Change the set operator types to Outer Union.
1) Click the Union operator between DIFT CHILDREN and DIFT CLOTHESSHOES.
2) Select Outer Union in the Set operator type field.
3) Select the Match columns by name (CORRESPONDING) option.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-42 Lesson 6 SAS® Data Integration Studio: Working with Transformations
4) Click the Union operator between DIFT CLOTHESSHOES and DIFT OUTDOORS.
5) Select Outer Union in the Set operator type field.
6) Click the Match columns by name (CORRESPONDING) option.
7) Click the Union operator between DIFT OUTDOORS and DIFT SPORTS.
8) Select Outer Union in the Set operator type field.
9) Click the Match columns by name (CORRESPONDING) option.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.2 Exploring the SQL Transfor mations 6-43
e. Propagate the columns to the temporary result set and perform mappings.
1) Click the DIFT CHILDREN table in the Queries pane.
2) In the Table Expression area, on the Select tab, click (Propagate from sources to
targets).
All columns from the Source table are propagated to the Target table and mapped.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-44 Lesson 6 SAS® Data Integration Studio: Working with Transformations
4) Click the Map all columns tool ( ) to map the columns from DIFT CLOTHESSHOES
to the corresponding target table columns.
6) Click the Map all columns tool ( ) to map the columns from DIFT OUTDOORS
to the corresponding target table columns.
7) Click the DIFT SPORTS table in the Queries pane.
8) Click the Map all columns tool ( ) to map the columns from DIFT SPORTS
to the corresponding target table columns.
f. Click OK to close the Set Operators Properties window.
8. Select File Save to save the job metadata to this point.
9. Click the portion of the tool to autoalign the diagram nodes.
10. Select File Save to save the job metadata to this point.
11. Register the work table.
a. Right-click the Set Operators work table and select Register Table.
b. On the General tab, enter DIFT Product Information as the name.
c. Verify that the location is /Data Mart Development/Orion Reports/SQL Transforms.
d. Click the Physical Storage tab.
e. Enter Product_Information as the physical name.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.2 Exploring the SQL Transfor mations 6-45
f. If necessary, select DIFT Orion Target Tables Library (located in Data Mart
Development Orion Target Data).
g. Click OK to close the Register Tables window.
12. Click File Save to save the current metadata.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-46 Lesson 6 SAS® Data Integration Studio: Working with Transformations
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.2 Exploring the SQL Transfor mations 6-47
Practice
In this practice, you create a job that will explore three SQL transformations: Create Table, Delete,
and Insert Rows.
4. Using the Create Table, Delete, and Insert Rows Transformations from the SQL Group
Create a job to load a new table of high-value suppliers. The table is used to create a report
showing counts by country of the high-value suppliers. The job flow should resemble the
following:
- Configure the Create Table transformation to load only the rows from the source table
with Supplier_ID values greater than 12000. These are the high-value suppliers.
Question: How many rows and columns are in the resultant table?
Answer: ______________________________________________
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-48 Lesson 6 SAS® Data Integration Studio: Working with Transformations
• The report should include only the international high-value suppliers. Delete the domestic
suppliers from the High Value Suppliers table.
- Add a Delete transformation from the SQL transformations group to the job.
Note: The Delete transformation points to the table that it modifies, in this case the
High Value Suppliers table.
- Configure the Delete transformation to delete rows from the High Value Suppliers table.
Delete the rows with the Country value equal to United States.
Hint: Use quotation marks around the value United States.
Question: How many rows and columns are now in the resultant table?
Answer: ______________________________________________
• It was found that additional suppliers with Supplier_ID values greater than 10000 are high-
value suppliers. Insert the additional rows from the DIFT Supplier Information table into the
High Value Suppliers table.
- Add an Insert Rows transformation from the SQL transformations group to the job.
- Configure the Insert Rows transformation to select the rows from the DIFT Supplier
Information table with Supplier_ID between 10000 and 12000 and Country not equal
to United States and insert them into the High Value Suppliers table.
Question: How many rows and columns are now in the resultant table?
Answer: ______________________________________________
• Create a one-way frequency report for the High Value Suppliers values. Show the number
of suppliers for each country.
Note: The following code can be added as preprocessing code (precode) for this last
transformation to specify the characters to draw the horizontal and vertical borders of
the table:
options formchar="|---|-|---|";
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-49
47
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
All transformations, including custom ones, generate SAS code to accomplish the following:
• extract data
• transform data
• load data into data stores
• create reports
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-50 Lesson 6 SAS® Data Integration Studio: Working with Transformations
48
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Java plug-in transformation templates were created by SAS Data Integration Studio developers
using Java programming. SAS Data Integration Studio users can use a wizard to create generated
transformation templates with only SAS programming skills.
Examples of Java plug-in transformation templates include most of the templates in the Data folder,
such as Sort, Splitter, and User Written. Examples of generated transformation templates include
Summary Tables and Summary Statistics. These transformations are def ault transformations
included with SAS Data Integration Studio.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-51
Generated Transformations
Generated transformation
templates in the
Transformations tree are
identified by the icon
associated with the
transformation.
49
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
There are many generated transformation templates that come by default with SAS Data Integration
Studio. Two of these include Summary Statistics and Summary Tables. All of the generated
transformation templates are identifiable by their icons in the Transformations tree.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-52 Lesson 6 SAS® Data Integration Studio: Working with Transformations
Generated Transformations
50
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Right-clicking a generated transformation yields a pop-up menu with several options that are not
available for Java transformations:
• Properties
• Analyze
• History
• Copy
• Import
• Export
• Archive
• Compare
• Find In
These options are the same options available for other metadata objects users create in the SAS
Metadata Repository. Generated transformations can be managed in the same way other metadata
objects such as libraries or tables can be managed.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-53
51
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
52
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-54 Lesson 6 SAS® Data Integration Studio: Working with Transformations
&classvar1
&classvar2
&analysisvar
When the transformation has been created, users can select values on the Options tab in the
transformation properties to configure the transformation. These values will be assigned to the
corresponding macro variables in the code.
Generated Code
The generated code for the transformation includes a %LET statement for
each transformation option.
%let syslast = yy.xx;
%let options = ;
%let classvar1 = Customer_Age_Group;
%let classvar2 = Customer_Gender;
%let analysisvar = Quantity Ordered;
%let title = Sum of Quantity across Gender and AgeGroup;
Each %LET statement creates a macro variable and assigns the value.
57
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-55
58
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
59
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-56 Lesson 6 SAS® Data Integration Studio: Working with Transformations
60
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Options and option groups are defined in this window. The value for the option Name must exactly
match the macro variable it is associated with in the SAS source code so that when users configure
the transformation, the value chosen for each option is assigned to the corresponding macro
variable.
61
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-57
62
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Options are surfaced to users on the Options tab of the transformation when they use the
transformation in a job. Groups become the organizational categories in the left-hand column of the
Options tab, and prompts become options.
63
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-58 Lesson 6 SAS® Data Integration Studio: Working with Transformations
64
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-59
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-60 Lesson 6 SAS® Data Integration Studio: Working with Transformations
Note: The generated transformation will appear in both the Transformations tree and the
Folders tree. The Location field will determine where the transformation will be
stored in the metadata folder structure. The Transformation Category field will
determine in which grouping the transformation will be stored in the Transformations
tree.
f. Click Next.
g. Add the SAS source code.
1) Access Windows Explorer.
2) Navigate to D:\Workshop\dift\SASCode.
3) Right-click TabulateGraph.sas and select Edit with Notepad++.
4) Right-click in the background of the Notepad++ window and select Select All.
5) Right-click in the background of the Notepad++ window and select Copy.
6) Select File Exit to close the Notepad++ window.
7) Access the New Transformation Wizard.
8) Right-click in the SAS Code pane and select Paste.
The SAS Code panel should resemble the following:
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-61
The following represents the SAS code found in the TabulateGraph.sas program:
%macro TabulateGChart;
options mprint;
ods listing close;
%if (%quote(&options) ne) %then %do;
options &options;
%end;
%if(%sysfunc(fileexist(&path))) %then %do;
%if (%quote(&path) ne) %then %do;
ods html path="&path" gpath="&path"
%if (%quote(&filename) ne) %then %do;
file="&filename..html" ;
%end;
%end;
%if (%quote(&tabulatetitle) ne) %then %do;
title1 "&tabulatetitle";
%end;
proc tabulate data=&syslast;
class &classvar1 &classvar2;
var &analysisvar;
table &classvar1*&classvar2,
&analysisvar*(min="Minimum"*f=comma7.2
mean="Average"*f=comma8.2 sum="Total"*f=comma14.2
max="Maximum"*f=comma10.2);
run;
%if (%quote(&gcharttitle) ne) %then %do;
title height=15pt "&gcharttitle";
%end;
goptions dev=png;
proc gchart data=&syslast;
vbar &classvar1 / sumvar=&analysisvar group=&classvar2
clipref frame type=SUM outside=SUM coutline=BLACK;
run; quit;
ods html close;
ods listing;
%end;
%else %do;
%if &sysscp = WIN %then %do;
%put ERROR: <text omitted; refer to file for complete text>.;
%end;
%else %if %index(*HP*AI*SU*LI*,*%substr(&sysscp,1,2)*) %then
%do;
%put ERROR: <text omitted; refer to file>
%end;
%else %if %index(*OS*VM*,*%substr(&sysscp,1,2)*) %then
%do;
%put ERROR: <text omitted; refer to file >.;
%end;
%end;
%mend TabulateGChart;
%TabulateGChart;
Note: Some comments and text were removed from the above display of code.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-62 Lesson 6 SAS® Data Integration Studio: Working with Transformations
This SAS code generates a PROC GCHART chart and a PROC TABULATE table. When the
transformation is complete, users will specify the values for the macro variables in this code by
configuring the transformation options. Therefore, in the next steps, we add these macro variables
as options and define what types of values are acceptable inputs. The macro variables in this code
that will become options are options, path, filename, tabulatetitle, classvar1, classvar2,
analysisvar, and gcharttitle.
options This macro variable will allow users to enter a space-separated list of global
options.
path This macro variable will specify the path where the report will be created.
filename This macro variable will specify the filename of the report. Notice that we
have “hardcoded” the file to be an HTML file. Users should not add an
HTML extension to their filename because one is already present in the
code.
tabulatetitle This macro variable will specify the title for the table.
classvar1 This macro variable will be used in both the PROC GCHART and the
PROC TABULATE steps as a classification variable-it will be the top-level
grouping in the table and chart.
classvar2 This macro variable will be used in both the PROC GCHART and the
PROC TABULATE steps as a classification variable-it will be the secondary
grouping for the table and chart. For example, if classvar1 is set to Country
and classvar2 is set to Gender, the chart and table will be grouped first by
Country, and then will be grouped by Gender within each Country.
analysisvar This macro variable will specify the numeric variable for which the chart and
table will generate statistics.
gcharttitle This macro variable will specify the title for the chart.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-63
h. Click Next.
The Options window is used to define the options for this transformation.
3) Click OK.
4) Click New Group.
5) Enter Titles in the Displayed text field.
6) Click OK.
7) Click New Group.
8) Enter Other Options in the Displayed text field.
9) Click OK.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-64 Lesson 6 SAS® Data Integration Studio: Working with Transformations
d) Enter Grouping Column for Table and Chart in the Displayed text field.
Note: The displayed text field is required.
e) Enter The column selected for this option will be used as a grouping column
in the TABULATE table and in the GCHART graph. in the Description field.
Note: The description field is not required, but it is highly recommended, giving
users a more detailed idea of the option’s purpose in the transformation.
f ) Click Requires a non-blank value in the Options area.
Note: This forces the user to choose a value for this option before the
transformation can generate its code.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-65
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-66 Lesson 6 SAS® Data Integration Studio: Working with Transformations
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-67
e) Enter The column selected for this option will be used as an analysis column
in the TABULATE table and it will determine the heights of the bars in the
GCHART chart. in the Description field.
f ) Click Requires a non-blank value in the Options area.
g) Click the Prompt Type and Values tab.
h) Select Data source column for the Prompt type field.
i) Verify that Select from source is selected in the Columns to select from area.
j) Clear Character in the Data types area.
k) Click Limit number of selectable columns.
l) Enter 1 in the Minimum field.
m) Enter 1 in the Maximum field.
n) Click OK to close the New Prompt window.
The three options in the Data Items group should resemble the following:
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-68 Lesson 6 SAS® Data Integration Studio: Working with Transformations
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-69
Note: If the reports folder does not exist, create it using the tool.
(3) Click OK to close the Select a Directory window.
l) Click OK to close the New Prompt window.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-70 Lesson 6 SAS® Data Integration Studio: Working with Transformations
a) Verify that the three items in the Data Items group are all required, as indicated by
the asterisk (*).
b) Verify that the descriptions entered for each of the parameters are displayed.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-71
c) Verify that clicking Browse opens a dialog box to navigate the SAS Folders to a data
source from which a column can be selected.
2) Click Titles in the selection pane.
The two options in the Titles group are displayed with their default values.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-72 Lesson 6 SAS® Data Integration Studio: Working with Transformations
Note: The Inputs and Outputs options set limits on the number of input ports and output
ports.
q. Click Next.
Note: If a transformation is used in a job and then updated, those changes will affect every
job where the transformation has been used. Check the impact of transformation
updates before you make any changes.
r. Click Finish.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-73
4. Verify that the transformation and new grouping appear in the Transformations tree and that the
transformation also appears in the Folders tree.
a. Click the Transformations tab.
b. Verify that a new grouping User Defined appears.
c. Expand the new grouping and verify that the new transformation exists.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-74 Lesson 6 SAS® Data Integration Studio: Working with Transformations
Practice
In this practice, you create a new transformation that generates a pie chart showing the sum of an
analysis variable across a classification variable.
5. Creating a Graphical Transformation
• Create a new folder called Custom Transformations in the Data Mart Development
Orion Reports folder.
• Create a new transformation called Generate Pie Chart.
– Name the transformation Generate Pie Chart.
– Place the transformation in the Data Mart Development Orion Reports
Custom Transformations folder and place it in the User Defined grouping on the
transformations tree.
Note: You can create the new User Defined grouping by typing User Defined in the
Transformation Category box.
– Use the code from GeneratePieChart.sas (located in D:\Workshop\dift\SASCode) as
the transformation’s source code.
– Define an option group called Data Items.
– Define the following options for the Data Items group:
Name: classvar
Required: Yes
Other information: Allow all data types; allow only one column selection.
Name: analysisvar
Description: Select the column whose sum across the chosen category will
be used to determine the size of a slice.
Required: Yes
Other information: Allow only numeric data types; allow only one column
selection.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-75
Name: charttitle
Description: Specif y the text to be used as the title f or the pie chart.
Required: No
Name: path
Required: Yes
Name: filename
Description: Enter a name f or the PDF f ile that will be created. Do NOT
enter the PDF f ile extension!
Required: Yes
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-76 Lesson 6 SAS® Data Integration Studio: Working with Transformations
– Specify that the transformation supports one input and no output tables.
– Verify that the transformation appears in two places – in the correct folder on the Folders
tab and under the correct grouping on the Transformations tab.
– Verify the transformation surfaces on the Folders tab.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-77
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-78 Lesson 6 SAS® Data Integration Studio: Working with Transformations
7. Click the Errors symbol. The Errors dialog box shows that required options have not been
specified.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-79
c. Verify that the Data Items group is selected in the selection pane.
1) Click Browse for the Grouping Column for Table and Chart option.
a) Select Customer Age Group in the Select a Data Source Item window.
b) Click OK to close the Select a Data Source Item window.
2) Click Browse for the Subgrouping Column for Table and Chart option.
a) Select Customer Gender in the Select a Data Source Item window.
b) Click OK to close the Select a Data Source Item window.
3) Click Browse for the Analysis Column for Table and Chart option.
a) Select Quantity Ordered in the Select a Data Source Item window.
b) Click OK to close the Select a Data Source Item window.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-80 Lesson 6 SAS® Data Integration Studio: Working with Transformations
e. Verify that the default values are specified. These default values will be used for this instance
of the transformation.
g. Click OK to close the Summary Table and Vertical Bar Chart Properties window.
11. Select File Save to save the job metadata to this point.
12. Run the job.
a. Right-click in the background of the job and select Run.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-81
We get a warning.
c. Click the Warnings and Errors tab.
The warning message tells us the labels are too wide for the bars.
13. View the generated HTML file.
a. Open Windows Explorer.
b. Navigate to the D:\Workshop\dift\reports folder.
c. Double-click CustomerOrderReport.html.
d. If necessary, click X to close the security message in the browser.
e. If necessary, right-click on the chart and select View Image to view the chart.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-82 Lesson 6 SAS® Data Integration Studio: Working with Transformations
f. The tabular report is at the top of the HTML output. Scroll down for the graphic report.
Note: The values of statistics might vary due to changing customer ages over time.
Notice that the Customer_Age_Group values greater than 75 have their own bars. The
custom format applied to Customer_Age_Group does not cover a large enough range. If we
fix the format, the bars should be wide enough that the labels will fit.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-83
This code adds a row to the dataset with new agegroup values that cover people from age
76 to age 100.
d. Right-click in the background of the file and select Select All.
e. Right-click in the background of the file and select Copy.
f. Close UpdateAgegroupFormat.sas.
16. Return to SAS Data Integration Studio.
17. Run the code.
a. Select Tools Code Editor.
b. Right-click in the Code Editor window and select Paste.
c. Click Run.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-84 Lesson 6 SAS® Data Integration Studio: Working with Transformations
18. Click the Output tab and verify the Agegroup format was successfully updated.
There is now a category for people with an age between 76 and 100.
19. Close the Code Editor window. Do not save your changes.
20. Re-generate affected source tables.
The format for Customer_Age_Group is applied in the DIFT Populate Customer Dimension Table
job, therefore the job needs to be re-run to apply the new format. DIFT Customer Dimension is a
source table for the DIFT Customer Order Information table, so the DIFT Populate Customer
Order Information Table job also needs to be re-run. The DIFT Order Fact table is unaffected
because it does not contain customer data.
a. If necessary, click the Folders tab.
b. Run the DIFT Populate Customer Dimension Table job.
1) Right-click the DIFT Populate Customer Dimension Table job and select Open.
2) Select Actions Run.
3) Verify that the job ran successfully.
4) Select File Close to close the job.
c. Run the DIFT Populate Customer Order Information Table job.
1) Right-click the DIFT Populate Customer Order Information Table job and select Open.
2) Select Actions Run.
3) Verify that the job ran successfully.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-85
4) Right-click on the DIFT Customer Order Information table and select Open.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-86 Lesson 6 SAS® Data Integration Studio: Working with Transformations
f. The tabular report is at the top of the HTML output. Scroll down for the graphic report.
Since there are fewer bars, now the bars are wide enough that the labels will fit. The results
are also much more useful now that we can capture people who are 76 to 100 years old in
one group.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-87
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-88 Lesson 6 SAS® Data Integration Studio: Working with Transformations
Practice
The new transformation from the previous practice is used to generate a PDF file with a pie chart
analyzing sums of quantity ordered by country. The final job flow will resemble the following:
Titles and Title for Pie Chart Keep the default value,
Options &analysisvar by &classvar
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.3 Creating Custom Transformations 6-89
• View the PDF file. The output should resemble the following:
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-90 Lesson 6 SAS® Data Integration Studio: Working with Transformations
6.4 Solutions
Solutions to Practices
1. Refreshing Course Metadata to Current Point
a. If necessary, access SAS Data Integration Studio with Bruno’s credentials.
1) Select Start All Programs SAS SAS Data Integration Studio.
2) Select My Server as the connection profile.
3) Click OK to close the Connection Profile window and open the Log On window.
4) Enter Bruno in the User ID field and Student1 in the Password field.
5) Click OK to close the Log On window.
b. Click the Folders tab.
c. Expand Data Mart Development.
d. Delete all subfolders under Data Mart Development.
1) Click the first folder under Data Mart Development.
2) Press and hold the Shift key and click the last folder under Data Mart Development.
3) Select Edit Delete.
4) If any Confirm Delete windows appear, click Yes.
e. Import fresh metadata.
1) Select the Data Mart Development folder.
2) Select File Import SAS Package.
3) Click Browse next to Enter the location of the input SAS package file.
a) If necessary, navigate to D:\Workshop\dift\solutions.
b) Click DIFT_Ch7Ex1.spk.
c) Click OK.
4) Verify that All Objects is selected.
5) Click Next.
6) Verify that all four Orion folders are selected.
7) Click Next.
8) Verify four different types of connections are shown to need to be established.
9) Click Next.
10) Verify that SASApp is listed for both the Original and Target fields.
11) Click Next.
12) Verify both servers (ODBC Server and Oracle Server) have matching values for the
Original and Target fields.
13) Click Next.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.4 Solutions 6-91
14) Verify that both file paths have matching values for the Original and Target fields.
15) Click Next.
16) Verify that all directory paths have matching values for the Original and Target fields.
17) Click Next. The Summary pane surfaces.
18) Click Next.
19) Verify the import process completed successfully.
20) Click Finish.
2. Examining and Executing Imported Job: DIFT Populate Product Order Information Tabl e
a. If necessary, access SAS Data Integration Studio with Bruno’s credentials.
1) Select Start All Programs SAS SAS Data Integration Studio.
2) Verify that the connection profile is My Server.
3) Click OK to close the Connection Profile window and access the Log On window.
4) Enter Bruno in the User ID field and Student1 in the Password field.
5) Click OK to close the Log On window.
b. Examine the imported job DIFT Populate Product Order Information Table.
1) Click the Folders tab.
2) Expand Data Mart Development Orion Jobs.
3) Right-click DIFT Populate Product Order Information Table and select Open.
4) Verify that the Details pane is showing.
5) Single-click the DIFT Product Order Information table object.
6) Click the Columns tab in the Details pane.
Question: How many columns are defined for DIFT Product Order Information?
Answer: 23
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-92 Lesson 6 SAS® Data Integration Studio: Working with Transformations
Question: What column from which table is populating data in the columns Year,
Quarter, Month, and DOM?
Answer: The OrderDate column from the OrderFact table
Question: How many rows were created for DIFT Product Order Information?
Answer: 951,669
e. In the job flow diagram, right-click the DIFT Product Order Information table and select
Open.
Question: Do the values calculated for Year, Quarter, Month, and DOM look appropriate?
Answer: Yes
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.4 Solutions 6-93
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-94 Lesson 6 SAS® Data Integration Studio: Working with Transformations
3) In the bottom portion of the Where tab, click the Data Sources tab.
4) Expand ProdOrders table.
5) Select Product_Line.
6) Click Add to Expression.
7) In the Expression Text area, type = “Clothes & Shoes” &.
8) In the bottom portion of the Where tab, click the Data Sources tab and if necessary,
expand the ProdOrders table.
9) Select Year.
10) Click Add to Expression.
11) In the Expression Text area, type = 2011.
The final expression should resemble the following:
b) Click for this option to open the Select Data Source Items window.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.4 Solutions 6-95
b) Click for this option to open the Select Data Source Items window.
c) Click Product Group. Hold down the Ctrl key and select Quarter.
d) Click .
e) Click OK to close the Select Data Source Items window.
6) Click the Describe TABLE to print options group in the left pane.
a) Locate the field for the Specify row expression option and enter Product_Group.
b) Locate the field for the Specify column expression option and enter
Quarter*Quantity=“ ”*Sum.
c) Locate the field for the Specify TABLE statement options option and enter rts=20.
7) Click the Label a keyword options group in the left pane.
a) Locate the field for the Specify KEYLABEL statement option.
b) Enter sum=“ ”.
8) Click the Other options options group in the left pane.
a) Locate the field for the Specify other options for OPTIONS statement option.
b) Enter ls=85 nodate nonumber.
c) Locate the field for the Summary tables procedure options.
d) Enter format=comma8..
9) Click the Titles and footnotes options group in the left pane.
a) In the Heading 1 field, type Total Quantity Ordered for Quarters of 2011.
b) In the Heading 2 field, type Product_Line: Clothes & Shoes.
10) Click the ODS options options group in the left pane.
a) In the ODS result field, select Use HTML.
b) In the Location field, enter
D:\Workshop\dift\reports\Quantities2011ClothesAndShoes.html.
11) Click the Precode and Postcode tab.
a) Check the Precode option.
b) In the Precode field, enter options formchar=“|---|-|---|”;.
12) Click OK to close the Summary Tables Properties window.
l. Select File Save to save the diagram and job metadata to this point.
m. Run the job.
1) Right-click in the background of the job and select Run.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-96 Lesson 6 SAS® Data Integration Studio: Working with Transformations
2) Click the Status tab in the Details area. Verify that all processes completed successfully.
3) Click to close the Details view.
4) View the log for the executed job.
Question: From the Log, locate the code that generated the HTML file. What PROC step
generated the report?
Answer: PROC TABULATE
Question: What statement (or statements) identified the categorical and analytical fields?
Answer: The CLASS statement identifies the categorical fields, the VAR statement
identifies the analytical field.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.4 Solutions 6-97
Question: In the HTML report, for the Product Group value of T-Shirts – is there an increase
in quantities ordered across all four quarters?
Answer: There is an increase from Q1 to Q2, and from Q2 to Q3. But there is a
decline from Q3 to Q4.
4. Using the Create Table, Delete, and Insert Rows Transformations from the SQL Group
a. If necessary, access SAS Data Integration Studio with Bruno’s credentials.
1) Select Start All Programs SAS SAS Data Integration Studio.
2) Select My Server as the connection profile.
3) Click OK to close the Connection Profile window and open the Log On window.
4) Enter Bruno in the User ID field and Student1 in the Password field.
5) Click OK to close the Log On window.
b. Create the initial job metadata.
1) Click the Folders tab.
2) Expand Data Mart Development Orion Reports SQL Transforms.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-98 Lesson 6 SAS® Data Integration Studio: Working with Transformations
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.4 Solutions 6-99
5) Verify that all source columns are propagated and mapped to the target table.
6) Click the Filter and Sort tab.
7) In the Filter (WHERE) pane (upper pane), click New row.
8) In the newly added row, click in the first Operand field and select Choose Column.
9) In the Choose Columns window, expand File Reader.
10) Click Supplier_ID.
11) Click OK to close the Choose Columns window.
12) Click in the Operator field and select >=.
13) Enter 12000 in the second Operand field.
14) Click the Code tab and review the generated PROC SQL code.
15) Click OK to close the Create Table Properties.
16) Select File Save to save the job metadata.
h. Run the job and verify that the High Value Suppliers table was created.
1) Right-click in the background of the job and select Run.
2) Verify that all steps completed successfully.
Question: How many rows and columns are in the resultant table?
Answer: Scroll on the Log tab to verify that 25 rows and 6 columns are in the result
set.
3) On the Diagram tab, right-click the High Value Suppliers table and select Open.
4) Verify the data in this table. All Supplier_Id values are greater than 12000.
5) Select File Close to close the View Data window.
i. Add the Delete transformation to the job.
1) On the Transformations tab, expand the SQL group and locate the Delete
transformation.
2) Drag the Delete transformation to the Diagram tab of the Job Editor.
3) Click the Folders tab.
4) If necessary, expand the Data Mart Development Orion Reports SQL Transforms
folder.
5) Drag the High Value Suppliers table to the Diagram tab of the Job Editor.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-100 Lesson 6 SAS® Data Integration Studio: Working with Transformations
6) Connect the Delete transformation to the newly added High Value Suppliers table.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.4 Solutions 6-101
3) Connect the work table of the File Reader transformation to the input port of the Insert
Rows transformation.
4) Right-click the work table of the Insert Rows transformation and select Replace.
5) In the Table Selector window, click the Folders tab.
6) If necessary, expand the Data Mart Development Orion Reports SQL Transforms
folder.
7) Select the High Value Suppliers table.
8) Click OK to close the Table Selector window.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-102 Lesson 6 SAS® Data Integration Studio: Working with Transformations
19) Click in the first Operand field and select Choose Column.
20) In the Choose Columns window, expand File Reader.
21) Click Country.
22) Click OK to close the Choose Columns window.
23) In the Operator, retain =.
24) Enter “United States” in the second Operand field.
25) Click the Code tab and review the generated PROC SQL code.
26) Click OK to close the Query Builder window.
27) Click OK to close the Insert Rows properties window.
28) Select File Save to save the job metadata.
o. Run the job and verify that two rows were inserted into the High Value Suppliers table.
1) Verify the control flow. If necessary, use the Control Flow tab to place the transformations
in the correct execution sequence.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.4 Solutions 6-103
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-104 Lesson 6 SAS® Data Integration Studio: Working with Transformations
5) Click Next.
6) Specify the SAS code.
a) Access Windows Explorer.
b) Navigate to D:\Workshop\dift\SASCode.
c) Right-click GeneratePieChart.sas and select Edit With Notepad++.
d) Right-click in the background of the Notepad++ window and select Select All.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.4 Solutions 6-105
options mprint;
ods listing close;
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-106 Lesson 6 SAS® Data Integration Studio: Working with Transformations
%end;
%else %do;
%if &sysscp = WIN %then
%do;
%put ERROR: <text omitted; refer to file for complete text>.;
%end;
%mend PieChart;
%PieChart;
Note: Some comments and text were removed from the above display of code.
This code uses PROC GCHART to generate a pie chart, with the sum of analysisvar
determining the size of a slice and classvar determining what data is used to generate the
slice. The charttitle macro variable will allow users to set a title for the chart. The path and
filename macro variables will be used to generate a PDF file at the path location with the
specified filename.
7) Click Next.
8) Create a new group.
a) Click New Group.
b) Enter the name Data Items.
c) Click OK.
9) Create a new group.
a) Click New Group.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.4 Solutions 6-107
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-108 Lesson 6 SAS® Data Integration Studio: Working with Transformations
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.4 Solutions 6-109
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-110 Lesson 6 SAS® Data Integration Studio: Working with Transformations
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.4 Solutions 6-111
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-112 Lesson 6 SAS® Data Integration Studio: Working with Transformations
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.4 Solutions 6-113
f. Select File Save to save diagram and job metadata to this point.
g. Specify properties for the Generate Pie Chart transformation.
1) Right-click the Generate Pie Chart transformation and select Properties.
2) Click the Options tab.
3) Select Customer Country as the Grouping Variable for Pie Chart.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-114 Lesson 6 SAS® Data Integration Studio: Working with Transformations
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6.4 Solutions 6-115
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
6-116 Lesson 6 SAS® Data Integration Studio: Working with Transformations
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Lesson 7 Introduction to Data
Quality and the SAS® Quality
Knowledge Base
7.1 Introduction to Data Quality ......................................................................................... 7-3
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.1 Introduction to Data Quality 7-3
TechTarget
Data quality is a perception or an assessment of data’s fitness to serve its purpose in a given
context. The quality of the data is determined by factors such as accuracy, completeness,
reliability, relevance, and how up to date it is. As data has become more intricately linked with
the operations of organizations, the emphasis on data quality gains greater attention.
Wikipedia
Data that is “fit for [its] intended uses in operations, decision making, and planning.”
The state of completeness, validity, consistency, timeliness, and accuracy that makes
data appropriate for a specific use.
The processes and technologies involved in ensuring the conformance of data values
to business requirements and acceptance criteria.
6
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
As a data scientist, you should appreciate how important data quality is to your projects and to the
enterprise as a whole. You rely on accurate, complete, reliable, up -to-date data. Here are some of
the accepted def initions of data quality throughout the industry.
ISO 9000 def ines data quality as “the degree to which a set of characteristics of data f ulfills
requirements.”
TechTarget adds that “data quality is a perception or an assessment of data’s f itness to serve its
purpose in a given context. The quality of the data is determined by f actors such as accuracy,
completeness, reliability, relevance, and how up-to-date it is. As data has become more intricately
linked with the operations of organizations, the emphasis on data quality gains greater attention.”
Wikipedia def ines data quality in a variety of ways, saying:
• Data that is “fit for [its] intended uses in operations, decision making, and planning.”
• “The state of completeness, validity, consistency, timeliness, and accuracy that makes data
appropriate for a specific use.”
• “The processes and technologies involved in ensuring the conformance of data values to
business requirements and acceptance criteria.”
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-4 Lesson 7 Introduction to Data Quality and the SAS® Quality Knowledge Base
Whyhappen?
Why does Bad Data Does Bad Data Happen?
Bad data happens for
a variety of reasons.
7
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.1 Introduction to Data Quality 7-5
Consequences of
Consequences ofBad
BadData
Data
70% of organizations feel poor quality or inconsistent data impacts their
ability to make sound business decisions.
Forrester
The yearly cost of bad data is over $3 trillion annually in the US.
Harvard Business Review
72% of global organizations say data quality issues impact customer trust
and perceptions of their brands.
Experian
8
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
According to recent research, the consequences of bad data are expensive to the organization. Bad
data not only impacts profitability f or the organization, it also af fects your customer’s perception of
your brand. In addition, bad data affects decisioning inside and across the organization, which
directly impacts your company’s bottom line and your relationship with your customers. The results
of some of this research are seen in the excerpts above.
As the studies point out, the cost of bad data affects every aspect your organization because it
• undermines the decision-making process by providing conflicting results, which leads to lack of
trust in the data, which compromises the entire decisioning process at your company
• negatively impacts your marketing and customer relationship goals and reduces customer trust
and perceptions of your brand due to lack of a single view of the subject
• negatively impacts the bottom-line. (According to Experian, bad data has a direct impact on the
bottom line of 88% of all American companies, and the average loss from bad data was
approximately 12% of overall revenue.)
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-6 Lesson 7 Introduction to Data Quality and the SAS® Quality Knowledge Base
9
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
SAS’ data quality of fering is an industry-validated solution, receiving recognition as a leader in both
the Gartner Magic Quadrant and the Forrester Waves f or Data Quality.
The SAS Data Quality methodology has a proven track record, with more than 20 years of success!
Made up of three phases, the methodology is designed to step you through the process of creating
reliable and consistent data during the data curation lif e cycle. You will learn more about the
methodology in a later lesson.
The SAS Quality Knowledge Base (QKB) is a powerf ul piece of technology that makes a wide range
of data cleansing and data management techniques available to you f rom within a variety of SAS
applications. The rules in the QKB are geography and language specif ic, to ensure applicability for
data f rom all around the world.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.1 Introduction to Data Quality 7-7
3
Technology
10
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
To have a complete data quality strategy in place, you must have the proper processes, people, and
technology. SAS provides you with the technology components for a well-rounded data quality
strategy. The DataFlux Data Management Studio methodology is a complete and proven
methodology. Now all we need is the people who are trained and ready to put their newly attained
knowledge to work!
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-8 Lesson 7 Introduction to Data Quality and the SAS® Quality Knowledge Base
12
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The SAS Quality Knowledge Base (QKB) is a collection of f iles and algorithms that st ore data and
logic f or defining data management operations such as data cleansing and standardization. The
def initions in the QKB that perf orm the cleansing tasks are geography and language specif ic. This
combination of geography and language is known as a locale.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.2 SAS Quality Knowledge Base Overview 7-9
QKB
13
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The components of the SAS Quality Knowledge Base (QKB) can be modified using the supplied
editors in Data Management Studio. Modifying or customizing the QKB enables you to create data
quality logic for operations such as custom parsing or cust om matching rules.
The customization component in Data Management Studio consists of a suite of editors, functions,
and interfaces that can be used to construct or modify components and definitions to fit your own
data. These editors enable you to perform the following tasks:
• explore and test the QKB components and definitions
• modify pre-built data types and data management algorithms (definitions) to meet business
needs
• create new data types and def initions based on customer need s
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-10 Lesson 7 Introduction to Data Quality and the SAS® Quality Knowledge Base
14
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Many SAS technology components also access the QKB and perform data-cleansing operations on
data. SAS software products like SAS Data Quality Server, SAS Data Integration Studio, SAS
Federation Server, SAS Data Quality Accelerators, and SAS Event Stream Processing can also be
used to reference the QKB when performing data management operations.
SAS Data Quality Server provides a collection of SAS procedures, functions, and call routines that
surface the functionality in the QKB from within SAS code.
SAS Data Integration Studio can be configured to access QKB components from the Data Quality
tab in the application’s Options window. The QKB definitions are then available from the provided
data quality transformations, or through customized SAS code in the user-written code
transformation.
SAS Federation Server can access the QKB functionality in SAS DS2 method calls that are
executed in FedSQL queries.
The SAS Data Quality Accelerators provide functions that execute QKB functionality in-database.
The SAS Data Quality Accelerator for Hadoop provides data quality functionality in the SAS Data
Loader for Hadoop directives for data quality.
SAS Event Stream Processing enables programmers to build applications that can quickly process
and analyze a large number of continuously flowing events. It can access the QKB to perform data
quality operations on the streaming data.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7.2 SAS Quality Knowledge Base Overview 7-11
Match Case
Gender Analysis
15
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
A definition is a set of steps for processing data values. The QKB has definitions that enable you to
do a variety of data management, data quality, and entity resolution tasks. The types of definitions
that are available in the QKB include the following:
Case Transf orms a text string by changing the case of its characters
to uppercase, lowercase, or proper case.
Gender Analysis Guesses the gender of the individual in the text string.
Identif ication Analysis Identif ies the text string as ref erring to a particular predef ined
category.
Match Generates match codes f or text strings where the match codes
denote a f uzzy representation of the character content of the
tokens in the text string.
You learn more about each of these def inition types as you use them in the next f ew lessons.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
7-12 Lesson 7 Introduction to Data Quality and the SAS® Quality Knowledge Base
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Lesson 8 DataFlux® Data
Management Studio: Essentials
Demonstration: Navigating the DataFlux Data Management Studio Interf ace ................... 8-9
Practice............................................................................................................... 8-32
8.3 Quality Knowledge Bases and Reference Data Sources ............................................. 8-33
Demonstration: Verifying the Course QKB and Ref erence Sources .............................. 8-37
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.1 Overview of Data Management Studio 8-3
3
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
A data curator's job is to understand and prepare data f or use in analytics and reports. SAS has
several of ferings that can aid in this ef f ort.
The technology components that comprise SAS data quality offerings include:
• DataFlux Data Management Studio – a Windows-based desktop client application that enables
you to create and manage processes f or ensuring the accuracy and consistency of data during
the data management lif e cycle. Typically, Data Management Studio is thought of as the
development environment.
• DataFlux Data Management Server – provides a scalable server environment f or executing the
processes created in Data Management Studio. Typically, Data Management Server is thought
of as the production environment.
• SAS Quality Knowledge Base (QKB) – a collection of files and algorithms that provide the data
cleansing and data management f unctionality that is surf aced through the nodes in Data
Management Studio processes.
• Reference data source(s) – third-party address verif ication, enrichment, and geo -coding
databases available to validate and enhance your data.
• DataFlux repository – storage location f or objects created and executed in Data Management
Studio. Data Management Server also needs a repository of objects that will be available (and
run) on the server.
Note: Data Management Studio and Data Management Server do not share a repository. Each
component needs its own repository.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-4 Lesson 8 DataFlux® Data Management Studio: Essentials
Reference Data Packs Client Tier Data Management Studio DataFlux Repository
4
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The jobs and services created in Data Management Studio are stored in a DataFlux Repository.
Data cleansing algorithms are stored in the Quality Knowledge Base (QKB). The rules and
algorithms in the QKB are specif ic to countries, as well as languages f rom around the world.
Third-party databases, known as “data packs” or ref erence data sources, can be purchased and
used within Data Management Studio. Data packs enable you to verif y and augment the data
coming f rom source systems based on ref erence data sources f rom around the world.
The diagram above shows a basic design-time architecture using Data Management Studio to
access the various technology components needed to build data cleansing processes.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.1 Overview of Data Management Studio 8-5
Proven Methodology
CONTROL DEFINE
MONITOR PLAN
EVALUATE DISCOVER
EXECUTE DESIGN
ACT
5
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The three-phase methodology is a step-by-step process for perf orming data management tasks
such as data quality, data integration, data migrations, and master data management (MDM). When
organizations plan, act on, and monitor data management projects, they build the f oundation to
optimize revenue, control costs, and mitigate risks. No matter what type of data you manage,
DataFlux technology can help you gain a more complete view of corporate inf ormation.
PLAN
DEFINE The planning stage of any data management project starts with this essential f irst
step. This is where the people, processes, technologies, and data sources are
def ined. Roadmaps that include articulating the acceptable outcomes are built.
Finally, the cross-f unctional teams across business units and between business
and IT communities are created to def ine the data management business rules.
DISCOVER A quick inspection of your corporate data would probably find that it resides in many
databases, managed by different systems, with different f ormats and representations
of the same data. This step of the methodology enables you to explore metadata to
verif y that the correct data sources are included in the data management program.
You can also create detailed data prof iles of identified data sources so that you can
understand their strengths and weaknesses.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-6 Lesson 8 DataFlux® Data Management Studio: Essentials
ACT
DESIGN Af ter you complete the f irst two steps, this phase enables you to take the dif ferent
structures, f ormats, data sources, and data f eeds and create an environment that
accommodates the needs of your business. At this step, business and IT users build
workf lows to enf orce business rules f or data quality and data integration. They also
create data models to house data in consolidated or master data sources.
EXECUTE Af ter business users establish how the data and rules should be def ined, the IT staf f
can install them within the IT inf rastructure and determine the integration method
(real time, batch, or virtual). These business rules can be reused and redeployed
across applications, which helps increase data consistency in the enterprise.
MONITOR
EVALUATE This step of the methodology enables users to define and enf orce business rules to
measure the consistency, accuracy, and reliability of new data as it enters the
enterprise. Reports and dashboards about critical data metrics are created f or
business and IT staf f members. The inf ormation that is gained f rom data monitoring
reports is used to ref ine and adjust the business rules.
CONTROL The f inal stage in a data management project involves examining any trends to
validate the extended use and retention of the data. Data that is no longer usef ul
is retired. The project’s success can then be shared throughout the organization.
The next steps are communicated to the data management team to lay
the groundwork f or f uture data management ef f orts.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.1 Overview of Data Management Studio 8-7
Navigation
Pane
Information Pane
Navigation
Riser
Bars
13
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The example shown displays the Data Management Studio main interf ace.
In the upper lef t corner of the window, there is the Home tab. This tab is always open and can be
used to navigate back to this view.
At the top of the window, the main menu items are below the Home tab. Selecting one of these items
reveals a selection of actions that are available f or that particular menu item.
Also, at the top of the window, to the right of the main menu, there is the main toolbar. It contains
various buttons to assist you as you navigate in Data Management Studio.
The riser bars are on the lower lef t corner of the window. Select one of the riser bars to display the
objects that are available f or that particular item in the navigation pane.
On the lef t side of the window, the navigation pane is above the riser bars. This pane can be
ref reshed to display the items f or the selected riser bar.
Selecting an item in the navigation pane controls the content of the main area or information pane.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-8 Lesson 8 DataFlux® Data Management Studio: Essentials
Resource
Pane
Data
Flow
Editor
Details
Pane
22
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
When you open new objects f rom the repository in Data Management Studio, a new tab appears f or
each object that you opened. The primary tabs f or these objects are displayed across the top of the
main interf ace, one f or each object that you opened. Click a tab to bring it to the f ront and display its
contents.
In the example shown, the selected tab is displaying a data job. If you want to see the contents of
more than one tab, you have the option to detach the selected tab. The icon that enables you to do
this is in the upper right corner of the tab itself . After you detach a tab, the icon changes to enable
you to re-attach it to the other tabs.
Each item type has its own interf ace. For example, the way that you specif y and view properties and
results f or a prof ile is different f rom a data exploration, data job, or any other item .
For any data job, there is a set of secondary tabs on the lef t side of the window, near the top. These
tabs enable you to navigate f urther within the data job (f or example, to view/edit the data f low, to
view/edit the settings, to view/edit the variables, or to review the log).
The Data Flow tab (the lef tmost secondary tab) includes a resource pane with riser bars. The display
shows the Nodes riser bar, which provides access to transformation nodes that can be added to
data jobs.
The Data Flow tab also includes the data f low editor, which is used to visualize and build data f lows
with the available transf ormation nodes.
The details pane at the bottom displays inf ormation for the selected node in the data f low as well f or
the data f low itself .
Note: Other objects (f or example, process jobs) can use or take advantage of the details pane. To
view (or toggle off) the details pane, click the icon on the toolbar or use the main toolbar.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.1 Overview of Data Management Studio 8-9
In this demonstration, you explore and navigate the interf ace f or DataFlux Data Management Studio.
1. Select Start All Programs DataFlux Data Management Studio 2.7 (studio1).
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-10 Lesson 8 DataFlux® Data Management Studio: Essentials
The main interf ace f or DataFlux Data Management Studio is now ready f or use.
Home tab
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.1 Overview of Data Management Studio 8-11
5. View the navigation inf ormation on the lef t side of the interf ace.
navigation
pane
navigation area
navigation
riser bars
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-12 Lesson 8 DataFlux® Data Management Studio: Essentials
6. View the inf ormation pane on the right side of the interf ace.
inf ormation
pane
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.1 Overview of Data Management Studio 8-13
recent f iles
methodology
Data Roundtable
discussions
documentation
settings
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-14 Lesson 8 DataFlux® Data Management Studio: Essentials
a. Clear the selection of Display Data Management Methodology on Startup on the bottom
bar.
Inf ormation appears about the steps to perf orm in the Def ine portion of the methodology.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.1 Overview of Data Management Studio 8-15
The Data Explorations topic appears in the DataFlux Data Management Studio online Help.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-16 Lesson 8 DataFlux® Data Management Studio: Essentials
Collections A collection is a set of f ields that are selected f rom tables that
are accessed f rom different data connections. A collection
provides a convenient way f or users to build a data set using
those f ields. A collection can be used as an input source f or a
prof ile in Data Management Studio.
Data Connections Data connections are used to access data in jobs, profiles, data
explorations, and data collections.
Master Data Foundations The DataFlux Master Data Foundations f eature in Data
Management Studio uses master data projects and entity
def initions to develop the best possible record f or a specific
resource, such as a customer or a product, f rom all of the
source systems that might contain a ref erence to that resource.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.1 Overview of Data Management Studio 8-17
The inf ormation area provides overview inf ormation about all def ined data connections such as
names, descriptions, and types.
The dif f erent types of data connections that can be registered in DataFlux Data Management
Studio are listed in the table below.
ODBC Connection Displays the Microsoft Windows ODBC Data Source Administrator
dialog box, which can be used to create ODBC connections.
Domain Enabled Enables you to link an ODBC connection to an authentication server
ODBC Connection domain so that credentials f or each user are automatically applied
when the domain is accessed.
Custom Connection Enables you to access data sources that are not otherwise supported
in the Data Management Studio interf ace.
SAS Data Set Enables you to connect to a f older that contains one or more SAS
Connection data sets.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-18 Lesson 8 DataFlux® Data Management Studio: Essentials
12. Click the DataFlux Sample data connection in the navigation pane.
The tables that are accessible through this data connection appear in the inf ormation area.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.1 Overview of Data Management Studio 8-19
13. In the navigation area, click the Data Management Servers riser bar.
c. When you are prompted f or credentials f or the SAS Metadata Server, enter Bruno f or the
user ID, and Student1 f or the password.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-20 Lesson 8 DataFlux® Data Management Studio: Essentials
e. Click the f irst def ined server DataFlux Data Management Server - sasbap.
The inf ormation area displays specifics about this defined server.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.2 DataFlux Repositories 8-21
DataFlux Repository
26
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
A DataFlux repository is the storage location f or objects that the user creates in Data Management
Studio.
The repository is used to organize work and can be used to surf ace lineage between data sources,
objects and f iles created.
The repository consists of two components: the data storage part of the repository, and the file
storage part of the repository.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-22 Lesson 8 DataFlux® Data Management Studio: Essentials
custom metrics
Data Storage data jobs
process jobs
queries
match reports
27
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.2 DataFlux Repositories 8-23
Repositories are def ined using the Administration riser bar in Data Management Studio. Af ter you
select the Administration riser bar, you can select Repository Definitions in the navigation pane to
see inf ormation about the repositories that are registered.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-24 Lesson 8 DataFlux® Data Management Studio: Essentials
This demonstration illustrates the steps that are necessary to create a DataFlux repository in Data
Management Studio.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.2 DataFlux Repositories 8-25
2. Click Repository Definitions in the list of Administration items in the navigation pane.
The inf ormation pane displays details for all def ined repositories.
4) Press Enter.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-26 Lesson 8 DataFlux® Data Management Studio: Essentials
7) Click Open.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.2 DataFlux Repositories 8-27
d. In the File storage area, click Browse next to the Folder f ield.
1) Navigate to D:\Workshop\dqdmp1\Demos.
4) Press Enter.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-28 Lesson 8 DataFlux® Data Management Studio: Essentials
5) Click OK.
e. Clear Private.
The f inal settings f or the new repository def inition should resemble the f ollowing:
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.2 DataFlux Repositories 8-29
f. Click OK.
A message window appears and states that the repository does not exist.
g. Click Yes.
h. Click Close.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-30 Lesson 8 DataFlux® Data Management Studio: Essentials
c. Right-click the Basics Demos repository and select New New Folder.
1) Enter output_files.
2) Press Enter.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.2 DataFlux Repositories 8-31
d. Right-click the Basics Demos repository and select New New Folder.
1) Enter profiles_and_explorations.
2) Press Enter.
Note: It is a best practice to use all lowercase and no spaces f or f older names
that might be used on the DataFlux Data Management Server as some
server operating systems are case sensitive.
The f inal set of f olders for the Basics Demos repository should resemble the f ollowing:
Note: The listed repository folders exist as locations in both Data storage (database)
and File storage (operating system f olders).
Note: When you create a repository f older, it creates an operating system f older in the
f ile storage location with the same name. Data storage is internally managed by
the database that is used.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-32 Lesson 8 DataFlux® Data Management Studio: Essentials
Practice
Create a new repository to be used f or the items created in the upcoming practices.
Some specif ics f or the new repository are as f ollows:
• Name the repository Basics Exercises.
• Create a database f ile repository named D:\Workshop\dqdmp1\Exercises\Exercises.rps.
• Specif y a f ile storage location of D:\Workshop\dqdmp1\Exercises\files.
• Specif y that the repository is a shared repository.
2. Updating the Set of Default Folders for the Basics Exercises Repository
Create a new repository in Data Management Studio that attaches to an existing repository that
contains solution f iles. Some specifics f or attaching to the existing repository are as f ollows:
• Name the repository Basics Solutions.
• Point to a database f ile repository named D:\Workshop\dqdmp1\solutions\solutions.rps.
• Specif y a f ile storage location of D:\Workshop\dqdmp1\solutions\files.
• Specif y that the repository is a shared repository.
• Verif y that the repository contains the output_files and profiles_and_explorations f olders.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.3 Quality Knowledge Bases and Reference Data Sources 8-33
37
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The SAS Quality Knowledge Base (QKB) is a collection of f iles and algorithms that store data and
logic f or defining data management operations such as data cleansing and standardization. The
def initions in the QKB that perf orm the cleansing tasks are geography and language specif ic. This
combination of geography and language is known as a locale.
Data Management Studio enables you to design and execute additional algorithms in any QKB. This
enables you to perf orm data cleansing tasks for other types of data as well. Depending on the
activities that you want to perf orm, you might need to modify existing definitions or create your own
def initions in the Quality Knowledge Base.
Two SAS training courses exist that discuss how to modify and create def initions in the QKB.
The courses are:
• DataFlux Data Management Studio: Understanding the Quality Knowledge Base
• DataFlux Data Management Studio: Creating a New Data Type in the Quality Knowledge Base
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-34 Lesson 8 DataFlux® Data Management Studio: Essentials
Registration of QKB(s)
summary information
for the selected QKB
specific QKB
selected
The locations of the Quality Knowledge Base f iles are registered on the Administration riser bar in
Data Management Studio.
Locales are organized in a hierarchy according to their language and country. You can expand the
QKB to display the available languages, and then expand one of the languages to view the locales
that are associated with that language.
Note: There can be only one active QKB at a time in a Data Management Studio session.
QKB def initions will enable you to do a variety of data curation tasks. You will explore many of the
available def inition types in f uture lessons.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.3 Quality Knowledge Bases and Reference Data Sources 8-35
USPS Data
Geo+Phone Data
39
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Ref erence data sources are used to verif y and enrich data. The ref erence sources (also known as
data packs) are typically a database used by Data Management Studio to compare user data to the
ref erence source. Given enough inf ormation to match an address or location or phone, the ref erence
data source can add a variety of additional f ields to f urther clarif y and enrich your data.
Data Management Studio allows direct use of data packs provided f rom United States Postal
Service, Canada Post and Geo+Phone data. The data packs and updates are f ound on
http://support.sas.com/downloads in the SAS DataFlux Sof tware section.
Note: You cannot directly access or modify ref erence data sources.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-36 Lesson 8 DataFlux® Data Management Studio: Essentials
summary information
Reference Sources for Reference Sources
item selected
Ref erence source locations are registered on the Administration riser bar in Data Management
Studio. In the display above, you can see that there are registrations f or three dif ferent ref erence
data sources: Canada Post Data, Geo+Phone Data, and USPS Data.
Note: Only one ref erence source location of each type can be d esignated as the def ault.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.3 Quality Knowledge Bases and Reference Data Sources 8-37
This demonstration illustrates how to verif y the QKB and the ref erence sources that are def ined f or
the course.
The inf ormation area displays summary inf ormation about the selected QKB.
8. Verif y that three ref erence sources are def ined f or this instance.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-38 Lesson 8 DataFlux® Data Management Studio: Essentials
The inf ormation area displays summary inf ormation about the selected ref erence source.
Each def ined ref erence source can be similarly investigated by clicking on the registered item.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.4 Data Connections 8-39
ODBC Connections
Custom Connections
44
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Data connections enable you to access your data in Data Management Studio f rom many types of
data sources.
Custom Connection Enables you to access data sources that are not otherwise
supported in the Data Management Studio interf ace.
SAS Data Set Enables you to connect to a f older that contains one or more
Connection SAS data sets.
In Data Management Studio, you can create ODBC connections to any ODBC data source that is
def ined to the machine with the ODBC driver manager.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-40 Lesson 8 DataFlux® Data Management Studio: Essentials
Data Connections
item selected
45
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Data connections are created f rom the Data riser bar in Data Management Studio. Af ter you select
the Data riser bar and the Data Connections item, you can see the summary inf ormation f or the
existing data connections. New data connections can be added using menus or toolbar buttons.
Af ter the data connections are def ined, you can access these data sources when you build
data explorations, data profiles, data jobs, and other items in Data Management Studio.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.4 Data Connections 8-41
Data Tab
Selected
Table
46
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
In Data Management Studio, you can preview the d ata f rom within the interf ace. In addition to
scrolling through the rows of data, you can also sort and f ilter the data within the Data tab.
Graph Tab
Graph Properties
Selected
Table
47
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Another nice f eature of Data Management Studio is the ability to visualize the data f rom a selected
table. You have several options f or the types of visualizations that you can create on the data f rom
the Graph tab.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-42 Lesson 8 DataFlux® Data Management Studio: Essentials
This demonstration illustrates how to def ine and work with a data connection, including using the
data viewer and generating queries.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.4 Data Connections 8-43
d. Click Add.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-44 Lesson 8 DataFlux® Data Management Studio: Essentials
i. Click Next.
1) Click With SQL Server authentication using a login ID and password entered by the
user.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.4 Data Connections 8-45
j. Click Next.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-46 Lesson 8 DataFlux® Data Management Studio: Essentials
The new data source appears on the System DSN tab of the ODBC Data Source
Administrator window.
5. Verif y that the new data connection appears in Data Management Studio.
a. Click Data Connections in the navigation pane. Then select View Refresh.
b. Expand Data Connections in the navigation area to view the new data connection.
Note: In addition to the new ODBC data source that you created, all ODBC data sources
already def ined on the machine where Data Management Studio is installed are
available in Data Management Studio.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.4 Data Connections 8-47
d. Click OK.
Note: To avoid entering credentials f or this data connection in subsequent Data
Management Studio sessions, you can right-click the data connection name and
select Save User Credentials. This action creates a .cf g f ile of the same name as
the data connection (thus f or this data connection the f ile would be DataFlux
Training.cf g).
This conf iguration f ile is saved in <user-directory>\DataFlux\dac\savedconn:
Note: The user directory is specif ied in the Def ault Settings portlet. (Select the
Information riser bar and the Overview item in the navigation pane, and the
Def ault Settings portlet appears in the inf ormation pane or main area. )
The DataFlux Training data connection shows a listing of five different schemas.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-48 Lesson 8 DataFlux® Data Management Studio: Essentials
5) Click OK.
Only the data sources intended f or use in Data Management Studio are now displayed.
6) Expand df_gifts.
7) Verif y that f ive tables exist in this schema.
8) Expand df_grocery.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.4 Data Connections 8-49
This display shows the f ive df_gifts tables as well as the two df_grocery tables.
It is important to know the basic steps f or def ining data connections as shown in this demonstration .
However, the two data connections that we will work with were predef ined as part of the course
image. During class, we work with dfConglomerate Gifts (used primarily in the demonstrations) and
dfConglomerate Grocery (used primarily in the practices).
In the next demonstration, we explore a table in the df Conglomerate Gif ts data connection through
the data viewer component on Data Management Studio , as well as examine a generated graphic.
In the subsequent practice, you explore a table (BREAKFAST_ITEMS) in the df Conglomerate
Grocery data connection.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-50 Lesson 8 DataFlux® Data Management Studio: Essentials
The inf ormation area displays summary inf ormation f or the selected table.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.4 Data Connections 8-51
By def ault, the Data tab will display up to 500 records f or the selected table. This value (500) can
be changed in the Data Management Studio Options window.
The Data tab has a number of tools that can aid you in viewing and possibly understanding your
data.
9. Click (Sort By) on the Data tab.
a. Double-click JOB TITLE in the Available Fields list to move it to the Selected Fields list.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-52 Lesson 8 DataFlux® Data Management Studio: Essentials
c. Click OK.
The data are now displayed in sorted order (by descending JOB TITLE).
a. Double-click LAST NAME in the Available Fields list to move it to the Selected Fields list.
We have now established a secondary sort f ield. The data will be sorted by JOB TITLE, and
within each distinct JOB TITLE value, the data will be sorted (ascending) by LAST NAME.
b. Click OK.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.4 Data Connections 8-53
This tab can be used to examine properties f or each f ield of the selected table.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-54 Lesson 8 DataFlux® Data Management Studio: Essentials
The Data riser bar has the basic capabilities f or examining descriptive inf ormation (Fields tab),
data inf ormation (Data tab) and graphical views of data (Graph tab).
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.4 Data Connections 8-55
Practice
• View the data on the Data tab and sort by the columns BRAND and NAME.
Answer:
Answer:
Answer:
• Create a graph using the Graph tab with the f ollowing specifications:
X axis: NAME
Y axis: SIZE
Question: What NAME value has the highest value (or sum of values) of SIZE
f or the sample def ined by the def ault “row count range” of 30?
Answer:
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-56 Lesson 8 DataFlux® Data Management Studio: Essentials
8.5 Solutions
Solutions to Practices
1. Creating a Repository for Upcoming Practices
a. If necessary, select Start All Programs DataFlux Data Management Studio 2.7.
d. Select Repository Definitions in the list of Administration items in the navigation pane.
e. Click New to def ine a new repository.
d) Press Enter.
g) Click Open.
4) In the File storage area, click Browse next to the Folder f ield.
a) Navigate to D:\Workshop\dqdmp1\Exercises.
d) Press Enter.
e) Click OK.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.5 Solutions 8-57
5) Clear Private.
6) Click OK. A message window appears and states that the repository does not exist.
7) Click Yes to create the new repository. Inf ormation about the repository initialization
appears in a window.
8) Click Close.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-58 Lesson 8 DataFlux® Data Management Studio: Essentials
2. Updating the Set of Default Folders for the Basics Exercises Repository
a. If necessary, select Start All Programs DataFlux Data Management Studio 2.7.
e. Right-click the Basics Exercises repository and select New New Folder.
1) Enter output_files.
2) Press Enter.
f. Right-click the Basics Exercises repository and select New New Folder.
1) Enter profiles_and_explorations.
2) Press Enter.
The f inal set of f olders for the Basics Exercises repository should resemble the f ollowing:
a. If necessary, select Start All Programs DataFlux Data Management Studio 2.7.
d. Click Repository Definitions in the list of Administration items on the navigation pane.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.5 Solutions 8-59
b) Click the solutions.rps f ile so that it appears in the File name f ield.
c) Click Open.
4) In the File storage area, click Browse next to the Folder f ield.
a) Navigate to D:\Workshop\dqdmp1\solutions\files.
b) Click OK.
5) Clear Private.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-60 Lesson 8 DataFlux® Data Management Studio: Essentials
a. If necessary, select Start All Programs DataFlux Data Management Studio 2.7.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.5 Solutions 8-61
The inf ormation area is populated with inf ormation about the selected table.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-62 Lesson 8 DataFlux® Data Management Studio: Essentials
4) Click OK.
The data is displayed in sorted order (by BRAND and then by NAME).
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8.5 Solutions 8-63
Notice the
scroll bar!
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
8-64 Lesson 8 DataFlux® Data Management Studio: Essentials
l. Scroll to the right to see the highest value of SIZE f or the sample def ined by the def ault
“row count range” of 30.
scroll bar is
to the right
Question: What NAME value has the highest value (or sum of values) of SIZE
f or the sample def ined by the def ault “row count range” of 30?
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Lesson 9 DataFlux® Data
Management Studio: Understanding
Data
9.1 Methodology Review ................................................................................................... 9-3
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.1 Methodology Review 9-3
CONTROL DEFINE
MONITOR PLAN
EVALUATE DISCOVER
EXECUTE DESIGN
ACT
3
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Recall the three-phase methodology introduced earlier. Starting in this lesson we will examine
different items and components of Data Management Studio as they relate to one of the three
phases. In this lesson, the PLAN phase is explored.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-4 Lesson 9 DataFlux® Data Management Studio: Understanding Data
PLAN Phase
4
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Define The planning stage of any data management project starts with this essential first
step. This is where the people, processes, technologies , and data sources are
defined. Roadmaps that include articulating the acceptable outcomes are built.
Finally, the cross-functional teams across business units and between business
and IT communities are created to define the data management business rules.
Discover A quick inspection of your corporate data would probably find that it resides in many
different databases, managed by many different systems, with many different
formats and representations of the same data. This step of the methodology lets
you explore metadata to verify that the right data sources are included in the data
management program – and create detailed data profiles of identified data sources
to understand their strengths and weaknesses.
Within Data Management Studio, two items can be used to help with the understanding of the data
sources – a data exploration and a data profile.
A data exploration will help you understand the structure of your data sources. Data explorations
provide comparisons of structural information across data sources, showing for example, variations
in the spelling of field names.
A data profile helps in understanding the data values in the various data sources. Data profiles
show distributions of field values and include useful metrics such as mean and standard deviation
for numeric fields and pattern counts for text fields. Data profiling will also surface occurrences of
null values, outliers, and other anomalies.
From reports generated for a data exploration, a data collection can easily be constructed (a
collection is a simple list of fields from possibly various tables from possibly various data
connections). Therefore, before studying data explorations, we will define and build a simple data
collection.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.1 Methodology Review 9-5
From reports generated for a data profile, the inconsistencies in actual data values can be viewed
through frequency distributions. These frequency distributions can be used as the basis for an item
called a standardization scheme (schemes are simple lists of data values with corresponding
standard values). Therefore, after finding issues in data profiles, we will define and build a
standardization scheme.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-6 Lesson 9 DataFlux® Data Management Studio: Understanding Data
Data Collection
13
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
A data collection is simply a set of data fields from one or more different tables from one or more
dif ferent data connections. This is very useful to a data scientist for project documentation, and to identify
what data f ields come from what data sources.
A data collection
• provides a convenient way to group related data fields from cross data sources
• proves to be beneficial in data governance efforts (for example, as a way to record all the fields that
contain phone numbers)
• can be used as an input source for profiles.
In the example:
• There are several defined data connections, where each data connection could contain one to many
tables.
• You want to construct a data collection from the field from several tables from several data
connections.
• You can choose
– the f ields from a single table in a specific data connection.
– the f ields from a single table in a different data connection.
– the f ields from a particular table in a specific data connection.
– the f ields from a different table in a specific data connection.
• The f inal data collection shown is a list of nine fields from four tables where the tables are found in
three dif ferent data connections.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.2 Creating Data Collections 9-7
selected collection
two collections
in Basics Demos
repository
Data collections are created in the Data riser bar in Data Management Studio and are stored in a
repository.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-8 Lesson 9 DataFlux® Data Management Studio: Understanding Data
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.2 Creating Data Collections 9-9
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-10 Lesson 9 DataFlux® Data Management Studio: Understanding Data
18
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
A data exploration reads metadata and/or data from databases and uses one or more of the three
types of analyses available to help understand the data structures and data being used in a project.
The three types of analyses available for exploring data are field name matching, field name
analysis, and sample data analysis.
Each type of analysis uses a different algorithm (definition) from the QKB to perform analysis on the
data.
• Field name matching analysis uses a definition known as a match definition. This algorithm
applies fuzzy logic to determine fields that might represent the same type of data, based on the
field name. For example, if you had a field named Phone_Number in one table and a field
named Phone_No in another table, the definition would identify these fields as potentially being
the same, based on a match code generated from the field name.
• Field name analysis uses a definition known as an identification analysis definition. This
algorithm uses the actual words contained in the field name, and looks up the words in a
vocabulary of words to categorize the words and identify their potential meanings. For example,
a field named Fax_Number might be categorized as a PHONE type field.
• Sample data analysis also uses an identification analysis definition but provides the ability to
sample the data in the table to determine whether the data is of a specific type. For example, a
sampling of 50 data records could reveal that the data in a particular field that has perhaps 10-
digits for each value. The identification analysis definition might categorize the field (based on its
data contents) as a PHONE type field.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.3 Designing Data Explorations 9-11
19
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The Field Relationship map provides a visual presentation of the field relationships between all the
databases, tables, and fields that are included in the data exploration.
In the above map:
• The outer ring represents the data connections specified for this analysis. The above map has
two “slices” – one for dfConglomerate Gifts and one for dfConglomerate Grocery.
• The inner ring has “slices” – each slice represents a table within its data connection. There are
five slices for the five tables in dfConglomerate Gifts. There are two slides for the two tables in
dfConglomerate Grocery.
• The dots represent fields for the tables specified for analysis. In the above diagram a dot is
selected that represents the STREET_ADDR field from the MANUFACTURERS table in
dfConglomerate Grocery.
• The green lines for a selected dot represent field relationships – field names that are either
exactly spelled the same or determined to be a match from a particular analysis applied.
This gives the data scientist an idea of how the data can possibly be joined together.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-12 Lesson 9 DataFlux® Data Management Studio: Understanding Data
If Field name matching is selected, an appropriate match definition (from the specified locale) along
with a preferred sensitivity must be specified.
Field name matching will analyze the names of each field from the selected data sources to
determine which fields have either an identical name or names that are similar enough to be
considered a match.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.3 Designing Data Explorations 9-13
ADDRESS
field
selected
Matching address
fields identified
Field Match
riser bar
21
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The Report tab has a variety of reports and information to be investigated. The report results of
running a Field Name Matching analysis are surfaced on the Field Match riser bar.
The Field Match report displays a list of the fields analyzed along with their corresponding matches.
A field of interest is selected in the left navigation pane, and the main area (middle area) displays the
corresponding matches.
In the example shown, based on the Field Match Analysis, the six fields in the main area were
identified as potentially relating to the selected field in the navigation pane (in this case, the Address
field from the Customers table in dfConglomerate Gifts).
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-14 Lesson 9 DataFlux® Data Management Studio: Understanding Data
If Field name analysis is selected, an appropriate identification analysis definition (from the
specified locale) must be specified.
Field name analysis analyzes the names of each field from the selected data sources to determine
which identity or category to assign to the field. An identity assigns a semantic meaning to the type
of information that the field contains (for example, address, phone number, and so on). For example,
a field named BUSINESS_PHONE could get categorized as a PHONE field, or a field named
BUSINESS_EMAIL could get categorized as an EMAIL field.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.3 Designing Data Explorations 9-15
Report tab
PHONE
identity
selected
The report results of running a Field Name Analysis analysis are surfaced on the Identification
Analysis riser bar.
The Identification Analysis report displays a list of fields in metadata that match certain categories
based on the field name as well as the Identification Analysis definition that was selected in the
properties for the Field Name Analysis in the data exploration.
In the example shown, based on the names of the fields, there are 10 fields from a variety of data
sources and tables whose field names meet the criterion in the Field Name identification analysis
definition for PHONE categorization or identity.
Note: The categories for the Field Name identification analysis definition (for example, ADDRESS,
CITY, COUNTRY, COUNTY, etc.) come from the definition in the QKB.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-16 Lesson 9 DataFlux® Data Management Studio: Understanding Data
If Sample data analysis is selected, an appropriate identification analysis definition (from the
specified locale) must be specified.
Sample data analysis analyzes a sample of data in each field to determine which identity or category
to assign to the field.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.3 Designing Data Explorations 9-17
Report tab
PHONE
identity
selected
Identification
PHONE type Sample of data
Analysis
fields identified values from
riser bar
Customers table
25
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The report results of running a Sample Data Analysis analysis are surfaced on the Identification
Analysis riser bar.
The Identification Analysis report displays a list of fields placed in certain categories based on a
sample of data records.
In the example shown, based on a sample of 150 data records, there are 15 fields shown in the main
(middle) area that come from a variety of data sources and tables. Those fields contain values that
meet the criterion in the Contact Info definition for phone numbers.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-18 Lesson 9 DataFlux® Data Management Studio: Understanding Data
Graphic of three
field names
spelled exactly
the same
Table Match
riser bar
30
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The Table Match report displays a list of database tables that contain matching fields for a selected
table or field.
Once a table is selected from the list of database tables, the main area refreshes to show all related
tables, as well as what percentage of columns in the table are in common.
In the example shown, the Customers table from dfConglomerate Gifts is selected. In the main
area, you see that the Customers table is related to six additional tables from the two data
connections.
The list of related tables shows the number of fields matched to the fields of the selected table.
Specifically, for the Manufacturers table from dfConglomerate Grocery, there are three fields in
common.
Selecting the Manufacturers table from dfConglomerate Grocery surfaces a relational diagram at
the bottom of the main area. This shows the fields from the Customers table that are matched to
same-named fields in the Manufacturers table. Specifically, the two tables have the ID, CITY, and
NOTES fields in common.
Note: This is not an interactive diagram; you cannot draw additional lines or delete the existing
relationship lines.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.3 Designing Data Explorations 9-19
b. Click OK.
The new data exploration appears on a primary tab.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-20 Lesson 9 DataFlux® Data Management Studio: Understanding Data
4) Click Add.
5) Click Close to close the Add Tables window.
b. If necessary, expand each of the data sources.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.3 Designing Data Explorations 9-21
I’m
not
100
%
sure
of
wha
t I’m
doin
g
next
wee
k.
My © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Copyright
origi
nal
plan
9-22 Lesson 9 DataFlux® Data Management Studio: Understanding Data
I’m
not
100
10. Specify the settings for sample data analysis.
%
sureIn the Analysis Methods area, click Sample data analysis.
a.
of Enter 500 for Sample size(records).
b.
wha
t I’mClick once in the Locale field. This reveals a selection tool.
c.
doinClick
d. in the Locale field and select English (United States).
g
nextClick once in the Identification Analysis Definition field. This reveals a selection tool.
e.
wee
f. Click in the Identification Analysis Definition field and select Contact Info.
k.
My
origi
nal
plan
s
wer
e to
I’m
go
not
to
100
my
11. Select
% Actions Run to execute the data exploration.
pare
sure
nts
Note: You can also click (Run Exploration) on the toolbar to execute the data exploration.
of
and
12. Explore
wha
dec the field relationship map results.
torat
I’m
a. Click the Map tab.
doin
e
g
their
next
hou
wee
se
k.
for
Chri
stm
My
as
origi
(my
nal
ann © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Copyright
plan
ual
s
pres
wer
ent
e to
9.3 Designing Data Explorations 9-23
The outer ring segments represent the selected data connections: dfConglomerate Gifts and
dfConglomerate Grocery.
The inner ring segments represent the selected tables from each data connection. Moving
the cursor over a segment of the inner ring displays the table name.
The dots represent the fields in each table.
b. Locate the MANUFACTURERS table from the dfConglomerate Grocery data connection.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-24 Lesson 9 DataFlux® Data Management Studio: Understanding Data
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.3 Designing Data Explorations 9-25
For the two tables in the dfConglomerate Grocery connection, there are 32 fields.
There are 59 fields that match the fields in the BREAKFAST_ITEMS table.
There are 85 fields that match the fields in the MANUFACTURERS table.
f. Click the MANUFACTURERS table.
The 17 fields from the MANUFACTURERS table are displayed with the number of matching
fields found for each.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-26 Lesson 9 DataFlux® Data Management Studio: Understanding Data
g. Expand the MANUFACTURERS table to display a list of fields for this table.
h. Click the STREET_ADDR field.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.3 Designing Data Explorations 9-27
The Contact Info identification analysis definition has eight defined categorizations for dat a
(ADDRESS, BLANK, E-MAIL, MIXED, NAME, ORGANIZATION, PHONE, UNKNOWN). This
definition inspects the data values (we chose a sample size of 500) to see whether the data
seems to be representative of ADDRESS data, or BLANK data, or E-MAIL data, and so on.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-28 Lesson 9 DataFlux® Data Management Studio: Understanding Data
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.3 Designing Data Explorations 9-29
The Field Name identification analysis definition has 19 defined categorizations for data
(ADDRESS, CITY, COUNTRY, COUNTY, DATE, EMAIL, GENDER, GENERIC_ID,
MARITAL_STATUS, MATCHCODE, NAME, ORGANIZATION, ORGANIZATION_ID,
PERSONAL_ID, PHONE, POSTALCODE, STATE/PROVINCE, UNKNOWN, URL). This
definition inspects the field names to see whether a field name seems to be an ADDRESS
field name, or a CITY field name, or a COUNTRY field name, and so on.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-30 Lesson 9 DataFlux® Data Management Studio: Understanding Data
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.3 Designing Data Explorations 9-31
This demonstration illustrates the steps that are necessary to create a data collection when
reviewing the results of a data exploration.
11. In the main area, select all the fields that are identified as ADDRESS fields.
a. Click the first field.
b. Hold down the Shift key.
c. Click the last field.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-32 Lesson 9 DataFlux® Data Management Studio: Understanding Data
12. Right-click one of the selected fields and select Add To Collection New Collection.
c. Click OK.
The new collection appears on a separate tab.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.3 Designing Data Explorations 9-33
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-34 Lesson 9 DataFlux® Data Management Studio: Understanding Data
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.3 Designing Data Explorations 9-35
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-36 Lesson 9 DataFlux® Data Management Studio: Understanding Data
A data profile enables you to inspect data for errors, inconsistencies, redundancies, and incomplete
information.
Examples:
• Suppose you need to examine payments for a week's orders. Metrics created for a profile can
show that just over 20% of the orders placed have not been paid.
• Suppose you work in a human resources group and need to generate some metrics (for
example, minimum salary, average salary, maximum salary) to compare across various job
categories. However, an examination of the names of job categories shows an alarming
inconsistency.
• Suppose you work in a marketing group and want to do some promotions for customers in
various states. However, examining the values of the STATE field shows inconsistent styles
(patterns) of information.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.4 Creating Data Profiles 9-37
Report tab
36
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-38 Lesson 9 DataFlux® Data Management Studio: Understanding Data
Select tables
to profile
To run a data profile on your data, you must first identify the properties for the data profile.
Specifically, you must do the following actions:
• select the data sources to be profiled
• select the data tables to be profiled
• select the columns to be profiled
• specify the metrics to be calculated against the data
• specify additional, more advanced options (Primary Key/Foreign Key Analysis, Redundant Data
Analysis, and Alerts)
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.4 Creating Data Profiles 9-39
38
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
You must specify which data profiling metrics must be calculated before the profile is executed. You
have full control over the metrics that are calculated for a data profile. You can select these metrics
by navigating to Tools Default Profile Metrics on the Properties tab of the data profile.
Standard metrics include the following:
• record count • data type
• null count • data length
• percent null • actual type
• blank count • minimum length
• minimum value • maximum length
• maximum value • non-null count
• pattern count • nullable
• unique count • decimal places
• uniqueness • statistical calculations
• primary key candidate (mode, mean, median, standard deviation,
standard error)
Note: You can override metrics for a specific column on the Properties tab.
Not all profile metrics are necessary for all the columns that are profiled. As part of the planning
phase, you should determine which metrics are most important for the types of data fields that you
want to profile.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-40 Lesson 9 DataFlux® Data Management Studio: Understanding Data
Customers
table
selected
39
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
You can view the results of the profile by clicking the Report tab.
On the Report tab, selecting a table displays the standard metrics for all columns in the table.
Expanding the table and selecting a specific column display s the standard metrics that are
specifically for that column.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.4 Creating Data Profiles 9-41
Visualizations tab
Report tab
Customers
table
selected
40
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Visualizations are customized charts that you create based on the data and the calculated metrics.
Visualizations can reveal patterns in the metrics that might not be apparent when you view the table
of standard metrics.
As a data scientist, this is a good way to document the health and well-being of the data used in your
project.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-42 Lesson 9 DataFlux® Data Management Studio: Understanding Data
Notes tab
Report tab
Customers
table
selected
41
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Notes can be added to the report (at both the field level and the table level) to aid in the planning
process. In the example, you see the table-level notes that were entered for the Customers table.
STATE/PROVINCE
column selected
42
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Individual fields from a table can be selected. The Column Profiling tab displays the standard metrics
for the selected field. In the example, you see the standard metrics that were calculated for the
STATE/PROVINCE field.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.4 Creating Data Profiles 9-43
STATE/PROVINCE
column selected
43
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
A distribution of the frequency counts can be calculated for the values of a field. The list of
distribution values can also be filtered and visualized.
If you double-click a specific value in the frequency distribution report, a window opens , showing all
the records in the original data source that contain those values.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-44 Lesson 9 DataFlux® Data Management Studio: Understanding Data
STATE/PROVINCE
column selected
44
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
A distribution of the pattern frequency counts (the pattern of the value in the field) can be calculated.
To view the record (or records) that contain a specific pattern, double-click the pattern distribution
value. The list of pattern distribution values can also be filtered and visualized.
For a pattern, the following rules apply:
• A represents an uppercase letter.
• a represents a lowercase letter.
• 9 represents a digit.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.4 Creating Data Profiles 9-45
Percentiles tab
Report tab
LIST PRICE
column selected
45
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Percentiles are calculated for intervals that you specify. In the example, you see the percentiles that
were calculated for the LIST_PRICE field in the Products table.
Outliers tab
Report tab
LIST PRICE
column selected
46
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The Outliers tab lists the X minimum and maximum value outliers. The number of listed minimum
and maximum values are specified when the data profiling metrics are set. In the example,
you see the percentiles that were calculated for the LIST_PRICE field in the Products table.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-46 Lesson 9 DataFlux® Data Management Studio: Understanding Data
Frequency
Distribution
table
47
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
If a frequency distribution is calculated for a particular field, then these calculated values are
available for a special chart type of Frequency Distribution. Recall that visualizations can be created
for a selected table.
The example shown displays
• a Frequency Distribution report for the selected field PRODUCT CODE (from the Products table
from the dfConglomerate Gifts data connection)
• a Frequency Distribution chart for the PRODUCT CODE field from the Products table (from the
dfConglomerate Gifts data connection).
Depending on who is viewing the profile reports (Frequency Distribution report versus Frequency
Distribution chart), one report might be more informative than the other, or possibly having both
provides clearer insight in to the data.
As a data scientist, presenting your project results and proposals is a critical part of your job. These
visualizations will assist you greatly in that endeavor.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.4 Creating Data Profiles 9-47
Pattern Frequency
Distribution
table
48
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
If a pattern frequency distribution is calculated for a particular field, then these calculated values are
then available for a special chart type of Pattern Frequency Distribution. Recall that visualizations
can be created for a selected table.
The example shown displays:
• a Pattern Frequency Distribution report for the selected field PRODUCT CODE (from the
Products table from the dfConglomerate Gifts data connection).
• a Pattern Frequency Distribution chart for the PRODUCT CODE field from the Products table
(from the dfConglomerate Gifts data connection).
Depending on who is viewing the profile reports (Pattern Frequency Distribution report versus
Pattern Frequency Distribution chart), one report might be more informative than the other, OR
possibly having both provides clearer insight in to the data.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-48 Lesson 9 DataFlux® Data Management Studio: Understanding Data
b. Click OK.
The new profile appears on a primary tab.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.4 Creating Data Profiles 9-49
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-50 Lesson 9 DataFlux® Data Management Studio: Understanding Data
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.4 Creating Data Profiles 9-51
Note: The f irst f our metric selections take additional processing time to calculate, so be
judicious in their selection.
c. Click Cancel to close the Metrics window.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-52 Lesson 9 DataFlux® Data Management Studio: Understanding Data
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.4 Creating Data Profiles 9-53
There is no designation for different sets of metric overrides. The check marks simply
indicate that the metrics for the selected fields are different from those that are set for the
overall profile.
10. Select File Save Profile to save the profile.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-54 Lesson 9 DataFlux® Data Management Studio: Understanding Data
After the profile runs and the report is generated, the Report tab becomes active.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.4 Creating Data Profiles 9-55
A table view of calculated metrics for each field selected from the Customers table appears:
Note: Some metrics display (not applicable), which indicates that the metric calculation is not
applicable to that f ield type.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-56 Lesson 9 DataFlux® Data Management Studio: Understanding Data
63 values
four patterns of
COUNTRY/REGION
values
A few observations:
• There are 63 records in the Customers table.
• There are four different patterns or styles for the COUNTRY/REGION field.
• There are four unique or distinct values for the COUNTRY/REGION field.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.4 Creating Data Profiles 9-57
There are five rows listed here: the four unique values and (null value). Therefore, it can be
noted that the Unique Count metric does not include null values.
e. Double-click the item (null value).
This action performs a drill-through to the source table and displays a window with the 11
records that have a null value for the COUNTRY/REGION field.
All fields are shown from the data source.
The Drill Through window is simply a display of values . Nothing can be changed via this
window’s interface. However, the data shown can be exported to a .txt, .csv or .xls file.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-58 Lesson 9 DataFlux® Data Management Studio: Understanding Data
Scrolling to the right and locating the COUNTRY/REGION field reveals all null values.
Investigating corresponding fields (such as CITY, STATE/PROVINCE, and ZIP/POSTAL
CODE) could indicate the correct value for the COUNTRY/REGION field.
g. Click Close to close the Frequency Distribution Drill Through window.
h. Click the Pattern Frequency Distribution tab.
Almost 83% of the data values have the pattern of three capital letters. It would be
worthwhile to investigate fixing the patterns of the nine observations that do not have the
AAA pattern.
Note: The drill-through action is also available for the pattern frequency distribution values.
i. Select Insert New Note.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.4 Creating Data Profiles 9-59
1) Enter Check the patterns of this field. in the Add Note window.
This tab displays the top X minimum and maximum values (where X is the value set
in the Metrics window).
m. Add a table-level note.
1) Click the Customers table (on
the left).
2) Select Insert New Note.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-60 Lesson 9 DataFlux® Data Management Studio: Understanding Data
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.4 Creating Data Profiles 9-61
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-62 Lesson 9 DataFlux® Data Management Studio: Understanding Data
The configuration of the chart (field selection; metric selection) can be changed by clicking
the (Edit) toolbar button.
The chart can be saved as .jpg, .png, .gif, or .bmp.
The chart can be printed if a valid printer connection is available.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.4 Creating Data Profiles 9-63
If the desired visualization is not understandable as a bar chart, it can be changed easily by
selecting one of the other available types.
16. Investigate a different table that was also profiled in this job.
a. Expand the dfConglomerate Grocery data connection.
b. Expand the MANUFACTURERS table.
c. Click the NOTES field.
d. Click the Frequency Distribution tab (in the main / information area).
Recall that this metric was cleared in a metric override for this field.
Similarly, the Pattern Frequency Distribution, Percentiles, and Outliers tabs display
(Not calculated) because these options were also cleared in the metric override for
this field.
17. Select File Close Profile.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-64 Lesson 9 DataFlux® Data Management Studio: Understanding Data
Practice
Question: How many distinct values exist for the UOM field in the BREAKFAST_ITEMS
table?
Answer:
Question: What are the distinct values for the UOM field in the BREAKFAST_ITEMS
table?
Answer:
Question: What are the ID values for the records with a value of PK for the UOM field
in the BREAKFAST_ITEMS table?
Answer:
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.5 Profiling Other Input Types 9-65
60
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Thus f ar, you worked with prof iling data in the data sources that were def ined in the data
connections. However, it is important to note that you are not limited to only the data in the data
connections f or profiling. You can also profile data using the f ollowing as input sources : text files,
SQL queries, f ilters on tables, and collections.
Note: You can improve perf ormance when working with large data sets by working with a sample
of the data. A sample interval can be specif ied f or any input type.
Note: By def ault, the sample interval is 1. That is, every row is read when you create the data
prof ile. However, another interval can be specif ied by selecting Actions Change Sample
Interval.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-66 Lesson 9 DataFlux® Data Management Studio: Understanding Data
61
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Data of ten arrives in the f orm of a text f ile (f or example, a f ile with a .txt or a .csv extension), and this
data needs to be studied or profiled.
To prof ile data that is in a text f ile, you need to select Text Files in the navigation pane of the
Properties tab. Af ter you select Text Files, you can then select Insert New Text File f or the
prof ile. Then you can specif y other options as you would for any other input source.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.5 Profiling Other Input Types 9-67
• Specify a name for the text file (item to appear as child under Text Files), select the appropriate
file type, and click OK.
• Specify file and appropriate attributes for the file, including field names. Then click OK.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-68 Lesson 9 DataFlux® Data Management Studio: Understanding Data
At this point, all fields defined from the text file are now available for selection. Selecting fields,
selecting metrics, specifying options – at this point, all of this is exactly same as what was available
when selecting a single table.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.5 Profiling Other Input Types 9-69
62
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
When creating a profile, it is sometimes necessary to execute the profile on a select subset of data
from a table. There are two options available for subsetting the table:
• New Filtered Table
• New SQL Query
Both options are available from the Insert menu.
The New Filtered Table option is used to create a new filtered table using the selected source table.
This option enables you to use an interface to build an expression to be used for filtering data
records from the input table. The filtered result set is then profiled.
The New SQL Query option is used to create an SQL query with a WHERE clause (using the
selected table) to filter the data records. This option opens an SQL Query window where you can
enter the SQL query to be used to filter the data. The filtered result set is then profiled.
Important Notes:
• The results generated for both the SQL query and the filtered table are the same.
• When you use a filtered table, all records from the database are returned to the machine where
the profile runs, and the filtering is performed on that machine.
• The database does the filtering for the SQL query and returns the filtered result set only to the
machine where the profile runs.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-70 Lesson 9 DataFlux® Data Management Studio: Understanding Data
selected table
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.5 Profiling Other Input Types 9-71
Add Condition
o If needed, click AND or OR, and then build the rest of the compound expression. Then click
Add Condition.
• Click OK to close the Filter on Table window.
At this point, all fields from the original selected table are now available for selection for this filtered
data. Selecting fields, selecting metrics, specifying options – at this point, all of this is exactly same
as what was available when selecting a single table.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-72 Lesson 9 DataFlux® Data Management Studio: Understanding Data
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.5 Profiling Other Input Types 9-73
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-74 Lesson 9 DataFlux® Data Management Studio: Understanding Data
In the example, a data collection was selected f rom the Data riser bar as input into the data prof ile.
The new data prof ile can be given a metadata name of Address Info Collection Profile. This creates
a new data prof ile f or you. Only the columns that exist in the collection are selected as input. To
save signif icant time and ef f ort, create collections. Then you do not need to search in all the source
data tables in all the data connections f or the f ields that you want to include in the prof ile.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.6 Designing Data Standardization Schemes 9-75
65
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Many standardization schemes are provided with the supplied QKBs. Users can also build
standardization schemes from profile's frequency distribution table (on the Report tab for the profile).
When a scheme is applied, if the input data is equal to the value in the Data column, then the data is
changed to the value in the Standard column.
In the example shown, the standard value SAS INSTITUTE is used when three different instances or
spellings are encountered (SAS, SAS INSTITUTE, SAS INSTITUTE INC).
The following special values might be seen or can be used in the Standard column:
//Remove The matched word or phrase is removed from the input string.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-76 Lesson 9 DataFlux® Data Management Studio: Understanding Data
66
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Standardization schemes are stored in the QKB. There are two types of schemes in the QKB.
• Phrase schemes are used to standardize entire phrases or strings that consist of more than one
word. Some examples of the data types that are typically stored as a phrase include cities,
organizations, addresses, and names.
• Element schemes are applied to each individual word in a phrase. This can be especially useful
if you have the type of data where certain words are repeated frequently (for example, the
qualifying extension on business names). Some data types that are typically standardized using
element schemes include address, organization, city, and name.
In the example, the two data values representing company names can be standardized either of two
ways. You could create a phrase standardization scheme that has both data values in it, with their
standard representations. Otherwise, you could create an element standardization scheme that
simply standardizes the word Inc., and it could be used to standardize both values.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.6 Designing Data Standardization Schemes 9-77
67
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Standardization schemes can easily be built in the results in a profile report. By selecting a column
of data that is used in the profile, you can right-click on the column and select Build a Scheme from
the menu.
Note: The frequency distribution must exist for the field so that you can use the Build a Scheme
option.
When you choose to build a scheme in this manner, the application runs an analysis on the data.
Also, it provides you with a report of all the permutations of data from the selected field. When
generating this report, you need decide whether you want to analyze the data as entire phrases
or individual words.
If you choose to run a phrase analysis, similar values are grouped in the report. To know how to
group the data values, a definition in the QKB called a match definition is used. To select the proper
match definition, you need to select the type of data being analyzed.
In the example, a phrase analysis is run using the COMPANY field as input. Because this data
represents company (or organization) data, the Organization match definition is used to group
similar data values.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-78 Lesson 9 DataFlux® Data Management Studio: Understanding Data
69
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The Report Generation window settings generate the report, which is surfaced on the left side of the
Scheme Builder window. The Scheme Builder has a Report side and a Scheme side.
Add to a
scheme
manually.
70
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
After the analysis report is displayed inside the Scheme Builder window, you can use the values in
the report to build the Data/Standard pairs for the standardization scheme. This process can be done
manually with the Add to Scheme button at the bottom of the left pane, or automatically with the
Build Scheme toolbar button. Most schemes are built by using a combination of these methods. This
enables users to take advantage of automated methods that leverage information and algorithms in
the QKB with specific requirements that involve some human intervention and decisions.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.6 Designing Data Standardization Schemes 9-79
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-80 Lesson 9 DataFlux® Data Management Studio: Understanding Data
b. Under Phrase analysis, click the down arrow in the Match definition field and select
Organization.
c. Click the down arrow in the Sensitivity field and select 75.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.6 Designing Data Standardization Schemes 9-81
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-82 Lesson 9 DataFlux® Data Management Studio: Understanding Data
b) Click OK.
Notice that the change applies to all items in the group.
Note: If a single value in a group of items needs to be changed, then select Edit
Modify Standards Manually Single Instance. A single value can then
be modified manually. To toggle back to the ability to change all instances
in a group, select Edit Modify Standards Manually All Instances.
b. Change the standard value of dfConglomerate incorporated.
1) On the Scheme side, scroll and locate the grouping of records with the standard value
dfConglomerate incorporated.
2) Right-click one of the dfConglomerate incorporated standard values and select Edit.
a) Enter dfConglomerate Inc. in the Standard field.
b) Click OK.
Notice that the change applies to all items in the group.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.6 Designing Data Standardization Schemes 9-83
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-84 Lesson 9 DataFlux® Data Management Studio: Understanding Data
The Scheme side of the Scheme Builder window should now resemble the following:
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.6 Designing Data Standardization Schemes 9-85
c. Click Save.
12. Select File Exit to close the Scheme Builder window.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-86 Lesson 9 DataFlux® Data Management Studio: Understanding Data
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.6 Designing Data Standardization Schemes 9-87
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-88 Lesson 9 DataFlux® Data Management Studio: Understanding Data
Note: Values that do not exist in the scheme are highlighted in red.
10. Update the existing scheme.
a. At the bottom of the Report side, locate the Standard field.
b. Enter Arrowhead Mills, Inc. in the Standard field.
c. Locate the grouping of records that begins with Arrowhead Mills, Inc..
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.6 Designing Data Standardization Schemes 9-89
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-90 Lesson 9 DataFlux® Data Management Studio: Understanding Data
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.6 Designing Data Standardization Schemes 9-91
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-92 Lesson 9 DataFlux® Data Management Studio: Understanding Data
14. Select File Save to save the scheme to the default QKB.
A warning window appears and indicates that there is duplicate data.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.6 Designing Data Standardization Schemes 9-93
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-94 Lesson 9 DataFlux® Data Management Studio: Understanding Data
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.6 Designing Data Standardization Schemes 9-95
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-96 Lesson 9 DataFlux® Data Management Studio: Understanding Data
The values with the specified standard are added to the Scheme side of the Scheme Builder
window. The scheme should now resemble the following:
c. Click Save.
14. Select File Exit to close the Scheme Builder window.
15. Select File Close Profile to close the profile.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.6 Designing Data Standardization Schemes 9-97
Practice
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-98 Lesson 9 DataFlux® Data Management Studio: Understanding Data
9.7 Solutions
Solutions to Practices
1. Profiling Tables in the dfConglomerate Grocery Data Connection
a. If necessary, invoke Data Management Studio.
1) Select Start All Programs DataFlux Data Management Studio 2.7.
2) Click Cancel in the Log On window.
b. Click the Folders riser bar.
c. Click the Basics Exercises repository.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.7 Solutions 9-99
d) Clear Outliers.
e) Click OK to close the Metrics window.
g. Select File Save Profile to save the profile.
h. Select Actions Run Profile Report.
1) Enter Initial profile in the Description field.
2) Click OK to execute the profile.
The Report tab becomes active.
i. Review the Profile report.
Question: How many distinct values exist for the UOM field in the BREAKFAST_ITEMS
table?
Answer: Expand the BREAKFAST_ITEMS table, and then click the UOM field. The
Column Profiling tab shows that the Unique Count metric has a value of 6.
Question: What are the distinct values for the UOM field in the BREAKFAST_ITEMS table?
Answer: Click the UOM field. Click the Frequency Distribution tab to display the six
unique values: OZ, CT, LB, PK, 0Z and a blank.
Question: What are the ID values for the records with a value of PK for the UOM field
in the BREAKFAST_ITEMS table?
Answer: Double-click the value PK. The ID values for these two records are 556 and
859.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-100 Lesson 9 DataFlux® Data Management Studio: Understanding Data
3) Click Close.
e. Begin designing standardization scheme.
1) Click the MANUFACTURERS table.
2) If necessary, click the Standard Metrics tab.
f. Right-click the CONTACT_CNTRY field and select Build a Scheme.
1) Accept the defaults in the Report Generation window and click OK.
The Scheme Builder window appears with the Report side showing the three distinct
values.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9.7 Solutions 9-101
2) Click Save.
3) Select File Exit to close the Scheme Builder window.
4) Select File Close Profile to close the profile.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
9-102 Lesson 9 DataFlux® Data Management Studio: Understanding Data
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Lesson 10 DataFlux® Data
Management Studio: Building Data
Jobs to Improve Data
Practice............................................................................................................. 10-44
Demonstration: Investigating Right Fielding and Identif ication Analysis ....................... 10-50
Demonstration: Working with the Branch and Data Validation Nodes .......................... 10-57
Practice............................................................................................................. 10-84
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.1 Introduction to Data Jobs 10-3
EXECUTE DESIGN
ACT
3
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Recall the three-phase methodology introduced earlier. In this lesson, the ACT phase is explored.
Data Management Studio uses data jobs for the ACT phase.
DESIGN Af ter you complete the f irst two steps, this phase enables you to take the dif ferent
structures, f ormats, data sources, and data f eeds and create an environment that
accommodates the needs of your business. At this step, business and IT users
build workf lows to enforce business rules f or data quality and data integrat ion. They
also create data models to house data in consolidated or master data sources.
EXECUTE Af ter business users establish how the data and rules should be def ined, the IT
staf f can install them within the IT inf rastructure and determine the integration
method (real time, batch, or virtual). These business rules can be reused and
redeployed across applications, which helps increase data consistency in the
enterprise.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-4 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
Source Target
Node A Node B Node C
Node Node
4
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Data jobs consist of nodes. Each node is designed to accomplish an objective (f or example,
generate a subset). Most data jobs start with a source node and end with a target node.
In the example job f low, there are f ive nodes. The beginning node is a source node and the ending
node is a target node. The nodes labeled A, B and C each process the data (to transf orm and/or
cleanse) and deliver the results to the target node.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.1 Introduction to Data Jobs 10-5
Job Options
Data Management Studio provides
many configurable options, including
options that affect data jobs.
5
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Data Management Studio provides you with a good selection of options to control the operations that
you perf orm inside of the application. You can navigate to these options by selecting Tools Data
Management Studio Options f rom the main menu. You can specif y the f ollowing options:
• general interf ace options
• f ilter and search options f or viewing data
• job options, such as layout def aults
• SQL options, such as join def aults and editor type defaults
In the Data Management Studio Options window, you can set options that af fect how you create and
execute jobs. These options are available by selecting Job in the lef t navigation pane. In the
example shown, you see the specif ic options that are available when working with jobs.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-6 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
6
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
An example of the type of options you can set f or jobs is the Output Fields option. This option
specif ies how f ields are initially passed between nodes in a data job. There are three options
available f or controlling this behavior:
• Target
• Source and Target
• All
Bef ore discussing the options, we need a bit more inf ormation about the types of nodes f ound in a
data job.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.1 Introduction to Data Jobs 10-7
Source Target
Node A Node B Node C
Node Node
A data job typically starts with a source node to bring data into the data job f low. Most data jobs end
with one or more “target” nodes that create output f rom the processes applied in the data job.
In between the source and the target node, any number of processing steps can take place on the
data as it is passed f rom one node to the next. In the diagram above, data enters the job f low
through a source node, and then is passed through three additional processing nodes (Node A,
Node B and Node C), and then f inally to a target node.
The source node specif ies properties f or a type of data source. Thus, a source node could specify
the properties to access data f rom
• a table f rom a def ined data connection
• a text f ile
• the result set of an SQL query.
Note: There are several other source node types available – the above list is not all-encompassing.
The intermediate nodes will contain properties f or their objective or purpose (f or example, an
intermediate can specif y properties f or subsetting, or for standardizing , or f or parsing). Each type of
intermediate node will have dif ferent properties than other nodes.
The target node specif ies properties f or loading or using the result set generated in the data job. A
target node could specify the properties to load data to
• a table f or a def ined data connection
• a text f ile.
Note: There are several other target node types available – the above list is not all-encompassing.
Also, the result of a data job could be a report. Theref ore, a target node could contain the properties
f or a type of report (f or example, an HTML report).
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-8 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
Many data jobs typically start with a source node of some type. Text File Input and Data Source are
two commonly used nodes.
In the example shown, all f ields f rom the Customers table are being passed f rom the Data Source
node into the next node.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.1 Introduction to Data Jobs 10-9
Many data job nodes create additional output fields as they process the data. For example, a
standardization node can generate standard values stored in new f ields.
In the example shown, f ive f ields were moved to the Selected list (each of the f ive f ields have a
standardization def inition applied). The results f or each of the f ive f ields will be written to new f i elds
whose def ault names are the original f ield name with text _Stnd added. For example, the COMPANY
f ield's standardized values will be stored in a new f ield called COMPANY_Stnd.
But what about the original f ields that are in the source data read by the node?
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-10 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
Many nodes that produce new f ields like the Standardization node have an Additional Outputs
button. Clicking this button opens a window that provides the ability to select which f ields from the
source data get passed through the current node.
In the example shown, all f ields f rom the Customers table were passed f rom a Data Source node
into the Standardization node – this can be discovered by examining the Available list in the
Standardization Properties window, or by opening the Additional Outputs window.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.1 Introduction to Data Jobs 10-11
Many data jobs end with a data output of some type. Text File Output and new tables produced with
Data Target (Insert) are two commonly used nodes. In the example shown, only two f ields from the
list of incoming fields are selected to be written to the new text f ile.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-12 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
Target All fields available to target nodes are passed to the target.
All fields available from source nodes are passed to next node,
Source and Target
and all fields available to target nodes are passed to the target.
One of the options that you can set f or the data job controls how output f ields are handled f or each
node. The purpose of this option is to control how f ields are initially passed to the adjacent nodes in
the job f low. The setting of the Output Fields option controls the initial selection of fields. It is
important to note that f ield selection modifications can be made on a node-by-node basis af ter the
initial propagation of f ields.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.1 Introduction to Data Jobs 10-13
Source Target
Node A Node B Node C
Node Node
Fields
17
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
If Target is selected f or the Output Fields option, then only the node connected to a target no de will
initially propagate f ields to the target node.
Source Target
Node A Node B Node C
Node Node
Fields Fields
18
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-14 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
Source Target
Node A Node B Node C
Node Node
19
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
If All is selected f or the Output Fields option, then all nodes will automatically propagate all f ields to
the next node in the job f low diagram.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.1 Introduction to Data Jobs 10-15
This demonstration illustrates the steps that are necessary to investigate and set various DataFlux
Data Management Studio options.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-16 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
4. Verif y that the General item is selected on the lef t selection pane.
5. Clear Automatically preview a selected node when the preview tab is active.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.1 Introduction to Data Jobs 10-17
9. Click New job defaults (in the Job grouping) on the lef t selection pane.
Options are available to establish def ault f unctionality when you create new data jobs.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-18 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
10. Click QKB (in the Job grouping) on the lef t selection pane.
11. If necessary, click in the Default locale f ield and select English (United States).
This option enables you to specify a locale to be used by default in data jobs. Many nodes give
you the option to override the def ault locale f or that step in the processing. In addition, the QKB
selection also displays a list of available locales for the def ault QKB.
12. Click OK to save the changes and close the Data Management Studio Options window.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.1 Introduction to Data Jobs 10-19
10.01 Activity
✓ If necessary, access Data Management Studio.
• Access DataFlux Data Management Studio by selecting
Start All Programs DataFlux Data Management Studio 2.7.
• Click Cancel in the Log On window.
✓ Select Tools Data Management Studio Options.
✓ Under General, clear Automatically preview a selected node when the
preview tab is active.
✓ Under Job, select Include node-specific notes when printing.
✓ Under Job, select All in the Output Fields area.
✓ Under Job, verify that English (United States) is the default locale.
22
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-20 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
Nodes f or a data job are f ound in groupings of like nodes. These are the groups:
• Data Job
• Data Inputs
• Data Outputs
• Data Integration
• Quality
• Enrichment
• Entity Resolution
• Monitor
• Prof ile
• Utilities
Our f irst job will use a Data Input node, several Quality nodes, and a Data Output node. This data
job will accomplish three things:
• Standardization – this will allow f or spelling and abbreviation issues to be addressed
• Parsing – this will allow a f ield of data to be broken in to smaller perhaps more usable
components
• Casing – this will apply some consistency to the data.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.2 Standardization, Parsing, and Casing 10-21
Standardization Definition
• is more complex than a standardization scheme
• can involve one or more standardization schemes
• can also parse data and apply regular expression libraries and casing
31
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
A node that we will use in a data job is the Standardization node. The properties of the
standardization node allow f or the selection of f ields to be standardized. Then, f or each of the
selected f ields, you can choose either a standardization scheme or a standardization def inition (or
both!). When executed, either the scheme or the def inition, or both, will transf orm the data values in
a consistent manner.
In a previous section, we explored the creation of a standardization scheme f ile, which is simply a list
of data values along with how we would like to have the values written out. A standardization
scheme is used to ensure the standard representation of data values. Standardization schemes can
be applied to single words (element scheme) or the entire phrase (phrase scheme).
A standardization definition is like an algorithm and is more complex than scheme. A standardization
def inition can involve one or more standardization schemes, and can also parse data, apply regular
expressions and possibly casing.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-22 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
In the example shown, an element scheme was created to standardize words that are contained in
address data values. Af ter the scheme is applied to the data, the words that match the “data” value
in the scheme are transf ormed to the “standard” value in the scheme.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.2 Standardization, Parsing, and Casing 10-23
Apply This
Data before Standardization Definition Data after Standardization
Mister John Q. Smith, Junior Name Mr John Q Smith, Jr
dataflux corporation Organization DataFlux Corp
123 North Main Street, Suite Address 123 N Main St, Ste 10
10
33
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Each unique type of data can have a corresponding standardization definition that can be applied to
af f ect the outcome of the data.
The example shown displays a record f or three dif ferent types of data: name data, company or
organization data, and address data.
• If the Name standardization def inition [f rom the English (United States) locale] is applied to the
f irst string of data, you see that several changes have occurred to produce the resultant value.
The name value provided (Mister John Q. Smith, Junior) shows Mister was transf ormed to Mr,
the period f ollowing the middle initial was removed and Junior was transf ormed to Jr.
• If the Organization standardization def inition [from the English (United States) locale] is applied
to the second string of data, you see that several changes have occurred to produce the
resultant value.
The company value provided (DataFlux corporation) shows the unique casing f or DataFlux, and
the word corporation was transf ormation to Corp.
• If the Address standardization def inition [from the English (United States) locale] is applied to the
third string of data, you see that several changes have occurred to produce the resultant value.
The address value provided (123 North Main Street, Suite 10) shows North was transf ormed
to N, Street was transf ormed to St, and Suite was transf ormed to Ste.
Note: If you standardize a data value using both a def inition and a scheme, the def inition is applied
f irst and then the scheme is applied.
Note: Data standardization does not perf orm a validation of the data (f or example, Address
verif ication). Address verification is a separate component of the Data Management Studio
application and is discussed in another section.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-24 Lesson 10 DataFlux® Data Man agement Studio: Building Data Jobs to Improve Data
Investigating Standardization
This demonstration illustrates the steps that are necessary to create a data job that standardizes
f ields in a source data table. This data job is continued in subsequent demonstrations.
c. Click OK.
a. Verif y that the Nodes riser bar is selected in the resource pane.
The node is added to the job f low and the properties window f or the node appears.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.2 Standardization, Parsing, and Casing 10-25
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-26 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
Notice that there are a number of inconsistencies in the data that need to be addressed:
• The COMPANY f ield needs to be corrected or cleansed. You can use the standardization
scheme created earlier to accomplish this.
• The phone f ields are not consistently f ormatted. A standardization definition can be used to
accomplish this.
• The ADDRESS f ield has portions of the f ield that are not consistent – specifically note the
inconsistent use of pre-directions (f or example, “East”, “E.” and “E”) and street types (f or
example, “Ave”, “Ave.” and “Avenue”).
a. From the Nodes riser bar in the resource pane, collapse the Data Inputs grouping of nodes.
The node is added to the job f low and the properties window f or the node appears .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.2 Standardization, Parsing, and Casing 10-27
c. In the Standardization f ields area, double-click each of the f ollowing fields to move them f rom
the Available list to the Selected list:
COMPANY
JOB TITLE
BUSINESS PHONE
HOME PHONE
MOBILE PHONE
FAX NUMBER
ADDRESS
STATE/PROVINCE
ZIP/POSTAL CODE
COUNTRY/REGION
d. For the COMPANY f ield, click under Scheme and select Ch3D6 Company Phrase
Scheme.
e. For the JOB TITLE f ield, click under Scheme and select Ch3D7 Job Title Element
Scheme.
f. For the BUSINESS PHONE f ield, click under Definition and select Phone.
g. For the HOME PHONE f ield, click under Definition and select Phone.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-28 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
h. For the MOBILE PHONE f ield, click under Definition and select Phone.
i. For the FAX NUMBER f ield, click under Definition and select Phone.
j. For the ADDRESS f ield, click under Definition and select Address.
k. For the STATE/PROVINCE f ield, click under Definition and select State/Province
(Abbreviation).
l. For the ZIP/POSTAL CODE f ield, click under Definition and select Postal Code.
m. For the COUNTRY/REGION f ield, click under Definition and select Country.
The Standardization Properties under the Selected list should resemble the f ollowing:
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.2 Standardization, Parsing, and Casing 10-29
Note: Only a f ew of the _Stnd f ields are shown in the above display.
When this node is previewed, the f ields always appear in a particular order (all of the original
f ields, all the _Stnd f ields, and then all the _Stnd_flag f ields). With this order it is dif ficult to
verif y whether a f ield’s original value was changed by applying the standardization def inition
or the standardization scheme. Most nodes do not allow the intermingling of the original
f ields and f ields that are produced by the node.
However, a Field Layout node can be used temporarily to reorder all f ields as desired.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-30 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
40
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The Field Layout node has many uses. When added to a job f low diagram, it can f ilter out unneeded
f ields f or subsequent nodes. In addition, the f ields selected can be in any order desired and the
output names can be updated.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.2 Standardization, Parsing, and Casing 10-31
In this demonstration, you add a Field Layout node to the data job f rom the previous demonstration
to control the order of the columns. Specifically, you want to preview the data where the original
values of the f ields are placed adjacent to their standardized values .
c. Click batch_jobs.
d. Double-click the Ch5D1_Customers_DataQuality data job.
a. From the Nodes riser bar in the resource pane, if necessary, collapse the Quality grouping
of nodes.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-32 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improv e Data
All the f ields are now sorted alphabetically by name. This sorting places the original f ields
next to their standardized f ield, which is in turn next to the standardization f lag f ield. For
example, ADDRESS is next to ADDRESS_Stnd, which is next to ADDRESS_Stnd_flag.
Note: The up and down arrows to the right of the Selected f ields list can also be used to
reorder the f ields.
Now a preview of this Field Layout node enables us to see how the selected standardization
techniques have af f ected the resultant f ields.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.2 Standardization, Parsing, and Casing 10-33
It is now easy to investigate the values that are produced by the Standardization node.
For the f irst record, notice that the original ADDRESS f ield value was changed
(Lane changed to Ln). Theref ore, the ADDRESS_Stnd_flag f ield has a value of True.
For the second record, notice that the original ADDRESS f ield value was not changed.
Theref ore, the ADDRESS_Stnd_flag f ield has a value of False.
Similar observations can be made f or the rest of the records across the three related f ields
(the original f ield, the standardized f ield, and the standardization f lag field).
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-34 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
Parsing Example
Name Information
Parse definitions define Dr. Alan W. Richards, Jr., M.D.
the rules to split the
words from a text string
Parsed Name
into the appropriate
Prefix Dr.
tokens.
Given Name Alan
Middle Name W.
Family Name Richards
Suffix Jr.
Title/Additional Info M.D.
43
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
A parse def inition is a type of definition in the QKB that specif ies the rules f or breaking a text string in
to tokens. Tokens are predef ined, specific to a data type, and have a semantic meaning. For
example, the Name data type includes six different tokens (Pref ix, Given Name, Middle Name,
Family Name, Suf f ix, Title/Additional Inf o), and there is a distinct difference in the semantic meaning
of each of these tokens.
The example shown is parsed using the Name parse def inition. This def inition splits a name in to
individual tokens. In this example, the name contains inf ormation f or each token in the Name data
type. It is important to note that when we parse any specif ic value, some tokens might not
necessarily be assigned values. (For example, a name value of John Smith would populate just two
of the tokens, Given Name and Family Name.)
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.2 Standardization, Parsing, and Casing 10-35
Casing Example
44
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The purpose of a case def inition is to convert the text to a specific case. For converting to upper or
lower case, each value is converted to uppercase or lowercase respectively. For propercasing, the
def inition converts the f irst character of each word to uppercase, and then augments that with the
known casing of certain words (f or example, DataFlux) and patterns within words. (For example, if
any word begins with Mc, then convert the next letter to uppercase.)
Note: For the best results, select an applicable def inition that is associated with a specif ic data
type, when you apply proper casing.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-36 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
This demonstration adds to the data job f rom the previous demonstration by adding a Parsing node
to parse an email f ield into tokens. A Change Case node is then added to convert the tokenized
email f ields to uppercase. Finally, a Text File Output node is added to the job f low.
c. Click batch_jobs.
d. Double-click Ch5D1_Customers_DataQuality.
a. From the Nodes riser bar in the resource pane, if necessary, collapse the Utilities grouping
of nodes.
The node is added to the data job f low and connected to the Standardization node.
The properties window f or the Parsing node appears.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.2 Standardization, Parsing, and Casing 10-37
c. Click the down arrow under Field to parse and select EMAIL.
e. Click the double right-pointing arrows to select all tokens: Mailbox, Sub-Domain,
Top-Level Domain, and Additional Info.
Note: When selected, Result code field specif ies that the results of the parse operation
are passed to the output of the node. If no f ield name is entered, the results are put
into a f ield named __parse_result__.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-38 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.2 Standardization, Parsing, and Casing 10-39
The node is added to the data job f low and connected to the Parsing node.
The properties window f or the Change Case node appears.
c. Double-click each of the f ollowing fields to move them f rom the Available list
to the Selected list:
EMAIL
Mailbox
Sub-Domain
Top-Level Domain
Additional Info
d. For each of the f ive selected f ields, click the down arrow under Type and select Upper.
Note: The selection of the Type f ield value f ilters the list of available case def inition f or the
Def inition f ield.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-40 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
e. For each of the f ive selected f ields, click the down arrow under Definition and select
Upper.
b. Scroll to the right to view the cased email and individual email token f ields.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.2 Standardization, Parsing, and Casing 10-41
a. From the Nodes riser bar in the resource pane, collapse the Quality grouping of nodes.
The node is added to the job f low and connected to the Change Case node.
The properties window f or the Text File Output node appears.
2) Navigate to D:\Workshop\dqdmp1\Demos\files\output_files.
4) Click Save.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-42 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
The Text File Output Properties window should resemble the f ollowing:
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.2 Standardization, Parsing, and Casing 10-43
The job runs, and the text f ile output is displayed in a Notepad window.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-44 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
Practice
Create a data job that uses the Standardization, Parsing, and Data Target (Insert) nodes.
The f inal job f low should resemble the f ollowing:
ID CONTACT_STATE_PROV
MANUFACTURER CONTACT_POSTAL_CD
CONTACT CONTACT_CNTRY
CONTACT_ADDRESS CONTACT_PHONE
CONTACT_CITY POSTDATE
• Use the specif ied standardization definition or scheme to standardize the f ollowing fields:
MANUFACTURER Organization
CONTACT Name
CONTACT_ADDRESS Address
CONTACT_CITY City
CONTACT_STATE_PROV State/Province
(Abbreviation)
CONTACT_PHONE Phone
Note: Accept the def ault names f or the standardized f ields and be sure to preserve
null values.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.2 Standardization, Parsing, and Casing 10-45
• Using the Name parse def inition, parse the standardized CONTACT f ield. Be sure
to preserve null values. Select three tokens and rename the output f ields as f ollows:
• Write the standardized, parsed data to a new table named Manufacturers_Stnd (in the
df Conglomerate Grocery data connection). If the data job runs multiple times, ensure
that the records f or each run are the only records in the table.
• Create a selection of the f ields and rename the output f ields as follows:
ID ID
MANUFACTURER_Stnd MANUFACTURER
FIRST_NAME FIRST_NAME
MIDDLE_NAME MIDDLE_NAME
LAST_NAME LAST_NAME
CONTACT_ADDRESS_Stnd CONTACT_ADDRESS
CONTACT_CITY_Stnd CONTACT_CITY
CONTACT_STATE_PROV_Stnd CONTACT_STATE_PROV
CONTACT_POSTAL_CD CONTACT_POSTAL_CD
CONTACT_CNTRY_Stnd CONTACT_CNTRY
CONTACT_PHONE_Stnd CONTACT_PHONE
POSTDATE POSTDATE
Question: Where (in Data Management Studio) can you view the new table’s data?
Answer:
Question: Where (in Data Management Studio) can you view the new table’s f ield names ?
Answer:
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-46 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.3 Identification Analysis and Right Fielding 10-47
Identif ication analysis and right f ielding can use the same def initions from the QKB, but each
produces output in different f ormats. Identification analysis identifies the type of data in a f ield, and
right fielding moves the data into separate f ields based on its identification.
56
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The purpose of an Identif ication Analysis node is to analyze the values in the f ield being analyzed
and attempt to determine the identity of that data value. For example, is the value a person’s name,
an organization, an address, and so on? The valid values returned by the identif ication analysis
def inition are determined by the identities def ined for the identif ication analysis definition.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-48 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
Customer Customer_Identity
DataFlux ORGANIZATION
John Q Smith NAME
DataFlux Corp ORGANIZATION
Nancy Jones NAME
59
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Consider a f ield that has mixed corporate and individual customers. Applying the Contact Info
identif ication analysis definition to the Customer f ield using the Identification Analysis node
produces a result set that f lags every record with the type of data that is discovered or identified.
60
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The purpose of a Right Fielding node is to analyze the f ield to determine the identity of that data
value. For example, is the value a person’s name, an organization, an address, and so on? Then,
based on that identity, the value is moved into a new f ield that is created f or that particular identit y
(f or example, Company, Person, or Unknown).
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.3 Identification Analysis and Right Fielding 10-49
Consider a field that has Applying the Contact Info identification analysis
mixed corporate and definition to the Customer field using the
individual customers. Right Fielding node can produce a result set that
moves identified values to the correct or right field.
63
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Consider a f ield that has mixed corporate and individual customers. Applying the Contact Info
identif ication analysis definition to the Customer f ield using the Right Fielding node can produce a
result set that moves identif ied values to the correct or right f ield.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-50 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
c. Click batch_jobs.
d. Double-click Ch5D2_RightFielding_IDAnalysis.
The data job appears on a new primary tab. The job f low should resemble the f ollowing:
b. Verif y that a double quotation mark " is selected as the text qualif ier.
e. Verif y that 1 is the def ault value in the Number of rows to skip f ield.
f. Verif y that seven f ields are def ined, including a f ield named Contact.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.3 Identification Analysis and Right Fielding 10-51
The Text File Input Properties window should resemble the f ollowing:
b. Verif y that the Contact f ield has a mixture of individual and corporate records .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-52 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
d. Verif y that two output f ields are def ined: one to contain Company values and one to contain
individual or Person values.
Note: The Contact f ield is moved to the end of the list so that it can be more easily
compared to the results of the Right Fielding node.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.3 Identification Analysis and Right Fielding 10-53
b. If necessary, scroll to the right to view the new right f ielding inf ormation.
Note: The Right Fielding node correctly identified data values f rom the Contact f ield as
either Organization (the Company f ield) or Name of Person(s) (the Person f ield).
The Identif ication Analysis Properties window should resemble the f ollowing:
Note: The Identity Contact Inf o Identif ication Analysis definition is used
to determine the identity of the data values in the Contact f ield.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-54 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
b. Scroll to the right to view the new identif ication inf ormation.
Note: The Identif ication Analysis node also correctly identified data values f rom the
Contact f ield as either ORGANIZATION or NAME. Right f ielding moves the data to
the associated f ield, but identification analysis creates a new f ield with values that
indicate the associated category.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.4 Branching and Gender Analysis 10-55
67
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
When a node is selected in a Data Flow diagram, the Details pane can be surf aced by selecting
View Show Details Pane.
This tab shows, f or the type of selected node, how many connections can come “in” to a node
(Connect from), as well as how many connections can come “out” of a node (Connect to).
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-56 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
68
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
A Branch node is used in a data job to send the output down two or more output paths. Most nodes
in a data job allow only one input connection and one output connection. The purpose of the Branch
node is to take one input connection and split the data f low into multiple paths. The Branch node
allows f or up to 32 output connections.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.4 Branching and Gender Analysis 10-57
This demonstration continues with an existing data job f rom the Basics Solutions repository to
illustrate the use of the Branch node. Dif ferent paths of data are created based on the identity
determined by the Identif ication Analysis node.
c. Click batch_jobs.
d. Double-click Ch5D2_RightFielding_IDAnalysis.
Question: How many output slots are available f or the Branch node?
Answer: 32 (slots 0 through 31)
Question: How can you discover how many output slots are needed f or a particular set
of data?
Answer: Create a frequency distribution to find the distinct number of values for
a particular field.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-58 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
4. To discover how many “branches” to add to the job f low, add a Frequency Distribution node.
Note: This is typically performed while the data job is being created.
a. Right-click the line connector between the Identif ication Analysis node and the Branch node.
Select Delete.
Note: This makes the node the “active” node. The next node added to the data job is added
to the active node.
The Frequency Distribution node is appended to the Identif ication Analysis node.
The Frequency Distribution Properties window appears.
e. From the Available list, double-click the Contact_Identity f ield to move it to the Selected list.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.4 Branching and Gender Analysis 10-59
1) Place your mouse pointer over the Identif ication Analysis node in a spot
where the drawing tool or pen appears.
2) Click and drag the pen. Release the mouse button when the pen is positioned
over the Branch node.
The Identif ication Analysis node is now reconnected to the Branch node.
5. Investigate the Data Validation node with the name Filter f or Companies.
b. Verif y that the Expression area is searching f or records where Contact_Identity is equal to
ORGANIZATION.
a. Right-click the Data Validation node named Filter for Companies and select Preview.
A sample of records appears on the Preview tab of the Details pane.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-60 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
7. Investigate the Data Validation node with the name Filter f or Person(s).
b. Verif y that the Expression area is searching f or records where Contact_Identity is equal
to NAME.
a. Right-click the Data Validation node named Filter for Person(s) and select Preview.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.4 Branching and Gender Analysis 10-61
Customer_Name Customer_Gender
Michelle Wan F
E.Fusco U
Earl Rigdon M
71
Pat Duffey U
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The Gender Analysis node of a data job (f ound in the Quality grouping of nodes) can apply a gender
analysis def inition to a selected name f ield. Based on the values of the names, the node will produce
a new f ield that will ref lect the gender of the name. The values returned by the gender analysis
def inition are M(ale), F(emale), and U(nknown).
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-62 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
This demonstration uses an existing data job f rom the Basics Solutions repository to illustrate the
use of the Gender Analysis node. You use this node to determine the gender of individuals (based
on the value in a name f ield), re-order the f ields, and write the data to the output tables.
c. Click batch_jobs.
d. Double-click Ch5D2_RightFielding_IDAnalysis.
3. Examine the properties of the Gender Analysis node.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.4 Branching and Gender Analysis 10-63
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-64 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
5. Investigate the properties f or the Field Layout node named Re-order Fields.
b. Verif y that the Person_Gender f ield was moved to the third position and renamed to
Gender.
6. Investigate the properties f or the Text File Output node named Prospects - Person(s).
The Text File Output Properties window should resemble the f ollowing:
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.4 Branching and Gender Analysis 10-65
c. Verif y that the Text qualifier f ield is set to a double quotation mark ".
7. Investigate the properties f or the Field Layout node named Re-Order Fields.
Question: What is the intended purpose f or this instance of the Field Layout node?
Answer: The Data Validation node previous to this node in the data job flow does
not have a way to select to manage output fields. Adding this Field
Layout node enables us to manage fields before adding any additional
nodes.
Question: Aside f rom the list of selected f ields, what is the dif f erence between this Field
Layout node and the Field Layout node in the NAME” branch?
Answer: The Field Layout node on the other branch of this data job was named
“Re-order Fields”. This Field Layout node is named “Re-Order Fields”.
It is important to note that node names are unique where uniqueness
includes the casing.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-66 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
8. Investigate the properties f or the Text File Output node named Prospects - Companies.
The Text File Output Properties window should resemble the f ollowing:
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.4 Branching and Gender Analysis 10-67
PERSONS:
COMPANIES:
10. View the detailed log and resolve any issues that exist.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-68 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
USPS Data
Geo+Phone Data
75
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Ref erence data sources are used to verif y and enrich data. The ref erence sources (also known as
data packs) are typically a database used by Data Management Studio to compare user data t o the
ref erence source. Given enough inf ormation to match an address or location or phone, the ref erence
data source can a variety of additional fields to further clarif y and enrich your data.
Data Management Studio allows direct use of data packs provided f rom United States Postal
Service, Canada Post and Geo+Phone data. The data packs and updates are f ound on
http://support.sas.com/downloads in the SAS DataFlux Sof tware section.
Note: You cannot directly access or modify ref erence data sources.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.5 Data Enrichment 10-69
summary information
Reference Sources for Reference Sources
item selected
Ref erence source locations are registered on the Administration riser bar in Data Management
Studio. In the display above, you can see that there are registrations f or three ref erence data
sources – Canada Post Data, Geo+Phone Data, and USPS Data.
Note: Only one ref erence source location of each type can be designated as the def ault.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-70 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
Address Verification
Original Address Address verification identifies,
940 Cary Parkway corrects, and enhances
address information.
27513
The address verif ication lookup process requires a valid street address and postal code, or a valid
street address with the corresponding city and state value. If these values match an address in the
lookup database, then the data can be enriched with additional data f ields. In the example shown,
the address 940 Cary Parkway with postal code 27513 is passed into the address verif ication node.
Since this is a valid address, then additional inf ormation can be added to the data row (f or example,
City, State, Zip+4, County Name and Congressional District).
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.5 Data Enrichment 10-71
Geocoding
Original Address Geocoding enhances address
940 NW CARY PKWY information with latitude and
longitude values.
CARY , NC 27513-2792
78
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Geocoding latitude and longitude information can be used to map locations and plan ef ficient
delivery routes. Geocoding can be licensed to return this inf ormation f or the centroid of the postal
code or at the roof -top level.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-72 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
The address verif ication nodes in Data Management Studio require that one (or more) third -party
ref erence data sources be licensed. These include:
• US Address Verif ication – United States Postal Service database, which supports CASS
certif ication.
• Canada Address Verif ication – Canada Post database, which supports SERP certification.
• North America Postal Level Geocode (includes PhonePlus)
• US Street Level Geocode – adds additional level of detail to the North American Postal Level
Geocode database.
• Loqate – worldwide databases, which also support worldwide geocoding.
Note: The US Street Level Geocode data requires the North American Postal Level Geocoding
data pack be installed f irst.
Note: More inf ormation about the ref erence data sources can be f ound by navigating to
support.sas.com/downloads and selecting DataFlux Data Updates.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.5 Data Enrichment 10-73
85
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The address verif ication process is performed by one of many available nodes in the Enrichment
group of nodes. Once a node is added to the data job f low, you can look up data values in the
ref erence data source to validate the address. If the address is located in the ref erence data source,
then there are a large number of available f ields that can be used to enrich the existing record.
For example, the job f low diagram shown contains an Address Verif ication node. When examining
the properties of this node, we can see the list of f ields being passed in to this node, or the input
f ields. The input f ields can be matched to f ields from the ref erence source using the Field Type area.
Depending on the specif ied ref erence source, there will be a number of output f ields that can be
generated f rom the ref erence source.
It is important to note that these output f ields are f ields added to the result set *in addition to* the
original f ields that might be passing through this node. (Recall the Additional Outputs button can
control which of the input f ields will pass out of this node. ) In the example shown, the original data
has a CITY f ield (as shown in the Input listing). One of the available output f ields is also named City.
To avoid a naming conf lict, you'll need to change the name of one of the City f ields. In the example
the output f ield City has been renamed to City_V to avoid the potential replication of field names.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-74 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
This demonstration uses an existing data job in the Basics Solutions repository. It illustrates the use
of an Address Verif ication node to verify addresses and a Geocoding node to enhance addresses
with latitude and longitude inf ormation.
c. Click batch_jobs.
d. Double-click Ch6D1_AddressVerification.
The data job appears on a new primary tab. The job f low should resemble the f ollowing:
3. Examine the properties of the Data Source node in the job f low.
b. Verif y that one the f ollowing f ields were added to the Selected list:
ID
COMPANY
LAST NAME
FIRST NAME
ADDRESS
CITY
STATE/PROVINCE
ZIP/POSTAL CODE
COUNTRY/REGION
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.5 Data Enrichment 10-75
4. Right-click the Customers Table Data Source node and select Preview.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-76 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
Note: You can click Suggest to attempt to match the f ields f rom the data source
with the f ields in the ref erence source.
1) Click Options.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.5 Data Enrichment 10-77
2) In the Output Name f ield, verif y that _V is af ter the f ollowing fields:
Address_Line_1
City
State
ZIP/Postal_Code
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-78 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Impro ve Data
f. Click Cancel to close the Address Verif ication (US/Canada) Properties window.
Note: The US_Result_Code f ield indicates whether the address was successf ully verif ied.
If the address was not successf ully verified, the code indicates the cause of failure.
Note: The US_Numeric_Result_Code f ield provides a numeric value f or the result. Possible
values f or both fields are as f ollows:
CITY 12 Could not locate city, state, or ZIP code in the USPS database.
At least, city and state or ZIP code must be present in the input.
OVER 15 One or more input strings is too long (maximum 100 characters).
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.5 Data Enrichment 10-79
7. Examine the properties f or the Street-Level Geocoding node in the job f low.
a. Double-click the Street-Level Geocoding node (named Geocoding using Verified Zip and
Street Number) to view the Properties window.
c. In the Input f ields area, verif y that the ZIP/Postal_Code_V f ield type is set
to Postal/ZIP Code.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-80 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
Note: When you use the Street-Level Geocoding node, a license check is perf ormed to
conf irm the license type. If you have a ZIP+4 database, you see only results
ref lecting that license type. If you have a database that includes street -level data,
you see street-level results.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.5 Data Enrichment 10-81
Geocode Result Code The result code indicates whether the record was successf ully
geocoded. Other possible codes are as f ollows:
• DP – The match is based on the delivery point.
• PLUS4 – The match f ailed on the delivery point, so the match
is based on ZIP+4.
• ZIP – The ZIP+4 match f ailed, so the match is based on the ZIP
code.
• NOMATCH – The f irst three checks f ailed. There is no match
in the geocoding database.
Geocode Latitude The numerical horizontal map ref erence f or address data.
Geocode Longitude The numerical vertical map ref erence f or address data.
Geocode FIPS The U.S. Federal Inf ormation Processing Standards (FIPS)
number used by the U.S. Census Bureau to ref er to geographical
areas.
Geocode Census A US Census Bureau ref erence number is assigned using the
Tract centroid latitude and longitude. This number contains ref erences
to the State and County codes.
Geocode FIPS MCD The FIPS Minor Civil Division (MCD) code ref ers to a subsection
Code of the US Census county subdivision statistics. This number
includes census data f or county divisions, census subareas, minor
civil divisions, unorganized territories, and incorporated areas.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-82 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
b. Verif y that the Text File Output Properties window resembles the f ollowing:
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.5 Data Enrichment 10-83
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-84 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
Practice
Create a data job that uses the Address Verif ication node. The f inal job f low should resemble
the f ollowing display:
• Add the MANUFACTURERS table f rom the df Conglomerate Grocery data connection
as the data source.
MANUFACTURER Firm
CITY City
STATE_PROV State
POSTAL_CD Zip
✓ Street abbreviation
✓ City abbreviation
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.5 Data Enrichment 10-85
State State_V
ID STATE/PROV
MANUFACTURER POSTAL_CD
STREET_ADDR COUNTRY
CITY PHONE
• Add a Text File Output node to the job f low. Use the f ollowing specifications:
Specif y the f ollowing fields as output for the text f ile with the specif ied output names:
ID ID
Address_V Address
City_V City
State_V State
ZIP_V ZIP
US_County_Name US_County_Name
US_Result_Code US_Result_Code
• Verif y that the text f ile contains the verif ied address inf ormation.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-86 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
10.6 Solutions
Solutions to Practices
1. Creating a Data Job Containing Standardization and Parsing
3) Double-click the Data Source node. The node is added to the job f low, and the properties
window f or the node appears.
6) In the Output f ields area, click the lef t-pointing double arrow to remove all f ields f rom
the Selected list.
7) Double-click the f ollowing fields to move them f rom the Available list to the Selected list:
ID CONTACT_STATE_PROV
MANUFACTURER CONTACT_POSTAL_CD
CONTACT CONTACT_CNTRY
CONTACT_ADDRESS CONTACT_PHONE
CONTACT_CITY POSTDATE
8) Click OK to save the changes and close the Data Source Properties window.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.6 Solutions 10-87
1) From the Nodes riser bar in the resource pane, collapse the Data Inputs grouping of
nodes.
3) Double-click the Standardization node. The node is added to the job f low.
The properties window f or the node appears.
6) In the Standardization f ields area, double-click each of the f ollowing fields to move them
f rom the Available list to the Selected list:
MANUFACTURER
CONTACT
CONTACT_ADDRESS
CONTACT_CITY
CONTACT_STATE_PROV
CONTACT_CNTRY
CONTACT_PHONE
a) For the MANUFACTURER f ield, click the down arrow under Definition and
select Organization.
b) For the CONTACT f ield, click the down arrow under Definition and select Name.
c) For the CONTACT_ADDRESS f ield, click the down arrow under Definition and
select Address.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-88 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
d) For the CONTACT_CITY f ield, click the down arrow under Definition and
select City.
e) For the CONTACT_STATE_PROV f ield, click the down arrow under Definition
and select State/Province (Abbreviation).
f ) For the CONTACT_CNTRY f ield, click the down arrow under Scheme and select
Ch3E2 CONTACT_CNTRY Scheme.
g) For the CONTACT_PHONE f ield, click the down arrow under Definition and
select Phone.
1) From the Nodes riser bar in the resource pane, verif y that the Quality grouping of nodes
is expanded.
2) Double-click the Parsing node. The node is added to the job f low. The properties window
f or the node appears.
5) Click the down arrow under Field to parse and select CONTACT_Stnd.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.6 Solutions 10-89
7) In the Tokens area, double-click each of the f ollowing tokens to move them f rom the
Available list to the Selected list:
Given Name
Middle Name
Family Name
a) Click in the Output Name cell f or the Given Name token, enter FIRST_NAME,
and press Enter.
b) Click in the Output Name cell f or the Middle Name token, enter MIDDLE_NAME,
and press Enter.
c) Click in the Output Name cell f or the Family Name token, enter LAST_NAME,
and press Enter.
A sample of the records appears on the Preview tab of the Details pane.
1) From the Nodes riser bar in the resource pane, collapse the Quality grouping of nodes.
3) Double-click the Data Target (Insert) node. The node is added to the job f low.
The properties window f or the node appears.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-90 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
(1) Enter Manufacturers_Stnd in the Enter a name for the new table f ield.
7) In the Output f ields area, click the lef t-pointing double arrow to remove all f ields
f rom the Selected list.
9) Rename the f ollowing f ields. (Click the Output Name cell, enter the new name, and
press Enter.)
MANUFACTURER_Stnd MANUFACTURER
CONTACT_ADDRESS_Stnd CONTACT_ADDRESS
CONTACT_CITY_Stnd CONTACT_CITY
CONTACT_CNTRY_Stnd CONTACT_CNTRY
CONTACT_PHONE_Stnd CONTACT_PHONE
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.6 Solutions 10-91
Note: If any node has an error or warning, double-click the row f or the node, and review
the messages. Return to the Data Flow tab and re-specif y the properties to fix the
issue. Then rerun the data job.
3) When you are f inished reviewing the log, select File Close to close the data job.
Question: Where (in Data Management Studio) can you view the new table’s data?
Answer: The Data riser bar allows you to view data for tables.
Question: Where (in Data Management Studio) can you view the new table’s f ield names ?
Answer: The Data riser bar allows you to view field information for tables.
n. Verif y that the records were written to the Manufacturers_Stnd table in the df Conglomerate
Grocery data connection.
5) Click Manufacturers_Stnd.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-92 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
7) Scroll through the data and verif y that the table contains the standardized and parsed
f ields.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.6 Solutions 10-93
1) Verif y that the Nodes riser bar is selected in the Resource pane.
3) Double-click the Data Source node. The node is added to the job f low,
and the properties window f or the node appears.
e. Edit the basic settings f or the Data Source node and preview the data.
A sample of the records appears on the Preview tab of the Details pane.
1) Verif y that the Nodes riser bar is selected in the Resource pane.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-94 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
The node is added to the job f low, and the properties window f or the node appears.
a) Click under Field Type f or the MANUFACTURER f ield and select Firm.
b) Click under Field Type f or the STREET_ADDR f ield and select Address Line 1.
c) Click under Field Type f or the CITY f ield and select City.
d) Click under Field Type f or the STATE_PROV f ield and select State.
e) Click under Field Type f or the POSTAL_CD f ield and select Zip.
a) Click Options.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.6 Solutions 10-95
b) Rename the f ields below. Click the Output Name cell, enter the new name,
and press Enter.
Firm Firm_V
City City_V
State State_V
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-96 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
c) Double-click the f ollowing fields to move them f rom the Available list
to the Output list:
ID STATE/PROV
MANUFACTURER POSTAL_CD
STREET_ADDR COUNTRY
CITY PHONE
A sample of the records appears on the Preview tab of the Details pane.
2) Scroll to the right to view the verif ied address inf ormation.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.6 Solutions 10-97
1) Verif y that the Nodes riser bar is selected in the Resource pane.
The node is added to the job f low, and the properties window f or the node appears.
b) Navigate to D:\Workshop\dqdmp1\Exercises\files\output_files.
d) Click Save.
b) Double-click the f ollowing fields to move them f rom the Available list to the Selected
list:
ID State_V
Firm_V ZIP_V
Address_V US_County_Name
City_V US_Result_Code
Address_V Address
City_V City
State_V State
ZIP_V ZIP
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-98 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
The Text File Output Properties should resemble the f ollowing display:
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.6 Solutions 10-99
l. Verif y that the text f ile contains the verif ied address inf ormation.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-100 Lesson 10 DataFlux® Data Management Studio: Building Data Jobs to Improve Data
23
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
24
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10.6 Solutions 10-101
25
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
10-102 Lesson 10 DataFlux® Data Managemen t Studio: Building Data Jobs to Improve Data
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Lesson 11 DataFlux® Data
Management Studio: Building Data
Jobs for Entity Resolution
Demonstration: Using Match Codes (and Other Fields) to Cluster Data Records ...........11-21
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.1 Introduction 11-3
11.1 Introduction
Name
John Q Smith
Mr. Johnny Smith
Smith, John
3
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Given the sample of records shown, do you think the three records represent the same individual?
Or could the three records represent dif ferent individuals? Is investigating this single field enough
inf ormation to be able to decide on whether the data are showing a single individual, two individuals
or perhaps three dif ferent individuals?
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-4 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution
Name
John Q Smith
Mr. Johnny Smith
Smith, John
4
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
In the example shown, you can see that the three strings representing John Smith could be
identif ying the same person. However, to the computer that compares on a character-by-character
basis, these entries appear to be three totally different text strings.
The QKB has a def inition type called a Match definition. Match def initions produce new f ields called
match codes. Match codes are generated, encoded text strings that represent a data value. Match
codes can be compared across records to identify potentially duplicate data that might be obvious to
the human eye, but not necessarily obvious to a computer program.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.1 Introduction 11-5
Match codes can be used to group similar data values. In the example shown, the records are sorted
by the match code values. You can see that these records might potentially match. This is not
because the value of Name is the same, but because the name values generated the same match
code at the chosen level of sensitivity (in this example, 85 level of sensitivity).
In addition to grouping similar records in a data table, match codes can be used across tables to join
similar records that would not previously be joined based on the value of Name. This is especially
helpf ul when you join two data source tables that do not have a common key or have possibly
dif f erent standards for how the data values were entered.
The match code generation process, in a simple f orm, consists of the f ollowing steps:
• Data is parsed into tokens (f or example, Given Name and Family Name).
• Signif icant tokens are selected f or use in the match code.
• Ambiguities and noise words are removed (f or example, “the”).
• Transf ormations are made (f or example, Johnathon > Jon).
• Phonetics are applied (f or example, HN sounds like N).
• Based on the sensitivity selection, the f ollowing results occur:
o Relevant components are determined.
o A certain number of characters of the transf ormed, relevant components are used.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-6 Lesson 11 DataFlux® Data Management Studio: Build ing Data Jobs for Entity Resolution
Sensitivity is used in the match code generation process to determine how much of the initial data
value that you want to use in the match code. In other words, it enables you to specif y how exact
you want when you generate the match codes.
At higher levels of sensitivity, more characters f rom the input text string are used to generate the
match code. Conversely, at lower levels of sensitivity, fewer characters are used f or the generation
of the match code.
In the example shown you can see that as the sensitivity level drops, less significant characters were
used in the match code string.
Note: It is important to experiment with dif ferent levels of sensitivity, because choosing a
sensitivity that is too low can cause over-matching. Choosing a sensitivity level that
is too high can cause under-matching.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.2 Creating Match Codes 11-7
8
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
9
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-8 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution
15
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The Match Codes node is added to a data job to generate match codes f or the desired f ields in your
input data source.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.2 Creating Match Codes 11-9
21
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The Match Codes (Parsed) node is used if the data on which you want to generate match codes is
already stored in tokens. In the example shown, because the data f or a person’s name is stored in a
FIRST NAME f ield and a LAST NAME f ield, there is no need to go through the process of parsing
the data into tokens again. In this situation, you can use the Match Code (Parsed) node to generate
a single match code f ield based on the data f ields that are passed directly into the tokens that make
up the Name data type.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-10 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution
This demonstration illustrates the steps that are necessary to generate match codes on a variety of
data f ields.
c. Expand batch_jobs.
d. Click Startup_Jobs.
e. Double-click Ch7D1_EntityResolution_01_Startup.
The data job appears on a new primary tab. The data f low should contain a single node.
3. Investigate the single node in the opened data job.
d. Verif y that the name inf ormation is contained in two f ields: FIRST NAME and LAST NAME.
a. Verif y that the Nodes riser bar is selected in the resource pane.
The node is added to the job f low. The properties window f or the node appears.
Note: The Match Codes (Parsed) node is used when the input data is parsed. The Name
match def inition is designed for a f ull name. If you generate a match code only on
FIRST_NAME, in most cases, the def inition assumes that this is a last name. Thus,
the matching of nicknames does not occur. For example, Jon does not match
Jonathan.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.2 Creating Match Codes 11-11
f. In the Tokens area, click the down arrow under Field Name f or the Given Name token
and select FIRST NAME.
g. Click the down arrow under Field Name f or the Family Name token and select LAST
NAME.
The Match Codes (Parsed) Properties window should resemble the f ollowing:
Note: Allow generation of multiple matchcodes per definition requires the creation of a
special match def inition in the QKB.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-12 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution
Note: Generate null match codes for blank field values generates a NULL match code if
the f ield is blank. If this is not selected, then a match code of all $ symbols is
generated f or the f ield. When you match records, a f ield with NULL does not equal
another f ield with NULL. However, a f ield with all $ symbols equals another f ield with
all $ symbols.
1) In the Output f ields area, click the LAST NAME f ield. Hold down the Ctrl key and click
the FIRST NAME f ield.
2) Click the down-pointing arrow 12 times to move the two selected f ields to the
bottom of the list of output fields.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.2 Creating Match Codes 11-13
At this point, we have generated one of the six needed match code f ields f or our cluster analysis.
The additional match code f ields are generated f rom f ields that are not parsed inf ormation.
The node is added to the job f low. The properties window f or the node appears.
c. In the Match code f ields area, double-click the COMPANY f ield to move it f rom Available to
Selected.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-14 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution
p. Click the down arrow under Definition and select Postal Code.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.2 Creating Match Codes 11-15
Note: Generating a match code perf orms an out-of-the-box standardization behind the scenes.
Thus, unless the intention is to write the standardized values to output or perf orm custom
standardizations by using a scheme or a modif ied definition, it is not necessary to
standardize bef ore generating match codes.
d. Click Save.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-16 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution
11.01 Activity
• Navigate to the batch_jobs/Startup_Jobs folder in Basics Solutions.
• Open the data job named Ch7E1_Manufacturers_MatchReport_O1_Startup.
• Add a Match Codes node. Field Name Definition Sensitivity
✓ Choose the following fields MANUFACTURER Organization 75
with the specified match CONTACT Name 75
definitions and sensitivities:
CONTACT_ADDRESS Address 85
CONTACT_STATE_PROV State/Province 85
CONTACT_POSTAL_CD Postal Code 85
IMPORTANT: In addition, f or the activity shown, save the modified data job to the
Basics Exercises repository in the batch_jobs f older with the name
Ch7E1_Manufacturers_MatchReport.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.3 Clustering Records 11-17
30
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-18 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution
2 Johnny Smith 4B&~2$$$$C@P$$$$ 940 Cary Parkway -S03£YR$$$$$$$$$$$$ (919) 447 3000
* The match code strings shown are not what would actually be generated.
If the grouping or clustering condition is
• Name_MC_85 and Address_MC_85 Matches are found for records 1 and 2.
• Name_MC_85 and Phone Matches are found for records 2 and 3.
If the grouping or clustering is requested on both conditions:
Name_MC_85 and Address_MC_85
or Matches are found for all three records.
Name_MC_85 and Phone 41
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Consider the data shown. Three f ields are provided (Name, Address, and Phone). Two f ields
(Name and Address) have had a match code string generated.
If (Name_MC_85 and Address_MC_85) is the grouping or clustering condition (which means that if ,
across records, Name_MC_85 and Address_MC_85 are the same, then these records will be
grouped). In the data shown, records 1 and 2 would be clustered or grouped based on this condition.
If (Name_MC_85 and Phone) is the grouping or clustering condition (which means that if , across
records, Name_MC_85 and Phone are the same, then these records will be grouped ). In the data
shown, records 2 and 3 would be clustered or grouped based on this cond ition.
Note: Identif ying how you expect to identify clusters of records is something that should be
discussed during the PLAN phase of the methodology.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.3 Clustering Records 11-19
Clustering Node
42
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-20 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution
49
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The Clustering node enables you to specif y one or more conditions for identif ying clusters of
records.
In the example shown, there are three conditions to identify rules f or grouping records into a cluster:
Each cluster of records is assigned a common cluster ID value, and in the example shown, the
cluster ID values will be stored in a new f ield named Cluster_ID.
You can choose to have the results contain only those "clusters" or groups of records where there
were no record matches (single-row clusters).
Or you can choose to have the results contain those "clusters" or groups of records where there was
at least one record match (multi-row clusters).
Or you can choose to show all clusters (both single-row clusters and multi-row clusters).
In addition, the result set can be organized or sorted by the cluster ID value.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.3 Clustering Records 11-21
This demonstration continues with the data job f rom the previous demonstration. A Standardization
node is added to standardize a phone f ield (a f ield that will be used in clustering analysis). The
Clustering node is then added and conf igured. Clustering is perf ormed using the match code f ields
as well as the standardized f ield.
c. Click batch_jobs.
d. Double-click Ch7D1_EntityResolution.
3. Add a Standardization node to the job f low f rom the previous demonstration.
a. Verif y that the Nodes riser bar is selected in the resource pane.
b. If necessary, collapse the Entity Resolution grouping of nodes.
The node is added to the job f low. The properties window f or the node appears.
c. In the Standardization f ields area, double-click BUSINESS PHONE to move it f rom Available
to Selected.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-22 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.3 Clustering Records 11-23
a. Verif y that the Nodes riser bar is selected in the resource pane.
The node is added to the job f low. The properties window f or the node appears.
Note: Enabling this option can signif icantly degrade job performance.
4) Click OK.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-24 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution
Name_MatchCode
COMPANY_MatchCode
ADDRESS_MatchCode
CITY_MatchCode
STATE/PROVINCE_MatchCode
Name_MatchCode
COMPANY_MatchCode
ADDRESS_MatchCode
ZIP/POSTAL CODE_MatchCode
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.3 Clustering Records 11-25
2) Double-click the f ollowing fields to move them f rom the Available f ields list to the Cluster
if all f ields match list:
Name_MatchCode
COMPANY_MatchCode
BUSINESS PHONE_Stnd
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-26 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution
The COND_ f ields display True if that condition (1, 2, or 3) caused the record
to be clustered. Otherwise, the COND_ f ields display False.
Question: What is the first value of Cluster_ID that has duplicate values?
Answer: Cluster_ID=22
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.3 Clustering Records 11-27
d. Scroll in the Preview window to where Cluster_ID=22 records are at the top of the window.
e. Scroll to the lef t to see the original f ield values f or LAST NAME, FIRST NAME, COMPANY,
BUSINESS PHONE, ADDRESS, CITY, STATE/PROVINCE, and ZIP/POSTAL CODE.
The original values in the f ields that were used f or match code generation or f or
standardization have values that are “close” to be the same if not exactly the same.
It might be necessary to report on the cluster results. We now examine the Match Report node.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-28 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution
52
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
One of the special types of output files, created f rom a set of clustered data, is known as a match
report.
A match report
• is a f ile created in the f ile-based portion of the repository
• is a way to explore each cluster created by the clustering process in the data job
• is ordered by cluster number.
Recall that the Clustering node has an option to identify which clusters to output (all, single-row,
multi-row). If all clusters are output, then the match report will have both single-row clusters and
multi-row clusters. If only single-row or multi-row clusters are output, then the match report will have
only single-row or multi-row clusters.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.3 Clustering Records 11-29
60
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The Match Report node allows f or the specif ication of a name and location of the match report.
In addition, you can specify a title f or the match report. The title surf aces on the title bar of the Match
Report Viewer.
You can select an option that launches the Match Report Viewer window when the data job f inishes
running.
The f ield created in the Clustering no de (the Cluster f ield) needs to be identif ied.
Lastly, the f ields f or the match report need to be selected. The purpose of the match report is to
understand how the clustering conditions specified in the clustering node have grouped or clustered
the data. Theref ore, the f ields selected for the match report are most of ten the "original" f ields (or the
f ields that the match codes were generated f or, not the match code f ields).
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-30 Lesson 11 DataFlux® Data Management Studio: Building Data J obs for Entity Resolution
This demonstration illustrates the steps that are necessary to create a match report f rom clustered
records and review the report.
c. Click batch_jobs.
d. Double-click Ch7D1_EntityResolution.
a. Verif y that the Nodes riser bar is selected in the resource pane.
b. If necessary, collapse the Entity Resolution grouping of nodes.
The node is added to the job f low. The properties window f or the node appears.
1) Navigate to D:\Workshop\dqdmp1\Demos\files\output_files.
4) Click Save.
c. Enter Customers Match Report - Three Conditions in the Report title f ield.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.3 Clustering Records 11-31
e. In the Cluster f ields area, click the down arrow next to Cluster field and select
Cluster_ID.
1) In the Report f ields area, click the lef t-pointing double arrow to remove all f ields f rom
the Selected list.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-32 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution
c. Verif y that the df Report Viewer appears. Specif ied title in viewer
Because you requested all the clusters, you get many clusters that contain only one record.
First cluster
Previous cluster
Next cluster
Last cluster
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.3 Clustering Records 11-33
Answer: 57
The bottom panel of the dfReport Viewer lists the total number of clusters.
Answer: Four
Answer: Three
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-34 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.3 Clustering Records 11-35
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-36 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution
Practice
• Open the Ch7E1_Manufacturers_MatchReport data job f rom the batch_jobs f older in the
Basic Exercises repository (this data job was saved f rom the 7.03 Activity).
Note: If the data job does not exist, open the f ollowing data job as a starting point:
dfr://Basics Solutions/batch_jobs/Startup_Jobs/
Ch7E1_Manufacturers_MatchReport_02_Startup
Immediately save this starter job to the batch_jobs f older in the Basics Exercises
repository with the name Ch7E1_Manufacturers_MatchReport.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.3 Clustering Records 11-37
Condition 1:
MANUFACTURER_MatchCode
CONTACT_MatchCode
CONTACT_ADDRESS_MatchCode
CONTACT_POSTAL_CD_MatchCode
Condition 2:
MANUFACTURER_MatchCode
CONTACT_MatchCode
CONTACT_STATE_PROV_MatchCode
CONTACT_PHONE_Stnd
Cluster_ID
ID
MANUFACTURER
CONTACT
CONTACT_ADDRESS
CONTACT_STATE_PROV
CONTACT_POSTAL_CD
CONTACT_PHONE_Stnd
Question: How many clusters are produced using the specif ied clustering conditions?
Answer:
Answer:
Answer:
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-38 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution
11.4 Survivorship
Entity Resolution
Up to this point, you have seen the following:
• using the Match Codes node and the Match
Codes (Parsed) node to generate match codes
• using the Clustering node to group records
that might represent the same entity.
The multi-row
clusters now need
to be reduced to a
single, surviving
record.
74
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.4 Survivorship 11-39
75
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The Surviving Record Identif ication node (also ref erred to as the SRI node) is in the Entity
Resolution grouping of nodes.
The properties of the SRI node provide an area f or def ining one or more record rules that can be
applied to multi-row clusters. In addition, field rules can be specif ied to enhance the f inal surviving
record.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-40 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution
Record rules are used to select which record f rom a cluster should survive. Examples of record rules
include, but are not limited to
• the maximum value in a f ield
• the most f requently occurring value in a f ield
• the value that has the longest value.
In the example shown, there are record rules to identif y the record with the highest occurrence of
f irst name, and the maximum value of ID. The resulting surviving records, based o n this rule, are the
records that do not have a line that is drawn through them.
Note: If there is ambiguity about which record is the survivor, the f irst remaining record
in the cluster is selected.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.4 Survivorship 11-41
Field rules are used to “borrow” inf ormation f rom other records in the cluster when the surviving
record does not contain a data value f or that particular f ield , or if a “better” value exists in one of the
other records in the cluster.
In the example shown, you can see that f or the surviving record (second row under cluster records),
the record does not have a value f or the EMAIL f ield. However, the f irst record in this cluster does
have a value f or EMAIL. A f ield rule can be set up to determine whether EMAIL is not null. If that
condition is met, then the value in the EMAIL f ield f rom this first record in this cluster can be written
to the surviving record’s EMAIL f ield.
Also, in this example, if there is a secondary f ield rule where the STATE f ield has the shortest value,
then select CITY and STATE f rom that record. The third record in this cluster has the shortest value
f or the STATE f ield. Theref ore, the values in the CITY and STATE f ields f rom this third record will be
written to the surviving record’s CITY and STATE f ields.
Note: The row of data at the top is the “newly constructed” surviving record, which uses all the
values that met the record rules, as well as the f ield rules. The f ield values that changed
based on a f ield rule are bolded.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-42 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution
78
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The Surviving Record Identif ication (SRI) node examines clustered data and determines a surviving
record f or each cluster.
By def ault, the SRI node passes only surviving records to the next node in the data f low. The
Options window f or the node can be used to override the def ault behavior.
There are numerous combinations of options to explore. For example, f or some sets of data, you
can:
• Want to keep duplicate records. A new f ield can be def ined to f lag the surviving record with a
True versus False f or non-surviving records.
• Want to keep original duplicate records and have a new distinct record as the surviving record .
This is particularly usef ul if you are using f ield rules to update values in the surviving record . A
new f ield can be def ined to flag the surviving record can be f lagged with a True versus False f or
non-surviving records.
• Want to keep original duplicate records. A new f ield can be def ined to use the primary key f ield of
the surviving record as the surviving record ID value f or the non-surviving records in a cluster.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.4 Survivorship 11-43
NO OPTIONS SPECIFIED
79
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
If no options are specif ied for the SRI node, then the def ault behavior is to pass only the surviving
records f rom the SRI node to the next node in a data job.
In the example shown, recall that a three-record cluster existed where Cluster_ID=22. Only the
surviving record (in this case, where ID has a maximum value) f rom this multi-row cluster is passed
to the next node. Similarly, recall that a two-record cluster existed where Cluster_ID=28. Only the
surviving record f rom this multi-row cluster is passed to the next node.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-44 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution
One option f or dealing with surviving and non-surviving records involves creating a surviving record
ID f ield that serves as a True/False f lag. The f lag's value indicates whether the record is a surviving
record. When selecting this option, you need to specify a new f ield to contain the True/False
indicator.
If this option is selected, every record that is a surviving record is f lagged as True. All the remaining
records f rom the clusters are f lagged as False. When you use this option, a simple f ilter f or selecting
records f lagged as True yields an accurate output table of only the surviving records.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.4 Survivorship 11-45
Keep duplicates
Surviving record ID field: SR_Flag
81
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
A second option f or dealing with surviving and non-surviving records involves creating the surviving
record ID f ield as bef ore and selecting Generate distinct surviving record to create a new
surviving record f or each cluster. This new record has a value of True f or the surviving indicator f ield
f lag, and the value of ID is the same as the surviving record.
If this option is selected, every cluster (even the single row clusters) has a surviving record that is
created and f lagged as True. All the original input records are f lagged as False. When you use this
option, a simple f ilter f or selecting records flagged as True yields an accurate output table of only the
surviving records. A simple f ilter f or selecting records , which is f lagged as False, contains all of the
original data records.
Note: This option results in an output data table with two records f or each ID value f or the
surviving records, which violates the assumptions of a primary key f ield. Further subsetting
of the data would be necessary bef ore you use the ID f ield as a primary key.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-46 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution
Keep duplicates
Surviving record ID field: SR_Flag
A third option f or dealing with surviving records involves creating the surviving record ID f ield and
selecting Generate distinct surviving record as bef ore. You can also identif y the primary key f ield
f or the input data. This results in a new record (the survivor) that has a value of True f or the surviving
indicator f ield f lag, and a null value f or ID.
If this option is selected, every cluster (even the single-row clusters) has a surviving record that is
created and f lagged as True with a blank value f or ID. All the original input records are f lagged as
False and retain their original value of the primary key f ield (ID). When you use this option, a simple
f ilter f or selecting records flagged as True yields an accurate output table of only the surviving
records. However, you need to add an additional processing step to generate new primary key
values f or the surviving records.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.4 Survivorship 11-47
A f ourth option f or dealing with surviving records involves creating the surviving record ID f ield,
selecting Use primary key as surviving record ID, and identif ying the ID f ield as the primary key
f ield f or the survivor. This results in the surviving record having a value of (null) f or each of the
clusters. The non-surviving records in each cluster have an SR_Flag value of the survivor’s primary
key f ield (ID, in this example).
If this option is selected, surviving records can be identif ied by selecting the records with a (null)
value in the SR_Flag f ield. Selecting records with a value in the SR_Flag f ield gives you a list of the
duplicates of the surviving records.
As you can see – dif f erent combinations of options affect the layout of the result set of the Surviving
Record Identif ication node. A case will need to be made f or which set of options to use f or every data
source used.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-48 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution
This demonstration illustrates the steps that are necessary to add and conf igure a Surviving Record
Identif ication node and investigate the various option settings for this node.
c. Click batch_jobs.
d. Double-click Ch7D1_EntityResolution.
d. Click Save.
4. Right-click the Match Report node (labeled Customers Match Report) and select Delete.
Note: The above data f low diagram was wrapped f or display purposes.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.4 Survivorship 11-49
a. Verif y that the Nodes riser bar is selected in the resource pane.
The node is added to the job f low. The properties window f or the node appears.
b. Click the down arrow next to Cluster ID field and select Cluster_ID.
3) Click the down arrow next to Operation and select Maximum Value.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-50 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution
a. Verif y that the Nodes riser bar is selected in the resource pane.
The node is added to the job f low. The properties window f or the node appears.
2) Navigate to D:\Workshop\dqdmp1\Demos\files\output_files.
3) Enter Test.csv in the File name f ield. (Notice the .csv extension!)
4) Click Save.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.4 Survivorship 11-51
1) Scroll in the Selected list to locate the Last_Name and First_Name f ields.
2) Click the up arrow until these two f ields f ollow the ID f ield.
The Text File Output Properties window should resemble the f ollowing:
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-52 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution
b. Verif y that the remaining record f or this cluster is the one with the maximum value of ID (63).
b. Verif y that the remaining record f or this cluster is the one with the maximum value of ID (61).
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.4 Survivorship 11-53
15. Verif y the processing inf ormation that appears f or each of the nodes.
Note: The Surviving Record Identif ication node processed 63 rows. For each multi-row cluster,
only one record was selected (the record with the maximum ID value). Theref ore, the
number of records written to the text f ile is 57 rows. Selecting only one record f rom each
cluster is the def ault action.
16. Edit the properties of the Surviving Record Identif ication node (Logical Delete Option 1).
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-54 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution
b. Move the SR_Flag f ield so that it appears af ter the Cluster_ID f ield.
2) Click the up arrow until SR_Flag appears af ter the Cluster_ID f ield.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.4 Survivorship 11-55
c. Verif y that SR_Flag=TRUE f or the surviving record (where the maximum value of ID is 63).
c. Verif y that SR_Flag=TRUE f or the surviving record (where the maximum value of ID is 61).
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-56 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution
23. Verif y the processing inf ormation that appears f or each of the nodes.
Note: Notice that the Text File Output node wrote all 63 rows to the text f ile.
24. Edit the properties of the Surviving Record Identif ication node (Logical Delete Option 2).
a. Right-click the Surviving Record Identification node and select Properties.
2) Click the down arrow next to Primary key field and select ID.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.4 Survivorship 11-57
In addition, the f ourth record does not contain a value f or the identif ied primary key f ield (ID).
In addition, the f ourth record does not contain a value f or the identif ied primary key f ield (ID).
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-58 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution
31. Verif y the processing inf ormation that appears f or each of the nodes.
Note: Notice that the Surviving Record Identif ication node processed 63 rows. For each cluster,
a distinct surviving record was generated. Theref ore, the number of records written to the
text f ile is 120 rows, which is sum of the original rows plus the new distinct surviving
records.
32. Edit the properties of the Surviving Record Identif ication node (Logical Delete Option 3).
3) Click the down arrow next to Primary key field and select ID.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.4 Survivorship 11-59
b. Verif y that the surviving record has a null value f or the SR_Flag f ield.
c. Verif y that the duplicate records have the value of the primary key of the surviving record.
b. Verif y that the surviving record has a null value f or the SR_Flag f ield.
c. Verif y that the duplicate records have the value of the primary key of the surviving record.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-60 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution
39. Verif y the processing inf ormation that appears f or each of the nodes.
Note: The Text File Output node wrote all 63 rows to the text f ile.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.4 Survivorship 11-61
This demonstration illustrates the steps that are necessary to establish f ield-level rules f or populating
f ields in the surviving record using f ield -level rules.
c. Click batch_jobs.
d. Double-click Ch7D2_SRI.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-62 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution
b. Verif y that the surviving record has a null value f or the EMAIL f ield, but another record in the
cluster has a non-null value f or the EMAIL f ield.
c. Verif y that the surviving record has a null value f or the JOB TITLE f ield, but another record in
the cluster has a non-null value f or the JOB TITLE f ield.
b. Verif y that the surviving record has a null value f or the EMAIL f ield, but another record in the
cluster has a non-null value f or the EMAIL f ield.
c. Verif y that the surviving record has a shorter value f or the JOB TITLE f ield, but another
record in the cluster has a longer value f or the JOB TITLE f ield.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.4 Survivorship 11-63
10. Edit the properties f or the Surviving Record Identif ication node.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-64 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution
2) In the Rule expressions area (of the Add Field Rule window), click Add.
b) Click the down arrow next to Operation and select Is Not Null.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.4 Survivorship 11-65
3) In the Af f ected fields area, verif y that only EMAIL appears in the Selected list.
2) In the Rule expressions area (of the Add Field Rule window), c lick Add.
a) Click the down arrow next to Field and select JOB TITLE.
b) Click the down arrow next to Operation and select is not Null.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-66 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution
e) Click the down arrow next to Field and select JOB TITLE.
f ) Click the down arrow next to Operation and select Longest Value.
3) In the Af f ected fields area, verif y that JOBTITLE appears in the Selected list.
4) In the Af f ected fields area, double-click the MOBILE PHONE f ield to move it f rom
Available to Selected.
The two def ined f ield rules should resemble the f ollowing:
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.4 Survivorship 11-67
b. Verif y that the surviving record now has a non-null value f or the EMAIL f ield, which was
retrieved f rom another record in the cluster.
c. Verif y that the surviving record has a non-null AND longest value f or the JOB TITLE f ield.
d. Verif y that the surviving record has a “new value” f or the MOBILE PHONE f ield. This value
was copied to the surviving record f rom the record where JOB TITLE was not null.
c. Verif y that the surviving record has a non-null and longest value f or the JOB TITLE f ield.
d. Verif y that the surviving record has a “new value” f or the MOBILE PHONE f ield. This value
was copied to the surviving record f rom the record where JOB TITLE was not null and the
longest value.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-68 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution
Summary: If you use f ield rules, important inf ormation that is potentially spread across multiple
records of a cluster can be retrieved f or the surviving record. In this case, the
surviving records examined have values f or both the EMAIL and JOB TITLE f ields.
In addition, in the second example, the MOBILE PHONE f ield f or the surviving
record is populated f rom the record where JOB TITLE has the longest value.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.4 Survivorship 11-69
Practice
This practice continues with the data job started in Practice 1. For this exercise, you will add a
Surviving Record Identif ication node and write the clustered records out to a text f ile. The f inal
job f low should resemble the f ollowing:
• Remove the Match Report node and add a Surviving Record Identif ication node with the
f ollowing properties:
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-70 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution
• Add a Text File Output node to the job f low using the f ollowing specifications:
Output file: Ch7E2_Manufacturer_BestRecord.csv in the directory
D:\Workshop\dqdmp1\Exercises\files\output_files
Question: For Cluster_ID=4, the surviving record is the record where ID=48. Why?
Answer:
Question: For Cluster_ID=6, the selected record is the record with ID=11. Why?
Answer:
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.5 Solutions 11-71
11.5 Solutions
Solutions to Practices
1. Creating a Data Job to Cluster Records and Create a Match Report
5) Double-click Ch7E1_Manufacturers_MatchReport.
1) Verif y that the Nodes riser bar is selected in the Resource pane.
The node is added to the job f low. The properties window f or the node appears.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-72 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution
1) Verif y that the Nodes riser bar is selected in the Resource pane.
The node is added to the job f low. The properties window f or the node appears.
3) Verif y that All clusters is specif ied in the Output clusters f ield.
4) Click Sort output by cluster number.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.5 Solutions 11-73
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-74 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution
1) Verif y that the Nodes riser bar is selected in the Resource pane.
The node is added to the job f low. The properties window f or the node appears.
a) Navigate to D:\Workshop\dqdmp1\Exercises\files\output_files.
c) Click Save.
3) Enter Manufacturers Match Report - Two Conditions in the Report title f ield.
4) Click Launch Report Viewer after job is completed.
5) In the Cluster f ields area, click the down arrow next to Cluster field and select
Cluster_ID.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.5 Solutions 11-75
a) In the Report f ields area, click the lef t-pointing double arrow to remove all f ields
f rom the Selected list.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-76 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution
Question: How many clusters are produced when you use the specified clustering
conditions?
The bottom panel of the dfReport Viewer window shows how many
clusters were found:
Answer: 30 clusters
(2) In the data job, right-click the Clustering node and select Properties.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.5 Solutions 11-77
(6) Select Actions Run Data Job to re-execute the data job.
Answer: 10 records
Scanning the two pages of the dfReport Viewer, you can see that the
first cluster with 10 records is the largest cluster.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-78 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution
5) Double-click Ch7E1_Manufacturers_MatchReport.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.5 Solutions 11-79
Note: If necessary, you can open the start-up job f or the exercise:
• Verif y that the Home tab is selected.
• Click the Folders riser bar.
• Expand the Basics Solutions repository.
• Expand the batch_jobs f older.
• Expand the Startup_Jobs f older.
• Right-click the Ch7E2_Manufacturers_SelectBestRecord_Startup job
and select Open.
d. Right-click the last node in the data job (Match Report node) and select Delete.
1) Verif y that the Nodes riser bar is selected in the resource pane.
The node is added to the job f low. The properties window f or the node appears.
2) Click the down arrow next to Cluster ID field and select Cluster_ID.
d) Click the down arrow next to Primary key field and select ID.
(1) Click the down arrow next to Field and select POSTDATE.
(2) Click the down arrow next to Operation and select Maximum Value.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-80 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution
5) In the Output f ields area, click Field Rules on the lower right.
a) Click Add in the Rule expressions area of the Add Field Rule window.
b) In the Af f ected fields area, verif y that only CONTACT appears in Selected.
1) Verif y that the Nodes riser bar is selected in the resource pane.
The node is added to the job f low. The properties window f or the node appears.
b) Navigate to D:\Workshop\dqdmp1\Exercises\files\output_files.
d) Click Save.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.5 Solutions 11-81
a) In the Output f ields area, click the lef t-pointing double arrow to remove all f ields
f rom the Selected list.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-82 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution
In the display of the text f ile that is shown above, two groups are highlighted.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11.5 Solutions 11-83
Question: For Cluster ID=4, the surviving record is the record where ID=48. Why?
Answer: The record rule selects the record where POSTDATE has a maximum
value. For this cluster, the record where ID=48 has the maximum value
of POSTDATE.
Question: For Cluster_ID=6, the selected record is the record with ID=11. Why?
Answer: The record rule selects the record where POSTDATE has a maximum
value. For this cluster, the record where ID=11 has the maximum value
of POSTDATE.
k. Select File Close to close the text f ile. Do not save any changes.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
11-84 Lesson 11 DataFlux® Data Management Studio: Building Data Jobs for Entity Resolution
continued...
11.01 Activity – Correct Answer
26
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
27
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Lesson 12 Understanding the SAS®
Quality Knowledge Base (QKB)
12.1 Working with QKB Component Files .......................................................................... 12-3
Demonstration: Accessing the QKB Component Files ................................................ 12-5
Demonstration: Using the Scheme Builder ............................................................. 12-16
Demonstration: Using the Chop Table Editor........................................................... 12-24
Demonstration: Using the Phonetics Editor............................................................. 12-34
Demonstration: Using the Regex Library Editor ....................................................... 12-43
Demonstration: Using the Vocabulary Editor ........................................................... 12-49
Demonstration: Using the Grammar Editor ............................................................. 12-62
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-3
Chop Tables
Regular Expression Libraries
Phonetics Libraries
Schemes
Vocabularies
Grammars
3
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The QKB consists of six types of component files that serve as the building block s for the definitions.
These files each perform a vital task in the overall functionality provided by the definition. The
following types of component files are available:
• chop tables
• regular expression libraries
• phonetics libraries
• schemes
• vocabularies
• grammars
Note: Each of the above file types has a corresponding special editor. The various editors can be
accessed by selecting Tools Other QKB Editors.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-4 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-5
This demonstration illustrates how to navigate to the QKB component files in DataFlux Data
Management Studio.
1. If necessary, open Data Management Studio.
a. Select Start All Programs DataFlux Data Management Studio 2.7.
b. Click Cancel in the Log On window.
c. If necessary, close the DataFlux Data Management Methodology window.
2. If necessary, open the QKB CI 27 - ENUSA Only QKB.
a. In Data Management Studio, select the Administration riser bar.
b. Expand Quality Knowledge Bases.
c. Expand QKB CI 27 - ENUSA Only.
d. Expand Global.
e. Expand English.
f. Select English (United States).
Note: The tabs in the Information pane can be used to display the various types of
component files in the QKB.
Note: The Quality Knowledge Base tab is selected by default. This tab displays all the
definitions that are available in the QKB.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-6 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
Note: As you navigate through the component files of the QKB, y ou see various symbols
that correspond to the objects that are related to the QKB.
Components
QKB
Data Type
Def inition
Scheme
Chop Table
Phonetics Library
Regex Library
Vocabulary Library
Grammar
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-7
The Schemes tab lists the standardization schemes that are contained in the selected QKB
locale. A scheme is a simple lookup table that maps a word to some alternate, pref erred
representation f or that word. Schemes are used to perf orm standardization of words and
phrases, and f or identif ying “known words” in casing def initions and extraction definitions.
Note: You can create a new scheme (in the Scheme Builder) by clicking the New Scheme
button.
b. Scroll through the list of schemes and locate the GB Country scheme.
c. Select the GB Country scheme.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-8 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
d. Click in the upper left corner to see the actions that are available for interacting with the
schemes.
e. Select Open.
The GB Country scheme opens in the Scheme Builder.
Note: You can also open the Scheme Builder for a scheme by double-clicking it on the
Schemes tab.
f. Preview the data and standard values that exist in the scheme for standardizing country
values.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-9
g. Scroll down to the data values that have United States as their standard.
Note: All of the different data values that would be standardized to United States if we were
to apply this scheme to the data.
h. Select File Close to close the Scheme Builder.
The Show Find Pane icon activates the Find toolbar. The Find toolbar enables you to search
for keywords that are associated with one or more schemes.
j. Enter Country in the Find field.
k. Press Enter.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-10 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
The first scheme that is associated with the word Country in the name is highlighted.
l. Click the Next and Previous buttons to scroll through the schemes.
4. Explore the chop tables in the QKB.
a. Click the Chop Tables tab to see the list of available chop tables.
The Chop Tables tab lists the chop tables that are contained in the selected QKB locale.
A chop table is a collection of character-level rules that are used to create an ordered word
list f rom an input string (f or example, to break a person’s name into the individual words that
make up the name.
Note: You can open a chop table by either double-clicking the chop table, or by selecting a
specific chop table and then clicking (the Open QKB icon). You can create a new
chop table (in the Chop Table Editor) by clicking the New Chop Table button. This
opens a wizard that navigates through the creation of the new chop table.
Note: You might find it beneficial to create a new chop table by opening an existing chop
table and selecting File Save As to save the new chop table with a different name.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-11
The Phonetics Libraries tab lists the phonetic libraries that are contained in the selected QKB
locale. A phonetics library contains phonetic reduction rules that are used to match words
and phrases with similar sounding words and phrases (f or example, John and Jon).
Note: You can open a phonetics library by double-clicking the phonetics library, or by
selecting a specific phonetics library and then clicking (the Open QKB icon). You
can create a new phonetics library (in the Phonetics Editor) by clicking the New
Phonetics button. This opens a wizard that navigates through the creation of the
new phonetics library.
Note: You might find it beneficial to create a new phonetics library by opening an existing
phonetics library and selecting File Save As to save the new phonetics library
with a different name.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-12 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
The Regex Libraries tab lists the regular expression (regex) libraries that are contained in the
selected QKB locale. A regular expression library contains regular expressions that are used
for character-level pattern matching and transformations (for example, removing parentheses
from around a string).
Note: You can open a regex library by double-clicking the regex library, or by selecting
the regex library and then clicking (the Open QKB icon). You can create a new
regex library (in the Regex Library Editor) by clicking the New Regex button. This
opens a wizard that navigates through the creation of the new regex library.
Note: You might find it beneficial to create a new regex library by finding an existing regex
library that has similar expressions to the one that you need. Edit it and select
File Save As to save the new regex library with a different name.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-13
The Vocabularies tab lists the vocabulary f iles that are contained in the selected QKB locale.
A vocabulary f ile contains a list of words. Categories are assigned to words in a vocabulary
to help identif y the semantic type of those words (f or example, the word John could be a
given name word, a middle name word, or a f amily name word).
Note: You can open a vocabulary by either double clicking the vocabulary, or by selecting
the vocabulary and clicking (the Open QKB icon). You can create a new
vocabulary in the Vocabulary Editor by clicking the New Vocabulary button. This
opens a wizard that navigates through the creation of the new vocabulary.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-14 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
The Grammars tab lists the grammar f iles that are contained in the selected QKB locale.
A grammar is set of rules that represent expected patterns of words in a given context (f or
example, a person’s name could be represented by the pattern <given name word> <f amily
name word>.
Note: You can open a grammar by double-clicking the name or selecting the grammar
and clicking the Open icon. You can create a new grammar in the Grammar Editor by
clicking the New Grammar button. This opens a wizard that navigates through the
creation of the new grammar.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-15
Scheme Overview
Element Standardization Scheme: Phrase Standardization Scheme:
Data Standard Data Standard
A scheme is a lookup table that is used to transform data values to a standard representation.
Schemes can be applied to individual words in a string (element analysis) or to the entire string
(phrase analysis). Schemes are used in many types of definitions and are often applied at the token
level. For example, in a standardization definition, the input text string is first parsed into tokens, and
then each token is standardized with one or more schemes. In a match definition, standardization
schemes are used to standardize data values that get used in the creation of the match code. In
identification analysis, known word schemes are used to associate words with their possible identity
(for example, familiar phrases that might represent an address or a country value).
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-16 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
This demonstration illustrates how to use the Scheme Builder to explore and modify a scheme.
1. If necessary, open Data Management Studio.
a. Select Start All Programs DataFlux Data Management Studio 2.7.
b. Click Cancel in the Log On window.
2. If necessary, open the English (United States) locale from the CI 2.7 - ENUSA Only QKB.
a. Select the Administration riser bar.
b. Expand Quality Knowledge Bases.
c. Expand Global.
d. Expand QKB CI 27 - ENUSA Only.
e. Expand English.
f. Select English (United States).
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-17
Note: Each line in the scheme represents a potential piece of expected data
and the replacement text (standard).
Note: You can add, delete, or edit a piece of expected data on the Edit menu.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-18 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
Note: The analysis of an individual field can be counted as a whole (phrase) or based
on each word (element).
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-19
Note: The GB Email Service Provider Standards standardization scheme is used in the
E-mail standardization definition specifically for processing the Sub-Domain token.
b. Click Close to close the Usage window.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-20 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
12.01 Activity
1. Open DataFlux Data Management Studio.
2. Open the QKB CI 2.7 - ENUSA Only QKB.
3. Click the Schemes tab.
4. Open the GB Email Top-Level Domain Standards scheme.
5. Answer the following questions:
• Which values are standardized as ORG?
• Is this an Element scheme or a Phrase scheme?
• Which definitions use this scheme?
7
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-21
11
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
A chop table is a collection of character-level rules that are used to create an ordered word list from
an input string. A chop table contains a line for every single character in the selected character set.
Chopping is the first step in
• performing element analysis when building an element standardization scheme
• chopping a string into a list of words for a parse definition.
Each character in a chop table receives a classification based on the intended use. A character can
be classified as one of these:
• LETTTER/SYMBOL – a letter or a non-separating symbol
• NUMBER – a numeric digit
• FULL SEPARATOR – a delimiting character that separates the string before it from the string
after it
• LEAD SEPARATOR – a separator that attaches to the beginning of a string (for example, an
opening parenthesis)
• TRAIL SEPARATOR – a separator at the end of a string (for example, a period after a name
salutation such as Mr.).
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-22 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
12
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Each character in a chop table also receives an operation, indicating whether the character should
be included in the output string, and if so, how that character should be treated in the output string.
For example, the open parenthesis might not be a relevant character in a person’s name, but is often
used to delineate a portion of a phone number. We can choose to remove the character from a name
string, but not a phone number string. These are the valid arguments for a character’s operation:
• USE – keeps the character in the string.
• TRIM – temporarily removes the character from the string, but keeps it in output tokens .
• SUPPRESS – removes the character from the string and the output tokens.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-23
Result after
chopping
13
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The Chop Table Editor also provides you with a test panel where you can test data values against
the chop table. This ensures you are getting the desired results from the chopping step.
For example, the input string SAS Institute, Inc. is chopped using the GB Organization chop table.
The result of the chop is a list of four words from the string. The comma is on a line by itself because
it is treated as a word on its own in the generated word list. The full stop (period) is part of the Inc.
word because it is classified as a TRAIL SEPARATOR with an operation of USE.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-24 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
This demonstration illustrates how to use the Chop Table Editor to explore, test, and modify a chop
table.
1. If necessary, invoke Data Management Studio.
a. Select Start All Programs DataFlux Data Management Studio 2.7.
b. Click Cancel in the Log On window.
2. If necessary, open the QKB CI 27 - ENUSA Only QBK in Data Management Studio.
a. Select the Administration riser bar.
b. Navigate to the English (United States) locale in QKB CI 27 - ENUSA Only.
c. Select the English (United States) locale and open it.
3. Explore the GB Email chop table.
a. In the right panel, click the Chop Tables tab.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-25
Note: A chop table is stored in a proprietary format. You should not attempt to create
or edit it outside of the provided editor.
5. Scroll to locate the FULL STOP character (value 46).
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-26 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
Note: The COMMERCIAL AT character (value 64) is classified as a FULL SEPARATOR with an
operation of USE.
Note: These characters separate email address values into the individual words that comprise
the email address.
7. Scroll up to locate the SPACE character (value 32).
Note: Because spaces are not valid as part of email address es, the operation for the SPACE
character (value 32) is TRIM, although it is still classified as a FULL SEPARATOR.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-27
c. Enter John. Doe@sas.com in the Input string field. (Notice the space after the first period.)
d. Click Go.
The Result area is populated with the same chopped string because the SPACE character
is associated with an operation of TRIM in the chop table. Therefore, it is not included
in the output.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-28 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
Note: When you modify any of the QKB component files, it is a best practice to create a
copy of the existing file instead of overwriting it. Here are some reasons:
• Many definitions might reference the original file.
• You might want to revert to the original file at some point in the future.
• Modifications to existing QKB components are more difficult to track when you
upgrade to a new release of the QKB. The QKB merge utility (for merging the old
QKB with the new one) could miss the fact that you have made modifications,
resulting in the loss of your work.
c. Click Save.
The title bar of the Chop Table Editor window displays the new chop table name.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-29
10. Change the uses of some of the characters in the new chop table.
a. Locate the SPACE character (value 32).
b. Click (the down arrow) in the Operation column for the SPACE character and select
USE.
c. Locate the FULL STOP character (value 46).
d. Click (the down arrow) in the Classification column for the FULL STOP character and
select TRAIL SEPARATOR.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-30 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
11. Use a test data value with the new chop table.
a. Enter John. Doe@sas.com in the Input string field. (Notice the space after the first
period.)
b. Click Go. The Result area is populated with the updated result string.
Note: The FULL STOP character, as a TRAILING SEPARATOR, now becomes a part
of the word that it follows. The SPACE character, as a FULL SEPARATOR,
is now a distinct word in the word list and is included in the output because
it has an operation of USE.
Note: These are not practical alterations but are shown here for illustrative purposes.
12. Save the new chop table.
a. Select File Exit to close the Chop Table Editor window.
b. If you are prompted, click Yes in the Reload QKB window to reload the QKB.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-31
c. Scroll down on the Chop Tables tab to locate the new chop table, My GB Email.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-32 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
12.02 Activity
1. In Data Management Studio, click the Chop Tables tab.
2. Open the GB Website chop table.
3. Answer the following questions:
• What is the classification for the FULL STOP character (value 46)?
What is the operation?
• What is the classification for the Solidus character (value 47)? What is
the operation?
• What is the chopped string for the input string?
support.sas.com/documentation
15
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
19
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
A phonetics library is a collection of rules that perf orm “sound -alike” analysis on a data value. The
image above shows the EN General Phonetics library. The lef t column shows the rule text, or the
pattern to be matched. The right column in the replacement text f or the matched rul e.
In the QKB, the phonetics library is used exclusively to generate match codes. During match code
generation, phonetic rules are applied to reduce an input string. The goal is to create phonetic rules
that produce the same output string f or input strings with similar pronunciations or spellings (f or
example, Night and Knight are reduced to NIT by applying phonetics rules).
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-33
20
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The Phonetics Editor allows you to create and manage phonetics rules to be used in reducing data.
In addition to managing the phonetic rules, you can test the rules and view the results inside the
editor.
In the example above, the EN General Phonetics lib rary is used to test the name string “JOHN
MACKNIGHT”. Using the rule that “GHT” sounds like “T”, “CK” sounds like “K”, and an “H” is silent,
the phonetically reduced string is “JON MAKNIT”. These phonetically reduced strings are used in a
match def inition to identify records for the same people when there are slight differences in how their
names have been entered.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-34 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
This demonstration illustrates how to use the Phonetics Editor to explore and test a phonetics library.
1. If necessary, invoke Data Management Studio.
a. Select Start All Programs DataFlux Data Management Studio 2.7.
b. Click Cancel in the Log On window.
2. If necessary, open the QKB CI 27 - ENUSA Only QBK in Data Management Studio.
a. Select the Administration riser bar.
b. Navigate to the English (United States) locale in QKB CI 27 - ENUSA Only.
c. Select the English (United States) locale and open it.
3. Explore the EN General Phonetics library.
a. Click the Phonetics Libraries tab.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-35
Each line in the editor represents a sequence of text and the phonetic equivalent. Each
phonetic rule is applied in order. You can control the order in which the rules apply to the text
string by adjusting the priority for each of the rules. Rule order can also be controlled by
dragging a rule upward or downward in the list.
Note: You can use the Edit menu to add, delete, or edit a rule.
Rule text can consist of literal characters and a small set of meta-characters. The meta-
characters used in phonetics libraries are outlined in the table below.
Meta-Character Usage
/ Searches an entire pattern but replaces only the characters before the
slash. Example: SCH/OOL with replacement of SK matches the word
SCHOOL and produces an output string of SKOOL.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-36 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
Note: For more details about syntax for phonetic rules, explore the following choices:
• Select Help Help Topics in the Phonetics Library Editor.
• In the Related Topics section, click the link for Phonetics Editor - Components
of a Rule.
4. Test the phonetics library with test data values.
a. Under Test Area, enter KNIGHT in the Input string field.
b. Click Go.
The Result field displays NIT.
Replacement #1
Replacement #2
Note: The first substitution is that the GHT string at the end of the string is replaced
with a T, because GH is silent.
Note: The second replacement is the KN string at the beginning of the word is replaced
with N, because K is silent.
Note: Phonetics library processing is not case sensitive. Input strings are converted
to uppercase before phonetic rules are applied and the results are displayed
in uppercase.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-37
Replacement #2
Replacement #1
Note: The Result field displays MAKWEELAN. The phonetics library matched two patterns:
• The first part of the string (MAK) is the result of the pattern MC at the beginning
of a word. It is replaced with MAK.
• The second part of the string (WEELAN) is the result of the pattern WH at the
beginning of the word. It is replaced with W, although this value is not at the
beginning of the word.
Note: The Reset the beginning of word option on the first replacement string (^MC) tells the
phonetics algorithm to reset the flag for the beginning of the word, which enables
(^WH) to be matched and replaced, although it occurs before the first matched pattern
(^MC) in the library.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-38 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
Replacement #1
Replacement #2
Note: The Result field displays MKNIT. Two patterns from the phonetics library
are matched.
• The first pattern matched is (GHT), which is replaced with T at the end
of the string. This results in a value of MCKNIT.
• The second pattern matched is (CK), which is replaced with K immediately
following the M. This results in a value of MKNIT.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-39
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-40 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-41
12.03 Activity
1. In Data Management Studio, click the Phonetics Library tab.
2. Open the EN Name Phonetics library.
3. Answer the following question:
• What does your full name phonetically reduce to?
22
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-42 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
24
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
A regular expression (typically shortened to regex) attempts to match a pattern in a subject string
f rom lef t to right. Most characters represent themselves in a pattern and match the corresponding
characters in the subject. The Regex Library Editor is used to build and test regular expression
libraries.
Regular expressions have the following characteristics:
• are organized into libraries that can be used for parsing, standardization, and matching
• are primarily intended for character-level cleansing and transformations (Standardization
definitions should be used for word- and phrase-level cleansing.)
• must conform to Perl regular expression syntax
Note: When regular expressions are used against the data, every regular expression in the library
is executed against every data value.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-43
This demonstration illustrates how to use the Regex Library Editor to explore and test a regex library.
1. If necessary, invoke Data Management Studio.
a. Select Start All Programs DataFlux Data Management Studio 2.7.
b. Click Cancel in the Log On window.
2. If necessary, open the QKB CI 27 - ENUSA Only QBK in Data Management Studio.
a. Select the Administration riser bar.
b. Navigate to the English (United States) locale in QKB CI 27 - ENUSA Only.
c. Select the English (United States) locale and open it.
3. Explore the GB Lightweight Punctuation Removal regex library.
a. Click the Regex Libraries tab.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-44 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
b. Right-click the GB Lightweight Punctuation Removal regex library and select Open.
The Regex Library Editor window appears. GB Lightweight Punctuation Removal is loaded.
Note: This regex library is designed to remove periods, commas, semicolons, and double
quotation marks from a text string. It also removes the # (number sign) character,
unless it is followed by a number.
Note: Each regular expression in the list is executed in order, from top to bottom.
c. Double-click the first regular expression [.,;"] (that is, open bracket, period, comma,
semicolon, double quotation mark close bracket).
Note: This window can be used to edit the expression, the substitution, or to add a note.
Note: It is important to remember that regular expressions are applied to the data
sequentially, a single data value could be changed by more than one expression.
One expression could change a data value, causing the value to not match a pattern
in a subsequent expression.
d. Click Cancel to close the Edit Expression window.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-45
Note: The periods are removed by the first expression in the regex library.
c. Under Test Area, enter Rudolph "Rudy" Smith, Ph.D. in the Input string field.
d. Click Go.
The Result field displays Rudolph Rudy Smith PhD.
Note: The double quotation marks, the comma, and the period are removed by the first
expression in the regex library.
Note: For more details about regular expression syntax, see the DataFlux Data
Management Studio 2.7: User Guide.
e. Select File Exit to close the Regex Library Editor.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-46 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-47
12.04 Activity
1. In Data Management Studio, click the Regex Libraries tab.
2. Open the GB Period Removal regex library.
3. Answer the following questions:
• What is the result for the input string U.S.A.?
• Which types of definitions in the QKB use this regex library?
26
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-48 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
Vocabulary Overview
Likelihoods associated
with each category
List of words
Categories assigned to
the selected word
29
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
A vocabulary is a collection of words, their associated categories, and a likelihood f or each category.
The Vocabulary Editor is used to build and maintain vocabularies. Words can be manually entered
into the vocabulary or imported from a f ile.
Note: Each word in a vocabulary is required to have at least one category (and likelihood)
associated with it.
Vocabularies are used f or the f ollowing:
• in parsing to categorize individual words in the text string
• in the matching process to identify noise words that are omitted from match code generation
• in gender analysis to determine the gender of an individual
• in identification analysis to determine the possible identity of words
The example above shows the GB Email vocabulary. The word “COM” is associated with two
categories:
• COM (com) - a Medium likelihood is assigned to the COM category.
• DMAIN (Domain) - a High likelihood is assigned to DMAIN category.
Note: You can see the text description for a category by hovering over the value in the Categories
pane.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-49
This demonstration illustrates how to use the Vocabulary Editor to explore and modify a vocabulary.
1. If necessary, invoke Data Management Studio.
a. Select Start All Programs DataFlux Data Management Studio 2.7.
b. Click Cancel in the Log On window.
2. If necessary, open the QKB CI 27 - ENUSA Only QBK in Data Management Studio.
a. Select the Administration riser bar.
b. Navigate to the English (United States) locale in QKB CI 27 - ENUSA Only.
c. Select the English (United States) locale and open it.
3. Explore the GB Email vocabulary.
a. Click the Vocabularies tab.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-50 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
Note: A vocabulary is stored in a proprietary format that you should not attempt to create
or edit directly. Vocabulary files should be accessed through the Vocabulary Editor
to avoid the danger of corrupting the file.
3. Create and modify a new version of the GB Email Vocabulary.
a. Select File Save As in the Vocabulary Editor.
b. Enter My GB Email in the Name field.
c. Click Save.
The title bar of the Vocabulary Editor window displays the new vocabulary name.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-51
c. Click OK.
The word COM is highlighted. The Word properties area on the right displays the properties
of the selected word.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-52 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
f. Click (the down arrow) in the Likelihood field and select Very High.
j. Click OK.
There is now a very high likelihood that the word COM belongs to the DMAIN category.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-53
The word AT is highlighted. The Word properties area on the right displays the properties
of the selected word.
e. Click (the down arrow) in the Category field and select WORD (Word).
f. Verify that Medium is the value in the Likelihood field.
g. Click OK.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-54 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
This adds a new category and likelihood pairing for an existing word.
c. Click OK.
d. Verify that SAS is selected in the Word list.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-55
1) Click (the down arrow) in the Category field and select DMAIN (Domain).
2) Click (the down arrow) in the Likelihood field and select High.
3) Click OK. This adds a new word to the vocabulary with a category and likelihood pairing.
Note: Every word in a vocabulary must have at least one assigned category.
If not, the vocabulary cannot be saved.
7. Select File Save to save the My GB Email vocabulary.
8. Select File Exit to close the Vocabulary Editor window.
9. When you are prompted to reload the QKB, click Yes.
Note: Because the QKB was loaded into memory when Data Management Studio was
instantiated, it is necessary to reload the QKB in order to use any changes that
you made to the QKB files and definitions.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-56 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
10. Scroll down on the Vocabularies tab to locate the new vocabulary, My GB Email.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-57
12.05 Activity
1. In Data Management Studio, click the Vocabularies tab.
2. Open the GB Website vocabulary.
3. Answer the following questions:
• What are the defined categories and likelihoods for the word HTTP?
• What are the defined categories and likelihoods for the word WWW?
31
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-58 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
Grammar Overview
Basic
Categories
Derived
Categories Derived rules for
valid e-mail addresses
34
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Af ter the morph analysis in the Parse Def inition is used to identify one or more basic categories f or
each word, the patterns of assigned categories can be identif ied. A grammar is a set of rules that
represent expected patterns of words in a given context.
The Grammar Editor is used to build and maintain basic and derived categories and build derived
rules f rom the categories. To improve the readability of grammar rules, all categories in a grammar
are represented using abbreviations.
Note: A grammar consists of two category types - basic and derived.
Note: Basic categories in a grammar correspond to categories associated with words during the
morph analysis.
Note: Basic categories defined in the Grammar are the categories that get imported into the
Vocabulary to assign to words.
The example above shows the basic category abbreviations used by the GB Email Validation
grammar. The derived category VALID is expanded to show the two patterns def ined in the grammar
f or a valid email address.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-59
Parent category
Priority
Derived
category rule
35
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Rules f or the derived categories represent an ordered list of categories (basic and derived) that
identif y patterns of words in a string. Each rule is associated with a parent (derived) category, and
has a priority associated with it.
In this example, a derived category rule f or a valid email address is highlighted. You can see the
derived category rule highlighted in blue. The parent category f or the derived category is VALID, and
the rule’s priority is set to Medium.
Note: Derived category rules can use both basic and derived categories.
Note: Each rule is associated with a priority, indicating the strength of the pattern matched.
Note: In the ideal situation, every possible pattern of categories will be identified in a grammar rule,
with no duplicates. However, this is not realistic, so you should strive to fully identify all of the
common patterns, and some of the less common patterns.
Note: Ambiguities in matched patterns are resolved through a scoring algorithm. This scoring takes
into account the likelihoods assigned to the words (in the Vocabulary) and the priorities of the
rules matched (in the grammar).
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-60 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
Name Rule: [Name] > [Given Name Word] [Family Name Word]
36
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
In the example above, the name “Bob Brauer” is being compared against the grammar rules f or a
person’s name. When reading this rule, you say that the category on the lef t side of the rule is
derived f rom the categories on the right side of the rule. The derived rule identif ies the possibility that
a valid name string can consist of a given name word f ollowed by a f amily name word .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-61
For some derived categories, you might want to allow a variable number of words of the same basic
category within a text string. In some cases, you will not know the exact number of words that might
appear in the string. The most ef f icient way of allowing for this situation is to use a recursive rul e.
Recursive rules enable you to def ine the recursive word once in a derived category and account f or
any number of occurrences of the word. A recursive rule consists of a basic category followed by a
derived category, which is the root category for the recursive rule being built.
Recursive rules achieve the f ollowing:
• enable matching derived categories of variable length
• avoid having multiple rules of variable length
• eliminate the need to guess at maximum word counts
As shown in the example above, there can be any number of name appendage words associated
with a person’s name. In the f irst example text string, there are three name appendage words
(categorized as NAW). In the second example text string, there are f ive name appendage words
(categorized as NAW).
The recursive rule f or this situation is NAW (Name Appendage Word) f ollowed by NA (Name
Appendage derived category), which could just be another NAW (Name Appendage Word). The rule
keeps looping through the words until it reaches a word that does not meet the rule f or a Name
Appendage.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-62 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
This demonstration illustrates how to use the Grammar Editor to explore and modify a grammar.
1. If necessary, invoke Data Management Studio.
a. Select Start All Programs DataFlux Data Management Studio 2.7.
b. Click Cancel in the Log On window.
2. If necessary, open the QKB CI 27 - ENUSA Only QBK in Data Management Studio.
a. Select the Administration riser bar.
b. Navigate to the English (United States) locale in QKB CI 27 - ENUSA Only.
c. Select the English (United States) locale and open it.
3. Explore the GB Email grammar.
a. Click the Grammars tab.
Note: A grammar is stored in a proprietary format that you should not attempt to create
or edit directly. Grammar files should be accessed through the Grammar Editor
only.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-63
Note: These basic categories from the grammar are also assigned to words
in the vocabulary.
3) Click the FNW (Family Name Word) basic category.
The Category properties are displayed on the right.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-64 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-65
h. Click OK. When a derived category is added, the patterns of basic and derived categories
that identify the category need to be defined.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-66 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
c. Click Add.
d. Click (the down arrow) in the Category field and select DMAIN.
e. Click OK.
f. Click Add.
g. Click (the down arrow) in the Category field and select MSERVER.
h. Click OK.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-67
The derived category DOMAIN_NEW contains the grammar rule [DMAIN MSERVER] and
has a priority of Medium.
Note: Multiple grammar solutions might be returned by a grammar analysis. The priority for
each derived grammar is used as part of a scoring process that determines the best
solution. (This is discussed in more detail later.)
Note: Creating a comprehensive and robust grammar typically involves multiple iterations
and thorough testing.
5. Save the new grammar to the QKB.
a. Select File Save to save the My GB Email grammar.
b. Select File Exit to close the Grammar Editor window.
c. If necessary, click Yes in the Reload QKB window to reload the QKB.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-68 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
d. Scroll down on the Grammars tab to locate the new grammar, My GB Email.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.1 Working with Q KB Component Files 12-69
12.06 Activity
1. In Data Management Studio, click the Grammars tab.
2. Open the GB Website grammar.
3. Answer the following questions:
• How many basic categories exist in the grammar?
• How many rules are defined for the URL derived category?
39
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-70 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
Entity Resolution
43
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
In the context of the QKB, a definition is a collection of metadata that def ines an algorithm that can
perf orm a data-cleansing operation. A def inition type corresponds to a type of data-cleansing
operation. For example, a match def inition contains metadat a used for creating a match code, and a
parse def inition contains metadata used f or parsing a data string into its individual tokens. Each
def inition is associated with a data type (that is, the "Name" parse def inition belongs to the "Name"
data type).
The QKB has def initions that allow you to do a variety of data management, data quality, and entity
resolution tasks.
The types of definitions that are available in the QKB include:
• Case
• Extraction
• Gender Analysis
• Identification Analysis
• Language Guess
• Locale Guess
• Match
• Parse
• Pattern Analysis
• Standardization
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.2 Working with Q KB Definitions 12-71
44
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
It is important to note that a data type does not necessarily have a def inition of each def inition type.
In the example above, you can see that the Name data type has f ive def inition types associated with
it – Gender Analysis, Match, Parse, Standardization, and Case def initions. If you look at the Postal
Code data type, however, it has only three def inition types associated with it – Match, Parse, and
Standardization def initions. Likewise, the Add ress data type has only three def inition types. The E-
mail data type has f our types of definitions associated with it.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-72 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
45
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The purpose of a case def inition is to ensure the appropriate casing of data as it is processed by the
def inition. Case definitions are algorithms that convert a text string to uppercase, lowercase, or
proper case.
The def inition uses a “base” casing algorithm, and then augments that with the known casing of
certain words (f or example, SAS or DataFlux) and patterns within words (f or example, any word that
begins with Mc, then uppercase the next letter).
Note: For the best results, select an applicable definition that is associated with a specific data
type, when applying proper casing.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.2 Working with Q KB Definitions 12-73
46
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Extraction def initions extract portions of a string into relevant tokens. For example, the Contact Inf o
extraction def inition is used in the table above to extract portions of the Input Data String into tokens
(Organization, Address, and Phone).
Notice that the order of the tokens in the data does not matter. This is because the def inition uses
vocabularies, regex libraries, and grammars to analyze patterns in the Input Data String and map the
data values to the appropriate tokens.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-74 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
47
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Gender Analysis definitions determine the gender of a person, typically based on the person’s name.
The analysis returns the value male, female, or unknown. This type of data can be very useful for
marketing campaigns, checking patient data, ensuring proper salutations for mailings, or analyzing
gender make-up of major field of study in an academic setting.
Note: Typically, gender analysis is performed on individual name data, but it could be used on ID
codes where a portion of the code represents the gender.
The result of gender analysis is typically a code that indicates whether the input value is of one
gender or the other (M for male and F for female). If gender values cannot be determined (due to
incomplete or conflicting information), U (for unknown) is returned.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.2 Working with Q KB Definitions 12-75
48
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
An Identif ication Analysis Definition specifies data and logic that can be used to identif y the semantic
type of a data string. For example, an identif ication analysis definition might be used to determine
whether a certain string represents the name of an individual or an organization.
Consider a f ield that has mixed corporate and individual customers. Applying the Field Content
identif ication analysis definition to the Customer data produces a result set that f lags every record
with the type of data that is discovered or identified.
Note: The Field Content identification analysis definition can be used to recognize addresses,
cities, email addresses, organization names, phone numbers, and more.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-76 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
49
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Match codes are generated, encoded text strings that represent a data value. Match codes can be
compared across records to identify potentially duplicate data that might be obvious to the human
eye, but not necessarily obvious to a computer program. In the example above, you can see that the
three strings representing John Smith likely represent the same person. To the computer, however,
these look like three totally different individuals.
Match codes can be used to group similar data values together. In the example above, sorted by the
match code values, you can see that these records might potentially match, not because the value of
Name is the same, but because the values generated the same match code at the selected
sensitivity level (in this example, 85).
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.2 Working with Q KB Definitions 12-77
50
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Sensitivity is used in the match code generation process to determine how much of the initial data
value you want to use in the match code. In other words, it allows you to specify how exact you want
to be in generating the match codes. The chosen level of sensitivity controls how many
transf ormations are made to the data string bef ore the generation of the match code.
Sensitivity also controls the number of positions each token contributes to the match code. At higher
levels of sensitivity, more characters f rom the input text string are used to generate the match code.
Conversely, at lower levels of sensitivity, fewer characters are used in the generation of the match
code.
Note: It is important to experiment with different levels of sensitivity, because choosing too low of a
sensitivity can lead to over-matching, and choosing a sensitivity level that is too high could
lead to under-matching.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-78 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
Parsed Name
Prefix Dr.
Given Name Alan
Middle Name W.
Family Name Richards
Suffix Jr.
Title/Additional Info M.D.
51
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Parse definitions def ine rules to place the words f rom a text string into the appropriate tokens. In the
example above, a name value is parsed using the Name parse def inition. The purpose of the Name
Parse Def inition is to parse a name string into the tokens that make up the name (f or example,
Pref ix, Given Name, Middle Name, and so on).
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.2 Working with Q KB Definitions 12-79
52
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Standardization definitions are used to improve the consistency of data by applying standardization
schemes to the individual tokens. The process of standardization with a standardization def inition
involves parsing the data string into tokens, and then standardizing each token using one or more
standardization schemes.
The examples above illustrate the ef f ect of applying standardization def initions to various input
strings. You can see that not only do the standardization def initions help with standardizing data
values, but they also control casing of values, as well as t he order of the tokens in the resulting data
string.
Note: Standardizing an address value, as in the example above, does not verify that the address is
correct. It simply ensures a standard representation across address data values.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-80 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
12.07 Activity
1. Open DataFlux Data Management Studio.
2. Open the QKB CI 2.7 - ENUSA Only QKB.
3. Click the Quality Knowledge Base tab.
4. View a list of all standardization definitions.
5. Answer the following question:
•How many data types have associated Standardization definitions?
6. View all the definitions for the E-mail data type.
7. Answer the following questions:
• How many types of definitions exist for the E-mail data type?
• What types of definitions exist for the E-mail data type?
53
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.3 Solutions 12-81
12.3 Solutions
Solutions to Activities and Questions
8
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
9
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-82 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
10
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
16
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.3 Solutions 12-83
17
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
18
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-84 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
23
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
27
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.3 Solutions 12-85
28
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
32
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-86 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
33
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
40
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12.3 Solutions 12-87
41
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
54
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
12-88 Lesson 12 Understanding the SAS® Quality Knowledge Base (QKB)
55
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
56
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Lesson 13 Using SAS® Code to
Access QKB Components
13.1 SAS Configuration Options for Accessing the QKB ................................................... 13-3
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.1 SAS Configuration Options for Accessing the QKB 13-3
3
Third-Party
SAS Platform Applications
SAS Applications
3
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
There are a number of SAS applications that can interact with the QKB. In order for these
applications to access the QKB components, they need to be configured to the root location of the
QKB. This section discusses the options available for configuring SAS to access the QKB:
1. Conf iguring an “interactive” SAS session to connect to the QKB.
2. Accessing the QKB programmatically in SAS code using the %Dqload macro to load the
specif ied QKB into memory.
3. Conf iguring the SAS Platf orm to the QKB by specifying configuration options in the .cf g files for
the SAS Application Server.
Interaction with the QKB f rom within an interactive SAS session is f acilitated by setting system
options that control access to the QKB when a SAS session is instantiated. The programmer can
also set these options programmatically from within the interactive SAS session.
Note: Setting options in the configuration file for an interactive SAS session is not the preferable
method, as the end user might not be aware of the list of locales set in the DQLOCALE
system option, which can lead to mistakenly using algorithms from the incorrect locale.
In batch SAS programs, there are programmatic ways to set SAS system options, including SAS
macros that control access to the QKB and load the QKB into memory.
Access to the QKB f rom within SAS applications that interact with the SAS Platf orm can be
f acilitated by setting options in the conf iguration f ile(s) that execute when a SAS Workspace Server
is instantiated.
Note: Additional options for configuring applications to the QKB are found within the applications
themselves.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-4 Lesson 13 Using SAS® Code to Access QKB Components
-DQSETUPLOC "D:\ProgramData\SAS\QKB\CI27_MultipleLocales"
4
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.1 SAS Configuration Options for Accessing the QKB 13-5
5
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The DQLOCALE system option sets an ordered list of locales for SAS to use f or data cleansing
processes. In the example above, the ENUSA, ENGBR, and FRCAN locales have been specif ied.
Note: Multiple locales can be specified for the DQLOCALE option. If multiple locales are specified,
the application searches the locales, in the order specified in the option, until it finds the
definition being used.
Note: All locales in the DQLOCALE list must exist in the QKB referenced in the DQSETUPLOC
option.
Because the locales that are specified with this option need to be loaded into memory for access,
you should always set the value of this system option by invoking the %Dqload macro (discussed in
a later section).
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-6 Lesson 13 Using SAS® Code to Access QKB Components
%Dqload
%Dqunload
%Dqputloc
6
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
SAS Data Quality Server provides three autocall macros that f acilitate interaction with the QKB f rom
within SAS. Specif ically, these macros facilitate the loading and unloading of QKB locales in
memory, as well as setting the ordered list of locales to be used in data cleansing processes.
These three macros are available f rom SAS Data Quality Server:
• %Dqload – used to set system option values and load the QKB into memory
• %Dqunload – used to unload the QKB from memory
• %Dqputloc – displays information about the contents of the current QKB locale from memory in
the SAS log.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.1 SAS Configuration Options for Accessing the QKB 13-7
%DQLOAD
(DQSETUPLOC="D:\ProgramData\SAS\QKB\CI27_MultipleLocales",
DQLOCALE=(ENUSA ENGBR FRCAN));
7
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The %Dqload macro is used to specif y the list and order of locales that are loaded into memory in a
SAS session. In addition to loading the QKB into memory, the macro sets the values of the
DQSETUPLOC and DQLOCALE SAS system options.
Options f or the %Dqload macro:
• DQSETUPLOC
• DQLOCALE
• DQINFO - this option controls the amount of information that is written to the SAS log while the
QKB is being loaded into memory. Specifying DQINFO=0 results in no information being written
to the log.
Note: Options set using the %Dqload macro override any system options that were set previously.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-8 Lesson 13 Using SAS® Code to Access QKB Components
8
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
When the %Dqload macro executes, inf ormation is written to the SAS log, including
• values for the DQSETUPLOC system option
• values for the DQLOCALE system option
• confirmation of the locales that were loaded into memory.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.1 SAS Configuration Options for Accessing the QKB 13-9
9
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
If you need to check the SAS Data Quality system options that are in ef f ect for your SAS session,
you can submit a PROC OPTIONS step with the option GROUP=DATAQUALITY. The resulting
output will conf irm the data quality settings that are in place f or your c urrent SAS session.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-10 Lesson 13 Using SAS® Code to Access QKB Components
10
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The %Dqunload autocall macro unloads all locales that are currently loaded into memory.
Note: It is not necessary to load and unload the QKB locales from every program that you want to
run, but it is a good practice when you are no longer using the data cleansing functions in
your SAS programs to unload the QKB from memory. The QKB locales will also be unloaded
from memory when your SAS session ends.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.1 SAS Configuration Options for Accessing the QKB 13-11
11
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The %Dqputloc macro is used to write inf ormation f rom the specified QKB locale to the SAS log.
The inf ormation written include all def initions, data type tokens, related f unctions, and the names of
the related parse def initions (f or gender def initions and match def initions).
Options f or the %Dqputloc macro:
• PARESDEFN=0|1 – this option lists the related parse definition for each gender definition and
match definition. The default value is PARSEDEFN=1.
• SHORT=0|1 – this option is used to limit the amount of information written to the log. Specifying
SHORT=1 removes the descriptions of how the definitions are used. The default value is
SHORT=0.
• locale – specifies the locale whose contents you want to view.
Note: If you specify the locale option, the specified locale must be a locale that was loaded into
memory.
Note: If you do not specify the locale option, the first locale in the DQLOAD locale list is used
by default.
The example above illustrates the use of the %Dqputloc macro to write the contents of the ENUSA
locale to the SAS log. In this example, the PARSEDEFN option is used to list the related parse
def initions for other def initions that use parse def initions as a preliminary step in their processing.
The SHORT=0 option writes the usage descriptions to the SAS log.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-12 Lesson 13 Using SAS® Code to Access QKB Components
12
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
13.01 Activity
1. Open a SAS windowing environment session by selecting Start SAS
SAS 9.4 (English).
2. Open the D:\Workshop\dqpqkb\demos\Ch2D1_DQServer_Macros.sas
program in the Program Editor window.
Hint: Click File and select Open Program to navigate to the program.
3. Submit the code.
4. Review the SAS Log window.
5. Answer the following questions:
How many types of definitions are available in the ENUSA locale?
What is the name of the gender definition in the ENUSA locale?
What tokens are populated by the Organization (Global) parse definition?
13
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-13
18
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
SAS Data Quality Server provides procedures, functions, and CALL routines to support interaction
with the QKB, as well as with the DataFlux Data Management Server.
The procedures, functions, and CALL routines enable you to access the QKB components, from
within SAS code. The functions and CALL routines are accessible from within DATA step and SQL
code, and are often used to create new columns of data within your code.
Interaction with the DataFlux Data Management Server enables you to have access to any job or
service, that has been made available on the server, from within SAS code. These coding options
allow you the flexibility to run jobs, call real-time services, check the status of running jobs, copy
logs, stop running jobs, and so on.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-14 Lesson 13 Using SAS® Code to Access QKB Components
• Procedures
• Functions
• Call Routines
19
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The procedures, f unctions, and CALL routines in the SAS Data Quality Server allow you access to a
variety of data transf ormation processes.
• Matching – creates match codes, which are encoded representations of data values, that can be
used to cluster similar data records, or as surrogate keys in “fuzzy” joins.
• Standardization – ensures the standard and consistent representation of data values.
• Parsing – used to break a string of data into meaningful tokens.
• Identification Analysis – identifies the semantic type of data in a field.
• Gender Analysis – identifies the gender of an individual based on the components of their name.
• Casing – ensures the proper casing of data values, especially values that do not conform to the
“typical” casing algorithms (for example, SAS and DataFlux).
Note: In order to have access to these data cleansing definitions, the SAS Data Quality Server
code needs to execute in a SAS session that is configured to the QKB and has the
necessary locales loaded into memory.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-15
20
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
There are two f unctions available in SAS Data Quality Server f or perf orming data standardization.
• DQSTANDARDIZE – returns a character value after standardizing, casing, spacing, and
formatting, and then applies a common representation to certain words and abbreviations.
Note: The DQSTANDARDIZE function uses a standardization definition from the QKB.
• DQSCHEMEAPPLY – applies a scheme to the data and returns a standardized value.
Note: The DQSCHEMEAPPLY function uses a standardization scheme from the QKB.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-16 Lesson 13 Using SAS® Code to Access QKB Components
21
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-17
22
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-18 Lesson 13 Using SAS® Code to Access QKB Components
This demonstration illustrates the use of the functions that are available for standardizing data.
1. If necessary, start a SAS session by selecting Start All Programs SAS
SAS 9.4 (English).
2. Verify that (Enhanced) Editor window is the active window.
3. Open an existing SAS program.
a. Select File Open Program.
b. Navigate to D:\Workshop\dqpqkb\Demos.
c. Click Ch3D1_DQLOCALE_Functions.sas.
d. Click Open.
4. Review the code.
a. Verify that there is a call to %Dqload that loads the ENUSA locale in to memory.
%DQLOAD(DQSETUPLOC = 'D:\ProgramData\SAS\QKB\CI27_MultipleLocales',
DQLOCALE = (ENUSA));
b. Verify that there are two LIBNAME statements.
libname input 'D:\Workshop\dqpqkb\Data';
libname output 'D:\Workshop\dqpqkb\Solutions\files\output_files';
c. Verify that there is a FILENAME statement.
filename scheme
'D:\ProgramData\SAS\QKB\CI27_MultipleLocales\scheme\en052.sch.qkb';
d. For the DATA step:
1) Verify that a new SAS table prospects_std is being created in the output library.
2) Verify that a SAS table prospects is being read from the input library.
3) Verify that a new character column is being created named city_std of length 32, and
that a label is being assigned to this new column.
4) Verify that the new city_std column is being assigned values returned by the
DQSCHEMEAPPLY function.
data output.prospects_std;
set input.prospects;
length city_std $32;
label city_std='Standardized City';
city_std = dqschemeapply(city,'scheme','BFD','ELEMENT',
IGNORE_CASE');
run;
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-19
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-20 Lesson 13 Using SAS® Code to Access QKB Components
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-21
13.02 Activity
1. Open a SAS windowing environment session by selecting
Start SAS SAS 9.4 (English).
2. Open the following program in the Program Editor window:
D:\Workshop\dqpqkb\demos\Ch3D4_CALL_DQSCHEMEAPPLY.sas
Hint: Select File Open Program and then navigate to the file specified.
3. Submit the code.
4. View the log to make sure that the program executed successfully.
5. Open the output.prospects_call_std table.
6. Use the table to answer the following questions:
Were any transformations made on the Address variable?
What is the highest number of transformations that were applied
to an Address value? 25
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-22 Lesson 13 Using SAS® Code to Access QKB Components
28
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The DQSCHEME procedure is used to improve the consistency of your data f rom within SAS code.
Specif ically, with the DQSCHEME procedure, you can do the f ollowing:
• Create standardization schemes – both SAS data set standardization schemes, as well as BFD
schemes in the QKB.
• Create analysis data sets – used to group together similar data values to assist you with the
creation of standardization schemes
• Apply standardization schemes to data – updates data values based on the “standard” value in
the scheme.
The syntax f or the DQSCHEME procedure consists of three statements:
• APPLY
• CONVERT
• CREATE
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-23
29
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-24 Lesson 13 Using SAS® Code to Access QKB Components
30
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The APPLY statement is used to apply a standardization scheme, f rom the specified locale, to a
variable. These are the valid options for the APPLY statement:
• LOCALE= – specifies the name of the locale that contains the standardization and match
definitions to be used in the statement.
• MATCH-DEFINITION= – specifies the name of the match definition to be used in looking up the
input data value in the standardization scheme.
• MODE=ELEMENT | PHRASE – specifies the mode to be used in applying the scheme to the
data.
• SCHEME= – specifies the name of the standardization scheme to be applied to the data.
• SCHEME_LOOKUP =EXACT | IGNORE_CASE | USE_MATCHDEF – specifies the method to
be used in looking up the data value in the scheme.
• SENSITIVITY=sensitivity-level – used in conjunction with the USE_MATCHDEF option, specifies
how exact you want to be when using match definitions to look up data values in the scheme.
Note: The default sensitivity level is 85.
• VAR=variable-name – specifies the variable in the input data set to be standardized.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-25
31
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
In the example above, three CREATE statements are used to create three standardization schemes
in the QKB. The schemes are created using the data f rom the data in the vendors data set. The
three schemes will be named City, State, and Org, and will be stored in the ENUSA locale in the
QKB.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-26 Lesson 13 Using SAS® Code to Access QKB Components
This demonstration illustrates the use of the DQSCHEME procedure to create a standardization
scheme.
1. If necessary, start a SAS session by selecting Start All Programs SAS
SAS 9.4 (English).
2. Verify that the (Enhanced) Editor window is the active window.
3. Open an existing SAS program.
a. Select File Open Program.
b. Navigate to D:\Workshop\dqpqkb\Demos.
c. Click Ch3D5_PROC_DQSCHEME_Create.sas.
d. Click Open.
4. Review the code.
%DQLOAD(DQSETUPLOC='D:\ProgramData\SAS\QKB\CI27_MultipleLocales',
DQLOCALE=(ENUSA));
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-27
b. Double-click the sas_city_scheme SAS data set to preview the scheme in the VIEWTABLE
window.
9. Make any changes that are necessary to the scheme by switching to Edit mode.
a. Select Edit Edit Mode.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-28 Lesson 13 Using SAS® Code to Access QKB Components
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-29
33
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
In the example above, the APPLY statement is used to apply the sas_city_sheme standardization
scheme to the City variable in the prospects data set.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-30 Lesson 13 Using SAS® Code to Access QKB Components
This demonstration illustrates the use of the DQSCHEME procedure to apply a standardization
scheme to a variable.
1. If necessary, start a SAS session by selecting Start All Programs SAS
SAS 9.4 (English).
2. Verify that the (Enhanced) Editor window is the active window.
3. Open an existing SAS program.
a. Select File Open Program.
b. Navigate to D:\Workshop\dqpqkb\Solutions\SAS_Programs.
c. Click Ch3D5_PROC_DQSCHEME_Apply.sas.
d. Click Open.
4. Review the code.
5. Apply the scheme to the City variable in the Prospects data set.
a. In the Enhanced Editor, enter the following code:
%DQLOAD(DQSETUPLOC = 'D:\ProgramData\SAS\QKB\CI27_MultipleLocales',
DQLOCALE = (ENUSA));
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-31
c. Double-click the Std_Prospects SAS data set to see the standardized city values in the
VIEWTABLE window.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-32 Lesson 13 Using SAS® Code to Access QKB Components
13.03 Activity
1. Open a SAS windowing environment session by selecting Start SAS
SAS 9.4 (English).
2. Open the following program in the Program Editor window:
D:\Workshop\dqpqkb\demos\Ch3D5_PROC_DQSCHEME_Apply.sas
Hint: Select File Open Program and then navigate to the file specified.
3. Submit the code.
4. View the log to make sure that the program executed successfully.
5. Verify that the new table output.std_prospects has correct values for the
city column.
35
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
37
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
These SAS Data Quality Server f unctions are available f or perf orming data matching :
• DQMATCH – returns a match code from a character value.
• DQMATCHINFOGET – returns the name of the parse definition that is associated with a match
definition.
• DQMATCHPARSED – returns a match code from a parsed character value.
We discuss only the DQMATCH f unction in this section.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-33
38
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The DQMATCH f unction returns a match code based on a data value. The match code is an
encoded representation of the characters in the data string af ter going through several processing
steps and based on the specif ied level of sensitivity.
Required arguments f or the DQMATCH f unction:
• source-string – specifies a character constant, variable, or expression that contains the value for
which a match code is created, according to the specified match definition.
• match-definition – the match definition from the QKB.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-34 Lesson 13 Using SAS® Code to Access QKB Components
This demonstration illustrates the use of the match code generation functions to create match codes.
1. Use SAS Data Quality Server functions in a SAS DATA step to create new data fields.
a. In the Enhanced Editor, enter the following code:
/* Set DQSETUPLOC and DQLOCALE options and load QKB into memory */
%DQLOAD(DQSETUPLOC='D:\ProgramData\SAS\QKB\CI27_MultipleLocales',
DQLOCALE=(ENUSA));
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-35
c. Double-click the Prospects_matchcodes data set to see the data with match code values in
the VIEWTABLE window.
d. Scroll to the right to see the new match code variables created by the function calls.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-36 Lesson 13 Using SAS® Code to Access QKB Components
CRITERIA <options>;
40
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-37
CRITERIA <CONDITION=integer>
<DELIMSTR=variable-name | VAR=variable-name>
<EXACT | MATCHDEF>
<MATCHCODE=output-character-variable>
<SENSITIVITY=sensitivity-level>;
41
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The purpose of the CRITERIA statement is to specif y conditions f or generating match code values.
These are the optional arguments f or the CRITERIA statement:
• CONDITION= – integer value that is used to group multiple CRITERIA statements together.
• DELIMSTR= | VAR= – specifies the value to be used to generate the match code. DELIMSTR=
is used to specify a token from a parse step, and VAR= is used to specify a variable name.
• EXACT|MATCHDEF – used to determine how clusters are created. EXACT is used to identify
exact character matches between values. MATCHDEF is used to specify a match definition from
the QKB to be used to generate match codes on the values.
• MATCHCODE= – specifies the name of the character variable that the match definition writes
the match code to.
• SENSITIVITY= – determines the amount of information that is contained in the resulting match
code value.
Note: The def ault level of sensitivity is 85.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-38 Lesson 13 Using SAS® Code to Access QKB Components
42
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
43
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The example above shows the output of the PROC DQMATCH code. You can see that match codes
have been generated on the Contact f ield and also the City f ield f rom the input data table. You can
also see that several records are identif ied as belonging to clusters, since they generated the same
match codes.
Note: The rows with CLUSTER_NUM values of “.” are single-row clusters that do not match any
other rows based on the match code values.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-39
This demonstration illustrates the use of the DQMATCH procedure to create match codes and
cluster data records.
1. If necessary, open a SAS session by selecting Start All Programs SAS
SAS 9.4 (English).
2. Using the DQMATCH procedure, create match codes and cluster data on the Name and
Address fields in the Prospects SAS data set.
a. In the Enhanced Editor, enter the following code:
/* Set DQSETUPLOC option to QKB root and load into memory */
%DQLOAD(DQSETUPLOC='D:\ProgramData\SAS\QKB\CI27_MultipleLocales',
DQLOCALE=(ENUSA));
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-40 Lesson 13 Using SAS® Code to Access QKB Components
4. Open the table created by the DQMATCH procedure to preview the match codes and the
clusters.
a. In the Explorer pane, navigate to the Output folder.
b. Double-click the Prospects_mc table.
c. Preview the data in the table.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-41
13.04 Activity
1. Open a SAS windowing environment session by selecting Start SAS
SAS 9.4 (English).
2. Open the following program in the Program Editor window:
D:\Workshop\dqpqkb\Demos\Ch3D7_PROC_DQMATCH.sas
Hint: Select File Open Program to navigate to the program.
3. Submit the code.
4. View the log to make sure that the program executed successfully.
5. Open the output.prospects_mc table.
6. Use the log and the table to answer the following questions:
How many records were written to the output table?
How many clusters were created in the output table?
45
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-42 Lesson 13 Using SAS® Code to Access QKB Components
48
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-43
49
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
The DQPARSE f unction returns a tokenized character string f rom a data value. This parsed
character string contains delimiters that separate the individual “tokens” in the tokenized string.
Required arguments f or the DQPARSE f unction:
• parse-string – specifies a character constant, variable, or expression that contains the value to
be parsed, according to the specified parse definition.
• parse-definition – specifies the parse definition from the QKB.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-44 Lesson 13 Using SAS® Code to Access QKB Components
The DQPARSETOKENGET f unction returns the value of the specif ied token f rom a previously
parsed data value. These are the required arguments f or the DQPARSETOKENGET f unction:
• parsed-char –a character constant, variable, or expression that contains the value that is the
parsed character value from which the value of the specified token is returned.
• token – the name of the token that is returned from the parsed value.
Note: To see a valid list of tokens for a parse definition, use the DQPARSEINFOGET function,
or alternatively, the %Dqputloc autocall macro.
• parse-definition – the name of the parse definition from the QKB.
Note: The parse definition used in the PARSETOKENGET function must be the same parse
definition that was used to create the parsed input string.
Optional argument f or the DQPARSETOKENGET f unction:
• locale - locale that contains the parse definition.
In the example above, the DQPARSETOKENGET f unction is used to return the data value stored in
the Given Name token of the parsed name string.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-45
This demonstration illustrates the use of the functions that are available for parsing data values.
1. Use SAS Data Quality Server functions in a SAS DATA step to create new data fields.
a. In the Enhanced Editor, enter the following code:
/* Set DQSETUPLOC option to QKB root and load into memory */
%DQLOAD(DQSETUPLOC='D:\ProgramData\SAS\QKB\CI27_MultipleLocales',
DQLOCALE=(ENUSA));
data output.prospects_parsed;
set input.prospects;
length ParsedPhone $20 ParsedName $60 areacode $3;
label areacode='Area Code';
parsedphone = dqparse(Phone_Number, 'Phone');
areacode = dqparsetokenget(parsedphone, 'Area Code', 'Phone');
parsedname = dqparse (contact, 'NAME');
run;
Hint: This code can be found in the following program:
D:\Workshop\dqpqkb\Demos\Ch3D8_Parse_Functions.sas
b. Select Submit to submit the code.
2. View the log and resolve any issues.
Select View Log and resolve any errors.
3. Preview the Prospects_parsed data set.
a. In the Explorer pane, navigate to the Output library.
b. Open the Output library.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-46 Lesson 13 Using SAS® Code to Access QKB Components
c. Double-click the Prospects_parsed SAS data set to see the ParsedPhone and
ParsedName values in the VIEWTABLE window.
d. Scroll to the right to see the new Area Code variable that was created by the function.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-47
52
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-48 Lesson 13 Using SAS® Code to Access QKB Components
The DQEXTRACT f unction returns token values f rom a f ree f orm data value. These are the required
arguments f or the DQEXTRACT f unction:
• extraction-string – the value that is extracted according to the specified extraction definition. The
value must be the name of a character variable, a character value in quotation marks, or an
expression that evaluates to a variable name or quoted value.
• extraction-definition – the extraction definition from the QKB used to extract data values into
tokens.
The optional argument f or the DQEXTRACT f unction is the locale that contains the extraction
def inition.
In the example above, the DQEXTRACT f unction is used to return a delimited text string f or the
various input data values that contain inf ormation f or Mike Abbott. The delimiters are used to
separate the various tokens in the extracted data string.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-49
This demonstration illustrates the use of the functions that are available for extracting data values.
1. If necessary, open a SAS session by selecting Start All Programs SAS
SAS 9.4 (English).
2. Use SAS Data Quality Server functions in a SAS DATA step to create new data fields.
a. In the Enhanced Editor, enter the following code:
data _null_;
length extractstring $200 extractinfo $200;
extractinfo=dqextinfoget('CONTACT INFO');
extractstring2=dqexttokenput(extractstring,'SAS','ORGANIZATION',
'CONTACT INFO');
put extractinfo= //
extractstring= //
extractstring2=;
run;
Hint: This code can be found in the following program:
D:\Workshop\dqpqkb\Demos\Ch3D9_Extract_Functions.sas
b. Select Submit to submit the code.
3. Select View Log and resolve any errors.
4. Preview the SAS log to see the results.
a. In the Log window of the SAS windowing environment session, navigate to the bottom of the
code that you submitted in the previous step.
Note: Values for a specific token can be obtained using the DQEXTTOKENGET function.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-50 Lesson 13 Using SAS® Code to Access QKB Components
The DQIDENTIFY f unction returns the type of data represented by a data value. These are the
required arguments f or the DQIDENTIFY f unction:
char – specif ies a character constant, variable, or expression that contains the value that is analyzed
to determine that category of the content.
identification-analysis-definition – the identif ication analysis definition f rom the QKB.
The optional argument f or the DQIDENTIFY f unction includes the locale that contains the
identif ication analysis definition.
In the example above, the DQIDENTIFY f unction is used to return the category of data represented
by the various input values. The values returned by the f unc tion are the categories of data guessed
f rom the data values.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-51
The DQGENDER f unction returns a gender value f rom the name of an individual. These are the
required arguments f or the DQGENDER f unction:
• char – specifies the character variable, or string, that is to be processed by the function.
• gender-analysis-definition – the definition from the QKB that is to be used to determine the
gender.
The optional argument f or the DQGENDER f unction includes the locale that contains the gender
analysis def inition.
In the example above, the DQGENDER f unction is used to guess the gender of the provided data
values. For the name Mike Abbott, the f unction returns M, indicating that the person is a male. For
the name Jane Abbott, the f unction returns F, indicating that the person is a f emale. For the name
Stacey Abbott, the f unction returns U, indicating that the gender is unknown, based on the provided
data value.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-52 Lesson 13 Using SAS® Code to Access QKB Components
data output.individuals;
set output.individuals;
length gender $1;
label gender='Gender';
gender = dqgender (Contact, 'Name');
run;
Hint: This code can be found in the following program:
D:\Workshop\dqpqkb\Demos\Ch3D10_ID_Gender_Functions.sas
b. Select Submit to submit the code.
4. Select View Log and resolve any errors.
5. Preview the Organizations data set.
b. In the Explorer pane, navigate to the Output library.
c. Open the Output library.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.2 SAS Data Quality Server Overview 13-53
d. Double-click the Organizations SAS data set to see the organization records in the
VIEWTABLE window.
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-54 Lesson 13 Using SAS® Code to Access QKB Components
13.3 Solutions
Solutions to Activities and Questions
14
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
15
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.3 Solutions 13-55
16
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
26
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-56 Lesson 13 Using SAS® Code to Access QKB Components
27
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
36
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13.3 Solutions 13-57
46
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
47
C o p y r i g h t © S AS In s t i tu t e In c. Al l r i g h t s re s e r ve d .
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
13-58 Lesson 13 Using SAS® Code to Access QKB Components
Copyright © 2020, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.