Technet Etl Design Questionnaire

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

ETL Design Questionnaire

Applies to:
Informatica PowerCenter

Summary
This article provides a questionnaire that can be useful when being involved in the design process of an
interface.

Author Bio
Author(s): Matthias Urech
Company: interface-development.com
Created on: January 8, 2009

Matthias Urech, founder and project leader of interface - development.com, has a proven track
record of applying data integration solutions for several companies across the industry. He
played a key role to implement projects in the area of data integration, data migration, data
consolidation and data warehousing.

Matthias contributed several articles about data integration best practices for Informatica
Technology Network, resulting in nomination as member of developer wall of fame and
receiving an award as contributor of the year in 2008.

Informatica Technology Network http://technet.informatica.com


© 2009 Informatica Corporation. All Rights Reserved. 1
ETL Design Questionnaire

Table of Contents
Introduction .........................................................................................................................................................3
ETL Design Questionnaire..................................................................................................................................4
Question 1: Who is involved? .........................................................................................................................4
Question 2: What is the scope of the interface? .............................................................................................5
Question 3: How is the data food chain? ........................................................................................................5
Question 4: How is the relationship between the systems? ...........................................................................6
Question 5: How is the data life cycle? ...........................................................................................................7
Question 6: How are changes captured? .......................................................................................................7
Question 7: What are the load frequencies? ..................................................................................................8
Question 8: What are the load types?.............................................................................................................8
Question 9: Which data volume is extracted? ................................................................................................9
Question 10: What is the strategy to load data?.............................................................................................9
Question 11: What is the load order? ...........................................................................................................10
Question 12: How is the interface data flow? ...............................................................................................10
Question 13: How is the target layout? .........................................................................................................11
Question 14: How is the logical mapping?....................................................................................................11
Question 15: Where are data quality issues addressed? .............................................................................12
Question 16: What are the operational actions?...........................................................................................12
Question 17: How is the scheduling configuration?......................................................................................13
Question 18: What level of documentation is provided?...............................................................................14
Disclaimer and Liability notice ..........................................................................................................................15

Informatica Technology Network http://technet.informatica.com


© 2009 Informatica Corporation. All Rights Reserved. 2
ETL Design Questionnaire

Introduction

Do not design a bridge by counting the number of people who swim across the river today. That’s also true
for ETL projects. Depending on your role in the ETL project, your work starts sometimes when the data flow
needs to be built. Fine, someone will tell you what you have to do. But sometimes you will be involved earlier
in the project to design the interface. As a matter of fact, you are faced then with the challenge of gathering
requirements. That’s where the ETL Design Questionnaire comes in place. Asking the right questions is not
only essential, but will put you also in a position of controlling the design process. As a side effect, you will be
recognized as a professional ETL developer that has a plan.

Sounds that the questionnaire is an exciting and useful tool to work with. But what makes the difference
compared to other methodologies or frameworks (i.e. Informatica Velocity)? Each of the provided questions
is supported by figures, tables or graphical elements (let’s call them diagrams). It is neither about creating
comprehensive documentation nor strictly answering all questions. Consider this set of questions rather as a
presentation of multiple views in order to get a common understanding about the interface for all involved
parties. The main goal is to gather as much information with a simple approach. The key word here is
“tailoring”. Decide yourself what you want to use!

Like everybody has a schema for getting rich that will not work, using the proposed questions in this article
will neither prevent you for going through the design process nor is the list of questions complete. At the end
of the day, your goal should be to provide the requirements for developing a working interface, respond to
changes, and have a good costumer communication and collaboration. I hope nonetheless that the provided
ETL Design Questionnaire will be useful to you and to your challenges.

Informatica Technology Network http://technet.informatica.com


© 2009 Informatica Corporation. All Rights Reserved. 3
ETL Design Questionnaire

ETL Design Questionnaire


Question 1: Who is involved?
Basically, the ETL team is responsible for extracting, transforming and loading data. More specifically, there
are more tasks to think about within and outside the ETL team. In detail:
• Performing data analysis
• Defining data quality strategy
• Gather business rules
• Develop interface
• Establish test plans
• Perform reconciliation
• Execute tests
• Prepare deployment and support rollout
• Documentation of interface

These are just some of the tasks and the list is by far not complete. However, all those tasks have
to be done by someone. The objective of the role/task diagram is to define the involved people and
their responsibilities. In short: you can simple ask "who does what?". For example: subject matter
expert (who) provides business rules (what).

Figure 1: Role/Task Diagram

Informatica Technology Network http://technet.informatica.com


© 2009 Informatica Corporation. All Rights Reserved. 4
ETL Design Questionnaire

Question 2: What is the scope of the interface?


Consider the interface as a black box for the moment. First, we want to understand the scope before building
the interface. In detail, you want to know what causes an event that provides input for the interface and what
is the output. For example: time and expense data has been posted after the month end (event) will read
daily charged hours for each employee (input) with the interface and deliver aggregated hours for the
financial period (output).

Figure 2: Interface Scope Diagram

Question 3: How is the data food chain?


Understanding the data food chain is important in order to get the big picture about the involved systems.
This gives you as early as possible an overview about constraints and dependencies in the project. The data
food chain diagram should state the involved systems (box), interfaces (arrow) and description of the data
types (arrow description).

Figure 3: Data Food Chain Diagram

Informatica Technology Network http://technet.informatica.com


© 2009 Informatica Corporation. All Rights Reserved. 5
ETL Design Questionnaire

Question 4: How is the relationship between the systems?


After drawing the data food chain you are able to easily focus on the systems that are involved in the ETL
project. The goal is to identify the system relationship in order to find out the system that owns the data. This
will help you also to see if you need to build additional interfaces or checks to ensure referential integrity.

Figure 4: System Relationships

Figure 4 illustrates the three types of system relationships:

• Master / Slave
This is the most common relationship. Data will be maintained in system 1 and
provided to system 2.

• Master / Master (one direction)


In this relationship, data will be maintained in both systems. Only system 1 will be
able to update data in system 2. Therefore, additional efforts (either manual or
automatic) have to be done to prevent data inconsistency and loss of data
quality.

• Master / Master (both directions)


As already mentioned in the previous relationship, data will be maintained in both
systems. This relationship shows that both systems are able to update data.

Of course, the road still doesn’t end here since some systems are connected to more than one
(see data food chain diagram). In such case you should prioritize the data flow order and check if
the data food chain makes sense at all.

Informatica Technology Network http://technet.informatica.com


© 2009 Informatica Corporation. All Rights Reserved. 6
ETL Design Questionnaire

Question 5: How is the data life cycle?


By knowing the types of relationships, you are now able to draw the data flow in the data life cycle diagram.
For example: the data flow arrow will point from system 1 to system 2 in case of a Master/Slave relationship.
What’s left is to move the data flow arrow horizontal to define at which point of time an action (i.e. create) in
system 1 will cause a certain action in system 2.

Figure 5: Data Life Cycle Diagram

Please note that the given actions in system 1 are just examples. Some systems only allow to flag data
inactive instead of delete them.

Question 6: How are changes captured?


The maintenance of data content is a key element in the development phase of an interface. There are
several ways to capture changes:
• Log Parsing
Parsing logs is about scanning logs and capturing changes according to log entries. While this is an
effective way of doing it also contains some dangers since it’s the nature of logs to get full and do not
log anything anymore. Moreover, you will potentially loose all changes if the log is truncated before you
run the interface.
• Audit Columns
Most source systems have audit columns to store the date and time a record was added or modified.
The most common approach to detect changes is to compare the last modified date with the date when
the interface ran the last time. This approach works fine if the date in the audit columns is indicating
only effective changes and the modified date does not overwrite the create date.
• Current Data Selection
In order to get only the current record is to select the end date field that contains NULL or a date in the
future (e.g. 12/31/9999). This approach requires that you have a snapshot of data of the last interface
run in order to compare and determine the changes. There’s no problem as long as the target system
detects the changes. Otherwise, you have to think about providing additional data loads to deliver the
full dataset in case the snapshot runs out of sync.
• Timed Data Selection
Timed data selection is when you check if the create or modified date fields are equal to a certain date
(e.g. SYSDATE -1). This works fine if the interface is running only once per day and do never fail.
Otherwise, you need to have a plan B to deliver the changes or a complete dataset.
• Changed Data Capture
Changed data capture preserves that either the source system or the interface captures the changes.
Interface changes are captures by retrieving a full copy of the target system in order to compare it with
a full copy of the source system. Although this approach is not the most efficient technique, it is the one

Informatica Technology Network http://technet.informatica.com


© 2009 Informatica Corporation. All Rights Reserved. 7
ETL Design Questionnaire

that is most reliable. It’s essential that you only take an exact copy of the target system. Any other
attempt (e.g. copy of previous extraction) will have the risk to miss some data.

Choosing the approach that best fit to the environment is key. It’s important that you know all constraints
and consider them during development. Spending enough research time will prevent you from troubles
running the interface in production.

Question 7: What are the load frequencies?


Interfaces are scheduled to be executed on a certain frequency. While scheduling itself is an operational
task, the frequency has an impact how the interface has to deal with the data at the point of execution. The
frequency to load data can by on a yearly, quarterly, monthly, weekly, daily, intra-daily or ad-hoc basis.
Please consider that this list is by far not complete. Instead, you should think through all possible cases and
write all load frequencies down.

Question 8: What are the load types?


The load type diagram outlines the load types and date ranges that are used in the interface.

Figure 6: Load Type Diagram

Informatica Technology Network http://technet.informatica.com


© 2009 Informatica Corporation. All Rights Reserved. 8
ETL Design Questionnaire

Question 9: Which data volume is extracted?


Knowing the extracted data volume is not only important for performance tuning but also essential for
delivering exactly the required amount of data. The data volume contains one of the following datasets:
• Complete Dataset
All data is provided. This includes current and history data without applying any
filter.
• Current Dataset
For each row only the most current data (known by source) is provided.
• History Dataset
Only history data is provided.
• Reduced Dataset
Filter is applied with one or more criteria (e.g. client number, report date, flag).

Determining the data volume may take some detective work. You might have to perform source
system analysis, study the history concept of the source system or simply talk to subject matter
experts in order to find out the criteria that are applied to the dataset.

Question 10: What is the strategy to load data?


Defining a load strategy is nothing else than building a relation between load frequency, load type and data
volume. In other words: it’s like a cube that has three dimensions: frequency (load frequency), date (load
type) and amount (data volume). Once you have defined the load strategy it’s recommended to review each
combination of load frequency, load type and data volume in order to identify gaps, reduce duplicates and
complexity.

Having a proper load strategy will help you in getting a clear understanding about what data the
interface will Extract, Transform and Load (ETL). In addition, the load strategy table also contains
the parameters that can be used in order to make the interface more flexible. This can reduce the
overhead of creating multiple interfaces when only certain attributes of an interface need to be
changed.

Table 1: Load Strategy Table

Informatica Technology Network http://technet.informatica.com


© 2009 Informatica Corporation. All Rights Reserved. 9
ETL Design Questionnaire

Question 11: What is the load order?


Defining dependencies between jobs is very important. Especially when you are loading data to a data
warehouse or a target with enabled constraints. Without, you can’t enforce referential integrity. Beside of
outlining dependencies, a load order table also shows jobs that are running in parallel.

Table 2: Load Order Table

Question 12: How is the interface data flow?


The interface data flow diagram is mostly used to outline the extract, transform and load (ETL) process. The
goal is to have a common understanding about the data flow and the involved applications and actions to
deliver data between the systems.

Figure 7: Interface Data Flow Diagram

Informatica Technology Network http://technet.informatica.com


© 2009 Informatica Corporation. All Rights Reserved. 10
ETL Design Questionnaire

Question 13: How is the target layout?


All that matters is the result of the solution or in other words: what is loaded into the target. Therefore, the
earlier you know what you have to provide the earlier you can begin with the development.

Table 3: Target Definition Table

Question 14: How is the logical mapping?


We presume that source and target is known. The logical mapping table helps you defining the linking of
source and target fields and to document business rules. The logical mapping is like water. It’s easier to build
something on it if linking and business rules are frozen.

Table 4: Logical Mapping Table

Informatica Technology Network http://technet.informatica.com


© 2009 Informatica Corporation. All Rights Reserved. 11
ETL Design Questionnaire

Question 15: Where are data quality issues addressed?


Here, it is about defining if you should care about data quality. You should address as many data quality
issues to the source as possible since future interface development initiatives would otherwise have to deal
with it again. However, some issues like incomplete data might be best addressed in the interface.

Table 5: Data Quality Assignment Table

Question 16: What are the operational actions?


Some operational actions are overseen during development. As a result, you have to put your hands again
on the interface. Thinking about operational steps from the beginning will help you identify hidden
requirements and perform accurate effort estimates.

Table 6: Operational Action Table

Informatica Technology Network http://technet.informatica.com


© 2009 Informatica Corporation. All Rights Reserved. 12
ETL Design Questionnaire

Question 17: How is the scheduling configuration?


In most organizations operational tasks are not done by the developer. Therefore, the interface will be
handed over and an operational handbook is provided. The operational handbook contains information about
how to schedule the interface. Figure 9 displays the scheduling workflow with all its dependencies:

Figure 8: Scheduling Workflow Diagram

Moreover, a scheduling configuration table is also provided explaining in detail how the scheduling job is
setup and executed.

Table 7: Scheduling Configuration Table

Informatica Technology Network http://technet.informatica.com


© 2009 Informatica Corporation. All Rights Reserved. 13
ETL Design Questionnaire

Question 18: What level of documentation is provided?


Table 8 supports you in defining the documentation scope:

Table 8: Documentation Decision Table

Informatica Technology Network http://technet.informatica.com


© 2009 Informatica Corporation. All Rights Reserved. 14
ETL Design Questionnaire

Disclaimer and Liability notice


Informatica offers no guarantees and assumes no responsibility or liability of any type with respect to the
content of this software asset, including any liability resulting from incompatibility between the content within
this asset and the materials and services offered by Informatica. You agree that you will not hold, or seek to
hold, Informatica responsible or liable with respect to the content of this software asset.

Informatica Technology Network http://technet.informatica.com


© 2009 Informatica Corporation. All Rights Reserved. 15

You might also like