Top 10 ETL Design Tips
Top 10 ETL Design Tips
Top 10 ETL Design Tips
Will Martin
1
ABSTRACT
Any organization with even a small IT team probably expends some effort
moving data from one system to another, usually doing something to the
data while moving it. This is often referred to as ETL, for Extract,
Transform, and Load. When a business manager asks someone on his
team to get business data from somewhere and build a report, that
manager is asking for ETL: for example, he may want someone to
“download” (Extract) the latest sales figures from operations, clean it up
and organize it by sales team (Transform it), and put it into an Excel file
(Load it) so he can review it. ETL is ubiquitous, and as many IT
organizations will know, business users always want the data “yesterday”.
While many commercial ETL tools exist, it is our experience that the
majority of organizations ultimately custom-build most or all of their major
ETL systems, usually for good reasons related to the cost, limitations of
such commercial systems, or the desire to avoid establishing a
dependency on an external toolset. Understanding this, herein we discuss
a repeatable and generalized methodology for designing ETL processes in
a way that enhances reliability and flexibility, thereby decreasing overall
costs while increasing productivity and value to business owners.
2
Data, Data, Everywhere: What is ETL, and Who Uses It?
ETL stands for “Extract, Transform, and Load”: Extracting data from one or more source data
systems, Transforming that data into a form that is suitable for whatever it is you are going to do
with it, and Loading it into its final destination system, where it can be accessed and used by
your business customers.
ETL could be something as simple as loading the contents of a CD-ROM onto a computer once
a quarter, sorting the data by date or geography, and saving the result in a spreadsheet. Or it
could be as complex as downloading terabytes of daily sales transaction data from Linux
dataservers distributed worldwide, downloading customer data from several Windows servers at
headquarters, downloading monthly inventory data from tape drives, applying complex business
rules and combining the data into a single data warehouse on yet another dataserver, and
finally running some SQL queries from the data warehouse to produce several customer-
specific datamarts in a user-access database, and doing all of this automatically every night.
The reason ETL is so common is that no one data system can do everything. Order-taking
systems are designed to record customer orders, so the data they gather is formatted to suit
that system. Inventory systems are designed to manage inventory. And even if an organization
uses a vendor's integrated system which automates the communications between these OLTP
systems, there is probably a separate reporting system. Or maybe there are legacy systems
that are still online for parts of the business. For whatever particular reasons, most businesses
need to get multiple systems to “talk to” each other, and these systems are often quite different
in data structure and format. Someone who tried to produce a management report showing
metrics involving multiple systems simply downloading information separately from each system
and combining that information into a useable output manually would probably be faced with a
daunting task, particularly if it is a regular occurrence.
Instead, what most organizations do is create an ETL process that will extract the relevant data
from all the relevant systems, transform the various datasets into a common coherent set of
information which includes applied business rules, and then load that data into a report,
reporting database, or other output system for consumption. This process will usually be
automated, or at least repeated.
3
Fig 1. The need for ETL
4
Building a repeatable ETL process will probably prove much more cost effective than manually
looking up data and building an output database over and over. However, if an ETL process is
poorly designed or implemented, simply running and maintaining it will also prove costly. There
is also always a productivity cost when information does not move freely within an organization
– when one system cannot get data from another system, or when management cannot see the
latest reports, the business suffers. Finally, businesses are constantly changing, resulting in
new business rules and new data structures, and any data-centric process needs to be flexible
enough to adapt to these changes.
In this paper we discuss a methodology for designing and building ETL systems that are robust,
reliable, and flexible, all features that usually result in cost-effective systems. We do not discuss
specific software tools or hardware and environment choices – the "technology stack" – as
these factors are typically driven by many factors which have little to do with ETL. But the
methodology described here does not depend on the use of any specific software or hardware;
rather it is a design strategy that can be applied in any system environment, precisely because it
embodies a repeatable design strategy rather than a specific tool.
5
Top 10 ETL Design Tips
Let's get right to it: here's our list of things to do that will result in a great ETL system:
1. Design your system in three logical "layers": an extract layer, a warehouse layer, and a
publish layer, all persistent.
2. Make each layer operate independently, so that you can refresh one without affecting
the others.
3. Design your extract to touch and depend on source systems as little as possible and as
infrequently as possible. Get in and get out, and don't alter data while you extract it:
extract it, then alter it.
4. If a source system is unreliable, treat it that way and design for fault tolerance.
5. Make your load steps within a layer modular so that each step can be run by itself.
6. Make every step of your ETL process non-destructively repeatable, so that if something
happens, you can just run that step again as many times as you want without creating
data problems like data duplication or repeated summing.
7. Break up long-running processes. Long-running processes are a barrier to testing,
discourage change, and create their own failure modes.
8. Code your business rules once and only once, in the warehouse layer. Don't duplicate
code just because you produce multiple reports involving the same data.
9. Be wary of custom-coded incremental loads. Use existing tools or avoid incremental
loads altogether if practical.
10. Design with both performance AND maintainability in mind. Faster is usually better, but
not always.
The discussion that follows will demonstrate the validity of these recommendations.
6
What Makes “Good” ETL?
• Reliable. Business users need to know the data will be there when they expect it.
• Fast ENOUGH (as opposed to as fast as possible). There’s always a time requirement.
• Flexible. Business rules constantly change, and systems are upgraded and replaced.
The ETL system will have to change as a result.
Reliable: A reliable system does what it’s supposed to when it’s supposed to do it, and the end
users get their data when they expect it. But what features make an ETL process reliable?
It is fast and only connects to external systems for the minimum time needed, in
order to minimize vulnerability to external system outages, primarily source
system outages. Source data systems have their own lives, and they may have
outages, scheduled and otherwise, and may have variable reliability. We want to “get in
and get out” quickly, so that we don’t have to worry any more than absolutely necessary
about source systems going offline and interrupting our ETL process.
It is modular, with process steps being separated and independent. This pays off
in many ways: ease of testing, ease of modification, fault recovery. When each process
is non-destructively repeatable, the payoff is multiplied.
It includes custom logging. If you design your own automated logging, which can be
very simple to do, you will be able to log whatever you want: step-by-step completion,
step timing, custom error messages and diagnostic information.
Fast ENOUGH:
The overall process completes with time to spare. If your customers need the data
by 8AM and your process finishes at 7:55AM every day, you have no room for
contingencies.
Beyond this, do not be obsessed with performance. In IT, there is often an over-
emphasis on performance, with every process being designed to be as fast as possible.
This often comes at a cost in terms of up-front design cost, lost flexibility, lost reliability,
and increased maintenance costs. Yes, customers always want the data yesterday. But
when the REALLY get upset is when the data doesn’t come at all, or when it’s not right.
Flexible:
8
Architecture: How Do We Meet These Goals?
We have found that a three-layer system provides an ideal environment to achieve the
goals we’ve identified. Each layer contains data tables with different purposes. These
layers are logical layers, not necessarily physical layers, with that decision being based
like many others on hardware and performance concerns.
9
Fig 2. A Three-Layer ETL System
10
The three-layer ETL system is composed of an extract layer, a warehouse layer, and the
final publish layer.
Goal: someplace to put and hold the extracted data where you we control it. This layer
serves to hold the data we’ve extracted from our source systems, largely un-altered
(filtered down to just what we need, but otherwise un-altered). The data here is
persistent – now that we have the data, WE control it; not from a business sense, but
tactically, for the purposes of our processing.
11
Fig 3. Sample Extract Layer Data Tables
12
But why store the data ourselves? Why not just pull it from the source systems when we
need it?
• You may want to run your post-extract process multiple times for a variety of
reasons: testing, diagnosing problems, business rule iterations, and you don’t
want to worry about what goes on in source systems between runs. This is
particularly useful for testing and debugging: for repeated test runs, you need to
know that the input data hasn’t changed.
• Source systems go up and down, and update, on their own schedule, sometimes
unpredictably. You may load data once per week, but you may make a business
rule change on Wednesday and want to see the prior Monday’s data with the
new business rules.
• You don’t want to traverse the network every time you want to use some source
system data.
• Once in the extract layer, the data is static. You never have to have to wonder
whether an apparent change is due to a change in the source data or due to
something in your process. This is extremely useful in production, but also
during development (including enhancements) and testing.
• Finally, your source system administrators will appreciate you minimizing the
impact on their systems by only accessing them when needed. This is a win-win
situation, and it will ultimately be to your benefit to have happy source system
personnel.
Goal: A place that combines all the source data in a form where it is standardized (as
much as is practical) and in a form that is easy for you to work with when building the
publish layer.
This is where you start applying business rules. In particular, you apply as many of your
customers’ business rules as you can, but only those rules that apply to ALL of your
customers (you may have multiple customer sets, each with some different business
rules).
The warehouse layer is also a great place to check and address data quality issues if
needed.
13
Fig 4. Sample Warehouse Layer Data Tables
14
The end result is a “warehouse” of data, possibly from different source systems, which is
self-consistent, fairly normalized, and in a format that is convenient for YOU to work with.
This effectively becomes the source system for all of your final data, and serves as the
“one truth” represented by the unified data coming from disparate systems.
Goal: the data is in its final form for user consumption. The data could be organized in
any number of ways: one (or several) star schemas, a set of relational tables, a set of
flat file feeds; whatever is most useful to the end users. Business rules that are specific
to particular user groups are applied only to the data that will be accessed by those
groups. Any given user group may have visibility to some or all of the data as
appropriate for each user group. But the publish layer is the only place any users will
typically have access to any system data.
15
F
16
ig 5. Sample Publish Layer Data Tables
17
Fig 6. A Three-Layer ETL System
More Detail
The Extract Layer
The extract layer typically looks like a database with a bunch of data tables, with one or more
tables from each source system. The data in each table generally looks a lot as it did in the
source system. Except for filtering down to only the rows and columns we actually need, and
maybe joining several source tables into one extracted table, also to reduce data volume, it’s
usually best to alter the source data too much during extract. There are several benefits to
minimizing data alteration during extract:
The primary reason to have a dedicated extract layer is to de-couple your system from the
source systems, so that they can go online and offline, refresh, update, or otherwise change
without affecting your system. By extracting the information you need from the source systems
and storing it persistently in your environment, you essentially establish your own source data
system: your extract layer.
Another important reason to have a dedicated extract layer is that it makes your extracted data
static and persistent – after you extract your data, source systems can go offline, but the data
you extracted from that system is still available to you, and never changes. By making the
extracted data persistent, we do not mean permanently persistent, in which case you’d be
duplicating the function of the source systems. We mean persistent between extracts: we’re
creating a persistent snapshot of source system data, not a mirror of the source data.
Having the extracted data remain static has several significant benefits. Because the data only
changes when you run a new extract (as opposed to when a source system changes), there will
never be any question of whether changes further downstream (in the warehouse or publish
layers) are the result of source changes or ETL system processing. If no extract has been run,
then any changes are the result of ETL system processing. This knowledge is crucial during
testing and development, including after production go-live. Also, an un-changing extract layer
provides a stable dataset to use for performance testing and for iterative testing of downstream
processes.
18
How often do we refresh the “snapshot”? Exactly as often as suits our needs. If our customers
need monthly data refreshes, we refresh our extract layer monthly. (But as we will discuss
below, because our warehouse and publish layers will be independent of our extract layer, if we
want to refresh the extract layer more frequently, or repeatedly for testing, we can do so without
affecting the other layers).
The processes that refresh the extract layer stand alone from our other processes. This
principle holds regardless of the specific technologies and tools used to extract data: no matter
what tools or techniques you use, make the extract process modular and repeatable. We want
to be able to run and re-run the extract whenever we want, without necessarily affecting
anything else in our system: make it non-destructively repeatable.
Loading information into the extract layer consists of accessing the data in the source
system(s), filtering it down to what you need, and loading it into your environment. The actual
processes and tools you use will depend on the source systems, your own system, and your
preferences and skill sets. But from a design perspective, the key is to make your processes
modular and non-destructively repeatable.
Making your processes non-destructively repeatable means designing each module, including
parent modules, so that it can be run by itself as often and as and repeated as many times as
desired without any negative effect on the resulting extract layer data.
A very simple but illuminating example of non-destructively repeatable design involves table
drops. Consider a system with 15 extracted data tables, where each table is dropped and re-
created during every system refresh. A very common design would be to create a process that
drops all 15 tables, and then one or more processes to re-create all the tables. The problem
with this design becomes clear in the case when a subset of the 15 tables needs to be
refreshed, for whatever reason: testing, a partial-load failure, a change to just a few tables. In
that case, in order to refresh one table, it first needs to be dropped. But the normal process
drops all 15 tables: the table drop process is unnecessarily destructive. While the affected
19
tables can always be dropped manually, that means someone needs to write a script to do that
– perhaps in a production environment, when no system expert is available.
For processes that load tables, non-destructive repeatability often means that rather than
inserting extracted data into an existing table, you either drop and re-create the entire table, or
you truncate the table before inserting data. This statement implies a preference for full-extract
systems, which extract the entire dataset from the source systems every time, rather than
incremental or “delta-load” systems, which only extract changed or new data. It may seem odd
to prefer a design that extracts the same data over and over again, but the benefit is reliability
and simplicity. Unless your source system provides well-implemented tools to help you identify
changed data records, allowing incremental data extracts, if you want to extract only new or
changed data, you need to explicitly design processes to detect that data, and that means
added code. Similarly, you will need to add processes to ensure that extracted data is not
duplicated via some problem in identifying new data correctly, or a problem of inserting new
data more than once. These processes add additional code and introduce failure points that
have potentially severe consequences. Therefore we recommend full-extract whenever
possible, and incremental loads only when truly necessary, except in cases where there are
already well-established tools available to you to facilitate reliable incremental loads.
This is not to say that custom incremental loading schemes are never desirable. In cases
where data volumes are too large for full extracts and no incremental load tools are at hand, you
will need to design your own. In this case, you still must design your extract processes to
be non-destructively repeatable. In modern database systems, this generally can be
achieved by using explicit transaction control (commit/rollback). Other solutions involve
developing ways to precisely identify the data records you are extracting in each run, and
adding a step to delete any such data from your extract-layer tables (it may be there from a
previous, incomplete extract) and then insert it again. Identification and management of
incremental data is never trivial, so again, we recommend avoiding incremental loads when
practical.
In any case, the key is to explicitly design your load processes to be repeatable. Any design
that requires a load to be run once and only once (per day, or month, or whatever) will be
problematic in the long term. Whether you expect it or not, it is almost certain that your load
processes will be repeated within an extract cycle for one reason or another, and you don’t want
that to be a problem in and of itself.
20
Finally, it is important that the extract load processes be distinct from warehouse and publish
load processes. Without a doubt there will be many times you will want to run the extract
process without affecting the warehouse or publish layers, for testing or in the case of
anomalies.
The term “data warehouse” is used frequently, and like the similar term “datamart” it can have
several meanings. A data warehouse can be a “star-schema” de-normalized dimensional
database; it can be a set of “datamarts”, each of which can be a star-schema system or a
normalized system; or it can be a single normalized database. This last case is what we’re
talking about. What makes our database a “warehouse” is that it combines data from disparate
systems into a single unified database.
By “unified”, we simply mean that no piece of information in the warehouse layer contradicts
another piece of information, and also that no piece of information in the warehouse layer is
duplicated elsewhere in the warehouse layer.
It might seem obvious that no self-contradictory information would exist in the warehouse layer,
but such may not be the case in the extract layer. This is because the extract layer potentially
contains data from several source systems that may not “talk to each other”. For example, the
inventory system and the sales system both record information about products; however, being
two different systems, they may use different identifiers for the same product. In Figure 3, the
order detail table from the Oracle sales system, XT_Sales_Lineitem, identifies a product by its
PartNumber.
On the other hand, the inventory system from vendor XYZ, in table XT_Inventory_Product,
identifies a product by its ProductID. Because both systems serve the same organization,
they’re both talking about the same product. But because they come from different vendors,
they may use different keys with different formats. While our end users don’t care about
identifier formats, which after all will probably not show up in any UI, we do care, and we will
want to use only one or the other in our system. We make that unifying choice in our
warehouse layer, where we simply create the new identifier ItemID, which represents a product
whether the context is sales, inventory, or anything else.
If, as in this hypothetical case, the same product has two identifiers in the source systems, a
mapping of these two identifiers will already exist somewhere, or there will be a means of
readily determining it. The organization would have encountered this potential problem long
ago and will have dealt with it as a necessity, probably by establishing a PartNumber-to-
ProductID mapping, or defining a de-facto mapping by enforcing a rule like uniqueness of
product name, which readily leads to a mapping of products in one system to the same products
in another system. We simply exploit that mapping in our warehouse layer, using it to determine
a unified ItemID for each product.
21
The mapping of PartNumber or ProductID to ItemID is represented in the ProductMaster table in
the warehouse layer. We construct this table using the known relationship between inventory
ProductID’s and sales PartNumbers, and use it to determine a unique ItemID for every product.
Occasionally, source systems will contain duplicate information. For example, both the sales
system and the CRM system have the concept of an “account”, so they may both have master
account lists. If so, these master account lists probably list many, or most, of the same
accounts. We may need to extract both lists from the source systems, but in our warehouse
layer we want only one. If, as is usually the case, there are discrepancies between the two
account lists in two source systems – perhaps the same account’s address differs from one
system to the next - we will need to reconcile these differences. This is not to say that we are in
the business of data quality; however, if we are asked to produce data and reports containing
accounts, we have to have a single definition of what each account is. The problem of HOW we
determine which of two pieces of duplicated information is “correct” is to be solved by the source
system owners. Our task is simply to implement the “correct” choice. The end result is that, in
our warehouse layer, we have only one definition for each data element, and nothing is
duplicated. In data modeling terms, this (among other things) means our warehouse layer is
somewhere close to Third Normal Form.
The construction of this unified self-consistent data layer certainly seems to be a worthy goal. It
represents our “one version of the truth” for the information in the organization. Nothing is
duplicated, nothing is contradictory or ambiguous. But surely this will be the case in our publish
layer as well; it will contain only one version of the truth. So why do we need a separate
warehouse layer, or put differently, why can’t our warehouse layer serve as the publish layer?
The answer is that a distinct warehouse layer gives us an ideal environment in which to
apply the bulk of our business rules. As we will see, for a variety of reasons we may
duplicate substantial amounts of data and data structures in our publish layer; the publish layer
may be somewhat “messy”. That’s fine, and is driven by our end users’ needs. But we don’t
want to duplicate the implementation of business rules. We want, wherever possible, to code
each business rule only once. And while the publish layer may be intentionally “messy”, with
overlapping datasets and strange formatting, we will want to have a clean, clear unified dataset
somewhere, and we will: the warehouse layer.
Because the data in the warehouse layer is self-consistent and organized compared to the
extract layer, and compact and normalized compared to the publish layer, the warehouse layer
is a great place to identify and address data quality issues. For example, if customer names
contain leading spaces, stripping them out here is easier than in the extract or publish layers
because in both of those cases the same customer name field may exist in multiple tables. In
the extract layer it may even exist in several different datatypes. But the biggest reason to apply
data quality fixes in the warehouse layer is so that you don't have to repeatedly apply the same
fix multiple times in the publish layer, every time the field shows up on a published table, report,
or UI.
22
Business Rules in the Warehouse Layer
The warehouse layer is where we apply most of our business rules. This is primarily for two
reasons. First, the data in the warehouse layer is easy to work with: it is fairly normalized, it
has consistent identifying keys, there is no duplicated data, and there is no data inconsistency.
Second, when a business rule is applied in the warehouse layer, its effects will cascade
downstream into the publish layer, where the affected data may be used in several places, by
simply coding the rule one time in the warehouse layer.
It is also true that business rule cascading would occur if we implemented business rules in the
extract layer. But for the reasons given in the extract layer notes above, that is usually not the
best approach.
In the warehouse layer, we apply all of the business rules that apply to ALL of our various end
user groups. For example, all of our user groups may agree that Central America is treated as
a single geography, and sales made in Honduras are simply added to sales made in Nicaragua
and the rest of Central America. Given this fact, we would probably want to go ahead and apply
this rule to our warehouse data, adding all the sales up and creating a “master” geography of
Central America in our sales tables, grouping all our Honduras and Nicaragua accounts into a
set of Central America accounts in our CRM tables, etc., even though that data is broken out
more granularly in the source sales and CRM systems.
But wait, you say – if you do that, you lose visibility to the more granular data available to us in
the source systems: why throw that away? The answer is: so that we don’t have to code that
Central-America-Grouping rule repeatedly in our publish layer, once for every reporting object
that refers to geography. We may have three sales reporting tables and four CRM reporting
tables in the publish layer, each of which refers to geography. In that case, if we didn’t
implement this rule in the warehouse layer, we’d need to implement it seven times in the publish
layer.
But wait, you say – what if they change the rule? Not a big problem: because we minimize the
alteration of data in the extract layer, we still have the more granular data if we need it. And
because we coded the Central-America-Grouping rule once, in the warehouse layer, if it
changes we only have to change it once (as opposed to changing it seven times in the publish
layer).
The warehouse layer embodies your “cleanest”, most tightly-organized database, with all of your
major business rules applied. You will want to be very careful with the format of the data here
and use your best database-design and modeling skills. Every step you take to create clean,
consistent data in the warehouse layer will pay off later by making everything you do with this
data easier: implementing and altering business rules, adding new complementary datasets,
adding new functionality. But more importantly, when you have built a clean, logical, and clear
warehouse layer, it becomes an ideal data source for the publish layer, and the publish layer is
what your customers are paying you to build, and where they’ll derive their business value.
23
The load process for the warehouse layer is similar to that for the extract layer except that the
data source is now the extract layer, and also you will be implementing business rules in the
warehouse layer. That will surely mean that the underlying code will be more complex, and you
may have extra processes beyond the one-per-target table design in the extract layer, for
example processes that update warehouse data based on calculations that can’t be done until
the entire warehouse base dataset is loaded.
Like the extract layer, it is critical that the warehouse load processes be modular, repeatable,
and distinct from the publish layer load processes.
The publish layer is the final product, the goal of the entire endeavor. This layer contains all the
data your customers want to see, arranged the way they want to see it, and it’s really the only
thing your customers care about (and what they’re paying for). We call it the publish layer
because typically it is the only layer that is visible to your customers.
The publish layer will often consist of a number of data tables, each formatted specifically to suit
one or several customers or reporting processes. It could also include or consist of a set of
exported flat files, or any other kind of data-driven objects. The format doesn’t really matter,
and it is precisely because we’ve built the warehouse layer the way we did that we can readily
supply data in whatever format our customers need.
Like the extract layer, the data in the publish layer may be “un-tidy” in some ways; it may be
highly de-normalized or embody redundancies if several groups want similar but not identical
datasets. Again, this would be because particular customers need the data to be a particular
way. The publish layer could even contain data inconsistencies, as long as they are the
intentional result of customer requirements. For instance, customer group A might have one
name for a certain type of CRM lead, while customer group B might call it something else, or
may even ignore that type of lead. Both situations are completely legitimate, and need to be
supported equally. (The organization may want to consider reaching a compromise on an issue
like this, but beyond perhaps pointing it out, that is not our concern).
The publish layer load process is very much like that of the warehouse layer, with the
warehouse layer now being the data source. The formatting of the data will be different,
because while the warehouse layer data is highly normalized, the publish layer data will likely
not be.
Regardless, key design factors remain the same: modularity, repeatability, and independence
from the other layers.
24
Key Qualities of the Completed System
• The system consists of three logical layers: extract layer, warehouse layer, and
publish layer, with the publish layer being the layer that is exposed to the users,
the consumers of the data.
• Data flowing into the extract layer is largely unchanged compared to the source
systems, aside from filtering and grouping/joining.
• Extracted data is persistent between system refreshes.
• The extract load, warehouse load, and publish layer loads each can run
independently of the others.
• Individual load processes don't take hours and hours to complete. If they do,
break them into multiple processes.
• Load processes are non-destructive and repeatable – a given process can be
repeated as often as desired without causing data quality problems (duplication
etc.). This way, if an error occurs, or if you’re not sure for whatever reason that a
process completed (or that it started), just run that single process again.
• Drop and re-create target tables whenever possible. Use “delta-load”
designs only if truly necessary or if tools already exist to facilitate it
without adding custom code.
• Delete datasets before insert processes. If a module inserts North
America sales data, delete all the existing North America sales data
automatically before the insert, as part of the same module.
• Any time you create an object during processing, code a pre-emptive
drop of that object into the same module that creates the object (rather
than having a separate script that drops all objects, for instance).
• Unreliable source systems are separated from reliable source systems and
treated as non-essential. (This is discussed in section 6 below.)
• The warehouse layer is the “cleanest” most self-consistent data you can create.
Use your best data modeling and normalization here.
• Business rules that apply globally are applied in the warehouse layer. This
avoids duplication of business-rule code and maximizes data consistency.
• Business rules that apply to only some users are applied in the publish layer.
• Each business rule is coded in only one place. Implement rules once and only
once – in a single module that is called by other modules, or in a view or table
that is accessed by other tables.
• The publish layer is formatted for end-customer usability, not data-modeling
ideals. Data may be de-normalized and repeated, but this is not a big problem
because the complex relationships – the business rules – are implemented in the
warehouse layer, so repeating data in the publish layer, where needed for user
convenience, is not difficult.
25
Roadmap to Building the System
In the preceding sections, we have described our ETL system. Here we describe how to
actually build such a system, step by step.
• Make sure you have a good specification of the end result your customer is
looking for, whether it is an online database for ad-hoc queries, a set of online
reports, a set of flat files, or a dump to disk or tape.
• Determine what data elements you need. This information comes from your
customer, who may be able to essentially list all the data elements and tell you
where they come from (which source system(s)), or may instead show you a set
of reports or other end-user interfaces.
• Determine where the data comes from, i.e. which source systems do you need to
access.
• Determine a contact for each source system, a person who represents that
system’s operational team. Gain online access to each source system.
• Learn some key things about each system: what is its uptime/downtime
schedule, what is its refresh schedule. In particular, find out how robust and
reliable is it by asking some key users (some of your customers will probably be
key users). For instance, systems based on flat-file uploads are notoriously
unreliable both in terms of downtime and particularly data quality. The reliability
of each source system will help you make important decisions about how you
design your extract processes.
• Learn how the data is structured in your source systems. Run some SQL
queries or otherwise access the source data and find the inherent data
relationships and business rules. You can do this using tests that are based on
26
the actual data you will be extracting, so that you will essentially be prototyping
some extract code while you investigate.
• Learn about data volume in the source systems. If you’ll be extracting 12 months
worth of sales data, is that 100MB or data or 100TB?
• Understand how the data volume is expected to change. It may be unwise to
design an extract process that works well for 1GB but crumbles at 100GB if
100GB is expected a year from now.
4. At this point, you are actively working on the project. It is worth remembering that while
you are working on extract and warehouse layers, the only thing your customers think
they care about is the publish layer. They really care about the extract and warehouse
layers too, because without those the publish layer won’t be very flexible or reliable, but
that won’t be clear to them unless they have personal ETL experience. Describe the
publish layer, in business terms, to your customers. Describe the extract process and
the application of business rules, but do not emphasize the extract or warehouse layers
themselves because they will not be visible to the customer and the customer should not
have to worry about them (and probably won’t see the need to pay for them).
Determine hardware: servers, network, etc. (if it has not been determined for you) and
where you will put things. Remember that this ETL system design is a logical design,
not a physical design, and layers can co-exist on servers or in database instances if
desired. You can put all three layers on a single server without any inherent problem.
However, you may likely choose not to do so for performance or security reasons.
27
makes for easy development and testing, scales well as the system grows, and is
self-documenting.
• Build in custom logging. The tools and O/S you are using will doubtless have
some sort of logging built in, but we have found that writing a simple procedure
that inserts a record into a central log table, and then calling that procedure as
needed from each module is easy and invaluable. This procedure can be called
at the beginning and end of every module, and should include automatically
adding a timestamp to each log entry. With this simple step, we will now have a
central, persistent log in which we can see every step that is initiated and when it
started and finished (if it finished – just as importantly, we can see that a step
never finished). There is no reason that this log cannot be made available to
processes system-wide – it does not have to be particular to the extract layer.
Over time the value of this log only increases – for instance, by building a SQL
query against the log table, you can see how an individual process step’s
completion time has changed over the past year as source data volume changes.
At a minimum, your log table should include the following:
• Event (text): a free-text column.
• Event_ts: date/time column, automatically populated with current
date/time.
• Several generic text columns, like “note1”, “note2”, “note3”. These can
contain different information depending on the context under which an
event was logged.
• Several generic numeric columns.
• As you increase the amount of data you are extracting, start some performance
testing to identify where your performance bottlenecks are likely to be. Gradually
open up your filtering until you have full-data-volume test runs. It is very
important to avoid very-long-running single processes, as they are difficult to
manage and are prime sources of system unreliability. Whenever you have a
single process that runs for more than an hour or two, strongly consider breaking
it into several smaller processes. For example, if, as a simple result of data
volume, it takes five hours to run XT_Sales_LineItem_Load, break this process
down into several sub-processes, with each one loading a subset of the data.
You can divide data by time, geography, product category, groups of orders,
whatever. Even if in the end it still takes five hours to load all the Sales LineItem
data, you will be much better off, and your system much more flexible and
reliable, if this really happens in five one-hour steps.
• Segregate unreliable source systems. As you start to build your extract process,
design it so that unreliable source systems will not cause your entire extract to
fail if they have an outage. You are already building a modular extract process –
make it even more robust by “fencing off” any process that depends on an
unreliable source system. There are several ways to do this:
i. Make each extract step that depends on an unreliable system a “non-
essential” step by enclosing it in error-handling routines that are set to
allow the overall process to continue if the step fails.
28
ii. Run all such steps as a completely separate process.
The approach to take with unreliable source systems is to ASSUME that they
WILL fail from time to time, and design accordingly.
It may seem that there is no such thing as a non-essential step in a data extract.
In the larger sense this is probably true, or else you wouldn’t be attempting to
extract that data in the first place. But it often is the case that some data is more
essential than other data, particularly when refresh frequency is considered. For
instance, while your customer may need that returned-product list, he may be
able to get by with that list being a week old, while that probably won’t be true
with the orders-in-process list. In that case, you won’t want to stop your daily
refresh simply because the returned-product flat-file download failed. In addition,
you probably will find that the more critical a set of data is, the more reliable its
supporting source system will be, and vice versa. Don’t let obsolete systems that
contain secondary data bring down your entire system. Remember too that
because everything in the extract layer is persistent between refreshes, when an
extract fails, the affected data will still be available from the prior extract. So for
example in the case of the failed or skipped returned-product list, the week-old
returned-product list data is still in the extract layer, so you can continue to use
that until a more recent list is extracted, while all your other data is up-to-date in
the meantime.
To some extent, this can be done in parallel with building the extract layer: as soon as
you have enough extract data to build any of your warehouse tables, you can do so.
Remember that even the completed system will doubtless continue to evolve and
expand, so the concept of a “finished” layer doesn’t really apply anyway.
a. Apply business rules. As discussed above, the data in the warehouse layer will
reflect the majority of the business rules you will apply. You will generally apply
these rules via some kind of SQL or other DML code that runs as you load your
warehouse data objects.
i. When a business rule results in changes to source data, consider also
keeping the original data in another field. For example, if a business rule
states that orders with negative product quantities (representing returns)
have the Quantity field set to 0, consider keeping the original quantity
also, in a field called Orig_Quantity, even if the end users say they will
never want to see this value. This can be very useful when apparent data
anomalies occur, when debugging, or for adhoc analysis in production.
You can choose not to expose this information in the publish layer if
desired.
b. As with the extract layer, use a modular design.
29
c. Use your highest level of care and sound data modeling practices in designing
the warehouse layer. The warehouse layer is ultimately the heart of what you
are producing, even though it will not be directly visible to your customers.
There is no need to have a detailed strategic plan devoted to building the publish layer.
Instead, build it directly according to customer needs. There can be little doubt that it will
change greatly over time, even if the underlying warehouse and source data doesn’t
change a lot. This is because customers are constantly finding new ways to use
information. In fact, any system that exposes new information invariably ends up
generating a new set of customer requests as a direct result.
a. Saying that there is no need for a grand strategic plan for the publish layer is not
to say that it should be designed and built without care. But rather than spending
a lot of up-front analysis time and effort in, for instance, minimizing the storage
space required by the publish layer, we recommend applying your effort towards
making the publish layer flexible and easy to manage:
i. Avoid code duplication. For example, it will often be the case that several
customers will need similar, but not identical, data tables. In that
situation, rather than building two similar tables from scratch, you can
often build one table and then build the second table as a select-and-
insert from the first, or as a view (if performance is sufficient).
ii. Use user-friendly object and field names. Make your dataset self-
documenting.
iii. Anticipate performance issues and design accordingly. For instance,
build indexes on key data fields, even though users may not know to ask
for this.
Best Practices
• Use transparent and consistent naming conventions. For example, if you have a
procedure that loads a data table, call it <table_name>_load.
• Write your own simple logging – this is as simple as a procedure that inserts a row into a
log table (with timestamp), which is called by each of your other procedures.
• Don’t sacrifice code readability, maintainability, or simplicity for performance unless you
really need to. If a process runs in 2 minutes and runs only once or twice, the fact that it
could run twice as fast is often not relevant.
• Don’t duplicate code. If you find yourself cutting and pasting code from one module to
another, you probably should do a small redesign and add a grouping table, a work
table, or write a procedure to encapsulate the duplicated code. For example, if you
always group 10 products into a super-group, implement that as a grouping table you
can join to, rather than doing the grouping explicitly in SQL whenever those products are
invoked.
• Don’t duplicate code. I just wanted to say that again. The real reason not to duplicate
code is this: when your user changes his mind about a rule, you’ll need to change your
code. You don’t want to make the same change in 10 places. With rare exceptions, if
you have to duplicate code, you’ve designed something wrong.
31
Putting it all Together
Let’s discuss some scenarios that demonstrate how the features of this ETL design can be
exploited for greatly enhanced productivity.
Consider a system used for reporting monthly sales progress. Data will be extracted from
source systems, processed, and presented to users once per month.
Despite this schedule, there will undoubtedly be many reasons that you will want to run
processing during the month, between monthly refreshes. For instance, your users may want to
apply new or changed business rules for next month. You will need to implement and test these
rules in the warehouse layer.
Because the warehouse layer load is independent from the publish layer load, you can code the
new rules in the warehouse load code and re-load the warehouse, now reflecting the new rules,
without affecting the publish layer or your users. Because the warehouse load is repeatable
and non-destructive, you can repeat the warehouse load process as often as you like for testing
and debugging. And because the extract layer data is static and persistent, you don’t need to
worry about source data changing between repeated warehouse loads – you are not running
the source extract. This is a great benefit during testing. You can even roll back the changed
warehouse load code and re-load the warehouse with production, un-changed data whenever
you want.
Similarly, because you can run the source extract without affecting the warehouse or publish
layers, you can change your extract code and run the extract repeatedly for performance
testing. In this case, unless you create a backup of the original extracted data, you will not be
able to restore the extract layer to its pre-changed format, but this doesn’t matter to your users
because the extract layer can be changed (or dropped) without affecting the publish layer.
Suppose one, but not all, of your source systems refreshes mid-month, providing some data
your users want to see before next month’s refresh. Because your extract is modular and each
step is non-destructively repeatable, you can extract new data from only that particular source
system without running the entire extract process. Also, because your extracted data is static
and persistent, you can then run the entire warehouse load process (or part of it – it also is
modular) and then the entire publish load process, to refresh the publish layer with the new
32
data. In this case, even though you ran the complete warehouse and publish layer load
processes, the only change will result from the single source system you re-extracted. Finally,
because all of your load processes are non-destructively repeatable, this mid-month refresh
won’t cause any data problems in any of the un-changed data.
Building one xt_ table from another – the rest of the process is untouched
Suppose your system reports on sales data by week, by product. You have one
XT_Sales_Product table in the extract layer with product sales aggregated by week, one
corresponding SALES_ table in the warehouse layer, with a set of complex business rules, and
several REP_ tables in the publish layer for the same data, organized in different ways. A
customer now needs data for several products to be expanded from weekly to daily, and also to
product category, which you were not previously extracting. Finally, the customer wants to
apply some business rules that apply to these products only. The customer wants several new
REP_ tables representing this new daily-product-category information with the new rules. All
your existing weekly REP_ tables need to be maintained unchanged.
Given that this change requires making changes all the way “upstream” to the extract layer, and
because of the existing complex business rules involved in organizing the sales data (in the
warehouse layer), this might seem a very disruptive change. However, design features of the
system can be exploited.
Because you will be extracting new source data (daily data and product category), you need to
do something in the extract layer. You could simply create a new extract process for this data.
But noting that it would be exactly the same as the existing sales by product extract, but with
lower granularity and a new attribute, you realize you’d be duplicating some code and extracting
the same data twice.
Running some performance tests in the extract layer (which won’t affect other layers), you verify
that you can afford to extract ALL product data by category and at a daily level within your
extract time window. (It may seem to be overkill to extract this data for all products. But a) fast-
enough is good enough, even if it’s not as fast as possible, and b) the customer may well later
want to add to his product-list request – something that will not surprise anyone - so you might
need the data after all).
33
have to change, and subsequent publish layer load processes also are
unchanged. The only difference is that there is a new table in the extract layer,
XT_Sales_Product_Daily. (You can even stop here and postpone actually using
the new daily data until a later date.) Note that despite the existence of new XT_
table that is very similar to another XT_ table, you have no duplicated extract
code: the second table is created with a simple summing of the first table, rather
than by its own extract.
5. Now you can continue building new processes to supply the requested new
REP_ tables. The first step would probably be to build a warehouse table
containing daily by-category sales data, only for the specific products required.
The source for that warehouse table would be XT_Sales_Product_Daily, filtered
by product, and with the new business rules, which apply only to those products,
implemented. The REP_ tables will follow.
The key to note here is that you have significantly altered the source extract of sales data, but
because you could make this change only in the extract layer and leave everything else
untouched. Because of this, it would be entirely possible to make this change in the
production system, while it is live: with completion of the last step of building the new
publish layer objects, those objects would simply appear in the publish layer, with everything
else un-touched. You probably would not choose to do this in the live production system, but
the fact that it could be done points out how flexible this three-layer modular design is.
This particular solution, almost identical to one we actually implemented recently in a live
system, relied on being able to extract daily by-category data for all products. If that were not
the case, the solution would be a little different, but equally practical:
Imagine that you have to code a process that a) determines the list of a specific subset of the
multiple ProductID's that attach to an OrderID and b) populates a single field representing a
concatenated list of those ProductID's. This ProductList field will then have a one-to-one
relationship with OrderID and will be a single field on each Order record. Given the one-to-
many relationship between and OrderID and its ProductID's, determining the list of products is
not difficult, nor is concatenating the fields. However, considering that the number of products
on an order is unknown and that there may be business rules that determine which products are
to be included in the list, the resulting code to determine the list could be more than a few lines,
34
and may include nested IF's, CASE statements, or other control logic. Finally, if the list
determination code occurs as part of a SELECT list in a SQL statement, the resulting SQL
statement may be somewhat complicated. Of course, this SQL may already have other
complicated elements unrelated to product lists. If, as is often the case, the business rules for
the list determination change over time, the result could be that changing this code becomes a
fairly involved process which also requires regression testing to identify unintended side effects.
One way to simplify the SQL in this case would be to use an inline function in the SELECT list:
SELECT OrderID
, FN_Product_List_Get(OrderID)
,…
FROM Orders …
Now the SQL statement is compact and clean, and changes to the product list rules can be
implemented in the function, without anyone touching the calling code.
Going further, since the number of products in each list is unknown, the concatenation code
must be done in a flexible way. A clean, easy to code and maintain solution might be to use a
cursor loop to successively select each ProductID and concatenate it to the previous ones.
So now we have an inline function that employs a cursor. The problem? Both inline functions
and cursors are notoriously slow in many environments, and the combination of both is likely to
be exhibit very poor performance. Many developers would recognize this and would not employ
this solution.
But would this actually be a poor design? Not necessarily. While it is true that a cursor called
by an inline function will almost surely return data much slower than other solutions, the added
calculation time may not matter much. What if the data volume for this particular subset of the
data is small and will remain small? If the cursor/function solution completes in 2 minutes, what
does it matter that another solution would complete in 15 seconds if it requires a day of testing
(not to mention the presence of a developer who is familiar with the code, who may have left the
project 3 months ago) every time a small change in list rules is required? What if this process
only runs once per quarter, or it runs every night but you have a 4-hour processing window and
your entire ETL process only takes 30 minutes to run? In these cases, that extra 2 minutes of
processing time just doesn’t matter, and is well worth the benefit of having code that is easy to
maintain and change.
Another example of balancing performance against maintainability is the use of physical work
tables rather than temporary work tables ("temp tables") that exist only in memory. (By "work
tables", we mean data tables that are populated during processing to make subsequent
processing easier, but that are not part of the final dataset that is being created). Temp tables
are often much faster to access than physical tables. However, they only exist during execution
of the process that creates them. Physical tables persist until they are dropped upon the next
ETL run. This can be extremely useful during development and also in production when
35
anomalies occur: depending on how they are used, work tables can contain a wealth of
information about how data looked during different phases of processing. This information can
be extraordinarily helpful when one is trying to trace data changes through the processing path,
and it all is lost if it exists in temp tables that are deleted once processing is complete.
One strategy is to use physical work tables during development, adding a block of code which
drops all the work tables before the code is transitioned to production, if the work tables require
noticeable storage space. This block of code can easily be commented in and out depending
on the situation, or the option to keep or drop work tables can be a parameter in production.
Alternatively, when development and unit testing is nearing completion, physical work tables
can easily be converted to temp tables in the code if a significant performance gain will be
realized.
As above, this sort of strategy only makes sense if a resulting performance loss (if any) is small
in the context of your overall needs. But if it is, the benefit of having physical work tables
available, even in production, can be great, so it is not always wise to simply use temp tables all
the time for performance and to save disk space.
This sort of compromise is a tough call, performance is extremely important, and we do not
advocate anything like a "build now and tune later" approach. However, it does pay to
remember that the fastest code is not always the best code. Keep this in mind and design a
system that enhances both performance and maintainability.
36
Conclusion
In this discussion, we examined what the most important features of an ETL system are and
how to design one that possesses those features. We have seen that by methodically building
a three-layer system and using modular, non-destructively repeatable load processes, we can
implement a system that is reliable, flexible, and manageable, and therefore cost effective.
Finally, we noted some real-world examples of ETL challenges and demonstrated how the
features of our preferred design can be exploited to overcome these challenges with a minimum
of disruption.
We know that this methodology is effective, because it has been developed over time in the
course of many implementations in diverse customer environments. It can be effective in your
environment as well.
37