ETL Staging Area
ETL Staging Area
ETL Staging Area
Staging area is an intermediate storage area between the sources of information and the data warehouse (DW)
or Data mart. It is a back room area of data warehouse.
The staging area exists to be a separate “back room “or “engine room” of the warehouse from where the data can
be transformed, corrected and prepared for the warehouse. It should only be accessible to the data stewards
ETL developers working on the data, or administrators monitoring or managing the ETL processes.
Several source systems which provide data.This can include databases or files or spreadsheets
A single “staging area” which may use one or more database schemas or file system (depending upon
warehouse load volumes).
One or more “visible” data marts or a single “warehouse presentation area” where data is made visible to end
user queries. This is what is called as a Data warehouse .If a data warehouse has a single subject area
(schema) then data warehouse & data marts are synonyms. But if data warehouse has multiple subject area
(schema) then data mart for each schema can be created depending on the business requirements.
Staging Area
The “staging area” is the middle bit.
Staging area is a database with the sources have originated from different databases for example db2, oracle,
Teradata etc all those can be standardize in staging area. Also joining of the tables can be done there. If there is
only one source system then depending on the quality of data & transformations required staging area can be
used or eliminated. Reporting always occurs on the Warehouse
2. Key Features
One or more database schema(s) or file system used to “stage” data extracted from the source OLTP
systems prior to being loaded to the “warehouse” where it is visible to end users.
Data in the staging area is not visible to end users for queries, reports or analysis of any kind. It does not hold
aggregated data ready for querying.
The loadtimestamp column in staging enables to know the status of record i.e. updated, new or deleted. It
may even include a flag reporting this status.
In most cases its size is larger than the “presentation area” itself or else equal to DW.
Although the “stage” data – e.g. Last sequence loaded may be backed up, much of the staging area data is
automatically replaced in case of full refresh i.e. truncate & reload from source during the ETL
load processes, and hence backup effort can be avoided. The presentation area however, may need backup
in many cases.
It may include some metadata, which may be used by analysts or operators monitoring the state of the
previous loads (e.g. audit information, summary totals of rows loaded etc).
The Error log file holds details of “rejected” entries – data which has failed quality tests, and may need
correction and re-submission to the ETL process.
It’s likely to have few indexes (compared to the “presentation area”), and hold data in a quite normalized
form. The presentation area (DW) is by comparison likely to be more highly indexed (mainly bitmap indexes),
with highly denormalized tables (the Dimension tables anyway).