04-1 DWH Data Warehouse - ETL
04-1 DWH Data Warehouse - ETL
04-1 DWH Data Warehouse - ETL
R. Marti
Data Warehousing
Source Database Landing Area
Data Mart
Interactive Analysis
Source Database
Staging Area
Data Warehouse
Dashboards
Source Database
Landing Area
Master Data
2
R. Marti
Data Warehousing
Source Database Landing Area
Staging Area
Data Warehouse
Extract
Data under control of ETL / DWh team
Clean
Conform
Deliver
Integrated and historized data of 1 sources
Another note: Extract > Clean > Conform > Deliver = Extract > Transform > Load - Transform = Clean + Conform - Load = Deliver
R. Marti 4-1 DWh 2011: Data Warehouse ETL Processing 4
R. Marti
R. Marti
R. Marti
R. Marti
Extract
Clean
Conform
Deliver
Data Warehouse
Access to hardware / OS / DBMS platform of source - mainframe legacy systems (EBCDIC character set, decimal data types, ... ) - ODBC / JDBC vs native drivers for relational (or nearly relational) DBMSs - use of an ETL Tool (e.g., Informatica PowerCenter) offering required connectors (Usually for dimensional data only): How to get the only the data that changed since the last extract? - use explicit audit columns in source databases (if there are any ... ) and select only rows where timestamp > last extract time Does every change procedure in the source set this correctly? - database log sniffing What if log entries get cleared before the next extract? - Keep previous source database extracts in the landing/staging area and do an explicit diff before the warehoue upload. Unfortunately, this is not very efficient ...
R. Marti 4-1 DWh 2011: Data Warehouse ETL Processing 11
R. Marti
Page 13
R. Marti
Page 14
Extract
Clean
Conform
Deliver
Data Warehouse
The quality of the data in many organizations has ample room for improvement ... and Data Warehousing / ETL brings this to the light ... (Data Warehousing = GIGO: Garbage In, Garbage Out) Ultimately, dirty data should be avoided at the source, namely, in the system / database, where the data is entered for the first time However, fixing this sustainably often involves much more than debugging, refactoring, and extending a single IT application Business Process Reengineering: - which tasks does a process entail, in which order - which tasks should be performed where (e.g., minimizing handovers) Data Quality Intiatives
R. Marti 4-1 DWh 2011: Data Warehouse ETL Processing 15
R. Marti
16
R. Marti
Page 18
R. Marti
Page 20
Structure enforcement
Primary keys (identifiers) Alternate keys (natural keys, uniqueness) Foreign keys
ETL_CHECKS
TABLES
ROWS
R. Marti
Note: Usually, rows are a degenerate dimension. Instead of a user-defined primary key, physical row IDs are used.
22
Conforming Data
Source Database Landing Area Staging Area
Extract
Clean
Conform
Deliver
Data Warehouse
Conforming is the step where not only the data structures but also the data values of a source database are mapped to the common data representation used in the Data Warehouse Conforming involves the following subtasks: Standardizing representations of data values Matching (the same or similar) data values of different rows (usually from different sources) describing the same real-world entity [ Duplicate Detection see separate section ] Determining the surviving data values describing rows matched in the previous step
R. Marti 4-1 DWh 2011: Data Warehouse ETL Processing 23
Remember: Dimensions supply the context in which the measures in the fact tables are viewed and compared using roll-up, drill-down, and slice ops Conforming primarily involves: Standardization of formats - dates, phone numbers, ... - capitalization of strings, trimming leading & trailing blanks etc. - standardization of street and company names (e.g., suffixes like Street vs St, Incorporated vs Inc, ... ) Mapping of external codes (currency, country, industry, ... codes) and internal codes (profit centers, accounting categories) to standard codes Mapping of source system identifiers of entities to common warehouse identifiers (based on exact matches) + Fuzzy matching where needed [ Duplicate Detection ]
R. Marti 4-1 DWh 2011: Data Warehouse ETL Processing
24
Conforming Facts
Conforming facts primarily involves: Conforming the foreign keys referring to dimensional entities, i.e., mapping source system identifiers of entities to common warehouse identifiers using the mapping established when conforming dimensions Standardizing units and scales of observed measures, e.g., imperial to metric units (or vice versa), foreign exchange, etc. Maybe aggregation of measures to a common higher dimension level Standardizing the computation of derived measures if needed (*), e.g., ratios etc.
*) Recommendation: - ask for the formulas to compute derived measures and document this computation - procure required observed measures (inputs) and derived measures (output) - check whether you can reproduce the procured output from its inputs .... - ... and, if so, keep the checks but dont recompute the outputs in your code
R. Marti 4-1 DWh 2011: Data Warehouse ETL Processing 25
R. Marti
26
R. Marti
Page 27
R. Marti
Page 28
R. Marti
Page 29
Delivering Data
Source Database Landing Area Staging Area
Extract
Clean
Conform
Deliver
Data Warehouse
Starting point is a staging area containing a cleaned (or at least quality scored) and conformed package of data ready for an upload to the data warehouse. This data package should be logically consistent, meaning it represents a snapshot as of a specific valid time in which structural integrity constraints such as entity integrity and referential integrity hold. Logically, the upload establishes the most recent snapshot of the history (unless a bi-temporal model allowing for corrections of existing data is used ... ). Regardless of whether the dimensional design is rigidly followed, we can distinguish between an upload of - dimensional data, which may lead to both inserts and updates - fact data, which should only lead to inserts
R. Marti 4-1 DWh 2011: Data Warehouse ETL Processing 30
Delivering Dimensions
The following steps are required when uploading dimensional data: Generation of new surrogates for dimensional objects (rows) where the (stable!) source system identifier is not known in the mapping table or in the Data Warehouse.
Recommendation: For efficiency, use special DBMS or ETL Tool features.
If no history is kept for properties of dimensional objects (rows) SCD* Type 1 update existing rows and insert new ones.
Recommendation: For efficiency, use a DBMS Bulk Load facility, and not an SQL merge statement: merge into TARGET_TABLE t using SOURCE_TABLE s on ( s.PK = t.PK ) when matched then update set ... where when not matched then insert ( ) values ( )
If history is kept, insert (and possibly update) rows according to the strategy used: - SCD Type 2: insert a new row with a new surrogate and update the table mapping the source system ID to the Data Warehouse surrogate - general temporal database concepts: insert a new row with the same surrogate and the VALID_FROM time, and set the VALID_TO time of the existing row Remember: Do not delete dimensional objects (rows) that were deleted in the source.
*) SCD = Slowly Changing Dimension
R. Marti 4-1 DWh 2011: Data Warehouse ETL Processing 31
Delivering Facts
Uploading facts is generally easier, and involves the following steps: Map all foreign keys to the appropriate surrogates. Insert fact tables with the appropriate valid time (which maybe a foreign key to a date/time dimension table). If performance matters, consider - switching foreign key checks off temporarily (or permanently, but with periodic background checks by running queries) - dropping indexes and re-building them (especially for bitmap indexes) Recompute dependent materialized tables (typically aggregates) or MOLAP (Multi-dimensional OLAP) cubes.
R. Marti
32
Literature
General Data Warehousing [Lehner 2003] Wolfgang Lehner: Datenbanktechnologie fr Data-Warehouse-Systeme Konzepte und Methoden. dpunkt.verlag, 2003. (in German)
General Data Quality / Data Profiling [Olson 2003] Jack E. Olson: Data Quality: The Accuracy Dimension. Morgan Kaufmann, 2003. .
ETL Processing [Kimball & Caserta 2004] Ralph Kimball, Joe Caserta: The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data. John Wiley, 2004 .
R. Marti
Slide 33