Data Warehousing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

Data Warehousing

1
Learning Objective
• Understand the basic definitions and concepts of data warehouses
• Understand data warehousing architectures
• Describe the processes used in developing and managing data
warehouses
• Explain data warehousing operations
• Explain data integration and the extraction, transformation, and load
(ETL) processes
• Describe real-time (active) data warehousing
• Understand data warehouse administration and security issues
2
Data Warehousing Definitions and Concepts
• “A data warehouse is a physical repository where relational data are
specially organized to provide enterprise-wide, cleansed data in a
standardized format.” - Text book

• “A data warehouse is a subject-oriented, integrated, nonvolatile, and


time-variant collection of data in support of management’s decision-
making process.” - W. H. Inmon

• “Data warehousing is a discipline that results in applications that


provide decision support capability, allows ready access to business
information, and creates business insight.” - Text book
3
Data Warehousing Definitions and Concepts
• Characteristics of Data Warehousing
▪ Subject oriented
▪ Integrated
▪ Time variant (time series)
▪ Nonvolatile
▪ Summarized
▪ Web based
▪ Relational/multidimensional
▪ Client/server
▪ Real time
▪ Include metadata
4
Data Warehousing Definitions and Concepts
• Data Mart
▪ A departmental DW that stores only relevant data
▪ Dependent data mart
◦ A subset that is created directly from a data warehouse
▪ Independent data mart
◦ A small DW designed for a strategic business unit or a department
• Oper Mart
▪ An operational data mart

5
Data Warehousing Definitions and Concepts
• Operational Data Stores (ODS)
▪ A type of database often used as an interim staging area for a data warehouse
▪ The database from which a business operates on an ongoing basis
• Enterprise Data Warehouse (EDW)
▪ A data warehouse for the enterprise (cover all subject areas of interest to the
entire organization)
• Metadata
▪ Data about data
▪ In a DW, metadata describe the contents of a DW and the manner of its
acquisition and use

6
Data Warehousing Process Overview
• Data warehousing process
1. Imported from various external and internal resources
2. Cleansed and organized consistently with the organization’s needs
3. Loaded into data warehouse or data marts
Alternatively, data marts can be created first and integrated into the EDW
4. Analyses are performed as needed

7
Data Warehousing Process Overview

8
Data Warehousing Process Overview
• Major components of a data warehousing process
▪ Data source
◦ From operational system and possibly external data sources
▪ Data extraction and transformation
◦ Extracted and transformed using custom-written or commercial software (ETL)
▪ Data loading
◦ Loaded into a staging area, where they are transformed and cleaned.
◦ Ready to load into the data warehouse and/or data mart
▪ Comprehensive database
◦ The EDW that supports decision analysis by providing relevant summarized and detailed
information

9
Data Warehousing Process Overview
• Major components of a data warehousing process (continued)
▪ Metadata
◦ Maintained for access by IT personnel and users.
◦ Include rules for organizing data summaries that are easy to index and search.
▪ Middleware tools
◦ Enable access to the data warehouse from a variety of front-end applications

10
Data Warehousing Architecture

3-tier architecture

2-tier architecture 1-tier architecture?

11
Data Warehousing Architecture
• Web-based Data Warehousing Architecture

12
Data Warehousing Architecture
• Issues to be considered when deciding which architecture to use:
▪ Which DBMS should be used?
▪ Will parallel processing and/or partitioning be used?
▪ Will data migration tools be used to load the DW?
▪ What tools will be used to support data retrieval and analysis?

13
Data Warehousing Architecture
• Alternative Data Warehousing Architectures

14
Data Warehousing Architecture
• Alternative Data Warehousing Architectures (continued)

15
Data Warehousing Architecture
• Alternative Data Warehousing Architectures (continued)

16
Data Warehousing Architecture
• 10 factors that potentially affect the architecture selection decision
1. Information interdependence between organizational units
2. Upper management’s information needs
3. Urgency of need for a data warehouse
4. Nature of end-user tasks
5. Constraints on resources
6. Strategic view of the data warehouse prior to implementation
7. Compatibility with existing systems
8. Perceived ability of the in-house IT staff
9. Technical issues
10. Social/political factors
17
Data Integration and ETL Process
• Data Integration
▪ Data integration comprises three major processes: data access, data
federation and change capture
◦ Data access: the ability to access and extract data from any data source
◦ Data federation: the integration of business views across multiple data stores
◦ Change capture: based on the identification, capture, and delivery of the changes made
to enterprise data sources
▪ When these three processes are correctly implemented, data can be accessed
and made accessible to an array of ETL and analysis tools and data
warehousing environments

18
Data Integration and ETL Process
• Extraction, Transformation, and Load
▪ Extraction – selecting data from one or more sources and reading the
selected data
▪ Transformation – converting data from its previous form into the form in
which it needs to be so that it can be placed into a data warehouse or simply
another database
▪ Load – putting the data
into the data warehouse

19
Data Integration and ETL Process
• Issues affect the purchase of ETL tools
▪ Data transformation tools are expensive
▪ Data transformation tools may have a long learning curve
▪ It is difficult to measure how the IT organization is doing until it has learned to
use the data transformation tools
• Important criteria in selecting an ETL tool
▪ Ability to read from and write to an unlimited number of data source
architectures
▪ Automatic capturing and delivery of metadata
▪ A history of conforming to open standards
▪ An easy-to-use interface for the developer and the functional user
20
DW Development
• Direct benefits of data warehouses:
▪ End users can perform extensive analysis in numerous ways
▪ A consolidate view of corporate data (i.e., the single version of truth) is
possible
▪ Better and more timely information is possible
◦ A DW permits information processing to be relieved from costly operational system onto
low-cost servers
◦ Many more end-user information requests can be processed more quickly
▪ Enhance system performance can result.
◦ A DW frees production process because some operation system reporting requirements
are moved to DSS
▪ Data access is simplified Indirect benefits result from end
users using these direct benefits
21
DW Development
• Data warehouse development approaches
▪ Inmon Model: EDW approach (top-down)
◦ Employ established database development methodologies and tools (e.g., ERD)
▪ Kimball Model: Data mart approach (bottom-up)
◦ “plan big, build small”
◦ Applies dimensional data modeling, which starts with tables
▪ Which model is best?
◦ There is no one-size-fits-all strategy to DW
▪ One alternative is to use hosted data warehouses

22
DW Development
• Representation of Data in DW
▪ Dimensional modeling is a retrieval-based system that supports high-volume
query access
▪ Star schema – the most commonly used and the simplest style of dimensional
modeling
◦ Contain a fact table surrounded by and connected to several dimension tables
◦ Fact table contains the descriptive attributes (numerical values) needed to perform
decision analysis and query reporting
◦ Dimension tables contain classification and aggregation information about the values in
the fact table
▪ Snowflakes schema – an extension of star schema where the diagram
resembles a snowflake in shape

23
DW Development
▪ Star Schema vs Snowflake Schema

24
DW Development
• Analysis of Data in DW
▪ Online analytical processing (OLAP)
◦ An approach to quickly answer ad hoc questions by executing multidimensional
analytical queries against organizational data repositories
▪ OLAP vs OLTP (Online transaction processing)
◦ OLTP system is primarily responsible for capturing and storing data related to day-to-day
business functions such as ERP, CRM, SCM, POS, and so forth
◦ OLAP system is designed to address the need of providing ad hoc analysis of
organizational data much more effectively and efficiently

25
DW Development
▪ Comparison between OLTP and OLAP
Criteria OLTP OLAP
Purpose To carry out day-to-day functions To support decision making and provide
answers to business and management queries
Data source Transaction data Data warehouse or data mart
• Normalized data repository • Nonnormalized data repository
• Focused on efficiency and consistency • Focused on accuracy and completeness
Reporting Routine, periodic, narrowly focused reports Ad hoc, multidimensional, broadly focused
reports and queries
Resource Ordinary relational databases Multiprocessor, large-capacity, specialized
requirements databases
Execution Fast (recording of business transactions and Slow (resource intensive, complex, large-scale
speed routine reports) queries)

26
DW Development
• OLAP Operations
▪ Slice – a subset of a multidimensional array
▪ Dice – a slice on more than two dimensions
of a data cube
▪ Drill Down/Up – navigating among levels of
data ranging from the most summarized (up)
to the most detailed (down)
▪ Roll Up – computing all of the data
relationships for one or more dimensions
▪ Pivot – changing the dimensional orientation of
a report or ad hoc query-page display
More details in later part of the course
27
DW Development
• Variations of OLAP
▪ Multidimensional OLAP (MOLAP)
◦ Stores data in an optimized multidimensional array storage
◦ Require pre-computation and storage of derived data
▪ Relational OLAP (ROLAP)
◦ Works directly with relational databases
◦ Does not require pre-computation
▪ Hybrid OLAP (HOLAP)
◦ Combination of ROLAP and MOLAP
◦ Allows model designer to decide which portions of data are stored in ROLAP / MOLAP

28
Real-Time Data Warehousing
• Also know as Active Data Warehousing (ADW)
• Decision making data are updated on an ongoing basis as business
transaction occur
Traditional DW Environment Active DW Environment
Strategic decision only Strategic and tactical decisions
Results sometimes hard to measure Results measured with operations
Daily, weekly, monthly data currency acceptable; Only comprehensive detailed data available within
summaries often appropriate minutes is acceptable
Moderate user concurrency High number (1,000+) of users accessing and querying
the system simultaneously
Highly restrictive reporting used to confirm or check Flexible ad hoc reporting
patterns; often uses predeveloped summary tables
Power users, knowledge workers, internal users Operational staff, call centers, external users
29
Enterprise
Decision
Evolution

30
DW Administration and Security
• Skill of Data Warehouse Administrator (DWA)
▪ Familiar with high-performance software, hardware and networking
technologies
▪ Possess solid business insight
▪ Familiar with decision-making processes so as to suitably design and maintain
the DW structure
▪ Possess excellent communications skills
• Security and privacy is a pressing issue in DW
▪ Safeguarding the most valuable assets
▪ Government regulations (e.g., HIPAA, etc.)
▪ Must be explicitly planned and executed
31
THE END

32

You might also like