COMP321F Database
COMP321F Database
COMP321F Database
Lecture 10
DATA WAREHOUSING
Wyman Wang
School of Science and Technology
Content
The main concepts and benefits associated with data warehousing.
How online transaction processing (OLTP) systems differ from data
warehousing.
The architecture and main components of a data warehouse.
The concept of a data mart and the main reasons for implementing
a data mart.
How Oracle supports the requirements of data warehousing.
2
Data Warehousing (DW)
Concepts
A formal definition of the data warehouse is offered by W.H.
Inmon:
3
Subject-oriented Data
The warehouse is organized around the major subjects of the
enterprise (e.g. customers, products, and sales) rather than the
major application areas (e.g. customer invoicing, stock control, and
product sales).
4
Integrated Data
The data warehouse integrates
corporate application-oriented data DW
Operational Data encoding
from different source systems, which server A – M,F M,F
often includes data that is server B – 1,0
inconsistent. server C – male,female
5
Time-variant Data
Data in the warehouse is only accurate and valid at some point in
time or over some time interval.
The changes to the data in the DW are tracked and recorded so that
reports show changes over time.
Different environments have different time horizons associated
◦ While for operational systems a time horizon of 60-to-90 days is normal,
DWs have a time horizon of 5-to-10 years.
6
Non-volatile Data
Data in the warehouse is not updated in real-time but is refreshed
from operational systems on a regular basis.
Data is loaded, but not updated.
When subsequent changes occur, a new version or snapshot record
is written.
Operational DW
insert change
7
Typical Architecture of a Data
Warehouse
More generally, a DW is:
◦ A large repository of some organization’s electronically stored data
◦ Specifically designed to facilitate reporting and analysis
8
Metadata
This area of the warehouse stores all the metadata (data about
data) definitions used by all the processes in the warehouse.
Used for a variety of purposes
◦ Extraction and loading processes - metadata is used to map data sources to
a common view of information within the warehouse.
◦ Warehouse management process - metadata is used to automate the
production of summary tables.
◦ Query management process - metadata is used to direct a query to the most
appropriate data source.
9
Data Mart
A subset of a data warehouse that
Data Sources Load Manager Data Marts
supports the requirements of a Users
particular department or business
function.
Operational Purchasing Analysis
Characteristics include System
10
Comparison of OLTP (Online Transaction
Processing) Systems and Data
Warehousing
11
Benefits of Data Warehousing
Potential high returns on investment
Competitive advantage
Increased productivity of corporate decision-makers
12
Data Warehouse Queries
The types of queries that a data warehouse is expected to answer
ranges from the relatively simple to the highly complex and is
dependent on the type of end-user access tools used.
13
Examples of Typical Data
Warehouse Queries
◦ What was the total revenue for Scotland in the third
Answering these questions with SQL-
quarter of 2020?
◦ What was the total revenue for property sales for each queries is difficult:
type of property in Great Britain in 2020? ◦ Complex query formulation necessary
◦ What would be the effect on property sales in different
◦ Processing will be slow due to complex
regions of Britain if legal costs went up by 3.5% and
Government taxes went down by 1.5% for properties
joins and multiple scans
over £100,000?
For normalization issues, the detail of each Branch is stored in another table Branch with the
following attributes:
◦ Branchcode
◦ Address
◦ Manager
15
Table Structures
CREATE table CompanySales (
Branchcode varchar(3),
Sales integer,
Year integer);
16
Analyzing and Reporting Data by
SQL
Oracle Database provides a large family of analytic SQL functions.
These analytic functions enable you to calculate:
◦ Aggregate functions for multidimensional analysis
◦ Ranking and comparing
17
Cube
To support multiple dimension analysing, Oracle provides the new CUBE extensions
to the GROUP BY clause.
CUBE operation enables a SELECT statement to calculate subtotals for all possible
combinations of a group of dimensions.
18
Example 1
select year, branchcode, to_char(sum(sales), '999,999') total_sales
from companysales
where year between 2020 and 2021
and branchcode in ('HKU', 'TKS', 'UST')
group by cube(year, branchcode)
order by year, branchcode;
19
Example 2: Rewrite Example 1 using
Group by and Union SQL statements
20
Question
Compare the SQL syntax between Example 1 and 2, discuss the
benefits using CUBE operation over Group By operation.
21
Discussion
The Cube operation is designed to minimize the number of Select
operation (Select SQL statements) from a single table.
The cube operation has selected the companysales table once only,
while the group by operation has selected the companysales table
for four times.
The query performance will be much more efficient using Cube
operation when the table size becomes larger.
22
Ranking Functions
Top-N queries ask for the n largest or smallest values of a column. An
example is "What are the top ten best students in COMPS321F?" Of
course, we may also want to ask "What are the 5 worst students?"
Both largest-values and smallest-values sets are considered as Top-N
queries.
23
Top-N: Example 1
The following example shows how the [ASC | DESC] option changes the ranking order
on the selection of the top 5 sales years from CompanySales table:
SELECT * from (
SELECT year,
TO_CHAR(SUM(sales), '999,999') SALES$,
RANK() OVER (ORDER BY SUM(sales) DESC NULLS LAST) As TOP_SALES_RANK
FROM companysales
GROUP BY year)
WHERE TOP_SALES_RANK <=5;
24
Top-N: Example 1 (Output)
25
Question
Shows the ranking order on the selection of the WORST 5 sales years
from CompanySales table
26
Discussion
27
Ranking Within Groups
The RANK function can be made to operate within groups, that is,
the rank gets reset whenever the group changes.
This is accomplished with the PARTITION BY clause. The group
expressions in the PARTITION BY subclause divide the data set into
groups within which RANK operates.
28
Per Group Ranking Example
The following example shows the Per Group Ranking function for branches in ('CLK', 'HKU', 'KLT')
over the years in (2017, 2019, 2021):
29
Per Group Ranking Example
30
Reference
Chapter 32 of Connolly, T and Begg, C, Database Systems: A practical
Approach to Design, Implementation, and Management (6th ed.),
Boston: Pearson Education.
31