COMP321F Database

COMPS321F
Advanced Database and Data Warehousing
Lecture 10
DATA WAREHOUSING
Wyman Wang
School of Science and Technology
Content
The main concepts and benefits associated with data warehousing.
How online transaction processing (OLTP) systems differ from data
warehousing.
The architecture and main components of a data warehouse.
The concept of a data mart and the main reasons for implementing
a data mart.
How Oracle supports the requirements of data warehousing.
2
Data Warehousing (DW)
Concepts
A formal definition of the data warehouse is offered by W.H.
Inmon:
“A subject-oriented, integrated, time-variant, and non-volatile

collection of data in support of management’s decision-making
process.”
3
Subject-oriented Data
The warehouse is organized around the major subjects of the
enterprise (e.g. customers, products, and sales) rather than the
major application areas (e.g. customer invoicing, stock control, and
product sales).
This is reflected in the need to store decision-support data rather

than application-oriented data.
4
Integrated Data
The data warehouse integrates
corporate application-oriented data DW
Operational Data encoding
from different source systems, which server A – M,F M,F
often includes data that is server B – 1,0
inconsistent. server C – male,female
The integrated data source must be server A – cm cm
made consistent to present a unified server B – inches
view of the data to the users.

i.e., gender, measurement, etc.,
5
Time-variant Data
Data in the warehouse is only accurate and valid at some point in
time or over some time interval.
The changes to the data in the DW are tracked and recorded so that
reports show changes over time.
Different environments have different time horizons associated
◦ While for operational systems a time horizon of 60-to-90 days is normal,
DWs have a time horizon of 5-to-10 years.
6
Non-volatile Data
Data in the warehouse is not updated in real-time but is refreshed
from operational systems on a regular basis.
Data is loaded, but not updated.
When subsequent changes occur, a new version or snapshot record
is written.
Operational DW
insert change
delete access access
Record-by-record load Mass load/

manipulation access of data
7
Typical Architecture of a Data
Warehouse
More generally, a DW is:
◦ A large repository of some organization’s electronically stored data
◦ Specifically designed to facilitate reporting and analysis
8
Metadata
This area of the warehouse stores all the metadata (data about
data) definitions used by all the processes in the warehouse.
Used for a variety of purposes
◦ Extraction and loading processes - metadata is used to map data sources to
a common view of information within the warehouse.
◦ Warehouse management process - metadata is used to automate the
production of summary tables.
◦ Query management process - metadata is used to direct a query to the most
appropriate data source.
9
Data Mart
A subset of a data warehouse that
Data Sources Load Manager Data Marts
supports the requirements of a Users
particular department or business
function.
Operational Purchasing Analysis
Characteristics include System
◦ Focuses on only the requirements of one

department or business function. Reporting
Operational Sales
◦ Do not normally contain detailed System
operational data unlike data warehouses.
◦ More easily understood and navigated.
Flat files Inventory
◦ Faster and cheaper to build than DWs
10
Comparison of OLTP (Online Transaction
Processing) Systems and Data
Warehousing
11
Benefits of Data Warehousing
Potential high returns on investment
Competitive advantage
Increased productivity of corporate decision-makers
12
Data Warehouse Queries
The types of queries that a data warehouse is expected to answer
ranges from the relatively simple to the highly complex and is
dependent on the type of end-user access tools used.
13
Examples of Typical Data
Warehouse Queries
◦ What was the total revenue for Scotland in the third
Answering these questions with SQL-
quarter of 2020?
◦ What was the total revenue for property sales for each queries is difficult:
type of property in Great Britain in 2020? ◦ Complex query formulation necessary
◦ What would be the effect on property sales in different
◦ Processing will be slow due to complex
regions of Britain if legal costs went up by 3.5% and
Government taxes went down by 1.5% for properties
joins and multiple scans
over £100,000?
◦ Which type of property sells for prices above the average
selling price for properties in the main cities of Great

Britain and how does this correlate to demographic data?
14
A Sample Scheme for a Data
Warehouse
ABC company has built a data warehouse using Oracle Database, which stores the sales record
of all branches from year 2015- 2021. Two tables are created to store the historical data.
In particular, a relation CompanySales has the following sets of attributes:
◦ Year
◦ Branchcode
◦ Sales
For normalization issues, the detail of each Branch is stored in another table Branch with the
following attributes:
◦ Branchcode
◦ Address
◦ Manager
15
Table Structures
CREATE table CompanySales (
Branchcode varchar(3),
Sales integer,
Year integer);
CREATE table Branch (

Branchcode varchar(3),
Address varchar(30),
Manager varchar(20));
16
Analyzing and Reporting Data by
SQL
Oracle Database provides a large family of analytic SQL functions.
These analytic functions enable you to calculate:
◦ Aggregate functions for multidimensional analysis
◦ Ranking and comparing
17
Cube
To support multiple dimension analysing, Oracle provides the new CUBE extensions
to the GROUP BY clause.
CUBE operation enables a SELECT statement to calculate subtotals for all possible
combinations of a group of dimensions.
CUBE appears in the GROUP BY clause in a SELECT statement as follows:
SELECT ... GROUP BY

CUBE (grouping_column_reference_list)
18
Example 1
select year, branchcode, to_char(sum(sales), '999,999') total_sales
from companysales
where year between 2020 and 2021
and branchcode in ('HKU', 'TKS', 'UST')
group by cube(year, branchcode)
order by year, branchcode;
19
Example 2: Rewrite Example 1 using
Group by and Union SQL statements
20
Question
Compare the SQL syntax between Example 1 and 2, discuss the
benefits using CUBE operation over Group By operation.
21
Discussion
The Cube operation is designed to minimize the number of Select
operation (Select SQL statements) from a single table.
The cube operation has selected the companysales table once only,
while the group by operation has selected the companysales table
for four times.
The query performance will be much more efficient using Cube
operation when the table size becomes larger.
22
Ranking Functions
Top-N queries ask for the n largest or smallest values of a column. An
example is "What are the top ten best students in COMPS321F?" Of
course, we may also want to ask "What are the 5 worst students?"
Both largest-values and smallest-values sets are considered as Top-N
queries.
23
Top-N: Example 1
The following example shows how the [ASC | DESC] option changes the ranking order
on the selection of the top 5 sales years from CompanySales table:
SELECT * from (
SELECT year,
TO_CHAR(SUM(sales), '999,999') SALES$,
RANK() OVER (ORDER BY SUM(sales) DESC NULLS LAST) As TOP_SALES_RANK
FROM companysales
GROUP BY year)
WHERE TOP_SALES_RANK <=5;
24
Top-N: Example 1 (Output)
25
Question
Shows the ranking order on the selection of the WORST 5 sales years
from CompanySales table
26
Discussion
27
Ranking Within Groups
The RANK function can be made to operate within groups, that is,
the rank gets reset whenever the group changes.
This is accomplished with the PARTITION BY clause. The group
expressions in the PARTITION BY subclause divide the data set into
groups within which RANK operates.
28
Per Group Ranking Example
The following example shows the Per Group Ranking function for branches in ('CLK', 'HKU', 'KLT')
over the years in (2017, 2019, 2021):
SELECT branchcode, year,

TO_CHAR(SUM(sales), '9,999,999,999') total_sales,
RANK() OVER (PARTITION BY branchcode ORDER BY SUM(sales) DESC) AS rank_within_branch
FROM companysales
WHERE year IN (2017, 2019, 2021)
AND branchcode IN ('CLK', 'HKU', 'KLT')
GROUP BY branchcode, year;
29
Per Group Ranking Example
30
Reference
Chapter 32 of Connolly, T and Begg, C, Database Systems: A practical
Approach to Design, Implementation, and Management (6th ed.),
Boston: Pearson Education.
31

COMP321F Database

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

COMP321F Database

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

COMP321F Database

Uploaded by

Copyright:

Available Formats

COMPS321F

Advanced Database and Data Warehousing

“A subject-oriented, integrated, time-variant, and non-volatile

This is reflected in the need to store decision-support data rather

The integrated data source must be server A – cm cm

made consistent to present a unified server B – inches

view of the data to the users.

delete access access

Record-by-record load Mass load/

◦ Focuses on only the requirements of one

◦ Which type of property sells for prices above the average

selling price for properties in the main cities of Great

CREATE table Branch (

CUBE appears in the GROUP BY clause in a SELECT statement as follows:

SELECT ... GROUP BY

SELECT branchcode, year,

You might also like