0% found this document useful (0 votes)
5 views31 pages

COMP321F Database

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1/ 31

COMPS321F

Advanced Database and Data Warehousing

Lecture 10
DATA WAREHOUSING

Wyman Wang
School of Science and Technology
Content
The main concepts and benefits associated with data warehousing.
How online transaction processing (OLTP) systems differ from data
warehousing.
The architecture and main components of a data warehouse.
The concept of a data mart and the main reasons for implementing
a data mart.
How Oracle supports the requirements of data warehousing.

2
Data Warehousing (DW)
Concepts
A formal definition of the data warehouse is offered by W.H.
Inmon:

“A subject-oriented, integrated, time-variant, and non-volatile


collection of data in support of management’s decision-making
process.”

3
Subject-oriented Data
The warehouse is organized around the major subjects of the
enterprise (e.g. customers, products, and sales) rather than the
major application areas (e.g. customer invoicing, stock control, and
product sales).

This is reflected in the need to store decision-support data rather


than application-oriented data.

4
Integrated Data
The data warehouse integrates
corporate application-oriented data DW
Operational Data encoding
from different source systems, which server A – M,F M,F
often includes data that is server B – 1,0
inconsistent. server C – male,female

The integrated data source must be server A – cm cm

made consistent to present a unified server B – inches

view of the data to the users.


i.e., gender, measurement, etc.,

5
Time-variant Data
Data in the warehouse is only accurate and valid at some point in
time or over some time interval.
The changes to the data in the DW are tracked and recorded so that
reports show changes over time.
Different environments have different time horizons associated
◦ While for operational systems a time horizon of 60-to-90 days is normal,
DWs have a time horizon of 5-to-10 years.

6
Non-volatile Data
Data in the warehouse is not updated in real-time but is refreshed
from operational systems on a regular basis.
Data is loaded, but not updated.
When subsequent changes occur, a new version or snapshot record
is written.
Operational DW

insert change

delete access access

Record-by-record load Mass load/


manipulation access of data

7
Typical Architecture of a Data
Warehouse
More generally, a DW is:
◦ A large repository of some organization’s electronically stored data
◦ Specifically designed to facilitate reporting and analysis

8
Metadata
This area of the warehouse stores all the metadata (data about
data) definitions used by all the processes in the warehouse.
Used for a variety of purposes
◦ Extraction and loading processes - metadata is used to map data sources to
a common view of information within the warehouse.
◦ Warehouse management process - metadata is used to automate the
production of summary tables.
◦ Query management process - metadata is used to direct a query to the most
appropriate data source.

9
Data Mart
A subset of a data warehouse that
Data Sources Load Manager Data Marts
supports the requirements of a Users
particular department or business
function.
Operational Purchasing Analysis
Characteristics include System

◦ Focuses on only the requirements of one


department or business function. Reporting
Operational Sales
◦ Do not normally contain detailed System
operational data unlike data warehouses.
◦ More easily understood and navigated.
Flat files Inventory
◦ Faster and cheaper to build than DWs

10
Comparison of OLTP (Online Transaction
Processing) Systems and Data
Warehousing

11
Benefits of Data Warehousing
Potential high returns on investment
Competitive advantage
Increased productivity of corporate decision-makers

12
Data Warehouse Queries
The types of queries that a data warehouse is expected to answer
ranges from the relatively simple to the highly complex and is
dependent on the type of end-user access tools used.

13
Examples of Typical Data
Warehouse Queries
◦ What was the total revenue for Scotland in the third
Answering these questions with SQL-
quarter of 2020?

◦ What was the total revenue for property sales for each queries is difficult:
type of property in Great Britain in 2020? ◦ Complex query formulation necessary
◦ What would be the effect on property sales in different
◦ Processing will be slow due to complex
regions of Britain if legal costs went up by 3.5% and
Government taxes went down by 1.5% for properties
joins and multiple scans

over £100,000?

◦ Which type of property sells for prices above the average

selling price for properties in the main cities of Great


Britain and how does this correlate to demographic data?
14
A Sample Scheme for a Data
Warehouse
ABC company has built a data warehouse using Oracle Database, which stores the sales record
of all branches from year 2015- 2021. Two tables are created to store the historical data.
In particular, a relation CompanySales has the following sets of attributes:
◦ Year
◦ Branchcode
◦ Sales

For normalization issues, the detail of each Branch is stored in another table Branch with the
following attributes:
◦ Branchcode
◦ Address
◦ Manager

15
Table Structures
CREATE table CompanySales (
Branchcode varchar(3),
Sales integer,
Year integer);

CREATE table Branch (


Branchcode varchar(3),
Address varchar(30),
Manager varchar(20));

16
Analyzing and Reporting Data by
SQL
Oracle Database provides a large family of analytic SQL functions.
These analytic functions enable you to calculate:
◦ Aggregate functions for multidimensional analysis
◦ Ranking and comparing

17
Cube
To support multiple dimension analysing, Oracle provides the new CUBE extensions
to the GROUP BY clause.
CUBE operation enables a SELECT statement to calculate subtotals for all possible
combinations of a group of dimensions.

CUBE appears in the GROUP BY clause in a SELECT statement as follows:

SELECT ... GROUP BY


CUBE (grouping_column_reference_list)

18
Example 1
select year, branchcode, to_char(sum(sales), '999,999') total_sales
from companysales
where year between 2020 and 2021
and branchcode in ('HKU', 'TKS', 'UST')
group by cube(year, branchcode)
order by year, branchcode;

19
Example 2: Rewrite Example 1 using
Group by and Union SQL statements

20
Question
Compare the SQL syntax between Example 1 and 2, discuss the
benefits using CUBE operation over Group By operation.

21
Discussion
The Cube operation is designed to minimize the number of Select
operation (Select SQL statements) from a single table.
The cube operation has selected the companysales table once only,
while the group by operation has selected the companysales table
for four times.
The query performance will be much more efficient using Cube
operation when the table size becomes larger.

22
Ranking Functions
Top-N queries ask for the n largest or smallest values of a column. An
example is "What are the top ten best students in COMPS321F?" Of
course, we may also want to ask "What are the 5 worst students?"
Both largest-values and smallest-values sets are considered as Top-N
queries.

23
Top-N: Example 1
The following example shows how the [ASC | DESC] option changes the ranking order
on the selection of the top 5 sales years from CompanySales table:
SELECT * from (
SELECT year,
TO_CHAR(SUM(sales), '999,999') SALES$,
RANK() OVER (ORDER BY SUM(sales) DESC NULLS LAST) As TOP_SALES_RANK
FROM companysales
GROUP BY year)
WHERE TOP_SALES_RANK <=5;

24
Top-N: Example 1 (Output)

25
Question
Shows the ranking order on the selection of the WORST 5 sales years
from CompanySales table

26
Discussion

27
Ranking Within Groups
The RANK function can be made to operate within groups, that is,
the rank gets reset whenever the group changes.
This is accomplished with the PARTITION BY clause. The group
expressions in the PARTITION BY subclause divide the data set into
groups within which RANK operates.

28
Per Group Ranking Example
The following example shows the Per Group Ranking function for branches in ('CLK', 'HKU', 'KLT')
over the years in (2017, 2019, 2021):

SELECT branchcode, year,


TO_CHAR(SUM(sales), '9,999,999,999') total_sales,
RANK() OVER (PARTITION BY branchcode ORDER BY SUM(sales) DESC) AS rank_within_branch
FROM companysales
WHERE year IN (2017, 2019, 2021)
AND branchcode IN ('CLK', 'HKU', 'KLT')
GROUP BY branchcode, year;

29
Per Group Ranking Example

30
Reference
Chapter 32 of Connolly, T and Begg, C, Database Systems: A practical
Approach to Design, Implementation, and Management (6th ed.),
Boston: Pearson Education.

31

You might also like