4 Data Warehousing & OLAP
4 Data Warehousing & OLAP
4 Data Warehousing & OLAP
Agenda
• Data Warehouse: Basic Concepts
• Cloud Datawarehouses
1
05-05-2022
• Almost all big corporations use a relational database to store their data.
• Companies found that relational databases were great at storing data, but made it
difficult to generate management reports, from transactional data.
• The whole process took far too long.
e.g. For reporting sales for a specific month, the database server would potentially
have to loop through millions of transactions in order to generate a total, this
would often slow down the entire system to the point it became unusable for
critical business operations.
• Hence, Dr. E. F. Codd (an IBM scientist, inventor of RDBMS) came up with a solution
called OLAP.
OLAP Introduction
• The idea behind OLAP is to pre-compute all of the totals and subtotals
needed for reporting, either at night or at weekend when the database
server was normally idle.
• Excel 2013 (or later versions) provides a complete OLAP solution with
feature OLAP pivot table.
2
05-05-2022
Source: https://www.researchgate.net/figure/Business-Intelligence-tools-20_fig1_277905782
Data Warehouse
• There are several ways that databases or other distributed information sources can be
made to work together.
• The data stored at the data warehouse is first processed: e.g. data may be filtered, and
relations may be joined or aggregated.
• As the data is copied from the sources, it may need to be transformed in certain ways
to make all data conform to the schema at the warehouse.
3
05-05-2022
Data Warehousing
• Data Warehousing (DW) is process for collecting and managing data from varied
sources to provide meaningful business insights.
• A Data warehouse is typically used to connect and analyze business data from
heterogeneous sources.
• The data warehouse is the core of the BI system which is built for data analysis and
reporting.
• It is a blend of technologies and components which aids the strategic use of data.
Datawarehouse
• Data warehousing was introduced in 1988 by IBM researchers Barry Devlin and Paul
Murphy. Since then the concept has evolved and taken on a life of its own.
• Data warehousing is about taking data that is stored in two or more databases
(information sources) and build from them one large database, possibly virtual,
containing information from all the sources, so that data can be queried as a unit.
• A specialized database Data cube is used to organize the integrated data which
facilitates OLAP (on-line analytic processing) and data mining queries.
4
05-05-2022
Information Integration
• Information integration is about taking data that is stored in two or more databases
(information sources) and build from them one large database, possibly virtual,
containing information from all the sources, so that data can be queried as a unit.
• A specialized database Data cube is used to organize the integrated data which
facilitates OLAP (on-line analytic processing) and data mining queries.
• Data warehousing:
• The process of constructing and using data warehouses
5
05-05-2022
Data Warehouse—Subject-Oriented
• Organized around major subjects, such as customer, product, sales
• Focusing on the modeling and analysis of data for decision makers, not on daily
operations or transaction processing
• Provide a simple and concise view around particular subject issues by excluding data
that are not useful in the decision support process
Data Warehouse—Integrated
• Constructed by integrating multiple, heterogeneous data sources
• relational databases, flat files, on-line transaction records
6
05-05-2022
Data Warehouse—Nonvolatile
• A physically separate store of data transformed from the operational environment
sometimes called de-normalized data
• Operational update of data does not occur in the data warehouse environment
• Does not require transaction processing, recovery, and concurrency control
mechanisms
• Requires only two operations in data accessing:
initial loading of data and access of data
7
05-05-2022
Data Warehouse
• In data warehouse integration architecture, data from
several sources is extracted and combined in global
schema.
8
05-05-2022
9
05-05-2022
increasing customer focus, which includes the analysis of customer buying patterns
(such as buying preference, buying time, budget cycles, and appetites for spending)
repositioning products and managing product portfolios by comparing the
performance of sales by quarter, by year, and by geographic regions in order to fine-
tune production strategies
analyzing operations and looking for sources of profit
managing customer relationships, making environmental corrections, and managing
the cost of corporate assets.
10
05-05-2022
Comparison
between OLTP
and OLAP
Systems
11
05-05-2022
• Note: OLAP query touches much of the data of database, as it classifies every recent
Sales fact by state of dealer that sold it. In contrast OLTP query touches only a single
tuple of the data.
12
05-05-2022
• OLAP queries typically examine very large amounts of data, even if the query results are
small.
• Source: C-store | Proceedings of the 31st international conference on Very large data bases (acm.org)
https://dl.acm.org/doi/10.5555/1083592.1083658
Stonebraker, M., Abadi, D. J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., ... & Zdonik, S. (2018). C-store: a column-oriented
DBMS. In Making Databases Work: the Pragmatic Wisdom of Michael Stonebraker (pp. 491-518).
13
05-05-2022
Data Warehousing: A
Multitiered Architecture
A 3-tier data warehousing
architecture
14
05-05-2022
Bottom tier
• The bottom tier is a warehouse database server that is almost always a relational
database system.
• Back-end tools and utilities are used to feed data into the bottom tier from operational
databases or other external sources.
• The data are extracted using application program interfaces known as gateways.
A gateway is supported by the underlying DBMS and allows client programs to
generate SQL code to be executed at a server.
Examples of gateways include ODBC (Open Database Connection) and OLEDB
(Object Linking and Embedding Database) by Microsoft and JDBC (Java Database
Connection).
• This tier also contains a metadata repository, which stores information about the data
warehouse and its contents.
Middle tier
• The middle tier is an OLAP server that is typically implemented using either
15
05-05-2022
Top tier
• The top tier is a front-end client layer,
• which contains query and reporting tools, analysis tools, and/or
• data mining tools (e.g., trend analysis, prediction, and so on).
Monitor
& OLAP Server
Other Metadata
sources Integrator
Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining
Data Marts
16
05-05-2022
Enterprise warehouse
• It provides corporate-wide data integration, usually from one or more operational
systems or external information providers, and is cross-functional in scope.
• It typically contains detailed data as well as summarized data, and can range in size
from a few gigabytes to hundreds of gigabytes, terabytes, or beyond.
• An enterprise data warehouse may be implemented on traditional mainframes,
computer super servers, or parallel architecture platforms.
• It requires extensive business modeling and may take years to design and build.
17
05-05-2022
Data Mart
• Data marts are usually implemented on low-cost departmental servers that are
Unix/Linux or Windows based.
• The implementation cycle of a data mart is more likely to be measured in weeks
rather than months or years.
• However, it may involve complex integration in the long run if its design and
planning were not enterprise-wide.
• Independent data marts are sourced from data captured from one or more
operational systems or external information providers, or from data generated
locally within a particular department or geographic area.
• Dependent data marts are sourced directly from enterprise data warehouses.
18
05-05-2022
Metadata Repository
• Meta data is the data defining warehouse objects. It stores:
• Description of the structure of the data warehouse
• schema, view, dimensions, hierarchies, derived data definition, data mart locations and
contents
• Operational meta-data
• data lineage (history of migrated data and transformation path), currency of data (active,
archived, or purged), monitoring information (warehouse usage statistics, error reports,
audit trails)
• The algorithms used for summarization
• The mapping from operational environment to the data warehouse
• Data related to system performance
• warehouse schema, view and derived data definitions
• Business data
• business terms and definitions, ownership of data, charging policies
19
05-05-2022
Star Schemas
• A star schema consists of schema for
fact table, which links to several other
relations called dimension tables.
• Fact table is at center of “star” whose
points are dimension tables.
• A fact table normally has several
attributes that represent dimensions,
and one or more dependent attribute
that represent properties of interest for
the point as a whole.
• E.g. dimensions for Sales data might
include date of sale, place of store,
dealer, etc.
• The dependent attributes may be
sales price, cost of item, tax, etc.
20
05-05-2022
21
05-05-2022
OLAP query that may be posed: List total no. of sales for each dealer in month of
May, 2021”
22
05-05-2022
• If the fact table Sales is joined with dimension table Autos, then attributes model and
color may be used for grouping sales in interesting ways.
e.g. breakdown of sales by color
or breakdown of sales by “SUV” model by month and dealer
• If Sales and Dealers are joined, we get options of grouping our data e.g.
Breakdown of sales by state or by city as well as by dealer
23
05-05-2022
OLAP Cube
• An OLAP cube is a multidimensional database that is optimized for data warehouse and
online analytical processing (OLAP) applications.
• OLAP cubes are often pre-summarized across dimensions to significantly improve OLAP
query time over relational databases.
• Query language used to interact and perform tasks with OLAP cubes is
multidimensional expressions (MDX).
24
05-05-2022
OLAP operations
• Since OLAP servers are based on multidimensional view of data, some popular OLAP operations
on multidimensional data are:
Roll up
Drill down
Slice and Dice
Pivot (rotate)
• Slice operation selects one particular dimension from a given cube and provides a new sub
cube.
• Dice selects 2 or more dimensions from a given cube and provides a new sub cube.
• Pivot operation is also known as rotation as it rotates data axes in view in order to provide an
alternative presentation of data.
25
05-05-2022
Roll up
• Roll up is performed by climbing up a
concept hierarchy for a particular
dimension.
• E.g. location dimension
• On rolling up, data is aggregated by
ascending location hierarchy from level
of city to level of country.
• Data is grouped into cities rather than
countries.
Multidimensional Data
• Sales volume as a function of product, month, and region
Office Day
Month
26
05-05-2022
Drill down
• On drilling down, the time
dimension is descended from level
of quarter to level of month.
27
05-05-2022
• A choice of partition for each dimension “dices” the cube, as shown in figure.
• The result is that: cube is divided into smaller cubes that represent groups of points
whose statistics are aggregated by a query that performs partitioning in its “group by”
clause.
Dicing
E.g. in the time dimension we may partition or dice “group by” according to days,
weeks, months, years, or not partition at all.
E.g. for cars dimension, partition by model, by color, by both model and color, or not
partition.
E.g. for dealers, partition by dealer, city, state, or not partition.
28
05-05-2022
Slicing
• Through the “where” clause, a query also has option of focusing on particular partitions
along one or more dimensions, i.e. on a particular slice of cube.
• Figure shows a query in which we ask for a slice in dimension “date”, and dice in 2 other
dimensions: car and dealer.
• Shading in figure suggests that we are interested only in one of these years.
• Here a specialized structure, “Data Cube” is used to hold the data, including its
precomputed aggregates.
29
05-05-2022
30
05-05-2022
Pivot
Summary:
Data cube operations
31
05-05-2022
Data Cubes
• MOLAP (Multidimensional OLAP) uses a specialized structure, “Data Cube” is used to
hold the data, including its precomputed aggregates.
sum
Canada
Mexico
sum
32
05-05-2022
33
05-05-2022
e.g. Data
Cube
34
05-05-2022
Lattice of Cuboids
Lattice of Cuboids
• Given a set of dimensions, we can generate a cuboid for each of the possible subsets of
the given dimensions. The result would form a lattice of cuboids, each showing the data
at a different level of summarization, or group-by.
• The lattice of cuboids is then referred to as a data cube.
• Figure 4.5 shows a lattice of cuboids forming a data cube for the dimensions time, item,
location, and supplier.
• The cuboid that holds the lowest level of summarization is called the base cuboid.
• For example, the 4-D cuboid in Figure 4.4 is the base cuboid for the given time, item,
location, and supplier dimensions.
• The 0-D cuboid, which holds the highest level of summarization, is called the apex
cuboid.
• In our example, this is the total sales, or dollars sold, summarized over all four
dimensions. The apex cuboid is typically denoted by all.
35
05-05-2022
36
05-05-2022
A Concept Hierarchy:
Dimension (location)
all all
74
37
05-05-2022
ORDER
TRUCK
PRODUCT LINE
Time Product
ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP
CITY
SALES PERSON
COUNTRY
DISTRICT
REGION
DIVISION
Location Each circle is
called a footprint Promotion Organization
38
05-05-2022
39
05-05-2022
40
05-05-2022
Snowflake Schema
• Snowflake Schema is the extension to star schema such that the tables are arranged in
a complex snowflake shape.
• The concept is similar to star schema with a center fact table and multiple dimension
tables radiating from the center except that the tables described as dimensions are
normalized and consist of more hierarchies.
• The dimension tables (as in star schema diagram) are further branch to other multiple
dimensions.
branch_key
location
branch location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country
41
05-05-2022
Snowflake Schema
• Sometimes, the query requirements are such that business people find snowflake diagrams
easy and important to get the results of their query.
• For instance, the time dimension can be connected to two further hierarchies:
Year—> Month —> Day
Week —> Day
• Though snowflake schema requires more storage space, they are widely known for
improved query performance in the data warehouse.
• Normalization through snowflaking results in saving storage space and the tradeoff
promotes automatic underlying joins.
• Trade-off: Such a normalized table is easy to maintain and saves storage space. However,
this space savings is negligible in comparison to the typical magnitude of the fact table.
• Furthermore, the snowflake structure can reduce the effectiveness of browsing, since more
joins will be needed to execute a query.
• Snowflake schema reduces redundancy, it is not as popular as the star schema in data
warehouse design.
Fact Constellation
• Sophisticated applications may require multiple fact tables to share dimension tables.
• This kind of schema can be viewed as a collection of stars, and hence is called a galaxy
schema or a fact constellation.
• A fact constellation schema is shown in Figure 4.8.
• This schema specifies two fact tables, sales and shipping.
• The sales table definition is identical to that of the star schema (Figure 4.6).
• The shipping table has five dimensions, or keys—item key, time key, shipper key, from
location, and to location—and two measures—dollars cost and units shipped.
• A fact constellation schema allows dimension tables to be shared between fact tables.
• For example, the dimensions tables for time, item, and location are shared between
the sales and shipping fact tables.
42
05-05-2022
• A data warehouse collects information about subjects that span the entire
organization, such as customers, items, sales, assets, and personnel, and thus its scope
is enterprise-wide.
For data warehouses, the fact constellation schema is commonly used, since it can
model multiple, interrelated subjects.
• A data mart, on the other hand, is a department subset of the data warehouse that
focuses on selected subjects, and thus its scope is departmentwide.
For data marts, the star or snowflake schema is commonly used, since both are
geared toward modeling single subjects, although the star schema is more popular
and efficient.
43
05-05-2022
ROLAP Servers
• These are the intermediate servers that stand in between a relational back-end server
and client front end tools.
• They use a relational or extended-relational DBMS to store and manage warehouse
data, and OLAP middleware to support missing pieces.
• ROLAP servers include optimization for each DBMS back end, implementation of
aggregation navigation logic, and additional tools and services.
• ROLAP technology tends to have greater scalability than MOLAP technology.
• The DSS server of Microstrategy, for example, adopts the ROLAP approach.
44
05-05-2022
• One of these relations is the fact table, which contains the raw or unaggregated, data.
• The query language and other capabilities depend on this kind of arrangement.
• Can handle large amounts of information: The data size limitation of ROLAP technology
depends on the data size of the underlying RDBMS. So, ROLAP itself does not restrict the
data amount.
• RDBMS already comes with a lot of features. So ROLAP technologies, (works on top of the
RDBMS) can control these functionalities.
Disadvantages
• Performance can be slow: Each ROLAP report is a SQL query (or multiple SQL queries) in
the relational database, the query time can be prolonged if the underlying data size is
large.
45
05-05-2022
MOLAP
• MOLAP structure has limited capabilities to dynamically create aggregations or to
evaluate results which have not been pre-calculated and stored.
• Applications requiring iterative and comprehensive time-series analysis of trends are
well suited for MOLAP technology (e.g., financial analysis and budgeting).
• Examples include Arbor Software's Essbase. Oracle's Express Server, Pilot Software's
Lightship Server, Sniper's TM/1. Planning Science's Gentium and Kenan Technology's
Multiway.
• Some of the problems faced by clients are related to maintaining support to multiple
subject areas in an RDBMS.
• Some vendors can solve these problems by continuing access from MOLAP tools to
detailed data in and RDBMS.
46
05-05-2022
MOLAP Servers
• These servers support multidimensional data views through array-based
multidimensional storage engines.
• The advantage of using a data cube is that it allows fast indexing to precomputed
summarized data.
• Notice that with multidimensional data stores, the storage utilization may be low if the
data set is sparse.
• Can perform complex calculations: All evaluations have been pre-generated when the
cube is created. Hence, complex calculations are not only possible, but they are returned
quickly.
Disadvantages
• Limited in the amount of information it can handle: Because all calculations are
performed when the cube is built, it is not possible to contain a large amount of data in
the cube itself.
• Requires additional investment: Cube technology is generally proprietary and does not
already exist in the organization. Therefore, to adopt MOLAP technology, chances are:
other investments in human and capital resources may be needed.
47
05-05-2022
• For example, a HOLAP server may allow large volumes of detailed data to be stored in a
relational database, while aggregations are kept in a separate MOLAP store.
• HOLAP systems save more substantial quantities of detailed data in the relational tables
while the aggregations are stored in the pre-calculated cubes.
• HOLAP also can drill through from the cube down to the relational tables for delineated
data.
48
05-05-2022
Disadvantages
• HOLAP architecture is complicated because it supports both MOLAP and ROLAP
servers.
49
05-05-2022
Datawarehouse implementation
• Data warehouses contain huge volumes of data.
• OLAP servers demand that decision support queries be answered in the order of
seconds.
• It is crucial for data warehouse systems to support highly efficient cube computation
techniques, access methods, and query processing techniques.
• At the core of multidimensional data analysis is the efficient computation of aggregations
• across many sets of dimensions. In SQL terms, these aggregations are referred to as
group-by’s.
• Each group-by can be represented by a cuboid, where the set of group-by’s forms a
lattice of cuboids defining a data cube.
50
05-05-2022
51
05-05-2022
Datawarehouse implementation
• The base cuboid contains all three dimensions, city, item, and year. It can return the total
sales for any combination of the three dimensions.
• The base cuboid is the least generalized (most specific) of the cuboids.
• The apex cuboid, or 0-D cuboid, refers to the case where the group-by is empty. It
contains the total sum of all sales.
• The apex cuboid is the most generalized (least specific) of the cuboids, and is often
denoted as all.
• If we start at the apex cuboid and explore downward in the lattice, this is equivalent to
drilling down within the data cube.
• If we start at the base cuboid and explore upward, this is akin to rolling up.
Datawarehouse implementation
• An SQL query containing no group-by (e.g., “compute the sum of total sales”) is a zero dimensional
operation. An SQL query containing one group-by (e.g., “compute the sum of sales, group-by city”) is a one-
dimensional operation. A cube operator on n dimensions is equivalent to a collection of group-by
statements, one for each subset of the n dimensions.
• Therefore, the cube operator is the n-dimensional generalization of the group-by operator.
• Precomputation leads to fast response time and avoids some redundant computation. Most, if not all,
OLAP products resort to some degree of precomputation of multidimensional aggregates.
• A major challenge related to this precomputation is that the required storage space may explode if all the
cuboids in a data cube are precomputed, especially when the cube has many dimensions.
• The storage requirements are even more excessive when many of the dimensions have associated concept
hierarchies, each with multiple levels.
• This problem is referred to as the curse of dimensionality.
• “How many cuboids are there in an n-dimensional data cube?” If there were no hierarchies associated
with each dimension, then the total number of cuboids for an n-dimensional data cube, is 2n. However,
in practice, many dimensions do have hierarchies.
52
05-05-2022
53
05-05-2022
• The other flavor, serverless, is more modern and counts Google BigQuery (ver 2)
and Snowflake as examples. Essentially, serverless cloud data warehouses make the
database cluster “invisible” or shared across many clients.
54
05-05-2022
55
05-05-2022
Base table
Index on Region Index on Type
Cust Region Type RecID Asia Europe Am erica RecID Retail Dealer
C1 Asia Retail 1 1 0 0 1 1 0
C2 Europe Dealer 2 0 1 0 2 0 1
C3 Asia Dealer 3 1 0 0 3 0 1
C4 America Retail 4 0 0 1 4 1 0
C5 Europe Dealer 5 0 1 0 5 0 1
• In data warehouses, join index relates the values of the dimensions of a start schema to
rows in the fact table.
E.g. fact table: Sales and two dimensions city and product
A join index on city maintains for each distinct city a list of R-IDs of the tuples
recording the Sales in the city
• Join indices can span multiple dimensions
56
05-05-2022
• Explore indexing structures and compressed vs. dense array structs in MOLAP
57
05-05-2022
Attribute-Oriented Induction
• Proposed in 1989 (KDD ‘89 workshop)
• Not confined to categorical data nor particular measures
• How it is done?
• Collect the task-relevant data (initial relation) using a relational database query
• Perform generalization by attribute removal or attribute generalization
• Apply aggregation by merging identical, generalized tuples and accumulating
their respective counts
• Interaction with users for knowledge presentation
58
05-05-2022
Birth_Region
Canada Foreign Total
Gender
M 16 14 30
F 10 22 32
Total 26 36 62
59
05-05-2022
60
05-05-2022
61
05-05-2022
Summary
• Data warehousing: A multi-dimensional model of a data warehouse
• A data cube consists of dimensions & measures
• Star schema, snowflake schema, fact constellations
• OLAP operations: drilling, rolling, slicing, dicing and pivoting
• Data Warehouse Architecture, Design, and Usage
• Multi-tiered architecture
• Implementation: Efficient computation of data cubes
• OLAP query processing
• OLAP servers: ROLAP, MOLAP, HOLAP
• Bitmap index
• Cloud based Data warehouses: Introduction
62