4 Data Warehousing & OLAP

Download as pdf or txt
Download as pdf or txt
You are on page 1of 62

05-05-2022

Datawarehousing and OLAP


Source:
Chapter 4, Han and Kamber Book, 3rd Edition
Chapter 20, Information Integration, from book: Database Systems, The
Complete Book, Garcia-Molina, Ullman, Widom

Agenda
• Data Warehouse: Basic Concepts

• Data Warehouse Modeling: Data Cube and OLAP

• Data Warehouse Design and Usage

• Cloud Datawarehouses

1
05-05-2022

Problems with RDBMS


• In 1979, Oracle released their first relational database product and by the mid to late
1990s, relational databases had come of age.

• Almost all big corporations use a relational database to store their data.

• Companies found that relational databases were great at storing data, but made it
difficult to generate management reports, from transactional data.
• The whole process took far too long.
e.g. For reporting sales for a specific month, the database server would potentially
have to loop through millions of transactions in order to generate a total, this
would often slow down the entire system to the point it became unusable for
critical business operations.
• Hence, Dr. E. F. Codd (an IBM scientist, inventor of RDBMS) came up with a solution
called OLAP.

OLAP Introduction
• The idea behind OLAP is to pre-compute all of the totals and subtotals
needed for reporting, either at night or at weekend when the database
server was normally idle.

• These totals are stored in a special database, called an OLAP cube.

• An OLAP cube doesn't have to loop through any transactions, because


totals are all pre-calculated, providing instant access.

• An OLAP cube is a snapshot of data at a specific point in time, perhaps at


the end of a specific day, week, month, or year.

• Excel 2013 (or later versions) provides a complete OLAP solution with
feature OLAP pivot table.

2
05-05-2022

Business Intelligence (BI)


• Business Intelligence means the
analysis and presentation of data
stored in OLAP cube.

• Best business intelligence platforms


Microsoft Power BI
Tableau Desktop
Dundas BI
Sisense
Zoho Analytics

Source: https://www.researchgate.net/figure/Business-Intelligence-tools-20_fig1_277905782

Data Warehouse
• There are several ways that databases or other distributed information sources can be
made to work together.

• The data stored at the data warehouse is first processed: e.g. data may be filtered, and
relations may be joined or aggregated.

• The warehouse is updated periodically.

• As the data is copied from the sources, it may need to be transformed in certain ways
to make all data conform to the schema at the warehouse.

3
05-05-2022

Data Warehousing
• Data Warehousing (DW) is process for collecting and managing data from varied
sources to provide meaningful business insights.

• A Data warehouse is typically used to connect and analyze business data from
heterogeneous sources.

• The data warehouse is the core of the BI system which is built for data analysis and
reporting.

• It is a blend of technologies and components which aids the strategic use of data.

• It is electronic storage of a large amount of information by a business which is designed


for query and analysis instead of transaction processing.

• It is a process of transforming data into information and making it available to users in a


timely manner to make a difference.

Datawarehouse
• Data warehousing was introduced in 1988 by IBM researchers Barry Devlin and Paul
Murphy. Since then the concept has evolved and taken on a life of its own.

• Data warehousing is about taking data that is stored in two or more databases
(information sources) and build from them one large database, possibly virtual,
containing information from all the sources, so that data can be queried as a unit.

• The sources may be conventional databases or other types of information, such as


collection of web pages.

• A specialized database Data cube is used to organize the integrated data which
facilitates OLAP (on-line analytic processing) and data mining queries.

4
05-05-2022

Information Integration
• Information integration is about taking data that is stored in two or more databases
(information sources) and build from them one large database, possibly virtual,
containing information from all the sources, so that data can be queried as a unit.

• The sources may be conventional databases or other types of information, such as


collection of web pages.

• A specialized database Data cube is used to organize the integrated data which
facilitates OLAP (on-line analytic processing) and data mining queries.

What is a Data Warehouse?


• Defined in many different ways, but not rigorously.
• A decision support database that is maintained separately from the organization’s
operational database
• Support information processing by providing a solid platform of consolidated,
historical data for analysis.

• “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile


collection of data in support of management’s decision-making process.”—W. H. Inmon

• Data warehousing:
• The process of constructing and using data warehouses

5
05-05-2022

Data Warehouse—Subject-Oriented
• Organized around major subjects, such as customer, product, sales
• Focusing on the modeling and analysis of data for decision makers, not on daily
operations or transaction processing
• Provide a simple and concise view around particular subject issues by excluding data
that are not useful in the decision support process

Data Warehouse—Integrated
• Constructed by integrating multiple, heterogeneous data sources
• relational databases, flat files, on-line transaction records

• Data cleaning and data integration techniques are applied.

• Ensure consistency in naming conventions, encoding structures, attribute


measures, etc. among different data sources

• E.g., Hotel price: currency, tax, breakfast covered, etc.

• When data is moved to the warehouse, it is converted.

6
05-05-2022

Data Warehouse—Time Variant


• The time horizon for the data warehouse is significantly longer than that of
operational systems
• Operational database: current value data
• Data warehouse data: provide information from a historical perspective (e.g., past
5-10 years)
• Every key structure in the data warehouse
• Contains an element of time, explicitly or implicitly
• But the key of operational data may or may not contain “time element”

Data Warehouse—Nonvolatile
• A physically separate store of data transformed from the operational environment
sometimes called de-normalized data
• Operational update of data does not occur in the data warehouse environment
• Does not require transaction processing, recovery, and concurrency control
mechanisms
• Requires only two operations in data accessing:
initial loading of data and access of data

7
05-05-2022

Data Warehouse
• In data warehouse integration architecture, data from
several sources is extracted and combined in global
schema.

• Once data is in warehouse, queries may be issued by the


user exactly as they would be issued to any database.

• Data in the warehouse can become seriously out of date if


not updated periodically.

• Data warehouses are used as centralized data repositories


for analytical and reporting purposes.

• A traditional data warehouse is located on-site at your


offices. You purchase the hardware, the server rooms and
hire the staff to run it.
 They are also called on-premise data warehouses.

3 approaches to constructing data in a warehouse


• Simply construct the warehouse from scratch
 Warehouse is periodically re-constructed from current data in the sources.
 The system are shut down so queries are not issued while warehouse is being constructed which is
the main disadvantage of this approach.
 Sometimes reconstructing warehouse can take longer than a typical night.
• Incremental update
 Warehouse is updated periodically based on changes that have been made to sources since last time
warehouse was modified.
 This approach can involve smaller amounts of data and short period of time
 Disadvantage is that calculating changes to warehouse i.e. incremental update is complex.
• Warehouse is changed immediately, in response to each change or small set of changes
at one or more of the sources
 This approach requires too much communication and processing
 Suitable for small warehouses whose underlying sources change slowly
 A successful warehouse implementation of this type has a number of important applications.

8
05-05-2022

e.g. Data warehouse


• Suppose 2 different dealers A and B of an automobile company use 2 different schema:
-Cars(serialNo, model, color, autoTrans, cdPlayer,…)
-Autos(serial, model, color) and Options(serial, option)
• We wish to create a warehouse with schema:
• AutoWhse(serialNo, model, color, autoTrans, dealer)
• Here we record option of having automatic transmission only
• And an attribute that tells which dealer has the car
• The s/w that extracts data from 2 dealers’ databases and populates global schema (i.e.
extractor) can be written as SQL query:

e.g. Data warehouse


• The extractor for 2nd
dealer is more
complex, as we have
to decide whether or
not a given car has
automatic
transmission.
• In this example
combiner is not
needed.

Dealer 2: Autos(serial, model, color) and Options(serial, option)

9
05-05-2022

How organizations use the information from data warehouses


• Many organizations use this information to support business decision-making activities,
including:

increasing customer focus, which includes the analysis of customer buying patterns
(such as buying preference, buying time, budget cycles, and appetites for spending)
repositioning products and managing product portfolios by comparing the
performance of sales by quarter, by year, and by geographic regions in order to fine-
tune production strategies
analyzing operations and looking for sources of profit
managing customer relationships, making environmental corrections, and managing
the cost of corporate assets.

Approaches for database integration


• Query driven
to build wrappers and integrators (or mediators) on top of multiple, heterogeneous
databases.
When a query is posed to a client site, a metadata dictionary is used to translate the
query into queries appropriate for the individual heterogeneous sites involved.
These queries are then mapped and sent to local query processors.
query-driven approach requires complex information filtering and integration
processes, and competes with local sites for processing resources.
It is inefficient and potentially expensive for frequent queries, especially queries
requiring aggregations.
• Update driven
Rather than using a query-driven approach, data warehousing employs an update
driven approach in which information from multiple, heterogeneous sources is
integrated in advance and stored in a warehouse for direct querying and analysis.

10
05-05-2022

Why a Separate Data Warehouse?


• High performance for both systems
• DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery
• Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view,
consolidation
• Different functions and different data:
• missing data: Decision support requires historical data which operational DBs do not
typically maintain
• data consolidation: DS requires consolidation (aggregation, summarization) of data
from heterogeneous sources
• data quality: different sources typically use inconsistent data representations, codes
and formats which have to be reconciled
• Note: There are more and more systems which perform OLAP analysis directly on
relational databases

Comparison
between OLTP
and OLAP
Systems

11
05-05-2022

OLTP vs. OLAP


O LTP O LAP
u sers clerk , IT professio nal kno w led ge w o rker
f u n ctio n d ay to d ay op erations d ecisio n suppo rt
D B d esig n ap p licatio n-oriented sub ject-o riented
d a ta current, up -to -d ate historical,
d etailed , flat relatio nal summarized, multidimensional
iso lated integrated , co nso lid ated
u sa g e rep etitive ad-hoc
a ccess read /w rite lo ts o f scans
ind ex/hash o n p rim. k ey
u n it o f w o rk sho rt, simp le transactio n comp lex query
# reco rd s a ccessed tens millio ns
# u sers tho usand s hund reds
D B size 1 0 0 M B -G B 10 0G B -TB
m etric transactio n throughp ut q uery throughp ut, respo nse

e.g. OLTP query v/s OLAP query


• e.g. Imagine an automobile company builds a data warehouse to analyze sales of its
cars. Consider schema for warehouse:
• A typical OLTP query:
-Find price at which auto with serialNo=123 was sold.
select price from Sales where serialNo=123;
• A typical OLAP query:
Company wants to examine sales on
or after April 1, 2020 to see
how recent average price per vehicle
varies by state.

• Note: OLAP query touches much of the data of database, as it classifies every recent
Sales fact by state of dealer that sold it. In contrast OLTP query touches only a single
tuple of the data.

12
05-05-2022

OLAP v/s OLTP


• Companies and organizations create a warehouse with a copy of large amounts of their
available data and assign analysis to query this warehouse for important patterns or
trends.

• This activity is called OLAP (online analytic processing) or decision-support-queries,


generally involves highly complex queries that use one or more aggregations.

• E.g. to search for products with increasing or decreasing overall sales.

• OLAP queries typically examine very large amounts of data, even if the query results are
small.

• In contrast, common database operations often referred as OLTP (online transaction


processing), such as bank deposits/withdrawals, airline reservations, etc touch a tiny
portion of database.

e.g. Row Store v/s


Column store

• Source: C-store | Proceedings of the 31st international conference on Very large data bases (acm.org)
https://dl.acm.org/doi/10.5555/1083592.1083658
Stonebraker, M., Abadi, D. J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., ... & Zdonik, S. (2018). C-store: a column-oriented
DBMS. In Making Databases Work: the Pragmatic Wisdom of Michael Stonebraker (pp. 491-518).

13
05-05-2022

Extraction, Transformation, and Loading (ETL)


• Data extraction
• get data from multiple, heterogeneous, and external sources
• Data cleaning
• detect errors in the data and rectify them when possible
• Data transformation
• convert data from legacy or host format to warehouse format
• Load
• sort, summarize, consolidate, compute views, check integrity, and build indicies
and partitions
• Refresh
• propagate the updates from the data sources to the warehouse

Data Warehousing: A
Multitiered Architecture
A 3-tier data warehousing
architecture

14
05-05-2022

Bottom tier
• The bottom tier is a warehouse database server that is almost always a relational
database system.
• Back-end tools and utilities are used to feed data into the bottom tier from operational
databases or other external sources.
• The data are extracted using application program interfaces known as gateways.
A gateway is supported by the underlying DBMS and allows client programs to
generate SQL code to be executed at a server.
Examples of gateways include ODBC (Open Database Connection) and OLEDB
(Object Linking and Embedding Database) by Microsoft and JDBC (Java Database
Connection).
• This tier also contains a metadata repository, which stores information about the data
warehouse and its contents.

Middle tier

• The middle tier is an OLAP server that is typically implemented using either

• a relational OLAP(ROLAP) model (i.e., an extended relational DBMS that


maps operations on multidimensional data to standard relational
operations); or

• a multidimensional OLAP (MOLAP) model (i.e., a special-purpose server


that directly implements multidimensional data and operations).

15
05-05-2022

Top tier
• The top tier is a front-end client layer,
• which contains query and reporting tools, analysis tools, and/or
• data mining tools (e.g., trend analysis, prediction, and so on).

Data Warehouse: A Multi-Tiered Architecture

Monitor
& OLAP Server
Other Metadata
sources Integrator

Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining

Data Marts

Data Sources Data Storage OLAP Engine Front-End Tools

16
05-05-2022

Three Data Warehouse Models


• Enterprise warehouse
• collects all of the information about subjects spanning the entire
organization
• Data Mart
• a subset of corporate-wide data that is of value to a specific groups of
users. Its scope is confined to specific, selected groups, such as
marketing data mart
• Independent vs. dependent (directly from warehouse) data mart
• Virtual warehouse
• A set of views over operational databases
• Only some of the possible summary views may be materialized

Enterprise warehouse
• It provides corporate-wide data integration, usually from one or more operational
systems or external information providers, and is cross-functional in scope.
• It typically contains detailed data as well as summarized data, and can range in size
from a few gigabytes to hundreds of gigabytes, terabytes, or beyond.
• An enterprise data warehouse may be implemented on traditional mainframes,
computer super servers, or parallel architecture platforms.

• It requires extensive business modeling and may take years to design and build.

17
05-05-2022

Data Mart
• Data marts are usually implemented on low-cost departmental servers that are
Unix/Linux or Windows based.
• The implementation cycle of a data mart is more likely to be measured in weeks
rather than months or years.
• However, it may involve complex integration in the long run if its design and
planning were not enterprise-wide.
• Independent data marts are sourced from data captured from one or more
operational systems or external information providers, or from data generated
locally within a particular department or geographic area.
• Dependent data marts are sourced directly from enterprise data warehouses.

The method for the development of data warehouse systems


• A recommended method for the development of data warehouse systems is to
implement the warehouse in an incremental and evolutionary manner, as shown in
Figure 4.2.
1. a high-level corporate data model is defined within a reasonably short period (such as one or
two months) that provides a corporate-wide, consistent, integrated view of data among
different subjects and potential usages.
2. This high-level model, although it will need to be refined in the further development of
enterprise data warehouses and departmental data marts, will greatly reduce future
integration problems.
3. independent data marts can be implemented in parallel with the enterprise warehouse based
on the same corporate data model set noted before.
4. distributed data marts can be constructed to integrate different data marts via hub servers.
5. a multitier data warehouse is constructed where the enterprise warehouse is the sole
custodian of all warehouse data, which is then distributed to the various dependent data
marts.

18
05-05-2022

The method for the development of data warehouse systems

Metadata Repository
• Meta data is the data defining warehouse objects. It stores:
• Description of the structure of the data warehouse
• schema, view, dimensions, hierarchies, derived data definition, data mart locations and
contents
• Operational meta-data
• data lineage (history of migrated data and transformation path), currency of data (active,
archived, or purged), monitoring information (warehouse usage statistics, error reports,
audit trails)
• The algorithms used for summarization
• The mapping from operational environment to the data warehouse
• Data related to system performance
• warehouse schema, view and derived data definitions
• Business data
• business terms and definitions, ownership of data, charging policies

19
05-05-2022

Multidimensional View of OLAP data


• In typical OLAP applications there is a central relation or collection of data, called fact
table.
• A fact table represents events or objects of interest, such as Sales from previous
example, e.g. Sales(serialNo, date, dealer, price)
• Figure shows Sales table as arranged in a multidimensional space or cube, with
dimensions car, dealer, and date.
• Each point as a sale of a single automobile while dimensions represent properties of
that sale.

Star Schemas
• A star schema consists of schema for
fact table, which links to several other
relations called dimension tables.
• Fact table is at center of “star” whose
points are dimension tables.
• A fact table normally has several
attributes that represent dimensions,
and one or more dependent attribute
that represent properties of interest for
the point as a whole.
• E.g. dimensions for Sales data might
include date of sale, place of store,
dealer, etc.
• The dependent attributes may be
sales price, cost of item, tax, etc.

20
05-05-2022

Facts in a Fact Table


• Facts are numeric measures.
• Think of them as the quantities by which we want to analyze
relationships between dimensions.
• Examples of facts for a sales data warehouse include dollars sold
(sales amount in dollars), units sold (number of units sold), and
amount budgeted.
• The fact table contains the names of the facts, or measures, as well as
keys to each of the related dimension tables.

Star Schema Template

21
05-05-2022

Example of Star Schema


time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
43

e.g. Fact Table: Sales relation

OLAP query that may be posed: List total no. of sales for each dealer in month of
May, 2021”

22
05-05-2022

e.g. Fact table


• Consider these 2 dimension tables:
Autos(serialNo, model, color)
Dealers(name, city, state, phone)
• We can merge these 2 dimensions to form a Fact table:
Sales(serialNo, date, dealer, price)
• Here “serialNo” in fact table is a foreign key referencing serialNo of dimension table Autos.
Attributes Autos.model and Autos.color give properties of a given auto.
• Attribute “dealer” of Sales is a foreign key, referencing name of dimension table Dealers.
• The “date” attribute of Sales is a dimension table representing time as a physical
property.
• Since grouping by various time units: weeks, months, quarters, and years, is frequently
desired by analysts.
• A typical “date” dimension table may have schema: Days(day, week, month, year) such
that
• an example date July 5, 2020 may be represented as
• tuple: (5, 27, 7, 2020).

e.g. Fact table


• Consider these 2 dimension tables:
Autos(serialNo, model, color)
Dealers(name, city, state, phone)
• We can merge these 2 dimensions to form a Fact table:
Sales(serialNo, date, dealer, price)

• If the fact table Sales is joined with dimension table Autos, then attributes model and
color may be used for grouping sales in interesting ways.
e.g. breakdown of sales by color
or breakdown of sales by “SUV” model by month and dealer

• If Sales and Dealers are joined, we get options of grouping our data e.g.
Breakdown of sales by state or by city as well as by dealer

23
05-05-2022

OLAP Cube
• An OLAP cube is a multidimensional database that is optimized for data warehouse and
online analytical processing (OLAP) applications.

• An OLAP cube is a method of storing data in multidimensional form, generally for


reporting purposes.

• In OLAP cubes, data are categorized by dimensions.

• OLAP cubes are often pre-summarized across dimensions to significantly improve OLAP
query time over relational databases.

• Query language used to interact and perform tasks with OLAP cubes is
multidimensional expressions (MDX).

How an OLAP cube works

24
05-05-2022

How an OLAP cube works

OLAP operations
• Since OLAP servers are based on multidimensional view of data, some popular OLAP operations
on multidimensional data are:
 Roll up
 Drill down
 Slice and Dice
 Pivot (rotate)

• Roll up performs aggregation on a data cube.

• Drill down is the reverse operation of Roll up.

• Slice operation selects one particular dimension from a given cube and provides a new sub
cube.

• Dice selects 2 or more dimensions from a given cube and provides a new sub cube.

• Pivot operation is also known as rotation as it rotates data axes in view in order to provide an
alternative presentation of data.

25
05-05-2022

Roll up
• Roll up is performed by climbing up a
concept hierarchy for a particular
dimension.
• E.g. location dimension
• On rolling up, data is aggregated by
ascending location hierarchy from level
of city to level of country.
• Data is grouped into cities rather than
countries.

Multidimensional Data
• Sales volume as a function of product, month, and region

Dimensions: Product, Location, Time


Hierarchical summarization paths

Industry Region Year

Category Country Quarter


Product

Product City Month Week

Office Day

Month

26
05-05-2022

Drill down
• On drilling down, the time
dimension is descended from level
of quarter to level of month.

• It navigates data from less detailed


data to highly detailed data.

Slicing and Dicing

27
05-05-2022

e.g. Slicing and Dicing


• Imagine points of raw data cube as partitioned along each dimension at some level of
granularity.
E.g. in the time dimension we may partition “group by” according to days, weeks, months, years, or not
partition at all.
E.g. for cars dimension, partition by model, by color, by both model and color, or not partition.
E.g. for dealers, partition by dealer, city, state, or not partition.

• A choice of partition for each dimension “dices” the cube, as shown in figure.
• The result is that: cube is divided into smaller cubes that represent groups of points
whose statistics are aggregated by a query that performs partitioning in its “group by”
clause.

Dicing

E.g. in the time dimension we may partition or dice “group by” according to days,
weeks, months, years, or not partition at all.
E.g. for cars dimension, partition by model, by color, by both model and color, or not
partition.
E.g. for dealers, partition by dealer, city, state, or not partition.

28
05-05-2022

Slicing

• Through the “where” clause, a query also has option of focusing on particular partitions
along one or more dimensions, i.e. on a particular slice of cube.
• Figure shows a query in which we ask for a slice in dimension “date”, and dice in 2 other
dimensions: car and dealer.
• Shading in figure suggests that we are interested only in one of these years.

Slicing and Dicing query

• Here a specialized structure, “Data Cube” is used to hold the data, including its
precomputed aggregates.

29
05-05-2022

e.g. Slicing and Dicing


• Consider these 2 dimension tables:
Autos(serialNo, model, color)
Dealers(name, city, state, phone)
• We can merge these 2 dimensions to form a Fact table:
Sales(serialNo, date, dealer, price)
• OLAP query: Find out which colors are not doing well.
• This query uses only Autos dimension table and can be written in SQL as:
• This query dices by color and then slices by model, focusing on a particular model “Gobi”.

e.g. Slicing and Dicing


• Consider these 2 dimension tables:
Autos(serialNo, model, color)
Dealers(name, city, state, phone)
• We can merge these 2 dimensions to form a Fact table:
Sales(serialNo, date, dealer, price)
• OLAP query: Find out how is the sale of “red Gobis” by all dealers or whether only some
dealers have had low sales of “red Gobis” through years

30
05-05-2022

Pivot

Summary:
Data cube operations

31
05-05-2022

Data Cubes
• MOLAP (Multidimensional OLAP) uses a specialized structure, “Data Cube” is used to
hold the data, including its precomputed aggregates.

• A data cube precomputes all possible aggregates in a systematic way.


• The amount of extra storage needed is often tolerable, and as long as warehouse data
does not change, there is no penalty incurred trying to keep all aggregates up-to-date.

• The Cube operator:


• Given a fact table F, an augmented table CUBE(F) may be defined that adds an
additional value, denoted “*” to each dimension which represents aggregation along
the dimension in which it appears.

A Sample Data Cube


Total annual sales
Date of TVs in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
TV
PC U.S.A
VCR
Country

sum
Canada

Mexico

sum

32
05-05-2022

e.g. Data Cube

e.g. Data Cube

33
05-05-2022

e.g. Data
Cube

e.g. 4-D Data Cube

34
05-05-2022

Lattice of Cuboids

Lattice of Cuboids
• Given a set of dimensions, we can generate a cuboid for each of the possible subsets of
the given dimensions. The result would form a lattice of cuboids, each showing the data
at a different level of summarization, or group-by.
• The lattice of cuboids is then referred to as a data cube.
• Figure 4.5 shows a lattice of cuboids forming a data cube for the dimensions time, item,
location, and supplier.
• The cuboid that holds the lowest level of summarization is called the base cuboid.
• For example, the 4-D cuboid in Figure 4.4 is the base cuboid for the given time, item,
location, and supplier dimensions.
• The 0-D cuboid, which holds the highest level of summarization, is called the apex
cuboid.
• In our example, this is the total sales, or dollars sold, summarized over all four
dimensions. The apex cuboid is typically denoted by all.

35
05-05-2022

Cuboids Corresponding to the Cube


all
0-D (apex) cuboid
product date country
1-D cuboids

product,date product,country date, country


2-D cuboids

3-D (base) cuboid


product, date, country

Data Cube Measures: Three Categories


• Distributive: if the result derived by applying the function to n aggregate values is the
same as that derived by applying the function on all the data without partitioning
• E.g., count(), sum(), min(), max()

• Algebraic: if it can be computed by an algebraic function with M arguments (where M is


a bounded integer), each of which is obtained by applying a distributive aggregate
function
• E.g., avg() [sum()/count()], min_N() [finds N minimum values], standard_deviation()

• Holistic: if there is no constant bound on the storage size needed to describe a


subaggregate.
• E.g., median(), mode(), rank()
It is difficult to compute holistic measures efficiently, which can be overcome by computing
approximate median value for a large data set.

36
05-05-2022

The Lattice of views

A Concept Hierarchy:
Dimension (location)
all all

region Europe ... North_America

country Germany ... Spain Canada ... Mexico

city Frankfurt ... Vancouver ... Toronto

office L. Chan ... M. Wind

74

37
05-05-2022

Starnet Query Model


• This star-net consists of four radial lines,
representing concept hierarchies for the
dimensions
 location, customer, item, and time, respectively.
• Each line consists of footprints representing
abstraction levels of the dimension.
• For example, the time line has four
footprints: “day,” “month,” “quarter,” and
“year.”
 A concept hierarchy may involve a single
attribute (e.g., date for the time hierarchy) or
 several attributes (e.g., the concept hierarchy
for location involves the attributes street, city,
province or state, and country).

e.g. A Star-Net Query Model


Customer Orders
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS

ORDER
TRUCK
PRODUCT LINE
Time Product
ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP
CITY
SALES PERSON
COUNTRY
DISTRICT

REGION
DIVISION
Location Each circle is
called a footprint Promotion Organization

38
05-05-2022

Typical OLAP Operations: Summary


• Roll up (drill-up): summarize data
• by climbing up hierarchy or by dimension reduction
• Drill down (roll down): reverse of roll-up
• from higher level summary to lower level summary or detailed data, or
introducing new dimensions
• Slice and dice: project and select
• Pivot (rotate):
• reorient the cube, visualization, 3D to series of 2D planes
• Other operations
• drill across: involving (across) more than one fact table
• drill through: through the bottom level of the cube to its back-end relational
tables (using SQL)

Summary: From Tables and Spreadsheets to Data Cubes


• A data warehouse is based on a multidimensional data model which views data in the
form of a data cube
• A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions
• Dimension tables, such as item (item_name, brand, type), or time(day, week, month,
quarter, year)
• Fact table contains measures (such as dollars_sold) and keys to each of the related
dimension tables
• In data warehousing literature, an n-D base cube is called a base cuboid.
• The top most 0-D cuboid, which holds the highest-level of summarization, is called the
apex cuboid. The lattice of cuboids forms a data cube.

39
05-05-2022

Conceptual Modeling of Data Warehouses


• The most popular data model for a data warehouse is a multidimensional model, which
can exist in the form of a star schema, a snowflake schema, or a fact constellation
schema.
• Modeling data warehouses: dimensions & measures
• Star schema: A fact table in the middle connected to a set of dimension tables
• Snowflake schema: A refinement of star schema where some dimensional hierarchy
is normalized into a set of smaller dimension tables, forming a shape similar to
snowflake
• Fact constellations: Multiple fact tables share dimension tables, viewed as a
collection of stars, therefore called galaxy schema or fact constellation

Example of Star Schema


time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures

40
05-05-2022

Snowflake Schema
• Snowflake Schema is the extension to star schema such that the tables are arranged in
a complex snowflake shape.

• The concept is similar to star schema with a center fact table and multiple dimension
tables radiating from the center except that the tables described as dimensions are
normalized and consist of more hierarchies.

• Snowflaking is implemented to perform advanced-level queries on the normalized


dimensions.

• The dimension tables (as in star schema diagram) are further branch to other multiple
dimensions.

Example of Snowflake Schema


time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key

branch_key
location
branch location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country

41
05-05-2022

Snowflake Schema
• Sometimes, the query requirements are such that business people find snowflake diagrams
easy and important to get the results of their query.
• For instance, the time dimension can be connected to two further hierarchies:
 Year—> Month —> Day
 Week —> Day
• Though snowflake schema requires more storage space, they are widely known for
improved query performance in the data warehouse.
• Normalization through snowflaking results in saving storage space and the tradeoff
promotes automatic underlying joins.
• Trade-off: Such a normalized table is easy to maintain and saves storage space. However,
this space savings is negligible in comparison to the typical magnitude of the fact table.
• Furthermore, the snowflake structure can reduce the effectiveness of browsing, since more
joins will be needed to execute a query.
• Snowflake schema reduces redundancy, it is not as popular as the star schema in data
warehouse design.

Fact Constellation
• Sophisticated applications may require multiple fact tables to share dimension tables.
• This kind of schema can be viewed as a collection of stars, and hence is called a galaxy
schema or a fact constellation.
• A fact constellation schema is shown in Figure 4.8.
• This schema specifies two fact tables, sales and shipping.
• The sales table definition is identical to that of the star schema (Figure 4.6).
• The shipping table has five dimensions, or keys—item key, time key, shipper key, from
location, and to location—and two measures—dollars cost and units shipped.
• A fact constellation schema allows dimension tables to be shared between fact tables.
• For example, the dimensions tables for time, item, and location are shared between
the sales and shipping fact tables.

42
05-05-2022

Example of Fact Constellation


time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location

branch location_key location to_location


branch_key location_key dollars_cost
branch_name units_sold
street
branch_type dollars_sold city units_shipped
province_or_state
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
shipper_type

Data warehouse (constellation) v/s Data mart (star or


snowflake)
• In data warehousing, there is a distinction between a data warehouse and a data mart.

• A data warehouse collects information about subjects that span the entire
organization, such as customers, items, sales, assets, and personnel, and thus its scope
is enterprise-wide.
For data warehouses, the fact constellation schema is commonly used, since it can
model multiple, interrelated subjects.

• A data mart, on the other hand, is a department subset of the data warehouse that
focuses on selected subjects, and thus its scope is departmentwide.
For data marts, the star or snowflake schema is commonly used, since both are
geared toward modeling single subjects, although the star schema is more popular
and efficient.

43
05-05-2022

3-tier data warehouse


architecture
• The top tier is a front-end client layer, which
contains query and reporting tools, analysis
tools, and/or data mining tools (e.g., trend
analysis, prediction, and so on).
• The bottom tier is a warehouse database
server that is almost always a relational
database system.
• The middle tier is an OLAP server that is
typically implemented using either
• relational OLAP(ROLAP) model (i.e., an
extended relational DBMS that maps
operations on multidimensional data to
standard relational operations); or
• multidimensional OLAP (MOLAP) model
(i.e., a special-purpose server that directly
implements multidimensional data and
operations).
• HOLAP is a hybrid model.

ROLAP Servers
• These are the intermediate servers that stand in between a relational back-end server
and client front end tools.
• They use a relational or extended-relational DBMS to store and manage warehouse
data, and OLAP middleware to support missing pieces.
• ROLAP servers include optimization for each DBMS back end, implementation of
aggregation navigation logic, and additional tools and services.
• ROLAP technology tends to have greater scalability than MOLAP technology.
• The DSS server of Microstrategy, for example, adopts the ROLAP approach.

44
05-05-2022

ROLAP (Relational OLAP)


• In this approach data may be stored in relations with a specialized structure called “star
schema”.

• One of these relations is the fact table, which contains the raw or unaggregated, data.

• Other relations give information about values along each dimension.

• The query language and other capabilities depend on this kind of arrangement.

Pro and Cons of ROLAP


Advantages

• Can handle large amounts of information: The data size limitation of ROLAP technology
depends on the data size of the underlying RDBMS. So, ROLAP itself does not restrict the
data amount.

• RDBMS already comes with a lot of features. So ROLAP technologies, (works on top of the
RDBMS) can control these functionalities.

Disadvantages
• Performance can be slow: Each ROLAP report is a SQL query (or multiple SQL queries) in
the relational database, the query time can be prolonged if the underlying data size is
large.

• Limited by SQL functionalities: ROLAP technology relies on upon developing SQL


statements to query the relational database, and SQL statements do not suit all needs.

45
05-05-2022

MOLAP (Multidimensional OLAP)


• Here a specialized structure, “Data Cube” is used to hold the data, including its
precomputed aggregates.
• A MOLAP system is based on a native logical model that directly supports
multidimensional data and operations.
• Data are stored physically into multidimensional arrays, and positional techniques are
used to access them.
• One of the significant distinctions of MOLAP against a ROLAP is that data are
summarized and are stored in an optimized format in a multidimensional cube, instead
of in a relational database.
• In MOLAP model, data are structured into proprietary formats by client's reporting
requirements with the calculations pre-generated on the cubes.

MOLAP
• MOLAP structure has limited capabilities to dynamically create aggregations or to
evaluate results which have not been pre-calculated and stored.
• Applications requiring iterative and comprehensive time-series analysis of trends are
well suited for MOLAP technology (e.g., financial analysis and budgeting).
• Examples include Arbor Software's Essbase. Oracle's Express Server, Pilot Software's
Lightship Server, Sniper's TM/1. Planning Science's Gentium and Kenan Technology's
Multiway.
• Some of the problems faced by clients are related to maintaining support to multiple
subject areas in an RDBMS.
• Some vendors can solve these problems by continuing access from MOLAP tools to
detailed data in and RDBMS.

46
05-05-2022

MOLAP Servers
• These servers support multidimensional data views through array-based
multidimensional storage engines.

• They map multidimensional views directly to data cube array structures.

• The advantage of using a data cube is that it allows fast indexing to precomputed
summarized data.

• Notice that with multidimensional data stores, the storage utilization may be low if the
data set is sparse.

• In such cases, sparse matrix compression techniques should be explored.


• Many MOLAP servers adopt a two-level storage representation to handle dense and
sparse data sets:
 Denser sub cubes are identified and stored as array structures, whereas
 Sparse sub cubes employ compression technology for efficient storage utilization.

Pro and Cons of MOLAP


Advantages
• Excellent Performance: A MOLAP cube is built for fast information retrieval, and is
optimal for slicing and dicing operations.

• Can perform complex calculations: All evaluations have been pre-generated when the
cube is created. Hence, complex calculations are not only possible, but they are returned
quickly.
Disadvantages
• Limited in the amount of information it can handle: Because all calculations are
performed when the cube is built, it is not possible to contain a large amount of data in
the cube itself.

• Requires additional investment: Cube technology is generally proprietary and does not
already exist in the organization. Therefore, to adopt MOLAP technology, chances are:
other investments in human and capital resources may be needed.

47
05-05-2022

Hybrid OLAP (HOLAP) servers


• The hybrid OLAP approach combines ROLAP and MOLAP technology, benefiting from the
greater scalability of ROLAP and the faster computation of MOLAP.

• For example, a HOLAP server may allow large volumes of detailed data to be stored in a
relational database, while aggregations are kept in a separate MOLAP store.

• The Microsoft SQL Server 2000 supports a hybrid OLAP server.

Hybrid OLAP (HOLAP) servers


• HOLAP incorporates the best features of MOLAP and ROLAP into a single architecture.

• HOLAP systems save more substantial quantities of detailed data in the relational tables
while the aggregations are stored in the pre-calculated cubes.

• HOLAP also can drill through from the cube down to the relational tables for delineated
data.

• The Microsoft SQL Server 2000 provides a hybrid OLAP server.

48
05-05-2022

Pro and Cons of HOLAP


Advantages
• HOLAP provide benefits of both MOLAP and ROLAP.
• It provides fast access at all levels of aggregation.
• HOLAP balances the disk space requirement, as it only stores the aggregate information
on the OLAP server and the detail record remains in the relational database. So no
duplicate copy of the detail record is maintained.

Disadvantages
• HOLAP architecture is complicated because it supports both MOLAP and ROLAP
servers.

OLAP Servers: Summary


• OLAP servers allow the analysts to get an insight of the information through fast,
consistent, and interactive access to information.

49
05-05-2022

OLAP Server Architectures: Summary


• Relational OLAP (ROLAP)
• Use relational or extended-relational DBMS to store and manage warehouse data and OLAP
middle ware
• Include optimization of DBMS backend, implementation of aggregation navigation logic, and
additional tools and services
• Greater scalability
• Multidimensional OLAP (MOLAP)
• Sparse array-based multidimensional storage engine
• Fast indexing to pre-computed summarized data
• Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)
• Flexibility, e.g., low level: relational, high-level: array
• Specialized SQL servers (e.g., Redbricks)
• Specialized support for SQL queries over star/snowflake schemas

Datawarehouse implementation
• Data warehouses contain huge volumes of data.
• OLAP servers demand that decision support queries be answered in the order of
seconds.
• It is crucial for data warehouse systems to support highly efficient cube computation
techniques, access methods, and query processing techniques.
• At the core of multidimensional data analysis is the efficient computation of aggregations
• across many sets of dimensions. In SQL terms, these aggregations are referred to as
group-by’s.
• Each group-by can be represented by a cuboid, where the set of group-by’s forms a
lattice of cuboids defining a data cube.

50
05-05-2022

The compute cube Operator


• One approach to cube computation extends SQL so as to include a compute cube
operator.
• The compute cube operator computes aggregates over all subsets of the dimensions
specified in the operation. This can require excessive storage space, especially for large
numbers of dimensions.
• Suppose that you want to create a data cube for AllElectronics sales that contains the
following: city, item, year, and sales in dollars. You want to be able to analyze the data,
with queries such as the following:
“Compute the sum of sales, grouping by city and item.”
“Compute the sum of sales, grouping by city.”
“Compute the sum of sales, grouping by item.”

A data cube is a lattice of cuboids


Total number of cuboids, or
group by’s, that can be
computed for this data cube is 23
= 8.
The possible group-by’s are the
following:
{(city, item, year), (city, item),
(city, year), (item, year), (city),
(item), (year), ()}, where () means
that the group-by is empty (i.e.,
the dimensions are not grouped).

51
05-05-2022

Datawarehouse implementation
• The base cuboid contains all three dimensions, city, item, and year. It can return the total
sales for any combination of the three dimensions.
• The base cuboid is the least generalized (most specific) of the cuboids.
• The apex cuboid, or 0-D cuboid, refers to the case where the group-by is empty. It
contains the total sum of all sales.
• The apex cuboid is the most generalized (least specific) of the cuboids, and is often
denoted as all.
• If we start at the apex cuboid and explore downward in the lattice, this is equivalent to
drilling down within the data cube.
• If we start at the base cuboid and explore upward, this is akin to rolling up.

Datawarehouse implementation
• An SQL query containing no group-by (e.g., “compute the sum of total sales”) is a zero dimensional
operation. An SQL query containing one group-by (e.g., “compute the sum of sales, group-by city”) is a one-
dimensional operation. A cube operator on n dimensions is equivalent to a collection of group-by
statements, one for each subset of the n dimensions.
• Therefore, the cube operator is the n-dimensional generalization of the group-by operator.
• Precomputation leads to fast response time and avoids some redundant computation. Most, if not all,
OLAP products resort to some degree of precomputation of multidimensional aggregates.
• A major challenge related to this precomputation is that the required storage space may explode if all the
cuboids in a data cube are precomputed, especially when the cube has many dimensions.
• The storage requirements are even more excessive when many of the dimensions have associated concept
hierarchies, each with multiple levels.
• This problem is referred to as the curse of dimensionality.
• “How many cuboids are there in an n-dimensional data cube?” If there were no hierarchies associated
with each dimension, then the total number of cuboids for an n-dimensional data cube, is 2n. However,
in practice, many dimensions do have hierarchies.

52
05-05-2022

Partial Materialization: Selected Computation


of Cuboids
There are three choices for data cube materialization given a base cuboid:
• No materialization: Do not precompute any of the “nonbase” cuboids. This leads to computing expensive
multidimensional aggregates on-the-fly, which can be extremely slow.
• Full materialization: Precompute all of the cuboids. The resulting lattice of computed cuboids is referred to
as the full cube. This choice typically requires huge amounts of memory space in order to store all of the
precomputed cuboids.
• Partial materialization: Selectively compute a proper subset of the whole set of possible cuboids.
Alternatively, we may compute a subset of the cube, which contains only those cells that satisfy some
user-specified criterion, such as where the tuple count of each cell is above some threshold.

• Partial materialization represents an interesting trade-off between storage space and


response time.
 during load and refresh, the materialized cuboids should be updated efficiently.
 Parallelism and incremental update techniques for this operation should be explored.

Cloud based Datawarehouse


• As on-premises data warehouses are prone to inflexible storage capacity, technical
difficulties, and high operational overhead due to hardware maintenance needs, many
businesses are moving their data warehousing to the cloud.
• Cloud-based data warehouses take advantage of the core benefits associated with on-
demand computing including far-reaching user access, seemingly limitless storage, and
increased computational capacity, as well as the ability to scale while paying only for
what is used.
 Popular cloud-based data warehouses are Amazon Redshift, Microsoft Azure, and SnowflakeDB.
• When your data warehouse is in the cloud, data integration tools play a critical role in
turning your data into useful, actionable information.
• Traditional extract, transform, and load tools cannot scale to move shifting amounts of
Big Data to a cloud data warehouse.
• Snaplogic’s web-based iPaaS and AI-powered tools enable customers to integrate their
data seamlessly and without additional coding.

53
05-05-2022

Data Warehouse Architecture: Traditional vs. Cloud


• A data warehouse is an electronic system that gathers data from a wide range of sources
within a company and uses the data to support management decision-making.

• Companies are increasingly moving towards cloud-based data warehouses instead of


traditional on-premise systems. Cloud-based data warehouses differ from traditional
warehouses in the following ways:
 There is no need to purchase physical hardware.
 It’s quicker and cheaper to set up and scale cloud data warehouses.

• Cloud-based data warehouse architectures can typically perform complex analytical


queries much faster because they use massively parallel processing (MPP).

cloud data warehouse architectures


There are two main camps:
• The first, older deployment architecture is cluster-based: Amazon Redshift and
Azure SQL Data Warehouse fall into this category. Typically, clustered cloud data
warehouses are really just clustered Postgres derivatives, ported to run as a service
in the cloud.

• The other flavor, serverless, is more modern and counts Google BigQuery (ver 2)
and Snowflake as examples. Essentially, serverless cloud data warehouses make the
database cluster “invisible” or shared across many clients.

54
05-05-2022

Data Engineer skill set


Data Engineer/Scientist: The Candidates must be MCA, or BE / B.Tech (Computer
Science) with more than 2+ years’ experience.
Design & develop Data acquisition.
Develop data set processes
Prepare data for predictive and prescriptive modelling.
Setting up ETL/ELT routine.
Building and automating data pipelines and data-infrastructure Kafka,
Ezmeral Event Data Streams,
MapR Data Fabric, MapR-DB,
Apache Spark, Drill, LLAP, OLTP, OLAP, Delta Lake

index OLAP data by bitmap indexing and join indexing


• To facilitate efficient data accessing, most data warehouse systems support index
structures and materialized views (using cuboids).
• The bitmap indexing method is popular in OLAP products because it allows quick
searching in data cubes.
• The bitmap index is an alternative representation of the record ID (RID) list.
• In the bitmap index for a given attribute, there is a distinct bit vector, Bv, for each
value v in the attribute’s domain.
 If a given attribute’s domain con- sists of n values, then n bits are needed for each entry in the
bitmap index (i.e., there are n bit vectors).
 If the attribute has the value v for a given row in the data table, then the bit representing that
value is set to 1 in the corresponding row of the bitmap index.
 All other bits for that row are set to 0.

55
05-05-2022

Indexing OLAP Data: Bitmap Index


• Index on a particular column
• Each value in the column has a bit vector: bit-op is fast
• The length of the bit vector: # of records in the base table
• The i-th bit is set if the i-th row of the base table has the value for the indexed column
• not suitable for high cardinality domains (suitable method B+ Trees)
 A recent bit compression technique, Word-Aligned Hybrid (WAH), makes it work for high cardinality
domain as well [Wu, et al. TODS’06]
Take examples of Bitwise operators to return tuples that satisfy a query.

Base table
Index on Region Index on Type
Cust Region Type RecID Asia Europe Am erica RecID Retail Dealer
C1 Asia Retail 1 1 0 0 1 1 0
C2 Europe Dealer 2 0 1 0 2 0 1
C3 Asia Dealer 3 1 0 0 3 0 1
C4 America Retail 4 0 0 1 4 1 0
C5 Europe Dealer 5 0 1 0 5 0 1

Indexing OLAP Data: Join Indices


• Join index: JI(R-id, S-id) where R (R-id, …)  S (S-id, …)

• Traditional indices map the values to a list of record ids


It materializes relational join in JI file and speeds up relational join

• In data warehouses, join index relates the values of the dimensions of a start schema to
rows in the fact table.

E.g. fact table: Sales and two dimensions city and product
A join index on city maintains for each distinct city a list of R-IDs of the tuples
recording the Sales in the city
• Join indices can span multiple dimensions

56
05-05-2022

Efficient Processing OLAP Queries


• Determine which operations should be performed on the available cuboids
• Transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g., dice = selection + projection

• Determine which materialized cuboid(s) should be selected for OLAP op.


• Let the query to be processed be on {brand, province_or_state} with the condition “year = 2004”, and there
are 4 materialized cuboids available:
1) {year, item_name, city}
2) {year, brand, country}
3) {year, brand, province_or_state}
4) {item_name, province_or_state} where year = 2004
Which should be selected to process the query?

• Explore indexing structures and compressed vs. dense array structs in MOLAP

57
05-05-2022

Attribute-Oriented Induction
• Proposed in 1989 (KDD ‘89 workshop)
• Not confined to categorical data nor particular measures
• How it is done?
• Collect the task-relevant data (initial relation) using a relational database query
• Perform generalization by attribute removal or attribute generalization
• Apply aggregation by merging identical, generalized tuples and accumulating
their respective counts
• Interaction with users for knowledge presentation

Attribute-Oriented Induction: An Example


Example: Describe general characteristics of graduate students in the University
database
• Step 1. Fetch relevant set of data using an SQL statement, e.g.,
Select * (i.e., name, gender, major, birth_place, birth_date, residence, phone#, gpa)
from student
where student_status in {“Msc”, “MBA”, “PhD” }

• Step 2. Perform attribute-oriented induction


• Step 3. Present results in generalized relation, cross-tab, or rule forms

58
05-05-2022

Class Characterization: An Example

Name Gender Major Birth-Place Birth_date Residence Phone # GPA

Initial Jim M CS Vancouver,BC, 8-12-76 3511 Main St., 687-4598 3.67


Woodman Canada Richmond
Relation Scott M CS Montreal, Que, 28-7-75 345 1st Ave., 253-9106 3.70
Lachance Canada Richmond
Laura Lee F Physics Seattle, WA, USA 25-8-70 125 Austin Ave., 420-5232 3.83
… … … … … Burnaby … …

Removed Retained Sci,Eng, Country Age range City Removed Excl,
Bus VG,..
Gender Major Birth_region Age_range Residence GPA Count
Prime M Science Canada 20-25 Richmond Very-good 16
Generalized F Science Foreign 25-30 Burnaby Excellent 22
Relation … … … … … … …

Birth_Region
Canada Foreign Total
Gender
M 16 14 30
F 10 22 32
Total 26 36 62

Basic Principles of Attribute-Oriented Induction


• Data focusing: task-relevant data, including dimensions, and the result is the initial
relation
• Attribute-removal: remove attribute A if there is a large set of distinct values for A but
(1) there is no generalization operator on A, or (2) A’s higher level concepts are
expressed in terms of other attributes
• Attribute-generalization: If there is a large set of distinct values for A, and there exists
a set of generalization operators on A, then select an operator and generalize A
• Attribute-threshold control: typical 2-8, specified/default
• Generalized relation threshold control: control the final relation/rule size

59
05-05-2022

Attribute-Oriented Induction: Basic Algorithm


• InitialRel: Query processing of task-relevant data, deriving the initial
relation.
• PreGen: Based on the analysis of the number of distinct values in each
attribute, determine generalization plan for each attribute: removal? or
how high to generalize?
• PrimeGen: Based on the PreGen plan, perform generalization to the right
level to derive a “prime generalized relation”, accumulating the counts.
• Presentation: User interaction: (1) adjust levels by drilling, (2) pivoting, (3)
mapping into rules, cross tabs, visualization presentations.

Presentation of Generalized Results


• Generalized relation:
• Relations where some or all attributes are generalized, with counts or other
aggregation values accumulated.
• Cross tabulation:
• Mapping results into cross tabulation form (similar to contingency tables).
• Visualization techniques:
• Pie charts, bar charts, curves, cubes, and other visual forms.
• Quantitative characteristic rules:
• Mapping generalized result into characteristic rules with quantitative information
associated with it, e.g.,
grad ( x)  male ( x) 
birth _ region ( x) "Canada"[t :53%]  birth _ region ( x) " foreign"[t : 47%] .

60
05-05-2022

Mining Class Comparisons


• Comparison: Comparing two or more classes
• Method:
• Partition the set of relevant data into the target class and the contrasting class(es)
• Generalize both classes to the same high level concepts
• Compare tuples with the same high level descriptions
• Present for every tuple its description and two measures
• support - distribution within single class
• comparison - distribution between classes
• Highlight the tuples with strong discriminant features
• Relevance Analysis:
• Find attributes (features) which best distinguish different classes

Concept Description vs. Cube-Based OLAP


• Similarity:
• Data generalization
• Presentation of data summarization at multiple levels of abstraction
• Interactive drilling, pivoting, slicing and dicing
• Differences:
• OLAP has systematic preprocessing, query independent, and can drill down to
rather low level
• AOI has automated desired level allocation, and may perform dimension relevance
analysis/ranking when there are many relevant dimensions
• AOI works on the data which are not in relational forms

61
05-05-2022

Summary
• Data warehousing: A multi-dimensional model of a data warehouse
• A data cube consists of dimensions & measures
• Star schema, snowflake schema, fact constellations
• OLAP operations: drilling, rolling, slicing, dicing and pivoting
• Data Warehouse Architecture, Design, and Usage
• Multi-tiered architecture
• Implementation: Efficient computation of data cubes
• OLAP query processing
• OLAP servers: ROLAP, MOLAP, HOLAP
• Bitmap index
• Cloud based Data warehouses: Introduction

62

You might also like