MultidimensionalDataModeling UnitIV
MultidimensionalDataModeling UnitIV
Data Model
A data model is a diagrammatic representation of the data and the relationship
between its different entities. It assists in identifying how the entities are
related through a visual representation of their relationships and thus helps
reduce possible errors in the database design. It helps in building a robust
database/data warehouse.
Types of Data Model
Conceptual Data Model
Logical Data Model
Physical Data Model
Conceptual Data Model
The conceptual data model is designed by identifying the various entities and the highest-
level relationships between them as per the given requirements.
Let us look at some features of a conceptual data mpdel-
• It identifies the most important entities.
• It identifies relationships between different entities.
• It does not support the specification of attributes.
• It does not support the specification of the primary key.
Going back to the requirement specification of TenToTen Stores, let us design the conceptual
data model (Next Slide).
In this case, the entities can be identified as
• Category (to store the category details of products).
• SubCategory (to store the details of sub-categories that belong to different categories)
• Product (to store product details).
• PromotionOffer (to store various promotion offers introduced by the company to sell products)
• ProductOffer (to map the promotion offer to a product).
• Date (to keep track of the sale date and also to analyze sales in different time periods)
• Territory (to store various territories where the stores are located).
• MarketType (to store details of various market setups, viz. “Hypermarkets &“Traditional
Supermarket”, “Dollar Store”, and “Super Warehouse”).
• OperatorType (to store the details of types of operator, viz. company-operated or franchise)
• Outlet (to store the details of various stores distributed over various locations).
• Sales (to store all the daily transactions made at various stores)
Logical Data Model
The logical data model is used to describe data in as much detail as possible. While describing
the data, no consideration is given to the physical implementation aspect.
Normalization:
1NF
2NF
3NF and soon
Outcome of Logical Data Model
Outcome of Logical Data Model
Outcome of Logical Data Model
Outcome of Logical Data Model
Outcome of Logical Data Model
To Conclude about Conceptual Data Model
• We have identified the various entities from the requirements specification.
• We have identified the various attributes for each entity.
• We have also identified the relationship that the entities share with each
other (Primary key-Foreign Key).
Types of Fact:
Additive facts: These are the facts that can be summed up/aggregated across all dimensions in a fact table. For
example, discrete numerical measures of activity — quantity sold, dollars sold, etc.
Consider a scenario where a retail store “Northwind Traders” wants to analyze the revenue generated. The
revenue generated can be by the employee who is selling the products; or it can be in terms of any combination
of multiple dimensions. Products, time, region, and employee are the dimensions in this case.
The revenue,which is a fact, can be aggregated along any of the above dimensions to give the total revenue
along that dimension. Such scenarios where the fact can be aggregated along all the dimensions make the fact a
fully additive or just an additive fact. Here revenue is the additive fact.
Consider a scenario where the “Northwind Traders” warehouse manager needs to find the total number of
products in the inventory. One inherent characteristic of any inventory is that there will be incoming products to
the inventory from the manufacturing plants and outgoing products from the inventory to the distribution
centres or retail outlets.
So if the total products in the inventory need to be found out, say, at the end of a month, it cannot be a simple
sum of the products in the inventory of individual days of that month. Actually, it is a combination of addition of
incoming products and subtraction of outgoing ones. This means the inventory level cannot be aggregated
along the “time” dimension.
But if a company has warehouses in multiple regions and would like to find the total products in inventory
across those warehouses, a meaningful number can be arrived at by aggregating inventory levels across those
warehouses. This simply means inventory levels can be aggregated along the “region” dimension. Such
scenarios where a fact can be aggregated along some dimensions but not along all dimensions give rise to
semi-additive facts. In this case, the number of products in inventory or the inventory level is the semi-
additive fact.
Let us discuss another example of semi-additive facts.
Figure depicts the “AccountsFact” fact table along with its
corresponding dimension tables. The “AccountsFact” fact table has
two measures :“CurrentBalance” and “ProfitMargin”. It has two
dimension keys: “DatelD” and “AccountID”. “CurrentBalance” is a semi-
additive fact. It makes sense to add up current balances for all
accounts to get the information on “what's the total current balance
for all accounts in the bank?” However, it does not make sense to add
up current balances through time. It does not make sense to add up all
current balances through time. It does not make sense to add up all
current balance for a given account for a given account for each day of
the month. Similarly, “ProfitMargin” is another non-additive fact, as it
does not make sense to add profit margins at the account level or at
the day level.
Data Modeling Techniques – Dimensional Modeling- Non-Additive Facts:
Non Additive facts: These are the facts that cannot be summed up for some dimensions present in the fact
table. For example, measurement of room temperature, percentages, ratios, factless, facts, etc. Non additive
facts cannot be added meaningfully across any dimensions. In other words, non-additive facts are facts where
SUM operator cannot be used to produce any meaningful results. The following illustration will help you
understand why room temperature is a non-additive fact.
Date Temperature
5th May (7AM) 27
5th May (12 AM) 33
5th May (5 PM) 10
Sum 70 (Non-Meaningful result)
Average 23.3 (Meaningful result)
Examples of non-additive facts are:
Textual facts: Adding textual facts does not result in any number. However, counting textual facts may result in a sensible
number.
Per-unit prices: Adding unit prices does not produce any meaningful number. For example: the unit sales price or unit cost is
strictly non-addictive. But these prices can be multiplied with the number products sold and can be depicted as total sales
amount or total product cost in the fact table.
Percentages and ratios: A ratio, such as gross margin, is non-additive. Non-additive facts are usually the result of ratio or
other calculations, such as percentages.
Measures of intensity: Measures of intensity such as the room temperature are non-additive across all dimensions.
Summing the room temperature across different times of the day produces a totally non-meaningful number.
Averages: Facts based on averages are non-additive. For example, average sales price is non-additive. Adding all the average
unit prices produces a meaningless number.
Factless facts (event-based fact tables): Event fact tables are tables that record events. For example, event fact tables are
used to record events such as Webpage clicks and employee or student attendance. In an attendance recording scenario,
attendance can be recorded in terms of “yes” or “no” OR with pusedo facts like “1” or “0”. In such scenarios, we can count
the values but adding them will give invalid values. Factless facts are generally used to model the many-to-many relationships
or to track events that did or did not happen.
Data Modeling Techniques – Dimensional Modeling- Non-Additive Facts -
Example:
The following figure is an example of a “factless fact table” -“EventFact”. This
factless fact table has four dimension keys: “EventID”, “SpeakerID”,
“ParticipantID”, and “DateID”. It does nor have any measures or facts. This table
can be queried to get details on the events that are the most popular. It can
further be used to track events that did not happen. We can also use this table to
elicit information about events that were the least popular or that were not
attended.
An Example of Multidimensional Modeling
Alex is excited. He will be travelling to the USA for business-related work. He has carefully planned his itinerary.
Before embarking on the journey, he wants to check the weather in various US cities. He has searched the
Internet to get the required information for the coming week. He has a table of data before him which looks like
as shown below:
City Name DateDetails MinTemp MaxTemp
Los Angels 22-05-2011 86 105
San Frasisco 22-05-2011 78 107
Phoenix 22-05-2011 88 98
Los Angels 23-05-2011 82 106
San Francisco 23-05-2011 76 104
Phoenix 23-05-2011 86 96
In the above table, we have two dimensions, say, the “Geography” dimension and the “Time” dimension. “NameofCity” and
“DateDetails” are attributes of the geography and time dimension respectively. There are also two facts, “MinTemp” and
“MaxTemp”. Using this table, it is possible to find out information about the maximum daily temperatures and the minimum
daily temperatures for any group of cities or group of days. Now let us assume that we wish to view the maximum and
minimum temperatures for states. A city belongs to a state. Let us add an attribute “State” to the “Geography” dimension.
The relationship between the state and the city is as depicted in the following figure:
A state can have multiple cities. The relationship is one-to-many from the state to State State
cities. Now assume that we wish to have a look at the minimum and maximum
temperatures by counties. This can be achieved by adding yet another attribute
“County” to the geography dimension. The relationship between the state and
county is as depicted in figure. The relationship is many from the state to counties.
You already know that temperature is a non-additive fact. However, one can look
City County
at the average temp for cities or states or for different time periods or for a
combination of geography and time.
City
What Are Dimensions/Dimension Tables?
• Dimension tables consist of dimension attributes which describe the
dimension elements to enhance comprehension.
• Dimension attributes (descriptive) are typically static values containing
discrete numbers which behave as text values.
• Main functionalities :
Query filtering\constraining
Query result set labeling
• The dimension attribute must be
Complete: Dimension attributes must not contain missing values.
Verbose: Labels must consist of full words.
Descriptive: The dimension attribute names must be able to convey the
purpose of the dimension element in as few and simple words as possible.
Discrete values: Dimension attributes must contain only one value per row
in dimension table.
Quality assured: Dimension attributes must not contain misspelt values or
impossible values.
Dimension Hierarchies
• A dimension hierarchy is a cascaded series of many-to-one relationships and
consists of different levels. Each level in a hierarchy corresponds to a
dimension attribute. Hierarchies document the relationship between different
levels in a dimension.
• A dimension hierarchy may also be described as a set of parent-child
relationships attributes present within a dimension. These hierarchy attributes,
also known as levels, roll up a child to parent. For example, Customer totals
can roll up to Sub-region totals which can further roll up to Region totals. A
better example would be — daily sales could roll up to weekly sales, which
further roll up to month to quarter to yearly sales. Let us understand the
concept of hierarchy through the example. In this example, the Product
hierarchy is like this
Degenerate
Dimension
Rapidly Junk
Changing (garbage)
Dimension Dimension
Dimension
Slowly Type
Changing Role-playing
Dimension Dimension
Dimension Tables – Degenerate Dimension
A degenerate dimension is a data that is dimension in temperament but is present in a fact table. It is a
dimension without any attributes. Usually, a degenerate dimension is a transaction-based number. There can
be more than one degenerate dimension in a fact table.
Degenerate dimensions often cause confusion as they don’t feel or look like normal dimensions. They act as
dimension keys in fact tables; however, they are not joined to corresponding dimensions in other dimension
tables as all their attributes are already present in other dimension tables.
Degenerate dimensions can also be called textual facts, but they are not facts as the primary key for the fact
table is often a combination of dimensional foreign keys and degenerate dimensions. As already stated, a fact
table can have more than one degenerate dimension. For example, an insurance claim line fact table typically
includes both claim and policy numbers as degenerate dimensions. A manufacturer can include degenerate
dimensions for the quote, order, and bill of lading numbers in the shipments fact table.
This figure depicts a PointOfSalesFact table along with
other dimension tables. The “PointOfSalesFact” has two
measures: AmountTransacted and QuantitySold. It has
the following dimension keys: DateKey that links the
“PointOfSaleFact” to “DimDate”, ProductID that links the
“PointOfSaleFact” to “DimProduct” and “StoreID” that
links the “PointOfSaleFact” to “DimStore”. Here,
TransactionNo is a degenrate dimension as it is a
dimension key without a corresponding dimension table.
All information/details pertaining to the transaction are
extracted and stored in the “PointOfSaleFact” table
itself;therefore, there is no need to have a separate
dimension table to store the attributes of the
transaction.
Dimension Tables – Slowly Changing Dimension (SCD)
In a dimension model, dimension attributes are not fixed as their values can change slowly over a period of
time. Here comes the role of a slowly changing dimension. A slowly changing dimension is a dimension whose
attribute/attributes for a record (row) change slowly over time, rather than change on a regularly timely basis.
Let us assume a company sells car-related accessories. The company decides to assign a new sales territory,
Los Angeles, to its sales representative, Bret Watson, who earlier operated from Chicago. How can you record
the change without making it appear that Watson earlier held Chicago?
Let us take a look at the original record of Bret Watson: Now the original record has to be changed as Bret
Watson has been assigned “Los Angeles” as his sales territory, effective May 1, 2011. This would be done
through a slowly changing dimension. Given below are the approaches for handling a slowly changing
dimension:
Type-I (Overwriting the History)
In this approach, the existing dimension attribute is overwritten with new data, and hence no history is preserved.
This approach is used when correcting data errors present in a field, such as a word spelled incorrectly.
SalesRepID SalesRepName SalesTerritory
1001 Bret Watson LosAngels
Region
RegionID
Country Code
Country
State Code
Country Code
City Code Country Name
State Code
City Code State
Decreases performance because more tables will need to be joined to satisfy queries
State code
State Name
City
City Code
City Code
City Name
ZIP
Why not to Snowflake?
Normally, you should avoid snowflaking or normalization of a dimension table,
unless required and appropriate. Snowflaking reduces space consumed by
dimension tables, but compared with entire data warehouse the saving is usually
insignificant.
Do not snowflake hierarchies of one dimension table into separate tables.
Hierarchies should belong to the dimension table only and should never be
snowflaked. Multiple hierarchies can belong to the same dimension if the
dimension has been designed at the lowest possible detail.
Data Model for Fact Constellation Schemaof TenToTen Stores
The constellation schema is shaped like a constellation of stars (i.e. Star schemas). This is more
complex than Star or Snowflake schema variations, as it contains multiple fact tables. This allows
the dimension tables to be shared among the various fact tables. It is also called “Galaxy schema”.
The main disadvantage of the fact constellation is more complicated design because multiple
aggregations must be taken into consideration (Figure below).
Dimensional Modeling Life Cycle
Phases of Dimensional Modeling Life Cycle:
1. Requirements gathering
2. Identifying the grain
3. Identifying the dimensions
4. Identifying the facts
5. Designing the dimensional model
Understanding Dimension – Cube
Dimension Hierarchy
Grain
Fact
Testing
N .America
Consulting
Europe Production Support
Asia Pacific
Q1 Q2 Q3 Q4
Recap (contd.)
• Difference between OLTP and OLAP
Comparison of OLTP and DSS
• Number of employees added in the role of the company during the last
quarter/6 months/1 year
OLTP or OLAP
Introduction to Dimensional Modeling (DM)
• Grain
• Fact
• Dimension
• Cube
• Star
• Snowflake
Of hierarchies and levels…
What Is a Grain?
• Identifying the grain also means deciding the level of detail that will be
made available in the dimensional model
Additive
Facts
Factless Fact
Semi Non
Additive Additive
And what about descriptive data?
Answer a Quick Question
Region
RegionID
Country Code
Country
State Code
Country Code
City Code Country Name
State Code
City Code State
Decreases performance because more tables will need to be joined to satisfy queries
State code
State Name
City
City Code
City Code
City Name
ZIP
Armed with these weapons that we call ‘Concepts’, let’s
step into the battlefield!
Case Study
Conversion of a ER Model to a Dimensional Model
PRODUCT
CUSTOMER
ER Diagram SKU (PK)
customer_ID (PK)
Description
customer_name
category
credit_profile
brand
purchase_profile
address
ORDER ORDER-LINE
STORE order_num (PK) order_num (PK) (FK)
store_ID (PK) customer_ID (FK) SKU (PK) (FK)
store_name store_ID (FK) promotion_key (FK)
floor_type clerk_ID (FK) dollars_cost
address date dollars_sold
district units_sold
CLERK PROMOTION
clerk_id (PK) promotion_NUM (PK)
clerk_name promotion_name
clerk_grade ad_type
price_type
DIMENSONAL PRODUCT
TIME
MODEL product_key (PK)
time_key (PK)
SKU
SQL_date
category
day_of_week
FACT description
month
time_key (FK) brand
customer_key (FK)
STORE store_key (FK) CUSTOMER
store_key (PK) clerk_key (FK) customer_key (PK)
store_ID product_key (FK) customer_name
store_name promotion_key credit_profile
floor_type (FK) purchase_profile
address dollars_cost address
district dollars_sold
units_sold
CLERK PROMOTION
clerk_key (PK) promotion_key (PK)
clerk_id promotion_name
clerk_name ad_type
clerk_grade price_type
Summary
• Basics of Database
• OLTP
• MDDM
• Cube
• Star Schema
• Snowflake schema
Food for Thought!