Unit 4 Data Modeling: Structure

Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

Business Intelligence and Tools Unit 4

Sikkim Manipal University Page No. 80


Unit 4 Data Modeling
Structure:
4.1 Introduction
Objectives
4.2 The Concept of Data Modeling
4.3 Data Modeling Techniques
4.4 Entity-Relationship (E-R) Modeling
4.4.1 Basic Concepts
4.4.2 Other Concepts
Self Assessment Question(s) (SAQs)
4.5 Dimensional Modeling
4.5.1 Basic Concept
4.5.2 Star and Snowflake Models
Self Assessment Question(s) (SAQs)
4.6 Database Design Methodology for Data Warehouses
Self Assessment Question(s) (SAQs)
4.7 Common Mistakes in Data Modeling
4.7.1 Failure to address Long term growth
4.7.2 No Mechanism to clean-out old Data
4.7.3 Overuse of Synthetic keys
4.7.4 Inadequate Planning for Indexes
Self Assessment Question(s) (SAQs)
4.8 Organizing the Data in a Data Warehouse
4.9 Summary
4.10 Terminal Questions (TQs)
4.11 Multiple Choice Questions (MCQs)
4.12 Answers to SAQs, TQs, and MCQs
4.12.1 Answers to Self Assessment Questions (SAQs)
4.12.2 Answers to Terminal Questions (TQs)
4.12.3 Answers to Multiple Choice Questions (MCQs)
Business Intelligence and Tools Unit 4
Sikkim Manipal University Page No. 81
4.1 Introduction
Data warehousing has become the most accepted approach for providing
an integrated, consistent source of data for use in business decision-
making. But the question that comes into the picture is how to model the
data in a data warehouse so that the data can best support the decision-
making processes of an organization.
In this Unit, we consider two basic data modeling techniques; ER modeling
and Dimensional modeling. The ER Modeling has been the choice for the
operational environment, but there is an increased interest in the use of
dimensional modeling in these days for a business intelligence environment.
Objectives
The objectives of the Unit are:
The purpose of data modeling in building a data warehouse
The techniques of data modeling
Common mistakes in data modeling
4.2 The Concept of Data Modeling
In simple terms, data; is a record of all business activities, resources, and
results of an organization and data model is a well-organized abstraction of
that data. So, it is quite natural that the data model has become the best
method to understand and manage the business of the organization.
Without a data model, it would be very difficult to organize the structure and
contents of the data in the data warehouse.
These data models are critical in understanding the data in an analytical
environment of a data warehouse and so act as a blueprint for the
information requirements of an organization. The purpose of modeling is to
provide an accurate record of some aspect of the real world in a particular
context. The objective of the modeling is to provide:
Business Intelligence and Tools Unit 4
Sikkim Manipal University Page No. 82
A record of accurate and meaningful business data definitions.
Identification of valid, consistent business data structures that contain
sufficient information to run and manage the business.
An indication of the similarities and differences between data from
different sources and the relationship between them.
4.3 Data Modeling Techniques
There are two data modeling techniques that are relevant in a data
warehousing environment. They are Entity Relationship modeling (ER
modeling) and dimensional modeling.
ER modeling produces a data model of the specific area of interest,
using two basic concepts: Entities and the Relationships between
them. A detailed ER model may also contain attributes, which can be
properties of either the entities or the relationships. The ER model is an
abstraction tool as it can be used to simplify, understand and analyze
the ambiguous data relationships in the real business world.
Dimensional modeling uses three basic concepts: Facts, Dimensions
and Measures. Dimensional modeling is powerful in representing the
requirements of the business user in the context of database tables and
also in the area of data warehousing.
Both ER and dimensional modeling can be used to create an abstract model
of a specific subject. However, each of them has its own limited set of
modeling concepts and associated notation conventions. Consequently, the
techniques seem different, and they are indeed different in terms of
semantic representation. There is much debate as to which method is better
and the conditions under which a specific technique is to be selected. There
can be no definite answer, understanding of the circumstances and the
business requirements finally lead to selection of an appropriate technique.
Business Intelligence and Tools Unit 4
Sikkim Manipal University Page No. 83
4.4 Entity- Relationship (E-R) Modeling
4.4.1 Basic Concepts
An ER model is represented by an ER diagram, which uses three basic
graphic symbols to conceptualize the data: entity, relationship, and attribute.
Entity
An entity is defined to be a person, place, thing, or event of interest to the
business or the organization. It represents a class of objects, which are
things in the real business world that can be observed and classified by their
properties and characteristics. In general, an entity has its own business
definition and a clear boundary definition that is required to describe what is
included and what is not.
In a practical modeling project, the team members share a definition
template for integration and a consistent entity definition in the model. In
case of a high-level business modeling, an entity can be very generic, but it
must be quite specific in the detailed logical modeling. There are four
entities; PRODUCT, PRODUCT MODEL, PRODUCT COMPONENT, and
COMPONENT in the ER diagram (Refer Figure 4.1) and are represented as
rectangles.
Fig. 4.1: A Simpl e ER Model
Business Intelligence and Tools Unit 4
Sikkim Manipal University Page No. 84
The four diagonal lines on the corners of the PRODUCT COMPONENT
entity represent that the entity is an associative entity and the entity is to
resolve the many-to-many relationship between two entities. PRODUCT
MODEL and COMPONENT are independent of each other but have a
business relationship between them. A PRODUCT MODEL consists of
many components and a component is related to many product models.
With this business rule, you cannot tell which components make up a
product model. To do this, you can define a resolving entity. For example,
the PRODUCT COMPONENT entity can provide the information about
which components are related to which product model.
In ER modeling, naming the entities is important for easy understanding and
clear communication. It is expressed grammatically in the form of a noun
rather than a verb and the criteria for selecting an entity name depend on
how well the name represents the characteristics and scope of the entity.
Also, defining a unique identifier of an entity is the most critical task. These
unique identifiers are called candidate keys. Among them, you can select
the key that is most commonly used to identify the entity, called primary
key.
Relationship
Relationships represent the structural interaction and association among the
entities in a model and they are represented with lines drawn between the
two specific entities. Generally, a relationship is named grammatically by a
verb (such as owns, belongs, and has) and the relationship between the
entities can be defined in terms of the cardinality. Cardinality represents the
maximum number of instances of one entity that are related to a single
instance in another table and vice versa. Thus the possible cardinalities
include one-to-one (1:1), one-to-many (1:M), and many-to-many (M:M). In a
detailed normalized ER model, any M:M relationship is not shown because it
is resolved to an associative entity.
Business Intelligence and Tools Unit 4
Sikkim Manipal University Page No. 85
Attributes
Attributes describe the characteristics of properties of the entities. The
Product ID, Description, and Picture are attributes of the PRODUCT entity in
Figure 4.1. The name of an attribute has to be unique in an entity and
should be self-explanatory to ensure clarity. For example, rather naming
date1 and date2, you may use the names; order date and delivery date.
When an instance has no value for an attribute, the minimum cardinality of
the attribute is zero, which means either nullable or optional.
In Figure 4.1, you can see the characters P, m, o, and F that stand for
primary key, mandatory, optional, and foreign key. The Picture attribute of
the PRODUCT entity is optional, which means it is nullable. A foreign key of
an entity is defined to be the primary key of another entity. In figure 4.1, the
Product ID attribute of the PRODUCT MODEL entity is a foreign key as it is
the primary key of the PRODUCT entity. These foreign keys are useful in
determining the relationships such as the referential integrity between the
entities.
4.4.2 Other Concepts
Supertype and Subtype
An entity can have subtypes and supertypes and the relationship between a
supertype entity and its subtype entity is an IS A relationship. An IS A
relationship is used where one entity is a generalization of several more
specialized entities. The supertype and subtype relationship is represented
by a triangle on the relationship. Figure 4.2 shows an example of supertype
and subtype entities wherein SALES OUTLET is the supertype of RETAIL
STORE and CORPORATE SALES OFFICE and RETAIL STORE,
CORPORATE SALES OFFICE are subtypes of SALES OUTLET. Here,
each subtype entity inherits attributes from its supertype entity.
Business Intelligence and Tools Unit 4
Sikkim Manipal University Page No. 86
Also, each subtype entity can have its own distinctive attributes. In the
example provided above, Region ID and Outlet ID are inherited attributes
and the sub entities have their own attributes (such as number of cash
registers and floor space of the RETAIL STORE subentity). The practical
benefit of supertyping and subtyping is that they make a data model more
directly expressive. J ust by looking at the ER diagram, you can see that
sales outlets are composed of retail stores and corporate sales offices.
Fig. 4.2: Supertype and Subtype
Other important concepts in the area of ER modeling are domain and
normalization.
A domain consists of all the possible acceptable values and categories
that are allowed for an attribute. It is the set of all real possible
occurrences. The format or data type, such as integer, date, and
character, provides a clear definition of domain. The practical benefit of
domain is that it is imperative for building the data dictionary or
repository, and for implementing the database consequently.
Business Intelligence and Tools Unit 4
Sikkim Manipal University Page No. 87
Normalization is a process of assigning the attributes to entities which in
a way reduces data redundancy, avoids data anomalies, provides a
solid architecture for updating data, and reinforces the long-term
integrity of the data model (the third normal form is usually adequate).
Self Assessment Question(s) (SAQs)
For Section 4.4
1. Discuss the basic concepts involved in ER modeling?
4.5 Dimensional Modeling
Dimensional modeling is a relatively new concept compared to ER
modeling. This method is simpler, more expressive, and easier to
understand. This technique is mainly aimed at conceptualizing and
visualizing data models as a set of measures that are described by common
aspects of the business. It is useful for summarizing and rearranging the
data and presenting views of the data to support data analysis. Also, the
technique focuses on numeric data, such as values, counts, and weights.
The basic concepts of this technique are discussed below:
4.5.1 Basic Concepts
Fact
A fact is a collection of related data items, consisting of measures and
context data. A fact represents a business item, a business transaction, or
an event that can be used to analyze the business or a business process. In
a data warehouse, facts are implemented in the core tables in which the
entire numeric data is stored.
Dimension
A dimension is a collection of members or units of the same type of views.
Usually, it is represented by an axis. In a dimensional model, every data
point in the fact table is associated with one and only one member from
Business Intelligence and Tools Unit 4
Sikkim Manipal University Page No. 88
each of the multiple dimensions. Thus dimensions determine the contextual
background for the facts. Many analytical processes are used to quantify the
impact of dimensions on the facts. Dimensions are the parameters over
which you can perform Online Analytical Processing (OLAP). For example,
in a database for analyzing all sales of products, the common dimensions
could be; time, location/region, customers, salesperson, scenarios such as
actual, budgeted, or estimated numbers. Generally, dimensions are mapped
to non-numeric, informative entities such as branch or employee.
A dimension contains several dimension members. A dimension member is
a distinct name or identifier used to determine a data item's position. For
instance, all weeks, months, quarters, and years make up a time dimension,
and all cities, regions, and countries make up a geography dimension. The
members of a dimension can be arranged into one or more hierarchies.
Each hierarchy can also have multiple hierarchy levels. Every member of a
dimension does not locate on one hierarchy structure.
Measure
A measure is a numeric attribute of a fact, representing the performance or
behavior of the business relative to the dimensions. For example, measures
are the sales in revenue, the sales volume, the quantity supplied, the supply
cost, etc. A measure is determined by combinations of the dimension
members and is located on facts. Here, the actual numbers are called as
variables.
The most popular way of visualizing a dimensional model is to use a three-
dimensional model using a cube. Though a dimensional model consists of
more than three dimensions (to be referred to as a hypercube), a three-
dimensional cube is more commonly used for practical convenience.
Business Intelligence and Tools Unit 4
Sikkim Manipal University Page No. 89
4.5.2 Star and Snowflake Models
There are two basic models that can be used in dimensional modeling:
Star model
Snowflake model
Star Model
After gathering the business requirements, the dimensional modeling begins
with identifying facts and dimensions and so the initial dimensional model is
usually star-like in appearance, with one fact in the center and one level of
several dimensions around it. This star model is the basic structure for a
dimensional model and has one large central table (a fact table) and a set of
smaller tables (the dimension tables) arranged in a radial pattern around the
central table (Refer Fig. 4.3).
Fig. 4.3: Star model
In the example provided in Figure 4.3, the sales is a fact table in the center
and the dimension tables (time, customer, seller, manufacturing location,
and product) are arranged around the fact table. Whereas, the traditional ER
model has an even and balanced style of entities and complex relationships
among entities, the dimensional model is very asymmetric. Even though the
Business Intelligence and Tools Unit 4
Sikkim Manipal University Page No. 90
fact table in the dimensional model is joined with all the other dimension
tables, there is only a single joining line connecting the fact table to the
dimension tables.
Snowflake Model
The snowflake model is derived from the star model and is the result of
decomposing one or more of the dimensions, which sometimes have
hierarchies themselves. You can define the many-to-one relationships
among members within a dimension table as a separate dimension table,
forming a hierarchy. For example, the seller dimension in Figure 4.3 can be
decomposed into sub-dimensions; outlet, region, and outlet type in
Figure 4.4.
Fig. 4.4: Snowflake Model
Business Intelligence and Tools Unit 4
Sikkim Manipal University Page No. 91
The decomposed snowflake structure visualizes the hierarchical structure of
dimensions and so it is easy for data modelers to understand the data
model and for database designers to analyze the dimensions. Developers
also prefer the snowflake model as it saves the data storage. However, this
structure seems more complex and so might make the business users feel
uncomfortable working with it compared to the simple star model.
Self Assessment Question(s) (SAQs)
For Section 4.5
1. Explain the significance of the dimensional data modeling technique?
4.6 Database Design Methodology for Data Warehouses
The steps involved in the database design methodology for data
warehouses are as follows:
Step 1 Choosing the Process
The process (function) refers to the subject matter of a particular data
mart.
Build the data mart and it should be the one that is most likely to be
delivered on time, within budget, and to answer the most commercially
important business questions.
Step 2 Choosing the Grain
Decide on what a record of the fact table is to represent.
Identify the dimensions of the fact table. The grain decision for the fact
table also determines the grain of each dimension table.
Also include time as a core dimension, which is always present in star
schemas.
Step 3 Identifying and conforming the Dimensions
Dimensions set the context for asking questions about the facts in the
fact table. If any dimension occurs in two data marts, they must be
exactly the same, or one must be a mathematical subset of the other.
Business Intelligence and Tools Unit 4
Sikkim Manipal University Page No. 92
Step 4 Choosing the Fact
The grain of the fact table determines which facts can be used in the
data mart.
Facts should be numeric and additive.
Unusable facts may include:
non-numeric facts
non-additive facts
Fact at different granularity from other facts in table.
Step 5 Storing the Pre-Calculations in the Fact Table
Once the facts have been selected, each of the facts is to be re-examined to
determine whether there are opportunities to use pre-calculations.
Step 6 Rounding out the Dimension Tables
Text descriptions are added to the dimension tables and they should be
as intuitive and understandable to the users as possible.
Usefulness of a data mart is determined by the scope and nature of the
attributes of the dimension tables.
Step 7: Choosing the duration of the database
Take appropriate duration measures (how far back in time the fact table
goes)
Step 8: Tracking slowly changing dimensions
Slowly changing dimension problem means that the proper description
of the old dimension data must be used with old fact data. Often, a
generalized key must be assigned to important dimensions in order to
distinguish multiple snapshots of dimensions over a period of time.
Three basic types of slowly changing dimensions:
Type 1, where a changed dimension attribute is overwritten.
Type 2, where changed dimensions attribute causes creation of a
new dimension record to be created.
Business Intelligence and Tools Unit 4
Sikkim Manipal University Page No. 93
Type 3, where a changed dimension attribute causes an alternate
attribute to be created so that both the old and new values of the
attribute are simultaneously accessible in the same dimension
record.
Step 9: Deciding the Query priorities and the Query Modes
Most critical physical design issues affecting the end-users perception
includes:
physical sort order of the fact table on disk
the presence of pre-stored summaries or aggregations
Additional physical design issues include administration, backup,
indexing performance, and security.
Self Assessment Question(s) (SAQs)
For Section 4.6
1. List out the steps involved in the database design methodology for data
warehouses and also provide various activities involved in each of these
steps.
4.7 Common Mistakes in Data Modeling
Following are some of the mistakes that development teams may make in
an application development project during development and implementation
of a logical and physical data model:
4.7.1 Failure to Address Long-term Growth
To construct a data model, you may use the commands (CREATE TABLE,
CREATE INDEX, etc.) to make your logical entities become physical tables.
But operations that work on small tables may not perform when the table
grows to production size because the tablespaces are not big enough. To
avoid this, you have to focus on the physical storage of large tables during
the design of the data model.
Business Intelligence and Tools Unit 4
Sikkim Manipal University Page No. 94
In this scenario, you need to
Estimate the size of the table and its long-term growth pattern
appropriately
Estimate the physical size of the tables adequately
Estimate the physical size of the indexes appropriately
Use this information to estimate the need for disk space both now and
several years in the future.
During development, validate the predictions by comparing predicted
bytes per row with the actual row sizes of the tables and indexes in the
development or test environments, and adjust the prediction model
accordingly.
4.7.2 No mechanism to clean-out old Data
When there is no plan for cleaning out old data, you may respond to
emergency shortages of space either by scrambling to delete larger
numbers of rows or by adding space on an ad hoc basis. But none of these
is a good solution. In Oracle, the solution to deal with this problem is range
partitioning (introduced in version 8.0). A range-partitioned table is
physically separated into several different partitions, and Oracle adds rows
to a particular partition automatically based on the value of a specific column
in the row (called the partition key).
The benefits of partitioning large tables include:
All rows in a given range can be eliminated with a single small, efficient
DDL transaction by simply dropping or truncating a partition.
For queries that have a condition limiting the value of the partition key,
such as partition key BETWEEN :lower_bound AND :upper_bound, the
query optimizer can skip partitions that will have no matching rows,
reducing I/O required to process the query, an operation known as
partition pruning.
Business Intelligence and Tools Unit 4
Sikkim Manipal University Page No. 95
Maintenance operations can proceed on individual partitions while the
rest of the table is still available to support users.
However, you need to make the following choices in advance to reap the
benefits of partitioning:
Which column to use as partition key.
Which intervals to use as partition ranges
Whether to partition the indexes, if so, how
When to add new partitions and drop old ones
4.7.3 Overuse of Synthetic Keys
Database designers frequently create synthetic key (or surrogate key)
because it serves the purpose of a primary key but has no real meaning to
the business. While there are valid reasons to use a synthetic key, it is going
overboard to extend the practice to every table in a data model, for these
reasons:
There are tables for which a synthetic key simply serves no purpose.
The primary key index is a waste of space, and the server must waste
time updating the index during inserts and deletes. One example would
be a table of log entries; it probably does not need a key at all. Another
example would be the detail table in a master/detail relationship. Detail
records are typically fetched by the master key; there is little benefit in
giving the detail rows their own synthetic key.
Simultaneous transactions that insert new rows must serialize on the
sequence generator as they each take turns getting new key values. If
there are dozens or hundreds of such transactions running in parallel, as
in an OLTP system, then such serializing will limit performance.
(Sequence caching can mitigate this serializing effect in Oracle Parallel
Server or Oracle Real Application Cluster, but it does not help at all on a
single-instance database.)
Business Intelligence and Tools Unit 4
Sikkim Manipal University Page No. 96
So use of synthetic key is useful only in the following situations:
The natural key is something that might change over time, such as a
car's license number. A synthetic key can remain fixed while the other
attributes change.
The natural key has many columns, and there are many places in your
application where a program must fetch a single row by key value. A
synthetic key would allow the application to identify a row by only one
value.
The natural key has many columns, and the table is the target of foreign
references. It is generally impractical to repeat all of the natural key
columns in the child tables; a synthetic key allows you to create foreign
references with just one column.
4.7.4 Inadequate Planning for Indexes
The most difficult task in creating a logical data model is to decide which
columns to include in indexes. But there are some rules of thumb that can
serve as a starting point. These generally require a decent understanding of
the way the tables will be accessed by the application programs, report
programs, and (if possible) ad hoc analysis.
Foreign Key Indexes
A foreign key index is useful only if at least one of the following is true:
When a query plan in your application starts by fetching rows in the
parent table, and then fetches rows in the child table by way of the
foreign key, then use of an index on the foreign key is useful.
When your application deletes rows in the parent table, or updates key
columns in the parent table, Oracle can use the foreign key index on the
child table to enforce the foreign key constraint.
If neither of these situations applies, then indexing the foreign key is not
worth.
Business Intelligence and Tools Unit 4
Sikkim Manipal University Page No. 97
Low-cardinality Columns
If you need an index for low-cardinality columns, you might use a bitmap
index, since bitmap indexes have many benefits when dealing with low-
cardinality columns. However, because a single block in a bitmap index can
contain thousands of rows, concurrency can be a problem during the inserts
or deletes. The only way to know for sure is to try a realistic benchmark with
and without the index.
Order of Indexed Columns
The order of the columns in the index can also affect its usefulness, though
this is less of an issue with the new skip-scanning feature of Oracle 9i.
Generally speaking, columns that are high-cardinality (i.e., have many
distinct values in the table) and are used in the WHERE clauses should
generally be first in the index. The idea is to allow the query processor to
find the rows you want by searching as few blocks as possible.
To understand this concept, consider a telephone book as an analogy. The
entries are sorted by last name, first name, street name, and house number.
When you are searching for a person by last name, the alphabetical
organization allows you to proceed to the correct line very quickly. But it
does not help at all if you are searching by street name, the third field in the
sort order. Instead, you would need a book where street name is the primary
sort field.
Reviewing Index Definitions
During development, you need to figure out which indexes you require.
Here, you may have to
determine the indexes you need by checking the query plans for
important queries that your application uses (Make sure the indexes are
adequate to support those queries)
Business Intelligence and Tools Unit 4
Sikkim Manipal University Page No. 98
determine the indexes not required (use index monitoring feature in
Oracle 9i)
The only purposes for indexes are to enforce constraints (primary key,
unique, and foreign key) and to improve the query performance. Any index
that does not do one of these two things is of no use.
Self Assessment Question(s) (SAQs)
For Section 4.7
1. Mention the frequent mistakes that the development teams may make
while developing and implementing a logical and physical data model in
an application development project and also suggest how one can avoid
making these mistakes?
4.8 Organizing the Data in a Data Warehouse
There are two leading approaches to organize the data in a data
warehouse; the Dimensional approach (advocated by Ralph Kimball) and
the Normalized approach (advocated by Bill Inmon).
In the "dimensional" approach, the transaction data is partitioned into
either measured "facts" (which are generally numeric data that captures
specific values) or "dimensions" (which contain the reference information
that gives each transaction its context). For example, a sales transaction
can be broken up into facts (like number of products ordered, the price
paid, etc.) and dimensions (like date, customer ID, product, geographical
location, salesperson, etc.). The advantage of this approach is that it is
easy for the business staff to understand and use the data warehouse.
Also, the data warehouse tends to operate very quickly because the
data is pre-joined into the dimensional form. However, the disadvantage
of this approach is that it is quite difficult to incorporate the changes if
the company changes the way in which it does business.
Business Intelligence and Tools Unit 4
Sikkim Manipal University Page No. 99
In the "normalized" approach, the data in the data warehouse is stored
in third normal form. The tables in the database are then grouped
together by subject areas so as to reflect the general definition of the
data (Customer, Product, Finance, etc.). The advantage of this approach
is that it is easy to modify the information in the database. But the
disadvantage of this approach is that because of the number of tables
involved, it can be rather slow to produce information and reports.
Furthermore, since the segregation of facts and dimensions is not
explicit in this type of data model, it is difficult for users to join the
required data elements into meaningful information without a precise
understanding of the data structure.
Subject areas are just a method of organizing information and can be
defined along any lines. The traditional approach has subjects defined as
the subjects or nouns within a problem space. E.g. in a Financial Services
business area, you might have Customers, products, Contracts, etc. An
alternative approach is to organize around the business transactions. e.g.:
Customer enrollment, Sales, Trades.
4.9 Summary
A data model is a well-organized abstraction of the data in a data
warehouse. The purpose of modeling is to provide an accurate record of
some aspect of the real world in a particular context. Data modeling
techniques that are relevant in a data warehousing environment are Entity
Relationship modeling (ER modeling) and dimensional modeling. ER
modeling uses two basic concepts; entities and relationships between them.
Also, there are attributes, which can be properties of either the entities or
the relationships. An entity is defined to be a person, place, thing, or event
of interest to the business or the organization. It represents a class of
objects, which are things in the real business world that can be observed
Business Intelligence and Tools Unit 4
Sikkim Manipal University Page No. 100
and classified by their properties and characteristics. Relationships
represent the structural interaction and association among the entities in a
model and it is represented with lines drawn between the two specific
entities.
The dimensional modeling uses three basic concepts; measures, facts, and
dimensions. This is powerful in representing the requirements of the
business user in the context of database tables and also in the area of data
warehousing. A fact is a collection of related data items, consisting of
measures and context data. A dimension is a collection of members or units
of the same type of views. A measure is a numeric attribute of a fact,
representing the performance or behavior of the business relative to the
dimensions.
There are two basic models that can be used in dimensional modeling; Star
model and Snowflake model. This star model is the basic structure for a
dimensional model and has one large central table (a fact table) and a set of
smaller tables (the dimension tables) arranged in a radial pattern around the
central table. The snowflake model is derived from the star model and is the
result of decomposing one or more of the dimensions, which sometimes
have hierarchies themselves. The steps involved in the database design
methodology for data warehouses are as follows; choosing the process,
choosing the grain, identifying and conforming the dimensions, choosing the
facts, storing pre-calculations in the fact table, rounding out the dimension
tables, choosing the duration of the database, tracking slowly changing
dimensions, and deciding the query priorities and the query modes.
4.10 Terminal Questions (TQs)
1. Explain the significance of data modeling in the context of data
warehousing?
Business Intelligence and Tools Unit 4
Sikkim Manipal University Page No. 101
2. Discuss the similarities and differences between the two data modeling
techniques; ER modeling and Dimensional modeling?
3. What are the approaches involved in organizing the data in a
warehouse? Discuss the merits and demerits of each of these
approaches?
4.11 Multiple Choice Questions (MCQs)
1. Which of the following data model consists of Attributes?
a. ER Modeling
b. Dimensional modeling
c. Both (a) and (b)
d. None of the above
2. Which of the following statements is False?
a. ER model is a type of data model that uses the concepts; entities
and relationships
b. ER model is more preferred data model in the context of data
warehousing compared to any of the other data models
c. Dimensional modeling uses the concepts; measures, facts, and
dimensions
d. Attributes in an ER model provide the properties of the entities or the
relationships
3. In ER modeling, a Domain is _____________.
a. the existing set of entities in a data warehouse
b. the existing set of relationships between the entities
c. the set of all possible values and categories allowed for an attribute
d. the set of all possible values and categories allowed for a derived
attribute
4. The relationships between the entities in an ER model are defined in
terms of ___________.
a. Dimensions
b. Degree of relationship
Business Intelligence and Tools Unit 4
Sikkim Manipal University Page No. 102
c. Cardinality
d. Measures
5. The IS A relationship in an ER model exists between ______________.
a. Supertype and Subtype entities
b. Any of the two constraints
c. Attribute and derived attributes
d. Entities and their attributes
6. Which of the following is a numeric attribute of a Fact?
a. Data point
b. Derived fact
c. Dimension
d. Measure
7. The central table in a snowflake model is a ________
a. Dimension table
b. Fact
c. Measure
d. Can be any
8. Which of the following approaches are used to organize the data in a
data warehouse?
a. Entity-Relationship approach and Dimensional approach
b. Entity-Relationship approach and Normalized approach
c. Facts approach and Dimension approach
d. Dimensional approach and Normalized approach
9. In which of the following modeling techniques, the supertype and
subtype entities are commonly used?
a. ER Modeling
b. Dimensional modeling
c. Both (A) and (B)
d. None of the above
Business Intelligence and Tools Unit 4
Sikkim Manipal University Page No. 103
10. The normalized approach of organizing the data has been proposed by
__________.
a. Ralph Kimball
b. E.R. Muller
c. Andy Groove
d. Bill Inmon
4.12 Answers to SAQs, TQs, and MCQs
4.12.1 Answers to Self Assessment Questions (SAQs)
Section 4.4
1. ER modeling uses two basic concepts; entities and relationships
between them. Also, there are attributes, which can be properties of
either the entities or the relationships. An entity is defined to be a
person, place, thing, or event of interest to the business or the
organization. It represents a class of objects, which are things in the real
business world that can be observed and classified by their properties
and characteristics. Relationships represent the structural interaction
and association among the entities in a model and is represented with
lines drawn between the two specific entities.
Section 4.5
1. Dimensional modeling is simpler and a more expressive technique
compared to ER modeling. This technique is mainly aimed at
conceptualizing and visualizing data models as a set of measures that
are described by common aspects of the business. Also, the technique
is useful for summarizing and rearranging the data and presenting views
of the data to support data analysis. Apart from these, the technique
focuses on numeric data, such as values, counts, and weights.
Business Intelligence and Tools Unit 4
Sikkim Manipal University Page No. 104
Section 4.6
1. The steps involved in the database design methodology for data
warehouses are as follows: Choosing the Process, Choosing the Grain,
Identifying and conforming the Dimensions, Choosing the Facts, Storing
Pre-Calculations in the Fact Table, Rounding out the Dimension Tables,
Choosing the duration of the database, Tracking slowly changing
dimensions, Deciding the Query priorities and the Query Modes. These
steps are detailed in Section 4.6.
Section 4.7
1. Some of the mistakes that development teams may make during the
development and implementation of a logical and physical data model
include failure to address the long-term growth, use of no specific
mechanism to clean-out the old data, overuse of the synthetic keys, and
inadequate planning for indexes. The team has to initiate appropriate
actions to overcome these mistakes.
4.12.2 Answers to Terminal Questions (TQs)
1. A data model is a well-organized abstraction of the data and so the
data model has become an appropriate method to understand and
manage the business of the organization. Otherwise, it is very difficult to
organize the structure and contents of the data in the data warehouse.
The data models are critical in understanding the data in an analytical
environment of a data warehouse and so act as a blueprint for the
information requirements of an organization. The purpose of modeling is
to provide an accurate record of some aspect of the real world in a
particular context.
The objective of the modeling is to provide a record of accurate and
meaningful business data definitions, to be useful in the identification of
valid, consistent business data structures that contain sufficient
Business Intelligence and Tools Unit 4
Sikkim Manipal University Page No. 105
information to run and manage the business, to act as an indication of
the similarities and differences between data from different sources and
the relationships between them.
2. The two techniques for data modeling in a data warehouse environment
sometimes look very different from each other, but they have many
similarities. Dimensional modeling can use the same notation, such as
entity, relationship, attribute, and primary key. And, in general, you can
say that a fact is just an entity in which the primary key is a combination
of foreign keys, and the foreign keys refer the dimensions.
Therefore, we could say that dimensional modeling is a special form of
ER modeling. An ER model provides the structure and content definition
of the informational needs of the corporation, which is the base for
designing the data warehouse. A conclusion that can be drawn from the
above discussion is that the two data modeling techniques have their
own strengths and weaknesses, and either of them can be used in the
appropriate situation.
3. The leading approaches to organize the data in a data warehouse are
Ralph Kimballs Dimensional approach and the Bill Inmons Normalized
approach. In the "dimensional" approach, the transaction data is
partitioned into either measured "facts" (which are generally numeric
data that captures specific values) or "dimensions" (which contain the
reference information that gives each transaction its context).
This enables the business staff to understand and use the data
warehouse. Also, the data warehouse tends to operate very quickly
because the data is pre-joined into the dimensional form. But there will
be some difficulty to incorporate the changes if the company changes
the way in which it does business. In the "normalized" approach, the
data in the data warehouse is stored in third normal form. The tables in
Business Intelligence and Tools Unit 4
Sikkim Manipal University Page No. 106
the database are then grouped together by subject areas so as to reflect
the general definition of the data. Through use of this approach, it is
easy to modify the information in the database.
4.12.3 Answers to Multiple Choice Questions (MCQs)
1. Ans: a
2. Ans: b
3. Ans: c
4. Ans: c
5. Ans: a
6. Ans: d
7. Ans: b
8. Ans: d
9. Ans: a
10. Ans. d

You might also like