Chapter V

Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

Data Warehouse

Chapter V
Chapter Objectives:
• De-normalization
• Dimensional Modeling
• Start schema
• Snowflake schema
• Fact tables
• Dimension table
Designing and Implementing a Data Warehouse

Two general methods:


• De-Normalization.
• Dimensional Modeling (DM)
De-normalization
De-normalization
• De-normalization is a strategy that database managers use to
increase the performance of a database infrastructure, more like a
“controlled crash” with the aim of performance enhancement
without loss of information.
• It involves adding redundant data to a normalized database to
reduce certain types of problems with database queries that
combine data from various tables into a single table.
De-normalization
• Normalization is a rule of thumb in DBMS, but in DSS ease of use is
achieved by way of de-normalization.
• De-normalization comes in many flavors, such as combining tables,
splitting tables, adding data etc., but all done very carefully.
How De-normalization improves performance?
De-normalization specifically improves performance by either:
1. Reducing the number of tables and hence the reliance on joins,
which consequently speeds up performance.
2. Reducing the number of joins required during query execution, or
3. Reducing the number of rows to be retrieved from the Primary
Data Table.
De-normalization Techniques

1. Collapsing Tables.
• Two entities with a One-to-One relationship.
• Two entities with a Many-to-Many relationship.
2. Splitting Tables (Horizontal/Vertical Splitting).
3. Pre-Joining.
4. Adding Redundant Columns (Reference Data).
5. Derived Attributes (Summary, Total, Balance etc).
1. Collapsing Tables
1. Collapsing Tables

One of the most common and safe denormalization techniques is


combining of One-to-One relationships. For example, SID and gender in
one table, and SID and degree in the other table.
There are several advantages of this technique, some of the obvious
ones being reduced storage space, reduced amount of time for data
update, some of the other not so apparent advantages are reduced
number of foreign keys on tables. Furthermore, combining the
columns does not change the business view, but does decrease access
time by having fewer physical objects and reducing overhead.
2. Splitting Tables
2. Splitting Tables

In contrast, denormalization can be used to create more tables by


splitting a relation into multiple tables. Both horizontal and vertical
splitting and their combination are possible.
Goals of Horizontal splitting:
(1) spread rows in a large table across many HW components (disks,
controllers, CPUs, etc.) in the environment to facilitate parallel
processing, and (2) segregate data into separate partitions so that
queries do not need to examine all data in a table when WHERE clause
filters specify only a subset of the partitions
2. Splitting Tables

Advantage:
• Enhance security of data.
• Organizing tables differently for different queries.
• Reduced I/O overhead.
• Graceful degradation of database in case of table damage.
• Fewer rows result in flatter B-trees and fast data retrieval.
2. Splitting Tables

Vertical Splitting
• Splitting and distributing into separate files with repeating primary key.
• Infrequently accessed columns become extra “baggage” thus degrading performance.
• Very useful for rarely accessed large text columns with large headers.
• Header size is reduced, allowing more rows per block, thus reducing I/O.
3. Pre-Joining:
3. Pre-Joining:

This technique is generally used when there is a one-to-many relationship


• Typical of Market basket query

• Join ALWAYS required


• Tables could be millions of rows
• Squeeze Master into Detail
• Detail 3-4 times of master
The sale_ID column is the primary key for the master table and uniquely identifies a market
basket. There will be one “detail” record for each item listed on the “receipt” for the
market basket. The tx_ID column is the primary key for the detail table.
4. Adding Redundant Columns:
4. Adding Redundant Columns:
• If closely observed, this technique is no different from a pre-joining. In pre-joining all
columns are moved from the master table into the detail table, but in the current case, a
sub-set of columns from the master table is made redundant or moved into the detail
table. The performance, and storage trade-offs are also very similar to pre-joining.
• Redundant Columns Surprise Creating redundant columns does not necessarily reduce
the storage space requirements, as neither the reference table is removed, nor the
columns duplicated from the reference table. However, it is obvious that column
redundancy does eliminate the join and increase the performance.
5. Derived Attributes:
• It is usually feasible to add derived
attribute(s) in the data warehouse data
model, if the derived data is frequently
accessed and calculated. The
justification of adding derived data is
Age is also a derived attribute, simple; it reduces the amount of query
calculated as Current_Date – DoB processing time at run-time while
(calculated periodically). accessing the data in the warehouse.
GP (Grade Point) column in the data
warehouse data model is included as a
derived value. The formula for calculating
this field is Grade*Credits.
Dimensional Modeling (DM)
• A simpler logical model optimized for decision support.
• Inherently dimensional in nature, with a single central fact table and
a set of smaller dimensional tables.
• Multi-part key for the fact table.
• Dimensional tables with a single-part PK.
• Keys are usually system generated.
• Results in a star like structure, called star schema or star join.
• All relationships mandatory M-1.
• Single path between any two levels.
• Supports OLAP operations.
Dimensional Modeling (DM)
• DM is a logical design technique that seeks to present the data in a
standard, instinctive structure that supports high-performance and ease
of understanding. It is inherently dimensional in nature, and it does adhere
to the relational model, but with some important restrictions. Such as,
every dimensional model is composed of one “central” table with a
multipart key, called the fact table, and a set of smaller tables called
dimension tables. Each dimension table has a single-part primary key that
corresponds exactly to one of the components of the multipart key in the
fact table. This results in a characteristic "starlike" structure or star schema.
The two Schemas
Star Schema:
• Star schema is generally considered to be the most efficient design as
it is designed with de-normalized tables encounters fewer join
operations.
Star Schema:
Advantages
– Improves query performance for often-used data
– Less tables and simple structure
– Efficient query processing with regard to dimensions
Disadvantages
– In some cases, high overhead of redundant data
Star Schema:
• Dimensional hierarchies are collapsed into a single table for each dimension.
• A single fact table created with a single header from the detail records,
resulting in:
• A vastly simplified physical data model!
• Fewer tables (thousands of tables in some ERP systems).
• Fewer joins resulting in high performance.
• Some requirement of additional space.
Snow-Flake Schema:
• Sometimes a pure star schema might suffer performance problems.
• This can occur when a de-normalized dimension table becomes very large and penalizes
the star join operation. Conversely, sometimes a small outer-level dimension table does
not incur a significant join cost because it can be permanently stored in a memory buffer.
• Furthermore, because a star structure exists at the center of a snowflake, an efficient
star join can be used to satisfy part of a query. Finally, some queries will not access data
from outer-level dimension tables. These queries effectively execute against a star
schema that contains smaller dimension tables. Therefore, under some circumstances, a
snowflake schema is more efficient than a star schema.
Snow-Flake Schema:
• With a snowflake schema the size of the dimension
• Tables will be reduced, and queries will run faster
• If a dimension is very sparse (most measures corresponding to the
dimension have no data)
• And/or a dimension has long list of attributes which may be queried
Star vs Snow-Flake schemas:

You might also like