Data Warehousing 2
Data Warehousing 2
Enterprise warehouse :
An enterprise warehouse collects all of the information about subjects spanning the
entire organization. It provides corporate-wide data integration, usually from one or
more operational systems or external information providers, and is cross-functional
in scope. It typically contains detailed data as well as summarized data, and can
range in size from a few gigabytes to hundreds of gigabytes, terabytes, or beyond.
An enterprise data warehouse may be implemented on traditional mainframes,
computer super servers, or parallel architecture platforms. It requires extensive
business modelling and may take years to design and build.
Data Marts :
A data mart contains a subset of corporate-wide data that is of value to a specific
group of users. The scope is confined to specific selected subjects. For example, a
marketing data mart may confine its subjects to customer, item, and sales. The data
contained in data marts tend to be summarized. Depending on the source of data,
data marts can be categorized as independent or dependent. Independent data
marts are sourced from data captured from one or more operational systems or
external information providers, or from data generated locally within a particular
department or geographic area. Dependent data marts are sourced directly from
enterprise data warehouses.
Virtual Warehouse :
A virtual warehouse is a set of views over operational databases. For efficient query
processing, only some of the possible summary views may be materialized. A virtual
warehouse is easy to build but requires excess capacity on operational database
servers.
What are the pros and cons of the top-down and bottom-up approaches to data warehouse
development?”
ANS :
Data extraction :
which typically gathers data from multiple, heterogeneous, and external sources.
Data cleaning :
which detects errors in the data and rectifies them when possible.
Data transformation:
which converts data from legacy or host format to warehouse format.
Load :
which sorts, summarizes, consolidates, computes views, checks integrity, and builds
indices and partitions.
Refresh :
which propagates the updates from the data sources to the warehouse.
DATA CUBE : A Multidimensional Model
A data cube allows data to be modelled and viewed in multiple dimensions. It is defined by
dimensions and facts. In general terms, dimensions are the perspectives or entities with
respect to which an organization wants to keep records.
For example, AllElectronics may create a sales data warehouse in order to keep records of
the store’s sales with respect to the dimensions time, item, branch, and location. These
dimensions allow the store to keep track of things like monthly sales of items and the
branches and locations at which the items were sold. Each dimension may have a table
associated with it, called a dimension table, which further describes the dimension.
A multidimensional data model is typically organized around a central theme, such as sales.
This theme is represented by a fact table. Facts are numeric measures. Think of them as the
quantities by which we want to analyse relationships between dimensions. The fact table
contains the names of the facts, or measures, as well as keys to each of the related
dimension tables.
Examples of facts for a sales data warehouse include dollars sold (sales amount in dollars),
units sold (number of units sold), and amount budgeted.
Table 1 :2D view of sales data for AllElectronics According to time, item, location.
Table 2 :3D view of sales data for AllElectronics according to time, location.
we may display any n-dimensional data as a series of .n-1 dimensional “cubes.” The data
cube is a metaphor for multidimensional data storage. The actual physical storage of such
data may differ from its logical representation. The important thing to remember is that data
cubes are n-dimensional and do not confine data to 3-D. The cuboid that holds the lowest
level of summarization is called the base cuboid.
Fig : A 3 D data cube representation of the data as a mentioned above according to time,
item, location, and supplier.
Fig : A 4 D data cube representation of sales data, according to time, item, location , and
supplier.
The 0-D cuboid, which holds the highest level of summarization, is called the apex cuboid.
The apex cuboid is typically denoted by all.
Fig : Lattice of cuboids, making up a 4 D data cube for time, location, and supplier. Each
cuboid represents a different degree of summarization.
o Star Schema :
The most common modelling paradigm is the star schema, in which the data
warehouse contains (1) a large central table (fact table) containing the bulk of
the data, with no redundancy, and (2) a set of smaller attendant tables
(dimension tables), one for each dimension. The schema graph resembles a
starburst, with the dimension tables displayed in a radial pattern around the
central fact table.
Eg. :- A star schema for AllElectronics sales , which is mentioned . Sales are
considered along four dimensions: time, item, branch, and location. The schema
contains a central fact table for sales that contains keys to each of the four
dimensions, along with two measures: rupees sold, and units sold. To minimize
the size of the fact table, dimension identifiers (e.g., time key and item key) are
system-generated identifiers.
Notice that in the star schema, each dimension is represented by only one table,
and each table contains a set of attributes. For example, the location dimension
table contains the attribute set flocation key, street, city, province or state,
countryg. This constraint may introduce some redundancy. For example,
“Urbana” and “Chicago” are both cities in the state of Illinois, USA. Entries for
such cities in the location dimension table will create redundancy among the
attributes province or state and country; that is, ...., Urbana, IL, USA/ and ....,
Chicago, IL, USA/.Moreover, the attributes within a dimension table may form
either a hierarchy (total order) or a lattice (partial order).
o Snowflake Schema :
The snowflake schema is a variant of the star schema model, where some
dimension tables are normalized, thereby further splitting the data into
additional tables. The resulting schema graph forms a shape similar to a
snowflake. The major difference between the snowflake and star schema
models is that the dimension tables of the snowflake model may be kept in
normalized form to reduce redundancies. Such a table is easy to maintain and
saves storage space. However, this space savings is negligible in comparison to
the typical magnitude of the fact table. Furthermore, the snowflake structure
can reduce the effectiveness of browsing, since more joins will be needed to
execute a query. Consequently, the system performance may be adversely
impacted. Hence, although the snowflake schema reduces redundancy, it is not
as popular as the star schema in data warehouse design.
Fig : Snowflake schema of a sale data warehouse
Eg : A snowflake schema for AllElectronics sales is given in Figure. Here, the sales
fact table is identical to that of the star schema, which is mentioned above. The
main difference between the two schemas is in the definition of dimension
tables. The single dimension table for item in the star schema is normalized in
the snowflake schema, resulting in new item and supplier tables. For example,
the item dimension table now contains the attributes item key, item name,
brand, type, and supplier key, where supplier key is linked to the supplier
dimension table, containing supplier key and supplier type information. Similarly,
the single dimension table for location in the star schema can be normalized into
two new tables: location and city. The city key in the new location table links to
the city dimension. Notice that, when desirable, further normalization can be
performed on province or state and country in the snowflake schema.
A fact constellation schema is shown in below figure. This schema specifies two
fact tables, sales and shipping. The sales table definition is identical to that of the
star schema . The shipping table has five dimensions, or keys—item key, time
key, shipper key, from location, and to location—and two measures—dollars cost
and units shipped. A fact constellation schema allows dimension tables to be
shared between fact tables. For example, the dimensions tables for time, item,
and location are shared between the sales and shipping fact tables
Fig : Fact constellation schema of a sales and shipping data warehouse.
Consider a concept hierarchy for the dimension location. City values for location
include Vancouver, Toronto, New York, and Chicago. Each city, however, can be
mapped to the province or state to which it belongs. For example, Vancouver can be
mapped to British Columbia, and Chicago to Illinois. The provinces and states can in
turn be mapped to the country (e.g., Canada or the United States) to which they
belong. These mappings form a concept hierarchy for the dimension location,
mapping a set of low-level concepts (i.e., cities) to higher-level, more general
concepts (i.e., countries). This concept hierarchy is illustrated in above examples.
Fig : A Concept hierarchy for location. Due to space limitations, not all of the
hierarchy nodes are shown.
The above figure shows the result of a roll-up operation performed on the central
cube by climbing up the concept hierarchy for location given in Figure. This
hierarchy was defined as the total order “street < city < province or state <
country.” The roll-up operation shown aggregates the data by ascending the
location hierarchy from the level of city to the level of country. In other words,
rather than grouping the data by city, the resulting cube groups the data by
country.
Drill Down :
Drill-down is the reverse of roll-up. It navigates from less detailed data to more
detailed data. Drill-down can be realized by either stepping down a concept
hierarchy for a dimension or introducing additional dimensions.
Above figure shows the result of a drill-down operation performed on the central
cube by stepping down a concept hierarchy for time defined as “day < month <
quarter < year.” Drill-down occurs by descending the time hierarchy from the
level of quarter to the more detailed level of month. The resulting data cube
details the total sales per month rather than summarizing them by quarter.
Slice :
The slice operation performs a selection on one dimension of the given cube,
resulting in a sub cube.
Above figure shows a slice operation where the sales data are selected from the
central cube for the dimension time using the criterion time D “Q1.”
Dice :
The dice operation defines a sub cube by performing a selection on two or more
dimensions.
Above figure shows a dice operation on the central cube based on the following
selection criteria that involve three dimensions: (location D “Toronto” or
“Vancouver”) and (time D “Q1” or “Q2”) and (item D “home entertainment” or
“computer”).
Pivot :
Pivot (also called rotate) is a visualization operation that rotates the data axes in
view to provide an alternative data presentation.
Above figure shows a pivot operation where the item and location axes in a 2-D
slice are rotated. Other examples include rotating the axes in a 3-D cube, or
transforming a 3-D cube into a series of 2-D planes.
o A Starnet Query Model for Querying Multidimensional Databases :
The querying of multidimensional databases can be based on a starnet model, which
consists of radial lines emanating from a central point, where each line represents a
concept hierarchy for a dimension. Each abstraction level in the hierarchy is called a
footprint. These represent the granularities available for use by OLAP operations
such as drill-down and roll-up.
A starnet query model for the AllElectronics data warehouse is shown in below Figure
. This starnet consists of four radial lines, representing concept hierarchies for the
dimensions location, customer, item, and time, respectively. Each line consists of
footprints representing abstraction levels of the dimension. For example, the time
line has four footprints: “day,” “month,” “quarter,” and “year.” A concept hierarchy
may involve a single attribute (e.g., date for the time hierarchy) or several attributes
(e.g., the concept hierarchy for location involves the attributes street, city, province
or state, and country). In order to examine the item sales at AllElectronics, users can
roll up along the time dimension from month to quarter, or, say, drill down along the
location dimension from country to city.
A data mining query for characterization. Suppose that a user wants to describe the
general characteristics of graduate students in the Big University database, given the
attributes name, gender, major, birth place, birth date, residence, phone# (telephone
number), and gpa (grade point average). A data mining query for this
characterization can be expressed in the data mining query language, DMQL, as
follows:
Input:
1. W get task relevant data (DMQuery, DB); // Let W, the working relation, hold the
task-relevant data.
a. Scan W and collect the distinct values for each attribute, ai . (Note: If W is very large,
b. For each attribute ai , determine whether ai should be removed. If not, compute its
minimum desired level Li based on its given or default attribute threshold, and
determine the mapping pairs (v, v0), where v is a distinct value of ai in W, and v0 is
its corresponding generalized value at level Li .
3. P generalization (W),
corresponding v0 in the mapping while accumulating count and computing any other
aggregate values.
This step can be implemented efficiently using either of the two following variations:
a. For each generalized tuple, insert the tuple into a sorted prime relation P by a
binary search: if the tuple is already in P, simply increase its count and other
aggregate values accordingly; otherwise, insert it into P.
b. Since in most cases the number of distinct values at the prime relation level is small,
the prime relation can be coded as an m-dimensional array, where m is the number
of attributes in P, and each dimension contains the corresponding generalized
attribute values. Each array element holds the corresponding count and other
aggregation values, if any. The insertion of a generalized tuple is performed by
measure aggregation in the corresponding array element.