What Is Data Warehouse?
What Is Data Warehouse?
What Is Data Warehouse?
Centralized database that is used to capture information from different parts of the business process.
Integration is closely related to subject orientation. e.g (Customer feedback form- How you get to know about us.)
A data warehouse is an electronic storage of an Organization's historical data for the purpose of reporting, analysis and data
mining or knowledge discovery.
Data warehouse is subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's
decision making process.
Subject-Oriented: to learn more about your company's sales data, you can build a warehouse that concentrates on sales. Using
this warehouse, you can answer questions like "Who was our best customer for this item last year?
Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may have
different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product.
Time-Variant: Historical data is kept in a data warehouse.
Non-volatile: Nonvolatile means that, once entered into the warehouse, data does not change. This is logical because the
purpose of a warehouse is to enable you to analyze what has occurred in past.
A data warehouse helps to integrate data (see Data integration) and store them historically so that we can analyze different
aspects of business including, performance analysis, trend, prediction etc. over a given time frame and use the result of our
analysis to improve the efficiency of business processes.
Need: to help management and businesses analyze data and this helps to fill the need for subject-oriented concepts.
Data mining
Is the process of finding patterns in a given data set. These patterns can often provide meaningful and insightful data to
whoever is interested in that data. Data mining is a method for comparing large amounts of data for the purpose of finding
Data warehousing is the process of compiling and organizing data into one common database, and data mining is the process
Data mining is typically done by business users with the assistance of engineers, and data warehousing is typically a process
done exclusively by engineers.
Data warehousing is a process that must occur before any data mining can take place.
Data warehousing can be said to be the process of centralizing or aggregating data from multiple sources into one common
repository.
Eg. credit card companies have a history of your purchases from the past and know geographically where those purchases
have been made. If all of a sudden some purchases are made in a city far from where you live, the credit card companies are
put on alert to a possible fraud since their data mining shows that you don’t normally make purchases in that city. Then, the
credit card company can disable your card for that transaction or just put a flag on your card for suspicious activity.
Data Mart
Data marts are generally designed for a single subject area. An organization may have data pertaining to different departments
like Finance, HR, Marketting etc. stored in data warehouse and each department may have separate data marts. These data
marts can be built on top of the data warehouse. Often holds only one subject area- for example, Finance, or Sales .Is built
focused on a dimensional model using a star schema.
Data Source Layer- sales data, hr data, inventory data, market research data, third party data-census data.
Data Extraction Layer
Staging area- temporary storage area where the data sits prior to being transformed into data warehouse
ETL-layer – logic applied to transform the data
Data Storage layer- where the transformed and cleansed data sit
Data logic layer- Business rules are stored.
Data Presentation Layer- This refers to the information that reaches the user.
Metadata layer- This is where information about the data stored in the data warehouse system is stored. A logical data model
would be an example of something that's in the metadata layer.
System Operations Layer - This layer includes information on how the data warehouse system operates, such as ETL job status,
system performance, and user access history.
A data model in software engineering is an abstract model that describes how data is represented and accessed.
A data model can be thought of as a diagram or flowchart that illustrates the relationships between data
A data model is a graphical view of data created for analysis and design purposes
We can see that the complexity increases from conceptual to logical to physical. This is why we always first start with the
Conceptual data model (so we understand at high level what are the different entities in our data and how they relate to one
another), then move on
identify the scope of the project in terms of the subject areas involved.
Logical data model (so we understand the details of our data without worrying about how they will actually implemented),
Logical Model identifies the details of the subject areas identified in the conceptual model, and the relationships between t he
subject areas. Logical model identifies the entities within the subject area and documents in detail the definition of the entity,
attributes, candidate keys, data types and the relationship with other entities.
Physical data model (so we know exactly how to implement our data model in the database of choice). Physical Model identifies
the table structure, Columns within the tables, Primary Keys, data types and size of the columns, constraints on the table and
the privileges on the tables.
In a data warehousing project, sometimes the conceptual data model and the logical data model are considered as a single
deliverable.
Data Modeling-
This activity is performed once we have understood the requirements for the data warehouse/mart. Data modeling refers to
creating the structure and relationship of the data for the business rules defined in the requirements
Dimensional modeling is one of the logical design techniques used in data warehousing. DM uses facts and dimensions of a
warehouse for its design. The goal of Dimensional model is not to achieve high degree of normalization but to facilitate easy
and faster data retrieval.
This is the form of data modeling used in data warehousing where we store business related attributes in tables
called Dimension Tables and the business related metrics/measures in tables called Fact Tables.
Dimension: A dimension is something that qualifies a quantity (measure). Dimension tables are used to describe dimensions;
they contain dimension keys, values and attributes.
A category of information mainly contains descriptive text. For example, the time dimensions.
For an example, consider this: If I just say… “20kg”, it does not mean anything. But if I say, "20kg of Rice (Product) is sold to
Ramesh (customer) on 5th April (date)", then that gives a meaningful sense. These product, customer and dates are some
dimension that qualified the measure - 20kg.
Fact: A fact table is a table that contains the measures of interest. A fact is something that is quantifiable (Or measurable).
Facts are typically (but not always) numerical values that can be aggregated. A fact table typically has two types of columns:
those that contain facts and those that are foreign keys to dimension tables.
Suppose our company sells products to customers. Every sale is a fact that happens within our company and the fact table is
used to record these facts.
The fact table lists events that happen in our company (or at least the events that we want to analyze). The dimension tables list
the factors (Customer, Time, Product) by which we want to analyze the data.
In other words, the difference between dimension tables and fact tables is that fact tables hold the data we want to analyze and
the dimension tables hold the information necessary to allow us to query it.
Dimension table and fact table are mainly used in data warehousing. The fact table mainly consists of business facts and
foreign keys that refer to primary keys in the dimension tables.
These measurable facts are used to know the business value and to forecast the future business.
A dimension table consists mainly of descriptive attributes that are textual fields.
Dimension tables provide descriptive or contextual information for the measurement of a fact table. On the other hand, fact
tables provide the measurements of an enterprise.
A dimension table contains a surrogate key, natural key, and a set of attributes. On the contrary, a fact table contains a foreign
key, measurements, and degenerated dimensions.
Fact contains normalized data and it takes more memory because in which we have less no.of columns and more no.of
records(i mean it maintains historical data)
Cumulative: This type of fact table describes what has happened over a period of time. For example, this fact table
may describe the total sales by product by store by day. The facts for this type of fact tables are mostly additive facts.
The first example presented here is a cumulative fact table.
Snapshot: This type of fact table describes the state of things in a particular instance of time, and usually includes
more semi-additive and non-additive facts. The second example presented here is a snapshot fact table.
Additive Measures
Measures that can be added across all dimensions.Additive measures can be used with any aggregation function like
Sum(), Avg() etc. Example is Sales Quantity etc.
Non-additive Measures
Measures that cannot be added across all dimensions. on-additive measures are those which cannot be used inside
any numeric aggregation function (e.g. SUM(), AVG() etc.). One example of non-additive fact is any kind of ratio or
percentage. Example, 5% profit margin, revenue to asset ratio etc. A non-numerical data can also be a non-additive
measure when that data is stored in fact tables, e.g. some kind of varchar flags in the fact table.
A fact table that does not contain fact. They contain only dimensional keys and it captures events that happen only at
information level but not included in the calculations level. just an information about an event that happen over a period.
A factless fact table captures the many-to-many relationships between dimensions, but contains no numeric or textual facts.
They are often used to record events or coverage information. Common examples of factless fact tables include:
Identifying product promotion events (to determine promoted products that didn’t sell)
Tracking student attendance or registration events
Tracking insurance-related accident events
The above fact is used to capture the leave taken by an employee. Whenever an employee takes leave a record is created with
the dimensions. Using the fact FACT_LEAVE we can answer many questions like
Number of leaves taken by an employee
The type of leave an employee takes
Details of the employee who took leave
Snowflake schema
A snowflake schema is a more normalized form of a star schema. In a star schema, each dimension is represented by a single
dimensional table, whereas Dimensional table is normalized into multiple lookup tables, each representing a level in the
dimensional hierarchy.
Fact Constellation Schema
This Schema is used mainly for the aggregate fact tables, OR where we want to split a fact table for better comprehension. The
split of fact table is done only when we want tofocus on aggregation over few facts & dimensions.
In a start schema,
1. The fact table will be at the center and is connected to the dimension tables.
2. The tables are completely in denormalized structure.
3. SQL queries performance is good as there are less number of joins involved.
4. Data redundancy is high and occupies more disk space.
Dimensions that change over time. SCD stands for slowly changing dimension, i.e. the dimensions where data is slowly
changing.
Advantages:
This allows us to accurately keep all historical information.
Disadvantages:
This will cause the size of the table to grow fast. In cases where the number of rows for the table is very high to start with,
storage and performance can become a concern.
This necessarily complicates the ETL process.
Type 3 Slowly Changing Dimension
In Type 3 Slowly Changing Dimension, there will be two columns to indicate the particular attribute of interest, one indicating
the original value, and one indicating the current value. There will also be a column that indicates when the current value
becomes active.
Advantages:
This does not increase the size of the table, since new information is updated.
This allows us to keep some part of history.
Used when changes occur for finite no of time.
1. Conformed Dimension
Confirmed Dimensions are nothing but Reusable Dimensions. A conformed dimension is the dimension that is shared
across multiple subject area. Consider 'Customer' dimension. Both marketing and sales department may use the same
customer dimension table in their reports. To join two fact tables indirectly.
2. Junk Dimension
A junk dimension is a grouping of typically low-cardinality attributes (flags, indicators etc.) so that those can be
removed from other tables and can be junked into an abstract dimension table. The junk dimension is simply a structure
that provides a convenient place to store the junk attributes.
E.g.: Assume that we have a gender dimension and marital status dimension. In the fact table we need to maintain two
keys referring to these dimensions. Instead of that create a junk dimension which has all the combinations of gender
and marital status (cross join gender and marital status table and create a junk table). Now we can maintain only one key
in the fact table.
3. Degenerated Dimension
A degenerated dimension is a dimension that is derived from fact table and does not have its own dimension table.
A dimension key, such as transaction number, receipt number, Invoice number etc. does not have any more associated
attributes and hence cannot be designed as a dimension table.
A degenerate dimension is a dimension that is stored in the fact table rather than the dimension table. It eliminates the
need to join to a Dimension table
These are essentially dimension keys for which there are no other attributes.
Degenerate Dimensions are used when fact tables represent transactional data.
can be used as primary key for the fact table but they cannot act as foreign keys.
Data integrity refers to the validity of data, meaning data is consistent and correct.
Comparison of OLTP and OLAP?
OLTP (On-line Transaction Processing) is characterized by a large number of short on-line transactions (INSERT, UPDATE
and DELETE). The main emphasis for OLTP systems is put on very fast query processing, maintaining data integrity in multi-
access environments and an effectiveness measured by number of transactions per second.
An OLTP database is used for day-to-day business operations and is therefore well-normalized.
OLAP (On-line Analytical Processing) is characterized by relatively low volume of transactions. Queries are often very
complex and involve aggregations. For OLAP systems a response time is anOLAP applications are widely used by Data Mining
techniques. In OLAP database there is aggregated, historical data, stored in multi-dimensional schemas (usually star)
MLOP -
This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a multidimensional cube. The storage is
not in the relational database, but in proprietary formats.MOLAP cubes are built for fast data retrieval.
Excellent performance: MOLAP cubes are built for fast data retrieval, and are optimal for slicing and dicing operations.
Can perform complex calculations: All calculations have been pre
Limitation of handling large amount of data
RLOP-
This methodology relies on manipulating the data stored in the relational database to give the appearance of
traditional OLAP's slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to adding a
"WHERE" clause in the SQL statement.
Limitation : Performance can be slow
OLTP OLAP
Source of data Operational data; OLTPs are the original Consolidation data; OLAP data comes
source of the data. from the various OLTP Databases
Purpose of data To control and run fundamental business To help with planning, problem solving,
tasks and decision support
What the data Reveals a snapshot of ongoing business Multi-dimensional views of various kinds
processes of business activities
Inserts and Updates Short and fast inserts and updates Periodic long-running batch jobs refresh
initiated by end users the data
Queries Relatively standardized and simple Often complex queries involving
queries Returning relatively few records aggregations
Processing Speed Typically very fast Depends on the amount of data involved;
batch data refreshes and complex queries
may take many hours; query speed can be
improved by creating indexes