Chapter 2

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 44

Wollega University

Chapter Two
Data Warehousing and Data Mining
November 13, 2021
Data Warehouse Concepts

• The basic concept of a Data Warehouse is to facilitate a single version


of truth for a company for decision making and forecasting.
• A Data warehouse is an information system that contains historical
and commutative data from single or multiple sources.
• Data Warehouse Concepts simplify the reporting and analysis
process of organizations.
• Types of Data Warehouse
• Information processing, analytical processing, and data mining are
the three types of data warehouse applications that are discussed
below −
• Information Processing − A data warehouse allows to process the
data stored in it. The data can be processed by means of querying,
basic statistical analysis, reporting using crosstabs, tables, charts, or
graphs.
• Analytical Processing − A data warehouse supports analytical
processing of the information stored in it. The data can be analyzed
by means of basic OLAP operations, including slice-and-dice, drill
down, drill up, and pivoting.
• Data Mining − Data mining supports knowledge discovery by finding
hidden patterns and associations, constructing analytical models,
performing classification and prediction. These mining results can be
presented using the visualization tools.
• Characteristics of Data Warehouse

• A data warehouse can be viewed as an information system with


the following attributes:
• – It is a database designed for analytical tasks
• – It‘s content is periodically updated
• – It contains current and historical data to provide a historical
perspective of information
• Data Warehouse Features
• The key features of a data warehouse are discussed below −
• Subject Oriented − A data warehouse is subject oriented because it provides
information around a subject rather than the organization's ongoing
operations.
• These subjects can be product, customers, suppliers, sales, revenue, etc.
• A data warehouse does not focus on the ongoing operations, rather it
focuses on modeling and analysis of data for decision making.
• Integrated − A data warehouse is constructed by integrating data from
heterogeneous sources such as relational databases, flat files, etc.
• This integration enhances the effective analysis of data.
• Time Variant − The data collected in a data warehouse is identified with a
particular time period.
• The data in a data warehouse provides information from the historical point
of view.
• Non-volatile − Non-volatile means the previous data is not erased when new
data is added to it.
• A data warehouse is kept separate from the operational database and
therefore frequent changes in operational database is not reflected in the
data warehouse.
• Note − A data warehouse does not require transaction processing, recovery,
and concurrency controls, because it is physically stored and separate from
the operational database.
• Data Warehouse Applications
• a data warehouse helps business executives to organize, analyze, and use
their data for decision making.
• A data warehouse serves as a sole part of a plan-execute-assess "closed-loop"
feedback system for the enterprise management.
• Data warehouses are widely used in the following fields −
• Financial services
• Banking services
• Consumer goods
• Retail sectors
• Controlled manufacturing
• Functions of Data Warehouse Tools and Utilities

• The following are the functions of data warehouse tools and


utilities −
• Data Extraction − Involves gathering data from multiple
heterogeneous sources.
• Data Cleaning − Involves finding and correcting the errors in data.
• Data Transformation − Involves converting the data from legacy
format to warehouse format.
• Data Loading − Involves sorting, summarizing, consolidating,
checking integrity, and building indices and partitions.
• Refreshing − Involves updating from data sources to warehouse.
• Note − Data cleaning and data transformation are important steps
in improving the quality of data and data mining results.
• Data Warehousing - Terminologies
• In this chapter, we will discuss some of the most commonly used
terms in data warehousing.
• Metadata
• Metadata is simply defined as data about data. The data that are
used to represent other data is known as metadata.
• For example, the index of a book serves as a metadata for the
contents in the book. In other words, we can say that metadata is
the summarized data that leads us to the detailed data.
• In terms of data warehouse, we can define metadata as
following
• Metadata is a road-map to data warehouse.
• Metadata in data warehouse defines the warehouse objects.
• Metadata acts as a directory. This directory helps the decision
support system to locate the contents of a data warehouse.
• Metadata Repository
• Metadata repository is an integral part of a data warehouse
system. It contains the following metadata −
• Business metadata − It contains the data ownership information,
business definition, and changing policies.
• Operational metadata − It includes currency of data and data
lineage. Currency of data refers to the data being active, archived,
or purged. Lineage of data means history of data migrated and
transformation applied on it.
• Data for mapping from operational environment to data
warehouse − It metadata includes source databases and their
contents, data extraction, data partition, cleaning, transformation
rules, data refresh and purging rules.
• The algorithms for summarization − It includes dimension
algorithms, data on granularity, aggregation, summarizing, etc.
• Data Cube
• A data cube helps us represent data in multiple dimensions. It is
defined by dimensions and facts.
• The dimensions are the entities with respect to which an
enterprise preserves the records.

• Data Mart
• Data marts contain a subset of organization-wide data that is
valuable to specific groups of people in an organization.
• In other words, a data mart contains only those data that is
specific to a particular group.
• For example, the marketing data mart may contain only data
related to items, customers, and sales. Data marts are confined to
subjects.
• Points to Remember About Data Marts

• Windows-based or Unix/Linux-based servers are used to


implement data marts. They are implemented on low-cost
servers.
• The implementation cycle of a data mart is measured in short
periods of time, i.e., in weeks rather than months or years.
• The life cycle of data marts may be complex in the long run, if
their planning and design are not organization-wide.
• Data marts are small in size.
• Data marts are customized by department.
• The source of a data mart is departmentally structured data
warehouse.
• Data marts are flexible.
• The following figure shows a graphical representation of data
marts.
• Other important terminology in data warehouse
• Enterprise Data warehouse: It collects all information about
subjects (customers, products, sales, assets, personnel) that
span the entire organization .
• Decision Support System (DSS): Information technology to
help the knowledge worker (executive, manager, and analyst)
makes faster & better decisions Drill-down.
• Designing the Data Warehouse
• Data warehouse is difficult to build due to the following reason:
• Heterogeneity of data sources
• Use of historical data
• Growing nature of data base.
• Data warehouse design approach must be business driven, continuous
and iterative engineering approach.
• In addition to the general considerations there are following specific
points relevant to the data warehouse design:
• Data content :
• The content and structure of the data warehouse are reflected in its
data model.
• The data model is the template that describes how information will be
organized within the integrated warehouse framework.
• The data warehouse data must be a detailed data. It must be formatted,
cleaned up and transformed to fit the warehouse data model.
• Meta data
• It defines the location and contents of data in the warehouse.
• Meta data is searchable by users to find definitions or subject areas.
• In other words, it must provide decision support oriented pointers
to warehouse data and provides a logical link between warehouse
data and decision support applications.
• Data distribution
• One of the biggest challenges when designing a data warehouse is
the data placement and distribution strategy.
• it becomes necessary to know how the data should be divided
across multiple servers and which users should get access to which
types of data.
• The data can be distributed based on the subject area, location
(geographical region), or time (current, month, year).
• Tools
• A number of tools are available that are specifically designed to help
in the implementation of the data warehouse.
• All selected tools must be compatible with the given data warehouse
environment and with each other.
• All tools must be able to use a common Meta data repository.
• Technical considerations
• A number of technical issues are to be considered when designing a
data warehouse environment. These issues include:
• The hardware platform that would house the data warehouse
• The dbms that supports the warehouse data
• The communication infrastructure that connects data marts,
operational systems and end users
• The hardware and software to support meta data repository
• The systems management framework that enables admin of the
entire environment
• In general, the warehouse design process consists of the
following steps:

• 1. Choose a business process to model If the business process is


organizational and involves multiple complex object collections, a data
warehouse model should be followed.
• if the process is departmental and focuses on the analysis of one kind
of business process, a data mart model should be chosen.
• 2. Choose the business process grain, which is the fundamental, atomic
level of data to be represented in the fact table for this process .
• 3. Choose the dimensions that will apply to each fact table record.
Typical dimensions are time, item, customer, supplier, warehouse,
transaction type, and status.
• 4. Choose the measures that will populate each fact table record.
Typical measures are numeric quantities like dollars sold and
units sold.
• Process of Data Warehouse Design
• A data warehouse can be built using three approaches:
• 1. A top-down approach
• 2. A bottom-up approach
• 3. A combination of both approaches
• Data warehouses are designed to facilitate reporting and analysis.
• The top-down approach starts with the overall design and planning. It is
useful in cases where the technology is mature and well-known, and where
the business problems that must be solved are clear and well-understood.
• The bottom-up approach starts with experiments and prototypes. This is
useful in the early stage of business modeling and technology
development. It allows an organisation to move forward at considerably
less expense and to evaluate the benefits of the technology before making
significant commitments.
• In the combined approach, an organisation can exploit the planned and
strategic nature of the top-down approach while retaining the rapid
implementation and opportunistic application of the bottom-up approach.
• Data Warehouse Design Architecture
• Data Warehouse Architecture is complex as it’s an information
system that contains historical and commutative data from multiple
sources.
• There are 3 approaches for constructing Data Warehouse layers:
• Single Tier, Two tier and Three tier. This 3 tier architecture of Data
Warehouse is explained as below.
• Single-tier architecture
• The objective of a single layer is to minimize the amount of data
stored. This goal is to remove data redundancy. This architecture is
not frequently used in practice.
• Two-tier architecture
• Two-layer architecture is one of the Data Warehouse layers which
separates physically available sources and data warehouse. This
architecture is not expandable and also not supporting a large
number of end-users. It also has connectivity problems because of
network limitations.
• Three-Tier Data Warehouse Architecture
• This is the most widely used Architecture of Data Warehouse.
• It consists of the Top, Middle and Bottom Tier.
• Bottom Tier: The database of the Data warehouse servers as the
bottom tier. It is usually a relational database system. Data is
cleansed, transformed, and loaded into this layer using back-end
tools.
• Middle Tier: The middle tier in Data warehouse is an OLAP server
which is implemented using either ROLAP or MOLAP model. For a
user, this application tier presents an abstracted view of the
database. This layer also acts as a mediator between the end-user
and the database.
• Top-Tier: The top tier is a front-end client layer. Top tier is the tools
and API that you connect and get data out from the data warehouse.
It could be Query tools, reporting tools, managed query tools,
Analysis tools and Data mining tools.
• The Data Warehouse is based on an RDBMS server which is a central
information repository that is surrounded by some key Data
Warehousing components to make the entire environment
functional, manageable and accessible.
• There are mainly five Data Warehouse Components:

• Data Warehouse Database


• The central database is the foundation of the data warehousing
environment. This database is implemented on the RDBMS
technology.
• This kind of implementation is constrained by the fact that
traditional RDBMS system is optimized for transactional database
processing and not for data warehousing.
• For instance, ad-hoc query, multi-table joins, aggregates are
resource intensive and slow down performance.
• Hence, alternative approaches to Database are used as listed
below-
• In a data warehouse, relational databases are deployed in parallel
to allow for scalability.
• Parallel relational databases also allow shared memory or shared
nothing model on various multiprocessor configurations or
massively parallel processors.
• New index structures are used to bypass relational table scan and
improve speed.
• Use of multidimensional database (MDDBs) to overcome any
limitations which are placed because of the relational Data
Warehouse Models. Example: Essbase from Oracle.
• Sourcing, Acquisition, Clean-up and Transformation Tools (ETL)
• The data sourcing, transformation, and migration tools are used for
performing all the conversions, summarizations, and all the changes
needed to transform data into a unified format in the data
warehouse.
• They are also called Extract, Transform and Load (ETL) Tools.
• Their functionality includes:
• Eliminating unwanted data in operational databases from loading
into Data warehouse.
• Search and replace common names and definitions for data arriving
from different sources.
• Calculating summaries and derived data
• In case of missing data, populate them with defaults.
• De-duplicated repeated data arriving from multiple data sources.
• Note:
• These Extract, Transform, and Load tools may generate background
jobs, Cobol programs, shell scripts, etc. that regularly update data in
data warehouse. These tools are also helpful to maintain the Metadata.
• These ETL Tools have to deal with challenges of Database & Data
heterogeneity.
• Metadata
• The name Meta Data suggests some high-level technological Data
Warehousing Concepts.
• Metadata is data about data which defines the data warehouse. It is
used for building, maintaining and managing the data warehouse.
• In the Data Warehouse Architecture, meta-data plays an important role
as it specifies the source, usage, values, and features of data
warehouse data.
• It also defines how data can be changed and processed. It is closely
connected to the data warehouse.
• Metadata helps to answer the following questions
• What tables, attributes, and keys does the Data Warehouse
contain?
• Where did the data come from?
• How many times do data get reloaded?
• What transformations were applied with cleansing?

• Metadata can be classified into following categories:


• Technical Meta Data: This kind of Metadata contains information
about warehouse which is used by Data warehouse designers and
administrators.
• Business Meta Data: This kind of Metadata contains detail that
gives end-users a way easy to understand information stored in the
data warehouse.
• Data Warehousing Query Tools

• One of the primary objects of data warehousing is to provide


information to businesses to make strategic decisions.
• Query tools allow users to interact with the data warehouse
system. These tools fall into four different categories:
• Query and reporting tools
• Application Development tools
• Data mining tools
• OLAP tools
• 1. Query and reporting tools:
• Query and reporting tools can be further divided into
• Reporting tools
• Managed query tools
• Reporting tools:
• Reporting tools can be further divided into production reporting tools and
desktop report writer.
• Report writers: This kind of reporting tool are tools designed for end-users
for their analysis.
• Production reporting: This kind of tools allows organizations to generate
regular operational reports. It also supports high volume batch jobs like
printing and calculating.
• Some popular reporting tools are Brio, Business Objects, Oracle, Power Soft,
SAS Institute.
• Managed query tools:
• This kind of access tools helps end users to resolve snags in database and SQL
and database structure by inserting meta-layer between users and database.
• Examples of types of reporting tools include:
• 1. Business intelligence tools: These are software applications that
simplify the process of development and production of business reports
based on data warehouse data.
• 2. Executive information systems (known more widely as Dashboard
(business): These are software applications that are used to display
complex business metrics and information in a graphical way to allow
rapid understanding.
• 3. OLAP Tools: OLAP tools form data into logical multi-dimensional
structures and allow users to select which dimensions to view data by.
• 4. Data Mining: Data mining tools are software that allow users to
perform detailed mathematical and statistical calculations on detailed
data warehouse data to detect trends, identify patterns and analyze data.
• 5. Application development tools:
• Sometimes built-in graphical and analytical tools do not satisfy the
analytical needs of an organization. In such cases, custom reports are
developed using Application development tools.
• Data warehouse Bus Architecture
• Data warehouse Bus determines the flow of data in your warehouse.
• The data flow in a data warehouse can be categorized as Inflow, Up
flow, Down flow, Outflow and Meta flow.
• While designing a Data Bus, one needs to consider the shared
dimensions, facts across data marts.
• Data Marts
• A data mart is an access layer which is used to get data out to the
users. It is presented as an option for large size data warehouse as it
takes less time and money to build.
• there is no standard definition of a data mart is differing from
person to person.
• In a simple word Data mart is a subsidiary of a data warehouse. The
data mart is used for partition of data which is created for the
specific group of users. Data marts could be created in the same
database as the Data warehouse or a physically separate Database
• Operations
• A data warehouse operation is comprised of the processes of
loading, manipulating and extracting data from the data
warehouse.
• Operations also cover user management, security, capacity
management and related functions.
• Roll-up: The roll-up operation (also called the drill-up operation )
• performs aggregation on a data cube, either by climbing up a
concept hierarchy for a dimension or by dimension reduction.
• When roll-up is performed by dimension reduction, one or more
dimensions are removed from the given cube.
• Drill-down: Drill-down is the reverse of roll-up. It navigates from
less detailed data to more detailed data.
• Drill-down can be realized by either stepping down a concept
hierarchy for a dimension or introducing additional dimensions.
• Slice and dice: The slice operation performs a selection on one
dimension of the given cube, resulting in a sub cube.
• The dice operation defines a sub cube by performing a selection on
two or more dimensions.
• Pivot (rotate): Pivot (also called rotate) is a visualization operation
that rotates the data axes in view to provide an alternative data
presentation.
• a pivot operation where the item and location axes in a 2-D slice
are rotated.
• Other examples include rotating the axes in a 3-D cube, or
transforming a 3-D cube into a series of 2-D planes
• Major Issues in Data Warehousing 
•  Building a data Warehouse is very difficult and a pain. It is
challenging, when data warehouses work properly, they are
magnificently useful, huge fun and unbelievably rewarding.
• Some of the major issues involved in building data warehouse are
discussed below:
• General Issues: It includes but is not limited to following issues:
• What kind of analysis do the business users want to perform?
• Do you currently collect the data required to support that analysis?
• How clean is data?
• Are there multiple sources for similar data?
• What structure is best for the core data warehouse (i.e., dimensional
or relational)?
• Technical Issues: It includes but is not limited to following issues
• How much data are you going to ship around your network, and will
it be able to cope?
• How much disk space will be needed?
• How fast does the disk storage need to be?
• Are you going to use SSDs to store “hot” data (i.e., frequently
accessed information)?
• What database and data management technology expertise already
exists within the company?
• Cultural Issues: It includes but is not limited to following issues
• How do data definitions differ between your operational systems?
Different departments and business units often use their own
definitions of terms like “customer,” “sale” and “order” within
systems.
• So you’ll need to standardize the definitions and add prefixes such
as “all sales,” “recent sales,” “commercial sales” and so on.
• What’s the process for gathering business requirements? Some
people will not want to spend time for you. Instead, they will
expect you to use your telepathic powers to divine their
warehousing and data analysis needs.
• Applications of Data Warehousing 
• Information processing, analytical processing, and data mining are
the three types of data warehouse applications that are discussed
below:
• Information Processing - A data warehouse allows to process the
data stored in it. The data can be processed by means of querying,
basic statistical analysis, reporting using crosstabs, tables, charts, or
graphs.
• Analytical Processing - A data warehouse supports analytical
processing of the information stored in it. The data can be analyzed
by means of basic OLAP operations, including slice-and-dice, drill
down, drill up, and pivoting.
• Data Mining - Data mining supports knowledge discovery by finding
hidden patterns and associations, constructing analytical models,
performing classification and prediction. These mining results can
be presented using the visualization tools.

You might also like