Dbms Data Warehosuing
Dbms Data Warehosuing
Dbms Data Warehosuing
Intelligence
Taught By –
Mrs. Veena Kiragi
School of Computer Science Engg. and
Applications
DY Patil International University
1
What is KDD?
• Knowledge Discovery from Data/Databases (KDD)
or Pattern Analysis, Knowledge Extraction
• Sources of data –
– Business such as ecommerce,
– Science such as biomedical field,
– Society such as news, digital cameras,
smartphones, apps.
2
Contd.
• Many people treat data mining as a synonym for
another popularly used term, knowledge
discovery from data, or KDD.
Data Data
Transformation Selection
Data Mining
Pattern
Evaluation
Knowledge
Presentation
Descriptive Predictive
13
Characterizatio
Mining of frequent
n
patterns, associations,
and
and correlations
Discrimination
Data Mining
Functionaliti
es
Classification
and Regression
Outlier Analysis
Clustering
Analysis
14
Data Warehouse
• Data warehouse systems are valuable tools in today’s competitive, fast-
evolving world. A data warehouse is a central repository of information
that can be analyzed to make more informed decisions.
15
Data Warehouse is
the place where
valuable data assets
of an organization are
stored such as
16
Contd.
• A data warehouse is a type of data management system that is designed
to enable and support business intelligence (BI) activities.
• The data within a data warehouse is usually derived from a wide range
of sources such as application log files and transaction applications.
17
Contd.
• Over time, it builds a historical record that can be invaluable to data
scientists and business analysts.
18
Subject Oriented
• A data warehouse is organized around major subjects such as customer,
supplier, product, and sales.
19
Integrated
• A data warehouse is usually constructed by integrating multiple
heterogeneous sources, such as relational databases, flat files, and
online transaction records.
20
Time-variant
• Data are stored to provide information from an historic perspective (e.g.,
the past 5–10 years).
21
Non-volatile
• A data warehouse is always a physically separate store of data
transformed from the application data found in the operational
environment.
• Once data is in a data warehouse, it’s stable and doesn’t change. It can
only be deleted.
22
What are the benefits of using
a data warehouse?
• Benefits of a data warehouse include the following:
24
Contd…
• A data warehouse architecture is made up of 3-tiers system.
25
• The top tier is the front-end client that
presents results of query through
reporting, analysis, and data mining
tools.
27
28
OLAP and OLTP Systems
• OLAP : On Line Analytical Processing
29
Comparison of OLTP and OLAP
Systems
Source: https://www.youtube.com/
watch?v=q9oAZwhuUy4
30
OLTP OLAP
Systems Systems
• These systems are called online • These systems are known as online
transaction processing (OLTP) systems. analytical processing (OLAP) systems.
• The major task of these systems is to • Data warehouse systems serve users or
perform online transaction and query knowledge workers in the role of data
processing. analysis and decision making.
• They cover most of the day-to-day • Such systems can organize and present
operations of an organization such as data in various formats in order to
purchasing, inventory, manufacturing, accommodate the diverse needs of
banking, payroll, registration, and different users.
accounting.
• An OLAP system is market-oriented.
• An OLTP system is customer-oriented.
• An OLAP system manages large amounts of
• An OLTP system manages current data that, historic data, provides facilities for
typically, are too detailed to be easily used summarization and aggregation, and stores
for decision making. and manages information at different
levels of granularity.
31
OLTP OLAP
Systems Systems
• An OLTP system usually adopts an entity- • An OLAP system typically adopts either a
relationship (ER) data model and an star or a snowflake model and a subject-
application-oriented database design. oriented database design.
• An OLTP system focuses mainly on the • An OLAP system often spans multiple
current data within an enterprise or versions of a database schema, due to the
department, without referring to historic evolutionary process of an organization.
data or data in different organizations.
• OLAP systems also deal with information
• The access patterns of an OLTP system that originates from different organizations,
consist mainly of short, atomic transactions. integrating information from many data
Such a system requires concurrency stores.
control and recovery mechanisms.
• Because of their huge volume, OLAP data
are stored on multiple storage media.
32
Comparison of OLTP and OLAP
Systems
Feature OLTP OLAP
Characteristic operational processing informational
processing
Orientation transaction analysis
User clerk, DBA, database knowledge worker (e.g.,
professional manager, executive,
analyst)
Function day-to-day operations long-term informational
requirements decision
support
DB design ER-based, application-oriented star/snowflake, subject-
oriented
Data current, guaranteed up-to-date historic, accuracy
maintained
over time
Summarization primitive, highly detailed summarized,
consolidated
View detailed, flat relational summarized,
33
multidimensional
Contd.
Feature OLTP OLAP
Access read/write mostly read
Focus data in information out
Number of records tens millions
accessed
Number of users thousands hundreds
DB size GB to high-order GB ≥ TB
Priority high performance, high high flexibility, end-user
availability autonomy
Metric transaction throughput query throughput,
response time
34
Dimensional Data Models
• The entity-relationship data model is commonly used in the design of
relational
databases, where a database schema consists of a set of entities and the
relationships
between them.
36
• A star schema for AllElectronics sales is shown in Figure here. Sales are
considered along four dimensions: time, item, branch, and location.
• Each dimension is
represented by only one
table, and
each table contains a set
of attributes.
37
2. Snowflake Schema
• The snowflake schema is a variant of the star schema model, where some
dimension tables are normalized, thereby further splitting the data into
additional tables. The resulting schema graph forms a shape similar to a
snowflake.
• The major difference between the snowflake and star schema models is
that the
dimension tables of the snowflake model may be kept in normalized form
to reduce
redundancies.
• Such a table is easy to maintain and saves storage space. However, this
space savings is negligible in comparison to the typical magnitude of the
fact table.
• For example, the dimensions tables for time, item, and location are shared
between the sales and shipping fact tables.
• For data warehouses, the fact constellation schema is commonly used, since it
can model multiple, interrelated subjects. A data mart, on the other hand, is a
department subset of the data warehouse that focuses on selected subjects, and
thus its scope is departmentwide.
• For data marts, the star or snowflake schema is commonly used, since both are
41
geared toward modeling single subjects, although the star schema is more
Data Warehouse Models/
Schemas
• From the architecture point of view, there are three data warehouse
models:
Enterprise Virtual
Warehous Data Mart Warehous
e e
42
1. Enterprise Warehouse
• An enterprise warehouse collects all of the information about subjects
spanning the entire organization.
44
Contd.
• However, it may involve complex integration in the long run if its design
and planning were not enterprise-wide.
• Independent data marts are sourced from data captured from one or
more operational systems or external information providers, or from
data generated locally within a particular department or geographic
area.
45
3. Virtual Warehouse
• A virtual warehouse is a set of views over operational databases.
• For efficient query processing, only some of the possible summary views
may be
materialized.
46
Data Cube: A Multidimensional
Data Model
• Data warehouses and OLAP tools are based on a multidimensional data model.
This model views data in the form of a data cube.
• Each dimension may have a table associated with it, called a dimension table,
which further describes the dimension.
• For example, a dimension table for item may contain the attributes item name,
brand, and type.
47
Contd.
• A multidimensional data model is typically organized around a central
theme, such
as sales.
• Examples of facts for a sales data warehouse include dollars sold (sales
amount in dollars), units sold (number of units sold), and amount
budgeted.
48
Example
• Imagine that you have collected the data for your analysis. These data
consist of the AllElectronics sales per quarter, for the years 2008 to
2010.
49
Contd.
• You are, however, interested in the annual sales (total per year),
rather than the total per quarter.
• Thus, the data can be aggregated so that the resulting data summarize
the total sales per year instead of per quarter.
50
2-D DATA CUBE
• In this 2-D representation, the sales for Vancouver are shown with respect to the time
dimension (organized in quarters) and the item dimension (organized according to the
types of items sold). The fact or measure displayed is dollars sold (in thousands).
• In particular, we will
look at the
AllElectronics sales
data for items sold per
quarter in the city of
Vancouver.
51
3-D DATA CUBE
• suppose we would like to view the data according to time and item, as well
as location, for the cities Chicago, New York, Toronto, and Vancouver.
• The 3-D data in the table are represented as a series of 2-D tables.
52
Contd.
53
Operations on Cubes
Drill-
Roll-Up
down
Slice &
Dice Pivot
54
55
56
Extraction, Transformation, and Loading (ETL) /ETL
Operations
• Data warehouse systems use back-end tools and utilities to populate and refresh
their data.
• These tools and utilities include the following functions:
– Data cleaning, which detects errors in the data and rectifies them
when possible.
– Refresh, which propagates the updates from the data sources to the
warehouse.
57
Data Preprocessing
• How can the data be preprocessed in order to help improve the quality
of the data?
• How can the data be preprocessed in order to help improve the mining
results?
58
Why we need Data Pre-processing? (Over
view)
• In large real-world databases, three of the most common problems
are:
Data
Preprocess
Consisten
Accurate Complete
t
59
Why Data Preprocessing?
• Data in the real world is dirty
– incomplete: lacking attribute values, lacking
certain attributes of interest, or containing only
aggregate data
– noisy: containing errors or outliers
– inconsistent: containing discrepancies in codes or
names
• No quality data, no quality mining results!
– Quality decisions must be based on quality data
– Data warehouse needs consistent integration of
quality data
• A multi-dimensional measure of data quality:
– A well-accepted multi-dimensional view:
• accuracy, completeness, consistency, timeliness,
believability, value added, interpretability, accessibility
Why there are data problems?
• There are many possible reasons for inaccurate data (e.g, having
incorrect attribute values)., or
• The instrument used for data collection may be faulty, there may be
human or computer errors occurring at data entry.
• Data reduction can reduce data size by, for instance, aggregating,
eliminating redundant features, or clustering.
• These techniques are not mutually exclusive; they may work together.
For example, data cleaning can involve transformations to correct
wrong data, such as by transforming all entries for a date field to a
62
common format.
Forms of data
preprocessing
Data Cleaning
• Data cleaning tasks
– Fill in missing values
– Identify outliers and smooth out noisy
data
– Correct inconsistent data
Missing Data
• Data is not always available
– E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time
of entry
– not register history or changes of the data
• Missing data may need to be inferred
How to Handle Missing
• Data?
Ignore the tuple: usually done when class label is missing
(assuming the task is classification—not effective in certain
cases)
• Problem:
Data Warehouse may store terabytes of
data: Complex data analysis/mining may
take a very long time to run on the
complete data set
• Solution?
– Data reduction…
Data
• Reduction
Obtains a reduced representation of
the data set that is much smaller in
volume but yet produces the same (or
almost the same) analytical results
• Data reduction strategies
– Data cube aggregation
– Dimensionality reduction
– Data compression
– Numerosity reduction
– Discretization and concept hierarchy
generation
Data Cube Aggregation
• Multiple levels of aggregation in data
cubes
– Further reduce the size of data to deal with
• Reference appropriate levels
– Use the smallest representation capable to
solve the task
• Queries regarding aggregated
information should be answered using
data cube, when possible
Data
Compression
• String compression
– There are extensive theories and well-tuned
algorithms
– Typically lossless
– But only limited manipulation is possible without
expansion
• Audio/video, image compression
– Typically lossy compression, with progressive
refinement
– Sometimes small fragments of signal can be
reconstructed without reconstructing the whole
• Time sequence is not audio
– Typically short and vary slowly with time
Data
Compression
Compresse
Original Data
d
Data
lossle
ss
s
los
Original Data y
Approximated
END OF Module-1….
THANK YOU!!!
80