Unit II DW
Unit II DW
Unit II DW
UNIT II
DATAWAREHOUSE
Introduction
A data warehouse is a centralized repository that stores structured, semi-structured, and
unstructured data from various sources. It's designed for query and analysis rather than transaction
processing. Here's a brief introduction to its key aspects:
Centralization: Data warehouses consolidate data from multiple sources into a single location.
This includes data from operational systems, external sources, and other data marts.
Integration: The data in a warehouse is integrated to ensure consistency and accuracy. This
integration often involves cleaning, transforming, and organizing the data according to a common
schema.
Subject-Oriented: Data warehouses are organized around specific subject areas, such as sales,
finance, or marketing. This subject-oriented approach allows for focused analysis and reporting.
Time-Variant: Data warehouses maintain historical data, enabling analysis of trends and patterns
over time. This temporal aspect is crucial for making informed decisions based on past
performance.
Non-Volatile: Once data is stored in a warehouse, it is rarely modified or deleted. This ensures
data integrity and provides a reliable foundation for analysis.
Query and Analysis: Data warehouses are optimized for complex queries and analytical tasks.
They typically support tools and techniques for data exploration, reporting, and data mining.
Scalability and Performance: Modern data warehouses are designed to handle large volumes of
data and support concurrent user access. They employ techniques such as parallel processing and
distributed computing to ensure high performance.
Business Intelligence (BI) Integration: Data warehouses often serve as the foundation for
business intelligence initiatives. BI tools connect to the warehouse to generate reports, dashboards,
and visualizations that support decision-making processes.
Definition
A data warehouse is a centralized repository that aggregates and organizes data from multiple
sources into a unified and consistent format. It is specifically designed for querying, analysis, and
reporting purposes rather than transactional processing. Data warehouses typically store historical
and current data, allowing organizations to perform complex analyses and gain insights into their
operations, customers, and business performance over time. This structured approach to data
management enables decision-makers to make informed strategic decisions based on a
comprehensive view of the organization's data.
Dimensions: Dimensions are attributes that describe the measures, such as time, location, or
product. They are typically stored in dimension tables in a multidimensional data model.
Cubes: Cubes are structures that represent the multidimensional relationships between measures
and dimensions in a data model. They provide a fast and efficient way to retrieve and analyze
data.
Aggregation: Aggregation is the process of summarizing data across dimensions and levels of
detail. This is a key feature of multidimensional data models, as it enables users to quickly
analyze data at different levels of granularity.
Drill-down and roll-up: Drill-down is the process of moving from a higher-level summary of
data to a lower level of detail, while roll-up is the opposite process of moving from a lower-level
detail to a higher-level summary. These features enable users to explore data in greater detail and
gain insights into the underlying patterns.
Hierarchies: Hierarchies are a way of organizing dimensions into levels of detail. For example,
a time dimension might be organized into years, quarters, months, and days. Hierarchies provide
a way to navigate the data and perform drill-down and roll-up operations.
OLAP (Online Analytical Processing): OLAP is a type of multidimensional data model that
supports fast and efficient querying of large datasets. OLAP systems are designed to handle
complex queries and provide fast response times.
Let us take the example of the data of a factory which sells products per quarter in Bangalore.
The data is represented in the table.
2D factory data
Now, if we desire to view the data of the sales in a three-dimensional table, then it is represented
in the diagram given below. Here the data of the sales is represented as a two dimensional table.
Let us consider the data according to item, time and location (like Kolkata, Delhi, Mumbai). Here
is the table:
3D data representation as 2D
This data can be represented in the form of three dimensions conceptually, which is shown in the
image below :
3D data representation
Data cleaning
Data cleaning, also known as data cleansing or data scrubbing, is the process of identifying and
correcting errors, inconsistencies, and inaccuracies in a dataset to improve its quality, reliability, and
usefulness for analysis or other purposes. It is an essential step in the data preparation ensuring that
the data used for analysis or decision-making is accurate, complete, and consistent.
Identifying Data Quality Issues: This involves examining the dataset to identify various types of
data quality issues, including missing values, duplicates, outliers, incorrect formatting, and
inconsistencies.
Handling Missing Values: Missing values are common in datasets and can adversely affect the
results of analysis. Depending on the context, missing values can be imputed (replaced with
estimated values), deleted, or handled using other techniques such as mean.
Removing Duplicates: Duplicate records or observations can skew analysis results and should be
identified and removed from the dataset. This involves comparing records based on key attributes
and keeping only one instance of each unique record.
Standardizing Data: Data cleaning often involves standardizing data formats, units, and
representations to ensure consistency.
Correcting Errors: Errors in data entry or collection may lead to inconsistencies or inaccuracies in
he dataset. Data cleaning may involve identifying and correcting these errors through manual
inspection or automated techniques.
Handling Outliers: Outliers are data points that significantly deviate from the rest of the dataset
and may distort analysis results. Data cleaning may involve identifying and either removing outliers
or transforming them to mitigate their impact.
Validating Data: Data validation involves checking the integrity and accuracy of data by comparing
it against predefined rules or constraints. This helps ensure that the data meets quality standards and
is fit for its intended purpose.
Data Reduction
Data reduction refers to the process of simplifying, summarizing, or condensing large sets of data
into more manageable and meaningful forms while retaining key information. This process is
commonly used in various fields such as data analysis, statistics, signal processing, and machine
learning to make data more understandable, interpretable, and easier to work. There are several
techniques for data reduction:
1. Data Sampling: This technique involves selecting a subset of the data to work with, rather
than using the entire dataset. This can be useful for reducing the size of a dataset while still
preserving the overall trends and patterns in the data.
Sudheendra Mouli H C, Asst Prof, GSSS_SSFGC
UNIT_II DATAWARHOUSE
2. Dimensionality Reduction: This technique involves reducing the number of features in the
dataset, either by removing features that are not relevant or by combining multiple features
into a single feature.
3. Data Compression: This technique involves using techniques such as lossy or lossless
compression to reduce the size of a dataset.
4. Data Discretization: This technique involves converting continuous data into discrete data
by partitioning the range of possible values into intervals or bins.
5. Feature Selection: This technique involves selecting a subset of features from the dataset
that are most relevant to the task at hand.
6. It’s important to note that data reduction can have a trade-off between the accuracy and the
size of the data. The more data is reduced, the less accurate the model will be and the less
generalizable it will be.
In conclusion, data reduction is an important step in data mining, as it can help to improve the
efficiency and performance of machine learning algorithms by reducing the size of the dataset.
Discretization
Data discretization refers to a method of converting a huge number of data values into smaller ones
so that the evaluation and management of data become easy. In other words, data discretization is a
method of converting attributes values of continuous data into a finite set of intervals with
minimum data loss. There are two forms of data discretization first is supervised discretization,
and the second is unsupervised discretization. Supervised discretization refers to a method in
which the class data is used. Unsupervised discretization refers to a method depending upon the
way which operation proceeds. It means it works on the top-down splitting strategy and bottom-up
merging strategy.
Discretization can be performed using different methods,
Equal Width discretization: In this method, the range of the continuous variable is divided into a
specified number of equal-width intervals.
Equal Frequency discretization: Here, the range of the continuous variable is partitioned into
intervals containing approximately the same number of data points..
Entropy-based discretization: This method divides the continuous variable into intervals based
on the entropy of the target variable.
Decision Tree Discretization: Decision trees can be used to determine optimal split points for
discretizing continuous variables based on their predictive power with respect to the target
variable.
Clustering-based Discretization: Clustering algorithms such as k-means can be applied to group.
*****