Unit II DW

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

UNIT_II DATAWARHOUSE

UNIT II
DATAWAREHOUSE
Introduction
A data warehouse is a centralized repository that stores structured, semi-structured, and
unstructured data from various sources. It's designed for query and analysis rather than transaction
processing. Here's a brief introduction to its key aspects:
Centralization: Data warehouses consolidate data from multiple sources into a single location.
This includes data from operational systems, external sources, and other data marts.
Integration: The data in a warehouse is integrated to ensure consistency and accuracy. This
integration often involves cleaning, transforming, and organizing the data according to a common
schema.
Subject-Oriented: Data warehouses are organized around specific subject areas, such as sales,
finance, or marketing. This subject-oriented approach allows for focused analysis and reporting.
Time-Variant: Data warehouses maintain historical data, enabling analysis of trends and patterns
over time. This temporal aspect is crucial for making informed decisions based on past
performance.
Non-Volatile: Once data is stored in a warehouse, it is rarely modified or deleted. This ensures
data integrity and provides a reliable foundation for analysis.
Query and Analysis: Data warehouses are optimized for complex queries and analytical tasks.
They typically support tools and techniques for data exploration, reporting, and data mining.
Scalability and Performance: Modern data warehouses are designed to handle large volumes of
data and support concurrent user access. They employ techniques such as parallel processing and
distributed computing to ensure high performance.
Business Intelligence (BI) Integration: Data warehouses often serve as the foundation for
business intelligence initiatives. BI tools connect to the warehouse to generate reports, dashboards,
and visualizations that support decision-making processes.

Definition
A data warehouse is a centralized repository that aggregates and organizes data from multiple
sources into a unified and consistent format. It is specifically designed for querying, analysis, and
reporting purposes rather than transactional processing. Data warehouses typically store historical
and current data, allowing organizations to perform complex analyses and gain insights into their
operations, customers, and business performance over time. This structured approach to data
management enables decision-makers to make informed strategic decisions based on a
comprehensive view of the organization's data.

Multidimensional Data Model


The multi-Dimensional Data Model is a method which is used for ordering data in the database
along with good arrangement and assembling of the contents in the database. A multidimensional
data model is a way of structuring and representing data in a manner that facilitates analysis along
multiple dimensions. It is commonly used in data warehousing and business intelligence
applications to organize data in a way that reflects the real-world relationships and hierarchies
among different attributes.

Sudheendra Mouli H C, Asst Prof, GSSS_SSFGC


UNIT_II DATAWARHOUSE
The Multi Dimensional Data Model allows customers to interrogate analytical questions
associated with market or business trends, unlike relational databases which allow customers to
access data in the form of queries. They allow users to rapidly receive answers to the requests
which they made by creating and examining the data comparatively fast.
OLAP (online analytical processing) and data
warehousing uses multi dimensional
databases. It is used to show multiple
dimensions of the data to users.
It represents data in the form of data cubes.
Data cubes allow to model and view the data
from many dimensions and perspectives. It is
defined by dimensions and facts and is
represented by a fact table. Facts are numerical
measures and fact tables contain measures of
the related dimensional tables or names of the
facts.
Features or Key Components of multidimensional data models:
Measures: Measures are numerical data that can be analyzed and compared, such as sales or
revenue. They are typically stored in fact tables in a multidimensional data model.

Dimensions: Dimensions are attributes that describe the measures, such as time, location, or
product. They are typically stored in dimension tables in a multidimensional data model.

Cubes: Cubes are structures that represent the multidimensional relationships between measures
and dimensions in a data model. They provide a fast and efficient way to retrieve and analyze
data.

Aggregation: Aggregation is the process of summarizing data across dimensions and levels of
detail. This is a key feature of multidimensional data models, as it enables users to quickly
analyze data at different levels of granularity.
Drill-down and roll-up: Drill-down is the process of moving from a higher-level summary of
data to a lower level of detail, while roll-up is the opposite process of moving from a lower-level
detail to a higher-level summary. These features enable users to explore data in greater detail and
gain insights into the underlying patterns.
Hierarchies: Hierarchies are a way of organizing dimensions into levels of detail. For example,
a time dimension might be organized into years, quarters, months, and days. Hierarchies provide
a way to navigate the data and perform drill-down and roll-up operations.
OLAP (Online Analytical Processing): OLAP is a type of multidimensional data model that
supports fast and efficient querying of large datasets. OLAP systems are designed to handle
complex queries and provide fast response times.
Let us take the example of the data of a factory which sells products per quarter in Bangalore.
The data is represented in the table.

Sudheendra Mouli H C, Asst Prof, GSSS_SSFGC


UNIT_II DATAWARHOUSE
In the given presentation, the factory’s sales for
Bangalore are, for the time dimension, which is
organized into quarters and the dimension of items,
which is sorted according to the kind of item which is
sold. The facts here are represented in rupees (in
thousands).

2D factory data

Now, if we desire to view the data of the sales in a three-dimensional table, then it is represented
in the diagram given below. Here the data of the sales is represented as a two dimensional table.
Let us consider the data according to item, time and location (like Kolkata, Delhi, Mumbai). Here
is the table:

3D data representation as 2D
This data can be represented in the form of three dimensions conceptually, which is shown in the
image below :

3D data representation

Advantages of Multi-Dimensional Data Model


The following are the advantages of a multi-dimensional data model:
 A multi-dimensional data model is easy to handle.
 It is easy to maintain.
 Its performance is better than that of normal databases (e.g. relational databases).
 The representation of data is better than traditional databases. That is because the multi-
dimensional databases are multi-viewed and carry different types of factors.

Sudheendra Mouli H C, Asst Prof, GSSS_SSFGC


UNIT_II DATAWARHOUSE
 It is workable on complex systems and applications, contrary to the simple one-dimensional
database systems.
 The compatibility in this type of database is an upliftment for projects having lower bandwidth
for maintenance staff.

Disadvantages of Multi Dimensional Data Model


The following are the disadvantages of a Multi-Dimensional Data Model
 The multi-dimensional Data Model is slightly complicated in nature and it requires
professionals to recognize and examine the data in the database.
 During the work of a Multi-Dimensional Data Model, when the system caches, there is a great
effect on the working of the system.
 It is complicated in nature due to which the databases are generally dynamic in design.
 The path to achieving the end product is complicated most of the time.
 As the Multi Dimensional Data Model has complicated systems, databases have a large number
of databases due to which the system is very insecure when there is a security break.

Data cleaning
Data cleaning, also known as data cleansing or data scrubbing, is the process of identifying and
correcting errors, inconsistencies, and inaccuracies in a dataset to improve its quality, reliability, and
usefulness for analysis or other purposes. It is an essential step in the data preparation ensuring that
the data used for analysis or decision-making is accurate, complete, and consistent.
Identifying Data Quality Issues: This involves examining the dataset to identify various types of
data quality issues, including missing values, duplicates, outliers, incorrect formatting, and
inconsistencies.
Handling Missing Values: Missing values are common in datasets and can adversely affect the
results of analysis. Depending on the context, missing values can be imputed (replaced with
estimated values), deleted, or handled using other techniques such as mean.
Removing Duplicates: Duplicate records or observations can skew analysis results and should be
identified and removed from the dataset. This involves comparing records based on key attributes
and keeping only one instance of each unique record.
Standardizing Data: Data cleaning often involves standardizing data formats, units, and
representations to ensure consistency.
Correcting Errors: Errors in data entry or collection may lead to inconsistencies or inaccuracies in
he dataset. Data cleaning may involve identifying and correcting these errors through manual
inspection or automated techniques.
Handling Outliers: Outliers are data points that significantly deviate from the rest of the dataset
and may distort analysis results. Data cleaning may involve identifying and either removing outliers
or transforming them to mitigate their impact.
Validating Data: Data validation involves checking the integrity and accuracy of data by comparing
it against predefined rules or constraints. This helps ensure that the data meets quality standards and
is fit for its intended purpose.

Sudheendra Mouli H C, Asst Prof, GSSS_SSFGC


UNIT_II DATAWARHOUSE
Documenting Changes: It's essential to document all changes made during the data cleaning
process, including the rationale behind each decision. This documentation helps maintain
transparency and reproducibility and ensures that others can understand and trust the cleaned dataset.
Data Integration and Transformation
Data integration in data mining refers to the process of combining data from multiple sources into
a single, unified view. This can involve cleaning and transforming the data, as well as resolving
any inconsistencies or conflicts that may exist between the different sources. The goal of data
integration is to make the data more useful and meaningful for the purposes of analysis and
decision making. Techniques used in data integration include
ETL (Extract, Transform, Load): ETL is a common method used for data integration. It involves
extracting data from various sources, transforming it to fit operational needs or analytics
requirements, and loading it into a target destination such as a data warehouse.
Real-time Data Integration: In addition to traditional batch processing, real-time data integration
enables organizations to capture and process data in near real-time, allowing for more timely
decision-making and faster insights.
Data transformation in data mining refers to the process of converting raw data into a format that
is suitable for analysis and modeling. This may include cleaning, aggregating, filtering, and
enriching data to make it suitable for analysis. Data transformation typically involves several steps,
1. Data cleaning: Removing or correcting errors, inconsistencies, and missing values in the
data. Cleansing ensures that the data is reliable and accurate for analysis.
2. Data integration: Combining data from multiple sources, such as databases and
spreadsheets, into a single format.
3. Data normalization: Scaling the data to a common range of values, such as between 0 and
1, to facilitate comparison and analysis.
4. Data reduction: Reducing the dimensionality of the data by selecting a subset of relevant
features or attributes.
5. Data discretization: Converting continuous data into discrete categories or bins.
6. Data aggregation: Combining data at different levels of granularity, such as by summing or
averaging, to create new features or attributes.
7 Data transformation is an important step in the data mining process as it helps to ensure that
the data is in a format that is suitable for analysis and modeling,

Data Reduction
Data reduction refers to the process of simplifying, summarizing, or condensing large sets of data
into more manageable and meaningful forms while retaining key information. This process is
commonly used in various fields such as data analysis, statistics, signal processing, and machine
learning to make data more understandable, interpretable, and easier to work. There are several
techniques for data reduction:
1. Data Sampling: This technique involves selecting a subset of the data to work with, rather
than using the entire dataset. This can be useful for reducing the size of a dataset while still
preserving the overall trends and patterns in the data.
Sudheendra Mouli H C, Asst Prof, GSSS_SSFGC
UNIT_II DATAWARHOUSE

2. Dimensionality Reduction: This technique involves reducing the number of features in the
dataset, either by removing features that are not relevant or by combining multiple features
into a single feature.
3. Data Compression: This technique involves using techniques such as lossy or lossless
compression to reduce the size of a dataset.
4. Data Discretization: This technique involves converting continuous data into discrete data
by partitioning the range of possible values into intervals or bins.
5. Feature Selection: This technique involves selecting a subset of features from the dataset
that are most relevant to the task at hand.
6. It’s important to note that data reduction can have a trade-off between the accuracy and the
size of the data. The more data is reduced, the less accurate the model will be and the less
generalizable it will be.
In conclusion, data reduction is an important step in data mining, as it can help to improve the
efficiency and performance of machine learning algorithms by reducing the size of the dataset.
Discretization
Data discretization refers to a method of converting a huge number of data values into smaller ones
so that the evaluation and management of data become easy. In other words, data discretization is a
method of converting attributes values of continuous data into a finite set of intervals with
minimum data loss. There are two forms of data discretization first is supervised discretization,
and the second is unsupervised discretization. Supervised discretization refers to a method in
which the class data is used. Unsupervised discretization refers to a method depending upon the
way which operation proceeds. It means it works on the top-down splitting strategy and bottom-up
merging strategy.
Discretization can be performed using different methods,
Equal Width discretization: In this method, the range of the continuous variable is divided into a
specified number of equal-width intervals.
Equal Frequency discretization: Here, the range of the continuous variable is partitioned into
intervals containing approximately the same number of data points..
Entropy-based discretization: This method divides the continuous variable into intervals based
on the entropy of the target variable.
Decision Tree Discretization: Decision trees can be used to determine optimal split points for
discretizing continuous variables based on their predictive power with respect to the target
variable.
Clustering-based Discretization: Clustering algorithms such as k-means can be applied to group.

*****

Sudheendra Mouli H C, Asst Prof, GSSS_SSFGC

You might also like