A Comprehensive Approach Towards Data Preprocessing Techniques & Association Rules
A Comprehensive Approach Towards Data Preprocessing Techniques & Association Rules
A Comprehensive Approach Towards Data Preprocessing Techniques & Association Rules
Association Rules
Jasdeep Singh Malik, Prachi Goyal, 3Mr.Akhilesh K Sharma
3
Assistant Professor, IES-IPS Academy, Rajendra Nagar Indore – 452012 , India
jasdeepsinghmalik@gmail.com , engineer.prachi@gmail.com
3
akhileshshm@yahoo.com
3) DATA REDUCTION
Why data reduction?
A database/data warehouse may store terabytes of data
Complex data analysis/mining may take a very long time to run
on the complete data set
Data reduction
Obtain a reduced representation of the data set that is much
smaller in volume but yet produce the same (or almost the
same) analytical results
Figure : Sales data for a given branch of All Electronics for the
Data reduction strategies years 1997 to 1999. In the data on the left, the
Aggregation sales are shown per quarter. In the data on the right, the data
Sampling are aggregated to provide the annual sales.
Dimensionality Reduction a) Data cube aggregation
Feature subset selection Imagine that you have collected the data for your analysis.
Feature creation These data consist of the AllElectronics sales per quarter,for
Discretization and Binarization the years 1997 to 1999. You are, however, interested in the
annual sales (total per year), rather than the total per quarter.
Attribute Transformation
Thus the data can be aggregated so that the resulting data
summarize the total sales per year instead of per quarter. This
aggregation is illustrated in Figure. The resulting data set is
smaller in volume, without loss of information necessary for b) Attribute Subset Selection
the analysis task. Attribute subset selection reduces the data set size by removing
(3)Data cubes store multidimensional aggregated information. such attributes (or dimensions) from it. Typically, methods of
For example, Figure : shows a data cube for multidimensional attribute subset selection are applied. The goal of attribute
analysis of sales data with respect to annual sales per item type subset selection is to find a minimum set of attributes such that
for each AllElectronics branch. Each cell holds an aggregate the resulting probability distribution of the data classes is as
data value, corresponding to the data point in multidimensional close as possible to the original distribution obtained using all
space. Concept hierarchies may exist for each attribute, attributes. Mining on a reduced set of attributes has an
allowing the analysis of data at multiple levels of abstraction. additional benefit. It reduces the number of attributes appearing
For example, a hierarchy for branch could allow branches to be in the discovered patterns, helping to make the patterns easier
grouped into regions, based on their address. Data cubes to understand.
provide fast access to [2]Basic heuristic methods of attribute subset selection include
precomputed, summarized data, thereby benefiting on-line the following techniques, some of which are illustrated in
analytical processing as well as data mining. Figure:
The cube created at the lowest level of abstraction is referred to a. Step-wise forward selection: The procedure starts with an
as the base cuboid. A cube for the highest level of abstraction is empty set of attributes. The best of the original attributes is
the apex cuboid. For the sales data of Figure, the apex cuboid determined and added to the set. At each subsequent iteration
would give one total- the total sales for all three years, for all or step, the best of the remainingoriginal attributes is added to t
item types, and for all branches. Data cubes created for varying b. Step-wise backward elimination: The procedure starts with
levels of abstraction are sometimes referred to as cuboids, so the full set of attributes. At each step, it removes the worst
that a “data cube" may instead refer to a lattice of cuboids. attribute remaining in the set.
Each higher level of abstraction further reduces the resulting c. Combination forward selection and backward elimination:
data size. When replying to data mining requests, the smallest The step-wise forward selection and backward elimination
available cuboid relevant to the given task should be used. methods can be combined so that, at each step, the procedure
selects the best attribute and
removes the worst from among the remaining attributes.
d. Decision tree induction: Decision tree algorithms, were
originally intended for classification. Decision tree induction
constructs a flow-chart-like structure where each internal (non-
leaf) node denotes a test on an attribute, each branch
corresponds to an outcome of the test, and each external (leaf)
node denotes a class prediction. At each node, the algorithm
chooses the “best" attribute to partition the data into individual
classes.
When decision tree induction is used for attribute subset
selection, a tree is constructed from the given data. All
attributes that do not appear in the tree are assumed to be
Fig:-- A data cube for Sale at Allelectronics irrelevant. The set of attributes appearing in the tree form the
Variation of Precipitation in Australia reduced subset of attributes.
Sampling
Choose a representative subset of the data
---Simple random sampling may have poor performance in the
presence of skew.
Develop adaptive sampling methods
---Stratified sampling:
Approximate the percentage of each class (or subpopulation of
interest) in the overall database
--- Used in conjunction with skewed data
1) Wavelets transform
2) Principal components analysis
c. Dimensionality reduction
In Dimensionality reduction, data encoding or transformation
are applied so as to obtain a reduced or “compressed”
representation of the original data .if the original data can be
reconstructed from the compressed data without any loss of
information, the data reduction is called lossless .if ,instead ,we
can reconstruct only an approximation of the original data, then
the reduction is called lossy. There are special well-tuned
algorithms for string compression .although they are typically
lossless, they allow only limited manipulation of the data.
Data Equal interval width
Diagram :-- Data Warehouse
Diagram :-- KDD Process Data Discretization and Automatic generation of concept
hierarchies For numeric data, techniques such as binning,
histogram analysis, and clustering analysis can be used.
4.ANALYSIS