DM 02 04 Data Transformation
DM 02 04 Data Transformation
DM 02 04 Data Transformation
Fall 2009
Data Transformation
Outline
Introduction
Normalization
Attribute Construction
Aggregation
Attribute Subset Selection
Discretization
Generalization
References
Data Transformation
Introduction
Data Transformation
Data Transformation
Data transformation
– the data are transformed into forms appropriate for mining.
Data transformation tasks:
– Normalization
– Attribute construction
– Aggregation
– Attribute Subset Selection
– Discretization
– Generalization
Data Transformation
Data Transformation Tasks
Normalization
– the attribute data are scaled so as to fall within a small
specified range, such as -1.0 to 1.0, 0.0 to 1.0
Attribute construction (or feature construction)
– new attributes are constructed and added from the given set
of attributes to help the mining process.
Aggregation
– summary or aggregation operations are applied to the data.
– For example, the daily sales data may be aggregated so as
to compute monthly and annual total amounts.
Data Transformation
Data Transformation Tasks
Discretization
– Dividing the range of a continuous attribute into intervals
– For example, values for numerical attributes, like age, may
be mapped to higher-level concepts, like youth, middle-
aged, and senior.
Generalization
– where low-level or “primitive” (raw) data are replaced by
higher-level concepts through the use of concept
hierarchies.
– For example, categorical attributes, like street, can be
generalized to higher-level concepts, like city or country.
Data Transformation
Normalization
Data Transformation
Normalization
Data Transformation
Min-max Normalization
Min-max normalization
– performs a linear transformation on the original data.
Suppose that:
– minA and maxA are the minimum and maximum values of
an attribute, A.
Min-max normalization maps a value, v, of A to v′ in
the range [new_minA, new_maxA] by computing:
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
Data Transformation
Example: Min-max Normalization
73,600 − 12,000
(1.0 − 0) + 0 = 0.716
98,000 − 12,000
Data Transformation
z-score normalization
Data Transformation
Example: z-score Normalization
Data Transformation
Decimal Scaling
Data Transformation
Example: Decimal Scaling
Data Transformation
Normalization
Data Transformation
Attribute Construction
Data Transformation
Attribute Construction
Data Transformation
Data Aggregation
Data Transformation
Data Aggregation
On the left, the sales are shown per quarter. On the right, the
data are aggregated to provide the annual sales
Sales data for a given branch of AllElectronics for the years
2002 to 2004.
Data Transformation
Data Aggregation
Data cubes store multidimensional aggregated information.
Data cubes provide fast access to precomputed, summarized data,
thereby benefiting on-line analytical processing as well as data
mining.
A data cube for sales
at AllElectronics.
Data Transformation
Attribute Subset Selection
Data Transformation
Attribute Subset Selection
Data Transformation
Attribute Subset Selection
Data Transformation
Attribute Subset Selection
Data Transformation
Attribute Subset Selection
Data Transformation
Attribute Subset Selection
Heuristic methods:
– Step-wise forward selection
– Step-wise backward elimination
– Combining forward selection and backward elimination
– Decision-tree induction
Data Transformation
Attribute Subset Selection
Data Transformation
Attribute Subset Selection
Data Transformation
Attribute Subset Selection
Data Transformation
Attribute Subset Selection
Data Transformation
Discretization
Data Transformation
Discretization
Data Discretization:
– Dividing the range of a continuous attribute into intervals
– Interval labels can then be used to replace actual data values
– Reduce the number of values for a given continuous
attribute
– Some classification algorithms only accept categorical
attributes.
– This leads to a concise, easy-to-use, knowledge-level
representation of mining results.
Data Transformation
Discretization
Data Transformation
Discretization
Data Transformation
Discretization
Typical methods:
– Binning
Top-down split, unsupervised,
– Clustering analysis (covered above)
Either top-down split or bottom-up merge, unsupervised
– Interval merging by χ2 Analysis:
unsupervised, bottom-up merge
Binning
– The sorted values are distributed into a number of buckets,
or bins, and then replacing each bin value by the bin mean
or median
– Binning is a top-down splitting technique based on a
specified number of bins.
– Binning is an unsupervised discretization technique,
because it does not use class information
Binning methods:
– Equal-width (distance) partitioning
– Equal-depth (frequency) partitioning
Data Transformation
Equal-width (distance) partitioning
Data Transformation
Equal-width (distance) partitioning
Data Transformation
Equal-depth (frequency) partitioning
Data Transformation
Equal-depth (frequency) partitioning
Data Transformation
Cluster Analysis
Data Transformation
Cluster Analysis
Data Transformation
Interval Merge by χ2 Analysis
ChiMerge:
– It is a bottom-up method
– Find the best neighboring intervals and merge them to form
larger intervals recursively
– The method is supervised in that it uses class information.
– The basic notion is that for accurate discretization, the
relative class frequencies should be fairly consistent within
an interval.
– Therefore, if two adjacent intervals have a very similar
distribution of classes, then the intervals can be merged.
Otherwise, they should remain separate.
– ChiMerge treats intervals as discrete categories
Data Transformation
Interval Merge by χ2 Analysis
Data Transformation
Generalization
Data Transformation
Generalization
Data Transformation
Example: Generalization
Data Transformation
References
Data Transformation
References
Data Transformation
The end
Data Transformation