A Comprehensive Approach Towards Data Preprocessing Techniques & Association Rules

A Comprehensive Approach Towards Data Preprocessing Techniques &
Association Rules
Jasdeep Singh Malik, Prachi Goyal, 3Mr.Akhilesh K Sharma
3
Assistant Professor, IES-IPS Academy, Rajendra Nagar Indore – 452012 , India
jasdeepsinghmalik@gmail.com , engineer.prachi@gmail.com
3
akhileshshm@yahoo.com
ABSTRACT aggregation, elimination redundant feature, or clustering, for

Data pre-processing is an important and critical step in the instance. By the help of this all data preprocessed techniques
data mining process and it has a huge impact on the success of we can improve the quality of data and consequently of the
a data mining project.[1](3) Data pre-processing is a step of mining results. Also we can improve the efficiency of mining
the Knowledge discovery in databases (KDD) process that process.
reduces the complexity of the data and offers better conditions Data preprocessing techniques helpful in OLTP (online
to subsequent analysis. Through this the nature of the data is transaction Processing) and OLAP (online analytical
better understood and the data analysis is performed more processing). Preprocessing technique is also use full for
accurately and efficiently. Data pre-processing is challenging association rules algo.like- aprior, partitional, princer search
as it involves extensive manual effort and time in developing algo and many more algos. Data preprocessing is important
the data operation scripts. There are a number of different stage for Data warehousing and Data mining.
tools and methods used for pre-processing, including: [2]Many efforts are being made to analyze data using a
sampling, which selects a representative subset from a large commercially available tool or to develop an analysis tool that
population of data; transformation, which manipulates raw meets the requirements of a particular application. Almost all
data to produce a single input; denoising, which removes noise these efforts have ignored the fact that some form of data pre-
from data; normalization, which organizes data for more processing is usually required to intelligently analyze the data.
efficient access; and feature extraction, which pulls out This means that through data pre-processing one can learn
specified data that is significant in some particular context. more about the nature of the data, solve problems that may
Pre-processing technique is also useful for association rules exist in the raw data (e.g. irrelevant or missing attributes in the
algo. Like- Aprior, Partitioned, Princer-search algo. and many data sets), change the structure of data (e.g. create levels of
more algos. granularity) to prepare the data for a more efficient and
intelligent data analysis, and solve problems such as the
KEYWORDS: KDD, Data mining, association rules, problem of very large data sets. There are several different
Preprocessing algos.,Data warehouse,Two sin-wave. types of problems, related to data collected from the real world,
that may have to be solved through data pre-processing.
INTRODUCTION Examples are: (i) data with missing, out of range or corrupt
Data analysis is now integral to our working lives. It is the elements, (ii) noisy data, (iii) data from several levels of
basis for investigations in many fields of knowledge, from granularity, (iv) large data sets, data dependency, and irrelevant
science to engineering and from management to process data, and (v) multiple sources of data.
control. Data on a particular topic are acquired in the form of NEEDS
symbolic and numeric attributes. Analysis of these data gives a Problem with huge real-world database
better understanding of the phenomenon of interest. When  Incomplete Data :- Missing value.
development of a knowledge-based system is planned, the data  Noisy.
analysis involves discovery and generation of new knowledge  Inconsistent.
for building a reliable and comprehensive knowledge base. [2](1)Noise refers to modification of original values
Data preprocessing is an important issue for both data Examples:- distortion of a person’s voice when
warehousing and data mining, as real-world data tend to be talking on a poor phone and “snow” on television screen
incomplete, noise, and inconsistent. Data preprocessing include
data cleaning, data integration, data transformation, and data
reduction. Data cleaning can be applied to remove noise and
correct inconsistencies in the data. Data integration merge data
from multiple source into a coherent data store, such as a data
warehouse. Data transformation, such as normalization, may be
applied. [2]Data reduction can reduce the data size by
iii. Use a global constant to fill in the missing value
iv. Use the attribute mean to fill in the missing value
v. Use the attribute mean for all samples belonging to
the same class.
vi.Use the most probable value to fill in the missing
value
b) Noisy data:
i. Binning
ii. Clustering
iii. Regression
Two Sine Waves Two Sine Waves + Noise
c) Inconsistent data
2) Data Integration and Data Transformation
WHY DATA PREPROCESSING?
a) Data Integration
Data in the real world is dirty
b) Data Transformation
incomplete: missing attribute values, lack of certain attributes
of interest, or containing only aggregate data i.Smoothing
e.g., occupation=“” ii.Aggregation
noisy: containing errors or outliers iii.Generalization
e.g., Salary=“-10” iv.Normalization
v. Attribute construction
inconsistent: containing discrepancies in codes or names
e.g., Age=“42” Birthday=“03/07/1997” 3) Data reduction
a) Data cube aggregation
e.g., Was rating “1,2,3”, now rating “A, B, C”
b) Attribute subset selection
e.g., discrepancy between duplicate records
A well-accepted multi-dimensional view: c) Dimensional reduction .
Accuracy, Completeness d) Data Sampling.
Consistency, Timeline e) Numerosity reduction
Believability, Value added f) Discretization and concept hierarchy generation
Interpretability, Accessibility.
1) DATA CLEANING
Real world data tend to be in complete, noisy and inconsistent
Major Tasks in Data Pre-processing .data cleaning routines attempts to fill in missing values,
1) Data cleaning smooth out noise while identifying outliers, and correct
inconsistencies in the data.[2]
2) Data integration a)Ways for handling missing values:
3) Data transformation a. Ignore the tuple: this is usually done when class label is
missing. This method is not very effective, unless tuple
4) Data reduction contains several attributes with missing values. It is especially
poor when the percentage of missing values per attributes
varies considerably.
b. Fill in the missing value manually: this approach is time
consuming and may not be feasible given a large data set with
missing values.
c. Use a global constant to fill in the missing value: replace all
missing attribute values by the same constant, such as label like
“unknown”. if missing value are replaced by ,say unknown
then the mining program may mistakenly think that they form
an interesting concept ,since they all have a value common –
that of “unknown”. Hence, although this method is simple, it is
not foolproof.
 Sorted data for price (in dollars):

4,8,15,21,21,24,25,28,34
 Partition into (equi-depth) bins:
1) Data cleaning -- Bin 1: 4, 8, 15
a) Missing values: -- Bin 2: 21, 21, 24
i. Ignore the tuple -- Bin 3: 25, 28, 34
ii. Fill in the missing value manually
 Smoothing by bin means:
--Bin 1: 9, 9, 9
--Bin 2: 22, 22, 22
--Bin 3: 29,29,29
 Smoothing by bin boundaries:

--Bin 1: 4, 4, 15
--Bin 2: 21, 21, 24
--Bin 3: 25, 25, 34
d. Use the attribute mean to fill in the missing value:
For example, suppose that the average income of AllElectronics Fig:- Clustering
customers is $28,000.Use this value to replace the missing value for
income.
e. Use the attribute mean for all samples belonging to the same # Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28,
class as the given tuple: For example, if classifying customers 34
according to credit risk, replace the missing value with the
average income value for customers in the same credit risk # Partition into (equal-depth) bins:
category as that of the given tuple. Bin 1: 4, 8, 15
f. Use the most probable value to fill in the missing value: This Bin 2: 21, 21, 24
may be determined with regression, inference-based tools using Bin 3: 25, 28, 34
a Bayesian formalism, or decision tree induction. For example,
using the other customer attributes in your data set, you may # Smoothing by bin means:
construct a decision tree to predict the missing values for Bin 1: 9, 9, 9,
income. Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
b) Noisy data
“What is noise?" Noise is a random error or variance in a # Smoothing by bin boundaries:
measured variable. Given a numeric attribute such as, say, Bin 1: 4, 4, 15
price, how can we “smooth" out the data to remove the noise? Bin 2: 21, 21, 24
Let's look at the following data smoothing techniques. Bin 3: 25, 25, 34
a. Binning methods:[1] Binning methods smooth a sorted data

value by consulting the “neighborhood", or values around it. b. Clustering: Outliers may be detected by clustering, where
The sorted values are distributed into a number of “buckets", or similar values are organized into groups or “clusters".
bins. Because binning methods consult the neighborhood of Intuitively, values which fall outside of the set of clusters may
values, they perform local smoothing. Figure illustrates some be considered outliers in Figure.
binning techniques. In this example, the data for price are first
sorted and then partitioned into equal-frequency bins of size 3 c. Regression: Data can be smoothed by fitting the data to a
(i.e., each bin contains 3 values). In smoothing by bin means, function, such as with regression. Linear regression involves
each value in a bin is replaced by the mean value of the bin. finding the “best" line to fit two variables, so that one variable
For example, the mean of the values 4, 8, and 15 in Bin 1 is 9. can be used to predict the other. Multiple linear regression is an
Therefore, each original value in this bin is replaced by the extension of linear regression, where more than two attributes
value 9. Similarly, smoothing by bin medians can be employed, are involved and the data are fit to a multidimensional surface.
in which each bin value is replaced by the bin median. In Using regression to find a mathematical equation to fit the data
smoothing by bin boundaries, the minimum and maximum helps smooth out the noise.
values in a given bin are identified as the bin boundaries.
Each bin value is then replaced by the closest boundary value.
In general, the larger the width, the greater the effect of the
smoothing. Alternatively, bins may be equal-width, where the
interval range of values in each bin is constant.
c) Inconsistent data (3)The cells that contribute the most to the Χ2 value are those
There may be inconsistencies in the data recorded for some whose actual count is very different from the expected count
transactions. Some data inconsistencies may be corrected Correlation does not imply causality
manually using external references. For example, errors made -- # of hospitals and # of car-theft in a city are correlated
at data entry may be corrected by performing a paper trace. -- Both are causally linked to the third variable: population
This may be coupled with routines designed to help correct the
inconsistent use of codes. Knowledge engineering tools may Chi-Square Calculation: An Example
also be used to detect the violation of known data constraints.
For example, known functional dependencies between Play chess Not play chess Sum (row)
attributes can be used to find values contradicting the
functional constraints.
There may also be inconsistencies due to data integration, Like science 250(90) 200(360) 450
where a given attribute can have different names in different fiction
databases. Redundancies may also result.
Not like 50(210) 1000(840) 1050
science
2) DATA INTEGRATION & TRANSFORMATION.
fiction
a) Data integration Sum(col.) 300 1200 1500
It is likely that your data analysis task will involve data
integration, which combines data from multiple sources into a
coherent data store, as in data warehousing. These sources may
include multiple databases, data cubes, or flat files. Χ2 (chi-square) calculation (numbers in parenthesis are
There are a number of issues to consider during data expected counts calculated based on the data distribution in the
integration. Schema integration can be tricky. How can like two categories)
real-world entities from multiple data sources be “matched (250 − 90) 2 (50 − 210) 2 (200 − 360) 2 (1000 − 840) 2
up"? This is referred to as the entity identification problem. For χ =
2
+ + +
90 210 360 840
example, how can the data analyst or the computer is sure that
customer _id in one database, and cust_number in another refer It shows that like_science_fiction and play_chess are correlated
to the same entity? Databases and data warehouses typically in the group
have metadata - that is, data about the data. Such metadata can
be used to help avoid errors in schema integration. b) Data transformation
Redundancy is another important issue. An attribute may be In data transformation, the data are transformed or consolidated
redundant if it can be “derived" from another attributes. into forms appropriate or mining. Data transformation can
Inconsistencies in attribute or dimension naming can also cause involve the following
redundancies in the resulting data set. a. Smoothing: which works to remove the noise from data.
Some redundancies can be detected by correlation analysis. Such techniques include binning, clustering, and regression
b. Aggregation: where summary or aggregation operations are
DataIntegration : Correlation Analysis (Numerical Data) applied to the data. For example, the daily sales data may be
Correlation coefficient (also called Pearson’s product moment aggregated so as to compute monthly and annual total amounts.
coefficient) This step is typically used in
where n is the number of tuples, and are the respective constructing a data cube for analysis of the data at multiple
means of A and B, σ A and σ B are the respective standard granularities
deviation of A and B, and Σ(AB) is the sum of the AB cross- c. Generalization of the data: where low level or “primitive"
product. (raw) data are replaced by higher level concepts through the
 If r A,B > 0, A and B are positively correlated (A’s use of concept hierarchies. For example, categorical attributes,
values increase as B’s). The higher, the stronger like street, can be generalized to higher level concepts, like city
correlation. or county. Similarly, values for numeric attributes, like age,
 r A,B = 0: independent; r A,B < 0: negatively correlated may be mapped to higher level concepts, like young, middle-
aged, and senior
DataIntegration : Correlation Analysis (Categorical Data) d. Normalization: where the attribute data are scaled so as to
 Χ2 (chi-square) test fall within a small specified range, such as -1.0 to 1.0, or 0 to
1.0.
(Observed − Expected ) 2
χ2 = ∑ e. Attribute construction: where new attributes are
constructed and added from the given set of attributes to help
Expected
 The larger the Χ2 value, the more likely the variables the mining process.
are related
(3)Smoothing is a form of data cleaning, Aggregation and Imagine that you have selected data from the AllElectronics
generalization also serve as forms of data reduction. We data warehouse for analysis. The data set will likely be huge!
therefore discuss normalization and attribute construction. An Complex data analysis and mining on huge amounts of data
attribute is normalized by scaling its values so that they fall may take a very long time, making such analysis impractical or
within a small specified range, such as 0 to 1.0. infeasible.
Normalization is particularly useful for classification Data reduction techniques can be applied to obtain a reduced
algorithms involving neural networks, or distance representation of the data set that is much smaller in volume,
measurements such as nearest-neighbor classification and yet closely maintains the integrity of the original data. That is,
clustering. If using the neural network back propagation mining on the reduced data set should be more efficient yet
algorithm for classification mining normalizing the input produce the same analytical results.
values for each attribute measured in the training samples will Strategies for data reduction include the following.
help speed up the learning phase. For distance-based methods, a. Data cube aggregation: where aggregation operations are
normalization helps prevent attributes with initially large applied to the data in the construction of a data cube.
ranges (e.g., income) from outweighing attributes with initially b. Dimension reduction: where irrelevant, weakly relevant, or
smaller ranges (e.g.binary attributes). redundant attributes or dimensions may be detected and
There are many methods for data normalization. We study removed.
three: min-max normalization, z-score normalization, and c. Data compression: where encoding mechanisms are used to
normalization by decimal scaling. reduce the data set size.
d. Numerosity reduction: where the data are replaced or
Min-max normalization performs a linear transformation on estimated by alternative, smaller data representations such as
the original data. Suppose that minA and maxA are the parametric models, or nonparametric methods such as
minimum and maximum values of an attribute A. Min-max clustering, sampling, and the use of histogram.
normalization maps a value ,v of A to v’ in the range [new e. Discretization and concept hierarchy generation: where
_minA; new _maxA] by computing raw data values for attributes are replaced by ranges or higher
conceptual levels.
v’= v 󲐀 minA *(new _maxA 󲐀 new _minA) + Discretization and Concept hierarchies are powerful tools of
new _minA data mining, in that they allow the mining of data at multiple
maxA 󲐀 minA levels of abstraction.
Min-max normalization preserves the relationships among the

original data values. It will encounter an “out of bounds" error
if a future input case for normalization falls outside of the
original data range for A.
3) DATA REDUCTION
Why data reduction?
A database/data warehouse may store terabytes of data
Complex data analysis/mining may take a very long time to run
on the complete data set
Data reduction
Obtain a reduced representation of the data set that is much
smaller in volume but yet produce the same (or almost the
same) analytical results
Figure : Sales data for a given branch of All Electronics for the
Data reduction strategies years 1997 to 1999. In the data on the left, the
Aggregation sales are shown per quarter. In the data on the right, the data
Sampling are aggregated to provide the annual sales.
Dimensionality Reduction a) Data cube aggregation
Feature subset selection Imagine that you have collected the data for your analysis.
Feature creation These data consist of the AllElectronics sales per quarter,for
Discretization and Binarization the years 1997 to 1999. You are, however, interested in the
annual sales (total per year), rather than the total per quarter.
Attribute Transformation
Thus the data can be aggregated so that the resulting data
summarize the total sales per year instead of per quarter. This
aggregation is illustrated in Figure. The resulting data set is
smaller in volume, without loss of information necessary for b) Attribute Subset Selection
the analysis task. Attribute subset selection reduces the data set size by removing
(3)Data cubes store multidimensional aggregated information. such attributes (or dimensions) from it. Typically, methods of
For example, Figure : shows a data cube for multidimensional attribute subset selection are applied. The goal of attribute
analysis of sales data with respect to annual sales per item type subset selection is to find a minimum set of attributes such that
for each AllElectronics branch. Each cell holds an aggregate the resulting probability distribution of the data classes is as
data value, corresponding to the data point in multidimensional close as possible to the original distribution obtained using all
space. Concept hierarchies may exist for each attribute, attributes. Mining on a reduced set of attributes has an
allowing the analysis of data at multiple levels of abstraction. additional benefit. It reduces the number of attributes appearing
For example, a hierarchy for branch could allow branches to be in the discovered patterns, helping to make the patterns easier
grouped into regions, based on their address. Data cubes to understand.
provide fast access to [2]Basic heuristic methods of attribute subset selection include
precomputed, summarized data, thereby benefiting on-line the following techniques, some of which are illustrated in
analytical processing as well as data mining. Figure:
The cube created at the lowest level of abstraction is referred to a. Step-wise forward selection: The procedure starts with an
as the base cuboid. A cube for the highest level of abstraction is empty set of attributes. The best of the original attributes is
the apex cuboid. For the sales data of Figure, the apex cuboid determined and added to the set. At each subsequent iteration
would give one total- the total sales for all three years, for all or step, the best of the remainingoriginal attributes is added to t
item types, and for all branches. Data cubes created for varying b. Step-wise backward elimination: The procedure starts with
levels of abstraction are sometimes referred to as cuboids, so the full set of attributes. At each step, it removes the worst
that a “data cube" may instead refer to a lattice of cuboids. attribute remaining in the set.
Each higher level of abstraction further reduces the resulting c. Combination forward selection and backward elimination:
data size. When replying to data mining requests, the smallest The step-wise forward selection and backward elimination
available cuboid relevant to the given task should be used. methods can be combined so that, at each step, the procedure
selects the best attribute and
removes the worst from among the remaining attributes.
d. Decision tree induction: Decision tree algorithms, were
originally intended for classification. Decision tree induction
constructs a flow-chart-like structure where each internal (non-
leaf) node denotes a test on an attribute, each branch
corresponds to an outcome of the test, and each external (leaf)
node denotes a class prediction. At each node, the algorithm
chooses the “best" attribute to partition the data into individual
classes.
When decision tree induction is used for attribute subset
selection, a tree is constructed from the given data. All
attributes that do not appear in the tree are assumed to be
Fig:-- A data cube for Sale at Allelectronics irrelevant. The set of attributes appearing in the tree form the
Variation of Precipitation in Australia reduced subset of attributes.
Sampling
Choose a representative subset of the data
---Simple random sampling may have poor performance in the
presence of skew.
Develop adaptive sampling methods
---Stratified sampling:
Approximate the percentage of each class (or subpopulation of
interest) in the overall database
--- Used in conjunction with skewed data
Standard Deviation of Standard Deviation of

Average Monthly Precipitation Avg. Yearly Precipitation
There are two popular and effective methods of lossy
dimensionality reduction:
1) Wavelets transform
2) Principal components analysis
DataReduction : Feature Creation

Raw Data Cluster/Stratified Sample Create new attributes that can capture the important
information in a data set much more efficiently than the
original attributes
Three general methodologies:
Types of sampling
Simple Random SamplinSimple Random Sampling There is an 1.Feature Extraction 2. domain-specific 3. Mapping Data to
equal probability of selecting any particular item Sampling New Space 4.Feature Construction 5.combining features
without replacement As each item is selected, it is removed
from thepopulation Sampling with replacement Objects are not Data Reduction : Mapping Data to a New Space
removed from the population as they are selected for the
sample. In sampling with replacement, the same object can be  Fourier transform
picked up more than once Stratified sampling Split the data into  Wavelet transform
several partitions; then draw random samples from each
partition
Sample size:-
What sample size is necessary to get at least one
object from each of 10 groups.
Probability a sample contains pts from each group

What is feature reduction?(3) Two Sine Waves Two Sine Waves + Noise Frequency
DataReduction : Discretization Without Using Class Labels
c. Dimensionality reduction
In Dimensionality reduction, data encoding or transformation
are applied so as to obtain a reduced or “compressed”
representation of the original data .if the original data can be
reconstructed from the compressed data without any loss of
information, the data reduction is called lossless .if ,instead ,we
can reconstruct only an approximation of the original data, then
the reduction is called lossy. There are special well-tuned
algorithms for string compression .although they are typically
lossless, they allow only limited manipulation of the data.
Data Equal interval width
Diagram :-- Data Warehouse
Equal frequency K-means

OVERVIEW:
Data preparation is an important issue for both data
DataReduction : Attribute Transformation
warehousing and data mining, as real-world data tends to be
A function that maps the entire set of values of a given attribute
incomplete, noisy, and inconsistent. Data preparation includes
to a new set of replacement values such that each old value can
data cleaning, data integration, data transformation, and data
be identified with one of the new values
reduction.
Simple functions: xk, log(x), ex, |x|
Standardization and Normalization
Data cleaning routines can be used to fill in missing values,
smooth noisy data, identify outliers, and correct data
inconsistencies.
Data integration combines data from multiples sources to

form a coherent data store. Metadata, correlation analysis, data
conflict detection, and the resolution of semantic heterogeneity
contribute towards smooth data integration.
Data transformation routines convert the data into appropriate

forms for mining. For example, attribute data may be
normalized so as to fall between small ranges, such as 0 to 1.0
Data reduction techniques such as data cube aggregation,

dimension reduction, data compression, Numerosity reduction,
and Discretization can be used to obtain a reduced
representation of the data, while minimizing the loss of
information content.
Diagram :-- KDD Process Data Discretization and Automatic generation of concept
hierarchies For numeric data, techniques such as binning,
histogram analysis, and clustering analysis can be used.
Although several methods of data preparation have been

developed, data preparation remains an active area of research.
4.ANALYSIS
After data preprocessing, we will have data preprocessing

result in the tabular form that contains spatial data
transformation result. Mining process of spatial association rule
used Apriori based [1] and FP-Growth algorithms [6]. The
reason of using of both algorithms are widely used of those
algorithms and using of two different approaches. Some association rule used Apriori and FP-Growth algorithm. For the
interesting patterns got from mining process are: near future, we plan to continue this research to accommodate
temporal constraint to spatial association rule mining.We are
IF Population_Density_Low doing further enhancement in data preprocessing techniques.
THEN DBD_Low (0.61) (0.75)
IF Health_Facility_No 6. REFERENCES
AND Population_Density_ Low [1] Data Pre-processing & Mining Algorithm, Knowledge &
THEN DBD_Low (0.47) (0.8) Data Mining & Preprocessing, 3rd edition, Han & Kamber.
IF Health_Facility_No [2]Mohd Helmy Abd Wahab, Mohd Norzali Haji Mohd,
AND Population_Density_ Low Hafizul Fahri Hanafi, Mohamad Farhan(1998) Mohamad
AND Close-to_Bog Mohsin
THEN DBD_Low (0.86) (0.86) Etc [3]. Agrawal, Rakesh and Ramakrishnan Srikant, “Fast
Algorithms for Mining & Preprocessing Assosiation Rules”,
From software testing, it is indicated that data Proceedings of the 20th VLDB Conference, Santiago, Chile
preprocessing is time consuming. This caused by spatial (1994).
joint process execution that require much time. [4] Salleb, Ansaf and Christel Vrain, “An Application of
Assosiation Knowledge Discovery and Data Mining (PKDD)
2000, LNAI 1910, pp. 613-618, Springer Verlag (2000).
[5] Agrawal, R., and Psaila, G. 1995. Active Data Mining.
In Proceedings on Knowledge Discovery and Data Mining
(KDD-95), 3–8. Menlo Park.
Figure . Comparison of Apriori and FP-Growth
Executing both association rule algorithms resulted or indicates

that both algorithms generate the same patterns. Another
interesting result is that Apriori algorithm is faster than FP-
Growth. This result may be caused by relatively few number of
data (there are 163 records of subdistrictin Surabaya). Another
reason is the patterns that used in the case study is not a long
pattern, whereas, one of several FP-Growth advantages is better
to work with long pattern [6].
5. CONCLUSION & FURTHER WORK

In this paper, we have proposed methodology and
implementation of data preprocessing and then performed
mining l association rule with conventional association
algorithms. (5)Main steps in this data preprocessing is spatial
and non-spatial feature selection based on parameter
determined, reduction of dimension, selection and
categorization of non-spatial attributes, join operation for the
spatial objects based on spatial parameter spatial and
transforms into form of output wanted. While finding spatial

A Comprehensive Approach Towards Data Preprocessing Techniques & Association Rules

Uploaded by

Copyright:

Available Formats

A Comprehensive Approach Towards Data Preprocessing Techniques & Association Rules

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Comprehensive Approach Towards Data Preprocessing Techniques & Association Rules

Uploaded by

Copyright:

Available Formats

A Comprehensive Approach Towards Data Preprocessing Techniques &

ABSTRACT aggregation, elimination redundant feature, or clustering, for

 Sorted data for price (in dollars):

 Smoothing by bin boundaries:

a. Binning methods:[1] Binning methods smooth a sorted data

Min-max normalization preserves the relationships among the

Standard Deviation of Standard Deviation of

DataReduction : Feature Creation

Probability a sample contains pts from each group

DataReduction : Discretization Without Using Class Labels

Equal frequency K-means

Data integration combines data from multiples sources to

Data transformation routines convert the data into appropriate

Data reduction techniques such as data cube aggregation,

Although several methods of data preparation have been

After data preprocessing, we will have data preprocessing

Figure . Comparison of Apriori and FP-Growth

Executing both association rule algorithms resulted or indicates

5. CONCLUSION & FURTHER WORK

You might also like