0% found this document useful (0 votes)

71 views55 pages

Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization

Data preprocessing involves cleaning, transforming, and reducing raw data to prepare it for modeling. It includes techniques like handling missing data, smoothing noisy data, normalizing numeric data, reducing dimensionality through feature selection or discretization, and reducing data volume through compression methods like clustering and sampling. The goal is to improve data quality and prepare the data for analysis or mining models.

Uploaded by

Chanda Test

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

71 views55 pages

Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization

Uploaded by

Chanda Test

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 55

Data Pre-processing

• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data transformation
– Normalization and aggregation
• Data reduction
– Obtains reduced representation in volume but produces the same or
similar analytical results
• Data discretization
– Part of data reduction but with particular importance, especially for
numerical data
Why Data Preprocessing?

• Data in the real world is dirty

– incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
– noisy: containing errors or outliers
– inconsistent: containing discrepancies in codes or names
• No quality data, no quality mining results!
– Quality decisions must be based on quality data
– Data warehouse needs consistent integration of quality
data
Forms of data
preprocessing
Data Cleaning

• Data cleaning tasks

– Fill in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data
Missing Data
• Data is not always available
– E.g., many tuples have no recorded value for several attributes, such
as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time of entry
– not register history or changes of the data
• Missing data may need to be inferred.
How to Handle Missing Data?
• Fill in the missing value manually: tedious + infeasible?
• Use a global constant to fill in the missing value: e.g.,
“unknown”, a new class?!
• Use the attribute mean to fill in the missing value
• Use the attribute mean for all samples belonging to the same
class to fill in the missing value: smarter
• Use the most probable value to fill in the missing value:
inference-based such as Bayesian formula or decision tree
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which requires data cleaning
– duplicate records
– incomplete data
– inconsistent data
How to Handle Noisy Data?
• Binning method:
– first sort data and partition into (equi-depth) bins
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• Clustering
– detect and remove outliers
• Combined computer and human inspection
– detect suspicious values and check by human
• Regression
– smooth by fitting the data into regression functions
Simple Discretization Methods:
Binning
• Equal-width (distance) partitioning:
– It divides the range into N intervals of equal size: uniform
grid
– if A and B are the lowest and highest values of the attribute,
the width of intervals will be: W = (B-A)/N.
– The most straightforward
– But outliers may dominate presentation
– Skewed data is not handled well.
• Equal-depth (frequency) partitioning:
– It divides the range into N intervals, each containing
approximately same number of samples
– Good data scaling
– Managing categorical attributes can be tricky.
Binning Methods for Data Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25,
26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Cluster Analysis
Regression
y

Y1’ y=x+1

X1 x
Data Integration
• Data integration:
– combines data from multiple sources into a coherent store
• Schema integration
– integrate metadata from different sources
– Entity identification problem: identify real world entities from
multiple data sources, e.g., A.cust-id  B.cust-#
• Detecting and resolving data value conflicts
– for the same real world entity, attribute values from different
sources are different
– possible reasons: different representations, different scales,
e.g., metric vs. British units
Handling Redundant Data
• Redundant data occur often when integration of
multiple databases
– The same attribute may have different names in different
databases
– One attribute may be a “derived” attribute in another table,
e.g., annual revenue
• Redundant data may be able to be detected by
correlational analysis
• Careful integration of the data from multiple sources
may help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality
Data Transformation
• Smoothing: remove noise from data
• Aggregation: summarization, data cube construction
• Generalization: concept hierarchy climbing
• Normalization: scaled to fall within a small,
specified range
– min-max normalization
– z-score normalization
– normalization by decimal scaling
• Attribute/feature construction
– New attributes constructed from the given ones
Data Transformation:
Normalization
• min-max normalization
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
• z-score normalization
v  meanA
v' 
stand _ devA
• normalization by decimal scaling
v
v'  j Where j is the smallest integer such that Max(| v ' |)<1
10
Data Reduction Strategies
• Warehouse may store terabytes of data: Complex data
analysis/mining may take a very long time to run on the
complete data set
• Data reduction
– Obtains a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the
same) analytical results
• Data reduction strategies
– Data cube aggregation
– Dimensionality reduction
– Numerosity reduction
– Discretization and concept hierarchy generation
Dimensionality Reduction
• Feature selection (i.e., attribute subset selection):
– Select a minimum set of features such that the probability
distribution of different classes given the values for those
features is as close as possible to the original distribution
given the values of all features
– reduce # of patterns in the patterns, easier to understand
• Heuristic methods (due to exponential # of choices):
– step-wise forward selection
– step-wise backward elimination
– combining forward selection and backward elimination
– decision-tree induction
Example of Decision Tree Induction

Initial attribute set:

{A1, A2, A3, A4, A5, A6}
A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}

Data Compression
• String compression
– There are extensive theories and well-tuned algorithms
– Typically lossless
– But only limited manipulation is possible without expansion
• Audio/video compression
– Typically lossy compression, with progressive refinement
– Sometimes small fragments of signal can be reconstructed
without reconstructing the whole
• Time sequence is not audio
– Typically short and vary slowly with time
Data Compression

Original Data Compressed

Data
lossless

Original Data
Approximated
Numerosity Reduction
• Parametric methods
– Assume the data fits some model, estimate model
parameters, store only the parameters, and discard the data
(except possible outliers)
– Log-linear models: obtain value at a point in m-D space as
the product on appropriate marginal subspaces
• Non-parametric methods
– Do not assume models
– Major families: histograms, clustering, sampling
Regression and Log-Linear Models
• Linear regression: Data are modeled to fit a straight
line
– Often uses the least-square method to fit the line

• Multiple regression: allows a response variable Y to

be modeled as a linear function of multidimensional
feature vector

• Log-linear model: approximates discrete

multidimensional probability distributions
Regress Analysis and Log-
Linear Models
• Linear regression: Y =  +  X
– Two parameters ,  and  specify the line and are to be
estimated by using the data at hand.
– using the least squares criterion to the known values of Y1,
Y2, …, X1, X2, ….
• Multiple regression: Y = b0 + b1 X1 + b2 X2.
– Many nonlinear functions can be transformed into the above.
• Log-linear models:
– The multi-way table of joint probabilities is approximated by
a product of lower-order tables.
– Probability: p(a, b, c, d) = ab acad bcd
Histograms
• A popular data
reduction technique 40
• Divide data into 35
buckets and store 30
average (sum) for each 25
bucket
20
• Can be constructed
15
optimally in one
10
dimension using
dynamic programming 5

• Related to quantization 0
10000 30000 50000 70000 90000
problems.
Clustering
• Partition data set into clusters, and one can
store cluster representation only
• Can be very effective if data is clustered but
not if data is “smeared”
• Can have hierarchical clustering and be stored
in multi-dimensional index tree structures
Sampling
• Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
• Choose a representative subset of the data
– Simple random sampling may have very poor performance
in the presence of skew
• Develop adaptive sampling methods
– Stratified sampling:
• Approximate the percentage of each class (or
subpopulation of interest) in the overall database
• Used in conjunction with skewed data
• Sampling may not reduce database I/Os (page at a
time).
Sampling

Raw Data
Sampling
Raw Data Cluster/Stratified Sample
Hierarchical Reduction
• Use multi-resolution structure with different degrees of
reduction
• Hierarchical clustering is often performed but tends to
define partitions of data sets rather than “clusters”
• Parametric methods are usually not amenable to
hierarchical representation
• Hierarchical aggregation
– An index tree hierarchically divides a data set into partitions
by value range of some attributes
– Each partition can be considered as a bucket
– Thus an index tree with aggregates stored at each node is a
hierarchical histogram
Discretization
• Three types of attributes:
– Nominal — values from an unordered set
– Ordinal — values from an ordered set
– Continuous — real numbers
• Discretization:
divide the range of a continuous attribute into intervals
– Some classification algorithms only accept categorical
attributes.
– Reduce data size by discretization
– Prepare for further analysis
Discretization and Concept
hierarchy
• Discretization
– reduce the number of values for a given continuous
attribute by dividing the range of the attribute into
intervals. Interval labels can then be used to replace actual
data values.
• Concept hierarchies
– reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute age) by
higher level concepts (such as young, middle-aged, or
senior).
Discretization and concept
hierarchy generation for numeric
data
• Binning (see sections before)

• Histogram analysis (see sections before)

• Clustering analysis (see sections before)

• Entropy-based discretization

• Segmentation by natural partitioning

Entropy-Based Discretization
• Given a set of samples S, if S is partitioned into two
intervals S1 and S2 using boundary T, the entropy after
partitioning is
| S1| |S 2|
E (S ,T )  Ent ( S1)  Ent ( S 2)
| S| | S|
• The boundary that minimizes the entropy function over
all possible boundaries is selected as a binary
discretization.
• The process is recursively applied to partitions
obtained until some stopping criterion is met, e.g.,
Ent ( S )  E (T , S )  
• Experiments show that it may reduce data size and
improve classification accuracy
Segmentation by natural partitioning
3-4-5 rule can be used to segment numeric data into
relatively uniform, “natural” intervals.
* If an interval covers 3, 6, 7 or 9 distinct values at
the most significant digit, partition the range into 3
equi-width intervals
* If it covers 2, 4, or 8 distinct values at the most
significant digit, partition the range into 4 intervals
* If it covers 1, 5, or 10 distinct values at the most
significant digit, partition the range into 5 intervals
Example of 3-4-5 rule
count

Step 1: -$351 -$159 profit $1,838 $4,700

Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max
Step 2: msd=1,000 Low=-$1,000 High=$2,000

(-$1,000 - $2,000)
Step 3:

(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)

(-$4000 -$5,000)
Step 4:

($2,000 - $5, 000)

(-$400 - 0) (0 - $1,000) ($1,000 - $2, 000)
(0 -
($1,000 -
(-$400 - $200)
$1,200) ($2,000 -
-$300) $3,000)
($200 -
($1,200 -
$400)
(-$300 - $1,400)
($3,000 -
-$200)
($400 - ($1,400 - $4,000)
(-$200 - $600) $1,600) ($4,000 -
-$100) ($600 - ($1,600 - $5,000)
$800) ($800 - ($1,800 -
$1,800)
(-$100 - $1,000) $2,000)
0)
Concept hierarchy generation for
categorical data
• Specification of a partial ordering of attributes
explicitly at the schema level by users or experts
• Specification of a portion of a hierarchy by
explicit data grouping
• Specification of a set of attributes, but not of their
partial ordering
• Specification of only a partial set of attributes
Specification of a set of attributes
Concept hierarchy can be automatically generated
based on the number of distinct values per attribute
in the given attribute set. The attribute with the
most distinct values is placed at the lowest level of
the hierarchy.

country 15 distinct values

province_or_ state 65 distinct values

city 3567 distinct values

street 674,339 distinct values

Data Mining Operations and
Techniques:
• Predictive Modelling :
– Based on the features present in the class_labeled training
data, develop a description or model for each class. It is used
for
• better understanding of each class, and
• prediction of certain properties of unseen data
– If the field being predicted is a numeric (continuous ) variables
then the prediction problem is a regression problem
– If the field being predicted is a categorical then the prediction
problem is a classification problem
– Predictive Modelling is based on inductive learning
(supervised learning)
Predictive Modelling (Classification):
debt
*
*
* o o o
* o o
* ** * o
* o
* * o o
o

income

Linear Classifier: Non Linear Classifier:

debt debt
* *
* *
* o o o * o o o
* o o * o o
* ** * o
* ** * o
* o * o
* * o o * * o o
o o
income
a*income + b*debt < t => No loan ! income
• Clustering (Segmentation)
– Clustering does not specify fields to be predicted but
targets separating the data items into subsets that are
similar to each other.
– Clustering algorithms employ a two-stage search:
• An outer loop over possible cluster numbers and an inner loop
to fit the best possible clustering for a given number of clusters
– Combined use of Clustering and classification provides
real discovery power.
Supervised vs Unsupervised Learning:
debt debt
* +
* +
* o o o
+ + + +
* o o +
* ** * o
+ ++ +
+ + +
* o
+
* * o o +++ +
+
o +
income

Supervised Unsupervised
Learning Learning

debt debt
* +
* +
* o o o + + + +
* o o + + +
* ** * o + ++ + +
* o + +
* * o o +++ +
o +
income
income
• Associations
– relationship between attributes (recurring patterns)
• Dependency Modelling
– Deriving causal structure within the data
• Change and Deviation Detection
– These methods accounts for sequence information (time-series
in financial applications pr protein sequencing in genome
mapping)
– Finding frequent sequences in database is feasible given
sparseness in real-world transactional database
Basic Components of Data Mining Algorithms
• Model Representation (Knowledge Representation) :
– the language for describing discoverable patterns / knowledge
• (e.g. decision tree, rules, neural network)
• Model Evaluation:
– estimating the predictive accuracy of the derived patterns
• Search Methods:
– Parameter Search : when the structure of a model is fixed, search for
the parameters which optimise the model evaluation criteria (e.g.
backpropagation in NN)
– Model Search: when the structure of the model(s) is unknown, find
the model(s) from a model class
• Learning Bias
– Feature selection
– Pruning algorithm
Predictive Modelling (Classification)
• Task: determine which of a fixed set of classes an example belongs to
• Input: training set of examples annotated with class values.
• Output:induced hypotheses (model/concept description/classifiers)
Learning : Induce classifiers from training data
Inductive
Training Learning Classifiers
Data: System (Derived Hypotheses)

Predication : Using Hypothesis for Prediction: classifying any

example described in the same manner

Data to be classified Classifier Decision on class

assignment
Classification Algorithms
Basic Principle (Inductive Learning Hypothesis): Any
hypothesis found to approximate the target function well over a
sufficiently large set of training examples will also approximate
the target function well over other unobserved examples.
Typical Algorithms:
• Decision trees
• Rule-based induction
• Neural networks
• Memory(Case) based reasoning
• Genetic algorithms
• Bayesian networks
Decision Tree Learning
General idea: Recursively partition data into sub-groups

• Select an attribute and formulate a logical test on attribute

• Branch on each outcome of test, move subset of examples (training

data) satisfying that outcome to the corresponding child node.

• Run recursively on each child node.

Termination rule specifies when to declare a leaf node.

Decision tree learning is a heuristic, one-step lookahead (hill climbing),

non-backtracking search through the space of all possible decision trees.
Decision Tree: Example
Day Outlook Temperature Humidity Wind Play Tennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes
Decision Tree : Training
DecisionTree(examples) =
Prune (Tree_Generation(examples))

Tree_Generation (examples) =
IF termination_condition (examples)
THEN leaf ( majority_class (examples) )
ELSE
LET
Best_test = selection_function (examples)
IN
FOR EACH value v OF Best_test
Let subtree_v = Tree_Generation ({ e  example| e.Best_test = v )
IN Node (Best_test, subtree_v )
Definition :
selection: used to partition training data
termination condition: determines when to stop partitioning
pruning algorithm: attempts to prevent overfitting
Selection Measure : the Critical Step
The basic approach to select a attribute is to examine each attribute and evaluate its
likelihood for improving the overall decision performance of the tree.

The most widely used node-splitting evaluation functions work by reducing the degree
of randomness or ‘impurity” in the current node:
c
Entropy function (C4.5): E (n)    pi (c  ci | n) log 2 pi (c  ci | n)
Information gain : i 1

nv
G (n, A)  E (n)  
vValue( A ) n
E (nv )

• ID3 and C4.5 branch on every value and use an entropy minimisation heuristic to
select best attribute.

• CART branches on all values or one value only, uses entropy minimisation or gini
function.

• GIDDY formulates a test by branching on a subset of attribute values (selection by

entropy minimisation)
Tree Induction:
The algorithm searches through the space of possible decision trees from
simplest to increasingly complex, guided by the information gain
heuristic.

Outlook

Sunny Overcast Rain

{1, 2,8,9,11 } {4,5,6,10,14}
Yes
?
?

D (Sunny, Humidity) = 0.97 - 3/50 - 2/50 = 0.97

D (Sunny,Temperature) = 0.97-2/5*0 - 2/5*1 - 1/5*0.0 = 0.57
D (Sunny,Wind)= 0.97 -= 2/5*1.0 - 3/5*0.918 = 0.019
Overfitting

• Consider eror of hypothesis H over

– training data : error_training (h)
– entire distribution D of data : error_D (h)
Hypothesis h overfits training data if there is an
alternative hypothesis h’ such that
error_training (h) < error_training (h’)
error_D (h) > error (h’)
Preventing Overfitting
• Problem: We don’t want to these algorithms to fit to
``noise’’
• Reduced-error pruning :
– breaks the samples into a training set and a test set. The tree is
induced completely on the training set.
– Working backwards from the bottom of the tree, the subtree
starting at each nonterminal node is examined.
• If the error rate on the test cases improves by pruning it, the subtree is
removed. The process continues until no improvement can be made by
pruning a subtree,
• The error rate of the final tree on the test cases is used as an estimate of
the true error rate.
Decision Tree Pruning:
physician fee freeze = n:
| adoption of the budget resolution = y: democrat (151.0) Simplified Decision Tree:
| adoption of the budget resolution = u: democrat (1.0)
| adoption of the budget resolution = n: physician fee freeze = n: democrat (168.0/2.6)
| | education spending = n: democrat (6.0) physician fee freeze = y: republican (123.0/13.9)
| | education spending = y: democrat (9.0) physician fee freeze = u:
| | education spending = u: republican (1.0) | mx missile = n: democrat (3.0/1.1)
physician fee freeze = y: | mx missile = y: democrat (4.0/2.2)
| synfuels corporation cutback = n: republican (97.0/3.0) | mx missile = u: republican (2.0/1.0)
| synfuels corporation cutback = u: republican (4.0)
| synfuels corporation cutback = y:
| | duty free exports = y: democrat (2.0)
| | duty free exports = u: republican (1.0)
| | duty free exports = n:
| | | education spending = n: democrat (5.0/2.0)
| | | education spending = y: republican (13.0/2.0) Evaluation on training data (300 items):
| | | education spending = u: democrat (1.0)
physician fee freeze = u: Before Pruning After Pruning
| water project cost sharing = n: democrat (0.0) ---------------- ---------------------------
| water project cost sharing = y: democrat (4.0) Size Errors Size Errors Estimate
| water project cost sharing = u:
| | mx missile = n: republican (0.0) 25 8( 2.7%) 7 13( 4.3%) ( 6.9%) <
| | mx missile = y: democrat (3.0/1.0)
| | mx missile = u: republican (2.0)
Evaluation of Classification Systems
Training Set: examples with class Predicted
values for learning.
False Positives
Test Set: examples with class values
for evaluating.
True Positives
Evaluation: Hypotheses are used to
infer classification of examples in the
test set; inferred classification is
False Negatives
compared to known classification.
Actual
Accuracy: percentage of examples in
the test set that are classified correctly.

Deep Neural Network (DNN)
100% (1)
Deep Neural Network (DNN)
80 pages
DecisionTree Numerical ID3Prob
No ratings yet
DecisionTree Numerical ID3Prob
114 pages
Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
Unit 1 - Data Scientist Tool Box
No ratings yet
Unit 1 - Data Scientist Tool Box
26 pages
MC4411 Project Work - Format
No ratings yet
MC4411 Project Work - Format
65 pages
Data Science M-1 Notes
No ratings yet
Data Science M-1 Notes
34 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
Naïve Bayes Classifier Algorithm
No ratings yet
Naïve Bayes Classifier Algorithm
10 pages
AI&ML BM4251 Unit 1-5 Notes
No ratings yet
AI&ML BM4251 Unit 1-5 Notes
116 pages
DataMining S
No ratings yet
DataMining S
103 pages
KMBN It01 - Unit 4
No ratings yet
KMBN It01 - Unit 4
19 pages
Data Mining
No ratings yet
Data Mining
87 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
19 pages
UNIT - 5 3D Object Representation
No ratings yet
UNIT - 5 3D Object Representation
59 pages
1 The Role of Statistics and The Data Analysis Process
100% (1)
1 The Role of Statistics and The Data Analysis Process
30 pages
PPT1
No ratings yet
PPT1
93 pages
Feature Engg Pre Processing Python
No ratings yet
Feature Engg Pre Processing Python
68 pages
Lecture-1 Introduction To Data Science
No ratings yet
Lecture-1 Introduction To Data Science
20 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Preprocessing
No ratings yet
Data Preprocessing
37 pages
Topic 1 Etw3482
100% (2)
Topic 1 Etw3482
69 pages
Question Bank - CSE-DS
No ratings yet
Question Bank - CSE-DS
5 pages
Dev Answer Key
100% (1)
Dev Answer Key
17 pages
Data Reduction Techniques
No ratings yet
Data Reduction Techniques
41 pages
Hyper-Parameter Tuning Techniques in Deep Learning - Towards Data Science
No ratings yet
Hyper-Parameter Tuning Techniques in Deep Learning - Towards Data Science
14 pages
BA ZG523 Introduction To Data Science
50% (2)
BA ZG523 Introduction To Data Science
12 pages
Predictive Modeling Project Report
100% (2)
Predictive Modeling Project Report
31 pages
5 A Machine Learning Approach For Skin Disease Detection and 2022 Healthcare
No ratings yet
5 A Machine Learning Approach For Skin Disease Detection and 2022 Healthcare
15 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
62 pages
Answers To Problems For Data Mining and Predictive Analytics (2nd Edition) by Larose
No ratings yet
Answers To Problems For Data Mining and Predictive Analytics (2nd Edition) by Larose
12 pages
Data Science Introduction
No ratings yet
Data Science Introduction
82 pages
Physics Unit 3 Assignment
67% (3)
Physics Unit 3 Assignment
19 pages
Chapter 3: Data Preprocessing
100% (1)
Chapter 3: Data Preprocessing
41 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
91 pages
07 Local Search Algorithms
No ratings yet
07 Local Search Algorithms
32 pages
Handling Missing Value
No ratings yet
Handling Missing Value
12 pages
Cluster
100% (1)
Cluster
72 pages
Unit 2
No ratings yet
Unit 2
11 pages
Chapter 10 Asset Management 2014 From Machine To Machine To The Internet of Things
No ratings yet
Chapter 10 Asset Management 2014 From Machine To Machine To The Internet of Things
8 pages
Machine Learning Techniques For Heart Disease Prediction: A. Lakshmanarao, Y.Swathi, P.Sri Sai Sundareswar
No ratings yet
Machine Learning Techniques For Heart Disease Prediction: A. Lakshmanarao, Y.Swathi, P.Sri Sai Sundareswar
4 pages
Anomaly Detection
No ratings yet
Anomaly Detection
11 pages
ImageProcessing11 Morphology
No ratings yet
ImageProcessing11 Morphology
48 pages
Artificial Intelligence & Expert System
100% (1)
Artificial Intelligence & Expert System
18 pages
Chandigarh Group of Colleges College of Engineering Landran, Mohali
No ratings yet
Chandigarh Group of Colleges College of Engineering Landran, Mohali
47 pages
DataMining Lecture 1
No ratings yet
DataMining Lecture 1
35 pages
Data Mining Handout
No ratings yet
Data Mining Handout
4 pages
Detection and Classification of Dental Caries in X-Ray Images Using Deep Neural Networks
No ratings yet
Detection and Classification of Dental Caries in X-Ray Images Using Deep Neural Networks
5 pages
6 - KNN Classifier
No ratings yet
6 - KNN Classifier
10 pages
Machine Learning Guide Line
No ratings yet
Machine Learning Guide Line
10 pages
Acourse of Pure Mathematics Cambrige
No ratings yet
Acourse of Pure Mathematics Cambrige
587 pages
Deep Learning and CNNFYTGS5101-Guoyangxie
No ratings yet
Deep Learning and CNNFYTGS5101-Guoyangxie
42 pages
Data Science Syllabus
No ratings yet
Data Science Syllabus
3 pages
2nd Unit - 2.2 - Data Analytics
No ratings yet
2nd Unit - 2.2 - Data Analytics
22 pages
6 Different Ways To Compensate For Missing Values in A Dataset
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset
6 pages
Key Data Mining Tasks: 1. Descriptive Analytics
No ratings yet
Key Data Mining Tasks: 1. Descriptive Analytics
10 pages
Fantasy Sports Prediction Clustering Analysis
No ratings yet
Fantasy Sports Prediction Clustering Analysis
21 pages
Data Science New
No ratings yet
Data Science New
9 pages
SAS Presentation
No ratings yet
SAS Presentation
49 pages
Data Science Project
No ratings yet
Data Science Project
3 pages
Message
No ratings yet
Message
10 pages
Seminar Report Machine Learning
No ratings yet
Seminar Report Machine Learning
20 pages
Ae8502 Question Bank-2022
No ratings yet
Ae8502 Question Bank-2022
153 pages
Macro 1 Theory and Background - Rel 108 OM Format PDF
75% (4)
Macro 1 Theory and Background - Rel 108 OM Format PDF
33 pages
V53PR0906133 Exp
No ratings yet
V53PR0906133 Exp
155 pages
Instant Access To Topics in Non Commutative Geometry Y. Manin Ebook Full Chapters
No ratings yet
Instant Access To Topics in Non Commutative Geometry Y. Manin Ebook Full Chapters
51 pages
Dip U1 - Digital Image Fundamentals
100% (1)
Dip U1 - Digital Image Fundamentals
49 pages
Quantization and Compression PDF
No ratings yet
Quantization and Compression PDF
220 pages
Structural Modelling Using Sap2000
No ratings yet
Structural Modelling Using Sap2000
29 pages
Parabolas (All Lectures)
No ratings yet
Parabolas (All Lectures)
8 pages
Linear Algebra Slide Beammer 2022 Oct 16
No ratings yet
Linear Algebra Slide Beammer 2022 Oct 16
178 pages
Sec. 3
No ratings yet
Sec. 3
8 pages
Stata Journal
No ratings yet
Stata Journal
192 pages
Week 03 02 - Walpole 23032021 054301pm
No ratings yet
Week 03 02 - Walpole 23032021 054301pm
27 pages
S-DLP Inverse Variation
No ratings yet
S-DLP Inverse Variation
5 pages
Properties of Circle
No ratings yet
Properties of Circle
21 pages
Control Flow A Byte of Python
No ratings yet
Control Flow A Byte of Python
11 pages
Brijesh 1
No ratings yet
Brijesh 1
13 pages
Probability Class 11
No ratings yet
Probability Class 11
1 page
Osl Languagespec
No ratings yet
Osl Languagespec
101 pages
Calculation Cover Sheet Date: Author: Project: Calc No: Title
No ratings yet
Calculation Cover Sheet Date: Author: Project: Calc No: Title
6 pages
Script-Template-Developing-Video-Lesson Math 3 Q3 Lesson 65 W6
No ratings yet
Script-Template-Developing-Video-Lesson Math 3 Q3 Lesson 65 W6
9 pages
WWW - Manaresults.Co - In: Power System Analysis
No ratings yet
WWW - Manaresults.Co - In: Power System Analysis
8 pages
Ngineering ATA Nalysis: Math 4
No ratings yet
Ngineering ATA Nalysis: Math 4
14 pages
To Predict The Bead Geometry Parameters and Shape Relationships in MIG Welding of Stainless Steel 301 by Mathematical Modelling
No ratings yet
To Predict The Bead Geometry Parameters and Shape Relationships in MIG Welding of Stainless Steel 301 by Mathematical Modelling
10 pages
Maths Assignment
No ratings yet
Maths Assignment
7 pages
Stanford E14 PSET 1 Solutions
No ratings yet
Stanford E14 PSET 1 Solutions
18 pages
MasteringPhysics - Assignment 2 - Motion in 1-D
No ratings yet
MasteringPhysics - Assignment 2 - Motion in 1-D
3 pages
Software Asset Management: What Is It and Why Do I Need It?: A Textbook on the Fundamentals in Software License Compliance, Audit Risks, Optimizing Software License ROI, Business Practices and Life Cycle Management
From Everand
Software Asset Management: What Is It and Why Do I Need It?: A Textbook on the Fundamentals in Software License Compliance, Audit Risks, Optimizing Software License ROI, Business Practices and Life Cycle Management
Carl A. Bolton
No ratings yet

Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization

Uploaded by

Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization

Uploaded by

Data Pre-processing

• Data in the real world is dirty

• Data cleaning tasks

Initial attribute set:

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}

Original Data Compressed

• Multiple regression: allows a response variable Y to

• Log-linear model: approximates discrete

• Histogram analysis (see sections before)

• Clustering analysis (see sections before)

• Segmentation by natural partitioning

Step 1: -$351 -$159 profit $1,838 $4,700

(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)

($2,000 - $5, 000)

country 15 distinct values

province_or_ state 65 distinct values

city 3567 distinct values

street 674,339 distinct values

Linear Classifier: Non Linear Classifier:

Predication : Using Hypothesis for Prediction: classifying any

Data to be classified Classifier Decision on class

• Select an attribute and formulate a logical test on attribute

• Branch on each outcome of test, move subset of examples (training

• Run recursively on each child node.

Termination rule specifies when to declare a leaf node.

Decision tree learning is a heuristic, one-step lookahead (hill climbing),

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

• GIDDY formulates a test by branching on a subset of attribute values (selection by

Sunny Overcast Rain

D (Sunny, Humidity) = 0.97 - 3/5*0 - 2/5*0 = 0.97

• Consider eror of hypothesis H over

You might also like

D (Sunny, Humidity) = 0.97 - 3/50 - 2/50 = 0.97