0% found this document useful (0 votes)

41 views10 pages

Lect2 - Data Preprocessing

Uploaded by

chala mitafa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views10 pages

Lect2 - Data Preprocessing

Uploaded by

chala mitafa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

5/26/2020

What is Data?
• Collection of data objects and their attributes
• An attribute is a property or characteristic of
an object
Data Preprocessing
• Attribute is also known as field, or feature
– Examples: eye color or age of a person

• A collection of attributes describe an object,

entity, or instance

1 2

Attribute Values Types of Attributes

• Attribute values are numbers or symbols assigned to • There are different types of attributes
an attribute  Nominal: categories, states
 Examples: ID numbers, eye color, zip codes
• Distinction between attributes and attribute values  Binary: Nominal attribute with only 2 states (0 or 1)
 Examples: gender
Same attribute can be mapped to different attribute
values  Ordinal: Values have a meaningful order (ranking) but
magnitude between successive values is not known.
Example: height can be measured in feet or meters  Examples: rankings (e.g., taste of potato chips on a scale from 1-10),
grades, height {tall, medium, short}
Different attributes can be mapped to the same set of  Interval: Measured on a scale of equal-sized units
values • Values have order
Example: Attribute values for ID and age are integers • Examples: calendar dates, temperatures in Celsius or Fahrenheit.
 Ratio: We can speak of values as being an order of magnitude
But properties of attribute values can be different larger than the unit of measurement
 ID has no limit but age has a maximum and minimum value • Examples: temperature in Kelvin, length, time, counts

3 4

1
5/26/2020

Discrete and Continuous Attributes

• Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection of
documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes
• Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented using a finite
number of digits.
– Continuous attributes are typically represented as floating-point variables.

5 6

Basic Statistical Descriptions of Data Summary Statistics: Measuring the Central

Tendency
• Motivation • Summary statistics are numbers that summarize
To better understand the data: central tendency, properties of the data
variation and spread
• Summarized properties include frequency,
• Data dispersion characteristics
location and spread
– median, max, min, quantiles, outliers, variance, etc.
– Examples: location - mean
• Numerical dimensions correspond to sorted
– spread - standard deviation
intervals
– Data dispersion: analyzed with multiple granularities • Most summary statistics can be calculated in a
of precision single pass through the data
• Dispersion analysis on computed measures
– Folding measures into numerical dimensions
7 8

2
5/26/2020

Frequency and Mode Percentiles

• The frequency of an attribute value is the
percentage of time the value occurs in the data
set
– For example, given the attribute ‘gender’ and a
representative population of people, the gender
‘female’ occurs about 50% of the time.
• The mode of a an attribute is the most frequent
• attribute value

• The notions of frequency and mode are typically

used with categorical data
9 10

Measures of Location: Mean and Median Arithmetic Mean

• The mean is the most common measure of the
location of a set of points.
• However, the mean is very sensitive to outliers.
• Thus, the median or a trimmed mean is also
commonly used. The mean of this set of values is

11 12

3
5/26/2020

Median Measuring the Central Tendency

• If N is odd, then the median is the middle value
of the ordered set.

• If N is even, then the median is not unique; it is

the two middlemost values and any value in
between.

• If X is a numeric attribute in this case, by

convention, the median is taken as the average
of the two middlemost values.
13 14

Measures of Spread: Range and Variance

Variance and Standard Deviation

15 16

4
5/26/2020

Types of data sets Examples of data quality problems

• Record • Noise: Refers to modification of original values
Data Matrix
Document Data • Outliers: data that are considerably different than most
of the other data objects in the data set
Transaction Data
• Missing values
• Graph  Reasons for missing values
World Wide Web  Information is not collected (e.g., people decline to give their age and
weight)
Molecular Structures
 Attributes may not be applicable to all cases (e.g., annual income is
not applicable to children)
• Ordered
Spatial Data  Handling missing values
 Eliminate Data Objects
Temporal Data  Estimate Missing Values
Sequential Data  Ignore the Missing Value During Analysis
Genetic Sequence Data  Replace with all possible values (weighted by their probabilities)
17 18

Why Data Preprocessing? Why Preprocessing? Data be

Incomplete!
• Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain
• Attributes of interest are not available (e.g.,
attributes of interest, or containing only aggregate customer information for sales transaction data)
data
• e.g., occupation=“ ”
• Data were not considered important at the time
– noisy: containing errors or outliers
of transactions, so they were not recorded.
• e.g., Salary=“-10” • Data not recorder because of misunderstanding
– inconsistent: containing discrepancies in codes or or malfunctions
names
• e.g., Age=“42” Birthday=“03/07/1997” • Data may have been recorded and later deleted.
• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., discrepancy between duplicate records • Missing/unknown values for some data
– Redundant: including everything, some of which are
irrelevant to our task
May 26, 2020 20

5
5/26/2020

Feature Extraction in Fingerprint Recognition

Fingerprint Recognition Case
• Fingerprint identification at the gym

It is not the points, but what is in between the points that matters... Edward
HOW? German
•Identifying/extracting a good feature set is the most challenging part of
data mining.
Feature vector: 10.2, 0.23, 0.34, 0.34, 20, …

Forms of Data Preprocessing Why Data Preprocessing?

• Aggregation • Data in the real world is dirty
incomplete: lacking attribute values, lacking certain
• Sampling attributes of interest, or containing only aggregate
data
• Dimensionality Reduction noisy: containing errors or outliers
inconsistent: containing discrepancies in codes or
• Feature subset selection names

• Feature creation • No quality data, no quality mining results!

Quality decisions must be based on quality data
• Discretization and Binarization DM needs consistent integration of quality data
• Attribute Transformation 23 24

6
5/26/2020

Forms of Data Preprocessing What is Data Exploration?

Data cleaning
A preliminary exploration of the data to better
understand its characteristics.
Data Integration • Key motivations include
– Helping to select the right tool for preprocessing or analysis
– Making use of humans’ abilities to recognize patterns
• People can recognize patterns not captured by data
analysis tools
Data transformation
• Related to the area of Exploratory Data Analysis (EDA)
Data reduction – Created by statistician John Tukey
– Seminal book is Exploratory Data Analysis by Tukey
– A nice online introduction can be found in Chapter 1 of the NIST
Engineering Statistics Handbook
Data Exploratory Analysis

Aggregation Exploratory Data Analysis Techniques

• Combining two or more attributes (or objects)
into a single attribute (or object)  Summary Statistics
 Visualization
• Purpose
Data reduction
 Feature Selection (big topic)
– Reduce the number of attributes or objects  Dimension Reduction (big topic)
Change of scale
• Cities aggregated into regions, states, countries, etc

More “stable” data

• Aggregated data tends to have less variability
27

7
5/26/2020

Sampling Types of Sampling

• Sampling is the main technique employed for data selection.
• Simple Random Sampling
• It is often used for both the preliminary investigation of the  There is an equal probability of selecting any particular item
data and the final data analysis.
• Sampling without replacement
• Statisticians sample because obtaining the entire set of data of
interest is too expensive or time consuming.  As each item is selected, it is removed from the population

• Sampling is used in data mining because processing the entire • Sampling with replacement
set of data of interest is too expensive or time consuming.  Objects are not removed from the population as they are
selected for the sample.
• The key principle for effective sampling is the following:  In sampling with replacement, the same object can be picked
 using a sample will work almost as well as using the entire data up more than once
sets, if the sample is representative
 A sample is representative if it has approximately the same • Stratified sampling
property (of interest) as the original set of data  Split the data into several partitions; then draw random
samples from each partition
29 30

Curse of dimensionality Dimensionality Reduction: Curse of Dimensionality

• When dimensionality increases, data becomes increasingly
sparse in the space that it occupies
• Definitions of density and distance between points, which is
critical for clustering and outlier detection, become less
meaningful
• Purpose of dimensionality reduction:
 Avoid curse of dimensionality
 Reduce amount of time and memory required by data mining
algorithms
 Allow data to be more easily visualized
 May help to eliminate irrelevant features or reduce noise

• Techniques of dimensionality reduction:

 Principle Component Analysis
 Singular Value Decomposition
31
 Others: supervised and non-linear techniques 32

8
5/26/2020

Feature Subset Selection Feature Selection and Correlation Matrix

• Another way to reduce dimensionality of data
• Redundant features
duplicate much or all of the information contained in
one or more other attributes
Example: purchase price of a product and the
amount of sales tax paid
• Irrelevant features
contain no information that is useful for the data
mining task at hand
Example: students' ID is often irrelevant to the task
of predicting students' GPA 33 34

Feature Subset Selection Feature Creation

• Techniques: • Create new attributes that can capture the
Brute-force approch:
 Try all possible feature subsets as input to data mining
important information in a data set much more
algorithm efficiently than the original attributes
Embedded approaches:
 Feature selection occurs naturally as part of the data mining
• Three general methodologies:
algorithm 1. Feature Extraction domain-specific
Filter approaches: 2. Mapping Data to New Space
 Features are selected before data mining algorithm is run 3. Feature Construction combining features
Wrapper approaches:
 Use the data mining algorithm as a black box to find best
subsetof attributes
35 36

9
5/26/2020

DM Assignment-I
• Compare and contrast DM and RDBMS
Describe the basic differences and similarities;
Describe the Pros and Cons (Merits & Demerits).
 On average, a summarized report of two
pages (Font: Times New Roman 12, 1.5
spacing) should be submitted on May 28,
2020. Use aastukk@gmail.com to submit
your assignments before the due date.

Cryptography and Network Security: Third Edition by William Stallings
No ratings yet
Cryptography and Network Security: Third Edition by William Stallings
17 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Week 2 - 3getting To Know Your Data
No ratings yet
Week 2 - 3getting To Know Your Data
67 pages
Data Preprocessing 09112023 065121pm
No ratings yet
Data Preprocessing 09112023 065121pm
30 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Data-Preprocessing
No ratings yet
Data-Preprocessing
138 pages
CIS 467 - Topic 2 - Data Exploration and Preprocessing
No ratings yet
CIS 467 - Topic 2 - Data Exploration and Preprocessing
81 pages
Data Preprocessing
No ratings yet
Data Preprocessing
12 pages
Data - Part 1
No ratings yet
Data - Part 1
58 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
100 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
Data Mining Chapter 2 Data Preprocessing
No ratings yet
Data Mining Chapter 2 Data Preprocessing
33 pages
Chapter 2 - Tagged
No ratings yet
Chapter 2 - Tagged
66 pages
Data Mining
No ratings yet
Data Mining
40 pages
17 Data Analysis
No ratings yet
17 Data Analysis
64 pages
AIML Unit 2 Understanding Data
No ratings yet
AIML Unit 2 Understanding Data
51 pages
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
No ratings yet
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
49 pages
DM Preprocessing Lec4,5
No ratings yet
DM Preprocessing Lec4,5
36 pages
DM Unit-1-1
No ratings yet
DM Unit-1-1
56 pages
Lec01 Dataprep
No ratings yet
Lec01 Dataprep
67 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
02data Edited v2
No ratings yet
02data Edited v2
43 pages
02know Your Data-Lecture2-3
No ratings yet
02know Your Data-Lecture2-3
53 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
Data and attributes in Data Mining
No ratings yet
Data and attributes in Data Mining
47 pages
02know Your Data Lecture2 3
No ratings yet
02know Your Data Lecture2 3
53 pages
Data Science Unit I (LN and QB)
No ratings yet
Data Science Unit I (LN and QB)
44 pages
Data Wrangling and Descriptive Analytics: DR Sandipan Karmakar Department of Management Studies MNIT Jaipur
No ratings yet
Data Wrangling and Descriptive Analytics: DR Sandipan Karmakar Department of Management Studies MNIT Jaipur
57 pages
02 Data
No ratings yet
02 Data
24 pages
Lect 3
No ratings yet
Lect 3
51 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Lecture - 04 - Data Understanding and Preparation
No ratings yet
Lecture - 04 - Data Understanding and Preparation
59 pages
Data Mining
No ratings yet
Data Mining
5 pages
Unit I
No ratings yet
Unit I
57 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
Data Preprocessing For Clustering
No ratings yet
Data Preprocessing For Clustering
40 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Penggalian Data & Analitika Bisnis: Faculties Teknologi Informasi - ITS
No ratings yet
Penggalian Data & Analitika Bisnis: Faculties Teknologi Informasi - ITS
69 pages
Data ch2
No ratings yet
Data ch2
16 pages
02 Data
No ratings yet
02 Data
35 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Unit 3
No ratings yet
Unit 3
41 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Wk. 3. Data (12-05-2021)
No ratings yet
Wk. 3. Data (12-05-2021)
57 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
49 pages
Chapter 2: Getting To Know Your Data
No ratings yet
Chapter 2: Getting To Know Your Data
30 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
16 pages
02 KnowYourData
No ratings yet
02 KnowYourData
44 pages
CH 2
No ratings yet
CH 2
36 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Amharic Sentence Parsing Using Base Phrase Chunking
No ratings yet
Amharic Sentence Parsing Using Base Phrase Chunking
10 pages
Natural Language Understanding
No ratings yet
Natural Language Understanding
3 pages
Bayesian Networks: A Tutorial
No ratings yet
Bayesian Networks: A Tutorial
73 pages
Bagging+Boosting+Gradient Boosting
100% (1)
Bagging+Boosting+Gradient Boosting
48 pages
How To Use WarpPLS
No ratings yet
How To Use WarpPLS
34 pages
ANCOVA
100% (1)
ANCOVA
23 pages
Percobaan Mat
No ratings yet
Percobaan Mat
4 pages
MGT 337-Project Management-Lecture 3
No ratings yet
MGT 337-Project Management-Lecture 3
22 pages
Final Sem
No ratings yet
Final Sem
2 pages
Final Assessment Feb2022
No ratings yet
Final Assessment Feb2022
9 pages
Stat-231 Practical Manual
No ratings yet
Stat-231 Practical Manual
45 pages
VaibhavKumar Extendedproject PDF
100% (2)
VaibhavKumar Extendedproject PDF
10 pages
Aiml Unit 3
No ratings yet
Aiml Unit 3
9 pages
STATISTICS
No ratings yet
STATISTICS
4 pages
Application of Lamendin's Adult Dental Aging - Prince - Ubelaker2002
No ratings yet
Application of Lamendin's Adult Dental Aging - Prince - Ubelaker2002
10 pages
Estimating The Economic Model of Crime With Panel Data: June 2019
No ratings yet
Estimating The Economic Model of Crime With Panel Data: June 2019
12 pages
I. Single Population (Z Test & T Test)
100% (1)
I. Single Population (Z Test & T Test)
11 pages
Chap15 - Time Series Forecasting & Index Number
No ratings yet
Chap15 - Time Series Forecasting & Index Number
60 pages
Module 5.2 - Anova and Ancova
No ratings yet
Module 5.2 - Anova and Ancova
10 pages
Math Reviewer 4-TH Quarter
No ratings yet
Math Reviewer 4-TH Quarter
4 pages
Confidence Intervals
No ratings yet
Confidence Intervals
24 pages
GMATH Regression Analysis
No ratings yet
GMATH Regression Analysis
3 pages
One Way ANOVA
100% (1)
One Way ANOVA
3 pages
Microsoft Excel Data Analysis For Beginners, Intermediate and Expert
100% (1)
Microsoft Excel Data Analysis For Beginners, Intermediate and Expert
19 pages
Chapter 5. Methods and Philosophy of Statistical Process Control
No ratings yet
Chapter 5. Methods and Philosophy of Statistical Process Control
36 pages
Study Materials Econometrics - Compressed
No ratings yet
Study Materials Econometrics - Compressed
61 pages
Clinical Psychology Research Design
No ratings yet
Clinical Psychology Research Design
26 pages
Lecture Notes BA
No ratings yet
Lecture Notes BA
15 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
94 pages
Reflective Essay of Research Methodology
No ratings yet
Reflective Essay of Research Methodology
7 pages
Statistics 1B Lecture Notes: Author: T. Farrar
No ratings yet
Statistics 1B Lecture Notes: Author: T. Farrar
129 pages
MATH 2275 Assignment #4
No ratings yet
MATH 2275 Assignment #4
3 pages

Lect2 - Data Preprocessing

Uploaded by

Lect2 - Data Preprocessing

Uploaded by

5/26/2020

• A collection of attributes describe an object,

Attribute Values Types of Attributes

Discrete and Continuous Attributes

Basic Statistical Descriptions of Data Summary Statistics: Measuring the Central

Frequency and Mode Percentiles

• The notions of frequency and mode are typically

Measures of Location: Mean and Median Arithmetic Mean

Median Measuring the Central Tendency

• If N is even, then the median is not unique; it is

• If X is a numeric attribute in this case, by

Measures of Spread: Range and Variance

Types of data sets Examples of data quality problems

Why Data Preprocessing? Why Preprocessing? Data be

Feature Extraction in Fingerprint Recognition

Forms of Data Preprocessing Why Data Preprocessing?

• Feature creation • No quality data, no quality mining results!

Forms of Data Preprocessing What is Data Exploration?

Aggregation Exploratory Data Analysis Techniques

More “stable” data

Sampling Types of Sampling

Curse of dimensionality Dimensionality Reduction: Curse of Dimensionality

• Techniques of dimensionality reduction:

Feature Subset Selection Feature Selection and Correlation Matrix

Feature Subset Selection Feature Creation

You might also like