5/26/2020
What is Data?
• Collection of data objects and their attributes
• An attribute is a property or characteristic of
an object
Data Preprocessing
• Attribute is also known as field, or feature
– Examples: eye color or age of a person
• A collection of attributes describe an object,
entity, or instance
1 2
Attribute Values Types of Attributes
• Attribute values are numbers or symbols assigned to • There are different types of attributes
an attribute Nominal: categories, states
Examples: ID numbers, eye color, zip codes
• Distinction between attributes and attribute values Binary: Nominal attribute with only 2 states (0 or 1)
Examples: gender
Same attribute can be mapped to different attribute
values Ordinal: Values have a meaningful order (ranking) but
magnitude between successive values is not known.
Example: height can be measured in feet or meters Examples: rankings (e.g., taste of potato chips on a scale from 1-10),
grades, height {tall, medium, short}
Different attributes can be mapped to the same set of Interval: Measured on a scale of equal-sized units
values • Values have order
Example: Attribute values for ID and age are integers • Examples: calendar dates, temperatures in Celsius or Fahrenheit.
Ratio: We can speak of values as being an order of magnitude
But properties of attribute values can be different larger than the unit of measurement
ID has no limit but age has a maximum and minimum value • Examples: temperature in Kelvin, length, time, counts
3 4
1
5/26/2020
Discrete and Continuous Attributes
• Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection of
documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes
• Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented using a finite
number of digits.
– Continuous attributes are typically represented as floating-point variables.
5 6
Basic Statistical Descriptions of Data Summary Statistics: Measuring the Central
Tendency
• Motivation • Summary statistics are numbers that summarize
To better understand the data: central tendency, properties of the data
variation and spread
• Summarized properties include frequency,
• Data dispersion characteristics
location and spread
– median, max, min, quantiles, outliers, variance, etc.
– Examples: location - mean
• Numerical dimensions correspond to sorted
– spread - standard deviation
intervals
– Data dispersion: analyzed with multiple granularities • Most summary statistics can be calculated in a
of precision single pass through the data
• Dispersion analysis on computed measures
– Folding measures into numerical dimensions
7 8
2
5/26/2020
Frequency and Mode Percentiles
• The frequency of an attribute value is the
percentage of time the value occurs in the data
set
– For example, given the attribute ‘gender’ and a
representative population of people, the gender
‘female’ occurs about 50% of the time.
• The mode of a an attribute is the most frequent
• attribute value
• The notions of frequency and mode are typically
used with categorical data
9 10
Measures of Location: Mean and Median Arithmetic Mean
• The mean is the most common measure of the
location of a set of points.
• However, the mean is very sensitive to outliers.
• Thus, the median or a trimmed mean is also
commonly used. The mean of this set of values is
11 12
3
5/26/2020
Median Measuring the Central Tendency
• If N is odd, then the median is the middle value
of the ordered set.
• If N is even, then the median is not unique; it is
the two middlemost values and any value in
between.
• If X is a numeric attribute in this case, by
convention, the median is taken as the average
of the two middlemost values.
13 14
Measures of Spread: Range and Variance
Variance and Standard Deviation
15 16
4
5/26/2020
Types of data sets Examples of data quality problems
• Record • Noise: Refers to modification of original values
Data Matrix
Document Data • Outliers: data that are considerably different than most
of the other data objects in the data set
Transaction Data
• Missing values
• Graph Reasons for missing values
World Wide Web Information is not collected (e.g., people decline to give their age and
weight)
Molecular Structures
Attributes may not be applicable to all cases (e.g., annual income is
not applicable to children)
• Ordered
Spatial Data Handling missing values
Eliminate Data Objects
Temporal Data Estimate Missing Values
Sequential Data Ignore the Missing Value During Analysis
Genetic Sequence Data Replace with all possible values (weighted by their probabilities)
17 18
Why Data Preprocessing? Why Preprocessing? Data be
Incomplete!
• Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain
• Attributes of interest are not available (e.g.,
attributes of interest, or containing only aggregate customer information for sales transaction data)
data
• e.g., occupation=“ ”
• Data were not considered important at the time
– noisy: containing errors or outliers
of transactions, so they were not recorded.
• e.g., Salary=“-10” • Data not recorder because of misunderstanding
– inconsistent: containing discrepancies in codes or or malfunctions
names
• e.g., Age=“42” Birthday=“03/07/1997” • Data may have been recorded and later deleted.
• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., discrepancy between duplicate records • Missing/unknown values for some data
– Redundant: including everything, some of which are
irrelevant to our task
May 26, 2020 20
5
5/26/2020
Feature Extraction in Fingerprint Recognition
Fingerprint Recognition Case
• Fingerprint identification at the gym
It is not the points, but what is in between the points that matters... Edward
HOW? German
•Identifying/extracting a good feature set is the most challenging part of
data mining.
Feature vector: 10.2, 0.23, 0.34, 0.34, 20, …
Forms of Data Preprocessing Why Data Preprocessing?
• Aggregation • Data in the real world is dirty
incomplete: lacking attribute values, lacking certain
• Sampling attributes of interest, or containing only aggregate
data
• Dimensionality Reduction noisy: containing errors or outliers
inconsistent: containing discrepancies in codes or
• Feature subset selection names
• Feature creation • No quality data, no quality mining results!
Quality decisions must be based on quality data
• Discretization and Binarization DM needs consistent integration of quality data
• Attribute Transformation 23 24
6
5/26/2020
Forms of Data Preprocessing What is Data Exploration?
Data cleaning
A preliminary exploration of the data to better
understand its characteristics.
Data Integration • Key motivations include
– Helping to select the right tool for preprocessing or analysis
– Making use of humans’ abilities to recognize patterns
• People can recognize patterns not captured by data
analysis tools
Data transformation
• Related to the area of Exploratory Data Analysis (EDA)
Data reduction – Created by statistician John Tukey
– Seminal book is Exploratory Data Analysis by Tukey
– A nice online introduction can be found in Chapter 1 of the NIST
Engineering Statistics Handbook
Data Exploratory Analysis
Aggregation Exploratory Data Analysis Techniques
• Combining two or more attributes (or objects)
into a single attribute (or object) Summary Statistics
Visualization
• Purpose
Data reduction
Feature Selection (big topic)
– Reduce the number of attributes or objects Dimension Reduction (big topic)
Change of scale
• Cities aggregated into regions, states, countries, etc
More “stable” data
• Aggregated data tends to have less variability
27
7
5/26/2020
Sampling Types of Sampling
• Sampling is the main technique employed for data selection.
• Simple Random Sampling
• It is often used for both the preliminary investigation of the There is an equal probability of selecting any particular item
data and the final data analysis.
• Sampling without replacement
• Statisticians sample because obtaining the entire set of data of
interest is too expensive or time consuming. As each item is selected, it is removed from the population
• Sampling is used in data mining because processing the entire • Sampling with replacement
set of data of interest is too expensive or time consuming. Objects are not removed from the population as they are
selected for the sample.
• The key principle for effective sampling is the following: In sampling with replacement, the same object can be picked
using a sample will work almost as well as using the entire data up more than once
sets, if the sample is representative
A sample is representative if it has approximately the same • Stratified sampling
property (of interest) as the original set of data Split the data into several partitions; then draw random
samples from each partition
29 30
Curse of dimensionality Dimensionality Reduction: Curse of Dimensionality
• When dimensionality increases, data becomes increasingly
sparse in the space that it occupies
• Definitions of density and distance between points, which is
critical for clustering and outlier detection, become less
meaningful
• Purpose of dimensionality reduction:
Avoid curse of dimensionality
Reduce amount of time and memory required by data mining
algorithms
Allow data to be more easily visualized
May help to eliminate irrelevant features or reduce noise
• Techniques of dimensionality reduction:
Principle Component Analysis
Singular Value Decomposition
31
Others: supervised and non-linear techniques 32
8
5/26/2020
Feature Subset Selection Feature Selection and Correlation Matrix
• Another way to reduce dimensionality of data
• Redundant features
duplicate much or all of the information contained in
one or more other attributes
Example: purchase price of a product and the
amount of sales tax paid
• Irrelevant features
contain no information that is useful for the data
mining task at hand
Example: students' ID is often irrelevant to the task
of predicting students' GPA 33 34
Feature Subset Selection Feature Creation
• Techniques: • Create new attributes that can capture the
Brute-force approch:
Try all possible feature subsets as input to data mining
important information in a data set much more
algorithm efficiently than the original attributes
Embedded approaches:
Feature selection occurs naturally as part of the data mining
• Three general methodologies:
algorithm 1. Feature Extraction domain-specific
Filter approaches: 2. Mapping Data to New Space
Features are selected before data mining algorithm is run 3. Feature Construction combining features
Wrapper approaches:
Use the data mining algorithm as a black box to find best
subsetof attributes
35 36
9
5/26/2020
DM Assignment-I
• Compare and contrast DM and RDBMS
Describe the basic differences and similarities;
Describe the Pros and Cons (Merits & Demerits).
On average, a summarized report of two
pages (Font: Times New Roman 12, 1.5
spacing) should be submitted on May 28,
2020. Use aastukk@gmail.com to submit
your assignments before the due date.
37
10