Chapter 02 Data and Data Preprocessing
Chapter 02 Data and Data Preprocessing
ES234422
PEMODELAN & ANALITIKA PREDIKTIF
(PREDICITVE MODELING & ANALYTICS)
Chapter 2
Data & Data Preprocessing
Prof. Ir. Arif Djunaidy, M.Sc., Ph.D.
arif.djunaidy@its.ac.id
adjunaidy@gmail.com
Learning Objectives & Book Reading
B
7 2
8 3
10 4
15 5
• Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented
using a finite number of digits.
– Continuous attributes are typically represented as floating-point
variables.
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
An element of
the sequence
Data & Data Preprocessing Chapter 02 / 18
Ordered Data (2)
• Genomic (DNA) sequence data
DNA sequencing is the process of determining the nucleic acid
sequence – the order of nucleotides in DNA. It includes any method
or technology that is used to determine the order of the four bases:
adenine, guanine, cytosine, and thymine
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Data & Data Preprocessing Chapter 02 / 19
Ordered Data (3)
• Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
• Stratified sampling
– Split the data into several partitions; then draw random samples
from each partition
• Techniques
– Principle Component Analysis
– Singular Value Decomposition
– Others: supervised and non-linear techniques
x1
x1
n 2
dist = ∑ ( pk − qk )
k =1
Where n is the number of dimensions (attributes) and pk and qk
are, respectively, the kth attributes (components) or data objects
p and q.
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
• r = 2. Euclidean distance
L∞ p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
Distance Matrix
Data & Data Preprocessing Chapter 02 / 58
Mahalanobis Distance (1)
mahalanobis( p, q) = ( p − q) ∑ −1 ( p − q)T
1 n
Σ j ,k = ∑ ( X ij − X j )( X ik − X k )
n − 1 i =1
0.3 0.2
Σ=
0 . 2 0 .3
C
B A: (0.5, 0.5)
B: (0, 1)
A C: (1.5, 1.5)
Mahal(A,B) = 5
Mahal(A,C) = 4
• Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1 • d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
Scatter plots
showing the
similarity from
–1 to 1.