Moputé TI
DATA PREPROCESSING
Dala Prepwo cessing.
teh |
ratty S43 le
Real wovtd databases 97° ward L
to mocsy, messing and mt “
: h sexe
Be ee rypecal ac
5 and
(pla several Gegaby's oy mat
we ecigre fron Oe es ae
Courtes.
eee
~ how _qualila ali, wetl bad to
mesulls:
how _quealely omining ————
wnoreluu
_Dala mecds 4 be feeproes
fo hetp +e improve Te 4
i
and Ree earn ees A of resulls:
—There are sereval preprowess A Fechne’g we
1 Date cae : cam be appleed to wemeve
data:
noise and correct vmemsestenares 1%
dala:
BD* pees aie easy merges dala keows mullple,
Gourees tnto a Coherent date oes
AC
as dala warehouse.
3. Paka Reduction: can reduce the Sud xb:
aggregating , eleoenalins redundant a
ov i cluslanen
4. Dake. Transformabions Coq: normalizalkos )
may be applied , where data are Scaled to
fatt wether a Smaller yrange lke 0-0 +0
lo. Tacs Can eraprove the acum aad
een mencng algon avetum
eppoee “4 of 9 ta fams Wve! 4
degtance measure mt vl
These techniques are mok mutually exclusweé ,
Ld red work eg Ea Data clea
may mveolvé Wansfesmations te correc l
all
Wrong data. , such as Frans for ming
enWirs fs a dak frelel fo a Commow
format.
Data has quately 4 = Sates 4 the rman
of Ce entended ure. The three elements of
dala qual are (i) Acerracy-
i ti (fn) Conplakess
Cin) ee|
"|
Inacursate, lacomplele and Inconsesten€
teed
Aale are common place prope
a real-world data bares and cate,
ware houses.
eas ons foe
There are man posseble x
usacurake data ( howe a
athcbule values) . The
(Stee ieee eae
Seta 4 b
wos dunn
cnco
data, collector,
be human ev computer ©
Users neg purpose alg
alues fer
Cubmil incorrect data “
mandator celds wohen Wed d
+ submel personal inf
dala enky-
> mot
wes hy ermal ory.
Cr choosn4
duplaged Re
cal ee:
as —— me os
Eavos aw dale brans nu's 310% can also
be teehnolog Lenn chatroom S
: Janaasy ’
u known
dafoutt valur
bivtedag ) ee
occuy, mere “oO
Such Os fumed falter sixe For
comdenaling § nemronewed data
transfer consumphon Incoreeet dalla,
also result row Cm consis btn eg
o dala Codes
mm
un Sromung _tonvent tensPesce
a Cacouscstenk fx mals dp teelds (eg. te:
Duplocate huples also requme dala cleaneng
Mayon Tasks tn Deka Preprocessing:
~The aes sleps involved ™
| are
B) Docka Imtcgractuen
2) Data Reduchon
Data Ceapee cess “g
A Data Ceancng Full or messeng valews ;
J
Smooth nowy data , deaf y-
or remove outhers and resctve
cacoms istenece $
Le eaemiinth 4, waa lleple chetebas.
dota eubes om eles.
Reduced vepreseatak on of tire
dota set ve much Smaller iy
volume yet produces the
Same analy teal vega -
er) Demensionality Reduchon
LD Numi sel
ky Data Compression
D Dola Vamsfeymahen © (1 Normalecah on
Ss
Ch Concept heerarehy
Gererahio:Steps Wo Date Prepsverssing
agg \\I//
i
Data leaning
Abus
ALA’ an AMS
32, 100,59,48 002,032, 100,059, 04%
ci) Messeng Valuss
a Ow Ree Dali
1 Date Cleaneng’ (usa Incondistent Data:
Gi) Hessung Values
Mans tup leg have mo vet
fos no altri ules °
2) lgmoxe tee tuple: Ths wad done when
We class abel G wesseoy (ef task deni
Tas wethod & not 7
tre huple contauns Sv atibulis
oles ee poee: vohen fre hed
orded volun
wu ssant
valuts per allbute Vanes
W'S Sow
conse davak
» Ecll wm wu'sscne values manual Thus
& tum consume es
ad
oasis fs ” A a lege data wun merge
kas
oes et
i_ i .}»@§
a te global doustant to sell ci missing values
Replace all messeng altibule values if some
£ such as a Inbel Lk 4 Unknown” or “=
Constant
”
a wnvs5eng valuss ame replaced by a
jrer the WAntrg prograre may war'stakenty
tronk trak neg Porm an entercsten g concept:
-Atinoughy Semple tres mefnod Ub vo re Commntn ded
4 Use the aleibu meanto Pell uw he mois senvg
value’ :
Replace missing values wetr Ke mean of valu
f trot attrrbute
Ss. Use alttrrbul wean ov median for all Samples
1
belong ng to the gawe class as the gwen tuple.
G. Use Im most probable value fo fell 09 tae
messing value: Thes ee be deli mined
vocth regression, (aleren ee Boned tools woerg
a ee females wi or deusion Wee
induckom-
Methods 3% G beas the dala — lke Lelled cn
value may not be Cowect «Method 6s eo
popular chrategy —AKis metnod Woes We
nent dala te
mo st (infor smottont rom pre
predech mess wg values> Noisy Data
tO Vananc,
pious eae wandow OF
moa measured von able
techniques — smooth
~ Dato Smooth eng
dala tp remove moese:
out
a) Benning:
—Bunntn methods srmnostty o Sere
b consrltng
data value song ¢
» values
cf “med rborhood” “
around ak.
Pues are
“tke sovkea val dustkebuted
Reto ae a bes « hackels” 7
— Seance Bunnie ds coms alf fre
mecanboshes & of values, faa
peajerno Hocal_Srmontheng
> oo awe
rnootrensiy be bawdana
Smootring by vo gees ra fess
_ bin rarans wa queer but are
\ ‘dantifeed as bu
each bin Vat pround Ones Enel free
replaced. bry, IRe shrelmninit yates
fru rena cloeest Aoundauy value
rx _the_onidth, grate 3 te TES
~ha
of gmootning:eq Govfed data fx price(e detlars )
4, 8,15) 21,221,245 25/28534.
Ben &xe = 3
Porwtetor tao Cc ual — pend bens |
Berl : 4, 8 IS
Bwa at, al, a4
Bind + a5, 28> 3H.
Smootheng “4 Ba means’
Bes | oe
}Bin 2 22, 22, 22
[Bin 3 29, 29, 29
Smo cothing by ben boundantes
ati ett eta ory
Bin 2 21, 21, oy
Bins 25, 2534 |
PF hecstering
Mg. Suppose Inak the dala Pr amalyses tnelude
the altachule age: The age values fer data
Are tA Ke tease orcdey
13, 15, 16, l6, 19, 20, 20, 21,22, 22, 25, 95, 25,25, 30,
23, 83, a5; 25, 35,35, 36, 40,45, 46,527
Use Smoothing et means to smooth We abere dak
:
Using Q ben depth!le x
ve Ike qa dala ws aluady sorted
Partclon dala wp equcde pl. bins £
dupth 3.
ee 3, 35°
Bett: 13a, 16, tb am 32127!
s
Bina: 16,19, 20 Bin? 1 35139!
: 0/45
Bn 3. 20, 21, 22 Bina iisecn dl
Bnd: Ab, 92, 70:
Bing ) 22, 26,25
Bin 5 : 25, 25,30
com f each bev
w each ber
Calurlate arthmel
Replace each ¢ the values
&y an thmettc mean cal ewlated fer te bey
¢
Buwr. 4, a, 4 Bue: 23, 33, 33
Bw2: te, 'e!8 Bin? : 25) 35) 35
}
Qing, zeae! ene 40, 40/40
Binal || 2a 2424 Bind: 56 56,56.
. '
Bin 5 26, 26, 26
Dy aiaae |
Outtress mona be detected by pr ual
Ss are orgonened
where Semdlar value
fr eee
Lp & or clusters”
Gaia oni
jnat fati subside
| Intuievely » values rat TR"
the Set of chaste a be consedered
. outleers eae—
% = ~
y ) Ct |
(ee \ r \
) x £
4
Aatacubes
chama inte: sation and o bgeck maclehing
Ts
> Chama 1
(we) Can be Beeky-
“Heo car
ccaliow pro blem
Entety -cdew
cee real oot entek'es pet
mmaliple bale sources be mattered up)
the ale analy font
a "how Can
that cestmer—t tn’ one DB
be Save
and cust number ern anotrer reper
to tha Same altabute ? -
MetaData ian be used to Aolp te avord
: eee Pon 7&9
eres tn Schema (™% d om e
detreGute emetede
f each
metadataName i: EN
ME ML ANI daka type,amd 7a e
{values permitted Gite _attu bate
ank,
Gre mull Aafes handleng bl
Keo oy null values
& Redeindancy. « another rap oy tssue
eal
“ y . = a
S$ dala, tntegr alow
- Intonsedteneces mi athadule 7
Acmension namin can. osc t eee
redundancies is ie vesallety
— Buch reduadanwes Are defected
ad tomclaltn amnalysc: » Auvem
turo altwebutes, reat aly core,
measure how “4 one allabule
Complies Lice olnry baocd .om the
available dale-
—For mumerce allacbubes , we can
evaluate sae correl alow bfo
foo alhibulds A and B by Lowpn
the covrelalow coef pcaent (Pearsons
oduct moment toc{fcwent ),
P fty— no: of tuples:
a, be — respective ana beens f A and Bw
tuple ce
A, Bite respecte wacan walues of 4 and B
Ta FB- respcelewe standard dewatoms e
A and B
(b> Weo¥A,@ 28 qrealés Han 0, A aud B are
poschevely comelated > volece of A wncrease
Os LAD value <8 inure ase
ae ae aa value , fe Slrmgiy the
Condation , and cnddcaal hat A (or B)
& wedun daneg.
ace De remo ved asaa) 8 cs .
ee
and theve ts wo conelalow befween
she we
cee TAB eo, then A_and B avt
Mgakwely comclaled , woherc thre values
fone altbule moreast as the valaos
hes mMLANS
+ other attrcbule Acreare
frat each alvrcbal discourages the
other.
ge) Deteckon and Rasotuom p date
value touLleels
~Poy the Same veal world entity ,
Se cee eee
ottebut values fom afew’
Sours ray Adler
Ths —o8 be dust Aalgoenens
wopres ent aro ns, ecakng © encoding:
eg0e over g
wulte Unt
nk alte bule may be Stored
fs uw ow syste and
une wm anotier:
Brisk oe
w af fe ent ches may tnvolw wot
we nk ULE nUes bud
ond taxes.
KL pres ceeoine
only dv also
| Attlee rte gernees
NTL Date Translos mn alow
pala G transformed or consoidalzd ento
volves
forms appropriate fs ccete tt
(we Petloweng -
a) Smoothing :
which works to remove netse
ow dala -
- bunnen 4 5 regres Stow clustarcng
, >) Agqucqatcon. Summary 7 aggregate
| Hevahons are apphed te dala
—_ PP
cA dady gales dala ™ be
so as to compalé
[ ancunls
aggregated
| eee
2) Gers ralixalor how kevel or Prspnctere (x00)
dale are vepiaced by higher
; leet conceplé trereug i toe
Comcept g concept Muronets.
chas Slack an
w
ae atucbulis $
to ha her level
de genexalinid
concepts Aeke oe ov coun Ee
art
Gonsirucked and added
from Ini gue set of
altebulis to Kelp the
metarng process:
ag we mayo to add afnbulg
Sarea’ & oy alnbukes heugat +
Sos iid th
a) Attacbute Conshructemn : New alibules
(eat ure Comsty ro)®) Nosmalexation The altrcbule dala are
Nomatez ation * Nee eee
Pell
Scafed so as +0 fal
wethin a smaller Tange
o-otelo
-\r ie Oo ee
rivo bie
~Nosmalizok' ow attempts to que alt
altchules an equal wecght:
~ woeful ef applicahons Ake cla ssefceatrow
tworks
al gon fFAms tn volun neural meter
=< eyeh “5
o adestance measuremenls §
Me aCe iO SO
ale 4
weartsl nevghbor elassef[cea ory 2%
ooh BE
eee
Nethods for Nermaléxat ow
Min -Max
Nomaldxattom L-Ste ve ecaaasleerig
Normalization be deumal
Scaling.
oO) Yin -Max Noymalization.
~Peformns a Aenea’ by anslormahon om
- Sexppose Anak Meng and MAK, art fre
WHKACmUM and Maximum values
en nee
of om altibute A.Pee eee
Min-max normalexaton maps a value Ve A
' .
stove wh Ike range [neo_muing , newmara| ad
Comp weg
eae
= TINA rtsematy — neve IMR
mar, — MIM,
pf Rewring.
= This wormalizatiow presaves tre welatumshep §
sere S the elev
om ong omgumal data values
»
Hlt well encounky an ‘out -of- bounds eecer
aha puturre tn podt cane for normalixakow
fatls cutscde (Ae
nal data range frA
eg: Suppose thak mencmamr and maximum.
values fy the al&ebule mrome acre £12,000
and $4¢,000 res peckwely Map. a value ef
473, 600 t a value ea eae Po.0, 0]
T
aac macnn max normaltzatow
wv
= 73600 — 12000 (0-0-0) +0
E000 —l2e60
i O- 4th©) Zscore_normalixakion (mre neem aleZorte
i emai afin (seme —— )
—The values fa > Q
booed om the me and
Hea
“standard devotes of
Whine A and wa ave We mee and
Standards deviahon abe
-Tucs metro of norm alixahow a
unefal when Teheran ae
the
and maxtmunr of attrbule A are
wn ao when
unakao trrre are cutras
grat dominate Wa Min -ar normale Rahay
eq. Suppose that the mean an
dawaton fers Ihe values for
income Ae § 54,000 and $16,000
ores pectowely useng <-Seore normalixatow
we vale § 73,600.
Trans 6" mwa 73 600—
54000
Sacer et late
6000
ie) Novmalxalion fy deamal Sealeng:
crermnalites ty mowmg tmz deunral poet
oe values alGrbub A
eg number of duimral pocals mow d
depends ow Uke maxtmum absolut value.
+ A
~ A value VV an plGeibalg dA es
Mormalé xed tp vy!
Ea eae “t computing
Y= Evin
\ 10°
eae aeceeean Ea
where | wo the Svratlest wteges guch that
modal) < |
4. Suppose thot tke retorded values f A
te po —FS6 wm UF. The maxmunm
absolute value of AG 986. To normale
dermal mg 7 we Were fore duude each
value ty looo wyee So nat —486
rormalixes to —0r9Rg and UF normalizes
0 ip4 Use Ke followin methods +e normaltxe
the greet set of data.
200> 3200, 400, 600,1000
204+
Dome -max normalgaton by seltny ee)
2) X-S Love movmal ation Cale
3) L-Seve nov mrale Zot wacng Erase nn
a'o
davadion | washed siege
3 Wer pvnlixal
men — Wee ormal Uiggacheow!
200! =
00 = (200- 200) (t-) 40
!0co— 200
= 0
B00! = (800-200) I-) yo = O95
#00
Os
Avo! = (400-200) (1-8) 40 5 ===
S020
esd :
Goo . (God -208-) 50 = see
eect
30
\@od = (1000 - 200)(t-®) +0 eee
&vd
(exatow
The values ap ler macy. Max WOM A
Eee (0, 195, 025 0:50 10)
———ay) Z-Sore normaltzaliom -
Mean = Sx .
he = 200
na oe ere cornea
. ce
= 2500 = 500
Claadard diwali = : :
(=) (a
(400-50 0) 4€6 eee
2
= (200-500) +( 300 -500) +
4(lo00 - S00) ~
= 7
= Q2Bag
as
2 : \
AO = 200-4500 _ =1-06 600 = 600-S0d
QQAi@ QPAs
) = 0353
300 - 300-S00 —0: FoF '
282. === |\\000 = (0 00-S¥U
Qea’
Avo! = 400-500 = —035 .
agye ———— = 146
=—
La Z-Svo ve normalization are
Tre values af
( -106, =o 107, —9°3S> 0353, 146)5)
: om. .
Mean absoleete acwvatiew > &
= | ]200-So00] + laoo- Seal + [so0-seo| +
Peed |eoo-soo} sheoo-se]
el if 2 9°
5 co = 240
REO = 20-500. -1.25 sual = Geotosy
240 == aceorn
= .
30d = 300-500 = —0'g33 OAlT
Pe eee et ——
240 looo!= 1oco- SUD
1 a
400 = A400-SvO L O47 ae
24D
The values afte normale xafon are
(2s, 9.633, —0-4i7, 417, 2-08)
aoa
) The Smatlest tovkeger {seem tat
y
the years 2 2v00g to 2010: -
~ tl we ave interested 8 tre annnal
ay) , rather
Sales ( Jetal gales om ye
tr of _zetal por quali:
Thus te dala can be 29 greg
dala Summanac
im stead e
aled So
frat the acsullkn ;
tte total salks per eo
Per auorker.
~The -wesulteng Ante & smaller ev
volume oul Boss of alarm attow
me cess Ott pa analyse pos ke
Year 2010eS eee
Data cubes Store maltedemension al, aga regafed infer
= The following gure shows a data cube Por:
muattedimensional omalyses P sales data wr &
aquaual Safes per eben ty pe Por each Aliclectroris
branch
LED
s
rol c
B
A
home
entertainment | 568
computer | 750
& phone | 150
security | 50
2002 2003 2004
year
Ee vel
-Each cell holds ar weqate date value,
Cores p onder to Re dakapocnt ow malteelmensl
a he |
~coneept Merarehy may exest fir ea° |
attrcbulr, alloweng tre analyses f date
at om allep le abstyacliow levels.
) Pterg eee EEC
Fach av lad abstactton level Purfrer reduces
ae wes long dala sexe
op how plying fo data menin weg ues 8 >
he gmatlest available enrbocd releranl
fo (Ke queen task Should be wed
(4D alee | Redurctow
Data sels few amabyses maa conlacy hundreds
of altacbetes, which may be
pe qd
oe vedundant— which
wulkuant +o
the marincng task
Law glow the meneng Prec ss:
:
ied Demensionalely reduction, aeduces (ke data
“such altebabes ordenanstons
Sted by wemoreng
frow ck.
fe SubSeb Selectow ! — to-fend
— Metnod of allicb el
athebulis such that Oe
a minunun Set
dcsl&c blow of tae dale
wesulteng probably
eble to me oniginal
Classes 03 as close as poss
des bicbutron cftacndd _ mseng atl attabuls-
~ Meneng on a@ wedaced seb allie Cerlis borefite
~reduces we number catobulés _appeant
to makes
wh ke discovered palterns, netpen
the patterns easter to understand,
Ieee eee ee eee eee eeAtKcbube Subsek Selechon tuclucle Me
Fotlovoing fechnog rae
) Stepwocse Forward Selechiory
~The procecerre starts oocfh aw ee
Set ¥ allseb ates api the reduced Seb.
~The best Gs tke on'ginal alucbu lis
cs duitee mace ae added “fo Ke
redeiced set
Aa.ation oe
slop, tre best 4 icant ane reemacneng
on eal atezbules & added fo the
AE each subsequent
Sel.
Ewa afcow.
3). Sep Backeoard gehectrow
~ The procedure clawfs weth Pall
cel E og attribubes,
— AE each sep,
worst allkcbule were atm AG
ch removes THe:
—
us tel.
eee> Ul
Comben ateon: of Frnvard Selechonw and,
Backward elvmenalow!
— The slipwwe fered arr geleckow and backend
mcanalkdw mretrods are combed eo thank
(Re procedure selecls Ue
dite worst from
ek
afk each stip >
and %é moves
“a lbutes -
pest altbulr
among une remnaeneng
Backward elimination Decision tree induction
Forward selection
Tnitial attribute set:
tyes Ass Aah
Tnitial attribute set: | Initial attribute set:
{Ay An Ay Aa Age Aad | (Are As Av As Ag?
=> (Ay As Aus Ass dpb
Initial reduced set:
cy > Ay Av As ded
= {A => Rediaced abate set:
= lAvAd {Ay Aud
> Rediced attbute set:
{Aye A Add
‘=> Reduced attribute se:
LO adn deh
4 Deeescon Tree Induchow
~Deescon bree algomfams tke. (Da, CS
and GtaT am Mntended fx olasssficalicre
Fs Dewsron (ree _enclucton conslyacls aflow
chart &ki shictere where each cnternal
Crom Leaf) mode represenls a tes om am
aticGule. » each branch corresponds to
an oetcome of te feck ancl each exlanala class predvels ore
G@eapmode) arm oles
tne al omtam chooses
-AE each modes
tre “best altiibute” fe parton
Aale enfo irdeu'd ual
induche
classes:
aw ow wed
gu altrouts sutet selehiow 12 ee
aw conskackd from Che gives aa,
when deurscon He
Atl attKcbufes drat do uot appent uw
a Wee ant assumed to be relevant.
Tre Sel of alGcbulés appeam 4 Ww
acduced subset of
dae
lee form lhe
altrchutes:
ng enter a for the mefrods
The shoppe
The proce deure ahaa
many voy
enoplng atrshetd oS TERETE
used be dela mene. whew to Shop
process.
the attibule Schekow
YeeData Compression.
0
~Tra nsforratrous are applic Pa pe foe ROR co
weduced or Compre ssedl wproentateon . of.
the origen al. date,
_ Two ty a —
Loss
dessles s
€
we couskuct ml,
an roxtrmaton
Onqinal data reconshu ck cfeol
ow compressed aaba
the onvquaal dala -
wetronk ang info: £2 foss- Of (Re oni gin dl det
wavelel bansferw
doss dak aconspres ston
Peer eee - Pronapat Component
Analy ses.
Wavelel Transform .
~The diserclé wanelel Lransform QwT)
Sai Leineay Vay Sanat Processing fechavg te.
wohen app lu'd'to a dala vecter ¥, oe
«f to a nunca veel yy depperent vecter x!
— woavelet Coeppcorenlé ->
The two vectors arc of (nt Same
fength.
~ Wahew applying 4 ob fechacque
dala wduchow , couscdey
th ak each tuple ag aw:
n-dimensconal oala vector:
ie fx, XQ, %B,- et xn} dupeckeng
\
make on tae
ow’ mreaseareenl’
tuple frore Wo Aolm
bane al&eb ales-
—wavelek & anspermsd dale caw
be luiw caked. A counpresseel capproxemerlton, |
of hale can be rehacned by show mg
ceaall / eae
oul a, brackow of ck Sha mgest of
tre wavelel coef free als.
etecstereteraneret eee
+ eg alt waveleb coefiveenls danger
tran Some wr specefred tres Oro lad
inch and olviy CoefferoealS
can be reta
sek fo O-
~The vesalt'n ee 7
th Pusfere Hem SpOreee
operations that can tals atvantage
of dake tparschy arc tows atahonally |
So taalcre ea ba tetas
— hes teohacq we works +0 atmo Woese
rout smootneng out Ce main fafurtS
ae d
See
of data, makeng ik eine: fox ab dala
cleaners
— Bwer a cet 2 coeffeuen » an appre sox imnatog
of cogendl | data can be conse clecl oy
apply ing ke inverse of DwTt void .
pat closely xe (atid to Duele
or (DET) a segnat
genes aad
Foun Transfer
tech eqns envaln gt
process ng
cosines:
— DWT achieves bellia £085 compress (0% -
and prowdes ™ wore accurate approxmaatten
ee) 4G
pwT wequaes 1685
inal Aala rer same no: ofeoe pees |
S space t than DET
06
04 04
02 0.2
0.0 cd |
-10 -05 00 05 1.0 15 20 0 2 4 6
(a) Haar-2 (b) Daubechies-4
a ular ee transforms taclude Haar-2,
Daubechias-4, Daubechees -6 ele:general procedure fer apply ing 1Q.
Acsecte wavelet Anan sfeornw Wes &
nfhrv.
That halves tmz dala at each
pase
Reevar checal Pyramed alge
LCR along : :
Ca alow , resulting 5
Compa alin al Speed.
Hevarchical Pyramed Afgonttr
The tangit. L of Re cempaet data
vector must be an tnteger emer of 2.
Thes condcfrow can be- met be pees
tae data vector weft eos Aas
NLCE SS AR
Zach translerm imustues too fuacfeons,
The frst apples some data smoothng
Auckas Sum ey wee eet”
The Second performs” a weeg hfe
ay foence
The foro functions ore applced to pas
dalaponls ev X wheeh wesulls wf
hoo data sets of dang La
The two functions are veers evel,
+o dota sels. obfacned en ae phe
untel tKe wesulhag > Rakasels obtacned ane
Ss on
Faans
- Gelocfedk values from (Ke hataselS obfacned
ca tae preseus cbaratons axe dusdgnated tre
wavetel cocpeteents of cai transformed data -
eee eal ee phe pier ne ibit
Equevalendtty. 0 matics multplecalrsw ean be
ee pled fo, (Re, tupect dala te oblacn the
woawelel voef fcr als whee Ce malix
ured depends ow he “gwen DT.
pphell to.
mwaunelel Laarsfermations can be
.s a data
matte deorensconal aka uch as
Rees eee ee
cate:
Thes cs done oe fudl applyeng’ the bansfornn
to cme fusk Aimeonscow , nen_to (Ke second,
and So ow:
Applécalcows ef wavelel Qanshems -
~ Compresscon of fenger prt ae a
2: Compurley w'ecow-
— analascs of Heme seas dala -
~~ SES
~ Datla eta a56 40 ¢ 24 48 48 4°
42 16 49 28 8 8
32 3g 6 10
Beale io) eile |e
ee 85 0 6 [(oisneesfansmmees
ae)
qa SF “0 16 10 oO ° sla
- bansform es arti we
start from tice bottom Rew: WE
add and subbasl sre Aafpenee
fo Ue wean and sepeat the
Process upto te furs Row:
35-3 16 to @ -8 oOo 12
32 38 (6 10 8 -8 o 12.
48 16 4g ag 8 -8 Oo '2
s6 40 @ 24 4@ 48 40 16.iene
Pree} al, Component Analyscs —Refer yous Hachene
eran alae sale
a nema Nofes
Numeoscdy Reductiow -
Reduce tke dale votame cheoseng
allanathuve Smaller forms °F date ¥
w oq nese grams clasts og i samphug.
Regression aes and dog Lrrear
a
J
Modes.
Ddencar Peseta : € wx +h
~ Two egresstow coe Pe nls wand b
4pee Ake Line and are bo hedata:
estimated useng
~ Weng (We heousit squares ca fer cow
to wie known values of Yu V2 Mir %
.
2) Multcple Re wess'om
¥ cba + bi min Pate
“man now Lene ae function’ Cam
ato the above:
be franspormed
bog henear Models:
acsucle malk
— Ap oxi nate
xf ity Aisin but org
~Bshmale Me probabdaty fee1. Histograms
* Use binning to approximate data distributions
* Popular form of dat ‘ion
* Histogram for an attribute , A, partitions the
data distribution of A into disjoint subsets, or
bucket
+ Each bucket — only a single attribute-
val value/frequency pair — singleton buckets
Histogram Analysis - Example
= Price data:
1,1,5,5,5,5,5,8,8,10,10,12,14,14,15,15,15,15,15,15,18,
18,18, 18,18, 18,18,18,20,20,20,20,20,20,20,21,21,21,21,
25,25,25,25,25,28,28,30,30,30
Equal with hogar wth bucket Sze $10
‘Histogram for pce using singleton bucketsHistogram Analysis
Partitioning rules:
~ Equal-width:- width of each bucket is uniform,
‘ ~ Equal-frequency (frequency: constant)
~ Maxdiff
Consider theaiterence between exch pio adjacent ales
— V-optimal
> one wth east vance
7 Histogram variance weighted um of the rial values that ech bucket
Teen where bucket weights equalto te number of values inthe
bucket
the Imovmum dstane betwen any tamabjeccin the cher
* Centroid distance
> alternative measure of cluster quality
> average distance of each cluster ‘object from the cluster centroid
* Can have hierarchical Clustering and be ‘stored in multi-
‘Sfarchical clustering ar
dimensional index tree structures
oo3. Sampling
+ Data reduction technique. Allows large data set to be represented bya
miuch smaller random samples of the data.
* Sampling: obtaining a small sample s to represent the whole
data set V
+ Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
+ _Key principle: Choose a representative subset of the data
— Simple random sampling may have very poor performance
in the presence of skew
— Develop adaptive sampling methods, e.g., stratified
sampling:
Types of Sampling
+ Simple random sampling
= There is an equal probability of selecting any particular item
* Sampling without replacement
— Once an object is selected, it is removed from the population
+ Sampling with replacement
— Aselected object is not removed from the population
+_Stratified sampling:
— Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of
the data)col: of data
wen (ne Ffolleewerg 24, 26, 28
val fg MAGron,
“ z backel eq ot
a) Conskucl a Bee equal wre of fe
3 bucks histogram,
) Constack % d :
|
+ Data :0, 4, 12, 16, 16, 18, 24, 26, 28
* Equal width |
~ Bina: 0,4 [-,10)
— Bin2: 12, 16, 16, 18 [10,20)
— Bin3: 24, 26,28 [20,+)
* Equal frequency
- Bin1: 0,4, 12 [ 14)
— Bin2: 16, 16,18 [14, 21) |
— Bin3: 24, 26,28 (21,4)
Equal width Equal frequency
id ESESGEUET Sz asnraunucmacammmcazens=t| EG fT ess sss sss =sanouozesaSnERNEREEIOTS |
fl) 030) 209)Discretization
Three types of attributes
vy, € Nominal—
values from an unordered set,e., color, rofession
-)%, Ordinal—values from an ordered set, ©8, military or academic rank
an Namerie—real numbers, eg, integer or real numbers
Discretization: Divide the range ofa continuous attribute into intervals
Interval labels can then be used to replace actual data values
~ Reduce data size by discretization
~ Supervised vs. unsupervised
~ Split top-down) vs. merge (bottom-up)
~ Discretization can be performed recursively on an attribute
~ Prepare for further analysis, eg, classification
Data Discretization Methods
{Yplcal methods: All the methods can be applied recursively
- Binning
* Top-down split, unsupervised
— Histogram analysis
* Top-down split, unsupervised
~ Clustering analysis (unsupervised, top-down split or bottom-
up merge)
~ Decision-tree analysis (supervised, top-down split)
scision-tree analysis
— Correlation (e.g., x2) analysis (unsupervised, bottom-up
merge)Simple Discretization: Binning
+ Equal-width (sistance) partitioning
= Divides the range into N intervals of equal size: uniform grid
~ iA and 8 are the lowest and highest values ofthe attribute, the width of
intervals will be: W = (6 -AV/N.
— The most straightforward, but outliers may dominate presentation
— Skewed data is not handled well
+ Equal-depth (frequency) partitioning
— Divides the range into N intervals, ach containing approximately same
number of samples
— Good data scaling
— Managing categorical attributes can be tricky
Binning Methods for Data Smoothing
1D Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equidepth) bins:
-Bin 1:4,8,8, 15
~Bin 2: 21, 21, 26, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
-Bin 1:9,9,9,9
- Bin 2: 23, 23, 23, 23
~ Bin 3: 29, 28, 23, 28
Smoothing by bin boundaries:
= Bin 1:4,4,4,15
~ Bin 2: 21, 21, 25, 25
= Bin 3: 26, 26, 26, 34
(aDiscretization by Classification &
Correlation Analysis
+ Classification (e.g,, decision tree analysis)
— Supervised: Given class labels, e-g., cancerous vs. benign
— Using entropy to determine split point (discretization point)
— Top-down, recursive spit
* Correlation analysis (e.g,, Chi-merge: x2-based discretization)
— Bottom-up merge: find the best neighboring intervals (those having,
similar distributions of classes, ie., low x values) to merge
~ Merge performed recursively, until a predefined stopping condition
Concept Hierarchy Generation
Concept hierarchy organizes concepts (i.., attribute values) hierarchically
*+ usually associated with each dimension in a data warehouse
Concept hierarchies facilitate drilling and rolling in data warehouses to view
data in multiple granularity
* Concept hierarchy formation:
Recursively reduce the data by collecting and replacing low level concepts
g (such as numeric values for age) by higher level concepts (such as youth,
‘adult or senion)
* Concept hierarchies can be explicitly specified by domain experts and/or data
warehouse designers
+ Concept hierarchy can be automatically formed for both numeric and
nominal data.Concept Hierarchy Generation
for Nominal Data
* Specification of a partial/total ordering of attributes explicitly at
the schema level by users or experts
£ Street < city < state counte)
* Specification ofa hierarchy for a set of values by explicit data
grouping
~ {Urbana, Champaign, Chicago} < Illinois
* Specification of only a partial set of. attributes
~ Eg,, only street < city, not others
* Automatic gener
of hierarchies (or attribute levels) by the
analysis of the number of distinct values
— Eg, for a set of attributes: (street, city, state, country}
Automatic Concept Hierarchy Generation
* Some hierarchies can be automatically generated based on
the analysis of the number of
distinct values per attribute in
the data set
— The attribute with the most distinct values is placed at
the lowest level of the hierarch
~ Exceptions, e.g, weekday, month, quarter, year
norte or sate > 365 sint values
os 3567 distinct values
sora ee 674,339 distinct values