Module -3
Deta Wacnzun
conveghng Mappina a
dala 8 f ting
J4s poOCess
it eady f analsis .
J t involveg movin3 o Combining comdep dola
aCCaR,ble g eose to analyse
3et o mace hem
lgo Enou a3 DaA munging tongshn
,
aus data nto desIx
buSfor mig
cleanung ,orRnzing gecd goY deusim makin
to be
fos Ma far aualysy
. M a k e auw dta ugable. bocaôn
adt cetna
Combie dala i o w Vo9ou sous Co busines
umdalgtamd
DALO dasa o xauixd Sormat
POAS
ontex of d a l -data s e
os es {r a Ca
tools
. Acutomoted Integnoatim
elemetg
amalysis. lauves
5. c l e a n dalk lom NOISe,muging decistong.
deuisions
f d a k e timey
USes
busineas
6. 4elp
S i s O6 Spam chaun)
dhawng
e f e c e s Cspelling
efence4
containtn
ViagRa
Y aa
mal pumchiotuon)
.Any exdamohen.
osubje Clot o
. Lemn
'tdevdicaim
Sugshn for spam
Mdel
Tuya poo b a b l i u e
-NN,
Linea Regpeon.
3. y
& k -for leaing spam.
hy Lneasn Rogiesgion
isute about uinean ReagoogO) spam fHeaun
dotaset ag a malic ,whee eoch o d CoTCLPendg-to
3. Cenaide
a emoi. difleu
uolumn
3Ccaiu columns fur eath doydg, heee Viafa' a
the 0d Viagia, Bhon that
emad Contain
4 Ony
Alled uith value 1 elge assin o
Column
imes e oorld appeal
alkanalvey one ca put no. o
eMal whee
ineoa ReReO we need training
5. for vasiabde
email haue be be lab eled wth cutcome
i.e spam à Aot
be d fo dodtig
Rg cal
6. A humam gooe spam
tabe ling tak
e buil
Wmodo
T.neh Romeeon labals
to pedict-he
hout
lobel 9 gve
An emau
8 1 os spam)
C o f o y not spam,
TaslEA binvy
9 oudcome ig a numbeh amd
dn LineA Paspesiom
10
coninoMs evau aboue tt
valueg'
Pedcled
Cntesl
value, 3
Choose a
belous hen outpuut u 'o'
outpud 4 , Ua9uab lep
ou toD Many
beuause
uRk
12 J+ do noF ,00,aDO W6r de
wukh der O
l0,000 eMoulk in vests ble
te
MaaX n o t invegtsble
tna
trd
Camot be in
i ne
e aa
A s
TR
1 13. Thue but shil
shl
wuDds,
wuDdg,
uaut tha D. oUutttc
LeOm e
we could O MA
4. binasuy
appropale to
Pe Raos wot
hy k-NN dLD not uskus Spam Haing
wute abeu -NN
emoud
2 eMcuile aRe paegeuted as Malsu x., uuth Ou06or
i?
Owid colum ng
Malux eibue9 ale ether o r 1 depemdung on peence
3
hot wid
be neos, basld on
tud emo as ad <
L F o s k-NN,
4
con+aLn.
both
usdg thoy
Loo manu dmen Siong g
uul have
5. HeI 1,00, Do uode
which
dimevsional spoce
OD, oD0
-
m
Cornpuhna di Shance
Compuodh m wk
ase LDt
maka K-NN
dimeusionalilty&
ut
6 3 u M s Rom u e O
PooY olgRth m
D1gut Recognihon
eath in a 16x16 pixel grid
Rappee dmensiomal space
256
UnwsaP 16x1b qid into
veCctonize ap Py ENN tune
Acclay, Confutosn a
NaLive Bayes taud
tad
-classacahon
nethod bosed on bayes
ppudatian injedad
Rore dus Casp
uuhsL 17 o
Exanple
tes posdwe
Scck pattewlg
99 -cst negative
pabevdg ha
997 healty poobab-lib
posiuive,whot g he
test
potiemt
Pahea
G,iuOA
GIvOM olly sKck,
achuall sic
Ppulodton
0,000 ppl
99 haalhy tagt + 99 hehy
sick
asdeAuy+
0opp
9900pP
Hee SO%
9 egl+ 1les+ f997test
1leg 980 PP
Peson 11ppl
Let , y b e venug u t probablng px),p(y)
poobabliay wheu both hoppeu
POX,9) be join
one haPPRnsive nupths
whA
Londitional poobasduby
has Aoppend
PCx,u)
=
P(u|x) PCX
P(xl4) PCy)=
olwe kor P(u),agM Pa) fo
PCya)= PCxly) p)
PCx)
"Jam Sid o"sick
euen-
-Le y e{to
+
to ev egt u potdue
he
= Pt
|sick) p (sick)
P(Sick+)
PCH)
o 99 x 0.0
o.99x o.01)+o.olxo.
9a)
507
Naie
N a i e Baye
Jndiuidual wrdg ug
foY
Spam A t
emoid 8 Spam
awd, add& to poobab
O Cuss O
u0nd at a f m e
condla oly one
wdicaleg
non sPam
han
e ablity of S pa
SPa
PCSpam) pso b o n SRam
oebalbului of
PCha
PCham)
1- P(spam) emaul
owod
in
sF
PCwsdspam) po botsuly ham emanl
o nwdd
dd in
tn
probabluiy
P(wc|ham)
Apply Bayeg La P(SOYe spam PCspam)
PCspam |Wád) = PCNOd )
PCwod)=P(uusd spam) P(Spam)t PlwBd|ham) plham)
emoulg
NO- O spam
PCSpam) =
Tot No. o emaulg
Pham) No ok Non-spam emauls
Tbt No emad
with I500 SPams, 362 ham.
Exap EMployee emais
wRd oappeass 6 times in spamM
Meehng
Is3 hmag in ham
Pepam) =soo
ISbot 3672
l-o-29 = 0.4|
P Cham) = I- P (spam)
0 Olo6
PCMeehng |spam) 500
53 =0.0yl6
PCmeetins Iham) 3672
P Cspam lmeehing) =(meetins |spam) PCSpam)
PCmeebin4)
6-o106 O29
(-ol06)0.29+(o o4l6 x o1)
0.09
PEpam)+PCmeahng|ham) Phem)
Cmechin5)= POmeehna lspam)
Aspam HHe tor Combining W8de.
b a binay w vechh
.Eath emal Rpejevled
ü 1 ro, depending on appeRame of h uid
eiy
3 e be email ve (or
n d e p for jh wRd
c denole Spam
veckor is spam
PC|C) poobabity thoak emaul
C-)
pxlc)=TOic C1-0je)
individuual uad i n spam
who, paubabdiley h a t
cPobabiluy of u hwuRd spam
on both tda Lto Conuet produuuh to suu]
4. Take Lo
LogCPCxlc)) = v
-
Lugoi. -eje
()
Aog Ojc+L-je)
Loa-Ojc)
logjc t z - )
log C-0jc)
-
j log(i-0j)
log jc
+
C1-8jc)
togjc
-
log (i -0je)+ log
loC | Cu-oje )+ lo C1-0c)
log Cpoxl) j i t No
who
j j - LogCjc/t-0jc)) w . (og(1-8j)
for eath Maid, must be computed
weahg j vay
- Compude pClc) them eshm at p(c|)
data Se
tolain wum pre-labled
3+ uRkg l& cheap
Laplace Smoothing :
pootabluby o a give uBd in spam emad
oduo of e to 1c
Yeline oj as
ne 0 hmes " uRd
e mjc wlhee eMoulU
appeoag n spam
Mo imgh Rd
any eMOd
-Laplace SMoottingeeu to appeaai in
aß
de of replaung oj
9= Tje ta x =l,f=lo to paeveN gOhing
poobalsdt oK 0 r 1
DDala s e
CDIe)
OS43 Maxp P MAX ime
Ukelhood
ML
eghimoR
Naint Bug way of choocing Oj for enh i
ARuw
log(jc (1-0)e-Tjc)
get t to 0 Hham
teke dasivatu
jc
MAxi
a
Po Steuo
= ag Max P(O|b)
eMAP
4
k-AN
Com pogung Nauwe Bayes to
NN
Navi Bages
thas ony One upes
J4 has tuo hupeapahamotea palamete ie
a neosA
Jg N on ünean laSRkia
-
Dimmemserety u e
3. Dimenconolby Loa
not ketuee
heoise 8ek poObleM.
problem
Jt Yeausu ainiMg 4Teuin
Labele.d upevised Leasnng
Boh a
o t R TTools
ools
y otes
the nleb APIs
Snapin
SGhaping
ask an
a uessh
ha r ,S o l v e
om Sol ve pooblem,
pooblem
need dala o a k
Dals Suietsts
tu do eseaNh e dads bR om
eals uth extactng
Ssapmg the web
websdes Dos A P I key
ways
to gtE 3 ExkMSimg
-
DiffeR
- dumP
Tools I. un cnd ynx - -
dump
-othu polsins but slow)
(Robust
R Beaulufd Soap
3. Mehanze Dont posge Taweaspt)
4.PostSaipt (Jmas clasCaten)
LAPI ey douonlod
doia in Spe uked
to
to davelopu
Proudod
forma Cuke
Cuke pOSSwRd)
posswRd)
and cce Fey
Delevope Rosiaeg aCCos
dowlorcd Sj2e
it abou
hae
-
APT s M
Pa oY wutheu Pa
etandasd fomoNe
-
r othee
be in JSon
Dala can
be u x d
Yauhoos YQL Ca Ext
h e ext
whee
w
== "cat
"cat
phofo.sAGCh
selet R o n ficke t lo
api-key='legid
j#|sdvt
whette
Exemgion available , exleusing e ixbus
- whan APIs a not
extemsim of Raebox
emd
Use Jnspeut the elemet on anuy webpage
HTMLfielde cam be acceed avd edded
fAfto ocaling t e shuff we need nside HTM
Cur Gek, qp, audk pehl e
to e h dala
Sip
b done Uina Pthon or R
Same
Las Recomiition
Chelk fuor landscape Or headshot
to label o r R ticEy
colleu data, agk sot ne
numbeR betwo2n o 255
Repe eath imase as RGB
Fo eath olos
Doaw 3 gtosam Modeeg lodscape &
bue
deude houo nmch
heedghot
A hicle classificadhan
Nawe Bayeg f r
Polical,p&g
text
-
Asts,Busines,
Mulhclass
APL to
to e f h arhelg
deuelopeR
-Use New y& tineg to claiy
wmodlol -for w8d pnekecQ
Bernoulli
-Appy
Resueg
AI Fe
Rogiskr anclo?
arhila
2000 rece fle in t b
Download sepaiole
2 sechon to
boy petugned by
m each
ashcle
3 Sa
S ave
e ashide uRL
fowat- ashcle ti lte APT
deumted
claukcahom
C bee Set of Couteqiiu r
Jek a r tatide
ide
C Coke of
eO,1,2.
alalx
Sposse binasy
X wd
ashole has
indicalug
Xii =I
uRain each
wdg , douumete
S Tain by Coumtng
e l a s s t o e9timale jc c
dounete 4 class c
uwhae D, no. of jhw&dH
dotunct o cles c hauing
no. eshimatim
smookh
hypepoSalay o
, dala to bage
to
classso
Clask o
clas
odrg fos each
6 Caluote Loa
(PCy=cl «) Z c
o PCy olt) =
eCt-Bjo)
wlhe
wjc Oje-j)
-Ojc
WoCZ to.
body a aasncle
nde
T . P e a d THe puunuahong
g
dhasacteo
dhasacoteg
punuuahong
unUOailed
-Pemoe
wwtu wwodg
-To keni 2e
wusds
-
Pito stop
,Pa inputs
Eshimato
- Probablt
fr
n
eaundorg
d o g
postiorr
oukpu solsDtoain/te spit
Diwde into
Poese Contugm atrx
oikfcuu o dany
ahclu
10
Re poRt Top