0% found this document useful (0 votes)
6 views

Data Science Module -3

Module 3 discusses data processing techniques, including data cleaning, organization, and analysis for effective decision-making. It covers methods for spam detection using algorithms like k-NN and Naive Bayes, emphasizing the importance of labeled data for training models. The module also highlights the use of APIs for data extraction and integration in machine learning workflows.

Uploaded by

kanti chandrakar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Data Science Module -3

Module 3 discusses data processing techniques, including data cleaning, organization, and analysis for effective decision-making. It covers methods for spam detection using algorithms like k-NN and Naive Bayes, emphasizing the importance of labeled data for training models. The module also highlights the use of APIs for data extraction and integration in machine learning workflows.

Uploaded by

kanti chandrakar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Module -3

Deta Wacnzun
conveghng Mappina a
dala 8 f ting
J4s poOCess
it eady f analsis .

J t involveg movin3 o Combining comdep dola


aCCaR,ble g eose to analyse
3et o mace hem

lgo Enou a3 DaA munging tongshn


,

aus data nto desIx


buSfor mig
cleanung ,orRnzing gecd goY deusim makin
to be
fos Ma far aualysy
. M a k e auw dta ugable. bocaôn
adt cetna
Combie dala i o w Vo9ou sous Co busines
umdalgtamd

DALO dasa o xauixd Sormat


POAS
ontex of d a l -data s e
os es {r a Ca
tools
. Acutomoted Integnoatim
elemetg
amalysis. lauves
5. c l e a n dalk lom NOISe,muging decistong.
deuisions
f d a k e timey
USes
busineas
6. 4elp

S i s O6 Spam chaun)
dhawng
e f e c e s Cspelling
efence4
containtn
ViagRa
Y aa
mal pumchiotuon)
.Any exdamohen.
osubje Clot o
. Lemn
'tdevdicaim
Sugshn for spam
Mdel
Tuya poo b a b l i u e
-NN,
Linea Regpeon.
3. y
& k -for leaing spam.
hy Lneasn Rogiesgion
isute about uinean ReagoogO) spam fHeaun
dotaset ag a malic ,whee eoch o d CoTCLPendg-to
3. Cenaide
a emoi. difleu

uolumn
3Ccaiu columns fur eath doydg, heee Viafa' a
the 0d Viagia, Bhon that
emad Contain
4 Ony
Alled uith value 1 elge assin o
Column
imes e oorld appeal
alkanalvey one ca put no. o
eMal whee
ineoa ReReO we need training
5. for vasiabde
email haue be be lab eled wth cutcome
i.e spam à Aot
be d fo dodtig
Rg cal
6. A humam gooe spam
tabe ling tak
e buil
Wmodo
T.neh Romeeon labals
to pedict-he
hout
lobel 9 gve
An emau
8 1 os spam)
C o f o y not spam,
TaslEA binvy
9 oudcome ig a numbeh amd

dn LineA Paspesiom
10
coninoMs evau aboue tt
valueg'
Pedcled
Cntesl
value, 3
Choose a
belous hen outpuut u 'o'
outpud 4 , Ua9uab lep
ou toD Many
beuause
uRk
12 J+ do noF ,00,aDO W6r de
wukh der O
l0,000 eMoulk in vests ble
te
MaaX n o t invegtsble
tna
trd
Camot be in
i ne
e aa
A s
TR
1 13. Thue but shil
shl
wuDds,
wuDdg,
uaut tha D. oUutttc
LeOm e
we could O MA

4. binasuy
appropale to
Pe Raos wot
hy k-NN dLD not uskus Spam Haing
wute abeu -NN
emoud
2 eMcuile aRe paegeuted as Malsu x., uuth Ou06or
i?
Owid colum ng
Malux eibue9 ale ether o r 1 depemdung on peence
3
hot wid
be neos, basld on
tud emo as ad <
L F o s k-NN,
4
con+aLn.
both
usdg thoy
Loo manu dmen Siong g
uul have
5. HeI 1,00, Do uode
which
dimevsional spoce
OD, oD0
-

m
Cornpuhna di Shance
Compuodh m wk
ase LDt
maka K-NN
dimeusionalilty&
ut
6 3 u M s Rom u e O
PooY olgRth m

D1gut Recognihon
eath in a 16x16 pixel grid
Rappee dmensiomal space
256
UnwsaP 16x1b qid into
veCctonize ap Py ENN tune
Acclay, Confutosn a

NaLive Bayes taud


tad
-classacahon
nethod bosed on bayes
ppudatian injedad
Rore dus Casp
uuhsL 17 o
Exanple
tes posdwe
Scck pattewlg
99 -cst negative
pabevdg ha
997 healty poobab-lib
posiuive,whot g he
test
potiemt
Pahea
G,iuOA
GIvOM olly sKck,
achuall sic
Ppulodton
0,000 ppl
99 haalhy tagt + 99 hehy
sick
asdeAuy+
0opp
9900pP
Hee SO%
9 egl+ 1les+ f997test
1leg 980 PP
Peson 11ppl
Let , y b e venug u t probablng px),p(y)
poobabliay wheu both hoppeu
POX,9) be join
one haPPRnsive nupths
whA
Londitional poobasduby
has Aoppend
PCx,u)
=
P(u|x) PCX
P(xl4) PCy)=

olwe kor P(u),agM Pa) fo


PCya)= PCxly) p)
PCx)
"Jam Sid o"sick
euen-
-Le y e{to
+
to ev egt u potdue
he
= Pt
|sick) p (sick)
P(Sick+)
PCH)
o 99 x 0.0
o.99x o.01)+o.olxo.
9a)

507

Naie
N a i e Baye
Jndiuidual wrdg ug
foY
Spam A t
emoid 8 Spam
awd, add& to poobab
O Cuss O
u0nd at a f m e
condla oly one

wdicaleg
non sPam
han
e ablity of S pa
SPa
PCSpam) pso b o n SRam
oebalbului of
PCha
PCham)
1- P(spam) emaul
owod
in
sF
PCwsdspam) po botsuly ham emanl
o nwdd
dd in
tn
probabluiy
P(wc|ham)
Apply Bayeg La P(SOYe spam PCspam)

PCspam |Wád) = PCNOd )


PCwod)=P(uusd spam) P(Spam)t PlwBd|ham) plham)
emoulg
NO- O spam
PCSpam) =
Tot No. o emaulg

Pham) No ok Non-spam emauls


Tbt No emad

with I500 SPams, 362 ham.


Exap EMployee emais

wRd oappeass 6 times in spamM


Meehng
Is3 hmag in ham

Pepam) =soo
ISbot 3672

l-o-29 = 0.4|
P Cham) = I- P (spam)
0 Olo6
PCMeehng |spam) 500
53 =0.0yl6
PCmeetins Iham) 3672

P Cspam lmeehing) =(meetins |spam) PCSpam)


PCmeebin4)
6-o106 O29
(-ol06)0.29+(o o4l6 x o1)

0.09

PEpam)+PCmeahng|ham) Phem)
Cmechin5)= POmeehna lspam)
Aspam HHe tor Combining W8de.
b a binay w vechh
.Eath emal Rpejevled

ü 1 ro, depending on appeRame of h uid


eiy
3 e be email ve (or
n d e p for jh wRd
c denole Spam
veckor is spam
PC|C) poobabity thoak emaul
C-)
pxlc)=TOic C1-0je)
individuual uad i n spam
who, paubabdiley h a t

cPobabiluy of u hwuRd spam

on both tda Lto Conuet produuuh to suu]


4. Take Lo
LogCPCxlc)) = v
-
Lugoi. -eje
()
Aog Ojc+L-je)
Loa-Ojc)
logjc t z - )
log C-0jc)
-

j log(i-0j)
log jc
+

C1-8jc)
togjc
-

log (i -0je)+ log


loC | Cu-oje )+ lo C1-0c)
log Cpoxl) j i t No
who
j j - LogCjc/t-0jc)) w . (og(1-8j)
for eath Maid, must be computed
weahg j vay
- Compude pClc) them eshm at p(c|)
data Se
tolain wum pre-labled
3+ uRkg l& cheap

Laplace Smoothing :

pootabluby o a give uBd in spam emad

oduo of e to 1c
Yeline oj as

ne 0 hmes " uRd


e mjc wlhee eMoulU
appeoag n spam

Mo imgh Rd
any eMOd

-Laplace SMoottingeeu to appeaai in


de of replaung oj
9= Tje ta x =l,f=lo to paeveN gOhing
poobalsdt oK 0 r 1
DDala s e
CDIe)
OS43 Maxp P MAX ime
Ukelhood
ML
eghimoR

Naint Bug way of choocing Oj for enh i


ARuw

log(jc (1-0)e-Tjc)
get t to 0 Hham
teke dasivatu

jc

MAxi
a
Po Steuo
= ag Max P(O|b)
eMAP

4
k-AN
Com pogung Nauwe Bayes to
NN
Navi Bages
thas ony One upes
J4 has tuo hupeapahamotea palamete ie

a neosA
Jg N on ünean laSRkia
-

Dimmemserety u e
3. Dimenconolby Loa
not ketuee
heoise 8ek poObleM.
problem
Jt Yeausu ainiMg 4Teuin
Labele.d upevised Leasnng
Boh a
o t R TTools
ools
y otes
the nleb APIs
Snapin
SGhaping
ask an
a uessh
ha r ,S o l v e
om Sol ve pooblem,
pooblem
need dala o a k
Dals Suietsts

tu do eseaNh e dads bR om

eals uth extactng


Ssapmg the web

websdes Dos A P I key


ways
to gtE 3 ExkMSimg
-

DiffeR
- dumP
Tools I. un cnd ynx - -
dump
-othu polsins but slow)
(Robust
R Beaulufd Soap
3. Mehanze Dont posge Taweaspt)
4.PostSaipt (Jmas clasCaten)

LAPI ey douonlod
doia in Spe uked
to
to davelopu
Proudod
forma Cuke
Cuke pOSSwRd)
posswRd)
and cce Fey
Delevope Rosiaeg aCCos
dowlorcd Sj2e
it abou
hae
-

APT s M

Pa oY wutheu Pa
etandasd fomoNe
-

r othee
be in JSon
Dala can

be u x d
Yauhoos YQL Ca Ext
h e ext
whee
w
== "cat
"cat
phofo.sAGCh
selet R o n ficke t lo
api-key='legid
j#|sdvt
whette
Exemgion available , exleusing e ixbus
- whan APIs a not
extemsim of Raebox
emd
Use Jnspeut the elemet on anuy webpage
HTMLfielde cam be acceed avd edded
fAfto ocaling t e shuff we need nside HTM

Cur Gek, qp, audk pehl e


to e h dala
Sip
b done Uina Pthon or R
Same

Las Recomiition

Chelk fuor landscape Or headshot


to label o r R ticEy
colleu data, agk sot ne
numbeR betwo2n o 255
Repe eath imase as RGB
Fo eath olos
Doaw 3 gtosam Modeeg lodscape &
bue
deude houo nmch

heedghot
A hicle classificadhan
Nawe Bayeg f r
Polical,p&g

text
-

Asts,Busines,
Mulhclass
APL to
to e f h arhelg
deuelopeR
-Use New y& tineg to claiy
wmodlol -for w8d pnekecQ
Bernoulli
-Appy
Resueg
AI Fe
Rogiskr anclo?
arhila
2000 rece fle in t b
Download sepaiole
2 sechon to
boy petugned by
m each
ashcle
3 Sa
S ave
e ashide uRL
fowat- ashcle ti lte APT
deumted
claukcahom
C bee Set of Couteqiiu r
Jek a r tatide
ide
C Coke of
eO,1,2.
alalx
Sposse binasy
X wd
ashole has
indicalug
Xii =I
uRain each
wdg , douumete
S Tain by Coumtng
e l a s s t o e9timale jc c

dounete 4 class c

uwhae D, no. of jhw&dH


dotunct o cles c hauing
no. eshimatim
smookh
hypepoSalay o
, dala to bage
to
classso
Clask o
clas
odrg fos each
6 Caluote Loa
(PCy=cl «) Z c

o PCy olt) =

eCt-Bjo)
wlhe
wjc Oje-j)

-Ojc
WoCZ to.

body a aasncle
nde
T . P e a d THe puunuahong
g
dhasacteo
dhasacoteg
punuuahong
unUOailed
-Pemoe
wwtu wwodg
-To keni 2e
wusds
-

Pito stop
,Pa inputs
Eshimato
- Probablt
fr
n
eaundorg
d o g

postiorr
oukpu solsDtoain/te spit
Diwde into
Poese Contugm atrx
oikfcuu o dany
ahclu
10
Re poRt Top

You might also like