2-Inductive Learning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 37

Introduc)on

 to  Machine  
Learning:  CS  436/580L  
Induc&ve  (Supervised)  
Learning:  Hypothesis  Spaces  
Instructor:  Ar)  Ramesh  
Binghamton  University  
Administrivia  
•  Homework  0  is  available  on  myCourses  
–  Worth  4/40  points  
–  Due  Sep  6th,  Wednesday,  11:59  pm  
–  Late  penalty  is  10%  aPer  the  deadline  
 
 

2
Probability  Review  Recap  
•  Mutual  Exclusion  
•  Independence  
•  Condi)onal  Independence  
•  Expecta)on  
•  Variance  

3
Probability  Review  
The   weather   on   a   par)cular   day   can   be   sunny,  
cloudy,  or  rainy.  It  can  be  sunny  with  probability  =  
0.3,  cloudy  with  probability  =  0.4,  and    rainy  with  
probability  =  0.3.  A  concert  is  planned  to  be  held  
in  the  city.  If  the  weather  is  sunny,  the  concert  will  
be  held  100%.  If  the  weather  is  cloudy  or  rainy,  it  
will   be   held   with   probability   0.8   and   0.5,  
respec)vely.    
What   is   the   probability   that   the   concert   will   be  
held?  
4
Probability  Review  
Let  X  denote  the  sum  of  two  fair  dice.  What  is  the  
expecta)on  of  X?  
   

5
Recap  
•  Different  defini)ons  of  machine  learning  and  all  
are  correct!!!  
•  Slight  varia)ons  according  to  type  of  learning  
•  Types  of  Learning  
–  Supervised  Learning  
–  Unsupervised  Learning  
–  Reinforcement  Learning  
–  Semi-­‐Supervised  Learning  

6
Types  of  Learning  
•  Supervised  Learning  
–  problem:  the  learner  is  required  to  learn  a  func&on  
which  maps  a  vector  into  one  of  several  classes  by  
looking  at  several  input-­‐output  examples  of  the  
func)on.  
–  standard  formula)on  of  the  supervised  learning  
task:  classifica&on    

7
Types  of  Learning  
•  Unsupervised  Learning  
–  models  a  set  of  inputs:  labeled  examples  are  not  
available  
–  standard  formula)on  of  the  unsupervised  learning  
task:  clustering  
•  Semi-­‐supervised  Learning  
–  combines  both  labeled  and  unlabeled  examples  to  
generate  an  appropriate  func)on  or  classifier  

8
Types  of  Learning  
•  Reinforcement  Learning  
–  the  algorithm  learns  a  policy  of  how  to  act  given  an  
observa)on  of  the  world  
–  Every  ac)on  has  some  impact  in  the  environment,  
the  environment  provides  feedback  that  guides  the  
learning  algorithm  
 

9
Which  type  of  learning  is  best?  
•  Determining  the  best  move  to  make  in  a  game  
 
•  Dis)nguish  between  dogs,  cats,  and  horse  pictures  
   
•  Elevator  scheduling  
   
•  Agent  in  field  trying  to  diffuse  a  bomb  
 
•  Speech  analysis  of  telephone  conversa)on  (400  
hours  annota)on  )me  for  each  hour  of  speech)  

10
ML  in  a  Nutshell  
•  Tens  of  thousands  of  machine  learning  
algorithms  
•  Hundreds  new  every  year  
•  Every  machine  learning  algorithm  has  three  
components:  
–  Representa&on  
–  Evalua&on  
–  Op&miza&on  

11
Representa&on  
•  Decision  trees  
•  Sets  of  rules  /  Logic  programs  
•  Instances  
•  Graphical  models  (Bayes/Markov  nets)  
•  Neural  networks  
•  Support  vector  machines  
•  Model  ensembles  
•  Etc.  

12
Evalua&on  
•  Accuracy  
•  Precision  and  recall  
•  Squared  error  
•  Likelihood  
•  Posterior  probability  
•  Cost  /  U)lity  
•  Margin  
•  Entropy  
•  K-­‐L  divergence  
•  Etc.  

13
Op&miza&on  
•  Combinatorial  op)miza)on  
–  E.g.:  Greedy  search  
•  Convex  op)miza)on  
–  E.g.:  Gradient  descent  
•  Constrained  op)miza)on  
–  E.g.:  Linear  programming  

14
Supervised  Learning  
•  Given:  Training  examples  (x,  f(x)),  for  some  unknown  func)on  
f  
•  Find:  an  approxima)on  to  f  
Example  Applica&ons  
•  Credit  risk  assessments  
–  x:  proper)es  of  customer  and  proposed  purchase  
–  f(x):  to  approve/reject  purchase  
•  Disease  diagnosis  
–  x:  proper)es  of  pa)ent  (symptoms,  lab  tests)  
–  f(x):  disease  diagnosis,  recommended  therapy  
•  Face  recogni&on  
–  x:  bitmap  picture  of  person’s  face  
–  f(x):  Person’s  name  

15
Appropriate  Applica&ons  for  
Supervised  Learning  
•  Situa&ons  where  there  is  no  human  expert  
–  x:  bond  graph  for  a  new  molecule  
–  f(x):  predicted  binding  strength  to  AIDS  protease  molecule  
•  Situa&ons  where  humans  can  perform  the  task  but  cant  
describe  how  to  do  it  
–  x:  bitmap  picture  of  handwrihen  character  
–  f(x):  ASCII  code  of  character  
•  Situa&ons  where  desired  f(x)  is  changing  rapidly  
–  x:    descrip)on  of  stock  prices  and  trades  for  last  10  days  
–  f(x):  recommended  stock  transac)ons  
•  Situa&ons  where  each  user  needs  a  customized  f  
–  x:    incoming  email  message  
–  f(x):  importance  score  for  presen)ng  to  user  

16
Example:  A  dataset  for  supervised  
learning  
Sepal  Length   Sepal  Width   Petal  Length   Petal  Width   Class  
5.1   3.5   1.4   0.2   Iris-­‐Sentosa  
6.1   3.0   4.6   1.4   Iris-­‐Versicolor  
7.2   3.6   6.1   2.5   Iris-­‐Virginica  

•  Columns  are  called  input  variables,  features,  or  aSributes  


•  The  type  of  flower  {Iris-­‐Sentosa,  Iris-­‐Versicolor,  Iris-­‐Virginica}  are  
called  target  variables,  output  variables,  or  labels  
•  A  row  in  the  table  is  called  a  training  example  
•  The  whole  table  is  called  (training,  valida&on,  test  or  
evalua&on)  data  set  
•  The  problem  of  predic)ng  the  label  is  called  classifica&on  

17
Supervised Learning
Supervised  Learning:   Problem
Formal   Defini&on  
test point
P(x,y) <x,y>
training points x y
training learning f
samples algorithm

Training examples are drawn from independently at


random according to unknown probability distribution ŷ y
P(x,y)
The learning algorithm analyzes the the examples and
produces a classifier f or hypothesis h loss
function
Given a new data point <x,y> drawn from P
(independently and at random), the classifier is given x
and predicts ŷ = f(x)
The loss L(ŷ,y) is then measured. L(ŷ,y)
Goal of the learning algorithm: Find the f that minimizes
the expected loss.
18
A  learning  problem!  

X   0   X   X   0   X   X   X   X  
0   X   0   X   X   0   0   X   X   f(x)=1  
0   X   X   X   0   0   0   0   0  

0   X   0   0   0   X   0   X   X  
X   0   X   X   X   0   X   0   0   f(x)=0  
0   X   X   0   X   X   0   X   X  

0   X   X  
0   X   0   f(x)=?  

X   X   0  
19
A  Learning  Problem!  
X1   X2   X3   X4   X5   X6   X7   X8   X9   f(x)  
X   0   X   0   X   0   0   X   X   1  
X   0   X   X   X   0   X   0   0   1  
X   X   X   0   X   X   0   0   0   1  
0   X   0   X   0   X   0   X   X   0  
0   0   X   X   X   0   0   X   X   0  
0   X   X   X   0   0   0   X   X   0  
0   X   X   0   X   0   X   X   0   ?  
•  x:  a  9-­‐dimensional  vector  
•  f(x):  a  func)on  or  a  program  that  takes  the  vector  as  input  and  outputs  either  a  
0  or  a  1  
•  Task:  given  the  training  examples,  find  a  good  approxima)on  to  f  so  that  in  
future  if  you  see  an  unseen  vector  “x”  you  will  be  able  to  figure  out  the  value  
of  f(x)  

20
Example  of  a  learning  problem  

A  simpler  
example  for   Classifica&on  problem  
analysis!  

Given  data  or  examples,  find  the  func&on  f?  

21
How  to  find  a  good  approxima&on  to  f?  
•  A  possible/plausible  technique  
Unknown   Training  
func)on   Examples/Data   Learning  
𝑓:𝑋→𝑌   algorithm  
(𝑥,𝑓(𝑥))  

A  good  
Hypothesis   approxima)on:  
space  𝐻  
ℎ≈𝑓  

Set  of  candidate  func)ons  


(Your  assump)ons  about  f)  

22
Hypothesis  Spaces  

You  are  assuming  that  the   It  turns  out  that  out  of  the  ​
unknown  func)on  f  could  be   2↑16   possible  func)ons,  ​2↑9   
any  one  of  the  ​2↑16   func)ons!   classify  all  points  in  the  
training  data  correctly!  

23
Hypothesis  Spaces  
•  10,000  features  
•  Features  are  binary  
•  Output  is  binary  
 
Number  of  boolean  func&ons?  

24
Hypothesis  Spaces  

You  are  assuming  that  


the  unknown  func)on  f  
could  be  any  one  of  the  
16  conjunc)ve  rules!  

Unfortunately,  none  of  


them  work  

25
Hypothesis  Spaces  

At  least  m  of  the  n  


variables  must  be  true  

You  are  assuming  that  


the  unknown  func)on  f  
could  be  any  one  of  the  
32  m-­‐of-­‐n  rules!  

Only  one  of  them,  the  one  


marked  by  “***”  works!  

26
Two  Views  of  Learning  
•  Learning  is  the  removal  of  uncertainty  
•  Learning  requires  guessing  a  good,  small  hypothesis  
case  
•   We  could  be  wrong!  
–  Our  prior  knowledge  might  be  wrong!  
–  Our  guess  for  hypothesis  class  could  be  wrong!  
•  The  smaller  the  hypothesis  class,  more  likely  we  are  wrong!  

27
Strategies  for  Machine  Learning  
•  Strategy  1:  Develop  languages  for  expressing  
prior  knowledge:  rule  grammars  and  stochas)c  
models  
•  Strategy  2:  Develop  flexible  hypothesis  spaces:  
Nested  collec)ons  of  hypotheses  –  decision  
trees,  rules,  neural  networks,  …  
•  In  either  case:    
–  Develop  algorithms  for  finding  a  hypothesis  that  
fits  the  data!  

28
Terminology  
•  Training  Example:  An  example  of  form  (x,  f(x))  
•  Target  func)on  (target  concept):  The  true  func)on  f  
•  Hypothesis:  A  proposed  func)on  h  believed  to  be  similar  
to  f  
•  Concept:  A  boolean  func)on.  Examples  for  which  f(x)  =  1  
are  called  posi)ve  examples  or  posi)ve  instances  of  the  
concept.  Examples  for  which  f(x)  =  0  are  called  nega)ve  
examples  or  nega)ve  instances  of  the  concept.    
•  Classifier:  A  discrete-­‐valued  func)on.  The  possible  values  
of  f  are  called  class  labels    f ∈ {1, 2,...K}
•  Hypothesis  Space:  The  space  of  learning  algorithms  that  
can  be  output  by  a  learning  algorithm  

29
Key  Issues  in  Machine  Learning  
•  What  are  good  hypothesis  spaces?  
–  Which  spaces  are  useful  in  prac)cal  applica)ons  and  why?  
•  What  algorithms  can  work  in  these  spaces?  
–  Are  there  general  design  principles  for  machine  learning  
algorithms?  
•  How  can  we  op&mize  accuracy  on  future  data  points?  
–  This  is  some)mes  called  the  problem  of  overfivng  
•  How  can  we  have  confidence  in  the  results?  
–  How  much  training  data  is  required  to  find  accurate  hypothesis  
•  Are  some  learning  problems  computa&onally  intractable?    
 (the  computa)onal  ques)on!)  
•  How  can  we  formulate  applica&on  problems  as  machine  
learning  problems?    
 (the  engineering  ques)on!)  
30
Steps  in  Supervised  Learning  
1.  Determine  the  representa)on  for  “x,f(x)”  and  
determine  what  “x”  to  use  (Feature  Engineering)  
2.  Gather  a  training  set  (not  all  data  is  kosher)  
(Data  Cleaning)  
3.    Select  a  suitable  evalua)on  method  
4.  Find  a  suitable  learning  algorithm  among  a  
 plethora  of  available  choices  
–  Issues  discussed  on  the  previous  slide  
 
31
Feature  Engineering  is  the  Key  
•  Most  effort  in  ML  projects  is  construc)ng  
features  
•  Black  art:  Intui)on,  crea)vity  required  
–  Understand  proper)es  of  the  task  at  hand  
–  How  the  features  interact  with  or  limit  the  algorithm  
you  are  using.  
•  ML  is  an  itera)ve  process  
–  Try  different  types  of  features,  experiment  with  
each  and  then  decide  which  feature  set/algorithm  
combina)on  to  use  

32
A  sample  machine  learning  Algorithm  
•  2-­‐way  classifica)on  problem  
–  +ve  and  –ve  classes  
•  Representa)on:  Lines  (Ax+By=C)  
–  Specifically  
•  if  Ax+By+C  >0  then  classify  “+ve”  
•  Else  classify  as  “-­‐ve”  
•  Evalua)on:  Number  of  mis-­‐classified  examples  
•  Op)miza)on:  An  algorithm  that  searches  for  the  
three  parameters:  A,  B  and  C.  

33
Toy  Example  
Y Ax+By+C>0   Blue  circles:  Good  
credit  (low  risk)  
60K  
Red  circles:  Bad  
credit  (high  risk)  
45K  
Income  

Ax+By+C<0   Problem:  Fit  a  


30K   line  that  
separates  the  two  
15K   such  that  the  
x error  is  
20   40   60   80   minimized.  
Age  
34
How  do  machine  learners  solve  this  
problem?  
•  Try  different  lines  un)l  you  find  one  that  
separates  the  data  into  two  
•  A  more  plausible  alterna)ve  
–  Begin  with  a  random  line  
–  Repeat  un)l  no  errors  
–  For  each  point  
•  If  the  current  line  says  +ve  and  point  is  –ve  then  decrease  
A,  B  and  C  
•  If  the  current  line  says  –ve  and  the  point  is  +ve  then  
increase  A,  B,  and  C  

35
Toy  Example:  More  data  
x2
Blue  circles:  Good  
credit  (low  risk)  
Red  circles:  Bad  credit  
(high  risk)  
Income  

Problem:  Fit  a  line  


that  separates  the  
two  such  that  the  
error  is  minimized.  
x1
Age  
36
Learning  =  Representa&on  +  
       Evalua&on  +  Op&miza&on  
•  Combina)ons  of  just  three  elements  
Representa&on   Evalua&on   Op&miza&on  
Instances   Accuracy   Greedy  search  
Hyperplanes   Precision/Recall   Branch  &  bound  
Decision  trees   Squared  error   Gradient  descent  
Sets  of  rules   Likelihood   Quasi-­‐Newton  
Neural  networks   Posterior  prob.   Linear  progr.  
Graphical  models   Margin   Quadra)c  progr.  
Etc.   Etc.   Etc.  

37

You might also like