02 Semantic Segmentation 2024
02 Semantic Segmentation 2024
Semantic Segmentation
1
Computer Vision Tasks
CAT GRASS, CAT, TREE, DOG, DOG, CAT DOG, DOG, CAT
SKY
2
So far: Image Classification
Class Scores
Cat: 0.9
Dog: 0.05
Fully-Connected:
Car: 0.01
Vector: 4096 to 1000
This image is CC0 public domain ...
4096
3
Convolutional Neural Networks
Feature maps
Normalization
Spatial pooling
Non-linearity
Convolution
(Learned)
Input Image
Convolutional Neural Networks
Feature maps
Normalization
Spatial pooling
Non-linearity
.
.
Convolution .
(Learned)
Normalization
Spatial pooling
Non-linearity
Convolution
(Learned)
Input Image
Convolutional Neural Networks
Feature maps
Normalization
Max
Spatial pooling
Non-linearity
Convolution
(Learned)
Input Image
Convolutional Neural Networks
Feature maps
Normalization
Convolution
(Learned)
Input Image
Convolutional Neural Networks
Feature maps
Normalization
Convolutional filters are trained in a
supervised manner by back-propagating
Spatial pooling classification error
Non-linearity
Convolution
(Learned)
Input Image
Simplified architecture
Softmax layer:
exp(w c ⋅ x)
P(c | x) = C
∑ exp(w k ⋅ x)
k=1
Tasks: Semantic Segmentation
CAT GRASS, CAT, TREE, DOG, DOG, CAT DOG, DOG, CAT
SKY
11
Semantic Segmentation
This image is CC0 public domain
s
Sky Sky
ee
Tr
Tr
ee
s
Cat Cow
Grass Grass
12
Evaluation metric
• Pixel classification!
• Accuracy?
• Heavily unbalanced
• Intersection over Union
• Average across classes
and images
• Per-class accuracy
• Average across classes
and images
Challenges in data collection
• Precise localization is hard to annotate
Local data
term
Smoothing
term
J. Tighe and S. Lazebnik, SuperParsing: Scalable Nonparametric Image Parsing with Superpixels,
ECCV 2010
Example: SuperParsing
• CRF energy function is defined on superpixels
• Unaries are based on nearest neighbor retrieval
• Pairwise potentials capture class co-occurrence statistics
Maximum likelihood
Original image labeling Edge penalties Final labeling
sky sky
road
tree
sea
sea
road
sand sand
J. Tighe and S. Lazebnik, SuperParsing: Scalable Nonparametric Image Parsing with Superpixels,
ECCV 2010
Semantic segmentation using
convolutional networks
person
bicycle
Segmentation: Sliding Window
Extract Classify center
patch pixel with CNN
Full image
Cow
Cow
Grass
Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013
Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014
21
Segmentation: Sliding Window
Extract Classify center
patch pixel with CNN
Full image
Cow
Cow
Grass
Problem: Very inefficient! Not
reusing shared features
between overlapping patches
Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013
Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014
22
Fully Convolutional Network
Input:
3xHxW Scores: Predictions:
Convolutions: CxHxW HxW
DxHxW
Loss function: Per-Pixel cross-entropy
Long et al, “Fully convolutional networks for semantic segmentation”, CVPR 2015
23
Fully Convolutional Network
Long et al, “Fully convolutional networks for semantic segmentation”, CVPR 2015
24
Fully Convolutional Network
Input:
3xHxW
Problem #1: Effective receptive
field size is linear in number of Problem #2: Convolution on
conv layers: With L 3x3 conv high res images is expensive!
layers, receptive field is 1+2L
Long et al, “Fully convolutional networks for semantic segmentation”, CVPR 2015
25
Fully Convolutional Network
Design network as a bunch of convolutional layers, with
downsampling and upsampling inside the network!
Med-res: Med-res:
D2 x H/4 x W/4 D2 x H/4 x W/4
Low-res:
Input: D3 x H/4 x
3xHxW High-res: W/4 High-res: Predictions:
D1 x H/2 x W/2 D1 x H/2 x W/2 HxW
Downsampling:
Upsampling:
Pooling, strided
???
convolution
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
26
In-Network Upsampling: “Unpooling”
27
Upsampling: Bilinear Interpolation
Input: C x 2 x 2 Output: C x 4 x 4
28
Transposed Convolution
29
Transposed Convolution
30
Skip Connection
31
32
33
35
36
37
38
39
40
41
42
43
44
Tasks: Object Detection
CAT GRASS, CAT, TREE, DOG, DOG, CAT DOG, DOG, CAT
SKY
45
Object Detection Progress
Faster R-CNN
Fast R-CNN
”Slow” R-CNN
46
Tasks: Instance Segmentation
CAT GRASS, CAT, TREE, DOG, DOG, CAT DOG, DOG, CAT
SKY
47
Instance Segmentation
Instance Segmentation:
Detect all objects in the Cow
image, and identify the
pixels that belong to
each object Cow
48
Instance Segmentation
Instance Segmentation:
Detect all objects in the Cow
image, and identify the
pixels that belong to
each object Cow
Approach: Perform
object detection, then
predict a segmentation
mask for each object!
49
Object Detection: Faster R-CNN
Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NeurIPS 2015
50
Instance Segmentation: Mask R-CNN
Mask
Prediction
51
Mask R-CNN
Classification Scores: C
Box coordinates (per class):
4*C
52
Mask R-CNN: Very Good Results!
53
Summary: Computer Vision Tasks
CAT GRASS, CAT, TREE, DOG, DOG, CAT DOG, DOG, CAT
SKY
54