0% found this document useful (0 votes)
11 views53 pages

02 Semantic Segmentation 2024

The document discusses semantic segmentation in computer vision, focusing on the classification of each pixel in an image without differentiating between instances. It highlights the challenges in data collection, evaluation metrics, and various methods such as Fully Convolutional Networks and Mask R-CNN for effective segmentation. Additionally, it contrasts semantic segmentation with other tasks like object detection and instance segmentation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views53 pages

02 Semantic Segmentation 2024

The document discusses semantic segmentation in computer vision, focusing on the classification of each pixel in an image without differentiating between instances. It highlights the challenges in data collection, evaluation metrics, and various methods such as Fully Convolutional Networks and Mask R-CNN for effective segmentation. Additionally, it contrasts semantic segmentation with other tasks like object detection and instance segmentation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Lecture 11:

Semantic Segmentation

1
Computer Vision Tasks

Semantic Object Instance


Classification
Segmentation Detection Segmentation

CAT GRASS, CAT, TREE, DOG, DOG, CAT DOG, DOG, CAT
SKY

No spatial extent No objects, just pixels Multiple Objects

This image is CC0 public domain

2
So far: Image Classification

Class Scores
Cat: 0.9
Dog: 0.05
Fully-Connected:
Car: 0.01
Vector: 4096 to 1000
This image is CC0 public domain ...
4096

Figure copyright Alex Krizhevsky, Ilya Sutskever, and


Geoffrey Hinton, 2012. Reproduced with permission.

3
Convolutional Neural Networks
Feature maps

Normalization

Spatial pooling

Non-linearity

Convolution
(Learned)

Input Image
Convolutional Neural Networks
Feature maps

Normalization

Spatial pooling

Non-linearity
.
.
Convolution .
(Learned)

Input Feature Map


Input Image
Convolutional Neural Networks
Feature maps

Normalization

Spatial pooling

Non-linearity

Convolution
(Learned)

Input Image
Convolutional Neural Networks
Feature maps

Normalization
Max

Spatial pooling

Non-linearity

Convolution
(Learned)

Input Image
Convolutional Neural Networks
Feature maps

Normalization

Spatial pooling Feature Maps Feature Maps


After Contrast
Normalization
Non-linearity

Convolution
(Learned)

Input Image
Convolutional Neural Networks
Feature maps

Normalization
Convolutional filters are trained in a
supervised manner by back-propagating
Spatial pooling classification error

Non-linearity

Convolution
(Learned)

Input Image
Simplified architecture

Softmax layer:
exp(w c ⋅ x)
P(c | x) = C

∑ exp(w k ⋅ x)
k=1
Tasks: Semantic Segmentation

Semantic Object Instance


Classification
Segmentation Detection Segmentation

CAT GRASS, CAT, TREE, DOG, DOG, CAT DOG, DOG, CAT
SKY

No spatial extent No objects, just pixels Multiple Objects

11
Semantic Segmentation
This image is CC0 public domain

Label each pixel in the image


with a category label

Don’t differentiate instances,


only care about pixels

s
Sky Sky

ee
Tr

Tr
ee
s
Cat Cow

Grass Grass

12
Evaluation metric

• Pixel classification!
• Accuracy?
• Heavily unbalanced
• Intersection over Union
• Average across classes
and images
• Per-class accuracy
• Average across classes
and images
Challenges in data collection
• Precise localization is hard to annotate

• Annotating every pixel leads to heavy tails

• Common solution: annotate few classes (often things),


mark rest as “Other”

• Common datasets: PASCAL VOC 2012 (~1500 images,


20 categories), COCO (~100k images, 20 categories)
Example: TextonBoost

Label Image Model


field parameters

Local data
term

Smoothing
term

J. Shotton, J. Winn, C. Rother, and A. Criminisi,


TextonBoost: Joint Appearance, Shape And Context Modeling For Multi-class Object
Recognition And Segmentation, ECCV 2006.
Example: SuperParsing

• CRF energy function is defined on superpixels


• Unaries are based on nearest neighbor retrieval
• Pairwise potentials capture class co-occurrence statistics

J. Tighe and S. Lazebnik, SuperParsing: Scalable Nonparametric Image Parsing with Superpixels,
ECCV 2010
Example: SuperParsing
• CRF energy function is defined on superpixels
• Unaries are based on nearest neighbor retrieval
• Pairwise potentials capture class co-occurrence statistics

Maximum likelihood
Original image labeling Edge penalties Final labeling
sky sky

road

tree
sea
sea
road
sand sand

J. Tighe and S. Lazebnik, SuperParsing: Scalable Nonparametric Image Parsing with Superpixels,
ECCV 2010
Semantic segmentation using
convolutional networks

person
bicycle
Segmentation: Sliding Window
Extract Classify center
patch pixel with CNN
Full image
Cow

Cow

Grass

Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013
Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014

21
Segmentation: Sliding Window
Extract Classify center
patch pixel with CNN
Full image
Cow

Cow

Grass
Problem: Very inefficient! Not
reusing shared features
between overlapping patches

Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013
Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014

22
Fully Convolutional Network

Design a network as a bunch of convolutional


layers to make predictions for pixels all at once!

Conv Conv Conv Conv argmax

Input:
3xHxW Scores: Predictions:
Convolutions: CxHxW HxW
DxHxW
Loss function: Per-Pixel cross-entropy

Long et al, “Fully convolutional networks for semantic segmentation”, CVPR 2015

23
Fully Convolutional Network

Design a network as a bunch of convolutional


layers to make predictions for pixels all at once!

Conv Conv Conv Conv argmax

Input: Problem #1: Effective receptive


3 x H x W field size is linear in number of
conv layers: With L 3x3 conv
layers, receptive field is 1+2L

Long et al, “Fully convolutional networks for semantic segmentation”, CVPR 2015

24
Fully Convolutional Network

Design a network as a bunch of convolutional


layers to make predictions for pixels all at once!

Conv Conv Conv Conv argmax

Input:
3xHxW
Problem #1: Effective receptive
field size is linear in number of Problem #2: Convolution on
conv layers: With L 3x3 conv high res images is expensive!
layers, receptive field is 1+2L

Long et al, “Fully convolutional networks for semantic segmentation”, CVPR 2015

25
Fully Convolutional Network
Design network as a bunch of convolutional layers, with
downsampling and upsampling inside the network!

Med-res: Med-res:
D2 x H/4 x W/4 D2 x H/4 x W/4

Low-res:
Input: D3 x H/4 x
3xHxW High-res: W/4 High-res: Predictions:
D1 x H/2 x W/2 D1 x H/2 x W/2 HxW
Downsampling:
Upsampling:
Pooling, strided
???
convolution
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015

26
In-Network Upsampling: “Unpooling”

Bed of Nails Nearest Neighbor


1 0 2 0 1 1 2 2
1 2 0 0 0 0 1 2 1 1 2 2
3 4 3 0 4 0 3 4 3 3 4 4
0 0 0 0 3 3 4 4
Input Output Input Output
Cx2x2 Cx4x4 Cx2x2 Cx4x4

27
Upsampling: Bilinear Interpolation

1.00 1.25 1.75 2.00

1 2 1.50 1.75 2.25 2.50

2.50 2.75 3.25 3.50


3 4
3.00 3.25 3.75 4.00

Input: C x 2 x 2 Output: C x 4 x 4

Use two closest neighbors in x and


y to construct linear approximations

28
Transposed Convolution

29
Transposed Convolution

30
Skip Connection

31
32
33
35
36
37
38
39
40
41
42
43
44
Tasks: Object Detection

Semantic Object Instance


Classification
Segmentation Detection Segmentation

CAT GRASS, CAT, TREE, DOG, DOG, CAT DOG, DOG, CAT
SKY

No spatial extent No objects, just pixels Multiple Objects

This image is CC0 public domain

45
Object Detection Progress
Faster R-CNN

Fast R-CNN
”Slow” R-CNN

Figure copyright Ross Girshick, 2015.


Reproduced with permission.

46
Tasks: Instance Segmentation

Semantic Object Instance


Classification
Segmentation Detection Segmentation

CAT GRASS, CAT, TREE, DOG, DOG, CAT DOG, DOG, CAT
SKY

No spatial extent No objects, just pixels Multiple Objects

47
Instance Segmentation

Instance Segmentation:
Detect all objects in the Cow
image, and identify the
pixels that belong to
each object Cow

This image is CC0 public domain

48
Instance Segmentation

Instance Segmentation:
Detect all objects in the Cow
image, and identify the
pixels that belong to
each object Cow

Approach: Perform
object detection, then
predict a segmentation
mask for each object!

This image is CC0 public domain

49
Object Detection: Faster R-CNN

Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NeurIPS 2015

50
Instance Segmentation: Mask R-CNN
Mask
Prediction

He et al, “Mask R-CNN”, ICCV 2017

51
Mask R-CNN
Classification Scores: C
Box coordinates (per class):
4*C

CNN Conv Conv


RoI Align
+RPN
256 x 14 x 14 256 x 14 x 14
Predict a mask for
each of C classes:
C x 28 x 28

He et al, “Mask R-CNN”, ICCV 2017

52
Mask R-CNN: Very Good Results!

He et al, “Mask R-CNN”, ICCV 2017

53
Summary: Computer Vision Tasks

Semantic Object Instance


Classification
Segmentation Detection Segmentation

CAT GRASS, CAT, TREE, DOG, DOG, CAT DOG, DOG, CAT
SKY

No spatial extent No objects, just pixels Multiple Objects

This image is CC0 public domain

54

You might also like