Introduction to
Face Processing with Computer Vision
Gabriel Bianconi
Founder, Scalar Research
AI & Data Science Consulting Firm
Previously at the Stanford AI Lab
Agenda
• Theory
• Detection
• Recognition
• Other Tasks
• Practice
• Rapid Prototyping
• Scaling
3
Theory
4
Face Detection
5
Haar-Like Features
• Summarize image based on simple color patterns
• Manually determined feature extractors (kernels)
• Leveraged for first real-time face detector (2001)
Ref: Viola & Jones (2001). Image: Wikimedia 6
7
8
Histogram of Oriented Gradients (HOG)
• Summarize image by distribution of color gradients
• Gradient intensities and orientations represent edges, etc.
• Captures more information than simple Haar-like features
Ref: Shu et al. (2011). 9
Ref: Shu et al. (2011) 10
Ref: Shu et al. (2011) 11
Ref: Rojas et al. (2011) 12
Ref: Rojas et al. (2011) 13
R-CNN
• Introduces CNNs for object detection
• CNNs learn how to extract features from data
• Breakthrough in performance
• Beats previous SOTA methods by huge margin
• However, detection is extremely slow
Ref: Girshick et al. (2014). 14
CNN Features
Ref: Lee et al. (2009). 15
CNN Features
Ref: Lee et al. (2009). 16
CNN Features
Ref: Lee et al. (2009). 17
CNN Features
Ref: Lee et al. (2009). 18
R-CNN
Ref: Girshick et al. (2014). 19
Fast R-CNN
• Improvement to R-CNN that leverages CNN for
classification and regression
• Other than proposing regions, system is now end-to-end vs. three
components trained greedily.
• Predictions are 200x+ faster with better performance
• Region proposals still are a bottleneck; total inference time is ~2s.
Ref: Girshick (2015). 20
Fast R-CNN
Ref: Girshick (2015). 21
Faster R-CNN
• Leverages CNN for region proposals as well
• “Region Proposal Network”
• Finally an end-to-end system with deep learning
• About 10x faster than Fast R-CNN, with better performance
• Total inference time is ~0.2s
Ref: Ren et al. (2016). 22
Faster R-CNN
Ref: Ren et al. (2016). 23
MTCNN
• Many model for face detection draw heavily from
the generalized object detection methods.
• MTCNN, for example, trains a multi-task system for
detection and alignment.
Ref: Zhang et al. (2015). 24
MTCNN
Ref: Zhang et al. (2015). 25
RetinaFace
• The current SOTA method combines many
techniques such as multi-task learning
• R-CNN family uses a two-stage approach
(proposals → refinement)
• RetinaFace uses a single-stage approach (faster,
higher recall, more false positives)
Ref: Deng et al. (2019). 26
RetinaFace
Ref: Deng et al. (2019). 27
Are we there yet?
WIDER Face (Easy) WIDER Face (Medium) WIDER Face (Hard)
~97% AP ~96% AP ~92% AP
Ref: Yang (2016). 28
Facial Recognition
29
Facial Recognition
• Facial recognition actually corresponds to group of
different tasks.
• Verification vs. Identification vs. Grouping vs. …
• Closed-Set vs. Open-Set
30
Closed-Set Recognition
• Every identity appears in training set
• Example: recognizing celebrities
• Effectively a classification problem
• Model aims to learn separable features
31
Closed-Set Identification
Test Sample Model Label Confidences
Label 0 Label 1 …
… …
Images: Wikimedia 32
Closed-Set Verification
Test Sample A Label Confidences
Model
Test Sample B Label Confidences
Images: Wikimedia 33
Open-Set Recognition
• Not every identity appears in training set
• Example: Facebook Photos
• Effectively a metric learning problem
• Model aims to learn large-margin features (embeddings)
34
Embeddings
• Map each sample to a vector (coordinate system)
• Used for words, graphs, faces, etc.
• Embeddings preserve similarity
• Similar samples close to each other
• Dissimilar samples far from each other
35
Images: Wikimedia 36
Embeddings
• “Similar” depends on the training data
• Same person, physical characteristic, etc.
• Embeddings represent latent information
• High-dimensional embeddings trained on large datasets
learn to represent latent information about the person (e.g.
physical characteristics)
37
Open-Set Identification
Test Sample Model Embedding + Distance
Emb. 0 Emb. 1 Emb. 2 …
Images: Wikimedia 38
Open-Set Verification
Test Sample A Embedding A
Distance
Model vs.
Threshold
Test Sample B Embedding B
Images: Wikimedia 39
Metric Learning
Ref: Liu et al. (2018) 40
Are we there yet?
LFW (Labeled Faces in the Wild)
Verification
99.85%+ accuracy
Ref: Yan et al. (2019); Learned-Miller et al. (2016) 41
Cross-Factor
Facial Recognition
42
Cross-Age
Ref: Zheng et al. (2017) 43
Cross-Pose
Ref: Li et al. (2011) 44
Cross-Makeup
Ref: Chen et al. (2013) 45
Further Research
46
Security
• How do we deal with adversarial users?
• Real face goes undetected or misclassified
• Fake face gets recognized
• Private data is extracted from model
•…
47
Security
Ref: Grigory Bakunov (2017) 48
Biometrics & Multi-Modal Data
• How do we deal with…
• Identical twins?
• Plastic surgery?
• ...
Ref: Singh et al. (2010) 49
Ref: Singh et al. (2010) 50
Biometrics & Multi-Modal Data
• Combine with other biometric data
• Biometric traits (e.g. hand)
• Multiple sensors (e.g. 2D + 3D)
• Multiple pictures (e.g. viewpoints, sequences)
•…
Ref: Singh et al. (2010); Ross & Jain (2004); Ross & Govindarajan(2005) 51
Ref: Apple 52
Privacy
• How do we deal with…
• Models that can predict gender, race, …?
• Models that leak the data?
• Predictions without sharing the raw data?
•…
Ref: Singh et al. (2010) 53
Other Tasks
54
Alignment & Pose Estimation
Ref: Ruiz et al. (2018) 55
Face Landmarks
56
Classification
Neutral
Happy
Happy
57
3D Reconstruction
Ref: Sela et al. (2017) 58
Practice
59
Rapid Prototyping
60
61
accuracy
…
ac
e
a ce
aF tF
in gh
t si
Re In
N
N et
N
TC ce
M Fa
n
tio
ni
Dozens of Tools
V
nC c og
pe re
O
c e_
fa
…
…
simplicity
APIs
• There are dozens of APIs providing low-cost face
processing at scale
• Most services charge less than $1 per 1000 images
• Depending on the use case, might be cheaper than provisioning GPUs
and deploying your own models (esp. if considering developer time)
• Often these APIs can achieve performance that’s
close to state-of-the-art
62
APIs – Example: Azure
• Detection
• Classification
• Gender, age, emotion, hair, smile, eyes, glasses, makeup, …
• Landmarks
• Pose Estimation
• Recognition
• Verification, identification, grouping, similarity search, …
63
Embeddings
• Face embeddings are typically used for open-set
recognition systems
• They can be leveraged to quickly train models for
downstream tasks (e.g. classification)
• Tools
• face_recognition (Github): extremely fast, reliable for frontal
• FaceNet: based on deep learning, strong across the board
64
Example – Facebook Photos
• Task: open-set face identification
• Strategy:
1. Detect faces and compute embeddings for known photos
of users; store for future use.
2. Whenever a photo is uploaded, do the same and compare
against known set.
65
Example – Detection
import face_recognition as fr
image = fr.load_image_file("file.jpg")
face_locations = fr.face_locations(image)
Ref: github.com/ageitgey/face_recognition 66
Example – Embedding
image = fr.load_image_file("file.jpg")
face_embedding = fr.face_encodings(image)[0]
Ref: github.com/ageitgey/face_recognition 67
Example – L2 Distance
- 0.31 0.59 0.69
0.31 - 0.52 0.63
0.59 0.52 - 0.50
0.69 0.63 0.50 -
Images: WikiMedia 68
Face Landmarks
• Face landmarks can also be quickly extracted with
pretrained models and used for a number of
downstream tasks.
69
Example – Face Landmarks
face_landmarks = fr.face_landmarks(image)[0]
print(face_landmarks.keys())
# left_eyebrow, right_eyebrow, lower_lip, top_lip, …
Ref: github.com/ageitgey/face_recognition 70
71
Example – Snapchat Filters
• Task: face manipulation
• Strategy:
1. Detect face and localize landmarks in image
2. Add objects, reshape image, etc. based on landmarks
72
Example – Snapchat Filters
from PIL import Image, ImageDraw
…
pil_image = Image.fromarray(image)
d = ImageDraw.Draw(pil_image, 'RGBA’)
lip_fill = (150, 0, 0, 128) # shade of red, 50% alpha
d.polygon(face_landmarks['top_lip'], fill=lip_fill)
d.polygon(face_landmarks['bottom_lip'], fill=lip_fill)
73
Scaling
75
Bias
• People & Demographics
• Is your training set… Coworkers? Single location?
• Environment
• Does it cover… Day and night? Seasons? Lighting
conditions? Backgrounds?
• Sensors
• Did you consider… Diverse hardware? Calibration?
Viewpoint (angle)? Resolution? Occlusion?
76
Optimizations
• It is often easier to simplify the real-world task than
drastically improve ML models.
77
Optimizations
Multiple model optimizations
($$$ in developer time, etc.)
Performance
Time (weeks)
78
Optimizations
Install a new light
($)
Performance
Time (weeks)
79
Risks
• What happens when your model makes a mistake?
• How can you deal with adversarial users?
• What is your threat model?
80
Other Considerations
• How do you handle…
• Model getting stale over time?
• Growing search space?
• Large amounts of real-time data?
• Detecting or tracking people vs. faces?
• Speed vs. cost vs. performance trade-offs?
81
Thank you.
gabriel@scalarresearch.com
82