Alberto Molinaro
Mean Squared Error (MSE)
For Regression problems
Punishes large errors heavily.
Sensitive to outliers. Squaring exaggerates
large errors.
Alberto Molinaro
Mean Absolute Error (MAE)
For Regression problems
Same as MSE but with
absolute difference.
No squaring
More robust to noise. Linear penalty.
Harder to optimize due to non-smooth
gradients.
Alberto Molinaro
Huber Loss
For Regression problems
Transition point δ:
2 Below this line, the loss
behaves like MSE
1.5
(quadratic); above it,
1 like MAE (linear)
0.5 δ = 0.6
Combines MSE and MAE benefits.
Treat δ as a hyperparameter to tune.
Alberto Molinaro
Binary Cross-
Entropy (Log Loss)
For Binary Classification
BCE if y=1
BCE if y=0
0.5 1
Outputs interpretable probabilities.
High loss if high confidence in wrong
prediction.
Alberto Molinaro
Focal Loss
For Imbalanced Classification
When γ = 0, the curve
matches BCE exactly.
γ=0 Higher γ → more
focus on hard examples.
γ=1
γ=5
Model's confidence
0.5 1 in the true class
Down-weights easy examples, helping focus
learning on challenging cases.
Treat γ as an hyperparameter to tune.
Alberto Molinaro
Alberto Molinaro
Categorical Cross-Entropy
For Multi-Class Classification
Only the predicted
probability of the true
class affects the loss.
0.5 1
For softmax + one-hot labels.
CCE generalizes BCE to multi-class problems.
Alberto Molinaro
Sparse Categorical
Cross-Entropy
For Multi-Class, Integer Labels
Only the predicted
probability of the true
class affects the loss.
0.5 1
Saves memory and computation.
Same as CCE but with integer labels instead
of one-hot encoded labels.
Alberto Molinaro
Hinge Loss
For Classification, SVMs / Margin-
based Learning
Correct but not
confident → Linear
1 penalty
Wrong →
Strong
penalty Correct & confident
→ Loss = 0
0.5 Margin = 1
Encourages confident classification.
Alberto Molinaro
Dice Loss
For Image Segmentation
Overlap
(=numerator) 0 = no overlap
1 = perfect match
Predicted
Ground truth
Measures overlap between predicted and
true masks.
Perfect for unbalanced pixel classes (e.g.,
medical imaging).
Alberto Molinaro
IoU Loss (Jaccard Loss)
For Object Detection / Segmentation
Overlap
(=numerator)
Predicted Ground truth
Union of
Predicted and
Ground truth
Focuses on region-level accuracy.
Great when overlap quality is more
important than pixel-level error.
Alberto Molinaro
Kullback-Leibler
(KL) Divergence
For Distribution Learning
P(i) Target distribution
Q(i): Predicted distribution
P(i), P(i) P(i)
Q(i)
Q(i) Q(i)
Output
Higher KL Divergence Lower KL Divergence class (i)
(poor fit) (better fit)
Compare predicted vs target distributions.
High penalty: Q underestimates P.
Alberto Molinaro
Cosine Similarity Loss
For Text and Embedding Models
doc1 The smaller angle
corresponds to higher
doc3 cosine similarity.
doc2
Minimize loss, maximize similarity (push
angle to 0).
Cosine similarity is scale-invariant, only
direction matters.
Ranges from −1 (opposite) to +1 (identical),
in most NLP uses: [0,1].
Alberto Molinaro
Triplet Loss
For Image Recognition
Before training After training
(Embedding space) (Embedding space)
Positive
Positive
Anchor Anchor
Negative
Negative
Learn embeddings that cluster similar inputs.
Pull anchor and positive embeddings closer,
and push negatives farther than a margin α.
Alberto Molinaro
Contrastive Loss
For Image Recognition
Push negative
(Embedding space) beyond margin m.
Positive Negative
margin: m
Anchor Anchor
Minimizes same-class distance.
Maximizes different-class separation within
a margin.
m: threshold distance. Minimum distance
between dissimilar pairs. Treat as hyper-param.
Alberto Molinaro
What we saw in this post
Regression: MSE, MAE, Huber.
Classification: BCE, CCE, Focal, SCCE, Hinge.
Image Segmentation: Dice, IoU.
Distribution learning: KL divergence.
Representation learning: Triplet, Contrastive,
Cosine.
Alberto Molinaro
Other loss functions
Negative Log-Likelihood (NLL) Loss:
Generalization of CCE when working
directly with log-probabilities.
Tversky Loss: Generalization of Dice/IoU.
Perceptual Loss: Compares feature maps
instead of pixels to better capture
perceptual similarity.
Wasserstein Loss: Measures distance between
distributions in a more stable way than KL,
core to Wasserstein GANs.
Poisson Loss: Suitable for count data,
assumes the target follows a Poisson
distribution (common in event modeling).
CTC (Connectionist Temporal Classification)
Loss: Enables sequence prediction without
aligned labels, key for speech, handwriting,
and OCR tasks.