0% found this document useful (0 votes)
5 views16 pages

Loss Functions

The document outlines various loss functions used in machine learning for different tasks, including regression, classification, image segmentation, and distribution learning. Key loss functions discussed include Mean Squared Error (MSE), Mean Absolute Error (MAE), Huber Loss, Binary Cross-Entropy, and others, each with specific characteristics and applications. Additionally, it mentions other loss functions like Negative Log-Likelihood and Wasserstein Loss that serve specialized purposes in model training.

Uploaded by

rimoghoshsayan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views16 pages

Loss Functions

The document outlines various loss functions used in machine learning for different tasks, including regression, classification, image segmentation, and distribution learning. Key loss functions discussed include Mean Squared Error (MSE), Mean Absolute Error (MAE), Huber Loss, Binary Cross-Entropy, and others, each with specific characteristics and applications. Additionally, it mentions other loss functions like Negative Log-Likelihood and Wasserstein Loss that serve specialized purposes in model training.

Uploaded by

rimoghoshsayan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Alberto Molinaro

Mean Squared Error (MSE)


For Regression problems

Punishes large errors heavily.


Sensitive to outliers. Squaring exaggerates
large errors.
Alberto Molinaro

Mean Absolute Error (MAE)


For Regression problems

Same as MSE but with


absolute difference.
No squaring

More robust to noise. Linear penalty.


Harder to optimize due to non-smooth
gradients.
Alberto Molinaro

Huber Loss
For Regression problems

Transition point δ:
2 Below this line, the loss
behaves like MSE
1.5
(quadratic); above it,
1 like MAE (linear)

0.5 δ = 0.6

Combines MSE and MAE benefits.


Treat δ as a hyperparameter to tune.
Alberto Molinaro
Binary Cross-
Entropy (Log Loss)
For Binary Classification
BCE if y=1
BCE if y=0

0.5 1

Outputs interpretable probabilities.


High loss if high confidence in wrong
prediction.
Alberto Molinaro

Focal Loss
For Imbalanced Classification

When γ = 0, the curve


matches BCE exactly.

γ=0 Higher γ → more


focus on hard examples.
γ=1
γ=5

Model's confidence
0.5 1 in the true class

Down-weights easy examples, helping focus


learning on challenging cases.
Treat γ as an hyperparameter to tune.
Alberto Molinaro
Alberto Molinaro

Categorical Cross-Entropy
For Multi-Class Classification

Only the predicted


probability of the true
class affects the loss.

0.5 1

For softmax + one-hot labels.


CCE generalizes BCE to multi-class problems.
Alberto Molinaro
Sparse Categorical
Cross-Entropy
For Multi-Class, Integer Labels

Only the predicted


probability of the true
class affects the loss.

0.5 1

Saves memory and computation.


Same as CCE but with integer labels instead
of one-hot encoded labels.
Alberto Molinaro

Hinge Loss
For Classification, SVMs / Margin-
based Learning

Correct but not


confident → Linear
1 penalty
Wrong →
Strong
penalty Correct & confident
→ Loss = 0

0.5 Margin = 1

Encourages confident classification.


Alberto Molinaro

Dice Loss
For Image Segmentation
Overlap
(=numerator) 0 = no overlap
1 = perfect match
Predicted
Ground truth

Measures overlap between predicted and


true masks.
Perfect for unbalanced pixel classes (e.g.,
medical imaging).
Alberto Molinaro

IoU Loss (Jaccard Loss)


For Object Detection / Segmentation
Overlap
(=numerator)
Predicted Ground truth

Union of
Predicted and
Ground truth

Focuses on region-level accuracy.


Great when overlap quality is more
important than pixel-level error.
Alberto Molinaro
Kullback-Leibler
(KL) Divergence
For Distribution Learning
P(i) Target distribution
Q(i): Predicted distribution
P(i), P(i) P(i)
Q(i)
Q(i) Q(i)

Output
Higher KL Divergence Lower KL Divergence class (i)
(poor fit) (better fit)

Compare predicted vs target distributions.


High penalty: Q underestimates P.
Alberto Molinaro

Cosine Similarity Loss


For Text and Embedding Models
doc1 The smaller angle
corresponds to higher
doc3 cosine similarity.

doc2

Minimize loss, maximize similarity (push


angle to 0).
Cosine similarity is scale-invariant, only
direction matters.
Ranges from −1 (opposite) to +1 (identical),
in most NLP uses: [0,1].
Alberto Molinaro

Triplet Loss
For Image Recognition

Before training After training


(Embedding space) (Embedding space)
Positive

Positive

Anchor Anchor

Negative

Negative

Learn embeddings that cluster similar inputs.


Pull anchor and positive embeddings closer,
and push negatives farther than a margin α.
Alberto Molinaro
Contrastive Loss
For Image Recognition
Push negative
(Embedding space) beyond margin m.

Positive Negative

margin: m

Anchor Anchor

Minimizes same-class distance.


Maximizes different-class separation within
a margin.
m: threshold distance. Minimum distance
between dissimilar pairs. Treat as hyper-param.
Alberto Molinaro

What we saw in this post


Regression: MSE, MAE, Huber.
Classification: BCE, CCE, Focal, SCCE, Hinge.
Image Segmentation: Dice, IoU.
Distribution learning: KL divergence.
Representation learning: Triplet, Contrastive,
Cosine.
Alberto Molinaro
Other loss functions
Negative Log-Likelihood (NLL) Loss:
Generalization of CCE when working
directly with log-probabilities.
Tversky Loss: Generalization of Dice/IoU.
Perceptual Loss: Compares feature maps
instead of pixels to better capture
perceptual similarity.
Wasserstein Loss: Measures distance between
distributions in a more stable way than KL,
core to Wasserstein GANs.
Poisson Loss: Suitable for count data,
assumes the target follows a Poisson
distribution (common in event modeling).
CTC (Connectionist Temporal Classification)
Loss: Enables sequence prediction without
aligned labels, key for speech, handwriting,
and OCR tasks.

You might also like