Ood-Seg: Out-Of-Distribution Detection For Image Segmentation With Sparse Multi-Class Positive-Only Annotations
Ood-Seg: Out-Of-Distribution Detection For Image Segmentation With Sparse Multi-Class Positive-Only Annotations
Ood-Seg: Out-Of-Distribution Detection For Image Segmentation With Sparse Multi-Class Positive-Only Annotations
Preprint
Junwen Wanga,∗, Zhonghao Wanga , Oscar MacCormaca,b , Jonathan Shapeya,b , Tom Vercauterena
a School of Biomedical Engineering & Imaging Sciences, King’s College London, UK
b
Department of Neurosurgery, King’s College Hospital, London, UK
Article history: Despite significant advancements, segmentation based on deep neural networks
in medical and surgical imaging faces several challenges, two of which we aim to
address in this work. First, acquiring complete pixel-level segmentation labels for
medical images is time-consuming and requires domain expertise. Second, typical
Keywords: Weakly supervised learning, segmentation pipelines cannot detect out-of-distribution (OOD) pixels, leaving them
Positive-Unlabelled learning, One-class prone to spurious outputs during deployment. In this work, we propose a novel
classification, Out-of-distribution detec- segmentation approach exploiting OOD detection that learns only from sparsely
tion, Hyperspectral imaging, Semantic annotated pixels from multiple positive-only classes. These multi-class positive
segmentation
annotations naturally fall within the in-distribution (ID) set. Unlabelled pixels may
contain positive classes but also negative ones, including what is typically referred to
as background in standard segmentation formulations. Here, we forgo the need for
background annotation and consider these together with any other unseen classes as
part of the OOD set. Our framework can integrate, at a pixel-level, any OOD detection
approaches designed for classification tasks. To address the lack of existing OOD
datasets and established evaluation metric for medical image segmentation, we propose
a cross-validation strategy that treats held-out labelled classes as OOD. Extensive
experiments on both multi-class hyperspectral and RGB surgical imaging datasets
demonstrate the robustness and generalisation capability of our proposed framework.
Similarly, the Dresden Surgical Anatomy Dataset (DSAD) of- In this paper, we propose a simple but effective medical im-
fers sparse positive-only annotations for RGB surgical imag- age segmentation framework to achieve pixel-level OOD detec-
ing (Carstens et al., 2023). Yet, the proper application of WSL tion using sparsely annotated data from positive-only classes.
approaches to such cases lacking background class annotations Our framework effectively learns feature representations using
remains an open question. WSL also preserves less informa- sparsely annotated labels, enabling reliable detection of OOD
tion compared to dense annotations, losing supervisory signal pixels with classical OOD approaches (Yang et al., 2024) de-
for some object structures. Such difficulties make the training signed for classification purposes. This allows for state-of-the-
process from sparse positive-only labels challenging. art OOD detection performance without compromising the clas-
Furthermore, to deploy a fully automated system in a safety- sification accuracy for ID classes.
critical environment, the system should not only able to produce To evaluate the model performance on OOD data, we propose
reliable results in a known context, but should also be able to a protocol that involves isolating part of the labelled classes
flag situations in which it may fail (Amodei et al., 2016; Eu- during training. These held-out annotations do not contribute
ropean Commission, 2024). Conventional segmentation frame- to updating the model weights during training but are grouped
works follow an assumption that all training data and testing as an additional outlier class for validation purposes. To ef-
data are drawn from the same distribution and are thus consid- fectively evaluate OOD performance for segmentation tasks,
ered in-distribution (ID). Under this assumption, at inference, we propose using two threshold-independent metrics to mea-
the model should only be used in a similar context, which may sure model performance. Building on these metrics, we further
imply limiting the acquisition hardware and the presence of un- design a threshold selection strategy to visualise OOD segmen-
expected classes such as a new model of surgical instrument. tation results.
This poses a safety issue when trying to deploy the model for Based on our framework, we compare four different clas-
real-world clinical use. Out-of-distribution (OOD) detection sical OOD detection methods integrated in a common U-Net
may thus be considered a mandatory feature in many clinical based backbone segmentation model. Our cross-validation re-
applications. It is an active research topic in many classifica- sults show that combining a model calibration method with the
tion tasks (Hendrycks and Gimpel, 2017; Liang et al., 2017; proposed framework achieves the best overall performance.
Lee et al., 2018; Hsu et al., 2020), but has rarely been exploited Our contributions are mostly along three folds:
in medical imaging (Lambert et al., 2024). • We introduce a novel framework based on positive-only
We argue that these two challenges outlined above: sparse learning for multi-class medical image segmentation. Our
annotations and the need for OOD detection, share enough approach effectively segments negative/OOD data with-
similarities to address them under a single methodological ap- out compromising performance for multi-class positive/ID
proach. In medical image segmentation with sparse annota- data.
tions, the absence of an annotation does not necessarily imply
that a region is identified as negative. Two other possibilities • To assess model performance in both ID and OOD scenar-
could explain why a positive pixel remains unlabelled: 1) it may ios, we propose a two-level cross-validation method and
be deemed ambiguous by the annotator; or 2) it may simply be metrics for evaluation. The cross-validation is based on
skipped due to time-constraints. The most straightforward, al- both subjects/patients and classes present in the dataset.
beit wrong approach to handle unlabelled data would be to as- Our evaluation approach eliminates the need for an addi-
sume that all such data belongs to the negative or background tional OOD testing set.
class. In contrast, positive-unlabelled (PU) learning (Bekker
• The proposed framework can seamlessly incorporate any
and Davis, 2020) assumes that an unlabelled example could
given OOD detection method or backbone architecture. In
belong to either the positive or negative class. Most exist-
particular, we introduce a novel convolutional adaptation
ing work in PU learning focuses on binary classification prob-
of the GODIN method, extending its applicability to seg-
lems rather than multi-class ones. PU learning can be seen
mentation tasks within our framework.
as a specific case within the broader domain of OOD detec-
tion. Given the absence of the negative class, traditional PU To the best of our knowledge, this represents the first work
learning methods are frequently formulated as one-class semi- to address the setting of positive-only learning for multi-class
supervised learning problems (Yang et al., 2024). However, medical image segmentation.
research on segmentation within the frameworks of both PU
learning and OOD detection is limited. Image segmentation
problems often require multi-class learning for which little PU- 2. Related works
learning approaches have been proposed. This scarcity of pub-
lished work is also partly due to the lack of OOD-based evalua- 2.1. Medical image segmentation with sparse annotation
tion protocols and publicly available benchmark datasets. One Existing WSL methods utilise sparse annotation at different
potential solution could be to use a different dataset as OOD level, including image-level annotation (Kuang et al., 2024),
data during testing (Karimi and Gholipour, 2023). However, bounding box (Wang and Xia, 2021; Wang et al., 2018; Xu
this approach poses significant challenges as annotating mul- et al., 2014), scribbles (Can et al., 2018; Wang et al., 2019c),
tiple medical datasets is labor-intensive and requires domain- points (Glocker et al., 2013; Qu et al., 2019; Dorent et al., 2021)
specific expertise for each dataset. and 2D slices within a 3D structure (Bitarafan et al., 2021;
Junwen Wang et al. / Preprint (2024) 3
Cai et al., 2023). These methods use weak labels as supervi- scaling (Guo et al., 2017) effectively separates ID and OOD im-
sion signals to train the model and produce full segmentation ages. Lee et al. (2018) suggested measuring the Mahalanobis
mask for test image. Specifically, Glocker et al. (2013) intro- distance between test image features and the training distri-
duced a semi-automatic labeling strategy that transforms sparse bution from the penultimate convolutional layer of the model.
point-wise annotations into dense probabilistic labels for ver- Hsu et al. (2020) proposed decomposing the confidence score
tebrae localisation and identification; Xu et al. (2014) propose to learn temperature parameters during training.
to segment both healthy and cancerous tissue from colorectal Despite methodological advances and positive demonstration
histopathological biopsies using bounding boxes; and Wang for image classification purposes, usage of OOD detection in
et al. (2018) reported improved CNN performance on sparse medical image segmentation is uncommon. Some studies hy-
annotated input through image-specific fine-tuning; and Wang pothesize that this may be due to the lack of OOD-based eval-
et al. (2019c) combined sparsely annotated input with a CNN uation protocols and the difficulty in gathering relevant data for
through geodesic distance transforms, followed by a resolution- it (Lambert et al., 2024; Bulusu et al., 2020). Recent research
preserving network resulting in better dense prediction. How- has attempted to address this issue by using other datasets as
ever, all of these methods primarily focussed on addressing par- OOD examples. Karimi and Gholipour (2023) used two sepa-
tial or incomplete annotations, thereby overlooking the context rate datasets: one for training the neural network and evaluating
in which no background annotations are present. its performance on ID data, and another for testing specifically
for OOD detection. González et al. (2022) collected four types
2.2. Learning from positive-only data of OOD datasets to account for different distribution shifts from
Positive and unlabelled (PU) learning considers a scenario ID data for COVID-19 lung lesion segmentation task. However,
where only a subset of positive data are labelled, while the unla- acquiring an additional dataset that can be considered OOD is
belled set contains both positive and negative data (Bekker and a difficult and time-consuming process. Therefore, a more scal-
Davis, 2020). It is closely related to semi-supervised learning able approach would be to establish both training and evaluation
and positive-only learning. within a single dataset.
Positive-only or one-class learning, illustrated in Figure 1, is
2.4. Uncertainty estimation in medical image segmentation
a supervised method which involves learning a decision bound-
As illustrated in the previous section, several typical OOD
ary that corresponds to a desired density level of the positive
detection approaches rely on estimating the uncertainty of a
data distribution (Perera et al., 2021). Early approaches utilised
deep learning prediction (Lambert et al., 2024). Better uncer-
statistical features to build one-class classifiers. For instance,
tainty modelling could thus benefit OOD detection. Several un-
Principal Component Analysis (PCA) (Bishop, 2006) or Ker-
certainty estimation approaches rely on measuring the empirical
nel PCA identifies a lower-dimensional subspace that best rep-
variance of the network predictions under a set of perturbations.
resents the training data distribution. Leveraging robust feature
Strategies to generate ensembles of predictions include using
extraction capabilities, some studies have integrated deep learn-
several deep learning models with: differences in model hyper-
ing models into one-class learning methods. Another method,
parameters (Wenzel et al., 2020); random initialization of the
Deep Support Vector Data Descriptor (DeepSVDD) (Ruff et al.,
network parameters; random shuffling of the data points (Lak-
2018) learns a representation that encloses embedding of all
shminarayanan et al., 2017); and applying dropout during test
positively labelled data with the smallest possible hyper-sphere.
time (Gal and Ghahramani, 2016). In medical image segmen-
One-class CNN (Oza and Patel, 2019) use a zero-centered
tation, uncertainty estimation has mostly been applied with bi-
Gaussian noise in the latent space as the pseudo-negative class
nary classes. As way of examples, Wang et al. (2019a) apply
and trains a CNN to learn a decision boundary for the given
test time augmentation to estimate aleatoric uncertainty for fetal
class.
brains and brain tumours segmentation from 2D and 3D Mag-
Positive-only learning extends the binary classification in
netic Resonance Images (MRI); Wang et al. (2019b) propose
one-class methods by learning decision boundaries for multi-
a CNN-based cascaded framework with test-time augmentation
ple classes of positive labelled data. However, very few studies
for brain tumour segmentation. Beyond prediction ensembling,
have examined the multi-class setup in detail. In this work, we
recent studies have focused on providing better uncertainty pre-
frame positive-only learning for image segmentation as a multi-
diction out of the box by calibrating the model uncertainty using
class problem with pixel-level OOD detection.
dedicated loss functions. In particular, Liang et al. (2020) pro-
posed an auxiliary loss term based on the difference between
2.3. Out-of-distribution detection
accuracy and confidence. Barfoot et al. (2024) extend the ex-
Several studies have explored OOD detection within the pected calibration error (Guo et al., 2017) to a differentiable
context of image classification (Hendrycks and Gimpel, 2017; loss function to train a segmentation model. However, none of
Liang et al., 2017; Lee et al., 2018; Hsu et al., 2020). As an the works in the medical imaging field have demonstrated the
early example exploiting deep learning, Hendrycks and Gim- benefits of improved uncertainty calibration in the context of
pel (2017) proposed using the maximum softmax score as a unlabelled or OOD data.
baseline for OOD detection based on an observation that cor-
rectly classified images tend to have higher softmax probabil- 2.5. Segmentation of surgical spectral imaging data
ities than erroneously classified examples. Liang et al. (2017) Having looked at related work in the key methodological ar-
found that applying confidence calibration through temperature eas of interest, we now turn to the related work in the main clin-
4 Junwen Wang et al. / Preprint (2024)
ical application of interest in this work, namely hyperspectral describe our proposed learning framework for sparse multi-
imaging for surgical guidance. Early works on segmentation class positive-only medical image segmentation (Section 3.2).
of surgical HSI data are based on traditional machine learn- Lastly, we introduce our proposed OOD-focused evaluation
ing techniques (Ravı̀ et al., 2017; Fabelo et al., 2018; Moccia framework (Section 3.3), evaluation metrics (Section 3.4), and
et al., 2018). For example, Ravı̀ et al. (2017) trained a Semantic threshold selection method for negative / OOD detection (Sec-
Texton Forest (Shotton et al., 2008) on HSI embedding which tion 3.5).
generated by using an adapted version of t-distributed stochas-
tic neighbour approach (t-SNE) (van der Maaten and Hinton, 3.1. Datasets
2008); Fabelo et al. (2018) proposed a hybrid framework utilis- Hyperspectral imaging (HSI) and multispectral imaging are
ing supervised learning and unsupervised learning techniques. emerging optical imaging techniques that collect and process
The supervised classification map is obtained by using a pixel- spectral data distributed across number of wavelengths (Shapey
wise Support Vector Machine (SVM) classifier that was spa- et al., 2019). By splitting light into multiple narrow bands
tially homogenized through k-nearest neighbours filtering. The beyond what human vision can observe, HSI separates light
authors then combined it with a segmentation map obtained via into numerous narrow spectral bands, capturing details invis-
unsupervised clustering using a hierarchical k-means algorithm. ible to the naked eye. This technique gathers diagnostic data
However, the experiment is conducted on 5 HSI datasets and the about tissue properties, allowing for objective characterization
separation between training, validation and testing is unclear. of tissues without the use of any external contrast agents. Re-
The use of deep learning for biomedical segmentation using cently, several HSI databases have been released as open access,
spectral imaging data is increasing (Khan et al., 2021). Most thereby easing research into medical HSI analysis (Studier-
studies adopt standard U-Net and similar architectures (Ron- Fischer et al., 2023; Hyttinen et al., 2020; Fabelo et al., 2016).
neberger et al., 2015; Jégou et al., 2017) and train their model The Heidelberg Porcine HyperSPECTRAL Imaging
with patch-based or pixel-based input. Some works have looked (Heiporspectral) dataset (Studier-Fischer et al., 2023)
at the impact of training models with different types of input comprise 5758 hyperspectral images with resolution of
spanning different levels of granularity such as pixel, patches 480 × 640 acquired over the 500-1000nm wavelength range.
and images (Seidlitz et al., 2022; Garcia Peraza Herrera et al., Hyperspectral images were captured using the TIVITA tissue
2023). In (Seidlitz et al., 2022), the authors segmented 20 types hyperspectral camera system, which provides 100 spectral
of organs from 506 HSI hypercubes taken from 20 pigs. They bands for each image. For consistency with all hyperspectral
compared the segmentation performance by training the model datasets used in this study, for each dataset we sample 16
with single pixels (no spatial context), patches and full HSI bands in the available wavelength range at equal intervals. The
images with the same hyperparameter setup. They reported background-free, sparse annotations include 20 physiological
that the best performance was achieved with full HSI image porcine organs, which are obtained from a total 11 pigs. For
input (Seidlitz et al., 2022). Similarly, Garcia Peraza Herrera each organ, annotations are distributed across 8 pigs. In each
et al. (2023) used the ODSI-DB dataset (Hyttinen et al., 2020) acquired organ image series, representative image regions of
segmenting 35 dental tissues from 30 human subjects after data the 20 structures depending on the respective organ image
preprocessing and partitioning training and testing set. They series were annotated.
trained a deep learning model on full HSI and hyperspectral The Oral and Dental Spectral Image Database (ODSI-
pixels with spatial context removed, reporting a baseline seg- DB) (Hyttinen et al., 2020) contains 316 hyperspectral images
mentation result. Recently, work by Martı́n-Pérez et al. (2024) of 30 human subjects of which 215 have annotations. Images
compared various pixel-level classification algorithms for brain have a varied resolution and wavelength range due to two differ-
tissue differentiation. The study evaluated conventional al- ent cameras being used in the study. 59 annotated images were
gorithms, deep learning methods, and advanced classification taken with a Nuance EX (CRI, PerkinElmer, Inc., Waltham,
models. Their findings highlighted that reducing the number MA, USA) and 156 were obtained with a Specim IQ (Specim,
of training pixels could improve performance, regardless of the Spectral Imaging Ltd., Oulu, Finland). The pictures taken by
dataset and classifiers. the Nuance EX contain 51 spectral bands (450–950 nm with 10
Overall, available surgical HSI data remains limited in size, nm bands) and special resolution 1392 × 1040; Those captured
and the inherent complexity and variability of the surgical envi- by the Specim IQ have 204 bands (400–1000 nm with approx-
ronment further complicate its analysis. Furthermore, the avail- imately 3 nm steps) and spatial resolution 512 × 512. Some
able annotations are sparse, as the data often consists of anno- images are further cropped to ensure the anonymity of the test-
tations on isolated pixels or small regions rather than compre- ing subject. To alleviate the discrepancy from the camera setup,
hensive labelling of entire images (Zhu et al., 2022). While rel- we sample 16 bands at equal intervals in the available range.
evant, none of the previous works have demonstrated effective We resize all images to a spatial size 512 × 512 by either cen-
methods for leveraging sparse, positive-only annotations. trally cropping or padding the image. Annotations from these
215 images are sparse and background-free. The number of
annotated pixels per image varies from image to image. The
3. Material and methods
annotated pixels can belong to 35 possible dental tissues, which
The section starts by describing the HSI and RGB imag- do not contain the background class. Inspection of this dataset
ing datasets and associated annotations that serve as a foun- shows that the majority of classes are underrepresented. We se-
dation and motivation for this work (Section 3.1). We then lect classes with at least 1 million pixel samples and discard the
Junwen Wang et al. / Preprint (2024) 5
𝐿𝐶𝐸
෩
𝑺𝒎𝒂𝒉𝒂 ෩
𝑺𝒈𝒐𝒅𝒊𝒏
𝒉 𝒙
𝒅𝒎𝒂𝒉𝒂 𝒛, 𝝁, 𝚺
𝒈(𝒙) 𝒎𝒂𝒙𝒄 𝐬𝐨𝐟𝐭𝐦𝐚𝐱(𝑺𝒄 ) Predicted Mask
Figure 2. Overview of the proposed OOD-SEG framework. During the training stage, only annotated pixels for the multiple positive classes are used to
update the model weights. We define a confidence score S to correlate probability distribution for ID classes. S can be replaced by multiple OOD detection
methods (See bottom left block). At the inference stage, we compute the maximum probability of S from c classes followed by thresholding from a pre-
selected threshold τm to obtain the predicted mask.
perturbation for two primary reasons. First, we aimed to sim- have no measurable impact on the performance while it helped
plify the training to allow fairer and more reliable comparisons provide more consistency in evaluation and mask visualisation.
across OOD approaches. Second, the original paper reported We thus use it in our subsequent experiments. Furthermore, as
only minor improvements from applying adversarial perturba- with our use of ODIN, to ensure a fair comparison and to reduce
tions, and these came at a significant computational cost (Liang computational burden, we did not incorporate the adversarial
et al., 2017). perturbation and feature ensembling calibration techniques ini-
tially proposed in (Lee et al., 2018).
Mahalanobis. Lee et al. (2018) propose an OOD mechanism
The mean vectors and covariance matrix in Equation (4) are
based on a statistical analysis of features observed in each ID
dataset-wide parameters. To alleviate the computational bur-
class. Let φ(x) be some pixel-level features obtained from in-
den associated with estimating µc and Σ at once from all pixel-
termediate layers of the network where, as before, the depen-
level features extracted across the entire training dataset, we
dence on pixel location is dropped for brevity. We chose the
first compute the per-class mean and a shared covariance for
features before segmentation head as our intermediate feature
each image in the training set through a spatial averaging pro-
in the study. The class-conditioned distributions of the features
cedure. These image-level estimates are then aggregated using
are modelled as Gaussians with a class-specific mean µc and a
standard reduction to produce the dataset-level estimates of µc
tied, i.e. class independent, covariance matrix Σ. A first scor-
and Σ.
ing S̃maha is obtained by computing the negative Mahalanobis
distance between a prediction feature and each class Gaussian:
S̃cmaha = −(φ(x) − µc )T Σ−1 (φ(x) − µc ) (4) Generalised ODIN (GODIN). Hsu et al. (2020) proposed a div-
idend and divisor structure for OOD detection that learns a tem-
To make a head-to-head comparison fairer and easier across perature scaling function g(x) during training. Assuming a triv-
OOD methods, we apply a softmax operator to the S̃maha Maha- ial extension for pixel-wise operation and dropping the depen-
lanobis scores and obtain normalised final scores: dence on pixel location from the equation for brevity, the un-
Smaha = softmax [S̃1maha , . . . , S̃Cmaha ]
(5) normalised scoring is expressed per class as:
We note that this use of the softmax is not advocated by Lee hc (x)
et al. (2018) nor is it strictly necessary. We however found it to S̃cgodin = (6)
g(x)
Junwen Wang et al. / Preprint (2024) 7
FPOOD
PC
TPc
𝐎𝐎𝐃
𝐎𝐎𝐃
TPRID = PC c=1
, FPROOD = 0 𝐓𝐍𝟎𝐎𝐎𝐃 𝐅𝐏𝟎𝐎𝐎𝐃 𝐅𝐏𝟎𝐎𝐎𝐃 𝐅𝐏𝟎𝐎𝐎𝐃 𝟎 𝟎 𝟎 𝟎
c=1 TPc + FNc TNOOD
0 + FPOOD
0
(11) 𝐅𝐍𝟏𝐎𝐎𝐃 𝐓𝐏𝟏 𝐅𝐍𝟏𝐈𝐃 𝐅𝐍𝟏𝐈𝐃 𝐓𝐍𝟐𝐎𝐎𝐃 𝐓𝐍𝟐𝐈𝐃 𝐅𝐏𝟐𝐈𝐃 𝐓𝐍𝟐𝐈𝐃
where FNc = FNOOD + ID
𝐀𝐜𝐭𝐮𝐚𝐥
𝐀𝐜𝐭𝐮𝐚𝐥
c FN c . It should be clear that since our
annotations are sparse, unlabelled data is omitted from these 𝐅𝐍𝟐𝐎𝐎𝐃 𝐅𝐍𝟐𝐈𝐃 𝐅𝐍𝟐𝐈𝐃 𝐅𝐍𝟐𝐎𝐎𝐃 𝐅𝐍𝟐𝐈𝐃 𝐅𝐍𝟐𝐈𝐃
𝐈𝐃
𝐈𝐃
𝐓𝐏𝟐 𝐓𝐏𝟐
statistics.
By computing TPRID and FPROOD under multiple thresh- 𝐅𝐍𝟑𝐎𝐎𝐃 𝐅𝐍𝟑𝐈𝐃 𝐅𝐍𝟑𝐈𝐃 𝐓𝐏𝟑 𝐓𝐍𝟐𝐎𝐎𝐃 𝐓𝐍𝟐𝐈𝐃 𝐅𝐏𝟐𝐈𝐃 𝐓𝐍𝟐𝐈𝐃
old τ, we obtain a Receiver Operating Characteristic (ROC)
curve. For clarity, we emphasize that this definition of the
ROC curve specifically takes advantage of the distinction be- Figure 4. Graphical illustration of confusion matrix incorporating multi-
tween the positive classes and the negative/OOD class to pro- class ID and OOD data. Left: with actual OOD data as negative class.
Right: In the case without actual OOD data and with class 2 considered as
vide a single well-posed binarisation of the multi-class problem
positive while others are negative in a one-vs-rest approach.
that doesn’t rely on a one-vs-rest strategy. The area under the
ROC curve (AUROC) is a threshold independent metric which
is commonly used by many image-level OOD detection meth- the AUROCOVR and AUPROVR metric in OVR setting by com-
ods (Hendrycks and Gimpel, 2017; Liang et al., 2017; Lee et al., puting TPROVR , FPROVR = 1 − TNROVR and PrecisionOVR =
c c c c
2018; Hsu et al., 2020). We thus use the AUROC metric (with TPc
under multiple thresholds.
TPc + FPc
our definition of TPRID and FPROOD ) for quantitative evalua-
tion.
Additionally, we propose to measure the Area Under the 3.5. OOD confidence threshold selection
Precision-Recall curve (AUPR) (Saito and Rehmsmeier, 2015)
as our second metric. Again, we define the precision within As detailed in Equation (1), our approach relies on a OOD
our multi-class setting by taking advantage of the distinction confidence threshold τ to generate the final segmentation
between the positive classes and the negative/OOD one: masks. This threshold should be chosen to 1) accurately clas-
PC sify pixels belonging to an ID class, and 2) detect background
TPc / OOD test pixels. For comparison purposes, we can also de-
Precision = c=1
(12)
+ TPc + FNID
PC
FPOOD
0 c=1 c fine a baseline with τ0 = 0 to represent the method without
outlier detection. To fulfill the two criteria above within our
Recall being a synonym for TPR, we use Equation (11) to define two-level cross-validation setup, we propose to find the opti-
it. Finally, we measure AUPR by evaluating recall and precision mal threshold τm which maximises a weighted sum of ID and
under multiple τ thresholds. OOD performance across the two-level folds using a pair of
threshold-sensitive metrics:
All-classes metrics. For our experiment using all labelled
classes, we do not have any ground-truth pixels associated with N
1X
the OOD class. As illustrated in Figure 4-right, FPOOD0 is thus τm = max wID MetrickID (τ) + wOOD MetrickOOD (τ) (14)
τ N k=1
0 by construction and this would skew the previous metrics. In
this context, we thus choose to compute TPR, TNR, balanced
accuracy (BACC) and F1 score based on a one-vs-rest strategy where N = NSP × NCP is total number of cross-validation folds;
(Taha and Hanbury, 2015). To distinguish these one-vs-rest and MetrickID (τ) (respectively MetrickOOD (τ)) represents the ID
metrics used in the all-classes setting from the OOD-focused (respectively OOD) performance of the model on the kth cross-
ones, we use a superscript OVR when referring to them. These validation fold when using threshold τ. In this work, we choose
expression for each individual positive class c is given by: TPRID and TNROOD = 1 − FPROOD as our ID and OOD met-
ric respectively. Computation of TPRID and FPROOD can be
TPc found in Equation (11). For the weighting parameters in Equa-
TPROVR = ,
c
TPc + FNc tion (14), we choose wID = wOOD = 0.5.
TNc When used outside of our two-level cross-validation ap-
TNROVR = ,
c
TNc + FPc proach, the OOD performance metrics are skewed by the ab-
(13)
1 sence of negative/OOD annotations, in which case our thresh-
BACCOVR
c = (TPROVRc + TNROVR
c ), old selection approach can be extended to only account for ID
2
2 TPc performance, essentially setting wID = 1 and wOOD = 0. An
F1OVR =
c
2 TPC + FPc + FNc alternative is to use the optimal threshold from the two-level
cross-validation experiments. We empirically found this thresh-
where TNc = TNID c + TNc
OOD
and in OVR setting FPc = old to offer a good trade-off between ID classification and OOD
ID
FPc . These class-specific OVR metrics are then averaged detection performance. When the ID data distribution is similar
across the positive classes to provide mean scores: TPROVR , to that of the validation set used during cross-validation, apply-
TNROVR , BACCOVR and F1OVR . Furthermore, we compute ing this threshold can be beneficial for generalisation purposes.
Junwen Wang et al. / Preprint (2024) 9
4. Experimental setup et al., 2023; Hyttinen et al., 2020) and manually select 16 chan-
nels at equal intervals from the total available spectral bands,
We start by describing relevant details on our models and sorted in ascending order.
training details in Section 4.1 and followed by data preprocess- After exporting the data, we apply ℓ1 -normalisation at each
ing pipeline in Section 4.2. spatial location i j to account for the non-uniform illumination
of the tissue surface. This is routinely applied in hyperspectral
4.1. Deep learning model and training setup imaging because of the dependency of the signal on the distance
For all experiments, we use a U-Net architecture with an between the camera and the tissue (Bahl et al., 2023; Studier-
efficientnet-b4 encoder (Tan and Le, 2019) pretrained on the Fischer et al., 2023). The uneven surface of the tissue can also
ImageNet dataset (Deng et al., 2009). Our implementation re- cause some image areas to have different lighting conditions,
lies on the Segmentation models PyTorch library1 . The choice which affects the classification accuracy and can be mitigated
of the encoder is based on good performance reported by pre- by data normalisation. For data augmentation, we adopt similar
vious study (Seidlitz et al., 2022) and graphical memory limits setup reported in (Seidlitz et al., 2022): random rotation (ro-
in the hardware used for this work. The model inputs are either tation angle limit: 45◦ ); random flip; random scaling (scaling
a pre-processed hyperspectral imaging (HSI) hypercube or an factor limit: 0.1); random shift (shift factor limit: 0.0625). All
RGB image. The number of input channels and weights of the transformations are applied with a probability of 0.5.
first convolutional layer are re-initialised and set to match the
number of channels of our input data. The output of the net-
work is passed on to a segmentation head to calculate the out- 5. Results
put logits. The number of output classes is set to be equal to the
number of positive classes (i.e. marked as ID) for a given exper- We begin by visualising confusion matrices and ROC curves
imental setup. Note that during our two-level cross-validation, as illustrated in Section 5.1. These measures provide the foun-
this number will be lower than the number of positive classes in dation for both our qualitative and quantitative analysis in the
the training dataset as some classes are being held out. later sections. Section 5.2 and Section 5.5 shows overall perfor-
mance of our proposed framework comparing different methods
as the scoring function. Section 5.3 and Section 5.4 shows qual-
Table 1. Hyperparameter setup.
itative evaluation for all methods plus a scenario in which our
Dataset Init. LR Batch size Epoch proposed OOD segmentation framework has not applied. Fur-
Heiporspectal 1e-4 8 20 thermore, we have tested performance of our method under the
ODSI-DB 1e-3 4 80 scenario that all labelled classes are considered as ID. The re-
DSAD 1e-4 4 10 sults are shown in Section 5.5.
$ F W X D O
4.2. Data preprocessing pipeline 5.1. Visualising confusion matrix and ROC curve
For the DSAD dataset, data are stored in PNG format. We use Figure 5 demonstrates that our method effectively separates
the Pillow library2 to read the RGB data and convert them into OOD pixels while maintaining classification accuracy for ID
PyTorch tensors. For the two HSI datasets, we first extract the pixels. Figure 6 shows ROC curve with ID class on the y-axis
hypercube using the provided Python libraries (Studier-Fischer and OOD classes on the x-axis. As τ increases, more data will
be flagged as OOD. For the baseline method, fewer pixels are
rejected under threshold 0.99, indicating that the model tends
1 https://github.com/qubvel/segmentation models.pytorch to make overconfident predictions and this decreases the over-
2 https://pillow.readthedocs.io/en/latest/ all AUC. The ODIN method addresses this issue by providing
10 Junwen Wang et al. / Preprint (2024)