Ood-Seg: Out-Of-Distribution Detection For Image Segmentation With Sparse Multi-Class Positive-Only Annotations

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Preprint (2024)

Preprint

OOD-SEG: Out-Of-Distribution detection for image SEGmentation with sparse


multi-class positive-only annotations
arXiv:2411.09553v1 [cs.CV] 14 Nov 2024

Junwen Wanga,∗, Zhonghao Wanga , Oscar MacCormaca,b , Jonathan Shapeya,b , Tom Vercauterena
a School of Biomedical Engineering & Imaging Sciences, King’s College London, UK
b
Department of Neurosurgery, King’s College Hospital, London, UK

ARTICLE INFO ABSTRACT

Article history: Despite significant advancements, segmentation based on deep neural networks
in medical and surgical imaging faces several challenges, two of which we aim to
address in this work. First, acquiring complete pixel-level segmentation labels for
medical images is time-consuming and requires domain expertise. Second, typical
Keywords: Weakly supervised learning, segmentation pipelines cannot detect out-of-distribution (OOD) pixels, leaving them
Positive-Unlabelled learning, One-class prone to spurious outputs during deployment. In this work, we propose a novel
classification, Out-of-distribution detec- segmentation approach exploiting OOD detection that learns only from sparsely
tion, Hyperspectral imaging, Semantic annotated pixels from multiple positive-only classes. These multi-class positive
segmentation
annotations naturally fall within the in-distribution (ID) set. Unlabelled pixels may
contain positive classes but also negative ones, including what is typically referred to
as background in standard segmentation formulations. Here, we forgo the need for
background annotation and consider these together with any other unseen classes as
part of the OOD set. Our framework can integrate, at a pixel-level, any OOD detection
approaches designed for classification tasks. To address the lack of existing OOD
datasets and established evaluation metric for medical image segmentation, we propose
a cross-validation strategy that treats held-out labelled classes as OOD. Extensive
experiments on both multi-class hyperspectral and RGB surgical imaging datasets
demonstrate the robustness and generalisation capability of our proposed framework.

1. Introduction To reduce the intensive workload in acquiring pixel-


level dense annotation from clinical experts, many efforts
have been made to advance Weakly Supervised Learning
Despite significant progress of deep neural networks based
(WSL) (Tajbakhsh et al., 2020; Can et al., 2018; Xu et al., 2014)
segmentation methods for medical and surgical image analy-
and train learning-based algorithms from coarse-grained anno-
sis in recent years, the efficacy of these methods is highly de-
tations instead of precise segmentation masks. WSL could al-
pendent on the quality and quantity of pixel-level annotations.
low domain experts to annotate regions that they are confident
Specifically, the manual annotation of medical images necessi-
of, leaving intricate details unlabelled. Such annotations are al-
tates professional expertise from experienced domain experts,
ready starting to be adopted in some open-access datasets. For
making the process both costly and time-consuming, thereby
example, a recent hyperspectral imaging (HSI) dataset adopted
leading to a shortage of labelled data in clinical settings.
a sparse annotation protocol by annotating representative im-
age regions, omitting marginal areas, superficial blood vessels,
adipose tissue and other artefacts (Studier-Fischer et al., 2023).
∗ Corresponding author: junwen.wang@kcl.ac.uk
2 Junwen Wang et al. / Preprint (2024)

Similarly, the Dresden Surgical Anatomy Dataset (DSAD) of- In this paper, we propose a simple but effective medical im-
fers sparse positive-only annotations for RGB surgical imag- age segmentation framework to achieve pixel-level OOD detec-
ing (Carstens et al., 2023). Yet, the proper application of WSL tion using sparsely annotated data from positive-only classes.
approaches to such cases lacking background class annotations Our framework effectively learns feature representations using
remains an open question. WSL also preserves less informa- sparsely annotated labels, enabling reliable detection of OOD
tion compared to dense annotations, losing supervisory signal pixels with classical OOD approaches (Yang et al., 2024) de-
for some object structures. Such difficulties make the training signed for classification purposes. This allows for state-of-the-
process from sparse positive-only labels challenging. art OOD detection performance without compromising the clas-
Furthermore, to deploy a fully automated system in a safety- sification accuracy for ID classes.
critical environment, the system should not only able to produce To evaluate the model performance on OOD data, we propose
reliable results in a known context, but should also be able to a protocol that involves isolating part of the labelled classes
flag situations in which it may fail (Amodei et al., 2016; Eu- during training. These held-out annotations do not contribute
ropean Commission, 2024). Conventional segmentation frame- to updating the model weights during training but are grouped
works follow an assumption that all training data and testing as an additional outlier class for validation purposes. To ef-
data are drawn from the same distribution and are thus consid- fectively evaluate OOD performance for segmentation tasks,
ered in-distribution (ID). Under this assumption, at inference, we propose using two threshold-independent metrics to mea-
the model should only be used in a similar context, which may sure model performance. Building on these metrics, we further
imply limiting the acquisition hardware and the presence of un- design a threshold selection strategy to visualise OOD segmen-
expected classes such as a new model of surgical instrument. tation results.
This poses a safety issue when trying to deploy the model for Based on our framework, we compare four different clas-
real-world clinical use. Out-of-distribution (OOD) detection sical OOD detection methods integrated in a common U-Net
may thus be considered a mandatory feature in many clinical based backbone segmentation model. Our cross-validation re-
applications. It is an active research topic in many classifica- sults show that combining a model calibration method with the
tion tasks (Hendrycks and Gimpel, 2017; Liang et al., 2017; proposed framework achieves the best overall performance.
Lee et al., 2018; Hsu et al., 2020), but has rarely been exploited Our contributions are mostly along three folds:
in medical imaging (Lambert et al., 2024). • We introduce a novel framework based on positive-only
We argue that these two challenges outlined above: sparse learning for multi-class medical image segmentation. Our
annotations and the need for OOD detection, share enough approach effectively segments negative/OOD data with-
similarities to address them under a single methodological ap- out compromising performance for multi-class positive/ID
proach. In medical image segmentation with sparse annota- data.
tions, the absence of an annotation does not necessarily imply
that a region is identified as negative. Two other possibilities • To assess model performance in both ID and OOD scenar-
could explain why a positive pixel remains unlabelled: 1) it may ios, we propose a two-level cross-validation method and
be deemed ambiguous by the annotator; or 2) it may simply be metrics for evaluation. The cross-validation is based on
skipped due to time-constraints. The most straightforward, al- both subjects/patients and classes present in the dataset.
beit wrong approach to handle unlabelled data would be to as- Our evaluation approach eliminates the need for an addi-
sume that all such data belongs to the negative or background tional OOD testing set.
class. In contrast, positive-unlabelled (PU) learning (Bekker
• The proposed framework can seamlessly incorporate any
and Davis, 2020) assumes that an unlabelled example could
given OOD detection method or backbone architecture. In
belong to either the positive or negative class. Most exist-
particular, we introduce a novel convolutional adaptation
ing work in PU learning focuses on binary classification prob-
of the GODIN method, extending its applicability to seg-
lems rather than multi-class ones. PU learning can be seen
mentation tasks within our framework.
as a specific case within the broader domain of OOD detec-
tion. Given the absence of the negative class, traditional PU To the best of our knowledge, this represents the first work
learning methods are frequently formulated as one-class semi- to address the setting of positive-only learning for multi-class
supervised learning problems (Yang et al., 2024). However, medical image segmentation.
research on segmentation within the frameworks of both PU
learning and OOD detection is limited. Image segmentation
problems often require multi-class learning for which little PU- 2. Related works
learning approaches have been proposed. This scarcity of pub-
lished work is also partly due to the lack of OOD-based evalua- 2.1. Medical image segmentation with sparse annotation
tion protocols and publicly available benchmark datasets. One Existing WSL methods utilise sparse annotation at different
potential solution could be to use a different dataset as OOD level, including image-level annotation (Kuang et al., 2024),
data during testing (Karimi and Gholipour, 2023). However, bounding box (Wang and Xia, 2021; Wang et al., 2018; Xu
this approach poses significant challenges as annotating mul- et al., 2014), scribbles (Can et al., 2018; Wang et al., 2019c),
tiple medical datasets is labor-intensive and requires domain- points (Glocker et al., 2013; Qu et al., 2019; Dorent et al., 2021)
specific expertise for each dataset. and 2D slices within a 3D structure (Bitarafan et al., 2021;
Junwen Wang et al. / Preprint (2024) 3

Cai et al., 2023). These methods use weak labels as supervi- scaling (Guo et al., 2017) effectively separates ID and OOD im-
sion signals to train the model and produce full segmentation ages. Lee et al. (2018) suggested measuring the Mahalanobis
mask for test image. Specifically, Glocker et al. (2013) intro- distance between test image features and the training distri-
duced a semi-automatic labeling strategy that transforms sparse bution from the penultimate convolutional layer of the model.
point-wise annotations into dense probabilistic labels for ver- Hsu et al. (2020) proposed decomposing the confidence score
tebrae localisation and identification; Xu et al. (2014) propose to learn temperature parameters during training.
to segment both healthy and cancerous tissue from colorectal Despite methodological advances and positive demonstration
histopathological biopsies using bounding boxes; and Wang for image classification purposes, usage of OOD detection in
et al. (2018) reported improved CNN performance on sparse medical image segmentation is uncommon. Some studies hy-
annotated input through image-specific fine-tuning; and Wang pothesize that this may be due to the lack of OOD-based eval-
et al. (2019c) combined sparsely annotated input with a CNN uation protocols and the difficulty in gathering relevant data for
through geodesic distance transforms, followed by a resolution- it (Lambert et al., 2024; Bulusu et al., 2020). Recent research
preserving network resulting in better dense prediction. How- has attempted to address this issue by using other datasets as
ever, all of these methods primarily focussed on addressing par- OOD examples. Karimi and Gholipour (2023) used two sepa-
tial or incomplete annotations, thereby overlooking the context rate datasets: one for training the neural network and evaluating
in which no background annotations are present. its performance on ID data, and another for testing specifically
for OOD detection. González et al. (2022) collected four types
2.2. Learning from positive-only data of OOD datasets to account for different distribution shifts from
Positive and unlabelled (PU) learning considers a scenario ID data for COVID-19 lung lesion segmentation task. However,
where only a subset of positive data are labelled, while the unla- acquiring an additional dataset that can be considered OOD is
belled set contains both positive and negative data (Bekker and a difficult and time-consuming process. Therefore, a more scal-
Davis, 2020). It is closely related to semi-supervised learning able approach would be to establish both training and evaluation
and positive-only learning. within a single dataset.
Positive-only or one-class learning, illustrated in Figure 1, is
2.4. Uncertainty estimation in medical image segmentation
a supervised method which involves learning a decision bound-
As illustrated in the previous section, several typical OOD
ary that corresponds to a desired density level of the positive
detection approaches rely on estimating the uncertainty of a
data distribution (Perera et al., 2021). Early approaches utilised
deep learning prediction (Lambert et al., 2024). Better uncer-
statistical features to build one-class classifiers. For instance,
tainty modelling could thus benefit OOD detection. Several un-
Principal Component Analysis (PCA) (Bishop, 2006) or Ker-
certainty estimation approaches rely on measuring the empirical
nel PCA identifies a lower-dimensional subspace that best rep-
variance of the network predictions under a set of perturbations.
resents the training data distribution. Leveraging robust feature
Strategies to generate ensembles of predictions include using
extraction capabilities, some studies have integrated deep learn-
several deep learning models with: differences in model hyper-
ing models into one-class learning methods. Another method,
parameters (Wenzel et al., 2020); random initialization of the
Deep Support Vector Data Descriptor (DeepSVDD) (Ruff et al.,
network parameters; random shuffling of the data points (Lak-
2018) learns a representation that encloses embedding of all
shminarayanan et al., 2017); and applying dropout during test
positively labelled data with the smallest possible hyper-sphere.
time (Gal and Ghahramani, 2016). In medical image segmen-
One-class CNN (Oza and Patel, 2019) use a zero-centered
tation, uncertainty estimation has mostly been applied with bi-
Gaussian noise in the latent space as the pseudo-negative class
nary classes. As way of examples, Wang et al. (2019a) apply
and trains a CNN to learn a decision boundary for the given
test time augmentation to estimate aleatoric uncertainty for fetal
class.
brains and brain tumours segmentation from 2D and 3D Mag-
Positive-only learning extends the binary classification in
netic Resonance Images (MRI); Wang et al. (2019b) propose
one-class methods by learning decision boundaries for multi-
a CNN-based cascaded framework with test-time augmentation
ple classes of positive labelled data. However, very few studies
for brain tumour segmentation. Beyond prediction ensembling,
have examined the multi-class setup in detail. In this work, we
recent studies have focused on providing better uncertainty pre-
frame positive-only learning for image segmentation as a multi-
diction out of the box by calibrating the model uncertainty using
class problem with pixel-level OOD detection.
dedicated loss functions. In particular, Liang et al. (2020) pro-
posed an auxiliary loss term based on the difference between
2.3. Out-of-distribution detection
accuracy and confidence. Barfoot et al. (2024) extend the ex-
Several studies have explored OOD detection within the pected calibration error (Guo et al., 2017) to a differentiable
context of image classification (Hendrycks and Gimpel, 2017; loss function to train a segmentation model. However, none of
Liang et al., 2017; Lee et al., 2018; Hsu et al., 2020). As an the works in the medical imaging field have demonstrated the
early example exploiting deep learning, Hendrycks and Gim- benefits of improved uncertainty calibration in the context of
pel (2017) proposed using the maximum softmax score as a unlabelled or OOD data.
baseline for OOD detection based on an observation that cor-
rectly classified images tend to have higher softmax probabil- 2.5. Segmentation of surgical spectral imaging data
ities than erroneously classified examples. Liang et al. (2017) Having looked at related work in the key methodological ar-
found that applying confidence calibration through temperature eas of interest, we now turn to the related work in the main clin-
4 Junwen Wang et al. / Preprint (2024)

ical application of interest in this work, namely hyperspectral describe our proposed learning framework for sparse multi-
imaging for surgical guidance. Early works on segmentation class positive-only medical image segmentation (Section 3.2).
of surgical HSI data are based on traditional machine learn- Lastly, we introduce our proposed OOD-focused evaluation
ing techniques (Ravı̀ et al., 2017; Fabelo et al., 2018; Moccia framework (Section 3.3), evaluation metrics (Section 3.4), and
et al., 2018). For example, Ravı̀ et al. (2017) trained a Semantic threshold selection method for negative / OOD detection (Sec-
Texton Forest (Shotton et al., 2008) on HSI embedding which tion 3.5).
generated by using an adapted version of t-distributed stochas-
tic neighbour approach (t-SNE) (van der Maaten and Hinton, 3.1. Datasets
2008); Fabelo et al. (2018) proposed a hybrid framework utilis- Hyperspectral imaging (HSI) and multispectral imaging are
ing supervised learning and unsupervised learning techniques. emerging optical imaging techniques that collect and process
The supervised classification map is obtained by using a pixel- spectral data distributed across number of wavelengths (Shapey
wise Support Vector Machine (SVM) classifier that was spa- et al., 2019). By splitting light into multiple narrow bands
tially homogenized through k-nearest neighbours filtering. The beyond what human vision can observe, HSI separates light
authors then combined it with a segmentation map obtained via into numerous narrow spectral bands, capturing details invis-
unsupervised clustering using a hierarchical k-means algorithm. ible to the naked eye. This technique gathers diagnostic data
However, the experiment is conducted on 5 HSI datasets and the about tissue properties, allowing for objective characterization
separation between training, validation and testing is unclear. of tissues without the use of any external contrast agents. Re-
The use of deep learning for biomedical segmentation using cently, several HSI databases have been released as open access,
spectral imaging data is increasing (Khan et al., 2021). Most thereby easing research into medical HSI analysis (Studier-
studies adopt standard U-Net and similar architectures (Ron- Fischer et al., 2023; Hyttinen et al., 2020; Fabelo et al., 2016).
neberger et al., 2015; Jégou et al., 2017) and train their model The Heidelberg Porcine HyperSPECTRAL Imaging
with patch-based or pixel-based input. Some works have looked (Heiporspectral) dataset (Studier-Fischer et al., 2023)
at the impact of training models with different types of input comprise 5758 hyperspectral images with resolution of
spanning different levels of granularity such as pixel, patches 480 × 640 acquired over the 500-1000nm wavelength range.
and images (Seidlitz et al., 2022; Garcia Peraza Herrera et al., Hyperspectral images were captured using the TIVITA tissue
2023). In (Seidlitz et al., 2022), the authors segmented 20 types hyperspectral camera system, which provides 100 spectral
of organs from 506 HSI hypercubes taken from 20 pigs. They bands for each image. For consistency with all hyperspectral
compared the segmentation performance by training the model datasets used in this study, for each dataset we sample 16
with single pixels (no spatial context), patches and full HSI bands in the available wavelength range at equal intervals. The
images with the same hyperparameter setup. They reported background-free, sparse annotations include 20 physiological
that the best performance was achieved with full HSI image porcine organs, which are obtained from a total 11 pigs. For
input (Seidlitz et al., 2022). Similarly, Garcia Peraza Herrera each organ, annotations are distributed across 8 pigs. In each
et al. (2023) used the ODSI-DB dataset (Hyttinen et al., 2020) acquired organ image series, representative image regions of
segmenting 35 dental tissues from 30 human subjects after data the 20 structures depending on the respective organ image
preprocessing and partitioning training and testing set. They series were annotated.
trained a deep learning model on full HSI and hyperspectral The Oral and Dental Spectral Image Database (ODSI-
pixels with spatial context removed, reporting a baseline seg- DB) (Hyttinen et al., 2020) contains 316 hyperspectral images
mentation result. Recently, work by Martı́n-Pérez et al. (2024) of 30 human subjects of which 215 have annotations. Images
compared various pixel-level classification algorithms for brain have a varied resolution and wavelength range due to two differ-
tissue differentiation. The study evaluated conventional al- ent cameras being used in the study. 59 annotated images were
gorithms, deep learning methods, and advanced classification taken with a Nuance EX (CRI, PerkinElmer, Inc., Waltham,
models. Their findings highlighted that reducing the number MA, USA) and 156 were obtained with a Specim IQ (Specim,
of training pixels could improve performance, regardless of the Spectral Imaging Ltd., Oulu, Finland). The pictures taken by
dataset and classifiers. the Nuance EX contain 51 spectral bands (450–950 nm with 10
Overall, available surgical HSI data remains limited in size, nm bands) and special resolution 1392 × 1040; Those captured
and the inherent complexity and variability of the surgical envi- by the Specim IQ have 204 bands (400–1000 nm with approx-
ronment further complicate its analysis. Furthermore, the avail- imately 3 nm steps) and spatial resolution 512 × 512. Some
able annotations are sparse, as the data often consists of anno- images are further cropped to ensure the anonymity of the test-
tations on isolated pixels or small regions rather than compre- ing subject. To alleviate the discrepancy from the camera setup,
hensive labelling of entire images (Zhu et al., 2022). While rel- we sample 16 bands at equal intervals in the available range.
evant, none of the previous works have demonstrated effective We resize all images to a spatial size 512 × 512 by either cen-
methods for leveraging sparse, positive-only annotations. trally cropping or padding the image. Annotations from these
215 images are sparse and background-free. The number of
annotated pixels per image varies from image to image. The
3. Material and methods
annotated pixels can belong to 35 possible dental tissues, which
The section starts by describing the HSI and RGB imag- do not contain the background class. Inspection of this dataset
ing datasets and associated annotations that serve as a foun- shows that the majority of classes are underrepresented. We se-
dation and motivation for this work (Section 3.1). We then lect classes with at least 1 million pixel samples and discard the
Junwen Wang et al. / Preprint (2024) 5

ing numbering positive classes from 0 to retain that 0 index for


negative data such as background and OOD samples.
Our approach entails training a standard multi-class seman-
tic segmentation network using a loss computed from the
sparse positive-only annotations as seen in Figure 2 (Train-
ing). In practice, we restrict the output of the network to a
C-dimensional output per pixel. That is, if no further post-
Multi-class One-class (Ours) Multi-class
processing were to be applied, the network would not predict
learning positive-only learning positive-only learning
any background class.
Figure 1. Decision boundaries of different learning settings. Colored mark- To incorporate background predictions, as illustrated in Fig-
ers represent positive data in different classes. “?” represents unlabelled ure 2 (Inference), during inference, we introduce a pixel-wise
data. Our proposed framework forms a multi-class positive-only learning
setting which has distinct decision boundaries that aims to enclose posi-
scoring function Scij which aims to capture the probability of
tively labelled data for each class and can thus serve as an OOD detection that pixel (i, j) belonging to the ID class c while acknowledging
mechanism based on non-enclosed areas. the possibility of it being OOD. If a pixel-wise score is high,
we maintain the assignment of that pixel to the best ID class.
In contrast, if the ID class scores are all low, the pixel is con-
rest of the classes for further analysis as done in previous work sidered as OOD. Our proposed framework utilises a confidence
by Garcia Peraza Herrera et al. (2023), resulting total 9 classes threshold τ to detect OOD samples at a pixel-level:
selected in this study.
While the application of our proposed methodological ap-

arg maxc Scij , if maxc Scij > τ

ŷi j = 

proach to spectral imaging data represents our main focus, we (1)
0,
 otherwise
also demonstrate the capability of our method in a non-spectral
imaging dataset. More specifically, we make use of an RGB
Baseline OOD scoring. Let f (x) denote the logit output of the
laparoscopic dataset that shares many similarities in terms of
segmentation network trained using the positive-only ID sparse
anatomical content and annotation style.
annotation. In its simplest implementation, the score function
The Dresden Surgical Anatomy Dataset (DSAD) (Carstens
can be the softmax output from the network as shown in Equa-
et al., 2023) comprises 13195 laparoscopic images from 32 pa-
tion (2) below where dependence on pixel location is omitted
tients of robot-assisted anterior rectal resections or rectal extir-
for brevity:
pation surgeries. Images provided in the dataset were extracted
from the video and were stored in PNG format at a resolution Sbaseline = [S1baseline , . . . , SCbaseline ] = softmax f (x)

(2)
of 1920 × 1080. The annotation of 11 abdominal organs pro-
vided pixel-wise segmentation with multiple inclusion criteria The resulting OOD approach is a commonly used baseline
for anatomical structures, resulting in sparsely annotated im- method for OOD detection in classification tasks (Hendrycks
ages across the dataset. The majority of annotations in this and Gimpel, 2017).
dataset only account for a single organ per image. However, Beyond this baseline, our framework allows integrating state-
a subset of the data is associated with multi-organ segmentation of-the-art OOD detection methods by changing the predefined
for all 11 anatomical structures. This includes a total of 1430 pixel-level scoring function S. In this work, we investigate
images in 32 patients. methods related to confidence calibration and Mahalanobis dis-
tance as they have demonstrated effectiveness in many OOD
3.2. Positive-only learning for multi-class segmentation detection for classification tasks (Liang et al., 2017; Lee et al.,
2018; Hsu et al., 2020).
We start by illustrating the concept of positive-only learning.
Figure 1 shows various decision boundaries with the presence ODIN. Liang et al. (2017) adds temperature scaling and adver-
of unlabelled data. Conventional multi-class classifiers show sarial perturbation to the logit output from the pretrained net-
no ability to detect outliers within the unlabelled data. Instead, work to improve the OOD performance:
they categorise them into one of the known classes. Alterna-
tively one-class classifiers can highlight outliers in the unla-
!
f (x)
belled data but they are limited to a single class. In contrast, our Sodin = softmax (3)
T
proposed method effectively identifies outliers in the unlabelled
data while establishing distinct decision boundaries enclosing It has been shown that using a large temperature T is gener-
the labelled positive data for each class. ally preferred for OOD classification tasks (Liang et al., 2017).
In this study, we propose addressing the positive-only learn- However, from our experiments, we find that using a relatively
ing scenario by leveraging concepts from OOD detection. Fig- small T (albeit still much larger than 1) is beneficial for seg-
ure 2 shows an overview of our proposed framework for im- mentation. We chose a fixed value of T = 10 across our exper-
age segmentation. Given a 2D image x, each annotated spa- iments. Furthermore, Liang et al. (2017) employed adversarial
tial location (i, j) from x has a corresponding annotation yi j , perturbation to further enhance OOD performance by optimis-
where yi j ∈ {c} = {1, 2, . . . , C} and C is the number of class ing the value of σ using a validation set composed equally of ID
that marked as in-distribution. We purposely refrain from start- and OOD data. In our study, we did not incorporate adversarial
6 Junwen Wang et al. / Preprint (2024)

Data & Backbone Model Training

𝐿𝐶𝐸

Input Model 𝐬𝐨𝐟𝐭𝐦𝐚𝐱(𝒇(𝒙)) Sparse Annotation

Out-of-Distribution Detection Methods Inference



𝑺𝒃𝒂𝒔𝒆𝒍𝒊𝒏𝒆 ෩
𝑺𝒐𝒅𝒊𝒏
𝒇 𝒙
𝒇(𝒙) τ𝑚
𝑻


𝑺𝒎𝒂𝒉𝒂 ෩
𝑺𝒈𝒐𝒅𝒊𝒏
𝒉 𝒙
𝒅𝒎𝒂𝒉𝒂 𝒛, 𝝁, 𝚺
𝒈(𝒙) 𝒎𝒂𝒙𝒄 𝐬𝐨𝐟𝐭𝐦𝐚𝐱(𝑺𝒄 ) Predicted Mask

Figure 2. Overview of the proposed OOD-SEG framework. During the training stage, only annotated pixels for the multiple positive classes are used to
update the model weights. We define a confidence score S to correlate probability distribution for ID classes. S can be replaced by multiple OOD detection
methods (See bottom left block). At the inference stage, we compute the maximum probability of S from c classes followed by thresholding from a pre-
selected threshold τm to obtain the predicted mask.

perturbation for two primary reasons. First, we aimed to sim- have no measurable impact on the performance while it helped
plify the training to allow fairer and more reliable comparisons provide more consistency in evaluation and mask visualisation.
across OOD approaches. Second, the original paper reported We thus use it in our subsequent experiments. Furthermore, as
only minor improvements from applying adversarial perturba- with our use of ODIN, to ensure a fair comparison and to reduce
tions, and these came at a significant computational cost (Liang computational burden, we did not incorporate the adversarial
et al., 2017). perturbation and feature ensembling calibration techniques ini-
tially proposed in (Lee et al., 2018).
Mahalanobis. Lee et al. (2018) propose an OOD mechanism
The mean vectors and covariance matrix in Equation (4) are
based on a statistical analysis of features observed in each ID
dataset-wide parameters. To alleviate the computational bur-
class. Let φ(x) be some pixel-level features obtained from in-
den associated with estimating µc and Σ at once from all pixel-
termediate layers of the network where, as before, the depen-
level features extracted across the entire training dataset, we
dence on pixel location is dropped for brevity. We chose the
first compute the per-class mean and a shared covariance for
features before segmentation head as our intermediate feature
each image in the training set through a spatial averaging pro-
in the study. The class-conditioned distributions of the features
cedure. These image-level estimates are then aggregated using
are modelled as Gaussians with a class-specific mean µc and a
standard reduction to produce the dataset-level estimates of µc
tied, i.e. class independent, covariance matrix Σ. A first scor-
and Σ.
ing S̃maha is obtained by computing the negative Mahalanobis
distance between a prediction feature and each class Gaussian:

S̃cmaha = −(φ(x) − µc )T Σ−1 (φ(x) − µc ) (4) Generalised ODIN (GODIN). Hsu et al. (2020) proposed a div-
idend and divisor structure for OOD detection that learns a tem-
To make a head-to-head comparison fairer and easier across perature scaling function g(x) during training. Assuming a triv-
OOD methods, we apply a softmax operator to the S̃maha Maha- ial extension for pixel-wise operation and dropping the depen-
lanobis scores and obtain normalised final scores: dence on pixel location from the equation for brevity, the un-
Smaha = softmax [S̃1maha , . . . , S̃Cmaha ]

(5) normalised scoring is expressed per class as:

We note that this use of the softmax is not advocated by Lee hc (x)
et al. (2018) nor is it strictly necessary. We however found it to S̃cgodin = (6)
g(x)
Junwen Wang et al. / Preprint (2024) 7

3.3. Two-level OOD-focused Cross-validation Evaluation


Fold 1 Fold 2
𝐂𝐏𝟏 𝐂𝐏𝟐 𝐂𝐏𝟑 𝐂𝐏𝟒 To evaluate the model performance in detecting OOD sam-
ples, existing methods utilise other datasets as OOD test
𝐒𝐏𝟏 1-1 1-2 1-3 1-4
𝐂𝐏𝟏 set (Liang et al., 2017; Hsu et al., 2020). A distinctive aspect of
… these OOD datasets is the presence of class categories that were
𝐒𝐏𝟐 2-1 2-2 2-3 2-4 … not encountered during training. Such discrepancy is named se-
mantic shift in the original OOD research, which remains an ac-
𝐒𝐏𝟑 3-1 3-2 3-3 3-4 tive research topic (Yang et al., 2024). This approach requires
𝐒𝐏𝟏 additional annotations beyond the target use case and thus pose

𝐒𝐏𝟒 4-1 4-2 4-3 4-4 an additional burden on the clinical experts.
In our context of sparse multi-class positive-only image seg-
mentation, a semantic shift can already occur at the pixel-level.
Figure 3. Graphical representation of the proposed OOD-focused two-level This allows us to propose an evaluation framework established
cross-validation strategy. For simplicity, only the first fold is shown in de- without using extra annotated medical image datasets. Fig-
tail. In this example, the number of subject partitions (SP) and class par- ure 3 shows a simplified view of our proposed two-level cross-
titions (CP) are set to 4, resulting in a total of 16 partitions. Subject-Class
validation pipeline for pixel-level OOD detection. Our two-
Partitions (SCP) marked in red, blue and grey respectively highlight train-
ing, testing or untouched data for a particular cross-validation fold. level cross-validation is built as a combination of two types of
data partitions based on subjects and classes: Subject Partitions
(SP) and Class Partitions (CP). The subject-level grouping in
A softmax operator is then applied to get the final score: the SPs ensures that there is no patient overlap bias within our
cross-validation experiments. The class-level groupings CPs al-
Sgodin = softmax [S̃1godin , . . . , S̃Cgodin ]

(7) low us to hold out some annotated classes from the training in
a specific fold. These classes can thus be considered as OOD
Both hc (x) and g(x) are chosen to take features from the penul- for this fold. For clarity, we note that the number NSP of subject
timate layer φ(x) of the backbone model f (x). For the temper- partitions (respectively NCP the number of class partitions) is
ature g(x), these features are fed through an extra pixel-wise upper bounded by the number of subjects (respectively positive
linear layer with trainable weights wg and bias bg , the batch classes) in the training data. By combining these partitions, we
norm (BN), and the sigmoid (σ) function: obtain NSP × NCP two-level folds for cross-validation purposes.
While this approach is effective is establishing an OOD-
g(x) = σ(BN(wg φ(x) + bg )) (8) focused evaluation with no need for OOD-specific annotations,
it should be clear that none of the models trained for a particular
For hc (x), an extra layer with trainable per-class weights wc and
two-level fold would be trained to recognise all ID classes in the
bias bc is used to extract a class similarity. In the context of clas-
training set. As such, a complete model for inference purposes
sification, Hsu et al. (2020) investigated three similarity mea-
should still be trained with all ID classes.
sures. The default one in the original work is the inner product
In this work, we chose NSP = 4 and NCP = 4 by default for
between the penultimate features and the per-class parameters:
our OOD-focused evaluations. To provide some insight on the
performance of models trained with all ID classes, we also used
hc (x) = wTc φ(x) + bc (9)
a more standard subject-level only cross-validation strategy by
The other proposed options consisted in the Euclidean distance setting NSP = 4 and NCP = 1. This scenario only allowed us to
wT φ(x) evaluate the capability to recognise ID classes but could not be
(||φ(x) − wc ||2 ) and the cosine similarity ( ||wc ||c ||φ(x)|| ) between the
used for OOD evaluation.
penultimate features and the per-class parameters. A trivial
pixel-wise extension of these temperature scaling and three sim-
3.4. Evaluation metric
ilarity measures for segmentation purposes is achieved by train-
ing spatially-invariant weights and bias terms. Of particular in- OOD-focused metrics. Within our two-level cross-validation
terest is the fact that the pixel-wise linear layer in Equation (8) approach, assuming CPk is the current held-out class partition
and the inner product operation in Equation (9) can efficiently and noting CPk its class complement, we start by building a
be implemented with a 1 × 1 convolution layer. multi-class confusion matrix that includes all ID classes in CPk
In this work, to capture some additional spatial context for and uses a single outlier class for classes in CPk . This is il-
both g and hc , we extend the pixel-wise operations in Equa- lustrated in Figure 4-left. Specifically, the outlier class is ob-
tion (8) and Equation (9) by introducing convolutional layers tained by aggregating classes in the CPk class partition. Every
with 3 × 3 kernels: annotated class is excluded NSP times in our cross-validation
approach, as it becomes part of the aggregated outlier class.
g(x) = σ(BN(Convg (φ(x)))) We categorised pixels belonging to the outlier class as negative
(10)
hc (x) = Convh (φ(x)). OOD examples and all other classes as positive ID examples for
this particular two-level fold.
where it should be understood that our notation makes depen- Subsequently, we define the true positive rate (TPRID ) from
dence on spatial location implicit. multi-class positive ID examples and the false negative rate
8 Junwen Wang et al. / Preprint (2024)

(FPROOD ) from negative OOD examples as follows: 𝐏𝐫𝐞𝐝𝐢𝐜𝐭𝐞𝐝 𝐏𝐫𝐞𝐝𝐢𝐜𝐭𝐞𝐝


𝐎𝐎𝐃 𝐈𝐃 𝐎𝐎𝐃 𝐈𝐃

FPOOD
PC
TPc

𝐎𝐎𝐃

𝐎𝐎𝐃
TPRID = PC c=1
, FPROOD = 0 𝐓𝐍𝟎𝐎𝐎𝐃 𝐅𝐏𝟎𝐎𝐎𝐃 𝐅𝐏𝟎𝐎𝐎𝐃 𝐅𝐏𝟎𝐎𝐎𝐃 𝟎 𝟎 𝟎 𝟎
c=1 TPc + FNc TNOOD
0 + FPOOD
0
(11) 𝐅𝐍𝟏𝐎𝐎𝐃 𝐓𝐏𝟏 𝐅𝐍𝟏𝐈𝐃 𝐅𝐍𝟏𝐈𝐃 𝐓𝐍𝟐𝐎𝐎𝐃 𝐓𝐍𝟐𝐈𝐃 𝐅𝐏𝟐𝐈𝐃 𝐓𝐍𝟐𝐈𝐃
where FNc = FNOOD + ID

𝐀𝐜𝐭𝐮𝐚𝐥

𝐀𝐜𝐭𝐮𝐚𝐥
c FN c . It should be clear that since our
annotations are sparse, unlabelled data is omitted from these 𝐅𝐍𝟐𝐎𝐎𝐃 𝐅𝐍𝟐𝐈𝐃 𝐅𝐍𝟐𝐈𝐃 𝐅𝐍𝟐𝐎𝐎𝐃 𝐅𝐍𝟐𝐈𝐃 𝐅𝐍𝟐𝐈𝐃

𝐈𝐃

𝐈𝐃
𝐓𝐏𝟐 𝐓𝐏𝟐
statistics.
By computing TPRID and FPROOD under multiple thresh- 𝐅𝐍𝟑𝐎𝐎𝐃 𝐅𝐍𝟑𝐈𝐃 𝐅𝐍𝟑𝐈𝐃 𝐓𝐏𝟑 𝐓𝐍𝟐𝐎𝐎𝐃 𝐓𝐍𝟐𝐈𝐃 𝐅𝐏𝟐𝐈𝐃 𝐓𝐍𝟐𝐈𝐃
old τ, we obtain a Receiver Operating Characteristic (ROC)
curve. For clarity, we emphasize that this definition of the
ROC curve specifically takes advantage of the distinction be- Figure 4. Graphical illustration of confusion matrix incorporating multi-
tween the positive classes and the negative/OOD class to pro- class ID and OOD data. Left: with actual OOD data as negative class.
Right: In the case without actual OOD data and with class 2 considered as
vide a single well-posed binarisation of the multi-class problem
positive while others are negative in a one-vs-rest approach.
that doesn’t rely on a one-vs-rest strategy. The area under the
ROC curve (AUROC) is a threshold independent metric which
is commonly used by many image-level OOD detection meth- the AUROCOVR and AUPROVR metric in OVR setting by com-
ods (Hendrycks and Gimpel, 2017; Liang et al., 2017; Lee et al., puting TPROVR , FPROVR = 1 − TNROVR and PrecisionOVR =
c c c c
2018; Hsu et al., 2020). We thus use the AUROC metric (with TPc
under multiple thresholds.
TPc + FPc
our definition of TPRID and FPROOD ) for quantitative evalua-
tion.
Additionally, we propose to measure the Area Under the 3.5. OOD confidence threshold selection
Precision-Recall curve (AUPR) (Saito and Rehmsmeier, 2015)
as our second metric. Again, we define the precision within As detailed in Equation (1), our approach relies on a OOD
our multi-class setting by taking advantage of the distinction confidence threshold τ to generate the final segmentation
between the positive classes and the negative/OOD one: masks. This threshold should be chosen to 1) accurately clas-
PC sify pixels belonging to an ID class, and 2) detect background
TPc / OOD test pixels. For comparison purposes, we can also de-
Precision = c=1
(12)
+ TPc + FNID
PC
FPOOD
0 c=1 c fine a baseline with τ0 = 0 to represent the method without
outlier detection. To fulfill the two criteria above within our
Recall being a synonym for TPR, we use Equation (11) to define two-level cross-validation setup, we propose to find the opti-
it. Finally, we measure AUPR by evaluating recall and precision mal threshold τm which maximises a weighted sum of ID and
under multiple τ thresholds. OOD performance across the two-level folds using a pair of
threshold-sensitive metrics:
All-classes metrics. For our experiment using all labelled
classes, we do not have any ground-truth pixels associated with N
1X
the OOD class. As illustrated in Figure 4-right, FPOOD0 is thus τm = max wID MetrickID (τ) + wOOD MetrickOOD (τ) (14)
τ N k=1
0 by construction and this would skew the previous metrics. In
this context, we thus choose to compute TPR, TNR, balanced
accuracy (BACC) and F1 score based on a one-vs-rest strategy where N = NSP × NCP is total number of cross-validation folds;
(Taha and Hanbury, 2015). To distinguish these one-vs-rest and MetrickID (τ) (respectively MetrickOOD (τ)) represents the ID
metrics used in the all-classes setting from the OOD-focused (respectively OOD) performance of the model on the kth cross-
ones, we use a superscript OVR when referring to them. These validation fold when using threshold τ. In this work, we choose
expression for each individual positive class c is given by: TPRID and TNROOD = 1 − FPROOD as our ID and OOD met-
ric respectively. Computation of TPRID and FPROOD can be
TPc found in Equation (11). For the weighting parameters in Equa-
TPROVR = ,
c
TPc + FNc tion (14), we choose wID = wOOD = 0.5.
TNc When used outside of our two-level cross-validation ap-
TNROVR = ,
c
TNc + FPc proach, the OOD performance metrics are skewed by the ab-
(13)
1 sence of negative/OOD annotations, in which case our thresh-
BACCOVR
c = (TPROVRc + TNROVR
c ), old selection approach can be extended to only account for ID
2
2 TPc performance, essentially setting wID = 1 and wOOD = 0. An
F1OVR =
c
2 TPC + FPc + FNc alternative is to use the optimal threshold from the two-level
cross-validation experiments. We empirically found this thresh-
where TNc = TNID c + TNc
OOD
and in OVR setting FPc = old to offer a good trade-off between ID classification and OOD
ID
FPc . These class-specific OVR metrics are then averaged detection performance. When the ID data distribution is similar
across the positive classes to provide mean scores: TPROVR , to that of the validation set used during cross-validation, apply-
TNROVR , BACCOVR and F1OVR . Furthermore, we compute ing this threshold can be beneficial for generalisation purposes.
Junwen Wang et al. / Preprint (2024) 9

4. Experimental setup et al., 2023; Hyttinen et al., 2020) and manually select 16 chan-
nels at equal intervals from the total available spectral bands,
We start by describing relevant details on our models and sorted in ascending order.
training details in Section 4.1 and followed by data preprocess- After exporting the data, we apply ℓ1 -normalisation at each
ing pipeline in Section 4.2. spatial location i j to account for the non-uniform illumination
of the tissue surface. This is routinely applied in hyperspectral
4.1. Deep learning model and training setup imaging because of the dependency of the signal on the distance
For all experiments, we use a U-Net architecture with an between the camera and the tissue (Bahl et al., 2023; Studier-
efficientnet-b4 encoder (Tan and Le, 2019) pretrained on the Fischer et al., 2023). The uneven surface of the tissue can also
ImageNet dataset (Deng et al., 2009). Our implementation re- cause some image areas to have different lighting conditions,
lies on the Segmentation models PyTorch library1 . The choice which affects the classification accuracy and can be mitigated
of the encoder is based on good performance reported by pre- by data normalisation. For data augmentation, we adopt similar
vious study (Seidlitz et al., 2022) and graphical memory limits setup reported in (Seidlitz et al., 2022): random rotation (ro-
in the hardware used for this work. The model inputs are either tation angle limit: 45◦ ); random flip; random scaling (scaling
a pre-processed hyperspectral imaging (HSI) hypercube or an factor limit: 0.1); random shift (shift factor limit: 0.0625). All
RGB image. The number of input channels and weights of the transformations are applied with a probability of 0.5.
first convolutional layer are re-initialised and set to match the
number of channels of our input data. The output of the net-
work is passed on to a segmentation head to calculate the out- 5. Results
put logits. The number of output classes is set to be equal to the
number of positive classes (i.e. marked as ID) for a given exper- We begin by visualising confusion matrices and ROC curves
imental setup. Note that during our two-level cross-validation, as illustrated in Section 5.1. These measures provide the foun-
this number will be lower than the number of positive classes in dation for both our qualitative and quantitative analysis in the
the training dataset as some classes are being held out. later sections. Section 5.2 and Section 5.5 shows overall perfor-
mance of our proposed framework comparing different methods
as the scoring function. Section 5.3 and Section 5.4 shows qual-
Table 1. Hyperparameter setup.
itative evaluation for all methods plus a scenario in which our
Dataset Init. LR Batch size Epoch proposed OOD segmentation framework has not applied. Fur-
Heiporspectal 1e-4 8 20 thermore, we have tested performance of our method under the
ODSI-DB 1e-3 4 80 scenario that all labelled classes are considered as ID. The re-
DSAD 1e-4 4 10 sults are shown in Section 5.5.

To perform model training, we minimise the cross-entropy 


loss between the softmax output and the one-hot encoded sparse
annotation mask for pixels marked as ID (Figure 2 training  
stage). This approach is used for the Baseline and GODIN mod-
els. We use Adam (Kingma and Ba, 2017) optimizer (β1 : 0.9  
$FWXDO

$FWXDO

and β2 : 0.999) and exponential learning rate scheme with decay


 
rate γ = 0.999. The initial learning rate, mini-batch size and to-
tal number of epochs are varied across datasets. We show the  
choice of these hyperparameters in Table 1. For a fair compari-
son, we use the same hyperparameters for both methods. 3UHGLFWHG  3UHGLFWHG 
To enable as fair a comparison as possible, we take advan-
tage of the fact that the ODIN and Mahalanobis methods can be Figure 5. Example confusion matrix at threshold τ0 = 0 (left) and τm (right)
applied on a frozen, pre-trained model. In this case, we re-use for the ODIN method on the Heiporspectral dataset with a specific held-out
class partition. The first row and column represent the negative / outlier
the weights from the Baseline model and implement the scor- class.
ing function as a post-processing step. We nonetheless allow
for tuning the confidence threshold as described in Section 3.5

4.2. Data preprocessing pipeline 5.1. Visualising confusion matrix and ROC curve
For the DSAD dataset, data are stored in PNG format. We use Figure 5 demonstrates that our method effectively separates
the Pillow library2 to read the RGB data and convert them into OOD pixels while maintaining classification accuracy for ID
PyTorch tensors. For the two HSI datasets, we first extract the pixels. Figure 6 shows ROC curve with ID class on the y-axis
hypercube using the provided Python libraries (Studier-Fischer and OOD classes on the x-axis. As τ increases, more data will
be flagged as OOD. For the baseline method, fewer pixels are
rejected under threshold 0.99, indicating that the model tends
1 https://github.com/qubvel/segmentation models.pytorch to make overconfident predictions and this decreases the over-
2 https://pillow.readthedocs.io/en/latest/ all AUC. The ODIN method addresses this issue by providing
10 Junwen Wang et al. / Preprint (2024)

%DVHOLQH 2',1 0DKDODQRELV *2',1


   
   

   


735id

735id

735id

735id
   

 $852&   $852&   $852&   $852& 
m  m  m  m 
   
                   
)35ood )35ood )35ood )35ood

Figure 6. ROC curves comparing different OOD detection methods within our framework. The curve shows the average performance of 16 models over
all two-level folds. Shaded area represents standard deviation across the folds. Experiments were conducted on the Heiporspectral dataset.

a calibrated score that aligns with the actual likelihood of cor- single image, an experienced clinical expert in our team man-
rectness, resulting in a less overconfident threshold τm for gen- ually annotated an image to serve as the ground truth for this
erating the predicted mask. qualitative demonstration. For the DSAD dataset, we chose an
image from the additional subset created from 1430 stomach
5.2. Cross-validations results for OOD detection frames that contain multi-class ground truth data.
Table 2 shows cross-validation results on both AUROC and
AUPR metrics. For each method and metric, we report results 5.4. Qualitative results with held-out OOD classes
across different CP. For each CP, we aggregate results across all We also considered qualitative results across all datasets us-
folds sharing the same SP and report mean and standard devia- ing our CP hold-out approach. To keep the main content concise
tion. Similarly, Table 3 shows cross-validation results on both for readability, we refer the reader to the appendix for these re-
AUROC and AUPR metrics. Here for each method and met- sults. Figure A.1 to Figure A.6 show some cases with segmen-
ric, we report results across different SP by aggregating results tation mask overlay. For each case, we show ground truth and
across all folds sharing the same CP. Again we report the mean predicted segmentation mask from one of the subject partition
and standard deviation for all results. We observed a consistent SPk and all class partitions CP1 to CP4 within SPk . Different
trend between two hyperspectral image datasets and one RGB classes are held-out in each CP related folds. Each class from
color image dataset, suggesting that our framework is applica- the training set is held-out at least NSP times. For each CP, we
ble to these medical imaging modalities. visualise and compare masks generated using different meth-
Overall, we found that using the ODIN method as a scor- ods at threshold τm . We also show the baseline results at τ0 = 0
ing function yielded the best performance. This suggests that a to represent the baseline method without using our proposed
well-calibrated confidence score is crucial for detecting OOD framework.
data. The GODIN method employs a learnable temperature Given the sparse nature of our ground-truth annotations,
parameter for calibration during training, which may increase qualitative evaluation reveals insights that quantitative mea-
training complexity as it requires learning both the model pa- sures alone fail to capture. In few cases, we observed that while
rameters for optimal ID performance and the temperature pa- the model performs well on annotated pixels, including those
rameter for effective calibration. The underperformance of the from held-out classes, the overall quality of the segmentation
Mahalanobis distance based approach could be explained by the mask is relatively poor. This indicates that quantitative met-
relatively high dimension of the feature space on which we use rics are insufficient to fully represent the model’s performance
it. Covariance estimation in high dimensional space is indeed on unlabelled pixels. Our findings emphasize the importance
prone to ill-conditioned results and thus poor reliability of in- of qualitative analysis in assessing model performance on unla-
verse covariance based features. belled sections.

5.3. Qualitative results for all labelled classes 5.5. Cross-validation results using all labelled classes
Figure 7 presents qualitative results when training for all la- We further analyse cross-validation results when training us-
belled classes. For each dataset, we display the results for ing all labelled classes. In this experiment, no classes are
the same image obtained using different methods at a confi- marked as OOD, therefore there is no negative data for train-
dence threshold of τm . Additionally, we include baseline results ing or evaluation. Instead, we adopt the one-vs-rest strat-
at τ0 = 0 to illustrate the outcome without outlier detection. egy discussed in Section 3.4 and measure TPROVR ; TNROVR ;
We selected examples with multi-class labels across our three BACCOVR and F1OVR across the positive classes. In addi-
datasets. tion to the threshold-dependent metrics, we further measure the
For the ODSI-DB dataset, one image with multi-class an- threshold-independent metrics AUROCOVR and AUPROVR .
notation was randomly selected, given that many images are For the threshold-dependent metrics, we report model perfor-
annotated with more than one class. In the case of the Heipor- mance with and without OOD detection. This is represented by
spectral dataset, where no multi-class data is available within a computing metrics at two different thresholds, τ0 = 0 and τm ,
Junwen Wang et al. / Preprint (2024) 11

Table 2. Cross-validation results for different class partitions (CP). For each CP, we aggregate results across all two-level folds sharing same subject
partition (SP) and report mean and standard deviation. We compared different OOD segmentation methods on two threshold independent metrics. Best
performances are highlighted in bold.
Dataset Method AUROC ↑ AUPR ↑
CP1 CP2 CP3 CP4 CPmean CP1 CP2 CP3 CP4 CPmean
Baseline 0.85±0.04 0.78±0.08 0.85±0.02 0.85±0.06 0.84±0.03 0.45±0.04 0.45±0.03 0.46±0.01 0.47±0.02 0.46±0.01
ODIN 0.97±0.02 0.95±0.03 0.96±0.02 0.97±0.01 0.96±0.01 0.98±0.01 0.98±0.02 0.98±0.02 0.99±0.00 0.98±0.00
Heiporspectral
Mahalanobis 0.93±0.01 0.87±0.03 0.88±0.07 0.86±0.03 0.88±0.03 0.95±0.01 0.94±0.02 0.93±0.03 0.92±0.02 0.93±0.01
GODIN 0.91±0.07 0.86±0.08 0.96±0.02 0.92±0.05 0.91±0.04 0.93±0.06 0.93±0.06 0.97±0.02 0.93±0.07 0.94±0.02
Baseline 0.66±0.08 0.76±0.02 0.80±0.05 0.71±0.04 0.73±0.05 0.52±0.05 0.51±0.03 0.45±0.04 0.49±0.02 0.49±0.02
ODIN 0.66±0.08 0.74±0.05 0.84±0.04 0.78±0.02 0.75±0.07 0.78±0.03 0.85±0.03 0.85±0.05 0.87±0.05 0.84±0.03
ODSI-DB
Mahalanobis 0.67±0.07 0.76±0.02 0.78±0.06 0.76±0.04 0.74±0.04 0.60±0.07 0.65±0.02 0.64±0.05 0.58±0.04 0.62±0.03
GODIN 0.66±0.08 0.65±0.06 0.69±0.05 0.50±0.25 0.62±0.08 0.71±0.07 0.73±0.05 0.66±0.07 0.49±0.32 0.65±0.09
Baseline 0.65±0.11 0.68±0.07 0.68±0.13 0.76±0.05 0.69±0.04 0.54±0.08 0.42±0.05 0.46±0.05 0.60±0.03 0.51±0.07
ODIN 0.65±0.12 0.70±0.09 0.70±0.13 0.78±0.06 0.71±0.04 0.80±0.11 0.73±0.09 0.72±0.13 0.85±0.03 0.78±0.05
DSAD
Mahalanobis 0.64±0.08 0.66±0.05 0.68±0.12 0.70±0.02 0.67±0.02 0.77±0.10 0.64±0.05 0.67±0.14 0.83±0.02 0.73±0.07
GODIN 0.68±0.04 0.67±0.10 0.68±0.08 0.71±0.10 0.68±0.02 0.76±0.06 0.68±0.15 0.67±0.11 0.80±0.05 0.72±0.05

Table 3. Cross-validation results for different subject partitions (SP). For each SP, we aggregate results across all two-level folds sharing the same class
partition (CP) and report mean and standard deviation. We compared different OOD segmentation methods on two threshold independent metrics. Best
performances are highlighted in bold.
Dataset Method AUROC ↑ AUPR ↑
SP1 SP2 SP3 SP4 SPmean SP1 SP2 SP3 SP4 SPmean
Baseline 0.81±0.03 0.84±0.09 0.83±0.06 0.86±0.03 0.84±0.02 0.44±0.02 0.46±0.04 0.47±0.02 0.46±0.02 0.46±0.01
ODIN 0.96±0.03 0.97±0.01 0.97±0.01 0.95±0.02 0.96±0.01 0.97±0.01 0.99±0.01 0.99±0.01 0.97±0.01 0.98±0.01
Heiporspectral
Mahalanobis 0.88±0.05 0.85±0.06 0.90±0.03 0.91±0.03 0.88±0.02 0.93±0.02 0.91±0.03 0.94±0.02 0.95±0.01 0.93±0.01
GODIN 0.93±0.02 0.91±0.11 0.94±0.02 0.86±0.06 0.91±0.03 0.96±0.02 0.95±0.07 0.96±0.01 0.89±0.06 0.94±0.03
Baseline 0.78±0.03 0.71±0.04 0.72±0.09 0.72±0.11 0.73±0.03 0.50±0.03 0.48±0.08 0.49±0.02 0.49±0.03 0.49±0.01
ODIN 0.81±0.04 0.72±0.07 0.76±0.09 0.72±0.11 0.75±0.04 0.87±0.05 0.82±0.03 0.85±0.07 0.81±0.04 0.84±0.02
ODSI-DB
Mahalanobis 0.80±0.04 0.76±0.06 0.72±0.08 0.69±0.05 0.74±0.04 0.64±0.07 0.63±0.02 0.60±0.07 0.60±0.02 0.62±0.02
GODIN 0.56±0.23 0.64±0.03 0.59±0.19 0.70±0.04 0.62±0.05 0.54±0.28 0.67±0.08 0.61±0.18 0.77±0.04 0.65±0.08
Baseline 0.73±0.05 0.74±0.11 0.66±0.11 0.63±0.07 0.69±0.05 0.54±0.09 0.50±0.09 0.47±0.07 0.51±0.12 0.51±0.03
ODIN 0.76±0.05 0.76±0.13 0.69±0.11 0.63±0.10 0.71±0.05 0.84±0.03 0.81±0.13 0.74±0.11 0.72±0.10 0.78±0.05
DSAD
Mahalanobis 0.71±0.06 0.70±0.08 0.62±0.07 0.64±0.08 0.67±0.04 0.78±0.10 0.77±0.12 0.67±0.12 0.71±0.10 0.73±0.05
GODIN 0.67±0.07 0.77±0.04 0.67±0.07 0.63±0.06 0.68±0.05 0.77±0.04 0.80±0.04 0.67±0.12 0.66±0.14 0.72±0.06

respectively. τm is selected from optimal threshold from two- 6. Conclusion


level cross-validation experiments. However, we also experi-
mented with the ID-only threshold selection strategy described In this work, we have presented a novel framework named
in Section 3.5 for the OVR setting. We found the difference OOD-SEG for detecting negative/out-of-distribution (OOD)
in performance between the two types of τm to be negligible. pixels while preserving multi-class positive/in-distribution (ID)
Furthermore, we found there is no drop in ID performance at classification accuracy in medical image segmentation. Our
τm compare to τ0 highlighting that our OOD rejection does not framework is based on a positive-only learning setting, which
aggressively mark pixels as OOD. Since the total positive data establishes distinct decision boundaries that enclose positively
remains the same, this indeed indicates that our framework does labelled data for each class. This approach allows for train-
not compromise the detection of true positives for ID classes but ing with sparse annotations under a weakly supervised learn-
moves false negatives from misclassified ID classes to the out- ing (WSL) setting and demonstrates robustness against various
lier class. Since there is no ground truth for the outlier class, we anomalies compared to methods relying on full annotations.
can discard these pixels and only account for misclassified ID To assess model performance in OOD scenarios, we pro-
classes. For brevity, these results are shown in the Appendix, posed a novel evaluation protocol based on subjects and classes,
Table A.1 for the threshold-dependent metrics and Table A.2 facilitating a more comprehensive assessment of OOD detec-
for the threshold-independent metrics. tion capabilities in medical imaging. Our framework and eval-
uation pipeline bridge the gap between OOD detection meth-
12 Junwen Wang et al. / Preprint (2024)

Heiporspectral ODSI-DB DSAD

Sparsely
annotated
ground truth

Baseline (τ0 = 0)

Baseline (τm )

ODIN

Mahalanobis

GODIN

Figure 7. Qualitative result using all labelled classes during training, i.e. with NCP = 1. For each dataset, we show the result of same image using different
methods at confidence threshold τm . τm is choose from the optimal threshold from the two-level cross-validation experiments. Baseline results at τ0 = 0 are
added to represent result without outlier detection.

ods originally designed for image classification and their ap- showing improved OOD detection performance without com-
plication in medical image segmentation. Extensive experi- promising classification accuracy.
ments conducted on two hyperspectral and one RGB laparo-
scopic imaging datasets validate the efficacy of our framework, Limitations. Despite the promising results demonstrated by our
OOD-SEG framework, limitations need to be acknowledged to
Junwen Wang et al. / Preprint (2024) 13

provide a comprehensive understanding of its applicability and Barfoot, T., Garcia-Peraza-Herrera, L., Glocker, B., Vercauteren, T., 2024. Av-
areas for improvement. erage Calibration Error: A Differentiable Loss for Improved Reliability in
Image Segmentation. URL: http://arxiv.org/abs/2403.06759.
First, our framework is built on positive-only learning. Un- Bekker, J., Davis, J., 2020. Learning from positive and unlabeled
like PU learning, positive-only learning does not utilise unla- data: A survey. Machine Learning 109, 719–760. doi:10.1007/
belled data during the learning process. Future work could ex- s10994-020-05877-5.
plore the use of unlabelled data to expand the diversity of the Bishop, C.M., 2006. Pattern Recognition and Machine Learning. Information
Science and Statistics, New York.
data seen during training and improve OOD performance. Bitarafan, A., Nikdan, M., Baghshah, M.S., 2021. 3D Image Segmentation
Second, unlike many OOD detection approaches used in im- With Sparse Annotation by Self-Training and Internal Registration. IEEE
age classification (Liang et al., 2017; Lee et al., 2018; Hsu et al., Journal of Biomedical and Health Informatics 25, 2665–2672. doi:10.
2020), we did not include any adversarial perturbations into our 1109/JBHI.2020.3038847.
Bulusu, S., Kailkhura, B., Li, B., Varshney, P.K., Song, D., 2020. Anomalous
experiments. Although modest and coming at a high compu- Example Detection in Deep Learning: A Survey. IEEE Access 8, 132330–
tational cost, adversarial perturbations have often demonstrated 132347. doi:10.1109/ACCESS.2020.3010274.
improved OOD performance. Future studies could investigate Cai, H., Qi, L., Yu, Q., Shi, Y., Gao, Y., 2023. 3D Medical Image Segmentation
the effectiveness of adding perturbations for image segmenta- with Sparse Annotation via Cross-Teaching between 3D and 2D Networks.
doi:10.48550/arXiv.2307.16256.
tion tasks, building on the proposed framework. Can, Y.B., Chaitanya, K., Mustafa, B., Koch, L.M., Konukoglu, E., Baum-
Lastly, in our method there is an inherent trade-off between gartner, C.F., 2018. Learning to Segment Medical Images with Scribble-
ID classification accuracy and OOD detection performance Supervision Alone, in: Deep Learning in Medical Image Analysis and
Multimodal Learning for Clinical Decision Support, Cham. pp. 236–244.
which inadvertently diminishes one aspect when another is en- doi:10.1007/978-3-030-00889-5_27.
hanced. Such a trade-off suggests that future work could ex- Carstens, M., Rinner, F.M., Bodenstedt, S., Jenke, A.C., Weitz, J., Distler, M.,
plore strategies to better balance or synergistically improve both Speidel, S., Kolbinger, F.R., 2023. The Dresden Surgical Anatomy Dataset
classification accuracy and OOD detection. for Abdominal Organ Segmentation in Surgical Data Science. Scientific
Data 10, 3. doi:10.1038/s41597-022-01719-2.
Impact. Our findings suggest that OOD-SEG has the potential Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L., 2009. ImageNet: A
large-scale hierarchical image database, in: IEEE Conference on Computer
to significantly impact downstream medical imaging applica- Vision and Pattern Recognition (CVPR), pp. 248–255.
tions. By enabling reliable OOD detection with sparse positive- Dorent, R., Joutard, S., Shapey, J., Kujawa, A., Modat, M., Ourselin, S.,
only annotations, our framework can enhance the safety and Vercauteren, T., 2021. Inter Extreme Points Geodesics for End-to-End
robustness of automated segmentation systems used in clini- Weakly Supervised Image Segmentation, in: Medical Image Computing
and Computer Assisted Intervention – MICCAI 2021, Cham. pp. 615–624.
cal settings. This could reduce the risk of misclassification of doi:10.1007/978-3-030-87196-3_57.
unknown or anomalous tissue types. Our evaluation protocol European Commission, 2024. Artificial Intelligence Act (Regulation
may also serve as a benchmark for future research, promoting (EU) 2024/1689). URL: https://artificialintelligenceact.eu/
high-level-summary/.
the development of more advanced OOD detection methods in
Fabelo, H., Ortega, S., Kabwama, S., Callico, G.M., Bulters, D., Szolna, A.,
medical image segmentation. Pineiro, J.F., Sarmiento, R., 2016. HELICoiD project: A new use of hyper-
spectral imaging for brain cancer detection in real-time during neurosurgical
operations, in: Hyperspectral Imaging Sensors: Innovative Applications and
Acknowledgments Sensor Standards 2016, p. 986002. doi:10.1117/12.2223075.
Fabelo, H., Ortega, S., Ravi, D., Kiran, B.R., Sosa, C., Bulters, D., Callicó,
TV and JS are co-founders and shareholders of Hypervision G.M., Bulstrode, H., Szolna, A., Piñeiro, J.F., Kabwama, S., Madroñal,
Surgical Ltd, London, UK. The authors have no other relevant D., Lazcano, R., J-O’Shanahan, A., Bisshopp, S., Hernández, M., Báez,
interests to declare. A., Yang, G.Z., Stanciulescu, B., Salvador, R., Juárez, E., Sarmiento,
R., 2018. Spatio-spectral classification of hyperspectral images for brain
This project received funding by the National Institute for
cancer detection during surgical operations. PLOS ONE 13, e0193721.
Health and Care Research (NIHR) under its Invention for Inno- doi:10.1371/journal.pone.0193721.
vation (i4i) Programme [NIHR202114]. The views expressed Gal, Y., Ghahramani, Z., 2016. Dropout as a bayesian approximation: Repre-
are those of the author(s) and not necessarily those of the senting model uncertainty in deep learning, in: Proceedings of The 33rd
International Conference on Machine Learning, pp. 1050–1059. URL:
NIHR or the Department of Health and Social Care. This work https://proceedings.mlr.press/v48/gal16.html.
was supported by core funding from the Wellcome/EPSRC Garcia Peraza Herrera, L.C., Horgan, C., Ourselin, S., Ebner, M., Vercauteren,
[WT203148/Z/16/Z; NS/A000049/1]. OM is funded by the EP- T., 2023. Hyperspectral image segmentation: A preliminary study on the
SRC DTP [EP/T517963/1]. For the purpose of open access, the Oral and Dental Spectral Image Database (ODSI-DB). Computer Methods
in Biomechanics and Biomedical Engineering: Imaging & Visualization 11,
authors have applied a CC BY public copyright license to any 1290–1298. doi:10.1080/21681163.2022.2160377.
Author Accepted Manuscript version arising from this submis- Glocker, B., Zikic, D., Konukoglu, E., Haynor, D.R., Criminisi, A., 2013.
sion. Vertebrae Localization in Pathological Spine CT via Dense Classification
from Sparse Annotations, in: Medical Image Computing and Computer-
Assisted Intervention – MICCAI 2013, Berlin, Heidelberg. pp. 262–270.
References doi:10.1007/978-3-642-40763-5_33.
González, C., Gotkowski, K., Fuchs, M., Bucher, A., Dadras, A., Fischbach, R.,
Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., Mané, D., Kaltenborn, I.J., Mukhopadhyay, A., 2022. Distance-based detection of out-
2016. Concrete Problems in AI Safety. doi:10.48550/arXiv.1606. of-distribution silent failures for Covid-19 lung lesion segmentation. Medi-
06565. cal Image Analysis 82, 102596. doi:10.1016/j.media.2022.102596.
Bahl, A., Horgan, C.C., Janatka, M., MacCormac, O.J., Noonan, P., Xie, Y., Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q., 2017. On Calibration of Modern
Qiu, J., Cavalcanti, N., Fürnstahl, P., Ebner, M., Bergholt, M.S., Shapey, J., Neural Networks, in: Proceedings of the 34th International Conference on
Vercauteren, T., 2023. Synthetic white balancing for intra-operative hyper- Machine Learning, pp. 1321–1330. URL: https://proceedings.mlr.
spectral imaging. Journal of Medical Imaging 10, 046001. doi:10.1117/ press/v70/guo17a.html.
1.JMI.10.4.046001.
14 Junwen Wang et al. / Preprint (2024)

Hendrycks, D., Gimpel, K., 2017. A Baseline for Detecting Misclassi- Ronneberger, O., Fischer, P., Brox, T., 2015. U-Net: Convolutional Net-
fied and Out-of-Distribution Examples in Neural Networks, in: Interna- works for Biomedical Image Segmentation, in: Medical Image Computing
tional Conference on Learning Representations, pp. 1–12. URL: https: and Computer-Assisted Intervention – MICCAI 2015, Cham. pp. 234–241.
//openreview.net/forum?id=Hkg4TI9xl. doi:10.1007/978-3-319-24574-4_28.
Hsu, Y.C., Shen, Y., Jin, H., Kira, Z., 2020. Generalized ODIN: Detecting Ruff, L., Vandermeulen, R., Goernitz, N., Deecke, L., Siddiqui, S.A., Binder,
Out-of-distribution Image without Learning from Out-of-distribution Data. A., Müller, E., Kloft, M., 2018. Deep One-Class Classification, in: Proceed-
doi:10.48550/arXiv.2002.11297. ings of the 35th International Conference on Machine Learning, pp. 4393–
Hyttinen, J., Fält, P., Jäsberg, H., Kullaa, A., Hauta-Kasari, M., 2020. Oral and 4402. URL: https://proceedings.mlr.press/v80/ruff18a.html.
Dental Spectral Image Database—ODSI-DB. Applied Sciences 10, 7246. Saito, T., Rehmsmeier, M., 2015. The Precision-Recall Plot Is More Infor-
doi:10.3390/app10207246. mative than the ROC Plot When Evaluating Binary Classifiers on Imbal-
Jégou, S., Drozdzal, M., Vazquez, D., Romero, A., Bengio, Y., 2017. The anced Datasets. PLOS ONE 10, e0118432. doi:10.1371/journal.pone.
One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Seman- 0118432.
tic Segmentation. doi:10.48550/arXiv.1611.09326. Seidlitz, S., Sellner, J., Odenthal, J., Özdemir, B., Studier-Fischer, A., Knödler,
Karimi, D., Gholipour, A., 2023. Improving Calibration and Out-of- S., Ayala, L., Adler, T.J., Kenngott, H.G., Tizabi, M., Wagner, M., Nickel,
Distribution Detection in Deep Models for Medical Image Segmentation. F., Müller-Stich, B.P., Maier-Hein, L., 2022. Robust deep learning-based
IEEE Transactions on Artificial Intelligence 4, 383–397. doi:10.1109/ semantic organ segmentation in hyperspectral images. Medical Image Anal-
TAI.2022.3159510. ysis 80, 102488. doi:10.1016/j.media.2022.102488.
Khan, U., Paheding, S., Elkin, C.P., Devabhaktuni, V.K., 2021. Trends in Shapey, J., Xie, Y., Nabavi, E., Bradford, R., Saeed, S.R., Ourselin, S., Ver-
Deep Learning for Medical Hyperspectral Image Analysis. IEEE Access cauteren, T., 2019. Intraoperative multispectral and hyperspectral label-free
9, 79534–79548. doi:10.1109/ACCESS.2021.3068392. imaging: A systematic review of in vivo clinical studies. Journal of Biopho-
Kingma, D.P., Ba, J., 2017. Adam: A Method for Stochastic Optimization. tonics 12, e201800455. doi:10.1002/jbio.201800455.
URL: http://arxiv.org/abs/1412.6980. Shotton, J., Johnson, M., Cipolla, R., 2008. Semantic texton forests for im-
Kuang, Z., Yan, Z., Yu, L., 2024. Weakly supervised learning for multi-class age categorization and segmentation, in: 2008 IEEE Conference on Com-
medical image segmentation via feature decomposition. Computers in Bi- puter Vision and Pattern Recognition, pp. 1–8. doi:10.1109/CVPR.2008.
ology and Medicine 171, 108228. doi:10.1016/j.compbiomed.2024. 4587503.
108228. Studier-Fischer, A., Seidlitz, S., Sellner, J., Bressan, M., Özdemir, B., Ayala,
Lakshminarayanan, B., Pritzel, A., Blundell, C., 2017. Simple and L., Odenthal, J., Knoedler, S., Kowalewski, K.F., Haney, C.M., Salg, G., Di-
Scalable Predictive Uncertainty Estimation using Deep Ensembles, etrich, M., Kenngott, H., Gockel, I., Hackert, T., Müller-Stich, B.P., Maier-
in: Advances in Neural Information Processing Systems, pp. 1– Hein, L., Nickel, F., 2023. HeiPorSPECTRAL - the Heidelberg Porcine
12. URL: https://proceedings.neurips.cc/paper_files/paper/ HyperSPECTRAL Imaging Dataset of 20 Physiological Organs. Scientific
2017/hash/9ef2ed4b7fd2c810847ffa5fa85bce38-Abstract.html. Data 10, 414. doi:10.1038/s41597-023-02315-8.
Lambert, B., Forbes, F., Doyle, S., Dehaene, H., Dojat, M., 2024. Trustwor- Taha, A.A., Hanbury, A., 2015. Metrics for evaluating 3D medical image seg-
thy clinical AI solutions: A unified review of uncertainty quantification in mentation: Analysis, selection, and tool. BMC Medical Imaging 15, 29.
Deep Learning models for medical image analysis. Artificial Intelligence in doi:10.1186/s12880-015-0068-x.
Medicine 150, 102830. doi:10.1016/j.artmed.2024.102830. Tajbakhsh, N., Jeyaseelan, L., Li, Q., Chiang, J.N., Wu, Z., Ding, X., 2020. Em-
Lee, K., Lee, K., Lee, H., Shin, J., 2018. A Simple Unified Frame- bracing imperfect datasets: A review of deep learning solutions for medical
work for Detecting Out-of-Distribution Samples and Adversarial At- image segmentation. Medical Image Analysis 63, 101693. doi:10.1016/
tacks, in: Advances in Neural Information Processing Systems, pp. j.media.2020.101693.
1–11. URL: https://papers.nips.cc/paper_files/paper/2018/ Tan, M., Le, Q., 2019. EfficientNet: Rethinking Model Scaling for Con-
hash/abdeb6f575ac5c6676b747bca8d09cc2-Abstract.html. volutional Neural Networks, in: Proceedings of the 36th International
Liang, G., Zhang, Y., Wang, X., Jacobs, N., 2020. Improved Trainable Calibra- Conference on Machine Learning, pp. 6105–6114. URL: https://
tion Method for Neural Networks on Medical Imaging Classification. URL: proceedings.mlr.press/v97/tan19a.html.
http://arxiv.org/abs/2009.04057. Wang, G., Li, W., Aertsen, M., Deprest, J., Ourselin, S., Vercauteren, T., 2019a.
Liang, S., Li, Y., Srikant, R., 2017. Enhancing The Reliability of Out-of- Aleatoric uncertainty estimation with test-time augmentation for medical
distribution Image Detection in Neural Networks. URL: https://arxiv. image segmentation with convolutional neural networks. Neurocomputing
org/abs/1706.02690v5. 335, 34–45. doi:10.1016/j.neucom.2019.01.103.
van der Maaten, L., Hinton, G., 2008. Visualizing Data using t-SNE. Journal Wang, G., Li, W., Ourselin, S., Vercauteren, T., 2019b. Automatic Brain
of Machine Learning Research 9, 2579–2605. URL: http://jmlr.org/ Tumor Segmentation Based on Cascaded Convolutional Neural Networks
papers/v9/vandermaaten08a.html. With Uncertainty Estimation. Frontiers in Computational Neuroscience 13.
Martı́n-Pérez, A., Martinez-Vega, B., Villa, M., Leon, R., Martinez de Ternero, doi:10.3389/fncom.2019.00056.
A., Fabelo, H., Ortega, S., Quevedo, E., Callico, G.M., Juarez, E., Sanz, Wang, G., Li, W., Zuluaga, M.A., Pratt, R., Patel, P.A., Aertsen, M., Doel,
C., 2024. Machine Learning Performance Trends: A Comparative Study of T., David, A.L., Deprest, J., Ourselin, S., Vercauteren, T., 2018. Inter-
Independent Hyperspectral Human Brain Cancer Databases. URL: https: active Medical Image Segmentation Using Deep Learning With Image-
//papers.ssrn.com/abstract=4898113. Specific Fine Tuning. IEEE Transactions on Medical Imaging 37, 1562–
Moccia, S., Wirkert, S.J., Kenngott, H., Vemuri, A.S., Apitz, M., Mayer, 1573. doi:10.1109/TMI.2018.2791721.
B., De Momi, E., Mattos, L.S., Maier-Hein, L., 2018. Uncertainty- Wang, G., Zuluaga, M.A., Li, W., Pratt, R., Patel, P.A., Aertsen, M., Doel, T.,
Aware Organ Classification for Surgical Data Science Applications in La- David, A.L., Deprest, J., Ourselin, S., Vercauteren, T., 2019c. DeepIGeoS:
paroscopy. IEEE Transactions on Biomedical Engineering 65, 2649–2659. A Deep Interactive Geodesic Framework for Medical Image Segmentation.
doi:10.1109/TBME.2018.2813015. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 1559–
Oza, P., Patel, V.M., 2019. One-Class Convolutional Neural Network. IEEE 1572. doi:10.1109/TPAMI.2018.2840695.
Signal Processing Letters 26, 277–281. doi:10.1109/LSP.2018.2889273. Wang, J., Xia, B., 2021. Bounding Box Tightness Prior for Weakly Super-
Perera, P., Oza, P., Patel, V.M., 2021. One-Class Classification: A Survey. vised Image Segmentation, in: Medical Image Computing and Computer
doi:10.48550/arXiv.2101.03064. Assisted Intervention – MICCAI 2021, Cham. pp. 526–536. doi:10.1007/
Qu, H., Wu, P., Huang, Q., Yi, J., Riedlinger, G.M., De, S., Metaxas, D.N., 978-3-030-87196-3_49.
2019. Weakly Supervised Deep Nuclei Segmentation using Points Anno- Wenzel, F., Snoek, J., Tran, D., Jenatton, R., 2020. Hyper-
tation in Histopathology Images, in: Proceedings of The 2nd International parameter Ensembles for Robustness and Uncertainty Quantification,
Conference on Medical Imaging with Deep Learning, pp. 390–400. URL: in: Advances in Neural Information Processing Systems, pp. 6514–
https://proceedings.mlr.press/v102/qu19a.html. 6527. URL: https://proceedings.neurips.cc/paper/2020/hash/
Ravı̀, D., Fabelo, H., Callic, G.M., Yang, G.Z., 2017. Manifold Embedding 481fbfa59da2581098e841b7afc122f1-Abstract.html.
and Semantic Segmentation for Intraoperative Guidance With Hyperspectral Xu, Y., Zhu, J.Y., Chang, E.I.C., Lai, M., Tu, Z., 2014. Weakly supervised
Brain Imaging. IEEE Transactions on Medical Imaging 36, 1845–1857. histopathology cancer image segmentation and classification. Medical Im-
doi:10.1109/TMI.2017.2695523. age Analysis 18, 591–604. doi:10.1016/j.media.2014.01.010.
Junwen Wang et al. / Preprint (2024) 15

Yang, J., Zhou, K., Li, Y., Liu, Z., 2024. Generalized Out-of-Distribution De-
tection: A Survey. International Journal of Computer Vision doi:10.1007/
s11263-024-02117-4.
Zhu, Q., Deng, W., Zheng, Z., Zhong, Y., Guan, Q., Lin, W., Zhang, L.,
Li, D., 2022. A Spectral-Spatial-Dependent Global Learning Framework
for Insufficient and Imbalanced Hyperspectral Image Classification. IEEE
Transactions on Cybernetics 52, 11709–11723. doi:10.1109/TCYB.2021.
3070577.
16 Junwen Wang et al. / Preprint (2024)

Appendix A. Additional results

Table A.1. Cross-validation results when using all annotated classes for training, i.e. with NCP = 1. For each method and metric, performance is reported
at thresholds τ0 = 0 and τm . τm is choose from the optimal threshold from the two-level cross-validation experiments. Best performance among all methods
at each dataset are highlighted in bold.
Dataset Method TPROVR ↑ TNROVR ↑ BACCOVR ↑ F1OVR ↑
τ0 = 0 τm τ0 = 0 τm τ0 = 0 τm τ0 = 0 τm
Baseline 0.98±0.02 0.98±0.02 1.00±0.00 1.00±0.00 0.99±0.01 0.99±0.01 0.98±0.02 0.98±0.02
ODIN 0.98±0.02 0.96±0.02 1.00±0.00 1.00±0.00 0.99±0.01 0.98±0.01 0.98±0.02 0.96±0.02
Heiporspectral
Mahalanobis 0.97±0.02 0.98±0.02 1.00±0.00 1.00±0.00 0.99±0.01 0.99±0.01 0.98±0.02 0.98±0.02
GODIN 0.97±0.02 0.86±0.05 1.00±0.00 1.00±0.00 0.98±0.01 0.93±0.03 0.97±0.02 0.86±0.05
Baseline 0.86±0.03 0.88±0.05 0.99±0.00 1.00±0.00 0.92±0.01 0.94±0.03 0.85±0.02 0.89±0.05
ODIN 0.86±0.03 0.90±0.05 0.99±0.00 1.00±0.00 0.92±0.01 0.95±0.03 0.85±0.02 0.91±0.05
ODSI-DB
Mahalanobis 0.87±0.02 0.95±0.02 0.99±0.00 1.00±0.00 0.93±0.01 0.98±0.01 0.84±0.04 0.95±0.02
GODIN 0.87±0.02 0.87±0.05 0.99±0.00 1.00±0.00 0.93±0.01 0.93±0.03 0.86±0.04 0.87±0.04
Baseline 0.80±0.04 0.88±0.07 0.99±0.00 1.00±0.00 0.90±0.02 0.94±0.04 0.80±0.07 0.88±0.07
ODIN 0.80±0.04 0.88±0.08 0.99±0.00 1.00±0.00 0.90±0.02 0.94±0.04 0.80±0.07 0.90±0.08
DSAD
Mahalanobis 0.82±0.05 0.91±0.04 0.99±0.00 1.00±0.00 0.90±0.02 0.95±0.02 0.78±0.08 0.90±0.08
GODIN 0.75±0.10 0.76±0.14 0.99±0.00 0.99±0.00 0.87±0.05 0.88±0.07 0.74±0.11 0.76±0.15

Table A.2. Cross-validation results when using all annotated classes for training, i.e. with NCP = 1. For each method and metric, average performance
among classes is reported with standard deviation. Best performance among all methods at each dataset are highlighted in bold. Some Mahalanobis
distance results are not shown (—) due to stability issue.
Dataset Method AUROCOVR ↑ AUPROVR ↑
SP1 SP2 SP3 SP4 SPmean SP1 SP2 SP3 SP4 SPmean
Baseline 0.99±0.04 0.97±0.11 1.00±0.00 0.99±0.03 0.98±0.01 0.84±0.14 0.87±0.11 0.86±0.10 0.82±0.12 0.85±0.02
ODIN 0.99±0.04 0.97±0.11 1.00±0.00 0.99±0.03 0.99±0.01 0.99±0.04 0.97±0.11 1.00±0.00 0.99±0.03 0.99±0.01
Heiporspectral
Mahalanobis 0.96±0.04 0.95±0.11 0.97±0.00 0.96±0.02 0.96±0.01 — — — — —
GODIN 0.99±0.04 0.97±0.11 1.00±0.00 0.98±0.06 0.98±0.01 0.99±0.04 0.97±0.11 1.00±0.00 0.97±0.07 0.98±0.01
Baseline 0.94±0.05 0.93±0.09 0.91±0.12 0.92±0.09 0.93±0.01 0.91±0.07 0.91±0.11 0.90±0.12 0.85±0.21 0.89±0.02
ODIN 0.94±0.05 0.93±0.09 0.91±0.12 0.92±0.09 0.93±0.01 0.90±0.07 0.91±0.11 0.89±0.12 0.84±0.23 0.89±0.02
ODSI-DB
Mahalanobis 0.92±0.05 0.91±0.08 0.92±0.06 0.89±0.07 0.91±0.01 — — — — —
GODIN 0.94±0.05 0.94±0.05 0.93±0.08 0.92±0.09 0.93±0.01 0.90±0.09 0.91±0.11 0.90±0.12 0.86±0.15 0.89±0.02
Baseline 0.93±0.05 0.91±0.06 0.89±0.08 0.88±0.10 0.90±0.02 0.90±0.07 0.86±0.11 0.82±0.13 0.78±0.16 0.84±0.04
ODIN 0.93±0.05 0.91±0.06 0.89±0.08 0.88±0.10 0.90±0.02 0.90±0.07 0.85±0.11 0.82±0.13 0.78±0.16 0.84±0.04
DSAD
Mahalanobis 0.89±0.04 0.87±0.06 0.86±0.07 0.84±0.09 0.87±0.02 — — — — —
GODIN 0.93±0.05 0.90±0.09 0.81±0.13 0.84±0.14 0.87±0.05 0.90±0.07 0.80±0.22 0.63±0.25 0.77±0.18 0.77±0.10
Junwen Wang et al. / Preprint (2024) 17

CP1 CP2 CP3 CP4

Sparsely
annotated
ground truth

Baseline (τ0 = 0)

Baseline (τm )

ODIN

Mahalanobis

GODIN

6WRPDFK 6PDOO%RZHO &RORQ /LYHU *DOOEODGGHU 3DQFUHDV .LGQH\

6SOHHQ %ODGGHU 2PHQWXP /XQJ +HDUW &DUWLODJH %RQH

6NLQ 0XVFOH 3HULWRQHXP 0DMRU9HLQ .LGQH\:LWK*) %LOH)OXLG 2XWOLHU

Figure A.1. Qualitative result of first case from Heiporspectral dataset. We show results of the same image from four class partitions (represented as CP1 to
CP4 ). For each CP, classes that are held-out are grouped as an extra outlier class for evaluation. We visualise and compare masks generated using different
methods at threshold τm . Baseline results at τ0 = 0 are added to represent result without outlier detection.
18 Junwen Wang et al. / Preprint (2024)

CP1 CP2 CP3 CP4

Sparsely
annotated
ground truth

Baseline (τ0 = 0)

Baseline (τm )

ODIN

Mahalanobis

GODIN

6WRPDFK 6PDOO%RZHO &RORQ /LYHU *DOOEODGGHU 3DQFUHDV .LGQH\

6SOHHQ %ODGGHU 2PHQWXP /XQJ +HDUW &DUWLODJH %RQH

6NLQ 0XVFOH 3HULWRQHXP 0DMRU9HLQ .LGQH\:LWK*) %LOH)OXLG 2XWOLHU

Figure A.2. Qualitative result of second case from Heiporspectral dataset. We show results of the same image from four class partitions (represented as
CP1 to CP4 ). For each CP, classes that are held-out are grouped as an extra outlier class for evaluation. We visualise and compare masks generated using
different methods at threshold τm . Baseline results at τ0 = 0 are added to represent result without outlier detection.
Junwen Wang et al. / Preprint (2024) 19

CP1 CP2 CP3 CP4

Sparsely
annotated
ground truth

Baseline (τ0 = 0)

Baseline (τm )

ODIN

Mahalanobis

GODIN

(QDPHO 7RQJXH $WWDFKHGJLQJLYD +DUGSDODWH /LS

2UDOPXFRVD 6NLQ +DLU 6RIWSDODWH 2XWOLHU

Figure A.3. Qualitative result of first case from ODSI-DB dataset. We show results of the same image from four class partitions (represented as CP1 to
CP4 ). For each CP, classes that are held-out are grouped as an extra outlier class for evaluation. We visualise and compare masks generated using different
methods at threshold τm . Baseline results at τ0 = 0 are added to represent result without outlier detection.
20 Junwen Wang et al. / Preprint (2024)

CP1 CP2 CP3 CP4

Sparsely
annotated
ground truth

Baseline (τ0 = 0)

Baseline (τm )

ODIN

Mahalanobis

GODIN

(QDPHO 7RQJXH $WWDFKHGJLQJLYD +DUGSDODWH /LS

2UDOPXFRVD 6NLQ +DLU 6RIWSDODWH 2XWOLHU

Figure A.4. Qualitative result of second case from ODSI-DB dataset. We show results of the same image from four class partitions (represented as CP1 to
CP4 ). For each CP, classes that are held-out are grouped as an extra outlier class for evaluation. We visualise and compare masks generated using different
methods at threshold τm . Baseline results at τ0 = 0 are added to represent result without outlier detection.
Junwen Wang et al. / Preprint (2024) 21

CP1 CP2 CP3 CP4

Sparsely
annotated
ground truth

Baseline (τ0 = 0)

Baseline (τm )

ODIN

Mahalanobis

GODIN

,QIHULRU0HVHQWHULF$UWHU\ ,QWHVWLQDO9HLQV &RORQ /LYHU 3DQFUHDV 6PDOO,QWHVWLQH

6WRPDFK 6SOHHQ $EGRPLQDO:DOO 8UHWHU 9HVLFXODU*ODQGV 2XWOLHU

Figure A.5. Qualitative result of first case from DSAD dataset. We show results of the same image from four class partitions (represented as CP1 to CP4 ).
For each CP, classes that are held-out are grouped as an extra outlier class for evaluation. We visualise and compare masks generated using different
methods at threshold τm . Baseline results at τ0 = 0 are added to represent result without outlier detection.
22 Junwen Wang et al. / Preprint (2024)

CP1 CP2 CP3 CP4

Sparsely
annotated
ground truth

Baseline (τ0 = 0)

Baseline (τm )

ODIN

Mahalanobis

GODIN

,QIHULRU0HVHQWHULF$UWHU\ ,QWHVWLQDO9HLQV &RORQ /LYHU 3DQFUHDV 6PDOO,QWHVWLQH

6WRPDFK 6SOHHQ $EGRPLQDO:DOO 8UHWHU 9HVLFXODU*ODQGV 2XWOLHU

Figure A.6. Qualitative result of second case from DSAD dataset. We show results of the same image from four class partitions (represented as CP1 to
CP4 ). For each CP, classes that are held-out are grouped as an extra outlier class for evaluation. We visualise and compare masks generated using different
methods at threshold τm . Baseline results at τ0 = 0 are added to represent result without outlier detection.

You might also like