noisy vs. clean training labels? | ID mistakes vs. OOD images? | difficulty of the OOD set? |
---|---|---|
Considering how pervasive the problem of label noise is in real-world image classification datasets, its effect on OOD detection is crucial to study. To address this gap, we systematically analyse the label noise robustness of a wide range of OOD detectors. Specifically:
- We present the first study of post-hoc OOD detection in the presence of noisy classification labels, examining the performance of 20 state-of-the-art methods under different types and levels of label noise in the training data. Our study includes multiple classification architectures and datasets, ranging from the beloved CIFAR10 to the more difficult Clothing1M, and shows that even at a low noise rate, the label noise setting poses an interesting challenge for many methods.
- We revisit the notion that OOD detection performance correlates with ID accuracy, examining when and why this relation holds. Robustness to inaccurate classification requires that OOD detectors effectively separate mistakes on ID data from OOD samples - yet most existing methods confound the two.
- the analysis folder contains the scripts used to process and analyse results.
- The analysis/paper_figures.ipynb notebook is a good place to start. It reproduces all the visualizations and results in the paper, supplementary material and poster.
- the run folder contains bash scripts to train the base classifiers on different sets of (clean or noisy) labels (e.g. run/cifar10_train.sh), and then evaluate post-hoc OOD detectors (e.g. run/cifar10_eval.sh). Training checkpoints and OOD detection results are saved in the results folder.
The rest of the repo follows the structure of OpenOOD:
-
data/images_classic contains the raw ID & OOD datasets and annotations. See data/README.md for download instructions.
-
data/benchmark_imglist contains the list of images and corresponding label for each train, val, test and OOD set. For example, the training labels for CIFAR-10N-Agg (9.01% noise rate) can be found in data/benchmark_imglist/train_cifar10n_agg.txt . We provide all the .txt files used in our experiments, as well as the scripts used to generate them.
- for the code used to generate the clean & real noisy label sets, see the dataset-specific notebooks in the data/images_classic folder (.e.g create_txt_files_cifar10.ipynb, create_txt_files_clothing1m.ipynb, create_txt_files_cub.ipynb ...)
- synthetic label sets are generated from the data/benchmark_imglist/generate_synth_labels.ipynb notebook.
This code was tested on Ubuntu 18.04 + CUDA 11.3 & Ubuntu 20.04 + CUDA 12.5 with Python 3.11.3 + PyTorch 2.0.1. CUDA & PyTorch are only necessary for training classifiers and evaluating OOD detectors yourself. If you are only interested in reproducing the paper's tables & visualizations, you can install a minimal environment.
conda create --name ood-labelnoise-viz python=3.11.3
conda activate ood-labelnoise-viz
pip install -r requirements_viz.txt
conda create -n ood-labelnoise python=3.11.3
conda activate ood-labelnoise
pip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 -f https://download.pytorch.org/whl/torch_stable.html
conda install gcc_linux-64 gxx_linux-64
pip install Cython==3.0.2
pip install -r requirements_full.txt
We benchmark the following 20 post-hoc OOD detection methods (listed in the order that they are presented in the paper). Their implementations are based on the OpenOOD benchmark, except for MDSEnsemble and GRAM which we modified to better align with the original papers.
Name | Implementation | Paper |
---|---|---|
MSP | BasePostprocessor | Hendrycks et al. 2017 |
TempScaling | TemperatureScalingPostprocessor | Guo et al. 2017 |
ODIN | ODINPostprocessor | Liang et al. 2018 |
GEN | GENPostprocessor | Liu et al. 2023 |
MLS | MaxLogitPostprocessor | Hendrycks et al. 2022 |
EBO | EBOPostprocessor | Liu et al. 2020 |
REACT | ReactPostprocessor | Sun et al. 2021 |
RankFeat | RankFeatPostprocessor | Song et al. 2022 |
DICE | DICEPostprocessor | Sun et al. 2022 |
ASH | ASHPostprocessor | Djurisic et al. 2023 |
MDS | MDSPostprocessor | Lee et al. 2018 |
MDSEnsemble | MDSEnsemblePostprocessorMod | Lee et al. 2018 |
RMDS | RMDSPostprocessor | Ren et al. 2021 |
KLM | KLMatchingPostprocessor | Hendrycks et al. 2022 |
OpenMax | OpenMax | Bendale et al. 2016 |
SHE | SHEPostprocessor | Zhang et al. 2023 |
GRAM | GRAMPostprocessorMod | Sastry et al. 2020 |
KNN | KNNPostprocessor | Sun et al. 2022 |
VIM | VIMPostprocessor | Wang et al. 2022 |
GradNorm | GradNormPostprocessor | Huang et al. 2021 |
- June 13th 2024: Code repo released
- June 7th 2024: Project page released
If you find our work useful, please cite:
@InProceedings{Humblot-Renaux_2024_CVPR,
author={Humblot-Renaux, Galadrielle and Escalera, Sergio and Moeslund, Thomas B.},
title={A Noisy Elephant in the Room: Is Your Out-of-Distribution Detector Robust to Label Noise?},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month={June},
year={2024},
pages={22626-22636},
doi={10.1109/CVPR52733.2024.02135}
}
If you have have any issues or doubts about the code, please create a Github issue. Otherwise, you can contact me at gegeh@create.aau.dk
- Our codebase heavily builds on the OpenOOD benchmark. We list our main changes in the paper's supplementary material.
- Our benchmark includes the CIFAR-N and Clothing1M datasets. These are highly valuable as they provide pairs of clean vs. real noisy labels.
- We use the deep-significance implementation of the Almost Stochastic Order test in our experimental comparisons.
- We follow the training procedure and splits from the Semantic Shift Benchmark to evaluate fine-grained semantic shift detection.
- The Compact Transformer and MLPMixer model implementation and training hyper-parameters are based on the following repositories: Compact-Transformers and vision-transformers-cifar10.