Stop Overkilling Simple Tasks With Black-Box

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO.
8, AUGUST 2021 1
Stop overkilling simple tasks with black-box

models and use transparent models instead
Matteo Rizzo, Matteo Marcuzzo, Alessandro Zangari, Andrea Gasparetto, and Andrea Albarelli
(correspondence author’s e-mail: matteo.rizzo@unive.it)
Abstract—The ability of deep learning-based approaches to

extract features autonomously from raw data while also out-
performing traditional methods has led to several major break-
arXiv:2302.02804v2 [cs.LG] 14 Jun 2023
throughs in artificial intelligence. However, it is well-known that

deep learning models suffer from an intrinsic opacity, making
it difficult to explain why they produce specific predictions.
This is problematic not only because it hinders debugging, but
most importantly because it negatively affects the perceived
trustworthiness of the systems. What is often overlooked is
that many tasks can be solved efficiently and effectively with
traditional models that are inherently more transparent. In this
work, we propose a top-down approach to solve a relatively
simple task (i.e., the classification of the ripeness of banana
crates) by planning explainability and model design together.
We showcase how this task can be solved using both opaque
deep learning models and a more transparent decision tree.
Notably, there is a minimal loss of accuracy, but a significant
gain in explainability, which is truthful to the model’s inner
workings. We perform a user study to evaluate the perception of
explainability by end users and discuss our findings.
Index Terms—Fruit ripeness classification, Explainable AI,
Vision Transformer, CNN, Decision Tree Fig. 1. Ripeness stages for crates of bananas from least ripe (1) to ripest (4).
I. I NTRODUCTION
There is now a broad spectrum of methods being designed
VER the last decade, machine learning research has been
O increasingly focused on the development of new deep
models based on Artificial Neural Networks (ANNs). Such
with the sole intent of obtaining small improvements over their
predecessors. Unfortunately, the speed at which these are being
developed far outpaces the development of strategies capable
methods have raised the bar in terms of accuracy for numerous of explaining them. Research towards explainability methods
cognitive tasks, leading to new and exciting opportunities has nevertheless brought interesting results, with milestone
but also serious challenges. Among these challenges, explain- techniques such as Local Interpretable Model-agnostic Ex-
ability is one that has sparked a vast amount of discourse planations (LIME) [3] and SHapley Additive exPlanations
and debate; briefly, it can be understood as the process of (SHAP) [4]. Such methods attempt to provide an explanation
understanding the internal decision-making process of such for the prediction of any classifier by local approximation
models by a human stakeholder. This is no trivial task, as through a linear model. However, as we will discuss in Section
modern deep models are extremely complex and rely upon IV-A, they have also received criticism, as they may provide
billions of parameters that need to be learned during training. explanations that are, at the very least, disputable.
The incredible performance of Deep Learning (DL) models
and their inner opacity constitutes a concrete problem. This is
A. Task and approach
especially true in high stake decision-making environments,
where the ability to provide an explanation for a model In this paper, we analyze a real-world scenario and how
prediction is a characteristic that has started to be enforced the aforementioned concepts of accuracy and explainability
by law [1]. Unfortunately, it is clear how this clashes with affect it. Our target task is the classification of the ripeness of
much of the design process of DL models, which is generally bananas stored in a crate, on a scale from 1 (least ripe) to 4
guided by researchers’ intuition, relies on trial and error for (ripest) (see Fig. 1 for an example).
tuning, and lacks a holistic approach including the upstream We act according to a top-down approach where we plan
definition of an explanation strategy [2]. for explainability as well as task accuracy. To tackle the
classification task we select a pool of three DL methods: (i)
Matteo Rizzo, Matteo Marcuzzo, Alessandro Zangari, and Andrea Al- a simple Convolutional Neural Network (CNN) model with
barelli are with the Department of Environmental Sciences, Informatics,
and Statistics, Ca’ Foscari University, Venice, Italy (e-mail: [matteo.rizzo, three convolutional blocks, (ii) a pre-trained convolutional
matteo.marcuzzo, alessandro.zangari, albarelli]@unive.it) model based on the MobileNetV2 framework [5], and (iii)
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 2
a pre-trained Vision Transformer (ViT) [6]. Model (i) was • We show that the same classification task can be solved
chosen to establish a lower bound to task accuracy for DL effectively and efficiently by a much simpler and more
models. We selected model (ii) because it is lightweight transparent model, a DT, with minimum feature engineer-
and fast, which are important characteristics for a model ing effort. More specifically, we constrain the number of
that ultimately has to be deployed for in-situ predictions. features to the three channels in the RGB color space,
Finally, model (iii) is a relatively new approach based on the which are easy to represent in a 3D space;
application of Transformers [7] to the vision domain. As we • We conduct a user study to determine which explanations
will show, the latter allows for almost perfect results and is best suit the stakeholders’ needs;
the best neural model across our proposed methods. • We release our code and self-collected dataset1 for repro-
ducibility and possible extensibility of our experiments.
B. Explaining the models
II. R ELATED W ORK
In order to generate an explanation for our selected DL
models, we use the well-studied and above-mentioned LIME Our work relates to two main paths of research; (i) the
and SHAP methods. With LIME, an explanation is a local optimization of fruit ripeness grading, and (ii) the advocacy
linear approximation of the model’s behavior. The underneath for more focus on explainability in the Artificial Intelligence
assumption of the method is that, while the model may be very (AI) community.
complex globally, it is relatively easy to approximate it around Grading the ripeness of the fruit is a long-studied problem
the vicinity of a particular instance. While treating the model for whom strategies based on statistics (e.g., [8], [9]), tradi-
as a black box, the instance to be explained is perturbed and a tional machine learning (e.g., [10], [11]), and DL (e.g., [12],
sparse linear model (considered inherently fully explainable) [13]) have been proposed. The top-performing methods are
is learned as an explanation. LIME’s explanations for image those based on DL, which, aside from reaching astonishing
tasks are masks that describe the pixel-level importance of accuracy, do away with the complex and error-prone task of
the input to the model’s output (see Fig. 2a for an example). feature engineering. However, the literature lacks extensive
On the other hand, SHAP is a game theoretic approach, comparisons among the three noted strategies. For more in-
designed to explain the output of any machine learning model. formation on the fruit ripeness grading problem and solutions,
It connects optimal credit allocation with local explanations a recent survey was authored by Rizzo et al. [14].
using the Shapley values from game theory and their related A common problem associated with DL models is their
extensions. The basic visualization of SHAP’s explanations for inner opacity. In fact, providing meaningful explanations for
image tasks are plots that show features each contributing to a DL model’s prediction is an arduous task. Much research
pushing the model output from the base value (the average has gone towards the extraction of explanations by using,
model output over the training dataset) to the model output. for instance, information from gradients [15], attention scores
Features pushing the prediction higher are shown in red, and [16], surrogate models [3], and latent prototypes [17]. Some
those pushing the prediction lower are in blue (see Fig. 2b). methods have been proposed with the promise of being
model-agnostic, i.e., to explain the prediction of any classifier.
Prominent examples are the aforementioned LIME and SHAP
C. Contributions methods [3], [4]. However, the proposed explanations have
In our experiments, we show that all three of the selected been challenged (e.g., [18]–[21]) and proved to be unreliable
neural models are capable of converging to very high (and, in multiple scenarios.
in some cases, close-to-perfect) accuracy in a relatively short Nevertheless, the focus of much of the most recent research
amount of training epochs. This leads us to question the still seems to be on scraping a few decimals of task accuracy,
difficulty of the task at hand and the actual need for such while still not accounting for explainability. On this topic,
powerful, yet black-box, methods. As the results suggest, the works such as the one by Rudin [22] discuss the necessity of
task is somewhat easy, and it is thus legitimate to try to tackle more carefully gauging the tasks being solved and, whenever
it with a simpler strategy. In particular, we propose to use one possible (or necessary due to high stakes), using more trans-
of the simplest and most transparent models (i.e., a Decision parent models rather than black box ones. In the same vein, we
Tree (DT) trained on the average RGB color of each image) advocate for the use of simpler and more explainable models
to achieve excellent accuracy while being able to provide when the task at hand is particularly simple.
informative and understandable explanations. We also discuss
our choice over other traditional methods, gauging benefits III. P RELIMINARIES
against loss of transparency. A. Task definition
Lastly, we perform a pilot user study involving a sample of
From a generic perspective, this work deals with a multi-
the main stakeholders of the classifiers to verify their attitude
class image classification task. Each image comprises a picture
towards the various models and their respective explanations.
of banana bunches within a crate, which are labeled based
In summary, our contributions are the following:
on an increasing ripeness value (1 to 4, least to most ripe,
• We propose an analysis of a selection of DL methods in see Fig. 1). It is assumed that all the bananas within a crate
terms of performance and explainability, utilizing relevant
models that offer a wide panoramic of the task; 1 https://github.com/matteo-rizzo/explainable-fruit-ripeness-classification
are in the same ripeness stage. The resulting classifier can be pre-processed and augmented dataset. The feature extraction
used in fruit wholesale markets to aid operators in the labeling procedure for the DT was selected such as it could also be used
process of large numbers of incoming crates. Data and expert to generate feature-based explanations that are intuitively un-
knowledge have been gathered from the wholesale market of derstandable by the user while remaining true to the decision-
the city of Treviso, Italy, which plans to propose the output making process of the model. The most intuitive factor to
of this project for practical use. determine the ripeness is the average color of banana bunches.
Hence, we use the R, G, and B values as color features for
B. Dataset the DT. More specifically, each image is reduced to a triple,
consisting of the per-channel average color values in the RGB
To train our selected models to accomplish this task, we color space, normalized in the [0 − 1] range. This color model
collected an ad-hoc dataset. Pictures of crates filled with var- was selected as it is well known in the literature, and encodes
ious bunches of bananas were acquired at a native resolution a color image taking human perception into account. One
of 4160 x 3120 pixels, and then manually labeled by experts notable thing to consider is how the luminance of the RGB
at the Treviso public fruit market. Pictures were taken with a color space is embedded within its three channels. This is
CZUR Shine Ultra scanner, with an effort made to achieve different, for instance, from color spaces such as YUV, where
consistent lighting. The dataset consists of a total of 927 luminance is encoded into the physical linear-space brightness
acquisitions, split among classes in a fairly balanced way. We (Y) channel. As the DT is based on average color values,
detail the class distribution in the supplementary material. To we find that this method benefits from a normalization of the
conduct the experiments we train all methods using 5-fold luminance. This is achieved by transposing all the images to
cross-validation, repeating it 10 times with different random the YUV color space, setting the Y channel to a common
seeds to strengthen the results. value, and then translating back to RGB. Note that this is
not a lossless operation and incurs a substantial degradation
C. Data Processing of the image structure. While this is acceptable for the color-
To obtain a reasonable inference time, each image was based DT, we do not include this pre-processing step for neural
resized to 224 x 224 pixels. This resizing was also chosen as models, which could lose access to important features.
it is fairly standard in many pre-trained models, allowing us to IV. M ETHODS AND EXPLANATIONS
easily make use of modern transfer learning approaches. The
A. Deep Learning Approaches
dataset was augmented with random transformations, includ-
ing rotation, affine transforms, elastic morphology transforms, In addressing the task of banana ripeness classification, we
random location crop, gaussian blur, as well as the erasure of run and compare three neural approaches. The first architecture
patches of the image and changes in perspective. The latter consists of a simple CNN using three convolutional blocks,
transformations were applied to account for different angles each characterized by two bi-dimensional convolutions and
and accidental occlusions, which are likely to happen when max pooling interleaved by ReLU activation functions. The
non-expert users are taking pictures with a smartphone (which convolutional layers extract features that are fed to a three-
is one potential end use of this classifier). We augmented layer feed-forward ANN which outputs the final prediction.
roughly 50% of the dataset and the new images are added Before being processed by the CNN, the data is normalized
to the original dataset before training. More details may be with respect to mean and standard deviation.
found in the supplementary material. The second architecture we consider is the pre-trained
A visual inspection of the dataset reveals that pictures MobileNetV2 network [5]. Still convolutional by nature, the
are noisy in that parts of the crate are being captured in strategy at the core of this method is based on depth-wise
the overall image (mostly the boundaries of the crate, as convolutions [24], [25] and inverted residual connections. The
well as, sometimes, its bottom). A way to circumvent this aim of the designers was to build a powerful, pre-trainable
problem is to perform semantic segmentation of the images model able to run on low-tier devices.
to filter out the background of banana bunches. Having no The third architecture that we examine is the Vision Trans-
manually segmented images, an unsupervised approach was former (ViT) [6]. Transformers [7] are neural architectures
the only feasible way to achieve this. After testing several based on multi-head attention [26], widely studied and em-
algorithms, we selected the SLIC algorithm [23] for this task. ployed by the NLP community [27], [28]. This architecture
In our experiments, all methods benefit from the inclusion of has seen recent applications to computer vision tasks with a
segmentation as a pre-processing step. As such, we only report variety of different strategies (see [29] for a survey). Briefly,
results on segmented images. ViT splits images into fixed-size patches and linearly embeds
each of them. Positional embeddings are then added to retain
position information before feeding the resulting sequence of
D. Feature Extraction vectors to a standard Transformer encoder. Classification is
DL models have the notable ability to extract features achieved through the addition of a learnable “classification
automatically (in this case, through convolution- or attention- token” to the sequence.
based operations). Conversely, traditional machine learning Explainability Strategy: Despite astonishing performances,
methods such as a DT require a set of hand-crafted features DL models remain “black boxes”; in other words, it is ex-
to operate correctly. In both cases, methods are applied to the tremely hard to explain their inner workings. To mitigate this
(a) LIME (b) SHAP

Fig. 2. Examples of explanations for deep models generated using LIME (a) and SHAP (b).
condition, we decide to use two well-known model-agnostic following a path up until a leaf node, which is labeled with a
explanation methods, i.e., LIME [3] and SHAP [4]. When specific ripeness value. The set of rules given by the traversed
dealing with images, both of them allow the generation of heat path defines an area within the RGB color space.
maps which are supposed to describe the importance of each Albeit simple to follow for very few features and relatively
pixel in the image toward the model’s prediction. Intuitively, shallow trees, the decision paths can grow exponentially with
warm colors indicate the regions of the image that contributed the number of features added. As anticipated, such numerical
the most to the prediction, while colder colors indicate regions features split within the DT can still appear opaque to the
of the image that contributed negatively to the prediction of average user. Thus, we take our explanation a step further by
the same class. Example explanations generated with LIME devising an interface that is human-understandable and tested
and SHAP are presented in Fig. 2. accordingly. More specifically, we use the rules extracted
from the decision path as constraints on the RGB gamut to
identify portions of such a space that are representative of
B. Decision Tree
the four classes of ripeness. Hence, it is easy to represent
In contrast to the inner complexity of the examined DL each unknown input data point as its average color in the 3D
methods, we propose to tackle the same task by using a RGB color space, and, consequently, determine which region
simple and more transparent model based on a DT classifier. it belongs to. This plot is our proposed explanation for the
In particular, we adopt the implementation offered by scikit- DT’s behavior. Fig. 3 is an example visualization of the whole
learn2 , which is based on the CART algorithm [30]. process (more examples are reported in the supplementary
Explainability Strategy: One may argue that a DT is an material).
intrinsically explainable model. We argue that there is no Note that, differently from feature attribution methods, our
such thing as intrinsic explainability: a transparent model still proposed solution relies on intuitive yet discriminant and
needs to provide some kind of explanation that is somewhat unambiguous features that a human can validate. On the
understandable to the users and fulfills their needs. Different other hand, the masks generated with methods like LIME
end-users are likely to have different requirements for ex- and SHAP leave vast room for interpretation by the user,
plainability. For example, machine learning experts may be who has to infer why a certain region of the input was
satisfied with understanding the range of feature values that considered important. Furthermore, this explanation process
are mapped to each target class (in our case, the RGB values). provides “local” explanations and no overall understanding of
Non-expert users may need these rules to be further processed the important features of each class. On the other hand, the
to be represented in a more intuitive way. Clearly, serving plots in Fig. 3 can be considered “global” explanations of
explainability is much easier with certain classes of models, the DT’s behavior, since each region of the RGB spectrum is
such as those regulated by a few parameters, though this is clearly defined by the whole set of rules learned by the tree.
yet to be clearly formalized in the literature.
Admittedly, a DT has a very intuitive interpretation. For
C. On the choice of models
every non-leaf node, the tree learns a threshold value for one
of its given features, thus producing two children (above and This work is meant to be a demonstration of how a simple
below the threshold). In our case, each instance is classified by transparent model should sometimes be considered in place
of a more complex one, for the sake of deriving intuitive and
2 https://scikit-learn.org/stable/modules/tree.html unambiguous explanations. However, several other methods
Accuracy Precision Recall F1

Decision Tree 0.9716 (± .0104) 0.9723 (± .0106) 0.9678 (± .0119) 0.9697 (± .0110)
CNN 0.9349 (± .0115) 0.9298 (± .0131) 0.9308 (± .0123) 0.9377 (± .0123)
MobileNet V2 0.9743 (± .0046) 0.9726 (± .0046) 0.9717 (± .0054) 0.9718 (± .0049)
ViT 0.9967 (± .0015) 0.9960 (± .0020) 0.9966 (± .0017) 0.9962 (± .0018)
Human performance 0.7500 (± .0589) 0.7588 (± .0453) 0.7500 (± .0589) 0.7519 (± .0524)
TABLE I
M ACRO - AVERAGED PERFORMANCE METRICS FOR THE MODELS , AVERAGED OVER 10 RANDOM SEEDS ( STANDARD DEVIATION IN BRACKETS ).
metrics: accuracy, macro-averaged precision, macro-averaged

recall, and macro-averaged F1 score.
Table I showcases the results achieved with both deep-
learning methods and the DT. We additionally report the hu-
man performance, which is the average of the scores obtained
by three stakeholders on the classification of a balanced dataset
of 300 randomly sampled images from the original non-
augmented dataset (∼ 20%). It is easy to see that all methods
achieve excellent results, with all metrics surpassing the 90s
percentile scores and greatly improving on a human baseline.
It is worth remembering that these results are achieved on
the datasets augmented with images that have gone through
various augmentations, which makes them more robust at the
cost of small decreases in performance. Error analysis, further
detailed in the supplementary material provided, reveals that
mistakes made always occur by way of the classifiers selecting
an adjacent class (e.g., class 2 instead of 1).
Among the selected methods, the ViT model achieves
a near-perfect score for all metrics. This corroborates the
intuition that the task at hand is easy to solve, which in
Fig. 3. Explanation generated from the constraints imposed by the DT on the
RGB color gamut. The 4 grades identify different areas within the gamut. turn justifies the attempt to solve it with the simpler DT.
The DT also obtains outstanding results, though this required
comparatively more effort (including the standardization of the
could be used in place of the DT, including Support Vector luminance and the extensive grid search). This process allows
Machines (SVMs) or probabilistic models. Indeed, we did the DT to have results comparable to those of MobileNetV2.
test an SVM with different kernels and a multinomial Naive Concerning inference time, all our models output predic-
Bayes classifier, performing a grid search to select the best tions within milliseconds. Thus, all of them could work in real-
hyperparameters. Only the SVM with the polynomial kernel time for in-situ operations. The tests of inference time were
(degree 8) improved over the DT, although by less than 0.5% performed and compared on both an AMD Ryzen 7 3700X
in accuracy and F1-score. Given the minimal difference in CPU and an NVIDIA GeForce RTX 2070 Super GPU. In terms
results, and the fact that the decision boundaries of an SVM of of memory required for storage, the DT and MobileNetV2
this type are difficult to interpret because of their complexity, would be the best fit as they are lightweight enough to be
the DT appears to be a better choice. The complete results are installed on low-tier devices, with the memory consumption
reported in the supplementary material. of the other models being much higher. We detail the inference
V. E XPERIMENTAL ANALYSIS time for each model in the supplementary material.
Notably, the DL models require way more computational
In this section, we compare the performance achieved by power than the DT counterpart for their training On a large
our employed methods. First, we perform an analysis of the scale, the energy consumption of the DL methods’ training has
classification metrics achieved by the three DL-based models a way larger (negative) ecological impact than simpler models
and their DT counterpart. Then, we study the explanations such as DTs.
generated according to the strategies proposed in Section IV-A
and Section IV-B and compare them through a user study
involving the stakeholders for the task of banana ripeness B. Explainability
classification in a real fruit market. In terms of explainability, we compare the strategies based
on LIME and SHAP for the DL models with the hand-crafted
A. Performance explanations based on RGB color designed for the DT. Fig. 2
To measure the ability of our selected models to produce and 3 showcase a comparison of the three types of explanations
correct predictions, we resort to commonly used classification for the same input. It is easy to see that the masks produced by
both LIME and SHAP do not highlight meaningful features of do not provide an unambiguous explanation, their visual nature
the image. Indeed, we can observe that the regions highlighted was still enough to make half of the participants deem them
are apparently random. Not only that; in our case, SHAP’s trustworthy.
visualization for the CNN always presented the same result Finally, 80% of respondents declared that the chosen ex-
for all classes, seemingly valuing features for grade 4 highly planation would improve their trust in the model, and 70%
(even when the CNN correctly classified other ripeness stages). are ready to trade about 5% of the classifier accuracy for
The situation does not change when we visually examine the a more transparent and human-explainable decision process.
explanations generated by the methods throughout the whole Considering that the accuracy loss between the DT and the
dataset. This does not necessarily mean that the explanations most accurate model is only around 2.5% for our classification
generated by LIME and SHAP are “untrue” to the inner work- task, and well above human performance, there appears to
ings of the model, but rather that our intuitive interpretation be little reason to prefer the latter to the simpler and more
of the highlighted regions is misaligned with how the model explainable one. We report the complete results of our study
actually makes use of those features internally. As such, we as supplementary material.
can only conclude that these visualizations are inadequate as
truly meaningful explanations. VI. F UTURE W ORK
Conversely, the explanation we generate for the DT is Using simple classifiers on a few manually extracted fea-
faithful to the model’s inner workings by design. One can tures can be much more problematic on more complex tasks,
easily utilize the thresholds of the tree’s nodes for an unknown as this could severely limit the performance of the models.
input to produce a path to a label. These explanations undoubt- Indeed, we do not make the point that simpler models should
edly reflect how the model works, but it is arguable whether always be used; many cognitive tasks would be nearly impos-
thresholds over RGB channels are really meaningful to a sible without the progress obtained through DL.
human user and thus may not constitute a useful explanation. For this specific task, we selected a simple strategy to
Foreseeing this complication, we devise a visualization to aid provide an intuitive explanation to non-AI-expert users, based
the explanation to the end user. In particular, we use the on the average color of the whole image. We plan on explor-
constraints learned by the DT over the RGB gamut to highlight ing strategies to serve explanations using higher numbers of
the portions (i.e., the shades of color) that relate to one class. features, possibly considering the pixel color distribution. The
Then, it is straightforward to map the RGB coordinates of a main drawback of similar strategies concerning our objective
new data point into the gamut to see which of the highlighted is the difficulty in providing understandable explanations to the
portions it pertains to. This strategy provides the user with a target users. Moreover, in line with the explainability by design
much more informative explanation that is both understandable principle, we plan to research the usage of regularization
and faithful to how the model works. strategies to improve the explainability of complex DL models.
This topic has been already researched [31], mostly tackling
the problem of robustness, which has indeed been linked to the
C. User study issue of explainability [32]. It would be interesting to explore
A user study was performed to investigate the preferences whether and how adding constraints on the features extracted
of the users about the generated explanations for the AI- by NNs could help in producing explanations that are more
powered model predictions. The users involved in the study are understandable by the end-users.
stakeholders in the grading of banana ripeness consisting of 20
people with different backgrounds and expertise with AI tools. VII. C ONCLUSIONS
We submitted an online questionnaire to each user, which In this paper, we propose three DL models and a DT for
we report in the supplementary material. The questionnaire the classification of bananas into 4 ripeness stages. The state-
introduces the task at hand and asks the users to compare of-the-art ViT model can achieve near-perfect accuracy, while
the informativeness of three types of explanation for the same other models follow closely behind. Notably, while the DT
input and prediction: (i) the mask generated by LIME, (ii) the leads to slightly lower accuracy scores, it produces easily
mask generated by SHAP, and (iii) the representation of the interpretable results. We use this task to show how an intuitive
input color in the RGB gamut. Explanations (i, ii) pertain to explanation strategy can be devised by model design, rather
the ViT model (the best-performing one) while explanation than with a post-hoc strategy. We argue that working with a
(iii) is generated from the DT. white-box model and human-understandable features, where
Discussion of results: First of all, when asked about the im- possible, can allow for satisfactory explanation with minimal
portance of having an explanation for the model’s behavior, all performance trade-off. To validate this claim, we carried out
participants believed that an associated explanation is at least a pilot user study on 20 users, comparing two popular model-
somewhat important, with most believing it to be essential. As agnostic explainability methods applied to our DL methods
for the preferred explanation method, 7 out of 20 respondents against our simple DT-based interpretation. The results of
considered the RGB gamut area produced by the DT to be the the study indicate a clear tendency of users to accept minor
most informative explanation and 3 declared that no expla- accuracy losses in favor of a more understandable model.
nation was useful to them. The remaining votes were almost However, they also showcase how non-expert users prefer
equally split between the SHAP and LIME visualizations. This simpler explanations, regardless of whether they are well-
result is certainly interesting; though the latter visualizations founded or not.
R EFERENCES [20] D. Garreau and D. Mardaoui, “What does LIME really see in images?”
in ICML 2021 - 38th International Conference on Machine Learning,
Virtual Conference, United States, Jul. 2021. [Online]. Available:
[1] A. D. Selbst and J. Powles, “Meaningful information and the right to
https://hal.science/hal-03233014
explanation,” International Data Privacy Law, vol. 7, no. 4, pp. 233–
[21] M. Nauta, A. Jutte, J. Provoost, and C. Seifert, “This looks like that,
242, 12 2017. [Online]. Available: https://doi.org/10.1093/idpl/ipx022
because ... explaining prototypes for interpretable image recognition,” in
[2] M. Rizzo, A. Veneri, A. Albarelli, C. Lucchese, and C. Conati, “A Machine Learning and Principles and Practice of Knowledge Discovery
theoretical framework for ai models explainability,” 2022. [Online]. in Databases. Springer International Publishing, 2021, pp. 441–456.
Available: https://arxiv.org/abs/2212.14447 [22] C. Rudin, “Stop explaining black box machine learning models for high
[3] M. T. Ribeiro, S. Singh, and C. Guestrin, “”why should I trust you?”: stakes decisions and use interpretable models instead,” Nat Mach Intell,
Explaining the predictions of any classifier,” CoRR, vol. abs/1602.04938, vol. 1, no. 5, pp. 206–215, May 2019.
2016. [Online]. Available: http://arxiv.org/abs/1602.04938 [23] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk,
[4] S. M. Lundberg and S. Lee, “A unified approach to interpreting model “Slic superpixels compared to state-of-the-art superpixel methods,” IEEE
predictions,” CoRR, vol. abs/1705.07874, 2017. [Online]. Available: Transactions on Pattern Analysis and Machine Intelligence, vol. 34,
http://arxiv.org/abs/1705.07874 no. 11, pp. 2274–2282, 2012.
[5] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L. Chen, [24] L. Sifre and S. Mallat, “Rigid-motion scattering for texture
“Inverted residuals and linear bottlenecks: Mobile networks for classification,” CoRR, vol. abs/1403.1687, 2014. [Online]. Available:
classification, detection and segmentation,” CoRR, vol. abs/1801.04381, http://arxiv.org/abs/1403.1687
2018. [Online]. Available: http://arxiv.org/abs/1801.04381 [25] F. Chollet, “Xception: Deep learning with depthwise separable convo-
[6] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, lutions,” in 2017 IEEE Conference on Computer Vision and Pattern
X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, Recognition (CVPR). IEEE, Jul. 2017.
S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is [26] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
worth 16x16 words: Transformers for image recognition at jointly learning to align and translate,” in 3rd International Conference
scale,” CoRR, vol. abs/2010.11929, 2020. [Online]. Available: on Learning Representations, ICLR 2015, San Diego, CA, USA, May
https://arxiv.org/abs/2010.11929 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun,
[7] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Eds., 2015. [Online]. Available: http://arxiv.org/abs/1409.0473
Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you [27] A. Gasparetto, M. Marcuzzo, A. Zangari, and A. Albarelli, “A
need,” in Advances in Neural Information Processing Systems, survey on text classification algorithms: From text to predictions,”
I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, Information, vol. 13, no. 2, 2022. [Online]. Available: https:
S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, //www.mdpi.com/2078-2489/13/2/83
Inc., 2017. [Online]. Available: https://proceedings.neurips.cc/paper/ [28] A. Gasparetto, A. Zangari, M. Marcuzzo, and A. Albarelli, “A survey
2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf on text classification: Practical perspectives on the italian language,”
[8] F. Mendoza and J. M. Aguilera, “Application of image analysis for PLOS ONE, vol. 17, no. 7, pp. 1–6, 07 2022. [Online]. Available:
classification of ripening bananas,” J. Food Sci., vol. 69, no. 9, pp. E471– https://doi.org/10.1371/journal.pone.0270904
E477, May 2006. [29] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah,
[9] O. O. Olarewaju, I. Bertling, and L. S. Magwaza, “Non-destructive “Transformers in vision: A survey,” ACM Comput. Surv., vol. 54, no.
evaluation of avocado fruit maturity using near infrared spectroscopy 10s, sep 2022. [Online]. Available: https://doi.org/10.1145/3505244
and PLS regression models,” Sci. Hortic., vol. 199, pp. 229–236, Feb. [30] L. Breiman, Classification and Regression Trees. New York: Routledge,
2016. 1984.
[10] X. Ni, C. Li, H. Jiang, and F. Takeda, “Deep learning image segmenta- [31] C. Wu, M. J. F. Gales, A. Ragni, P. Karanasou, and K. C. Sim, “Im-
tion and extraction of blueberry fruit traits associated with harvestability proving interpretability and regularization in deep learning,” IEEE/ACM
and yield,” Hortic Res, vol. 7, p. 110, Jul. 2020. Transactions on Audio, Speech, and Language Processing, vol. 26, no. 2,
[11] A. Septiarini, H. Hamdani, H. R. Hatta, and K. Anwar, “Automatic pp. 256–265, 2018.
image segmentation of oil palm fruits by applying the contour-based [32] A. Ross and F. Doshi-Velez, “Improving the adversarial robustness
approach,” Sci. Hortic., vol. 261, p. 108939, Feb. 2020. and interpretability of deep neural networks by regularizing their
[12] N. Saranya, K. Srinivasan, and S. K. P. Kumar, “Banana ripeness stage input gradients,” Proceedings of the AAAI Conference on Artificial
identification: a deep learning approach,” J. Ambient Intell. Humaniz. Intelligence, vol. 32, no. 1, Apr. 2018. [Online]. Available: https:
Comput., vol. 13, no. 8, pp. 4033–4039, Aug. 2022. //ojs.aaai.org/index.php/AAAI/article/view/11504
[13] I. Sa, Z. Ge, F. Dayoub, B. Upcroft, T. Perez, and C. McCool,
“DeepFruits: A fruit detection system using deep neural networks,”
Sensors, vol. 16, no. 8, Aug. 2016.
[14] M. Rizzo, M. Marcuzzo, A. Zangari, A. Gasparetto, and A. Albarelli,
“Fruit ripeness classification: A survey,” Artificial Intelligence in
Agriculture, vol. 7, pp. 44–57, 2023. [Online]. Available: https:
//www.sciencedirect.com/science/article/pii/S2589721723000065
[15] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and
D. Batra, “Grad-cam: Visual explanations from deep networks via
gradient-based localization,” in 2017 IEEE International Conference on
Computer Vision (ICCV), 2017, pp. 618–626.
[16] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
jointly learning to align and translate,” in 3rd International Conference
on Learning Representations, ICLR 2015, San Diego, CA, USA, May
7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun,
Eds., 2015. [Online]. Available: http://arxiv.org/abs/1409.0473
[17] C. Chen, O. Li, C. Tao, A. J. Barnett, J. Su, and C. Rudin, This Looks
like That: Deep Learning for Interpretable Image Recognition. Red
Hook, NY, USA: Curran Associates Inc., 2019.
[18] J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and
B. Kim, “Sanity checks for saliency maps,” in Proceedings of the 32nd
International Conference on Neural Information Processing Systems, ser.
NIPS’18. Red Hook, NY, USA: Curran Associates Inc., Dec. 2018,
pp. 9525–9536.
[19] S. Serrano and N. A. Smith, “Is attention interpretable?” in Proceedings
of the 57th Annual Meeting of the Association for Computational
Linguistics. Florence, Italy: Association for Computational Linguistics,
Jul. 2019, pp. 2931–2951.

Stop Overkilling Simple Tasks With Black-Box

Uploaded by

Copyright:

Available Formats

Stop Overkilling Simple Tasks With Black-Box

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stop Overkilling Simple Tasks With Black-Box

Uploaded by

Copyright:

Available Formats

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO.

Stop overkilling simple tasks with black-box

Abstract—The ability of deep learning-based approaches to

throughs in artificial intelligence. However, it is well-known that

(a) LIME (b) SHAP

Accuracy Precision Recall F1

metrics: accuracy, macro-averaged precision, macro-averaged

You might also like