N-20206
N-20206
N-20206
Abstract—Finding the ideal number of layers and size for to search for the best architecture. This task is time-consuming
each layer is a key challenge in deep neural network design. and may not lead to models with expected performance.
Two approaches for such networks exist: filter learning and Visual Analytics (VA) techniques have recently been in-
architecture learning. While the first one starts with a given
architecture and optimizes model weights, the second one aims creasingly used to help designers with architecture deci-
to find the best architecture. Recently, several visual analytics sions [2]. Most such approaches focus on feature understand-
(VA) techniques have been proposed to understand the behavior ing, i.e., explain which type of features a neuron learned
of a network, but few VA techniques support designers in to recognize, and support interpretability [3]–[8], by helping
architectural decisions. We propose a hybrid methodology based to understand how the model process the input features to
on VA to improve the architecture of a pre-trained network by
reducing/increasing the size and number of layers. We introduce predict output labels. However, there is still a gap between
Activation Occurrence Maps that show how likely each image architectural and interpretability tasks. While many VA tools
position of a convolutional kernel’s output activates for a given tackle the task of finding the particular features a neuron
class, and Class Selectivity Maps, that show the selectiveness of learned [9], [10], they do not address the end-to-end problem
different positions in a kernel’s output for a given label. Both of deciding if these features are indeed enough or useful for the
maps help in the decision to drop kernels that do not significantly
add to the network’s performance, increase the size of a layer prediction task, and, implicitly, the question whether a given
having too few kernels, and add extra layers to the model. The (set of) neuron(s) helps the network’s overall task.
user interacts from the first to the last layer, and the network is To close this gap, we propose a VA tool to help designers
retrained after each layer modification. We validate our approach make architectural decisions on the number and size of layers
with experiments in models trained with two widely-known image in a model. Our method uses three visualizations: Activation
classification datasets and show how our method helps to make
design decisions to improve or to simplify the architectures of Occurrence Maps (AOMs), Occurrence Difference Matrix
such models. (ODMs), and Class Selectivity Maps (CSMs); and a novel
Index Terms—Deep Learning, CNNs, Visual Analytics, Model metric to evaluate the overall selectivity of a neuron. These
Understanding, Architecture Tuning tools help the designer make decisions such as: (1) remove
neurons performing redundant roles, i.e., recognizing the same
I. I NTRODUCTION features, or neurons that do not contribute to the prediction
process; (2) increase the size of a layer if no neurons learned
Designing the appropriate neural network for a learning task useful features for one or more classes; and (3) add more
requires deciding over several factors such as the optimizer layers to the model if sets of classes still do not present very
algorithm, loss function, regularization parameters, activation selective features. By following our methodology, practitioners
functions, number of layers, and type and size of each can guide the design of novel models from scratch or improve
layer [1]. Most such decisions are made empirically, using pre-trained models by identifying neurons that can be dropped
experience from previous similar problems and general ‘good — reducing overfitting — or the need for more neurons or
practice’ guidelines, and often use a trial-and-error approach layers — improving performance. Additionally, our method
can help the task of transfer learning [11], as our tools can
This work is partially supported by FAPESP, grant no. 2014/12236-1; guide the selection of the most useful features to transfer to
CNPq, grant no. 303808/2018-7 and 308851/2015-3; FAPERGS, grant no.
17/2551-000; RCN, grant no. 240862; RCN and SIU, grant no. 261645; SIU, the new model. While the focus of our method is to address
grant no. UTF-2016-short-term/10128. convolutional networks (CNNs), whose convolutional neurons
performance Y engineer undoes the edit (6). The process stops when no
add kernels
network architecture editing
add layers
and network size ready additional edits can be done without losing too much accuracy,
good? when the engineer decides that the network has been re-
architected satisfactorily, or when the available time for editing
5 N
has finished (7).
last edit We next describe the AOM, ODM, and CSM visualizations
Y decreased for convolutional kernels. These visualizations can be easily
Task 1
Task 2
Task 3
undo edit
performance? adapted to handle fully-connected neurons, by interpreting the
6
neuron’s output activation as having one single position.
N
4 A. Activation Occurrence Maps
visual analysis tools
G2 H4 0.5
G3 0.5
H5
H6
G4
H7
G5 H8
H9 0.0
H10
H11 Fig. 3. ODMs for the two first layers in the MNIST model (see Sec. V).
0.0 The ODM is a symmetric matrix displaying the average difference between
the AOMs produced by each pair of kernels. It provides a concise overview
Fig. 2. AOMs produced by the first two layers in a model trained with of the similarities than the AOMs, with the trade-off of also providing fewer
the MNIST dataset. Each row represents one of the 16 kernels in the layer, details. Nonetheless, it can be used as a first step to search for groups of
while each column is a class. The highlighted groups (red border rectangles) similar kernels in large layers.
contains kernels recognizing similar features (See Sec. V). We notice that
some kernels activate more often to specific features in the images, such
as the digit structure (G1, H1), particular border orientations (G2, G4, G5), AOMs (that may thus be redundant) is to build the Occurrence
background (G3, H4), or different handwriting styles (H2, H3, H7, H9, H9).
Difference Matrix (ODM) of the layer.
The ODM is a symmetric matrix D where each cell
correspond to kernels, and columns to classes, respectively. Dki ,kj ∈ R+ measures the average difference between the
Rows are ordered using groups of kernels with similar AOMs. AOMs corresponding to kernels ki and kj for every corre-
When one kernel produces similar AOMs for all classes, it sponding position in the AOMs and every class in the training
is very likely that this kernel is not effective for the prediction set, computed as
process. For example, if a position in the output of a kernel 1 X j<H
X i<W X kI ,c kJ ,c
produces positive activations — or instead, activates very DkI ,kJ = Mi,j − Mi,j , (2)
|C| · W · H
rarely — to all or almost all inputs from every class in the i=0 j=0
c∈C
training set, this position is unlikely to give the next layer where C is the set of classes in the training set; W and H
useful information about which label the model should assign. are the dimensions of the kernels’ output; and Mi,j k,c
is the
We see such a pattern for the kernels shown in row 13 and 14 position (i, j) in the AOM of kernel k and class c (Eqn. 1).
from the top in Fig. 2 (left). These kernels produce very similar Figure 3 shows the ODMs calculated for the AOMs of
AOMs for all classes in the training set, with all positions both layers displayed earlier in Fig. 2. Each value DkI ,kJ
activating very rarely, or never at all. Hence, they are good is encoded using an ordinal blue-to-red colormap. Rows and
candidates for removal or reinitialization. We show later in columns in this matrix-like image follow the kernel order in
Sec. V that we can remove this type of kernel with little or Fig. 2. In practical settings, the designer can sort rows with
no performance reduction. a matrix-reordering algorithm [34] to easily identify similar-
AOMs are helpful to find redundant kernels as well. When value cells. We can see in Fig. 3 that kernels whose AOMs
multiple kernels display too similar maps for every class in, display strong similarity in Fig. 2 also display strong similarity
this tells us that these kernels recognize very similar features, in the ODM. However, ODMs are more compact than AOM
and thus may be redundant. The kernels in groups denoted by matrices: As each of its cells encodes just a single value by
the red rectangles in Fig. 2 shows an example of this issue. color, we can easily visualize ODMs for layers up to thousand
These kernels recognize almost identical features across the kernels on a single screen. In contrast, the AOM matrix can
classes, which may make unnecessary to keep both. visually scale only to a few tens of kernels, given that each
of its cells requires a resolution equal to that of the activation
B. Occurrence Difference Matrix
output. The ODM is particularly helpful for Task 1, as it can
To analyze a single layer using AOMs, we need to display concisely inform the designer about redundant kernels that can
|K|·|C|·W ·H squares, where |K| is the number of kernels in be removed or reinitialized to improve the model’s accuracy.
the layer, |C| is the number of classes in the training set, and
W and H are the kernel’s output dimensions. DL networks C. Class Selectivity Maps
used in complex prediction tasks often have large layers with Our third visualization, Class Selectivity Maps (CSMs),
hundreds or even thousands of kernels. Also, training sets for helps the DL engineer find how selective each position of
such tasks can contain as many classes [33]. Constructing a kernel k’s output is towards a given class c. If a position
and using AOM matrices as shown in Fig. 2 has thus limited often activates for items of a given class c but does not often
scalability. An alternative to finding kernels producing similar activate for items of any other class, this position is selective
CSMs - Layer 1 MNIST CSMs - Layer 2
1.0
H1
G1
H2
H3
G2 H4
G3 0.0
H5
H6
G4 H7
G5 H8
H9
H10
H11
-1.0
Fig. 4. CSMs produced by the first two layers in the MNIST model (see Sec. V). Each row represents one of the 16 kernels in each layer, while each column
is a class. Kernels in each layer respect the same order of Fig. 2. With this view, we can identify regions of the kernel’s output image where some classes are
more selective, i.e., they activate more often than other classes. For instance, the kernels in group H1 display a strong selectivity towards round-shaped inner
structures in the digits, while group H2 displays stronger selectivity towards flat shapes, such as digit one. Identifying such regions is important because they
are the most likely to help the prediction process — e.g., discriminating between class 0 and 1.
towards class c, i.e., it learned how to (partially) distinguish different features of the class. We give more details about the
items of class c from items of other classes. insights the user can take from this visualization in Section V.
k,c
A CSM S k,c is a matrix where each element Si,j tells how
selective position (i, j) of kernel k is for class c, computed as D. Kernel Selectivity
While CSMs depict well the selectivity of different regions
d6=c
k,c 1 X k,c k,d in each kernel of a layer, they do not assign an overall
Si,j = Mi,j − Mi,j , (3)
|C| − 1 selectivity value for the whole kernel. Having such a value
d∈C
would help when deciding which kernel to keep in the model
k,c from a group of redundant kernels or evaluate if a kernel
where C is the set of classes in the training set; and Mi,j is
is a good candidate for removal or reinitialization due to its
the position (i, j) in the AOM of kernel k and class c (Eqn. 1).
poor contribution to the model (Task 1). This editing of the
Figure 4 shows the CSMs produced by the first and second network can have three added values. First, one can keep
layer of our MNIST model (see Sec. V), i.e., the same the most selective kernel and thereby minimize the potential
layers analyzed in Figs. 2 and 3. Rows indicate kernels and performance loss when simplifying the network. Secondly, the
columns indicate classes, like in the AOM matrix. The values network size is decreased leading to the already mentioned
k,c
Si,j ∈ [−1, 1] are color-coded using a two-segment colormap speed and space benefits. Finally, the overall performance can
ranging from cyan (0) to purple (1) and yellow (0) to red be increased by reinitializing unsatisfactory kernels that do not
(-1), respectively. Hence, purple regions indicate where the aggregate useful features or introduce overfitting to the model.
analyzed kernel k is very selective towards the selected class Given the matrix S k containing all the CSMs of kernel k
c; red regions indicate positions where the kernel k is very for every position and class, we compute the overall selectivity
‘dismissive’ of class c, that is, the position outputs positive s(k) of kernel k as
values for c far less often than for any other class. For example,
one can notice that several kernels in the 2nd layer in Fig. 4
X
s(k) = |avg(S k,c0 ) − avg(S k,c1 )|, (4)
(right) are selective in most of the digit area in images from c0 ,c1 ∈C|c0 6=c1
class ’zero’. However, some kernels, such as the one in group
H11, are more selective in the ’inner circle’ of the digit zero. where C is the set of classes in training set, and avg(S k,c )
This behavior indicates how different kernels may learn the is the average value of all positions in the CSM for kernel k
and rarely at pixels outside it. These kernels learn a similar
feature: to recognize the digit structure.
The other kernels in the visualized layer show the opposite
pattern: They activate more often in pixels outside the digit
area than inside. These kernels recognize features such as
the different border orientations that appear in each class.
However, they do not necessarily have the same role: Some
of them activate more often around the top of the digit, e.g.,
group G4, while others activate more often around the bottom
of the digit, e.g., group G2. Still, some of these kernels look
to be quite redundant vs each other, and thus may not add
Fig. 5. The MNIST model (top) contains two convolutional layers followed by
a max pooling layer, a hidden fully-connected layer, and the output decision
useful information to the prediction. To check this, we look at
layer. After training, this model achieves 96.96% accuracy on the test set. The the occurrence difference matrix (ODM) for this layer (Fig. 3
CIFAR10 model (bottom) contains a more complex architecture with three left). Here, we can spot several groups of kernels with strong
sequential groups of layers formed by two convolutions, one max pooling and
one dropout layer followed by a hidden dense layer and the output decision
similarity (cells with dark blue shades).
layer. After training, the model achieves 80.54% accuracy on the test set. Next, we analyze the CSMs produced by these potentially
redundant kernels (Fig. 4 left). While the AOMs show us
which kernel regions activate for different classes, the CSMs
and class c. In Sec. V, we demonstrate the usefulness of our show whether these activations are selective or not, i.e., if high-
metric with an experiment. occurrence activations provide enough information to help the
network decide if the input belongs to a given class or not.
V. E XPERIMENTS
The CSMs show us that kernels in the last three rows are not
We ran a series of experiments to show how designers can selective towards any class, which makes them unlikely to be
use our techniques to perform the tasks described in Sec. III. relevant to the model’s decisions. In contrast, other kernels,
For this, we use two image classification models trained with e.g., group G1, show regions of strong selectivity for some
two widely-known datasets: (1) MNIST, containing images of classes, telling that their features are meaningful enough to
handwritten digits from 0 to 9; and (2) CIFAR10, containing help the model’s decision. We proceed by removing kernels
images displaying an object belonging to one of ten different we found as not useful for the reasons stated above. These
classes — airplane, automobile, bird, cat, deer, dog, frog, correspond to the last three rows in Figs. 2 and 4 (left).
horse, ship, and truck. For both datasets, we aim to design Also, we group kernels with strong similarity in the AOMs
a deep neural network that can classify images into the (kernel groups correspond to red triangles in Fig. 2) and keep
respective ten classes. Figure 5 displays the architecture for a single representative of each group in the model (we delete
both models. In this experiment, we apply our method to the the others). The kept kernel is the one with the highest kernel
activation of each convolutional layer before pooling. How- selectivity (see Sec. IVD) for a given group.
ever, the method can easily be applied after the subsequent After keeping only the chosen five kernels in the 1st layer
pooling layer — regardless of the pooling technique —, as of the model, we freeze the weights of this layer and retrain
these layers contain a more concise representation of the the model from the second layer onward. This way, we ensure
features identified in the previous layer. that our model cannot modify the features learned by those five
kernels and thus has to use only these features (and whatever
A. Analysis and layer size reduction of the MNIST model more abstract features it builds in deeper layers) to perform the
We next use our VA tools to simplify this model but keep prediction. With just one retraining epoch, our model achieves
a high accuracy. After training, we create the visualizations 96.93% accuracy, virtually the same it had with all 16 kernels
shown in Figs. 2, 3, and 4 using all training-set images. in the 1st layer. This shows that the five chosen kernels cover
Following the insights from our VA approach, we remove enough features to achieve the same classification capability
kernels that do not help the prediction task, either because we had with 16 kernels.
they do not recognize features useful to distinguish any class or We repeat the process in the second convolutional layer of
because they are redundant vs other same-layer kernels (Task our MNIST model. The AOMs for this layer (Fig. 2 right),
1, Sec. III). A more complex alternative to kernel removal is to tell us that its kernels are much more diverse concerning the
reinitialize these kernels, so we prefer removal for simplicity. features they recognize than kernels in layer 1. Often, they
We start our analysis by looking at the AOMs of the first learn particular features that appear in some writing styles. For
convolutional layer kernels (Fig. 2 left). In row 14, we easily instance, the kernel on groups H8 recognizes digits written in a
spot a kernel whose most positions rarely activate for any class. ‘rounded’ way, while kernel on row H7 recognizes digits from
So, this is a kernel that most likely does not contribute to the a ‘flatter’ writing. Also, Figs. 2 and 4 (right) help us recognize
prediction task. Also, we noticed groups of kernels with very kernels that are redundant or not selective enough to help our
similar patterns of activation occurrence. For example, kernels model. Red triangles in Fig. 2 (right) shows the found kernel
in the G1 group often activate at pixels inside the digit area, groups. We perform a similar edit to layer 1, freeze weights
AOMs of each kernel (1st Convolutional Layer - CIFAR10)
1.0
Kernels
0.5
0.0
AOMs of each kernel (6th Convolutional Layer - CIFAR10)
1.0
Kernels
0.5
0.0
Airplane
Airplane
Airplane
Airplane
Bird
Bird
Bird
Bird
Automobile
Automobile
Automobile
Automobile
Deer
Deer
Deer
Deer
Dog
Dog
Dog
Dog
Cat
Cat
Cat
Cat
Horse
Horse
Horse
Horse
Ship
Ship
Ship
Ship
Frog
Frog
Frog
Frog
Truck
Truck
Truck
Truck
Fig. 6. AOMs for the 64 kernels in the 1st and 6th layers of the CIFAR10 model. Kernels are sorted by the similarity tree computed by aggregative clustering
the AOM rows. The 1st layer cannot learn features discriminative enough to each class, which denotes the need for more layers in the model. Note that
while kernels in the 1st layer may produce positive activations for several classes, such activations tend to appear in different regions of the output image for
different pairs kernel x class. If this behavior is not present, the next layers cannot use these features to build more discriminative ones, which indicates a need
for more kernels in the layer. Finally, the 6th layer provides much more discriminative kernels, often activating positive values for just one class, indicating
that the model is unlikely to improve performance if more layers are added.
in layer 2, retrain, and obtain an accuracy of 97.69%. So, our Conversely, in the 6th layer, often only one class produces
approach allowed us to not only simplify the model but also positive activations in a given kernel. This tells that, at this
improve its accuracy. Note that this accuracy was obtained point, the model already separates classes as best as it can,
only by retraining the weights in the hidden fully-connected and more layers are unlikely to improve performance. Some
layer and the output layer of the model, so this accuracy relies classes, e.g., cat and deer, do not achieve an occurrence close
solely on the features previously learned — before removal — to 100% in any kernel. This suggests that the features the
by the convolutional kernels we selected to keep. layer learned for these classes are not enough to cover all
their samples, indicating the need to learn more features about
B. Analysis and layer size reduction of CIFAR10 model them. Hence, we found that this or previous layers need more
We repeat the previous experiment for the model trained kernels to discover more features (Task 2, Sec. III).
with the CIFAR10 dataset. Fig. 6 shows the AOMs for the Figure 7 shows the CSMs of each of the 64 kernels in
1st and the 6th convolutional layers in the model. Both views the 1st and 6th layers of the CIFAR10 model. We see that
give us interesting insights into the layers’ behaviors. First, some kernels (or regions) with high occurrence values in the
many kernels in both layers do not often activate for any class, corresponding AOMs (Fig. 6) are not very selective among
suggesting kernels that did not learn to recognize features different classes, making them a poor choice to keep in the
useful for classification. Secondly, kernels in the 1st layer that network. In contrast, we see in layer kernels that are very
activates often, usually do so to multiple classes. This tells that selective for subsets of classes (e.g., ship and truck). This
this layer cannot learn features complex enough to distinguish tells that, while this layer is not enough to find class-specific
individual classes, suggesting that the model needs more layers features, it can find features that only appear in a small subset
to perform the prediction task (Task 3, Sec. III). of classes, thus easing the prediction task of the next layers.
CSMs of each kernel (1st Convolutional Layer - CIFAR10)
1.0
Kernels
0.0
-1.0
CSMs of each kernel (6th Convolutional Layer - CIFAR10)
1.0
Kernels
0.0
-1.0
Airplane
Airplane
Airplane
Airplane
Bird
Bird
Bird
Bird
Automobile
Automobile
Automobile
Automobile
Deer
Deer
Deer
Deer
Dog
Dog
Dog
Dog
Cat
Cat
Cat
Cat
Horse
Horse
Horse
Horse
Ship
Ship
Ship
Ship
Frog
Frog
Frog
Frog
Truck
Truck
Truck
Truck
Fig. 7. CSMs for the 64 kernels in the 1st and 6th layers of the CIFAR10 model. Kernels follow the same order of Fig. 6. Notice that while several kernels
in the 1st layer show a high activation occurrence for all classes (see Fig. 6), they are usually much more selective to only a couple of classes. The CSMs
give more confidence to designer decisions, as it clearly states if a high occurrence pattern in AOMs indeed indicates selectivity.
Due to the width and depth of this model, we only reduce network, we compute the average of the pair’s class selectivity
the size of the 1st and 6th convolutional layers. As for the map S k,c , denoted sk,c . Then, for each class ci , we remove
MNIST model, we spot and remove kernels that are not the kernels with skj ,ci ≤ 0, and compute the accuracy of the
selective for any class, and simplify groups of redundant resulting network only for test-set elements of class ci . Next,
kernels (details omitted for brevity). With our VA tools, we we do the opposite: For each class ci , we remove the kernels
reduced the 1st layer’ size to 20 kernels. After retraining the with skj ,ci > 0 and compute the accuracy of the network
rest of the model for one epoch, and freezing the weights in without such kernels for test-set elements of class ci .
the 1st layer —, our reduced model achieved 81.20% test-set We did these experiments for the 6th CIFAR10 model layer
accuracy, even higher than the initial 80.54% accuracy. (CSMs shown in Fig. 7 right). Fig. 8 (top) shows the number
Following our VA approach, we reduced the size of the 6th of kernels kept in each case. Fig. 8 (bottom) compares the test-
layer from 64 to 35 kernels achieving a test set accuracy of set accuracy for elements of each class ci and considering all
80.43% after retraining only the weights in the fully-connected kernels, kernels not selective for ci , and kernels selective for ci .
layers for one epoch. This accuracy is marginally below the In all cases, the number of selective kernels for a given class is
original one. We see here a trade-off between network size and much smaller than the number of non-selective kernels. In all
performance: At some point, the network simplification has to cases, the test-set accuracy drops significantly for the chosen
stop as the accuracy will inherently drop. As stated earlier, an class when we remove highly selective kernels for that class
alternative is to reinitialize the kernels selected for removal and increases when we only keep selective kernels. Hence,
(instead of removing them) to achieve higher accuracy. our selectivity metric indeed captures well how much a kernel
contributes to the separation of a given class.
C. Kernel Selectivity Experiment
To show how our kernel selectivity method indeed captures VI. C ONCLUSION
the overall selectivity of a kernel, we run the following In this work, we present a visual analytics set of tools, and
experiment: For each pair of kernel k and class c in the associated workflow, that helps machine learning designers in
Number of Selective Kernels [5] J. Yosinski, J. Clune, T. Fuchs, and H. Lipson, “Understanding neural
networks through deep visualization,” in Proc. ICML – Workshop on
Deep Learning, 2015.
[6] H. Strobelt, S. Gehrmann, H. Pfister, and A. M. Rush, “LSTMVis: A
Number of Kernels
tool for visual analysis of hidden state dynamics in recurrent neural
networks,” IEEE TVCG, vol. 24, no. 1, pp. 667–676, Jan 2018.
[7] P. Rauber, S. G. Fadel, A. Falcão, and A. Telea, “Visualizing the hidden
activity of artificial neural networks,” IEEE TVCG, vol. 23, no. 1, 2017.
[8] M. Kahng, P. Andrews, A. Kalro, and D. Chau, “Activis: Visual
exploration of industry-scale deep neural network models,” IEEE TVCG,
vol. 24, no. 1, 2018.
[9] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional
Deer
Automobile
Horse
Dog
Ship
Bird
Frog
Airplane
Truck
Cat networks: Visualising image classification models and saliency maps,”
CoRR, vol. abs/1312.6034, 2013.
[10] D. Erhan, Y. Bengio, A. Courville, and P. Vincent, “Visualizing higher-
Class Accuracy Comparison layer features of a deep network,” University of Montreal, 2009.
[11] K. Weiss, T. M. Khoshgoftaar, and D. Wang, “A survey of transfer
learning,” Journal of Big Data, vol. 3, no. 1, 2016.
[12] T. Elsken, J. H. Metzen, and F. Hutter, “Neural architecture search: A
Class Accuracy
Horse
Dog
Airplane
Deer
Bird
Frog
Truck
Cat