0% found this document useful (0 votes)
6 views10 pages

N-20206

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 10

IJCNN 2019 - International Joint Conference on Neural Networks, Budapest Hungary, 14-19 July 2019

A Methodology for Neural Network Architectural


Tuning Using Activation Occurrence Maps
Rafael Garcia Alexandre Xavier Falcão Alexandru C. Telea
Instituto de Informática Instituto de Computação Department of Computer Science
Universidade Federal do Rio Grande do Sul Universidade de Campinas University of Groningen
Porto Alegre, Brazil Campinas, Brazil Groningen, The Netherlands
rgarcia@inf.ufrgs.br afalcao@ic.unicamp.br a.c.telea@rug.nl

Bruno Castro da Silva Jim Tørresen João Luiz Dihl Comba


Instituto de Informática Department of Informatics Instituto de Informática
Universidade Federal do Rio Grande do Sul University of Oslo Universidade Federal do Rio Grande do Sul
Porto Alegre, Brazil Oslo, Norway Porto Alegre, Brazil
bruno.silva@inf.ufrgs.br jimtoer@ifi.uio.no comba@inf.ufrgs.br

Abstract—Finding the ideal number of layers and size for to search for the best architecture. This task is time-consuming
each layer is a key challenge in deep neural network design. and may not lead to models with expected performance.
Two approaches for such networks exist: filter learning and Visual Analytics (VA) techniques have recently been in-
architecture learning. While the first one starts with a given
architecture and optimizes model weights, the second one aims creasingly used to help designers with architecture deci-
to find the best architecture. Recently, several visual analytics sions [2]. Most such approaches focus on feature understand-
(VA) techniques have been proposed to understand the behavior ing, i.e., explain which type of features a neuron learned
of a network, but few VA techniques support designers in to recognize, and support interpretability [3]–[8], by helping
architectural decisions. We propose a hybrid methodology based to understand how the model process the input features to
on VA to improve the architecture of a pre-trained network by
reducing/increasing the size and number of layers. We introduce predict output labels. However, there is still a gap between
Activation Occurrence Maps that show how likely each image architectural and interpretability tasks. While many VA tools
position of a convolutional kernel’s output activates for a given tackle the task of finding the particular features a neuron
class, and Class Selectivity Maps, that show the selectiveness of learned [9], [10], they do not address the end-to-end problem
different positions in a kernel’s output for a given label. Both of deciding if these features are indeed enough or useful for the
maps help in the decision to drop kernels that do not significantly
add to the network’s performance, increase the size of a layer prediction task, and, implicitly, the question whether a given
having too few kernels, and add extra layers to the model. The (set of) neuron(s) helps the network’s overall task.
user interacts from the first to the last layer, and the network is To close this gap, we propose a VA tool to help designers
retrained after each layer modification. We validate our approach make architectural decisions on the number and size of layers
with experiments in models trained with two widely-known image in a model. Our method uses three visualizations: Activation
classification datasets and show how our method helps to make
design decisions to improve or to simplify the architectures of Occurrence Maps (AOMs), Occurrence Difference Matrix
such models. (ODMs), and Class Selectivity Maps (CSMs); and a novel
Index Terms—Deep Learning, CNNs, Visual Analytics, Model metric to evaluate the overall selectivity of a neuron. These
Understanding, Architecture Tuning tools help the designer make decisions such as: (1) remove
neurons performing redundant roles, i.e., recognizing the same
I. I NTRODUCTION features, or neurons that do not contribute to the prediction
process; (2) increase the size of a layer if no neurons learned
Designing the appropriate neural network for a learning task useful features for one or more classes; and (3) add more
requires deciding over several factors such as the optimizer layers to the model if sets of classes still do not present very
algorithm, loss function, regularization parameters, activation selective features. By following our methodology, practitioners
functions, number of layers, and type and size of each can guide the design of novel models from scratch or improve
layer [1]. Most such decisions are made empirically, using pre-trained models by identifying neurons that can be dropped
experience from previous similar problems and general ‘good — reducing overfitting — or the need for more neurons or
practice’ guidelines, and often use a trial-and-error approach layers — improving performance. Additionally, our method
can help the task of transfer learning [11], as our tools can
This work is partially supported by FAPESP, grant no. 2014/12236-1; guide the selection of the most useful features to transfer to
CNPq, grant no. 303808/2018-7 and 308851/2015-3; FAPERGS, grant no.
17/2551-000; RCN, grant no. 240862; RCN and SIU, grant no. 261645; SIU, the new model. While the focus of our method is to address
grant no. UTF-2016-short-term/10128. convolutional networks (CNNs), whose convolutional neurons

978-1-7281-2009-6/$31.00 ©2019 IEEE paper N-20206.pdf


from now on we call kernels, our approach can easily be et al. [7] project activation vectors of hidden layers to help
adapted to fully-connected and recurrent models. deciding whether a given layer distinguishes the classes or
Our paper is organized as follows. Section II discusses not. While adequate for evaluating the quality of a layer, this
related work in both architecture modeling and VA for deep approach does not give any information about the quality of
learning (DL). Section III details the DL engineering tasks individual neurons. Zhong et al. [25] propose two metrics to
that our approach assists. Section IV explains in detail our gauge neuron’s quality, but their metrics are not class-specific
approach and the involved techniques and metrics. Section V like ours. Arguably closest to our work, DeepEyes [26] uses
presents a series of experiments where we validate our ap- heatmap matrices to let the user search for kernels that do
proach on two image classification problems. Finally, Sec- not activate for any image or activate for all images, which
tion VI concludes the paper, outlining future work directions. in either case are ineffective kernels. Yet, they do not allow
exploring activations in different regions of the kernel’s output,
II. R ELATED W ORK which is vital in datasets where a feature’s position may add
Architecture tuning is one of the main challenges for DL value to the prediction process.
engineering [12]. The simplest way to tune such networks
is to grid search for the best architecture, testing many III. M ODELING TASKS
combinations of number/type of layers and layer sizes and A DL engineer must make several hard architectural deci-
other hyperparameters, and choose the set-up maximizing sions when using neural networks to solve a learning problem.
performance. This strategy is unpractical when working with First, one must choose the right number and size of model
deep models because training a single model is computa- layers [27]. However, there is no analytic way to find the best
tionally expensive. Recent approaches involve learning the architecture a model should have to solve a problem. Designers
architecture adaptively during model training [13], training often end up choosing more or fewer layers/neurons than they
a reinforcement learning model to create architectures for a should, leading to the issues listed below. These issues appear
given input learning problem [14], or to automatically adapt on the design of most types of model architectures — DFNs,
the network’s topology to the input sample, so that the network CNNs, RNNs, etc. — and most learning tasks — classification,
can have a faster and simpler prediction process if the sample regression, generative models, etc. Our model can be adapted
is easy to predict or a more complex topology otherwise [15]. to address all of those.
Still, no such methods allow designers to use their experience Too few layers: A model with fewer layers than ideal may
to modify the model. fail to distinguish classes separated by highly abstract features
Several works tackle the subgoal of reducing redundant or which are created from simple features found in early layers.
unimportant model components. Cogswell et al. [16] use a Too many layers: Conversely, the use of too many layers
regularization technique to reduce redundancy by minimizing in a model also creates problems. Deep networks have, by
the cross-covariance of activations in the model’s layers. construction, a high number of parameters, which significantly
Pruning methods can reduce the number of weights in a fixed increases when adding new layers. As other ML methods,
architecture without significant accuracy loss [17]–[20]. A neural models are prone to overfitting, which mainly happens
model can learn the pruned architecture alongside weights when the number of parameters in the model is too high.
during training [21]–[23]. Recent studies show that such A high number of layers or neurons also severely increases
methods often achieve results similar to the same reduced the training time and the memory space required to store the
architecture learned from scratch, turning over-parameterized model, turning their use prohibitive in systems such as embed-
training unnecessary [24]. Our approach, in contrast, provides ded devices, which are increasingly necessary for applications
designer-reasoning to the network reduction process without in fields like robotics and embedded computing [17].
the need for pre-training a much larger model than necessary. Too small layers: For a neural network to operate well, it
Visual Analytics (VA) has provided significant support for must iteratively, layer-by-layer, change the representation of
deep learning [2]. Until now VA mostly focused on feature the input to distinguish between the different classes in the
understanding and model interpretability: Salience maps [9] final layer [7]. It does it by recognizing low-level features that
highlight the image pixels contributing most to a given neu- are more likely to appear when the input belongs to a particular
ron’s activation, thereby showing to designers which image class. If a layer does not have enough neurons to learn all
features contribute to the prediction process. Activation Max- such features, that layer will likely harm the overall prediction
imization [10] creates an image that maximizes the activation capability. When this happens, increasing the number of layers
of a given neuron, giving insights on which features the neuron does not help, as the next layers do not have enough ‘simple’
recognizes. While such methods help interpreting the learned features to build meaningful higher-level features.
features, they do not tell if these features are selective towards Too large layers: An overestimated number of neurons per
the possible output labels. Our approach tackles this task, as layer can also harm the model, increasing the chance of
it does not focus on which features a neuron learned, but how overfitting, training time, and the space required for the model,
useful these features are to the prediction process. just as for a model with too many layers.
Closer to our goal, some VA methods address the goal Ineffective neurons: Even if not overfitting, when training a
of evaluating the quality of a model’s components. Rauber model having a high amount of neurons, it is unlikely that all
neurons become equally important for the task at hand after class it is specializing itself [8], [30]. The activations of a
training. While some neurons learn features that are crucial neuron contain the values that will be used as input by the next
to the prediction task, others may learn features that do not layer. We can see them as a new representation of features
contribute much to it or even harm the performance [23]. from the input sample. If a neuron activates a high value
Redundant neurons: Multiple neurons may learn to identify at some position because it is looking for features an input
closely-related features, leading the model to have redundant sample has, this becomes an indication of what the neuron is
information across its components [28]. If one can find such doing. Activations are easier to understand than weight values,
groups of neurons and keep from them only a subset, one can which are hard to interpret, especially in deeper layers where
reduce the size of trained models and maintain (nearly) the they interact with activations produced by the composition of
same performance as before reduction. many neurons with equally hard to interpret weights.
We identify three main tasks where our approach addresses To better understand the role of a neuron, we must look
a subset of the above issues: at activations produced by several inputs. Otherwise, we do
1. Find redundant or ineffective kernels: Ineffective kernels not know how useful is the feature producing high activations
(that do not learn to recognize features useful for recognizing in the neuron. Indeed, such a feature may appear in every
any class) are strong removal candidates without decreasing class or be an input-specific feature that does not appear
performance. Conversely, groups of kernels (in the same layer) in other elements of the same class. In both cases, learning
that learned to recognize very similar input features can be such a feature does not help the prediction process. A typical
reinitialized to see if they next can learn more diverse features. way to overcome this issue is to look at the mean activation
Alternatively, the designer can simplify such a group by produced by a kernel for all elements of a given class [8],
reducing it to one or a few kernels. [30]. If a neuron has a high mean activation, we can assume
2. Find too small layers: When the number of kernels in a more confidently that it indeed learned a feature useful for
layer is not enough to learn all the required features, two or that class. Yet, high mean activations can be misleading. The
more classes may always activate in the same set of kernels activation produced by a neuron in a non-final layer serves as
in that layer. This behavior makes it hard for the next layers input to the next layer. If the weights from the next layer that
to build more complex features capable of recognizing each interact with this activation value are small enough, having a
class and can cause the model to underperform. In this case, high activation may not be that important. Moreover, finding
the designer may want to increase the layer’s size by adding the neuron(s) with high(est) activations does not directly help
more kernels. the tasks defined in Sec. III. For instance, two kernels can
3. Find too shallow models: Ideally, the activations of the recognize essentially the same feature, but have different
model’s last hidden layer should be very discriminative, with activation values, due to the way their weights evolved during
each class activating in a different set of neurons. If this does training. Thus, analyzing only their mean activation value does
not happen, e.g., if many neurons activate for multiple classes, not tell us that there is redundancy between two kernels.
the designer may want to add extra layers so that the model Given the above, we propose to look at the proportion of
can build higher-level features. positive activations instead of their absolute values. When a
Automating these tasks is challenging as it is hard or kernel outputs a positive value, it tells that it found to some
impractical to define accurate quantitative tests to ‘query’ for extent the learned feature. A non-positive activation value
the presence of too small layers, too shallow models, kernel (clamped to zero when using the common ReLU activation
redundancy, and kernel (in)effectiveness. Consider kernel re- function), tells that the kernel did not find such a feature.
dundancy: Comparing all activations of just two kernels is too Considering this, we can assume that a kernel that learned
expensive, as it needs |K| · |K| · |D| comparisons for |K| a feature capable of distinguishing one class (or a set of
kernels in a layer and |D| elements in the training set. More- classes) from the rest will produce positive activations for
over, convolutional kernels usually output high-dimensional most elements from this class (or set of classes). Hence, if we
activations, further increasing the complexity of such a com- find kernels with a high proportion of positive activations for
parison. Considering groups of more than two kernels makes elements of some class ci and a low proportion for elements
the problem quickly intractable. Even if such computational of some other class cj , we can say that this kernel is selective
costs would be acceptable, there is not an exact threshold for toward this pair of classes.
defining the degree of activation similarity that makes two To allow designers to find and analyze such kernels, and also
kernels redundant. Comparing the learnable parameters of each find cases, such as groups of classes for which such patterns
kernel is also unreliable, as different parameter sets can learn do not occur, we propose three visualization techniques:
to recognize (nearly) the same features and the function of The Activation Occurrence Maps (AOMs) display the pro-
weight vectors is hard to interpret, especially in deeper layers. portion of samples from a given class ci activating positive
values at each position of the output activation of a given
IV. V ISUAL A NALYTICS A PPROACH kernel k. Similar AOMs for a kernel for all classes ci indicate
Most previous VA solutions for DL engineering used acti- kernels which react similarly over all classes, thus ineffective
vation values to analyze neurons and answer questions such as for the overall network goal (Task 1). AOMs also show
what features a kernel is learning [3], [9], [29] or for which kernels that strongly activate for similar input subsets, thus
data train & test where, e.g., which group of neurons is redundant and which
1 performance can be eliminated from it (5). Upon finding such a case,
architecture network
2 the engineer next edits the network and repeats from step
3 2. If the accuracy drops significantly after such an edit, the
remove kernels

performance Y engineer undoes the edit (6). The process stops when no
add kernels
network architecture editing

add layers
and network size ready additional edits can be done without losing too much accuracy,
good? when the engineer decides that the network has been re-
architected satisfactorily, or when the available time for editing
5 N
has finished (7).
last edit We next describe the AOM, ODM, and CSM visualizations
Y decreased for convolutional kernels. These visualizations can be easily
Task 1
Task 2
Task 3

undo edit
performance? adapted to handle fully-connected neurons, by interpreting the
6
neuron’s output activation as having one single position.
N
4 A. Activation Occurrence Maps
visual analysis tools

editing Deep neural networks — particularly those employed in


Y
effort still image-based problems — often have a sequence of convolu-
available? tional layers with ReLU activation functions as the first layers,
typically followed by a pooling and some regularization layers
AOM N
ODM such as dropout [31]. The role of convolutional layers is to find
CSM 7 particular features that help in the prediction task regardless
of where these appear in the image. For example, a neuron
Fig. 1. Workflow that describes all the steps followed by a ML designer when from the first convolutional layer takes the original image as
applying visual analytics in the design of neural networks. input and produces another image as activation. This activation
image has positive values in places where the original image
contains the feature(s) this neuron is looking for, and zero
highlighting potentially redundant kernel groups that can be activations (after ReLU) elsewhere [1].
further simplified (Task 1). Set of classes producing similar To let the designer identify how a kernel behaves for
AOMs in every kernel indicate a small layer size (Task 2). different classes, and how different kernels act for the same
Finally, kernels producing AOMs with strong occurrence for class, we must understand how the occurrence of positive
multiple classes is a signal of a need for more layers to activations changes in different regions of the kernel’s image
discriminate between these classes (Task 3). output. This knowledge is needed since different classes may
The Occurrence Difference Matrix (ODM) visually summa- present very similar features — particularly in the first layers
rizes the difference between occurrence values in the AOMs — but, for a given class, these features may appear more
of two kernels ki and kj . often in a different model-layer than for other classes [30]. To
Finally, the Class Selectivity Maps (CSMs), which displays allow the designer to analyze such differences, we construct
how selective each image position of a given kernel k is an Activation Occurrence Map (AOM) for each (kernel, class)
for distinguishing elements of class c from elements of other pair in the layer of interest.
classes, thus helping in the insights provided by the AOMs For a giving kernel k and a giving class c in the training set,
(Tasks 1..3). Additionally, we propose a metric to assign an we compute the corresponding AOM M k,c ⊂ RW xH , where
overall selectivity value for a kernel, which allows designers W and H are the dimensions of the activation produced by
to decide which kernel should be kept in the network when the kernel k when the network process any given input image.
k,c
simplifying a group of redundant kernels (Task 1). Each cell Mi,j in the AOM shows the proportion of training
The AOM, ODM, and CSM visualizations are used to images from class c activating positive values at position (i, j)
support tasks 1..3 for DL engineering via the following VA of the activation produced by k, and is computed as
workflow (Fig. 1). The workflow starts with the engineer k,c 1 X
selecting some initial network architecture (number and size Mi,j = U (ak (x)i,j ) (1)
|Dc | c
of layers), based on heuristics or good practices in the domain x∈D
c
(1). The network is next trained and tested, as usual (2). If where D is the set of images in the training set D belonging
the result is deemed satisfactory, e.g., a good accuracy is to class c; ak (x)i,j is the value at position (i, j) in the
obtained, and the network’s overall size and learned features activation matrix produced by kernel k for the input sample
are acceptable, the process stops (3). If not, the three views are x; and U is the Heaviside step function.
used to detect cases where the network’s architecture likely can Figure 2 displays the AOMs corresponding to the kernels
be improved (4). Rather than using automatically computed in the first (left) and second (right) layer of a model trained
thresholds, the engineer examines such cases visually and with the MNIST dataset [32] (see Sec. V), with the values
k,c
decides which of tasks 1..3 (s)he wants to execute next and Mi,j color-coded via an ordinal colormap. Rows in this image
AOMs - Layer 1 MNIST AOMs - Layer 2 ODM - Layer 1 MNIST ODM - Layer 2
1.0 1.0
H1
G1
H2
H3

G2 H4 0.5
G3 0.5
H5
H6
G4
H7
G5 H8
H9 0.0
H10
H11 Fig. 3. ODMs for the two first layers in the MNIST model (see Sec. V).
0.0 The ODM is a symmetric matrix displaying the average difference between
the AOMs produced by each pair of kernels. It provides a concise overview
Fig. 2. AOMs produced by the first two layers in a model trained with of the similarities than the AOMs, with the trade-off of also providing fewer
the MNIST dataset. Each row represents one of the 16 kernels in the layer, details. Nonetheless, it can be used as a first step to search for groups of
while each column is a class. The highlighted groups (red border rectangles) similar kernels in large layers.
contains kernels recognizing similar features (See Sec. V). We notice that
some kernels activate more often to specific features in the images, such
as the digit structure (G1, H1), particular border orientations (G2, G4, G5), AOMs (that may thus be redundant) is to build the Occurrence
background (G3, H4), or different handwriting styles (H2, H3, H7, H9, H9).
Difference Matrix (ODM) of the layer.
The ODM is a symmetric matrix D where each cell
correspond to kernels, and columns to classes, respectively. Dki ,kj ∈ R+ measures the average difference between the
Rows are ordered using groups of kernels with similar AOMs. AOMs corresponding to kernels ki and kj for every corre-
When one kernel produces similar AOMs for all classes, it sponding position in the AOMs and every class in the training
is very likely that this kernel is not effective for the prediction set, computed as
process. For example, if a position in the output of a kernel 1 X j<H
X i<W X kI ,c kJ ,c
produces positive activations — or instead, activates very DkI ,kJ = Mi,j − Mi,j , (2)
|C| · W · H
rarely — to all or almost all inputs from every class in the i=0 j=0
c∈C
training set, this position is unlikely to give the next layer where C is the set of classes in the training set; W and H
useful information about which label the model should assign. are the dimensions of the kernels’ output; and Mi,j k,c
is the
We see such a pattern for the kernels shown in row 13 and 14 position (i, j) in the AOM of kernel k and class c (Eqn. 1).
from the top in Fig. 2 (left). These kernels produce very similar Figure 3 shows the ODMs calculated for the AOMs of
AOMs for all classes in the training set, with all positions both layers displayed earlier in Fig. 2. Each value DkI ,kJ
activating very rarely, or never at all. Hence, they are good is encoded using an ordinal blue-to-red colormap. Rows and
candidates for removal or reinitialization. We show later in columns in this matrix-like image follow the kernel order in
Sec. V that we can remove this type of kernel with little or Fig. 2. In practical settings, the designer can sort rows with
no performance reduction. a matrix-reordering algorithm [34] to easily identify similar-
AOMs are helpful to find redundant kernels as well. When value cells. We can see in Fig. 3 that kernels whose AOMs
multiple kernels display too similar maps for every class in, display strong similarity in Fig. 2 also display strong similarity
this tells us that these kernels recognize very similar features, in the ODM. However, ODMs are more compact than AOM
and thus may be redundant. The kernels in groups denoted by matrices: As each of its cells encodes just a single value by
the red rectangles in Fig. 2 shows an example of this issue. color, we can easily visualize ODMs for layers up to thousand
These kernels recognize almost identical features across the kernels on a single screen. In contrast, the AOM matrix can
classes, which may make unnecessary to keep both. visually scale only to a few tens of kernels, given that each
of its cells requires a resolution equal to that of the activation
B. Occurrence Difference Matrix
output. The ODM is particularly helpful for Task 1, as it can
To analyze a single layer using AOMs, we need to display concisely inform the designer about redundant kernels that can
|K|·|C|·W ·H squares, where |K| is the number of kernels in be removed or reinitialized to improve the model’s accuracy.
the layer, |C| is the number of classes in the training set, and
W and H are the kernel’s output dimensions. DL networks C. Class Selectivity Maps
used in complex prediction tasks often have large layers with Our third visualization, Class Selectivity Maps (CSMs),
hundreds or even thousands of kernels. Also, training sets for helps the DL engineer find how selective each position of
such tasks can contain as many classes [33]. Constructing a kernel k’s output is towards a given class c. If a position
and using AOM matrices as shown in Fig. 2 has thus limited often activates for items of a given class c but does not often
scalability. An alternative to finding kernels producing similar activate for items of any other class, this position is selective
CSMs - Layer 1 MNIST CSMs - Layer 2
1.0
H1
G1
H2
H3

G2 H4

G3 0.0
H5
H6
G4 H7
G5 H8
H9
H10
H11
-1.0
Fig. 4. CSMs produced by the first two layers in the MNIST model (see Sec. V). Each row represents one of the 16 kernels in each layer, while each column
is a class. Kernels in each layer respect the same order of Fig. 2. With this view, we can identify regions of the kernel’s output image where some classes are
more selective, i.e., they activate more often than other classes. For instance, the kernels in group H1 display a strong selectivity towards round-shaped inner
structures in the digits, while group H2 displays stronger selectivity towards flat shapes, such as digit one. Identifying such regions is important because they
are the most likely to help the prediction process — e.g., discriminating between class 0 and 1.

towards class c, i.e., it learned how to (partially) distinguish different features of the class. We give more details about the
items of class c from items of other classes. insights the user can take from this visualization in Section V.
k,c
A CSM S k,c is a matrix where each element Si,j tells how
selective position (i, j) of kernel k is for class c, computed as D. Kernel Selectivity
While CSMs depict well the selectivity of different regions
d6=c
k,c 1 X k,c k,d in each kernel of a layer, they do not assign an overall
Si,j = Mi,j − Mi,j , (3)
|C| − 1 selectivity value for the whole kernel. Having such a value
d∈C
would help when deciding which kernel to keep in the model
k,c from a group of redundant kernels or evaluate if a kernel
where C is the set of classes in the training set; and Mi,j is
is a good candidate for removal or reinitialization due to its
the position (i, j) in the AOM of kernel k and class c (Eqn. 1).
poor contribution to the model (Task 1). This editing of the
Figure 4 shows the CSMs produced by the first and second network can have three added values. First, one can keep
layer of our MNIST model (see Sec. V), i.e., the same the most selective kernel and thereby minimize the potential
layers analyzed in Figs. 2 and 3. Rows indicate kernels and performance loss when simplifying the network. Secondly, the
columns indicate classes, like in the AOM matrix. The values network size is decreased leading to the already mentioned
k,c
Si,j ∈ [−1, 1] are color-coded using a two-segment colormap speed and space benefits. Finally, the overall performance can
ranging from cyan (0) to purple (1) and yellow (0) to red be increased by reinitializing unsatisfactory kernels that do not
(-1), respectively. Hence, purple regions indicate where the aggregate useful features or introduce overfitting to the model.
analyzed kernel k is very selective towards the selected class Given the matrix S k containing all the CSMs of kernel k
c; red regions indicate positions where the kernel k is very for every position and class, we compute the overall selectivity
‘dismissive’ of class c, that is, the position outputs positive s(k) of kernel k as
values for c far less often than for any other class. For example,
one can notice that several kernels in the 2nd layer in Fig. 4
X
s(k) = |avg(S k,c0 ) − avg(S k,c1 )|, (4)
(right) are selective in most of the digit area in images from c0 ,c1 ∈C|c0 6=c1
class ’zero’. However, some kernels, such as the one in group
H11, are more selective in the ’inner circle’ of the digit zero. where C is the set of classes in training set, and avg(S k,c )
This behavior indicates how different kernels may learn the is the average value of all positions in the CSM for kernel k
and rarely at pixels outside it. These kernels learn a similar
feature: to recognize the digit structure.
The other kernels in the visualized layer show the opposite
pattern: They activate more often in pixels outside the digit
area than inside. These kernels recognize features such as
the different border orientations that appear in each class.
However, they do not necessarily have the same role: Some
of them activate more often around the top of the digit, e.g.,
group G4, while others activate more often around the bottom
of the digit, e.g., group G2. Still, some of these kernels look
to be quite redundant vs each other, and thus may not add
Fig. 5. The MNIST model (top) contains two convolutional layers followed by
a max pooling layer, a hidden fully-connected layer, and the output decision
useful information to the prediction. To check this, we look at
layer. After training, this model achieves 96.96% accuracy on the test set. The the occurrence difference matrix (ODM) for this layer (Fig. 3
CIFAR10 model (bottom) contains a more complex architecture with three left). Here, we can spot several groups of kernels with strong
sequential groups of layers formed by two convolutions, one max pooling and
one dropout layer followed by a hidden dense layer and the output decision
similarity (cells with dark blue shades).
layer. After training, the model achieves 80.54% accuracy on the test set. Next, we analyze the CSMs produced by these potentially
redundant kernels (Fig. 4 left). While the AOMs show us
which kernel regions activate for different classes, the CSMs
and class c. In Sec. V, we demonstrate the usefulness of our show whether these activations are selective or not, i.e., if high-
metric with an experiment. occurrence activations provide enough information to help the
network decide if the input belongs to a given class or not.
V. E XPERIMENTS
The CSMs show us that kernels in the last three rows are not
We ran a series of experiments to show how designers can selective towards any class, which makes them unlikely to be
use our techniques to perform the tasks described in Sec. III. relevant to the model’s decisions. In contrast, other kernels,
For this, we use two image classification models trained with e.g., group G1, show regions of strong selectivity for some
two widely-known datasets: (1) MNIST, containing images of classes, telling that their features are meaningful enough to
handwritten digits from 0 to 9; and (2) CIFAR10, containing help the model’s decision. We proceed by removing kernels
images displaying an object belonging to one of ten different we found as not useful for the reasons stated above. These
classes — airplane, automobile, bird, cat, deer, dog, frog, correspond to the last three rows in Figs. 2 and 4 (left).
horse, ship, and truck. For both datasets, we aim to design Also, we group kernels with strong similarity in the AOMs
a deep neural network that can classify images into the (kernel groups correspond to red triangles in Fig. 2) and keep
respective ten classes. Figure 5 displays the architecture for a single representative of each group in the model (we delete
both models. In this experiment, we apply our method to the the others). The kept kernel is the one with the highest kernel
activation of each convolutional layer before pooling. How- selectivity (see Sec. IVD) for a given group.
ever, the method can easily be applied after the subsequent After keeping only the chosen five kernels in the 1st layer
pooling layer — regardless of the pooling technique —, as of the model, we freeze the weights of this layer and retrain
these layers contain a more concise representation of the the model from the second layer onward. This way, we ensure
features identified in the previous layer. that our model cannot modify the features learned by those five
kernels and thus has to use only these features (and whatever
A. Analysis and layer size reduction of the MNIST model more abstract features it builds in deeper layers) to perform the
We next use our VA tools to simplify this model but keep prediction. With just one retraining epoch, our model achieves
a high accuracy. After training, we create the visualizations 96.93% accuracy, virtually the same it had with all 16 kernels
shown in Figs. 2, 3, and 4 using all training-set images. in the 1st layer. This shows that the five chosen kernels cover
Following the insights from our VA approach, we remove enough features to achieve the same classification capability
kernels that do not help the prediction task, either because we had with 16 kernels.
they do not recognize features useful to distinguish any class or We repeat the process in the second convolutional layer of
because they are redundant vs other same-layer kernels (Task our MNIST model. The AOMs for this layer (Fig. 2 right),
1, Sec. III). A more complex alternative to kernel removal is to tell us that its kernels are much more diverse concerning the
reinitialize these kernels, so we prefer removal for simplicity. features they recognize than kernels in layer 1. Often, they
We start our analysis by looking at the AOMs of the first learn particular features that appear in some writing styles. For
convolutional layer kernels (Fig. 2 left). In row 14, we easily instance, the kernel on groups H8 recognizes digits written in a
spot a kernel whose most positions rarely activate for any class. ‘rounded’ way, while kernel on row H7 recognizes digits from
So, this is a kernel that most likely does not contribute to the a ‘flatter’ writing. Also, Figs. 2 and 4 (right) help us recognize
prediction task. Also, we noticed groups of kernels with very kernels that are redundant or not selective enough to help our
similar patterns of activation occurrence. For example, kernels model. Red triangles in Fig. 2 (right) shows the found kernel
in the G1 group often activate at pixels inside the digit area, groups. We perform a similar edit to layer 1, freeze weights
AOMs of each kernel (1st Convolutional Layer - CIFAR10)
1.0

Kernels
0.5

0.0
AOMs of each kernel (6th Convolutional Layer - CIFAR10)
1.0
Kernels

0.5

0.0
Airplane

Airplane

Airplane

Airplane
Bird

Bird

Bird

Bird
Automobile

Automobile

Automobile

Automobile
Deer

Deer

Deer

Deer
Dog

Dog

Dog

Dog
Cat

Cat

Cat

Cat
Horse

Horse

Horse

Horse
Ship

Ship

Ship

Ship
Frog

Frog

Frog

Frog
Truck

Truck

Truck

Truck
Fig. 6. AOMs for the 64 kernels in the 1st and 6th layers of the CIFAR10 model. Kernels are sorted by the similarity tree computed by aggregative clustering
the AOM rows. The 1st layer cannot learn features discriminative enough to each class, which denotes the need for more layers in the model. Note that
while kernels in the 1st layer may produce positive activations for several classes, such activations tend to appear in different regions of the output image for
different pairs kernel x class. If this behavior is not present, the next layers cannot use these features to build more discriminative ones, which indicates a need
for more kernels in the layer. Finally, the 6th layer provides much more discriminative kernels, often activating positive values for just one class, indicating
that the model is unlikely to improve performance if more layers are added.

in layer 2, retrain, and obtain an accuracy of 97.69%. So, our Conversely, in the 6th layer, often only one class produces
approach allowed us to not only simplify the model but also positive activations in a given kernel. This tells that, at this
improve its accuracy. Note that this accuracy was obtained point, the model already separates classes as best as it can,
only by retraining the weights in the hidden fully-connected and more layers are unlikely to improve performance. Some
layer and the output layer of the model, so this accuracy relies classes, e.g., cat and deer, do not achieve an occurrence close
solely on the features previously learned — before removal — to 100% in any kernel. This suggests that the features the
by the convolutional kernels we selected to keep. layer learned for these classes are not enough to cover all
their samples, indicating the need to learn more features about
B. Analysis and layer size reduction of CIFAR10 model them. Hence, we found that this or previous layers need more
We repeat the previous experiment for the model trained kernels to discover more features (Task 2, Sec. III).
with the CIFAR10 dataset. Fig. 6 shows the AOMs for the Figure 7 shows the CSMs of each of the 64 kernels in
1st and the 6th convolutional layers in the model. Both views the 1st and 6th layers of the CIFAR10 model. We see that
give us interesting insights into the layers’ behaviors. First, some kernels (or regions) with high occurrence values in the
many kernels in both layers do not often activate for any class, corresponding AOMs (Fig. 6) are not very selective among
suggesting kernels that did not learn to recognize features different classes, making them a poor choice to keep in the
useful for classification. Secondly, kernels in the 1st layer that network. In contrast, we see in layer kernels that are very
activates often, usually do so to multiple classes. This tells that selective for subsets of classes (e.g., ship and truck). This
this layer cannot learn features complex enough to distinguish tells that, while this layer is not enough to find class-specific
individual classes, suggesting that the model needs more layers features, it can find features that only appear in a small subset
to perform the prediction task (Task 3, Sec. III). of classes, thus easing the prediction task of the next layers.
CSMs of each kernel (1st Convolutional Layer - CIFAR10)
1.0

Kernels
0.0

-1.0
CSMs of each kernel (6th Convolutional Layer - CIFAR10)
1.0
Kernels

0.0

-1.0
Airplane

Airplane

Airplane

Airplane
Bird

Bird

Bird

Bird
Automobile

Automobile

Automobile

Automobile
Deer

Deer

Deer

Deer
Dog

Dog

Dog

Dog
Cat

Cat

Cat

Cat
Horse

Horse

Horse

Horse
Ship

Ship

Ship

Ship
Frog

Frog

Frog

Frog
Truck

Truck

Truck

Truck
Fig. 7. CSMs for the 64 kernels in the 1st and 6th layers of the CIFAR10 model. Kernels follow the same order of Fig. 6. Notice that while several kernels
in the 1st layer show a high activation occurrence for all classes (see Fig. 6), they are usually much more selective to only a couple of classes. The CSMs
give more confidence to designer decisions, as it clearly states if a high occurrence pattern in AOMs indeed indicates selectivity.

Due to the width and depth of this model, we only reduce network, we compute the average of the pair’s class selectivity
the size of the 1st and 6th convolutional layers. As for the map S k,c , denoted sk,c . Then, for each class ci , we remove
MNIST model, we spot and remove kernels that are not the kernels with skj ,ci ≤ 0, and compute the accuracy of the
selective for any class, and simplify groups of redundant resulting network only for test-set elements of class ci . Next,
kernels (details omitted for brevity). With our VA tools, we we do the opposite: For each class ci , we remove the kernels
reduced the 1st layer’ size to 20 kernels. After retraining the with skj ,ci > 0 and compute the accuracy of the network
rest of the model for one epoch, and freezing the weights in without such kernels for test-set elements of class ci .
the 1st layer —, our reduced model achieved 81.20% test-set We did these experiments for the 6th CIFAR10 model layer
accuracy, even higher than the initial 80.54% accuracy. (CSMs shown in Fig. 7 right). Fig. 8 (top) shows the number
Following our VA approach, we reduced the size of the 6th of kernels kept in each case. Fig. 8 (bottom) compares the test-
layer from 64 to 35 kernels achieving a test set accuracy of set accuracy for elements of each class ci and considering all
80.43% after retraining only the weights in the fully-connected kernels, kernels not selective for ci , and kernels selective for ci .
layers for one epoch. This accuracy is marginally below the In all cases, the number of selective kernels for a given class is
original one. We see here a trade-off between network size and much smaller than the number of non-selective kernels. In all
performance: At some point, the network simplification has to cases, the test-set accuracy drops significantly for the chosen
stop as the accuracy will inherently drop. As stated earlier, an class when we remove highly selective kernels for that class
alternative is to reinitialize the kernels selected for removal and increases when we only keep selective kernels. Hence,
(instead of removing them) to achieve higher accuracy. our selectivity metric indeed captures well how much a kernel
contributes to the separation of a given class.
C. Kernel Selectivity Experiment
To show how our kernel selectivity method indeed captures VI. C ONCLUSION
the overall selectivity of a kernel, we run the following In this work, we present a visual analytics set of tools, and
experiment: For each pair of kernel k and class c in the associated workflow, that helps machine learning designers in
Number of Selective Kernels [5] J. Yosinski, J. Clune, T. Fuchs, and H. Lipson, “Understanding neural
networks through deep visualization,” in Proc. ICML – Workshop on
Deep Learning, 2015.
[6] H. Strobelt, S. Gehrmann, H. Pfister, and A. M. Rush, “LSTMVis: A
Number of Kernels
tool for visual analysis of hidden state dynamics in recurrent neural
networks,” IEEE TVCG, vol. 24, no. 1, pp. 667–676, Jan 2018.
[7] P. Rauber, S. G. Fadel, A. Falcão, and A. Telea, “Visualizing the hidden
activity of artificial neural networks,” IEEE TVCG, vol. 23, no. 1, 2017.
[8] M. Kahng, P. Andrews, A. Kalro, and D. Chau, “Activis: Visual
exploration of industry-scale deep neural network models,” IEEE TVCG,
vol. 24, no. 1, 2018.
[9] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional

Deer
Automobile

Horse
Dog

Ship
Bird

Frog
Airplane

Truck
Cat networks: Visualising image classification models and saliency maps,”
CoRR, vol. abs/1312.6034, 2013.
[10] D. Erhan, Y. Bengio, A. Courville, and P. Vincent, “Visualizing higher-
Class Accuracy Comparison layer features of a deep network,” University of Montreal, 2009.
[11] K. Weiss, T. M. Khoshgoftaar, and D. Wang, “A survey of transfer
learning,” Journal of Big Data, vol. 3, no. 1, 2016.
[12] T. Elsken, J. H. Metzen, and F. Hutter, “Neural architecture search: A
Class Accuracy

survey,” CoRR, 2018.


[13] C. Cortes, X. Gonzalvo, V. Kuznetsov, M. Mohri, and S. Yang, “Adanet:
Adaptive structural learning of artificial neural networks,” CoRR, 2016.
[14] B. Baker, O. Gupta, N. Naik, and R. Raskar, “Designing neural network
architectures using reinforcement learning,” CoRR, 2016.
[15] A. Veit and S. J. Belongie, “Convolutional networks with adaptive
computation graphs,” CoRR, vol. abs/1711.11503, 2017.
[16] M. Cogswell, F. Ahmed, R. B. Girshick, L. Zitnick, and D. Batra, “Re-
Ship
Automobile

Horse
Dog
Airplane

Deer
Bird

Frog

Truck
Cat

ducing overfitting in deep networks by decorrelating representations,”


CoRR, vol. abs/1511.06068, 2015.
[17] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep
neural network with pruning, trained quantization and huffman coding,”
CoRR, vol. abs/1510.00149, 2015.
Fig. 8. Selective Experiment on 6th layer CIFAR model. Top figure displays
[18] Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating very
the number of kernels considered to be selective (green) and not-selective
deep neural networks,” in Proc. ICCV, 2017.
(red) for each class. Bottom figure displays how much the class accuracy
[19] J. Luo, J. Wu, and W. Lin, “Thinet: A filter level pruning method for
changes when we keep only the kernels from each one of these groups.
deep neural network compression,” CoRR, vol. abs/1707.06342, 2017.
[20] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning
filters for efficient convnets,” CoRR, vol. abs/1608.08710, 2016.
their deep learning architectural decisions. We define three [21] Z. Huang and N. Wang, “Data-driven sparse structure selection for deep
neural networks,” in Proc. ECCV, 2018.
tasks related to such decisions and show how our techniques [22] H. Liu, K. Simonyan, and Y. Yang, “DARTS: differentiable architecture
support them. We show how our toolset and workflow can search,” CoRR, vol. abs/1806.09055, 2018.
be used in practice to considerably reduce the sizes of two [23] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and con-
nections for efficient neural network,” in Advances in Neural Information
trained deep-learning models for non-trivial classification tasks Processing Systems 28. Curran Associates, Inc., 2015, pp. 1135–1143.
while keeping, or even increasing, classification performance. [24] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell, “Rethinking the
Our work opens multiple future work possibilities. While value of network pruning,” CoRR, vol. abs/1810.05270, 2018.
[25] W. Zhong, C. Xie, Y. Zhong, Y. Wang, W. Xu, S. Cheng, and K. Mueller,
our method is readily adaptable for fully-connected neurons, “Evolutionary visual analysis of deep neural networks,” in Proc. ICML
recurrent layers such as LSTMs are much more challenging. – Workshop on Visualization for Deep Learning, 2017.
Such layers contain hidden internal states that modify their [26] N. Pezzotti, T. Hollt, J. V. Gemert, B. P. F. Lelieveldt, E. Eisemann,
and A. Vilanova, “DeepEyes: Progressive visual analytics for designing
values at every input timestep. Analyzing how the AOMs deep neural networks,” IEEE TVCG, vol. 24, no. 1, pp. 98–108, 2018.
of such hidden states change over time is an interesting [27] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, 2015.
direction. Separately, our visualizations do not scale well for [28] M. Denil, B. Shakibi, L. Dinh, M. A. Ranzato, and N. de Freitas, “Pre-
dicting parameters in deep learning,” in Advances in Neural Information
large networks of tens of layers and thousands of neurons. We Processing Systems 26. Curran Associates, Inc., 2013, pp. 2148–2156.
plan next to study how high-dimensional data visualization, [29] A. Nguyen, A. Dosovitskiy, J. Yosinski, T. Brox, and J. Clune, “Syn-
well studied in the visual analytics community [35], could be thesizing the preferred inputs for neurons in neural networks via deep
generator networks,” in Proc. NIPS, 2016, pp. 3395–3403.
applied to improve the scalability of our method. [30] B. Alsallakh, A. Jourabloo, M. Ye, X. Liu, and L. Ren, “Do convolu-
tional neural networks learn class hierarchy?” IEEE TVCG, 2018.
[31] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-
R EFERENCES dinov, “Dropout: a simple way to prevent neural networks from overfit-
ting,” J Mach Learn Res, vol. 15, no. 1, pp. 1929–1958, 2014.
[1] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. The MIT [32] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
Press, 2016. applied to document recognition,” Proc. IEEE, vol. 86, no. 11, 1998.
[2] R. Garcia, A. C. Telea, B. C. da Silva, J. Torresen, and J. L. D. [33] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei, “ImageNet:
Comba, “A task-and-technique centered survey on visual analytics for A large-scale hierarchical image database,” in Proc. IEEE CVPR, 2009.
deep learning model engineering,” Computers & Graphics, vol. 77, 2018. [34] M. Behrisch, B. Bach, N. Henry Riche, T. Schreck, and J.-D. Fekete,
[3] M. Liu, J. Shi, Z. Li, C. Li, J. Zhu, and S. Liu, “Towards better analysis “Matrix reordering methods for table and network visualization,” Com-
of deep convolutional neural networks,” IEEE TVCG, vol. 23, 2017. puter Graphics Forum, vol. 35, no. 3, pp. 693–716, 2016.
[4] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolu- [35] S. Liu, D. Maljovec, B. Wang, P. Bremer, and V. Pascucci, “Visualizing
tional networks,” in Proc. ECCV. Springer, 2014, pp. 818–833. high-dimensional data: Advances in the past decade,” IEEE TVCG, 2017.

You might also like