Academia.eduAcademia.edu

Feed-Forward Neural Networks Need Inductive Bias to Learn Equality Relations

2018, arXiv (Cornell University)

arXiv:1812.01662v1 [cs.LG] 4 Dec 2018 Feed-Forward Neural Networks Need Inductive Bias to Learn Equality Relations Tillman Weyde Department of Computer Science City, University of London London, United Kigndom t.e.weyde@city.ac.uk Radha Manisha Kopparti Department of Computer Science City, University of London London, United Kingdom radha.kopparti@city.ac.uk Abstract Basic binary relations such as equality and inequality are fundamental to relational data structures. Neural networks should learn such relations and generalise to new unseen data. We show in this study, however, that this generalisation fails with standard feed-forward networks on binary vectors. Even when trained with maximal training data, standard networks do not reliably detect equality. We introduce differential rectifier (DR) units that we add to the network in different configurations. The DR units create an inductive bias in the networks, so that they do learn to generalise, even from small numbers of examples and we have not found any negative effect of their inclusion in the network. Given the fundamental nature of these relations, we hypothesize that feed-forward neural network learning benefits from inductive bias in other relations as well. Consequently, the further development of suitable inductive biases will be beneficial to many tasks in relational learning with neural networks. 1 Introduction Basic relations such as equality are fundamental to relational data structures. One goal of applying neural networks to relational data is that the networks learn to infer these relational structure from data. Although equality is typically not learned from data, equality or approximate equality may be embedded as part of other tasks. The modelling of equality is clearly in the hypothesis space of feed-forward neural networks (FFNNs) [1], but [2, 3] already highlighted that learning of identity relationships with neural networks may not generalise to unseen data. Therefore, we see learning to recognise equality as relevant from a theoretical and practical perspective. In this study we test whether feed-forward networks learn equality as well as a numeric comparison, thresholded digit sum, and digit reversal of pairs of binary vectors and then generalise this to new data in different settings regarding the task, the amount of data provided, and the depth of the network. We find that the recognition of binary relations is not generalised reliably by feed-forward networks. To address this problem, we introduce an inductive bias with additional predefined network structures, that we call differential rectifier (DR) units. We find in our experiments that DR units induce reliable perfect generalisation for equality and all other tasks except in digit reversal. We see two questions that these results raise: First, which other relations neural networks do not learn and what that means for more complex tasks. Second, what kinds of inductive biases to design and how to implement them. The remainder of this paper is organised as follows: Section 2 reviews related literature, Section 3 introduces the task of learning vector equality and our DR units for inductive bias. Section 4 presents the experimental results and in Section 5 follow the conclusions of this paper. 32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada. 2 Related work In relational learning, equality is often not learned from the data, with the exception of the work by [4] who learn to detect equality attributed of objects from images. Learning equality could be interesting in the context of constraint learning [5] to learn when equality constraints should be regarded as satisfied. Another relevant area is rule learning and application, where soft unification like in [6] could be replaced with a learnt model. Since neural networks are currently by far the most popular machine learning method, it seems of interest whether they can learn equality. There have been a number of theoretical contributions showing that feed-forward networks are universal approximators, most generally to our knowledge by [1]. Presumably because of these results there was relatively little interest in the question which functions neural networks can not learn. One of the few studies in this direction was undertaken in [2] in 1999, where a recurrent neural network failed to distinguish abstract patterns, based on equality relations between sequence elements, although seven-month-old infants showed the ability to distinguish them after a few minutes of exposure. This was followed by an lively exchange on rule learning by neural networks and in human language acquisition, where results by [7–9] could not be reproduced by [10, 11] and [12] disputed claims by [11]. Other approaches, such as [13–15], use different network architectures, problem formulations or evaluation methods. A more specific problem of learning equality relations was posed in [3] by showing that learning of equality on even numbers does not transfer to odd numbers in binary representation. This relates to the input neuron for the least significant bit not being set to 1 during training. Recently, [16] addressed this specific problem with different approaches as an example for extrapolation and inductive biases for machine learning in natural language processing. However, they did not address the general question of learning equality with neural networks. If standard neural networks do not generalise equality relations despite the solution being in their hypothesis space, as we will show for FFNNs below, then the question is how we can enable the learning of solutions that do generalise. Inductive biases as a solution can be realised in a number of ways and have been of increased interest recently [17, 18]. 3 Equality relation learning The studies listed above motivated the approach taken here to study a reduced problem outside common contexts such as image analysis or cognitive modelling: whether feed-forward neural networks trained with back-propagation generally have the ability to learn equality relations and generalise to unseen data. The general task is to learn the relation between pairs of binary vectors. This leads to a binary classification of the pairs according to the equality or otherwise of its element vectors. We use a standard FFNN as sketched in Figure 1a). This network has 2n input neurons, where n is the vector dimensionality. The hidden layer has 10 neurons with ReLu activation. The output layer has two neurons representing the two classes (equal/unequal), which use softmax activation. The training uses the Adam optimiser[19] with cross-entropy loss. The data we train and test the network with is synthetically generated and we vary the type and the distribution of the data in the experiments below. We are interested in how many training examples are needed until the network learns to correctly classify pairs of equal vs. unequal vectors. This network, like the following ones have been implemented in Python using PyTorch (http://pytorch.org). Inductive bias creation with DR units In our model, we use differential rectifier (DR) units that compare input values by calculating the absolute difference: f (x, y) = |x − y|. We create one DR unit for every vector dimension with weights from the inputs to the DR units fixed at 1, thus learning the suitable summation weights for the DRs is sufficient for creating a generalisable equality detector. We use two ways of integrating DR units into the neural networks: Early Fusion and Mid Fusion. In Early Fusion, DR units are concatenated to input units 1b), and in Mid Fusion they are added to the hidden layer 1c). In both cases the existing input and hidden units are unchanged. 2 a) b) c) Figure 1: Network architectures: a) standard feed-forward network, DR integration with b) early fusion and c) mid fusion. The DR units receive their input in both cases from vec 1 and vec 2. Vector Dimensions n=2 n=3 n=5 n=10 n=30 n=100 Plain FFNN 52% 55% 37% 52% 65% 75% Early Fusion 82% 75% 67% 75% 75% 100% Mid Fusion 100% 100% 100% 100% 100% 100% Table 1: Accuracy of the different network types on pairs of vectors of different dimensions. The joint train and test data covers all possible equal vector pairs up to 1000, and a random selection where there are more. Only the Mid Fusion architecture reaches reliable equality detection. 4 Experiments and Results We performed different sets of experiments using binary vectors for estimating vector equality in relation to vector dimensionality, data size and dataset structure. We also use two additional tasks to test the effect of DR units in different contexts. Effect of network architecture and vector dimensionality We generate pairs of random binary vectors with dimensionality n between 2 and 100 as shown in table 1. We use all the possible binary vectors to generate equal pairs, i.e. 2n pairs, for n < 10 and a random selection of 1000 vectors otherwise. We also generate the same number of randomly selected unequal vector pairs. Then we use stratified sampling to split the data 75:25 into train and test set. The network is then trained for 20 epochs, which led to convergence in all cases. We run 10 simulations for each configuration. The average results are shown in Table 1. We see that the standard FFNNs never fully generalise, and in many cases barely exceed chance level (50%). The early fusion model improves results, but only reaches full performance for 100 dimensions. The Mid Fusion reaches perfect test performance in all cases. For the plain FFNN, it looks like there is a trend towards better performance at higher dimensionality, but with the observed variation that may be coincidental. We did not perform an exhaustive grid search over all hyperparameters, but tested higher numbers of hidden layers (2,3), and larger hidden layers (20,30 neurons) without observing a significant change in the results. Effect of training data size We study here how much the performance depends on the training data size. For this, we vary only the training data size and keep the test set and all other parameters constant. We use training data sizes of 1% to 50% (in relation to the totally available data as defined above) and the accuracy achieved in various conditions is plotted in Figure 2. It is worth noting, that the Mid Fusion network reaches 100% accuracy from 10% data size on while the FFNN shows only small learning effects. Effect of vector coverage A possible hypothesis for the results of the FFNN is that the coverage of the vectors in the training set plays a role. To share vectors in equal pairs between training and test set would mean to train on the test data, but we created a training set that contains all vectors that 3 Figure 2: Accuracy of FFNN for 10 dimensional binary vectors after varying the distributions of training data from (1%-50%) keeping the testing data fixed Type 1) Plain FFNN 2) Early Fusion 3) Mid Fusion a) 50% 75% 100% b) 52% 87% 100% c) 75% 92% 100% d) 77% 82% 100% e) 50% 55% 58% Table 2: Test set accuracy of FFNN withour and with DR units for different vector coverage (a,b, see text for details) and for classification by c) numeric comparison (≥), d) digit sum ≥ 3, and e) inversion of digits. appear in the test set in the unequal pairs for n = 10. The results are shown in column a) of Table 2. We also created a training set where each vector appeared as above, but in both position 1 and 2. The results are shown in column b) of table 2. The results in both cases are similar to those without this additional coverage in Table 1. Other classification tasks We evaluate here whether the DR units have a negative effect on other learning tasks (using n = 3). We evaluated the networks on the classification by comparing the two vectors in the pair as binary numbers with results shown in column c) of Table 2. We also tested a task that is not a comparison of the two vectors in the pair, by calculating the digit sum. We classify by checking if the digit sum is ≥ 3. In both c) and d) we see, that the performance is actually not hindered but helped by the DR units. We finally tested the task of recognising digit reversal (swapping least with most significant bits), which DR units are not designed for, as they compare corresponding digits. As we can see in column e), DR units do not deliver a perfect solution here, but still lead to somewhat better results than a plain FFNN. 5 Conclusions In this study we examined the learning behaviour of feed-forward neural networks in vector equality detection and observed that the networks do not generalise well to unseen data. We also had similar results in other tasks like numeric inequality and sum of bits of binary vectors. We therefore introduced a simple modification to the network with differential rectifier (DR) units and noticed substantial improvements on unseen test data. This improvement is largely independent of vector dimension, data size and other parameters. The question why standard FFNNs do not learn vector equality relations in a generalisable way is a relevant one, and deserves further theoretical and empirical study. It is also important to investigate the design of further measures for creating and controlling inductive biases in neural network learning, as we find that even relatively simple tasks like generalising equality require them. 4 References [1] Moshe Leshno, Vladimir Ya Lin, Allan Pinkus, and Shimon Schocken. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks, 6(6):861–867, 1993. [2] G. F. Marcus, S. Vijayan, S.B. Rao, and P.M. Vishton. Rule learning by seven-month-old infants. Science, 283, 5398:77–80, 1999. [3] G. F. Marcus. The algebraic mind: Integrating connectionism and cognitive science. Cambridge MIT Press, 2001. [4] Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Tim Lillicrap. A simple neural network module for relational reasoning. In Advances in neural information processing systems, pages 4967–4976, 2017. [5] Luc De Raedt, Andrea Passerini, and Stefano Teso. Learning constraints from examples. In Proceedings in Thirty-Second AAAI Conference on Artificial Intelligence, 2018. [6] Andres Campero, Aldo Pareja, Tim Klinger, Josh Tenenbaum, and Sebastian Riedel. Logical rule induction and theory learning using neural theorem proving, 2018. [7] Jeffrey Elman. Generalization, rules, and neural networks: A simulation of marcus et. al. https://crl.ucsd.edu/ elman/Papers/MVRVsimulation.html, 1999. [8] Gerry Altmann and Zoltan Dienes. Technical comment on rule learning by seven-month-old infants and neural networks. In Science, 284(5416)):875–875, 1999. [9] Thomas R. Shultz and Alan C. Bale. Neural network simulation of infant familiarization to artificial sentences: Rule-like behavior without explicit rules and variables. Infancy, 2:4, 501-536, DOI: 10.1207/S15327078IN020407, 2001. [10] Marlus Vilcu and Robert F Hadley. Generalization in simple recurrent networks. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 23, 2001. [11] Marius Vilcu and Robert F Hadley. Two apparent ‘counterexamples’ to marcus: A closer look. Minds and Machines, 15(3-4):359–382, 2005. [12] Thomas R Shultz and Alan C Bale. Neural networks discover a near-identity relation to distinguish simple syntactic forms. Minds and Machines, 16(2):107–139, 2006. [13] Shastri and Chang. A spatiotemporal connectionist model of algebraic rule-learning. International Computer Science Institute, pages TR–99–011, 1999. [14] P. Dominey and F. Ramus. Neural network processing of natural language: Isensitivity to serial, temporal and abstract structure of language in the infant. Language and Cognitive Processes, pages 15(1),87–127, 2000. [15] Raquel G. Alhama and Willem Zuidema. Pre-wiring and pre-training: What does a neural network need to learn truly general identity rules. CoCo at NIPS, 2016. [16] Jeff Mitchell, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. Extrapolation in nlp. arXiv:1805.06648, 2018. [17] Jessica B Hamrick, Kelsey R Allen, Victor Bapst, Tina Zhu, Kevin R McKee, Joshua B Tenenbaum, and Peter W Battaglia. Relational inductive bias for physical construction in humans and machines. arXiv preprint arXiv:1806.01203, 2018. [18] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pages 4077–4087, 2017. [19] Diederik P. Kingma and Jimmy Ba. arXiv:1412.6980, 2014. Adam: A method for stochastic optimization. 5