Backpropagation and Resilient Propagation
Backpropagation and Resilient Propagation
Backpropagation and Resilient Propagation
BACK PROPAGATION Back propagation is a common method of training artificial neural networks so as to minimize the objective function. It is a supervised learning method. It requires a dataset of the desired output for many inputs, making up the training set. It is most useful for feed-forward networks (networks that have no feedback or simply, that have no connections). The term is an abbreviation for "backward propagation of errors". Back propagation requires that the activation function to be used by the artificial neurons (or "nodes") to be differentiable. For better understanding of the back propagation algorithm it can be divided into two phases: propagation and weight update.
Propagation
Each propagation involves the following steps: 1. Forward propagation of a training pattern's input through the neural network in order to generate the propagation's output activations. 2. Backward propagation of the propagation's output activations through the neural network using the training pattern's target in order to generate the deltas of all output and hidden neurons.
As the name implies, the errors propagate backwards from the output nodes to the inner nodes. Back propagation calculates the gradient of the error of the network regarding the network's modifiable weights. This gradient is almost always used in a simple stochastic gradient descent algorithm to find weights that minimize the error. Often the term "back propagation" is used in a more general sense, to refer to the entire procedure encompassing both the calculation of the gradient and its use in stochastic gradient descent. Back propagation usually allows quick convergence on satisfactory local minima for error in the kind of networks to which it is suited. Back propagation networks are necessarily multilayer perceptions (usually with one input, one hidden, and one output layer). In order for the hidden layer to serve any useful function, multilayer networks must have non-linear activation functions for the multiple layers: a multilayer network using only linear activation functions is equivalent to some single layer, linear network. Non-linear activation functions that are commonly used include the logistic function, the softmax function, and the Gaussian functions. It is also closely related to the GaussNewton algorithm.
RESILIENT PROPAGATION
Resilient Propagation is one of the best general purpose training methods for neural networks. It is not the best training method in every case, but in most cases it is. It can be used for feed forward neural networks and simple recurrent neural networks. Resilient propagation is a type of propagation training which is susceptible to the Flat Spot Problem for certain activation functions. It will typically outperform back propagation by a considerable factor. Additionally, it has no parameters that must be set. Back propagation requires that a learning rate and momentum value be specified. Finding an optimal learning rate and momentum value for back propagation can be difficult. This is not necessary in case of resilient propagation. The Resilient Propagation algorithm is one of the most popular adaptive learning rates training algorithms. It employs a sign-based scheme to eliminate harmful influences of derivatives magnitude on the weight updates, and is eminently suitable for applications where the gradient is numerically estimated or the error is noisy. It is easy to implement in hardware and is not susceptible to numerical problems. The idea behind it is that it has motivated the development of several variants with the aim to improve the convergence behavior and effectiveness of the original method. Thus hybrid learning schemes have been proposed to incorporate second derivative related information in it, which approximates the second derivative by onedimensional secant steps, and the Diagonal Estimation which directly computes the diagonal elements of the Hessian matrix. Also approaches inspired from global optimization theory have been developed to equip it with annealing strategies, such as the Simulated Annealing and the Restart mode Simulated Annealing in order to escape from shallow local minima. Recently, the Improved algorithm which applies a backtracking strategy (i.e. it decides whether to take back a step along a weight direction or not by means of a heuristic), has shown improved convergence speed when compared against existing variants, as well as other training methods. Relevant literature shows that its learning schemes exhibit fast convergence in empirical evaluations, but usually require introducing or even fine tuning additional heuristics. For example, annealing schedules require heuristics for the acceptance
probability and the visiting distribution, whilst second derivative methods employ heuristics in the various approximations of the second derivative. Moreover, literature shows a lack of theoretical results underpinning its modifications, particularly the firstorder methods. This is not surprising as heuristics make difficult to guarantee converge to a local minimiser of the error function when adaptive learning rates for each weight are used in calculating the weight updates.