To minimize this distance, perceptron uses stochastic gradient descent (SGD) as the optimization
function. If the data is linearly separable, it is guaranteed that SGD will converge in a finite number of
steps. The last piece that Perceptron needs is the activation function, the function that determines if the
neuron will fire or not. Initial Perceptron models used sigmoid function, and just by looking at its shape,
it makes a lot of sense! The sigmoid function maps any real input to a value that is either 0 or 1 and
encodes a non-linear function. The neuron can receive negative numbers as input, and it will still be
able to produce an output that is either 0 or 1.
But, if you look at Deep Learning papers and algorithms from the last decade, you’ll see the most of
them use the Rectified Linear Unit (ReLU) as the neuron’s activation function. The reason why ReLU
became more adopted is that it allows better optimization using SGD, more efficient computation and
is scale-invariant, meaning, its characteristics are not affected by the scale of the input. The neuron
receives inputs and picks an initial set of weights random. These are combined in weighted sum and
then ReLU, the activation function, determines the value of the output.