Deep learning uses continuous piecewise linear (CPL) functions to model complex relationships in data. CPL functions are composed of multiple linear pieces joined continuously. They can approximate complex functions while remaining feasible to compute. Neural networks implement CPL functions using layers of linear and ReLU activation functions. Weights connecting the layers are optimized using gradient descent to minimize error on training examples, enabling the network to generalize to new data. This approach has achieved breakthrough results in tasks like image recognition compared to previous shallow learning methods.
Deep learning uses continuous piecewise linear (CPL) functions to model complex relationships in data. CPL functions are composed of multiple linear pieces joined continuously. They can approximate complex functions while remaining feasible to compute. Neural networks implement CPL functions using layers of linear and ReLU activation functions. Weights connecting the layers are optimized using gradient descent to minimize error on training examples, enabling the network to generalize to new data. This approach has achieved breakthrough results in tasks like image recognition compared to previous shallow learning methods.
Deep learning uses continuous piecewise linear (CPL) functions to model complex relationships in data. CPL functions are composed of multiple linear pieces joined continuously. They can approximate complex functions while remaining feasible to compute. Neural networks implement CPL functions using layers of linear and ReLU activation functions. Weights connecting the layers are optimized using gradient descent to minimize error on training examples, enabling the network to generalize to new data. This approach has achieved breakthrough results in tasks like image recognition compared to previous shallow learning methods.
Deep learning uses continuous piecewise linear (CPL) functions to model complex relationships in data. CPL functions are composed of multiple linear pieces joined continuously. They can approximate complex functions while remaining feasible to compute. Neural networks implement CPL functions using layers of linear and ReLU activation functions. Weights connecting the layers are optimized using gradient descent to minimize error on training examples, enabling the network to generalize to new data. This approach has achieved breakthrough results in tasks like image recognition compared to previous shallow learning methods.
By Gilbert Strang Artificial intelligence languished for a
generation, waiting for new ideas. There is no claim that the absolute best class of func- (Avh [(Av + bhJ+ uppose we draw one of the digits S 0,1, ... , 9. How does a human rec- ognize which digit it is? That neuroscience tions has now been found. That class needs to allow a great many parameters (called pq + 2q = 20 weights question is not answered here. How can a weights). And it must remain feasible to computer recognize which digit it is? This compute all those weights (in a reasonable is a machine learning question. Probably time) from knowledge of the training set. C[Av+bJ+ =W both answers begin with the same idea: The choice that has succeeded beyond leamfrom examples. expectation-and has transformed shallow learning into deep learning-is continuous r{ 4,3) = 15 linear pieces So we start with M different images (the training set). An image is a set of p small piecewise linear (CPL) functions. Linear in w = F(v) pixels - or a vector v = (Vi"'" V p )' The for simplicity, continuous to model an (Av)q component Vi tells us the "grayscale" of the unknown but reasonable rule, and piecewise ith pixel in the image: how dark or light it to achieve the nonlinearity that is an abso- Figure 1. Neural net construction of a piecewise linear function of the data vector v. is. We now have M images, each with p lute requirement for real images and data. sliced by q hyperplanes into r pieces. We some measure of F(v) - w( v) is solved by features: M vectors v in p-dimensional This leaves the crucial question of com- can count those pieces! This measures the following a gradient downhill. The gradient space. For every v in that training set, we putability. What parameters will quickly "expressivity" of the overall function F( v). of this complicated function is computed know the digit it represents. describe a large family of CPL functions? The formula from combinatorics is by backpropagation: the workhorse of deep In a way, we know a function. We have Linear finite elements start with a triangu- learning that executes the chain rule. M inputs in RP, each with an output from lar mesh. But specifying many individual ° to 9. But we don't have a "rnle." We are helpless with a new input. Machine learn- nodes in RP is expensive. It will be better if those nodes are the intersections of a A historic competition in 2012 was to identify the 1.2 million images collected in ImageNet. The breakthrough neural network ing proposes to create a rule that succeeds smaller number of lines (or hyperplanes). in AlexNet had 60 miIlion weights in eight on (most of) the training images. But "suc- Please note that a regular grid is too simple. This number gives an impression of the layers. Its accuracy (after five days of sto- ceed" means much more than that: the rule Figure I is a first construction of a graph of F. But our function is not yet chastic gradient descent) cut in half the next should give the correct digit for a much piecewise linear function of the data vec- sufficiently expressive, and one more idea best error rate. Deep learning had arrived. wider set of test images, taken from the tor v. Choose a matrix A and vector b. is needed. Our goal here was to identify continu- same population. This essential requirement Then set to 0 (this is the nonlinear step) Here is the indispensable ingredient in ous piecewise linear functions as power- is called generalization. all negative components of A v + b. Then the learning function F. The best way ful approximators. That family is also What form shall the rule take? Here multiply by a matrix C to produce the out- to create complex functions from simple convenient-closed under addition and we meet the fundamental question. Our put w=F(v)=C(Av+b)+. That vector functions is by composition. Each F; is maximization and composition. The magic first answer might be: F(v) could be a (A v + b) + forms a "hidden layer" between linear (or affine) followed by the nonlinear is that the learning function F(Ap bi , v) linear function from RP to RIO (a 10 by the input v and the output w. ReLU : F;(v) = (AiV + bJ+. Theircom- gives accurate results on images v that F p matrix). The 10 outputs would be prob- The nonlinear function called position is F(v) = FL(F;,j .. r;(F;(v)))). has never seen. abilities of the numbers 0 to 9. We would ReLU(x)=x+ = max (x, 0) was ongl- We now have L -1 hidden layers before have lOp entries and M training samples nally smoothed into a logistic curve like the final output layer. The network becomes This article is published with very light edits. to get mostly right. 1/ (1 + e-X ).It was reasonable to think deeper as L increases. That depth can grow The difficulty is that linearity is far too that continuous derivatives would help quickly for convolutional nets (with banded Gilbert Strang teaches linear algebra at limited. Artistically, two Os could make in optimizing the weights A, b, C. That Toeplitz matrices A). the Massachusetts Institute of Technology. an 8. I and 0 could combine into a hand- proved to be wrong. A description of the January 2019 textbook The great optimization problem of deep written 9 or possibly a 6. Images don't The graph of each component of Linear Algebra and Learning from Data is learning is to compute weights Ai and bi add. Recognizing faces instead of numbers (Av+b) has two half-planes (one is flat, that will make the outputs F(v) nearly cor- available at math. mit. edullearningfromdata. requires a great many pixels - and the from th/ Os where A v+ b is negative). rect - close to the digit w( v) that the image input-output rule is nowhere near linear. If A is q by p, the input space RP is v represents. This problem of minimizing