History
History
History
Abstract. Machine learning (ML) is the science of credit assignment: finding patterns in
observations that predict the consequences of actions and help to improve future performance.
Credit assignment is also required for human understanding of how the world works, not only
for individuals navigating daily life, but also for academic professionals like historians who
interpret the present in light of past events. Here I focus on the history of modern artificial
intelligence (AI) which is dominated by artificial neural networks (NNs) and deep learning,[DL1-4]
both conceptually closer to the old field of cybernetics than to what's been called AI since 1956
(e.g., expert systems and logic programming). A modern history of AI will emphasize
breakthroughs outside of the focus of traditional AI text books, in particular, mathematical
foundations of today's NNs such as the chain rule (1676), the first NNs (linear regression, circa
1800), and the first working deep learners (1965-). From the perspective of 2022, I provide a
timeline of the—in hindsight—most important relevant events in the history of NNs, deep
learning, AI, computer science, and mathematics in general, crediting those who laid
foundations of the field. The text contains numerous hyperlinks to relevant overview sites from
my AI Blog. It also debunks certain popular but misleading historic accounts of deep learning,
and supplements my previous deep learning survey[DL1] which provides hundreds of additional
https://people.idsia.ch/~juergen/deep-learning-history.html 1/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
references. Finally, to round it off, I'll put things in a broader historic context spanning the time
since the Big Bang until when the universe will be many times older than it is now. The present
piece is also the draft of a chapter of my upcoming AI book.
Disclaimer. Some say a history of deep learning should not be written by someone who has
helped to shape it—"you are part of history not a historian."[CONN21] I cannot subscribe to that
point of view. Since I seem to know more about deep learning history than others,[S20][DL3,DL3a][T22]
[DL1-2]
I consider it my duty to document and promote this knowledge, even if that seems to
imply a conflict of interest, as it means prominently mentioning my own team's work, because
(as of 2022) the most cited NNs are based on it.[MOST] Future AI historians may correct any era-
specific potential bias.
Table of Contents
Sec. 1: Introduction
Sec. 2: 1676: The Chain Rule For Backward Credit Assignment
Sec. 3: Circa 1800: First Neural Net (NN) / Linear Regression / Shallow Learning
Sec. 4: 1920-1925: First Recurrent NN (RNN) Architecture. ~1972: First Learning RNNs
Sec. 5: 1958: Multilayer Feedforward NN (without Deep Learning)
Sec. 6: 1965: First Deep Learning
Sec. 7: 1967-68: Deep Learning by Stochastic Gradient Descent
Sec. 8: 1970: Backpropagation. 1982: For NNs. 1960: Precursor.
Sec. 9: 1979: First Deep Convolutional NN (1969: Rectified Linear Units)
Sec. 10: 1980s-90s: Graph NNs / Stochastic Delta Rule (Dropout) / More RNNs / Etc
Sec. 11: Feb 1990: Generative Adversarial Networks / Artificial Curiosity / NN Online Planners
Sec. 12: April 1990: NNs Learn to Generate Subgoals / Work on Command
Sec. 13: March 1991: NNs Learn to Program NNs. Transformers with Linearized Self-Attention
Sec. 14: April 1991: Deep Learning by Self-Supervised Pre-Training. Distilling NNs
Sec. 15: June 1991: Fundamental Deep Learning Problem: Vanishing/Exploding Gradients
Sec. 16: June 1991: Roots of Long Short-Term Memory / Highway Nets / ResNets
Sec. 17: 1980s-: NNs for Learning to Act Without a Teacher
Sec. 18: It's the Hardware, Stupid!
Sec. 19: But Don't Neglect the Theory of AI (Since 1931) and Computer Science
Sec. 20: The Broader Historic Context from Big Bang to Far Future
Sec. 21: Acknowledgments
Sec. 22: 555+ Partially Annotated References (many more in the award-winning survey[DL1])
Introduction
Over time, certain historic events have become more important in the eyes of certain
beholders. For example, the Big Bang of 13.8 billion years ago is now widely considered an
essential moment in the history of everything. Until a few decades ago, however, it has
remained completely unknown to earthlings, who for a long time have entertained quite
https://people.idsia.ch/~juergen/deep-learning-history.html 2/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
erroneous ideas about the origins of the universe (see the final section for more on the world's
history). Currently accepted histories of many more limited subjects are results of similarly
radical revisions. Here I will focus on the history of artificial intelligence (AI), which also isn't
quite what it used to be.
A history of AI written in the 1980s would have emphasized topics such as theorem proving,
[GOD][GOD34][ZU48][NS56]
logic programming, expert systems, and heuristic search.[FEI63,83][LEN83] This
would be in line with topics of a 1956 conference in Dartmouth, where the term "AI" was coined
by John McCarthy as a way of describing an old area of research seeing renewed interest.
Practical AI dates back at least to 1914, when Leonardo Torres y Quevedo (see below) built
the first working chess end game player[BRU1-4] (back then chess was considered as an activity
restricted to the realms of intelligent creatures). AI theory dates back at least to 1931-34 when
Kurt Gödel (see below) identified fundamental limits of any type of computation-based AI.[GOD]
[BIB3][GOD21,a,b]
A history of AI written in the early 2000s would have put more emphasis on topics such as
support vector machines and kernel methods,[SVM1-4] Bayesian (actually Laplacian or possibly
Saundersonian[STI83-85]) reasoning[BAY1-8][FI22] and other concepts of probability theory and
statistics,[MM1-5][NIL98][RUS95] decision trees,e.g.,[MIT97] ensemble methods,[ENS1-4] swarm intelligence,[SW1]
and evolutionary computation.[EVO1-7]([TUR1],unpublished) Why? Because back then such techniques
drove many successful AI applications.
A history of AI written in the 2020s must emphasize concepts such as the even older chain
rule[LEI07] and deep nonlinear artificial neural networks (NNs) trained by gradient descent,[GD'] in
particular, feedback-based recurrent networks, which are general computers whose programs
are weight matrices.[AC90] Why? Because many of the most famous and most commercial
recent AI applications depend on them.[DL4]
Such NN concepts are actually conceptually close to topics of the MACY conferences (1946-
1953)[MACY51] and the 1951 Paris conference on calculating machines and human thought, now
often viewed as the first conference on AI.[AI51][BRO21][BRU4] However, before 1956, much of what's
now called AI was still called cybernetics, with a focus very much in line with modern AI based
on "deep learning" with NNs.[DL1-2][DEC]
Some of the past NN research was inspired by the human brain, which has on the order of 100
billion neurons, each connected to 10,000 other neurons on average. Some are input neurons
that feed the rest with data (sound, vision, tactile, pain, hunger). Others are output neurons
that control muscles. Most neurons are hidden in between, where thinking takes place. Your
brain apparently learns by changing the strengths or weights of the connections, which
determine how strongly neurons influence each other, and which seem to encode all your
lifelong experience. Similar for our artifical NNs, which learn better than previous methods to
recognize speech or handwriting or video, minimize pain, maximize pleasure, drive cars, etc.
[MIR](Sec. 0)[DL1-4]
How can NNs learn all of this? In what follows, I shall highlight essential historic contributions
that made this possible. Since virtually all of the fundamental concepts of modern AI were
derived in previous millennia, the section titles below emphasize developments only up to the
year 2000. However, many of the sections mention the later impact of this work in the new
https://people.idsia.ch/~juergen/deep-learning-history.html 3/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
millennium, which brought numerous improvements in hardware and software, a bit like the
20th century brought numerous improvements of the cars invented in the 19th.
The present piece also debunks a frequently repeated, misleading "history of deep learning"[S20]
[DL3,3a]
which ignores most of the pioneering work mentioned below.[T22] See Footnote 6. The title
image of the present article is a reaction to an erroneous piece of common knowledge which
says[T19] that the use of NNs "as a tool to help computers recognize patterns and simulate
human intelligence had been introduced in the 1980s," although such NNs appeared long
before the 1980s.[T22] Ensuring proper credit assignment in all of science is of great importance
to me—just as it should be to all scientists—and I encourage an interested reader to also take
a look at some of my letters on this in Science and Nature, e.g., on the history of aviation,[NASC1-
2]
the telephone,[NASC3] the computer,[NASC4-7] resilient robots,[NASC8] and scientists of the 19th
century.[NASC9]
Finally, to round it off, I'll put things in a broader historic context spanning the time since the
Big Bang until when the universe will be many times older than it is now.
paved the way for infinitesimals and published special cases of calculus, e.g., for spheres and
parabola segments, building on even earlier work in ancient Greece. Fundamental work on
calculus was also conducted in the 14th century by Madhava of Sangamagrama and
colleagues of the Indian Kerala school.[MAD86-05]
Footnote 2. Remarkably, Leibniz (1646-1714, aka "the world's first computer scientist"[LA14]) also
laid foundations of modern computer science. He designed the first machine that could
perform all four arithmetic operations (1673), and the first with an internal memory.[BL16] He
described the principles of binary computers (1679)[L79][L03][LA14][HO66][LEI21,a,b] employed by virtually
all modern machines. His formal Algebra of Thought (1686)[L86][WI48] was deductively
equivalent[LE18] to the much later Boolean Algebra (1847).[BOO] His Characteristica Universalis &
Calculus Ratiocinator aimed at answering all possible questions through computation;[WI48] his
"Calculemus!" is one of the defining quotes of the age of enlightenment. It is quite remarkable
that he is also responsible for the chain rule, foundation of "modern" deep learning, a key
subfield of modern computer science.
Footnote 3. Some claim that the backpropagation algorithm (discussed further down; now
widely used to train deep NNs) is just the chain rule of Leibniz (1676) & L'Hopital (1696).[CONN21]
No, it is the efficient way of applying the chain rule to big networks with differentiable nodes
(there are also many inefficient ways of doing this).[T22] It was not published until 1970, as
discussed below.[BP1,4,5]
This NN from over 2 centuries ago has two layers: an input layer with
several input units, and an output layer. For simplicity, let's assume the
latter consists of a single output unit. Each input unit can hold a real-
valued number and is connected to the output by a connection with a
real-valued weight. The NN's output is the sum of the products of the
inputs and their weights. Given a training set of input vectors and
desired target values for each of them, the NN weights are adjusted
such that the sum of the squared errors between the NN outputs and
the corresponding targets is minimized.
Of course, back then this was not called an NN. It was called the method of least squares, also
widely known as linear regression. But it is mathematically identical to today's linear NNs:
same basic algorithm, same error function, same adaptive parameters/weights. Such simple
NNs perform "shallow learning" (as opposed to "deep learning" with many nonlinear layers). In
fact, many NN courses start by introducing this method, then move on to more complex,
deeper NNs.
https://people.idsia.ch/~juergen/deep-learning-history.html 5/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
Footnote 5. "Shallow learning" with NNs experienced a new wave of popularity in the late
1950s. Rosenblatt's perceptron (1958)[R58] combined a linear NN as above with an output
threshold function to obtain a pattern classifier (compare his more advanced work on multi-
layer networks discussed below). Joseph[R61] mentions an even earlier perceptron-like device
by Farley & Clark. Widrow & Hoff's similar Adaline learned in 1962.[WID62]
Like the human brain, but unlike the more limited feedforward NNs
(FNNs), recurrent NNs (RNNs) have feedback connections, such that
one can follow directed connections from certain internal nodes to
others and eventually end up where one started. This is essential for
implementing a memory of past events during sequence processing.
In 1972, Shun-Ichi Amari made the Lenz-Ising recurrent architecture adaptive such that it could
learn to associate input patterns with output patterns by changing its connection weights.[AMH1]
https://people.idsia.ch/~juergen/deep-learning-history.html 6/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
Today, the most popular RNN is the Long Short-Term Memory (LSTM) mentioned below, which
has become the most cited NN of the 20th century.[MOST]
In 1958, Frank Rosenblatt not only combined linear NNs and threshold
functions (see the section on shallow learning since 1800), he also had
more interesting, deeper multilayer perceptrons (MLPs).[R58] His MLPs
had a non-learning first layer with randomized weights and an adaptive
output layer. Although this was not yet deep learning, because only the
last layer learned,[DL1] Rosenblatt basically had what much later was
rebranded as Extreme Learning Machines (ELMs) without proper
attribution.[ELM1-2][CONN21][T22]
Today, the most popular FNN is a version of the LSTM-based Highway Net (mentioned below)
called ResNet,[HW1-3] which has become the most cited NN of the 21st century.[MOST]
https://people.idsia.ch/~juergen/deep-learning-history.html 7/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
Like later deep NNs, Ivakhnenko's nets learned to create hierarchical, distributed, internal
representations of incoming data.
He did not call them deep learning NNs, but that's what they were. In fact, the ancient term
"deep learning" was first introduced to Machine Learning much later by Dechter (1986), and to
NNs by Aizenberg et al (2000).[DL2] (Margin note: our 2005 paper on deep learning[DL6,6a] was the
first machine learning publication with the word combination "learn deep" in the title.[T22])
Ivakhnenko and Lapa (1965, see above) trained their deep networks
layer by layer. In 1967, however, Shun-Ichi Amari suggested to train
MLPs with many layers in non-incremental end-to-end fashion from
scratch by stochastic gradient descent (SGD),[GD1] a method proposed in
1951 by Robbins & Monro.[STO51-52]
See also Iakov Zalmanovich Tsypkin's even earlier work on gradient descent-based on-line
learning for non-linear systems.[GDa-b]
https://people.idsia.ch/~juergen/deep-learning-history.html 8/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
In 1970, Seppo Linnainmaa was the first to publish what's now known as backpropagation, the
famous algorithm for credit assignment in networks of differentiable nodes,[BP1,4,5] also known as
"reverse mode of automatic differentiation." It is now the foundation of widely used NN
software packages such as PyTorch and Google's Tensorflow.
By 1985, compute had become about 1,000 times cheaper than in 1970, and the first desktop
computers had just become accessible in wealthier academic labs. An experimental analysis of
the known method[BP1-2] by David E. Rumelhart et al. then demonstrated that backpropagation
can yield useful internal representations in hidden layers of NNs.[RUM] At least for supervised
learning, backpropagation is generally more efficient than Amari's above-mentioned deep
https://people.idsia.ch/~juergen/deep-learning-history.html 9/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
learning through the more general SGD method (1967), which learned useful internal
representations in NNs about 2 decades earlier.[GD1-2a]
It took 4 decades until the backpropagation method of 1970[BP1-2] got widely accepted as a
training method for deep NNs. Before 2010, many thought that the training of NNs with many
layers requires unsupervised pre-training, a methodology introduced by myself in 1991[UN][UN0-3]
(see below), and later championed by others (2006).[UN4] In fact, it was claimed[VID1] that "nobody
in their right mind would ever suggest" to apply plain backpropagation to deep NNs. However,
in 2010, our team with my outstanding Romanian postdoc Dan Ciresan[MLP1-2] showed that deep
FNNs can be trained by plain backpropagation and do not at all require unsupervised pre-
training for important applications.[MLP2]
Our system set a new performance record[MLP1] on the back then famous and widely used
image recognition benchmark called MNIST. This was achieved by greatly accelerating deep
FNNs on highly parallel graphics processing units called GPUs (as first done for shallow NNs
with few layers by Jung & Oh in 2004[GPUNN]). A reviewer called this a "wake-up call to the
machine learning community." Today, everybody in the field is pursuing this approach.
Footnote 6. Unfortunately, several authors who re-published backpropagation in the 1980s did
not cite the prior art—not even in later surveys.[T22] In fact, as mentioned in the introduction,
there is a broader, frequently repeated, misleading "history of deep learning"[S20] which ignores
most of the pioneering work mentioned in the previous sections.[T22][DLC] This "alternative
history" essentially goes like this: "In 1969, Minsky & Papert[M69] showed that shallow NNs
without hidden layers are very limited and the field was abandoned until a new generation of
neural network researchers took a fresh look at the problem in the 1980s."[S20] However, the
1969 book[M69] addressed a "problem" of Gauss & Legendre's shallow learning (circa 1800)[DL1-2]
that had already been solved 4 years prior by Ivakhnenko & Lapa's popular deep learning
method,[DEEP1-2][DL2] and then also by Amari's SGD for MLPs.[GD1-2] Minsky neither cited this work
nor corrected his book later.[HIN](Sec. I)[T22] And even recent papers promulgate this revisionist
narrative of deep learning, apparently to glorify later contributions of their authors (such as the
Boltzmann machine[BM][HIN][SK75][G63][T22]) without relating them to the original work,[DLC][S20][T22]
although the true history is well-known. Deep learning research was alive and kicking in the
https://people.idsia.ch/~juergen/deep-learning-history.html 10/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
Since 1989, Yann LeCun's team has contributed improvements of CNNs, especially for
images.[CNN2,4][T22] Baldi and Chauvin (1993) had the first application of CNNs with
backpropagation to biomedical/biometric images.[BA93]
CNNs became more popular in the ML community much later in 2011 when my own team
greatly sped up the training of deep CNNs (Dan Ciresan et al., 2011).[GPUCNN1,3,5] Our fast GPU-
based[GPUNN][GPUCNN5] CNN of 2011[GPUCNN1] known as DanNet[DAN,DAN1][R6] was a practical
https://people.idsia.ch/~juergen/deep-learning-history.html 11/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
breakthrough, much deeper and faster than earlier GPU-accelerated CNNs of 2006.[GPUCNN] In
2011, DanNet became the first pure deep CNN to win computer vision contests.[GPUCNN2-3,5]
For a while, DanNet enjoyed a monopoly. From 2011 to 2012 it won every contest it entered,
winning four of them in a row (15 May 2011, 6 Aug 2011, 1 Mar 2012, 10 Sep 2012).[GPUCNN5] In
particular, at IJCNN 2011 in Silicon Valley, DanNet blew away the competition and achieved
the first superhuman visual pattern recognition[DAN1] in an international contest. DanNet was
also the first deep CNN to win: a Chinese handwriting contest (ICDAR 2011), an image
segmentation contest (ISBI, May 2012), a contest on object detection in large images (ICPR,
10 Sept 2012), and— at the same time—a medical imaging contest on cancer detection.
[GPUCNN8]
In 2010, we introduced DanNet to Arcelor Mittal, the world's largest steel producer, and
were able to greatly improve steel defect detection.[ST] To the best of my knowledge, this was
the first deep learning breakthrough in heavy industry. In July 2012, our CVPR paper on
DanNet[GPUCNN3] hit the computer vision community. 5 months later, the similar GPU-accelerated
AlexNet won the ImageNet[IM09] 2012 contest.[GPUCNN4-5][R6] Our CNN image scanners were 1000
times faster than previous methods.[SCAN] This attracted tremendous interest from the
healthcare industry. Today IBM, Siemens, Google and many startups are pursuing this
approach. The VGG network (ImageNet 2014 winner)[GPUCNN9] and other highly cited CNNs[RCNN1-
3]
further extended the DanNet of 2011.[MIR](Sec. 19)[MOST]
ResNet, the ImageNet 2015 winner[HW2] (Dec 2015) and currently the most cited NN,[MOST] is a
version (with open gates) of our earlier Highway Net (May 2015).[HW1-3][R5] The Highway Net (see
below) is actually the feedforward net version of our vanilla LSTM (see below).[LSTM2] It was the
first working, really deep feedforward NN with hundreds of layers (previous NNs had at most a
few tens of layers).
NNs with rapidly changing "fast weights" were introduced by v.d. Malsburg (1981) and others.
[T22]
[FAST,a,b] Deep learning architectures that can manipulate structured data such as graphs
[PO87-90]
were proposed in 1987 by Pollack and extended/improved by Sperduti, Goller, and
[SP93-97][GOL][KU][T22]
Küchler in the early 1990s. See also our graph NN-like, Transformer-like Fast
[FWP0-1][FWP6][FWP]
Weight Programmers of 1991 which learn to continually rewrite mappings from
https://people.idsia.ch/~juergen/deep-learning-history.html 12/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
inputs to outputs (addressed below), and the work of Baldi and colleagues.[BA96-03] Today, graph
NNs are used in numerous applications.
The 80s and 90s also saw various proposals of biologically more plausible deep learning
algorithms that—unlike backpropagation—are local in space and time.[BB2][NAN1-4][NHE][HEL] See
overviews[MIR](Sec. 15, Sec. 17) and recent renewed interest in such methods.[NAN5][FWPMETA6][HIN22]
In 1990, Hanson introduced the Stochastic Delta Rule, a stochastic way of training NNs by
backpropagation. Decades later, a version of this became popular under the moniker
"dropout."[Drop1-4][GPUCNN4]
Many additional papers on NNs (including RNNs) were published in the 1980s and 90s—see
the numerous references in the 2015 survey.[DL1] Here, however, we mostly limit ourselves to
the—in hindsight—most essential ones, given the present (ephemeral?) perspective of 2022.
Generative Adversarial Networks (GANs) have become very popular.[MOST] They were first
published in 1990 in Munich under the moniker Artificial Curiosity.[AC90-20][GAN1] Two dueling NNs
(a probabilistic generator and a predictor) are trying to maximize each other's loss in a minimax
game.[AC](Sec. 1) The generator (called the controller) generates probabilistic outputs (using
stochastic units[AC90] like in the much later StyleGANs[GAN2]). The predictor (called the world
model) sees the outputs of the controller and predicts environmental reactions to them. Using
gradient descent, the predictor NN minimizes its error, while the generator NN tries to make
outputs that maximize this error: one net's loss is the other net's gain.[AC90] (The world model
can also be used for continual online action planning.[AC90][PLAN2-3][PLAN])
https://people.idsia.ch/~juergen/deep-learning-history.html 13/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
[GAN1]
4 years before a 2014 paper on GANs, my well-known 2010 survey[AC10] summarised the
generative adversarial NNs of 1990 as follows: a "neural network as a predictive world model
is used to maximize the controller's intrinsic reward, which is proportional to the model's
prediction errors" (which are minimized).
The 2014 GANs are an instance of this where the trials are very short (like in bandit problems)
and the environment simply returns 1 or 0 depending on whether the controller's (or
generator's) output is in a given set.[AC20][AC][T22](Sec. XVII)
Other early adversarial machine learning settings[S59][H90] were very different—they neither
involved unsupervised NNs nor were about modeling data nor used gradient descent.[AC20]
The 1990 principle has been widely used for exploration in Reinforcement Learning[SIN5][OUD13]
[PAT17][BUR18]
and for synthesis of realistic images,[GAN1,2] although the latter domain was recently
taken over by Rombach et al.'s Latent Diffusion, another method published in Munich,[DIF1]
building on Jarzynski's earlier work in physics from the previous millennium[DIF2] and more
recent papers.[DIF3-5]
In 1991, I published yet another ML method based on two adversarial NNs called Predictability
Minimization for creating disentangled representations of partially redundant data, applied to
images in 1996.[PM0-2][AC20][R2][MIR](Sec. 7)
Most NNs of recent centuries were dedicated to simple pattern recognition, not to high-level
reasoning, which is now considered a remaining grand challenge.[LEC] The early 1990s,
however, saw first exceptions: NNs that learn to decompose complex spatio-temporal
observation sequences into compact but meaningful chunks[UN0-3] (see further below), and NN-
based planners of hierarchical action sequences for compositional learning,[HRL0] as discussed
next. This work injected concepts of traditional "symbolic" hierarchical AI[NS59][FU77] into end-to-
end differentiable "sub-symbolic" NNs.
https://people.idsia.ch/~juergen/deep-learning-history.html 14/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
In 1990, our NNs learned to generate hierarchical action plans with end-to-end differentiable
NN-based subgoal generators for Hierarchical Reinforcement Learning (HRL).[HRL0] Soon
afterwards, this was also done with recurrent NNs that learn to generate sequences of
subgoals.[HRL1-2][PHD][MIR](Sec. 10) An RL machine gets extra command inputs of the form (start, goal).
An evaluator NN learns to predict the current rewards/costs of going from start to goal. An
(R)NN-based subgoal generator also sees (start, goal), and uses (copies of) the evaluator NN
to learn by gradient descent a sequence of cost-minimising intermediate subgoals. The RL
machine tries to use such subgoal sequences to achieve final goals. The system is learning
action plans at multiple levels of abstraction and multiple time scales and solves (at least in
principle) what recently (2022) has been called an "open problem."[LEC]
Compare other NNs that have "worked on command" since April 1990, in particular, for
learning selective attention,[ATT0-3] artificial curiosity and self-invented problems,[PP][PPa,1,2][AC]
upside-down reinforcement learning[UDRL1-2] and its generalizations.[GGP]
Recently, Transformers[TR1] have been all the rage, e.g., generating human-sounding texts.[GPT3]
Transformers with "linearized self-attention"[TR5-6] were first published in March 1991[FWP0-1][FWP6]
[FWP]
(apart from normalisation—see tweet of 2022 for 30-year anniversary). These so-called
"Fast Weight Programmers" or "Fast Weight Controllers"[FWP0-1] separated storage and control
like in traditional computers, but in an end-to-end-differentiable, adaptive, fully neural way
(rather than in a hybrid fashion[PDA1-2][DNC]). The "self-attention" in standard Transformers[TR1-4]
combines this with a projection and softmax (using attention terminology like the one I
introduced in 1993[ATT][FWP2][R4]).
Today's Transformers heavily use unsupervised pre-training[UN0-3] (see next section), another
deep learning methodology first published in our Annus Mirabilis of 1990-1991.[MIR][MOST]
The 1991 fast weight programmers also led to meta-learning self-referential NNs that can run
their own weight change algorithm or learning algorithm on themselves, and improve it, and
https://people.idsia.ch/~juergen/deep-learning-history.html 15/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
improve the way they improve it, and so on. This work since 1992[FWPMETA1-9][HO1] extended my
1987 diploma thesis,[META1] which introduced algorithms not just for learning but also for meta-
learning or learning to learn,[META] to learn better learning algorithms through experience. This
became very popular in the 2010s[DEC] when computers were a million times faster.
Today's most powerful NNs tend to be very deep, that is, they have many layers of neurons or
many subsequent computational stages.[MIR] Before the 1990s, however, gradient-based
training did not work well for deep NNs, only for shallow ones[DL1-2] (but see a 1989 paper[MOZ]).
This Deep Learning Problem was most obvious for recurrent NNs. Like the human brain, but
unlike the more limited feedforward NNs (FNNs), RNNs have feedback connections. This
makes RNNs powerful, general purpose, parallel-sequential computers that can process input
sequences of arbitrary length (think of speech data or videos). RNNs can in principle
implement any program that can run on your laptop or any other computer in existence. If we
want to build an Artificial General Intelligence (AGI), then its underlying computational
substrate must be something more like an RNN than an FNN as FNNs are fundamentally
insufficient; RNNs and similar systems are to FNNs as general computers are to pocket
calculators. In particular, unlike FNNs, RNNs can in principle deal with problems of arbitrary
depth.[DL1] Before the 1990s, however, RNNs failed to learn deep problems in practice.[MIR](Sec. 0)
To overcome this drawback through RNN-based "general deep learning," I built a self-
supervised RNN hierarchy that learns representations at multiple levels of abstraction and
multiple self-organizing time scales:[LEC] the Neural Sequence Chunker[UN0] or Neural History
Compressor.[UN1] Each RNN tries to solve the pretext task of predicting its next input, sending
only unexpected inputs (and therefore also targets) to the next RNN above. The resulting
compressed sequence representations greatly facilitate downstream supervised deep learning
such as sequence classification.
Although computers back then were about a million times slower per dollar than today, by
1993, the Neural History Compressor above was able to solve previously unsolvable "very
https://people.idsia.ch/~juergen/deep-learning-history.html 16/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
[UN2]
deep learning" tasks of depth > 1000 (requiring more than 1,000 subsequent computational
stages—the more such stages, the deeper the learning). In 1993, we also published a
continuous version of the Neural History Compressor.[UN3] (See also recent work on
unsupervised NN-based abstraction.[OBJ1-5])
More than a decade after this work,[UN1] a similar unsupervised method for more limited
feedforward NNs (FNNs) was published, facilitating supervised learning by unsupervised pre-
training of stacks of FNNs called Deep Belief Networks (DBNs).[UN4] The 2006 justification was
essentially the one I used in the early 1990s for my RNN stack: each higher level tries to
reduce the description length (or negative log probability) of the data representation in the level
below.[HIN][T22][MIR]
The Long Short-Term Memory (LSTM) recurrent neural network[LSTM1-6] overcomes the
Fundamental Deep Learning Problem identified by Sepp in his above-mentioned 1991 diploma
thesis,[VAN1] which I consider one of the most important documents in the history of machine
learning. It also provided essential insights for overcoming the problem, through basic
principles (such as constant error flow) of what we called LSTM in a tech report of 1995.[LSTM0]
After the main peer-reviewed publication in 1997[LSTM1][25y97] (now the most cited NN article of the
20th century[MOST]), LSTM and its training procedures were further improved on my Swiss LSTM
grants at IDSIA through the work of my later students Felix Gers, Alex Graves, and others. A
milestone was the "vanilla LSTM architecture" with forget gate[LSTM2]—the LSTM variant of
1999-2000 that everybody is using today, e.g., in Google's Tensorflow. Alex was lead author of
our first successful application of LSTM to speech (2004).[LSTM10] 2005 saw the first publication
of LSTM with full backpropagation through time and of bi-directional LSTM[LSTM3] (now widely
used).
Another milestone of 2006 was the training method "Connectionist Temporal Classification" or
CTC[CTC] for simultaneous alignment and recognition of sequences. Our team successfully
applied CTC-trained LSTM to speech in 2007[LSTM4] (also with hierarchical LSTM stacks[LSTM14]).
This led to the first superior end-to-end neural speech recognition. It was very different from
hybrid methods since the late 1980s which combined NNs and traditional approaches such as
Hidden Markov Models (HMMs).[BW][BRI][BOU][HYB12][T22] In 2009, through the efforts of Alex, LSTM
https://people.idsia.ch/~juergen/deep-learning-history.html 18/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
trained by CTC became the first RNN to win international competitions, namely, three ICDAR
2009 Connected Handwriting Competitions (French, Farsi, Arabic). This attracted enormous
interest from industry. LSTM was soon used for everything that involves sequential data such
as speech[LSTM10-11][LSTM4][DL1] and videos. In 2015, the CTC-LSTM combination dramatically
improved Google's speech recognition on the Android smartphones.[GSR15] Many other
companies adopted this.[DL4] Google's new on-device speech recognition of 2019 (now on your
phone, not on the server) is still based on LSTM.
Through the work of my students Rupesh Kumar Srivastava and Klaus Greff, the LSTM
principle also led to our Highway Network[HW1] of May 2015, the first working very deep FNN
with hundreds of layers (previous NNs had at most a few tens of layers). Microsoft's
https://people.idsia.ch/~juergen/deep-learning-history.html 19/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
[HW2]
ResNet (which won the ImageNet 2015 contest) is a version thereof (ResNets are Highway
Nets whose gates are always open). The earlier Highway Nets perform roughly as well as their
ResNet versions on ImageNet.[HW3] Variants of highway gates are also used for certain
algorithmic tasks where the pure residual layers do not work as well.[NDR]
The LSTM / Highway Net Principle is the Core of Modern Deep Learning
Deep learning is all about NN depth.[DL1] In the 1990s, LSTMs brought essentially unlimited
depth to supervised recurrent NNs; in the 2000s, the LSTM-inspired Highway Nets brought it to
feedforward NNs. LSTM has become the most cited NN of the 20th century; the Highway Net
version called ResNet the most cited NN of the 21st.[MOST] (Citations, however, are a highly
questionable measure of true impact.[NAT1])
The previous sections have mostly focused on deep learning for passive pattern
recognition/classification. However, NNs are also relevant for Reinforcement Learning (RL),
[KAE96][BER96][TD3][UNI][GM3][LSTMPG] the most general type of learning. General RL agents must
discover, without the aid of a teacher, how to interact with a dynamic, initially unknown, partially
observable environment in order to maximize their expected cumulative reward signals.[DL1]
There may be arbitrary, a priori unknown delays between actions and perceivable
consequences. The RL problem is as hard as any problem of computer science, since any task
with a computable description can be formulated in the general RL framework.[UNI]
Certain RL problems can be addressed through non-neural techniques invented long before
the 1980s: Monte Carlo (tree) search (MC, 1949),[MOC1-5] dynamic programming (DP, 1953),[BEL53]
artificial evolution (1954),[EVO1-7]([TUR1],unpublished) alpha-beta-pruning (1959),[S59] control theory and
system identification (1950s),[KAL59][GLA85] stochastic gradient descent (SGD, 1951),[STO51-52] and
universal search techniques (1973).[AIT7]
Deep FNNs and RNNs, however, are useful tools for improving certain types of RL. In the
1980s, concepts of function approximation and NNs were combined with system identification,
[WER87-89][MUN87][NGU89]
DP and its online variant called Temporal Differences (TD),[TD1-3] artificial
evolution,[EVONN1-3] and policy gradients.[GD1][PG1-3] Many additional references on this can be
found in Sec. 6 of the 2015 survey.[DL1]
When there is a Markovian interface[PLAN3] to the environment such that the current input to the
RL machine conveys all the information required to determine a next optimal action, RL with
DP/TD/MC-based FNNs can be very successful, as shown in 1994[TD2] (master-level
backgammon player) and the 2010s[DM1-2a] (superhuman players for Go, chess, and other
games).
For more complex cases without Markovian interfaces, where the learning machine must
consider not only the present input, but also the history of previous inputs, our combinations of
RL algorithms and LSTM[LSTM-RL][RPG] have become standard, in particular, our LSTM trained by
policy gradients (2007).[RPG07][RPG][LSTMPG]
https://people.idsia.ch/~juergen/deep-learning-history.html 20/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
For example, in 2018, a PG-trained LSTM was the core of OpenAI's famous Dactyl which
learned to control a dextrous robot hand without a teacher.[OAI1][OAI1a] Similar for video games: in
2019, DeepMind (co-founded by a student from my lab) famously beat a pro player in the
game of Starcraft, which is theoretically harder than Chess or Go[DM2] in many ways, using
Alphastar whose brain has a deep LSTM core trained by PG.[DM3] An RL LSTM (with 84% of the
model's total parameter count) also was the core of the famous OpenAI Five which learned to
defeat human experts in the Dota 2 video game (2018).[OAI2] Bill Gates called this a "huge
milestone in advancing artificial intelligence".[OAI2a][MIR](Sec. 4)[LSTMPG]
The recent breakthroughs of deep learning algorithms from the past millennium (see previous
sections) would have been impossible without continually improving and accelerating computer
hardware. Any history of AI and deep learning would be incomplete without mentioning this
evolution, which has been running for at least two millennia.
The first known gear-based computational device was the Antikythera mechanism (a kind of
astronomical clock) in Ancient Greece over 2000 years ago.
Perhaps the world's first practical programmable machine was an automatic theatre made in
the 1st century[SHA7a][RAU1] by Heron of Alexandria (who apparently also had the first known
https://people.idsia.ch/~juergen/deep-learning-history.html 21/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
The 9th century music automaton by the Banu Musa brothers in Baghdad was perhaps the first
machine with a stored program.[BAN][KOE1] It used pins on a revolving cylinder to store programs
controlling a steam-driven flute—compare Al-Jazari's programmable drum machine of 1206.
[SHA7b]
The 1600s brought more flexible machines that computed answers in response to input data.
The first data-processing gear-based special purpose calculator for simple arithmetics was
built in 1623 by Wilhelm Schickard, one of the candidates for the title of "father of automatic
https://people.idsia.ch/~juergen/deep-learning-history.html 22/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
computing," followed by the superior Pascaline of Blaise Pascal (1642). In 1673, the already
mentioned Gottfried Wilhelm Leibniz (called "the smartest man who ever lived"[SMO13]) designed
the first machine (the step reckoner) that could perform all four arithmetic operations, and the
first with a memory.[BL16] He also described the principles of binary computers governed by
punch cards (1679),[L79][L03][LA14][HO66] and published the chain rule[LEI07-10] (see above), essential
ingredient of deep learning and modern AI.
The first commercial program-controlled machines (punch card-based looms) were built in
France circa 1800 by Joseph-Marie Jacquard and others—perhaps the first "modern"
programmers who wrote the world's first industrial software. They inspired Ada Lovelace and
her mentor Charles Babbage (UK, circa 1840). He planned but was unable to build a
programmable, general purpose computer (only his non-universal special purpose calculator
led to a working 20th century replica).
Between 1935 and 1941, Konrad Zuse created the world's first working
programmable general-purpose computer: the Z3. The corresponding patent of 1936[ZU36-38][RO98]
[ZUS21]
described the digital circuits required by programmable physical hardware, predating
Claude Shannon's 1937 thesis on digital circuit design.[SHA37] Unlike Babbage, Zuse used
Leibniz' principles of binary computation (1679)[L79][LA14][HO66][L03] instead of traditional decimal
computation. This greatly simplified the hardware.[LEI21,a,b] Ignoring the inevitable storage
limitations of any physical computer, the physical hardware of Z3 was indeed universal in the
modern sense of the purely theoretical but impractical constructs of Gödel[GOD][GOD34,21,21a] (1931-
34), Church[CHU] (1935), Turing[TUR] (1936), and Post[POS] (1936). Simple arithmetic tricks can
compensate for Z3's lack of an explicit conditional jump instruction.[RO98] Today, most computers
are binary like Z3.
https://people.idsia.ch/~juergen/deep-learning-history.html 23/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
Z3 used electromagnetic relays with visibly moving switches. The first electronic special
purpose calculator (whose moving parts were electrons too small to see) was the binary ABC
(US, 1942) by John Atanasoff (the "father of tube-based computing"[NASC6a]). Unlike the gear-
based machines of the 1600s, ABC used vaccum tubes—today's machines use the transistor
principle patented by Julius Edgar Lilienfeld in 1925.[LIL1-2] But unlike Zuse's Z3, ABC was not
freely programmable. Neither was the electronic Colossus machine by Tommy Flowers (UK,
1943-45) used to break the Nazi code.[NASC6]
The first general working programmable machine built by someone other than Zuse (1941)[RO98]
was Howard Aiken's decimal MARK I (US, 1944). The much faster decimal ENIAC by Eckert
and Mauchly (1945/46) was programmed by rewiring it. Both data and programs were stored in
electronic memory by the "Manchester baby" (Williams, Kilburn & Tootill, UK, 1948) and the
1948 upgrade of ENIAC, which was reprogrammed by entering numerical instruction codes
into read-only memory.[HAI14b]
Since then, computers have become much faster through integrated circuits (ICs). In 1949,
Werner Jacobi at Siemens filed a patent for an IC semiconductor with several transistors on a
common substrate (granted in 1952).[IC49-14] In 1958, Jack Kilby demonstrated an IC with
external wires. In 1959, Robert Noyce presented a monolithic IC.[IC14] Since the 1970s, graphics
processing units (GPUs) have been used to speed up computations through parallel
processing. ICs/GPUs of today (2022) contain many billions of transistors (almost all of them of
Lilienfeld's 1925 FET type[LIL1-2]).
In 1941, Zuse's Z3 could perform roughly one elementary operation (e.g., an addition) per
second. Since then, every 5 years, compute got 10 times cheaper (note that his law is much
older than Moore's Law which states that the number of transistors[LIL1-2] per chip doubles every
18 months). As of 2021, 80 years after Z3, modern computers can execute about 10 million
billion instructions per second for the same (inflation-adjusted) price. The naive extrapolation of
this exponential trend predicts that the 21st century will see cheap computers with a thousand
times the raw computational power of all human brains combined.[RAW]
Physics seems to dictate that future efficient computational hardware will have to be brain-like,
with many compactly placed processors in 3-dimensional space, sparsely connected by many
short and few long wires, to minimize total connection cost (even if the "wires" are actually light
beams).[DL2] The basic architecture is essentially the one of a deep, sparsely connected, 3-
dimensional RNN, and Deep Learning methods for such RNNs are expected to become even
much more important than they are today.[DL2]
https://people.idsia.ch/~juergen/deep-learning-history.html 24/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
The core of modern AI and deep learning is mostly based on simple math of recent centuries:
calculus/linear algebra/statistics. Nevertheless, to efficiently implement this core on the modern
hardware mentioned in the previous section, and to roll it out for billions of people, lots of
software engineering was necessary, based on lots of smart algorithms invented in the past
century. There is no room here to mention them all. However, at least I'll list some of the most
important highlights of the theory of AI and computer science in general.
Like most great scientists, Gödel built on earlier work. He combined Georg Cantor's
diagonalization trick[CAN] (which showed in 1891 that there are different types of infinities) with
the foundational work by Gottlob Frege[FRE] (who introduced the first formal language in 1879),
Thoralf Skolem[SKO23] (who introduced primitive recursive functions in 1923) and Jacques
Herbrand[GOD86] (who identified limitations of Skolem's approach). These authors in turn built on
the formal Algebra of Thought (1686) by Gottfried Wilhelm Leibniz[L86][WI48] (see above), which is
deductively equivalent[LE18] to the later Boolean Algebra of 1847.[BOO]
In 1935, Alonzo Church derived a corollary / extension of Gödel's result by demonstrating that
Hilbert & Ackermann's Entscheidungsproblem (decision problem) does not have a general
solution.[CHU] To do this, he used his alternative universal coding language called Untyped
Lambda Calculus, which forms the basis of the highly influential programming language LISP.
https://people.idsia.ch/~juergen/deep-learning-history.html 25/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
In 1936, Alan M. Turing introduced yet another universal model: the Turing Machine.[TUR] He
rederived the above-mentioned result.[CHU][TUR][HIN][GOD21,21a][TUR21][LEI21,21a] In the same year of 1936,
Emil Post published yet another independent universal model of computing.[POS] Today we
know many such models.
Konrad Zuse not only created the world's first working programmable general-purpose
computer,[ZU36-38][RO98][ZUS21] he also designed Plankalkül, the first high-level programming
language.[BAU][KNU] He applied it to chess in 1945[KNU] and to theorem proving in 1948.[ZU48]
Compare Newell & Simon's later work on theorem proving (1956).[NS56] Much of early AI in the
1940s-70s was actually about theorem proving and deduction in Gödel style[GOD][GOD34,21,21a]
through expert systems and logic programming.
In the early 2000s, Marcus Hutter (while working under my Swiss National Science Foundation
grant[UNI]) augmented Solomonoff's universal predictor[AIT1][AIT10] by an optimal action selector (a
universal AI) for reinforcement learning agents living in initially unknown (but at least
computable) environments.[AIT20,22] He also derived the asymptotically fastest algorithm for all
well-defined computational problems,[AIT21] solving any problem as quickly as the unknown
fastest solver of such problems, save for an additive constant that does not depend on the
problem size.
The even more general optimality of the self-referential 2003 Gödel Machine[GM3-9] is not limited
to asymptotic optimality.
Nevertheless, such mathematically optimal AIs are not yet practically feasible for various
reasons. Instead, practical modern AI is based on suboptimal, limited, yet not extremely well-
understood techniques such as NNs and deep learning, the focus of the present article. But
who knows what kind of AI history will prevail 20 years form now?
Credit assignment is about finding patterns in historic data and figuring out how certain events
were enabled by previous events. Historians do it. Physicists do it. AIs do it, too. Let's take a
step back and look at AI in the broadest historical context: all time since the Big Bang. In 2014,
I found a beautiful pattern of exponential acceleration in it,[OMG] which I have presented in
many talks since then, and which also made it into Sibylle Berg's award-winning book "GRM:
https://people.idsia.ch/~juergen/deep-learning-history.html 26/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
[OMG2]
Brainfuck." Previously published patterns of this kind span much shorter time intervals:
just a few decades or centuries or at most millennia.[OMG1]
It turns out that from a human perspective, the most important events since the beginning of
the universe are neatly aligned on a timeline of exponential speed up (error bars mostly below
10 percent). In fact, history seems to converge in an Omega point in the year 2040 or so. I like
to call it Omega, because a century ago, Teilhard de Chardin called Omega the point where
humanity will reach its next level.[OMG0] Also, Omega sounds much better than "Singularity"[SING1-
2]
—it sounds a bit like "Oh my God."[OMG]
Let's start with the Big Bang 13.8 billion years ago. We divide this time by 4 to obtain about 3.5
billion years. Omega is 2040 or so. At Omega minus 3.5 billion years, something very
important happened: life emerged on this planet.
And we take again a quarter of this time. We come out 900 million years ago when something
very important happened: animal-like, mobile life emerged.
And we divide again by 4. We come out 220 million years ago when mammals were invented,
our ancestors.
And we divide again by 4. 55 million years ago the first primates emerged, our ancestors.
And we divide by 4. 50 thousand years ago, behaviorally modern man emerged, our ancestor,
and started colonizing the world.
https://people.idsia.ch/~juergen/deep-learning-history.html 27/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
And we divide again by 4. We come out 13 thousand years ago when something very
important happened: domestication of animals, agriculture, first settlements—the begin of
civilisation. Now we see that all of civilization is just a flash in world history, just one millionth of
the time since the Big Bang. Agriculture and spacecraft were invented almost at the same time.
And we divide by 4. 3,300 years ago saw the onset of the 1st population explosion in the Iron
Age.
And we divide by 4. Remember that the convergence point Omega is the year 2040 or so.
Omega minus 800 years—that was in the 13th century, when iron and fire came together in
form of guns and cannons and rockets in China. This has defined the world since then and the
West remains quite behind of the license fees it owes China.
And we divide again by 4. Omega minus 200 years—we hit the mid 19th century, when iron
and fire came together in ever more sophisticated form to power the industrial revolution
through improved steam engines, based on the work of Beaumont, Papin, Newcomen, Watt,
and others (1600s-1700s, going beyond the first simple steam engines by Heron of
Alexandria[RAU1] in the 1st century). The telephone (e.g., Meucci 1857, Reis 1860, Bell 1876)
[NASC3]
started to revolutionize communication. The germ theory of disease (Pasteur & Koch,
late 1800s) revolutionized healthcare and made people live longer on average. And circa 1850,
the fertilizer-based agricultural revolution (Sprengel & von Liebig, early 1800s) helped to trigger
the 2nd population explosion, which peaked in the 20th century, when the world population
quadrupled, letting the 20th century stand out among all centuries in the history of mankind,
driven by the Haber-Bosch process for creating artificial fertilizer, without which the world could
feed at most 4 billion people.[HAB1-2]
And we divide again by 4. Omega minus 50 years—that's more or less the year 1990, the end
of the 3 great wars of the 20th century: WW1, WW2, and the Cold War. The 7 most valuable
public companies were all Japanese (today most of them are US-based); however, both China
and the US West Coast started to rise rapidly, setting the stage for the 21st century. A digital
nervous system started spanning the globe through cell phones and the wireless revolution
(based on radio waves discovered in the 1800s) as well as cheap personal computers for all.
The WWW was created at the European particle collider in Switzerland by Tim Berners-Lee.
And Modern AI started also around this time: the first truly self-driving cars were built in the
1980s in Munich by the team of Ernst Dickmanns (by 1994, their robot cars were driving in
highway traffic, up to 180 km/h).[AUT] Back then, I worked on my 1987 diploma thesis,[META1]
which introduced algorithms not just for learning but also for meta-learning or learning to learn,
[META]
to learn better learning algorithms through experience (now a very popular topic[DEC]). And
then came our Miraculous Year 1990-91[MIR] at TU Munich, the root of today's most cited
NNs[MOST] and of modern deep learning through self-supervised/unsupervised learning (see
above),[UN][UN0-3] the LSTM/Highway Net/ResNet principle (now in your pocket on your
smartphone—see above),[DL4][DEC][MOST] artificial curiosity and generative adversarial NNs for
agents that invent their own problems (see above),[AC90-AC20][PP-PP2][SA17] Transformers with
linearized self-attention (see above),[FWP0-6][TR5-6] distilling teacher NNs into student NNs (see
above),[UN][UN0-3] learning action plans at multiple levels of abstraction and multiple time scales
(see above),[HRL0-2][LEC] and other exciting stuff. Much of this has become very popular, and
improved the lives of billions of people.[DL4][DEC][MOST]
https://people.idsia.ch/~juergen/deep-learning-history.html 28/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
And we divide again by 4. Omega minus 13 years—that's a point in the near future, more or
less the year 2030, when many predict that cheap AIs will have a human brain power. Then the
final 13 years or so until Omega, when incredible things will happen (take all of this with a grain
of salt, though[OMG1]).
But of course, time won't stop with Omega. Maybe it's just human-dominated history that will
end. After Omega, many curious meta-learning AIs that invent their own goals (which have
existed in my lab for decades[AC][AC90,AC90b]) will quickly improve themselves, restricted only by
the fundamental limits of computability and physics.
What will supersmart AIs do? Space is hostile to humans but friendly to appropriately designed
robots, and offers many more resources than our thin film of biosphere, which receives less
than a billionth of the sun's energy. While some curious AIs will remain fascinated with life, at
least as long as they don't fully understand it,[ACM16][FA15][SP16][SA17] most will be more interested in
the incredible new opportunities for robots and software life out there in space. Through
innumerable self-replicating robot factories in the asteroid belt and beyond they will transform
the solar system and then within a few hundred thousand years the entire galaxy and within
tens of billions of years the rest of the reachable universe. Despite the light-speed limit, the
expanding AI sphere will have plenty of time to colonize and shape the entire visible cosmos.
Let me stretch your mind a bit. The universe is still young, only 13.8 billion years old.
Remember when we kept dividing by 4? Now let's multiply by 4! Let's look ahead to a time
when the cosmos will be 4 times older than it is now: about 55 billion years old. By then, the
visible cosmos will be permeated by intelligence. Because after Omega, most AIs will have to
go where most of the physical resources are, to make more and bigger AIs. Those who don't
won't have an impact.[ACM16][FA15][SP16]
Acknowledgments
Some of the material above was taken from previous AI Blog posts.[MIR] [DEC] [GOD21]
[ZUS21] [LEI21] [AUT] [HAB2] [ARC06] [AC] [ATT] [DAN] [DAN1] [DL4] [GPUCNN5,8] [DLC] [FDL] [FWP] [LEC] [META] [MLP2] [MOST] [PLAN]
https://people.idsia.ch/~juergen/deep-learning-history.html 29/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
[UN] [LSTMPG] [BP4] [DL6a] [HIN] [T22]
Thanks to many expert reviewers (including several famous neural net
pioneers) for useful comments. Since science is about self-correction, let me know under
juergen@idsia.ch if you can spot any remaining error. Many additional relevant publications
can be found in my publication page and my arXiv page. This work is licensed under a
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
[25y97] In 2022, we are celebrating the following works from a quarter-century ago. 1. Journal
paper on Long Short-Term Memory, the most cited neural network (NN) of the 20th century
(and basis of the most cited NN of the 21st). 2. First paper on physical, philosophical and
theological consequences of the simplest and fastest way of computing all possible
metaverses (= computable universes). 3. Implementing artificial curiosity and creativity through
generative adversarial agents that learn to design abstract, interesting computational
experiments. 4. Journal paper on meta-reinforcement learning. 5. Journal paper on hierarchical
Q-learning. 6. First paper on reinforcement learning to play soccer: start of a series. 7. Journal
papers on flat minima & low-complexity NNs that generalize well. 8. Journal paper on Low-
Complexity Art, the Minimal Art of the Information Age. 9. Journal paper on probabilistic
incremental program evolution.
[AC] J. Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity.
Schmidhuber's artificial scientists not only answer given questions but also invent new
questions. They achieve curiosity through: (1990) the principle of generative adversarial
networks, (1991) neural nets that maximise learning progress, (1995) neural nets that
maximise information gain (optimally since 2011), (1997) adversarial design of surprising
computational experiments, (2006) maximizing compression progress like
scientists/artists/comedians do, (2011) PowerPlay... Since 2012: applications to real robots.
[AC90] J. Schmidhuber. Making the world differentiable: On using fully recurrent self-
supervised neural networks for dynamic reinforcement learning and planning in non-stationary
environments. Technical Report FKI-126-90, TUM, Feb 1990, revised Nov 1990. PDF. The first
paper on online planning with reinforcement learning recurrent neural networks (NNs) (more)
and on generative adversarial networks where a generator NN is fighting a predictor NN in a
minimax game (more).
[AC91] J. Schmidhuber. Adaptive confidence and adaptive curiosity. Technical Report FKI-149-
91, Inst. f. Informatik, Tech. Univ. Munich, April 1991. PDF.
https://people.idsia.ch/~juergen/deep-learning-history.html 30/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
[AC97] J. Schmidhuber. What's interesting? Technical Report IDSIA-35-97, IDSIA, July 1997.
Focus on automatic creation of predictable internal abstractions of complex spatio-temporal
events: two competing, intrinsically motivated agents agree on essentially arbitrary algorithmic
experiments and bet on their possibly surprising (not yet predictable) outcomes in zero-sum
games, each agent potentially profiting from outwitting / surprising the other by inventing
experimental protocols where both modules disagree on the predicted outcome. The focus is
on exploring the space of general algorithms (as opposed to traditional simple mappings from
inputs to outputs); the general system focuses on the interesting things by losing interest in
both predictable and unpredictable aspects of the world. Unlike Schmidhuber et al.'s previous
systems with intrinsic motivation,[AC90-AC95] the system also takes into account the computational
cost of learning new skills, learning when to learn and what to learn. See later publications.[AC99]
[AC02]
[AC98b] M. Wiering and J. Schmidhuber. Learning exploration policies with models. In Proc.
CONALD, 1998.
[AC09] J. Schmidhuber. Art & science as by-products of the search for novel patterns, or data
compressible in unknown yet learnable ways. In M. Botta (ed.), Et al. Edizioni, 2009, pp. 98-
112. PDF. (More on artificial scientists and artists.)
[AC10] J. Schmidhuber. Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990-2010).
IEEE Transactions on Autonomous Mental Development, 2(3):230-247, 2010. IEEE link. PDF.
With a brief summary of the generative adversarial neural networks of 1990[AC90,90b][AC20] where a
generator NN is fighting a predictor NN in a minimax game (more).
[ACM16] ACM interview by S. Ibaraki (2016). Chat with J. Schmidhuber: Artificial Intelligence &
Deep Learning—Now & Future. Link.
https://people.idsia.ch/~juergen/deep-learning-history.html 31/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
[AI51] Les Machines a Calculer et la Pensee Humaine: Paris, 8.-13. Januar 1951, Colloques
internationaux du Centre National de la Recherche Scientifique; no. 37, Paris 1953. H.
Bruderer[BRU4] calls that the first conference on AI.
[AIT1] R. J. Solomonoff. A formal theory of inductive inference. Part I. Information and Control,
7:1-22, 1964.
[AIT3] G.J. Chaitin. On the length of programs for computing finite binary sequences: statistical
considerations. Journal of the ACM, 16:145-159, 1969 (submitted 1965).
[AIT4] P. Martin-Löf. The definition of random sequences. Information and Control, 9:602-619,
1966.
[AIT6] A. K. Zvonkin and L. A. Levin. The complexity of finite objects and the algorithmic
concepts of information and randomness. Russian Math. Surveys, 25(6):83-124, 1970.
[AIT8] L. A. Levin. Laws of information (nongrowth) and aspects of the foundation of probability
theory. Problems of Information Transmission, 10(3):206-210, 1974.
[AIT9] C. P. Schnorr. Process complexity and effective random tests. Journal of Computer
Systems Science, 7:376-388, 1973.
[AIT11] P. Gacs. On the relation between descriptional complexity and algorithmic probability.
Theoretical Computer Science, 22:71-93, 1983.
[AIT13] J. Rissanen. Stochastic complexity and modeling. The Annals of Statistics, 14(3):1080-
1100, 1986.
[AIT17] J. Schmidhuber. Discovering neural nets with low Kolmogorov complexity and high
generalization capability. Neural Networks, 10(5):857-873, 1997.
[AIT19] C. S. Calude. Chaitin Omega numbers, Solovay machines and Gödel incompleteness.
Theoretical Computer Science, 2000.
[AIT21] M. Hutter. The fastest and shortest algorithm for all well-defined problems.
International Journal of Foundations of Computer Science, 13(3):431-443, 2002. (Based on
work done under J. Schmidhuber's SNF grant 20-61847: unification of universal induction and
sequential decision theory, 2000).
[AIT24] J. Schmidhuber. The Speed Prior: a new simplicity measure yielding near-optimal
computable predictions. In J. Kivinen and R. H. Sloan, editors, Proceedings of the 15th Annual
Conference on Computational Learning Theory (COLT 2002), Lecture Notes in Artificial
Intelligence, pages 216-228. Springer, Sydney, Australia, 2002.
[AM16] Blog of Werner Vogels, CTO of Amazon (Nov 2016): Amazon's Alexa "takes advantage
of bidirectional long short-term memory (LSTM) networks using a massive amount of data to
train models that convert letters to sounds and predict the intonation contour. This technology
enables high naturalness, consistent intonation, and accurate processing of texts."
[AMH1] S. I. Amari (1972). Learning patterns and pattern sequences by self-organizing nets of
threshold elements. IEEE Transactions, C 21, 1197-1206, 1972. PDF. First publication of what
was later sometimes called the Hopfield network[AMH2] or Amari-Hopfield Network,[AMH3] based on
the (uncited) Lenz-Ising recurrent architecture.[L20][I25][T22]
https://people.idsia.ch/~juergen/deep-learning-history.html 33/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
[AMH1b] W. A. Little. The existence of persistent states in the brain. Mathematical Biosciences,
19.1-2, p. 101-120, 1974. Mentions the recurrent Ising model[L20][I25]on which the (uncited) Amari
network[AMH1,2] is based.
[AMH2] J. J. Hopfield (1982). Neural networks and physical systems with emergent collective
computational abilities. Proc. of the National Academy of Sciences, vol. 79, pages 2554-2558,
1982. The Hopfield network or Amari-Hopfield Network was first published in 1972 by Amari.
[AMH1]
[AMH2] did not cite [AMH1].
[ATT0] J. Schmidhuber and R. Huber. Learning to generate focus trajectories for attentive
vision. Technical Report FKI-128-90, Institut für Informatik, Technische Universität München,
1990. PDF.
[ATT1] J. Schmidhuber and R. Huber. Learning to generate artificial fovea trajectories for target
detection. International Journal of Neural Systems, 2(1 & 2):135-141, 1991. Based on TR FKI-
128-90, TUM, 1990. PDF. More.
[ATT2] J. Schmidhuber. Learning algorithms for networks with internal and external feedback.
In D. S. Touretzky, J. L. Elman, T. J. Sejnowski, and G. E. Hinton, editors, Proc. of the 1990
Connectionist Models Summer School, pages 52-61. San Mateo, CA: Morgan Kaufmann,
1990. PS. (PDF.)
[AUT] J. Schmidhuber (AI Blog, 2005). Highlights of robot car history. Around 1986, Ernst
Dickmanns and his group at Univ. Bundeswehr Munich built the world's first real autonomous
https://people.idsia.ch/~juergen/deep-learning-history.html 34/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
robot cars, using saccadic vision, probabilistic approaches such as Kalman filters, and parallel
computers. By 1994, they were in highway traffic, at up to 180 km/h, automatically passing
other cars.
[AV1] A. Vance. Google Amazon and Facebook Owe Jürgen Schmidhuber a Fortune—This
Man Is the Godfather the AI Community Wants to Forget. Business Week, Bloomberg, May 15,
2018.
[BA93] P. Baldi and Y. Chauvin. Neural Networks for Fingerprint Recognition, Neural
Computation, Vol. 5, 3, 402-418, (1993). First application of CNNs with backpropagation to
biomedical/biometric images.
[BA96] P. Baldi and Y. Chauvin. Hybrid Modeling, HMM/NN Architectures, and Protein
Applications, Neural Computation, Vol. 8, 7, 1541-1565, (1996). One of the first papers on
graph neural networks.
[BA99] P. Baldi, S. Brunak, P. Frasconi, G. Soda, and G. Pollastri. Exploiting the Past and the
Future in Protein Secondary Structure Prediction, Bioinformatics, Vol. 15, 11, 937-946, (1999).
[BA03] P. Baldi and G. Pollastri. The Principled Design of Large-Scale Recursive Neural
Network Architectures-DAG-RNNs and the Protein Structure Prediction Problem. Journal of
Machine Learning Research, 4, 575-602, (2003).
[BAN] Banu Musa brothers (9th century). The book of ingenious devices (Kitab al-hiyal).
Translated by D. R. Hill (1979), Springer, p. 44, ISBN 90-277-0833-9. As far as we know, the
Banu Musa music automaton was the world's first machine with a stored program.
[BAY1] S. M. Stigler. Who Discovered Bayes' Theorem? The American Statistician. 37(4):290-
296, 1983. Bayes' theorem is actually Laplace's theorem or possibly Saunderson's theorem.
[BAY2] T. Bayes. An essay toward solving a problem in the doctrine of chances. Philosophical
Transactions of the Royal Society of London, 53:370-418. Communicated by R. Price, in a
letter to J. Canton, 1763.
[BAY7] J. F. G. De Freitas. Bayesian methods for neural networks. PhD thesis, University of
Cambridge, 2003.
[BAY8] J. M. Bernardo, A.F.M. Smith. Bayesian theory. Vol. 405. John Wiley & Sons, 2009.
[BB2] J. Schmidhuber. A local learning algorithm for dynamic feedforward and recurrent
networks. Connection Science, 1(4):403-412, 1989. (The Neural Bucket Brigade—figures
omitted!). PDF. HTML. Compare TR FKI-124-90, TUM, 1990. PDF. Proposal of a biologically
more plausible deep learning algorithm that—unlike backpropagation—is local in space and
time. Based on a "neural economy" for reinforcement learning.
[BIB3] W. Bibel (2003). Mosaiksteine einer Wissenschaft vom Geiste. Invited talk at the
conference on AI and Gödel, Arnoldsheim, 4-6 April 2003. Manuscript, 2003.
[BL16] L. Bloch (2016). Informatics in the light of some Leibniz's works. Communication to XB2
Berlin Xenobiology Conference.
[BM] D. Ackley, G. Hinton, T. Sejnowski (1985). A Learning Algorithm for Boltzmann Machines.
Cognitive Science, 9(1):147-169. This paper neither cited relevant prior work by Sherrington &
Kirkpatrick[SK75] & Glauber[G63] nor the first working algorithms for deep learning of internal
representations (Ivakhnenko & Lapa, 1965)[DEEP1-2][HIN] nor Amari's work (1967-68)[GD1-2] on
learning internal representations in deep nets through stochastic gradient descent. Even later
surveys by the authors[S20][DLC] failed to cite the prior art.[T22]
[BOO] George Boole (1847). The Mathematical Analysis of Logic, Being an Essay towards a
Calculus of Deductive Reasoning. London, England: Macmillan, Barclay, & Macmillan, 1847.
Leibniz' formal Algebra of Thought (1686)[L86][WI48] was deductively equivalent[LE18] to the much
later Boolean Algebra.
[BPA] H. J. Kelley. Gradient Theory of Optimal Flight Paths. ARS Journal, Vol. 30, No. 10, pp.
947-954, 1960. Precursor of modern backpropagation.[BP1-5]
[BPB] A. E. Bryson. A gradient method for optimizing multi-stage allocation processes. Proc.
Harvard Univ. Symposium on digital computers and their applications, 1961.
1976. Link. The first publication on "modern" backpropagation, also known as the reverse
mode of automatic differentiation.
[BP4] J. Schmidhuber (AI Blog, 2014; updated 2020). Who invented backpropagation? More.
[DL2]
[BP5] A. Griewank (2012). Who invented the reverse mode of differentiation? Documenta
Mathematica, Extra Volume ISMP (2012): 389-400.
[BPTT1] P. J. Werbos. Backpropagation through time: what it does and how to do it.
Proceedings of the IEEE 78.10, 1550-1560, 1990.
[BRI] Bridle, J.S. (1990). Alpha-Nets: A Recurrent "Neural" Network Architecture with a Hidden
Markov Model Interpretation, Speech Communication, vol. 9, no. 1, pp. 83-92.
[BRU1] H. Bruderer. Computing history beyond the UK and US: selected landmarks from
continental Europe. Communications of the ACM 60.2 (2017): 76-84.
[BRU3] H. Bruderer. Milestones in Analog and Digital Computing. 2 volumes, 3rd edition.
Springer Nature Switzerland AG, 2020.
[BRO21] D. C. Brock (2021). Cybernetics, Computer Design, and a Meeting of the Minds. An
influential 1951 conference in Paris considered the computer as a model of—and for—the
human mind. IEEE Spectrum, 2021. Link.
[BW] H. Bourlard, C. J. Wellekens (1989). Links between Markov models and multilayer
perceptrons. NIPS 1989, p. 502-510.
https://people.idsia.ch/~juergen/deep-learning-history.html 37/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
[CHU] A. Church (1935). An unsolvable problem of elementary number theory. Bulletin of the
American Mathematical Society, 41: 332-333. Abstract of a talk given on 19 April 1935, to the
American Mathematical Society. Also in American Journal of Mathematics, 58(2), 345-363 (1
Apr 1936). First explicit proof that the Entscheidungsproblem (decision problem) does not have
a general solution.
[CNN1c] Bower Award Ceremony 2021: Jürgen Schmidhuber lauds Kunihiko Fukushima.
YouTube video, 2021.
[CNN3] Weng, J., Ahuja, N., and Huang, T. S. (1993). Learning recognition and segmentation
of 3-D objects from 2-D images. Proc. 4th Intl. Conf. Computer Vision, Berlin, Germany, pp.
https://people.idsia.ch/~juergen/deep-learning-history.html 38/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
121-128. A CNN whose downsampling layers use Max-Pooling (which has become very
popular) instead of Fukushima's Spatial Averaging.[CNN1]
[CNN4] M. A. Ranzato, Y. LeCun: A Sparse and Locally Shift Invariant Feature Extractor
Applied to Document Images. Proc. ICDAR, 2007
[CNN5a] S. Behnke. Learning iterative image reconstruction in the neural abstraction pyramid.
International Journal of Computational Intelligence and Applications, 1(4):427-438, 1999.
[CNN5b] S. Behnke. Hierarchical Neural Networks for Image Interpretation, volume LNCS
2766 of Lecture Notes in Computer Science. Springer, 2003.
[CON16] J. Carmichael (2016). Artificial Intelligence Gained Consciousness in 1991. Why A.I.
pioneer Jürgen Schmidhuber is convinced the ultimate breakthrough already happened.
Inverse, 2016. Link.
[CUB2] J. Schmidhuber. A fixed size storage O(n3) time complexity learning algorithm for fully
recurrent continually running networks. Neural Computation, 4(2):243-248, 1992. PDF.
https://people.idsia.ch/~juergen/deep-learning-history.html 39/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
[DAN] J. Schmidhuber (AI Blog, 2021). 10-year anniversary. In 2011, DanNet triggered the
deep convolutional neural network (CNN) revolution. Named after Schmidhuber's outstanding
postdoc Dan Ciresan, it was the first deep and fast CNN to win international computer vision
contests, and had a temporary monopoly on winning them, driven by a very fast
implementation based on graphics processing units (GPUs). 1st superhuman result in 2011.
[DAN1]
Now everybody is using this approach.
[DAN1] J. Schmidhuber (AI Blog, 2011; updated 2021 for 10th birthday of DanNet): First
superhuman visual pattern recognition. At the IJCNN 2011 computer vision competition in
Silicon Valley, the artificial neural network called DanNet performed twice better than humans,
three times better than the closest artificial competitor (from LeCun's team), and six times
better than the best non-neural method.
[DEC] J. Schmidhuber (AI Blog, 02/20/2020, updated 2021, 2022). The 2010s: Our Decade of
Deep Learning / Outlook on the 2020s. The recent decade's most important developments and
industrial applications based on the AI of Schmidhuber's team, with an outlook on the 2020s,
also addressing privacy and data markets.
[DEEP1a] Ivakhnenko, Alexey Grigorevich. The group method of data of handling; a rival of the
method of stochastic approximation. Soviet Automatic Control 13 (1968): 43-55.
[DL1] J. Schmidhuber, 2015. Deep learning in neural networks: An overview. Neural Networks,
61, 85-117. More. Got the first Best Paper Award ever issued by the journal Neural Networks,
founded in 1988.
[DL3] Y. LeCun, Y. Bengio, G. Hinton (2015). Deep Learning. Nature 521, 436-444. HTML. A
"survey" of deep learning that does not mention the pioneering works of deep learning [T22].
[DL3a] Y. Bengio, Y. LeCun, G. Hinton (2021). Turing Lecture: Deep Learning for AI.
Communications of the ACM, July 2021. HTML. Local copy (HTML only). Another "survey" of
deep learning that does not mention the pioneering works of deep learning [T22].
https://people.idsia.ch/~juergen/deep-learning-history.html 40/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
[DL4] J. Schmidhuber (AI Blog, 2017). Our impact on the world's most valuable public
companies: Apple, Google, Microsoft, Facebook, Amazon... By 2015-17, neural nets
developed in Schmidhuber's labs were on over 3 billion devices such as smartphones, and
used many billions of times per day, consuming a significant fraction of the world's compute.
Examples: greatly improved (CTC-based) speech recognition on all Android phones, greatly
improved machine translation through Google Translate and Facebook (over 4 billion LSTM-
based translations per day), Apple's Siri and Quicktype on all iPhones, the answers of
Amazon's Alexa, etc. Google's 2019 on-device speech recognition (on the phone, not the
server) is still based on LSTM.
[DL6] F. Gomez and J. Schmidhuber. Co-evolving recurrent neurons learn deep memory
POMDPs. In Proc. GECCO'05, Washington, D. C., pp. 1795-1802, ACM Press, New York, NY,
USA, 2005. PDF.
[DL6a] J. Schmidhuber (AI Blog, Nov 2020). 15-year anniversary: 1st paper with "learn deep"
in the title (2005). The deep reinforcement learning & neuroevolution developed in
Schmidhuber's lab solved problems of depth 1000 and more.[DL6] Soon after its publication,
everybody started talking about "deep learning." Causality or correlation?
[DL7] "Deep Learning ... moving beyond shallow machine learning since 2006!" Web site
deeplearning.net of Y. Bengio's MILA (2015, retrieved May 2020; compare the version in the
Internet Archive), referring to Hinton's[UN4] and Bengio's[UN5] unsupervised pre-training for deep
NNs[UN4] (2006) although this type of deep learning dates back to Schmidhuber's work of 1991.
[UN1-2][UN]
[DLC] J. Schmidhuber (AI Blog, June 2015). Critique of Paper by self-proclaimed[DLC2] "Deep
Learning Conspiracy" (Nature 521 p 436). The inventor of an important method should get
credit for inventing it. She may not always be the one who popularizes it. Then the popularizer
should get credit for popularizing it (but not for inventing it). More on this under [T22].
[DLC1] Y. LeCun. IEEE Spectrum Interview by L. Gomes, Feb 2015. Quote: "A lot of us
involved in the resurgence of Deep Learning in the mid-2000s, including Geoff Hinton, Yoshua
Bengio, and myself—the so-called 'Deep Learning conspiracy' ..."
[DLC2] M. Bergen, K. Wagner (2015). Welcome to the AI Conspiracy: The 'Canadian Mafia'
Behind Tech's Latest Craze. Vox recode, 15 July 2015. Quote: "... referred to themselves as
the 'deep learning conspiracy.' Others called them the 'Canadian Mafia.'"
[DLH] J. Schmidhuber (AI Blog, 2022). Annotated History of Modern AI and Deep Learning.
Technical Report IDSIA-22-22, IDSIA, Lugano, Switzerland, 2022. Preprint arXiv:2212.11279.
Tweet of 2022.
paper. Its abstract claims: "While reinforcement learning agents have achieved some
successes in a variety of domains, their applicability has previously been limited to domains in
which useful features can be handcrafted, or to domains with fully observed, low-dimensional
state spaces." It also claims to bridge "the divide between high-dimensional sensory inputs and
actions." Similarly, the first sentence of the abstract of the earlier tech report version[DM1] of
[DM2] claims to "present the first deep learning model to successfully learn control policies
directly from high-dimensional sensory input using reinforcement learning." However, the first
such system (requiring no unsupervised pre-training) was created earlier by Jan Koutnik et al.
in Schmidhuber's lab.[CO2] DeepMind was co-founded by Shane Legg, a PhD student from this
lab; he and Daan Wierstra (another PhD student of Schmidhuber and DeepMind's 1st
employee) were the first persons at DeepMind who had AI publications and PhDs in computer
science. More.
[DM2a] D. Silver et al. A general reinforcement learning algorithm that masters chess, Shogi,
and Go through self-play. Science 362.6419:1140-1144, 2018.
[DM3] S. Stanford. DeepMind's AI, AlphaStar Showcases Significant Progress Towards AGI.
Medium ML Memoirs, 2019. Alphastar has a "deep LSTM core."
[DIF4] O. Ronneberger, P. Fischer, T. Brox. Unet: Convolutional networks for biomedical image
segmentation. In MICCAI (3), vol. 9351 of Lecture Notes in Computer Science, pages 234-241.
Springer, 2015.
[DIF5] J. Ho, A. Jain, P. Abbeel. Denoising diffusion probabilistic models. Advances in Neural
Information Processing Systems 33:6840-6851, 2020.
https://people.idsia.ch/~juergen/deep-learning-history.html 42/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
Hybrid computing using a neural network with dynamic external memory. Nature, 538:7626, p
471, 2016. This work of DeepMind did not cite the original work of the early 1990s on neural
networks learning to control dynamic external memories.[PDA1-2][FWP0-1]
[Drop1] S. J. Hanson (1990). A Stochastic Version of the Delta Rule, PHYSICA D,42, 265-272.
What's now called "dropout" is a variation of the stochastic delta rule—compare preprint
arXiv:1808.03578, 2018.
[Drop2] N. Frazier-Logue, S. J. Hanson (2020). The Stochastic Delta Rule: Faster and More
Accurate Deep Learning Through Adaptive Weight Noise. Neural Computation 32(5):1018-
1032.
[DYNA90] R. S. Sutton (1990). Integrated Architectures for Learning, Planning, and Reacting
Based on Approximating Dynamic Programming. Machine Learning Proceedings 1990, of the
Seventh International Conference, Austin, Texas, June 21-23, 1990, p 216-224.
[DYNA91] R. S. Sutton (1991). Dyna, an integrated architecture for learning, planning, and
reacting. ACM Sigart Bulletin 2.4 (1991):160-163.
[ELM1] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew. Extreme learning machine: A new learning
scheme of feedforward neural networks. Proc. IEEE Int. Joint Conf. on Neural Networks, Vol.
2, 2004, pp. 985-990. This paper does not mention that the "ELM" concept goes back to
Rosenblatt's work in the 1950s.[R62][T22]
[ELM2] ELM-ORIGIN, 2004. The Official Homepage on Origins of Extreme Learning Machines
(ELM). "Extreme Learning Machine Duplicates Others' Papers from 1988-2007." Local copy.
This overview does not mention that the "ELM" concept goes back to Rosenblatt's work in the
1950s.[R62][T22]
[ENS1] R. E. Schapire. The strength of weak learnability. Machine Learning, 5:197-227, 1990.
[EVO2] L. Fogel, A. Owens, M. Walsh. Artificial Intelligence through Simulated Evolution. Wiley,
New York, 1966.
https://people.idsia.ch/~juergen/deep-learning-history.html 43/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
[EVO6] S. F. Smith. A Learning System Based on Genetic Adaptive Algorithms, PhD Thesis,
Univ. Pittsburgh, 1980
[EVONN1] Montana, D. J. and Davis, L. (1989). Training feedforward neural networks using
genetic algorithms. In Proceedings of the 11th International Joint Conference on Artificial
Intelligence (IJCAI)—Volume 1, IJCAI'89, pages 762–767, San Francisco, CA, USA. Morgan
Kaufmann Publishers Inc.
[EVONN2] Miller, G., Todd, P., and Hedge, S. (1989). Designing neural networks using genetic
algorithms. In Proceedings of the 3rd International Conference on Genetic Algorithms, pages
379–384. Morgan Kauffman.
[EVONN3] H. Kitano. Designing neural networks using genetic algorithms with graph
generation system. Complex Systems, 4:461-476, 1990.
[FA15] Intelligente Roboter werden vom Leben fasziniert sein. (Intelligent robots will be
fascinated by life.) FAZ, 1 Dec 2015. Link.
[FAKE] H. Hopf, A. Krief, G. Mehta, S. A. Matlin. Fake science and the knowledge crisis:
ignorance can be fatal. Royal Society Open Science, May 2019. Quote: "Scientists must be
willing to speak out when they see false information being presented in social media,
traditional print or broadcast press" and "must speak out against false information and fake
science in circulation and forcefully contradict public figures who promote it."
[FAKE2] L. Stenflo. Intelligent plagiarists are the most dangerous. Nature, vol. 427, p. 777 (Feb
2004). Quote: "What is worse, in my opinion, ..., are cases where scientists rewrite previous
findings in different words, purposely hiding the sources of their ideas, and then during
subsequent years forcefully claim that they have discovered new phenomena.
[FAST] C. v.d. Malsburg. Tech Report 81-2, Abteilung f. Neurobiologie, Max-Planck Institut f.
Biophysik und Chemie, Goettingen, 1981. First paper on fast weights or dynamic links.
https://people.idsia.ch/~juergen/deep-learning-history.html 44/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
[FASTb] G. E. Hinton, D. C. Plaut. Using fast weights to deblur old memories. Proc. 9th annual
conference of the Cognitive Science Society (pp. 177-186), 1987. 3rd paper on fast weights
(two types of weights with different learning rates).
[FB17] By 2017, Facebook used LSTM to handle over 4 billion automatic translations per day
(The Verge, August 4, 2017); see also Facebook blog by J.M. Pino, A. Sidorov, N.F. Ayan
(August 3, 2017)
[FDL] J. Schmidhuber (AI Blog, 2013). My First Deep Learning System of 1991 + Deep
Learning Timeline 1960-2013.
[FEI63] E. A. Feigenbaum, J. Feldman. Computers and thought. McGraw-Hill: New York, 1963.
[FM] S. Hochreiter and J. Schmidhuber. Flat minimum search finds simple nets. Technical
Report FKI-200-94, Fakultät für Informatik, Technische Universität München, December 1994.
PDF.
[FU77] K. S. Fu. Syntactic Pattern Recognition and Applications. Berlin, Springer, 1977.
[FWP] J. Schmidhuber (AI Blog, 26 March 2021, updated 2022). 26 March 1991: Neural nets
learn to program neural nets with fast weights—like Transformer variants. 2021: New stuff! 30-
year anniversary of a now popular alternative[FWP0-1] to recurrent NNs. A slow feedforward NN
learns by gradient descent to program the changes of the fast weights[FAST,FASTa,b] of another NN,
separating memory and control like in traditional computers. Such Fast Weight
Programmers[FWP0-6,FWPMETA1-8] can learn to memorize past data, e.g., by computing fast weight
changes through additive outer products of self-invented activation patterns[FWP0-1] (now often
called keys and values for self-attention[TR1-6]). The similar Transformers[TR1-2] combine this with
projections and softmax and are now widely used in natural language processing. For long
input sequences, their efficiency was improved through Transformers with linearized self-
attention[TR5-6] which are formally equivalent to Schmidhuber's 1991 outer product-based Fast
Weight Programmers (apart from normalization). In 1993, he introduced the attention
terminology[FWP2] now used in this context,[ATT] and extended the approach to RNNs that
program themselves. See tweet of 2022.
[FWP2] J. Schmidhuber. Reducing the ratio between learning complexity and number of time-
varying variables in fully recurrent nets. In Proceedings of the International Conference on
Artificial Neural Networks, Amsterdam, pages 460-463. Springer, 1993. PDF. First recurrent
NN-based fast weight programmer using outer products, introducing the terminology of
learning "internal spotlights of attention."
[FWP3] I. Schlag, J. Schmidhuber. Gated Fast Weights for On-The-Fly Neural Program
Generation. Workshop on Meta-Learning, @N(eur)IPS 2017, Long Beach, CA, USA.
[FWP3a] I. Schlag, J. Schmidhuber. Learning to Reason with Third Order Tensor Products.
Advances in Neural Information Processing Systems (N(eur)IPS), Montreal, 2018. Preprint:
arXiv:1811.12143. PDF.
[FWP4a] J. Ba, G. Hinton, V. Mnih, J. Z. Leibo, C. Ionescu. Using Fast Weights to Attend to the
Recent Past. NIPS 2016. PDF. Very similar to [FWP0-2], in both motivation [FWP2] and
execution.
[FWP5] F. J. Gomez and J. Schmidhuber. Evolving modular fast-weight networks for control. In
W. Duch et al. (Eds.): Proc. ICANN'05, LNCS 3697, pp. 383-389, Springer-Verlag Berlin
Heidelberg, 2005. PDF. HTML overview. Reinforcement-learning fast weight programmer.
[FWP6] I. Schlag, K. Irie, J. Schmidhuber. Linear Transformers Are Secretly Fast Weight
Programmers. ICML 2021. Preprint: arXiv:2102.11174.
[FWP7] K. Irie, I. Schlag, R. Csordas, J. Schmidhuber. Going Beyond Linear Transformers with
Recurrent Fast Weight Programmers. Preprint: arXiv:2106.06295 (June 2021).
1993. PDF.
[FWPMETA3] J. Schmidhuber. An introspective network that can learn to run its own weight
change algorithm. In Proc. of the Intl. Conf. on Artificial Neural Networks, Brighton, pages 191-
195. IEE, 1993.
[FWPMETA4] J. Schmidhuber. A neural network that embeds its own meta-levels. In Proc. of
the International Conference on Neural Networks '93, San Francisco. IEEE, 1993.
[FWPMETA5] J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF. A recurrent neural net with
a self-referential, self-reading, self-modifying weight matrix can be found here.
[FWPMETA6] L. Kirsch and J. Schmidhuber. Meta Learning Backpropagation & Improving It.
Advances in Neural Information Processing Systems (NeurIPS), 2021. Preprint
arXiv:2012.14905 [cs.LG], 2020.
[GD'] C. Lemarechal. Cauchy and the Gradient Method. Doc Math Extra, pp. 251-254, 2012.
[GD''] J. Hadamard. Memoire sur le probleme d'analyse relatif a Vequilibre des plaques
elastiques encastrees. Memoires presentes par divers savants estrangers à l'Academie des
Sciences de l'Institut de France, 33, 1908.
[GDb] Y. Z. Tsypkin (1971). Adaptation and Learning in Automatic Systems, Academic Press,
1971. On gradient descent-based on-line learning for non-linear systems.
https://people.idsia.ch/~juergen/deep-learning-history.html 47/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
[GD1] S. I. Amari (1967). A theory of adaptive pattern classifier, IEEE Trans, EC-16, 279-307
(Japanese version published in 1965). PDF. Probably the first paper on using stochastic
gradient descent[STO51-52] for learning in multilayer neural networks (without specifying the
specific gradient descent method now known as reverse mode of automatic differentiation or
backpropagation[BP1]).
[GD2a] H. Saito (1967). Master's thesis, Graduate School of Engineering, Kyushu University,
Japan. Implementation of Amari's 1967 stochastic gradient descent method for multilayer
perceptrons.[GD1] (S. Amari, personal communication, 2021.)
[GD3] S. I. Amari (1977). Neural Theory of Association and Concept Formation. Biological
Cybernetics, vol. 26, p. 175-185, 1977. See Section 3.1 on using gradient descent for learning
in multilayer networks.
[GLA85] W. Glasser. Control theory. New York: Harper and Row, 1985.
[GM6] J. Schmidhuber (2006). Gödel machines: Fully Self-Referential Optimal Universal Self-
Improvers. In B. Goertzel and C. Pennachin, eds.: Artificial General Intelligence, p. 199-226,
2006. PDF.
[GT16] Google's dramatically improved Google Translate of 2016 is based on LSTM, e.g.,
WIRED, Sep 2016, or siliconANGLE, Sep 2016
https://people.idsia.ch/~juergen/deep-learning-history.html 48/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
[GAN0] O. Niemitalo. A method for training artificial neural networks to generate missing data
within a variable context. Blog post, Internet Archive, 2010. A blog post describing basic
ideas[AC][AC90,AC90b][AC20] of GANs.
[GOD] K. Gödel. Über formal unentscheidbare Sätze der Principia Mathematica und
verwandter Systeme I. Monatshefte für Mathematik und Physik, 38:173-198, 1931. In the early
1930s, Gödel founded theoretical computer science. He identified fundamental limits of
mathematics, theorem proving, computing, and Artificial Intelligence.
[GOD56] R. J. Lipton and K. W. Regan. Gödel's lost letter and P=NP. Link. Gödel identified
what's now the most famous open problem of computer science.
[GOD86] K. Gödel. Collected works Volume I: Publications 1929-36, S. Feferman et. al.,
editors, Oxford Univ. Press, Oxford, 1986.
[GOD10] V. C. Nadkarni. Gödel, Einstein and proof for God. The Economic Times, 2010.
[GOD21] J. Schmidhuber (2021). 90th anniversary celebrations: 1931: Kurt Gödel, founder of
theoretical computer science, shows limits of math, logic, computing, and artificial intelligence.
This was number 1 on Hacker News.
[GOD21a] J. Schmidhuber (2021). Als Kurt Gödel die Grenzen des Berechenbaren entdeckte.
(When Kurt Gödel discovered the limits of computability.) Frankfurter Allgemeine Zeitung,
16/6/2021.
[GOD21b] J. Schmidhuber (AI Blog, 2021). 80. Jahrestag: 1931: Kurt Gödel, Vater der
theoretischen Informatik, entdeckt die Grenzen des Berechenbaren und der künstlichen
Intelligenz.
https://people.idsia.ch/~juergen/deep-learning-history.html 49/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
[GPUNN] Oh, K.-S. and Jung, K. (2004). GPU implementation of neural networks. Pattern
Recognition, 37(6):1311-1314. Speeding up traditional NNs on GPU by a factor of 20.
[GPUCNN5] J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet):
History of computer vision contests won by deep CNNs since 2011. DanNet was the first CNN
to win one, and won 4 of them in a row before the similar AlexNet/VGG Net and the Resnet (a
Highway Net with open gates) joined the party. Today, deep CNNs are standard in computer
vision.
https://people.idsia.ch/~juergen/deep-learning-history.html 50/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
[GPUCNN6] J. Schmidhuber, D. Ciresan, U. Meier, J. Masci, A. Graves. On Fast Deep Nets for
AGI Vision. In Proc. Fourth Conference on Artificial General Intelligence (AGI-11), Google,
Mountain View, California, 2011. PDF.
[GPUCNN8] J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet). First
deep learner to win a contest on object detection in large images— first deep learner to win a
medical imaging contest (2012). Link. How the Swiss AI Lab IDSIA used GPU-based CNNs to
win the ICPR 2012 Contest on Mitosis Detection and the MICCAI 2013 Grand Challenge.
[GRO69] S. Grossberg. Some networks that can learn, remember, and reproduce any number
of complicated space-time patterns, Indiana University Journal of Mathematics and Mechanics,
19:53-91, 1969.
[H86] J. L. van Hemmen (1986). Spin-glass models of a neural network. Phys. Rev. A 34,
3435, 1 Oct 1986.
[H88] H. Sompolinsky (1988). Statistical Mechanics of Neural Networks. Physics Today 41, 12,
70, 1988.
[HAB1] V. Smil. Detonator of the population explosion. Nature, July 29 1999, p 415.
[HAB2] J. Schmidhuber (Blog, 2000). Most influential persons of the 20th century (according
to Nature, 1999). The Haber-Bosch process has often been called the most important
invention of the 20th century[HAB1] as it "detonated the population explosion," driving the world's
population from 1.6 billion in 1900 to about 8 billion today. Without it, half of humanity would
perish. Billions of people would never have existed without it; soon it will sustain 2 out of 3
persons.
[HAI14] T. Haigh (2014). Historical reflections. Actually, Turing did not invent the computer.
Communications of the ACM, Vol. 57(1): 36-41, Jan 2014. PDF.
[HEL] P. Dayan, G. E. Hinton, R. M. Neal, and R. S. Zemel. The Helmholtz machine. Neural
Computation, 7:889-904, 1995. An unsupervised learning algorithm related to Schmidhuber's
supervised Neural Heat Exchanger.[NHE]
[HIN] J. Schmidhuber (AI Blog, 2020). Critique of Honda Prize for Dr. Hinton. Science must not
allow corporate PR to distort the academic record. See also [T22].
[HO66] E. Hochstetter et al. (1966): Herrn von Leibniz' Rechnung mit Null und Eins. Berlin:
Siemens AG.
[HRL2] J. Schmidhuber and R. Wahnsiedler. Planning simple trajectories using neural subgoal
generators. In J. A. Meyer, H. L. Roitblat, and S. W. Wilson, editors, Proc. of the 2nd
International Conference on Simulation of Adaptive Behavior, pages 196-202. MIT Press,
1992. PDF.
[HW2] He, K., Zhang, X., Ren, S., Sun, J. Deep residual learning for image recognition.
Preprint arXiv:1512.03385 (Dec 2015). Residual nets are a version of Highway Nets[HW1] where
the gates are always open: g(x)=1 (a typical highway net initialization) and t(x)=1. More.
[HYB12] Hinton, G. E., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A.,
Vanhoucke, V., Nguyen, P., Sainath, T. N., and Kingsbury, B. (2012). Deep neural networks for
acoustic modeling in speech recognition: The shared views of four research groups. IEEE
Signal Process. Mag., 29(6):82-97. This work did not cite the earlier LSTM[LSTM0-6] trained by
Connectionist Temporal Classification (CTC, 2006).[CTC] CTC-LSTM was successfully applied to
speech in 2007[LSTM4] (also with hierarchical LSTM stacks[LSTM14]) and became the first superior
end-to-end neural speech recogniser that outperformed the state of the art, dramatically
improving Google's speech recognition.[GSR][GSR15][DL4] This was very different from previous
hybrid methods since the late 1980s which combined NNs and traditional approaches such as
hidden Markov models (HMMs).[BW][BRI][BOU] [HYB12] still used the old hybrid approach and did
not compare it to CTC-LSTM. Later, however, Hinton switched to LSTM, too.[LSTM8]
[I24] E. Ising (1925). Beitrag zur Theorie des Ferro- und Paramagnetismus. Dissertation, 1924.
[I25] E. Ising (1925). Beitrag zur Theorie des Ferromagnetismus. Z. Phys., 31 (1): 253-258,
1925. Based on [I24]. The first non-learning recurrent NN architecture (the Ising model or
Lenz-Ising model) was introduced and analyzed by physicists Ernst Ising and Wilhelm Lenz in
the 1920s.[L20][I25][K41][W45][T22] It settles into an equilibrium state in response to input conditions,
and is the foundation of the first well-known learning RNNs.[AMH1-2]
[IC49] DE 833366 W. Jacobi / SIEMENS AG: Halbleiterverstärker patent filed 14 April 1949,
granted 15 May 1952. First integrated circuit with several transistors on a common substrate.
[IC14] @CHM Blog, Computer History Museum (2014). Who Invented the IC?
[IM09] J. Deng, R. Socher, L.J. Li, K. Li, L. Fei-Fei (2009). Imagenet: A large-scale hierarchical
image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.
https://people.idsia.ch/~juergen/deep-learning-history.html 53/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
[JOU17] Jouppi et al. (2017). In-Datacenter Performance Analysis of a Tensor Processing Unit.
Preprint arXiv:1704.04760
[K56] S.C. Kleene. Representation of Events in Nerve Nets and Finite Automata. Automata
Studies, Editors: C.E. Shannon and J. McCarthy, Princeton University Press, p. 3-42,
Princeton, N.J., 1956.
[KAL59] R. Kalman. On the general theory of control systems. IRE Transactions on Automatic
Control 4.3 (1959): 110-110.
[KAL59a] R. Kalman, J. Bertram. General approach to control theory based on the methods of
Lyapunov. IRE Transactions on Automatic Control 4.3 (1959):20-20.
[KO2] J. Schmidhuber. Discovering neural nets with low Kolmogorov complexity and high
generalization capability. Neural Networks, 10(5):857-873, 1997. PDF.
[KU] A. Küchler & C. Goller (1996). Inductive learning in symbolic domains using structure-
driven recurrent neural networks. Lecture Notes in Artificial Intelligence, vol 1137. Springer,
Berlin, Heidelberg.
[L84] G. Leibniz (1684). Nova Methodus pro Maximis et Minimis. First publication of "modern"
infinitesimal calculus.
[L20] W. Lenz (1920). Beitrag zum Verständnis der magnetischen Erscheinungen in festen
Körpern. Physikalische Zeitschrift, 21:613-615. See also [I25].
[LA14] D. R. Lande (2014). Development of the Binary Number System and the Foundations of
Computer Science. The Mathematics Enthusiast, vol. 11(3):6 12, 2014. Link.
[LE18] W. Lenzen. Leibniz and the Calculus Ratiocinator. Technology and Mathematics, pp 47-
78, Springer, 2018.
[LEC] J. Schmidhuber (AI Blog, 2022). LeCun's 2022 paper on autonomous machine
intelligence rehashes but does not cite essential work of 1990-2015. Years ago, Schmidhuber's
team published most of what Y. LeCun calls his "main original contributions:" neural nets that
learn multiple time scales and levels of abstraction, generate subgoals, use intrinsic motivation
to improve world models, and plan (1990); controllers that learn informative predictable
representations (1997), etc. This was also discussed on Hacker News, reddit, and in the
media. See tweet1. LeCun also listed the "5 best ideas 2012-2022" without mentioning that
most of them are from Schmidhuber's lab, and older. See tweet2.
[LEI21] J. Schmidhuber (AI Blog, 2021). 375th birthday of Leibniz, founder of computer
science.
https://people.idsia.ch/~juergen/deep-learning-history.html 55/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
[LEI21a] J. Schmidhuber (2021). Der erste Informatiker. Wie Gottfried Wilhelm Leibniz den
Computer erdachte. (The first computer scientist. How Gottfried Wilhelm Leibniz conceived the
computer.) Frankfurter Allgemeine Zeitung (FAZ), 17/5/2021. FAZ online: 19/5/2021.
[LEI21b] J. Schmidhuber (AI Blog, 2021). 375. Geburtstag des Herrn Leibniz, dem Vater der
Informatik.
[LEN83] D. B. Lenat. Theory formation by heuristic search. Machine Learning, 21, 1983.
[LIL1] US Patent 1745175 by Austrian physicist Julius Edgar Lilienfeld for work done in Leipzig:
"Method and apparatus for controlling electric current." First filed in Canada on 22.10.1925.
The patent describes a field-effect transistor. Today, almost all transistors are field-effect
transistors.
[LIL2] US Patent 1900018 by Austrian physicist Julius Edgar Lilienfeld: "Device for controlling
electric current." Filed on 28.03.1928. The patent describes a thin film field-effect transistor.
Today, almost all transistors are field-effect transistors.
[LIT21] M. L. Littman (2021). Collusion Rings Threaten the Integrity of Computer Science
Research. Communications of the ACM, Vol. 64 No. 6, p. 43-44, June 2021.
https://people.idsia.ch/~juergen/deep-learning-history.html 56/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
[LSTM12] D. Wierstra, F. Gomez, J. Schmidhuber. Modeling systems with internal state using
Evolino. In Proc. of the 2005 conference on genetic and evolutionary computation (GECCO),
Washington, D. C., pp. 1795-1802, ACM Press, New York, NY, USA, 2005. Got a GECCO best
paper award.
[LSTM13] F. A. Gers and J. Schmidhuber. LSTM Recurrent Networks Learn Simple Context
Free and Context Sensitive Languages. IEEE Transactions on Neural Networks 12(6):1333-
1340, 2001. PDF.
[LSTMPG] J. Schmidhuber (AI Blog, Dec 2020). 10-year anniversary of our journal paper on
deep reinforcement learning with policy gradients for LSTM (2007-2010). Recent famous
applications: DeepMind's Starcraft player (2019) and OpenAI's dextrous robot hand & Dota
player (2018)—Bill Gates called this a huge milestone in advancing AI.
https://people.idsia.ch/~juergen/deep-learning-history.html 57/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
the 2002 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2002),
Lausanne, 2002. PDF.
[M69] M. Minsky, S. Papert. Perceptrons (MIT Press, Cambridge, MA, 1969). A misleading
"history of deep learning" goes more or less like this: "In 1969, Minsky & Papert[M69] showed
that shallow NNs without hidden layers are very limited and the field was abandoned until a
new generation of neural network researchers took a fresh look at the problem in the 1980s."
[S20]
However, the 1969 book[M69] addressed a "problem" of Gauss & Legendre's shallow
learning (circa 1800)[DL1-2] that had already been solved 4 years prior by Ivakhnenko & Lapa's
popular deep learning method,[DEEP1-2][DL2] and then also by Amari's SGD for MLPs.[GD1-2] Minsky
was apparently unaware of this and failed to correct it later.[HIN](Sec. I)[T22](Sec. XIII)
[MAR71] D. Marr. Simple memory: a theory for archicortex. Philos Trans R Soc Lond B Biol
Sci, 262:841, p 23-81, 1971.
[MAD05] Neither Newton nor Leibniz—The Pre-History of Calculus and Celestial Mechanics in
Medieval Kerala. S. Rajeev, Univ. of Rochester, 2005.
https://people.idsia.ch/~juergen/deep-learning-history.html 58/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
[META] J. Schmidhuber (AI Blog, 2020). 1/3 century anniversary of first publication on
metalearning machines that learn to learn (1987). For its cover Schmidhuber drew a robot that
bootstraps itself. 1992-: gradient descent-based neural metalearning. 1994-: Meta-
Reinforcement Learning with self-modifying policies. 1997: Meta-RL plus artificial curiosity and
intrinsic motivation. 2002-: asymptotically optimal metalearning for curriculum learning. 2003-:
mathematically optimal Gödel Machine. 2020: new stuff!
[MGC] MICCAI 2013 Grand Challenge on Mitosis Detection, organised by M. Veta, M.A.
Viergever, J.P.W. Pluim, N. Stathonikos, P. J. van Diest of University Medical Center Utrecht.
[MIR] J. Schmidhuber (AI Blog, Oct 2019, updated 2021, 2022). Deep Learning: Our
Miraculous Year 1990-1991. Preprint arXiv:2005.05744, 2020. The deep learning neural
networks of Schmidhuber's team have revolutionised pattern recognition and machine
learning, and are now heavily used in academia and industry. In 2020-21, we celebrate that
many of the basic ideas behind this revolution were published within fewer than 12 months in
the "Annus Mirabilis" 1990-1991 at TU Munich.
[MLP2] J. Schmidhuber (AI Blog, Sep 2020). 10-year anniversary of supervised deep learning
breakthrough (2010). No unsupervised pre-training. By 2010, when compute was 100 times
more expensive than today, both the feedforward NNs[MLP1] and the earlier recurrent NNs of
Schmidhuber's team were able to beat all competing algorithms on important problems of that
time. This deep learning revolution quickly spread from Europe to North America and Asia. The
rest is history.
[MM1] R. Stratonovich (1960). Conditional Markov processes. Theory of Probability and its
Applications, 5(2):156-178.
[MM2] L. E. Baum, T. Petrie. Statistical inference for probabilistic functions of finite state
Markov chains. The Annals of Mathematical Statistics, pages 1554-1563, 1966.
[MM3] A. P. Dempster, N. M. Laird, D. B. Rubin. Maximum likelihood from incomplete data via
the EM algorithm. Journal of the Royal Statistical Society, B, 39, 1977.
https://people.idsia.ch/~juergen/deep-learning-history.html 59/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
[MOC1] N. Metropolis, S. Ulam. The Monte Carlo method. Journal of the American Statistical
Association. 44 (247): 335-341, 1949.
[MOC3] B. Brügmann. Monte Carlo Go. Technical report, Department of Physics, Syracuse
University, 1993.
[MOC5] R. Coulom. Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search.
Computers and Games, 5th International Conference, CG 2006, Turin, Italy, May 29–31,
2006.
[MOST] J. Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work
done in my labs. Foundations of the most popular NNs originated in Schmidhuber's labs at TU
Munich and IDSIA. (1) Long Short-Term Memory (LSTM), (2) ResNet (which is the earlier
Highway Net with open gates), (3) AlexNet and VGG Net (both building on the similar earlier
DanNet: the first deep convolutional NN to win image recognition competitions), (4) Generative
Adversarial Networks (an instance of the much earlier Adversarial Artificial Curiosity), and (5)
variants of Transformers (Transformers with linearized self-attention are formally equivalent to
the much earlier Fast Weight Programmers). Most of this started with the Annus Mirabilis of
1990-1991.[MIR]
https://people.idsia.ch/~juergen/deep-learning-history.html 60/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
[NAN3] Recurrent networks adjusted by adaptive critics. In Proc. IEEE/INNS International Joint
Conference on Neural Networks, Washington, D. C., volume 1, pages 719-722, 1990.
[NAS] B. Zoph, Q. V. Le. Neural Architecture Search with Reinforcement Learning. Preprint
arXiv:1611.01578 (PDF), 2017. Compare the earlier Neural Architecture Search of Bayer et al.
(2009) for LSTM-like topologies.[LSTM7]
[NASC1] J. Schmidhuber. First Pow(d)ered flight / plane truth. Correspondence, Nature, 421 p
689, Feb 2003.
[NASC3] J. Schmidhuber. The last inventor of the telephone. Letter, Science, 319, no. 5871, p.
1759, March 2008.
[NASC4] J. Schmidhuber. Turing: Keep his work in perspective. Correspondence, Nature, vol
483, p 541, March 2012, doi:10.1038/483541b.
[NASC5] J. Schmidhuber. Turing in Context. Letter, Science, vol 336, p 1639, June 2012. (On
Gödel, Zuse, Turing.) See also comment on response by A. Hodges
(DOI:10.1126/science.336.6089.1639-a)
[NASC6] J. Schmidhuber. Colossus was the first electronic digital computer. Correspondence,
Nature, 441 p 25, May 2006.
[NASC7] J. Schmidhuber. Turing's impact. Correspondence, Nature, 429 p 501, June 2004
[NAT1] J. Schmidhuber. Citation bubble about to burst? Nature, vol. 469, p. 34, 6 January
2011. HTML.
[NDR] R. Csordas, K. Irie, J. Schmidhuber. The Neural Data Router: Adaptive Control Flow in
Transformers Improves Systematic Generalization. Proc. ICLR 2022. Preprint
arXiv/2110.07732.
[NHE] J. Schmidhuber. The Neural Heat Exchanger. Oral presentations since 1990 at various
universities including TUM and the University of Colorado at Boulder. Also in In S. Amari, L.
Xu, L. Chan, I. King, K. Leung, eds., Proceedings of the Intl. Conference on Neural Information
Processing (1996), pages 194-197, Springer, Hongkong. Link. Proposal of a biologically more
plausible deep learning algorithm that—unlike backpropagation—is local in space and time.
Inspired by the physical heat exchanger: inputs "heat up" while being transformed through
many successive layers, targets enter from the other end of the deep pipeline and "cool down."
[NGU89] D. Nguyen and B. Widrow; The truck backer-upper: An example of self learning in
neural networks. In IEEE/INNS International Joint Conference on Neural Networks,
Washington, D.C., volume 1, pages 357-364, 1989.
[NS56] A. Newell and H. Simon. The logic theory machine—A complex information processing
system. IRE Transactions on Information Theory 2.3 (1956):61-79. Compare Konrad Zuse's
much earlier 1948 work on theorem proving[ZU48] based on Plankalkül, the first high-level
programming language.[BAU][KNU]
[NS59] A. Newell, J. C. Shaw, H. Simon. Report on a general problem solving program. IFIP
congress, vol. 256, 1959.
[NYT1] NY Times article by J. Markoff, Nov. 27, 2016: When A.I. Matures, It May Call Jürgen
Schmidhuber 'Dad'
[OAI2a] J. Rodriguez. The Science Behind OpenAI Five that just Produced One of the
Greatest Breakthrough in the History of AI. Towards Data Science, 2018. An LSTM with 84% of
the model's total parameter count was the core of OpenAI Five.
[OMG0] P. T. De Chardin. The antiquity and world expansion of human culture. Man's Role in
Changing the Face of the Earth. Univ. Chicago Press, Chicago, p. 103-112, 1956.
[OMG] J. Schmidhuber. An exponential acceleration pattern in the history of the most important
events from a human perspective, reaching all the way back to the Big Bang. Link. Ask me
Anything, Reddit/ML, 2014.
[OMG2] S. Berg. GRM: Brainfuck. Kiepenheuer & Witsch, 2019. Got the Swiss Book Award
2019. Contains 2 pages on Schmidhuber's timeline of history's exponential acceleration since
the Big Bang.[OMG]
[OOPS2] J. Schmidhuber. Optimal Ordered Problem Solver. Machine Learning, 54, 211-254,
2004. PDF. HTML. HTML overview. Download OOPS source code in crystalline format.
https://people.idsia.ch/~juergen/deep-learning-history.html 63/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
[OUD13] P.-Y. Oudeyer, A. Baranes, F. Kaplan. Intrinsically motivated learning of real world
sensorimotor skills with developmental constraints. In G. Baldassarre and M. Mirolli, editors,
Intrinsically Motivated Learning in Natural and Artificial Systems. Springer, 2013.
[PDA1] G.Z. Sun, H.H. Chen, C.L. Giles, Y.C. Lee, D. Chen. Neural Networks with External
Memory Stack that Learn Context—Free Grammars from Examples. Proceedings of the 1990
Conference on Information Science and Systems, Vol.II, pp. 649-653, Princeton University,
Princeton, NJ, 1990.
[PDA2] M. Mozer, S. Das. A connectionist symbol manipulator that discovers the structure of
context-free languages. Proc. NIPS 1993.
[PG2] Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. (1999a). Policy gradient
methods for reinforcement learning with function approximation. In Advances in Neural
Information Processing Systems (NIPS) 12, pages 1057-1063.
[PLAN] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of planning & reinforcement
learning with recurrent world models and artificial curiosity (1990). This work also introduced
high-dimensional reward signals, deterministic policy gradients for RNNs, and the GAN
principle (widely used today). Agents with adaptive recurrent world models even suggest a
simple explanation of consciousness & self-awareness.
https://people.idsia.ch/~juergen/deep-learning-history.html 64/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
[PLAN5] One Big Net For Everything. Preprint arXiv:1802.08864 [cs.AI], Feb 2018.
[PLAN6] D. Ha, J. Schmidhuber. Recurrent World Models Facilitate Policy Evolution. Advances
in Neural Information Processing Systems (NIPS), Montreal, 2018. (Talk.) Preprint:
arXiv:1809.01999. Github: World Models.
https://people.idsia.ch/~juergen/deep-learning-history.html 65/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
[R1] Reddit/ML, 2019. Hinton, LeCun, Bengio receive ACM Turing Award. This announcement
contains more comments about Schmidhuber than about any of the awardees.
[R4] Reddit/ML, 2019. Five major deep learning papers by G. Hinton did not cite similar earlier
work by J. Schmidhuber.
[R5] Reddit/ML, 2019. The 1997 LSTM paper by Hochreiter & Schmidhuber has become the
most cited deep learning research paper of the 20th century.
[R6] Reddit/ML, 2019. DanNet, the CUDA CNN of Dan Ciresan in J. Schmidhuber's team, won
4 image recognition challenges prior to AlexNet.
[R8] Reddit/ML, 2019. J. Schmidhuber on Alexey Ivakhnenko, godfather of deep learning 1965.
[R9] Reddit/ML, 2019. We find it extremely unfair that Schmidhuber did not get the Turing
award. That is why we dedicate this song to Juergen to cheer him up.
[R11] Reddit/ML, 2020. Schmidhuber: Critique of Honda Prize for Dr. Hinton
[R12] Reddit/ML, 2020. J. Schmidhuber: Critique of Turing Award for Drs. Bengio & Hinton &
LeCun
[R15] Reddit/ML, 2021. J. Schmidhuber's work on fast weights from 1991 is similar to
linearized variants of Transformers
[R58] Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and
organization in the brain. Psychological review, 65(6):386. This paper not only described single
layer perceptrons, but also deeper multilayer perceptrons (MLPs). Although these MLPs did
not yet have deep learning, because only the last layer learned,[DL1] Rosenblatt basically had
what much later was rebranded as Extreme Learning Machines (ELMs) without proper
attribution.[ELM1-2][CONN21][T22]
[R61] Joseph, R. D. (1961). Contributions to perceptron theory. PhD thesis, Cornell Univ.
https://people.idsia.ch/~juergen/deep-learning-history.html 66/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
[RAU1] M. Rausch. Heron von Alexandria. Die Automatentheater und die Erfindung der ersten
antiken Programmierung. Diplomica Verlag GmbH, Hamburg 2012. Perhaps the world's first
programmable machine was an automatic theatre made in the 1st century by Heron of
Alexandria, who apparently also had the first known working steam engine.
[RCNN] R. Girshick, J. Donahue, T. Darrell, J. Malik. Rich feature hierarchies for accurate
object detection and semantic segmentation. Preprint arXiv/1311.2524, Nov 2013.
[RCNN2] R. Girshick. Fast R-CNN. Proc. of the IEEE international conference on computer
vision, p. 1440-1448, 2015.
[RO98] R. Rojas (1998). How to make Zuse's Z3 a universal computer. IEEE Annals of
Computing, vol. 19:3, 1998.
[ROB87] A. J. Robinson, F. Fallside. The utility driven dynamic error propagation network.
Technical Report CUED/F-INFENG/TR.1, Cambridge University Engineering Department,
1987.
https://people.idsia.ch/~juergen/deep-learning-history.html 67/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
[S93] D. Sherrington (1993). Neural networks: the spin glass approach. North-Holland
Mathematical Library, vol 51, 1993, p. 261-291.
[S2S] I. Sutskever, O. Vinyals, Quoc V. Le. Sequence to sequence learning with neural
networks. In: Advances in Neural Information Processing Systems (NIPS), 2014, 3104-3112.
[S59] A. L. Samuel. Some studies in machine learning using the game of checkers. IBM
Journal on Research and Development, 3:210-229, 1959.
[SA17] J. Schmidhuber. Falling Walls: The Past, Present and Future of Artificial Intelligence.
Scientific American, Observations, Nov 2017.
[SHA7a] N. Sharkey (2007). A programmable robot from AD 60. New Scientist, Sept 2017.
[SHA7b] N. Sharkey (2007). A 13th Century Programmable Robot. Univ. of Sheffield, 2007. On
a programmable drum machine of 1206 by Al-Jazari.
[SHA37] C. E. Shannon (1938). A Symbolic Analysis of Relay and Switching Circuits. Trans.
AIEE. 57 (12): 713-723. Based on his thesis, MIT, 1937.
[SHA48] C. E. Shannon. A mathematical theory of communication (parts I and II). Bell System
Technical Journal, XXVII:379-423, 1948. Boltzmann's entropy formula (1877) seen from the
perspective of information transmission.
[SING2] V. Vinge. The coming technological singularity: How to survive in the post-human era.
Science fiction criticism: An anthology of essential writings. p. 352-363, 1993.
[SK75] D. Sherrington, S. Kirkpatrick (1975). Solvable Model of a Spin-Glass. Phys. Rev. Lett.
35, 1792, 1975.
[SKO23] T. Skolem (1923). Begründung der elementaren Arithmetik durch die rekurrierende
Denkweise ohne Anwendung scheinbarer Veränderlichen mit unendlichem
Ausdehnungsbereich. Skrifter utgit av Videnskapsselskapet i Kristiania, I. Mathematisk-
Naturvidenskabelig Klasse 6 (1923), 38 pp.
[SMO13] L. Smolin (2013). My hero: Gottfried Wilhelm von Leibniz. The Guardian, 2013. Link.
Quote: "And this is just the one part of Leibniz's enormous legacy: the philosopher Stanley
Rosen called him 'the smartest person who ever lived'."
[SNT] J. Schmidhuber, S. Heil (1996). Sequential neural text compression. IEEE Trans. Neural
Networks, 1996. PDF. An earlier version appeared at NIPS 1995. Much later this was called a
probabilistic language model.[T22]
[SON18] T. Sonar. The History of the Priority Dispute between Newton and Leibniz.
Birkhaeuser, 2018.
https://people.idsia.ch/~juergen/deep-learning-history.html 69/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
[ST] J. Masci, U. Meier, D. Ciresan, G. Fricout, J. Schmidhuber Steel Defect Classification with
Max-Pooling Convolutional Neural Networks. Proc. IJCNN 2012. PDF. Apparently, this was the
first deep learning breakthrough in heavy industry.
[ST61] K. Steinbuch. Die Lernmatrix. (The learning matrix.) Kybernetik, 1(1):36-45, 1961.
[ST95] W. Hilberg (1995). Karl Steinbuch, ein zu Unrecht vergessener Pionier der künstlichen
neuronalen Systeme. (Karl Steinbuch, an unjustly forgotten pioneer of artificial neural
systems.) Frequenz, 49(1995)1-2.
[SP16] JS interviewed by C. Stoecker: KI wird das All erobern. (AI will conquer the universe.)
SPIEGEL, 6 Feb 2016. Link.
[SP93] A. Sperduti (1993). Encoding Labeled Graphs by Labeling RAAM. NIPS 1993: 1125-
1132 One of the first papers on graph neural networks.
[SP94] A. Sperduti (1994). Labelling Recursive Auto-associative Memory. Connect. Sci. 6(4):
429-459 (1994)
[SPG95] A. Sperduti, A. Starita, C. Goller (1995). Learning Distributed Representations for the
Classification of Terms. IJCAI 1995: 509-517
[SPG97] A. Sperduti, A. Starita (1997). Supervised neural networks for the classification of
structures. IEEE Trans. Neural Networks 8(3): 714-735, 1997.
[STI81] S. M. Stigler. Gauss and the Invention of Least Squares. Ann. Stat. 9(3):465-474,
1981.
[STI83] S. M. Stigler. Who Discovered Bayes' Theorem? The American Statistician. 37(4):290-
296, 1983. Bayes' theorem is actually Laplace's theorem or possibly Saunderson's theorem.
[STI85] S. M. Stigler (1986). Inverse Probability. The History of Statistics: The Measurement of
Uncertainty Before 1900. Harvard University Press, 1986.
[SV20] S. Vazire (2020). A toast to the error detectors. Let 2020 be the year in which we value
those who ensure that science is self-correcting. Nature, vol 577, p 9, 2/2/2020.
https://people.idsia.ch/~juergen/deep-learning-history.html 70/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
[SVM3] V. Vapnik. The nature of statistical learning theory. Springer science & business media,
1999.
[T19] ACM's justification of the 2018 A.M. Turing Award (announced in 2019). WWW link. Local
copy 1 (HTML only). Local copy 2 (HTML only). [T22] debunks this justification.
[T20a] J. Schmidhuber (AI Blog, 25 June 2020). Critique of 2018 Turing Award for Drs. Bengio
& Hinton & LeCun. A precursor of [T22].
[T22] J. Schmidhuber (AI Blog, 2022). Scientific Integrity and the History of Deep Learning:
The 2021 Turing Lecture, and the 2018 Turing Award. Technical Report IDSIA-77-21, IDSIA,
Lugano, Switzerland, 2022. Debunking [T19] and [DL3a] .
[TD3] R. Sutton, A. Barto. Reinforcement learning: An introduction. Cambridge, MA, MIT Press,
1998.
[TR3] K. Tran, A. Bisazza, C. Monz. The Importance of Being Recurrent for Modeling
Hierarchical Structure. EMNLP 2018, p 4731-4736. ArXiv preprint 1803.03585.
[TUR1] A. M. Turing. Intelligent Machinery. Unpublished Technical Report, 1948. Link. In: Ince
DC, editor. Collected works of AM Turing—Mechanical Intelligence. Elsevier Science
Publishers, 1992.
[TUR21] J. Schmidhuber (AI Blog, Sep 2021). Turing Oversold. It's not Turing's fault, though.
[TUR3] G. Oppy, D. Dowe (2021). The Turing Test. Stanford Encyclopedia of Philosophy.
Quote: "it is sometimes suggested that the Turing Test is prefigured in Descartes' Discourse on
the Method. (Copeland (2000:527) finds an anticipation of the test in the 1668 writings of the
Cartesian de Cordemoy. Abramson (2011a) presents archival evidence that Turing was aware
of Descartes' language test at the time that he wrote his 1950 paper. Gunderson (1964)
provides an early instance of those who find that Turing's work is foreshadowed in the work of
Descartes.)"
[UN] J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with
unsupervised or self-supervised pre-training. Unsupervised hierarchical predictive coding (with
self-supervised target generation) finds compact internal representations of sequential data to
facilitate downstream deep learning. The hierarchy can be distilled into a single deep neural
network (suggesting a simple model of conscious and subconscious information processing).
1993: solving problems of depth >1000.
[UN0] J. Schmidhuber. Neural sequence chunkers. Technical Report FKI-148-91, Institut für
Informatik, Technische Universität München, April 1991. PDF. Unsupervised/self-supervised
learning and predictive coding is used in a deep hierarchy of recurrent neural networks (RNNs)
https://people.idsia.ch/~juergen/deep-learning-history.html 72/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
to find compact internal representations of long sequences of data, across multiple time scales
and levels of abstraction. Each RNN tries to solve the pretext task of predicting its next input,
sending only unexpected inputs to the next RNN above. The resulting compressed sequence
representations greatly facilitate downstream supervised deep learning such as sequence
classification. By 1993, the approach solved problems of depth 1000 [UN2] (requiring 1000
subsequent computational stages/layers—the more such stages, the deeper the learning). A
variant collapses the hierarchy into a single deep net. It uses a so-called conscious chunker
RNN which attends to unexpected events that surprise a lower-level so-called subconscious
automatiser RNN. The chunker learns to understand the surprising events by predicting them.
The automatiser uses a neural knowledge distillation procedure to compress and absorb the
formerly conscious insights and behaviours of the chunker, thus making them subconscious.
The systems of 1991 allowed for much deeper learning than previous methods. More.
[UN1] J. Schmidhuber. Learning complex, extended sequences using the principle of history
compression. Neural Computation, 4(2):234-242, 1992. Based on TR FKI-148-91, TUM, 1991.
[UN0]
PDF. First working Deep Learner based on a deep RNN hierarchy (with different self-
organising time scales), overcoming the vanishing gradient problem through unsupervised pre-
training and predictive coding (with self-supervised target generation). Also: compressing or
distilling a teacher net (the chunker) into a student net (the automatizer) that does not forget its
old skills—such approaches are now widely used. More.
[UN2] J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF. An ancient experiment on "Very
Deep Learning" with credit assignment across 1200 time steps or virtual layers and
unsupervised / self-supervised pre-training for a stack of recurrent NN can be found here
(depth > 1000).
[UNI] Theory of Universal Learning Machines & Universal AI. Work of Marcus Hutter (in the
early 2000s) on J. Schmidhuber's SNF project 20-61847: Unification of universal induction and
sequential decision theory.
https://people.idsia.ch/~juergen/deep-learning-history.html 73/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
[URQ10] A. Urquhart. Von Neumann, Gödel and complexity theory. Bulletin of Symbolic Logic
16.4 (2010): 516-530. Link.
[VAN4] Y. Bengio. Neural net language models. Scholarpedia, 3(1):3881, 2008. Link.
[VAR13] M. Y. Vardi (2013). Who begat computing? Communications of the ACM, Vol. 56(1):5,
Jan 2013. Link.
[VID1] G. Hinton. The Next Generation of Neural Networks. Youtube video [see 28:16].
GoogleTechTalk, 2007. Quote: "Nobody in their right mind would ever suggest" to use plain
backpropagation for training deep networks. However, in 2010, Schmidhuber's team in
Switzerland showed[MLP1-2] that unsupervised pre-training is not necessary to train deep NNs.
[W45] G. H. Wannier (1945). The Statistical Problem in Cooperative Phenomena. Rev. Mod.
Phys. 17, 50.
[WI48] N. Wiener (1948). Time, communication, and the nervous system. Teleological
mechanisms. Annals of the N.Y. Acad. Sci. 50 (4): 197-219. Quote: "... the general idea of a
computing machine is nothing but a mechanization of Leibniz's calculus ratiocinator."
[WID62] Widrow, B. and Hoff, M. (1962). Associative storage and retrieval of digital information
in networks of adaptive neurons. Biological Prototypes and Synthetic Systems, 1:160, 1962.
[WU] Y. Wu et al. Google's Neural Machine Translation System: Bridging the Gap between
Human and Machine Translation. Preprint arXiv:1609.08144 (PDF), 2016. Based on LSTM
https://people.idsia.ch/~juergen/deep-learning-history.html 74/75
29/12/2022, 12:21 Annotated history of modern AI and deep neural networks
[XAV] X. Glorot, Y. Bengio. Understanding the difficulty of training deep feedforward neural
networks. Proc. 13th Intl. Conference on Artificial Intelligence and Statistics, PMLR 9:249-256,
2010.
[YB20] Y. Bengio. Notable Past Research. WWW link (retrieved 15 May 2020). Local copy
(plain HTML only). The author claims that in 1995 he "introduced the use of a hierarchy of time
scales to combat the vanishing gradients issue" although Schmidhuber's publications on
exactly this topic date back to 1991-93.[UN0-2][UN] The author also writes that in 1999 he
"introduced, for the first time, auto-regressive neural networks for density estimation" although
Schmidhuber & Heil used a very similar set-up for text compression already in 1995.[SNT]
[ZU36] K. Zuse (1936). Verfahren zur selbsttätigen Durchführung von Rechnungen mit Hilfe
von Rechenmaschinen. Patent application Z 23 139 / GMD Nr. 005/021, 1936. First patent
application describing a general, practical, program-controlled computer.
[ZU37] K. Zuse (1937). Einführung in die allgemeine Dyadik. Mentions the storage of program
instructions in the computer's memory.
[ZU38] K. Zuse (1938). Diary entry of 4 June 1938. Description of computer architectures that
put both program instructions and data into storage—compare the later "von Neumann"
architecture [NEU45].
[ZU48] K. Zuse (1948). Über den Plankalkül als Mittel zur Formulierung schematisch
kombinativer Aufgaben. Archiv der Mathematik 1(6), 441-449 (1948). PDF. Apparently the first
practical design of an automatic theorem prover (based on Zuse's high-level programming
language Plankalkül).
[ZUS21] J. Schmidhuber (AI Blog, 2021). 80th anniversary celebrations: 1941: Konrad Zuse
completes the first working general computer, based on his 1936 patent application.
[ZUS21a] J. Schmidhuber (AI Blog, 2021). 80. Jahrestag: 1941: Konrad Zuse baut ersten
funktionalen Allzweckrechner, basierend auf der Patentanmeldung von 1936.
[ZUS21b] J. Schmidhuber (2021). Der Mann, der den Computer erfunden hat. (The man who
invented the computer.) Weltwoche, Nr. 33.21, 19 August 2021. PDF.
.
https://people.idsia.ch/~juergen/deep-learning-history.html 75/75