Algorithmic Probability Theory and Applications
Algorithmic Probability Theory and Applications
Applications
Ray J. Solomonoff
Visiting Professor, Computer Learning Research Centre
Royal Holloway, University of London
IDSIA, Galleria 2, CH–6928 Manno–Lugano, Switzerland
rjsolo@ieee.org http://world.std.com/˜rjs/pubs.html
August 16, 2009∗
Abstract
We first define Algorithmic Probability, an extremely powerful method
of inductive inference. We discuss its completeness, incomputability, di-
versity and subjectivity and show that its incomputability in no way in-
hibits its use for practical prediction. Applications to Bernoulli sequence
prediction and grammar discovery are described. We conclude with a note
on its employment in a very strong AI system for very general problem
solving.
Introduction
Ever since probability was invented, there has been much controversy as
to just what it meant, how it should be defined and above all, what is
the best way to predict the future from the known past. Algorithmic
Probability is a relatively recent definition of probability that attempts to
solve these problems.
We begin with a simple discussion of prediction and its relationship
to probability. This soon leads to a definition of Algorithmic Probabil-
ity (ALP) and its properties. The best-known properties of ALP are
its incomputibility and its completeness (in that order). Completeness
means that if there is any regularity (i.e. property useful for prediction)
in a batch of data, ALP will eventually find it, using a surprisingly small
amount of data. The incomputability means that in the search for reg-
ularities, at no point can we make a useful estimate of how close we are
to finding the most important ones. We will show, however, that this
incomputability is of a very benign kind, so that in no way does it inhibit
∗ Revision of Chapter 1 in F. Emmert-Streib and M. Dehmer (Eds.), Information
Theory and Statistical Learning, Springer Science+Business Media, N.Y., 2009, pp.
1–23.
1
the use of ALP for good prediction. One of the important properties of
ALP is subjectivity, the amount of personal experiential information that
the statistician must put into the system. We will show that this is a de-
sirable feature of ALP, rather than a “Bug”. Another property of ALP
is its diversity - it affords many explanations of data giving very good
understanding of that data.
There have been a few derivatives of Algorithmic Probability - Mini-
mum Message Length (MML), Minimum Description Length (MDL) and
Stochastic Complexity - which merit comparison with ALP.
We will discuss the application of ALP to two kinds of problems: Pre-
diction of the Bernoulli Sequence and Discovery of the Grammars of Con-
text Free Languages. We also show how a variation of Levin’s search
procedure can be used to search over a function space very efficiently to
find good predictive models.
The final section is on the future of ALP — some open problems in
its application to AI and what we can expect from these applications.
2
the number of favorable choices to the total number of choices in the past.
If the decision to use integration by parts in an integration problem has
been successful in the past 43% of the time, then its present probability
of success is about 0.43. This method has very poor accuracy if we only
have one or two cases in the past, and is undefined if the case has never
occurred before. Unfortunately it is just these situations that occur most
often in problem solving.
On a very practical level: If we cross a particular street 10 times and
we get hit by a car twice, we might estimate that the probability of getting
hit in crossing that street is about 0.2 = 2/10. However, if instead, we
only crossed that street twice and we didn’t get hit either time, it would be
unreasonable to conclude that our probability of getting hit was zero! By
seriously revising our definition of probability, we are able to resolve this
difficulty and clear up many others that have plagued classical concepts
of probability.
What is Induction?
Prediction is usually done by finding inductive models. These are deter-
ministic or probabilistic rules for prediction. We are given a batch of data
- typically a series of zeros and ones, and we are asked to predict any one
of the data points as a function of the data points that precede it.
In the simplest case, let us suppose that the data has a very simple
structure:
0101010101010...
In this case, a good inductive rule is “zero is always followed by one;
one is always followed by zero”. This is an example of deterministic in-
duction, and deterministic prediction. In this case it is 100% correct every
time!
There is, however, a common kind of induction problem in which our
predictions will not be that reliable. Suppose we are given a sequence
of zeros and ones with very little apparent structure. The only apparent
regularity is that zero occurs 70% of the time and one appears 30% of
the time. Inductive algorithms give a probability for each symbol in a
sequence that is a function of any or none of the previous symbols. In the
present case, the algorithm is very simple and the probability of the next
symbol is independent of the past - the probability of zero seems to be 0.7;
the probability of one seems to be 0.3. This kind of simple probabilistic
sequence is called a “Bernoulli sequence”. The sequence can contain many
different kinds of symbols, but the probability of each is independent of
the past. In Section 8 we will discuss the Bernoulli sequence in some
detail.
In general we will not always be predicting Bernoulli sequences and
there are many possible algorithms (which we will call “models”) that tell
how to assign a probability to each symbol, based on the past. Which of
these should we use? Which will give good predictions in the future?
One desirable feature of an inductive model is that if it is applied
to the known sequence, it produces good predictions. Suppose Ri is an
3
inductive algorithm. Ri predicts the probability of an symbol aj in a
sequence a1 , a2 · · · an by looking at the previous symbols: More exactly,
aj is the symbol for which we want the probability. a1 , a2 · · · aj−1 are the
previous symbols in the sequence. Then Ri is able to give the probability
of a particular value of aj as a function of the past. Here, the values
of aj can range over the entire “alphabet” of symbols that occur in the
sequence. If the sequence is a binary, aj will range over the set 0 and 1
only. If the sequence is English text, aj will range over all alphabetic and
punctuation symbols. If Ri is a good predictor, for most of the aj , the
probability it assigns to them will be large - near one.
Consider S, the product of the probabilities that Ri assigns to the in-
dividual symbols of the sequence, a1 , a2 · · · an . S will give the probability
that Ri assigns to the sequence as a whole.
Y
n Y
n
4
to the text will be in two parts: the first part is the code for R, which is
|R| bits long, and the second part is the code for the probability of the
data, as given by R - it will be just − log2 S bits in length. The sum of
these will be |R| − log2 S bits. The compression ratio achieved by R would
be
8N
|R| − log2 S
PPM, a commonly used prediction algorithm, achieves a compression of
about 3 for English text. For very large strings of data, compressions of as
much as 6 have been achieved by highly refined prediction algorithms. We
can use |R| − log2 S bits, the length of the compressed code, as a “figure
of merit” of a particular induction algorithm with respect to a particular
text.
We want an algorithm that will give good prediction, i.e., large S, and
small |R|, so |R|−log2 S, the figure of merit, will be as small as possible and
the probability it assigns to the text will be as large as possible. Models
with |R| larger than optimum are considered to be overfitted. Models in
which |R| are smaller than optimum are considered to be underfitted. By
choosing a model that minimizes |R| − log2 S, we avoid both underfitting
and overfitting, and obtain very good predictions. We will return to this
topic later, when we tell how to compute |R| and S for particular models
and data sets.
Usually there are many inductive models available. In 1960, I de-
scribed Algorithmic Probability - ALP ([5, 6, 7]), which uses all possible
models in parallel for prediction, with weights dependent upon the figure
of merit of each model.
X
PM (an+1 |a1 , a2 · · · an ) = 2−|Ri | Si Ri (an+1 |a1 , a2 · · · an ) (2)
5
nonetheless important. It will be discussed at greater length in Section 4
on “Subjectivity”.
Normally in prediction problems we will have some time limit, T , in
which we have to make our prediction. In ALP what we want is a set of
models of maximum total weight. A set of this sort will give us an ap-
proximation that is as close as possible to ALP and gives best predictions.
To obtain such a set, we devise a search technique that tries to find, in
the available time, T , a set of Models, Ri , such that the total weight,
X
2−|Ri | Si (3)
i
is as large as possible.
£X
n
¤ 1
EP (PM (am+1 = 1|a1 , a2 · · · am ) − P (am+1 = 1|a1 , a2 · · · am ))2 ≤ − lnP0
2
m=1
lnP0 ≈ kln2 (4)
3 Incomputability
It should be noted that in general, it is impossible to find the truly best
models with any certainty - there is an infinity of models to be tested and
some take an unacceptably long time to evaluate. At any particular time
6
in the search, we will know the best ones so far, but we can’t ever be sure
that spending a little more time will not give much better models! While it
is clear that we can always make approximations to ALP by using a limited
number of models, we can never know how close these approximations are
to the “True ALP”. ALP is indeed, formally incomputable.
In this section, we will investigate how our models are generated and
how the incomputability comes about - why it is a necessary, desirable
feature of any high performance prediction technique, and how this in-
computability in no way inhibits its use for practical prediction.
PM = 2−|Ri | Si Ri (5)
i=1
Here Ri gives the probability distribution for the next symbol as com-
puted by the ith model. Just what do we mean by these models, Ri ?
7
which values of X the loop will terminate.
A simple W HILE loop in which it is not so easy to tell:
W HILE X > 4
IF X/2 is an integer T HEN X ← X/2
ELSE X ← 3 ∗ X + 1
EndW HILE
This program has been tested with X starting at all positive integers up
to more than sixty million. The loop has always terminated, but no one
yet is certain as to whether it terminates for all positive integers!
For any Total Recursive Function we know all values of arguments
for which the function has values. Compositions and primitive recursive
functions are all total recursive. Many partial recursive functions are
total recursive, but some are not. As a consequence of the insolvability
of Turing’s “Halting Problem”, it will sometimes be impossible to tell if
a certain W HILE loop will terminate or not.
Suppose we use Eqn. 2 to approximate ALP by sequentially testing
functions in a list of all possible functions - these will be the partial re-
cursive functions because this is the only recursively enumerable function
class that includes all possible predictive functions. As we test to find
functions with good figures of merit (small (|Ri | − log2 Si )) we find that
certain of them don’t converge after say, a time T , of ten seconds. We
know that if we increase T enough, eventually, all converging trials will
converge and all divergent trials will still diverge - so eventually we will
get close to true ALP - but we cannot recognize when this occurs. Fur-
thermore for any finite T , we cannot ever know a useful upper bound on
how large the error in the ALP approximation is. That is why this partic-
ular method of approximating ALP is called “incomputable”. Could there
be another computable approximation technique that would converge? It
is easy to show that any computable technique cannot be “complete” -
i.e.having very small errors in probability estimates.
Consider an arbitrary computable probability method, R0 . We will
show how to generate a sequence for which R0 ’s errors in probability
would always be 0.5 or more. We start our sequence with a single bit,
say zero. We then ask R0 for the most probable next bit. If it says “one
is more probable”, we make the continuation zero, if it says “zero is more
probable”, we make the next bit one. If it says “both are equally likely”
we make the next bit zero. We generate the third bit in the sequence in
the same way, and we can use this method to generate an arbitrarily long
continuation of the initial zero.
For this sequence, R0 will always have an error in probability of at least
one half. Since completeness implies that prediction errors approach zero
for all finitely describable sequences, it is clear that R0 or any other com-
putable probability method cannot be complete. Conversely, any complete
probability method, such as ALP, cannot be computable.
If we cannot compute ALP, what good is it? It would seem to be of
little value for prediction! To answer this objection, we note that from
a practical viewpoint, we never have to calculate ALP exactly - we can
always use approximations. While it is impossible to know how close our
approximations are to the true ALP, that information is rarely needed for
practical induction.
8
What we actually need for practical prediction:
1. Estimates of how good a particular approximation will be in future
problems - (called “Out of Sample Error”).
2. Methods to search for good models.
3. Quick and simple methods to compare models.
For 1., we can use Cross Validation or Leave One Out - well-known meth-
ods that work with most kinds of problems. In addition, because ALP
does not overfit or underfit there is usually a better method to make such
estimates.
For 2. In Section 10 we will describe a variant of Levin’s Search Proce-
dure, for an efficient search of a very large function space.
For 3., we will always find it easy to compare models via their associated
“Figures of Merit”, |Ri | − log2 (Si ).
4 Subjectivity
The subjectivity of probability resides in a priori information - the in-
formation available to the statistician before he sees the data to be ex-
trapolated. This is independent of what kind of statistical techniques we
use. In ALP this a priori information is embodied in M , our “Reference
Computer”. Recall our assignment of a |R| value to an induction model -
it was the length of the program necessary to describe R. In general, this
will depend on the machine we use - its instruction set. Since the ma-
chines we use are Universal - they can imitate one another - the length of
description of programs will not vary widely between most reference ma-
chines we might consider. But nevertheless, using small samples of data
(as we often do in AI), these differences between machines can modify
results considerably.
For quite some time I felt that the dependence of ALP on the reference
machine was a serious flaw in the concept, and I tried to find some “objec-
tive” universal device, free from the arbitrariness of choosing a particular
universal machine. When I thought I finally found a device of this sort, I
realized that I really didn’t want it - that I had no use for it at all! Let
me explain:
9
In doing inductive inference, one begins with two kinds of information:
First, the data itself, and second, the a priori data - the information one
had before seeing the data. It is possible to do prediction without data,
but one cannot do prediction without a priori information. In choosing a
reference machine we are given the opportunity to insert into the a priori
probability distribution any information about the data that we know
before we see it.
If the reference machine were somehow “objectively” chosen for all
induction problems, we would have no way to make use of our prior in-
formation. This lack of an objective prior distribution makes ALP very
subjective - as are all Bayesian systems.
This certainly makes the results “subjective”. If we value objectivity,
we can routinely reduce the choice of a machine and representation to cer-
tain universal “default” values - but there is a tradeoff between objectivity
and accuracy. To obtain the best extrapolation, we must use whatever
information is available, and much of this information may be subjective.
Consider two physicians, A and B: A is a conventional physician: He
diagnoses ailments on the basis of what he has learned in school, what he
has read about and his own experience in treating patients. B is not a
conventional physician. He is “objective”. His diagnosis is entirely “by
the book” - things he has learned in school that are universally accepted.
He tries as hard as he can to make his judgements free of any bias that
might be brought about by his own experience in treating patients. As a
lawyer, I might prefer defending B’s decisions in court, but as a patient,
I would prefer A’s intelligently biased diagnosis and treatment.
To the extent that a statistician uses objective techniques, his recom-
mendations may be easily defended, but for accuracy in prediction, the
additional information afforded by subjective information can be a critical
advantage.
Consider the evolution of a priori in a scientist during the course of his
life. He starts at birth with minimal a priori information - but enough to
be able to learn to walk, to learn to communicate and his immune system
is able to adapt to certain hostilities in the environment. Soon after
birth, he begins to solve problems and incorporate the problem solving
routines into his a priori tools for future problem solving. This continues
throughout the life of the scientist - as he matures, his a priori information
matures with him.
In making predictions, there are several commonly used techniques for
inserting a priori information. First, by restricting or expanding the set
of induction models to be considered. This is certainly the commonest
way. Second, by selecting prediction functions with adjustable parame-
ters and assuming a density distribution over those parameters based on
past experience with such parameters. Third, we note that much of the
information in our sciences is expressed as definitions - additions to our
language. ALP, or approximations of it, avails itself of this information
by using these definitions to help assign code lengths, and hence a priori
probabilities to models. Computer languages are usually used to describe
models, and it is relatively easy to make arbitrary definitions part of the
language.
More generally, modifications of computer languages are known to be
10
able to express any conceivable a priori probability distributions. This
gives us the ability to incorporate whatever a priori information we like
into our computer language. It is certainly more general than any of the
other methods of inserting a priori information.
11
approximate ALP we try to preserve this diversity in the subset of models
we use.
6 Derivatives of ALP
After my first description of ALP in 1960 [5], there were several related
induction models described, minimum message length (MML) Wallace
and Boulton (1968) [13], Minimum Description Length (MDL) Rissanen
(1978) [3], and Stochastic Complexity, Rissanen (1989) [4]. These models
were conceived independently of ALP - (though Rissanen had read Kol-
mogorov’s 1965 paper on minimum coding [1],which is closely related to
ALP). MML and MDL select induction models by minimizing the figure
of merit, |Ri | − log2 (Si ) just as ALP does. However, instead of using a
weighted sum of models, they use only the single best model.
MDL chooses a space of computable models then selects the best model
from that space. This avoids any incomputability, but greatly limits the
kinds of models that it can use. MML recognizes the incomputability
of finding the best model so it is in principle much stronger than MDL.
Stochastic complexity, like MDL, first selects a space of computable mod-
els - then, like ALP it uses a weighted sum of all models in the that space.
Like MDL, it differs from ALP in the limited types of models that are
accessible to it. MML is about the same as ALP when the best model is
much better than any other found. When several models are of compa-
rable figure of merit, MML and ALP will differ. One advantage of ALP
over MML and MDL is in its diversity of models. This is useful if the
induction is part of an ongoing process of learning - but if the induction
is used on one problem only, diversity is of much less value. Stochastic
Complexity, of course, does obtain diversity in its limited set of models.
7 Extensions of ALP
The probability distribution for ALP that I’ve shown is called “The Uni-
versal Distribution for sequential prediction”. There are two other univer-
sal distributions I’d like to describe:
12
corpus by Ri . If Rk is the best stochastic grammar that we’ve found,
then we use Rk (an+1 ) as the probability of an+1 . To obtain the ALP
version, we simply sum over all models as before, using weights 2−|Ri | Si
This kind of ALP has an associated convergence theorem giving very
small errors in probability. This approach can be used in linguistics. The
aj can be examples of sentences that are grammatically correct. We can
use |Ri | − log2 Si as a likelihood that the data was created by grammar,
Ri . Section 9 continues the discussion of Grammatical Induction.
Y
n
13
Recall that ALP tells us to sum the predictions of each model, with weight
given by the product of the a priori probability of the model (2−|Ri | ) and
Si , the probability assigned to the data by the model . . . i.e.:
X
2−|Ri | Si Ri ( ) (9)
i
14
This method of extrapolating binary Bernoulli sequences is called “Laplace’s
Rule”. The formula for the probability of a binary sequence, (n0n+n 0 !n1 !
1 +1)!
can be generalized for an alphabet of k symbols.
A sequence of k different kinds of symbols has a probability of
Q
k
(k − 1)! ni !
i=1
(14)
Pk
(k − 1 + ni )!
i=1
th
ni is the number of times the i symbol occurs.
This formula can be obtained by integration in a k − 1 dimensional
nk−1
space of the function pn1 n2
1 p2 · · · pk−1 (1 − p1 − p2 · · · − pk−1 )
nk
.
Through an argument similar to that used for the binary sequence,
the probability of the next symbol being of the j th type is
nj + 1
Pk (15)
k+ i=1
ni
A wayP to visualize this result: the body of data (the “corpus”) consists of
the ni symbols. Think of a “pre–corpus” containing one of each of the
k symbols. If we think of a “macro corpus” as “corpus plus pre–corpus”
we can obtain the probability of the next symbol being the j th one by
dividing the number of occurrences of that symbol in the macro corpus,
by the total number of symbols of all types in the macro corpus.
It is also possible to have different numbers of each symbol type in
the pre–corpus, enabling us to get a great variety of “a priori probability
distributions” for our predictions.
S → Aac
15
S → BaAd
A → BAaS
A → AB
A → a
B → aBa
B → b
The capital letters (including S) are all nonterminal symbols. The lower
case letters are all terminal symbols. To generate a legal string, we start
with the symbol, S, and we perform either of the two possible substitu-
tions. If we choose BaAd, we would then have to choose substitutions for
the nonterminals B and A. For B, if we chose aBa we would again have
to make a choice for B. If we chose a terminal symbol, like b for B, then
no more substitutions can be made.
An example of a string generation sequence:
S, BaAd, aBaaAd, abaaAd, abaaABd, abaaaBd, abaaabd.
The string abaaabd is then a legally derived string from this grammar. The
set of all strings legally derivable from a grammar is called the language
of the grammar. The language of a grammar can contain a finite or
infinite number of strings. If we replace the deterministic substitution
rules with probabilistic rules, we have a probabilistic grammar. A grammar
of this sort assigns a probability to every string it can generate. In the
deterministic grammar above, S had two rewrite choices, A had three,
and B had two. If we assign a probability to each choice, we have a
probabilistic grammar.
Suppose S had substitution probability 0.1 for Aac and 0.9 for BaAd.
Similarly, assigning probabilities 0.3, 0.2 and 0.5 for A’s substitutions and
0.4, 0.6 for B’s substitutions.
S 0.1 Aac
0.9 BaAd
A 0.3 BAaS
0.2 AB
0.5 a
B 0.4 aBa
0.6 b
16
Suppose we are given a set of strings, ab, aabb, aaabbb that were gen-
erated by an unknown grammar. How do we find the grammar?
I wouldn’t answer that question directly, but instead I will tell how to
find a sequence of grammars that fits the data progressively better. The
best one we find may not be the true generator, but will give probabilities
to strings close to those given by the generator.
The example here is that of A. Stolcke’s, PhD thesis [12]. We start
with an ad hoc grammar that can generate the data, but it overf its . . .
it is too complex:
S → ab
→ aabb
→ aaabbb
S → X
→ aXb
→ aaXbb
X → ab
S → X
→ Y
→ aY b
X → ab
Y → aXb
S → X
→ aSb
→ aXb
X → ab
S → aSb
→ ab
17
which is an adequate grammar. At each step there are usually several
possible chunk or merge candidates. We chose the candidates that give
minimum description length to the resultant grammar.
How do we calculate the length of description of a grammar and its
description of the data set?
Consider the grammar
S → X
→ Y
→ aY b
X → ab
Y → aXb
There are two kinds of terminal symbols and three kinds of nonterminals.
If we know the number of terminals and nonterminals, we need describe
only the right hand side of the substitutions to define the grammar. The
names of the nonterminals (other than the first one, S) are not relevant.
We can describe the right hand side by the string Xs1 Y s1 aY bs1 s2 abs1 s2 aXbs1 s2 .
s1 and s2 are punctuation symbols. s1 marks the end of a string. s2 marks
the end of a sequence of strings that belong to the same nonterminal. The
string to be encoded has 7 kinds of symbols. The number of times each
occurs: X, 2; Y , 2; S, 0; a, 3; b, 3; s1 , 5; s2 , 3. We can then use the
formula
Q
k
(k − 1)! ni !
i=1
(17)
Pk
(k − 1 + ni )!
i=1
where log∗2n = log2 n + log2 log2 n + log2 log2 log2 n · · · taking as many
positive terms as there are, and A is a normalization constant. There
seems to be no good reason to choose 2 as the base for logs, and using
different bases gives much different results. If we use natural logs, the
sum diverges.
This particular form of P (n) was devised by Rissanen. It is an at-
tempt to approximate the shortest description of the integer n, e.g. the
Kolmogorov complexity of n. Its first moment is infinite, which means it
is very biased toward large numbers. If we have reason to believe, from
previous experience, that n will not be very large, but will be about λ,
then a reasonable form of P (n) might be P (n) = Ae−n/λ , A being a
normalization constant.
18
Q
n
The forgoing enables us to evaluate P (Ri ) of Eqn. 16. The Ri (aj )
j=1
part is evaluated by considering the choices made when the grammar
produces the data corpus. For each nonterminal, we will have a sequence
of decisions whose probabilities can be evaluated by an expression like
Eqn. 14, or perhaps the simpler technique of Eqn. 15 that uses the “pre–
corpus”. Since there are three nonterminals, we need the product of three
such expressions.
The process used by Stolcke in his thesis was to make various trials of
chunking or merging in attempts to successively get a shorter description
length - or to increase Eqn. 16. Essentially a very greedy method. He
has been actively working on Context Free Grammar discovery since then,
and has probably discovered many improvements. There are many more
recent papers at his website.
Most, if not all CFG discovery has been oriented toward finding a
single best grammar. For applications in AI and genetic programming
it is useful to have large sets of not necessarily best grammars - giving
much needed diversity. One way to implement this: At each stage of
modification of a grammar, there are usually several different operations
that can reduce description length. We could pursue such paths in parallel
. . . perhaps retaining the best 10 or best 100 grammars thus far. Branches
taken early in the search could lead to very divergent paths and much
needed diversity. This approach helps avoid local optima in grammars
and premature convergence when applied to Genetic Programming.
19
PM (Ri )Si is as large as possible.
Next we choose a small initial time T - which might be the time needed
to execute 10 instructions in our Functional Language. The initial T is not
critical. We then compute PM (Ri )Si for all Ri for which ti /PM (Ri ) < T .
Here ti is the time needed to construct Ri and evaluate its Si .
There are only a finite number of Ri that satisfy this criterion and if
T is very small, there will be very few, if any. We remember which Ri ’s
have large PM (Ri )Si . P
ti < T · PM (R Pi ), so i ti , the total time for this part of the search
takes
P time < T · P (Ri ). Since the PM (Ri ) are a priori probabilities,
i M
P (Ri ) must be less than or equal to 1, and so the total time for this
i M
part of the search must be less than T .
If we are not satisfied with the Ri we’ve obtained, we double T and
do the search again. We continue doubling T and searching until we
find satisfactory Ri ’s or until we run out of time. If T 0 is the value of
T when we finish the search, then the total time for the search will be
T 0 + T 0 /2 + T 0 /4 · · · ≈ 2T 0 .
If it took time tj to generate and test one of our “good” models, Rj ,
then when Rj was discovered, T 0 would be no more than 2tj /PM (Rj ) —
so we would take no more time than twice this, or 4tj /PM (Rj ) to find
Rj . Note that this time limit depends on Rj only, and is independent
of the fact that we may have aborted the Si evaluations of many Ri for
which ti was infinite or unacceptably large. This feature of Levin Search
is a mandatory requirement for search over a space of partial recursive
functions. Any weaker search technique would seriously limit the power
of the inductive models available to us.
When ALP is being used in AI, we are solving a sequence of problems
of increasing difficulty. The machine (or language) M is periodically “up-
dated” by inserting subroutines and definitions, etc., into M so that the
solutions, Ri to problems in the past result in larger PM (Rj ). As a result
the tj /PM (Rj ) are smaller - giving quicker solutions to problems of the
past - and usually for problems of the future as well.
20
the machine a simple problem. Using Levin Search, it finds one or more
solutions to the problem. The system then updates itself by modifying
the reference machine so that the solutions found will have higher a priori
probabilities. We then give it new problems somewhat similar to the pre-
vious problem. Again we use Levin Search to find solutions—We continue
with a sequence of problems of increasing difficulty, updating after each
solution is found. As the training sequence continues we expect that we
will need less and less care in selecting new problems and that the system
will eventually be able to solve a large space of very difficult problems.
For a more detailed description of the system, see Solomonoff 2003 [10].
The principal things that need to be done to implement such a system:
* We have to find a good reference language: Some good candidates
are APL, LISP, FORTH, or a subset of Assembly Language. These
languages must be augmented with definitions and subroutines that
we expect to be useful in problem solving.
* The design of good training sequences for the system is critical for
getting much problem–solving ability into it. I have written some
general principles on how to do this [9], but more work needs to be
done in this area. For early training, it might be useful to learn
definitions of instructions from Maple or Mathematica. For more
advanced training we might use the book that Ramanujan used to
teach himself mathematics —“A Synopsis of Elementary Results in
Pure and Applied Mathematics” by George S. Carr.
* We need a good update algorithm. It is possible to use PPM, a rela-
tively fast, effective method of prediction, for preliminary updating,
but to find more complex regularities, a more general algorithm is
needed. The universality of the reference language assures us that
any conceivable update algorithm can be considered. APL’s diver-
sity of solutions to problems maximizes the information that we are
able to insert into the a priori probability distribution. After a suit-
able training sequence the system should know enough to usefully
work on the problem of updating itself.
Because of ALP’s completeness (among other desirable properties), we
expect that the complete AI system described above should become an
extremely powerful general problem solving device — going well beyond
the limited functional capabilities of current incomplete AI systems.
Further Reading
Cover, T. and Thomas, J.: Elements of Information Theory. Wiley
and Sons, N.Y. (1991) – Good treatments of statistics, predictions, etc.
Li, M. and Vitányi, P.: An Introduction to Kolmogorov Complexity
and Its Applications. Springer-Verlag, N.Y. (1993)
Li, M. and Vitányi, P.: An Introduction to Kolmogorov Complexity
and Its Applications. Springer-Verlag, N.Y. (1997)
Li, Vitányi, 1993, 1997 – Starts with elementary treatment and de-
velopment. Many sections very clear, very well written. Other sections
21
difficult to understand. Occasional serious ambiguity of notation (e.g.
definition of “enumerable”). Treatment of probability is better in 1997
than in 1993 edition.
Y. Shan, R.I. McKay, R. Baxter et al.: Grammar Model-Based Pro-
gram Evolution. (Dec. 2003) A recent review of work in this area, and
what looks like a very good learning system. Discusses mechanics of fitting
Grammar to Data, and how to use Grammars to guide Search Problems.
A. Stolcke, S. Omohundro: Inducing Probabilistic Grammars by Bayesian
Model Merging. (1994) This is largely a summary of Stolcke’s: On Learn-
ing Context Free Grammars [12].
Solomonoff, R.J.: The following papers are all available at the website:
http://world.std.com/˜rjs/pubs.html
A Preliminary Report on a General Theory of Inductive Inference.
(1960)
A Formal Theory of Inductive Inference. Information and Control,
Part I: (1964)
A Formal Theory of Inductive Inference, Part II: (June 1964.) Dis-
cusses fitting of context free grammars to data. Most of the discussion
is correct, but Sections 4.2.4 and 4.3.4 are questionable and equations 49
and 50 are incorrect.
– A Preliminary Report . . . and A Formal Theory... give some
intuitive justification for the way ALP does induction.
The Discovery of Algorithmic Probability. (1997) – Gives heuristic
background for discovery of ALP. Page 27 gives a time line of important
publications related to development of ALP.
Progress in Incremental Machine Learning; Revision 2.0. (Oct. 30,
2003) – A more detailed description of the system I’m currently working
on. There have been important developments since, however.
The Universal Distribution and Machine Learning. (2003) – Discus-
sion of irrelevance of incomputability to applications for prediction. Also
discussion of subjectivity.
References
[1] Kolmogorov, A.N.: Three Approaches to the Quantitative Definition
of Information. Problems Inform. Transmission 1(1) 1–7 (1965)
[2] Levin, L.A.: Universal Search Problems. Problemy Peredaci Informa-
cii 9 115–116 (1973) Translated in Problems of Information Trans-
mission 9 265–266.
[3] Rissanen, J.: Modeling by the Shortest Data Description. Automat-
ica 14 465–471 (1978)
[4] Rissanen, J.: Stochastical Complexity and Statistical Inquiry. World
Scientific Publishing Company (1989)
[5] Solomonoff, R.J.: A Preliminary Report on a General Theory of In-
ductive Inference. (Revision of Report V–131, Feb. 1960), Contract
AF 49(639)–376, Report ZTB–138, Zator Co., Cambridge, Mass.
(Nov, 1960) (http://www.world.std.com/˜rjs/pubs.html)
22
[6] Solomonoff, R.J.: A Formal Theory of Inductive Inference. Informa-
tion and Control, Part I: Vol 7, No. 1. 1–22 (March 1964)
[7] Solomonoff, R.J.: A Formal Theory of Inductive Inference. Informa-
tion and Control, Part II: Vol 7, No. 2. 224–254 (June 1964)
[8] Solomonoff, R.J.: Complexity-Based Induction Systems: Compar-
isons and Convergence Theorems. IEEE Trans. on Information The-
ory, Vol IT–24, No. 4. 422–432 (July, 1978)
[9] Solomonoff, R.J.: A System for Incremental Learning Based on Al-
gorithmic Probability. Proceedings of the Sixth Israeli Conference
on Artificial Intelligence, Computer Vision and Pattern Recognition
515–527 (Dec. 1989)
[10] Solomonoff, R.J.: Progress In Incremental Machine Learning. TR
IDSIA-16-03, revision 2.0. (2003)
[11] Solomonoff, R.J.: Three Kinds of Probabilistic Induction: Universal
Distributions and Convergence Theorems. — The Computer Journal:
51(5), pp. 566–570, 2008.
[12] Andreas Stolcke: On Learning Context Free Grammars. PhD Thesis
(1994)
[13] Wallace, C.S and Boulton, D.M.: An Information Measure for Clas-
sification. Computer Journal 11 185–195 (1968)
23