0% found this document useful (1 vote)
151 views

Algorithmic Probability Theory and Applications

This document defines and discusses algorithmic probability, an extremely powerful method of inductive inference. It discusses algorithmic probability's completeness, incomputability, diversity, and subjectivity. It then describes applications of algorithmic probability to predicting Bernoulli sequences and discovering grammars of context-free languages. The document concludes by noting algorithmic probability's potential use in very strong AI systems for general problem solving.

Uploaded by

Avik Mitra
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
151 views

Algorithmic Probability Theory and Applications

This document defines and discusses algorithmic probability, an extremely powerful method of inductive inference. It discusses algorithmic probability's completeness, incomputability, diversity, and subjectivity. It then describes applications of algorithmic probability to predicting Bernoulli sequences and discovering grammars of context-free languages. The document concludes by noting algorithmic probability's potential use in very strong AI systems for general problem solving.

Uploaded by

Avik Mitra
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Algorithmic Probability—Theory and

Applications
Ray J. Solomonoff
Visiting Professor, Computer Learning Research Centre
Royal Holloway, University of London
IDSIA, Galleria 2, CH–6928 Manno–Lugano, Switzerland
rjsolo@ieee.org http://world.std.com/˜rjs/pubs.html
August 16, 2009∗

Abstract
We first define Algorithmic Probability, an extremely powerful method
of inductive inference. We discuss its completeness, incomputability, di-
versity and subjectivity and show that its incomputability in no way in-
hibits its use for practical prediction. Applications to Bernoulli sequence
prediction and grammar discovery are described. We conclude with a note
on its employment in a very strong AI system for very general problem
solving.

Introduction
Ever since probability was invented, there has been much controversy as
to just what it meant, how it should be defined and above all, what is
the best way to predict the future from the known past. Algorithmic
Probability is a relatively recent definition of probability that attempts to
solve these problems.
We begin with a simple discussion of prediction and its relationship
to probability. This soon leads to a definition of Algorithmic Probabil-
ity (ALP) and its properties. The best-known properties of ALP are
its incomputibility and its completeness (in that order). Completeness
means that if there is any regularity (i.e. property useful for prediction)
in a batch of data, ALP will eventually find it, using a surprisingly small
amount of data. The incomputability means that in the search for reg-
ularities, at no point can we make a useful estimate of how close we are
to finding the most important ones. We will show, however, that this
incomputability is of a very benign kind, so that in no way does it inhibit
∗ Revision of Chapter 1 in F. Emmert-Streib and M. Dehmer (Eds.), Information

Theory and Statistical Learning, Springer Science+Business Media, N.Y., 2009, pp.
1–23.

1
the use of ALP for good prediction. One of the important properties of
ALP is subjectivity, the amount of personal experiential information that
the statistician must put into the system. We will show that this is a de-
sirable feature of ALP, rather than a “Bug”. Another property of ALP
is its diversity - it affords many explanations of data giving very good
understanding of that data.
There have been a few derivatives of Algorithmic Probability - Mini-
mum Message Length (MML), Minimum Description Length (MDL) and
Stochastic Complexity - which merit comparison with ALP.
We will discuss the application of ALP to two kinds of problems: Pre-
diction of the Bernoulli Sequence and Discovery of the Grammars of Con-
text Free Languages. We also show how a variation of Levin’s search
procedure can be used to search over a function space very efficiently to
find good predictive models.
The final section is on the future of ALP — some open problems in
its application to AI and what we can expect from these applications.

1 Prediction, Probability and Induction


What is prediction?
“An estimate of what is to occur in the future” — But also necessary is a
measure of confidence in the prediction: As a negative example consider
an early AI program called “Prospector”. It was given the characteristics
of a plot of land and was expected to suggest places to drill for oil. While it
did indeed do that, it soon became clear that without having any estimate
of confidence, it is impossible to know whether it is economically feasible
to spend $100,000 for an exploratory drill rig. Probability is one way to
express this confidence.
Say the program estimated probabilities of 0.1 for 1000 gallon yield, 0.1
for 10,000 gallon yield and 0.1 for 100,000 gallon yield. The expected yield
would be 0.1 x 1000 + 0.1 x 10,000 + 0.1 x 100,000 = 11100 gallons. At
$100 per gallon this would give $1,110,000. Subtracting out the $100,000
for the drill rig gives an expected profit of $1,010,000, so it would be worth
drilling at that point. The moral is that predictions by themselves are
usually of little value - it is necessary to have confidence levels associated
with the predictions.
A strong motivation for revising classical concepts of probability has
come from the analysis of human problem solving. When working on a
difficult problem, a person is in a maze in which he must make choices
of possible courses of action. If the problem is a familiar one, the choices
will all be easy. If it is not familiar, there can be much uncertainty in
each choice, but choices must somehow be made. One basis for choices
might be the probability of each choice leading to a quick solution - this
probability being based on experience in this problem and in problems
like it. A good reason for using probability is that it enables us to use
Levin’s Search Technique (Section 10) to find the solution in near minimal
time.
The usual method of calculating probability is by taking the ratio of

2
the number of favorable choices to the total number of choices in the past.
If the decision to use integration by parts in an integration problem has
been successful in the past 43% of the time, then its present probability
of success is about 0.43. This method has very poor accuracy if we only
have one or two cases in the past, and is undefined if the case has never
occurred before. Unfortunately it is just these situations that occur most
often in problem solving.
On a very practical level: If we cross a particular street 10 times and
we get hit by a car twice, we might estimate that the probability of getting
hit in crossing that street is about 0.2 = 2/10. However, if instead, we
only crossed that street twice and we didn’t get hit either time, it would be
unreasonable to conclude that our probability of getting hit was zero! By
seriously revising our definition of probability, we are able to resolve this
difficulty and clear up many others that have plagued classical concepts
of probability.

What is Induction?
Prediction is usually done by finding inductive models. These are deter-
ministic or probabilistic rules for prediction. We are given a batch of data
- typically a series of zeros and ones, and we are asked to predict any one
of the data points as a function of the data points that precede it.

In the simplest case, let us suppose that the data has a very simple
structure:
0101010101010...
In this case, a good inductive rule is “zero is always followed by one;
one is always followed by zero”. This is an example of deterministic in-
duction, and deterministic prediction. In this case it is 100% correct every
time!
There is, however, a common kind of induction problem in which our
predictions will not be that reliable. Suppose we are given a sequence
of zeros and ones with very little apparent structure. The only apparent
regularity is that zero occurs 70% of the time and one appears 30% of
the time. Inductive algorithms give a probability for each symbol in a
sequence that is a function of any or none of the previous symbols. In the
present case, the algorithm is very simple and the probability of the next
symbol is independent of the past - the probability of zero seems to be 0.7;
the probability of one seems to be 0.3. This kind of simple probabilistic
sequence is called a “Bernoulli sequence”. The sequence can contain many
different kinds of symbols, but the probability of each is independent of
the past. In Section 8 we will discuss the Bernoulli sequence in some
detail.
In general we will not always be predicting Bernoulli sequences and
there are many possible algorithms (which we will call “models”) that tell
how to assign a probability to each symbol, based on the past. Which of
these should we use? Which will give good predictions in the future?
One desirable feature of an inductive model is that if it is applied
to the known sequence, it produces good predictions. Suppose Ri is an

3
inductive algorithm. Ri predicts the probability of an symbol aj in a
sequence a1 , a2 · · · an by looking at the previous symbols: More exactly,

pj = Ri (aj |a1 .a2 · · · aj−1 )

aj is the symbol for which we want the probability. a1 , a2 · · · aj−1 are the
previous symbols in the sequence. Then Ri is able to give the probability
of a particular value of aj as a function of the past. Here, the values
of aj can range over the entire “alphabet” of symbols that occur in the
sequence. If the sequence is a binary, aj will range over the set 0 and 1
only. If the sequence is English text, aj will range over all alphabetic and
punctuation symbols. If Ri is a good predictor, for most of the aj , the
probability it assigns to them will be large - near one.
Consider S, the product of the probabilities that Ri assigns to the in-
dividual symbols of the sequence, a1 , a2 · · · an . S will give the probability
that Ri assigns to the sequence as a whole.
Y
n Y
n

S= Ri (aj |a1 , a2 · · · an ) = pj (1)


j=1 j=1

For good prediction we want S as large as possible. The maximum value


it can have is one, which implies perfect prediction. The smallest value
it can have is zero - which can occur if one or more of the pj are zero -
meaning that the algorithm predicted an event to be impossible, yet that
event occurred!
The “Maximum Likelihood” method of model selection uses S only to
decide upon a model. First, a set of models is chosen by the statistician,
based on his experience with the kind of prediction being done. The model
within that set having maximum S value is selected.
Maximum Likelihood is very good when there is a lot of data - which
is the area in which classical statistics operates. When there is only a
small amount of data, it is necessary to consider not only S, but the effect
of the likelihood of the model itself on model selection. The next section
will show how this may be done.

2 Compression and ALP


An important application of symbol prediction is text compression. If an
induction algorithm assigns a probability S to a text, there is a coding
method - Arithmetic Coding - that can re-create the entire text without
error using just − log2 S bits.
More exactly: Suppose x is a string of English text, in which each
character is represented by an 8 bit ASCII code, and there are n characters
in x. x would be directly represented by a code of just 8n bits. If we had
a prediction model, R , that assigned a probability of S to the text, then
it is possible to write a sequence of just − log2 S bits, so that the original
text, x, can be recovered from that bit sequence without error.
If R is a string of symbols (usually a computer program) that describes
the prediction model, we will use |R| to represent the length of the shortest
binary sequence that describes R. If S > 0, then the probability assigned

4
to the text will be in two parts: the first part is the code for R, which is
|R| bits long, and the second part is the code for the probability of the
data, as given by R - it will be just − log2 S bits in length. The sum of
these will be |R| − log2 S bits. The compression ratio achieved by R would
be
8N
|R| − log2 S
PPM, a commonly used prediction algorithm, achieves a compression of
about 3 for English text. For very large strings of data, compressions of as
much as 6 have been achieved by highly refined prediction algorithms. We
can use |R| − log2 S bits, the length of the compressed code, as a “figure
of merit” of a particular induction algorithm with respect to a particular
text.
We want an algorithm that will give good prediction, i.e., large S, and
small |R|, so |R|−log2 S, the figure of merit, will be as small as possible and
the probability it assigns to the text will be as large as possible. Models
with |R| larger than optimum are considered to be overfitted. Models in
which |R| are smaller than optimum are considered to be underfitted. By
choosing a model that minimizes |R| − log2 S, we avoid both underfitting
and overfitting, and obtain very good predictions. We will return to this
topic later, when we tell how to compute |R| and S for particular models
and data sets.
Usually there are many inductive models available. In 1960, I de-
scribed Algorithmic Probability - ALP ([5, 6, 7]), which uses all possible
models in parallel for prediction, with weights dependent upon the figure
of merit of each model.
X
PM (an+1 |a1 , a2 · · · an ) = 2−|Ri | Si Ri (an+1 |a1 , a2 · · · an ) (2)

PM (an+1 |a1 , a2 · · · an ) is the probability assigned by ALP to the (n + 1)th


symbol of the sequence, in view of the previous part of the sequence.
Ri (an+1 |a1 , a2 · · · an ) is the probability assigned by the ith model to
the (n + 1)th symbol of the sequence, in view of the previous part of the
sequence.
Si is the probability assigned by Ri , (the ith model) to the known
sequence, a1 , a2 · · · an via Eqn. 1.
2−|Ri | Si is 1/2 with an exponent equal to the figure of merit that Ri
has with respect to the data string a1 , a2 . . . an . It is the weight assigned
to Ri ( ). This weight is large when the figure of merit is good - i.e. small.
Suppose that |Ri | is the shortest program describing the ith model
using a particular “reference computer” or programming language - which
we will call M . Clearly the value of |Ri | will depend on the nature of M .
We will be using machines (or languages) that are “ Universal” - machines
that can readily program any conceivable function - almost all computers
and programming languages are of this kind. The subscript M in PM
expresses the dependence of ALP on choice of the reference computer or
language.
The universality of M assures us that the value of ALP will not depend
very much on just which M we use - but the dependence upon M is

5
nonetheless important. It will be discussed at greater length in Section 4
on “Subjectivity”.
Normally in prediction problems we will have some time limit, T , in
which we have to make our prediction. In ALP what we want is a set of
models of maximum total weight. A set of this sort will give us an ap-
proximation that is as close as possible to ALP and gives best predictions.
To obtain such a set, we devise a search technique that tries to find, in
the available time, T , a set of Models, Ri , such that the total weight,
X
2−|Ri | Si (3)
i

is as large as possible.

On the whole, ALP would seem to be a complex, time consuming way


to compute probabilities - though in fact, if suitable approximations are
used, these objections are not at all serious.
Does ALP have any advantages over other probability evaluation meth-
ods? For one, it’s the only method known to be complete. The complete-
ness property of ALP means that if there is any regularity in a body of
data, our system is guaranteed to discover it using a relatively small sam-
ple of that data. More exactly, say we had some data that were generated
by an unknown probabilistic source, P . Not knowing P , we use instead,
PM to obtain the Algorithmic Probabilities of the symbols in the data.
How much do the symbol probabilities computed by PM differ from their
true probabilities, P ? The Expected value with respect to P of the total
square error between P and PM is bounded by −1/2 lnP0 .

£X
n
¤ 1
EP (PM (am+1 = 1|a1 , a2 · · · am ) − P (am+1 = 1|a1 , a2 · · · am ))2 ≤ − lnP0
2
m=1
lnP0 ≈ kln2 (4)

P0 is the a priori probability of P . It is the probability we would assign


to P if we knew P .
k is the Kolmogorov complexity of the data generator, P . It’s the
shortest binary program that can describe P , the generator of the data.
This is an extremely small error rate. The error in probability ap-
proaches zero more rapidly than 1/n. Rapid convergence to correct prob-
abilities is a most important feature of ALP. The convergence holds for
any P that is describable by a computer program and includes many
functions that are formally incomputable. Various kinds of functions are
described in the next section. The convergence proof is in Solomonoff 78
[8].

3 Incomputability
It should be noted that in general, it is impossible to find the truly best
models with any certainty - there is an infinity of models to be tested and
some take an unacceptably long time to evaluate. At any particular time

6
in the search, we will know the best ones so far, but we can’t ever be sure
that spending a little more time will not give much better models! While it
is clear that we can always make approximations to ALP by using a limited
number of models, we can never know how close these approximations are
to the “True ALP”. ALP is indeed, formally incomputable.
In this section, we will investigate how our models are generated and
how the incomputability comes about - why it is a necessary, desirable
feature of any high performance prediction technique, and how this in-
computability in no way inhibits its use for practical prediction.

How incomputability arises and how we deal with it


Recall that for ALP. we added up the predictions of all models, using
suitable weights:
X

PM = 2−|Ri | Si Ri (5)
i=1

Here Ri gives the probability distribution for the next symbol as com-
puted by the ith model. Just what do we mean by these models, Ri ?

There are just four kinds of functions that Ri can be:


1. Finite compositions of a finite set of functions.
2. Primitive recursive functions.
3. Partial recursive functions.
4. Total recursive functions.
Compositions are combinations of a small set of functions. The finite
power series
3.2 + 5.98 ∗ X − 12.54 ∗ X 2 + 7.44 ∗ X 3
is a composition using the functions plus and times on the real numbers.
Finite series of this sort can approximate any continuous functions to ar-
bitrary precision.

Primitive Recursive Functions are defined by one or more DO loops. For


example to define F actorial(X) we can write
F actorial(0) ← 1
DO I = 1, X
F actorial(I) ← I ∗ F actorial(I − 1)
EndDO

Partial Recursive Functions are definable using one or more W HILE


loops. For example, to define the factorial in this way:
F actorial(0) ← 1
I←0
W HILE I 6= X
I ← I + 1 F actorial(I) ← I ∗ F actorial(I − 1)
EndW HILE
The loop will terminate if X is a non negative integer. For all other values
of X, the loop will run forever. In the present case it is easy to tell for

7
which values of X the loop will terminate.
A simple W HILE loop in which it is not so easy to tell:
W HILE X > 4
IF X/2 is an integer T HEN X ← X/2
ELSE X ← 3 ∗ X + 1
EndW HILE
This program has been tested with X starting at all positive integers up
to more than sixty million. The loop has always terminated, but no one
yet is certain as to whether it terminates for all positive integers!
For any Total Recursive Function we know all values of arguments
for which the function has values. Compositions and primitive recursive
functions are all total recursive. Many partial recursive functions are
total recursive, but some are not. As a consequence of the insolvability
of Turing’s “Halting Problem”, it will sometimes be impossible to tell if
a certain W HILE loop will terminate or not.
Suppose we use Eqn. 2 to approximate ALP by sequentially testing
functions in a list of all possible functions - these will be the partial re-
cursive functions because this is the only recursively enumerable function
class that includes all possible predictive functions. As we test to find
functions with good figures of merit (small (|Ri | − log2 Si )) we find that
certain of them don’t converge after say, a time T , of ten seconds. We
know that if we increase T enough, eventually, all converging trials will
converge and all divergent trials will still diverge - so eventually we will
get close to true ALP - but we cannot recognize when this occurs. Fur-
thermore for any finite T , we cannot ever know a useful upper bound on
how large the error in the ALP approximation is. That is why this partic-
ular method of approximating ALP is called “incomputable”. Could there
be another computable approximation technique that would converge? It
is easy to show that any computable technique cannot be “complete” -
i.e.having very small errors in probability estimates.
Consider an arbitrary computable probability method, R0 . We will
show how to generate a sequence for which R0 ’s errors in probability
would always be 0.5 or more. We start our sequence with a single bit,
say zero. We then ask R0 for the most probable next bit. If it says “one
is more probable”, we make the continuation zero, if it says “zero is more
probable”, we make the next bit one. If it says “both are equally likely”
we make the next bit zero. We generate the third bit in the sequence in
the same way, and we can use this method to generate an arbitrarily long
continuation of the initial zero.
For this sequence, R0 will always have an error in probability of at least
one half. Since completeness implies that prediction errors approach zero
for all finitely describable sequences, it is clear that R0 or any other com-
putable probability method cannot be complete. Conversely, any complete
probability method, such as ALP, cannot be computable.
If we cannot compute ALP, what good is it? It would seem to be of
little value for prediction! To answer this objection, we note that from
a practical viewpoint, we never have to calculate ALP exactly - we can
always use approximations. While it is impossible to know how close our
approximations are to the true ALP, that information is rarely needed for
practical induction.

8
What we actually need for practical prediction:
1. Estimates of how good a particular approximation will be in future
problems - (called “Out of Sample Error”).
2. Methods to search for good models.
3. Quick and simple methods to compare models.
For 1., we can use Cross Validation or Leave One Out - well-known meth-
ods that work with most kinds of problems. In addition, because ALP
does not overfit or underfit there is usually a better method to make such
estimates.
For 2. In Section 10 we will describe a variant of Levin’s Search Proce-
dure, for an efficient search of a very large function space.
For 3., we will always find it easy to compare models via their associated
“Figures of Merit”, |Ri | − log2 (Si ).

In summary, it is clear that all computable prediction methods have a


serious flaw - they cannot ever approach completeness. On the other
hand, while approximations to ALP can approach completeness, we can
never know how close we are to the final, incomputable result. We can,
however, get good estimates of the future error in our approximations,
and this is all that we really need in a practical prediction system.
That our approximations approach ALP assures us that if we spend
enough time searching we will eventually get as little error in prediction
as is possible. No computable probability evaluation method can ever give
us this assurance. It is in this sense that the incomputability of ALP is a
desirable feature.

4 Subjectivity
The subjectivity of probability resides in a priori information - the in-
formation available to the statistician before he sees the data to be ex-
trapolated. This is independent of what kind of statistical techniques we
use. In ALP this a priori information is embodied in M , our “Reference
Computer”. Recall our assignment of a |R| value to an induction model -
it was the length of the program necessary to describe R. In general, this
will depend on the machine we use - its instruction set. Since the ma-
chines we use are Universal - they can imitate one another - the length of
description of programs will not vary widely between most reference ma-
chines we might consider. But nevertheless, using small samples of data
(as we often do in AI), these differences between machines can modify
results considerably.
For quite some time I felt that the dependence of ALP on the reference
machine was a serious flaw in the concept, and I tried to find some “objec-
tive” universal device, free from the arbitrariness of choosing a particular
universal machine. When I thought I finally found a device of this sort, I
realized that I really didn’t want it - that I had no use for it at all! Let
me explain:

9
In doing inductive inference, one begins with two kinds of information:
First, the data itself, and second, the a priori data - the information one
had before seeing the data. It is possible to do prediction without data,
but one cannot do prediction without a priori information. In choosing a
reference machine we are given the opportunity to insert into the a priori
probability distribution any information about the data that we know
before we see it.
If the reference machine were somehow “objectively” chosen for all
induction problems, we would have no way to make use of our prior in-
formation. This lack of an objective prior distribution makes ALP very
subjective - as are all Bayesian systems.
This certainly makes the results “subjective”. If we value objectivity,
we can routinely reduce the choice of a machine and representation to cer-
tain universal “default” values - but there is a tradeoff between objectivity
and accuracy. To obtain the best extrapolation, we must use whatever
information is available, and much of this information may be subjective.
Consider two physicians, A and B: A is a conventional physician: He
diagnoses ailments on the basis of what he has learned in school, what he
has read about and his own experience in treating patients. B is not a
conventional physician. He is “objective”. His diagnosis is entirely “by
the book” - things he has learned in school that are universally accepted.
He tries as hard as he can to make his judgements free of any bias that
might be brought about by his own experience in treating patients. As a
lawyer, I might prefer defending B’s decisions in court, but as a patient,
I would prefer A’s intelligently biased diagnosis and treatment.
To the extent that a statistician uses objective techniques, his recom-
mendations may be easily defended, but for accuracy in prediction, the
additional information afforded by subjective information can be a critical
advantage.
Consider the evolution of a priori in a scientist during the course of his
life. He starts at birth with minimal a priori information - but enough to
be able to learn to walk, to learn to communicate and his immune system
is able to adapt to certain hostilities in the environment. Soon after
birth, he begins to solve problems and incorporate the problem solving
routines into his a priori tools for future problem solving. This continues
throughout the life of the scientist - as he matures, his a priori information
matures with him.
In making predictions, there are several commonly used techniques for
inserting a priori information. First, by restricting or expanding the set
of induction models to be considered. This is certainly the commonest
way. Second, by selecting prediction functions with adjustable parame-
ters and assuming a density distribution over those parameters based on
past experience with such parameters. Third, we note that much of the
information in our sciences is expressed as definitions - additions to our
language. ALP, or approximations of it, avails itself of this information
by using these definitions to help assign code lengths, and hence a priori
probabilities to models. Computer languages are usually used to describe
models, and it is relatively easy to make arbitrary definitions part of the
language.
More generally, modifications of computer languages are known to be

10
able to express any conceivable a priori probability distributions. This
gives us the ability to incorporate whatever a priori information we like
into our computer language. It is certainly more general than any of the
other methods of inserting a priori information.

5 Diversity and Understanding


Apart from accuracy of probability estimate, ALP has for AI another
important value: Its multiplicity of models gives us many different ways
to understand our data.
A very conventional scientist understands his science using a single
“current paradigm” - the way of understanding that is most in vogue at
the present time. A more creative scientist understands his science in
very many ways, and can more easily create new theories, new ways of
understanding, when the “current paradigm” no longer fits the current
data.
In the area of AI in which I’m most interested - Incremental Learning
- this diversity of explanations is of major importance. At each point in
the life of the System, it is able to solve with acceptable merit, all of the
problems it’s been given thus far. We give it a new problem - usually its
present Algorithm is adequate. Occasionally, it will have to be modified
a bit. But every once in a while it gets a problem of real difficulty and
the present Algorithm has to be seriously revised. At such times, we try
using or modifying once sub-optimal algorithms. If that doesn’t work we
can use parts of the sub-optimal algorithms and put them together in new
ways to make new trial algorithms. It is in giving us a broader basis to
learn from the past, that this value of ALP lies.

5.1 ALP and “The Wisdom of Crowds”


It is a characteristic of ALP that it averages over all possible models of the
data: There is evidence that this kind of averaging may be a good idea in a
more general setting. “The Wisdom of Crowds” is a recent book by James
Serowiecki that investigates this question. The idea is that if you take a
bunch of very different kinds of people and ask them (independently) for
a solution to a difficult problem, then a suitable average of their solutions
will very often be better than the best in the set. He gives examples of
people guessing the number of beans in a large glass bottle, or guessing
the weight of a large ox, or several more complex, very difficult problems.
He is concerned with the question of what kinds of problems can be
solved this way as well as the question of when crowds are wise and when
they are stupid. They become very stupid in mobs or in committees in
which a single person is able to strongly influence the opinions in the
crowd. In a wise crowd, the opinions are individualized, the needed infor-
mation is shared by the problem solvers, and the individuals have great
diversity in their problem solving techniques. The methods of combin-
ing the solutions must enable each of the opinions to be voiced. These
conditions are very much the sort of thing we do in ALP. Also, when we

11
approximate ALP we try to preserve this diversity in the subset of models
we use.

6 Derivatives of ALP
After my first description of ALP in 1960 [5], there were several related
induction models described, minimum message length (MML) Wallace
and Boulton (1968) [13], Minimum Description Length (MDL) Rissanen
(1978) [3], and Stochastic Complexity, Rissanen (1989) [4]. These models
were conceived independently of ALP - (though Rissanen had read Kol-
mogorov’s 1965 paper on minimum coding [1],which is closely related to
ALP). MML and MDL select induction models by minimizing the figure
of merit, |Ri | − log2 (Si ) just as ALP does. However, instead of using a
weighted sum of models, they use only the single best model.
MDL chooses a space of computable models then selects the best model
from that space. This avoids any incomputability, but greatly limits the
kinds of models that it can use. MML recognizes the incomputability
of finding the best model so it is in principle much stronger than MDL.
Stochastic complexity, like MDL, first selects a space of computable mod-
els - then, like ALP it uses a weighted sum of all models in the that space.
Like MDL, it differs from ALP in the limited types of models that are
accessible to it. MML is about the same as ALP when the best model is
much better than any other found. When several models are of compa-
rable figure of merit, MML and ALP will differ. One advantage of ALP
over MML and MDL is in its diversity of models. This is useful if the
induction is part of an ongoing process of learning - but if the induction
is used on one problem only, diversity is of much less value. Stochastic
Complexity, of course, does obtain diversity in its limited set of models.

7 Extensions of ALP
The probability distribution for ALP that I’ve shown is called “The Uni-
versal Distribution for sequential prediction”. There are two other univer-
sal distributions I’d like to describe:

7.1 A Universal Distribution for an Unordered


Set of Strings
Suppose we have a corpus of n unordered discrete objects, each described
by a finite string aj : Given a new string, an+1 , what is the probability
that it is in the previous set? In MML and MDL, we consider various
algorithms, Ri , that assign probabilities to strings. (We might regard
them as Probabilistic Grammars). We use for prediction, the grammar,
Ri , for which
|Ri | − log2 Si (6)
is minimum. Here |RQ
i | is the number of bits in the description of the
grammar, Ri . Si = Ri (aj ) is the probability assigned to the entire
j

12
corpus by Ri . If Rk is the best stochastic grammar that we’ve found,
then we use Rk (an+1 ) as the probability of an+1 . To obtain the ALP
version, we simply sum over all models as before, using weights 2−|Ri | Si
This kind of ALP has an associated convergence theorem giving very
small errors in probability. This approach can be used in linguistics. The
aj can be examples of sentences that are grammatically correct. We can
use |Ri | − log2 Si as a likelihood that the data was created by grammar,
Ri . Section 9 continues the discussion of Grammatical Induction.

7.2 A Universal Distribution for an Unordered


Set of Ordered Pairs of Strings
This type of induction includes almost all kinds of prediction problems
as “special cases”. Suppose you have a set of question answer pairs,
Q1 , A1 ; Q2 , A2 ; ...Qn , An : Given a new question, Qn+1 , what is the prob-
ability distribution over possible answers, An+1 ? Equivalently, we have
an unknown analog and/or digital transducer, and we are given a set of
input/output pairs Q1 , A1 ; ... - For a new input Qi , what is probability
distribution on outputs? Or, say the Qi are descriptions of mushrooms
and the Ai are whether they are poisonous or not.
As before, we hypothesize operators Rj (A|Q) that are able to assign
a probability to any A given any Q: The ALP solution is
X Y
n+1

P (An+1 |Qn+1 ) = 2−|Rj | Rj (Ai |Qi )


j i=1
X
= ajn Rj (An+1 |Qn+1 )
j

Y
n

ajn = 2−|Rj | Rj (Ai |Qi ) (7)


i=1

2−|Rj | are the a priori probabilities associated with the Rj .


ajn is the weight given to Rj ’s predictions in view of it’s success in
predicting the data set Q1 , A1 ...Qn , An .
This ALP system has a corresponding theorem for small errors in
probability. As before, we try to find a set of models of maximum weight
in the available time. Proofs of convergence theorems for these extensions
of ALP are in Solomonoff 2008 [11]. There are corresponding MDL, MML
versions in which we pick the single model of maximum weight.

8 Coding the Bernoulli Sequence


First, consider a binary Bernoulli sequence of length n. It’s only visible
regularity is that zeroes have occurred n0 times and ones have occurred
n1 times. One kind of model for this data is that the probability of 0 is p
and the probability of 1 is 1 − p. Call this model Rp . Sp is the probability
assigned to the data by Rp .
Sp = pn0 (1 − p)n1 (8)

13
Recall that ALP tells us to sum the predictions of each model, with weight
given by the product of the a priori probability of the model (2−|Ri | ) and
Si , the probability assigned to the data by the model . . . i.e.:
X
2−|Ri | Si Ri ( ) (9)
i

In summing we consider all models with 0 ≤ p ≤ 1.


We assume for each model, Rp , precision ∆ in describing p. So p is
1
specified with accuracy, ∆. We have ∆ models to sum so total weight is
1.
2−|Ri | = ∆
Si = Sp = pn0 (1 − p)n1
Ri ( ) = Rp = p
Summing the models for small ∆ gives the integral
Z 1 Z 1
n0 n1
p (1 − p) p dp = pn0+1 (1 − p)n1 dp (10)
0 0

This integral can be evaluated using the Beta function, B( , ).


Z 1
x! y!
px (1 − p)y dy = B(x + 1, y + 1) = (11)
0
(x + y + 1)!
(n0 +1)! n1 !
So our integral of Eqn. 10 equals (n 0 +n1 +2)!
.
We can get about the same result another way: The function pn0 (1 −
p)n1 is (if n0 and n1 are large), narrowly peaked at p0 = n0n+n 0
1
.If we
used MDL we would use the model with p = p0 . The a priori probability
of the model itself will depend on how accurately we have to specify p0 . If
the “width” of the peaked distribution is ∆, then the a priori probability
of model Mp0 will be just ∆ · pn 0 (1 − p0 )
0 n1
. q
p0 (1−p0 )
It is known that the width of the distribution is just 2 n0 +n1 +1
.1
q
p0 (1−p0 )
As a result the probability assigned to this model is n0 +n1 +1
· pn
0 (1
0

n1 −n n

p0 ) · 2. If we use Sterling’s approximation for n! (n! ≈ e n 2πn), it
is not difficult to show that
r
n0 !n1 ! p0 (1 − p0 ) √
≈ pn
0 (1 − p0 )
0 n1
· 2π (12)
(n0 + n1 + 1)! n0 + n1 + 1

2π = 2.5066 which is roughly equal to 2.
To obtain the probability of a zero following a sequence of n0 zeros
and n1 ones: We divide the probability of the sequence having the extra
zero, by the probability of the sequence without the extra zero. i.e.:
(n0 + 1)!n1 !
.
n0 !n1 ! n0 + 1
= (13)
(n0 + n1 + 2)! (n0 + n1 + 1)! (n0 + n1 + 2)
1 This can be obtained by getting the first and second moments of the distribution, using
R1 x!y!
the fact that px (1 − p)y dp = (x+y+1)!
.
0

14
This method of extrapolating binary Bernoulli sequences is called “Laplace’s
Rule”. The formula for the probability of a binary sequence, (n0n+n 0 !n1 !
1 +1)!
can be generalized for an alphabet of k symbols.
A sequence of k different kinds of symbols has a probability of

Q
k
(k − 1)! ni !
i=1
(14)
Pk
(k − 1 + ni )!
i=1

th
ni is the number of times the i symbol occurs.
This formula can be obtained by integration in a k − 1 dimensional
nk−1
space of the function pn1 n2
1 p2 · · · pk−1 (1 − p1 − p2 · · · − pk−1 )
nk
.
Through an argument similar to that used for the binary sequence,
the probability of the next symbol being of the j th type is
nj + 1
Pk (15)
k+ i=1
ni

A wayP to visualize this result: the body of data (the “corpus”) consists of
the ni symbols. Think of a “pre–corpus” containing one of each of the
k symbols. If we think of a “macro corpus” as “corpus plus pre–corpus”
we can obtain the probability of the next symbol being the j th one by
dividing the number of occurrences of that symbol in the macro corpus,
by the total number of symbols of all types in the macro corpus.
It is also possible to have different numbers of each symbol type in
the pre–corpus, enabling us to get a great variety of “a priori probability
distributions” for our predictions.

9 Context Free Grammar Discovery


This is a method of extrapolating an unordered set of finite strings: Given
the set of strings, a1 , a2 , · · · an , what is the probability that a new string,
an+1 , is a member of the set? We assume that the original set was gener-
ated by some sort of probabilistic device. We want to find a device of this
sort that has a high a priori likelihood (i.e. short description length) and
assigns high probability to the data set. A good model Ri , is one with
maximum value of
Y
n

P (Ri ) Ri (aj ) (16)


j=1

Here P (Ri ) is the a priori probability of the model Ri . Ri (aj ) is the


probability assigned by Ri to data string, aj .
To understand probabilistic models, we first define non–probabilistic
grammars. In the case of context free grammars, this consists of a set of
terminal symbols and a set of symbols called nonterminals, one of which
is the initial starting symbol, S.
A grammar could then be:

S → Aac

15
S → BaAd
A → BAaS
A → AB
A → a
B → aBa
B → b

The capital letters (including S) are all nonterminal symbols. The lower
case letters are all terminal symbols. To generate a legal string, we start
with the symbol, S, and we perform either of the two possible substitu-
tions. If we choose BaAd, we would then have to choose substitutions for
the nonterminals B and A. For B, if we chose aBa we would again have
to make a choice for B. If we chose a terminal symbol, like b for B, then
no more substitutions can be made.
An example of a string generation sequence:
S, BaAd, aBaaAd, abaaAd, abaaABd, abaaaBd, abaaabd.
The string abaaabd is then a legally derived string from this grammar. The
set of all strings legally derivable from a grammar is called the language
of the grammar. The language of a grammar can contain a finite or
infinite number of strings. If we replace the deterministic substitution
rules with probabilistic rules, we have a probabilistic grammar. A grammar
of this sort assigns a probability to every string it can generate. In the
deterministic grammar above, S had two rewrite choices, A had three,
and B had two. If we assign a probability to each choice, we have a
probabilistic grammar.
Suppose S had substitution probability 0.1 for Aac and 0.9 for BaAd.
Similarly, assigning probabilities 0.3, 0.2 and 0.5 for A’s substitutions and
0.4, 0.6 for B’s substitutions.

S 0.1 Aac
0.9 BaAd
A 0.3 BAaS
0.2 AB
0.5 a
B 0.4 aBa
0.6 b

In the derivation of abaaab of the previous example, the substitutions


would have probabilities 0.9 to get BaAd, 0.4 to get aBaaAd, 0.6 to
get abaaAd, 0.2 to get abaaABd, 0.5 to get abaaaBd, and 0.6 to get
abaaabd. The probability of the string abaabd being derived this way is
0.9 × 0.4 × 0.6 × 0.2 × 0.5 × 0.6 = 0.01296. Often there are other ways to
derive the same string with a grammar, so we have to add up the probabil-
ities of all of its possible derivations to get the total probability of a string.

16
Suppose we are given a set of strings, ab, aabb, aaabbb that were gen-
erated by an unknown grammar. How do we find the grammar?
I wouldn’t answer that question directly, but instead I will tell how to
find a sequence of grammars that fits the data progressively better. The
best one we find may not be the true generator, but will give probabilities
to strings close to those given by the generator.
The example here is that of A. Stolcke’s, PhD thesis [12]. We start
with an ad hoc grammar that can generate the data, but it overf its . . .
it is too complex:

S → ab
→ aabb
→ aaabbb

We then try a series of modifications of the grammar (Chunking and


M erging) that increase the total probability of description and thereby
decrease total description length. M erging consists of replacing two non-
terminals by a single nonterminal. Chunking is the process of defining
new nonterminals. We try it when a string or substring has occurred
two or more times in the data. ab has occurred three times so we define
X = ab and rewrite the grammar as

S → X
→ aXb
→ aaXbb
X → ab

aXb occurs twice so we define Y = aXb giving

S → X
→ Y
→ aY b
X → ab
Y → aXb

At this point there are no repeated strings or substrings, so we try the


operation M erge which coalesces two nonterminals. In the present case
merging S and Y would decrease complexity of the grammar, so we try:

S → X
→ aSb
→ aXb
X → ab

Next, merging S and X gives

S → aSb
→ ab

17
which is an adequate grammar. At each step there are usually several
possible chunk or merge candidates. We chose the candidates that give
minimum description length to the resultant grammar.
How do we calculate the length of description of a grammar and its
description of the data set?
Consider the grammar

S → X
→ Y
→ aY b
X → ab
Y → aXb

There are two kinds of terminal symbols and three kinds of nonterminals.
If we know the number of terminals and nonterminals, we need describe
only the right hand side of the substitutions to define the grammar. The
names of the nonterminals (other than the first one, S) are not relevant.
We can describe the right hand side by the string Xs1 Y s1 aY bs1 s2 abs1 s2 aXbs1 s2 .
s1 and s2 are punctuation symbols. s1 marks the end of a string. s2 marks
the end of a sequence of strings that belong to the same nonterminal. The
string to be encoded has 7 kinds of symbols. The number of times each
occurs: X, 2; Y , 2; S, 0; a, 3; b, 3; s1 , 5; s2 , 3. We can then use the
formula
Q
k
(k − 1)! ni !
i=1
(17)
Pk
(k − 1 + ni )!
i=1

to compute the probability of the grammar: k = 7, since there are 7


symbols and n1 = 2, n2 = 2, n3 = 0, n4 = 3, etc. We also have to
include the probability of 2, the number of kinds of terminals, and of 3,
the number of kinds of nonterminals.
There is some disagreement in the machine learning community about
how best to assign probability to integers, n. A common form is

P (n) = A2− log2 n (18)

where log∗2n = log2 n + log2 log2 n + log2 log2 log2 n · · · taking as many
positive terms as there are, and A is a normalization constant. There
seems to be no good reason to choose 2 as the base for logs, and using
different bases gives much different results. If we use natural logs, the
sum diverges.
This particular form of P (n) was devised by Rissanen. It is an at-
tempt to approximate the shortest description of the integer n, e.g. the
Kolmogorov complexity of n. Its first moment is infinite, which means it
is very biased toward large numbers. If we have reason to believe, from
previous experience, that n will not be very large, but will be about λ,
then a reasonable form of P (n) might be P (n) = Ae−n/λ , A being a
normalization constant.

18
Q
n
The forgoing enables us to evaluate P (Ri ) of Eqn. 16. The Ri (aj )
j=1
part is evaluated by considering the choices made when the grammar
produces the data corpus. For each nonterminal, we will have a sequence
of decisions whose probabilities can be evaluated by an expression like
Eqn. 14, or perhaps the simpler technique of Eqn. 15 that uses the “pre–
corpus”. Since there are three nonterminals, we need the product of three
such expressions.
The process used by Stolcke in his thesis was to make various trials of
chunking or merging in attempts to successively get a shorter description
length - or to increase Eqn. 16. Essentially a very greedy method. He
has been actively working on Context Free Grammar discovery since then,
and has probably discovered many improvements. There are many more
recent papers at his website.
Most, if not all CFG discovery has been oriented toward finding a
single best grammar. For applications in AI and genetic programming
it is useful to have large sets of not necessarily best grammars - giving
much needed diversity. One way to implement this: At each stage of
modification of a grammar, there are usually several different operations
that can reduce description length. We could pursue such paths in parallel
. . . perhaps retaining the best 10 or best 100 grammars thus far. Branches
taken early in the search could lead to very divergent paths and much
needed diversity. This approach helps avoid local optima in grammars
and premature convergence when applied to Genetic Programming.

10 Levin’s Search Technique


In the section on incomputability we mentioned the importance of good
search techniques for finding effective induction models. The procedure
we will describe was inspired by Levin’s search technique [2], but is applied
to a different kind of problem.
Here, we have a corpus of data to extrapolate, and we want to search
over a function space, to find functions (“models”) Ri ( ) such that 2−|Ri | Si
is as large as possible. In this search, for some Ri , the time needed
to evaluate Si , (the probability assigned to the corpus by Ri ), may be
unacceptably large - possibly infinite.
Suppose we have a (deterministic) context free grammar, G, that can
generate strings that are programs in some computer language. (Most
computer languages have associated grammars of this kind). In generat-
ing programs, the grammar will have various choices of substitutions to
make. If we give each substitution in a k–way choice, a probability of 1/k,
then we have a probabilistic grammar that assigns a priori probabilities
to the programs that it generates. If we use a functional language (such
as LISP), this will give a probability distribution over all functions it can
generate. The probability assigned to the function Ri will be denoted
by PM (Ri ). Here M is the name of the functional computer language.
PM (Ri ) corresponds to what we called 2−|Ri | in our earlier discussions.
|Ri | corresponds to − log2 PM (Ri ). As before, Si is the probability as-
signed to the corpus by Ri . We want to find functions Ri ( ) such that

19
PM (Ri )Si is as large as possible.
Next we choose a small initial time T - which might be the time needed
to execute 10 instructions in our Functional Language. The initial T is not
critical. We then compute PM (Ri )Si for all Ri for which ti /PM (Ri ) < T .
Here ti is the time needed to construct Ri and evaluate its Si .
There are only a finite number of Ri that satisfy this criterion and if
T is very small, there will be very few, if any. We remember which Ri ’s
have large PM (Ri )Si . P
ti < T · PM (R Pi ), so i ti , the total time for this part of the search
takes
P time < T · P (Ri ). Since the PM (Ri ) are a priori probabilities,
i M
P (Ri ) must be less than or equal to 1, and so the total time for this
i M
part of the search must be less than T .
If we are not satisfied with the Ri we’ve obtained, we double T and
do the search again. We continue doubling T and searching until we
find satisfactory Ri ’s or until we run out of time. If T 0 is the value of
T when we finish the search, then the total time for the search will be
T 0 + T 0 /2 + T 0 /4 · · · ≈ 2T 0 .
If it took time tj to generate and test one of our “good” models, Rj ,
then when Rj was discovered, T 0 would be no more than 2tj /PM (Rj ) —
so we would take no more time than twice this, or 4tj /PM (Rj ) to find
Rj . Note that this time limit depends on Rj only, and is independent
of the fact that we may have aborted the Si evaluations of many Ri for
which ti was infinite or unacceptably large. This feature of Levin Search
is a mandatory requirement for search over a space of partial recursive
functions. Any weaker search technique would seriously limit the power
of the inductive models available to us.
When ALP is being used in AI, we are solving a sequence of problems
of increasing difficulty. The machine (or language) M is periodically “up-
dated” by inserting subroutines and definitions, etc., into M so that the
solutions, Ri to problems in the past result in larger PM (Rj ). As a result
the tj /PM (Rj ) are smaller - giving quicker solutions to problems of the
past - and usually for problems of the future as well.

11 The Future of ALP — Some Open


Problems
We have described ALP and some of its properties:
First, its completeness: Its remarkable ability to find any irregularities
in an apparently small amount of data.
Second: That any complete induction system like ALP must be for-
mally incomputable.
Third: That this incomputability imposes no limit on its use for prac-
tical induction. This fact is based on our ability to estimate the future
accuracy of any particular induction model. While this seems to be easy
to do in ALP without using Cross Validation, more work needs to be done
in this area.
ALP was designed to work on difficult problems in AI. The particular
kind of AI considered was a version of “Incremental Learning”: We give

20
the machine a simple problem. Using Levin Search, it finds one or more
solutions to the problem. The system then updates itself by modifying
the reference machine so that the solutions found will have higher a priori
probabilities. We then give it new problems somewhat similar to the pre-
vious problem. Again we use Levin Search to find solutions—We continue
with a sequence of problems of increasing difficulty, updating after each
solution is found. As the training sequence continues we expect that we
will need less and less care in selecting new problems and that the system
will eventually be able to solve a large space of very difficult problems.
For a more detailed description of the system, see Solomonoff 2003 [10].
The principal things that need to be done to implement such a system:
* We have to find a good reference language: Some good candidates
are APL, LISP, FORTH, or a subset of Assembly Language. These
languages must be augmented with definitions and subroutines that
we expect to be useful in problem solving.
* The design of good training sequences for the system is critical for
getting much problem–solving ability into it. I have written some
general principles on how to do this [9], but more work needs to be
done in this area. For early training, it might be useful to learn
definitions of instructions from Maple or Mathematica. For more
advanced training we might use the book that Ramanujan used to
teach himself mathematics —“A Synopsis of Elementary Results in
Pure and Applied Mathematics” by George S. Carr.
* We need a good update algorithm. It is possible to use PPM, a rela-
tively fast, effective method of prediction, for preliminary updating,
but to find more complex regularities, a more general algorithm is
needed. The universality of the reference language assures us that
any conceivable update algorithm can be considered. APL’s diver-
sity of solutions to problems maximizes the information that we are
able to insert into the a priori probability distribution. After a suit-
able training sequence the system should know enough to usefully
work on the problem of updating itself.
Because of ALP’s completeness (among other desirable properties), we
expect that the complete AI system described above should become an
extremely powerful general problem solving device — going well beyond
the limited functional capabilities of current incomplete AI systems.

Further Reading
Cover, T. and Thomas, J.: Elements of Information Theory. Wiley
and Sons, N.Y. (1991) – Good treatments of statistics, predictions, etc.
Li, M. and Vitányi, P.: An Introduction to Kolmogorov Complexity
and Its Applications. Springer-Verlag, N.Y. (1993)
Li, M. and Vitányi, P.: An Introduction to Kolmogorov Complexity
and Its Applications. Springer-Verlag, N.Y. (1997)
Li, Vitányi, 1993, 1997 – Starts with elementary treatment and de-
velopment. Many sections very clear, very well written. Other sections

21
difficult to understand. Occasional serious ambiguity of notation (e.g.
definition of “enumerable”). Treatment of probability is better in 1997
than in 1993 edition.
Y. Shan, R.I. McKay, R. Baxter et al.: Grammar Model-Based Pro-
gram Evolution. (Dec. 2003) A recent review of work in this area, and
what looks like a very good learning system. Discusses mechanics of fitting
Grammar to Data, and how to use Grammars to guide Search Problems.
A. Stolcke, S. Omohundro: Inducing Probabilistic Grammars by Bayesian
Model Merging. (1994) This is largely a summary of Stolcke’s: On Learn-
ing Context Free Grammars [12].
Solomonoff, R.J.: The following papers are all available at the website:
http://world.std.com/˜rjs/pubs.html
A Preliminary Report on a General Theory of Inductive Inference.
(1960)
A Formal Theory of Inductive Inference. Information and Control,
Part I: (1964)
A Formal Theory of Inductive Inference, Part II: (June 1964.) Dis-
cusses fitting of context free grammars to data. Most of the discussion
is correct, but Sections 4.2.4 and 4.3.4 are questionable and equations 49
and 50 are incorrect.
– A Preliminary Report . . . and A Formal Theory... give some
intuitive justification for the way ALP does induction.
The Discovery of Algorithmic Probability. (1997) – Gives heuristic
background for discovery of ALP. Page 27 gives a time line of important
publications related to development of ALP.
Progress in Incremental Machine Learning; Revision 2.0. (Oct. 30,
2003) – A more detailed description of the system I’m currently working
on. There have been important developments since, however.
The Universal Distribution and Machine Learning. (2003) – Discus-
sion of irrelevance of incomputability to applications for prediction. Also
discussion of subjectivity.

References
[1] Kolmogorov, A.N.: Three Approaches to the Quantitative Definition
of Information. Problems Inform. Transmission 1(1) 1–7 (1965)
[2] Levin, L.A.: Universal Search Problems. Problemy Peredaci Informa-
cii 9 115–116 (1973) Translated in Problems of Information Trans-
mission 9 265–266.
[3] Rissanen, J.: Modeling by the Shortest Data Description. Automat-
ica 14 465–471 (1978)
[4] Rissanen, J.: Stochastical Complexity and Statistical Inquiry. World
Scientific Publishing Company (1989)
[5] Solomonoff, R.J.: A Preliminary Report on a General Theory of In-
ductive Inference. (Revision of Report V–131, Feb. 1960), Contract
AF 49(639)–376, Report ZTB–138, Zator Co., Cambridge, Mass.
(Nov, 1960) (http://www.world.std.com/˜rjs/pubs.html)

22
[6] Solomonoff, R.J.: A Formal Theory of Inductive Inference. Informa-
tion and Control, Part I: Vol 7, No. 1. 1–22 (March 1964)
[7] Solomonoff, R.J.: A Formal Theory of Inductive Inference. Informa-
tion and Control, Part II: Vol 7, No. 2. 224–254 (June 1964)
[8] Solomonoff, R.J.: Complexity-Based Induction Systems: Compar-
isons and Convergence Theorems. IEEE Trans. on Information The-
ory, Vol IT–24, No. 4. 422–432 (July, 1978)
[9] Solomonoff, R.J.: A System for Incremental Learning Based on Al-
gorithmic Probability. Proceedings of the Sixth Israeli Conference
on Artificial Intelligence, Computer Vision and Pattern Recognition
515–527 (Dec. 1989)
[10] Solomonoff, R.J.: Progress In Incremental Machine Learning. TR
IDSIA-16-03, revision 2.0. (2003)
[11] Solomonoff, R.J.: Three Kinds of Probabilistic Induction: Universal
Distributions and Convergence Theorems. — The Computer Journal:
51(5), pp. 566–570, 2008.
[12] Andreas Stolcke: On Learning Context Free Grammars. PhD Thesis
(1994)
[13] Wallace, C.S and Boulton, D.M.: An Information Measure for Clas-
sification. Computer Journal 11 185–195 (1968)

23

You might also like