Towards A Universal Theory of Artificial Intelligence Based On Algorithmic Probability and Sequential Decisions

Marcus Hutter -1- Universal Artificial Intelligence
Towards a Universal Theory of Artificial

Intelligence based on Algorithmic Probability
and Sequential Decisions
Marcus Hutter
Istituto Dalle Molle di Studi sull’Intelligenza Artificiale

IDSIA, Galleria 2, CH-6928 Manno-Lugano, Switzerland
[email protected], http://www.idsia.ch/∼marcus
2000 – 2002
Decision Theory = Probability + Utility Theory
+ +
Universal Induction = Ockham + Epicur + Bayes
= =
Universal Artificial Intelligence without Parameters
Table of Contents
• Sequential Decision Theory
• Iterative and functional AIµ Model
• Algorithmic Complexity Theory
• Universal Sequence Prediction
• Definition of the Universal AIξ Model
• Universality of ξ AI and Credit Bounds
• Sequence Prediction (SP)
• Strategic Games (SG)
• Function Minimization (FM)
• Supervised Learning by Examples (EX)
• The Timebounded AIξ Model
• Aspects of AI included in AIξ
• Outlook & Conclusions
Overview
• Decision Theory solves the problem of rational agents in uncertain worlds if the
environmental probability distribution is known.
• Solomonoff’s theory of Universal Induction solves the problem of sequence
prediction for unknown prior distribution.
• We combine both ideas and get a parameterless model of Universal Artificial
Intelligence.

+ +
Universal Induction = Ockham + Epicur + Bayes
= =
Universal Artificial Intelligence without Parameters
Preliminary Remarks
• The goal is to mathematically define a unique model superior to any other model
in any environment.
• The AIξ model is unique in the sense that it has no parameters which could be
adjusted to the actual environment in which it is used.
• In this first step toward a universal theory we are not interested in computational
aspects.
• Nevertheless, we are interested in maximizing a utility function, which means to
learn in as minimal number of cycles as possible. The interaction cycle is the basic
unit, not the computation time per unit.
The Cybernetic or Agent Model
r1 x01 r2 x02 r3 x03 r4 x04 r5 x05 r6 x06 ...

P
iP
PP
PP

) P
Agent Environ−
working tape ... working tape ...
p ment q
PP
1
PP
PP
q
P
y1 y2 y3 y4 y5 y6 ...
- p : X ∗ → Y ∗ is cybernetic system / agent with chronological function / Turing

machine.
- q : Y ∗ → X ∗ is deterministic computable (only partially accessible) chronological
environment.
The Agent-Environment Interaction Cycle
for k:=1 to T do
- p thinks/computes/modifies internal state.
- p writes output yk ∈ Y .
- q reads output yk .
- q computes/modifies internal state.
- q writes reward/utlitity input rk := r(xk ) ∈ R.
- q write regular input x0k ∈ X 0 .
- p reads input xk := rk x0k ∈ X.
endfor
- T is lifetime of system (total number of cycles).

- Often R = {0, 1} = {bad, good} = {error, correct}.
Utility Theory for Deterministic Environment

The (system,environment) pair (p,q) produces the unique I/O sequence
ω(p, q) := y1pq xpq pq pq pq pq

1 y2 x2 y3 x3 ...
Total reward (value) in cycles k to m is defined as
Vkm (p, q) := r(xpq pq

k ) + ... + r(xm )
Optimal system is system which maximizes total reward
pbest := maxarg V1T (p, q)

p
VkT (pbest , q) ≥ VkT (p, q) ∀p

Probabilistic Environment
Replace q by a probability distribution µ(q) over environments.
The total expected reward in cycles k to m is
µ 1 X
Vkm (p|ẏẋ<k ) := µ(q) · Vkm (p, q)
N
q:q(ẏ<k )=ẋ<k
The history is no longer uniquely determined.

ẏẋ<k := ẏ1 ẋ1 ...ẏk−1 ẋk−1 :=actual history.
AIµ maximizes expected future reward by looking hk ≡ mk −k+1 cycles ahead
(horizon). For mk = T , AIµ is optimal.
µ
ẏk := maxarg max Vkm k
(p|ẏẋ<k )
yk p:p(ẋ<k )=ẏ<k yk
Environment responds with ẋk with probability determined by µ.

This functional form of AIµ is suitable for theoretical considerations. The iterative
form (next section) is more suitable for ’practical’ purpose.
Iterative AIµ Model

The probability that the environment produces input xk in cycle k under the condition
that the history is y1 x1 ...yk−1 xk−1 yk is abbreviated by
µ(yx<k yxk ) ≡ µ(y1 x1 ...yk−1 xk−1 yk xk )
With Bayes’ Rule, the probability of input x1 ...xk if system outputs y1 ...yk is
µ(y1 x1 ...yk xk ) = µ(yx1 )·µ(yx1 yx2 )· ... ·µ(yx<k yxk )
Underlined arguments are probability variables, all others are conditions.

µ is called a chronological probability distribution.
Marcus Hutter - 10 - Universal Artificial Intelligence
Iterative AIµ Model

The Expectimax sequence/algorithm: Take reward expectation over the xi and
maximum over the yi in chronological order to incorporate correct dependency of xi
and yi on the history.
X X X
best
Vkm (ẏẋ<k ) = max max ... max
yk yk+1 ym
xk xk+1 xm
(r(xk )+ ... +r(xm ))·µ(ẏẋ<k yxk:m )
X X X
ẏk = maxarg max ... max
yk yk+1 ym k
xk xk+1 xmk
(r(xk )+ ... +r(xmk ))·µ(ẏẋ<k yxk:mk )
This is the essence of Decision Theory.

Functional AIµ ≡ Iterative AIµ

The functional and iterative AIµ models behave identically with the natural
identification
X
µ(yx1:k ) = µ(q)
q:q(y1:k )=x1:k
Remaining Problems:
• Computational aspects.
• The true prior probability is usually not (even approximately not) known.
Limits we are interested in
1 ¿ hl(yk xk )i ¿ k ¿ T ¿ |Y × X|
a b c d
16 24 32
1 ¿ 2 ¿ 2 ¿ 2 ¿ 265536
(a) The agents interface is wide.

(b) The interface can be sufficiently explored.
(c) The death is far away.
(d) Most input/outputs do not occur.
These limits are never used in proofs but ...
... we are only interested in theorems which do not degenerate under the above limits.
Induction = Predicting the Future

Extrapolate past observations to the future, but how can we know something about
the future?
Philosophical Dilemma:
• Hume’s negation of Induction.
• Epicurus’ principle of multiple explanations.
• Ockhams’ razor (simplicity) principle.
• Bayes’ rule for conditional probabilities.
Given sequence x1 ...xk−1 what is the next letter xk ?
If the true prior probability µ(x1 ...xn ) is known, and we want to minimize the number
of prediction errors, then the optimal scheme is to predict the xk with highest
conditional µ probability, i.e. maxargyk µ(x<k y k ).
Solomonoff solved the problem of unknown prior µ by introducing a universal
probability distribution ξ related to Kolmogorov Complexity.
Algorithmic Complexity Theory

The Kolmogorov Complexity of a string x is the length of the shortest (prefix) program
producing x.
K(x) := min{l(p) : U (p) = x} , U = univ.TM
p
The universal semimeasure is the probability that output of U starts with x when the
input is provided with fair coin flips
X
ξ(x) := 2−l(p) [Solomonoff 64]
p : U (p)=x∗
Universality property of ξ: ξ dominates every computable probability distribution

×
ξ(x) ≥ 2−K(ρ) ·ρ(x) ∀ρ
Furthermore, the µ expected squared distance sum between ξ and µ is finite for
computable µ
X∞ X
+
µ(x<k )(ξ(x<k xk ) − µ(x<k xk ))2 < ln 2 · K(µ)
k=1 x1:k
[Solomonoff 78] for binary alphabet
Universal Sequence Prediction

n→∞
⇒ ξ(x<n xn ) −→ µ(x<n xn ) with µ probability 1.
⇒ Replacing µ by ξ might not introduce many additional prediction errors.
General scheme: Predict xk with prob. ρ(x<k xk ).
This includes deterministic predictors as well.
Number of expected prediction errors:
Pn P
Enρ := k=1 x1:k µ(x1:k )(1−ρ(x<k xk ))
Θξ predicts xk of maximal ξ(x<k xk ).

−1/2 n→∞
EnΘξ /Enρ ≤ 1 + O(Enρ ) −→ 1 [Hutter 99]
Θξ is asymptotically optimal with rapid convergence.
For every (passive) game of chance for which there exists a winning strategy, you can
make money by using Θξ even if you don’t know the underlying probabilistic
process/algorithm.
Θξ finds and exploits every regularity.
Definition of the Universal AIξ Model

Universal AI = Universal Induction + Decision Theory
Replace µAI in decision theory model AIµ by an appropriate generalization of ξ .
X
ξ(yx1:k ) := 2−l(q)
q:q(y1:k )=x1:k
X X X
ẏk = maxarg max ... max
yk yk+1 ym k
xk xk+1 xmk
(r(xk )+ ... +r(xmk ))·ξ(ẏẋ<k yxk:mk )
Functional form: µ(q) ,→ ξ(q) := 2−l(q) .
Bold Claim: AIξ is the most intelligent environmental independent

agent possible.
Universality of ξ AI
The proof is analog as for sequence prediction. Inputs yk are pure spectators.
×
ξ(yx1:n ) ≥ 2−K(ρ) ρ(yx1:n ) ∀ chronological ρ
Convergence of ξ AI to µAI
yi are again pure spectators. To generalize SP case to arbitrary alphabet we need
|X| |X| |X| |X|
X X yi X X
2
(yi −zi ) ≤ yi ln for yi = 1 ≥ zi
i=1 i=1
zi i=1 i=1
n→∞
⇒ ξ AI (yx<n yxn ) −→ µAI (yx<n yxn ) with µ prob. 1.
Does replacing µAI with ξ AI lead to AIξ system with asymptotically optimal behaviour
with rapid convergence?
This looks promising from the analogy with SP!
Value Bounds (Optimality of AIξ)

Naive reward bound analogously to error bound for SP
µ µ
V1T (p∗ ) ≥ V1T (p) − o(...) ∀µ, p
Problem class (set of environments) {µ0 , µ1 } with Y = V = {0, 1} and rk = δiy1 in

environment µi violates reward bound. The first output y1 decides whether all future
rk = 1 or 0.
Alternative: µ probability Dnµξ /n of suboptimal outputs of AIξ different from AIµ in
the first n cycles tends to zero
n
X
Dnµξ /n → 0 , Dnµξ := h 1−δẏµ ,ẏξ iµ
k k
k=1
This is a weak asymptotic convergence claim.

Value Bounds (Optimality of AIξ)

Let V = {0, 1} and |Y | be large. Consider all (deterministic) environments in which a
single complex output y ∗ is correct (r = 1) and all others are wrong (r = 0). The
problem class is
{µ : µ(yx<k yk 1) = δyk y∗ , K(y ∗ ) = b log2 |Y |c }
Problem: Dkµξ ≤ 2K(µ) is the best possible error bound we can expect, which depends
×
on K(µ) only. It is useless for k ¿ |Y | = 2K(µ) , although asymptotic convergence
satisfied.
But: A bound like 2K(µ) reduces to 2K(µ|ẋ<k ) after k cycles, which is O(1) if enough
information about µ is contained in ẋ<k in any form.
Separability Concepts
... which might be useful for proving reward bounds

• Forgetful µ.
• Relevant µ.
• Asymptotically learnable µ.
• Farsighted µ.
• Uniform µ.
• (Generalized) Markovian µ.
• Factorizable µ.
• (Pseudo) passive µ.
Other concepts
• Deterministic µ.
• Chronological µ.
Sequence Prediction (SP)

Sequence z1 z2 z3 ... with true prior µSP (z1 z2 z3 ...).
AIµ Model:
yk = prediction for zk .
rk+1 = δyk zk = 1/0 if prediction was correct/wrong.
x0k+1 = ².
µAI (y1 r1 ...yk rk ) = µSP (δy1 r1 ...δyk rk ) = µSP (z1 ...zk )
Choose horizon hk arbitrary ⇒
SP Θµ
ẏkAIµ = maxarg µ(ż1 ...żk−1 y k ) = ẏk
yk
AIµ always reduces exactly to XXµ model if XXµ is optimal solution in domain XX.
AIξ model differs from SPΘξ model. For hk = 1
SP Θξ
ẏkAIξ = maxarg ξ(ẏṙ<k yk 1) 6= ẏk
yk
×
AI
Weak error bound: Enξ < 2K(µ) < ∞ for deterministic µ.
Strategic Games (SG)

• Consider strictly competitive strategic games like chess.
• Minimax is best strategy if both Players are rational with unlimited capabilities.
• Assume that the environment is a minimax player of some game ⇒ µAI uniquely
determined.
• Inserting µAI into definition of ẏkAI of AIµ model reduces the expecimax sequence
to the minimax strategy (ẏkAI = ẏkSG ).
• As ξ AI → µAI we expect AIξ to learn the minimax strategy for any game and
minimax opponent.
• If there is only non-trivial reward rk ∈ {win, loss, draw} at the end of the game,
repeated game playing is necessary to learn from this very limited feedback.
• AIξ can exploit limited capabilities of the opponent.
Function Minimization (FM)

Approximately minimize (unknown) functions with as few function calls as possible.
Applications:
• Traveling Salesman Problem (bad example).
• Minimizing production costs.
• Find new materials with certain properties.
• Draw paintings which somebody likes.
X
FM
µ (y1 z 1 ...yn z n ) := µ(f )
f :f (yi )=zi ∀1≤i≤n
Trying to find yk which minimizes f in the next cycle does not work.
General Ansatz for FMµ/ξ:
X X
ẏk = minarg ... min (α1 z1 + ... +αT zT )·µ(ẏż1 ...yz T )
yk yT
zk zT
Under certain weak conditions on αi , f can be learned with AIξ.

Supervised Learning by Examples (EX)

Learn functions by presenting (z, f (z)) pairs and ask for function values of z 0 by
presenting (z 0 , ?) pairs.
More generally: Learn relations R 3 (z, v).
Supervised learning is much faster than reinforcement learning.
The AIµ/ξ model:
x0k = (zk , vk ) ∈ R∪(Z ×{?}) ⊂ Z ×(Y ∪{?}) = X 0
yk+1 = guess for true vk if actual vk =?.
rk+1 = 1 iff (zk , yk+1 ) ∈ R
AIµ is optimal by construction.
Supervised Learning by Examples (EX)

The AIξ model:
• Inputs x0k contain much more than 1 bit feedback per cycle.
• Short codes dominate ξ
• The shortest code of examples (zk , vk ) is a coding of R and the indices of the
(zk , vk ) in R.
• This coding of R evolves independently of the rewards rk .
• The system has to learn to output yk+1 with (zk , yk+1 ) ∈ R.
• As R is already coded in q, an additional algorithm of length O(1) needs only to
be learned.
• Credits rk with information content O(1) are needed for this only.
• AIξ learns to learn supervised.
Computability and Monkeys

SPξ and AIξ are not really uncomputable (as often stated) but ...
ẏkAIξ is only asymptotically computable/approximable with slowest possible
convergence.
Idea of the typing monkeys:
• Let enough monkeys type on typewriters or computers, eventually one of them will
write Shakespeare or an AI program.
• To pick the right monkey by hand is cheating, as then the intelligence of the
selector is added.
• Problem: How to (algorithmically) select the right monkey.
The Timebounded AIξ Model

An algorithm pbest can be/has been constructed for which the following holds:
Result/Theorem:
• Let p be any (extended) chronological program
• with length l(p) ≤ ˜l and computation time per cycle t(p) ≤ t̃
• for which there exists a proof of VA(p), i.e. that p is a valid approximation, of
length ≤ lP .
• Then an algorithm pbest can be constructed, depending on ˜l,t̃ and lP but not on
knowing p
• which is effectively more or equally intelligent according to ºc than any such p.
• The size of pbest is l(pbest ) = O(ln(˜l· t̃·lP )),
• the setup-time is tsetup (pbest ) = O(lP2 ·2lP ),
• the computation time per cycle is tcycle (pbest ) = O(2l̃ · t̃).

Aspects of AI included in AIξ

All known and unknown methods of AI should be directly included in the AIξ model or
emergent.
Directly included are:
• Probability theory (probabilistic environment)
• Utility theory (maximizing rewards)
• Decision theory (maximizing expected reward)
• Probabilistic reasoning (probabilistic environment)
• Reinforcement Learning (rewards)
• Algorithmic information theory (universal prob.)
• Planning (expectimax sequence)
• Heuristic search (use ξ instead of µ)
• Game playing (see SG)
• (Problem solving) (maximize reward)
• Knowledge (in short programs q)
• Knowledge engineering (how to train AIξ)
• Language or image processing (has to be learned)
Other Aspects of AI not included in AIξ
Not included: Fuzzy logic, Possibility theory, Dempster-Shafer, ...
Other methods might emerge in the short programs q if we would analyze them.
AIξ seems not to lack any important known methodology of AI, apart from
computational aspects.
Outlook & Uncovered Topics
• Derive general and special reward bounds for AIξ.

• Downscale AIξ in more detail and to more problem classes analog to the
downscaling of SP to Minimum Description Length and Finite Automata.
• There is no need for implementing extra knowledge, as this can be learned.
• The learning process itself is an important aspect.
• Noise or irrelevant information in the inputs do not disturb the AIξ system.
Conclusions
• We have developed a parameterless model of AI based on Decision Theory and
Algorithm Information Theory.
• We have reduced the AI problem to pure computational questions.
• A formal theory of something, even if not computable, is often a great step toward
solving a problem and also has merits in its own.
• All other systems seem to make more assumptions about the environment, or it is
far from clear that they are optimal.
• Computational questions are very important and are probably difficult. This is the
point where AI could get complicated as many AI researchers believe.
• Nice theory yet complicated solution, as in physics.
The Big Questions
• Is non-computational physics relevant to AI? [Penrose]

• Could something like the number of wisdom Ω prevent a simple solution to AI?
[Chaitin]
• Do we need to understand consciousness before being able to understand AI or
construct AI systems?
• What if we succeed?

Towards A Universal Theory of Artificial Intelligence Based On Algorithmic Probability and Sequential Decisions

Uploaded by

Document Informationclick to expand document informationTowards a Universal Theory of Artificial Intelligence Based on Algorithmic Probability and Sequential Decisions

Document Informationclick to expand document information

Copyright:

Available Formats

Towards A Universal Theory of Artificial Intelligence Based On Algorithmic Probability and Sequential Decisions

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Towards A Universal Theory of Artificial Intelligence Based On Algorithmic Probability and Sequential Decisions

Uploaded by

Copyright:

Available Formats

Marcus Hutter -1- Universal Artificial Intelligence

Towards a Universal Theory of Artificial

Istituto Dalle Molle di Studi sull’Intelligenza Artificiale

Decision Theory = Probability + Utility Theory

The Cybernetic or Agent Model

r1 x01 r2 x02 r3 x03 r4 x04 r5 x05 r6 x06 ...

- p : X ∗ → Y ∗ is cybernetic system / agent with chronological function / Turing

The Agent-Environment Interaction Cycle

- T is lifetime of system (total number of cycles).

Utility Theory for Deterministic Environment

ω(p, q) := y1pq xpq pq pq pq pq

Total reward (value) in cycles k to m is defined as

Vkm (p, q) := r(xpq pq

Optimal system is system which maximizes total reward

pbest := maxarg V1T (p, q)

VkT (pbest , q) ≥ VkT (p, q) ∀p

The history is no longer uniquely determined.

Environment responds with ẋk with probability determined by µ.

Iterative AIµ Model

µ(yx<k yxk ) ≡ µ(y1 x1 ...yk−1 xk−1 yk xk )

µ(y1 x1 ...yk xk ) = µ(yx1 )·µ(yx1 yx2 )· ... ·µ(yx<k yxk )

Underlined arguments are probability variables, all others are conditions.

Iterative AIµ Model

(r(xk )+ ... +r(xm ))·µ(ẏẋ<k yxk:m )

(r(xk )+ ... +r(xmk ))·µ(ẏẋ<k yxk:mk )

This is the essence of Decision Theory.

Functional AIµ ≡ Iterative AIµ

Limits we are interested in

(a) The agents interface is wide.

Induction = Predicting the Future

Algorithmic Complexity Theory

Universality property of ξ: ξ dominates every computable probability distribution

Universal Sequence Prediction

Θξ predicts xk of maximal ξ(x<k xk ).

Definition of the Universal AIξ Model

(r(xk )+ ... +r(xmk ))·ξ(ẏẋ<k yxk:mk )

Functional form: µ(q) ,→ ξ(q) := 2−l(q) .

Bold Claim: AIξ is the most intelligent environmental independent

Value Bounds (Optimality of AIξ)

Problem class (set of environments) {µ0 , µ1 } with Y = V = {0, 1} and rk = δiy1 in

This is a weak asymptotic convergence claim.

Value Bounds (Optimality of AIξ)

{µ : µ(yx<k yk 1) = δyk y∗ , K(y ∗ ) = b log2 |Y |c }

... which might be useful for proving reward bounds

Sequence Prediction (SP)

Strategic Games (SG)

Function Minimization (FM)

Under certain weak conditions on αi , f can be learned with AIξ.

Supervised Learning by Examples (EX)

Supervised Learning by Examples (EX)

Computability and Monkeys

The Timebounded AIξ Model

• the computation time per cycle is tcycle (pbest ) = O(2l̃ · t̃).

Aspects of AI included in AIξ

Other Aspects of AI not included in AIξ

Not included: Fuzzy logic, Possibility theory, Dempster-Shafer, ...

Outlook & Uncovered Topics