Universal Intelligence:: A Definition of Machine Intelligence
Universal Intelligence:: A Definition of Machine Intelligence
Shane Legg
IDSIA, Galleria 2, Manno-Lugano CH-6928, Switzerland
[email protected] www.vetta.org/shane
Marcus Hutter
RSISE @ ANU and SML @ NICTA, Canberra, ACT, 0200, Australia
[email protected] www.hutter1.net
December 2007
Abstract
A fundamental problem in artificial intelligence is that nobody really knows what
intelligence is. The problem is especially acute when we need to consider artificial
systems which are significantly different to humans. In this paper we approach this
problem in the following way: We take a number of well known informal definitions
of human intelligence that have been given by experts, and extract their essential
features. These are then mathematically formalised to produce a general measure of
intelligence for arbitrary machines. We believe that this equation formally captures
the concept of machine intelligence in the broadest reasonable sense. We then show
how this formal definition is related to the theory of universal optimal learning
agents. Finally, we survey the many other tests and definitions of intelligence that
have been proposed for machines.
Keywords
1
Contents
1 Introduction 3
2 Natural Intelligence 5
2.1 Human intelligence tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Animal intelligence tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Desirable properties of an intelligence test . . . . . . . . . . . . . . . . . . 8
2.4 Static vs. dynamic tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Theories of human intelligence . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6 Ten definitions of human intelligence . . . . . . . . . . . . . . . . . . . . . 10
2.7 More definitions of human intelligence . . . . . . . . . . . . . . . . . . . . 12
References 43
3
1 Introduction
“Innumerable tests are available for measuring intelligence, yet no one is quite
certain of what intelligence is, or even just what it is that the available tests
are measuring.” — R. L. Gregory [Gre98]
What is intelligence? It is a concept that we use in our daily lives that seems to
have a fairly concrete, though perhaps naive, meaning. We say that our friend who got
an A in his calculus test is very intelligent, or perhaps our cat who has learnt to go
into hiding at the first mention of the word “vet”. Although this intuitive notion of
intelligence presents us with no difficulties, if we attempt to dig deeper and define it in
precise terms we find the concept to be very difficult to nail down. Perhaps the ability to
learn quickly is central to intelligence? Or perhaps the total sum of one’s knowledge is
more important? Perhaps communication and the ability to use language play a central
role? What about “thinking” or the ability to perform abstract reasoning? How about
the ability to be creative and solve problems? Intelligence involves a perplexing mixture
of concepts, many of which are equally difficult to define.
Psychologists have been grappling with these issues ever since humans first became
fascinated with the nature of the mind. Debates have raged back and forth concerning the
correct definition of intelligence and how best to measure the intelligence of individuals.
These debates have in many instances been very heated as what is at stake is not merely
a scientific definition, but a fundamental issue of how we measure and value humans: Is
one employee smarter than another? Are men on average more intelligent than women?
Are white people smarter than black people? As a result intelligence tests, and their
creators, have on occasion been the subject of intense public scrutiny. Simply determining
whether a test, perhaps quite unintentionally, is partly a reflection of the race, gender,
culture or social class of its creator is a subtle, complex and often politically charged issue
[Gou81, HM96]. Not surprisingly, many have concluded that it is wise to stay well clear
of this topic.
In reality the situation is not as bad as it is sometimes made out to be. Although the
details of the definition are debated, in broad terms a fair degree of consensus about the
scientific definition of intelligence and how to measure it has been achieved [Got97a, SB86].
Indeed it is widely recognised that when standard intelligence tests are correctly applied
and interpreted, they all measure approximately the same thing [Got97a]. Furthermore,
what they measure is both stable over time in individuals and has significant predictive
power, in particular for future academic performance and other mentally demanding pur-
suits. The issues that continue to draw debate are the questions such as whether the tests
test only a part or a particular type of intelligence, or whether they are somehow biased
towards a particular group or set of mental skills. Great effort has gone into dealing with
these issues, but they are difficult problems with no easy solutions.
Somewhat disconnected from this exists a parallel debate over the nature of intelligence
in the context of machines. While the debate is less politically charged, in some ways
the central issues are even more difficult. Machines can have physical forms, sensors,
actuators, means of communication, information processing abilities and environments
that are totally unlike those that we experience. This makes the concept of “machine
intelligence” particularly difficult to get a handle on. In some cases, a machine may
display properties that we equate with human intelligence, in such cases it might be
4 1 INTRODUCTION
reasonable to describe the machine as also being intelligent. In other situations this view
is far too limited and anthropocentric. Ideally we would like to be able to measure the
intelligence of a wide range of systems; humans, dogs, flies, robots or even disembodied
systems such as chat-bots, expert systems, classification systems and prediction algorithms
[Joh92, Alb91].
One response to this problem might be to develop specific kinds of tests for specific
kinds of entities; just as intelligence tests for children differ to intelligence tests for adults.
While this works well when testing humans of different ages, it comes undone when we
need to measure the intelligence of entities which are profoundly different to each other
in terms of their cognitive capacities, speed, senses, environments in which they operate,
and so on. To measure the intelligence of such diverse systems in a meaningful way
we must step back from the specifics of particular systems and establish the underlying
fundamentals of what it is that we are really trying to measure.
The difficulty of developing an abstract and highly general notion of intelligence is
readily apparent. Consider, for example, the memory and numerical computation tasks
that appear in some intelligence tests and which were once regarded as defining hallmarks
of human intelligence. We now know that these tasks are absolutely trivial for a machine
and thus do not appear to test the machine’s intelligence in any meaningful sense. Indeed
even the mentally demanding task of playing chess can be largely reduced to brute force
search [HCH95]. What else may in time be possible with relatively simple algorithms
running on powerful machines is hard to say. What we can be sure of is that as technology
advances, our concept of intelligence will continue to evolve with it.
How then are we to develop a concept of intelligence that is applicable to all kinds of
systems? Any proposed definition must encompass the essence of human intelligence, as
well as other possibilities, in a consistent way. It should not be limited to any particular
set of senses, environments or goals, nor should it be limited to any specific kind of
hardware, such as silicon or biological neurons. It should be based on principles which
are fundamental and thus unlikely to alter over time. Furthermore, the definition of
intelligence should ideally be formally expressed, objective, and practically realisable as
an effective intelligence test.
In this paper we approach the problem of defining machine intelligence as follows:
Section 2 overviews well known theories, definitions and tests of intelligence that have
been developed by psychologists. Our objective in this section is to gain an understand-
ing of the essence of intelligence in the broadest possible terms. In particular we are
interested in commonly expressed ideas that could be applied to arbitrary systems and
contexts, not just humans.
Section 3 takes these key ideas and formalises them. This leads to universal intelligence,
our proposed formal definition of machine intelligence. We then examine some of the
properties of universal intelligence, such as its ability to sensibly order simple learning
algorithms and connections to the theory of universal optimal learning agents.
Section 4 overviews other definitions and tests of machine intelligence that have been
proposed. Although surveys of the Turing test and its many variants exist, for example
[SCA00], as far as we know this section is the first general survey of definitions and tests of
machine intelligence. Given how fundamental this is to the field of artificial intelligence,
the absence of such a survey is quite remarkable. For any field to mature as a science,
questions of definition and measurement must be meticulously investigated. We conclude
5
our survey with a summary comparison of the various proposed tests and definitions of
machine intelligence.
Section 5 ends the paper with discussion, responses to criticisms, conclusions and direc-
tions for future research.
The genesis of this work lies in Hutter’s universal optimal learning agent, AIXI, de-
scribed in 2, 12, 60 and 300 pages in [Hut01b, Hut01a, Hut05, Hut07b], respectively. In
this work, an order relation for intelligent agents is presented, with respect to which the
provably optimal AIXI agent is maximal. The universal intelligence measure presented
here is a derivative of this order relation. A short description of the universal intelligence
measure appeared in [LH05], from which two articles followed in the popular scientific
press [GR05, Fié05]. An 8 page paper on universal intelligence appeared in [LH06b],
followed by an updated poster presentation [LH06a]. In the current paper we explore
universal intelligence in much greater detail, in particular the way in which it relates
to mainstream views on human intelligence and other proposed definitions of machine
intelligence.
2 Natural Intelligence
Human intelligence is an enormously rich topic with a complex intellectual, social and
political history. For an overview the interested reader might want to consult “Handbook
of Intelligence” [Ste00] edited by R. J. Sternberg. Our objective in this section is simply to
sketch a range of tests, theories and definitions of human and animal intelligence. We are
particularly interested in common themes and general perspectives on intelligence that
could be applicable to many kinds of systems, as these will form the foundation of our
definition of machine intelligence in the next section.
children [Bin11]. It was found that Binet’s test results were a good predictor of children’s
academic performance.
Lewis Terman of Stanford University developed an English version of Binet’s test.
As the age norms for French children did not correspond well with American children,
he revised Binet’s test in various ways, in particular he increased the upper age limit.
This resulted in the now famous Stanford-Binet test [TM50]. This test formed the basis
of a number of other intelligence tests, such as the Army Alpha and Army Beta tests
which were used to classify recruits. Since its development, the Stanford-Binet has been
periodically revised, with updated versions being widely used today.
David Wechsler believed that the original Binet tests were too focused on verbal skills
and thus disadvantaged certain otherwise intelligent individuals, for example the deaf or
people who did not speak the test language as a first language. To address this prob-
lem, he proposed that tests should contain a combination of both verbal and nonverbal
problems. He also believed that in addition to an overall IQ score, a profile should be pro-
duced showing the performance of the individual in the various areas tested. Borrowing
significantly from the Stanford-Binet, the US army Alpha test, and others, he developed
a range of tests targeting specific age groups from preschoolers up to adults [Wec58]. Due
in part to problems with revisions of the Stanford-Binet test in the 1960’s and 1970’s,
Wechsler’s tests became the standard. They continue to be well respected and widely
used.
Owing to a common lineage, modern versions of the Wechsler and the Stanford-Binet
have a similar basic structure [Kau00]. Both test the individual in a number of verbal
and non-verbal ways. In the case of a Stanford-Binet the test is broken up into 5 key
areas: Fluid reasoning, knowledge, quantitative reasoning, visual-spatial processing, and
working memory. In the case of the Wechsler Adult Intelligence Scale (WAIS-III), the
verbal tests include areas such as such as knowledge, basic arithmetic, comprehension,
vocabulary, and short term memory. Non-verbal tests include picture completion, spatial
perception, problem solving, symbol search and object assembly.
As part of an effort to make intelligence tests more culture neutral John Raven de-
veloped the progressive matrices test [Rav00]. In this test each problem consists of a
short sequence of basic shapes. For example, a circle in a box, then a circle with a cross
in the middle followed by a circle with a triangle inside. The test subject then has to
select from a second list the image that best continues the pattern. Simple problems
have simple patterns, while difficult problems have more subtle and complex patterns. In
each case however, the simplest pattern that can explain the observed sequence is the one
that correctly predicts its continuation. Thus, not only is the ability to recognise pat-
terns tested, but also the ability to evaluate the complexity of different explanations and
then correctly apply the philosophical principle of Occam’s razor. We will return to Oc-
cam’s razor and its importance in intelligence testing in Subsection 3.3 when considering
machine intelligence.
Today several different versions of the Raven test exist designed for different age
groups and ability levels. As the tests depend strongly on the ability to identify abstract
patterns, rather than knowledge, they are considered to be some of the most “g-loaded”
intelligence tests available (see Subsection 2.5). The Raven tests remain in common use
today, particularly when it is thought that culture or language bias could be an issue.
The intelligence quotient, or IQ, was originally introduced by Stern [Ste12]. It was
computed by taking the age of a child as estimated by their performance in the intelligence
2.2 Animal intelligence tests 7
test, and then dividing this by their true biological age and multiplying by 100. Thus
a 10 year old child whose mental performance was equal to that of a normal 12 year
old, had an IQ of 120. As the concept of mental age has now been discredited, and was
never applicable to adults anyway, modern IQ scores are simply normalised to a Gaussian
distribution with a mean of 100. The standard deviation used varies: in the United
States 15 is commonly used, while in Europe 25 is common. For children the normalising
Gaussian is based on peers of the same age.
Whatever normalising distribution is used, by definition an individual’s IQ is always
an indication of their cognitive performance relative to some larger group. Clearly this
would be problematic in the context of machines where the performance of some machines
could be many orders of magnitude greater than others. Furthermore, the distribution of
machine performance would be continually changing due to advancing technology. Thus,
for our purposes, an absolute measure will be more meaningful than a traditional IQ type
of measure.
For an overview of the history of intelligence testing and the structure of modern tests,
see [Kau00].
stand commands. Thus a simple solution is to use basic “rewards” to guide behaviour, as
we do with animals. Although this approach is extremely general, one difficulty is that
solving the task, and simply learning what the task is, become confounded and thus the
results need to be interpreted carefully [Zen97]. Due to our need for generality, we will
use this reward based approach for our formal measure of machine intelligence. Specifi-
cally, we will adopt the reinforcement learning framework from artificial intelligence (see
Subsection 3.1).
For good overviews of animal intelligence research see [Zen00] or [HP94].
relevant to tests of machine intelligence. To some extent they are also relevant to formal
definitions of intelligence. We will return to these desirable properties when analysing
our definition of machine intelligence in Subsection 3.5, and later when comparing tests
of machine intelligence in Subsection 4.3.
that have been given by psychologists. Many of these definitions are well known. Although
the definitions differ, there are reoccurring features; in some cases these are explicitly
stated, while in others they are more implicit. We start by considering ten definitions
that take, in our view, a similar perspective:
Perhaps the most elementary common feature of these definitions is that intelligence
is seen as a property of an individual who is interacting with an external environment,
problem or situation. Indeed, at least this much is common to practically all proposed
definitions of intelligence.
12 2 NATURAL INTELLIGENCE
We take this to be our informal working definition of intelligence. In the next section
we will use this definition as the starting point from which we will construct a formal
definition of machine intelligence. However before we proceed further, the reader way
wish to revise the 10 definitions above to ensure that the definition we have adopted is
indeed reasonable.
“Intelligence is a very general mental capability that, among other things, involves
the ability to reason, plan, solve problems, think abstractly, comprehend complex
ideas, learn quickly and learn from experience.” [Got97a]
on the particular tasks that the agent must deal with. Our approach is to consider
intelligence to be the effect of capacities such as those listed above. It is not the result of
having any specific set of capacities. Indeed, intelligence could also be the effect of many
other capacities, some of which humans may not have. In summary, our definition is not
in conflict with the above definition, rather it is that our definition is more abstract and
general.
“. . . in its lowest terms intelligence is present where the individual animal, or human
being, is aware, however dimly, of the relevance of his behaviour to an objective.
Many definitions of what is indefinable have been attempted by psychologists, of
which the least unsatisfactory are 1. the capacity to meet novel situations, or
to learn to do so, by new adaptive responses and 2. the ability to perform tests
or tasks, involving the grasping of relationships, the degree of intelligence being
proportional to the complexity, or the abstractness, or both, of the relationship.”
J. Drever [Dre52]
This definition has many similarities to ours. Firstly, it emphasises the agent’s ability
to choose its actions so as to achieve an objective, or in our terminology, a goal. It
then goes on to stress the agent’s ability to deal with situations which have not been
encountered before. In our terminology, this is the ability to deal with a wide range of
environments. Finally, this definition highlights the agent’s ability to perform tests or
tasks, something which is entirely consistent with our performance orientated perspective
of intelligence.
“Intelligence is not a single, unitary ability, but rather a composite of several func-
tions. The term denotes that combination of abilities required for survival and
advancement within a particular culture.” A. Anastasi [Ana92]
This definition does not specify exactly which capacities are important, only that they
should enable the individual to survive and advance with the culture. As such this is a
more abstract “success” orientated definition of intelligence, like ours. Naturally, culture
is a part of the agent’s environment.
This is not really much of a definition as it simply shifts the problem of defining
intelligence to the problem of defining abstract thinking. The same is true of many other
definitions that refer to things such as imagination, creativity or consciousness. The
following definition has a similar problem:
Nonetheless, our definition of intelligence is not entirely inconsistent with the above
definition in that an individual may be required to know many things, or have a significant
capacity for knowledge, in order to perform well in some environments. However our
definition is broader in that knowledge, or the capacity for knowledge, is not by itself
sufficient. We require that the knowledge can be used effectively for some purpose. Indeed
unless information can be effectively utilised for a number of purposes, it seems reasonable
to consider it to be merely “data”, rather than “knowledge”.
“The capacity to acquire capacity.” H. Woodrow quoted in [Ste00]
The definition of Woodrow is typical of those which emphasise not the current ability
of the individual, but rather the individual’s ability to expand and develop new abilities.
This is a fundamental point of divergence for many views on intelligence. Consider the
following question: Is a young child as intelligent as an adult? From one perspective,
children are very intelligent because they can learn and adapt to new situations quickly.
On the other hand, the child is unable to do many things due to a lack of knowledge
and experience and thus will make mistakes an adult would know to avoid. These need
not just be physical acts, they could also be more subtle things like errors in reasoning
as their mind, while very malleable, has not yet matured. In which case, perhaps their
intelligence is currently low, but will increase with time and experience?
Fundamentally, this difference in perspective is a question of time scale: Must an agent
be able to tackle some task immediately, or perhaps after a short period of time during
which learning can take place, or perhaps it only matters that they can eventually learn to
deal with the problem? Being able to deal with a difficult problem immediately is a matter
of experience, rather than intelligence. While being able to deal with it in the very long
run might not require much intelligence at all, for example, simply trying a vast number
of possible solutions might eventually produce the desired results. Intelligence then seems
to be the ability to adapt and learn as quickly as possible given the constraints imposed
by the problem at hand. It is this insight that we will use to neatly deal with temporal
preference when defining machine intelligence (see Measure of success in Subsection 3.2).
“Intelligence is a general factor that runs through all types of performance.”
A. Jensen
At first this might not look like a definition of intelligence, but it makes an important
point: Intelligence is not really the ability to do anything in particular, rather it is a
very general ability that affects many kinds of performance. Conversely, by measuring
many different kinds of performance we can estimate an individual’s intelligence. This is
consistent with our definition’s emphasis on the agent’s generality.
“Intelligence is what is measured by intelligence tests.” E. Boring [Bor23]
Boring’s famous definition of intelligence takes this idea a step further. If intelligence
is not the ability to do anything in particular, but rather an abstract ability that indirectly
affects performance in many tasks, then perhaps it is most concretely described as the
ability to do the kind of abstract problems that appear in intelligence tests? In which
case, Boring’s definition is not as facetious as it first appears.
This definition also highlights the fact that the concept of intelligence, and how it
is measured, are intimately related. In the context of this paper we refer to these as
definitions of intelligence, and tests of intelligence, respectively.
15
observation
reward
agent environment
action
Figure 1: The agent and the environment interact by sending action, observation and
reward signals to each other.
the agent might be rewarded for winning a game or solving a puzzle. If the agent is
to succeed in its environment, that is, receive a lot of reward, it must learn about the
structure of the environment and in particular what it needs to do in order to get reward.
Thus from a broad perspective, the goal is flexible. Not surprisingly, this is exactly the
way in which we condition an animal to achieve a goal: By selectively rewarding certain
behaviours (see Subsection 2.2). In a narrow sense the animal’s goal is fixed, perhaps to
get more treats to eat, but in a broader sense it is flexible as it may require doing a trick
or solving a puzzle of our choosing.
In our framework we will include the reward signal as a part of the perception generated
by the environment. The perceptions also contain a non-reward part, which we will refer
to as observations. This now gives us the complete system of interacting agent and
environment, as illustrated in Figure 1. The goal, in the broad flexible sense, is implicitly
defined by the environment as this is what defines when rewards are generated. Thus, in
the framework as we have defined it, to test an agent in any given way it is sufficient to
fully define the environment.
This widely used and very flexible structure is in itself nothing new. In artificial
intelligence it is the framework used in reinforcement learning [SB98]. By appropriately
renaming things, it also describes the controller-plant framework used in control theory
[BT96]. The interesting point for us is that this setup follows naturally from our informal
definition of intelligence and our desire to keep things as general as possible. The only
difficulty was how to deal with the notion of success, or profit. This required the existence
of some kind of an objective or goal. The most flexible and elegant way to bring this into
the framework was to use a simple reward signal.
3.1 Example. To make this model more concrete, consider the following “Two Coins
Game”. In each cycle two 50¢ coins are tossed. Before the coins settle the player must
guess at the number of heads that will result: either 0, 1, or 2. If the guess is correct
the player gets to keep both coins and then two new coins are produced and the game
repeats. If the guess is incorrect the player does not receive any coins, and the game is
repeated.
In terms of the agent-environment model, the player is the agent and the system that
produces all the coins, tosses them and distributes the reward when appropriate, is the
environment. The agent’s actions are its guesses at the number of heads in each iteration
of the game: 0, 1 or 2. The observation is the state of the coins when they settle, and the
reward is either $0 or $1.
3.2 Formal agent-environment framework 17
It is easy to see that for unbiased coins the most likely outcome is 1 head and thus the
optimal strategy for the agent is to always guess 1. However if the coins are significantly
biased it might be optimal to guess either 0 or 2 heads depending on the bias. If this were
the case, then after a number of iterations of the game an intelligent agent would realise
that the coins were probably biased and change its strategy accordingly.
With a little imagination, seemingly any sort of game, challenge, problem or test
can be expressed in this simple framework without too much effort. It should also be
emphasised that this agent-environment framework says nothing about how the agent or
the environment actually work; it only describes their roles.
The agent. Formally, the agent is a function, denoted by π, which takes the current
history as input and chooses the next action as output. We do not want to restrict the
agent in any way, in particular we do not require that it is deterministic. A convenient
way of representing the agent then is as a probability measure over actions conditioned
on the complete interaction history. Thus, π(a3 |o1 r1 a1 o2 r2 ) is the probability of action a3
in the third cycle, given that the current history is o1 r1 a1 o2 r2 . A deterministic agent is
simply one that always assigns a probability of 1 to a single action for any given history.
As the history that the agent can use to select its action expands indefinitely, the agent
need not be Markovian. Indeed, how the agent produces its distribution over actions
for any given history is left open. In artificial intelligence the agent will of course be a
machine and so π will be a computable function.
√ In general however, π could be anything:
An algorithm that generates the digits of e as outputs, an incomputable function, or
even a human pushing buttons on a keyboard.
18 3 A DEFINITION OF MACHINE INTELLIGENCE
3.2 Example. To illustrate this formalism, consider again the Two Coins Game intro-
duced in Example 3.1. Let P := {0, 1, 2} ×{0, 1} be the perception space representing the
number of heads after tossing the two coins and the value of the received reward. Like-
wise let A := {0, 1, 2} be the action space representing the agent’s guess at the number
of heads that will occur. Assuming two fair coins, we can represent this environment by
µ: 1
4
if ok = ak−1 ∈ {0, 2} ∧ rk = 1,
3
if ok 6= ak−1 ∈ {0, 2} ∧ rk = 0,
4
1
µ(ok rk |o1 . . . ak−1 ) := 2
if ok = ak−1 = 1 ∧ rk = 1,
1
2
if ok 6= ak−1 = 1 ∧ rk = 0,
0 otherwise.
That is, always guess that one head will be the result of the two coins being tossed. A
more complex agent might keep count of how many heads occur in each cycle and then
adapt its strategy if it seems that the coins are sufficiently biased.
Measure of success. Our next task is to formalise the idea of “profit” or “success”
for an agent. Informally, we know that the agent must try to maximise the amount of
reward it receives, however this could mean several different things. For example, one
agent might quickly find a way to get a reward of 0.9 in every cycle. After 100 cycles it
will have received a total reward of about 90 with an average reward per cycle of close to
0.9. A second agent might spend the first 80 cycles exploring different actions and their
consequences, during which time its average reward might only be 0.2. Having done this
exploration however, it might then know a way to get a reward of 1.0 in every cycle. Thus
after 100 cycles its total reward is only 80 × 0.2 + 20.0 = 36, giving an average reward per
cycle of just 0.36. After 1,000 cycles however, the second agent will be performing much
better than the first.
Which agent is the better one? The answer depends on how we value reward in the
near future versus reward in the more distant future. In some situations we may want
our agent to perform well fairly quickly, in others we might only care that it eventually
reaches a level of performance that is as high as possible.
A standard way of formalising this is to scale the value of rewards so that they decay
geometrically into the future at a rate given by a discount parameter γ ∈ (0, 1). For
example, with γ = 0.95 a reward of 0.7 that is 10 time steps into the future would be
given a value of 0.7 × (0.95)10 ≈ 0.42. At 100 time steps into the future a reward of 0.7
would have a value of just over 0.004. By increasing γ towards 1 we weight long term
3.2 Formal agent-environment framework 19
rewards more heavily, conversely by reducing it we weight them less so. In other words,
this parameter controls how short term greedy, or long term farsighted, the agent should
be.
To work out the expected future value for a given agent and environment interacting,
we take the sum of these discounted rewards into the infinite future and work out its
expected value, !
∞
1 X
Vµπ (γ) := E γ i ri . (1)
Γ i=1
In the above, ri is the reward in cycle i of a given history, γ is the discount rate, P γ i is the
discount applied to the ith reward into the future, the normalising constant is Γ := ∞ i
i=1 γ ,
and the expected value is taken over all possible interaction sequences between the agent
π and the environment µ.
Under geometric discounting an agent with γ = 0.95 will not plan further than about
1
20 cycles ahead. Thus we say that the agent has a constant effective horizon of 1−γ .
Since we are interested in universal intelligence, a limited farsightedness is not acceptable
because for every horizon there is a task that needs a larger horizon to be solved. For
instance, while a horizon of 5 is sufficient for tic-tac-toe, it is insufficient for chess. Clearly,
geometric discounting has not solved the problem of how to weight near term rewards
versus long term rewards, it has simply expressed this weighting as a parameter. What
we require is a single definition of machine intelligence, not a range of definitions that
vary according to a free parameter.
A more promising candidate for universal discounting is the near-harmonic, or
quadratic discount, where we replace γ i in Equation 1 by 1/i2 and modifying Γ accord-
ingly. This has some interesting properties, in particular the agent needs to look forward
into the future in a way that is proportional to its current age. This is appealing since it
seems that humans of age k years usually do not plan their lives for more than, perhaps,
the next k years. More importantly, it allows us to avoid the problem of having to choose
a global time scale or effective horizon [Hut05].
Although harmonic discounting has a number of attractive properties [Hut06a], an
even simpler and more general solution is possible. If we look at the value function in
Equation 1, we see that discounting plays two roles. Firstly, it normalises rewards received
so that their sum is always finite. Secondly, it weights the reward at different points in
the future which in effect defines a temporal preference. A direct way to solve both of
these problems, without needing an external parameter, is to simply require that the total
reward returned by the environment can never exceed 1. For such a reward summable
environment µ, it follows that the expected value of the sum of rewards is also finite and
thus discounting is no longer required,
∞
!
X
Vµπ := E ri ≤ 1. (2)
i=1
One way of viewing this is that the rewards returned by the environment now have the
temporal preference already factored in. The cost is that this is an additional condition
that we place on the space of environments. Previously we required that each reward
signal was in a subset of [0, 1] ∩ Q, now we have the additional constraint that the reward
sum is always bounded (see Subsection 5.2 for further discussion about why this constraint
is reasonable).
20 3 A DEFINITION OF MACHINE INTELLIGENCE
is going to predict which hypotheses are the most likely to be correct, it must resort to
something other than just the observational information that it has. This is a frequently
occurring problem in inductive inference for which the most common approach is to invoke
the principle of Occam’s razor:
Given multiple hypotheses that are consistent with the data, the simplest should
be preferred.
This is generally considered the rational and intelligent thing to do [Wal05], indeed IQ
tests often implicitly test an individual’s ability to use Occam’s razor, as pointed out in
Subsection 2.1.
3.3 Example. Consider the following type of question which commonly appears in
intelligence tests. There is a sequence such as 2, 4, 6, 8, and the test subject needs
to predict the next number. Of course the pattern is immediately clear: The numbers
are increasing by 2 each time, or more mathematically, the k th item is given by 2k. An
intelligent person would easily identify this pattern and predict the next digit to be 10.
However, the polynomial 2k 4 − 20k 3 + 70k 2 − 98k + 48 is also consistent with the data,
in which case the next number in the sequence would be 58. Why then, even if we are
aware of the larger polynomial, do we consider the first answer to be the most likely one?
It is because we apply, perhaps unconsciously, the principle of Occam’s razor. The fact
that intelligence tests define this as the “correct” answer, shows us that using Occam’s
razor is considered the intelligent thing to do. Thus, although we do not usually mention
Occam’s razor when defining intelligence, the ability to effectively use it is an important
facet of intelligent behaviour.
In some cases we may even consider the correct use of Occam’s razor to be a more
important demonstration of intelligence than achieving a successful outcome. Consider,
for example, the following game:
3.4 Example. A questioner lays twenty $10 notes out on a table before you and then
points to the first one and asks “Yes or No?”. If you answer “Yes” he hands you the
money. If you answer “No” he takes it from the table and puts it in his pocket. He then
points to the next $10 note on the table and asks the same question. Although you, as an
intelligent agent, might experiment with answering both “Yes” and “No” a few times, by
the 13th round you would have decided that the best choice seems to be “Yes” each time.
However what you do not know is that if you answer “Yes” in the 13th round then the
questioner will pull out a gun and shoot you! Thus, although answering “Yes” in the 13th
round is the most intelligent choice, given what you know, it is not the most successful
one. An exceptionally dim individual may have failed to notice the obvious relationship
between answers and getting the money, and thus might answer “No” in the 13th round,
thereby saving his life due to what could truly be called “dumb luck”.
What is important then, is not that an intelligent agent succeeds in any given situation,
but rather that it takes actions that we would expect to be the most likely ones to lead
to success. Given adequate experience this might be clear, however often experience is
not sufficient and one must fall back on good prior assumptions about the world, such as
Occam’s razor. It is important then that we test the agents in such a way that they are,
22 3 A DEFINITION OF MACHINE INTELLIGENCE
at least on average, rewarded for correctly applying Occam’s razor, even if in some cases
this leads to failure.
There is another subtlety that needs to be pointed out. Often intelligence is thought
of as the ability to deal with complexity. Or in the words of the psychologist Gottfred-
son, “. . . g is the ability to deal with cognitive complexity — in particular, with complex
information processing.”[Got97b] It is tempting then to equate the difficultly of an envi-
ronment with its complexity. Unfortunately, things are not so straightforward. Consider
the following environment:
3.5 Example. Imagine a very complex environment with a rich set of relationships
between the agent’s actions and observations. The measure that describes this will have
a high complexity. However, also imagine that the reward signal is always maximal
no matter what the agent does. Thus, although this is a very complex environment in
which the agent is unlikely to be able predict what it will observe next, it is also an
easy environment in the sense that all policies are optimal, even very simple ones that do
nothing at all. The environment contains a lot of structure that is irrelevant to the goal
that the agent is trying to achieve.
From this perspective, a problem is thought of as being difficult if the simplest good
solution to the problem is complex. Easy problems on the other hand are those that have
simple solutions. This is a very natural way to think about the difficulty of problems, or
in our terminology, environments.
Fortunately, this distinction does not affect our use of Occam’s razor. When we talk
about an hypothesis, what we mean is a potential model of the environment from the
agent’s perspective, not just a model that is sufficient with respect to the agent’s goal.
From the agent’s perspective, an incorrect hypothesis that fails to model much of the
environment may be optimal if the parts of the environment that the hypothesis fails to
model are not relevant to receiving reward. However, when Occam’s razor is applied, we
apply it with respect to the complexity of the hypotheses, not the complexity of good
solutions with respect to an objective. Thus, to reward agents on average for correctly
using Occam’s razor, we must weight the environments according to their complexity, not
their difficulty.
Our remaining problem now is to measure the complexity of environments. The Kol-
mogorov complexity of a binary string x is defined as being the length of the shortest
program that computes x:
where p is a binary string which we call a program, l(p) is the length of this string in bits,
and U is a prefix universal Turing machine U called the reference machine.
To gain an intuition for how this works, consider a binary string 0000 . . . 0 that consists
of a trillion 0s. Although this string is very long, it clearly has a simple structure and
thus we would expect it to have a low complexity. Indeed this is the case because we
can write a very short program p that simply loops a trillion times outputting a 0 each
time. Similarity, other strings with simple patterns have a low Kolmogorov complexity.
On the other hand, if we consider a long irregular random string 111010110000010 . . .
then it is much more difficult to find a short program that outputs this string. Indeed it
is possible to prove that there are so many strings of this form, relative to the number
3.3 A formal definition of machine intelligence 23
of short programs, that in general it is impossible for long random strings to have short
programs. In other words, they have high Kolmogorov complexity.
An important property of K is that it is nearly independent of the choice of U. To see
why, consider what happens if we switch from U, in the above definition of K, to some
other universal Turing machine U ′ . Due to the universality property of U ′ , there exists a
program q that allows U ′ to simulate U. Thus, if we give U ′ both q and p as inputs, it can
simulate U running p and thereby compute U(p). It follows then that switching from U to
U ′ in our definition of K above incurs at most an additional cost of l(q) bits in minimal
program length. The constant l(q) is independent of which string x we are measuring
the complexity of, and for reasonable universal Turing machines, this constant will be
small. This invariance property makes K an excellent universal complexity measure. For
an extensive treatment of Kolmogorov complexity see [LV97] or [Cal02].
In our current application we need to measure the complexity of the computable
measures that describe environments. It can be shown that this set can be enumerated
µ1 , µ2 , µ3 , . . . (see Theorem 4.3.1 in [LV97]). Using a simple encoding method we can
express each index as a binary string, written hii. In a sense this binary string is a
description of an environment with respect to our enumeration. This lets us define the
complexity of an environment µi to be K(µi ) := K(hii). Intuitively, if a short program
can be used to describe the program for an environment µi , then this environment will
have a low complexity.
This answers our problem of needing to be able to measure the complexity of envi-
ronments, but we are not done yet. In order to formalise Occam’s razor we need to have
a way to assign an a priori probability to environments in such a way that complex en-
vironments are less likely, and simple environments more likely. If we consider that each
environment µi is described by a minimal length program that is a binary string, then the
natural way to do this is to consider each additional bit of program length to reduce the
environment’s probability by one half, reflecting the fact that each bit has two possible
states. This gives us what is known as the algorithmic probability distribution over the
space of environments, defined 2−K(µ) . This distribution has powerful properties that es-
sentially solve long-standing open philosophical, statistical, and computational problems
in the area of inductive inference [Hut07a]. Furthermore, the distribution can be used to
define powerful universal learning agents that have provably optimal performance [Hut05].
Bringing all these pieces together, we can now define our formal measure of intelligence
for arbitrary systems. Let E be the space of all computable reward summable environ-
mental measures with respect to the reference machine U, and let K be the Kolmogorov
complexity function. The expected performance of agent π with respect to the universal
distribution 2−K(µ) over the space of all environments E is given by,
X
Υ(π) := 2−K(µ) Vµπ .
µ∈E
in the set E. Occam’s razor is given by the term 2−K(µ) which weights the agent’s perfor-
mance in each environment inversely proportional to its complexity. The definition is very
general in terms of which sensors or actuators the agent might have as all information
exchanged between the agent and the environment takes place over very general commu-
nication channels. Finally, the formal definition places no limits on the internal workings
of the agent. Thus, we can apply the definition to any system that is able to receive and
generate information with view to achieving goals. The main drawback, however, is that
the Kolmogorov complexity function K is not computable and can only be approximated.
This is an important point that we will return to later.
A random agent. The agent with the lowest intelligence, at least among those that are
not actively trying to perform badly, would be one that makes uniformly random actions.
We will call this π rand . Although this is clearly a weak agent, we cannot simply conclude
rand
that the value of Vµπ will always be low as some environments will generate high reward
no matter what the agent does. Nevertheless, in general such an agent will not be very
successful as it will fail to exploit any regularities in the environment, no matter how
rand
simple they are. It follows then that the values of Vµπ will typically be low compared to
other agents, and thus Υ(π rand ) will be low. Conversely, if Υ(π rand ) is very low, then the
equation for Υ implies that for simple environments, and many complex environments,
rand
the value of Vµπ must also be relatively low. This kind of poor performance in general
is what we would expect of an unintelligent agent.
A very specialised agent. From the equation for Υ, we see that an agent could have
very low universal intelligence but still perform extremely well at a few very specific and
complex tasks. Consider, for example, IBM’s Deep Blue chess supercomputer, which we
dblue
will represent by π dblue . When µchess describes the game of chess, Vµπchess is very high.
chess
However 2−K(µ ) is small, and for µ 6= µchess the value function will be low as π dblue
only plays chess. Therefore, the value of Υ(π dblue ) will be very low. Intuitively, this is
because Deep Blue is too inflexible and narrow to have general intelligence; a characteristic
weakness of specialised artificial intelligence systems.
A general but simple agent. Imagine an agent that performs very basic learning by
building up a table of observation and action pairs and keeping statistics on the rewards
that follow. Each time an observation that it has been seen before occurs, the agent takes
the action with highest estimated expected reward in the next cycle with 90% probability,
or a random action with 10% probability. We will call this agent π basic . It is immediately
clear that many environments, both complex and very simple, will have at least some
structure that such an agent would take advantage of. Thus, for almost all µ we will have
basic rand
Vµπ > Vµπ and so Υ(π basic ) > Υ(π rand ). Intuitively, this is what we would expect as
π basic , while very simplistic, is surely more intelligent than π rand .
3.4 Universal intelligence of various agents 25
Similarly, as π dblue will fail to take advantage of even trivial regularities in some of
the most basic environments, Υ(π basic ) > Υ(π dblue ). This is reasonable as our aim is to
measure a machine’s level of general intelligence. Thus an agent that can take advantage
of basic regularities in a wide range of environments should rate more highly than a
specialised machine that fails outside of a very limited domain.
A simple agent with more history. The first order structure of π basic , while very
general, will miss many simple exploitable regularities. Consider the following environ-
ment µalt . Let R = [0, 1] ∩ Q, A = {up, down} and O = {ε}, where ε is the empty string.
In cycle k the environment generates a reward of 2−k each time the agent’s action is dif-
ferent to its previous action. Otherwise the reward is 0. We can define this environment
formally,
1 if ak−1 6= ak−2 ∧ rk = 2−k ,
alt
µ (ok rk |o1 . . . ak−1 ) := 1 if ak−1 = ak−2 ∧ rk = 0,
0 otherwise.
Clearly the optimal strategy for an agent is simply to alternate between the actions up
and down. Even though this is very simple, this strategy requires the agent to correlate
its current action with its previous action, something that π basic cannot do.
A natural extension of π basic is to use a longer history of actions, observations and
rewards in its internal table. Let π 2back be the agent that builds a table of statistics
for the expected reward conditioned on the last two actions, rewards and observations.
It is immediately clear that π 2back will exploit the structure of the µalt environment.
Furthermore, by definition π 2back is a generalisation of π basic and thus it will adapt to
2back basic
any regularity that π basic can adapt to. It follows then that in general Vµπ > Vµπ
and so Υ(π 2back ) > Υ(π basic ), as we would intuitively expect. In the same way we can
extend the history that the agent utilises back further and produce even more powerful
agents that are able to adapt to more lengthy temporal structures and which will have
still higher machine intelligence.
top
a = rest or climb
r = 1.0
a = climb
a = rest r = 0.0
r = 0.1 bottom
Figure 2: A simple game in which the agent climbs a playground slide and slides back
down again. A shortsighted agent will always just rest at the bottom of the slide.
of r̂k+2 will potentially depend not only on ak , but also on ak+1 , the agent assumes that
ak+1 is chosen to simply maximise the estimated reward r̂k+2 .
The π 2back agent can see that by missing out on the resting reward of 2−k−4 for one
cycle and climbing, a greater reward of 2−k will be had when sliding back down the slide
in the following cycle.
By definition π 2forward generalises π 2back in a way that more closely reflects the value
2forward 2back
function V and thus in general Vµπ > Vµπ . It then follows that Υ(π 2forward ) >
Υ(π 2back ) as we would intuitively expect for this more powerful agent.
In a similar way agents of increasing complexity and adaptability can be defined which
will have still greater intelligence. However with more complex agents it is usually difficult
to theoretically establish whether one agent has more or less universal intelligence than
another. Nevertheless, in the simple examples above we saw that the more flexible and
powerful an agent was, the higher its universal intelligence.
A very intelligent agent. A very smart agent would perform well in simple environ-
ments, and reasonably well compared to most other agents in more complex environments.
From the equation for universal intelligence this would clearly produce a very high value
for Υ. Conversely, if Υ was very high then the equation for Υ implies that the agent must
perform well in most simple environments and reasonably well in many complex ones also.
A super intelligent agent. Consider what would be required to maximise the value
of Υ. By definition, a “perfect” agent would always pick the action which had greatest
expected future reward. To do this, for every environment µ ∈ E the agent must take into
account how likely it is that it is facing µ given the interaction history so far, and the prior
probability of µ, that is, 2−K(µ) . It would then consider all possible future interactions
that might occur, and how likely they are, and from this select the action in the current
cycle that maximises the expected future reward.
This perfect theoretical agent is known as AIXI. It has been precisely defined and
studied at length in [Hut05] (see [Hut07b] for a shorter exposition). The connection
between universal intelligence and AIXI is not coincidental: Υ was originally derived
from the so called “intelligence order relation” (see Definition 5.14 in [Hut05]), which in
3.5 Properties of universal intelligence 27
turn was constructed to reflect the equations for AIXI. As such we can define the upper
bound on universal intelligence to be,
Ῡ := max Υ(π) = Υ π AIXI .
π
AIXI is not computable due to the incomputability of K, and even if K were com-
putable, accurately computing the expectations to maximise future expected rewards
would be practically infeasible. Nevertheless, AIXI is interesting from a theoretical per-
spective as it defines, in an elegant way, what might be considered to be the perfect theo-
retical artificial intelligence. Indeed many strong optimality properties have been proven
for AIXI. For example, it has been proven that AIXI converges to optimal performance
in any environment where this is at all possible for a general agent (see Theorem 5.34 of
[Hut05]). This optimality result includes ergodic Markov decision processes, prediction
problems, classification problems, bandit problems and many others [LH04b, LH04a].
These mathematical results prove that agents with very high universal intelligence are
extremely powerful and general.
indeed this is what is done in practice. Of course, just because a sequence passes all
our tests does not mean that it must be random. There could always be some deeper
structure to the sequence that our tests were not able to detect. All we can say is that
the sequence seems random with respect to our ability to detect patterns.
Some might argue that the definition of something should not just capture the concept,
it should also be practical. For example, the definition of intelligence should be such
that intelligence can be easily measured. The above example, however, illustrates why
this approach is sometimes flawed: If we were to define randomness with respect to a
particular set of tests, then one could specifically construct a sequence that followed a
regular pattern in such a way that it passed all of our randomness tests. This would
completely undermine our definition of randomness. A better approach is to define the
concept in the strongest and cleanest way possible, and then to accept that our ability
to test for this ideal has limitations. In other words, our task is to find better and more
effective tests, not to redefine what it is that we are testing for. This is the attitude we
have taken here, though in this paper our focus is almost entirely on the first part, that
is, establishing a strong theoretical definition of machine intelligence.
Although some of the criteria by which we judge practical tests of intelligence are not
relevant to a pure definition of intelligence, many of the desirable properties are similar.
Thus to understand the strengths and weaknesses of our definition, consider again the
desirable properties for a test of intelligence from Subsection 2.3.
Valid. The most important property of any proposed formal definition of intelligence
is that it does indeed describe something that can reasonably be called “intelligence”.
Essentially, this is the core argument of this report so far: We have taken a mainstream
informal definition and step by step formalised it. Thus, so long as our informal defini-
tion is reasonable, and our formalisation argument holds, the result can reasonably be
described as a formal definition of intelligence.
Meaningful. As we saw in the previous section, universal intelligence orders the power
and adaptability of simple agents in a natural way. Furthermore, a high value of Υ implies
that the agent performs well in most simple and moderately complex environments. Such
an agent would be an impressively powerful and flexible piece of technology, with many
potential uses. Clearly then, universal intelligence is inherently meaningful, independent
of whether or not one considers it to be a measure of intelligence.
Wide range. As we saw in the previous section, universal intelligence is able to order
the intelligence of even the most basic agents such as π rand , π basic , π 2back and π 2forward . At
the other extreme we have the theoretical super intelligent agent AIXI which has maximal
Υ value. Thus, universal intelligence spans trivial learning algorithms right up to super
intelligent agents. This seems to be the widest range possible for a measure of machine
intelligence.
3.5 Properties of universal intelligence 29
General. As the agent’s performance on all well defined environments is factored into
its Υ value, a broader performance metric is difficult to imagine. Indeed, a well defined
measure of intelligence that is broader than universal intelligence would seem to contradict
the Church-Turing thesis as it would imply that we could effectively measure an agent’s
performance for some well defined problem that was outside of the space of computable
measures.
Practical. In its current form the definition cannot be directly turned into a test of
intelligence as the Kolmogorov complexity function is not computable. Thus in its pure
form we can only use it to analyse the nature of intelligence and to theoretically examine
the intelligence of mathematically defined learning algorithms.
In order to use universal intelligence more generally we will need to construct a work-
able test that approximates an agent’s Υ value. The equation for Υ suggests how we might
approach this problem. Essentially, an agent’s universal intelligence is a weighted sum
of its performance over the space of all environments. Thus, we could randomly gener-
ate programs that describe environmental probability measures and then test the agent’s
performance against each of these environments. After sampling sufficiently many envi-
ronments the agent’s approximate universal intelligence would be computed by weighting
its score in each environment according to the complexity of the environment as given by
the length of its program. Another possibility might to be try to approximate the sum by
enumerating environmental programs from short to long, as the short ones will contribute
by far the greatest to the sum. However in this case we will need to be able to reset
the state of the agent so that it cannot cheat by learning our environmental enumeration
method. In any case, various practical challenges will need to be addressed before uni-
versal intelligence can be used to construct an effective intelligence test. As this would
be a significant project in its own right, in this paper we focus on the theoretical issues
surrounding the universal intelligence.
“Intelligence is the computational part of the ability to achieve goals in the world.
Varying kinds and degrees of intelligence occur in people, many animals and some
machines.” J. McCarthy [McC04]
The position taken by Albus is especially similar to ours. Although the quote above
does not explicitly mention the need to be able to perform well in a wide range of envi-
ronments, at a later point in the same paper he mentions the need to be able to succeed
in a “large variety of circumstances”.
“Intelligent systems are expected to work, and work well, in many different envi-
ronments. Their property of intelligence allows them to maximize the probability
of success even if full knowledge of the situation is not available. Functioning of
intelligent systems cannot be considered separately from the environment and the
concrete situation including the goal.” R. R. Gudwin [Gud00]
While this definition is consistent with the position we have taken, when trying to
actually test the intelligence of an agent Gudwin does not believe that a “black box” be-
haviour based approach is sufficient, rather his approach is to look at the “. . . architectural
details of structures, organizations, processes and algorithms used in the construction of
the intelligent systems,” [Gud00] Our perspective is simply to not care whether an agent
looks intelligent on the inside. If it is able to perform well in a wide range of environments,
that is all that matters. For more discussion on this point see our response to Block’s and
Searle’s arguments in Subsection 5.2.
“We define two perspectives on artificial system intelligence: (1) native intelli-
gence, expressed in the specified complexity inherent in the information content
of the system, and (2) performance intelligence, expressed in the successful (i.e.,
goal-achieving) performance of the system in a complicated environment.” J. A.
Horst [Hor02]
32 4 DEFINITIONS AND TESTS OF MACHINE INTELLIGENCE
Here we see two distinct notions of intelligence, a performance based one and an
information content one. This is similar to the distinction between fluid intelligence
and crystallized intelligence made by the psychologist Cattell (see Subsection 2.5). The
performance notion of intelligence is similar to our definition with the expectation that
performance is measured in a complex environment rather than across a wide range of
environments. This perspective appears in some other definitions also,
“[An intelligent agent does what] is appropriate for its circumstances and its goal, it
is flexible to changing environments and changing goals, it learns from experience,
and it makes appropriate choices given perceptual limitations and finite computa-
tion.” D. Poole [PMG98]
“. . . in any real situation behavior appropriate to the ends of the system and adaptive
to the demands of the environment can occur, within some limits of speed and
complexity.” A. Newell and H. A. Simon [NS76]
“Intelligence is the ability for an information processing agent to adapt to its envi-
ronment with insufficient knowledge and resources.” P. Wang [Wan95]
artificial intelligence is to find algorithms which have the greatest efficiency of intelligence,
that is, which achieve the most intelligence per unit of computational resources consumed.
It should also be pointed out that although universal intelligence does not test the
efficiency of an agent in terms of the computational resources that it uses, it does however
test how quickly the agent learns from past data. In a sense, an agent which learns very
quickly could be thought of as being very “data efficient”.
one instance of a human actually failing a Turing test. When queried about the latter,
one of the judges explained that “no human being would have that amount of knowledge
about Shakespeare”[Shi94].
Compression tests. Mahoney has proposed a particularly simple solution to the binary
pass or fail problem with the Turing test: Replace the Turing test with a text compression
test [Mah99]. In essence this is somewhat similar to a “Cloze test” where an individual’s
comprehension and knowledge in a domain is estimated by having them guess missing
words from a passage of text.
While simple text compression can be performed with symbol frequencies, the resulting
compression is relatively poor. By using more complex models that capture higher level
features such as aspects of grammar, the best compressors are able to compress text to
about 1.5 bits per character for English. However humans, which can also make use of
general world knowledge, the logical structure of the argument etc., are able to reduce
this down to about 1 bit per character. Thus the compression statistic provides an easily
computed measure of how complete a machine’s models of language, reasoning and domain
knowledge are, relative to a human.
To see the connection to the Turing test, consider a compression test based on a very
large corpus of dialogue. If a compressor could perform extremely well on such a test,
this is mathematically equivalent to being able to determine which sentences are probable
at a give point in a dialogue, and which are not (for the equivalence of compression
and prediction see [BCW90]). Thus, as failing a Turing test occurs when a machine (or
person!) generates a sentence which would be improbable for a human, extremely good
performance on dialogue compression implies the ability to pass a Turing test.
A recent development in this area is the Hutter Prize [Hut06b]. In this test the corpus
is a 100 MB extract from Wikipedia. The idea is that this should represent a reasonable
sample of world knowledge and thus any compressor that can perform very well on this
test must have have a good model of not just English, but also world knowledge in general.
One criticism of compression tests is that it is not clear whether a powerful compressor
would easily translate into a general purpose artificial intelligence. Also, while a young
child has a significant amount of elementary knowledge about how to interact with the
world, this knowledge would be of little use when trying to compress an encyclopedia full
of abstract “adult knowledge” about the world.
Competitive games. The Turing Ratio method of Masum et al. has more empha-
sis on tasks and games rather than cognitive tests. Similar to our own definition,
they propose that “. . . doing well at a broad range of tasks is an empirical definition
of ‘intelligence’.”[MCO02] To quantify this they seek to identify tasks that measure im-
portant abilities, admit a series of strategies that are qualitatively different, and are
reproducible and relevant over an extended period of time. They suggest a system of
measuring performance through pairwise comparisons between AI systems that is similar
to that used to rate players in the international chess rating system. The key difficulty
however, which the authors acknowledge is an open challenge, is to work out what these
tasks should be, and to quantify just how broad, important and relevant each is. In our
view these are some of the most central problems that must be solved when attempting
to construct an intelligence test. Thus we consider this approach to be incomplete in its
current state.
exactly what, and what not, to test for. Thus we consider Psychometric AI, at least as it
is currently formulated, to only partially address this central question.
C-Test. One perspective among psychologists who support the g-factor view of intel-
ligence, is that intelligence is “the ability to deal with complexity”[Got97b]. Thus, in a
test of intelligence, the most difficult questions are the ones that are the most complex
because these will, by definition, require the most intelligence to solve. It follows then
that if we could formally define and measure the complexity of test problems using com-
plexity theory we could construct a formal test of intelligence. The possibility of doing
this was perhaps first suggested by Chaitin [Cha82]. While this path requires numerous
difficulties to be dealt with, we believe that it is the most natural and offers many advan-
tages: It is formally motivated, precisely defined and potentially could be used to measure
the performance of both computers and biological systems on the same scale without the
problem of bias towards any particular species or culture.
Essentially this is the approach that we have taken. Universal intelligence is based
our the universally optimal AIXI agent for active environments, which in turn is based
on Kolmogorov complexity and Solomonoff’s universal model of sequence prediction.
A relative of universal intelligence is the C-Test of Hernández-Orallo which was also
inspired by Solomonoff induction and Kolmogorov complexity [HO00b, HOMC98]. If we
gloss over some technicalities, the essential relationships look like this:
The C-Test consists of a number of sequence prediction and abduction problems similar
to those that appear in many standard IQ tests. The test has been successfully applied
to humans with intuitively reasonable results [HOMC98, HO00a]. Similar to standard
IQ tests, the C-Test always ensures that each question has an unambiguous answer in
the sense that there is always one hypothesis that is consistent with the observed pattern
that has significantly lower complexity than the alternatives. Other than making the test
easier to score, it has the added advantage of reducing the test’s sensitivity to changes in
the reference machine.
The key difference to sequence problems that appear in standard intelligence tests is
that the questions are based on a formally expressed measure of complexity. To overcome
the problem of Kolmogorov complexity not being computable, the C-Test instead uses
Levin’s Kt complexity [Lev73]. In order to retain the invariance property of Kolmogorov
complexity, Levin complexity requires the additional assumption that the universal Turing
machines are able to simulate each other in linear time. As far as we know, this is the
only formal definition of intelligence that has so far produced a usable test of intelligence.
To illustrate the C-Test, below are some example problems taken from [HOMC98].
Beside each question is its complexity, naturally more complex patterns are also more
difficult:
4.2 Formal definitions and tests of machine intelligence 37
Our main criticism of the C-Test is that it is a static test limited to passive environ-
ments. As we have argued earlier, we believe that a better approach is to use dynamic
intelligence tests where the agent must interact with an environment in order to solve
problems. As AIXI is a generalisation of Solomonoff induction from passive to active en-
vironments, universal intelligence could be viewed as generalising the C-Test from passive
to active environments.
Smith’s Test. Another complexity based formal definition of intelligence that appeared
recently in an unpublished report is due to W. D. Smith [Smi06]. His approach has
a number of connections to our work, indeed Smith states that his work is largely a
“. . . rediscovery of recent work by Marcus Hutter”. Perhaps this is over stating the sim-
ilarities because while there are some connections, there are also many important differ-
ences.
The basic structure of Smith’s definition is that an agent faces a series of problems
that are generated by an algorithm. In each iteration the agent must try to produce
the correct response to the problem that it has been given. The problem generator then
responds with a score of how good the agent’s answer was. If the agent so desires it can
submit another answer to the same problem. At some point the agent requests to the
problem generator to move onto the next problem and the score that the agent received
for its last answer to the current problem is then added to its cumulative score. Each
interaction cycle counts as one time step and the agent’s intelligence is then its total
cumulative score considered as a function of time. In order to keep things feasible, the
problems must all be in the complexity class P, that is, decision problems which can be
solved by a deterministic Turing machine in polynomial time.
We have three main criticisms of Smith’s definition. Firstly, while for practical reasons
it might make sense to restrict problems to be in P, we do not see why this practical
restriction should be a part of the very definition of intelligence. If some breakthrough
meant that agents could solve difficult problems in not just P but sometimes in NP as
well, then surely these new agents would be more intelligent? We had similar objections
to informal definitions of machine intelligence that included efficiency requirements in
Subsection 4.1.
Our second criticism is that the way intelligence is measured is essentially static, that
is, the environments are passive. As we have argued before, we believe that dynamic
testing in active environments is a better measure of a system’s intelligence. To put this
38 4 DEFINITIONS AND TESTS OF MACHINE INTELLIGENCE
argument yet another way: Succeeding in the real world requires you to be more than an
insightful spectator!
The final criticism is that while the definition is somewhat formally defined, still it
leaves open the important question of what exactly the tests should be. Smith suggests
that researchers should dream up tests and then contribute them to some common pool
of tests. As such, this is not a fully specified definition.
Wide range. A test/definition should cover very low levels of intelligence right up to
super human intelligence.
General. Ideally we would like to have a very general test/definition that could be
applied to everything from a fly to a machine learning algorithm.
Dynamic. A test/definition should directly take into account the ability to learn and
adapt over time as this is an important aspect of intelligence.
Formal. The test/definition should be specified with the highest degree of precision
possible, allowing no room for misinterpretation. Ideally, it should be described
using formal mathematics.
Objective. The test/definition should not appeal to subjective assessments such as the
opinions of human judges.
Fully Defined. Has the test/definition been fully defined, or are parts still unspecified?
Practical. A test should be able to be performed quickly and automatically, while from
a definition it should be possible to create an efficient test.
39
Pr ers ed
O a l ta l
.
D ra l g e
ef
G eR e
n
id tiv
rm en
.D
en an
ni efi
lly e
ac al
nd d
nb c
Te al
Fu ctiv
Fo am
U ami
W ma
Fu ase
U D
tic
vs
Intelligence Test
lid
e
r
v
i
st
yn
fo
bj
Va
In
Turing Test • · · · • · · · · • · • T
Total Turing Test • · · · • · · · · • · · T
Inverted Turing Test • • · · • · · · · • · • T
Toddler Turing Test • · · · • · · · · · · • T
Linguistic Complexity • • · · · · • • · • • T
Text Compression Test • • · • • • T
Turing Ratio • ? ? ? ? ? · ? ? T/D
Psychometric AI • ? • · • • • · • T/D
Smith’s Test • • · ? · ? • T/D
C-Test • • · T/D
Universal Intelligence · D
Table 1: In the table means “yes”, • means “debatable”, · means “no”, and ? means
unknown. When something is rated as unknown that is usually because the test in
question is not sufficiently specified.
Test vs. Def. Finally, we note whether the proposal is more of a test, more of a
definition, or something in between.
It’s obviously false, there’s nothing in your definition, just a few equations.
Perhaps the most common criticism is also the most vacuous one: It’s obviously wrong!
These people seem to believe that defining intelligence with an equation is clearly impos-
sible, and thus there must be very large and obvious flaws in our work. Not surprisingly
these people are also the least likely to want to spend 10 minutes having the material
explained to them. Unfortunately, none of these people have been able to communicate
why the work is so obviously flawed in any concrete way — despite in one instance having
one of the authors chasing the poor fellow out of the conference centre and down the
street begging for an explanation. If anyone would like to properly explain their position
to us in the future, we promise not to chase you down the street!
It’s obviously correct, indeed everybody already knows this stuff. Curiously,
the second most common criticism is the exact opposite: The work is obviously right, and
indeed it is already well known. Digging deeper, the heart of this criticism comes from the
perception that we have not done much more than just describe reinforcement learning.
If you already accept that the reinforcement learning framework is the most general and
flexible way to describe artificial intelligence, and not everybody does, then by mixing
in Occam’s razor and a dash of complexity theory, the equation for universal intelligence
follows in a fairly straightforward way. While this is true, the way in which we have
brought these things together has never been done before, although it does have some
connection to other work, as discussed in Subsection 4.2. Furthermore, simply coming up
with an equation is not enough, one must argue that what the equation describes is in
fact “intelligence” in a sense that is reasonable for machines.
We have addressed this question in three main ways: Firstly, in Section 2 we developed
an informal definition of intelligence based on expert definitions which was then piece by
piece formalised leading to the equation for Υ in Subsection 3.3. This chain of argument
strongly ties our equation for intelligence with existing informal definitions and ideas
on the nature of intelligence. Secondly, in Subsections 3.4 and 3.5 we showed that the
equation has properties that are consistent with a definition of intelligence. Finally,
in Subsection 3.4 it was shown that universal intelligence is strongly connected to the
theory of universally optimal learning agents, in particular AIXI. From this it follows that
machines with very high universal intelligence have a wide range of powerful optimality
properties. Clearly then, what we have done goes far beyond merely restating elementary
reinforcement learning theory.
distribution over future events cannot, even in theory, be simulated to an arbitrary pre-
cision by a computable process. Some people take this position on various philosophical
grounds, such as the need for freewill. However, in standard physics there is no law of
the universe that is not computable in the above sense. Nor is there any experimental
evidence showing that such a physical law must exist. This includes quantum theory
and chaotic systems, both of which can be extremely difficult to compute for some phys-
ical systems, but are not fundamentally incomputable theories. In the case of quantum
computers, they can compute with lower time complexity than classical Turing machines,
however they are unable to compute anything that a classical Turing machine cannot,
when given enough time. Thus, as there is no hard evidence of incomputable processes in
the universe, our assumption that the agent’s environment has a computable distribution
is certainly not unreasonable.
If a physical process was ever discovered that was not Turing computable, then this
would likely result in a new extended model of computation. Just as we have based uni-
versal intelligence on the Turing model of computation, it might be possible to construct
a new definition of universal intelligence based on this new model in a natural way.
Finally, even if the universe was not computable, and we did not update our formal
definition of intelligence to take this into account, the fact that everything in physics so
far is computable means that a computable approximation to our universe would still be
extremely accurate over a huge range of situations. In which case, an agent that could
deal with a wide range of computable environments would most likely still function well
within such a universe.
very general definition. This is easier to do if we abstract over the internal workings of
the agent and define intelligence only in terms of external communications. Practically,
what matters is how well something works. By definition, if an agent has a high value of
Υ, then it must work well over a wide range of environments.
Block attacks this perspective by describing a machine that appears to be intelligent
as it is able to pass the Turing test, but is in fact no more than just a big look-up table
of questions and answers [Blo81] (for a related argument see [Gun71]). Although such
a look-up table based machine would be unfeasibly large, the fact that a finite machine
could in theory consistently pass the Turing test, seemingly without any real intelligence,
is worrisome. Our formal measure of machine intelligence could be challenged in the same
way, as could any test of intelligence that relies only on an agent’s external behaviour.
Our response to this is very simple: If an agent has a very high value of Υ then it is,
by definition, able to successfully operate in a wide range of environments. We simply
do not care whether the agent is efficient, due to some very clever algorithm, or absurdly
inefficient, for example by using an unfeasibly gigantic look-up table of precomputed
answers. The important point for us is that the machine has an amazing ability to solve
a huge range of problems in a wide variety of environments.
But you don’t deal with consciousness (or creativity, imagination, freewill,
emotion, love, soul, etc.) We apply the same argument to consciousness, emotions,
freewill, creativity, the soul and other such things. Our goal is to build powerful and
flexible machines and thus these somewhat vague properties are only relevant to our
goal to the extent to which they have some measurable effect on performance in some
well defined environment. If no such measurable effect exists, then they are not relevant
to our objective. Of course this is not the same as saying that these things do not
exist. The question is whether they are relevant or not. We would consider creativity,
appropriately defined, to have a significant impact on an agent’s ability to adapt to
challenging environments. Perhaps the same is also true of emotions, freewill and other
qualities.
so called “No Free Lunch” theorem [WM97]. However this theorem, or any of the standard
variants on it, cannot be applied to universal intelligence for the simple reason that we
have not taken a uniform distribution over the space of environments.
It is conceivable that there might exist some more general kind of “No Free Lunch”
theorem for agents that limits their maximal intelligence according to our definition.
Clearly any such result would have to apply only to computable agents given that the
incomputable AIXI agent faces no such limit. If such a result were true, it would suggest
that our definition of intelligence is perhaps too broad in its scope. Currently we know of
no such result.
Interestingly, if it could be shown that an upper limit on Υ existed for feasible machines
and that humans performed above this limit, then this would prove that humans have
some incomputable element to their operation, perhaps consciousness, which is of real
practical significance to their performance.
5.3 Conclusion
“. . . we need a definition of intelligence that is applicable to machines as well as
humans or even dogs. Further, it would be helpful to have a relative measure of
intelligence, that would enable us to judge one program more or less intelligent
than another, rather than identify some absolute criterion. Then it will be
possible to assess whether progress is being made . . . ”
— W. L. Johnson [Joh92]
Given the obvious significance of formal definitions of intelligence for research, and
calls for more direct measures of machine intelligence to replace the problematic Turing
test and other imitation based tests, little work has been done in this area. In this paper
we have attempted to tackle this problem by taking an informal definition of intelligence
modelled on expert definitions of human intelligence, and then generalise and formalise
it. We believe that the resulting mathematical definition captures the concept of ma-
chine intelligence in a very powerful and yet elegant way. Furthermore, by considering
alternative, more tractable measures of complexity, practical tests that estimate universal
intelligence should be possible. Developing such tests will be the next major task in this
direction of research.
The fact that we have stated our definition of machine intelligence in precise mathe-
matical terms, rather than the more usual vaguely worded descriptions, means that there
is no reason why criticisms of our approach should not be equally clear and precise. At
the very least we hope that this in itself will help raise the debate over the definition and
nature of machine intelligence to a new level of scientific rigour.
Acknowledgements
This work was supported by the Swiss NSF grant 200020-107616.
References
[AABL02] N. Alvarado, S. Adams, S. Burbeck, and C. Latta. Beyond the Turing test: Per-
formance metrics for evaluating a computer simulation of the human mind. In Per-
44 REFERENCES
formance Metrics for Intelligent Systems Workshop, Gaithersburg, MD, USA, 2002.
North-Holland.
[Alb91] J. S. Albus. Outline for a theory of intelligence. IEEE Trans. Systems, Man and
Cybernetics, 21(3):473–509, 1991.
[Ana92] A. Anastasi. What counselors should know about the use and interpretation of psy-
chological tests. Journal of Counseling and Development, 70(5):610–615, 1992.
[Aso03] A. Asohan. Leading humanity forward. The Star, October 14, 2003.
[BCW90] T. C. Bell, J. G. Cleary, and I. H. Witten. Text compression. Prentice Hall, 1990.
[Bin11] A. Binet. Les idees modernes sur les enfants. Flammarion, Paris, 1911.
[Bin37] W. V. Bingham. Aptitudes and aptitude testing. Harper & Brothers, New York, 1937.
[Bor23] E. G. Boring. Intelligence as the tests test it. New Republic, 35:35–37, 1923.
[BS05] A. Binet and T. Simon. Methodes nouvelles por le diagnostic du niveai intellectuel des
anormaux. L’Année Psychologique, 11:191–244, 1905.
[Cal02] C. S. Calude. Information and Randomness. Springer, Berlin, 2nd edition, 2002.
[Cat87] R. B. Cattell. Intelligence: Its Structure, Growth, and Action. Elsevier, New York,
1987.
[Eis91] J. Eisner. Cognitive science and the search for intelligence. Invited paper presented to
the Socratic Society, University of Cape Town, 1991.
[FH98] K. M. Ford and P. J. Hayes. On computational wings: Rethinking the goals of artificial
intelligence. Scientific American, Special Edition(4), 1998.
REFERENCES 45
[Fog95] D. B. Fogel. Review of computational intelligence: Imitating life. Proc. of the IEEE,
83(11), 1995.
[Fre90] R. M. French. Subcognition and the limits of the Turing test. Mind, 99:53–65, 1990.
[Gar93] H. Gardner. Frames of Mind: Theory of multiple intelligences. Fontana Press, 1993.
[GR05] D. Graham-Rowe. Spotting the bots with brains. In New Scientist magazine, volume
2512, page 27, 13 August 2005.
[Gre98] R. L. Gregory. The Oxford Companion to the Mind. Oxford University Press, Oxford,
UK, 1998.
[Gui67] J. P. Guilford. The Nature of Human Intelligence. McGraw-Hill, New York, 1967.
[Gun71] K. Gunderson. Mentality and machines. Doubleday and company, Garden City, New
York, USA, 1971.
[Har89] S. Harnad. Minds, machines and Searle. Journal of Theoretical and Experimental
Artificial Intelligence, 1:5–25, 1989.
[Hau81] J. Haugeland. Mind Design: Philosophy, psychology, and Artificial Intelligence. Brad-
ford Books, 1981.
[HCH95] F. H. Hsu, M. S. Campbell, and A. J. Hoane. Deep blue system overview. In Proceed-
ings of the 1995 International Conference on Supercomputing, pages 240–244, 1995.
[HM96] R. J. Herrnstein and C. Murray. The Bell Curve: Intelligence and Class Structure in
American Life. Free Press, 1996.
[HO00a] J. Hernández-Orallo. Beyond the Turing test. Journal of Logic, Language and Infor-
mation, 9(4):447–466, 2000.
46 REFERENCES
[Hor02] J. Horst. A native intelligence metric for artificial systems. In Performance Metrics
for Intelligent Systems Workshop, Gaithersburg, MD, USA, 2002.
[HP94] L. M. Herman and A. A. Pack. Animal intelligence: Historical perspectives and con-
temporary approaches. In R. Sternberg, editor, Encyclopedia of Human Intelligence,
pages 86–96. Macmillan, New York, 1994.
[Hut01b] M. Hutter. Universal sequential decisions in unknown environments. Proc. 5th Euro-
pean Workshop on Reinforcement Learning (EWRL-5), 27:25–26, October 2001.
[Hut06a] M. Hutter. General discounting versus average reward. In Proc. 17th International
Conf. on Algorithmic Learning Theory (ALT’06), volume 4264 of LNAI, pages 244–258,
Barcelona, 2006. Springer, Berlin.
[Kur00] R. Kurzweil. The age of spiritual machines: When computers exceed human intelli-
gence. Penguin, 2000.
[LH04a] S. Legg and M. Hutter. Ergodic MDPs admit self-optimising policies. Technical Report
IDSIA-21-04, IDSIA, 2004.
[LH04b] S. Legg and M. Hutter. A taxonomy for abstract environments. Technical Report
IDSIA-20-04, IDSIA, 2004.
[LH05] S. Legg and M. Hutter. A universal measure of intelligence for artificial agents. In
Proc. 21st International Joint Conf. on Artificial Intelligence (IJCAI-2005), pages
1509–1510, Edinburgh, 2005.
[LH06a] S. Legg and M. Hutter. A formal definition of intelligence for artificial systems. In
Proc. 50th Anniversary Summit of Artificial Intelligence, pages 197–198, Monte Verita,
Switzerland, 2006.
[LH06b] S. Legg and M. Hutter. A formal measure of machine intelligence. In Proc. 15th Annual
Machine Learning Conference of Belgium and The Netherlands (Benelearn’06), pages
73–80, Ghent, 2006.
[MCO02] H. Masum, S. Christensen, and F. Oppacher. The Turing ratio: Metrics for open-
ended tasks. In GECCO 2002: Proceedings of the Genetic and Evolutionary Compu-
tation Conference, pages 973–980, New York, 2002. Morgan Kaufmann Publishers.
[Min85] M. Minsky. The Society of Mind. Simon and Schuster, New York, 1985.
[NS76] A. Newell and H. A. Simon. Computer science as empirical enquiry: Symbols and
search. Communications of the ACM 19, 3:113–126, 1976.
[Rav00] J. Raven. The Raven’s progressive matrices: Change and stability over culture and
time. Cognitive Psychology, 41:1–48, 2000.
[RR86] Zh. I. Reznikova and B.Ya. Ryabko. Analysis of the language of ants by information-
theoretic methods. Problems Inform. Transmission, 22:245–249, 1986.
[SCA00] A. Saygin, I. Cicekli, and V. Akman. Turing test: 50 years later. Minds and Machines,
10, 2000.
[Sch98] P. Schweizer. The truly total Turing test. Minds and Machines, 8:263–272, 1998.
[Sch02] J. Schmidhuber. The Speed Prior: a new simplicity measure yielding near-optimal
computable predictions. In Proc. 15th Annual Conference on Computational Learning
Theory (COLT 2002), Lecture Notes in Artificial Intelligence, pages 216–228, Sydney,
Australia, July 2002. Springer.
[SD03] P. Sanghi and D. L. Dowe. A computer program capable of passing I.Q. tests. In Proc.
4th ICCS International Conference on Cognitive Science (ICCS’03), pages 570–575,
Sydney, NSW, Australia, 2003.
[Sea80] J. Searle. Minds, brains, and programs. Behavioral & Brain Sciences, 3:417–458, 1980.
[SG02] R. J. Sternberg and E. L. Grigorenko, editors. Dynamic Testing: The nature and
measurement of learning potential. Cambridge University Press, 2002.
[Shi94] S. Shieber. Lessons from a restricted Turing test. CACM: Communications of the
ACM, 37, 1994.
[Spe27] C. E. Spearman. The abilities of man, their nature and measurement. Macmillan, New
York, 1927.
[TGH01] A. Treister-Goren and J. L. Hutchens. Creating AI: A unique interplay between the
development of learning algorithms and their education. In Proceeding of the First
International Workshop on Epigenetic Robotics, 2001.
[Thu38] L. L. Thurstone. Primary mental abilities. University of Chicago Press, Chicago, 1938.
[Vos05] P. Voss. Essentials of general intelligence: The direct path to AGI. In B. Goertzel and
C. Pennachin, editors, Artificial General Intelligence. Springer-Verlag, 2005.
[Wan95] P. Wang. On the working definition of intelligence. Technical Report 94, Center for
Research on Concepts and Cognition, Indiana University, 1995.
[Wat96] S. Watt. Naive psychology and the inverted Turing test. Psycoloquy, 7(14), 1996.
[Wec58] D. Wechsler. The measurement and appraisal of adult intelligence. Williams &
Wilkinds, Baltimore, 4 edition, 1958.
[WM97] D. H. Wolpert and W. G. Macready. No free lunch theorems for optimization. IEEE
Trans. on Evolutionary Computation, 1(1):67–82, 1997.
[Zen97] T. R. Zentall. Animal memory: The role of instructions. Learning and Motivation,
28:248–267, 1997.