Articulo
Articulo
Environment
Qian Wan† Wei Liu Longlong Xu
Hubei Province Key Laboratory Hubei Province Key Laboratory Hubei Province Key Laboratory
of Intelligent Robot of Intelligent Robot of Intelligent Robot
School of Computer Science and School of Computer Science and School of Computer Science and
Engineering, Wuhan Institute of Engineering, Wuhan Institute of Engineering, Wuhan Institute of
Technology Technology Technology
Wuhan, China Wuhan, China Wuhan, China
[email protected] [email protected] [email protected]
Jingzhi Guo
Hubei Province Key Laboratory
of Intelligent Robot
School of Computer Science and
Engineering, Wuhan Institute of
Technology
Wuhan, China
[email protected]
https://doi.org/10.1145/3302425.3302432
ABSTRACT
The BDI model has solved the problem of reasoning and decision-
making of agents in a particular environment by procedure 1 Introduction
reasoning. But in uncertain environment which the context is The research on agents, acting in an uncertain and dynamic
unknown the BDI model is not applicable, because in BDI model environment is a challenge, BD I [1] agents is designed for agent-
the context must be matched in plan library. To address this issue, orient programming model and build multi-agent systems. BDI
in this paper we propose a method extending the BDI model with model is concerned with agent’s rule description and logic
Q-learning which is one algorithm of reinforcement learning, and reasoning in multi-agent systems. However these two factors are
make an improvement to the decision-making mechanism on the based on context and must be designed in advance.
Reinforcement learning [2] (RL) is applied to solve this problem
ASL as a implement model of BDI. Finally we completed the that BDI agent don’t know environmental model. RL assumes
simulation of maze on Jason simulation platform to verify the that an agent is using observed rewards that are perceived from
feasibility of the method. the environment to measure its utility following its actions in an
uncertain and dynamic environment. According to reward value,
KEYWORDS the agent can determine the sequence of actions though in
uncertain environment.
BDI model, Agent, Q-learning, Jason, Plan Library BDI concepts used to describe people's behavior and intention
at first, then was introduced into artificial intelligence, and the
ACM Reference format: earliest BDI abstract model was put forward by Georgeff. On
Qian Wan, Wei Liu, Longlong Xu and Jingzhi Guo. 2018. Extending the the basis of the model, different Procedure Reasoning System
BDI Model with Q-learning in Uncertain Environment. In Proceedings of (PRS) are designed for reasoning. Based on PRS mode, the
2018 International Conference on Algorithms, Computing and Artificial researchers developed the Multi-Agent System (MAS) based on
Intelligence (ACAI’18). Sanya, China, 6 pages. BDI model including JACK [3], DECAF [4], IRMA [5],
†
JADEX [6],ASL [7], etc. Because the ASL which has been
Corresponding author: [email protected]
Permission to make digital or hard copies of all or part of this work for personal or
added with the plan library on the foundation model of BDI has
classroom use is granted without fee provided that copies are not made or a simulation system and a better extension interface, we see
distributed for profit or commercial advantage and that copies bear this notice and ASL as a starting point for studying the BDI model. The plan
the full citation on the first page. Copyrights for components of this work owned problem of the agent in uncertain environment, so our research
by others than ACM must be honored. Abstracting with credit is permitted. To is aimed at improving the planning part of the ASL which is the
copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from
implementation model of the BDI Agent.
[email protected]. RL is learning to map situations to actions so as to maximize
ACAI '18, December 21–23, 2018, Sanya, China a numerical reward. Without knowing which actions to take, the
© 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-6625- learner must discover which actions yield the most reward by
0/18/12…$15.00
trying them. Actions may affect not only the immediate reward
https://doi.org/10.1145/3302425.3302432 but also the next situation and all subsequent rewards. Agent in
ACAI’18, December, 2018, Sanya, China Q. Wan et al.
RL has the ability of learning and planning, but lacks the logic decision. But the human resource is consumed by the human
and reasoning ability of BDI Agent.RL can be divided into two being as the trainer.
types by the: model-based and model-free, model-free RL is
mainly used to solve the planning of agents in situations where Compared with the above method, this study avoids the unit of
environmental information is not known, Q-learning is one of plan with the state as the plan and the decision- making in the
non-model algorithm that has high learning efficiency. unknown environment will become a plan which added to the
Therefore, this study put forward the method which can solve the plan library after the Q-learning algorithm has explored the
problem that BDI Agent can't decision under dynamic and
uncertain environment by RL. environment. The plan is the unit after learning which avoids the
The structure of this paper is as follows. After this introductory problem of oversize planning library, and the state of agent is
section, we follow in Section II with a brief discussion of related easy to define.
works. Section III introduces BDI Agent and AgentSpeak(L), RL
and q-learning algorithm. Section IV describes our method that
3 Background
decision improvement algorithm based on q-learning in ASL
system. Section V describes the simulation experiment and 3.1 BDI-Agent and AgentSpeak(L)
evaluates and analyzes the results of experiment. Section VI
summarizes our research and point to possible future Programming paradigm based on agent gives higher intelligent
developments. computer software module and adaptive ability, BDI is an agent
oriented programming model which is composed of Belief, Desire
and Intention. The belief includes the environmental information
2 Related Works acquired by the agent from the environment, the information of its
own operation and the information received from other agents.
The inference mechanism in BDI model is based on the preset
The desire represents the possible state of the agent when
environment, so the BDI system lacks the planning under
performing task, which is driven by intention in the actual
unknown environment. For the known environment model, there
operation. The intention represents the state that the agent decides
have been many methods are used to solve the planning in which
to achieve in the actual operation of the agent. There are two types
including decision tree, self-aware neural network, and rules
of intention one of which is the target to be executed by the agent,
learning algorithm are used to optimize the Agent’s decision.
and another is the plan based on the context matching in the BDI
Pereira apply Markov decision process to generate the optimal
model. The BDI conceptual model is described as follows: agent
strategy of BDI plan, but the method in unknown environments is
initializes beliefs and intentions, then perceives environmental
difficult to build Markov environment model, so these methods do
information. By belief update function, belief is update.
not apply in unknown environment information.
According to the current intentions and beliefs, some desires are
For the unknown environment model, the self-adaptive research
selected as a candidate, one of them can be selected as the
of agent in the unknown environment of BDI model has been paid
execution plan through the matching rules that have been
more attention. Google recently proposed a method that building
designed in advance. Agent finish the task by the plan. BDI model
the BDI model of Agent by deep reinforcement learning to
is a conceptual model in which the agent has basic logical
understand the current real intention of Agent in order to improve
reasoning ability.
its planning [8], J. L. Feliu propose to have an offline training
session for plan generation in an uncertain and dynamic ASL (Agent Speak Language)[12] is extended on the basis of
environment, the use of Q-learning from training sessions BDI model, and the specific system structure diagram is shown
consisting of interactions between agent and environment in figure 1.
generates plans for agents in ASL[9]. Although this method has
solved the problem that agents know how to act when considering 1. The agent target and belief library are initialized, the belief
the efficiency of achieving goals in every state under uncertain library is updated by the BRF (Belief Revision Function)
environment, but when the state set is too large, quantity of according to the target agent perceives information from the
environment or from other agents.
generated plan will be explosive which lead to deviation that
2. The agent cognitive the environment changes, and updates
agents focus on the internal logic in original BDI model. Joost
the belief library and verify whether the change triggers the
Broekens propose to solve the problem for selection of rules of event in the plan library. When multiple events are triggered
which priority is learned by RL, and to use a state to represent simultaneously, the sequence of event execution is
independent heuristic rules on active targets [10]. However, it is determined by the SE function.
difficult for agents to express the state in dynamic environment. 3. According to the trigger event, the predicate symbol is
Autonomous Agent mentioned in the literature [11], incentive matched in plan library. For example, if default trigger
signal of agent's behavior is given by the person. Supervised event in the plan library is +color(Object,color). When
learning is used to generate action selector according to the agent perceives the blue box from the environment, the data
excitation signal which finally enables the agent to perform the will be transformed into expression +color
task efficiently in complex and unknown environment, this (box1,blue)[source(percept)], then after SE(Select Event)
is used to match trigger event,box1 Object, blue
method enables the Agent to learn the human’s perception and
Color. The applicable plan will be matched, and the value
Extending the BDI Model with Q-learning in Uncertain
ACAI’18, December, 2018, Sanya, China
Environment
of Object and Color is instantiated into box1 and blue entities the maximum value of Q(s,a) is updated every time the state is
in the plan. passed. Its iterative update equation is given in equation (1):
4. Select a plan through the S0 function which context is
matched in the applicable plan, the plan structure is Q k + 1 ( s ,a ) = Q ( s ,a ) + r
t t k t t
( t + 1 + max Q ( s ,a) —Q (s ,a ) (1)
a
k t +1 k t t
)
trigger_event:context->body, body content includes the
action sequence, subtarget and child trigger event. Q ( s ,a ) represents the value of the updated cumulative reward
k + 1 t t
is matched similarly in body. The plans that conform to the cumulative value of the last time the Agent passed the state.
context are pushed into the stack of intent. The agent ( 0,1) , the larger is, the later reward is more considered.
performs the action in turn to pop up the stack. When the
stack is empty, the system enters the next loop. rt+1 represents the value of the reward obtained when st performs
action a to st + 1 . is the discount cumulative reward.
AgentSpeak(L) builds a multi-agent system based on BDI (
max Qk st +1, a ) denotes the maximum value at st + 1 in Q table. Q-
model, which facilitates the programming and research of Agent. a
This study mainly focuses on the improvement of ASL which is learning is a greedy strategy algorithm. The value of Q(s, a) tends
the implementation model of BDI. to be stable after the finite iteration of formula (1), the maximum
current state Q value in the Q table is selected until the final state
is reached to form the optimal strategy.