0% found this document useful (0 votes)
12 views

Defining Problem From Solutions - Inverse Reinforcement Learning (IRL) and Its Applications For Next-Generation Networking

Uploaded by

liyang211014
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Defining Problem From Solutions - Inverse Reinforcement Learning (IRL) and Its Applications For Next-Generation Networking

Uploaded by

liyang211014
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

1

Defining Problem from Solutions: Inverse


Reinforcement Learning (IRL) and Its Applications
for Next-Generation Networking
Yinqiu Liu, Ruichen Zhang, Hongyang Du, Dusit Niyato, Fellow, IEEE, Jiawen Kang, Zehui Xiong, and
Dong In Kim, Fellow, IEEE

Abstract—Performance optimization is a critical concern in in mobile edge networks as an example, where users select
arXiv:2404.01583v1 [cs.NI] 2 Apr 2024

networking, on which Deep Reinforcement Learning (DRL) has edge servers to maximize their Quality of Experience (QoE).
achieved great success. Nonetheless, DRL training relies on pre- We know that several factors are related to QoE, such as
cisely defined reward functions, which formulate the optimization
objective and indicate the positive/negative progress towards service latency and fees. However, the weight and relationship
the optimal. With the ever-increasing environmental complexity of observed factors, as well as how to fuse them appropriately
and human participation in Next-Generation Networking (NGN), to model user-side experience, are unknown. Moreover, the
defining appropriate reward functions become challenging. In unobserved subjectivity of users, such as personal inner and
this article, we explore the applications of Inverse Reinforcement behavioral preferences, should also be considered. Without
Learning (IRL) in NGN. Particularly, if DRL aims to find optimal
solutions to the problem, IRL finds a problem from the optimal knowing what constitutes the reward, a trained DRL policy
solutions, where the optimal solutions are collected from experts, might exhibit severe sub-optimality and bias.
and the problem is defined by reward inference. Specifically, Fortunately, Inverse Reinforcement Learning (IRL) emerges
we first formally introduce the IRL technique, including its as a pivotal solution to overcome the obstacles caused by
fundamentals, workflow, and difference from DRL. Afterward, reward inaccessibility [3]. Specifically, IRL enhances DRL by
we present the motivations of IRL applications in NGN and
survey existing studies. Furthermore, to demonstrate the process introducing reward inferences. In the above example, instead
of applying IRL in NGN, we perform a case study about human- of manually defining a reward function without any accurate
centric prompt engineering in Generative AI-enabled networks. prior knowledge and precision guarantee, IRL utilizes DNNs
We demonstrate the effectiveness of using both DRL and IRL to infer the rewards that can effectively explain user behaviors.
techniques and prove the superiority of IRL. Such behaviors are called expert trajectories and should be
Index Terms—Inverse Reinforcement Learning (IRL), Next- collected before training IRL. Successful IRL applications
Generation Networking (NGN), Generative AI (GAI), Reward include ChatGPT, which involves large-scale human feedback
Engineering, Deep Reinforcement Learning (DRL).
to fine-tune model generation1 . Meanwhile, IRL has been
widely adopted in human-in-the-loop networking and systems,
I. I NTRODUCTION such as autonomous driving assistance [4]. We conclude that
Deep Reinforcement Learning (DRL) has become indis- IRL owns the following benefits.
pensable in many fields, such as networking, robotics, and • Environment Exploration: IRL provides a means to
finance [1]. Following the Reinforcement Learning (RL) prin- break the information asymmetry and explore complex
ciple, an agent can optimize its decision-making capability by or adversarial environments. By leveraging the inferred
iteratively interacting with an environment, aiming to maxi- reward functions, agents are not only guided toward opti-
mize cumulative rewards. Meanwhile, Deep Neural Networks mal policies but are also encouraged to explore uncharted
(DNNs) enhance RL by enabling agents to represent complex territories within the environment. For instance, users col-
environments and learn sophisticated policies. Although such a lect malicious servers’ behaviors to infer their objectives
paradigm demonstrates remarkable success, explicitly defining and patterns, thus taking corresponding defenses [3].
or even determining rewards can be challenging in many real- • Behavior Understanding: By inferring reward functions
world scenarios [2] due to environmental complexity, human from observed expert behaviors, IRL offers profound in-
participation, or information asymmetry. Take task offloading sights into the underlying motivations of agents, enabling
a deeper comprehension of complex behaviors. For exam-
Y. Liu, R. Zhang, H. Du, and D. Niyato are with the School
of Computer Science and Engineering, Nanyang Technological Univer- ple, human driving patterns are complicated since actions
sity, Singapore (e-mail: [email protected], [email protected], taken by drivers according to different traffic conditions
[email protected], and [email protected]). are based on their empirical experience, driving skills,
J. Kang is with the School of Automation, Guangdong University of
Technology, China (e-mail: [email protected]). and preferences. Leveraging DNNs to represent such non-
Z. Xiong is with the Pillar of Information Systems Technology and Design, linear features, IRL can learn a reward function that aligns
Singapore University of Technology and Design, Singapore (e-mail: zehui with the driving behaviors2 .
[email protected]).
D. I. Kim is with the Department of Electrical and Computer En-
1 https://cgnarendiran.github.io/blog/chatgpt-future-of-conversational-ai/
gineering, Sungkyunkwan University, Suwon 16419, South Korea (e-
mail:[email protected]). 2 https://meixinzhu.github.io/project/irl/
2

• Policy Imitation: IRL excels in distilling policies from A. Preliminaries


demonstrations, allowing the agents to imitate desired 1) Reward Engineering: To precisely formulate an op-
behaviors. This capability is especially beneficial in sce- timization problem, the reward function should be defined
narios where the desired outcome is known but the path to appropriately. The term “reward” indicates the desirability of
achieving it is not. In the driving example, after acquiring each action. By assigning positive/higher and negative/lower
a reward function, an autonomous driving agent can be reward values to the actions that make positive and negative
trained to imitate the expert drivers’ behaviors. progress, respectively, the optimization objective can be de-
In this article, we delve into the applications of IRL in scribed and the optimal can be gradually approached. Finally,
Next-Generation Networking (NGN), which is the foundation the inputs of reward functions are observed/unobserved factors
of numerous advanced technologies, e.g., 6G communica- of the environments and stakeholders.
tions, Metaverse, and Internet of Everything. According to As shown in Fig. 1(a), users may encounter different situa-
AT&T3 , NGN is projected to accommodate massive devices, tions when defining reward functions. First, if their objective
support diverse communication/network protocols, and pro- is straightforward, such as minimizing the throughput, it can
vide immersive services to users. Such explosive growth in be directly adopted as the reward. If multiple physical factors,
networking scale, topology, and complexity greatly increases e.g., throughput, latency, and package loss, are involved, the
the difficulty of defining optimization objectives and rewards reward function should fuse them in linear/non-linear manners
precisely. Noticing such issues, several studies [2], [5] adopted and acquire a combined expression. Note that the difficulty and
IRL to solve certain networking problems, such as workload potential errors grow dramatically with the increasing number
balancing, routing schedules, and attack detection, while an in- of factors. Furthermore, in human-in-the-loop networks, the
depth analysis of IRL-enabled NGN is missing. We conclude unobserved subjective factors of users should be modeled and
that three key questions are still yet to be answered. considered as well, which is challenging. The precision of
• Q1: How can IRL help NGN, and in which ways? reward functions is measured by how well they can describe
• Q2: How to design IRL algorithms to serve specific NGN optimization objectives and elicit desired behaviors.
scenarios? 2) Markov Decision Process: Fig. 1(b) illustrates an MDP
• Q3: What are the open issues and future directions in this [1], the building block for DRL and IRL. As a discrete-
topic? time stochastic control process, MDP presents a mathematical
To this end, we provide forward-looking research on IRL- framework for modeling sequential decision-making in envi-
enabled NGN, as well as conducting a case study for demon- ronments with uncertainty. The basic MDP is constructed by
stration. To the best of our knowledge, this is the first work a four-element tuple ⟨states, actions, transition probabilities,
answering why IRL is essential for NGN and showing how rewards⟩. These components describe the environment in
to deploy IRL algorithms to solve real-world NGN problems. which an agent performs certain actions, acquires rewards, and
Our main contributions are summarized as follows. transits from the current state to the next state.
• We comprehensively discuss the fundamentals of IRL. • States: The set of all possible situations, each of which
Specifically, based on Markov Decision Process (MDP) encapsulates various configurations or factors that de-
and RL principles, we introduce the basics of IRL and scribe the current environment.
several representative IRL algorithms. Additionally, we • Actions: For each state, there are actions available to the
analyze the benefits of enhancing DRL by IRL. agent that can change the state.
• We explore the applications of IRL in NGN. To do so, • Transition Probabilities: The likelihood of moving from
we first illustrate the driving forces existing in NGN to one state to another, given an action taken by the agent.
promote IRL adoption. Afterward, we review the existing • Rewards: After taking an action and moving to a new
literature on IRL-enabled NGN, as well as the limitations state, the agent will receive a reward, whose meaning is
of the current IRL techniques. discussed before. The reward function can be defined by
• We conduct a case study to demonstrate the process mathematical expressions or be inferred by DNNs, which
of applying IRL in NGN. In particular, we showcase a correspond to DRL and IRL, respectively.
human-centric prompt engineering scheme for Generative In the NGN context, most of the optimization problems can
AI (GAI)-enabled networks. Experimental results prove be modeled as an MDP, such as resource allocation, routing,
that IRL can effectively infer unobserved rewards of attacks/defenses, and sub-carrier selection [1]. Hence, acquir-
human users and achieve higher experience. ing a reliable and effective method for MDP optimization is
of significant importance in promoting NGN.
II. F UNDAMENTALS OF I NVERSE R EINFORCEMENT
L EARNING
B. Deep Reinforcement Learning
In this section, we comprehensively discuss the fundamen-
Combining reinforcement learning (RL) principles with the
tals of IRL, including MDP, conventional DRL, the basics of
representational capabilities of DNNs, DRL is the most famous
IRL, representative IRL algorithms, and IRL’s advantages.
and efficient approach to solving complex decision-making
3 https://www.business.att.com/learn/tech-advice/what-is-next-generation- tasks formulated as MDP. Next, we introduce the basic idea,
network-ngn-technology-explained.html architecture, and process of DRL.
3

Cumulative (expected) reward Refinement of Policy network


Factors Reward function
Discount factor policy network Value network
throughput ⅀ 𝛄 𝛄2 𝛄n
1 Optimizer
Reward Next state
f(x) = x
throughput Validate Agent
Reward function 𝚺
latency ⅀ ⅀ Cumulative
success rate (expected)
⨯(n-4)
f(x) = 𝛼x + 𝛽y … State reward
Action
throughput Reward
latency …. 𝛄n Replay
precision buffer
image quality style Reward
…. Action Environment
preference f(x) = 𝛺(x, …)
(a) (b) (c)
(d) The agent
Reward of
ct Discriminator

inference
lle Expert trajectories trajectory by Reward

Reward
Compare expert
Co trajectories inferred reward model
<state-action, state-action…> Tr
ai
ni
Expert
DRL Ma x ng
Agent trajectories Update
generator Actions
Reward
<state-action, state-action…> of others Feedback
Update

Expert Human
Agent trajectories
Refinement of RL Maximum Maximum Generator RL with human
policy network Policy network module margin entropy GAIL feedback

Fig. 1: The fundamentals of IRL (a): The cases of defining reward functions. (b): The illustration of MDP. (c): The workflow
of DRL algorithm. (d) The workflow of IRL. Note that we illustrate four representative approaches to infer reward, namely
maximum margin, maximum entropy, GAIL, and RLHF.

1) Basic Idea: The fundamental goal of DRL is to derive is streamlined into three main stages, namely data collection,
an optimal policy that maximizes the cumulative reward in advantage estimation, and policy optimization. Initially, the
an MDP setup [1]. Through iterative interaction with the agent interacts with the edge network using the current server
environment, an agent learns to refine its decision-making selection policy and gathers ⟨action, state, reward⟩ records.
strategy based on received feedback. DNNs play a crucial role Such records are buffered to calculate advantage estimates,
in this process, approximating the functions that predict the which assess the relative value of actions. Then, PPO opti-
future rewards of actions, thereby guiding the agent toward mizes a specially designed objective function that includes a
optimal decision-making. clipping mechanism. This mechanism prevents the new policy
2) General Architecture: As shown in Fig. 1(c), the general from straying too far from the old one, ensuring small, stable
DRL framework consists of several essential components: updates and avoiding drastic performance fluctuations. By
iteratively cycling through these stages, PPO gradually fine-
• Agent: The decision-maker that interacts with the envi-
tunes the policy, striking a balance between exploring new
ronment, i.e., taking actions, transiting to the next state,
strategies and exploiting known rewards, making it adept at
and receiving rewards.
navigating complex environments.
• Environment: The system that defines the state and
action spaces and allows agents to perform MDP.
• Policy Network: A neural network that maps states to C. Inverse Reinforcement Learning
actions, defining the agent’s behavior. Despite the success of DRL, in many network scenarios,
• Value Network: A neural network that estimates future precisely defining the reward function is impossible. In the
rewards from states or state-action pairs. above example, although manually-defined QoE can reflect
• Replay Buffer: A repository of numerous past trajecto- users’ preference for low latency and high success rate, the
ries, enabling the agent to reuse historical training data subjectivity factors are ignored. For instance, users with time-
and reduce training costs. sensitive tasks may emphasize latency, while secure computing
• Optimizer: The mechanism that adjusts neural network tasks should put success rate at the highest priority. Following
parameters to maximize expected cumulative rewards biased reward functions, the policy training may deviate from
under the current policy. the real optimization objective. Accordingly, IRL is presented
3) Algorithm Process: To illustrate the DRL process, we to infer such sophisticated reward functions by observing and
show an example of leveraging Proximal Policy Optimization analyzing users’ behaviors [2]. As shown in Fig. 1(d), in the
(PPO) [6] to offload tasks. In this case, agent, action, and beginning, the uses’ selections of edge servers in different
state refer to the user, the selection of task offloading server, network states are collected, called expert datasets/trajectories.
and the edge network, respectively. In addition, reward can be Afterward, instead of simply learning the action policy, IRL
defined by fusing multiple factors, such as latency and success first infers the form of the reward function that best explains
rate, thereby indicating user QoE. Afterward, the PPO process the demonstration behaviors. Moreover, it can optimize the
4

Environment E D Policy optimization: A B image


Offloading of
generation
Adversarial
Signal Battery • In complex NGN scenarios, learning from expert trajectories task offloading tasks
Speed Altitude
can maximize the efficiency of policy refinement.
Transmission power • The reward learned from IRL can maximize the
effectiveness for mimicking expect trajectories (E). Server
… Speed Servers
Crush Motivations for Server
Traffic jam (attacker)
Human driver Environmental applying IRL

Attacks
Height
(expert) complexity:
Interference Task/
Stop … • The network complexity information
Speed up and diversity increase
Images
Expert due to the introduction of
trajectory new communication and Reward unavailability: I choose image 3
since purple is
𝝭 𝝭 networking paradigms, • In many cases of NGN, the reward cannot User my favorite color

𝝭 such as SIGAN, and 6G. be modeled, such as


The evidence of
QoE definition • Numerous physical • The network contains adversarial, e.g., User
attacks
factors explicitly or attackers, eavesdroppers, whose objective
Agent implicitly affect IRL infer rewards
IRL learns the is hidden from users (A).
environment, causing IRL infer the containing
IRL optimize the complicated • The immediate reward to each action is subjectivity, thus
hidden objective
QoE by mimicking factor fusion difficulty in reward unavailable (B). and strategies of aligning with
the experts by DNNs modeling (D). • The NGN is human-in-the-loop (C). human
attackers

The current
D Leaf Action 1
D
Action 1’
D Low
C
state ×N actions
B server B … B QoE
A E G A E G High A E ×N actions
… IRL learns the
DNNs of the IRL for Root QoE unknown immediate
C C C
reward inference server F H F F H reward of each action

Fig. 2: The motivation of applying IRL in NGN, using QoE optimization for task offloading as an example. A: The attackers’
information is hidden from users. B: Subjectivity factors can hardly be represented precisely. C: The immediate reward of each
action is unobserved. Action 1 and 1’ choose servers H and G for task offloading, respectively. D: SAGIN communications
involve numerous physical factors from diverse devices, causing huge difficulty in QoE modeling. E: The human driving
behaviors contain complicated physiological processes that can hardly be described by manually designed rewards.

agent’s policy towards closely aligning with or mimicking 3) Generative Imitation Learning: Recall that reward func-
the expert policy according to the inferred reward function. tions serve as an indicator to guide policy refinement. Apart
Given the strong representation and learning ability, the reward from inferring an explicit reward value for each action and
function is generally inferred by a DNN. maximizing the reward expectation, the policy can be refined
in an imitation manner. For instance, GAIL [4] leverages the
D. Evolution of Inverse Reinforcement Learning Generative Adversarial Networks (GANs) framework, where
The inference of reward functions can be implemented a DNN-based generator aims to approach expert behavior.
following different approaches. As illustrated in Fig. 1(c), we Meanwhile, another DNN acts as the discriminator, trying
introduce four representative and widely adopted IRL algo- to differentiate between the expert’s actions and those of
rithms, from basic maximum margin to advanced generative the generator. The two modules are trained alternately, i.e.,
imitation learning. freezing one module and adjusting the parameters of the
1) Maximum Margin: This algorithm [4] focuses on distin- other module to optimize its behavior imitation/differentiation
guishing the expert’s policy from all other policies by a margin performance. Afterward, the generator can effectively mimic
that depends on the difference in the accumulated reward. It the expert policy and solve MDP problems without requiring
operates under the principle that the correct reward function reward inference. Accordingly, the computational efficiency
should make the expert behavior appear significantly better can be significantly improved.
than all other possible behaviors. This method is particularly
useful in scenarios where clear demarcation between optimal 4) Reinforcement Learning with Human Feedback: In many
and suboptimal policies is possible. cases, the reward function is unavailable since reward values
2) Maximum Entropy: This method [4] introduces the are given by humans. Human perception and evaluation is
concept of entropy to address the ambiguity in the reward a complex physiological process and is affected by various
function that could explain observed behaviors. It assumes that subjective factors such as preference, personality, environment,
among all possible reward functions, the one that leads to the and strictness. As a variant of IRL, RLHF combines the
observed behavior while also maximizing entropy, or in other RL principle with direct feedback from humans, integrating
words, promoting diverse but consistent behaviors, is the most subjective insights into the learning process. Specifically, a
appropriate. Compared with maximum margin, this approach reward model will be first trained based on action-score pairs,
is adept at handling the inherent uncertainty in determining where scores are manually annotated by humans. Then, the
why an agent acts a certain way by favoring reward functions DRL can be leveraged for policy optimization, ensuring the
that support a wide range of plausible actions. policy aligns with human preferences.
5

E. Comparison and Summary applications and ensures high human-perceptual experi-


Based on the above descriptions, the difference between ences [7]. Take the offloading of image generation tasks
DRL and IRL can be summarized as follows. First, DRL op- as an example (see Fig. 2(c)), user QoE depends not only
erates within a well-defined MDP environment with available on objective factors such as service latency and fee but
four-element tuples. IRL, conversely, applies to incomplete also on users’ preference for the painting style, whose
MDPs, where the reward function is unknown. Accordingly, modeling is intricate.
the primary goal in DRL is to learn an optimal policy that • Environmental Complexity: Given massive devices, as
maximizes cumulative rewards. IRL aims to infer the reward well as the advancement of access protocols and commu-
function based on the expert’s demonstration to understand nication diagrams, NGN exhibits environmental complex-
the objectives the agent is implicitly striving for. The in- ity. For instance, in Space-Air-Ground Integrated Net-
ferred reward then guides the policy refinement. Finally, the work (SAGIN), multiple physical factors of the devices
DRL policies are optimized by iteratively interacting with in each layer explicitly or implicitly contribute to QoE,
the environment. In contrast, IRL relies on offline expert while the specific contribution is unknown (see Fig. 2(d)).
demonstrations, learning indirectly from optimal behaviors and In this case, the annually defined reward function may
inferring the rewards that can elicit such behaviors. Overall, suffer from incomprehensiveness and bias.
DRL is used to find a solution to the problem while IRL is • Policy Optimization: If expert behaviors that maximize
used to find a problem from the solutions. QoE are available, training the agent to mimic the expert
The most significant advantage of IRL is its ability to infer can lead to the highest efficiency for QoE optimization.
unobserved reward functions precisely. Such capability greatly However, manually designing a reward function can
enhances conventional DRL since if the reward is defined hardly realize the action imitation. Fig. 2(e) demonstrates
inappropriately, the trained policy cannot solve the optimiza- an example of autonomous driving. Contributed to the
tion efficiently. In addition, IRL, especially RLHF, excels strong capability of DNNs, IRL can mimic the expert
in understanding and mimicking human-like decision-making driver with perfect QoE effectively by inferring the re-
behaviors, making it ideal for applications requiring nuanced ward that elicits desired behaviors.
behavior modeling. Last but not least, IRL can provide insights
into the underlying motivations and objectives behind observed B. Applications of Inverse Reinforcement Learning in Next-
actions, contributing to understanding environments. Generation Networking
III. I NVERSE R EINFORCEMENT L EARNING IN In this part, we review related works about IRL in network-
N EXT-G ENERATION N ETWORKING ing, as shown in TABLE I. Particularly, our survey is organized
A. Motivations from Next-Generation Networking from the three perspectives shown above.
1) Infer Unobserved Rewards: As mentioned in Section
According to AT&T4 , NGN refers to the evolution and II, IRL is effective in inferring unobserved rewards caused
migration of fixed and mobile networking infrastructures from by adversarial relationships or human participation in the
distinct and proprietary networks to converged networks with networks. For instance, Snow et al. [5] applied multi-agent
high efficiency, security, and experience. Hence, the manners IRL to identify coordination in cognitive radar networks, i.e.,
of networking organization, management, and service provi- multiple radars collaborate to track one target. Since the
sioning will undergo significant changes [7]. In this part, we radars are adversarial, the target cannot identify their numbers,
utilize QoE maximization for task offloading as an example to configurations, and individual utility functions. Leveraging the
demonstrate the driven force in NGN that promotes the further IRL concept, the target first determines if collaboration exists
development of IRL (see Fig. 2). based on whether the emissions satisfy Pareto optimally [5].
• Reward Unavailability: In some NGN scenarios, re-
If so, it then uses the emissions as the expert trajectories to
wards are unavailable to decision-makers. As shown infer the individual utility of each radar. Likewise, Parras et al.
in Fig. 2(a), users may encounter malicious offloading [3] leveraged GAIL to enhance IoT security. This is because,
servers, which steal sensitive information for attacks. The nowadays, DRL is widely adopted by attackers to create new
intelligent defense against such malicious behaviors relies attacks. Specifically, they select the attack objective, adopt
on detecting attackers’ objectives, which are hidden from DRL to train the optimal attack strategy and launch the attacks
users. In addition, the complicated topology of NGN on the victims. Accordingly, defenders can adopt IRL methods,
leads to sophisticated RL state spaces [7]. For instance, such as GAIL, to infer the objective of the attackers and thus
Fig. 2(b) illustrates a case where offloading servers are refine the defense policies.
structured as a tree and users choose an offloading service In addition to adversarial networks, Wang et al. [8] dis-
chain from the leaf to the root. In this case, the immediate cussed the applications of IRL in cases where the reward of
reward of performing each action is inaccessible since each action cannot be mathematically calculated. Specifically,
the association between the current state and the final they utilize the branch and bound (B&B) algorithm to realize
result is unknown. Finally, NGN exhibits human-in-the- full-dimensional task offloading in edge computing. Although
loop features since it aims to support diverse advanced the B&B process can be modeled as an MDP, it is very
4 https://www.business.att.com/learn/tech-advice/what-is-next-generation- challenging to design an appropriate reward function since
network-ngn-technology-explained.html the B&B algorithm. Before obtaining an optimal solution, the
6

TABLE I: Summary of applications of IRL in networking.


Des
Aspect Reference Application Motivation Methodology Performance

Sown et al. [5] Coordination detection in The collaborative radars contain multiple Apply IRL to infer radars’ The utility functions are
cognitive radar networks entities, each of which has own utility. individual utility inferred with high accuracy
Infer
Infer attacker’s objectives and The objective of attackers is unknown, causing Utilize GAIL to infer attacker’s The defense can effectively
unobserved Parras et al. [3]
defend IoT attacks difficulty in defining reward functions objectives and design defense address the attacks
rewards
Wang et al. [8] Full-dimension task The consequence of each step is unknown until Adopt GIRL to infer the reward Keep at least 80% of
offloading in edge computing the result of the entire B&B is obtained of each step of B&B accuracy for guiding B&B

QoS detection in dynamic The network is dynamic and time-variant; the Apply IRL to predict QoS in Reduce at most 31.3%
Overcome Li et al. [9]
and time-variant networks projection between routes and QoS is vague dynamic networks error in QoS prediction
complex Load balancing in Different factors contribute to QoE with varying Present trajectory extrapolation Achieve 32% improvement
Konar et al. [10]
network communication networks degrees in reward function forming to QoE modeling in load balancing
environments Trajectory scheduling and Numerous physical factors and process affect Develop IRL to capture UAV- Realize much higher
Shamsoshoara et al. [11]
power allocation in UAV the UAV, e.g., UAV’s speed and power related factors power allocation efficiency

Power allocation optimization Using learned reward can better guide the Utilize maximum entropy to Only 1% performance loss
Zhang et al. [12] in cellular networks agent to mimic the expert guide agent training compared with expert
Guide Policy Sum rate maximization in The same as [12] Utilize GAIL to guide the Achieve 20% gain
Tian et al. [13] multi-cell multi-user networks
Optimization agent training compared with DRL
Power usage maximization in The same as [12] Utilize GAIL to guide the Reduce 50% power than
Tian et al. [14]
MISO networks policy optimization DRL results

branching order of variables cannot be known in advance. That inferring the optimization objective.
is because the B&B algorithm solves the problem by breaking 3) Guide Policy Optimizations: In some cases where the
it down into smaller sub-problems and using a bounding optimal trajectories are available, directly inferring the reward
function to eliminate sub-problems with limited upper bounds, function that facilitates the agent to mimic the expert can
forming a tree-structured solution space. In this case, we lead to the best policy optimization efficiency. For instance,
cannot know that branching which variable can generate a Zhang et al. [12] adopted maximum entropy to optimize
smaller enumeration tree. Therefore, the authors present a the power allocation schemes in multi-user cellular networks.
Graph-based IRL (GIRL), which uses a Graph Neural Network Experiments show that IRL greatly outperforms manually
(GNN) to infer the immediate reward of each action based on defined reward functions based on objective metrics, such as
the structure of the tree. weighted minimum mean square error. Similarly, Tian et al.
Finally, the concept of IRL has been widely adopted in [13] applied this principle to multi-cell multi-user networks
human-in-the-loop networks. For instance, the training of to maximize the sum rate. Furthermore, in [14], they adopted
ChatGPT, the most famous multimodal AIGC model, involves the GAIL method to minimize power usage in multiple-input-
self-reinforcement from large-scale human feedback, thereby single-output networks with multiple users. Compared with
aligning the model output with human preferences. conventional DRL, the proposed method can reduce power
consumption by over 50%.
2) Overcome Complex Network Environments: Owing to Lesson Learned: From the above brief survey, we learn that
complex network environments, even if manually defining IRL enhances the capabilities of DRL by introducing a reward
some vague reward functions is available, it can hardly pre- inference step. Contributed to DNN’s stronger representation
cisely represent the environmental change due to the agent’s and learning capabilities than humans, IRL can well overcome
actions and efficiently indicate the desirability. In [9], the complex environments, actions, and reward perception pro-
authors presented an IRL-based approach to predict QoS in cesses (such as human perception of AI-generated images).
dynamic and time-variant networks. Due to large state spaces Such advantages make IRL inherently suitable for scenarios
and complex projections between routes and QoS values, it that are complex while existing expert trajectories for demon-
is difficult to define a precise reward function artificially. stration. Nonetheless, IRL also exhibits certain drawbacks,
Likewise, Konar et al. [10] presented the trajectory extrap- such as the additional computational costs for reward infer-
olation algorithm to model the quality of experience (QoE) ences and the dependency on expert trajectories, which are
for communication load balancing. Particularly, they consider subject to further research.
the difficulty of gathering exert trajectories in practice. Hence,
their proposed algorithm first sorts the gathered trajectories
according to some objective and well-established metrics. IV. C ASE S TUDY: H UMAN - ORIENTED P ROMPT
Then, the trajectories are sampled to facilitate the reward infer- E NGINEERING IN G ENERATIVE AI-E NABLED N ETWORKS
ence and policy learning. In UAV networks, Shamsoshoara et
al. [11] developed the interference-aware joint path planning To illustrate how to solve practical NGN problems by
and power allocation scheme to minimize the interference IRL, this section performs a case study about human-oriented
that UAVs cause to the existing ground user equipment. prompt engineering. Particularly, we simultaneously present
IRL is applied to capture all physical factors, including the the DRL- and IRL-based solutions to the same task, helping
UAV’s transmission power, terrestrial user density, imposed readers compare these two methods. The effectiveness of DRL
interference, and the UAV’s task completion capability, thereby and IRL is also discussed.
7

An ancient Chinese bridge with towers An ancient stone bridge


arches over a …
remaining challenge. Next, we solve such an optimization
Prompt
problem following DRL and IRL paradigms, respectively.
Prompt
User Generated content Service provider engineering

Raw prompt:
A B. DRL- and IRL-based Prompt Engineering
Towers
Fail to 1) DRL Workflow: First, we solve this problem following
An ancient Chinese with
render
bridge with towers. details the DRL paradigm. Specifically, we adopt the PPO algorithm
the
background texture towers Correct discussed in Section II-B. As shown in Fig. 3(C), the service
lighting mood … composition
Crafted prompt: provider acts as agent, whose actions include all available op-
An ancient stone bridge arches over a serene river, flanked by towers with erations to refine raw prompts. The raw prompts from users are
upturned eaves and intricate carvings, embodying architectural mastery and
cultural heritage. This scene, features misty mountains and lush foliage, blending accommodated by environment. Finally, reward evaluates the
natural beauty with human creativity. Subtle, majestic lighting enhances the efficacy of applying the selected prompt engineering operation
scene's ethereal quality, suggesting the bridge connects the past with the present.
B on the current raw prompt. To this end, we leverage Image-
DRL: Policy IRL: For this reward5 , a learning-based metric for measuring the aesthetic
prompt,
Next state network action 3 quality of AI-generated images, as the reward function. Corre-
is most
preferred spondingly, the quality score of the image generated from the
Agent: User
Service provider Expert dataset crafted prompt is utilized as the reward. The DRL architecture
Environment: Reward:
Action:
AIGC market Inferred
and algorithm workflow follow the description in Section II-
Reward: Prompt
Next state

Image-reward engineering reward B, with the goal of optimizing the policy for selecting prompt
Action: Reward
Environment: Prompt inference:
engineering operations.
AIGC engineering
market
GAIL 2) IRL Workflow: Then, we apply IRL to solve the same
Replay problem. Different from DRL, as shown in Fig. 3(d), IRL
buffer Agent:
include the following three steps.
C D Service provider
• Expert Dataset Collection: First, we construct an expert
dataset to indicate human preference for AI-generated
Fig. 3: The illustration of our case study. A: The system model images. Particularly, inspired by the outstanding capabil-
of GAI-enabled network. B: The efficacy of prompt engineer- ity of Large Language Models (LLMs) in multimodal
ing. We can observe that the image generated from the crafted understanding [15], we leverage an LLM-empowered
prompt excels in object rendering and image composition. C: agent to mimic real users. Then, we randomly compose
The illustration of DRL-based prompt engineering. D: The 50 raw prompts and perform all the available prompt
illustration of IRL-based prompt engineering. engineering operations on each of them. The agent is
utilized to evaluate the generated images by scores. Note
that we adopt two strategies for training LLM, ensuring it
A. System Model generates rationale scores. First, we instruct LLM to act
as experienced users, recalling its pre-trained knowledge
With the rapid advancement of GAI, generative tasks, such
of image quality assessment. Additionally, we feed ten
as text-to-image generation, automatic conversation, and 3D
materials regarding computer vision, image generation,
avatar renderings, play an important role in NGN. As shown
and painting to the LLM, thus enhancing its expertise.
in Fig. 3(A), we consider a GAI-enabled network with users
All the ⟨raw prompt, crafted prompt, score⟩ pairs con-
and service providers. Users describe their requests by natural
struct the expert dataset, showing which kind of prompt
language (so-called prompt) and submit them to the service
engineering is preferred by humans for the specific image
providers. Operating professional GAI models, the service
generation task.
providers perform generative inferences and generate required
• Policy Optimization: Different from defining rewards via
content for users.
an existing metric, IRL utilizes DNNs to infer reward
Nonetheless, due to the lack of professionality/experience, functions that align with expert decisions. To improve the
user prompts may suffer from information insufficiency and learning efficiency, we adopt GAIL explained in Section
ambiguity, causing low generation quality. To this end, service II-D. Specifically, the discriminator distinguishes between
providers can strategically craft prompts that maintain seman- the selections of expert prompt engineering policy and
tic equivalence with raw prompts while being informative, those of the agent’s policy, thereby self-calibrating re-
clear, and suitable for the specific GAI model [15]. Such wards to align with the expert decision-making mode.
a process is called prompt engineering [15]. Taking text- Meanwhile, the generator aims to learn a prompt engi-
to-image generation as an example, Fig. 3(B) depicts the neering policy that mimics the expert behaviors, driven by
impact of prompt engineering on the generated image. We feedback from the discriminator. In this way, the genera-
can observe that the image generated by the crafted prompt tor can gradually imitate humans in judging AI-generated
outperforms in exquisiteness and composition. Despite such images and perform prompt engineering accordingly.
advantages, service providers may own multiple prompt en- • MDP Design: Finally, we configure the MDP for IRL.
gineering approaches since the prompts are crafted by open Similar to DRL, the action space contains all candidate
vocabularies. In this case, how to select the optimal prompt
engineering approach according to the specific request is a 5 https://github.com/THUDM/ImageReward
8

prompt engineering operations. The state space is repre-

Image rewards
sented by the LLM-assigned scores of the image gener- 0.6
ated by the crafted prompt. This allows GAIL to evaluate 0.4
the efficacy of each action by evaluating its impact on the DRL
0.2
resulting image quality. Unlike traditional DRL which
relies on manually designing reward functions, GAIL 0
0 500 1000 1500 2000
leverages the discriminator network to infer rewards from Episodes
our expert dataset. 8.6

Cumulative rewards
C. Experiments 8.5
IRL
1) Experimental Setting: We take text-to-image generation 8.4 Random
as an example. The service providers adopt Stable Diffusion
v2 to generate images. Raw prompts from users take the form 8.3
of A [A] with [B], e.g., “a city with a car” and “a garden 8.2
with a fountain.” Leveraging the GPT-4 model, seven kinds
of operations to craft prompts can be performed on each raw 8.1
0 500 1000 1500 2000
prompt, as discussed in [15]. The LLM-based agent is also Episodes
implemented by GPT-4. For IRL, the discriminator consists
of a four-layer fully connected DNN and two intermediate Fig. 4: The training curves of DRL and IRL.
layers, each containing 100 neurons. Meanwhile, the generator
containing actors and critics employs a similar four-layer, fully
connected DNN architecture with 64 neurons in each of the
Case1: A mountain with stone castle
two hidden layers. The hyperbolic tangent function serves as Case2: A garden with a bench
the activation mechanism for all hidden layers. The learning Case3: A city with a blue car
rate and gamma are set to 3 × 10−4 and 0, respectively. 8.39
2) Result Analysis: Fig. 4 shows the trend of cumulative
rewards of different algorithms during the training process. It
shows that as the number of episodes increases, the proposed 8.06
IRL method achieves good convergence and significantly 8.16
outperforms the baseline, i.e., the service provider selects
prompt engineering operation randomly. Meanwhile, DRL also
converges rapidly. With well-trained DRL and IRL policies,
Fig. 5 evaluates their efficacy in selecting prompt engineering
operations. Note that the quality scores are also assessed by the
LLM-empowered agent. As shown in Fig. 5, IRL can increase
the image quality by 0.33 on average, while DRL can only Fig. 5: The efficacy of DRL and IRL in prompt engineering.
achieve 0.1 increment. This is because IRL is trained on expert Each case corresponds to one randomly selected raw prompt.
datasets, which can effectively indicate the human preference
for assessing AI-generated images. Consequently, the prompt
engineering operations selected by IRL can better align with B. IRL with Human Feedback
human desire, resulting in high-score images. Human feedback has been successfully integrated into DRL
Discussion: Contributed to the increasing image quality, the (e.g., ChatGPT) to make policies align with human prefer-
service experience of humans can be increased drastically. ences. Likewise, humans can participate in the IRL process
Meanwhile, the re-generation caused by unqualified outputs to further enhance the expert dataset/human annotation. For
can be reduced, which greatly decreases service latency and example, in our case study, the efficient calibration of in-
bandwidth consumption [15]. ferred rewards based on human feedback is worth researching,
thereby ensuring the reward function represents the real human
V. F UTURE D IRECTIONS judgment of AI-generated images precisely.
A. Mixture-of-Experts
Recall that IRL heavily relies on expert trajectories, while
in some complex scenarios, collecting the optimal trajectories C. Security of IRL
that achieve the objective with the lowest cost is impossible. The reliance on expert trajectories may cause security
Instead, there may only exist multiple local optimal trajectories issues since attackers can easily mislead policy training by
owned by distributed experts, each of which reaches the data poisoning. To this end, strict access control and privacy
optimum in certain aspects/stages. To this end, the mixture-of- protection are urgently required for deploying IRL in practical
experts principle can be leveraged, which utilizes a learning- NGN scenarios. Zero-trust can be a potential technique to
based gating network to dynamically select/combine different dynamically manage data access and usage, thereby preventing
trajectories in different training stages. privacy leakage.
9

VI. C ONCLUSION [6] R. Zhang, K. Xiong, Y. Lu, P. Fan, D. W. K. Ng, and K. B. Letaief,
“Energy efficiency maximization in ris-assisted swipt networks with
In this article, we have explored the applications of IRL rsma: A ppo-based approach,” IEEE Journal on Selected Areas in
in NGN. Specifically, we have comprehensively introduced Communications, vol. 41, no. 5, pp. 1413–1430, 2023.
the IRL technique, including its fundamentals, representative [7] V. S. Vaishnavi, Y. M. Roopa, and P. L. Srinivasa Murthy, “A survey on
algorithms, and advantages. Then, we have discussed the next generation networks,” in ICCNCT, 2020, pp. 162–171.
vision of NGN, as well as the motivations for adopting IRL. [8] G. Wang, P. Cheng, Z. Chen, B. Vucetic, and Y. Li, “Inverse rein-
forcement learning with graph neural networks for full-dimensional
Afterward, we have surveyed existing literature about IRL task offloading in edge computing,” IEEE Transactions on Mobile
proposals for solving networking problems. Furthermore, we Computing, pp. 1–18, 2023.
have performed a case study on human-centric prompt en- [9] J. Li, H. Wu, Q. He, Y. Zhao, and X. Wang, “Dynamic qos prediction
gineering in GAI-enabled networks, comparing the workflow with intelligent route estimation via inverse reinforcement learning,”
IEEE Transactions on Services Computing, pp. 1–18, 2023.
and effectiveness of both DRL and IRL. Finally, the future
[10] A. Konar, D. Wu, Y. T. Xu, S. Jang, S. Liu, and G. Dudek, “Commu-
directions to promote the further development of IRL in NGN nication load balancing via efficient inverse reinforcement learning,” in
have been summarized. ICC, 2023, pp. 472–478.
[11] A. Shamsoshoara, F. Lotfi, S. Mousavi, F. Afghah, and I. Guvenc, “Joint
R EFERENCES path planning and power allocation of a cellular-connected uav using
[1] H. Du, R. Zhang, Y. Liu, J. Wang, Y. Lin, Z. Li, D. Niyato, J. Kang, apprenticeship learning via deep inverse reinforcement learning,” ArXiv
Z. Xiong, S. Cui, B. Ai, H. Zhou, and D. I. Kim, “Beyond deep preprint: ArXiv:2306.10071, 2023.
reinforcement learning: A tutorial on generative diffusion models in [12] R. Zhang, K. Xiong, X. Tian, Y. Lu, P. Fan, and K. B. Letaief, “Inverse
network optimization,” ArXiv preprint: ArXiv:2308.05384, 2023. reinforcement learning meets power allocation in multi-user cellular
[2] G. Wang, P. Cheng, Z. Chen, W. Xiang, B. Vucetic, and Y. Li, “Inverse networks,” in INFOCOM WKSHPS, 2022, pp. 1–2.
reinforcement learning with graph neural networks for iot resource [13] X. Tian, K. Xiong, R. Zhang, P. Fan, D. Niyato, and K. B. Letaief, “Sum
allocation,” in ICASSP, 2023, pp. 1–5. rate maximization in multi-cell multi-user networks: An inverse rein-
[3] J. Parras, A. Almodóvar, P. A. Apellániz, and S. Zazo, “Inverse rein- forcement learning-based approach,” IEEE Wireless Communications
forcement learning: A new framework to mitigate an intelligent backoff Letters, vol. 13, no. 1, pp. 4–8, 2024.
attack,” IEEE Internet of Things Journal, vol. 9, no. 24, pp. 24 790–
24 799, 2022. [14] X. Tian, B. Gao, M. Liu, K. Xiong, P. Fan, and K. Ben Letaief, “Irl-pm:
[4] S. Arora and P. Doshi, “A survey of inverse reinforcement learning: An inverse reinforcement learning-based power minimization in multi-
Challenges, methods and progress,” Artificial Intelligence, vol. 297, pp. user miso networks,” in ICCCS, 2023, pp. 332–336.
103 500–103 548, 2021. [15] Y. Liu, H. Du, D. Niyato, J. Kang, S. Cui, X. Shen, and P. Zhang,
[5] L. Snow, V. Krishnamurthy, and B. M. Sadler, “Identifying coordination “Optimizing mobile-edge ai-generated everything (aigx) services by
in a cognitive radar network - a multi-objective inverse reinforcement prompt engineering: Fundamental, framework, and case study,” IEEE
learning approach,” in ICASSP, 2023, pp. 1–5. Network, pp. 1–1, 2024.

You might also like