Defining Problem From Solutions - Inverse Reinforcement Learning (IRL) and Its Applications For Next-Generation Networking
Defining Problem From Solutions - Inverse Reinforcement Learning (IRL) and Its Applications For Next-Generation Networking
Abstract—Performance optimization is a critical concern in in mobile edge networks as an example, where users select
arXiv:2404.01583v1 [cs.NI] 2 Apr 2024
networking, on which Deep Reinforcement Learning (DRL) has edge servers to maximize their Quality of Experience (QoE).
achieved great success. Nonetheless, DRL training relies on pre- We know that several factors are related to QoE, such as
cisely defined reward functions, which formulate the optimization
objective and indicate the positive/negative progress towards service latency and fees. However, the weight and relationship
the optimal. With the ever-increasing environmental complexity of observed factors, as well as how to fuse them appropriately
and human participation in Next-Generation Networking (NGN), to model user-side experience, are unknown. Moreover, the
defining appropriate reward functions become challenging. In unobserved subjectivity of users, such as personal inner and
this article, we explore the applications of Inverse Reinforcement behavioral preferences, should also be considered. Without
Learning (IRL) in NGN. Particularly, if DRL aims to find optimal
solutions to the problem, IRL finds a problem from the optimal knowing what constitutes the reward, a trained DRL policy
solutions, where the optimal solutions are collected from experts, might exhibit severe sub-optimality and bias.
and the problem is defined by reward inference. Specifically, Fortunately, Inverse Reinforcement Learning (IRL) emerges
we first formally introduce the IRL technique, including its as a pivotal solution to overcome the obstacles caused by
fundamentals, workflow, and difference from DRL. Afterward, reward inaccessibility [3]. Specifically, IRL enhances DRL by
we present the motivations of IRL applications in NGN and
survey existing studies. Furthermore, to demonstrate the process introducing reward inferences. In the above example, instead
of applying IRL in NGN, we perform a case study about human- of manually defining a reward function without any accurate
centric prompt engineering in Generative AI-enabled networks. prior knowledge and precision guarantee, IRL utilizes DNNs
We demonstrate the effectiveness of using both DRL and IRL to infer the rewards that can effectively explain user behaviors.
techniques and prove the superiority of IRL. Such behaviors are called expert trajectories and should be
Index Terms—Inverse Reinforcement Learning (IRL), Next- collected before training IRL. Successful IRL applications
Generation Networking (NGN), Generative AI (GAI), Reward include ChatGPT, which involves large-scale human feedback
Engineering, Deep Reinforcement Learning (DRL).
to fine-tune model generation1 . Meanwhile, IRL has been
widely adopted in human-in-the-loop networking and systems,
I. I NTRODUCTION such as autonomous driving assistance [4]. We conclude that
Deep Reinforcement Learning (DRL) has become indis- IRL owns the following benefits.
pensable in many fields, such as networking, robotics, and • Environment Exploration: IRL provides a means to
finance [1]. Following the Reinforcement Learning (RL) prin- break the information asymmetry and explore complex
ciple, an agent can optimize its decision-making capability by or adversarial environments. By leveraging the inferred
iteratively interacting with an environment, aiming to maxi- reward functions, agents are not only guided toward opti-
mize cumulative rewards. Meanwhile, Deep Neural Networks mal policies but are also encouraged to explore uncharted
(DNNs) enhance RL by enabling agents to represent complex territories within the environment. For instance, users col-
environments and learn sophisticated policies. Although such a lect malicious servers’ behaviors to infer their objectives
paradigm demonstrates remarkable success, explicitly defining and patterns, thus taking corresponding defenses [3].
or even determining rewards can be challenging in many real- • Behavior Understanding: By inferring reward functions
world scenarios [2] due to environmental complexity, human from observed expert behaviors, IRL offers profound in-
participation, or information asymmetry. Take task offloading sights into the underlying motivations of agents, enabling
a deeper comprehension of complex behaviors. For exam-
Y. Liu, R. Zhang, H. Du, and D. Niyato are with the School
of Computer Science and Engineering, Nanyang Technological Univer- ple, human driving patterns are complicated since actions
sity, Singapore (e-mail: [email protected], [email protected], taken by drivers according to different traffic conditions
[email protected], and [email protected]). are based on their empirical experience, driving skills,
J. Kang is with the School of Automation, Guangdong University of
Technology, China (e-mail: [email protected]). and preferences. Leveraging DNNs to represent such non-
Z. Xiong is with the Pillar of Information Systems Technology and Design, linear features, IRL can learn a reward function that aligns
Singapore University of Technology and Design, Singapore (e-mail: zehui with the driving behaviors2 .
[email protected]).
D. I. Kim is with the Department of Electrical and Computer En-
1 https://cgnarendiran.github.io/blog/chatgpt-future-of-conversational-ai/
gineering, Sungkyunkwan University, Suwon 16419, South Korea (e-
mail:[email protected]). 2 https://meixinzhu.github.io/project/irl/
2
inference
lle Expert trajectories trajectory by Reward
Reward
Compare expert
Co trajectories inferred reward model
<state-action, state-action…> Tr
ai
ni
Expert
DRL Ma x ng
Agent trajectories Update
generator Actions
Reward
<state-action, state-action…> of others Feedback
Update
Expert Human
Agent trajectories
Refinement of RL Maximum Maximum Generator RL with human
policy network Policy network module margin entropy GAIL feedback
Fig. 1: The fundamentals of IRL (a): The cases of defining reward functions. (b): The illustration of MDP. (c): The workflow
of DRL algorithm. (d) The workflow of IRL. Note that we illustrate four representative approaches to infer reward, namely
maximum margin, maximum entropy, GAIL, and RLHF.
1) Basic Idea: The fundamental goal of DRL is to derive is streamlined into three main stages, namely data collection,
an optimal policy that maximizes the cumulative reward in advantage estimation, and policy optimization. Initially, the
an MDP setup [1]. Through iterative interaction with the agent interacts with the edge network using the current server
environment, an agent learns to refine its decision-making selection policy and gathers ⟨action, state, reward⟩ records.
strategy based on received feedback. DNNs play a crucial role Such records are buffered to calculate advantage estimates,
in this process, approximating the functions that predict the which assess the relative value of actions. Then, PPO opti-
future rewards of actions, thereby guiding the agent toward mizes a specially designed objective function that includes a
optimal decision-making. clipping mechanism. This mechanism prevents the new policy
2) General Architecture: As shown in Fig. 1(c), the general from straying too far from the old one, ensuring small, stable
DRL framework consists of several essential components: updates and avoiding drastic performance fluctuations. By
iteratively cycling through these stages, PPO gradually fine-
• Agent: The decision-maker that interacts with the envi-
tunes the policy, striking a balance between exploring new
ronment, i.e., taking actions, transiting to the next state,
strategies and exploiting known rewards, making it adept at
and receiving rewards.
navigating complex environments.
• Environment: The system that defines the state and
action spaces and allows agents to perform MDP.
• Policy Network: A neural network that maps states to C. Inverse Reinforcement Learning
actions, defining the agent’s behavior. Despite the success of DRL, in many network scenarios,
• Value Network: A neural network that estimates future precisely defining the reward function is impossible. In the
rewards from states or state-action pairs. above example, although manually-defined QoE can reflect
• Replay Buffer: A repository of numerous past trajecto- users’ preference for low latency and high success rate, the
ries, enabling the agent to reuse historical training data subjectivity factors are ignored. For instance, users with time-
and reduce training costs. sensitive tasks may emphasize latency, while secure computing
• Optimizer: The mechanism that adjusts neural network tasks should put success rate at the highest priority. Following
parameters to maximize expected cumulative rewards biased reward functions, the policy training may deviate from
under the current policy. the real optimization objective. Accordingly, IRL is presented
3) Algorithm Process: To illustrate the DRL process, we to infer such sophisticated reward functions by observing and
show an example of leveraging Proximal Policy Optimization analyzing users’ behaviors [2]. As shown in Fig. 1(d), in the
(PPO) [6] to offload tasks. In this case, agent, action, and beginning, the uses’ selections of edge servers in different
state refer to the user, the selection of task offloading server, network states are collected, called expert datasets/trajectories.
and the edge network, respectively. In addition, reward can be Afterward, instead of simply learning the action policy, IRL
defined by fusing multiple factors, such as latency and success first infers the form of the reward function that best explains
rate, thereby indicating user QoE. Afterward, the PPO process the demonstration behaviors. Moreover, it can optimize the
4
Attacks
Height
(expert) complexity:
Interference Task/
Stop … • The network complexity information
Speed up and diversity increase
Images
Expert due to the introduction of
trajectory new communication and Reward unavailability: I choose image 3
since purple is
𝝭 𝝭 networking paradigms, • In many cases of NGN, the reward cannot User my favorite color
The current
D Leaf Action 1
D
Action 1’
D Low
C
state ×N actions
B server B … B QoE
A E G A E G High A E ×N actions
… IRL learns the
DNNs of the IRL for Root QoE unknown immediate
C C C
reward inference server F H F F H reward of each action
Fig. 2: The motivation of applying IRL in NGN, using QoE optimization for task offloading as an example. A: The attackers’
information is hidden from users. B: Subjectivity factors can hardly be represented precisely. C: The immediate reward of each
action is unobserved. Action 1 and 1’ choose servers H and G for task offloading, respectively. D: SAGIN communications
involve numerous physical factors from diverse devices, causing huge difficulty in QoE modeling. E: The human driving
behaviors contain complicated physiological processes that can hardly be described by manually designed rewards.
agent’s policy towards closely aligning with or mimicking 3) Generative Imitation Learning: Recall that reward func-
the expert policy according to the inferred reward function. tions serve as an indicator to guide policy refinement. Apart
Given the strong representation and learning ability, the reward from inferring an explicit reward value for each action and
function is generally inferred by a DNN. maximizing the reward expectation, the policy can be refined
in an imitation manner. For instance, GAIL [4] leverages the
D. Evolution of Inverse Reinforcement Learning Generative Adversarial Networks (GANs) framework, where
The inference of reward functions can be implemented a DNN-based generator aims to approach expert behavior.
following different approaches. As illustrated in Fig. 1(c), we Meanwhile, another DNN acts as the discriminator, trying
introduce four representative and widely adopted IRL algo- to differentiate between the expert’s actions and those of
rithms, from basic maximum margin to advanced generative the generator. The two modules are trained alternately, i.e.,
imitation learning. freezing one module and adjusting the parameters of the
1) Maximum Margin: This algorithm [4] focuses on distin- other module to optimize its behavior imitation/differentiation
guishing the expert’s policy from all other policies by a margin performance. Afterward, the generator can effectively mimic
that depends on the difference in the accumulated reward. It the expert policy and solve MDP problems without requiring
operates under the principle that the correct reward function reward inference. Accordingly, the computational efficiency
should make the expert behavior appear significantly better can be significantly improved.
than all other possible behaviors. This method is particularly
useful in scenarios where clear demarcation between optimal 4) Reinforcement Learning with Human Feedback: In many
and suboptimal policies is possible. cases, the reward function is unavailable since reward values
2) Maximum Entropy: This method [4] introduces the are given by humans. Human perception and evaluation is
concept of entropy to address the ambiguity in the reward a complex physiological process and is affected by various
function that could explain observed behaviors. It assumes that subjective factors such as preference, personality, environment,
among all possible reward functions, the one that leads to the and strictness. As a variant of IRL, RLHF combines the
observed behavior while also maximizing entropy, or in other RL principle with direct feedback from humans, integrating
words, promoting diverse but consistent behaviors, is the most subjective insights into the learning process. Specifically, a
appropriate. Compared with maximum margin, this approach reward model will be first trained based on action-score pairs,
is adept at handling the inherent uncertainty in determining where scores are manually annotated by humans. Then, the
why an agent acts a certain way by favoring reward functions DRL can be leveraged for policy optimization, ensuring the
that support a wide range of plausible actions. policy aligns with human preferences.
5
Sown et al. [5] Coordination detection in The collaborative radars contain multiple Apply IRL to infer radars’ The utility functions are
cognitive radar networks entities, each of which has own utility. individual utility inferred with high accuracy
Infer
Infer attacker’s objectives and The objective of attackers is unknown, causing Utilize GAIL to infer attacker’s The defense can effectively
unobserved Parras et al. [3]
defend IoT attacks difficulty in defining reward functions objectives and design defense address the attacks
rewards
Wang et al. [8] Full-dimension task The consequence of each step is unknown until Adopt GIRL to infer the reward Keep at least 80% of
offloading in edge computing the result of the entire B&B is obtained of each step of B&B accuracy for guiding B&B
QoS detection in dynamic The network is dynamic and time-variant; the Apply IRL to predict QoS in Reduce at most 31.3%
Overcome Li et al. [9]
and time-variant networks projection between routes and QoS is vague dynamic networks error in QoS prediction
complex Load balancing in Different factors contribute to QoE with varying Present trajectory extrapolation Achieve 32% improvement
Konar et al. [10]
network communication networks degrees in reward function forming to QoE modeling in load balancing
environments Trajectory scheduling and Numerous physical factors and process affect Develop IRL to capture UAV- Realize much higher
Shamsoshoara et al. [11]
power allocation in UAV the UAV, e.g., UAV’s speed and power related factors power allocation efficiency
Power allocation optimization Using learned reward can better guide the Utilize maximum entropy to Only 1% performance loss
Zhang et al. [12] in cellular networks agent to mimic the expert guide agent training compared with expert
Guide Policy Sum rate maximization in The same as [12] Utilize GAIL to guide the Achieve 20% gain
Tian et al. [13] multi-cell multi-user networks
Optimization agent training compared with DRL
Power usage maximization in The same as [12] Utilize GAIL to guide the Reduce 50% power than
Tian et al. [14]
MISO networks policy optimization DRL results
branching order of variables cannot be known in advance. That inferring the optimization objective.
is because the B&B algorithm solves the problem by breaking 3) Guide Policy Optimizations: In some cases where the
it down into smaller sub-problems and using a bounding optimal trajectories are available, directly inferring the reward
function to eliminate sub-problems with limited upper bounds, function that facilitates the agent to mimic the expert can
forming a tree-structured solution space. In this case, we lead to the best policy optimization efficiency. For instance,
cannot know that branching which variable can generate a Zhang et al. [12] adopted maximum entropy to optimize
smaller enumeration tree. Therefore, the authors present a the power allocation schemes in multi-user cellular networks.
Graph-based IRL (GIRL), which uses a Graph Neural Network Experiments show that IRL greatly outperforms manually
(GNN) to infer the immediate reward of each action based on defined reward functions based on objective metrics, such as
the structure of the tree. weighted minimum mean square error. Similarly, Tian et al.
Finally, the concept of IRL has been widely adopted in [13] applied this principle to multi-cell multi-user networks
human-in-the-loop networks. For instance, the training of to maximize the sum rate. Furthermore, in [14], they adopted
ChatGPT, the most famous multimodal AIGC model, involves the GAIL method to minimize power usage in multiple-input-
self-reinforcement from large-scale human feedback, thereby single-output networks with multiple users. Compared with
aligning the model output with human preferences. conventional DRL, the proposed method can reduce power
consumption by over 50%.
2) Overcome Complex Network Environments: Owing to Lesson Learned: From the above brief survey, we learn that
complex network environments, even if manually defining IRL enhances the capabilities of DRL by introducing a reward
some vague reward functions is available, it can hardly pre- inference step. Contributed to DNN’s stronger representation
cisely represent the environmental change due to the agent’s and learning capabilities than humans, IRL can well overcome
actions and efficiently indicate the desirability. In [9], the complex environments, actions, and reward perception pro-
authors presented an IRL-based approach to predict QoS in cesses (such as human perception of AI-generated images).
dynamic and time-variant networks. Due to large state spaces Such advantages make IRL inherently suitable for scenarios
and complex projections between routes and QoS values, it that are complex while existing expert trajectories for demon-
is difficult to define a precise reward function artificially. stration. Nonetheless, IRL also exhibits certain drawbacks,
Likewise, Konar et al. [10] presented the trajectory extrap- such as the additional computational costs for reward infer-
olation algorithm to model the quality of experience (QoE) ences and the dependency on expert trajectories, which are
for communication load balancing. Particularly, they consider subject to further research.
the difficulty of gathering exert trajectories in practice. Hence,
their proposed algorithm first sorts the gathered trajectories
according to some objective and well-established metrics. IV. C ASE S TUDY: H UMAN - ORIENTED P ROMPT
Then, the trajectories are sampled to facilitate the reward infer- E NGINEERING IN G ENERATIVE AI-E NABLED N ETWORKS
ence and policy learning. In UAV networks, Shamsoshoara et
al. [11] developed the interference-aware joint path planning To illustrate how to solve practical NGN problems by
and power allocation scheme to minimize the interference IRL, this section performs a case study about human-oriented
that UAVs cause to the existing ground user equipment. prompt engineering. Particularly, we simultaneously present
IRL is applied to capture all physical factors, including the the DRL- and IRL-based solutions to the same task, helping
UAV’s transmission power, terrestrial user density, imposed readers compare these two methods. The effectiveness of DRL
interference, and the UAV’s task completion capability, thereby and IRL is also discussed.
7
Raw prompt:
A B. DRL- and IRL-based Prompt Engineering
Towers
Fail to 1) DRL Workflow: First, we solve this problem following
An ancient Chinese with
render
bridge with towers. details the DRL paradigm. Specifically, we adopt the PPO algorithm
the
background texture towers Correct discussed in Section II-B. As shown in Fig. 3(C), the service
lighting mood … composition
Crafted prompt: provider acts as agent, whose actions include all available op-
An ancient stone bridge arches over a serene river, flanked by towers with erations to refine raw prompts. The raw prompts from users are
upturned eaves and intricate carvings, embodying architectural mastery and
cultural heritage. This scene, features misty mountains and lush foliage, blending accommodated by environment. Finally, reward evaluates the
natural beauty with human creativity. Subtle, majestic lighting enhances the efficacy of applying the selected prompt engineering operation
scene's ethereal quality, suggesting the bridge connects the past with the present.
B on the current raw prompt. To this end, we leverage Image-
DRL: Policy IRL: For this reward5 , a learning-based metric for measuring the aesthetic
prompt,
Next state network action 3 quality of AI-generated images, as the reward function. Corre-
is most
preferred spondingly, the quality score of the image generated from the
Agent: User
Service provider Expert dataset crafted prompt is utilized as the reward. The DRL architecture
Environment: Reward:
Action:
AIGC market Inferred
and algorithm workflow follow the description in Section II-
Reward: Prompt
Next state
Image-reward engineering reward B, with the goal of optimizing the policy for selecting prompt
Action: Reward
Environment: Prompt inference:
engineering operations.
AIGC engineering
market
GAIL 2) IRL Workflow: Then, we apply IRL to solve the same
Replay problem. Different from DRL, as shown in Fig. 3(d), IRL
buffer Agent:
include the following three steps.
C D Service provider
• Expert Dataset Collection: First, we construct an expert
dataset to indicate human preference for AI-generated
Fig. 3: The illustration of our case study. A: The system model images. Particularly, inspired by the outstanding capabil-
of GAI-enabled network. B: The efficacy of prompt engineer- ity of Large Language Models (LLMs) in multimodal
ing. We can observe that the image generated from the crafted understanding [15], we leverage an LLM-empowered
prompt excels in object rendering and image composition. C: agent to mimic real users. Then, we randomly compose
The illustration of DRL-based prompt engineering. D: The 50 raw prompts and perform all the available prompt
illustration of IRL-based prompt engineering. engineering operations on each of them. The agent is
utilized to evaluate the generated images by scores. Note
that we adopt two strategies for training LLM, ensuring it
A. System Model generates rationale scores. First, we instruct LLM to act
as experienced users, recalling its pre-trained knowledge
With the rapid advancement of GAI, generative tasks, such
of image quality assessment. Additionally, we feed ten
as text-to-image generation, automatic conversation, and 3D
materials regarding computer vision, image generation,
avatar renderings, play an important role in NGN. As shown
and painting to the LLM, thus enhancing its expertise.
in Fig. 3(A), we consider a GAI-enabled network with users
All the ⟨raw prompt, crafted prompt, score⟩ pairs con-
and service providers. Users describe their requests by natural
struct the expert dataset, showing which kind of prompt
language (so-called prompt) and submit them to the service
engineering is preferred by humans for the specific image
providers. Operating professional GAI models, the service
generation task.
providers perform generative inferences and generate required
• Policy Optimization: Different from defining rewards via
content for users.
an existing metric, IRL utilizes DNNs to infer reward
Nonetheless, due to the lack of professionality/experience, functions that align with expert decisions. To improve the
user prompts may suffer from information insufficiency and learning efficiency, we adopt GAIL explained in Section
ambiguity, causing low generation quality. To this end, service II-D. Specifically, the discriminator distinguishes between
providers can strategically craft prompts that maintain seman- the selections of expert prompt engineering policy and
tic equivalence with raw prompts while being informative, those of the agent’s policy, thereby self-calibrating re-
clear, and suitable for the specific GAI model [15]. Such wards to align with the expert decision-making mode.
a process is called prompt engineering [15]. Taking text- Meanwhile, the generator aims to learn a prompt engi-
to-image generation as an example, Fig. 3(B) depicts the neering policy that mimics the expert behaviors, driven by
impact of prompt engineering on the generated image. We feedback from the discriminator. In this way, the genera-
can observe that the image generated by the crafted prompt tor can gradually imitate humans in judging AI-generated
outperforms in exquisiteness and composition. Despite such images and perform prompt engineering accordingly.
advantages, service providers may own multiple prompt en- • MDP Design: Finally, we configure the MDP for IRL.
gineering approaches since the prompts are crafted by open Similar to DRL, the action space contains all candidate
vocabularies. In this case, how to select the optimal prompt
engineering approach according to the specific request is a 5 https://github.com/THUDM/ImageReward
8
Image rewards
sented by the LLM-assigned scores of the image gener- 0.6
ated by the crafted prompt. This allows GAIL to evaluate 0.4
the efficacy of each action by evaluating its impact on the DRL
0.2
resulting image quality. Unlike traditional DRL which
relies on manually designing reward functions, GAIL 0
0 500 1000 1500 2000
leverages the discriminator network to infer rewards from Episodes
our expert dataset. 8.6
Cumulative rewards
C. Experiments 8.5
IRL
1) Experimental Setting: We take text-to-image generation 8.4 Random
as an example. The service providers adopt Stable Diffusion
v2 to generate images. Raw prompts from users take the form 8.3
of A [A] with [B], e.g., “a city with a car” and “a garden 8.2
with a fountain.” Leveraging the GPT-4 model, seven kinds
of operations to craft prompts can be performed on each raw 8.1
0 500 1000 1500 2000
prompt, as discussed in [15]. The LLM-based agent is also Episodes
implemented by GPT-4. For IRL, the discriminator consists
of a four-layer fully connected DNN and two intermediate Fig. 4: The training curves of DRL and IRL.
layers, each containing 100 neurons. Meanwhile, the generator
containing actors and critics employs a similar four-layer, fully
connected DNN architecture with 64 neurons in each of the
Case1: A mountain with stone castle
two hidden layers. The hyperbolic tangent function serves as Case2: A garden with a bench
the activation mechanism for all hidden layers. The learning Case3: A city with a blue car
rate and gamma are set to 3 × 10−4 and 0, respectively. 8.39
2) Result Analysis: Fig. 4 shows the trend of cumulative
rewards of different algorithms during the training process. It
shows that as the number of episodes increases, the proposed 8.06
IRL method achieves good convergence and significantly 8.16
outperforms the baseline, i.e., the service provider selects
prompt engineering operation randomly. Meanwhile, DRL also
converges rapidly. With well-trained DRL and IRL policies,
Fig. 5 evaluates their efficacy in selecting prompt engineering
operations. Note that the quality scores are also assessed by the
LLM-empowered agent. As shown in Fig. 5, IRL can increase
the image quality by 0.33 on average, while DRL can only Fig. 5: The efficacy of DRL and IRL in prompt engineering.
achieve 0.1 increment. This is because IRL is trained on expert Each case corresponds to one randomly selected raw prompt.
datasets, which can effectively indicate the human preference
for assessing AI-generated images. Consequently, the prompt
engineering operations selected by IRL can better align with B. IRL with Human Feedback
human desire, resulting in high-score images. Human feedback has been successfully integrated into DRL
Discussion: Contributed to the increasing image quality, the (e.g., ChatGPT) to make policies align with human prefer-
service experience of humans can be increased drastically. ences. Likewise, humans can participate in the IRL process
Meanwhile, the re-generation caused by unqualified outputs to further enhance the expert dataset/human annotation. For
can be reduced, which greatly decreases service latency and example, in our case study, the efficient calibration of in-
bandwidth consumption [15]. ferred rewards based on human feedback is worth researching,
thereby ensuring the reward function represents the real human
V. F UTURE D IRECTIONS judgment of AI-generated images precisely.
A. Mixture-of-Experts
Recall that IRL heavily relies on expert trajectories, while
in some complex scenarios, collecting the optimal trajectories C. Security of IRL
that achieve the objective with the lowest cost is impossible. The reliance on expert trajectories may cause security
Instead, there may only exist multiple local optimal trajectories issues since attackers can easily mislead policy training by
owned by distributed experts, each of which reaches the data poisoning. To this end, strict access control and privacy
optimum in certain aspects/stages. To this end, the mixture-of- protection are urgently required for deploying IRL in practical
experts principle can be leveraged, which utilizes a learning- NGN scenarios. Zero-trust can be a potential technique to
based gating network to dynamically select/combine different dynamically manage data access and usage, thereby preventing
trajectories in different training stages. privacy leakage.
9
VI. C ONCLUSION [6] R. Zhang, K. Xiong, Y. Lu, P. Fan, D. W. K. Ng, and K. B. Letaief,
“Energy efficiency maximization in ris-assisted swipt networks with
In this article, we have explored the applications of IRL rsma: A ppo-based approach,” IEEE Journal on Selected Areas in
in NGN. Specifically, we have comprehensively introduced Communications, vol. 41, no. 5, pp. 1413–1430, 2023.
the IRL technique, including its fundamentals, representative [7] V. S. Vaishnavi, Y. M. Roopa, and P. L. Srinivasa Murthy, “A survey on
algorithms, and advantages. Then, we have discussed the next generation networks,” in ICCNCT, 2020, pp. 162–171.
vision of NGN, as well as the motivations for adopting IRL. [8] G. Wang, P. Cheng, Z. Chen, B. Vucetic, and Y. Li, “Inverse rein-
forcement learning with graph neural networks for full-dimensional
Afterward, we have surveyed existing literature about IRL task offloading in edge computing,” IEEE Transactions on Mobile
proposals for solving networking problems. Furthermore, we Computing, pp. 1–18, 2023.
have performed a case study on human-centric prompt en- [9] J. Li, H. Wu, Q. He, Y. Zhao, and X. Wang, “Dynamic qos prediction
gineering in GAI-enabled networks, comparing the workflow with intelligent route estimation via inverse reinforcement learning,”
IEEE Transactions on Services Computing, pp. 1–18, 2023.
and effectiveness of both DRL and IRL. Finally, the future
[10] A. Konar, D. Wu, Y. T. Xu, S. Jang, S. Liu, and G. Dudek, “Commu-
directions to promote the further development of IRL in NGN nication load balancing via efficient inverse reinforcement learning,” in
have been summarized. ICC, 2023, pp. 472–478.
[11] A. Shamsoshoara, F. Lotfi, S. Mousavi, F. Afghah, and I. Guvenc, “Joint
R EFERENCES path planning and power allocation of a cellular-connected uav using
[1] H. Du, R. Zhang, Y. Liu, J. Wang, Y. Lin, Z. Li, D. Niyato, J. Kang, apprenticeship learning via deep inverse reinforcement learning,” ArXiv
Z. Xiong, S. Cui, B. Ai, H. Zhou, and D. I. Kim, “Beyond deep preprint: ArXiv:2306.10071, 2023.
reinforcement learning: A tutorial on generative diffusion models in [12] R. Zhang, K. Xiong, X. Tian, Y. Lu, P. Fan, and K. B. Letaief, “Inverse
network optimization,” ArXiv preprint: ArXiv:2308.05384, 2023. reinforcement learning meets power allocation in multi-user cellular
[2] G. Wang, P. Cheng, Z. Chen, W. Xiang, B. Vucetic, and Y. Li, “Inverse networks,” in INFOCOM WKSHPS, 2022, pp. 1–2.
reinforcement learning with graph neural networks for iot resource [13] X. Tian, K. Xiong, R. Zhang, P. Fan, D. Niyato, and K. B. Letaief, “Sum
allocation,” in ICASSP, 2023, pp. 1–5. rate maximization in multi-cell multi-user networks: An inverse rein-
[3] J. Parras, A. Almodóvar, P. A. Apellániz, and S. Zazo, “Inverse rein- forcement learning-based approach,” IEEE Wireless Communications
forcement learning: A new framework to mitigate an intelligent backoff Letters, vol. 13, no. 1, pp. 4–8, 2024.
attack,” IEEE Internet of Things Journal, vol. 9, no. 24, pp. 24 790–
24 799, 2022. [14] X. Tian, B. Gao, M. Liu, K. Xiong, P. Fan, and K. Ben Letaief, “Irl-pm:
[4] S. Arora and P. Doshi, “A survey of inverse reinforcement learning: An inverse reinforcement learning-based power minimization in multi-
Challenges, methods and progress,” Artificial Intelligence, vol. 297, pp. user miso networks,” in ICCCS, 2023, pp. 332–336.
103 500–103 548, 2021. [15] Y. Liu, H. Du, D. Niyato, J. Kang, S. Cui, X. Shen, and P. Zhang,
[5] L. Snow, V. Krishnamurthy, and B. M. Sadler, “Identifying coordination “Optimizing mobile-edge ai-generated everything (aigx) services by
in a cognitive radar network - a multi-objective inverse reinforcement prompt engineering: Fundamental, framework, and case study,” IEEE
learning approach,” in ICASSP, 2023, pp. 1–5. Network, pp. 1–1, 2024.