Dependency-Aware CAV Task Scheduling Via Diffusion-Based Reinforcement Learning
Dependency-Aware CAV Task Scheduling Via Diffusion-Based Reinforcement Learning
Abstract—In this paper, we propose a novel dependency- computing resources are acted as extensions of the edge and
aware task scheduling strategy for dynamic unmanned aerial termed service vehicles (SVs) [8], the tasks generated by task
arXiv:2411.18230v1 [cs.AI] 27 Nov 2024
vehicle-assisted connected autonomous vehicles (CAVs). Specifi- vehicles (TVs) can be offloaded to nearby service vehicles.
cally, different computation tasks of CAVs consisting of multiple
dependency subtasks are judiciously assigned to nearby CAVs Furthermore, the fine-grained vehicular tasks partition and
or the base station for promptly completing tasks. Therefore, we scheduling accelerates the completion of the task [9] [10],
formulate a joint scheduling priority and subtask assignment since tasks can be divided into several dependent subtasks,
optimization problem with the objective of minimizing the which are modelled as the directed acyclic graph (DAG) [11]
average task completion time. The problem aims at improving describing subtasks interdependency, and offloaded to other
the long-term system performance, which is reformulated as
a Markov decision process. To solve the problem, we further SVs or BS. The existing researches illuminate the advantages
propose a diffusion-based reinforcement learning algorithm, of the internet of vehicle (IoV) combined with VEC task
named Synthetic DDQN based Subtasks Scheduling, which can offloading. However, due to geographical limitations, placing
make adaptive task scheduling decision in real time. A diffusion many BSs on highway may not be feasible on economic
model-based synthetic experience replay is integrated into the benefits [12]. Thus, for tasks with diverse computation require-
reinforcement learning framework, which can generate sufficient
synthetic data in experience replay buffer, thereby significantly ments, optimizing the tasks scheduling with high-mobility SVs
accelerating convergence and improving sample efficiency. Sim- and the limited BS servers is a challenging problem.
ulation results demonstrate the effectiveness of the proposed In this paper, we investigate the task scheduling problem
algorithm on reducing task completion time, comparing to for highway CAVs. TVs offload subtasks to SVs with avail-
benchmark schemes. able computing resources, but time-varying task computation
requirements and computing resources make it challenging to
I. I NTRODUCTION
make adaptive task scheduling decisions in real-time, which
With advancements in communication and autonomous will prolong the completion time when the required computing
driving technologies, connected autonomous vehicles (CAVs) resources of a subtask exceeds the computing capacity of SVs.
have become increasingly prevalent, catering to people’s traffic Due to the advantages of flexibility and easy deployment, the
demands [1]. They have to execute various computation- unmanned aerial vehicle (UAV) can be deployed for relaying
intensive and delay-sensitive tasks, including perception fu- subtasks to surrounding BS server, aiming at compensating
sion, real-time navigation based on video or Augmented for the offloading needs of TVs when the number of SVs
Reality (AR), and multimedia entertainment, etc [2]. These is insufficient or subtasks have a higher workload. Firstly,
tasks necessitate joint processing to guarantee safe driving with the goal of on-demand tasks scheduling, we construct
while satisfying the quality of service [3]. a task scheduling model, i.e., two-side priority adjustment,
Due the limited computing resources for CAVs, it will for determining the scheduling priority of subtasks while
inevitably prolong tasks completion time when multiple tasks considering mobility and resources for selecting optimal of-
need to be processed simultaneously. To minimize the task floading targets. Second, we formulate the long-term subtasks
completion time, some researches [4] [5] used mobile edge scheduling problem that minimizes the average completion
computing (MEC) to directly offload entire task to the time of overall tasks, and the deep reinforcement learning
base station (BS) for fast processing through the vehicle-to- (DRL)-driven synthetic DDQN based subtasks scheduling
infrastructure (V2I) link, which may cause extended com- algorithm (SDSS) is proposed for solving the problem in a
pletion time due to the additional task transmission. Thus, dynamic environment. Thirdly, simulation results demonstrate
some literature [6] [7] had proposed the partial offloading the effectiveness of the proposed algorithm on reducing task
scheme, one part of the task is assigned to the local vehicle, completion time. The main contributions of this paper are
while the remaining is offloaded to edge servers. Although the summarized as follows:
scheme reduces transmission delay, it remains challenging to • We design a dependency-aware task scheduling strategy
guarantee efficient tasks scheduling due to the high dynamic. to reduce task completion time for CAV networks;
Consequently, vehicle edge computing (VEC) based tasks • We formulate a long-term optimization problem for mini-
scheduling is gradually emerging, other vehicles with available mizing the overall tasks average completion time and then
BS servers
reformulate it into a Markov decision process (MDP); Offloading/Backhaul
2ms
Perception 1 5 2 43
and diffusion model to make adaptive task scheduling
2ms
…
perception
Q(s, a;θ)
Xt
SER module. We describe the long-term subtasks scheduling
p(Xt|Xt-1;σt)
q(Xt-1|Xt,X0)
and offloading problem as an MDP and then propose the SDSS ... ... ...
Q(s',argmaxa'Q(s', a';θ);θ')
parameter
... ...
... ...
...
updating syn (s, a, r, s' )
A. MDP Formulation s, a ... ...
Target Q value
X1
network θ'
The optimization problem P is described as an MDP four-
p(X1|X0;σ1)
q(X0|X1)
real
tuple {S, A, P, R}, including state S, action A, transition s' (s, a, r, s' )
... ...
Experience replay buffer X0
probability P, and reward R. The main components are as ... ... ...
(s, a, r, s') Batch
sampling generated
follows. (s, a, r, s')
r SER
1) State: The state is denoted as the combination of sub-
tasks information, scheduling decisions, and offloading targets Fig. 3. Model structure of the SDSS algorithm.
information. The state for subtask w is denoted as
S = {Iw , aw , Or }, (11) and then the target Q value network θ′ uses the action with the
maximum Q value to calculate the expected value of the next
where Iw denotes information of subtasks, including workload state Sw+1 , the calculation of the target Q value is denoted as
w
ωm and interdependency, i.e., the indicator of predecessor and
successor subtask Prew and Sucw . aw denotes the scheduling ytarget = r + γQ (s′ , argmaxa′ Qestim (s′ , a, θ) , θ′ ) , (15)
decision of subtasks. Or is the offloading targets information, where γ represents the discount factor, which can be used
including the available computing resources of SVs and BS to adjust the long-term reward, and its range is set is γ ∈
server fs and fj , and the distance between TVs and SV. (0.1, 0.99). Subsequently, it can obtain the loss function by
2) Action: Each subtask w of the vehicle task m can be comparing the target Q value, i.e., ytarget with the output
computed in local TV n or offloaded to SV s and BS server j. Qestim (s, a, θ) of Q network,
For a whole vehicle task, the scheduling action of all subtasks 1
can be represented as L (θ) =Es,a;s′ { (r + γQ (s′ , argmaxa′ Qestim (s′ , a, θ) ; θ′ )
2
2
aw = {αw , Bw }, (12) −Qestim (s, a; θ)) },
(16)
where αw is the scheduling priority of subtasks and Bw = the above operation of decoupling action selection and value
{xw,s , yw,j , zw,n } as an array denotes the different selection estimation can effectively avoid the overestimation problem of
of offloading targets. the target Q value, which is helpful to improve the reliability
3) Reward Function: The reward function is designed as of the high dimension discrete strategies.
the negative increment of delay after making a scheduling On the other hand, the main goal placing the experience
action, it is denoted as replay buffer between the networks and the environment
interaction is to store the transitions obtained from the agent’s
R (sw , aw ) = −∆τw , (13) interaction with the environment. These stored transitions
(s, a, r, s′ ) , as the exploration experience, can provide training
where ∆τw is represented as the difference of completion
data for these networks, which also makes the training samples
delay between two adjacent subtasks scheduling decisions, i.e.,
aw+1 of the agent have a certain diversity.
∆τw = τw+1 − τwaw . The goal is to find the optimal decision
In practice, there exist a large number of ineffective tran-
to obtain the maximum cumulative reward.
sitions in the initial strategy exploration stage. It will in-
B. SDSS Algorithm evitably cause the difficulty in convergence by sampling and
utilizing these data for agent training. Therefore, to improve
The model structure of the SDSS algorithm based on the the algorithm convergence ability and sample efficiency, we
Synthtic-DDQN is shown in Fig. 3, the algorithm structure consider using the emerging generative diffusion model [17] to
mainly includes Q value network and target Q network and provide real and synthetic transitions for agent training through
an improved experience replay buffer. diffusion-based transitions generation, i.e., the SER module.
On the one hand, the two networks represent the Q value Two steps of SER module are as follows.
network θ, i.e., policy network and target Q value network 1) The forward noising: For a original data distribution
θ′ . The policy network estimates the Q value of all possible p (x) with standard deviation σ, x represents the real tran-
actions aw in state Sw and outputs the action a corresponding sitions data, considering the noised distribution p(x; σ) that
to the maximum Q value, the calculation of action selection obtained by adding independent and identically distributed
process is shown as Gaussian noise of deviation σ to the p (x), i.e., p (xt |x0 ; σn ),
which will make the final data distribution with unknown noise
amax (s, θ) = argmaxa′ Qestim (s, a, θ) , (14) become the indistinguishable random noise.
Algorithm 1 Proposed SDSS algorithm TABLE I: Simulation parameters.
Input: The subtask info Iw , offloading target info Or , data Parameter Value
ratio r, and discount factor γ; Number of TVs and SVs (N, S) {2, 5}
Output: The estimation values of two Q networks; Number of subtasks of single task (W ) {4∼6}
Computing power of TVs, SVs (fn , fs ) {2, 2∼8} GHz
1: Initialize experience replay buffer D, denoising model, Computing power of BS server (fj ) 50 GHz
networks parameter θ and θ′ ; Transmit power of vehicle and UAV (Pn , Pu ) {20, 30} dBm
2: for episode = 1: M do Bandwidth of vehicle and UAV (Bn , Bu ) {5, 10} MHz
Maximum tolerant delay of single task (τm ) 650 ms
3: episode = 1, Reset the environment and initialize state Workload size of a subtask (ωm w) {500∼5000} KB
space S; Length of highway section 1 km
4: for step = 1: T do Learning rate and discount factor {0.001, 0.95}
Experience replay size 100000
5: step = 1, agent outputs actions a, then selects action Denoising step 10
and executes, argmaxa′ Qestim (s, a, θ);
6: Complete subtask scheduling and offloading target
selection while calculating rw , and Sw ← Sw+1 ;
7: Store the transitions (s, a, r, s′ ) into D;
8: The real transitions from D are sampled into SER
module for updating the forward diffusion process;
9: Generate samples from reverse denosing process by
diffusion step iteratively, and adding to D, D ←
Dsyn ;
10: Train the agent makes scheduling policy by sampling
from D with ratio r;
11: Calculating the target Q value using Eq. (15);
12: The Q network θ is updated by gradient descent of
loss function Eq. (16);
13: step + 1
14: end for
15: episode + 1 Fig. 4. The algorithm convergence curve.
16: end for