Traffic_Pattern_Sharing_for_Federated_Traffic_Flow_Prediction_with_Personalization
Traffic_Pattern_Sharing_for_Federated_Traffic_Flow_Prediction_with_Personalization
[email protected] [email protected]
2024 IEEE International Conference on Data Mining (ICDM) | 979-8-3315-0668-1/24/$31.00 ©2024 IEEE | DOI: 10.1109/ICDM59182.2024.00071
640
Authorized licensed use limited to: CUNY Hunter College Library. Downloaded on April 16,2025 at 22:11:03 UTC from IEEE Xplore. Restrictions apply.
FL method FedAvg [15] aggregates model weights sent from (m = 1, 2, . . . , M ) region possesses a subset of sensors Vm ,
local clients on the server and downloads the aggregated therefore forming a subgraph of the global traffic network
model back to the clients. However, the heterogeneity of data Gm ⊆ G along with its corresponding private dataset Dm =
(i) (i) (i) |Vm | (i)
among different clients poses a critical challenge. To deal with {x1 , . . . , xt , . . . , xT }i=1 , where xt ∈ Rd represents the
this problem, FedProx [32] proposes a regularization term observed d-dimensional features recorded by sensor i at time
aimed at minimizing the discrepancy between local models stamp t, and T represents the total number of time stamps.
and the global model, thereby preventing local models from Therefore, our target is to precisely predict the traffic flow at
deviating too far from their local training data. FedAtt [33] the location of sensors.
achieves flexible aggregation through adaptive weight. Be- However, most existing works rely on centralized data
sides, FedFed [34] shares the performance-sensitive features to collection, which is impractical since the traffic data possessed
mitigate data heterogeneity. Recently, PFL [35]–[39] becomes by different traffic administration departments is not allowed to
popular in dealing with highly heterogeneous data, which distribute due to privacy issues. To overcome this challenge,
proposes to train a personalized local model suitable for each we propose to use FL to collaboratively train TFP models
client in a collaborative manner. Therefore, PFL is usually without the need to exchange private traffic data. In federated
more effective than traditional FL methods that only learn TFP, each traffic administration department can be considered
a single global model [32]–[34]. For example, FedPer [36] as a client that trains a TFP model to capture the spatial-
shares a common base layer while providing individualized temporal dependencies of traffic roads from historical traffic
local layers for each client to preserve local knowledge. data recorded by local sensors and make accurate predictions
Additionally, through model-agnostic meta-learning, PerFe- of future traffic flow. To be specific, the task for m-th client is
dAvg [37] learns a meta-model to generate initial local models to train a model fWm parameterized by Wm such that it can
for each client, in order to improve the local performance. predict the traffic flow for the future T2 time stamps based on
Recently, various FL methods have been developed for the historical T1 time stamps, namely:
spatial-temporal forecasting. For example, FedGRU [17] in- fW
troduces an ensemble clustering-based FL scheme to capture Xt−T1 +1 , . . . , Xt −−−m
→ Xt+1 , . . . , Xt+T2 , (1)
the spatial-temporal correlation of traffic data. Furthermore, (1) (2) (|V |) |Vm |×d
where Xt = [xt ; xt ; . . . ; xt m ]
∈R represents the
CNFGNN [20] aggregates parameters based on spatial de-
observation of the local traffic network Gm at time stamp t.
pendencies captured by GNNs on the server. Considering the
The objective of federated TPF is to collaboratively train
heterogeneity of spatial-temporal data, some studies aim to
TFP models across clients without compromising data privacy.
enhance the performance of models by learning personalized
The classic FL method, i.e., FedAvg [15], aggregates model
models for each client. For instance, FedDA [22] proposes a
parameters at the server after local training according to the
dual attention scheme, which aggregates both intra- and inter-
following formula:
cluster models, rather than simply averaging the weights of
M
local models. In FML-ST [23], meta-learning is integrated into X |Vm |
the FL scenario to solve the heterogeneity problem in spatial- W← Wm . (2)
m=1
|V|
temporal prediction. Analogously, FUELS [24] incorporates
auxiliary contrastive tasks to inject detailed spatial-temporal The aggregated model is then redistributed to clients for the
heterogeneity into the latent representation space. However, next round of training. Due to the non-IID traffic data across
the aforementioned methods overlook the common knowledge different regions, this way of learning a global TFP model may
(e.g., common traffic patterns) within spatial-temporal data, exhibit suboptimal performance. To address this issue, PFL is
and thus their performance could be limited. implemented by training a customized model for each client
to enhance its performance on local traffic data. The objective
III. P ROBLEM D ESCRIPTION of PFL can then be formulated as
In this section, we formally define the setting of our M
1 X |Vm |
investigated federated TFP problem. The traffic road network min Lm (Wm , Dm ) , (3)
of a city can be represented as an undirected graph G = (V, E).
W1 ,...,WM M m=1 |V|
Here, V denotes the set of nodes, where each node corresponds where Lm is the loss function of the m-th client.
to a sensor to record traffic data, and E denotes the set of edges
corresponding to the roads connecting the sensors. Besides, IV. M ETHODOLOGY
A ∈ R|V|×|V| represents the weighted adjacency matrix This section details the proposed FedTPS (see Fig. 2).
depicting the proximity (e.g., geographical distance, causal During the local training phase (see Fig. 2(a)), stable traffic
dependencies, or traffic series similarity) between nodes. The dynamics are firstly obtained through the decomposition of
notation | · | denotes the cardinality of a set. traffic flow. To construct the traffic pattern repository, the
In reality, traffic sensors in different regions of a city stable traffic dynamics representation obtained through the
may belong to different traffic administration departments. pattern encoder (see Fig. 2(e)) is firstly projected to a query
Consequently, suppose there are M traffic administration space through a linear layer (see Fig. 2(f)) to compute the
departments governing M different regions. Then the m-th matching scores with patterns in the repository. Then the
641
Authorized licensed use limited to: CUNY Hunter College Library. Downloaded on April 16,2025 at 22:11:03 UTC from IEEE Xplore. Restrictions apply.
Input
Input Traffic
Traffic Flow
Flow Prediction
Prediction Similarity-aware
Similarity-aware Aggregation
Aggregation
(c)(c) Representation
Representation
𝐇𝒕𝒐𝐇𝒕𝒐 (d)
(d)
Original
Original
Decoder
Decoder
Encoder
Encoder ഥ 1𝑝
ഥ 1𝑝𝐖 ഥ𝑀𝑝ഥ 𝑝
𝐖 𝐖 𝐖𝑀
𝑇1 𝑇1 𝑇2 𝑇2 similarity
similarity
Pattern
Pattern Matching
Matching
DWT
DWT
……
𝑝 𝑝 𝑝 𝑝
𝐖1𝐖1 𝐖𝑀𝐖𝑀
IDWT
IDWT
……
……
Matching
Matching Server
Server
Score
Score
𝐐𝐐 𝑝 𝑝
𝐖1𝐖1 ഥ 1𝑝𝐖
𝐖 ഥ 1𝑝 𝐖𝑀𝐖𝑀
𝑝 𝑝 ഥ𝑀
𝐖
𝑝ഥ 𝑝
𝐖𝑀
Stable Traffic
Stable Dynamics
Traffic Dynamics Traffic Pattern
Traffic Repository
Pattern Repository
𝐖𝑚𝐖𝑚
𝑝 𝑝
(e)(e)
Original
Original Original
Original
Encoder
Encoder Encoder
Encoder
(f)(f) Patter
Patter
……
Pattern
Pattern
Pattern
Pattern
Encoder
Encoder Encoder
Encoder
𝐒𝐨𝐟𝐭𝐦𝐚𝐱
𝐒𝐨𝐟𝐭𝐦𝐚𝐱
Linear
Linear
𝑝 𝑝
𝐖1 𝐖1 Decoder
Decoder 𝑝 𝑝
𝐖𝑀𝐖𝑀 Decoder
Decoder
Encoder
Encoder Client
Client Client
𝑇1 𝑇1 𝟏 𝟏 Client
𝑀𝑀
Decomposition of of
Decomposition Traffic Flow
Traffic Flow Construction of of
Construction Traffic Pattern
Traffic Repository
Pattern Repository Sharing Strategy
Sharing of of
Strategy Traffic Pattern
Traffic Pattern
(a) The local training phase (b) The global aggregation phase
Fig. 2. The framework of FedTPS. During the local training phase (i.e.,(a)), stable traffic dynamics are extracted through the decomposition of traffic flow
and are further utilized to construct the traffic pattern repository consisting of common traffic patterns on each client. During the global aggregation phase
(i.e.,(b)), each client uploads the traffic pattern repository to the server and shares the repository with other clients via similarity-aware aggregation.
642
Authorized licensed use limited to: CUNY Hunter College Library. Downloaded on April 16,2025 at 22:11:03 UTC from IEEE Xplore. Restrictions apply.
2 denote the number of the representative traffic patterns and
the dimension of each pattern. We first adopt a linear layer
2 … 2 to project the stable traffic dynamics representation Hlt to a
2 query space, which can be formulated as
2
2 Hqt = Hlt ∗ Wq + bq , (11)
where Hqt ∈R |Vm |×c
denotes query matrix and “∗” denotes
Fig. 3. Illustration of J-level DWT.
the dot product operation. Wq ∈ Rh×c and bq ∈ Rc are
learnable parameters. Then we compute the matching scores
the side effects of discrepancies arising from region-specific Q with patterns in the repository as follows:
characteristics. To achieve this goal, we employ DWT [31] to
Q = softmax Hqt ∗ Wp ⊤ , (12)
decompose the traffic data into waveforms of different frequen-
cies, with the expectation that the low-frequency component Subsequently, we calculate the matched traffic patterns Pt ∈
corresponding to stable traffic dynamics can be considered as R|V |×c as a weighted sum of the patterns in Wp , and obtain
the common traffic pattern. To be specific, given traffic data
Z = [Xt−T1 +1 ; Xt−T1 +2 ; . . . ; Xt ] ∈ RT1 ×|Vm |×d , the J-level Pt = Q ∗ W p . (13)
DWT is performed to obtain the low-frequency component Z̄lj Equation (12) and (13) are used to retrieve the most relevant
and high-frequency component Z̄hj at the j-th level, namely: common traffic patterns for a given query matrix. Finally, we
concatenate Pt with the representations of the original traffic
Z̄lj = (↓ 2)(fg ⋆ Z̄lj−1 ), (8)
data Hot and feed them into the decoder to obtain predictions
Z̄hj = (↓ 2)(fh ⋆ Z̄lj−1 ), (9) Z′ = [Xt+1 ; Xt+2 ; . . . ; Xt+T2 ] ∈ RT2 ×|Vm |×d , where ℓ1
loss function is adopted to optimize the training process. The
where fg and fh represent the low-pass and high-pass filters of learnable parameters at the m-th client are denoted by Wm e1
,
a 1D orthogonal wavelet, respectively. The symbol ⋆ denotes e2 d q p e1 e2
Wm , Wm , Wm , and Wm , where Wm and Wm refer to the
the convolution operation and (↓ 2) represents naive down- parameters of the original encoder and the pattern encoder,
sampling halving the length of each component. The process respectively. Besides, Wm d
refers to the parameters of the
of J-level DWT is illustrated in Fig. 3. We only employ one- q
decoder, Wm refers to the parameters of the linear layer, and
level DWT in our model to reduce computational overhead. Wm p
refers to the learnable traffic pattern repository.
The low-frequency component, which represents stable traffic
dynamics, is then transformed back to the time domain through C. Sharing Strategy of Traffic Pattern
Inverse DWT (IDWT), which reaches: Based on the constructed traffic pattern repository, FedTPS
l
Z = fg−1 ⋆ (↑ 2)Z̄l1 , (10) aims to share the global knowledge contained in the common
traffic patterns across clients in a personalized manner. The
where fg−1 is the inverse low-pass filter and (↑ 2) denotes model on the m-th client can be divided into two parts: the
the naive up-sampling operation doubling the length of each p
traffic pattern repository Wm that represents the common
component. e1 e2
traffic patterns and other model parameters (i.e., Wm , Wm ,
After decomposition, the original traffic time series Z and its d q
Wm , and Wm ) that learn the spatial-temporal dependencies
low-frequency component Zl are separately fed into different of local traffic data. Our core idea is to share the learnable
encoders to obtain the learned representations Hot and Hlt , traffic pattern repository within the FL framework while
respectively. maintaining the rest model parameters for local training.
2) Construction of Traffic Pattern Repository: After ob- Moreover, to improve the alignment of traffic patterns from
taining the common traffic patterns represented by the low- different clients during the aggregation process, we devise
frequency component, we intend to exploit the knowledge the similarity-aware aggregation rather than the conventional
contained in these patterns for subsequent sharing among the p
averaging aggregation. Specifically, by denoting Wm [i] as the
different clients. However, since these patterns are generated i-th representative pattern in the repository of the m-th client,
from traffic data on each client individually, directly sharing p
we calculate the cosine similarity of Wm [i] with patterns from
them may pose a risk of privacy leakage. In addition, given repositories of other clients. Then we select and aggregate the
the observed variation in traffic patterns across different traffic top-k similar patterns from each client, which can be expressed
road networks [25], we aim to learn a set of representative as
M X
traffic patterns for each client to facilitate pattern sharing. p X 1
Wm [i] ← Wnp [j], (14)
Memory networks, which have achieved notable success in M k
n=1 j∈Sk
computer vision [40] and anomaly detection [41], [42] due to
their powerful representation abilities, have been increasingly where Sk indicates the set of k indices of the representative
adopted for spatial-temporal data analysis [13], [25], [43]. patterns in Wnp that are most similar to Wm p
[i]. Afterwards,
Inspired by the memory networks, we construct a learnable the server redistributes the aggregated traffic pattern repository
traffic pattern repository Wp ∈ RN ×c , where N and c to each client for the subsequent round of local training.
643
Authorized licensed use limited to: CUNY Hunter College Library. Downloaded on April 16,2025 at 22:11:03 UTC from IEEE Xplore. Restrictions apply.
TABLE I
Algorithm 1 FedTPS on the client side DATASET S TATISTICS
Input: Historical traffic flow Z from private dataset Dm ;
Datasets # Samples # Nodes Sample Rate Time Span
number of local rounds R1 ; federated traffic pattern
p
repository Wm . PEMS03 26208 358 5 mins 09/2018 - 11/2018
PEMS04 16992 307 5 mins 01/2018 - 02/2018
Output: Prediction of future traffic flow Z′ . PEMS07 28224 883 5 mins 05/2017 - 08/2017
p
1: Download federated traffic pattern repository Wm from PEMS08 17856 170 5 mins 07/2016 - 08/2016
the server;
p
2: Update the traffic pattern repository Wp m ← Wm ;
3: for each local rounds r = 1, 2, . . . , R1 do V. E XPERIMENTS
4: Compute low-frequency component Zl via (8) and
To evaluate the performance of our model, we carried out
(10);
comparative experiments on four real-world highway traffic
5: Compute the representations Hot and Hlt via (4), (5),
datasets in FL scenarios. First, we introduce the experimental
(6), and (7);
settings, followed by a detailed presentation of the results,
6: Compute the matched pattern Pt via (11), (12) and
which includes performance comparison, ablation study, and
(13);
parametric sensitivity.
7: Concate Hot and Pt , and predict future traffic flow Z′
through decoder; A. Experimental Setup
8: Calculate gradients and update learnable parameters
Wm e1
, Wme2
, Wmd
, Wmq
, and Wm p
; 1) Datasets Description and Preprocessing: We evaluate
9: Upload Wm p
to the server. our proposed framework on four widely used datasets for TFP,
including PEMS03, PEMS04, PEMS07, and PEMS08. These
datasets consist of traffic flow data collected by California
Algorithm 2 FedTPS on the server side Transportation Agencies (CalTrans) Performance Measure-
ment System (PeMS) [44], with the number representing the
Input: Number of clients M ; number of communication district code. The statistical details of these datasets are listed
rounds R2 ; number of selected patterns k; the traffic in Table I.
p
pattern repository Wm from client m. Following the practice of previous methods [45], we split
p
Output: Federated traffic pattern Wm for client m. the datasets into training set, validation set, and test set in
p(1)
1: Initialize W ; chronological order with the ratio of 6 : 2 : 2. Across all four
2: for each communication round r = 1, 2, . . . , R2 do datasets, we use the past 12 time stamps to predict the traffic
3: for client m ∈ {1, 2, · · · , M } in parallel do flow for the upcoming 12 time stamps. Before training, we
4: if r = 1 then apply a standard normalization procedure to the datasets to
p(1)
5: Send W to client m; ensure a stable training process. To simulate the FL scenario,
6: else we employ the graph partitioning algorithm, i.e., METIS [46]
p(r) p(r)
7: Wm ← aggregate W1:M via (14); to evenly partition the global traffic road network, with each
p(r)
8: Send Wm to client m; client possessing a subgraph of the global traffic road network.
9: Perform Algorithm 1 on client m; 2) Evaluation Metrics: The evaluation metrics in this paper
include Mean Absolute Error (MAE), Root Mean Square
Error (RMSE), and Mean Absolute Percentage Error (MAPE),
which are defined as follows:
Through iterative training and aggregation of pattern repos- T
itories, common traffic patterns serve as additional global 1X
M AE = Xt − X̂t , (15)
knowledge to further guide the TFP process. Meanwhile, the T t=1
remaining model components, which learn the spatial-temporal v
dependencies specific to the region of each client, do not u
u1 X T 2
participate in the process of aggregation, thereby forming the RM SE = t Xt − X̂t , (16)
personalized FL style and mitigating the adverse effects of T t=1
region-specific variations.
T
1 X Xt − X̂t
In summary, through our proposed method of common pat- M AP E = , (17)
tern extraction and sharing strategy, the federated framework T t=1 Xt
can effectively explore common traffic patterns to enhance
where Xt denotes the ground truth of all nodes at time
TFP capabilities with the personalized model. We provide the
stamp t and X̂t denotes the prediction value. We evaluate
pseudocode of our FedTPS on the client side and server side
the performance of the TFP task on the client side, and then
in Algorithm 1 and Algorithm 2, respectively.
average the performances across all clients.
644
Authorized licensed use limited to: CUNY Hunter College Library. Downloaded on April 16,2025 at 22:11:03 UTC from IEEE Xplore. Restrictions apply.
TABLE II
OVERALL P ERFORMANCE ON F OUR DATASETS . T HE BEST RESULTS ARE HIGHLIGHTED IN BOLDFACE .
0 $ (
0 $ (
0 $ (
1 X P E H U R I &