0% found this document useful (0 votes)
32 views12 pages

Intrusion Detection of Foreign Objects in Overhead Power System For Preventive Maintenance in High-Speed Railway Catenary Inspection

This document discusses a new method for detecting foreign objects in overhead power systems (OPS) used for high-speed railways. The method uses a two-stage sparse cross attention (SCA) based transformer detector (SCATD). In the first stage, a spatiotemporal enhanced CNN leverages features across frames to emphasize spatial responses of foreign objects. A spatial memory based feature aggregation module then iteratively updates feature affinities during training. In the second stage, SCA builds weak transformer detectors whose predictive results are fused for better performance via voting. Experiments on real OPS datasets show the proposed SCATD method effectively detects foreign objects.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views12 pages

Intrusion Detection of Foreign Objects in Overhead Power System For Preventive Maintenance in High-Speed Railway Catenary Inspection

This document discusses a new method for detecting foreign objects in overhead power systems (OPS) used for high-speed railways. The method uses a two-stage sparse cross attention (SCA) based transformer detector (SCATD). In the first stage, a spatiotemporal enhanced CNN leverages features across frames to emphasize spatial responses of foreign objects. A spatial memory based feature aggregation module then iteratively updates feature affinities during training. In the second stage, SCA builds weak transformer detectors whose predictive results are fused for better performance via voting. Experiments on real OPS datasets show the proposed SCATD method effectively detects foreign objects.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL.

71, 2022 2513412

Intrusion Detection of Foreign Objects in Overhead


Power System for Preventive Maintenance in
High-Speed Railway Catenary Inspection
Shangdong Zheng , Zebin Wu , Senior Member, IEEE, Yang Xu , Member, IEEE, and Zhihui Wei

Abstract— Foreign object intrusion detection in overhead


power systems (OPSs) is critical during preventive maintenance
in catenary inspection. However, the OPS images characteristics,
including appearance degradation and noisy representations,
increase the difficulty of foreign object detection. In this arti-
cle, a sparse cross attention (SCA) based transformer detec-
tor (SCATD) is developed for detecting foreign objects in OPSs.
Specifically, a two-stage refinement architecture is proposed
for extracting the features of foreign objects in OPS images.
In the first stage, a spatiotemporal enhanced (STE)-convolutional
neural network (CNN) is proposed to leverage feature-level
spatiotemporal coherence across frames to emphasize the spatial
responses of foreign objects. Then, a spatial memory (SM) based
feature aggregation (SMFA) module is constructed to iteratively
update the feature affinities during training. Moreover, the SCA
network sparsely builds different weak transformer detectors to Fig. 1. Overview of preventive maintenance in OPSs. Traditionally, foreign
produce weak predictive results. All the predictive results are object detection (in static images or videos) is implemented by human staff.
(a)–(f) show images of foreign objects captured in different locations or under
fused to achieve the better predictive performance via voting
different weather conditions. The red rectangular boxes denote the cameras.
schemes in an adaptive way. Finally, our SCATD is compared The orange rectangular boxes represent bird nests, which are foreign objects.
with state-of-the-art deep learning-based object detection algo-
rithms. Experiments on real OPS dataset demonstrate the high
effectiveness of the proposed SCATD scheme in detecting foreign
instances in OPSs. nests, that appear in OPSs are caused by many factors,
Index Terms— Convolutional neural network (CNN), foreign including bird migration, component aging, and weathering
object detection, intrusion detection, overhead power system and can result in significant energy loss and the destruction
(OPS), sparse cross attention (SCA)-based transformer detector of overhead contact lines. According to [1], there are more
(SCATD). than 9300 km of high-speed rail networks in China and
more than 60 000 km of high-speed rail networks around the
I. I NTRODUCTION world. OPS images are captured by high-resolution cameras
installed on operating vehicles due to their high functionality
O VERHEAD power systems (OPSs) are critical compo-
nents of railway power units that play a vital role in
high-speed railway systems. Foreign objects, such as bird
and interoperability. However, the manual examination of a
large number of images and videos to detect foreign objects
in OPSs is not only time-consuming and labor-intensive but
Manuscript received 3 May 2022; revised 9 June 2022; accepted 28 June also heavily reliant on manual experience. Fig. 1 visualizes the
2022. Date of publication 11 July 2022; date of current version 20 July 2022.
This work was supported in part by the National Natural Science Foundation preventive maintenance pipeline in OPSs which includes data
of China under Grant 62071233, Grant 61971223, and Grant 61976117; in collection and storage, foreign object detection and preventive
part by the Jiangsu Provincial Natural Science Foundation of China under maintenance.
Grant BK20211570, Grant BK20180018, and Grant BK20191409; in part
by the Fundamental Research Funds for the Central Universities under Grant In the previous studies [1], [2], image analysis and image
30917015104, Grant 30919011103, Grant 30919011402, Grant 30921011209, matching methods were utilized to detect foreign objects
and Grant JSGP202204; and in part by the Key Projects of University in OPSs. Kouadio et al. [2] extracted all surrounding com-
Natural Science Fund of Jiangsu Province under Grant 19KJA360001. The
Associate Editor coordinating the review process was Dr. M. Shamim Hossain. ponents, including contact lines and supporting brackets to
(Corresponding author: Zebin Wu.) locate the catenary sections between two supporting brackets.
Shangdong Zheng, Yang Xu, and Zhihui Wei are with the School of Wu et al. [1] proposed a canny edge detector to effec-
Computer Science and Engineering, Nanjing University of Science and
Technology, Nanjing 210094, China. tively identify the edges of overhead contact lines and bird
Zebin Wu is with the School of Computer Science and Engineering, Nanjing nests. Foreign objects, which have good continuity and high
University of Science and Technology, Nanjing 210094, China, and also with contrast, can be accurately detected by line detectors and
the School of Computer Science and Engineering, Nanjing University of Sci-
ence and Technology, Nanjing 210094, China (e-mail: [email protected]). threshold methods. Recently, various convolutional neural net-
Digital Object Identifier 10.1109/TIM.2022.3189642 work (CNN)-based methods [3]–[6] have been proposed for
1557-9662 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang provided by UniNet. Downloaded on September 27,2023 at 20:37:45 UTC from IEEE Xplore. Restrictions apply.
2513412 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 71, 2022

damage, defect, and crack detection in high-speed railway


systems. Aydin et al. [7] proposed a two-step framework for
detecting anomalous objects in high-speed railway systems.
The mean-shift tracking strategy and Gaussian mixture model
were utilized to combine trajectory-based and region-based
information for object detection. Zheng et al. [8] used an atten-
tion mechanism and skip connection strategy to emphasize the
spatial contextual information of foreign objects. To record
catenary parameters such as the insulators, split pins and pillar
positions, a real-time image capture system [9] was installed
on the inspection train to inspect the catenary for damage.
Wang et al. [10] proposed a stackable attention-guided multi-
scale module (SAMM) for detecting pillar number plates in the
overhead catenary systems (OCSs). To ensure that the network
propagates high-level semantic features in the shallow layers
of the feature extraction network, the SAMM progressively
suppresses the responses of background features. Lu et al. [11]
utilized time-scale normalization to detect abnormalities in
high-speed railways. After different image pairs were matched
using geometric constraints, the matched pictures were divided
into different sub-patches to properly align the target images.
Zhong et al. [12] analyzed different types of split pins and
introduced three definitions for identifying the missing and
loosening split pins. A three-stage detection framework was
constructed to localize the areas of the split pins. In [13],
a Faster R-CNN detector and a deep multitask neural network
were used to detect insulator surface defects. In [14], binary
feature pooling was proposed to better represent insulator
strings in infrared images. To locate the broken insulator Fig. 2. Foreign objects captured in different weather conditions and locations
strings, a support vector machine (SVM) was merged with in OPSs. The purple, green and orange rectangular boxes represent balloons,
kites and bird nests, which are various kinds of foreign objects.
a sliding window framework to classify the insulator states.
In addition, an automatic pillar number detection and recog- Video object detection (VID) [15], [19]–[24] plays an
nition method [8] was developed for preventive maintenance important role in computer vision tasks. How to uti-
in OPSs. This research improved the problem of fast fault lize the abundant spatiotemporal information from support
localization. frames (SFs) to improve the detection accuracy of occluded
Intrusion detection of bird nests, a primary foreign object and distorted instances in a reference frame (RF) is an essential
detection task in OPSs, has attracted considerable attention in issue in VID. Flow-guided feature aggregation (FGFA) [19]
recent works. A possible representation and multistage bird and deep feature flow (DFF) [20] utilized in-network optical
nests detection system inspired by image binarization was flow models to measure the feature affinities between two
analyzed in [1]. In [15], a nonlocal attention-based method frames. However, optical flow estimation has a large compu-
was presented for detecting foreign objects in OPSs, however, tation cost. Due to the dramatic changes in object locations,
because of its large computational overhead, this technique appearances, and poses across frames, optical flow models
is difficult to be deployed in industrial scenarios. Chen and might be unreliable. Recent state-of-the-art VID methods
He [16] utilized an enhanced_RetinaNet to detect bird nests have attempted to highlight the spatial responses of objects
in the power transmission system and reduce the conflicts in images. Deng et al. [22] proposed a relation distillation
between the foreground and background. In [17], an improved network (RDN) to explore the spatiotemporal correlations
single shot multibox detector (SSD) [18] was used to detect between the RF and all SFs. Spatiotemporal sampling net-
foreign objects in the high-speed railway system in real- work (STSN) [23] utilized multiple deformable convolution
time. This method utilized a three-stage integrated detection networks (DCNs) to incorporate target temporal information.
strategy to improve the performance of small object detection. However, while the use of multiple DCNs can improve
However, these deep learning-based methods cannot achieve performance, it greatly increases the computational cost.
the effective performance in detecting foreign objects in OPSs. Cheng et al. [21] explored the global-local context informa-
As illustrated in Fig. 2, the foreign objects include kites, bal- tion of objects in different frames to better represent the
loons, and bird nests, which may be obscured by contact lines, appearance of targets, which achieved the SOTA performance
supporting brackets, or split pins. In addition, the cameras on the ImageNet VID dataset [25]. The most frequently used
always capture video data. All the methods above decompose evaluation metric in static image/VID tasks is the average
video samples into a set of frames to detect target instances, precision (AP) [26] which measures the intersection over
ignoring the temporal correlations across video frames. union (IOU) between the predicted bounding boxes and

Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang provided by UniNet. Downloaded on September 27,2023 at 20:37:45 UTC from IEEE Xplore. Restrictions apply.
ZHENG et al.: INTRUSION DETECTION OF FOREIGN OBJECTS IN OPS FOR PREVENTIVE MAINTENANCE 2513412

the ground truth. The precision [12] (Pr, how many predictions Algorithm 1 Algorithm of our SCATD
are accurate) and recall [12] (Re, how many ground truths are Input: video frames {It }, sparse degree α
selected) are necessary for determining the AP. Thus, recent for t = t − K to t + K do  initialize feature buffer
object detection frameworks have utilized these evaluation Fk = χ f ea (It )  feature extraction network
metrics to assess the performance of their methods. end for
However, it is a nontrivial task to use these VID methods for t = 1 to ∞ do
for foreign object detection in OPSs. First, the size of the OPS c
Ft,t+K = concat[Ft , Ft+K ]  concat operation
images ranges from 1920 × 1080 to 2880 × 2160, which 
Ft+K = Ft+K + fste (Ft,t+K
c
)  enhanced SFs

are much larger than the images in ImageNet VID dataset. Fte , Ft+Ke 
= ε(Ft , Ft+K )  embedding features
In addition, the foreign objects in OPSs are extremely small F e ·F
e

and OPS images are always captured under different weather wt,t+K = ex p( |F et ||Ft +K
e )  aggregation weight
t t +K |
conditions (e.g., rainy and cloudy), which poses great difficulty Mt = Mt−K + f S M F A (Ft )  compute spatial memory
in excavating the spatial features of foreign instances. The end for  
experiments in Section IV prove that there is a strong need to Ftsm = Mt TK =−T wt,t+K · Ft+K  aggregate features
develop a method for detecting foreign objects in OPSs.
To realize automatic foreign object detection in OPSs, Q t , K t , Vt |t+K 
t=t−K = Li near (Ft+K )  Linear layer
a sparse cross attention (SCA)-based transformer detec-
t , K t , Vt
Q sm = Li near (Ftsm )  Linear layer
sm sm
tor (SCATD) is proposed in this paper. During the feature
extraction phase, a STE network is constructed to propagate
high-level semantic features across frames in the shallow Cα = frs (comb(Q smt , K t , Vt ,
sm sm
  
layers. The STE network establishes the spatiotemporal cor- Q t+K , K t+K , Vt+K ))  random selected
relations between the feature maps to determine high-level
semantic contextual information with a lower computational Ften = f F A (Enc SC A (Cα ))  SCA encoder
overhead. Moreover, a spatial memory (SM) based feature
aggregation (SMFA) module is designed to preserve the key Ftde = Dec(Ften )  decoder
spatial features of each frame to guide the feature aggregation Output: detection results {D(Ftde )}.
operation. Finally, a SCA mechanism is proposed to further
improve the precision of foreign object detection via ensemble
learning. The specific contributions of this article are as Conceptually, the video samples are defined as {It }t+Kt=t−K ,
follows. where the frame It is the RF and K represents the sampling
1) A two-stage feature refinement framework is proposed to windows to select the SFs. The video samples acquired in
progressively emphasize the spatial responses of foreign the OPSs are collected under different weather conditions.
objects in OPS images. In the first stage, the STE net- Because different types of cameras are utilized to capture
work is constructed to leverage the spatiotemporal coher- the OPS images, the size of each image frame ranges from
ence at the feature level to alleviate motion blur and the 1920 × 1080 to 2880 × 2160. The foreign objects are located
partial occlusion of foreign objects across frames. In the at various locations in OPSs, which increases the difficulty
second stage, because the STE-CNN establishes coarse of object detection. In this work, the image processing step
features for foreign objects, a SMFA strategy is designed has three main components: feature transformation, feature
to iteratively update the feature affinities during training. aggregation, and foreign object detection. Algorithm 1 shows
2) The SCA mechanism is constructed to leverage temporal the processing of the proposed SCATD and Figs. 3 and 5
coherence for video sequence prediction in the trans- describe the pipeline of our method.
former detector. Constrained by the sparse degree α,
our SCA sparsely selects several groups from different
combinations of Q (Query), K (Key), and V (Value) fea- A. Feature Transformation
tures across frames to produce weak predictive results. ResNet50 [27] is a frequently used backbone network for
Finally, our SCA adaptively fuses these predictive results static image and VID [19], [24]. Because its considerable
to achieve the better performance via voting schemes. feature learning ability of extracting high-level semantic infor-
The remainder of this article is organized as follows. mation, recent VID methods [19]–[24] utilized ResNet50 as
In Section II, we briefly analyze the overall architecture of their backbone network for feature extraction. To facilitate
our SCATD and the details of our method are introduced in a fair comparison, the proposed SCATD used ResNet50 to
Section III. The results of our model and the comparisons with extract features of foreign objects in OPS images. One image
various state-of-the-art approaches are discussed in Section IV. It ∈ R 1×3×608×608 is taken as the input of ResNet50 and the
Finally, Section V presents different perspectives for future extracted feature Ft ∈ R 1×1024×(608/32)×(608/32) is fed into the
works. feature transformation network to propagate the spatiotemporal
information to alleviate motion blur and partial occlusions
II. S YSTEM OVERVIEW caused by deteriorated frame quality. Foreign objects are often
The foreign object video data are captured by front-mounted covered by electrical components (e.g., insulators, pillars, and
cameras on the operating vehicle (illustrated in Fig. 1). anchors) in OPSs. An efficient approach to address this issue

Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang provided by UniNet. Downloaded on September 27,2023 at 20:37:45 UTC from IEEE Xplore. Restrictions apply.
2513412 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 71, 2022

is to use the feature transformation function to align semantic TABLE I


information in the adjacent video frames. A RCHITECTURE OF S TAGE O NE : T HE STE N ETWORK

B. Feature Aggregation
First, the STE-CNN is employed to determine the spatiotem-
poral features of foreign objects in images. Then, the SMFA
module is developed to iteratively update the feature affinities.
In the feature aggregation module, similarity weight matrices
are estimated to mitigate appearance degradation between the
RF and all SFs. Because foreign objects may be obscured by
electrical components in certain frames, the spatial information
of foreign objects in adjacent frames [19]–[24] can be explored
to detect occluded instances. Thus, the SMFA module is
designed to emphasize the spatial responses of foreign objects
in the RF.
C. Foreign Object Detection
After the spatiotemporal correlations between the RF and
all SFs are emphasized by our STE-CNN and SMFA module,
the final SCA mechanism is proposed in the transformer
detector to model video sequence spatiotemporal relations
via ensemble learning. To reduce the computational cost and
memory overhead, we design a random selection strategy to
capture the spatiotemporal features of foreign objects. The
sparse degree parameter α determines the complexity of our
ensemble detection processing. The impact of α is discussed The global pooling operations in (2) compress the spatial
in Section IV.
c
dimension of Ft,t+K to 1 × 1, and the spatial features of the
foreign objects might be missed during attention learning.
III. D ETECTION M ODULE Thus, our STE network explores more abundant features of
A. Spatiotemporal Enhanced CNN foreign objects from the spatial aspect
With one RF feature defined as Ft ∈ R 1×1024×H ×W and one  c
   c  c

Ft+K = Ft,t+K ⊗ sigmoid ψ Avg Ft,t+K , Max(Ft,t+K
SF feature defined as Ft+K ∈ R 1×1024×H ×W , the proposed STE
network aims to align the temporal information between these (3)
features in the channel and spatial dimensions to reduce false where ψ(Layer13) denotes a convolutional operation. We use
positives and inaccurate location detection. The framework a pair of global avg-pooling and max-pooling operations in the
c c
of the STE network is shown in Table I, which can be channel dimension of Ft,t+K and the results of Avg(Ft,t+K )
c
summarized as and Max(Ft,t+K ) are transferred to R 1×1×H ×W . Then, a sig-

Ft+K = Ft+k + f ste (concat[Ft , Ft+K ]) (1) moid function is utilized to normalize each spatial element
where Ft+K
denotes the output of the STE network, and the to generate another attention mask. The final STE_enhanced

operation of STE can be formulated as f ste . feature Ft+K is obtained with the elementwise product func-
First, we concatenate Ft and Ft+K in the channel dimen- tion ⊗. As a result, the STE network can capture the most
sion and the concatenated feature is denoted as Ft,k+K c
∈ similar and critical features to emphasize the responses of
R 1×2048×H ×W
. After a pair of global avg-pooling and foreign objects in images from channel and spatial aspects,
reducing the problems of false positives and inaccurate loca-
max-pooling operations are used in the spatial dimension, 
c
Avg(Ft,k+K ) and Max(Ft,k+K
c
) are transferred to R 1×2048×1×1 . tion detection. The enhanced SF feature Ft+K and original RF
feature Ft are then sent to the next stage for efficient feature
As shown in Table I, different sets of convolutional layers
aggregation.
φ(Layer3andLayer4) and ϕ(Layer5andLayer6) are utilized
c
to transfer Ft,k+K to various high-level semantic features.
We merge these two features with an elementwise summation B. Spatial Memory-Based Feature Aggregation Strategy
function and normalize each channel element with a Sigmoid Feature aggregation is an efficient method for mitigating
layer to generate a channel attention mask. Then, the pre- appearance degradation during video detection between the RF
c
liminary attention_enhanced feature Ft,k+K ∈ R 1×2048×H ×W and all SFs. According to [19], the final refined RF feature Ft
is fed into the next attention branch. The operation can be can be formulated as
formulated as 
T

c
   c  Ft ( p) = 
wt,t+K ( p) · Ft+K ( p) (4)
Ft,t+K = Ft,t+K
c
⊗ sigmoid φ Avg Ft,t+K
  c  K =−T
+ ϕ Max Ft,t+K (2) where the weight w indicates the spatial similarity between
where ⊗ represents the elementwise product. each SF and the RF. To compute the weight w, a 3-layer

Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang provided by UniNet. Downloaded on September 27,2023 at 20:37:45 UTC from IEEE Xplore. Restrictions apply.
ZHENG et al.: INTRUSION DETECTION OF FOREIGN OBJECTS IN OPS FOR PREVENTIVE MAINTENANCE 2513412

Fig. 3. In practice, we randomly select three frames from one video {It }t+K t=t−K to introduce our SCATD. The corresponding feature maps are produced
by the ResNet50 backbone network. In the first stage, the spatiotemporal enhanced (STE)-CNN is employed to determine the spatiotemporal information of
foreign objects in video sequences. In the second stage, the SMFA module is developed to iteratively update the feature affinities. In the detector, the upgraded
reference frame feature and all enhanced support frame features are determined to learn the object relations with sparse cross attention (SCA) mechanism.
Finally, ensemble learning is utilized to improve the precision of foreign object detection.

embedding network S(x) is introduced to the feature aggrega- can be formulated as


tion module to measure the similarity. The weight w can be 
T

calculated as Ftsm = Mt wt,t+K · Ft+K (6)
   K =−T
S(Ft ( p)) · S Ft+K ( p)   mp ap 
wt,t+K ( p) = exp    . (5) Mt = Mt−K + sigmoid ρ Ft , Ft (7)
|S(Ft ( p))| S F  ( p) t+K
where the [ ] represents the concatenation operation and
Finally, all the weights wt,t+K for each spatial location p ρ denotes a 3 × 3 convolutional layer. The feature affinities,
are normalized over the adjacent frames through a softmax which are produced at time t, are progressively updated to
layer. determine better spatial correspondences based on the former
However, this feature-level aggregation strategy [19], [20], SM Mt−K . The impact of the SMFA strategy is discussed in
[23], [28] is suboptimal for detecting foreign objects in OPS Section IV.
images. Redundant backgrounds, such as the sky, tend to have As shown in Fig. 4, the similarity weights of the bird nests
very similar features in a sequence of adjacent frames. As a (orange boxes) and background areas (gray boxes) can be
result, these methods might misestimate redundant background calculated by the original feature aggregation method and our
features as the most distinctive ones during object detection. SMFA module. The red pixels denote the similarity of the
The overall architecture of the SMFA module is shown corresponding spatial areas of two adjacent frames, ranging
in Fig. 3. The SMFA module aims to align the spatial from 0.1 to 1.0. Fig. 4(a) clearly shows that the original sim-
information of foreign objects across frames. The original ilarity weights focus on useless background information such
similarity weights wt,t+K and refined RF feature Ft can be as the sky rather than the bird nests. Thus, these weights might
obtained with the feature aggregation method [19]. Then, misestimate the spatial correspondences between the two fea-
we aggregate the spatial information of the original RF feature ture maps. After our SM is introduced, the SMFA_integrated
Ft by using the max-pooling and avg-pooling layers, which network covers the target object regions better than the original
∈ R 1×1×H ×W
mp
generates two spatial context features Ft weights, as shown in Fig. 4(b).
and Ft ∈ R 1×1×H ×W . We merge these two feature maps
ap

in the channel dimension using a concatenation operation.


A standard convolutional layer ρ is used to integrate the C. Sparse Cross Attention-Based Transformer Detector
spatial information features with the former SM Mt−K to Recent novel studies [29]–[31] have demonstrated the
produce Mt using elementwise summation. Finally, the SM Mt considerable potential of transformer in sequence data
which retains the spatial information of the foreign objects, tasks. Duke et al. [30] proposed sparse spatiotemporal trans-
is merged with Ft , using as a control gate to filter out formers (STTs) for video object segmentation, address-
the inaccurate similarity weights. The overall SM strategy ing the computational complexity of transformer structures.

Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang provided by UniNet. Downloaded on September 27,2023 at 20:37:45 UTC from IEEE Xplore. Restrictions apply.
2513412 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 71, 2022

Fig. 6. Visualization of different foreign objects in OPSs. The purple, green


and orange rectangular boxes represent the foreign objects of balloons, kites,
and bird nests, respectively.

Fig. 4. Orange and gray rectangular boxes represent bird nests and
background areas, respectively. (a) shows the similarity matrices computed a softmax layer to gain the similarity weights. Formally, this
by the original feature aggregation method. (b) shows the similarity matrices process can be formulated as
computed by our SMFA module. The red pixel block, which ranges from
0.1 to 1.0, represents the level of similarity between two adjacent frames. y = softmax(Q K T )V (8)
where Q, K and V ∈ R C×SP
represent query, key and value
vectors, respectively. C and SP denote the channel and spatial
dimensions.
Inspired by [30], [31] our SCA module proposes an efficient
variant of self-attention to capture the features of foreign
objects in video sequences. Considering a multiframe pipeline
to produce attention masks, these three matrices can be
denoted as Q, K , V ∈ R T ×C×SP , where T denotes temporal
dimension.
As shown in Fig. 5, the SCA module uses sequence
 
features Ftsm , Ft+K and Ft−K from the former network as
its inputs. The corresponding query, key and value matri-
ces are represented by Q sm t , K t , Vt
sm sm
and Q t , K t , Vt |t+K
t=t−K .
We merge the different combinations of feature matrices using
a concatenation operation in the temporal dimension. Because
there are a large number of combinations of {Q, K , V }t |t+K t=t−K ,
we cannot simply use (8) to perform the self-attention oper-
ations during training. For example, all the combinations of
{Q, K , V }t |t+K
t=t−K are formulated as
⎡ ⎤
Fig. 5. Architecture of our sparse cross attention (SCA) transformer encoder.
(Q t Kt Vt ), (Q t−K K t−K Vt−K )
⎢ (Q t+K K t+K Vt+K ), (Q t+K K t+K Vt−K ) ⎥
⎢ ⎥
⎢ (Q t Vt+K ), (Q t Vt ) ⎥
⎢ Kt K t+K ⎥. (9)
⎣ ... ... ..., ... ... ... ⎦
Neimark et al. [31] proposed a transformer-based method for
recognizing video instances. Because this method uses a (Q t−K K t+K Vt−K ), (Q t−K K t−K Vt+K )
complete video as the input during the inference stage, this The ideal detector can be achieved by training the detection
transformer network is more suitable for long video recogni- network with all the combinations shown in (9). However,
tion tasks. due to GPU memory limitations, this processing is extremely
The attention mechanism in transformer is based on a time-consuming and redundant. Thus, a random selection
trainable associative memory with (key, value) vector pairs. strategy is employed to overcome this computational barrier.
The query matrices are matched against a set of key vectors With the control of our user-defined parameter α, our SCA
using inner products. Then, these results are normalized by randomly selects α pairs of combinations of {Q, K , V }t |t+K
t=t−K

Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang provided by UniNet. Downloaded on September 27,2023 at 20:37:45 UTC from IEEE Xplore. Restrictions apply.
ZHENG et al.: INTRUSION DETECTION OF FOREIGN OBJECTS IN OPS FOR PREVENTIVE MAINTENANCE 2513412

Fig. 7. Loss curve of our static image detection model with different Fig. 8. Loss curve of our video object detection model with different
optimizers. (a) SGD. (b) Adam. optimizers. (a) SGD. (b) Adam.

strategy [22], we randomly select two adjacent frames as SFs


during training. The impact of α is discussed in Section IV.
to train our SCATD. In the training phase, we set α to
3 for guaranteeing the (Q t , K t , Vt ), (Q t+K , K t+K , Vt+K ) and
IV. R ESULTS AND D ISCUSSIONS
(Q t−K , K t−K , Vt−K ) are used for training. The value of α is
set to 8 for maximizing the usage of GPU memory. After In this section, we introduce the implementation details of
each backpropagation, the SCA module builds different weak our experiments and analyze the experimental results of our
detectors (different selections of combinations). Thus, with the SCATD. First, we compare our SCATD with different object
control of α, the proposed SCATD can construct various weak detection frameworks, including both static image object
detectors during the training phases. In the inference stage, detection methods and VID methods. Moreover, we also use a
we use different values of α, ranging from 3 to 27 [27 denotes 10-fold cross validation strategy to evaluate the effectiveness
all the combinations in (9) are used to detect foreign objects] to of our SCATD. The second part discusses the ablation study
evaluate the performance of our ensemble learning _enhanced to report the effects of our SM strategy and the parameters α
SCATD. on the proposed SCATD.
We define sparse attention operators across the RF and
SFs as C = {C0 , . . . , Cα } where Cα is one combination of A. Experimental Settings
{Q, K , V } vectors. As a result, each training iteration ran- 1) Implementation Details: Following the protocols in [15],
domly selects different C to generate multiple weak detectors. we conduct experiments on the high-speed foreign object
The final prediction could be refined by the joint efforts of dataset (HSFD) [15]. To facilitate a fair comparison,
these weak detectors with ensemble learning. Formally, our we adopted the same settings as in [15]. All our models
SCA is defined as were implemented on the PyTorch [39] framework with two
  NVIDIA GTX 1080 Ti GPUs. To compare our method with
yCα = softmax Q Cα K CTα VCα (10) other VID methods [19]–[24], the learning rate was set to
0.001 and all the methods were trained on the HSFD dataset
where α represents the parameter which ranges from 3 to 8 for 120k iterations. We divide the learning rate by 10 during
during the training phases. Then, a feed-forward net- the last 40k iterations.
work (FFN) is applied to each position separately and iden- 2) Datasets: The dataset of foreign objects in OPSs,
tically to transform SCA(Q, K , V )Cα . All the transformed the HSFD, includes 166 video clips with 1920 × 1080,
features are fed into the feature aggregation module. In accor- 2560 × 1440 and 2880 × 2160 pixel images that were cap-
dance with the settings of the temporal data sampling tured by different cameras equipped on high-speed trains,
Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang provided by UniNet. Downloaded on September 27,2023 at 20:37:45 UTC from IEEE Xplore. Restrictions apply.
2513412 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 71, 2022

TABLE II
Q UANTITATIVE R ESULTS OF D IFFERENT S INGLE -F RAME D ETECTION M ETHODS ON THE HSFD D ATASET

TABLE III
Q UANTITATIVE R ESULTS OF D IFFERENT V IDEO O BJECT D ETECTION M ETHODS ON THE HSFD D ATASET

as shown in Fig. 6. The video sequences were captured from B. Overall Performance
different train lines between April 2018 and January 2020. The The performances of the different static image object
original HSFD dataset includes 166 videos with 2975 OPS detection baselines on the HSFD dataset are listed in
images. One category (bird nests) of 2647 examples are Table II. Based on the dense proposal selection strategy, Faster
annotated in the HSFD dataset. We reserve these annota- R-CNN [38] performs better than the other one-stage detection
tions and introduce two new foreign object categories: kites frameworks. The mAP of Faster R-CNN is 77.8% which is
and balloons, which were neglected in [15]. Approximately 3.5% higher than the mAP of FCOS [36]. This result shows
114 images of 114 kites and 274 pictures of 386 balloons were the excellent performance of the two-phase network in the
annotated for our experiments. We divide the HSFD dataset small object detection task. The results of DETR [37] on the
into 132 training videos and 34 test videos and follow the HSFD dataset are not stable, which demonstrates its relatively
widely used settings in accordance with [19] to facilitate a poor small object detection performance. However, the DETR
fair comparison. method has the minimum number of parameters. Compared
3) Evaluation Metrics: The AP is the most frequently used with other single-frame detection frameworks, our proposed
evaluation metric in object detection [26]. More specifically, method achieves the highest mAP of approximately 85.3%
the final metrics to evaluate the detection performance include on the HSFD dataset. Moreover, the proposed single-frame
mean AP@50 (mAP@50), mean AP@75 (mAP75), and mean method achieves the highest F1 score. The loss function curves
AP@50:95 (mAP@50:95) [40] averaged over all object cate- of our single-frame or video object detection model are shown
gories. In addition, the recall (Re) rate, precision (Pr) rate and in Figs. 7 and 8. After 80k iterations, when the learning
F1-score (F1) [12] are introduced as pointer for evaluating rate decreases, the loss curves of our network are further
the performance of the SCATD. These metrics are defined as converged. Fig. 9(a)–(c) shows the precision-recall curves of
follows: various approaches. The proposed method achieves the best
balance between recall and precision during foreign object
TP detection.
Precision = (11)
TP + FP To further evaluate the performance of our SCATD, we com-
TP pare our method with other VID methods. Table III demon-
Recall = (12)
TP + FN strates the performance of different VID methods on the HSFD
2 × Precision × Recall test set. Here, we compare our method with other state-of-
F1-score = (13)
Precision + Recall the-art techniques without any post-processing. FGFA [19]
achieves better predictions than DFF [19] because it selects
where TP, FP, and FN denote true positive, false positive, and more temporal spanning ranges to exploit instance-level cali-
false negative, respectively. bration. Because the locations, appearances, shapes and poses

Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang provided by UniNet. Downloaded on September 27,2023 at 20:37:45 UTC from IEEE Xplore. Restrictions apply.
ZHENG et al.: INTRUSION DETECTION OF FOREIGN OBJECTS IN OPS FOR PREVENTIVE MAINTENANCE 2513412

Fig. 9. Precision-recall curves of different static image object detection methods. (a) FCOS, (b) Faster R-CNN, and (c) the proposed method (single) and
video object detection methods, (d) STSN, (e) STSN-DC5, and (f) the proposed method (video) on the HSFD test set. We list the PR curves for six IoU
values: 0.5, 0.55, 0.6, 0.65, 0.7, and 0.75.

of small foreign objects can change dramatically across frames


in the HSFD dataset, the relationships of instance-level fea-
tures across frames are difficult to model. Thus, the mAP of
RDN [22] is relatively poor. Overall, with the same backbone
network, our SCATD performs better than the other methods.
In Fig. 10, the pink, blue and brown boxes represent different
object queries, which could contain fragments of foreign
objects from adjacent frames. All the queries are fed into the
SCA module to assist the RF detecting occluded or distorted
instances. The mAP of the proposed SCATD reaches 87.2%
with ResNet50, which is an improvement of 4.1% over the best Fig. 10. Pink, blue, and brown rectangular boxes represent different object
queries that could contain foreign objects from adjacent frames. All these
competitor network STSN-DC5. Importantly, compared with queries are fed into the SCA to assist the reference frame detecting the
other VID methods, our SCATD has the minimum number occluded or distorted instances.
of parameters. Fig. 8 illustrates the loss function curves of
our VID method with different optimizers. The loss curves of
our network are converged after 80k iterations. Fig. 9(d)–(f) models might be unreliable, Thus, the mean mAP of DFF [20]
demonstrates the precision-recall curves of various approaches. and FGFA [19] are 70.42% and 74.46%, respectively, which
The proposed method achieves the best balance between are lower than those of the other methods. However, DFF
precision and recall during foreign object detection. and FGFA have a relatively stable performance with variances
Additionally, we use the K -fold cross-validation [41] strat- of 0.65% and 1.60%, respectively. The reason for this result
egy to verify the effectiveness of our method. Specifically, is that the optical flow estimations between two frames are
we set K to 10 in our experiments and compare our SCATD fixed. Our proposed SCATD achieves 90.83% mean mAP
with other methods [19]–[24]. In the experiments listed in which is 8.08% and 7.28% higher than those of RDN [22]
Table III, 80% of the HSFD data were used to train the mod- and spatiotemporal nonlocal block (STNB) [24], respectively.
els. Table IV lists the results of the 10-fold cross-validation The variance of our method is also lower than that of RDN and
experiments, in which 90% of the HSFD data were used to STNB methods, which demonstrates the stable performance of
train the different methods. As a result, the performance of our SCATD. The best competitor, STSN [23] achieves 85.51%
the various methods clearly improved. Due to the dramatic mean mAP, which is still 5.32% lower than our SCATD.
changes in foreign object locations and poses, optical flow Importantly, the proposed SCATD achieves fastest testing FPS

Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang provided by UniNet. Downloaded on September 27,2023 at 20:37:45 UTC from IEEE Xplore. Restrictions apply.
2513412 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 71, 2022

TABLE IV
Q UANTITATIVE R ESULTS OF D IFFERENT V IDEO O BJECT D ETECTION M ETHODS W ITH A 10- FOLD C ROSS -VALIDATION S TRATEGY

TABLE V
P ERFORMANCE AND RUN T IME C OMPARISONS BY U SING THE SMFA
M ODULE OR N OT IN D IFFERENT M ETHODS . W E A DOPT
R EST N ET 50 AS THE BACKBONE N ETWORK

Fig. 12. Effect of α which ranges from 3 to 8 during the training stage.
During testing, we use different values of α, ranging from 3 to 27 to evaluate
the performance of the SCATD.

the feature aggregation module of various methods. Inspired


by [22], after the models are trained for 120k iterations,
the SMFA_refined methods are trained for an additional 60k
iterations. As illustrated in Table V and Fig. 11, we introduce
the SMFA module into DFF [20] and FGFA [19], which
use optical flow models to extract temporal information. The
mAP of the SMFA enhanced FGFA increases from 66.82%
to 68.69%, which makes 1.87% improvement over the base
FGFA. The DFF is improved by 1.13%. In terms of the
Fig. 11. Improvements in the different methods when using our SMFA other methods [23], [24], which capture the spatiotemporal
module.
correspondence of the feature maps, the improved STNB and
STSN achieve mAPs of 81.20% and 81.23%, respectively.
while has the lowest FLOPs. Our SCATD is 2.1 times faster The performance of using SMFA or not in our method are
than method STNB and achieves 81.31(G) FLOPs. 86.15% and 87.20%, respectively. All the evaluated methods
prove that our SMFA boosts the baseline performance with a
negligible increase in computational overhead, demonstrating
C. Ablation Study the effectiveness of modeling SM during the feature selection
During the training and inference phases, each GPU phase. Importantly, our SCATD achieves the fastest testing
processes three images: one RF and two SFs, which are FPS approximately 5.3 tasks per second.
randomly sampled in the user-defined offsets. 2) Effect of SCA Components: We adopt pure
1) Effect of the SMFA Components: To explore the effects of ResNet50 [27] as our backbone network and do not
the SMFA components, we integrate the proposed SMFA with utilize spatiotemporal models or optical flow networks to

Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang provided by UniNet. Downloaded on September 27,2023 at 20:37:45 UTC from IEEE Xplore. Restrictions apply.
ZHENG et al.: INTRUSION DETECTION OF FOREIGN OBJECTS IN OPS FOR PREVENTIVE MAINTENANCE 2513412

extract temporal information. Thus, all the gains can be [4] H. Ji, “Optimization-based incipient fault isolation for the high-speed
attributed to our SCA module. Essentially, a larger sparse train air brake system,” IEEE Trans. Instrum. Meas., vol. 71, pp. 1–9,
2022.
degree α means that more combinations of Q (Query), K [5] H. Wang, Z. Liu, A. Núñez, and R. Dollevoet, “Entropy-based
(Key), and V (Value) across frames are utilized to achieve the local irregularity detection for high-speed railway catenaries with fre-
detector, resulting in a large computational overhead. Thus, quent inspections,” IEEE Trans. Instrum. Meas., vol. 68, no. 10,
pp. 3536–3547, Oct. 2019.
we varied α from 3 to 8 in the SCA module to train our [6] Z. Liu, Y. Lyu, L. Wang, and Z. Han, “Detection approach based on an
models and explore the effects of α on performance. improved faster RCNN for brace sleeve screws in high-speed railways,”
Ideally, training the detection network with a larger α IEEE Trans. Instrum. Meas., vol. 69, no. 7, pp. 4395–4403, Jul. 2020.
[7] I. Aydin, M. Karakose, and E. Akin, “A robust anomaly detection
could achieve the strongest detector for foreign object detec- in pantograph-catenary system based on mean-shift tracking and fore-
tion. However, as a result of GPU memory limitations, this ground detection,” in Proc. IEEE Int. Conf. Syst., Man, Cybern.,
processing is both time-consuming and redundant. As illus- Oct. 2013, pp. 4444–4449.
[8] S. Zheng et al., “Pillar number plate detection and recognition in
trated in Fig. 12, as α increases, the performance of our unconstrained scenarios,” J. Circuits, Syst. Comput., vol. 30, no. 11,
SCATD progressively improves from 76.3% to 81.9%. The Sep. 2021, Art. no. 2150201.
best performance is achieved when α is set to 8 during the [9] H. Hofler, M. Dambacher, N. Dimopoulos, and V. Jetter, “Monitoring
and inspecting overhead wires and supporting structures,” in Proc. IEEE
training phase which proves that our random selection strategy Intell. Vehicles Symp., Jun. 2004, pp. 512–517.
can significantly improve the performance of our ensemble [10] Y. Wang et al., “A stackable attention-guided multi-scale CNN for
learning_enhanced SCATD. Thus, the joint training of multiple number plate detection,” in Proc. Int. Conf. Image Graph. Beijing,
weak detectors to build a strong detector in the proposed China: Springer, 2019, pp. 199–209.
[11] S. F. Lu, Z. Liu, and Y. Shen, “Automatic fault detection of multiple
SCATD has great potential for improving the accuracy of targets in railway maintenance based on time-scale normalization,” IEEE
foreign object detection in OPSs. Trans. Instrum. Meas., vol. 67, no. 4, pp. 849–865, Apr. 2018.
[12] J. Zhong, Z. Liu, Z. Han, Y. Han, and W. Zhang, “A CNN-based defect
inspection method for catenary split pins in high-speed railway,” IEEE
V. C ONCLUSION Trans. Instrum. Meas., vol. 68, no. 8, pp. 2849–2860, Aug. 2019.
[13] G. Kang, S. Gao, L. Yu, and D. Zhang, “Deep architecture for high-speed
This article proposes an effective method for foreign object railway insulator surface defect detection: Denoising autoencoder with
detection in overhead power systems (OPSs). A two-stage multitask learning,” IEEE Trans. Instrum. Meas., vol. 68, no. 8,
pp. 2679–2690, Aug. 2019.
refinement architecture is proposed for extracting foreign [14] Z. Zhao, G. Xu, and Y. Qi, “Representation of binary feature pooling for
object features in OPS images. We present the STE-CNN detection of insulator strings in infrared images,” IEEE Trans. Dielectr.,
which accurately estimates spatial correspondences across Electr. Insul., vol. 23, no. 5, pp. 2858–2866, Oct. 2016.
[15] W. Lu, W. Xu, Z. Wu, Y. Xu, and Z. Wei, “Video object detection based
frames. Unlike conventional methods enhance the reference on non-local prior of spatiotemporal context,” in Proc. 8th Int. Conf. Adv.
features by aggregating nearby features with cosine similarity, Cloud Big Data (CBD), Dec. 2020, pp. 177–182.
we propose a SMFA module for refining the feature affinities [16] R. Chen and J. He, “Two-stage training method of RetinaNet for bird’s
nest detection,” in Proc. Int. Conf. Bio-Inspired Comput., Theories Appl.
between the RF and SFs. As a result, similar and critical Zhengzhou, China: Springer, 2019, pp. 586–596.
features can be emphasized with the SM. Moreover, a SCA [17] M. Ju and C. D. Yoo, “Detection of bird’s nest in real time based on
mechanism is designed in the transformer detector that cap- relation with electric pole using deep neural network,” in Proc. 34th Int.
tures spatiotemporal features of foreign objects in OPS images. Tech. Conf. Circuits Syst., Comput. Commun. (ITC-CSCC), Jun. 2019,
pp. 1–4.
A random selection strategy is employed during transformer [18] W. Liu et al., “SSD: Single shot MultiBox detector,” in Proc. Eur. Conf.
training to generate multiple weak detectors. The final pre- Comput. Vis. Amsterdam, The Netherlands: Springer, 2016, pp. 21–37.
dictions could be refined by the joint efforts of these weak [19] X. Zhu, Y. Wang, J. Dai, L. Yuan, and Y. Wei, “Flow-guided feature
aggregation for video object detection,” in Proc. IEEE Int. Conf. Comput.
detectors via ensemble learning. The experimental results Vis. (ICCV), Oct. 2017, pp. 408–417.
illustrate that our proposed method can achieve excellent [20] X. Zhu, Y. Xiong, J. Dai, L. Yuan, and Y. Wei, “Deep feature flow for
performance on real OPS dataset. In the future, we will explore video recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
(CVPR), Jul. 2017, pp. 2349–2358.
more complex designs for our three components. [21] Y. Chen, Y. Cao, H. Hu, and L. Wang, “Memory enhanced global-local
aggregation for video object detection,” in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 10337–10346.
ACKNOWLEDGMENT [22] J. Deng, Y. Pan, T. Yao, W. Zhou, H. Li, and T. Mei, “Relation
distillation networks for video object detection,” in Proc. IEEE/CVF
The authors would like to gratefully acknowledge the sup- Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 7023–7032.
port from NVIDIA Corporation for providing the GeForce [23] G. Bertasius, L. Torresani, and J. Shi, “Object detection in video with
GTX 1080 Ti used in this research. spatiotemporal sampling networks,” in Proc. Eur. Conf. Comput. Vis.
(ECCV), 2018, pp. 331–346.
[24] J. Dai et al., “Deformable convolutional networks,” in Proc. IEEE Int.
R EFERENCES Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 764–773.
[25] O. Russakovsky et al., “ImageNet large scale visual recognition chal-
[1] X. Wu, P. Yuan, Q. Peng, C.-W. Ngo, and J.-Y. He, “Detection of bird lenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, Dec. 2015.
nests in overhead catenary system images for high-speed rail,” Pattern [26] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and
Recognit., vol. 51, pp. 242–254, Mar. 2016. A. Zisserman, “The PASCAL visual object classes (VOC) challenge,”
[2] R. Kouadio, V. Delcourt, L. Heutte, and C. Petitjean, “Video based Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, Jun. 2010.
catenary inspection for preventive maintenance on iris 320,” in Proc. [27] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
World Congr. Railway Res., May 2008, pp. 1–6. image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
[3] J. Wang, L. Luo, W. Ye, and S. Zhu, “A defect-detection method (CVPR), Jun. 2016, pp. 770–778.
of split pins in the catenary fastening devices of high-speed railway [28] C. Guo et al., “Progressive sparse local attention for video object detec-
based on deep learning,” IEEE Trans. Instrum. Meas., vol. 69, no. 12, tion,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019,
pp. 9517–9525, Dec. 2020. pp. 3909–3918.

Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang provided by UniNet. Downloaded on September 27,2023 at 20:37:45 UTC from IEEE Xplore. Restrictions apply.
2513412 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 71, 2022

[29] Y. Wang et al., “End-to-end video instance segmentation with transform- Zebin Wu (Senior Member, IEEE) received the
ers,” 2020, arXiv:2011.14503. B.Sc. and Ph.D. degrees in computer science and
[30] B. Duke, A. Ahmed, C. Wolf, P. Aarabi, and G. W. Taylor, “SSTVOS: technology from the Nanjing University of Science
Sparse spatiotemporal transformers for video object segmentation,” and Technology, Nanjing, China, in 2003 and 2007,
2021, arXiv:2101.08833. respectively.
[31] D. Neimark, O. Bar, M. Zohar, and D. Asselmann, “Video trans- He is currently a Professor with the School of
former network,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. Workshops Computer Science and Engineering, Nanjing Univer-
(ICCVW), Oct. 2021, pp. 3163–3172. sity of Science and Technology. Before that, he was
[32] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in a Visiting Scholar with the GIPSA-Lab, Grenoble
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, INP, Université Grenoble Alpes, Grenoble, France,
pp. 7263–7271. from August 2018 to September 2018. He was a
[33] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,” Visiting Scholar with the Department of Mathematics, University of California
2018, arXiv:1804.02767. at Los Angeles, Los Angeles, CA, USA, from August 2016 to September
[34] A. Bochkovskiy, C.-Y. Wang, and H.-Y. Mark Liao, “YOLOv4: Optimal 2016 and from July 2017 to August 2017. He was a Visiting Scholar with
speed and accuracy of object detection,” 2020, arXiv:2004.10934. the Hyperspectral Computing Laboratory, Department of Technology of Com-
[35] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for puters and Communications, Escuela Politécnica, University of Extremadura,
dense object detection,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Cáceres, Spain, from June 2014 to June 2015. His research interests include
Oct. 2017, pp. 2980–2988. hyperspectral image processing, parallel computing, big data processing, and
[36] Z. Tian, C. Shen, H. Chen, and T. He, “FCOS: Fully convolutional their applications in railway foreign object detection.
one-stage object detection,” in Proc. IEEE/CVF Int. Conf. Comput. Vis.
(ICCV), Oct. 2019, pp. 9627–9636.
[37] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
S. Zagoruyko, “End-to-end object detection with transformers,” in Proc.
Eur. Conf. Comput. Vis. Glasgow, U.K.: Springer, 2020, pp. 213–229. Yang Xu (Member, IEEE) received the B.Sc. degree
[38] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards in applied mathematics and the Ph.D. degree in
real-time object detection with region proposal networks,” IEEE Trans. pattern recognition and intelligence systems from
Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2017. the Nanjing University of Science and Technol-
[39] A. Paszke et al., “Automatic differentiation in pyTorch,” in Proc. NIPS ogy (NUST), Nanjing, China, in 2011 and 2016,
Workshop Autodiff Submission, 2017, pp. 1–4. respectively.
[40] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in He is currently a Lecturer with the School of Com-
Proc. Eur. Conf. Comput. Vis. Zürich, Switzerland: Springer, 2014, puter Science and Engineering, NUST. His research
pp. 740–755. interests include hyperspectral image classification,
[41] J. G. Moreno-Torres, J. A. Saez, and F. Herrera, “Study on the impact of hyperspectral detection, image processing, machine
partition-induced dataset shift on k-fold cross-validation,” IEEE Trans. learning, and their applications in railway foreign
Neural Netw. Learn. Syst., vol. 23, no. 8, pp. 1304–1312, Aug. 2012. object detection.

Zhihui Wei was born in Jiangsu, China, in 1963.


He received the B.Sc. and M.Sc. degrees in applied
Shangdong Zheng was born in Jiangsu, China, mathematics and the Ph.D. degree in communica-
in 1995. He received the B.Sc. degree in soft- tion and information system from Southeast Uni-
ware engineering from the School of Computer and versity, Nanjing, China, in 1983, 1986, and 2003,
Communication, Lanzhou University of Technology, respectively.
Lanzhou, China, in 2017. He is currently pursuing He is currently a Professor and a Doctoral Super-
the Ph.D. degree with the Nanjing University of visor with the Nanjing University of Science and
Science and Technology, Nanjing, China. Technology (NUST), Nanjing. His research interests
His research interests include object detection, include partial differential equations, mathematical
image processing, deep learning, and their applica- image processing, multiscale analysis, sparse repre-
tions in railway foreign object detection. sentation, compressive sensing, and their applications in railway foreign object
detection.

Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang provided by UniNet. Downloaded on September 27,2023 at 20:37:45 UTC from IEEE Xplore. Restrictions apply.

You might also like