Intrusion Detection of Foreign Objects in Overhead Power System For Preventive Maintenance in High-Speed Railway Catenary Inspection
Intrusion Detection of Foreign Objects in Overhead Power System For Preventive Maintenance in High-Speed Railway Catenary Inspection
Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang provided by UniNet. Downloaded on September 27,2023 at 20:37:45 UTC from IEEE Xplore. Restrictions apply.
2513412 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 71, 2022
Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang provided by UniNet. Downloaded on September 27,2023 at 20:37:45 UTC from IEEE Xplore. Restrictions apply.
ZHENG et al.: INTRUSION DETECTION OF FOREIGN OBJECTS IN OPS FOR PREVENTIVE MAINTENANCE 2513412
the ground truth. The precision [12] (Pr, how many predictions Algorithm 1 Algorithm of our SCATD
are accurate) and recall [12] (Re, how many ground truths are Input: video frames {It }, sparse degree α
selected) are necessary for determining the AP. Thus, recent for t = t − K to t + K do initialize feature buffer
object detection frameworks have utilized these evaluation Fk = χ f ea (It ) feature extraction network
metrics to assess the performance of their methods. end for
However, it is a nontrivial task to use these VID methods for t = 1 to ∞ do
for foreign object detection in OPSs. First, the size of the OPS c
Ft,t+K = concat[Ft , Ft+K ] concat operation
images ranges from 1920 × 1080 to 2880 × 2160, which
Ft+K = Ft+K + fste (Ft,t+K
c
) enhanced SFs
are much larger than the images in ImageNet VID dataset. Fte , Ft+Ke
= ε(Ft , Ft+K ) embedding features
In addition, the foreign objects in OPSs are extremely small F e ·F
e
and OPS images are always captured under different weather wt,t+K = ex p( |F et ||Ft +K
e ) aggregation weight
t t +K |
conditions (e.g., rainy and cloudy), which poses great difficulty Mt = Mt−K + f S M F A (Ft ) compute spatial memory
in excavating the spatial features of foreign instances. The end for
experiments in Section IV prove that there is a strong need to Ftsm = Mt TK =−T wt,t+K · Ft+K aggregate features
develop a method for detecting foreign objects in OPSs.
To realize automatic foreign object detection in OPSs, Q t , K t , Vt |t+K
t=t−K = Li near (Ft+K ) Linear layer
a sparse cross attention (SCA)-based transformer detec-
t , K t , Vt
Q sm = Li near (Ftsm ) Linear layer
sm sm
tor (SCATD) is proposed in this paper. During the feature
extraction phase, a STE network is constructed to propagate
high-level semantic features across frames in the shallow Cα = frs (comb(Q smt , K t , Vt ,
sm sm
layers. The STE network establishes the spatiotemporal cor- Q t+K , K t+K , Vt+K )) random selected
relations between the feature maps to determine high-level
semantic contextual information with a lower computational Ften = f F A (Enc SC A (Cα )) SCA encoder
overhead. Moreover, a spatial memory (SM) based feature
aggregation (SMFA) module is designed to preserve the key Ftde = Dec(Ften ) decoder
spatial features of each frame to guide the feature aggregation Output: detection results {D(Ftde )}.
operation. Finally, a SCA mechanism is proposed to further
improve the precision of foreign object detection via ensemble
learning. The specific contributions of this article are as Conceptually, the video samples are defined as {It }t+Kt=t−K ,
follows. where the frame It is the RF and K represents the sampling
1) A two-stage feature refinement framework is proposed to windows to select the SFs. The video samples acquired in
progressively emphasize the spatial responses of foreign the OPSs are collected under different weather conditions.
objects in OPS images. In the first stage, the STE net- Because different types of cameras are utilized to capture
work is constructed to leverage the spatiotemporal coher- the OPS images, the size of each image frame ranges from
ence at the feature level to alleviate motion blur and the 1920 × 1080 to 2880 × 2160. The foreign objects are located
partial occlusion of foreign objects across frames. In the at various locations in OPSs, which increases the difficulty
second stage, because the STE-CNN establishes coarse of object detection. In this work, the image processing step
features for foreign objects, a SMFA strategy is designed has three main components: feature transformation, feature
to iteratively update the feature affinities during training. aggregation, and foreign object detection. Algorithm 1 shows
2) The SCA mechanism is constructed to leverage temporal the processing of the proposed SCATD and Figs. 3 and 5
coherence for video sequence prediction in the trans- describe the pipeline of our method.
former detector. Constrained by the sparse degree α,
our SCA sparsely selects several groups from different
combinations of Q (Query), K (Key), and V (Value) fea- A. Feature Transformation
tures across frames to produce weak predictive results. ResNet50 [27] is a frequently used backbone network for
Finally, our SCA adaptively fuses these predictive results static image and VID [19], [24]. Because its considerable
to achieve the better performance via voting schemes. feature learning ability of extracting high-level semantic infor-
The remainder of this article is organized as follows. mation, recent VID methods [19]–[24] utilized ResNet50 as
In Section II, we briefly analyze the overall architecture of their backbone network for feature extraction. To facilitate
our SCATD and the details of our method are introduced in a fair comparison, the proposed SCATD used ResNet50 to
Section III. The results of our model and the comparisons with extract features of foreign objects in OPS images. One image
various state-of-the-art approaches are discussed in Section IV. It ∈ R 1×3×608×608 is taken as the input of ResNet50 and the
Finally, Section V presents different perspectives for future extracted feature Ft ∈ R 1×1024×(608/32)×(608/32) is fed into the
works. feature transformation network to propagate the spatiotemporal
information to alleviate motion blur and partial occlusions
II. S YSTEM OVERVIEW caused by deteriorated frame quality. Foreign objects are often
The foreign object video data are captured by front-mounted covered by electrical components (e.g., insulators, pillars, and
cameras on the operating vehicle (illustrated in Fig. 1). anchors) in OPSs. An efficient approach to address this issue
Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang provided by UniNet. Downloaded on September 27,2023 at 20:37:45 UTC from IEEE Xplore. Restrictions apply.
2513412 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 71, 2022
B. Feature Aggregation
First, the STE-CNN is employed to determine the spatiotem-
poral features of foreign objects in images. Then, the SMFA
module is developed to iteratively update the feature affinities.
In the feature aggregation module, similarity weight matrices
are estimated to mitigate appearance degradation between the
RF and all SFs. Because foreign objects may be obscured by
electrical components in certain frames, the spatial information
of foreign objects in adjacent frames [19]–[24] can be explored
to detect occluded instances. Thus, the SMFA module is
designed to emphasize the spatial responses of foreign objects
in the RF.
C. Foreign Object Detection
After the spatiotemporal correlations between the RF and
all SFs are emphasized by our STE-CNN and SMFA module,
the final SCA mechanism is proposed in the transformer
detector to model video sequence spatiotemporal relations
via ensemble learning. To reduce the computational cost and
memory overhead, we design a random selection strategy to
capture the spatiotemporal features of foreign objects. The
sparse degree parameter α determines the complexity of our
ensemble detection processing. The impact of α is discussed The global pooling operations in (2) compress the spatial
in Section IV.
c
dimension of Ft,t+K to 1 × 1, and the spatial features of the
foreign objects might be missed during attention learning.
III. D ETECTION M ODULE Thus, our STE network explores more abundant features of
A. Spatiotemporal Enhanced CNN foreign objects from the spatial aspect
With one RF feature defined as Ft ∈ R 1×1024×H ×W and one c
c c
Ft+K = Ft,t+K ⊗ sigmoid ψ Avg Ft,t+K , Max(Ft,t+K
SF feature defined as Ft+K ∈ R 1×1024×H ×W , the proposed STE
network aims to align the temporal information between these (3)
features in the channel and spatial dimensions to reduce false where ψ(Layer13) denotes a convolutional operation. We use
positives and inaccurate location detection. The framework a pair of global avg-pooling and max-pooling operations in the
c c
of the STE network is shown in Table I, which can be channel dimension of Ft,t+K and the results of Avg(Ft,t+K )
c
summarized as and Max(Ft,t+K ) are transferred to R 1×1×H ×W . Then, a sig-
Ft+K = Ft+k + f ste (concat[Ft , Ft+K ]) (1) moid function is utilized to normalize each spatial element
where Ft+K
denotes the output of the STE network, and the to generate another attention mask. The final STE_enhanced
operation of STE can be formulated as f ste . feature Ft+K is obtained with the elementwise product func-
First, we concatenate Ft and Ft+K in the channel dimen- tion ⊗. As a result, the STE network can capture the most
sion and the concatenated feature is denoted as Ft,k+K c
∈ similar and critical features to emphasize the responses of
R 1×2048×H ×W
. After a pair of global avg-pooling and foreign objects in images from channel and spatial aspects,
reducing the problems of false positives and inaccurate loca-
max-pooling operations are used in the spatial dimension,
c
Avg(Ft,k+K ) and Max(Ft,k+K
c
) are transferred to R 1×2048×1×1 . tion detection. The enhanced SF feature Ft+K and original RF
feature Ft are then sent to the next stage for efficient feature
As shown in Table I, different sets of convolutional layers
aggregation.
φ(Layer3andLayer4) and ϕ(Layer5andLayer6) are utilized
c
to transfer Ft,k+K to various high-level semantic features.
We merge these two features with an elementwise summation B. Spatial Memory-Based Feature Aggregation Strategy
function and normalize each channel element with a Sigmoid Feature aggregation is an efficient method for mitigating
layer to generate a channel attention mask. Then, the pre- appearance degradation during video detection between the RF
c
liminary attention_enhanced feature Ft,k+K ∈ R 1×2048×H ×W and all SFs. According to [19], the final refined RF feature Ft
is fed into the next attention branch. The operation can be can be formulated as
formulated as
T
c
c Ft ( p) =
wt,t+K ( p) · Ft+K ( p) (4)
Ft,t+K = Ft,t+K
c
⊗ sigmoid φ Avg Ft,t+K
c K =−T
+ ϕ Max Ft,t+K (2) where the weight w indicates the spatial similarity between
where ⊗ represents the elementwise product. each SF and the RF. To compute the weight w, a 3-layer
Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang provided by UniNet. Downloaded on September 27,2023 at 20:37:45 UTC from IEEE Xplore. Restrictions apply.
ZHENG et al.: INTRUSION DETECTION OF FOREIGN OBJECTS IN OPS FOR PREVENTIVE MAINTENANCE 2513412
Fig. 3. In practice, we randomly select three frames from one video {It }t+K t=t−K to introduce our SCATD. The corresponding feature maps are produced
by the ResNet50 backbone network. In the first stage, the spatiotemporal enhanced (STE)-CNN is employed to determine the spatiotemporal information of
foreign objects in video sequences. In the second stage, the SMFA module is developed to iteratively update the feature affinities. In the detector, the upgraded
reference frame feature and all enhanced support frame features are determined to learn the object relations with sparse cross attention (SCA) mechanism.
Finally, ensemble learning is utilized to improve the precision of foreign object detection.
Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang provided by UniNet. Downloaded on September 27,2023 at 20:37:45 UTC from IEEE Xplore. Restrictions apply.
2513412 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 71, 2022
Fig. 4. Orange and gray rectangular boxes represent bird nests and
background areas, respectively. (a) shows the similarity matrices computed a softmax layer to gain the similarity weights. Formally, this
by the original feature aggregation method. (b) shows the similarity matrices process can be formulated as
computed by our SMFA module. The red pixel block, which ranges from
0.1 to 1.0, represents the level of similarity between two adjacent frames. y = softmax(Q K T )V (8)
where Q, K and V ∈ R C×SP
represent query, key and value
vectors, respectively. C and SP denote the channel and spatial
dimensions.
Inspired by [30], [31] our SCA module proposes an efficient
variant of self-attention to capture the features of foreign
objects in video sequences. Considering a multiframe pipeline
to produce attention masks, these three matrices can be
denoted as Q, K , V ∈ R T ×C×SP , where T denotes temporal
dimension.
As shown in Fig. 5, the SCA module uses sequence
features Ftsm , Ft+K and Ft−K from the former network as
its inputs. The corresponding query, key and value matri-
ces are represented by Q sm t , K t , Vt
sm sm
and Q t , K t , Vt |t+K
t=t−K .
We merge the different combinations of feature matrices using
a concatenation operation in the temporal dimension. Because
there are a large number of combinations of {Q, K , V }t |t+K t=t−K ,
we cannot simply use (8) to perform the self-attention oper-
ations during training. For example, all the combinations of
{Q, K , V }t |t+K
t=t−K are formulated as
⎡ ⎤
Fig. 5. Architecture of our sparse cross attention (SCA) transformer encoder.
(Q t Kt Vt ), (Q t−K K t−K Vt−K )
⎢ (Q t+K K t+K Vt+K ), (Q t+K K t+K Vt−K ) ⎥
⎢ ⎥
⎢ (Q t Vt+K ), (Q t Vt ) ⎥
⎢ Kt K t+K ⎥. (9)
⎣ ... ... ..., ... ... ... ⎦
Neimark et al. [31] proposed a transformer-based method for
recognizing video instances. Because this method uses a (Q t−K K t+K Vt−K ), (Q t−K K t−K Vt+K )
complete video as the input during the inference stage, this The ideal detector can be achieved by training the detection
transformer network is more suitable for long video recogni- network with all the combinations shown in (9). However,
tion tasks. due to GPU memory limitations, this processing is extremely
The attention mechanism in transformer is based on a time-consuming and redundant. Thus, a random selection
trainable associative memory with (key, value) vector pairs. strategy is employed to overcome this computational barrier.
The query matrices are matched against a set of key vectors With the control of our user-defined parameter α, our SCA
using inner products. Then, these results are normalized by randomly selects α pairs of combinations of {Q, K , V }t |t+K
t=t−K
Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang provided by UniNet. Downloaded on September 27,2023 at 20:37:45 UTC from IEEE Xplore. Restrictions apply.
ZHENG et al.: INTRUSION DETECTION OF FOREIGN OBJECTS IN OPS FOR PREVENTIVE MAINTENANCE 2513412
Fig. 7. Loss curve of our static image detection model with different Fig. 8. Loss curve of our video object detection model with different
optimizers. (a) SGD. (b) Adam. optimizers. (a) SGD. (b) Adam.
TABLE II
Q UANTITATIVE R ESULTS OF D IFFERENT S INGLE -F RAME D ETECTION M ETHODS ON THE HSFD D ATASET
TABLE III
Q UANTITATIVE R ESULTS OF D IFFERENT V IDEO O BJECT D ETECTION M ETHODS ON THE HSFD D ATASET
as shown in Fig. 6. The video sequences were captured from B. Overall Performance
different train lines between April 2018 and January 2020. The The performances of the different static image object
original HSFD dataset includes 166 videos with 2975 OPS detection baselines on the HSFD dataset are listed in
images. One category (bird nests) of 2647 examples are Table II. Based on the dense proposal selection strategy, Faster
annotated in the HSFD dataset. We reserve these annota- R-CNN [38] performs better than the other one-stage detection
tions and introduce two new foreign object categories: kites frameworks. The mAP of Faster R-CNN is 77.8% which is
and balloons, which were neglected in [15]. Approximately 3.5% higher than the mAP of FCOS [36]. This result shows
114 images of 114 kites and 274 pictures of 386 balloons were the excellent performance of the two-phase network in the
annotated for our experiments. We divide the HSFD dataset small object detection task. The results of DETR [37] on the
into 132 training videos and 34 test videos and follow the HSFD dataset are not stable, which demonstrates its relatively
widely used settings in accordance with [19] to facilitate a poor small object detection performance. However, the DETR
fair comparison. method has the minimum number of parameters. Compared
3) Evaluation Metrics: The AP is the most frequently used with other single-frame detection frameworks, our proposed
evaluation metric in object detection [26]. More specifically, method achieves the highest mAP of approximately 85.3%
the final metrics to evaluate the detection performance include on the HSFD dataset. Moreover, the proposed single-frame
mean AP@50 (mAP@50), mean AP@75 (mAP75), and mean method achieves the highest F1 score. The loss function curves
AP@50:95 (mAP@50:95) [40] averaged over all object cate- of our single-frame or video object detection model are shown
gories. In addition, the recall (Re) rate, precision (Pr) rate and in Figs. 7 and 8. After 80k iterations, when the learning
F1-score (F1) [12] are introduced as pointer for evaluating rate decreases, the loss curves of our network are further
the performance of the SCATD. These metrics are defined as converged. Fig. 9(a)–(c) shows the precision-recall curves of
follows: various approaches. The proposed method achieves the best
balance between recall and precision during foreign object
TP detection.
Precision = (11)
TP + FP To further evaluate the performance of our SCATD, we com-
TP pare our method with other VID methods. Table III demon-
Recall = (12)
TP + FN strates the performance of different VID methods on the HSFD
2 × Precision × Recall test set. Here, we compare our method with other state-of-
F1-score = (13)
Precision + Recall the-art techniques without any post-processing. FGFA [19]
achieves better predictions than DFF [19] because it selects
where TP, FP, and FN denote true positive, false positive, and more temporal spanning ranges to exploit instance-level cali-
false negative, respectively. bration. Because the locations, appearances, shapes and poses
Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang provided by UniNet. Downloaded on September 27,2023 at 20:37:45 UTC from IEEE Xplore. Restrictions apply.
ZHENG et al.: INTRUSION DETECTION OF FOREIGN OBJECTS IN OPS FOR PREVENTIVE MAINTENANCE 2513412
Fig. 9. Precision-recall curves of different static image object detection methods. (a) FCOS, (b) Faster R-CNN, and (c) the proposed method (single) and
video object detection methods, (d) STSN, (e) STSN-DC5, and (f) the proposed method (video) on the HSFD test set. We list the PR curves for six IoU
values: 0.5, 0.55, 0.6, 0.65, 0.7, and 0.75.
Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang provided by UniNet. Downloaded on September 27,2023 at 20:37:45 UTC from IEEE Xplore. Restrictions apply.
2513412 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 71, 2022
TABLE IV
Q UANTITATIVE R ESULTS OF D IFFERENT V IDEO O BJECT D ETECTION M ETHODS W ITH A 10- FOLD C ROSS -VALIDATION S TRATEGY
TABLE V
P ERFORMANCE AND RUN T IME C OMPARISONS BY U SING THE SMFA
M ODULE OR N OT IN D IFFERENT M ETHODS . W E A DOPT
R EST N ET 50 AS THE BACKBONE N ETWORK
Fig. 12. Effect of α which ranges from 3 to 8 during the training stage.
During testing, we use different values of α, ranging from 3 to 27 to evaluate
the performance of the SCATD.
Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang provided by UniNet. Downloaded on September 27,2023 at 20:37:45 UTC from IEEE Xplore. Restrictions apply.
ZHENG et al.: INTRUSION DETECTION OF FOREIGN OBJECTS IN OPS FOR PREVENTIVE MAINTENANCE 2513412
extract temporal information. Thus, all the gains can be [4] H. Ji, “Optimization-based incipient fault isolation for the high-speed
attributed to our SCA module. Essentially, a larger sparse train air brake system,” IEEE Trans. Instrum. Meas., vol. 71, pp. 1–9,
2022.
degree α means that more combinations of Q (Query), K [5] H. Wang, Z. Liu, A. Núñez, and R. Dollevoet, “Entropy-based
(Key), and V (Value) across frames are utilized to achieve the local irregularity detection for high-speed railway catenaries with fre-
detector, resulting in a large computational overhead. Thus, quent inspections,” IEEE Trans. Instrum. Meas., vol. 68, no. 10,
pp. 3536–3547, Oct. 2019.
we varied α from 3 to 8 in the SCA module to train our [6] Z. Liu, Y. Lyu, L. Wang, and Z. Han, “Detection approach based on an
models and explore the effects of α on performance. improved faster RCNN for brace sleeve screws in high-speed railways,”
Ideally, training the detection network with a larger α IEEE Trans. Instrum. Meas., vol. 69, no. 7, pp. 4395–4403, Jul. 2020.
[7] I. Aydin, M. Karakose, and E. Akin, “A robust anomaly detection
could achieve the strongest detector for foreign object detec- in pantograph-catenary system based on mean-shift tracking and fore-
tion. However, as a result of GPU memory limitations, this ground detection,” in Proc. IEEE Int. Conf. Syst., Man, Cybern.,
processing is both time-consuming and redundant. As illus- Oct. 2013, pp. 4444–4449.
[8] S. Zheng et al., “Pillar number plate detection and recognition in
trated in Fig. 12, as α increases, the performance of our unconstrained scenarios,” J. Circuits, Syst. Comput., vol. 30, no. 11,
SCATD progressively improves from 76.3% to 81.9%. The Sep. 2021, Art. no. 2150201.
best performance is achieved when α is set to 8 during the [9] H. Hofler, M. Dambacher, N. Dimopoulos, and V. Jetter, “Monitoring
and inspecting overhead wires and supporting structures,” in Proc. IEEE
training phase which proves that our random selection strategy Intell. Vehicles Symp., Jun. 2004, pp. 512–517.
can significantly improve the performance of our ensemble [10] Y. Wang et al., “A stackable attention-guided multi-scale CNN for
learning_enhanced SCATD. Thus, the joint training of multiple number plate detection,” in Proc. Int. Conf. Image Graph. Beijing,
weak detectors to build a strong detector in the proposed China: Springer, 2019, pp. 199–209.
[11] S. F. Lu, Z. Liu, and Y. Shen, “Automatic fault detection of multiple
SCATD has great potential for improving the accuracy of targets in railway maintenance based on time-scale normalization,” IEEE
foreign object detection in OPSs. Trans. Instrum. Meas., vol. 67, no. 4, pp. 849–865, Apr. 2018.
[12] J. Zhong, Z. Liu, Z. Han, Y. Han, and W. Zhang, “A CNN-based defect
inspection method for catenary split pins in high-speed railway,” IEEE
V. C ONCLUSION Trans. Instrum. Meas., vol. 68, no. 8, pp. 2849–2860, Aug. 2019.
[13] G. Kang, S. Gao, L. Yu, and D. Zhang, “Deep architecture for high-speed
This article proposes an effective method for foreign object railway insulator surface defect detection: Denoising autoencoder with
detection in overhead power systems (OPSs). A two-stage multitask learning,” IEEE Trans. Instrum. Meas., vol. 68, no. 8,
pp. 2679–2690, Aug. 2019.
refinement architecture is proposed for extracting foreign [14] Z. Zhao, G. Xu, and Y. Qi, “Representation of binary feature pooling for
object features in OPS images. We present the STE-CNN detection of insulator strings in infrared images,” IEEE Trans. Dielectr.,
which accurately estimates spatial correspondences across Electr. Insul., vol. 23, no. 5, pp. 2858–2866, Oct. 2016.
[15] W. Lu, W. Xu, Z. Wu, Y. Xu, and Z. Wei, “Video object detection based
frames. Unlike conventional methods enhance the reference on non-local prior of spatiotemporal context,” in Proc. 8th Int. Conf. Adv.
features by aggregating nearby features with cosine similarity, Cloud Big Data (CBD), Dec. 2020, pp. 177–182.
we propose a SMFA module for refining the feature affinities [16] R. Chen and J. He, “Two-stage training method of RetinaNet for bird’s
nest detection,” in Proc. Int. Conf. Bio-Inspired Comput., Theories Appl.
between the RF and SFs. As a result, similar and critical Zhengzhou, China: Springer, 2019, pp. 586–596.
features can be emphasized with the SM. Moreover, a SCA [17] M. Ju and C. D. Yoo, “Detection of bird’s nest in real time based on
mechanism is designed in the transformer detector that cap- relation with electric pole using deep neural network,” in Proc. 34th Int.
tures spatiotemporal features of foreign objects in OPS images. Tech. Conf. Circuits Syst., Comput. Commun. (ITC-CSCC), Jun. 2019,
pp. 1–4.
A random selection strategy is employed during transformer [18] W. Liu et al., “SSD: Single shot MultiBox detector,” in Proc. Eur. Conf.
training to generate multiple weak detectors. The final pre- Comput. Vis. Amsterdam, The Netherlands: Springer, 2016, pp. 21–37.
dictions could be refined by the joint efforts of these weak [19] X. Zhu, Y. Wang, J. Dai, L. Yuan, and Y. Wei, “Flow-guided feature
aggregation for video object detection,” in Proc. IEEE Int. Conf. Comput.
detectors via ensemble learning. The experimental results Vis. (ICCV), Oct. 2017, pp. 408–417.
illustrate that our proposed method can achieve excellent [20] X. Zhu, Y. Xiong, J. Dai, L. Yuan, and Y. Wei, “Deep feature flow for
performance on real OPS dataset. In the future, we will explore video recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
(CVPR), Jul. 2017, pp. 2349–2358.
more complex designs for our three components. [21] Y. Chen, Y. Cao, H. Hu, and L. Wang, “Memory enhanced global-local
aggregation for video object detection,” in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 10337–10346.
ACKNOWLEDGMENT [22] J. Deng, Y. Pan, T. Yao, W. Zhou, H. Li, and T. Mei, “Relation
distillation networks for video object detection,” in Proc. IEEE/CVF
The authors would like to gratefully acknowledge the sup- Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 7023–7032.
port from NVIDIA Corporation for providing the GeForce [23] G. Bertasius, L. Torresani, and J. Shi, “Object detection in video with
GTX 1080 Ti used in this research. spatiotemporal sampling networks,” in Proc. Eur. Conf. Comput. Vis.
(ECCV), 2018, pp. 331–346.
[24] J. Dai et al., “Deformable convolutional networks,” in Proc. IEEE Int.
R EFERENCES Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 764–773.
[25] O. Russakovsky et al., “ImageNet large scale visual recognition chal-
[1] X. Wu, P. Yuan, Q. Peng, C.-W. Ngo, and J.-Y. He, “Detection of bird lenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, Dec. 2015.
nests in overhead catenary system images for high-speed rail,” Pattern [26] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and
Recognit., vol. 51, pp. 242–254, Mar. 2016. A. Zisserman, “The PASCAL visual object classes (VOC) challenge,”
[2] R. Kouadio, V. Delcourt, L. Heutte, and C. Petitjean, “Video based Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, Jun. 2010.
catenary inspection for preventive maintenance on iris 320,” in Proc. [27] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
World Congr. Railway Res., May 2008, pp. 1–6. image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
[3] J. Wang, L. Luo, W. Ye, and S. Zhu, “A defect-detection method (CVPR), Jun. 2016, pp. 770–778.
of split pins in the catenary fastening devices of high-speed railway [28] C. Guo et al., “Progressive sparse local attention for video object detec-
based on deep learning,” IEEE Trans. Instrum. Meas., vol. 69, no. 12, tion,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019,
pp. 9517–9525, Dec. 2020. pp. 3909–3918.
Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang provided by UniNet. Downloaded on September 27,2023 at 20:37:45 UTC from IEEE Xplore. Restrictions apply.
2513412 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 71, 2022
[29] Y. Wang et al., “End-to-end video instance segmentation with transform- Zebin Wu (Senior Member, IEEE) received the
ers,” 2020, arXiv:2011.14503. B.Sc. and Ph.D. degrees in computer science and
[30] B. Duke, A. Ahmed, C. Wolf, P. Aarabi, and G. W. Taylor, “SSTVOS: technology from the Nanjing University of Science
Sparse spatiotemporal transformers for video object segmentation,” and Technology, Nanjing, China, in 2003 and 2007,
2021, arXiv:2101.08833. respectively.
[31] D. Neimark, O. Bar, M. Zohar, and D. Asselmann, “Video trans- He is currently a Professor with the School of
former network,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. Workshops Computer Science and Engineering, Nanjing Univer-
(ICCVW), Oct. 2021, pp. 3163–3172. sity of Science and Technology. Before that, he was
[32] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in a Visiting Scholar with the GIPSA-Lab, Grenoble
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, INP, Université Grenoble Alpes, Grenoble, France,
pp. 7263–7271. from August 2018 to September 2018. He was a
[33] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,” Visiting Scholar with the Department of Mathematics, University of California
2018, arXiv:1804.02767. at Los Angeles, Los Angeles, CA, USA, from August 2016 to September
[34] A. Bochkovskiy, C.-Y. Wang, and H.-Y. Mark Liao, “YOLOv4: Optimal 2016 and from July 2017 to August 2017. He was a Visiting Scholar with
speed and accuracy of object detection,” 2020, arXiv:2004.10934. the Hyperspectral Computing Laboratory, Department of Technology of Com-
[35] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for puters and Communications, Escuela Politécnica, University of Extremadura,
dense object detection,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Cáceres, Spain, from June 2014 to June 2015. His research interests include
Oct. 2017, pp. 2980–2988. hyperspectral image processing, parallel computing, big data processing, and
[36] Z. Tian, C. Shen, H. Chen, and T. He, “FCOS: Fully convolutional their applications in railway foreign object detection.
one-stage object detection,” in Proc. IEEE/CVF Int. Conf. Comput. Vis.
(ICCV), Oct. 2019, pp. 9627–9636.
[37] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
S. Zagoruyko, “End-to-end object detection with transformers,” in Proc.
Eur. Conf. Comput. Vis. Glasgow, U.K.: Springer, 2020, pp. 213–229. Yang Xu (Member, IEEE) received the B.Sc. degree
[38] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards in applied mathematics and the Ph.D. degree in
real-time object detection with region proposal networks,” IEEE Trans. pattern recognition and intelligence systems from
Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2017. the Nanjing University of Science and Technol-
[39] A. Paszke et al., “Automatic differentiation in pyTorch,” in Proc. NIPS ogy (NUST), Nanjing, China, in 2011 and 2016,
Workshop Autodiff Submission, 2017, pp. 1–4. respectively.
[40] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in He is currently a Lecturer with the School of Com-
Proc. Eur. Conf. Comput. Vis. Zürich, Switzerland: Springer, 2014, puter Science and Engineering, NUST. His research
pp. 740–755. interests include hyperspectral image classification,
[41] J. G. Moreno-Torres, J. A. Saez, and F. Herrera, “Study on the impact of hyperspectral detection, image processing, machine
partition-induced dataset shift on k-fold cross-validation,” IEEE Trans. learning, and their applications in railway foreign
Neural Netw. Learn. Syst., vol. 23, no. 8, pp. 1304–1312, Aug. 2012. object detection.
Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang provided by UniNet. Downloaded on September 27,2023 at 20:37:45 UTC from IEEE Xplore. Restrictions apply.