A Visual Vibration Characterization Method For Int - 2023 - Mechanical Systems A
A Visual Vibration Characterization Method For Int - 2023 - Mechanical Systems A
A R T I C L E I N F O A B S T R A C T
Communicated by Janko Slavič Health monitoring and fault diagnosis are the keys to ensuring equipment safe operation. This
work proposes a novel fault diagnosis method based on visual extraction and vibration charac
Keywords: terization. Instead of using conventional accelerometers to obtain fault data, the visual extraction
Intelligent fault diagnosis method obtains the full-field vibration information with rich texture features and produces no
Vibration extraction
mass loading effect on the measured object. This method extracts the time-domain vibration
Vibration image
information from the collected image sequences through image phase difference, and then en
Deep learning
codes it into gray-scale images as input for a convolutional neural network model. The experi
mental results testing on the bearing vibration image dataset show that the proposed method can
achieve superior performance in fault diagnosis. It gains superior results with high classification
and recognition accuracy.
1. Introduction
With the development of intelligent manufacturing technology, the complexity, precision and intelligence of modern industrial
equipment have been continuously improved. In order to ensure the safe operation of mechanical systems, fault detection and
diagnosis technology is used as an effective means to assess motor reliability and reduce the risk of unplanned shutdowns. It can not
only increase productivity but also reduce maintenance costs [1,2]. For rotating machinery equipment, analyzing and processing the
vibration signal [3–5] is an effective method to judge the operating state and the fault condition of the machine.
Normally, the fault diagnosis methods mostly use the vibration data collected by the contact accelerometer. However, the accel
erometer can cause a mass loading effect when the mass of the measured object is light. It can also affect the operating of small
mechanical structures. Moreover, installing a large number of accelerometers on the surface is time-consuming and labor-intensive
when measuring large objects. To achieve multi-point vibration measurement, multiple accelerometers must be installed simulta
neously. The extracted vibration information is difficult to synchronize, which leads to a complex signal processing procedure.
In recent years, the vision-based vibration measurement method has been promoted and applied in practice due to its low cost,
good flexibility, and full-field vibration measurement ability. Different from the conventional contact measurement method, the
vision-based measurement method obtains the vibration image of the measured object by the digital camera. Then it extracts the
feature information through the relevant image processing technology. This method achieves a wider measurement range and is also
adapted to different complex environments. Fleet et al. [6,7] proposed the optical flow method based on the video phase, which
extracted the motion by analyzing the phase change of the image instead of processing the original pixel intensity value. Chen et al. [8]
* Corresponding author.
E-mail address: [email protected] (C. Peng).
https://doi.org/10.1016/j.ymssp.2023.110229
Received 1 September 2022; Received in revised form 15 February 2023; Accepted 19 February 2023
Available online 3 March 2023
0888-3270/© 2023 Elsevier Ltd. All rights reserved.
C. Peng et al. Mechanical Systems and Signal Processing 192 (2023) 110229
Fig. 1. Flow chart of transforming image sequences into time-domain vibration signal through Gabor filter.
combined this idea with the motion amplification algorithm, and extracted the resonance frequencies and motion deformation modes
of several laboratory-scale reference structures. Moreover, they [9] extracted the displacement to perform simple modal and frequency
analysis from the video vibration measurements in small mechanical structures. Wahbeh, Pan and Lee et al. [10–12] applied visual
vibration measurement to some large-scale civil structures for measurement, and obtained the displacement information due to vi
bration. Javh et al. [13] proposed a simplified gradient-based optical flow method and carried out experiments on steel beam
structures to achieve sub-pixel precision displacement and vibration measurement. Caetano et al. [14] developed a vision system based
on the optical flow method. The vibration of large engineering structures was observed to provide an important data source for more
complete vibration monitoring.
However, the main problem of the optical flow method is the high computational complexity and strict condition requirements.
Different from the traditional optical flow method, Wadhwa and Liu et al. [15,16] provided new ideas for visual vibration mea
surement by performing motion analysis and amplification on the video. It could directly calculate the pixel displacement through the
local phase change to reduce the computational complexity. Based on Euler’s perspective, Wu et al. [17] proposed a magnification
algorithm for subtle movements. Furthermore, Wadhwa et al. [18] improved a phase-based video motion processing technology. It did
not amplify the noise while amplifying the action, so it could achieve a better amplification effect. Based on these methods, Peng et al.
[19,20] used the video collected by the camera to accurately obtain the vibration information and improve the active vibration
reduction performance of the magnetic levitation rotor. Zona et al. [21] discussed the results of visual-based structural and infra
structure vibration monitoring in full-scale civil engineering field tests.
In the field of fault detection and diagnosis, vibration data is often characterized by hand-designed filters. Bin at al. detected early
fault of rotating machinery by extracting vibration features based on wavelet packets – Empirical mode decomposition [22]. Currently,
visual vibration extraction always combines with structural dynamics identification and modal analysis. Sarrafi et al. [23] extracted
the damage-sensitive features from the acquired vibration video to monitor the structural health of wind turbine blades. However,
these methods require human prior knowledge, which increases the complexity of fault diagnosis. Because of the rapid development of
big data technology, the research of deep learning [24] is becoming a very active field of fault diagnosis [25,26]. This method can
effectively learn features from massive historical data and accurately construct the relationship between signal data and fault types in a
multi-layer structure. Miao et al. [27] proposed a novel deep learning model, deep sparse representation network for fault diagnosis of
rotating machinery, which can suppress noise and learn effective features from noised signals directly. Zhao et al. [28] improved the
feature learning ability of deep learning methods to conduct rortating machinery fault diagnosis tasks with highly noised vibration
signals. Thus far, visual vibration extraction has not been applied to fault diagnosis methods based on deep learning. Meanwhile, all
existing data-driven fault diagnosis methods have problems that it is difficult to collect enough effective data in a short time. And
because of the lack of effective data processing methods, it is difficult to get a fitting model. Selecting a suitable vibration extraction
and characterization method is an effective way to solve these problems.
This work focuses on the intelligent fault diagnosis of rotating machinery by using the bearing vibration image sequences collected
from a high frame rate industrial camera. The main contributions of this work are as follows:
1) The full-field vibration information extracted by vision-based method is proposed for fault diagnosis without conventional contact
accelerometer data, which can achieve multipoint extraction to augment data effectively.
2) A new fault diagnosis method is proposed, in which vibration features are encoded into two-dimensional grayscale images then as
inputs for CNN model. It can achieve high-accuracy fault diagnosis without any prior knowledge. And we explore the influence of
the number of convolutional layers and fully connected layers in the network model on diagnostic performance.
2
C. Peng et al. Mechanical Systems and Signal Processing 192 (2023) 110229
3) After experimenting on the produced rotating machinery fault dataset, it is proved that the proposed method has significant ad
vantages compared with the traditional fault diagnosis method.
The rest of this paper is organized as follows. Section II presents the methodologies, including the vision-based vibration char
acterization data construction method and the CNN model. Section III presents the details of our dataset and the testing results of the
proposed method. The conclusion and future research works are presented in Section IV.
Fig. 1 shows the proposed vibration characterization method based on the video phase. First, the high frame rate video sampling
method is used to obtain dynamic continuous vibration image sequences with the high-speed industrial camera. Then, the two-
dimensional Gabor filter is used to spatially filter the video image. And the local phase and amplitude information of each pixel at
each frame of the video is calculated. Based on the correlation between image frequency domain information and vibration, the local
phase is converted into local vibration information. In addition, the local amplitude is used to denoise the calculated phase signal,
which can improve the accuracy of the measurement.
In order to extract the local motion phase from the video, the Gabor transform is introduced here. The Gabor transform is a short-
time windowed Fourier transform, which can extract relevant features in different scales and directions in the frequency domain. Each
image decomposed by Gabor transform reflects the intensity change of frequency and direction within a local range. Through the
Gabor transform, each frame in the video is transformed into the complex frequency domain. So local motion is expressed more
accurately. In the spatial domain, the two-dimensional Gabor filter is a sine function modulated by a Gaussian function. The two-
dimensional Gabor function can be expressed as,
G(x, y, w, d, ψ , σ, γ) = Td + iXd (1)
where i is the imaginary unit. x, y are the spatial coordinates. w represents the wavelength of the sine wave. d represents the direction of
the parallel fringes of the Gabor filter kernel. ψ denotes the phase offset. σ denotes the standard deviation of the Gaussian function. And
γ represents the space length ratio. Td , Xd can be specifically expressed as,
( ) ( )
x2 + γ2 y2d xd
Td = exp − d cos 2π + ψ (2)
2σ w
( ) ( )
x2 + γ2 y2d xd
Xd = exp − d sin 2π + ψ (3)
2σ w
where xd , yd are the spatial displacement vectors of the object in the horizontal and vertical directions, which can be expressed as,
{
xd = xcosd + ysind
(4)
yd = − xcosd + ysind
Therefore, it is known that the image intensity in spatial domain of the collected image sequences at spatial position (x, y) and frame
t is S (x, y, t). By using the two-dimensional Gabor filter to perform convolution processing on the image intensity value, the spatial
domain information of the image is converted to the local frequency domain information. We get the frequency domain form of image
intensity and direction d as
f (x, y, t) = Ld (x, y, t)exp(ipd (x, y, t) ) = S(x, y, t) ⊗ (Td + iXd ) (5)
where f (x,y, t) represents the frequency domain intensity of the image at frame t. Ld (x,y, t) is the local amplitude of the pixel. pd (x,y, t)
represents the local phase.
Here, we use an improved video phase vibration measurement method. According to the relationship between the local phase and
the actual displacement, the displacement signal of the pixel is directly extracted. Then the time-domain vibration acceleration signal
is calculated through Laplacian of Gaussian (LOG) operation [29].
We assume that the pixel at the certain spatial position (x, y) of the image sequence has a local movement (Δx, Δy) at the motion
frame tn. Similarly, we set the frame t0 as the reference frame. The time interval between two consecutive frames is Δt = tn − t0 . The
image intensity of the reference frame is S(x, y, t0 ), and the image intensity of the motion frame is S(x + Δx, y + Δy, t0 + Δt).
Taking the horizontal motion extraction as an example, we choose the direction d = 0◦ , so that xd = x, yd = y. Then the integral form
of the reference frame and the motion frame can be obtained as,
3
C. Peng et al. Mechanical Systems and Signal Processing 192 (2023) 110229
∫ ∫ +∞ ( )
( x) (x − u)2 + γ 2 (y − v)2 ( u )
f (x, y, t0 ) = exp − 2πi × S(u, v, t) × exp − × exp 2πi + ψ i dudv (6)
w − ∞ 2σ 2 w
( ∫ ∫ +∞
x + Δx)
f (x, y, t0 + Δt) = exp − 2πi × S(u + Δx, v + Δy, t + Δt)
w − ∞
( )
( ) (7)
(x − u − Δx)2 + γ 2 (y − v − Δy)2 u
×exp − 2
× exp 2πi + ψ i dudv
2σ w
For the above two formulas, the definite integration results of the phase are the same. Here we use the constant term to express the
definite integration results. By calculating the phase angle of the two formulas, we can get (8) and (9) as,
x
(8)
′
p(f (x, y, t0 ) ) = − 2π +p
w
x + Δx
p(f (x, y, t0 + Δt) ) = − 2π + p’ (9)
w
where p denotes the integral constant. Then we make a difference between the above two equations, the phase difference of the two
′
phases is (10),
Δx
Δp = 2π (10)
w
The horizontal motion Δx is proportional to the phase difference Δp. This method does not need to calculate the phase gradient of
the entire image. The phase difference Δp between two frames which is multiplied by the scale factor can represent the horizontal
displacement of the pixel.
The phase signal in the image is affected by the low-amplitude noise, which will eventually interfere with the accuracy of the
extracted displacement. In order to reduce the effect, the phase is weighted spatially Gaussian blur by using the local amplitude. h (x, y)
is the two-dimensional Gaussian function, and its standard deviation ρ represents the width of the spatial domain filter. The larger the
standard deviation is, the wider the two-dimensional Gaussian image is, which makes the filtering effect better. The Gaussian function
can be expressed as,
[ ]
− (x2 + y2 )
h(x, y) = exp (11)
2 ρ
For the frame s whose phase is ps and amplitude is Ls, the weighted phase signal is calculated as follows,
(ps Ls ) ⊗ h(x, y)
ps = (12)
Ls ⊗ h(x, y)
∂2 l2 − 2σ 2 − l2
LOG = G (l) = e 2σ2 (13)
∂l2 σ σ2
where σ represents the standard deviation of the Gaussian function and l represents the displacement. The displacement signal is
denoised by the Gaussian function Gσ (l), and then the Laplace function is used to calculate the second derivative of the displacement
signal, which is the time-domain acceleration signal.
Then the one-dimensional displacement time-domain signal extracted from the original vibration image is encoded. The extracted
multi-point one-dimensional vibration information is converted from the time-domain pixel space to the two-dimensional image space
to construct a vibration characterization image dataset.
With the length of n2, we take a one-dimensional vibration signal T(i), i = 1, 2, …, n2, which represents the value of each segment
signal. The intensity of each pixel in the converted grayscale image is P (j, k), j = 1, 2, …, n, k = 1, 2, …, n, whose size is n × n. P (j, k) is
expressed as follows,
{ }
T((j − 1) × n + k ) − min(T)
P(j, k) = round × 255 (14)
max(T) − min(T)
where round (⋅) is the rounding function. The value of each pixel is normalized to [0,255]. Each pixel in the grayscale image represents
the intensity of the vibration, which contains richer temporal features.
4
C. Peng et al. Mechanical Systems and Signal Processing 192 (2023) 110229
4
4096 Softmax
256 256 256
128 128 Fc1
64 64 Conv5,6,7 Pool 3
Conv3,4 Pool 2
Input Conv1,2 Pool 1
(b)
Fig. 2. (a) Common fault diagnosis network model. (b) Proposed CNN model. The proposed network model is deeper and more complex than the
common model.
In this work, a novel deep 2D-CNN based on two-dimensional gray image for fault diagnosis is proposed to explore the ability of the
deep network to automatically extract and discriminate features. Fig. 2(a) shows the Lenet5 model [30] normally used in fault
diagnosis tasks. Fig. 2(b) shows the proposed 2D-CNN model. The overall network architecture consists of seven convolutional layers
and fully connected layers. Compared with Lenet5, it has a deeper and more complex structure, which contains more parameters.
Through the larger effective receptive field brought by the deeper convolution kernel, the richer combination of underlying features
can be obtained. This network model is more suitable for the proposed fault diagnosis problem.
Since it is proved that a smaller filter size is more beneficial to improve the performance of the model [31], the size of all
convolution filters is selected as 3 × 3. As the feature extraction unit, Convolutional layers and pooling layers are used for nonlinear
mapping to extract effective fault features from the original vibration image. The fully connected layer followed feature extraction
module is used to perform classification tasks. Sharing weights and biases in the convolution layers can reduce the connection between
network layers to decrease the amount of training parameters and reduce the overfitting risk. The calculation process can be expressed
as,
( )
D ∑
∑ N ∑N
yk,l
i,j = g W k,l
m,p ⊗ xk,l− 1
i+m− 1,j+p− 1 + b k,l
(15)
k=1 i,j=1 m,p=1
where maxpool(•) represents the max pooling, Rp denotes the pooling receptive region and zk,l
i,j is the value of the kth output feature
map of the lth layer at the position (i, j) after pooling.
Then the feature map is input to the fully connected layer to classify the extracted abstract features to achieve intelligent diagnosis.
For multi-class classification problems, the softmax function is frequently used for label prediction. In the supervised intelligent fault
diagnosis problem, the cross-entropy loss function is generally used to measure the error between the real label and the predicted
result. The overall optimization goal of intelligent fault diagnosis can be summarized as,
1∑∑ s
L= − yic log(pic ) (17)
n i c=1
where s represents the number of categories of health condition, yic is a symbolic function and takes 1 if the true category of the sample
is c, and pic denotes the probability that the sample is predicted to be c. The proposed network uses the stochastic gradient descent
5
C. Peng et al. Mechanical Systems and Signal Processing 192 (2023) 110229
Pictures
Rotating Machine
1D displacement signal
Gabor filter
Phase
difference
Running under the
Signal to image
condition of
different faults and Displacement Normalization
speeds High-speed
industrial camera
Video to signal
...
Classification Feature extraction
32 ×32 image
Diagnostic
result
softmax
Training dataset
conv
conv
conv
conv
conv
pool
conv
conv
pool
pool
fc
Model Training
Fig. 4. The visual acquisition platform and the simulated fault bearing diagram. (a) The platform mainly composed by high-speed camera, light
source without stroboscopic, bearing seat, brushless direct current motor, and motor speed controller. (b) The normal bearing. (c) The bearing with
inner race fault. (d) The bearing with outer race fault. (e) The bearing with roller fault.
algorithm to continuously update and optimize the weights of the model to minimize the loss function to achieve high-precision
classification and diagnosis.
Fig. 3 shows the intelligent fault diagnosis of rotating machine, including coding the visual extraction of vibration into a two-
dimensional image and constructing a new CNN structure. The main flow is as follows,
Step 1: Use a high-speed camera to collect vibration videos of rotors at different speeds and different types of faults.
Step 2: Select the pixels for the acquired vibration image sequence. Construct the vibration displacement and image phase mapping
relationship based on the Gabor frequency shift characteristics to obtain sufficient full-field time-domain vibration samples.
Step 3: Encode the extracted time-domain vibration signals to obtain images as input for the two-dimensional convolutional neural
network, and construct a bearing fault image dataset.
Step 4: Divide the dataset into the training set, verification set and test set. Use sufficient training set samples to train the feature
extraction and state classification module parameters of the proposed CNN model. Then use verification to adjust the hyperparameters
of the model and preliminarily evaluate the model performance.
Step 5: Use testing data to test on the model to verify the superiority of the proposed intelligent fault diagnosis method.
6
C. Peng et al. Mechanical Systems and Signal Processing 192 (2023) 110229
Fig. 5. The edge detection result and the frequency domain contrast between the accelerometer truth and the visual vibration extraction. (a) The
bearing seat edge extraction result of a frame in the video. The vibration features of pixels at the edges of the image are obvious to extract. (b) The
normal condition. (c) The inner race fault condition. (d) The outer race fault condition. (e) The roller fault condition.
In this section, through a series of experiments, the effect of the proposed vision-based fault diagnosis method on the rotating
machinery fault dataset is evaluated. All the experiments are carried out in Python3.6 with PyTorch and run on a workstation equipped
with an RTX2080Ti GPU.
A. Dataset description
First, a high-speed industrial camera (Flare 12M180MCX) is used to sample video data of rotating machinery bearing seat running
under different operating conditions. The bearing seat vibration is transmitted from the bearing, which can represent the fault features.
Here, according to Whittaker-Shannon sampling theorem, the camera frame rate is set to 1000 fps, which satisfies the characteristic
frequency requirements of bearing faults at motor speeds in the experiments. At the same time, in order to meet the image quality
requirements in the visual processing method, the resolution is set to 760 × 800dpi. Each time the number of frames taken is 10,000.
Fig. 4(a) shows the visual acquisition platform. The bearing seats are installed with hard contact to isolate the impact of vibration
generated during the working process of the bearing test bench on the high-speed camera. Manual calibration methods and some
common methods are used to minimize installation errors as much as possible. The bearing we use has an inner diameter of 20 mm, an
outer diameter of 47 mm, and a thickness of 14 mm. Fig. 4(b)-(e) show the mechanical rotor diagram under four health conditions,
normal, inner race fault, outer race fault and roller fault. The effects of different damage severity are considered when artificially
machining bearing faults. Through the electrical discharge machining equipment, fine grooves of different sizes are respectively
machined in the inner race, outer race and roller of the bearings, which simulates different types of bearing health conditions with
different damage severity. Each fault type of bearings has three fine grooves of 0.5 mm, 1.0 mm and 1.5 mm respectively. The speed
range of the rotor bearing is set to 2000 rpm-3000 rpm, and the experiment shooting is performed every 100 rpm within this range. So
4 × 11 vibration video sequences are collected.
Then the video sequences of bearing seat are encoded to the vibration characterization image. Fig. 5(a) shows the bearing seat edge
detection result of a frame in the video. First, for each original vibration image sequences, 20 pixels with obvious vibration features are
selected from the edge manually. A total of 4 × 11 × 20 one-dimensional time-domain phase difference signals are extracted. Since the
bearing seat vibrates more obviously in the horizontal direction, here the Gabor filter direction d = 0 can have better vibration
◦
extraction results. Fig. 5(b)-(e) shows the frequency domain contrast between the visual vibration extraction and the accelerometer
truth from four types of health conditions. Here, Pearson correlation coefficient r is used to verify the effectiveness of visual vibration
extraction.
∑n
i=1 (Xi − X)(Yi − Y)
r = √̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅√̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
̅ (18)
∑n ∑ n
(Xi − X)2 (Yi − Y)2
i=1 i=1
where X, Y are the frequency domain signals of visual extraction vibration and accelerometer measurement. At 3000 rpm, the cor
relation coefficient of the normal condition is 0.9342. It of the inner race fault condition is 0.9153. It of the outer race fault condition is
0.9079. It of the roller fault condition is 0.8745.
Next, the original vibration signal is encoded to convert the extracted full-field one-dimensional vibration information from the
time-domain pixel space to the two-dimensional image space. Taking an extracted phase difference signal as an example, with each
set’s length of 1024, 10 sets of overlapping sliding window sampling is taken on the 1 × 10000 matrix. Then each set is reorganized
into a 32 × 32 image matrix. Finally, based on the above gray-scale image transformation, the elements in the matrix are normalized to
7
C. Peng et al. Mechanical Systems and Signal Processing 192 (2023) 110229
Table 1
Layer configurations of CNN model.
No.
L1 Conv(3 × 3 × 64)
L2 Conv((3 × 3 × 64)
L3 Maxpool(2 × 2)
L4 Conv((3 × 3 × 128)
L5 Conv((3 × 3 × 128)
L6 Maxpool(2 × 2)
L7 Conv((3 × 3 × 256)
L8 Conv((3 × 3 × 256)
L9 Conv((3 × 3 × 256)
L10 Maxpool(2 × 2)
100
100 Accuracy 0.25
0.25 100
99.4 Accuracy 0.25
0.25
Std 99.2 Std
99.5
99.5 0.2
0.2 99.5 0.2
0.2
Standard deviations
Standard deviations
99
Accuracy (%)
Accuracy (%)
98.8
99
99 0.0.15
15 996 0.0.15
15
98.
98.4
98.5
98.5 0.0.1
1 98.5 0.0.1
1
98.2
98
98 0.0.05
05 98
98 0.0.05
05
97.8
97.5
97.5 00 97.5
97.6 00
Conv5FC1 Conv5FC3 Conv7FC1 Conv7FC3 Conv10FC1 Conv10FC3 Conv5FC1 Conv5FC3 Conv7FC1 Conv7FC3 Conv10FC1 Conv10FC3
(a) (b)
Fig. 6. Results of different datasets with different models. (a) Result of phase difference dataset with CNN models. (b) Result of acceleration dataset
with CNN models.
First, CNN models of different structures are compared, focusing on the impact of the number of convolutional layers and fully
connected layers on fault diagnosis performance.
Conv7FC1 means that it consists of seven convolutional layers and a fully connected layer. The layer structure of the network model
is shown in Table 1. Every two or three convolutional layers are followed by a Maxpool (2 × 2) containing a 2 × 2 filter pooling layer.
Conv (3 × 3 × 64) represents a convolutional layer with 3 × 3 filters and 64 channels. The strides of the convolutional and pooling
layers are set to 1.
Fig. 6(a) shows the experimental results of using different network structures on the phase difference dataset of 20 pixels. This work
explores the impact of 5, 7, and 10 convolutional layers and 1 and 3 fully connected layers on classification accuracy. Here, the
maximum term, minimum term, average value, and standard deviation of the results on 10 repeated experiments are used to evaluate
the accuracy.
From the results, Conv7C1 has the highest accuracy average of 99.792% and the lowest variance of 0.0887. The accuracy is the
ratio of the number of correctly classified vibration images in the test set to the total number. The variance is an assessment of
fluctuation in accuracy across multiple tests. So the network of 7 convolutional layers has higher accuracy and stability than Conv5FC1
with an average accuracy of 99.215% or Conv10FC1 with 98.453%. At the same time, compared to the three fully connected layers
network Conv7FC3 with an average accuracy of 99.198%, a single fully connected layer network exhibits better performance. These
experiments prove that more fully connected layers are not suitable for the vision-based fault diagnosis problem proposed in this work.
It can affect the generalization performance of the model, lengthen the training time, and easily lead to the occurrence of overfitting.
Next, in order to explore the impact of different data processing methods on the accuracy of diagnosis, we repeat the above
experiment with an acceleration dataset of 20 pixels. The experimental results are shown in Fig. 6(b). Compared with the acceleration
data usually used in the diagnosis method based on vibration, the unique phase difference data of this visual diagnosis method has
better diagnostic performance. Since the acceleration data is further converted from the phase difference data through LOG operator
filtering and other operations, the quality of the data is reduced. From the experimental results, its accuracy is slightly lower than the
phase difference data, and the highest accuracy rate is 99.168% on the Conv7FC3 network. But there is still a gap of 0.624% compared
with the best result in the phase difference data.
8
C. Peng et al. Mechanical Systems and Signal Processing 192 (2023) 110229
100
100
90
90 2D SVM Lenet5
Accuracy (%)
80
80
70
70
AlexNet Proposed Method
60
60
50
50
5 pixels 10 pixels 20 pixels
Fig. 7. Vibration acceleration signal grayscale image. Result of different number of pixels compared with other 2D method.
Table 2
Result compared with other 1D method.
Method Max Min Mean Std
Therefore, Conv7FC1 is selected as the network model, and the phase difference data is adopted as the network input. The following
comparative experiments are carried out.
In order to evaluate the diagnostic performance of the proposed method, comparative experiments are conducted on other 2D
classification methods using the phase difference grayscale dataset. Other methods include the machine learning method 2D support
vector machine (SVM) [32], the deep learning classification network Lenet5 and AlexNet [33].
At the same time, the effect of selecting different numbers of pixels as data input is also discussed. The same experiment is repeated
10 times and the average accuracy rate is selected as the evaluation standard. The results are shown in Fig. 7.
From the results, on the 20-pixel dataset, the average accuracy of the proposed method is 99.792%, which is significantly better
than other diagnostic methods such as 2D SVM’s 81.760%, Lenet5′ s 96.272% and AlexNet’s 97.517%. It proves that the performance
of the network we designed is better than the current popular machine learning methods. This method is more suitable for vision-based
fault diagnosis problems, which is more potential in the future.
Through the comparison of datasets with different numbers of pixels, as the number of selected pixels increases, the diagnostic
accuracy rate is also rising when processing the same original vibration image sequences. For example, the average accuracy of 20
pixels in the proposed method is significantly higher than 99.189% of 10 pixels and 97.517% of 5 pixels. It is verified that the data
collection method based on the whole field vision can effectively expand the vibration dataset, obtain effective fault information, and
improve the accuracy of fault diagnosis. It embodies the superiority of this vision-based diagnostic method.
In this experiment, the proposed method is compared with one-dimensional data-based diagnostic methods, including 1D SVM and
deep neural network (DNN). This comparative experiment is based on a phase difference dataset of 20 pixels. And it also uses
Conv7FC1 as the network model. For the one-dimensional method, the dataset used is the same sampled data before conversed to
grayscale. The experimental results are shown in Table 2.
From the results, compared with the average accuracy of 1D SVM 90.977% and DNN 94.937%, the proposed method can achieve
the highest accuracy of 99.972%. The purpose of this comparative experiment is to demonstrate the advantages of transforming the
one-dimensional temporal features extracted by the visual method into two-dimensional images. Under the same data sample, the
amount of one-dimensional network parameters is larger, and the data length is too long, which will easily cause the loss of features.
Using two-dimensional images as the CNN input not only effectively reduces the amount of network parameters and training time, but
also extracts information in different time dimensions and has better diagnostic performance.
Finally, to prove the superiority of the proposed visual vibration extraction, the experiment compared with the vibration data
9
C. Peng et al. Mechanical Systems and Signal Processing 192 (2023) 110229
100
100
9090
Accuracy (%)
8080
7070
6060
5050
4040
20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200
10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200
10
Epochs
Accelerometer train loss Accelerometer test loss
Fig. 8. The accuracy curve for CNN training with the accelerometer data and the visual data.
Table 3
Result compared with accelerometer dataset.
Vision dataset Accelerometer dataset Accelerometer data
for CNN for CNN for traditional method
The fault diagnosis method of rotating machinery based on visual extraction and vibration characterization proposed in this work
overcomes the shortcomings of traditional accelerometers. It avoids the mass loading effect, and enriches the extracted vibration
information. The time-domain vibration characterization information is converted into an image dataset containing multi-time
dimension information. Furthermore, the high-precision fault diagnosis is achieved through the CNN model. Different CNN models
based on the phase difference and acceleration data are compared. The method is also compared with one-dimensional and two-
dimensional fault diagnosis methods. The experimental results show that the proposed fault diagnosis method has the preferable
robustness, the good generalization performance and the highest accuracy of 99.792%. It has important promotion value in practical
engineering applications. Future work will focus on fault diagnosis under the variable working conditions using transfer learning.
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to
10
C. Peng et al. Mechanical Systems and Signal Processing 192 (2023) 110229
Data availability
Acknowledgement
This work was supported in part by the National Natural Science Foundation of China under Grant 62122038, in part by the Natural
Science Foundation of Jiangsu Province under Grant BK20211565.
References
[1] J. Lee, F. Wu, W. Zhao, M. Ghaffari, L. Liao, D. Siegel, Prognostics and health management design for rotary machinery systems—Reviews methodology and
applications, Mech. Syst. Signal Process. 42 (1) (2014) 314–334.
[2] L. Liu, H. Guo, Z. Gao, Y.Y. You, B. Zhang, Machine vision based condition monitoring and fault diagnosis of machine tools using information from machined
surface texture: A review, Mech. Syst. Signal Process. 164 (2022).
[3] T. Wang, Q. Han, F. Chu, Z. Feng, Vibration based condition monitoring and fault diagnosis of wind turbine planetary gearbox: A review, Mech. Syst. Signal
Process. 126 (2019) 662–685.
[4] Z. Ye, J. Yu, Deep morphological convolutional network for feature learning of vibration signals and its applications to gearbox fault diagnosis, Mech. Syst.
Signal Process. 161 (Dec. 2021).
[5] M. Altaf, T. Akram, M.A. Khan, M. Iqbal, M.M.I. Ch, C.-H. Hsu, A new statistical features based approach for bearing fault diagnosis using vibration signals,
Sensors 22 (5) (2022) 2012.
[6] D.J. Fleet, A.D. Jepson, Computation of component image velocity from local phase information, Int. J. Comput. Vis. 5 (1) (1990) 77–104.
[7] T. Gautama, M. Van Hulle, A phase-based approach to the estimation of the optical flow field using spatial filtering, IEEE Trans. Neural Networks 13 (5) (2002).
[8] J.G. Chen, N. Wadhwa, Y.-J. Cha, F. Durand, W.T. Freeman, O. Buyukozturk, Modal identification of simple structures with high-speed video using motion
magnification, J. Sound Vibrat. 345 (2015) 58–71.
[9] J.G. Chen, A. Davis, N. Wadhwa, F. Durand, W.T. Freeman, O. Büyüköztürk, Video camera–based vibration measurement for civil infrastructure applications,
J. Infrastruct. Syst. 23 (3) (2016).
[10] A.M. Wahbeh, J.P. Caffrey, S.F. Masri, A vision-based approach for the direct measurement of displacements in vibrating systems, Smart Mater. Struct. 12 (5)
(2003) 785–794.
[11] B. Pan, Digital image correlation for surface deformation measurement: Historical developments recent advances and future goals, Meas. Sci. Technol. 29 (8)
(2018).
[12] J.J. Lee, H.N. Ho, A vision-based dynamic rotational angle measurement system for large civil structures, Sensors 12 (2012) 7326–7336.
[13] J. Javh, J. Slavič, M. Boltežar, The subpixel resolution of optical-flow-based modal analysis, Mech. Syst. Signal Process. 88 (2017) 89–99.
[14] E. Caetano, S. Silva, J. Bateira, A vision system for vibration monitoring of civil engineering structures, Exp. Tech. 35 (4) (2011) 74–82.
[15] C. Liu, A. Torralba, W. Freeman, F. Durand, E.H. Adelson, Motion magnification, ACM Trans. Graphics 24 (3) (2005) 519–526.
[16] N. Wadhwa, et al., Eulerian video magnification and analysis, Commun. ACM 60 (1) (2016) 87–95.
[17] H.-Y. Wu, M. Rubinstein, E. Shih, J. Guttag, F. Durand, W. Freeman, Eulerian video magnification for revealing subtle changes in the world, ACM Trans. Graph.
31 (4) (2012).
[18] N. Wadhwa, M. Rubinstein, F. Durand, W.T. Freeman, Phase-based video motion processing, ACM Trans. Graph. 32 (4) (2013) 80.
[19] C. Peng, C. Zeng, Y. Wang, Camera-based micro-vibration measurement for lightweight structure using an improved phase-based motion extraction, Sensors 20
(5) (2020) 2590–2599.
[20] C. Peng, M. Zhu, Y. Wang, J. Jiang, Phase-based video measurement for active vibration suppression performance of the magnetically suspended rotor system,
IEEE Trans. Ind. Electron. 68 (2) (2021) 1497–1505.
[21] A. Zona, Vision-based vibration monitoring of structures and infrastructures: An overview of recent applications, Infrastruct. 6(1) (2020) 4.
[22] G.F. Bin, J.J. Gao, X.J. Li, B.S. Dhillon, Early fault diagnosis of rotating machinery based on wavelet packets – Empirical mode decomposition feature extraction
and neural network, Mech. Syst. Signal Process. 27 (2012) 696–711.
[23] A. Sarrafi, Z. Mao, C. Niezrecki, P. Poozesh, Vibration-based damage detection in wind turbine blades using phase-based motion estimation and motion
magnification, J. Sound Vib. 421 (2018) 300–318.
[24] G. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural networks, Science 313 (5786) (2006).
[25] B. Yang, S. Xu, Y. Lei, C.G. Lee, E. Stewart, C. Roberts, Multi-source transfer learning network to complement knowledge for intelligent diagnosis of machines
with unseen faults, Mech. Syst. Signal Process. 162 (2022).
[26] Y. Lei, B. Yang, X. Jiang, F. Jia, N. Li, A.K. Nandi, Applications of machine learning to machine fault diagnosis: A review and roadmap, Mech. Syst. Signal
Process. 138 (2020).
[27] M. Miao, Y. Sun, J. Yu, Deep sparse representation network for feature learning of vibration signals and its application in gearbox fault diagnosis, Knowl.-Based
Syst., Mar. 240 (2022) 108–116.
[28] M. Zhao, S. Zhong, X. Fu, B. Tang, M. Pecht, Deep residual shrinkage networks for fault diagnosis, IEEE Trans. Ind. Informat. 16 (7) (2020) 4681–4690.
[29] D. Marr, E. Hildreth. Theory of edge detection. Proc. Roy. Soc. London Ser. B. Biol. Sci. 207.1167 (1980): 187-217.
[30] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proc. IEEE 86 (11) (1998) 2278–2324.
[31] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, Proc. Int. Conf. Learn. Represent. (2015).
[32] X. Zhang, Y. Liang, J. Zhou, A novel bearing fault diagnosis model integrated permutation entropy ensemble empirical mode decomposition and optimized SVM,
Measurement 69 (2015) 164–179.
[33] A. Krizhevsky, I. Sutskever, G. Hinton, ImageNet classification with deep convolutional neural networks, Proc. Adv. Neural Inf. Process. Syst., (2012) 1097-1105.
[34] N.E. Huang, Z. Shen, S.R. Long, M.C. Wu, H.H. Shih, Q. Zheng, et al., The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-
stationary time series analysis, Proc. Royal Soc. A: Math. Phys. Eng. Sci. 454 (1971) (1998) 903–995.
11