Multi-Source Precipitation Data Merging For Heavy Rainfall Events Based On Cokriging and Machine Learning Methods
Multi-Source Precipitation Data Merging For Heavy Rainfall Events Based On Cokriging and Machine Learning Methods
Article
Multi-Source Precipitation Data Merging for Heavy Rainfall
Events Based on Cokriging and Machine Learning Methods
Junmin Zhang 1,2 , Jianhui Xu 2,3, *, Xiaoai Dai 1 , Huihua Ruan 4 , Xulong Liu 2,3 and Wenlong Jing 2,3
Abstract: Gridded precipitation data with a high spatiotemporal resolution are of great importance
for studies in hydrology, meteorology, and agronomy. Observational data from meteorological
stations cannot accurately reflect the spatiotemporal distribution and variations of precipitation over
a large area. Meanwhile, radar-derived precipitation data are restricted by low accuracy in areas of
complex terrain and satellite-based precipitation data by low spatial resolution. Therefore, hourly
precipitation models were employed to merge data from meteorological stations, Radar, and satellites;
the models used five machine learning algorithms (XGBoost, gradient boosting decision tree, random
forests (RF), LightGBM, and multiple linear regression (MLR)), as well as the CoKriging method. In
Citation: Zhang, J.; Xu, J.; Dai, X.; the north of Guangdong Province, data of four heavy rainfall events in 2018 were processed with
Ruan, H.; Liu, X.; Jing, W.
geographic data to obtain merged hourly precipitation data. The CoKriging method secured the best
Multi-Source Precipitation Data
prediction of spatial distribution of accumulated precipitation, followed by the tree-based machine
Merging for Heavy Rainfall Events
learning (ML) algorithms, and significantly, the prediction of MLR deviated from the actual pattern.
Based on Cokriging and Machine
Learning Methods. Remote Sens. 2022,
All machine learning methods showed poor performances for timepoints with little precipitation
14, 1750. https://doi.org/10.3390/ during the heavy rainfall events. The tree-based ML method showed poor performance at some
rs14071750 timepoints when precipitation was over-related to latitude, longitude, and distance from the coast.
sensing, with the most recent satellite-based efforts attempting to integrate the advantages
of infrared and microwave sensing [9–11]. These merging projects include pioneering
efforts such as the CPC Merged Analysis of Precipitation (CMAP), the Global Precipitation
Climatology Project (GPCP) [12], and the Tropical Rainfall Measuring Mission (TRMM),
operating since 1997. The current standard for such projects is the Global Precipitation
Measurement (GPM) mission, which forms the basis for two widely used data products:
Integrated Multi-satellite Retrievals for GPM (IMERG) and Global Satellite Mapping of
Precipitation (GSMaP). IMERG can provide good half-hourly precipitation estimates with
a spatial resolution of 0.1◦ × 0.1◦ , but some studies of extreme weather have found this
resolution too coarse [13]. GSMaP data outperform IMERG for extreme precipitation
events [14], but they are likewise limited by low resolution. There may be a significant
systematic deviation between individual satellite-derived and radar-derived precipitation
data, and the utility of satellite and radar data can be greatly enhanced by merging them
with station data for correction and calibration [15,16].
In recent years, methods have been developed to merge precipitation data from differ-
ent sources in order to improve spatiotemporal resolution and combine the advantages of
the different sources [17]. Various methods have been introduced for merging precipitation
data from meteorological stations and satellites, including artificial neural networks [18,19],
optimal interpolation [20], the Filtersim multiple-point statistics method [21], convolutional
neural network–long short-term memory (CNNLSTM) deep fusion modeling [22], and geo-
graphically weighted ridge regression [23]. The ordinary Kriging [24,25], CoKriging [26],
and spatial–temporal local weighted linear regression Kriging (STLWLRK) [27] methods
have been developed for merging data from meteorological stations and ground-based
radar. Multilayer perceptron networks [28] and the MetNet neural weather model [29]
can be used to merge satellite and ground-based radar data. To further improve spatial
resolution without reducing accuracy, researchers have introduced the high-resolution
spatial structure analysis of radar precipitation data based on techniques of merging sta-
tion and satellite data, as well as developing methods for merging data from all three
sources (stations, satellite, and radar), including Monte Carlo-based multi-objective opti-
mization [30], Bayesian averaging [31], geographically weighted regression, and artificial
neural networks [32].
Despite the progress in developing methods for daily precipitation estimation based
on multi-source data merging, the use of daily precipitation as an indicator of precipitation
intensity remains a potential source of bias. The intensity of prolonged light precipitation
may be overestimated, while the intensity of brief heavy precipitation may be underesti-
mated, and two different intensity figures may be reported for a single precipitation event
spanning two days [33]. Using hourly data provides a more accurate indicator of the actual
precipitation intensity, reducing the sampling error while recording more details regarding
each precipitation event [34]. Compared to normal rainfall events, heavy rainfall events
are associated with higher precipitation values and more pronounced spatial differences
in precipitation, leading to lower data accuracy. In addition, there is a need to test the
applicability of multi-source hourly precipitation data merging for studying different types
of heavy rainfall events, such as monsoon rainstorms and typhoons. Therefore, in this
study, we analyzed the correlations between selected variables and the hourly precipitation
observed at 250 meteorological stations in the mountainous areas of Northern Guangdong
Province during four heavy rainfall events in 2018 (event I, 23–27 April; event II, 7–10 May;
event III, 26–30 August; and event IV, 16 and 17 September). The auxiliary data analyzed
included radar precipitation data, satellite precipitation data, elevation, distance from the
coastline, and latitude and longitude. In addition, we sought to determine the optimal
multi-source precipitation data merging method under the theoretical framework of ma-
chine learning and geostatistics. Accordingly, we analyzed the heavy rainfall events by
data merging using five machine learning algorithms (XGBoost, GBDT, RF, LightGBM, and
MLR) and the CoKriging precipitation merging model, then compared the results.
Remote Sens. 2022, 14, x FOR PEER REVIEW 3 of 24
GBDT, RF, LightGBM, and MLR) and the CoKriging precipitation merging model, then
Remote Sens. 2022, 14, 1750 compared the results. 3 of 23
2.3. Methodology
2.3.1. Multi-Source Precipitation Data Merging Methods
We constructed station–radar–satellite hourly precipitation merging models based on
machine learning algorithms and geostatistical methods, together with auxiliary geographic
parameters, including topography, latitude and longitude, and distance from the coastline.
Additionally, to facilitate the high-accuracy merging of multi-source precipitation data, we
developed a CoKriging data merging model with station-observed precipitation as the
primary variable and the radar precipitation data as a covariate. Finally, merged hourly
precipitation data with a spatial resolution of 1 km were obtained. A flowchart of the
hourly precipitation data merging methods in this study is shown in Figure 2. There
were four main steps: First, IMERG and GSMaP data with a spatial resolution of 0.1◦
were spatially downscaled using the geostatistical ATPK method, respectively. Second, a
regression prediction model was constructed using machine learning algorithms based
on the correlation of station precipitation data with radar precipitation data, satellite
precipitation data, and auxiliary geographic variables. Third, the residuals between the
model estimates and station observations were interpolated using the ordinary Kriging
interpolation algorithm. Fourth, the model prediction results were corrected using the
interpolated model residuals, producing high-accuracy hourly precipitation merging data
with a spatial resolution of 1 km.
Remote Sens. 2022, 14, x FOR PEER REVIEW 5 of 24
2.3.2.
2.3.2. Machine
Machine Learning-BasedHourly
Learning-Based Hourly Precipitation
PrecipitationData
DataMerging
MergingModels
Models
In In this
this study,five
study, fivemachine
machinelearning
learning algorithms
algorithms(GBDT,
(GBDT,XGBoost,
XGBoost, LightGBM,
LightGBM, RF, RF,
andand
MLR) were used to construct regression models for station precipitation data,
MLR) were used to construct regression models for station precipitation data, radar pre-radar precip-
itation data, and auxiliary geographic parameters. The model estimates were compared
cipitation data, and auxiliary geographic parameters. The model estimates were com-
with CoKriging interpolation results, as shown in Equations (1) and (2). In Equation (1),
pared with CoKriging interpolation results, as shown in Equations (1) and (2). In Equation
(1), ˆ ML = f ML ( Radar, I MERG, GsMaP, Lon, Lat, DEM, Coastline)
Prec (1)
ˆ ML denotes
where Prec =precipitation
( , data predicted
, , a machine
by , , learning
, )
algorithm; f ML
(1)
denotes
where a regression model constructed based on machine learning algorithms;
denotes precipitation data predicted by a machine learning algorithm; Radar,
I MERG,
denotes and GsMaP
a regression denote
model radar and satellite
constructed based onprecipitation data; Lonalgorithms;
machine learning and Lat denote ,
, and denote radar and satellite precipitation data; and denote
Remote Sens. 2022, 14, 1750 6 of 23
latitude and longitude, respectively; DEM denotes elevation; and Coastline denotes dis-
tance from the coastline. The inputs of the constructed machine learning model were radar
precipitation data (spatial resolution of 1 km) and the auxiliary geographic variables, and
the output was predicted precipitation data (spatial resolution of 1 km). Then, the residuals
of the model were interpolated using the ordinary Kriging interpolation algorithm, as
shown in Equation (2):
n
ε̂ ML ( x ) = ∑ λi ε ML (Xi ) (2)
i =1
where ε̂ ML ( x ) is the ordinary Kriging estimate for the residual of the machine learning
model at spatial location x, and λi denotes the weight of the ordinary Kriging interpolation
method. Thus, to obtain high-accuracy precipitation merging data with a spatial resolution
of 1 km, it is recommended to use the interpolation results of the model residuals to correct
the 1-km precipitation prediction results.
2.3.3. GBDT
The gradient boosting decision tree (GBDT) is an iterative decision tree model, based on
a boosting algorithm, that achieves classification and regression by continuously reducing
residuals. The GBDT algorithm generates a weak learner with each iteration, and each
learner is trained with the residuals of the learner in the previous round until a strong
classifier is finally obtained. The core concept of the GBDT algorithm is to let each tree fit
the residuals generated by the previous tree and use the cumulative results for all the trees
as the final prediction output through formula calculations.
2.3.4. XGBoost
XGBoost is an integrated learning algorithm based on the method of boosting [42]. It
is an optimized ensemble tree-based algorithm, improved and extended from the GBDT
algorithm. Its main idea is to use feature splitting to grow trees continuously, with each
generated tree representing a new function used to fit the residuals of the previous tree;
finally, the calculated value of each leaf node is added to obtain the final predictive value:
K
ŷi = ∑ f k ( Xi ) , f k ∈ F (3)
K =1
where ŷi is the model-predicted value, K denotes the number of trees, F is the ensemble
space of the regression tree (also known as CART), and Xidenotes the feature vector of the
i-th data point. F = { f ( X ) = wq(X ) } q : Rm → T, w ∈ R T , where q denotes the structure
of each tree by which the examples are mapped to the corresponding leaf indices, T is the
number of leaves on the tree, and f k corresponds to the structure q and the leaf weight w of
k-th independent tree. Each regression tree contains consecutive scores on each leaf, and
the score on the i-th leaf is denoted by wi . The objective function of the XGBoost algorithm
includes a loss function and a regularization term:
1
Ω( f ) = γT + λ||w||2 (5)
2
where l (ŷi , yi ) denotes the training error between the predicted value ŷi and true value of
the target yi . The regularization term Ω penalizes the complexity of the model to smooth
the final learned weight and avoid overfitting, and γ and λ denote the penalty coefficients
of the model.
Remote Sens. 2022, 14, 1750 7 of 23
2.3.5. LightGBM
Light Gradient Boosting Machine (LightGBM) is a GBDT variation for big data pro-
cessing that balances efficiency and accuracy [43]. The characteristics of the LightGBM
algorithm are as follows: (1) A leaf-wise algorithm with a depth limit is adopted to replace
the level-wise strategy used by most GBDT tools, (2) data volume and accuracy are bal-
anced using a gradient-based one-sided sampling (GOSS) algorithm that can exclude most
samples with small gradients and calculate information gain using the remaining samples,
and (3) the exclusive feature bundling (EFB) method is used to reduce the data volume
by reducing the number of features. LightGBM uses a histogram algorithm to reduce the
memory occupied by the method and the complexity of the data separation. Its core idea
is to convert continuous features into discrete values and construct a histogram, and the
cumulative statistics of each discrete value in the histogram are counted by traversing the
training data. During feature selection, the optimal splitting point can be determined by
simply traversing the discrete values in the histogram. Moreover, the histogram can be
accelerated by the difference. Leaf nodes with large histograms can be obtained based on
histogram differences between the small leaf nodes, thus minimizing the computational
effort of obtaining histograms for each leaf nodes.
2.3.6. RF
Random forest (RF) is a combination of decision trees where each tree depends on
a random vector value with the same distribution as the forest [44]. RF is a product of
integrated learning, which combines the integrated Bagging (bootstrap aggregating) [44]
and classification and regression tree (CART) algorithms. The idea of RF is to randomly
select N samples from the original training sample set repeatedly with replacements to form
the sample subsets and then generate N decision trees based on the subsets. Each decision
tree is judged to obtain N classification results, and the final classification is determined by
voting. RF has the following characteristics: (1) the subsets are independent of each other,
which enables parallel computing and ensures high efficiency; (2) because of the Bagging
method, the decision tree is not too complex and does not require pruning; and (3) the
existence of out-of-bag (oob) data makes it unnecessary to select a validation set separately.
2.3.7. MLR
Based on linear relationships between the precipitation data from meteorological sta-
tions, radar precipitation data, and auxiliary geographic parameters, we constructed a multi-
source precipitation data merging model based on the multiple linear regression (MLR)
method. The parameters of the MLR model were solved by the least squares method to ful-
fill the requirement that the residual sum of squares Q = ∑m Preci − Precˆ i 2 be minimized.
i =1
N1 N2
Zgauge,CK ∗ ( x0 ) = ∑ λ1i Zgauge (xi ) + ∑ λ2j Zradar
xj (6)
i =1 j =1
where Zgauge,CK ∗ ( x0 ) is the estimated value atx0 ; Zgauge ( xi ) is the value of the primary vari-
able, station-observed precipitation; Zradar x j is the value of the covariate, the radar precip-
N
itation data; λ1i and λ2j are the weights of Zgauge and Zradar , respectively; ∑i=11 λ1i = 1; and
Remote Sens. 2022, 14, 1750 8 of 23
N
∑ j=21 λ2j = 0. The CoKriging method uses the variance function and covariance function to
perform unbiased optimal estimations according to the following formulae:
N (h)
1
∑
2
γgauge (h) = Zgauge ( xi ) − Zgauge ( xi + h) (7)
2N (h) i =1
N (h)
1
∑ Zgauge ( xi ) − Zgauge ( xi + h)
γgauge,radar (h) = 2N (h)
i =1 (8)
·[ Zradar ( xi ) − Zradar ( xi + h)]
where N (h) is the number of samples used to calculate the variance function, and h is the
sample distance.
where m is the number of stations, Preci denotes the precipitation data from the i-th ground-
ˆ i denotes the multi-source precipitation merging data,
based meteorological station, Prec
Prec denotes the average station-observed precipitation, and Prec ˆ denotes the average of
the multi-source merging precipitation.
3. Results
3.1. Evaluation of the Accuracy of Merging Results
Hourly precipitation observations were selected from 20% of the meteorological sta-
tions (50 stations), which were evenly distributed over the study area and varied at each
timepoint (Figure 1), to evaluate the accuracy of the merged hourly multi-source pre-
cipitation data for four heavy rainfall events in terms of the three indices (CC, RMSE,
and MAE).
Figures 3–6 present the merged precipitation data obtained for the four heavy rainfall
events by the six merging methods. The scatterplots show that the main errors were
underestimation for high-precipitation timepoints and overestimation for low- or no-
precipitation timepoints. CoKriging produced fewer errors in both cases than the machine
learning methods, so it was considered the most accurate, giving CC values for the four
heavy rainfall events of 0.783, 0.806, 0.727, and 0.828, respectively. Comparing the different
machine learning algorithms, GBDT and XGBoost had the best performances for events
I (CC = 0.756) and IV (CC = 0.820), respectively, whereas RF provided the highest CC for
events II (CC = 0.787) and III (CC = 0.678). MLR resulted in the lowest CC and the highest
RMSE and MAE, indicating the lowest accuracy. Moreover, MLR provided the most severe
overestimation for the low- and no-precipitation timepoints during events I, II, and III.
022, 14, x FOR PEER REVIEW 9 of 24
Figure 3. Comparison of the observed and estimated precipitation for heavy rainfall event I: (a)
XGBoost, (b) MLR, (c) RF, (d) GBDT, (e) LightGBM, (f) CoKriging (the red line represents the fitted
curve between
Figure in-situ
3. Comparison
Figure and estimated
of3.the precipitation;
observed
Comparison and the precipitation
estimated
of the observed black line represents
and estimated that
forprecipitation
heavy thefor
rainfallratio of in-situ
event
heavy I: (a) event I:
rainfall
precipitation
XGBoost, to estimate
(b) MLR, (c) RF,precipitation
(d) GBDT, is
(e) 1:1).
LightGBM, (f) CoKriging (the red line represents the fitted
(a) XGBoost, (b) MLR, (c) RF, (d) GBDT, (e) LightGBM, (f) CoKriging (the red line represents the
curve between in-situ
fittedand
curve estimated
between precipitation; the black
in-situ and estimated line represents
precipitation; thatline
the black therepresents
ratio of in-situ
that the ratio of
precipitation to estimate precipitation
in-situ precipitation to is 1:1). precipitation is 1:1).
estimate
Figure 4. Comparison
Figureof4.theComparison
observed and estimated
of the observedprecipitation
and estimated forprecipitation
heavy rainfallfor event
heavy II: (a) event II:
rainfall
XGBoost, (b) MLR, (c) RF, (d) GBDT, (e) LightGBM, (f) CoKriging (the red line represents the fitted
(a) XGBoost, (b) MLR, (c) RF, (d) GBDT, (e) LightGBM, (f) CoKriging (the red line represents the
curve between
Figure in-situ
fittedand
4. Comparison of estimated
the
curve observed
between precipitation;
and estimated
in-situ theprecipitation
and estimatedblack line represents
forthe
precipitation; heavy that
black therepresents
rainfall
line ratio of II:
event in-situ
that(a)
the ratio of
precipitation
XGBoost, to estimate
(b) MLR, (c) RF,
in-situ precipitation is
(d) GBDT,to(e)
precipitation 1:1). precipitation
LightGBM,
estimate (f) CoKriging
is 1:1). (the red line represents the fitted
curve between in-situ and estimated precipitation; the black line represents that the ratio of in-situ
precipitation to estimate precipitation
Figure 7 shows is
the1:1).
boxplots of the evaluation indices for the precipitation timepoints
during each heavy rainfall event (when the sum of the observations at the 250 stations was
greater than 10 mm). CoKriging produced the highest average and median CC for all four
events, as well as the lowest the RMSE and MAE, indicating that CoKriging had the highest
merging accuracy. MLR produced the lowest CC distribution and the highest MAE and
RMSE. Among the machine learning methods, RF produced the highest upper limit of CC
for event I, but its lower limit was still lower than that of CoKriging. For all methods, CC
was the lowest for event III compared with the other heavy rainfall events, and the mean
CC was below 0.6, except with CoKriging, indicating that the merging performance was
worst for event III. The CC values of event IV were generally high, with relatively small
variations, and the merging effect was the best, which is because event IV involved high
Remote Sens. 2022, 14, 1750 10 of 23
022, 14, x FOR PEER REVIEW cumulative precipitation, and the rainfall was continuous without abrupt 10 of 24 (e.g., the
declines
periods from 00:00 on 24 April to 23:00 on 25 April, 19:00 on 7 May to 22:00 on 9 May, and
15:00 to 20:00 on 26 August).
Figure 5. Comparison of the observed and estimated precipitation for heavy rainfall event III: (a)
XGBoost, (b) MLR, (c) RF, (d) GBDT, (e) LightGBM, (f) CoKriging (the red line represents the fitted
Figure
curve 5. Comparison
between in-situ of 5.
and
Figure the observed
estimated and estimated
precipitation;
Comparison of the the precipitation
observed black
and line forprecipitation
heavy
represents
estimated rainfall
that thefor event
ratio of III:
heavy in-situ(a) event III:
rainfall
XGBoost, (b) MLR, (c)
precipitation to estimate RF, (d) GBDT,
precipitation
(a) XGBoost, (e) LightGBM,
is (c)
(b) MLR, 1:1). (f) CoKriging (the red line represents the fitted
RF, (d) GBDT, (e) LightGBM, (f) CoKriging (the red line represents the
curve between in-situ and estimated precipitation; the black
fitted curve between in-situ and estimated line represents
precipitation; thatline
the black therepresents
ratio of in-situ
that the ratio of
precipitation to estimate precipitation is 1:1).
in-situ precipitation to estimate precipitation is 1:1).
(e.g., the
rainfall periods
events from
are 00:00correlated
highly on 24 Aprilwith
to 23:00
theon 25 April, 19:00
variations of theonprecipitation
7 May to 22:00values,
on 9 with
May, and 15:00 to 20:00 on 26 August).
basically consistent trends of increases and decreases.
Figure 7. Boxplots of the evaluation indices for the four heavy rainfall events: event I, 23-27 April;
Figure 7. Boxplots of the evaluation indices for the four heavy rainfall events: event I, 23–27 April;
event II, 7-10 May; event III, 26-30 August ; and event IV, 16 and 17 September.
event II, 7–10 May; event III, 26–30 August; and event IV, 16 and 17 September.
Figures 8–11 demonstrate the time series of the evaluation indices of the hourly pre-
cipitation merging data for the four heavy rainfall events and the sum of the observed
hourly precipitation data from all ground-based stations at the corresponding time. It can
be seen that CoKriging provides highly accurate hourly precipitation merging data for the
four heavy rainfall events, with the CC accounting for over 39%, much higher than that of
the five machine learning methods. In addition, the RMSE and MAE of the four heavy
22, 14, x FOR PEER
Remote Sens.REVIEW
2022, 14, 1750 13 of 24 12 of 23
As shown in Figure 10, event III contained many precipitation peaks with relatively
short precipitation intervals, which was quite different from the characteristics of the other
monsoon rainstorm events, I and II. For event III, the five machine learning methods all
As shown in Figure 10, event III contained many precipitation peaks with relatively
short precipitation intervals, which was quite different from the characteristics of the other
monsoon rainstorm events, I and II. For event III, the five machine learning methods all
had a CC dominance ratio of about 10%, and CoKriging was still the one with the highest
overall
Remote Sens. 2022, 14, 1750accuracy. The advantage of CoKriging was especially pronounced for the period13 of 23
from 0:00 on 26 August to 0:00 on 27 August at the beginning of the heavy rainfall event.
As shown in Figure 11, during event IV, the accumulate precipitation at the stations
began to increase from 3:00 on 16 September and peaked at 13:00 on that day. After that,
the precipitation continually decreased, showing a normal distribution trend in general.
The evaluation indices CC, RMSE, and MAE were still optimal for CoKriging for event
IV, yet the ratio of CoKriging dominance decreased compared with the first three heavy
rainfall events. In addition, the CC of all the methods except for MLR mostly exceeded
0.5, indicating
Figure that theindices
10. Evaluation
Figure overall
10. ofmerging
heavy
Evaluation performance
rainfall
indices event
of heavy for
III.
rainfall event
event III. IV was better.
As shown in Figure 8, the accumulated precipitation at the stations for event I started
to increase at 12:00 on 23 April and reached the peak of the entire event of about 800 mm at
18:00 on that day. After a pause of two days, the precipitation reached two small peaks at
0:00 on the last two days, respectively; the accuracy of CoKriging for event I was generally
higher than that of the other methods. However, there were cases when CoKriging was
inferior to other methods. For example, the CC of CoKriging was much lower than that of
the other five methods for the timepoint of 23:00 on 26 April. The reasons may be that the
precipitation at that timepoint was relatively low and that the rainfall centers were far apart.
Thus, the spatial correlation between the precipitation data was relatively weak, making
the accuracy of CoKriging lower than that of the machine learning methods. Figure 8
also shows that the CC of MLR was 0 for the timepoint of 18:00 on 23 April when the
accumulated precipitation at the stations reached its peak, and the corresponding RMSE
and MAE indices were very high. The comparison between MLR merging results plotted
Remote Sens. 2022, 14, x FOR PEER REVIEW 16 of 25
in Figures 12 and 13 and the merging results of the other methods revealed that this result
was caused by a miscalculation of the precipitation centers with MLR.
3–6 and the comparison between the merging results and station-observed precipitation
in Figure 13 reveal that the spatial distribution of accumulated precipitation predicted by
CoKriging agrees the best with the actual pattern, followed by the results of the tree-based
machine
Figure
Figure learning
12.12. methods.
Distribution
Distribution of the
of the Furthermore,
merging
merging results
results the
based distribution
based
on onXGBoost,
the characteristics
the XGBoost,
GBDT,GBDT, of the
RF, accumu-
LightGBM,
LightGBM, andRF, and
MLR
lated precipitation
algorithms of
and the
the MLR significantly
CoKriging deviate
interpolation from
results: (a) the actual
18:00 on 23 pattern,
April showing
algorithms and the CoKriging interpolation results: (a) 18:00 on 23 April 2018, (b) 9:00 on 7(b)
MLR 2018, May9:00ob-
on 7
2018,
vious
(c) 7:00problems
May 2018,
on 30(c)
Augustinon
7:00 events
2018, II and
30 August
and III. and
(d)2018,
13:00 (d)September
on 16 13:00 on 162018.
September 2018.
4. Discussion
4.1. Spatial Distribution Characteristics of Accumulated Precipitation
Figure 13 shows the spatial distribution of accumulated precipitation obtained by the
different merging methods for the four heavy rainfall events in 2018. It can be seen that
the accumulated precipitation as determined by the four tree-based machine learning
models GBDT, XGBoost, LightGBM, and RF has similar spatial distribution characteristics
that are significantly different from those obtained by CoKriging and MLR. Among them,
the range of areas with high accumulated precipitation in the results of the RF method is
smaller than that for GBDT, XGBoost, and LGBM; the accumulated precipitation results
of the XGBoost method are not smooth, with prominent precipitation variations; and the
accumulated precipitation results of CoKriging are obviously jagged, which is character-
istic of the interpolation method. Finally, the accumulated precipitation results of MLR
have a significantly different spatial distribution from the results of the other methods,
and the accumulated precipitation values are much higher than the other methods; these
are consistent with the hourly prediction results of MLR. The evaluation indices in Figures
Figure 13. Cont.
Remote Sens. 2022, 14, x FOR PEER REVIEW 19 of 25
Figure13.
Figure 13.Spatial
Spatialdistribution
distributionofofthe
theaccumulated
accumulatedprecipitation
precipitationmerging
mergingresults
resultsof
ofdifferent
differentmerging
merging
methodsfor
methods for the
the four
four heavy
heavy rainfall
rainfall events:
events: (a)
(a)heavy
heavyrainfall
rainfallevent
eventI, I,(b)
(b)heavy
heavy rainfall event
rainfall II, II,
event (c)
heavy rainfall event III, and (d) heavy rainfall event IV.
(c) heavy rainfall event III, and (d) heavy rainfall event IV.
Remote Sens. 2022, 14, 1750 17 of 23
4. Discussion
4.1. Spatial Distribution Characteristics of Accumulated Precipitation
Figure 13 shows the spatial distribution of accumulated precipitation obtained by
the different merging methods for the four heavy rainfall events in 2018. It can be seen
that the accumulated precipitation as determined by the four tree-based machine learning
models GBDT, XGBoost, LightGBM, and RF has similar spatial distribution characteristics
that are significantly different from those obtained by CoKriging and MLR. Among them,
the range of areas with high accumulated precipitation in the results of the RF method is
smaller than that for GBDT, XGBoost, and LGBM; the accumulated precipitation results
of the XGBoost method are not smooth, with prominent precipitation variations; and the
accumulated precipitation results of CoKriging are obviously jagged, which is characteristic
of the interpolation method. Finally, the accumulated precipitation results of MLR have
a significantly different spatial distribution from the results of the other methods, and
Remote Sens. 2022, 14, 1750 18 of 23
the accumulated precipitation values are much higher than the other methods; these are
consistent with the hourly prediction results of MLR. The evaluation indices in Figures 3–6
and the comparison between the merging results and station-observed precipitation in
Figure 13 reveal that the spatial distribution of accumulated precipitation predicted by
CoKriging agrees the best with the actual pattern, followed by the results of the tree-based
machine learning methods. Furthermore, the distribution characteristics of the accumulated
precipitation of the MLR significantly deviate from the actual pattern, showing obvious
problems in events II and III.
Chao [47] used the multiscale geographically weighted regression (MGWR) method
to perform hourly precipitation data merging for the Ziwu River Basin, and the CC of
the merging result was 0.724. Li [48] applied the space–time multiscale analysis system
(STMAS), a multigrid variational analysis technique, to merge hourly precipitation data
during the heavy precipitation period in Jiangxi Province from May to June 2019, and the
CC of the merging result was 0.76. Figures 3–6 show the hourly precipitation merging
accuracy of six methods for four heavy rainfall events, where the CC values of the merging
results of CoKriging for events I, II, and IV (0.783, 0.806, and 0.828) are higher than the CC
of the methods in the two studies mentioned above. The CC of CoKriging for event III
(0.726) is also similar to that of GWR and STMAS, and the machine learning methods RF
and XGBoost with the highest merging accuracy for events II and IV also provide higher
CC results of 0.787 and 0.82. Therefore, the merging results for the four heavy rainfall
events in this paper have application and reference values.
Figure
Figure 14.
14. Residual
Residual correction for the
correction for the results
resultsof
ofthetheXGBoost,
XGBoost,LightGBM,
LightGBM,and
andRF
RFmethods
methodsatat9:00
9:00onon7
7May
May2018:
2018:(a)(a)XGBoost,
XGBoost,(b)
(b)LightGBM,
LightGBM,and and(c)(c)RF.
RF.
Although CoKriging has the highest precipitation merging accuracy, its modeling
5. Conclusions
time We
is also longerthe
selected than the machine
mountainous learning
area methods.
of Northern Considering
Guangdong the issues
Province as theofstudy
accuracy
area
and
and time, we suggest
used data thatheavy
from four spatial and temporal
rainfall autocorrelation
events during parameters
the 2018 flood seasonshould
(23–27 be in-
April,
troduced into machine learning models in further studies and that the influences
7–10 May, 26–30 August, and 16 September) to establish precipitation data merging models of am-
bient precipitation
suitable for the area and
based precipitation
on XGBoost,atGBDT,previous timepoints
LightGBM, on the
RF, MLR, andmodeling
CoKriging. results
The
should be fully
residuals considered.
of these machineInlearning
addition,models
the accuracy of the multi-source
were corrected with the precipitation
ordinary Krigingdata
merging
method to model could
obtain be improved
hourly by incorporating
precipitation merging data more
withsatellite precipitation
a spatial resolution data
of 1 into
km,
the
andmodel.
the merging results were assessed and analyzed based on the spatial precipitation
distribution and accuracy indices. The research conclusions are as follows:
5.
(1)Conclusions
The errors in these precipitation merging results mainly involve underestimations
We selected the mountainous
for high-precipitation areaand
timepoints of Northern Guangdong
overestimations Province
for low- as the study
or no-precipitation
area and used data from four heavy rainfall events during the 2018 flood season (23–27
timepoints.
April,
(2) The7–10spatial
May, 26–30 August,ofand
distribution the16 September) to
accumulated establish precipitation
precipitation predicted by data merging
CoKriging
models suitable
agrees for with
the best the area based pattern,
the actual on XGBoost, GBDT,
followed LightGBM,
by the results ofRF,
the MLR, and
tree-based
CoKriging.
machineThelearning
residuals of these whereas
methods, machine learning models of
the distribution were corrected with
accumulated the ordi-
precipitation
nary Kriging
predictedmethod
by MLR toisobtain hourly different
significantly precipitation
frommerging
the actualdata with The
pattern. a spatial resolution
merging results
of 1 km, and the merging
of CoKriging have aresults
higherwere assessed
accuracy thanand
theanalyzed
machine based on the
learning spatialbecause
methods, precip-
itation distributionduring
precipitation and accuracy indices.
heavy rainfall The research
events conclusions
has pronounced areautocorrelation,
spatial as follows: and
radar precipitation data as a covariate are highly correlated with the station-observed
precipitation.
Remote Sens. 2022, 14, 1750 21 of 23
(3) Different machine learning methods are applicable for different types of heavy rainfall
events. The RF-based hourly precipitation merging model is suitable for analyzing
monsoon rainstorm events, and the XGBoost-based hourly precipitation merging
model is suitable for analyzing typhoon events.
(4) The merging performance of the machine learning methods is relatively poor for
the timepoints, with little precipitation during the heavy rainfall event. One reason
is that the models have difficulty in extracting features when a small number of
meteorological stations observe little precipitation; another one is the models do not
capture the temporal variability of precipitation well, while constant rain is always
observed the easiest.
(5) The hourly merging results of the tree-based machine learning models contain striped
textures at some timepoints, which is caused by an excessively high correlation
between the precipitation at these timepoints and latitude and distance from the
coastline; the MLR method showed miscalculations for the precipitation values and
locations and overestimates the accumulated precipitation for heavy rainfall events II,
III, and IV.
In summary, the hourly precipitation merging models proposed in this paper can
provide high-accuracy gridded precipitation data for heavy rainfall events in the moun-
tainous areas of Northern Guangdong Province. The models combine the advantages of
different types of precipitation data by merging precipitation observed from meteorological
stations with radar and satellite precipitation data, which is essential for studying different
types of heavy rainfall events in mountainous regions. One of their shortcomings is that
CoKriging has obvious advantages for the heavy rainfall events in South China, but its
applicability in other regions needs further research; the other is that they fail to consider
the spatiotemporal autocorrelation of precipitation during heavy rainfall events and that
the results for the spatial distribution of precipitation produced by the machine learning
methods have certain defects at some timepoints. The models need to be improved in
further studies.
Author Contributions: Conceptualization, J.X.; Methodology, J.Z. and J.X.; Software, J.Z.; Validation,
J.Z. and H.R.; Formal Analysis, J.Z., J.X. and X.D.; Investigation, J.X. and J.Z.; Resources, H.R. and
W.J.; Data Curation, J.X.; Writing—Original Draft Preparation, J.Z.; Writing—Review and Editing,
J.X., X.D. and X.L.; Visualization, J.Z.; and Supervision and Project Administration, J.X. All authors
have read and agreed to the published version of the manuscript.
Funding: This research was jointly funded by the National Natural Science Foundation of China,
grant number 41901371; the Science and Technology Planning Project of Guangdong Province, grant
number 2018B020207012; the Key Special Project for Introduced Talents Team of Southern Marine
Science and Engineering Guangdong Laboratory (Guangzhou), grant number GML2019ZD0301; the
Guangdong Innovative and Entrepreneurial Research Team Program, grant number 2016ZT06D336;
the GDAS’ Project of Science and Technology Development, grant number 2019GDASYL-0301001;
and the Science and Technology Program of Guangdong, grant number 2021B1212100006.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.
Acknowledgments: The authors would like to thank the anonymous reviewers for their constructive
comments. We also thank the Geographical Science Data Center of the Greater Bay Area for providing
the relevant data in this study.
Conflicts of Interest: The authors declare no conflict of interest.
Remote Sens. 2022, 14, 1750 22 of 23
References
1. Sapiano, M.; Arkin, P.A. An intercomparison and validation of high-resolution satellite precipitation estimates with 3-hourly
gauge data. J. Hydrometeorol. 2009, 10, 149–166. [CrossRef]
2. Taylor, C.M.; de Jeu, R.A.; Guichard, F.; Harris, P.P.; Dorigo, W.A. Afternoon rain more likely over drier soils. Nature 2012, 489,
423–426. [CrossRef] [PubMed]
3. Jieru, Y.; András, B. Short time precipitation estimation using weather radar and surface observations: With rainfall displacement
information integrated in a stochastic manner. J. Hydrol. 2019, 574, 672–682.
4. Tapiador, F.J.; Turk, F.J.; Walt, P.; Arthur, Y.H.; Eduardo, G.; Luiz, A.T.M.; Carlos, F.A.; Paola, S.; Chris, K.; George, J.H.; et al.
Global precipitation measurement: Methods, datasets and applications. Atmos. Res. 2012, 104, 70–97. [CrossRef]
5. Kidd, C.; Becker, A.; Huffman, G.J.; Muller, C.L.; Joe, P.; Skofronick-Jackson, G.; Kirschbaum, D.B. So, how much of the Earth’s
surface is covered by rain gauges? Bull. Am. Meteorol. Soc. 2017, 98, 69–78. [CrossRef]
6. Rana, S.; McGregor, J.; Renwick, J. Precipitation seasonality over the Indian subcontinent: An evaluation of gauge, reanalyses,
and satellite retrievals. J. Hydrometeorol. 2015, 16, 631–651. [CrossRef]
7. Xie, P.; Arkin, P.A. Analyses of global monthly precipitation using gauge observations, satellite estimates, and numerical model
predictions. J. Clim. 1996, 9, 840–858. [CrossRef]
8. Yilmaz, K.K.; Adler, R.F.; Tian, Y.; Hong, Y.; Pierce, H.F. Evaluation of a satellite-based global flood monitoring system. Int. J.
Remote Sens. 2010, 31, 3763–3782. [CrossRef]
9. Arkin, P.A.; Meisner, B.N. The relationship between large-scale convective rainfall and cold cloud over the western hemisphere
during 1982-84. Mon. Weather. Rev. 1987, 115, 51–74. [CrossRef]
10. Berg, W.; Chase, R. Determination of mean rainfall from the Special Sensor Microwave/Imager (SSM/I) using a mixed lognormal
distribution. J. Atmos. Ocean. Technol. 1992, 9, 129–141. [CrossRef]
11. Xie, P.; Arkin, P.A. Global precipitation: A 17-year monthly analysis based on gauge observations, satellite estimates, and
numerical model outputs. Bull. Am. Meteorol. Soc. 1997, 78, 2539–2558. [CrossRef]
12. Huffman, G.J.; Adler, R.F.; Arkin, P.; Chang, A.; Ferraro, R.; Gruber, A.; Janowiak, J.; McNab, A.; Rudolf, B.; Schneider, U. The
global precipitation climatology project (GPCP) combined precipitation dataset. Bull. Am. Meteorol. Soc. 1997, 78, 5–20. [CrossRef]
13. Ziqiang, M.; Jintao, X.; Kang, H.; Xiuzhen, H.; Qingwen, J.; TseChun, W.; Wentao, X.; Yang, H. An updated moving window
algorithm for hourly-scale satellite precipitation downscaling: A case study in the Southeast Coast of China. J. Hydrol. 2020, 581,
124378.
14. Gao, Y.; Xu, H.; Liu, G. Evaluation of the GSMaP Estimates on Monitoring Extreme Precipitation Events. Remote sensing
Technology and Application. Remote Sens. Technol. Appl. 2019, 34, 1121–1132.
15. Michaelides, S.; Levizzani, V.; Anagnostou, E.; Bauer, P.; Kasparis, T.; Lane, J.E. Precipitation: Measurement, remote sensing,
climatology and modeling. Atmos. Res. 2009, 94, 512–533. [CrossRef]
16. Zhang, J.; Howard, K.; Langston, C.; Kaney, B.; Qi, Y.; Tang, L.; Grams, H.; Wang, Y.; Cocks, S.; Martinaitis, S. Multi-Radar
Multi-Sensor (MRMS) quantitative precipitation estimation: Initial operating capabilities. Bull. Am. Meteorol. Soc. 2016, 97,
621–638. [CrossRef]
17. Shen, Y.; Zhao, P.; Pan, Y.; Yu, J. A high spatiotemporal gauge-satellite merged precipitation analysis over China. J. Geophys. Res.
Atmos. 2014, 119, 3063–3075. [CrossRef]
18. Alharbi, R.; Hsu, K.; Sorooshian, S. Bias adjustment of satellite-based precipitation estimation using artificial neural networks-
cloud classification system over Saudi Arabia. Arab. J. Geosci. 2018, 11, 1–17. [CrossRef]
19. Xu, G.; Wang, Z.; Xia, T. Mapping Areal Precipitation with Fusion Data by ANN Machine Learning in Sparse Gauged Region.
Applied Sciences. 2019, 9, 2294. [CrossRef]
20. Shen, Y.; Pan, S.; Xu, B.; Y, J. Parameter Improvements of Hourly Automatic Weather Stations Precipitation Analysis by Optimal
Interpolation over China. J. Chengdu Univ. Technol. 2012, 27, 219–224.
21. Kunwei, L.; Xiong, Y.; Xin, Z.; Fen, T. Multi-source Precipitation Data Fusion Method Based on Filtersim. J. Syst. Simul. 2019,
31, 1232.
22. Wu, H.; Yang, Q.; Liu, J.; Wang, G. A spatiotemporal deep fusion model for merging satellite and gauge precipitation in China. J.
Hydrol. 2020, 584, 124664. [CrossRef]
23. Chen, S.; Xiong, L.; Ma, Q.; Kim, J.; Chen, J.; Xu, C. Improving daily spatial precipitation estimates by merging gauge observation
with multiple satellite-based precipitation products based on the geographically weighted ridge regression method. J. Hydrol.
2020, 589, 125156. [CrossRef]
24. Delrieu, G.; Wijbrans, A.; Boudevillain, B.; Faure, D.; Bonnifait, L.; Kirstetter, P. Geostatistical radar–raingauge merging: A novel
method for the quantification of rain estimation accuracy. Adv. Water Resour. 2014, 71, 110–124. [CrossRef]
25. Sideris, I.V.; Gabella, M.; Sassi, M.; Germann, U. Real-Time Spatiotemporal Merging of Radar and Raingauge Precipitation
Measurements in Switzerland. In Proceedings of the 9th International Workshop on Precipitation in Urban Areas, St. Moritz,
Switzerland, 6–9 December 2012.
26. Azimi-Zonooz, A.; Krajewski, W.F.; Bowles, D.S.; Seo, D.J. Spatial rainfall estimation by linear and non-linear co-kriging of
radar-rainfall and raingage data. Stoch. Hydrol. Hydraul. 1989, 3, 51–67. [CrossRef]
27. Zhang, G.; Tian, G.; Cai, D.; Bai, R.; Tong, J. Merging radar and rain gauge data by using spatial–temporal local weighted linear
regression kriging for quantitative precipitation estimation. J. Hydrol. 2021, 601, 126612. [CrossRef]
Remote Sens. 2022, 14, 1750 23 of 23
28. Chen, H.; Chandrasekar, V.; Cifelli, R.; Xie, P. A Machine Learning System for Precipitation Estimation Using Satellite and Ground
Radar Network Observations. IEEE Trans. Geosci. Remote 2019, 58, 982–994. [CrossRef]
29. Sønderby, C.K.; Espeholt, L.; Heek, J.; Dehghani, M.; Oliver, A.; Salimans, T.; Agrawal, S.; Hickey, J.; Kalchbrenner, N. Metnet: A
neural weather model for precipitation forecasting. arXiv 2020, arXiv:2003.12140.
30. Hazra, A.; Maggioni, V.; Houser, P.; Antil, H.; Noonan, M. A Monte Carlo-based multi-objective optimization approach to merge
different precipitation estimates for land surface modeling. J. Hydrol. 2019, 570, 454–462. [CrossRef]
31. Pang, Y.; Shen, Y.; Yu, J.; Xiong, A. An experiment of high-resolution gauge-radar-satellite combined precipitation retrieval based
on the Bayesian merging method. Acta Meteorol. Sin. 2015, 73, 177–186.
32. Wehbe, Y.; Temimi, M.; Adler, R.F. Enhancing precipitation estimates through the fusion of weather radar, satellite retrievals, and
surface parameters. Remote Sens.-Basel 2020, 12, 1342. [CrossRef]
33. Li, J.; Yu, R.; Sun, W. Duration and seasonality of the hourly extreme rainfall in the central-eastern part of China. Acta Meteorol.
Sin. 2013, 71, 652–659.
34. Trenberth, K.E.; Dai, A.; Rasmussen, R.M.; Parsons, D.B. The changing character of precipitation. Bull. Am. Meteorol. Soc. 2003, 84,
1205–1218. [CrossRef]
35. Li, D.; Chen, W.; Ye, A. Climatic characteristics and forecast focus of heavy rain in Qingyuan. Guangdong Meteorol. 1999, 2, 8–10.
36. Roe, G.H. Orographic precipitation. Annu. Rev. Earth Planet. Sci. 2005, 33, 645–671. [CrossRef]
37. Huffman, G.J.; Bolvin, D.T.; Braithwaite, D.; Hsu, K.; Joyce, R.; Xie, P.; Yoo, S. NASA global precipitation measurement (GPM)
integrated multi-satellite retrievals for GPM (IMERG). Algorithm Theor. Basis Doc. ATBD Version 2015, 4, 26.
38. Shige, S.; Yamamoto, T.; Tsukiyama, T.; Kida, S.; Ashiwake, H.; Kubota, T.; Seto, S.; Aonashi, K.; Okamoto, K. The GSMaP
precipitation retrieval algorithm for microwave sounders—Part I: Over-ocean algorithm. IEEE Trans. Geosci. Remote 2009, 47,
3084–3097. [CrossRef]
39. Hou, A.Y.; Kakar, R.K.; Neeck, S.; Azarbarzin, A.A.; Kummerow, C.D.; Kojima, M.; Oki, R.; Nakamura, K.; Iguchi, T. The global
precipitation measurement mission. Bull. Am. Meteorol. Soc. 2014, 95, 701–722. [CrossRef]
40. Ushio, T.; Sasashige, K.; Kubota, T.; Shige, S.; Okamoto, K.; Aonashi, K.; Inoue, T.; Takahashi, N.; Iguchi, T.; Kachi, M. A
Kalman filter approach to the Global Satellite Mapping of Precipitation (GSMaP) from combined passive microwave and infrared
radiometric data. J. Meteorol. Soc. Jpn. Ser. II. 2009, 87, 137–151. [CrossRef]
41. Kyriakidis, P.C. A geostatistical framework for area-to-point spatial interpolation. Geogr. Anal. 2004, 36, 259–289. [CrossRef]
42. Chen, T.; Guestrin, C. Xgboost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794.
43. Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T. Lightgbm: A highly efficient gradient boosting decision
tree. Adv. Neural Inf. Processing Syst. 2017, 30, 3146–3154.
44. Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [CrossRef]
45. Zhang, R. Spatial Variation Theory and Applications; Science Press: Beijing, China, 2005.
46. Huang, X.; He, L.; Zhao, H.; Huang, Y.; Wu, Y. Prediction model based on the Laplacian eigenmap method combined with a
random forest algorithm for rainstorm satellite images during the first annual rainy season in South China. Nat. Hazards 2021,
107, 331–353. [CrossRef]
47. Chao, L.; Zhang, K.; Li, Z.; Zhu, Y.; Wang, J.; Yu, Z. Geographically weighted regression based methods for merging satellite and
gauge precipitation. J. Hydrol. 2018, 558, 275–289. [CrossRef]
48. Li, X.; Wei, Z.; Shaoping, H.; Weihua, D.; Xueying, Z. Analysis of fusion test results on hourly precipitation from meteorological
and hydrological stations and radar. Torrential Rain Disasters 2020, 39, 276–284.