2015 Elsevier Comparison of Feature Selection Methods Using ANNs in MCP Wind Speed Methods a Case Study
2015 Elsevier Comparison of Feature Selection Methods Using ANNs in MCP Wind Speed Methods a Case Study
Applied Energy
journal homepage: www.elsevier.com/locate/apenergy
h i g h l i g h t s
An analysis is carried out of the benefits of feature selection in MCP methods which use ANNs.
The wrapper approach (WA) generated lower mean errors than the filter approach (FA).
No significant statistical difference was observed between the WA and the FA in certain cases.
The FA generated models somewhat simpler and more interpretable than the WA.
The WA displayed better predictive capacity than the FA, but is more computationally intensive.
a r t i c l e i n f o a b s t r a c t
Article history: Recent studies in the field of renewable energies, and specifically in wind resource prediction, have
Received 8 May 2015 shown growing interest in proposals for Measure–Correlate–Predict (MCP) methods which simultane-
Received in revised form 27 July 2015 ously use data recorded at various reference weather stations. In this context, the use of a high number
Accepted 21 August 2015
of reference stations may result in overspecification with its associated negative effects. These include,
Available online 7 September 2015
amongst others, an increase in the estimation error and/or overfitting which could be detrimental to
the generalisation capacity of the model when handling new data (prediction).
Keywords:
This paper analyses the benefits of feature selection for use with Artificial Neural Network (ANN) tech-
Measure–correlate–predict method
Artificial neural networks
niques with a multilayer perceptron (MLP) structure when the ANNs are used as MCP methods to predict
Wind speed mean hourly wind speeds at a target site. The features considered in this study were the mean hourly
Wind direction wind speeds and directions recorded in 2003 and 2004 at five weather stations in the Canary
Feature selection Archipelago (Spain).
Cross-validation technique The two feature selection techniques considered in the analysis were the Correlation Feature Selection
(CFS), which is a correlation-based filter approach (FA), and an MLP-based wrapper approach (WA). The
metrics used to compare the results were the mean absolute error (MAE), the mean absolute percentage
error (MAPE) and the index of agreement (IoA).
Evaluation of the mean errors obtained in the 10-fold cross-validation tests for the year used to repre-
sent the short-term wind data period resulted in several conclusions. These included, notably, that the
WA gave lower mean errors than the FA in 100% of the cases analysed independently of the metric
employed. However, the FA resulted in a significant reduction in computational load and considerable
enhancement of model interpretability. When very good correlation coefficients were obtained between
the target and reference stations, no significant statistical difference was observed at 5% level between
the three models (FA, WA and the models constructed with all the variables) in most of the cases
analysed.
Ó 2015 Elsevier Ltd. All rights reserved.
⇑ Corresponding author. Tel.: +34 928 45 14 83; fax: +34 928 45 14 84.
E-mail address: [email protected] (J.A. Carta).
http://dx.doi.org/10.1016/j.apenergy.2015.08.102
0306-2619/Ó 2015 Elsevier Ltd. All rights reserved.
J.A. Carta et al. / Applied Energy 158 (2015) 490–507 491
Nomenclature
A parameter defined by Eq. (2) MAE mean absolute error, Eq. (5) (m/s)
AEMET State Meteorological Agency of the Ministry of the MAPE mean absolute percentage error, Eq. (6) (%)
Environment and Rural and Marine Environs of the MCP Measure–Correlate–Predict
Spanish Government (Spanish initials) MLP multilayer perceptron
ANOVA analysis of variance n number of data, Eqs. (1), (2), (5), (6) and (7).
ANNs artificial neural networks ncv number of errors obtained in 10-fold cross-validation.
agl above ground level N number of reference stations, Fig. 5
B parameter defined by Eq. (2) o vector which contains the observed wind speed values
BH Benjamini and Hochberg step-up procedure [49] (m/s) at a target site, Eqs. (5)–(7)
C parameter defined by Eq. (2)
o Arithmetic mean in m/s of observed wind speed values,
CCC Circular Correlation Coefficient, Eq. (1) Eq. (7)
CFS Correlation Feature Selection p-value Is the estimated probability of rejecting the null hypoth-
CL circular-linear correlation coefficient, Eq. (3) esis (H0) of a study question when that null hypothesis
CPU Central Processing Unit is true [50]
D, D0 variables that represent wind direction (Degree) Q parameter defined by Eq. (2)
D1; . . . ; D5 variables that represent wind directions of weather q number of neurons in the hidden layer of the neural
stations no. 1, . . . , 5, respectively network, Fig. 5
E parameter defined by Eq. (2) r cs correlation coefficient, Eq. (4)
e vector which contains the estimated values of the wind rv c correlation coefficient, Eq. (4)
speed (m/s) at a target site, Eqs. (5)–(7) rv s correlation coefficient, Eq. (4)
F parameter defined by Eq. (2) S variable representing wind speed (m/s)
FA filter approach S1; . . . ; S5 variables which represent the wind speeds of weather
FA-1, FA-2 used to mark the itinerary followed in the use of the stations number 1, . . . , 5, respectively
filter approach (FA), Fig. 7 WA wrapper approach
FS full feature set (no feature selection). WA-1, WA-2 Used to mark the itinerary followed in the use of
G parameter defined by Eq. (2) the wrapper approach (WA), Fig. 7
GNU recursive acronym which stands for GNU is Not Unix Weka Waikato Environment for Knowledge Analysis
H parameter defined by Eq. (2) WS weather station
H0 null hypothesis, Eq. (8) WS-k weather station identified with the number k
H1 alternative hypothesis, Eq. (8)
IoA index of agreement, Eq. (7) Greek letters
ITC Technological Institute of the Canary Islands (Spanish a level of statistical significance
initials) li ; lj population means of models i and j, Eq. (8)
1.2. Aim of this paper process [33]. An additional comparison will involve statistical
hypothesis testing to determine whether there is a significant
As previously stated, a growing trend has been noted in propos- statistical difference (at 5% level) between the results obtained
als that employ MCP methods, which simultaneously use as refer- with the three strategies considered (FA, WA and FS).
ence variables wind speed and direction data series registered at The article is structured as follows: the following section
various WSs, for long-term wind speed estimation of a target site. describes the materials, including the data sample, the MLP models
However, despite an exhaustive search of the literature, it appears used and the feature selection techniques that are compared. The
that no proposals have been made to use, within the general methodology employed for the comparison study is then
framework of strategies of MCP methods, feature selection tech- described. Next, the results that were obtained are presented and
niques. According to the bibliography consulted, MCP methods analysed and, finally, a description is given of the conclusions
which propose the use of multiple reference WSs employ all the drawn from the study.
available features without taking into consideration the pros and
cons that may be associated with such an action strategy.
Accordingly, one of the aims of the present study is to determine 2. Materials
the benefits, if any, of feature selection [28–31] when using neural
networks as MCP methods. The idea behind this is to offer users 2.1. Meteorological data used
and designers of MCP methods a strategy that provides them with
greater information and the possibility of improving the use of The meteorological data used in this paper (mean hourly wind
such methods. Users of traditional MCP methods, which employ speeds and directions) were recorded at five WSs installed on three
a single reference station, tend to turn to rules of thumb when of the seven major islands that make up the Canary Archipelago
selecting this reference station [10]. Such rules are based on the (Spain) (Fig. 1).
degree of linear correlation that exists between the wind speeds The wind speeds were captured using rotating cup type
of the target station and those of the reference station. However, anemometers situated on masts at 10 m above ground level (agl)
no rules of thumb have been proposed for the selection of various located on the coasts of the islands (Fig. 1). The data series used
reference stations. In view of this, the present paper will consider a were recorded during the years 2003 and 2004 and were provided
specific MCP problem which onsists of predicting the mean hourly by the Technological Institute of the Canary Islands (Spanish ini-
wind speeds at different target sites using the wind speeds and tials: ITC)1 and by the State Meteorological Agency (Spanish initials:
directions from four reference WSs in the Canary Archipelago AEMET) of the Ministry of Environment and Rural and Marine Affairs
(Spain) as features. Given the different levels of correlation of the Government of Spain.
observed between the features of the WSs used (see Section 2.1), The codes assigned to each WS are shown in the first column of
this study also aims to outline rules of thumb that can serve as Table 1. The ITC provided the data series from WS-1 and WS-2 and
guidelines when choosing the most appropriate feature selection AEMET the data series from WS-3, WS-4 and WS-5. The geograph-
technique for use in MCP methods using multiple WSs. ical coordinates of each station (latitude, longitude and altitude)
In view of their popularity in this type of problem, the ANNs are also shown in Table 1 along with the annual mean and stan-
used in this study are MLP neural networks with a hidden layer dard deviation of the wind speed.
and a linear-type output layer in accordance with the continuous The predominant wind regime in the Canary Islands is that of
nature of the wind speed variable. the so-called trade winds, which blow mostly in a NE direction.
ANNs are a group of machine learning techniques which could fit The wind roses of 2003 for each WS are shown in Fig. 2a and c
into the category of non-parametric statistical techniques given and of 2004 in Fig. 2b and d.
their capacity to approximate any continuous function [32]. In con- For ease of interpretation, the scatterplot matrix of the 10 vari-
sequence, their flexibility is greater than that of parametric tech- ables of the 2003 data sample has been divided into three figures.
niques, but interpretability is lower and there is also a higher Fig. 3 represents the scatterplot matrix of the different wind
risk of overfitting. direction variables, while Fig. 4 shows the scatterplot matrix that
The essential question that lies behind the objectives of this corresponds to the different wind speed and wind direction vari-
work is whether MLP algorithms are sufficiently capable by them- ables. The 2004 scatterplots have very similar structures. The sam-
selves of carrying out the necessary selection of features in an MCP ple therefore comprises eight features or input variables (wind
context, or whether it is necessary or at the very least beneficial to speeds and directions of the four reference stations) for each of
use selection techniques to complement the MLP algorithms. Three the five target stations, with a sampling size of 6738 data for
model types are considered in this study. One involves no feature 2003 and 8114 for 2004 for each of the variables.
selection (FS – Full Set) and the other two are commonly used fea- Table 2 shows the correlation between the mean hourly wind
ture selection strategies: a filter approach (FA) based on correla- speeds of the different WSs in 2003 and 2004. The correlation
tion, the so-called Correlation Feature Selection (CFS) [33–35], was quantified by calculating the Pearson product-moment coeffi-
and a wrapper approach (WA) based on MLPs [29,33]. This study cient of linear correlation (commonly called Pearson’s correlation
aims to compare and determine which of the above model types coefficient). Pearson’s correlation coefficient, in this case, measures
gives the best results of the following metrics: the mean absolute the magnitude or strength of the linear association, as well as the
error (MAE), the mean absolute percentage error (MAPE) and a direction (rising or falling, depending respectively on whether
refined version of Willmott’s dimensionless index of agreement the sign is positive or negative), between the recorded wind speeds
(IoA) [36]. Importantly, this study aims to offer the strongest pos- at the target station and reference station [10]. The correlation
sible grounds for its conclusions in order to avoid having to treat coefficients recorded in the two years considered in this study ran-
purely circumstantial differences between the results of the differ- ged between 0.634 and 0.948. The most notable correlations are
ent methods as being due to important structural mechanisms between wind speed variables S1 and S4, corresponding to WS-1
when, in fact, they were simply caused by the randomness of the and WS-4.
samples. Therefore, the comparison in this study will not only
involve determination, for the particular sample used, of the differ-
ences between the results obtained with the different models 1
Company belonging to the Board of Industry of the Autonomous Canary
according to the various metrics in a 10-fold cross-validation Government.
J.A. Carta et al. / Applied Energy 158 (2015) 490–507 493
Table 1
Weather stations used in the study.
Table 3 shows the circular-correlation coefficient (CCC) hypothesis of non-existence of linear association is rejected if
between the mean hourly wind directions of the different stations CCC differs greatly from zero.
in 2003 and 2004. The CCC proposed by Fisher and Lee [37,38] to The range of CCC determined in the two years varies between
analyse the relationship between two angular variables was used, 0.2 and 0.794. The linear character of the relationships that this
and is defined in Eq. (1). coefficient detects can be seen in Fig. 3. The most notable relation-
ships are between the wind direction variables D2 and D4, corre-
4ðAB CQ Þ
CCC ¼ 1=2
ð1Þ sponding to WS-2 and WS-4.
½ðn2 E2 F 2 Þðn2 G2 H2 Þ Table 4 shows the circular-linear (CL) correlation coefficient
between the mean hourly wind directions (D) and speeds (S) of
where n is the number of data and A, B, C, Q, E, F, G and H are given the different stations in 2003 and 2004. The CL proposed by Mardia
by Eq. (2) [39] and Johnson and Wehrley [40] was used and is formulated in
Pn Pn 9 Eq. (3).
A¼ i¼1 cosðDi Þ cosðD0i Þ; B ¼ i¼1sinðDi Þ sinðD0i Þ >
>
>
> 1=2
Pn X
n >
>
C¼ cosðDi Þ sinðD0i Þ; Q¼ sinðDi Þ cosðDi Þ =
0 r2v c þ r2v s 2rv c r v s rcs
i¼1 CL ¼ ð3Þ
ð2Þ 1 r 2cs
P P
i¼1
>
>
E ¼ ni¼1 cosð2Di Þ F ¼ ni¼1 sinð2Di Þ >
>
>
>
P Pn ; where
G ¼ ni¼1 cosð2D0i Þ H ¼ i¼1 sinð2Di Þ 0
Fig. 2. Wind roses for each of the stations. The (a) and (c) columns represent the wind roses for 2003, and the (b) and (d) columns for 2004.
(sinD, cosD)). This coefficient takes values between 0 and 1, which is property, known as the universal approximation property, ensures
to say it does not indicate negative correlation. that an MLP network with a hidden layer can reproduce the struc-
The CL correlation coefficients obtained in the two years consid- ture of relationships that exist between the input variables and the
ered in this study are in the range 0.252–0.521. The non-linear target variable, supposing that the relationships are continuous. In
periodic structure, typical of the sinusoidal relationships that this consequence, it is not in theory necessary to use MLP networks
coefficient detects, of the relationships between the wind speed with more than one hidden layer. In practice, moreover, networks
and wind direction variables can be seen in Fig. 4. with more than one hidden layer often make model training and
selection processes much more difficult without offering any addi-
2.2. Techniques used in the comparison tional benefits. This is due to the higher degree of non-linearity
which is introduced into the target function of the optimisation
2.2.1. Architecture of the ANNs used algorithm and to the addition of an extra parameter for each new
Given the dominance in the scientific literature concerned with hidden layer (the number of neurons in that layer) in what is
renewable energies of proposals for biologically-inspired algo- already a complex model selection process. In this context, only
rithms [3,11,18–23,26,27] to estimate long-term wind speed char- the presence of correlations with a high degree of non-linearity
acteristics using multiple reference stations, the models employed between the input variables and the target variable, together with
in this paper for such estimation were configured using ANNs. a large sample size that can reveal these complex relationships,
More specifically, MLP neural networks were used. These pos- would make such multiple architecture advisable. However, these
sess a multilayer feedforward structure with a single hidden layer are not the circumstances of our particular case in view of the
comprised of hidden units (neurons) with sigmoid activation func- moderate complexity of the relationships shown in Fig. 3. A more
tions, and a single neuron output layer with linear activation, as in-depth discussion of this question which is not limited to the
corresponds to the continuous nature of the wind speed at the tar- set of techniques used in the present study can be found in [41].
get site. Moreover, this architecture has shown its capacity to sat- In this paper, the input neurons of the models are fed with ser-
isfactorily approximate any continuous transformation [32–34] ies of wind speeds (S) and directions (D) recorded at various refer-
and its use has been proposed by several authors [18,20–22]. This ence stations (Fig. 5). If N reference stations are used, then the
J.A. Carta et al. / Applied Energy 158 (2015) 490–507 495
Fig. 3. Scatterplot matrix of the wind directions recorded during 2003 at the five weather stations used in the study.
maximum number of features (input variables) is 2N and the The 10-fold cross-validation technique was used to estimate the
number of neurons of the output layer is one (the target site wind error of the estimation models. This technique is widely used and
speed). accepted in the data mining community [33].
The cross-validation mechanism is schematically represented in
2.2.2. Selection and training algorithm of the ANN model Fig. 6. This method consists of dividing the data, once randomly
The algorithm used in the training of the model of all the MLP ordered, into ncv = 10 discrete subsets of similar size. Model learn-
neural networks used in this work is the backpropagation algo- ing then takes place using 9 subsets, with the mean prediction
rithm, which aims to optimise estimation of the parameters of error being determined with the different metrics in the remaining
the networks. More specifically, we used the MultilayerPerceptron subset. The procedure is repeated 10 times, omitting from the
algorithm implemented in Weka (Waikato Environment for training a different subset each time. The final mean prediction
Knowledge Analysis), free software available under the GNU error with each metric is obtained by calculating the arithmetic
General Public License [33] and developed by Waikato University mean of the 10 mean prediction errors obtained in each of the 10
(New Zealand). subsets that were successively excluded from the training. The
496 J.A. Carta et al. / Applied Energy 158 (2015) 490–507
Fig. 4. Scatterplot matrix of the wind speeds and wind directions recorded during 2003 at the five weather stations used in the study.
Table 2
Linear correlation coefficients between the wind speeds of the five anemometer weather stations.
variance of the 10 mean prediction errors in the 10 cross-validation The number of neurons in the hidden layer (q in Fig. 5) must be
groups gives an idea of the variability of these partial means and, specified by the designer when configuring the structure of the
therefore, of model performance stability when handling new data. network in Weka. Various heuristic rules have been proposed in
J.A. Carta et al. / Applied Energy 158 (2015) 490–507 497
Table 3
Circular-correlation coefficients between the wind directions of the five anemometer weather stations.
Table 4
Circular-linear correlation coefficients between the wind directions (horizontal) and wind speeds (vertical) of the five anemometer weather stations.
Fig. 6. Schematic representation of the 10-fold cross-validation mechanism used in the study.
‘‘o”. MAE is expressed in the same units as the parameters it According to Willmott et al. [36], in general, the IoA is more
compares. rationally related to model accuracy than other indices in use. They
Pn also point out that this index is quite flexible and so is applicable to
i¼1 jei oi j
MAE ¼ ð5Þ a wide range of model-performance problems. When IoA values
n are close to 1 this indicates strong agreement between the results
With the MAE, all sizes of error are treated evenly according to of the model and the observations. If IoA is equal to 0, this means
their magnitude. MAPE is defined by Eq. (6). according to Willmott et al. [36] that ‘‘the sum of the magnitudes
n of the errors and the sum of the perfect-model-deviation and
100 X
ei oi
MAPE ¼ ð6Þ observed-deviation magnitudes are equivalent”. However, the
n i¼1 oi
same authors also point out that ‘‘values of IoA near 1 can mean
The MAPE is a relative measurement that expresses the error as that the model-estimated deviations about o are poor estimates of
a percentage of the observed data. the observed deviations; but, they also can mean that there simply
Willmott et al. [36] propose the IoA that is given by Eq. (7), is little observed variability. As the lower limit of IoA is
where o is the arithmetic mean of the values observed. The IoA is approached, interpretations should be made cautiously”.
dimensionless and its value is found in the range from 1 to 1.
8 Pn 2.3. Hardware used for the calculations
jei oi j
>
> 1 Pi¼1 ; When
>
> n
j
>
>
2
i¼1 i
jo o
Given the methodological procedure used (described in
>
< P P
n n
jei oi j 6 2 i¼1 joi o j Section 3) to attain the objectives outlined and given the volume
IoA ¼ Pi¼1 ð7Þ
>
> 2
n
jo oj of data involved in the study, it is clear that the computational time
>
> Pni¼1 i 1; When
>
> je oi j required would be considerable if using a PC with a low number of
>
: Pn i¼1 i
Pn
microprocessors. It was therefore decided to use a supercomputer,
i¼1 jei oi j > 2 i¼1 joi oj
in this case Atlante which forms part of the Spanish
J.A. Carta et al. / Applied Energy 158 (2015) 490–507 499
Supercomputing Network and has a distributed-memory cluster The itinerary of each of the three blocks (FS, FA, WA) leads to
[48]. Atlante consists of 84 IBM JS21 compute nodes (blades). Each the selection of the best model for each strategy. Once the best
blade has two dual-core processors at 2.5 GHz running the Linux models for the FS, FA and WA strategies have been selected, a com-
operating system with 8 GB of memory RAM and 73 GB local disk parison between them is made based on the magnitudes of the
storage. In total, peak performance of Atlante is 3.09 TFlops. One mean errors obtained in the cross-validations. This comparison
of the networks that interconnects Atlante (Myrinet Network) focuses on two aspects. One is based on a comparison of the mag-
has a high bandwidth which it uses for parallel communication nitudes of the metrics obtained in the cross-validations (indicated
applications. in Fig. 7 by a circled ‘1’) and the other on the classic statistical anal-
ysis of null hypothesis (see Section 3.2.1). The aim is to know
whether, from a statistical point of view, there exists a significant
3. Methodology
difference between the results obtained with the three strategies
(this action is indicated by a circled ‘2’ in Fig. 7).
3.1. Preamble
A general outline of the procedure used is shown in Fig. 7. For 3.2.1. Testing for statistical significance
easier interpretation the procedure has been particularised for The purpose of the statistical analysis presented in this subsec-
the case in which the short-term series of wind speeds and direc- tion is to determine whether the mean error of the ten mean errors
tions at the reference stations are the four shown in Table 1 and obtained with the 10-fold cross-validation of a model is signifi-
Fig. 1 (with codes WS-2 to WS-5) and the target site is WS-1. cantly larger or smaller than the mean error of another model.
The year 2003 is taken as the time period representative of the Note that, as three metrics have been proposed (IoA, MAE and
short-term common to all stations. As indicated by various authors MAPE) to evaluate wind speed prediction (Section 2.2.4), there
[1,2,4,5,9,10], a long series of wind data at a target site is required are three samples for each metric (one sample for each model:
to estimate the corresponding long-term wind resource (some FA, WA and FS) and each sample has ten data values.
authors speak of the need for 20 or 30 years worth of data Note also that, in order to rationalise the computational load of
[1,5,7–10]). Given that no such long historical series of data (which the procedure, the same random cross-validation partitions were
would meet the typical constraints of MCP methods [10]) were used in the process to obtain each of the models (FA, WA and
available for a significant number of stations which would allow FS). The reason behind this was also to try to minimize the variance
the analysis of different degrees of correlation, the stages indicated of the difference between the mean metrics obtained by the three
in Fig. 7 by a circled ‘3’ and circled ‘4’ will not be taken into consid- models, as these share the same experimental unit (the same par-
eration in the procedure employed in this paper. In other words, tition of data of each cross-validation iteration: the same nine
the model construction and testing stages are covered, as the groups used for the training and the same group excluded for val-
length of the short-term representative series extends to one year idation in which the metric is evaluated). So, for each metric we
and, therefore, the seasonal variation influence of the wind charac- have a scenario of analysis of variance with three levels of treat-
teristics is considered to be picked up, as is generally recom- ment and three potentially dependent samples (ANOVA within
mended [10]. However, long-term hindsighting of the wind subjects), in which the two-by-two comparisons will be made. As
conditions of the target station is not carried out. multiple comparisons will be made, the adjusted p-values will be
Though Fig. 7 represents a specific case, the work actually per- calculated using the procedure proposed by Benjamini and
formed was much more extensive in that the cases analysed Hochberg [49], normally called the BH step-up procedure.
included each of the five WSs of Table 1 as target site and both The decision problem consists of choosing between the null
2003 and 2004 were used as the short-term representative year. hypothesis H0 and the alternative hypothesis H1, Eq. (8), with a sig-
nificance level of 5%. In the case of the IoA metric, if the null
3.2. Procedure followed hypothesis is rejected then model i is considered better than model
j. In the case of the MAE and MAPE metrics, if the null hypothesis is
As can be seen in Fig. 7, the procedure consists of three indepen- rejected then model j is considered better than model i. In the case
dent action blocks to produce the different MCP models that result that it is not possible to reject the null hypothesis then it cannot be
from using MLP neural networks. said that either of the two models that are being compared is bet-
One block uses the complete set of variables, which is to say ter than the other.
that no type of feature selection takes place. This is indicated by
the letters ‘FS’ (Full Set) enclosed in a polygon in Fig. 7. This block H0 : li ¼ lj ; H1 : li > lj ð8Þ
produces MLP neural networks trained with all the variables
according to the procedure and model selection described in Sec- In Eq. (8), li and lj represent the population means of the models i
tions 2.2.1 and 2.2.2. The other two blocks employ the feature and j, respectively.
selection methods explained in Section 2.2.3. The FA block is To resolve the problem, Eq. (8), both the paired and unpaired
represented in Fig. 7 by the itinerary represented by the initials parametric Student’s t-test [50,51] will be used in this paper and
‘‘FA-1” and ‘‘FA-2” enclosed in polygons. MLP neural networks a judgement value will made in view of the correlations that exist
are generated in this process according to the model training and between the samples. When the hypothesis of normality is rejected
selection algorithm described in Section 2.2.2, using as input the (the modified statistic of Anderson–Darling [52] was applied for
subset of variables with highest merit after an exhaustive search this purpose), given that the number of data is small (ncv = 10),
through the space of feature subsets. The WA block is represented non-parametric permutation tests were used for both dependent
in Fig. 7 by the itinerary represented by the initials ‘‘WA-1” and (paired) samples and independent (unpaired) samples. The choice
‘‘WA-2” enclosed in polygons. This block produces MLP neural net- between them will depend on the correlations that exist between
works trained with the model selection and training algorithm the samples.
described in Section 2.2.2, using as input the subset of variables The non-parametric permutation tests used in this study were
which produces the best predictive results in a 10-fold cross- first proposed by Fisher [43] and Pitman [44] and have continued
validation process using MLP neural networks (also trained as to evolve to the present day on the basis of subsequent work car-
described in Section 2.2.2). ried out by the same authors [45,46].
500 J.A. Carta et al. / Applied Energy 158 (2015) 490–507
The null hypothesis H0 will be discarded if p-value (adjusted) 6 CPU time required for selection of the MCP model of the target sta-
a in favour of the alternative hypothesis H1. tion WS-1 (year 2003) using the IoA metric and FA method was
1019s, while the time required for the same WS and metric in
the case of the WA method was 66803s. Expressed another way,
4. Analysis of results the CPU time required for the WA-based model was 6456% higher
than for the FA model.
The most important results obtained by the best models gener- Tables 5–7, each of which corresponds to one of the three met-
ated following the procedure described in Section 3.2 for the filter, rics (IoA, MAE and MAPE) used in the comparison, show the results
wrapper and full feature set methods are shown in Figs. 8–10. Each of the application of the different statistical hypothesis tests
of these three figures corresponds to one of the metrics used in the explained in Section 3.2.1 to the losses obtained by the different
comparison (Section 2.2.4) and is comprised of two graphical models (FA, WA and FS) in the cross-validation process used in
representations which each show the mean loss obtained when their training-validation.
predicting the mean wind speeds using one of the years as a The content of each of the three tables is the same, but particu-
short-term representative training-validation sample (a on the left larised for each metric. The first column indicates the target station
for 2003; b on the right for 2004). in each MCP problem, the second shows the alternative hypothesis
Each of these two graphical representations shows, from top to (H1), the third the data year (short-term representative) used for
bottom, the optimum number of neurons for the model selected in training-validation of the models whose 10 cross-validation results
each strategy, the variables finally selected in each model, and a are being compared, and the fourth the linear correlation coeffi-
bar diagram indicating the mean loss in the 10 cross-validation cient (q) obtained between the samples being compared. The fifth,
groups used and an interval centred on this mean of the standard sixth and seventh columns show the p-values of the following sta-
deviation of the loss in the cross-validation process. tistical tests: the Anderson–Darling Test for normality of the differ-
The computational load required for selection of the MCP mod- ence between two samples of mean losses produced by the two
els (including feature selection) was counted by measuring the models being compared, and the Student Test and Permutation
CPU times (using 32 of the 84 blades of the hardware indicated Test (both for paired samples). The last column shows the adjusted
in Section 2.3) needed to run the models. By way of example, the p-value, following the BH procedure [49] of the paired Permutation
J.A. Carta et al. / Applied Energy 158 (2015) 490–507 501
Fig. 8. Results obtained for the IoA metric: (a) Estimation of mean hourly wind speeds (training year: 2003), (b) estimation of mean hourly wind speeds (training year: 2004).
Test. Note that the p-values obtained from a total of 90 comparison 100% of the cases analysed (10 cases: 5 stations 2 years). It can
tests (5 WS 3 models (FA, WA and FS) 2 years (2003 and also be seen that in at least 80% of the cases analysed the WA gave
2004) 3 metrics (IoA, MAE and MAPE)) were used to calculate lower mean errors than the FS and that the FS never gave lower
the p-values shown in the final column of Tables 5–7 with the errors than the WA. In addition, in at least 80% of the cases anal-
BH method. Note too that the adjusted p-values of Tables 5–7 are ysed the FS gave lower mean errors than the FA. However, the FS
written in bold when the null hypothesis is rejected and the alter- generated higher mean errors than the FA in 20% of the cases when
native hypothesis shown in the second column of these tables is the IoA and MAE metrics were used.
accepted. The adjusted p-values which correspond to target station Despite the results shown in Fig. 11, it should be noted that, at
WS-4, year 2003 and alternative hypotheses WA > FA and WA > FS first sight and considering the magnitudes of the metrics generated
are written in parenthesis in Table 5 in order to show negative and by the three models (WA, FA and FS) (Figs. 8–10), no notable differ-
low correlation coefficients between samples. The same thing ence can be observed between them and, therefore, none of the
occurs in Table 6 for the same station and year, in the case of the models can be totally discarded a priori without considering other
alternative hypotheses FA > WA and FS > WA. In these cases, the possible benefits. In particular, the FA appears to be competitive at
unpaired tests described in Section 3.2.1 (Table 8) were also used predictive level in the cases in which the errors were lower (WS-1
in view of these correlations to make a judgement value. The and WS-4), despite using a smaller number of variables than the
adjusted p-values shown in the final column of Table 8 and those other two models. This means that fewer computational resources
shown in parenthesis in Tables 5 and 6 lead to the conclusion that, are required and, no less important, allows for much greater
in the cases analysed, both the paired and unpaired tests indicate interpretability.
that there is no evidence to reject the null hypothesis that there As can be observed in Figs. 8–10, the WA produces models with
are no significant differences with a significance level of 5%. a higher number of variables than the FA which contribute some
An analysis of the results is performed below with a distinction information in the prediction and which are useful to the MLP net-
between the results that could be considered of a methodological works. This happens despite the redundancy between variables
type, as they are associated with the comparison made between observed in Section 2.1. So, the MLP networks can manage this
the filter, wrapper and full variable set strategies using MLP neural redundancy without missing out on the specific information that
networks, and those which could be considered specific results of each variable may provide about the target variable. However,
the different MCP problems considered for this comparison in the the FA generated simpler models than the WA, as it used between
wind scenario of the Canary Archipelago. 40% and 75% fewer variables than the WA, and with a smaller num-
ber of neurons in the hidden layers. The higher number of neurons
4.1. Methodological type results of the hidden layer of the MLP produced by the WA has an impact
on the number of iterations and the computational load of the
The results shown in Figs. 8–10 are synthesised in Fig. 11, training algorithms. So, the MLP training algorithms, in the case
where it can be seen that, independently of the metric used (IoA, of using FA, are faster and potentially more effective at estimating
MAE or MAPE), the WA gave lower mean errors than the FA in the parameters.
502 J.A. Carta et al. / Applied Energy 158 (2015) 490–507
Fig. 9. Results obtained for the MAE metric: (a) Estimation of mean hourly wind speeds (training year: 2003), (b) estimation of mean hourly wind speeds (training year:
2004).
It is also clear from Figs. 8–10 that the differences in the sets of 4.2. Results of the different MCP problems posed in the Canary
variables selected by both methods (WA and FA) are fundamen- Archipelago
tally to be found in the direction variables. This is coherent with
the larger non-linear component of the relationships between the For the analysis of the particular results of the different stations
direction variables and the speed variables which act as target considered as target, a distinction will be made between two
(Fig. 4), resulting in low Pearson’s correlation coefficients which groups of stations: one group comprised of WS-1 and WS-4 and
are insufficient for the CFS filter method, which uses these coeffi- a second of WS-2, WS-3 and WS-5.
cients, to select the direction variables in the subset. In all cases,
the absolute value of Pearson’s correlation coefficient was below Stations WS-1 and WS-4: it can be seen in Figs. 8–10 that the
0.5 and in some cases below 0.04 (Table 9). The authors of the pre- smallest errors in mean hourly wind speed prediction were
sent paper consider that it would be convenient for the filter made for these target stations, independently of the models
method to use with these pairs of variables the circular-linear cor- (WA, FA and FS) used and of the year (2003 or 2004) used to
relation coefficient, Eq. (3), which detects these non-linear rela- represent the short-term. It can even be observed in the case
tionships, with the aim of working under the same conditions as of these two stations that in most cases analysed (metrics and
the WA. With this in mind, the authors propose to undertake this year) there was no significant statistical difference between
methodological improvement in a future study with a view to con- the three models (FA, WA and FS) at a 5% level (Tables 5–8).
firming or qualifying the parsimony differences and the predictive Specifically, in the case of the MAE metric (Tables 6 and 8) in
differences between the two strategies that were obtained in the no case is there evidence to reject the null hypothesis at that
present study. significance level. The same can be said for the MAPE metric
The results of Tables 5–7 are summarised in Fig. 12 and show (Table 7) in the case of 2004 (in the case of 2003, the p-values
that, independently of the metric used, the WA performed signifi- are not much lower than 0.05 and there would be no evidence
cantly better than the FA in 60% of cases and in no case did it per- to reject the null hypothesis in the case of a significance level
form worse. It can also be seen in Fig. 12 that the WA did not give below 4%). It can be deduced from the analyses that an absence
significantly different results from the FS for the IoA and MAPE of evidence to reject the null hypothesis is fundamentally a con-
metrics in 90% of the cases analysed (in the remaining 10% of cases sequence of the high degree of linear correlation between the
the WA was significantly better). There were no significant differ- wind speeds recorded at these target stations and the data
ences between the WA and the FS in 100% of cases with the MAE available at the stations which served as reference (Table 2).
metric. Meanwhile, the FS was significantly better than the FA in The direction variables are less relevant. It can be seen that
40% of cases with the IoA and MAE metrics, and never worse. With the direction variables did not intervene in the FA models used
the MAPE metric, the FS was better than the FA in 20% of cases with in wind speed estimation of the target stations WS-1 and WS-4.
no significant differences observed between them in the remaining The degree of linear correlation influences especially the magni-
80%. tude of the IoA metric. The highest correlation coefficients
J.A. Carta et al. / Applied Energy 158 (2015) 490–507 503
Fig. 10. Results obtained for the MAPE metric: (a) Estimation of mean hourly wind speeds (training year: 2003), (b) estimation of mean hourly wind speeds (training year:
2004).
Table 5
Analysis of statistically significant differences for the IoA metric. Tests for paired samples.
Table 6
Analysis of statistically significant differences for the MAE metric. Tests for paired samples.
Table 7
Analysis of statistically significant differences for the MAPE metric. Tests for paired samples.
Table 8
Analysis of statistically significant differences. Test for unpaired samples.
Fig. 11. Comparison between the magnitudes of the metrics of the three models (WA, FA and FS) obtained in the 10-fold cross-validation process, for the ten cases analysed
(5 weather stations 2 years).
Table 9
Pearson’s correlation coefficients between the wind speeds (vertical) and wind directions (horizontal) of the five anemometer weather stations.
significant differences (5% significance level) between the errors 1), the FA method can be considered competitive in terms of pre-
generated by the filter models and the wrapper models, with the dictive ability and is more interpretable by the analyst than the
WA errors being lower. Similar conclusions can be drawn when WA method. The FA method uses a lower number of features than
comparing the FS and FA models, except in the case of target sta- the WA method and also requires less computational resources.
tion WS-5, where the differences between the errors generated However, the best results of the WA show that when MLP neural
by these models were not statistically significant at a 5% level. It networks are being used as a predictive model the redundancy
can also be seen in Tables 5–8 that, for these three target stations between variables can be beneficial for the prediction as well as
in no case was there evidence to reject the null hypothesis chosen lending robustness to the model. As the degree of non-linearity
to compare the WA and the FS. in the relationships between features increases (for example, when
Although the study undertaken focussed on the short-term per- the WSs are located in complex terrain where the direction vari-
iod, it should be remembered that the purpose of MCP models is to able may acquire more significance), use of the WA method for fea-
estimate long-term wind speed series at a target site. In this con- ture selection becomes more recommendable, although the FA can
text, and as mentioned in Section 3.1, MCP methods require a ser- still be used as an interpretive tool. However, it must be remem-
ies of conditions to be met for them to be useful [10]. These bered that using WA methods has a high computational cost.
conditions are as follows: (a) The wind data series must have been Moreover, these methods require enough data for complex non-
recorded by the WSs in compliance with certain standards. In par- linear relationships to show up.
ticular, there must have been no changes to the area surrounding In the case analysed, the differences between the sets of fea-
any of the WSs (new buildings, installation of wind farms, major tures selected by the WA and FA methods were fundamentally
changes to the vegetations, etc.) that could have modified the rela- found in the wind direction variables. As the CFS filter method uses
tionships between the wind data of the WSs. (b) The data sets must Pearson’s linear correlation coefficients for feature selection, when
be statistically stationary (wind behaviour in the future during the these are low (the case of wind direction variables) the CFS method
working life of the energy project must be analogous to past beha- tends to discard them in the subset. For this reason, incorporation
viour). (c) The short-term data series recorded at the target WS in the filter methods is recommended of correlation coefficients of
must allow conclusions to be drawn about seasonal variations a non-linear nature which are able to detect the non-linear rela-
(wind speed and direction), and (d) The wind climate of the differ- tionships that may exist between the wind direction features and
ent WSs must be similar. between these and the linear features. In this way, the lower com-
In the framework of the present study, it should be noted that putational load of these methods can be taken advantage of with-
the non-availability of long wind data series meant it was not pos- out losing predictive capacity.
sible to analyse some of the previously described determinants of
MCP methods. Nonetheless, the conclusions drawn from the com- Acknowledgements
parative analysis of WA, FA and FS methods in the short-term peri-
ods for each of the two years considered are very similar in terms The authors would like to thank Belén Esteban, IT engineer and
of predictive ranking, the selection of variables made and even the the technician in charge of the Atlante Supercomputer of the ITC
number of hidden units obtained in the models of the two years. (Technological Institute of the Canary Islands) for the technical
Moreover, this was despite the instability that MLP algorithms usu- assistance she provided. We would also like to thank the ITC and
ally present due to the multiple local optima in which non-linear AEMET for providing the wind data used in this study.
optimisation algorithms can terminate. This suggests that the con-
clusions obtained in this study are applicable to MCP methods if it References
can be assumed that the wind climate of the WSs fits a sufficiently
stable pattern (as is the case of the two years studied). [1] Jain P. Wind energy engineering. 1st ed. New York: McGraw-Hill; 2011.
[2] Brower MC. Wind resource assessment. 1st ed. New Jersey: Wiley; 2012.
[3] Velázquez S, Carta JA, Matías JM. Comparison between ANNs and linear MCP
algorithms in the long-term estimation of the cost per kWh produced by a
5. Conclusions wind turbine at a candidate site. A case study in the Canary Islands. Appl
Energy 2011;88:3869–81.
[4] Hiester TR, Pennell WT. The siting handbook for large wind energy systems. 1st
A specific analysis was undertaken in this paper of the useful-
ed. New York: WindBook; 1981.
ness of feature selection methods using ANNs in MCP methods. [5] Justus CG, Mani K, Mikhail AS. Interannual and month-to-month variations of
To date, no similar analysis has been published in the literature wind speed. J Appl Meteorol 1979;18:913–20.
and no feature selection methods have been implemented in [6] Landberg L, Myllerup L, Rathmann O, Petersen EL, Jørgensen BH, Niels BJ, et al.
Wind resource estimation – an overview. Wind Energy 2003;6:261–71.
MCP modules of the software programmes employed in the wind [7] Baker R, Walker SN, Wade JE. Annual and seasonal variations in mean wind
industry. This study compared two general strategies for feature speed and wind turbine energy production. Sol Energy 1990;45:285–9.
selection in real conditions, the CFS filter method and the wrapper [8] Klink K. Trends and interannual variability of wind speed distributions in
Minnesota. J Clim 2002;15:3311–7.
method, and the two were also compared to a model (FS) which [9] Gasch R, Twele J. Wind power plants. 2nd ed. Berlin: Springer; 2012.
made no feature selection. MLP neural networks trained with [10] Carta JA, Velázquez S, Cabrera P. A review of measure–correlate–predict (MCP)
quadratic loss through the backpropagation algorithm were used methods used to estimate long-term wind characteristics at a target site.
Renew Sustain Energy Rev 2013;27:362–400.
as an inductive algorithm. [11] Jung S, Kwon SD. Weighted error functions in artificial neural networks for
The comparison was made in the context of an MCP strategy for improved wind energy potential estimation. Appl Energy 2013;111:778–90.
the prediction of the wind resource at five stations chosen succes- [12] Zhang J, Chowdhury S, Messac A, Hodge BM. A hybrid measure–correlate–
predict method for long-term wind condition assessment. Energy Convers
sively as the target site, using the four remaining stations as refer- Manage 2014;87:697–710.
ence sites. This entailed posing five MCP problems in which the [13] Rogers AL, Rogers JW, Manwell JF. Comparison of the performance of four
dependent variable was the wind speed at the target site and the measure–correlate–predict algorithms. J Wind Eng Ind Aerodynam
2005;93:243–64.
independent variables were the wind speeds and directions at
[14] Carta JA, Velázquez S. A new probabilistic method to estimate the long-term
the four reference stations. Below, we present the main conclu- wind speed characteristics at a potential wind energy conversion site. Energy
sions obtained from the study: 2011;36:2671–85.
When a statistical comparison of the wind speeds of the target [15] Casella L. Improving long-term wind speed assessment using joint probability
functions applied to three wind data sets. Wind Eng 2012;36:473–84.
WS and reference WSs gives Pearson’s correlation coefficients [16] Probst O, Cárdenas D. State of the art and trends in wind resource assessment.
which are ranked in the literature as very good (between 0.9 and Energies 2010;3:1087–141.
J.A. Carta et al. / Applied Energy 158 (2015) 490–507 507
[17] Carta JA, Velázquez S, Matías JM. Use of Bayesian networks classifiers for long- Waikato; 1999. <www.cs.waikato.ac.nz/~mhall/thesis.pdf> [accessed
term mean wind turbine energy output estimation at a potential wind energy 16.01.15].
conversion site. Energy Convers Manage 2011;52:1137–49. [35] Hall MA. Feature selection for discrete and numeric class machine learning.
[18] Öztopal A. Artificial neural network approach to spatial estimation of wind Working paper 99/4; 1999. <www.cs.waikato.ac.nz/ml/publications/1999/
velocity. Energy Convers Manage 2006;47:395–406. 99MH-Feature-Select.pdf> [accessed 16.01.15].
[19] Bilgili M, Sahin B, Yasar A. Application of artificial neural networks for the [36] Willmott CJ, Robeson SM, Matsuura K. A refined index of model performance.
wind speed prediction of target station using reference stations data. Renew Int J Climatol 2012;32:2088–94.
Energy 2007;32:2350–60. [37] Fisher NI, Lee AJ. A correlation coefficient for circular data. Biometrika
[20] Monfared M, Rastegar H, Kojabadi HM. A new strategy for wind speed 1983;70:327–32.
forecasting using artificial intelligent methods. Renew Energy 2009;34:845–8. [38] Fisher NI. Statistical analysis of circular data. 1st ed. Cambridge: Cambridge
[21] Lopez P, Velo R, Maseda F. Effect of direction on wind speed estimation in University Press; 1995.
complex terrain using neural networks. Renew Energy 2008;33:2266–72. [39] Mardia KV. Linear-circular correlation coefficients and rhythmometry.
[22] Velázquez S, Carta JA, Matías JM. Influence of the input layer signals of ANNs Biometrika 1976;63:403–5.
on wind power estimation for a target site: a case study. Renew Sustain Energy [40] Johnson RA, Wehrly T. Measures and models for angular correlation and
Rev 2011;15:1556–66. angular-linear correlation. J Roy Stat Soc Ser B 1977;39:222–9.
[23] Bechrakis DA, Deane JP, McKeogh EJ. Wind resource assessment of an area [41] Bengio Y, LeCun Y. Scaling learning algorithms toward AI. In: Bottou L,
using short term data correlated to a long term data set. Sol Energy Chapelle O, DeCoste D, Weston J, editors. Large-scale learning
2004;76:725–32. machines. Cambridge (MA): The MIT Press; 2007. p. 321–59.
[24] Haykin S. Neural networks. A comprehensive foundation. New Jersey [42] Sheela K, Deepa SN. Review on methods to fix number of hidden neurons in
(USA): Prentice Hall; 1999. neural Networks. Math Probl Eng 2013;2013:11. http://dx.doi.org/10.1155/
[25] Bishop CM. Neural networks for pattern recognition. New York: Oxford 2013/425740.
University Press; 1995. [43] Fisher RA. The design of experiments. 2nd ed. London: Oliver & Boyd; 1937.
[26] Kalogirou SA. Artificial neural networks in renewable energy systems [44] Pitman EJG. Significance tests which may be applied to samples from any
applications: a review. Renew Sustain Energy Rev 2001;5:373–401. populations. Suppl J Roy Stat Soc 1937;4:119–30.
[27] Fadare DA. The application of artificial neural networks to mapping of wind [45] Good P. Permutation, parametric and bootstrap tests of hypotheses. 3rd
speed profile for energy application in Nigeria. Appl Energy 2010;87:934–42. ed. New York: Springer; 2005.
[28] Blum AL, Langley P. Selection of relevant features and examples in machine [46] Berry KJ, Johnston JE, Mielke PW. A chronicle of permutation statistical
learning. Artif Intell 1997;97:245–71. methods. 1st ed. New York: Springer; 2014.
[29] Kohavi R, John GH. Wrappers for feature subset selection. Artif Intell [47] Chakraborty R, Pal NR. Feature selection using a neural network framework
1997;97:273–324. with controlled redundancy. IEEE Trans Neural Netw Learn Syst
[30] Guyon I, Elisseeff A. An introduction to variable and feature selection. J Mach 2015;26:35–50.
Learn Res 2003;3:1157–82. [48] http://atlante.itccanarias.org/ [accessed 22.03.14].
[31] Liu H, Motoda H. Computational methods of feature selection. Chapman & [49] Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and
Hall/CRC; 2008. powerful approach to multiple testing. J Roy Stat Soc 1995;57:289–300.
[32] Leshno M, Lin VY, Pinkus A, Schocken S. Multilayer feedforward networks with [50] Devore JL. Probability and statistics for engineering and sciences. 8th ed. New
a nonpolynomial activation function can approximate any function. Neural York: Brooks/Cole; 2012.
Netw 1993;6:861–7. [51] Montgomery DC, Runger GC. Applied statistics and probability for engineers.
[33] Witten IH, Frank E. Data mining. 3rd ed. San Francisco: Elsevier; 2011. 6th ed. New York: Wiley; 2014.
[34] Hall MA. Correlation-based feature selection for machine learning. PhD thesis. [52] D’Agostino RB, Stephens MA. Goodness of fit techniques. 1st ed. New
Hamilton (New Zealand): Department of Computer Science, University of York: Dekker; 1986.