0% found this document useful (0 votes)
29 views

Investigating The Effect of Correlation-Based Feature Selection On The Performance of Support Vector Machines in Reservoir Characterization

A paper investigating feature selection in oil reservoir characterisation improvement

Uploaded by

Kabeer Akande
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Investigating The Effect of Correlation-Based Feature Selection On The Performance of Support Vector Machines in Reservoir Characterization

A paper investigating feature selection in oil reservoir characterisation improvement

Uploaded by

Kabeer Akande
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Journal of Natural Gas Science and Engineering 22 (2015) 515e522

Contents lists available at ScienceDirect

Journal of Natural Gas Science and Engineering


journal homepage: www.elsevier.com/locate/jngse

Investigating the effect of correlation-based feature selection on the


performance of support vector machines in reservoir characterization
Kabiru O. Akande a, *, Taoreed O. Owolabi b, Sunday O. Olatunji c
a
Electrical Engineering Department, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia
b
Physics Department, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia
c
Computer Science Department, University of Dammam, Dammam, Saudi Arabia

a r t i c l e i n f o a b s t r a c t

Article history: Permeability is an important property of hydrocarbon reservoir as crude oil lies underneath rock for-
Received 13 October 2014 mations with lower permeability and its accurate estimation is paramount to successful oil and gas
Received in revised form exploration. In this work, we investigate the effect of feature selection on the generalization performance
12 December 2014
and predictive capability of support vector machine (SVM) in predicting the permeability of carbonate
Accepted 5 January 2015
Available online 8 January 2015
reservoirs. The feature selection was based on estimating the correlation between the target attribute
and each of the available predictors. SVM has been improved through the feature selection approach
employed. The uniqueness of this approach is the fact that it employs fewer dataset in improving the
Keywords:
Feature selection
performance of the SVM model. The effect of the approach has been investigated using real-industrial
Correlation datasets obtained during petroleum exploration from five distinct oil wells located in a Middle
Support vector machines Eastern oil and gas field. The results from this approach are very promising and suggest a way to improve
Reservoir characterization on the performance of this algorithm and many other computational intelligence methods through
Permeability systematic selection of the best features thereby reducing the number of features employed.
© 2015 Elsevier B.V. All rights reserved.

1. Introduction proper design of exploration equipment can all be gleaned from


permeability estimation. Also, several fundamental issues in pe-
Permeability is an indication of the ease with which fluid flows troleum industry can only be resolved with adequate knowledge of
through a material (Darcy, 1856) and in petroleum engineering, it is permeability estimation (Olatunji et al., 2011). However, it is very
the ability of porous rock to allow for passage of oil and gas (Sunday difficult to estimate permeability and this has been an area of
Olusanya Olatunji et al., 2014). A porous rock (with pores) is not rigorous research for engineers and practitioners in oil and gas
necessarily permeable rather permeability is an indication of the industries. There are many available techniques such as well-log
interconnection of the pore spaces and the ease with which fluid evaluation, core measurement and well testing for permeability
passes through them (Ayan et al., 2001). It is one of the most estimation (Ahmed et al., 1991). These techniques use well-log and
important flow characterizations of oil and gas reservoir and its core data obtained from boreholes for permeability estimation. The
knowledge can lead to various important deductions as well as main tool used in correlating the data is regression analysis which
decisions regarding the reservoir under investigation (Tusiani and assumes an existence of linear or non-linear relationships between
Shearer, 2007). Its accurate estimation is fundamental to a suc- the predictors and target. The predictors in this case are the various
cessful flow characterization of reservoir as well as determining the properties of rock while the target attribute is the permeability.
scale of the medium. Specifically, important information regarding However, this approach has yielded little success and is far from
the amount of oil and gas present in a reservoir, what percentage of achieving a good result in accurate prediction of permeability (S.O.
this oil and gas is recoverable, the fluid saturation distribution (flow Olatunji et al., 2014a,b).
rate), estimation of future exploration as well as adequate and Various techniques used in estimating permeability can be
grouped into three major categories. They are empirical, statistical
and computational intelligence methods (Olatunji et al., 2011). The
* Corresponding author.
E-mail addresses: [email protected] (K.O. Akande), empirically determined values are those measured in the labora-
[email protected] (T.O. Owolabi), [email protected] tory while statistical models make use of multiple regression
(S.O. Olatunji).

http://dx.doi.org/10.1016/j.jngse.2015.01.007
1875-5100/© 2015 Elsevier B.V. All rights reserved.
516 K.O. Akande et al. / Journal of Natural Gas Science and Engineering 22 (2015) 515e522

analysis. Computational intelligence methods are various machine the performance improvement recorded was achieved using a
learning techniques applied to permeability prediction. We briefly subset of the available data. This is highly desired as the challenge
delve into these various methods in order to provide an overview of of Huge Data modeling is presently dominating the computing
the development in these fields over the years. world. This is due to the fact that industries are being inundated
One of the early contributors to this task is Kozeny whose work with enormous amount of data and storing all these data and
was extended by Carman (Carrier, 2003). They formulated a for- modeling them are very challenging. Therefore, the strategy
mula relating permeability to other measurable rock properties. considered in this work can serve as an effective and efficient pre-
This relation has its shortcomings in that it is only applicable for selection method of extracting a sufficiently small subset of avail-
uniformly sized spheres and moreover, any physical properties of able data without compromising performance. In fact, since the
the rock could only be measured from core analysis with the aid of strategy employed in this work tends to select the features that are
special equipment (Olatunji et al., 2011). Tixier proposed and most predictive of a given target attribute, superior performance is
established a relationship between permeability and resistivity recorded in most of the cases considered in addition to reducing the
gradient using empirical model (Tixier, 1949). However, his model number of features employed.
only estimates average permeability over the range of area covered The choice of SVM is due to its many unique features which
by the resistivity gradients. He later developed a simpler model, include sound mathematical foundation, non-convergence to local
which is more frequently used, based on the work of Wyllie and minima and accurate generalization and predictive ability when
Rose (Wyllie and Rose, 2013). Many other models were developed trained on small datasets (Akande et al., 2014). The rest of this
from Kozeny equation as outlined by Pirson (Pirson, 1963). paper is organized as follows: Section 2 briefly describes the sta-
Multiple regression analysis was employed by Wendt and tistical learning algorithm and correlation-based approach
Sakurai (Wendt et al., 1986) in estimation of permeability though considered in this work. Section 3 gives description of the datasets
with some drawbacks. Despite boosting the performance of the and details of the experiments. Section 4 discusses the results and
model in predicting extreme values of permeability using weighted their interpretations. Section 5 concludes the paper and suggests
average of high and low values, the model can still be statistically recommendation.
biased, prone to instability and its distribution is narrower than
that of the original data set. A lot of advancements has been made
in using empirical and statistical models for permeability predic- 2. Computational intelligence method
tion yet despite all the best efforts, many inaccuracies still persists
hence a need for further exploration of computational intelligence 2.1. Support vector machine
which has enjoyed great success in recognizing complex patterns
(Olatunji et al., 2011). Support vector machine is a tool derived from statistical
Machine learning has revolutionized data analysis for parameter learning theory for classification and regression tasks (Cortes and
prediction. It enjoys wide ranging successes in diverse engineering Vapnik, 1995). In classification problems, SVM employ the use of
problems where it has been applied. Researchers have proffered optimal separation principle. This principle selects (from among
solutions to many real-life complex tasks with the aid of various infinite number of linear classifiers) an hyperplane with the
machine learning techniques. The techniques have excelled and maximum margin between linearly separable classes. The opti-
enjoyed wide acceptance in areas such as medicine (Olatunji and mum separation hyperplane selected is based on minimization of
Arif, 2013), manufacturing (Stoneking, 1999), face detection (Osuna generalization error or on defined upper bound on the error using
et al., 1997), speech-related application (Olatunji et al., 2013, this structural risk minimization. An optimum separation hyperplane is
article is availabe at http://dl.acm.org/citation.cfm?id¼1909246), the one with maximum distance from the closest point of the two
material properties (Taoreed O. Owolabi et al., 2014a,b) and many classes (Cortes and Vapnik, 1995), (Cristianini and Shawe-Taylor,
others. This has spurred revolutionary research into their applica- 1999). However, SVM seek an hyperplane with maximum margin
tion in oil and gas industry (Bruce et al., 2000). The field of artificial as well as minimizes a quantity with direct proportionality to
intelligence is full of techniques used in learning complex patterns number of misclassification errors for non-separable classes. A
between predictors and their targets. A very popular technique that tradeoff between number of misclassification errors and maximum
enjoys early success in permeability prediction is the neural network margin is chosen for the system using a predefined positive con-
based approach also aptly referred to as virtual measurement stant. In order to construct linear decision surfaces using SVM, a
technique (Wong et al., 2010). Several works exist that has made use finite set of variables can be transformed on to a higher dimen-
of artificial neural network in predicting permeability of carbonate sional space where linear classification can then be carried out
reservoir from other well log data. A pioneer work in this field is (Cortes and Vapnik, 1995). This technique is applied in extending
Bruce et al. which comprehensively dealt with the details of SVM to regression problems.
permeability estimation from well logs with the aid of Bayesian The concept of maximum margin in SVM classification is
neural network (Bruce et al., 2000). extended to regression problems by defining ε einsensitive loss
Hybrid of fuzzy logic and artificial neural network (ANN) was function which was proposed by Vapnik (Cortes and Vapnik, 1995).
applied to naturally fractured reservoirs and the authors were able This is termed ε eSupport Vector Regression (SVR). SVR seeks for all
to propose a method for complete description of fractured reservoir the training data a function having at most epsilon deviation from
(El Ouahed et al., 2005). Permeability of fractured reservoirs is of the actual target vectors. This means that SVR limits the error be-
great importance as they affect the migration of oil and gas. Despite tween its hypothesis and actual values to maximum of epsilon
various proposed methods, there is still no established practice that (ensure that it does not exceed this constant) and the function that
is general and the performance of existing techniques requires achieve this has to be as flat as possible. SVR linear decision func-
improvement (Saemi et al., 2007). Hence, a simple and effective tion can be formulated as:
data preprocessing technique has been investigated in this work by
f ðx; fÞ ¼ 〈w; x〉 þ b (1)
systematic selection of the best features from the available pool of
features or attributes. The proposed approach is found to have A smaller value for the adjustable parameter yields a better
profound effect on improving the performance of SVM as indicated model in general and indicates the flatness of the linear decision
in the achieved results discussed later in this work. Additionally, function. Thus, the objective function of SVR is the minimization of
K.O. Akande et al. / Journal of Natural Gas Science and Engineering 22 (2015) 515e522 517
 2
Euclidean norm of the adjustable parameters (q ) and this result. Typical correlation value consists of both magnitude and
 
optimization objective can be stated as follows: direction of either negative or positive. However, the direction is of
little importance as regards the strength of the relationship and
1  2  degree of association existing between the variables as it only in-
Minimize
2 dicates the trend of dependence. Specifically, if two variables are
q
( (2) negatively correlated, it means that as one increases the other de-
yk gðw; xk Þ bε
Subject to creases and vice versa. There is no attached unit to the measured
gðw; xk Þ þ b yk  ε
value for correlation Table 1 offers a rule of thumb in interpreting
The optimization problem above is subjected to an existence of correlation coefficient as its interpretation is very paramount to its
function which satisfies the condition of ε-error limit on all training useful application (Weber, 1970). It is seen that as the degree of
pairs. In order to ease this limit and allow more degree of freedom, association between variables increases, the value of r approaches 1
slack variables ε,ε0 can be incorporated. The modified objective is and when jrj ¼ 1, then the variables are said to be perfectly corre-
then stated as: lated. Although correlation has its limitations (Weber, 1970), it is a
very powerful tool in discerning the degree of association between
1  2  XK  variables and this advantage is taken into consideration in this
Minimize q þ C εk ; ε0k

2 research work.
k¼1
(3) Intuitively, what we have done is to systematically separate
those features with least contribution to the task at hand. The effect
(
yk gðw; xk Þ b  ε þ εk
Subject to of this is that we have been able to retain useful information while
gðw; xk Þ þ b yk  ε þ ε0k
at the same time eliminate features which could hinder the effi-
8 ciency of the algorithm being implemented. This could be likened
< yk gðw; xk Þ b  ε þ εk to the established practice of removing outliers from datasets in
gðw; xk Þ þ b yk  ε þ ε0k for all k ¼ 1; 2…; K: order to improve performance as outliers are known to affect even
εk ; ε0k  0 a simple data analysis (Osborne and Overbay, 2004).
:
Hence, this is a sort of data discrimination where attributes
The C in equation (3) is a regularization factor, chosen by the
perceived to have negative influence on the algorithm performance
user, which controls the tradeoff between the flatness of the
based on their correlation with variable to be predicted (target) are
function represented by the optimization of model parameter and
set aside. Indeed, this has profound effect on the efficiency and
the extent to which the error exceeds ε. In order words, C attempts
performance improvement of the algorithm as indicated in the
to regularize the conflicting tasks of both minimizing model error
results of this research work.
and model complexity. The modified optimization objective in Eq.
(3) can be solved using Lagrangian multipliers which simplify the
equation by transforming it into a dual space. The resulting solution 3. Empirical study
for this mathematical manipulation is
3.1. Data set description and statistical analysis
K 
X
l0k (4)

f ðx; fÞ ¼ lk 〈xk ; x〉 þ b Real-industrial datasets obtained during petroleum exploration
k¼1
from five distinct oil wells located in a Middle Eastern oil and gas
0 field have been used in this study. The total datasets available for
where l and l are positive Lagrangian multipliers. Vapnik extended
each of the well and the various geophysical data acquired from
this technique to non-linear SVR by introducing the concepts of
them are presented in Table 2. It is very useful to take a statistical
kernel function. The input data are first transformed onto a higher
analysis of the datasets before proceeding to model training. This
dimensional feature space using a non-linear mapping function
provides a great insight into each of the variables and how they
and then linear regression is performed in this space. The linear
evolve from measurement to measurement. Criteria such as stan-
decision function in equation (4) can then be modified to accom-
dard deviation (std) which measures how much variation exists
modate this extension and the solution becomes
within the data points of each predictors, mean, maximum and
K 
X minimum values are very useful. Table 3 provides a full statistical
l0k (5) analysis of the datasets for all the wells used in this study. The

f ðx; fÞ ¼ lk 4〈xk ; x〉 þ b
k¼1 analysis was carried out using Excel spreadsheet.

where 4xk,x is the non-linear transformation function. Kernel


functions are used in the higher dimensional feature space to 3.2. Description of the experiment
simplify computation and they greatly affect the generalization
performance of SVM. The popular ones are polynomial and All the programming tasks related to this work were carried out
Gaussian kernel. Optimal settings for SVM is done by the user, using MATLAB computing environment. The datasets were first
carefully selecting regularization factor C, the type of kernel func- normalized to ensure all the data falls within same range and that
tion and its specific parameters as well as ε-insensitive loss those data with large values do not biased the model. The
function. normalized data were then randomized and segregated into

Table 1
2.2. Feature selection: correlation approach Rule of thumb for interpreting statistical correlation.

Absolute correlation value r Strength of correlation


Statistical correlation is an indispensable tool in understanding
the relationship between two variables. It gives insight into 0.68  r  1 Strong or High
strength of this relationship and serves as means through which 0.36  r  0.67 Moderate or Modest
r  0.35 Low or Weak
positive inference could be deduced by accurately interpreting its
518 K.O. Akande et al. / Journal of Natural Gas Science and Engineering 22 (2015) 515e522

Table 2 regularization factor C, epsilon, lambda and kernel option. While


Available input parameters for each well. both lambda and epsilon had little effect on the generalization
Well- No of Available well-log data (predictors) performance of the model, varying value of C and kernel option (h)
Code data have profound effect. We arrived at optimum parameters for the
Well-A 388 MSFL, DT, NPHI, PHIT, RHOB, SWT, CALI, CT, DRHO, GR, RT model after carrying out extensive simulations.
Well-B 357 CPERM, CPOR, MSFL, NPHI, PHIT, RHOB, SWT, CALI, CT,
DRHO, GR, RT
Well-C 478 MSFL, DT, NPHI, PHIT, RHOB, SWT, CALI, CT, DRHO, GR, RT 3.4. Optimal parameters search procedure for SVM
Well-D 388 CPERM, CPOR, MSFL, DT, NPHI, PHIT, RHOB, SWT, CALI, CT,
DRHO, GR, RT
Well-E 41 CPERM, CPOR, DT, NPHI, PHIT, RHOB, SWT, CALI, CT, DRHO, The values of the SVR parameters that give the model its opti-
GR, RT mum performance are referred to as optimum parameters. In order
Key: micro spherically focused log (MSFL), neutron porosity (NPHI), total porosity to obtain optimum parameters, SVM parameters were optimized
(PHIT), bulk density (RHOB), water saturation (SWT), resistivity(RT), bulk density through test-set-cross-validation on all the available data set (Un-
correction (DRHO), electrical conductivity (CT), sonic travel time (DT), log10_core correlated dataset). We then proceed to use the optimum param-
permeability (CPERM), log10_core porosity (CPOR), caliper(CALI), gamma-ray(GR). eters for the correlated dataset. A detailed procedure for the test-
set-cross-validation used in optimizing the SVM parameters goes
thus: For each run of generated training and testing set, values for
training and testing partition in ratio 8:2 respectively. Data
the assessment criteria (Correlation coefficient, Root-mean square
randomization was done in order to ensure fair representation of
and Absolute Error) were monitored for a group of parameters C
the datasets in the training and testing partition. After this data
(bound on the Langrangian multiplier), l (conditioning parameter
preprocessing, model training was then carried out using 80% of the
for QP methods), ε (epsilon) and kernel option (h). The performance
data set aside for this purpose. After training the model, it was
measures and the corresponding parameter values were noted and
tested using 20% of the data different from the training set. This
recorded. This experiment is then repeated for every SVM kernel
practice is done in order to assess the generalization accuracy of the
function available with an incremental step of parameters values.
trained model and ascertain its level of performance in using the
The optimal values of the parameters and the kernel function
learned pattern to predict target values for previously unseen
associated with the best performance measure were identified. The
datasets. This is referred to as validation of the model and the
procedure can be summarized as follows:
performance assessment will only be as good as the criteria set for
this purpose. In this work, three common criteria have been used
Step 1: An initial kernel function is chosen from the available
and are explained later in this section.
ones.
Step 2: Test-set-cross-validation procedure is used to select the
3.3. System setup best values for C, ε l and kernel option.
Step 3: Steps 1 and 2 are repeated for all the available kernel
SVM model was developed primarily using radial basis function. functions.
The polynomial kernel function did not work well with these data Step 4: Kernel function and the parameters values corre-
and did not give good result. The best parameters for the model sponding to the best performance measures are noted as the
were obtained using optimum search procedures whereby the best optimum parameters.
combination of parameters in terms of good generalization per- Step 5: The optimum parameters are then used to train final
formance is chosen. SVM user-defined parameters include SVM

Table 3
Statistical analysis of all the dataset.

Well PERM NPHI PHIT RHOB SWT CALI CT DRHO GR RT MSFL DT CPERM CPOR

Well-A
Mean 0.66 0.09 0.10 2.53 0.55 6.13 0.06 0.01 13.90 17.58 1.61 61.37
Std 0.63 0.14 0.18 2.67 1.00 6.13 0.07 0.02 15.16 20.98 1.67 68.75
Max 0.03 0.05 0.07 0.14 0.45 0.00 0.01 0.01 1.27 3.40 0.06 7.39
Min 0.69 0.03 0.03 2.40 0.11 6.13 0.05 0.01 12.63 14.17 1.55 53.98
Well-B
Mean 47.14 0.14 0.15 2.44 0.17 8.41 0.05 0.06 14.79 1310.84 1.18 0.48 0.16
Std 102.39 0.05 0.07 0.14 0.18 0.10 0.03 0.03 3.97 3308.42 0.46 1.24 0.08
Max 642.04 0.26 0.29 2.67 1.00 8.49 0.11 0.13 31.04 10000.00 2.44 2.99 0.31
Min 0.03 0.03 0.03 2.18 0.04 8.16 0.00 0.00 6.04 8.92 0.54 1.81 0.04
Well-C
Mean 0.77 0.04 0.03 2.68 0.89 6.06 0.01 0.03 16.09 2652.77 1.72 52.32
Std 0.73 0.02 0.01 0.00 0.11 0.05 0.01 0.02 0.72 2592.50 0.01 3.04
Max 1.50 0.06 0.04 2.69 1.00 6.11 0.02 0.05 16.81 5245.27 1.73 55.37
Min 0.04 0.02 0.02 2.68 0.77 6.01 0.01 0.01 15.37 60.27 1.70
Well-D
Mean 46.12 0.13 0.14 2.47 0.27 6.27 0.06 0.01 16.86 26.37 1.96 65.06 0.64 0.14
Std 117.94 0.06 0.07 0.13 0.26 0.16 0.03 0.01 5.15 22.09 0.48 8.24 1.17 0.07
Max 862.52 0.27 0.30 2.73 1.00 7.09 0.18 0.10 35.82 146.10 4.03 82.68 3.21 0.29
Min 0.01 0.02 0.02 2.19 0.04 6.19 0.01 0.00 8.72 5.60 1.13 51.22 1.70 0.00
Well-E
Mean 22.43 0.10 0.12 2.51 0.53 6.42 0.02 0.04 24.87 57.56 60.22 0.25 0.15
Std 22.42 0.07 0.09 0.18 0.47 0.04 0.01 0.00 2.14 17.33 9.63 1.95 0.09
Max 44.85 0.17 0.21 2.68 1.00 6.46 0.03 0.04 27.01 74.89 69.85 2.20 0.24
Min 0.02 0.03 0.03 2.33 0.06 6.39 0.01 0.04 22.73 40.24 50.59 1.70 0.06
K.O. Akande et al. / Journal of Natural Gas Science and Engineering 22 (2015) 515e522 519

Step 6: Performance measures for both the training and testing equation (7)
sets resulting from Step 5 are recorded. sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ðx1 y1 Þ2 þ ðx2 y2 Þ2 þ … þ ðxn yn Þ2
A mathematical implementation of test-set-cross-validation rmse ¼ (7)
n
procedure is as follows: We first defined Ki(j) where K contains
all the available kernel functions, i, j and k are the indexes for kernel where n is the size of the dataset.
function, selected value of C and h respectively while iy, jy and ky
represents the indexes for optimum kernel function, C and h
3.5.3. Average absolute percent relative error (Ea)
respectively. Total number of available kernel function is ni,
Absolute error expresses the difference (error) between each of
assumed maximum value of C and h are nj and nk respectively. The
the predicted output and its corresponding actual value. Relative
recorded performance measure is stored in pf. The algorithm can
error is calculated as the ratio of the mean of the absolute error and
then be depicted as follows:
mean value of the predicted output. Percent relative error is the
relative error measured in percentage. Hence, the formula is as
expressed in equation (8).
 
n yi i
 a yp
X 
Ea ¼  100 (8)

 yia
 
i¼1


4. Results and discussion

The simulations carried out in this research work were done in


two batches and the results are presented in a comparative manner
so as to highlight the effect of our findings. The first set of simu-
lations was done using the whole data obtained from exploration
field as presented in Table 2. SVM model was trained and tested and
the generalization accuracy of the resulting model was recorded.
Prior to the first simulation, we only normalized and randomized
the dataset as a form of data preprocessing and the reasons for that
was given is section 3.
Optimum values of C, epsilon, lambda and kernel option (h) are In order to further improve the accuracy of the model, we
750, 1.2, 10 7 and 1.8 respectively while the best kernel function is looked into the correlation between the target and the predictors
radial basis (RBF). and the correlation coefficients are presented in Table 4.
Based on the rule of thumb stated in Table 1, we have defined a
3.5. Assessment criteria threshold of 0.5 correlation value for any of the well logs to be
considered for the second part of the simulation. It is seen that
The criteria implemented for assessing performance are the except for well-C, all other wells have well logs falling into this
three most commonly used in regression analysis and petroleum category, so for well-C we defined correlation value of 0.3 as the
engineering journals (Sunday Olusanya Olatunji et al., 2014a,b). threshold. As a result of the defined-thresholds, features for each
This is done in order to align with the best practice and to carry out well have now be pruned down to six well log data which is clear
a fair assessment of the developed model. The criteria are detailed from correlation values presented in Table 4.
in next subsections. We then proceeded to train the model using the new pre-
processed data with 6 features selected from the available features
3.5.1. Correlation coefficient (cc) and the result is presented in Table 5.
This measures the statistical correlation between the predicted Before analyzing the result, it is beneficial to get an insight into
and actual values. The value ranges between 0 and 1. The value of what has been done to realize the datasets for second part of the
0 indicates no similarity and a bad performance while 1 represents simulation. This insight is detailed in section 2. But briefly, the
perfect correlation between the model output and actual value to
be predicted indicates perfect generalization accuracy. The formula
for correlation coefficient depicted in (6). Table 4
Correlation between permeability and well logs for each of the well.
 
ya y0a yp y0p
P
Well logs Well-A Well-B Well-C Well-D Well-E
cc ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2ffi (6) NPHI 0.694691 0.708764 0.344422 0.672612 0.577893
2P

ðya ya Þ 0 yp yp 0 PHIT 0.693479 0.687566 0.384443 0.670098 0.698268
P
RHOB 0.66741 0.64889 0.3924 0.65338 0.64518
SWT 0.70598 0.21393 0.14001 0.27487 0.2097
where ya and yp stand for the actual and predicted values respec- CALI 0.25465 0.227449 0.132242 0.09444 0.18433
tively while y0a and y0p respectively represent their mean. CT 0.1304 0.289395 0.315517 0.063035 0.614203
DRHO 0.25051 0.49807 0.19888 0.031367 0.002007
GR 0.15625 0.05571 0.11949 0.02225 0.143211
3.5.2. Root mean-squared error, (RMSE) RT 0.10518 0.17643 0.03552 0.12154 0.25594
This is calculated by taking the mean of the square of the dif- MSFL 0.35745 0.53207 0.48473 0.37085
ferences between each of the predicted output xi and its corre- DT 0.63615 0.376058 0.658873 0.646934
sponding actual value yi and then taking the square root of the CPERM 0.536471 0.509236 0.405953
CPOR 0.648363 0.634595 0.599875
resulting value. The formula for implementing this is as shown in
520 K.O. Akande et al. / Journal of Natural Gas Science and Engineering 22 (2015) 515e522

motivation is to improve the accuracy of the resulting model by Table 6


systematically eliminating those data that may hinder the effi- Performance gain of SVM correlated model over uncorrelated model (in percent).

ciency of the algorithm. Of course, this could have been done in a Well-A Well-B Well-C Well-D Well-E
trial and error manner, trying different combinations of the well R2 14.37% 10.24% 5.38% 1.01% 13.51%
logs (predictors) till a better result is obtained. However, that leads RMSE 74.52% 56.42% 99.81% 10.12% 86.88%
to a waste of valuable time and computing resources. Ea 99.67% 94.02% 99.81% 97.54% 73.59%
The results of the simulation are presented in Table 5 and are
further analyzed. Uncorrelated ones are those obtained for the first
part of the simulation when the whole data were used. The
correlated ones are obtained for the second part of the simulation
when we implement correlation-based feature selection pre-
processing for the development of the model.
The performance criteria R2, RMSE, and Ea represent coefficient
of correlation, root mean-squared error and average absolute
percent relative error respectively. It should also be noted that for
R2, the higher the value, the better is the performance of the sys-
tem. This is opposite for the remaining two criteria as they
measured the error in the performance. Hence, for RMSE and Ea,
lower value represent better performance and an improvement.
Using coefficient of correlation, Table 5 showed how much
improvement correlation preprocessing has on the model's
performance. Fig. 1. Pictorial representation of SVM testing accuracy using coefficient of correlation.
This improvement is seen all across board, an indication of the
effectiveness of correlation-based feature extraction approach
considered in this work. Specifically, coefficient of correlation in-
creases from 0.87 to 1.00 for Well-A, 0.89 to 0.98 for Well-B and
0.85 to 0.97 for Well-C. This is indeed a notable and sizeable
improvement as it corresponds to increase of 14.37%., 10.24% and
5.38% respectively and this has been achieved using reduced
number of features. This performance increase is also noted in the
remaining wells and for other two performance criteria as shown in
Table 6 which shows the percentage gain in performance for
correlation-based SVM model where positive value indicates per-
formance gain and vice versa. A closer look at the results for Well-D,
where there seems to be no improvement, will reveal that the
performance of the system was excellent prior to correlation pre-
processing and that there is not much to improve upon as opposed
Fig. 2. Pictorial representation of SVM testing accuracy using RMSE.
to other wells. In fact, it is seen from Table 5 that the full features
produced 99% coefficient of correlation which is very high and
shows that this particular well does not require preprocessing.
However, considering the fact that the preprocessing approach
employed reduces number of features, takes less processing time
and computing resources, it can be argued that 1.01% is a small
trade-off for so many benefits especially in cases of huge data
processing. The preceding argument can also be put forward for
well-A.
A pictorial representation of Table 5 is presented in Figs. 1e3.
The pictorial representations give an overview of the performance
improvement recorded using the correlation-based feature
extraction approach and can be easily viewed. It is seen that the
model developed from correlation-based approach outperformed
the one developed without it. From Fig. 1, correlation coefficient
increases in virtually all the wells for correlation-based approach
Fig. 3. Pictorial representation of SVM testing accuracy using absolute percent error.
while error recorded reduces greatly. Absolute error decreases

Table 5
Testing accuracy of SVM model in permeability prediction.

Criteria Well-A Well-B Well-C Well-D Well-E

Uncorr Corr Uncorr Corr Uncorr Corr Uncorr Corr Uncorr Corr

R2 0.87 1.00 0.89 0.98 0.87 0.92 0.99 0.98 0.85 0.97
RMSE 0.57 0.99 40.12 17.49 69.80 0.13 18.84 20.74 100.43 13.18
Ea 333.79 1.12 4039.32 241.65 5581.49 10.20 6359.19 156.66 1103.05 291.34

Uncorr ¼ uncorrelated model, Corr ¼ correlated model.


K.O. Akande et al. / Journal of Natural Gas Science and Engineering 22 (2015) 515e522 521

greatly across the wells, varying between 73% to more than 99% computational cost and time.
reduction. Specifically, a reduction of 73.59% and 99.81% in the From the preceding discussion, it has been shown that despite
value of absolute error were recorded in well-E and well-A this feature extraction and the resulting dataset reduction which
respectively and similar trend is recorded in the remaining wells. requires less run time and less usage of computational resources,
This sort of reduction improves the accuracy of the prediction of superior performance has been recorded in virtually all the wells.
permeability which translates to increase in production from the This is indeed a notable achievement because big industries are
wells since lower error means reduction in the uncertainty of the been inundated with streams of data nowadays and any data pro-
predictive model. cessing technique able to extract better performance from a subset
Coefficient of correlation bars for the model that is developed of large datasets is highly desired since it will lead to a gain in
after feature extraction clearly lies above that of the model devel- valuable time and computational resources. It was reported by
oped without feature extraction which is an indication of uniform SINTEF that 90% of all the data in the world has been generated over
improvement recorded in generalization and predictive capability the last two years which means that the future definitely require
of SVM after implementing correlation-based feature extraction Big Data modeling techniques (Dragland, 2013). The technique re-
since higher correlation coefficient means better generalization ported in this work can find applications in tasks involving datasets
accuracy. This improvement is also noted in RMSE and Ea when the with large number of features whose modeling and processing
feature extraction was implemented as the error bars distinctly lie might be impractical due to the sheer size of the datasets hence, the
lower which means the models consistently produce smaller error. approach investigated in this work offers a means through which
This suggests that if this simple technique is applied, it can improve such datasets can be effectively reduced without compromising
the generalization accuracy and predictive ability of SVM. performance. This is indeed the case in petroleum industries as
Finally, an important aspect of this correlation-based feature shown briefly in this work and many other big industries. In fact,
extraction approach is shown in Fig. 4 where reduced dataset has there is a whirlwind of research activities presently ongoing in the
been used to realize better performance. From Fig. 4, less than half intelligence community regarding techniques of handling huge
of the original features were used in well-D after implementing data (Labrinidis and Jagadish, 2012; Anon, 2008; McCallum et al.,
correlation-based feature extraction approach, 50% was used in 2000; Sun et al., 2009). Therefore, this research work is timely
well-B and E while 55% was used in the remaining wells. Though all and a step in the right direction.
features have to be used for the preprocessing procedure, it is
neither time-consuming nor computationally demanding. Calcu- 5. Conclusion and recommendation
lating correlation coefficient is a simple analysis in Excel and not
anywhere as computationally demanding as executing machine Permeability of carbonate reservoir has been predicted in this
learning algorithms. In fact, it can be incorporated among the sta- work using SVM model developed from real datasets obtained from
tistical analysis (such as mean and standard deviation as shown in oil and gas field exploration. The performance of the algorithm was
Table 3 of the paper) routinely carry out as preprocessing steps. then improved greatly by considering only the feature (attributes)
Moreover, SVM is reputed to be efficient for small datasets and with high correlation to the target variable. This feature selection
becomes computationally infeasible for large datasets because its was carried out systematically as a preprocessing state to the use of
complexity is highly dependent on the size of data set (Owolabi the SVM approach. It can be inferred from the improved perfor-
et al., 2014a,b; Li and Yu, 2013; Cervantes et al., 2008). This infea- mance recorded that the use of correlation-based feature extraction
sibility becomes pronounced in Oil and Gas sector where datasets preprocessing eliminates ill-conditioned features and this
from a well may run into thousands of data points. Hence, the approach greatly improve the efficiency of the algorithm. The use of
proposed preprocessing procedure eliminates a huge chunk of the reduced dataset with better performance is also very advantageous
datasets making SVM a feasible approach for large datasets sce- as it translates to improved performance, less run time and less
nario. In fact, this is one of the motivations for this work. Further- computational overhead. This makes a great deal of difference in
more, once the correlation coefficient is obtained for the features of cases with huge datasets as is common in real-life industrial sce-
a particular well then subsequent data points from this well are narios. Based on the encouraging result obtained in this work, we
characterized based on this correlation coefficient values which recommend researchers to employ this simple and efficient
eliminates further preprocessing. Therefore, correlation-based approach in their research work. Future research could be in the
feature selection approach is a simple statistical analysis which area of investigating the effect of this approach on other learning
leads to substantial reduction of large datasets. This makes training algorithms.
SVM model with these datasets a feasible approach resulting in less

References

Ahmed, U., Crary, S.F., Coates, G.R., 1991. Permeability estimation: the various
sources and their interrelationships. J. Petroleum Technol. 43 (5), 578e587.
Available at: https://www.onepetro.org/journal-paper/SPE-19604-PA (accessed
02.09.14.).
Akande, K.O., et al., 2014. Performance comparison of SVM and ANN in predicting
compressive strength of concrete, 16 (5), 88e94.
Anon, 2008. Handbook for Team-based Qualitative Research. Rowman Altamira.
Available at: http://books.google.com/books?
hl¼en&lr¼&id¼nnwJbi52StwC&pgis¼1 (accessed 17.09.14.).
Ayan, C., et al., 2001. Characterizing permeability with formation testers. Oilfield
Rev. 13 (3), 2e23.
Bruce, A.G., Wong, P.M., Zhang, Y., Salisch, H.A., Fung, C.C., Gedeon, T.D., 2000.
A state-of-the-art review of neural networks for permeability prediction. Aust.
Petroleum Prod. Explor. Assoc. 40 (1), 343e354.
Carrier, W.D., 2003. Goodbye, Hazen; Hello, Kozeny-Carman. J. Geotechnical Geo-
environ. Eng. 129 (11), 1054e1056. Available at: http://ascelibrary.org/doi/abs/
10.1061/%28ASCE%291090-0241%282003%29129%3A11%281054%29 (accessed
17.09.14.).
Fig. 4. Data size comparison between correlated and uncorrelated SVM model. Cervantes, J., et al., 2008. Support vector machine classification for large data sets
522 K.O. Akande et al. / Journal of Natural Gas Science and Engineering 22 (2015) 515e522

via minimum enclosing ball clustering. Neurocomputing 71 (4e6), 611e619. fusion of type-2 fuzzy logic systems and extreme learning machines for
Available at: http://www.sciencedirect.com/science/article/pii/ modelling permeability prediction. Inf. Fusion 16, 29e45. Available at: http://dl.
S0925231207002962 (accessed 10.12.14.). acm.org/citation.cfm?id¼2542687.2542759 (accessed 14.08.14.).
Cortes, C., Vapnik, V., 1995. Support-vector networks. Mach. Learn. 297, 273e297. Osborne, J.W., Overbay, A., 2004. The power of outliers (and why researchers should
Available at: http://link.springer.com/article/10.1007/BF00994018 (accessed ALWAYS check for them). Pract. Assess. Res. Eval. 9 (6).
16.08.14.). Osuna, E., Freund, R., Girosit, F., 1997. Training support vector machines: an appli-
Cristianini, N., Shawe-Taylor, J., 1999. An Introduction to Support Vector Machines: cation to face detection. In: Proceedings of IEEE Computer Society Conference
and Other Kernel-based Learning Methods. Available at: http://dl.acm.org/ on Computer Vision and Pattern Recognition. IEEE Comput. Soc, pp. 130e136.
citation.cfm?id¼345662 (accessed 3.08.14.). Available at: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?
(1803e1858) Darcy, H., 1856. Les fontaines publiques de la ville de Dijon: exposition arnumber¼609310 (accessed 20.08.14.).
et application des principes  a suivre et des formules a employer dans les Owolabi, T.O., Akande, K.O., Olatunji, S.O., 2014a. Estimation of superconducting
questions de distribution d’eau/par Henry Darcy. Available at: http://gallica.bnf. transition temperature T C for superconductors of the doped MgB2 system from
fr/ark:/12148/bpt6k624312 (accessed 02.09.14.). the crystal lattice parameters using support vector regression. J. Supercond.
Dragland, Å., 2013. Big Data e for Better or Worse. sintef.com. Available at: http:// Nov. Magnetism. Available at: http://link.springer.com/10.1007/s10948-014-
www.sintef.no/home/Press-Room/Research-News/Big-Dataefor-better-or- 2891-7 (accessed 03.12.14.).
worse/ (accessed 10.07.14.). Owolabi, T.O., Akande, K.O., Olatunji, S.O., 2014b. Support vector machines approach
El Ouahed, A.K., Tiab, D., Mazouzi, A., 2005. Application of artificial intelligence to for estimating work function of semiconductors: addressing the limitation of
characterize naturally fractured zones in Hassi Messaoud Oil Field, Algeria. metallic plasma model. Appl. Phys. Res. 6 (5), 122. Available at: http://www.
J. Petroleum Sci. Eng. 49 (3e4), 122e141. Available at: http://www. ccsenet.org/journal/index.php/apr/article/view/39961 (accessed 24.09.14.).
sciencedirect.com/science/article/pii/S0920410505001373 (accessed 17.09.14.). Pirson, S., 1963. Handbook of Well Log Analysis for Oil and Gas Formation Evalua-
Labrinidis, A., Jagadish, H.V., 2012. Challenges and opportunities with big data. Proc. tion. Prentice-Hall, Englewood Cliffs N.J.
VLDB Endow. 5 (12), 2032e2033. Available at: http://dl.acm.org/citation.cfm? Saemi, M., Ahmadi, M., Varjani, A.Y., 2007. Design of neural networks using genetic
id¼2367502.2367572 (accessed 17.09.14.). algorithm for the permeability estimation of the reservoir. J. Petroleum Sci. Eng.
Li, X., Yu, W., 2013. Fast support vector machine classification for large data sets. Int. 59 (1e2), 97e105. Available at: http://www.sciencedirect.com/science/article/
J. Comput. Intell. Syst. 7 (2), 197e212. Available at: http://www.tandfonline. pii/S0920410507000472 (accessed 17.09.14.).
com/doi/abs/10.1080/18756891.2013.868148#.VIeN4dKUeSo (accessed Stoneking, D., 1999. Improving the manufacturability of electronic designs. IEEE
10.12.14.). Spectr. 36 (6), 70e76. Available at: http://dl.acm.org/citation.cfm?id¼328136.
McCallum, A., Nigam, K., Ungar, L.H., 2000. Efficient clustering of high-dimensional 328192 (accessed 20.08.14.).
data sets with application to reference matching. In: Proceedings of the Sixth Sun, T., et al., 2009. An efficient hierarchical clustering method for large datasets
ACM SIGKDD International Conference on Knowledge Discovery and Data with map-reduce. In: 2009 International Conference on Parallel and Distributed
Mining e KDD ’00. New York. ACM Press, New York, USA, pp. 169e178. Available Computing, Applications and Technologies. IEEE, pp. 494e499. Available at:
at: http://dl.acm.org/citation.cfm?id¼347090.347123 (accessed 17.09.14.). http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber¼5372757
Olatunji, S., Arif, H., 2013. Identification of erythemato-squamous skin diseases (accessed 17.09.14.).
using extreme learning machine and artificial neural network. ictactjournals.in, Tixier, M.P., 1949. Evaluation of permeability from electric-log resistivity gradients.
6956(October). Available at: http://ictactjournals.in/paper/Vol4Iss1_1_page_ Oil Gas J. 111e113.
627_632.pdf (accessed 20.08.14.). Tusiani, M.D., Shearer, G., 2007. LNG: a Nontechnical Guide. PennWell Books.
Olatunji, S.O., Selamat, A., Raheem, A.A.A., 2011. Predicting correlations properties Available at: http://books.google.com/books?id¼b14hnWUAOPYC&pgis¼1
of crude oil systems using type-2 fuzzy logic systems. Expert Syst. Appl. 38 (9), (accessed 17.09.14.).
10911e10922. Available at: http://dl.acm.org/citation.cfm?id¼1975002.1975073 Weber, J.C., 1970. Statistics and Research in Physical Education/, C. V. Mosby.
(accessed 11.07.14.). Wendt, W.A., Sakurai, S., Nelson, P.H., 1986. Permeability Prediction from Well Logs
Olatunji, S.O., et al., 2013. Identification of Question and Non-Question Segments in Using Multiple Regression. Academic Press.
Arabic Monologues Using Prosodic Features: Novel Type-2 Fuzzy Logic and Wong, P., Aminzadeh, F., Nikravesh, M., 2010. Soft Computing for Reservoir Char-
Sensitivity-based Linear Learning Approaches, 2013(August), pp. 165e175. acterization and Modeling. Available at: http://dl.acm.org/citation.cfm?
Olatunji, S.O., Selamat, A., Abdul Raheem, A.A., 2014a. Improved sensitivity based id¼1965484 (accessed 15.09.14.).
linear learning method for permeability prediction of carbonate reservoir using Wyllie, M.R.J., Rose, W.D., 2013. Some theoretical considerations related to the
interval type-2 fuzzy logic system. Appl. Soft Comput. 14, 144e155. Available at: quantitative evaluation of the physical characteristics of reservoir rock from
http://dl.acm.org/citation.cfm?id¼2560970.2562525 (accessed 14.08.14.). electrical log data. J. Petroleum Technol. 2 (04), 105e118. Available at: https://
Olatunji, S.O., Selamat, A., Abdulraheem, A., 2014b. A hybrid model through the www.onepetro.org/journal-paper/SPE-950105-G (accessed 17.09.14.).

You might also like