Convolutional Neural Networks For
Convolutional Neural Networks For
1
University of Oslo, Department of Informatics, Oslo, Norway. E-mail: [email protected]; [email protected]; [email protected].
2
University of Oslo, Department of Geosciences, Oslo, Norway. E-mail: [email protected].
Figure 1. A visualization of how faces may be represented in a CNN. We observe that (a) edges with different orientations form (b) eyes, noses, and mouths that can be
used to represent (c) faces. This figure originally appeared in Lee et al. (2011) and is printed with permission from the author of the original article.
Figure 2. Filter responses for some of the nodes of the first four layers of the Alex-net (Krizhevsky et al., 2012) visualized using the technique by Zeiler and Fergus (2014).
This figure originally appeared in Zeiler and Fergus (2014) and is printed with permission from the authors of the original article.
Figure 3. A CNN applied for image classification. The input is an image, and the output is a probability vector in which each element contains the estimated probability
of the image containing an object of the given class.
Conceptually, we can think of it as a soft classifier that is applied Training the network
to the extracted feature image. The RELU is determining if the Now that we have investigated how a trained CNN can solve
previous filter detected the given feature (input is higher than classification tasks, the most prominent question is: how is the
zero) and, if detected, returns how strongly the feature is present. network trained? Defining the filters manually is not an option
By applying convolution plus activation, each node becomes a since there might be thousands of filters in a network. Instead,
small feature extractor and classification unit. we train the CNN by iteratively updating the filter weights to
Between some of the convolutional layers, the spatial size of minimize the error on the training set. For simplicity, we use the
the images is sometimes reduced by downsampling or pooling. term weights for all tunable parameters in the network. This
This gradually introduces spatial invariance as it helps the network includes the filter coefficients in the convolutional layers and the
to move information from the spatial arrangement of pixel values weights in the fully connected layers.
(image domain) to features containing information relevant for Cost function. The network is trained by minimizing a dif-
the classification task (feature domain). ferentiable cost function. The cost function is used instead of
Fully connected layers. After the final convolutional layer, we the error rate because the error rate is discontinuous and will be
have hopefully detected high-level features that will help identify harder to minimize numerically. Because the softmax function
the class of the object. The next step is to predict the class of the gives a “soft” label with values between 0 and 1, the output of
image based on these features. For example, if we have detected this layer is used to construct a continuous cost function. A
a mouth and eyes, we can predict that the image contains a human common cost function, especially for multiclass problems, is
face. If we have detected eyes and fur, we can predict that the cross entropy:
image contains an animal. Such hierarchical decision rules are
modeled in the fully connected part of the network.
The output vector from the final convolutional layer is the input E = −∑ y ʹj log( y j ) , (4)
to the first fully connected layer. Each of the subsequent layers ∀j
receives the output vector from the previous layer. Each node applies
Figure 6. (a) The inline slice used for training. The manual interpretation was reproduced from Rojo et al. (2016). (b) The test slice. No postprocessing was applied to the
predicted salt pixels.
during training. During the training phase, 50% of the nodes The network is trained on one manually labeled inline slice
of each layer are randomly dropped (put to zero) at each itera- (Figure 6a). We select 3D cubes around the pixels in this slice
tion. The weights in each of the nodes then become less depen- (including amplitudes from the neighboring inline slices) and use
dent on the other nodes, which prevents overfitting. It also these cubes as training samples. When the network has been trained,
forces the network to learn redundant features and increases we go through the full 3D volume and select all possible 65 × 65 × 65
the robustness. cubes and apply the network to predict the class at each location.
Augmentation. Given that we have gathered the largest train- We used the network configuration presented in Figure 7.
ing set practically possible, random augmentation can be used to This is a further development of the network we proposed in
simulate a larger training data set. Small random geometrical Waldeland and Solberg (2017). Batch norm was applied before
transforms (rotation, flipping of axes, scaling, etc.) are added to the RELU activations. The network was trained for 2000 itera-
the training images, which artificially generate more training tions, with a batch size of 32 with 16 samples from each class.
samples. The technique prevents the network from overfitting to Having only one training slice does not give enough training
a limited set of training examples and makes it invariant to the data to avoid overfitting. To remedy this, we applied random
types of augmentation that are applied. The amount of augmenta- augmentation to simulate a larger training set. The augmentation
tion should be chosen such that we get samples with as much was random scaling (± 20%), random flipping of the nondepth
variety as possible but should not alter a sample so much that it axes, random rotation (± 180°), and random tilting (± 15°). The
is not representative of its original class. chosen network and training configurations were a result of
some experimentation. 3 The training lasted for approximately
Example: Salt classification 20 minutes on a Nvidia Titan GPU (2013 model).
In this section, a CNN is applied in the context of automated Results and discussion. To assess the quality of the automatic
seismic interpretation for delineation of salt bodies. Interpretation interpretation, we compare it with a manual interpretation of an
of salt bodies is of interest for two reasons. Due to the low perme- inline test section (Figure 6b). The automatic interpretation gener-
ability associated with salt bodies, they may form seals for reser- ally coincides well with the interpreted salt body. At the flanks,
voirs. In addition, they have a relatively high sound velocity, which the manual interpretation is slightly more conservative than the
makes it important to obtain an accurate velocity model in the
vicinity of the salt bodies.
Data set and network configuration. In our example, we use 3
Initially, we investigated how the number of layers and number of
a 3D data set acquired in the Barents Sea at the 7228/7 block, which nodes affected the performance. The results were very similar for network
configurations with everything between four and 10 layers and with
contains one salt wall and four salt stocks (Rojo et al., 2016). We 40–80 nodes in each layer. The network seemed to converge after
define two classes — “salt” and “not salt.” We select small 1000–2000 training iterations and did not improve notably when training
65 × 65 × 65 cubes of seismic amplitudes from the full cube. The for more iterations. (Up to 20,000 iterations were tested.) We also tested
dropout, max, and average pooling, ELU activation function, but the
goal is to have the network predicting the class of the center pixel results did not change much. The network did not perform well, however,
of the small cubes. without augmentation and batch norm.
Figure 7. The full 3D cube is partitioned into small amplitude cubes of 65 × 65 × 65 samples, which are input into the salt classification network. The network then
predicts the class of the center pixel.
automatic interpretation using the
CNN. It should be noted that it is not
trivial to determine the boundaries at
the salt flanks and that manual inter-
pretations might vary from interpreter
to interpreter. The resulting classifica-
tion of the full 3D data set shows that
the network has successfully delineated
the salt wall and the four salt stocks
(Figure 8). Although the automatic
interpretation is somewhat more con-
servative for the four salt stocks (visible
in the time slices), the automatic inter-
pretation is very close to the manual 3D
interpretation. In our example, we only
trained on one inline containing the salt
wall. It is likely that by including train-
ing slices from the four salt stocks, the
classification would improve.
Learned attributes. We can con-
struct sections with the learned attri-
butes to gain insight into how the clas-
sification task is solved. To compute the
attribute sections, we select the small
3D cube for a given center pixel and run
it through the network. The output from
a given node can then be collected as
the attribute value at this location.
Figure 5 shows five of the learned attri-
butes for the test slice. The first attribute
(Figure 9a) is sensitive to horizontally
layered geology. This is a good indicator
of regions outside the salt but does not
give a good separation close to the
Figure 8. (a–c) The predicted salt body is marked with red. (d) The predicted full 3D salt body is visualized with
color, indicating the time. No postprocessing was applied to the predicted salt pixels. Panels (e) and (f) show the
boundary. In these regions, we often
manual interpretation done in Rojo et al. (2016) and are printed with permission from the authors of the original have dipping salt flanks. The next attri-
article. Note that the manual 3D interpretation was conducted only for the salt wall and one of the four salt stocks. butes (Figures 9b and 9c) are sensitive
Figure 9. Some of the learned attributes from the last convolutional layer computed for the test inline slice.
One difference between natural images and seismic data is demonstrated the use of CNNs in the context of seismic images
that the objects we typically want to detect in seismic images are by using them to delineate salt bodies. One manually labeled slice
less complex than objects we want to detect in natural images. was used to train the network, and the network was used to
This means the network architecture needed to solve seismic successfully delineate the full 3D salt body. This was confirmed
interpretation problems can have fewer layers than the architectures by a comparison with a manual interpretation.
used for natural images, which is beneficial because smaller
networks are more easily trained with less data. Acknowledgments
Much of the success of CNNs has come due to efficient This work is funded by the Norwegian Research Council,
implementation of convolutions on powerful GPUs, which have Grant 234019. The CNN used in this paper was implemented
made it possible to train networks with more iterations within a using the TensorFlow framework by Google. Example code is
feasible time frame. For 3D seismic data, this poses a challenge available at https://github.com/waldeland/CNN-for-ASI.
due to the limited amount of memory on GPUs. For 3D data,
this quickly limits the sizes of networks, samples, and batches Corresponding author: [email protected]
that can be used.
In the context of seismic, we need to be aware that the underly- References
ing geology might be very different depending on the region where Araya-Polo, M., T. Dahlke, C. Frogner, C. Zhang, T. Poggio, and
the data were acquired. Also, the data quality is dependent on D. Hohl, 2017, Automated fault detection without seismic process-
the processing workflow, sampling interval, and eventual errors ing: The Leading Edge, 36, no. 3, 208–214, https://doi.org/10.1190/
in the velocity model. If we want to train a network that should tle36030208.1.
Bengio, Y., 2012, Practical recommendations for gradient-based
generalize to new data sets, we need to have training samples
training of deep architectures, in G. Montavon, G. B. Orr, and
from data sets covering a wide range of these differences. However,
K.-R. Müller, eds., Neural networks: Tricks of the trade, second
when a new seismic data set is processed and interpreted, a large edition: Springer Heidelberg, 437–478.
amount of manual work and quality control is invested in each Gabor, D., 1946, Theory of communication. Part 1: The analysis of
data set. Therefore, the cost of labeling some parts (e.g., some information: Journal of the Institution of Electrical Engineers
slices) of a data set is relatively small compared to the entire — Part III: Radio and Communication Engineering, 93, no. 26,
workflow. This makes it possible to train the network on a few 429–457, https://doi.org/10.1049/ji-3-2.1946.0074.
slices from the data set and use it to interpret the remaining. Ganin, Y., and V. Lempitsky, 2015, Unsupervised domain adaptation
by backpropagation: Proceedings of the International Conference
Suggested reading on Machine learning, 1180–1189.
The network architecture we used is rather simple and is based Glorot, X., and Y. Bengio, 2010, Understanding the difficulty of
training deep feedforward neural networks: Proceedings of the
on texture classification. State-of-the-art models for classification
13th International Conference on Artificial Intelligence and Sta-
(Szegedy et al., 2015; He et al., 2016) should be considered for
tistics, 249–256.
complex classification tasks and segmentation networks Goodfellow, I., Y. Bengio, and A. Courville, 2016, Deep learning:
(Ronneberger et al., 2015) for non-texture-based problems. In MIT Press.
cases with little training data, reuse of existing networks (Razavian He, K., X. Zhang, S. Ren, and J. Sun, 2016, Deep residual learning
et al., 2014) or domain adaption (i.e., Ganin and Lempitsky, 2015) for image recognition: IEEE Conference on Computer Vision
could improve results, but it is not trivial to succeed with such and Pattern Recognition, 770–778, https://doi.org/10.1109/
methods. Visualization of trained networks (i.e., Zeiler and Fergus, CVPR.2016.90.
2014; Mordvintsev et al., 2015) is important in order to understand Hensman, P., and D. Masko, 2015, The impact of imbalanced training
how a network solves a given classification task. It should be noted data for convolutional neural networks: PhD thesis, KTH Royal
that the field of deep learning is actively being developed and that Institute of Technology.
Huang, L., X. Dong, and T. E. Clee, 2017, A scalable deep learning
best practices for network configurations and choosing hyperpa-
platform for identifying geologic features from seismic attributes:
rameters and training strategies change rapidly.
The Leading Edge, 36, no. 3, 249–256, https://doi.org/10.1190/
tle36030249.1.
Conclusions Ioffe, S., and C. Szegedy, 2015, Batch normalization: Accelerating
The use of CNNs on seismic data is promising and may lead deep network training by reducing internal covariate shift: Proceed-
to increased accuracy for automated seismic interpretation, as it ings of 32nd International Conference on Machine Learning.
has done for tasks in natural and medical image analysis. CNNs Kingma, D., and J. Ba, 2015, Adam: A method for stochastic opti-
consider the spatial aspect of images and exhibit the hierarchal mization: Proceedings of the 3rd International Conference on
structure for detecting complex objects through layers of convo- Learning Representations, 1–15.
lutional nodes. The filter weights are optimized during the training