Learning and Transferring Representations For Image Steganalysis Using Convolutional Neural Network
Learning and Transferring Representations For Image Steganalysis Using Convolutional Neural Network
timize all the parameters in both steps. Differently to this where f () denotes non-linearity operation, pool() denotes
work, we here consider the use of transfer learning to help the pooling, Xjl is the j-th feature map in layer l, Xil−1 is the
training of CNN model for achieving a better performance for i-th feature map in layer l − 1, Kij is the trainable filter con-
steganalysis. necting the j-th output map and the i-th input map, blj is an
Generally, the goal of transfer learning is to leverage trainable bias parameter for the j-th output map. Note that,
shared domain-specific knowledge contained in related tasks the weights of filter kernels and the biases have to be learned
to help improving the performance of the target task. In and are modified during training.
[13–15], transfer learning with CNNs is explored for visual The first convolutional layer accepts the noise residual
recognition in a manner of joint training CNNs from unsu- from the image processing layer as input and filters it with
6RXUFH WDVN
,PDJH SURFHVVLQJ OD\HU &RQYROXWLRQDO OD\HUV )XOO\ FRQQHFWHG OD\HUV
7UDLQLQJ LPDJHV
3UHWUDLQLQJ 6WHJRV ZLWK SD\ORDG 3 & & & & & ) ) )
$ &RYHUV
)HDWXUH 7UDQVIHU
WUDQVIHU SDUDPHWHUV
7UDLQLQJ LPDJHV
)LQH WXQLQJ 6WHJRV ZLWK SD\ORDG 3 & & & & & ) ) )
% &RYHUV
7DUJHW WDVN
16 trainable kernels of size 5×5. The second, third and fourth the network using training images consisting of stegos with a
convolutional layers apply convolutions with the kernel size lower payload B and the corresponding covers. Note that, the
of 3×3. The size of convolution kernel used in the fifth con- pre-training and fine-tuning procedures are similar, except
volutional layer is 5×5. The filtering stride of all convolution that the former initializes the trainable parameters randomly,
operations in the five convolutional layers is 1. At each con- while the latter initializes the network with the representa-
volutional layer, the Gaussian activation function is applied tions already learned from the pre-trained network. But the
element-wise to the output of convolution operations. More- initialization from the pre-trained network indeed plays an
over, each of the five convolutional layers applies an overlap- important role in improving the training of CNN on the target
ping average pooling operation with the window size 3×3 and task. In fact, though CNN has shown great discriminative
stride 2. power in many image classification tasks, it is prone to get-
After five layers of convolution and pooling operations, ting stuck in local minima, which is a common weakness
the input image has been converted into a 256D feature vector of neural networks, especially deep networks. For the ste-
capturing the steganographic traces in the input image, and ganalysis task, when embedding with a very low payload,
are finally fed to the classification module consisting of three the differences between stegos and covers are quite small,
fully connected layers. Each of the first two fully connected which makes CNN hard to train. However, the shared rep-
layers have 128 neurons, and the output of each neuron is resentations of the pre-trained network on the source task of
activated by the ReLU activation function [18]. The last fully detecting stegos with a much higher payload already capture
connected layer has 2 neurons, and the outputs are fed to a some important patterns caused by embedding operations,
two-way softmax for classification. hence providing a good regularization to drive the network
training for the target task.
3.2. Learning and transferring features
Based on the described network architecture, here we intro- 4. EXPERIMENTS
duce how features can be learned from the source task and
transferred to the target task. First, we pre-train the network In this paper, all experiments were carried out on the stan-
on the task of detecting stegos with a higher payload A (the dardized BOSSbase 1.01 dataset [19] containing 10,000 cov-
source task) using the back-propagation algorithm. For the er images of size 512×512. We split the dataset by assigning
source task, the training images are composed of stegos with 70% of the images to a training set, 10% to a validation set,
a higher payload A and the corresponding covers. Note that, and 20% to a test set, respectively. It is necessary to point out
the KV kernel in the image processing layer is fixed , while that, we use the same split for all the experiments.
all the trainable parameters in the network are initialized ran- Due to the GPU memory limitation, it is hard for our pro-
domly and learned during training. The trainable parameters posed deep network to directly use an image of size 512×512
here include filter kernels and the biases in the convolutional as the input. In our experiments, we tackle this problem by
layers, as well as weights and biases in the fully connected firstly extracting five 256×256 patches, including the four
layers. corner patches and the center patch, and their flip version
After pre-training the network on the source task, we from each image of size 512×512 to represent the whole im-
transfer the parameters of the five convolution layers C1, C2, age, and then feed these extracted patches to the CNN net-
C3, C4, C5 and three fully connected layers, F1, F2, F3 to work described above. At test time, we first make a predic-
the target task of detecting stegos with a lower payload B, tion on each of ten patches extracted from an image of size
that is, we initialize the network for the target task with the 512×512, and then average the ten predictions to produce a
parameters learned from the source task. Then we fine tune estimate of the class probabilities for the entire image. This
transformation can greatly reduce the GPU memory require-
Table 1. Detection error for WOW algorithm.
ment, while artificially enlarging the dataset to reduce the ef-
Payload 0.1bpp 0.2bpp 0.3bpp 0.4bpp 0.5bpp
fects of overfitting. No-pretrain 50.00% 33.30% 27.88% 20.28% 18.50%
The training of our proposed network was carried out us- Pre-0.6bpp 40.85% 33.55% 28.28% 22.73% 18.55%
ing the code provided by Krizhevsky et al [20]. We use mini- Pre-0.5bpp 40.13% 33.18% 27.48% 21.95% -
batch size of 128 and momentum of 0.9. The weight decay is Pre-0.4bpp 38.43% 30.78% 24.87% - -
Pre-0.3bpp 40.40% 32.67% - - -
0 for the convolutional layers and 0.01 for the fully connect-
Pre-0.2bpp 39.83% - - - -
ed layers. All models are initialized with learning rates of SRM + EC 39.77% 31.75% 24.92% 20.67% 16.23%
0.001. The training is stopped whenever the validation error
stops improving. In our experiments, the number of iterations
is 100 to 200 for pre-training. During the fine-tuning, we first
train the pre-trained model for 10 to 20 iterations, then divide Table 2. Detection error for S-UNIWARD algorithm.
the learning rates by 10 and train the model for another 10 to Payload 0.1bpp 0.2bpp 0.3bpp 0.4bpp 0.5bpp
No-pretrain 50.00% 37.40% 30.60% 24.08% 17.33%
20 iterations.
Pre-0.6bpp 43.80% 35.38% 29.78% 23.33% 18.63%
We evaluate the performance of the proposed approach on Pre-0.5bpp 42.93% 34.38% 28.42% 22.05% -
detecting WOW [21] and S-UNIWARD [22], two of the state- Pre-0.4bpp 43.18% 35.78% 29.57% - -
of-the-art spatial domain steganographic algorithms, across Pre-0.3bpp 43.30% 36.50% - - -
five payloads 0.1, 0.2, 0.3, 0.4, and 0.5 bpp (bits per pixel). Pre-0.2bpp 43.90% - - - -
SRM + EC 40.25% 32.10% 24.95% 20.55% 16.64%
For each of the two steganographic algorithms, we first train a
CNN network on stegos with a payload A and the correspond-
ing covers, and then apply our transfer learning scheme to the
tasks of detecting the same steganographic algorithm with a for the pre-trained CNN to capture enough shared pattern-
payload lower than A. For example, the “Pre-0.6bpp” means s for transferring. Moreover, both WOW and S-UNIWARD
that we first pre-train the network on stegos with the payload are content-adaptive steganographic algorithms which embed
0.6 bpp and the corresponding covers, and then transfer the messages into areas of images that are relatively hard to de-
learned parameters to the task of detecting stegos with the tect. Note that, the payload in the target task is relatively low,
payload of 0.5, 0.4, 0.3, 0.2, and 0.1 bpp, respectively. We which means the messages are more likely to be embedded in
compared our method with the CNN model without transfer noisy or textured areas that are hard to model. As the payload
learning (“No-pretrain”), that is the framework proposed by for the source task goes higher, some messages may be em-
Qian et al. [11], and also with one of the state-of-the-art tra- bedded in smooth areas that are easier to model, and it is more
ditional steganalysis schemes based on handcrafted features, likely that the pre-trained CNN will capture the patterns in s-
that is the SRM feature set [4] implemented with the Ensem- mooth areas, which are much different from patterns in noisy
ble Classifier (“SRM + EC”). areas, for transferring. Finally, when compared with the tra-
We report the detection error in Table 1 and Table 2. Here ditional “SRM + EC” method, our approach achieves a better
the detection error is computed as follow. performance on detecting WOW algorithm.
1
PE = minPF A (PF A + PM D (PF A )). (3) 5. CONCLUSIONS
2
The comparison between our proposed method and the This paper proposes a novel framework based on transfer
framework proposed by Qian et al. in [11] shows that us- learning to improve the learning of features with CNN mod-
ing this transfer learning scheme we have obtained signifi- els for steganalysis. In the proposed framework, we first pre-
cant improvements in detecting stegos with a payload lower train a CNN model using training images composed of stegos
than 0.3 bpp for WOW algorithm, and lower than 0.4 bpp with a high payload and the corresponding covers, and then
for S-UNIWARD algorithm. We found that, CNN without transfer the learned feature representations to regularize the
pre-training does not converge when the payload is as low as CNN model for a better performance in detecting stegos with
0.1 bpp in our experiments, and the detection error is 50%. a lower payload. In this manner, the auxiliary information
But with our transfer learning scheme, we do make the CN- from stegos with a high payload can be efficiently utilized to
N converge and obtain a much better performance. We also help the task of detecting stegos with a low payload. Experi-
observed that the best choice of payload for obtaining ste- mental results show that the proposed framework does bring
gos in the pre-training step is 0.4 bpp when detecting WOW, an improvement as compared with the previous CNN model
and 0.5bpp when detecting S-UNIWARD. A choice of either for steganalysis without using the transfer learning scheme.
a lower or a higher payload will lead to a significant perfor- Our approach also achieves a better performance than the
mance drop as compared with the best choice. In fact, as the traditional steganalysis scheme that using SRM feature set
payload for the source task goes lower, and it becomes harder when detecting the WOW algorithm.
References [12] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffn-
er, “Gradient-based learning applied to document recognition,”
[1] Jan Kodovskỳ, Jessica Fridrich, and Vojtěch Holub, “Ensem- Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
ble classifiers for steganalysis of digital media,” IEEE Trans-
actions on Information Forensics and Security, vol. 7, pp. 432– [13] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu, “3d convolu-
444, 2012. tional neural networks for human action recognition,” IEEE
Transactions on Pattern Analysis and Machine Intelligence,
[2] Gokhan Gul and Fatih Kurugollu, “A new methodology in ste- vol. 35, no. 1, pp. 221–231, 2013.
ganalysis: breaking highly undetectable steganograpy (hugo),” [14] Amr Ahmed, Kai Yu, Wei Xu, Yihong Gong, and Eric Xing,
in Information Hiding. Springer, 2011, pp. 71–84. “Training hierarchical feed-forward visual recognition models
[3] Jessica Fridrich, Jan Kodovskỳ, Vojtěch Holub, and Miroslav using transfer learning from pseudo-tasks,” in ECCV, pp. 69–
Goljan, “Steganalysis of content-adaptive steganography in 82. 2008.
spatial domain,” in Information Hiding. Springer, 2011, pp.
[15] Hossein Mobahi, Ronan Collobert, and Jason Weston, “Deep
102–117.
learning from temporal coherence in video,” in Proceedings of
[4] Jessica Fridrich and Jan Kodovskỳ, “Rich models for steganal- the 26th Annual International Conference on Machine Learn-
ysis of digital images,” IEEE Transactions on Information ing, 2009, pp. 737–744.
Forensics and Security, vol. 7, no. 3, pp. 868–882, 2012.
[16] Maxime Oquab, Leon Bottou, Ivan Laptev, and Josef Sivic,
[5] Yun Q. Shi, Patchara Sutthiwan, and Licong Chen, “Textu- “Learning and transferring mid-level image representations us-
ral features for steganalysis,” in Information Hiding. Springer, ing convolutional neural networks,” in IEEE Conference on
2013, pp. 63–77. Computer Vision and Pattern Recognition, 2014, pp. 1717–
1724.
[6] Vojtech Holub and Jessica Fridrich, “Random projections of
residuals for digital image steganalysis,” IEEE Transactions on [17] A. Karpathy, G. Toderici, S. Shetty, and T. Leung, “Large-
Information Forensics and Security, vol. 8, no. 12, pp. 1996– scale video classification with convolutional neural networks,”
2006, 2013. in IEEE Conference on Computer Vision and Pattern Recogni-
tion, 2014, pp. 1725–1732.
[7] Weixuan Tang, Haodong Li, Weiqi Luo, and Jiwu Huang,
“Adaptive steganalysis against wow embedding algorithm,” in [18] Vinod Nair and Geoffrey E. Hinton, “Rectified linear units
Proceedings of the 2nd ACM workshop on Information hiding improve restricted boltzmann machines,” in Proceedings of the
and multimedia security, 2014, pp. 91–96. 27th International Conference on Machine Learning (ICML-
10), 2010, pp. 807–814.
[8] Tomas Denemark, Vahid Sedighi, Vojtěch Holub, Rémi
Cogranne, and Jessica Fridrich, “Selection-channel-aware [19] Patrick Bas, Tomáš Filler, and Tomáš Pevnỳ, “ break our
rich model for steganalysis of digital images,” in 2015 Na- steganographic system: The ins and outs of organizing boss,”
tional Conference on Parallel Computing Technologies (PAR- in Information Hiding. Springer, 2011, pp. 59–70.
COMPTECH), 2015, pp. 48–53.
[20] A. Krizhevsky, “cuda-convnet,” 2012,
[9] Jan Kodovskỳ and Jessica Fridrich, “Steganalysis of jpeg im-
http://code.google.com/p/cuda-convnet/.
ages using rich models,” in IS&T/SPIE Electronic Imaging,
2012, pp. 83030A–83030A. [21] Vojtéch Holub and Jessica Fridrich, “Designing steganograph-
[10] Miroslav Goljan, Jessica Fridrich, Rémi Cogranne, et al., ic distortion using directional filters.,” in The IEEE Interna-
“Rich model for steganalysis of color images,” in Parallel tional Workshop on Information Forensics and Security (WIF-
Computing Technologies (PARCOMPTECH), 2015 National S), 2012, pp. 234–239.
Conference on, 2015, pp. 185–190.
[22] Vojtěch Holub and Jessica Fridrich, “Digital image steganog-
[11] Yinlong Qian, Jing Dong, Wei Wang, and Tieniu Tan, “Deep raphy using universal distortion,” in Proceedings of the first
learning for steganalysis via convolutional neural networks,” in ACM workshop on Information hiding and multimedia securi-
IS&T/SPIE Electronic Imaging, 2015, pp. 94090J–94090J. ty. ACM, 2013, pp. 59–68.