0% found this document useful (0 votes)
14 views8 pages

19964-Article Text-23977-1-2-20220628

The paper introduces the Regional Rotate Layer (RRL) for Convolutional Neural Networks (CNNs) to enhance rotation invariance without increasing model complexity or requiring additional parameters. By embedding RRL into existing networks, it allows feature maps to maintain rotation equivariance, leading to improved performance on rotated images using standard training data. The effectiveness of RRL is demonstrated through evaluations with LeNet-5, ResNet-18, and tiny-yolov3, yielding impressive results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views8 pages

19964-Article Text-23977-1-2-20220628

The paper introduces the Regional Rotate Layer (RRL) for Convolutional Neural Networks (CNNs) to enhance rotation invariance without increasing model complexity or requiring additional parameters. By embedding RRL into existing networks, it allows feature maps to maintain rotation equivariance, leading to improved performance on rotated images using standard training data. The effectiveness of RRL is demonstrated through evaluations with LeNet-5, ResNet-18, and tiny-yolov3, yielding impressive results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

RRL: Regional Rotate Layer in Convolutional Neural Networks


Zongbo Hao, Tao Zhang, Mingwang Chen, Zou Kaixu
University of Electronic Science and Technology of China
No.4, Section 2, North Jianshe Road
Chengdu, China, 610054
[email protected], [email protected], [email protected], [email protected]

Abstract Tamura, Horiguchi, and Murakami 2019; Mash, Borghetti,


and Pecarina 2016), or merging the rotation equivariant fea-
Convolutional Neural Networks (CNNs) perform very well tures (Gao et al. 2019; Wiersma, Eisemann, and Hildebrandt
in image classification and object detection in recent years,
but even the most advanced models have limited rotation in-
2020). These methods either increase the workload of train-
variance. Known solutions include the enhancement of train- ing or increase the number of model parameters. In this pa-
ing data and the increase of rotation invariance by globally per, by making the feature maps before and after convolution
merging the rotation equivariant features. These methods ei- rotation equivariant, the whole neural network is rotation in-
ther increase the workload of training or increase the num- variant. With rotation angle θ ∈ {0, 90, 180, 270}◦ , the fea-
ber of model parameters. To address this problem, this pa- ture maps are completely the same. When the input image
per proposes a module that can be inserted into the existing is rotated with arbitrary angle, there is only small difference
networks, and directly incorporates the rotation invariance between feature maps before and after rotation.
into the feature extraction layers of the CNNs. This module The main contributions of this paper are:
does not have learnable parameters and will not increase the
1) A Local Binary Pattern(LBP) operator based Regional
complexity of the model. At the same time, only by train-
ing the upright data, it can perform well on the rotated test- Rotation Layer (RRL) is proposed. RRL can be embedded
ing set. These advantages will be suitable for fields such as in CNNs, without the need for substantial changes to the
biomedicine and astronomy where it is difficult to obtain up- network to achieve rotation invariant.
right samples or the target has no directionality. Evaluate our 2) Without learning new parameters, RRL makes the fea-
module with LeNet-5, ResNet-18 and tiny-yolov3, we get im- ture maps before and after convolution satisfy the rotation
pressive results. equivariance, and thus makes the entire neural network rota-
tion invariant. With rotation angles θ ∈ {0, 90, 180, 270}◦ ,
the feature maps are exactly the same. With arbitrary rota-
Introduction tion angle, there is a small distinction between feature maps.
Deep learning and convolutional neural networks have made 3) Evaluate RRL with LeNet-5, ResNet-18 and tiny-
great progress in many tasks such as image classification and yolov3, we get impressive results.
object detection. The inherent properties of convolution and
pooling layer alleviate the influence of local translation and Related Work
distortion. However, due to the lack of the ability to pro-
cess large rotation of image, convolution neural networks are For deep learning based methods, the most direct way is
limited in some visual tasks, including target boundary de- data augmentation (Van Dyk and Meng 2001), which sim-
tection (Maninis et al. 2016; Dalal and Triggs 2005), multi- ply changes the size and direction of the input images to
directional target detection (Cheng, Zhou, and Han 2016) create more training data. TI-Pooling (Laptev et al. 2016)
and image classification (Jaderberg et al. 2015; Laptev et al. uses the rotated image as input, and applies a pooling layer
2016). In recent years, CNN based image classification and before outputting features to unify the network’s results for
object detection have been used in biomedical, industrial and different rotation angles. Dieleman proposed a deep neural
astronomical research. In these fields, objects can appear in network model that uses translational and rotational sym-
any direction, such as microscopic images, objects on con- metry to classify galaxy morphology (Dieleman, Willett,
veyor belts or objects observed. So, the research on rotation and Dambre 2015). They create multiple rotated and flipped
invariance of neural networks has been more and more im- galaxy image samples, and then concatenate the feature
portant. maps to the classifier. Polar Transformer Networks (PTN)
(Esteves et al. 2018) converts the input to polar coordinates.
At present, most deep networks use data augmentation to
PTN is composed of a polar coordinate prediction module,
make the network recognize objects in different directions
a polar coordinate conversion module, and a classifier to
(Ojala, Pietikäinen, and Harwood 1996; Quiroga et al. 2018;
achieve translation invariance and equal changes in expan-
Copyright © 2022, Association for the Advancement of Artificial sion/rotation groups. (Jiang and Mei 2019) also proposed a
Intelligence (www.aaai.org). All rights reserved. Polar Coordinate Convolutional Neural Network (PCCNN)

826
to convert the input image to polar coordinates to achieve
rotation invariance. The overall structure of the model is the
same as the traditional CNN, except that the central loss
function is used to learn rotation invariant features. In ad-
dition, (Cohen and Welling 2016) proposed Group Equiv-
ariant Convolutional Networks (GCNN) as a special case of
controllable CNN, which proved that the spatial transforma-
tion of the image can be corresponded in the feature map
and the filter. GCNN is composed of group convolution ker-
nels. These convolutions include filter rotation and merging
operations on the rotation. Group convolution is limited to
integer multiples of 90°rotation and flipping. Cohen et al.
also proposed steerable CNNs (Cohen and Welling 2017)
and Spherical CNNs (Cohen et al. 2018) to achieve rotation
equivariant. Steerable CNNs are limited to discrete groups,
such as discrete rotations acting on planar images or permu- Figure 1: Local Binary Pattern. (a) image of size 33. The
tations acting on point clouds. Spherical CNNs show good numbers are the gray values; (b) the binary values of the
robustness. Because FFT and IFFT are used in spherical con- surrounding pixels; (c) the index value of the surrounding
volution, some information will be lost in the conversion pixels; (d) the dot product result of (b) and (c); (e) rotate (b)
process. Spherical convolution achieves rotation invariance of 135°clockwise to get the minimum LBP; (f) the image
for ideal 3D objects, and there is no interference of back- window after rotation.
ground or other noise. If there are multiple 3D objects in the .
natural scene, the 3D objects must be segmented first, and
then the rotation invariant features are extracted. The Rota-
tion Equivariant Vector Field Networks (RotEqNet) (Mar- tures. Suppose the window size is 3 × 3, setting the central
cos et al. 2017) uses multiple rotation instances of a uni- pixel as the threshold, we can get a binary encoding of the
form standard filter to perform convolution, that is, the fil- local texture, and convert it to a decimal value. As shown
ter is rotated at different intervals. Although the RotEqNet in Figure 1(a), the central point pixel value 6 is used as the
model is small, the increasing in the number of convolu- comparison reference, then calculate the difference values
tion kernels brings more memory and longer computing of the surrounding eight pixels with the central point. If the
time. (Dieleman, De Fauw, and Kavukcuoglu 2016) encodes neighbouring value is less than the central value, the corre-
cyclic symmetry in CNNs by parameter sharing to achieve sponding location is marked as 0, otherwise marked as 1,
rotation equivariant. They introduce four operations: slice, as shown in Figure 1(b). Taking the upper left corner of the
pool, stack and roll. The operations can be cast as layers in matrix as the starting point, each position is given an index
a neural network, and build networks that are equivariant to power of 2 according to the flattening and stretching direc-
cyclic rotations and share parameters across different orien- tion of the matrix, as shown in Figure 1(c). The dot pro-
tations. But the operations change the size of the minibatch duction operation is performed between the weight matrix
(slicing, pooling), the number of feature maps (rolling), or and the binary matrix, as shown in Figure 1(d). Only the
both (stacking). To alleviate the excessive time-consuming values of 1 in the binary matrix are preserved, and the new
and memory usage, (Li et al. 2018) proposed Deep Rota- weight is superimposed. Finally, the surrounding elements
tion Equivariant Network. They apply rotation transforma- of the result matrix are added to form the decimal LBP iden-
tion on filters rather than feature maps, achieving a speed tifier (in this example 169) of the local texture. A series of
up of more than 2 times with even less memory overhead. LBP feature values are obtained by rotating the surrounding
But the methods all need to learn new parameters to achieve points, and the minimum of these values is selected as the
rotation equivariant. LBP value of the central pixel. In this paper, the points are
rotated to the minimum state of LBP, so as to achieve the ro-
Rotation Invariance Based on LBP Operator tation invariance of angle. In the case, the minimum state is
The standard convolutional neural networks do not have the shown as Figure 1 (e). That is, the original feature is rotated
property of rotation invariance. Trained by the upright sam- 135°clockwise, as shown in Figure 1(f).
ples, the performance drops significantly when tested by the
rotated images. To solve this problem, we add a regional ro- Regional Rotation Layer (LBP)
tation layer (RRL) before the convolutional layers and the LBP is operated in a window, while RRL is operated on the
fully connected layers. The main idea is that we indirectly feature maps. The feature maps are sampled one by one in
achieve rotation invariance by restricting the convolutional the form of sliding window, and LBP is implemented in each
features to be rotation equivariant. window. So we can rotate the feature maps to the same state
even with different input orientations.
Local Binary Pattern RRL is usually added before convolutional layer. Here we
Local Binary Pattern (LBP) (Ojala, Pietikäinen, and Har- take the first convolution operation of a three-channel RGB
wood 1996) is an operator that describes image texture fea- image as an example to illustrate the workflow of RRL.

827
Algorithm 1: RRL in local windows n*90°), if the convolution kernel rotates the same angle r,
Input: RGB image sample batch {I1 , I2 , · · · , It · · · } the result is unchanged: f (w) = Lr [f ](rw), where Lr [f ]
Output: Rotate the feature maps to the same indicates that the convolution kernel rotates r counterclock-
state wise. But the local convolution operation is not invariance,
1: Load image It , It ∈ RH×W ×3 .
when the convolution kernel is unchanged, the results before
2: Perform LBP on each channel to get Vt , Vt ∈
and after w rotation are different: f (w) 6= f (rw).
Global convolution operation has neither rotation equiv-
RF ×F ×(OH ×OW ×3) , where F is the kernel size, OH(W )
ariance nor invariance. When the feature map is rotated, not
is the height/width of the feature map. The sequence
only the convolution kernel needs to be rotated to the same
of feature maps in the channel has not been changed,
angle, but the feature output must be rotated with the same
shown as process 1 in Figure 2.
angle in the opposite direction, so that the result of the orig-
3: Reshape Vt , then concatenate the window features be-
inal convolution operation can be kept consistent, as shown
longing to the same channel into a matrix It0 , It0 ∈
in eq. 3.
R(F ×OH )×(F ×OW )×3 , shown as process 2 in Figure 2. f (x) = r−1 Lr [f ](rx) (3)
4: Perform a convolution operation with step size F on
0 From the above analysis, we know that after rotation (only
It0 and get the output feature Ot , Ot ∈ ROH ×OW ×k ,
for rn rotation), the features will still maintain equivariant
shown as process 3 in Figure 2.
after layer-by-layer convolution. Therefore, the entire CNN
will be rotation invariant if we reversely rotate the feature
maps before the fully connected layer.
It It’
Vt First, to achieve equivariant of global convolution, the
core function R(x) of the algorithm needs to satisfy eq. 4:
rotate reshape
f F [R (x)] = r−n f F [R (rn x)] (4)
F
where f (x) is the convolution operation with step size
* of F . In other words, when the filter does not change, the
conv s=F convolution result of the rotated input is equal to that of
the non-rotated input through reverse rotating the output.
Ot F  F  k  k'
To satisfy eq. 4, the local convolution needs to be invariant:
f [R (w)] = f [R (rn w)]. That is :

Figure 2: RRL works in a local window It . Step 1 : Perform R (w) = R (rn w) (5)
LBP on each channel to rotate feature maps. Step 2 : Re- Here, we use the core function R(x), which is named
shape Vt and concatenate the features into matrix It0 . Step RRL module, to make the window convolution invariant, and
3 : Perform convolution operation on It0 to get output fea- achieve rotation invariant of the CNN. RRLs position is be-
ture Qt . fore each convolutional layer and after the last convolutional
. layer with the step size of F . In particular, for the last RRL,
the global feature maps x are treated as a local window w,
and satisfies R (x) = R (rn x). Because the activation func-
Rotation Equivariance and Invariance Derivation tion, BN layer and pooling layer are rotation equivariant and
Equivariance refers to that, when the transformation can be they do not affect the final result, they are not discussed here.
measured in the output of an operator, the operator and the
transformation are equivariant, as shown in eq. 1. Integrate RRL with CNN
f (T (x)) = T (f (x)) (1) Each RRL Ri is embedded before each conv layer fi . Sup-
pose that the original feature x and the rotated 90°feature rx
where x is the input, T (•) is the transformation, f (•) is the are fed into R1 respectively.
operator. After R1 and f1 , we have:
An operator is invariant relative to the transformation,
when the influence of the transformation cannot be detected x → f1F1 [R1 (x)]
in the output of the operator, shown as eq. 2. rx → f1F1 [R1 (rx)]
f (T (x)) = f (x) (2) ∴f1F1 [R1 (x)] = r−1 f1F1 [R1 (rx)]
In order to achieve rotation invariance of the entire CNN, After R2 and f2 , we have:
it is expected that the input features can be rotated uniformly h h ii
after the convolution layers and before the fully connected f1F1 [R1 (x)] → f2F2 R2 f1F1 [R1 (x)]
layers that perform the classification task. h h ii
Local convolution operation is equivariant. In the feature f1F1 [R1 (rx)] → f2F2 R2 f1F1 [R1 (rx)]
window w(F ×F ), when w is rotated of r (r means counter- h h ii h h ii
clockwise rotate 90°, and rn means counterclockwise rotate ∴f2F2 R2 f1F1 [R1 (x)] = r−1 f2F2 R2 f1F1 [R1 (rx)]

828
Training CIFAR- CIFAR- CIFA10- CIFAR-
Data 10 10 10-rot 10-rot+
(a) CIFAR10-rot Testing CIFAR- CIFAR- CIFAR- CIFAR-
horse,0° bird,90° ship,180° airplane,270° Data 10-rot 10-rot+ 10-rot 10-rot+
LeNet-5 33.2 18.2 38.7 25.4
(b) CIFAR10-rot+
LeNet- 71.3 52.8 70.9 49.1
5+RRL
cat,40° truck,165° horse,270° automobile,3000°
Table 1: Comparison of accuracy (%) on LeNet-5 with and
without RRL.
Figure 3: Examples of CIFAR10-rot and CIFAR10-rot+
.

So, we have:
h h h h iiii
Rn+1 fnFn . . . R2 f1F1 [R1 (x)]
h h h h iiii (6)
= Rn+1 fnFn . . . R2 f1F1 [R1 (rx)]

The conclusion can be extended to other CNNs. When RRL


is added in the right position, the rotation invariance of the
model can be achieved.

Experiments
Image Classification Based on LeNet-5
Dataset and LeNet-5 CIFAR-10 is used in our experi-
ment. The dataset was proposed by krizhevsky in 2009. It
contains 60000 32 × 32 colour images, belonging to 10
categories. There are 50000 images in training set (5000
in each category) and 10000 images in test set (1000 in
each category). The images rotated in the first ways are call
CIFAR10-rot (namely θ ∈ {0, 90, 180, 270}°and in the sec-
ond way are called CIFAR10-rot+ ( θ ∈ [0, 360)°), as shown
in Figure 3. In order to ensure that the effective content area
Figure 4: Feature distributions of LeNet-5 with and without
of the image is fixed, the largest inscribed circle of the square
RRLs, input with different rotation angles
image is selected. Only the inner area of the circle has origi-
.
nal image pixels, and the outer area of the circle is filled with
black, shown as Figure 3 (b).
LeNet-5 (LeCun et al. 1998) is one of the earliest CNNs. provide more convolution kernels to learn the same patterns
It has two convolutional layers and three fully connected lay- with different directions, so it reduces generalization;
ers, so three RRLs are plugged in. Keeping the convolutional 3) Without RRLs, the accuracy of recognition can be im-
layers and fully connected layers unchanged as (LeCun et al. proved a bit by using data augmentation, but the training cost
1998), the three RRLs are inserted in front of conv1, conv2 increases and the problem is not solved essentially.
and after conv2, respectively.
In order to analyze the role of RRL more intuitively, the
Experimental Result and Analysis Table 1 shows the test feature maps are visualized in Figure 4. In Figure 4, the left
accuracy of LeNet-5 on CIFAR-10 with and without RRL. columns are the input images. The middle columns are the
The second column is trained by the original training set output feature of the last layer of the original LeNet-5 net-
(without rotation images) and tested by CIFAR10-rot. The work. Except the upright image can be correctly classified,
third column is trained by original training set (without ro- the other three cases are misidentified. The right columns
tation) and tested by CIFAR10-rot+ data set. The fourth and are the output features of the last layer of LeNet-5+RRL net-
fifth columns are trained and tested by CIFAR10-rot and work, whose features do not change with the rotation angle,
CIFAR10-rot + datasets respectively. and all predict the correct category. It shows that with the
From Table 1 we can find that: RRLs, the same features are extracted from the images with
1) Keeping the original CNN structure and adding only different angles, and the coding invariance is realized.
the RRLs can improve the recognition accuracy of rotating The visualization results of Grad-CAM(Selvaraju et al.
images greatly; 2017) are shown in Figure 5. The input images are rotated
2) Trained with the augmented data, the accuracy of im- θ ∈ {0, 90, 180, 270}°. Conv1-grad, Conv2-grad and RRL3-
proved network decreases. It implies that LeNet-5 cannot grad are the heatmaps obtained by gradient calculation of the

829
Conv1 Conv2 RRL3- Conv1 Conv2 RRL3-
Training Data CIFAR-10
Horse -grad -grad grad Frog -grad -grad grad Testing Data CIFAR-10-rot CIFAR-10-rot+

LeNet-5 33.2 18.2
LeNet-5+RRL 71.3 52.8
90° ResNet-18 46.5 38.7
ResNet-18+RRL 75.0 65.3
180° CyResNet56-P - 61.3
(Cowen et al. 2015)
270° PR RF 1 (Follmann - 44.1
Bird Truck and Bottger 2018)

ORN (Zhou et al. 60.9 40.7
2017)
90°
Table 3: Comparison of accuracy (%) with other methods on
180° CIFAR-10.

270°
in Table 3. We can see that ResNet-18+RRL has obtained
Figure 5: Visualization of three RRLs output heatmaps in high accuracy on both data sets. It shows that RRL can help
LeNet-5 + RRL the original CNN to improve the encoding ability without
. increasing the parameters and model complexity, and obtain
stronger generalization ability.
Training CIFAR- CIFAR- CIFA10- CIFAR- From table 3, we can also find that ResNet + RRL im-
Data 10 10 10-rot 10-rot+ proves performance less than LeNet + RRL does (28.5%
Testing CIFAR- CIFAR- CIFAR- CIFAR- vs 38.1% on CIFAR10-rot, 26.6% vs 34.1% on CIFAR10-
Data 10-rot 10-rot+ 10-rot 10-rot+ rot+). LBP operator tends to rotate the brighter texture of the
ResNet- 46.5 38.7 73.6 58.7 image to the left part of the window. We can guess that with
18 the restriction of RRL, the obtained features tend to be sim-
ResNet- 75.0 65.3 77.9 63.1 ilar and reduce the diversity of features. After the training
18+RRL data are enhanced, the gap between the two is also signif-
icantly smaller (improve 4.3% on CIFAR10-rot and 4.4%
Table 2: Comparison of accuracy (%) ResNet-18 with and on CIFAR10-rot+ respectively). For rotated images, the tra-
without RRL. ditional convolutional network will specially customize the
filter for each direction of the same texture. However, due
to the constraint of RRL, even if the input data are more di-
verse, the feature types with little change in content will not
feature maps after the first layer, the second layer of convo-
increase significantly. Therefore, when the model is deep-
lution and the last regional rotation layer, respectively. It can
ened, the tradition neural networks will improve more than
be seen that before the last regional rotation, the features are
that of networks with RRL.
direction dependent, and the focus position of the model still
Even with data augmentation, the rotation invariance of
changes with the rotation angle of input. At this stage, the
conventional convolution network is still not as good as
network is only rotation equivariant. After RRL3, the neu-
plugged with RRL. However, it can be predicted that with the
ral network completes the invariance coding of rotation, and
deepening of the network, the rising trend of accuracy with
the features shown by RRL3-grad hardly change with the
RRL will slow down. Therefore, the algorithm is suitable for
rotation.
shallow or medium neural networks or limited training sam-
ples and limited computing resources.
Image Classification Based on ResNet-18 Figure 6 shows the classification results of the same im-
ResNet-18 (He et al. 2016) was proposed in 2016, and age at different rotation angles with and without RRLs. The
it consists of 17 convolution layers and a fully connected blue sections mean the angle range of correctly classifying
layer. The core component of ResNet is the residual module, ”frogs”. The image is rotated every 12°, thus there are 30
which consists of two consecutive convolution layers and a rotation angle sections. Figure 6(a) shows the classification
skip connection. output of the model without RRLs. When the rotation angle
Table 2 shows the comparison of the accuracy of ResNet- is between θ ∈ [−36, 24]°or θ ∈ [36, 60]°, it can be clas-
18 with and without RRL. It can be seen from Table 2 that the sified correctly. In other states, different prediction results
effect of data augmentation is not significant with RRL. No will be obtained with different angles. Figure 6(b) shows the
matter whether the input sample is rotated or not, as long as classification output of the improved model. The blue area
the sample itself remains unchanged, the local features will is larger than that of (a), indicating that the rotation layers
remain unchanged after RRLs. makes the model more insensitive to the input rotation an-
Comparison with other methods on CIFAR-10 is shown gle.

830
0° 0°

truck airplane

270° dog truck


90° 270° 90°
horse
horse
bird truck
airplane dog
truck
180° 180°
(a) (b) (a) (b)

Figure 8: Comparison of loss and accuracy with and without


Figure 6: Recognition results with arbitrary rotation angles. RRL by ResNet-44. The blue curve is the orginal ResNet-44
(a)ResNet-18, (b)ResNet-18 + RRL model, and the orange curve is ResNet-44+RRL.
. .

Model Multi-class Score


ResNet-44 3.67862
ResNet-44+RRL 2.18777
appendicularian trichodesmium protist_other echinoderm_larva_ Table 4: Comparison of loss scores of real testing sets on
_slight_curve _multiple seastar_bipinnaria
Plankton dataset.

Figure 7: Sample images in Plankton dataset


. model, and the blue curve is the original ResNet-44 model.
It can be seen that the orange curve is always smaller than
the blue curve, that is, the regional rotation layer makes the
Plankton Recognition Based on ResNet-44 model error smaller. Figure 8(b) shows the accuracy curves
The plankton dataset(Cowen et al. 2015) consists of 30,336 of the two algorithms on the testing set. Obviously, after
gray images of different sizes, which are unevenly divided adding the regional rotation layer, the error of the training
into 121 categories, corresponding to different kinds of set is reduced, and there is no over-fitting, and the perfor-
plankton. There are 27,299 images in training set and 3,037 mance is improved. Table 4 shows the running results of ap-
images in testing set. In order to unify the sample number plying the two models to the real test set without published
of each category, the data are augmented for the categories labels. The lower the score means the model performs better.
with a small number of images. Finally, each category in The performance of the dataset shows that without increas-
the training set contains 2000 images and each category in ing the model parameters, the convolutional neural network
the testing set contains 100 images. Each sample resizes to can have stronger generalization by adding a regional rota-
50 × 50, then pads with white background to 64 × 64, and tion layer, and give the neural network the ability to capture
finally take its maximum inscribed circle to ensure that the global and local rotation.
image is in the centre of the image. Figure 7 shows the re-
sults of data processing and their categories. Object Detection Based on MobileNet-tiny-yolov3
Each image contains a single organism, which may be in MobileNet-tiny-yolov3 is selected as the basic network.
any direction in three-dimensional space due to ignoring the Compared with the darknet53 with residual as the main
small influence of gravity. And the ocean is full of debris structure, mobilenet can achieve a better balance in terms
particles, so there will inevitably be some noise in the image. of calculation, storage space and accuracy. Using the pruned
The existence of unknown categories requires the model to tiny yolov3, the model is smaller and has more advantages
deal with unrecognized objects, so it is necessary to classify when the computing resources are limited, and the fast de-
those plankton with large shape differences into the same tection speed also makes tiny yolov3 more cost-effective and
category. The above factors make the classification more dif- easier to be applied in practice.
ficult.
The standard ResNet-44 consists of 43 convolutional lay- Rotation Transformation of Coordinate In the object
ers and a fully connected layer.A regional rotation layer is detection task, the coordinates of the upper left corner and
added in front of all convolutional layers. the lower right corner of the target bounding boxes are usu-
After 100,000 epochs of training on the training set, the ally provided as labels, so the location changes with the
final plankton classification model is obtained. In order to rotation of the target. The corresponding coordinate labels
compare the effect of regional rotation layer, the original need to be recalculated. Here we only consider four rota-
ResNet-44 and the ResNet-44 + RRL model after adding tion angles, θ ∈ {0, 90, 180, 270} °. As shown in Figure
regional rotation layer are trained and tested respectively. 9. There is a rectangular box surrounding the target ob-
Figure 8(a) shows the loss curves of the two algorithms ject, which is defined by the upper left coordinate (x1 , y1 )
on the training set. The orange curve is ResNet-44 + RRL and the lower right coordinate (x2 , y2 ). When the image ro-

831
x Model mAP
w

MobieNet-tiny-yolov3 43.76%
(x1,y1) (w-x2,h-y2) MobieNet-tiny-yolov3+RRL 61.78%
y h
• Table 5: mAP of MobileNet-tiny-yolov3 on Pascal VOC
(x2,y2) (w-x1,h-y1) dataset. Trained by upright images and tested by rotated im-
(a) 0° (b) 180°
(y1,w-x2) (h-y2,x1)
ages.

• • •
(y2,w-x1) (h-y1,x2) Experimental Results and Analysis The detection effect
(c) 90° (d) 270° of MobileNet-tiny-yolov3 with and without RRL are shown
in Figure 10. Both models are trained with upright images.
Figure 9: Coordinate transformation The first column shows the groundtruth after rotating and
. scaling the original image, the second and third column are
the testing results on the basic model, and adding RRLs. For
the top two rows, the image contains only one label, but the
ground truth baseline baseline+rrl basic model outputs two prediction boxes, one of which does
dog
not belong to the correct category, as shown in the green box.
rotate 90° It can be seen that the basic model has some recognition abil-
cat ity for rotating images, and will be misled into other wrong
categories. The output by the improved model is quite sim-
ilar with real label. The bottom two rows contain multiple
cat
labels, and they all overlap to some extent. The output of
rotate 270° the basic model contains only one target and all objects. It
sheep shows that the model can only be roughly positioned, and
can no longer be finely divided. The improved model can
detect each object with some location errors. IoU threshold
rotate 270°
person person

person
is set to 0.5, trained with upright pictures and tested with ro-
person×3 person person tating pictures. The mAP of the two models is shown in the
person person
table 5.

motorbike motorbike
Conclusion
rotate 270° motorbike
person
This paper proposes a regional rotation layer (RRL) to help
person,
person

CNNs to learn rotation invariant features. By data augmen-


motorbike tation, CNN needs to train more filters for each change in the
sample, which leads to the increase of the number of param-
eters. So it is important to balance the network size and the
Figure 10: Examples of detection effect before and after data size. In this paper, LBP operator is used to encode the
model improvement local region so that it has the same local features before and
. after rotation. Thus, when the input changes, the local fea-
tures remain the same. Then RRL is integrated with LeNet-5,
ResNet-18 and tiny-yolov3, which verifies the effectiveness
tates 90°counterclockwise, the point (x1 , y1 ) is transformed of the method. Experimental results are analysed in detail,
into (y1 , w − x1 ), the point (x2 , y2 ) is transformed into the applicable scenarios and shortcomings of the method are
(y2 , w − x2 ), and it represents the points in the lower left presented.
corner and upper right corner of the rectangular box respec-
tively. The final coordinate label becomes a red hollow point Acknowledgments
(y1 , w −x1 ) and a red solid point (y2 , w −x2 ). Similarly, (b) This project is supported by the fund of Science and Tech-
and (d) represent the position label when rotating 180°and nology of Sichuan Province (No.2021YFG0330). Thanks
270°counterclockwise. for Shuyu Zhang, Yiting Wang.
Dataset Pascal VOC dataset contains 20 categories. The
dataset has been widely used in object detection, seman- References
tic segmentation and classification tasks, and as a common Cheng, G.; Zhou, P.; and Han, J. 2016. Rifd-cnn: Rotation-
test benchmark. VOC 2007 and VOC 2012 are used in this invariant and fisher discriminative convolutional neural net-
experiment. Finally, 16,551 images and 40,058 objects are works for object detection. In Proceedings of the IEEE con-
used in training, 4,952 images and 12,032 objects are used ference on computer vision and pattern recognition, 2884–
in testing. 2893. LAS VEGAS: IEEE.

832
Cohen, T.; and Welling, M. 2016. Group equivariant convo- LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998.
lutional networks. In International conference on machine Gradient-based learning applied to document recognition.
learning, 2990–2999. PMLR. Proceedings of the IEEE, 86(11): 2278–2324.
Cohen, T. S.; Geiger, M.; Köhler, J.; and Welling, M. 2018. Li, J.; Yang, Z.; Liu, H.; and Cai, D. 2018. Deep rotation
Spherical CNNs. In International Conference on Learning equivariant network. Neurocomputing, 290: 26–33.
Representations. Maninis, K.-K.; Pont-Tuset, J.; Arbeláez, P.; and Van Gool,
Cohen, T. S.; and Welling, M. 2017. Steerable cnns. In L. 2016. Convolutional oriented boundaries. In European
International Conference on Learning Representations. conference on computer vision, 580–596. Amsterdam, The
Cowen, R. K.; Sponaugle, S.; Robinson, K.; and Luo, J. Netherlands: Springer.
2015. Planktonset 1.0: Plankton imagery data collected from Marcos, D.; Volpi, M.; Komodakis, N.; and Tuia, D. 2017.
fg walton smith in straits of florida from 2014–06-03 to Rotation equivariant vector field networks. In Proceedings
2014–06-06 and used in the 2015 national data science bowl of the IEEE International Conference on Computer Vision,
(ncei accession 0127422). NOAA National Centers for En- 5048–5057.
vironmental Information. Mash, R.; Borghetti, B.; and Pecarina, J. 2016. Improved air-
Dalal, N.; and Triggs, B. 2005. Histograms of oriented gra- craft recognition for aerial refueling through data augmenta-
dients for human detection. In 2005 IEEE computer soci- tion in convolutional neural networks. In International sym-
ety conference on computer vision and pattern recognition posium on visual computing, 113–122. Las Vegas, Nevada,
(CVPR’05), volume 1, 886–893. Boston, MA: IEEE. USA: Springer.
Dieleman, S.; De Fauw, J.; and Kavukcuoglu, K. 2016. Ex- Ojala, T.; Pietikäinen, M.; and Harwood, D. 1996. A com-
ploiting cyclic symmetry in convolutional neural networks. parative study of texture measures with classification based
In International conference on machine learning, 1889– on featured distributions. Pattern recognition, 29(1): 51–59.
1898. PMLR. Quiroga, F.; Ronchetti, F.; Lanzarini, L.; and Bariviera, A. F.
Dieleman, S.; Willett, K. W.; and Dambre, J. 2015. Rotation- 2018. Revisiting data augmentation for rotational invariance
invariant convolutional neural networks for galaxy morphol- in convolutional neural networks. In International Confer-
ogy prediction. Monthly notices of the royal astronomical ence on Modelling and Simulation in Management Sciences,
society, 450(2): 1441–1459. 127–141. Girona, Spain: Springer.
Esteves, C.; Allen-Blanchette, C.; Zhou, X.; and Daniilidis, Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.;
K. 2018. Polar Transformer Networks. In International Parikh, D.; and Batra, D. 2017. Grad-cam: Visual explana-
Conference on Learning Representations. tions from deep networks via gradient-based localization. In
Follmann, P.; and Bottger, T. 2018. A rotationally-invariant Proceedings of the IEEE international conference on com-
convolution module by feature map back-rotation. In 2018 puter vision, 618–626.
IEEE Winter Conference on Applications of Computer Vi- Tamura, M.; Horiguchi, S.; and Murakami, T. 2019. Omnidi-
sion (WACV), 784–792. IEEE. rectional pedestrian detection by rotation invariant training.
Gao, L.; Li, H.; Lu, Z.; and Lin, G. 2019. Rotation- In 2019 IEEE winter conference on applications of computer
equivariant convolutional neural network ensembles in im- vision (WACV), 1989–1998. Hawaii: IEEE.
age processing. In Adjunct Proceedings of the 2019 ACM Van Dyk, D. A.; and Meng, X.-L. 2001. The art of data aug-
International Joint Conference on Pervasive and Ubiquitous mentation. Journal of Computational and Graphical Statis-
Computing and Proceedings of the 2019 ACM International tics, 10(1): 1–50.
Symposium on Wearable Computers, 551–557. London: As- Wiersma, R.; Eisemann, E.; and Hildebrandt, K. 2020. Cnns
sociation for Computing Machinery,New York. on surfaces using rotation-equivariant features. ACM Trans-
He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep resid- actions on Graphics (TOG), 39(4): 92–1.
ual learning for image recognition. In Proceedings of the Zhou, Y.; Ye, Q.; Qiu, Q.; and Jiao, J. 2017. Oriented re-
IEEE conference on computer vision and pattern recogni- sponse networks. In Proceedings of the IEEE Conference
tion, 770–778. on Computer Vision and Pattern Recognition, 519–528.
Jaderberg, M.; Simonyan, K.; Zisserman, A.; et al. 2015.
Spatial transformer networks. Advances in neural informa-
tion processing systems, 28: 2017–2025.
Jiang, R.; and Mei, S. 2019. Polar coordinate convolutional
neural network: from rotation-invariance to translation-
invariance. In 2019 IEEE International Conference on Im-
age Processing (ICIP), 355–359. IEEE.
Laptev, D.; Savinov, N.; Buhmann, J. M.; and Pollefeys, M.
2016. Ti-pooling: transformation-invariant pooling for fea-
ture learning in convolutional neural networks. In Proceed-
ings of the IEEE conference on computer vision and pattern
recognition, 289–297. LAS VEGAS: IEEE.

833

You might also like