19964-Article Text-23977-1-2-20220628
19964-Article Text-23977-1-2-20220628
826
to convert the input image to polar coordinates to achieve
rotation invariance. The overall structure of the model is the
same as the traditional CNN, except that the central loss
function is used to learn rotation invariant features. In ad-
dition, (Cohen and Welling 2016) proposed Group Equiv-
ariant Convolutional Networks (GCNN) as a special case of
controllable CNN, which proved that the spatial transforma-
tion of the image can be corresponded in the feature map
and the filter. GCNN is composed of group convolution ker-
nels. These convolutions include filter rotation and merging
operations on the rotation. Group convolution is limited to
integer multiples of 90°rotation and flipping. Cohen et al.
also proposed steerable CNNs (Cohen and Welling 2017)
and Spherical CNNs (Cohen et al. 2018) to achieve rotation
equivariant. Steerable CNNs are limited to discrete groups,
such as discrete rotations acting on planar images or permu- Figure 1: Local Binary Pattern. (a) image of size 33. The
tations acting on point clouds. Spherical CNNs show good numbers are the gray values; (b) the binary values of the
robustness. Because FFT and IFFT are used in spherical con- surrounding pixels; (c) the index value of the surrounding
volution, some information will be lost in the conversion pixels; (d) the dot product result of (b) and (c); (e) rotate (b)
process. Spherical convolution achieves rotation invariance of 135°clockwise to get the minimum LBP; (f) the image
for ideal 3D objects, and there is no interference of back- window after rotation.
ground or other noise. If there are multiple 3D objects in the .
natural scene, the 3D objects must be segmented first, and
then the rotation invariant features are extracted. The Rota-
tion Equivariant Vector Field Networks (RotEqNet) (Mar- tures. Suppose the window size is 3 × 3, setting the central
cos et al. 2017) uses multiple rotation instances of a uni- pixel as the threshold, we can get a binary encoding of the
form standard filter to perform convolution, that is, the fil- local texture, and convert it to a decimal value. As shown
ter is rotated at different intervals. Although the RotEqNet in Figure 1(a), the central point pixel value 6 is used as the
model is small, the increasing in the number of convolu- comparison reference, then calculate the difference values
tion kernels brings more memory and longer computing of the surrounding eight pixels with the central point. If the
time. (Dieleman, De Fauw, and Kavukcuoglu 2016) encodes neighbouring value is less than the central value, the corre-
cyclic symmetry in CNNs by parameter sharing to achieve sponding location is marked as 0, otherwise marked as 1,
rotation equivariant. They introduce four operations: slice, as shown in Figure 1(b). Taking the upper left corner of the
pool, stack and roll. The operations can be cast as layers in matrix as the starting point, each position is given an index
a neural network, and build networks that are equivariant to power of 2 according to the flattening and stretching direc-
cyclic rotations and share parameters across different orien- tion of the matrix, as shown in Figure 1(c). The dot pro-
tations. But the operations change the size of the minibatch duction operation is performed between the weight matrix
(slicing, pooling), the number of feature maps (rolling), or and the binary matrix, as shown in Figure 1(d). Only the
both (stacking). To alleviate the excessive time-consuming values of 1 in the binary matrix are preserved, and the new
and memory usage, (Li et al. 2018) proposed Deep Rota- weight is superimposed. Finally, the surrounding elements
tion Equivariant Network. They apply rotation transforma- of the result matrix are added to form the decimal LBP iden-
tion on filters rather than feature maps, achieving a speed tifier (in this example 169) of the local texture. A series of
up of more than 2 times with even less memory overhead. LBP feature values are obtained by rotating the surrounding
But the methods all need to learn new parameters to achieve points, and the minimum of these values is selected as the
rotation equivariant. LBP value of the central pixel. In this paper, the points are
rotated to the minimum state of LBP, so as to achieve the ro-
Rotation Invariance Based on LBP Operator tation invariance of angle. In the case, the minimum state is
The standard convolutional neural networks do not have the shown as Figure 1 (e). That is, the original feature is rotated
property of rotation invariance. Trained by the upright sam- 135°clockwise, as shown in Figure 1(f).
ples, the performance drops significantly when tested by the
rotated images. To solve this problem, we add a regional ro- Regional Rotation Layer (LBP)
tation layer (RRL) before the convolutional layers and the LBP is operated in a window, while RRL is operated on the
fully connected layers. The main idea is that we indirectly feature maps. The feature maps are sampled one by one in
achieve rotation invariance by restricting the convolutional the form of sliding window, and LBP is implemented in each
features to be rotation equivariant. window. So we can rotate the feature maps to the same state
even with different input orientations.
Local Binary Pattern RRL is usually added before convolutional layer. Here we
Local Binary Pattern (LBP) (Ojala, Pietikäinen, and Har- take the first convolution operation of a three-channel RGB
wood 1996) is an operator that describes image texture fea- image as an example to illustrate the workflow of RRL.
827
Algorithm 1: RRL in local windows n*90°), if the convolution kernel rotates the same angle r,
Input: RGB image sample batch {I1 , I2 , · · · , It · · · } the result is unchanged: f (w) = Lr [f ](rw), where Lr [f ]
Output: Rotate the feature maps to the same indicates that the convolution kernel rotates r counterclock-
state wise. But the local convolution operation is not invariance,
1: Load image It , It ∈ RH×W ×3 .
when the convolution kernel is unchanged, the results before
2: Perform LBP on each channel to get Vt , Vt ∈
and after w rotation are different: f (w) 6= f (rw).
Global convolution operation has neither rotation equiv-
RF ×F ×(OH ×OW ×3) , where F is the kernel size, OH(W )
ariance nor invariance. When the feature map is rotated, not
is the height/width of the feature map. The sequence
only the convolution kernel needs to be rotated to the same
of feature maps in the channel has not been changed,
angle, but the feature output must be rotated with the same
shown as process 1 in Figure 2.
angle in the opposite direction, so that the result of the orig-
3: Reshape Vt , then concatenate the window features be-
inal convolution operation can be kept consistent, as shown
longing to the same channel into a matrix It0 , It0 ∈
in eq. 3.
R(F ×OH )×(F ×OW )×3 , shown as process 2 in Figure 2. f (x) = r−1 Lr [f ](rx) (3)
4: Perform a convolution operation with step size F on
0 From the above analysis, we know that after rotation (only
It0 and get the output feature Ot , Ot ∈ ROH ×OW ×k ,
for rn rotation), the features will still maintain equivariant
shown as process 3 in Figure 2.
after layer-by-layer convolution. Therefore, the entire CNN
will be rotation invariant if we reversely rotate the feature
maps before the fully connected layer.
It It’
Vt First, to achieve equivariant of global convolution, the
core function R(x) of the algorithm needs to satisfy eq. 4:
rotate reshape
f F [R (x)] = r−n f F [R (rn x)] (4)
F
where f (x) is the convolution operation with step size
* of F . In other words, when the filter does not change, the
conv s=F convolution result of the rotated input is equal to that of
the non-rotated input through reverse rotating the output.
Ot F F k k'
To satisfy eq. 4, the local convolution needs to be invariant:
f [R (w)] = f [R (rn w)]. That is :
Figure 2: RRL works in a local window It . Step 1 : Perform R (w) = R (rn w) (5)
LBP on each channel to rotate feature maps. Step 2 : Re- Here, we use the core function R(x), which is named
shape Vt and concatenate the features into matrix It0 . Step RRL module, to make the window convolution invariant, and
3 : Perform convolution operation on It0 to get output fea- achieve rotation invariant of the CNN. RRLs position is be-
ture Qt . fore each convolutional layer and after the last convolutional
. layer with the step size of F . In particular, for the last RRL,
the global feature maps x are treated as a local window w,
and satisfies R (x) = R (rn x). Because the activation func-
Rotation Equivariance and Invariance Derivation tion, BN layer and pooling layer are rotation equivariant and
Equivariance refers to that, when the transformation can be they do not affect the final result, they are not discussed here.
measured in the output of an operator, the operator and the
transformation are equivariant, as shown in eq. 1. Integrate RRL with CNN
f (T (x)) = T (f (x)) (1) Each RRL Ri is embedded before each conv layer fi . Sup-
pose that the original feature x and the rotated 90°feature rx
where x is the input, T (•) is the transformation, f (•) is the are fed into R1 respectively.
operator. After R1 and f1 , we have:
An operator is invariant relative to the transformation,
when the influence of the transformation cannot be detected x → f1F1 [R1 (x)]
in the output of the operator, shown as eq. 2. rx → f1F1 [R1 (rx)]
f (T (x)) = f (x) (2) ∴f1F1 [R1 (x)] = r−1 f1F1 [R1 (rx)]
In order to achieve rotation invariance of the entire CNN, After R2 and f2 , we have:
it is expected that the input features can be rotated uniformly h h ii
after the convolution layers and before the fully connected f1F1 [R1 (x)] → f2F2 R2 f1F1 [R1 (x)]
layers that perform the classification task. h h ii
Local convolution operation is equivariant. In the feature f1F1 [R1 (rx)] → f2F2 R2 f1F1 [R1 (rx)]
window w(F ×F ), when w is rotated of r (r means counter- h h ii h h ii
clockwise rotate 90°, and rn means counterclockwise rotate ∴f2F2 R2 f1F1 [R1 (x)] = r−1 f2F2 R2 f1F1 [R1 (rx)]
828
Training CIFAR- CIFAR- CIFA10- CIFAR-
Data 10 10 10-rot 10-rot+
(a) CIFAR10-rot Testing CIFAR- CIFAR- CIFAR- CIFAR-
horse,0° bird,90° ship,180° airplane,270° Data 10-rot 10-rot+ 10-rot 10-rot+
LeNet-5 33.2 18.2 38.7 25.4
(b) CIFAR10-rot+
LeNet- 71.3 52.8 70.9 49.1
5+RRL
cat,40° truck,165° horse,270° automobile,3000°
Table 1: Comparison of accuracy (%) on LeNet-5 with and
without RRL.
Figure 3: Examples of CIFAR10-rot and CIFAR10-rot+
.
So, we have:
h h h h iiii
Rn+1 fnFn . . . R2 f1F1 [R1 (x)]
h h h h iiii (6)
= Rn+1 fnFn . . . R2 f1F1 [R1 (rx)]
Experiments
Image Classification Based on LeNet-5
Dataset and LeNet-5 CIFAR-10 is used in our experi-
ment. The dataset was proposed by krizhevsky in 2009. It
contains 60000 32 × 32 colour images, belonging to 10
categories. There are 50000 images in training set (5000
in each category) and 10000 images in test set (1000 in
each category). The images rotated in the first ways are call
CIFAR10-rot (namely θ ∈ {0, 90, 180, 270}°and in the sec-
ond way are called CIFAR10-rot+ ( θ ∈ [0, 360)°), as shown
in Figure 3. In order to ensure that the effective content area
Figure 4: Feature distributions of LeNet-5 with and without
of the image is fixed, the largest inscribed circle of the square
RRLs, input with different rotation angles
image is selected. Only the inner area of the circle has origi-
.
nal image pixels, and the outer area of the circle is filled with
black, shown as Figure 3 (b).
LeNet-5 (LeCun et al. 1998) is one of the earliest CNNs. provide more convolution kernels to learn the same patterns
It has two convolutional layers and three fully connected lay- with different directions, so it reduces generalization;
ers, so three RRLs are plugged in. Keeping the convolutional 3) Without RRLs, the accuracy of recognition can be im-
layers and fully connected layers unchanged as (LeCun et al. proved a bit by using data augmentation, but the training cost
1998), the three RRLs are inserted in front of conv1, conv2 increases and the problem is not solved essentially.
and after conv2, respectively.
In order to analyze the role of RRL more intuitively, the
Experimental Result and Analysis Table 1 shows the test feature maps are visualized in Figure 4. In Figure 4, the left
accuracy of LeNet-5 on CIFAR-10 with and without RRL. columns are the input images. The middle columns are the
The second column is trained by the original training set output feature of the last layer of the original LeNet-5 net-
(without rotation images) and tested by CIFAR10-rot. The work. Except the upright image can be correctly classified,
third column is trained by original training set (without ro- the other three cases are misidentified. The right columns
tation) and tested by CIFAR10-rot+ data set. The fourth and are the output features of the last layer of LeNet-5+RRL net-
fifth columns are trained and tested by CIFAR10-rot and work, whose features do not change with the rotation angle,
CIFAR10-rot + datasets respectively. and all predict the correct category. It shows that with the
From Table 1 we can find that: RRLs, the same features are extracted from the images with
1) Keeping the original CNN structure and adding only different angles, and the coding invariance is realized.
the RRLs can improve the recognition accuracy of rotating The visualization results of Grad-CAM(Selvaraju et al.
images greatly; 2017) are shown in Figure 5. The input images are rotated
2) Trained with the augmented data, the accuracy of im- θ ∈ {0, 90, 180, 270}°. Conv1-grad, Conv2-grad and RRL3-
proved network decreases. It implies that LeNet-5 cannot grad are the heatmaps obtained by gradient calculation of the
829
Conv1 Conv2 RRL3- Conv1 Conv2 RRL3-
Training Data CIFAR-10
Horse -grad -grad grad Frog -grad -grad grad Testing Data CIFAR-10-rot CIFAR-10-rot+
0°
LeNet-5 33.2 18.2
LeNet-5+RRL 71.3 52.8
90° ResNet-18 46.5 38.7
ResNet-18+RRL 75.0 65.3
180° CyResNet56-P - 61.3
(Cowen et al. 2015)
270° PR RF 1 (Follmann - 44.1
Bird Truck and Bottger 2018)
0°
ORN (Zhou et al. 60.9 40.7
2017)
90°
Table 3: Comparison of accuracy (%) with other methods on
180° CIFAR-10.
270°
in Table 3. We can see that ResNet-18+RRL has obtained
Figure 5: Visualization of three RRLs output heatmaps in high accuracy on both data sets. It shows that RRL can help
LeNet-5 + RRL the original CNN to improve the encoding ability without
. increasing the parameters and model complexity, and obtain
stronger generalization ability.
Training CIFAR- CIFAR- CIFA10- CIFAR- From table 3, we can also find that ResNet + RRL im-
Data 10 10 10-rot 10-rot+ proves performance less than LeNet + RRL does (28.5%
Testing CIFAR- CIFAR- CIFAR- CIFAR- vs 38.1% on CIFAR10-rot, 26.6% vs 34.1% on CIFAR10-
Data 10-rot 10-rot+ 10-rot 10-rot+ rot+). LBP operator tends to rotate the brighter texture of the
ResNet- 46.5 38.7 73.6 58.7 image to the left part of the window. We can guess that with
18 the restriction of RRL, the obtained features tend to be sim-
ResNet- 75.0 65.3 77.9 63.1 ilar and reduce the diversity of features. After the training
18+RRL data are enhanced, the gap between the two is also signif-
icantly smaller (improve 4.3% on CIFAR10-rot and 4.4%
Table 2: Comparison of accuracy (%) ResNet-18 with and on CIFAR10-rot+ respectively). For rotated images, the tra-
without RRL. ditional convolutional network will specially customize the
filter for each direction of the same texture. However, due
to the constraint of RRL, even if the input data are more di-
verse, the feature types with little change in content will not
feature maps after the first layer, the second layer of convo-
increase significantly. Therefore, when the model is deep-
lution and the last regional rotation layer, respectively. It can
ened, the tradition neural networks will improve more than
be seen that before the last regional rotation, the features are
that of networks with RRL.
direction dependent, and the focus position of the model still
Even with data augmentation, the rotation invariance of
changes with the rotation angle of input. At this stage, the
conventional convolution network is still not as good as
network is only rotation equivariant. After RRL3, the neu-
plugged with RRL. However, it can be predicted that with the
ral network completes the invariance coding of rotation, and
deepening of the network, the rising trend of accuracy with
the features shown by RRL3-grad hardly change with the
RRL will slow down. Therefore, the algorithm is suitable for
rotation.
shallow or medium neural networks or limited training sam-
ples and limited computing resources.
Image Classification Based on ResNet-18 Figure 6 shows the classification results of the same im-
ResNet-18 (He et al. 2016) was proposed in 2016, and age at different rotation angles with and without RRLs. The
it consists of 17 convolution layers and a fully connected blue sections mean the angle range of correctly classifying
layer. The core component of ResNet is the residual module, ”frogs”. The image is rotated every 12°, thus there are 30
which consists of two consecutive convolution layers and a rotation angle sections. Figure 6(a) shows the classification
skip connection. output of the model without RRLs. When the rotation angle
Table 2 shows the comparison of the accuracy of ResNet- is between θ ∈ [−36, 24]°or θ ∈ [36, 60]°, it can be clas-
18 with and without RRL. It can be seen from Table 2 that the sified correctly. In other states, different prediction results
effect of data augmentation is not significant with RRL. No will be obtained with different angles. Figure 6(b) shows the
matter whether the input sample is rotated or not, as long as classification output of the improved model. The blue area
the sample itself remains unchanged, the local features will is larger than that of (a), indicating that the rotation layers
remain unchanged after RRLs. makes the model more insensitive to the input rotation an-
Comparison with other methods on CIFAR-10 is shown gle.
830
0° 0°
truck airplane
831
x Model mAP
w
•
MobieNet-tiny-yolov3 43.76%
(x1,y1) (w-x2,h-y2) MobieNet-tiny-yolov3+RRL 61.78%
y h
• Table 5: mAP of MobileNet-tiny-yolov3 on Pascal VOC
(x2,y2) (w-x1,h-y1) dataset. Trained by upright images and tested by rotated im-
(a) 0° (b) 180°
(y1,w-x2) (h-y2,x1)
ages.
•
• • •
(y2,w-x1) (h-y1,x2) Experimental Results and Analysis The detection effect
(c) 90° (d) 270° of MobileNet-tiny-yolov3 with and without RRL are shown
in Figure 10. Both models are trained with upright images.
Figure 9: Coordinate transformation The first column shows the groundtruth after rotating and
. scaling the original image, the second and third column are
the testing results on the basic model, and adding RRLs. For
the top two rows, the image contains only one label, but the
ground truth baseline baseline+rrl basic model outputs two prediction boxes, one of which does
dog
not belong to the correct category, as shown in the green box.
rotate 90° It can be seen that the basic model has some recognition abil-
cat ity for rotating images, and will be misled into other wrong
categories. The output by the improved model is quite sim-
ilar with real label. The bottom two rows contain multiple
cat
labels, and they all overlap to some extent. The output of
rotate 270° the basic model contains only one target and all objects. It
sheep shows that the model can only be roughly positioned, and
can no longer be finely divided. The improved model can
detect each object with some location errors. IoU threshold
rotate 270°
person person
person
is set to 0.5, trained with upright pictures and tested with ro-
person×3 person person tating pictures. The mAP of the two models is shown in the
person person
table 5.
motorbike motorbike
Conclusion
rotate 270° motorbike
person
This paper proposes a regional rotation layer (RRL) to help
person,
person
832
Cohen, T.; and Welling, M. 2016. Group equivariant convo- LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998.
lutional networks. In International conference on machine Gradient-based learning applied to document recognition.
learning, 2990–2999. PMLR. Proceedings of the IEEE, 86(11): 2278–2324.
Cohen, T. S.; Geiger, M.; Köhler, J.; and Welling, M. 2018. Li, J.; Yang, Z.; Liu, H.; and Cai, D. 2018. Deep rotation
Spherical CNNs. In International Conference on Learning equivariant network. Neurocomputing, 290: 26–33.
Representations. Maninis, K.-K.; Pont-Tuset, J.; Arbeláez, P.; and Van Gool,
Cohen, T. S.; and Welling, M. 2017. Steerable cnns. In L. 2016. Convolutional oriented boundaries. In European
International Conference on Learning Representations. conference on computer vision, 580–596. Amsterdam, The
Cowen, R. K.; Sponaugle, S.; Robinson, K.; and Luo, J. Netherlands: Springer.
2015. Planktonset 1.0: Plankton imagery data collected from Marcos, D.; Volpi, M.; Komodakis, N.; and Tuia, D. 2017.
fg walton smith in straits of florida from 2014–06-03 to Rotation equivariant vector field networks. In Proceedings
2014–06-06 and used in the 2015 national data science bowl of the IEEE International Conference on Computer Vision,
(ncei accession 0127422). NOAA National Centers for En- 5048–5057.
vironmental Information. Mash, R.; Borghetti, B.; and Pecarina, J. 2016. Improved air-
Dalal, N.; and Triggs, B. 2005. Histograms of oriented gra- craft recognition for aerial refueling through data augmenta-
dients for human detection. In 2005 IEEE computer soci- tion in convolutional neural networks. In International sym-
ety conference on computer vision and pattern recognition posium on visual computing, 113–122. Las Vegas, Nevada,
(CVPR’05), volume 1, 886–893. Boston, MA: IEEE. USA: Springer.
Dieleman, S.; De Fauw, J.; and Kavukcuoglu, K. 2016. Ex- Ojala, T.; Pietikäinen, M.; and Harwood, D. 1996. A com-
ploiting cyclic symmetry in convolutional neural networks. parative study of texture measures with classification based
In International conference on machine learning, 1889– on featured distributions. Pattern recognition, 29(1): 51–59.
1898. PMLR. Quiroga, F.; Ronchetti, F.; Lanzarini, L.; and Bariviera, A. F.
Dieleman, S.; Willett, K. W.; and Dambre, J. 2015. Rotation- 2018. Revisiting data augmentation for rotational invariance
invariant convolutional neural networks for galaxy morphol- in convolutional neural networks. In International Confer-
ogy prediction. Monthly notices of the royal astronomical ence on Modelling and Simulation in Management Sciences,
society, 450(2): 1441–1459. 127–141. Girona, Spain: Springer.
Esteves, C.; Allen-Blanchette, C.; Zhou, X.; and Daniilidis, Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.;
K. 2018. Polar Transformer Networks. In International Parikh, D.; and Batra, D. 2017. Grad-cam: Visual explana-
Conference on Learning Representations. tions from deep networks via gradient-based localization. In
Follmann, P.; and Bottger, T. 2018. A rotationally-invariant Proceedings of the IEEE international conference on com-
convolution module by feature map back-rotation. In 2018 puter vision, 618–626.
IEEE Winter Conference on Applications of Computer Vi- Tamura, M.; Horiguchi, S.; and Murakami, T. 2019. Omnidi-
sion (WACV), 784–792. IEEE. rectional pedestrian detection by rotation invariant training.
Gao, L.; Li, H.; Lu, Z.; and Lin, G. 2019. Rotation- In 2019 IEEE winter conference on applications of computer
equivariant convolutional neural network ensembles in im- vision (WACV), 1989–1998. Hawaii: IEEE.
age processing. In Adjunct Proceedings of the 2019 ACM Van Dyk, D. A.; and Meng, X.-L. 2001. The art of data aug-
International Joint Conference on Pervasive and Ubiquitous mentation. Journal of Computational and Graphical Statis-
Computing and Proceedings of the 2019 ACM International tics, 10(1): 1–50.
Symposium on Wearable Computers, 551–557. London: As- Wiersma, R.; Eisemann, E.; and Hildebrandt, K. 2020. Cnns
sociation for Computing Machinery,New York. on surfaces using rotation-equivariant features. ACM Trans-
He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep resid- actions on Graphics (TOG), 39(4): 92–1.
ual learning for image recognition. In Proceedings of the Zhou, Y.; Ye, Q.; Qiu, Q.; and Jiao, J. 2017. Oriented re-
IEEE conference on computer vision and pattern recogni- sponse networks. In Proceedings of the IEEE Conference
tion, 770–778. on Computer Vision and Pattern Recognition, 519–528.
Jaderberg, M.; Simonyan, K.; Zisserman, A.; et al. 2015.
Spatial transformer networks. Advances in neural informa-
tion processing systems, 28: 2017–2025.
Jiang, R.; and Mei, S. 2019. Polar coordinate convolutional
neural network: from rotation-invariance to translation-
invariance. In 2019 IEEE International Conference on Im-
age Processing (ICIP), 355–359. IEEE.
Laptev, D.; Savinov, N.; Buhmann, J. M.; and Pollefeys, M.
2016. Ti-pooling: transformation-invariant pooling for fea-
ture learning in convolutional neural networks. In Proceed-
ings of the IEEE conference on computer vision and pattern
recognition, 289–297. LAS VEGAS: IEEE.
833