Multiple Modalities PDF Conference Procedings, Please Read-157-166
Multiple Modalities PDF Conference Procedings, Please Read-157-166
1 Introduction
Computer-aided diagnostics (CAD), especially in thyroid nodule classification
task have a long history. Following huge boost in image classification perfor-
mance, people started fine-tuning existing networks to classify thyroid nod-
ules [5]. Such fine-tuning, even though exhibits decent performance on certain
data sets, could not achieve a universally optimal performance since the fine-
tuned networks could be weak in extrapolation. On the other hand, directly
fine-tuning networks for classification often misses out entirely on segmentation,
which still strains doctors in diagnostics. To address this, there has been an
emergence of interest in detection and segmentation methods [4]. On the other
hand, there are also questions with regards to whether segmentation should be
included since it requires significant computational resources and is relatively
easy for radiologists to segment thyroid nodule from image.
2 Method
Concat
Fig. 1. Demonstration of ensembled segmentation work flow. Two masks are generated
by Mask Scoring R-CNN and two-staged segmentation method consisting of CenterNet
and Deep Snake
Fig. 2. Two-step attention network. CBAM module is responsible for telling the net-
work where to focus. CAM is responsible for guiding feature maps.
Mc = σ(Fcmax + Fcavg ),
where σ denotes the sigmoid activation. After the generation of this channel-wise
attention, we apply this by
f = Mc F,
where denotes element-wise multiplication and f denotes the feature map
after applying channel attention.
The feature map and weights of the last fully connected layer undergoes 1×1
convolution to generate the attention map. Formally, let f denote feature maps
and w be the weight matrix of the fully connected layer. Attention map A is
given by:
A = ReLU (conv(f, w)).
The attention map will therefore be of the same shape with any channel of
the feature map. Attention map is then upsampled to the original input size
and undergoes color normalization. We then perform softmasking with sigmoid
function:
1
T (A) = ,
1 + exp(−α(A − B))
where T (A) is the attention map generated by this online attention module. Fur-
thermore, this online module designed a combined loss so that we can calibrate
both attention map and our classification results, i.e.,
where
Lclassif ication = BCEloss
We use Dice loss to maximize the overlap between attention map and input
mask. The classification loss is set to be Binary entropy loss. λ provides a lever-
aging effect between the two tasks; and since classification is the main task, we
set λ = 0.4.
Cascaded Networks for Thyroid Nodule Diagnosis from Ultrasound Images 149
Max Pooling
Concat
Sigmoid
Channel feature
Input feature
MLP network
Average Pooling
Fig. 3. Channel feature architecture. The input feature are mapped to max pooling
and avg pooling separately, then passing through a MLP network.
3 Experiment
3.1 Data Set and Augmentation Techniques
TN-SCUI 2020 data set features 3644 ultrasound images of Thyroid gland, of
which 2003 are malignant and 1641 are benign. The data set is provided by cour-
tesy of Shanghai Ruijin Hospital. The dataset is then partitioned into training
and validation in a 7:3 ratio. To further increase the robustness of our method,
we employ a variety of data augmentation methods. Specifically, we randomly
rotate the image and apply small degrees of affine transformation to mimic the
positions and hardware variances in image acquisition process. Furthermore, we
increase the diversity of our data by adjusting brightness, contrast and Gaussian
noise. Finally, we train the network on a five-fold, cross validation and cast a
majority voting on the testing set featuring 910 images.
Table 1. Segmentation result of Ensemble UNet. DLA34 stands for Deep Layer Aggre-
gation Model with 34 layer, and FPN stands for Feature Pyramid Network. DLA34
denotes Deep Layer Aggregation Model with 34 layer.
Table 3. Ablation study of our two-staged attention network. TA stands for two-
step, CBAM stands for the usage of the channel attention module, and CAM stands
for the usage of CAM attention module. Heat map represents the usage of the class-
discriminative detection heat map generated by CenterNet. Ratio represents the addi-
tion channel of input consisting of height and width ratio. mIoU represents the mean
intersection of union between attention map generated by CAM and segmentation
mask.
Fig. 4. Instances of the attention map generated by our two-step attention mechanism.
Left three are benign cases and the remaining are malignant. First row represents origi-
nal ultrasound images of the thyroid nodule. Second row represents the attention maps
produced by the network without mask guidance. Third row represents the attention
map generated with mask guidance without heat map and width height ratio. Fourth
row represents the attention maps generated with mask guidance, as well as, with the
heat map and the height width ratio as additional inputs. We denote the leftmost
image on the first row by (1, 1) and rightmost image at the same row by (1 6). Also,
denote the rightmost image in the fourth row by (4, 6).
152 X. Shen et al.
The above results provide empirical evidence that even though split attention
module of ResNest is able to boost classification performance, its performance is
Cascaded Networks for Thyroid Nodule Diagnosis from Ultrasound Images 153
still much lower that of our method (i.e., 81.01% in Table 2). This phenomenon
demonstrates that, using the localization cues from detection and the segmen-
tation task is suitable for exploring the performance of this task, while roughly
cropping the nodule regions may lead to too much misguidance for the final
classification.
4 Conclusion
In this paper, we explored inherent connections between segmentation and clas-
sification, and designed a two-step attention network to utilize segmentation
results for achieving better classification results. Our method achieved the third
place in classification at TN-SCUI2020 challenge. Furthermore, our method pro-
vides explainable learning by explicitly producing attention maps generated by
the network, which we hope would aid doctors in clinical diagnostic process.
References
1. Ma, J., Wu, F., Jiang, T., Zhu, J., Kong, D.: Cascade convolutional neural networks
for automatic detection of thyroid nodules in ultrasound images. Med. Phys. 44(5),
1678–1691 (2017). https://doi.org/10.1002/mp.12134
2. Song, W., et al.: Multitask cascade convolution neural networks for automatic
thyroid nodule detection and recognition. IEEE J. Biomed. Health Inform. 23(3),
1215–1224 (2019). https://doi.org/10.1109/JBHI.2018.2852718
3. Castellino, R.A.: Computer aided detection (CAD): an overview. Cancer Imag-
ing 5(1), 17–19 (2005). https://doi.org/10.1102/1470-7330.2005.0018. The official
publication of the International Cancer Imaging Society
4. Chi, J., Walia, E., Babyn, P., Wang, J., Groot, G., Eramian, M.: Thyroid nodule
classification in ultrasound images by fine-tuning deep convolutional neural net-
work. J. Digit. Imaging 30(4), 477–486 (2017). https://doi.org/10.1007/s10278-
017-9997-y
5. Li, X., et al.: Diagnosis of thyroid cancer using deep convolutional neural net-
work models applied to sonographic images: a retrospective, multicohort, diagnos-
tic study. Lancet Oncol. 20(2), 193–201 (2019). https://doi.org/10.1016/s1470-
2045(18)30762-9
6. Prochazka, A., Gulati, S., Holinka, S., Smutek, D.: Classification of thyroid nod-
ules in ultrasound images using direction-independent features extracted by two-
threshold binary decomposition. Technol. Cancer Res. Treat. 18 (2019). https://
doi.org/10.1177/1533033819830748
7. Guo, D., et al.: Organ at risk segmentation for head and neck cancer using stratified
learning and neural architecture search. In: 2020 IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition (CVPR), Seattle, WA, USA, pp. 4222–4231
(2020). https://doi.org/10.1109/CVPR42600.2020.00428
8. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: 2017 IEEE Inter-
national Conference on Computer Vision (ICCV), Venice, pp. 2980–2988 (2017).
https://doi.org/10.1109/ICCV.2017.322
9. Huang, Z., Huang, L., Gong, Y., Huang, C., Wang, X.: Mask scoring R-CNN.
In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), Long Beach, CA, USA, pp. 6402–6411 (2019). https://doi.org/10.1109/
CVPR.2019.00657
154 X. Shen et al.
10. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
Las Vegas, NV, pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90
11. Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., Tian, Q.: CenterNet: keypoint triplets
for object detection (2019). (cite arxiv:1904.08189Comment: 10 pages (including 2
pages of References), 7 figures, 5 tables)
12. Peng, S., Jiang, W., Pi, H., Li, X., Bao, H., Zhou, X.: Deep snake for real-time
instance segmentation. In: 2020 IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), Seattle, WA, USA, pp. 8530–8539 (2020). https://
doi.org/10.1109/CVPR42600.2020.00856
13. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features
for discriminative localization. In: 2016 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), Las Vegas, NV, pp. 2921–2929 (2016). https://doi.
org/10.1109/CVPR.2016.319
14. Ouyang, X., et al.: Dual-sampling attention network for diagnosis of COVID-19
from community acquired pneumonia. IEEE Trans. Med. Imaging 39(8), 2595–
2605 (2020). https://doi.org/10.1109/TMI.2020.2995508
15. Mayo Clinic: Thyroid nodules (2020)
16. Mayo Clinic: Needle biopsy (2020)
17. Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: CBAM: convolutional block attention
module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018.
LNCS, vol. 11211, pp. 3–19. Springer, Cham (2018). https://doi.org/10.1007/978-
3-030-01234-2 1
18. Liu, S., Deng, W.: Very deep convolutional neural network based image classifi-
cation using small training sample size. In: 2015 3rd IAPR Asian Conference on
Pattern Recognition (ACPR), Kuala Lumpur, pp. 730–734 (2015). https://doi.
org/10.1109/ACPR.2015.7486599
19. Zhang, H., et al.: ResNeSt: split-attention networks. arXiv:2004.08955
20. Fu, J., et al.: Dual attention network for scene segmentation. In: 2019 IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach,
CA, USA, pp. 3141–3149 (2019). https://doi.org/10.1109/CVPR.2019.00326