Enhancing Cell Segmentation
Enhancing Cell Segmentation
Abstract—This study presents a comprehensive workflow for tasks. These methods can result in over-segmentation, under-
cell segmentation using the Monuseg dataset. Our method inte- segmentation, or mis-identification of cells. Consequently,
grates several advanced techniques to enhance the performance recent research has increasingly focused on deep learning
of segmentation models. We apply Principal Component Analysis
(PCA) and Gaussian Mixture Model - Expectation Maximization techniques, which offer improved accuracy and robustness
(GMM-EM) clustering for data pre-processing. The training in handling diverse and complex cell images (see detail in
and evaluation significantly improved the mean IoU metric, Section II).
demonstrating the enhanced performance of Language meets Deep Learning-based innovative algorithms generally
Vision Transformer (LViT) compared to other deep-learning archive state-of-the-art performance in medical imaging seg-
networks. This constructed workflow shows substantial potential
in improving the accuracy of cell segmentation tasks, which is mentation. Among the models, we are inspired by the LViT
crucial and beneficial for various biomedical applications. (Language meets Vision Transformer) model [2] since it tack-
les the challenge of limited labeled data in medical image
I. I NTRODUCTION segmentation by incorporating medical text annotations. This
Cytopathology, the study of cellular-level diseases, is a text information supplements the image data, allowing the
pivotal tool in cancer screening. Artificial intelligence can model to learn even with limited labeled examples.
potentially revolutionize diagnostics, especially segmenting We improve the performance of the LViT model by ensuring
cell images from micrographs. high-quality data input. In more detail, we integrate Princi-
Accurate cell segmentation is critical, the cornerstone of pal Component Analysis (PCA) for dimensionality reduction
medical image analysis. It enables the precise identification and Gaussian Mixture Models with Expectation-Maximization
and classification of cellular structures within microscopic (GMM-EM) in the pre-processing pipeline to optimize the
images. It is significant for various biomedical applications, clustering process. This has resulted in improved preliminary
such as computational pathology, disease diagnosis, prognosis, segmentation maps, a promising outcome that paves the way
and computer-aided diagnosis. for further advancements in cell segmentation.
For the classical approach, the Split and Merge Watershed The following parts are organized as follows: Section II
(SM-Watershed) method [1] for cell segmentation in fluo- shows some base deep learning models applied in cell seg-
rescence microscopy images. Initially, the Marker-Controlled mentation. Section III focuses on our contribution to the com-
Watershed (MC-Watershed) algorithm provides preliminary bination of GMM-EM and LVit model. Moreover, Section IV
segmentation. The Split phase separates clusters using cell indicates the experiment results, and Section V determines this
characteristics like size and convexity, while the Merge phase method’s conclusion and discussion.
refines these segments by eliminating over-segmentation. This
method effectively balances segmentation accuracy without II. D EEP L EARNING A PPROACHES FOR C ELL
needing labeled data. S EGMENTATION IN M ICROSCOPIC I MAGES
The segmentation method that combines the K-L transform This section reviews some deep-learning approaches that
and the OTSU method is not just a standard cell segmentation have had good results in the last few years, particularly on
method. It is a proven and effective technique. The K-L the MoNuSeg dataset, a widely used benchmark dataset in
transform is utilized to select the most informative channel cell segmentation. Understanding the performance of these
from the image, followed by the OTSU method to determine approaches on this dataset will help us decide the direction
an optimal automatic threshold value for segmentation. This of continuing research in this field.
approach has been applied to various types of cell images, In 2020, Jha et al. [3] proposed a Double U-Net model
demonstrating its practicality and effectiveness. by adding another U-Net at the bottom of the network to
While effective in specific scenarios, classical approaches capture supplementary semantic information efficiently. Fur-
such as Watershed, OTSU, and K-L transform often need help ther, Atrous Spatial Pyramid Pooling (ASPP) was adapted to
with the complexity and variability of all cell segmentation capture contextual data, and the post-processing techniques
significantly improved the result of automatic polyp detec- used in pathology image segmentation, leveraging skip connec-
tion. However, since this network uses two Unet models, the tions to recover detailed information. However, the semantic
increase in the number of parameters is a limitation of this gap between encoder and decoder can hinder performance.
model. To address this, the FusionU-Net [11] incorporates a fusion
Wang et al. [4] with a Bending Loss Regularized (BLR) module to reduce semantic gaps by exchanging information
model is successful in tackling the challenge of segmenting between skip connections. Our two-round fusion design con-
overlapped nuclei in histopathology images. This model ap- siders local relevance and bi-directional information exchange
plied high and low penalties to contour points with large and across layers. In addition, another model, named BiU-Net [12],
small curvature. In addition, the bending loss helps to avoid combines CNNs and transformers using a two-stage fusion
the generation of boundaries for two or more nuclei that are strategy. The Single-Scale Fusion (SSF) stage integrates local
touching. The BLR model performs better than other models. and long-range features, while the Multi-Scale Fusion (MSF)
However, there is still a problem with the segmentation of stage eliminates the semantic gap between deep and shallow
overlapping nuclei. layers. Additionally, a Context-Aware Block (CAB) in the bot-
Hassan et al. [5] proposed a Pyramid Scene Parsing with tleneck enhances multi-scale features in the decoder, improving
SegNet (PSPSegNet) to identify and delineate the boundaries segmentation performance.
of nuclei. The experiment indicates that the PSPSegNet model
is effective with F1-Score and AJI at 0.8815 and 0.7080, re- III. T HE P ROPOSED A PPROACH : C OMBINATION LV IT
spectively. Concerning the object level, the PSPSegNet model M ODEL WITH GMM-EM M ETHODS
relies on training data and, therefore, is unsuitable for exacting Compared to the successful methods mentioned in Sec-
cell shapes. tion II, the LViT (Language meets Vision Transformer)
The author Lagree et al. [6] proposed a gradient-boosting model [2] overcomes these limitations by significantly enhanc-
U-Net(GB U-Net) to segment breast tumor cell nuclei. This re- ing context awareness and reducing dependency on extensive
search shows that deep convolutional neural networks are suit- labeled data, making it more suitable for cell segmentation
able for training with transfer learning on a set of histopatho- tasks. LViT effectively captures global relationships within
logical images independent of breast tissue to segment or not images, leading to more accurate and efficient segmentation
tumor nuclei of the breast. results than Unet++. Thus, we enhanced this model in this
In the same year, 2021, Li et al. [7] proposed the Bagging paper by applying the pre-processing stage to improve perfor-
Ensemble Deep segmentation (BEDs) model, which aggre- mance (see Figure 2).
gates self-ensemble learning and testing stage augmentation Before describing our contribution in more detail, we revise
to improve the robustness of nucleus segmentation. However, the main idea of the LViT model, as illustrated (Figure 1):
this model needs to segment better when the images are LViT uses text information to supplement the image data,
complicated in shape and structure. allowing the model to learn even with limited labeled exam-
One year later, Qin et al. [8] proposed the REU-Net model ples. Furthermore, LViT leverages semi-supervised learning,
to improve segmentation accuracy by focusing on region- utilizing text data to generate high-quality pseudo-labels for
specific features within images. It leverages enhanced feature unlabeled images. An Exponential Pseudo Label Iteration
extraction techniques to identify better and segment nuclei in (EPI) mechanism assists the Pixel-Level Attention Module
medical images. However, while REU-Net boosts performance, (PLAM) preserve local image features during this process.
it may introduce additional computational complexity due to The Exponential Pseudo Label Iteration (EPI) mechanism is
its region-focused approach. Liang et al. [9] also use region a critical component in LViT, designed to enhance the quality
information to build a model that integrates Guided Anchoring of pseudo-labels for unlabeled image iterative. EPI refines
(GA) into the Region Proposal Network (RPN) and using a the pseudo-labels in each iteration by incorporating feedback
fusion box score (FBS) with soft non-maximum suppression from the model’s predictions, which become increasingly
(SoftNMS). This model improves accuracy over traditional accurate over time. This iterative process improves pseudo-
CNN-based approaches and enhances cell-level analysis in labels’ reliability, facilitating more effective semi-supervised
digital tissue images. learning. Complementing EPI, the Pixel-Level Attention Mod-
In 2023, the large-scale model named Segment Anything ule (PLAM) ensures that fine-grained details are preserved
Model (SAM) was introduced [10]. This model is trained on 11 during segmentation. PLAM assigns attention scores to each
million images with over 1 billion masks for general-purpose pixel, allowing the model to focus on critical regions within
segmentation. Although SAM doesn’t initially provide high- the image and maintain local feature integrity. This dual
quality segmentation for medical images, its masks, features, mechanism of EPI and PLAM enables LViT to achieve high
and stability scores are valuable for improving medical image precision in cell segmentation tasks, even with limited labeled
segmentation models. This model can be applied to augment data.
inputs for models like U-Net, with experiments on three tasks Compared to models like U-Net++ and BiO-LinkNet, which
that show its effectiveness. rely solely on convolutional operations, LViT’s transformer
In recent years, U-Net and its variants have been widely architecture allows it to capture local and global information
2
Fig. 1. Illustration of (a) the proposed LViT model and (b) the Pixel-Level Attention Module (PLAM). The proposed LViT model is a Double-U structure that
combines a U-shape CNN branch with a U-shaped ViT branch [2].
more effectively. While MaxViT-UNet achieves high perfor- Where z is the PCA-transformed feature vector, K is the
mance using a combination of convolutions and transformers, number of clusters, πk is the weight of the k-th cluster,
LViT’s advantage lies in its ability to leverage image data and and N (z|µk , Σk ) is the probability density function of the
text annotations. This adaptability makes it more effective in Gaussian distribution with mean vector µk and covariance
cell segmentation tasks. matrix Σk .
Our pre-processing stage prepares the data and ensures An iterative experimental process meticulously determined
a robust and reliable process for LViT optimally. We use the number of clusters K. This process was designed to
Principal Component Analysis (PCA) to reduce the data’s di- optimize the separation between clusters and the compactness
mensionality, followed by Gaussian Mixture Models (GMMs) within each cluster, ensuring the highest quality results. Upon
with Expectation Maximization (EM) to group images into completion of the clustering process, each pixel in the image
distinct clusters. was labeled according to the cluster with the highest posterior
The Principle Component Analysis was first applied in our probability, generating a preliminary segmentation map.
pre-processing pipeline to reduce the feature vectors’ dimen- Our research has achieved several notable outcomes:
sions, including pixel intensity values and texture features • Developed a pre-processing pipeline that effectively in-
extracted from the cell images. The PCA transformation, a tegrates PCA for dimensionality reduction with GMM-
crucial step in our pipeline, is defined in Equation 1. EM for clustering, providing high-quality input for the
subsequent LViT model.
z = WT (x − µ) (1) • Our research has led to a significant achievement-the
combination of PCA and GMM-EM has optimized the
Where x is the original feature vector, µ is the mean of clustering process. This has resulted in improved prelim-
the feature vectors, W is the matrix of eigenvectors of the inary segmentation maps, a promising outcome that paves
covariance matrix, and z is the transformed feature vector in the way for further advancements in cell segmentation.
the principal component space. IV. E XPERIMENTS RESULTS
Following dimensionality reduction, GMM-EM was applied
The experiment is evaluated using the highly regarded
for image clustering. GMM-EM was chosen due to its ability
MICCAI MoNuSeg dataset. This dataset comprises 44 images,
to effectively process diverse and complex cellular data, its
each sized at 1000 × 1000 pixels with 28, 846 labeled cell
flexibility in modeling complex distributions, and its capacity
nuclei distributed across nine organs: breast, liver, kidney,
to provide soft probabilities for each cluster. The GMM model
prostate, bladder, colon, stomach, lungs, and brain. The dataset
is defined as in Equation 2.
is organized into 24 images for training, 6 for validation, and
K 14 reserved for testing.
X
p(z) = πk N (z|µk , Σk ) (2) Our approach involved extensive experimentation with var-
k=1 ious sizes of image patches in images of the MoNuSeg2018
3
Fig. 2. Proposed Model for Cell Segmentation over Microscopic Images combining GMM-EM to LViT Model
Fig. 3. Cell segmentation results on MoNuSeg2018 dataset [13] using the proposed model, achieving a Dice score of 0.80, IoU of 0.66, and a runtime of
17155.2 seconds on K = 2 clusters, trained on an NVIDIA A100 (Google Colab Pro).
Method LVit [2] Proposed model MaxViT-UNet [14] Dice Unet [15] Unet++ [16] BiO-LinkNet [17] LinkNet [17] R2U-Net [18]
Dice 0.78 0.80 0.83 0.76 0.77 0.77 0.77 0.80
IoU 0.65 0.66 0.72 0.62 0.63 0.62 0.63 0.68
TABLE I
S EMANTIC SEGMENTATION RESULTS ON THE M O N U S EG 2018 DATASET.
dataset. We were cropping images to 256×256 pixels produced [0, 1]; when the value is close to 0, the model’s accuracy is
trade-off results. We also addressed memory constraints by low; the closer it is to 1, the higher its accuracy.
adopting overlapping patches with a 70-pixel overlap while
maintaining the original organ distribution. Using the LViT method, each image in the Monuseg2018
dataset is accompanied by a text passage providing a spe-
Given the relatively small dataset size of 1, 100 patches, cific description and evaluation of that particular image. In
this paper employs data augmentation techniques to enhance more detail, each image describes the characteristics of the
model training and accuracy, effectively expanding the dataset nuclei, such as evenly/sparse distribution and higher/lower
to 2, 200 images. These techniques, including random hori- density areas, etc. There may also be several images with
zontal flipping, rotating, and adding a Gaussian filter with a the same text passage. We pre-processed the images using
random parameter, were developed to address the challenges of the GMM-EM (Gaussian Mixture Model with Expectation-
working with a small dataset, demonstrating our commitment Maximization) technique to effectively process the visual data
to overcoming research obstacles. before applying the LViT method. This process clusters only
The semantic segmentation model’s accuracy is evaluated the image data, creating groups of visually similar images
using the Intersection over Union (IoU) measure and the while maintaining the individual text descriptions for each
Dice coefficient. For the individual segmentation problem, the image. We conducted experiments with different K values to
model’s accuracy is assessed based on the Score value. The determine the optimal number of clusters (K) for the image
IoU, Dice, and Score measurements have values in the range data. We adjusted based on experimental results and dataset
4
characteristics to optimize model performance. This approach across both evaluation metrics, reinforcing its effectiveness for
allows us to leverage both the clustered visual information and accurate and efficient cell segmentation.
the unique textual descriptions for each image in our dataset, Figure 3 presents three cropped images from the original
enhancing the overall performance of the LViT method. MoNuSeg2018 dataset. The figure indicates that using the
GMM-EM and LViT effectively segments overlapping cells
K Dice Score IoU Score Execution Time (seconds) and distinguishes between cells and other components, such
1 0.7659 0.6307 13449.25
2 0.8015 0.6566 17155.2
as blood, which are not the focus of this study.
3 0.7039 0.5522 23643.5
V. C ONCLUSIONS AND DISCUSSIONS
TABLE II
P ERFORMANCE METRICS WITH DIFFERENT NUMBERS OF CLUSTERS ON This paper presented an enhanced cell segmentation ap-
THE TEST DATASET. proach by integrating the LViT model with GMM-EM pre-
processing. The use of clustering is crucial in improving the
learning process, as it allows the model to capture more distinct
Following the data pre-processing with GMM-EM, we ex- and characteristic features from the data. By applying PCA for
perimented with different numbers of clusters K, starting from dimensionality reduction, followed by GMM-EM clustering,
1 and incrementally increasing. After training the LViT model the data is effectively segmented into meaningful groups,
from scratch on each clustered dataset, we selected K = 2 as it making it easier for the model to focus on relevant patterns.
yielded the highest accuracy compared to other configurations. This structured input significantly boosts the accuracy and
Choosing K = 2 allows the model to capture enough variance reliability of the model. This combination achieved superior
to segment the cells accurately while maintaining robustness performance when evaluating the MICCAI MoNuSeg dataset.
against noise or unnecessary complexity. This balance likely In the future, we plan to collect data from diverse sources and
leads to better generalization, as evidenced by the superior further optimize the model parameters to enhance the accuracy
performance metrics. Table II shows the performance metrics of our approach significantly.
for each K. The model’s performance was then evaluated on
the test set using metrics such as IoU and the Dice coefficient. ACKNOWLEDGMENT
The results (see Table I) are a testament to our progress, with This research was funded by the research project QG.23.71
the LViT method combined with GMM-EM achieving an IoU of Vietnam National University, Hanoi.
of 0.66 and a Dice coefficient of 0.8, surpassing the baseline
R EFERENCES
LViT model (IoU = 0.65, Dice = 0.78). Compared with stan-
dard LViT and U-Net++, our proposed method demonstrates [1] M. Gamarra, E. Zurek, H. J. Escalante, L. Hurtado, and
that our approach is competitive and a significant step forward H. San-Juan-Vergara, “Split and merge watershed: A
in the microscopy image processing field. two-step method for cell segmentation in fluorescence
Table I illustrates our results compared to another method. microscopy images,” Biomedical signal processing and
The results of the proposed method show significant improve- control, vol. 53, p. 101 575, 2019.
ment in cell segmentation performance on the MoNuSeg2018 [2] Z. Li, Y. Li, Q. Li, et al., “Lvit: Language meets vi-
dataset compared to other methods. Our approach achieved sion transformer in medical image segmentation,” IEEE
the second highest Dice score of 0.80 and an IoU of 0.66, transactions on medical imaging, 2023.
demonstrating better accuracy and reliability. In contrast, the [3] D. Jha, M. A. Riegler, D. Johansen, P. Halvorsen, and
standard LViT method scored a Dice of 0.78 and IoU of 0.65, H. D. Johansen, Doubleu-net: A deep convolutional
indicating a noticeable enhancement with improvements. The neural network for medical image segmentation, 2020.
proposed method outperforms both metrics compared to Dice arXiv: 2006 . 04868 [eess.IV]. [Online]. Available:
Unet and Unet++, which scored Dice values of 0.76 and 0.77 https://arxiv.org/abs/2006.04868.
and IoU values of 0.62 and 0.63, respectively. [4] H. Wang, M. Xian, and A. Vakanski, Bending loss regu-
Although MaxViT-UNet achieves a higher Dice score of larized network for nuclei segmentation in histopathol-
0.83, its superior performance can be attributed to its multi- ogy images, 2020. arXiv: 2002 . 01020 [eess.IV].
axis attention mechanism, which better captures local and [Online]. Available: https://arxiv.org/abs/2002.01020.
global contexts. This, together with combining convolutional [5] L. Hassan, A. Saleh, M. Abdel-Nasser, O. A. Omer, and
layers and transformers, enhances its segmentation accuracy. D. Puig, “Promising deep semantic nuclei segmentation
However, its increased complexity and computational demands models for multi-institutional histopathology images of
highlight the trade-offs, whereas the proposed method offers a different organs,” 2021.
strong balance between performance and efficiency. [6] A. Lagree, M. Mohebpour, N. Meti, et al., “A review
Other methods, like BiO-LinkNet and LinkNet, with Dice and comparison of breast tumor cell nuclei segmen-
scores of 0.77 and IoU scores of 0.62 and 0.63, respectively, tation performances using deep convolutional neural
also fall short of our results. The R2U-Net method also networks,” Scientific Reports, vol. 11, no. 1, p. 8025,
performs well, with a Dice score of 0.80 and an IoU of 2021.
0.68. However, our method remains robust and consistent
5
[7] X. Li, H. Yang, J. He, et al., Beds: Bagging ensemble
deep segmentation for nucleus segmentation with testing
stage stain augmentation, 2021. arXiv: 2102 . 08990
[cs.CV]. [Online]. Available: https : / / arxiv. org / abs /
2102.08990.
[8] J. Qin, Y. He, Y. Zhou, J. Zhao, and B. Ding, “Reu-net:
Region-enhanced nuclei segmentation network,” Com-
puters in Biology and Medicine, vol. 146, p. 105 546,
2022.
[9] H. Liang, Z. Cheng, H. Zhong, A. Qu, and L. Chen, “A
region-based convolutional network for nuclei detection
and segmentation in microscopy images,” Biomedical
Signal Processing and Control, vol. 71, p. 103 276,
2022.
[10] Y. Zhang, T. Zhou, S. Wang, P. Liang, Y. Zhang, and
D. Z. Chen, “Input augmentation with sam: Boosting
medical image segmentation with segmentation foun-
dation model,” in International Conference on Medical
Image Computing and Computer-Assisted Intervention,
Springer, 2023, pp. 129–139.
[11] Z. Li, H. Lyu, and J. Wang, “Fusionu-net: U-net with
enhanced skip connection for pathology image seg-
mentation,” in Asian Conference on Machine Learning,
PMLR, 2024, pp. 694–706.
[12] Z. Huang, Y. Zhao, Z. Yu, et al., “Biu-net: A dual-
branch structure based on two-stage fusion strategy for
biomedical image segmentation,” Computer Methods
and Programs in Biomedicine, vol. 252, p. 108 235,
2024.
[13] A. Goodman, A. Carpenter, E. Park, et al., 2018
data science bowl, Kaggle Competition, 2018. [Online].
Available: https : / / kaggle . com / competitions / data -
science-bowl-2018.
[14] A. R. Khan and A. Khan, “Maxvit-unet: Multi-axis at-
tention for medical image segmentation,” arXiv preprint
arXiv:2305.08396, 2023.
[15] K. C. T. Nguyen, “Segment automatically cells over
microscopic images based on deep learning techniques,”
Undergraduate thesis in VNU University of Science,
2021.
[16] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J.
Liang, “Unet++: A nested u-net architecture for medical
image segmentation,” in Deep learning in medical image
analysis and multimodal learning for clinical decision
support, Springer, 2018, pp. 3–11.
[17] A. Chaurasia and E. Culurciello, “Linknet: Exploiting
encoder representations for efficient semantic segmen-
tation,” in 2017 IEEE visual communications and image
processing (VCIP), IEEE, 2017, pp. 1–4.
[18] M. Z. Alom, C. Yakopcic, T. M. Taha, and V. K. Asari,
“Nuclei segmentation with recurrent residual convolu-
tional neural networks based u-net (r2u-net),” in NAE-
CON 2018-IEEE National Aerospace and Electronics
Conference, IEEE, 2018, pp. 228–233.