26 - Sentiment Analysis of Linguistic Cues To Assist Medical Image Classification
26 - Sentiment Analysis of Linguistic Cues To Assist Medical Image Classification
https://doi.org/10.1007/s11042-023-16538-9
Abstract
Image classification is a challenging problem and often suffers from the bottleneck of visual
features. With the ever-growing availability of multimedia data with the help of the Internet
and social platforms, many images are available along with their collateral text. These lin-
guistic keywords can be used as additional “sensors” to enhance efficiency while acting as
another mode of information. This article has proposed a framework to perform the sentiment
analysis on the rich textual information available from the linguistic cues of related images
and incorporate them to enhance image classification. The case study has been performed on
the binary classification of in-vivo gastral images and related text obtained from a known gas-
troenterologist. After the image classification is performed, there is a certain complex family
of images that often cannot be further classified. Thus, the classification accuracy is further
assisted by performing the sentiment analysis using Long Short Term Memory (LSTM) deep
learning network and Bag of Words. Experimental results of the proof-of-concept have been
compared with the state-of-the-art techniques to demonstrate the performance improvement
of the multi-modal system.
1 Introduction
The amount of data available for automatic pattern recognition has increased with social
media and the Internet in the form of multiple modalities such as image, text, audio, video,
B Parminder Kaur
[email protected]
Avleen Kaur Malhi
[email protected]
Husanbir Singh Pannu
[email protected]
1 Computer Science and Engineering Department, Thapar Institute of Engineering and Technology,
Patiala, India
2 Department of Computing and Informatics, Bournemouth University, Bournemouth, UK
123
30848 Multimedia Tools and Applications (2024) 83:30847–30866
and meta-data, however, learning systems involving these modalities and data fusion are
still at an infant stage. For instance, text-based search through Google, Yahoo, and Bing is
convenient and fast but still involves mismatches, low relevance, and duplicate results.
1.1 Motivation
Visual content analysis is also under-constrained with the available visual features such as
color, texture, and boundaries. A survey has studied that one modality can complement the
deficit resulting due to the other modality while coupling with a positive impact on the
integrated learning system. Content-based textual analysis can assist sentiment classifica-
tion, interpretation, and selection with the possibility to enhance the performance of image
classification or retrieval systems. Images often carry contextual information incorporating
meta-data, tags, keywords, and captions, which are highly valuable and collaborate with
image features to explain the missing information and expert opinions based upon subtle
visual features or historical patterns of image objects.
Content-based image retrieval often suffers from visual feature extraction, and thus an
additional mode of information, such as collateral text is required to classify images. Some-
times the information required is just not present in the image, so linguistic cues (from
experts) are useful. Biological images often suffer from various challenges such as moving
objects/cameras and changes in shape such as microscopic or capsule endoscopic images;
optimal intensity and contrast balance trade-off; casting light might change the intensity and
contrasts; light sources and cameras are not standard; colors can be ignored in microscopic
images but not in macroscopic ones; colors should be separated from their intensities (RGB to
LUV); segmentation algorithms come with their error bundles. Image processing by employ-
ing only image features is called uni-modal learning, however, it is called bi-modal learning
when an association of image and text features is utilized. Bi-modal learning is claimed to
be more efficient than uni-modal learning [2].
Collateral text can be classified to assist the image classification in discerning confounding
images and reduce false positives as shown in Fig. 1. It shows that images may not be separated
correctly into healthy and sick classes merely using visual features. So, textual information
is required for efficient classification.
An instance of an endoscopic image along with the collateral text is depicted in Fig. 2.
Figure 3 describes the text analysis to classify the leftover unclear confounding images
which went unclassified due to the semantic gap between the visual features and the expert
knowledge of the field. Thus related keywords of the images help further separate the images’
sick and bleeding categories.
1.2 Contributions
Most of the traditional visual and text learning methods have followed unimodal phenomenon
in which text features guide the only text classification or image features guide only the
image processing. However medical image features are complex and cannot be automatically
extracted by the algorithm. They often need an expert to identify the symptoms and therefore
the descriptive linguistic cues play the role of an expert description. This paper has proposed
a multimodal learning technique in which a freely available image caption is used to guide
the medical image classification. The major contributions of the article are summarized as
follows:
123
Multimedia Tools and Applications (2024) 83:30847–30866 30849
Fig. 1 Image features cannot classify images into sick and healthy classes: (a) Active bleeding in small bowel
(b) False positive (air bubble) [1]
Fig. 3 Image analysis followed by text analysis to assist the overall classification for confounding cases
123
30850 Multimedia Tools and Applications (2024) 83:30847–30866
1. Sentiment analysis for the captions of images has been performed to assist image classi-
fication.
2. Gastral images and related captions have been obtained from the known gastroenterolo-
gist for the case study.
3. Use of machine learning techniques such as Bag of Words (BoW) and Long Short Term
Memory (LSTM) have been explored to analyze the text instances.
4. Accuracy comparison of image classification alone and image + captions classification
has been performed to support the claim that collateral text helps to aid information to
the visual features.
The rest of the article has been organized as follows: Section 2 is about the literature review
for related works as motivation, Section 3 describes the background, Section 4 discusses the
proposed architecture, Section 5 presents the experimental case study, and Section 6 concludes
the paper.
2 Related work
This section briefly reviews the literature related to multi-modal techniques, followed by the
works related to LSTM methods.
In [2], image captions, titles, collateral text, and references of the underlying image pro-
vided in the article have been considered, along with image features for image annotation
and classification. In [3] multimedia social content has been compiled to form an image-
text database called TumEmo and emotion analysis has been performed with Multi-view
Attentional Network (MVAN).
Fusing visual features and textual aspects has been a point of attention in various applica-
tions to leverage performance, such as a content-aware ranking model for sports data studied
in [4]. With an increase in the amount of multi-modal data, preserving data privacy becomes
challenging. To overcome this challenge, a privacy protection technique dubbed deep adver-
sarial privacy-preserving cross-modal hashing has been introduced in [5]. It comprises a
deep cross-modal hashing model and a secure index structure. A universal weighting metric
learning framework has been proposed in [6] for an effective cross-modal retrieval process.
It can sample the informative modality pairs, and weight values are assigned to them as per
the similarity scores such that diverse pairs favor different penalty strengths. Two polynomial
losses are also introduced in this framework, such as self-similarity and relative-similarity.
The self-similarity polynomial loss provides a polynomial function that links the weight val-
ues with the self-similarity scores, however, weight values are linked with relative-similarity
scores in the relative-similarity polynomial loss.
Web video applications using contextual information to eliminate near duplicate real-time
video using thumbnails, view count, and video duration have been studied in [7]. Local points
and color information for content analysis help in the verification of the reported duplication.
It also helps to improve the tagging ability of the related videos, images, and textual content.
In [8], Recipe1M+ food and the cooking database has been utilized, involving 13 million
food images and over 1 million cooking recipes. A Multi-modal system has been developed
for efficient image-recipe retrieval tasks. The proposed model, data, and source code are
publicly available. Heterogeneous graph embeddings are proposed in [9] to preserve the
123
Multimedia Tools and Applications (2024) 83:30847–30866 30851
modality-specific information and increase the cross-modal retrieval accuracy. One modality
embedding is compensated by the other modality’s aggregated embedding. The label noise
issue is reduced by constructing a self-denoising tree search that makes the heterogeneous
neighborhood more semantically relevant. A summary of recent works has been discussed
in Table 1 according to various application areas.
In [15] a survey of deep learning techniques for medical imaging and NLP (natural lan-
guage processing) has been studied. In [16], a self-supervised technique involving language
tasks, vision and NSP has been proposed to study the rich medical images. Masked vision
language modelling (MVLV) is used to extract text semantics for medical images with an
associated caption dataset. Two datasets VQA-RAD and VQA-Med 2019 has been used for
demonstation of the case study, which involve radiology images and visual question answers
(VQA) to provide machine learnign solutions which is interpretable through attention maps.
In [17], BERT (Bidirectional Encoder Representations from Transformers) model been used
which involves multiple modalities for image captioning and VQA. Unstructured reports have
been used along with radiology images and a medical vision language learner (MedViLL)
has been developed. In [18] an implicit NLP has been studied using EWECT and SMP2019
datasets with text to picture (TTP) technique. In [19] deconfounded visio-linguistic bert
(DeVLBert) framework has been proposed to increase the generalizability of intervention
based learning using causal intervention.
Since its inception in 1995, several variants of Long Short Term Memory (LSTM) of Recur-
rent Neural Networks (RNN) has been proposed [20]. In LSTM networks, the adaptive forget
gate was introduced in [21] to reset itself to release the resources and reset at appropriate
instances with superior performance to Recurrent Neural Network (RNN) algorithms. LSTM
has been used for disambiguation in the Punjabi language to identify the accurate contextual
meaning in [22]. Word vectors for 66 ambiguous nouns have been considered for deep learn-
ing systems using unigram and bigram feature sets and Punjabi language corpus. In [23],
an analysis of 8 variants of LSTM for polyphonic music modeling, handwritten text, and
123
30852 Multimedia Tools and Applications (2024) 83:30847–30866
speech recognition has been discussed. ANOVA test has been performed to see the effect
of hyper-parameter tuning of all variants, which are separately optimized with the help of
random search. Empirical analysis has demonstrated the importance of output activation
and the forget gate as critical components. LSTM has been used for short-term traffic fore-
cast in [24]. It used a two-dimensional network with multiple memory units to incorporate a
temporal-spatial correlation traffic system. LSTM has also been studied in the power systems
for volatile load forecasting in [25] for short-term load prediction to assist customers and
grid operations and future planning. CNN-LSTM model has been utilized in [26] for accurate
prediction of gas field production based on a gas field in southwest China. It is a vital task
for reservoir engineers, and its prediction is difficult because of multiple unknown reser-
voir parameters. CNN is utilized for feature extraction and LSTM for learning the sequence
dependence.
For the real time data collection, a three-layered health care system has been proposed
[27] including preprocessing as well as transmission. It utilizes the Internet of Things, cloud
computing and fog computing for the end users for future and current ongoing applications.
In [27], three cloud-oriented AI-driven models have been studied for COVID-19 detection
and prevention (named D-espy): Stacked LSTM, Vanilla LSTM and ARIMA using JHU
dataset.
For text classification, a deep pyramid CNN has been studied in [28], CNN for sentence
classification has been studied in [29], recurrent CNN for text classification in [30] and a
transformer model has been proposed in [31].
Deep learning has gained wide popularity in medical imaging diagnostics in recent years,
but visual features underconstrain information retrieval. This is due to the requirement of
large input labeled dataset, variability in contrast or resolution, colors, noise, artifacts, blur
in case of in-vivo endoscopy images [32] and lack of expert knowledge in the machine to
infer the medical images. Moreover, the data often come from different machines, models,
and formats, which causes a shift in the data distribution. Thus, it lacks the generality for the
deep learning models and results in poor performance if only rely on just one modality i.e.
image. Machine learning models heavily depend on the statistical patterns of data distribution
and model parameters, which gets disturbed by varying the image acquisition instances and
protocols [32]. Therefore an additional feature such as a supporting legend is often useful
with a given image. A text has steady features, structured grammar and more details compared
to an image, which only yields a conclusive view and even conveys different meanings to
different people according to their background knowledge. Thus text provides the language
to the vision and enhances the algorithmic learning [17].
3 Background
This section discusses image/text classification and the LSTM model used for sentiment
analysis in the proposed technique. The goal is to collect the data collectively in the form of
images and related referral collateral text. The snapshot of such an example has been shown
in the Fig. 2. Images and text are classified separately, and an example of collateral text
getting ready for the sentiment analysis has been depicted in the Table 2. Prominent metrics
123
Multimedia Tools and Applications (2024) 83:30847–30866 30853
Table 2 Collateral text associated with the normal and sick images
Sr. Type Collateral Text
used for text analysis, such as binary or frequency features, weirdness coefficient, TF-IDF
are summarized in the Table 3.
Text features are in the form of strings, so first, we need to convert these string features into
numerical features for which the following methods can be utilized:
Bag of Words is a basic model used in natural language processing. The order of the words
in the document is discarded, and it just tells about the presence of a word in the document.
For example, consider the following sentences:
123
30854 Multimedia Tools and Applications (2024) 83:30847–30866
3.1.2 TF-IDF
TF-IDF stands for Term Frequency-Inverse Document Frequency which explains the impor-
tance of the word in the given corpus or data. It incorporates two concepts: Term Frequency
(TF) and Inverse Document Frequency (IDF). Term Frequency is defined as how frequently
a word appears in the document or corpus. Let f t,d represent the frequency of the term t in
the document d. The term frequency t f (t, d) means the term count in the given document
and is defined as [35]:
f t,d
t f (t, d) = (1)
max{ f s,d : s ∈ d}
To avoid bias due to a larger document, it is scaled by dividing with the most popular
term of the document. Let N be the total documents in the corpus such that N = |D|. So
{d ∈ D : t ∈ d} means the documents in the corpus in which term t appears. Then id f (t, D)
is defined as:
N
id f (t, D) = log (2)
|{d ∈ D : t ∈ d}|
Fig. 4 Example of text matrix generated using five sentences and Bag of Words
123
Multimedia Tools and Applications (2024) 83:30847–30866 30855
Higher TF-IDF means that the word or term is more important in the given document.
t f id f (t, d, D) = t f (t, d) × id f (t, D) (3)
So term frequency is the count of a word in the document, and document frequency (DF)
means how many documents have that word. IDF will be low for stop words that are most
occurring such as “the” , “is”,“a”, etc.
3.1.3 Word2Vec
Word2Vec is used to derive word embeddings using the family of models, such as neural
networks with only two layers to design linguistic contexts of word semantics. It yields a
vector space after processing the large text corpus, where each word gets a distinct vector
associated with it in the span of vector space. The idea is to place similar words closer in
the space using the cosine similarity metric as an example. Details about Word2vec can be
found in [36].
A long short-term memory network is a modified version of the Recurrent Neural Network
(RNN) that eliminates its long-term dependency through a memory gate. It enables long-
term temporal dynamics and complex long sequences. Memory cells (ci ) make the repeated
module structure of LTSM, which resembles a chain and eliminates the exploding situations
by working on the gradient. Memory cells are linearly dependent in LTSM, i.e., ci with ci+1
where i is the state of the cell. Cell states are the functions of current information which
can be controlled through inclusion-exclusion operations. Figure 5 demonstrates the gate
structure of LTSM where different elements are described below:
• The sigmoid layer is the forget gate f (t) to decide about removing the past information
from the state of a cell through computation of the current state of the memory cell. The
range of the Sigmoid function defines retaining or discarding the information from the
Fig. 5 Basic LSTM structure and involved function [37]. The activation function tanh has been used for φ,
and σ is the Sigmoid function
123
30856 Multimedia Tools and Applications (2024) 83:30847–30866
binary output 1 and 0, respectively. Let W ∈ R h×d , w ∈ R h×h are weight matrices,
b ∈ R h is a bias which is calculated at the time of training, where d and h depicts the
cardinality of the feature set and the hidden units. f ∈ R h is the forget gate, I ∈ R h is
the input gate, c, c1 ∈ R h are cell input activation and cell state vector.
f t = σ b f + W .[wt , h t−1 ] (4)
• Input word wt is processed through the Sigmoid function using input gates, input bias bi ,
and previous hidden state h t−1 to decide about the preservation of the information or not.
Equations (5), (6), (7) describe the input generated through the input gate; tanh function
applied for the next cell memory state; new memory state derivation using scaling of
current information i t and current cell state c1 along with incorporating the forget gate
f t . Finally, the summation of the input gate and the forget gate (7) generates the final
state of the memory.
i t = σ (bi + Wi .[wt , h t−1 ]) (5)
c1 = tanh(bc + Wc .[wt , h t−1 ]) (6)
ct = c1 × i t + ct−1 × f t (7)
where “×" is the point-wise multiplication operator.
• Finally, the exposure or output gate determines what cell state information to act as the
final output by separating the final memory out of the hidden state. Let output ot be the
final gate information, the last hidden state be h t−1 of the input word wt then (8) and (9)
define the output gate information and the new hidden state:
ot = σ (bo + Wo .[wt , h t−1 ]) (8)
h t = tanh(ct ) × ot (9)
Thus based on the context dependence of the neighboring words, LSTM captures the
sequence of the words.
4 Proposed technique
The proposed model incorporates text features into the image classification task, which
suggests implicit features of the image. Usually, image classification is performed by image
features and text classification is learned by text features, but this article highlights the
Fig. 6 Work flow diagram to illustrate the linguistic cues assisting image classification
123
Multimedia Tools and Applications (2024) 83:30847–30866 30857
4.1 Architecture
The proposed architecture has been demonstrated in Fig. 7. Images and captions are input
to the system. Each caption is converted into useful keywords by removing punctuations,
Fig. 7 Architecture of the proposed bi-modal system for image and text classification
123
30858 Multimedia Tools and Applications (2024) 83:30847–30866
stop/end words, removal of too common or too rare words in regard to a lexicon of medical
vocabulary, and then vectorization using bag of words to input into LSTM model for sentiment
analysis. On the parallel track, image vectorization is performed using wavelet transforms,
Zernike moments and then input vectors is used by ANN for classification. If the label is
neutral then label of the underlying caption is found by asking LSTM model. Thus final
accuracy is calculated by taking the union of decisions made by image and text classifier.
For LSTM (text classifier), word embedding dimensions = 100, epochs = 50, fully connected
layers = 2, hidden units = 180, learning rate = 0.05. For ANN (image classifier) Levenberg-
Marquardt trining algorithm, mean squared error, and 10 hidden layers were used which are
the standard settings of MATLAB’s curve fitting app for neural networks. Sentiment analysis
of the linguistic cues provided by collateral text along with the medical images is fused
to the image segmentation to improve the classification accuracy. The case study includes
300 images and captions which are obtained from a known gastroenterologist. These 300
instances are processed for image classification and text classification separately. For image
segmentation, we used Wavelet Transforms for denoising, Zernike moments for vectorization,
and a neural network for classification. For text analysis, Bag of Words on preprocessed data
followed by machine learning techniques (including LSTM) has been applied. Any other set
of algorithms can also be adopted for feature extraction and classification system.
After the decision of image segmentation has been made to predict the bleeding cases
(positives), the sentiment analysis is tested for the detected negative cases to detect new
positive sick cases. Figure 8 illustrates the steps followed in sentiment analysis of the collateral
text (available with the images) and inspired by [38]. The idea is to catch as many positive
(sick) cases as possible through image and text classification. This is because sensitivity
has been prioritized over specificity for delicate medical situations. Algorithm 1 shows the
step-by-step procedure to implement the proposed approach.
5 Experiments
This section discusses about the system description, dataset and empirical analysis.
123
Multimedia Tools and Applications (2024) 83:30847–30866 30859
Fig. 9 Raw data versus cleaned data using pre-processing (reduction = 68.9%)
123
30860 Multimedia Tools and Applications (2024) 83:30847–30866
else
foreach caption k of the captions do
Tokenize(k)
Erase punctuations, remove stop words, empty words
foreach caption do
Remove too short/long/infrequent words
using lexicon
end
Text Vectorization using Bag of Words
end
for numeric data n do
LSTM_Train(n)
foreach Params p ∈ P, set do
word embedding dimensions = 100
epochs = 50
fully connected layers =2
hidden units = 180
learning rate = 0.05
end
LSTM_Testing
Calculate Confusion Matrix
Calculate Accuracytext
Accuracy ← Accuracyimage ∪ Accuracytext
end
end
end
123
Multimedia Tools and Applications (2024) 83:30847–30866 30861
Pre-processing of the raw captions associated with images includes tokenization, lemma-
tization, creation of the BoW model, removal of infrequent words, too short/long words,
punctuations, and finally, calculation of the reduction ratio. Too many closed-class words
with no information (the, a, is, etc.) are removed to reduce the sparsity in text vectors. There
is a need to extract the crucial information from the text while considering the rarity of the
Fig. 13 Training accuracy with blue and dotted lines for each iteration and the average, respectively. Average
validation accuracy for 5 trials is 96.67%
123
30862 Multimedia Tools and Applications (2024) 83:30847–30866
Fig. 14 Loss curve with dotted average and the red solid line for the iterations. Loss almost becomes 0 after
15 iterations
word in the given text relative to all texts. The frequency of the words must be analyzed to
avoid acronyms and spelling mistakes.
The architecture of the proposed method has been shown in Fig. 7. Algorithm 1 has been
applied to the dataset of 300 images and captions (examples shown in Fig. 2 and Table 2). A
summary of the process flow of sentiment analysis has been depicted in Fig. 8, starting from
the raw text captions, pre-processing, model training, category prediction for the captions,
and final result during deployment. Pre-processing of the text has been done to remove close
class words, punctuations, lemmatization, rare words removal, vectorization using Bag of
Words, and then feeding the vectors into the LSTM machine learning model. Snapshot of the
pre-processing word cloud and the binary categories of the cleaned data has been summarized
in Figs. 9 and 10. The data distribution of both normal and sick classes have been shown
in Fig. 11. A histogram to analyze the length of individual raw captions has been shown in
Fig. 12. All these figures and data analysis has been performed using MATLAB (2020)
software.
LSTM model has been trained, and the accuracy and loss curves are depicted in Figs. 13
and 14, respectively. It can be observed that the training accuracy increases and becomes
stable after 10 iterations. Similarly, the loss curve becomes stable after 10 iterations. Finally,
the results of sentiment analysis have been summarized in Table 4. LSTM is the winner
Table 4 Performance comparison of sentiment analysis of linguistic cues on the test data (train:test = 270:30)
Sr. Method Test Accuracy (%)
123
Multimedia Tools and Applications (2024) 83:30847–30866 30863
Fig. 15 Confusion matrix of the test results for LSTM. 30 captions are used for testing and 270 for training
(training:testing ratio = 9 : 1)
having the highest accuracy as compared to other machine learning models, and thus has
been highlighted in the table. The confusion matrix for LSTM results has been shown in
Fig. 15. After choosing LSTM results for the best text classifier for the image captions,
the image segmentation has been performed using Wavelet transforms, Zernike moments,
and neural networks. The accuracy results have been reported in Table 5 for image, text
classification, and image + text classification. Image + text has the highest accuracy, which
has been represented in bold. Figure 16 shows the comparison of the proposed sentiment
analysis approach with other recent techniques.
For the images which were detected as negative (healthy), their respective captions were
tested through the proposed sentiment analysis model to update the accuracy measure of the
image segmentation results. This resulted in some accuracy boost, as shown in Table 5. The
labels for sick and normal being 1 and 0 (for true positive and true negative), so the following
equation can explain the updated accuracy:
It means that if the bleeding instance went undetected through the image model, it should
be detected by the text analysis model. Thus both modalities help to detect the true positive
to confirm the detection of sick cases. Thus incorporating more modalities besides image
123
30864 Multimedia Tools and Applications (2024) 83:30847–30866
and text could contribute to the learning accuracy if orchestrated properly using data fusion
methods, appropriate feature selection, and learning models.
6 Conclusion
A bi-modal framework for text sentiment analysis to assist image classification has been
proposed. The claim is that image classification alone can be under constraint if relies only
on visual features. Evidence provided by linguistic cues has been exploited using the text
features, BoW and LSTM to learn the health status provided by the collateral text to assist
the image classification task. Candidate terms were selected while studying the linguistic
evidence from the real data provided by a known gastroenterologist. The obtained results
demonstrate the improved performance of utilizing the textual features for image analysis
and classification in complex situations where visual features are insufficient.
The assumption on the underlying technique is the availability of medical images and
expert descriptions in the form of captions. The annotation of medical images is time con-
suming and expensive in terms of the manual hours of labor for technical experts since
medical domain is entirely different than common object images. LSTM is efficient in learn-
ing the complex relationships within the data due to its long-term information memory; its
activation functions are robust and do not suffer if the gradient vanishes, but it also has some
limitations. They are complex, not quite suitable for non-sequential online input data, need
large training data, are slow to train, and are not efficient if data has a lot of noise [43]. The
classification accuracy is still not 100% which means there is still a room for improvement
with better feature extractions and classification algorithms.
This research is useful for the medical practitioners for image classification assistance;
patients to do the diagnosis through an automatic analysis of an app through the lab reports
of images and description; medical interns to relate the image features with the technical
terminology to learn; AI experts to extend this model by incorporating other modalities
besides image and text to evolve a robot system to assist humanity. Future plan is to improve
the proposed algorithm by utilizing better feature extraction and classification technique and
exploring novel multimodal association methods.
Acknowledgements Authors are thankful to (a) Dr. Sunil Arya, Gastroenterologist at Leela Bhawan Patiala,
and Dr. G.S. Sidhu at Max Hospital Mohali, India, for the dataset and technical feedback; and (b) Professor
Khurshid Ahmad, Trinity College Dublin Ireland, for research direction.
123
Multimedia Tools and Applications (2024) 83:30847–30866 30865
Data Availability The dataset analyzed during the current study is not publicly available due to medical data
privacy.
Declarations
Competing interests The authors declare that they have no known competing interests.
References
1. Boal Carvalho P, Magalhães J, Dias de Castro F, Monteiro S, Rosa B, Moreira MJ, et al (2017) Suspected
blood indicator in capsule endoscopy: a valuable tool for gastrointestinal bleeding diagnosis. Arquivos
de gastroenterologia 54(1):16–20
2. Jing M, Scotney BW, Coleman SA, McGinnity MT, Zhang X, Kelly S et al (2016) Integration of text
and image analysis for ood event image recognition. In: 2016 27th Irish Signals and Systems Conference
(ISSC). IEEE; pp 1–6
3. Yang X, Feng S, Wang D, Zhang Y (2020) Image-text multimodal emotion classification via multi-view
attentional network. IEEE Trans Multimedia 23:4014–4026
4. Shih HC (2017) A survey of content-aware video analysis for sports. IEEE Trans Circuits Syst Video
Technol 28(5):1212–1231
5. Zhu L, Song J, Yang Z, Huang W, Zhang C, Yu W (2022) DAP2CMH: Deep Adversarial Privacy-
Preserving Cross-Modal Hashing. Neural Processing Letters 54(4):2549–2569
6. Wei J, Yang Y, Xu X, Zhu X, Shen HT (2021) Universal weighting metric learning for cross-modal
retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence
7. Shen L, Hong R, Hao Y (2020) Advance on large scale near-duplicate video retrieval. Frontiers of
Computer Science 14(5):145702
8. Marin J, Biswas A, Ofli F, Hynes N, Salvador A, Aytar Y et al (2019) Recipe1m+: A dataset for learning
cross-modal embeddings for cooking recipes and food images. IEEE transactions on pattern analysis and
machine intelligence
9. Chen D, Wang M, Chen H, Wu L, Qin J, Peng W (2022) Cross-Modal Retrieval with Heterogeneous Graph
Embedding. In: Proceedings of the 30th ACM International Conference on Multimedia 3291–3300
10. Pitcher BJ, Briefer EF, Baciadonna L, McElligott AG (2017) Cross-modal recognition of familiar con-
specifics in goats. Royal Society open science 4(2):160346
11. Frermann L, Cohen SB, Lapata M (2018) Whodunnit? crime drama as a case for natural language under-
standing. Trans Assoc Comput Linguist 6:1–15
12. Tripathi P, Watwani PP, Thakur S, Shaw A, Sengupta S (2018) Discover Cross-Modal Human Behav-
ior Analysis. In: 2018 Second International Conference on Electronics, Communication and Aerospace
Technology (ICECA). IEEE 1818–1824
13. Calhoun VD, Sui J (2016) Multimodal fusion of brain imaging data: a key to finding the missing link (s) in
complex mental illness. Biological psychiatry: cognitive neuroscience and neuroimaging 1(3):230–244
14. Goyal P, Sahu S, Ghosh S, Lee C (2020) Cross-modal Learning for Multi-modal Video Categorization.
arXiv:2003.03501
15. Pandey B, Pandey DK, Mishra BP, Rhmann W (2022) A comprehensive survey of deep learning in the
field of medical imaging and medical natural language processing: Challenges and research directions. J
King Saud Univ Comput Inf 34(8):5083–5099
16. Khare Y, Bagal V, Mathew M, Devi A, Priyakumar UD, Jawahar C (2021) Mmbert: Multimodal bert
pretraining for improved medical vqa. In: 2021 IEEE 18th International Symposium on Biomedical
Imaging (ISBI). IEEE 1033–1036
17. Moon JH, Lee H, Shin W, Kim YH, Choi E (2022) Multi-modal understanding and generation for medical
images and text via vision-language pre-training. IEEE J Biomed Health Inform 26(12):6070–6080
18. Chen M, Ubul K, Xu X, Aysa A, Muhammat M (2022) Connecting text classifi- cation with image
classification: a new preprocessing method for implicit sentiment text classification. Sensors 22(5):1899
19. Zhang S, Jiang T, Wang T, Kuang K, Zhao Z, Zhu J et al (2020) Devlbert:Learning deconfounded visio-
linguistic representations. In: Proceedings of the 28th ACM International Conference on Multimedia
4373–4382
20. Kaliyar RK, Goswami A, Narang P (2021) FakeBERT: Fake news detection in social media with a
BERT-based deep learning approach. Multimedia tools and applications 80(8):11765–11788
123
30866 Multimedia Tools and Applications (2024) 83:30847–30866
21. Staudemeyer RC, Morris ER (2019) Understanding LSTM-a tutorial into long short-term memory recur-
rent neural networks. arXiv:1909.09586
22. pal Singh V, Kumar P (2019) Word sense disambiguation for Punjabi language using deep learning
techniques. Neural Computing and Applications 1–11
23. Greff K, Srivastava RK, Koutník J, Steunebrink BR, Schmidhuber J (2016) LSTM: A search space odyssey.
IEEE Trans Neural Netw Learn Syst 28(10):2222–2232
24. Zhao Z, Chen W, Wu X, Chen PC, Liu J (2017) LSTM network: a deep learning approach for short-term
traffic forecast. IET Intell Transp Syst 11(2):68–75
25. Kong W, Dong ZY, Jia Y, Hill DJ, Xu Y, Zhang Y (2017) Short-term residential load forecasting based
on LSTM recurrent neural network. IEEE Transactions on Smart Grid 10(1):841–851
26. Zha W, Liu Y, Wan Y, Luo R, Li D, Yang S et al (2022) Forecasting monthly gas field production based
on the CNN-LSTM model. Energy 2022–124889
27. Kumari A, Tanwar S, Tyagi S, Kumar N (2018) Fog computing for Healthcare 4.0 environment: Oppor-
tunities and challenges. Computers & Electrical Engineering 72:1–13
28. Johnson R, Zhang T (2017) Deep pyramid convolutional neural networks for text categorization. In:
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1:
Long Papers) 562–570
29. Chen Y (2015) Convolutional neural network for sentence classification. University of Waterloo
30. Lai S, Xu L, Liu K, Zhao J (2015) Recurrent convolutional neural networks for text classification. In:
Proceedings of the AAAI conference on artificial intelligence 29
31. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN et al (2017) Attention is all you need.
Advances in neural information processing systems 30
32. Perone CS, Cohen-Adad J (2019) Promises and limitations of deep learning for medical image segmen-
tation. J Med Artif Intell 2(1):1–2
33. Manaka T, van Zyl T, Kar D (2022) Improving Cause-of-Death Classification from Verbal Autopsy
Reports. In: Artificial Intelligence Research: Third Southern African Conference, SACAIR 2022, Stel-
lenbosch, South Africa, December 5–9, 2022, Proceedings. Springer 46–59
34. Ölçer D, Taşkaya Temizel T (2022) Quality assessment of web-based information on type 2 diabetes.
Online Information Review 46(4):715–732
35. Devi MD, Saharia N (2023) Unsupervised tweets categorization using semantic and statistical features.
Multimedia Tools and Applications 82(6):9047–9064
36. Chen Q, Sokolova M (2021) Specialists, scientists, and sentiments: Word2Vec and Doc2Vec in analysis
of scientific and medical texts. SN Computer Science 2:1–11
37. Guo L, Li N, Jia F, Lei Y, Lin J (2017) A recurrent neural network based health indicator for remaining
useful life prediction of bearings. Neurocomputing 240:98–109
38. Gorr H (2020) Classify Sentiment of Tweets Using Deep Learning. MathWorks. May
21,2020;online https://www.mathworks.com/matlabcentral/fileexchange/68264-classify-sentiment-of-
tweets-using-deep-learning, MATLAB Central File Exchange
39. Patel R, Passi K (2020) Sentiment analysis on twitter data of world cup soccer tournament using machine
learning. IoT 1(2):14
40. Bilal M, Israr H, Shahid M, Khan A (2016) Sentiment classification of Roman- Urdu opinions using Naïve
Bayesian, Decision Tree and KNN classification techniques. J King Saud Univ Comput Inf 28(3):330–344
41. Jain PK, Pamula R, Srivastava G (2021) A systematic literature review on machine learning applications
for consumer sentiment analysis using online reviews. Computer science review 41:100413
42. Neogi AS, Garg KA, Mishra RK, Dwivedi YK (2021) Sentiment analysis and classification of Indian
farmers’ protest using twitter data. Int J Inf Manag Data Insights 1(2)
43. Manaswi NK, Manaswi NK. Rnn and lstm (2018) Deep Learning with Applications Using Python:
Chatbots and Face, Object, and Speech Recognition With TensorFlow and Keras 115–126
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under
a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted
manuscript version of this article is solely governed by the terms of such publishing agreement and applicable
law.
123