0% found this document useful (0 votes)

20 views20 pages

26 - Sentiment Analysis of Linguistic Cues To Assist Medical Image Classification

This document presents a framework for enhancing medical image classification through sentiment analysis of linguistic cues associated with images. By utilizing a multi-modal learning approach that combines image features with textual information, the study demonstrates improved classification accuracy, particularly for complex medical images. The proposed method employs Long Short Term Memory (LSTM) networks and Bag of Words techniques, showcasing its effectiveness through a case study on in-vivo gastral images.

Uploaded by

duybtr0602

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views20 pages

26 - Sentiment Analysis of Linguistic Cues To Assist Medical Image Classification

Uploaded by

duybtr0602

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Multimedia Tools and Applications (2024) 83:30847–30866

https://doi.org/10.1007/s11042-023-16538-9

Sentiment analysis of linguistic cues to assist medical image

classification

Parminder Kaur1 · Avleen Kaur Malhi2 · Husanbir Singh Pannu1

Received: 2 January 2023 / Revised: 19 May 2023 / Accepted: 13 August 2023 /

Published online: 13 September 2023
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023

Abstract
Image classification is a challenging problem and often suffers from the bottleneck of visual
features. With the ever-growing availability of multimedia data with the help of the Internet
and social platforms, many images are available along with their collateral text. These lin-
guistic keywords can be used as additional “sensors” to enhance efficiency while acting as
another mode of information. This article has proposed a framework to perform the sentiment
analysis on the rich textual information available from the linguistic cues of related images
and incorporate them to enhance image classification. The case study has been performed on
the binary classification of in-vivo gastral images and related text obtained from a known gas-
troenterologist. After the image classification is performed, there is a certain complex family
of images that often cannot be further classified. Thus, the classification accuracy is further
assisted by performing the sentiment analysis using Long Short Term Memory (LSTM) deep
learning network and Bag of Words. Experimental results of the proof-of-concept have been
compared with the state-of-the-art techniques to demonstrate the performance improvement
of the multi-modal system.

Keywords Multi-modal system · Medical image classification · Sentiment analysis ·

Collateral text · Machine learning

1 Introduction
The amount of data available for automatic pattern recognition has increased with social
media and the Internet in the form of multiple modalities such as image, text, audio, video,

B Parminder Kaur
[email protected]
Avleen Kaur Malhi
[email protected]
Husanbir Singh Pannu
[email protected]
1 Computer Science and Engineering Department, Thapar Institute of Engineering and Technology,
Patiala, India
2 Department of Computing and Informatics, Bournemouth University, Bournemouth, UK

123
30848 Multimedia Tools and Applications (2024) 83:30847–30866

and meta-data, however, learning systems involving these modalities and data fusion are
still at an infant stage. For instance, text-based search through Google, Yahoo, and Bing is
convenient and fast but still involves mismatches, low relevance, and duplicate results.

1.1 Motivation

Visual content analysis is also under-constrained with the available visual features such as
color, texture, and boundaries. A survey has studied that one modality can complement the
deficit resulting due to the other modality while coupling with a positive impact on the
integrated learning system. Content-based textual analysis can assist sentiment classifica-
tion, interpretation, and selection with the possibility to enhance the performance of image
classification or retrieval systems. Images often carry contextual information incorporating
meta-data, tags, keywords, and captions, which are highly valuable and collaborate with
image features to explain the missing information and expert opinions based upon subtle
visual features or historical patterns of image objects.
Content-based image retrieval often suffers from visual feature extraction, and thus an
additional mode of information, such as collateral text is required to classify images. Some-
times the information required is just not present in the image, so linguistic cues (from
experts) are useful. Biological images often suffer from various challenges such as moving
objects/cameras and changes in shape such as microscopic or capsule endoscopic images;
optimal intensity and contrast balance trade-off; casting light might change the intensity and
contrasts; light sources and cameras are not standard; colors can be ignored in microscopic
images but not in macroscopic ones; colors should be separated from their intensities (RGB to
LUV); segmentation algorithms come with their error bundles. Image processing by employ-
ing only image features is called uni-modal learning, however, it is called bi-modal learning
when an association of image and text features is utilized. Bi-modal learning is claimed to
be more efficient than uni-modal learning [2].
Collateral text can be classified to assist the image classification in discerning confounding
images and reduce false positives as shown in Fig. 1. It shows that images may not be separated
correctly into healthy and sick classes merely using visual features. So, textual information
is required for efficient classification.
An instance of an endoscopic image along with the collateral text is depicted in Fig. 2.
Figure 3 describes the text analysis to classify the leftover unclear confounding images
which went unclassified due to the semantic gap between the visual features and the expert
knowledge of the field. Thus related keywords of the images help further separate the images’
sick and bleeding categories.

1.2 Contributions

Most of the traditional visual and text learning methods have followed unimodal phenomenon
in which text features guide the only text classification or image features guide only the
image processing. However medical image features are complex and cannot be automatically
extracted by the algorithm. They often need an expert to identify the symptoms and therefore
the descriptive linguistic cues play the role of an expert description. This paper has proposed
a multimodal learning technique in which a freely available image caption is used to guide
the medical image classification. The major contributions of the article are summarized as
follows:

123
Multimedia Tools and Applications (2024) 83:30847–30866 30849

Fig. 1 Image features cannot classify images into sick and healthy classes: (a) Active bleeding in small bowel
(b) False positive (air bubble) [1]

Fig. 2 Collateral text for the given image

Fig. 3 Image analysis followed by text analysis to assist the overall classification for confounding cases

123
30850 Multimedia Tools and Applications (2024) 83:30847–30866

1. Sentiment analysis for the captions of images has been performed to assist image classi-
fication.
2. Gastral images and related captions have been obtained from the known gastroenterolo-
gist for the case study.
3. Use of machine learning techniques such as Bag of Words (BoW) and Long Short Term
Memory (LSTM) have been explored to analyze the text instances.
4. Accuracy comparison of image classification alone and image + captions classification
has been performed to support the claim that collateral text helps to aid information to
the visual features.
The rest of the article has been organized as follows: Section 2 is about the literature review
for related works as motivation, Section 3 describes the background, Section 4 discusses the
proposed architecture, Section 5 presents the experimental case study, and Section 6 concludes
the paper.

2 Related work

This section briefly reviews the literature related to multi-modal techniques, followed by the
works related to LSTM methods.

2.1 Works related to multi-modal systems

In [2], image captions, titles, collateral text, and references of the underlying image pro-
vided in the article have been considered, along with image features for image annotation
and classification. In [3] multimedia social content has been compiled to form an image-
text database called TumEmo and emotion analysis has been performed with Multi-view
Attentional Network (MVAN).
Fusing visual features and textual aspects has been a point of attention in various applica-
tions to leverage performance, such as a content-aware ranking model for sports data studied
in [4]. With an increase in the amount of multi-modal data, preserving data privacy becomes
challenging. To overcome this challenge, a privacy protection technique dubbed deep adver-
sarial privacy-preserving cross-modal hashing has been introduced in [5]. It comprises a
deep cross-modal hashing model and a secure index structure. A universal weighting metric
learning framework has been proposed in [6] for an effective cross-modal retrieval process.
It can sample the informative modality pairs, and weight values are assigned to them as per
the similarity scores such that diverse pairs favor different penalty strengths. Two polynomial
losses are also introduced in this framework, such as self-similarity and relative-similarity.
The self-similarity polynomial loss provides a polynomial function that links the weight val-
ues with the self-similarity scores, however, weight values are linked with relative-similarity
scores in the relative-similarity polynomial loss.
Web video applications using contextual information to eliminate near duplicate real-time
video using thumbnails, view count, and video duration have been studied in [7]. Local points
and color information for content analysis help in the verification of the reported duplication.
It also helps to improve the tagging ability of the related videos, images, and textual content.
In [8], Recipe1M+ food and the cooking database has been utilized, involving 13 million
food images and over 1 million cooking recipes. A Multi-modal system has been developed
for efficient image-recipe retrieval tasks. The proposed model, data, and source code are
publicly available. Heterogeneous graph embeddings are proposed in [9] to preserve the

123
Multimedia Tools and Applications (2024) 83:30847–30866 30851

modality-specific information and increase the cross-modal retrieval accuracy. One modality
embedding is compensated by the other modality’s aggregated embedding. The label noise
issue is reduced by constructing a self-denoising tree search that makes the heterogeneous
neighborhood more semantically relevant. A summary of recent works has been discussed
in Table 1 according to various application areas.
In [15] a survey of deep learning techniques for medical imaging and NLP (natural lan-
guage processing) has been studied. In [16], a self-supervised technique involving language
tasks, vision and NSP has been proposed to study the rich medical images. Masked vision
language modelling (MVLV) is used to extract text semantics for medical images with an
associated caption dataset. Two datasets VQA-RAD and VQA-Med 2019 has been used for
demonstation of the case study, which involve radiology images and visual question answers
(VQA) to provide machine learnign solutions which is interpretable through attention maps.
In [17], BERT (Bidirectional Encoder Representations from Transformers) model been used
which involves multiple modalities for image captioning and VQA. Unstructured reports have
been used along with radiology images and a medical vision language learner (MedViLL)
has been developed. In [18] an implicit NLP has been studied using EWECT and SMP2019
datasets with text to picture (TTP) technique. In [19] deconfounded visio-linguistic bert
(DeVLBert) framework has been proposed to increase the generalizability of intervention
based learning using causal intervention.

2.2 Works related to LSTM

Since its inception in 1995, several variants of Long Short Term Memory (LSTM) of Recur-
rent Neural Networks (RNN) has been proposed [20]. In LSTM networks, the adaptive forget
gate was introduced in [21] to reset itself to release the resources and reset at appropriate
instances with superior performance to Recurrent Neural Network (RNN) algorithms. LSTM
has been used for disambiguation in the Punjabi language to identify the accurate contextual
meaning in [22]. Word vectors for 66 ambiguous nouns have been considered for deep learn-
ing systems using unigram and bigram feature sets and Punjabi language corpus. In [23],
an analysis of 8 variants of LSTM for polyphonic music modeling, handwritten text, and

Table 1 Summary of the related state-of-art for multi-modal systems

Sr. Application Modalities Goal Year Reference

1 Biology Audio, Video Goat identification 2017 [10]

2 Disaster mngt. Text, Image Flood image recognition 2016 [2]
3 Forensics NLP, Video Crime investigation 2018 [11]
4 General Image, Text, Video image-text, video-text retrieval 2021 [6]
5 Food recipes Image, Text image-recipe retrieval 2019 [8]
6 Human behavior Audio, Video HR job interview 2018 [12]
7 General Image, Text Person re-identification, 2022 [9]
image-text retrieval
8 Medical EEG, fMRI Brain image fusion 2017 [13]
9 Music Audio, Video Video categorization 2020 [14]
10 Sports Video, Text Content based video analysis 2017 [4]
11 General Image, Text image-text retrieval 2022 [5]

123
30852 Multimedia Tools and Applications (2024) 83:30847–30866

speech recognition has been discussed. ANOVA test has been performed to see the effect
of hyper-parameter tuning of all variants, which are separately optimized with the help of
random search. Empirical analysis has demonstrated the importance of output activation
and the forget gate as critical components. LSTM has been used for short-term traffic fore-
cast in [24]. It used a two-dimensional network with multiple memory units to incorporate a
temporal-spatial correlation traffic system. LSTM has also been studied in the power systems
for volatile load forecasting in [25] for short-term load prediction to assist customers and
grid operations and future planning. CNN-LSTM model has been utilized in [26] for accurate
prediction of gas field production based on a gas field in southwest China. It is a vital task
for reservoir engineers, and its prediction is difficult because of multiple unknown reser-
voir parameters. CNN is utilized for feature extraction and LSTM for learning the sequence
dependence.
For the real time data collection, a three-layered health care system has been proposed
[27] including preprocessing as well as transmission. It utilizes the Internet of Things, cloud
computing and fog computing for the end users for future and current ongoing applications.
In [27], three cloud-oriented AI-driven models have been studied for COVID-19 detection
and prevention (named D-espy): Stacked LSTM, Vanilla LSTM and ARIMA using JHU
dataset.
For text classification, a deep pyramid CNN has been studied in [28], CNN for sentence
classification has been studied in [29], recurrent CNN for text classification in [30] and a
transformer model has been proposed in [31].

2.3 Criticism of unimodal learning

Deep learning has gained wide popularity in medical imaging diagnostics in recent years,
but visual features underconstrain information retrieval. This is due to the requirement of
large input labeled dataset, variability in contrast or resolution, colors, noise, artifacts, blur
in case of in-vivo endoscopy images [32] and lack of expert knowledge in the machine to
infer the medical images. Moreover, the data often come from different machines, models,
and formats, which causes a shift in the data distribution. Thus, it lacks the generality for the
deep learning models and results in poor performance if only rely on just one modality i.e.
image. Machine learning models heavily depend on the statistical patterns of data distribution
and model parameters, which gets disturbed by varying the image acquisition instances and
protocols [32]. Therefore an additional feature such as a supporting legend is often useful
with a given image. A text has steady features, structured grammar and more details compared
to an image, which only yields a conclusive view and even conveys different meanings to
different people according to their background knowledge. Thus text provides the language
to the vision and enhances the algorithmic learning [17].

3 Background

This section discusses image/text classification and the LSTM model used for sentiment
analysis in the proposed technique. The goal is to collect the data collectively in the form of
images and related referral collateral text. The snapshot of such an example has been shown
in the Fig. 2. Images and text are classified separately, and an example of collateral text
getting ready for the sentiment analysis has been depicted in the Table 2. Prominent metrics

123
Multimedia Tools and Applications (2024) 83:30847–30866 30853

Table 2 Collateral text associated with the normal and sick images
Sr. Type Collateral Text

1 Small bowel PillCam SB3 Small bowel normal mucosa

2 Normal lesion Duodenal polyp
NORMAL
3 Floating view lower to upper side of gastric wall stomach is full of water
air or mixture a Floating view
4 Jejunum with villous atrophy and microerosions

5 Upper gastrointestinal Bleeding anal position esophagogastric mucosal

junction
6 Upper gastrointestinal Bleeding duodenal SDA
SICK
7 Endoscopic abnormal lesions taken by MiroCam Gastric erosion with
adherent blood clot
8 Lesions obscure gastrointestinal bleed Adenocarcinoma

used for text analysis, such as binary or frequency features, weirdness coefficient, TF-IDF
are summarized in the Table 3.

3.1 Text vectorization

Text features are in the form of strings, so first, we need to convert these string features into
numerical features for which the following methods can be utilized:

• Bag of Words (BoW)

• TFIDF
• Word2Vec

3.1.1 Bag of Words (BoW)

Bag of Words is a basic model used in natural language processing. The order of the words
in the document is discarded, and it just tells about the presence of a word in the document.
For example, consider the following sentences:

Table 3 Formulae for text analysis metrics

Sr. Metric Formula Significance

1 Binary features Binary string of 0 or 1 Feature is 0 or 1 for its absence or

presence in the text information [33]
2 Frequency features String of natural numbers Count of occurrences in the given
text against corpus [34]
3 Weirdness coefficient (f/N)/(fc+1)/Nc How unique a keyword is with
respect to corpus? [34]
4 TF-IDF Tf(t)* log(N/nt) Overall importance of the word in
and across the documents [35]

123
30854 Multimedia Tools and Applications (2024) 83:30847–30866

1. There used to be Stone Age

2. There used to be Bronze Age
3. There used to be Iron Age
4. There was Age of Revolution
5. Now it is Digital Age
Here each sentence is a separate document if we make a list with the union of all words as
follows: {“There” , “was”, “to”, “we” , “used” , “Stone” , “Bronze” , “Iron” , “Revolution” ,
“Digital” , “Age” , “of” , “Now” , “it” , “is”}. Then count the occurrence of each word in the
document w.r.t list. For instance, vector conversion of the sentence “There used to be Stone
Age” can be represented as: “There” = 1, “was” = 0, “to” = 1, “be” = 1, “used” = 1, “Stone”=
1,“Bronze” = 0, “Iron”= 0, “Revolution”= 0, “Digital”= 0, “Age”= 1, “of”= 0, “Now”= 0,
“it”= 0, “is”= 0 So, these sentences get converted into bit-string vectors as follows:
• “There used to be Bronze Age” = [1,0,1,1,1,0,1,0,0,0,1,0,0,0,0]
• “There used to be Iron Age” = [1,0,1,1,1,0,0,1,0,0,1,0,0,0,0]
• “There was Age of Revolution” = [1,1,0,0,0,0,0,0,1,0,1,1,0,0,0]
• “Now it is Digital Age” = [0,0,0,0,0,0,0,0,0,1,1,0,1,1,1]
This approach is called unigram (considering one word only). Similarly, we have bigram,
trigrams, or n-gram binning (where multiple words are considered). For example, bigram
with words: “There used", “used to". This process is called text vectorization. But this yields
a sparse matrix as shown in Fig. 4.

3.1.2 TF-IDF

TF-IDF stands for Term Frequency-Inverse Document Frequency which explains the impor-
tance of the word in the given corpus or data. It incorporates two concepts: Term Frequency
(TF) and Inverse Document Frequency (IDF). Term Frequency is defined as how frequently
a word appears in the document or corpus. Let f t,d represent the frequency of the term t in
the document d. The term frequency t f (t, d) means the term count in the given document
and is defined as [35]:
f t,d
t f (t, d) = (1)
max{ f s,d : s ∈ d}
To avoid bias due to a larger document, it is scaled by dividing with the most popular
term of the document. Let N be the total documents in the corpus such that N = |D|. So
{d ∈ D : t ∈ d} means the documents in the corpus in which term t appears. Then id f (t, D)
is defined as:
N
id f (t, D) = log (2)
|{d ∈ D : t ∈ d}|

Fig. 4 Example of text matrix generated using five sentences and Bag of Words

123
Multimedia Tools and Applications (2024) 83:30847–30866 30855

Higher TF-IDF means that the word or term is more important in the given document.
t f id f (t, d, D) = t f (t, d) × id f (t, D) (3)
So term frequency is the count of a word in the document, and document frequency (DF)
means how many documents have that word. IDF will be low for stop words that are most
occurring such as “the” , “is”,“a”, etc.

3.1.3 Word2Vec

Word2Vec is used to derive word embeddings using the family of models, such as neural
networks with only two layers to design linguistic contexts of word semantics. It yields a
vector space after processing the large text corpus, where each word gets a distinct vector
associated with it in the span of vector space. The idea is to place similar words closer in
the space using the cosine similarity metric as an example. Details about Word2vec can be
found in [36].

3.2 Long Short Term Memory (LSTM) network

A long short-term memory network is a modified version of the Recurrent Neural Network
(RNN) that eliminates its long-term dependency through a memory gate. It enables long-
term temporal dynamics and complex long sequences. Memory cells (ci ) make the repeated
module structure of LTSM, which resembles a chain and eliminates the exploding situations
by working on the gradient. Memory cells are linearly dependent in LTSM, i.e., ci with ci+1
where i is the state of the cell. Cell states are the functions of current information which
can be controlled through inclusion-exclusion operations. Figure 5 demonstrates the gate
structure of LTSM where different elements are described below:
• The sigmoid layer is the forget gate f (t) to decide about removing the past information
from the state of a cell through computation of the current state of the memory cell. The
range of the Sigmoid function defines retaining or discarding the information from the

Fig. 5 Basic LSTM structure and involved function [37]. The activation function tanh has been used for φ,
and σ is the Sigmoid function

123
30856 Multimedia Tools and Applications (2024) 83:30847–30866

binary output 1 and 0, respectively. Let W ∈ R h×d , w ∈ R h×h are weight matrices,
b ∈ R h is a bias which is calculated at the time of training, where d and h depicts the
cardinality of the feature set and the hidden units. f ∈ R h is the forget gate, I ∈ R h is
the input gate, c, c1 ∈ R h are cell input activation and cell state vector.

f t = σ b f + W .[wt , h t−1 ] (4)
• Input word wt is processed through the Sigmoid function using input gates, input bias bi ,
and previous hidden state h t−1 to decide about the preservation of the information or not.
Equations (5), (6), (7) describe the input generated through the input gate; tanh function
applied for the next cell memory state; new memory state derivation using scaling of
current information i t and current cell state c1 along with incorporating the forget gate
f t . Finally, the summation of the input gate and the forget gate (7) generates the final
state of the memory.
i t = σ (bi + Wi .[wt , h t−1 ]) (5)
c1 = tanh(bc + Wc .[wt , h t−1 ]) (6)
ct = c1 × i t + ct−1 × f t (7)
where “×" is the point-wise multiplication operator.
• Finally, the exposure or output gate determines what cell state information to act as the
final output by separating the final memory out of the hidden state. Let output ot be the
final gate information, the last hidden state be h t−1 of the input word wt then (8) and (9)
define the output gate information and the new hidden state:
ot = σ (bo + Wo .[wt , h t−1 ]) (8)
h t = tanh(ct ) × ot (9)
Thus based on the context dependence of the neighboring words, LSTM captures the
sequence of the words.

4 Proposed technique

The proposed model incorporates text features into the image classification task, which
suggests implicit features of the image. Usually, image classification is performed by image
features and text classification is learned by text features, but this article highlights the

Fig. 6 Work flow diagram to illustrate the linguistic cues assisting image classification

123
Multimedia Tools and Applications (2024) 83:30847–30866 30857

importance of cross-relevance in the deployment of machine learning models for intensive

areas of science such as medicine. Medical images are not similar to general object images
and often hide information from a layman’s vision. Thus a secondary feature is required
to complement the missing information. Figure 6 shows the process of assisting the image
classification through the sentiment analysis of the related captions of the image. The case
study of endoscopy data has been considered for experimentation, where the classification
of in-vivo images is difficult. So, the sentiment analysis has been performed on the available
linguistic cues to classify the images of confounding cases.

4.1 Architecture

The proposed architecture has been demonstrated in Fig. 7. Images and captions are input
to the system. Each caption is converted into useful keywords by removing punctuations,

Fig. 7 Architecture of the proposed bi-modal system for image and text classification

123
30858 Multimedia Tools and Applications (2024) 83:30847–30866

stop/end words, removal of too common or too rare words in regard to a lexicon of medical
vocabulary, and then vectorization using bag of words to input into LSTM model for sentiment
analysis. On the parallel track, image vectorization is performed using wavelet transforms,
Zernike moments and then input vectors is used by ANN for classification. If the label is
neutral then label of the underlying caption is found by asking LSTM model. Thus final
accuracy is calculated by taking the union of decisions made by image and text classifier.
For LSTM (text classifier), word embedding dimensions = 100, epochs = 50, fully connected
layers = 2, hidden units = 180, learning rate = 0.05. For ANN (image classifier) Levenberg-
Marquardt trining algorithm, mean squared error, and 10 hidden layers were used which are
the standard settings of MATLAB’s curve fitting app for neural networks. Sentiment analysis
of the linguistic cues provided by collateral text along with the medical images is fused
to the image segmentation to improve the classification accuracy. The case study includes
300 images and captions which are obtained from a known gastroenterologist. These 300
instances are processed for image classification and text classification separately. For image
segmentation, we used Wavelet Transforms for denoising, Zernike moments for vectorization,
and a neural network for classification. For text analysis, Bag of Words on preprocessed data
followed by machine learning techniques (including LSTM) has been applied. Any other set
of algorithms can also be adopted for feature extraction and classification system.
After the decision of image segmentation has been made to predict the bleeding cases
(positives), the sentiment analysis is tested for the detected negative cases to detect new
positive sick cases. Figure 8 illustrates the steps followed in sentiment analysis of the collateral
text (available with the images) and inspired by [38]. The idea is to catch as many positive
(sick) cases as possible through image and text classification. This is because sensitivity
has been prioritized over specificity for delicate medical situations. Algorithm 1 shows the
step-by-step procedure to implement the proposed approach.

5 Experiments
This section discusses about the system description, dataset and empirical analysis.

5.1 System configuration

Desktop System is Dell Inc. with Model XPS 8930 with Windows 10 ProVersion 10.0.17763,
Intel(R) Core (TM) i7-8700 CPU @ 3.20GHz, 3192 Mhz, 6 Core(s), 12 logical processor(s),
with 16BM RAM. MATLAB 2020a software has been used for the implementation.

Fig. 8 Process flow of sentiment analysis of the captions using MATLAB

123
Multimedia Tools and Applications (2024) 83:30847–30866 30859

Fig. 9 Raw data versus cleaned data using pre-processing (reduction = 68.9%)

Fig. 10 Word clouds of normal and sick categories

123
30860 Multimedia Tools and Applications (2024) 83:30847–30866

Algorithm 1 Algorithm for proposed bi-modal system.

Function Accuracy = SentimentAnalysis()
Input : Images with labels, captions and lexicon

Output: S = {Classification accuracy of images+captions}

Initialization - load image and captions

Labels = ImageClassification()
if Label = Neutral then
Result ← Image classification
Calculate Accuracyimage

else
foreach caption k of the captions do
Tokenize(k)
Erase punctuations, remove stop words, empty words
foreach caption do
Remove too short/long/infrequent words
using lexicon
end
Text Vectorization using Bag of Words
end
for numeric data n do
LSTM_Train(n)
foreach Params p ∈ P, set do
word embedding dimensions = 100
epochs = 50
fully connected layers =2
hidden units = 180
learning rate = 0.05
end
LSTM_Testing
Calculate Confusion Matrix
Calculate Accuracytext
Accuracy ← Accuracyimage ∪ Accuracytext
end

end

Fig. 11 Data distribution

123
Multimedia Tools and Applications (2024) 83:30847–30866 30861

Fig. 12 Caption length frequency histogram for 300 image captions

5.2 Data preparation

Pre-processing of the raw captions associated with images includes tokenization, lemma-
tization, creation of the BoW model, removal of infrequent words, too short/long words,
punctuations, and finally, calculation of the reduction ratio. Too many closed-class words
with no information (the, a, is, etc.) are removed to reduce the sparsity in text vectors. There
is a need to extract the crucial information from the text while considering the rarity of the

Fig. 13 Training accuracy with blue and dotted lines for each iteration and the average, respectively. Average
validation accuracy for 5 trials is 96.67%

123
30862 Multimedia Tools and Applications (2024) 83:30847–30866

Fig. 14 Loss curve with dotted average and the red solid line for the iterations. Loss almost becomes 0 after
15 iterations

word in the given text relative to all texts. The frequency of the words must be analyzed to
avoid acronyms and spelling mistakes.

5.3 Performance analysis

The architecture of the proposed method has been shown in Fig. 7. Algorithm 1 has been
applied to the dataset of 300 images and captions (examples shown in Fig. 2 and Table 2). A
summary of the process flow of sentiment analysis has been depicted in Fig. 8, starting from
the raw text captions, pre-processing, model training, category prediction for the captions,
and final result during deployment. Pre-processing of the text has been done to remove close
class words, punctuations, lemmatization, rare words removal, vectorization using Bag of
Words, and then feeding the vectors into the LSTM machine learning model. Snapshot of the
pre-processing word cloud and the binary categories of the cleaned data has been summarized
in Figs. 9 and 10. The data distribution of both normal and sick classes have been shown
in Fig. 11. A histogram to analyze the length of individual raw captions has been shown in
Fig. 12. All these figures and data analysis has been performed using MATLAB (2020)
software.
LSTM model has been trained, and the accuracy and loss curves are depicted in Figs. 13
and 14, respectively. It can be observed that the training accuracy increases and becomes
stable after 10 iterations. Similarly, the loss curve becomes stable after 10 iterations. Finally,
the results of sentiment analysis have been summarized in Table 4. LSTM is the winner

Table 4 Performance comparison of sentiment analysis of linguistic cues on the test data (train:test = 270:30)
Sr. Method Test Accuracy (%)

1 Support vector machine [39] 74.31

2 K-nearest neighbors algorithm [40] 76.95
3 Artificial neural networks [41] 81.22
4 Random forest [42] 82.59
5 Long Short Term Memory (LSTM) [20] 86.67
6 Deep pyramid convolutional neural network [28] 82.31
7 Convolutional neural network for sentence classification [29] 80.46
8 Recurrent convolutional neural networks for text classification [30] 81.28
9 Transformer model [31] 83.02

123
Multimedia Tools and Applications (2024) 83:30847–30866 30863

Fig. 15 Confusion matrix of the test results for LSTM. 30 captions are used for testing and 270 for training
(training:testing ratio = 9 : 1)

having the highest accuracy as compared to other machine learning models, and thus has
been highlighted in the table. The confusion matrix for LSTM results has been shown in
Fig. 15. After choosing LSTM results for the best text classifier for the image captions,
the image segmentation has been performed using Wavelet transforms, Zernike moments,
and neural networks. The accuracy results have been reported in Table 5 for image, text
classification, and image + text classification. Image + text has the highest accuracy, which
has been represented in bold. Figure 16 shows the comparison of the proposed sentiment
analysis approach with other recent techniques.
For the images which were detected as negative (healthy), their respective captions were
tested through the proposed sentiment analysis model to update the accuracy measure of the
image segmentation results. This resulted in some accuracy boost, as shown in Table 5. The
labels for sick and normal being 1 and 0 (for true positive and true negative), so the following
equation can explain the updated accuracy:

T Pbimodal = T PI mage ∪ T PT ext (10)

It means that if the bleeding instance went undetected through the image model, it should
be detected by the text analysis model. Thus both modalities help to detect the true positive
to confirm the detection of sick cases. Thus incorporating more modalities besides image

Table 5 Comparison of accuracy Sr. Modality Test accuracy (%)

using uni-modal (image only)
and proposed bi-modal 1 Image 88.30
(image+text) learning system
2 Text 88.67
3 Image + Text 92.74

123
30864 Multimedia Tools and Applications (2024) 83:30847–30866

Fig. 16 Performance comparison of various text classification techniques of Table 4

and text could contribute to the learning accuracy if orchestrated properly using data fusion
methods, appropriate feature selection, and learning models.

6 Conclusion

A bi-modal framework for text sentiment analysis to assist image classification has been
proposed. The claim is that image classification alone can be under constraint if relies only
on visual features. Evidence provided by linguistic cues has been exploited using the text
features, BoW and LSTM to learn the health status provided by the collateral text to assist
the image classification task. Candidate terms were selected while studying the linguistic
evidence from the real data provided by a known gastroenterologist. The obtained results
demonstrate the improved performance of utilizing the textual features for image analysis
and classification in complex situations where visual features are insufficient.
The assumption on the underlying technique is the availability of medical images and
expert descriptions in the form of captions. The annotation of medical images is time con-
suming and expensive in terms of the manual hours of labor for technical experts since
medical domain is entirely different than common object images. LSTM is efficient in learn-
ing the complex relationships within the data due to its long-term information memory; its
activation functions are robust and do not suffer if the gradient vanishes, but it also has some
limitations. They are complex, not quite suitable for non-sequential online input data, need
large training data, are slow to train, and are not efficient if data has a lot of noise [43]. The
classification accuracy is still not 100% which means there is still a room for improvement
with better feature extractions and classification algorithms.
This research is useful for the medical practitioners for image classification assistance;
patients to do the diagnosis through an automatic analysis of an app through the lab reports
of images and description; medical interns to relate the image features with the technical
terminology to learn; AI experts to extend this model by incorporating other modalities
besides image and text to evolve a robot system to assist humanity. Future plan is to improve
the proposed algorithm by utilizing better feature extraction and classification technique and
exploring novel multimodal association methods.
Acknowledgements Authors are thankful to (a) Dr. Sunil Arya, Gastroenterologist at Leela Bhawan Patiala,
and Dr. G.S. Sidhu at Max Hospital Mohali, India, for the dataset and technical feedback; and (b) Professor
Khurshid Ahmad, Trinity College Dublin Ireland, for research direction.

123
Multimedia Tools and Applications (2024) 83:30847–30866 30865

Data Availability The dataset analyzed during the current study is not publicly available due to medical data
privacy.

Declarations

Competing interests The authors declare that they have no known competing interests.

References
1. Boal Carvalho P, Magalhães J, Dias de Castro F, Monteiro S, Rosa B, Moreira MJ, et al (2017) Suspected
blood indicator in capsule endoscopy: a valuable tool for gastrointestinal bleeding diagnosis. Arquivos
de gastroenterologia 54(1):16–20
2. Jing M, Scotney BW, Coleman SA, McGinnity MT, Zhang X, Kelly S et al (2016) Integration of text
and image analysis for ood event image recognition. In: 2016 27th Irish Signals and Systems Conference
(ISSC). IEEE; pp 1–6
3. Yang X, Feng S, Wang D, Zhang Y (2020) Image-text multimodal emotion classification via multi-view
attentional network. IEEE Trans Multimedia 23:4014–4026
4. Shih HC (2017) A survey of content-aware video analysis for sports. IEEE Trans Circuits Syst Video
Technol 28(5):1212–1231
5. Zhu L, Song J, Yang Z, Huang W, Zhang C, Yu W (2022) DAP2CMH: Deep Adversarial Privacy-
Preserving Cross-Modal Hashing. Neural Processing Letters 54(4):2549–2569
6. Wei J, Yang Y, Xu X, Zhu X, Shen HT (2021) Universal weighting metric learning for cross-modal
retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence
7. Shen L, Hong R, Hao Y (2020) Advance on large scale near-duplicate video retrieval. Frontiers of
Computer Science 14(5):145702
8. Marin J, Biswas A, Ofli F, Hynes N, Salvador A, Aytar Y et al (2019) Recipe1m+: A dataset for learning
cross-modal embeddings for cooking recipes and food images. IEEE transactions on pattern analysis and
machine intelligence
9. Chen D, Wang M, Chen H, Wu L, Qin J, Peng W (2022) Cross-Modal Retrieval with Heterogeneous Graph
Embedding. In: Proceedings of the 30th ACM International Conference on Multimedia 3291–3300
10. Pitcher BJ, Briefer EF, Baciadonna L, McElligott AG (2017) Cross-modal recognition of familiar con-
specifics in goats. Royal Society open science 4(2):160346
11. Frermann L, Cohen SB, Lapata M (2018) Whodunnit? crime drama as a case for natural language under-
standing. Trans Assoc Comput Linguist 6:1–15
12. Tripathi P, Watwani PP, Thakur S, Shaw A, Sengupta S (2018) Discover Cross-Modal Human Behav-
ior Analysis. In: 2018 Second International Conference on Electronics, Communication and Aerospace
Technology (ICECA). IEEE 1818–1824
13. Calhoun VD, Sui J (2016) Multimodal fusion of brain imaging data: a key to finding the missing link (s) in
complex mental illness. Biological psychiatry: cognitive neuroscience and neuroimaging 1(3):230–244
14. Goyal P, Sahu S, Ghosh S, Lee C (2020) Cross-modal Learning for Multi-modal Video Categorization.
arXiv:2003.03501
15. Pandey B, Pandey DK, Mishra BP, Rhmann W (2022) A comprehensive survey of deep learning in the
field of medical imaging and medical natural language processing: Challenges and research directions. J
King Saud Univ Comput Inf 34(8):5083–5099
16. Khare Y, Bagal V, Mathew M, Devi A, Priyakumar UD, Jawahar C (2021) Mmbert: Multimodal bert
pretraining for improved medical vqa. In: 2021 IEEE 18th International Symposium on Biomedical
Imaging (ISBI). IEEE 1033–1036
17. Moon JH, Lee H, Shin W, Kim YH, Choi E (2022) Multi-modal understanding and generation for medical
images and text via vision-language pre-training. IEEE J Biomed Health Inform 26(12):6070–6080
18. Chen M, Ubul K, Xu X, Aysa A, Muhammat M (2022) Connecting text classification with image
classification: a new preprocessing method for implicit sentiment text classification. Sensors 22(5):1899
19. Zhang S, Jiang T, Wang T, Kuang K, Zhao Z, Zhu J et al (2020) Devlbert:Learning deconfounded visio-
linguistic representations. In: Proceedings of the 28th ACM International Conference on Multimedia
4373–4382
20. Kaliyar RK, Goswami A, Narang P (2021) FakeBERT: Fake news detection in social media with a
BERT-based deep learning approach. Multimedia tools and applications 80(8):11765–11788

123
30866 Multimedia Tools and Applications (2024) 83:30847–30866

21. Staudemeyer RC, Morris ER (2019) Understanding LSTM-a tutorial into long short-term memory recur-
rent neural networks. arXiv:1909.09586
22. pal Singh V, Kumar P (2019) Word sense disambiguation for Punjabi language using deep learning
techniques. Neural Computing and Applications 1–11
23. Greff K, Srivastava RK, Koutník J, Steunebrink BR, Schmidhuber J (2016) LSTM: A search space odyssey.
IEEE Trans Neural Netw Learn Syst 28(10):2222–2232
24. Zhao Z, Chen W, Wu X, Chen PC, Liu J (2017) LSTM network: a deep learning approach for short-term
traffic forecast. IET Intell Transp Syst 11(2):68–75
25. Kong W, Dong ZY, Jia Y, Hill DJ, Xu Y, Zhang Y (2017) Short-term residential load forecasting based
on LSTM recurrent neural network. IEEE Transactions on Smart Grid 10(1):841–851
26. Zha W, Liu Y, Wan Y, Luo R, Li D, Yang S et al (2022) Forecasting monthly gas field production based
on the CNN-LSTM model. Energy 2022–124889
27. Kumari A, Tanwar S, Tyagi S, Kumar N (2018) Fog computing for Healthcare 4.0 environment: Oppor-
tunities and challenges. Computers & Electrical Engineering 72:1–13
28. Johnson R, Zhang T (2017) Deep pyramid convolutional neural networks for text categorization. In:
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1:
Long Papers) 562–570
29. Chen Y (2015) Convolutional neural network for sentence classification. University of Waterloo
30. Lai S, Xu L, Liu K, Zhao J (2015) Recurrent convolutional neural networks for text classification. In:
Proceedings of the AAAI conference on artificial intelligence 29
31. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN et al (2017) Attention is all you need.
Advances in neural information processing systems 30
32. Perone CS, Cohen-Adad J (2019) Promises and limitations of deep learning for medical image segmen-
tation. J Med Artif Intell 2(1):1–2
33. Manaka T, van Zyl T, Kar D (2022) Improving Cause-of-Death Classification from Verbal Autopsy
Reports. In: Artificial Intelligence Research: Third Southern African Conference, SACAIR 2022, Stel-
lenbosch, South Africa, December 5–9, 2022, Proceedings. Springer 46–59
34. Ölçer D, Taşkaya Temizel T (2022) Quality assessment of web-based information on type 2 diabetes.
Online Information Review 46(4):715–732
35. Devi MD, Saharia N (2023) Unsupervised tweets categorization using semantic and statistical features.
Multimedia Tools and Applications 82(6):9047–9064
36. Chen Q, Sokolova M (2021) Specialists, scientists, and sentiments: Word2Vec and Doc2Vec in analysis
of scientific and medical texts. SN Computer Science 2:1–11
37. Guo L, Li N, Jia F, Lei Y, Lin J (2017) A recurrent neural network based health indicator for remaining
useful life prediction of bearings. Neurocomputing 240:98–109
38. Gorr H (2020) Classify Sentiment of Tweets Using Deep Learning. MathWorks. May
21,2020;online https://www.mathworks.com/matlabcentral/fileexchange/68264-classify-sentiment-of-
tweets-using-deep-learning, MATLAB Central File Exchange
39. Patel R, Passi K (2020) Sentiment analysis on twitter data of world cup soccer tournament using machine
learning. IoT 1(2):14
40. Bilal M, Israr H, Shahid M, Khan A (2016) Sentiment classification of Roman- Urdu opinions using Naïve
Bayesian, Decision Tree and KNN classification techniques. J King Saud Univ Comput Inf 28(3):330–344
41. Jain PK, Pamula R, Srivastava G (2021) A systematic literature review on machine learning applications
for consumer sentiment analysis using online reviews. Computer science review 41:100413
42. Neogi AS, Garg KA, Mishra RK, Dwivedi YK (2021) Sentiment analysis and classification of Indian
farmers’ protest using twitter data. Int J Inf Manag Data Insights 1(2)
43. Manaswi NK, Manaswi NK. Rnn and lstm (2018) Deep Learning with Applications Using Python:
Chatbots and Face, Object, and Speech Recognition With TensorFlow and Keras 115–126

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under
a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted
manuscript version of this article is solely governed by the terms of such publishing agreement and applicable
law.

123

Session 15-1 Multimodal
No ratings yet
Session 15-1 Multimodal
82 pages
Multimodal Biomedical Image Classification and Retrieval With Multi Response Linear Regression (MLR) - Based Meta Learning
No ratings yet
Multimodal Biomedical Image Classification and Retrieval With Multi Response Linear Regression (MLR) - Based Meta Learning
4 pages
A Multimodal Transfer Learning Approach Using PubMedCLIP For Medical Image Classification
No ratings yet
A Multimodal Transfer Learning Approach Using PubMedCLIP For Medical Image Classification
12 pages
Interactive Cross and Multimodal Biomedical Image Retrieval
No ratings yet
Interactive Cross and Multimodal Biomedical Image Retrieval
16 pages
Contrastive Learning of Medical Visual Representations From Paired Images and Text
No ratings yet
Contrastive Learning of Medical Visual Representations From Paired Images and Text
15 pages
Enhancing Natural Language Processing (NLP) Models With Multimodal Learning Enhanced
No ratings yet
Enhancing Natural Language Processing (NLP) Models With Multimodal Learning Enhanced
2 pages
Deepsetfusion
No ratings yet
Deepsetfusion
10 pages
Computational Methods For Integrating Vision and Language: Kobus Barnard
No ratings yet
Computational Methods For Integrating Vision and Language: Kobus Barnard
229 pages
Warner 等 - 2024 - Multimodal Machine Learning in Image-Based and Clinical Biomedicine Survey and Prospects
No ratings yet
Warner 等 - 2024 - Multimodal Machine Learning in Image-Based and Clinical Biomedicine Survey and Prospects
17 pages
2022 Naacl-Clmlf
No ratings yet
2022 Naacl-Clmlf
13 pages
MML Language
No ratings yet
MML Language
11 pages
(2023) (IEEE) Multimodal Emotion Classification With Multi-Level Semantic Reasoning Network
No ratings yet
(2023) (IEEE) Multimodal Emotion Classification With Multi-Level Semantic Reasoning Network
13 pages
2023.acl MVCN
No ratings yet
2023.acl MVCN
13 pages
Content-Based Medical Image Retrieval Using Low-Le
No ratings yet
Content-Based Medical Image Retrieval Using Low-Le
9 pages
Med Think
No ratings yet
Med Think
13 pages
Multimodal Generative AI For Interpreting
No ratings yet
Multimodal Generative AI For Interpreting
8 pages
2024 Emnlp D2R
No ratings yet
2024 Emnlp D2R
12 pages
Medical Image Understanding With Pretrained
No ratings yet
Medical Image Understanding With Pretrained
14 pages
Multimodal Machine Learning A Survey and Taxonomy
No ratings yet
Multimodal Machine Learning A Survey and Taxonomy
21 pages
EasyChair Preprint 13501
No ratings yet
EasyChair Preprint 13501
8 pages
Multimodal Deep Learning
No ratings yet
Multimodal Deep Learning
21 pages
MedCLIP - Contrastive Learning From Unpaired Medical Images and Text
No ratings yet
MedCLIP - Contrastive Learning From Unpaired Medical Images and Text
12 pages
A Survey Priya
No ratings yet
A Survey Priya
5 pages
Automatic Creative Selection With Cross-Modal Matching
No ratings yet
Automatic Creative Selection With Cross-Modal Matching
3 pages
Multimodal Machine Learning: A Survey and Taxonomy: Tadas Baltru Saitis, Chaitanya Ahuja, and Louis-Philippe Morency
No ratings yet
Multimodal Machine Learning: A Survey and Taxonomy: Tadas Baltru Saitis, Chaitanya Ahuja, and Louis-Philippe Morency
20 pages
Modeling Text With Graph Convolutional Network For Cross-Modal Information Retrieval
No ratings yet
Modeling Text With Graph Convolutional Network For Cross-Modal Information Retrieval
7 pages
2024 Progressive - Fusion - Network - With - Mixture - of - Experts - For - Multimodal - Sentiment - Analysis
No ratings yet
2024 Progressive - Fusion - Network - With - Mixture - of - Experts - For - Multimodal - Sentiment - Analysis
8 pages
Deep Learning Book PDF
No ratings yet
Deep Learning Book PDF
272 pages
Multimodal Classification For Analysing Social Med
No ratings yet
Multimodal Classification For Analysing Social Med
16 pages
Lecture12 1MultimodalFusion
No ratings yet
Lecture12 1MultimodalFusion
66 pages
Multimodal Sentiment Analysis-6
No ratings yet
Multimodal Sentiment Analysis-6
20 pages
ConFEDE - Contrastive Feature Decomposition For Multimodal Sentiment
No ratings yet
ConFEDE - Contrastive Feature Decomposition For Multimodal Sentiment
14 pages
Multimodal Machine Learning
No ratings yet
Multimodal Machine Learning
27 pages
Hessel and Lee - 2020 - Does My Multimodal Model Learn Cross-Modal Interactions It's Harder To Tell Than You Might Think!
No ratings yet
Hessel and Lee - 2020 - Does My Multimodal Model Learn Cross-Modal Interactions It's Harder To Tell Than You Might Think!
17 pages
Medical Image Analysis With Transformers
No ratings yet
Medical Image Analysis With Transformers
66 pages
Information Sciences: Changqin Huang, Haijiao Xu, Liang Xie, Jia Zhu, Chunyan Xu, Yong Tang
No ratings yet
Information Sciences: Changqin Huang, Haijiao Xu, Liang Xie, Jia Zhu, Chunyan Xu, Yong Tang
18 pages
Data-Efficient Multimodal Fusion On A Single GPU
No ratings yet
Data-Efficient Multimodal Fusion On A Single GPU
15 pages
Graphbased Multimodal Ranking Models For Multimodal Summarization-1-1
No ratings yet
Graphbased Multimodal Ranking Models For Multimodal Summarization-1-1
21 pages
Diagnostics: Transmed: Transformers Advance Multi-Modal Medical Image Classification
No ratings yet
Diagnostics: Transmed: Transformers Advance Multi-Modal Medical Image Classification
15 pages
Multimodal Learning
No ratings yet
Multimodal Learning
29 pages
By My Eyes: Grounding Multimodal Large Language Models With Sensor Data Via Visual Prompting
No ratings yet
By My Eyes: Grounding Multimodal Large Language Models With Sensor Data Via Visual Prompting
23 pages
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Vision Teransformaer Paper
No ratings yet
Vision Teransformaer Paper
12 pages
Incorporating Visual Information Into Natural Language Processing
No ratings yet
Incorporating Visual Information Into Natural Language Processing
151 pages
Second Presentation
No ratings yet
Second Presentation
26 pages
Multimodal Foundation Models
No ratings yet
Multimodal Foundation Models
14 pages
Author NameAffiliationauthor@Email
No ratings yet
Author NameAffiliationauthor@Email
8 pages
Multi Model
No ratings yet
Multi Model
36 pages
Multimodal Medical Disease Classification With Llama Ii: Christian Gapp Elias Tappeiner Martin Welk Rainer Schubert
No ratings yet
Multimodal Medical Disease Classification With Llama Ii: Christian Gapp Elias Tappeiner Martin Welk Rainer Schubert
9 pages
Combining Language and Vision With A Multimodal Skip-Gram Model
No ratings yet
Combining Language and Vision With A Multimodal Skip-Gram Model
11 pages
Lecture22 Multimodal
No ratings yet
Lecture22 Multimodal
32 pages
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Applsci 13 09456
No ratings yet
Applsci 13 09456
24 pages
Silber Er 13 Acl
No ratings yet
Silber Er 13 Acl
12 pages
Universal Network
No ratings yet
Universal Network
18 pages
Text-Oriented Modality Reinforcement Network For Multimodal Sentiment Analysis From Unaligned Multimodal Sequences
No ratings yet
Text-Oriented Modality Reinforcement Network For Multimodal Sentiment Analysis From Unaligned Multimodal Sequences
12 pages
Synthesis Lectures On Computer Vision: Series Editors
No ratings yet
Synthesis Lectures On Computer Vision: Series Editors
8 pages
Recent Advances and Trends in Multimodal Deep Learning A Review
No ratings yet
Recent Advances and Trends in Multimodal Deep Learning A Review
35 pages
1 s2.0 S0957417422024344 Main
No ratings yet
1 s2.0 S0957417422024344 Main
21 pages
Paper 3
No ratings yet
Paper 3
13 pages
Marketing Decision Model and Consumer Behavior Pre
No ratings yet
Marketing Decision Model and Consumer Behavior Pre
25 pages
One Day Workshop: Important Date About VIT
No ratings yet
One Day Workshop: Important Date About VIT
2 pages
Recurrent Neural Networks: Anahita Zarei, PH.D
No ratings yet
Recurrent Neural Networks: Anahita Zarei, PH.D
37 pages
1 s2.0 S0378377424000453 Main
No ratings yet
1 s2.0 S0378377424000453 Main
14 pages
CISC 867 Deep Learning: 14. Text Classification With Recurrent Neural Networks and Word Embeddings
No ratings yet
CISC 867 Deep Learning: 14. Text Classification With Recurrent Neural Networks and Word Embeddings
28 pages
SPSAS Learning From Data Abu Mostafa
No ratings yet
SPSAS Learning From Data Abu Mostafa
66 pages
UNIT2
No ratings yet
UNIT2
25 pages
Ria 37.06 19
No ratings yet
Ria 37.06 19
11 pages
ROHAN PRASAD FinalProjectReport - Rohan Gamer
No ratings yet
ROHAN PRASAD FinalProjectReport - Rohan Gamer
39 pages
Integrating Gait and Speech Dynamics Methodologies For Enhanced Stuttering Detection Across Diverse Datasets
No ratings yet
Integrating Gait and Speech Dynamics Methodologies For Enhanced Stuttering Detection Across Diverse Datasets
14 pages
Deep Learning Unit 5
No ratings yet
Deep Learning Unit 5
23 pages
1223-Article Text-5565-1-10-20200702
No ratings yet
1223-Article Text-5565-1-10-20200702
7 pages
Processes 08 01068
No ratings yet
Processes 08 01068
18 pages
Performance Evaluation of Deep Learning Models in Detection of Distributed Denial of Service Attacks 2
No ratings yet
Performance Evaluation of Deep Learning Models in Detection of Distributed Denial of Service Attacks 2
7 pages
EEG-Based Age and Gender Prediction Using Deep BLSTM-LSTM Network Model
No ratings yet
EEG-Based Age and Gender Prediction Using Deep BLSTM-LSTM Network Model
8 pages
Ai Chapter 3 and 4
No ratings yet
Ai Chapter 3 and 4
37 pages
5 LSTM
No ratings yet
5 LSTM
4 pages
SSRN 3808539
No ratings yet
SSRN 3808539
14 pages
Digital Twin For Real-Time Li-Ion Battery State of Health Estimation With Partially Discharged Cycling Data
100% (1)
Digital Twin For Real-Time Li-Ion Battery State of Health Estimation With Partially Discharged Cycling Data
11 pages
ISMIR 2019 Tutorial - Waveform-Based Music Processing With Deep Learning
No ratings yet
ISMIR 2019 Tutorial - Waveform-Based Music Processing With Deep Learning
152 pages
Video-Based Estimation of Pain Indicators in Dogs
No ratings yet
Video-Based Estimation of Pain Indicators in Dogs
20 pages
Sanyam Modi Report Seminar Final
No ratings yet
Sanyam Modi Report Seminar Final
39 pages
Handwriting Recognition With Large Multidimensional Long Short-Term Memory Recurrent Neural Networks
No ratings yet
Handwriting Recognition With Large Multidimensional Long Short-Term Memory Recurrent Neural Networks
6 pages
Pyhton PBL
No ratings yet
Pyhton PBL
30 pages
Want To Generate Your Own Music Using Deep Learning? Here's A Guide To Do Just That!
No ratings yet
Want To Generate Your Own Music Using Deep Learning? Here's A Guide To Do Just That!
17 pages
Unlocking The Power of Long Short-Term Memory (LSTM) Networks - by Sachinsoni - Medium
No ratings yet
Unlocking The Power of Long Short-Term Memory (LSTM) Networks - by Sachinsoni - Medium
23 pages
Gold Volatility Prediction Using A CNN-LSTM Approa
No ratings yet
Gold Volatility Prediction Using A CNN-LSTM Approa
9 pages
1803 07870 PDF
No ratings yet
1803 07870 PDF
29 pages
Comparative Analysis of Advanced Time Series Forecasting Techniques
No ratings yet
Comparative Analysis of Advanced Time Series Forecasting Techniques
47 pages
E3sconf Icmed-Icmpc2023 01048
No ratings yet
E3sconf Icmed-Icmpc2023 01048
9 pages