sensors-22-09775-v2
sensors-22-09775-v2
Article
Depression Detection Based on Hybrid Deep Learning SSCL
Framework Using Self-Attention Mechanism: An Application
to Social Networking Data
Aleena Nadeem 1 , Muhammad Naveed 2 , Muhammad Islam Satti 3 , Hammad Afzal 1 , Tanveer Ahmad 4
and Ki-Il Kim 5, *
Abstract: In today’s world, mental health diseases have become highly prevalent, and depression
is one of the mental health problems that has become widespread. According to WHO reports,
depression is the second-leading cause of the global burden of diseases. In the proliferation of such
issues, social media has proven to be a great platform for people to express themselves. Thus, a user’s
social media can speak a great deal about his/her emotional state and mental health. Considering the
Citation: Nadeem, A.; Naveed, M.;
high pervasiveness of the disease, this paper presents a novel framework for depression detection
Islam Satti, M.; Afzal, H.; Ahmad, T.;
from textual data, employing Natural Language Processing and deep learning techniques. For this
Kim, K.-I. Depression Detection
Based on Hybrid Deep Learning
purpose, a dataset consisting of tweets was created, which were then manually annotated by the
SSCL Framework Using domain experts to capture the implicit and explicit depression context. Two variations of the dataset
Self-Attention Mechanism: An were created, on having binary and one ternary labels, respectively. Ultimately, a deep-learning-based
Application toSocial Networking hybrid Sequence, Semantic, Context Learning (SSCL) classification framework with a self-attention
Data. Sensors 2022, 22, 9775. mechanism is proposed that utilizes GloVe (pre-trained word embeddings) for feature extraction;
https://doi.org/10.3390/s22249775 LSTM and CNN were used to capture the sequence and semantics of tweets; finally, the GRUs and
Academic Editor: Steve Ling
self-attention mechanism were used, which focus on contextual and implicit information in the tweets.
The framework outperformed the existing techniques in detecting the explicit and implicit context,
Received: 13 November 2022 with an accuracy of 97.4 for binary labeled data and 82.9 for ternary labeled data. We further tested
Accepted: 8 December 2022
our proposed SSCL framework on unseen data (random tweets), for which an F1-score of 94.4 was
Published: 13 December 2022
achieved. Furthermore, in order to showcase the strengths of the proposed framework, we validated
Publisher’s Note: MDPI stays neutral it on the “News Headline Data set” for sarcasm detection, considering a dataset from a different
with regard to jurisdictional claims in domain. It also outmatched the performance of existing techniques in cross-domain validation.
published maps and institutional affil-
iations. Keywords: depression detection; implicit depressive tweets; ternary classification; self-attention
mechanism; deep learning
between 6 and 21 percent [2]. The situation of mental illnesses, particularly in developing
and underdeveloped countries, where healthcare resources are scarce, is even worse. The
presence of depression in the young population is estimated to be 5 percent across the
world and 20 percent in its milder forms (i.e., mild depression, partial symptoms, and
probable depression) [3]. The provision of mental health care is infrequent; thus, often
depression and other mental disorders remain untreated. Sometimes, the disease is not
even diagnosed or identified.
The most-important part of the treatment of all diseases, particularly mental disorders
including depression, is the identification of symptoms of depression among patients. The
detection of depressive symptoms well before its assessment and treatment can significantly
improve the likelihood of the reduction of depressive indications and underlying diseases.
Moreover, it can alleviate negative implications for health and well-being, as well as social,
economic, and social life [4]. The potential diagnostic techniques are often expensive
for the poor populations of underdeveloped countries and involve the patient reporting
to a mental health care practitioner, and patients or caregivers are required to explain
the symptoms. This aspect of diagnostic techniques often leads to non-identification of
depression in patients and results in untreated individuals. There is a need to develop
a diagnostic method without the physical involvement of the patient at a reduced or no
financial cost.
Over the last decade, social media has emerged to be an extensively used platform
for the exchange of ideas and information. A small chunk of text can echo the mental state
of a person. Hence, practitioners can gain a good amount of intuition about the mental
well-being and health of an individual from a Facebook post, a tweet, or an Instagram
post [5,6]. There is an explosive amount of literature that caters to the part of social media
on the anatomy of social dealings such the break-up of relationships, smoking and drinking
reversion, suicide celebration, sexual harassment, and mental illnesses [6,7]. The ever-
growing increase in the use of social media across the globe provides the opportunity to
use social media as a tool for depression detection [8].
Referring to the diagnostic techniques required for depression detection, many studies
have experimented with various techniques and suggested numerous frameworks. The
techniques make use of different forms of data based on social-media-based depression
datasets, multi-modal datasets (audio and video interviews), survey-based datasets, and
EEG-signal-based depression datasets. Studies that have used social-media-based depres-
sion datasets have mostly resorted to Facebook, Twitter, Reddit, Reachout.com, Instagram
data, and some other social channels. Most of the studies focus on explicit depression
detection in which there is a clear statement regarding the disease [7–10]. Moreover, with
explicit detection, the time factor of social media usage has also been considered. Depressed
patients usually show the tendency of insomnia, so they stay awake at night and use social
media more frequently [8,11–14]. In this task, SVM [7,8,15] has shown great success with
the highest accuracy of 85 percent and Convolutional Neural Networks (CNNs) [12,14]
have shown good results with the highest accuracy of 88 percent. However, an explicit
statement of depression and the time factor are not the ultimate measures to rely upon, and
our goal is to detect a sarcastic tone or implicit statement about the disease. A person may
not be stating it clearly, but can still be even suicidal.
Contrary to social media data, where a person is openly expressing thoughts, multi-
modal depression datasets, which contain audio and video interviews of patients, have
also been used in various studies along with machine learning and deep learning mod-
els [16–24]. The shortcoming of these methods of detecting through multi-modal datasets
is that they involve patients directly. In developing countries, where resources are scarce as
ratio of the population [4,8], patients cannot indulge in such long interactions with doctors.
Moreover, it becomes a difficult and expensive method to analyze audio and video. In such
methods, CNNs again have proven to be highly efficient because of being proficient in
image/video processing [7,15,20,22]. Furthermore, the survey-based depression datasets
have the same issue of involving the patient and obtaining biased answers [19–22]. In addi-
Sensors 2022, 22, 9775 3 of 28
tion to that, EEG-signal-based datasets have also been employed in studies and involved
image processing [25–27].
For the stated task, we resorted to social media data and focused on the problem of the
implicit depression detection task. In the studies that were surveyed, there is no facilitation
of implicit depression categorization. The datasets are even annotated on an “explicit”
basis, where a patient states that he/she is depressed. Moreover, there is no study on the
detection of first-person (I, we), second-person (you, they), and third-person (he, she, my
mother/father, etc.) depression. This paper proposes the exploration of publicly available
Twitter depression datasets through Natural Language Processing and involves the manual
annotation of data into binary and ternary labels, where binary labels are for first-person
depression and ternary labels are for second- and third-person depression, Additionally,
machine learning and deep learning algorithms were applied for the classification task,
along with various feature extraction techniques and pretrained word embeddings.
Ultimately, a novel hybrid deep learning SSCL framework with an attention mecha-
nism is presented. The SSCL model consists of LSTM and CNN layers in order to boost the
depression detection classification task along with a GRU and self-attention mechanism, to
capture the contextual information. Overall, models employing the GRU have proven to
show better efficiency and effectiveness as compared to other Recurrent Neural Networks
(RNNs) such as the Long Short-Term Memory (LSTM) model [9] alone. Moreover, to
boost the efficiency of the model, an attention layer was added to enhance capturing of
contextual/implicit information in a tweet and performed the task of depression detection
with an improved accuracy and F1-score as compared to other novel models. In addition
to that, the experimental results attained on a real-world/unseen tweet dataset showed
the dominance of the proposed SSCL method as compared to the baseline methods. The
pivotal contributions of this study are stated as follows:
• A hybrid deep learning model is proposed, which uses a self-attention mechanism
focusing on the sequential, long-distance relationship of words, as well as their con-
textual information, by considering the weights of the words.
• An existing benchmark dataset was annotated with the help of domain experts, to
discriminate between the implicit and explicit nature of expression in tweets. This
improved the accuracy by reducing the false positive and false negative rates.
• The dataset was further annotated to formulate the problem of depression detection
into binary and ternary classification tasks, catering to first-, second-, as well as
third-person depression mentioned in tweets.
• The proposed framework was evaluated and shown to work well on unseen data
(random tweets, other than in the training and test dataset splits) and text classification
tasks from a different domain (sarcasm detection).
The paper structure is as follows. Section 2 is the related work in which relevant
papers are studied. Section 3 explains the materials and methodology for this research.
Section 4 presents the experimental results of machine learning and deep learning on raw,
as well as modified labeled data. Furthermore, Section 5 gives the results and discussion,
elaborating on the analysis of the experimental results. Section 6 concludes the work, and
Section 7 discusses the future work possibilities and the directions of the current research.
2. Related Work
Depression detection is of great importance as it has increased around us. It is a
prevalent mental disorder worldwide that carries huge burdens with it such as suicide,
economic burden, societal disturbance, physical damage, and many profound health
concerns. Due to the damaging nature of the disease, there has been a good amount of
research and literature on it. In this section, we analyze the related work and cover the
maximum dimensions of the work and techniques that have been used in the domain of
depression detection. The relevant studies that were included use datasets based on social
media (Facebook, Twitter, Reddit, and Reachout.com), multi-modal datasets (audio and
video), survey-based datasets, and EEG-signal-based depression datasets.
Sensors 2022, 22, 9775 4 of 28
The social-media-based depression datasets have been used in various studies. Studies
inculcating datasets from frequently used social media sites, such as Facebook, Twitter,
Reddit, and Reachout.com, have been explored. Uddin et al. used texts depicting depres-
sion symptoms. They used a Long Short-Term Memory (LSTM) Recurrent Neural Network
(RNN) to classify texts of self-perceived symptoms of depression. The symptoms were
predefined by medical professionals. The model was used to discriminate texts depicting
depression symptoms from posts with no such descriptions. Finally, the trained RNN was
used to automatically predict depression posts. This RNN along with LSTM gave a mean
accuracy of 98 percent, but did not cater to explicit depression detection [9].
The study [10] used the textual content of the tweets of users and analyzed the semantic
context in texts through deep learning models. The proposed model was a hybrid of two
deep learning algorithms, a Convolutional Neural Network (CNN) and Bi-directional Long
Short-Term Memory (BiLSTM). After optimization, it gave an accuracy of 94.28 percent.
Moreover, in [11], machine-learning- and lexicon- (dictionary and corpus) based approaches
were used for text classification in the task of depression detection. The results of balanced
and imbalanced data techniques were also explored. Machine learning algorithms were
applied, but a hybrid technique provided better results in detecting depression. A hybrid
model of LSTM and SVMs was created to detect depression with more accuracy. With data
balancing from SMOTE and using LSTM with SVM, the highest accuracy achieved was
83 percent.
The study [12] proposed Multi-aspect Depression Detection with a Hierarchical Atten-
tion Network (MDHAN) for detecting depression through the social media posts of users.
The task was performed by extracting features from the user behavior and the user’s online
timeline (posts). The model consisted of a hybrid setting to tackle user behavior through the
MLP and user timeline posts in the HAN to calculate each tweet and word’s importance and
capture semantic sequence features from the user timelines (posts). The MDHAN achieved
the best performance with 89 percent for the F1-score, to detect depression on Twitter.
Islam et al. used Facebook users’ comments for behavioral examination in [15] and ap-
plied basic machine learning algorithms, using primary linguistic features. Moreover, they
also considered the timings of Facebook posts and classified the data based on particular
words and timings, for which SVM performed the best. In [8], a socially meditated patient
portal app was developed for the detection of depression-related markers on Facebook
using ML techniques. The observed features were generated from Facebook, and the
unobserved features were generated using SMPP. In this study, again, the SVM model
resulted in the best average result. Both of the studies used basic ML algorithms, and the
classification criteria were explicit rather than implicit.
In another study, the text classification model SS3 for early depression detection was
presented in [7], using CLEF 2017—the content posted by users/subjects on Reddit. The
model consisted of two phases. In the first phase of classification, a document was split into
multiple blocks until the words were reached, and in the second phase, the global value
of a word for each category was computed. The technique was novel, but a higher rate of
false positives and false negatives could be expected. Another study [13] also employed the
same dataset (CLEF-2017). Various feature representations with features such as stylometric
and morphology features containing parts-of-speech proportions were used. SVM and
random forest were used for the classification of the different features, in which SVM gave
the best result. The set of features was limited in this study, and the context could not be
captured. Instagram data with images, text, and temporal aspects were researched in [14].
A summed depression score for each post by the static weight and time-adapted weight
was calculated. a CNN was used to extract image features, and Word2Vec was used to
learn vector representations. The results were based on the score, and again the implicit
nature of the content could not be captured.
In [12], a hybrid model was proposed using a Twitter-based dataset. The model used
pre-trained word embeddings to encode the words in user posts, and the proposed model
attained superior performance by merging a BiGRU with a CNN to detect depression on
Sensors 2022, 22, 9775 5 of 28
Twitter. In [18], the Reachout.com (a platform for young people to discuss their everyday
issues) dataset was manually labeled based on posts’ urgency (labels: crisis, red, amber,
green). Psycho-linguistic features and network features were utilized, in combination, and
LSTM was used for anomaly detection. The study categorized depression on the basis of
labels and did not include the classification of depressed and non-depressed patients.
Many studies have used multi-modal datasets including different modes of data such
as audio, video, and text. DAIC-WOZ, i.e., the “Distress Analysis Interview Corpus”,
and AVEC-2019, the “The Audio/Visual Emotion Challenge & Workshop”, were the most
commonly used. The study [16] employed speech signal processing with hand-crafted
features to apply depression detection techniques using deep learning methods. This
paper elaborated on methods for acoustic feature extraction and presented algorithms
for classification. A multimodal dataset comprising AVEC2013 and DAIC-WOZ (2014)
was used, and the classification task was performed using a CNN and LSTM. Similarly,
Reference [17] used audio recordings of the individuals’ voices. The DAIC-WOZ database
was used to conduct text analysis on a word level using Natural Language Processing
(NLP) techniques and a voice quality analysis model on tense for the breathdimension. The
text analysis model showed the best performance with an F1-score equal to 0.8 (0.42) for
non-depressed (depressed) individuals, while the voice quality model score was 0.76.
In [28], the model that was presented for the depression detection task consisted of
transformers at the top and a 1D-CNN at the bottom. Features from both models were
fed to the feedforward network, which classified the depressed and normal people. A
graph attention model to handle multi-modal (audio, text, images) data for depression
detection was proposed in [29]. A 10-layer temporal CNN was used to obtain the features
of multi-modal data and perform classification for the depression detection tasks. In [30],
the DAIC-WOZ dataset was used in combination with the AVEC depression dataset. AVEC
was used for training purposes, whereas the tuning was performed on DAIC. A novel
multi-modal framework with a hybrid deep learning approach combining supervised deep
learning, shot learning, and human allied interactions was presented in the study. In [19],
again, DAIC-WOZ was used, and a multitask setting was employed to combine regression
and classification. The model consisted of a three-layer BGRU model with 128 hidden units
and four different text embeddings, namely Word2Vec, FastText, ELMo, and BERT.
Another study [20] explored the AVEC challenge for audio, video, and text descriptors
and developed a hybrid classification and estimation framework for depression. A Deep
Neural Network (DNN) and a Deep-Convolutional-Neural-Network (DCCN)-based audio
and visual multi-modal depression recognition framework was proposed. In particular,
teenager psychological stress was studied in [21] by a self-developed mobile app, “Hap-
port”, which collected sleep and exercise data via a wearable wrist sensor. A multimodal
interactive fusion method involving text images, sleep, and exercise data was proposed.
Survey-based depression datasets have also been employed by many studies. Survey-
based datasets were either collected in the form of self-developed questionnaires, which
were filled in by participants, or by using an already available dataset such as Depression
Anxiety & Stress Scale Questionnaire—21 Questions (DASS-21). In [24], the data collection
was performed through the Amazon Mechanical Turk (MTurk) portal, by using a survey.
Afterward, complex and nonlinear relationships between features were developed using
machine learning algorithms, namely: KNN, LR, NN, RF, and SVM-Linear. In another
study [25], machine learning algorithms were deployed for learning participants who
were symptomatic and non-symptomatic for depression, and a differentiation between
depression and anxiety was made by sensing a unique pattern of biased responses to
emotional inducements, Random forest was used to identify patterns in the behavioral
measures. Studies using survey-based datasets require patient involvement, whereas the
idea of our research is to detect depression requiring minimum difficulties for the patients.
The techniques used in these studies are also basic and have little novelty.
An Electroencephalogram (EEG) is an assessment of the brain by means of electrodes,
which are metallic disks attached to the head/scalp. The test senses electrical activity in
Sensors 2022, 22, 9775 6 of 28
brain cells and interconnects via electrical impulses. The activity is shown in the form of
curvy lines on an EEG recording. The EEG of depressed patients differs from non-depressed
patients in terms of brain cell activity. Many studies have used EEG recordings’ datasets
to study abnormal activity in depressed patients. In [31–33], EEG-based datasets were
used. Deep learning models such as CNNs and LSTMs have been employed to perform
classification based on EEG signals or images.
For the depression detection task, our approach is different from all the other studies
in that we manually annotated the data after a consultation by a group of mental health
practitioners to capture the explicit, as well as the implicit context of tweets. Furthermore,
after preparing the quality dataset, we used pre-trained Twitter word embeddings to
extract the features. Then, various methods of comparative studies were applied, which
ave better results on our dataset, and our own hybrid deep learning model performed
exceptionally well.
True Positive (TP) Depression instances that are positive and determined as
positive
True Negative (TN) Instances that are negative and determined as negative/not-
depressed
False Positives (FP) Instances that are negative and determined as posi-
tive/depressed
False Negative (FN) Depression instances that are positive, but are determined
as negative
Sensors 2022, 22, 9775 9 of 28
The raw dataset contained tweets labeled as “1” and “0”. The annotation was chal-
lenging with respect to the fact that it contained false positive and false negative labels.
An experienced mental health practitioner was consulted for the stated purpose, and a
criterion was developed for the modification of the annotation. The process was carried
out manually to ensure reliability. Table 3 states the existing criteria for data labeling.
Another significant part of the data preprocessing included the cleaning and prepa-
ration of the data for further analysis and effective model training. Initially, the raw file
contained 31,000 tweets in total, out of which 15,999 were labeled “1” and 15,000 were la-
beled “0”. The detailed statistics are listed in Table 4. According to the depression detection
domain, the prepossessing steps were carried out on data including removing duplicate
comments to avoid skewed metrics, and to make the learning process impartial, appending
hashtags in the tweet for information extraction, hyperlinks, punctuation, extra spaces,
other language, and special characters removal were carried out. Moreover, lower-casing
of letters was performed. Stop words were retained as they play a significant role in identi-
fying the first-person (I, me, myself, we, us, our, and ourselves) or third-person depression
(she, her, herself, he, him, and himself), so pronouns were retained to make the learning of
the model more accurate. Other stop words were removed. Tokenization is an integral part
of preprocessing for all ML and DL algorithms. In this research, the “Spacy Tokenizer” was
utilized for the segmentation of the text. This is performed on the basis of white spaces
between words. Moreover, the tokenizer checks for other symbols such as single inverted
commas. For example, “shouldn’t” would be split into “should” and “n’t”. The Spacy
tokenizer has been specially built for Natural Language Processing and deep learning, so it
was applied to our dataset for tokenization.
Label correction or data annotation was carried out as a part of the preprocessing.
Following are the statistics concerning label correction. After the label correction, the
number of tweets in every class is stated in Table 4.
Moreover, for deep learning models, feature extraction has been performed by us-
ing pre-trained word embeddings mainly Word2Vec, FastText, and GloVe. These word
embeddings have been utilized along with various deep learning models and deep learn-
ing ensembles:
Word2Vec: Word2Vec is a pre-trained model developed by Google for performing
sentiment analysis. This basically merges the effective employment of the skip-gram and
continuous bag-of-words architectures for calculating vector representations of words.
Google News Vector with a word vector model (3 million 300-dimensional English word
vectors) was utilized in this study.
FastText: This is another way of learning word representation, developed by Facebook’s
AI. FastText is basically a library that is used for the learning of word embeddings and
obtaining vector representations of words. In this study, 2 million word vectors trained with
subword information with 300 dimensions on Common Crawl-pretrained (600B tokens)
FastText English word vectors were used as one of the feature extraction methods.
GloVe—Global Vector for Word Representation: Stanford’s GloVe is an unsupervised
learning model for finding meaningful vector representations for words. Pretrained word
embeddings are already trained on a larger corpus, which can be used to solve the similar
task of Natural Language Processing. In this study, GloVe’s pre-trained vector model was
used to map the tweets onto the vector space. Twitter’s 2B tweets with 27B tokens and a
1.2 M vocab with 200 dimensional vectors were used in this study for extracting the textual
features of tweets.
The evaluation metrics that were used in this study to gauge the performance of our
classifiers were the accuracy and F1-score. Both of the metrics take into account, all four
significant dimensions, i.e., True Positives (TPs), True Negatives (TNs), False Positives
(FPs), and False Negatives (FNs) (refer to Table 1).
Accuracy: This is the measure of all the cases that have been correctly identified, which
means it takes into account all the accurately/correctly classified cases.
TP + TN
Accuracy = (2)
TP + FP + TN + FN
F1-score: The F1-scores takes the precision and recall into account, and it is a result
of the harmonic mean of the precision and recall. The F1-score is an improved mea-
sure over the accuracy metric because the F1-score measures the cases that have been
incorrectly classified.
Recall × Precision
F1-Score = 2 × (3)
Recall + Precision
Precision is the measure of the positive cases that are correctly identified from all the
predicted positive cases.
TP
Precision = (4)
TP + FP
Recall is the measure of the positive cases that are correctly identified from all the
positive cases existing in the dataset.
TP
Recall = (5)
TP + FN
The accuracy measure is used when all of the classes are likewise significant. Therefore,
it can be easily deduced that we use accuracy when true positives and true negatives are
equally important and there exists a class balance. However, accuracy is not a durable
measure in real life, because class imbalance is common in real-life problems. Therefore,
keeping this in consideration, we used the F1-score. The F1-score is basically the harmonic
mean of the precision and recall of the model. Thus, the F1-score considers the both false
positives and false negatives. Hereafter, it is used when the false negative and false positive
case are crucial, and in our case, the wrong detection of depression (FP) and no detection
of depression when it exists (FN) are vital information that cannot be lost. Moreover, the
F1-score handles class imbalance well, and in our ternary classification problem, the classes
are imbalanced. In this study, we present both the accuracy and F1-score as evaluation
metrics, but because of the mentioned reasons, the F1-score is an imperative metric as
compared to accuracy.
4. Experimental Analysis
For the above-mentioned classification tasks, seven prevalent machine learning classi-
fiers were employed to evaluate the performance of each. The ML classification models
used were the K-Nearest Neighbors (KNNs), Naïve Bayes (NB), Random Forest (RF),
Logistic Regression (LR), Support Vector Machine (SVM), Decision Tree (DT), and Multi-
layer Perceptron (MLP). Deep learning algorithms were also experimented with, namely:
feedforward networks, GRUs, CNN, LSTM, BiLSTM, attention layer mechanism, and en-
semble/hybrid models. The subsection “Machine Learning Classification Model” presents the
experimentation result of raw, binary, and ternary labeled data with the ML algorithms.
The subsection “Deep Learning Classification Model’‘ presents the experimentation result of
binary and ternary labeled data with the DL algorithms.
Sensors 2022, 22, 9775 13 of 28
Traditional Features
Techniques TF-IDF Uni-Gram Bi-Gram Uni–Bi-Gram Uni–Bi–Tri-Gram
Accuracy F1-Score Accuracy F1-Score Accuracy F1-Score Accuracy F1-Score Accuracy F1-Score
KNN 82.1 81.6 94 93.9 75.9 75.5 92.8 92.8 91.9 91.9
NB 81.7 81 92 91.9 89.3 89.1 89 88.8 89.6 89.4
RF 98.6 98.5 99 99 95.8 95.8 98.6 98.6 98.5 98.5
LR 99.1 99 99.8 99.8 95.8 95.8 99.8 99.8 99.7 99.7
SVM 99.6 99.6 99.8 99.8 95.4 95.4 99.7 99.7 99.6 99.6
DT 99 98.9 99.6 99.6 95 95 99.3 99.3 99.4 99.4
MLP 98.7 98.7 99.6 99.6 97.4 97.3 99 99 98.7 98.7
Table 6 presents the results of all the machine learning classifiers applied on raw
labeled binary data, in which implicit tweets were not labeled as depressed, rather being
labeled asnon-depressed. The feature extraction techniques used with the classifiers were
TF-IDF and N-grams (uni-gram, bi-gram, uni–bi-gram, and uni–bi–tri-gram). The data
used for the training and testing of the classifiers were divided on the basis of the 80:20 rule,
i.e., 80 percent data for training and 20 percent data for testing.
It is noticeable from Table 6 that SVM and logistic regression performed equally well
with all of the feature extraction techniques except bi-grams. With bi-grams, multilayer
perceptron gave the best results overall. However, it is evident from the results that the
classifiers were overfitting.
In order to evaluate the performances of classifiers on unseen data and establish the
fact that the classifiers were overfitting, a set of real-time tweets was tested separately for
each model with every feature extraction technique. The real-time data were a small set
of tweets that consisted of both implicit and explicit tweets. Table 7 presents the results
of the ML classifiers on the real-time tweets dataset. It was observed that the model was
unable to capture the implicit statements about depression and was highly overfit. Mostly,
the classifiers output the wrong labels for implicit statements for depression, whereas they
classified the explicit statements well for the real-time tweets. The classifiers gave diverse
results with every feature extraction technique. It was evident that the highest F1-score of
SVM was with uni–bi–tri-gram with an F1-score of 76.4.
Table 7. Real-time tweets dataset result for ML classifiers with raw labeling.
Traditional Features
Techniques TF-IDF Uni-Gram Bi-Gram Uni–Bi-Gram Uni–Bi–Tri-Gram
Accuracy F1-Score Accuracy F1-Score Accuracy F1-Score Accuracy F1-Score Accuracy F1-Score
KNN 47 43.9 70.5 70.1 35.5 34.3 70.5 70.1 70.5 70.5
NB 47 31.9 52.9 48.4 47 31.9 47 31.9 47 31.9
RF 64.7 61.3 75.4 75.7 70.5 70.1 70.5 68.8 58.8 56.4
LR 70.5 70.1 64.7 63.5 70.5 70.1 70.5 70.1 76.4 76.3
SVM 64.7 63.5 64.7 63.5 70.5 70.5 75.4 76.3 76.4 76.4
DT 58.8 56.4 58.8 56.4 70.8 68.8 58.8 56.4 52.9 48.4
MLP 58.8 58.2 58.8 58.2 64.7 61.3 58.8 58.2 64.7 64.5
In order to capture the implicit content and fix the problem of overfitting, the labels
were modified and implicit statements about depression were also marked as depressed (la-
beled as 1) with the help of a medical practitioner. This labeling helped the model/classifiers
learn the implicit statements as well.
Table 8 states the results of all the mentioned machine learning classifiers along with
all the stated techniques for feature extraction applied to the data with modified labeling.
Traditional Features
Techniques TF-IDF Uni-Gram Bi-Gram Uni–Bi-Gram Uni–Bi–Tri-Gram
Accuracy F1-Score Accuracy F1-Score Accuracy F1-Score Accuracy F1-Score Accuracy F1-Score
KNN 80.2 79.8 82.8 82.6 63.1 57.7 84.8 84.7 59.9 56.1
NB 86.9 86.7 92.6 92.5 89 88.9 90.8 90.7 90 89.8
RF 96.1 96 96.4 96.2 93.9 93.6 96.4 96.2 95.9 95.8
LR 95.3 95.2 96.5 96.3 93.5 93.3 96.9 96.7 96.6 96.5
SVM 96.4 96.3 96.6 96.5 92.7 92.6 96.8 96.7 96.5 96.4
DT 95.3 95 96 95.8 92.3 92.2 96.3 96.2 96.3 96.1
MLP 94.7 94.3 95.3 95 93.9 93.7 95.7 95.5 91.8 91.7
Sensors 2022, 22, 9775 15 of 28
It is evident from Table 8 that SVM outperformed all the other classifiers. SVM gave
the best results with TF-IDF, uni-gram, and uni–bi-gram. With Uni–Bigram it gave the
highest F1-score of 96.7. ML classifiers trained with modified labeled data were also tested
with real-time tweets. Table 9 shows the results of the classifiers on the unseen data. It is
evident from the results that training all the classifiers with modified labels, they performed
better on the unseen data. The highest accuracy achieved was 82.3 and the highest F1-score
was 82.1 by logistic regression. LR performed better on the test dataset with TF-IDF and
uni–bi–tri-gram, whereas random forest gave the same highest accuracy and F1-score of
82.3 and 82.1, respectively, on test data with uni–bi-gram.
Table 9. Real-time tweets dataset result for ML classifiers with modified labeling.
Traditional Features
Techniques TF-IDF Uni-Gram Bi-Gram Uni–Bi-Gram Uni–Bi–Tri-Gram
Accuracy F1-Score Accuracy F1-Score Accuracy F1-Score Accuracy F1-Score Accuracy F1-Score
KNN 70.5 70.1 64.7 64.5 58.8 47.1 58.8 58.2 70.5 70.1
NB 58.8 52.9 64.7 63.5 52.9 43.3 64.7 61.3 58.8 52.9
RF 76.4 75.7 64.7 63.5 76.4 76.3 82.3 82.1 64.7 63.5
LR 82.3 82.1 70.5 70.4 64.7 63.5 70.5 70.1 82.3 82.3
SVM 70.5 70.1 70.5 70.1 70.5 70.1 76.4 76.3 76.4 76.3
DT 70.5 70.4 64.7 63.5 70.5 70.1 64.7 64.5 58.8 56.4
MLP 76.4 75.7 70.5 70.1 52.9 52.7 64.7 63.5 76.4 76.3
Additionally, Table 10 lists the highest accuracy and F1-scores we obtained from the
raw label and modified label data when tested on the real-time tweets. Other than that,
comparing the performances of the classifiers trained on raw labeled data with training on
the modified labeled data, it is evident from Tables 7 and 9 that, with the unseen/real-life
data, the classifiers trained on the modified labeled data performed better. Moreover, if
we consider the rate of the true negatives and true positives classification, the classifiers
trained with the modified labels captured the implicit tweets more, hence fulfilling the
purpose of applying machine learning in the context of depression detection.
Traditional Features
Techniques TF-IDF Uni-Gram Bi-Gram Uni–Bi-Gram Uni–Bi–Tri-Gram
Accuracy F1-Score Accuracy F1-Score Accuracy F1-Score Accuracy F1-Score Accuracy F1-Score
KNN 78 59.6 84.3 61.4 50.8 34.8 80 55.5 66.8 45.5
NB 84.8 58.6 86.5 60.4 83.7 59 86.3 59.8 86 59.9
RF 89.7 62.4 90 63.1 88.1 61.5 90.5 63.2 89.7 62.6
LR 90.4 72.7 90.8 77.1 88.1 70.3 91.5 77.1 91.4 77.3
SVM 90.8 71.3 91.2 72.8 86.7 59.9 91.5 70.2 90.3 67.7
DT 88 72.1 88.3 72 85.2 66.3 89.4 73.2 87.9 72.1
MLP 88.7 72.8 89.6 75.3 88.7 69.7 89.4 72.9 88.8 74.2
Another aim of this study was to train the machine learning classifiers such that they
has an increased rate of true positive and true negative detection whilst decreasing the
rate of false positives and false negatives. Figure 3 shows the confusion matrix of TF-
IDF+LR and uni–bi–tri-gram + LR; Figure 4 shows the confusion matrix of uni–bi–tri-gram
+ SVM. It is obvious that, in Figure 3, rate of the TPs and TNs is higher than in Figure 4
(uni–bi–tri-gram + SVM on raw-labeled-trained data).
Sensors 2022, 22, 9775 16 of 28
talked about somebody else’s depression or provided some generic information were
labeled as “2” (refer to Table 3: annotation criteria). As the information is also important if
somebody is talking about some other person’s depression such as his/her father, mother,
or any other acquaintance, or providing some general information, this should also be
considered. Therefore, to extend our work for advancement, we applied seven machine
learning algorithms i.e.: K-Nearest Neighbors (KNNs), Naïve Bayes (NB), Random Forest
(RF), Logistic Regression (LR), Support Vector Machine (SVM), Decision Tree (DT), and
Multilayer Perceptron (MLP), on the ternary labeled data. The feature extraction techniques
used were the same as those used for the binary task, i.e., TF-IDF, uni-gram, bi-gram,
uni–bi-gram, and uni–bi–tri-gram.
The evaluation metrics used for assessing the performance of the classifiers were the
accuracy and F1-score. The F1-score was more reliable in the ternary classification task
because of the class imbalance. The number of tweets that were labeled “0” was 13,533; the
number of tweets labeled “1” was 11843; the number of tweets labeled “2” was 1742 (refer
to Table 4). The F1-scores are durable when classes are imbalanced and true positives and
true negatives are equally important for us.
The machine learning classifiers applied with the different feature extraction methods
are stated in Table 11. The experimental analysis of the ternary classification task showed
that logistic regression performed the best in terms oft he F1-score with the highest score of
77.3 with the uni–bi–tri-gram method. The accuracy is also given in the table, but as the
classes were imbalanced, the F1-score would be more pertinent in the case of the ternary
classification. With the TF-IDF method, multilayer perceptron performed the best out of
all the classifiers with an F1-score of 72.8. Logistic regression also performed well with
the uni-gram method, bi-gram, and uni–bi-gram methods. In short, logistic regression
performed efficiently compared to all other classifiers.
4.2.1. Binary Classification Task with Deep Learning and Comparative Methods
Multiple deep learning models along with word embeddings were experimented with.
In Table 12, the results are shown for various combinations of word embeddings and deep
learning techniques. Apart from the various combinations with word embeddings, the
hybrid models/ensembles were also employed.
Sensors 2022, 22, 9775 18 of 28
Table 12. Experimental analysis of deep learning models on binary labeled Data Section 4.2.1.
Another study [12] we compared our work with gave an F1-score of 85 on the binary
classification of tweets by using the skip-gram model for word representation, BiGRUs
for learning latent representation of words, and CNN for classification using multimodal
features, whereas, in our study, the results in terms of the F1-score improved a great deal.
Sensors 2022, 22, 9775 19 of 28
Later on, with GloVe embedding, we experimented with an ensemble of LSTM + CNN
and LSTM + GRU. These ensembles did not improve theF1-scores. When the GRU along
with the attention layer was incorporated, the model performance greatly increased. The
model’s details are elaborated in Section 5.
Table 14. Experimental analysis of deep learning models on ternary labeled data.
was used for feature extraction along with a hybrid model consisting of Long Short-Term
Memory (LSTM), a Convolutional Neural Network (CNN), a Gated Recurrent Unit (GRU),
and an attention layer. The proposed model gave the best performance in terms of the
accuracy and F1-score. Figure 5 illustrates the architecture of the proposed methodology.
Word embeddings can be categorized as learning models that help draw the rep-
resentation or underlying connotation of words in our vocabulary in a semantic space
that is in a lower dimension. Their fundamental principle is founded on augmenting an
objective function that assists in fetching words that are frequently occurring with each
other under a specific contextual window and words that occur in closer proximity in
the semantic space. An extraordinary aptitude of these models is that they can efficiently
capture numerous lexical properties in NLP tasks such as the similarity amongst words,
correspondences between words, the global occurrence of these words in a document or a
sentence, etc. These models have been developed to be increasingly prevalent in the NLP
domain, and they have been utilized as an input to deep learning models. In this study, we
practiced with the most popular models, but GloVe gave the best performance with our
proposed model.
As stated before, GloVe is Stanford’s developed unsupervised learning model for dis-
covering meaningful vector representations for words. In this study, we used GloVe’s
pre-trained word embeddings, which have already been trained on a huge data corpus;
thus, it can be used to solve similar NLP tasks. We used GloVe’s pre-trained vector model to
map the tweets onto the lower-dimensional vector space. Twitter’s two billion tweets with
27 billion tokens and 1.2 million vocabularies with 200-dimensional vectors were used in
this study to extract the textual features of the tweets to our defined size of the embedding
matrix. The embedding matrix basically contains the word embeddings extracted with the
help of the word embedding models.
Word embeddings from GloVe are in the form of an embedding layer with an input
dimension of 5000, an input length of 120, an output dimension of 200, and an embedding
matrix of (5000, 200). After, the embeddingdropout layer was deployed with the rate of 0.3.
This was performed to avoid any overfitting and to avoid randomly selected neurons.
The input to our LSTM layer was the GloVe embedding matrix containing word
embeddings, which not only incorporates local context information (local statistics), but
also considers global word occurrence (global statistics). These word embeddings were
further fed to our novel hybrid deep learning attention-based model with a learning rate of
0.01. The first layer of our proposed model consisted of LSTM with 96 LSTM cells, which
Sensors 2022, 22, 9775 21 of 28
are capable of acting as memory cells. LSTM contains memory cells instead of the hidden
layer, which makes it good at finding long-range dependencies in the data. Long-range
dependencies are crucial in finding the sentence structure. LSTMs are fundamentally
made of three gates: input gate, forget gate, and output gate, which help to absorb which
information should be conserved and which information should be thrown out. Moreover,
LSTMs are also capable of storing input sequences in their memory cells for longer periods
of time, and they fixed the problem of Recurrent Neural Networks (RNNs), where there is
no regulation of the input. As RNNs do not know which past inputs need to be forgotten
and which should be carried forward, LSTMs are more appropriate for the task of learning
word embeddings in a semantic way and are suitable for maintaining the sequence of the
word embeddings of tweets.
The output sequences of LSTM were given as the input to a 1D Convolutional Neu-
ral Network (CNN). While LSTM maintains the order/sequence of the semantics in a
tweet/sentence, the CNN is capable of extracting semantic information from a sequence of
sentences, and it can learn effective and suitable feature representations. The 1D CNN was
used with a filter of 32 and a kernel size of five. After deploying a CNN layer, a dropout
layer with a rate of 0.3 was used to avoid any overfitting from the combination of LSTM
with the CNN.
Afterward, the GRU layer was deployed. While LSTM can effectively remember longer
sequences than the GRU, at the same time, the GRU has better performance in capturing
long-distance relationships between words in a sequence/sentence. Moreover, the GRU
has proven to be computationally effective because of its precise internal structure with
an update and reset gate, and it can capture long-distance relations among information.
The GRU focuses on all input variables equally, which are output from its previous layer,
i.e., the CNN, so it does not result in good precision/accuracy of the output sequence.
As all input variables can contribute differently to the output prediction, the GRU as a
stand-alone assistance technique to the LSTM+CNN cannot perform well.
Hence, in order to pay more attention to the contextual information, we used a GRU
with a self-attention layer. The GRU outputs the input variables fed to it with a focus on all
of the input variables, so an attention layer was deployed to focus more on the relevant
words’ context. A self-attention layer was chosen instead of an attention layer because the
attention mechanism only lets the output focus the attention on a particular input, whereas
self-attention lets inputs of the network interact with each other, hence calculating the
attention of all other inputs with respect to one input. The self-attention phenomenon is
capable of drawing global dependencies between inputs and outputs. Figure 6 shows the
connection of the GRU with the attention layer.
labels, the performance of the classifiers was better as compared to the raw labeled data.
For the raw-labeled-trained models, the highest F1-score was obtained by LR and SVM
with the uni–bi–tri-gram technique. The highest F1-scores of 76.3 and 76.4 were for LR and
SVM, respectively, on the unseen data. Furthermore, with the modified-labeled-trained
models, the unseen data were also tested. The highest F1-score was obtained by LR with
uni–bi–tri-gram on the unseen data, and the score was 82.3. If we compare the scores of the
models trained with raw labels and modified labels, there was a huge difference in terms
of the accuracy and F1-score. Hence, implicit depression detection and an increase in the
number of true positives and true negatives increased.
Deep learning models were also employed for the depression detection task. Referring
to Table 12, it can be seen that the combination of the CNN+LSTM worked well as compared
to the CNN+BiLSTM when used with no embedding, whilst the same ensembles when
used with Word2Vec and the CNN+BiLSTM performed better than the CNN+LSTM. When
we used FastText embedding with the CNN+BiLSTM, it gave an accuracy and F1-score
of 96.6. With our proposed model that used GloVe 200-d with the LSTM+CNN+GRU
+attention, the F1-score increased to 97.4 percent.
Similarly, all of the machine learning algorithms that were experimented with for
binary labels were also applied for ternary labeled data. In the case of ternary labels, LR
performed best with the uni–bi–tri gram technique and gave an accuracy of 91.4 and an
F1-score of 77.3 (refer to Table 10). When the deep learning models were experimented
with, the simple feedforward network performed relatively better on the ternary data
as compared to other methods such as the GRU and Word2Vec with the CNN (refer to
Table 14). The BiLSTM + CNN performed very well with an accuracy and F1-score of 80.1.
With GloVe, the results dropped below the 70s, but when the GRU along with the attention
layer was added to the model, the performance was improved in terms of the accuracy and
F1-score. The highest accuracy and F1-score achieved was 82.9, and the ensemble that gave
this result was our proposed model.
To achieve better results for the task of depression classification in terms of implicit
and explicit depression detection, we proposed a model in which, along with an ensemble
of the CNN and LSTM, we used the GRU and self-attention mechanism to pay attention to
the contextual information of the text also. In other methods, the focus was on extracting
the sequential and semantic information, whereas, in our proposed model, LSTM acted as a
sequence learning mechanism and the CNN paid attention to the semantics and significant
features. To further enhance the performance of implicit depression detection we added the
GRU along with a self-attention layer to grasp the contextual information that a text carries.
The GRU captured long-distance relationships between the occurrence of words, whereas
the self-attention layer focused on relevant and pertinent features, hence improving the
performance in terms of the accuracy and F1-score, as well as enhancing the performance
in terms of implicit depression detection tasks and increasing the number of true positive
and true negative results.
Table 15 presents the improved results with the proposed model. It is evident by seeing
the experimental results of the binary and ternary labeled data and the results in Table 15
that our model improved the F1-score and accuracy. Our model was implemented with
GloVe word embedding and, afterward, also tested with FastText. Our model performed
better with GloVe. On the unseen test data, our model gave an accuracy and F1-score of
94.4 in the case of binary labeled data. Moreover, if we observe our results as compared to
studies included in the related work, the proposed model improved the accuracy of binary
labeled data a great deal, as shown in Table 13.
We also implemented the comparative methods listed in Table 13 with our ternary
data. The three-gram with CNN + LSTM gave an accuracy of 60.1 and an F1-score of 60,
whereas, with our deep-learning-based method, the accuracy and F1-score improved for
ternary labeled data also, with an accuracy and F1-score of 82.9.
Sensors 2022, 22, 9775 24 of 28
Table 15. Proposed method results on binary and ternary labeled data.
Figure 8 depicts the validation accuracy graph for binary labeled data, which increased
with learning. The loss decreased as the model’s learning increased.
Figure 9 illustrates the accuracy curve for ternary labeled data.
Table 17 presents the results of the comparative model from a few studies based on the
same dataset, along with our model’s result on the sarcasm detection dataset. Our model
improved the accuracy and F1-score of sarcasm detection based on the news headlines data.
Figure 10 below depicts the validation accuracy, in the case of sarcasm detection
through our proposed model.
Author Contributions: Conceptualization, A.N. and M.N.; Methodology, H.A.; Software, T.A.;
Formal analysis, M.N., M.I.S. and K.-I.K.; Investigation, M.I.S. and T.A.; Resources, H.A.; Data
curation, T.A.; Project administration, T.A.; Funding acquisition, K.-I.K. All authors have read and
agreed to the published version of the manuscript.
Funding: This work was supported by Institute for Information & communications Technology
Planning & Evaluation (IITP) grant funded by the Korean Government (MSIT) (No. 2022-0-01200,
Training Key Talents in Industrial Convergence Security).
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. World Health Organization. Depression. Available online: https://www.who.int/news-room/fact-sheets/detail/depression
(accessed on 15 November 2021).
2. Serrat, O. Social Network Analysis in Knowledge Solutions; Springer: New York, NY, USA, 2017.
3. Kessler, R.C.; Bromet, E.J. The epidemiology of depression across cultures. Annu. Rev. Public Health 2013, 34, 119. [CrossRef]
[PubMed]
4. Weinberger, A.H.; Gbedemah, M.; Martinez, A.M.; Nash, D.; Galea, S.; Goodwin, R.D. Trends in depression prevalence in the
USA from 2005 to 2015: Widening disparities in vulnerable groups. Psychol. Med. 2017, 48, 1308–1315. [CrossRef] [PubMed]
5. Cambria, E.; White, B. Jumping NLP curves: A review of natural language processing research. IEEE Comput. Intell. Mag. 2014,
9, 48–57. [CrossRef]
6. Jana, M.; Havigerová, J.H.; Kucera, D.; Hoffmannova, P. Text-Based Detection of the Risk of Depression. Front. Psychol 2019,
10, 513.
Sensors 2022, 22, 9775 27 of 28
7. Burdisso , S.G.; Errecalde, M.; Montes-y Gómez, M. A text classification framework for simple and effective early depression
detection over social media streams. Expert Syst. Appl. 2019, 133, 182–197. [CrossRef]
8. Hussain, J.; Satti, F.A.; Afzal, M.; Khan, W.A.; Bilal, H.S.M.; Ansaar, M.Z.; Ahmad, H.F.; Hur, T.; Bang, J.; Kim, J.I.; et al. Exploring
the dominant features of social media for depression detection. J. Inf. Sci. 2020, 46, 739–759. [CrossRef]
9. Uddin, M.Z.; Dysthe, K.K.; Følstad, A.; Brandtzaeg, P.B. Deep learning for prediction of depressive symptoms in a large textual
dataset. Neural Comput. Appl. 2021, 34, 721–744. [CrossRef]
10. Kour, H.; Gupta, M.K. An hybrid deep learning approach for depression prediction from user tweets using feature-rich CNN and
bi-directional LSTM. Multimed. Tools Appl. 2022, 81, 23649–23685. [CrossRef]
11. Gupta, H.; Goel, L.; Singh, A.; Prasad, A.; Ullah, M.A. Psychological Analysis for Depression Detection from Social Networking
Sites. Ment. Illn. Detect. Anal. Soc. Media 2022, 2022 , 4395358. [CrossRef]
12. Hamad Zogan, I.R.; Wang, X.; Jameel, S.; Xu, G. Explainable depression detection with multi-aspect features using a hybrid deep
learning model on social media. Spec. Top. Issues Comput. Sci. 2022, 25, 281–304.
13. Stankevich, M.; Isakov, V.; Devyatkin, D.; Smirnov, I. Feature Engineering for Depression Detection in Social Media. In
Proceedings of the 7th International Conference on Pattern Recognition Applications and Methods, Madeira, Portugal, 16–18
January 2018; pp. 426–431.
14. Chiu, C.Y.; Lane, H.Y.; Koh, J.L.; Chen, A.L. Multimodal depression detection on instagram considering time interval of posts. J.
Intell. Inf. Syst. 2021, 56, 25–47. [CrossRef]
15. Islam, M.; Kabir, M.A.; Ahmed, A.; Kamal, A.R.M.; Wang, H.; Ulhaq, A. Depression detection from social network data using
machine learning techniques. Health Inf. Sci. Syst. 2018, 6, 1–12. [CrossRef] [PubMed]
16. Md Zia Uddin, K.K.D.; Følstad, A.; Brandtzaeg, P. Automatic depression recognition by intelligent speech signal processing: A
systematic survey. In CAAI Transactions on Intelligence Technology; Wiley Online Library: New York, NY, USA, 2022.
17. Solieman, H.; Pustozerov, E.A. The Detection of Depression Using Multimodal Models Based on Text and Voice Quality Features.
In Proceedings of the 2021 IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering (ElConRus),
Moscow, Russia, 26–29 January 2021.
18. Shrestha, A.; Serra, E.; Spezzano, F. Multi-modal social and psycho-linguistic embedding via recurrent neural networks to identify
depressed users in online forums. Netw. Model. Anal. Health Inform. Bioinform. 2020, 9, 1–11. [CrossRef]
19. Dinkeland, H.; Wu, M.; Yu, K. Text-based depression detection on sparse data. arXiv 2019, arXiv:2011.14966.
20. Yang, L. Multi-Modal Depression Detection and Estimation. In Proceedings of the 2019 8th International Conference on Affective
Computing and Intelligent Interaction Workshops and Demos (ACIIW), Cambridge, UK, 3–6 September 2019; pp. 26–30.
21. Zhanga, H.; Cao, L.; Feng, L.; Yang, M. Multi-modal interactive fusion method for detecting teenagers’ psychological stress. J.
Biomed. Inform. 2020, 106, 103427. [CrossRef]
22. Huang, Z.; Epps, J.; Joachim, D. Exploiting vocal tract coordination using dilated cnns for depression detection in naturalistic
environments. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6549–6553.
23. Yang, L.; Jiang, D.; Sahli, H. Integrating deep and shallow models for multi-modal depression analysis—Hybrid architectures.
IEEE Trans. Affect. Comput. 2018, 12, 239–253. [CrossRef]
24. Razavi, R.; Gharipour, A.; Gharipour, M. Depression screening using mobile phone usage metadata: A machine learning
approach. J. Am. Med. Inform. Assoc. 2020, 27, 522–530. [CrossRef]
25. Richter, T.; Fishbain, B.; Markus, A.; Richter-Levin, G.; Okon-Singer, H. Using machine learning-based analysis for behavioral
differentiation between anxiety and depression. Sci. Rep. 2020, 10, 16381. [CrossRef]
26. Priya, A.; Garg, S.; Tigga, N.P. Predicting anxiety, depression and stress in modern life using machine learning algorithms.
Procedia Comput. Sci. 2020, 167, 1258–1267. [CrossRef]
27. Kumar, P.; Garg, S.; Garg, A. Assessment of anxiety, depression and stress using machine learning models. Procedia Comput. Sci.
2020, 171, 1989–1998. [CrossRef]
28. Lam, G.; Dongyan, H.; Lin, W. Context-aware deep learning for multi-modal depression detection. In Proceedings of the ICASSP
2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019;
pp. 3946–3950.
29. Zheng, W.; Yan, L.; Gou, C.; Wang, F.Y. Graph attention model embedded with multi-modal knowledge for depression detection.
In Proceedings of the 2020 IEEE International Conference on Multimedia and Expo (ICME), London, UK, 6–10 July 2020; pp. 1–6.
30. Shalu, H.; CN, H.S.; Das, A.; Majumder, S.; Datar, A.; MS, S.M.; Das, A.; Kadiwala, J. Depression Status Estimation by Deep
Learning based Hybrid Multi-Modal Fusion Model. arXiv 2020, arXiv:2011.14966.
31. LI, X.; Zhang, X.; Zhu, J. Depression recognition using machine learning methods with different feature generation strategies.
Artif. Intell. Med. 2019, 99, 101696. [CrossRef] [PubMed]
32. Ay, B.; Yildirim, O.; Talo, M.; Baloglu, U.B.; Aydin, G.; Puthankattil, S.D.; Acharya, U.R. Automated Depression Detection Using
Deep Representation and Sequence Learning with EEG Signals. J. Med. Syst. 2019, 43, 1–12. [CrossRef] [PubMed]
33. Čukić, M.; Stokić, M.; Simić, S.; Pokrajac, D. The successful discrimination of depression from EEG could be attributed to proper
feature extraction and not to a particular classification method. Cogn. Neurodynamics 2020, 14, 443–455. [CrossRef]
34. Depression Tweets Kaggle. Available online: https://www.kaggle.com/datasets/samrats/depressiontweets?select=train2Data.
csv (accessed on 15 October 2021).
Sensors 2022, 22, 9775 28 of 28