Report
Report
End of studies Project Report submitted in partial fulfillment of the requirements for the
degree of
Submitted by:
Imen Bouzidi
First, all praises and thanks are due to the Almighty Allah for guiding me to
the right path, for giving me the strength, knowledge, ability and opportunity
to undertake this work and to persevere and to complete it satisfactorily.
Depressed people do not consult doctors at early stages, which often leads to
a serious deterioration of their conditions that could reach suicide. Meanwhile,
they are increasingly using social media to disclose their emotions and there-
fore generating valuable data that could be leveraged for the early detection of
depression. We started by collecting visual and textual data from various social
media platforms (Twitter, Pexels and Unsplash) using keywords extracted from
the Diagnostic and Statistical Manual of Mental Disorders (DSM-5). Then, we
trained multiple state-of-the-art models. For text classification and after pre-
processing, we trained: default embedding + LSTM and Pre-Trained Global
Vectors (GloVe) + BiLSTM and for image classification we trained Deep CNN,
ResNets and BiT-L (Big Transfer Large). We also proposed aggregating these
two models to create a multi-modal model. 300K tweets were collected from
5460 users (223014 of depressed class and 187360 of not-depressed class). More-
over, we collected 11,484 images from Unsplash and Pexels (6250 images labeled
as Depressed and 5234 images labeled as Not-Depressed). After training the
models, The BiT-L (ResNet101x1) classified images with an accuracy of 0.82,
F1-score for depressed class equal to 0.81 and equal to 0.82 for non-depressed
class. As for texts, GloVe+BiLSTM scored an accuracy of 0.69 and F1-score for
depressed class equal to 0.65 and equal to 0.73 for not-depressed class. Even-
tually, we aggregated the two models into one model that accepts multi-modal
input and we detected an amelioration of the classification, which is promis-
ing for future research on this topic. This study demonstrates the feasibility
of leveraging social media data to make evidence-based mental health services
more widely available to those in need.
Keywords— Mental health, Depression, Social networks, Deep learning,
Multi-modal learning
Contents
Contents 3
List of Figures 4
List of Tables 6
Acronyms 1
Introduction 2
2 Background 6
2.1 Deep Learning (DL) . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Basics of DL . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Convolutional Neural Network (CNN) . . . . . . . . . . 9
2.1.3 Recurrent Neural Network (RNN) . . . . . . . . . . . . . 14
2.2 Text Pre-processing techniques . . . . . . . . . . . . . . . . 17
2.3 Word Embedding . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Classification Performance Metrics . . . . . . . . . . . . . . 19
2.4.1 Confusion matrix . . . . . . . . . . . . . . . . . . . . . . 19
2.4.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2
3.2.2 Textual Data . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.1 Images Classification . . . . . . . . . . . . . . . . . . . . 24
3.3.2 Text Classification . . . . . . . . . . . . . . . . . . . . . 26
3.3.3 Models Aggregation . . . . . . . . . . . . . . . . . . . . . 28
3.4 Hyper-parameters . . . . . . . . . . . . . . . . . . . . . . . . 28
3.5 Project Workflow . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.5.1 Project’s folder structure: . . . . . . . . . . . . . . . . . 29
3.5.2 Training loop . . . . . . . . . . . . . . . . . . . . . . . . 30
Bibliography 44
3
List of Figures
4
4.4 Post of a user who is labeled not depressed and the results of
prediction of each model . . . . . . . . . . . . . . . . . . . . . . 41
4.5 Post of a user who is labeled depressed and the results of predic-
tion of each model . . . . . . . . . . . . . . . . . . . . . . . . . 41
5
List of Tables
6
Acronyms
1
Introduction
In recent years, social media became a host for nearly 3.5 billion people [1] who
are increasingly relying on these platforms to communicate their emotions, con-
cerns and even daily life routine. Indeed, valuable insights could be extracted
from online behavior to understand and predict offline behavior. Therefore,
social media was widely used by both mental health researchers and data scien-
tists to understand the mental state of users which is reflected by the growing
amount of data shared on social networks. The aim of this study is to help
flatten the alarming suicide rate. In fact, detecting depressive disorder via so-
cial networks(Twitter) is proven effective[2]. However, the greatest challenge
of online depression detection lies in the various signs of depression which can
always variate from one person to another.
In this work, we systematically propose a study that focuses on using both
texts and images to detect signs of depression through social networks. Only a
few researches were made concerning this topic. As a result, in this project we
will propose our method of aggregating two models. Indeed, the approach is
that we will train a model on image classification for depressed users and another
one on text classification and once we have good accuracy for both models, we
will try combining their predictions when having a multi-modal input. In this
perspective, we will build two robust models one for image classification and
another one for text classification and aggregate them into one model.
The first chapter of this report introduces the hosting company, the project
context and the actual situation of depression detection. In the second chapter,
we will explore the theoretical and practical basis of Deep Neural Networks
presenting Convolutional Neural Networks and Recurrent Neural Networks with
a brief explanation of Word Embedding and text preprocessing techniques. And
in the third chapter, we will look through the design of the solution from data
collection to model training details.
Finally, we will present in chapter four the practical steps of data preparation
and the experimentation of the chosen models as well as the aggregation of the
best models and its results.
2
Chapter 1
This chapter introduces the host company where this internship was conducted
and the context of the graduation project. It also explains the problematic
of the project and states some studies on the current situation of detecting
depression via social networks using deep learning techniques.
3
Chapter 1: Project’s Context and Process
|4
Chapter 1: Project’s Context and Process
• Build and train a powerful model for text classification of potentially de-
pressive users.
• Test the performance of the two models separately and check the contribu-
tion of the aggregation of both models to assist a multi-modal detection of
depression.
Conclusion
In this chapter we introduced the general context and the problematic of this
project. In the following chapter we will present the theoretical tools that
made working on this project feasible, we will present an easy introduction to
the models we used that will facilitate understanding the design of our proposed
solution.
|5
Chapter 2
Background
The challenge of classifying users presenting signs of depression and others who
do not based on multi-modal data, requires extracting a massive amount of
features from raw unstructured data, principally images and texts. This leads
us automatically to Deep Learning (DL) thanks to its ability to learn and
perform increasingly complex tasks.
In this chapter, we will introduce the theory behind the models used to
accomplish this study, starting with an introduction to the basics of deep learn-
ing. Then we will present the Convolutional Neural Network (CNN) used in
image classification of users presenting signs of depression and others who do
not. Thereafter, for the text classification task, we will be explaining the Recur-
rent Neural Network (RNN) going through some details about Long Short-Term
Memory (LSTM) and Bi-directional Long Short-Term Memory (Bi-LSTM) net-
works, with a glimpse on both word-embedding and text pre-processing tech-
niques. Finally, we will present the convenient classification metrics to evaluate
images and text classification models.
2.1.1 Basics of DL
2.1.1.1 The Perceptron
The perceptron is a single layer neural network model designed for supervised
binary classification invented in 1957 by Frank Rosenblatt [10]. The hidden
layer of this model is composed of one neuron (computational unit), therefore,
6
Chapter 2: Background
its simplicity will help understand the basics of the neural network functionality.
The structure of perceptron in forward propagation is presented in figure 2.1 .
Where:
• (x1 , ..., xn ): the input vector
• (w1 , ..., wn ): the weights vector
• f : the activation function
• b: the bias.
In the forward propagation, once we randomly initialize the weights’
vector (w1 , ..., wn ), the input layer takes a numerical representation of data
(x1 , ..., xn ) (e.g. pixels of an image, sequences of a text) then 0≤i≤n xi wi is
P
|7
Chapter 2: Background
|8
Chapter 2: Background
• Rectified Linear Unit (ReLU): is the most used activation function in neural
networks as it allows the network to converge very quickly.
Mathematically, it is defined as:
R −→ [0, −∞[
f : x 7→ max(0, x) (2.2)
• Sigmoid or Logistic activation function: is generally used for binary classifi-
cation problems as it outputs a value between 0 and 1. It is defined as:
R −→ ]0, 1[
1
f : x 7→ (2.3)
1 + e−x
• Hyperbolic tangent (tanh): have almost the same properties as a sigmoid
function. It outputs values that range between -1 and 1. Its mathematical
expression is:
R −→ ] − 1, 1[
ex − e−x
f : x 7→ (2.4)
ex + e−x
• Softmax: is a form of logistic regression which turns numbers into
probabilities that sum to 1. Its output is a probability vector to pre-
dict the belonging to each target class. Mathematically, let N be the
number of target classes to be predicted, the softmax function is defined as:
R −→ ]0, 1[
e xi
fi : xi 7→ PN xj f or i = 1, ..., N (2.5)
j=1 e
|9
Chapter 2: Background
Input Layer:
Unlike neural networks where the input is a vector of pixels, in CNN, an image
is usually represented as a three dimensional (3D) matrices of pixels’ values.
Height and width are according to the input image dimensions, and depth is
generally three channels RGB (Red-Green-Blue).
Convolutional Layer:
The convolution operation is the basic component of CNN, invented by Yann
LeCun[15], it consists in applying a filter on the input image to detect features
related to each class.
|10
Chapter 2: Background
Figure 2.6: Convolutional Layer of an input image of 9*9 pixels with stride 1 and size of filter 3*3
1 It is the size of the matrix considered a hyper parameter of the convolution operation.
2 Stride is the number of pixels skipped in the convolution, it is considered a hyper parameter of the convolution operation.
|11
Chapter 2: Background
Pooling Layer:
Similar to the convolutional layer, the pooling layer is used to further reduce the
dimension of the previous matrices independently (dimensional reduction) to
significantly decrease the amount of parameters and the computational power
required to the data processing.
Furthermore, this operation extracts dominant features from the input while
maintaining the process of effectively training the model as the network becomes
invariant to small transformations and translations of the input image.
Pooling can be accomplished in multiple approaches: Max pooling, Min
pooling, Mean pooling and Average pooling. The most common approach used
is Max pooling. The process of applying Max pooling is presented in figure 2.7.
Figure 2.7: Pooling Layer of an input image of 7*7 pixels with stride 2 and size 3*3
|12
Chapter 2: Background
Figure 2.9: Summary of the CNN architecture for the example of an input of 9*9 pixels
In general the more convolutional layers we have, the more features the
model will be able to recognize.
|13
Chapter 2: Background
|14
Chapter 2: Background
As cited in [18], RNN suffers from gradient vanishing and exploding problems
for complicated tasks when more context needs to be taken in consideration. As
a result, Long Short-Term Memory Networks (LSTM) were designed to solve
this problem[19], the following part will explain LSTM.
|15
Chapter 2: Background
N.b: Wi and Wc are the weights corresponding to each input of each layer
while bi and bc are the bias.
• Output Gate:
This is the final gate that decides the part of the current cell state that
makes it to the output. Sigmoid function decides which values to let through
applying: ot = σ(Wo ∗ [y <t−1> , x<t> ] + bo) just like the forget gate and then
the tanh is applied on the current cell state Ct that is updated through the
gates. So the final output of the node is:
y <t> = ot + tanh(Ct )
This explication was inspired from the Colah’s Blog.[22]
The LSTM network takes in consideration the current input and only the previ-
ous output of each node. Therefore, LSTM network takes more time to compile
and understand the context of the whole sentence. Take for example in NLP,
to understand a word we do not just need the previous word, but also to the
next word. To solve this problem, Bi-LSTM networks was designed inspired by
Bidirectional Recurrent Neural Networks (BRNN) [23].
It is composed of multiple nodes just like LSTM but it applies forward-
propagation two times, one for the forward cells (like LSTM) and one for the
backward cells (taking the next word in consideration). To sum up, The Bi-
LSTM output takes the input of the previous cell, the current cell and the next
cell. This is well-explained by the figure 2.12.
|16
Chapter 2: Background
As we can notice from the figure 2.12, Bi-LSTM network trains two LSTMs
on the input sequence, the first LSTM on the sequence as it is and the other
on a reversed version of the sequence.
|17
Chapter 2: Background
In fact, this may seem very obvious and easy but removing certain words
can affect the processing, so stop words differ with the context of the study.
[24] [25]
• Stemming: is a process of reducing words of the same root to a com-
mon form (e.g. ’Likes’ and ’Liked’ will be reduced to ’Like’). The main
algorithms used for stemming are:
– Porter Stemming Algorithm [26] removes common endings from words.
– Lancaster Stemming Algorithm [27] is the most aggressive stemming al-
gorithm, it can transform words into strange stems that have no mean-
ing (e.g. ’having’ becomes ’Hav’)
• Lemmatization: is also a process of reducing the words having the same
root into a common word. Unlike stemming, lemmatization does not simply
slices of endings but uses a lexical base to get the correct form of the
common word. For example: meeting becomes meet, played becomes play...
• Text Tokenization: consists in splitting the text after pre-processing it
into smaller pieces called tokens, this can be done on character level (split
words into its composing characters) or word level (split sentences into its
composing words).
|18
Chapter 2: Background
For the example of the figure 2.13 we suppose that the dimensions have
clear meaning and we can notice that vector representation of words sharing
meaning have close values, as represented in the space on the right side. We can
easily apply mathematical operations to calculate similarities between words for
e.g. wKing ¯wM an + wW oman = wQueen .
There are multiple methods of word embedding such as Word2Vec, GloVe,
FastText, ELMo...
|19
Chapter 2: Background
As we can see in the table 2.1, diagonal values of this matrix represent
the correct prediction of the classes, while the off-diagonal values represent the
samples which are misclassified .
In simple words:
• True Positives: the model predicted "Depressed" and the actual output was
also "Depressed".
• False Positives: the model predicted "Depressed" and the actual output was
"Not Depressed".
• True Negatives: the model predicted "Not Depressed" and the actual output
was "Not Depressed".
• False Negatives: the model predicted "Not Depressed" and the actual output
was "Depressed".
2.4.2 Metrics
2.4.2.1 Classification Accuracy
Precision and Recall are both class specific performance metrics. Mathemati-
cally, they are defined for both classes as follows:
TP
P recision =
TP + FP
TP
Recall =
TP + FN
|20
Chapter 2: Background
2.4.2.3 F1-Score
F1-score indicates the precision and robustness of the model, it ranges from 0
to 1, the greater it is the better is the performance of our model.
Mathematically, it is defined as the Harmonic Mean between precision and
recall, therefore it gives importance to both recall and precision. Its expression
is as follows:
2 ∗ P recision ∗ Recall
F 1 − score =
P recision + Recall
Conclusion
In this chapter, we presented the basics of deep neural networks alongside the
explanations of CNN and RNN. We also went through texts’ preprocessing
techniques and word embedding to understand how sequential data is prepared
for computer processing and finally we finished with the metrics that will allow
us to evaluate our models.
On the grounds that we understood the theory of the models used in the
study, we will present in the next chapter the proposed solution in response to
resolve the problematic of the project.
|21
Chapter 3
In this chapter, we will go through the technical steps of building the models
of text and image classification. First, we will start with the motivation of this
study and follow with the data collection for both visual and textual data and
how we managed to clean it and prepare it for the processing phase. Then we
will present the architectures and explanations of the models that we worked
with. Finally, we will finish with defining the hyper-parameters and detail the
project workflow.
22
Chapter 3: Design of the solution
The extraction of self-declared cases was inspired from a scientific article that
proves the efficiency of this method [8]. The keywords used for depressed users
research were: "I am diagnosed with depression", "I am fighting depression"
|23
Chapter 3: Design of the solution
and "I suffer from depression". To extract "not depressed" users, we thought of
an approach that would get profiles of users that are definitely not depressed.
This approach consist on researching trending happiness hashtag 1 such as:
"#happyme", "#lifeisgood" and "#lovemylife".
3.3 Models
3.3.1 Images Classification
To accurately classify images that depressed users are likely to post, we trained
multiple models having different architectures based on CNN previously ex-
plained in chapter 2 2.1.2 to obtain the best performance. We started with the
simplest architecture and finished with a state of the art model:
|24
Chapter 3: Design of the solution
ResNets learn residual functions with reference to the layer inputs, instead of
learning unreferenced functions. They take a standard feed-forward CNN and
add skip connections that bypass a few convolution layers at a time.
Each bypass gives rise to a residual block in which the convolution layers
predict a residual that is added to the block’s input tensor. They stack residual
blocks on top of each other to form a network: e.g. a ResNet-50 has fifty layers
using these blocks.
Typical ResNet models are implemented with double layer skips that contain
non linearities (ReLU) and batch normalization 3 in between.
Formally, denoting the desired underlying mapping as H(x) , we let the
stacked nonlinear layers fit another mapping of F (x) = H(x) − x. The original
mapping is recast into F (x) + x.
The building block of a ResNet is presented in figure 3.4:
There is empirical evidence that these types of network are easier to opti-
mize, and can gain accuracy from considerably increased depth [34].
In our study we used two architectures of ResNets depending on the number
of layers which are ResNet50 (50 layer) and ResNet101 (101 layer).
Transfer Learning:
Since training is very expensive, both in time and resources, we chose to work
with a pre-trained ResNet model. Transfer of pre-trained representations im-
proves sample efficiency and simplifies hyperparameter tuning when training
deep neural networks for vision. Basically, transfer learning, as its name indi-
cates, transfers the learning from a model trained on a specific task to a new
model. In our case we chose to not fine-tune weights, so we will keep the model
as it is and only train the output layer.
3 It normalizes the input layer by adjusting and scaling the activations to allow each layer of a network to learn by itself
|25
Chapter 3: Design of the solution
The pre-trained model will act like a feature extractor which is previously
trained on huge data-sets to become so good at this task.
Big Transfer (BiT)
To go even further and enhance the performance of the image classification
model, we used BiT, a state of the art model [35] developed in 2019. Specifically,
we used BiT-L (Big Transfer Large) which is trained upon the dataset JFT-
300M (contains 10 to 1 billion image datasets).
We can choose the architecture of the model when importing it, we chose to
try ResNet50 and ResNet101 since these two were the ones that gave the best
results on other datasets.
3.3.2.1 LSTM:
|26
Chapter 3: Design of the solution
GloVe embedding:
is an unsupervised learning algorithm developed by Stanford in 2014 [36], it is
one of the most commonly used word embedding approaches. It encodes co-
occurrence probability ratio between two words as vector differences to mark
meaningful differences between word vectors. The goal is to define a context
for two words as to whether or not the two words appear in close proximity of
N-word apart.
The model architecture contains 5 separate types of layers as figure 3.6 indi-
cates, these layers are:
• The input layer
• Embedding layer : a pre-trained GloVe word embedding.
• The Bidirectional layer : which is a layer composed of 512 BiLSTM units.
• A dropout layer to avoid over-fitting.
• The dense layer is the layer that outputs the result.
|27
Chapter 3: Design of the solution
3.4 Hyper-parameters
In machine learning, a hyper-parameter is a variable whose value controls the
learning process of the model. These parameters express the complexity of the
model and how fast it should learn. Hyper-parameters are usually fixed before
the actual training process begins.
We can distinguish two types of hyper-parameters:
|28
Chapter 3: Design of the solution
4 We used two types of optimizer: SDG which is Gradient descent (with momentum) optimizer and ADAM which implements
the Adam algorithm.
|29
Chapter 3: Design of the solution
2. Load the model’s script: Next, we will call the function created for the
model, which will begin by preparing data and train the model.
1 from models . I m a g e _ cl a s s i f i e r import Deep_CNN
3. Train the model: we can train the model by simply calling the model func-
tion that we created as follows:
1 CM , CR = Deep_CNN . model ( train , validation , test , batch_size =128 , lr =0.0001 , epochs =10)
|30
Chapter 3: Design of the solution
This will return the textual progress of the training on one hand, and return
the confusion matrix with a confusion report on the other hand.
4. Test the model: we scan the results given by the fit command to check if
we have problems whether with data or with the model. For example, if we
notice an important difference between train accuracy and test accuracy,
we have to think about over-fitting and find a solution accordingly. Also we
calculate the confusion matrix and the performance metrics for the model.
5. Save Model: The model’s architecture and weights are automatically saved
while training in order to gain time and load the model when needed.
For each loop of each model, we may repeat steps three and four, while manually
tuning the hyper-parameters when calling the model function.
Conclusion
In this chapter we presented the design of our solution for this study including
the sources of our different data-sets and the inspiration of the idea. Then
we proceeded with the architectures of the different models that we suggested
for both images and texts classification and next we presented the method of
aggregation of the models. Finally, we drew the big picture of the project
workflow and the steps followed for each model. In the next chapter, we will
present the method of crawling data and some visualization, then we will present
the experimental results of the models and their aggregation.
|31
Chapter 4
This chapter states the results of our study and the implementation of our
models. We will start with the tools we used to establish this study, then we will
present the process and results of data collection and preprocessing of visual and
textual data-sets. Finally, we will finish with the main goal of this study which
includes the results of both image classification and text classification.Moreover,
we will look at the worth of aggregating these two models to better predict users
suffering from depression.
4.1.1 Hardware
In the process of the implementation of our solution we used two main machines,
a local machine for refactoring codes, testing models and research, and a virtual
machine (VM) on Google Cloud Platform (GCP) to run models and codes that
are heavy in terms of computation and time. Following are the specifications
of these machines:
• Local Machine: Lenovo E330
∗ Operating System : Kali Linux 2020.2
∗ CPU: Intel Core i5-3230M 2,6GHz
∗ RAM: 8 Go DDR3
∗ Disk: 320 GB HDD
32
Chapter 4: Implementation and Experimental Results
|33
Chapter 4: Implementation and Experimental Results
4.2.1 Images
We will start by the process of crawling images, then how we cleaned these data
and finish with creating the loader to make it easier when writing the script for
the model.
As explained in chapter three, we used keywords to search for images from two
websites Unsplash and Pexels and crawled the data using two different methods.
Although the process that we used in crawling images for the two websites is
common (figure 3.1), the practical functions differ for the two websites. We
present how data were collected for both websites:
|34
Chapter 4: Implementation and Experimental Results
Unsplash:
We created a function to crawl the links of all images of a specific keyword
using the API of unsplash. To crawl links we can simply call this function with
the proper parameters to save links in a csv file.
Pexels:
We did not use the Pexels’ API to crawl links of images for a specific keyword,
we created a function using Selenium Python 1 to save links in a csv file easy
to implement.
After scraping the links of photos, we used a function to save these images
in a local directory for further processing. This task took almost two weeks
to achieve due to various problems such as interruption in internet connection,
images’ size (every image is almost 4.0M B) and technical issues.
We collected a total of :
• 6250 images labeled as "Depressed"
• 5234 images labeled as "Not Depressed"
Images were collected using the same keywords, therefore, we had to think
about a solution to eliminate repetitive images. Fdupes 2 for Linux was the best
option we used to remove duplicate images.
After cleaning data, the number of images decreased from 6250 images la-
beled as "Depressed" to 4023 images and from 5234 images labeled as "Not
Depressed" to 4197 images.
To optimize the data generation process for easier training procedure, we cre-
ated a loader for crawled images following these steps:
3
• Resize and convert images to npy format.
4
• Use npz to zip all images to reduce its size from 63.9GB to 1.2GB.
• Upload zipped files to mediafire.
• Create loader to make the generation of data easier when needed.
It returns a data-set splitted into train, validation and test.
1 Selenium Python provides a simple API to write functional using Selenium WebDriver. Through Selenium Python API we can
|35
Chapter 4: Implementation and Experimental Results
Results:
We may notice in the sample of images in figure 4.1 that images present different
features in spite of belonging to the same class.
The final step of preparing data is resizing images based on the specificity of
each model, converting it into a matrix of pixels and normalizing it (e.g. divide
it by 255).
Now that we have images data ready to be used in training our models, we
move to how we manage to get textual labeled data.
5 It is a package written in Python for scraping Tweets from Twitter profiles without using Twitter’s API,
https://github.com/twintproject/twint
|36
Chapter 4: Implementation and Experimental Results
Crawled tweets were saved in separate csv files for each user. The data loader
combines these files into one csv file having the tweets and the corresponding
labels. This loader returns the train, validation and test set of textual data.
We noticed that the words are flexible and variant in the raw data of crawled
social media, which causes great difficulties in word matching and semantic
analysis.
Therefore, we created a customized function to pre-process our data consid-
ering the aspect of our study that engages in detecting features in texts that
only depressed users use. We wanted to keep as much of the information in
these tweets. For this, we created a personalized list of stop words to remove
and a list of words with its substitutes to replace (such as bad words and con-
tractions). We also decoded emoticons to emoticon alias using emoji 6 ,a package
in Python. Finally, tokenizing tweets was done by SocialTokenizer which is a
tokenizer belonging to ekphrasis 7 which is a text processing tool, adapted to
texts from social networks.
Results:
The word clouds of tweets from both classes "Depressed" and "Not depressed"
are respectively presented in figure 4.2 and figure 4.3.
|37
Chapter 4: Implementation and Experimental Results
It’s easy to say that the content of textual posts differs for both classes by the
distinction of various words used more often by individuals labeled depressed
such as bad words, "need" and emojis such as "face_with_tears_of_joy". One
common word that we may notice is "pic" which means that the user shared
a post containing a photo, this enhances our interest in studying the effect of
detecting depression using the combination of texts and images.
|38
Chapter 4: Implementation and Experimental Results
Conclusion:
The BiT-L (ResNet101x1) model outperformed all the models we manipulated,
therefore we saved this model using keras to further load it for the aggregation
of models.
At this point we have a robust model for detecting depression using images’
features, we will then move to results of text classification.
|39
Chapter 4: Implementation and Experimental Results
According to the table 4.4, we could notice that the GloVe+BiLSTM model
outperformed the LSTM model. Since we calculated the confusion matrix and
the metrics for each model, we will present results for the models that gave us
the best accuracy (0.6957 and 0.7219) to decide which one to choose:
Precision Recall F1-score Support
Depressed 0.70 0.61 0.65 635
Not depressed 0.70 0.77 0.73 739
Conclusion:
Judging by the recall and precision of depressed users’ class which is the main
focus of our study, we will choose the model with 0.6957 for the aggregation
and further prediction.
At this stage, we have a robust model for image classification and a good
model for text classification, we will then pass to the aggregation of these mod-
els.
|40
Chapter 4: Implementation and Experimental Results
Figure 4.4: Post of a user who is labeled not depressed and the results of prediction of each model
Figure 4.5: Post of a user who is labeled depressed and the results of prediction of each model
|41
Chapter 4: Implementation and Experimental Results
Conclusion
In this final chapter, we went through the work environment setup of this
graduation project. Then, we presented the steps we followed to collect and
clean textual and visual data for processing.
Subsequently, based on the models’ performances, we decided which models
to keep for both textual and visual data and finally we established the aggre-
gation of the best two models that we got from the conducted experiments.
|42
Conclusion and future work
‘The important thing is to not stop questioning. Curiosity has its own reason for existing.’
Albert Einstein
43
Bibliography
[1] J. Clement. Number of global social network users 2010-2023, April 2020.
[2] Michael Gamon, Munmun Choudhury, Scott Counts, and Eric Horvitz.
Predicting depression via social media. Association for the Advancement
of Artificial Intelligence, 07 2013.
[3] J Myers and Myrna Weissman. Use of self-report symptom scale to detect
depression in a community sample. The American journal of psychiatry,
137:1081–4, 10 1980.
[4] Aron Halfin. Depression: The benefits of early and appropriate treatment.
The American journal of managed care, 13:S92–7, 12 2007.
[5] Luca Avena, Fabienne Castell, Alexandre Gaudillière, and Clothilde Melot.
Random forests and networks analysis. Journal of Statistical Physics, 11
2017.
[6] Md Rafiqul Islam, Ashad Kabir, Ashir Ahmed, Abu Kamal, Hua Wang,
and Anwaar Ulhaq. Depression detection from social network data using
machine learning techniques. Health Information Science and Systems, 6:8,
12 2018.
[7] Michael Gamon, Munmun Choudhury, Scott Counts, and Eric Horvitz.
Predicting depression via social media. Association for the Advancement
of Artificial Intelligence, 07 2013.
[8] Sharath Chandra Guntuku, David Yaden, Margaret Kern, Lyle Ungar, and
Johannes Eichstaedt. Detecting depression and mental illness on social
media: an integrative review. Current Opinion in Behavioral Sciences,
18:43–49, 12 2017.
[9] Tiancheng Shen, Jia Jia, Guangyao Shen, Fuli Feng, Xiangnan He, Huanbo
Luan, Jie Tang, Thanassis Tiropanis, Tat-Seng Chua, and Wendy Hall.
Cross-domain depression detection via harvesting social media. pages 1611–
1617, 07 2018.
44
BIBLIOGRAPHY
[10] Marvin Papert Minsky and Seymour. Perceptrons, volume 1. M.I.T. Press,
1969.
[11] Kunihiko Fukushima. A self-organizing neural network model for a mech-
anism of pattern recognition unaffected by shift in position. Neocognitron,
1980.
[12] K. Fukushima. Neocognitron. Scholarpedia, 2(1):1717, 2007.
[13] Yann Lecun, Leon Bottou, Y. Bengio, and Patrick Haffner. Gradient-based
learning applied to document recognition. Proceedings of the IEEE, 86:2278
– 2324, 12 1998.
[14] Maria Valueva, Nikolay Nagornov, Pavel Lyakhov, G.V. Valuev, and N.I.
Chervyakov. Application of the residue number system to reduce hardware
costs of the convolutional neural network implementation. Mathematics
and Computers in Simulation, 177, 05 2020.
[15] Y. Bengio and Yann Lecun. Convolutional networks for images, speech,
and time-series. 11 1997.
[16] Hidenori Ide. and Takio Kurita. Improvement of learning for cnn with relu
activation by sparse regularization. IEEE, 07 2017.
[17] Robert DiPietro and Gregory D. Hager. Deep learning: RNNs and LSTM,
volume 1. Handbook of Medical Image Computing and Computer Assisted
Intervention, 2020.
[18] Sepp Hochreiter. The vanishing gradient problem during learning recurrent
neural nets and problem solutions. International Journal of Uncertainty,
Fuzziness and Knowledge-Based Systems, 6:107–116, 04 1998.
[19] S. Hochreiter and J. Schmidhuber. Tlong short-term memory. Neural
computation, 9:1735–1780, 04 1997.
[20] Sepp Hochreiter and Jürgen Schmidhuber. Lstm can solve hard long time
lag problems. pages 473–479, 01 1996.
[21] Nir Arbel. How lstm networks solve the problem of vanishing gradients,
December 2018.
[22] Oinkina and Hakyll. Understanding lstm networks, August 2015.
[23] Mike Schuster and Kuldip Paliwal. Bidirectional recurrent neural networks.
Signal Processing, IEEE Transactions on, 45:2673 – 2681, 12 1997.
[24] Hassan Saif, Miriam Fernandez, and Harith Alani. On stopwords, filtering
and data sparsity for sentiment analysis of twitter. 05 2014.
|45
BIBLIOGRAPHY
[25] Jianqiang Zhao and Xiaolin Gui. Comparison research on text pre-
processing methods on twitter sentiment analysis. IEEE Access, PP:1–1,
02 2017.
[26] MF Porter. An algorithm for suffix stripping. Program: Electronic Library
and Information Systems, 14, 03 1980.
[27] Chris Paice. Another stemmer. SIGIR Forum, 24:56–61, 11 1990.
[28] Liang-Chih Yu, Jin Wang, K. Lai, and Xuejie Zhang. Refining word em-
beddings using intensity scores for sentiment analysis. IEEE/ACM Trans-
actions on Audio, Speech, and Language Processing, pages 1–1, 12 2017.
[29] John Watson. Psychology as a behaviorist views it. Psychological Review
- PSYCHOL REV, 20:158–177, 01 1913.
[30] Ahmed Husseini Orabi, Prasadith Buddhitha, Mahmoud Husseini Orabi,
and Diana Inkpen. Deep learning for depression detection of twitter users.
pages 88–97, 01 2018.
[31] Guangyao Shen, Jia Jia, Liqiang Nie, Fuli Feng, Cunjun Zhang, Tianrui
Hu, Tat-Seng Chua, and Wenwu Zhu. Depression detection via harvesting
social media: A multimodal dictionary learning solution. pages 3838–3844,
08 2017.
[32] American Psychiatric Association. Diagnostic and Statistical Manual of
Mental Disorders, volume 5. American Psychiatric Publishing, 2013.
[33] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and
Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks
from overfitting. Journal of Machine Learning Research, 15:1929–1958, 06
2014.
[34] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual
learning for image recognition. pages 770–778, 06 2016.
[35] Xiaohua Zhai Joan Puigcerver Jessica Yung Sylvain Gelly Alexan-
der Kolesnikov, Lucas Beyer and Neil Houlsby. Big transfer (bit): General
visual representation learning. pages 1–28, 12 2019.
[36] Jeffrey Pennington, Richard Socher, and Christoper Manning. Glove:
Global vectors for word representation. EMNLP, 14:1532–1543, 01 2014.
|46