0% found this document useful (0 votes)
5 views

FULLTEXT01

Uploaded by

czxharshpreet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

FULLTEXT01

Uploaded by

czxharshpreet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Bachelor Degree Project

Wed 2.0: improving customer experience with wedding


service providers through investigation of the ranking
mechanism and sentiment analysis of user feedback on
Instagram

Author: ​Maria Jäderlund


Supervisor: ​Daniel Toll
Semester. V​ T 2019
Subject:​ Computer Science
Abstract
Instagram is one of the main social platforms for business promotion [1]. Millions of
potential customers and endless visual marketing opportunities makes Instagram a
perfect place to increase online sales. There are many tools and mechanisms to promote
brands on Instagram such as paid advertising or using a pre-generated set of popular
hashtags. In this regard, ​the presence and content of users’ comments becomes an
important socio-psychological factor in the motivation to buy or use a product or
service. The goal of this degree project is ​to investigate natural language processing
techniques applied to users’ comments on Instagram in order to determine a new
algorithm that will ​include content analysis to the list of feed ranking factors​. As it is
now, the user has to read through posts on Instagram to get an idea of the quality of a
product or service. Therefore, a way to classify and rank products and services is
needed. We propose a new algorithm called "Wed 2.0" that can assist consumers in
their search of wedding services and products on Instagram. Data mining techniques [2]
and sentiment analysis [3] are used to define the mood of the comments and structure
user opinions as well as to rank accounts based on this knowledge.

Keywords:​ sentiment analysis, natural language processing, ranking, Instagram,


VADER.

2
Contents
1 Introduction 4
1.1 Background 4
1.2 Related Work 7
1.3​ Motivation 8
1.4 ​ ​Problem Formulation 8
1.5 Objectives 9
1.6 Scope/Limitation 9
1.7 Target Group 9
1.8 Outline 10
2 Method 11
​ ethod Description
2.1 M 11
2.2 Reliability and Validity 12
2.3 Ethical Considerations 13
3 Implementation 14
3​.1 ​ Tool Choice 14
3​.2 ​Ranking Model 16
3.3 Data Collection 17
3.4 Data Storage 19
3.5 Data Preprocessing 20
3.6 Sentiment Analysis 21
3.7 Account Ranking 23
3.7.1. Total Score of All Posts 23
3.7.2. Total Likes Score 25
3.7.3 Total Followers Score 26
3.7.4 Total Account Score 27

4 Results 28

5 Analysis 32

6 Discussion 36

7 Conclusion 39
7.1 Future work 40
References 41

3
1. Introduction
According to Instagram business statistics, there are 25 million business profiles and
over 200 million users visiting at least one business profile every day [1]. The
accessibility of this public network as an advertising platform and potential
opportunities for business development makes Instagram a popular networking service
among entrepreneurs. In the era of Web 2.0 [4] with its emphasis on user-generated
content, it becomes essential to increase customer engagement in order to maintain the
reputation and promotion of the company.
This project is called by analogy with Web 2.0, but rephrased to match wedding
services that will be in focus of this research. Just as in case with Web 2.0, the Wed 2.0
algorithm largely relies on user participation and dynamic content.
Content analysis of customer feedback can be beneficial not only for businesses but
also for consumers themselves. In this regard, it makes it easier for users to evaluate a
certain product or service before deciding whether to purchase it. This thesis paper will
investigate ​natural language processing techniques ​combined with sentiment analysis in
order to create a new ranking algorithm for a better user experience on Instagram. This
algorithm will determine which accounts will be ranked high in search results, based not
only on the number of likes or followers, but mainly on the sentiment analysis of users’
comments.

1.1 Background

Instagram [5] as one of the most popular social networks pays close attention to search
functionality and discovery of content. Instagram search results are presented with four
categories: top, people, hashtags and places (see Image 1.1).
The ​top t​ ab shows popular accounts, hashtags and geotags containing some or all of
the words from the search query. The ​people ​tab is convenient to use to search for
people by nickname. The ​tags tab shows all publications which are marked with
hashtags that contain words from the search query. The ​places tag shows locations that
contain words from the search query.
The standard Instagram search bar algorithm presents search results based on
multiple factors including Instagram and Facebook activities, user’s interests, a level of
engagement as well as a number of followers and likes in the accounts [6]. The latter
two can be misleading due to a variety of services for buying followers and likes on
Instagram. The goal of this project is to explore an alternative way of recommendations
based on the users’ feedback in the comments section. The idea is to present search
results based not as much on a number of followers and likes as on the customer
satisfaction level towards specific product. The more positive feedback the account gets,
the higher rank it will be able to achieve in search results.
The project focuses on data extraction and data mining techniques, natural language
processing and sentiment analysis. Data extraction from the Web or “Web scraping” [7]
is a method of retrieving data from Web sources. It allows us to manually or
automatically extract new or updated data and save it for later use. There are plenty of

4
Web scraping tools available on the Internet that provide direct access to real-time data
[8]. This project will make use of an open-source library that is called “Instagram PHP
Scraper” [9]. It provides flexibility and allows to customize the Web scraping process
according to the user’s needs.

Image 1.1: Instagram search results: top results including hashtags and accounts (on
the left) and top accounts (on the right).

In order to analyze the extracted data, different data mining techniques can be
applied. Data mining is the process of raw data researching and detecting of hidden
knowledge that is nontrivial, practically useful, not previously known and available for
human interpretation [10]. Data mining is a multidisciplinary field which results in
variety of methods and algorithms implemented in its different systems, wherein some
of these systems integrate several approaches at once. Key components include neural
networks, case-based reasoning, decision trees, genetic algorithms and support vector
machine. The effectiveness of algorithms depends on many factors such as the size and
structure of the dataset. For this reason, different algorithms will be investigated in
order to find the most suitable for the project.
Another core conception of the project is machine learning [11]. One can describe it
as an area within artificial intelligence. Machine learning is based on statistical and
computational principles and combines a variety of different approaches which include
but not limited to logic, statistics, probability theory, mathematical optimization and

5
computational algorithms. It is widely used in data mining, robotics, pattern
recognition, processing of textual data as well as audio and video files.
Most of the tasks of machine learning can be divided into two categories: supervised
learning and unsupervised learning. In case with supervised learning, there is data with
the inputs and the desired outputs on the basis of which a prediction should be made. In
unsupervised learning there is only data with inputs, the properties of which we want to
find. The first category solves, for example, the tasks of regression and classification.
The difference between regression and classification is in their output variables:
regression predicts a quantity while classification predicts a discrete value, such as a
label. The second category solves such problems as clustering or dimension reduction.
Clustering means partitioning data into similar categories. Dimension reduction scales
down the number of random variables by obtaining a set of principal variables.
Another automated process that is used in this project is sentiment analysis [12]. This
discipline explores not only the content of the textual data, but also its tonality. It is
known that a natural language text can express an emotional assessment of what is
being reported. The purpose of the sentiment analysis is to find users’ opinions in the
text and determine their properties. It is common to distinguish three parameters of this
tonality which include the subject of the tonality (the author of the text), the evaluation
of the tonality (positive, negative or neutral) and the object of the tonality (what the
opinion is about).
The easiest method of determining the author’s opinion is to count and compare the
number of words that have a positive or negative tint in the text. However, this approach
can lead to an unreliable result, since the same words have different meanings and
different tonalities in different contexts. Therefore, more advanced approaches are used
for classifying the polarity of a given text. They can be divided into four main
categories: rule-based approaches, dictionary-based approaches, using supervised
learning and using unsupervised learning.
The first approach requires an advanced set of rules that determines what is
considered to be positive or negative tonality. For example, if there are no negative
grammatical constructions in the sentence, it is considered to have a positive tonality.
The second approach uses specific dictionaries such as SenticNet or WordNet to
analyze the text. These dictionaries represent a list of words where each word has a
special tonality value assigned to it. In order to determine the tonality of the given text,
we need to assign each word a corresponding tonality value from the dictionary and
then to calculate the overall tonality of the text. This can be done either by finding out
the arithmetic mean of all values or by using some of the base classifiers such as Naive
Bayes or Support Vector Machine. The latter one also refers to the third approach where
the model is trained on a specific training labeled dataset with predefined tonality values
according to a supervised machine learning method. Unsupervised learning that is used
in the fourth approach provides a less accurate result and can be found, for example, in
automated text clustering.

6
1.2 Related Work
Instagram data analysis is a broad field of research. There are also many previous
studies regarding sentiment analysis and different machine learning techniques. When it
comes to Instagram, the majority of works are dedicated to image and tags analysis or
the content of posts (i.e. descriptions under images or videos). However, there are
relatively few works that would combine both concepts and apply them to determine
users’ sentiment in the comments section. In this subchapter, the most interesting and
related to our project studies will be evaluated.
Zhang et al. investigated product and service popularity on Instagram based on
temporal evolution of certain tags [13]. The focus of researchers in this work is on
analyzing and extracting data from hashtags and images using image clustering, not
from comments themselves. Nevertheless, it highlights the problem of social media
popularity as a result of different factors such as factual information, vividness,
entertainment parameters and sentiment. It also proved that clustering can be used to
discover meaningful subsets of related data.
Previous researches have also started to investigate Instagram ranking mechanisms
such as in a work of Chilet et al. where social media marketing in the high-end fashion
industry is analyzed [14]. The authors apply Named Entity Recognition (NER) in
Natural Language Processing in order to structure posts based on their information
content as well as to obtain measures of brand leadership and similarity. The authors
developed their own fashion glossary for the selection and extraction of fashion
symbols on Instagram. This approach does not include analysis of users’ feedback,
however, it developed a valuable method to identify and rank the leaders in the industry.
An interesting conclusion is that leader brands tend to post more brand-building posts
while less popular brands are focusing on informative posts.
Another attempt of analyzing Instagram data has been done by Hammar et al. [15].
The authors apply methods of generative modeling and unsupervised mining of fashion
attributes. Word embedding is proved to be the most useful asset for information
extraction. In this research, a deep clothing classifier with weak supervision is trained to
classify Instagram data based on the associated text. This approach partially analyzes
the comments section, but mostly characterizes its multilingual nature rather than
sentiment polarity.
There was an attempt though on Instagram comment classification using statistical
approach by Prabowo and Purwarianti [16]. There are a few different techniques used in
this study such as unigram and word embedding, support vector machine (SVM) and
convolutional neural network (CNN). The authors aim to create a system that can
classify users’ comments based on the response that should be given. For this purpose,
they use a statistical approach with information feature extracted from the text. The
main focus here is on the replies that the online store owners must give for the user
comments on their Instagram accounts. Comments categorization is implemented by
manual labeling (either “answered”, “read” or “ignored”), not so much attention is paid
to social sentiment of the comments themselves. One of the study results is particularly

7
important for our research. According to the authors, inclusion of pre-processing step in
the data analysis in combination with feature selection can significantly improve the
accuracy generated by the classification model. It is also mentioned that for a better
result there should be a method to handle the appearance of product name in Instagram
comments.

1.3 Motivation

In the era of Web 2.0 [4], which is headed by user-generated content, it becomes vital to
take into account the knowledge and interests of customers. ​Instagram advertising
mechanism gives us recommendations based on, among others, prepaid deals with
companies. It is known that there are many business strategies on Instagram on how to
attract clients and get more followers. This is done from business point of view,
however, less attention is paid to benefits from the user’s point of view.
A new ranking mechanism based on sentiment analysis of users’ feedback may be
beneficial for both customers and small businesses​. ​This will be investigated by
analyzing ​customer experience with wedding service providers on Instagram. It is a
well-known fact that the organization of wedding events requires serious expenses, but
not all couples are able or willing to take on the financial costs associated with wedding
planning. The proposed ranking algorithm may contribute to creation of the
direct-to-consumer channel bypassing intermediaries and overpayments. Therefore,
small businesses with overall positive feedback get a direct access to potential
customers without the need of paid advertising to get on top of search results. Users, in
their turn, get a list of services and products based on real customers’ opinion rather
than prepaid ads or recommendations from wedding agencies.
Moreover, this research can be beneficial for science as well, since it contributes to
better understanding and further developing of Instagram’s Web 2.0 technologies.​ T ​ his
can be done by ​evaluating and implementing a ranking model based on the emotional
sentiment of users’ comments.​

1.4 Problem Formulation


The Instagram search bar provides great opportunities to present a brand or service to a
larger circle of users and attract more new subscribers. No surprise that brands are
spending large amount of money on digital advertising to get into users’ feed or search
results [17].
As it is now, the user has to read through posts on Instagram to get an idea of the
quality of the recommended product or service. So far there is no such algorithm that
would generate personified recommendations on Instagram based on both account’s
characteristics and sentiment analysis of users’ feedback to a respective service or
product.
Therefore, we need ​to investigate natural language processing algorithms and data
mining techniques applied to users’ comments on Instagram. T ​ his will help us t​ o
determine ​a new ranking algorithm ​that will include content and feedback analysis to

8
the list of feed ranking factors.​ Based on this knowledge, a better customer experience
can be achieved.

1.5 Objectives

Objective Description
O1 Investigation of Instagram API
O2 Research on machine learning algorithms, sentiment analysis and data
mining techniques
O3 Proposition of a ranking model suitable for the project (based on O2)
O4 Developing a new ranking algorithm “Wed 2.0”
O5 Run the algorithm as the baseline and the preprocessed experiment
Table 1.1: The project’s objectives

A new ranking algorithm called "Wed 2.0" will be created following the objectives
presented in Table 1.1. It will aim to assist consumers in their search of wedding
services on Instagram. The mood of the comments will be defined and structured with
help of ​natural language processing techniques and sentiment analysis. As a result, a list
of highly-ranked ​products and services based on ​accounts’ description and customer
feedback​ will be presented.
It is expected that the new ranking mechanism ​will provide an alternative way for
recommendations that is more directed to users rather than product and service
providers.

1.6 Scope/Limitation

To limit the scope of the project only Instagram accounts with description in English
will be processed as well as a certain category of wedding products (wedding dresses).
Another limitation is that only a certain number of accounts (150) will be analyzed.
Due to the restricted JSON requests limit on Instagram (200 requests an hour) [23], the
process of querying a large amount of Instagram’s comments and media becomes very
slow. In order to avoid blocking during long running queries, one must send requests at
different time intervals. Considering the time constraints imposed by the project
deadlines, only 150 accounts have been chosen for this research.
There is also a minimum of 5 comments under each post. This minimum was
introduced in order to avoid adding posts without comments to the database. Since this
research includes analysis of Instagram profiles with different engagement rate and
different number of followers, the number of comments under one post was set to
minimum of 5. However, the maximum number of comments is not limited by this
project.

1.7 Target Group

This thesis will be interesting mostly for Instagram developers and for those who work
in the field of natural language processing, sentiment analysis and Web 2.0

9
technologies. This research can also be useful for those who develop any kind of
application that uses social profiling as well as feedback information for building user
behavior models.

1.8 Outline

Section 2 is dedicated to the choice of methods that are used under development of the
ranking model. It also discusses the validity and reliability of the project and highlights
some ethical considerations in the project in relation to confidentiality and privacy
obligations.
Section 3 describes the implementation process with an emphasis on the code. It also
provides a description of the ranking model that is developed in the project.
Section 4 presents the result of the implementation. Section 5 gives a thorough
analysis of the result described in the fourth section.
The results and analysis chapters are followed by the discussion chapter that
highlights the findings in the previous sections and whether they solve the research
problem. The last section presents the conclusion about the work done in this thesis and
provides some thoughts about possible future research.

10
2. Method
In this project, we will use Design Science as a method and research approach. Creating
a new ranking mechanism is an exploratory process which will require a thorough
analysis of Instagram API [18], explication of scientific literature in the field of natural
language processing techniques as well as evaluation of different approaches and
solutions. Sentiment and data analysis is an essential component of the developing
process and to answer this part of the research problem a controlled experiment will be
conducted. In this case, a role of the independent variable plays a dataset that is tested
against natural language processing techniques. The dependent variable is represented
by the total account score that is obtained by applying these techniques.

2.1 Method Description

According to Cross [19], Design Science is "not just the utilisation of scientific
knowledge of artefacts, but design in some sense as a scientific activity itself". In this
case design can be interpreted as both a process of creating something new that does not
exist yet and a product as a result of this process. Design Science is an iterative search
process that can be represented as a Generate/Test Cycle [20]. Figure 2.1 illustrates
design research process that includes designing artifacts to solve the identified problem,
demonstrating and evaluating the design and presenting the results.

Figure 2.1: Design Science as a Generate/Test Cycle.

11
The design process will be carried out as following. First the data will be gathered
from 150 public Instagram accounts with help of Instagram PHP scraper. Three basic
attributes of the feedback will be extracted using sentiment analysis: subject (a service
or product being commented on), opinion holder (a user that expresses his or her
opinion) and sentiment intensity (how positive or negative comment is). Information
about the type of a service will be extracted by Web scraping the account description.
The retrieved data will be structured and saved into MongoDB for further analysis.
A given product’s social sentiment will be determined by analyzing the text and
assigning the sentiment scores to the customers’ comments. For this purpose, different
machine learning and natural language processing techniques will be analyzed to find
the most appropriate method for sentiment analysis. Before assigning the scores to
comments, the textual data will be preprocessed. Knowing score of every comment, the
overall score of one post will be calculated and consequently the overall score of the
account. After obtaining total scores for all 150 accounts in the dataset, the accounts
will be sorted in descending order. Finally, the top 10 accounts with the highest score
will be presented.
The quality assessment will be done by formulating and testing requirements where
we define the relevance ratio of the presented accounts. Experiment participants will be
asked to judge the relevance of the result on a scale of 0-3 with 0 meaning not relevant,
3 meaning highly relevant. Additionally, the number of likes and followers will be
involved to analyze and draw conclusions.

2.2 Reliability and Validity

The data used in the project is collected using a Web scraping technique in real-time,
which means that only a certain set of data at a given moment will be analyzed. Data on
Instagram is inconsistent, i.e. users may modify or remove comments or account
descriptions. This problem is intended to be solved by applying some rules and
restrictions on the dataset. For example, each post in the dataset must have minimum
five comments.
Another problem refers to the characteristics of Instagram comments. They most
likely include informal language or even inappropriate word enhancement, emoji,
wrong punctuation or can be written in other language than English. All these factors
could negatively affect the determination of sentiment polarity. To reduce problems
with reliability a basic pre-process of the dataset shall be performed such as word
standardization, removing hashtags and links to other users, expansion of abbreviations
and so on.
There might also occur the construct validity problem, for example, about the
interpretation of sentiment tonality. The validity of the results largely depends on
efficient and valid measurement of sentiment. More or less fine-grained sentiment
classification is possible by using a sentiment score. But again the range from very
negative to very positive may vary significantly in different studies. This should be
taken into consideration and explained in detail in the implementation chapter.

12
Finally, some external validity problems might arise. It refers to the generality of the
results and includes three main concerns: developing of a new ranking mechanism that
has not existed before, restrictions on the dataset (limited amount of accounts) as well as
a limited number of participants in the validation process (three participants). Therefore,
the results of the research can only be generalized to a certain extent.

2.3 Ethical Considerations

There might be some ethical considerations regarding privacy and legality. The first
reason is that the data in the dataset is extracted from accounts and comments belonging
to real users. The second reason is that the mining of information is related to real
people's’ behavior.
The data presented in this thesis paper will be as much as possible anonymized,
however, some information like comments content will appear as an example of the
implementation details (usernames of the comments’ owners and their ids will not be
shown in the report). Only public accounts will be used for the data processing and
there will not be evident relation between an opinion holder and his or her comment
content. Moreover, the application itself is not intended to become a commercial
software product. The project results will be used for research purposes only and will be
publicly available.
It is worth noting that web scraping used in the project does not contradict the
General Data Protection Regulation (GDPR) since it does not deal with personal data by
definition, which is characterized as “any personally identifiable information (PII) that
could be used to directly or indirectly identify a specific individual” [21]. This
definition refers to credit card details, IP address, social security number, email address
and so forth. None of it will be extracted by the Wed 2.0 algorithm.
Another ethical consideration is the privacy of the people participating in the
validation process. This issue will be solved by conducting an anonymous survey.

13
3. Implementation
The application initially was planned to be built as a real-time Web application using
Python and Django framework. However, the implementation part of the project had to
be revised due to recent changes from Instagram regarding restricted access to its public
API and the lowered requests limit [22][23]. The ranking mechanism has been
developed with help of the Instagram Web version and a third party library (Instagram
PHP Scraper, v0.8.28) [9]. Extracted data is stored and organized in MongoDB (v.
4.0.8) [24]. Sentiment Analysis and Instagram accounts ranking are implemented using
the VADER (Valence Aware Dictionary for sEntiment Reasoning) sentiment analysis
tool [25] and in Python (v. 3.7.3).
The implementation process is divided into six stages:
1) Downloading requested unstructured data from Instagram via the Web crawler
module in Instagram PHP scraper
2) Extracting structured data via the parser module in Instagram PHP scraper
3) Storing data in MongoDB
4) Data preprocessing in Python
5) Sentiment analysis with help of VADER
6) Accounts ranking in Python

3.1 Tool Choice

VADER is a part of Python’s Natural Language Toolkit (NLTK) [26] that combines
rule-based and lexicon-based approaches for sentiment analysis. It is developed with
focus on social media text (and microblogging in particular) and provides numerical
scores assigned to each sentiment class. Words are mapped to sentiment according to a
predefined lexicon with tonality values for each word. Besides values for positive,
neutral or negative sentiment, VADER provides an additional value that is called the
compound score. It calculates the sum of positive, neutral and negative ratings and
normalizes the result on a scale from 1 to -1 where 1 is extreme positivity and -1 is
extreme negativity.
The preference for using VADER in sentiment analysis is explained by the particular
qualities of social media data:
● using informal language and slang (nah, yay, meh and so forth)
● using emoticons (:), :(, :o and so on)
● using emoji (for example, 😊 ​ ​, ​☹​)
● using acronyms (omg, lol, rofl)
● variation between the sentiment intensity of words within same tonal category (for
example, the positive words ​good ​and ​great ​should have different sentiment values)
● polysemy, i.e. the capacity of a word to have multiple meanings depending on a
context (this problem is solved in VADER by performing sentiment classification
with word-sense disambiguation)
● emotional effect of punctuation marks in the text (​The dress is good!!! s​ hould have

14
a higher score than ​The dress is good)​
● using capitalization for emphasis or as a sign of an emotional uplift (​The dress is
GOOD)​
● influence of degree adverbs on the sentiment intensity of words (such as ​incredibly​,
quite a​ nd so on)
● the sentiment polarity reversal in presence of the contrasting conjunction ​but ​(for
example, ​This wedding dress is great, but I don’t like its color)​
● ​possible shift in sentiment polarity while using double negatives

The VADER takes care of all these properties. The five last-mentioned properties are
VADER’s base rules that affect the overall compound score of the text [27].
In general, the sentiment of words depends to a large extent on the order of words in
the sentence. That is why such pure statistical approach to sentiment analysis as Bags of
Words [28] cannot be used in this project. In the bag-of-words model text is represented
as an unordered collection of words without linking between them. Unlike this
approach, VADER is word-order sensitive and considers semantic relationship between
extracted features.
Sentiment analysis could also be implemented with help of either polarity-based or
valence-based lexicons. Linguistic Inquiry and Word Count [29] or the General Inquirer
[30] are an example of the first category. Both dictionaries categorize words into several
categories inclusive positive and negative classes, however, no attention is paid either to
the words and symbols that are used within social media domain or their sentiment
intensity. Words are only labeled as ​positive,​ ​negative o​ r ​neutral​. In order to rank
accounts, it is important to know not only whether they are positive or negative, but also
how positive or negative they are.
The second category of lexicons takes into consideration not only the sentiment
polarity of words, but also their sentiment intensity providing numerical scores on a
predefined scale for each word. SentiWordNet [31] and SenticNet [32] belong to this
category. VADER interfaces with both of them to produce a more accurate result [33].
From previous controlled experiments [33] VADER is known to outperform basic
machine learning algorithms from Python’s scikit-learn library [34] for the Naïve
Bayes, Maximum Entropy (with probabilistic classifiers) and Support Vector Machine
(with a linear classifier) models. These classifiers are time-consuming, largely depend
on the set of labeled training data (i.e. a set of inputs with known outputs are required)
and only predict the class probabilities (positive, negative or neutral) with a certain level
of accuracy. VADER does not depend on labelled training data and achieves better
accuracies by introducing a human-curated (with help of human experts in giving
sentiment scores and labeling data sets) sentiment lexicon that was specifically adapted
to the social media domain conveying the intensity aspects of sentiment analysis [33].
Another advantage with VADER is that it is written in Python which makes it
suitable for working with large datasets. Python also has numerous libraries intended for
data analysis and preprocessing such as ​scikit-learn [34] and ​numpy [​ 35] that will be
used in this project in order to improve the algorithm presented by VADER.

15
3.2 Ranking Model

Figure 3.1: The proposed ranking model

16
The proposed ranking algorithm (Figure 3.1) includes several modules:
● retrieving module for downloading and parsing data with help of PHP Instagram
Scraper
● data storage module that uses Python script and MongoDB as database
● preprocessing module for text normalization, in Python
● sentiment analysis module (with help of the VADER tool)
● ranking module that implements calculation based on the sentiment scores.
The following subsections describe each module in detail.

3.3 Data Collection


Web scraping is used as a method for downloading and parsing data from public
accounts on the Web version of Instagram. The PHP library Instagram Scraper is
appointed for this purpose [9]. It provides opportunity to get account, post and comment
information without authorization. Data is extracted from Instagram profiles using the
JSON object on the web page itself.
The latest uploaded Instagram posts can be collected by calling the ​getMediasByTag
function and specifying the number of posts to be parsed. In this and the following
subsections the hashtag ​weddingdress ​will be used as a start point for ranking Instagram
profiles that provide wedding outfit services.

Image 3.1: Collecting account information.

After receiving requested data with posts, the system loops through this data and
reads in information about posts’ owners and their accounts’ IDs. The latter one is used
to fetch the account’s information such as username, account description and the

17
number of followers. At this stage the ​preg_replace f​ unction is used to remove special
characters and convert the account description to lower case. There is also a control
check regarding the #​weddingdress r​ elated keywords’ presence in the account
description (Image 3.1). This allows us to filter collected accounts and leave only those
ones that have with wedding services to do.
After filtering the collected accounts, the system fetches information about posts and
comments with respective account and saves this data into MongoDB (Image 3.2).

Image 3.2: Saving data into MongoDB.

18
The collected social media data includes 15,852 comments under 3,201 posts
obtained from 150 public accounts. Due to lowered limit to requests per hour, data has
been pulled at different time and days.

3.4 Data Storage


Collected data is stored in MongoDB. The database model consists of three collections:
accounts, comments and posts (Figure 3.2).

Figure 3.2: Database model

Each document in the collections has a predefined set of field-and-value pairs


(colored white in Figure 3.2). The system adds a few new fields to every document after
implementing sentiment analysis and calculating scores (colored green in Figure 3.2).
The account document ​id f​ ield holds the Instagram account’s id. This primary key is
used to connect the account, post and comment tables (Account.id = Post.account_id =
Comment.account_id). The ​username f​ ield holds a unique username for Instagram
account, ​biography ​holds a string value with Instagram profile description, ​followers
holds the integer value for the profile’s number of followers.
The post document has also the ​id f​ ield with a unique value. This primary key is used
to connect the post and comment tables (Post.id = Comment.post_id). As stated earlier,
the ​account_id ​field is used for the connection with the account table. Every post on
Instagram can have a description (caption) that is reflected in the ​text f​ ield. The ​likes
field indicates the number of likes that a post has.

19
Each comment that users leave under posts has a unique id (the ​id ​field in the
comment document). In order to determine the owner of the post that attracts comments,
the ​account_id ​field is added to the comment document. In a similar way the ​post_id
field is used, but in this case for determining the comment belonging to a specific post.
The contents of the comment are reflected in the ​text f​ ield of the comment document.
Finally, the ​user_id ​field is used for defining the author of the comment.

3.5 Data Preprocessing


The collected data has been preprocessed before applying sentiment analysis to it.
VADER has a limited functionality (tokenization, expanding abbreviations) when it
comes to data preprocessing, so it had to be modified in order to achieve a higher level
of accuracy. Due to the specifics of microblogging (sarcasm, irony, use of emoji and so
on), the basic preprocessing methods had to be revised in order to avoid misleading
results. For example, one of the common preprocessing methods is punctuation
removal, however, in case with Instagram it can be a valuable information: additional
exclamation marks may indicate increased sentiment intensity level, the question mark
may mean uncertainty and so on. For the same reason stemming [36] is not used in this
project, since different forms of the words and different types of the words with similar
meanings may still have different sentiment intensity depending on the context.
Therefore, the processed words should not be stemmed or lemmatized. Finally,
converting all letters to lowercase is also avoided, since VADER is sensible to
capitalization and gives a higher compound score to words in uppercase.
Below are the methods used for preprocessing of Instagram comments in this
project:
● removing mentioning (via @) in the text. The usernames mentioned by other
users may carry a certain semantic tonality and affect the overall score of the
comment. This has been done with help of regex expressions:
text = re.sub('@[^\s]+','',text)
● removing the # sign in hashtags. Here the situation is completely opposed to the
previous one: hashtags are often used to emphasize the user’s opinion and
therefore are important for semantic analysis. This has been done with help of
regex expressions:
text = re.sub(r'#([^\s]+)', r'\1', text)
● adding space between emoji signs. VADER has difficulties with interpreting
repeated emoji signs, in such case sentiment score is always neutral. In this
project the emoji package is used to identify emoji in the text and return its
Unicode reference (Image 3.3).

20
Image 3.3: Adding spaces between emoji

● word standardization (removing repeated characters in words). Only words that


are not presented in the lexicon have been cleaned in order to avoid removing
duplicate letters from actual English words such as ​good f​ or example. For this
purpose, the text input is first split into a list of strings, then every item is
checked against lexicon and if it is not present there, then repeated characters are
removed from this word (Image 3.4).

Image 3.4: Word standardization

● expansion of abbreviations and slang words. This is done automatically by


VADER.
● emoticon conversion. This is done automatically by VADER.
● emoji conversion. This is done automatically by VADER.

3.6 Sentiment Analysis


After retrieving, storing and preprocessing Instagram data, sentiment analysis has been
performed. With help of VADER it requires one line of code:
sentimentScore = sentAnalyzer.polarity_scores(textUpd)
The ​polarity_scores function returns four different values denoting the text
sentiment tonality regarding its belonging to positive, negative and neutral categories.

21
The VADER algorithm matches every word in the comment text against its lexicon
with sentiment values. Image 3.5 illustrates how these values may vary depending on
applied rules (described in subsection 2.1 Tool Choice).

Image 3.5: Sentiment Analysis


The fourth value, the compound score, is the one that accounts ranking will be built
on. This score provides a single measure of sentiment polarity and is calculated
according to the following formula (Image 3.6).

Image 3.6 Compound Score Formula [37]


Sentiment(w​i​) is the score valence of the word and ​α ​is a hyper-parameter that is
supposed to approximate the max expected value (the default value is 15 in NLTK)
[37]. Basically, it normalizes the sum of all scores in the text and presents the result on a
scale ​from 1 to -1 where 1 is extreme positivity and -1 is extreme negativity. In order to
avoid confusion in calculation, the compound scale has been transformed to positive
values only (therefore, from 0 to 2, where 0 is most negative and 2 is most positive).
After obtaining the compound polarity score value for every comment, all
documents in the comments collection have been updated with new field-value pairs
(Image 3.7).

Image 3.7: Example of the Compound Score in the Comments collection

22
3.7 Account ranking
The ranking process is implemented in Python and includes several steps. In order to get
top 10 accounts out of the processed 150 Instagram accounts, the total account score of
each profile should be calculated. This total score consists of three parts​:
● total score of all posts​ (based on the compound score of the comments)
● total likes score​ (based on the number of likes under the posts)
● total followers score​ (based on the number of followers that a profile has).
Since the goal with this project was to develop a ranking mechanism that
predominantly depends on user feedback, this category has the most weight (100
points). Due to the fact that followers and likes can be bought on Instagram [38], the
total likes score and total followers score will have a reduced impact on the ranking (50
points each) compared to the total score of all posts based on the comments content.
Since buying likes and followers is not always the case, these two categories could not
be completely eliminated from counting the total account score.
The introduced fixed point assignment system (100/50/50) is solely the researcher’s
choice. It aims to emphasize the importance of users' feedback over the number of
followers and likes. However, these numbers can vary depending on the goal with the
research or application.

3.7.1 Total Score of All Posts


Total score of the account’s post is calculated as the sum of comments compound scores
under the post. Since the number of comments differ from post to post, a plain sum of
individual scores would not be efficient. The proposed algorithm is to first find the
mean percentage of a total comments score and then convert this value to points on a
scale from 0 to 100, where 1% is equal to 1 point.

Image 3.8: Calculation of Total Comments Score

23
The mean percentage score is calculated by dividing the total compound score of all
comments by the number of highest possible compound score of all comments under
one post. The latter one is calculated by multiplying the number of comments under one
post by 2 that is the value that indicates extreme positive sentiment tonality (Image 3.8).
As in case with the comments collection in MongoDB, the post collection has also
been updated with new field-value pairs after running the script (Image 3.9).

Image 3.9: Example of the Total Comments Score in the Posts Collection
In a similar way, the account’s total posts score is calculated, i.e. by dividing the
total score of all posts by total highest possible score (Image 3.10).

Image 3.10. Calculation of Total Posts Score

24
After obtaining the value for total posts score based on the total comments score,
every document in the account collection has been updated with a new field (Image
3.11).

Image 3.11: Example of the Total Comments Score in the Account Collection

3.7.2 Total Likes Score

The total likes score is also calculated with help of the mean percentage value. First, the
sum of all likes under all posts of the given account is computed. The post that gained
most likes is considered to be the highest ranked post and its number of likes will be
equal to 100% or 50 points. As discussed earlier in this chapter, the total likes score
will have a reduced impact on the ranking compared to the total score of all posts,
therefore, the maximum number of points it can have is set to 50, not 100.
Knowing the post with the maximum number of likes in the account, the overall
likes score can be estimated by dividing the total number of likes in the account by the
number of highest possible likes score of all posts and multiplying it by 50 (Image
3.12).

Image 3.12: Calculation of Total Likes Score

25
The example below illustrates one of the updated documents in the account
collection with a new field containing the likes score (Image 3.13).

Image 3.13: Example of the Total Likes Score in the Account Collection

3.7.3 Total Followers Score

The logics is similar to the one that is used in calculation of the total likes score, but in
this case the number of followers in one profile is divided by the number of followers
in the account with the highest number of subscribers (Image 3.14).

Image 3.14: Calculation of Total Followers Score


Unlike the total likes score where the mean percentage is calculated relative to the
number of likes within account, the total followers score is estimated in relation to the
number of followers in other accounts (Image 3.14).

Image 3.15:. Example of the Total Followers Score in the Account Collection

26
3.7.4 Total Account Score
The total account score can be estimated as the sum of the values for total posts score,
total likes score and total followers score. After updating documents in the account
collection with new field-value pairs, a list of top-rated accounts can be obtained by
calling the sort and limit functions (Image 3.16).

Image 3.16. Calculation of Total Account Score

27
4. Results
The proposed ranking algorithm was performed over the set of data that was retrieved
from the web version of Instagram. It includes 150 public accounts with 3,201 posts and
15,852 comments in total. Due to lowered limit to requests, data has been pulled at
different time and days. The results have been limited to 10 and arranged from the
highest rank on the top to the lowest rank in the bottom. The maximum possible points
one account could get was 200 (100 for comments, 50 for followers and 50 for likes).
The algorithm has been run as the baseline experiment (without preprocessing of the
comments) and as the preprocessed experiment. In the latter case the goal was to
investigate whether preprocessing of the comments may have an impact on the accounts
ranking mechanism.
The results from the baseline experiment comparing accounts by total score is shown
in Image 4.1 below.

Image 4.1: Top 10 Instagram accounts sorted by total score (unpreprocessed data)

Image 4.2 and Image 4.3 illustrate a pure result based solely on the sentiment
analysis of the comments (with preprocessed and unpreprocessed data respectively).

Image 4.2: Top 10 Instagram accounts sorted by total comments score (preprocessed
data)

Image 4.3: Top 10 Instagram accounts sorted by total comments score (unpreprocessed
data)

28
Image 4.4 and Image 4.5 provide the results of the accounts ranking based either on
the total likes score or the total followers score without considering the VADER
compound sentiment score of the comments.

Image 4.4: Top 10 Instagram accounts sorted by total likes score

The highest possible score (50) in Image 4.4 indicates that top 10 accounts receive quite
a high number of likes on all posts relative to the most liked post in the given account.

Image 4.5: Top 10 Instagram accounts sorted by total followers score

Finally, the results from the preprocessed experiment are presented, where accounts
are sorted by total score inclusive comments sentiment, the mean percentage of likes
and followers (Image 4.6).

Image 4.6: Top 10 Instagram accounts sorted by total score (preprocessed data)

The following table (Table 4.1) represents the survey results where experiment
participants have been asked to evaluate the relevance of the result on a scale of 0-3
with 0 meaning not relevant, 3 meaning highly relevant.

29
Method Question Participant 1 Participant 2 Participant 3

Sorting by Is the profile 2 1 1


total score content relevant (not sure about (indian (shree_fashio
with considering that Indian dresses in dresses are not n_studio_indi
unpreproce higher-ranked top-ranked relevant, some a is not
ssed data accounts receive account) comments are relevant)
(Image 4.1) more positive not relevant)
scores?

Sorting by Is the profile 3 2 1


total score content relevant (some (@nuriozkan
(preprocess considering that messages are couture is
ed data) higher-ranked irrelevant for underrated
(Image 4.6) accounts receive ranking like due to
more positive ‘thank you’ multilingual
scores? messages) comments)

Sorting by Is the profile 0 1 0


total likes content relevant (makeup account (account is (score
score considering that is not relevant, relevant, but assigning
(Image 4.4) higher-ranked score is the same score system does not seem
accounts receive for all) is weird) to work)
more likes than
lower-ranked ones?

30
Sorting by Is the profile 3 3 3
total content relevant
followers considering that
score higher-ranked
(Image 4.5) accounts have more
followers than
lower-ranked ones?

Table 4.1 Assessment of the ranking algorithm. Survey results.

The given data will be compared and analyzed in the following chapters.

31
5. Analysis
The above results illustrate possible outputs of the proposed model depending on
different combinations of ranking criteria and data preprocessing. There is a clear
difference in score between sorting accounts with preprocessed or unpreprocessed
datasets (Figure 5.1).

Figure 5.1: Top accounts sorted by total score with preprocessed and unpreprocessed
data
This difference is essential for the project because it results in changes within
account rank placement. It can be observed following one specific account, for example
bridalrelovedleicester.​ Table 5.1 illustrates the result of account ranking based on
different methods and sorting algorithms.

Method Score Rank Placement

Sorted by total score 126.64 4


(unpreprocessed data)

Sorted by total score 129.48 6


(preprocessed data)

Sorted by total comments 79.25 10


score (preprocessed data)

Sorted by total comments 76.42 5


score (unpreprocessed
data)

Sorted by total likes score 50 4


Table 5.1. Rank placement of ​bridalrelovedleicester’​ s Instagram profile

32
Since total score consists of three components (comments score, likes score,
followers score), it makes sense to separate these values and analyze whether the score
difference still remains high. The likes and followers score don’t depend on data
preprocessing (numerical values were required in calculation), that is why only the
comments score will be analyzed in detail.
The focus of the ranking mechanism is on the total compound sentiment score of the
comments, rather than on relative frequency of positive, neutral and negative comments.
Sentiment intensity of the comments has a larger impact on ranking (providing score
values), than comments belonging to a certain category. Figure 5.2 shows that there is a
score difference even within the comments component. The blue bar indicates higher
total compound score of the preprocessed dataset.

Figure 5.2: Top accounts sorted by total comments score with preprocessed and
unpreprocessed data

However, this information does not correlate well with the results of statistical tests. In
order to compare two average values a two-tail T-test (i.e. two-sample assuming
unequal variances) has been performed [39]. This project meets conditions for applying
the statistical t-test criterion such as
● the test statistic should follow a normal distribution (the compound score scale
that calculation is based on is known in advance)
● the two datasets have the same variance
● data is not dependently sampled (not sampled in clusters)
The test results are twofold. On the one hand, if a T-test is performed between total
scores on preprocessed and unpreprocessed datasets, the resulting P-value (lower than
0.05) indicates that the difference is still statistically significant (Table 5.2).

33
Table 5.2: T-test between total scores on preprocessed and unpreprocessed datasets

On the other hand, if the same test is performed on the total comments score dataset
(excluding the total score of comments and likes), the observed difference between the
samples becomes less clear which is indicated by the higher P-value (Table 5.3).

Table 5.3: T-test between total comments scores on preprocessed and unpreprocessed
datasets

Does it mean that data preprocessing does not have too much impact on the ranking
mechanism in the end? The answer can still be no and it can be explained by several
factors. Firstly, the proposed model does not take into account the presence of
comments in languages other than English. VADER evaluates such comments as
neutral by default (Image 5.1).
Secondly, the total compound score largely depends on the time factor. Every
account has a different audience, and Instagram activity varies in different time zones.
For example, if data is collected when it’s daytime in Cairo, Egypt and night in
Phoenix, Arizona, one can expect more comments in Arabic. As stated above, such
comments are not processed correctly. This is a drawback of the restricted requests limit
on Instagram, caused by latest changes in their policy [23]. Otherwise, data could be
retrieved and processed real-time considering user’s time zone.

34
Image 5.1: Evaluation of comments in languages other than English

Another possible reason for relatively low difference in scores between preprocessed
and unpreprocessed datasets is unpredictability of Instagram comments. Even though
the VADER lexicon is comparatively large (more than 7500 lexical features), it cannot
cover all aspects of the language that is constantly evolving (the latest update on the
VADER lexicon was a year ago). ​Moreover, misspellings have a significant negative
impact on the total compound score making such words neutral by default.
Another big issue that may negatively affect the proposed model is the ability of
account owners to remove unwanted comments. This problem could be eliminated by
analyzing the user’s reviews on their own profiles instead (through mentioning via @).
However, Instagram parser does not provide the opportunity to retrieve feedback data
on profiles through mentioning in other profiles. In order to get comment content where
the service or product is mentioned, one should know the id or username of the person
who leaves the feedback.
Overall, positive and neutral comments prevail in the collected data. That can also be
explained by the specifics of the service (wedding outfit) that usually causes positive
emotions. However, the proposed ranking model is not tied to a specific type of service
or product and can be applied to any other category (for example, cosmetics where the
number of complaining users may be higher).

35
6. Discussion
The goal of this degree project is ​to investigate natural language processing techniques
applied to users’ comments on Instagram in order to determine a new algorithm that
will ​include content analysis to the list of feed ranking factors​. A new model to rank
Instagram accounts is proposed as a result of this study.
VADER is proved to be the most suitable tool for such kind of applications that
depend on the user feedback analysis. Recent work by Hutto and Gilbert [33] provides
evidence that VADER sentiment analysis tool, used in the project, outperforms the
machine learning algorithms for Naïve Bayes, Maximum Entropy and Support Vector
Machine models. However, the conducted experiment ​shows that a combination of
rule-based and lexicon-based approaches for sentiments analysis is not efficient for
applications largely relying on social media texts, especially in microblogging domain​.
Many of the proposed functionalities implemented by VADER fail when it comes to
sentiment score estimation. The project results show that the proposed model performs
better with preprocessed dataset due to increased accuracy of individual comments
(Figure 5.1 and Figure 5.2).
In this regard, a similar concern can be found in the previous research by Araujo et
al. [40]. According to the authors, the sentence-level methods in sentiment analysis with
VADER may lead to bias in the results due to unidentified messages (in our case they
were labelled as neutral). Therefore, it is important to provide high-coverage in data in
order to get a better accuracy of the results. This can be achieved by several techniques
that include but not limited to
● using multilingual lexical database
● controlling spelling and grammar
● testing against stop word lists
● using ignore word lists
● updating lexicons to keep the project up-to-date
Even if the proposed ranking mechanism with the extended and modified VADER
algorithm provides a more accurate result, it still has a few drawbacks identified after
analyzing the survey results (Table 4.1). During the assessment of the ranking
algorithm, the survey participants have been asked to judge the relevance of the results
on a scale of 0-3, with 0 meaning not relevant at all and 3 meaning highly relevant. In
case with sorting by total ​followers score, the participants agree on both the content
relevance and ranking. It is logical, since this method is based on pure numerical data
sorting and should not cause any misunderstanding.
Evaluation of the sorting by total ​likes score method is not as unambiguous. The
participants have been misled by same score value for all accounts (Image 4.4). These
values are still correct because the total likes score is calculated with help of mean
percentage, not just the sum of all likes. The highest possible score (50) indicates that
top 10 accounts receive quite a high number of likes on all posts relative to the most
liked post in the given account. However, a warning has been issued regarding one
mismatching account content (makeup instead of wedding outfit). This reveals some

36
flaws in the account selection system based on the input keyword. In the proposed
system the account description content is tested against the array of words related to the
given category, i.e. wedding outfit (Image 3.1). In our case, both the array and the
mismatching account’s description contain the word ​bridal (compare bridal dress versus
bridal makeup). Therefore, a clearer algorithm is needed to identify appropriate words.
Evaluation of the sorting by total score and total comments score methods has been
performed providing structured samples from database for respective account. Since the
algorithm retrieves posts with minimum 5 comments, it would require much time from
the participants to find these specifics posts.
The evaluation result of the sorting by total ​score with unpreprocessed data s​ hows
that all three participants disagree with the most-highly ranked account
(@shree_fashion_studio_india), because they find it irrelevant to the category of
wedding outfit. The reason for it has been described in a previous chapter. As discussed
earlier, the time factor plays a big role in the data collection process. In this case, this
piece of data has been collected during bursts of activity in India. The given account
still proposes wedding dresses but not in the classic western style.
In case with sorting by ​total score based on the preprocessed data​, Participant 2 has
some uncertainty about the result reliability, pointing at the irrelevance of some
comments. Image 6.1 illustrates that the sentence gets slightly positive sentiment,
however, it does not express any assessment in relation to the product itself (just
gratitude for the mentioning). Such information should be excluded from sentiment
analysis and should not affect the overall compound score of the account.

Image 6.1. Ranking evaluation. Irrelevant comment.

Participant 3 highlights another important issue. Even though ​nuriozkancouture​’s


profile has relatively high number of followers (7,307) and receives many likes, the
total score get lowered due to incorrect scoring of the comments written in languages
other than English (Image 6.2).

Image 6.2. Ranking evaluation. Incorrect scoring.

37
Even if the algorithm chooses profiles with description in English only, there is
always high probability that the comments section turns out multilingual. Therefore, the
proposed model should be further evaluated to consider such cases. The work of Chilet
et al. on applying Named Entity Recognition (NER) in Natural Language Processing for
data structuring can be used as a guideline for implementing language identification
with help of NER tools [14].
The proposed ranking algorithm also has a few other flaws in terms of internal and
external validity. Due to time constraints and Instagram restrictions on requests, only a
static dataset has been used for the ranking algorithm development. Implementation of
the algorithm on real-time data could probably reveal the model's additional faults.
Moreover, the dataset itself is relatively small (150 profiles and 15,852 comments). For
comparison Araque et al. in their research on deep learning sentiment analysis uses the
Sentiment140 dataset containing 1.6 millions Twitter messages[41]. Along with
randomization of data the internal validity could be improved by additional
experimental manipulation with data. For example, modifying existing comments in the
dataset or adding new comments with a strong negative or positive sentiment could
confirm or deny validity of the proposed model.
The external validity of the project is negatively affected by several factors. The fact
that data is collected at certain periods of time does not guarantee the generalizability of
findings. Secondly, the ranking point system (100/50/50) is solely the researcher's
choice, intended to emphasize the importance of users' feedback over the number of
followers and likes. These numbers can vary.
There may also occur the construct validity problem, when it comes to the
interpretation of sentiment tonality. VADER uses ​a human-curated sentiment lexicon
which means that sentiment scores have been assigned with help of human experts.
Someone may not agree on the assigned scores.
To sum up, the proposed ranking model based on the combination of rule-based and
lexicon-based approaches for sentiments analysis provides an alternative way to rate
Instagram accounts depending on user feedback. Due to time constraints, it is not
implemented in a way that it can cover all aspects of natural language processing. The
problem stated in the thesis has been answered, however, the proposed ranking
algorithm requires subsequent refinement.

38
7. Conclusion
This project is an attempt to create an alternative way of ranking Instagram accounts
based not only on a number of likes or followers but mostly considering sentiment
analysis of user comments. In order to achieve this aim, five objectives has been
identified in this project (see subchapter 1.5 Objectives).
The ​first objective (​O1)​ has been met during downloading and extracting data from
Instagram (see subchapter 3.3 Data Collection). Due to latest changes to Instagram
policy [23] that resulted in a restricted access to their API, the implementation part of
the project has been performed with an unofficial Instagram API open source project
(Instagram PHP Scraper). It gives access to all Instagram features, but unlike official
Instagram API, it does not require an access token and authorized requests. The
requested data has been extracted by analyzing JSON objects on the Web version of
Instagram that generally corresponds to the structure of the official API responses.
The implementation part of this project has been preceded by research on machine
learning algorithms and mining techniques as well as on sentiment analysis (​objective
O2)​ . It was described in subchapters 1.1 Background and 3.1 Tool Choice. This project
utilizes a combination of rule-based and lexicon-based approaches that is performed
with help of VADER sentiment analysis tool. Unlike other approaches such as Bags of
Words or classifiers, VADER uses lexicons where each word has a special tonality
value assigned to it. It is essential for developing the ranking algorithm, since it does
not only analyze the sentiment polarity of words, but also describes their sentiment
intensity providing numerical scores. Moreover, VADER considers semantic
relationship between words and in general is highly adapted to social media domain
(processing slang, emoji, acronyms and so on).
Based on this knowledge, the proposed ranking model suitable for this project has
been presented in subchapter 3.2 Ranking Model (​objective O3​). This model includes
several modules, each of which is responsible for a specific range of functions:
retrieving and downloading information from Instagram, data storage, data
preprocessing, sentiment analysis and ranking.
The development process of these modules is described in detail in subchapters
3.3-3.7 (​objective O4​). Downloading and extracting data has been performed with help
of Instagram PHP Scraper. The extracted data has been stored in MongoDB and
preprocessed in Python. Sentiment analysis has been implemented on both preprocessed
and unpreprocessed data. Finally, the acquired information has been ranked and
analyzed.
The proposed algorithm has been tested on both preprocessed and unpreprocessed
datasets (​objective O5​). The results show that performance of the proposed model is
better on the preprocessed data, while sorting both by total score and total comments
score (see Figure 5.1 and Figure 5.2). However, it can be negatively affected by internal
and external factors which has been proven by the twofold results of a two-tail T-test. It
indicates a statistically significant difference between total scores on preprocessed and
unpreprocessed datasets, but in case with the total comments score (that is solely based

39
on sentiment analysis), the observed difference decreases. This can be explained by the
particular features of the microblogging domain that obstruct data preprocessing such as
multilingual data, time factor, unpredictability of Instagram comments and so on.
Even though the results revealed a list of drawbacks with the proposed model, most
of them can be corrected if additional time is granted. It has been proven that it is
possible to provide an alternative way for ranking search results based on the user
feedback.
A new ranking mechanism based on sentiment analysis of user feedback may be
beneficial for both customers and small businesses. The proposed ranking algorithm
may contribute to creation of the ​direct-to-consumer channel that bypasses
intermediaries. With help of this algorithm small businesses with overall positive
feedback get a direct access to potential customers without the need of paid advertising
to get on top of search results. Users, in their turn, get a list of services and products
based on real customers’ opinion rather than prepaid ads. Moreover, this research
contributes to ​better understanding and further developing of Instagram’s Web 2.0
technologies.​
The proposed ranking model ​is universal and can be applied to any category of
services or products. However, it has some drawbacks and limitations that can be
eliminated in future works.

7.1 Future work


There are several issues identified in this project that require further investigation and
solving. First of all, it concerns evolving of existing data preprocessing solutions.
Assigning sentiment compound scores to all comments regardless of language as well
as spam filtering are two priority tasks that could significantly improve the accuracy of
the result.
Further research may also include ranking binding to users geolocation. So far it is
hard to implement due to multilingual comments and difference in time-zones.
Clustering hashtags and incorporation of this technique in data retrieval from Instagram
profiles’ description could also enhance the accuracy of the result.
As discussed earlier, latest changes to Instagram API policy left a significant
negative impact on the project implementation, making it impossible to test the
algorithm with real-time data. Perhaps future policy revisions will change the situation
for the better.

40
References

[1] T. Clarke. (2019, Mar. 5). 24+ Instagram Statistics That Matter to Marketers in
2019​. [Online]. Available:​ ​https://blog.hootsuite.com/instagram-statistics/

[2] ​Wikipedia contributors. (2019, Feb. 24). ​Data mining.​ Wikipedia, The Free
Encyclopedia​. [Online]. Available:​ ​https://en.wikipedia.org/wiki/Data_mining

[3] ​Wikipedia contributors. (2019, Jan. 28). ​Sentiment analysis​. Wikipedia, The Free
Encyclopedia​. [Online]. Available:
https://en.wikipedia.org/w/index.php?title=Sentiment_analysis&oldid=880653981

[4] Wikipedia contributors. (2019, Feb. 5). ​Web 2.0. Wikipedia, The Free Encyclopedia.
[Online]. Available:
https://en.wikipedia.org/w/index.php?title=Web_2.0&oldid=881846097

[5] Instagram. (2019). [Online]. Available:​ ​https://www.instagram.com/

[6] A. Lua. (2019, Feb. 19). ​How the Instagram Algorithm Works in 2019: Everything
You Need to Know​. Buffer. [Online]. Available:
https://buffer.com/library/instagram-feed-algorithm

[7] Wikipedia contributors. (2019, Feb. 27). ​Web Scraping. Wikipedia, The Free
Encyclopedia. [Online]. Available:​ ​https://en.wikipedia.org/wiki/Web_scraping

[8] Ashutosh KS. (2017, Apr. 13). 10 ​Web Scraping Tools to Extract Online Data.
Hongkiat. [Online]. Available:​ ​https://www.hongkiat.com/blog/web-scraping-tools/

[9] Github. (2019). ​Instagram PHP Scraper. [Online]. Available:


https://github.com/postaddictme/instagram-php-scraper

[10] Wikipedia contributors. (2019, Feb. 24). ​Data Mining. Wikipedia, The Free
Encyclopedia. [Online]. Available:​ ​https://en.wikipedia.org/wiki/Data_mining

[11] Wikipedia contributors. (2019, Feb. 21). ​Machine Learning. Wikipedia, The Free
Encyclopedia. [Online]. Available:​ ​https://en.wikipedia.org/wiki/Machine_learning

[12] Wikipedia contributors. (2019, Jan. 28). ​Sentiment Analysis. Wikipedia, The Free
Encyclopedia. [Online]. Available:​ ​https://en.wikipedia.org/wiki/Sentiment_analysis

[13] K. Zhang et al. “Product and Service Popularity Analysis on Instagram,” in ​IEEE
International Conference on Consumer Electronics-Taiwan (ICCE-TW)​, 2018. pp.
13-18.

41
[14] J.A. Chilet et al. “Analyzing social media marketing in the high-end fashion
industry using Named Entity Recognition,” in ​IEEE/ACM International Conference on
Advances in Social Networks Analysis and Mining (ASONAM)​, 2016. pp. 621-622.

[15] K. Hammar et al. “Deep Text Mining of Instagram Data without Strong
Supervision,” in ​IEEE/WIC/ACM International Conference on Web Intelligence (WI)​,
2018. pp.158-165.

[16] F. Prabowo and A. Purwarianti. “Instagram online shop's comment classification


using statistical approach,” in ​2nd International conferences on Information
Technology, Information Systems and Electrical Engineering (ICITISEE),​ 2017. pp.
282-287.

[17] B. Mullin. (2018, Nov. 13). ​Brands Now Spend Nearly Two Thirds of Digital
Advertising on Mobile, IAB Says.​ The Wall Street Journal. [Online]. Available:
https://www.wsj.com/articles/brands-now-spend-nearly-two-thirds-of-digital-advertisin
g-on-mobile-iab-says-1542124801

[18] API - Instagram. (2019). Instagram. [Online]. Available:


https://www.instagram.com/developer/

[19] N. Cross. “A History of Design Methodology,” in ​Design Methodology and


Relationships with Science​, 1 ed., Dordrecht, Netherlands: Kluwer Academic
Publishers, 1993. pp. 15-27.

[20] A. Hevner et al. “Design Science in Information Systems Research”, in ​MIS


​ ol. 28. No.1, ​2004. p​ p. 75-105.
Quarterly, V

[21] I. Kerins. (2018, July 25). ​GDPR Compliance for Web Scrapers: the Step-by-Step
Guide.​ The Scrapinghub Blog. [Online]. Available:
https://blog.scrapinghub.com/web-scraping-gdpr-compliance-guide

[22] S. Neeraj. (2018). ​Facebook and Instagram API Access, Update and Changes
2018​. Taggbox [Online]. Available:
https://taggbox.com/blog/facebook-instagram-api-access-update-changes-2018/

[23] ​API and Other Platform Product Changes.​ (2018). Facebook for developers.
[Online].Available:​https://developers.facebook.com/blog/post/2018/04/04/facebook-api
-platform-product-changes/

[24] MongoDB. (2019). [Online]. Available:​ ​https://www.mongodb.com/

42
[25] VADER Sentiment Analysis. (2018). GitHub. [Online]. Available:
https://github.com/cjhutto/vaderSentiment

[26] Natural Language Toolkit. (2019). NLTK Project. [Online]. Available:


https://www.nltk.org/

[27] A. Bhuyan. ​Practical Data Analysis: Using Open Source Tools & Techniques
(Volume Book 1). Amazon Digital Services LLC, 2018.

[28] Wikipedia contributors. (2019, May 1). ​Bag-of-Words model. Wikipedia, The Free
Encyclopedia. [Online]. Available:​ ​https://en.wikipedia.org/wiki/Bag-of-words_model

[29] LIWC. (2015). Pennbaker Conglomerates, Inc. [Online]. Available:


http://liwc.wpengine.com

[30] The General Inquirer. (2002). [Online]. Available:


http://www.wjh.harvard.edu/~inquirer/Home.html

[31] SentiWordNet. (2007). [Online]. Available:​ ​http://ontotext.fbk.eu/sentiwn.html

[32] SenticNet. [Online]. Available:​ ​https://sentic.net/

[33] C. J. Hutto, E. Gilbert. ”VADER: A Parsimonious Rule-based Model for Sentiment


Analysis of Social Media Text,” in ​Proceedings of the Eighth International AAAI
Conference on Weblogs and Social Media,​ At Ann Arbor, MI. Georgia Institute of
Technology. Atlanta, 2015. pp. 216-225.

[34] scikit-learn. (2019). Machine learning in Python. [Online]. Available:


https://scikit-learn.org/stable/

[35] NumPy. (2019). [Online]. Available: ​https://www.numpy.org/

[36] Python Web Scraping – Dealing with Text. (2019). TutorialsPoint. [Online].
Available:
https://www.tutorialspoint.com/python_web_scraping/python_web_scraping_dealing_w
ith_text.htm

[37] G. Bonaccorso. Machine Learning Algorithms: Popular algorithms for data


science and machine learning​. 2nd ed. Birmingham, UK: Packt Publishing Ltd, 2018. p.
422.

[38] M. Aynsley. (2017, March 20). ​Want to Buy Instagram Followers? This is What
Happens When You Do.​ [Online]. Available:
https://blog.hootsuite.com/buy-instagram-followers-experiment/

43
[39] ​Wikipedia contributors. (2019, May 23). ​Student’s t-Test.​ Wikipedia, The Free
Encyclopedia. [Online]. Available:​ ​https://en.wikipedia.org/wiki/Student%27s_t-test

[40] M. Araujo et al. (2016, Apr. 4). "An Evaluation of Machine Translation for
Multilingual Sentence-level Sentiment Analysis," in SAC, 2016. [Online]. Available:
http://blackbird.dcc.ufmg.br:1210/pdfs/sac2016-translation.pdf

[41] O. Araque et al. (2017, Feb. 3) "Enhancing deep learning sentiment analysis with
ensemble techniques in social applications," in ​Expert Systems with Applications 77
(2017). An International Journal. pp. 236-246.

44

You might also like