0% found this document useful (0 votes)
32 views

TF-IDF Method in Ranking Keywords of Instagram Users' Image Captions

Uploaded by

fghjkvi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

TF-IDF Method in Ranking Keywords of Instagram Users' Image Captions

Uploaded by

fghjkvi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2015 International Conference on Information Technology Systems and Innovation (ICITSI)

Bandung - Bali, November 16 - 19, 2015


ISBN: 978-1-4673-6664-9

TF-IDF Method in Ranking Keywords of


Instagram Users’ Image Captions
Bernardus Ari Kuncoro Bambang Heru Iswanto
Master of Information Technology Department of Physics
Bina Nusantara University Jakarta State University
Jakarta, Indonesia Jakarta, Indonesia
Email: b.kuncoro [at] binus.ac.id Email: bhi [at] unj.ac.id

Abstract—Instagram is one of the popular social media appli-


cations used by a wide range of people around the world. The
significant growth of active Instagram users affects the size of
Instagram data. The more number of users, the larger and more
various Instagram data is posted. In line with its popularity,
in recent years many researchers begin to study and analyze
it for various purposes, such as detecting event photos based
on location, clustering the photo content, advertising strategies
based on user types, and so on. As of now there are three types
of data available in Instagram which are text, image, and video.
In this paper we propose Term-Frequency and Inverse Document
Frequency (TF-IDF) method to rank keywords of top twenty most
followed Instagram users based on image captions of Instagram.
The objective of this research is to automatically know the main
idea of Instagram users based on 50 recent image captions posted.
In our experiments, TF-IDF has been successfully implemented to
reveal a set of keywords with its ranking. The highest ranking of
keyword is indeed the main topic of a user, indicated by the value
Fig. 1. Example of Cristiano’s and Instagram’s Posts with Image Captions
of TF-IDF. The result of study indicates that TF-IDF method is
very useful to find and rank the keywords of Instagram users
image captions. In the future research, the ranking keywords are
needed in solving classification and clustering tasks as feature Despite the fact of Instagram popularity, the number of
extractions. researches focused in Instagram is very low. In 2014, Hu,
Keywords—Instagram; text mining; Term-Frequency and In- Manikonda, and Kambhampati [3] wrote that their work is be-
verse Document Frequency, social media lieved to be the first study to conduct a deep analysis of photo
content and user activities and types on Instagram. In their
study, computer vision and identification by clustering were
I. I NTRODUCTION successfully applied thus eight popular categories of photos
and five distinct types of Instagram users were revealed. A
Instagram is one of the popular social media platforms that dissertation related to Instagram was reported by McCune. He
provides users a quick way to capture and share their life investigated peoples motivations of using Instagram through a
moments with followers through a series of filter-manipulated survey study of 23 Instagram users [4]. In 2013, Silva, Vaz de
photo and video. It is more popular amongst a younger de- Melo, Almeida, Salles, and Loureiro have applied visualization
mographic. Over 35% of people using Instagram are between and cultural analytics on Instagram photos from different cities
ages 18-29 years [1]. Since establishment in October 2010 in the world to trace their social and cultural differences [5].
until this paper was written, the growth of active Instagram Instagram has three types of data which are text, image,
users has significantly increased. According to an updated and video. To narrow down the idea of this study, only text
data by the official Instagram account in September 2015 [2], data was used. The text data used in this study was the
Instagram has been registered by 400 million users which is image caption that represents the description of the image.
25% higher than the number of registered users in December As illustrated in Fig. 1, the image caption is located under the
2014. Another interesting fact is that the average of photos image that was posted by the user.
being uploaded by users per day is more than 80 million
photos. The research question of this study is ”How to find keywords

978-1-4673-6664-9/15/$31.00 ©2015 IEEE


and the rankings of Instagram account based on the image TABLE I
caption data posted?”. To answer this question, text mining T OP 20 I NSTAGRAM P ROFILES
(TF-IDF) method is used. The output of this study is the Rank Username Media Followers Following
keywords with the ranking value. The higher the ranking, the 1 instagram 2,509 103,226,690 182
more relevant the keyword with the captions that users posted. 2 taylorswift 732 49,451,242 77
3 kimkardashian 3,167 48,014,416 96
The significances of this study are as follows. First, the ranking 4 beyonce 1,172 47,173,577 0
keywords of username image captions can be used as features 5 selenagomez 1,028 45,858,936 173
of advanced research such as clustering, classification, and 6 arianagrande 1,869 44,598,791 952
profiling of Instagram username. Second, this study adds the 7 justinbieber 2,508 40,228,982 73
8 kendalljenner 2,343 38,055,799 170
diversity of Instagram data research with a different approach 9 kyliejenner 3,338 38,075,231 186
which is text mining. Third, the method can be used to 10 nickiminaj 3,387 35,185,711 382
expedite researchers in retrieving significant words of the 11 khloekardashian 2,935 33,091,863 149
12 natgeo 8,432 32,835,979 94
users, as this can be done automatically rather than a manual 13 neymarjr 3,018 32,555,977 1,023
retrieval, by keeping an eye on the captions posted by the 14 cristiano 602 31,865,306 198
users. 15 mileycyrus 4,280 29,539,569 384
16 katyperry 366 28,891,826 217
17 therock 1,343 28,778,259 64
18 jlo 1,185 27,824,613 966
II. DATASET 19 badgalriri 3,267 26,976,059 1,166
20 kourtneykardash 2,021 25,875,453 72
The dataset was crawled using API of Insta-
gram. First, the top 20 most followed Instagram
usernames were collected. The list is based on III. M ETHODOLOGY
http://socialblade.com/instagram/top/100/followers accessed
on October 7, 2015 [6] and it can be seen in Table I. Second, The methodology of this study is illustrated in Fig. 2.
in order to know the most updated keywords of the users, Basically, there are three moduls used. They are retrieval,
only 50 of the most recent image captions were used. Each preprocessing, and ranking moduls. In retrieval modul, each
username is assumed as one document that contains a bag of of the usernames was used to request the recent 50 image
words, hence there are 20 documents in total. captions via Instagram API. The output of this retrieval is
a group of text files. Since the number of username used
The following are characteristics of Instagram image caption is 20, hence the output of this modul is 20 text files. This
data. Please be noted that these can be changed in the future methodology is inspired by Kumar and Sebastian research in
without prior notice due to Instagram updates. 2012 [8].

1) Image caption character limit: the limit for captions The next modul is the preprocessing modul. It is needed
on the photo and subsequent comments caps is 2200 to pass the important words and filter irrelevant words and
characters each. User is also allowed not to write a characters in each document. The first preprocessing modul
caption at all. step is removal of HTML and symbol characters. It is im-
2) Hashtag limit: The limit of hashtag is 30 hashtag per portant because, the users commonly write symbol characters
caption. that has less significant meaning or a non keyword symbol.
3) Symbol characters: Some of the users uses the symbol The second step of preprocessing modul is punctuation, #tag,
characters provided in the smartphone keyboard. @tag, and stopwords removal. The main goal of this step is
4) Writing technique: As Instagram has 2200 characters to retrive essential words and to eliminate words that has less
limit, spelling and cyber slang in the image caption is significance towards the documents such as ”the”, ”is”, ”are”,
not often used by users compared to Tweets in Twitter. ”an”, ”of”, ”to”, etc. It is also useful to reduce indexing file
5) Availability: The amount of data available is extremely size, improving efficiency and effectiveness. The third step is
large. According to the Instagram official release in standardizing words. For example the user sometimes writes
September 2015, there are 80 million photos uploaded ’go hooooome’, thus the output of this step is ’go home’.
daily. The Instagram API facilitates the collection of Last step on preprocessing modul is URLs removal. It is clear
image captions as well as the URL Link of image. that the URL link is not significant to be used to reveal the
6) Topics: Instagram users post photos and videos in a wide keywords.
variety of topics. Previous research observed that there
Upon finishing preprocessing the data, ranking process is
are eight main photo categories which are friends, food,
then applied. The first step of this modul is tokenization. Its
gadget, captioned photo, pet, activity, selfie, and fashion
objective in this case is to break the text up into words or
posted in Instagram [3].
other meaningful elements called tokens. Then each tokens, or
7) Weekend Onpeak: Users tend to post the photos and
commonly refered to as terms are used to form vector space
videos during weekends and at the end of the day. [7]
model.
Fig. 3. Top 10 Words of @instagram Account


1, if x = t
fr(x, t) =
0, otherwise
Hence, TF(t, d) returns how many times the term t is
present in the document d.
2) IDF is defined with the following formula:
|D|
IDF(t) = log (2)
1 + |{d : t ∈ d}|
where |{d : t ∈ d}| is the number of documents where
the term t appears, when the term-frequency function
satisfies TF(t, d) = 0, were only adding 1 into the
formula to avoid zero-division.
3) TF-IDF formula is defined as follows:

Fig. 2. Methodology TF-IDF(t) = TF(t, d) × IDF(t) (3)

The TF-IDF value increases proportionally to the number


TF-IDF stands for ”Term Frequency, Inverse Document of times a word appears in the document, but is offset by the
Frequency”. It is a way to score the importance of words frequency of the word in the corpus, which helps to adjust for
(or ”terms”) in a document based on how frequently they the fact that some words appear more frequently in general,
appear across multiple documents. Besides that, it is the most thus the more appears in a document, the more a word is
common weighting method used to describe documents in the estimated to be significant in that document.
Vector Space Model (VSM), particularly in Information Re-
trieval problems. TF-IDF is a relatively old method proposed IV. R ESULT AND D ISCUSSION
by Salton and Buckley in 1988 [9]. Despite its age, it is simple
and effective, making it a popular starting point compared to The proposed method was applied to the top twenty most
the more recent algorithms. To know more about the TF-IDF, followed Instagram usernames as input. The number of rank-
here are the descriptions of TF and IDF. ing keywords can be varied and in this study was limited up
to 10 ranks. Thus the result are 20 username items with 10
1) TF is a measure of how many times the terms t present ranking keywords. Three samples of the results that represents
in each file document d. The formula of TF in mathe- the top 10 words and TF-IDF value of each Instagram users
matical symbol is as follows: are illustrated with a bar chart in Fig 3, 4, and 5. The first bar
 is the highest ranking words or the most relevant word of a
TF(t, d) = fr(x, t) (1)
specific user. According to those figures, the most relevant
x∈d
words for each @instagram, @taylorswift, and @cristiano
where the fr(x, t) is a simple function defined as users are weekend, toronto, and drive respectively.
Fig. 4. Top 10 Words of @taylorswift Instagram Account

Fig. 6. Result of Keyword Ranking for Remaining Usernames - part 1

Fig. 5. Top 10 Words of @cristiano Instagram Account

Going more deeply to the highest ranked keyword in each


username, it turns out that they have different reasons why
it becomes the highest. The term ’weekend’ in @instagram
account becomes the highest keyword, because during the
time data was crawled, @instagram held Weekend Hashtag
Project. The username @taylorswift whose has term ’toronto’
as her highest rank keyword, because she has just shared
several photos about her concert in Toronto, Canada. While the Fig. 7. Result of Keyword Ranking for Remaining Usernames - part 2
term ’drive’ in @cristiano becomes the highest rank keyword,
because he is currently endorsing his new sport drink product
and named CR7Drive.
the least rank term: ’tulsa’. Other than that, some of the terms
Figure 6 and 7 illustrates the result of keyword ranking in the result are not easily understood due to it being slang
for the remaining 17 usernames. They are arranged from the terms (e.g. yall, j4, poo, etc.), usernames of other Instagram
higest rank to the lowest. For example, @arianagrande highest users (e.g. ronyalwin), numbers (mostly date), and non-English
rank keyword is ’focus’ followed by ’babes’, ’andrea’, until language. This needs to be improved in future works.
V. C ONCLUSION

A set of keywords with its ranking have been successfully


revealed from image captions of the top 20 most followed
Instagram users. The use of the proposed method in which TF-
IDF is implemented is very simple and effective in revealing
the keywords and its ranking from a certain user. The results
show that the highest ranking of keyword is indeed the main
topic of a user, indicated by the value of TF-IDF. The higher
the TF-IDF value, the more relevant that keyword is to the
specific Instagram username. However, this work still needs
to be improved in terms of understanding slang words and
non-English language, adding feature of keywords based on
annotation images, and so on.

R EFERENCES

[1] J. Golbeck, Introduction to Social Media Investigation: A Hands-On


Approach, 1st ed. Massachusetts: Syngress, 2015.
[2] Instagram, “Instagram 400,000,000,” 2015. [Online]. Available:
https://instagram.com/p/78n-7MBQU8/
[3] Y. Hu, L. Manikonda, and S. Kambhampati, “What we Instagram : a first
analysis of Instagram photo content and user types,” Proceedings of the
Eight International AAAI Conference on Weblogs and Social Media, pp.
595–598, 2014.
[4] Z. Mccune and J. Thompson, “Consumer Production in Social Media
Networks : A Case Study of the Instagram iPhone App,” Ph.D.
dissertation, University of Cambridge, 2011.
[5] T. H. Silva, P. O. S. V. D. Melo, J. M. Almeida, J. Salles, and
A. A. F. Loureiro, “A picture of instagram is worth more than a thousand
words: Workload characterization and application,” Proceedings - IEEE
International Conference on Distributed Computing in Sensor Systems,
DCoSS 2013, no. i, pp. 123–132, 2013.
[6] Socialblade, “Top 100 Instagram Users by Followers,” 2015. [Online].
Available: http://socialblade.com/instagram/top/100/followers
[7] C. S. Araujo, L. P. D. Correa, A. P. C. D. Silva, R. O.
Prates, and W. Meira, “It is Not Just a Picture: Revealing
Some User Practices in Instagram,” 2014 9th Latin American
Web Congress, no. May, pp. 19–23, 2014. [Online]. Available:
http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=7000167
[8] A. Kumar and T. M. Sebastian, “Sentiment Analysis on Twitter,” Inter-
national Journal of Computer Science Issues, vol. 9, no. 4, pp. 372–378,
2012.
[9] G. Salton and C. Buckley, “Term-weighted approaches to automatic text
retrieval.” In Information Processing & Management, vol. 24, no. 5, pp.
513–523, 1988.

You might also like