TF-IDF Method in Ranking Keywords of Instagram Users' Image Captions
TF-IDF Method in Ranking Keywords of Instagram Users' Image Captions
1) Image caption character limit: the limit for captions The next modul is the preprocessing modul. It is needed
on the photo and subsequent comments caps is 2200 to pass the important words and filter irrelevant words and
characters each. User is also allowed not to write a characters in each document. The first preprocessing modul
caption at all. step is removal of HTML and symbol characters. It is im-
2) Hashtag limit: The limit of hashtag is 30 hashtag per portant because, the users commonly write symbol characters
caption. that has less significant meaning or a non keyword symbol.
3) Symbol characters: Some of the users uses the symbol The second step of preprocessing modul is punctuation, #tag,
characters provided in the smartphone keyboard. @tag, and stopwords removal. The main goal of this step is
4) Writing technique: As Instagram has 2200 characters to retrive essential words and to eliminate words that has less
limit, spelling and cyber slang in the image caption is significance towards the documents such as ”the”, ”is”, ”are”,
not often used by users compared to Tweets in Twitter. ”an”, ”of”, ”to”, etc. It is also useful to reduce indexing file
5) Availability: The amount of data available is extremely size, improving efficiency and effectiveness. The third step is
large. According to the Instagram official release in standardizing words. For example the user sometimes writes
September 2015, there are 80 million photos uploaded ’go hooooome’, thus the output of this step is ’go home’.
daily. The Instagram API facilitates the collection of Last step on preprocessing modul is URLs removal. It is clear
image captions as well as the URL Link of image. that the URL link is not significant to be used to reveal the
6) Topics: Instagram users post photos and videos in a wide keywords.
variety of topics. Previous research observed that there
Upon finishing preprocessing the data, ranking process is
are eight main photo categories which are friends, food,
then applied. The first step of this modul is tokenization. Its
gadget, captioned photo, pet, activity, selfie, and fashion
objective in this case is to break the text up into words or
posted in Instagram [3].
other meaningful elements called tokens. Then each tokens, or
7) Weekend Onpeak: Users tend to post the photos and
commonly refered to as terms are used to form vector space
videos during weekends and at the end of the day. [7]
model.
Fig. 3. Top 10 Words of @instagram Account
1, if x = t
fr(x, t) =
0, otherwise
Hence, TF(t, d) returns how many times the term t is
present in the document d.
2) IDF is defined with the following formula:
|D|
IDF(t) = log (2)
1 + |{d : t ∈ d}|
where |{d : t ∈ d}| is the number of documents where
the term t appears, when the term-frequency function
satisfies TF(t, d) = 0, were only adding 1 into the
formula to avoid zero-division.
3) TF-IDF formula is defined as follows:
R EFERENCES