The Impact of Preprocessing in Natural Language For Open Source Intelligence and Criminal Investigation

Uploaded by

losetaf527

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

The Impact of Preprocessing in Natural Language For Open Source Intelligence and Criminal Investigation

Uploaded by

losetaf527

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

2019 IEEE International Conference on Big Data (Big Data)

The impact of preprocessing in natural language for

open source intelligence and criminal investigation
Jan William Johnsen Katrin Franke
Dep. of Information Security and Communication Technology Dep. of Information Security and Communication Technology
Norwegian University of Science and Technology Norwegian University of Science and Technology
Gjøvik, Norway Gjøvik, Norway
[email protected] [email protected]

Abstract—Underground forums serves as gathering place for red herring [10]; however, it provides an opportunity where
like-minded cyber criminals and are an continued threat to law OSINT can offer significant value to proactive Cyber Threat
and order. Law enforcement agencies can use Open-Source In- Intelligence (CTI) to organisations about threats they were not
telligence (OSINT) to gather valuable information to proactively
counter existing and new threats. For example, by shifting crim- previously aware of [5], [17]. Consequently, data acquisition
inal investigation’s focus onto certain cyber criminals with large from OSINT are largely automated and can cause an increase
impact in underground forums and related criminal business in false positives [17]. In other words, the result of automated
models. This paper presents our study on text preprocessing processes can have a negative effect on information reliability.
requirements and document construction for the topic model Law enforcement agencies has primarily used reactive ap-
algorithm Latent Dirichlet Allocation (LDA). We identify a set
of preprocessing requirements based on literature review and proaches in criminal investigations for decades. New proactive
demonstrate them on a real-world forum, similar to those used by approaches and utilising vast amount of unstructured data can
cyber criminals. Our result show that topic modelling processes assist law enforcement agencies to prevent crime and uphold
needs to follow a very strict procedure to provide significant the law. Information is key to any criminal investigation [2],
result that can be useful in OSINT. Additionally, more reliable where information is constructed from data. However, cor-
results are produced by tuning the hyper-parameters and the
number of topics for LDA. We demonstrate improved results by rectly structuring, analysing and extracting useful knowledge
iterative preprocessing to continuously improve the model, which or facts from unstructured data is a challenge. The goal is
provide more coherent and focused topics. to gather sufficient information to accurately and adequately
Index Terms—Digital forensics, Latent Dirichlet Allocation, explain circumstances of a situation or incident. Additionally,
reliability, document construction, underground marketplace, the reliability and validity of data can change with attributes to
criminal investigation
the data source and the methods used to process the data [2].
One goal of OSINT is to make sense of a lot of unstruc-
I. I NTRODUCTION
tured data, e.g. by automatically analyse various discussion
OSINT exploits publicly available data such as pictures, forums to understand new trends or progression of malware
video and text to piece together factual data – i.e. infor- development. Natural Language Processing (NLP) is used to
mation – for an end goal. Two overlapping developments process and analyse large amounts of natural language data,
have particularly influenced the growth of OSINT: expansion where LDA is one of the more popular algorithms. LDA is a
of social media and big data [13]. Social media is a good generative statistical model, commonly used to categorise a set
example of big data in practice, as tons of user-produced of observations (i.e. text) into unobserved groups that explain
videos and texts are uploaded onto the Internet every day. why some parts of the data are similar. LDA is described
Information gathered from open sources can give insights into further in Section III.
world events, however, piecing together relevant data from the Every algorithm, including LDA, is susceptible to the ex-
vast sea of materials can be difficult. Furthermore, big data pression ‘garbage in, garbage out’. In other words, results
majorly consists of unstructured data, which current traditional will be incorrect if the input is erroneous, regardless of
analytical tools are not built to handle. the algorithm’s accuracy. The way these LDA models are
Researchers frequently repeat the ‘80 per cent rule’, which trained, and in particularly how their inputs are preprocessed
refer to the quantification of open-source contribution to (if at all) is something we find missing in previous research.
intelligence [10], [18]. It is difficult to put an estimate on Therefore, our research concentrates on improving our current
how much OSINT contribute to an intelligence operation and understanding for how to best construct documents as input for
the 80 per cent number is generally considered a mischievous the LDA algorithm. We first briefly explain how the Machine
Learning (ML) and forensic process model can be linked,
The research leading to these results has received funding from the Research and then we define which requirements must apply for using
Council of Norway programme IKTPLUSS, under the R&D project ‘Ars
Forensica - Computational Forensics for Large-scale Fraud Detection, Crime LDA in a digital forensic context. With these requirements
Investigation & Prevention’, grant agreement 248094/O70. in mind, we will cross validate three different document

978-1-7281-0858-2/19/$31.00 ©2019 IEEE 4248

Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on November 03,2024 at 10:33:23 UTC from IEEE Xplore. Restrictions apply.
construction methods for LDA and study it in detail on the of noise. The following list is composed of some common
Nulled dataset. We primarily focus on OSINT in the context recommendations for cleaning the data [9], [14].
of digital forensics, but it will hold the same for intelligence • Word normalisation: Inflected languages modifies
operations. words to express different grammatical categories. Stem-
Recently, LDA has been widely studied from a digital foren- ming and lemmatisation are two methods to normalise
sic perspective. Anwar et al. [3] analyse authorship attribution text, as they help find the root form of words. Stemming
for Urdu text; Porter [14] splits his dataset into time intervals removes suffixes or prefixes used with a word, without
to find evolution of hacker tools and trends; Caines et al. [6] considering the resulting word belongs to a language.
uses ML and rule-based classifiers to automatically label post Lemmatisation reduces the inflected words properly while
type and intent from posts in underground forum; Samtani ensuring that the root word belongs to the language.
et al. [15] designed a novel CTI framework to analyse and • Stop word removal: Words that are generally the most
understand threats present in hacker communities; L’huillier common words in a language, which tend to be over-
et al. [12] combine text mining and social network analysis to represented in the result unless removed. They do not
extract key members from darkweb forums. contain any important significance. However, removing
Text preprocessing varies widely in these studies, e.g. stop words indiscriminately means you can accidentally
grammatical mistakes and word preferences are relevant in filter out important data.
authorship attribution [3] or hacker forums contain atypical • Uninformative word removal: Similar to stop word
language [14]. They have a few issues, such as using Google removal, however, it is a domain specific list of unin-
Translate to convert text into English [15] or not checking formative words. It can be quite long and depend on the
model fit [12]. Additionally, they frequently do not describe domain producing the text in question.
how they structure the LDA input. • Word length removal: Remove words that have fewer
This article is structured in the following way: Section II than x (e.g. three) characters.
describes previous and relevant work for our research, linking • Document de-duplication: Eliminating duplicate copies
the ML and forensic process model and defining LDA prepro- of repeating data, i.e. removing identical documents that
cessing requirements; Section III and IV report any prepro- appear frequently.
cessing on the data, define the LDA document construction and • Expanding/replacing acronyms: Acronyms are used
provide results of our real-world scenario demonstration. We quite often and may need some subject matter expertise
discuss the significance of our results and give a recapitulation to understand.
of this article in Section V. • Other: Convert everything to lowercase and remove
punctuation marks/special symbols. Finally, remove extra
II. P REVIOUS WORK
white-spaces.
Data preprocessing is an integral step from the perspective Requirements which reduce the vocabulary size has clear
of the ML process model – as described by Kononenko advantages for the quality. For example, removing stop words
and Kukar [11] – where data quality directly affects the leave remaining terms that convey clearly topic-specific se-
ability of ML models to learn. Furthermore, a survey by mantic content. Schofield et al. [16] looked at some of the
CrowdFlower [8] found that 60 per cent of the professionals common practices we have listed and found that many have
spend much of their time cleaning and organising data. The either no effect or a negative effect. For example: i) effects
same emphasis of data quality also holds for digital forensics. from document duplication were minimal until they had a
Andersen [2] gives details of the digital forensic process, in substantial proportion of the corpus; ii) stop word removal (de-
relation to criminal cases. He points out that information is terminers, conjunctions and prepositions) can improve model
crucial, and it should be reliable to have any value in a court fit and quality; and iii) stemming methods perform worse.
of law. It is beyond this article to have a complete comparison
of both process models, but there is a mutual understanding in III. M ETHODOLOGY
both domains that the preprocessing phase is the most crucial There are several topic modelling algorithms [1], however,
step. Data preprocessing is a time-consuming and crucial step, we selected LDA because it is typically more effective and
that consolidate and structure data to improve the accuracy of generalises better than other algorithms. This is beneficial as
results. our proposed method may generalise to more specific domains,
Both the user of a system and the system itself have some such as those of underground forums. Furthermore, LDA can
requirements for it to be accurate and precise, i.e. reliable. extract human-interpretative topics from a document corpus,
We focus our requirements from the user’s perspective: what where each topic is characterised by the words they are most
they need to do to adeptly use the system, such as LDA associated with. LDA [4] is a way of ‘soft clustering’ using a
in a digital forensic context. Text analysis typically begins set of documents and a pre-defined k number of topics. Each
with preprocessing the input data, but related literature varies document has some probability to belonging to several topics,
widely with regards to which preprocessing method they which allow for a nuanced way of categorising documents.
utilise. Requirements should improve the algorithms’ ability to The three hyper-parameters k, α and η adjust the LDA
identify interesting or important patterns in the data, instead learning. Where k is a predefined amount of topics and α

4249

Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on November 03,2024 at 10:33:23 UTC from IEEE Xplore. Restrictions apply.
and η regulate two Dirichlet distributions. These Dirichlet
distributions adjust the LDA model document-topic density
and topic-word density, respectively. More specifically, LDA
models assumes documents consists of fewer topics at low
α values, while higher α values documents can consist of
more than one topic. Higher values will likely produce a more
uniform distribution, so a document will have an even mixture
of all the topics. Hyper-parameter η works similarly, but adjust
the word distribution per topic. Thus, topics consist of less
words at low η values and more words at higher values. LDA
is most commonly used to i) shrink a large corpus of text to
some sequence of keywords, ii) reduce the task of clustering
or searching a huge number of documents, iii) summarise a
large collection of text or iv) automatically tag new incoming
text by the learned topics.
We use the previously mentioned requirements and pre-
processing recommendations from Schofield et al. [16], such
as removing about 700 of the most common English stop
words. Following their recommendations, we decided to not
remove duplicated documents nor use stemming, as this was
reported to have little effect. We removed additional text such
as HTML tags (incl. their attributes), HTML entities (e.g.
 ), symbols and extra spaces. Finally, we removed all
rows with an empty text field and converted everything to
lower case characters.
Users can write public posts to communicate with other
forum users. These posts can have two distinctions: a subject
is started by an initial post by a user, while other users are
able to reply with their own posts to subjects. There is always
zero or more replies associated with each subject. Figure 1
illustrate this type of interactions between users, where each
user is depicted with different colours.
We focused our document construction method on the crite-
ria to include all available posts found on the forum and ended
up with identifying three distinct ways that we named: A, B
and C. Figure 1 also portrays these document construction
methods, where A is subject-centred, B is subject-user-centred
and C is user-centred. Other construction approaches, than
those shown in Figure 1, can be created and would yield
different results. However, we decided to not consider them
further as they would have too much information loss due to
ignoring many posts.
Construction A keep the original subject-structure found on
the forum. In other words, one document is the combination
of the subject starter and all its replies. Construction B
builds upon this idea of being subject-centred. However, this
Fig. 1. Document construction approaches analysed in this article. Con-
approach combines the posts from users in a subject into struction A is subject-centred; construction B is subject-user-centred; while
separate documents. Finally, construction C combines all posts construction C is user-centred. Unique users are marked with different colours.
for distinct users into a separate document; i.e. one document
consists of all posts that has been written by a specific user.
The motivation for construction A is to capture the overall C is user-centric and should capture more of the interests for
activity on the forum, to get a high-level overview of topics forum users.
that users are talking about. However, combining all posts from The number of latent topics, k, is a parameter we have to
various users per subject might obscure the result. Therefore, set in LDA models; we explore k = 10, 20, 30, 40, 50 and
we designed construction B to be subject and user-centred, as 60 in our experiment. The other parameters α and η are either
this could produce a more accurate result. While construction inferred from the data ( k1 when they are set to None) or set to

4250

Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on November 03,2024 at 10:33:23 UTC from IEEE Xplore. Restrictions apply.
the values 0.05, 0.1, 0.5, 1, 5 and 10. TABLE I
Finally, we have to evaluate the model quality after the T EN BEST MODELS WITH HYPER - PARAMETER COMBINATIONS
unsupervised learning process. We use k-fold cross-validation Construction A
to assess how well the LDA models will generalise to an # α η k Perplexity
independent data set. For each analysis, we split the data into 1 0.05 0.05 10 5855.00
five folds: each fold is used for training the LDA model four 2 0.10 0.05 10 5886.47
times and testing the model one time. We use perplexity to 3 None 0.05 10 5960.00
4 0.50 0.05 10 6035.86
objectively measure how well our model predicts the testing 5 0.05 0.10 10 6279.13
fold, where a low perplexity score indicates a better model. 6 1.00 0.05 10 6299.63
Furthermore, we use mean perplexity (i.e. the arithmetic mean 7 None None 10 6325.16
for each fold) to compare all 882 (k × α × η combinations) 8 0.10 None 10 6354.32
models between each other. We select models with the lowest 9 0.10 0.10 10 6354.98
10 0.50 0.10 10 6476.63
perplexity for further manual inspection.
Construction B
IV. E XPERIMENT AND RESULTS # α η k Perplexity
1 None 0.05 10 7088.24
We explicitly concentrate our attention on data preprocess- 2 0.10 0.05 10 7133.69
ing in this research article, where LDA document construction 3 0.50 0.05 10 7133.89
is centre. It is, therefore, out of our scope to focus on the 4 0.05 0.05 10 7268.40
5 1.00 0.05 10 7484.53
data gathering process, such as running web scraping tools to
6 0.05 None 10 7763.43
extract OSINT from real-world underground forums. Instead, 7 None None 10 7768.09
we will use a dataset of ‘Nulled’ that was leaked in May 8 0.05 0.10 10 7870.99
2016. It is a hacker forum on the deep web, that facilitate 9 None 0.10 10 7877.45
the brokering of compromised passwords, stolen bitcoins and 10 0.50 None 10 7937.83
other sensitive data. Nulled’s Structured Query Language Construction C
(SQL) database was leaked in its original form, without any # α η k Perplexity
1 None 0.05 10 8111.34
filtration or preprocessing. Their database contained details 2 0.05 0.05 10 8276.60
about 599 085 user accounts, 800 593 private messages and 3 0.10 0.05 10 8344.80
3 495 596 public messages. We imported it to a MySQL 4 0.50 0.05 10 8492.75
server and exported the necessary information from tables and 5 0.10 0.10 10 8687.27
fields with a Python script, using the Pandas package. More 6 None None 10 8785.00
7 0.50 None 10 8865.91
specifically, we stored information found in database tables 8 0.05 None 10 8889.34
‘topics’ and ‘posts’ (columns: ‘author_id’, ‘post’, ‘topic_id’) 9 None 0.10 10 8930.34
in a file for further analysis. 10 1.00 0.05 10 8947.48
We used Pandas to group the three construction methods
following the design described in Section III and depicted
in Figure 1. The text column ‘post’ was further processed not be suitable for any time-critical criminal investigation, it
(described in Section III) to make it suitable for LDA and doc- could be applied to proactive OSINT gathering.
ument generation. We fit the LDA algorithm from the Scikit- Table II show the five most frequent words from each topic,
learn package, for all the possible parameter combinations. All from the three best models which was manually inspected.
three document construction approaches was analysed using These topics are not sorted in any particular order. Some words
294 distinct combinations of LDA hyper-parameters. We ran appear in multiple topics, such as hide, color, http/https and
a total of 882 (294 × 3) LDA analyses to find the optimal numbers, which does not provide any meaningful interpre-
combination of parameters. Table I shows the best ten models tation of topic. For example, ‘hide’ is a tag in the BBcode
with the lowest perplexity. lightweight markup language, commonly used to format posts
Interestingly, our best result had very low hyper-parameters in many message boards. It is frequently used to withhold
and 10 topics. While Samtani et al. [15] found an optimal topic information until a visitor creates a user account on the forum
number ranging from between 80 and 100. More importantly, and gains privileges to view the hidden content.
Chang et al. [7] found that perplexity is not strongly correlated The various document construction methods (as seen in
to human interpretation, as they found that the most frequent Table II) does not show much variance in the identified
words in topics usually do not describe a coherent idea for keywords. The main difference was the number of documents
those topics. A human forensic analyst would at least manage that the LDA could learn from. Document construction A
to interpret and understand fewer topics than something like have 120 875 documents, B contain 2 794 304, and C have
80 and 100 topics. However, fewer topics with a low perplexity 272 023. Although document construction B had 2 212 per
score are not guaranteed to be easier interpreted by a human cent greater number of documents to learn from than method
analyst. An important note is that low hyper-parameters also A, it didn’t produce any significant differently result. Thus,
result in a slower convergence rate. While this solution might it can be recommended to go with the two other document

4251

Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on November 03,2024 at 10:33:23 UTC from IEEE Xplore. Restrictions apply.
TABLE II analyses. Table III show that the perplexity increase for the
F IVE MOST FREQUENT WORDS FOR TOPICS iterative preprocessing steps.
# Construction A
1 account, good, help, time, accounts TABLE III
2 80, 8080, 120, 195, 3128 I TERATIVE TEN BEST MODELS WITH HYPER - PARAMETER COMBINATIONS

3 ty, thx, nice, man, hide

Construction A
4 gmail, hotmail, yahoo, net, aol
# α η k Perplexity
5 ty, fixed, version, download, bot
6 bol, scripts, script, https, legends 1 0.05 0.05 10 18249.81
7 php, inurl, site, v1, 123456789a 2 None 0.05 10 18332.84
8 color, ru, size, http, hide 3 0.10 0.05 10 18343.05
9 http, https, youtube, watch, members 4 0.10 0.10 10 19380.88
10 game, origin, sims, email, games 5 None 0.10 10 19434.33
6 0.05 None 10 19525.33
# Construction B 7 0.10 None 10 19546.14
1 80, 8080, 120, 195, 3128 8 0.05 0.10 10 19576.82
2 account, http, hide, accounts, kappa 9 None None 10 19793.34
3 download, hide, bot, https, bol 10 None None 20 22685.33
4 http, site, de, php, net
Construction B
5 sharing, testing, script, best, scripts
6 ty, http, members, 123456a, tx # α η k Perplexity
7 gmail, hotmail, yahoo, check, thx 1 None 0.05 10 17582.44
8 man, php, bro, mate, yahoo 2 0.10 0.05 10 17825.29
9 test, works, lol, hope, game 3 0.05 0.05 10 17827.16
10 thx, nice, good, work, share 4 None None 10 18732.37
5 0.05 None 10 18788.12
# Construction C 6 None 0.10 10 18822.09
1 account, hide, http, https, accounts 7 0.05 0.10 10 18998.47
2 80, 8080, 195, 120, 3128 8 0.10 None 10 19077.16
3 download, bot, version, file, script 9 0.10 0.10 10 19131.95
4 thx, nice, man, good, bro 10 None 0.05 20 22562.58
5 ty, test, thx, nice, bro
6 gmail, hotmail, yahoo, php, http Construction C
7 site, php, color, hide, http # α η k Perplexity
8 game, gmail, hotmail, games, captured 1 0.10 0.05 10 29255.27
9 tks, unknown, 5900, password, null 2 0.05 0.05 10 29595.43
10 55336, 123456789a, 123, ruddy, asdf3425j3d 3 None 0.05 10 29608.72
4 0.10 None 10 29721.30
5 0.05 None 10 29834.72
6 0.10 0.10 10 30328.28
constructions (A and C) as they produce a similar and faster 7 None None 10 30579.64
result using fewer documents. 8 0.05 0.10 10 30974.97
9 None 0.10 10 31947.12
We need to further improve our result found in Table II to 10 0.05 None 20 41342.10
make the topics more clear for human analysts. We repeat the
previous preprocessing steps and adding some new steps to The more frequent words per topic, as seen in Table IV,
enhance the result. We begin by iteratively identify and re- also show greater coherent ideas per topics after additional
move BBcode tags and additional uninformative words1 from iterative preprocessing. For example, there exist topics that:
topics. We also removed numbers during the preprocessing, i) express gratitude or appreciation (work, thx, nice, share,
as numbers had very little meaning other than being related good), ii) about popular games (lol, battlefield, fifa, sims,
to network ports or passwords. Finally, we used lemmatisation origin), iii) leaking of credentials (username, password), iv)
due to the frequent similar words such as ‘account’, ‘accounts’, various malicious tools (stealer, crypter, phisher, rat) and v)
‘member’, ‘members’ and so forth. administrative purposes (member, ban, pm).
After conducting the iterative preprocessing, we use the Document construction A can be suitable to get an overview
previous gained knowledge to adjust the hyper-parameters in of what the underground forum is about, as it shows a
our experiment. We re-run the experiment for all document relation to accounts, leaks of credentials and games. Document
construction approaches using low hyper-parameters: where construction B show less diverse topics as many of them can
α and η are set to values None, 0.05 and 0.1 and k set to be categorised as expressing some gratitude. Thus, making
values 10 and 20. Resulting in running 54 (18 × 3) additional this construction approach less suitable for a digital forensic
investigation. Construction C can be suitable for understanding
1 http, https, www, gmail, hotmail, yahoo, inurl, ty, font, color, youtube, the different users within a forum, including their interest
asp, well, post, myfonts, otf, abc, qwerty, ru, qwe, rar, add, true, beta, day, or possibly role on the forum. For example, people with a
ip, net, aol, uk, function, live, fr, msn, var, de, br, nulled, menu, wa, time,
people, ha, window, thing, start, year, de, site, php, zip, uk, pl, web, edition, high proportion of expression of gratitude (thanks, thx, nice,
lol, work, aspx, xmlrpc, html, view, content, xd etc.) in their messages might belong to the majority group

4252

Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on November 03,2024 at 10:33:23 UTC from IEEE Xplore. Restrictions apply.
TABLE IV focused on different aspects: subject-centred, subject-user-
I TERATIVE FIVE MOST FREQUENT WORDS FOR TOPICS centred and user-centred. While they did not produce any
# Construction A significant different result in keywords between topics; our
1 account, file, bot, download, link result shows that more documents do not necessary improve
2 comcast, music, song, sbcglobal, rr the quality of topics.
3 game, origin, sims, email, github Data is key to piece together any criminal investigation and
4 capture, type, key, unit, local more research are needed to further improve the reliability of
5 mail, password, username, unknown, user
6 member, wp, pro, stealer, clean automated processes/algorithms. Small changes in the input
7 game, play, watch, best, good can produce an unreliable output, which in turn forensic
8 script, update, enemy, auto, download analysts can misinterpret. Thus, we need to move further than
9 account, bol, legend, help, crack contemporary research’s focus on using LDA to produce a
10 thx, nice, share, test, man general overview of a large corpus of text. For example by
# Construction B applying techniques described in this article on real-world
1 help, crack, link, guy, bol
2 share, check, skin, gg, account
dark web underground forums. Furthermore, we need to design
3 download, dude, bot, update, version reliable and automated processes suitable in a digital forensic
4 bro, great, watch, rep, hello context. For example, to distinguish between individuals that
5 thx, nice, test, hope, wow produce advance tools for cybercrime and from those who
6 file, tnx, download, gonna, password simply are consumers of such tools. Finally, similar research
7 wub, member, god, omg, gj
8 tks, cool, awesome, wp, tyty
as Chang et al. [7] should be conducted to analyse human
9 good, script, mate, love, best understandable topics and evaluation metrics (e.g. perplexity)
10 account, man, kappa, ban, lot in a digital forensic context.
# Construction C
1 nice, bro, tnx, tyy, gg R EFERENCES
2 tks, ea, member, mail, info
[1] Rubayyi Alghamdi and Khalid Alfalqi. A Survey of Topic Modeling in
3 account, game, link, crack, free Text Mining. International Journal of Advanced Computer Science and
4 script, bol, update, download, game Applications, 6(1), 2015.
5 thx, man, share, nice, good [2] Stig Andersen. Technical Report: A preliminary Process Model for
6 file, download, bot, version, update Investigation. preprint, SocArXiv, May 2019.
7 clean, stealer, rat, crypter, password [3] Waheed Anwar, Imran Sarwar Bajwa, M. Abbas Choudhary, and Sha-
8 capture, account, member, gmx, key bana Ramzan. An Empirical Study on Forensic Analysis of Urdu Text
9 wp, thnx, pro, unit, local Using LDA-Based Authorship Attribution. IEEE Access, 7:3224–3234,
10 unknown, user, creed, assassin, unite 2019.
[4] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent Dirichlet
Allocation. Journal of Machine Learning Research, 3:993–1022, 2003.
[5] Matt Bromiley. Threat Intelligence: What It Is, and How to Use It
of less technical skilled cyber criminals. Additional steps for Effectively, 2016.
removing unnecessary and less informative words may result [6] Andrew Caines, Sergio Pastrana, Alice Hutchings, and Paula J. Buttery.
Automatically identifying the function and intent of posts in under-
in highlighting more skilled cyber criminals. ground forums. Crime Science, 7(1):19, December 2018.
[7] Jonathan Chang, Jordan Boyd-Graber, Sean Gerrish, Chong Wang, and
V. C ONCLUSION David M Blei. Reading Tea Leaves: How Humans Interpret Topic
Cybercrime continue to be a treat to our economy and Models. page 10, 2009.
[8] CrowdFlower. Data Science Report.
the general sense of justice. Law enforcement agencies can URL: https://visit.figure-eight.com/rs/416-ZBE-
exploit OSINT to gather proactive CTI, which might make 142/images/CrowdFlower_DataScienceReport_2016.pdf, 2016.
them more effective to combat cybercriminals. The challenge [9] Isuf Deliu, Carl Leichter, and Katrin Franke. Extracting cyber threat
intelligence from hacker forums: Support vector machines versus con-
of OSINT comes from a lot of unstructured data which may volutional neural networks. In 2017 IEEE International Conference on
result in unreliable information from automated processes. Our Big Data (Big Data), pages 3648–3656, Boston, MA, December 2017.
research shows that automated algorithms such as LDA must IEEE.
[10] R. Dover, M.S. Goodman, and C. Hillebrand. Routledge Companion to
follow a set of requirements to reduce the vocabulary size and Intelligence Studies. Routledge Companions. Taylor & Francis, 2013.
improve the quality. We recommend repetitive preprocessing [11] I. Kononenko and M. Kukar. Machine Learning and Data Mining.
steps, e.g. continuously remove common words, until the result Elsevier Science, 2007.
contains coherent and clear topics. Data cleaning is invariably [12] Gaston L’Huillier, Hector Alvarez, Sebastián A. Ríos, and Felipe Aguil-
era. Topic-based social network analysis for virtual communities of
an iterative process as there are always problems that are interests in the Dark Web. page 9, 2011.
overlooked the first time around. [13] Matthew Moran. Big data brings new power to open-source intelligence,
Contemporary related research mostly focuses on using 2014.
[14] Kyle Porter. Analyzing the DarkNetMarkets subreddit for evolutions
topic modelling to get a quick overview of a lot of docu- of tools and trends using LDA topic modeling. Digital Investigation,
ments. This article tries to reduce the gap between reliability 26:S87–S97, July 2018.
of automated processes to make them applicable in digital [15] Sagar Samtani, Ryan Chinn, Hsinchun Chen, and Jay F. Nunamaker.
Exploring Emerging Hacker Assets and Key Hackers for Proactive
forensic contexts. We identified three distinct ways user’s Cyber Threat Intelligence. Journal of Management Information Systems,
posts could be constructed into documents, each approach 34(4):1023–1053, 2017.

4253

Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on November 03,2024 at 10:33:23 UTC from IEEE Xplore. Restrictions apply.
[16] Alexandra Schoﬁeld, Måns Magnusson, Laure Thompson, and David
Mimno. Understanding Text Pre-Processing for Latent Dirichlet Allo-
cation. page 4, 2017.
[17] Ryan Williams, Sagar Samtani, Mark Patton, and Hsinchun Chen.
Incremental Hacker Forum Exploit Collection and Classiﬁcation for
Proactive Cyber Threat Intelligence: An Exploratory Study. In 2018
IEEE International Conference on Intelligence and Security Informatics
(ISI), pages 94–99, Miami, FL, November 2018. IEEE.
[18] Hamid Akın Ünver. Digital Open Source Intelligence and International
Security: A Primer. 2018.

4254

Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on November 03,2024 at 10:33:23 UTC from IEEE Xplore. Restrictions apply.

06 - Don't Borrow Trouble by Caroline Glenn
No ratings yet
06 - Don't Borrow Trouble by Caroline Glenn
114 pages
(OSINT) Open Source Intelligence Investigation - From Strategy To Implementation (2016)
73% (11)
(OSINT) Open Source Intelligence Investigation - From Strategy To Implementation (2016)
302 pages
Python Crash Course: The Complete Step-By-Step Guide On How to Come Up Easily With Your First Data Science Project From Scratch In Less Than 7 Days
From Everand
Python Crash Course: The Complete Step-By-Step Guide On How to Come Up Easily With Your First Data Science Project From Scratch In Less Than 7 Days
Simon Tallman
No ratings yet
Data Mining: Concepts, Fundamentals And Applications
From Everand
Data Mining: Concepts, Fundamentals And Applications
Enrico Guardelli
No ratings yet
Data Science Basics
From Everand
Data Science Basics
Zoe Codewell
No ratings yet
PYTHON DATA ANALYTICS: Mastering Python for Effective Data Analysis and Visualization (2024 Beginner Guide)
From Everand
PYTHON DATA ANALYTICS: Mastering Python for Effective Data Analysis and Visualization (2024 Beginner Guide)
FLOYD BAX
No ratings yet
Concept Mining: Fundamentals and Applications
From Everand
Concept Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
A systematic review on research utilising artificial intelligence for open
No ratings yet
A systematic review on research utilising artificial intelligence for open
28 pages
Top Networking Terms You Should Know
From Everand
Top Networking Terms You Should Know
JOHN SMITH
No ratings yet
CSLeichter 15.09.2017 NISLabC1 As Intended
No ratings yet
CSLeichter 15.09.2017 NISLabC1 As Intended
49 pages
The Not Yet Exploited Goldmine of OSINT Opportunit
100% (2)
The Not Yet Exploited Goldmine of OSINT Opportunit
23 pages
Information Extraction: Fundamentals and Applications
From Everand
Information Extraction: Fundamentals and Applications
Fouad Sabry
No ratings yet
Data Science, AI, and Blockchain: Integrated Approaches
From Everand
Data Science, AI, and Blockchain: Integrated Approaches
Ekaaksh Deshpande
No ratings yet
Tuominen Sanna
No ratings yet
Tuominen Sanna
90 pages
In Depth Security Vol. II: Proceedings of the DeepSec Conferences
From Everand
In Depth Security Vol. II: Proceedings of the DeepSec Conferences
BoD - Books on Demand
No ratings yet
CSLeichter Interpol 11.09.2017
No ratings yet
CSLeichter Interpol 11.09.2017
47 pages
Penetration Testing Fundamentals -1: Penetration Testing Study Guide To Breaking Into Systems
From Everand
Penetration Testing Fundamentals -1: Penetration Testing Study Guide To Breaking Into Systems
Devi Prasad
No ratings yet
OSINT in the Intelligence Era: Lecture notes
From Everand
OSINT in the Intelligence Era: Lecture notes
Gianluigi Me
No ratings yet
Text Mining: Fundamentals and Applications
From Everand
Text Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Data Science Essentials: Machine Learning and Natural Language Processing
From Everand
Data Science Essentials: Machine Learning and Natural Language Processing
Angel Gabaldon
No ratings yet
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Data-Driven Security: Analysis, Visualization and Dashboards
From Everand
Data-Driven Security: Analysis, Visualization and Dashboards
Jay Jacobs
No ratings yet
Project 2024
No ratings yet
Project 2024
110 pages
Introduction to Cyber-Security
From Everand
Introduction to Cyber-Security
Akinjide Akinola
No ratings yet
Activity Recognition: Fundamentals and Applications
From Everand
Activity Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
Internet Connection Mechanics
From Everand
Internet Connection Mechanics
Lucas Lee
No ratings yet
Knowledge Reasoning: Fundamentals and Applications
From Everand
Knowledge Reasoning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Digital Technologies – an Overview of Concepts, Tools and Techniques Associated with it
From Everand
Digital Technologies – an Overview of Concepts, Tools and Techniques Associated with it
Editor IJSMI
No ratings yet
Cyber Forensics
From Everand
Cyber Forensics
Ophelia Marlowe
No ratings yet
Vijayragavan Cyber Ppt
No ratings yet
Vijayragavan Cyber Ppt
21 pages
Digital Forensics and Cybercrime Explained
From Everand
Digital Forensics and Cybercrime Explained
Kanti Shukla
No ratings yet
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Defense in Depth
From Everand
Defense in Depth
Qasim
No ratings yet
978-981-96-6008-7_23
No ratings yet
978-981-96-6008-7_23
14 pages
VIJAYRAGAVAN CYBER PPT
No ratings yet
VIJAYRAGAVAN CYBER PPT
21 pages
Big Data Ethics in Research
From Everand
Big Data Ethics in Research
Nicolae Sfetcu
No ratings yet
Monitoring and Surveillance Agents: Fundamentals and Applications
From Everand
Monitoring and Surveillance Agents: Fundamentals and Applications
Fouad Sabry
No ratings yet
Computer Forensics: A Pocket Guide
From Everand
Computer Forensics: A Pocket Guide
Nathan Clarke
4/5 (3)
The Beginner's to Professional Guide
From Everand
The Beginner's to Professional Guide
mohamed adel
No ratings yet
Mastering Data Science with Python: The Ultimate Guide: Unlock the Power of Data Analysis and Visualization with Python's Cutting-Edge Tools and Techniques
From Everand
Mastering Data Science with Python: The Ultimate Guide: Unlock the Power of Data Analysis and Visualization with Python's Cutting-Edge Tools and Techniques
daniel Huston
No ratings yet
Effective Vulnerability Management: Managing Risk in the Vulnerable Digital Ecosystem
From Everand
Effective Vulnerability Management: Managing Risk in the Vulnerable Digital Ecosystem
Chris Hughes
5/5 (1)
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Reality Mining: Using Big Data to Engineer a Better World
From Everand
Reality Mining: Using Big Data to Engineer a Better World
Nathan Eagle
4/5 (2)
LOTED: a semantic web portal for the management of tenders from the European Community
From Everand
LOTED: a semantic web portal for the management of tenders from the European Community
Francesco Valle
No ratings yet
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Image Retrieval: Fundamentals and Applications
From Everand
Image Retrieval: Fundamentals and Applications
Fouad Sabry
No ratings yet
APELID Augmentd WGAN and Parallel Ensemble Learning
No ratings yet
APELID Augmentd WGAN and Parallel Ensemble Learning
17 pages
Conceptual Dependency Theory: Fundamentals and Applications
From Everand
Conceptual Dependency Theory: Fundamentals and Applications
Fouad Sabry
No ratings yet
Image Retrieval: Unlocking the Power of Visual Data
From Everand
Image Retrieval: Unlocking the Power of Visual Data
Fouad Sabry
No ratings yet
Data Science
From Everand
Data Science
Chloe Martin
No ratings yet
Crash Course Big Data
From Everand
Crash Course Big Data
IntroBooks Team
No ratings yet
ERPANET Case Study: Project Gutenberg
From Everand
ERPANET Case Study: Project Gutenberg
ERPANET
No ratings yet
Using Open Source Intelligence As A Tool For Reliable Web Searching
No ratings yet
Using Open Source Intelligence As A Tool For Reliable Web Searching
12 pages
Exploring The Intersection Of Artificial Intelligence And Cyber Defense
From Everand
Exploring The Intersection Of Artificial Intelligence And Cyber Defense
Stephen Nnamdi
No ratings yet
Deep Learning for Beginners: A Comprehensive Introduction of Deep Learning Fundamentals for Beginners to Understanding Frameworks, Neural Networks, Large Datasets, and Creative Applications with Ease
From Everand
Deep Learning for Beginners: A Comprehensive Introduction of Deep Learning Fundamentals for Beginners to Understanding Frameworks, Neural Networks, Large Datasets, and Creative Applications with Ease
Steven Cooper
5/5 (1)
API Security: A guide to building and securing APIs from the developer team at Okta
From Everand
API Security: A guide to building and securing APIs from the developer team at Okta
Les Hazlewood
No ratings yet
Network Coding and Signcryption for Cloud Data Integrity
From Everand
Network Coding and Signcryption for Cloud Data Integrity
Noah Joan
No ratings yet
Computer Skills: Understanding Computer Science and Cyber Security (2 in 1)
From Everand
Computer Skills: Understanding Computer Science and Cyber Security (2 in 1)
Jonathan Rigdon
No ratings yet
CybersecurityForensicswithAI-AComprehensiveReview (1)
No ratings yet
CybersecurityForensicswithAI-AComprehensiveReview (1)
31 pages
Artificial Intelligence Based Digital Forensics FR PDF
No ratings yet
Artificial Intelligence Based Digital Forensics FR PDF
6 pages
Case Study Sexual Assault
No ratings yet
Case Study Sexual Assault
8 pages
Partituras Gratis Da Musica A Nica Esperanca Cantor Cristao 581 para Orquestra de Cordas Do Classicos Do Gospel Gospel Gratuito Tom Si Bemol Maior
No ratings yet
Partituras Gratis Da Musica A Nica Esperanca Cantor Cristao 581 para Orquestra de Cordas Do Classicos Do Gospel Gospel Gratuito Tom Si Bemol Maior
3 pages
Personal Selling Unit - 1 Notes
No ratings yet
Personal Selling Unit - 1 Notes
20 pages
Updated Kedarkantha Trek - Thrillophilia 5D 4N
No ratings yet
Updated Kedarkantha Trek - Thrillophilia 5D 4N
24 pages
Luggage Tag
No ratings yet
Luggage Tag
9 pages
Types of Dance
No ratings yet
Types of Dance
10 pages
Introduction History Aa
No ratings yet
Introduction History Aa
28 pages
!DAO 2016-08 Revised WQ & Effluent Standards
No ratings yet
!DAO 2016-08 Revised WQ & Effluent Standards
26 pages
HRM Project Report-Dalmia Bharat Group
100% (1)
HRM Project Report-Dalmia Bharat Group
32 pages
Wood Plastic Composite Panels
No ratings yet
Wood Plastic Composite Panels
9 pages
2021.01.01 India IP and Confidentiality Agreement (Standalone) - 1
No ratings yet
2021.01.01 India IP and Confidentiality Agreement (Standalone) - 1
3 pages
(eBook PDF) Advanced Mathematical And Computational Tools In Metrology And Testing X 2024 Scribd Download
100% (10)
(eBook PDF) Advanced Mathematical And Computational Tools In Metrology And Testing X 2024 Scribd Download
30 pages
Table of Contents - 2
No ratings yet
Table of Contents - 2
3 pages
Instruments of Empire - Peter Gowan
No ratings yet
Instruments of Empire - Peter Gowan
7 pages
Seminar in Research Methodology
100% (3)
Seminar in Research Methodology
56 pages
Biochemistry Compiled Notes Lec
No ratings yet
Biochemistry Compiled Notes Lec
41 pages
QMS Brochure (NIOSH)
No ratings yet
QMS Brochure (NIOSH)
2 pages
Remainders: Investigate
No ratings yet
Remainders: Investigate
6 pages
Over Trading - Working Capital
No ratings yet
Over Trading - Working Capital
4 pages
NIOSH Report
No ratings yet
NIOSH Report
16 pages
EIA Module 5
No ratings yet
EIA Module 5
37 pages
Newsletter No 41 - Nov 2005
No ratings yet
Newsletter No 41 - Nov 2005
82 pages
TAUSSIG, M. Nervous System
No ratings yet
TAUSSIG, M. Nervous System
30 pages
Introduction to Criminal Justice Information Systems by Ralph Ioimo all chapter instant download
100% (1)
Introduction to Criminal Justice Information Systems by Ralph Ioimo all chapter instant download
45 pages
Um MXR-B en Verac
No ratings yet
Um MXR-B en Verac
32 pages
WBPDCL_Recruitment_2025_02
No ratings yet
WBPDCL_Recruitment_2025_02
15 pages
Posible Set List Del Concierto de Iron Maiden en Peru
100% (3)
Posible Set List Del Concierto de Iron Maiden en Peru
7 pages
Employee Training And Development 5th Edition Noe Test Bank instant download
100% (3)
Employee Training And Development 5th Edition Noe Test Bank instant download
46 pages
Middle School Teacher 2015 Paper II
No ratings yet
Middle School Teacher 2015 Paper II
8 pages
Final Exam3
No ratings yet
Final Exam3
7 pages

The Impact of Preprocessing in Natural Language For Open Source Intelligence and Criminal Investigation

Uploaded by

The Impact of Preprocessing in Natural Language For Open Source Intelligence and Criminal Investigation

Uploaded by

2019 IEEE International Conference on Big Data (Big Data)

The impact of preprocessing in natural language for

978-1-7281-0858-2/19/$31.00 ©2019 IEEE 4248

3 ty, thx, nice, man, hide

You might also like