The Impact of Preprocessing in Natural Language For Open Source Intelligence and Criminal Investigation
The Impact of Preprocessing in Natural Language For Open Source Intelligence and Criminal Investigation
Abstract—Underground forums serves as gathering place for red herring [10]; however, it provides an opportunity where
like-minded cyber criminals and are an continued threat to law OSINT can offer significant value to proactive Cyber Threat
and order. Law enforcement agencies can use Open-Source In- Intelligence (CTI) to organisations about threats they were not
telligence (OSINT) to gather valuable information to proactively
counter existing and new threats. For example, by shifting crim- previously aware of [5], [17]. Consequently, data acquisition
inal investigation’s focus onto certain cyber criminals with large from OSINT are largely automated and can cause an increase
impact in underground forums and related criminal business in false positives [17]. In other words, the result of automated
models. This paper presents our study on text preprocessing processes can have a negative effect on information reliability.
requirements and document construction for the topic model Law enforcement agencies has primarily used reactive ap-
algorithm Latent Dirichlet Allocation (LDA). We identify a set
of preprocessing requirements based on literature review and proaches in criminal investigations for decades. New proactive
demonstrate them on a real-world forum, similar to those used by approaches and utilising vast amount of unstructured data can
cyber criminals. Our result show that topic modelling processes assist law enforcement agencies to prevent crime and uphold
needs to follow a very strict procedure to provide significant the law. Information is key to any criminal investigation [2],
result that can be useful in OSINT. Additionally, more reliable where information is constructed from data. However, cor-
results are produced by tuning the hyper-parameters and the
number of topics for LDA. We demonstrate improved results by rectly structuring, analysing and extracting useful knowledge
iterative preprocessing to continuously improve the model, which or facts from unstructured data is a challenge. The goal is
provide more coherent and focused topics. to gather sufficient information to accurately and adequately
Index Terms—Digital forensics, Latent Dirichlet Allocation, explain circumstances of a situation or incident. Additionally,
reliability, document construction, underground marketplace, the reliability and validity of data can change with attributes to
criminal investigation
the data source and the methods used to process the data [2].
One goal of OSINT is to make sense of a lot of unstruc-
I. I NTRODUCTION
tured data, e.g. by automatically analyse various discussion
OSINT exploits publicly available data such as pictures, forums to understand new trends or progression of malware
video and text to piece together factual data – i.e. infor- development. Natural Language Processing (NLP) is used to
mation – for an end goal. Two overlapping developments process and analyse large amounts of natural language data,
have particularly influenced the growth of OSINT: expansion where LDA is one of the more popular algorithms. LDA is a
of social media and big data [13]. Social media is a good generative statistical model, commonly used to categorise a set
example of big data in practice, as tons of user-produced of observations (i.e. text) into unobserved groups that explain
videos and texts are uploaded onto the Internet every day. why some parts of the data are similar. LDA is described
Information gathered from open sources can give insights into further in Section III.
world events, however, piecing together relevant data from the Every algorithm, including LDA, is susceptible to the ex-
vast sea of materials can be difficult. Furthermore, big data pression ‘garbage in, garbage out’. In other words, results
majorly consists of unstructured data, which current traditional will be incorrect if the input is erroneous, regardless of
analytical tools are not built to handle. the algorithm’s accuracy. The way these LDA models are
Researchers frequently repeat the ‘80 per cent rule’, which trained, and in particularly how their inputs are preprocessed
refer to the quantification of open-source contribution to (if at all) is something we find missing in previous research.
intelligence [10], [18]. It is difficult to put an estimate on Therefore, our research concentrates on improving our current
how much OSINT contribute to an intelligence operation and understanding for how to best construct documents as input for
the 80 per cent number is generally considered a mischievous the LDA algorithm. We first briefly explain how the Machine
Learning (ML) and forensic process model can be linked,
The research leading to these results has received funding from the Research and then we define which requirements must apply for using
Council of Norway programme IKTPLUSS, under the R&D project ‘Ars
Forensica - Computational Forensics for Large-scale Fraud Detection, Crime LDA in a digital forensic context. With these requirements
Investigation & Prevention’, grant agreement 248094/O70. in mind, we will cross validate three different document
Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on November 03,2024 at 10:33:23 UTC from IEEE Xplore. Restrictions apply.
construction methods for LDA and study it in detail on the of noise. The following list is composed of some common
Nulled dataset. We primarily focus on OSINT in the context recommendations for cleaning the data [9], [14].
of digital forensics, but it will hold the same for intelligence • Word normalisation: Inflected languages modifies
operations. words to express different grammatical categories. Stem-
Recently, LDA has been widely studied from a digital foren- ming and lemmatisation are two methods to normalise
sic perspective. Anwar et al. [3] analyse authorship attribution text, as they help find the root form of words. Stemming
for Urdu text; Porter [14] splits his dataset into time intervals removes suffixes or prefixes used with a word, without
to find evolution of hacker tools and trends; Caines et al. [6] considering the resulting word belongs to a language.
uses ML and rule-based classifiers to automatically label post Lemmatisation reduces the inflected words properly while
type and intent from posts in underground forum; Samtani ensuring that the root word belongs to the language.
et al. [15] designed a novel CTI framework to analyse and • Stop word removal: Words that are generally the most
understand threats present in hacker communities; L’huillier common words in a language, which tend to be over-
et al. [12] combine text mining and social network analysis to represented in the result unless removed. They do not
extract key members from darkweb forums. contain any important significance. However, removing
Text preprocessing varies widely in these studies, e.g. stop words indiscriminately means you can accidentally
grammatical mistakes and word preferences are relevant in filter out important data.
authorship attribution [3] or hacker forums contain atypical • Uninformative word removal: Similar to stop word
language [14]. They have a few issues, such as using Google removal, however, it is a domain specific list of unin-
Translate to convert text into English [15] or not checking formative words. It can be quite long and depend on the
model fit [12]. Additionally, they frequently do not describe domain producing the text in question.
how they structure the LDA input. • Word length removal: Remove words that have fewer
This article is structured in the following way: Section II than x (e.g. three) characters.
describes previous and relevant work for our research, linking • Document de-duplication: Eliminating duplicate copies
the ML and forensic process model and defining LDA prepro- of repeating data, i.e. removing identical documents that
cessing requirements; Section III and IV report any prepro- appear frequently.
cessing on the data, define the LDA document construction and • Expanding/replacing acronyms: Acronyms are used
provide results of our real-world scenario demonstration. We quite often and may need some subject matter expertise
discuss the significance of our results and give a recapitulation to understand.
of this article in Section V. • Other: Convert everything to lowercase and remove
punctuation marks/special symbols. Finally, remove extra
II. P REVIOUS WORK
white-spaces.
Data preprocessing is an integral step from the perspective Requirements which reduce the vocabulary size has clear
of the ML process model – as described by Kononenko advantages for the quality. For example, removing stop words
and Kukar [11] – where data quality directly affects the leave remaining terms that convey clearly topic-specific se-
ability of ML models to learn. Furthermore, a survey by mantic content. Schofield et al. [16] looked at some of the
CrowdFlower [8] found that 60 per cent of the professionals common practices we have listed and found that many have
spend much of their time cleaning and organising data. The either no effect or a negative effect. For example: i) effects
same emphasis of data quality also holds for digital forensics. from document duplication were minimal until they had a
Andersen [2] gives details of the digital forensic process, in substantial proportion of the corpus; ii) stop word removal (de-
relation to criminal cases. He points out that information is terminers, conjunctions and prepositions) can improve model
crucial, and it should be reliable to have any value in a court fit and quality; and iii) stemming methods perform worse.
of law. It is beyond this article to have a complete comparison
of both process models, but there is a mutual understanding in III. M ETHODOLOGY
both domains that the preprocessing phase is the most crucial There are several topic modelling algorithms [1], however,
step. Data preprocessing is a time-consuming and crucial step, we selected LDA because it is typically more effective and
that consolidate and structure data to improve the accuracy of generalises better than other algorithms. This is beneficial as
results. our proposed method may generalise to more specific domains,
Both the user of a system and the system itself have some such as those of underground forums. Furthermore, LDA can
requirements for it to be accurate and precise, i.e. reliable. extract human-interpretative topics from a document corpus,
We focus our requirements from the user’s perspective: what where each topic is characterised by the words they are most
they need to do to adeptly use the system, such as LDA associated with. LDA [4] is a way of ‘soft clustering’ using a
in a digital forensic context. Text analysis typically begins set of documents and a pre-defined k number of topics. Each
with preprocessing the input data, but related literature varies document has some probability to belonging to several topics,
widely with regards to which preprocessing method they which allow for a nuanced way of categorising documents.
utilise. Requirements should improve the algorithms’ ability to The three hyper-parameters k, α and η adjust the LDA
identify interesting or important patterns in the data, instead learning. Where k is a predefined amount of topics and α
4249
Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on November 03,2024 at 10:33:23 UTC from IEEE Xplore. Restrictions apply.
and η regulate two Dirichlet distributions. These Dirichlet
distributions adjust the LDA model document-topic density
and topic-word density, respectively. More specifically, LDA
models assumes documents consists of fewer topics at low
α values, while higher α values documents can consist of
more than one topic. Higher values will likely produce a more
uniform distribution, so a document will have an even mixture
of all the topics. Hyper-parameter η works similarly, but adjust
the word distribution per topic. Thus, topics consist of less
words at low η values and more words at higher values. LDA
is most commonly used to i) shrink a large corpus of text to
some sequence of keywords, ii) reduce the task of clustering
or searching a huge number of documents, iii) summarise a
large collection of text or iv) automatically tag new incoming
text by the learned topics.
We use the previously mentioned requirements and pre-
processing recommendations from Schofield et al. [16], such
as removing about 700 of the most common English stop
words. Following their recommendations, we decided to not
remove duplicated documents nor use stemming, as this was
reported to have little effect. We removed additional text such
as HTML tags (incl. their attributes), HTML entities (e.g.
), symbols and extra spaces. Finally, we removed all
rows with an empty text field and converted everything to
lower case characters.
Users can write public posts to communicate with other
forum users. These posts can have two distinctions: a subject
is started by an initial post by a user, while other users are
able to reply with their own posts to subjects. There is always
zero or more replies associated with each subject. Figure 1
illustrate this type of interactions between users, where each
user is depicted with different colours.
We focused our document construction method on the crite-
ria to include all available posts found on the forum and ended
up with identifying three distinct ways that we named: A, B
and C. Figure 1 also portrays these document construction
methods, where A is subject-centred, B is subject-user-centred
and C is user-centred. Other construction approaches, than
those shown in Figure 1, can be created and would yield
different results. However, we decided to not consider them
further as they would have too much information loss due to
ignoring many posts.
Construction A keep the original subject-structure found on
the forum. In other words, one document is the combination
of the subject starter and all its replies. Construction B
builds upon this idea of being subject-centred. However, this
Fig. 1. Document construction approaches analysed in this article. Con-
approach combines the posts from users in a subject into struction A is subject-centred; construction B is subject-user-centred; while
separate documents. Finally, construction C combines all posts construction C is user-centred. Unique users are marked with different colours.
for distinct users into a separate document; i.e. one document
consists of all posts that has been written by a specific user.
The motivation for construction A is to capture the overall C is user-centric and should capture more of the interests for
activity on the forum, to get a high-level overview of topics forum users.
that users are talking about. However, combining all posts from The number of latent topics, k, is a parameter we have to
various users per subject might obscure the result. Therefore, set in LDA models; we explore k = 10, 20, 30, 40, 50 and
we designed construction B to be subject and user-centred, as 60 in our experiment. The other parameters α and η are either
this could produce a more accurate result. While construction inferred from the data ( k1 when they are set to None) or set to
4250
Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on November 03,2024 at 10:33:23 UTC from IEEE Xplore. Restrictions apply.
the values 0.05, 0.1, 0.5, 1, 5 and 10. TABLE I
Finally, we have to evaluate the model quality after the T EN BEST MODELS WITH HYPER - PARAMETER COMBINATIONS
unsupervised learning process. We use k-fold cross-validation Construction A
to assess how well the LDA models will generalise to an # α η k Perplexity
independent data set. For each analysis, we split the data into 1 0.05 0.05 10 5855.00
five folds: each fold is used for training the LDA model four 2 0.10 0.05 10 5886.47
times and testing the model one time. We use perplexity to 3 None 0.05 10 5960.00
4 0.50 0.05 10 6035.86
objectively measure how well our model predicts the testing 5 0.05 0.10 10 6279.13
fold, where a low perplexity score indicates a better model. 6 1.00 0.05 10 6299.63
Furthermore, we use mean perplexity (i.e. the arithmetic mean 7 None None 10 6325.16
for each fold) to compare all 882 (k × α × η combinations) 8 0.10 None 10 6354.32
models between each other. We select models with the lowest 9 0.10 0.10 10 6354.98
10 0.50 0.10 10 6476.63
perplexity for further manual inspection.
Construction B
IV. E XPERIMENT AND RESULTS # α η k Perplexity
1 None 0.05 10 7088.24
We explicitly concentrate our attention on data preprocess- 2 0.10 0.05 10 7133.69
ing in this research article, where LDA document construction 3 0.50 0.05 10 7133.89
is centre. It is, therefore, out of our scope to focus on the 4 0.05 0.05 10 7268.40
5 1.00 0.05 10 7484.53
data gathering process, such as running web scraping tools to
6 0.05 None 10 7763.43
extract OSINT from real-world underground forums. Instead, 7 None None 10 7768.09
we will use a dataset of ‘Nulled’ that was leaked in May 8 0.05 0.10 10 7870.99
2016. It is a hacker forum on the deep web, that facilitate 9 None 0.10 10 7877.45
the brokering of compromised passwords, stolen bitcoins and 10 0.50 None 10 7937.83
other sensitive data. Nulled’s Structured Query Language Construction C
(SQL) database was leaked in its original form, without any # α η k Perplexity
1 None 0.05 10 8111.34
filtration or preprocessing. Their database contained details 2 0.05 0.05 10 8276.60
about 599 085 user accounts, 800 593 private messages and 3 0.10 0.05 10 8344.80
3 495 596 public messages. We imported it to a MySQL 4 0.50 0.05 10 8492.75
server and exported the necessary information from tables and 5 0.10 0.10 10 8687.27
fields with a Python script, using the Pandas package. More 6 None None 10 8785.00
7 0.50 None 10 8865.91
specifically, we stored information found in database tables 8 0.05 None 10 8889.34
‘topics’ and ‘posts’ (columns: ‘author_id’, ‘post’, ‘topic_id’) 9 None 0.10 10 8930.34
in a file for further analysis. 10 1.00 0.05 10 8947.48
We used Pandas to group the three construction methods
following the design described in Section III and depicted
in Figure 1. The text column ‘post’ was further processed not be suitable for any time-critical criminal investigation, it
(described in Section III) to make it suitable for LDA and doc- could be applied to proactive OSINT gathering.
ument generation. We fit the LDA algorithm from the Scikit- Table II show the five most frequent words from each topic,
learn package, for all the possible parameter combinations. All from the three best models which was manually inspected.
three document construction approaches was analysed using These topics are not sorted in any particular order. Some words
294 distinct combinations of LDA hyper-parameters. We ran appear in multiple topics, such as hide, color, http/https and
a total of 882 (294 × 3) LDA analyses to find the optimal numbers, which does not provide any meaningful interpre-
combination of parameters. Table I shows the best ten models tation of topic. For example, ‘hide’ is a tag in the BBcode
with the lowest perplexity. lightweight markup language, commonly used to format posts
Interestingly, our best result had very low hyper-parameters in many message boards. It is frequently used to withhold
and 10 topics. While Samtani et al. [15] found an optimal topic information until a visitor creates a user account on the forum
number ranging from between 80 and 100. More importantly, and gains privileges to view the hidden content.
Chang et al. [7] found that perplexity is not strongly correlated The various document construction methods (as seen in
to human interpretation, as they found that the most frequent Table II) does not show much variance in the identified
words in topics usually do not describe a coherent idea for keywords. The main difference was the number of documents
those topics. A human forensic analyst would at least manage that the LDA could learn from. Document construction A
to interpret and understand fewer topics than something like have 120 875 documents, B contain 2 794 304, and C have
80 and 100 topics. However, fewer topics with a low perplexity 272 023. Although document construction B had 2 212 per
score are not guaranteed to be easier interpreted by a human cent greater number of documents to learn from than method
analyst. An important note is that low hyper-parameters also A, it didn’t produce any significant differently result. Thus,
result in a slower convergence rate. While this solution might it can be recommended to go with the two other document
4251
Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on November 03,2024 at 10:33:23 UTC from IEEE Xplore. Restrictions apply.
TABLE II analyses. Table III show that the perplexity increase for the
F IVE MOST FREQUENT WORDS FOR TOPICS iterative preprocessing steps.
# Construction A
1 account, good, help, time, accounts TABLE III
2 80, 8080, 120, 195, 3128 I TERATIVE TEN BEST MODELS WITH HYPER - PARAMETER COMBINATIONS
4252
Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on November 03,2024 at 10:33:23 UTC from IEEE Xplore. Restrictions apply.
TABLE IV focused on different aspects: subject-centred, subject-user-
I TERATIVE FIVE MOST FREQUENT WORDS FOR TOPICS centred and user-centred. While they did not produce any
# Construction A significant different result in keywords between topics; our
1 account, file, bot, download, link result shows that more documents do not necessary improve
2 comcast, music, song, sbcglobal, rr the quality of topics.
3 game, origin, sims, email, github Data is key to piece together any criminal investigation and
4 capture, type, key, unit, local more research are needed to further improve the reliability of
5 mail, password, username, unknown, user
6 member, wp, pro, stealer, clean automated processes/algorithms. Small changes in the input
7 game, play, watch, best, good can produce an unreliable output, which in turn forensic
8 script, update, enemy, auto, download analysts can misinterpret. Thus, we need to move further than
9 account, bol, legend, help, crack contemporary research’s focus on using LDA to produce a
10 thx, nice, share, test, man general overview of a large corpus of text. For example by
# Construction B applying techniques described in this article on real-world
1 help, crack, link, guy, bol
2 share, check, skin, gg, account
dark web underground forums. Furthermore, we need to design
3 download, dude, bot, update, version reliable and automated processes suitable in a digital forensic
4 bro, great, watch, rep, hello context. For example, to distinguish between individuals that
5 thx, nice, test, hope, wow produce advance tools for cybercrime and from those who
6 file, tnx, download, gonna, password simply are consumers of such tools. Finally, similar research
7 wub, member, god, omg, gj
8 tks, cool, awesome, wp, tyty
as Chang et al. [7] should be conducted to analyse human
9 good, script, mate, love, best understandable topics and evaluation metrics (e.g. perplexity)
10 account, man, kappa, ban, lot in a digital forensic context.
# Construction C
1 nice, bro, tnx, tyy, gg R EFERENCES
2 tks, ea, member, mail, info
[1] Rubayyi Alghamdi and Khalid Alfalqi. A Survey of Topic Modeling in
3 account, game, link, crack, free Text Mining. International Journal of Advanced Computer Science and
4 script, bol, update, download, game Applications, 6(1), 2015.
5 thx, man, share, nice, good [2] Stig Andersen. Technical Report: A preliminary Process Model for
6 file, download, bot, version, update Investigation. preprint, SocArXiv, May 2019.
7 clean, stealer, rat, crypter, password [3] Waheed Anwar, Imran Sarwar Bajwa, M. Abbas Choudhary, and Sha-
8 capture, account, member, gmx, key bana Ramzan. An Empirical Study on Forensic Analysis of Urdu Text
9 wp, thnx, pro, unit, local Using LDA-Based Authorship Attribution. IEEE Access, 7:3224–3234,
10 unknown, user, creed, assassin, unite 2019.
[4] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent Dirichlet
Allocation. Journal of Machine Learning Research, 3:993–1022, 2003.
[5] Matt Bromiley. Threat Intelligence: What It Is, and How to Use It
of less technical skilled cyber criminals. Additional steps for Effectively, 2016.
removing unnecessary and less informative words may result [6] Andrew Caines, Sergio Pastrana, Alice Hutchings, and Paula J. Buttery.
Automatically identifying the function and intent of posts in under-
in highlighting more skilled cyber criminals. ground forums. Crime Science, 7(1):19, December 2018.
[7] Jonathan Chang, Jordan Boyd-Graber, Sean Gerrish, Chong Wang, and
V. C ONCLUSION David M Blei. Reading Tea Leaves: How Humans Interpret Topic
Cybercrime continue to be a treat to our economy and Models. page 10, 2009.
[8] CrowdFlower. Data Science Report.
the general sense of justice. Law enforcement agencies can URL: https://visit.figure-eight.com/rs/416-ZBE-
exploit OSINT to gather proactive CTI, which might make 142/images/CrowdFlower_DataScienceReport_2016.pdf, 2016.
them more effective to combat cybercriminals. The challenge [9] Isuf Deliu, Carl Leichter, and Katrin Franke. Extracting cyber threat
intelligence from hacker forums: Support vector machines versus con-
of OSINT comes from a lot of unstructured data which may volutional neural networks. In 2017 IEEE International Conference on
result in unreliable information from automated processes. Our Big Data (Big Data), pages 3648–3656, Boston, MA, December 2017.
research shows that automated algorithms such as LDA must IEEE.
[10] R. Dover, M.S. Goodman, and C. Hillebrand. Routledge Companion to
follow a set of requirements to reduce the vocabulary size and Intelligence Studies. Routledge Companions. Taylor & Francis, 2013.
improve the quality. We recommend repetitive preprocessing [11] I. Kononenko and M. Kukar. Machine Learning and Data Mining.
steps, e.g. continuously remove common words, until the result Elsevier Science, 2007.
contains coherent and clear topics. Data cleaning is invariably [12] Gaston L’Huillier, Hector Alvarez, Sebastián A. Ríos, and Felipe Aguil-
era. Topic-based social network analysis for virtual communities of
an iterative process as there are always problems that are interests in the Dark Web. page 9, 2011.
overlooked the first time around. [13] Matthew Moran. Big data brings new power to open-source intelligence,
Contemporary related research mostly focuses on using 2014.
[14] Kyle Porter. Analyzing the DarkNetMarkets subreddit for evolutions
topic modelling to get a quick overview of a lot of docu- of tools and trends using LDA topic modeling. Digital Investigation,
ments. This article tries to reduce the gap between reliability 26:S87–S97, July 2018.
of automated processes to make them applicable in digital [15] Sagar Samtani, Ryan Chinn, Hsinchun Chen, and Jay F. Nunamaker.
Exploring Emerging Hacker Assets and Key Hackers for Proactive
forensic contexts. We identified three distinct ways user’s Cyber Threat Intelligence. Journal of Management Information Systems,
posts could be constructed into documents, each approach 34(4):1023–1053, 2017.
4253
Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on November 03,2024 at 10:33:23 UTC from IEEE Xplore. Restrictions apply.
[16] Alexandra Schofield, Måns Magnusson, Laure Thompson, and David
Mimno. Understanding Text Pre-Processing for Latent Dirichlet Allo-
cation. page 4, 2017.
[17] Ryan Williams, Sagar Samtani, Mark Patton, and Hsinchun Chen.
Incremental Hacker Forum Exploit Collection and Classification for
Proactive Cyber Threat Intelligence: An Exploratory Study. In 2018
IEEE International Conference on Intelligence and Security Informatics
(ISI), pages 94–99, Miami, FL, November 2018. IEEE.
[18] Hamid Akın Ünver. Digital Open Source Intelligence and International
Security: A Primer. 2018.
4254
Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on November 03,2024 at 10:33:23 UTC from IEEE Xplore. Restrictions apply.