0% found this document useful (0 votes)
35 views

Identification of Cybercriminals in Social Media Using Machine Learning

The document discusses identifying cybercriminals on social media using machine learning. It proposes combining user content analysis and network analysis, and using machine learning algorithms like Naive Bayes, KNN, random forest, and neural networks to detect criminal activity. Text mining is also used to continuously monitor for suspicious users and identify suspected criminals.

Uploaded by

Jackie Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

Identification of Cybercriminals in Social Media Using Machine Learning

The document discusses identifying cybercriminals on social media using machine learning. It proposes combining user content analysis and network analysis, and using machine learning algorithms like Naive Bayes, KNN, random forest, and neural networks to detect criminal activity. Text mining is also used to continuously monitor for suspicious users and identify suspected criminals.

Uploaded by

Jackie Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2022 International Conference on Smart Generation Computing, Communication and Networking (SMART GENCON)

Karnataka, India. Dec 23-25, 2022

Identification of Cybercriminals in Social Media


2022 International Conference on Smart Generation Computing, Communication and Networking (SMART GENCON) | 978-1-6654-5499-5/22/$31.00 ©2022 IEEE | DOI: 10.1109/SMARTGENCON56628.2022.10084119

using Machine Learning


Atika Gupta Dr Priya Matta Dr Bhasker Pant
Graphic Era Deemed to be University, Graphic Era Deemed to be University, Graphic Era Deemed to be University,
Graphic Era Hill University, Graphic Era Hill University, Graphic Era Hill University,
Dehradun, India, Dehradun, India, Dehradun, India,
[email protected] [email protected] [email protected]

Abstract - The significant growth in online social media has like blogs, Wiki, microblogs, social media posts and many
allowed users worldwide to communicate freely and share their other forms.
ideas. The emerging social media is also becoming a powerful
communication tool for businesses and organizations. Due to
this rapid use of technology, performing crime in newer ways
has also emerged. By using social networking sites, criminals
can extract the information for criminal activity. Such activities
include cyber frauds, cyberbullying, cyberstalking, hacking etc.,
which may harm an online user. So, identifying such proficient
cybercriminals is of utmost importance to mitigate these
cyberattacks. With growing technology, these cybercriminals
have also become advanced and hide in large underground
forums. Majorly, the crime-related activities are in text format,
which makes identifying the cybercriminal a tedious task.
Detection and discovering the crime and identifying a
cybercriminal is the primary task of the crime analysis process.
The article proposes a method to solve this problem by
combining user-generated content analysis and user network
analysis. Machine learning algorithms such as Multinomial
Naïve Bayes, K-Nearest Neighbour, Random Forest, Multilayer
Perceptron and ID3 can be used. Also, text mining approaches
are used to detect and predict criminal activities effectively.
These techniques continuously check for suspicious users and
Fig. 1. Social Media User
create a graph to find the suspected user.

Keywords: cybercrime, cybercriminal, social media, machine


learning, text mining.

I. INTRODUCTION
The growth and evolution of computers have proved to be
a blessing and a curse for people facing the issue of privacy,
child protection, national security, flighting and prosecuting
cybercrime. The users on the internet are exposed to attacks
like fraud, identity loss, abuse and many more; they are also
not aware of their internet rights for managing and catching
the cybercriminals involved in these crimes. The use of
computers and information technology have made their way
into our day-to-day lives and has transformed our way of
living drastically. Still, people simultaneously suffered from
security breaches and alarming cybercrime rates [1].
According to some studies, cybercrime is no more hacking
and attacking the system; it is now an attack [2]. The rapid
development of internet technologies like cloud computing
has paved the way for cybercriminals by providing a
favourable environment to commit such crimes. Many people Fig. 2. Web 1.0 to Web 2.0
have moved to the internet, enabling thousands of websites to
serve as a platform for cybercrime. The web has changed its The flow of cybercrime is the same as the traditional
view, as shown in figure 2, wherein in Web 1.0, the users crime: 1) activity, 2) activity - target/purpose, 3) mode-of-
used to be only the consumer/ readers, but in Web 2.0, there is operation, 4) criminal, 5) harm caused/loss incurred. In the
a two-way link. The previous readers are now the contributors flow discussed, the first one is the criminal activity on social
also. The user is no longer just the consumer but is at the media. Activity leads to the target or the purpose of
same time contributing to the production using different ways cybercrime, leading to the mode of operation. The method of
operation can lead to the online footprints of criminals, where

978-1-6654-5499-5/22/$31.00 ©2022 IEEE 1


Authorized licensed use limited to: Temple University. Downloaded on December 10,2023 at 22:21:13 UTC from IEEE Xplore. Restrictions apply.
we can check the harm caused to the user or loss incurred if interest for them. The investigator must consider and check a
the crime is financial. The research uses criminal profiling to criminal’s network information for any cybercrime. This
investigate the cybercriminal, which depends on behavioural investigation can uncover a suspect's network with hidden
analysis and psychological patterns to determine the person's subnetworks, roles, and communication.
personality traits. Cybercriminal profiling helps identify the
criminal behaviour, the modus operandi and probably the Sameera et al. [11] state that the text mining approach for
motivation behind the crime. Fetching the criminal cybercrime detection and prediction is efficient. These
characteristics is not the only purpose of criminal profiling but techniques continuously check for malicious text, even in
also to take some preventive measures so that the offence short- forms or code words. "Self-Customized Hyperlink-
does not repeat in the future[3]. Induced Topic Search" (HITS) framework is proposed for
text-based social network data mining to find the suspicious
The online communication content can uncover a lot of user.
information about the participant, such as their social
activities, network, choices, and recommendations. However, Johnsen et al. [12] combined the network centrality
it is a very time-consuming and tedious task to manually methods with Latent Dirichlet Allocation (LDA) to detect
analyze the chat conversation of a criminal case to find competent criminals. The study removed a significant part of
evidence. Most of the work done uses crawling search engines the irrelevant user's network, which reduced the web as many
to extract meaningful information. Still, this method has its nodes and vertices were removed. The article firstly proposed
limitations. The extracted data may contain information a study of the underground economy from the user's
related to the search criteria but does not necessarily match perspective, identifying the small group of skilled
the suspect's social activity and network[4][5]. The sequence- cybercriminals.
matching technique for crime investigation is unsuitable as Ford et al. [13], the author of the article, analyzed the
the drug dealer rarely uses terms like 'drug' or 'cocaine' in application of Machine Learning techniques to keep
their chats. The previous knowledge about the person is cyberspace secure from cybercriminals. A discussion about
essential for good results, or else the results will be irrelevant the difficulties faced by implementing these techniques is also
and inconsistent.[6] conducted. Machine learning is expanding ways to keep
This research aims to mine the textual data to create a cyberspace safe, but many vulnerabilities exist, and there is a
framework that can collect intuitive and explainable evidence scope for advancement with these techniques.
from the chat history to simplify and facilitate the Jung et al. [14] state a short review of several journals
investigation process, particularly when the investigator does linked to applying machine learning models to improve cyber
not have many clues[7]. The precise calculation and security. They adopted some generally occurring barriers to
comprehension of the current and past cyber-attack situation machine learning techniques in obtaining proper datasets with
can help prosecute the criminals. Still, they can also help the the most competitive application for a particular security
agencies to keep an eye on such activities so that no further problem.
damage occurs. One standard method to identify
cybercriminals is to track their footprints, such as their IP Hodo et al. [15] the article discusses the performance
address[8]. The criminals fabricate their false footprints comparison of different machine learning techniques specific
covering up the original one's to mislead the investigator, to detecting abnormalities. The author also highlights the
making it impossible to catch them. Another way to find these performance effectiveness of feature selection in ML and
cybercriminals is to analyze the history of related attacks and suggests that using a convolutional neural network can be
discover the relationship between different aspects of the very useful in this context. CNN can bring new advancements
attack. What was the corresponding attack type? What was in cybersecurity if used with full potential.
the related target? Where was that related target located? Apruzzese et al. [16] analyzed the role of several machine
These queries can be solved to profile cyber attackers, predict learning techniques in detecting several types of anomalies
future threats, learn from past failures, and improve like malicious code or malicious activity detection. The study
prevention techniques[9][10]. concluded that no ML technique does not have any
The knowledge acquired by Data Mining techniques is vulnerabilities to cyber-attacks. As the criminals are
beneficial, and it can help and support the investigator in upgrading themselves continuously, no method can safeguard
finding the cybercriminals. Classification and clustering them fully; each process is struggling to keep pace with the
techniques can help identify the crime pattern and criminal. changing requirements.
Data Mining has played an essential role in assisting humans Sheikhi et al. [17] proposed a new machine learning
in forensics and investigations. technique for anomaly detection by consuming the text
The motivation for conducting this research work is to messages using the content-based detection features. Text
help the researchers perform their research in criminal messages are critical in anomaly detection as they give a
investigation and crime prediction. The study will also help scope to read the historical data and learn from previous
the bodies that work to identify the actors involved in evidence. When we know from previous evidence, we can
cybercrime. The article provides insight into the criminal forecast future anomalies. The article concluded that the
analysis method and produces different types of crime proposed neural network and content-based feature selection
prediction. method had outperformed most machine learning techniques
regarding accuracy on the same dataset.
II. LITERATURE SURVEY Mercaldo et al. [18] state that using the signature-based
Changing aspects behind the relationships of criminals are categorization techniques produces higher error rates for
used in identifying the suspects and knowing the activities of malware detection. They proposed an approach which uses

2
Authorized licensed use limited to: Temple University. Downloaded on December 10,2023 at 22:21:13 UTC from IEEE Xplore. Restrictions apply.
image-based deep learning methods for malware detection to When any of the features like the network or content of
showcase the difference between the group of negative any user is found suspicious, the user can further analyse
characteristics and the positive characteristics by acquiring crime detection.
grey-scale images.
IV. IMPLEMENTATION TECHNIQUES
Biron et al. [19] propose a Crime Detection System
(CTS) to facilitate crime detection and investigation as all the As discussed earlier, the primary goal is to determine a
crime entries are done on this system. Data science has user involved in cybercriminal activity. In the preceding
provided opportunities to detect crime by creating quantifiable section, we have also been concerned about the features that
information on criminal social media profiles which can be may affect a cybercriminal's detection. This segment defines
then analyzed to predict the crime based on location the machine learning model used for the implementation.
proximity. Cluster graphs study the network of a criminal to
understand its centrality. In contrast, geolocation proximity
offers the researcher the knowledge of previous crime
occurrences with the help of the heat map visualizing
technique.
Marin et al. [20] state that it is challenging to investigate
the criminals on the dark web as it is a combination of public,
private, and restricted profiles. But, a user's activities mainly
come into the picture when he is publically active. The
publicly available posts were analyzed to find out the
potential criminal. The author also used a reputation system to
validate the findings.
III. PROPOSED SOLUTION
This segment gives a brief idea about the solutions and the
advanced methods used to identify a cybercriminal. "Fig. 3"
shows a user on social media. It has several profile attributes,
network attributes, and content attributes. Profile attributes
include their name, city, content, gender etc. The network
attribute consists of the user's friend circle and the groups he
may have joined. Lastly, the content attributes consist of all
the user-generated content, whatever he posts, reposts, likes or Fig. 3. Social Media User
shares on social media. The goal is to identify the
cybercriminal on social media. Finding a user involved in This can be done in two parts:1)by analysing the
cybercrime can not be concluded by analyzing his profile text/content of the user and 2)by analysing the user's social
features, as it only contains a user's personal information. network.
"Table. 1" shows some of the profile, network, and content A. Analysing the content
elements.
Figure 4 shows a systematic flow of the proposed model
TABLE I. USER’S ATTRIBUTE for analysing the text, which comprises four significant steps:
feature extraction, pre-processing, feature selection, and
Attributes machine learning algorithms. The input text is pre-processed
User
Profile Network Content to transform the raw data into an understandable form by
Connect with User's Posts, eliminating inconsistencies. The main steps involved in pre-
Name, Age, Birth Date,
friends and the likes, processing are removing redundant characters and
Gender, Location,
community, comments,
Qualification, Profile
web address shares/ tokenization, where the text is broken down into small parts
Picture, etc. called tokens. For example, "This is an apple" will be
etc. reposts etc.
tokenized as "this"," is"," an"," apple". After this, in the final
pre-processing, we remove the stop words.
The goal is to identify the cybercriminal on social media.
Finding a user involved in cybercrime can not be concluded The text we have just prepared by removing the stop
by analyzing his profile features, as it only contains a user's words is not feasible to apply machine learning algorithms.
personal information. In figure 3, we have designed a scenario Feature extraction creates a numeric mapping on the text for
where we have a social media user, and we have to predict extracting some meaning. Here the techniques used for feature
whether he can be a cybercriminal or not. We have considered extraction from the text can be time-frequency-inverse
his network attributes when analyzing the friends and document frequency (tf-idf) and bag-of-words (BoW). The
community he may have joined. If anyone in his network is BoW uses the feature of word frequency. In this approach,
found to have suspicious activity in the past or present, this every cell gives the count(c) of the word(fwi) in the text
user is also marked as a suspicious user and must be document(ti). Here, unwanted words may get a higher weight
monitored. On the other hand, his content can also be than they should get. This problem is solved by the Tf-idf
analyzed using the text mining approach to see what words he approach, where this problem is mitigated by using the
uses in his post. If his content is objectionable, this user can following equation.
be kept in the suspect category for further analysis.

3
Authorized licensed use limited to: Temple University. Downloaded on December 10,2023 at 22:21:13 UTC from IEEE Xplore. Restrictions apply.
appear closer. So, KNN utilises identical features in
the data point and new points. Euclidian distance
between the data points is calculated as,
+HUH WI í LG I IZL WL  GHVLJQDWHV WKH WI-idf value of the
word fwi in a text document (ti), tf(fwi, ti) defines the
frequency of word fwi in the text document (ti), m means the , for the ggiven ggraph.
p
total number of text documents, and |tem: fwet| denotes the
number of text document t containing word fw.
Data is split into training and testing, wherein the training
part uses the features obtained in the previous step to train the
machine learning classifiers. These algorithms are
Multinomial Naïve Bayes(MNB), Random Forest(RF), K-
Nearest Neighbour(KNN), Multilayer Perceptron and ID3.
a) Multinomial Naïve Bayes: This algorithm is
practical in classifying discrete features such as text
or documents. It uses the multinomial distribution
and Bayes theorem, where the variables of each
class C are not dependent on any other class. The
equations given below are used for text
classification:

Fig. 5. Euclidean Distance

d) Multilayer Perceptron: It is a feed forward neural


network section consisting of three layers: input,
hidden, and output. The input layer gets the input
signals to be processed. The output layer achieves
In the above equations, C is a class variable, and V=(v1, the task of prediction and classification. Several
hidden layers are placed between the input and
v2,…vn) represents the feature vector. output layer, which performs the computation. The
computation that takes place at every neuron is,

Where b(1) and b(2) are the bias functions, W(1)


and W(2) weight, and G and s are the activation
functions.

e) ID3: ID3 stands for Iterative Dichotomiser, which


iteratively divides features into two or more groups
at every step. It applies a top-down approach to a
decision tree that tests each node attribute on the
Fig. 4. Proposed Model
tree. A decision tree is built by finding the entropy,
b) Random Forest: This algorithm comprises many the measure of uncertainty and the maximum
decision trees which run individually. To find the information gained amongst all the feature
more likely branches to occur, the 'Gini Index' columns, represented
p as a node on the tree.
method is used, which is calculated as below:

In the above equation, the total number of classes


is represented by c, and pi shows the probability of The steps involved are calculating the entropy for the
ith class. dataset, calculating the feature with maximum information
gain, and repeating until the desired tree is formed.

c) KNN: This algorithm is based on supervised


learning approaches, which say similar things

4
Authorized licensed use limited to: Temple University. Downloaded on December 10,2023 at 22:21:13 UTC from IEEE Xplore. Restrictions apply.
Fig. 6. The proposed system for network analysis

B. Analysing the social graph:


The system's primary objective proposed under this
section is to monitor the suspicious activity over the network
and track down the cybercriminal. “Fig. 4.” shows the
architecture proposed for criminal identification by analysing
the user's network. The steps involved are:
i. Information Extraction: Here, the raw data is
extracted from the chats and is converted to a
machine-readable format. A considerable amount
of data is ex5tracted from social media to perform
data analysis. The messages which are extracted
are based on different forms and architectures.
Fig. 7. Social Network Graph
Each log file has information about more logs and
other discussions. v. User group and profile identification: The
ii. Data Normalization: Mostly, the data obtained clustering method is used to find user groups.
from social media contains noisy text data and Clustering can be achieved by grouping users with
needs to be cleaned. The most common noise the same essential components. For profile
present is the repeated use of stopwords, identification, the researchers use text mining
punctuations, and other information that can be methods. When a suspected word is found, the
written without repetition. system will check for the user's information like
For example, a simple “hello” can be found as profile etc., and further monitor its behaviour and
“heeelllooooo.” This is an unnecessary word which activities.
creates disturbance and is very difficult to be
processed. For these repeated words, regular
expressions are required. These slang words also V. CONCLUSIONS
do not have any place in the dictionary; hence are Cyber threats are increasing at a rapid rate. The
to be replaced by familiar words. conventional techniques used and have been used for years
iii. Vocabulary and Key information Extraction: are not capable enough to cope with this increasing rate of
Vocabulary is defined as a set of words which a cybercrime. Cybercriminals are becoming smarter with
user uses in his lifetime for communication. Hence technology, and they find ways to escape or hide. The article
vocabulary extraction is a process in which we aim aims to propose a system where we can identify and predict
to the group's vocabulary involved in the cybercriminals. It suggests that a user can be identified by
discussion. It consists of n-gram extraction, stop- analysing user-generated content or his network. User-
words clearance, stemming and folding. After this, generated content is the text, image or video that he may
crucial information is extracted where the user produce on social media. This includes his posts, comments,
using these words is identified. Using this, we also shares, likes etc. Network analysis means analysing his
recognise the group of the chat session with these friends, groups or community. This can be achieved by
words to identify the suspected user. checking the chat logs of the user. The article proposed a
iv. Social graph construction: A chat session can system for both approaches. The text mining approach uses
provide us with the group or community the machine learning classifiers like Multinomial Naïve
interacting with each other, and this interaction Bayes, KNN, Random Forest, Multilayer Perceptron and
creates a relationship among these user groups. A ID3. Machine learning techniques have proved to be very
useful for overcoming the limitations of conventional
social network graph is constructed by consuming
methods. Text mining approaches are an effective way of
the weighted graphs by analysing the chats' detecting and predicting criminal activities. To see illegal
interaction patterns. After this analysis, a chart activities, text mining approaches continuously check for
shows every connected group, user, and suspected suspicious words even when they are from in slang, code
one. The graph can be shown below: words or short forms. A graph suggests the user connections

5
Authorized licensed use limited to: Temple University. Downloaded on December 10,2023 at 22:21:13 UTC from IEEE Xplore. Restrictions apply.
and finds the suspect user. The suspect user is continuously Technology and Security (ICBATS), 2022, pp. 1-6, doi:
monitored for criminal activity or behaviour. 10.1109/ICBATS54253.2022.9759021.
[11] K. Sameera & P. Vishwakarma (2018). Cybercrime: To Detect
REFERENCES Suspected User's Chat Using Text Mining. Smart Innovation,
Systems and Technologies, 381–390. doi:10.1007/978-981-13-1742-
[1] K.Shaukat, S. Luo, S. Chen, & D.Liu (2020). Cyber threat detection 2_37
using machine learning techniques: A performance evaluation
[12] J. W. Johnsen & K. Franke (2020). Identifying Proficient
perspective. In 2020 International Conference on Cyber Warfare and
Cybercriminals Through Text and Network Analysis. 2020 IEEE
Security (ICCWS) (pp. 1-6). IEEE.
International Conference on Intelligence and Security Informatics
[2] 97XPDODYLFLXV-,YDQßFLNV 2.DUSLVKFKHQNR  ,VVXHVRI (ISI). https://doi.org/10.1109/isi49825
VRFLHW\ ß VHFXULW\ SXEOLF VDIHW\ XQGHU JOREDOL]DWLRQ FRQGLWLRQs in
[13] V. Ford & A. Siraj (2014). Applications of machine learning in cyber
Lithuania, Journal of Security and Sustainability Issues 4(9): 545–
security. In Proceedings of the 27th International Conference on
573. https://doi.org/10.9770/ jssi.2016.5.4(9)
Computer Applications in Industry and Engineering (Vol. 118). Kota
[3] A. Kipane (2019). Meaning of profiling of cybercriminals in the Kinabalu, Malaysia: IEEE Xplore.
security context. In SHS Web of Conferences (Vol. 68, p. 01009).
[14] H. Jiang, J. Nagra, & P. Ahammad, "Sok: Applying machine learning
EDP Sciences.
in security-a survey," arXiv preprint arXiv:1611.03186, 2016.
[4] M. Rai,& B. Pillai. (2021). Criminal Activities Predictive Analysis
[15] E. Hodo, X. Bellekens, A. Hamilton, C. Tachtatzis, & R. Atkinson,
Using Data Mining Techniques. International Journal of Scientific
(2017). Shallow and deep networks intrusion detection system: A
Research & Engineering Trends.
taxonomy and survey. arXiv preprint arXiv:1701.02145.
[5] M. Bada & J. R. C. Nurse, "Profiling the Cybercriminal: A
[16] G. Apruzzese, M. Colajanni, L. Ferretti, A. Guido & M. Marchetti
Systematic Review of Research," 2021 International Conference on
(2018). On the effectiveness of machine and deep learning for cyber
Cyber Situational Awareness, Data Analytics and Assessment
security. In 2018 10th international conference on cyber Conflict
(CyberSA), 2021, pp. 1-8, doi:
(CyCon) (pp. 371-390). IEEE.
10.1109/CyberSA52016.2021.9478246.
[17] S. Sheikhi, M. Kheirabadi, and A. Bazzazi (2020). An effective
[6] R. Mahajan & V. Mansotra, (2021). Correlating crime and social
model for SMS spam detection using content-based features and
media: using semantic sentiment analysis. Int J Adv Comput Sci
averaged neural network. International Journal of Engineering, 33(2),
Appl.
221-228.
[7] G. Cascavilla, D. A. Tamburri & W. J. Van Den Heuvel, (2021).
[18] F. Mercaldo and A. Santone (2020). Deep learning for image-based
Cybercrime threat intelligence: A systematic multi-vocal literature
mobile malware detection. Journal of Computer Virology and
review. Computers & Security, 105, 102258.
Hacking Techniques, 16(2), 157-171.
[8] E. Marttila, A. Koivula & P. Räsäne, (2021). Cybercrime
[19] K.Biron,W.Mansoor,S.Miniaoui,S.Atalla, H. Mukhtar & K. F. Bin
Victimization and Problematic Social Media Use: Findings from a
Hashim (2019). Data Science tools for crime investigation, archival,
Nationally Representative Panel Study. American Journal of Criminal
and analysis. In 2019 IEEE SmartWorld, Ubiquitous Intelligence &
Justice, 46(6), 862-881.
Computing, Advanced & Trusted Computing, Scalable Computing &
[9] N. Lykousas, V. Koutsokostas, F. Casino & C. Patsakis, (2021). The Communications, Cloud & Big Data Computing, Internet of People
Cynicism of Modern Cybercrime: Automating the Analysis of and Smart City Innovation
Surface Web Marketplaces. arXiv preprint arXiv:2105.11805. (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI) (pp. 1263-
[10] M. R. Kumar and P. K.Malathi, "An Innovative Method in 1266). IEEE.
Classifying and predicting the accuracy of intrusion detection on [20] E. Marin, J. Shakarian, & P. Shakarian, (2018). Mining key-hackers
cybercrime by comparing Decision Tree with Support Vector on darkweb forums. In 2018 1st International Conference on Data
Machine," 2022 International Conference on Business Analytics for Intelligence and Security (ICDIS) (pp. 73-80). IEEE.

6
Authorized licensed use limited to: Temple University. Downloaded on December 10,2023 at 22:21:13 UTC from IEEE Xplore. Restrictions apply.

You might also like