Identification of Cybercriminals in Social Media Using Machine Learning
Identification of Cybercriminals in Social Media Using Machine Learning
Abstract - The significant growth in online social media has like blogs, Wiki, microblogs, social media posts and many
allowed users worldwide to communicate freely and share their other forms.
ideas. The emerging social media is also becoming a powerful
communication tool for businesses and organizations. Due to
this rapid use of technology, performing crime in newer ways
has also emerged. By using social networking sites, criminals
can extract the information for criminal activity. Such activities
include cyber frauds, cyberbullying, cyberstalking, hacking etc.,
which may harm an online user. So, identifying such proficient
cybercriminals is of utmost importance to mitigate these
cyberattacks. With growing technology, these cybercriminals
have also become advanced and hide in large underground
forums. Majorly, the crime-related activities are in text format,
which makes identifying the cybercriminal a tedious task.
Detection and discovering the crime and identifying a
cybercriminal is the primary task of the crime analysis process.
The article proposes a method to solve this problem by
combining user-generated content analysis and user network
analysis. Machine learning algorithms such as Multinomial
Naïve Bayes, K-Nearest Neighbour, Random Forest, Multilayer
Perceptron and ID3 can be used. Also, text mining approaches
are used to detect and predict criminal activities effectively.
These techniques continuously check for suspicious users and
Fig. 1. Social Media User
create a graph to find the suspected user.
I. INTRODUCTION
The growth and evolution of computers have proved to be
a blessing and a curse for people facing the issue of privacy,
child protection, national security, flighting and prosecuting
cybercrime. The users on the internet are exposed to attacks
like fraud, identity loss, abuse and many more; they are also
not aware of their internet rights for managing and catching
the cybercriminals involved in these crimes. The use of
computers and information technology have made their way
into our day-to-day lives and has transformed our way of
living drastically. Still, people simultaneously suffered from
security breaches and alarming cybercrime rates [1].
According to some studies, cybercrime is no more hacking
and attacking the system; it is now an attack [2]. The rapid
development of internet technologies like cloud computing
has paved the way for cybercriminals by providing a
favourable environment to commit such crimes. Many people Fig. 2. Web 1.0 to Web 2.0
have moved to the internet, enabling thousands of websites to
serve as a platform for cybercrime. The web has changed its The flow of cybercrime is the same as the traditional
view, as shown in figure 2, wherein in Web 1.0, the users crime: 1) activity, 2) activity - target/purpose, 3) mode-of-
used to be only the consumer/ readers, but in Web 2.0, there is operation, 4) criminal, 5) harm caused/loss incurred. In the
a two-way link. The previous readers are now the contributors flow discussed, the first one is the criminal activity on social
also. The user is no longer just the consumer but is at the media. Activity leads to the target or the purpose of
same time contributing to the production using different ways cybercrime, leading to the mode of operation. The method of
operation can lead to the online footprints of criminals, where
2
Authorized licensed use limited to: Temple University. Downloaded on December 10,2023 at 22:21:13 UTC from IEEE Xplore. Restrictions apply.
image-based deep learning methods for malware detection to When any of the features like the network or content of
showcase the difference between the group of negative any user is found suspicious, the user can further analyse
characteristics and the positive characteristics by acquiring crime detection.
grey-scale images.
IV. IMPLEMENTATION TECHNIQUES
Biron et al. [19] propose a Crime Detection System
(CTS) to facilitate crime detection and investigation as all the As discussed earlier, the primary goal is to determine a
crime entries are done on this system. Data science has user involved in cybercriminal activity. In the preceding
provided opportunities to detect crime by creating quantifiable section, we have also been concerned about the features that
information on criminal social media profiles which can be may affect a cybercriminal's detection. This segment defines
then analyzed to predict the crime based on location the machine learning model used for the implementation.
proximity. Cluster graphs study the network of a criminal to
understand its centrality. In contrast, geolocation proximity
offers the researcher the knowledge of previous crime
occurrences with the help of the heat map visualizing
technique.
Marin et al. [20] state that it is challenging to investigate
the criminals on the dark web as it is a combination of public,
private, and restricted profiles. But, a user's activities mainly
come into the picture when he is publically active. The
publicly available posts were analyzed to find out the
potential criminal. The author also used a reputation system to
validate the findings.
III. PROPOSED SOLUTION
This segment gives a brief idea about the solutions and the
advanced methods used to identify a cybercriminal. "Fig. 3"
shows a user on social media. It has several profile attributes,
network attributes, and content attributes. Profile attributes
include their name, city, content, gender etc. The network
attribute consists of the user's friend circle and the groups he
may have joined. Lastly, the content attributes consist of all
the user-generated content, whatever he posts, reposts, likes or Fig. 3. Social Media User
shares on social media. The goal is to identify the
cybercriminal on social media. Finding a user involved in This can be done in two parts:1)by analysing the
cybercrime can not be concluded by analyzing his profile text/content of the user and 2)by analysing the user's social
features, as it only contains a user's personal information. network.
"Table. 1" shows some of the profile, network, and content A. Analysing the content
elements.
Figure 4 shows a systematic flow of the proposed model
TABLE I. USER’S ATTRIBUTE for analysing the text, which comprises four significant steps:
feature extraction, pre-processing, feature selection, and
Attributes machine learning algorithms. The input text is pre-processed
User
Profile Network Content to transform the raw data into an understandable form by
Connect with User's Posts, eliminating inconsistencies. The main steps involved in pre-
Name, Age, Birth Date,
friends and the likes, processing are removing redundant characters and
Gender, Location,
community, comments,
Qualification, Profile
web address shares/ tokenization, where the text is broken down into small parts
Picture, etc. called tokens. For example, "This is an apple" will be
etc. reposts etc.
tokenized as "this"," is"," an"," apple". After this, in the final
pre-processing, we remove the stop words.
The goal is to identify the cybercriminal on social media.
Finding a user involved in cybercrime can not be concluded The text we have just prepared by removing the stop
by analyzing his profile features, as it only contains a user's words is not feasible to apply machine learning algorithms.
personal information. In figure 3, we have designed a scenario Feature extraction creates a numeric mapping on the text for
where we have a social media user, and we have to predict extracting some meaning. Here the techniques used for feature
whether he can be a cybercriminal or not. We have considered extraction from the text can be time-frequency-inverse
his network attributes when analyzing the friends and document frequency (tf-idf) and bag-of-words (BoW). The
community he may have joined. If anyone in his network is BoW uses the feature of word frequency. In this approach,
found to have suspicious activity in the past or present, this every cell gives the count(c) of the word(fwi) in the text
user is also marked as a suspicious user and must be document(ti). Here, unwanted words may get a higher weight
monitored. On the other hand, his content can also be than they should get. This problem is solved by the Tf-idf
analyzed using the text mining approach to see what words he approach, where this problem is mitigated by using the
uses in his post. If his content is objectionable, this user can following equation.
be kept in the suspect category for further analysis.
3
Authorized licensed use limited to: Temple University. Downloaded on December 10,2023 at 22:21:13 UTC from IEEE Xplore. Restrictions apply.
appear closer. So, KNN utilises identical features in
the data point and new points. Euclidian distance
between the data points is calculated as,
+HUH WI í LG I IZL WL GHVLJQDWHV WKH WI-idf value of the
word fwi in a text document (ti), tf(fwi, ti) defines the
frequency of word fwi in the text document (ti), m means the , for the ggiven ggraph.
p
total number of text documents, and |tem: fwet| denotes the
number of text document t containing word fw.
Data is split into training and testing, wherein the training
part uses the features obtained in the previous step to train the
machine learning classifiers. These algorithms are
Multinomial Naïve Bayes(MNB), Random Forest(RF), K-
Nearest Neighbour(KNN), Multilayer Perceptron and ID3.
a) Multinomial Naïve Bayes: This algorithm is
practical in classifying discrete features such as text
or documents. It uses the multinomial distribution
and Bayes theorem, where the variables of each
class C are not dependent on any other class. The
equations given below are used for text
classification:
4
Authorized licensed use limited to: Temple University. Downloaded on December 10,2023 at 22:21:13 UTC from IEEE Xplore. Restrictions apply.
Fig. 6. The proposed system for network analysis
5
Authorized licensed use limited to: Temple University. Downloaded on December 10,2023 at 22:21:13 UTC from IEEE Xplore. Restrictions apply.
and finds the suspect user. The suspect user is continuously Technology and Security (ICBATS), 2022, pp. 1-6, doi:
monitored for criminal activity or behaviour. 10.1109/ICBATS54253.2022.9759021.
[11] K. Sameera & P. Vishwakarma (2018). Cybercrime: To Detect
REFERENCES Suspected User's Chat Using Text Mining. Smart Innovation,
Systems and Technologies, 381–390. doi:10.1007/978-981-13-1742-
[1] K.Shaukat, S. Luo, S. Chen, & D.Liu (2020). Cyber threat detection 2_37
using machine learning techniques: A performance evaluation
[12] J. W. Johnsen & K. Franke (2020). Identifying Proficient
perspective. In 2020 International Conference on Cyber Warfare and
Cybercriminals Through Text and Network Analysis. 2020 IEEE
Security (ICCWS) (pp. 1-6). IEEE.
International Conference on Intelligence and Security Informatics
[2] 97XPDODYLFLXV-,YDQßFLNV 2.DUSLVKFKHQNR ,VVXHVRI (ISI). https://doi.org/10.1109/isi49825
VRFLHW\ ß VHFXULW\ SXEOLF VDIHW\ XQGHU JOREDOL]DWLRQ FRQGLWLRQs in
[13] V. Ford & A. Siraj (2014). Applications of machine learning in cyber
Lithuania, Journal of Security and Sustainability Issues 4(9): 545–
security. In Proceedings of the 27th International Conference on
573. https://doi.org/10.9770/ jssi.2016.5.4(9)
Computer Applications in Industry and Engineering (Vol. 118). Kota
[3] A. Kipane (2019). Meaning of profiling of cybercriminals in the Kinabalu, Malaysia: IEEE Xplore.
security context. In SHS Web of Conferences (Vol. 68, p. 01009).
[14] H. Jiang, J. Nagra, & P. Ahammad, "Sok: Applying machine learning
EDP Sciences.
in security-a survey," arXiv preprint arXiv:1611.03186, 2016.
[4] M. Rai,& B. Pillai. (2021). Criminal Activities Predictive Analysis
[15] E. Hodo, X. Bellekens, A. Hamilton, C. Tachtatzis, & R. Atkinson,
Using Data Mining Techniques. International Journal of Scientific
(2017). Shallow and deep networks intrusion detection system: A
Research & Engineering Trends.
taxonomy and survey. arXiv preprint arXiv:1701.02145.
[5] M. Bada & J. R. C. Nurse, "Profiling the Cybercriminal: A
[16] G. Apruzzese, M. Colajanni, L. Ferretti, A. Guido & M. Marchetti
Systematic Review of Research," 2021 International Conference on
(2018). On the effectiveness of machine and deep learning for cyber
Cyber Situational Awareness, Data Analytics and Assessment
security. In 2018 10th international conference on cyber Conflict
(CyberSA), 2021, pp. 1-8, doi:
(CyCon) (pp. 371-390). IEEE.
10.1109/CyberSA52016.2021.9478246.
[17] S. Sheikhi, M. Kheirabadi, and A. Bazzazi (2020). An effective
[6] R. Mahajan & V. Mansotra, (2021). Correlating crime and social
model for SMS spam detection using content-based features and
media: using semantic sentiment analysis. Int J Adv Comput Sci
averaged neural network. International Journal of Engineering, 33(2),
Appl.
221-228.
[7] G. Cascavilla, D. A. Tamburri & W. J. Van Den Heuvel, (2021).
[18] F. Mercaldo and A. Santone (2020). Deep learning for image-based
Cybercrime threat intelligence: A systematic multi-vocal literature
mobile malware detection. Journal of Computer Virology and
review. Computers & Security, 105, 102258.
Hacking Techniques, 16(2), 157-171.
[8] E. Marttila, A. Koivula & P. Räsäne, (2021). Cybercrime
[19] K.Biron,W.Mansoor,S.Miniaoui,S.Atalla, H. Mukhtar & K. F. Bin
Victimization and Problematic Social Media Use: Findings from a
Hashim (2019). Data Science tools for crime investigation, archival,
Nationally Representative Panel Study. American Journal of Criminal
and analysis. In 2019 IEEE SmartWorld, Ubiquitous Intelligence &
Justice, 46(6), 862-881.
Computing, Advanced & Trusted Computing, Scalable Computing &
[9] N. Lykousas, V. Koutsokostas, F. Casino & C. Patsakis, (2021). The Communications, Cloud & Big Data Computing, Internet of People
Cynicism of Modern Cybercrime: Automating the Analysis of and Smart City Innovation
Surface Web Marketplaces. arXiv preprint arXiv:2105.11805. (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI) (pp. 1263-
[10] M. R. Kumar and P. K.Malathi, "An Innovative Method in 1266). IEEE.
Classifying and predicting the accuracy of intrusion detection on [20] E. Marin, J. Shakarian, & P. Shakarian, (2018). Mining key-hackers
cybercrime by comparing Decision Tree with Support Vector on darkweb forums. In 2018 1st International Conference on Data
Machine," 2022 International Conference on Business Analytics for Intelligence and Security (ICDIS) (pp. 73-80). IEEE.
6
Authorized licensed use limited to: Temple University. Downloaded on December 10,2023 at 22:21:13 UTC from IEEE Xplore. Restrictions apply.