Cyber-threat
Cyber-threat
SYNOPSIS ON
COMPUTER ENGINEERING
Submitted By:
Sign of Guide
Abstract
Cyberspace is one of the most complicated systems ever created by humanity; many people use cyber-
technology resources on a daily basis, yet the bulk of them have little understanding of it. To use of social
media cannot replace the requirement for security experts to conduct in-depth analyses of specific sorts
of attacks, such as detecting anomalies in network traffic, worms, and port scans, among other things.
Analysing social media data, on the other hand, can help discover new patterns of cyber threat and
security threats including data theft, carding, and hijacking. We used machine learning to predict cyber
threat in the proposed system. The best model is created by training a dataset of Twitter cyber-Threat
using the SVM, NB, DT, RF and ANN algorithms. Used best model for predicting cyber threats and
which categories.
Features of System:
Preparing the dataset
Data Pre-processing
Feature extraction
Classification using Algorithm
Python
Scikit-learn
Pandas
SVM
DT
NB
RF
ANN
Dataset:-
In proposed system we have collect dataset of twitter (related to cyber threats) on kaggle website.
In Dataset A list of keywords was selected to filter the tweets retrieved from the stream listener.
These keywords includes username of selected cybersecurity organizations, list of buzzwords
related to cybersecurity terms (‘ciphertext’, ‘cryptography’, ‘hacked’, ‘breach’, ‘sniffer’,
‘firewall’, ‘hijacking’,‘Clickjacking’, ‘Malware’,‘Sphear phising’, ‘virus’, and ‘vulnerability’)
from cybersecurity domain experts.
Algorithm -
SVM (Support Vector Machine) :-
Support Vector Machine (SVM) is a controlled approach for machine learning that is suitable for both
classification and regression difficulties. It is employed largely in classification issues, however. Each
data item is defined in the SVM algorithm n-dimensional space point (where n is a number of features)
each feature value is the value of a specific coordinate. Then we carry out Support Vectors are merely
individual observation coordinates. The SVM is a boundary between both the two classes (hyper planes
/ rows). Categorization by finding the hyper-plane that distinguishes the classes very well.
DT (Decision Tree):
The goal of using a Decision Tree is to create a training model that can use to predict the class or value
of the target variable by learning simple decision rules inferred from prior data (training data). In
Decision Trees, for predicting a class label for a record we start from the root of the tree.
NB (Nave Bayes):
The number of parameters required by Nave Bayes classifiers is linear in the number of variables
(features/predictors) in a learning problem. Maximum-likelihood training can be done in linear time by
evaluating a closed-form expression, rather than the time-consuming iterative approximation required by
many other forms of classifiers.
RF (Random Forest):
Random forest is a supervised learning algorithm which is used for both classification as well as
regression. But however, it is mainly used for classification problems. As we know that a forest is made
up of trees and more trees means more robust forest. Similarly, random forest algorithm creates decision
trees on data samples and then gets the prediction from each of them and finally selects the best solution
by means of voting. It is an ensemble method which is better than a single decision tree because it reduces
the over-fitting by averaging the result.
ANN:
An Artificial Neural Network is an information processing technique. It works like the way human brain
processes information. ANN includes a large number of connected processing units that work together
to process information. They also generate meaningful results from it.
Artificial Neural network is typically organized in layers. Layers are being made up of many
interconnected ‘nodes’ which contain an ‘activation function’. A neural network may contain the
following 3 layers: a. Input layer, b. Hidden layer and c. Output layer.
REFERENCES:
[1] Wang, S. (2010). Crawling Deep Web using a GA-based set covering algorithm.
[2] Zhou, S., Long, Z., Tan, L., & Guo, H. (2018). Automatic identification of indicators of
compromise using neural-based sequence labelling. arXiv preprint arXiv:1810.10156.
[3] Guo, M.,& Wang, J. A. (2009, April). An ontology-based approach to model common
vulnerabilities and exposures in information security. In ASEE Southest Section Conference.
[4] Ninth Annual Cost if Cybercrime Study unlocking The Value of Improved Cybersecurity
Protection .The Cost of Cybercrime Contents.
[5] Ranade, P., Mittal, S., Joshi, A., & Joshi, K. (2018, November). Using deep neural networks
to translate multi-lingual threat intelligence. In 2018 IEEE International Conference on
Intelligence and Security Informatics (ISI) (pp. 238-243). IEEE.
[6] Dong, Y., Guo, W., Chen, Y., Xing, X., Zhang, Y., & Wang, G. (2019). Towards the detection
of inconsistencies in public security vulnerability reports. In 28th USENIX Security Symposium
(USENIX Security 19) (pp. 869-885).
[7] Rodriguez, A., & Okamura, K. (2020). Social Media Data Mining for Proactive Cyber Defense.
Journal of Information Processing, 28, 230- 238.