1822-b.e-cse-batchno-103
1822-b.e-cse-batchno-103
By
SATHYABAMA
INSTITUTE OF SCIENCE AND TECHNOLOGY
(DEEMED TO BE UNIVERSITY)
Accredited with Grade “A” by NAAC
JEPPIAAR NAGAR, RAJIV GANDHI SALAI,
CHENNAI - 600 119
MARCH-2022
I
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
BONAFIDE CERTIFICATE
This is to certify that this Project Report is the bonafide work of Henry D Samuel
(REG NO: 38110199), Santhana Kumar M. (REG NO: 38110503) who have
done Project work as a team who carried out the project entitled ―AUTOMATION
DETECTION OF MALWARE AND STENOGRAPHICAL CONTENT USING
MACHINE LEARNING‖ under my supervision from November 2021 to March
2022.
Internal Guide
Ms. AISHWARYA R M.E.,
II
DECLARATION
DATE:
III
ACKNOWLEDGEMENT
I convey my thanks to Dr. T. Sasikala M.E., Ph.D, Dean, School of Computing Dr.
L. Lakshmanan M.E., Ph.D. , and Dr.S.Vigneshwari M.E., Ph.D. Heads of the
Department of Computer Science and Engineering for providing me necessary
support and details at the right time during the progressive reviews.
IV
ABSTRACT
In recent times many malware attacks increasing in our society. Mainly image-
based malware attacks are spreading worldwide and many people get harmful
malware-based images through the technique called steganography. In the
existing system, only open malware and files from the internet is identified.
V
LIST OF FIGURES
VI
TABLE OF CONTENT
3 IMPLEMENTATION
3.1 Image Processing 19
3.1.1 Digital Image Processing 19
3.1.2 Pattern Recognition 20
3.2 Basic approaches to malware detection 21
VII
3.3 Machine learning 22
4
METHODOLOGY 3.4 Unsupervised Learning 22
Methodology 29
5 RESULT
5.1 Result 49
5.2 Performance Analysis 52
VIII
IX
CHAPTER 1
INTRODUCTION
There is no way to discover the hidden message except by the sender and
receiver. Because the secret message is embedded in the cover file, anyone
observing it as an ordinary file does not notice that the cover file contains secret
information, thus making steganography more secure.
The person who knows whether the cover file contains secret information is the
only one who can attempt to steal it.Machine learning is the main domain used for
modern steganography purposes. The major reason is the modern problem needs
a modern solution. Machine learning powerful prediction algorithm helps to find out
the stego content. It can be also useful for filtering the contents in the transmission
area.
1
CHAPTER 2
LITERATURE SURVEY
The domain analysis that we have done for the project mainly involved
understanding the neural networks
2.2 TensorFlow:
Features: TensorFlow provides stable Python (for version 3.7 across all
platforms) and C APIs; and without API backwards compatibility guarantee:
C++, Go, Java, JavaScript and Swift (early release). Third-party packages are
available for C#, Haskell Julia, MATLAB,R, Scala, Rust, OCaml, and
Crystal."New language support should be built on top of the C API. However,
not all functionality is available in C yet." Some more functionality is provided
by the Python API.
2.3 Opencv:
2
OpenCV's application areas include:
Boosting
Decision tree learning
Gradient boosting trees
Expectation-maximization algorithm
k-nearest neighbor algorithm
Naive Bayes classifier
Artificial neural networks
Random forest
Support vector machine (SVM)
Deep neural networks (DNN)
3
ROS (Robot Operating System). OpenCV is used as the primary vision
package in ROS.
Integrating Vision Toolkit (IVT), a fast and easy-to-use C++ library with an
optional interface to OpenCV.
OpenCV Functionality
Image/video I/O, processing, display (core, imgproc, highgui)
Object/feature detection (objdetect, features2d, nonfree)
Geometry-based monocular or stereo computer vision (calib3d,
stitching, videostab)
Computational photography (photo, video, superres)
Machine learning & clustering (ml, flann)
CUDA acceleration (gpu)
Image-Processing:
4
processing is the analysis and manipulation of a digitized image, especially in
order to improve its quality‖.
Digital-Image :
Robotics Application
Localization − Determine robot location automatically
Navigation
Obstacles avoidance
Assembly (peg-in-hole, welding, painting)
Manipulation (e.g. PUMA robot manipulator)
Human Robot Interaction (HRI) − Intelligent robotics to interact with and
serve people
Medicine Application
Classification and detection (e.g. lesion or cells classification and tumor
5
detection)
2D/3D segmentation
3D human organ reconstruction (MRI or ultrasound)
Vision-guided robotics surgery
Industrial Automation Application
Industrial inspection (defect detection)
Assembly
Barcode and package label reading
Object sorting
Document understanding (e.g. OCR)
Security Application
Biometrics (iris, finger print, face recognition)
Surveillance − Detecting certain suspicious activities or behaviors
Transportation Application
Autonomous vehicle
Safety, e.g., driver vigilance monitoring
2.4 Keras:
6
coding necessary for writing deep neural network code. The code is hosted on
GitHub, and community support forums include the GitHub issues page, and a
Slack channel.
Keras applications module is used to provide pre-trained model for deep neural
networks. Keras models are used for prediction, feature extraction and fine
tuning. This chapter explains about Keras applications in detail.
Pre-trained models
ResNet
VGG16
MobileNet
InceptionResNetV2
InceptionV3
2.5 Numpy:
7
dimensional arrays and matrices, along with a large collection of high-level
mathematical functions to operate on these arrays. The ancestor of NumPy,
Numeric, was originally created by Jim Hugunin with contributions from
several other developers. In 2005, Travis Oliphant created NumPy by
incorporating features of the competing Numarray into Numeric, with extensive
modifications. NumPy is open- source software and has many contributors.
Python bindings of the widely used computer vision library OpenCV utilize
NumPy arrays to store and operate on data. Since images with multiple
channels are simply represented as three-dimensional arrays, indexing, slicing
or masking with other arrays are very efficient ways to access specific pixels
of an image. The NumPy array as universal data structure in OpenCV for
images, extracted feature points, filter kernels and many more vastly simplifies
the programming workflow and debugging.
8
possible as it is with Python's lists. The np.pad(...) routine to extend arrays
actually creates new arrays of the desired shape and padding values, copies
the given array into the new one and returns it.
NumPy'snp.concatenate([a1,a2]) operation does not actually link the two
arrays but returns a new one, filled with the entries from both given arrays in
sequence. Reshaping the dimensionality of an array with np.reshape(...) is
only possible as long as the number of elements in the array does not change.
These circumstances originate from the fact that NumPy's arrays must be
views on contiguous memory buffers. A replacement package called Blaze
attempts to overcome this limitation.
Algorithms that are not expressible as a vectorized operation will typically run
slowly because they must be implemented in "pure Python", while
vectorization may increase memory complexity of some operations from
constant to linear, because temporary arrays must be created that are as large
as the inputs. Runtime compilation of numerical code has been implemented
by several groups to avoid these problems; open source solutions that
interoperate with NumPy include scipy.weave, numexpr and Numba. Cython
and Pythran are static-compiling alternatives to these.
9
classifies information according to a specific architecture. The network bears a
strong resemblance to statistical methods such as curve fitting and regression
analysis.
Areas of Application
Followings are some of the areas, where ANN is being used. It suggests that
ANN has an interdisciplinary approach in its development and applications.
Speech Recognition
Great progress has been made in this field, however, still such kinds of
systems are facing the problem of limited vocabulary or grammar along with
the issue of retraining of the system for different speakers in different
10
conditions. ANN is playing a major role in this area. Following ANNs have been
used for speech recognition −
Multilayer networks
Character Recognition
Neocognitron
Signatures are one of the most useful ways to authorize and authenticate a
person in legal transactions. Signature verification technique is a non-vision
based technique.
For this application, the first approach is to extract the feature or rather the
11
geometrical feature set representing the signature. With these feature sets, we
have to train the neural networks using an efficient neural network algorithm.
This trained neural network will classify the signature as being genuine or
forged under the verification stage.
It is one of the biometric methods to identify the given face. It is a typical task
because of the characterization of ―non-face‖ images. However, if a neural
network is well trained, then it can be divided into two classes namely images
having faces and images that do not have faces.
First, all the input images must be preprocessed. Then, the dimensionality of
that image must be reduced. And, at last it must be classified using neural
network training algorithm. Following neural networks are used for training
purposes with preprocessed image −
Deep Learning:
Earlier versions of neural networks such as the first perceptrons were shallow,
composed of one input and one output layer, and at most one hidden layer in
between. More than three layers (including input and output) qualifies as
―deep‖ learning. So deep is not just a buzzword to make algorithms seem like
they read Sartre and listen to bands you haven‘t heard of yet. It is a strictly
defined term that means more than one hidden layer.
12
In deep-learning networks, each layer of nodes trains on a distinct set of
features based on the previous layer‘s output. The further you advance into
the neural net, the more complex the features your nodes can recognize,
since they aggregate and recombine features from the previous layer.
For example, deep learning can take a million images, and cluster them
according to their similarities: cats in one corner, ice breakers in another, and
in a third all the photos of your grandmother. This is the basis of so-called
smart photo albums.
When training on unlabeled data, each node layer in a deep network learns
features automatically by repeatedly trying to reconstruct the input from which
it draws its samples, attempting to minimize the difference between the
network‘s guesses and the probability distribution of the input data itself.
13
Restricted Boltzmann machines, for examples, create so-called
reconstructions in this manner.
Instead of the image, the computer sees an array of pixels. For example, if
image size is 300 x 300. In this case, the size of the array will be 300x300x3.
Where 300 is width, next 300 is height and 3 is RGB channel values. The
computer is assigned a value from 0 to 255 to each of these numbers. Тhis
value describes the intensity of the pixel at each point.
To solve this problem the computer looks for the characteristics of the
baselevel. In human understanding such characteristics are for example the
trunk or large ears. For the computer, these characteristics are boundaries or
curvatures. And then through the groups of convolutional layers the computer
constructs more abstract concepts.In more detail: the image is passed through
a series of convolutional, nonlinear, pooling layers and fully connected layers,
and then generates the output.
14
Applications of convolution neural network:
The SGD algorithm is good at identifying harmful websites, but it can make
mistakes when evaluating which ones are safe, labelling them as risky
.[1]The web mining algorithm will extract textual information from web pages and
identify those associated with terrorism. A system whose main purpose is to create
a website where people may inspect any webpage or website for any evidence of
terrorist activity.
[2]The persuading concept is to see if the feature's equivalent word appears in the
mail. When a good classifier is employed to create the classification model, the
15
experimental results of this method have high TPR and Precision values, and the
false positive rate is regulated within an acceptable range
.[3]Linguistic characteristics are critical for distinguishing across users' written
styles. Sentiment analysis is most effective with content that has a subjective
context, such as a suicide note. These characteristics can be derived explicitly
from the user profile or inferred implicitly using various data mining tools and
methodologies
.[4]Using a document embedding, A decision tree model with gradient boosting
predicts dangerous categories of gathered web pages.
[5] The ROC curve was used to comprehend a performance measurement for a
classification task at various thresholds. The false-positive percentage should be
kept as low as possible, whereas the true positive rate should be maximized.
[6]Hierarchical spatial scaling's analytic bias improves the model's ability to handle
detection problems in documents of possibly changing sizes. A feature extractor is
a programme that parses a neural network model that distinguishes a series of
tokens from HTML pages judgments, in this approach.
[7]Cross-channel scripting defence methods follow a website's whole path,
consisting of sustained storage systems If the soiled information is not cleansed,
an alert is generated. The adversary can use this threat to insert inappropriate
material into the user's embedded system. causing web applications to
malfunction and information to be leaked.
[8]Clustering methods lustring (e.g., k-means, DBSCAN) To detect malicious
domains, unsupervised separation instances evaluate data and derive necessary
details from it.
[9]The Dirichlet latent allocation topic model proposed a pattern that determines
that tweets about crime are more likely to be positive. However, A stash of tweets
is available; however, a series of old tweets is either impossible or prohibitively
expensive.
The key insight underlying our work is that an infected application run will modify
the control-flow/data structures compared to a benign run. This will be reflected in
its memory access pattern. This is obvious for the important class of memory
corruption vulnerabilities for code in memory-unsafe programming languages such
16
as C/C++.The same is true for another important class of malware, kernel rootkits,
which modify control flow in the operating system.
While these two classes are used extensively in this paper, control-flow and/or
data structure modification are intrinsic characteristics of malware. Thus, we
propose hardware monitoring of memory accesses for classifying individual
application runs as being malicious or benign. Since virtual addresses provide for
a more consistent signature than physical addresses, we propose obtaining the
virtual address trace through in-processor monitoring.
This paper also focuses on offline learning but with recent breakthroughs in the
development of machine learning cores, e.g. [21], we believe even online learning
of the detection model is realizable in hardware.
17
II. Malware and Memory Access Patterns We now describe certain common
types of kernel and userlevel malware and how they affect memory accesses.
Kernel Rootkits: Kernel rootkits modify kernel data structures to redirect control
flow in system calls to malevolent code. The two most common ways are: system
call table modification, which changes a function pointer in the syscall table, and
virtual file system (VFS) function pointer hooking, which replaces function pointers
in the VFS file operation structure.
18
effective epoch markers for kernel-level malware. Function calls are effective
epoch markers for userlevel malware. Using the entire program run as an epoch is
not feasible for continuous running programs such as web browsers but can be
effective for small programs. It still has limitations and this is discussed further in
§VI.
CHAPTER 3
image analysis.
There are two types of methods used for image processing namely, analogue and
digital image processing. Analogue image processing can be used for the hard
copies like printouts and photographs. Image analysts use various fundamentals of
interpretation while using these visual techniques.
19
Digital image processing techniques help in manipulation of the digital images by
using computers. The three general phases that all types of data have to undergo
while using digital technique are pre- processing, enhancement, and display,
information extraction.
20
due to the increased availability of big data and a new abundance of processing
power. These activities can be viewed as two facets of the same field of
application, and they have undergone substantial development over the past few
decades.
Pattern recognition systems are commonly trained from labeled "training" data.
When no labeled data are available, other algorithms can be used to discover
previously unknown patterns. KDD and data mining have a larger focus on
unsupervised methods and stronger connection to business use. Pattern
recognition focuses more on the signal and also takes acquisition and Signal
Processing into consideration. It originated in engineering, and the term is popular
in the context of computer vision: a leading computer vision conference is
named Conference on Computer Vision and Pattern Recognition.
Pattern recognition algorithms generally aim to provide a reasonable answer for all
possible inputs and to perform "most likely" matching of the inputs, taking into
account their statistical variation. This is opposed to pattern matching algorithms,
which look for exact matches in the input with pre-existing patterns. A common
example of a pattern-matching algorithm is regular expression matching, which
looks for patterns of a given sort in textual data and is included in the search
capabilities of many text editors and word processors.
21
component of every cybersecurity product. Malware recognition modules decide if
an object is a threat, based on the data they have collected on it. This data may be
collected at different phases: – Pre-execution phase data is anything you can tell
about a file without executing it. This may include executable file format
descriptions, code descriptions, binary data statistics, text strings and information
extracted via code emulation and other similar data. – Post-execution phase data
conveys information about behavior or events caused by process activity in a
system. In the early part of the cyber era, the number of malware threats was
relatively low, and simple manually created pre-execution rules were often enough
to detect threats. The rapid rise of the Internet and the ensuing growth in malware
meant that manually created detection rules were no longer practical - and new,
advanced protection technologies were needed. Anti-malware companies turned
to machine learning, an area of computer science that had been used successfully
in image recognition, searching and decision-making, to augment their malware
detection and classification. Today, machine learning boosts malware detection
using various kinds of data on host, network and cloud-based anti-malware
components. Machine learning: concepts and definitions
22
clustering. Clustering is a task that includes splitting a data set into groups of
similar objects. Another task is representation learning – this includes building an
informative feature set for objects based on their low-level description (for
example, an autoencoder model). Machine Learning Methods for Malware
Detection In this paper, we summarize our extensive experience using machine
learning to build advanced protection for our customers. Unsupervised learning 2
Large unlabeled datasets are available to cybersecurity vendors and the cost of
their manual labeling by experts is high – this makes unsupervised learning
valuable for threat detection. Clustering can help to optimize efforts for the manual
labeling of new samples. With informative embedding, we can decrease the
number of labeled objects needed for the next machine learning approach in our
pipeline: supervised learning. Supervised learning is a setting that is used when
both the data and the right answers for each object are available. The goal is to fit
the model that will produce the right answers for new objects.
23
the model to new objects. In this phase, the type of the model and its parameters
do not change. The model only produces predictions. In the case of malware
detection, this is the protection phase. Vendors often deliver a trained model to
users where the product makes decisions based on model predictions
autonomously. Mistakes can cause devastating consequences for a user – for
example, removing an OS driver. It is crucial for the vendor to select a model
family properly. The vendor must use an efficient training procedure to find the
model with a high detection rate and a low false positive rate.
24
collect a training set, and we overlook the fact that occasionally all files larger than
10 MB are all malware and not benign (which is certainly not true for real world
files). While training, the model will exploit this property of the dataset, and will
learn that any file larger than 10 MB is malware. It will use this property for
detection. When this model is applied to real world data, it will produce many false
positives. To prevent this outcome, we needed to add benign files with larger sizes
to the training set. Then, the model will not rely on an erroneous data set property.
Generalizing this, we must train our models on a data set that correctly represents
the conditions where the model will be working in the real world. This makes the
task of collecting a representative dataset crucial for machine learning to be
successful.
25
model that allows us to fix false-positives on the fly, without completely retraining
the model. Examples of this are implemented in our pre- and post-execution
models, which are described in the following sections.
Outside the malware detection domain, machine learning algorithms regularly work
under the assumption of fixed data distribution, which means that it doesn‘t
change with time. When we have a training set that is large enough, we can train
the model so that it will effectively reason any new sample in a test set. As time
goes on, the model will continue working as expected. After applying machine
learning to malware detection, we have to face the fact that our data distribution
isn‘t fixed: • Active adversaries (malware writers) constantly work on avoiding
detections and releasing new versions of malware files that differ significantly from
those that have been seen during the training phase. • Thousands of software
companies produce new types of benign executables that are significantly different
from previously known types. The data on these types was lacking in the training
set, but the model, nevertheless, needs to recognize them as benign. This causes
serious changes in data distribution and raises the problem of detection rate
degradation over time in any machine learning implementation. Cybersecurity
vendors that implement machine learning in their antimalware solutions face this
problem and need to overcome it. The architecture needs to be flexible and has to
allow model updates ‗on the fly‘ between retraining. Vendors must also have
effective processes for collecting and labeling new samples, enriching training
datasets and regularly retraining models.
26
representative sequence of bytes or other features indicating malware. During the
detection, an antiviral engine in a product checked the presence of the malware
fingerprint in a file against known malware fingerprints stored in the antivirus
database. However, malware writers invented techniques like server-side
polymorphism. This resulted in a flow of hundreds of thousands of malicious
samples being discovered every day. At the same time, the fingerprints used were
sensitive to small changes in files. Minor changes in existing malware took it off
the radar. The previous approach quickly became ineffective because: • Creating
detection rules manually couldn‘t keep up with the emerging flow of malware. •
Checking each file‘s fingerprint against a library of known malware meant that you
couldn‘t detect new malware until analysts manually create a detection rule. We
were interested in features that were robust against small changes in a file. These
features would detect new modifications of malware, but would not require more
resources for calculation. Performance and scalability are the key priorities of the
first stages of anti-malware engine processing. To address this, we focused on
extracting features that could be: • calculated quickly, like statistics derived from
file byte content or code disassembly • directly retrieved from the structure of the
executable, like a file format description. Using this data, we calculated a specific
type of hash functions called locality-sensitive hashes (LSH). Regular
cryptographic hashes of two almost identical files differ as much as hashes of two
very different files. There is no connection between the similarity of files and their
hashes. However, LSHs of almost identical files map to the same binary bucket –
their LSHs are very similar – with high probability. LSHs of two different files differ
substantially. But we went further. The LSH calculation was unsupervised. It didn‘t
take into account our additional knowledge of each sample being malware or
benign. Having a dataset of similar and non-similar objects, we enhanced this
approach by introducing a training phase. We implemented a similarity hashing
approach. It‘s similar to LSH, but it‘s supervised and capable of utilizing
information about pairs of similar and non-similar objects. In this case: • Our
training data X would be pairs of file feature representations [X1, X2] • Y would be
the label that would tell us whether the objects were actually semantically similar
or not. • During training, the algorithm fits parameters of hash mapping h(X) to
maximize the number of pairs from the training set, for which h(X1) and h(X2) are
identical for similar objects and different otherwise. This algorithm that is being
27
applied to executable file features provides specific similarity hash mapping with
useful detection capabilities. In fact, we train several versions of this mapping that
differ in their sensitivity to local variations of different sets of features. For example,
one version of similarity hash mapping could be more focused on capturing the
executable file structure, while paying less attention to the actual content. Another
could be more focused on capturing the ASCII-strings of the file. This captures the
idea that different subsets of features could be more or less discriminative to
different kinds of malware files. For one of them, file content statistics could reveal
the presence of an unknown malicious packer. For the others, the most important
piece of information regarding potential behavior is concentrated in strings
representing used OS API, created file names, accessed URLs or other feature
subsets. For more precise detection in products, the results of a similarity hashing
algorithm are combined with other machine learning-based detection methods.
28
CHAPTER 4
4.1 METHODOLOGY
4.1.1 TRAINING MODULE:
Supervised machine learning:It is one of the ways of machine learning where the
model is trained by input data and expected output data. Тo create such model, it
is necessary to go through the following phases:
1. model construction
2. model training
3. model testing
4. model evaluation
29
optimizer= ‗name_of_opimazer_alg‘ ) The loss function shows the accuracy
of each prediction made by the model.
Before model training it is important to scale data for their further use.
4.1.2 SEGMENTATION
30
Image classification is the process of taking an input(like a picture) and
outputting its class or probability that the input is a particular class. Neural
networks are applied in the following steps:
1. One hot encode the data: A one-hot encoding can be applied to the integer
representation. This is where the integer encoded variable is removed and
a new binary variable is added for each unique integer value.
2. Define the model: A model said in a very simplified form is nothing but a
function that is used to take in certain input, perform certain operation to its
beston the given input (learning and then predicting/classifying) and produce
the suitable output.
3. Compile the model: The optimizer controls the learning rate. We will be using
‗adam‘ as our optmizer. Adam is generally a good optimizer to use for many
cases. The adam optimizer adjusts the learning rate throughout training. The
learning rate determines how fast the optimal weights for the model are
calculated. A smaller learning rate may lead to more accurate weights (up to a
certain point), but the time it takes to compute the weights will be longer.
4. Train the model: Training a model simply means learning (determining) good
values for all the weights and the bias from labeled examples. In supervised
learning, a machine learning algorithm builds a model by examining many
examples and attempting to find a model that minimizes loss; this process is
called empirical risk minimization.
5. Test the model
A convolutional neural network convolves learned featured with input data and
uses 2D convolution layers.
Convolution Operation:
31
Here are the three elements that enter into the convolution operation:
Input image
Feature detector
Feature map
You place it over the input image beginning from the top-left corner within the
borders you see demarcated above, and then you count the number of cells in
which the feature detector matches the input image.
The number of matching cells is then inserted in the top-left cell of the feature
map
You then move the feature detector one cell to the right and do the same thing.
This movement is called a and since we are moving the feature detector one
cell at time, that would be called a stride of one pixel.
What you will find in this example is that the feature detector's middle-left cell
with the number 1 inside it matches the cell that it is standing over inside the
input image. That's the only matching cell, and so you write ―1‖ in the next cell in
the feature map, and so on and so forth.
After you have gone through the whole first row, you can then move it over to
the next row and go through the same process.
There are several uses that we gain from deriving a feature map. These are the
most important of them: Reducing the size of the input image, and you should
know that the larger your strides (the movements across pixels), the smaller
your feature map.
Relu Layer:
32
them as 0‘s. The purpose of applying the rectifier function is to increase the non-
linearity in our images. The reason we want to do that is that images are naturally
non-linear. The rectifier serves to break up the linearity even further in order to
make up for the linearity that we might impose an image when we put it through
the convolution operation. What the rectifier function does to an image like this is
remove all the black elements from it, keeping only those carrying a positive value
(the grey and white colors).The essential difference between the non-rectified
version of the image and the rectified one is the progression of colors. After we
rectify the image, you will find the colors changing more abruptly. The gradual
change is no longer there. That indicates that the linearity has been disposed of.
Pooling Layer:
The pooling (POOL) layer reduces the height and width of the input. It helps
reduce computation, as well as helps make feature detectors more invariant
to its position in the input This process is what provides the convolutional
neural network with the ―spatial variance‖ capability. In addition to that,
pooling serves to minimize the size of the images as well as the number of
parameters which, in turn, prevents an issue of ―overfitting‖ from coming up.
Overfitting in a nutshell is when you create an excessively complex model in
order to account for the idiosyncracies we just mentioned The result ofusing
a pooling layer and creating down sampled or pooled feature maps is a
summarized version of the features detected in the input. They are useful as
small changes in the location of the feature in the input detected by the
convolutional layer will result in a pooled feature map with the feature in the
same location. Thiscapability added by pooling is called the model‘s
invariance to local translation.
The role of the artificial neural network is to take this data and
combine the features into a wider variety of attributes that make the
convolutional network more capable of classifying images, which is the whole
purpose from creating a convolutional neural network. It has neurons linked
to each other ,and activates if it identifies patterns and sends signals to
33
output layer .the outputlayer gives output class based on weight values, For
now, all you need to know is that the loss function informs us of how
accurate our network is, which we then use in optimizing our network in order
to increase its effectiveness. That requires certain things to be altered in our
network. These include the weights (the blue lines connecting the neurons,
which are basically the synapses), and the feature detector since the network
often turns out to be looking for the wrong features and has to be
reviewed multiple times for the sake of optimization.This full connection
process practically works as follows:
The neuron in the fully-connected layer detects a certain feature; say, a nose.
4.3 TESTING
Testing Objectives:
There are several rules that can serve as testing objectives they are:
34
Testing is a process of executing program with the intent of finding an error.
A good test case is the one that has a high probability of finding an
undiscovered error.
Types of Testing:
In order to make sure that the system does not have errors, the
different levels of testing strategies that are applied at different phases of
software development are :
Unit Testing:
Code Inspection
Performance error
35
In this testing only the output is checked for correctness
In this the test cases are generated on the logic of each module by
drawing flow graphs of that module and logical decisions are tested on all the
cases.
It has been used to generate the test cases in the following cases:
Execute all loops at their boundaries and within their operational bounds.
Integration Testing
System Testing
Involves in house testing of the entire system before delivery to the user.
The aim is to satisfy the user the system meets all requirements of the
client‘s specifications. It is conducted by the testing organization if a company
has one. Test data may range from and generated to production.
Inclusion of changes/fixes.
36
One common approach is graduated testing: as system testing progresses
and (hopefully) fewer and fewer defects are found, the code is frozen for
testing for increasingly longer time periods.
Acceptance Testing
Requirements traceability:
Model training:
After model construction it is time for model training. In this phase, the model is
trained using training data and expected output for this data. It‘s look this way:
model.fit(training_data, expected_output). Progress is visible on the console
when the script runs.
At the end it will report the final accuracy of the for training the first level
classifier and 200 attacks and corresponding benign versions for the second
level training. The remaining 200 pairs were used in testing. MAP only needs
one training phase. Thus, 400 attacks and corresponding benign versions were
used for training and the remaining 351 pairs are used for testing. We developed
a ―pintool‖ for Pin version .
37
Function Call vs Entire-Program Epoch: Since using the entire program run as
an epoch gives reasonable results with MAP‘s feature sets, it is worth comparing
this with the function call epoch. The main problem with using the
entireprogram-run epoch is that several applications run for an indeterminate
amount of time (e.g. web browser). Summarizing over the entire program run
would require the program to finish which, if not making the training phase
impossible, would 2017 Design, Automation and Test in Europe (DATE) 173
likely add error as the malicious part could be a small part of the run. In contrast,
by summarizing over a function call, the training data is easier to collect and
more accurate. For applications that do not execute continuously, summarizing
over the entire run is feasible.
RIPE has this characteristic. To evaluate the value of using function call as
epoch for it, we built the memory access histograms for the entire program run,
and trained them using different classifiers.
There are three kinds of features based on: (i) architectural events, (ii) memory
addresses, and (iii) the instruction mix. MAP collects data for every 10K
instructions (its epoch) to form feature vectors and each feature vector is labeled
as malicious/benign based on the program being executed. The detection model
is then trained to label these 10K-instruction epochs as malicious/benign.
Model Testing:
During this phase a second set of data is loaded. This data set has never been
seen by the model and therefore it‘s true accuracy will be verified. After the
38
model training is complete, and it is understood that the model shows the right
result, it can.
QEMU was used to execute a target machine which ran Debian ―Squeeze‖ with
Linux kernel v2.6.32. In a rootkit infected system, the behavior of a system utility
is changed to meet the purpose of the rootkit, e.g., ps may hide malicious
processes.
Both benign and malicious traces were collected by running the following system
utilities: ls, ps, lsmod, netstat. The utilities were executed with different current
directories, background processes, arguments, etc. for a total of 50 runs for each
rootkit. Both benign and malicious memory traces are summarized using the
system call epoch.
A detection model is was trained for each kind affected systcall. 2knark s and
override had to be modified to run on our target system. The training set
contained the rootkits: avg coder, adore-ng, kbeast and AFkit (bold in Table I).
We then test the ability of the learned model to distinguish between infected and
benign systems on the remaining rootkits. The machine learning algorithm is
trained on the 4 rootkits, but is asked to detect 6 rootkits it has never seen before.
These experiments demonstrate our framework can detect new malware.
Detection Results: Figure 3 shows the rootkit detection results using our
detection framework as a Receiver-Operating Characteristic (ROC) graph. The x-
axis shows the false positive rate and the y-axis shows true positive rate. Points
on the graph show the achieved detection rate at a certain false positive rate.
The graphs show the classification results for sys read and sys getdents using
different machine learning classifiers with a 4k histogram bin size. Detection
39
performance is not sensitive to histogram bin size: 1k and 16k bin sizes yield
similar results. For both system calls, the best performing classifier (random
forest) reaches 100% true positive rate, i.e., detects all attacks, at < 1% false
positive rate. B. Case Study: User Level Memory Corruption Malware In this
section, we focus on user-level applications. We provide a direct comparison with
Malware-Aware Processors (MAP)[17] that also targets user level programs.
This suite was executed on a Linux system running Ubuntu 6.06 distribution with
kernel version 2.6.15. We also created a ―benign‖ version of RIPE where each of
the attack targets is patched. Methodology: Among the 850 RIPE attacks, 751
were successful on our target system. For our two-level classification mode, we
used 351 attacks and corresponding benign versions
Our first step is to convert the normal image into a binary representation. If some
kind of stego content is available we can easily identify using the machine learning
algorithm. After this step, if it is not passed this test, it would undergo the next step
of the process which extracting to form the hex code
40
, RGB values of pixels in the cover picture is converted into its corresponding
octal values. As a cover for the payload, different image formats were used and
analytic values were calculated for the red, green, and blue channels of the input
image and the resulting image, as well as the difference among the average pixels
for both images.
Our algorithm next converts the images to grayscale for creating a separate
channel for each image. The stego content image is encrypted with different types
of algorithms. So, the message cannot be decoded but we can identify the
malware content. Many hazardous contents can be sent with the image file and
can hack anyone‘s personal system. The algorithm identifies the individual
channel of the RGB layer and identifies the spyware.
In Fig. 1, the picture is represented by the red, green, and blue channels of the
image. The value of the red channel is 137.488, the value of the green channel is
173.29, and the value of the blue channel is 206.06.
41
Fig4.2. Output image
Display the changes in the image due to hiding messages, the amount of change
depends on the size of the bits (1, 2, or 3) used and the technique (LSB or MSB)
42
used. A difference of 1.654% is shown using only 1-bit substitution on the red
channel, and a difference of 0.605 and 0.57 on the two channels (red and green).
Therefore, if LSB is used with a one-bit substitution, the least amount will appear
on only one channel (blue). With the significant change in the bits of the image, we
can able find the stego content occupied in the image.
This algorithm used a classified method that is based on logistic classification and
linear classification.This system allows separating the files which contain steno
content. This algorithm is based on the steganalysis method.
The substitution method is made the problem easier to solve. The change in the
least significant bit in the image can help us to intimate that the file contains more
stego content. This RGB layer identification method can be classified into the
following steps shows in figure 5.(More clarity definition about fig 4 and fig 5
Step 1: For i 1 to k
[p(1|dj).(1-P|dj))]
43
Step 4: initialize the weight od instance dj to P(1|dj).
(1-P) (1|
Three commonly used classifiers: logistic regression, SVM and random forest are
used by our work. We consider two design points for the classifiers. Direct
Classification of Memory Access Histograms: This performs binary classification
directly on the summary histograms computed in each epoch. Empirically, this was
effective in detecting kernel-level malware. During the training phase, each system
call affected by a rootkit is labelled as malicious, while other system calls are
labelled benign.1 Since rootkits corrupt syscall execution, the classifier then learns
to distinguish between benign and malicious syscalls.
44
Weighted Classification of Memory Access Histograms: For user-level malware
the epoch boundaries correspond to function calls, which unlike system calls are
not limited in number and are different across programs. Therefore, detecting
malware by analyzing the histogram of a single function is hard.
Using the MAP Features in Our Detection Scenario: MAP provides us with
potential features for malware detection. However, their framework is not
applicable to our detection scenario as it is designed to label different programs
while ours detects whether one particular application is infected by malware. In
most malware infected program runs, the malicious behavior only occurs at certain
phases of the execution, the other phases are normal program behavior.
This makes it hard to label the 10K-instruction epochs correctly without manual
intervention. If we label all the epochs in a malware infected run as malicious,
most epochs that reflect normal program behavior would be wrongly labeled,
resulting in huge training error. Therefore, instead of using the 10K-instruction
epoch, we use the entire program run as the epoch. Although it is hard to correctly
label the 10K-instruction epochs, any malicious behavior is reflected in the feature
vectors of entire malicious runs, so labels can be correctly applied. These feature
vectors are collected over the entire run and the classifier is trained to label the
entire run. Data collection for the MAP features was also done using our ―pin tool‖.
The second training phase uses a different training dataset. Assume we have n
models for the n function calls and each 1In this scenario we need some human
input to identify the system calls affected by each rootkit used in training. This is
relatively easy compared to the manual analysis required for the techniques in
[19], [15], [12]. model m1 to mn provides its classification result. An accuracy rate
45
is assigned to each model. We discard weak models with accuracy below 60% as
they have poor correlation between the run being malicious and the function being
malicious.
Assume that k models, M1 to Mk are left, and they have accuracy rate r1 to rk.
We assign a weight: wi = ri−0.5 (rj−0.5) to each model. Assume the classification
result from each Mi is ci (0 for benign, 1 for malicious, if fi is called multiple times,
ci is the average value of all the classification results). The classifier result is
defined as the weighted sum of each model C = ciwi. If C is above a threshold T,
decided in the second phase, it is classified as malicious and otherwise it is
classified as benign.
Let the standard for good detection be > 95% true positive rate with false positives
< 5%. Two feature sets/detection methods meet these criteria. Ours using two-
level classification of memory accesses in the function call epoch not only meets
the standard but also has the highest accuracy under any false positive rate.
Among the MAP feature sets, only INS2 with the logistic regression classifier is
slightly above the standard.
Table III shows the comparison between our method (―FUNC‖) and ―COMB‖,
―INS2‖ and ―MEM1‖ which are the most successful among the MAP features. We
see that for each false positive rate (FP), our method has best true positive rate
(TP). Also, as the allowed false positive rate decreases, our method maintains a
good detection rate while others drop quickly. These results show the strength of
our method, especially in the very low false positive rate regime.
The results are presented using ROC graphs. From the graphs, we see that when
using small histogram bin sizes, the two epochs perform similarly. But when the
bin size is larger, the results of the function epoch stay mostly the same while the
detection rate using the entireprogram epoch deteriorates.
46
Thus, the function call epoch is resilient to changes in histogram bin size, while
with the entire-program epoch, the histogram bin size needs to be chosen
carefully. This may need human input and may possibly differ between
applications.
Summary Histograms for Epochs Offline Histogram Storage Train Classifier
Trained Model Program (a) Training exec Verify Binary Signature Load Model to
HW Classifier HW Execution Monitor Authenticated Handler Malware detected (b)
Operation Example Epoch Fig. 2:
47
randomized offset to their initial addresses. Since this offset is known to the
operating system, ASLR can be ―de-randomized‖.
48
CHAPTER 5
5.1 RESULT
We experimentally covered both kernel and user level threats and demonstrated
very high detection accuracy against kernel level rootkits (100% detection rate with
less than 1% false positives) and user level memory corruption attacks (99.0%
detection rate with less than 5% false positives).
This first success signal based on a limited sample size needs to be confirmed in
bigger datasets, The method demonstrates how EHR notes for adolescent suicide
attempts can be enhanced with clinically relevant information to identify children
with a history of suicide risk, This can aid in patient's health organizing during a
49
highly vulnerable period for this high-risk demographic
Fig5.1.LSB Graph
50
Fig5.4. Output image
51
5.2 PERFORMANCE ANALYSIS
Performance analysis
Detection of LSB
Statistical
Steganography Based It can detect only It can detect gray scale with
Distribution of
on Distribution of Pixel using gray scale high accuracy but not stego
Pixel
Differences in Natural images content.
Differences
Images
An analysis of LSB
LSB It cannot include
Based Image
Replacement more sized Not mentioned
Steganography
algorithm images.
Technique
Detection of LSB
Replacement and LSB
LSB It can only identify
Matching
steganography by converting gray 82% stego accuracy
Steganography Using
methods scale images
Gray Level Run Length
Matrix
52
CHAPTER 6
To curb the menace of terrorism and to destroy the online presence of dangerous
terrorist organizations like ISIS and other radicalization websites. We need a
proper system to detect and terminate websites which are spreading harmful
content used to radicalizing youth and helpless people. We analysed the usage of
Online Social Networks (OSNs) in the event of a terrorist attack. We used different
metrics like number of tweets, whether users in developing countries tended to
tweet, re-tweet or reply, demographics, geo location and we defined new metrics
(reach and impression of the tweet) and presented their models. While the
developing countries are faced by many limitations in using OSNs such as
unreliable power and poor Internet connection, still the study finding challenges
the traditional media of reporting during disasters like terrorist‘s attacks. We
recommend centers globally to make full use of the OSNs for crisis communication
in order to save more lives during such
6.2 CONCLUSION
We concluded this paper the ways of negotiating for a data steganography system
53
Error! Reference sources not found and present the results obtained from them in
the form of bar graphs. These graphs are collated according to the results
produced from colorful methods.
Based on our evaluation, we have made it evident that to keep the quantum
performing picture, were we must use JPG format as the cover for the container,
and that we must cover only one last substantial piece since in our trail showed
the difference of 0.081 among its average value and the cover. There is no
significant difference between LSB and MS for other picture format combinations
and the number of pieces to cover.
In this paper, we presented a framework for detecting malware that uses online
analysis of steganalysis with machine learning to identify Steno contents with
a precision of 98 per cent. Our framework has been applied to detecting infected
malware from well-known applications via a case study that focuses on detecting
application-specific malicious software.
The purpose of this study is to gather online memory data with the aid of a
summary of memory accesses based on the system call epoch. Both kernel and
user-level threats were covered in experiments that shows high detection accuracy
against stego contents (100% detection rate with less than 1% false positives) as
well as memory steganalysis attacks (99.0% detection rate with less than 4
percentage false positives). As part of the proposed methodology, machine
learning is used to identify malware signatures for classification rather than relying
on human handy works - a crucial step to automate this analytical problem.
54
REFERENCES
[1] Idika, N., & Mathur, A. P. (2007). A survey of malware detection
techniques. Purdue University, 48
[3 ] Moser, A., Kruegel, C., & Kirda, E. (2007, December). Limits of static analysis
for malware detection. In Twenty-Third Annual Computer Security Applications
Conference (ACSAC 2007) (pp. 421-430). IEEE.
[4] Burguera, I., Zurutuza, U., & Nadjm-Tehrani, S. (2011, October). Crowdroid:
behavior-based malware detection system for android. In Proceedings of the 1st
ACM workshop on Security and privacy in smartphones and mobile devices (pp.
15-26).
[5] Ye, Y., Wang, D., Li, T., & Ye, D. (2007, August). IMDS: Intelligent malware
detection system. In Proceedings of the 13th ACM SIGKDD international
conference on Knowledge discovery and data mining (pp. 1043-1047).
[6] Vinod, P., Jaipur, R., Laxmi, V., & Gaur, M. (2009, March). Survey on malware
detection methods. In Proceedings of the 3rd Hackers’ Workshop on computer
55
and internet security (IITKHACK’09) (pp. 74-79).
[7] Sahs, J., & Khan, L. (2012, August). A machine learning approach to android
malware detection. In 2012 European Intelligence and Security Informatics
Conference (pp. 141-147). IEEE.
[8] McLaughlin, N., Martinez del Rincon, J., Kang, B., Yerima, S., Miller, P., Sezer,
S., ... & Joon Ahn, G. (2017, March). Deep android malware detection.
In Proceedings of the seventh ACM on conference on data and application
security and privacy (pp. 301-308).
[9] Ye, Y., Li, T., Adjeroh, D., & Iyengar, S. S. (2017). A survey on malware
detection using data mining techniques. ACM Computing Surveys (CSUR), 50(3),
1-40.
[10] Kolbitsch, C., Comparetti, P. M., Kruegel, C., Kirda, E., Zhou, X. Y., & Wang,
X. (2009, August). Effective and efficient malware detection at the end host.
In USENIX security symposium (Vol. 4, No. 1, pp. 351-366).
[11] Yuan, Z., Lu, Y., Wang, Z., & Xue, Y. (2014, August). Droid-sec: deep learning
in android malware detection. In Proceedings of the 2014 ACM conference on
SIGCOMM (pp. 371-372).
[12] Shabtai, A., Kanonov, U., Elovici, Y., Glezer, C., & Weiss, Y. (2012).
―Andromaly‖: a behavioral malware detection framework for android
devices. Journal of Intelligent Information Systems, 38(1), 161-190.
[13] Preda, M. D., Christodorescu, M., Jha, S., & Debray, S. (2007). A semantics-
based approach to malware detection. ACM SIGPLAN Notices, 42(1), 377-388.
[14] Bazrafshan, Z., Hashemi, H., Fard, S. M. H., & Hamzeh, A. (2013, May). A
survey on heuristic malware detection techniques. In The 5th Conference on
Information and Knowledge Technology (pp. 113-120). IEEE.
56
[15] Zarni Aung, W. Z. (2013). Permission-based android malware
detection. International Journal of Scientific & Technology Research, 2(3), 228-
234.
7.APPENDIX
a) SAMPLE CODE
import subprocess
import sys
shouldPreprocess = False
if __name__ == '__main__':
if shouldPreprocess:
data = pp.getDataset('RedditDataSentimental.xlsx')
data = pp.cleanDataset(data, True, True)
57
if False:
model.validateModel(data) # not executed except for testing due to time
required to evaluate
# Final training
model.trainFinalModel(data)
import math
import pipeline as pl
import preprocess as pp
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from pprint import pprint
from sklearn.utils import shuffle
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn import metrics
58
# Shuffle if required
if shouldShuffle:
suicidalGroup = shuffle(suicidalGroup)
notSuicidal = shuffle(notSuicidal)
return data
def validateModel(data):
"""
validateModel - tests validity and prints accuracy score of Logistic Regression model
# Fit model
skf = StratifiedKFold(n_splits=5, shuffle=True)
skf.get_n_splits(trainContentTfid, data['Suicidal'])
LR = LogisticRegression()
scores = []
for train_index, test_index in skf.split(trainContentTfid, data['Suicidal']):
x_train, x_test = trainContentTfid[train_index], trainContentTfid[test_index]
y_train, y_test = data.iloc[train_index, 1], data.iloc[test_index, 1]
LR.fit(x_train, y_train)
y_pred = LR.predict(x_test)
scores.append(round(metrics.accuracy_score(y_test,y_pred) * 100, 2))
# print(metrics.classification_report(y_test, y_pred, target_names=['Not Suicidal',
'Suicidal']))
# Print scores
scores = [{'Split': i, 'Accuracy': x} for i, x in enumerate(scores)]
print('____LR indication____\n')
pprint(scores)
print('_____________________')
59
"""
evaluateModel - evaluate pipeline model from pl
return gs_eval
def gridSearchParameters(data):
"""
gridSearchParameters - hyper parameter tweaking as per pl model pipelines to trade
accuracy for higher sensitivity
60
:param data: pandas DataFrame
:return: None
"""
names, results, params = evaluateModels(data.iloc[:, 0], data.iloc[:, 1])
for i, x in enumerate(names):
print('Test -->' + x)
print('Results -->')
print(results[i].head())
def trainFinalModel(data):
"""
trainFinalModel - final training model with best parameters
# Fit model
LR = LogisticRegression(C=4, max_iter=1000, penalty='l2')
LR.fit(x_train, y_train)
# Predict
y_pred = LR.predict(x_test)
print(f' Accuracy of LR : {round(metrics.accuracy_score(y_test, y_pred) * 100, 2)}%')
print(metrics.classification_report(y_test, y_pred))
LogisticRegression_score = metrics.accuracy_score(y_test, y_pred)
# Fit model
classifier = LinearSVC(C=4, max_iter=1000, penalty='l2')
classifier.fit(x_train, y_train)
# Predict
y_prediction = classifier.predict(x_test)
print(f' Accuracy Score of SVC: {round(metrics.accuracy_score(y_test, y_prediction) *
61
100, 2)}%')
print(metrics.classification_report(y_test, y_prediction))
ACC1=metrics.accuracy_score(y_test, y_prediction)
# Visualise model prediction vs truth
sns.heatmap(metrics.confusion_matrix(y_test, y_prediction), annot=True, fmt='d',
cmap='YlGnBu')
plt.xlabel('Predicted')
plt.ylabel('Truth')
plt.title('Confusion matrix of SVC', y=1.1)
plt.show()
# Model scorers
REFIT_SCORE = 'recall_score'
MODEL_SCORERS = {
'precision_score': make_scorer(precision_score),
'recall_score': make_scorer(recall_score),
'accuracy_score': make_scorer(accuracy_score)
}
62
# GridSearchCV pipelines
MODEL_PIPELINES = {
'fe=tfid+bigram+default_c=LR' : {
'Pipeline' : Pipeline([
('tfid', TfidfVectorizer()),
('clf', LogisticRegression())
]),
'Parameters' : {
'tfid__norm' : ['l1', 'l2'],
'tfid__ngram_range': [(1, 2)],
'clf__penalty': ['l2'],
'clf__max_iter': [1000],
'clf__C' : [0.25, 0.5, 1, 2, 4]
}
},
'fe=tfid+trigram+default_c=LR' : {
'Pipeline' : Pipeline([
('tfid', TfidfVectorizer()),
('clf', LogisticRegression())
]),
'Parameters' : {
'tfid__norm' : ['l1', 'l2'],
'tfid__ngram_range': [(1, 3)],
'clf__penalty': ['l2'],
'clf__max_iter': [1000],
'clf__C' : [0.25, 0.5, 1, 2, 4]
}
}
}
import os
import emoji
import nltk
import pipeline as pl
import texthero as hero
import pandas as pd
import regex as re
from nltk.stem import WordNetLemmatizer
# Utilities
def getDataset(filename):
"""
getDataset - gets and reads the specified file from the input directory,
drops null rows and columns excluded from our analysis.
63
data = None
if 'csv' in split[1]:
data = pd.read_csv(filename)
elif 'xlsx' in split[1]:
data = pd.read_excel(filename)
return data
def removeHandles(text):
"""
removeHandles - removes @alphanumeric handles from text
def removeHashtag(text):
"""
removeHashtag - removes #hashtag from text
def removeEmoji(text):
"""
removeEmoji - removes all instances of emoji in the string passed.
64
:param lemmatise: boolean, defaults True - determines whether function performs
lemmatisation
:return: returns array of cleaned strings
"""
# Remove emoji
text = text.apply(lambda x: ' '.join([removeEmoji(word) for word in x.split()]))
# Remove hashtag
text = text.apply(lambda x: ' '.join([removeHashtag(word) for word in x.split()]))
# Remove handles
text = text.apply(lambda x: ' '.join([removeHandles(word) for word in x.split()]))
# Lemmatisation
if lemmatise:
nltk.download('wordnet')
text = text.apply(lambda x: ' '.join([WordNetLemmatizer().lemmatize(word) for word in
x.split()]))
return text
65
return data
import json # to print list/dict in textbox
import tkinter as tk # root GUI module
import tkinter.scrolledtext as scrolledtext # module for scrollable text widget
import tkinter.ttk as ttk # themed GUI module
from tkinter.filedialog import askopenfile # module to read file
container.grid_rowconfigure(0, weight=1)
container.grid_columnconfigure(0, weight=1)
self.frames = {}
menu = tk.Menu(container)
ex = tk.Menu(menu, tearoff=0)
menu.add_cascade(menu=ex, label="Exit")
ex.add_command(label="Exit",
command=self.destroy)
tk.Tk.config(self, menu=menu)
self.show_frame(Startpage)
# Home page
class Startpage(ttk.Frame):
66
def __init__(self, parent, controller):
ttk.Frame.__init__(self, parent)
ttk.Label(self, text="").pack()
ttk.Label(self, text="").pack()
ttk.Label(self, text="").pack()
ttk.Label(self, text="\n").pack()
ttk.Label(self, text="").pack()
67
for line in f:
j.append(line.strip())
f.close()
d = dict.fromkeys(j, 0)
ttk.Label(self, text="").pack()
# code to open and scan the list of websites given in a text file
def open_n_scan():
files = askopenfile(mode='r', filetypes=[("Text File", "*.txt")])
l3.config(state=tk.NORMAL)
l3.delete('1.0', "end")
for url in files:
count = 0
result = requests.get(url.strip())
soup = BeautifulSoup(result.content, 'lxml')
for i in soup.get_text().split():
if (i.lower() in j):
count += 1
l3.insert(tk.END, url.strip() + " = " + str(count) + "\n")
l3.config(state=tk.DISABLED)
68
ttk.Label(self, text="").pack()
ttk.Label(self, text="").pack()
ttk.Label(self, text="").pack()
# About page
class PageTwo(ttk.Frame):
ttk.Label(self, text="").pack()
ttk.Label(self, text="").pack()
ttk.Label(self, text="").pack()
69
can help prevent suicide by learning the warning signs, promoting prevention and
resilience, and a committing to social change.").pack()
# Info on suicide
ttk.Label(self, text="").pack()
app = MyApp()
app.resizable(0, 0)
app.title("Detect web pages with suicidal content") # app title
app.state('zoomed') # maximized app by default
app.mainloop()
70
45